2016-03-16 16:21:49

by Ion Badulescu

[permalink] [raw]
Subject: Re: rcu stalls and soft lockups with recent kernels

Just following up to my own email:

It turns out that we can eliminate the RCU stalls by changing from
CONFIG_RCU_NOCB_CPU_ALL to CONFIG_RCU_NOCB_CPU_NONE. Letting each cpu
handle its own RCU callbacks completely fixes the problems for us.

Now, CONFIG_NO_HZ_FULL and CONFIG_RCU_NOCB_CPU_ALL is the default config
for fedora and rhel7. Ho-humm... anybody interesting in tracking this
down further?

The bug is likely to be a missing wakeup for the rcuo* threads. It would
explain why the stalls get resolved eventually, maybe when another RCU
callback gets scheduled (properly this time!) and the threads wake up
and process their entire queue. But it's just speculation at this point.

And one final point: the RCU stall stack traces are actually rather
useless when the RCU callbacks are delegated to kernel threads. What one
particular cpu may be doing when it's fallen behind on RCU grace periods
isn't very interesting, given that its rcuo* threads could be running
anywhere, on any cpu.

Thanks,
-Ion


On 02/04/2016 02:12 PM, Ion Badulescu wrote:
> Hello,
>
> We run a compute cluster of about 800 or so machines here, which makes
> heavy use of NFS and fscache (on a dedicated local drive with an ext4
> filesystem) and also exercises the other local drives pretty hard. All
> the compute jobs run as unprivileged users with SCHED_OTHER scheduling,
> nice level 1.
>
> Over the past month or two, we've been seeing some strange and seemingly
> random rcu stalls and soft cpu lockups when running jobs on the cluster.
> The stalls are bad enough to cause hard drives to get kicked out of MD
> arrays, and even to get offlined altogether by the SCSI layer. The
> kernel running on these machines is based on 3.18.12 with some local
> patches, which we've been running quite happily since early May 2015.
> It's unclear what started triggering the stalls after all these months,
> as we've been unable to correlate them to any particular jobs. The one
> thing that's clear is that they tend to happen in bunches on multiple
> machines at the same time, so whatever it is it's some centralized
> condition that triggers them. It could be a certain type of job, or it
> could be the state of the centralized NFS servers they access.
>
> In an effort to solve the issue and isolate its cause, we upgraded the
> kernel on about 80 of those machines to the latest 4.4.1 kernel, this
> time keeping out most of the local patches we had. We also enabled some
> more kernel debugging options, including lockdep. Only a few local
> patches were kept in:
> - printing some extra information in sysrq
> - disabling audit-to-printk
> - perf build fixes
> - changing the slab calculate_order algorithm to favor small allocation
> orders whenever possible.
>
> The full kernel config is available at
> http://www.badula.org/config-4.4.1 and the combined patch with our local
> changes is at http://www.badula.org/linux-ion-git-4.4.1.diff
>
> Unfortunately, the stalls did not go away. Many of the machines running
> 4.4.1 hit stalls with stack traces in the sunrpc transmit path, similar
> to this one:
>
>
> Feb 2 04:26:02 INFO: rcu_sched self-detected stall on CPUINFO:
> rcu_sched detected stalls on CPUs/tasks:
> Feb 2 04:26:02 6-...: (20999 ticks this GP) idle=cf1/140000000000001/0
> softirq=9090641/9090641 fqs=7000
> Feb 2 04:26:02 (detected by 5, t=21003 jiffies, g=1733453, c=1733452, q=0)
> Feb 2 04:26:02 running task 0 328895 2 0x00080088 6
> FAIR -20 6
> Feb 2 04:26:02
> Feb 2 04:26:02 ffffffffffffff10 ffffffff8152abaf 0000000000000000
> 0000000000000004
> Feb 2 04:26:02 000000043121b8b0 ffff881001149808 0000000000000296
> 0208002000000000
> Feb 2 04:26:02 0000000000000296 ffff88003121b908 ffffffff810968fb
> ffff8810010ed9c8
> Feb 2 04:26:02 Call Trace:
> Feb 2 04:26:02 [<ffffffff8152abaf>] ? _raw_spin_lock_irqsave+0xf/0x40
> Feb 2 04:26:02 [<ffffffff810968fb>] ? finish_wait+0x6b/0x90
> Feb 2 04:26:02 [<ffffffff814694aa>] ? sk_stream_wait_memory+0x24a/0x2d0
> Feb 2 04:26:02 [<ffffffff810966a0>] ? woken_wake_function+0x20/0x20
> Feb 2 04:26:02 [<ffffffff8152a8da>] ? _raw_spin_unlock_bh+0x1a/0x20
> Feb 2 04:26:02 [<ffffffff8145d22d>] ? release_sock+0xfd/0x150
> Feb 2 04:26:02 [<ffffffff814bfc16>] ? tcp_sendpage+0xd6/0x5e0
> Feb 2 04:26:02 [<ffffffff814e7bc0>] ? inet_sendpage+0x50/0xe0
> Feb 2 04:26:02 [<ffffffffa03b4ea5>] ? xs_nospace+0x75/0xf0 [sunrpc]
> Feb 2 04:26:02 [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
> [sunrpc]
> Feb 2 04:26:02 [<ffffffff81043946>] ? read_hpet+0x16/0x20
> Feb 2 04:26:02 [<ffffffff810b7ec2>] ? ktime_get+0x52/0xc0
> Feb 2 04:26:02 [<ffffffffa03b2133>] ? xprt_transmit+0x63/0x3a0 [sunrpc]
> Feb 2 04:26:02 [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
> [sunrpc]
> Feb 2 04:26:02 [<ffffffff8152a8da>] ? _raw_spin_unlock_bh+0x1a/0x20
> Feb 2 04:26:02 [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
> [sunrpc]
> Feb 2 04:26:02 [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
> [sunrpc]
> Feb 2 04:26:02 [<ffffffffa03ae538>] ? call_transmit+0x1d8/0x2c0 [sunrpc]
> Feb 2 04:26:02 [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
> [sunrpc]
> Feb 2 04:26:02 [<ffffffffa03bb4e9>] ? __rpc_execute+0x89/0x3c0 [sunrpc]
> Feb 2 04:26:02 [<ffffffff81081f15>] ? finish_task_switch+0x125/0x260
> Feb 2 04:26:02 [<ffffffffa03bb8e5>] ? rpc_async_schedule+0x15/0x20
> [sunrpc]
> Feb 2 04:26:02 [<ffffffff8106fad8>] ? process_one_work+0x148/0x450
> Feb 2 04:26:02 [<ffffffff8106ff12>] ? worker_thread+0x132/0x600
> Feb 2 04:26:02 [<ffffffff81083862>] ? default_wake_function+0x12/0x20
> Feb 2 04:26:02 [<ffffffff81096566>] ? __wake_up_common+0x56/0x90
> Feb 2 04:26:02 [<ffffffff8106fde0>] ? process_one_work+0x450/0x450
> Feb 2 04:26:02 [<ffffffff8106fde0>] ? process_one_work+0x450/0x450
> Feb 2 04:26:02 [<ffffffff8107511c>] ? kthread+0xcc/0xf0
> Feb 2 04:26:02 [<ffffffff81075050>] ?
> kthread_freezable_should_stop+0x70/0x70
> Feb 2 04:26:02 [<ffffffff8152b0df>] ? ret_from_fork+0x3f/0x70
> Feb 2 04:26:02 [<ffffffff81075050>] ?
> kthread_freezable_should_stop+0x70/0x70
>
>
> So we added a cond_resched() call inside the loop in __rpc_execute, if
> the loop made more than 20 iterations. While clearly just a bandaid, and
> maybe also in the wrong place, we just wanted to see if it would help.
> And indeed it did, because the very common sunrpc-related stalls went
> away. Instead, this ext4-related stall started showing up:
>
>
> [91919.498218] INFO: rcu_sched self-detected stall on CPU
> [91919.498221] INFO: rcu_sched detected stalls on CPUs/tasks:
> [91919.498235] 15-...: (4 ticks this GP) idle=dbf/140000000000001/0
> softirq=16345141/16345141 fqs=1
> [91919.498238] (detected by 10, t=178100 jiffies, g=4061489, c=4061488,
> q=0)
> [91919.498239] Task dump for CPU 15:
> [91919.498249] condor_starter R running task 0 506564 502795
> 0x00080080 15 FAIR 0 0-23 10504 2591
> [91919.498252] 0000000000000004 ffff880049bb7958 0000000000000246
> 0000000000000002
> [91919.498254] 0000000000000000 ffff880049bb7938 ffffffff812c83b6
> 0000000000000000
> [91919.498255] ffffffff81156070 ffff880000000000 ffffffff00000000
> 0000000000000246
> [91919.498255] Call Trace:
> [91919.498261] [<ffffffff812c83b6>] ? __radix_tree_lookup+0x36/0x100
> [91919.498265] [<ffffffff81156070>] ? find_get_pages+0x280/0x280
> [91919.498267] [<ffffffff81156165>] ? find_get_entry+0xf5/0x200
> [91919.498268] [<ffffffff8115617d>] ? find_get_entry+0x10d/0x200
> [91919.498270] [<ffffffff81156070>] ? find_get_pages+0x280/0x280
> [91919.498272] [<ffffffff81158288>] ? pagecache_get_page+0x38/0x1d0
> [91919.498275] [<ffffffff8120bff2>] ? alloc_buffer_head+0x42/0x50
> [91919.498277] [<ffffffff8120c86d>] ? __find_get_block_slow+0x4d/0x160
> [91919.498280] [<ffffffff8159f35b>] ? _raw_spin_unlock+0x2b/0x40
> [91919.498281] [<ffffffff8120d74c>] ? create_empty_buffers+0x9c/0xc0
> [91919.498284] [<ffffffff810aabb0>] ? mark_held_locks+0x70/0x90
> [91919.498285] [<ffffffff815a00d7>] ? retint_kernel+0x10/0x10
> [91919.498286] [<ffffffff810aae7f>] ? trace_hardirqs_on_caller+0xff/0x1c0
> [91919.498290] [<ffffffff81001017>] ? trace_hardirqs_on_thunk+0x17/0x19
> [91919.498292] [<ffffffff810c40cc>] ? rcu_irq_exit+0x6c/0xb0
> [91919.498293] [<ffffffff815a00d7>] ? retint_kernel+0x10/0x10
> [91919.498296] [<ffffffff812ceb95>] ?
> copy_user_enhanced_fast_string+0x5/0x10
> [91919.498299] [<ffffffff812d54ec>] ?
> iov_iter_copy_from_user_atomic+0x9c/0x240
> [91919.498300] [<ffffffff8115535e>] ? generic_perform_write+0xfe/0x1d0
> [91919.498302] [<ffffffff811ed3f2>] ? generic_update_time+0x82/0xd0
> [91919.498305] [<ffffffff81157466>] ?
> __generic_file_write_iter+0x196/0x1f0
> [91919.498315] [<ffffffffa007dd31>] ? ext4_file_write_iter+0x121/0x3a0
> [ext4]
> [91919.498318] [<ffffffff811d321c>] ? __sb_start_write+0xbc/0xd0
> [91919.498320] [<ffffffff811d0abf>] ? __vfs_write+0xcf/0xf0
> [91919.498322] [<ffffffff811d2123>] ? vfs_write+0xc3/0x1a0
> [91919.498323] [<ffffffff811d22e9>] ? SyS_write+0x59/0xc0
> [91919.498325] [<ffffffff8159f5d7>] ? entry_SYSCALL_64_fastpath+0x12/0x6f
> [91919.498327] rcu_sched kthread starved for 178097 jiffies! g4061489
> c4061488 f0x0 s3 ->state=0x1
> [91919.750048]
> [91919.751744] 15-...: (4 ticks this GP) idle=dbf/140000000000001/0
> softirq=16345141/16345141 fqs=2
> [91919.761832] (t=178100 jiffies g=4061489 c=4061488 q=0)
> [91919.767820] Task dump for CPU 15:
> [91919.767821] condor_starter R running task 0 506564 502795
> 0x00080088 15 FAIR 0 0-23 10504 2591
> [91919.767830] 0000000000000010 ffff88207fc63b58 ffffffff8108aec5
> ffffffff8108acf8
> [91919.776246] 0000000000000001 0000000033322d30 ffffffff817e3471
> 00ff882000000000
> [91919.784645] 0000000000000000 0000000000000000 000000000000000f
> ffffffff81a4bb80
> [91919.793038] Call Trace:
> [91919.795830] <IRQ> [<ffffffff8108aec5>] sched_show_task+0x235/0x450
> [91919.803003] [<ffffffff8108acf8>] ? sched_show_task+0x68/0x450
> [91919.809569] [<ffffffff810b73af>] ? vprintk_default+0x1f/0x30
> [91919.816036] [<ffffffff81154de9>] ? printk+0x46/0x48
> [91919.821625] [<ffffffff8108b11f>] dump_cpu_task+0x3f/0x50
> [91919.827700] [<ffffffff810c3594>] rcu_dump_cpu_stacks+0x84/0xc0
> [91919.834360] [<ffffffff810c3ae9>] check_cpu_stall+0x1d9/0x290
> [91919.840823] [<ffffffff810ad66c>] ? __lock_acquire+0x48c/0x17e0
> [91919.847504] [<ffffffff810c56f2>] rcu_check_callbacks+0x102/0x3a0
> [91919.854371] [<ffffffff810da629>] ? tick_sched_timer+0x29/0x80
> [91919.860936] [<ffffffff8105f0a8>] ? __raise_softirq_irqoff+0x28/0x80
> [91919.868082] [<ffffffff810a91fd>] ? trace_hardirqs_off+0xd/0x10
> [91919.879805] [<ffffffff810c9fa9>] update_process_times+0x39/0x70
> [91919.886561] [<ffffffff810da519>] tick_sched_handle+0x39/0x70
> [91919.893039] [<ffffffff810da644>] tick_sched_timer+0x44/0x80
> [91919.899435] [<ffffffff810caa56>] __hrtimer_run_queues+0xe6/0x2c0
> [91919.906286] [<ffffffff810cad9c>] ? hrtimer_interrupt+0x7c/0x1a0
> [91919.913042] [<ffffffff810da600>] ? tick_nohz_handler+0xb0/0xb0
> [91919.919701] [<ffffffff810a91fd>] ? trace_hardirqs_off+0xd/0x10
> [91919.926365] [<ffffffff810d09a8>] ?
> ktime_get_update_offsets_now+0xb8/0x170
> [91919.934191] [<ffffffff810cad9c>] ? hrtimer_interrupt+0x7c/0x1a0
> [91919.940948] [<ffffffff810cadbd>] hrtimer_interrupt+0x9d/0x1a0
> [91919.947518] [<ffffffff81090274>] ?
> vtime_common_account_irq_enter+0x34/0x50
> [91919.955446] [<ffffffff8103b529>] local_apic_timer_interrupt+0x39/0x60
> [91919.962786] [<ffffffff815a1fd5>] smp_apic_timer_interrupt+0x45/0x59
> [91919.969930] [<ffffffff815a035e>] apic_timer_interrupt+0x8e/0xa0
> [91919.976687] <EOI> [<ffffffff812ceb95>] ?
> copy_user_enhanced_fast_string+0x5/0x10
> [91919.985237] [<ffffffff812d54ec>] ?
> iov_iter_copy_from_user_atomic+0x9c/0x240
> [91919.993274] [<ffffffff8115535e>] generic_perform_write+0xfe/0x1d0
> [91920.000229] [<ffffffff811ed3f2>] ? generic_update_time+0x82/0xd0
> [91920.007083] [<ffffffff81157466>] __generic_file_write_iter+0x196/0x1f0
> [91920.014529] [<ffffffffa007dd31>] ext4_file_write_iter+0x121/0x3a0
> [ext4]
> [91920.022161] [<ffffffff811d321c>] ? __sb_start_write+0xbc/0xd0
> [91920.028742] [<ffffffff811d0abf>] __vfs_write+0xcf/0xf0
> [91920.034629] [<ffffffff811d2123>] vfs_write+0xc3/0x1a0
> [91920.040413] [<ffffffff811d22e9>] SyS_write+0x59/0xc0
> [91920.046098] [<ffffffff8159f5d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
>
>
> We also saw some cases of soft lockups showing no stack trace at all and
> RIP pointing to userspace memory:
>
>
> [92163.555725] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 21s!
> [strat_sim:570566]
> [92163.564455] Modules linked in: nfsv3 nfs_acl nfs msr autofs4 lockd
> grace sunrpc cachefiles fscache binfmt_misc nls_iso8859_1 nls_cp437 vfat
> fat vhost_net vhost tun kvm irqbypass input_leds hid_generic iTCO_wdt
> iTCO_vendor_support pcspkr sfc mtd i2c_algo_bit sb_edac sg lpc_ich
> mfd_core ehci_pci ehci_hcd xhci_pci xhci_hcd i2c_i801 i2c_core ixgbe ptp
> pps_core mdio ipmi_devintf ipmi_si ipmi_msghandler tpm_tis tpm
> acpi_power_meter hwmon ext4 jbd2 mbcache crc16 raid1 dm_mirror
> dm_region_hash dm_log dm_mod
> [92163.614893] irq event stamp: 12063822
> [92163.619020] hardirqs last enabled at (12063821):
> [<ffffffff815a00bf>] retint_user+0x18/0x20
> [92163.628536] hardirqs last disabled at (12063822):
> [<ffffffff815a0339>] apic_timer_interrupt+0x69/0xa0
> [92163.638920] softirqs last enabled at (12063814):
> [<ffffffff8105fa08>] __do_softirq+0x1f8/0x390
> [92163.648723] softirqs last disabled at (12063801):
> [<ffffffff8105fceb>] irq_exit+0xfb/0x110
> [92163.658042] CPU: 7 PID: 570566 Comm: strat_sim Tainted: G
> I 4.4.1-el6.ia32e.lime.0 #1
> [92163.668134] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS
> SE5C610.86B.01.01.0011.081020151200 08/10/2015
> [92163.679871] task: ffff880838a30000 ti: ffff8808c270c000 task.ti:
> ffff8808c270c000
> [92163.688302] RIP: 0033:[<0000000003cd9e89>] [<0000000003cd9e89>]
> 0x3cd9e89
> [92163.696036] RSP: 002b:00007ffdd497e168 EFLAGS: 00000202
> [92163.702009] RAX: 000000002860dad8 RBX: 00000000285fb790 RCX:
> 00000000285fb730
> [92163.710024] RDX: 0000000015546340 RSI: 0000000000000001 RDI:
> 000000002861afb0
> [92163.718040] RBP: 00000000285cd220 R08: 0000000015476050 R09:
> 000000000000000a
> [92163.726055] R10: 0000000000000001 R11: 0000000000000009 R12:
> ffffffffffffff80
> [92163.734071] R13: ffffffffffffff80 R14: 000000000000007f R15:
> 000000002861afe8
> [92163.742088] FS: 00002b21c95e3de0(0000) GS:ffff88103fce0000(0000)
> knlGS:0000000000000000
> [92163.751199] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [92163.757661] CR2: 00002b21e3bcbd40 CR3: 00000004b3087000 CR4:
> 00000000001406e0
>
> [these are traces from different machines, btw, so don't try to read
> anything into their timestamps]
>
> At this point we're kinda stumped. Everything seems to point to some
> problem at a pretty low level, not related to sunrpc or ext4 or any
> other high level subsystem... but we can't quite figure out what it is.
>
> Any insights or potential fixes would be highly appreciated.
>
> Thanks,
> -Ion
>
>


2016-03-18 02:29:09

by Mike Galbraith

[permalink] [raw]
Subject: Re: rcu stalls and soft lockups with recent kernels

On Wed, 2016-03-16 at 12:15 -0400, Ion Badulescu wrote:
> Just following up to my own email:
>
> It turns out that we can eliminate the RCU stalls by changing from
> CONFIG_RCU_NOCB_CPU_ALL to CONFIG_RCU_NOCB_CPU_NONE. Letting each cpu
> handle its own RCU callbacks completely fixes the problems for us.
>
> Now, CONFIG_NO_HZ_FULL and CONFIG_RCU_NOCB_CPU_ALL is the default config
> for fedora and rhel7. Ho-humm...

All RCU offloaded to CPU0 of a big box seems like a very bad idea.

-Mike

2016-03-22 20:25:22

by Ion Badulescu

[permalink] [raw]
Subject: Re: rcu stalls and soft lockups with recent kernels

On 03/17/2016 10:28 PM, Mike Galbraith wrote:
> On Wed, 2016-03-16 at 12:15 -0400, Ion Badulescu wrote:
>> Just following up to my own email:
>>
>> It turns out that we can eliminate the RCU stalls by changing from
>> CONFIG_RCU_NOCB_CPU_ALL to CONFIG_RCU_NOCB_CPU_NONE. Letting each cpu
>> handle its own RCU callbacks completely fixes the problems for us.
>>
>> Now, CONFIG_NO_HZ_FULL and CONFIG_RCU_NOCB_CPU_ALL is the default config
>> for fedora and rhel7. Ho-humm...
>
> All RCU offloaded to CPU0 of a big box seems like a very bad idea.


It's not offloaded to CPU0, is it? Those rcuo* threads are not cpu-bound
and can run on any cpu the scheduler will put them on. In any case,
there was no indication that the rcuo* threads wanted to run but
couldn't get cpu time.

Anyway, looks like I spoke too soon. It's less often with
RCU_NOCB_CPU_NONE than with RCU_NOCB_CPU_ALL, but the soft lockups and
rcu stalls are still happening.

[44206.316711] clocksource: timekeeping watchdog: Marking clocksource
'tsc' as unstable because the skew is too large:
[44206.328463] clocksource: 'hpet' wd_now:
ffffffff wd_last: 5f03cdca mask: ffffffff
[44206.339037] clocksource: 'tsc' cs_now:
64788b443c3a cs_last: 647840eea919 mask: ffffffffffffffff
[44206.351253] clocksource: Switched to clocksource hpet
[44922.301452] INFO: rcu_sched detected stalls on CPUs/tasks:
[44922.307644] 0-...: (1 GPs behind) idle=53d/140000000000001/0
softirq=8515474/8515477 fqs=6994
[44922.317435] (detected by 1, t=21019 jiffies, g=2011397, c=2011396,
q=3263)
[44922.325274] Task dump for CPU 0:
[44922.325276] python R running task 0 257113 257112
0x00080088 0 FAIR 1 0 152294 48373
[44922.325283] ffffffff8152ca8e ffff881b76870000 ffff880e83669000
0000000000007d54
[44922.333671] ffff881b1cdc7a48 ffff880a58a57e58 0000000000000086
0000000000000000
[44922.342060] 0000000000000000 0000000000003fa1 ffff880a58a54000
ffff880a58a57e88
[44922.350446] Call Trace:
[44922.353215] [<ffffffff8152ca8e>] ? __schedule+0x38e/0xa90
[44922.359388] [<ffffffff810ad536>] ? rcu_eqs_enter_common+0x66/0x130
[44922.366437] [<ffffffff810f38fc>] ? acct_account_cputime+0x1c/0x20
[44922.373388] [<ffffffff81086698>] ? account_user_time+0x78/0x80
[44922.380045] [<ffffffff810866e3>] ? vtime_account_user+0x43/0x60
[44922.386801] [<ffffffff81132de0>] ? __context_tracking_exit+0x70/0xc0
[44922.394044] [<ffffffff810019af>] ? enter_from_user_mode+0x1f/0x50
[44922.400994] [<ffffffff815312b9>] ? apic_timer_interrupt+0x69/0x90
[44923.210453] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s!
[python:257113]
[44923.218890] Modules linked in: nfsv3 nfs_acl nfs msr autofs4 lockd
grace sunrpc cachefiles fscache binfmt_misc nls_iso8859_1 nls_cp437 vfat
fat vhost_net vhost tun kvm irqbypass input_leds hid_generic iTCO_wdt
iTCO_vendor_support pcspkr sfc mtd i2c_algo_bit sb_edac sg l
pc_ich mfd_core ehci_pci ehci_hcd xhci_pci xhci_hcd i2c_i801 i2c_core
ixgbe ptp pps_core mdio ipmi_devintf ipmi_si ipmi_msghandler tpm_tis tpm
acpi_power_meter hwmon ext4 jbd2 mbcache crc16 raid1 dm_mirror
dm_region_hash dm_log dm_mod
[44923.269289] CPU: 0 PID: 257113 Comm: python Tainted: G I
4.4.5-el6.ia32e.lime.0 #1
[44923.279089] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS
SE5C610.86B.01.01.0011.081020151200 08/10/2015
[44923.290824] task: ffff880e83669000 ti: ffff880a58a54000 task.ti:
ffff880a58a54000
[44923.299253] RIP: 0033:[<00002b5ee4e0502d>] [<00002b5ee4e0502d>]
0x2b5ee4e0502d
[44923.307508] RSP: 002b:00007ffe4120b170 EFLAGS: 00000212
[44923.313494] RAX: 0000000000000008 RBX: 0000000003fe1480 RCX:
0000000003fc9b00
[44923.321513] RDX: 00002b5ee34b4260 RSI: 0000000000002248 RDI:
00000000000000d4
[44923.329530] RBP: 0000000003fe1800 R08: 0000000000000078 R09:
0000000000000800
[44923.337548] R10: 00007ffe4120b550 R11: 0000000000011240 R12:
00002b5ee34ba938
[44923.345566] R13: 00002b5ee34c1010 R14: 00007ffe4120b428 R15:
0000000000000400
[44923.353580] FS: 00002b5ec7fe98c0(0000) GS:ffff88103fc00000(0000)
knlGS:0000000000000000
[44923.362689] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[44923.369147] CR2: 00002b5ee4190000 CR3: 00000005ac81b000 CR4:
00000000001406f0

That rcu_eqs_enter_common function seems to be a fairly common
occurrence in these stack traces. Not sure if it means anything, though.

Also, this seems to be a sock lockup with RIP in userspace. Does it mean
timer interrupts are disabled? Somehow it fails to reschedule the NMI timer.

We're at our wits' end here...

This, btw, is a 2x12 core Haswell box.

Thanks,
-Ion

2016-03-23 04:50:24

by Mike Galbraith

[permalink] [raw]
Subject: Re: rcu stalls and soft lockups with recent kernels

(cc)

On Tue, 2016-03-22 at 16:22 -0400, Ion Badulescu wrote:
> On 03/17/2016 10:28 PM, Mike Galbraith wrote:
> > On Wed, 2016-03-16 at 12:15 -0400, Ion Badulescu wrote:
> > > Just following up to my own email:
> > >
> > > It turns out that we can eliminate the RCU stalls by changing from
> > > CONFIG_RCU_NOCB_CPU_ALL to CONFIG_RCU_NOCB_CPU_NONE. Letting each cpu
> > > handle its own RCU callbacks completely fixes the problems for us.
> > >
> > > Now, CONFIG_NO_HZ_FULL and CONFIG_RCU_NOCB_CPU_ALL is the default config
> > > for fedora and rhel7. Ho-humm...
> >
> > All RCU offloaded to CPU0 of a big box seems like a very bad idea.
>
>
> It's not offloaded to CPU0, is it? Those rcuo* threads are not cpu-bound
> and can run on any cpu the scheduler will put them on. In any case,
> there was no indication that the rcuo* threads wanted to run but
> couldn't get cpu time.

Right, my thinker was screwed on cross-threaded.

> Anyway, looks like I spoke too soon. It's less often with
> RCU_NOCB_CPU_NONE than with RCU_NOCB_CPU_ALL, but the soft lockups and
> rcu stalls are still happening.
>
> [44206.316711] clocksource: timekeeping watchdog: Marking clocksource
> 'tsc' as unstable because the skew is too large:
> [44206.328463] clocksource: 'hpet' wd_now:
> ffffffff wd_last: 5f03cdca mask: ffffffff
> [44206.339037] clocksource: 'tsc' cs_now:
> 64788b443c3a cs_last: 647840eea919 mask: ffffffffffffffff
> [44206.351253] clocksource: Switched to clocksource hpet
> [44922.301452] INFO: rcu_sched detected stalls on CPUs/tasks:
> [44922.307644] 0-...: (1 GPs behind) idle=53d/140000000000001/0
> softirq=8515474/8515477 fqs=6994
> [44922.317435] (detected by 1, t=21019 jiffies, g=2011397, c=2011396,
> q=3263)
> [44922.325274] Task dump for CPU 0:
> [44922.325276] python R running task 0 257113 257112
> 0x00080088 0 FAIR 1 0 152294 48373
> [44922.325283] ffffffff8152ca8e ffff881b76870000 ffff880e83669000
> 0000000000007d54
> [44922.333671] ffff881b1cdc7a48 ffff880a58a57e58 0000000000000086
> 0000000000000000
> [44922.342060] 0000000000000000 0000000000003fa1 ffff880a58a54000
> ffff880a58a57e88
> [44922.350446] Call Trace:
> [44922.353215] [] ? __schedule+0x38e/0xa90
> [44922.359388] [] ? rcu_eqs_enter_common+0x66/0x130
> [44922.366437] [] ? acct_account_cputime+0x1c/0x20
> [44922.373388] [] ? account_user_time+0x78/0x80
> [44922.380045] [] ? vtime_account_user+0x43/0x60
> [44922.386801] [] ? __context_tracking_exit+0x70/0xc0
> [44922.394044] [] ? enter_from_user_mode+0x1f/0x50
> [44922.400994] [] ? apic_timer_interrupt+0x69/0x90
> [44923.210453] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s!
> [python:257113]
> [44923.218890] Modules linked in: nfsv3 nfs_acl nfs msr autofs4 lockd
> grace sunrpc cachefiles fscache binfmt_misc nls_iso8859_1 nls_cp437 vfat
> fat vhost_net vhost tun kvm irqbypass input_leds hid_generic iTCO_wdt
> iTCO_vendor_support pcspkr sfc mtd i2c_algo_bit sb_edac sg l
> pc_ich mfd_core ehci_pci ehci_hcd xhci_pci xhci_hcd i2c_i801 i2c_core
> ixgbe ptp pps_core mdio ipmi_devintf ipmi_si ipmi_msghandler tpm_tis tpm
> acpi_power_meter hwmon ext4 jbd2 mbcache crc16 raid1 dm_mirror
> dm_region_hash dm_log dm_mod
> [44923.269289] CPU: 0 PID: 257113 Comm: python Tainted: G I
> 4.4.5-el6.ia32e.lime.0 #1
> [44923.279089] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS
> SE5C610.86B.01.01.0011.081020151200 08/10/2015
> [44923.290824] task: ffff880e83669000 ti: ffff880a58a54000 task.ti:
> ffff880a58a54000
> [44923.299253] RIP: 0033:[<00002b5ee4e0502d>] [<00002b5ee4e0502d>]
> 0x2b5ee4e0502d
> [44923.307508] RSP: 002b:00007ffe4120b170 EFLAGS: 00000212
> [44923.313494] RAX: 0000000000000008 RBX: 0000000003fe1480 RCX:
> 0000000003fc9b00
> [44923.321513] RDX: 00002b5ee34b4260 RSI: 0000000000002248 RDI:
> 00000000000000d4
> [44923.329530] RBP: 0000000003fe1800 R08: 0000000000000078 R09:
> 0000000000000800
> [44923.337548] R10: 00007ffe4120b550 R11: 0000000000011240 R12:
> 00002b5ee34ba938
> [44923.345566] R13: 00002b5ee34c1010 R14: 00007ffe4120b428 R15:
> 0000000000000400
> [44923.353580] FS: 00002b5ec7fe98c0(0000) GS:ffff88103fc00000(0000)
> knlGS:0000000000000000
> [44923.362689] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [44923.369147] CR2: 00002b5ee4190000 CR3: 00000005ac81b000 CR4:
> 00000000001406f0
>
> That rcu_eqs_enter_common function seems to be a fairly common
> occurrence in these stack traces. Not sure if it means anything, though.
>
> Also, this seems to be a sock lockup with RIP in userspace. Does it mean
> timer interrupts are disabled? Somehow it fails to reschedule the NMI timer.
>
> We're at our wits' end here...
>
> This, btw, is a 2x12 core Haswell box.
>
> Thanks,
> -Ion

2016-03-23 15:36:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu stalls and soft lockups with recent kernels

On Wed, Mar 23, 2016 at 05:50:14AM +0100, Mike Galbraith wrote:
> (cc)
>
> On Tue, 2016-03-22 at 16:22 -0400, Ion Badulescu wrote:
> > On 03/17/2016 10:28 PM, Mike Galbraith wrote:
> > > On Wed, 2016-03-16 at 12:15 -0400, Ion Badulescu wrote:
> > > > Just following up to my own email:
> > > >
> > > > It turns out that we can eliminate the RCU stalls by changing from
> > > > CONFIG_RCU_NOCB_CPU_ALL to CONFIG_RCU_NOCB_CPU_NONE. Letting each cpu
> > > > handle its own RCU callbacks completely fixes the problems for us.
> > > >
> > > > Now, CONFIG_NO_HZ_FULL and CONFIG_RCU_NOCB_CPU_ALL is the default config
> > > > for fedora and rhel7. Ho-humm...
> > >
> > > All RCU offloaded to CPU0 of a big box seems like a very bad idea.
> >
> >
> > It's not offloaded to CPU0, is it? Those rcuo* threads are not cpu-bound
> > and can run on any cpu the scheduler will put them on. In any case,
> > there was no indication that the rcuo* threads wanted to run but
> > couldn't get cpu time.
>
> Right, my thinker was screwed on cross-threaded.
>
> > Anyway, looks like I spoke too soon. It's less often with
> > RCU_NOCB_CPU_NONE than with RCU_NOCB_CPU_ALL, but the soft lockups and
> > rcu stalls are still happening.
> >
> > [44206.316711] clocksource: timekeeping watchdog: Marking clocksource
> > 'tsc' as unstable because the skew is too large:
> > [44206.328463] clocksource: 'hpet' wd_now:
> > ffffffff wd_last: 5f03cdca mask: ffffffff
> > [44206.339037] clocksource: 'tsc' cs_now:
> > 64788b443c3a cs_last: 647840eea919 mask: ffffffffffffffff
> > [44206.351253] clocksource: Switched to clocksource hpet
> > [44922.301452] INFO: rcu_sched detected stalls on CPUs/tasks:
> > [44922.307644] 0-...: (1 GPs behind) idle=53d/140000000000001/0
> > softirq=8515474/8515477 fqs=6994
> > [44922.317435] (detected by 1, t=21019 jiffies, g=2011397, c=2011396,
> > q=3263)
> > [44922.325274] Task dump for CPU 0:
> > [44922.325276] python R running task 0 257113 257112
> > 0x00080088 0 FAIR 1 0 152294 48373
> > [44922.325283] ffffffff8152ca8e ffff881b76870000 ffff880e83669000
> > 0000000000007d54
> > [44922.333671] ffff881b1cdc7a48 ffff880a58a57e58 0000000000000086
> > 0000000000000000
> > [44922.342060] 0000000000000000 0000000000003fa1 ffff880a58a54000
> > ffff880a58a57e88
> > [44922.350446] Call Trace:
> > [44922.353215] [] ? __schedule+0x38e/0xa90
> > [44922.359388] [] ? rcu_eqs_enter_common+0x66/0x130
> > [44922.366437] [] ? acct_account_cputime+0x1c/0x20
> > [44922.373388] [] ? account_user_time+0x78/0x80
> > [44922.380045] [] ? vtime_account_user+0x43/0x60
> > [44922.386801] [] ? __context_tracking_exit+0x70/0xc0
> > [44922.394044] [] ? enter_from_user_mode+0x1f/0x50
> > [44922.400994] [] ? apic_timer_interrupt+0x69/0x90
> > [44923.210453] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s!
> > [python:257113]
> > [44923.218890] Modules linked in: nfsv3 nfs_acl nfs msr autofs4 lockd
> > grace sunrpc cachefiles fscache binfmt_misc nls_iso8859_1 nls_cp437 vfat
> > fat vhost_net vhost tun kvm irqbypass input_leds hid_generic iTCO_wdt
> > iTCO_vendor_support pcspkr sfc mtd i2c_algo_bit sb_edac sg l
> > pc_ich mfd_core ehci_pci ehci_hcd xhci_pci xhci_hcd i2c_i801 i2c_core
> > ixgbe ptp pps_core mdio ipmi_devintf ipmi_si ipmi_msghandler tpm_tis tpm
> > acpi_power_meter hwmon ext4 jbd2 mbcache crc16 raid1 dm_mirror
> > dm_region_hash dm_log dm_mod
> > [44923.269289] CPU: 0 PID: 257113 Comm: python Tainted: G I
> > 4.4.5-el6.ia32e.lime.0 #1
> > [44923.279089] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS
> > SE5C610.86B.01.01.0011.081020151200 08/10/2015
> > [44923.290824] task: ffff880e83669000 ti: ffff880a58a54000 task.ti:
> > ffff880a58a54000
> > [44923.299253] RIP: 0033:[<00002b5ee4e0502d>] [<00002b5ee4e0502d>]
> > 0x2b5ee4e0502d
> > [44923.307508] RSP: 002b:00007ffe4120b170 EFLAGS: 00000212
> > [44923.313494] RAX: 0000000000000008 RBX: 0000000003fe1480 RCX:
> > 0000000003fc9b00
> > [44923.321513] RDX: 00002b5ee34b4260 RSI: 0000000000002248 RDI:
> > 00000000000000d4
> > [44923.329530] RBP: 0000000003fe1800 R08: 0000000000000078 R09:
> > 0000000000000800
> > [44923.337548] R10: 00007ffe4120b550 R11: 0000000000011240 R12:
> > 00002b5ee34ba938
> > [44923.345566] R13: 00002b5ee34c1010 R14: 00007ffe4120b428 R15:
> > 0000000000000400
> > [44923.353580] FS: 00002b5ec7fe98c0(0000) GS:ffff88103fc00000(0000)
> > knlGS:0000000000000000
> > [44923.362689] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [44923.369147] CR2: 00002b5ee4190000 CR3: 00000005ac81b000 CR4:
> > 00000000001406f0
> >
> > That rcu_eqs_enter_common function seems to be a fairly common
> > occurrence in these stack traces. Not sure if it means anything, though.
> >
> > Also, this seems to be a sock lockup with RIP in userspace. Does it mean
> > timer interrupts are disabled? Somehow it fails to reschedule the NMI timer.
> >
> > We're at our wits' end here...
> >
> > This, btw, is a 2x12 core Haswell box.

This one would definitely make Salvador Dali proud!

All I can suggest is to reconfigure to get the highest error rate, then
bisect.

Thanx, Paul