2018-05-21 06:16:10

by Abdul Haleem

[permalink] [raw]
Subject: [next-20180517][ppc] watchdog: CPU 88 self-detected hard LOCKUP @ update_cfs_group+0x30/0x150

Greeting's

Offlate we are seeing hard hard lockup trace messages on our powerpc
bare-metal machine for both linux-next and mainline kernel

Machine Type: Power 8 Bare-metal (minsky)
kernel: 4.17.0-rc5-next-20180517
gcc: 4.8.5
config: attached
test: the exact scenario which is triggering these traces is not known

trace logs:
-----------
Watchdog: CPU 88 TB:20328455213119, last heartbeat TB:20323303945487 (10061ms ago)
Modules linked in: btrfs xor zstd_decompress zstd_compress xxhash
lzo_compress raid6_pq dm_snapshot dm_bufio nvme bnx2x iptable_mangle
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
nf_reject_ipv4 xt_tcpudp tun bridge stp llc iptable_filter dm_mirror
dm_region_hash dm_log dm_service_time vmx_crypto powernv_rng rng_core
dm_multipath kvm_hv kvm binfmt_misc nfsd ip_tables x_tables autofs4 xfs
lpfc crc_t10dif crct10dif_generic nvme_fc nvme_fabrics mdio libcrc32c
nvme_core crct10dif_common [last unloaded: nvme]
CPU: 88 PID: 0 Comm: swapper/88 Not tainted 4.17.0-rc5-next-20180517-autotest-autotest #1
NIP: c000000000131a80 LR: c0000000001376b8 CTR: c0000000008284c0
REGS: c000003fff42fd80 TRAP: 0900 Not tainted (4.17.0-rc5-next-20180517-autotest-autotest)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22042044 XER: 20000000
CFAR: c0000000001376b4 SOFTE: 1
GPR00: c000000000137b64 c000003ca93236a0 c000000001086c00 c000003c907bdc00
GPR04: c000003c907bdc00 c000003ffbf23a68 0000000000000000 c000001f4f540400
GPR08: 0000000000100000 c000003c907bde00 0000000000000400 c000003c907bde30
GPR12: c0000000008284c0 c000003fff74f880 c0000000008284c0 c000003ffbf260b8
GPR16: c000003ffbf27580 0000000000000001 0000000000000000 000000000000ba7e
GPR20: 000000000000ba7e 0000000000000001 c000001f4ac55900 c000003ffbf27580
GPR24: afb504000afb5041 0000000000000001 0000000000000000 0000000000000001
GPR28: 0000000000000001 0000000000000000 c000003ffbf27600 c000003c907bdc00
NIP [c000000000131a80] update_cfs_group+0x30/0x150
LR [c0000000001376b8] enqueue_task_fair+0x3c8/0x1390
Call Trace:
[c000003ca93236a0] [c000000000137b64] enqueue_task_fair+0x874/0x1390 (unreliable)
[c000003ca9323790] [c000000000128870] activate_task+0x80/0x120
[c000003ca9323810] [c000000000128db8] ttwu_do_activate+0x68/0xc0
[c000003ca9323850] [c00000000012a208] try_to_wake_up+0x238/0x5c0
[c000003ca93238d0] [c0000000001853b0] hrtimer_wakeup+0x30/0x60
[c000003ca93238f0] [c000000000185094] __hrtimer_run_queues+0x124/0x350
[c000003ca9323970] [c000000000185e3c] hrtimer_interrupt+0x11c/0x2f0
[c000003ca9323a20] [c000000000023290] __timer_interrupt+0x90/0x270
[c000003ca9323a70] [c000000000023900] timer_interrupt+0xa0/0xe0
[c000003ca9323aa0] [c000000000009348] decrementer_common+0x158/0x160
--- interrupt: 901 at replay_interrupt_return+0x0/0x4
LR = arch_local_irq_restore+0x74/0x90
[c000003ca9323d90] [c00000000019a92c] tick_nohz_idle_stop_tick+0x17c/0x360 (unreliable)
[c000003ca9323db0] [c00000000082b668] cpuidle_enter_state+0x108/0x3c0
[c000003ca9323e10] [c000000000130104] call_cpuidle+0x44/0x80
[c000003ca9323e30] [c000000000130698] do_idle+0x2f8/0x3a0
[c000003ca9323ec0] [c000000000130914] cpu_startup_entry+0x34/0x40
[c000003ca9323ef0] [c0000000000432a4] start_secondary+0x4d4/0x520
[c000003ca9323f90] [c00000000000b270] start_secondary_prolog+0x10/0x14
Instruction dump:
3c4c00f5 384251b0 e9230160 7c641b78 2fa90000 4d9e0020 e9490000 e8c900a0
e8e901d0 794ab282 e9070100 7faa3040 <419c0100> e8c70180 e8e90180 7ca851d2

kernel:watchdog: CPU 88 self-detected hard LOCKUP @ update_cfs_group+0x30/0x150

kernel:watchdog: CPU 88 TB:20328455213119, last heartbeat TB:20323303945487 (10061ms ago)

watchdog: CPU 88 became unstuck TB:20328505869807
CPU: 88 PID: 0 Comm: swapper/88 Not tainted 4.17.0-rc5-next-20180517-autotest-autotest #1
NIP: c00000000000aeec LR: c000000000016644 CTR: c000000000022be0
REGS: c000003ca9323b10 TRAP: 0901 Not tainted (4.17.0-rc5-next-20180517-autotest-autotest)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 42004084 XER: 00000000
CFAR: c000000000185fe4 SOFTE: 0
GPR00: c00000000082b668 c000003ca9323d90 c000000001086c00 0000000000000900
GPR04: b000000000001033 c000000000093f80 0000000000000001 0000000000000008
GPR08: 0000000000000000 c000000000d86c84 0000000000000000 c000003ffbf27630
GPR12: c000000000022be0 c000003fff74f880
NIP [c00000000000aeec] replay_interrupt_return+0x0/0x4
LR [c000000000016644] arch_local_irq_restore+0x74/0x90
Call Trace:
[c000003ca9323d90] [c00000000019a92c] tick_nohz_idle_stop_tick+0x17c/0x360 (unreliable)
[c000003ca9323db0] [c00000000082b668] cpuidle_enter_state+0x108/0x3c0
[c000003ca9323e10] [c000000000130104] call_cpuidle+0x44/0x80
[c000003ca9323e30] [c000000000130698] do_idle+0x2f8/0x3a0
[c000003ca9323ec0] [c000000000130910] cpu_startup_entry+0x30/0x40
[c000003ca9323ef0] [c0000000000432a4] start_secondary+0x4d4/0x520
[c000003ca9323f90] [c00000000000b270] start_secondary_prolog+0x10/0x14
Instruction dump:
7d200026 618c8000 2c030900 4182e328 2c030500 4182dd40 2c030f00 4182f1e8
2c030a00 4182ff9c 2c030e60 4182edb8 <4e800020> 7c781b78 48000395 480003ad

kernel:watchdog: CPU 88 became unstuck TB:20328505869807

watchdog: CPU 96 self-detected hard LOCKUP @ tick_nohz_next_event+0x38/0x200
watchdog: CPU 96 TB:23470018682939, last heartbeat TB:23464816758804 (10160ms ago)
Modules linked in: btrfs xor zstd_decompress zstd_compress xxhash
lzo_compress raid6_pq dm_snapshot dm_bufio nvme bnx2x iptable_mangle
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
nf_reject_ipv4 xt_tcpudp tun bridge stp llc iptable_filter dm_mirror
dm_region_hash dm_log dm_service_time vmx_crypto powernv_rng rng_core
dm_multipath kvm_hv kvm binfmt_misc nfsd ip_tables x_tables autofs4 xfs
lpfc crc_t10dif crct10dif_generic nvme_fc nvme_fabrics mdio libcrc32c
nvme_core crct10dif_common [last unloaded: nvme]
CPU: 96 PID: 0 Comm: swapper/96 Not tainted 4.17.0-rc5-next-20180517-autotest-autotest #1
NIP: c000000000199df8 LR: c00000000019ad3c CTR: c00000000082d5a0
REGS: c000003fff3cfd80 TRAP: 0900 Not tainted (4.17.0-rc5-next-20180517-autotest-autotest)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22004084 XER: 00000000
CFAR: c00000000019ad38 SOFTE: 1
GPR00: c00000000019ad3c c000003ca9343d50 c000000001086c00 c000003ffc113880
GPR04: 0000000000000060 0000000000000000 0000003ffb3a0000 c000000001196c00
GPR08: c0000000010b3b00 c000000000d86c80 c0000000010b3b80 c000003ffc127630
GPR12: c00000000082d5a0 c000003fff748f80 c000003ca9343f90 c0000000012f6020
GPR16: c000003ffcc08cb0 c0000000000410b0 c0000000000410b0 c000000000d65260
GPR20: 0000000000000001 c000000000fcf820 c000003ca9340000 c000003ca9340080
GPR24: c000000000d862e8 0000000077359400 c000003ffc1260d8 c000000000fcf820
GPR28: 0000000000000060 0000000000000060 c000003ffc113880 c000003ffc113880
NIP [c000000000199df8] tick_nohz_next_event+0x38/0x200
LR [c00000000019ad3c] tick_nohz_get_sleep_length+0xac/0x100
Call Trace:
[c000003ca9343d50] [c00000000019ad04] tick_nohz_get_sleep_length+0x74/0x100 (unreliable)
[c000003ca9343d90] [c00000000082d668] menu_select+0xc8/0x7f0
[c000003ca9343e10] [c00000000082b950] cpuidle_select+0x30/0x50
[c000003ca9343e30] [c000000000130664] do_idle+0x2c4/0x3a0
[c000003ca9343ec0] [c000000000130914] cpu_startup_entry+0x34/0x40
[c000003ca9343ef0] [c0000000000432a4] start_secondary+0x4d4/0x520
[c000003ca9343f90] [c00000000000b270] start_secondary_prolog+0x10/0x14
Instruction dump:
7c0802a6 fb81ffe0 fbc1fff0 3ce20011 fba1ffe8 fbe1fff8 7c7e1b78 7c9c2378
3d420003 394acf80 3d020003 3908cf00 <38e7ff08> f8010010 f821ffb1 812a0000
watchdog: CPU 96 became unstuck TB:23470021067039
CPU: 96 PID: 0 Comm: swapper/96 Not tainted 4.17.0-rc5-next-20180517-autotest-autotest #1
NIP: c00000000000aeec LR: c000000000016644 CTR: c00000000082d5a0
REGS: c000003ca9343ad0 TRAP: 0901 Not tainted (4.17.0-rc5-next-20180517-autotest-autotest)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 42004024 XER: 20000000
CFAR: c00000000000c1bc SOFTE: 0
GPR00: c00000000082e4e8 c000003ca9343d50 c000000001086c00 0000000000000900
GPR04: c000000000fcf820 0000000000000000 0000000000000001 0000000000000808
GPR08: 0000000000000000 c000000000d86c80 c0000000010b3b80 c000003ffc127630
GPR12: c00000000082d5a0 c000003fff748f80

NIP [c00000000000aeec] replay_interrupt_return+0x0/0x4
LR [c000000000016644] arch_local_irq_restore+0x74/0x90
Call Trace:
[c000003ca9343d50] [c00000000019ad4c] tick_nohz_get_sleep_length+0xbc/0x100 (unreliable)
[c000003ca9343d70] [c00000000082e4e8] snooze_loop+0x48/0x190
[c000003ca9343db0] [c00000000082b618] cpuidle_enter_state+0xb8/0x3c0
[c000003ca9343e10] [c000000000130104] call_cpuidle+0x44/0x80
[c000003ca9343e30] [c000000000130698] do_idle+0x2f8/0x3a0
...
[c000003ca9343ec0] [c000000000130914] cpu_startup_entry+0x34/0x40

[c000003ca9343ef0] [c0000000000432a4] start_secondary+0x4d4/0x520
kernel:watchdog[44665.478148] [c000003ca9343f90] [c00000000000b270] start_secondary_prolog+0x10/0x14
Instruction dump: CPU 96 self-det
7d200026 618c8000 2c030900 4182e328 2c030500 4182dd40 2c030f00 4182f1e8
ted hard LOCKUP[44665.478356] 2c030a00 4182ff9c 2c030e60 4182edb8 <4e800020> 7c781b78 48000395 480003ad
@ tick_nohz_next_event+0x38/0x200

kernel:watchdog: CPU 96 TB:23470018682939, last heartbeat TB:23464816758804 (10160ms ago

--
Regard's

Abdul Haleem
IBM Linux Technology Centre



Attachments:
Hab-NV-config_with_NVMe (89.66 kB)

2018-05-21 06:52:47

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [next-20180517][ppc] watchdog: CPU 88 self-detected hard LOCKUP @ update_cfs_group+0x30/0x150

Ah, it's POWER8.

I'm betting we have a bug with nohz timer offloading somewhere.

I *think* we may have seen similar on P9 as well, but that may be
related to problems with stop states.

Can you reproduce it easily? I'm thinking maybe adding some
tracepoints that track decrementer settings and interrupts, and
nohz offload activity might show something up.

Thanks,
Nick


On Mon, 21 May 2018 11:45:00 +0530
Abdul Haleem <[email protected]> wrote:

> Greeting's
>
> Offlate we are seeing hard hard lockup trace messages on our powerpc
> bare-metal machine for both linux-next and mainline kernel
>
> Machine Type: Power 8 Bare-metal (minsky)
> kernel: 4.17.0-rc5-next-20180517
> gcc: 4.8.5
> config: attached
> test: the exact scenario which is triggering these traces is not known
>
> trace logs:
> -----------
> Watchdog: CPU 88 TB:20328455213119, last heartbeat TB:20323303945487 (10061ms ago)
> Modules linked in: btrfs xor zstd_decompress zstd_compress xxhash
> lzo_compress raid6_pq dm_snapshot dm_bufio nvme bnx2x iptable_mangle
> ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
> nf_reject_ipv4 xt_tcpudp tun bridge stp llc iptable_filter dm_mirror
> dm_region_hash dm_log dm_service_time vmx_crypto powernv_rng rng_core
> dm_multipath kvm_hv kvm binfmt_misc nfsd ip_tables x_tables autofs4 xfs
> lpfc crc_t10dif crct10dif_generic nvme_fc nvme_fabrics mdio libcrc32c
> nvme_core crct10dif_common [last unloaded: nvme]
> CPU: 88 PID: 0 Comm: swapper/88 Not tainted 4.17.0-rc5-next-20180517-autotest-autotest #1
> NIP: c000000000131a80 LR: c0000000001376b8 CTR: c0000000008284c0
> REGS: c000003fff42fd80 TRAP: 0900 Not tainted (4.17.0-rc5-next-20180517-autotest-autotest)
> MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22042044 XER: 20000000
> CFAR: c0000000001376b4 SOFTE: 1
> GPR00: c000000000137b64 c000003ca93236a0 c000000001086c00 c000003c907bdc00
> GPR04: c000003c907bdc00 c000003ffbf23a68 0000000000000000 c000001f4f540400
> GPR08: 0000000000100000 c000003c907bde00 0000000000000400 c000003c907bde30
> GPR12: c0000000008284c0 c000003fff74f880 c0000000008284c0 c000003ffbf260b8
> GPR16: c000003ffbf27580 0000000000000001 0000000000000000 000000000000ba7e
> GPR20: 000000000000ba7e 0000000000000001 c000001f4ac55900 c000003ffbf27580
> GPR24: afb504000afb5041 0000000000000001 0000000000000000 0000000000000001
> GPR28: 0000000000000001 0000000000000000 c000003ffbf27600 c000003c907bdc00
> NIP [c000000000131a80] update_cfs_group+0x30/0x150
> LR [c0000000001376b8] enqueue_task_fair+0x3c8/0x1390
> Call Trace:
> [c000003ca93236a0] [c000000000137b64] enqueue_task_fair+0x874/0x1390 (unreliable)
> [c000003ca9323790] [c000000000128870] activate_task+0x80/0x120
> [c000003ca9323810] [c000000000128db8] ttwu_do_activate+0x68/0xc0
> [c000003ca9323850] [c00000000012a208] try_to_wake_up+0x238/0x5c0
> [c000003ca93238d0] [c0000000001853b0] hrtimer_wakeup+0x30/0x60
> [c000003ca93238f0] [c000000000185094] __hrtimer_run_queues+0x124/0x350
> [c000003ca9323970] [c000000000185e3c] hrtimer_interrupt+0x11c/0x2f0
> [c000003ca9323a20] [c000000000023290] __timer_interrupt+0x90/0x270
> [c000003ca9323a70] [c000000000023900] timer_interrupt+0xa0/0xe0
> [c000003ca9323aa0] [c000000000009348] decrementer_common+0x158/0x160
> --- interrupt: 901 at replay_interrupt_return+0x0/0x4
> LR = arch_local_irq_restore+0x74/0x90
> [c000003ca9323d90] [c00000000019a92c] tick_nohz_idle_stop_tick+0x17c/0x360 (unreliable)
> [c000003ca9323db0] [c00000000082b668] cpuidle_enter_state+0x108/0x3c0
> [c000003ca9323e10] [c000000000130104] call_cpuidle+0x44/0x80
> [c000003ca9323e30] [c000000000130698] do_idle+0x2f8/0x3a0
> [c000003ca9323ec0] [c000000000130914] cpu_startup_entry+0x34/0x40
> [c000003ca9323ef0] [c0000000000432a4] start_secondary+0x4d4/0x520
> [c000003ca9323f90] [c00000000000b270] start_secondary_prolog+0x10/0x14
> Instruction dump:
> 3c4c00f5 384251b0 e9230160 7c641b78 2fa90000 4d9e0020 e9490000 e8c900a0
> e8e901d0 794ab282 e9070100 7faa3040 <419c0100> e8c70180 e8e90180 7ca851d2
>
> kernel:watchdog: CPU 88 self-detected hard LOCKUP @ update_cfs_group+0x30/0x150
>
> kernel:watchdog: CPU 88 TB:20328455213119, last heartbeat TB:20323303945487 (10061ms ago)
>
> watchdog: CPU 88 became unstuck TB:20328505869807
> CPU: 88 PID: 0 Comm: swapper/88 Not tainted 4.17.0-rc5-next-20180517-autotest-autotest #1
> NIP: c00000000000aeec LR: c000000000016644 CTR: c000000000022be0
> REGS: c000003ca9323b10 TRAP: 0901 Not tainted (4.17.0-rc5-next-20180517-autotest-autotest)
> MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 42004084 XER: 00000000
> CFAR: c000000000185fe4 SOFTE: 0
> GPR00: c00000000082b668 c000003ca9323d90 c000000001086c00 0000000000000900
> GPR04: b000000000001033 c000000000093f80 0000000000000001 0000000000000008
> GPR08: 0000000000000000 c000000000d86c84 0000000000000000 c000003ffbf27630
> GPR12: c000000000022be0 c000003fff74f880
> NIP [c00000000000aeec] replay_interrupt_return+0x0/0x4
> LR [c000000000016644] arch_local_irq_restore+0x74/0x90
> Call Trace:
> [c000003ca9323d90] [c00000000019a92c] tick_nohz_idle_stop_tick+0x17c/0x360 (unreliable)
> [c000003ca9323db0] [c00000000082b668] cpuidle_enter_state+0x108/0x3c0
> [c000003ca9323e10] [c000000000130104] call_cpuidle+0x44/0x80
> [c000003ca9323e30] [c000000000130698] do_idle+0x2f8/0x3a0
> [c000003ca9323ec0] [c000000000130910] cpu_startup_entry+0x30/0x40
> [c000003ca9323ef0] [c0000000000432a4] start_secondary+0x4d4/0x520
> [c000003ca9323f90] [c00000000000b270] start_secondary_prolog+0x10/0x14
> Instruction dump:
> 7d200026 618c8000 2c030900 4182e328 2c030500 4182dd40 2c030f00 4182f1e8
> 2c030a00 4182ff9c 2c030e60 4182edb8 <4e800020> 7c781b78 48000395 480003ad
>
> kernel:watchdog: CPU 88 became unstuck TB:20328505869807
>
> watchdog: CPU 96 self-detected hard LOCKUP @ tick_nohz_next_event+0x38/0x200
> watchdog: CPU 96 TB:23470018682939, last heartbeat TB:23464816758804 (10160ms ago)
> Modules linked in: btrfs xor zstd_decompress zstd_compress xxhash
> lzo_compress raid6_pq dm_snapshot dm_bufio nvme bnx2x iptable_mangle
> ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
> nf_reject_ipv4 xt_tcpudp tun bridge stp llc iptable_filter dm_mirror
> dm_region_hash dm_log dm_service_time vmx_crypto powernv_rng rng_core
> dm_multipath kvm_hv kvm binfmt_misc nfsd ip_tables x_tables autofs4 xfs
> lpfc crc_t10dif crct10dif_generic nvme_fc nvme_fabrics mdio libcrc32c
> nvme_core crct10dif_common [last unloaded: nvme]
> CPU: 96 PID: 0 Comm: swapper/96 Not tainted 4.17.0-rc5-next-20180517-autotest-autotest #1
> NIP: c000000000199df8 LR: c00000000019ad3c CTR: c00000000082d5a0
> REGS: c000003fff3cfd80 TRAP: 0900 Not tainted (4.17.0-rc5-next-20180517-autotest-autotest)
> MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22004084 XER: 00000000
> CFAR: c00000000019ad38 SOFTE: 1
> GPR00: c00000000019ad3c c000003ca9343d50 c000000001086c00 c000003ffc113880
> GPR04: 0000000000000060 0000000000000000 0000003ffb3a0000 c000000001196c00
> GPR08: c0000000010b3b00 c000000000d86c80 c0000000010b3b80 c000003ffc127630
> GPR12: c00000000082d5a0 c000003fff748f80 c000003ca9343f90 c0000000012f6020
> GPR16: c000003ffcc08cb0 c0000000000410b0 c0000000000410b0 c000000000d65260
> GPR20: 0000000000000001 c000000000fcf820 c000003ca9340000 c000003ca9340080
> GPR24: c000000000d862e8 0000000077359400 c000003ffc1260d8 c000000000fcf820
> GPR28: 0000000000000060 0000000000000060 c000003ffc113880 c000003ffc113880
> NIP [c000000000199df8] tick_nohz_next_event+0x38/0x200
> LR [c00000000019ad3c] tick_nohz_get_sleep_length+0xac/0x100
> Call Trace:
> [c000003ca9343d50] [c00000000019ad04] tick_nohz_get_sleep_length+0x74/0x100 (unreliable)
> [c000003ca9343d90] [c00000000082d668] menu_select+0xc8/0x7f0
> [c000003ca9343e10] [c00000000082b950] cpuidle_select+0x30/0x50
> [c000003ca9343e30] [c000000000130664] do_idle+0x2c4/0x3a0
> [c000003ca9343ec0] [c000000000130914] cpu_startup_entry+0x34/0x40
> [c000003ca9343ef0] [c0000000000432a4] start_secondary+0x4d4/0x520
> [c000003ca9343f90] [c00000000000b270] start_secondary_prolog+0x10/0x14
> Instruction dump:
> 7c0802a6 fb81ffe0 fbc1fff0 3ce20011 fba1ffe8 fbe1fff8 7c7e1b78 7c9c2378
> 3d420003 394acf80 3d020003 3908cf00 <38e7ff08> f8010010 f821ffb1 812a0000
> watchdog: CPU 96 became unstuck TB:23470021067039
> CPU: 96 PID: 0 Comm: swapper/96 Not tainted 4.17.0-rc5-next-20180517-autotest-autotest #1
> NIP: c00000000000aeec LR: c000000000016644 CTR: c00000000082d5a0
> REGS: c000003ca9343ad0 TRAP: 0901 Not tainted (4.17.0-rc5-next-20180517-autotest-autotest)
> MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 42004024 XER: 20000000
> CFAR: c00000000000c1bc SOFTE: 0
> GPR00: c00000000082e4e8 c000003ca9343d50 c000000001086c00 0000000000000900
> GPR04: c000000000fcf820 0000000000000000 0000000000000001 0000000000000808
> GPR08: 0000000000000000 c000000000d86c80 c0000000010b3b80 c000003ffc127630
> GPR12: c00000000082d5a0 c000003fff748f80
>
> NIP [c00000000000aeec] replay_interrupt_return+0x0/0x4
> LR [c000000000016644] arch_local_irq_restore+0x74/0x90
> Call Trace:
> [c000003ca9343d50] [c00000000019ad4c] tick_nohz_get_sleep_length+0xbc/0x100 (unreliable)
> [c000003ca9343d70] [c00000000082e4e8] snooze_loop+0x48/0x190
> [c000003ca9343db0] [c00000000082b618] cpuidle_enter_state+0xb8/0x3c0
> [c000003ca9343e10] [c000000000130104] call_cpuidle+0x44/0x80
> [c000003ca9343e30] [c000000000130698] do_idle+0x2f8/0x3a0
> ...
> [c000003ca9343ec0] [c000000000130914] cpu_startup_entry+0x34/0x40
>
> [c000003ca9343ef0] [c0000000000432a4] start_secondary+0x4d4/0x520
> kernel:watchdog[44665.478148] [c000003ca9343f90] [c00000000000b270] start_secondary_prolog+0x10/0x14
> Instruction dump: CPU 96 self-det
> 7d200026 618c8000 2c030900 4182e328 2c030500 4182dd40 2c030f00 4182f1e8
> ted hard LOCKUP[44665.478356] 2c030a00 4182ff9c 2c030e60 4182edb8 <4e800020> 7c781b78 48000395 480003ad
> @ tick_nohz_next_event+0x38/0x200
>
> kernel:watchdog: CPU 96 TB:23470018682939, last heartbeat TB:23464816758804 (10160ms ago
>


2018-05-29 13:12:50

by Abdul Haleem

[permalink] [raw]
Subject: Re: [next-20180517][ppc] watchdog: CPU 88 self-detected hard LOCKUP @ update_cfs_group+0x30/0x150

On Mon, 2018-05-21 at 16:50 +1000, Nicholas Piggin wrote:
> Ah, it's POWER8.
>
> I'm betting we have a bug with nohz timer offloading somewhere.
>
> I *think* we may have seen similar on P9 as well, but that may be
> related to problems with stop states.
>
> Can you reproduce it easily? I'm thinking maybe adding some
> tracepoints that track decrementer settings and interrupts, and
> nohz offload activity might show something up.

Yes, the problem is reproducible consistently on our CI setup and today
It triggered on 4.17.0-rc6 (mainline) too.

--
Regard's

Abdul Haleem
IBM Linux Technology Centre