Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752136AbdF2Jli (ORCPT ); Thu, 29 Jun 2017 05:41:38 -0400 Received: from mail-pg0-f67.google.com ([74.125.83.67]:34138 "EHLO mail-pg0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751696AbdF2Jla (ORCPT ); Thu, 29 Jun 2017 05:41:30 -0400 Date: Thu, 29 Jun 2017 19:36:14 +1000 From: Nicholas Piggin To: Abdul Haleem Cc: linuxppc-dev , linux-next , Stephen Rothwell , linux-kernel , sachinp , mpe , paulus@samba.org Subject: Re: [linux-next] cpus stalls detected few hours after booting next kernel Message-ID: <20170629193614.376e27e6@roar.ozlabs.ibm.com> In-Reply-To: <1498728106.19484.21.camel@abdul> References: <1498728106.19484.21.camel@abdul> Organization: IBM X-Mailer: Claws Mail 3.14.1 (GTK+ 2.24.31; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2473 Lines: 42 I've seen this as well (or something like it) in mambo at boot, but it's pretty rare to hit. I'm trying to debug it. I'm guessing possibly an idle vs interrupt race. > [ 4255.151192] Sending NMI from CPU 5 to CPUs 0: > [ 4255.151246] NMI backtrace for cpu 0 > [ 4255.151287] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.12.0-rc7-next-20170628 #2 > [ 4255.151363] task: c0000007f8495600 task.stack: c0000007f842c000 > [ 4255.151428] NIP: c00000000000adb4 LR: c000000000015584 CTR: c00000000082f4b0 > [ 4255.151504] REGS: c0000007f842fb60 TRAP: 0e81 Not tainted (4.12.0-rc7-next-20170628) > [ 4255.151578] MSR: 9000000000009033 > [ 4255.151586] CR: 22004884 XER: 00000000 > [ 4255.151675] CFAR: c00000000062c108 SOFTE: 1 > [ 4255.151675] GPR00: c00000000082d6c8 c0000007f842fde0 c000000001062b00 0000000028000000 > [ 4255.151675] GPR04: 0000000000000003 c000000000089830 00003aa8056bc35f 0000000000000001 > [ 4255.151675] GPR08: 0000000000000002 c000000000d52d80 00000007fe7d0000 9000000000001003 > [ 4255.151675] GPR12: c00000000082a0c0 c00000000fd40000 > [ 4255.152217] NIP [c00000000000adb4] .L__replay_interrupt_return+0x0/0x4 > [ 4255.152334] LR [c000000000015584] arch_local_irq_restore+0x74/0x90 > [ 4255.152447] Call Trace: > [ 4255.152499] [c0000007f842fde0] [c00000000017cec0] tick_broadcast_oneshot_control+0x40/0x60 (unreliable) > [ 4255.152662] [c0000007f842fe00] [c00000000082d6c8] cpuidle_enter_state+0x108/0x3d0 > [ 4255.152803] [c0000007f842fe60] [c000000000133e94] call_cpuidle+0x44/0x80 > [ 4255.152921] [c0000007f842fe80] [c000000000134240] do_idle+0x290/0x2f0 > [ 4255.153037] [c0000007f842fef0] [c000000000134474] cpu_startup_entry+0x34/0x40 > [ 4255.153176] [c0000007f842ff20] [c000000000041944] start_secondary+0x304/0x360 > [ 4255.153316] [c0000007f842ff90] [c00000000000b16c] start_secondary_prolog+0x10/0x14 > [ 4255.153455] Instruction dump: > [ 4255.153527] 7d200026 618c8000 2c030900 4182e320 2c030500 4182dd68 2c030e80 4182ffa4 > [ 4255.153668] 2c030ea0 4182f078 2c030e60 4182edb0 <4e800020> 7c781b78 480003c9 480003e1 What I think you're seeing here is just the lockup IPI causes the CPU to wake from idle, and so the backtrace gets recorded from where the interrupt gets replayed when the cpuidle code returns from the low level idle call then re-enables interrupts. I don't *think* the replay-wakeup-interrupt patch is directly involved, but it's likely to be one of the idle patches. Thanks, Nick