Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753994AbbDPQbx (ORCPT ); Thu, 16 Apr 2015 12:31:53 -0400 Received: from mail-wg0-f41.google.com ([74.125.82.41]:33285 "EHLO mail-wg0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752022AbbDPQbp (ORCPT ); Thu, 16 Apr 2015 12:31:45 -0400 Date: Thu, 16 Apr 2015 18:31:40 +0200 From: Ingo Molnar To: Chris J Arges Cc: Linus Torvalds , Rafael David Tinoco , Peter Anvin , Jiang Liu , Peter Zijlstra , LKML , Jens Axboe , Frederic Weisbecker , Gema Gomez , the arch/x86 maintainers Subject: Re: [PATCH] smp/call: Detect stuck CSD locks Message-ID: <20150416163140.GA17024@gmail.com> References: <5522BB49.5060704@canonical.com> <20150407092121.GA9971@gmail.com> <20150407205945.GA28212@canonical.com> <20150408064734.GA26861@gmail.com> <20150413035616.GA24037@canonical.com> <20150413061450.GA10857@gmail.com> <20150415195452.GA19953@canonical.com> <20150416110423.GA15760@gmail.com> <20150416155819.GA20490@canonical.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150416155819.GA20490@canonical.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1845 Lines: 50 * Chris J Arges wrote: > A previous backtrace of a 3.19 series kernel is here and showing interrupts > enabled on both CPUs on L1: > https://lkml.org/lkml/2015/2/23/234 > http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt > > [...] > > Yes, I think at this point I'll go through the various backtraces > and try to narrow things down. I think overall we're seeing a single > effect from multiple code paths. Now what would be nice is to observe it whether the CPU that is not doing the CSD wait is truly locked up. It might be executing random KVM-ish workloads and the various backtraces we've seen so far are just a random sample of those workloads (from L1's perspective). Yet the fact that the kdump's NMI gets through is a strong indication that the CPU's APIC is fine: NMIs are essentially IPIs too, they just go to the NMI vector, which punches through irqs-off regions. So maybe another debug trick would be useful: instead of re-sending the IPI, send a single non-destructive NMI every second or so, creating a backtrace on the other CPU. From that we'll be able to see whether it's locked up permanently in an irqs-off section. I.e. basically you could try to trigger the 'show NMI backtraces on all CPUs' logic when the lockup triggers, and repeat it every couple of seconds. The simplest method to do that would be to call: trigger_all_cpu_backtrace(); every couple of seconds, in the CSD polling loop - after the initial timeout has passed. I'd suggest to collect at least 10 pairs of backtraces that way. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/