Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753904AbbDMGPL (ORCPT ); Mon, 13 Apr 2015 02:15:11 -0400 Received: from mail-wg0-f51.google.com ([74.125.82.51]:35116 "EHLO mail-wg0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753222AbbDMGO4 (ORCPT ); Mon, 13 Apr 2015 02:14:56 -0400 Date: Mon, 13 Apr 2015 08:14:51 +0200 From: Ingo Molnar To: Chris J Arges Cc: Linus Torvalds , Rafael David Tinoco , Peter Anvin , Jiang Liu , Peter Zijlstra , LKML , Jens Axboe , Frederic Weisbecker , Gema Gomez , the arch/x86 maintainers Subject: Re: [PATCH] smp/call: Detect stuck CSD locks Message-ID: <20150413061450.GA10857@gmail.com> References: <551D8FAF.5070805@canonical.com> <20150402190725.GA10570@gmail.com> <551DB0E2.1020607@canonical.com> <20150403054320.GA9863@gmail.com> <5522BB49.5060704@canonical.com> <20150407092121.GA9971@gmail.com> <20150407205945.GA28212@canonical.com> <20150408064734.GA26861@gmail.com> <20150413035616.GA24037@canonical.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150413035616.GA24037@canonical.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2217 Lines: 54 * Chris J Arges wrote: > /sys/module/kvm_intel/parameters/enable_apicv on the affected > hardware is not enabled, and unfortunately my hardware doesn't have > the necessary features to enable it. So we are dealing with KVM's > lapic implementation only. That's actually pretty fortunate, as we don't have to worry about hardware state nearly as much! > FYI, I'm working on getting better data at the moment and here is my approach: > * For the L0 kernel: > - In arch/x86/kvm/lapic.c, I enabled 'apic_debug' to get more output (and print > the addresses of various useful structures) > - Setup crash to live dump kvm_lapic structures and associated registers for > both vCPUs It would also be nice to double check the stuck vCPU's normal CPU state: is it truly able to receive interrupts? (IRQ flags are on, or is it sitting in the idle loop, etc.?) If the IRQ flag (in EFLAGS) is off then the vCPU is not able to receive interrupts, regardless of local APIC state. > * For the L1 kernel: > - Dump a stacktrace when we detect a lockup. > - Detect a lockup and try to not alter the state. > - Have a reliable signal such that the L0 hypervisor can dump the lapic > structures and registers when csd_lock_wait detects a softlockup. I'd also suggest adding a printk() to IPI receipt, to make sure it's not the CSD code that is not getting called into after the IPI resend attempt. To make sure you only get messages after the CPU got stuck, add a 'locked_up' flag that signals this, and only print the messages if the lockup scenario is happening. I'd do it by adding something like this to kernel/smp.c::generic_smp_call_function_single_interrupt(): if (csd_locked_up) printk("CSD: Function call IPI callback on CPU#%d\n", raw_smp_processor_id()); Having this message in place would ensure that the IPI indeed did not get generated on the stuck vCPU. (Because we'd not get this message.) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/