Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751536AbbFMHQH (ORCPT ); Sat, 13 Jun 2015 03:16:07 -0400 Received: from mail-wi0-f170.google.com ([209.85.212.170]:33659 "EHLO mail-wi0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750853AbbFMHP7 (ORCPT ); Sat, 13 Jun 2015 03:15:59 -0400 Date: Sat, 13 Jun 2015 09:15:47 +0200 From: Ingo Molnar To: Srinivas Pandruvada Cc: mingo@redhat.com, tglx@linutronix.de, hpa@zytor.com, pavel@ucw.cz, rjw@rjwysocki.net, x86@kernel.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, Denys Vlasenko , Andy Lutomirski , Borislav Petkov , Brian Gerst , Linus Torvalds , "Kleen, Andi" Subject: [PATCH, DEBUG] x86/32: Add small delay after resume Message-ID: <20150613071547.GA27446@gmail.com> References: <1434066338-6619-1-git-send-email-srinivas.pandruvada@linux.intel.com> <20150612060747.GA25024@gmail.com> <1434125724.2353.19.camel@spandruv-DESK3.jf.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1434125724.2353.19.camel@spandruv-DESK3.jf.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3147 Lines: 90 * Srinivas Pandruvada wrote: > > Also, could you please describe how the failure triggers in your system: how > > many times do you have to suspend/resume to trigger the segfaults, and is > > there anything that makes the failures less or more likely? > > It is very random. Sometimes only few hundred trys reproduce this issue. Some > other times it requires thousands of trys (sometimes not reproducible at all for > days) It is very time sensitive. So the very same kernel image will produce different crash patterns depending on the time of day? That suggests heat/hardware problems. > [...] A small delay or some debug code in resume path prevents this to crash. Fun... > The BIOS folks created special version to check if they are corrupting any DS, > but they were not able to catch any corruption. [...] So is it true that we always execute wakeup_pmode_return first after we return from the BIOS? If so then the BIOS touching DS cannot be an issue, as we re-initialize all segment selectors, which reloads the descriptors: ENTRY(wakeup_pmode_return) wakeup_pmode_return: movw $__KERNEL_DS, %ax movw %ax, %ss movw %ax, %ds movw %ax, %es movw %ax, %fs movw %ax, %gs # reload the gdt, as we need the full 32 bit address lidt saved_idt lldt saved_ldt ljmp $(__KERNEL_CS), $1f > [...] Since these are special deployed systems running critical application, > need to request the tests again with your changes. May take long time. So my second patch is clearly broken as per Brian Gerst's comments. What I would suggest is to try a patch that adds just 100 NOPs or so - attached below. This patch will add a small delay without any side effects (other than changing the kernel image layout). If that makes the crash go away, then I'd say the likelihood that it's hardware related increases substantially: maybe a PLL has not stabilized yet sufficiently after resume, or there's some latent heat sensitivity and the fan has not started up yet, etc. ( You can then use this simple delay generating patch in production systems as well, to work around the problem. Maybe convince the BIOS folks to add a delay like this to their resume path before they call Linux. ) Thanks, Ingo =================> arch/x86/kernel/acpi/wakeup_32.S | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/kernel/acpi/wakeup_32.S b/arch/x86/kernel/acpi/wakeup_32.S index 665c6b7d2ea9..ef26999da80a 100644 --- a/arch/x86/kernel/acpi/wakeup_32.S +++ b/arch/x86/kernel/acpi/wakeup_32.S @@ -10,6 +10,12 @@ ENTRY(wakeup_pmode_return) wakeup_pmode_return: + + /* Timing delay of a few dozen cycles: give the hardware some time to recover */ + .rept 100 + nop + .endr + movw $__KERNEL_DS, %ax movw %ax, %ss movw %ax, %ds -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/