Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753742AbbLOC7U (ORCPT ); Mon, 14 Dec 2015 21:59:20 -0500 Received: from mail-io0-f194.google.com ([209.85.223.194]:33094 "EHLO mail-io0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751195AbbLOC7T (ORCPT ); Mon, 14 Dec 2015 21:59:19 -0500 MIME-Version: 1.0 In-Reply-To: <20151214172840.GB42652@redhat.com> References: <20151214172840.GB42652@redhat.com> Date: Mon, 14 Dec 2015 19:59:18 -0700 Message-ID: Subject: Re: [PATCH 1/1] Fix HARD Lockup Firing off while in debugger From: Jeff Merkey To: Don Zickus Cc: LKML , akpm@linux-foundation.org, uobergfe@redhat.com, atomlin@redhat.com, cmetcalf@ezchip.com, fweisbec@gmail.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2548 Lines: 61 On 12/14/15, Don Zickus wrote: > On Sat, Dec 12, 2015 at 02:08:13PM -0700, Jeff Merkey wrote: >> The current touch_nmi_watchdog() function in /kernel/watchdog.c does >> not always catch all cases when a processor is spinning in the nmi >> handler inside either KGDB, KDB, or MDB. The hrtimer_interrupts_saved >> count can still end up matching the previous value in some cases, >> resulting in the hard lockup detector tagging processors inside a > > Hi Jeff, > > I am confused here, the 'touch_nmi_watchdog()' was supposed to block the > check for hrtimer_interrupts from happening. So if the check is still > being > executed _after_ you executed touch_nmi_watchdog(), it would imply there > was > about 10 seconds or so of time elapse from the touch command to the hrtimer > check. > > So I am not sure how the below patch would fix this, other than just add > another 10 second delay (for a total of 20 seconds) to your timeout? > > >> debugger and executing a panic. The patch below corrects this >> problem. I did not add this to the touch_nmi_function directly >> becuase of possible affects on timing issues. >> >> I have tested this patch and it fixes the problem for kernel debuggers >> stopping errant hard lockup events when processors are spinning inside >> the debugger. > > The kernel doesn't normal take patches like this without a corresponding > user, which I didn't see attached in this patch or a patch series. > > Cheers, > Don > I'll resend the patch series properly formatted and clean. There is a hole in there somewhere that causes this bug. You can reproduce it by downloading the mdb debugger, patching linux, building it, then removing the call to this function while spinning in the debugger with a breakpoint on schedule() set from the debugger console. It does fire off in about 20 seconds without this function I have suggested. You can download the debugger here. https://github.com/jeffmerkey/linux-stable/compare/v4.3.2...jeffmerkey:mdb-v4.3.2.diff Use this patch applied to kernel v4.3.2 if you want to easily reproduce it and before you build it remove the function call to touch_hardlockup_watchdog() at mdb_watchdogs() in arch/x86/kernel/debug/mdb/mdb-main.c. I'll format another patch this time a clean one. I apologize. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/