The current touch_nmi_watchdog() function in /kernel/watchdog.c does
not always catch all cases when a processor is spinning in the nmi
handler inside either KGDB, KDB, or MDB. The hrtimer_interrupts_saved
count can still end up matching the previous value in some cases,
resulting in the hard lockup detector tagging processors inside a
debugger and executing a panic. The patch below corrects this
problem. I did not add this to the touch_nmi_function directly
becuase of possible affects on timing issues.
I have tested this patch and it fixes the problem for kernel debuggers
stopping errant hard lockup events when processors are spinning inside
the debugger.
Signed-off-by: Jeff V. Merkey <[email protected]>
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 18f34cf..b682aab 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -283,6 +283,13 @@ static bool is_hardlockup(void)
__this_cpu_write(hrtimer_interrupts_saved, hrint);
return false;
}
+
+void touch_hardlockup_watchdog(void)
+{
+ __this_cpu_write(hrtimer_interrupts_saved, 0);
+}
+EXPORT_SYMBOL_GPL(touch_hardlockup_watchdog);
+
#endif
static int is_softlockup(unsigned long touch_ts)
FYI. More info, this bug only occurs when the processors have all
triggered on an int1 exception (do_debug) and are all spinning waiting
to access the debugger console . It seems to be related to the nmi
handler firing off on processors which are waiting to get the shared
debugger lock in each debugger.
It does affect processors which may be in both nmi handlers and int1
handlers concurently but is triggered when more than one of the
processors has an outstanding int1 exception.
On Sat, Dec 12, 2015 at 02:08:13PM -0700, Jeff Merkey wrote:
> The current touch_nmi_watchdog() function in /kernel/watchdog.c does
> not always catch all cases when a processor is spinning in the nmi
> handler inside either KGDB, KDB, or MDB. The hrtimer_interrupts_saved
> count can still end up matching the previous value in some cases,
> resulting in the hard lockup detector tagging processors inside a
Hi Jeff,
I am confused here, the 'touch_nmi_watchdog()' was supposed to block the
check for hrtimer_interrupts from happening. So if the check is still being
executed _after_ you executed touch_nmi_watchdog(), it would imply there was
about 10 seconds or so of time elapse from the touch command to the hrtimer
check.
So I am not sure how the below patch would fix this, other than just add
another 10 second delay (for a total of 20 seconds) to your timeout?
> debugger and executing a panic. The patch below corrects this
> problem. I did not add this to the touch_nmi_function directly
> becuase of possible affects on timing issues.
>
> I have tested this patch and it fixes the problem for kernel debuggers
> stopping errant hard lockup events when processors are spinning inside
> the debugger.
The kernel doesn't normal take patches like this without a corresponding
user, which I didn't see attached in this patch or a patch series.
Cheers,
Don
>
>
> Signed-off-by: Jeff V. Merkey <[email protected]>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 18f34cf..b682aab 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -283,6 +283,13 @@ static bool is_hardlockup(void)
> __this_cpu_write(hrtimer_interrupts_saved, hrint);
> return false;
> }
> +
> +void touch_hardlockup_watchdog(void)
> +{
> + __this_cpu_write(hrtimer_interrupts_saved, 0);
> +}
> +EXPORT_SYMBOL_GPL(touch_hardlockup_watchdog);
> +
> #endif
>
> static int is_softlockup(unsigned long touch_ts)
On 12/14/15, Don Zickus <[email protected]> wrote:
> On Sat, Dec 12, 2015 at 02:08:13PM -0700, Jeff Merkey wrote:
>> The current touch_nmi_watchdog() function in /kernel/watchdog.c does
>> not always catch all cases when a processor is spinning in the nmi
>> handler inside either KGDB, KDB, or MDB. The hrtimer_interrupts_saved
>> count can still end up matching the previous value in some cases,
>> resulting in the hard lockup detector tagging processors inside a
>
> Hi Jeff,
>
> I am confused here, the 'touch_nmi_watchdog()' was supposed to block the
> check for hrtimer_interrupts from happening. So if the check is still
> being
> executed _after_ you executed touch_nmi_watchdog(), it would imply there
> was
> about 10 seconds or so of time elapse from the touch command to the hrtimer
> check.
>
> So I am not sure how the below patch would fix this, other than just add
> another 10 second delay (for a total of 20 seconds) to your timeout?
>
>
>> debugger and executing a panic. The patch below corrects this
>> problem. I did not add this to the touch_nmi_function directly
>> becuase of possible affects on timing issues.
>>
>> I have tested this patch and it fixes the problem for kernel debuggers
>> stopping errant hard lockup events when processors are spinning inside
>> the debugger.
>
> The kernel doesn't normal take patches like this without a corresponding
> user, which I didn't see attached in this patch or a patch series.
>
> Cheers,
> Don
>
I'll resend the patch series properly formatted and clean. There is
a hole in there somewhere that causes this bug. You can reproduce it
by downloading the mdb debugger, patching linux, building it, then
removing the call to this function while spinning in the debugger with
a breakpoint on schedule() set from the debugger console. It does
fire off in about 20 seconds without this function I have suggested.
You can download the debugger here.
https://github.com/jeffmerkey/linux-stable/compare/v4.3.2...jeffmerkey:mdb-v4.3.2.diff
Use this patch applied to kernel v4.3.2 if you want to easily
reproduce it and before you build it remove the function call to
touch_hardlockup_watchdog() at mdb_watchdogs() in
arch/x86/kernel/debug/mdb/mdb-main.c.
I'll format another patch this time a clean one. I apologize.
Jeff
On 12/14/15, Jeff Merkey <[email protected]> wrote:
> On 12/14/15, Don Zickus <[email protected]> wrote:
>> On Sat, Dec 12, 2015 at 02:08:13PM -0700, Jeff Merkey wrote:
>>> The current touch_nmi_watchdog() function in /kernel/watchdog.c does
>>> not always catch all cases when a processor is spinning in the nmi
>>> handler inside either KGDB, KDB, or MDB. The hrtimer_interrupts_saved
>>> count can still end up matching the previous value in some cases,
>>> resulting in the hard lockup detector tagging processors inside a
>>
>> Hi Jeff,
>>
>> I am confused here, the 'touch_nmi_watchdog()' was supposed to block the
>> check for hrtimer_interrupts from happening. So if the check is still
>> being
>> executed _after_ you executed touch_nmi_watchdog(), it would imply there
>> was
>> about 10 seconds or so of time elapse from the touch command to the
>> hrtimer
>> check.
>>
>> So I am not sure how the below patch would fix this, other than just add
>> another 10 second delay (for a total of 20 seconds) to your timeout?
>>
>>
>>> debugger and executing a panic. The patch below corrects this
>>> problem. I did not add this to the touch_nmi_function directly
>>> becuase of possible affects on timing issues.
>>>
>>> I have tested this patch and it fixes the problem for kernel debuggers
>>> stopping errant hard lockup events when processors are spinning inside
>>> the debugger.
>>
>> The kernel doesn't normal take patches like this without a corresponding
>> user, which I didn't see attached in this patch or a patch series.
>>
>> Cheers,
>> Don
>>
>
> I'll resend the patch series properly formatted and clean. There is
> a hole in there somewhere that causes this bug. You can reproduce it
> by downloading the mdb debugger, patching linux, building it, then
> removing the call to this function while spinning in the debugger with
> a breakpoint on schedule() set from the debugger console. It does
> fire off in about 20 seconds without this function I have suggested.
>
> You can download the debugger here.
>
> https://github.com/jeffmerkey/linux-stable/compare/v4.3.2...jeffmerkey:mdb-v4.3.2.diff
>
> Use this patch applied to kernel v4.3.2 if you want to easily
> reproduce it and before you build it remove the function call to
> touch_hardlockup_watchdog() at mdb_watchdogs() in
> arch/x86/kernel/debug/mdb/mdb-main.c.
>
> I'll format another patch this time a clean one. I apologize.
>
> Jeff
>
Oh, and don't forget to type "g" for go after setting the schedule()
breakpoint. This will reload all the processors and cause them to
break into the debugger and be held by the debugger at int1 exception.
This is when the touch_nmi_watchdog() breaks.
You also need to do this on an SMP system, It's an SMP bug,
preferablt one with 4 or more processors.
Jeff