LinuxLists.cc - [PATCH 1/2] RFC: hung

2019-05-02 19:43:46

Subject: [PATCH 1/2] RFC: hung_task: taint kernel

There's the hung_task_panic sysctl, but that's a bit an extreme measure.
As a fallback taint at least the machine.

Our CI uses this to decide when a reboot is necessary, plus to figure
out whether the kernel is still happy.

Signed-off-by: Daniel Vetter <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Valdis Kletnieks <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: "Liu, Chuansheng" <[email protected]>
---
kernel/hung_task.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index f108a95882c6..7fae16f1b49c 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -203,6 +203,8 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
if (hung_task_call_panic) {
trigger_all_cpu_backtrace();
panic("hung_task: blocked tasks");
+ } else {
+ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
}
}

--
2.20.1

2019-05-02 19:45:51

by Daniel Vetter

[permalink] [raw]

Subject: [PATCH 2/2] RFC: soft/hardlookup: taint kernel

There's the soft/hardlookup_panic sysctls, but that's a bit an extreme
measure. As a fallback taint at least the machine.

Our CI uses this to decide when a reboot is necessary, plus to figure
out whether the kernel is still happy.

Signed-off-by: Daniel Vetter <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Valdis Kletnieks <[email protected]>
Cc: Laurence Oberman <[email protected]>
Cc: Vincent Whitchurch <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Sinan Kaya <[email protected]>
Cc: Daniel Vetter <[email protected]>
---
kernel/watchdog.c | 2 ++
kernel/watchdog_hld.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 6a5787233113..de7a60503517 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -469,6 +469,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
if (softlockup_panic)
panic("softlockup: hung tasks");
+ else
+ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
__this_cpu_write(soft_watchdog_warn, true);
} else
__this_cpu_write(soft_watchdog_warn, false);
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 247bf0b1582c..cce46cf75d76 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -154,6 +154,8 @@ static void watchdog_overflow_callback(struct perf_event *event,

if (hardlockup_panic)
nmi_panic(regs, "Hard LOCKUP");
+ else
+ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

__this_cpu_write(hard_watchdog_warn, true);
return;
--
2.20.1

2019-05-02 20:49:39

by Daniel Vetter

[permalink] [raw]

Subject: [PATCH] RFC: hung_task: taint kernel

There's the hung_task_panic sysctl, but that's a bit an extreme measure.
As a fallback taint at least the machine.

Our CI uses this to decide when a reboot is necessary, plus to figure
out whether the kernel is still happy.

v2: Works much better when I put the else { add_taint() } at the right
place.

Signed-off-by: Daniel Vetter <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Valdis Kletnieks <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: "Liu, Chuansheng" <[email protected]>
---
kernel/hung_task.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index f108a95882c6..d90d98f53ccb 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -117,6 +117,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
console_verbose();
hung_task_show_lock = true;
hung_task_call_panic = true;
+ } else {
+ add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
}

/*
--
2.20.1

2019-05-03 00:02:40

by Laurence Oberman

[permalink] [raw]

Subject: Re: [PATCH 2/2] RFC: soft/hardlookup: taint kernel

On Thu, 2019-05-02 at 21:42 +0200, Daniel Vetter wrote:
> There's the soft/hardlookup_panic sysctls, but that's a bit an
> extreme
> measure. As a fallback taint at least the machine.
>
> Our CI uses this to decide when a reboot is necessary, plus to figure
> out whether the kernel is still happy.
>
> Signed-off-by: Daniel Vetter <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Valdis Kletnieks <[email protected]>
> Cc: Laurence Oberman <[email protected]>
> Cc: Vincent Whitchurch <[email protected]>
> Cc: Don Zickus <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Sergey Senozhatsky <[email protected]>
> Cc: Sinan Kaya <[email protected]>
> Cc: Daniel Vetter <[email protected]>
> ---
> kernel/watchdog.c | 2 ++
> kernel/watchdog_hld.c | 2 ++
> 2 files changed, 4 insertions(+)
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 6a5787233113..de7a60503517 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -469,6 +469,8 @@ static enum hrtimer_restart
> watchdog_timer_fn(struct hrtimer *hrtimer)
> add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
> if (softlockup_panic)
> panic("softlockup: hung tasks");
> + else
> + add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
> __this_cpu_write(soft_watchdog_warn, true);
> } else
> __this_cpu_write(soft_watchdog_warn, false);
> diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
> index 247bf0b1582c..cce46cf75d76 100644
> --- a/kernel/watchdog_hld.c
> +++ b/kernel/watchdog_hld.c
> @@ -154,6 +154,8 @@ static void watchdog_overflow_callback(struct
> perf_event *event,
>
> if (hardlockup_panic)
> nmi_panic(regs, "Hard LOCKUP");
> + else
> + add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
>
> __this_cpu_write(hard_watchdog_warn, true);
> return;

This looks OK to me, could be useful to know we would have triggered
had the flags been set.

Reviewed-by: Laurence Oberman <[email protected]>

2019-05-03 00:49:07

by Tetsuo Handa

[permalink] [raw]

Subject: Re: [PATCH] RFC: hung_task: taint kernel

On 2019/05/03 5:46, Daniel Vetter wrote:
> There's the hung_task_panic sysctl, but that's a bit an extreme measure.
> As a fallback taint at least the machine.
>
> Our CI uses this to decide when a reboot is necessary, plus to figure
> out whether the kernel is still happy.

Why your CI can't watch for "blocked for more than" message instead of
setting the taint flag? How does your CI decide a reboot is necessary?

There is no need to set the tainted flag when some task was just blocked
for a while. It might be due to memory pressure, it might be due to setting
very short timeout (e.g. a few seconds), it might be due to busy CPUs doing
something else...

2019-05-03 08:52:28

by Daniel Vetter

[permalink] [raw]

Subject: Re: [PATCH] RFC: hung_task: taint kernel

On Fri, May 03, 2019 at 09:47:03AM +0900, Tetsuo Handa wrote:
> On 2019/05/03 5:46, Daniel Vetter wrote:
> > There's the hung_task_panic sysctl, but that's a bit an extreme measure.
> > As a fallback taint at least the machine.
> >
> > Our CI uses this to decide when a reboot is necessary, plus to figure
> > out whether the kernel is still happy.
>
> Why your CI can't watch for "blocked for more than" message instead of
> setting the taint flag? How does your CI decide a reboot is necessary?

We spam an awful lot into dmesg, and at least historically had
occasionally trouble capturing it all (we're better than that now I
think). Plus the thing that parses dmesg isn't the thing that runs
testcases, hence why we started to use taint flags (or procfs lockdep
status) and similar things to check the kernel is still alive enough.

> There is no need to set the tainted flag when some task was just blocked
> for a while. It might be due to memory pressure, it might be due to setting
> very short timeout (e.g. a few seconds), it might be due to busy CPUs doing
> something else...

Yeah I realize that this probably doesn't have much use outside of our CI,
but maybe there's someone how likes the idea.

Wrt spurious taints: You can disable the hung_tasks checker outright,
which also stops the tainting.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2019-05-09 09:27:27

by Sergey Senozhatsky

[permalink] [raw]

Subject: Re: [PATCH 2/2] RFC: soft/hardlookup: taint kernel

On (05/02/19 21:42), Daniel Vetter wrote:
[..]
> @@ -469,6 +469,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
> if (softlockup_panic)
> panic("softlockup: hung tasks");
> + else
> + add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
> __this_cpu_write(soft_watchdog_warn, true);

Soft lockup sets TAINT_SOFTLOCKUP bit. Would it be enough for your CI?

[..]
> @@ -154,6 +154,8 @@ static void watchdog_overflow_callback(struct perf_event *event,
>
> if (hardlockup_panic)
> nmi_panic(regs, "Hard LOCKUP");
> + else
> + add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

Maybe you can mirror what soft lockup does. Add a HARDLOCKUP taint bit

+++ b/include/linux/kernel.h
@@ -571,7 +571,8 @@ extern enum system_states {
#define TAINT_LIVEPATCH 15
#define TAINT_AUX 16
#define TAINT_RANDSTRUCT 17
-#define TAINT_FLAGS_COUNT 18
+#define TAINT_HARDLOCKUP 18
+#define TAINT_FLAGS_COUNT 19

and then set TAINT_HARDLOCKUP in watchdog_overflow_callback().

Just a small idea, I'll leave this to more experienced people.

-ss

2019-05-09 10:17:52

by Daniel Vetter

[permalink] [raw]

Subject: Re: [PATCH 2/2] RFC: soft/hardlookup: taint kernel

On Thu, May 9, 2019 at 11:24 AM Sergey Senozhatsky
<[email protected]> wrote:
>
> On (05/02/19 21:42), Daniel Vetter wrote:
> [..]
> > @@ -469,6 +469,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> > add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
> > if (softlockup_panic)
> > panic("softlockup: hung tasks");
> > + else
> > + add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
> > __this_cpu_write(soft_watchdog_warn, true);
>
> Soft lockup sets TAINT_SOFTLOCKUP bit. Would it be enough for your CI?

I'm blind :-/ Yes this is totally useful.

> [..]
> > @@ -154,6 +154,8 @@ static void watchdog_overflow_callback(struct perf_event *event,
> >
> > if (hardlockup_panic)
> > nmi_panic(regs, "Hard LOCKUP");
> > + else
> > + add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
>
> Maybe you can mirror what soft lockup does. Add a HARDLOCKUP taint bit

We'd also want a taint for hung tasks (separate patch, same idea), not
sure it's a good idea to use a new taint bit for all of them? Atm we
don't check for all taint bits (some of them are set because of things
our testcases do, like module reload or setting unsafe kernel options
meant for testing only, so picking one of the bits we check already
was least resistance.

> +++ b/include/linux/kernel.h
> @@ -571,7 +571,8 @@ extern enum system_states {
> #define TAINT_LIVEPATCH 15
> #define TAINT_AUX 16
> #define TAINT_RANDSTRUCT 17
> -#define TAINT_FLAGS_COUNT 18
> +#define TAINT_HARDLOCKUP 18
> +#define TAINT_FLAGS_COUNT 19
>
> and then set TAINT_HARDLOCKUP in watchdog_overflow_callback().
>
> Just a small idea, I'll leave this to more experienced people.

The hung_tasks taint wasn't all that positively received, I feels like
this will stay a hack private to our CI. Except if someone else pipes
up who wants this, then I'm happy to polish.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch