2023-01-12 09:46:09

by Weiping Zhang

[permalink] [raw]
Subject: [RFC PATCH] hung_task: show sysctl_hung_task_warnings

This patch try to add more debug info to detect lost kernel log or no
hung task was detected.

The user set 10 to the hung_task_timeout_secs, the kernel log:

[ 3942.642220] INFO: task mount:19066 blocked for more than 10 seconds.
[ 3952.876768] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
[ 3952.877088] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
[ 3952.878212] INFO: task mount:19066 blocked for more than 10 seconds.
[ 3963.116805] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
[ 3963.117137] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
[ 3963.118275] INFO: task mount:19066 blocked for more than 10 seconds.
[ 3973.356837] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
[ 3973.357148] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
[ 3973.358247] INFO: task mount:19066 blocked for more than 10 seconds.
[ 3993.836899] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
[ 3993.837238] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
[ 3993.838356] INFO: task mount:19066 blocked for more than 10 seconds.

There is no any log at about 3983, it's hard to know if kernel log was
lost or there is no hung task was detected at that moment. So this patch
print sysctl_hung_task_warnings to distinguish the above two cases.

Signed-off-by: Weiping Zhang <[email protected]>
---
kernel/hung_task.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index c71889f3f3fc..ca917931473d 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -127,8 +127,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
* complain:
*/
if (sysctl_hung_task_warnings) {
- if (sysctl_hung_task_warnings > 0)
+ if (sysctl_hung_task_warnings > 0) {
sysctl_hung_task_warnings--;
+ pr_err("sysctl_hung_task_warnings: %d\n",
+ sysctl_hung_task_warnings);
+ }
pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n",
t->comm, t->pid, (jiffies - t->last_switch_time) / HZ);
pr_err(" %s %s %.*s\n",
--
2.34.1


2023-01-13 13:56:12

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH] hung_task: show sysctl_hung_task_warnings

On Thu 2023-01-12 17:17:45, Weiping Zhang wrote:
> This patch try to add more debug info to detect lost kernel log or no
> hung task was detected.
>
> The user set 10 to the hung_task_timeout_secs, the kernel log:
>
> [ 3942.642220] INFO: task mount:19066 blocked for more than 10 seconds.
> [ 3952.876768] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> [ 3952.877088] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> [ 3952.878212] INFO: task mount:19066 blocked for more than 10 seconds.
> [ 3963.116805] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> [ 3963.117137] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> [ 3963.118275] INFO: task mount:19066 blocked for more than 10 seconds.
> [ 3973.356837] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> [ 3973.357148] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> [ 3973.358247] INFO: task mount:19066 blocked for more than 10 seconds.
> [ 3993.836899] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> [ 3993.837238] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> [ 3993.838356] INFO: task mount:19066 blocked for more than 10 seconds.
>
> There is no any log at about 3983, it's hard to know if kernel log was
> lost or there is no hung task was detected at that moment. So this patch
> print sysctl_hung_task_warnings to distinguish the above two cases.
>
> Signed-off-by: Weiping Zhang <[email protected]>
> ---
> kernel/hung_task.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index c71889f3f3fc..ca917931473d 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -127,8 +127,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
> * complain:
> */
> if (sysctl_hung_task_warnings) {
> - if (sysctl_hung_task_warnings > 0)
> + if (sysctl_hung_task_warnings > 0) {
> sysctl_hung_task_warnings--;
> + pr_err("sysctl_hung_task_warnings: %d\n",
> + sysctl_hung_task_warnings);
> + }

It is too much noise. But it might make sense to report it when
the counter gets down to zero. Something like:

if (sysctl_hung_task_warnings)
pr_info("Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings\n");

and move this down after printing all the details for this hung task report.

> pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n",
> t->comm, t->pid, (jiffies - t->last_switch_time) / HZ);
> pr_err(" %s %s %.*s\n",

Best Regards,
Petr

2023-01-29 09:10:06

by Weiping Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH] hung_task: show sysctl_hung_task_warnings

Sorry for the late reply.

On Fri, Jan 13, 2023 at 9:43 PM Petr Mladek <[email protected]> wrote:
>
> On Thu 2023-01-12 17:17:45, Weiping Zhang wrote:
> > This patch try to add more debug info to detect lost kernel log or no
> > hung task was detected.
> >
> > The user set 10 to the hung_task_timeout_secs, the kernel log:
> >
> > [ 3942.642220] INFO: task mount:19066 blocked for more than 10 seconds.
> > [ 3952.876768] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > [ 3952.877088] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > [ 3952.878212] INFO: task mount:19066 blocked for more than 10 seconds.
> > [ 3963.116805] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > [ 3963.117137] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > [ 3963.118275] INFO: task mount:19066 blocked for more than 10 seconds.
> > [ 3973.356837] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > [ 3973.357148] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > [ 3973.358247] INFO: task mount:19066 blocked for more than 10 seconds.

Expect but no hung task log at this moment.

> > [ 3993.836899] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > [ 3993.837238] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > [ 3993.838356] INFO: task mount:19066 blocked for more than 10 seconds.
> >
> > There is no any log at about 3983, it's hard to know if kernel log was
> > lost or there is no hung task was detected at that moment. So this patch
> > print sysctl_hung_task_warnings to distinguish the above two cases.
> >
> > Signed-off-by: Weiping Zhang <[email protected]>
> > ---
> > kernel/hung_task.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > index c71889f3f3fc..ca917931473d 100644
> > --- a/kernel/hung_task.c
> > +++ b/kernel/hung_task.c
> > @@ -127,8 +127,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
> > * complain:
> > */
> > if (sysctl_hung_task_warnings) {
> > - if (sysctl_hung_task_warnings > 0)
> > + if (sysctl_hung_task_warnings > 0) {
> > sysctl_hung_task_warnings--;
> > + pr_err("sysctl_hung_task_warnings: %d\n",
> > + sysctl_hung_task_warnings);
> > + }
>
> It is too much noise. But it might make sense to report it when
> the counter gets down to zero. Something like:
>
> if (sysctl_hung_task_warnings)
> pr_info("Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings\n");
>
> and move this down after printing all the details for this hung task report.
I'm fine to print this warning when it gets down to 0.

The warning counter is useful to detect kernel log lost or not, if add
a new line to print this count is too noise,
how about append this counter at the end of the following line:
pr_err("INFO: task %s:%d blocked for more than %ld seconds,
sysctl_hung_task_warnings: %d\n",
t->comm, t->pid, (jiffies - t->last_switch_time) / HZ,
sysctl_hung_task_warnings);
>
> > pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n",
> > t->comm, t->pid, (jiffies - t->last_switch_time) / HZ);
> > pr_err(" %s %s %.*s\n",
>
> Best Regards,
> Petr

Thanks a ton
Weiping

2023-01-30 10:28:43

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH] hung_task: show sysctl_hung_task_warnings

On Sun 2023-01-29 17:08:29, Weiping Zhang wrote:
> Sorry for the late reply.
>
> On Fri, Jan 13, 2023 at 9:43 PM Petr Mladek <[email protected]> wrote:
> >
> > On Thu 2023-01-12 17:17:45, Weiping Zhang wrote:
> > > This patch try to add more debug info to detect lost kernel log or no
> > > hung task was detected.
> > >
> > > The user set 10 to the hung_task_timeout_secs, the kernel log:
> > >
> > > [ 3942.642220] INFO: task mount:19066 blocked for more than 10 seconds.
> > > [ 3952.876768] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > [ 3952.877088] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > [ 3952.878212] INFO: task mount:19066 blocked for more than 10 seconds.
> > > [ 3963.116805] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > [ 3963.117137] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > [ 3963.118275] INFO: task mount:19066 blocked for more than 10 seconds.
> > > [ 3973.356837] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > [ 3973.357148] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > [ 3973.358247] INFO: task mount:19066 blocked for more than 10 seconds.
>
> Expect but no hung task log at this moment.
>
> > > [ 3993.836899] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > [ 3993.837238] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > [ 3993.838356] INFO: task mount:19066 blocked for more than 10 seconds.
> > >
> > > There is no any log at about 3983, it's hard to know if kernel log was
> > > lost or there is no hung task was detected at that moment. So this patch
> > > print sysctl_hung_task_warnings to distinguish the above two cases.
> > >
> > > Signed-off-by: Weiping Zhang <[email protected]>
> > > ---
> > > kernel/hung_task.c | 5 ++++-
> > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > > index c71889f3f3fc..ca917931473d 100644
> > > --- a/kernel/hung_task.c
> > > +++ b/kernel/hung_task.c
> > > @@ -127,8 +127,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
> > > * complain:
> > > */
> > > if (sysctl_hung_task_warnings) {
> > > - if (sysctl_hung_task_warnings > 0)
> > > + if (sysctl_hung_task_warnings > 0) {
> > > sysctl_hung_task_warnings--;
> > > + pr_err("sysctl_hung_task_warnings: %d\n",
> > > + sysctl_hung_task_warnings);
> > > + }
> >
> > It is too much noise. But it might make sense to report it when
> > the counter gets down to zero. Something like:
> >
> > if (sysctl_hung_task_warnings)
> > pr_info("Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings\n");
> >
> > and move this down after printing all the details for this hung task report.
> I'm fine to print this warning when it gets down to 0.

I prefer this variant.

> The warning counter is useful to detect kernel log lost or not, if add
> a new line to print this count is too noise,
> how about append this counter at the end of the following line:
> pr_err("INFO: task %s:%d blocked for more than %ld seconds,
> sysctl_hung_task_warnings: %d\n",
> t->comm, t->pid, (jiffies - t->last_switch_time) / HZ,
> sysctl_hung_task_warnings);

Is it really important to print this on every line, please? IMHO, most people
do not care. IMHO. it would add more harm (confusion, noise) than good.

Best Regards,
Petr

2023-01-31 07:56:40

by Weiping Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH] hung_task: show sysctl_hung_task_warnings

On Mon, Jan 30, 2023 at 6:28 PM Petr Mladek <[email protected]> wrote:
>
> On Sun 2023-01-29 17:08:29, Weiping Zhang wrote:
> > Sorry for the late reply.
> >
> > On Fri, Jan 13, 2023 at 9:43 PM Petr Mladek <[email protected]> wrote:
> > >
> > > On Thu 2023-01-12 17:17:45, Weiping Zhang wrote:
> > > > This patch try to add more debug info to detect lost kernel log or no
> > > > hung task was detected.
> > > >
> > > > The user set 10 to the hung_task_timeout_secs, the kernel log:
> > > >
> > > > [ 3942.642220] INFO: task mount:19066 blocked for more than 10 seconds.
> > > > [ 3952.876768] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > > [ 3952.877088] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > > [ 3952.878212] INFO: task mount:19066 blocked for more than 10 seconds.
> > > > [ 3963.116805] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > > [ 3963.117137] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > > [ 3963.118275] INFO: task mount:19066 blocked for more than 10 seconds.
> > > > [ 3973.356837] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > > [ 3973.357148] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > > [ 3973.358247] INFO: task mount:19066 blocked for more than 10 seconds.
> >
> > Expect but no hung task log at this moment.
> >
> > > > [ 3993.836899] INFO: task kworker/u81:0:7 blocked for more than 10 seconds.
> > > > [ 3993.837238] INFO: task scsi_eh_0:506 blocked for more than 10 seconds.
> > > > [ 3993.838356] INFO: task mount:19066 blocked for more than 10 seconds.
> > > >
> > > > There is no any log at about 3983, it's hard to know if kernel log was
> > > > lost or there is no hung task was detected at that moment. So this patch
> > > > print sysctl_hung_task_warnings to distinguish the above two cases.
> > > >
> > > > Signed-off-by: Weiping Zhang <[email protected]>
> > > > ---
> > > > kernel/hung_task.c | 5 ++++-
> > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > > > index c71889f3f3fc..ca917931473d 100644
> > > > --- a/kernel/hung_task.c
> > > > +++ b/kernel/hung_task.c
> > > > @@ -127,8 +127,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
> > > > * complain:
> > > > */
> > > > if (sysctl_hung_task_warnings) {
> > > > - if (sysctl_hung_task_warnings > 0)
> > > > + if (sysctl_hung_task_warnings > 0) {
> > > > sysctl_hung_task_warnings--;
> > > > + pr_err("sysctl_hung_task_warnings: %d\n",
> > > > + sysctl_hung_task_warnings);
> > > > + }
> > >
> > > It is too much noise. But it might make sense to report it when
> > > the counter gets down to zero. Something like:
> > >
> > > if (sysctl_hung_task_warnings)
> > > pr_info("Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings\n");
> > >
> > > and move this down after printing all the details for this hung task report.
> > I'm fine to print this warning when it gets down to 0.
>
> I prefer this variant.
It's ok, Yuanli will submit a patch for this.
>
> > The warning counter is useful to detect kernel log lost or not, if add
> > a new line to print this count is too noise,
> > how about append this counter at the end of the following line:
> > pr_err("INFO: task %s:%d blocked for more than %ld seconds,
> > sysctl_hung_task_warnings: %d\n",
> > t->comm, t->pid, (jiffies - t->last_switch_time) / HZ,
> > sysctl_hung_task_warnings);
>
> Is it really important to print this on every line, please? IMHO, most people
> do not care. IMHO. it would add more harm (confusion, noise) than good.
>
Thanks
> Best Regards,
> Petr