2017-08-09 02:46:32

by Yafang Shao

[permalink] [raw]
Subject: [PATCH v3] scheduler: enhancement to show_state_filter and SysRq

Sometimes we want to get tasks in TASK_RUNNING sepcifically,
instead of dump all tasks.

For example, when the loadavg are high, we want to dump
tasks in TASK_RUNNING and TASK_UNINTERRUPTIBLE, which contribute
to system load. But mostly there're lots of tasks in Sleep state,
which occupies almost all of the kernel log buffer, even overflows
it, that causes the useful messages get lost. Although we can
enlarge the kernel log buffer, but that's not a good idea.

But with the current facility we can't dump tasks in TASK_RUNNING
state specifically because TASK_RUNNING is 0, that will dump all
tasks if we call show_state_filter(TASK_RUNNING).

So I made this change to make the show_state_filter more flexible,
and then we can dump the tasks in TASK_RUNNING specifically.

The reason I just modified the existing SysRq key 'W' other than
introducing a new key to dump the tasks in running state is that
dumping both blocked tasks and running tasks is more helpful than
dump blocked tasks only. That's an improvement.

Signed-off-by: Yafang Shao <[email protected]>
---
drivers/tty/sysrq.c | 15 ++++++++-------
include/linux/sched.h | 1 +
include/linux/sched/debug.h | 6 ++++--
kernel/sched/core.c | 8 +++++---
4 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 3ffc1ce..d1433ed 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -289,14 +289,15 @@ static void sysrq_handle_showstate(int key)
.enable_mask = SYSRQ_ENABLE_DUMP,
};

-static void sysrq_handle_showstate_blocked(int key)
+/* tasks in blocked and running state contribute to loadavg */
+static void sysrq_handle_showstate_load(int key)
{
- show_state_filter(TASK_UNINTERRUPTIBLE);
+ show_state_filter(TASK_UNINTERRUPTIBLE << 1 | (TASK_RUNNING | 0x1));
}
-static struct sysrq_key_op sysrq_showstate_blocked_op = {
- .handler = sysrq_handle_showstate_blocked,
- .help_msg = "show-blocked-tasks(w)",
- .action_msg = "Show Blocked State",
+static struct sysrq_key_op sysrq_showstate_load_op = {
+ .handler = sysrq_handle_showstate_load,
+ .help_msg = "show-blocked/running-tasks(w)",
+ .action_msg = "Show Blocked/Running State",
.enable_mask = SYSRQ_ENABLE_DUMP,
};

@@ -477,7 +478,7 @@ static void sysrq_handle_unrt(int key)
&sysrq_mountro_op, /* u */
/* v: May be registered for frame buffer console restore */
NULL, /* v */
- &sysrq_showstate_blocked_op, /* w */
+ &sysrq_showstate_load_op, /* w */
/* x: May be registered on mips for TLB dump */
/* x: May be registered on ppc/powerpc for xmon */
/* x: May be registered on sparc64 for global PMU dump */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8337e2d..318f149 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -82,6 +82,7 @@
#define TASK_NOLOAD 1024
#define TASK_NEW 2048
#define TASK_STATE_MAX 4096
+#define TASK_ALL_BITS ((TASK_STATE_MAX << 1) - 1)

#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPNn"

diff --git a/include/linux/sched/debug.h b/include/linux/sched/debug.h
index e0eaee5..c844689 100644
--- a/include/linux/sched/debug.h
+++ b/include/linux/sched/debug.h
@@ -1,6 +1,8 @@
#ifndef _LINUX_SCHED_DEBUG_H
#define _LINUX_SCHED_DEBUG_H

+#include <linux/sched.h>
+
/*
* Various scheduler/task debugging interfaces:
*/
@@ -10,13 +12,13 @@
extern void dump_cpu_task(int cpu);

/*
- * Only dump TASK_* tasks. (0 for all tasks)
+ * Only dump TASK_* tasks. (TASK_ALL_BITS for all tasks)
*/
extern void show_state_filter(unsigned long state_filter);

static inline void show_state(void)
{
- show_state_filter(0);
+ show_state_filter(TASK_ALL_BITS);
}

struct pt_regs;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0869b20..873e579 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5161,19 +5161,21 @@ void show_state_filter(unsigned long state_filter)
*/
touch_nmi_watchdog();
touch_all_softlockup_watchdogs();
- if (!state_filter || (p->state & state_filter))
+ /* in case we want to set TASK_RUNNING specifically */
+ if ((p->state != TASK_RUNNING ? p->state << 1 : 1) &
+ state_filter)
sched_show_task(p);
}

#ifdef CONFIG_SCHED_DEBUG
- if (!state_filter)
+ if (state_filter == TASK_ALL_BITS)
sysrq_sched_debug_show();
#endif
rcu_read_unlock();
/*
* Only show locks if all tasks are dumped:
*/
- if (!state_filter)
+ if (state_filter == TASK_ALL_BITS)
debug_show_all_locks();
}

--
1.8.3.1


2017-08-09 07:43:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3] scheduler: enhancement to show_state_filter and SysRq

On Wed, Aug 09, 2017 at 06:31:28PM +0800, Yafang Shao wrote:
> Sometimes we want to get tasks in TASK_RUNNING sepcifically,
> instead of dump all tasks.
>
> For example, when the loadavg are high, we want to dump
> tasks in TASK_RUNNING and TASK_UNINTERRUPTIBLE, which contribute
> to system load. But mostly there're lots of tasks in Sleep state,
> which occupies almost all of the kernel log buffer, even overflows
> it, that causes the useful messages get lost. Although we can
> enlarge the kernel log buffer, but that's not a good idea.

That's what you have serial consoles for...

> +static void sysrq_handle_showstate_load(int key)
> {
> + show_state_filter(TASK_UNINTERRUPTIBLE << 1 | (TASK_RUNNING | 0x1));
> }

How is that not unreadable gunk?

> @@ -477,7 +478,7 @@ static void sysrq_handle_unrt(int key)
> &sysrq_mountro_op, /* u */
> /* v: May be registered for frame buffer console restore */
> NULL, /* v */
> - &sysrq_showstate_blocked_op, /* w */
> + &sysrq_showstate_load_op, /* w */
> /* x: May be registered on mips for TLB dump */
> /* x: May be registered on ppc/powerpc for xmon */
> /* x: May be registered on sparc64 for global PMU dump */

So I'm really not convinced this is useful. The blocked thing is very
useful if you're trying to debug a deadlock. Now you get endless clutter
with runnable tasks.

High load-avg as such isn't a problem. Why do you care?

2017-08-09 08:01:52

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH v3] scheduler: enhancement to show_state_filter and SysRq

2017-08-09 15:43 GMT+08:00 Peter Zijlstra <[email protected]>:
> On Wed, Aug 09, 2017 at 06:31:28PM +0800, Yafang Shao wrote:
>> Sometimes we want to get tasks in TASK_RUNNING sepcifically,
>> instead of dump all tasks.
>>
>> For example, when the loadavg are high, we want to dump
>> tasks in TASK_RUNNING and TASK_UNINTERRUPTIBLE, which contribute
>> to system load. But mostly there're lots of tasks in Sleep state,
>> which occupies almost all of the kernel log buffer, even overflows
>> it, that causes the useful messages get lost. Although we can
>> enlarge the kernel log buffer, but that's not a good idea.
>
> That's what you have serial consoles for...
>
mostly we don't even have one console because we alwayas login the
servers via ssh. And manage the servers with console is not so convenient.


>> +static void sysrq_handle_showstate_load(int key)
>> {
>> + show_state_filter(TASK_UNINTERRUPTIBLE << 1 | (TASK_RUNNING | 0x1));
>> }
>
> How is that not unreadable gunk?

I know it is very hard to read, that's why I put TASK_RUNNING here.
But in order to Backward Compatibility, I have to do it like this.

>
>> @@ -477,7 +478,7 @@ static void sysrq_handle_unrt(int key)
>> &sysrq_mountro_op, /* u */
>> /* v: May be registered for frame buffer console restore */
>> NULL, /* v */
>> - &sysrq_showstate_blocked_op, /* w */
>> + &sysrq_showstate_load_op, /* w */
>> /* x: May be registered on mips for TLB dump */
>> /* x: May be registered on ppc/powerpc for xmon */
>> /* x: May be registered on sparc64 for global PMU dump */
>
> So I'm really not convinced this is useful. The blocked thing is very
> useful if you're trying to debug a deadlock. Now you get endless clutter
> with runnable tasks.
>
> High load-avg as such isn't a problem. Why do you care?

Lots of tasks in blocked state, may not mean deadlock.
In most cases, it means tasks blocked in Disk-IO or some other
hardware resources.
So, when lots of tasks in blocked state, we need to know which
processes are running and holding
the lock, and what thses processes running for. high load is one
example of these cases.
As tasks in blocked state and running state contribute to system load,
that's why I
name the function with 'load'.

Thanks
Yafang

2017-08-09 09:09:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3] scheduler: enhancement to show_state_filter and SysRq

On Wed, Aug 09, 2017 at 04:01:49PM +0800, Yafang Shao wrote:
> 2017-08-09 15:43 GMT+08:00 Peter Zijlstra <[email protected]>:
> > On Wed, Aug 09, 2017 at 06:31:28PM +0800, Yafang Shao wrote:
> >> Sometimes we want to get tasks in TASK_RUNNING sepcifically,
> >> instead of dump all tasks.
> >>
> >> For example, when the loadavg are high, we want to dump
> >> tasks in TASK_RUNNING and TASK_UNINTERRUPTIBLE, which contribute
> >> to system load. But mostly there're lots of tasks in Sleep state,
> >> which occupies almost all of the kernel log buffer, even overflows
> >> it, that causes the useful messages get lost. Although we can
> >> enlarge the kernel log buffer, but that's not a good idea.
> >
> > That's what you have serial consoles for...
> >
> mostly we don't even have one console because we alwayas login the
> servers via ssh. And manage the servers with console is not so convenient.

I find IPMI SOL very useful. Serial console (esp. earlyprintk) keeps on
working long after most other things have died.

In any case, you can easily dump the printk output into your ssh session
if you want, use something like:

cat /dev/kmsg | tee logfile & echo t > /proc/sysrq-trigger

I really see no problem here. Then you can run a bit of awk or whatever
your favourite tool is to filter out the stuff you don't want.

2017-08-09 09:26:17

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH v3] scheduler: enhancement to show_state_filter and SysRq

2017-08-09 17:09 GMT+08:00 Peter Zijlstra <[email protected]>:
> On Wed, Aug 09, 2017 at 04:01:49PM +0800, Yafang Shao wrote:
>> 2017-08-09 15:43 GMT+08:00 Peter Zijlstra <[email protected]>:
>> > On Wed, Aug 09, 2017 at 06:31:28PM +0800, Yafang Shao wrote:
>> >> Sometimes we want to get tasks in TASK_RUNNING sepcifically,
>> >> instead of dump all tasks.
>> >>
>> >> For example, when the loadavg are high, we want to dump
>> >> tasks in TASK_RUNNING and TASK_UNINTERRUPTIBLE, which contribute
>> >> to system load. But mostly there're lots of tasks in Sleep state,
>> >> which occupies almost all of the kernel log buffer, even overflows
>> >> it, that causes the useful messages get lost. Although we can
>> >> enlarge the kernel log buffer, but that's not a good idea.
>> >
>> > That's what you have serial consoles for...
>> >
>> mostly we don't even have one console because we alwayas login the
>> servers via ssh. And manage the servers with console is not so convenient.
>
> I find IPMI SOL very useful. Serial console (esp. earlyprintk) keeps on
> working long after most other things have died.
>
> In any case, you can easily dump the printk output into your ssh session
> if you want, use something like:
>
> cat /dev/kmsg | tee logfile & echo t > /proc/sysrq-trigger

that's what I'm doing it currently :)
Then I thought deeply why not do it more smartly?
Introducing a new key(here I just modified the key 'w') only dump
tasks in running and blocked should be more smarter.

>
> I really see no problem here. Then you can run a bit of awk or whatever
> your favourite tool is to filter out the stuff you don't want.
>
Another question, if we could filter with scritpts in userland, why
did we introduced the key 'w' to dump only blocked state
as we already have a key 't' to dump all tasks ?

Thanks
Yafang

2017-08-09 16:43:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3] scheduler: enhancement to show_state_filter and SysRq

On Wed, Aug 09, 2017 at 05:26:14PM +0800, Yafang Shao wrote:
> 2017-08-09 17:09 GMT+08:00 Peter Zijlstra <[email protected]>:
> > On Wed, Aug 09, 2017 at 04:01:49PM +0800, Yafang Shao wrote:
> >> 2017-08-09 15:43 GMT+08:00 Peter Zijlstra <[email protected]>:
> >> > On Wed, Aug 09, 2017 at 06:31:28PM +0800, Yafang Shao wrote:
> >> >> Sometimes we want to get tasks in TASK_RUNNING sepcifically,
> >> >> instead of dump all tasks.
> >> >>
> >> >> For example, when the loadavg are high, we want to dump
> >> >> tasks in TASK_RUNNING and TASK_UNINTERRUPTIBLE, which contribute
> >> >> to system load. But mostly there're lots of tasks in Sleep state,
> >> >> which occupies almost all of the kernel log buffer, even overflows
> >> >> it, that causes the useful messages get lost. Although we can
> >> >> enlarge the kernel log buffer, but that's not a good idea.
> >> >
> >> > That's what you have serial consoles for...
> >> >
> >> mostly we don't even have one console because we alwayas login the
> >> servers via ssh. And manage the servers with console is not so convenient.
> >
> > I find IPMI SOL very useful. Serial console (esp. earlyprintk) keeps on
> > working long after most other things have died.
> >
> > In any case, you can easily dump the printk output into your ssh session
> > if you want, use something like:
> >
> > cat /dev/kmsg | tee logfile & echo t > /proc/sysrq-trigger
>
> that's what I'm doing it currently :)
> Then I thought deeply why not do it more smartly?
> Introducing a new key(here I just modified the key 'w') only dump
> tasks in running and blocked should be more smarter.

Since you're strictly ssh based, you could maybe do a sysctl that allows
changing the 'default' filter of sysctl-t, dunno if that makes sense
though.

Also, since you're not actually debugging a dead machine, maybe you can
do a custom kernel module / systemtap / ebpf thing that collects
precisely the information you want.

sysrq is typically a last ditch debug mostly dead machine thing, which
you're very much not having.

> >
> > I really see no problem here. Then you can run a bit of awk or whatever
> > your favourite tool is to filter out the stuff you don't want.
> >
> Another question, if we could filter with scritpts in userland, why
> did we introduced the key 'w' to dump only blocked state
> as we already have a key 't' to dump all tasks ?

No idea that is long before my time, I expect because 'w' (blocked) is
typically a small number of tasks.

2017-08-10 02:44:48

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH v3] scheduler: enhancement to show_state_filter and SysRq

2017-08-10 0:42 GMT+08:00 Peter Zijlstra <[email protected]>:
> On Wed, Aug 09, 2017 at 05:26:14PM +0800, Yafang Shao wrote:
>> 2017-08-09 17:09 GMT+08:00 Peter Zijlstra <[email protected]>:
>> > On Wed, Aug 09, 2017 at 04:01:49PM +0800, Yafang Shao wrote:
>> >> 2017-08-09 15:43 GMT+08:00 Peter Zijlstra <[email protected]>:
>> >> > On Wed, Aug 09, 2017 at 06:31:28PM +0800, Yafang Shao wrote:
>> >> >> Sometimes we want to get tasks in TASK_RUNNING sepcifically,
>> >> >> instead of dump all tasks.
>> >> >>
>> >> >> For example, when the loadavg are high, we want to dump
>> >> >> tasks in TASK_RUNNING and TASK_UNINTERRUPTIBLE, which contribute
>> >> >> to system load. But mostly there're lots of tasks in Sleep state,
>> >> >> which occupies almost all of the kernel log buffer, even overflows
>> >> >> it, that causes the useful messages get lost. Although we can
>> >> >> enlarge the kernel log buffer, but that's not a good idea.
>> >> >
>> >> > That's what you have serial consoles for...
>> >> >
>> >> mostly we don't even have one console because we alwayas login the
>> >> servers via ssh. And manage the servers with console is not so convenient.
>> >
>> > I find IPMI SOL very useful. Serial console (esp. earlyprintk) keeps on
>> > working long after most other things have died.
>> >
>> > In any case, you can easily dump the printk output into your ssh session
>> > if you want, use something like:
>> >
>> > cat /dev/kmsg | tee logfile & echo t > /proc/sysrq-trigger
>>
>> that's what I'm doing it currently :)
>> Then I thought deeply why not do it more smartly?
>> Introducing a new key(here I just modified the key 'w') only dump
>> tasks in running and blocked should be more smarter.
>
> Since you're strictly ssh based, you could maybe do a sysctl that allows
> changing the 'default' filter of sysctl-t, dunno if that makes sense
> though.
>
> Also, since you're not actually debugging a dead machine, maybe you can
> do a custom kernel module / systemtap / ebpf thing that collects
> precisely the information you want.
>
> sysrq is typically a last ditch debug mostly dead machine thing, which
> you're very much not having.
>

Per my understanding, SysRq is a very old thing.
I agree with you that SysRq is implemented to debug dead machine. In
the old days, once the machine was dead, we could press the SysRq key
on the keyboad to help us collect the information then analyze and
resovle it. That's great.
But things change now.
Nowdayes, tens of thousands of servers running in IDC without keyboad
nor screen, but I find this old thing still be the easiest way to
troubeshoot some kernel issues introduced by the applications. For
example, once there's sudden/random CPU %sys utilization spikes, or
suden/random system loadavg spikes, we could use /proc/sysrq-trigger
conveniently collecting the information in the kernel and analyze
what the issue is in the kernel.

Old things, new issues.

>> >
>> > I really see no problem here. Then you can run a bit of awk or whatever
>> > your favourite tool is to filter out the stuff you don't want.
>> >
>> Another question, if we could filter with scritpts in userland, why
>> did we introduced the key 'w' to dump only blocked state
>> as we already have a key 't' to dump all tasks ?
>
> No idea that is long before my time, I expect because 'w' (blocked) is
> typically a small number of tasks.

If the machine dead, there should not be many runnning tasks as well.

Thanks
Yafang