LinuxLists.cc - [PATCH 0/8] perf sched: Add trace event for sched wait.

2014-04-15 13:32:49

Subject: [PATCH 0/8] perf sched: Add trace event for sched wait.

Hi all:
This is to solve the bug message shown in perf sched latency.

# perf sched latency|tail
ksoftirqd/0:3 | 0.597 ms | 57 | avg: 0.004 ms | max: 0.054 ms | max at: 19681.546204 s
ksoftirqd/1:14 | 0.637 ms | 58 | avg: 0.004 ms | max: 0.066 ms | max at: 19674.687734 s
irqbalance:349 | 0.429 ms | 1 | avg: 0.004 ms | max: 0.004 ms | max at: 19675.791528 s
ksoftirqd/3:24 | 0.527 ms | 67 | avg: 0.003 ms | max: 0.011 ms | max at: 19673.285019 s
migration/3:23 | 0.000 ms | 1 | avg: 0.002 ms | max: 0.002 ms | max at: 19672.055354 s
-----------------------------------------------------------------------------------------
TOTAL: | 4384.616 ms | 36879 |
---------------------------------------------------
INFO: 0.030% state machine bugs (11 out of 36684)

After some investigation, there are two reasons cause this problem.

(1). Sometimes, scheduler will wake up a running task, it is not necessary,
then I skip the wakeup if task->state is TASK_RUNNING. [4/8]

(2). No tracing for sched wait.
This is a simple graph for task state changing.

---------------- 1 ----------------
| TASK_RUNNING | ------------------------------>| TASK_RUNNING |
| (running) |<------------------------------ | (wait cpu) |
---------------- 2 ----------------
^ |
|4 ------------------------- 3 |
|-------|TASK_{UN}INTERRUPTABLE |<--------------|
| in wait_rq |
-------------------------

As the graph shown above, there are four event in scheduling, and
we currently are tracing 3 of them.

1 & 2: sched:sched_switch
4: sched:sched_wakeup|sched:sched_wakeup_new

But about 3, we have no trace event for it.

This patchset add a new trace event for sched wait, and add a trace point
before any task added into wait queue. [1/8] & [3/8]

BTW, other patchs in this thread are about some little fix and enhancement
in the development, please help to review at the same time.

Thanx

Dongsheng (8):
sched & trace: Add a trace event for wait.
sched/wait: Add trace point before add task into wait queue.
sched/wait: Use __add_wait_queue{_tail}_exclusive() as possible.
sched/core: Skip wakeup when task is already running.
perf tools: record and process sched:sched_wait event.
perf tools: add missing event for perf sched record.
perf tools: Adapt the TASK_STATE_TO_CHAR_STR to new value in kernel
space.
perf tools: Clarify the output of perf sched map.

include/trace/events/sched.h | 20 +++++++++
kernel/sched/core.c | 3 ++
kernel/sched/wait.c | 13 ++++--
tools/perf/builtin-sched.c | 100 +++++++++++++++++++++++++++++++++++--------
4 files changed, 115 insertions(+), 21 deletions(-)

--
1.8.2.1

2014-04-15 13:32:57

by Dongsheng Yang

[permalink] [raw]

Subject: [PATCH 5/8] perf tools: record and process sched:sched_wait event.

From: Dongsheng <[email protected]>

Currently, perf sched tool does not cover any trace event when
a task from TASK_RINNING to TASK_{UN}INTERRUPTIBLE. Then if
a thread changed to TASK_{UN}INTERRUPTIBLE, but we did not
capture a event for it, so the state of atom is still TASK_RUNNING,
at this time, when we process a sched_wakeup event, we will see
the thread state is not TASK_SLEEP, and record a state_bug.

This patch make it record and process sched_wait event, to solve
this problem.

Signed-off-by: Dongsheng <[email protected]>
---
tools/perf/builtin-sched.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 9ac0a49..a32af4e 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -117,6 +117,10 @@ struct trace_sched_handler {
struct perf_evsel *evsel,
struct perf_sample *sample,
struct machine *machine);
+
+ int (*wait_event)(struct perf_sched *sched, struct perf_evsel *evsel,
+ struct perf_sample *sample, struct machine *machine);
+
};

struct perf_sched {
@@ -863,6 +867,22 @@ add_sched_out_event(struct work_atoms *atoms,
return 0;
}

+static int
+add_sched_wait_event(struct work_atoms *atoms,
+ u64 timestamp)
+{
+ struct work_atom *atom = zalloc(sizeof(*atom));
+ if (!atom) {
+ pr_err("Non memory at %s", __func__);
+ return -1;
+ }
+
+ atom->sched_out_time = timestamp;
+
+ list_add_tail(&atom->list, &atoms->work_list);
+ return 0;
+}
+
static void
add_runtime_event(struct work_atoms *atoms, u64 delta,
u64 timestamp __maybe_unused)
@@ -1100,6 +1120,32 @@ static int latency_migrate_task_event(struct perf_sched *sched,
return 0;
}

+static int latency_wait_event(struct perf_sched *sched,
+ struct perf_evsel *evsel,
+ struct perf_sample *sample,
+ struct machine *machine)
+{
+ const u32 pid = perf_evsel__intval(evsel, sample, "pid");
+ struct work_atoms *atoms;
+ struct thread *wakee;
+ u64 timestamp = sample->time;
+
+ wakee = machine__findnew_thread(machine, 0, pid);
+ atoms = thread_atoms_search(&sched->atom_root, wakee, &sched->cmp_pid);
+ if (!atoms) {
+ if (thread_atoms_insert(sched, wakee))
+ return -1;
+ atoms = thread_atoms_search(&sched->atom_root, wakee, &sched->cmp_pid);
+ if (!atoms) {
+ pr_err("wakeup-event: Internal tree error");
+ return -1;
+ }
+ }
+
+ add_sched_wait_event(atoms, timestamp);
+ return 0;
+}
+
static void output_lat_thread(struct perf_sched *sched, struct work_atoms *work_list)
{
int i;
@@ -1250,6 +1296,19 @@ static void perf_sched__sort_lat(struct perf_sched *sched)
}
}

+static int process_sched_wait(struct perf_tool *tool,
+ struct perf_evsel *evsel,
+ struct perf_sample *sample,
+ struct machine *machine)
+{
+ struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
+
+ if (sched->tp_handler->wait_event)
+ return sched->tp_handler->wait_event(sched, evsel, sample, machine);
+
+ return 0;
+}
+
static int process_sched_wakeup_event(struct perf_tool *tool,
struct perf_evsel *evsel,
struct perf_sample *sample,
@@ -1444,6 +1503,7 @@ static int perf_sched__read_events(struct perf_sched *sched,
{ "sched:sched_wakeup", process_sched_wakeup_event, },
{ "sched:sched_wakeup_new", process_sched_wakeup_event, },
{ "sched:sched_migrate_task", process_sched_migrate_task_event, },
+ { "sched:sched_wait", process_sched_wait, },
};
struct perf_session *session;
struct perf_data_file file = {
@@ -1636,6 +1696,7 @@ static int __cmd_record(int argc, const char **argv)
"-e", "sched:sched_process_fork",
"-e", "sched:sched_wakeup",
"-e", "sched:sched_migrate_task",
+ "-e", "sched:sched_wait",
};

rec_argc = ARRAY_SIZE(record_args) + argc - 1;
@@ -1722,6 +1783,7 @@ int cmd_sched(int argc, const char **argv, const char *prefix __maybe_unused)
.switch_event = latency_switch_event,
.runtime_event = latency_runtime_event,
.migrate_task_event = latency_migrate_task_event,
+ .wait_event = latency_wait_event,
};
struct trace_sched_handler map_ops = {
.switch_event = map_switch_event,
--
1.8.2.1

2014-04-15 13:32:56

by Dongsheng Yang

[permalink] [raw]

Subject: [PATCH 6/8] perf tools: add missing event for perf sched record.

From: Dongsheng <[email protected]>

We should record and process sched:sched_wakeup_new event in
perf sched tool, but currently, there is the process function
for it, without recording it in record subcommand.

This patch add -e sched:sched_wakeup_new to perf sched record.

Signed-off-by: Dongsheng <[email protected]>
---
tools/perf/builtin-sched.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index a32af4e..72ec223 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1695,6 +1695,7 @@ static int __cmd_record(int argc, const char **argv)
"-e", "sched:sched_stat_runtime",
"-e", "sched:sched_process_fork",
"-e", "sched:sched_wakeup",
+ "-e", "sched:sched_wakeup_new",
"-e", "sched:sched_migrate_task",
"-e", "sched:sched_wait",
};
--
1.8.2.1

2014-04-15 13:33:30

by Dongsheng Yang

[permalink] [raw]

Subject: [PATCH 8/8] perf tools: Clarify the output of perf sched map.

From: Dongsheng <[email protected]>

In output of perf sched map, any shortname of thread will be explained
at the first time when it appear.

Example:
*A0 228836.978985 secs A0 => perf:23032
*. A0 228836.979016 secs B0 => swapper:0
. *C0 228836.979099 secs C0 => migration/3:22
*A0 . C0 228836.979115 secs
A0 . *. 228836.979115 secs

But B0, which is explained as swapper:0 did not appear in the
left part of output. Instead, we use '.' as the shortname of
swapper:0. So the comment of "B0 => swapper:0" is not easy to
understand.

This patch clarify the output of perf sched map with not allocating
one letter-number shortname for swapper:0 and print ". => swapper:0"
as the explaination for swapper:0.

Example:
*A0 228836.978985 secs A0 => perf:23032
* . A0 228836.979016 secs . => swapper:0
. *B0 228836.979099 secs B0 => migration/3:22
*A0 . B0 228836.979115 secs
A0 . * . 228836.979115 secs
A0 *C0 . 228836.979225 secs C0 => ksoftirqd/2:18
A0 *D0 . 228836.979236 secs D0 => rcu_sched:7

Signed-off-by: Dongsheng <[email protected]>
---
tools/perf/builtin-sched.c | 35 +++++++++++++++++++----------------
1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 030b7d0..727b8a0 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1359,17 +1359,23 @@ static int map_switch_event(struct perf_sched *sched, struct perf_evsel *evsel,

new_shortname = 0;
if (!sched_in->shortname[0]) {
- sched_in->shortname[0] = sched->next_shortname1;
- sched_in->shortname[1] = sched->next_shortname2;
-
- if (sched->next_shortname1 < 'Z') {
- sched->next_shortname1++;
- } else {
- sched->next_shortname1='A';
- if (sched->next_shortname2 < '9') {
- sched->next_shortname2++;
+ if (!strcmp(thread__comm_str(sched_in), "swapper")) {
+ sched_in->shortname[0] = '.';
+ sched_in->shortname[1] = '\0';
+ }
+ else {
+ sched_in->shortname[0] = sched->next_shortname1;
+ sched_in->shortname[1] = sched->next_shortname2;
+
+ if (sched->next_shortname1 < 'Z') {
+ sched->next_shortname1++;
} else {
- sched->next_shortname2='0';
+ sched->next_shortname1='A';
+ if (sched->next_shortname2 < '9') {
+ sched->next_shortname2++;
+ } else {
+ sched->next_shortname2='0';
+ }
}
}
new_shortname = 1;
@@ -1381,12 +1387,9 @@ static int map_switch_event(struct perf_sched *sched, struct perf_evsel *evsel,
else
printf("*");

- if (sched->curr_thread[cpu]) {
- if (sched->curr_thread[cpu]->tid)
- printf("%2s ", sched->curr_thread[cpu]->shortname);
- else
- printf(". ");
- } else
+ if (sched->curr_thread[cpu])
+ printf("%2s ", sched->curr_thread[cpu]->shortname);
+ else
printf(" ");
}

--
1.8.2.1

2014-04-15 13:33:53

by Dongsheng Yang

[permalink] [raw]

Subject: [PATCH 7/8] perf tools: Adapt the TASK_STATE_TO_CHAR_STR to new value in kernel space.

From: Dongsheng <[email protected]>

Currently, TASK_STATE_TO_CHAR_STR in kernel space is already expanded to RSDTtZXxKWP,
but it is still RSDTtZX in perf sched tool.

This patch update TASK_STATE_TO_CHAR_STR to the new value in kernel space.

Signed-off-by: Dongsheng <[email protected]>
---
tools/perf/builtin-sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 72ec223..030b7d0 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -66,7 +66,7 @@ struct sched_atom {
struct task_desc *wakee;
};

-#define TASK_STATE_TO_CHAR_STR "RSDTtZX"
+#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"

enum thread_state {
THREAD_SLEEPING = 0,
--
1.8.2.1

2014-04-15 13:32:53

by Dongsheng Yang

[permalink] [raw]

Subject: [PATCH 1/8] sched & trace: Add a trace event for wait.

From: Dongsheng <[email protected]>

Signed-off-by: Dongsheng <[email protected]>
---
include/trace/events/sched.h | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 67e1bbf..7aa7d43a 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -51,6 +51,26 @@ TRACE_EVENT(sched_kthread_stop_ret,
);

/*
+ * Tracepoint for wait:
+ */
+TRACE_EVENT(sched_wait,
+
+ TP_PROTO(struct task_struct *p),
+
+ TP_ARGS(p),
+
+ TP_STRUCT__entry(
+ __field( pid_t, pid )
+ ),
+
+ TP_fast_assign(
+ __entry->pid = p->pid;
+ ),
+
+ TP_printk("pid=%d", __entry->pid)
+);
+
+/*
* Tracepoint for waking up a task:
*/
DECLARE_EVENT_CLASS(sched_wakeup_template,
--
1.8.2.1

2014-04-15 13:34:14

by Dongsheng Yang

[permalink] [raw]

Subject: [PATCH 4/8] sched/core: Skip wakeup when task is already running.

From: Dongsheng <[email protected]>

It is pointless to wake up a running task. Currently, we can
see it in perf sched latency.

# perf sched record sleep 10
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 21.170 MB perf.data (~924950 samples) ]
# perf sched latency |tail
ksoftirqd/13:57 | 0.011 ms | 1 | avg: 0.003 ms | max: 0.003 ms | max at: 254123.512011 s
khelper:6423 | 0.059 ms | 3 | avg: 0.003 ms | max: 0.005 ms | max at: 254124.947110 s
ksoftirqd/3:17 | 0.042 ms | 2 | avg: 0.003 ms | max: 0.003 ms | max at: 254126.478175 s
automount:6465 | 0.480 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s
automount:6618 | 0.465 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s
-----------------------------------------------------------------------------------------
TOTAL: | 3535.020 ms | 28477 |
---------------------------------------------------
INFO: 0.922% state machine bugs (259 out of 28092)

Signed-off-by: Dongsheng <[email protected]>
---
kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..fc6b644 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1416,6 +1416,9 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
static void
ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
+ if (p->state == TASK_RUNNING)
+ return;
+
check_preempt_curr(rq, p, wake_flags);
trace_sched_wakeup(p, true);

--
1.8.2.1

2014-04-15 13:34:49

by Dongsheng Yang

[permalink] [raw]

Subject: [PATCH 2/8] sched/wait: Add trace point before add task into wait queue.

From: Dongsheng <[email protected]>

As there are lots of functions in sched/wait.c to call __add_wait_queue{_tail}(),
we need to add trace point before any time we add task into wait queue, then this
patch add trace point in any possible route to adding task into wait queue.

Signed-off-by: Dongsheng <[email protected]>
---
kernel/sched/wait.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 7d50f79..283750e 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -10,6 +10,8 @@
#include <linux/wait.h>
#include <linux/hash.h>

+#include <trace/events/sched.h>
+
void __init_waitqueue_head(wait_queue_head_t *q, const char *name, struct lock_class_key *key)
{
spin_lock_init(&q->lock);
@@ -23,6 +25,7 @@ void add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait)
{
unsigned long flags;

+ trace_sched_wait(current);
wait->flags &= ~WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&q->lock, flags);
__add_wait_queue(q, wait);
@@ -34,6 +37,7 @@ void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
{
unsigned long flags;

+ trace_sched_wait(current);
wait->flags |= WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&q->lock, flags);
__add_wait_queue_tail(q, wait);
@@ -172,6 +176,7 @@ prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
{
unsigned long flags;

+ trace_sched_wait(current);
wait->flags &= ~WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&q->lock, flags);
if (list_empty(&wait->task_list))
@@ -186,6 +191,7 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
{
unsigned long flags;

+ trace_sched_wait(current);
wait->flags |= WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&q->lock, flags);
if (list_empty(&wait->task_list))
@@ -205,6 +211,7 @@ long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state)
wait->private = current;
wait->func = autoremove_wake_function;

+ trace_sched_wait(current);
spin_lock_irqsave(&q->lock, flags);
if (list_empty(&wait->task_list)) {
if (wait->flags & WQ_FLAG_EXCLUSIVE)
--
1.8.2.1

2014-04-15 13:34:48

by Dongsheng Yang

[permalink] [raw]

Subject: [PATCH 3/8] sched/wait: Use __add_wait_queue{_tail}_exclusive() as possible.

From: Dongsheng <[email protected]>

There is already a function in include/linux/wait.h to cover the
'exclusive' usage. So we can use it in sched/wait.c to replace
the opened implementation of it.

Signed-off-by: Dongsheng <[email protected]>
---
kernel/sched/wait.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 283750e..b04827e 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -38,9 +38,8 @@ void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
unsigned long flags;

trace_sched_wait(current);
- wait->flags |= WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&q->lock, flags);
- __add_wait_queue_tail(q, wait);
+ __add_wait_queue_tail_exclusive(q, wait);
spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(add_wait_queue_exclusive);
@@ -192,10 +191,9 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
unsigned long flags;

trace_sched_wait(current);
- wait->flags |= WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&q->lock, flags);
if (list_empty(&wait->task_list))
- __add_wait_queue_tail(q, wait);
+ __add_wait_queue_tail_exclusive(q, wait);
set_current_state(state);
spin_unlock_irqrestore(&q->lock, flags);
}
--
1.8.2.1

2014-04-15 13:49:32

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 1/8] sched & trace: Add a trace event for wait.

On Tue, Apr 15, 2014 at 09:32:50PM +0900, Dongsheng Yang wrote:
> From: Dongsheng <[email protected]>
>

No changelog, and the tracepoint seems useless to me.

> Signed-off-by: Dongsheng <[email protected]>
> ---
> include/trace/events/sched.h | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 67e1bbf..7aa7d43a 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -51,6 +51,26 @@ TRACE_EVENT(sched_kthread_stop_ret,
> );
>
> /*
> + * Tracepoint for wait:
> + */
> +TRACE_EVENT(sched_wait,
> +
> + TP_PROTO(struct task_struct *p),
> +
> + TP_ARGS(p),
> +
> + TP_STRUCT__entry(
> + __field( pid_t, pid )
> + ),
> +
> + TP_fast_assign(
> + __entry->pid = p->pid;
> + ),
> +
> + TP_printk("pid=%d", __entry->pid)
> +);
> +
> +/*
> * Tracepoint for waking up a task:
> */
> DECLARE_EVENT_CLASS(sched_wakeup_template,
> --
> 1.8.2.1
>

2014-04-15 13:49:59

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 3/8] sched/wait: Use __add_wait_queue{_tail}_exclusive() as possible.

On Tue, Apr 15, 2014 at 09:32:52PM +0900, Dongsheng Yang wrote:
> From: Dongsheng <[email protected]>
>
> There is already a function in include/linux/wait.h to cover the
> 'exclusive' usage. So we can use it in sched/wait.c to replace
> the opened implementation of it.
>
> Signed-off-by: Dongsheng <[email protected]>
> ---
> kernel/sched/wait.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 283750e..b04827e 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -38,9 +38,8 @@ void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
> unsigned long flags;
>
> trace_sched_wait(current);
> - wait->flags |= WQ_FLAG_EXCLUSIVE;
> spin_lock_irqsave(&q->lock, flags);
> - __add_wait_queue_tail(q, wait);
> + __add_wait_queue_tail_exclusive(q, wait);
> spin_unlock_irqrestore(&q->lock, flags);
> }
> EXPORT_SYMBOL(add_wait_queue_exclusive);
> @@ -192,10 +191,9 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
> unsigned long flags;
>
> trace_sched_wait(current);
> - wait->flags |= WQ_FLAG_EXCLUSIVE;
> spin_lock_irqsave(&q->lock, flags);
> if (list_empty(&wait->task_list))
> - __add_wait_queue_tail(q, wait);
> + __add_wait_queue_tail_exclusive(q, wait);
> set_current_state(state);
> spin_unlock_irqrestore(&q->lock, flags);

That is not a no-op, if !list_empty() we loose the WQ_flag

2014-04-15 13:53:40

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.

On Tue, Apr 15, 2014 at 09:32:53PM +0900, Dongsheng Yang wrote:
> From: Dongsheng <[email protected]>
>
> It is pointless to wake up a running task. Currently, we can
> see it in perf sched latency.
>
> Signed-off-by: Dongsheng <[email protected]>
> ---
> kernel/sched/core.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 268a45e..fc6b644 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1416,6 +1416,9 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
> static void
> ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> {
> + if (p->state == TASK_RUNNING)
> + return;
> +
> check_preempt_curr(rq, p, wake_flags);
> trace_sched_wakeup(p, true);
>

How can you get there with ->state == RUNNING? try_to_wake_up*() bail
when !(->state & state).

2014-04-15 13:55:03

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 0/8] perf sched: Add trace event for sched wait.

On Tue, Apr 15, 2014 at 09:32:49PM +0900, Dongsheng Yang wrote:
> Hi all:
> This is to solve the bug message shown in perf sched latency.
>
> # perf sched latency|tail
> ksoftirqd/0:3 | 0.597 ms | 57 | avg: 0.004 ms | max: 0.054 ms | max at: 19681.546204 s
> ksoftirqd/1:14 | 0.637 ms | 58 | avg: 0.004 ms | max: 0.066 ms | max at: 19674.687734 s
> irqbalance:349 | 0.429 ms | 1 | avg: 0.004 ms | max: 0.004 ms | max at: 19675.791528 s
> ksoftirqd/3:24 | 0.527 ms | 67 | avg: 0.003 ms | max: 0.011 ms | max at: 19673.285019 s
> migration/3:23 | 0.000 ms | 1 | avg: 0.002 ms | max: 0.002 ms | max at: 19672.055354 s
> -----------------------------------------------------------------------------------------
> TOTAL: | 4384.616 ms | 36879 |
> ---------------------------------------------------
> INFO: 0.030% state machine bugs (11 out of 36684)
>
> After some investigation, there are two reasons cause this problem.
>
> (1). Sometimes, scheduler will wake up a running task, it is not necessary,
> then I skip the wakeup if task->state is TASK_RUNNING. [4/8]
>
> (2). No tracing for sched wait.
> This is a simple graph for task state changing.
>
> ---------------- 1 ----------------
> | TASK_RUNNING | ------------------------------>| TASK_RUNNING |
> | (running) |<------------------------------ | (wait cpu) |
> ---------------- 2 ----------------
> ^ |
> |4 ------------------------- 3 |
> |-------|TASK_{UN}INTERRUPTABLE |<--------------|
> | in wait_rq |
> -------------------------
>
> As the graph shown above, there are four event in scheduling, and
> we currently are tracing 3 of them.
>
> 1 & 2: sched:sched_switch
> 4: sched:sched_wakeup|sched:sched_wakeup_new
>
> But about 3, we have no trace event for it.

We do, sched_switch() includes the previous task state. Also there's
tons more than a waitqueue you can get stuck in.

2014-04-16 10:51:23

by Dongsheng Yang

[permalink] [raw]

Subject: Re: [PATCH 3/8] sched/wait: Use __add_wait_queue{_tail}_exclusive() as possible.

On 04/15/2014 10:49 PM, Peter Zijlstra wrote:
> On Tue, Apr 15, 2014 at 09:32:52PM +0900, Dongsheng Yang wrote:
>> From: Dongsheng <[email protected]>
>>
>> There is already a function in include/linux/wait.h to cover the
>> 'exclusive' usage. So we can use it in sched/wait.c to replace
>> the opened implementation of it.
>>
>> Signed-off-by: Dongsheng <[email protected]>
>> ---
>> kernel/sched/wait.c | 6 ++----
>> 1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
>> index 283750e..b04827e 100644
>> --- a/kernel/sched/wait.c
>> +++ b/kernel/sched/wait.c
>> @@ -38,9 +38,8 @@ void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait)
>> unsigned long flags;
>>
>> trace_sched_wait(current);
>> - wait->flags |= WQ_FLAG_EXCLUSIVE;
>> spin_lock_irqsave(&q->lock, flags);
>> - __add_wait_queue_tail(q, wait);
>> + __add_wait_queue_tail_exclusive(q, wait);
>> spin_unlock_irqrestore(&q->lock, flags);
>> }
>> EXPORT_SYMBOL(add_wait_queue_exclusive);
>> @@ -192,10 +191,9 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
>> unsigned long flags;
>>
>> trace_sched_wait(current);
>> - wait->flags |= WQ_FLAG_EXCLUSIVE;
>> spin_lock_irqsave(&q->lock, flags);
>> if (list_empty(&wait->task_list))
>> - __add_wait_queue_tail(q, wait);
>> + __add_wait_queue_tail_exclusive(q, wait);
>> set_current_state(state);
>> spin_unlock_irqrestore(&q->lock, flags);
> That is not a no-op, if !list_empty() we loose the WQ_flag

Oh, yes, my mistake. Thanx
> .
>

2014-04-16 11:24:04

by Dongsheng Yang

[permalink] [raw]

Subject: Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.

On 04/15/2014 10:53 PM, Peter Zijlstra wrote:
> On Tue, Apr 15, 2014 at 09:32:53PM +0900, Dongsheng Yang wrote:
>
> How can you get there with ->state == RUNNING? try_to_wake_up*() bail
> when !(->state & state).
Yes, try_to_wake_up() did this check. But other callers would miss it.

With the following code ,I can get the actual message of waking up
a running task

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9f63275..1369cae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1418,8 +1418,10 @@ static void ttwu_activate(struct rq *rq, struct
task_stru
static void
ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
- if (p->state == TASK_RUNNING)
+ if (p->state == TASK_RUNNING) {
+ printk("Wakeup a running task.");
return;
+ }

check_preempt_curr(rq, p, wake_flags);
trace_sched_wakeup(p, true);

# grep "Wakeup" /var/log/messages
Apr 15 20:16:21 localhost kernel: [ 5.436505] Wakeup a running task.
Apr 15 20:16:21 localhost kernel: [ 7.776042] Wakeup a running task.
Apr 15 20:16:21 localhost kernel: [ 9.324274] Wakeup a running task.

So, I think there are some caller of ttwu_do_wakeup() is attempt to wake
up a running task.
> .
>

2014-04-16 11:27:39

by Dongsheng Yang

[permalink] [raw]

Subject: Re: [PATCH 0/8] perf sched: Add trace event for sched wait.

On 04/15/2014 10:54 PM, Peter Zijlstra wrote:
> On Tue, Apr 15, 2014 at 09:32:49PM +0900, Dongsheng Yang wrote:
>> But about 3, we have no trace event for it.
> We do, sched_switch() includes the previous task state. Also there's
> tons more than a waitqueue you can get stuck in.
Okey, prev_state is designed to trace No. 3 situation, but it seems
not working well now from my test result. I will look into it and dig
the reason out. Thanx for your help.
> .
>

2014-04-16 14:23:19

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 1/8] sched & trace: Add a trace event for wait.

On Tue, 15 Apr 2014 15:49:16 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, Apr 15, 2014 at 09:32:50PM +0900, Dongsheng Yang wrote:
> > From: Dongsheng <[email protected]>
> >
>
> No changelog, and the tracepoint seems useless to me.
>

At least patch 1 and 2 need to be combined.

-- Steve

2014-04-22 12:55:56

by Dongsheng Yang

[permalink] [raw]

Subject: Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.

On 04/16/2014 07:22 PM, Dongsheng Yang wrote:
> On 04/15/2014 10:53 PM, Peter Zijlstra wrote:
>> On Tue, Apr 15, 2014 at 09:32:53PM +0900, Dongsheng Yang wrote:
>>
>> How can you get there with ->state == RUNNING? try_to_wake_up*() bail
>> when !(->state & state).
> Yes, try_to_wake_up() did this check. But other callers would miss it.
>
> With the following code ,I can get the actual message of waking up
> a running task
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9f63275..1369cae 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1418,8 +1418,10 @@ static void ttwu_activate(struct rq *rq, struct
> task_stru
> static void
> ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> {
> - if (p->state == TASK_RUNNING)
> + if (p->state == TASK_RUNNING) {
> + printk("Wakeup a running task.");
> return;
> + }
>
> check_preempt_curr(rq, p, wake_flags);
> trace_sched_wakeup(p, true);
>
>
> # grep "Wakeup" /var/log/messages
> Apr 15 20:16:21 localhost kernel: [ 5.436505] Wakeup a running task.
> Apr 15 20:16:21 localhost kernel: [ 7.776042] Wakeup a running task.
> Apr 15 20:16:21 localhost kernel: [ 9.324274] Wakeup a running task.

Hi Peter, after some more investigation, I think I got the problem,
which is that
some other task set p->state to TASK_RUNNING without holding p->pi_lock.

Scenario as attached graph shown, if some other task set p->state to
TASK_RUNNING after the check if (! (p->state & state)), then we are
wasting time to wake up a running task in try_to_wake_up().

If the analyse is right, I think there are two methods to solve this
problem:
* Skip in ttwu_do_wakeup() when p->state is running, as what my patch
did.
* Add a locking when we set p->state, lots of work to do and I am
afraid
it will hurt the performance of kernel.

The following message is the backtrace info I got when it happened:

(gdb) bt
#0 try_to_wake_up (p=0xffff88027e651930, state=1, wake_flags=0)
at kernel/sched/core.c:1605
#1 0xffffffff81099532 in default_wake_function (curr=<value optimized
out>,
mode=<value optimized out>, wake_flags=<value optimized out>,
key=<value optimized out>) at kernel/sched/core.c:2853
#2 0xffffffff810aa489 in __wake_up_common (q=0xffff88027f03f210, mode=1,
nr_exclusive=1, wake_flags=0, key=0x4) at kernel/sched/wait.c:75
#3 0xffffffff810aa838 in __wake_up (q=0xffff88027f03f210, mode=1,
nr_exclusive=1, key=0x4) at kernel/sched/wait.c:97
#4 0xffffffff813cd0a4 in n_tty_check_unthrottle (tty=0xffff88027f03ec00,
file=0xffff880278ab1b00,
buf=0x7fff0fcf9720 "\r\nyum install -y
./a/alsa-plugins-pulseaudio-1.0.27-1.fc19.x86_64.rpml -y
./a/alsa-lib-1.0.27.1-2.fc19.x86_64.rpm.noarch.rpm\r\nyum install -y
./a/albatross-xfwm4-theme-1.2-5.fc19.noarch.rpm\r\nyum instal"...,
nr=16315) at drivers/tty/n_tty.c:280
#5 n_tty_read (tty=0xffff88027f03ec00, file=0xffff880278ab1b00,
buf=0x7fff0fcf9720 "\r\nyum install -y
./a/alsa-plugins-pulseaudio-1.0.27-1.fc19.x86_64.rpml -y
./a/alsa-lib-1.0.27.1-2.fc19.x86_64.rpm.noarch.rpm\r\nyum install -y
./a/albatross-xfwm4-theme-1.2-5.fc19.noarch.rpm\r\nyum instal"...,
nr=16315) at drivers/tty/n_tty.c:2259
#6 0xffffffff813c5667 in tty_read (file=0xffff880278ab1b00,
buf=0x7fff0fcf9720 "\r\nyum install -y
./a/alsa-plugins-pulseaudio-1.0.27-1.fc19.x86_64.rpml -y
./a/alsa-lib-1.0.27.1-2.fc19.x86_64.rpm.noarch.rpm\r\nyum in---Type
<return> to continue, or q <return> to quit---q
Quit
(gdb) p p->state ------> Currently, p->state is TASK_RUNNING.
$1 = 0
(gdb) l
1600
1601 success = 1; /* we're going to change ->state */
1602 cpu = task_cpu(p);
1603
1604 if (p->state == TASK_RUNNING) {
1605 printk("Wake up a running task.");
1606 }
1607 if (p->on_rq && ttwu_remote(p, wake_flags))
1608 goto stat;
1609
(gdb)
>
> So, I think there are some caller of ttwu_do_wakeup() is attempt to wake
> up a running task.
>> .
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> .
>

Attachments:

graph (279.00 B)

2014-04-22 17:10:57

by Benjamin Segall

[permalink] [raw]

Subject: Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.

Dongsheng Yang <[email protected]> writes:

> On 04/16/2014 07:22 PM, Dongsheng Yang wrote:
>> On 04/15/2014 10:53 PM, Peter Zijlstra wrote:
>>> On Tue, Apr 15, 2014 at 09:32:53PM +0900, Dongsheng Yang wrote:
>>>
>>> How can you get there with ->state == RUNNING? try_to_wake_up*() bail
>>> when !(->state & state).
>> Yes, try_to_wake_up() did this check. But other callers would miss it.
>>
>> With the following code ,I can get the actual message of waking up
>> a running task
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 9f63275..1369cae 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1418,8 +1418,10 @@ static void ttwu_activate(struct rq *rq, struct
>> task_stru
>> static void
>> ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>> {
>> - if (p->state == TASK_RUNNING)
>> + if (p->state == TASK_RUNNING) {
>> + printk("Wakeup a running task.");
>> return;
>> + }
>>
>> check_preempt_curr(rq, p, wake_flags);
>> trace_sched_wakeup(p, true);
>>
>>
>> # grep "Wakeup" /var/log/messages
>> Apr 15 20:16:21 localhost kernel: [ 5.436505] Wakeup a running task.
>> Apr 15 20:16:21 localhost kernel: [ 7.776042] Wakeup a running task.
>> Apr 15 20:16:21 localhost kernel: [ 9.324274] Wakeup a running task.
>
> Hi Peter, after some more investigation, I think I got the problem, which is
> that
> some other task set p->state to TASK_RUNNING without holding p->pi_lock.
>
> Scenario as attached graph shown, if some other task set p->state to
> TASK_RUNNING after the check if (! (p->state & state)), then we are
> wasting time to wake up a running task in try_to_wake_up().
>
> If the analyse is right, I think there are two methods to solve this problem:
> * Skip in ttwu_do_wakeup() when p->state is running, as what my patch
> did.
> * Add a locking when we set p->state, lots of work to do and I am afraid
> it will hurt the performance of kernel.

This is all expected behavior, and the somewhat less than useful trace
events are expected. A task setting p->state to TASK_RUNNING without
locks is fine if and only p == current. The standard deschedule loop is
basically:

while (1) {
set_current_state(TASK_(UN)INTERRUPTIBLE);
if (should_still_sleep)
schedule();
}
set_current_state(TASK_RUNNING);

Which can produce this in a race.

The only problem this causes is a wasted check_preempt_curr call in the
racing case, and a somewhat inaccurate sched:sched_wakeup trace event.
Note that even if you did recheck in ttwu_do_wakeup you could still race
and get an "inaccurate" trace event. Heck, even if the ttwu is
_necessary_ because p is currently trying to take rq->lock to
deschedule, you won't get a matching sched_switch event, because the
ttwu is running before schedule is.

You could sorta fix this I guess by tracking every write to p->state
with trace events, but that would be a somewhat different change, and
might be considered too expensive for all I know (and the trace events
could /still/ be resolved in a different order across cpus compared to
p->state's memory).

2014-04-22 17:53:43

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.

On Tue, 22 Apr 2014 10:10:52 -0700
[email protected] wrote:

> This is all expected behavior, and the somewhat less than useful trace
> events are expected. A task setting p->state to TASK_RUNNING without
> locks is fine if and only p == current. The standard deschedule loop is

Sure, and if you are not current, then all you need is the rq lock of
the rq that p's CPU is for.

> basically:
>
> while (1) {
> set_current_state(TASK_(UN)INTERRUPTIBLE);

Yep, and set_current_state() implies a memory barrier.

> if (should_still_sleep)
> schedule();
> }
> set_current_state(TASK_RUNNING);

The above can use __set_current_state() as there's no races to deal
with when setting current's state to RUNNING.

>
> Which can produce this in a race.
>
> The only problem this causes is a wasted check_preempt_curr call in the
> racing case, and a somewhat inaccurate sched:sched_wakeup trace event.
> Note that even if you did recheck in ttwu_do_wakeup you could still race
> and get an "inaccurate" trace event. Heck, even if the ttwu is
> _necessary_ because p is currently trying to take rq->lock to
> deschedule, you won't get a matching sched_switch event, because the
> ttwu is running before schedule is.
>
> You could sorta fix this I guess by tracking every write to p->state
> with trace events, but that would be a somewhat different change, and
> might be considered too expensive for all I know (and the trace events
> could /still/ be resolved in a different order across cpus compared to
> p->state's memory).

Yeah, let's not do that.

-- Steve

2014-04-22 20:01:24

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.

On Tue, Apr 22, 2014 at 08:56:11PM +0900, Dongsheng Yang wrote:

> >On 04/15/2014 10:53 PM, Peter Zijlstra wrote:
> >>On Tue, Apr 15, 2014 at 09:32:53PM +0900, Dongsheng Yang wrote:
> >>
> >>How can you get there with ->state == RUNNING? try_to_wake_up*() bail
> >>when !(->state & state).

> Hi Peter, after some more investigation, I think I got the problem, which is
> that
> some other task set p->state to TASK_RUNNING without holding p->pi_lock.

That would be a plain bug, you're only ever allowed setting your own
->state, never someone else's. That is:

for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
if (cond)
break;
schedule();
}
__set_current_state(TASK_RUNNING);

Is the 'normal' way of blocking, and that is very much only touching its
own state. And in the above you cannot trigger the case you're seeing,
because the ->state = TASK_RUNNING, at the end, is _after_ its already
running, so we've completed the wakeup.

So I think you've spotted something rotten and we should catch/fix that.

Your backtrace only shows where we find ->state == TASK_RUNNING, not
where it became such.

2014-04-22 20:02:37

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running.

On Tue, Apr 22, 2014 at 10:10:52AM -0700, [email protected] wrote:
> This is all expected behavior, and the somewhat less than useful trace
> events are expected. A task setting p->state to TASK_RUNNING without
> locks is fine if and only p == current. The standard deschedule loop is
> basically:
>
> while (1) {
> set_current_state(TASK_(UN)INTERRUPTIBLE);
> if (should_still_sleep)
> schedule();
> }
> set_current_state(TASK_RUNNING);
>
> Which can produce this in a race.
>
> The only problem this causes is a wasted check_preempt_curr call in the
> racing case, and a somewhat inaccurate sched:sched_wakeup trace event.
> Note that even if you did recheck in ttwu_do_wakeup you could still race
> and get an "inaccurate" trace event. Heck, even if the ttwu is
> _necessary_ because p is currently trying to take rq->lock to
> deschedule, you won't get a matching sched_switch event, because the
> ttwu is running before schedule is.
>
> You could sorta fix this I guess by tracking every write to p->state
> with trace events, but that would be a somewhat different change, and
> might be considered too expensive for all I know (and the trace events
> could /still/ be resolved in a different order across cpus compared to
> p->state's memory).

Ah, you're saying that a second task could try a spurious wakeup between
set_current_state() and schedule(). Yes, that'll trigger this indeed.