2014-12-11 16:00:31

by Josef Bacik

[permalink] [raw]
Subject: [PATCH] sched/fair: change where we report sched stats V2

The schedule stats currently spit out once the entity is being queued, which
means if we have stack traces enabled we will get the stack trace of the waker,
not of the task being woken. This makes the backtrace completely useless when
trying to track down latency spikes, as we want to know why we were put to sleep
for as long as we were.

This patch moves the trace_sched_stat* calls to after the task has been put back
on the CPU so that the backtrace we get is the one that put us to sleep. Mike
Galbraith suggested that I leave the actual stat collection stuff where it is so
that we don't add the scheduler latency to the sleep stats, so I've added 3 new
fields in the scheduler stats to keep track of the last sleep time so that if
any of these are set we know to trigger the trace points. Thanks,

Signed-off-by: Josef Bacik <[email protected]>
---
V1->V2: left the stat collection the way it was and just moved the
trace_sched_stat* calls to after we are put back on the CPU.

include/linux/sched.h | 3 +++
kernel/sched/core.c | 14 ++++----------
kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 1 +
4 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bb..f647a09 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1091,13 +1091,16 @@ struct sched_statistics {
u64 wait_sum;
u64 iowait_count;
u64 iowait_sum;
+ u64 last_iowait_time;

u64 sleep_start;
u64 sleep_max;
+ u64 last_sleep_time;
s64 sum_sleep_runtime;

u64 block_start;
u64 block_max;
+ u64 last_block_time;
u64 exec_max;
u64 slice_max;

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 89e7283..e763709 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2277,11 +2277,12 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
tick_nohz_task_switch(current);
}

-#ifdef CONFIG_SMP
-
/* rq->lock is NOT held, but preemption is disabled */
static inline void post_schedule(struct rq *rq)
{
+ if (rq->curr->sched_class->post_schedule_stats)
+ rq->curr->sched_class->post_schedule_stats(rq);
+#ifdef CONFIG_SMP
if (rq->post_schedule) {
unsigned long flags;

@@ -2292,15 +2293,8 @@ static inline void post_schedule(struct rq *rq)

rq->post_schedule = 0;
}
-}
-
-#else
-
-static inline void post_schedule(struct rq *rq)
-{
-}
-
#endif
+}

/**
* schedule_tail - first thing a freshly forked thread must call.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ef2b104..3038524 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2702,6 +2702,29 @@ static inline int idle_balance(struct rq *rq)

#endif /* CONFIG_SMP */

+static void task_emit_stats(struct rq *rq)
+{
+#ifdef CONFIG_SCHEDSTATS
+ struct task_struct *tsk = rq->curr;
+ struct sched_entity *se = &tsk->se;
+
+ if (se->statistics.last_sleep_time) {
+ trace_sched_stat_sleep(tsk, se->statistics.last_sleep_time);
+ se->statistics.last_sleep_time = 0;
+ }
+
+ if (se->statistics.last_iowait_time) {
+ trace_sched_stat_iowait(tsk, se->statistics.last_iowait_time);
+ se->statistics.last_iowait_time = 0;
+ }
+
+ if (se->statistics.last_block_time) {
+ trace_sched_stat_blocked(tsk, se->statistics.last_block_time);
+ se->statistics.last_block_time = 0;
+ }
+#endif
+}
+
static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
#ifdef CONFIG_SCHEDSTATS
@@ -2721,11 +2744,10 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)

se->statistics.sleep_start = 0;
se->statistics.sum_sleep_runtime += delta;
+ se->statistics.last_sleep_time = delta;

- if (tsk) {
+ if (tsk)
account_scheduler_latency(tsk, delta >> 10, 1);
- trace_sched_stat_sleep(tsk, delta);
- }
}
if (se->statistics.block_start) {
u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.block_start;
@@ -2737,16 +2759,16 @@ static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
se->statistics.block_max = delta;

se->statistics.block_start = 0;
+ se->statistics.last_block_time = delta;
se->statistics.sum_sleep_runtime += delta;

if (tsk) {
if (tsk->in_iowait) {
se->statistics.iowait_sum += delta;
se->statistics.iowait_count++;
- trace_sched_stat_iowait(tsk, delta);
+ se->statistics.last_iowait_time = delta;
}

- trace_sched_stat_blocked(tsk, delta);

/*
* Blocking time is in units of nanosecs, so shift by
@@ -7966,6 +7988,7 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_move_group = task_move_group_fair,
#endif
+ .post_schedule_stats = task_emit_stats,
};

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2df8ef0..7c0e977 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1140,6 +1140,7 @@ struct sched_class {
#ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_move_group) (struct task_struct *p, int on_rq);
#endif
+ void (*post_schedule_stats) (struct rq *this_rq);
};

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
--
1.9.3


2014-12-15 10:16:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: change where we report sched stats V2

On Thu, Dec 11, 2014 at 10:59:55AM -0500, Josef Bacik wrote:
> The schedule stats currently spit out once the entity is being queued, which
> means if we have stack traces enabled we will get the stack trace of the waker,
> not of the task being woken. This makes the backtrace completely useless when
> trying to track down latency spikes, as we want to know why we were put to sleep
> for as long as we were.
>
> This patch moves the trace_sched_stat* calls to after the task has been put back
> on the CPU so that the backtrace we get is the one that put us to sleep. Mike
> Galbraith suggested that I leave the actual stat collection stuff where it is so
> that we don't add the scheduler latency to the sleep stats, so I've added 3 new
> fields in the scheduler stats to keep track of the last sleep time so that if
> any of these are set we know to trigger the trace points. Thanks,
>

Yeah, so I don't like this, it adds overhead for everyone.

Also, I'm very sure this issue has been raised before, and I'm pretty
sure the answer back then was why don't you look at the
trace_sched_switch() site matching this wakeup?

2014-12-15 15:37:46

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: change where we report sched stats V2

On 12/15/2014 05:16 AM, Peter Zijlstra wrote:
> On Thu, Dec 11, 2014 at 10:59:55AM -0500, Josef Bacik wrote:
>> The schedule stats currently spit out once the entity is being queued, which
>> means if we have stack traces enabled we will get the stack trace of the waker,
>> not of the task being woken. This makes the backtrace completely useless when
>> trying to track down latency spikes, as we want to know why we were put to sleep
>> for as long as we were.
>>
>> This patch moves the trace_sched_stat* calls to after the task has been put back
>> on the CPU so that the backtrace we get is the one that put us to sleep. Mike
>> Galbraith suggested that I leave the actual stat collection stuff where it is so
>> that we don't add the scheduler latency to the sleep stats, so I've added 3 new
>> fields in the scheduler stats to keep track of the last sleep time so that if
>> any of these are set we know to trigger the trace points. Thanks,
>>
>
> Yeah, so I don't like this, it adds overhead for everyone.
>

Only if SCHEDSTATS is enabled tho, and it's no more overhead in the
SCHEDSTATS case than before. Would it be more acceptable to move the
entire callback under SCHEDSTATS?

> Also, I'm very sure this issue has been raised before, and I'm pretty
> sure the answer back then was why don't you look at the
> trace_sched_switch() site matching this wakeup?
>

This is fine for discrete problems, but when trying to find a random
latency spike in a production workload it's impossible. If I do

trace-cmd record -e sched:sched_switch -T sleep 5

on just one of our random web servers I end up with this

du -h trace.dat
62M trace.dat

thats 62 megs in 5 seconds. I ran the following command for almost 2
hours when searching for a latency spike

trace-cmd record -B latency -e sched:sched_stat_blocked -f \"delay >=
100000\" -T -o /root/latency.dat

and got the following .dat file

du -h latency.dat
48M latency.dat

That right there is a huge reason for this change, so I'd really like to
do whatever it takes to get this fixed. I'd be happy to move the
callback under CONFIG_SCHEDSTATS to reduce any possible overhead you
would be concerned about. Thanks,

Josef

2014-12-15 17:21:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: change where we report sched stats V2

On Mon, Dec 15, 2014 at 10:37:09AM -0500, Josef Bacik wrote:

> >Yeah, so I don't like this, it adds overhead for everyone.
> >
>
> Only if SCHEDSTATS is enabled tho, and it's no more overhead in the
> SCHEDSTATS case than before. Would it be more acceptable to move the entire
> callback under SCHEDSTATS?

Nah, doesn't work. Distros need to enable the world and then some so
.config is a false choice.

> This is fine for discrete problems, but when trying to find a random latency
> spike in a production workload it's impossible. If I do
>
> trace-cmd record -e sched:sched_switch -T sleep 5
>
> on just one of our random web servers I end up with this
>
> du -h trace.dat
> 62M trace.dat
>
> thats 62 megs in 5 seconds. I ran the following command for almost 2 hours
> when searching for a latency spike
>
> trace-cmd record -B latency -e sched:sched_stat_blocked -f \"delay >=
> 100000\" -T -o /root/latency.dat
>
> and got the following .dat file
>
> du -h latency.dat
> 48M latency.dat

Ah, regardless what I think of our filter implementation, that actually
makes sense, let me ponder this a bit.

2014-12-15 17:30:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: change where we report sched stats V2

On Mon, Dec 15, 2014 at 06:21:29PM +0100, Peter Zijlstra wrote:
> On Mon, Dec 15, 2014 at 10:37:09AM -0500, Josef Bacik wrote:
>
> > >Yeah, so I don't like this, it adds overhead for everyone.
> > >
> >
> > Only if SCHEDSTATS is enabled tho, and it's no more overhead in the
> > SCHEDSTATS case than before. Would it be more acceptable to move the entire
> > callback under SCHEDSTATS?
>
> Nah, doesn't work. Distros need to enable the world and then some so
> .config is a false choice.
>
> > This is fine for discrete problems, but when trying to find a random latency
> > spike in a production workload it's impossible. If I do
> >
> > trace-cmd record -e sched:sched_switch -T sleep 5
> >
> > on just one of our random web servers I end up with this
> >
> > du -h trace.dat
> > 62M trace.dat
> >
> > thats 62 megs in 5 seconds. I ran the following command for almost 2 hours
> > when searching for a latency spike
> >
> > trace-cmd record -B latency -e sched:sched_stat_blocked -f \"delay >=
> > 100000\" -T -o /root/latency.dat
> >
> > and got the following .dat file
> >
> > du -h latency.dat
> > 48M latency.dat
>
> Ah, regardless what I think of our filter implementation, that actually
> makes sense, let me ponder this a bit.

Oh, I just remembered we 'fixed' this for perf, see commit:

e6dab5ffab59 ("perf/trace: Add ability to set a target task for events")

I'm not sure how to do the same thing with ftrace though, maybe steve
knows.

The thing is, at wakeup time we know the task we're waking, so we pass
that task along and provide a trace for that instead of current. Andrew
(who implemented it might have some userspace to share).