LinuxLists.cc - [PATCH] sched/fair: change where we report sched stats

2014-12-09 18:22:32

Subject: [PATCH] sched/fair: change where we report sched stats

The schedule stats currently spit out once the entity is being queued, which
means if we have stack traces enabled we will get the stack trace of the waker,
not of the task being woken. This makes the backtrace completely useless when
trying to track down latency spikes, as we want to know why we were put to sleep
for as long as we were.

This patch moves stat stuff to after the schedule, right as we are waking up,
this way if we have backtraces enabled we will get a useful backtrace. This
allows us to trace on the sched:sched_stat_blocked/iowait/sleep tracepoints and
limit them based on the duration rather than trace every sched_switch operation
and then post-parse that information looking for our latency problems.

I've tested this in production and it works well, I'd appreciate feedback on
this solution, I'd be happy to re-work it to be more acceptable and test here.
This is an important fix for us and anybody else who wants to do latency
debugging in production at a large scale. Thanks

Signed-off-by: Josef Bacik <[email protected]>
---
kernel/sched/core.c | 14 ++++----------
kernel/sched/fair.c | 14 ++++++--------
kernel/sched/sched.h | 1 +
3 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 89e7283..e763709 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2277,11 +2277,12 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
tick_nohz_task_switch(current);
}

-#ifdef CONFIG_SMP
-
/* rq->lock is NOT held, but preemption is disabled */
static inline void post_schedule(struct rq *rq)
{
+ if (rq->curr->sched_class->post_schedule_stats)
+ rq->curr->sched_class->post_schedule_stats(rq);
+#ifdef CONFIG_SMP
if (rq->post_schedule) {
unsigned long flags;

@@ -2292,15 +2293,8 @@ static inline void post_schedule(struct rq *rq)

rq->post_schedule = 0;
}
-}
-
-#else
-
-static inline void post_schedule(struct rq *rq)
-{
-}
-
#endif
+}

/**
* schedule_tail - first thing a freshly forked thread must call.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ef2b104..84d5804 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2702,13 +2702,12 @@ static inline int idle_balance(struct rq *rq)

#endif /* CONFIG_SMP */

-static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void task_update_stats(struct rq *rq)
{
#ifdef CONFIG_SCHEDSTATS
- struct task_struct *tsk = NULL;
-
- if (entity_is_task(se))
- tsk = task_of(se);
+ struct task_struct *tsk = rq->curr;
+ struct cfs_rq *cfs_rq = task_cfs_rq(tsk);
+ struct sched_entity *se = &tsk->se;

if (se->statistics.sleep_start) {
u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.sleep_start;
@@ -2829,10 +2828,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
account_entity_enqueue(cfs_rq, se);
update_cfs_shares(cfs_rq);

- if (flags & ENQUEUE_WAKEUP) {
+ if (flags & ENQUEUE_WAKEUP)
place_entity(cfs_rq, se, 0);
- enqueue_sleeper(cfs_rq, se);
- }

update_stats_enqueue(cfs_rq, se);
check_spread(cfs_rq, se);
@@ -7966,6 +7963,7 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_move_group = task_move_group_fair,
#endif
+ .post_schedule_stats = task_update_stats,
};

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2df8ef0..7c0e977 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1140,6 +1140,7 @@ struct sched_class {
#ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_move_group) (struct task_struct *p, int on_rq);
#endif
+ void (*post_schedule_stats) (struct rq *this_rq);
};

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
--
1.9.3

2014-12-10 06:23:30

by Mike Galbraith

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: change where we report sched stats

On Tue, 2014-12-09 at 13:21 -0500, Josef Bacik wrote:

> This patch moves stat stuff to after the schedule, right as we are waking up,

But sleep/block ends when the task is awakened/enqueued, not when it
gets the CPU. You're adding scheduling latency, breaking accounting.

-Mike

2014-12-10 21:49:29

by Josef Bacik

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: change where we report sched stats

On 12/10/2014 01:23 AM, Mike Galbraith wrote:
> On Tue, 2014-12-09 at 13:21 -0500, Josef Bacik wrote:
>
>> This patch moves stat stuff to after the schedule, right as we are waking up,
>
> But sleep/block ends when the task is awakened/enqueued, not when it
> gets the CPU. You're adding scheduling latency, breaking accounting.
>

Yes I'm aware of that. I don't care if the delay time is slightly
higher than normal, I care about knowing exactly why we were sleeping to
begin with. I suppose I could leave the accounting part where it is and
then just fire the tracepoint when it's put on the CPU so we get the
best of both worlds, but honestly I don't feel like adding the extra
scheduling latency into the accounting is that big of a deal. Thanks,

Josef

2014-12-11 03:34:21

by Mike Galbraith

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: change where we report sched stats

On Wed, 2014-12-10 at 16:48 -0500, Josef Bacik wrote:
> On 12/10/2014 01:23 AM, Mike Galbraith wrote:
> > On Tue, 2014-12-09 at 13:21 -0500, Josef Bacik wrote:
> >
> >> This patch moves stat stuff to after the schedule, right as we are waking up,
> >
> > But sleep/block ends when the task is awakened/enqueued, not when it
> > gets the CPU. You're adding scheduling latency, breaking accounting.
> >
>
> Yes I'm aware of that. I don't care if the delay time is slightly
> higher than normal, I care about knowing exactly why we were sleeping to
> begin with. I suppose I could leave the accounting part where it is and
> then just fire the tracepoint when it's put on the CPU so we get the
> best of both worlds, but honestly I don't feel like adding the extra
> scheduling latency into the accounting is that big of a deal. Thanks,

I think sleep/iowait should remain what they are, sleep/iowait end at
wakeup. I don't think waker trace is useless either for that matter.
Who/when ends a sleep period is just as much a part of the picture as
what triggered that sleep. Waker scheduling latency, thumb twiddling
etc. extend sleep.

Shrug, maintainer call. I don't recall ever having any difficulty
determining why a task went to sleep, so don't get it.

-Mike

2014-12-11 14:56:46

by Josef Bacik

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: change where we report sched stats

On 12/10/2014 10:34 PM, Mike Galbraith wrote:
> On Wed, 2014-12-10 at 16:48 -0500, Josef Bacik wrote:
>> On 12/10/2014 01:23 AM, Mike Galbraith wrote:
>>> On Tue, 2014-12-09 at 13:21 -0500, Josef Bacik wrote:
>>>
>>>> This patch moves stat stuff to after the schedule, right as we are waking up,
>>>
>>> But sleep/block ends when the task is awakened/enqueued, not when it
>>> gets the CPU. You're adding scheduling latency, breaking accounting.
>>>
>>
>> Yes I'm aware of that. I don't care if the delay time is slightly
>> higher than normal, I care about knowing exactly why we were sleeping to
>> begin with. I suppose I could leave the accounting part where it is and
>> then just fire the tracepoint when it's put on the CPU so we get the
>> best of both worlds, but honestly I don't feel like adding the extra
>> scheduling latency into the accounting is that big of a deal. Thanks,
>
> I think sleep/iowait should remain what they are, sleep/iowait end at
> wakeup. I don't think waker trace is useless either for that matter.
> Who/when ends a sleep period is just as much a part of the picture as
> what triggered that sleep. Waker scheduling latency, thumb twiddling
> etc. extend sleep.
>

Ok I'll re-roll with the stats themselves not changed. We can get the
waker other ways, but if we're wanting to see how long we were asleep we
are probably going to want to know why.

> Shrug, maintainer call. I don't recall ever having any difficulty
> determining why a task went to sleep, so don't get it.
>

How did you do it? I had one latency spike in a 90 minute test that
runs across 30 boxes that could have been caused by anything, so if
there is a way I could have easily found that without moving these
tracepoints around I'd love to hear it. Thanks,

Josef

2014-12-14 09:56:47

by Mike Galbraith

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: change where we report sched stats

On Thu, 2014-12-11 at 09:55 -0500, Josef Bacik wrote:

> How did you do it? I had one latency spike in a 90 minute test that
> runs across 30 boxes that could have been caused by anything, so if
> there is a way I could have easily found that without moving these
> tracepoints around I'd love to hear it. Thanks,

If I have a target, I tend to set tracing_off() traps.

Seems you want an ftrace tracer that tracks sched_switch sleeping to
sched_switch running in addition to the wakeup latency tracer.

-Mike