2008-06-02 12:48:52

by Ankita Garg

[permalink] [raw]
Subject: Re: [ANNOUNCE] sched: schedtop utility

Hi Gregory,

On Thu, May 22, 2008 at 08:06:44AM -0600, Gregory Haskins wrote:
> Hi all scheduler developers,
> I had an itch to scratch w.r.t. watching the stats in /proc/schedstats, and it appears that the perl scripts referenced in Documentation/scheduler/sched-stats.txt do not support v14 from HEAD so I whipped up a little utility I call "schedtop".
>

Nice tool! Helps in better visualization of the data in schedstats.

Using the tool, realized that most of the timing related stats therein
might not be completely usable in many scenarios, as might already be
known.

Without any additional load on the system, all the stats are nice and
sane. But, as soon as I ran my particular testcase, the data
pertaining to the delta of run_delay/cpu_time went haywire! I understand
that all the values are based on top of rq->clock, which relies on tsc that
is not synced across cpus and would result in skews/incorrect values.
But, turns out to be not so reliable data for debugging. This is
ofcourse nothing related to the tool, but for schedstat in
general...rather just adding on to the already existing woes with non-syned
tscs :-)

> This utility will process statistics from /proc/schedstat such that the busiest stats will bubble up to the top. It can alternately be sorted by the largest stat, or by name. Stats can be included or excluded based on reg-ex pattern matching.
>
> You can download the tarball here:
>
> ftp://ftp.novell.com/dev/ghaskins/schedtop.tar.gz
>
> I have also posted it to the opensuse build service for generating RPMS for a handful of 32/64-bit x86 distros for your convenience:
>
> http://download.opensuse.org/repositories/home:/ghaskins/
>
> (Note that the build is still in progress for some of the favors, so if you do not see the flavor you are looking for, check back in a little while)
>
> Comments/feedback/bug-fixes welcome!

Maybe support for displaying process specific schedstat info might help?
We already do capture it under /proc/<pid>/schedstat. So, a -p switch for
pid maybe? Reason for this is when looking at the entire system stats, I
might not be able to get lot of information specific to my particular
application...like when a particular thread suffers a lot of delay
waiting for the runqueue... just a thought!

>
> Regards
> -Greg
>

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India


2008-06-02 13:09:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] sched: schedtop utility

On Mon, 2008-06-02 at 18:18 +0530, Ankita Garg wrote:
> Hi Gregory,
>
> On Thu, May 22, 2008 at 08:06:44AM -0600, Gregory Haskins wrote:
> > Hi all scheduler developers,
> > I had an itch to scratch w.r.t. watching the stats in /proc/schedstats, and it appears that the perl scripts referenced in Documentation/scheduler/sched-stats.txt do not support v14 from HEAD so I whipped up a little utility I call "schedtop".
> >
>
> Nice tool! Helps in better visualization of the data in schedstats.
>
> Using the tool, realized that most of the timing related stats therein
> might not be completely usable in many scenarios, as might already be
> known.
>
> Without any additional load on the system, all the stats are nice and
> sane. But, as soon as I ran my particular testcase, the data
> pertaining to the delta of run_delay/cpu_time went haywire! I understand
> that all the values are based on top of rq->clock, which relies on tsc that
> is not synced across cpus and would result in skews/incorrect values.
> But, turns out to be not so reliable data for debugging. This is
> ofcourse nothing related to the tool, but for schedstat in
> general...rather just adding on to the already existing woes with non-syned
> tscs :-)

Thing is, things runtime should be calculated by using per cpu deltas.
You take a stamp when you get scheduled on the cpu and another one when
you stop running, then the delta is added to runtime.

This is always on the same cpu - when you get migrated you're stopped
and re-scheduled so that should work out nicely.

So in that sense it shouldn't matter that the rq->clock values can get
skewed between cpus.

So I'm still a little puzzled by your observations; though it could be
that the schedstat stuff got broken - I've never really looked too
closely at it.


2008-06-02 13:20:27

by Gregory Haskins

[permalink] [raw]
Subject: Re: [ANNOUNCE] sched: schedtop utility

Hi Ankita,
For some reason, I didn't get your original email. I had to go find it on the lkml.org archives.

But anyway, see inline

>>> On Mon, Jun 2, 2008 at 9:07 AM, in message
<[email protected]>, Peter Zijlstra
<[email protected]> wrote:
> On Mon, 2008-06-02 at 18:18 +0530, Ankita Garg wrote:
>> Hi Gregory,
>>
>> On Thu, May 22, 2008 at 08:06:44AM -0600, Gregory Haskins wrote:
>> > Hi all scheduler developers,
>> > I had an itch to scratch w.r.t. watching the stats in /proc/schedstats,
> and it appears that the perl scripts referenced in
> Documentation/scheduler/sched-stats.txt do not support v14 from HEAD so I
> whipped up a little utility I call "schedtop".
>> >
>>
>> Nice tool! Helps in better visualization of the data in schedstats.
>>
>> Using the tool, realized that most of the timing related stats therein
>> might not be completely usable in many scenarios, as might already be
>> known.
>>
>> Without any additional load on the system, all the stats are nice and
>> sane. But, as soon as I ran my particular testcase, the data
>> pertaining to the delta of run_delay/cpu_time went haywire! I understand
>> that all the values are based on top of rq->clock, which relies on tsc that
>> is not synced across cpus and would result in skews/incorrect values.
>> But, turns out to be not so reliable data for debugging. This is
>> ofcourse nothing related to the tool, but for schedstat in
>> general...rather just adding on to the already existing woes with non-syned
>> tscs :-)
>
> Thing is, things runtime should be calculated by using per cpu deltas.
> You take a stamp when you get scheduled on the cpu and another one when
> you stop running, then the delta is added to runtime.
>
> This is always on the same cpu - when you get migrated you're stopped
> and re-scheduled so that should work out nicely.
>
> So in that sense it shouldn't matter that the rq->clock values can get
> skewed between cpus.
>
> So I'm still a little puzzled by your observations; though it could be
> that the schedstat stuff got broken - I've never really looked too
> closely at it.

I suspect we must be talking about those stats that are always moving pretty fast. I see that too, and I use the (potentially unknown) filtering feature of schedtop: "-i REGEX" will set the include filter, and "-x REGEX" will set the exclude filter. By default, include allows everything, and exclude filters nothing. Changing it to "-x sched_info" will exclude all those pesky stats that move fast and do not convey useful (to me, anyway) data. I hope that helps!

Also, about your idea for the /proc/<pid>/schedstats, I was thinking the same thing while on my trip on Friday. I will add this feature. Thanks!

-Greg

2008-06-05 05:20:32

by Ankita Garg

[permalink] [raw]
Subject: Re: [ANNOUNCE] sched: schedtop utility

Hi,

On Mon, Jun 02, 2008 at 03:07:31PM +0200, Peter Zijlstra wrote:
> On Mon, 2008-06-02 at 18:18 +0530, Ankita Garg wrote:
> > Hi Gregory,
> >
> > On Thu, May 22, 2008 at 08:06:44AM -0600, Gregory Haskins wrote:
> > > Hi all scheduler developers,
> > > I had an itch to scratch w.r.t. watching the stats in /proc/schedstats, and it appears that the perl scripts referenced in Documentation/scheduler/sched-stats.txt do not support v14 from HEAD so I whipped up a little utility I call "schedtop".
> > >
> >
> > Nice tool! Helps in better visualization of the data in schedstats.
> >
> > Using the tool, realized that most of the timing related stats therein
> > might not be completely usable in many scenarios, as might already be
> > known.
> >
> > Without any additional load on the system, all the stats are nice and
> > sane. But, as soon as I ran my particular testcase, the data
> > pertaining to the delta of run_delay/cpu_time went haywire! I understand
> > that all the values are based on top of rq->clock, which relies on tsc that
> > is not synced across cpus and would result in skews/incorrect values.
> > But, turns out to be not so reliable data for debugging. This is
> > ofcourse nothing related to the tool, but for schedstat in
> > general...rather just adding on to the already existing woes with non-syned
> > tscs :-)
>
> Thing is, things runtime should be calculated by using per cpu deltas.
> You take a stamp when you get scheduled on the cpu and another one when
> you stop running, then the delta is added to runtime.
>
> This is always on the same cpu - when you get migrated you're stopped
> and re-scheduled so that should work out nicely.
>
> So in that sense it shouldn't matter that the rq->clock values can get
> skewed between cpus.
>
> So I'm still a little puzzled by your observations; though it could be
> that the schedstat stuff got broken - I've never really looked too
> closely at it.
>

Thanks Peter for the explanation...
I agree with the above and that is the reason why I did not see weird
values with cpu_time. But, run_delay still would suffer skews as the end
points for delta could be taken on different cpus due to migration (more
so on RT kernel due to the push-pull operations). With the below patch,
I could not reproduce the issue I had seen earlier. After every dequeue,
we take the delta and start wait measurements from zero when moved to a
different rq.


Signed-off-by: Ankita Garg <[email protected]>

Index: linux-2.6.24.4/kernel/sched.c
===================================================================
--- linux-2.6.24.4.orig/kernel/sched.c 2008-06-03 14:14:07.000000000 +0530
+++ linux-2.6.24.4/kernel/sched.c 2008-06-04 12:48:34.000000000 +0530
@@ -948,6 +948,7 @@

static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
{
+ sched_info_dequeued(p);
p->sched_class->dequeue_task(rq, p, sleep);
p->se.on_rq = 0;
}
Index: linux-2.6.24.4/kernel/sched_stats.h
===================================================================
--- linux-2.6.24.4.orig/kernel/sched_stats.h 2008-06-03 14:14:28.000000000 +0530
+++ linux-2.6.24.4/kernel/sched_stats.h 2008-06-05 10:39:39.000000000 +0530
@@ -113,6 +113,13 @@
if (rq)
rq->rq_sched_info.cpu_time += delta;
}
+
+static inline void
+rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
+{
+ if (rq)
+ rq->rq_sched_info.run_delay += delta;
+}
# define schedstat_inc(rq, field) do { (rq)->field++; } while (0)
# define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0)
# define schedstat_set(var, val) do { var = (val); } while (0)
@@ -129,6 +136,11 @@
#endif

#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+static inline void sched_info_reset_dequeued(struct task_struct *t)
+{
+ t->sched_info.last_queued = 0;
+}
+
/*
* Called when a process is dequeued from the active array and given
* the cpu. We should note that with the exception of interactive
@@ -138,15 +150,22 @@
* active queue, thus delaying tasks in the expired queue from running;
* see scheduler_tick()).
*
- * This function is only called from sched_info_arrive(), rather than
- * dequeue_task(). Even though a task may be queued and dequeued multiple
- * times as it is shuffled about, we're really interested in knowing how
- * long it was from the *first* time it was queued to the time that it
- * finally hit a cpu.
+ * Though we are interested in knowing how long it was from the *first* time a
+ * task was queued to the time that it finally hit a cpu, we call this routine
+ * from dequeue_task() to account for possible rq->clock skew across cpus. The
+ * delta taken on each cpu would annul the skew.
*/
static inline void sched_info_dequeued(struct task_struct *t)
{
- t->sched_info.last_queued = 0;
+ unsigned long long now = task_rq(t)->clock, delta = 0;
+
+ if(unlikely(sched_info_on()))
+ if(t->sched_info.last_queued)
+ delta = now - t->sched_info.last_queued;
+ sched_info_reset_dequeued(t);
+ t->sched_info.run_delay += delta;
+
+ rq_sched_info_dequeued(task_rq(t), delta);
}

/*
@@ -160,7 +179,7 @@

if (t->sched_info.last_queued)
delta = now - t->sched_info.last_queued;
- sched_info_dequeued(t);
+ sched_info_reset_dequeued(t);
t->sched_info.run_delay += delta;
t->sched_info.last_arrival = now;
t->sched_info.pcount++;

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

2008-06-19 10:27:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] sched: schedtop utility

On Thu, 2008-06-05 at 10:50 +0530, Ankita Garg wrote:

> Thanks Peter for the explanation...
>
> I agree with the above and that is the reason why I did not see weird
> values with cpu_time. But, run_delay still would suffer skews as the end
> points for delta could be taken on different cpus due to migration (more
> so on RT kernel due to the push-pull operations). With the below patch,
> I could not reproduce the issue I had seen earlier. After every dequeue,
> we take the delta and start wait measurements from zero when moved to a
> different rq.

OK, so task delay delay accounting is broken because it doesn't take
migration into account.

What you've done is make it symmetric wrt enqueue, and account it like

cpu0 cpu1

enqueue
<wait-d1>
dequeue
enqueue
<wait-d2>
run

Where you add both d1 and d2 to the run_delay,.. right?

This seems like a good fix, however it looks like the patch will break
compilation in !CONFIG_SCHEDSTATS && !CONFIG_TASK_DELAY_ACCT, of it
failing to provide a stub for sched_info_dequeue() in that case.


> Signed-off-by: Ankita Garg <[email protected]>
>
> Index: linux-2.6.24.4/kernel/sched.c
> ===================================================================
> --- linux-2.6.24.4.orig/kernel/sched.c 2008-06-03 14:14:07.000000000 +0530
> +++ linux-2.6.24.4/kernel/sched.c 2008-06-04 12:48:34.000000000 +0530
> @@ -948,6 +948,7 @@
>
> static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
> {
> + sched_info_dequeued(p);
> p->sched_class->dequeue_task(rq, p, sleep);
> p->se.on_rq = 0;
> }
> Index: linux-2.6.24.4/kernel/sched_stats.h
> ===================================================================
> --- linux-2.6.24.4.orig/kernel/sched_stats.h 2008-06-03 14:14:28.000000000 +0530
> +++ linux-2.6.24.4/kernel/sched_stats.h 2008-06-05 10:39:39.000000000 +0530
> @@ -113,6 +113,13 @@
> if (rq)
> rq->rq_sched_info.cpu_time += delta;
> }
> +
> +static inline void
> +rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
> +{
> + if (rq)
> + rq->rq_sched_info.run_delay += delta;
> +}
> # define schedstat_inc(rq, field) do { (rq)->field++; } while (0)
> # define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0)
> # define schedstat_set(var, val) do { var = (val); } while (0)
> @@ -129,6 +136,11 @@
> #endif
>
> #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
> +static inline void sched_info_reset_dequeued(struct task_struct *t)
> +{
> + t->sched_info.last_queued = 0;
> +}
> +
> /*
> * Called when a process is dequeued from the active array and given
> * the cpu. We should note that with the exception of interactive
> @@ -138,15 +150,22 @@
> * active queue, thus delaying tasks in the expired queue from running;
> * see scheduler_tick()).
> *
> - * This function is only called from sched_info_arrive(), rather than
> - * dequeue_task(). Even though a task may be queued and dequeued multiple
> - * times as it is shuffled about, we're really interested in knowing how
> - * long it was from the *first* time it was queued to the time that it
> - * finally hit a cpu.
> + * Though we are interested in knowing how long it was from the *first* time a
> + * task was queued to the time that it finally hit a cpu, we call this routine
> + * from dequeue_task() to account for possible rq->clock skew across cpus. The
> + * delta taken on each cpu would annul the skew.
> */
> static inline void sched_info_dequeued(struct task_struct *t)
> {
> - t->sched_info.last_queued = 0;
> + unsigned long long now = task_rq(t)->clock, delta = 0;
> +
> + if(unlikely(sched_info_on()))
> + if(t->sched_info.last_queued)
> + delta = now - t->sched_info.last_queued;
> + sched_info_reset_dequeued(t);
> + t->sched_info.run_delay += delta;
> +
> + rq_sched_info_dequeued(task_rq(t), delta);
> }
>
> /*
> @@ -160,7 +179,7 @@
>
> if (t->sched_info.last_queued)
> delta = now - t->sched_info.last_queued;
> - sched_info_dequeued(t);
> + sched_info_reset_dequeued(t);
> t->sched_info.run_delay += delta;
> t->sched_info.last_arrival = now;
> t->sched_info.pcount++;
>

2008-07-01 09:00:38

by Ankita Garg

[permalink] [raw]
Subject: Re: [ANNOUNCE] sched: schedtop utility

Hi Peter,

On Thu, Jun 19, 2008 at 12:27:14PM +0200, Peter Zijlstra wrote:
> On Thu, 2008-06-05 at 10:50 +0530, Ankita Garg wrote:
>
> > Thanks Peter for the explanation...
> >
> > I agree with the above and that is the reason why I did not see weird
> > values with cpu_time. But, run_delay still would suffer skews as the end
> > points for delta could be taken on different cpus due to migration (more
> > so on RT kernel due to the push-pull operations). With the below patch,
> > I could not reproduce the issue I had seen earlier. After every dequeue,
> > we take the delta and start wait measurements from zero when moved to a
> > different rq.
>
> OK, so task delay delay accounting is broken because it doesn't take
> migration into account.
>
> What you've done is make it symmetric wrt enqueue, and account it like
>
> cpu0 cpu1
>
> enqueue
> <wait-d1>
> dequeue
> enqueue
> <wait-d2>
> run
>
> Where you add both d1 and d2 to the run_delay,.. right?
>

Thanks for reviewing the patch. The above is exactly what I have done.

> This seems like a good fix, however it looks like the patch will break
> compilation in !CONFIG_SCHEDSTATS && !CONFIG_TASK_DELAY_ACCT, of it
> failing to provide a stub for sched_info_dequeue() in that case.
>

Fixed. Pl. find the new patch below.

Signed-off-by: Ankita Garg <[email protected]>

Index: linux-2.6.24.4/kernel/sched.c
===================================================================
--- linux-2.6.24.4.orig/kernel/sched.c 2008-06-05 13:31:53.000000000 +0530
+++ linux-2.6.24.4/kernel/sched.c 2008-07-01 14:03:58.000000000 +0530
@@ -948,6 +948,7 @@

static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
{
+ sched_info_dequeued(p);
p->sched_class->dequeue_task(rq, p, sleep);
p->se.on_rq = 0;
}
Index: linux-2.6.24.4/kernel/sched_stats.h
===================================================================
--- linux-2.6.24.4.orig/kernel/sched_stats.h 2008-06-05 13:31:53.000000000 +0530
+++ linux-2.6.24.4/kernel/sched_stats.h 2008-07-01 14:23:32.000000000 +0530
@@ -113,6 +113,13 @@
if (rq)
rq->rq_sched_info.cpu_time += delta;
}
+
+static inline void
+rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
+{
+ if (rq)
+ rq->rq_sched_info.run_delay += delta;
+}
# define schedstat_inc(rq, field) do { (rq)->field++; } while (0)
# define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0)
# define schedstat_set(var, val) do { var = (val); } while (0)
@@ -121,6 +128,9 @@
rq_sched_info_arrive(struct rq *rq, unsigned long long delta)
{}
static inline void
+rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
+{}
+static inline void
rq_sched_info_depart(struct rq *rq, unsigned long long delta)
{}
# define schedstat_inc(rq, field) do { } while (0)
@@ -129,6 +139,11 @@
#endif

#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+static inline void sched_info_reset_dequeued(struct task_struct *t)
+{
+ t->sched_info.last_queued = 0;
+}
+
/*
* Called when a process is dequeued from the active array and given
* the cpu. We should note that with the exception of interactive
@@ -138,15 +153,22 @@
* active queue, thus delaying tasks in the expired queue from running;
* see scheduler_tick()).
*
- * This function is only called from sched_info_arrive(), rather than
- * dequeue_task(). Even though a task may be queued and dequeued multiple
- * times as it is shuffled about, we're really interested in knowing how
- * long it was from the *first* time it was queued to the time that it
- * finally hit a cpu.
+ * Though we are interested in knowing how long it was from the *first* time a
+ * task was queued to the time that it finally hit a cpu, we call this routine
+ * from dequeue_task() to account for possible rq->clock skew across cpus. The
+ * delta taken on each cpu would annul the skew.
*/
static inline void sched_info_dequeued(struct task_struct *t)
{
- t->sched_info.last_queued = 0;
+ unsigned long long now = task_rq(t)->clock, delta = 0;
+
+ if(unlikely(sched_info_on()))
+ if(t->sched_info.last_queued)
+ delta = now - t->sched_info.last_queued;
+ sched_info_reset_dequeued(t);
+ t->sched_info.run_delay += delta;
+
+ rq_sched_info_dequeued(task_rq(t), delta);
}

/*
@@ -160,7 +182,7 @@

if (t->sched_info.last_queued)
delta = now - t->sched_info.last_queued;
- sched_info_dequeued(t);
+ sched_info_reset_dequeued(t);
t->sched_info.run_delay += delta;
t->sched_info.last_arrival = now;
t->sched_info.pcount++;
@@ -231,7 +253,9 @@
__sched_info_switch(prev, next);
}
#else
-#define sched_info_queued(t) do { } while (0)
-#define sched_info_switch(t, next) do { } while (0)
+#define sched_info_queued(t) do { } while (0)
+#define sched_info_reset_dequeued(t) do { } while (0)
+#define sched_info_dequeued(t) do { } while (0)
+#define sched_info_switch(t, next) do { } while (0)
#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */


--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

2008-07-03 14:01:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] sched: schedtop utility


* Ankita Garg <[email protected]> wrote:

> > OK, so task delay delay accounting is broken because it doesn't take
> > migration into account.

> Fixed. Pl. find the new patch below.

applied to tip/sched/devel - thanks Ankita.

Ingo

2008-07-03 20:57:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ANNOUNCE] sched: schedtop utility

On Thu, 2008-07-03 at 10:34 +0200, Ingo Molnar wrote:
> * Ankita Garg <[email protected]> wrote:
>
> > > OK, so task delay delay accounting is broken because it doesn't take
> > > migration into account.
>
> > Fixed. Pl. find the new patch below.
>
> applied to tip/sched/devel - thanks Ankita.

ACK

2008-10-21 16:25:27

by Gregory Haskins

[permalink] [raw]
Subject: Re: [ANNOUNCE] sched: schedtop utility

I finally put up a wiki for my schedtop utility

http://rt.wiki.kernel.org/index.php/Schedtop_utility

Hopefully someone out there besides myself finds this tool useful!

Regards,
-Greg