2007-10-16 20:37:55

by Ken Chen

[permalink] [raw]
Subject: [patch] sched: schedstat needs a diet

schedstat is useful in investigating CPU scheduler behavior. Ideally,
I think it is beneficial to have it on all the time. However, the
cost of turning it on in production system is quite high, largely due
to number of events it collects and also due to its large memory
footprint.

Most of the fields probably don't need to be full 64-bit on 64-bit
arch. Rolling over 4 billion events will most like take a long time
and user space tool can be made to accommodate that. I'm proposing
kernel to cut back most of variable width on 64-bit system. (note,
the following patch doesn't affect 32-bit system).


Signed-off-by: Ken Chen <[email protected]>


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 592e3a5..311a8bd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -562,7 +562,7 @@ struct sched_info {
last_queued; /* when we were last queued to run */
#ifdef CONFIG_SCHEDSTATS
/* BKL stats */
- unsigned long bkl_count;
+ unsigned int bkl_count;
#endif
};
#endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */
@@ -698,34 +698,34 @@ struct sched_domain {

#ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
- unsigned long lb_count[CPU_MAX_IDLE_TYPES];
- unsigned long lb_failed[CPU_MAX_IDLE_TYPES];
- unsigned long lb_balanced[CPU_MAX_IDLE_TYPES];
- unsigned long lb_imbalance[CPU_MAX_IDLE_TYPES];
- unsigned long lb_gained[CPU_MAX_IDLE_TYPES];
- unsigned long lb_hot_gained[CPU_MAX_IDLE_TYPES];
- unsigned long lb_nobusyg[CPU_MAX_IDLE_TYPES];
- unsigned long lb_nobusyq[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_count[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_nobusyq[CPU_MAX_IDLE_TYPES];

/* Active load balancing */
- unsigned long alb_count;
- unsigned long alb_failed;
- unsigned long alb_pushed;
+ unsigned int alb_count;
+ unsigned int alb_failed;
+ unsigned int alb_pushed;

/* SD_BALANCE_EXEC stats */
- unsigned long sbe_count;
- unsigned long sbe_balanced;
- unsigned long sbe_pushed;
+ unsigned int sbe_count;
+ unsigned int sbe_balanced;
+ unsigned int sbe_pushed;

/* SD_BALANCE_FORK stats */
- unsigned long sbf_count;
- unsigned long sbf_balanced;
- unsigned long sbf_pushed;
+ unsigned int sbf_count;
+ unsigned int sbf_balanced;
+ unsigned int sbf_pushed;

/* try_to_wake_up() stats */
- unsigned long ttwu_wake_remote;
- unsigned long ttwu_move_affine;
- unsigned long ttwu_move_balance;
+ unsigned int ttwu_wake_remote;
+ unsigned int ttwu_move_affine;
+ unsigned int ttwu_move_balance;
#endif
};

diff --git a/kernel/sched.c b/kernel/sched.c
index 0da2b26..5e7fce9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -328,22 +328,22 @@ struct rq {
struct sched_info rq_sched_info;

/* sys_sched_yield() stats */
- unsigned long yld_exp_empty;
- unsigned long yld_act_empty;
- unsigned long yld_both_empty;
- unsigned long yld_count;
+ unsigned int yld_exp_empty;
+ unsigned int yld_act_empty;
+ unsigned int yld_both_empty;
+ unsigned int yld_count;

/* schedule() stats */
- unsigned long sched_switch;
- unsigned long sched_count;
- unsigned long sched_goidle;
+ unsigned int sched_switch;
+ unsigned int sched_count;
+ unsigned int sched_goidle;

/* try_to_wake_up() stats */
- unsigned long ttwu_count;
- unsigned long ttwu_local;
+ unsigned int ttwu_count;
+ unsigned int ttwu_local;

/* BKL stats */
- unsigned long bkl_count;
+ unsigned int bkl_count;
#endif
struct lock_class_key rq_lock_key;
};
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index a5e517e..e6fb392 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -137,7 +137,7 @@ void print_cfs_rq(struct seq_file *m, int cpu,
struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight);
#ifdef CONFIG_SCHEDSTATS
- SEQ_printf(m, " .%-30s: %ld\n", "bkl_count",
+ SEQ_printf(m, " .%-30s: %d\n", "bkl_count",
rq->bkl_count);
#endif
SEQ_printf(m, " .%-30s: %ld\n", "nr_spread_over",
diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h
index 1c08484..ef1a7df 100644
--- a/kernel/sched_stats.h
+++ b/kernel/sched_stats.h
@@ -21,7 +21,7 @@ static int show_schedstat(struct seq_file *seq, void *v)

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %llu %llu %lu",
+ "cpu%d %u %u %u %u %u %u %u %u %u %llu %llu %lu",
cpu, rq->yld_both_empty,
rq->yld_act_empty, rq->yld_exp_empty, rq->yld_count,
rq->sched_switch, rq->sched_count, rq->sched_goidle,
@@ -42,8 +42,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
seq_printf(seq, "domain%d %s", dcount++, mask_str);
for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
itype++) {
- seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu "
- "%lu",
+ seq_printf(seq, " %u %u %u %u %u %u %u %u",
sd->lb_count[itype],
sd->lb_balanced[itype],
sd->lb_failed[itype],
@@ -53,8 +52,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
sd->lb_nobusyq[itype],
sd->lb_nobusyg[itype]);
}
- seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu"
- " %lu %lu %lu\n",
+ seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u %u\n",
sd->alb_count, sd->alb_failed, sd->alb_pushed,
sd->sbe_count, sd->sbe_balanced, sd->sbe_pushed,
sd->sbf_count, sd->sbf_balanced, sd->sbf_pushed,


2007-10-17 07:23:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] sched: schedstat needs a diet


* Ken Chen <[email protected]> wrote:

> schedstat is useful in investigating CPU scheduler behavior. Ideally,
> I think it is beneficial to have it on all the time. However, the
> cost of turning it on in production system is quite high, largely due
> to number of events it collects and also due to its large memory
> footprint.
>
> Most of the fields probably don't need to be full 64-bit on 64-bit
> arch. Rolling over 4 billion events will most like take a long time
> and user space tool can be made to accommodate that. I'm proposing
> kernel to cut back most of variable width on 64-bit system. (note,
> the following patch doesn't affect 32-bit system).

thanks, applied.

note that current -git has a whole bunch of new schedstats fields in
/proc/<PID>/sched which can be used to track the exact balancing
behavior of tasks. It can be cleared via echoing 0 to the file - so
overflow is not an issue. Most of those new fields should probably be
unsigned int too. (they are u64 right now.)

Ingo

2007-10-17 09:30:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] sched: schedstat needs a diet

On Wed, 2007-10-17 at 09:23 +0200, Ingo Molnar wrote:
> * Ken Chen <[email protected]> wrote:
>
> > schedstat is useful in investigating CPU scheduler behavior. Ideally,
> > I think it is beneficial to have it on all the time. However, the
> > cost of turning it on in production system is quite high, largely due
> > to number of events it collects and also due to its large memory
> > footprint.
> >
> > Most of the fields probably don't need to be full 64-bit on 64-bit
> > arch. Rolling over 4 billion events will most like take a long time
> > and user space tool can be made to accommodate that. I'm proposing
> > kernel to cut back most of variable width on 64-bit system. (note,
> > the following patch doesn't affect 32-bit system).
>
> thanks, applied.
>
> note that current -git has a whole bunch of new schedstats fields in
> /proc/<PID>/sched which can be used to track the exact balancing
> behavior of tasks. It can be cleared via echoing 0 to the file - so
> overflow is not an issue. Most of those new fields should probably be
> unsigned int too. (they are u64 right now.)
>

FWIW I can't see how this patch saves a _lot_ of space. The stats are
per domain or per rq, neither are things that have a lot of instances.

That said, I have no actual objection to the patch, just not getting it.


2007-10-18 22:20:15

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch] sched: schedstat needs a diet

* Peter Zijlstra ([email protected]) wrote:
> On Wed, 2007-10-17 at 09:23 +0200, Ingo Molnar wrote:
> > * Ken Chen <[email protected]> wrote:
> >
> > > schedstat is useful in investigating CPU scheduler behavior. Ideally,
> > > I think it is beneficial to have it on all the time. However, the
> > > cost of turning it on in production system is quite high, largely due
> > > to number of events it collects and also due to its large memory
> > > footprint.
> > >
> > > Most of the fields probably don't need to be full 64-bit on 64-bit
> > > arch. Rolling over 4 billion events will most like take a long time
> > > and user space tool can be made to accommodate that. I'm proposing
> > > kernel to cut back most of variable width on 64-bit system. (note,
> > > the following patch doesn't affect 32-bit system).
> >
> > thanks, applied.
> >
> > note that current -git has a whole bunch of new schedstats fields in
> > /proc/<PID>/sched which can be used to track the exact balancing
> > behavior of tasks. It can be cleared via echoing 0 to the file - so
> > overflow is not an issue. Most of those new fields should probably be
> > unsigned int too. (they are u64 right now.)
> >
>
> FWIW I can't see how this patch saves a _lot_ of space. The stats are
> per domain or per rq, neither are things that have a lot of instances.
>
> That said, I have no actual objection to the patch, just not getting it.
>
>

Good question indeed. How large is this memory footprint exactly ? If it
is as small as you say, I suspect that the real issue could be that
these variable are accessed by the scheduler critical paths and
therefore trash the caches.

(in bytes with 8 bytes longs)
(in 2.6.23-mm1)

task struct
struct sched_entity 9 * 8 bytes
struct sched_info 5 * 8 bytes
(as Ingo noted, this is only in -mm. It really hurts since it grows the
task structs)

struct sched_domain
20 * 8 bytes
O(nr cpus) or a little more on tricky setups

struct rq
struct sched_info 5 * 8 bytes
10 * 8 bytes
O(nr cpus), which is not much.

If the memory footprint of struct sched_domain and struct rq really
matters, one should set its NR_CPUS to the lowest value required by his
setup to help reduce the memory size. And forget about per task
statistics.

Adding data to the task struct will turn out to be a real problem, both
for memory consumption and cache trashing. Could we think of allocating
the memory required for statistics (scheduler, vm, ...) only when stats
collection is required ? It could add one pointer to the task struct
(NULL by default, set to a memory location used to accumulate per-task
stats before we activate system wide stats counting). It could fit well
with the immediate values, which could be used to enable/disable the
statistic collection dynamically at runtime with minimal impact in the
scheduler code.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-10-18 22:57:48

by Ken Chen

[permalink] [raw]
Subject: Re: [patch] sched: schedstat needs a diet

On 10/18/07, Mathieu Desnoyers <[email protected]> wrote:
> Good question indeed. How large is this memory footprint exactly ? If it
> is as small as you say, I suspect that the real issue could be that
> these variable are accessed by the scheduler critical paths and
> therefore trash the caches.

Maybe my wording was ambiguous, I meant to reduce cache line pollution
when accessing these schedstat fields.

With unsigned long, on x86_64, schedstat consumes 288 bytes for each
sched_domain and 128 bytes in struct rq. On a extremely small system
that has a couple of CPU sockets with one level of numa node, there
will be 704 bytes per CPU for schedstat. Given the sparseness of
them, we are probably talking about 11-12 cache line eviction on
several heavily used scheduler functions. Reduce cache line pollution
is the primary goal, actual memory consumption isn't really a concern.

- Ken

2007-10-18 23:13:19

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch] sched: schedstat needs a diet

* Ken Chen ([email protected]) wrote:
> On 10/18/07, Mathieu Desnoyers <[email protected]> wrote:
> > Good question indeed. How large is this memory footprint exactly ? If it
> > is as small as you say, I suspect that the real issue could be that
> > these variable are accessed by the scheduler critical paths and
> > therefore trash the caches.
>
> Maybe my wording was ambiguous, I meant to reduce cache line pollution
> when accessing these schedstat fields.
>
> With unsigned long, on x86_64, schedstat consumes 288 bytes for each
> sched_domain and 128 bytes in struct rq. On a extremely small system
> that has a couple of CPU sockets with one level of numa node, there
> will be 704 bytes per CPU for schedstat. Given the sparseness of
> them, we are probably talking about 11-12 cache line eviction on
> several heavily used scheduler functions. Reduce cache line pollution
> is the primary goal, actual memory consumption isn't really a concern.
>

Generally speaking, if such cache trashing is an issue, why don't we
make sure that each task struct member is declared in this structure
following its access frequency ? (except for #ifdef blocks, which should
stay together) It could then statistically save a lot of cachelines.

Or is it already the case ? It doesn't look like it when I see:

struct list_head ptrace_list;

Just beside the

struct mm_struct *mm, *active_mm;

pointers.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-10-31 09:06:51

by Rick Lindsley

[permalink] [raw]
Subject: Re: [patch] sched: schedstat needs a diet

On 10/18/07, Mathieu Desnoyers <[email protected]> wrote:
> Good question indeed. How large is this memory footprint exactly ? If it
> is as small as you say, I suspect that the real issue could be that
> these variable are accessed by the scheduler critical paths and
> therefore trash the caches.

Maybe my wording was ambiguous, I meant to reduce cache line pollution
when accessing these schedstat fields.

As the original author, it was always my intention that schedstats be low
enough impact that it could be turned on all the time, if need be. That's
why, for the most part, it does increments and decrements of counters and
leaves the actual math to apps that might gather the data. Initial
measurements showed that it was having no measurable impact on performance.

Of course, that was years ago too. Systems (and hardware) have changed
considerably in that time. Are we talking theoretical cache pollution
or measured? And if measured, what effect is it having?

(sorry for the late followup; was on vacation ...)

Rick