2010-11-20 02:09:23

by John Stultz

[permalink] [raw]
Subject: [PATCH 0/5] [RFC] Trivial scheduler related Android patches

So after all the heat that was generated in the various Android
discussions, I took a look a look at the android git tree, and
while there are a fair number of large and controversial
infrastructure changes, there are also a number of small fixes
that apply easily against Linus' git tree.

So after cherry picking these 50-some small patches out of the
android tree, I organized them into topic branches, and over
the next few weeks, I hope to send them out to lkml and topic
maintainers for comments.

Now, I'm not proposing that these changes be merged as-is. It
may very well be that, unknown to me, android developers have
already tried to submit these patches and they have been rejected
for good reason. Or some patches may very well be necessary hacks
to get thing shipping while deeper fixes are being worked on. If
that is the case, let me know and forgive me for the noise.

But as, it seemed many of these small changes have been obscured
by the debate over the larger infrastructure changes, I wanted
to bring them forward so that possibly good fixes were not missed
in the controversy.

Maintainers: If you do find any of these patches distasteful,
that's fine, I'll be happy to drop them from my tree for now.
I really don't want to stir up another huge mail thread over these
small patches, but I'd appreciate if you'd consider them as a
bug report illustrating an issue or a desired feature, and suggest
what you see as a reasonable way to accomplish the desired
functionality presented in the patch.

The following patches are just the scheduler related trivial patches
from the Android tree. You can find this as well as my other trivial
Android topic branches here:
http://git.linaro.org/gitweb?p=people/jstultz/linux.git;a=summary

thanks
-john

Cc: Arve Hj?nnev?g <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Dima Zavin <[email protected]>
CC: Erik Gilling <[email protected]>
CC: Mike Chan <[email protected]>


Arve Hjønnevåg (1):
sched: Enable might_sleep before initializing drivers.

Dima Zavin (1):
sched: use the old min_vruntime when normalizing on dequeue

Erik Gilling (1):
sched: make task dump print all 15 chars of proc comm

Mike Chan (2):
scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU
frequencies
scheduler: cpuacct: Enable platform callbacks for cpuacct power
tracking

Documentation/cgroups/cpuacct.txt | 7 +++
include/linux/cpuacct.h | 43 +++++++++++++++++++
kernel/sched.c | 84 ++++++++++++++++++++++++++++++++++++-
kernel/sched_fair.c | 6 ++-
4 files changed, 137 insertions(+), 3 deletions(-)
create mode 100644 include/linux/cpuacct.h

--
1.7.3.2.146.gca209


2010-11-20 02:09:24

by John Stultz

[permalink] [raw]
Subject: [PATCH 1/5] sched: Enable might_sleep before initializing drivers.

From: Arve Hjønnevåg <[email protected]>

This allows detection of init bugs in built-in drivers.

CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
Signed-off-by: Arve Hjønnevåg <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
kernel/sched.c | 13 ++++++++++++-
1 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index aa14a56..0b58415 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8104,13 +8104,24 @@ static inline int preempt_count_equals(int preempt_offset)
return (nested == PREEMPT_INATOMIC_BASE + preempt_offset);
}

+static int __might_sleep_init_called;
+int __init __might_sleep_init(void)
+{
+ __might_sleep_init_called = 1;
+ return 0;
+}
+early_initcall(__might_sleep_init);
+
void __might_sleep(const char *file, int line, int preempt_offset)
{
#ifdef in_atomic
static unsigned long prev_jiffy; /* ratelimiting */

if ((preempt_count_equals(preempt_offset) && !irqs_disabled()) ||
- system_state != SYSTEM_RUNNING || oops_in_progress)
+ oops_in_progress)
+ return;
+ if (system_state != SYSTEM_RUNNING &&
+ (!__might_sleep_init_called || system_state != SYSTEM_BOOTING))
return;
if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
return;
--
1.7.3.2.146.gca209

2010-11-20 02:09:46

by John Stultz

[permalink] [raw]
Subject: [PATCH 2/5] sched: make task dump print all 15 chars of proc comm

From: Erik Gilling <[email protected]>

CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
Change-Id: I1a5c9676baa06c9f9b4424bbcab01b9b2fbfcd99
Signed-off-by: Erik Gilling <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0b58415..c99bbb2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5390,7 +5390,7 @@ void sched_show_task(struct task_struct *p)
unsigned state;

state = p->state ? __ffs(p->state) + 1 : 0;
- printk(KERN_INFO "%-13.13s %c", p->comm,
+ printk(KERN_INFO "%-15.15s %c", p->comm,
state < sizeof(stat_nam) - 1 ? stat_nam[state] : '?');
#if BITS_PER_LONG == 32
if (state == TASK_RUNNING)
--
1.7.3.2.146.gca209

2010-11-20 02:09:25

by John Stultz

[permalink] [raw]
Subject: [PATCH 4/5] scheduler: cpuacct: Enable platform callbacks for cpuacct power tracking

From: Mike Chan <[email protected]>

Platform must register cpu power function that return power in
milliWatt seconds.

CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
Change-Id: I1caa0335e316c352eee3b1ddf326fcd4942bcbe8
Signed-off-by: Mike Chan <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
Documentation/cgroups/cpuacct.txt | 3 +++
include/linux/cpuacct.h | 4 +++-
kernel/sched.c | 24 ++++++++++++++++++++++--
3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt
index 600d2d0..84e471b 100644
--- a/Documentation/cgroups/cpuacct.txt
+++ b/Documentation/cgroups/cpuacct.txt
@@ -44,6 +44,9 @@ cpuacct.cpufreq file gives CPU time (in nanoseconds) spent at each CPU
frequency. Platform hooks must be implemented inorder to properly track
time at each CPU frequency.

+cpuacct.power file gives CPU power consumed (in milliWatt seconds). Platform
+must provide and implement power callback functions.
+
cpuacct controller uses percpu_counter interface to collect user and
system times. This has two side effects:

diff --git a/include/linux/cpuacct.h b/include/linux/cpuacct.h
index 560df02..8f68e73 100644
--- a/include/linux/cpuacct.h
+++ b/include/linux/cpuacct.h
@@ -31,7 +31,9 @@ struct cpuacct_charge_calls {
*/
void (*init) (void **cpuacct_data);
void (*charge) (void *cpuacct_data, u64 cputime, unsigned int cpu);
- void (*show) (void *cpuacct_data, struct cgroup_map_cb *cb);
+ void (*cpufreq_show) (void *cpuacct_data, struct cgroup_map_cb *cb);
+ /* Returns power consumed in milliWatt seconds */
+ u64 (*power_usage) (void *cpuacct_data);
};

int cpuacct_charge_register(struct cpuacct_charge_calls *fn);
diff --git a/kernel/sched.c b/kernel/sched.c
index 35055fc..270d34a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9282,12 +9282,28 @@ static int cpuacct_cpufreq_show(struct cgroup *cgrp, struct cftype *cft,
struct cgroup_map_cb *cb)
{
struct cpuacct *ca = cgroup_ca(cgrp);
- if (ca->cpufreq_fn && ca->cpufreq_fn->show)
- ca->cpufreq_fn->show(ca->cpuacct_data, cb);
+ if (ca->cpufreq_fn && ca->cpufreq_fn->cpufreq_show)
+ ca->cpufreq_fn->cpufreq_show(ca->cpuacct_data, cb);

return 0;
}

+/* return total cpu power usage (milliWatt second) of a group */
+static u64 cpuacct_powerusage_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ int i;
+ struct cpuacct *ca = cgroup_ca(cgrp);
+ u64 totalpower = 0;
+
+ if (ca->cpufreq_fn && ca->cpufreq_fn->power_usage)
+ for_each_present_cpu(i) {
+ totalpower += ca->cpufreq_fn->power_usage(
+ ca->cpuacct_data);
+ }
+
+ return totalpower;
+}
+
static struct cftype files[] = {
{
.name = "usage",
@@ -9306,6 +9322,10 @@ static struct cftype files[] = {
.name = "cpufreq",
.read_map = cpuacct_cpufreq_show,
},
+ {
+ .name = "power",
+ .read_u64 = cpuacct_powerusage_read
+ },
};

static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
--
1.7.3.2.146.gca209

2010-11-20 02:10:06

by John Stultz

[permalink] [raw]
Subject: [PATCH 3/5] scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU frequencies

From: Mike Chan <[email protected]>

Introduce new platform callback hooks for cpuacct for tracking CPU frequencies

Not all platforms / architectures have a set CPU_FREQ_TABLE defined
for CPU transition speeds. In order to track time spent in at various
CPU frequencies, we enable platform callbacks from cpuacct for this accounting.

Architectures that support overclock boosting, or don't have pre-defined
frequency tables can implement their own bucketing system that makes sense
given their cpufreq scaling abilities.

New file:
cpuacct.cpufreq reports the CPU time (in nanoseconds) spent at each CPU
frequency.

CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
Change-Id: I10a80b3162e6fff3a8a2f74dd6bb37e88b12ba96
Signed-off-by: Mike Chan <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
Documentation/cgroups/cpuacct.txt | 4 +++
include/linux/cpuacct.h | 41 +++++++++++++++++++++++++++++++
kernel/sched.c | 49 +++++++++++++++++++++++++++++++++++++
3 files changed, 94 insertions(+), 0 deletions(-)
create mode 100644 include/linux/cpuacct.h

diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt
index 8b93094..600d2d0 100644
--- a/Documentation/cgroups/cpuacct.txt
+++ b/Documentation/cgroups/cpuacct.txt
@@ -40,6 +40,10 @@ system: Time spent by tasks of the cgroup in kernel mode.

user and system are in USER_HZ unit.

+cpuacct.cpufreq file gives CPU time (in nanoseconds) spent at each CPU
+frequency. Platform hooks must be implemented inorder to properly track
+time at each CPU frequency.
+
cpuacct controller uses percpu_counter interface to collect user and
system times. This has two side effects:

diff --git a/include/linux/cpuacct.h b/include/linux/cpuacct.h
new file mode 100644
index 0000000..560df02
--- /dev/null
+++ b/include/linux/cpuacct.h
@@ -0,0 +1,41 @@
+/* include/linux/cpuacct.h
+ *
+ * Copyright (C) 2010 Google, Inc.
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#ifndef _CPUACCT_H_
+#define _CPUACCT_H_
+
+#include <linux/cgroup.h>
+
+#ifdef CONFIG_CGROUP_CPUACCT
+
+/*
+ * Platform specific CPU frequency hooks for cpuacct. These functions are
+ * called from the scheduler.
+ */
+struct cpuacct_charge_calls {
+ /*
+ * Platforms can take advantage of this data and use
+ * per-cpu allocations if necessary.
+ */
+ void (*init) (void **cpuacct_data);
+ void (*charge) (void *cpuacct_data, u64 cputime, unsigned int cpu);
+ void (*show) (void *cpuacct_data, struct cgroup_map_cb *cb);
+};
+
+int cpuacct_charge_register(struct cpuacct_charge_calls *fn);
+
+#endif /* CONFIG_CGROUP_CPUACCT */
+
+#endif // _CPUACCT_H_
diff --git a/kernel/sched.c b/kernel/sched.c
index c99bbb2..35055fc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -72,6 +72,7 @@
#include <linux/ctype.h>
#include <linux/ftrace.h>
#include <linux/slab.h>
+#include <linux/cpuacct.h>

#include <asm/tlb.h>
#include <asm/irq_regs.h>
@@ -9082,8 +9083,30 @@ struct cpuacct {
u64 __percpu *cpuusage;
struct percpu_counter cpustat[CPUACCT_STAT_NSTATS];
struct cpuacct *parent;
+ struct cpuacct_charge_calls *cpufreq_fn;
+ void *cpuacct_data;
};

+static struct cpuacct *cpuacct_root;
+
+/* Default calls for cpufreq accounting */
+static struct cpuacct_charge_calls *cpuacct_cpufreq;
+int cpuacct_register_cpufreq(struct cpuacct_charge_calls *fn)
+{
+ cpuacct_cpufreq = fn;
+
+ /*
+ * Root node is created before platform can register callbacks,
+ * initalize here.
+ */
+ if (cpuacct_root && fn) {
+ cpuacct_root->cpufreq_fn = fn;
+ if (fn->init)
+ fn->init(&cpuacct_root->cpuacct_data);
+ }
+ return 0;
+}
+
struct cgroup_subsys cpuacct_subsys;

/* return cpu accounting group corresponding to this container */
@@ -9118,8 +9141,16 @@ static struct cgroup_subsys_state *cpuacct_create(
if (percpu_counter_init(&ca->cpustat[i], 0))
goto out_free_counters;

+ ca->cpufreq_fn = cpuacct_cpufreq;
+
+ /* If available, have platform code initalize cpu frequency table */
+ if (ca->cpufreq_fn && ca->cpufreq_fn->init)
+ ca->cpufreq_fn->init(&ca->cpuacct_data);
+
if (cgrp->parent)
ca->parent = cgroup_ca(cgrp->parent);
+ else
+ cpuacct_root = ca;

return &ca->css;

@@ -9247,6 +9278,16 @@ static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
return 0;
}

+static int cpuacct_cpufreq_show(struct cgroup *cgrp, struct cftype *cft,
+ struct cgroup_map_cb *cb)
+{
+ struct cpuacct *ca = cgroup_ca(cgrp);
+ if (ca->cpufreq_fn && ca->cpufreq_fn->show)
+ ca->cpufreq_fn->show(ca->cpuacct_data, cb);
+
+ return 0;
+}
+
static struct cftype files[] = {
{
.name = "usage",
@@ -9261,6 +9302,10 @@ static struct cftype files[] = {
.name = "stat",
.read_map = cpuacct_stats_show,
},
+ {
+ .name = "cpufreq",
+ .read_map = cpuacct_cpufreq_show,
+ },
};

static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
@@ -9290,6 +9335,10 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
for (; ca; ca = ca->parent) {
u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
*cpuusage += cputime;
+
+ /* Call back into platform code to account for CPU speeds */
+ if (ca->cpufreq_fn && ca->cpufreq_fn->charge)
+ ca->cpufreq_fn->charge(ca->cpuacct_data, cputime, cpu);
}

rcu_read_unlock();
--
1.7.3.2.146.gca209

2010-11-20 02:10:04

by John Stultz

[permalink] [raw]
Subject: [PATCH 5/5] sched: use the old min_vruntime when normalizing on dequeue

From: Dima Zavin <[email protected]>

After pulling the thread off the run-queue during a cgroup change,
the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
then gets normalized to this new value. This can then lead to the thread
getting an unfair boost in the new group if the vruntime of the next
task in the old run-queue was way further ahead.

CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
Cc: Arve Hjønnevåg <[email protected]>
Signed-off-by: Dima Zavin <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
kernel/sched_fair.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f4f6a83..72f19ad 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -802,6 +802,8 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
+ u64 min_vruntime;
+
/*
* Update run-time statistics of the 'current'.
*/
@@ -826,6 +828,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
account_entity_dequeue(cfs_rq, se);
+
+ min_vruntime = cfs_rq->min_vruntime;
update_min_vruntime(cfs_rq);

/*
@@ -834,7 +838,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* movement in our normalized position.
*/
if (!(flags & DEQUEUE_SLEEP))
- se->vruntime -= cfs_rq->min_vruntime;
+ se->vruntime -= min_vruntime;
}

/*
--
1.7.3.2.146.gca209

2010-11-20 10:42:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/5] sched: Enable might_sleep before initializing drivers.

On Fri, 2010-11-19 at 18:08 -0800, John Stultz wrote:
> From: Arve Hjønnevåg <[email protected]>
>
> This allows detection of init bugs in built-in drivers.
>
> CC: Ingo Molnar <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> Signed-off-by: Arve Hjønnevåg <[email protected]>
> Signed-off-by: John Stultz <[email protected]>
> ---
> kernel/sched.c | 13 ++++++++++++-
> 1 files changed, 12 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index aa14a56..0b58415 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -8104,13 +8104,24 @@ static inline int preempt_count_equals(int preempt_offset)
> return (nested == PREEMPT_INATOMIC_BASE + preempt_offset);
> }
>
> +static int __might_sleep_init_called;
> +int __init __might_sleep_init(void)
> +{
> + __might_sleep_init_called = 1;
> + return 0;
> +}
> +early_initcall(__might_sleep_init);
> +
> void __might_sleep(const char *file, int line, int preempt_offset)
> {
> #ifdef in_atomic
> static unsigned long prev_jiffy; /* ratelimiting */
>
> if ((preempt_count_equals(preempt_offset) && !irqs_disabled()) ||
> - system_state != SYSTEM_RUNNING || oops_in_progress)
> + oops_in_progress)
> + return;
> + if (system_state != SYSTEM_RUNNING &&
> + (!__might_sleep_init_called || system_state != SYSTEM_BOOTING))
> return;
> if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
> return;

Remind me, why isn't scheduler_running good enough?

2010-11-20 10:48:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/5] scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU frequencies

On Fri, 2010-11-19 at 18:08 -0800, John Stultz wrote:
> From: Mike Chan <[email protected]>
>
> Introduce new platform callback hooks for cpuacct for tracking CPU frequencies
>
> Not all platforms / architectures have a set CPU_FREQ_TABLE defined
> for CPU transition speeds. In order to track time spent in at various
> CPU frequencies, we enable platform callbacks from cpuacct for this accounting.
>
> Architectures that support overclock boosting, or don't have pre-defined
> frequency tables can implement their own bucketing system that makes sense
> given their cpufreq scaling abilities.
>
> New file:
> cpuacct.cpufreq reports the CPU time (in nanoseconds) spent at each CPU
> frequency.

I utterly detest all such accounting crap.. it adds ABI constraints it
add runtime overhead. etc..

Can't you get the same information by using the various perf bits? If
you trace the cpufreq changes you can compute the time spend in each
power state, if you additionally trace the sched_switch you can compute
it for each task.

2010-11-20 10:54:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/5] sched: use the old min_vruntime when normalizing on dequeue

On Fri, 2010-11-19 at 18:08 -0800, John Stultz wrote:
> From: Dima Zavin <[email protected]>
>
> After pulling the thread off the run-queue during a cgroup change,
> the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
> then gets normalized to this new value. This can then lead to the thread
> getting an unfair boost in the new group if the vruntime of the next
> task in the old run-queue was way further ahead.
>
> CC: Ingo Molnar <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> Cc: Arve Hjønnevåg <[email protected]>
> Signed-off-by: Dima Zavin <[email protected]>
> Signed-off-by: John Stultz <[email protected]>
> ---
> kernel/sched_fair.c | 6 +++++-
> 1 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index f4f6a83..72f19ad 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -802,6 +802,8 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
> static void
> dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> + u64 min_vruntime;
> +
> /*
> * Update run-time statistics of the 'current'.
> */
> @@ -826,6 +828,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> if (se != cfs_rq->curr)
> __dequeue_entity(cfs_rq, se);
> account_entity_dequeue(cfs_rq, se);
> +
> + min_vruntime = cfs_rq->min_vruntime;
> update_min_vruntime(cfs_rq);
>
> /*
> @@ -834,7 +838,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> * movement in our normalized position.
> */
> if (!(flags & DEQUEUE_SLEEP))
> - se->vruntime -= cfs_rq->min_vruntime;
> + se->vruntime -= min_vruntime;
> }

Right, so assuming the reasoning is right (my brain still needs to wake
up) the patch is weird, by not simply move the code bock up and avoid
the whole extra variable like so?

---
kernel/sched_fair.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index d35f464..dfa28ef 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1003,8 +1003,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
se->on_rq = 0;
update_cfs_load(cfs_rq, 0);
account_entity_dequeue(cfs_rq, se);
- update_min_vruntime(cfs_rq);
- update_cfs_shares(cfs_rq, 0);

/*
* Normalize the entity after updating the min_vruntime because the
@@ -1013,6 +1011,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*/
if (!(flags & DEQUEUE_SLEEP))
se->vruntime -= cfs_rq->min_vruntime;
+
+ update_min_vruntime(cfs_rq);
+ update_cfs_shares(cfs_rq, 0);
}

/*

2010-11-20 12:33:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/5] sched: use the old min_vruntime when normalizing on dequeue

On Sat, 2010-11-20 at 11:55 +0100, Peter Zijlstra wrote:
> Right, so assuming the reasoning is right (my brain still needs to wake
> up) the patch is weird, by not simply move the code bock up and avoid
> the whole extra variable like so?

Also, clearly that comments needs addressing..

2010-11-22 05:52:08

by Florian Mickler

[permalink] [raw]
Subject: Re: [PATCH 3/5] scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU frequencies

On Sat, 20 Nov 2010 11:48:24 +0100
Peter Zijlstra <[email protected]> wrote:

> On Fri, 2010-11-19 at 18:08 -0800, John Stultz wrote:
> > From: Mike Chan <[email protected]>
> >
> > Introduce new platform callback hooks for cpuacct for tracking CPU frequencies
> >
> > Not all platforms / architectures have a set CPU_FREQ_TABLE defined
> > for CPU transition speeds. In order to track time spent in at various
> > CPU frequencies, we enable platform callbacks from cpuacct for this accounting.
> >
> > Architectures that support overclock boosting, or don't have pre-defined
> > frequency tables can implement their own bucketing system that makes sense
> > given their cpufreq scaling abilities.
> >
> > New file:
> > cpuacct.cpufreq reports the CPU time (in nanoseconds) spent at each CPU
> > frequency.
>
> I utterly detest all such accounting crap.. it adds ABI constraints it
> add runtime overhead. etc..
>
> Can't you get the same information by using the various perf bits? If
> you trace the cpufreq changes you can compute the time spend in each
> power state, if you additionally trace the sched_switch you can compute
> it for each task.
>
>
This is probably used for "on-site" debugging of production systems.

I.e. when someone sends them a problem report using an
bugreport-tool, they gather all useful information they can get on the
system because they only have one-way communication with their bug
reporters.

Do the perf bits work for such a usecase? If I guess correctly, the
perf bits need a userspace part that computes what would be in the
cpuacct.cpufreq file?

Regards,
Flo

2010-11-22 10:43:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/5] scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU frequencies

On Mon, 2010-11-22 at 06:51 +0100, Florian Mickler wrote:
> On Sat, 20 Nov 2010 11:48:24 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, 2010-11-19 at 18:08 -0800, John Stultz wrote:
> > > From: Mike Chan <[email protected]>
> > >
> > > Introduce new platform callback hooks for cpuacct for tracking CPU frequencies
> > >
> > > Not all platforms / architectures have a set CPU_FREQ_TABLE defined
> > > for CPU transition speeds. In order to track time spent in at various
> > > CPU frequencies, we enable platform callbacks from cpuacct for this accounting.
> > >
> > > Architectures that support overclock boosting, or don't have pre-defined
> > > frequency tables can implement their own bucketing system that makes sense
> > > given their cpufreq scaling abilities.
> > >
> > > New file:
> > > cpuacct.cpufreq reports the CPU time (in nanoseconds) spent at each CPU
> > > frequency.
> >
> > I utterly detest all such accounting crap.. it adds ABI constraints it
> > add runtime overhead. etc..
> >
> > Can't you get the same information by using the various perf bits? If
> > you trace the cpufreq changes you can compute the time spend in each
> > power state, if you additionally trace the sched_switch you can compute
> > it for each task.
> >
> >
> This is probably used for "on-site" debugging of production systems.

Dude, its from the _android_ tree... its cpufreq crud.. it must be some
crack induced power management scheme.

2010-11-22 12:23:43

by Florian Mickler

[permalink] [raw]
Subject: Re: [PATCH 3/5] scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU frequencies

On Mon, 22 Nov 2010 11:43:59 +0100
Peter Zijlstra <[email protected]> wrote:

> On Mon, 2010-11-22 at 06:51 +0100, Florian Mickler wrote:
> > On Sat, 20 Nov 2010 11:48:24 +0100
> > Peter Zijlstra <[email protected]> wrote:
> >
> > > On Fri, 2010-11-19 at 18:08 -0800, John Stultz wrote:
> > > > From: Mike Chan <[email protected]>
> > > >
> > > > Introduce new platform callback hooks for cpuacct for tracking CPU frequencies
> > > >
> > > > Not all platforms / architectures have a set CPU_FREQ_TABLE defined
> > > > for CPU transition speeds. In order to track time spent in at various
> > > > CPU frequencies, we enable platform callbacks from cpuacct for this accounting.
> > > >
> > > > Architectures that support overclock boosting, or don't have pre-defined
> > > > frequency tables can implement their own bucketing system that makes sense
> > > > given their cpufreq scaling abilities.
> > > >
> > > > New file:
> > > > cpuacct.cpufreq reports the CPU time (in nanoseconds) spent at each CPU
> > > > frequency.
> > >
> > > I utterly detest all such accounting crap.. it adds ABI constraints it
> > > add runtime overhead. etc..
> > >
> > > Can't you get the same information by using the various perf bits? If
> > > you trace the cpufreq changes you can compute the time spend in each
> > > power state, if you additionally trace the sched_switch you can compute
> > > it for each task.
> > >
> > >
> > This is probably used for "on-site" debugging of production systems.
>
> Dude, its from the _android_ tree... its cpufreq crud.. it must be some
> crack induced power management scheme.
>
>

:)

what I wanted to get at, was that they probably need these stats
aggregated somewhere neat and tidy and can not compute them on the fly
recording massive amounts of data...

I wonder why they didn't put this in the
idle-driver. I don't know.

Regards,
Flo

2010-11-23 02:05:42

by Mike Chan

[permalink] [raw]
Subject: Re: [PATCH 3/5] scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU frequencies

On Mon, Nov 22, 2010 at 4:23 AM, Florian Mickler <[email protected]> wrote:
> On Mon, 22 Nov 2010 11:43:59 +0100
> Peter Zijlstra <[email protected]> wrote:
>
>> On Mon, 2010-11-22 at 06:51 +0100, Florian Mickler wrote:
>> > On Sat, 20 Nov 2010 11:48:24 +0100
>> > Peter Zijlstra <[email protected]> wrote:
>> >
>> > > On Fri, 2010-11-19 at 18:08 -0800, John Stultz wrote:
>> > > > From: Mike Chan <[email protected]>
>> > > >
>> > > > Introduce new platform callback hooks for cpuacct for tracking CPU frequencies
>> > > >
>> > > > Not all platforms / architectures have a set CPU_FREQ_TABLE defined
>> > > > for CPU transition speeds. In order to track time spent in at various
>> > > > CPU frequencies, we enable platform callbacks from cpuacct for this accounting.
>> > > >
>> > > > Architectures that support overclock boosting, or don't have pre-defined
>> > > > frequency tables can implement their own bucketing system that makes sense
>> > > > given their cpufreq scaling abilities.
>> > > >
>> > > > New file:
>> > > > cpuacct.cpufreq reports the CPU time (in nanoseconds) spent at each CPU
>> > > > frequency.
>> > >
>> > > I utterly detest all such accounting crap.. it adds ABI constraints it
>> > > add runtime overhead. etc..
>> > >
>> > > Can't you get the same information by using the various perf bits? If
>> > > you trace the cpufreq changes you can compute the time spend in each
>> > > power state, if you additionally trace the sched_switch you can compute
>> > > it for each task.
>> > >
>> > >
>> > This is probably used for "on-site" debugging of production systems.
>>
>> Dude, its from the _android_ tree... its cpufreq crud.. it must be some
>> crack induced power management scheme.
>>
>>
>
> :)
>
> what I wanted to get at, was that they probably need these stats
> aggregated somewhere neat and tidy and can not compute them on the fly
> recording massive amounts of data...
>
> I wonder why they didn't put this in the
> idle-driver. ?I don't know.
>

This is useful for tracking cpu power per c-group. We split each
android application into its own c-group and track what cpu speeds and
how long the cpu spent for each one. Peter we've actually discussed
this before:
http://lkml.org/lkml/2010/5/6/301

These patches were discussed with Paul Menage and Balbir Singh back in
April, as well as on lmkl and the cpufreq mailing lists. These may or
may not be useful for mainline, I assume anyone wanting to track power
specific for c-groups would be interested. I'm open for different
implementations that can help achieve cpu power tracking per-cgroup if
this particular implementation is controversial, or if you just want
to help make Android's kernel better.

-- Mike

> Regards,
> Flo
>
>

2010-11-23 10:22:30

by Erik Gilling

[permalink] [raw]
Subject: [tip:sched/core] sched: Make task dump print all 15 chars of proc comm

Commit-ID: 28d0686cf7b14e30243096bd874d3f80591ed392
Gitweb: http://git.kernel.org/tip/28d0686cf7b14e30243096bd874d3f80591ed392
Author: Erik Gilling <[email protected]>
AuthorDate: Fri, 19 Nov 2010 18:08:51 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 23 Nov 2010 10:29:07 +0100

sched: Make task dump print all 15 chars of proc comm

Signed-off-by: Erik Gilling <[email protected]>
Signed-off-by: John Stultz <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 550cf3a..324afce 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5249,7 +5249,7 @@ void sched_show_task(struct task_struct *p)
unsigned state;

state = p->state ? __ffs(p->state) + 1 : 0;
- printk(KERN_INFO "%-13.13s %c", p->comm,
+ printk(KERN_INFO "%-15.15s %c", p->comm,
state < sizeof(stat_nam) - 1 ? stat_nam[state] : '?');
#if BITS_PER_LONG == 32
if (state == TASK_RUNNING)

2010-11-23 11:35:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/5] scheduler: cpuacct: Enable platform hooks to track cpuusage for CPU frequencies

On Mon, 2010-11-22 at 18:05 -0800, Mike Chan wrote:

> This is useful for tracking cpu power per c-group. We split each
> android application into its own c-group and track what cpu speeds and
> how long the cpu spent for each one. Peter we've actually discussed
> this before:
> http://lkml.org/lkml/2010/5/6/301
>
> These patches were discussed with Paul Menage and Balbir Singh back in
> April, as well as on lmkl and the cpufreq mailing lists. These may or
> may not be useful for mainline, I assume anyone wanting to track power
> specific for c-groups would be interested. I'm open for different
> implementations that can help achieve cpu power tracking per-cgroup if
> this particular implementation is controversial, or if you just want
> to help make Android's kernel better.

Right, so Stephane is working on perf-cgroup bits (I saw he recently
posted another version, which I guess I ought to look at soonish).

With that it would be rather simple to use perf to track per-cgroup
power state.