2007-08-31 02:04:54

by Roman Zippel

[permalink] [raw]
Subject: [ANNOUNCE/RFC] Really Fair Scheduler

Hi,

I'm glad to announce a working prototype of the basic algorithm I
already suggested last time.
As I already tried to explain previously CFS has a considerable
algorithmic and computational complexity. This patch should now make it
clearer, why I could so easily skip over Ingo's long explanation of all
the tricks CFS uses to keep the computational overhead low - I simply
don't need them. The following numbers are based on a 2.6.23-rc3-git1 UP
kernel, the first 3 runs are without patch, the last 3 runs are with the
patch:

Xeon 1.86GHz:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
suse Linux 2.6.23- 0.9500 1.3900 1.1700 1.6500 1.1700 1.68000 1.22000
suse Linux 2.6.23- 0.9700 1.3600 1.1400 1.6100 1.1500 1.67000 1.18000
suse Linux 2.6.23- 0.9800 1.4000 1.1500 1.6600 1.1400 1.70000 1.19000
suse Linux 2.6.23- 0.7500 1.2000 1.0000 1.4800 0.9800 1.52000 1.06000
suse Linux 2.6.23- 0.7800 1.2200 0.9800 1.4600 1.0100 1.53000 1.05000
suse Linux 2.6.23- 0.8300 1.2200 1.0000 1.4800 0.9800 1.52000 1.05000

Celeron 2.66GHz:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
devel6 Linux 2.6.23- 3.1300 3.3800 5.9000 5.8300 18.1 9.33000 18.0
devel6 Linux 2.6.23- 3.1400 3.2800 7.0700 6.0900 17.6 9.18000 17.8
devel6 Linux 2.6.23- 3.1400 3.6000 3.5400 6.1300 17.2 9.02000 17.2
devel6 Linux 2.6.23- 2.7400 3.0300 5.5200 6.4900 16.6 8.70000 17.0
devel6 Linux 2.6.23- 2.7400 3.2700 8.7700 6.8700 17.1 8.79000 17.4
devel6 Linux 2.6.23- 2.7500 3.3100 5.5600 6.1700 16.8 9.07000 17.2


Besides these numbers I can also provide a mathematical foundation for
it, I tried the same for CFS, but IMHO it's not really sanely possible.
This model is far more accurate than CFS is and doesn't add an error
over time, thus there are no more underflow/overflow anymore within the
described limits.
The small example program also demonstrates how it can be easily scaled
down to millisecond resolution to completely get rid of the 64bit math.
This may be interesting for a few archs even if they have a finer clock
resolution as most scheduling decisions are done on a millisecond basis.

The basic idea of this scheduler is somewhat different than CFS. Where
the old scheduler maintains fixed time slices, CFS still maintains a
dynamic per task time slice. This model does away with it completely,
instead it puts the task on a virtual (normalized) time line, where only
the relative distance between any two task is relevant.

So here all the mathematical details necessary to understand what the
scheduler does, so anyone can judge for himself how solid this design
is. First some basics:

(1) time = sum_{t}^{T}(time_{t})
(2) weight_sum = sum_{t}^{T}(weight_{t})

Time per task is calculated with:

(3) time_{t} = time * weight_{t} / weight_sum

This can be also written as:

(4) time_{t} / weight_{t} = time / weight_sum

This way we have the normalized time:

(5) time_norm = time / weight_sum
(6) time_norm_{t} = time_{t} / weight_{t}

If every task got its share they are all same. Using time_norm one can
calculate the time tasks should get based on their weight:

(7) sum_{t}^{T}(time_{t}) = sum_{t}^{T}(round(time / weight_sum) * weight_{t})

This is bascially what CFS currently does and it demonstrates the basic
problem it faces. It rounds the normalized time and the rounding error
is also distributed to the time a task gets, so there is a difference
between the time which is distributed to the tasks and the time consumed
by them. On the upside the error is distributed equally to all tasks
relatively to their weight (so it isn't immediately visible via top). On
the downside the error itself is weighted too and so a small error can
be become quite large and the higher the weight the more it contributes
to the error and the more likely it hits one of the limits. Once it hits
a limit, the overflow/underflow time is simply thrown away and is lost
for accounting and the task doesn't get the time it's supposed to get.

An alternative approach is to not use time_norm at all to distribute the
time. Any task can be used to calculate the time any other task needs
relative to it. For this the normalized time per task is maintained
based on (6):

(8) time_norm_{t} * 2^16 = time_{t} * round(2^16 / weight_{t})

Using the difference between the normalized times of two tasks, one can
calculate the time it needs to equalize the normalized time.
This has the advantage that round(2^16 / weight_{t}) is constant (unless
reniced) and thus also the error due to the rounding. The time one task
gets relative to another is based on these constants. As only the delta
of these times are needed, the absolute value can simply overflow and
the limit of maximum time delta is:

(9) time_delta_max = KCLOCK_MAX / (2^16 / weight_min)


The global normalized time is still needed and useful (e.g. waking
tasks) and thus this faces the same issue as CFS right now - managing
the rounding error. This means one can't directly use the real
weight_{t} value anymore without producing new errors, so either one
uses this approximate weight:

(10) weight_app_{t} = 2^16 / round(2^16 / weight_{t})

or even better would be to get rid of this completely.

Based on (5) and (6) one can calculate the global normalized time as:

(11) time_norm = sum_{t}^{T}(time_{t}) / sum_{t}^{T}(weight_app_{t})
= sum_{t}^{T}(time_norm_{t} * weight_app_{t}) / sum_{t}^{T}(weight_app_{t})

This is now a weighted average and provides the possibility to get rid
of weight_app_{t} by simply replacing it:

(12) time_norm_app = sum_{t}^{T}(time_norm_{t} * weight_{t}) / sum_{t}^{T}(weight_{t})

This produces only a approximate normalized time, but if all
time_norm_{t} are equal (i.e. all tasks got their share), the result is
the same, thus the error is only temporary. If one already approximates
this value this means value other replacements are possible too. In the
previous example program I simply used 1:

(13) time_norm_app = sum_{t}^{T}(time_norm_{t}) / T

Another approximation is to use a shift:

(14) time_norm_app = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t}) / sum_{t}^{T}(2^weight_shift_{t})

This helps to avoid a possible full 64 bit multiply and makes other
operations elsewhere simpler too and the result should be close enough.
So by maintaining these two sums one can calculate an approximate
normalized time value:

(15) time_sum_app = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t})
(16) weight_sum_app = sum_{t}^{T}(2^weight_shift_{t})

Using (6) the dividend can also be written as:

(17) time_sum_app * 2^16 = sum_{t}^{T}(time_{t} * round(2^16 / weight_{t}) * 2^weight_shift_{t})

The multiplier "round(2^16 / weight_{t}) * 2^weight_shift_{t}" is off at
most by 1.42 (or e(.5*l(2))), so this approximate time sum value grows
almost linearly with the real time and thus the maximum of this sum is
reached approximately after:

(18) time_max = KCLOCK_MAX / 2^16 / 1.42

The problem now is to avoid this overflow, for this the sum is regularly
updated by splitting it:

(19) time_sum_app = time_sum_base + time_sum_off

Using this (14) can be written as:

(20) time_norm_app = (time_sum_base + time_sum_off) / weight_sum_app
= time_sum_base / weight_sum_app + time_sum_off / weight_sum_app
= time_norm_base + time_sum_off / weight_sum_app

time_norm_base is maintained incrementally be defining this increment:

(21) time_norm_inc = time_sum_max / weight_sum_app

Everytime time_sum_off exceeds time_sum_max, time_sum_off and
time_norm_base are adjusted appropriately. time_sum_max is scaled so
that the required update frequency is reduced to a minimum, but also so
that time_sum_off can be easily scaled down to a 32 bit value when
needed.

This basically describes the static system, but to allow for sleeping
and waking these sums need adjustments to preserve a proper average:

(22) weight_app_sum += 2^weight_shift_{new}
(23) time_sum_max += time_norm_inc * 2^weight_shift_{new}
(24) time_sum_off += (time_norm_{new} - time_norm_base) * 2^weight_shift_{new}

The last one is a little less obvious, it can be derived from (15) and
(19):

(25) time_sum_base + time_sum_off = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t})
time_sum_off = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t}) - time_sum_base
= sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t}) - time_norm_base * weight_sum_app
= sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t}) - sum_{t}^{T}(time_norm_base * 2^weight_shift_{t})
= sum_{t}^{T}((time_norm_{t} - time_norm_base) * 2^weight_shift_{t})

As one can see here time_norm_base is only interesting relative to the
normalized task time and so the same limits apply.

The average from (20) can now be used to calculate the normalized time
for the new task in (24). It can be a given a bonus relative to the
other task or it might be still within a certain limit, as it hasn't
slept long enough. The limit (9) still applies here, so a simple
generation counter may still be needed for long sleeps.
The time_sum_off value used to calculate the average can be scaled down
as mentioned above. As it contains far more resolution than needed for
short time scheduling decisions, the lower bits can be thrown away to
get a 32 bit value. The scaling of time_sum_max makes sure that one
knows the location of the most significant bit, so that the 32 bit is
used as much as possible.


Finally a few more notes about the patch. The current task is not kept
in the tree (just this saves a lot of tree updates), so I faced a
similiar problem as FAIR_GROUP_SCHED, that enqueue_task/dequeue_task can
be called for the current task, so maintaining the current task pointer
for the class is interesting. Instead of adding a set_curr_task it would
be IMO simpler to further reduce the number of indirect calls, e.g. the
deactivate_task/put_prev_task sequence can be replaced with a single
call (and I don't need the sleep/wake arguments anymore, so it can be
reused for that).
I disabled the usage of cpu load as its calculation is also rather 64bit
heavy and I suspect it could be easily scaled down, but this way it's
not my immediate concern.

Ingo, from this point now I need your help, you have to explain to me,
what is missing now relative to CFS. I tried to ask questions, but that
wasn't very successful...

bye, Roman

---
include/linux/sched.h | 21 -
kernel/sched.c | 158 +++++----
kernel/sched_debug.c | 26 -
kernel/sched_norm.c | 812 ++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 910 insertions(+), 107 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -884,39 +884,28 @@ struct load_weight {
*
* Current field usage histogram:
*
- * 4 se->block_start
* 4 se->run_node
- * 4 se->sleep_start
- * 4 se->sleep_start_fair
* 6 se->load.weight
- * 7 se->delta_fair
- * 15 se->wait_runtime
*/
struct sched_entity {
- long wait_runtime;
- unsigned long delta_fair_run;
- unsigned long delta_fair_sleep;
- unsigned long delta_exec;
- s64 fair_key;
+ s64 time_key;
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
- unsigned int on_rq;
+ unsigned int on_rq, queued;;
+ unsigned int weight_shift;

u64 exec_start;
u64 sum_exec_runtime;
- u64 wait_start_fair;
- u64 sleep_start_fair;
+ s64 time_norm;
+ s64 req_weight_inv;

#ifdef CONFIG_SCHEDSTATS
- u64 wait_start;
u64 wait_max;
s64 sum_wait_runtime;

- u64 sleep_start;
u64 sleep_max;
s64 sum_sleep_runtime;

- u64 block_start;
u64 block_max;
u64 exec_max;

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -183,18 +183,24 @@ struct cfs_rq {

s64 fair_clock;
u64 exec_clock;
- s64 wait_runtime;
u64 sleeper_bonus;
unsigned long wait_runtime_overruns, wait_runtime_underruns;

+ u64 prev_update;
+ s64 time_norm_base, time_norm_inc;
+ u64 run_start, run_end;
+ u64 run_end_next, run_end_curr;
+ s64 time_sum_max, time_sum_off;
+ unsigned long inc_shift, weight_sum;
+
struct rb_root tasks_timeline;
- struct rb_node *rb_leftmost;
struct rb_node *rb_load_balance_curr;
-#ifdef CONFIG_FAIR_GROUP_SCHED
/* 'curr' points to currently running entity on this cfs_rq.
* It is set to NULL otherwise (i.e when none are currently running).
*/
- struct sched_entity *curr;
+ struct sched_entity *curr, *next;
+ struct sched_entity *rb_leftmost;
+#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */

/* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
@@ -231,12 +237,16 @@ struct rq {
*/
unsigned long nr_running;
#define CPU_LOAD_IDX_MAX 5
+#ifdef CONFIG_SMP
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+#endif
unsigned char idle_at_tick;
#ifdef CONFIG_NO_HZ
unsigned char in_nohz_recently;
#endif
+#ifdef CONFIG_SMP
struct load_stat ls; /* capture load from *all* tasks on this cpu */
+#endif
unsigned long nr_load_updates;
u64 nr_switches;

@@ -636,13 +646,6 @@ static void resched_cpu(int cpu)
resched_task(cpu_curr(cpu));
spin_unlock_irqrestore(&rq->lock, flags);
}
-#else
-static inline void resched_task(struct task_struct *p)
-{
- assert_spin_locked(&task_rq(p)->lock);
- set_tsk_need_resched(p);
-}
-#endif

static u64 div64_likely32(u64 divident, unsigned long divisor)
{
@@ -657,6 +660,13 @@ static u64 div64_likely32(u64 divident,
#endif
}

+#else
+static inline void resched_task(struct task_struct *p)
+{
+ assert_spin_locked(&task_rq(p)->lock);
+ set_tsk_need_resched(p);
+}
+#endif
#if BITS_PER_LONG == 32
# define WMULT_CONST (~0UL)
#else
@@ -734,15 +744,33 @@ static void update_load_sub(struct load_
* If a task goes up by ~10% and another task goes down by ~10% then
* the relative distance between them is ~25%.)
*/
+#ifdef SANE_NICE_LEVEL
+static const int prio_to_weight[40] = {
+ 3567, 3102, 2703, 2351, 2048, 1783, 1551, 1351, 1177, 1024,
+ 892, 776, 676, 588, 512, 446, 388, 338, 294, 256,
+ 223, 194, 169, 147, 128, 111, 97, 84, 74, 64,
+ 56, 49, 42, 37, 32, 28, 24, 21, 18, 16
+};
+
+static const u32 prio_to_wmult[40] = {
+ 294, 338, 388, 446, 512, 588, 676, 776, 891, 1024,
+ 1176, 1351, 1552, 1783, 2048, 2353, 2702, 3104, 3566, 4096,
+ 4705, 5405, 6208, 7132, 8192, 9410, 10809, 12417, 14263, 16384,
+ 18820, 21619, 24834, 28526, 32768, 37641, 43238, 49667, 57052, 65536
+};
+
+static const u32 prio_to_wshift[40] = {
+ 8, 8, 7, 7, 7, 7, 7, 6, 6, 6,
+ 6, 6, 5, 5, 5, 5, 5, 4, 4, 4,
+ 4, 4, 3, 3, 3, 3, 3, 2, 2, 2,
+ 2, 2, 1, 1, 1, 1, 1, 0, 0, 0
+};
+#else
static const int prio_to_weight[40] = {
- /* -20 */ 88761, 71755, 56483, 46273, 36291,
- /* -15 */ 29154, 23254, 18705, 14949, 11916,
- /* -10 */ 9548, 7620, 6100, 4904, 3906,
- /* -5 */ 3121, 2501, 1991, 1586, 1277,
- /* 0 */ 1024, 820, 655, 526, 423,
- /* 5 */ 335, 272, 215, 172, 137,
- /* 10 */ 110, 87, 70, 56, 45,
- /* 15 */ 36, 29, 23, 18, 15,
+ 95325, 74898, 61681, 49932, 38836, 31775, 24966, 20165, 16132, 12945,
+ 10382, 8257, 6637, 5296, 4228, 3393, 2709, 2166, 1736, 1387,
+ 1111, 888, 710, 568, 455, 364, 291, 233, 186, 149,
+ 119, 95, 76, 61, 49, 39, 31, 25, 20, 16
};

/*
@@ -753,16 +781,20 @@ static const int prio_to_weight[40] = {
* into multiplications:
*/
static const u32 prio_to_wmult[40] = {
- /* -20 */ 48388, 59856, 76040, 92818, 118348,
- /* -15 */ 147320, 184698, 229616, 287308, 360437,
- /* -10 */ 449829, 563644, 704093, 875809, 1099582,
- /* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326,
- /* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587,
- /* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126,
- /* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,
- /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
+ 11, 14, 17, 21, 27, 33, 42, 52, 65, 81,
+ 101, 127, 158, 198, 248, 309, 387, 484, 604, 756,
+ 944, 1181, 1476, 1845, 2306, 2882, 3603, 4504, 5629, 7037,
+ 8796, 10995, 13744, 17180, 21475, 26844, 33554, 41943, 52429, 65536
};

+static const u32 prio_to_wshift[40] = {
+ 13, 12, 12, 12, 11, 11, 11, 10, 10, 10,
+ 9, 9, 9, 8, 8, 8, 7, 7, 7, 6,
+ 6, 6, 5, 5, 5, 5, 4, 4, 4, 3,
+ 3, 3, 2, 2, 2, 1, 1, 1, 0, 0
+};
+#endif
+
static void activate_task(struct rq *rq, struct task_struct *p, int wakeup);

/*
@@ -784,7 +816,8 @@ static int balance_tasks(struct rq *this

#include "sched_stats.h"
#include "sched_rt.c"
-#include "sched_fair.c"
+//#include "sched_fair.c"
+#include "sched_norm.c"
#include "sched_idletask.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
@@ -792,6 +825,7 @@ static int balance_tasks(struct rq *this

#define sched_class_highest (&rt_sched_class)

+#ifdef CONFIG_SMP
static void __update_curr_load(struct rq *rq, struct load_stat *ls)
{
if (rq->curr != rq->idle && ls->load.weight) {
@@ -843,6 +877,14 @@ static inline void dec_load(struct rq *r
update_curr_load(rq);
update_load_sub(&rq->ls.load, p->se.load.weight);
}
+#else
+static inline void inc_load(struct rq *rq, const struct task_struct *p)
+{
+}
+static inline void dec_load(struct rq *rq, const struct task_struct *p)
+{
+}
+#endif

static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
@@ -858,9 +900,6 @@ static void dec_nr_running(struct task_s

static void set_load_weight(struct task_struct *p)
{
- task_rq(p)->cfs.wait_runtime -= p->se.wait_runtime;
- p->se.wait_runtime = 0;
-
if (task_has_rt_policy(p)) {
p->se.load.weight = prio_to_weight[0] * 2;
p->se.load.inv_weight = prio_to_wmult[0] >> 1;
@@ -878,6 +917,8 @@ static void set_load_weight(struct task_

p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
+ p->se.weight_shift = prio_to_wshift[p->static_prio - MAX_RT_PRIO];
+ p->se.req_weight_inv = p->se.load.inv_weight * (kclock_t)sysctl_sched_granularity;
}

static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
@@ -986,11 +1027,13 @@ inline int task_curr(const struct task_s
return cpu_curr(task_cpu(p)) == p;
}

+#ifdef CONFIG_SMP
/* Used instead of source_load when we know the type == 0 */
unsigned long weighted_cpuload(const int cpu)
{
return cpu_rq(cpu)->ls.load.weight;
}
+#endif

static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
{
@@ -1004,27 +1047,6 @@ static inline void __set_task_cpu(struct

void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
{
- int old_cpu = task_cpu(p);
- struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
- u64 clock_offset, fair_clock_offset;
-
- clock_offset = old_rq->clock - new_rq->clock;
- fair_clock_offset = old_rq->cfs.fair_clock - new_rq->cfs.fair_clock;
-
- if (p->se.wait_start_fair)
- p->se.wait_start_fair -= fair_clock_offset;
- if (p->se.sleep_start_fair)
- p->se.sleep_start_fair -= fair_clock_offset;
-
-#ifdef CONFIG_SCHEDSTATS
- if (p->se.wait_start)
- p->se.wait_start -= clock_offset;
- if (p->se.sleep_start)
- p->se.sleep_start -= clock_offset;
- if (p->se.block_start)
- p->se.block_start -= clock_offset;
-#endif
-
__set_task_cpu(p, new_cpu);
}

@@ -1584,21 +1606,12 @@ int fastcall wake_up_state(struct task_s
*/
static void __sched_fork(struct task_struct *p)
{
- p->se.wait_start_fair = 0;
p->se.exec_start = 0;
p->se.sum_exec_runtime = 0;
- p->se.delta_exec = 0;
- p->se.delta_fair_run = 0;
- p->se.delta_fair_sleep = 0;
- p->se.wait_runtime = 0;
- p->se.sleep_start_fair = 0;

#ifdef CONFIG_SCHEDSTATS
- p->se.wait_start = 0;
p->se.sum_wait_runtime = 0;
p->se.sum_sleep_runtime = 0;
- p->se.sleep_start = 0;
- p->se.block_start = 0;
p->se.sleep_max = 0;
p->se.block_max = 0;
p->se.exec_max = 0;
@@ -1970,6 +1983,7 @@ unsigned long nr_active(void)
return running + uninterruptible;
}

+#ifdef CONFIG_SMP
/*
* Update rq->cpu_load[] statistics. This function is usually called every
* scheduler tick (TICK_NSEC).
@@ -2026,8 +2040,6 @@ do_avg:
}
}

-#ifdef CONFIG_SMP
-
/*
* double_rq_lock - safely lock two runqueues
*
@@ -3349,7 +3361,9 @@ void scheduler_tick(void)
if (unlikely(rq->clock < next_tick))
rq->clock = next_tick;
rq->tick_timestamp = rq->clock;
+#ifdef CONFIG_SMP
update_cpu_load(rq);
+#endif
if (curr != rq->idle) /* FIXME: needed? */
curr->sched_class->task_tick(rq, curr);
spin_unlock(&rq->lock);
@@ -6515,6 +6529,9 @@ static inline void init_cfs_rq(struct cf
{
cfs_rq->tasks_timeline = RB_ROOT;
cfs_rq->fair_clock = 1;
+ cfs_rq->time_sum_max = 0;
+ cfs_rq->time_norm_inc = MAX_TIMESUM >> 1;
+ cfs_rq->inc_shift = 0;
#ifdef CONFIG_FAIR_GROUP_SCHED
cfs_rq->rq = rq;
#endif
@@ -6522,7 +6539,6 @@ static inline void init_cfs_rq(struct cf

void __init sched_init(void)
{
- u64 now = sched_clock();
int highest_cpu = 0;
int i, j;

@@ -6547,12 +6563,11 @@ void __init sched_init(void)
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
list_add(&rq->cfs.leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
#endif
- rq->ls.load_update_last = now;
- rq->ls.load_update_start = now;
+#ifdef CONFIG_SMP
+ rq->ls.load_update_last = rq->ls.load_update_start = sched_clock();

for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
rq->cpu_load[j] = 0;
-#ifdef CONFIG_SMP
rq->sd = NULL;
rq->active_balance = 0;
rq->next_balance = jiffies;
@@ -6642,16 +6657,7 @@ void normalize_rt_tasks(void)

read_lock_irq(&tasklist_lock);
do_each_thread(g, p) {
- p->se.fair_key = 0;
- p->se.wait_runtime = 0;
p->se.exec_start = 0;
- p->se.wait_start_fair = 0;
- p->se.sleep_start_fair = 0;
-#ifdef CONFIG_SCHEDSTATS
- p->se.wait_start = 0;
- p->se.sleep_start = 0;
- p->se.block_start = 0;
-#endif
task_rq(p)->cfs.fair_clock = 0;
task_rq(p)->clock = 0;

Index: linux-2.6/kernel/sched_norm.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_norm.c
@@ -0,0 +1,812 @@
+/*
+ * Completely Fair Scheduling (CFS) Class (SCHED_NORMAL/SCHED_BATCH)
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * Interactivity improvements by Mike Galbraith
+ * (C) 2007 Mike Galbraith <[email protected]>
+ *
+ * Various enhancements by Dmitry Adamushko.
+ * (C) 2007 Dmitry Adamushko <[email protected]>
+ *
+ * Group scheduling enhancements by Srivatsa Vaddagiri
+ * Copyright IBM Corporation, 2007
+ * Author: Srivatsa Vaddagiri <[email protected]>
+ *
+ * Scaled math optimizations by Thomas Gleixner
+ * Copyright (C) 2007, Thomas Gleixner <[email protected]>
+ *
+ * Really fair scheduling
+ * Copyright (C) 2007, Roman Zippel <[email protected]>
+ */
+
+typedef s64 kclock_t;
+
+static inline kclock_t kclock_max(kclock_t x, kclock_t y)
+{
+ return (kclock_t)(x - y) > 0 ? x : y;
+}
+static inline kclock_t kclock_min(kclock_t x, kclock_t y)
+{
+ return (kclock_t)(x - y) < 0 ? x : y;
+}
+
+#define MSHIFT 16
+#define MAX_TIMESUM ((kclock_t)1 << (30 + MSHIFT))
+
+/*
+ * Preemption granularity:
+ * (default: 2 msec, units: nanoseconds)
+ *
+ * NOTE: this granularity value is not the same as the concept of
+ * 'timeslice length' - timeslices in CFS will typically be somewhat
+ * larger than this value. (to see the precise effective timeslice
+ * length of your workload, run vmstat and monitor the context-switches
+ * field)
+ *
+ * On SMP systems the value of this is multiplied by the log2 of the
+ * number of CPUs. (i.e. factor 2x on 2-way systems, 3x on 4-way
+ * systems, 4x on 8-way systems, 5x on 16-way systems, etc.)
+ */
+unsigned int sysctl_sched_granularity __read_mostly = 2000000000ULL/HZ;
+
+/*
+ * SCHED_BATCH wake-up granularity.
+ * (default: 10 msec, units: nanoseconds)
+ *
+ * This option delays the preemption effects of decoupled workloads
+ * and reduces their over-scheduling. Synchronous workloads will still
+ * have immediate wakeup/sleep latencies.
+ */
+unsigned int sysctl_sched_batch_wakeup_granularity __read_mostly =
+ 10000000000ULL/HZ;
+
+/*
+ * SCHED_OTHER wake-up granularity.
+ * (default: 1 msec, units: nanoseconds)
+ *
+ * This option delays the preemption effects of decoupled workloads
+ * and reduces their over-scheduling. Synchronous workloads will still
+ * have immediate wakeup/sleep latencies.
+ */
+unsigned int sysctl_sched_wakeup_granularity __read_mostly = 1000000000ULL/HZ;
+
+unsigned int sysctl_sched_stat_granularity __read_mostly;
+
+/*
+ * Initialized in sched_init_granularity():
+ */
+unsigned int sysctl_sched_runtime_limit __read_mostly;
+
+/*
+ * Debugging: various feature bits
+ */
+enum {
+ SCHED_FEAT_FAIR_SLEEPERS = 1,
+ SCHED_FEAT_SLEEPER_AVG = 2,
+ SCHED_FEAT_SLEEPER_LOAD_AVG = 4,
+ SCHED_FEAT_PRECISE_CPU_LOAD = 8,
+ SCHED_FEAT_START_DEBIT = 16,
+ SCHED_FEAT_SKIP_INITIAL = 32,
+};
+
+unsigned int sysctl_sched_features __read_mostly =
+ SCHED_FEAT_FAIR_SLEEPERS *1 |
+ SCHED_FEAT_SLEEPER_AVG *1 |
+ SCHED_FEAT_SLEEPER_LOAD_AVG *1 |
+ SCHED_FEAT_PRECISE_CPU_LOAD *1 |
+ SCHED_FEAT_START_DEBIT *1 |
+ SCHED_FEAT_SKIP_INITIAL *0;
+
+extern struct sched_class fair_sched_class;
+
+static inline kclock_t get_time_avg(struct cfs_rq *cfs_rq)
+{
+ kclock_t avg;
+
+ avg = cfs_rq->time_norm_base;
+ if (cfs_rq->weight_sum)
+ avg += (kclock_t)((int)(cfs_rq->time_sum_off >> MSHIFT) / cfs_rq->weight_sum) << MSHIFT;
+
+ return avg;
+}
+
+/**************************************************************
+ * CFS operations on generic schedulable entities:
+ */
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+/* cpu runqueue to which this cfs_rq is attached */
+static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->rq;
+}
+
+/* An entity is a task if it doesn't "own" a runqueue */
+#define entity_is_task(se) (!se->my_q)
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
+{
+ return container_of(cfs_rq, struct rq, cfs);
+}
+
+#define entity_is_task(se) 1
+
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+ return container_of(se, struct task_struct, se);
+}
+
+
+/**************************************************************
+ * Scheduling class tree data structure manipulation methods:
+ */
+
+/*
+ * Enqueue an entity into the rb-tree:
+ */
+static inline void
+__enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
+ struct rb_node *parent = NULL;
+ struct sched_entity *entry;
+ kclock_t key;
+ int leftmost = 1;
+
+ se->time_key = key = se->time_norm + (se->req_weight_inv >> 1);
+ /*
+ * Find the right place in the rbtree:
+ */
+ while (*link) {
+ parent = *link;
+ entry = rb_entry(parent, struct sched_entity, run_node);
+ /*
+ * We dont care about collisions. Nodes with
+ * the same key stay together.
+ */
+ if (key - entry->time_key < 0) {
+ link = &parent->rb_left;
+ } else {
+ link = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ /*
+ * Maintain a cache of leftmost tree entries (it is frequently
+ * used):
+ */
+ if (leftmost) {
+ cfs_rq->rb_leftmost = se;
+ if (cfs_rq->curr) {
+ cfs_rq->run_end_next = se->time_norm + se->req_weight_inv;
+ cfs_rq->run_end = kclock_min(cfs_rq->run_end_next,
+ kclock_max(se->time_norm, cfs_rq->run_end_curr));
+ }
+ }
+
+ rb_link_node(&se->run_node, parent, link);
+ rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
+ update_load_add(&cfs_rq->load, se->load.weight);
+ se->queued = 1;
+}
+
+static inline void
+__dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ if (cfs_rq->rb_leftmost == se) {
+ struct rb_node *next = rb_next(&se->run_node);
+ cfs_rq->rb_leftmost = next ? rb_entry(next, struct sched_entity, run_node) : NULL;
+ }
+ rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
+ update_load_sub(&cfs_rq->load, se->load.weight);
+ se->queued = 0;
+}
+
+#ifdef CONFIG_SCHED_DEBUG
+static void verify_queue(struct cfs_rq *cfs_rq, int inc_curr, struct sched_entity *se2)
+{
+ struct rb_node *node;
+ struct sched_entity *se;
+ kclock_t sum = 0;
+
+ se = cfs_rq->curr;
+ if (inc_curr && se)
+ sum = (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ node = rb_first(&cfs_rq->tasks_timeline);
+ WARN_ON(node && cfs_rq->rb_leftmost != rb_entry(node, struct sched_entity, run_node));
+ while (node) {
+ se = rb_entry(node, struct sched_entity, run_node);
+ sum += (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ node = rb_next(node);
+ }
+ if (sum != cfs_rq->time_sum_off) {
+ kclock_t oldsum = cfs_rq->time_sum_off;
+ cfs_rq->time_sum_off = sum;
+ printk("%ld:%Lx,%Lx,%p,%p,%d\n", cfs_rq->nr_running, sum, oldsum, cfs_rq->curr, se2, inc_curr);
+ WARN_ON(1);
+ }
+}
+#else
+#define verify_queue(q,c,s) ((void)0)
+#endif
+
+/**************************************************************
+ * Scheduling class statistics methods:
+ */
+
+/*
+ * Update the current task's runtime statistics. Skip current tasks that
+ * are not in our scheduling class.
+ */
+static inline void update_curr(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *curr = cfs_rq->curr;
+ kclock_t now = rq_of(cfs_rq)->clock;
+ unsigned long delta_exec;
+ kclock_t delta_norm;
+
+ if (unlikely(!curr))
+ return;
+
+ delta_exec = now - cfs_rq->prev_update;
+ if (!delta_exec)
+ return;
+ cfs_rq->prev_update = now;
+
+ curr->sum_exec_runtime += delta_exec;
+ cfs_rq->exec_clock += delta_exec;
+
+ delta_norm = (kclock_t)delta_exec * curr->load.inv_weight;
+ curr->time_norm += delta_norm;
+ cfs_rq->time_sum_off += delta_norm << curr->weight_shift;
+
+ verify_queue(cfs_rq, 4, curr);
+}
+
+static void
+enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ kclock_t min_time;
+
+ verify_queue(cfs_rq, cfs_rq->curr != se, se);
+ min_time = get_time_avg(cfs_rq) - se->req_weight_inv;
+ if ((kclock_t)(se->time_norm - min_time) < 0)
+ se->time_norm = min_time;
+
+ cfs_rq->nr_running++;
+ cfs_rq->weight_sum += 1 << se->weight_shift;
+ if (cfs_rq->inc_shift < se->weight_shift) {
+ cfs_rq->time_norm_inc >>= se->weight_shift - cfs_rq->inc_shift;
+ cfs_rq->time_sum_max >>= se->weight_shift - cfs_rq->inc_shift;
+ cfs_rq->inc_shift = se->weight_shift;
+ }
+ cfs_rq->time_sum_max += cfs_rq->time_norm_inc << se->weight_shift;
+ while (cfs_rq->time_sum_max >= MAX_TIMESUM) {
+ cfs_rq->time_norm_inc >>= 1;
+ cfs_rq->time_sum_max >>= 1;
+ cfs_rq->inc_shift++;
+ }
+ cfs_rq->time_sum_off += (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ if (cfs_rq->time_sum_off >= cfs_rq->time_sum_max) {
+ cfs_rq->time_sum_off -= cfs_rq->time_sum_max;
+ cfs_rq->time_norm_base += cfs_rq->time_norm_inc;
+ }
+
+ if (&rq_of(cfs_rq)->curr->se == se)
+ cfs_rq->curr = se;
+ if (cfs_rq->curr != se)
+ __enqueue_entity(cfs_rq, se);
+ verify_queue(cfs_rq, 2, se);
+}
+
+static void
+dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
+{
+ verify_queue(cfs_rq, 3, se);
+
+ cfs_rq->weight_sum -= 1 << se->weight_shift;
+ if (cfs_rq->weight_sum) {
+ cfs_rq->time_sum_max -= cfs_rq->time_norm_inc << se->weight_shift;
+ while (cfs_rq->time_sum_max < (MAX_TIMESUM >> 1)) {
+ cfs_rq->time_norm_inc <<= 1;
+ cfs_rq->time_sum_max <<= 1;
+ cfs_rq->inc_shift--;
+ }
+
+ cfs_rq->time_sum_off -= (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ if (cfs_rq->time_sum_off < 0) {
+ cfs_rq->time_sum_off += cfs_rq->time_sum_max;
+ cfs_rq->time_norm_base -= cfs_rq->time_norm_inc;
+ }
+ } else {
+ cfs_rq->time_sum_off -= (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ BUG_ON(cfs_rq->time_sum_off);
+ cfs_rq->time_norm_inc = MAX_TIMESUM >> 1;
+ cfs_rq->time_sum_max = 0;
+ cfs_rq->inc_shift = 0;
+ }
+
+
+ cfs_rq->nr_running--;
+ if (se->queued)
+ __dequeue_entity(cfs_rq, se);
+ if (cfs_rq->curr == se)
+ cfs_rq->curr = NULL;
+ if (cfs_rq->next == se)
+ cfs_rq->next = NULL;
+ verify_queue(cfs_rq, cfs_rq->curr != se, se);
+}
+
+static inline void
+set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ cfs_rq->prev_update = rq_of(cfs_rq)->clock;
+ cfs_rq->run_start = se->time_norm;
+ cfs_rq->run_end = cfs_rq->run_end_curr = cfs_rq->run_start + se->req_weight_inv;
+ cfs_rq->curr = se;
+}
+
+static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *se = cfs_rq->next ? cfs_rq->next : cfs_rq->rb_leftmost;
+ struct sched_entity *next;
+
+ cfs_rq->next = NULL;
+ if (se->queued)
+ __dequeue_entity(cfs_rq, se);
+ set_next_entity(cfs_rq, se);
+
+ next = cfs_rq->rb_leftmost;
+ if (next) {
+ cfs_rq->run_end_next = next->time_norm + next->req_weight_inv;
+ cfs_rq->run_end = kclock_min(cfs_rq->run_end_next,
+ kclock_max(next->time_norm, cfs_rq->run_end_curr));
+ }
+
+ return se;
+}
+
+static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
+{
+ update_curr(cfs_rq);
+ if (prev->on_rq)
+ __enqueue_entity(cfs_rq, prev);
+ cfs_rq->curr = NULL;
+}
+
+static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+ update_curr(cfs_rq);
+
+ while (cfs_rq->time_sum_off >= cfs_rq->time_sum_max) {
+ cfs_rq->time_sum_off -= cfs_rq->time_sum_max;
+ cfs_rq->time_norm_base += cfs_rq->time_norm_inc;
+ }
+
+ /*
+ * Reschedule if another task tops the current one.
+ */
+ if (cfs_rq->rb_leftmost && (kclock_t)(curr->time_norm - cfs_rq->run_end) >= 0) {
+ cfs_rq->next = cfs_rq->rb_leftmost;
+ resched_task(rq_of(cfs_rq)->curr);
+ }
+}
+
+/**************************************************
+ * CFS operations on tasks:
+ */
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+/* Walk up scheduling entities hierarchy */
+#define for_each_sched_entity(se) \
+ for (; se; se = se->parent)
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+ return p->se.cfs_rq;
+}
+
+/* runqueue on which this entity is (to be) queued */
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+ return se->cfs_rq;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+ return grp->my_q;
+}
+
+/* Given a group's cfs_rq on one cpu, return its corresponding cfs_rq on
+ * another cpu ('this_cpu')
+ */
+static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
+{
+ /* A later patch will take group into account */
+ return &cpu_rq(this_cpu)->cfs;
+}
+
+/* Iterate thr' all leaf cfs_rq's on a runqueue */
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+ list_for_each_entry(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+
+/* Do the two (enqueued) tasks belong to the same group ? */
+static inline int is_same_group(struct task_struct *curr, struct task_struct *p)
+{
+ if (curr->se.cfs_rq == p->se.cfs_rq)
+ return 1;
+
+ return 0;
+}
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+#define for_each_sched_entity(se) \
+ for (; se; se = NULL)
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+ return &task_rq(p)->cfs;
+}
+
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+ struct task_struct *p = task_of(se);
+ struct rq *rq = task_rq(p);
+
+ return &rq->cfs;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+ return NULL;
+}
+
+static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
+{
+ return &cpu_rq(this_cpu)->cfs;
+}
+
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+ for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
+
+static inline int is_same_group(struct task_struct *curr, struct task_struct *p)
+{
+ return 1;
+}
+
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+/*
+ * The enqueue_task method is called before nr_running is
+ * increased. Here we update the fair scheduling stats and
+ * then put the task into the rbtree:
+ */
+static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
+{
+ struct cfs_rq *cfs_rq;
+ struct sched_entity *se = &p->se;
+
+ for_each_sched_entity(se) {
+ if (se->on_rq)
+ break;
+ cfs_rq = cfs_rq_of(se);
+ enqueue_entity(cfs_rq, se);
+ }
+}
+
+/*
+ * The dequeue_task method is called before nr_running is
+ * decreased. We remove the task from the rbtree and
+ * update the fair scheduling stats:
+ */
+static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
+{
+ struct cfs_rq *cfs_rq;
+ struct sched_entity *se = &p->se;
+
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ dequeue_entity(cfs_rq, se, sleep);
+ /* Don't dequeue parent if it has other entities besides us */
+ if (cfs_rq->weight_sum)
+ break;
+ }
+}
+
+/*
+ * sched_yield() support is very simple - we dequeue and enqueue
+ */
+static void yield_task_fair(struct rq *rq, struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct sched_entity *next;
+ __update_rq_clock(rq);
+
+ update_curr(cfs_rq);
+ next = cfs_rq->rb_leftmost;
+ if (next && (kclock_t)(p->se.time_norm - next->time_norm) > 0) {
+ cfs_rq->next = next;
+ return;
+ }
+ cfs_rq->next = &p->se;
+}
+
+/*
+ * Preempt the current task with a newly woken task if needed:
+ */
+static void check_preempt_curr_fair(struct rq *rq, struct task_struct *p)
+{
+ struct task_struct *curr = rq->curr;
+ struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+ unsigned long gran;
+
+ if (!cfs_rq->curr || unlikely(rt_prio(p->prio))) {
+ resched_task(curr);
+ return;
+ }
+ update_curr(cfs_rq);
+
+ gran = sysctl_sched_wakeup_granularity;
+ /*
+ * Batch tasks prefer throughput over latency:
+ */
+ if (unlikely(p->policy == SCHED_BATCH))
+ gran = sysctl_sched_batch_wakeup_granularity;
+
+ if (is_same_group(curr, p) &&
+ cfs_rq->rb_leftmost == &p->se &&
+ curr->se.time_norm - p->se.time_norm >= cfs_rq->curr->load.inv_weight * (kclock_t)gran) {
+ cfs_rq->next = &p->se;
+ resched_task(curr);
+ }
+}
+
+static struct task_struct *pick_next_task_fair(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq = &rq->cfs;
+ struct sched_entity *se;
+
+ if (unlikely(!cfs_rq->nr_running))
+ return NULL;
+
+ if (cfs_rq->nr_running == 1 && cfs_rq->curr)
+ return task_of(cfs_rq->curr);
+
+ do {
+ se = pick_next_entity(cfs_rq);
+ cfs_rq = group_cfs_rq(se);
+ } while (cfs_rq);
+
+ return task_of(se);
+}
+
+/*
+ * Account for a descheduled task:
+ */
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+{
+ struct sched_entity *se = &prev->se;
+ struct cfs_rq *cfs_rq;
+
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ put_prev_entity(cfs_rq, se);
+ }
+}
+
+/**************************************************
+ * Fair scheduling class load-balancing methods:
+ */
+
+/*
+ * Load-balancing iterator. Note: while the runqueue stays locked
+ * during the whole iteration, the current task might be
+ * dequeued so the iterator has to be dequeue-safe. Here we
+ * achieve that by always pre-iterating before returning
+ * the current task:
+ */
+static inline struct task_struct *
+__load_balance_iterator(struct cfs_rq *cfs_rq)
+{
+ struct task_struct *p;
+ struct rb_node *curr;
+
+ curr = cfs_rq->rb_load_balance_curr;
+ if (!curr)
+ return NULL;
+
+ p = rb_entry(curr, struct task_struct, se.run_node);
+ cfs_rq->rb_load_balance_curr = rb_next(curr);
+
+ return p;
+}
+
+static struct task_struct *load_balance_start_fair(void *arg)
+{
+ struct cfs_rq *cfs_rq = arg;
+
+ cfs_rq->rb_load_balance_curr = cfs_rq->rb_leftmost ?
+ &cfs_rq->rb_leftmost->run_node : NULL;
+ if (cfs_rq->curr)
+ return rb_entry(&cfs_rq->curr->run_node, struct task_struct, se.run_node);
+
+ return __load_balance_iterator(cfs_rq);
+}
+
+static struct task_struct *load_balance_next_fair(void *arg)
+{
+ struct cfs_rq *cfs_rq = arg;
+
+ return __load_balance_iterator(cfs_rq);
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static int cfs_rq_best_prio(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *curr;
+ struct task_struct *p;
+
+ if (!cfs_rq->nr_running)
+ return MAX_PRIO;
+
+ curr = cfs_rq->rb_leftmost;
+ p = task_of(curr);
+
+ return p->prio;
+}
+#endif
+
+static unsigned long
+load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
+ unsigned long max_nr_move, unsigned long max_load_move,
+ struct sched_domain *sd, enum cpu_idle_type idle,
+ int *all_pinned, int *this_best_prio)
+{
+ struct cfs_rq *busy_cfs_rq;
+ unsigned long load_moved, total_nr_moved = 0, nr_moved;
+ long rem_load_move = max_load_move;
+ struct rq_iterator cfs_rq_iterator;
+
+ cfs_rq_iterator.start = load_balance_start_fair;
+ cfs_rq_iterator.next = load_balance_next_fair;
+
+ for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ struct cfs_rq *this_cfs_rq;
+ long imbalance;
+ unsigned long maxload;
+
+ this_cfs_rq = cpu_cfs_rq(busy_cfs_rq, this_cpu);
+
+ imbalance = busy_cfs_rq->load.weight - this_cfs_rq->load.weight;
+ /* Don't pull if this_cfs_rq has more load than busy_cfs_rq */
+ if (imbalance <= 0)
+ continue;
+
+ /* Don't pull more than imbalance/2 */
+ imbalance /= 2;
+ maxload = min(rem_load_move, imbalance);
+
+ *this_best_prio = cfs_rq_best_prio(this_cfs_rq);
+#else
+# define maxload rem_load_move
+#endif
+ /* pass busy_cfs_rq argument into
+ * load_balance_[start|next]_fair iterators
+ */
+ cfs_rq_iterator.arg = busy_cfs_rq;
+ nr_moved = balance_tasks(this_rq, this_cpu, busiest,
+ max_nr_move, maxload, sd, idle, all_pinned,
+ &load_moved, this_best_prio, &cfs_rq_iterator);
+
+ total_nr_moved += nr_moved;
+ max_nr_move -= nr_moved;
+ rem_load_move -= load_moved;
+
+ if (max_nr_move <= 0 || rem_load_move <= 0)
+ break;
+ }
+
+ return max_load_move - rem_load_move;
+}
+
+/*
+ * scheduler tick hitting a task of our scheduling class:
+ */
+static void task_tick_fair(struct rq *rq, struct task_struct *curr)
+{
+ struct cfs_rq *cfs_rq;
+ struct sched_entity *se = &curr->se;
+
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ entity_tick(cfs_rq, se);
+ }
+}
+
+/*
+ * Share the fairness runtime between parent and child, thus the
+ * total amount of pressure for CPU stays equal - new tasks
+ * get a chance to run but frequent forkers are not allowed to
+ * monopolize the CPU. Note: the parent runqueue is locked,
+ * the child is not running yet.
+ */
+static void task_new_fair(struct rq *rq, struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct sched_entity *se = &p->se;
+ kclock_t time;
+
+ sched_info_queued(p);
+
+ time = (rq->curr->se.time_norm - get_time_avg(cfs_rq)) >> 1;
+ cfs_rq->time_sum_off -= (time << rq->curr->se.weight_shift);
+ rq->curr->se.time_norm -= time;
+ se->time_norm = rq->curr->se.time_norm;
+
+ enqueue_entity(cfs_rq, se);
+ p->se.on_rq = 1;
+
+ cfs_rq->next = se;
+ resched_task(rq->curr);
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* Account for a task changing its policy or group.
+ *
+ * This routine is mostly called to set cfs_rq->curr field when a task
+ * migrates between groups/classes.
+ */
+static void set_curr_task_fair(struct rq *rq)
+{
+ struct sched_entity *se = &rq->curr.se;
+
+ for_each_sched_entity(se)
+ set_next_entity(cfs_rq_of(se), se);
+}
+#else
+static void set_curr_task_fair(struct rq *rq)
+{
+}
+#endif
+
+/*
+ * All the scheduling class methods:
+ */
+struct sched_class fair_sched_class __read_mostly = {
+ .enqueue_task = enqueue_task_fair,
+ .dequeue_task = dequeue_task_fair,
+ .yield_task = yield_task_fair,
+
+ .check_preempt_curr = check_preempt_curr_fair,
+
+ .pick_next_task = pick_next_task_fair,
+ .put_prev_task = put_prev_task_fair,
+
+ .load_balance = load_balance_fair,
+
+ .set_curr_task = set_curr_task_fair,
+ .task_tick = task_tick_fair,
+ .task_new = task_new_fair,
+};
+
+#ifdef CONFIG_SCHED_DEBUG
+static void print_cfs_stats(struct seq_file *m, int cpu)
+{
+ struct cfs_rq *cfs_rq;
+
+ for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
+ print_cfs_rq(m, cpu, cfs_rq);
+}
+#endif
Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -38,9 +38,9 @@ print_task(struct seq_file *m, struct rq

SEQ_printf(m, "%15s %5d %15Ld %13Ld %13Ld %9Ld %5d ",
p->comm, p->pid,
- (long long)p->se.fair_key,
- (long long)(p->se.fair_key - rq->cfs.fair_clock),
- (long long)p->se.wait_runtime,
+ (long long)p->se.time_norm >> 16,
+ (long long)((p->se.time_key >> 16) - rq->cfs.fair_clock),
+ ((long long)((rq->cfs.fair_clock << 16) - p->se.time_norm) * p->se.load.weight) >> 20,
(long long)(p->nvcsw + p->nivcsw),
p->prio);
#ifdef CONFIG_SCHEDSTATS
@@ -73,6 +73,7 @@ static void print_rq(struct seq_file *m,

read_lock_irq(&tasklist_lock);

+ rq->cfs.fair_clock = get_time_avg(&rq->cfs) >> 16;
do_each_thread(g, p) {
if (!p->se.on_rq || task_cpu(p) != rq_cpu)
continue;
@@ -93,10 +94,10 @@ print_cfs_rq_runtime_sum(struct seq_file
struct rq *rq = &per_cpu(runqueues, cpu);

spin_lock_irqsave(&rq->lock, flags);
- curr = first_fair(cfs_rq);
+ curr = cfs_rq->rb_leftmost ? &cfs_rq->rb_leftmost->run_node : NULL;
while (curr) {
p = rb_entry(curr, struct task_struct, se.run_node);
- wait_runtime_rq_sum += p->se.wait_runtime;
+ //wait_runtime_rq_sum += p->se.wait_runtime;

curr = rb_next(curr);
}
@@ -110,12 +111,12 @@ void print_cfs_rq(struct seq_file *m, in
{
SEQ_printf(m, "\ncfs_rq\n");

+ cfs_rq->fair_clock = get_time_avg(cfs_rq) >> 16;
#define P(x) \
SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(cfs_rq->x))

P(fair_clock);
P(exec_clock);
- P(wait_runtime);
P(wait_runtime_overruns);
P(wait_runtime_underruns);
P(sleeper_bonus);
@@ -143,10 +144,12 @@ static void print_cpu(struct seq_file *m
SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(rq->x))

P(nr_running);
+#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %lu\n", "load",
rq->ls.load.weight);
P(ls.delta_fair);
P(ls.delta_exec);
+#endif
P(nr_switches);
P(nr_load_updates);
P(nr_uninterruptible);
@@ -160,11 +163,13 @@ static void print_cpu(struct seq_file *m
P(clock_overflows);
P(clock_deep_idle_events);
P(clock_max_delta);
+#ifdef CONFIG_SMP
P(cpu_load[0]);
P(cpu_load[1]);
P(cpu_load[2]);
P(cpu_load[3]);
P(cpu_load[4]);
+#endif
#undef P

print_cfs_stats(m, cpu);
@@ -241,16 +246,7 @@ void proc_sched_show_task(struct task_st
#define P(F) \
SEQ_printf(m, "%-25s:%20Ld\n", #F, (long long)p->F)

- P(se.wait_runtime);
- P(se.wait_start_fair);
- P(se.exec_start);
- P(se.sleep_start_fair);
- P(se.sum_exec_runtime);
-
#ifdef CONFIG_SCHEDSTATS
- P(se.wait_start);
- P(se.sleep_start);
- P(se.block_start);
P(se.sleep_max);
P(se.block_max);
P(se.exec_max);


Attachments:
fs3.c (8.31 kB)

2007-08-31 09:37:30

by Mike Galbraith

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

On Fri, 2007-08-31 at 04:05 +0200, Roman Zippel wrote:
> Hi,

Greetings,

> I'm glad to announce a working prototype of the basic algorithm I
> already suggested last time.

(finding it difficult to resist the urge to go shopping, I
fast-forwarded to test drive... grep shopping arch/i386/kernel/tcs.c if
you're curious;)

I plunked it into 2.6.23-rc4 to see how it reacts to various sleeper
loads, and hit some starvation. If I got it in right (think so) there's
a bug lurking somewhere. taskset -c 1 fairtest2 resulted in the below.
It starts up running both tasks at about 60/40 for hog/sleeper, then
after a short while goes nuts. The hog component eats 100% cpu and
starves the sleeper (and events, forever).

runnable tasks:
task PID tree-key delta waiting switches prio sum-exec sum-wait sum-sleep wait-overrun wait-underrun
------------------------------------------------------------------------------------------------------------------------------------------------------------------
events/1 8 13979193020350 -3984516524180 541606276813 2014 115 0 0 0 0 0
R fairtest2 7984 10027571241955 -7942765479096 5707836335486 294 120 0 0 0 0 0
fairtest2 7989 13539381091732 -4424328443109 8147144513458 286 120 0 0 0 0 0

taskset -c 1 massive_intr 3 9999 produces much saner looking numbers,
and is fair...

runnable tasks:
task PID tree-key delta waiting switches prio sum-exec sum-wait sum-sleep wait-overrun wait-underrun
------------------------------------------------------------------------------------------------------------------------------------------------------------------
massive_intr 12808 29762662234003 21699 -506538 4650 120 0 0 0 0 0
R massive_intr 12809 29762662225939 -687 53406 4633 120 0 0 0 0 0
massive_intr 12810 29762662220183 7879 453097 4619 120 0 0 0 0 0


Attachments:
fairtest2.c (3.86 kB)

2007-08-31 10:55:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler


* Roman Zippel <[email protected]> wrote:

> Hi,
>
> I'm glad to announce a working prototype of the basic algorithm I
> already suggested last time. As I already tried to explain previously
> CFS has a considerable algorithmic and computational complexity. [...]

hey, thanks for working on this, it's appreciated! In terms of merging
your stuff, your patch looks a bit large and because you did not tell us
that you were coding in this area, you probably missed Peter Zijlstra's
excellent CFS work:

http://programming.kicks-ass.net/kernel-patches/sched-cfs/

The following portion of Peter's series does much of the core math
changes of what your patch provides (and which makes up for most of the
conceptual delta in your patch), on a step by step basis:

sched-update_weight_inv.patch
sched-se_load.patch
sched-se_load-2.patch
sched-64-wait_runtime.patch
sched-scaled-limit.patch
sched-wait_runtime-scale.patch
sched-calc_delta.patch

So the most intrusive (math) aspects of your patch have been implemented
already for CFS (almost a month ago), in a finegrained way.

Peter's patches change the CFS calculations gradually over from
'normalized' to 'non-normalized' wait-runtime, to avoid the
normalizing/denormalizing overhead and rounding error. Turn off sleeper
fairness, remove the limit code and we should arrive to something quite
close to the core math in your patch, with similar rounding properties
and similar overhead/complexity. (there are some other small details in
the math but this is the biggest item by far.) I find Peter's series
very understandable and he outlined the math arguments in his replies to
your review mails. (would be glad to re-open those discussions though if
you still think there are disagreements.)

Peter's work fully solves the rounding corner-cases you described as:

> This model is far more accurate than CFS is and doesn't add an error
> over time, thus there are no more underflow/overflow anymore within
> the described limits.

( your characterisation errs in that it makes it appear to be a common
problem, while in practice it's only a corner-case limited to extreme
negative nice levels and even there it needs a very high rate of
scheduling and an artificially constructed workload: several hundreds
of thousand of context switches per second with a yield-ing loop to be
even measurable with unmodified CFS. So this is not a 2.6.23 issue at
all - unless there's some testcase that proves the opposite. )

with Peter's queue there are no underflows/overflows either anymore in
any synthetic corner-case we could come up with. Peter's queue works
well but it's 2.6.24 material.

Non-normalized wait-runtime is simply a different unit (resulting in
slightly higher context-switch performance), the principle and the end
result does not change.

All in one, we dont disagree, this is an incremental improvement we are
thinking about for 2.6.24. We do disagree with this being positioned as
something fundamentally different though - it's just the same thing
mathematically, expressed without a "/weight" divisor, resulting in no
change in scheduling behavior. (except for a small shift of CPU
utilization for a synthetic corner-case)

And if we handled that fundamental aspect via Peter's queue, all the
remaining changes you did can be done (and considered and merged)
evolutionarily instead of revolutionarily, ontop of CFS - this should
cut down on the size and the impact of your changes significantly!

So if you'd like to work with us on this and get items that make sense
merged (which we'd very much like to see happen), could you please
re-shape the rest of your changes and ideas (such as whether to use
ready-queueing or a runqueue concept, which does look interesting) ontop
of Peter's queue, and please do it as a finegrained, reviewable,
mergable series of patches, like Peter did. Thanks!

Ingo

2007-08-31 13:19:36

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Hi,

On Fri, 31 Aug 2007, Ingo Molnar wrote:

> So the most intrusive (math) aspects of your patch have been implemented
> already for CFS (almost a month ago), in a finegrained way.

Interesting claim, please substantiate.

> Peter's patches change the CFS calculations gradually over from
> 'normalized' to 'non-normalized' wait-runtime, to avoid the
> normalizing/denormalizing overhead and rounding error.

Actually it changes wait-runtime to a normalized value and it changes
nothing about the rounding error I was talking about.
It addresses the conversion error between the different units I was
mentioning in an earlier mail, but the value is still rounded.

> > This model is far more accurate than CFS is and doesn't add an error
> > over time, thus there are no more underflow/overflow anymore within
> > the described limits.
>
> ( your characterisation errs in that it makes it appear to be a common
> problem, while in practice it's only a corner-case limited to extreme
> negative nice levels and even there it needs a very high rate of
> scheduling and an artificially constructed workload: several hundreds
> of thousand of context switches per second with a yield-ing loop to be
> even measurable with unmodified CFS. So this is not a 2.6.23 issue at
> all - unless there's some testcase that proves the opposite. )
>
> with Peter's queue there are no underflows/overflows either anymore in
> any synthetic corner-case we could come up with. Peter's queue works
> well but it's 2.6.24 material.

Did you even try to understand what I wrote?
I didn't say that it's a "common problem", it's a conceptual problem. The
rounding has been improved lately, so it's not as easy to trigger with
some simple busy loops.
Peter's patches don't remove limit_wait_runtime() and AFAICT they can't,
so I'm really amazed how you can make such claims.

> All in one, we dont disagree, this is an incremental improvement we are
> thinking about for 2.6.24. We do disagree with this being positioned as
> something fundamentally different though - it's just the same thing
> mathematically, expressed without a "/weight" divisor, resulting in no
> change in scheduling behavior. (except for a small shift of CPU
> utilization for a synthetic corner-case)

Everytime I'm amazed how quickly you get to your judgements... :-(
Especially interesting is that you don't need to ask a single question for
that, which would mean you actually understood what I wrote, OTOH your
wild claims tell me something completely different.

BTW who is "we" and how is it possible that this meta mind can come to
such quick judgements?

The basic concept is quite different enough, one can e.g. see that I have
to calculate some of the key CFS variables for the debug output.
The concepts are related, but they are definitively not "the same thing
mathematically", the method of resolution is quite different, if you think
otherwise then please _prove_ it.

bye, Roman

2007-08-31 13:22:27

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Hi,

On Fri, 31 Aug 2007, Mike Galbraith wrote:

> I plunked it into 2.6.23-rc4 to see how it reacts to various sleeper
> loads, and hit some starvation. If I got it in right (think so) there's
> a bug lurking somewhere. taskset -c 1 fairtest2 resulted in the below.
> It starts up running both tasks at about 60/40 for hog/sleeper, then
> after a short while goes nuts. The hog component eats 100% cpu and
> starves the sleeper (and events, forever).

Thanks for testing, although your test program does nothing unusual here.
Can you please send me your .config?
Were there some kernel messages while running it?

bye, Roman

2007-08-31 13:55:51

by Mike Galbraith

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

On Fri, 2007-08-31 at 15:22 +0200, Roman Zippel wrote:
> Hi,
>
> On Fri, 31 Aug 2007, Mike Galbraith wrote:
>
> > I plunked it into 2.6.23-rc4 to see how it reacts to various sleeper
> > loads, and hit some starvation. If I got it in right (think so) there's
> > a bug lurking somewhere. taskset -c 1 fairtest2 resulted in the below.
> > It starts up running both tasks at about 60/40 for hog/sleeper, then
> > after a short while goes nuts. The hog component eats 100% cpu and
> > starves the sleeper (and events, forever).
>
> Thanks for testing, although your test program does nothing unusual here.
> Can you please send me your .config?

Attached.

> Were there some kernel messages while running it?

I didn't look actually, was in rather a hurry. I'll try it again
tomorrow.

-Mike


Attachments:
.config (52.47 kB)

2007-09-01 04:35:22

by Mike Galbraith

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

On Fri, 2007-08-31 at 15:22 +0200, Roman Zippel wrote:

> Were there some kernel messages while running it?

Nope.

-Mike

2007-09-01 06:48:44

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Hi,

On Fri, 31 Aug 2007, Ingo Molnar wrote:

Maybe I should explain for everyone else (especially after seeing some of
the comments on kerneltrap), why I reacted somewhat irritated on what
looks like such an innocent mail.
The problem is without the necessary background one can't know how wrong
such statements as this are, the level of confidence is amazing though:

> Peter's work fully solves the rounding corner-cases you described as:

I'd expect Ingo to know better, but the more he refuses to answer my
questions, the more I doubt it, at least than it comes to the math part.

While Peter's patches are interesting, they are only a small step to what
I'm trying to achieve. With these patches the basic CFS math pretty much
looks like this:

sum_{t}^{T}(round(time_{t} * round(WMULT / weight_{t}) * WEIGTH0 / WMULT))
= sum_{r}^{R}(round(time_{r} * round(WMULT / weight_sum) * WEIGTH0 / WMULT))

It's based on this equation:

sum_{t}^{T}(time_{t} / weight_{t}) = sum_{r}^{R}(time_{r} / weight_sum)

This is the time a single task gets relative to the the runtime of all
tasks. These sums are incrementally added/substracted to wait_runtime and
it should stay around zero.

All Peter's wait_runtime-scale patch does is to move the weight_{t} from
one side to the other and that's it, it changes _nothing_ about the
rounding above - "the rounding corner-cases" are still there.

In my announcement I describe now in quite some detail, how I get rid of
this rounding effect. The only rounding from above equation which is still
left is "round(WMULT / weight_{t})", but this is luckily a constant and so
is the time one task gets relative to another (unless reniced).

Anyone who has actually read and understood what I wrote will hopefully
realize what complete nonsense a statement like this is:

> So the most intrusive (math) aspects of your patch have been implemented
> already for CFS (almost a month ago), in a finegrained way.

I'm not repeating again the whole math here, if anyone has questions about
it, I'll try my best to answer them. So instead here are some of the
intrusive aspects, which supposedly have been implemented already.

One key difference is that I don't maintain the global sum (fair_clock)
directly anymore, I can calculate it if needed, but it's not used to
update wait_runtime anymore. This has the advantage that the whole
rounding involved in it has no influence anymore on how much time a task
gets. Without this value the whole math around how to schedule a task is
quite different as well.

Another key difference is that I got rid of (WEIGTH0 / WMULT), this has
the advantage that it completely gets rid of the problematic rounding and
the scheduler can be now finally precise as the hardware allows it.
OTOH this has consequences for the range of values, as they can and are
expected to overflow now after some time, which the math has to take into
account.

One positive side effect of these overflows is that I can reduce the
resolution the scheduler is working with and thus I can get rid of pretty
much all of the 64bit math, if the reduced resolution is sufficient, e.g.
for all archs which have a jiffies based scheduler clock, but even
embedded archs might like it.

> So if you'd like to work with us on this and get items that make sense
> merged (which we'd very much like to see happen), could you please
> re-shape the rest of your changes and ideas (such as whether to use
> ready-queueing or a runqueue concept, which does look interesting) ontop
> of Peter's queue, and please do it as a finegrained, reviewable,
> mergable series of patches, like Peter did. Thanks!

The funny thing is it should be no that hard to split the patch, but that
wasn't point. The point is to discuss the differences - how the different
approach effects the scheduling decisions, as the new scheduler maintains
somewhat different values. If I'm now told "it's just the same thing
mathematically", which is provably nonsense, I'm a little stunned and the
point that aggravates me is that most people simply are going to believe
Ingo, because they don't understand the issue (and they don't really have
to). I'm still amazed how easily Ingo can just ignore the main part of my
work and still gets away with it. The last thing I want is yet another
flame war, all I want is that this work is taken seriously, but it took
Ingo less than 9 hours to get to a thorough judgement...

bye, Roman

2007-09-02 01:03:36

by Daniel Walker

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

On Fri, 2007-08-31 at 04:05 +0200, Roman Zippel wrote:
> Hi,
>
> I'm glad to announce a working prototype of the basic algorithm I
> already suggested last time.
> As I already tried to explain previously CFS has a considerable
> algorithmic and computational complexity. This patch should now make it
> clearer, why I could so easily skip over Ingo's long explanation of all
> the tricks CFS uses to keep the computational overhead low - I simply
> don't need them. The following numbers are based on a 2.6.23-rc3-git1 UP
> kernel, the first 3 runs are without patch, the last 3 runs are with the
> patch:

Out of curiosity I was reviewing your patch .. Since you create
kernel/sched_norm.c as a copy of kernel/sched_fair.c it was hard to see
what had changed .. So I re-diffed your patch to eliminate
kernel/sched_norm.c and just make the changes to kernel/sched_fair.c ..

The the patch is near the end of this email.. The most notable thing
about the rediff is the line count,

4 files changed, 323 insertions(+), 729 deletions(-)

That's impressive (assuming my rediff is correct). Although I don't know
for certain how that line reduction is achieved..

I also ran hackbench (in a haphazard way) a few times on it vs. CFS in
my tree, and RFS was faster to some degree (it varied)..

> (1) time = sum_{t}^{T}(time_{t})
> (2) weight_sum = sum_{t}^{T}(weight_{t})

I read your description, but I was distracted by this latex style
notation .. Could you walk through in english what these two equation
are doing ..

The patch below is on top of my tree (2.6.23-rc5-dw1) ,
ftp://source.mvista.com/pub/dwalker/rt/

Daniel

---
include/linux/sched.h | 21 -
kernel/sched.c | 186 ++++-------
kernel/sched_debug.c | 26 -
kernel/sched_fair.c | 819 +++++++++++++-------------------------------------
4 files changed, 323 insertions(+), 729 deletions(-)

Index: linux-2.6.22/include/linux/sched.h
===================================================================
--- linux-2.6.22.orig/include/linux/sched.h
+++ linux-2.6.22/include/linux/sched.h
@@ -1046,40 +1046,29 @@ struct load_weight {
*
* Current field usage histogram:
*
- * 4 se->block_start
* 4 se->run_node
- * 4 se->sleep_start
- * 4 se->sleep_start_fair
* 6 se->load.weight
- * 7 se->delta_fair
- * 15 se->wait_runtime
*/
struct sched_entity {
- long wait_runtime;
- unsigned long delta_fair_run;
- unsigned long delta_fair_sleep;
- unsigned long delta_exec;
- s64 fair_key;
+ s64 time_key;
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
- unsigned int on_rq;
+ unsigned int on_rq, queued;;
+ unsigned int weight_shift;

u64 exec_start;
u64 sum_exec_runtime;
u64 prev_sum_exec_runtime;
- u64 wait_start_fair;
- u64 sleep_start_fair;
+ s64 time_norm;
+ s64 req_weight_inv;

#ifdef CONFIG_SCHEDSTATS
- u64 wait_start;
u64 wait_max;
s64 sum_wait_runtime;

- u64 sleep_start;
u64 sleep_max;
s64 sum_sleep_runtime;

- u64 block_start;
u64 block_max;
u64 exec_max;

Index: linux-2.6.22/kernel/sched.c
===================================================================
--- linux-2.6.22.orig/kernel/sched.c
+++ linux-2.6.22/kernel/sched.c
@@ -230,18 +230,24 @@ struct cfs_rq {

s64 fair_clock;
u64 exec_clock;
- s64 wait_runtime;
u64 sleeper_bonus;
unsigned long wait_runtime_overruns, wait_runtime_underruns;

+ u64 prev_update;
+ s64 time_norm_base, time_norm_inc;
+ u64 run_start, run_end;
+ u64 run_end_next, run_end_curr;
+ s64 time_sum_max, time_sum_off;
+ unsigned long inc_shift, weight_sum;
+
struct rb_root tasks_timeline;
- struct rb_node *rb_leftmost;
struct rb_node *rb_load_balance_curr;
-#ifdef CONFIG_FAIR_GROUP_SCHED
/* 'curr' points to currently running entity on this cfs_rq.
* It is set to NULL otherwise (i.e when none are currently running).
*/
- struct sched_entity *curr;
+ struct sched_entity *curr, *next;
+ struct sched_entity *rb_leftmost;
+#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */

/* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
@@ -278,12 +284,16 @@ struct rq {
*/
unsigned long nr_running;
#define CPU_LOAD_IDX_MAX 5
+#ifdef CONFIG_SMP
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+#endif
unsigned char idle_at_tick;
#ifdef CONFIG_NO_HZ
unsigned char in_nohz_recently;
#endif
+#ifdef CONFIG_SMP
struct load_stat ls; /* capture load from *all* tasks on this cpu */
+#endif
unsigned long nr_load_updates;
u64 nr_switches;

@@ -758,13 +768,6 @@ static void resched_cpu(int cpu)
resched_task(cpu_curr(cpu));
spin_unlock_irqrestore(&rq->lock, flags);
}
-#else
-static inline void resched_task(struct task_struct *p)
-{
- assert_spin_locked(&task_rq(p)->lock);
- set_tsk_need_resched(p);
-}
-#endif

static u64 div64_likely32(u64 divident, unsigned long divisor)
{
@@ -779,6 +782,13 @@ static u64 div64_likely32(u64 divident,
#endif
}

+#else
+static inline void resched_task(struct task_struct *p)
+{
+ assert_spin_locked(&task_rq(p)->lock);
+ set_tsk_need_resched(p);
+}
+#endif
#if BITS_PER_LONG == 32
# define WMULT_CONST (~0UL)
#else
@@ -856,15 +866,33 @@ static void update_load_sub(struct load_
* If a task goes up by ~10% and another task goes down by ~10% then
* the relative distance between them is ~25%.)
*/
+#ifdef SANE_NICE_LEVEL
static const int prio_to_weight[40] = {
- /* -20 */ 88761, 71755, 56483, 46273, 36291,
- /* -15 */ 29154, 23254, 18705, 14949, 11916,
- /* -10 */ 9548, 7620, 6100, 4904, 3906,
- /* -5 */ 3121, 2501, 1991, 1586, 1277,
- /* 0 */ 1024, 820, 655, 526, 423,
- /* 5 */ 335, 272, 215, 172, 137,
- /* 10 */ 110, 87, 70, 56, 45,
- /* 15 */ 36, 29, 23, 18, 15,
+ 3567, 3102, 2703, 2351, 2048, 1783, 1551, 1351, 1177, 1024,
+ 892, 776, 676, 588, 512, 446, 388, 338, 294, 256,
+ 223, 194, 169, 147, 128, 111, 97, 84, 74, 64,
+ 56, 49, 42, 37, 32, 28, 24, 21, 18, 16
+};
+
+static const u32 prio_to_wmult[40] = {
+ 294, 338, 388, 446, 512, 588, 676, 776, 891, 1024,
+ 1176, 1351, 1552, 1783, 2048, 2353, 2702, 3104, 3566, 4096,
+ 4705, 5405, 6208, 7132, 8192, 9410, 10809, 12417, 14263, 16384,
+ 18820, 21619, 24834, 28526, 32768, 37641, 43238, 49667, 57052, 65536
+};
+
+static const u32 prio_to_wshift[40] = {
+ 8, 8, 7, 7, 7, 7, 7, 6, 6, 6,
+ 6, 6, 5, 5, 5, 5, 5, 4, 4, 4,
+ 4, 4, 3, 3, 3, 3, 3, 2, 2, 2,
+ 2, 2, 1, 1, 1, 1, 1, 0, 0, 0
+};
+#else
+static const int prio_to_weight[40] = {
+ 95325, 74898, 61681, 49932, 38836, 31775, 24966, 20165, 16132, 12945,
+ 10382, 8257, 6637, 5296, 4228, 3393, 2709, 2166, 1736, 1387,
+ 1111, 888, 710, 568, 455, 364, 291, 233, 186, 149,
+ 119, 95, 76, 61, 49, 39, 31, 25, 20, 16
};

/*
@@ -875,15 +903,19 @@ static const int prio_to_weight[40] = {
* into multiplications:
*/
static const u32 prio_to_wmult[40] = {
- /* -20 */ 48388, 59856, 76040, 92818, 118348,
- /* -15 */ 147320, 184698, 229616, 287308, 360437,
- /* -10 */ 449829, 563644, 704093, 875809, 1099582,
- /* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326,
- /* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587,
- /* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126,
- /* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,
- /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
+ 11, 14, 17, 21, 27, 33, 42, 52, 65, 81,
+ 101, 127, 158, 198, 248, 309, 387, 484, 604, 756,
+ 944, 1181, 1476, 1845, 2306, 2882, 3603, 4504, 5629, 7037,
+ 8796, 10995, 13744, 17180, 21475, 26844, 33554, 41943, 52429, 65536
+};
+
+static const u32 prio_to_wshift[40] = {
+ 13, 12, 12, 12, 11, 11, 11, 10, 10, 10,
+ 9, 9, 9, 8, 8, 8, 7, 7, 7, 6,
+ 6, 6, 5, 5, 5, 5, 4, 4, 4, 3,
+ 3, 3, 2, 2, 2, 1, 1, 1, 0, 0
};
+#endif

static void activate_task(struct rq *rq, struct task_struct *p, int wakeup);

@@ -914,6 +946,7 @@ static int balance_tasks(struct rq *this

#define sched_class_highest (&rt_sched_class)

+#ifdef CONFIG_SMP
static void __update_curr_load(struct rq *rq, struct load_stat *ls)
{
if (rq->curr != rq->idle && ls->load.weight) {
@@ -965,6 +998,14 @@ static inline void dec_load(struct rq *r
update_curr_load(rq);
update_load_sub(&rq->ls.load, p->se.load.weight);
}
+#else
+static inline void inc_load(struct rq *rq, const struct task_struct *p)
+{
+}
+static inline void dec_load(struct rq *rq, const struct task_struct *p)
+{
+}
+#endif

static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
@@ -980,9 +1021,6 @@ static void dec_nr_running(struct task_s

static void set_load_weight(struct task_struct *p)
{
- task_rq(p)->cfs.wait_runtime -= p->se.wait_runtime;
- p->se.wait_runtime = 0;
-
if (task_has_rt_policy(p)) {
p->se.load.weight = prio_to_weight[0] * 2;
p->se.load.inv_weight = prio_to_wmult[0] >> 1;
@@ -1000,6 +1038,8 @@ static void set_load_weight(struct task_

p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
+ p->se.weight_shift = prio_to_wshift[p->static_prio - MAX_RT_PRIO];
+ p->se.req_weight_inv = p->se.load.inv_weight * (kclock_t)sysctl_sched_granularity;
}

static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
@@ -1129,11 +1169,13 @@ inline int task_curr(const struct task_s
return cpu_curr(task_cpu(p)) == p;
}

+#ifdef CONFIG_SMP
/* Used instead of source_load when we know the type == 0 */
unsigned long weighted_cpuload(const int cpu)
{
return cpu_rq(cpu)->ls.load.weight;
}
+#endif

/*
* Pick up the highest-prio task:
@@ -1180,27 +1222,6 @@ static inline void __set_task_cpu(struct

void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
{
- int old_cpu = task_cpu(p);
- struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
- u64 clock_offset, fair_clock_offset;
-
- clock_offset = old_rq->clock - new_rq->clock;
- fair_clock_offset = old_rq->cfs.fair_clock - new_rq->cfs.fair_clock;
-
- if (p->se.wait_start_fair)
- p->se.wait_start_fair -= fair_clock_offset;
- if (p->se.sleep_start_fair)
- p->se.sleep_start_fair -= fair_clock_offset;
-
-#ifdef CONFIG_SCHEDSTATS
- if (p->se.wait_start)
- p->se.wait_start -= clock_offset;
- if (p->se.sleep_start)
- p->se.sleep_start -= clock_offset;
- if (p->se.block_start)
- p->se.block_start -= clock_offset;
-#endif
-
__set_task_cpu(p, new_cpu);
}

@@ -1968,22 +1989,13 @@ int fastcall wake_up_state(struct task_s
*/
static void __sched_fork(struct task_struct *p)
{
- p->se.wait_start_fair = 0;
p->se.exec_start = 0;
p->se.sum_exec_runtime = 0;
p->se.prev_sum_exec_runtime = 0;
- p->se.delta_exec = 0;
- p->se.delta_fair_run = 0;
- p->se.delta_fair_sleep = 0;
- p->se.wait_runtime = 0;
- p->se.sleep_start_fair = 0;

#ifdef CONFIG_SCHEDSTATS
- p->se.wait_start = 0;
p->se.sum_wait_runtime = 0;
p->se.sum_sleep_runtime = 0;
- p->se.sleep_start = 0;
- p->se.block_start = 0;
p->se.sleep_max = 0;
p->se.block_max = 0;
p->se.exec_max = 0;
@@ -2426,6 +2438,7 @@ unsigned long nr_active(void)
return running + uninterruptible;
}

+#ifdef CONFIG_SMP
/*
* Update rq->cpu_load[] statistics. This function is usually called every
* scheduler tick (TICK_NSEC).
@@ -2482,8 +2495,6 @@ do_avg:
}
}

-#ifdef CONFIG_SMP
-
/*
* double_rq_lock - safely lock two runqueues
*
@@ -3814,7 +3825,9 @@ void scheduler_tick(void)
if (unlikely(rq->clock < next_tick))
rq->clock = next_tick;
rq->tick_timestamp = rq->clock;
+#ifdef CONFIG_SMP
update_cpu_load(rq);
+#endif
if (curr != rq->idle) /* FIXME: needed? */
curr->sched_class->task_tick(rq, curr);
spin_unlock(&rq->lock);
@@ -5554,32 +5567,6 @@ void __cpuinit init_idle(struct task_str
*/
cpumask_t nohz_cpu_mask = CPU_MASK_NONE;

-/*
- * Increase the granularity value when there are more CPUs,
- * because with more CPUs the 'effective latency' as visible
- * to users decreases. But the relationship is not linear,
- * so pick a second-best guess by going with the log2 of the
- * number of CPUs.
- *
- * This idea comes from the SD scheduler of Con Kolivas:
- */
-static inline void sched_init_granularity(void)
-{
- unsigned int factor = 1 + ilog2(num_online_cpus());
- const unsigned long limit = 100000000;
-
- sysctl_sched_min_granularity *= factor;
- if (sysctl_sched_min_granularity > limit)
- sysctl_sched_min_granularity = limit;
-
- sysctl_sched_latency *= factor;
- if (sysctl_sched_latency > limit)
- sysctl_sched_latency = limit;
-
- sysctl_sched_runtime_limit = sysctl_sched_latency;
- sysctl_sched_wakeup_granularity = sysctl_sched_min_granularity / 2;
-}
-
#ifdef CONFIG_SMP
/*
* This is how migration works:
@@ -7155,13 +7142,10 @@ void __init sched_init_smp(void)
/* Move init over to a non-isolated CPU */
if (set_cpus_allowed(current, non_isolated_cpus) < 0)
BUG();
- sched_init_granularity();
}
#else
void __init sched_init_smp(void)
-{
- sched_init_granularity();
-}
+{ }
#endif /* CONFIG_SMP */

int in_sched_functions(unsigned long addr)
@@ -7178,6 +7162,9 @@ static inline void init_cfs_rq(struct cf
{
cfs_rq->tasks_timeline = RB_ROOT;
cfs_rq->fair_clock = 1;
+ cfs_rq->time_sum_max = 0;
+ cfs_rq->time_norm_inc = MAX_TIMESUM >> 1;
+ cfs_rq->inc_shift = 0;
#ifdef CONFIG_FAIR_GROUP_SCHED
cfs_rq->rq = rq;
#endif
@@ -7185,7 +7172,6 @@ static inline void init_cfs_rq(struct cf

void __init sched_init(void)
{
- u64 now = sched_clock();
int highest_cpu = 0;
int i, j;

@@ -7210,12 +7196,11 @@ void __init sched_init(void)
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
list_add(&rq->cfs.leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
#endif
- rq->ls.load_update_last = now;
- rq->ls.load_update_start = now;
+#ifdef CONFIG_SMP
+ rq->ls.load_update_last = rq->ls.load_update_start = sched_clock();

for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
rq->cpu_load[j] = 0;
-#ifdef CONFIG_SMP
rq->sd = NULL;
rq->active_balance = 0;
rq->next_balance = jiffies;
@@ -7312,16 +7297,7 @@ void normalize_rt_tasks(void)

read_lock_irq(&tasklist_lock);
do_each_thread(g, p) {
- p->se.fair_key = 0;
- p->se.wait_runtime = 0;
p->se.exec_start = 0;
- p->se.wait_start_fair = 0;
- p->se.sleep_start_fair = 0;
-#ifdef CONFIG_SCHEDSTATS
- p->se.wait_start = 0;
- p->se.sleep_start = 0;
- p->se.block_start = 0;
-#endif
task_rq(p)->cfs.fair_clock = 0;
task_rq(p)->clock = 0;

Index: linux-2.6.22/kernel/sched_debug.c
===================================================================
--- linux-2.6.22.orig/kernel/sched_debug.c
+++ linux-2.6.22/kernel/sched_debug.c
@@ -38,9 +38,9 @@ print_task(struct seq_file *m, struct rq

SEQ_printf(m, "%15s %5d %15Ld %13Ld %13Ld %9Ld %5d ",
p->comm, p->pid,
- (long long)p->se.fair_key,
- (long long)(p->se.fair_key - rq->cfs.fair_clock),
- (long long)p->se.wait_runtime,
+ (long long)p->se.time_norm >> 16,
+ (long long)((p->se.time_key >> 16) - rq->cfs.fair_clock),
+ ((long long)((rq->cfs.fair_clock << 16) - p->se.time_norm) * p->se.load.weight) >> 20,
(long long)(p->nvcsw + p->nivcsw),
p->prio);
#ifdef CONFIG_SCHEDSTATS
@@ -73,6 +73,7 @@ static void print_rq(struct seq_file *m,

read_lock_irq(&tasklist_lock);

+ rq->cfs.fair_clock = get_time_avg(&rq->cfs) >> 16;
do_each_thread(g, p) {
if (!p->se.on_rq || task_cpu(p) != rq_cpu)
continue;
@@ -93,10 +94,10 @@ print_cfs_rq_runtime_sum(struct seq_file
struct rq *rq = &per_cpu(runqueues, cpu);

spin_lock_irqsave(&rq->lock, flags);
- curr = first_fair(cfs_rq);
+ curr = cfs_rq->rb_leftmost ? &cfs_rq->rb_leftmost->run_node : NULL;
while (curr) {
p = rb_entry(curr, struct task_struct, se.run_node);
- wait_runtime_rq_sum += p->se.wait_runtime;
+ //wait_runtime_rq_sum += p->se.wait_runtime;

curr = rb_next(curr);
}
@@ -110,12 +111,12 @@ void print_cfs_rq(struct seq_file *m, in
{
SEQ_printf(m, "\ncfs_rq\n");

+ cfs_rq->fair_clock = get_time_avg(cfs_rq) >> 16;
#define P(x) \
SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(cfs_rq->x))

P(fair_clock);
P(exec_clock);
- P(wait_runtime);
P(wait_runtime_overruns);
P(wait_runtime_underruns);
P(sleeper_bonus);
@@ -143,10 +144,12 @@ static void print_cpu(struct seq_file *m
SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(rq->x))

P(nr_running);
+#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %lu\n", "load",
rq->ls.load.weight);
P(ls.delta_fair);
P(ls.delta_exec);
+#endif
P(nr_switches);
P(nr_load_updates);
P(nr_uninterruptible);
@@ -160,11 +163,13 @@ static void print_cpu(struct seq_file *m
P(clock_overflows);
P(clock_deep_idle_events);
P(clock_max_delta);
+#ifdef CONFIG_SMP
P(cpu_load[0]);
P(cpu_load[1]);
P(cpu_load[2]);
P(cpu_load[3]);
P(cpu_load[4]);
+#endif
#ifdef CONFIG_PREEMPT_RT
/* Print rt related rq stats */
P(rt_nr_running);
@@ -252,16 +257,7 @@ void proc_sched_show_task(struct task_st
#define P(F) \
SEQ_printf(m, "%-25s:%20Ld\n", #F, (long long)p->F)

- P(se.wait_runtime);
- P(se.wait_start_fair);
- P(se.exec_start);
- P(se.sleep_start_fair);
- P(se.sum_exec_runtime);
-
#ifdef CONFIG_SCHEDSTATS
- P(se.wait_start);
- P(se.sleep_start);
- P(se.block_start);
P(se.sleep_max);
P(se.block_max);
P(se.exec_max);
Index: linux-2.6.22/kernel/sched_fair.c
===================================================================
--- linux-2.6.22.orig/kernel/sched_fair.c
+++ linux-2.6.22/kernel/sched_fair.c
@@ -16,41 +16,50 @@
* Scaled math optimizations by Thomas Gleixner
* Copyright (C) 2007, Thomas Gleixner <[email protected]>
*
- * Adaptive scheduling granularity, math enhancements by Peter Zijlstra
- * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
+ * Really fair scheduling
+ * Copyright (C) 2007, Roman Zippel <[email protected]>
*/

+typedef s64 kclock_t;
+
+static inline kclock_t kclock_max(kclock_t x, kclock_t y)
+{
+ return (kclock_t)(x - y) > 0 ? x : y;
+}
+static inline kclock_t kclock_min(kclock_t x, kclock_t y)
+{
+ return (kclock_t)(x - y) < 0 ? x : y;
+}
+
+#define MSHIFT 16
+#define MAX_TIMESUM ((kclock_t)1 << (30 + MSHIFT))
+
/*
- * Targeted preemption latency for CPU-bound tasks:
- * (default: 20ms, units: nanoseconds)
+ * Preemption granularity:
+ * (default: 2 msec, units: nanoseconds)
*
- * NOTE: this latency value is not the same as the concept of
- * 'timeslice length' - timeslices in CFS are of variable length.
- * (to see the precise effective timeslice length of your workload,
- * run vmstat and monitor the context-switches field)
+ * NOTE: this granularity value is not the same as the concept of
+ * 'timeslice length' - timeslices in CFS will typically be somewhat
+ * larger than this value. (to see the precise effective timeslice
+ * length of your workload, run vmstat and monitor the context-switches
+ * field)
*
* On SMP systems the value of this is multiplied by the log2 of the
* number of CPUs. (i.e. factor 2x on 2-way systems, 3x on 4-way
* systems, 4x on 8-way systems, 5x on 16-way systems, etc.)
- * Targeted preemption latency for CPU-bound tasks:
- */
-unsigned int sysctl_sched_latency __read_mostly = 20000000ULL;
-
-/*
- * Minimal preemption granularity for CPU-bound tasks:
- * (default: 2 msec, units: nanoseconds)
*/
-unsigned int sysctl_sched_min_granularity __read_mostly = 2000000ULL;
+unsigned int sysctl_sched_granularity __read_mostly = 2000000000ULL/HZ;

/*
* SCHED_BATCH wake-up granularity.
- * (default: 25 msec, units: nanoseconds)
+ * (default: 10 msec, units: nanoseconds)
*
* This option delays the preemption effects of decoupled workloads
* and reduces their over-scheduling. Synchronous workloads will still
* have immediate wakeup/sleep latencies.
*/
-unsigned int sysctl_sched_batch_wakeup_granularity __read_mostly = 25000000UL;
+unsigned int sysctl_sched_batch_wakeup_granularity __read_mostly =
+ 10000000000ULL/HZ;

/*
* SCHED_OTHER wake-up granularity.
@@ -60,12 +69,12 @@ unsigned int sysctl_sched_batch_wakeup_g
* and reduces their over-scheduling. Synchronous workloads will still
* have immediate wakeup/sleep latencies.
*/
-unsigned int sysctl_sched_wakeup_granularity __read_mostly = 1000000UL;
+unsigned int sysctl_sched_wakeup_granularity __read_mostly = 1000000000ULL/HZ;

unsigned int sysctl_sched_stat_granularity __read_mostly;

/*
- * Initialized in sched_init_granularity() [to 5 times the base granularity]:
+ * Initialized in sched_init_granularity():
*/
unsigned int sysctl_sched_runtime_limit __read_mostly;

@@ -83,7 +92,7 @@ enum {

unsigned int sysctl_sched_features __read_mostly =
SCHED_FEAT_FAIR_SLEEPERS *1 |
- SCHED_FEAT_SLEEPER_AVG *0 |
+ SCHED_FEAT_SLEEPER_AVG *1 |
SCHED_FEAT_SLEEPER_LOAD_AVG *1 |
SCHED_FEAT_PRECISE_CPU_LOAD *1 |
SCHED_FEAT_START_DEBIT *1 |
@@ -91,6 +100,17 @@ unsigned int sysctl_sched_features __rea

extern struct sched_class fair_sched_class;

+static inline kclock_t get_time_avg(struct cfs_rq *cfs_rq)
+{
+ kclock_t avg;
+
+ avg = cfs_rq->time_norm_base;
+ if (cfs_rq->weight_sum)
+ avg += (kclock_t)((int)(cfs_rq->time_sum_off >> MSHIFT) / cfs_rq->weight_sum) << MSHIFT;
+
+ return avg;
+}
+
/**************************************************************
* CFS operations on generic schedulable entities:
*/
@@ -103,21 +123,9 @@ static inline struct rq *rq_of(struct cf
return cfs_rq->rq;
}

-/* currently running entity (if any) on this cfs_rq */
-static inline struct sched_entity *cfs_rq_curr(struct cfs_rq *cfs_rq)
-{
- return cfs_rq->curr;
-}
-
/* An entity is a task if it doesn't "own" a runqueue */
#define entity_is_task(se) (!se->my_q)

-static inline void
-set_cfs_rq_curr(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- cfs_rq->curr = se;
-}
-
#else /* CONFIG_FAIR_GROUP_SCHED */

static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
@@ -125,21 +133,8 @@ static inline struct rq *rq_of(struct cf
return container_of(cfs_rq, struct rq, cfs);
}

-static inline struct sched_entity *cfs_rq_curr(struct cfs_rq *cfs_rq)
-{
- struct rq *rq = rq_of(cfs_rq);
-
- if (unlikely(rq->curr->sched_class != &fair_sched_class))
- return NULL;
-
- return &rq->curr->se;
-}
-
#define entity_is_task(se) 1

-static inline void
-set_cfs_rq_curr(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
-
#endif /* CONFIG_FAIR_GROUP_SCHED */

static inline struct task_struct *task_of(struct sched_entity *se)
@@ -161,9 +156,10 @@ __enqueue_entity(struct cfs_rq *cfs_rq,
struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
struct rb_node *parent = NULL;
struct sched_entity *entry;
- s64 key = se->fair_key;
+ kclock_t key;
int leftmost = 1;

+ se->time_key = key = se->time_norm + (se->req_weight_inv >> 1);
/*
* Find the right place in the rbtree:
*/
@@ -174,7 +170,7 @@ __enqueue_entity(struct cfs_rq *cfs_rq,
* We dont care about collisions. Nodes with
* the same key stay together.
*/
- if (key - entry->fair_key < 0) {
+ if (key - entry->time_key < 0) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
@@ -186,584 +182,221 @@ __enqueue_entity(struct cfs_rq *cfs_rq,
* Maintain a cache of leftmost tree entries (it is frequently
* used):
*/
- if (leftmost)
- cfs_rq->rb_leftmost = &se->run_node;
+ if (leftmost) {
+ cfs_rq->rb_leftmost = se;
+ if (cfs_rq->curr) {
+ cfs_rq->run_end_next = se->time_norm + se->req_weight_inv;
+ cfs_rq->run_end = kclock_min(cfs_rq->run_end_next,
+ kclock_max(se->time_norm, cfs_rq->run_end_curr));
+ }
+ }

rb_link_node(&se->run_node, parent, link);
rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
update_load_add(&cfs_rq->load, se->load.weight);
- cfs_rq->nr_running++;
- se->on_rq = 1;
+ se->queued = 1;
}

static inline void
__dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- if (cfs_rq->rb_leftmost == &se->run_node)
- cfs_rq->rb_leftmost = rb_next(&se->run_node);
+ if (cfs_rq->rb_leftmost == se) {
+ struct rb_node *next = rb_next(&se->run_node);
+ cfs_rq->rb_leftmost = next ? rb_entry(next, struct sched_entity, run_node) : NULL;
+ }
rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
update_load_sub(&cfs_rq->load, se->load.weight);
- cfs_rq->nr_running--;
- se->on_rq = 0;
+ se->queued = 0;
}

-static inline struct rb_node *first_fair(struct cfs_rq *cfs_rq)
+#ifdef CONFIG_SCHED_DEBUG
+static void verify_queue(struct cfs_rq *cfs_rq, int inc_curr, struct sched_entity *se2)
{
- return cfs_rq->rb_leftmost;
-}
+ struct rb_node *node;
+ struct sched_entity *se;
+ kclock_t sum = 0;

-static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)
-{
- return rb_entry(first_fair(cfs_rq), struct sched_entity, run_node);
+ se = cfs_rq->curr;
+ if (inc_curr && se)
+ sum = (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ node = rb_first(&cfs_rq->tasks_timeline);
+ WARN_ON(node && cfs_rq->rb_leftmost != rb_entry(node, struct sched_entity, run_node));
+ while (node) {
+ se = rb_entry(node, struct sched_entity, run_node);
+ sum += (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ node = rb_next(node);
+ }
+ if (sum != cfs_rq->time_sum_off) {
+ kclock_t oldsum = cfs_rq->time_sum_off;
+ cfs_rq->time_sum_off = sum;
+ printk("%ld:%Lx,%Lx,%p,%p,%d\n", cfs_rq->nr_running, sum, oldsum, cfs_rq->curr, se2, inc_curr);
+ WARN_ON(1);
+ }
}
+#else
+#define verify_queue(q,c,s) ((void)0)
+#endif

/**************************************************************
* Scheduling class statistics methods:
*/

/*
- * Calculate the preemption granularity needed to schedule every
- * runnable task once per sysctl_sched_latency amount of time.
- * (down to a sensible low limit on granularity)
- *
- * For example, if there are 2 tasks running and latency is 10 msecs,
- * we switch tasks every 5 msecs. If we have 3 tasks running, we have
- * to switch tasks every 3.33 msecs to get a 10 msecs observed latency
- * for each task. We do finer and finer scheduling up to until we
- * reach the minimum granularity value.
- *
- * To achieve this we use the following dynamic-granularity rule:
- *
- * gran = lat/nr - lat/nr/nr
- *
- * This comes out of the following equations:
- *
- * kA1 + gran = kB1
- * kB2 + gran = kA2
- * kA2 = kA1
- * kB2 = kB1 - d + d/nr
- * lat = d * nr
- *
- * Where 'k' is key, 'A' is task A (waiting), 'B' is task B (running),
- * '1' is start of time, '2' is end of time, 'd' is delay between
- * 1 and 2 (during which task B was running), 'nr' is number of tasks
- * running, 'lat' is the the period of each task. ('lat' is the
- * sched_latency that we aim for.)
- */
-static long
-sched_granularity(struct cfs_rq *cfs_rq)
-{
- unsigned int gran = sysctl_sched_latency;
- unsigned int nr = cfs_rq->nr_running;
-
- if (nr > 1) {
- gran = gran/nr - gran/nr/nr;
- gran = max(gran, sysctl_sched_min_granularity);
- }
-
- return gran;
-}
-
-/*
- * We rescale the rescheduling granularity of tasks according to their
- * nice level, but only linearly, not exponentially:
- */
-static long
-niced_granularity(struct sched_entity *curr, unsigned long granularity)
-{
- u64 tmp;
-
- if (likely(curr->load.weight == NICE_0_LOAD))
- return granularity;
- /*
- * Positive nice levels get the same granularity as nice-0:
- */
- if (likely(curr->load.weight < NICE_0_LOAD)) {
- tmp = curr->load.weight * (u64)granularity;
- return (long) (tmp >> NICE_0_SHIFT);
- }
- /*
- * Negative nice level tasks get linearly finer
- * granularity:
- */
- tmp = curr->load.inv_weight * (u64)granularity;
-
- /*
- * It will always fit into 'long':
- */
- return (long) (tmp >> WMULT_SHIFT);
-}
-
-static inline void
-limit_wait_runtime(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- long limit = sysctl_sched_runtime_limit;
-
- /*
- * Niced tasks have the same history dynamic range as
- * non-niced tasks:
- */
- if (unlikely(se->wait_runtime > limit)) {
- se->wait_runtime = limit;
- schedstat_inc(se, wait_runtime_overruns);
- schedstat_inc(cfs_rq, wait_runtime_overruns);
- }
- if (unlikely(se->wait_runtime < -limit)) {
- se->wait_runtime = -limit;
- schedstat_inc(se, wait_runtime_underruns);
- schedstat_inc(cfs_rq, wait_runtime_underruns);
- }
-}
-
-static inline void
-__add_wait_runtime(struct cfs_rq *cfs_rq, struct sched_entity *se, long delta)
-{
- se->wait_runtime += delta;
- schedstat_add(se, sum_wait_runtime, delta);
- limit_wait_runtime(cfs_rq, se);
-}
-
-static void
-add_wait_runtime(struct cfs_rq *cfs_rq, struct sched_entity *se, long delta)
-{
- schedstat_add(cfs_rq, wait_runtime, -se->wait_runtime);
- __add_wait_runtime(cfs_rq, se, delta);
- schedstat_add(cfs_rq, wait_runtime, se->wait_runtime);
-}
-
-/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
*/
-static inline void
-__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+static inline void update_curr(struct cfs_rq *cfs_rq)
{
- unsigned long delta, delta_exec, delta_fair, delta_mine;
- struct load_weight *lw = &cfs_rq->load;
- unsigned long load = lw->weight;
-
- delta_exec = curr->delta_exec;
- schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));
-
- curr->sum_exec_runtime += delta_exec;
- cfs_rq->exec_clock += delta_exec;
-
- if (unlikely(!load))
- return;
-
- delta_fair = calc_delta_fair(delta_exec, lw);
- delta_mine = calc_delta_mine(delta_exec, curr->load.weight, lw);
-
- if (cfs_rq->sleeper_bonus > sysctl_sched_min_granularity) {
- delta = min((u64)delta_mine, cfs_rq->sleeper_bonus);
- delta = min(delta, (unsigned long)(
- (long)sysctl_sched_runtime_limit - curr->wait_runtime));
- cfs_rq->sleeper_bonus -= delta;
- delta_mine -= delta;
- }
-
- cfs_rq->fair_clock += delta_fair;
- /*
- * We executed delta_exec amount of time on the CPU,
- * but we were only entitled to delta_mine amount of
- * time during that period (if nr_running == 1 then
- * the two values are equal)
- * [Note: delta_mine - delta_exec is negative]:
- */
- add_wait_runtime(cfs_rq, curr, delta_mine - delta_exec);
-}
-
-static void update_curr(struct cfs_rq *cfs_rq)
-{
- struct sched_entity *curr = cfs_rq_curr(cfs_rq);
+ struct sched_entity *curr = cfs_rq->curr;
+ kclock_t now = rq_of(cfs_rq)->clock;
unsigned long delta_exec;
+ kclock_t delta_norm;

if (unlikely(!curr))
return;

- /*
- * Get the amount of time the current task was running
- * since the last time we changed load (this cannot
- * overflow on 32 bits):
- */
- delta_exec = (unsigned long)(rq_of(cfs_rq)->clock - curr->exec_start);
-
- curr->delta_exec += delta_exec;
-
- if (unlikely(curr->delta_exec > sysctl_sched_stat_granularity)) {
- __update_curr(cfs_rq, curr);
- curr->delta_exec = 0;
- }
- curr->exec_start = rq_of(cfs_rq)->clock;
-}
-
-static inline void
-update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- se->wait_start_fair = cfs_rq->fair_clock;
- schedstat_set(se->wait_start, rq_of(cfs_rq)->clock);
-}
-
-/*
- * We calculate fair deltas here, so protect against the random effects
- * of a multiplication overflow by capping it to the runtime limit:
- */
-#if BITS_PER_LONG == 32
-static inline unsigned long
-calc_weighted(unsigned long delta, unsigned long weight, int shift)
-{
- u64 tmp = (u64)delta * weight >> shift;
-
- if (unlikely(tmp > sysctl_sched_runtime_limit*2))
- return sysctl_sched_runtime_limit*2;
- return tmp;
-}
-#else
-static inline unsigned long
-calc_weighted(unsigned long delta, unsigned long weight, int shift)
-{
- return delta * weight >> shift;
-}
-#endif
-
-/*
- * Task is being enqueued - update stats:
- */
-static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- s64 key;
-
- /*
- * Are we enqueueing a waiting task? (for current tasks
- * a dequeue/enqueue event is a NOP)
- */
- if (se != cfs_rq_curr(cfs_rq))
- update_stats_wait_start(cfs_rq, se);
- /*
- * Update the key:
- */
- key = cfs_rq->fair_clock;
-
- /*
- * Optimize the common nice 0 case:
- */
- if (likely(se->load.weight == NICE_0_LOAD)) {
- key -= se->wait_runtime;
- } else {
- u64 tmp;
-
- if (se->wait_runtime < 0) {
- tmp = -se->wait_runtime;
- key += (tmp * se->load.inv_weight) >>
- (WMULT_SHIFT - NICE_0_SHIFT);
- } else {
- tmp = se->wait_runtime;
- key -= (tmp * se->load.inv_weight) >>
- (WMULT_SHIFT - NICE_0_SHIFT);
- }
- }
-
- se->fair_key = key;
-}
-
-/*
- * Note: must be called with a freshly updated rq->fair_clock.
- */
-static inline void
-__update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- unsigned long delta_fair = se->delta_fair_run;
-
- schedstat_set(se->wait_max, max(se->wait_max,
- rq_of(cfs_rq)->clock - se->wait_start));
-
- if (unlikely(se->load.weight != NICE_0_LOAD))
- delta_fair = calc_weighted(delta_fair, se->load.weight,
- NICE_0_SHIFT);
-
- add_wait_runtime(cfs_rq, se, delta_fair);
-}
-
-static void
-update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- unsigned long delta_fair;
-
- if (unlikely(!se->wait_start_fair))
- return;
-
- delta_fair = (unsigned long)min((u64)(2*sysctl_sched_runtime_limit),
- (u64)(cfs_rq->fair_clock - se->wait_start_fair));
-
- se->delta_fair_run += delta_fair;
- if (unlikely(abs(se->delta_fair_run) >=
- sysctl_sched_stat_granularity)) {
- __update_stats_wait_end(cfs_rq, se);
- se->delta_fair_run = 0;
- }
-
- se->wait_start_fair = 0;
- schedstat_set(se->wait_start, 0);
-}
-
-static inline void
-update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- update_curr(cfs_rq);
- /*
- * Mark the end of the wait period if dequeueing a
- * waiting task:
- */
- if (se != cfs_rq_curr(cfs_rq))
- update_stats_wait_end(cfs_rq, se);
-}
-
-/*
- * We are picking a new current task - update its stats:
- */
-static inline void
-update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- /*
- * We are starting a new run period:
- */
- se->exec_start = rq_of(cfs_rq)->clock;
-}
-
-/*
- * We are descheduling a task - update its stats:
- */
-static inline void
-update_stats_curr_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- se->exec_start = 0;
-}
-
-/**************************************************
- * Scheduling class queueing methods:
- */
-
-static void __enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- unsigned long load = cfs_rq->load.weight, delta_fair;
- long prev_runtime;
-
- /*
- * Do not boost sleepers if there's too much bonus 'in flight'
- * already:
- */
- if (unlikely(cfs_rq->sleeper_bonus > sysctl_sched_runtime_limit))
- return;
-
- if (sysctl_sched_features & SCHED_FEAT_SLEEPER_LOAD_AVG)
- load = rq_of(cfs_rq)->cpu_load[2];
-
- delta_fair = se->delta_fair_sleep;
-
- /*
- * Fix up delta_fair with the effect of us running
- * during the whole sleep period:
- */
- if (sysctl_sched_features & SCHED_FEAT_SLEEPER_AVG)
- delta_fair = div64_likely32((u64)delta_fair * load,
- load + se->load.weight);
-
- if (unlikely(se->load.weight != NICE_0_LOAD))
- delta_fair = calc_weighted(delta_fair, se->load.weight,
- NICE_0_SHIFT);
-
- prev_runtime = se->wait_runtime;
- __add_wait_runtime(cfs_rq, se, delta_fair);
- schedstat_add(cfs_rq, wait_runtime, se->wait_runtime);
- delta_fair = se->wait_runtime - prev_runtime;
-
- /*
- * Track the amount of bonus we've given to sleepers:
- */
- cfs_rq->sleeper_bonus += delta_fair;
-}
-
-static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- struct task_struct *tsk = task_of(se);
- unsigned long delta_fair;
-
- if ((entity_is_task(se) && tsk->policy == SCHED_BATCH) ||
- !(sysctl_sched_features & SCHED_FEAT_FAIR_SLEEPERS))
+ delta_exec = now - cfs_rq->prev_update;
+ if (!delta_exec)
return;
+ cfs_rq->prev_update = now;

- delta_fair = (unsigned long)min((u64)(2*sysctl_sched_runtime_limit),
- (u64)(cfs_rq->fair_clock - se->sleep_start_fair));
-
- se->delta_fair_sleep += delta_fair;
- if (unlikely(abs(se->delta_fair_sleep) >=
- sysctl_sched_stat_granularity)) {
- __enqueue_sleeper(cfs_rq, se);
- se->delta_fair_sleep = 0;
- }
-
- se->sleep_start_fair = 0;
-
-#ifdef CONFIG_SCHEDSTATS
- if (se->sleep_start) {
- u64 delta = rq_of(cfs_rq)->clock - se->sleep_start;
-
- if ((s64)delta < 0)
- delta = 0;
-
- if (unlikely(delta > se->sleep_max))
- se->sleep_max = delta;
-
- se->sleep_start = 0;
- se->sum_sleep_runtime += delta;
- }
- if (se->block_start) {
- u64 delta = rq_of(cfs_rq)->clock - se->block_start;
-
- if ((s64)delta < 0)
- delta = 0;
+ curr->sum_exec_runtime += delta_exec;
+ cfs_rq->exec_clock += delta_exec;

- if (unlikely(delta > se->block_max))
- se->block_max = delta;
+ delta_norm = (kclock_t)delta_exec * curr->load.inv_weight;
+ curr->time_norm += delta_norm;
+ cfs_rq->time_sum_off += delta_norm << curr->weight_shift;

- se->block_start = 0;
- se->sum_sleep_runtime += delta;
- }
-#endif
+ verify_queue(cfs_rq, 4, curr);
}

static void
-enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
+enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- /*
- * Update the fair clock.
- */
- update_curr(cfs_rq);
+ kclock_t min_time;

- if (wakeup)
- enqueue_sleeper(cfs_rq, se);
+ verify_queue(cfs_rq, cfs_rq->curr != se, se);
+ min_time = get_time_avg(cfs_rq) - se->req_weight_inv;
+ if ((kclock_t)(se->time_norm - min_time) < 0)
+ se->time_norm = min_time;

- update_stats_enqueue(cfs_rq, se);
- __enqueue_entity(cfs_rq, se);
+ cfs_rq->nr_running++;
+ cfs_rq->weight_sum += 1 << se->weight_shift;
+ if (cfs_rq->inc_shift < se->weight_shift) {
+ cfs_rq->time_norm_inc >>= se->weight_shift - cfs_rq->inc_shift;
+ cfs_rq->time_sum_max >>= se->weight_shift - cfs_rq->inc_shift;
+ cfs_rq->inc_shift = se->weight_shift;
+ }
+ cfs_rq->time_sum_max += cfs_rq->time_norm_inc << se->weight_shift;
+ while (cfs_rq->time_sum_max >= MAX_TIMESUM) {
+ cfs_rq->time_norm_inc >>= 1;
+ cfs_rq->time_sum_max >>= 1;
+ cfs_rq->inc_shift++;
+ }
+ cfs_rq->time_sum_off += (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ if (cfs_rq->time_sum_off >= cfs_rq->time_sum_max) {
+ cfs_rq->time_sum_off -= cfs_rq->time_sum_max;
+ cfs_rq->time_norm_base += cfs_rq->time_norm_inc;
+ }
+
+ if (&rq_of(cfs_rq)->curr->se == se)
+ cfs_rq->curr = se;
+ if (cfs_rq->curr != se)
+ __enqueue_entity(cfs_rq, se);
+ verify_queue(cfs_rq, 2, se);
}

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
{
- update_stats_dequeue(cfs_rq, se);
- if (sleep) {
- se->sleep_start_fair = cfs_rq->fair_clock;
-#ifdef CONFIG_SCHEDSTATS
- if (entity_is_task(se)) {
- struct task_struct *tsk = task_of(se);
-
- if (tsk->state & TASK_INTERRUPTIBLE)
- se->sleep_start = rq_of(cfs_rq)->clock;
- if (tsk->state & TASK_UNINTERRUPTIBLE)
- se->block_start = rq_of(cfs_rq)->clock;
+ verify_queue(cfs_rq, 3, se);
+
+ cfs_rq->weight_sum -= 1 << se->weight_shift;
+ if (cfs_rq->weight_sum) {
+ cfs_rq->time_sum_max -= cfs_rq->time_norm_inc << se->weight_shift;
+ while (cfs_rq->time_sum_max < (MAX_TIMESUM >> 1)) {
+ cfs_rq->time_norm_inc <<= 1;
+ cfs_rq->time_sum_max <<= 1;
+ cfs_rq->inc_shift--;
}
- cfs_rq->wait_runtime -= se->wait_runtime;
-#endif
+
+ cfs_rq->time_sum_off -= (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ if (cfs_rq->time_sum_off < 0) {
+ cfs_rq->time_sum_off += cfs_rq->time_sum_max;
+ cfs_rq->time_norm_base -= cfs_rq->time_norm_inc;
+ }
+ } else {
+ cfs_rq->time_sum_off -= (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ BUG_ON(cfs_rq->time_sum_off);
+ cfs_rq->time_norm_inc = MAX_TIMESUM >> 1;
+ cfs_rq->time_sum_max = 0;
+ cfs_rq->inc_shift = 0;
}
- __dequeue_entity(cfs_rq, se);
-}

-/*
- * Preempt the current task with a newly woken task if needed:
- */
-static int
-__check_preempt_curr_fair(struct cfs_rq *cfs_rq, struct sched_entity *se,
- struct sched_entity *curr, unsigned long granularity)
-{
- s64 __delta = curr->fair_key - se->fair_key;

- /*
- * Take scheduling granularity into account - do not
- * preempt the current task unless the best task has
- * a larger than sched_granularity fairness advantage:
- */
- if (__delta > niced_granularity(curr, granularity)) {
- resched_task(rq_of(cfs_rq)->curr);
- return 1;
- }
- return 0;
+ cfs_rq->nr_running--;
+ if (se->queued)
+ __dequeue_entity(cfs_rq, se);
+ if (cfs_rq->curr == se)
+ cfs_rq->curr = NULL;
+ if (cfs_rq->next == se)
+ cfs_rq->next = NULL;
+ verify_queue(cfs_rq, cfs_rq->curr != se, se);
}

static inline void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- /*
- * Any task has to be enqueued before it get to execute on
- * a CPU. So account for the time it spent waiting on the
- * runqueue. (note, here we rely on pick_next_task() having
- * done a put_prev_task_fair() shortly before this, which
- * updated rq->fair_clock - used by update_stats_wait_end())
- */
- update_stats_wait_end(cfs_rq, se);
- update_stats_curr_start(cfs_rq, se);
- set_cfs_rq_curr(cfs_rq, se);
+ cfs_rq->prev_update = rq_of(cfs_rq)->clock;
+ cfs_rq->run_start = se->time_norm;
+ cfs_rq->run_end = cfs_rq->run_end_curr = cfs_rq->run_start + se->req_weight_inv;
+ cfs_rq->curr = se;
}

static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
- struct sched_entity *se = __pick_next_entity(cfs_rq);
+ struct sched_entity *se = cfs_rq->next ? cfs_rq->next : cfs_rq->rb_leftmost;
+ struct sched_entity *next;

+ cfs_rq->next = NULL;
+ if (se->queued)
+ __dequeue_entity(cfs_rq, se);
set_next_entity(cfs_rq, se);

+ next = cfs_rq->rb_leftmost;
+ if (next) {
+ cfs_rq->run_end_next = next->time_norm + next->req_weight_inv;
+ cfs_rq->run_end = kclock_min(cfs_rq->run_end_next,
+ kclock_max(next->time_norm, cfs_rq->run_end_curr));
+ }
+
return se;
}

static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
- /*
- * If still on the runqueue then deactivate_task()
- * was not called and update_curr() has to be done:
- */
- if (prev->on_rq)
- update_curr(cfs_rq);
-
- update_stats_curr_end(cfs_rq, prev);
-
+ update_curr(cfs_rq);
if (prev->on_rq)
- update_stats_wait_start(cfs_rq, prev);
- set_cfs_rq_curr(cfs_rq, NULL);
+ __enqueue_entity(cfs_rq, prev);
+ cfs_rq->curr = NULL;
}

static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
- unsigned long gran, ideal_runtime, delta_exec;
- struct sched_entity *next;
+ update_curr(cfs_rq);

- /*
- * Dequeue and enqueue the task to update its
- * position within the tree:
- */
- dequeue_entity(cfs_rq, curr, 0);
- enqueue_entity(cfs_rq, curr, 0);
+ while (cfs_rq->time_sum_off >= cfs_rq->time_sum_max) {
+ cfs_rq->time_sum_off -= cfs_rq->time_sum_max;
+ cfs_rq->time_norm_base += cfs_rq->time_norm_inc;
+ }

/*
* Reschedule if another task tops the current one.
*/
- next = __pick_next_entity(cfs_rq);
- if (next == curr)
- return;
-
- gran = sched_granularity(cfs_rq);
- ideal_runtime = niced_granularity(curr,
- max(sysctl_sched_latency / cfs_rq->nr_running,
- (unsigned long)sysctl_sched_min_granularity));
- /*
- * If we executed more than what the latency constraint suggests,
- * reduce the rescheduling granularity. This way the total latency
- * of how much a task is not scheduled converges to
- * sysctl_sched_latency:
- */
- delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
- if (delta_exec > ideal_runtime)
- gran = 0;
-
- if (__check_preempt_curr_fair(cfs_rq, next, curr, gran))
- curr->prev_sum_exec_runtime = curr->sum_exec_runtime;
+ if (cfs_rq->rb_leftmost && (kclock_t)(curr->time_norm - cfs_rq->run_end) >= 0) {
+ cfs_rq->next = cfs_rq->rb_leftmost;
+ resched_task(rq_of(cfs_rq)->curr);
+ }
}

/**************************************************
@@ -868,7 +501,7 @@ static void enqueue_task_fair(struct rq
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
- enqueue_entity(cfs_rq, se, wakeup);
+ enqueue_entity(cfs_rq, se);
}
}

@@ -886,7 +519,7 @@ static void dequeue_task_fair(struct rq
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
/* Don't dequeue parent if it has other entities besides us */
- if (cfs_rq->load.weight)
+ if (cfs_rq->weight_sum)
break;
}
}
@@ -897,14 +530,16 @@ static void dequeue_task_fair(struct rq
static void yield_task_fair(struct rq *rq, struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
-
+ struct sched_entity *next;
__update_rq_clock(rq);
- /*
- * Dequeue and enqueue the task to update its
- * position within the tree:
- */
- dequeue_entity(cfs_rq, &p->se, 0);
- enqueue_entity(cfs_rq, &p->se, 0);
+
+ update_curr(cfs_rq);
+ next = cfs_rq->rb_leftmost;
+ if (next && (kclock_t)(p->se.time_norm - next->time_norm) > 0) {
+ cfs_rq->next = next;
+ return;
+ }
+ cfs_rq->next = &p->se;
}

/*
@@ -916,12 +551,11 @@ static void check_preempt_curr_fair(stru
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
unsigned long gran;

- if (unlikely(rt_prio(p->prio))) {
- update_rq_clock(rq);
- update_curr(cfs_rq);
+ if (!cfs_rq->curr || unlikely(rt_prio(p->prio))) {
resched_task(curr);
return;
}
+ update_curr(cfs_rq);

gran = sysctl_sched_wakeup_granularity;
/*
@@ -930,8 +564,12 @@ static void check_preempt_curr_fair(stru
if (unlikely(p->policy == SCHED_BATCH))
gran = sysctl_sched_batch_wakeup_granularity;

- if (is_same_group(curr, p))
- __check_preempt_curr_fair(cfs_rq, &p->se, &curr->se, gran);
+ if (is_same_group(curr, p) &&
+ cfs_rq->rb_leftmost == &p->se &&
+ curr->se.time_norm - p->se.time_norm >= cfs_rq->curr->load.inv_weight * (kclock_t)gran) {
+ cfs_rq->next = &p->se;
+ resched_task(curr);
+ }
}

static struct task_struct *pick_next_task_fair(struct rq *rq)
@@ -942,6 +580,9 @@ static struct task_struct *pick_next_tas
if (unlikely(!cfs_rq->nr_running))
return NULL;

+ if (cfs_rq->nr_running == 1 && cfs_rq->curr)
+ return task_of(cfs_rq->curr);
+
do {
se = pick_next_entity(cfs_rq);
cfs_rq = group_cfs_rq(se);
@@ -976,10 +617,12 @@ static void put_prev_task_fair(struct rq
* the current task:
*/
static inline struct task_struct *
-__load_balance_iterator(struct cfs_rq *cfs_rq, struct rb_node *curr)
+__load_balance_iterator(struct cfs_rq *cfs_rq)
{
struct task_struct *p;
+ struct rb_node *curr;

+ curr = cfs_rq->rb_load_balance_curr;
if (!curr)
return NULL;

@@ -993,14 +636,19 @@ static struct task_struct *load_balance_
{
struct cfs_rq *cfs_rq = arg;

- return __load_balance_iterator(cfs_rq, first_fair(cfs_rq));
+ cfs_rq->rb_load_balance_curr = cfs_rq->rb_leftmost ?
+ &cfs_rq->rb_leftmost->run_node : NULL;
+ if (cfs_rq->curr)
+ return rb_entry(&cfs_rq->curr->run_node, struct task_struct, se.run_node);
+
+ return __load_balance_iterator(cfs_rq);
}

static struct task_struct *load_balance_next_fair(void *arg)
{
struct cfs_rq *cfs_rq = arg;

- return __load_balance_iterator(cfs_rq, cfs_rq->rb_load_balance_curr);
+ return __load_balance_iterator(cfs_rq);
}

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1012,7 +660,7 @@ static int cfs_rq_best_prio(struct cfs_r
if (!cfs_rq->nr_running)
return MAX_PRIO;

- curr = __pick_next_entity(cfs_rq);
+ curr = cfs_rq->rb_leftmost;
p = task_of(curr);

return p->prio;
@@ -1097,36 +745,21 @@ static void task_tick_fair(struct rq *rq
static void task_new_fair(struct rq *rq, struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
- struct sched_entity *se = &p->se, *curr = cfs_rq_curr(cfs_rq);
+ struct sched_entity *se = &p->se;
+ kclock_t time;

sched_info_queued(p);

- update_curr(cfs_rq);
- update_stats_enqueue(cfs_rq, se);
- /*
- * Child runs first: we let it run before the parent
- * until it reschedules once. We set up the key so that
- * it will preempt the parent:
- */
- se->fair_key = curr->fair_key -
- niced_granularity(curr, sched_granularity(cfs_rq)) - 1;
- /*
- * The first wait is dominated by the child-runs-first logic,
- * so do not credit it with that waiting time yet:
- */
- if (sysctl_sched_features & SCHED_FEAT_SKIP_INITIAL)
- se->wait_start_fair = 0;
+ time = (rq->curr->se.time_norm - get_time_avg(cfs_rq)) >> 1;
+ cfs_rq->time_sum_off -= (time << rq->curr->se.weight_shift);
+ rq->curr->se.time_norm -= time;
+ se->time_norm = rq->curr->se.time_norm;

- /*
- * The statistical average of wait_runtime is about
- * -granularity/2, so initialize the task with that:
- */
- if (sysctl_sched_features & SCHED_FEAT_START_DEBIT) {
- se->wait_runtime = -(sched_granularity(cfs_rq) / 2);
- schedstat_add(cfs_rq, wait_runtime, se->wait_runtime);
- }
+ enqueue_entity(cfs_rq, se);
+ p->se.on_rq = 1;

- __enqueue_entity(cfs_rq, se);
+ cfs_rq->next = se;
+ resched_task(rq->curr);
}

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1137,7 +770,7 @@ static void task_new_fair(struct rq *rq,
*/
static void set_curr_task_fair(struct rq *rq)
{
- struct sched_entity *se = &rq->curr->se;
+ struct sched_entity *se = &rq->curr.se;

for_each_sched_entity(se)
set_next_entity(cfs_rq_of(se), se);


2007-09-02 02:24:05

by Bill Davidsen

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Roman Zippel wrote:
> Hi,
>
> On Fri, 31 Aug 2007, Ingo Molnar wrote:
>
> Maybe I should explain for everyone else (especially after seeing some of
> the comments on kerneltrap), why I reacted somewhat irritated on what
> looks like such an innocent mail.
> The problem is without the necessary background one can't know how wrong
> such statements as this are, the level of confidence is amazing though:
>
>> Peter's work fully solves the rounding corner-cases you described as:
>
> I'd expect Ingo to know better, but the more he refuses to answer my
> questions, the more I doubt it, at least than it comes to the math part.
>
The "math part" is important if you're doing a thesis defense, but
demonstrating better behavior under some realistic load would probably
be a better starting point. Maybe Ingo missed something in your math,
and maybe he just didn't find a benefit.

The ck and sd schedulers developed over a long time of public use and
redefinition of goals. The cfs developed faster, but still had months of
testing and the benefit of rt experience. Your scheduler is more of a
virgin birth, no slow public discussion before, no time yet for people
to run it under real loads, you're seeing first impressions.

You dropped this on the world two days before a major US holiday, at the
end of the summer when those of us not on vacation may be covering for
those who are, did you expect Ingo to drop his regular work to look at
your stuff? And do you think there are many people who can look at your
math, and look at your code, and then have any clue how well it works in
practice? I bet there aren't ten people in the world who would even
claim to do that, and half of them are kidding themselves.

So give people a few weeks to see if the rounding errors you eliminated
mean anything in practice. Faster context is nice, but it's not
something most people count as their major problem. I did a *lot* of cfs
vs. sd testing, not just all the glitch1 tests I posted here, lots of
stuff I just send to Ingo, and lots of tests like IPCbench and NNTP
server loading showed nothing. (That's good, it means cfs isn't worse
than the old scheduler for those loads.) I played with the tuning, I
even diddled the code a bit, without finding meaningful improvement, so
your possibly better math may not change the overall behavior much.

I will wait until I post actual test results before I say any more.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2007-09-02 07:20:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler


* Daniel Walker <[email protected]> wrote:

> The the patch is near the end of this email.. The most notable thing
> about the rediff is the line count,
>
> 4 files changed, 323 insertions(+), 729 deletions(-)
>
> That's impressive (assuming my rediff is correct). [...]

Yeah, at first glance i liked that too, then i looked into the diff and
noticed that a good chunk of the removal "win" comes from the removal of
~35 comment blocks while adding new code that has no comments at all
(!).

And if you look at the resulting code size/complexity, it actually
increases with Roman's patch (UP, nodebug, x86):

text data bss dec hex filename
13420 228 1204 14852 3a04 sched.o.rc5
13554 228 1228 15010 3aa2 sched.o.rc5-roman

Although it _should_ have been a net code size win, because if you look
at the diff you'll see that other useful things were removed as well:
sleeper fairness, CPU time distribution smarts, tunings, scheduler
instrumentation code, etc.

> I also ran hackbench (in a haphazard way) a few times on it vs. CFS in
> my tree, and RFS was faster to some degree (it varied)..

here are some actual numbers for "hackbench 50" on -rc5, 10 consecutive
runs fresh after bootup, Core2Duo, UP:

-rc5(cfs) -rc5+rfs
-------------------------------
Time: 3.905 Time: 4.259
Time: 3.962 Time: 4.190
Time: 3.981 Time: 4.241
Time: 3.986 Time: 3.937
Time: 3.984 Time: 4.120
Time: 4.001 Time: 4.013
Time: 3.980 Time: 4.248
Time: 3.983 Time: 3.961
Time: 3.989 Time: 4.345
Time: 3.981 Time: 4.294
-------------------------------
Avg: 3.975 Avg: 4.160 (+4.6%)
Fluct: 0.138 Fluct: 1.671

so unmodified CFS is 4.6% faster on this box than with Roman's patch and
it's also more consistent/stable (10 times lower fluctuations).

At lower hackbench levels (hackbench 10) the numbers are closer - that
could be what you saw.

But, this measurement too is apples to oranges, given the amount of
useful code the patch removes - fact is that you can always speed up the
scheduler by removing stuff, just run hackbench as SCHED_FIFO (via "chrt
-f 90 ./hackbench 50") to see what a minimal scheduler can do.

It would be far more reviewable and objectively judgeable on an item by
item basis if Roman posted the finegrained patches i asked for. (which
patch series should be sorted in order of intrusiveness - i.e. leaving
the harder changes to the end of the series.)

Ingo

2007-09-02 08:27:45

by Satyam Sharma

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler



On Sun, 2 Sep 2007, Ingo Molnar wrote:
>
> Although it _should_ have been a net code size win, because if you look
> at the diff you'll see that other useful things were removed as well:
> sleeper fairness, CPU time distribution smarts, tunings, scheduler
> instrumentation code, etc.

To be fair to Roman, he probably started development off an earlier CFS,
most probably 2.6.23-rc3-git1, if I guess correctly from his original
posting. So it's likely he missed out on some of the tunings/comments(?)
etc code that got merged after that.

> > I also ran hackbench (in a haphazard way) a few times on it vs. CFS in
> > my tree, and RFS was faster to some degree (it varied)..
>
> here are some actual numbers for "hackbench 50" on -rc5, 10 consecutive
> runs fresh after bootup, Core2Duo, UP:

Again, it would be interesting to benchmark against 2.6.23-rc3-git1. And
also probably rediff vs 2.6.23-rc3-git1 and compare how the code actually
changed ... but admittedly, doing so would be utterly pointless, because
much water has flowed down the Ganges since -rc3 (misc CFS improvements,
Peter's patches that you mentioned). So a "look, I told you so" kind of
situation wouldn't really be constructive at all.

> It would be far more reviewable and objectively judgeable on an item by
> item basis if Roman posted the finegrained patches i asked for. (which
> patch series should be sorted in order of intrusiveness - i.e. leaving
> the harder changes to the end of the series.)

Absolutely. And if there indeed are net improvements (be it for corner
cases) over latest CFS-rc5, while maintaining performance for the common
cases at the same time, well, that can only be a good thing.

Just my Rs. 0.02,

Satyam

2007-09-02 09:26:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler


* Roman Zippel <[email protected]> wrote:

> > with Peter's queue there are no underflows/overflows either anymore
> > in any synthetic corner-case we could come up with. Peter's queue
> > works well but it's 2.6.24 material.
>
> Did you even try to understand what I wrote? I didn't say that it's a
> "common problem", it's a conceptual problem. The rounding has been
> improved lately, so it's not as easy to trigger with some simple busy
> loops.

As i mentioned it in my first reply to you i really welcome your changes
and your interest in the Linux scheduler, but i'm still somewhat
surprised why you focus on rounding so much (and why you attack CFS's
math implementation so vehemently and IMO so unfairly) - and we had this
discussion before in the "CFS review" thread that you started.

The kind of rounding error you seem to be worried about is very, very
small. For normal nice-0 tasks it's in the "one part per a hundred
million" range, or smaller. As a comparison: "top"'s CPU utilization
statistics are accurate to "one part per thousand" - and that's roughly
the precision range that humans care about. (Graphical tools are even
coarser - one part per hundred or worse.)

I suspect if rounding effects are measurable you will post numbers that
prove it, correct? You did not do that so far. Or if they are not
measurable for any workload we care about, why should we worry about it
if it causes no problems in practice? (or if it causes problems, what
are the actual effects and can they be addressed in a simpler way?)

The main reason i'm interested in changing the fairness math under CFS
(be that Peter's changes or your changes) is _not_ primarily to address
any rounding behavior, but to potentially improve performance! Rounding
errors are at most an academic issue, unless they are actually
measurable in a workload we care about. (If rounding gets improved as a
side-effect of a change that's an added, albeit lower-prio bonus.) But
even more important is quality of scheduling - performance is secondary
to that. (unless performance is so bad that it becomes a quality issue:
such as an O(N) algorithm would do.)

Your math is fairly simple (and that is _good_, just like CFS's existing
math is simple), it can be summed up in essence as (without complicating
it with nice-level weighting, for easy understandability):

" use the already existing p->sum_exec_runtime 'task runtime' metric
that CFS maintains, and use that as the key into the rb-tree that
selects tasks that should be run next. To handle sleeping tasks: keep
a per-rq sum of all runnable task's ->sum_exec_runtime values and
start newly woken tasks at the average rq->sum/nr_running value. "

Now your patch does not actually do it that way in a clearly discernible
manner because lots of changes are intermixed into one big patch.

( please correct me if i got your math wrong. Your patch does not add
any comments at all to the new code and this slowed down my review
and analysis of your patch quite considerably. Lack of comments makes
it harder to see the purpose and makes it harder to notice the
benefits/tradeoffs involved in each change. )

Much of the remaining complexity in your patch i see as an add-on
optimization to that concept: you use a quite complex Bresenham
implementation that hides the real machinery of the code. Yes, rounding
improves too with Breshenham, but to me that is secondary - i'm mainly
interested in the performance aspects of Breshenham. There are
advantages of your approach but it also has disadavantages: you removed
sleeper fairness for example, which makes it harder to compare the
scheduling quality of the two implementations.

To sum up: the math in your patch _could_ be implemented as a much
smaller add-on to the already existing variables maintained by CFS, but
you chose to do lots of changes, variable-renames and removals at once
and posted them as one big change.

I really welcome large changes to the scheduler (hey, in the past 2.5
years alone we added 350+ scheduler commits from over 95 unique
contributors, so i'm as feature-happy as it gets), but it's far easier
to review and merge stuff if it's nicely split up. (I'll eventually get
through your patch too, but it's much harder that way and as you
probably know every core kernel hacker is away for the Kernel Summit so
dont expect much activity for a week or so.)

One conceptual reason behind the intrusive policy-modularization done in
CFS was to _never again_ be forced to do "big" scheduler changes - we
now can do most things in small, understandable steps.

I posted my very first, raw version of CFS after 3 days of hacking which
showed the raw math and nothing more, and i posted every iteration since
then, so you can follow through its history. _Please_, separate things
out so that they can be reviewed one by one. And _please_ order the
patches in "level of intrusiveness" order - i.e. leave the more
intrusive patches as late in the patch-queue as possible. Thanks!

Ingo

2007-09-02 10:00:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler


* Satyam Sharma <[email protected]> wrote:

> On Sun, 2 Sep 2007, Ingo Molnar wrote:
> >
> > Although it _should_ have been a net code size win, because if you
> > look at the diff you'll see that other useful things were removed as
> > well: sleeper fairness, CPU time distribution smarts, tunings,
> > scheduler instrumentation code, etc.
>
> To be fair to Roman, he probably started development off an earlier
> CFS, most probably 2.6.23-rc3-git1, if I guess correctly from his
> original posting. So it's likely he missed out on some of the
> tunings/comments(?) etc code that got merged after that.

actually, here are the rc3->rc5 changes to CFS:

sched.c | 89 ++++++++++++++++++++++++------------
sched_debug.c | 3 -
sched_fair.c | 142 +++++++++++++++++++++++++++++++++++++++++++++-------------
sched_rt.c | 11 +++-
4 files changed, 182 insertions(+), 63 deletions(-)

so since -rc3 CFS's size _increased_ (a bit).

and i just checked, the sched.o codesize still increases even when
comparing rc4 against rc4+patch (his original patch) and there are no
comments added by Roman's patch at all. (all the comments in
sched_norm.c were inherited from sched_fair.c and none of the new code
comes with comments - this can be seen in Daniel's rediffed patch.)

(and it's still not apples to oranges, for the reasons i outlined - so
this whole comparison is unfair to CFS on several levels.)

also, note that CFS's modularity probably enabled Roman to do a fairly
stable kernel/sched_norm.c (as most of the post-rc3 CFS changes were not
to sched.c but to sched_fair.c) with easy porting. So with the CFS
modular framework you can easily whip up and prototype a new scheduler
and name it whatever you like. [ i expect the RCFS (Really Completely
Fair Scheduler) patches to be posted to lkml any minute ;-) ]

> > It would be far more reviewable and objectively judgeable on an item
> > by item basis if Roman posted the finegrained patches i asked for.
> > (which patch series should be sorted in order of intrusiveness -
> > i.e. leaving the harder changes to the end of the series.)
>
> Absolutely. And if there indeed are net improvements (be it for corner
> cases) over latest CFS-rc5, while maintaining performance for the
> common cases at the same time, well, that can only be a good thing.

yeah - and as i said to Roman, i like for example the use of a
ready-queue instead of a run-queue. (but these are independent of the
math changes, obviously.)

I also think that the core math changes should be split from the
Breshenham optimizations. I.e. the Breshenham _fract code should be done
as a "this improves performance and improves rounding, without changing
behavior" add-on ontop of a fairly simple core math change. I think
Roman will easily be able to do this with a few hours of effort which
should present much reduced .text overhead in his next version of the
patch, to demonstrate the simplicity of his implementation of the CFS
fairness math - this really isnt hard to do.

Ingo

2007-09-02 14:47:21

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Hi,

On Sat, 1 Sep 2007, Daniel Walker wrote:

> Out of curiosity I was reviewing your patch .. Since you create
> kernel/sched_norm.c as a copy of kernel/sched_fair.c it was hard to see
> what had changed .. So I re-diffed your patch to eliminate
> kernel/sched_norm.c and just make the changes to kernel/sched_fair.c ..

I mainly did it this way to avoid annoying conflicts. I also started with
a copy where I threw out everything but a basic skeleton and than just
copied the stuff I needed.

> > (1) time = sum_{t}^{T}(time_{t})
> > (2) weight_sum = sum_{t}^{T}(weight_{t})
>
> I read your description, but I was distracted by this latex style
> notation .. Could you walk through in english what these two equation
> are doing ..

T is the number of tasks, so the first is the sum of the time every task
gets over a period of time and the second is the sum of all weights of
each task.

bye, Roman

2007-09-02 15:11:19

by Daniel Walker

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

On Sun, 2007-09-02 at 16:47 +0200, Roman Zippel wrote:

> > > (1) time = sum_{t}^{T}(time_{t})
> > > (2) weight_sum = sum_{t}^{T}(weight_{t})
> >
> > I read your description, but I was distracted by this latex style
> > notation .. Could you walk through in english what these two equation
> > are doing ..
>
> T is the number of tasks, so the first is the sum of the time every task
> gets over a period of time and the second is the sum of all weights of
> each task.

I think I'm more puzzled now than I was before, (scratches head for
minutes) ..

I see .. It's a summation symbol ..

For instance if there are three tasks in the system. Call them A,B, and
C.

then

time equals "time of A" + "time of B" + "time of C"

Right?

Daniel

2007-09-02 15:16:17

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Hi,

On Sun, 2 Sep 2007, Ingo Molnar wrote:

> And if you look at the resulting code size/complexity, it actually
> increases with Roman's patch (UP, nodebug, x86):
>
> text data bss dec hex filename
> 13420 228 1204 14852 3a04 sched.o.rc5
> 13554 228 1228 15010 3aa2 sched.o.rc5-roman

That's pretty easy to explain due to differences in inlining:

text data bss dec hex filename
15092 228 1204 16524 408c kernel/sched.o
15444 224 1228 16896 4200 kernel/sched.o.rfs
14708 224 1228 16160 3f20 kernel/sched.o.rfs.noinline

Sorry, but I didn't spend as much time as you on tuning these numbers.

Index: linux-2.6/kernel/sched_norm.c
===================================================================
--- linux-2.6.orig/kernel/sched_norm.c 2007-09-02 16:58:05.000000000 +0200
+++ linux-2.6/kernel/sched_norm.c 2007-09-02 16:10:58.000000000 +0200
@@ -145,7 +145,7 @@ static inline struct task_struct *task_o
/*
* Enqueue an entity into the rb-tree:
*/
-static inline void
+static void
__enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
@@ -192,7 +192,7 @@ __enqueue_entity(struct cfs_rq *cfs_rq,
se->queued = 1;
}

-static inline void
+static void
__dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
if (cfs_rq->rb_leftmost == se) {
@@ -240,7 +240,7 @@ static void verify_queue(struct cfs_rq *
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
*/
-static inline void update_curr(struct cfs_rq *cfs_rq)
+static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
kclock_t now = rq_of(cfs_rq)->clock;

> Although it _should_ have been a net code size win, because if you look
> at the diff you'll see that other useful things were removed as well:
> sleeper fairness, CPU time distribution smarts, tunings, scheduler
> instrumentation code, etc.

Well, these are things I'd like you to explain a little, for example I
repeatedly asked you about the sleeper fairness and I got no answer.
BTW you seemed to haved missed that I actually give a bonus to sleepers
as well.

> > I also ran hackbench (in a haphazard way) a few times on it vs. CFS in
> > my tree, and RFS was faster to some degree (it varied)..
>
> here are some actual numbers for "hackbench 50" on -rc5, 10 consecutive
> runs fresh after bootup, Core2Duo, UP:
>
> -rc5(cfs) -rc5+rfs
> -------------------------------
> Time: 3.905 Time: 4.259
> Time: 3.962 Time: 4.190
> Time: 3.981 Time: 4.241
> Time: 3.986 Time: 3.937
> Time: 3.984 Time: 4.120
> Time: 4.001 Time: 4.013
> Time: 3.980 Time: 4.248
> Time: 3.983 Time: 3.961
> Time: 3.989 Time: 4.345
> Time: 3.981 Time: 4.294
> -------------------------------
> Avg: 3.975 Avg: 4.160 (+4.6%)
> Fluct: 0.138 Fluct: 1.671
>
> so unmodified CFS is 4.6% faster on this box than with Roman's patch and
> it's also more consistent/stable (10 times lower fluctuations).

Was SCHED_DEBUG enabled or disabled for these runs?

bye, Roman

2007-09-02 15:29:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler


* Roman Zippel <[email protected]> wrote:

> > And if you look at the resulting code size/complexity, it actually
> > increases with Roman's patch (UP, nodebug, x86):
> >
> > text data bss dec hex filename
> > 13420 228 1204 14852 3a04 sched.o.rc5
> > 13554 228 1228 15010 3aa2 sched.o.rc5-roman
>
> That's pretty easy to explain due to differences in inlining:
>
> text data bss dec hex filename
> 15092 228 1204 16524 408c kernel/sched.o
> 15444 224 1228 16896 4200 kernel/sched.o.rfs
> 14708 224 1228 16160 3f20 kernel/sched.o.rfs.noinline

no, when generating those numbers i used:

CONFIG_CC_OPTIMIZE_FOR_SIZE=y
# CONFIG_FORCED_INLINING is not set

(but i also re-did it for all the other combinations of these build
flags and similar results can be seen - your patch, despite removing
lots of source code, produces a larger sched.o.)

> Sorry, but I didn't spend as much time as you on tuning these numbers.

some changes did slip into your patch that have no other purpose but to
reduce code size:

+#ifdef CONFIG_SMP
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+#endif
[...]
+#ifdef CONFIG_SMP
/* Used instead of source_load when we know the type == 0 */
unsigned long weighted_cpuload(const int cpu)
{
return cpu_rq(cpu)->ls.load.weight;
}
+#endif
[...]

so i thought you must be aware of the problem - at least considering how
much you've criticised CFS's "complexity" both in your initial review of
CFS (which included object size comparisons) and in this patch
submission of yours (which did not include object size comparisons
though).

> > so unmodified CFS is 4.6% faster on this box than with Roman's patch
> > and it's also more consistent/stable (10 times lower fluctuations).
>
> Was SCHED_DEBUG enabled or disabled for these runs?

debugging disabled of course. (your patch has a self-validity checking
function [verify_queue()] that is called on SCHED_DEBUG=y, it would have
been unfair to test your patch with that included.)

Ingo

2007-09-02 17:02:36

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Hi,

On Sat, 1 Sep 2007, Bill Davidsen wrote:

> > I'd expect Ingo to know better, but the more he refuses to answer my
> > questions, the more I doubt it, at least than it comes to the math part.
> >
> The "math part" is important if you're doing a thesis defense, but
> demonstrating better behavior under some realistic load would probably be a
> better starting point.

That wasn't my design goal. The initial trigger for this was all the 64bit
math in CFS and the question how can this be tuned using arch specific
knowledge about the input values. For this I needed a correct and
verifyable model to analyze, so I know where the limits are and how it
reacts to different sets of inputs. This is where I got into trouble with
all the rounding - I couldn't substantially change the math without
convincing myself that it's still correct for all kinds of input data.
That's why I completely redid the math from scratch - it's based on the
same basic ideas, but the solution and thus the math is quite different.

The main reason I didn't concentrate on the behaviour so much is that
since both scheduler are conceptually not that far apart, it should be
possible to apply any tuning done to CFS also to my model. But this
requires someone actually understands what tuning was done and it wasn't
done by random changes and seeing what happens, i.e. someone should be
able to explain it to me.
BTW in this context it's rather interesting to see that Ingo attacks me
now for not having done this tuning yet and for which I explicitely asked
for help...

> Maybe Ingo missed something in your math, and maybe he
> just didn't find a benefit.

Maybe he didn't understand it at all and is too proud to admit it?
(At least that's the only explanation which makes sense to me.)

> You dropped this on the world two days before a major US holiday, at the end
> of the summer when those of us not on vacation may be covering for those who
> are, did you expect Ingo to drop his regular work to look at your stuff?

Did I demand anything like that? It had been fine if he had been asking
for more time or if he had more questions first, but it took him only a
few hours to come to his conclusion without any need for further
questions.

> And
> do you think there are many people who can look at your math, and look at your
> code, and then have any clue how well it works in practice? I bet there aren't
> ten people in the world who would even claim to do that, and half of them are
> kidding themselves.

I don't think the math is that complex, it may need some more explanations
outside the math formulas though. But the math is needed to analyze what
effect changes to scheduler have, otherwise there is a risk it becomes
only guesswork with random changes which may help in some situations but
break others.

bye, Roman

2007-09-02 17:16:21

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Hi,

On Sun, 2 Sep 2007, Ingo Molnar wrote:

> so i thought you must be aware of the problem - at least considering how
> much you've criticised CFS's "complexity" both in your initial review of
> CFS (which included object size comparisons) and in this patch
> submission of yours (which did not include object size comparisons
> though).

Ingo as long as you freely mix algorithmic and code complexity we won't
get very far. I did code size comparisons for your _stable_ code, which
was merged into Linus' tree. I explicitly said that my patch is only a
prototype, so I haven't done any cleanups and tuning in this direction
yet, so making any conclusions based on code size comparisions is quite
ridiculous at this point.
The whole point of this patch was to demonstrate the algorithmic changes,
not to demonstrate a final and perfectly tuned scheduler.

> > > so unmodified CFS is 4.6% faster on this box than with Roman's patch
> > > and it's also more consistent/stable (10 times lower fluctuations).
> >
> > Was SCHED_DEBUG enabled or disabled for these runs?
>
> debugging disabled of course. (your patch has a self-validity checking
> function [verify_queue()] that is called on SCHED_DEBUG=y, it would have
> been unfair to test your patch with that included.)

I'll look into it next week.

bye, Roman

2007-09-02 19:22:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler


* Roman Zippel <[email protected]> wrote:

> > > > so unmodified CFS is 4.6% faster on this box than with Roman's
> > > > patch and it's also more consistent/stable (10 times lower
> > > > fluctuations).
> > >
> > > Was SCHED_DEBUG enabled or disabled for these runs?
> >
> > debugging disabled of course. (your patch has a self-validity
> > checking function [verify_queue()] that is called on SCHED_DEBUG=y,
> > it would have been unfair to test your patch with that included.)
>
> I'll look into it next week.

thanks. FYI, there's plenty of time - i'll be at the KS next week so
i'll be quite unresponsive to emails. Would be nice if you could take a
quick look at the trivial patch i posted today though. How close is it
to your algorithm, have i missed any important details? (not counting
nice levels and rounding, it's just a quick & dirty prototype to show
the first layer of the core math and nothing more.)

Ingo

2007-09-03 02:57:41

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Hi,

On Sun, 2 Sep 2007, Ingo Molnar wrote:

> > Did you even try to understand what I wrote? I didn't say that it's a
> > "common problem", it's a conceptual problem. The rounding has been
> > improved lately, so it's not as easy to trigger with some simple busy
> > loops.
>
> As i mentioned it in my first reply to you i really welcome your changes
> and your interest in the Linux scheduler, but i'm still somewhat
> surprised why you focus on rounding so much (and why you attack CFS's
> math implementation so vehemently and IMO so unfairly) - and we had this
> discussion before in the "CFS review" thread that you started.

The rounding is insofar a problem as it makes it very difficult to produce
a correct mathematical model of CFS, which can be used to verify the
working of code.
With the recent rounding changes they have indeed little effect in the
short term, especially as the error is distributed equally to the task, so
every task gets relatively the same time, the error still exists
nonetheless and adds up over time (although with the rounding changes it
doesn't grow into one direction anymore). The problem is now the limiting,
which you can't remove as long as the error exists. Part of this problem
is that CFS doesn't really maintain a balanced sum of the available
runtime, i.e. the sum of all updated wait_runtime values of all active
tasks should be zero. This means under complex loads it's possible to hit
the wait_runtime limits and the overflow/underflow time is lost in the
calculation and this value can be easily in the millisecond range.
To be honest I have no idea how real this problem in the praxis is right
now, quite possibly people will only see it as a small glitch and in most
cases they won't even notice. Previously it had been rather easy to
trigger these problems due to the rounding problems. The main problem for
me here is now that with any substantial change in CFS I'm running risk of
making this worse and noticable again somehow. For example CFS relies on
the 64bit math to have enough resolution so that the errors aren't
noticable.
That's why I needed a mathematical model I can work with and with it for
example I can easily downscale the math to 32bit and I know exactly the
limits within it will work correctly.

> Your math is fairly simple (and that is _good_, just like CFS's existing
> math is simple),

I really wouldn't call CFS's existing math simple, the basic ideas are
indeed simple, but that the math of the actual implementation is quite a
bit more complex.

> it can be summed up in essence as (without complicating
> it with nice-level weighting, for easy understandability):
>
> " use the already existing p->sum_exec_runtime 'task runtime' metric
> that CFS maintains, and use that as the key into the rb-tree that
> selects tasks that should be run next. To handle sleeping tasks: keep
> a per-rq sum of all runnable task's ->sum_exec_runtime values and
> start newly woken tasks at the average rq->sum/nr_running value. "
>
> Now your patch does not actually do it that way in a clearly discernible
> manner because lots of changes are intermixed into one big patch.

Well, it's part of the math, but I wouldn't call it the "essence" of the
_math_, as it leaves many questions unanswered, the essence of the math
mainly involves how the problems are solved created by the weighting.

If you want to describe the basic logical differences, you can do it a
little simpler: CFS maintains the runtime of a task relative to a global
clock, while in my model every task has its own clock and an approximated
average is used to initialize the clock of new tasks.

> ( please correct me if i got your math wrong. Your patch does not add
> any comments at all to the new code and this slowed down my review
> and analysis of your patch quite considerably. Lack of comments makes
> it harder to see the purpose and makes it harder to notice the
> benefits/tradeoffs involved in each change. )

I hope you did notice that I explained in the announcement in quite detail
what the code does.

> To sum up: the math in your patch _could_ be implemented as a much
> smaller add-on to the already existing variables maintained by CFS, but
> you chose to do lots of changes, variable-renames and removals at once
> and posted them as one big change.

I didn't rename that much and there are only a few that could be reused,
for most part I indeed removed quite a bit and the rest really has a
different meaning.

> _Please_, separate things
> out so that they can be reviewed one by one.

That's not really the problem, but _please_ someone explain to me the
features you want to see preserved. I don't want to be left at guessing
what they're intendend to do and blindly implement something which should
do something similiar.
Thanks.

bye, Roman

2007-09-03 18:20:36

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Hi,

On Sun, 2 Sep 2007, Daniel Walker wrote:

> For instance if there are three tasks in the system. Call them A,B, and
> C.
>
> then
>
> time equals "time of A" + "time of B" + "time of C"

Ok, let's take a simple example. :)

If we have three task A, B, C, each with a weight of 1, 2, 3, so the
weight sum is 6. If we let these three tasks run over a period of 12s,
each task gets time/weight_sum*weight_{t} seconds, that is each task gets
2, 4, 6 seconds of runtime.
The basic idea of CFS is now that using this formula, the time is
distributed to the active tasks and accounted against the runtime they
get. Let's assume a time slice of 1s, where each task gets
1s/weight_sum*weight_{t} of eligible runtime every second, so in the
following table the task column contains the values (eligible runtime
- obtained runtime):

time current A B C fair_clock
1 C 1/6-0 2/6-0 3/6-1 1/6
2 C 2/6-0 4/6-0 6/6-2 2/6
3 C 3/6-0 6/6-0 9/6-3 3/6
4 B 4/6-0 8/6-1 12/6-3 4/6
5 B 5/6-0 10/6-2 15/6-3 5/6
6 A 6/6-1 12/6-2 18/6-3 6/6
7 C 7/6-1 14/6-2 21/6-4 7/6
8 C 8/6-1 16/6-2 24/6-5 8/6
9 C 9/6-1 18/6-2 27/6-6 9/6
10 B 10/6-1 20/6-3 30/6-6 10/6
11 B 11/6-1 22/6-4 33/6-6 11/6
12 A 12/6-2 24/6-4 36/6-6 12/6

Practically the eligible runtime part isn't updated constantly, but a
delta against the last update is used. If everything is working
correctly in the end the difference between the eligible and obtained
runtime is zero.

Peter's patches now make an interesting step, basically they change the
position of the weight_{t} variable, so every task get's now
1s/weight_sum per second and weight_{t} is now used to scale the
obtained runtime instead:

time current A B C fair_clock
1 C 1/6-0/1 1/6-0/2 1/6-1/3 1/6
2 C 2/6-0/1 2/6-0/2 2/6-2/3 2/6
3 C 3/6-0/1 3/6-0/2 3/6-3/3 3/6
4 B 4/6-0/1 4/6-1/2 4/6-3/3 4/6
5 B 5/6-0/1 5/6-2/2 5/6-3/3 5/6
6 A 6/6-1/1 6/6-2/2 6/6-3/3 6/6
7 C 7/6-1/1 7/6-2/2 7/6-4/3 7/6
8 C 8/6-1/1 8/6-2/2 8/6-5/3 8/6
9 C 9/6-1/1 9/6-2/2 9/6-6/3 9/6
10 B 10/6-1/1 10/6-3/2 10/6-6/3 10/6
11 B 11/6-1/1 11/6-4/2 11/6-6/3 11/6
12 A 12/6-2/1 12/6-4/2 12/6-6/3 12/6

As you can see here the fair_clock part is the same for every task in
every row, so that it can be removed, this is basically the step I do in
my patch:

time current A B C fair_clock
1 C 0/1 0/2 1/3 1/6
2 C 0/1 0/2 2/3 2/6
3 C 0/1 0/2 3/3 3/6
4 B 0/1 1/2 3/3 4/6
5 B 0/1 2/2 3/3 5/6
6 A 1/1 2/2 3/3 6/6
7 C 1/1 2/2 4/3 7/6
8 C 1/1 2/2 5/3 8/6
9 C 1/1 2/2 6/3 9/6
10 B 1/1 3/2 6/3 10/6
11 B 1/1 4/2 6/3 11/6
12 A 2/1 4/2 6/3 12/6

This means fair_clock isn't used in the runtime calculation anymore and
time isn't relative to fair_clock, but it's an absolute value. It looks
relatively simple, but it means all calculations involving fair_clock
deltas and wait_runtime values have to be redone in this context and
that's the part where I need some help with - some of the tuning
features aren't converted yet.
The importance of this step is that it removes a major error source for
the accuracy of the scheduler. All these fractions have to be rounded to
integer values at some point and the main problem here is that the
denominator in the fair_clock calculation is variable - it changes with
the number of active tasks and it requires quite some resolution, so the
error doesn't become noticable.

fair_clock isn't used here anymore, but it's still used to initialize the
time value of new tasks. For this I can approximate it and that is the
major part of the math involved to make it so accurate, that the error
doesn't grow over time and stays within a certain limit.
The basic idea here is to compensate for the rounding errors which are
inevitable. Let's look at the last row at time 12, if we add up the time
values and denormalize them (remove the weight part), we get a sum like
this: round(2/1)*1 + round(4/2)*2 + round(6/3)*3 ~= 12, that means I don't
use the real time value 12, I use an approximated time value instead which
is calculated from the individual task times so it includes the rounding
errors that happened during the calculation of these task times.

Basically that's it and I hope that explains the basic math a bit easier. :-)

bye, Roman

2007-09-03 21:18:33

by Daniel Walker

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

On Mon, 2007-09-03 at 20:20 +0200, Roman Zippel wrote:
> Basically that's it and I hope that explains the basic math a bit easier. :-)
>

It helps a tiny bit .. However, I appreciate that you took the time to
write this .. Thanks you.

Daniel

2007-09-06 03:04:16

by Syren Baran

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Am Montag, den 03.09.2007, 04:58 +0200 schrieb Roman Zippel:
> Hi,
>
> On Sun, 2 Sep 2007, Ingo Molnar wrote:
>
> > > Did you even try to understand what I wrote? I didn't say that it's a
> > > "common problem", it's a conceptual problem. The rounding has been
> > > improved lately, so it's not as easy to trigger with some simple busy
> > > loops.
> >
> > As i mentioned it in my first reply to you i really welcome your changes
> > and your interest in the Linux scheduler, but i'm still somewhat
> > surprised why you focus on rounding so much (and why you attack CFS's
> > math implementation so vehemently and IMO so unfairly) - and we had this
> > discussion before in the "CFS review" thread that you started.
>
> The rounding is insofar a problem as it makes it very difficult to produce
> a correct mathematical model of CFS, which can be used to verify the
> working of code.
> With the recent rounding changes they have indeed little effect in the
> short term, especially as the error is distributed equally to the task, so
> every task gets relatively the same time, the error still exists
> nonetheless and adds up over time (although with the rounding changes it
> doesn't grow into one direction anymore).
So, its like i roll a dice repeatedly and subtract 3.5 every time? Even
in the long run the the result should be close to zero.
> The problem is now the limiting,
> which you can't remove as long as the error exists. Part of this problem
> is that CFS doesn't really maintain a balanced sum of the available
> runtime, i.e. the sum of all updated wait_runtime values of all active
> tasks should be zero. This means under complex loads it's possible to hit
> the wait_runtime limits and the overflow/underflow time is lost in the
> calculation and this value can be easily in the millisecond range.
> To be honest I have no idea how real this problem in the praxis is right
> now, quite possibly people will only see it as a small glitch and in most
> cases they won't even notice.
So, given the above example, if the result of the dice rolls ever
exceeds +/- 1,000,000 i´ll get some ugly timing glitches? As the number
of dice rolls grows towards infinity the chance of remaining within this
boundary goes steadily towards 0%.
What does this equate to in the real world? Weeks, month, years,
milennia?

> Previously it had been rather easy to
> trigger these problems due to the rounding problems. The main problem for
> me here is now that with any substantial change in CFS I'm running risk of
> making this worse and noticable again somehow. For example CFS relies on
> the 64bit math to have enough resolution so that the errors aren't
> noticable.
> That's why I needed a mathematical model I can work with and with it for
> example I can easily downscale the math to 32bit and I know exactly the
> limits within it will work correctly.

If i understand the problem correctly these errors occur due to the fact
that delta-values are used instead of recalculating the "fair" process
time every "dice" roll?
Somehow the dead simple solution would seem to "reset" or "reinitialise"
the "dice" roll every million rolls or so.
Or, to put this into context again, figure out how large the accumulated
error would have to be to be noticable. Then figure out how long it
would take for such an error occur in the real world and just
reinitialise the damn thing. Should´nt be too complicated stochastics.

Greetings,
Syren

2007-09-07 15:35:42

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Hi,

On Sun, 2 Sep 2007, Ingo Molnar wrote:

Below is a patch updated against the latest git tree, no major changes.
For a split version I'm still waiting for some more explanation about the
CFS tuning parameter.
I added another check for the debug version so that any inbalances (as
e.g. Mike seen it) should be reported. I left some of the proc parameter
visible even in the non-debug version, so one can play with it without
getting the debug overhead.

> And if you look at the resulting code size/complexity, it actually
> increases with Roman's patch (UP, nodebug, x86):
>
> text data bss dec hex filename
> 13420 228 1204 14852 3a04 sched.o.rc5
> 13554 228 1228 15010 3aa2 sched.o.rc5-roman

You'll be happy to know, that with the patch even the SMP code size
decreases.
BTW if I made the impression that overall code size must decrease, than I
apologize, overall code size is of course only an indication, IMO more
important is where that code is, i.e. that regularly executed code is as
compact as possible.
64bit math simply adds a level of bloatness, which is hard to avoid, but
with the fixed math it's actually possible to further reduce the code size
by getting rid of the 64bit math, so that anyone prefering a really
compact scheduler can have that.

> > I also ran hackbench (in a haphazard way) a few times on it vs. CFS in
> > my tree, and RFS was faster to some degree (it varied)..
>
> here are some actual numbers for "hackbench 50" on -rc5, 10 consecutive
> runs fresh after bootup, Core2Duo, UP:
>
> -rc5(cfs) -rc5+rfs
> -------------------------------
> Time: 3.905 Time: 4.259
> Time: 3.962 Time: 4.190
> Time: 3.981 Time: 4.241
> Time: 3.986 Time: 3.937
> Time: 3.984 Time: 4.120
> Time: 4.001 Time: 4.013
> Time: 3.980 Time: 4.248
> Time: 3.983 Time: 3.961
> Time: 3.989 Time: 4.345
> Time: 3.981 Time: 4.294
> -------------------------------
> Avg: 3.975 Avg: 4.160 (+4.6%)
> Fluct: 0.138 Fluct: 1.671
>
> so unmodified CFS is 4.6% faster on this box than with Roman's patch and
> it's also more consistent/stable (10 times lower fluctuations).

hackbench apparently works best if one doesn't wake up the consumer too
early, so in the patch I don't preempt a just woken process, which
produces far better numbers.

bye, Roman


---
include/linux/sched.h | 26 -
kernel/sched.c | 173 +++++-----
kernel/sched_debug.c | 26 -
kernel/sched_norm.c | 841 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 30 -
5 files changed, 948 insertions(+), 148 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -884,40 +884,28 @@ struct load_weight {
*
* Current field usage histogram:
*
- * 4 se->block_start
* 4 se->run_node
- * 4 se->sleep_start
- * 4 se->sleep_start_fair
* 6 se->load.weight
- * 7 se->delta_fair
- * 15 se->wait_runtime
*/
struct sched_entity {
- long wait_runtime;
- unsigned long delta_fair_run;
- unsigned long delta_fair_sleep;
- unsigned long delta_exec;
- s64 fair_key;
+ s64 time_key;
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
- unsigned int on_rq;
+ unsigned int on_rq, queued;
+ unsigned int weight_shift;

u64 exec_start;
u64 sum_exec_runtime;
- u64 prev_sum_exec_runtime;
- u64 wait_start_fair;
- u64 sleep_start_fair;
+ s64 time_norm;
+ s64 req_weight_inv;

#ifdef CONFIG_SCHEDSTATS
- u64 wait_start;
u64 wait_max;
s64 sum_wait_runtime;

- u64 sleep_start;
u64 sleep_max;
s64 sum_sleep_runtime;

- u64 block_start;
u64 block_max;
u64 exec_max;

@@ -1400,12 +1388,10 @@ static inline void idle_task_exit(void)

extern void sched_idle_next(void);

-extern unsigned int sysctl_sched_latency;
-extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_granularity;
extern unsigned int sysctl_sched_wakeup_granularity;
extern unsigned int sysctl_sched_batch_wakeup_granularity;
extern unsigned int sysctl_sched_stat_granularity;
-extern unsigned int sysctl_sched_runtime_limit;
extern unsigned int sysctl_sched_child_runs_first;
extern unsigned int sysctl_sched_features;

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -183,18 +183,24 @@ struct cfs_rq {

s64 fair_clock;
u64 exec_clock;
- s64 wait_runtime;
u64 sleeper_bonus;
unsigned long wait_runtime_overruns, wait_runtime_underruns;

+ u64 prev_update;
+ s64 time_norm_base, time_norm_inc;
+ u64 run_start, run_end;
+ u64 run_end_next, run_end_curr;
+ s64 time_sum_max, time_sum_off;
+ unsigned long inc_shift, weight_sum;
+
struct rb_root tasks_timeline;
- struct rb_node *rb_leftmost;
struct rb_node *rb_load_balance_curr;
-#ifdef CONFIG_FAIR_GROUP_SCHED
/* 'curr' points to currently running entity on this cfs_rq.
* It is set to NULL otherwise (i.e when none are currently running).
*/
- struct sched_entity *curr;
+ struct sched_entity *curr, *next;
+ struct sched_entity *rb_leftmost;
+#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */

/* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
@@ -231,12 +237,16 @@ struct rq {
*/
unsigned long nr_running;
#define CPU_LOAD_IDX_MAX 5
+#ifdef CONFIG_SMP
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+#endif
unsigned char idle_at_tick;
#ifdef CONFIG_NO_HZ
unsigned char in_nohz_recently;
#endif
+#ifdef CONFIG_SMP
struct load_stat ls; /* capture load from *all* tasks on this cpu */
+#endif
unsigned long nr_load_updates;
u64 nr_switches;

@@ -636,13 +646,6 @@ static void resched_cpu(int cpu)
resched_task(cpu_curr(cpu));
spin_unlock_irqrestore(&rq->lock, flags);
}
-#else
-static inline void resched_task(struct task_struct *p)
-{
- assert_spin_locked(&task_rq(p)->lock);
- set_tsk_need_resched(p);
-}
-#endif

static u64 div64_likely32(u64 divident, unsigned long divisor)
{
@@ -657,6 +660,13 @@ static u64 div64_likely32(u64 divident,
#endif
}

+#else
+static inline void resched_task(struct task_struct *p)
+{
+ assert_spin_locked(&task_rq(p)->lock);
+ set_tsk_need_resched(p);
+}
+#endif
#if BITS_PER_LONG == 32
# define WMULT_CONST (~0UL)
#else
@@ -734,15 +744,33 @@ static void update_load_sub(struct load_
* If a task goes up by ~10% and another task goes down by ~10% then
* the relative distance between them is ~25%.)
*/
+#ifdef SANE_NICE_LEVEL
+static const int prio_to_weight[40] = {
+ 3567, 3102, 2703, 2351, 2048, 1783, 1551, 1351, 1177, 1024,
+ 892, 776, 676, 588, 512, 446, 388, 338, 294, 256,
+ 223, 194, 169, 147, 128, 111, 97, 84, 74, 64,
+ 56, 49, 42, 37, 32, 28, 24, 21, 18, 16
+};
+
+static const u32 prio_to_wmult[40] = {
+ 294, 338, 388, 446, 512, 588, 676, 776, 891, 1024,
+ 1176, 1351, 1552, 1783, 2048, 2353, 2702, 3104, 3566, 4096,
+ 4705, 5405, 6208, 7132, 8192, 9410, 10809, 12417, 14263, 16384,
+ 18820, 21619, 24834, 28526, 32768, 37641, 43238, 49667, 57052, 65536
+};
+
+static const u32 prio_to_wshift[40] = {
+ 8, 8, 7, 7, 7, 7, 7, 6, 6, 6,
+ 6, 6, 5, 5, 5, 5, 5, 4, 4, 4,
+ 4, 4, 3, 3, 3, 3, 3, 2, 2, 2,
+ 2, 2, 1, 1, 1, 1, 1, 0, 0, 0
+};
+#else
static const int prio_to_weight[40] = {
- /* -20 */ 88761, 71755, 56483, 46273, 36291,
- /* -15 */ 29154, 23254, 18705, 14949, 11916,
- /* -10 */ 9548, 7620, 6100, 4904, 3906,
- /* -5 */ 3121, 2501, 1991, 1586, 1277,
- /* 0 */ 1024, 820, 655, 526, 423,
- /* 5 */ 335, 272, 215, 172, 137,
- /* 10 */ 110, 87, 70, 56, 45,
- /* 15 */ 36, 29, 23, 18, 15,
+ 95325, 74898, 61681, 49932, 38836, 31775, 24966, 20165, 16132, 12945,
+ 10382, 8257, 6637, 5296, 4228, 3393, 2709, 2166, 1736, 1387,
+ 1111, 888, 710, 568, 455, 364, 291, 233, 186, 149,
+ 119, 95, 76, 61, 49, 39, 31, 25, 20, 16
};

/*
@@ -753,16 +781,20 @@ static const int prio_to_weight[40] = {
* into multiplications:
*/
static const u32 prio_to_wmult[40] = {
- /* -20 */ 48388, 59856, 76040, 92818, 118348,
- /* -15 */ 147320, 184698, 229616, 287308, 360437,
- /* -10 */ 449829, 563644, 704093, 875809, 1099582,
- /* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326,
- /* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587,
- /* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126,
- /* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,
- /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
+ 11, 14, 17, 21, 27, 33, 42, 52, 65, 81,
+ 101, 127, 158, 198, 248, 309, 387, 484, 604, 756,
+ 944, 1181, 1476, 1845, 2306, 2882, 3603, 4504, 5629, 7037,
+ 8796, 10995, 13744, 17180, 21475, 26844, 33554, 41943, 52429, 65536
};

+static const u32 prio_to_wshift[40] = {
+ 13, 12, 12, 12, 11, 11, 11, 10, 10, 10,
+ 9, 9, 9, 8, 8, 8, 7, 7, 7, 6,
+ 6, 6, 5, 5, 5, 5, 4, 4, 4, 3,
+ 3, 3, 2, 2, 2, 1, 1, 1, 0, 0
+};
+#endif
+
static void activate_task(struct rq *rq, struct task_struct *p, int wakeup);

/*
@@ -784,7 +816,8 @@ static int balance_tasks(struct rq *this

#include "sched_stats.h"
#include "sched_rt.c"
-#include "sched_fair.c"
+//#include "sched_fair.c"
+#include "sched_norm.c"
#include "sched_idletask.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
@@ -792,6 +825,7 @@ static int balance_tasks(struct rq *this

#define sched_class_highest (&rt_sched_class)

+#ifdef CONFIG_SMP
static void __update_curr_load(struct rq *rq, struct load_stat *ls)
{
if (rq->curr != rq->idle && ls->load.weight) {
@@ -843,6 +877,14 @@ static inline void dec_load(struct rq *r
update_curr_load(rq);
update_load_sub(&rq->ls.load, p->se.load.weight);
}
+#else
+static inline void inc_load(struct rq *rq, const struct task_struct *p)
+{
+}
+static inline void dec_load(struct rq *rq, const struct task_struct *p)
+{
+}
+#endif

static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
@@ -858,9 +900,6 @@ static void dec_nr_running(struct task_s

static void set_load_weight(struct task_struct *p)
{
- task_rq(p)->cfs.wait_runtime -= p->se.wait_runtime;
- p->se.wait_runtime = 0;
-
if (task_has_rt_policy(p)) {
p->se.load.weight = prio_to_weight[0] * 2;
p->se.load.inv_weight = prio_to_wmult[0] >> 1;
@@ -878,6 +917,8 @@ static void set_load_weight(struct task_

p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
+ p->se.weight_shift = prio_to_wshift[p->static_prio - MAX_RT_PRIO];
+ p->se.req_weight_inv = p->se.load.inv_weight * (kclock_t)sysctl_sched_granularity;
}

static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
@@ -986,11 +1027,13 @@ inline int task_curr(const struct task_s
return cpu_curr(task_cpu(p)) == p;
}

+#ifdef CONFIG_SMP
/* Used instead of source_load when we know the type == 0 */
unsigned long weighted_cpuload(const int cpu)
{
return cpu_rq(cpu)->ls.load.weight;
}
+#endif

static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
{
@@ -1004,27 +1047,6 @@ static inline void __set_task_cpu(struct

void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
{
- int old_cpu = task_cpu(p);
- struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
- u64 clock_offset, fair_clock_offset;
-
- clock_offset = old_rq->clock - new_rq->clock;
- fair_clock_offset = old_rq->cfs.fair_clock - new_rq->cfs.fair_clock;
-
- if (p->se.wait_start_fair)
- p->se.wait_start_fair -= fair_clock_offset;
- if (p->se.sleep_start_fair)
- p->se.sleep_start_fair -= fair_clock_offset;
-
-#ifdef CONFIG_SCHEDSTATS
- if (p->se.wait_start)
- p->se.wait_start -= clock_offset;
- if (p->se.sleep_start)
- p->se.sleep_start -= clock_offset;
- if (p->se.block_start)
- p->se.block_start -= clock_offset;
-#endif
-
__set_task_cpu(p, new_cpu);
}

@@ -1584,22 +1606,12 @@ int fastcall wake_up_state(struct task_s
*/
static void __sched_fork(struct task_struct *p)
{
- p->se.wait_start_fair = 0;
p->se.exec_start = 0;
p->se.sum_exec_runtime = 0;
- p->se.prev_sum_exec_runtime = 0;
- p->se.delta_exec = 0;
- p->se.delta_fair_run = 0;
- p->se.delta_fair_sleep = 0;
- p->se.wait_runtime = 0;
- p->se.sleep_start_fair = 0;

#ifdef CONFIG_SCHEDSTATS
- p->se.wait_start = 0;
p->se.sum_wait_runtime = 0;
p->se.sum_sleep_runtime = 0;
- p->se.sleep_start = 0;
- p->se.block_start = 0;
p->se.sleep_max = 0;
p->se.block_max = 0;
p->se.exec_max = 0;
@@ -1971,6 +1983,7 @@ unsigned long nr_active(void)
return running + uninterruptible;
}

+#ifdef CONFIG_SMP
/*
* Update rq->cpu_load[] statistics. This function is usually called every
* scheduler tick (TICK_NSEC).
@@ -2027,8 +2040,6 @@ do_avg:
}
}

-#ifdef CONFIG_SMP
-
/*
* double_rq_lock - safely lock two runqueues
*
@@ -3350,7 +3361,9 @@ void scheduler_tick(void)
if (unlikely(rq->clock < next_tick))
rq->clock = next_tick;
rq->tick_timestamp = rq->clock;
+#ifdef CONFIG_SMP
update_cpu_load(rq);
+#endif
if (curr != rq->idle) /* FIXME: needed? */
curr->sched_class->task_tick(rq, curr);
spin_unlock(&rq->lock);
@@ -4914,16 +4927,12 @@ static inline void sched_init_granularit
unsigned int factor = 1 + ilog2(num_online_cpus());
const unsigned long limit = 100000000;

- sysctl_sched_min_granularity *= factor;
- if (sysctl_sched_min_granularity > limit)
- sysctl_sched_min_granularity = limit;
-
- sysctl_sched_latency *= factor;
- if (sysctl_sched_latency > limit)
- sysctl_sched_latency = limit;
+ sysctl_sched_granularity *= factor;
+ if (sysctl_sched_granularity > limit)
+ sysctl_sched_granularity = limit;

- sysctl_sched_runtime_limit = sysctl_sched_latency;
- sysctl_sched_wakeup_granularity = sysctl_sched_min_granularity / 2;
+ sysctl_sched_wakeup_granularity = sysctl_sched_granularity / 2;
+ sysctl_sched_batch_wakeup_granularity = sysctl_sched_granularity;
}

#ifdef CONFIG_SMP
@@ -6516,6 +6525,9 @@ static inline void init_cfs_rq(struct cf
{
cfs_rq->tasks_timeline = RB_ROOT;
cfs_rq->fair_clock = 1;
+ cfs_rq->time_sum_max = 0;
+ cfs_rq->time_norm_inc = MAX_TIMESUM >> 1;
+ cfs_rq->inc_shift = 0;
#ifdef CONFIG_FAIR_GROUP_SCHED
cfs_rq->rq = rq;
#endif
@@ -6523,7 +6535,6 @@ static inline void init_cfs_rq(struct cf

void __init sched_init(void)
{
- u64 now = sched_clock();
int highest_cpu = 0;
int i, j;

@@ -6548,12 +6559,11 @@ void __init sched_init(void)
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
list_add(&rq->cfs.leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
#endif
- rq->ls.load_update_last = now;
- rq->ls.load_update_start = now;
+#ifdef CONFIG_SMP
+ rq->ls.load_update_last = rq->ls.load_update_start = sched_clock();

for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
rq->cpu_load[j] = 0;
-#ifdef CONFIG_SMP
rq->sd = NULL;
rq->active_balance = 0;
rq->next_balance = jiffies;
@@ -6643,16 +6653,7 @@ void normalize_rt_tasks(void)

read_lock_irq(&tasklist_lock);
do_each_thread(g, p) {
- p->se.fair_key = 0;
- p->se.wait_runtime = 0;
p->se.exec_start = 0;
- p->se.wait_start_fair = 0;
- p->se.sleep_start_fair = 0;
-#ifdef CONFIG_SCHEDSTATS
- p->se.wait_start = 0;
- p->se.sleep_start = 0;
- p->se.block_start = 0;
-#endif
task_rq(p)->cfs.fair_clock = 0;
task_rq(p)->clock = 0;

Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -38,9 +38,9 @@ print_task(struct seq_file *m, struct rq

SEQ_printf(m, "%15s %5d %15Ld %13Ld %13Ld %9Ld %5d ",
p->comm, p->pid,
- (long long)p->se.fair_key,
- (long long)(p->se.fair_key - rq->cfs.fair_clock),
- (long long)p->se.wait_runtime,
+ (long long)p->se.time_norm >> 16,
+ (long long)((p->se.time_key >> 16) - rq->cfs.fair_clock),
+ ((long long)((rq->cfs.fair_clock << 16) - p->se.time_norm) * p->se.load.weight) >> 20,
(long long)(p->nvcsw + p->nivcsw),
p->prio);
#ifdef CONFIG_SCHEDSTATS
@@ -73,6 +73,7 @@ static void print_rq(struct seq_file *m,

read_lock_irq(&tasklist_lock);

+ rq->cfs.fair_clock = get_time_avg(&rq->cfs) >> 16;
do_each_thread(g, p) {
if (!p->se.on_rq || task_cpu(p) != rq_cpu)
continue;
@@ -93,10 +94,10 @@ print_cfs_rq_runtime_sum(struct seq_file
struct rq *rq = &per_cpu(runqueues, cpu);

spin_lock_irqsave(&rq->lock, flags);
- curr = first_fair(cfs_rq);
+ curr = cfs_rq->rb_leftmost ? &cfs_rq->rb_leftmost->run_node : NULL;
while (curr) {
p = rb_entry(curr, struct task_struct, se.run_node);
- wait_runtime_rq_sum += p->se.wait_runtime;
+ //wait_runtime_rq_sum += p->se.wait_runtime;

curr = rb_next(curr);
}
@@ -110,12 +111,12 @@ void print_cfs_rq(struct seq_file *m, in
{
SEQ_printf(m, "\ncfs_rq\n");

+ cfs_rq->fair_clock = get_time_avg(cfs_rq) >> 16;
#define P(x) \
SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(cfs_rq->x))

P(fair_clock);
P(exec_clock);
- P(wait_runtime);
P(wait_runtime_overruns);
P(wait_runtime_underruns);
P(sleeper_bonus);
@@ -143,10 +144,12 @@ static void print_cpu(struct seq_file *m
SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(rq->x))

P(nr_running);
+#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %lu\n", "load",
rq->ls.load.weight);
P(ls.delta_fair);
P(ls.delta_exec);
+#endif
P(nr_switches);
P(nr_load_updates);
P(nr_uninterruptible);
@@ -160,11 +163,13 @@ static void print_cpu(struct seq_file *m
P(clock_overflows);
P(clock_deep_idle_events);
P(clock_max_delta);
+#ifdef CONFIG_SMP
P(cpu_load[0]);
P(cpu_load[1]);
P(cpu_load[2]);
P(cpu_load[3]);
P(cpu_load[4]);
+#endif
#undef P

print_cfs_stats(m, cpu);
@@ -241,16 +246,7 @@ void proc_sched_show_task(struct task_st
#define P(F) \
SEQ_printf(m, "%-25s:%20Ld\n", #F, (long long)p->F)

- P(se.wait_runtime);
- P(se.wait_start_fair);
- P(se.exec_start);
- P(se.sleep_start_fair);
- P(se.sum_exec_runtime);
-
#ifdef CONFIG_SCHEDSTATS
- P(se.wait_start);
- P(se.sleep_start);
- P(se.block_start);
P(se.sleep_max);
P(se.block_max);
P(se.exec_max);
Index: linux-2.6/kernel/sched_norm.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_norm.c
@@ -0,0 +1,841 @@
+/*
+ * Completely Fair Scheduling (CFS) Class (SCHED_NORMAL/SCHED_BATCH)
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * Interactivity improvements by Mike Galbraith
+ * (C) 2007 Mike Galbraith <[email protected]>
+ *
+ * Various enhancements by Dmitry Adamushko.
+ * (C) 2007 Dmitry Adamushko <[email protected]>
+ *
+ * Group scheduling enhancements by Srivatsa Vaddagiri
+ * Copyright IBM Corporation, 2007
+ * Author: Srivatsa Vaddagiri <[email protected]>
+ *
+ * Scaled math optimizations by Thomas Gleixner
+ * Copyright (C) 2007, Thomas Gleixner <[email protected]>
+ *
+ * Really fair scheduling
+ * Copyright (C) 2007, Roman Zippel <[email protected]>
+ */
+
+typedef s64 kclock_t;
+
+static inline kclock_t kclock_max(kclock_t x, kclock_t y)
+{
+ return (kclock_t)(x - y) > 0 ? x : y;
+}
+static inline kclock_t kclock_min(kclock_t x, kclock_t y)
+{
+ return (kclock_t)(x - y) < 0 ? x : y;
+}
+
+#define MSHIFT 16
+#define MAX_TIMESUM ((kclock_t)1 << (30 + MSHIFT))
+
+/*
+ * Preemption granularity:
+ * (default: 10 msec, units: nanoseconds)
+ *
+ * NOTE: this granularity value is not the same as the concept of
+ * 'timeslice length' - timeslices in CFS will typically be somewhat
+ * larger than this value. (to see the precise effective timeslice
+ * length of your workload, run vmstat and monitor the context-switches
+ * field)
+ *
+ * On SMP systems the value of this is multiplied by the log2 of the
+ * number of CPUs. (i.e. factor 2x on 2-way systems, 3x on 4-way
+ * systems, 4x on 8-way systems, 5x on 16-way systems, etc.)
+ */
+unsigned int sysctl_sched_granularity __read_mostly = 10000000ULL;
+
+/*
+ * SCHED_BATCH wake-up granularity.
+ * (default: 10 msec, units: nanoseconds)
+ *
+ * This option delays the preemption effects of decoupled workloads
+ * and reduces their over-scheduling. Synchronous workloads will still
+ * have immediate wakeup/sleep latencies.
+ */
+unsigned int sysctl_sched_batch_wakeup_granularity __read_mostly =
+ 10000000ULL;
+
+/*
+ * SCHED_OTHER wake-up granularity.
+ * (default: 2 msec, units: nanoseconds)
+ *
+ * This option delays the preemption effects of decoupled workloads
+ * and reduces their over-scheduling. Synchronous workloads will still
+ * have immediate wakeup/sleep latencies.
+ */
+unsigned int sysctl_sched_wakeup_granularity __read_mostly = 2000000ULL;
+
+unsigned int sysctl_sched_stat_granularity __read_mostly;
+
+/*
+ * Debugging: various feature bits
+ */
+enum {
+ SCHED_FEAT_FAIR_SLEEPERS = 1,
+ SCHED_FEAT_SLEEPER_AVG = 2,
+ SCHED_FEAT_SLEEPER_LOAD_AVG = 4,
+ SCHED_FEAT_PRECISE_CPU_LOAD = 8,
+ SCHED_FEAT_START_DEBIT = 16,
+ SCHED_FEAT_SKIP_INITIAL = 32,
+};
+
+unsigned int sysctl_sched_features __read_mostly =
+ SCHED_FEAT_FAIR_SLEEPERS *1 |
+ SCHED_FEAT_SLEEPER_AVG *1 |
+ SCHED_FEAT_SLEEPER_LOAD_AVG *1 |
+ SCHED_FEAT_PRECISE_CPU_LOAD *1 |
+ SCHED_FEAT_START_DEBIT *1 |
+ SCHED_FEAT_SKIP_INITIAL *0;
+
+extern struct sched_class fair_sched_class;
+
+static kclock_t get_time_avg(struct cfs_rq *cfs_rq)
+{
+ kclock_t avg;
+
+ avg = cfs_rq->time_norm_base;
+ if (cfs_rq->weight_sum)
+ avg += (kclock_t)((int)(cfs_rq->time_sum_off >> MSHIFT) / cfs_rq->weight_sum) << MSHIFT;
+
+ return avg;
+}
+
+static void normalize_time_sum_off(struct cfs_rq *cfs_rq)
+{
+ if (unlikely(cfs_rq->time_sum_off >= cfs_rq->time_sum_max)) {
+ cfs_rq->time_sum_off -= cfs_rq->time_sum_max;
+ cfs_rq->time_norm_base += cfs_rq->time_norm_inc;
+ } else if (unlikely(cfs_rq->time_sum_off < 0)) {
+ cfs_rq->time_sum_off += cfs_rq->time_sum_max;
+ cfs_rq->time_norm_base -= cfs_rq->time_norm_inc;
+ }
+}
+
+/**************************************************************
+ * CFS operations on generic schedulable entities:
+ */
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+/* cpu runqueue to which this cfs_rq is attached */
+static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->rq;
+}
+
+/* An entity is a task if it doesn't "own" a runqueue */
+#define entity_is_task(se) (!se->my_q)
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
+{
+ return container_of(cfs_rq, struct rq, cfs);
+}
+
+#define entity_is_task(se) 1
+
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+ return container_of(se, struct task_struct, se);
+}
+
+
+/**************************************************************
+ * Scheduling class tree data structure manipulation methods:
+ */
+
+/*
+ * Enqueue an entity into the rb-tree:
+ */
+static void
+__enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
+ struct rb_node *parent = NULL;
+ struct sched_entity *entry;
+ kclock_t key;
+ int leftmost = 1;
+
+ se->time_key = key = se->time_norm + (se->req_weight_inv >> 1);
+ /*
+ * Find the right place in the rbtree:
+ */
+ while (*link) {
+ parent = *link;
+ entry = rb_entry(parent, struct sched_entity, run_node);
+ /*
+ * We dont care about collisions. Nodes with
+ * the same key stay together.
+ */
+ if (key - entry->time_key < 0) {
+ link = &parent->rb_left;
+ } else {
+ link = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ /*
+ * Maintain a cache of leftmost tree entries (it is frequently
+ * used):
+ */
+ if (leftmost) {
+ cfs_rq->rb_leftmost = se;
+ if (cfs_rq->curr) {
+ cfs_rq->run_end_next = se->time_norm + se->req_weight_inv;
+ cfs_rq->run_end = kclock_min(cfs_rq->run_end_next,
+ kclock_max(se->time_norm, cfs_rq->run_end_curr));
+ }
+ }
+
+ rb_link_node(&se->run_node, parent, link);
+ rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
+ update_load_add(&cfs_rq->load, se->load.weight);
+ se->queued = 1;
+}
+
+static void
+__dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ if (cfs_rq->rb_leftmost == se) {
+ struct rb_node *next = rb_next(&se->run_node);
+ cfs_rq->rb_leftmost = next ? rb_entry(next, struct sched_entity, run_node) : NULL;
+ }
+ rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
+ update_load_sub(&cfs_rq->load, se->load.weight);
+ se->queued = 0;
+}
+
+#ifdef CONFIG_SCHED_DEBUG
+static void verify_queue(struct cfs_rq *cfs_rq, int inc_curr, struct sched_entity *se2)
+{
+ struct rb_node *node;
+ struct sched_entity *se;
+ kclock_t avg, sum = 0;
+ static int busy = 0, cnt = 10;
+
+#define RANGE_CHECK(se, se2) ({ \
+ if (cnt > 0 && abs(avg - se->time_norm) > (se->req_weight_inv << 2)) { \
+ cnt--; \
+ printk("%ld,%p(%u): %Lx,%Lx,%p(%u),%d\n", \
+ cfs_rq->nr_running, se, task_of(se)->pid, \
+ se->time_norm, avg, se2, task_of(se2)->pid, inc_curr); \
+ WARN_ON(1); \
+ } })
+
+ if (busy)
+ return;
+ busy = 1;
+
+ avg = get_time_avg(cfs_rq);
+ se = cfs_rq->curr;
+ if (inc_curr && se) {
+ sum = (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ RANGE_CHECK(se, se2);
+ }
+ node = rb_first(&cfs_rq->tasks_timeline);
+ WARN_ON(node && cfs_rq->rb_leftmost != rb_entry(node, struct sched_entity, run_node));
+ while (node) {
+ se = rb_entry(node, struct sched_entity, run_node);
+ sum += (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ RANGE_CHECK(se, se2);
+ node = rb_next(node);
+ }
+ if (sum != cfs_rq->time_sum_off) {
+ kclock_t oldsum = cfs_rq->time_sum_off;
+ cfs_rq->time_sum_off = sum;
+ printk("%ld:%Lx,%Lx,%p,%p,%d\n", cfs_rq->nr_running, sum, oldsum, cfs_rq->curr, se2, inc_curr);
+ WARN_ON(1);
+ }
+ busy = 0;
+}
+#else
+#define verify_queue(q,c,s) ((void)0)
+#endif
+
+/**************************************************************
+ * Scheduling class statistics methods:
+ */
+
+/*
+ * Update the current task's runtime statistics. Skip current tasks that
+ * are not in our scheduling class.
+ */
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *curr = cfs_rq->curr;
+ kclock_t now = rq_of(cfs_rq)->clock;
+ unsigned long delta_exec;
+ kclock_t delta_norm;
+
+ if (unlikely(!curr))
+ return;
+
+ delta_exec = now - cfs_rq->prev_update;
+ if (!delta_exec)
+ return;
+ cfs_rq->prev_update = now;
+
+ curr->sum_exec_runtime += delta_exec;
+ cfs_rq->exec_clock += delta_exec;
+
+ delta_norm = (kclock_t)delta_exec * curr->load.inv_weight;
+ curr->time_norm += delta_norm;
+ cfs_rq->time_sum_off += delta_norm << curr->weight_shift;
+
+ verify_queue(cfs_rq, 4, curr);
+}
+
+static void
+enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ kclock_t min_time;
+
+ verify_queue(cfs_rq, cfs_rq->curr != se, se);
+ min_time = get_time_avg(cfs_rq) - se->req_weight_inv;
+ if ((kclock_t)(se->time_norm - min_time) < 0)
+ se->time_norm = min_time;
+
+ cfs_rq->nr_running++;
+ cfs_rq->weight_sum += 1 << se->weight_shift;
+ if (cfs_rq->inc_shift < se->weight_shift) {
+ cfs_rq->time_norm_inc >>= se->weight_shift - cfs_rq->inc_shift;
+ cfs_rq->time_sum_max >>= se->weight_shift - cfs_rq->inc_shift;
+ cfs_rq->inc_shift = se->weight_shift;
+ }
+ cfs_rq->time_sum_max += cfs_rq->time_norm_inc << se->weight_shift;
+ while (cfs_rq->time_sum_max >= MAX_TIMESUM) {
+ cfs_rq->time_norm_inc >>= 1;
+ cfs_rq->time_sum_max >>= 1;
+ cfs_rq->inc_shift++;
+ }
+ cfs_rq->time_sum_off += (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ normalize_time_sum_off(cfs_rq);
+
+ if (&rq_of(cfs_rq)->curr->se == se)
+ cfs_rq->curr = se;
+ if (cfs_rq->curr != se)
+ __enqueue_entity(cfs_rq, se);
+ verify_queue(cfs_rq, 2, se);
+}
+
+static void
+dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
+{
+ verify_queue(cfs_rq, 3, se);
+
+ cfs_rq->weight_sum -= 1 << se->weight_shift;
+ if (cfs_rq->weight_sum) {
+ cfs_rq->time_sum_max -= cfs_rq->time_norm_inc << se->weight_shift;
+ while (cfs_rq->time_sum_max < (MAX_TIMESUM >> 1)) {
+ cfs_rq->time_norm_inc <<= 1;
+ cfs_rq->time_sum_max <<= 1;
+ cfs_rq->inc_shift--;
+ }
+
+ cfs_rq->time_sum_off -= (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ normalize_time_sum_off(cfs_rq);
+ } else {
+ cfs_rq->time_sum_off -= (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ BUG_ON(cfs_rq->time_sum_off);
+ cfs_rq->time_norm_inc = MAX_TIMESUM >> 1;
+ cfs_rq->time_sum_max = 0;
+ cfs_rq->inc_shift = 0;
+ }
+
+
+ cfs_rq->nr_running--;
+ if (se->queued)
+ __dequeue_entity(cfs_rq, se);
+ if (cfs_rq->curr == se)
+ cfs_rq->curr = NULL;
+ if (cfs_rq->next == se)
+ cfs_rq->next = NULL;
+ verify_queue(cfs_rq, cfs_rq->curr != se, se);
+}
+
+static inline void
+set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ cfs_rq->prev_update = rq_of(cfs_rq)->clock;
+ cfs_rq->run_start = se->time_norm;
+ cfs_rq->run_end = cfs_rq->run_end_curr = cfs_rq->run_start + se->req_weight_inv;
+ cfs_rq->curr = se;
+}
+
+static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *se = cfs_rq->next ? cfs_rq->next : cfs_rq->rb_leftmost;
+ struct sched_entity *next;
+
+ cfs_rq->next = NULL;
+ if (se->queued)
+ __dequeue_entity(cfs_rq, se);
+ set_next_entity(cfs_rq, se);
+
+ next = cfs_rq->rb_leftmost;
+ if (next) {
+ cfs_rq->run_end_next = next->time_norm + next->req_weight_inv;
+ cfs_rq->run_end = kclock_min(cfs_rq->run_end_next,
+ kclock_max(next->time_norm, cfs_rq->run_end_curr));
+ }
+
+ return se;
+}
+
+static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
+{
+ update_curr(cfs_rq);
+ if (prev->on_rq)
+ __enqueue_entity(cfs_rq, prev);
+ cfs_rq->curr = NULL;
+}
+
+static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+ update_curr(cfs_rq);
+ normalize_time_sum_off(cfs_rq);
+
+#ifdef CONFIG_SCHED_DEBUG
+{
+ static int cnt = 10;
+ kclock_t avg = get_time_avg(cfs_rq);
+ int inc_curr = 5;
+ RANGE_CHECK(curr, curr);
+}
+#endif
+
+ /*
+ * Reschedule if another task tops the current one.
+ */
+ if (cfs_rq->rb_leftmost && (kclock_t)(curr->time_norm - cfs_rq->run_end) >= 0) {
+ cfs_rq->next = cfs_rq->rb_leftmost;
+ resched_task(rq_of(cfs_rq)->curr);
+ }
+}
+
+/**************************************************
+ * CFS operations on tasks:
+ */
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+/* Walk up scheduling entities hierarchy */
+#define for_each_sched_entity(se) \
+ for (; se; se = se->parent)
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+ return p->se.cfs_rq;
+}
+
+/* runqueue on which this entity is (to be) queued */
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+ return se->cfs_rq;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+ return grp->my_q;
+}
+
+/* Given a group's cfs_rq on one cpu, return its corresponding cfs_rq on
+ * another cpu ('this_cpu')
+ */
+static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
+{
+ /* A later patch will take group into account */
+ return &cpu_rq(this_cpu)->cfs;
+}
+
+/* Iterate thr' all leaf cfs_rq's on a runqueue */
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+ list_for_each_entry(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+
+/* Do the two (enqueued) tasks belong to the same group ? */
+static inline int is_same_group(struct task_struct *curr, struct task_struct *p)
+{
+ if (curr->se.cfs_rq == p->se.cfs_rq)
+ return 1;
+
+ return 0;
+}
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+#define for_each_sched_entity(se) \
+ for (; se; se = NULL)
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+ return &task_rq(p)->cfs;
+}
+
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+ struct task_struct *p = task_of(se);
+ struct rq *rq = task_rq(p);
+
+ return &rq->cfs;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+ return NULL;
+}
+
+static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
+{
+ return &cpu_rq(this_cpu)->cfs;
+}
+
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+ for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
+
+static inline int is_same_group(struct task_struct *curr, struct task_struct *p)
+{
+ return 1;
+}
+
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+/*
+ * The enqueue_task method is called before nr_running is
+ * increased. Here we update the fair scheduling stats and
+ * then put the task into the rbtree:
+ */
+static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
+{
+ struct cfs_rq *cfs_rq;
+ struct sched_entity *se = &p->se;
+
+ for_each_sched_entity(se) {
+ if (se->on_rq)
+ break;
+ cfs_rq = cfs_rq_of(se);
+ enqueue_entity(cfs_rq, se);
+ }
+}
+
+/*
+ * The dequeue_task method is called before nr_running is
+ * decreased. We remove the task from the rbtree and
+ * update the fair scheduling stats:
+ */
+static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
+{
+ struct cfs_rq *cfs_rq;
+ struct sched_entity *se = &p->se;
+
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ dequeue_entity(cfs_rq, se, sleep);
+ /* Don't dequeue parent if it has other entities besides us */
+ if (cfs_rq->weight_sum)
+ break;
+ }
+}
+
+/*
+ * sched_yield() support is very simple - we dequeue and enqueue
+ */
+static void yield_task_fair(struct rq *rq, struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct sched_entity *next;
+ __update_rq_clock(rq);
+
+ update_curr(cfs_rq);
+ next = cfs_rq->rb_leftmost;
+ if (next && (kclock_t)(p->se.time_norm - next->time_norm) > 0) {
+ cfs_rq->next = next;
+ return;
+ }
+ cfs_rq->next = &p->se;
+}
+
+/*
+ * Preempt the current task with a newly woken task if needed:
+ */
+static void check_preempt_curr_fair(struct rq *rq, struct task_struct *p)
+{
+ struct task_struct *curr = rq->curr;
+ struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+ unsigned long gran;
+ kclock_t gran_norm;
+
+ if (!cfs_rq->curr || unlikely(rt_prio(p->prio))) {
+ resched_task(curr);
+ return;
+ }
+ update_curr(cfs_rq);
+
+ if (!is_same_group(curr, p) || cfs_rq->rb_leftmost != &p->se)
+ return;
+
+ gran = sysctl_sched_wakeup_granularity;
+ /*
+ * Batch tasks prefer throughput over latency:
+ */
+ if (unlikely(p->policy == SCHED_BATCH))
+ gran = sysctl_sched_batch_wakeup_granularity;
+ gran_norm = cfs_rq->curr->load.inv_weight * (kclock_t)gran;
+
+ if (curr->se.time_norm - cfs_rq->run_start >= gran_norm &&
+ curr->se.time_norm - p->se.time_norm >= gran_norm) {
+ cfs_rq->next = &p->se;
+ resched_task(curr);
+ }
+}
+
+static struct task_struct *pick_next_task_fair(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq = &rq->cfs;
+ struct sched_entity *se;
+
+ if (unlikely(!cfs_rq->nr_running))
+ return NULL;
+
+ if (cfs_rq->nr_running == 1 && cfs_rq->curr)
+ return task_of(cfs_rq->curr);
+
+ do {
+ se = pick_next_entity(cfs_rq);
+ cfs_rq = group_cfs_rq(se);
+ } while (cfs_rq);
+
+ return task_of(se);
+}
+
+/*
+ * Account for a descheduled task:
+ */
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+{
+ struct sched_entity *se = &prev->se;
+ struct cfs_rq *cfs_rq;
+
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ put_prev_entity(cfs_rq, se);
+ }
+}
+
+/**************************************************
+ * Fair scheduling class load-balancing methods:
+ */
+
+/*
+ * Load-balancing iterator. Note: while the runqueue stays locked
+ * during the whole iteration, the current task might be
+ * dequeued so the iterator has to be dequeue-safe. Here we
+ * achieve that by always pre-iterating before returning
+ * the current task:
+ */
+static inline struct task_struct *
+__load_balance_iterator(struct cfs_rq *cfs_rq)
+{
+ struct task_struct *p;
+ struct rb_node *curr;
+
+ curr = cfs_rq->rb_load_balance_curr;
+ if (!curr)
+ return NULL;
+
+ p = rb_entry(curr, struct task_struct, se.run_node);
+ cfs_rq->rb_load_balance_curr = rb_next(curr);
+
+ return p;
+}
+
+static struct task_struct *load_balance_start_fair(void *arg)
+{
+ struct cfs_rq *cfs_rq = arg;
+
+ cfs_rq->rb_load_balance_curr = cfs_rq->rb_leftmost ?
+ &cfs_rq->rb_leftmost->run_node : NULL;
+ if (cfs_rq->curr)
+ return rb_entry(&cfs_rq->curr->run_node, struct task_struct, se.run_node);
+
+ return __load_balance_iterator(cfs_rq);
+}
+
+static struct task_struct *load_balance_next_fair(void *arg)
+{
+ struct cfs_rq *cfs_rq = arg;
+
+ return __load_balance_iterator(cfs_rq);
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static int cfs_rq_best_prio(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *curr;
+ struct task_struct *p;
+
+ if (!cfs_rq->nr_running)
+ return MAX_PRIO;
+
+ curr = cfs_rq->rb_leftmost;
+ p = task_of(curr);
+
+ return p->prio;
+}
+#endif
+
+static unsigned long
+load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
+ unsigned long max_nr_move, unsigned long max_load_move,
+ struct sched_domain *sd, enum cpu_idle_type idle,
+ int *all_pinned, int *this_best_prio)
+{
+ struct cfs_rq *busy_cfs_rq;
+ unsigned long load_moved, total_nr_moved = 0, nr_moved;
+ long rem_load_move = max_load_move;
+ struct rq_iterator cfs_rq_iterator;
+
+ cfs_rq_iterator.start = load_balance_start_fair;
+ cfs_rq_iterator.next = load_balance_next_fair;
+
+ for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ struct cfs_rq *this_cfs_rq;
+ long imbalance;
+ unsigned long maxload;
+
+ this_cfs_rq = cpu_cfs_rq(busy_cfs_rq, this_cpu);
+
+ imbalance = busy_cfs_rq->load.weight - this_cfs_rq->load.weight;
+ /* Don't pull if this_cfs_rq has more load than busy_cfs_rq */
+ if (imbalance <= 0)
+ continue;
+
+ /* Don't pull more than imbalance/2 */
+ imbalance /= 2;
+ maxload = min(rem_load_move, imbalance);
+
+ *this_best_prio = cfs_rq_best_prio(this_cfs_rq);
+#else
+# define maxload rem_load_move
+#endif
+ /* pass busy_cfs_rq argument into
+ * load_balance_[start|next]_fair iterators
+ */
+ cfs_rq_iterator.arg = busy_cfs_rq;
+ nr_moved = balance_tasks(this_rq, this_cpu, busiest,
+ max_nr_move, maxload, sd, idle, all_pinned,
+ &load_moved, this_best_prio, &cfs_rq_iterator);
+
+ total_nr_moved += nr_moved;
+ max_nr_move -= nr_moved;
+ rem_load_move -= load_moved;
+
+ if (max_nr_move <= 0 || rem_load_move <= 0)
+ break;
+ }
+
+ return max_load_move - rem_load_move;
+}
+
+/*
+ * scheduler tick hitting a task of our scheduling class:
+ */
+static void task_tick_fair(struct rq *rq, struct task_struct *curr)
+{
+ struct cfs_rq *cfs_rq;
+ struct sched_entity *se = &curr->se;
+
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ entity_tick(cfs_rq, se);
+ }
+}
+
+/*
+ * Share the fairness runtime between parent and child, thus the
+ * total amount of pressure for CPU stays equal - new tasks
+ * get a chance to run but frequent forkers are not allowed to
+ * monopolize the CPU. Note: the parent runqueue is locked,
+ * the child is not running yet.
+ */
+static void task_new_fair(struct rq *rq, struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct sched_entity *se = &p->se;
+ kclock_t time;
+
+ sched_info_queued(p);
+
+ time = (rq->curr->se.time_norm - get_time_avg(cfs_rq)) >> 1;
+ cfs_rq->time_sum_off -= (time << rq->curr->se.weight_shift);
+ normalize_time_sum_off(cfs_rq);
+ rq->curr->se.time_norm -= time;
+ se->time_norm = rq->curr->se.time_norm;
+
+ enqueue_entity(cfs_rq, se);
+ p->se.on_rq = 1;
+
+ cfs_rq->next = se;
+ resched_task(rq->curr);
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* Account for a task changing its policy or group.
+ *
+ * This routine is mostly called to set cfs_rq->curr field when a task
+ * migrates between groups/classes.
+ */
+static void set_curr_task_fair(struct rq *rq)
+{
+ struct sched_entity *se = &rq->curr.se;
+
+ for_each_sched_entity(se)
+ set_next_entity(cfs_rq_of(se), se);
+}
+#else
+static void set_curr_task_fair(struct rq *rq)
+{
+}
+#endif
+
+/*
+ * All the scheduling class methods:
+ */
+struct sched_class fair_sched_class __read_mostly = {
+ .enqueue_task = enqueue_task_fair,
+ .dequeue_task = dequeue_task_fair,
+ .yield_task = yield_task_fair,
+
+ .check_preempt_curr = check_preempt_curr_fair,
+
+ .pick_next_task = pick_next_task_fair,
+ .put_prev_task = put_prev_task_fair,
+
+ .load_balance = load_balance_fair,
+
+ .set_curr_task = set_curr_task_fair,
+ .task_tick = task_tick_fair,
+ .task_new = task_new_fair,
+};
+
+#ifdef CONFIG_SCHED_DEBUG
+static void print_cfs_stats(struct seq_file *m, int cpu)
+{
+ struct cfs_rq *cfs_rq;
+
+ for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
+ print_cfs_rq(m, cpu, cfs_rq);
+}
+#endif
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -211,30 +211,16 @@ static ctl_table root_table[] = {
{ .ctl_name = 0 }
};

-#ifdef CONFIG_SCHED_DEBUG
static unsigned long min_sched_granularity_ns = 100000; /* 100 usecs */
static unsigned long max_sched_granularity_ns = 1000000000; /* 1 second */
static unsigned long min_wakeup_granularity_ns; /* 0 usecs */
static unsigned long max_wakeup_granularity_ns = 1000000000; /* 1 second */
-#endif

static ctl_table kern_table[] = {
-#ifdef CONFIG_SCHED_DEBUG
- {
- .ctl_name = CTL_UNNUMBERED,
- .procname = "sched_min_granularity_ns",
- .data = &sysctl_sched_min_granularity,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = &proc_dointvec_minmax,
- .strategy = &sysctl_intvec,
- .extra1 = &min_sched_granularity_ns,
- .extra2 = &max_sched_granularity_ns,
- },
{
.ctl_name = CTL_UNNUMBERED,
- .procname = "sched_latency_ns",
- .data = &sysctl_sched_latency,
+ .procname = "sched_granularity_ns",
+ .data = &sysctl_sched_granularity,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = &proc_dointvec_minmax,
@@ -264,6 +250,7 @@ static ctl_table kern_table[] = {
.extra1 = &min_wakeup_granularity_ns,
.extra2 = &max_wakeup_granularity_ns,
},
+#ifdef CONFIG_SCHED_DEBUG
{
.ctl_name = CTL_UNNUMBERED,
.procname = "sched_stat_granularity_ns",
@@ -277,17 +264,6 @@ static ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
- .procname = "sched_runtime_limit_ns",
- .data = &sysctl_sched_runtime_limit,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = &proc_dointvec_minmax,
- .strategy = &sysctl_intvec,
- .extra1 = &min_sched_granularity_ns,
- .extra2 = &max_sched_granularity_ns,
- },
- {
- .ctl_name = CTL_UNNUMBERED,
.procname = "sched_child_runs_first",
.data = &sysctl_sched_child_runs_first,
.maxlen = sizeof(unsigned int),

2007-09-08 07:57:13

by Mike Galbraith

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

On Fri, 2007-09-07 at 17:35 +0200, Roman Zippel wrote:
> Hi,
>
> On Sun, 2 Sep 2007, Ingo Molnar wrote:
>
> Below is a patch updated against the latest git tree, no major changes.

Interesting, I see major behavioral changes.

I still see an aberration with fairtest2. On startup, the hog component
will consume 100% cpu for a bit, then the sleeper shows up. This
doesn't always happen, but happens quite often. The time it takes for
the sleeper to show up varies, but is roughly as below. With the
previous kernel, sleeper starvation was permanent once the hog ran for a
bit. This behavior inverted itself, starvation now goes away after a
bit.

6573 root 20 0 1568 468 384 R 51 0.0 0:08.30 1 fairtest2
6574 root 20 0 1568 112 28 R 47 0.0 0:01.07 1 fairtest2

Once it coughs up it's fairtest2 startup-furball, it does well on mixed
load (various interval sleepers + hog) fairness. The previous kernel
did not do at all well at mixed load, as soon as I added a hog, it was
all over for sleepers, and even without a hog, fairness among the
various interval sleepers was not at all good.

On the interactivity front, your first cut was not really usable here,
but this one seems fine <not heavily tested caveat>. One test I do here
is Amarok song switch time under hefty kbuild load, and this kernel
passes that test just fine. I didn't try this test with your first cut,
because it was very ragged even under modest load. All in all, this one
is behaving quite well, which is radically different than my experience
with first cut.

Debug messages triggered on boot, but haven't triggered since.

[ 113.426575] audit(1189232695.670:2): audit_backlog_limit=256 old=64 by auid=4294967295 res=1
[ 113.953958] IA-32 Microcode Update Driver: v1.14a <[email protected]>
[ 115.851209] audit(1189232698.095:3): audit_pid=5597 old=0 by auid=4294967295
[ 116.707631] 2,f73035a0(5624): 1fa78027f5c,1fb37ff0000,f7d035a0(5626),1
[ 116.723091] WARNING: at kernel/sched_norm.c:243 verify_queue()
[ 116.737979] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 116.752191] [<c0105d83>] show_trace+0x12/0x14
[ 116.765544] [<c0105d9b>] dump_stack+0x16/0x18
[ 116.778579] [<c011e489>] verify_queue+0x158/0x388
[ 116.791716] [<c011e6db>] enqueue_entity+0x22/0x19a
[ 116.804880] [<c011e9e2>] task_new_fair+0xa9/0xd2
[ 116.817724] [<c0123c47>] wake_up_new_task+0x9e/0xad
[ 116.830723] [<c0125ddc>] do_fork+0x13c/0x205
[ 116.843063] [<c0102258>] sys_clone+0x33/0x39
[ 116.855353] [<c0104182>] sysenter_past_esp+0x5f/0x85
[ 116.868344] =======================
[ 116.879674] 2,f7303ae0(5618): 1fbf7fca89d,1fb37ff0000,f7d035a0(5626),1
[ 116.894100] WARNING: at kernel/sched_norm.c:250 verify_queue()
[ 116.907863] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 116.920912] [<c0105d83>] show_trace+0x12/0x14
[ 116.933141] [<c0105d9b>] dump_stack+0x16/0x18
[ 116.945361] [<c011e5b9>] verify_queue+0x288/0x388
[ 116.957945] [<c011e6db>] enqueue_entity+0x22/0x19a
[ 116.970363] [<c011e9e2>] task_new_fair+0xa9/0xd2
[ 116.982308] [<c0123c47>] wake_up_new_task+0x9e/0xad
[ 116.994597] [<c0125ddc>] do_fork+0x13c/0x205
[ 117.006331] [<c0102258>] sys_clone+0x33/0x39
[ 117.017848] [<c0104182>] sysenter_past_esp+0x5f/0x85
[ 117.029869] =======================

<repeats dozen times or so>


2007-09-08 08:24:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

On Sat, 2007-09-08 at 09:56 +0200, Mike Galbraith wrote:

> <repeats dozen times or so>

They weren't all repeats after all, the last few were...

[ 120.267389] 2,f73035a0(5624): 1fa7e90b58c,1fb3b460000,f73035a0(5624),5
[ 120.281110] WARNING: at kernel/sched_norm.c:413 entity_tick()
[ 120.294101] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 120.306425] [<c0105d83>] show_trace+0x12/0x14
[ 120.317857] [<c0105d9b>] dump_stack+0x16/0x18
[ 120.329124] [<c011eb68>] task_tick_fair+0x15d/0x170
[ 120.340988] [<c01221f3>] scheduler_tick+0x18d/0x2d5
[ 120.352896] [<c012f91d>] update_process_times+0x44/0x63
[ 120.365177] [<c014161e>] tick_sched_timer+0x5c/0xbd
[ 120.377146] [<c013ca61>] hrtimer_interrupt+0x12c/0x1a0
[ 120.389392] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88
[ 120.402114] [<c0104c50>] apic_timer_interrupt+0x28/0x30
[ 120.414491] [<c0123c1b>] wake_up_new_task+0x72/0xad
[ 120.426477] [<c0125ddc>] do_fork+0x13c/0x205
[ 120.437795] [<c0102258>] sys_clone+0x33/0x39
[ 120.449062] [<c0104182>] sysenter_past_esp+0x5f/0x85
[ 120.461047] =======================


2007-09-10 23:23:41

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Hi,

On Sat, 8 Sep 2007, Mike Galbraith wrote:

> > On Sun, 2 Sep 2007, Ingo Molnar wrote:
> >
> > Below is a patch updated against the latest git tree, no major changes.
>
> Interesting, I see major behavioral changes.
>
> I still see an aberration with fairtest2. On startup, the hog component
> will consume 100% cpu for a bit, then the sleeper shows up. This
> doesn't always happen, but happens quite often.

I found the problem for this. What basically happened is that a task that
hasn't been running for a second is enqueued first on an idle queue and it
keeps that advantage compared to tasks that had been running more recently
until it catched up. The new version will now remember where the last task
left off and use that for that first task which restarts the queue. As a
side effect it also limits the bonus a task gets if multiple tasks are
woken at the same time.

This version also limits againsts runtime overruns, that means a task
exceeds its time slice, because the timer interrupt was disabled and so
the current task wasn't preempted, this usually happens during boot, but
also when using a serial console.

bye, Roman

---
include/linux/sched.h | 26 -
kernel/sched.c | 173 +++++----
kernel/sched_debug.c | 26 -
kernel/sched_norm.c | 870 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 30 -
5 files changed, 977 insertions(+), 148 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -884,40 +884,28 @@ struct load_weight {
*
* Current field usage histogram:
*
- * 4 se->block_start
* 4 se->run_node
- * 4 se->sleep_start
- * 4 se->sleep_start_fair
* 6 se->load.weight
- * 7 se->delta_fair
- * 15 se->wait_runtime
*/
struct sched_entity {
- long wait_runtime;
- unsigned long delta_fair_run;
- unsigned long delta_fair_sleep;
- unsigned long delta_exec;
- s64 fair_key;
+ s64 time_key;
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
- unsigned int on_rq;
+ unsigned int on_rq, queued;
+ unsigned int weight_shift;

u64 exec_start;
u64 sum_exec_runtime;
- u64 prev_sum_exec_runtime;
- u64 wait_start_fair;
- u64 sleep_start_fair;
+ s64 time_norm;
+ s64 req_weight_inv;

#ifdef CONFIG_SCHEDSTATS
- u64 wait_start;
u64 wait_max;
s64 sum_wait_runtime;

- u64 sleep_start;
u64 sleep_max;
s64 sum_sleep_runtime;

- u64 block_start;
u64 block_max;
u64 exec_max;

@@ -1400,12 +1388,10 @@ static inline void idle_task_exit(void)

extern void sched_idle_next(void);

-extern unsigned int sysctl_sched_latency;
-extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_granularity;
extern unsigned int sysctl_sched_wakeup_granularity;
extern unsigned int sysctl_sched_batch_wakeup_granularity;
extern unsigned int sysctl_sched_stat_granularity;
-extern unsigned int sysctl_sched_runtime_limit;
extern unsigned int sysctl_sched_child_runs_first;
extern unsigned int sysctl_sched_features;

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -183,18 +183,24 @@ struct cfs_rq {

s64 fair_clock;
u64 exec_clock;
- s64 wait_runtime;
u64 sleeper_bonus;
unsigned long wait_runtime_overruns, wait_runtime_underruns;

+ u64 prev_update;
+ s64 time_norm_base, time_norm_inc, time_avg_min;
+ u64 run_start, run_end;
+ u64 run_end_next, run_end_curr;
+ s64 time_sum_max, time_sum_off;
+ unsigned long inc_shift, weight_sum;
+
struct rb_root tasks_timeline;
- struct rb_node *rb_leftmost;
struct rb_node *rb_load_balance_curr;
-#ifdef CONFIG_FAIR_GROUP_SCHED
/* 'curr' points to currently running entity on this cfs_rq.
* It is set to NULL otherwise (i.e when none are currently running).
*/
- struct sched_entity *curr;
+ struct sched_entity *curr, *next;
+ struct sched_entity *rb_leftmost;
+#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */

/* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
@@ -231,12 +237,16 @@ struct rq {
*/
unsigned long nr_running;
#define CPU_LOAD_IDX_MAX 5
+#ifdef CONFIG_SMP
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+#endif
unsigned char idle_at_tick;
#ifdef CONFIG_NO_HZ
unsigned char in_nohz_recently;
#endif
+#ifdef CONFIG_SMP
struct load_stat ls; /* capture load from *all* tasks on this cpu */
+#endif
unsigned long nr_load_updates;
u64 nr_switches;

@@ -636,13 +646,6 @@ static void resched_cpu(int cpu)
resched_task(cpu_curr(cpu));
spin_unlock_irqrestore(&rq->lock, flags);
}
-#else
-static inline void resched_task(struct task_struct *p)
-{
- assert_spin_locked(&task_rq(p)->lock);
- set_tsk_need_resched(p);
-}
-#endif

static u64 div64_likely32(u64 divident, unsigned long divisor)
{
@@ -657,6 +660,13 @@ static u64 div64_likely32(u64 divident,
#endif
}

+#else
+static inline void resched_task(struct task_struct *p)
+{
+ assert_spin_locked(&task_rq(p)->lock);
+ set_tsk_need_resched(p);
+}
+#endif
#if BITS_PER_LONG == 32
# define WMULT_CONST (~0UL)
#else
@@ -734,15 +744,33 @@ static void update_load_sub(struct load_
* If a task goes up by ~10% and another task goes down by ~10% then
* the relative distance between them is ~25%.)
*/
+#ifdef SANE_NICE_LEVEL
+static const int prio_to_weight[40] = {
+ 3567, 3102, 2703, 2351, 2048, 1783, 1551, 1351, 1177, 1024,
+ 892, 776, 676, 588, 512, 446, 388, 338, 294, 256,
+ 223, 194, 169, 147, 128, 111, 97, 84, 74, 64,
+ 56, 49, 42, 37, 32, 28, 24, 21, 18, 16
+};
+
+static const u32 prio_to_wmult[40] = {
+ 294, 338, 388, 446, 512, 588, 676, 776, 891, 1024,
+ 1176, 1351, 1552, 1783, 2048, 2353, 2702, 3104, 3566, 4096,
+ 4705, 5405, 6208, 7132, 8192, 9410, 10809, 12417, 14263, 16384,
+ 18820, 21619, 24834, 28526, 32768, 37641, 43238, 49667, 57052, 65536
+};
+
+static const u32 prio_to_wshift[40] = {
+ 8, 8, 7, 7, 7, 7, 7, 6, 6, 6,
+ 6, 6, 5, 5, 5, 5, 5, 4, 4, 4,
+ 4, 4, 3, 3, 3, 3, 3, 2, 2, 2,
+ 2, 2, 1, 1, 1, 1, 1, 0, 0, 0
+};
+#else
static const int prio_to_weight[40] = {
- /* -20 */ 88761, 71755, 56483, 46273, 36291,
- /* -15 */ 29154, 23254, 18705, 14949, 11916,
- /* -10 */ 9548, 7620, 6100, 4904, 3906,
- /* -5 */ 3121, 2501, 1991, 1586, 1277,
- /* 0 */ 1024, 820, 655, 526, 423,
- /* 5 */ 335, 272, 215, 172, 137,
- /* 10 */ 110, 87, 70, 56, 45,
- /* 15 */ 36, 29, 23, 18, 15,
+ 95325, 74898, 61681, 49932, 38836, 31775, 24966, 20165, 16132, 12945,
+ 10382, 8257, 6637, 5296, 4228, 3393, 2709, 2166, 1736, 1387,
+ 1111, 888, 710, 568, 455, 364, 291, 233, 186, 149,
+ 119, 95, 76, 61, 49, 39, 31, 25, 20, 16
};

/*
@@ -753,16 +781,20 @@ static const int prio_to_weight[40] = {
* into multiplications:
*/
static const u32 prio_to_wmult[40] = {
- /* -20 */ 48388, 59856, 76040, 92818, 118348,
- /* -15 */ 147320, 184698, 229616, 287308, 360437,
- /* -10 */ 449829, 563644, 704093, 875809, 1099582,
- /* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326,
- /* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587,
- /* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126,
- /* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,
- /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
+ 11, 14, 17, 21, 27, 33, 42, 52, 65, 81,
+ 101, 127, 158, 198, 248, 309, 387, 484, 604, 756,
+ 944, 1181, 1476, 1845, 2306, 2882, 3603, 4504, 5629, 7037,
+ 8796, 10995, 13744, 17180, 21475, 26844, 33554, 41943, 52429, 65536
};

+static const u32 prio_to_wshift[40] = {
+ 13, 12, 12, 12, 11, 11, 11, 10, 10, 10,
+ 9, 9, 9, 8, 8, 8, 7, 7, 7, 6,
+ 6, 6, 5, 5, 5, 5, 4, 4, 4, 3,
+ 3, 3, 2, 2, 2, 1, 1, 1, 0, 0
+};
+#endif
+
static void activate_task(struct rq *rq, struct task_struct *p, int wakeup);

/*
@@ -784,7 +816,8 @@ static int balance_tasks(struct rq *this

#include "sched_stats.h"
#include "sched_rt.c"
-#include "sched_fair.c"
+//#include "sched_fair.c"
+#include "sched_norm.c"
#include "sched_idletask.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
@@ -792,6 +825,7 @@ static int balance_tasks(struct rq *this

#define sched_class_highest (&rt_sched_class)

+#ifdef CONFIG_SMP
static void __update_curr_load(struct rq *rq, struct load_stat *ls)
{
if (rq->curr != rq->idle && ls->load.weight) {
@@ -843,6 +877,14 @@ static inline void dec_load(struct rq *r
update_curr_load(rq);
update_load_sub(&rq->ls.load, p->se.load.weight);
}
+#else
+static inline void inc_load(struct rq *rq, const struct task_struct *p)
+{
+}
+static inline void dec_load(struct rq *rq, const struct task_struct *p)
+{
+}
+#endif

static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
@@ -858,9 +900,6 @@ static void dec_nr_running(struct task_s

static void set_load_weight(struct task_struct *p)
{
- task_rq(p)->cfs.wait_runtime -= p->se.wait_runtime;
- p->se.wait_runtime = 0;
-
if (task_has_rt_policy(p)) {
p->se.load.weight = prio_to_weight[0] * 2;
p->se.load.inv_weight = prio_to_wmult[0] >> 1;
@@ -878,6 +917,8 @@ static void set_load_weight(struct task_

p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
+ p->se.weight_shift = prio_to_wshift[p->static_prio - MAX_RT_PRIO];
+ p->se.req_weight_inv = p->se.load.inv_weight * (kclock_t)sysctl_sched_granularity;
}

static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
@@ -986,11 +1027,13 @@ inline int task_curr(const struct task_s
return cpu_curr(task_cpu(p)) == p;
}

+#ifdef CONFIG_SMP
/* Used instead of source_load when we know the type == 0 */
unsigned long weighted_cpuload(const int cpu)
{
return cpu_rq(cpu)->ls.load.weight;
}
+#endif

static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
{
@@ -1004,27 +1047,6 @@ static inline void __set_task_cpu(struct

void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
{
- int old_cpu = task_cpu(p);
- struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
- u64 clock_offset, fair_clock_offset;
-
- clock_offset = old_rq->clock - new_rq->clock;
- fair_clock_offset = old_rq->cfs.fair_clock - new_rq->cfs.fair_clock;
-
- if (p->se.wait_start_fair)
- p->se.wait_start_fair -= fair_clock_offset;
- if (p->se.sleep_start_fair)
- p->se.sleep_start_fair -= fair_clock_offset;
-
-#ifdef CONFIG_SCHEDSTATS
- if (p->se.wait_start)
- p->se.wait_start -= clock_offset;
- if (p->se.sleep_start)
- p->se.sleep_start -= clock_offset;
- if (p->se.block_start)
- p->se.block_start -= clock_offset;
-#endif
-
__set_task_cpu(p, new_cpu);
}

@@ -1584,22 +1606,12 @@ int fastcall wake_up_state(struct task_s
*/
static void __sched_fork(struct task_struct *p)
{
- p->se.wait_start_fair = 0;
p->se.exec_start = 0;
p->se.sum_exec_runtime = 0;
- p->se.prev_sum_exec_runtime = 0;
- p->se.delta_exec = 0;
- p->se.delta_fair_run = 0;
- p->se.delta_fair_sleep = 0;
- p->se.wait_runtime = 0;
- p->se.sleep_start_fair = 0;

#ifdef CONFIG_SCHEDSTATS
- p->se.wait_start = 0;
p->se.sum_wait_runtime = 0;
p->se.sum_sleep_runtime = 0;
- p->se.sleep_start = 0;
- p->se.block_start = 0;
p->se.sleep_max = 0;
p->se.block_max = 0;
p->se.exec_max = 0;
@@ -1971,6 +1983,7 @@ unsigned long nr_active(void)
return running + uninterruptible;
}

+#ifdef CONFIG_SMP
/*
* Update rq->cpu_load[] statistics. This function is usually called every
* scheduler tick (TICK_NSEC).
@@ -2027,8 +2040,6 @@ do_avg:
}
}

-#ifdef CONFIG_SMP
-
/*
* double_rq_lock - safely lock two runqueues
*
@@ -3350,7 +3361,9 @@ void scheduler_tick(void)
if (unlikely(rq->clock < next_tick))
rq->clock = next_tick;
rq->tick_timestamp = rq->clock;
+#ifdef CONFIG_SMP
update_cpu_load(rq);
+#endif
if (curr != rq->idle) /* FIXME: needed? */
curr->sched_class->task_tick(rq, curr);
spin_unlock(&rq->lock);
@@ -4914,16 +4927,12 @@ static inline void sched_init_granularit
unsigned int factor = 1 + ilog2(num_online_cpus());
const unsigned long limit = 100000000;

- sysctl_sched_min_granularity *= factor;
- if (sysctl_sched_min_granularity > limit)
- sysctl_sched_min_granularity = limit;
-
- sysctl_sched_latency *= factor;
- if (sysctl_sched_latency > limit)
- sysctl_sched_latency = limit;
+ sysctl_sched_granularity *= factor;
+ if (sysctl_sched_granularity > limit)
+ sysctl_sched_granularity = limit;

- sysctl_sched_runtime_limit = sysctl_sched_latency;
- sysctl_sched_wakeup_granularity = sysctl_sched_min_granularity / 2;
+ sysctl_sched_wakeup_granularity = sysctl_sched_granularity / 2;
+ sysctl_sched_batch_wakeup_granularity = sysctl_sched_granularity;
}

#ifdef CONFIG_SMP
@@ -6516,6 +6525,9 @@ static inline void init_cfs_rq(struct cf
{
cfs_rq->tasks_timeline = RB_ROOT;
cfs_rq->fair_clock = 1;
+ cfs_rq->time_sum_max = 0;
+ cfs_rq->time_norm_inc = MAX_TIMESUM >> 1;
+ cfs_rq->inc_shift = 0;
#ifdef CONFIG_FAIR_GROUP_SCHED
cfs_rq->rq = rq;
#endif
@@ -6523,7 +6535,6 @@ static inline void init_cfs_rq(struct cf

void __init sched_init(void)
{
- u64 now = sched_clock();
int highest_cpu = 0;
int i, j;

@@ -6548,12 +6559,11 @@ void __init sched_init(void)
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
list_add(&rq->cfs.leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
#endif
- rq->ls.load_update_last = now;
- rq->ls.load_update_start = now;
+#ifdef CONFIG_SMP
+ rq->ls.load_update_last = rq->ls.load_update_start = sched_clock();

for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
rq->cpu_load[j] = 0;
-#ifdef CONFIG_SMP
rq->sd = NULL;
rq->active_balance = 0;
rq->next_balance = jiffies;
@@ -6643,16 +6653,7 @@ void normalize_rt_tasks(void)

read_lock_irq(&tasklist_lock);
do_each_thread(g, p) {
- p->se.fair_key = 0;
- p->se.wait_runtime = 0;
p->se.exec_start = 0;
- p->se.wait_start_fair = 0;
- p->se.sleep_start_fair = 0;
-#ifdef CONFIG_SCHEDSTATS
- p->se.wait_start = 0;
- p->se.sleep_start = 0;
- p->se.block_start = 0;
-#endif
task_rq(p)->cfs.fair_clock = 0;
task_rq(p)->clock = 0;

Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -38,9 +38,9 @@ print_task(struct seq_file *m, struct rq

SEQ_printf(m, "%15s %5d %15Ld %13Ld %13Ld %9Ld %5d ",
p->comm, p->pid,
- (long long)p->se.fair_key,
- (long long)(p->se.fair_key - rq->cfs.fair_clock),
- (long long)p->se.wait_runtime,
+ (long long)p->se.time_norm >> 16,
+ (long long)((p->se.time_key >> 16) - rq->cfs.fair_clock),
+ ((long long)((rq->cfs.fair_clock << 16) - p->se.time_norm) * p->se.load.weight) >> 20,
(long long)(p->nvcsw + p->nivcsw),
p->prio);
#ifdef CONFIG_SCHEDSTATS
@@ -73,6 +73,7 @@ static void print_rq(struct seq_file *m,

read_lock_irq(&tasklist_lock);

+ rq->cfs.fair_clock = get_time_avg(&rq->cfs) >> 16;
do_each_thread(g, p) {
if (!p->se.on_rq || task_cpu(p) != rq_cpu)
continue;
@@ -93,10 +94,10 @@ print_cfs_rq_runtime_sum(struct seq_file
struct rq *rq = &per_cpu(runqueues, cpu);

spin_lock_irqsave(&rq->lock, flags);
- curr = first_fair(cfs_rq);
+ curr = cfs_rq->rb_leftmost ? &cfs_rq->rb_leftmost->run_node : NULL;
while (curr) {
p = rb_entry(curr, struct task_struct, se.run_node);
- wait_runtime_rq_sum += p->se.wait_runtime;
+ //wait_runtime_rq_sum += p->se.wait_runtime;

curr = rb_next(curr);
}
@@ -110,12 +111,12 @@ void print_cfs_rq(struct seq_file *m, in
{
SEQ_printf(m, "\ncfs_rq\n");

+ cfs_rq->fair_clock = get_time_avg(cfs_rq) >> 16;
#define P(x) \
SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(cfs_rq->x))

P(fair_clock);
P(exec_clock);
- P(wait_runtime);
P(wait_runtime_overruns);
P(wait_runtime_underruns);
P(sleeper_bonus);
@@ -143,10 +144,12 @@ static void print_cpu(struct seq_file *m
SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(rq->x))

P(nr_running);
+#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %lu\n", "load",
rq->ls.load.weight);
P(ls.delta_fair);
P(ls.delta_exec);
+#endif
P(nr_switches);
P(nr_load_updates);
P(nr_uninterruptible);
@@ -160,11 +163,13 @@ static void print_cpu(struct seq_file *m
P(clock_overflows);
P(clock_deep_idle_events);
P(clock_max_delta);
+#ifdef CONFIG_SMP
P(cpu_load[0]);
P(cpu_load[1]);
P(cpu_load[2]);
P(cpu_load[3]);
P(cpu_load[4]);
+#endif
#undef P

print_cfs_stats(m, cpu);
@@ -241,16 +246,7 @@ void proc_sched_show_task(struct task_st
#define P(F) \
SEQ_printf(m, "%-25s:%20Ld\n", #F, (long long)p->F)

- P(se.wait_runtime);
- P(se.wait_start_fair);
- P(se.exec_start);
- P(se.sleep_start_fair);
- P(se.sum_exec_runtime);
-
#ifdef CONFIG_SCHEDSTATS
- P(se.wait_start);
- P(se.sleep_start);
- P(se.block_start);
P(se.sleep_max);
P(se.block_max);
P(se.exec_max);
Index: linux-2.6/kernel/sched_norm.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_norm.c
@@ -0,0 +1,870 @@
+/*
+ * Completely Fair Scheduling (CFS) Class (SCHED_NORMAL/SCHED_BATCH)
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * Interactivity improvements by Mike Galbraith
+ * (C) 2007 Mike Galbraith <[email protected]>
+ *
+ * Various enhancements by Dmitry Adamushko.
+ * (C) 2007 Dmitry Adamushko <[email protected]>
+ *
+ * Group scheduling enhancements by Srivatsa Vaddagiri
+ * Copyright IBM Corporation, 2007
+ * Author: Srivatsa Vaddagiri <[email protected]>
+ *
+ * Scaled math optimizations by Thomas Gleixner
+ * Copyright (C) 2007, Thomas Gleixner <[email protected]>
+ *
+ * Really fair scheduling
+ * Copyright (C) 2007, Roman Zippel <[email protected]>
+ */
+
+typedef s64 kclock_t;
+
+static inline kclock_t kclock_max(kclock_t x, kclock_t y)
+{
+ return (kclock_t)(x - y) > 0 ? x : y;
+}
+static inline kclock_t kclock_min(kclock_t x, kclock_t y)
+{
+ return (kclock_t)(x - y) < 0 ? x : y;
+}
+
+#define MSHIFT 16
+#define MAX_TIMESUM ((kclock_t)1 << (30 + MSHIFT))
+
+/*
+ * Preemption granularity:
+ * (default: 10 msec, units: nanoseconds)
+ *
+ * NOTE: this granularity value is not the same as the concept of
+ * 'timeslice length' - timeslices in CFS will typically be somewhat
+ * larger than this value. (to see the precise effective timeslice
+ * length of your workload, run vmstat and monitor the context-switches
+ * field)
+ *
+ * On SMP systems the value of this is multiplied by the log2 of the
+ * number of CPUs. (i.e. factor 2x on 2-way systems, 3x on 4-way
+ * systems, 4x on 8-way systems, 5x on 16-way systems, etc.)
+ */
+unsigned int sysctl_sched_granularity __read_mostly = 10000000ULL;
+
+/*
+ * SCHED_BATCH wake-up granularity.
+ * (default: 10 msec, units: nanoseconds)
+ *
+ * This option delays the preemption effects of decoupled workloads
+ * and reduces their over-scheduling. Synchronous workloads will still
+ * have immediate wakeup/sleep latencies.
+ */
+unsigned int sysctl_sched_batch_wakeup_granularity __read_mostly =
+ 10000000ULL;
+
+/*
+ * SCHED_OTHER wake-up granularity.
+ * (default: 2 msec, units: nanoseconds)
+ *
+ * This option delays the preemption effects of decoupled workloads
+ * and reduces their over-scheduling. Synchronous workloads will still
+ * have immediate wakeup/sleep latencies.
+ */
+unsigned int sysctl_sched_wakeup_granularity __read_mostly = 2000000ULL;
+
+unsigned int sysctl_sched_stat_granularity __read_mostly;
+
+/*
+ * Debugging: various feature bits
+ */
+enum {
+ SCHED_FEAT_FAIR_SLEEPERS = 1,
+ SCHED_FEAT_SLEEPER_AVG = 2,
+ SCHED_FEAT_SLEEPER_LOAD_AVG = 4,
+ SCHED_FEAT_PRECISE_CPU_LOAD = 8,
+ SCHED_FEAT_START_DEBIT = 16,
+ SCHED_FEAT_SKIP_INITIAL = 32,
+};
+
+unsigned int sysctl_sched_features __read_mostly =
+ SCHED_FEAT_FAIR_SLEEPERS *1 |
+ SCHED_FEAT_SLEEPER_AVG *1 |
+ SCHED_FEAT_SLEEPER_LOAD_AVG *1 |
+ SCHED_FEAT_PRECISE_CPU_LOAD *1 |
+ SCHED_FEAT_START_DEBIT *1 |
+ SCHED_FEAT_SKIP_INITIAL *0;
+
+extern struct sched_class fair_sched_class;
+
+static kclock_t get_time_avg(struct cfs_rq *cfs_rq)
+{
+ kclock_t avg;
+
+ avg = cfs_rq->time_norm_base;
+ if (cfs_rq->weight_sum)
+ avg += (kclock_t)((int)(cfs_rq->time_sum_off >> MSHIFT) / cfs_rq->weight_sum) << MSHIFT;
+
+ return avg;
+}
+
+static void normalize_time_sum_off(struct cfs_rq *cfs_rq)
+{
+ if (unlikely(cfs_rq->time_sum_off >= cfs_rq->time_sum_max)) {
+ cfs_rq->time_sum_off -= cfs_rq->time_sum_max;
+ cfs_rq->time_norm_base += cfs_rq->time_norm_inc;
+ } else if (unlikely(cfs_rq->time_sum_off < 0)) {
+ cfs_rq->time_sum_off += cfs_rq->time_sum_max;
+ cfs_rq->time_norm_base -= cfs_rq->time_norm_inc;
+ }
+}
+
+/* When determining the end runtime of the current task, two main cases have to
+ * be considered relative to the next task:
+ * - if the next task has a higher priority, it will shorten the runtime of the
+ * current task by using run_end_next
+ * - if the next task has a lower priority, it may have gotten already enough
+ * time, so we keep the current task running until it's close enough again
+ */
+static void update_run_end(struct cfs_rq *cfs_rq)
+{
+ kclock_t run_end;
+
+ run_end = cfs_rq->run_end_curr;
+ if (cfs_rq->rb_leftmost)
+ run_end = kclock_min(cfs_rq->run_end_next,
+ kclock_max(cfs_rq->rb_leftmost->time_norm, run_end));
+ cfs_rq->run_end = run_end;
+}
+
+/**************************************************************
+ * CFS operations on generic schedulable entities:
+ */
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+/* cpu runqueue to which this cfs_rq is attached */
+static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->rq;
+}
+
+/* An entity is a task if it doesn't "own" a runqueue */
+#define entity_is_task(se) (!se->my_q)
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
+{
+ return container_of(cfs_rq, struct rq, cfs);
+}
+
+#define entity_is_task(se) 1
+
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline struct task_struct *task_of(struct sched_entity *se)
+{
+ return container_of(se, struct task_struct, se);
+}
+
+
+/**************************************************************
+ * Scheduling class tree data structure manipulation methods:
+ */
+
+/*
+ * Enqueue an entity into the rb-tree:
+ */
+static void
+__enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
+ struct rb_node *parent = NULL;
+ struct sched_entity *entry;
+ kclock_t key;
+ int leftmost = 1;
+
+ se->time_key = key = se->time_norm + (se->req_weight_inv >> 1);
+ /*
+ * Find the right place in the rbtree:
+ */
+ while (*link) {
+ parent = *link;
+ entry = rb_entry(parent, struct sched_entity, run_node);
+ /*
+ * We dont care about collisions. Nodes with
+ * the same key stay together.
+ */
+ if (key - entry->time_key < 0) {
+ link = &parent->rb_left;
+ } else {
+ link = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ /*
+ * Maintain a cache of leftmost tree entries (it is frequently
+ * used):
+ */
+ if (leftmost) {
+ cfs_rq->rb_leftmost = se;
+ cfs_rq->run_end_next = se->time_norm + se->req_weight_inv;
+ }
+
+ rb_link_node(&se->run_node, parent, link);
+ rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
+ update_load_add(&cfs_rq->load, se->load.weight);
+ se->queued = 1;
+}
+
+static void
+__dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ if (cfs_rq->rb_leftmost == se) {
+ struct sched_entity *next = NULL;
+ struct rb_node *next_node = rb_next(&se->run_node);
+ if (next_node) {
+ next = rb_entry(next_node, struct sched_entity, run_node);
+ cfs_rq->run_end_next = next->time_norm + next->req_weight_inv;
+ }
+ cfs_rq->rb_leftmost = next;
+ }
+ rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
+ update_load_sub(&cfs_rq->load, se->load.weight);
+ se->queued = 0;
+}
+
+#ifdef CONFIG_SCHED_DEBUG
+static void verify_queue(struct cfs_rq *cfs_rq, int inc_curr, struct sched_entity *se2)
+{
+ struct rb_node *node;
+ struct sched_entity *se;
+ kclock_t avg, sum = 0;
+ static int busy = 0, cnt = 10;
+
+#define RANGE_CHECK(se, se2) ({ \
+ if (cnt > 0 && abs(avg - se->time_norm) > (se->req_weight_inv << 2)) { \
+ cnt--; \
+ printk("%ld,%p(%u,%lx,%Lx): %Lx,%Lx,%p(%u,%lx,%Lx),%d\n", \
+ cfs_rq->nr_running, se, task_of(se)->pid, \
+ se->load.inv_weight, se->req_weight_inv, \
+ se->time_norm, avg, se2, task_of(se2)->pid, \
+ se2->load.inv_weight, se2->req_weight_inv, inc_curr); \
+ WARN_ON(1); \
+ } })
+
+ if (busy)
+ return;
+ busy = 1;
+
+ avg = get_time_avg(cfs_rq);
+ se = cfs_rq->curr;
+ if (inc_curr && se) {
+ sum = (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ RANGE_CHECK(se, se2);
+ }
+ node = rb_first(&cfs_rq->tasks_timeline);
+ WARN_ON(node && cfs_rq->rb_leftmost != rb_entry(node, struct sched_entity, run_node));
+ while (node) {
+ se = rb_entry(node, struct sched_entity, run_node);
+ sum += (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ RANGE_CHECK(se, se2);
+ node = rb_next(node);
+ }
+ if (sum != cfs_rq->time_sum_off) {
+ kclock_t oldsum = cfs_rq->time_sum_off;
+ cfs_rq->time_sum_off = sum;
+ printk("%ld:%Lx,%Lx,%p,%p,%d\n", cfs_rq->nr_running, sum, oldsum, cfs_rq->curr, se2, inc_curr);
+ WARN_ON(1);
+ }
+ busy = 0;
+}
+#else
+#define verify_queue(q,c,s) ((void)0)
+#endif
+
+/**************************************************************
+ * Scheduling class statistics methods:
+ */
+
+/*
+ * Update the current task's runtime statistics. Skip current tasks that
+ * are not in our scheduling class.
+ */
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *curr = cfs_rq->curr;
+ kclock_t now = rq_of(cfs_rq)->clock;
+ unsigned long delta_exec;
+ kclock_t delta_norm;
+
+ if (unlikely(!curr))
+ return;
+
+ delta_exec = now - cfs_rq->prev_update;
+ if (!delta_exec)
+ return;
+ cfs_rq->prev_update = now;
+
+ curr->sum_exec_runtime += delta_exec;
+ cfs_rq->exec_clock += delta_exec;
+
+ delta_norm = (kclock_t)delta_exec * curr->load.inv_weight;
+ curr->time_norm += delta_norm;
+ if (unlikely((kclock_t)(curr->time_norm - cfs_rq->run_end) > curr->req_weight_inv)) {
+ if (cfs_rq->rb_leftmost) {
+ kclock_t end = cfs_rq->run_end + curr->req_weight_inv;
+ cfs_rq->wait_runtime_overruns++;
+ delta_norm -= curr->time_norm - end;
+ curr->time_norm = end;
+ } else
+ cfs_rq->run_end = cfs_rq->run_end_curr = curr->time_norm + curr->req_weight_inv;
+ }
+ cfs_rq->time_sum_off += delta_norm << curr->weight_shift;
+
+ verify_queue(cfs_rq, 4, curr);
+}
+
+static void
+enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ verify_queue(cfs_rq, cfs_rq->curr != se, se);
+ cfs_rq->time_avg_min = kclock_max(cfs_rq->time_avg_min, get_time_avg(cfs_rq));
+ se->time_norm = kclock_max(cfs_rq->time_avg_min - se->req_weight_inv, se->time_norm);
+
+ cfs_rq->nr_running++;
+ cfs_rq->weight_sum += 1 << se->weight_shift;
+ if (cfs_rq->inc_shift < se->weight_shift) {
+ cfs_rq->time_norm_inc >>= se->weight_shift - cfs_rq->inc_shift;
+ cfs_rq->time_sum_max >>= se->weight_shift - cfs_rq->inc_shift;
+ cfs_rq->inc_shift = se->weight_shift;
+ }
+ cfs_rq->time_sum_max += cfs_rq->time_norm_inc << se->weight_shift;
+ while (cfs_rq->time_sum_max >= MAX_TIMESUM) {
+ cfs_rq->time_norm_inc >>= 1;
+ cfs_rq->time_sum_max >>= 1;
+ cfs_rq->inc_shift++;
+ }
+ cfs_rq->time_sum_off += (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ normalize_time_sum_off(cfs_rq);
+
+ if (&rq_of(cfs_rq)->curr->se == se) {
+ cfs_rq->curr = se;
+ cfs_rq->run_end_curr = cfs_rq->run_start + se->req_weight_inv;
+ }
+ if (cfs_rq->curr != se)
+ __enqueue_entity(cfs_rq, se);
+ update_run_end(cfs_rq);
+ verify_queue(cfs_rq, 2, se);
+}
+
+static void
+dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
+{
+ verify_queue(cfs_rq, 3, se);
+
+ cfs_rq->weight_sum -= 1 << se->weight_shift;
+ if (cfs_rq->weight_sum) {
+ cfs_rq->time_sum_max -= cfs_rq->time_norm_inc << se->weight_shift;
+ while (cfs_rq->time_sum_max < (MAX_TIMESUM >> 1)) {
+ cfs_rq->time_norm_inc <<= 1;
+ cfs_rq->time_sum_max <<= 1;
+ cfs_rq->inc_shift--;
+ }
+
+ cfs_rq->time_sum_off -= (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ normalize_time_sum_off(cfs_rq);
+ } else {
+ cfs_rq->time_avg_min = kclock_max(cfs_rq->time_avg_min, se->time_norm);
+ cfs_rq->time_sum_off -= (se->time_norm - cfs_rq->time_norm_base) << se->weight_shift;
+ BUG_ON(cfs_rq->time_sum_off);
+ cfs_rq->time_norm_inc = MAX_TIMESUM >> 1;
+ cfs_rq->time_sum_max = 0;
+ cfs_rq->inc_shift = 0;
+ }
+
+
+ cfs_rq->nr_running--;
+ if (se->queued)
+ __dequeue_entity(cfs_rq, se);
+ update_run_end(cfs_rq);
+ if (cfs_rq->curr == se)
+ cfs_rq->curr = NULL;
+ if (cfs_rq->next == se)
+ cfs_rq->next = NULL;
+ verify_queue(cfs_rq, cfs_rq->curr != se, se);
+}
+
+static inline void
+set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ cfs_rq->prev_update = rq_of(cfs_rq)->clock;
+ cfs_rq->run_start = se->time_norm;
+ cfs_rq->run_end_curr = cfs_rq->run_start + se->req_weight_inv;
+ cfs_rq->curr = se;
+}
+
+static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *se = cfs_rq->next ? cfs_rq->next : cfs_rq->rb_leftmost;
+
+ cfs_rq->next = NULL;
+ if (se->queued)
+ __dequeue_entity(cfs_rq, se);
+ set_next_entity(cfs_rq, se);
+ update_run_end(cfs_rq);
+
+ return se;
+}
+
+static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
+{
+ update_curr(cfs_rq);
+ if (prev->on_rq)
+ __enqueue_entity(cfs_rq, prev);
+ cfs_rq->curr = NULL;
+}
+
+static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+ update_curr(cfs_rq);
+ normalize_time_sum_off(cfs_rq);
+
+#ifdef CONFIG_SCHED_DEBUG
+{
+ static int cnt = 10;
+ kclock_t avg = get_time_avg(cfs_rq);
+ int inc_curr = 5;
+ RANGE_CHECK(curr, curr);
+}
+#endif
+
+ /*
+ * Reschedule if another task tops the current one.
+ */
+ if ((kclock_t)(curr->time_norm - cfs_rq->run_end) >= 0) {
+ if (cfs_rq->rb_leftmost) {
+ cfs_rq->next = cfs_rq->rb_leftmost;
+ resched_task(rq_of(cfs_rq)->curr);
+ } else
+ cfs_rq->run_end = cfs_rq->run_end_curr = curr->time_norm + curr->req_weight_inv;
+ }
+}
+
+/**************************************************
+ * CFS operations on tasks:
+ */
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+/* Walk up scheduling entities hierarchy */
+#define for_each_sched_entity(se) \
+ for (; se; se = se->parent)
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+ return p->se.cfs_rq;
+}
+
+/* runqueue on which this entity is (to be) queued */
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+ return se->cfs_rq;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+ return grp->my_q;
+}
+
+/* Given a group's cfs_rq on one cpu, return its corresponding cfs_rq on
+ * another cpu ('this_cpu')
+ */
+static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
+{
+ /* A later patch will take group into account */
+ return &cpu_rq(this_cpu)->cfs;
+}
+
+/* Iterate thr' all leaf cfs_rq's on a runqueue */
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+ list_for_each_entry(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+
+/* Do the two (enqueued) tasks belong to the same group ? */
+static inline int is_same_group(struct task_struct *curr, struct task_struct *p)
+{
+ if (curr->se.cfs_rq == p->se.cfs_rq)
+ return 1;
+
+ return 0;
+}
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+#define for_each_sched_entity(se) \
+ for (; se; se = NULL)
+
+static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
+{
+ return &task_rq(p)->cfs;
+}
+
+static inline struct cfs_rq *cfs_rq_of(struct sched_entity *se)
+{
+ struct task_struct *p = task_of(se);
+ struct rq *rq = task_rq(p);
+
+ return &rq->cfs;
+}
+
+/* runqueue "owned" by this group */
+static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
+{
+ return NULL;
+}
+
+static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
+{
+ return &cpu_rq(this_cpu)->cfs;
+}
+
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+ for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
+
+static inline int is_same_group(struct task_struct *curr, struct task_struct *p)
+{
+ return 1;
+}
+
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+/*
+ * The enqueue_task method is called before nr_running is
+ * increased. Here we update the fair scheduling stats and
+ * then put the task into the rbtree:
+ */
+static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
+{
+ struct cfs_rq *cfs_rq;
+ struct sched_entity *se = &p->se;
+
+ for_each_sched_entity(se) {
+ if (se->on_rq)
+ break;
+ cfs_rq = cfs_rq_of(se);
+ enqueue_entity(cfs_rq, se);
+ }
+}
+
+/*
+ * The dequeue_task method is called before nr_running is
+ * decreased. We remove the task from the rbtree and
+ * update the fair scheduling stats:
+ */
+static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
+{
+ struct cfs_rq *cfs_rq;
+ struct sched_entity *se = &p->se;
+
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ dequeue_entity(cfs_rq, se, sleep);
+ /* Don't dequeue parent if it has other entities besides us */
+ if (cfs_rq->weight_sum)
+ break;
+ }
+}
+
+/*
+ * sched_yield() support is very simple - we dequeue and enqueue
+ */
+static void yield_task_fair(struct rq *rq, struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct sched_entity *next;
+ __update_rq_clock(rq);
+
+ update_curr(cfs_rq);
+ next = cfs_rq->rb_leftmost;
+ if (next && (kclock_t)(p->se.time_norm - next->time_norm) > 0) {
+ cfs_rq->next = next;
+ return;
+ }
+ cfs_rq->next = &p->se;
+}
+
+/*
+ * Preempt the current task with a newly woken task if needed:
+ */
+static void check_preempt_curr_fair(struct rq *rq, struct task_struct *p)
+{
+ struct task_struct *curr = rq->curr;
+ struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+ unsigned long gran;
+ kclock_t gran_norm;
+
+ if (!cfs_rq->curr || unlikely(rt_prio(p->prio))) {
+ resched_task(curr);
+ return;
+ }
+
+ if (!is_same_group(curr, p) || cfs_rq->rb_leftmost != &p->se)
+ return;
+
+ update_curr(cfs_rq);
+
+ gran = sysctl_sched_wakeup_granularity;
+ /*
+ * Batch tasks prefer throughput over latency:
+ */
+ if (unlikely(p->policy == SCHED_BATCH))
+ gran = sysctl_sched_batch_wakeup_granularity;
+ gran_norm = cfs_rq->curr->load.inv_weight * (kclock_t)gran;
+
+ if (curr->se.time_norm - cfs_rq->run_start >= gran_norm &&
+ curr->se.time_norm - p->se.time_norm >= gran_norm) {
+ cfs_rq->next = &p->se;
+ resched_task(curr);
+ }
+}
+
+static struct task_struct *pick_next_task_fair(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq = &rq->cfs;
+ struct sched_entity *se;
+
+ if (unlikely(!cfs_rq->nr_running))
+ return NULL;
+
+ if (cfs_rq->nr_running == 1 && cfs_rq->curr)
+ return task_of(cfs_rq->curr);
+
+ do {
+ se = pick_next_entity(cfs_rq);
+ cfs_rq = group_cfs_rq(se);
+ } while (cfs_rq);
+
+ return task_of(se);
+}
+
+/*
+ * Account for a descheduled task:
+ */
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+{
+ struct sched_entity *se = &prev->se;
+ struct cfs_rq *cfs_rq;
+
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ put_prev_entity(cfs_rq, se);
+ }
+}
+
+/**************************************************
+ * Fair scheduling class load-balancing methods:
+ */
+
+/*
+ * Load-balancing iterator. Note: while the runqueue stays locked
+ * during the whole iteration, the current task might be
+ * dequeued so the iterator has to be dequeue-safe. Here we
+ * achieve that by always pre-iterating before returning
+ * the current task:
+ */
+static inline struct task_struct *
+__load_balance_iterator(struct cfs_rq *cfs_rq)
+{
+ struct task_struct *p;
+ struct rb_node *curr;
+
+ curr = cfs_rq->rb_load_balance_curr;
+ if (!curr)
+ return NULL;
+
+ p = rb_entry(curr, struct task_struct, se.run_node);
+ cfs_rq->rb_load_balance_curr = rb_next(curr);
+
+ return p;
+}
+
+static struct task_struct *load_balance_start_fair(void *arg)
+{
+ struct cfs_rq *cfs_rq = arg;
+
+ cfs_rq->rb_load_balance_curr = cfs_rq->rb_leftmost ?
+ &cfs_rq->rb_leftmost->run_node : NULL;
+ if (cfs_rq->curr)
+ return rb_entry(&cfs_rq->curr->run_node, struct task_struct, se.run_node);
+
+ return __load_balance_iterator(cfs_rq);
+}
+
+static struct task_struct *load_balance_next_fair(void *arg)
+{
+ struct cfs_rq *cfs_rq = arg;
+
+ return __load_balance_iterator(cfs_rq);
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static int cfs_rq_best_prio(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *curr;
+ struct task_struct *p;
+
+ if (!cfs_rq->nr_running)
+ return MAX_PRIO;
+
+ curr = cfs_rq->rb_leftmost;
+ p = task_of(curr);
+
+ return p->prio;
+}
+#endif
+
+static unsigned long
+load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
+ unsigned long max_nr_move, unsigned long max_load_move,
+ struct sched_domain *sd, enum cpu_idle_type idle,
+ int *all_pinned, int *this_best_prio)
+{
+ struct cfs_rq *busy_cfs_rq;
+ unsigned long load_moved, total_nr_moved = 0, nr_moved;
+ long rem_load_move = max_load_move;
+ struct rq_iterator cfs_rq_iterator;
+
+ cfs_rq_iterator.start = load_balance_start_fair;
+ cfs_rq_iterator.next = load_balance_next_fair;
+
+ for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ struct cfs_rq *this_cfs_rq;
+ long imbalance;
+ unsigned long maxload;
+
+ this_cfs_rq = cpu_cfs_rq(busy_cfs_rq, this_cpu);
+
+ imbalance = busy_cfs_rq->load.weight - this_cfs_rq->load.weight;
+ /* Don't pull if this_cfs_rq has more load than busy_cfs_rq */
+ if (imbalance <= 0)
+ continue;
+
+ /* Don't pull more than imbalance/2 */
+ imbalance /= 2;
+ maxload = min(rem_load_move, imbalance);
+
+ *this_best_prio = cfs_rq_best_prio(this_cfs_rq);
+#else
+# define maxload rem_load_move
+#endif
+ /* pass busy_cfs_rq argument into
+ * load_balance_[start|next]_fair iterators
+ */
+ cfs_rq_iterator.arg = busy_cfs_rq;
+ nr_moved = balance_tasks(this_rq, this_cpu, busiest,
+ max_nr_move, maxload, sd, idle, all_pinned,
+ &load_moved, this_best_prio, &cfs_rq_iterator);
+
+ total_nr_moved += nr_moved;
+ max_nr_move -= nr_moved;
+ rem_load_move -= load_moved;
+
+ if (max_nr_move <= 0 || rem_load_move <= 0)
+ break;
+ }
+
+ return max_load_move - rem_load_move;
+}
+
+/*
+ * scheduler tick hitting a task of our scheduling class:
+ */
+static void task_tick_fair(struct rq *rq, struct task_struct *curr)
+{
+ struct cfs_rq *cfs_rq;
+ struct sched_entity *se = &curr->se;
+
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ entity_tick(cfs_rq, se);
+ }
+}
+
+/*
+ * Share the fairness runtime between parent and child, thus the
+ * total amount of pressure for CPU stays equal - new tasks
+ * get a chance to run but frequent forkers are not allowed to
+ * monopolize the CPU. Note: the parent runqueue is locked,
+ * the child is not running yet.
+ */
+static void task_new_fair(struct rq *rq, struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+ struct sched_entity *se = &p->se;
+ kclock_t time;
+
+ sched_info_queued(p);
+
+ time = (rq->curr->se.time_norm - get_time_avg(cfs_rq)) >> 1;
+ cfs_rq->time_sum_off -= (time << rq->curr->se.weight_shift);
+ normalize_time_sum_off(cfs_rq);
+ rq->curr->se.time_norm -= time;
+ se->time_norm = rq->curr->se.time_norm;
+
+ enqueue_entity(cfs_rq, se);
+ p->se.on_rq = 1;
+
+ cfs_rq->next = se;
+ resched_task(rq->curr);
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* Account for a task changing its policy or group.
+ *
+ * This routine is mostly called to set cfs_rq->curr field when a task
+ * migrates between groups/classes.
+ */
+static void set_curr_task_fair(struct rq *rq)
+{
+ struct sched_entity *se = &rq->curr.se;
+
+ for_each_sched_entity(se)
+ set_next_entity(cfs_rq_of(se), se);
+}
+#else
+static void set_curr_task_fair(struct rq *rq)
+{
+}
+#endif
+
+/*
+ * All the scheduling class methods:
+ */
+struct sched_class fair_sched_class __read_mostly = {
+ .enqueue_task = enqueue_task_fair,
+ .dequeue_task = dequeue_task_fair,
+ .yield_task = yield_task_fair,
+
+ .check_preempt_curr = check_preempt_curr_fair,
+
+ .pick_next_task = pick_next_task_fair,
+ .put_prev_task = put_prev_task_fair,
+
+ .load_balance = load_balance_fair,
+
+ .set_curr_task = set_curr_task_fair,
+ .task_tick = task_tick_fair,
+ .task_new = task_new_fair,
+};
+
+#ifdef CONFIG_SCHED_DEBUG
+static void print_cfs_stats(struct seq_file *m, int cpu)
+{
+ struct cfs_rq *cfs_rq;
+
+ for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
+ print_cfs_rq(m, cpu, cfs_rq);
+}
+#endif
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -211,30 +211,16 @@ static ctl_table root_table[] = {
{ .ctl_name = 0 }
};

-#ifdef CONFIG_SCHED_DEBUG
static unsigned long min_sched_granularity_ns = 100000; /* 100 usecs */
static unsigned long max_sched_granularity_ns = 1000000000; /* 1 second */
static unsigned long min_wakeup_granularity_ns; /* 0 usecs */
static unsigned long max_wakeup_granularity_ns = 1000000000; /* 1 second */
-#endif

static ctl_table kern_table[] = {
-#ifdef CONFIG_SCHED_DEBUG
- {
- .ctl_name = CTL_UNNUMBERED,
- .procname = "sched_min_granularity_ns",
- .data = &sysctl_sched_min_granularity,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = &proc_dointvec_minmax,
- .strategy = &sysctl_intvec,
- .extra1 = &min_sched_granularity_ns,
- .extra2 = &max_sched_granularity_ns,
- },
{
.ctl_name = CTL_UNNUMBERED,
- .procname = "sched_latency_ns",
- .data = &sysctl_sched_latency,
+ .procname = "sched_granularity_ns",
+ .data = &sysctl_sched_granularity,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = &proc_dointvec_minmax,
@@ -264,6 +250,7 @@ static ctl_table kern_table[] = {
.extra1 = &min_wakeup_granularity_ns,
.extra2 = &max_wakeup_granularity_ns,
},
+#ifdef CONFIG_SCHED_DEBUG
{
.ctl_name = CTL_UNNUMBERED,
.procname = "sched_stat_granularity_ns",
@@ -277,17 +264,6 @@ static ctl_table kern_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
- .procname = "sched_runtime_limit_ns",
- .data = &sysctl_sched_runtime_limit,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = &proc_dointvec_minmax,
- .strategy = &sysctl_intvec,
- .extra1 = &min_sched_granularity_ns,
- .extra2 = &max_sched_granularity_ns,
- },
- {
- .ctl_name = CTL_UNNUMBERED,
.procname = "sched_child_runs_first",
.data = &sysctl_sched_child_runs_first,
.maxlen = sizeof(unsigned int),

2007-09-11 06:27:54

by Mike Galbraith

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

On Tue, 2007-09-11 at 01:23 +0200, Roman Zippel wrote:
> Hi,
>
> On Sat, 8 Sep 2007, Mike Galbraith wrote:
>
> > > On Sun, 2 Sep 2007, Ingo Molnar wrote:
> > >
> > > Below is a patch updated against the latest git tree, no major changes.
> >
> > Interesting, I see major behavioral changes.
> >
> > I still see an aberration with fairtest2. On startup, the hog component
> > will consume 100% cpu for a bit, then the sleeper shows up. This
> > doesn't always happen, but happens quite often.
>
> I found the problem for this. What basically happened is that a task that
> hasn't been running for a second is enqueued first on an idle queue and it
> keeps that advantage compared to tasks that had been running more recently
> until it catched up. The new version will now remember where the last task
> left off and use that for that first task which restarts the queue. As a
> side effect it also limits the bonus a task gets if multiple tasks are
> woken at the same time.

I still see the fairtest2 sleeper startup anomaly. Sometimes it starts
up normally, others the sleeper is a delayed. Seems to require idle
time to trigger worst case startup delay.

14854 root 20 0 1568 468 384 R 52 0.0 0:07.50 1 fairtest2
14855 root 20 0 1568 112 28 R 45 0.0 0:00.46 1 fairtest2

Everything else still seems fine. Boot-time warnings:

[ 113.504259] audit(1189488395.753:2): audit_pid=5403 old=0 by auid=4294967295
[ 114.077979] IA-32 Microcode Update Driver: v1.14a <[email protected]>
[ 116.281216] 4,f70355a0(5633,b,d1cef00): 32af5bc18d4,31b49210000,f70355a0(5633,b,d1cef00),2
[ 116.298004] WARNING: at kernel/sched_norm.c:271 verify_queue()
[ 116.312380] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 116.326193] [<c0105d83>] show_trace+0x12/0x14
[ 116.339270] [<c0105d9b>] dump_stack+0x16/0x18
[ 116.352199] [<c011e61a>] verify_queue+0x2f8/0x3fb
[ 116.365406] [<c011e8a3>] enqueue_entity+0x186/0x21b
[ 116.378571] [<c0123344>] enqueue_task_fair+0x2f/0x31
[ 116.391545] [<c011d043>] enqueue_task+0xd/0x18
[ 116.403833] [<c011dffe>] activate_task+0x20/0x2d
[ 116.416174] [<c0120336>] __migrate_task+0x9a/0xc4
[ 116.428490] [<c0122d0b>] migration_thread+0x175/0x220
[ 116.441159] [<c01393f7>] kthread+0x37/0x59
[ 116.452824] [<c0104dd3>] kernel_thread_helper+0x7/0x14
[ 116.465582] =======================
[ 116.476497] 4,f70355a0(5633,b,d1cef00): 32af5bc18d4,31b496c0000,f7e1a5a0(4179,3b0,232aaf800),4
[ 116.492899] WARNING: at kernel/sched_norm.c:271 verify_queue()
[ 116.506459] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 116.519302] [<c0105d83>] show_trace+0x12/0x14
[ 116.531313] [<c0105d9b>] dump_stack+0x16/0x18
[ 116.543334] [<c011e61a>] verify_queue+0x2f8/0x3fb
[ 116.555710] [<c011ea1e>] update_curr+0xe6/0x102
[ 116.567654] [<c011ebb9>] task_tick_fair+0x3f/0x1f2
[ 116.579657] [<c01223f3>] scheduler_tick+0x18d/0x2d5
[ 116.591808] [<c012fa9d>] update_process_times+0x44/0x63
[ 116.604400] [<c014179e>] tick_sched_timer+0x5c/0xbd
[ 116.616741] [<c013cbe1>] hrtimer_interrupt+0x12c/0x1a0
[ 116.629230] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88
[ 116.642047] [<c0104c50>] apic_timer_interrupt+0x28/0x30
[ 116.654692] [<c011f119>] __wake_up+0x3a/0x42
[ 116.666158] [<c03f02d3>] sock_def_readable+0x6e/0x70
[ 116.678143] [<c046195c>] unix_stream_sendmsg+0x19d/0x32c
[ 116.690520] [<c03ebf11>] sock_aio_write+0x104/0x123
[ 116.702479] [<c0178c60>] do_sync_readv_writev+0xb4/0xea
[ 116.714907] [<c0179318>] do_readv_writev+0xbb/0x1d4
[ 116.727074] [<c0179470>] vfs_writev+0x3f/0x51
[ 116.738800] [<c01798b4>] sys_writev+0x3d/0x64
[ 116.750154] [<c0104182>] sysenter_past_esp+0x5f/0x85
[ 116.761932] =======================
[ 116.772092] 4,f70355a0(5633,b,d1cef00): 32af5bc18d4,31b496c0000,f793bae0(4010,3b0,232aaf800),3
[ 116.787811] WARNING: at kernel/sched_norm.c:271 verify_queue()
[ 116.800930] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 116.813437] [<c0105d83>] show_trace+0x12/0x14
[ 116.825164] [<c0105d9b>] dump_stack+0x16/0x18
[ 116.836828] [<c011e61a>] verify_queue+0x2f8/0x3fb
[ 116.848910] [<c0123391>] dequeue_task_fair+0x4b/0x2d3
[ 116.861390] [<c011d05b>] dequeue_task+0xd/0x18
[ 116.873193] [<c011dfa1>] deactivate_task+0x20/0x3a
[ 116.885335] [<c011e0aa>] balance_tasks+0x9f/0x140
[ 116.897357] [<c011e1ab>] load_balance_fair+0x60/0x7d
[ 116.909627] [<c011d141>] move_tasks+0x5b/0x77
[ 116.921200] [<c0120c0b>] load_balance+0xd3/0x2ab
[ 116.932786] [<c0123fca>] run_rebalance_domains+0x89/0x319
[ 116.944998] [<c012bbc2>] __do_softirq+0x73/0xe0
[ 116.956238] [<c0106bc3>] do_softirq+0x6e/0xc1
[ 116.967316] =======================
[ 116.977554] 3,f70355a0(5633,b,d1cef00): 32af5bc18d4,31b2ded0000,f793bae0(4010,3b0,232aaf800),1
[ 116.993315] WARNING: at kernel/sched_norm.c:271 verify_queue()
[ 117.006408] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 117.018914] [<c0105d83>] show_trace+0x12/0x14
[ 117.030667] [<c0105d9b>] dump_stack+0x16/0x18
[ 117.042332] [<c011e61a>] verify_queue+0x2f8/0x3fb
[ 117.054335] [<c01234ec>] dequeue_task_fair+0x1a6/0x2d3
[ 117.066781] [<c011d05b>] dequeue_task+0xd/0x18
[ 117.078429] [<c011dfa1>] deactivate_task+0x20/0x3a
[ 117.090380] [<c011e0aa>] balance_tasks+0x9f/0x140
[ 117.102203] [<c011e1ab>] load_balance_fair+0x60/0x7d
[ 117.114283] [<c011d141>] move_tasks+0x5b/0x77
[ 117.125689] [<c0120c0b>] load_balance+0xd3/0x2ab
[ 117.137320] [<c0123fca>] run_rebalance_domains+0x89/0x319
[ 117.149773] [<c012bbc2>] __do_softirq+0x73/0xe0
[ 117.161343] [<c0106bc3>] do_softirq+0x6e/0xc1
[ 117.172724] =======================
[ 117.183276] 3,f70355a0(5633,b,d1cef00): 32af5bc18d4,31b2ded0000,f7035ae0(5612,3b0,232aaf800),3
[ 117.199354] WARNING: at kernel/sched_norm.c:271 verify_queue()
[ 117.212692] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 117.225188] [<c0105d83>] show_trace+0x12/0x14
[ 117.236680] [<c0105d9b>] dump_stack+0x16/0x18
[ 117.247991] [<c011e61a>] verify_queue+0x2f8/0x3fb
[ 117.259717] [<c0123391>] dequeue_task_fair+0x4b/0x2d3
[ 117.271903] [<c011d05b>] dequeue_task+0xd/0x18
[ 117.283414] [<c011dfa1>] deactivate_task+0x20/0x3a
[ 117.295253] [<c011e0aa>] balance_tasks+0x9f/0x140
[ 117.307030] [<c011e1ab>] load_balance_fair+0x60/0x7d
[ 117.319101] [<c011d141>] move_tasks+0x5b/0x77
[ 117.330543] [<c0120c0b>] load_balance+0xd3/0x2ab
[ 117.342243] [<c0123fca>] run_rebalance_domains+0x89/0x319
[ 117.354801] [<c012bbc2>] __do_softirq+0x73/0xe0
[ 117.366493] [<c0106bc3>] do_softirq+0x6e/0xc1
[ 117.377949] =======================
[ 117.388461] 8,f70355a0(5633,b,d1cef00): 32af5bc18d4,31c54050000,f70355a0(5633,b,d1cef00),2
[ 117.404131] WARNING: at kernel/sched_norm.c:271 verify_queue()
[ 117.417451] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 117.429922] [<c0105d83>] show_trace+0x12/0x14
[ 117.441387] [<c0105d9b>] dump_stack+0x16/0x18
[ 117.452681] [<c011e61a>] verify_queue+0x2f8/0x3fb
[ 117.464372] [<c011e8a3>] enqueue_entity+0x186/0x21b
[ 117.476324] [<c0123344>] enqueue_task_fair+0x2f/0x31
[ 117.488346] [<c011d043>] enqueue_task+0xd/0x18
[ 117.499760] [<c011dffe>] activate_task+0x20/0x2d
[ 117.511365] [<c011e0bf>] balance_tasks+0xb4/0x140
[ 117.523046] [<c011e1ab>] load_balance_fair+0x60/0x7d
[ 117.535015] [<c011d141>] move_tasks+0x5b/0x77
[ 117.546335] [<c0120c0b>] load_balance+0xd3/0x2ab
[ 117.557923] [<c0123fca>] run_rebalance_domains+0x89/0x319
[ 117.570385] [<c012bbc2>] __do_softirq+0x73/0xe0
[ 117.581929] [<c0106bc3>] do_softirq+0x6e/0xc1
[ 117.593248] =======================
[ 117.603676] 8,f70355a0(5633,b,d1cef00): 32af5bc18d4,31c54050000,dfcd4060(939,135,b82da880),4
[ 117.619422] WARNING: at kernel/sched_norm.c:271 verify_queue()
[ 117.632410] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 117.644554] [<c0105d83>] show_trace+0x12/0x14
[ 117.655838] [<c0105d9b>] dump_stack+0x16/0x18
[ 117.667078] [<c011e61a>] verify_queue+0x2f8/0x3fb
[ 117.678770] [<c011ea1e>] update_curr+0xe6/0x102
[ 117.690245] [<c011ebb9>] task_tick_fair+0x3f/0x1f2
[ 117.701972] [<c01223f3>] scheduler_tick+0x18d/0x2d5
[ 117.713818] [<c012fa9d>] update_process_times+0x44/0x63
[ 117.726056] [<c014179e>] tick_sched_timer+0x5c/0xbd
[ 117.737964] [<c013cbe1>] hrtimer_interrupt+0x12c/0x1a0
[ 117.750175] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88
[ 117.762890] [<c0104c50>] apic_timer_interrupt+0x28/0x30
[ 117.775290] [<c0123fca>] run_rebalance_domains+0x89/0x319
[ 117.787875] [<c012bbc2>] __do_softirq+0x73/0xe0
[ 117.799549] [<c0106bc3>] do_softirq+0x6e/0xc1
[ 117.811040] =======================
[ 117.821594] 8,f70355a0(5633,b,d1cef00): 32af5bc18d4,32397330000,dfc46060(3936,3b0,232aaf800),3
[ 117.837448] WARNING: at kernel/sched_norm.c:271 verify_queue()
[ 117.850404] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 117.862659] [<c0105d83>] show_trace+0x12/0x14
[ 117.874184] [<c0105d9b>] dump_stack+0x16/0x18
[ 117.885711] [<c011e61a>] verify_queue+0x2f8/0x3fb
[ 117.897601] [<c0123391>] dequeue_task_fair+0x4b/0x2d3
[ 117.909865] [<c011d05b>] dequeue_task+0xd/0x18
[ 117.921487] [<c011dfa1>] deactivate_task+0x20/0x3a
[ 117.933509] [<c011e0aa>] balance_tasks+0x9f/0x140
[ 117.945443] [<c011e1ab>] load_balance_fair+0x60/0x7d
[ 117.957671] [<c011d141>] move_tasks+0x5b/0x77
[ 117.969259] [<c0120c0b>] load_balance+0xd3/0x2ab
[ 117.981115] [<c0123fca>] run_rebalance_domains+0x89/0x319
[ 117.993846] [<c012bbc2>] __do_softirq+0x73/0xe0
[ 118.005658] [<c0106bc3>] do_softirq+0x6e/0xc1
[ 118.017255] =======================
[ 118.027946] 7,f70355a0(5633,b,d1cef00): 32af5bc18d4,3238da60000,dfc46060(3936,3b0,232aaf800),1
[ 118.044138] WARNING: at kernel/sched_norm.c:271 verify_queue()
[ 118.057413] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 118.069850] [<c0105d83>] show_trace+0x12/0x14
[ 118.081438] [<c0105d9b>] dump_stack+0x16/0x18
[ 118.092983] [<c011e61a>] verify_queue+0x2f8/0x3fb
[ 118.104976] [<c01234ec>] dequeue_task_fair+0x1a6/0x2d3
[ 118.117448] [<c011d05b>] dequeue_task+0xd/0x18
[ 118.129182] [<c011dfa1>] deactivate_task+0x20/0x3a
[ 118.141272] [<c011e0aa>] balance_tasks+0x9f/0x140
[ 118.153258] [<c011e1ab>] load_balance_fair+0x60/0x7d
[ 118.165505] [<c011d141>] move_tasks+0x5b/0x77
[ 118.177084] [<c0120c0b>] load_balance+0xd3/0x2ab
[ 118.188888] [<c0123fca>] run_rebalance_domains+0x89/0x319
[ 118.201507] [<c012bbc2>] __do_softirq+0x73/0xe0
[ 118.213207] [<c0106bc3>] do_softirq+0x6e/0xc1
[ 118.224673] =======================
[ 118.235200] 7,f70355a0(5633,b,d1cef00): 32af5bc18d4,3238da60000,f791f060(5607,3b0,232aaf800),3
[ 118.251260] WARNING: at kernel/sched_norm.c:271 verify_queue()
[ 118.264614] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 118.277128] [<c0105d83>] show_trace+0x12/0x14
[ 118.288631] [<c0105d9b>] dump_stack+0x16/0x18
[ 118.299967] [<c011e61a>] verify_queue+0x2f8/0x3fb
[ 118.311709] [<c0123391>] dequeue_task_fair+0x4b/0x2d3
[ 118.323920] [<c011d05b>] dequeue_task+0xd/0x18
[ 118.335457] [<c011dfa1>] deactivate_task+0x20/0x3a
[ 118.347331] [<c011e0aa>] balance_tasks+0x9f/0x140
[ 118.359125] [<c011e1ab>] load_balance_fair+0x60/0x7d
[ 118.371206] [<c011d141>] move_tasks+0x5b/0x77
[ 118.382656] [<c0120c0b>] load_balance+0xd3/0x2ab
[ 118.394366] [<c0123fca>] run_rebalance_domains+0x89/0x319
[ 118.406932] [<c012bbc2>] __do_softirq+0x73/0xe0
[ 118.418632] [<c0106bc3>] do_softirq+0x6e/0xc1
[ 118.430088] =======================
[ 118.661317] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 118.841985] 3,f7913ae0(5648,b,d1cef00): 30aea0eef39,2dc5dc40000,f7913ae0(5648,b,d1cef00),5
[ 118.876767] WARNING: at kernel/sched_norm.c:438 entity_tick()
[ 118.908572] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 118.939760] [<c0105d83>] show_trace+0x12/0x14
[ 118.970122] [<c0105d9b>] dump_stack+0x16/0x18
[ 119.000496] [<c011ed0e>] task_tick_fair+0x194/0x1f2
[ 119.031508] [<c01223f3>] scheduler_tick+0x18d/0x2d5
[ 119.062350] [<c012fa9d>] update_process_times+0x44/0x63
[ 119.093732] [<c014179e>] tick_sched_timer+0x5c/0xbd
[ 119.124944] [<c013cbe1>] hrtimer_interrupt+0x12c/0x1a0
[ 119.156611] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88
[ 119.188954] [<c0104c50>] apic_timer_interrupt+0x28/0x30
[ 119.221063] [<c0165c22>] __do_fault+0x3d/0x363
[ 119.252382] [<c0167288>] handle_mm_fault+0x11b/0x6a8
[ 119.284202] [<c011c158>] do_page_fault+0x3ed/0x615
[ 119.315726] [<c04d2bba>] error_code+0x72/0x78
[ 119.346869] =======================
[ 119.408744] 3,f7dceae0(5657,b,d1cef00): 333cd7dba80,3291a150000,f7dceae0(5657,b,d1cef00),5
[ 119.424943] WARNING: at kernel/sched_norm.c:438 entity_tick()
[ 119.438312] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 119.451068] [<c0105d83>] show_trace+0x12/0x14
[ 119.463097] [<c0105d9b>] dump_stack+0x16/0x18
[ 119.475119] [<c011ed0e>] task_tick_fair+0x194/0x1f2
[ 119.487694] [<c01223f3>] scheduler_tick+0x18d/0x2d5
[ 119.500217] [<c012fa9d>] update_process_times+0x44/0x63
[ 119.513155] [<c014179e>] tick_sched_timer+0x5c/0xbd
[ 119.525773] [<c013cbe1>] hrtimer_interrupt+0x12c/0x1a0
[ 119.538696] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88
[ 119.552119] [<c0104c50>] apic_timer_interrupt+0x28/0x30
[ 119.565221] =======================
[ 119.650421] NFSD: starting 90-second grace period
[ 120.632062] 2,f7335060(5712,b,d1cef00): 35ac9c90eea,ffffe84b40460000,f7335060(5712,b,d1cef00),5
[ 120.652839] WARNING: at kernel/sched_norm.c:438 entity_tick()
[ 120.670841] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 120.688353] [<c0105d83>] show_trace+0x12/0x14
[ 120.705102] [<c0105d9b>] dump_stack+0x16/0x18
[ 120.721703] [<c011ed0e>] task_tick_fair+0x194/0x1f2
[ 120.738851] [<c01223f3>] scheduler_tick+0x18d/0x2d5
[ 120.755914] [<c012fa9d>] update_process_times+0x44/0x63
[ 120.773311] [<c014179e>] tick_sched_timer+0x5c/0xbd
[ 120.790296] [<c013cbe1>] hrtimer_interrupt+0x12c/0x1a0
[ 120.807555] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88
[ 120.825351] [<c0104c50>] apic_timer_interrupt+0x28/0x30
[ 120.842811] =======================
[ 120.858451] 2,f7335060(5712,b,d1cef00): 35aca70e122,ffffe84b40e90000,f7335060(5712,b,d1cef00),5
[ 120.879922] WARNING: at kernel/sched_norm.c:438 entity_tick()
[ 120.898237] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 120.915750] [<c0105d83>] show_trace+0x12/0x14
[ 120.932336] [<c0105d9b>] dump_stack+0x16/0x18
[ 120.948858] [<c011ed0e>] task_tick_fair+0x194/0x1f2
[ 120.966070] [<c01223f3>] scheduler_tick+0x18d/0x2d5
[ 120.983249] [<c012fa9d>] update_process_times+0x44/0x63
[ 121.000839] [<c014179e>] tick_sched_timer+0x5c/0xbd
[ 121.018090] [<c013cbe1>] hrtimer_interrupt+0x12c/0x1a0
[ 121.035610] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88
[ 121.053627] [<c0104c50>] apic_timer_interrupt+0x28/0x30
[ 121.071256] =======================
[ 121.087023] 2,f7335060(5712,b,d1cef00): 35acb18b35a,ffffe84b418c0000,f7335060(5712,b,d1cef00),5
[ 121.108549] WARNING: at kernel/sched_norm.c:438 entity_tick()
[ 121.127230] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 121.145279] [<c0105d83>] show_trace+0x12/0x14
[ 121.162114] [<c0105d9b>] dump_stack+0x16/0x18
[ 121.178420] [<c011ed0e>] task_tick_fair+0x194/0x1f2
[ 121.195216] [<c01223f3>] scheduler_tick+0x18d/0x2d5
[ 121.211843] [<c012fa9d>] update_process_times+0x44/0x63
[ 121.228703] [<c014179e>] tick_sched_timer+0x5c/0xbd
[ 121.245445] [<c013cbe1>] hrtimer_interrupt+0x12c/0x1a0
[ 121.262515] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88
[ 121.279696] [<c0104c50>] apic_timer_interrupt+0x28/0x30
[ 121.296219] =======================
[ 121.310817] 2,f7335060(5712,b,d1cef00): 35acbc08592,ffffe84b422e0000,f7335060(5712,b,d1cef00),5
[ 121.331338] WARNING: at kernel/sched_norm.c:438 entity_tick()
[ 121.349145] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 121.366481] [<c0105d83>] show_trace+0x12/0x14
[ 121.383035] [<c0105d9b>] dump_stack+0x16/0x18
[ 121.399470] [<c011ed0e>] task_tick_fair+0x194/0x1f2
[ 121.416436] [<c01223f3>] scheduler_tick+0x18d/0x2d5
[ 121.433297] [<c012fa9d>] update_process_times+0x44/0x63
[ 121.450497] [<c014179e>] tick_sched_timer+0x5c/0xbd
[ 121.467300] [<c013cbe1>] hrtimer_interrupt+0x12c/0x1a0
[ 121.484335] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88
[ 121.501525] [<c0104c50>] apic_timer_interrupt+0x28/0x30
[ 121.518032] =======================
[ 121.532610] 2,f7335060(5712,b,d1cef00): 35acc6857ca,ffffe84b42d10000,f7335060(5712,b,d1cef00),5
[ 121.553097] WARNING: at kernel/sched_norm.c:438 entity_tick()
[ 121.570885] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 121.588190] [<c0105d83>] show_trace+0x12/0x14
[ 121.604731] [<c0105d9b>] dump_stack+0x16/0x18
[ 121.621151] [<c011ed0e>] task_tick_fair+0x194/0x1f2
[ 121.638084] [<c01223f3>] scheduler_tick+0x18d/0x2d5
[ 121.654955] [<c012fa9d>] update_process_times+0x44/0x63
[ 121.672143] [<c014179e>] tick_sched_timer+0x5c/0xbd
[ 121.688956] [<c013cbe1>] hrtimer_interrupt+0x12c/0x1a0
[ 121.705988] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88
[ 121.723180] [<c0104c50>] apic_timer_interrupt+0x28/0x30
[ 121.739686] =======================
[ 121.754270] 3,f7335060(5712,b,d1cef00): 35acd102a0d,ffffe85123590000,f7335060(5712,b,d1cef00),5
[ 121.774772] WARNING: at kernel/sched_norm.c:438 entity_tick()
[ 121.792570] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 121.809871] [<c0105d83>] show_trace+0x12/0x14
[ 121.826405] [<c0105d9b>] dump_stack+0x16/0x18
[ 121.842791] [<c011ed0e>] task_tick_fair+0x194/0x1f2
[ 121.859719] [<c01223f3>] scheduler_tick+0x18d/0x2d5
[ 121.876564] [<c012fa9d>] update_process_times+0x44/0x63
[ 121.893738] [<c014179e>] tick_sched_timer+0x5c/0xbd
[ 121.910522] [<c013cbe1>] hrtimer_interrupt+0x12c/0x1a0
[ 121.927532] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88
[ 121.944700] [<c0104c50>] apic_timer_interrupt+0x28/0x30
[ 121.961179] =======================
[ 121.975739] 2,f7335060(5712,b,d1cef00): 35acdb7fc45,ffffe85306070000,f7335060(5712,b,d1cef00),5
[ 121.996410] WARNING: at kernel/sched_norm.c:438 entity_tick()
[ 122.014189] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 122.031484] [<c0105d83>] show_trace+0x12/0x14
[ 122.047999] [<c0105d9b>] dump_stack+0x16/0x18
[ 122.064367] [<c011ed0e>] task_tick_fair+0x194/0x1f2
[ 122.081282] [<c01223f3>] scheduler_tick+0x18d/0x2d5
[ 122.098123] [<c012fa9d>] update_process_times+0x44/0x63
[ 122.115307] [<c014179e>] tick_sched_timer+0x5c/0xbd
[ 122.132098] [<c013cbe1>] hrtimer_interrupt+0x12c/0x1a0
[ 122.149090] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88
[ 122.166256] [<c0104c50>] apic_timer_interrupt+0x28/0x30
[ 122.182763] =======================
[ 122.197362] 2,f7335060(5712,b,d1cef00): 35ace5fce7d,ffffe85306aa0000,f7335060(5712,b,d1cef00),5
[ 122.217856] WARNING: at kernel/sched_norm.c:438 entity_tick()
[ 122.235637] [<c0105188>] show_trace_log_lvl+0x1a/0x30
[ 122.252929] [<c0105d83>] show_trace+0x12/0x14
[ 122.269437] [<c0105d9b>] dump_stack+0x16/0x18
[ 122.285823] [<c011ed0e>] task_tick_fair+0x194/0x1f2
[ 122.302718] [<c01223f3>] scheduler_tick+0x18d/0x2d5
[ 122.319572] [<c012fa9d>] update_process_times+0x44/0x63
[ 122.336736] [<c014179e>] tick_sched_timer+0x5c/0xbd
[ 122.353530] [<c013cbe1>] hrtimer_interrupt+0x12c/0x1a0
[ 122.370532] [<c01173f0>] smp_apic_timer_interrupt+0x57/0x88
[ 122.387687] [<c0104c50>] apic_timer_interrupt+0x28/0x30
[ 122.404192] =======================
[ 128.411998] [drm] Initialized drm 1.1.0 20060810



2007-09-11 11:28:49

by Roman Zippel

[permalink] [raw]
Subject: Re: [ANNOUNCE/RFC] Really Fair Scheduler

Hi,

On Tue, 11 Sep 2007, Mike Galbraith wrote:

> I still see the fairtest2 sleeper startup anomaly. Sometimes it starts
> up normally, others the sleeper is a delayed. Seems to require idle
> time to trigger worst case startup delay.
>
> 14854 root 20 0 1568 468 384 R 52 0.0 0:07.50 1 fairtest2
> 14855 root 20 0 1568 112 28 R 45 0.0 0:00.46 1 fairtest2
>
> Everything else still seems fine. Boot-time warnings:
>
> [ 113.504259] audit(1189488395.753:2): audit_pid=5403 old=0 by auid=4294967295
> [ 114.077979] IA-32 Microcode Update Driver: v1.14a <[email protected]>
> [ 116.281216] 4,f70355a0(5633,b,d1cef00): 32af5bc18d4,31b49210000,f70355a0(5633,b,d1cef00),2
> [ 116.298004] WARNING: at kernel/sched_norm.c:271 verify_queue()
> [ 116.312380] [<c0105188>] show_trace_log_lvl+0x1a/0x30
> [ 116.326193] [<c0105d83>] show_trace+0x12/0x14
> [ 116.339270] [<c0105d9b>] dump_stack+0x16/0x18
> [ 116.352199] [<c011e61a>] verify_queue+0x2f8/0x3fb
> [ 116.365406] [<c011e8a3>] enqueue_entity+0x186/0x21b
> [ 116.378571] [<c0123344>] enqueue_task_fair+0x2f/0x31
> [ 116.391545] [<c011d043>] enqueue_task+0xd/0x18
> [ 116.403833] [<c011dffe>] activate_task+0x20/0x2d
> [ 116.416174] [<c0120336>] __migrate_task+0x9a/0xc4
> [ 116.428490] [<c0122d0b>] migration_thread+0x175/0x220
> [ 116.441159] [<c01393f7>] kthread+0x37/0x59
> [ 116.452824] [<c0104dd3>] kernel_thread_helper+0x7/0x14
> [ 116.465582] =======================

Damn, I forgot that tasks which are reniced or migrate to another cpu
need some more initialization, so the small incremental patch does that.
Thanks again for testing.

bye, Roman

Index: linux-2.6/kernel/sched_norm.c
===================================================================
--- linux-2.6.orig/kernel/sched_norm.c 2007-09-11 13:15:00.000000000 +0200
+++ linux-2.6/kernel/sched_norm.c 2007-09-11 13:13:43.000000000 +0200
@@ -326,11 +326,14 @@ static void update_curr(struct cfs_rq *c
}

static void
-enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
{
verify_queue(cfs_rq, cfs_rq->curr != se, se);
cfs_rq->time_avg_min = kclock_max(cfs_rq->time_avg_min, get_time_avg(cfs_rq));
- se->time_norm = kclock_max(cfs_rq->time_avg_min - se->req_weight_inv, se->time_norm);
+ if (likely(wakeup))
+ se->time_norm = kclock_max(cfs_rq->time_avg_min - se->req_weight_inv, se->time_norm);
+ else
+ se->time_norm = cfs_rq->time_avg_min;

cfs_rq->nr_running++;
cfs_rq->weight_sum += 1 << se->weight_shift;
@@ -553,7 +556,7 @@ static void enqueue_task_fair(struct rq
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
- enqueue_entity(cfs_rq, se);
+ enqueue_entity(cfs_rq, se, wakeup);
}
}

@@ -813,7 +816,7 @@ static void task_new_fair(struct rq *rq,
rq->curr->se.time_norm -= time;
se->time_norm = rq->curr->se.time_norm;

- enqueue_entity(cfs_rq, se);
+ enqueue_entity(cfs_rq, se, 1);
p->se.on_rq = 1;

cfs_rq->next = se;