2007-05-23 17:11:48

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC] [PATCH 0/3] Add group fairness to CFS

Here's an attempt to extend CFS (v13) to be fair at a group level, rather than
just at task level. The patch is in a very premature state (passes
simple tests, smp load balance not supported yet) at this point. I am sending
it out early to know if this is a good direction to proceed.

Salient points which needs discussion:

1. This patch reuses CFS core to achieve fairness at group level also.

To make this possible, CFS core has been abstracted to deal with generic
schedulable "entities" (tasks, users etc).

2. The per-cpu rb-tree has been split to be per-group per-cpu.

schedule() now becomes two step on every cpu : pick a group first (from
group rb-tree) and a task within that group next (from that group's task
rb-tree)

3. Grouping mechanism - I have used 'uid' as the basis of grouping for
timebeing (since that grouping concept is already in mainline today).
The patch can be adapted to a more generic process grouping mechanism
(like http://lkml.org/lkml/2007/4/27/146) later.

Some results below, obtained on a 4way (with HT) Intel Xeon box. All
number are reflective of single CPU performance (tests were forced to
run on single cpu since load balance is not yet supported).


uid "vatsa" uid "guest"
(make -s -j4 bzImage) (make -s -j20 bzImage)

2.6.22-rc1 772.02 sec 497.42 sec (real)
2.6.22-rc1+cfs-v13 780.62 sec 478.35 sec (real)
2.6.22-rc1+cfs-v13+this patch 776.36 sec 776.68 sec (real)

[ An exclusive cpuset containing only one CPU was created and the
compilation jobs of both users were run simultaneously in this cpuset ]

I also disabled CONFIG_FAIR_USER_SCHED and compared the results with
cfs-v13:

uid "vatsa"
make -s -j4 bzImage

2.6.22-rc1+cfs-v13 395.57 sec (real)
2.6.22-rc1+cfs-v13+this_patch 388.54 sec (real)

There is no regression I can see (rather some improvement, which I can't
understand atm). I will run more tests later to check this regression aspect.

Request your comments on the future direction to proceed!


--
Regards,
vatsa


2007-05-23 17:12:10

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC] [PATCH 1/3] task_cpu(p) needs to be correct always

We rely very much on task_cpu(p) to be correct at all times, so that we can
correctly find the task_grp_rq from which the task has to be removed or added
to.

There is however one place in the scheduler where this assumption of
task_cpu(p) being correct is broken. This patch fixes that piece of code.

(Thanks to Balbir Singh for pointing this out to me)

Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---
kernel/sched.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux-2.6.21-rc7/kernel/sched.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/sched.c 2007-05-23 20:46:41.000000000 +0530
+++ linux-2.6.21-rc7/kernel/sched.c 2007-05-23 20:48:26.000000000 +0530
@@ -4571,7 +4571,7 @@
static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
{
struct rq *rq_dest, *rq_src;
- int ret = 0;
+ int ret = 0, on_rq;

if (unlikely(cpu_is_offline(dest_cpu)))
return ret;
@@ -4587,9 +4587,11 @@
if (!cpu_isset(dest_cpu, p->cpus_allowed))
goto out;

- set_task_cpu(p, dest_cpu);
- if (p->on_rq) {
+ on_rq = p->on_rq;
+ if (on_rq)
deactivate_task(rq_src, p, 0);
+ set_task_cpu(p, dest_cpu);
+ if (on_rq) {
activate_task(rq_dest, p, 0);
check_preempt_curr(rq_dest, p);
}

--
Regards,
vatsa

2007-05-23 17:12:35

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC] [PATCH 2/3] Introduce two new structures - struct lrq and sched_entity

This patch groups together fields used by CFS (for SCHED_NORMAL tasks)
in task_struct and runqueue into separate structures so that they can be
reused in a later patch.

'struct sched_entity' represents the attributes used by CFS for every
schedulable entity (task in this case).

'struct lrq' represents the runqueue used to store schedulable entities (tasks
in this case) and to maintain various clocks (ex: fair clock for tasks).

This patch also modifies rest of kernel to reflect these new structures.

Intended effect of this patch is zero on overall functionality of
scheduler.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---
fs/proc/array.c | 2
include/linux/sched.h | 44 ++++----
kernel/exit.c | 2
kernel/posix-cpu-timers.c | 19 +--
kernel/sched.c | 195 +++++++++++++++++++------------------
kernel/sched_debug.c | 85 ++++++++--------
kernel/sched_fair.c | 238 +++++++++++++++++++++++-----------------------
7 files changed, 301 insertions(+), 284 deletions(-)

Index: linux-2.6.21-rc7/fs/proc/array.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/proc/array.c 2007-05-23 20:46:40.000000000 +0530
+++ linux-2.6.21-rc7/fs/proc/array.c 2007-05-23 20:48:34.000000000 +0530
@@ -412,7 +412,7 @@
* Use CFS's precise accounting, if available:
*/
if (!has_rt_policy(task)) {
- utime = nsec_to_clock_t(task->sum_exec_runtime);
+ utime = nsec_to_clock_t(task->se.sum_exec_runtime);
stime = 0;
}

Index: linux-2.6.21-rc7/include/linux/sched.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/sched.h 2007-05-23 20:46:40.000000000 +0530
+++ linux-2.6.21-rc7/include/linux/sched.h 2007-05-23 20:48:34.000000000 +0530
@@ -838,6 +838,29 @@
void (*task_new) (struct rq *rq, struct task_struct *p);
};

+/* CFS scheduling entity (task, user etc) statistics fields: */
+struct sched_entity {
+ int load_weight; /* for niceness load balancing purposes */
+ int on_rq;
+ struct rb_node run_node;
+ u64 wait_start_fair;
+ u64 wait_start;
+ u64 exec_start;
+ u64 sleep_start, sleep_start_fair;
+ u64 block_start;
+ u64 sleep_max;
+ u64 block_max;
+ u64 exec_max;
+ u64 wait_max;
+ u64 last_ran;
+
+ s64 wait_runtime;
+ u64 sum_exec_runtime;
+ s64 fair_key;
+ s64 sum_wait_runtime, sum_sleep_runtime;
+ unsigned long wait_runtime_overruns, wait_runtime_underruns;
+};
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -852,34 +875,15 @@
int oncpu;
#endif
#endif
- int load_weight; /* for niceness load balancing purposes */

int prio, static_prio, normal_prio;
- int on_rq;
struct list_head run_list;
- struct rb_node run_node;
+ struct sched_entity se;

unsigned short ioprio;
#ifdef CONFIG_BLK_DEV_IO_TRACE
unsigned int btrace_seq;
#endif
- /* CFS scheduling class statistics fields: */
- u64 wait_start_fair;
- u64 wait_start;
- u64 exec_start;
- u64 sleep_start, sleep_start_fair;
- u64 block_start;
- u64 sleep_max;
- u64 block_max;
- u64 exec_max;
- u64 wait_max;
- u64 last_ran;
-
- s64 wait_runtime;
- u64 sum_exec_runtime;
- s64 fair_key;
- s64 sum_wait_runtime, sum_sleep_runtime;
- unsigned long wait_runtime_overruns, wait_runtime_underruns;

unsigned int policy;
cpumask_t cpus_allowed;
Index: linux-2.6.21-rc7/kernel/exit.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/exit.c 2007-05-23 20:46:40.000000000 +0530
+++ linux-2.6.21-rc7/kernel/exit.c 2007-05-23 20:48:34.000000000 +0530
@@ -124,7 +124,7 @@
sig->nivcsw += tsk->nivcsw;
sig->inblock += task_io_get_inblock(tsk);
sig->oublock += task_io_get_oublock(tsk);
- sig->sum_sched_runtime += tsk->sum_exec_runtime;
+ sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
sig = NULL; /* Marker for below. */
}

Index: linux-2.6.21-rc7/kernel/posix-cpu-timers.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/posix-cpu-timers.c 2007-05-23 20:46:40.000000000 +0530
+++ linux-2.6.21-rc7/kernel/posix-cpu-timers.c 2007-05-23 20:48:34.000000000 +0530
@@ -161,7 +161,8 @@
}
static inline unsigned long long sched_ns(struct task_struct *p)
{
- return (p == current) ? current_sched_runtime(p) : p->sum_exec_runtime;
+ return (p == current) ? current_sched_runtime(p) :
+ p->se.sum_exec_runtime;
}

int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
@@ -249,7 +250,7 @@
cpu->sched = p->signal->sum_sched_runtime;
/* Add in each other live thread. */
while ((t = next_thread(t)) != p) {
- cpu->sched += t->sum_exec_runtime;
+ cpu->sched += t->se.sum_exec_runtime;
}
cpu->sched += sched_ns(p);
break;
@@ -467,7 +468,7 @@
void posix_cpu_timers_exit(struct task_struct *tsk)
{
cleanup_timers(tsk->cpu_timers,
- tsk->utime, tsk->stime, tsk->sum_exec_runtime);
+ tsk->utime, tsk->stime, tsk->se.sum_exec_runtime);

}
void posix_cpu_timers_exit_group(struct task_struct *tsk)
@@ -475,7 +476,7 @@
cleanup_timers(tsk->signal->cpu_timers,
cputime_add(tsk->utime, tsk->signal->utime),
cputime_add(tsk->stime, tsk->signal->stime),
- tsk->sum_exec_runtime + tsk->signal->sum_sched_runtime);
+ tsk->se.sum_exec_runtime + tsk->signal->sum_sched_runtime);
}


@@ -536,7 +537,7 @@
nsleft = max_t(unsigned long long, nsleft, 1);
do {
if (likely(!(t->flags & PF_EXITING))) {
- ns = t->sum_exec_runtime + nsleft;
+ ns = t->se.sum_exec_runtime + nsleft;
if (t->it_sched_expires == 0 ||
t->it_sched_expires > ns) {
t->it_sched_expires = ns;
@@ -1004,7 +1005,7 @@
struct cpu_timer_list *t = list_first_entry(timers,
struct cpu_timer_list,
entry);
- if (!--maxfire || tsk->sum_exec_runtime < t->expires.sched) {
+ if (!--maxfire || tsk->se.sum_exec_runtime < t->expires.sched) {
tsk->it_sched_expires = t->expires.sched;
break;
}
@@ -1049,7 +1050,7 @@
do {
utime = cputime_add(utime, t->utime);
stime = cputime_add(stime, t->stime);
- sum_sched_runtime += t->sum_exec_runtime;
+ sum_sched_runtime += t->se.sum_exec_runtime;
t = next_thread(t);
} while (t != tsk);
ptime = cputime_add(utime, stime);
@@ -1208,7 +1209,7 @@
t->it_virt_expires = ticks;
}

- sched = t->sum_exec_runtime + sched_left;
+ sched = t->se.sum_exec_runtime + sched_left;
if (sched_expires && (t->it_sched_expires == 0 ||
t->it_sched_expires > sched)) {
t->it_sched_expires = sched;
@@ -1300,7 +1301,7 @@

if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
(tsk->it_sched_expires == 0 ||
- tsk->sum_exec_runtime < tsk->it_sched_expires))
+ tsk->se.sum_exec_runtime < tsk->it_sched_expires))
return;

#undef UNEXPIRED
Index: linux-2.6.21-rc7/kernel/sched.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/sched.c 2007-05-23 20:48:26.000000000 +0530
+++ linux-2.6.21-rc7/kernel/sched.c 2007-05-23 20:48:34.000000000 +0530
@@ -114,6 +114,23 @@
struct list_head queue[MAX_RT_PRIO];
};

+/* CFS-related fields in a runqueue */
+struct lrq {
+ unsigned long raw_weighted_load;
+ #define CPU_LOAD_IDX_MAX 5
+ unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+ unsigned long nr_load_updates;
+
+ u64 fair_clock, prev_fair_clock;
+ u64 exec_clock, prev_exec_clock;
+ s64 wait_runtime;
+ unsigned long wait_runtime_overruns, wait_runtime_underruns;
+
+ struct rb_root tasks_timeline;
+ struct rb_node *rb_leftmost;
+ struct rb_node *rb_load_balance_curr;
+};
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -129,16 +146,13 @@
* remote CPUs use both these fields when doing load calculation.
*/
long nr_running;
- unsigned long raw_weighted_load;
- #define CPU_LOAD_IDX_MAX 5
- unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+ struct lrq lrq;

unsigned char idle_at_tick;
#ifdef CONFIG_NO_HZ
unsigned char in_nohz_recently;
#endif
u64 nr_switches;
- unsigned long nr_load_updates;

/*
* This is part of a global counter where only the total sum
@@ -154,10 +168,6 @@

u64 clock, prev_clock_raw;
s64 clock_max_delta;
- u64 fair_clock, prev_fair_clock;
- u64 exec_clock, prev_exec_clock;
- s64 wait_runtime;
- unsigned long wait_runtime_overruns, wait_runtime_underruns;

unsigned int clock_warps;
unsigned int clock_unstable_events;
@@ -168,10 +178,6 @@
int rt_load_balance_idx;
struct list_head *rt_load_balance_head, *rt_load_balance_curr;

- struct rb_root tasks_timeline;
- struct rb_node *rb_leftmost;
- struct rb_node *rb_load_balance_curr;
-
atomic_t nr_iowait;

#ifdef CONFIG_SMP
@@ -573,13 +579,13 @@
static inline void
inc_raw_weighted_load(struct rq *rq, const struct task_struct *p)
{
- rq->raw_weighted_load += p->load_weight;
+ rq->lrq.raw_weighted_load += p->se.load_weight;
}

static inline void
dec_raw_weighted_load(struct rq *rq, const struct task_struct *p)
{
- rq->raw_weighted_load -= p->load_weight;
+ rq->lrq.raw_weighted_load -= p->se.load_weight;
}

static inline void inc_nr_running(struct task_struct *p, struct rq *rq)
@@ -605,22 +611,22 @@

static void set_load_weight(struct task_struct *p)
{
- task_rq(p)->wait_runtime -= p->wait_runtime;
- p->wait_runtime = 0;
+ task_rq(p)->lrq.wait_runtime -= p->se.wait_runtime;
+ p->se.wait_runtime = 0;

if (has_rt_policy(p)) {
- p->load_weight = prio_to_weight[0] * 2;
+ p->se.load_weight = prio_to_weight[0] * 2;
return;
}
/*
* SCHED_BATCH tasks get minimal weight:
*/
if (p->policy == SCHED_BATCH) {
- p->load_weight = 1;
+ p->se.load_weight = 1;
return;
}

- p->load_weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
+ p->se.load_weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
}

static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
@@ -629,7 +635,7 @@

sched_info_queued(p);
p->sched_class->enqueue_task(rq, p, wakeup, now);
- p->on_rq = 1;
+ p->se.on_rq = 1;
}

static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
@@ -637,7 +643,7 @@
u64 now = rq_clock(rq);

p->sched_class->dequeue_task(rq, p, sleep, now);
- p->on_rq = 0;
+ p->se.on_rq = 0;
}

/*
@@ -725,7 +731,7 @@
/* Used instead of source_load when we know the type == 0 */
unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->raw_weighted_load;
+ return cpu_rq(cpu)->lrq.raw_weighted_load;
}

#ifdef CONFIG_SMP
@@ -742,18 +748,18 @@
u64 clock_offset, fair_clock_offset;

clock_offset = old_rq->clock - new_rq->clock;
- fair_clock_offset = old_rq->fair_clock - new_rq->fair_clock;
+ fair_clock_offset = old_rq->lrq.fair_clock - new_rq->lrq.fair_clock;

- if (p->wait_start)
- p->wait_start -= clock_offset;
- if (p->wait_start_fair)
- p->wait_start_fair -= fair_clock_offset;
- if (p->sleep_start)
- p->sleep_start -= clock_offset;
- if (p->block_start)
- p->block_start -= clock_offset;
- if (p->sleep_start_fair)
- p->sleep_start_fair -= fair_clock_offset;
+ if (p->se.wait_start)
+ p->se.wait_start -= clock_offset;
+ if (p->se.wait_start_fair)
+ p->se.wait_start_fair -= fair_clock_offset;
+ if (p->se.sleep_start)
+ p->se.sleep_start -= clock_offset;
+ if (p->se.block_start)
+ p->se.block_start -= clock_offset;
+ if (p->se.sleep_start_fair)
+ p->se.sleep_start_fair -= fair_clock_offset;

task_thread_info(p)->cpu = new_cpu;

@@ -781,7 +787,7 @@
* If the task is not on a runqueue (and not running), then
* it is sufficient to simply update the task's cpu field.
*/
- if (!p->on_rq && !task_running(rq, p)) {
+ if (!p->se.on_rq && !task_running(rq, p)) {
set_task_cpu(p, dest_cpu);
return 0;
}
@@ -812,7 +818,7 @@
repeat:
rq = task_rq_lock(p, &flags);
/* Must be off runqueue entirely, not preempted. */
- if (unlikely(p->on_rq || task_running(rq, p))) {
+ if (unlikely(p->se.on_rq || task_running(rq, p))) {
/* If it's preempted, we yield. It could be a while. */
preempted = !task_running(rq, p);
task_rq_unlock(rq, &flags);
@@ -860,9 +866,9 @@
struct rq *rq = cpu_rq(cpu);

if (type == 0)
- return rq->raw_weighted_load;
+ return rq->lrq.raw_weighted_load;

- return min(rq->cpu_load[type-1], rq->raw_weighted_load);
+ return min(rq->lrq.cpu_load[type-1], rq->lrq.raw_weighted_load);
}

/*
@@ -874,9 +880,9 @@
struct rq *rq = cpu_rq(cpu);

if (type == 0)
- return rq->raw_weighted_load;
+ return rq->lrq.raw_weighted_load;

- return max(rq->cpu_load[type-1], rq->raw_weighted_load);
+ return max(rq->lrq.cpu_load[type-1], rq->lrq.raw_weighted_load);
}

/*
@@ -887,7 +893,7 @@
struct rq *rq = cpu_rq(cpu);
unsigned long n = rq->nr_running;

- return n ? rq->raw_weighted_load / n : SCHED_LOAD_SCALE;
+ return n ? rq->lrq.raw_weighted_load / n : SCHED_LOAD_SCALE;
}

/*
@@ -1118,7 +1124,7 @@
if (!(old_state & state))
goto out;

- if (p->on_rq)
+ if (p->se.on_rq)
goto out_running;

cpu = task_cpu(p);
@@ -1173,11 +1179,11 @@
* of the current CPU:
*/
if (sync)
- tl -= current->load_weight;
+ tl -= current->se.load_weight;

if ((tl <= load &&
- tl + target_load(cpu, idx) <= tl_per_task) ||
- 100*(tl + p->load_weight) <= imbalance*load) {
+ tl + target_load(cpu, idx) <= tl_per_task) ||
+ 100*(tl + p->se.load_weight) <= imbalance*load) {
/*
* This domain has SD_WAKE_AFFINE and
* p is cache cold in this domain, and
@@ -1211,7 +1217,7 @@
old_state = p->state;
if (!(old_state & state))
goto out;
- if (p->on_rq)
+ if (p->se.on_rq)
goto out_running;

this_cpu = smp_processor_id();
@@ -1275,18 +1281,19 @@
*/
static void __sched_fork(struct task_struct *p)
{
- p->wait_start_fair = p->wait_start = p->exec_start = p->last_ran = 0;
- p->sum_exec_runtime = 0;
-
- p->wait_runtime = 0;
-
- p->sum_wait_runtime = p->sum_sleep_runtime = 0;
- p->sleep_start = p->sleep_start_fair = p->block_start = 0;
- p->sleep_max = p->block_max = p->exec_max = p->wait_max = 0;
- p->wait_runtime_overruns = p->wait_runtime_underruns = 0;
+ p->se.wait_start_fair = p->se.wait_start = p->se.exec_start = 0;
+ p->se.last_ran = 0;
+ p->se.sum_exec_runtime = 0;
+
+ p->se.wait_runtime = 0;
+
+ p->se.sum_wait_runtime = p->se.sum_sleep_runtime = 0;
+ p->se.sleep_start = p->se.sleep_start_fair = p->se.block_start = 0;
+ p->se.sleep_max = p->se.block_max = p->se.exec_max = p->se.wait_max = 0;
+ p->se.wait_runtime_overruns = p->se.wait_runtime_underruns = 0;

INIT_LIST_HEAD(&p->run_list);
- p->on_rq = 0;
+ p->se.on_rq = 0;
p->nr_switches = 0;

/*
@@ -1357,7 +1364,7 @@
p->prio = effective_prio(p);

if (!sysctl_sched_child_runs_first || (clone_flags & CLONE_VM) ||
- task_cpu(p) != this_cpu || !current->on_rq) {
+ task_cpu(p) != this_cpu || !current->se.on_rq) {
activate_task(rq, p, 0);
} else {
/*
@@ -1372,7 +1379,7 @@

void sched_dead(struct task_struct *p)
{
- WARN_ON_ONCE(p->on_rq);
+ WARN_ON_ONCE(p->se.on_rq);
}

/**
@@ -1584,17 +1591,19 @@
unsigned long tmp;
u64 tmp64;

- this_rq->nr_load_updates++;
+ this_rq->lrq.nr_load_updates++;
if (!(sysctl_sched_load_smoothing & 64)) {
- this_load = this_rq->raw_weighted_load;
+ this_load = this_rq->lrq.raw_weighted_load;
goto do_avg;
}

- fair_delta64 = this_rq->fair_clock - this_rq->prev_fair_clock + 1;
- this_rq->prev_fair_clock = this_rq->fair_clock;
-
- exec_delta64 = this_rq->exec_clock - this_rq->prev_exec_clock + 1;
- this_rq->prev_exec_clock = this_rq->exec_clock;
+ fair_delta64 = this_rq->lrq.fair_clock -
+ this_rq->lrq.prev_fair_clock + 1;
+ this_rq->lrq.prev_fair_clock = this_rq->lrq.fair_clock;
+
+ exec_delta64 = this_rq->lrq.exec_clock -
+ this_rq->lrq.prev_exec_clock + 1;
+ this_rq->lrq.prev_exec_clock = this_rq->lrq.exec_clock;

if (fair_delta64 > (s64)LONG_MAX)
fair_delta64 = (s64)LONG_MAX;
@@ -1620,10 +1629,10 @@

/* scale is effectively 1 << i now, and >> i divides by scale */

- old_load = this_rq->cpu_load[i];
+ old_load = this_rq->lrq.cpu_load[i];
new_load = this_load;

- this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) >> i;
+ this_rq->lrq.cpu_load[i] = (old_load*(scale-1) + new_load) >> i;
}
}

@@ -1879,7 +1888,8 @@
* skip a task if it will be the highest priority task (i.e. smallest
* prio value) on its new queue regardless of its load weight
*/
- skip_for_load = (p->load_weight >> 1) > rem_load_move + SCHED_LOAD_SCALE_FUZZ;
+ skip_for_load = (p->se.load_weight >> 1) > rem_load_move +
+ SCHED_LOAD_SCALE_FUZZ;
if (skip_for_load && p->prio < this_best_prio)
skip_for_load = !best_prio_seen && p->prio == best_prio;
if (skip_for_load ||
@@ -1892,7 +1902,7 @@

pull_task(busiest, p, this_rq, this_cpu);
pulled++;
- rem_load_move -= p->load_weight;
+ rem_load_move -= p->se.load_weight;

/*
* We only want to steal up to the prescribed number of tasks
@@ -1989,7 +1999,7 @@

avg_load += load;
sum_nr_running += rq->nr_running;
- sum_weighted_load += rq->raw_weighted_load;
+ sum_weighted_load += rq->lrq.raw_weighted_load;
}

/*
@@ -2223,11 +2233,12 @@

rq = cpu_rq(i);

- if (rq->nr_running == 1 && rq->raw_weighted_load > imbalance)
+ if (rq->nr_running == 1 &&
+ rq->lrq.raw_weighted_load > imbalance)
continue;

- if (rq->raw_weighted_load > max_load) {
- max_load = rq->raw_weighted_load;
+ if (rq->lrq.raw_weighted_load > max_load) {
+ max_load = rq->lrq.raw_weighted_load;
busiest = rq;
}
}
@@ -2830,7 +2841,7 @@
unsigned long flags;

local_irq_save(flags);
- ns = p->sum_exec_runtime + sched_clock() - p->last_ran;
+ ns = p->se.sum_exec_runtime + sched_clock() - p->se.last_ran;
local_irq_restore(flags);

return ns;
@@ -3518,7 +3529,7 @@
rq = task_rq_lock(p, &flags);

oldprio = p->prio;
- on_rq = p->on_rq;
+ on_rq = p->se.on_rq;
if (on_rq)
dequeue_task(rq, p, 0);

@@ -3571,7 +3582,7 @@
p->static_prio = NICE_TO_PRIO(nice);
goto out_unlock;
}
- on_rq = p->on_rq;
+ on_rq = p->se.on_rq;
if (on_rq) {
dequeue_task(rq, p, 0);
dec_raw_weighted_load(rq, p);
@@ -3708,7 +3719,7 @@
static void
__setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
{
- BUG_ON(p->on_rq);
+ BUG_ON(p->se.on_rq);

p->policy = policy;
switch (p->policy) {
@@ -3814,7 +3825,7 @@
spin_unlock_irqrestore(&p->pi_lock, flags);
goto recheck;
}
- on_rq = p->on_rq;
+ on_rq = p->se.on_rq;
if (on_rq)
deactivate_task(rq, p, 0);
oldprio = p->prio;
@@ -4468,7 +4479,7 @@
unsigned long flags;

__sched_fork(idle);
- idle->exec_start = sched_clock();
+ idle->se.exec_start = sched_clock();

idle->prio = idle->normal_prio = MAX_PRIO;
idle->cpus_allowed = cpumask_of_cpu(cpu);
@@ -4587,7 +4598,7 @@
if (!cpu_isset(dest_cpu, p->cpus_allowed))
goto out;

- on_rq = p->on_rq;
+ on_rq = p->se.on_rq;
if (on_rq)
deactivate_task(rq_src, p, 0);
set_task_cpu(p, dest_cpu);
@@ -5986,11 +5997,11 @@
spin_lock_init(&rq->lock);
lockdep_set_class(&rq->lock, &rq->rq_lock_key);
rq->nr_running = 0;
- rq->tasks_timeline = RB_ROOT;
- rq->clock = rq->fair_clock = 1;
+ rq->lrq.tasks_timeline = RB_ROOT;
+ rq->clock = rq->lrq.fair_clock = 1;

for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
- rq->cpu_load[j] = 0;
+ rq->lrq.cpu_load[j] = 0;
#ifdef CONFIG_SMP
rq->sd = NULL;
rq->active_balance = 0;
@@ -6072,15 +6083,15 @@

read_lock_irq(&tasklist_lock);
for_each_process(p) {
- p->fair_key = 0;
- p->wait_runtime = 0;
- p->wait_start_fair = 0;
- p->wait_start = 0;
- p->exec_start = 0;
- p->sleep_start = 0;
- p->sleep_start_fair = 0;
- p->block_start = 0;
- task_rq(p)->fair_clock = 0;
+ p->se.fair_key = 0;
+ p->se.wait_runtime = 0;
+ p->se.wait_start_fair = 0;
+ p->se.wait_start = 0;
+ p->se.exec_start = 0;
+ p->se.sleep_start = 0;
+ p->se.sleep_start_fair = 0;
+ p->se.block_start = 0;
+ task_rq(p)->lrq.fair_clock = 0;
task_rq(p)->clock = 0;

if (!rt_task(p)) {
@@ -6103,7 +6114,7 @@
goto out_unlock;
#endif

- on_rq = p->on_rq;
+ on_rq = p->se.on_rq;
if (on_rq)
deactivate_task(task_rq(p), p, 0);
__setscheduler(rq, p, SCHED_NORMAL, 0);
Index: linux-2.6.21-rc7/kernel/sched_debug.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/sched_debug.c 2007-05-23 20:46:40.000000000 +0530
+++ linux-2.6.21-rc7/kernel/sched_debug.c 2007-05-23 20:48:34.000000000 +0530
@@ -40,15 +40,16 @@
SEQ_printf(m, "%15s %5d %15Ld %13Ld %13Ld %9Ld %5d "
"%15Ld %15Ld %15Ld %15Ld %15Ld\n",
p->comm, p->pid,
- (long long)p->fair_key, (long long)p->fair_key - rq->fair_clock,
- (long long)p->wait_runtime,
+ (long long)p->se.fair_key,
+ (long long)p->se.fair_key - rq->lrq.fair_clock,
+ (long long)p->se.wait_runtime,
(long long)p->nr_switches,
p->prio,
- (long long)p->sum_exec_runtime,
- (long long)p->sum_wait_runtime,
- (long long)p->sum_sleep_runtime,
- (long long)p->wait_runtime_overruns,
- (long long)p->wait_runtime_underruns);
+ (long long)p->se.sum_exec_runtime,
+ (long long)p->se.sum_wait_runtime,
+ (long long)p->se.sum_sleep_runtime,
+ (long long)p->se.wait_runtime_overruns,
+ (long long)p->se.wait_runtime_underruns);
}

static void print_rq(struct seq_file *m, struct rq *rq, u64 now)
@@ -69,7 +70,7 @@

curr = first_fair(rq);
while (curr) {
- p = rb_entry(curr, struct task_struct, run_node);
+ p = rb_entry(curr, struct task_struct, se.run_node);
print_task(m, rq, p, now);

curr = rb_next(curr);
@@ -86,8 +87,8 @@
spin_lock_irqsave(&rq->lock, flags);
curr = first_fair(rq);
while (curr) {
- p = rb_entry(curr, struct task_struct, run_node);
- wait_runtime_rq_sum += p->wait_runtime;
+ p = rb_entry(curr, struct task_struct, se.run_node);
+ wait_runtime_rq_sum += p->se.wait_runtime;

curr = rb_next(curr);
}
@@ -106,9 +107,9 @@
SEQ_printf(m, " .%-22s: %Ld\n", #x, (long long)(rq->x))

P(nr_running);
- P(raw_weighted_load);
+ P(lrq.raw_weighted_load);
P(nr_switches);
- P(nr_load_updates);
+ P(lrq.nr_load_updates);
P(nr_uninterruptible);
SEQ_printf(m, " .%-22s: %lu\n", "jiffies", jiffies);
P(next_balance);
@@ -119,18 +120,18 @@
P(clock_unstable_events);
P(clock_max_delta);
rq->clock_max_delta = 0;
- P(fair_clock);
- P(prev_fair_clock);
- P(exec_clock);
- P(prev_exec_clock);
- P(wait_runtime);
- P(wait_runtime_overruns);
- P(wait_runtime_underruns);
- P(cpu_load[0]);
- P(cpu_load[1]);
- P(cpu_load[2]);
- P(cpu_load[3]);
- P(cpu_load[4]);
+ P(lrq.fair_clock);
+ P(lrq.prev_fair_clock);
+ P(lrq.exec_clock);
+ P(lrq.prev_exec_clock);
+ P(lrq.wait_runtime);
+ P(lrq.wait_runtime_overruns);
+ P(lrq.wait_runtime_underruns);
+ P(lrq.cpu_load[0]);
+ P(lrq.cpu_load[1]);
+ P(lrq.cpu_load[2]);
+ P(lrq.cpu_load[3]);
+ P(lrq.cpu_load[4]);
#undef P
print_rq_runtime_sum(m, rq);

@@ -190,21 +191,21 @@
#define P(F) \
SEQ_printf(m, "%-25s:%20Ld\n", #F, (long long)p->F)

- P(wait_start);
- P(wait_start_fair);
- P(exec_start);
- P(sleep_start);
- P(sleep_start_fair);
- P(block_start);
- P(sleep_max);
- P(block_max);
- P(exec_max);
- P(wait_max);
- P(last_ran);
- P(wait_runtime);
- P(wait_runtime_overruns);
- P(wait_runtime_underruns);
- P(sum_exec_runtime);
+ P(se.wait_start);
+ P(se.wait_start_fair);
+ P(se.exec_start);
+ P(se.sleep_start);
+ P(se.sleep_start_fair);
+ P(se.block_start);
+ P(se.sleep_max);
+ P(se.block_max);
+ P(se.exec_max);
+ P(se.wait_max);
+ P(se.last_ran);
+ P(se.wait_runtime);
+ P(se.wait_runtime_overruns);
+ P(se.wait_runtime_underruns);
+ P(se.sum_exec_runtime);
#undef P

{
@@ -218,7 +219,7 @@

void proc_sched_set_task(struct task_struct *p)
{
- p->sleep_max = p->block_max = p->exec_max = p->wait_max = 0;
- p->wait_runtime_overruns = p->wait_runtime_underruns = 0;
- p->sum_exec_runtime = 0;
+ p->se.sleep_max = p->se.block_max = p->se.exec_max = p->se.wait_max = 0;
+ p->se.wait_runtime_overruns = p->se.wait_runtime_underruns = 0;
+ p->se.sum_exec_runtime = 0;
}
Index: linux-2.6.21-rc7/kernel/sched_fair.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/sched_fair.c 2007-05-23 20:46:40.000000000 +0530
+++ linux-2.6.21-rc7/kernel/sched_fair.c 2007-05-23 20:48:34.000000000 +0530
@@ -55,10 +55,10 @@
*/
static inline void __enqueue_task_fair(struct rq *rq, struct task_struct *p)
{
- struct rb_node **link = &rq->tasks_timeline.rb_node;
+ struct rb_node **link = &rq->lrq.tasks_timeline.rb_node;
struct rb_node *parent = NULL;
struct task_struct *entry;
- s64 key = p->fair_key;
+ s64 key = p->se.fair_key;
int leftmost = 1;

/*
@@ -66,12 +66,12 @@
*/
while (*link) {
parent = *link;
- entry = rb_entry(parent, struct task_struct, run_node);
+ entry = rb_entry(parent, struct task_struct, se.run_node);
/*
* We dont care about collisions. Nodes with
* the same key stay together.
*/
- if ((s64)(key - entry->fair_key) < 0) {
+ if ((s64)(key - entry->se.fair_key) < 0) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
@@ -84,31 +84,31 @@
* used):
*/
if (leftmost)
- rq->rb_leftmost = &p->run_node;
+ rq->lrq.rb_leftmost = &p->se.run_node;

- rb_link_node(&p->run_node, parent, link);
- rb_insert_color(&p->run_node, &rq->tasks_timeline);
+ rb_link_node(&p->se.run_node, parent, link);
+ rb_insert_color(&p->se.run_node, &rq->lrq.tasks_timeline);
}

static inline void __dequeue_task_fair(struct rq *rq, struct task_struct *p)
{
- if (rq->rb_leftmost == &p->run_node)
- rq->rb_leftmost = NULL;
- rb_erase(&p->run_node, &rq->tasks_timeline);
+ if (rq->lrq.rb_leftmost == &p->se.run_node)
+ rq->lrq.rb_leftmost = NULL;
+ rb_erase(&p->se.run_node, &rq->lrq.tasks_timeline);
}

static inline struct rb_node * first_fair(struct rq *rq)
{
- if (rq->rb_leftmost)
- return rq->rb_leftmost;
+ if (rq->lrq.rb_leftmost)
+ return rq->lrq.rb_leftmost;
/* Cache the value returned by rb_first() */
- rq->rb_leftmost = rb_first(&rq->tasks_timeline);
- return rq->rb_leftmost;
+ rq->lrq.rb_leftmost = rb_first(&rq->lrq.tasks_timeline);
+ return rq->lrq.rb_leftmost;
}

static struct task_struct * __pick_next_task_fair(struct rq *rq)
{
- return rb_entry(first_fair(rq), struct task_struct, run_node);
+ return rb_entry(first_fair(rq), struct task_struct, se.run_node);
}

/**************************************************************/
@@ -125,24 +125,24 @@
/*
* Negative nice levels get the same granularity as nice-0:
*/
- if (curr->load_weight >= NICE_0_LOAD)
+ if (curr->se.load_weight >= NICE_0_LOAD)
return granularity;
/*
* Positive nice level tasks get linearly finer
* granularity:
*/
- return curr->load_weight * (s64)(granularity / NICE_0_LOAD);
+ return curr->se.load_weight * (s64)(granularity / NICE_0_LOAD);
}

unsigned long get_rq_load(struct rq *rq)
{
- unsigned long load = rq->cpu_load[CPU_LOAD_IDX_MAX-1] + 1;
+ unsigned long load = rq->lrq.cpu_load[CPU_LOAD_IDX_MAX-1] + 1;

if (!(sysctl_sched_load_smoothing & 1))
- return rq->raw_weighted_load;
+ return rq->lrq.raw_weighted_load;

if (sysctl_sched_load_smoothing & 4)
- load = max(load, rq->raw_weighted_load);
+ load = max(load, rq->lrq.raw_weighted_load);

return load;
}
@@ -156,31 +156,31 @@
* Niced tasks have the same history dynamic range as
* non-niced tasks, but their limits are offset.
*/
- if (p->wait_runtime > nice_limit) {
- p->wait_runtime = nice_limit;
- p->wait_runtime_overruns++;
- rq->wait_runtime_overruns++;
+ if (p->se.wait_runtime > nice_limit) {
+ p->se.wait_runtime = nice_limit;
+ p->se.wait_runtime_overruns++;
+ rq->lrq.wait_runtime_overruns++;
}
limit = (limit << 1) - nice_limit;
- if (p->wait_runtime < -limit) {
- p->wait_runtime = -limit;
- p->wait_runtime_underruns++;
- rq->wait_runtime_underruns++;
+ if (p->se.wait_runtime < -limit) {
+ p->se.wait_runtime = -limit;
+ p->se.wait_runtime_underruns++;
+ rq->lrq.wait_runtime_underruns++;
}
}

static void __add_wait_runtime(struct rq *rq, struct task_struct *p, s64 delta)
{
- p->wait_runtime += delta;
- p->sum_wait_runtime += delta;
+ p->se.wait_runtime += delta;
+ p->se.sum_wait_runtime += delta;
limit_wait_runtime(rq, p);
}

static void add_wait_runtime(struct rq *rq, struct task_struct *p, s64 delta)
{
- rq->wait_runtime -= p->wait_runtime;
+ rq->lrq.wait_runtime -= p->se.wait_runtime;
__add_wait_runtime(rq, p, delta);
- rq->wait_runtime += p->wait_runtime;
+ rq->lrq.wait_runtime += p->se.wait_runtime;
}

/*
@@ -193,15 +193,15 @@
struct task_struct *curr = rq->curr;

if (curr->sched_class != &fair_sched_class || curr == rq->idle
- || !curr->on_rq)
+ || !curr->se.on_rq)
return;
/*
* Get the amount of time the current task was running
* since the last time we changed raw_weighted_load:
*/
- delta_exec = now - curr->exec_start;
- if (unlikely(delta_exec > curr->exec_max))
- curr->exec_max = delta_exec;
+ delta_exec = now - curr->se.exec_start;
+ if (unlikely(delta_exec > curr->se.exec_max))
+ curr->se.exec_max = delta_exec;

if (sysctl_sched_load_smoothing & 1) {
unsigned long load = get_rq_load(rq);
@@ -211,24 +211,24 @@
do_div(delta_fair, load);
} else {
delta_fair = delta_exec * NICE_0_LOAD;
- do_div(delta_fair, rq->raw_weighted_load);
+ do_div(delta_fair, rq->lrq.raw_weighted_load);
}

- delta_mine = delta_exec * curr->load_weight;
+ delta_mine = delta_exec * curr->se.load_weight;
do_div(delta_mine, load);
} else {
delta_fair = delta_exec * NICE_0_LOAD;
- delta_fair += rq->raw_weighted_load >> 1;
- do_div(delta_fair, rq->raw_weighted_load);
+ delta_fair += rq->lrq.raw_weighted_load >> 1;
+ do_div(delta_fair, rq->lrq.raw_weighted_load);

- delta_mine = delta_exec * curr->load_weight;
- delta_mine += rq->raw_weighted_load >> 1;
- do_div(delta_mine, rq->raw_weighted_load);
+ delta_mine = delta_exec * curr->se.load_weight;
+ delta_mine += rq->lrq.raw_weighted_load >> 1;
+ do_div(delta_mine, rq->lrq.raw_weighted_load);
}

- curr->sum_exec_runtime += delta_exec;
- curr->exec_start = now;
- rq->exec_clock += delta_exec;
+ curr->se.sum_exec_runtime += delta_exec;
+ curr->se.exec_start = now;
+ rq->lrq.exec_clock += delta_exec;

/*
* Task already marked for preemption, do not burden
@@ -237,7 +237,7 @@
if (unlikely(test_tsk_thread_flag(curr, TIF_NEED_RESCHED)))
goto out_nowait;

- rq->fair_clock += delta_fair;
+ rq->lrq.fair_clock += delta_fair;
/*
* We executed delta_exec amount of time on the CPU,
* but we were only entitled to delta_mine amount of
@@ -253,8 +253,8 @@
static inline void
update_stats_wait_start(struct rq *rq, struct task_struct *p, u64 now)
{
- p->wait_start_fair = rq->fair_clock;
- p->wait_start = now;
+ p->se.wait_start_fair = rq->lrq.fair_clock;
+ p->se.wait_start = now;
}

/*
@@ -274,29 +274,29 @@
/*
* Update the key:
*/
- key = rq->fair_clock;
+ key = rq->lrq.fair_clock;

/*
* Optimize the common nice 0 case:
*/
- if (likely(p->load_weight == NICE_0_LOAD)) {
- key -= p->wait_runtime;
+ if (likely(p->se.load_weight == NICE_0_LOAD)) {
+ key -= p->se.wait_runtime;
} else {
- int negative = p->wait_runtime < 0;
+ int negative = p->se.wait_runtime < 0;
u64 tmp;

- if (p->load_weight > NICE_0_LOAD) {
+ if (p->se.load_weight > NICE_0_LOAD) {
/* negative-reniced tasks get helped: */

if (negative) {
- tmp = -p->wait_runtime;
+ tmp = -p->se.wait_runtime;
tmp *= NICE_0_LOAD;
- do_div(tmp, p->load_weight);
+ do_div(tmp, p->se.load_weight);

key += tmp;
} else {
- tmp = p->wait_runtime;
- tmp *= p->load_weight;
+ tmp = p->se.wait_runtime;
+ tmp *= p->se.load_weight;
do_div(tmp, NICE_0_LOAD);

key -= tmp;
@@ -305,16 +305,16 @@
/* plus-reniced tasks get hurt: */

if (negative) {
- tmp = -p->wait_runtime;
+ tmp = -p->se.wait_runtime;

tmp *= NICE_0_LOAD;
- do_div(tmp, p->load_weight);
+ do_div(tmp, p->se.load_weight);

key += tmp;
} else {
- tmp = p->wait_runtime;
+ tmp = p->se.wait_runtime;

- tmp *= p->load_weight;
+ tmp *= p->se.load_weight;
do_div(tmp, NICE_0_LOAD);

key -= tmp;
@@ -322,7 +322,7 @@
}
}

- p->fair_key = key;
+ p->se.fair_key = key;
}

/*
@@ -333,20 +333,20 @@
{
s64 delta_fair, delta_wait;

- delta_wait = now - p->wait_start;
- if (unlikely(delta_wait > p->wait_max))
- p->wait_max = delta_wait;
-
- if (p->wait_start_fair) {
- delta_fair = rq->fair_clock - p->wait_start_fair;
- if (unlikely(p->load_weight != NICE_0_LOAD))
- delta_fair = (delta_fair * p->load_weight) /
+ delta_wait = now - p->se.wait_start;
+ if (unlikely(delta_wait > p->se.wait_max))
+ p->se.wait_max = delta_wait;
+
+ if (p->se.wait_start_fair) {
+ delta_fair = rq->lrq.fair_clock - p->se.wait_start_fair;
+ if (unlikely(p->se.load_weight != NICE_0_LOAD))
+ delta_fair = (delta_fair * p->se.load_weight) /
NICE_0_LOAD;
add_wait_runtime(rq, p, delta_fair);
}

- p->wait_start_fair = 0;
- p->wait_start = 0;
+ p->se.wait_start_fair = 0;
+ p->se.wait_start = 0;
}

static inline void
@@ -370,7 +370,7 @@
/*
* We are starting a new run period:
*/
- p->exec_start = now;
+ p->se.exec_start = now;
}

/*
@@ -381,7 +381,7 @@
{
update_curr(rq, now);

- p->exec_start = 0;
+ p->se.exec_start = 0;
}

/**************************************************************/
@@ -396,7 +396,7 @@
if (!(sysctl_sched_load_smoothing & 16))
goto out;

- delta_fair = rq->fair_clock - p->sleep_start_fair;
+ delta_fair = rq->lrq.fair_clock - p->se.sleep_start_fair;
if ((s64)delta_fair < 0)
delta_fair = 0;

@@ -406,15 +406,15 @@
*/
if (sysctl_sched_load_smoothing & 8) {
delta_fair = delta_fair * load;
- do_div(delta_fair, load + p->load_weight);
+ do_div(delta_fair, load + p->se.load_weight);
}

__add_wait_runtime(rq, p, delta_fair);

out:
- rq->wait_runtime += p->wait_runtime;
+ rq->lrq.wait_runtime += p->se.wait_runtime;

- p->sleep_start_fair = 0;
+ p->se.sleep_start_fair = 0;
}

/*
@@ -433,29 +433,29 @@
update_curr(rq, now);

if (wakeup) {
- if (p->sleep_start) {
- delta = now - p->sleep_start;
+ if (p->se.sleep_start) {
+ delta = now - p->se.sleep_start;
if ((s64)delta < 0)
delta = 0;

- if (unlikely(delta > p->sleep_max))
- p->sleep_max = delta;
+ if (unlikely(delta > p->se.sleep_max))
+ p->se.sleep_max = delta;

- p->sleep_start = 0;
+ p->se.sleep_start = 0;
}
- if (p->block_start) {
- delta = now - p->block_start;
+ if (p->se.block_start) {
+ delta = now - p->se.block_start;
if ((s64)delta < 0)
delta = 0;

- if (unlikely(delta > p->block_max))
- p->block_max = delta;
+ if (unlikely(delta > p->se.block_max))
+ p->se.block_max = delta;

- p->block_start = 0;
+ p->se.block_start = 0;
}
- p->sum_sleep_runtime += delta;
+ p->se.sum_sleep_runtime += delta;

- if (p->sleep_start_fair)
+ if (p->se.sleep_start_fair)
enqueue_sleeper(rq, p);
}
update_stats_enqueue(rq, p, now);
@@ -473,11 +473,11 @@
update_stats_dequeue(rq, p, now);
if (sleep) {
if (p->state & TASK_INTERRUPTIBLE)
- p->sleep_start = now;
+ p->se.sleep_start = now;
if (p->state & TASK_UNINTERRUPTIBLE)
- p->block_start = now;
- p->sleep_start_fair = rq->fair_clock;
- rq->wait_runtime -= p->wait_runtime;
+ p->se.block_start = now;
+ p->se.sleep_start_fair = rq->lrq.fair_clock;
+ rq->lrq.wait_runtime -= p->se.wait_runtime;
}
__dequeue_task_fair(rq, p);
}
@@ -509,9 +509,9 @@
* position within the tree:
*/
dequeue_task_fair(rq, p, 0, now);
- p->on_rq = 0;
+ p->se.on_rq = 0;
enqueue_task_fair(rq, p, 0, now);
- p->on_rq = 1;
+ p->se.on_rq = 1;

/*
* Reschedule if another task tops the current one.
@@ -526,11 +526,11 @@
* yield-to support: if we are on the same runqueue then
* give half of our wait_runtime (if it's positive) to the other task:
*/
- if (p_to && p->wait_runtime > 0) {
- p_to->wait_runtime += p->wait_runtime >> 1;
- p->wait_runtime >>= 1;
+ if (p_to && p->se.wait_runtime > 0) {
+ p_to->se.wait_runtime += p->se.wait_runtime >> 1;
+ p->se.wait_runtime >>= 1;
}
- curr = &p->run_node;
+ curr = &p->se.run_node;
first = first_fair(rq);
/*
* Move this task to the second place in the tree:
@@ -547,25 +547,25 @@
return;
}

- p_next = rb_entry(next, struct task_struct, run_node);
+ p_next = rb_entry(next, struct task_struct, se.run_node);
/*
* Minimally necessary key value to be the second in the tree:
*/
- yield_key = p_next->fair_key + 1;
+ yield_key = p_next->se.fair_key + 1;

now = __rq_clock(rq);
dequeue_task_fair(rq, p, 0, now);
- p->on_rq = 0;
+ p->se.on_rq = 0;

/*
* Only update the key if we need to move more backwards
* than the minimally necessary position to be the second:
*/
- if (p->fair_key < yield_key)
- p->fair_key = yield_key;
+ if (p->se.fair_key < yield_key)
+ p->se.fair_key = yield_key;

__enqueue_task_fair(rq, p);
- p->on_rq = 1;
+ p->se.on_rq = 1;
}

/*
@@ -575,7 +575,7 @@
__check_preempt_curr_fair(struct rq *rq, struct task_struct *p,
struct task_struct *curr, unsigned long granularity)
{
- s64 __delta = curr->fair_key - p->fair_key;
+ s64 __delta = curr->se.fair_key - p->se.fair_key;

/*
* Take scheduling granularity into account - do not
@@ -631,7 +631,7 @@
* If the task is still waiting for the CPU (it just got
* preempted), start the wait period:
*/
- if (prev->on_rq)
+ if (prev->se.on_rq)
update_stats_wait_start(rq, prev, now);
}

@@ -654,23 +654,23 @@
if (!first)
return NULL;

- p = rb_entry(first, struct task_struct, run_node);
+ p = rb_entry(first, struct task_struct, se.run_node);

- rq->rb_load_balance_curr = rb_next(first);
+ rq->lrq.rb_load_balance_curr = rb_next(first);

return p;
}

static struct task_struct * load_balance_next_fair(struct rq *rq)
{
- struct rb_node *curr = rq->rb_load_balance_curr;
+ struct rb_node *curr = rq->lrq.rb_load_balance_curr;
struct task_struct *p;

if (!curr)
return NULL;

- p = rb_entry(curr, struct task_struct, run_node);
- rq->rb_load_balance_curr = rb_next(curr);
+ p = rb_entry(curr, struct task_struct, se.run_node);
+ rq->lrq.rb_load_balance_curr = rb_next(curr);

return p;
}
@@ -688,9 +688,9 @@
* position within the tree:
*/
dequeue_task_fair(rq, curr, 0, now);
- curr->on_rq = 0;
+ curr->se.on_rq = 0;
enqueue_task_fair(rq, curr, 0, now);
- curr->on_rq = 1;
+ curr->se.on_rq = 1;

/*
* Reschedule if another task tops the current one.
@@ -723,16 +723,16 @@
* until it reschedules once. We set up the key so that
* it will preempt the parent:
*/
- p->fair_key = current->fair_key - niced_granularity(rq->curr,
+ p->se.fair_key = current->se.fair_key - niced_granularity(rq->curr,
sysctl_sched_granularity) - 1;
/*
* The first wait is dominated by the child-runs-first logic,
* so do not credit it with that waiting time yet:
*/
- p->wait_start_fair = 0;
+ p->se.wait_start_fair = 0;

__enqueue_task_fair(rq, p);
- p->on_rq = 1;
+ p->se.on_rq = 1;
inc_nr_running(p, rq);
}


--
Regards,
vatsa

2007-05-23 17:12:59

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC] [PATCH 3/3] Generalize CFS core and provide per-user fairness

This patch reuses CFS core to provide inter task-group fairness. For
demonstration purpose, the patch also extends CFS to provide per-user
fairness. The patch is very experimental atm and in particular, SMP LOAD
BALANCE IS DISABLED to keep the patch simple. I think group-based smp
load
balance is more trickier and I intend to look at it next.

Although user id is chosen as the basis of grouping tasks, the patch can
be adapted to work with other task grouping mechanisms (like :
http://lkml.org/lkml/2007/4/27/146).

Signed-off-by : Srivatsa Vaddagiri <[email protected]>


---
arch/i386/Kconfig | 6
arch/x86_64/Kconfig | 6
include/linux/sched.h | 16
kernel/sched.c | 256 +++++++++++----
kernel/sched_debug.c | 4
kernel/sched_fair.c | 820 ++++++++++++++++++++++++++++++++------------------
kernel/sched_rt.c | 2
kernel/user.c | 5
8 files changed, 753 insertions(+), 362 deletions(-)

Index: linux-2.6.21-rc7/arch/i386/Kconfig
===================================================================
--- linux-2.6.21-rc7.orig/arch/i386/Kconfig 2007-05-23 20:46:38.000000000 +0530
+++ linux-2.6.21-rc7/arch/i386/Kconfig 2007-05-23 20:48:39.000000000 +0530
@@ -307,6 +307,12 @@
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.

+config FAIR_USER_SCHED
+ bool "Fair user scheduler"
+ default n
+ help
+ Fair user scheduler
+
source "kernel/Kconfig.preempt"

config X86_UP_APIC
Index: linux-2.6.21-rc7/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.21-rc7.orig/arch/x86_64/Kconfig 2007-05-23 20:46:38.000000000 +0530
+++ linux-2.6.21-rc7/arch/x86_64/Kconfig 2007-05-23 20:48:39.000000000 +0530
@@ -330,6 +330,12 @@
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.

+config FAIR_USER_SCHED
+ bool "Fair user scheduler"
+ default n
+ help
+ Fair user scheduler
+
source "kernel/Kconfig.preempt"

config NUMA
Index: linux-2.6.21-rc7/include/linux/sched.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/sched.h 2007-05-23 20:48:34.000000000 +0530
+++ linux-2.6.21-rc7/include/linux/sched.h 2007-05-23 20:48:39.000000000 +0530
@@ -551,6 +551,16 @@
#define is_rt_policy(p) ((p) != SCHED_NORMAL && (p) != SCHED_BATCH)
#define has_rt_policy(p) unlikely(is_rt_policy((p)->policy))

+#ifdef CONFIG_FAIR_USER_SCHED
+int sched_alloc_user(struct user_struct *user);
+void sched_free_user(struct user_struct *user);
+void sched_move_task(struct user_struct *old);
+#else
+static inline int sched_alloc_user(struct user_struct *user) { return 0; }
+static inline void sched_free_user(struct user_struct *user) { }
+static inline void sched_move_task(struct user_struct *user) { }
+#endif
+
/*
* Some day this will be a full-fledged user tracking system..
*/
@@ -575,6 +585,10 @@
/* Hash table maintenance information */
struct list_head uidhash_list;
uid_t uid;
+#ifdef CONFIG_FAIR_USER_SCHED
+ struct sched_entity *se; /* per-cpu sched_entity */
+ struct lrq *lrq; /* per-cpu runqueue for this user */
+#endif
};

extern struct user_struct *find_user(uid_t);
@@ -859,6 +873,8 @@
s64 fair_key;
s64 sum_wait_runtime, sum_sleep_runtime;
unsigned long wait_runtime_overruns, wait_runtime_underruns;
+ struct sched_entity *parent;
+ struct lrq *my_q; /* The queue owned by this entity */
};

struct task_struct {
Index: linux-2.6.21-rc7/kernel/sched.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/sched.c 2007-05-23 20:48:34.000000000 +0530
+++ linux-2.6.21-rc7/kernel/sched.c 2007-05-23 20:48:39.000000000 +0530
@@ -129,6 +129,14 @@
struct rb_root tasks_timeline;
struct rb_node *rb_leftmost;
struct rb_node *rb_load_balance_curr;
+ struct sched_entity *curr;
+ unsigned int *sched_granularity; /* &sysctl_sched_granularity */
+ struct rq *rq;
+ unsigned long nice_0_load;
+#ifdef CONFIG_FAIR_USER_SCHED
+ struct list_head lrq_list;
+ struct rcu_head rcu;
+#endif
};

/*
@@ -164,6 +172,7 @@

struct task_struct *curr, *idle;
unsigned long next_balance;
+ unsigned long rt_load;
struct mm_struct *prev_mm;

u64 clock, prev_clock_raw;
@@ -214,6 +223,32 @@
struct lock_class_key rq_lock_key;
};

+#define NICE_0_LOAD SCHED_LOAD_SCALE
+#define NICE_0_SHIFT SCHED_LOAD_SHIFT
+
+#ifdef CONFIG_FAIR_USER_SCHED
+static struct sched_entity root_user_se[NR_CPUS];
+static struct lrq root_user_lrq[NR_CPUS];
+
+static inline void init_se(struct sched_entity *se, struct lrq *lrq)
+{
+ se->my_q = lrq;
+ se->load_weight = NICE_0_LOAD;
+}
+
+static inline void init_lrq(struct lrq *lrq, struct rq *rq)
+{
+ lrq->rq = rq;
+ lrq->fair_clock = 1;
+ lrq->tasks_timeline = RB_ROOT;
+ lrq->nice_0_load = NICE_0_LOAD;
+ lrq->sched_granularity = &sysctl_sched_granularity;
+ INIT_LIST_HEAD(&lrq->lrq_list);
+ list_add_rcu(&lrq->lrq_list, &rq->lrq.lrq_list);
+}
+
+#endif
+
static DEFINE_PER_CPU(struct rq, runqueues) ____cacheline_aligned_in_smp;
static DEFINE_MUTEX(sched_hotcpu_mutex);

@@ -555,9 +590,6 @@
#define RTPRIO_TO_LOAD_WEIGHT(rp) \
(PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + LOAD_WEIGHT(rp))

-#define NICE_0_LOAD SCHED_LOAD_SCALE
-#define NICE_0_SHIFT SCHED_LOAD_SHIFT
-
/*
* Nice levels are multiplicative, with a gentle 10% change for every
* nice level changed. I.e. when a CPU-bound task goes from nice 0 to
@@ -576,16 +608,22 @@
/* 10 */ 110, 87, 70, 56, 45, 36, 29, 23, 18, 15,
};

+extern struct sched_class rt_sched_class;
+
static inline void
inc_raw_weighted_load(struct rq *rq, const struct task_struct *p)
{
- rq->lrq.raw_weighted_load += p->se.load_weight;
+ /* Hack - needs better handling */
+ if (p->sched_class == &rt_sched_class)
+ rq->rt_load += p->se.load_weight;
}

static inline void
dec_raw_weighted_load(struct rq *rq, const struct task_struct *p)
{
- rq->lrq.raw_weighted_load -= p->se.load_weight;
+ /* Hack - needs better handling */
+ if (p->sched_class == &rt_sched_class)
+ rq->rt_load -= p->se.load_weight;
}

static inline void inc_nr_running(struct task_struct *p, struct rq *rq)
@@ -728,10 +766,32 @@
return cpu_curr(task_cpu(p)) == p;
}

+#ifdef CONFIG_FAIR_USER_SCHED
+
+#define for_each_lrq(rq, lrq) \
+ for (lrq = container_of((rq)->lrq.lrq_list.next, struct lrq, lrq_list);\
+ prefetch(rcu_dereference(lrq->lrq_list.next)), lrq != &(rq)->lrq;\
+ lrq = container_of(lrq->lrq_list.next, struct lrq, lrq_list))
+
+#else
+
+#define for_each_lrq(rq, lrq) \
+ for (lrq = &rq->lrq; lrq != NULL; lrq = NULL)
+
+#endif
+
/* Used instead of source_load when we know the type == 0 */
unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->lrq.raw_weighted_load;
+ struct lrq *lrq;
+ unsigned long weight = 0;
+
+ for_each_lrq(cpu_rq(cpu), lrq)
+ weight += lrq->raw_weighted_load;
+
+ weight += cpu_rq(cpu)->rt_load;
+
+ return weight;
}

#ifdef CONFIG_SMP
@@ -761,6 +821,10 @@
if (p->se.sleep_start_fair)
p->se.sleep_start_fair -= fair_clock_offset;

+#ifdef CONFIG_FAIR_USER_SCHED
+ p->se.parent = &p->user->se[new_cpu];
+#endif
+
task_thread_info(p)->cpu = new_cpu;

}
@@ -863,12 +927,18 @@
*/
static inline unsigned long source_load(int cpu, int type)
{
- struct rq *rq = cpu_rq(cpu);
+ unsigned long rwl, cpl = 0;
+ struct lrq *lrq;
+
+ rwl = weighted_cpuload(cpu);

if (type == 0)
- return rq->lrq.raw_weighted_load;
+ return rwl;
+
+ for_each_lrq(cpu_rq(cpu), lrq)
+ cpl += lrq->cpu_load[type-1];

- return min(rq->lrq.cpu_load[type-1], rq->lrq.raw_weighted_load);
+ return min(cpl, rwl);
}

/*
@@ -877,12 +947,18 @@
*/
static inline unsigned long target_load(int cpu, int type)
{
- struct rq *rq = cpu_rq(cpu);
+ unsigned long rwl, cpl = 0;
+ struct lrq *lrq;
+
+ rwl = weighted_cpuload(cpu);

if (type == 0)
- return rq->lrq.raw_weighted_load;
+ return rwl;
+
+ for_each_lrq(cpu_rq(cpu), lrq)
+ cpl += lrq->cpu_load[type-1];

- return max(rq->lrq.cpu_load[type-1], rq->lrq.raw_weighted_load);
+ return max(cpl, rwl);
}

/*
@@ -893,7 +969,7 @@
struct rq *rq = cpu_rq(cpu);
unsigned long n = rq->nr_running;

- return n ? rq->lrq.raw_weighted_load / n : SCHED_LOAD_SCALE;
+ return n ? weighted_cpuload(cpu) / n : SCHED_LOAD_SCALE;
}

/*
@@ -1583,59 +1659,6 @@
return running + uninterruptible;
}

-static void update_load_fair(struct rq *this_rq)
-{
- unsigned long this_load, fair_delta, exec_delta, idle_delta;
- unsigned int i, scale;
- s64 fair_delta64, exec_delta64;
- unsigned long tmp;
- u64 tmp64;
-
- this_rq->lrq.nr_load_updates++;
- if (!(sysctl_sched_load_smoothing & 64)) {
- this_load = this_rq->lrq.raw_weighted_load;
- goto do_avg;
- }
-
- fair_delta64 = this_rq->lrq.fair_clock -
- this_rq->lrq.prev_fair_clock + 1;
- this_rq->lrq.prev_fair_clock = this_rq->lrq.fair_clock;
-
- exec_delta64 = this_rq->lrq.exec_clock -
- this_rq->lrq.prev_exec_clock + 1;
- this_rq->lrq.prev_exec_clock = this_rq->lrq.exec_clock;
-
- if (fair_delta64 > (s64)LONG_MAX)
- fair_delta64 = (s64)LONG_MAX;
- fair_delta = (unsigned long)fair_delta64;
-
- if (exec_delta64 > (s64)LONG_MAX)
- exec_delta64 = (s64)LONG_MAX;
- exec_delta = (unsigned long)exec_delta64;
- if (exec_delta > TICK_NSEC)
- exec_delta = TICK_NSEC;
-
- idle_delta = TICK_NSEC - exec_delta;
-
- tmp = (SCHED_LOAD_SCALE * exec_delta) / fair_delta;
- tmp64 = (u64)tmp * (u64)exec_delta;
- do_div(tmp64, TICK_NSEC);
- this_load = (unsigned long)tmp64;
-
-do_avg:
- /* Update our load: */
- for (i = 0, scale = 1; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
- unsigned long old_load, new_load;
-
- /* scale is effectively 1 << i now, and >> i divides by scale */
-
- old_load = this_rq->lrq.cpu_load[i];
- new_load = this_load;
-
- this_rq->lrq.cpu_load[i] = (old_load*(scale-1) + new_load) >> i;
- }
-}
-
#ifdef CONFIG_SMP

/*
@@ -1999,7 +2022,7 @@

avg_load += load;
sum_nr_running += rq->nr_running;
- sum_weighted_load += rq->lrq.raw_weighted_load;
+ sum_weighted_load += weighted_cpuload(i);
}

/*
@@ -2227,18 +2250,19 @@
int i;

for_each_cpu_mask(i, group->cpumask) {
+ unsigned long rwl;

if (!cpu_isset(i, *cpus))
continue;

rq = cpu_rq(i);
+ rwl = weighted_cpuload(i);

- if (rq->nr_running == 1 &&
- rq->lrq.raw_weighted_load > imbalance)
+ if (rq->nr_running == 1 && rwl > imbalance)
continue;

- if (rq->lrq.raw_weighted_load > max_load) {
- max_load = rq->lrq.raw_weighted_load;
+ if (rwl > max_load) {
+ max_load = rwl;
busiest = rq;
}
}
@@ -5988,6 +6012,12 @@
*/
rt_sched_class.next = &fair_sched_class;
fair_sched_class.next = NULL;
+#ifdef CONFIG_FAIR_USER_SCHED
+ root_user.se = root_user_se; /* per-cpu schedulable entities */
+ root_user.lrq = root_user_lrq; /* per-cpu runqueue */
+ root_user_lrq[0].curr = &current->se; /* todo: remove this */
+ cpu_rq(0)->lrq.curr = current->se.parent = &root_user.se[0];
+#endif

for_each_possible_cpu(i) {
struct prio_array *array;
@@ -5999,6 +6029,9 @@
rq->nr_running = 0;
rq->lrq.tasks_timeline = RB_ROOT;
rq->clock = rq->lrq.fair_clock = 1;
+ rq->lrq.nice_0_load = NICE_0_LOAD;
+ rq->lrq.sched_granularity = &sysctl_sched_granularity;
+ rq->lrq.rq = rq;

for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
rq->lrq.cpu_load[j] = 0;
@@ -6020,6 +6053,16 @@
highest_cpu = i;
/* delimiter for bitsearch: */
__set_bit(MAX_RT_PRIO, array->bitmap);
+#ifdef CONFIG_FAIR_USER_SCHED
+ INIT_LIST_HEAD(&rq->lrq.lrq_list);
+ {
+ struct lrq *lrq = &current->user->lrq[i];
+ struct sched_entity *se = &current->user->se[i];
+
+ init_se(se, lrq);
+ init_lrq(lrq, rq);
+ }
+#endif
}

set_load_weight(&init_task);
@@ -6176,3 +6219,74 @@
}

#endif
+
+#ifdef CONFIG_FAIR_USER_SCHED
+
+int sched_alloc_user(struct user_struct *new)
+{
+ int i = num_possible_cpus();
+
+ new->se = kzalloc(sizeof(struct sched_entity) * i, GFP_KERNEL);
+ if (!new->se)
+ return -ENOMEM;
+
+ new->lrq = kzalloc(sizeof(struct lrq) * i, GFP_KERNEL);
+ if (!new->lrq) {
+ kfree(new->se);
+ return -ENOMEM;
+ }
+
+ for_each_possible_cpu(i) {
+ struct lrq *lrq = &new->lrq[i];
+ struct sched_entity *se = &new->se[i];
+ struct rq *rq = cpu_rq(i);
+
+ init_se(se, lrq);
+ init_lrq(lrq, rq);
+ }
+
+ return 0;
+}
+
+static void free_lrq(struct rcu_head *rhp)
+{
+ struct lrq *lrq = container_of(rhp, struct lrq, rcu);
+
+ kfree(lrq);
+}
+
+void sched_free_user(struct user_struct *up)
+{
+ int i;
+ struct lrq *lrq;
+
+ for_each_possible_cpu(i) {
+ lrq = &up->lrq[i];
+ list_del_rcu(&lrq->lrq_list);
+ }
+
+ lrq = &up->lrq[0];
+ call_rcu(&lrq->rcu, free_lrq);
+
+ kfree(up->se);
+}
+
+void sched_move_task(struct user_struct *old)
+{
+ unsigned long flags;
+ struct user_struct *new = current->user;
+ struct rq *rq;
+
+ rq = task_rq_lock(current, &flags);
+
+ current->user = old;
+ deactivate_task(rq, current, 0);
+ current->user = new;
+ current->se.parent = &new->se[task_cpu(current)];
+ activate_task(rq, current, 0);
+
+ task_rq_unlock(rq, &flags);
+}
+
+
+#endif
Index: linux-2.6.21-rc7/kernel/sched_debug.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/sched_debug.c 2007-05-23 20:48:34.000000000 +0530
+++ linux-2.6.21-rc7/kernel/sched_debug.c 2007-05-23 20:48:39.000000000 +0530
@@ -68,7 +68,7 @@
"------------------------------------------------"
"--------------------------------\n");

- curr = first_fair(rq);
+ curr = first_fair(&rq->lrq);
while (curr) {
p = rb_entry(curr, struct task_struct, se.run_node);
print_task(m, rq, p, now);
@@ -85,7 +85,7 @@
unsigned long flags;

spin_lock_irqsave(&rq->lock, flags);
- curr = first_fair(rq);
+ curr = first_fair(&rq->lrq);
while (curr) {
p = rb_entry(curr, struct task_struct, se.run_node);
wait_runtime_rq_sum += p->se.wait_runtime;
Index: linux-2.6.21-rc7/kernel/sched_fair.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/sched_fair.c 2007-05-23 20:48:34.000000000 +0530
+++ linux-2.6.21-rc7/kernel/sched_fair.c 2007-05-23 20:48:39.000000000 +0530
@@ -46,19 +46,25 @@

extern struct sched_class fair_sched_class;

+#define entity_is_task(t) (!t->my_q)
+#define task_entity(t) container_of(t, struct task_struct, se)
+static inline void update_curr(struct lrq *lrq, u64 now);
+
/**************************************************************/
/* Scheduling class tree data structure manipulation methods:
*/

+/************* Start generic schedulable entity operations ********************/
+
/*
* Enqueue a task into the rb-tree:
*/
-static inline void __enqueue_task_fair(struct rq *rq, struct task_struct *p)
+static inline void __enqueue_entity(struct lrq *lrq, struct sched_entity *p)
{
- struct rb_node **link = &rq->lrq.tasks_timeline.rb_node;
+ struct rb_node **link = &lrq->tasks_timeline.rb_node;
struct rb_node *parent = NULL;
- struct task_struct *entry;
- s64 key = p->se.fair_key;
+ struct sched_entity *entry;
+ s64 key = p->fair_key;
int leftmost = 1;

/*
@@ -66,12 +72,12 @@
*/
while (*link) {
parent = *link;
- entry = rb_entry(parent, struct task_struct, se.run_node);
+ entry = rb_entry(parent, struct sched_entity, run_node);
/*
* We dont care about collisions. Nodes with
* the same key stay together.
*/
- if ((s64)(key - entry->se.fair_key) < 0) {
+ if ((s64)(key - entry->fair_key) < 0) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
@@ -84,31 +90,35 @@
* used):
*/
if (leftmost)
- rq->lrq.rb_leftmost = &p->se.run_node;
+ lrq->rb_leftmost = &p->run_node;

- rb_link_node(&p->se.run_node, parent, link);
- rb_insert_color(&p->se.run_node, &rq->lrq.tasks_timeline);
+ rb_link_node(&p->run_node, parent, link);
+ rb_insert_color(&p->run_node, &lrq->tasks_timeline);
+ lrq->raw_weighted_load += p->load_weight;
+ p->on_rq = 1;
}

-static inline void __dequeue_task_fair(struct rq *rq, struct task_struct *p)
+static inline void __dequeue_entity(struct lrq *lrq, struct sched_entity *p)
{
- if (rq->lrq.rb_leftmost == &p->se.run_node)
- rq->lrq.rb_leftmost = NULL;
- rb_erase(&p->se.run_node, &rq->lrq.tasks_timeline);
+ if (lrq->rb_leftmost == &p->run_node)
+ lrq->rb_leftmost = NULL;
+ rb_erase(&p->run_node, &lrq->tasks_timeline);
+ lrq->raw_weighted_load -= p->load_weight;
+ p->on_rq = 0;
}

-static inline struct rb_node * first_fair(struct rq *rq)
+static inline struct rb_node * first_fair(struct lrq *lrq)
{
- if (rq->lrq.rb_leftmost)
- return rq->lrq.rb_leftmost;
+ if (lrq->rb_leftmost)
+ return lrq->rb_leftmost;
/* Cache the value returned by rb_first() */
- rq->lrq.rb_leftmost = rb_first(&rq->lrq.tasks_timeline);
- return rq->lrq.rb_leftmost;
+ lrq->rb_leftmost = rb_first(&lrq->tasks_timeline);
+ return lrq->rb_leftmost;
}

-static struct task_struct * __pick_next_task_fair(struct rq *rq)
+static struct sched_entity * __pick_next_entity(struct lrq *lrq)
{
- return rb_entry(first_fair(rq), struct task_struct, se.run_node);
+ return rb_entry(first_fair(lrq), struct sched_entity, run_node);
}

/**************************************************************/
@@ -119,125 +129,126 @@
* We rescale the rescheduling granularity of tasks according to their
* nice level, but only linearly, not exponentially:
*/
-static u64
-niced_granularity(struct task_struct *curr, unsigned long granularity)
+static u64 niced_granularity(struct lrq *lrq, struct sched_entity *curr,
+ unsigned long granularity)
{
/*
* Negative nice levels get the same granularity as nice-0:
*/
- if (curr->se.load_weight >= NICE_0_LOAD)
+ if (curr->load_weight >= lrq->nice_0_load)
return granularity;
/*
* Positive nice level tasks get linearly finer
* granularity:
*/
- return curr->se.load_weight * (s64)(granularity / NICE_0_LOAD);
+ return curr->load_weight * (s64)(granularity / lrq->nice_0_load);
}

-unsigned long get_rq_load(struct rq *rq)
+unsigned long get_lrq_load(struct lrq *lrq)
{
- unsigned long load = rq->lrq.cpu_load[CPU_LOAD_IDX_MAX-1] + 1;
+ unsigned long load = lrq->cpu_load[CPU_LOAD_IDX_MAX-1] + 1;

if (!(sysctl_sched_load_smoothing & 1))
- return rq->lrq.raw_weighted_load;
+ return lrq->raw_weighted_load;

if (sysctl_sched_load_smoothing & 4)
- load = max(load, rq->lrq.raw_weighted_load);
+ load = max(load, lrq->raw_weighted_load);

return load;
}

-static void limit_wait_runtime(struct rq *rq, struct task_struct *p)
+static void limit_wait_runtime(struct lrq *lrq, struct sched_entity *p)
{
- s64 limit = sysctl_sched_runtime_limit;
+ s64 limit = *(lrq->sched_granularity);
s64 nice_limit = limit; // niced_granularity(p, limit);

/*
* Niced tasks have the same history dynamic range as
* non-niced tasks, but their limits are offset.
*/
- if (p->se.wait_runtime > nice_limit) {
- p->se.wait_runtime = nice_limit;
- p->se.wait_runtime_overruns++;
- rq->lrq.wait_runtime_overruns++;
+ if (p->wait_runtime > nice_limit) {
+ p->wait_runtime = nice_limit;
+ p->wait_runtime_overruns++;
+ lrq->wait_runtime_overruns++;
}
limit = (limit << 1) - nice_limit;
- if (p->se.wait_runtime < -limit) {
- p->se.wait_runtime = -limit;
- p->se.wait_runtime_underruns++;
- rq->lrq.wait_runtime_underruns++;
+ if (p->wait_runtime < -limit) {
+ p->wait_runtime = -limit;
+ p->wait_runtime_underruns++;
+ lrq->wait_runtime_underruns++;
}
}

-static void __add_wait_runtime(struct rq *rq, struct task_struct *p, s64 delta)
+static void
+__add_wait_runtime(struct lrq *lrq, struct sched_entity *p, s64 delta)
{
- p->se.wait_runtime += delta;
- p->se.sum_wait_runtime += delta;
- limit_wait_runtime(rq, p);
+ p->wait_runtime += delta;
+ p->sum_wait_runtime += delta;
+ limit_wait_runtime(lrq, p);
}

-static void add_wait_runtime(struct rq *rq, struct task_struct *p, s64 delta)
+static void add_wait_runtime(struct lrq *lrq, struct sched_entity *p, s64 delta)
{
- rq->lrq.wait_runtime -= p->se.wait_runtime;
- __add_wait_runtime(rq, p, delta);
- rq->lrq.wait_runtime += p->se.wait_runtime;
+ lrq->wait_runtime -= p->wait_runtime;
+ __add_wait_runtime(lrq, p, delta);
+ lrq->wait_runtime += p->wait_runtime;
}

/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
*/
-static inline void update_curr(struct rq *rq, u64 now)
+static inline void _update_curr(struct lrq *lrq, u64 now)
{
u64 delta_exec, delta_fair, delta_mine;
- struct task_struct *curr = rq->curr;
+ struct sched_entity *curr = lrq->curr;
+ struct task_struct *curtask = lrq->rq->curr;

- if (curr->sched_class != &fair_sched_class || curr == rq->idle
- || !curr->se.on_rq)
+ if (!curr->on_rq || !curr->exec_start)
return;
/*
* Get the amount of time the current task was running
* since the last time we changed raw_weighted_load:
*/
- delta_exec = now - curr->se.exec_start;
- if (unlikely(delta_exec > curr->se.exec_max))
- curr->se.exec_max = delta_exec;
+ delta_exec = now - curr->exec_start;
+ if (unlikely(delta_exec > curr->exec_max))
+ curr->exec_max = delta_exec;

if (sysctl_sched_load_smoothing & 1) {
- unsigned long load = get_rq_load(rq);
+ unsigned long load = get_lrq_load(lrq);

if (sysctl_sched_load_smoothing & 2) {
- delta_fair = delta_exec * NICE_0_LOAD;
+ delta_fair = delta_exec * lrq->nice_0_load;
do_div(delta_fair, load);
} else {
- delta_fair = delta_exec * NICE_0_LOAD;
- do_div(delta_fair, rq->lrq.raw_weighted_load);
+ delta_fair = delta_exec * lrq->nice_0_load;
+ do_div(delta_fair, lrq->raw_weighted_load);
}

- delta_mine = delta_exec * curr->se.load_weight;
+ delta_mine = delta_exec * curr->load_weight;
do_div(delta_mine, load);
} else {
- delta_fair = delta_exec * NICE_0_LOAD;
- delta_fair += rq->lrq.raw_weighted_load >> 1;
- do_div(delta_fair, rq->lrq.raw_weighted_load);
-
- delta_mine = delta_exec * curr->se.load_weight;
- delta_mine += rq->lrq.raw_weighted_load >> 1;
- do_div(delta_mine, rq->lrq.raw_weighted_load);
+ delta_fair = delta_exec * lrq->nice_0_load;
+ delta_fair += lrq->raw_weighted_load >> 1;
+ do_div(delta_fair, lrq->raw_weighted_load);
+
+ delta_mine = delta_exec * curr->load_weight;
+ delta_mine += lrq->raw_weighted_load >> 1;
+ do_div(delta_mine, lrq->raw_weighted_load);
}

- curr->se.sum_exec_runtime += delta_exec;
- curr->se.exec_start = now;
- rq->lrq.exec_clock += delta_exec;
+ curr->sum_exec_runtime += delta_exec;
+ curr->exec_start = now;
+ lrq->exec_clock += delta_exec;

/*
* Task already marked for preemption, do not burden
* it with the cost of not having left the CPU yet.
*/
- if (unlikely(test_tsk_thread_flag(curr, TIF_NEED_RESCHED)))
+ if (unlikely(test_tsk_thread_flag(curtask, TIF_NEED_RESCHED)))
goto out_nowait;

- rq->lrq.fair_clock += delta_fair;
+ lrq->fair_clock += delta_fair;
/*
* We executed delta_exec amount of time on the CPU,
* but we were only entitled to delta_mine amount of
@@ -245,23 +256,23 @@
* the two values are equal)
* [Note: delta_mine - delta_exec is negative]:
*/
- add_wait_runtime(rq, curr, delta_mine - delta_exec);
+ add_wait_runtime(lrq, curr, delta_mine - delta_exec);
out_nowait:
;
}

static inline void
-update_stats_wait_start(struct rq *rq, struct task_struct *p, u64 now)
+update_stats_wait_start(struct lrq *lrq, struct sched_entity *p, u64 now)
{
- p->se.wait_start_fair = rq->lrq.fair_clock;
- p->se.wait_start = now;
+ p->wait_start_fair = lrq->fair_clock;
+ p->wait_start = now;
}

/*
* Task is being enqueued - update stats:
*/
static inline void
-update_stats_enqueue(struct rq *rq, struct task_struct *p, u64 now)
+update_stats_enqueue(struct lrq *lrq, struct sched_entity *p, u64 now)
{
s64 key;

@@ -269,35 +280,35 @@
* Are we enqueueing a waiting task? (for current tasks
* a dequeue/enqueue event is a NOP)
*/
- if (p != rq->curr)
- update_stats_wait_start(rq, p, now);
+ if (p != lrq->curr)
+ update_stats_wait_start(lrq, p, now);
/*
* Update the key:
*/
- key = rq->lrq.fair_clock;
+ key = lrq->fair_clock;

/*
* Optimize the common nice 0 case:
*/
- if (likely(p->se.load_weight == NICE_0_LOAD)) {
- key -= p->se.wait_runtime;
+ if (likely(p->load_weight == lrq->nice_0_load)) {
+ key -= p->wait_runtime;
} else {
- int negative = p->se.wait_runtime < 0;
+ int negative = p->wait_runtime < 0;
u64 tmp;

- if (p->se.load_weight > NICE_0_LOAD) {
+ if (p->load_weight > lrq->nice_0_load) {
/* negative-reniced tasks get helped: */

if (negative) {
- tmp = -p->se.wait_runtime;
- tmp *= NICE_0_LOAD;
- do_div(tmp, p->se.load_weight);
+ tmp = -p->wait_runtime;
+ tmp *= lrq->nice_0_load;
+ do_div(tmp, p->load_weight);

key += tmp;
} else {
- tmp = p->se.wait_runtime;
- tmp *= p->se.load_weight;
- do_div(tmp, NICE_0_LOAD);
+ tmp = p->wait_runtime;
+ tmp *= p->load_weight;
+ do_div(tmp, lrq->nice_0_load);

key -= tmp;
}
@@ -305,98 +316,98 @@
/* plus-reniced tasks get hurt: */

if (negative) {
- tmp = -p->se.wait_runtime;
+ tmp = -p->wait_runtime;

- tmp *= NICE_0_LOAD;
- do_div(tmp, p->se.load_weight);
+ tmp *= lrq->nice_0_load;
+ do_div(tmp, p->load_weight);

key += tmp;
} else {
- tmp = p->se.wait_runtime;
+ tmp = p->wait_runtime;

- tmp *= p->se.load_weight;
- do_div(tmp, NICE_0_LOAD);
+ tmp *= p->load_weight;
+ do_div(tmp, lrq->nice_0_load);

key -= tmp;
}
}
}

- p->se.fair_key = key;
+ p->fair_key = key;
}

/*
* Note: must be called with a freshly updated rq->fair_clock.
*/
static inline void
-update_stats_wait_end(struct rq *rq, struct task_struct *p, u64 now)
+update_stats_wait_end(struct lrq *lrq, struct sched_entity *p, u64 now)
{
s64 delta_fair, delta_wait;

- delta_wait = now - p->se.wait_start;
- if (unlikely(delta_wait > p->se.wait_max))
- p->se.wait_max = delta_wait;
-
- if (p->se.wait_start_fair) {
- delta_fair = rq->lrq.fair_clock - p->se.wait_start_fair;
- if (unlikely(p->se.load_weight != NICE_0_LOAD))
- delta_fair = (delta_fair * p->se.load_weight) /
- NICE_0_LOAD;
- add_wait_runtime(rq, p, delta_fair);
+ delta_wait = now - p->wait_start;
+ if (unlikely(delta_wait > p->wait_max))
+ p->wait_max = delta_wait;
+
+ if (p->wait_start_fair) {
+ delta_fair = lrq->fair_clock - p->wait_start_fair;
+ if (unlikely(p->load_weight != lrq->nice_0_load))
+ delta_fair = (delta_fair * p->load_weight) /
+ lrq->nice_0_load;
+ add_wait_runtime(lrq, p, delta_fair);
}

- p->se.wait_start_fair = 0;
- p->se.wait_start = 0;
+ p->wait_start_fair = 0;
+ p->wait_start = 0;
}

static inline void
-update_stats_dequeue(struct rq *rq, struct task_struct *p, u64 now)
+update_stats_dequeue(struct lrq *lrq, struct sched_entity *p, u64 now)
{
- update_curr(rq, now);
+ update_curr(lrq, now);
/*
* Mark the end of the wait period if dequeueing a
* waiting task:
*/
- if (p != rq->curr)
- update_stats_wait_end(rq, p, now);
+ if (p != lrq->curr)
+ update_stats_wait_end(lrq, p, now);
}

/*
* We are picking a new current task - update its stats:
*/
static inline void
-update_stats_curr_start(struct rq *rq, struct task_struct *p, u64 now)
+update_stats_curr_start(struct lrq *lrq, struct sched_entity *p, u64 now)
{
/*
* We are starting a new run period:
*/
- p->se.exec_start = now;
+ p->exec_start = now;
}

/*
* We are descheduling a task - update its stats:
*/
static inline void
-update_stats_curr_end(struct rq *rq, struct task_struct *p, u64 now)
+update_stats_curr_end(struct lrq *lrq, struct sched_entity *p, u64 now)
{
- update_curr(rq, now);
+ update_curr(lrq, now);

- p->se.exec_start = 0;
+ p->exec_start = 0;
}

/**************************************************************/
/* Scheduling class queueing methods:
*/

-static void enqueue_sleeper(struct rq *rq, struct task_struct *p)
+static void enqueue_sleeper(struct lrq *lrq, struct sched_entity *p)
{
- unsigned long load = get_rq_load(rq);
+ unsigned long load = get_lrq_load(lrq);
u64 delta_fair = 0;

if (!(sysctl_sched_load_smoothing & 16))
goto out;

- delta_fair = rq->lrq.fair_clock - p->se.sleep_start_fair;
+ delta_fair = lrq->fair_clock - p->sleep_start_fair;
if ((s64)delta_fair < 0)
delta_fair = 0;

@@ -406,15 +417,15 @@
*/
if (sysctl_sched_load_smoothing & 8) {
delta_fair = delta_fair * load;
- do_div(delta_fair, load + p->se.load_weight);
+ do_div(delta_fair, load + p->load_weight);
}

- __add_wait_runtime(rq, p, delta_fair);
+ __add_wait_runtime(lrq, p, delta_fair);

out:
- rq->lrq.wait_runtime += p->se.wait_runtime;
+ lrq->wait_runtime += p->wait_runtime;

- p->se.sleep_start_fair = 0;
+ p->sleep_start_fair = 0;
}

/*
@@ -423,43 +434,43 @@
* then put the task into the rbtree:
*/
static void
-enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup, u64 now)
+enqueue_entity(struct lrq *lrq, struct sched_entity *p, int wakeup, u64 now)
{
u64 delta = 0;

/*
* Update the fair clock.
*/
- update_curr(rq, now);
+ update_curr(lrq, now);

if (wakeup) {
- if (p->se.sleep_start) {
- delta = now - p->se.sleep_start;
+ if (p->sleep_start && entity_is_task(p)) {
+ delta = now - p->sleep_start;
if ((s64)delta < 0)
delta = 0;

- if (unlikely(delta > p->se.sleep_max))
- p->se.sleep_max = delta;
+ if (unlikely(delta > p->sleep_max))
+ p->sleep_max = delta;

- p->se.sleep_start = 0;
+ p->sleep_start = 0;
}
- if (p->se.block_start) {
- delta = now - p->se.block_start;
+ if (p->block_start && entity_is_task(p)) {
+ delta = now - p->block_start;
if ((s64)delta < 0)
delta = 0;

- if (unlikely(delta > p->se.block_max))
- p->se.block_max = delta;
+ if (unlikely(delta > p->block_max))
+ p->block_max = delta;

- p->se.block_start = 0;
+ p->block_start = 0;
}
- p->se.sum_sleep_runtime += delta;
+ p->sum_sleep_runtime += delta;

- if (p->se.sleep_start_fair)
- enqueue_sleeper(rq, p);
+ if (p->sleep_start_fair)
+ enqueue_sleeper(lrq, p);
}
- update_stats_enqueue(rq, p, now);
- __enqueue_task_fair(rq, p);
+ update_stats_enqueue(lrq, p, now);
+ __enqueue_entity(lrq, p);
}

/*
@@ -468,18 +479,374 @@
* update the fair scheduling stats:
*/
static void
-dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep, u64 now)
+dequeue_entity(struct lrq *lrq, struct sched_entity *p, int sleep, u64 now)
{
- update_stats_dequeue(rq, p, now);
+ update_stats_dequeue(lrq, p, now);
if (sleep) {
- if (p->state & TASK_INTERRUPTIBLE)
- p->se.sleep_start = now;
- if (p->state & TASK_UNINTERRUPTIBLE)
- p->se.block_start = now;
- p->se.sleep_start_fair = rq->lrq.fair_clock;
- rq->lrq.wait_runtime -= p->se.wait_runtime;
+ if (entity_is_task(p)) {
+ struct task_struct *tsk = task_entity(p);
+
+ if (tsk->state & TASK_INTERRUPTIBLE)
+ p->sleep_start = now;
+ if (tsk->state & TASK_UNINTERRUPTIBLE)
+ p->block_start = now;
+ }
+ p->sleep_start_fair = lrq->fair_clock;
+ lrq->wait_runtime -= p->wait_runtime;
+ }
+ __dequeue_entity(lrq, p);
+}
+
+/*
+ * Preempt the current task with a newly woken task if needed:
+ */
+static inline void
+__check_preempt_curr_fair(struct lrq *lrq, struct sched_entity *p,
+ struct sched_entity *curr, unsigned long granularity)
+{
+ s64 __delta = curr->fair_key - p->fair_key;
+
+ /*
+ * Take scheduling granularity into account - do not
+ * preempt the current task unless the best task has
+ * a larger than sched_granularity fairness advantage:
+ */
+ if (__delta > niced_granularity(lrq, curr, granularity))
+ resched_task(lrq->rq->curr);
+}
+
+static struct sched_entity * pick_next_entity(struct lrq *lrq, u64 now)
+{
+ struct sched_entity *p = __pick_next_entity(lrq);
+
+ /*
+ * Any task has to be enqueued before it get to execute on
+ * a CPU. So account for the time it spent waiting on the
+ * runqueue. (note, here we rely on pick_next_task() having
+ * done a put_prev_task_fair() shortly before this, which
+ * updated rq->fair_clock - used by update_stats_wait_end())
+ */
+ update_stats_wait_end(lrq, p, now);
+ update_stats_curr_start(lrq, p, now);
+ lrq->curr = p;
+
+ return p;
+}
+
+/*
+ * Account for a descheduled task:
+ */
+static void put_prev_entity(struct lrq *lrq, struct sched_entity *prev, u64 now)
+{
+ if (!prev) /* Don't update idle task's stats */
+ return;
+
+ update_stats_curr_end(lrq, prev, now);
+ /*
+ * If the task is still waiting for the CPU (it just got
+ * preempted), start the wait period:
+ */
+ if (prev->on_rq)
+ update_stats_wait_start(lrq, prev, now);
+}
+
+/*
+ * scheduler tick hitting a task of our scheduling class:
+ */
+static void entity_tick(struct lrq *lrq, struct sched_entity *curr)
+{
+ struct sched_entity *next;
+ u64 now = __rq_clock(lrq->rq);
+
+ /*
+ * Dequeue and enqueue the task to update its
+ * position within the tree:
+ */
+ dequeue_entity(lrq, curr, 0, now);
+ enqueue_entity(lrq, curr, 0, now);
+
+ /*
+ * Reschedule if another task tops the current one.
+ */
+ next = __pick_next_entity(lrq);
+ if (next == curr)
+ return;
+
+ if (entity_is_task(curr)) {
+ struct task_struct *c = task_entity(curr),
+ *n = task_entity(next);
+
+ if ((c == lrq->rq->idle) || (rt_prio(n->prio) &&
+ (n->prio < c->prio)))
+ resched_task(c);
+ } else
+ __check_preempt_curr_fair(lrq, next, curr,
+ *(lrq->sched_granularity));
+}
+
+static void _update_load(struct lrq *this_rq)
+{
+ unsigned long this_load, fair_delta, exec_delta, idle_delta;
+ unsigned int i, scale;
+ s64 fair_delta64, exec_delta64;
+ unsigned long tmp;
+ u64 tmp64;
+
+ this_rq->nr_load_updates++;
+ if (!(sysctl_sched_load_smoothing & 64)) {
+ this_load = this_rq->raw_weighted_load;
+ goto do_avg;
+ }
+
+ fair_delta64 = this_rq->fair_clock -
+ this_rq->prev_fair_clock + 1;
+ this_rq->prev_fair_clock = this_rq->fair_clock;
+
+ exec_delta64 = this_rq->exec_clock -
+ this_rq->prev_exec_clock + 1;
+ this_rq->prev_exec_clock = this_rq->exec_clock;
+
+ if (fair_delta64 > (s64)LONG_MAX)
+ fair_delta64 = (s64)LONG_MAX;
+ fair_delta = (unsigned long)fair_delta64;
+
+ if (exec_delta64 > (s64)LONG_MAX)
+ exec_delta64 = (s64)LONG_MAX;
+ exec_delta = (unsigned long)exec_delta64;
+ if (exec_delta > TICK_NSEC)
+ exec_delta = TICK_NSEC;
+
+ idle_delta = TICK_NSEC - exec_delta;
+
+ tmp = (SCHED_LOAD_SCALE * exec_delta) / fair_delta;
+ tmp64 = (u64)tmp * (u64)exec_delta;
+ do_div(tmp64, TICK_NSEC);
+ this_load = (unsigned long)tmp64;
+
+do_avg:
+ /* Update our load: */
+ for (i = 0, scale = 1; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
+ unsigned long old_load, new_load;
+
+ /* scale is effectively 1 << i now, and >> i divides by scale */
+
+ old_load = this_rq->cpu_load[i];
+ new_load = this_load;
+
+ this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) >> i;
+ }
+}
+
+/******************* Start task operations **********************************/
+
+static inline struct lrq * task_grp_lrq(const struct task_struct *p)
+{
+#ifdef CONFIG_FAIR_USER_SCHED
+ return &p->user->lrq[task_cpu(p)];
+#else
+ return &task_rq(p)->lrq;
+#endif
+}
+
+/*
+ * The enqueue_task method is called before nr_running is
+ * increased. Here we update the fair scheduling stats and
+ * then put the task into the rbtree:
+ */
+static void
+enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup, u64 now)
+{
+ struct lrq *lrq = task_grp_lrq(p);
+ struct sched_entity *se = &p->se;
+
+ enqueue_entity(lrq, se, wakeup, now);
+ if (p == rq->curr)
+ lrq->curr = se;
+
+ if (likely(!se->parent || se->parent->on_rq))
+ return;
+
+ lrq = &rq->lrq;
+ se = se->parent;
+ if (p == rq->curr)
+ lrq->curr = se;
+ enqueue_entity(lrq, se, wakeup, now);
+ se->on_rq = 1;
+}
+
+/*
+ * The dequeue_task method is called before nr_running is
+ * decreased. We remove the task from the rbtree and
+ * update the fair scheduling stats:
+ */
+static void
+dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep, u64 now)
+{
+ struct lrq *lrq = task_grp_lrq(p);
+ struct sched_entity *se = &p->se;
+
+ dequeue_entity(lrq, se, sleep, now);
+
+ if (likely(!se->parent || lrq->raw_weighted_load))
+ return;
+
+ se = se->parent;
+ lrq = &rq->lrq;
+ dequeue_entity(lrq, se, sleep, now);
+ se->on_rq = 0;
+}
+
+static struct task_struct * pick_next_task_fair(struct rq *rq, u64 now)
+{
+ struct lrq *lrq;
+ struct sched_entity *se;
+
+ lrq = &rq->lrq;
+ se = pick_next_entity(lrq, now);
+
+ if (se->my_q) {
+ lrq = se->my_q;
+ se = pick_next_entity(lrq, now);
+ }
+
+ return task_entity(se);
+}
+
+/*
+ * Account for a descheduled task:
+ */
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, u64 now)
+{
+ struct lrq *lrq = task_grp_lrq(prev);
+ struct sched_entity *se = &prev->se;
+
+ if (prev == rq->idle)
+ return;
+
+ put_prev_entity(lrq, se, now);
+
+ if (!se->parent)
+ return;
+
+ se = se->parent;
+ lrq = &rq->lrq;
+ put_prev_entity(lrq, se, now);
+}
+
+/*
+ * scheduler tick hitting a task of our scheduling class:
+ */
+static void task_tick_fair(struct rq *rq, struct task_struct *curr)
+{
+ struct lrq *lrq;
+ struct sched_entity *se;
+
+ se = &curr->se;
+ lrq = task_grp_lrq(curr);
+ entity_tick(lrq, se);
+
+ if (likely(!se->parent))
+ return;
+
+ /* todo: reduce tick frequency at higher scheduling levels? */
+ se = se->parent;
+ lrq = &rq->lrq;
+ entity_tick(lrq, se);
+}
+
+/*
+ * Preempt the current task with a newly woken task if needed:
+ */
+static void check_preempt_curr_fair(struct rq *rq, struct task_struct *p)
+{
+ struct task_struct *curr = rq->curr;
+
+ if ((curr == rq->idle) || rt_prio(p->prio)) {
+ resched_task(curr);
+ } else {
+ struct sched_entity *cse, *nse;
+
+ if (!curr->se.parent || (curr->se.parent == p->se.parent)) {
+ cse = &curr->se;
+ nse = &p->se;
+ } else {
+ cse = curr->se.parent;
+ nse = p->se.parent;
+ }
+
+ __check_preempt_curr_fair(&rq->lrq, cse, nse,
+ sysctl_sched_wakeup_granularity);
+ }
+}
+
+static inline void update_curr(struct lrq *lrq, u64 now)
+{
+ struct task_struct *curtask = lrq->rq->curr;
+ struct lrq *curq;
+
+ if (curtask->sched_class != &fair_sched_class ||
+ curtask == lrq->rq->idle || !curtask->se.on_rq)
+ return;
+
+ /* this is slightly inefficient - need better way of updating clock */
+ curq = task_grp_lrq(curtask);
+ _update_curr(curq, now);
+
+ if (unlikely(curtask->se.parent)) {
+ curq = &lrq->rq->lrq;
+ _update_curr(curq, now);
+ }
+}
+
+void update_load_fair(struct rq *this_rq)
+{
+ struct task_struct *curr = this_rq->curr;
+ struct lrq *lrq = task_grp_lrq(curr);
+ struct sched_entity *se = &curr->se;
+
+ _update_load(lrq);
+
+ if (!se->parent)
+ return;
+
+ lrq = &this_rq->lrq;
+ _update_load(lrq);
+}
+
+/*
+ * Share the fairness runtime between parent and child, thus the
+ * total amount of pressure for CPU stays equal - new tasks
+ * get a chance to run but frequent forkers are not allowed to
+ * monopolize the CPU. Note: the parent runqueue is locked,
+ * the child is not running yet.
+ */
+static void task_new_fair(struct rq *rq, struct task_struct *p)
+{
+ struct lrq *lrq = task_grp_lrq(p);
+ struct sched_entity *se = &p->se;
+
+ sched_info_queued(p);
+ update_stats_enqueue(lrq, se, rq_clock(rq));
+ /*
+ * Child runs first: we let it run before the parent
+ * until it reschedules once. We set up the key so that
+ * it will preempt the parent:
+ */
+ p->se.fair_key = current->se.fair_key - niced_granularity(lrq,
+ &rq->curr->se, sysctl_sched_granularity) - 1;
+ /*
+ * The first wait is dominated by the child-runs-first logic,
+ * so do not credit it with that waiting time yet:
+ */
+ p->se.wait_start_fair = 0;
+
+ __enqueue_entity(lrq, se);
+ if (unlikely(se && !se->on_rq)) { /* idle task forking */
+ lrq = &rq->lrq;
+ update_stats_enqueue(lrq, se, rq_clock(rq));
+ __enqueue_entity(lrq, se);
}
- __dequeue_task_fair(rq, p);
+ inc_nr_running(p, rq);
}

/*
@@ -494,6 +861,8 @@
struct task_struct *p_next;
s64 yield_key;
u64 now;
+ struct lrq *lrq = task_grp_lrq(p);
+ struct sched_entity *se = &p->se;

/*
* Bug workaround for 3D apps running on the radeon 3D driver:
@@ -508,15 +877,14 @@
* Dequeue and enqueue the task to update its
* position within the tree:
*/
- dequeue_task_fair(rq, p, 0, now);
- p->se.on_rq = 0;
- enqueue_task_fair(rq, p, 0, now);
- p->se.on_rq = 1;
+ dequeue_entity(lrq, se, 0, now);
+ enqueue_entity(lrq, se, 0, now);

/*
* Reschedule if another task tops the current one.
*/
- p_next = __pick_next_task_fair(rq);
+ se = __pick_next_entity(lrq);
+ p_next = task_entity(se);
if (p_next != p)
resched_task(p);
return;
@@ -531,7 +899,7 @@
p->se.wait_runtime >>= 1;
}
curr = &p->se.run_node;
- first = first_fair(rq);
+ first = first_fair(lrq);
/*
* Move this task to the second place in the tree:
*/
@@ -554,8 +922,7 @@
yield_key = p_next->se.fair_key + 1;

now = __rq_clock(rq);
- dequeue_task_fair(rq, p, 0, now);
- p->se.on_rq = 0;
+ dequeue_entity(lrq, se, 0, now);

/*
* Only update the key if we need to move more backwards
@@ -564,75 +931,7 @@
if (p->se.fair_key < yield_key)
p->se.fair_key = yield_key;

- __enqueue_task_fair(rq, p);
- p->se.on_rq = 1;
-}
-
-/*
- * Preempt the current task with a newly woken task if needed:
- */
-static inline void
-__check_preempt_curr_fair(struct rq *rq, struct task_struct *p,
- struct task_struct *curr, unsigned long granularity)
-{
- s64 __delta = curr->se.fair_key - p->se.fair_key;
-
- /*
- * Take scheduling granularity into account - do not
- * preempt the current task unless the best task has
- * a larger than sched_granularity fairness advantage:
- */
- if (__delta > niced_granularity(curr, granularity))
- resched_task(curr);
-}
-
-/*
- * Preempt the current task with a newly woken task if needed:
- */
-static void check_preempt_curr_fair(struct rq *rq, struct task_struct *p)
-{
- struct task_struct *curr = rq->curr;
-
- if ((curr == rq->idle) || rt_prio(p->prio)) {
- resched_task(curr);
- } else {
- __check_preempt_curr_fair(rq, p, curr,
- sysctl_sched_wakeup_granularity);
- }
-}
-
-static struct task_struct * pick_next_task_fair(struct rq *rq, u64 now)
-{
- struct task_struct *p = __pick_next_task_fair(rq);
-
- /*
- * Any task has to be enqueued before it get to execute on
- * a CPU. So account for the time it spent waiting on the
- * runqueue. (note, here we rely on pick_next_task() having
- * done a put_prev_task_fair() shortly before this, which
- * updated rq->fair_clock - used by update_stats_wait_end())
- */
- update_stats_wait_end(rq, p, now);
- update_stats_curr_start(rq, p, now);
-
- return p;
-}
-
-/*
- * Account for a descheduled task:
- */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, u64 now)
-{
- if (prev == rq->idle)
- return;
-
- update_stats_curr_end(rq, prev, now);
- /*
- * If the task is still waiting for the CPU (it just got
- * preempted), start the wait period:
- */
- if (prev->se.on_rq)
- update_stats_wait_start(rq, prev, now);
+ __enqueue_entity(lrq, se);
}

/**************************************************************/
@@ -648,6 +947,7 @@
*/
static struct task_struct * load_balance_start_fair(struct rq *rq)
{
+#if 0
struct rb_node *first = first_fair(rq);
struct task_struct *p;

@@ -659,10 +959,13 @@
rq->lrq.rb_load_balance_curr = rb_next(first);

return p;
+#endif
+ return NULL; /* todo: fix load balance */
}

static struct task_struct * load_balance_next_fair(struct rq *rq)
{
+#if 0
struct rb_node *curr = rq->lrq.rb_load_balance_curr;
struct task_struct *p;

@@ -673,67 +976,8 @@
rq->lrq.rb_load_balance_curr = rb_next(curr);

return p;
-}
-
-/*
- * scheduler tick hitting a task of our scheduling class:
- */
-static void task_tick_fair(struct rq *rq, struct task_struct *curr)
-{
- struct task_struct *next;
- u64 now = __rq_clock(rq);
-
- /*
- * Dequeue and enqueue the task to update its
- * position within the tree:
- */
- dequeue_task_fair(rq, curr, 0, now);
- curr->se.on_rq = 0;
- enqueue_task_fair(rq, curr, 0, now);
- curr->se.on_rq = 1;
-
- /*
- * Reschedule if another task tops the current one.
- */
- next = __pick_next_task_fair(rq);
- if (next == curr)
- return;
-
- if ((curr == rq->idle) || (rt_prio(next->prio) &&
- (next->prio < curr->prio)))
- resched_task(curr);
- else
- __check_preempt_curr_fair(rq, next, curr,
- sysctl_sched_granularity);
-}
-
-/*
- * Share the fairness runtime between parent and child, thus the
- * total amount of pressure for CPU stays equal - new tasks
- * get a chance to run but frequent forkers are not allowed to
- * monopolize the CPU. Note: the parent runqueue is locked,
- * the child is not running yet.
- */
-static void task_new_fair(struct rq *rq, struct task_struct *p)
-{
- sched_info_queued(p);
- update_stats_enqueue(rq, p, rq_clock(rq));
- /*
- * Child runs first: we let it run before the parent
- * until it reschedules once. We set up the key so that
- * it will preempt the parent:
- */
- p->se.fair_key = current->se.fair_key - niced_granularity(rq->curr,
- sysctl_sched_granularity) - 1;
- /*
- * The first wait is dominated by the child-runs-first logic,
- * so do not credit it with that waiting time yet:
- */
- p->se.wait_start_fair = 0;
-
- __enqueue_task_fair(rq, p);
- p->se.on_rq = 1;
- inc_nr_running(p, rq);
+#endif
+ return NULL;
}

/*
Index: linux-2.6.21-rc7/kernel/user.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/user.c 2007-05-23 20:46:38.000000000 +0530
+++ linux-2.6.21-rc7/kernel/user.c 2007-05-23 20:48:39.000000000 +0530
@@ -112,6 +112,7 @@
if (atomic_dec_and_lock(&up->__count, &uidhash_lock)) {
uid_hash_remove(up);
spin_unlock_irqrestore(&uidhash_lock, flags);
+ sched_free_user(up);
key_put(up->uid_keyring);
key_put(up->session_keyring);
kmem_cache_free(uid_cachep, up);
@@ -153,6 +154,8 @@
return NULL;
}

+ sched_alloc_user(new);
+
/*
* Before adding this, check whether we raced
* on adding the same user already..
@@ -163,6 +166,7 @@
key_put(new->uid_keyring);
key_put(new->session_keyring);
kmem_cache_free(uid_cachep, new);
+ sched_free_user(new);
} else {
uid_hash_insert(new, hashent);
up = new;
@@ -187,6 +191,7 @@
atomic_dec(&old_user->processes);
switch_uid_keyring(new_user);
current->user = new_user;
+ sched_move_task(old_user);

/*
* We need to synchronize with __sigqueue_alloc()
Index: linux-2.6.21-rc7/kernel/sched_rt.c
===================================================================
--- linux-2.6.21-rc7.orig/kernel/sched_rt.c 2007-05-23 09:28:03.000000000 +0530
+++ linux-2.6.21-rc7/kernel/sched_rt.c 2007-05-23 20:48:39.000000000 +0530
@@ -166,7 +166,7 @@
activate_task(rq, p, 1);
}

-static struct sched_class rt_sched_class __read_mostly = {
+struct sched_class rt_sched_class __read_mostly = {
.enqueue_task = enqueue_task_rt,
.dequeue_task = dequeue_task_rt,
.yield_task = yield_task_rt,
--
Regards,
vatsa

2007-05-23 18:33:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS


* Srivatsa Vaddagiri <[email protected]> wrote:

> Here's an attempt to extend CFS (v13) to be fair at a group level,
> rather than just at task level. The patch is in a very premature state
> (passes simple tests, smp load balance not supported yet) at this
> point. I am sending it out early to know if this is a good direction
> to proceed.

cool patch! :-)

> Salient points which needs discussion:
>
> 1. This patch reuses CFS core to achieve fairness at group level also.
>
> To make this possible, CFS core has been abstracted to deal with
> generic schedulable "entities" (tasks, users etc).

yeah, i like this alot.

The "struct sched_entity" abstraction looks very clean, and that's the
main thing that matters: it allows for a design that will only cost us
performance if group scheduling is desired.

If you could do a -v14 port and at least add minimal SMP support: i.e.
it shouldnt crash on SMP, but otherwise no extra load-balancing logic is
needed for the first cut - then i could try to pick all these core
changes up for -v15. (I'll let you know about any other thoughts/details
when i do the integration.)

> 2. The per-cpu rb-tree has been split to be per-group per-cpu.
>
> schedule() now becomes two step on every cpu : pick a group first
> (from group rb-tree) and a task within that group next (from that
> group's task rb-tree)

yeah. It might even become more steps if someone wants to have a
different, deeper hierarchy (at the price of performance). Containers
will for example certainly want to use one more level.

> 3. Grouping mechanism - I have used 'uid' as the basis of grouping for
> timebeing (since that grouping concept is already in mainline
> today). The patch can be adapted to a more generic process grouping
> mechanism (like http://lkml.org/lkml/2007/4/27/146) later.

yeah, agreed.

> Some results below, obtained on a 4way (with HT) Intel Xeon box. All
> number are reflective of single CPU performance (tests were forced to
> run on single cpu since load balance is not yet supported).
>
>
> uid "vatsa" uid "guest"
> (make -s -j4 bzImage) (make -s -j20 bzImage)
>
> 2.6.22-rc1 772.02 sec 497.42 sec (real)
> 2.6.22-rc1+cfs-v13 780.62 sec 478.35 sec (real)
> 2.6.22-rc1+cfs-v13+this patch 776.36 sec 776.68 sec (real)
>
> [ An exclusive cpuset containing only one CPU was created and the
> compilation jobs of both users were run simultaneously in this cpuset
> ]

looks really promising!

> I also disabled CONFIG_FAIR_USER_SCHED and compared the results with
> cfs-v13:
>
> uid "vatsa"
> make -s -j4 bzImage
>
> 2.6.22-rc1+cfs-v13 395.57 sec (real)
> 2.6.22-rc1+cfs-v13+this_patch 388.54 sec (real)
>
> There is no regression I can see (rather some improvement, which I
> can't understand atm). I will run more tests later to check this
> regression aspect.

kernel builds dont really push scheduling micro-costs, rather try
something like 'hackbench.c' to measure that. (kernel builds are of
course one of our primary benchmarks.)

> Request your comments on the future direction to proceed!

full steam ahead please! =B-)

Ingo

2007-05-25 07:37:27

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS

On Thu, May 24, 2007 at 12:26:16AM +0200, Guillaume Chazarain wrote:
> As a sidenote, while in CFS-v13 a nice=0 tasks seems to get 10x more CPU
> than a nice=10 one, with the group fairness patch, the ratio drops to
> less than 2x (for tasks with the same UID).

gah ..silly me.

Can you repeat your tests with this patch pls? With the patch applied, I am
now getting the same split between nice 0 and nice 10 task as CFS-v13
provides (90:10 as reported by top )

5418 guest 20 0 2464 304 236 R 90 0.0 5:41.40 3 hog
5419 guest 30 10 2460 304 236 R 10 0.0 0:43.62 3 nice10hog



Fix a stupid bug, where I was not calling __check_preempt_curr_fair()
at task level during task_tick ..

Signed-off-by : Srivatsa Vaddagiri <[email protected]>


---


diff -puN kernel/sched_fair.c~fix kernel/sched_fair.c
--- linux-2.6.22-rc1-cfs-group/kernel/sched_fair.c~fix 2007-05-25 12:28:52.000000000 +0530
+++ linux-2.6.22-rc1-cfs-group-vatsa/kernel/sched_fair.c 2007-05-25 12:30:06.000000000 +0530
@@ -577,11 +577,12 @@ static void entity_tick(struct lrq *lrq,
*n = task_entity(next);

if ((c == lrq->rq->idle) || (rt_prio(n->prio) &&
- (n->prio < c->prio)))
+ (n->prio < c->prio))) {
resched_task(c);
- } else
- __check_preempt_curr_fair(lrq, next, curr,
- *(lrq->sched_granularity));
+ return;
+ }
+ }
+ __check_preempt_curr_fair(lrq, next, curr, *(lrq->sched_granularity));
}

static void _update_load(struct lrq *this_rq)
_


--
Regards,
vatsa

2007-05-25 07:51:36

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS

On Wed, May 23, 2007 at 08:32:52PM +0200, Ingo Molnar wrote:
> > Here's an attempt to extend CFS (v13) to be fair at a group level,
> > rather than just at task level. The patch is in a very premature state
> > (passes simple tests, smp load balance not supported yet) at this
> > point. I am sending it out early to know if this is a good direction
> > to proceed.
>
> cool patch! :-)

Thanks!

> > 1. This patch reuses CFS core to achieve fairness at group level also.
> >
> > To make this possible, CFS core has been abstracted to deal with
> > generic schedulable "entities" (tasks, users etc).
>
> yeah, i like this alot.
>
> The "struct sched_entity" abstraction looks very clean, and that's the
> main thing that matters: it allows for a design that will only cost us
> performance if group scheduling is desired.
>
> If you could do a -v14 port and at least add minimal SMP support: i.e.
> it shouldnt crash on SMP, but otherwise no extra load-balancing logic is
> needed for the first cut - then i could try to pick all these core
> changes up for -v15. (I'll let you know about any other thoughts/details
> when i do the integration.)

Sure ..I will work on a -v14 port. I would like to target for something which:

1. doesn't break performance/functionality of existing CFS scheduler
-if- CONFIG_FAIR_USER_SCHEDULER is disabled. This also means load
balance should work as it works today when the config option is
disabled.

Do you recommend a set of tests that I need to run to ensure there
is no regression? I know that there is a bunch of scheduler
tests floating around on lkml ..Just need to dig to them (or if
someone has all these tests handy on a website, I will download from
that site!)

2. Provides fairness at group (user) level at the cost of missing load
balance functionaility (missing until I get around to work on it that
is).

> kernel builds dont really push scheduling micro-costs, rather try
> something like 'hackbench.c' to measure that. (kernel builds are of
> course one of our primary benchmarks.)

sure i will try that on my next version.

--
Regards,
vatsa

2007-05-25 08:35:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS


* Srivatsa Vaddagiri <[email protected]> wrote:

> Can you repeat your tests with this patch pls? With the patch applied,
> I am now getting the same split between nice 0 and nice 10 task as
> CFS-v13 provides (90:10 as reported by top )
>
> 5418 guest 20 0 2464 304 236 R 90 0.0 5:41.40 3 hog
> 5419 guest 30 10 2460 304 236 R 10 0.0 0:43.62 3 nice10hog

btw., what are you thoughts about SMP?

it's a natural extension of your current code. I think the best approach
would be to add a level of 'virtual CPU' objects above struct user. (how
to set the attributes of those objects is open - possibly combine it
with cpusets?)

That way the scheduler would first pick a "virtual CPU" to schedule, and
then pick a user from that virtual CPU, and then a task from the user.

To make group accounting scalable, the accounting object attached to the
user struct should/must be per-cpu (per-vcpu) too. That way we'd have a
clean hierarchy like:

CPU #0 => VCPU A [ 40% ] + VCPU B [ 60% ]
CPU #1 => VCPU C [ 30% ] + VCPU D [ 70% ]

VCPU A => USER X [ 10% ] + USER Y [ 90% ]
VCPU B => USER X [ 10% ] + USER Y [ 90% ]
VCPU C => USER X [ 10% ] + USER Y [ 90% ]
VCPU D => USER X [ 10% ] + USER Y [ 90% ]

the scheduler first picks a vcpu, then a user from a vcpu. (the actual
external structure of the hierarchy should be opaque to the scheduler
core, naturally, so that we can use other hierarchies too)

whenever the scheduler does accounting, it knows where in the hierarchy
it is and updates all higher level entries too. This means that the
accounting object for USER X is replicated for each VCPU it participates
in.

SMP balancing is straightforward: it would fundamentally iterate through
the same hierarchy and would attempt to keep all levels balanced - i
abstracted away its iterators already.

Hm?

Ingo

2007-05-25 10:02:52

by Guillaume Chazarain

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS

Srivatsa Vaddagiri a ?crit :
> Can you repeat your tests with this patch pls? With the patch applied, I am
> now getting the same split between nice 0 and nice 10 task as CFS-v13
> provides (90:10 as reported by top )

Yep, this fixes the problem for me too.

Thanks.

--
Guillaume

2007-05-25 10:48:51

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS

On Fri, May 25, 2007 at 10:29:51AM +0200, Ingo Molnar wrote:
> btw., what are you thoughts about SMP?

I was planning on reusing smpnice concepts here, with the difference
that we balance group weights across CPU in addition to total weight of
CPUs.

For ex, assuming weight of each task is 10

CPU0 => USER X (Wt 100) + USER Y (Wt 200) => total weight 300
CPU1 => USER X (Wt 30) + USER Y (Wt 80) => total weight 110

So first we notice that CPU0 and CPU1 are imbalanced by weight 190 and
target at reducing this imbalance by half i.e CPU1 has to pull total
weight of 95 (190/2) from CPU0. However while pulling weights, we apply
the same imbalance/2 rule at group level also. For ex: we cannot pull more than
70/2 = 35 from USER X on CPU0 or more than 120/2 = 60 from USER Y on CPU0.

Using this rule, after balance, the two CPUs may look like:

CPU0 => USER X (Wt 70) + USER Y (Wt 140) => total weight 210
CPU1 => USER X (Wt 60) + USER Y (Wt 140) => total weight 200

I had tried this approach earlier (in
https://lists.linux-foundation.org/pipermail/containers/2007-April/004580.html)
and had obtained decent results. It also required minimal changes to
smpnice.

Compared to this, what better degree of control/flexibilty does virtual
cpu approach give?

> it's a natural extension of your current code. I think the best approach
> would be to add a level of 'virtual CPU' objects above struct user. (how
> to set the attributes of those objects is open - possibly combine it
> with cpusets?)

are these virtual CPUs visible to users (ex: does smp_processor_id()
return virtual cpu id rather than physical id and does DEFINE_PER_CPU
create per-cpu data for virtual CPUs rather than physical cpus)?

> That way the scheduler would first pick a "virtual CPU" to schedule,

are virtual cpus pinned to their physical cpu or can they bounce around?
i.e can CPU #0 schedule VCPU D (in your example below)? If bouncing is
allowed, I am not sure whether that is a good thing for performance. How
do we minimize this performance cost?

> and then pick a user from that virtual CPU, and then a task from the user.
>
> To make group accounting scalable, the accounting object attached to the
> user struct should/must be per-cpu (per-vcpu) too. That way we'd have a
> clean hierarchy like:
>
> CPU #0 => VCPU A [ 40% ] + VCPU B [ 60% ]
> CPU #1 => VCPU C [ 30% ] + VCPU D [ 70% ]
>
> VCPU A => USER X [ 10% ] + USER Y [ 90% ]
> VCPU B => USER X [ 10% ] + USER Y [ 90% ]
> VCPU C => USER X [ 10% ] + USER Y [ 90% ]
> VCPU D => USER X [ 10% ] + USER Y [ 90% ]
>
> the scheduler first picks a vcpu, then a user from a vcpu. (the actual
> external structure of the hierarchy should be opaque to the scheduler
> core, naturally, so that we can use other hierarchies too)
>
> whenever the scheduler does accounting, it knows where in the hierarchy
> it is and updates all higher level entries too. This means that the
> accounting object for USER X is replicated for each VCPU it participates
> in.
>
> SMP balancing is straightforward: it would fundamentally iterate through
> the same hierarchy and would attempt to keep all levels balanced - i
> abstracted away its iterators already
>
> Hm?

--
Regards,
vatsa

2007-05-25 11:12:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS


* Srivatsa Vaddagiri <[email protected]> wrote:

> On Fri, May 25, 2007 at 10:29:51AM +0200, Ingo Molnar wrote:
> > btw., what are you thoughts about SMP?
>
> I was planning on reusing smpnice concepts here, with the difference
> that we balance group weights across CPU in addition to total weight
> of CPUs.

ok, that would be (much) simpler that any explicit vcpu approach. Do you
plan to add this next?

Ingo

2007-05-25 11:20:58

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS

On Fri, May 25, 2007 at 01:11:40PM +0200, Ingo Molnar wrote:
> > I was planning on reusing smpnice concepts here, with the difference
> > that we balance group weights across CPU in addition to total weight
> > of CPUs.
>
> ok, that would be (much) simpler that any explicit vcpu approach. Do you
> plan to add this next?

sure, after i first cleanup current group changes a bit. Hopefully I can
send you first version of smp support by early next week.

--
Regards,
vatsa

2007-05-25 12:07:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS


* Srivatsa Vaddagiri <[email protected]> wrote:

> On Fri, May 25, 2007 at 01:11:40PM +0200, Ingo Molnar wrote:
> > > I was planning on reusing smpnice concepts here, with the difference
> > > that we balance group weights across CPU in addition to total weight
> > > of CPUs.
> >
> > ok, that would be (much) simpler that any explicit vcpu approach. Do you
> > plan to add this next?
>
> sure, after i first cleanup current group changes a bit. Hopefully I
> can send you first version of smp support by early next week.

great. Btw., could you please keep the "up to this point there should be
no behavioral change in CFS" fundamental splitup of your patches - that
way i can look at the core changes (and possibly apply them) without
having to consider the uid based changes (which do change behavior and
which need more upstream buy-in).

Ingo

2007-05-25 12:33:43

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS

On Fri, May 25, 2007 at 02:05:36PM +0200, Ingo Molnar wrote:
> great. Btw., could you please keep the "up to this point there should be
> no behavioral change in CFS" fundamental splitup of your patches -

sure ..basically the changes required in CFS core is the introduction of two
structures - struct sched_entity and struct lrq and generalization of
cfs routines to operate on these structures rather than on
task_struct/rq structures directly.

In the first split, I will ensure that this generalization is applied
only to tasks (which represents reorganization of core with no
behavioral/functional change in scheduler) and in a subsequent split/patch I
will apply the generalization to uids also (which will add group fairness
aspect to scheduler), as you require.

Thanks for your feedback so far!

> that way i can look at the core changes (and possibly apply them) without
> having to consider the uid based changes (which do change behavior and
> which need more upstream buy-in).

--
Regards,
vatsa

2007-05-25 13:08:51

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS

Ingo Molnar wrote:
> * Srivatsa Vaddagiri <[email protected]> wrote:
>
>
>>Can you repeat your tests with this patch pls? With the patch applied,
>>I am now getting the same split between nice 0 and nice 10 task as
>>CFS-v13 provides (90:10 as reported by top )
>>
>> 5418 guest 20 0 2464 304 236 R 90 0.0 5:41.40 3 hog
>> 5419 guest 30 10 2460 304 236 R 10 0.0 0:43.62 3 nice10hog
>
>
> btw., what are you thoughts about SMP?
>
> it's a natural extension of your current code. I think the best approach
> would be to add a level of 'virtual CPU' objects above struct user. (how
> to set the attributes of those objects is open - possibly combine it
> with cpusets?)

> That way the scheduler would first pick a "virtual CPU" to schedule, and
> then pick a user from that virtual CPU, and then a task from the user.

don't you mean the vice versa:
first use to scheduler, then VCPU (which is essentially a runqueue or rbtree),
then a task from VCPU?

this is the approach we use in OpenVZ and if you don't mind
I would propose to go this way for fair-scheduling in mainstream.
It has it's own advantages and disatvantages.

This is not the easy way to go and I can outline the problems/disadvantages
which appear on this way:
- tasks which bind to CPU mask will bind to virtual CPUs.
no problem with user tasks, but some kernel threads
use this to do CPU-related management (like cpufreq).
This can be fixed using SMP IPI actually.
- VCPUs should no change PCPUs very frequently,
otherwise there is some overhead. Solvable.

Advantages:
- High precision and fairness.
- Allows to use different group scheduling algorithms
on top of VCPU concept.
OpenVZ uses fairscheduler with CPU limiting feature allowing
to set maximum CPU time given to a group of tasks.

> To make group accounting scalable, the accounting object attached to the
> user struct should/must be per-cpu (per-vcpu) too. That way we'd have a
> clean hierarchy like:
>
> CPU #0 => VCPU A [ 40% ] + VCPU B [ 60% ]
> CPU #1 => VCPU C [ 30% ] + VCPU D [ 70% ]

how did you select these 40%:60% and 30%:70% split?

> VCPU A => USER X [ 10% ] + USER Y [ 90% ]
> VCPU B => USER X [ 10% ] + USER Y [ 90% ]
> VCPU C => USER X [ 10% ] + USER Y [ 90% ]
> VCPU D => USER X [ 10% ] + USER Y [ 90% ]
>
> the scheduler first picks a vcpu, then a user from a vcpu. (the actual
> external structure of the hierarchy should be opaque to the scheduler
> core, naturally, so that we can use other hierarchies too)
>
> whenever the scheduler does accounting, it knows where in the hierarchy
> it is and updates all higher level entries too. This means that the
> accounting object for USER X is replicated for each VCPU it participates
> in.

So if 2 VCPUs running on 2 physical CPUs do accounting the have to update the same
user X accounting information which is not per-[v]cpu?

> SMP balancing is straightforward: it would fundamentally iterate through
> the same hierarchy and would attempt to keep all levels balanced - i
> abstracted away its iterators already.

Thanks,
Kirill

2007-05-25 15:27:14

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Fri, May 25, 2007 at 05:05:16PM +0400, Kirill Korotaev wrote:
> > That way the scheduler would first pick a "virtual CPU" to schedule, and
> > then pick a user from that virtual CPU, and then a task from the user.
>
> don't you mean the vice versa:
> first use to scheduler, then VCPU (which is essentially a runqueue or rbtree),
> then a task from VCPU?
>
> this is the approach we use in OpenVZ [...]

So is this how it looks in OpenVZ?

CONTAINER1 => VCPU0 + VCPU1
CONTAINER2 => VCPU2 + VCPU3

PCPU0 picks a container first, a vcpu next and then a task in it
PCPU1 also picks a container first, a vcpu next and then a task in it.

Few questions:

1. Are VCPU runqueues (on which tasks are present) global queues?

That is, let's say that both PCPU0 and PCPU1 pick CONTAINER1 to schedule
(first level) at the same time and next (let's say) they pick same vcpu
VCPU0 to schedule (second level). Will the two pcpu's now have to be
serialized for scanning task to schedule next (third level) within VCPU0
using a spinlock? Won't that shootup scheduling costs (esp on large
systems), compared to (local scheduling + balance across cpus once in a
while, the way its done today)?

Or do you required that two pcpus don't schedule the same vcpu at the
same time (the way hypervisors normally work)? Even then I would
imagine a fair level of contention to be present in second step (pick
a virtual cpu from a container's list of vcpus).

2. How would this load balance at virtual cpu level and sched domain based
load balancing interact?

The current sched domain based balancing code has many HT/MC/SMT related
optimizations, which ensure that tasks are spread across physical
threads/cores/packages in a most efficient manner - so as to utilize
hardware bandwidth to the maximum. You would now need to introduce
those optimizations essentially at schedule() time ..? Don't know
if that is a wise thing to do.

3. How do you determine the number of VCPUs per container? Is there any
relation for number of virtual cpus exposed per user/container and
the number of available cpus? For ex: in case of user-driven
scheduling, we would want all users to see the same number of cpus
(which is the number available in the system).

4. VCPU ids (namespace) - is it different for different containers?

For ex: can id's of vcpus belonging to different containers (say VCPU0 and
VCPU2), as seen by users thr' vgetcpu/smp_processor_id() that is, be same?
If so, then potentially two threads belonging to different users may find
that they are running -truly simultaneously- on /same/ cpu 0 (one on
VCPU0/PCPU0 and another on VCPU2/PCPU1) which normally isn't possible!

This may be ok for containers, with non-overlapping cpu id namespace,
but when applied to group scheduling for, say, users, which require a
global cpu id namespace, wondering how that would be addressed ..


> and if you don't mind I would propose to go this way for fair-scheduling in
> mainstream.
> It has it's own advantages and disatvantages.
>
> This is not the easy way to go and I can outline the problems/disadvantages
> which appear on this way:
> - tasks which bind to CPU mask will bind to virtual CPUs.
> no problem with user tasks, [...]

Why is this not a problem for user tasks? Tasks which bind to different
CPUs for performance reason now can find that they are running on same
(physical) CPU unknowingly.

> but some kernel threads
> use this to do CPU-related management (like cpufreq).
> This can be fixed using SMP IPI actually.
> - VCPUs should no change PCPUs very frequently,
> otherwise there is some overhead. Solvable.
>
> Advantages:
> - High precision and fairness.

I just don't know if this benefit of high degree of fairness is worth the
complexity it introduces. Besides having some data which shows how much better
is is with respect to fairness/overhead when compared with other approaches
(like smpnice) would help I guess. I will however let experts like Ingo make
the final call here :)

--
Regards,
vatsa

2007-05-25 16:06:25

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS

On Wed, May 23, 2007 at 11:03:16AM -0700, William Lee Irwin III wrote:
> Well, SMP load balancing is what makes all this hard.

Agreed. I am optimistic that we can achieve good degree of SMP
fairness using similar mechanism as smpnice ..

> On Wed, May 23, 2007 at 10:18:59PM +0530, Srivatsa Vaddagiri wrote:
> > Salient points which needs discussion:
> > 1. This patch reuses CFS core to achieve fairness at group level also.
> > To make this possible, CFS core has been abstracted to deal with generic
> > schedulable "entities" (tasks, users etc).
>
> The ability to handle deeper hierarchies would be useful for those
> who want such semantics.

sure, although the more levels of hierarchy scheduler recoginizes, more
the (accounting/scheduling) cost is!

> On Wed, May 23, 2007 at 10:18:59PM +0530, Srivatsa Vaddagiri wrote:
> > 2. The per-cpu rb-tree has been split to be per-group per-cpu.
> > schedule() now becomes two step on every cpu : pick a group first (from
> > group rb-tree) and a task within that group next (from that group's task
> > rb-tree)
>
> That assumes per-user scheduling groups; other configurations would
> make it one step for each level of hierarchy. It may be possible to
> reduce those steps to only state transitions that change weightings
> and incremental updates of task weightings. By and large, one needs
> the groups to determine task weightings as opposed to hierarchically
> scheduling, so there are alternative ways of going about this, ones
> that would even make load balancing easier.

Yeah I agree that providing hierarchical group-fairness at the cost of single
(or fewer) scheduling levels would be a nice thing to target for,
although I don't know of any good way to do it. Do you have any ideas
here? Doing group fairness in a single level, using a common rb-tree for tasks
from all groups is very difficult IMHO. We need atleast two levels.

One possibility is that we recognize deeper hierarchies only in user-space,
but flatten this view from kernel perspective i.e some user space tool
will have to distributed the weights accordingly in this flattened view
to the kernel.

> On Wed, May 23, 2007 at 10:18:59PM +0530, Srivatsa Vaddagiri wrote:
> > 3. Grouping mechanism - I have used 'uid' as the basis of grouping for
> > timebeing (since that grouping concept is already in mainline today).
> > The patch can be adapted to a more generic process grouping mechanism
> > (like http://lkml.org/lkml/2007/4/27/146) later.
>
> I'd like to see how desirable the semantics achieved by reflecting
> more of the process hierarchy structure in the scheduler groupings are.
> Users, sessions, pgrps, and thread_groups would be the levels of
> hierarchy there, where some handling of orphan pgrps is needed.

Good point. Essentially all users should get fair cpu first, then all
sessions/pgrps under a user should get fair share, followed by
process-groups under a session, followed by processes in a
process-group, followed by threads in a process (phew) .. ?

The container patches by Paul Menage at http://lkml.org/lkml/2007/4/27/146
provide a generic enough mechanism to group tasks in a hierarchical
manner for each resource controller. For ex: for the cpu controller, if the
desired fairness is as per the above scheme (user/session/pgrp/threads etc),
then it is possible to write a script which creates such a tree under cpu
controller filesystem:

# mkdir /dev/cpuctl
# mount -t container -o cpuctl none /dev/cpuctl

/dev/cpuctl is the cpu controller filesystem which can look like this:

/dev/cpuctl
|----uid root
| |-- sid 10
| | |------ pgrp 20
| | | |-- process 100
| | | |-- process 101
| | | |
| |
| |-- sid 11
|
|--- uid guest

(If the cpu controller really supports those many levels that is!)

user scripts can be written to modify this filesystem tree upon every
login/session/user creation (if that is possible to trap on). Essentially it
lets this semantics (what you ask) be dynamic/tunable by user.

> Kernel compiles are markedly poor benchmarks. Try lat_ctx from lmbench,
> VolanoMark, AIM7, OAST, SDET, and so on.

Thanks for this list of tests. I intend to run all of them if possible
for my next version.

--
Regards,
vatsa

2007-05-25 16:21:14

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

Srivatsa Vaddagiri wrote:
> On Fri, May 25, 2007 at 05:05:16PM +0400, Kirill Korotaev wrote:
>
>>>That way the scheduler would first pick a "virtual CPU" to schedule, and
>>>then pick a user from that virtual CPU, and then a task from the user.
>>
>>don't you mean the vice versa:
>>first use to scheduler, then VCPU (which is essentially a runqueue or rbtree),
>>then a task from VCPU?
>>
>>this is the approach we use in OpenVZ [...]
>
>
> So is this how it looks in OpenVZ?
>
> CONTAINER1 => VCPU0 + VCPU1
> CONTAINER2 => VCPU2 + VCPU3
>
> PCPU0 picks a container first, a vcpu next and then a task in it
> PCPU1 also picks a container first, a vcpu next and then a task in it.

correct.

> Few questions:
>
> 1. Are VCPU runqueues (on which tasks are present) global queues?
>
> That is, let's say that both PCPU0 and PCPU1 pick CONTAINER1 to schedule
> (first level) at the same time and next (let's say) they pick same vcpu
> VCPU0 to schedule (second level). Will the two pcpu's now have to be
> serialized for scanning task to schedule next (third level) within VCPU0
> using a spinlock? Won't that shootup scheduling costs (esp on large
> systems), compared to (local scheduling + balance across cpus once in a
> while, the way its done today)?
> Or do you required that two pcpus don't schedule the same vcpu at the
> same time (the way hypervisors normally work)? Even then I would
> imagine a fair level of contention to be present in second step (pick
> a virtual cpu from a container's list of vcpus).

2 physical CPUs can't select the same VCPU at the same time.
i.e. VCPU can be running on 1 PCPU only at the moment.
and vice versa: PCPU can run only 1 VCPU at the given moment.

So serialization is done when we need to assign VCPU to PCPU moment only,
not when we select a particular task from the runqueue.

About the contention: you can control how often VCPUs should be rescheduled,
so the contention can be quite small. This contention is unavoidable in any fair
scheduler since fairness implies across CPUs accounting and decision making at least
with some period of time.

Well it is possible to avoid contention at all - if we do fair scheduling
separately on each CPU. But in this case we still do user-based balancing
(which requires serialization) and precision can be nasty.

> 2. How would this load balance at virtual cpu level and sched domain based
> load balancing interact?
>
> The current sched domain based balancing code has many HT/MC/SMT related
> optimizations, which ensure that tasks are spread across physical
> threads/cores/packages in a most efficient manner - so as to utilize
> hardware bandwidth to the maximum. You would now need to introduce
> those optimizations essentially at schedule() time ..? Don't know
> if that is a wise thing to do.

load balancing is done taking into account *current* VCPUs assignments to PCPUs.
i.e. sched domains are taken into account.
nothing is introduces at schedule() time - not sure what you meant actually by this.

> 3. How do you determine the number of VCPUs per container? Is there any
> relation for number of virtual cpus exposed per user/container and
> the number of available cpus? For ex: in case of user-driven
> scheduling, we would want all users to see the same number of cpus
> (which is the number available in the system).

by default every user is given num_online_cpus() VCPUs, i.e. it can
run on all physical CPUs at the same time. If needed a user can be limited.


> 4. VCPU ids (namespace) - is it different for different containers?

yes.

> For ex: can id's of vcpus belonging to different containers (say VCPU0 and
> VCPU2), as seen by users thr' vgetcpu/smp_processor_id() that is, be same?

yes.

> If so, then potentially two threads belonging to different users may find
> that they are running -truly simultaneously- on /same/ cpu 0 (one on
> VCPU0/PCPU0 and another on VCPU2/PCPU1) which normally isn't possible!

yes. but for user space this has no any implications. You see, there is no way for user space
to determine whether it is "-truly simultaneously- running on /same/ cpu 0".

> This may be ok for containers, with non-overlapping cpu id namespace,
> but when applied to group scheduling for, say, users, which require a
> global cpu id namespace, wondering how that would be addressed ..

very simple imho.
the only way from user space to get some task CPU id is /proc.
All you need is to return *some* value there.
For example, one can report PCPU id to which VCPU is assigned.

>>and if you don't mind I would propose to go this way for fair-scheduling in
>>mainstream.
>>It has it's own advantages and disatvantages.
>>
>>This is not the easy way to go and I can outline the problems/disadvantages
>>which appear on this way:
>>- tasks which bind to CPU mask will bind to virtual CPUs.
>> no problem with user tasks, [...]
>
>
> Why is this not a problem for user tasks? Tasks which bind to different
> CPUs for performance reason now can find that they are running on same
> (physical) CPU unknowingly.

if there is no high load - tasks will be running on different PCPUs
as the author was planning, since VCPUs will get different PCPUs for sure.
Otherwise - it is not performance critical, and moreover *controversial* to *fairness*.

Let me provide an example why binding is controversial to fairness.
Imagine that we have 2 USERs - USER1 and USER2 and 2 CPUs in the system.
USER1 has 50 tasks binded to CPU0 and 50 tasks binded to CPU1.
USER2 has 1 task.

Let USER2 to be as important as USER1 is, so these USERs should
share summary CPU time as 1:1.
How will it work with your approach?

>>but some kernel threads
>> use this to do CPU-related management (like cpufreq).
>> This can be fixed using SMP IPI actually.
>>- VCPUs should no change PCPUs very frequently,
>> otherwise there is some overhead. Solvable.
>>
>>Advantages:
>>- High precision and fairness.
>
>
> I just don't know if this benefit of high degree of fairness is worth the
> complexity it introduces. Besides having some data which shows how much better
> is is with respect to fairness/overhead when compared with other approaches
> (like smpnice) would help I guess. I will however let experts like Ingo make
> the final call here :)

sure. The "perfect" solution doesn't exist :( So I would be happy to know Ingo
opinion as well.

Thanks,
Kirill

2007-05-25 17:15:42

by Tong Li

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/3] Add group fairness to CFS

On Fri, 2007-05-25 at 21:44 +0530, Srivatsa Vaddagiri wrote:
> >
> > That assumes per-user scheduling groups; other configurations would
> > make it one step for each level of hierarchy. It may be possible to
> > reduce those steps to only state transitions that change weightings
> > and incremental updates of task weightings. By and large, one needs
> > the groups to determine task weightings as opposed to hierarchically
> > scheduling, so there are alternative ways of going about this, ones
> > that would even make load balancing easier.
>
> Yeah I agree that providing hierarchical group-fairness at the cost of single
> (or fewer) scheduling levels would be a nice thing to target for,
> although I don't know of any good way to do it. Do you have any ideas
> here? Doing group fairness in a single level, using a common rb-tree for tasks
> from all groups is very difficult IMHO. We need atleast two levels.
>
> One possibility is that we recognize deeper hierarchies only in user-space,
> but flatten this view from kernel perspective i.e some user space tool
> will have to distributed the weights accordingly in this flattened view
> to the kernel.

Nice work, Vatsa. When I wrote the DWRR algorithm, I flattened the
hierarchies into one level, so maybe that approach can be applied to
your code as well. What I did is to maintain task and task group weights
and reservations separately from the scheduler, while the scheduler only
sees one system-wide weight per task and does not concern about which
group a task is in. The key here is the system-wide weight of each task
should represent an equivalent share to the share represented by the
group hierarchies. To do this, the scheduler looks up the task and group
weights/reservations it maintains, and dynamically computes the
system-wide weight *only* when it need a weight for a given task while
scheduling. The on-demand weight computation makes sure the cost is
small (constant time). The computation itself can be seen from an
example: assume we have a group of two tasks and the group's total share
is represented by a weight of 10. Inside the group, let's say the two
tasks, P1 and P2, have weights 1 and 2. Then the system-wide weight for
P1 is 10/3 and the weight for P2 is 20/3. In essence, this flattens
weights into one level without changing the shares they represent.

tong

2007-05-25 18:01:16

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Fri, May 25, 2007 at 08:18:56PM +0400, Kirill Korotaev wrote:
> 2 physical CPUs can't select the same VCPU at the same time.
> i.e. VCPU can be running on 1 PCPU only at the moment.
> and vice versa: PCPU can run only 1 VCPU at the given moment.
>
> So serialization is done when we need to assign VCPU to PCPU moment only,
> not when we select a particular task from the runqueue.
>
> About the contention: you can control how often VCPUs should be rescheduled,
> so the contention can be quite small. This contention is unavoidable in any fair
> scheduler since fairness implies across CPUs accounting and decision making at least
> with some period of time.
>
> Well it is possible to avoid contention at all - if we do fair scheduling
> separately on each CPU. But in this case we still do user-based balancing
> (which requires serialization) and precision can be nasty.

I guess how nasty or not it is depends on interval over which fairness
is expected. Longer the interval, better the scope to provide good fairness
based on load-balance schemes like sched-domain/smpnice ..

> > 2. How would this load balance at virtual cpu level and sched domain based
> > load balancing interact?

[snip]

> load balancing is done taking into account *current* VCPUs assignments to PCPUs.
> i.e. sched domains are taken into account.
> nothing is introduces at schedule() time - not sure what you meant actually by this.
>

Basically, lets say that there are 2 CPUs, each with two threads

--------------------
| |
------ ---------
i i i i
CPU0 CPU1 CPU2 CPU3

CPU0/1 and CPU2/3 are siblings.

Lets say that CPU0 has two or more tasks while CPU[1-3] are idle. Then the HT
optimizations in sched domains based load balancer ensures that CPU2 or
3 picks up one (or more) task from CPU0 rather than CPU1 picking it. The same
logic works at higher packaging levels (core, package, node? etc).

How would the virtual cpu scheduler achieve such optimizations? My
thought was such optimizations (if they have to be retained) has to be
introduced at virtual cpu schedule time ..?

Not just this, the sched domain based load balancer has other (nice?)
properties that it balances less frequently across nodes than across cpus
inside a node. This lets tasks execute longer on same node.

Essentially what I wanted to know was : what will be the role of sched
domain based load balancer on top of virtual cpu load balancer? Do you
disable sched domain based balancer completely? Or does it balance tasks
across virtual cpus? Given that virtual cpus can run anywhere, then it needs
to be significantly changed to understand that (for ex: the HT/MC
optimizations in sched domain balancer needs to becomes aware of this
virtual->physical cpu relationship).

> > 4. VCPU ids (namespace) - is it different for different containers?
>
> yes.

Is this namespace different inside kernel context too?

To illustrate, consider my previous example:

CONTAINER1 => VCPU0 + VCPU1
CONTAINER2 => VCPU2 + VCPU3

Lets say that T1 in CONTAINER1 is running on VCPU0/PCPU0 at the same time
that T2 in CONTAINER2 is running on VCPU2/PCPU1. As we discussed
earlier, they both can see themselves running on same (virtual) cpu 0. Lets say
they make system calls now and what does the kernel see this cpu id as
thr' smp_processor_id()? Hopefully it is different for the two threads
when they are inside kernel ..

> > For ex: can id's of vcpus belonging to different containers (say VCPU0 and
> > VCPU2), as seen by users thr' vgetcpu/smp_processor_id() that is, be same?
>
> yes.
>
> > If so, then potentially two threads belonging to different users may find
> > that they are running -truly simultaneously- on /same/ cpu 0 (one on
> > VCPU0/PCPU0 and another on VCPU2/PCPU1) which normally isn't possible!
>
> yes. but for user space this has no any implications. You see, there is no way for user space
> to determine whether it is "-truly simultaneously- running on /same/ cpu 0".

Hmm ..what if some user space app maintains per-cpu stuff and (now that
we return same cpu id to both tasks running simultaneously on two different
physical cpus) they collide writing to the the same per-cpu area?

> > This may be ok for containers, with non-overlapping cpu id namespace,
> > but when applied to group scheduling for, say, users, which require a
> > global cpu id namespace, wondering how that would be addressed ..
>
> very simple imho.
> the only way from user space to get some task CPU id is /proc.

I believe there is a syscall (vgetcpu?) in works to get a cpu id.

> All you need is to return *some* value there.
> For example, one can report PCPU id to which VCPU is assigned.

That may break other things. What if task had bound to (virtual) cpu0 and
now its call to vgetcpu returns different values at different times,
based on where that virtual cpu0 was running at that moment!

IMHO introduction of virtual cpu for cases which require global cpu
id space will be a pretty complex thingy (both for user space and for
kernel). Not sure if it is worth the benefits.

> > Why is this not a problem for user tasks? Tasks which bind to different
> > CPUs for performance reason now can find that they are running on same
> > (physical) CPU unknowingly.
>
> if there is no high load - tasks will be running on different PCPUs
> as the author was planning, since VCPUs will get different PCPUs for sure.

Again it depends on the application I guess. Even under high load, it
may be the expectation of the user to run on same cpu for correctness
reasons.

> Otherwise - it is not performance critical, and moreover *controversial* to *fairness*.
>
> Let me provide an example why binding is controversial to fairness.
> Imagine that we have 2 USERs - USER1 and USER2 and 2 CPUs in the system.
> USER1 has 50 tasks binded to CPU0 and 50 tasks binded to CPU1.
> USER2 has 1 task.
>
> Let USER2 to be as important as USER1 is, so these USERs should
> share summary CPU time as 1:1.
> How will it work with your approach?

Good example :) USER2's single task will have to share its CPU with
USER1's 50 tasks (unless we modify the smpnice load balancer to
disregard cpu affinity that is - which I would not prefer to do).

Ingo/Peter, any thoughts here? CFS and smpnice probably is "broken"
with respect to such example as above albeit for nice-based tasks.

--
Regards,
vatsa

2007-05-26 00:17:58

by Peter Williams

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

Srivatsa Vaddagiri wrote:
> Good example :) USER2's single task will have to share its CPU with
> USER1's 50 tasks (unless we modify the smpnice load balancer to
> disregard cpu affinity that is - which I would not prefer to do).

I don't think that ignoring cpu affinity is an option. Setting the cpu
affinity of tasks is a deliberate policy action on the part of the
system administrator and has to be honoured. Load balancing has to do
the best it can in these circumstances which may mean sub optimal
distribution of the load BUT it is result of a deliberate policy
decision by the system administrator.

>
> Ingo/Peter, any thoughts here? CFS and smpnice probably is "broken"
> with respect to such example as above albeit for nice-based tasks.
>

See above. I think that faced with cpu affinity use by the system
administrator that smpnice will tend towards a task to cpu allocation
that is (close to) the best that can be achieved without violating the
cpu affinity assignments. (It may take a little longer than normal but
it should get there eventually.)

You have to assume that the system administrator knows what (s)he's
doing and is willing to accept the impact of their policy decision on
the overall system performance.

Having said that, if it was deemed necessary you could probably increase
the speed at which the load balancer converged on a good result in the
face of cpu affinity by keeping a "pinned weighted load" value for each
run queue and using that to modify find_busiest_group() and
find_busiest_queue() to be a bit smarter. But I'm not sure that it
would be worth the added complexity.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-26 15:44:45

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

Srivatsa Vaddagiri wrote:
>> Ingo/Peter, any thoughts here? CFS and smpnice probably is "broken"
>> with respect to such example as above albeit for nice-based tasks.

On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote:
> See above. I think that faced with cpu affinity use by the system
> administrator that smpnice will tend towards a task to cpu allocation
> that is (close to) the best that can be achieved without violating the
> cpu affinity assignments. (It may take a little longer than normal but
> it should get there eventually.)
> You have to assume that the system administrator knows what (s)he's
> doing and is willing to accept the impact of their policy decision on
> the overall system performance.
> Having said that, if it was deemed necessary you could probably increase
> the speed at which the load balancer converged on a good result in the
> face of cpu affinity by keeping a "pinned weighted load" value for each
> run queue and using that to modify find_busiest_group() and
> find_busiest_queue() to be a bit smarter. But I'm not sure that it
> would be worth the added complexity.

Just in case anyone was looking for algorithms...

Lag should be considered in lieu of load because lag is what the
scheduler is trying to minimize; load is not directly relevant, but
appears to have some sort of relationship. Also, instead of pinned,
unpinned should be considered. It's unpinned that load balancing can
actually migrate. Using the signed minimax pseudonorm (i.e. the highest
signed lag, where positive is higher than all negative regardless of
magnitude) on unpinned lags yields a rather natural load balancing
algorithm consisting of migrating from highest to lowest signed lag,
with progressively longer periods for periodic balancing across
progressively higher levels of hierarchy in sched_domains etc. as usual.
Basically skip over pinned tasks as far as lag goes.

The trick with all that comes when tasks are pinned within a set of
cpus (especially crossing sched_domains) instead of to a single cpu.
There one can just consider a cpu to enter a periodic load balance
cycle, and then consider pushing and pulling, perhaps what could be
called the "exchange lags" for the pair of cpus. That would be the
minimax lag pseudonorms for the tasks migratable to both cpus of the
pair. That makes the notion of moving things from highest to lowest
lag (where load is now considered) unambiguous apart from whether all
this converges, but not when to actually try to load balance vs. when
not to, or when it's urgent vs. when it should be done periodically.

To clarify that, an O(cpus**2) notion appears to be necessary, namely
the largest exchange lag differential between any pair of cpus. There
is also the open question of whether moving tasks between cpus with the
highest exchange lag differential will actually reduce it or whether it
runs the risk of increasing it by creating a larger exchange lag
differential between different pairs of cpus. A similar open question
is raised by localizing balancing decisions to sched_domains. What
remains clear is that any such movement reduces the worst-case lag in
the whole system. Because of that, the worst-case lag in the whole
system monotonically decreases as balancing decisions are made, and
that much is subject to an infinite descent argument. Unfortunately,
determining the largest exchange lag differential appears to be more
complex than merely finding the highest and lowest lags. Bipartite
forms of the problem also arise from sched_domains.

I doubt anyone's really paying any sort of attention, so I'll not
really bother working out much more in the way of details with respect
to load balancing. It may be that there are better ways to communicate
algorithmic notions than prose descriptions. However, it's doubtful I'll
produce anything in a timely enough fashion to attract or hold interest.

The smpnice affair is better phrased in terms of task weighting. It's
simple to honor nice in such an arrangement. First unravel the
grouping hierarchy, then weight by nice. This looks like

task nice hier1 hier2 ... hierN
t_1 w_n1 w_h11 w_h21 ... w_hN1
t_2 w_n2 w_h12 w_h22 ... w_hN2
...

For the example of nice 0 vs. nice 10 as distinct users with 10%
steppings between nice levels, one would have

task nice hier1
t_1 1 1
t_2 0.3855 1

w_1, the weight of t_1, would be
(w_h11*w_n1/(w_h11*w_n1 + w_h12*w_n2))
= (1*1/(1 + 1*0.3855..))
= 0.7217..
w_2, the weight of t_2, would be
(w_h12*w_n2/(w_h11*w_n1 + w_h12*w_n2))
= (1*0.3855../(1 + 1*0.3855..))
= 0.27826..
This just so happens to work out to being the same as if t_1 and t_2
had their respective nice numbers without the scheduler grouping, which
is basically what everyone wants to happen.

It's more obvious how to extend it to more tasks than levels of
hierarchy. An example of that follows:

task nice hier1 hier2 ... hierN
t_1 0.3 0.6 * ... *
t_2 0.7 0.4 * ... *

hier2 through hierN are ignorable since t_1 and t_2 are both the only
members at those levels of hierarchy. We then get something just like
the above example, w_1 = 0.3*0.6/(0.3*0.6+0.7*0.4) = 0.3913.. and
w2 = 0.7*0.4/(0.3*0.6+0.7*0.4) = 0.6087..

It's more interesting with enough tasks to have more meaningful levels
of hierarchy.

task nice hier1 hier2
t_1 0.7 0.6 0.6
t_2 0.3 0.6 0.4
t_3 0.7 0.4 0.6
t_4 0.3 0.4 0.4

where t_1 and t_2 share a hier1 grouping and t_3 and t_4 also share
a hier1 grouping, but the hier1 grouping for t_1 and t_2 is distinct
from the hier1 grouping for t_3 and t_4. All hier2 groupings are
distinct. So t_1 would have pre-nice weight 0.6*0.6, t_2 0.6*0.4,
t_3 0.6*0.4, and t_4 0.4*0.4 (the numbers were chosen so denominators
conveniently collapse to 1). Now that the hierarchy is flattened,
nice numbers can be factored in for t_1's final weight being
0.7*0.36/(0.7*0.36+0.3*0.24+0.7*0.24+0.3*0.16) = 0.252/0.54 = 0.467..
and the others being 0.133.. (t_2), 0.311.. (t_3), and 0.0889.. (t_4).

In such a manner nice numbers obey the principle of least surprise.


-- wli

2007-05-27 01:30:14

by Peter Williams

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

William Lee Irwin III wrote:
> Srivatsa Vaddagiri wrote:
>>> Ingo/Peter, any thoughts here? CFS and smpnice probably is "broken"
>>> with respect to such example as above albeit for nice-based tasks.
>
> On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote:
>> See above. I think that faced with cpu affinity use by the system
>> administrator that smpnice will tend towards a task to cpu allocation
>> that is (close to) the best that can be achieved without violating the
>> cpu affinity assignments. (It may take a little longer than normal but
>> it should get there eventually.)
>> You have to assume that the system administrator knows what (s)he's
>> doing and is willing to accept the impact of their policy decision on
>> the overall system performance.
>> Having said that, if it was deemed necessary you could probably increase
>> the speed at which the load balancer converged on a good result in the
>> face of cpu affinity by keeping a "pinned weighted load" value for each
>> run queue and using that to modify find_busiest_group() and
>> find_busiest_queue() to be a bit smarter. But I'm not sure that it
>> would be worth the added complexity.
>
> Just in case anyone was looking for algorithms...
>
> Lag should be considered in lieu of load because lag

What's the definition of lag here?

> is what the
> scheduler is trying to minimize;

This isn't always the case. Some may prefer fairness to minimal lag.
Others may prefer particular tasks to receive preferential treatment.

> load is not directly relevant, but
> appears to have some sort of relationship. Also, instead of pinned,
> unpinned should be considered.

If you have total and pinned you can get unpinned. It's probably
cheaper to maintain data for pinned than unpinned as there's less of it
on normal systems.

> It's unpinned that load balancing can
> actually migrate.

True but see previous comment.

> Using the signed minimax pseudonorm (i.e. the highest
> signed lag, where positive is higher than all negative regardless of
> magnitude) on unpinned lags yields a rather natural load balancing
> algorithm consisting of migrating from highest to lowest signed lag,
> with progressively longer periods for periodic balancing across
> progressively higher levels of hierarchy in sched_domains etc. as usual.
> Basically skip over pinned tasks as far as lag goes.
>
> The trick with all that comes when tasks are pinned within a set of
> cpus (especially crossing sched_domains) instead of to a single cpu.

Yes, this makes the cost of maintaining the required data higher which
makes keeping pinned data more attractive than unpinned.

BTW keeping data for sets of CPU affinities could cause problems as the
number of possible sets is quite large (being 2 to the power of the
number of CPUs). So you need an algorithm based on pinned data for
single CPUs that knows the pinning isn't necessarily exclusive rather
than one based on sets of CPUs. As I understand it (which may be
wrong), the mechanism you describe below takes that approach.

> There one can just consider a cpu to enter a periodic load balance
> cycle, and then consider pushing and pulling, perhaps what could be
> called the "exchange lags" for the pair of cpus. That would be the
> minimax lag pseudonorms for the tasks migratable to both cpus of the
> pair. That makes the notion of moving things from highest to lowest
> lag (where load is now considered) unambiguous apart from whether all
> this converges, but not when to actually try to load balance vs. when
> not to, or when it's urgent vs. when it should be done periodically.
>
> To clarify that, an O(cpus**2) notion appears to be necessary, namely
> the largest exchange lag differential between any pair of cpus. There
> is also the open question of whether moving tasks between cpus with the
> highest exchange lag differential will actually reduce it or whether it
> runs the risk of increasing it by creating a larger exchange lag
> differential between different pairs of cpus. A similar open question
> is raised by localizing balancing decisions to sched_domains. What
> remains clear is that any such movement reduces the worst-case lag in
> the whole system. Because of that, the worst-case lag in the whole
> system monotonically decreases as balancing decisions are made, and
> that much is subject to an infinite descent argument. Unfortunately,
> determining the largest exchange lag differential appears to be more
> complex than merely finding the highest and lowest lags. Bipartite
> forms of the problem also arise from sched_domains.
>
> I doubt anyone's really paying any sort of attention, so I'll not
> really bother working out much more in the way of details with respect
> to load balancing. It may be that there are better ways to communicate
> algorithmic notions than prose descriptions. However, it's doubtful I'll
> produce anything in a timely enough fashion to attract or hold interest.
>
> The smpnice affair is better phrased in terms of task weighting. It's
> simple to honor nice in such an arrangement. First unravel the
> grouping hierarchy, then weight by nice. This looks like
>
> task nice hier1 hier2 ... hierN
> t_1 w_n1 w_h11 w_h21 ... w_hN1
> t_2 w_n2 w_h12 w_h22 ... w_hN2
> ...
>
> For the example of nice 0 vs. nice 10 as distinct users with 10%
> steppings between nice levels, one would have
>
> task nice hier1
> t_1 1 1
> t_2 0.3855 1
>
> w_1, the weight of t_1, would be
> (w_h11*w_n1/(w_h11*w_n1 + w_h12*w_n2))
> = (1*1/(1 + 1*0.3855..))
> = 0.7217..
> w_2, the weight of t_2, would be
> (w_h12*w_n2/(w_h11*w_n1 + w_h12*w_n2))
> = (1*0.3855../(1 + 1*0.3855..))
> = 0.27826..
> This just so happens to work out to being the same as if t_1 and t_2
> had their respective nice numbers without the scheduler grouping, which
> is basically what everyone wants to happen.
>
> It's more obvious how to extend it to more tasks than levels of
> hierarchy. An example of that follows:
>
> task nice hier1 hier2 ... hierN
> t_1 0.3 0.6 * ... *
> t_2 0.7 0.4 * ... *
>
> hier2 through hierN are ignorable since t_1 and t_2 are both the only
> members at those levels of hierarchy. We then get something just like
> the above example, w_1 = 0.3*0.6/(0.3*0.6+0.7*0.4) = 0.3913.. and
> w2 = 0.7*0.4/(0.3*0.6+0.7*0.4) = 0.6087..
>
> It's more interesting with enough tasks to have more meaningful levels
> of hierarchy.
>
> task nice hier1 hier2
> t_1 0.7 0.6 0.6
> t_2 0.3 0.6 0.4
> t_3 0.7 0.4 0.6
> t_4 0.3 0.4 0.4
>
> where t_1 and t_2 share a hier1 grouping and t_3 and t_4 also share
> a hier1 grouping, but the hier1 grouping for t_1 and t_2 is distinct
> from the hier1 grouping for t_3 and t_4. All hier2 groupings are
> distinct. So t_1 would have pre-nice weight 0.6*0.6, t_2 0.6*0.4,
> t_3 0.6*0.4, and t_4 0.4*0.4 (the numbers were chosen so denominators
> conveniently collapse to 1). Now that the hierarchy is flattened,
> nice numbers can be factored in for t_1's final weight being
> 0.7*0.36/(0.7*0.36+0.3*0.24+0.7*0.24+0.3*0.16) = 0.252/0.54 = 0.467..
> and the others being 0.133.. (t_2), 0.311.. (t_3), and 0.0889.. (t_4).
>
> In such a manner nice numbers obey the principle of least surprise.

Is it just me or did you stray from the topic of handling cpu affinity
during load balancing to hierarchical load balancing? I couldn't see
anything in the above explanation that would improve the handling of cpu
affinity.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-28 16:31:28

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Fri, May 25, 2007 at 10:14:58AM -0700, Li, Tong N wrote:
> Nice work, Vatsa. When I wrote the DWRR algorithm, I flattened the
> hierarchies into one level, so maybe that approach can be applied to
> your code as well. What I did is to maintain task and task group weights
> and reservations separately from the scheduler, while the scheduler only
> sees one system-wide weight per task and does not concern about which
> group a task is in. The key here is the system-wide weight of each task
> should represent an equivalent share to the share represented by the
> group hierarchies. To do this, the scheduler looks up the task and group
> weights/reservations it maintains, and dynamically computes the
> system-wide weight *only* when it need a weight for a given task while
> scheduling. The on-demand weight computation makes sure the cost is
> small (constant time). The computation itself can be seen from an
> example: assume we have a group of two tasks and the group's total share
> is represented by a weight of 10. Inside the group, let's say the two
> tasks, P1 and P2, have weights 1 and 2. Then the system-wide weight for
> P1 is 10/3 and the weight for P2 is 20/3. In essence, this flattens
> weights into one level without changing the shares they represent.

What do these task weights control? Timeslice primarily? If so, I am not
sure how well it can co-exist with cfs then (unless you are planning to
replace cfs with a equally good interactive/fair scheduler :)

I would be very interested if this weight calculation can be used for
smpnice based load balancing purposes too ..

--
Regards,
vatsa

2007-05-28 17:18:59

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote:
> I don't think that ignoring cpu affinity is an option. Setting the cpu
> affinity of tasks is a deliberate policy action on the part of the
> system administrator and has to be honoured.

mmm ..but users can set cpu affinity w/o administrator priveleges ..

--
Regards,
vatsa

2007-05-29 00:19:00

by Peter Williams

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

Srivatsa Vaddagiri wrote:
> On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote:
>> I don't think that ignoring cpu affinity is an option. Setting the cpu
>> affinity of tasks is a deliberate policy action on the part of the
>> system administrator and has to be honoured.
>
> mmm ..but users can set cpu affinity w/o administrator priveleges ..
>

OK. So you have to assume the users know what they're doing. :-)

In reality though, the policy of allowing ordinary users to set affinity
on their tasks should be rethought.

In any case, there's no point having cpu affinity if it's going to be
ignored. Maybe you could have two levels of affinity: 1. if set by a
root it must be obeyed; and 2. if set by an ordinary user it can be
overridden if the best interests of the system dictate. BUT I think
that would be a bad idea.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-29 01:56:20

by Paul Menage

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On 5/28/07, Peter Williams <[email protected]> wrote:
>
> In any case, there's no point having cpu affinity if it's going to be
> ignored. Maybe you could have two levels of affinity: 1. if set by a
> root it must be obeyed; and 2. if set by an ordinary user it can be
> overridden if the best interests of the system dictate. BUT I think
> that would be a bad idea.

Something like that already exists (at least for controlling the
bounding set of allowed cpus) via cpusets.

Paul

2007-05-29 03:30:41

by Peter Williams

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

Peter Williams wrote:
> Srivatsa Vaddagiri wrote:
>> On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote:
>>> I don't think that ignoring cpu affinity is an option. Setting the
>>> cpu affinity of tasks is a deliberate policy action on the part of
>>> the system administrator and has to be honoured.
>>
>> mmm ..but users can set cpu affinity w/o administrator priveleges ..
>>
>
> OK. So you have to assume the users know what they're doing. :-)
>
> In reality though, the policy of allowing ordinary users to set affinity
> on their tasks should be rethought.

After more contemplation, I now think I may have gone overboard here. I
am now of the opinion that any degradation of overall system performance
due to the use of cpu affinity would be confined to the tasks with cpu
affinity set. So there's no need to prevent ordinary users from setting
cpu affinity on their own processes as any degradation will only affect
them.

So it goes back to the situation where you have to assume that they know
what they're doing and obey their policy.

>
> In any case, there's no point having cpu affinity if it's going to be
> ignored. Maybe you could have two levels of affinity: 1. if set by a
> root it must be obeyed; and 2. if set by an ordinary user it can be
> overridden if the best interests of the system dictate. BUT I think
> that would be a bad idea.

This idea is now not just bad but unnecessary.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-29 10:49:31

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

William Lee Irwin III wrote:
>> Lag should be considered in lieu of load because lag

On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
> What's the definition of lag here?

Lag is the deviation of a task's allocated CPU time from the CPU time
it would be granted by the ideal fair scheduling algorithm (generalized
processor sharing; take the limit of RR with per-task timeslices
proportional to load weight as the scale factor approaches zero).
Negative lag reflects receipt of excess CPU time. A close-to-canonical
"fairness metric" is the maximum of the absolute values of the lags of
all the tasks on the system. The "signed minimax pseudonorm" is the
largest lag without taking absolute values; it's a term I devised ad
hoc to describe the proposed algorithm.


William Lee Irwin III wrote:
>> is what the
>> scheduler is trying to minimize;

On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
> This isn't always the case. Some may prefer fairness to minimal lag.
> Others may prefer particular tasks to receive preferential treatment.

This comment does not apply. Generalized processor sharing expresses
preferential treatment via weighting. Various other forms of
preferential treatment require more elaborate idealized models.


>> load is not directly relevant, but
>> appears to have some sort of relationship. Also, instead of pinned,
>> unpinned should be considered.

On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
> If you have total and pinned you can get unpinned. It's probably
> cheaper to maintain data for pinned than unpinned as there's less of it
> on normal systems.

Regardless of the underlying accounting, I've presented a coherent
algorithm. It may be that there's no demonstrable problem to solve.
On the other hand, if there really is a question as to how to load
balance in the presence of tasks pinned to cpus, I just answered it.


William Lee Irwin III wrote:
>> Using the signed minimax pseudonorm (i.e. the highest
>> signed lag, where positive is higher than all negative regardless of
>> magnitude) on unpinned lags yields a rather natural load balancing
>> algorithm consisting of migrating from highest to lowest signed lag,
>> with progressively longer periods for periodic balancing across
>> progressively higher levels of hierarchy in sched_domains etc. as usual.
>> Basically skip over pinned tasks as far as lag goes.
>> The trick with all that comes when tasks are pinned within a set of
>> cpus (especially crossing sched_domains) instead of to a single cpu.

On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
> Yes, this makes the cost of maintaining the required data higher which
> makes keeping pinned data more attractive than unpinned.
> BTW keeping data for sets of CPU affinities could cause problems as the
> number of possible sets is quite large (being 2 to the power of the
> number of CPUs). So you need an algorithm based on pinned data for
> single CPUs that knows the pinning isn't necessarily exclusive rather
> than one based on sets of CPUs. As I understand it (which may be
> wrong), the mechanism you describe below takes that approach.

Yes, the mechanism I described takes that approach.


William Lee Irwin III wrote:
>> The smpnice affair is better phrased in terms of task weighting. It's
>> simple to honor nice in such an arrangement. First unravel the
>> grouping hierarchy, then weight by nice. This looks like
[...]
>> In such a manner nice numbers obey the principle of least surprise.

On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
> Is it just me or did you stray from the topic of handling cpu affinity
> during load balancing to hierarchical load balancing? I couldn't see
> anything in the above explanation that would improve the handling of cpu
> affinity.

There was a second issue raised to which I responded. I didn't stray
per se. I addressed a second topic in the post.


-- wli

2007-05-30 00:09:45

by Peter Williams

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> Lag should be considered in lieu of load because lag
>
> On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
>> What's the definition of lag here?
>
> Lag is the deviation of a task's allocated CPU time from the CPU time
> it would be granted by the ideal fair scheduling algorithm (generalized
> processor sharing; take the limit of RR with per-task timeslices
> proportional to load weight as the scale factor approaches zero).

Over what time period does this operate?

> Negative lag reflects receipt of excess CPU time. A close-to-canonical
> "fairness metric" is the maximum of the absolute values of the lags of
> all the tasks on the system. The "signed minimax pseudonorm" is the
> largest lag without taking absolute values; it's a term I devised ad
> hoc to describe the proposed algorithm.

So what you're saying is that you think dynamic priority (or its
equivalent) should be used for load balancing instead of static priority?

>
> William Lee Irwin III wrote:
>>> is what the
>>> scheduler is trying to minimize;
>
> On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
>> This isn't always the case. Some may prefer fairness to minimal lag.
>> Others may prefer particular tasks to receive preferential treatment.
>
> This comment does not apply. Generalized processor sharing expresses
> preferential treatment via weighting. Various other forms of
> preferential treatment require more elaborate idealized models.

This was said before I realized that your "lag" is just a measure of
fairness.

>
>
>>> load is not directly relevant, but
>>> appears to have some sort of relationship. Also, instead of pinned,
>>> unpinned should be considered.
>
> On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
>> If you have total and pinned you can get unpinned. It's probably
>> cheaper to maintain data for pinned than unpinned as there's less of it
>> on normal systems.
>
> Regardless of the underlying accounting,

I was just replying to your criticism of my suggestion to keep pinned
task statistics and use them.

> I've presented a coherent
> algorithm. It may be that there's no demonstrable problem to solve.
> On the other hand, if there really is a question as to how to load
> balance in the presence of tasks pinned to cpus, I just answered it.

Unless I missed something there's nothing in your suggestion that does
anything more about handling pinned tasks than is already done by the
load balancer.

>
>
> William Lee Irwin III wrote:
>>> Using the signed minimax pseudonorm (i.e. the highest
>>> signed lag, where positive is higher than all negative regardless of
>>> magnitude) on unpinned lags yields a rather natural load balancing
>>> algorithm consisting of migrating from highest to lowest signed lag,
>>> with progressively longer periods for periodic balancing across
>>> progressively higher levels of hierarchy in sched_domains etc. as usual.
>>> Basically skip over pinned tasks as far as lag goes.
>>> The trick with all that comes when tasks are pinned within a set of
>>> cpus (especially crossing sched_domains) instead of to a single cpu.
>
> On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
>> Yes, this makes the cost of maintaining the required data higher which
>> makes keeping pinned data more attractive than unpinned.
>> BTW keeping data for sets of CPU affinities could cause problems as the
>> number of possible sets is quite large (being 2 to the power of the
>> number of CPUs). So you need an algorithm based on pinned data for
>> single CPUs that knows the pinning isn't necessarily exclusive rather
>> than one based on sets of CPUs. As I understand it (which may be
>> wrong), the mechanism you describe below takes that approach.
>
> Yes, the mechanism I described takes that approach.
>
>
> William Lee Irwin III wrote:
>>> The smpnice affair is better phrased in terms of task weighting. It's
>>> simple to honor nice in such an arrangement. First unravel the
>>> grouping hierarchy, then weight by nice. This looks like
> [...]
>>> In such a manner nice numbers obey the principle of least surprise.
>
> On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
>> Is it just me or did you stray from the topic of handling cpu affinity
>> during load balancing to hierarchical load balancing? I couldn't see
>> anything in the above explanation that would improve the handling of cpu
>> affinity.
>
> There was a second issue raised to which I responded. I didn't stray
> per se. I addressed a second topic in the post.

OK.

To reiterate, I don't think that my suggestion is really necessary. I
think that the current load balancing (stand fast a small bug that's
being investigated) will come up with a good distribution of tasks to
CPUs within the constraints imposed by any CPU affinity settings.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-30 00:16:23

by Bill Huey

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Mon, May 28, 2007 at 10:09:19PM +0530, Srivatsa Vaddagiri wrote:
> On Fri, May 25, 2007 at 10:14:58AM -0700, Li, Tong N wrote:
> > is represented by a weight of 10. Inside the group, let's say the two
> > tasks, P1 and P2, have weights 1 and 2. Then the system-wide weight for
> > P1 is 10/3 and the weight for P2 is 20/3. In essence, this flattens
> > weights into one level without changing the shares they represent.
>
> What do these task weights control? Timeslice primarily? If so, I am not
> sure how well it can co-exist with cfs then (unless you are planning to
> replace cfs with a equally good interactive/fair scheduler :)

It's called SD. From Con Kolivas that got it right the first time around :)

> I would be very interested if this weight calculation can be used for
> smpnice based load balancing purposes too ..

bill

2007-05-30 02:48:36

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

William Lee Irwin III wrote:
>> Lag is the deviation of a task's allocated CPU time from the CPU time
>> it would be granted by the ideal fair scheduling algorithm (generalized
>> processor sharing; take the limit of RR with per-task timeslices
>> proportional to load weight as the scale factor approaches zero).

On Wed, May 30, 2007 at 10:09:28AM +1000, Peter Williams wrote:
> Over what time period does this operate?

Over a period of time while the task is runnable.


William Lee Irwin III wrote:
>> Negative lag reflects receipt of excess CPU time. A close-to-canonical
>> "fairness metric" is the maximum of the absolute values of the lags of
>> all the tasks on the system. The "signed minimax pseudonorm" is the
>> largest lag without taking absolute values; it's a term I devised ad
>> hoc to describe the proposed algorithm.

On Wed, May 30, 2007 at 10:09:28AM +1000, Peter Williams wrote:
> So what you're saying is that you think dynamic priority (or its
> equivalent) should be used for load balancing instead of static priority?

It doesn't do much in other schemes, but when fairness is directly
measured by the dynamic priority, it is a priori more meaningful.
This is not to say the net effect of using it is so different.


William Lee Irwin III wrote:
>> I've presented a coherent
>> algorithm. It may be that there's no demonstrable problem to solve.
>> On the other hand, if there really is a question as to how to load
>> balance in the presence of tasks pinned to cpus, I just answered it.
>
On Wed, May 30, 2007 at 10:09:28AM +1000, Peter Williams wrote:
> Unless I missed something there's nothing in your suggestion that does
> anything more about handling pinned tasks than is already done by the
> load balancer.

There are two things, both of which are relatively subtle and
coincidentally happen to some extent. The first is the unpinned lag,
which behaves much differently from unpinned load weight even if it's
not so different in concept, but apparently achieves a similar result.
The second is the relativistic point of view, which happens somewhat by
coincidence anyway, but isn't formalized anywhere at all as a basis for
handling tasks pinned to cpus. The first difference is minor and the
second formalizing something that mostly or completely already happens.


William Lee Irwin III wrote:
>> There was a second issue raised to which I responded. I didn't stray
>> per se. I addressed a second topic in the post.

On Wed, May 30, 2007 at 10:09:28AM +1000, Peter Williams wrote:
> OK.
> To reiterate, I don't think that my suggestion is really necessary. I
> think that the current load balancing (stand fast a small bug that's
> being investigated) will come up with a good distribution of tasks to
> CPUs within the constraints imposed by any CPU affinity settings.

It's sort of like performance. If the numbers are already good enough,
my algorithm on that front need not be bothered with either.


-- wli

2007-05-30 02:51:40

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Mon, May 28, 2007 at 10:09:19PM +0530, Srivatsa Vaddagiri wrote:
> What do these task weights control? Timeslice primarily? If so, I am not
> sure how well it can co-exist with cfs then (unless you are planning to
> replace cfs with a equally good interactive/fair scheduler :)
> I would be very interested if this weight calculation can be used for
> smpnice based load balancing purposes too ..

Task weights represent shares of CPU bandwidth. If task i has weight w_i
then its share of CPU bandwidth is intended to be w_i/sum_i w_i.

"Load weight" seems to be used more in the scheduler source.


-- wli

2007-05-30 04:08:09

by Peter Williams

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

William Lee Irwin III wrote:
> On Wed, May 30, 2007 at 10:09:28AM +1000, Peter Williams wrote:
>> So what you're saying is that you think dynamic priority (or its
>> equivalent) should be used for load balancing instead of static priority?
>
> It doesn't do much in other schemes, but when fairness is directly
> measured by the dynamic priority, it is a priori more meaningful.
> This is not to say the net effect of using it is so different.

I suspect that while it's probably theoretically better it wouldn't make
much difference on a real system (probably not enough to justify any
extra complexity if there were any). The exception might be on systems
where there were lots of CPU intensive tasks that used relatively large
chunks of CPU each time they were runnable which would give the load
balancer a more stable load to try and balance. It might be worth the
extra effort to get it exactly right on those systems. On most normal
systems this isn't the case and the load balancer is always playing
catch up to a constantly changing scenario.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2007-05-30 17:06:35

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Sat, May 26, 2007 at 08:41:12AM -0700, William Lee Irwin III wrote:
> The smpnice affair is better phrased in terms of task weighting. It's
> simple to honor nice in such an arrangement. First unravel the
> grouping hierarchy, then weight by nice. This looks like
>
> task nice hier1 hier2 ... hierN
> t_1 w_n1 w_h11 w_h21 ... w_hN1
> t_2 w_n2 w_h12 w_h22 ... w_hN2
> ...
>
> For the example of nice 0 vs. nice 10 as distinct users with 10%
> steppings between nice levels, one would have
>
> task nice hier1
> t_1 1 1
> t_2 0.3855 1
>
> w_1, the weight of t_1, would be
> (w_h11*w_n1/(w_h11*w_n1 + w_h12*w_n2))
> = (1*1/(1 + 1*0.3855..))
> = 0.7217..
> w_2, the weight of t_2, would be
> (w_h12*w_n2/(w_h11*w_n1 + w_h12*w_n2))
> = (1*0.3855../(1 + 1*0.3855..))
> = 0.27826..
> This just so happens to work out to being the same as if t_1 and t_2
> had their respective nice numbers without the scheduler grouping, which
> is basically what everyone wants to happen.
>
> It's more obvious how to extend it to more tasks than levels of
> hierarchy. An example of that follows:
>
> task nice hier1 hier2 ... hierN
> t_1 0.3 0.6 * ... *
> t_2 0.7 0.4 * ... *
>
> hier2 through hierN are ignorable since t_1 and t_2 are both the only
> members at those levels of hierarchy. We then get something just like
> the above example, w_1 = 0.3*0.6/(0.3*0.6+0.7*0.4) = 0.3913.. and
> w2 = 0.7*0.4/(0.3*0.6+0.7*0.4) = 0.6087..
>
> It's more interesting with enough tasks to have more meaningful levels
> of hierarchy.
>
> task nice hier1 hier2
> t_1 0.7 0.6 0.6
> t_2 0.3 0.6 0.4
> t_3 0.7 0.4 0.6
> t_4 0.3 0.4 0.4
>
> where t_1 and t_2 share a hier1 grouping and t_3 and t_4 also share
> a hier1 grouping, but the hier1 grouping for t_1 and t_2 is distinct
> from the hier1 grouping for t_3 and t_4. All hier2 groupings are
> distinct. So t_1 would have pre-nice weight 0.6*0.6, t_2 0.6*0.4,
> t_3 0.6*0.4, and t_4 0.4*0.4 (the numbers were chosen so denominators
> conveniently collapse to 1). Now that the hierarchy is flattened,
> nice numbers can be factored in for t_1's final weight being
> 0.7*0.36/(0.7*0.36+0.3*0.24+0.7*0.24+0.3*0.16) = 0.252/0.54 = 0.467..
> and the others being 0.133.. (t_2), 0.311.. (t_3), and 0.0889.. (t_4).

Hmm ..so do you think this weight decomposition can be used to flatten
the tree all the way to a single level in case of cfs? That would mean we can
achieve group fairness with single level scheduling in cfs ..I am
somewhat skeptical that we can achieve group fairness with a single
level rb-tree (and w/o substantial changes to pick_next_task logic in cfs
that is), but if it can be accomplished would definitely be a great win.

> In such a manner nice numbers obey the principle of least surprise.

--
Regards,
vatsa

2007-05-30 20:13:29

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Sat, May 26, 2007 at 08:41:12AM -0700, William Lee Irwin III wrote:
>> The smpnice affair is better phrased in terms of task weighting. It's
>> simple to honor nice in such an arrangement. First unravel the
>> grouping hierarchy, then weight by nice. This looks like
[...]
>> conveniently collapse to 1). Now that the hierarchy is flattened,
>> nice numbers can be factored in for t_1's final weight being
>> 0.7*0.36/(0.7*0.36+0.3*0.24+0.7*0.24+0.3*0.16) = 0.252/0.54 = 0.467..
>> and the others being 0.133.. (t_2), 0.311.. (t_3), and 0.0889.. (t_4).

On Wed, May 30, 2007 at 10:44:05PM +0530, Srivatsa Vaddagiri wrote:
> Hmm ..so do you think this weight decomposition can be used to flatten
> the tree all the way to a single level in case of cfs? That would mean we can
> achieve group fairness with single level scheduling in cfs ..I am
> somewhat skeptical that we can achieve group fairness with a single
> level rb-tree (and w/o substantial changes to pick_next_task logic in cfs
> that is), but if it can be accomplished would definitely be a great win.

Yes, the hierarchy can be flattened completely and global task weights
computed and used to achieve group fairness. The changes aren't to
pick_next_task() but rather to the ->fair_key computations. In fact, I
went a step beyond that.


On Sat, May 26, 2007 at 08:41:12AM -0700, William Lee Irwin III wrote:
>> In such a manner nice numbers obey the principle of least surprise.

The step beyond was to show how nice numbers can be done with all that
hierarchical task grouping so they have global effects instead of
effects limited to the scope of the narrowest grouping hierarchy
containing the task. I had actually assumed the weighting and
flattening bits were already in your plans from some other post you
made and was building upon that.


-- wli

2007-05-31 03:19:07

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Wed, May 30, 2007 at 01:13:59PM -0700, William Lee Irwin III wrote:
> On Wed, May 30, 2007 at 10:44:05PM +0530, Srivatsa Vaddagiri wrote:
> > Hmm ..so do you think this weight decomposition can be used to flatten
> > the tree all the way to a single level in case of cfs? That would mean we can
> > achieve group fairness with single level scheduling in cfs ..I am
> > somewhat skeptical that we can achieve group fairness with a single
> > level rb-tree (and w/o substantial changes to pick_next_task logic in cfs
> > that is), but if it can be accomplished would definitely be a great win.
>
> Yes, the hierarchy can be flattened completely and global task weights
> computed and used to achieve group fairness.

ok, lets say we are are considering a hierarchy of user->process->thread as
below:

Users = {U1, U2, U3}

where process in a user are:

U1 = {P0, P1, P2, P3, P4}
U2 = {P5, P6}
U3 = {P7}

and where threads/tasks in a process are:

P0 = {T0, T1}
P1 = {T2}
P2 = {T3, T4, T5}
P3 = {T6}
P4 = {T8}
P5 = {T9, T10}
P6 = {T11}
P7 = {T14, T15, T16}


If we need to achieve group fairness given single-level hierarchy,
then tasks need to be spread out in rb-tree like something below?

U1 U2 U3 U1 U2 U3
|---------|---------|---------|---------|---------|---------|--- // ---
t0 t1 t2 t3 t4 t5 t6

[ t0, t1 ..tN are equally spaced points in time. t1 = t0 + K, t2 = t1 + K .. ]

Viewed at the top hierarchy level (users) tasks are spread such that each user
gets "equal" execution over some interval (lets say b/n t0-t3).

When viewed at the next lower hierarchy level (processes), it should look like:

| U1 | U2 | U3 | U1 | U2 | U3 |
| | | | | | |
| P0 P1 P2| P5 P6 | P7 | P3 P4 P0 | P5 P6 | P7 |
|---|---|---|----|-----|----------|---|---|---|----|-----|---------|-//-
t0 t1 t2 t3 t3' t4 t5 t6


[contd below ..]

| U1 | U2 | U3 | U1 | U2 | U3 |
| | | | | | |
| P1 P2 P3| P5 P6 | P7 | P4 P0 P1| P5 P6 | P7 |
--// |---|---|---|----|-----|----------|---|---|---|----|-----|---------|-//-
t6 t7 t8 t9 t10 t11 t12


The available bandwidth to a user is dividided "equally" between various
processes of the user over some time (say between t0 - t3').

When viewed at the next lower hierarchy level (threads), it should look like:



| U1 | U2 | U3 | U1 | U2 | U3 |
| | | | | | |
| P0| P1| P2| P5 | P6 | P7 | P3| P4| P0| P5 | P6 | P7 |
| | | | | | | | | | | | | | |
| T0| T2| T3| T9 | T11 | T14| T15 | T6| T8| T1| T10| T11 | T16| T14 |
|---|---|---|----|-----|----|-----|---|---|---|----|-----|----|-----|-//
t0 t1 t2 t3 t4 t5 t6

(continuting below)

| U1 | U2 | U3 | U1 | U2 | U3 |
| | | | | | |
| P1| P2| P3| P5 | P6 | P7 | P4| P0| P1| P5 | P6 | P7 |
| | | | | | | | | | | | | | |
| T2| T4| T6| T9 | T11 | T15| T16 | T8| T0|T2 | T10| T11 | T14| T14 |
--// |---|---|---|----|-----|----|-----|---|---|---|----|-----|----|-----|-//
t6 t7 t8 t9 t10 t11 t12

Available bandwidth to a process is divided "equally" between threads of
the process.

Although I have been using "equally" everywhere above, it needs to take
into account relative importance of tasks/processes/users.

> The changes aren't to pick_next_task() but rather to the ->fair_key
> computations.

Thats the $$ question I guess :) What computation of ->fair_key can we use
such that task execution sequence is (from the above example):

T0, T2, T3, T9, T11, T14, T15, T6, T8, T1, T10, T11, T16, T14 ...

?

Of course, this ->fair_key computation should default to what it is
today when hierarchical res mgmt is disabled?

> In fact, I went a step beyond that.
>
>
> On Sat, May 26, 2007 at 08:41:12AM -0700, William Lee Irwin III wrote:
> >> In such a manner nice numbers obey the principle of least surprise.
>
> The step beyond was to show how nice numbers can be done with all that
> hierarchical task grouping so they have global effects instead of
> effects limited to the scope of the narrowest grouping hierarchy
> containing the task. I had actually assumed the weighting and
> flattening bits were already in your plans from some other post you
> made and was building upon that.

I would definitely be willing to try out any experiments you think of,
esp those that allow the hierarchy to be flattened. atm fair_key
calculation (in the context of cfs) seem to be the biggest challenge to
surmount for this to work.

--
Regards,
vatsa

2007-05-31 04:09:21

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Wed, May 30, 2007 at 01:13:59PM -0700, William Lee Irwin III wrote:
>> The step beyond was to show how nice numbers can be done with all that
>> hierarchical task grouping so they have global effects instead of
>> effects limited to the scope of the narrowest grouping hierarchy
>> containing the task. I had actually assumed the weighting and
>> flattening bits were already in your plans from some other post you
>> made and was building upon that.

On Thu, May 31, 2007 at 08:56:57AM +0530, Srivatsa Vaddagiri wrote:
> I would definitely be willing to try out any experiments you think of,
> esp those that allow the hierarchy to be flattened. atm fair_key
> calculation (in the context of cfs) seem to be the biggest challenge to
> surmount for this to work.

It's not all that tricky. The ->fair_key computations are already
parametrized on load weights. The "task weights" here are just what
Linux calls "load weight," so we're largely done once task weights
are calculated.

The tricky part (if any) is essentially what you've already got nailed
down, that is, creating and manipulating the accounting objects for the
task groups or whatever you're calling them.


-- wli

2007-05-31 05:40:41

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Wed, May 30, 2007 at 09:09:26PM -0700, William Lee Irwin III wrote:
> It's not all that tricky.

Hmm ..the fact that each task runs for a minimum of 1 tick seems to
complicate the matters to me (when doing group fairness given a single
level hierarchy). A user with 1000 (or more) tasks can be unduly
advantaged compared to another user with just 1 (or fewer) task because of this?

> The ->fair_key computations are already
> parametrized on load weights. The "task weights" here are just what
> Linux calls "load weight," so we're largely done once task weights
> are calculated.
>
> The tricky part (if any) is essentially what you've already got nailed
> down, that is, creating and manipulating the accounting objects for the
> task groups or whatever you're calling them.

--
Regards,
vatsa

2007-05-31 06:36:16

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Wed, May 30, 2007 at 09:09:26PM -0700, William Lee Irwin III wrote:
>> It's not all that tricky.

On Thu, May 31, 2007 at 11:18:28AM +0530, Srivatsa Vaddagiri wrote:
> Hmm ..the fact that each task runs for a minimum of 1 tick seems to
> complicate the matters to me (when doing group fairness given a single
> level hierarchy). A user with 1000 (or more) tasks can be unduly
> advantaged compared to another user with just 1 (or fewer) task
> because of this?

Temporarily, yes. All this only works when averaged out. The basic
idea is that you want a constant upper bound on the difference between
the CPU time a task receives and the CPU time it was intended to get.
This discretization is one of the larger sources of the "error" in the
CPU time granted. The constant upper bound usually only applies to the
largest difference for any task. When absolute values of differences
are summed across tasks the aggregate will be O(tasks) because there's
something almost like a constant per-task lower bound a la Heisenberg.
It would have to get more exact the more tasks there are on the system
for that to work, and something of the opposite actually holds.

It might be appropriate for the scheduler to dynamically adjust a
periodic timer's period or to set up one-shot timers at involuntary
preemption times in order to achieve more precise fairness in this
sort of situation. In the case of few preemption points such one-shot
code or low periodicity code would also save on taking interrupts that
would otherwise manifest as overhead.

In short, a user with many tasks can reap a temporary advantage
relative to users with fewer tasks because of this, but over time,
longer-running tasks will receive the CPU time intended to within
some constant upper bound, provided other things aren't broken.


-- wli

2007-05-31 08:26:04

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Wed, May 30, 2007 at 11:36:47PM -0700, William Lee Irwin III wrote:
> On Thu, May 31, 2007 at 11:18:28AM +0530, Srivatsa Vaddagiri wrote:
> > Hmm ..the fact that each task runs for a minimum of 1 tick seems to
> > complicate the matters to me (when doing group fairness given a single
> > level hierarchy). A user with 1000 (or more) tasks can be unduly
> > advantaged compared to another user with just 1 (or fewer) task
> > because of this?
>
> Temporarily, yes. All this only works when averaged out.

So essentially when we calculate delta_mine component for each of those
1000 tasks, we will find that it has executed for 1 tick (4 ms say) but
its fair share was very very low.

fair_share = delta_exec * p->load_weight / total_weight

If p->load_weight has been calculated after factoring in hierarchy (as
you outlined in a previous mail), then p->load_weight of those 1000 tasks
will be far less compared to the p->load_weight of one task belonging to
other user, correct? Just to make sure I get all this correct:

User U1 has tasks T0 - T999
User U2 has task T1000

assuming each task's weight is 1 and each user's weight is 1 then:


WT0 = (WU1 / WU1 + WU2) * (WT0 / WT0 + WT1 + ... + WT999)
= (1 / 1 + 1) * (1 / 1000)
= 1/2000
= 0.0005

WT1 ..WT999 will be same as WT0

whereas, weight of T1000 will be:


WT1000 = (WU1 / WU1 + WU2) * (WT1000 / WT1000)
= (1 / 1 + 1) * (1/1)
= 0.5

?


So when T0 (or T1 ..T999) executes for 1 tick (4ms), their fair share would
be:
T0's fair_share (delta_mine)
= 4 ms * 0.0005 / (0.0005 * 1000 + 0.5)
= 4 ms * 0.0005 / 1
= 0.002 ms (2000 ns)

This would cause T0's ->wait_runtime to go negative sharply, causing it to be
inserted back in rb-tree well ahead in future. One change I can forsee
in CFS is with regard to limit_wait_runtime() ..We will have to change
its default limit, atleast when group fairness thingy is enabled.

Compared to this when T1000 executes for 1 tick, its fair share would be
calculated as:

T1000's fair_share (delta_mine)
= 4 ms * 0.5 / (0.0005 * 1000 + 0.5)
= 4 ms * 0.5 / 1
= 2 ms (2000000 ns)

Its ->wait_runtime will drop less significantly, which lets it be
inserted in rb-tree much to the left of those 1000 tasks (and which indirectly
lets it gain back its fair share during subsequent schedule cycles).

Hmm ..is that the theory?

Ingo, do you have any comments on this approach?

/me is tempted to try this all out.


> The basic
> idea is that you want a constant upper bound on the difference between
> the CPU time a task receives and the CPU time it was intended to get.
> This discretization is one of the larger sources of the "error" in the
> CPU time granted. The constant upper bound usually only applies to the
> largest difference for any task. When absolute values of differences
> are summed across tasks the aggregate will be O(tasks) because there's
> something almost like a constant per-task lower bound a la Heisenberg.
> It would have to get more exact the more tasks there are on the system
> for that to work, and something of the opposite actually holds.
>
> It might be appropriate for the scheduler to dynamically adjust a
> periodic timer's period or to set up one-shot timers at involuntary
> preemption times in order to achieve more precise fairness in this
> sort of situation. In the case of few preemption points such one-shot
> code or low periodicity code would also save on taking interrupts that
> would otherwise manifest as overhead.
>
> In short, a user with many tasks can reap a temporary advantage
> relative to users with fewer tasks because of this, but over time,
> longer-running tasks will receive the CPU time intended to within
> some constant upper bound, provided other things aren't broken.

--
Regards,
vatsa

2007-05-31 08:42:48

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Wed, May 30, 2007 at 11:36:47PM -0700, William Lee Irwin III wrote:
>> Temporarily, yes. All this only works when averaged out.

On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
> So essentially when we calculate delta_mine component for each of those
> 1000 tasks, we will find that it has executed for 1 tick (4 ms say) but
> its fair share was very very low.
> fair_share = delta_exec * p->load_weight / total_weight
> If p->load_weight has been calculated after factoring in hierarchy (as
> you outlined in a previous mail), then p->load_weight of those 1000 tasks
> will be far less compared to the p->load_weight of one task belonging to
> other user, correct? Just to make sure I get all this correct:

You've got it all correct.


On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
> User U1 has tasks T0 - T999
> User U2 has task T1000
> assuming each task's weight is 1 and each user's weight is 1 then:
> WT0 = (WU1 / WU1 + WU2) * (WT0 / WT0 + WT1 + ... + WT999)
> = (1 / 1 + 1) * (1 / 1000)
> = 1/2000
> = 0.0005
> WT1 ..WT999 will be same as WT0
> whereas, weight of T1000 will be:
> WT1000 = (WU1 / WU1 + WU2) * (WT1000 / WT1000)
> = (1 / 1 + 1) * (1/1)
> = 0.5
> ?

Yes, these calculations are correct.


On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
> So when T0 (or T1 ..T999) executes for 1 tick (4ms), their fair share would
> be:
> T0's fair_share (delta_mine)
> = 4 ms * 0.0005 / (0.0005 * 1000 + 0.5)
> = 4 ms * 0.0005 / 1
> = 0.002 ms (2000 ns)
> This would cause T0's ->wait_runtime to go negative sharply, causing it to be
> inserted back in rb-tree well ahead in future. One change I can forsee
> in CFS is with regard to limit_wait_runtime() ..We will have to change
> its default limit, atleast when group fairness thingy is enabled.
> Compared to this when T1000 executes for 1 tick, its fair share would be
> calculated as:
> T1000's fair_share (delta_mine)
> = 4 ms * 0.5 / (0.0005 * 1000 + 0.5)
> = 4 ms * 0.5 / 1
> = 2 ms (2000000 ns)
> Its ->wait_runtime will drop less significantly, which lets it be
> inserted in rb-tree much to the left of those 1000 tasks (and which indirectly
> lets it gain back its fair share during subsequent schedule cycles).

This analysis is again entirely correct.


On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
> Hmm ..is that the theory?
> Ingo, do you have any comments on this approach?
> /me is tempted to try this all out.

Yes, this is the theory behind using task weights to flatten the task
group hierarchies. My prior post assumed all this and described a method
to make nice numbers behave as expected in the global context atop it.


-- wli

2007-05-31 08:49:04

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
> Its ->wait_runtime will drop less significantly, which lets it be
> inserted in rb-tree much to the left of those 1000 tasks (and which indirectly
> lets it gain back its fair share during subsequent schedule cycles).
>
> Hmm ..is that the theory?

My only concern is the time needed to converge to this fair
distribution, especially in face of fluctuating workloads. For ex: a
container who does a fork bomb can have a very adverse impact on other
container's fair share under this scheme compared to other schemes which
dedicate separate rb-trees for differnet containers (and which also support two
level hierarchical scheduling inside the core scheduler).

I am inclined to have the core scheduler support atleast two levels of
hierarchy (to better isolate each container) and resort to the flattening
trick for higher levels.

--
Regards,
vatsa

2007-05-31 09:14:50

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
>> Its ->wait_runtime will drop less significantly, which lets it be
>> inserted in rb-tree much to the left of those 1000 tasks (and which
>> indirectly lets it gain back its fair share during subsequent
>> schedule cycles).
>> Hmm ..is that the theory?

On Thu, May 31, 2007 at 02:26:00PM +0530, Srivatsa Vaddagiri wrote:
> My only concern is the time needed to converge to this fair
> distribution, especially in face of fluctuating workloads. For ex: a
> container who does a fork bomb can have a very adverse impact on
> other container's fair share under this scheme compared to other
> schemes which dedicate separate rb-trees for differnet containers
> (and which also support two level hierarchical scheduling inside the
> core scheduler).
> I am inclined to have the core scheduler support atleast two levels
> of hierarchy (to better isolate each container) and resort to the
> flattening trick for higher levels.

Yes, the larger number of schedulable entities and hence slower
convergence to groupwise weightings is a disadvantage of the flattening.
A hybrid scheme seems reasonable enough. Ideally one would chop the
hierarchy in pieces so that n levels of hierarchy become k levels of n/k
weight-flattened hierarchies for this sort of attack to be most effective
(at least assuming similar branching factors at all levels of hierarchy
and sufficient depth to the hierarchy to make it meaningful) but this is
awkward to do. Peeling off the outermost container or whichever level is
deemed most important in terms of accuracy of aggregate enforcement as
a hierarchical scheduler is a practical compromise.

Hybrid schemes will still incur the difficulties of hierarchical
scheduling, but they're by no means insurmountable. Sadly, only
complete flattening yields the simplifications that make task group
weighting enforcement orthogonal to load balancing and the like. The
scheme I described for global nice number behavior is also not readily
adaptable to hybrid schemes.


-- wli

2007-05-31 09:29:00

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

On Thu, May 31, 2007 at 02:15:34AM -0700, William Lee Irwin III wrote:
> Yes, the larger number of schedulable entities and hence slower
> convergence to groupwise weightings is a disadvantage of the flattening.
> A hybrid scheme seems reasonable enough.

Cool! This puts me back on track to implement hierarchical scheduling in
CFS :)

Once this is done and once I can get containers running on a box, I will
experiment with the flattening trick for user and process levels inside
containers.

Thanks for your feedback so far!

> Ideally one would chop the
> hierarchy in pieces so that n levels of hierarchy become k levels of n/k
> weight-flattened hierarchies for this sort of attack to be most effective
> (at least assuming similar branching factors at all levels of hierarchy
> and sufficient depth to the hierarchy to make it meaningful) but this is
> awkward to do. Peeling off the outermost container or whichever level is
> deemed most important in terms of accuracy of aggregate enforcement as
> a hierarchical scheduler is a practical compromise.
>
> Hybrid schemes will still incur the difficulties of hierarchical
> scheduling, but they're by no means insurmountable. Sadly, only
> complete flattening yields the simplifications that make task group
> weighting enforcement orthogonal to load balancing and the like. The
> scheme I described for global nice number behavior is also not readily
> adaptable to hybrid schemes.

--
Regards,
vatsa