2006-09-28 17:25:47

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC, PATCH 0/9] CPU Controller V2

Here's V2 of the token-based CPU controller I have been working on.

Changes since last version (posted at http://lkml.org/lkml/2006/8/20/115):

- Task load was not changed when it moved between task-groups of
different quota (bug hit by Mike Galbraith).

- SMP load balance seems to work -much- better now wrt its awaress
of quota on each task-group. The trick was to go beyond the
max_load difference in __move_tasks and instead use the load
difference between two task-groups on the different cpus as
basis of pulling tasks.

- Better timeslice management, aimed at handling bursty
workloads better. Patch 3/9 has documentation on timeslice
management for various task-groups.

- Modified cpuset interface as per Paul Jackson's suggestions.
Some of the changes are:
- s/meter_cpu/cpu_meter_enabled
- s/cpu_quota/cpu_meter_quota
- s/FILE_METER_FLAG/FILE_CPU_METER_ENABLED
- s/FILE_METER_QUOTA/FILE_CPU_METER_QUOTA
- Dont allow cpu_meter_enabled to be turned on for an
"in-use" cpuset (which has tasks attached to it)
- Dont allow cpu_meter_quota to be changed for an
"in-use" cpuset (which has tasks attached to it)

Last two are temporary limitations until we figure out how
to get to a cpuset's task-list more easily.

Still on my todo list:

- Improved surplus cycles management. If A, B and C groups have
been given 50%, 30% and 20% quota respectively and if group B
is idle, B's quota has to be divided b/n A and C in the 5:2
proportion.

- Although load balance seems to be working nicely for the
testcases I have been running, I anticipate certain corner
cases which are yet to be worked out. Especially I need to
make sure some of the HT/MC optimizations are not broken.


Ingo/Nick, IMHO virtualizing cpu-runqueues approach to solve the controller
need is not a good idea, since:

- retaining existing load-balance optimizations for MC/SMT case is
going to be hard (has to be done at schedule time now)
- because of virtualization, two virtual cpus could end up running on
the same physical cpu which would affect the carefull SMP
optimizations put in place are all-over the kernel
- not to mention specialized apps which want to bind to CPUs for
performance reasons may behave badly in such a virtualized
environment.

Hence I have been pursuing more simpler approaches like in this patch.

Your comments/views on this are highly appreciated.

--
Regards,
vatsa


2006-09-28 17:27:26

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC, PATCH 1/9] CPU Controller V2 - Split runqueue

This patch introduces a new runqueue structure (struct task_grp_rq) which is
used to hold "related" tasks together. The structure is quite similar to how
the runqueue structure looks today.

The scheme is every task group will have its own per-cpu runqueues (which is
of type 'struct task_grp_rq'), to hold its tasks together on each CPU.

This is in addition to a per-cpu runqueue which holds runnable task-groups
themselves.

The patch also converts over several references of 'struct rq' to
'struct task_grp_rq' (since tasks now reside in task_grp_rq).

Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---

linux-2.6.18-root/init/Kconfig | 8 +
linux-2.6.18-root/kernel/sched.c | 204 +++++++++++++++++++++++++++++++--------
2 files changed, 174 insertions(+), 38 deletions(-)

diff -puN kernel/sched.c~base_cpu_ctlr kernel/sched.c
--- linux-2.6.18/kernel/sched.c~base_cpu_ctlr 2006-09-27 12:48:23.000000000 +0530
+++ linux-2.6.18-root/kernel/sched.c 2006-09-28 17:23:22.133456408 +0530
@@ -191,11 +191,48 @@ static inline unsigned int task_timeslic

struct prio_array {
unsigned int nr_active;
+ int best_static_prio;
DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
struct list_head queue[MAX_PRIO];
};

/*
+ * Each task belong to some task-group. Task-groups and what tasks they contain
+ * are defined by the administrator using some suitable interface.
+ * Administrator can also define the CPU bandwidth provided to each task-group.
+ *
+ * Task-groups are given a certain priority to run on every CPU. Currently
+ * task-group priority on a CPU is defined to be the same as that of
+ * highest-priority runnable task it has on that CPU. Task-groups also
+ * get their own runqueue on every CPU. The main runqueue on each CPU is
+ * used to hold task-groups, rather than tasks.
+ *
+ * Scheduling decision on a CPU is now two-step : first pick highest priority
+ * task-group from the main runqueue and next pick highest priority task from
+ * the runqueue of that group. Both decisions are of O(1) complexity.
+ */
+
+/* runqueue used for every task-group */
+struct task_grp_rq {
+ struct prio_array arrays[2];
+ struct prio_array *active, *expired, *rq_array;
+ unsigned long expired_timestamp;
+ int prio; /* Priority of the task-group */
+ struct list_head list;
+ struct task_grp *tg;
+};
+
+static DEFINE_PER_CPU(struct task_grp_rq, init_tg_rq);
+
+/* task-group object - maintains information about each task-group */
+struct task_grp {
+ struct task_grp_rq *rq[NR_CPUS]; /* runqueue pointer for every cpu */
+};
+
+/* The "default" task-group */
+struct task_grp init_task_grp;
+
+/*
* This is the main, per-CPU runqueue data structure.
*
* Locking rule: those places that want to lock multiple runqueues
@@ -224,12 +261,13 @@ struct rq {
*/
unsigned long nr_uninterruptible;

- unsigned long expired_timestamp;
unsigned long long timestamp_last_tick;
struct task_struct *curr, *idle;
struct mm_struct *prev_mm;
+ /* these arrays hold task-groups.
+ * xxx: Avoid this for !CONFIG_CPUMETER?
+ */
struct prio_array *active, *expired, arrays[2];
- int best_expired_prio;
atomic_t nr_iowait;

#ifdef CONFIG_SMP
@@ -244,6 +282,7 @@ struct rq {
#endif

#ifdef CONFIG_SCHEDSTATS
+ /* xxx: move these to task-group runqueue where necessary */
/* latency stats */
struct sched_info rq_sched_info;

@@ -267,6 +306,13 @@ struct rq {

static DEFINE_PER_CPU(struct rq, runqueues);

+#define switch_array(array1, array2) \
+ { \
+ struct prio_array *tmp = array2; \
+ array2 = array1; \
+ array1 = tmp; \
+ }
+
/*
* The domain tree (rq->sd) is protected by RCU's quiescent state transition.
* See detach_destroy_domains: synchronize_sched for details.
@@ -280,6 +326,7 @@ static DEFINE_PER_CPU(struct rq, runqueu
#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))
#define this_rq() (&__get_cpu_var(runqueues))
#define task_rq(p) cpu_rq(task_cpu(p))
+#define task_grp_rq(p) (task_grp(p)->rq[task_cpu(p)])
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)

#ifndef prepare_arch_switch
@@ -359,6 +406,12 @@ static inline void finish_lock_switch(st
}
#endif /* __ARCH_WANT_UNLOCKED_CTXSW */

+/* return the task-group to which a task belongs */
+static inline struct task_grp *task_grp(struct task_struct *p)
+{
+ return &init_task_grp;
+}
+
/*
* __task_rq_lock - lock the runqueue a given task resides on.
* Must be called interrupts disabled.
@@ -831,10 +884,11 @@ static int effective_prio(struct task_st
*/
static void __activate_task(struct task_struct *p, struct rq *rq)
{
- struct prio_array *target = rq->active;
+ struct task_grp_rq *tgrq = task_grp_rq(p);
+ struct prio_array *target = tgrq->active;

if (batch_task(p))
- target = rq->expired;
+ target = tgrq->expired;
enqueue_task(p, target);
inc_nr_running(p, rq);
}
@@ -844,7 +898,10 @@ static void __activate_task(struct task_
*/
static inline void __activate_idle_task(struct task_struct *p, struct rq *rq)
{
- enqueue_task_head(p, rq->active);
+ struct task_grp_rq *tgrq = task_grp_rq(p);
+ struct prio_array *target = tgrq->active;
+
+ enqueue_task_head(p, target);
inc_nr_running(p, rq);
}

@@ -2077,7 +2134,7 @@ int can_migrate_task(struct task_struct
return 1;
}

-#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->best_expired_prio)
+#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->expired->best_static_prio)

/*
* move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
@@ -2884,6 +2941,8 @@ unsigned long long current_sched_time(co
return ns;
}

+#define nr_tasks(tgrq) (tgrq->active->nr_active + tgrq->expired->nr_active)
+
/*
* We place interactive tasks back into the active array, if possible.
*
@@ -2894,13 +2953,13 @@ unsigned long long current_sched_time(co
* increasing number of running tasks. We also ignore the interactivity
* if a better static_prio task has expired:
*/
-static inline int expired_starving(struct rq *rq)
+static inline int expired_starving(struct task_grp_rq *rq)
{
- if (rq->curr->static_prio > rq->best_expired_prio)
+ if (current->static_prio > rq->expired->best_static_prio)
return 1;
if (!STARVATION_LIMIT || !rq->expired_timestamp)
return 0;
- if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * rq->nr_running)
+ if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * nr_tasks(rq))
return 1;
return 0;
}
@@ -2991,6 +3050,7 @@ void scheduler_tick(void)
struct task_struct *p = current;
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
+ struct task_grp_rq *tgrq = task_grp_rq(p);

update_cpu_clock(p, rq, now);

@@ -3004,7 +3064,7 @@ void scheduler_tick(void)
}

/* Task might have expired already, but not scheduled off yet */
- if (p->array != rq->active) {
+ if (p->array != tgrq->active) {
set_tsk_need_resched(p);
goto out;
}
@@ -3027,25 +3087,26 @@ void scheduler_tick(void)
set_tsk_need_resched(p);

/* put it at the end of the queue: */
- requeue_task(p, rq->active);
+ requeue_task(p, tgrq->active);
}
goto out_unlock;
}
if (!--p->time_slice) {
- dequeue_task(p, rq->active);
+ dequeue_task(p, tgrq->active);
set_tsk_need_resched(p);
p->prio = effective_prio(p);
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;

- if (!rq->expired_timestamp)
- rq->expired_timestamp = jiffies;
- if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
- enqueue_task(p, rq->expired);
- if (p->static_prio < rq->best_expired_prio)
- rq->best_expired_prio = p->static_prio;
+ if (!tgrq->expired_timestamp)
+ tgrq->expired_timestamp = jiffies;
+ if (!TASK_INTERACTIVE(p) || expired_starving(tgrq)) {
+ enqueue_task(p, tgrq->expired);
+ if (p->static_prio < tgrq->expired->best_static_prio)
+ tgrq->expired->best_static_prio =
+ p->static_prio;
} else
- enqueue_task(p, rq->active);
+ enqueue_task(p, tgrq->active);
} else {
/*
* Prevent a too long timeslice allowing a task to monopolize
@@ -3066,12 +3127,13 @@ void scheduler_tick(void)
if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
- (p->array == rq->active)) {
+ (p->array == tgrq->active)) {

- requeue_task(p, rq->active);
+ requeue_task(p, tgrq->active);
set_tsk_need_resched(p);
}
}
+
out_unlock:
spin_unlock(&rq->lock);
out:
@@ -3264,6 +3326,7 @@ asmlinkage void __sched schedule(void)
int cpu, idx, new_prio;
long *switch_count;
struct rq *rq;
+ struct task_grp_rq *next_grp;

/*
* Test if we are atomic. Since do_exit() needs to call into
@@ -3332,23 +3395,36 @@ need_resched_nonpreemptible:
idle_balance(cpu, rq);
if (!rq->nr_running) {
next = rq->idle;
- rq->expired_timestamp = 0;
wake_sleeping_dependent(cpu);
goto switch_tasks;
}
}

+ /* Pick a task group first */
+#ifdef CONFIG_CPUMETER
array = rq->active;
if (unlikely(!array->nr_active)) {
+ switch_array(rq->active, rq->expired);
+ array = rq->active;
+ }
+ idx = sched_find_first_bit(array->bitmap);
+ queue = array->queue + idx;
+ next_grp = list_entry(queue->next, struct task_grp_rq, list);
+#else
+ next_grp = init_task_grp.rq[cpu];
+#endif
+
+ /* Pick a task within that group next */
+ array = next_grp->active;
+ if (unlikely(!array->nr_active)) {
/*
* Switch the active and expired arrays.
*/
- schedstat_inc(rq, sched_switch);
- rq->active = rq->expired;
- rq->expired = array;
- array = rq->active;
- rq->expired_timestamp = 0;
- rq->best_expired_prio = MAX_PRIO;
+ schedstat_inc(next_grp, sched_switch);
+ switch_array(next_grp->active, next_grp->expired);
+ array = next_grp->active;
+ next_grp->expired_timestamp = 0;
+ next_grp->expired->best_static_prio = MAX_PRIO;
}

idx = sched_find_first_bit(array->bitmap);
@@ -3826,11 +3902,13 @@ void rt_mutex_setprio(struct task_struct
struct prio_array *array;
unsigned long flags;
struct rq *rq;
+ struct task_grp_rq *tgrq;
int oldprio;

BUG_ON(prio < 0 || prio > MAX_PRIO);

rq = task_rq_lock(p, &flags);
+ tgrq = task_grp_rq(p);

oldprio = p->prio;
array = p->array;
@@ -3844,7 +3922,7 @@ void rt_mutex_setprio(struct task_struct
* in the active array!
*/
if (rt_task(p))
- array = rq->active;
+ array = tgrq->active;
enqueue_task(p, array);
/*
* Reschedule if we are currently running on this runqueue and
@@ -4411,7 +4489,8 @@ asmlinkage long sys_sched_getaffinity(pi
asmlinkage long sys_sched_yield(void)
{
struct rq *rq = this_rq_lock();
- struct prio_array *array = current->array, *target = rq->expired;
+ struct task_grp_rq *tgrq = task_grp_rq(current);
+ struct prio_array *array = current->array, *target = tgrq->expired;

schedstat_inc(rq, yld_cnt);
/*
@@ -4422,13 +4501,13 @@ asmlinkage long sys_sched_yield(void)
* array.)
*/
if (rt_task(current))
- target = rq->active;
+ target = tgrq->active;

if (array->nr_active == 1) {
schedstat_inc(rq, yld_act_empty);
- if (!rq->expired->nr_active)
+ if (!tgrq->expired->nr_active)
schedstat_inc(rq, yld_both_empty);
- } else if (!rq->expired->nr_active)
+ } else if (!tgrq->expired->nr_active)
schedstat_inc(rq, yld_exp_empty);

if (array != target) {
@@ -5131,9 +5210,8 @@ static void migrate_dead(unsigned int de
}

/* release_task() removes task from tasklist, so we won't find dead tasks. */
-static void migrate_dead_tasks(unsigned int dead_cpu)
+static void migrate_dead_tasks_from_grp(struct task_grp_rq *rq)
{
- struct rq *rq = cpu_rq(dead_cpu);
unsigned int arr, i;

for (arr = 0; arr < 2; arr++) {
@@ -5146,6 +5224,26 @@ static void migrate_dead_tasks(unsigned
}
}
}
+
+static void migrate_dead_tasks(unsigned int dead_cpu)
+{
+ struct rq *rq = cpu_rq(dead_cpu);
+ unsigned int arr, i;
+
+ for (arr = 0; arr < 2; arr++) {
+ for (i = 0; i < MAX_PRIO; i++) {
+ struct list_head *list = &rq->arrays[arr].queue[i];
+
+ while (!list_empty(list)) {
+ struct task_grp_rq *tgrq =
+ list_entry(list->next,
+ struct task_grp_rq, list);
+
+ migrate_dead_tasks_from_grp(tgrq);
+ }
+ }
+ }
+}
#endif /* CONFIG_HOTPLUG_CPU */

/*
@@ -6725,21 +6823,48 @@ int in_sched_functions(unsigned long add
&& addr < (unsigned long)__sched_text_end);
}

+static void task_grp_rq_init(struct task_grp_rq *tgrq, struct task_grp *tg)
+{
+ int j, k;
+
+ tgrq->active = tgrq->arrays;
+ tgrq->expired = tgrq->arrays + 1;
+ tgrq->rq_array = NULL;
+ tgrq->expired->best_static_prio = MAX_PRIO;
+ tgrq->active->best_static_prio = MAX_PRIO;
+ tgrq->prio = MAX_PRIO;
+ tgrq->tg = tg;
+ INIT_LIST_HEAD(&tgrq->list);
+
+ for (j = 0; j < 2; j++) {
+ struct prio_array *array;
+
+ array = tgrq->arrays + j;
+ for (k = 0; k < MAX_PRIO; k++) {
+ INIT_LIST_HEAD(array->queue + k);
+ __clear_bit(k, array->bitmap);
+ }
+ // delimiter for bitsearch
+ __set_bit(MAX_PRIO, array->bitmap);
+ }
+}
+
void __init sched_init(void)
{
- int i, j, k;
+ int i, j;

for_each_possible_cpu(i) {
- struct prio_array *array;
struct rq *rq;
+ struct task_grp_rq *tgrq;

rq = cpu_rq(i);
+ tgrq = init_task_grp.rq[i] = &per_cpu(init_tg_rq, i);
spin_lock_init(&rq->lock);
+ task_grp_rq_init(tgrq, &init_task_grp);
lockdep_set_class(&rq->lock, &rq->rq_lock_key);
rq->nr_running = 0;
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
- rq->best_expired_prio = MAX_PRIO;

#ifdef CONFIG_SMP
rq->sd = NULL;
@@ -6753,6 +6878,9 @@ void __init sched_init(void)
atomic_set(&rq->nr_iowait, 0);

for (j = 0; j < 2; j++) {
+ struct prio_array *array;
+ int k;
+
array = rq->arrays + j;
for (k = 0; k < MAX_PRIO; k++) {
INIT_LIST_HEAD(array->queue + k);
diff -puN init/Kconfig~base_cpu_ctlr init/Kconfig
--- linux-2.6.18/init/Kconfig~base_cpu_ctlr 2006-09-27 12:48:23.000000000 +0530
+++ linux-2.6.18-root/init/Kconfig 2006-09-27 12:48:24.000000000 +0530
@@ -231,6 +231,14 @@ config CPUSETS

Say N if unsure.

+config CPUMETER
+ bool "CPU resource control"
+ depends on CPUSETS && EXPERIMENTAL
+ help
+ This options lets you create cpu resource partitions within
+ cpusets. Each resource partition can be given a different quota
+ of CPU usage.
+
config RELAY
bool "Kernel->user space relay support (formerly relayfs)"
help
_
--
Regards,
vatsa

2006-09-28 17:28:14

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC, PATCH 2/9] CPU Controller V2 - Task-group priority handling

To retain good interactivity, priority of a task-group is defined to be the
same as the priority of the highest-priority runnable task it has in its
runqueue.

This means as tasks come and go, priority of a task-group can change.

This patch deals with such changes to task-group priority.

P.S : Is this a good idea? Don't know. Other option is to to have a static
task-group priority which is decided by its quota.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---

linux-2.6.18-root/kernel/sched.c | 80 ++++++++++++++++++++++++++++++++++++++-
1 files changed, 78 insertions(+), 2 deletions(-)

diff -puN kernel/sched.c~update_task_grp_prio kernel/sched.c
--- linux-2.6.18/kernel/sched.c~update_task_grp_prio 2006-09-28 16:39:15.840754048 +0530
+++ linux-2.6.18-root/kernel/sched.c 2006-09-28 17:23:21.191599592 +0530
@@ -191,7 +191,7 @@ static inline unsigned int task_timeslic

struct prio_array {
unsigned int nr_active;
- int best_static_prio;
+ int best_static_prio, best_dyn_prio;
DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
struct list_head queue[MAX_PRIO];
};
@@ -710,6 +710,55 @@ sched_info_switch(struct task_struct *pr
#define sched_info_switch(t, next) do { } while (0)
#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */

+/* Dequeue task-group object from the main per-cpu runqueue */
+static void dequeue_task_grp(struct task_grp_rq *tgrq)
+{
+ struct prio_array *array = tgrq->rq_array;
+
+ array->nr_active--;
+ list_del(&tgrq->list);
+ if (list_empty(array->queue + tgrq->prio))
+ __clear_bit(tgrq->prio, array->bitmap);
+ tgrq->rq_array = NULL;
+}
+
+/* Enqueue task-group object on the main per-cpu runqueue */
+static void enqueue_task_grp(struct task_grp_rq *tgrq, struct prio_array *array,
+ int head)
+{
+ if (!head)
+ list_add_tail(&tgrq->list, array->queue + tgrq->prio);
+ else
+ list_add(&tgrq->list, array->queue + tgrq->prio);
+ __set_bit(tgrq->prio, array->bitmap);
+ array->nr_active++;
+ tgrq->rq_array = array;
+}
+
+/* Priority of a task-group is defined to be the priority of the highest
+ * priority runnable task it has. As tasks belonging to a task-group come in
+ * and go, the task-group priority can change on a particular CPU.
+ */
+static inline void update_task_grp_prio(struct task_grp_rq *tgrq, struct rq *rq,
+ int head)
+{
+ int new_prio;
+ struct prio_array *array = tgrq->rq_array;
+
+ new_prio = tgrq->active->best_dyn_prio;
+ if (new_prio == MAX_PRIO)
+ new_prio = tgrq->expired->best_dyn_prio;
+
+ if (array)
+ dequeue_task_grp(tgrq);
+ tgrq->prio = new_prio;
+ if (new_prio != MAX_PRIO) {
+ if (!array)
+ array = rq->active;
+ enqueue_task_grp(tgrq, array, head);
+ }
+}
+
/*
* Adding/removing a task to/from a priority array:
*/
@@ -717,8 +766,17 @@ static void dequeue_task(struct task_str
{
array->nr_active--;
list_del(&p->run_list);
- if (list_empty(array->queue + p->prio))
+ if (list_empty(array->queue + p->prio)) {
__clear_bit(p->prio, array->bitmap);
+ if (p->prio == array->best_dyn_prio) {
+ struct task_grp_rq *tgrq = task_grp_rq(p);
+
+ array->best_dyn_prio =
+ sched_find_first_bit(array->bitmap);
+ if (array == tgrq->active || !tgrq->active->nr_active)
+ update_task_grp_prio(tgrq, task_rq(p), 0);
+ }
+ }
}

static void enqueue_task(struct task_struct *p, struct prio_array *array)
@@ -728,6 +786,14 @@ static void enqueue_task(struct task_str
__set_bit(p->prio, array->bitmap);
array->nr_active++;
p->array = array;
+
+ if (p->prio < array->best_dyn_prio) {
+ struct task_grp_rq *tgrq = task_grp_rq(p);
+
+ array->best_dyn_prio = p->prio;
+ if (array == tgrq->active || !tgrq->active->nr_active)
+ update_task_grp_prio(tgrq, task_rq(p), 0);
+ }
}

/*
@@ -746,6 +812,14 @@ enqueue_task_head(struct task_struct *p,
__set_bit(p->prio, array->bitmap);
array->nr_active++;
p->array = array;
+
+ if (p->prio < array->best_dyn_prio) {
+ struct task_grp_rq *tgrq = task_grp_rq(p);
+
+ array->best_dyn_prio = p->prio;
+ if (array == tgrq->active || !tgrq->active->nr_active)
+ update_task_grp_prio(tgrq, task_rq(p), 1);
+ }
}

/*
@@ -6831,7 +6905,9 @@ static void task_grp_rq_init(struct task
tgrq->expired = tgrq->arrays + 1;
tgrq->rq_array = NULL;
tgrq->expired->best_static_prio = MAX_PRIO;
+ tgrq->expired->best_dyn_prio = MAX_PRIO;
tgrq->active->best_static_prio = MAX_PRIO;
+ tgrq->active->best_dyn_prio = MAX_PRIO;
tgrq->prio = MAX_PRIO;
tgrq->tg = tg;
INIT_LIST_HEAD(&tgrq->list);
_
--
Regards,
vatsa

2006-09-28 17:29:06

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC, PATCH 3/9] CPU Controller V2 - group timeslice management

This patch does the actual job of dividing up CPU time amonmg various
task-groups as per their quota.

High-level overview:

Lets say that we have three task-groups/classes A, B and C
whose quota is 50%, 30% and 20%. The quota specifies how
much of execution time each group gets per some quantum.
If 10 seconds is chosen as this quantum, then group A's tasks
get 5 seconds worth of execution time -every- 10 seconds,
grp B gets 3 seconds and so on, as shown below:


A (5 sec) B (3 sec) C (2 sec)
|------------------------|---------------|-----------|

T0<---------------- 10 sec quantum ----------------->T1


In this scheme, grp C's tasks could starve potentially,
waiting for their turn.

This patch avoids the above problem by breaking the 10 sec
quantum into several smaller quantums. A, B and C get their
due share in each smaller quantum and hence each group
progress more quickly than in the above scheme.


A B C A B C A B C A B C
|-----|---|--|-----|---|--| .../ / ...|-----|---|--|-----|---|--|

|<---------->|<---------->| |<---------->|
1 sec 1 sec 1 sec
^ ^
| |
(CPU_CONTROL_SHORT_WINDOW)

|<------------------------------------------------------------->|
10 sec quantum (CPU_CONTROL_LONG_WINDOW)


The latter scheme is implemented by maintaining two counters (tokens)
for each task-group (on each CPU).

o ticks -> which is exhausted over the short quantum
o long_ticks -> which is exhausted over the longer quantum

To begin with:

grp->ticks = grp->quota * CPU_CONTROL_SHORT_WINDOW * HZ
grp->long_ticks = CPU_CONTROL_LONG_WINDOW/CPU_CONTROL_SHORT_WINDOW


grp->ticks is decremented every timer ticks. When it hits zero,
grp->long_ticks is decremented. If resulting grp->long_ticks is
non-zero, grp->ticks wraps back to the initial value and the group
is put in an expired bucket. Else if resulting grp->long_ticks is
zero, the group is put in a greedy bucket.


Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---

linux-2.6.18-root/kernel/sched.c | 124 ++++++++++++++++++++++++++++++++++++---
1 files changed, 115 insertions(+), 9 deletions(-)

diff -puN kernel/sched.c~handle_grp_timeslice kernel/sched.c
--- linux-2.6.18/kernel/sched.c~handle_grp_timeslice 2006-09-28 16:39:22.407755712 +0530
+++ linux-2.6.18-root/kernel/sched.c 2006-09-28 17:23:20.141759192 +0530
@@ -157,9 +157,6 @@
(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
(MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))

-#define TASK_PREEMPTS_CURR(p, rq) \
- ((p)->prio < (rq)->curr->prio)
-
/*
* task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
* to time slice values: [800ms ... 100ms ... 5ms]
@@ -217,7 +214,9 @@ struct task_grp_rq {
struct prio_array arrays[2];
struct prio_array *active, *expired, *rq_array;
unsigned long expired_timestamp;
- int prio; /* Priority of the task-group */
+ unsigned short ticks, long_ticks;
+ int prio; /* Priority of the task-group */
+ unsigned long last_update;
struct list_head list;
struct task_grp *tg;
};
@@ -226,6 +225,7 @@ static DEFINE_PER_CPU(struct task_grp_rq

/* task-group object - maintains information about each task-group */
struct task_grp {
+ unsigned short ticks, long_ticks; /* bandwidth given to task-group */
struct task_grp_rq *rq[NR_CPUS]; /* runqueue pointer for every cpu */
};

@@ -267,7 +267,9 @@ struct rq {
/* these arrays hold task-groups.
* xxx: Avoid this for !CONFIG_CPUMETER?
*/
- struct prio_array *active, *expired, arrays[2];
+ struct prio_array *active, *expired, *greedy_active, *greedy_expired;
+ struct prio_array arrays[4];
+ unsigned long last_update;
atomic_t nr_iowait;

#ifdef CONFIG_SMP
@@ -710,6 +712,14 @@ sched_info_switch(struct task_struct *pr
#define sched_info_switch(t, next) do { } while (0)
#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */

+#define CPU_CONTROL_SHORT_WINDOW (HZ)
+#define CPU_CONTROL_LONG_WINDOW (5 * HZ)
+#define NUM_LONG_TICKS (unsigned short) (CPU_CONTROL_LONG_WINDOW / \
+ CPU_CONTROL_SHORT_WINDOW)
+
+#define greedy_grp(grp) (!grp->long_ticks)
+#define greedy_task(p) (greedy_grp(task_grp_rq(p)))
+
/* Dequeue task-group object from the main per-cpu runqueue */
static void dequeue_task_grp(struct task_grp_rq *tgrq)
{
@@ -753,12 +763,29 @@ static inline void update_task_grp_prio(
dequeue_task_grp(tgrq);
tgrq->prio = new_prio;
if (new_prio != MAX_PRIO) {
- if (!array)
+ if (!greedy_grp(tgrq))
array = rq->active;
+ else
+ array = rq->greedy_active;
enqueue_task_grp(tgrq, array, head);
}
}

+#ifdef CONFIG_CPUMETER
+
+#define TASK_PREEMPTS_CURR(p, rq) \
+ (idle_cpu(task_cpu(p)) || \
+ (greedy_task((rq)->curr) && !greedy_task(p)) || \
+ ((task_grp_rq(p)->rq_array == task_grp_rq((rq)->curr)->rq_array) && \
+ (((p)->prio < (rq)->curr->prio))))
+
+#else
+
+#define TASK_PREEMPTS_CURR(p, rq) \
+ ((p)->prio < (rq)->curr->prio)
+
+#endif
+
/*
* Adding/removing a task to/from a priority array:
*/
@@ -3111,6 +3138,29 @@ void account_steal_time(struct task_stru
cpustat->steal = cputime64_add(cpustat->steal, tmp);
}

+#ifdef CONFIG_CPUMETER
+static inline void move_task_grp_list(struct prio_array *src,
+ struct prio_array *dst)
+{
+ unsigned int i;
+
+ if (!src->nr_active)
+ return;
+
+ for (i = 0; i < MAX_PRIO; i++) {
+ struct list_head *list = src->queue + i;
+
+ while (!list_empty(list)) {
+ struct task_grp_rq *tgrq;
+
+ tgrq = list_entry(list->next, struct task_grp_rq, list);
+ dequeue_task_grp(tgrq);
+ enqueue_task_grp(tgrq, dst, 0);
+ }
+ }
+}
+#endif
+
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
@@ -3209,6 +3259,39 @@ void scheduler_tick(void)
}

out_unlock:
+#ifdef CONFIG_CPUMETER
+ if (!--tgrq->ticks) {
+ struct prio_array *target;
+
+ if (tgrq->long_ticks) {
+ --tgrq->long_ticks;
+ if (tgrq->long_ticks)
+ target = rq->expired;
+ else
+ target = rq->greedy_active;
+ } else /* should be in rq->greedy_active */
+ target = rq->greedy_expired;
+
+ /* Move the task group to expired list */
+ dequeue_task_grp(tgrq);
+ tgrq->ticks = task_grp(p)->ticks;
+ enqueue_task_grp(tgrq, target, 0);
+ set_tsk_need_resched(p);
+ }
+
+ if (unlikely(jiffies - rq->last_update > CPU_CONTROL_LONG_WINDOW)) {
+ if (rq->active->nr_active || rq->expired->nr_active) {
+ move_task_grp_list(rq->greedy_active, rq->active);
+ move_task_grp_list(rq->greedy_expired, rq->active);
+ } else {
+ switch_array(rq->active, rq->greedy_active);
+ switch_array(rq->expired, rq->greedy_expired);
+ }
+ rq->last_update = jiffies;
+ set_tsk_need_resched(p);
+ }
+#endif
+
spin_unlock(&rq->lock);
out:
rebalance_tick(cpu, rq, NOT_IDLE);
@@ -3478,12 +3561,27 @@ need_resched_nonpreemptible:
#ifdef CONFIG_CPUMETER
array = rq->active;
if (unlikely(!array->nr_active)) {
- switch_array(rq->active, rq->expired);
- array = rq->active;
+ if (rq->expired->nr_active) {
+ switch_array(rq->active, rq->expired);
+ array = rq->active;
+ } else {
+ array = rq->greedy_active;
+ if (!array->nr_active) {
+ switch_array(rq->greedy_active,
+ rq->greedy_expired);
+ array = rq->greedy_active;
+ }
+ }
}
idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next_grp = list_entry(queue->next, struct task_grp_rq, list);
+
+ if (unlikely(next_grp->last_update != rq->last_update)) {
+ next_grp->ticks = next_grp->tg->ticks;
+ next_grp->long_ticks = next_grp->tg->long_ticks;
+ next_grp->last_update = rq->last_update;
+ }
#else
next_grp = init_task_grp.rq[cpu];
#endif
@@ -6909,6 +7007,8 @@ static void task_grp_rq_init(struct task
tgrq->active->best_static_prio = MAX_PRIO;
tgrq->active->best_dyn_prio = MAX_PRIO;
tgrq->prio = MAX_PRIO;
+ tgrq->ticks = tg->ticks;
+ tgrq->long_ticks = tg->long_ticks;
tgrq->tg = tg;
INIT_LIST_HEAD(&tgrq->list);

@@ -6929,6 +7029,9 @@ void __init sched_init(void)
{
int i, j;

+ init_task_grp.ticks = CPU_CONTROL_SHORT_WINDOW; /* 100% bandwidth */
+ init_task_grp.long_ticks = NUM_LONG_TICKS;
+
for_each_possible_cpu(i) {
struct rq *rq;
struct task_grp_rq *tgrq;
@@ -6941,6 +7044,9 @@ void __init sched_init(void)
rq->nr_running = 0;
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
+ rq->greedy_active = rq->arrays + 2;
+ rq->greedy_expired = rq->arrays + 3;
+ rq->last_update = tgrq->last_update = jiffies;

#ifdef CONFIG_SMP
rq->sd = NULL;
@@ -6953,7 +7059,7 @@ void __init sched_init(void)
#endif
atomic_set(&rq->nr_iowait, 0);

- for (j = 0; j < 2; j++) {
+ for (j = 0; j < 4; j++) {
struct prio_array *array;
int k;

_
--
Regards,
vatsa

2006-09-28 17:30:25

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC, PATCH 4/9] CPU Controller V2 - define group operations

Define these operations for a task-group:

- create new group
- destroy existing group
- assign bandwidth (quota) for a group
- get bandwidth (quota) of a group


Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---

linux-2.6.18-root/include/linux/sched.h | 13 +++++
linux-2.6.18-root/kernel/sched.c | 79 ++++++++++++++++++++++++++++++++
2 files changed, 92 insertions(+)

diff -puN kernel/sched.c~cpu_ctlr_grp_ops kernel/sched.c
--- linux-2.6.18/kernel/sched.c~cpu_ctlr_grp_ops 2006-09-28 16:40:21.106832096 +0530
+++ linux-2.6.18-root/kernel/sched.c 2006-09-28 17:23:19.055924264 +0530
@@ -7192,3 +7192,82 @@ void set_curr_task(int cpu, struct task_
}

#endif
+
+#ifdef CONFIG_CPUMETER
+
+/* Allocate runqueue structures for the new task-group */
+void *sched_alloc_group(void)
+{
+ struct task_grp *tg;
+ struct task_grp_rq *tgrq;
+ int i;
+
+ tg = kzalloc(sizeof(*tg), GFP_KERNEL);
+ if (!tg)
+ return NULL;
+
+ tg->ticks = CPU_CONTROL_SHORT_WINDOW;
+ tg->long_ticks = NUM_LONG_TICKS;
+
+ for_each_possible_cpu(i) {
+ tgrq = kzalloc(sizeof(*tgrq), GFP_KERNEL);
+ if (!tgrq)
+ goto oom;
+ tg->rq[i] = tgrq;
+ task_grp_rq_init(tgrq, tg);
+ }
+
+ return tg;
+oom:
+ while (i--)
+ kfree(tg->rq[i]);
+
+ kfree(tg);
+ return NULL;
+}
+
+/* Deallocate runqueue structures */
+void sched_dealloc_group(struct task_grp *tg)
+{
+ int i;
+
+ for_each_possible_cpu(i)
+ kfree(tg->rq[i]);
+
+ kfree(tg);
+}
+
+/* Assign quota to this group */
+int sched_assign_quota(struct task_grp *tg, int quota)
+{
+ int i;
+
+ tg->ticks = (quota * CPU_CONTROL_SHORT_WINDOW) / 100;
+
+ for_each_possible_cpu(i)
+ tg->rq[i]->ticks = tg->ticks;
+
+ return 0;
+}
+
+static inline int cpu_quota(struct task_grp *tg)
+{
+ return (tg->ticks * 100) / CPU_CONTROL_SHORT_WINDOW;
+}
+
+/* Return assigned quota for this group */
+int sched_get_quota(struct task_grp *tg)
+{
+ return cpu_quota(tg);
+}
+
+static struct task_grp_ops sched_grp_ops = {
+ .alloc_group = sched_alloc_group,
+ .dealloc_group = sched_dealloc_group,
+ .assign_quota = sched_assign_quota,
+ .get_quota = sched_get_quota,
+};
+
+struct task_grp_ops *cpu_ctlr = &sched_grp_ops;
+
+#endif /* CONFIG_CPUMETER */
diff -puN include/linux/sched.h~cpu_ctlr_grp_ops include/linux/sched.h
--- linux-2.6.18/include/linux/sched.h~cpu_ctlr_grp_ops 2006-09-28 16:40:21.111831336 +0530
+++ linux-2.6.18-root/include/linux/sched.h 2006-09-28 17:23:19.059923656 +0530
@@ -1611,6 +1611,19 @@ static inline void thaw_processes(void)
static inline int try_to_freeze(void) { return 0; }

#endif /* CONFIG_PM */
+
+#ifdef CONFIG_CPUMETER
+struct task_grp;
+struct task_grp_ops {
+ void *(*alloc_group)(void);
+ void (*dealloc_group)(struct task_grp *grp);
+ int (*assign_quota)(struct task_grp *grp, int quota);
+ int (*get_quota)(struct task_grp *grp);
+};
+
+extern struct task_grp_ops *cpu_ctlr;
+#endif
+
#endif /* __KERNEL__ */

#endif
_
--
Regards,
vatsa

2006-09-28 17:31:20

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC, PATCH 5/9] CPU Controller V2 - deal with movement of tasks

When a task moves between groups (as initiated by an administrator), it has to
be removed from the runqueue of its old group and added to the runqueue of its
new group. This patch defines this move operation.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---

linux-2.6.18-root/include/linux/sched.h | 3 +
linux-2.6.18-root/kernel/sched.c | 61 ++++++++++++++++++++++++++++++++
2 files changed, 64 insertions(+)

diff -puN kernel/sched.c~cpu_ctlr_move_task kernel/sched.c
--- linux-2.6.18/kernel/sched.c~cpu_ctlr_move_task 2006-09-28 16:40:24.971244616 +0530
+++ linux-2.6.18-root/kernel/sched.c 2006-09-28 17:23:18.115067296 +0530
@@ -7261,10 +7261,71 @@ int sched_get_quota(struct task_grp *tg)
return cpu_quota(tg);
}

+/*
+ * Move a task from one group to another. If the task is already on a
+ * runqueue, this involves removing the task from its old group's runqueue
+ * and adding to its new group's runqueue.
+ *
+ * This however is slightly tricky, given the facts that:
+ *
+ * - some pointer in the task struct (ex: cpuset) represents the group
+ * to which a task belongs.
+ * - At any give point during the move operation, the pointer either
+ * points to the old group or to the new group, but not both!
+ * - dequeue_task/enqueue_task rely on this pointer to know which
+ * task_grp_rq the task is to be removed from/added to.
+ *
+ * Hence the move is accomplished in two steps:
+ *
+ * 1. In first step, sched_pre_move_task() is called with the group
+ * pointer set to the old group to which the task belonged.
+ * If the task was on a runqueue, sched_pre_move_task() will
+ * removes it from the runqueue.
+ *
+ * 2. In second step, sched_post_move_task() is called with the group
+ * pointer set to the new group to which the task belongs.
+ * sched_post_move_task() will add the task in its new runqueue
+ * if it was on a runqueue in step 1.
+ *
+ */
+
+int sched_pre_move_task(struct task_struct *tsk, unsigned long *flags,
+ struct task_grp *tg_old, struct task_grp *tg_new)
+{
+ struct rq *rq;
+ int rc = 0;
+
+ if (tg_new == tg_old)
+ return rc;
+
+ rq = task_rq_lock(tsk, flags);
+
+ rc = 1;
+ if (tsk->array) {
+ rc = 2;
+ deactivate_task(tsk, rq);
+ }
+
+ return rc;
+}
+
+/* called with rq lock held */
+void sched_post_move_task(struct task_struct *tsk, unsigned long *flags, int rc)
+{
+ struct rq *rq = task_rq(tsk);
+
+ if (rc == 2)
+ __activate_task(tsk, rq);
+
+ task_rq_unlock(rq, flags);
+}
+
static struct task_grp_ops sched_grp_ops = {
.alloc_group = sched_alloc_group,
.dealloc_group = sched_dealloc_group,
.assign_quota = sched_assign_quota,
+ .pre_move_task = sched_pre_move_task,
+ .post_move_task = sched_post_move_task,
.get_quota = sched_get_quota,
};

diff -puN include/linux/sched.h~cpu_ctlr_move_task include/linux/sched.h
--- linux-2.6.18/include/linux/sched.h~cpu_ctlr_move_task 2006-09-28 16:40:24.976243856 +0530
+++ linux-2.6.18-root/include/linux/sched.h 2006-09-28 17:23:18.119066688 +0530
@@ -1618,6 +1618,9 @@ struct task_grp_ops {
void *(*alloc_group)(void);
void (*dealloc_group)(struct task_grp *grp);
int (*assign_quota)(struct task_grp *grp, int quota);
+ int (*pre_move_task)(struct task_struct *tsk, unsigned long *,
+ struct task_grp *old, struct task_grp *new);
+ void (*post_move_task)(struct task_struct *tsk, unsigned long *, int);
int (*get_quota)(struct task_grp *grp);
};

_
--
Regards,
vatsa

2006-09-28 17:32:14

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC, PATCH 6/9] CPU Controller V2 - Handle dont care groups

Deal with task-groups whose bandwidth hasnt been explicitly set by the
administrator. Unallocated CPU bandwidth is equally distributed among such
"don't care" groups.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---

linux-2.6.18-root/include/linux/sched.h | 2
linux-2.6.18-root/kernel/sched.c | 92 ++++++++++++++++++++++++++++++--
2 files changed, 88 insertions(+), 6 deletions(-)

diff -puN kernel/sched.c~cpu_ctlr_handle_dont_cares kernel/sched.c
--- linux-2.6.18/kernel/sched.c~cpu_ctlr_handle_dont_cares 2006-09-28 16:40:28.772666712 +0530
+++ linux-2.6.18-root/kernel/sched.c 2006-09-28 17:23:16.989238448 +0530
@@ -226,6 +226,12 @@ static DEFINE_PER_CPU(struct task_grp_rq
/* task-group object - maintains information about each task-group */
struct task_grp {
unsigned short ticks, long_ticks; /* bandwidth given to task-group */
+ int left_over_pct;
+ int total_dont_care_grps;
+ int dont_care; /* Does this group care for its bandwidth ? */
+ struct task_grp *parent;
+ struct list_head dont_care_list;
+ struct list_head list;
struct task_grp_rq *rq[NR_CPUS]; /* runqueue pointer for every cpu */
};

@@ -7031,6 +7037,12 @@ void __init sched_init(void)

init_task_grp.ticks = CPU_CONTROL_SHORT_WINDOW; /* 100% bandwidth */
init_task_grp.long_ticks = NUM_LONG_TICKS;
+ init_task_grp.left_over_pct = 100; /* 100% unallocated bandwidth */
+ init_task_grp.parent = NULL;
+ init_task_grp.total_dont_care_grps = 1; /* init_task_grp itself */
+ init_task_grp.dont_care = 1;
+ INIT_LIST_HEAD(&init_task_grp.dont_care_list);
+ list_add_tail(&init_task_grp.list, &init_task_grp.dont_care_list);

for_each_possible_cpu(i) {
struct rq *rq;
@@ -7195,8 +7207,31 @@ void set_curr_task(int cpu, struct task_

#ifdef CONFIG_CPUMETER

+/* Distribute left over bandwidth equally to all "dont care" task groups */
+static void recalc_dontcare(struct task_grp *tg_root)
+{
+ int ticks;
+ struct list_head *entry;
+
+ if (!tg_root->total_dont_care_grps)
+ return;
+
+ ticks = ((tg_root->left_over_pct / tg_root->total_dont_care_grps) *
+ CPU_CONTROL_SHORT_WINDOW) / 100;
+
+ list_for_each(entry, &tg_root->dont_care_list) {
+ struct task_grp *tg;
+ int i;
+
+ tg = list_entry(entry, struct task_grp, list);
+ tg->ticks = ticks;
+ for_each_possible_cpu(i)
+ tg->rq[i]->ticks = tg->ticks;
+ }
+}
+
/* Allocate runqueue structures for the new task-group */
-void *sched_alloc_group(void)
+void *sched_alloc_group(struct task_grp *tg_parent)
{
struct task_grp *tg;
struct task_grp_rq *tgrq;
@@ -7208,6 +7243,10 @@ void *sched_alloc_group(void)

tg->ticks = CPU_CONTROL_SHORT_WINDOW;
tg->long_ticks = NUM_LONG_TICKS;
+ tg->parent = tg_parent;
+ tg->dont_care = 1;
+ tg->left_over_pct = 100;
+ INIT_LIST_HEAD(&tg->dont_care_list);

for_each_possible_cpu(i) {
tgrq = kzalloc(sizeof(*tgrq), GFP_KERNEL);
@@ -7217,6 +7256,15 @@ void *sched_alloc_group(void)
task_grp_rq_init(tgrq, tg);
}

+ if (tg->parent) {
+ tg->parent->total_dont_care_grps++;
+ list_add_tail(&tg->list, &tg->parent->dont_care_list);
+ recalc_dontcare(tg->parent);
+ } else {
+ tg->total_dont_care_grps = 1;
+ list_add_tail(&tg->list, &tg->dont_care_list);
+ }
+
return tg;
oom:
while (i--)
@@ -7230,6 +7278,16 @@ oom:
void sched_dealloc_group(struct task_grp *tg)
{
int i;
+ struct task_grp *tg_root = tg->parent;
+
+ if (!tg_root)
+ tg_root = tg;
+
+ if (tg->dont_care) {
+ tg_root->total_dont_care_grps--;
+ list_del(&tg->list);
+ recalc_dontcare(tg_root);
+ }

for_each_possible_cpu(i)
kfree(tg->rq[i]);
@@ -7240,12 +7298,33 @@ void sched_dealloc_group(struct task_grp
/* Assign quota to this group */
int sched_assign_quota(struct task_grp *tg, int quota)
{
- int i;
+ int i, old_quota = 0;
+ struct task_grp *tg_root = tg->parent;
+
+ if (!tg_root)
+ tg_root = tg;
+
+ if (!tg->dont_care)
+ old_quota = (tg->ticks * 100) / CPU_CONTROL_SHORT_WINDOW;
+
+ if (quota > (tg_root->left_over_pct - old_quota))
+ return -EINVAL;
+
+ if (tg->dont_care) {
+ tg->dont_care = 0;
+ tg_root->total_dont_care_grps--;
+ list_del(&tg->list);
+ }

tg->ticks = (quota * CPU_CONTROL_SHORT_WINDOW) / 100;
+ for_each_possible_cpu(i) {
+ tg->rq[i]->ticks = tg->ticks;
+ tg->rq[i]->long_ticks = tg->long_ticks;
+ }

- for_each_possible_cpu(i)
- tg->rq[i]->ticks = tg->ticks;
+ /* xxx: needs some locking */
+ tg_root->left_over_pct -= (quota - old_quota);
+ recalc_dontcare(tg_root);

return 0;
}
@@ -7258,7 +7337,10 @@ static inline int cpu_quota(struct task_
/* Return assigned quota for this group */
int sched_get_quota(struct task_grp *tg)
{
- return cpu_quota(tg);
+ if (tg->dont_care)
+ return 0;
+ else
+ return cpu_quota(tg);
}

/*
diff -puN include/linux/sched.h~cpu_ctlr_handle_dont_cares include/linux/sched.h
--- linux-2.6.18/include/linux/sched.h~cpu_ctlr_handle_dont_cares 2006-09-28 16:40:28.777665952 +0530
+++ linux-2.6.18-root/include/linux/sched.h 2006-09-28 16:40:28.801662304 +0530
@@ -1615,7 +1615,7 @@ static inline int try_to_freeze(void) {
#ifdef CONFIG_CPUMETER
struct task_grp;
struct task_grp_ops {
- void *(*alloc_group)(void);
+ void *(*alloc_group)(struct task_grp *grp_parent);
void (*dealloc_group)(struct task_grp *grp);
int (*assign_quota)(struct task_grp *grp, int quota);
int (*pre_move_task)(struct task_struct *tsk, unsigned long *,
_
--
Regards,
vatsa

2006-09-28 17:32:49

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC, PATCH 7/9] CPU Controller V2 - SMP load balance changes

When balancing tasks across CPU runqueues, we need to take into account the
quota of task-groups (for example: we dont want to pull all tasks of a group
with 90% limit on the same CPU).

This is easily accomplished by piggy backing on the smpnice mechanism.

This patch also modifies move_tasks() to look at all the arrays of runqueue for
pulling tasks.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---

linux-2.6.18-root/kernel/sched.c | 155 +++++++++++++++++++++++++++++----------
1 files changed, 119 insertions(+), 36 deletions(-)

diff -puN kernel/sched.c~load_balance kernel/sched.c
--- linux-2.6.18/kernel/sched.c~load_balance 2006-09-28 16:40:34.992721120 +0530
+++ linux-2.6.18-root/kernel/sched.c 2006-09-28 17:23:15.762424952 +0530
@@ -216,7 +216,7 @@ struct task_grp_rq {
unsigned long expired_timestamp;
unsigned short ticks, long_ticks;
int prio; /* Priority of the task-group */
- unsigned long last_update;
+ unsigned long last_update, raw_weighted_load;
struct list_head list;
struct task_grp *tg;
};
@@ -906,6 +906,11 @@ static inline int __normal_prio(struct t
#define RTPRIO_TO_LOAD_WEIGHT(rp) \
(PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + LOAD_WEIGHT(rp))

+static inline int cpu_quota(struct task_grp *tg)
+{
+ return (tg->ticks * 100) / CPU_CONTROL_SHORT_WINDOW;
+}
+
static void set_load_weight(struct task_struct *p)
{
if (has_rt_policy(p)) {
@@ -919,21 +924,25 @@ static void set_load_weight(struct task_
p->load_weight = 0;
else
#endif
- p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority);
+ p->load_weight = (RTPRIO_TO_LOAD_WEIGHT(p->rt_priority)
+ * cpu_quota(task_grp(p))) / 100;
} else
- p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio);
+ p->load_weight = (PRIO_TO_LOAD_WEIGHT(p->static_prio)
+ * cpu_quota(task_grp(p))) / 100;
}

static inline void
-inc_raw_weighted_load(struct rq *rq, const struct task_struct *p)
+inc_raw_weighted_load(struct rq *rq, struct task_struct *p)
{
rq->raw_weighted_load += p->load_weight;
+ task_grp_rq(p)->raw_weighted_load += p->load_weight;
}

static inline void
-dec_raw_weighted_load(struct rq *rq, const struct task_struct *p)
+dec_raw_weighted_load(struct rq *rq, struct task_struct *p)
{
rq->raw_weighted_load -= p->load_weight;
+ task_grp_rq(p)->raw_weighted_load -= p->load_weight;
}

static inline void inc_nr_running(struct task_struct *p, struct rq *rq)
@@ -2241,42 +2250,30 @@ int can_migrate_task(struct task_struct
return 1;
}

-#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->expired->best_static_prio)
+#define rq_best_prio(rq) min((rq)->active->best_dyn_prio, \
+ (rq)->expired->best_static_prio)

-/*
- * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
- * load from busiest to this_rq, as part of a balancing operation within
- * "domain". Returns the number of tasks moved.
- *
- * Called with both runqueues locked.
- */
-static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
- unsigned long max_nr_move, unsigned long max_load_move,
- struct sched_domain *sd, enum idle_type idle,
- int *all_pinned)
+static int __move_tasks(struct task_grp_rq *this_rq, int this_cpu,
+ struct task_grp_rq *busiest, unsigned long max_nr_move,
+ unsigned long max_load_move, struct sched_domain *sd,
+ enum idle_type idle, int *all_pinned, long *load_moved)
{
int idx, pulled = 0, pinned = 0, this_best_prio, best_prio,
- best_prio_seen, skip_for_load;
+ best_prio_seen = 0, skip_for_load;
struct prio_array *array, *dst_array;
struct list_head *head, *curr;
struct task_struct *tmp;
- long rem_load_move;
+ long rem_load_move, grp_load_diff;
+
+ grp_load_diff = busiest->raw_weighted_load - this_rq->raw_weighted_load;
+ rem_load_move = grp_load_diff/2;

- if (max_nr_move == 0 || max_load_move == 0)
+ if (grp_load_diff < 0 || max_nr_move == 0 || max_load_move == 0)
goto out;

- rem_load_move = max_load_move;
pinned = 1;
this_best_prio = rq_best_prio(this_rq);
best_prio = rq_best_prio(busiest);
- /*
- * Enable handling of the case where there is more than one task
- * with the best priority. If the current running task is one
- * of those with prio==best_prio we know it won't be moved
- * and therefore it's safe to override the skip (based on load) of
- * any task we find with that prio.
- */
- best_prio_seen = best_prio == busiest->curr->prio;

/*
* We first consider expired tasks. Those will likely not be
@@ -2325,7 +2322,7 @@ skip_queue:
if (skip_for_load && idx < this_best_prio)
skip_for_load = !best_prio_seen && idx == best_prio;
if (skip_for_load ||
- !can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) {
+ !can_migrate_task(tmp, task_rq(tmp), this_cpu, sd, idle, &pinned)) {

best_prio_seen |= idx == best_prio;
if (curr != head)
@@ -2339,7 +2336,8 @@ skip_queue:
schedstat_inc(sd, lb_hot_gained[idle]);
#endif

- pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
+ pull_task(task_rq(tmp), array, tmp, cpu_rq(this_cpu), dst_array,
+ this_cpu);
pulled++;
rem_load_move -= tmp->load_weight;

@@ -2365,9 +2363,98 @@ out:

if (all_pinned)
*all_pinned = pinned;
+ *load_moved = grp_load_diff/2 - rem_load_move;
return pulled;
}

+static inline int choose_next_array(struct prio_array *arr[], int start_idx)
+{
+ int i;
+
+ for (i = start_idx; arr[i]; i++)
+ if (arr[i]->nr_active)
+ break;
+
+ return i;
+}
+
+/*
+ * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
+ * load from busiest to this_rq, as part of a balancing operation within
+ * "domain". Returns the number of tasks moved.
+ *
+ * Called with both runqueues locked.
+ */
+static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
+ unsigned long max_nr_move, unsigned long max_load_move,
+ struct sched_domain *sd, enum idle_type idle,
+ int *all_pinned)
+{
+ struct task_grp_rq *busy_grp, *this_grp;
+ long load_moved;
+ unsigned long total_nr_moved = 0, nr_moved;
+ int idx;
+ int arr_idx = 0;
+ struct list_head *head, *curr;
+ struct prio_array *array, *array_prefs[5];
+
+
+ /* order in which tasks are picked up */
+ array_prefs[0] = busiest->greedy_expired;
+ array_prefs[1] = busiest->greedy_active;
+ array_prefs[2] = busiest->expired;
+ array_prefs[3] = busiest->active;
+ array_prefs[4] = NULL;
+
+ arr_idx = choose_next_array(array_prefs, arr_idx);
+ array = array_prefs[arr_idx++];
+
+ if (!array)
+ goto out;
+
+new_array:
+ /* Start searching at priority 0: */
+ idx = 0;
+skip_bitmap:
+ if (!idx)
+ idx = sched_find_first_bit(array->bitmap);
+ else
+ idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
+ if (idx >= MAX_PRIO) {
+ arr_idx = choose_next_array(array_prefs, arr_idx);
+ array = array_prefs[arr_idx++];
+ if (!array)
+ goto out;
+ goto new_array;
+ }
+
+ head = array->queue + idx;
+ curr = head->prev;
+skip_queue:
+ busy_grp = list_entry(curr, struct task_grp_rq, list);
+ this_grp = busy_grp->tg->rq[this_cpu];
+
+ curr = curr->prev;
+
+ nr_moved = __move_tasks(this_grp, this_cpu, busy_grp, max_nr_move,
+ max_load_move, sd, idle, all_pinned, &load_moved);
+
+ total_nr_moved += nr_moved;
+ max_nr_move -= nr_moved;
+ max_load_move -= load_moved;
+
+ if (!max_nr_move || (long)max_load_move <= 0)
+ goto out;
+
+ if (curr != head)
+ goto skip_queue;
+ idx++;
+ goto skip_bitmap;
+
+out:
+ return total_nr_moved;
+}
+
/*
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which
@@ -7329,11 +7416,6 @@ int sched_assign_quota(struct task_grp *
return 0;
}

-static inline int cpu_quota(struct task_grp *tg)
-{
- return (tg->ticks * 100) / CPU_CONTROL_SHORT_WINDOW;
-}
-
/* Return assigned quota for this group */
int sched_get_quota(struct task_grp *tg)
{
@@ -7396,6 +7478,7 @@ void sched_post_move_task(struct task_st
{
struct rq *rq = task_rq(tsk);

+ set_load_weight(tsk);
if (rc == 2)
__activate_task(tsk, rq);

_
--
Regards,
vatsa

2006-09-28 17:33:32

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC, PATCH 8/9] CPU Controller V2 - task_cpu(p) needs to be correct always

We rely very much on task_cpu(p) to be correct at all times, so that we can
correctly find the task_grp_rq from which the task has to be removed or added
to.

There is however one place in the scheduler where this assumption of
task_cpu(p) being correct is broken. This patch fixes that piece of code.

(Thanks to Balbir Singh for pointing this out to me)

Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---

linux-2.6.18-root/kernel/sched.c | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)

diff -puN kernel/sched.c~cpu_ctlr_setcpu kernel/sched.c
--- linux-2.6.18/kernel/sched.c~cpu_ctlr_setcpu 2006-09-28 16:40:37.844287616 +0530
+++ linux-2.6.18-root/kernel/sched.c 2006-09-28 17:23:14.896556584 +0530
@@ -5230,6 +5230,7 @@ static int __migrate_task(struct task_st
{
struct rq *rq_dest, *rq_src;
int ret = 0;
+ struct prio_array *array;

if (unlikely(cpu_is_offline(dest_cpu)))
return ret;
@@ -5245,8 +5246,8 @@ static int __migrate_task(struct task_st
if (!cpu_isset(dest_cpu, p->cpus_allowed))
goto out;

- set_task_cpu(p, dest_cpu);
- if (p->array) {
+ array = p->array;
+ if (array) {
/*
* Sync timestamp with rq_dest's before activating.
* The same thing could be achieved by doing this step
@@ -5256,6 +5257,11 @@ static int __migrate_task(struct task_st
p->timestamp = p->timestamp - rq_src->timestamp_last_tick
+ rq_dest->timestamp_last_tick;
deactivate_task(p, rq_src);
+ }
+
+ set_task_cpu(p, dest_cpu);
+
+ if (array) {
__activate_task(p, rq_dest);
if (TASK_PREEMPTS_CURR(p, rq_dest))
resched_task(rq_dest->curr);
_
--
Regards,
vatsa

2006-09-28 17:34:55

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC, PATCH 9/9] CPU Controller V2 - cpuset interface

A controller cannot live in isolation. *Somebody* needs to tell it about
events like new-task-group creation, destruction of existing group,
assignment of bandwidth and finally movement of tasks between task-groups.

We also need *some* interface for grouping tasks and assigning bandwidth
to the task-groups.

cpusets is chosen as this interface for now.

===============================================================================

DISCLAIMER:

- This by no means is any recommendation for cpuset to be chosen as
resource management interface!

- CPUset interface is chosen for now, because it represents a
task-grouping mechanism which is already in mainline and hence
makes the review of rest of scheduler changes easy.

- The CPU controller can be easily hooked to some other interface
like Resource Groups or User-bean-counter, should that be decided
down the lane.

===============================================================================

This patch makes some extension to cpusets as below:

- Every cpuset directory has two new files:
cpu_meter_enabled, cpu_meter_quota
- Writing '1' into 'cpu_meter_enabled' marks the cpuset as "metered" i.e
tasks in this cpuset and its child cpusets will have their CPU
usage controlled by the controller as per the quota settings
mentioned in 'cpu_meter_quota'.
- Writing a number between 0-100 into 'cpu_meter_quota' file of a
metered cpuset assigns corresponding bandwidth for all tasks in the
cpuset (task-group).
- Only cpusets marked as "exclusive" can be metered.
- cpusets that have child cpusets cannot be metered.
- (Temporary limitation:) cpusets which are "in-use" (i.e tasks are
attached to it) cannot be metered
- Metered cpusets cannot have grand-children!

Metered cpusets was discussed elaborately at:

http://lkml.org/lkml/2005/9/26/58

This patch was inspired from those discussions and is a subset of the features
provided in the earlier patch.

As an example, follow these steps to create metered cpusets:


# alias echo="/bin/echo"

# cd /dev
# mkdir cpuset
# mount -t cpuset cpuset cpuset

# cd cpuset

# echo 1 > cpu_exclusive

# mkdir all
# cd all

# echo "0-N" > cpus # Assign all CPUs to the cpuset.
# (replace N with number of cpus in system)
# echo 0 > mems # assign node 0
# echo 1 > cpu_exclusive
# echo 1 > cpu_meter_enabled # "all" is now a "metered" cpuset

# # Because parent is marked 'cpu_meter_enabled',
# # mkdir below automatically copied 'cpus', 'mems' and
# # 'cpu_meter_enabled' from the parent to the new child.

# mkdir default
# mkdir very_imp_grp
# mkdir less_imp_grp

# # Assign 10% bandwidth to default group
# echo 10 > default/cpu_meter_quota

# # Assign 70% bandwidth to very_imp_grp
# echo 70 > very_imp_grp/cpu_meter_quota

# # Assign 20% bandwidth to less_imp_grp
# echo 20 > less_imp_grp/cpu_meter_quota


# # Now transfer all existing tasks in system to default cpuset
# sed -nu p < /dev/cpuset/tasks > /dev/cpuset/all/default/tasks

# # After the above step, 'cat /dev/cpuset/tasks' should output
# # nothing. Otherwise, repeat the sed command, until cat displays
# # nothing.

# echo $very_imp_task1_pid > very_imp_grp/tasks
# echo $very_imp_task2_pid > very_imp_grp/tasks

# echo $less_imp_task1_pid > less_imp_grp/tasks
# echo $less_imp_task2_pid > less_imp_grp/tasks


Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---

linux-2.6.18-root/include/linux/cpuset.h | 9 +
linux-2.6.18-root/init/main.c | 2
linux-2.6.18-root/kernel/cpuset.c | 255 ++++++++++++++++++++++++++++++-
linux-2.6.18-root/kernel/sched.c | 9 -
4 files changed, 271 insertions(+), 4 deletions(-)

diff -puN kernel/cpuset.c~cpuset_interface kernel/cpuset.c
--- linux-2.6.18/kernel/cpuset.c~cpuset_interface 2006-09-28 16:42:00.073786832 +0530
+++ linux-2.6.18-root/kernel/cpuset.c 2006-09-28 17:18:32.261523640 +0530
@@ -99,6 +99,9 @@ struct cpuset {
int mems_generation;

struct fmeter fmeter; /* memory_pressure filter */
+#ifdef CONFIG_CPUMETER
+ void *cpu_ctlr_data;
+#endif
};

/* bits in struct cpuset flags field */
@@ -110,6 +113,9 @@ typedef enum {
CS_NOTIFY_ON_RELEASE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
+#ifdef CONFIG_CPUMETER
+ CS_METER_FLAG,
+#endif
} cpuset_flagbits_t;

/* convenient tests for these bits */
@@ -148,6 +154,180 @@ static inline int is_spread_slab(const s
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}

+#ifdef CONFIG_CPUMETER
+static inline int is_metered(const struct cpuset *cs)
+{
+ return test_bit(CS_METER_FLAG, &cs->flags);
+}
+
+/* does the new combination of meter and exclusive flags makes sense? */
+static int validate_meters(const struct cpuset *cur, const struct cpuset *trial)
+{
+ struct cpuset *parent = cur->parent;
+ int is_changed, is_exclusive_changed, rc = -EINVAL;
+ cpumask_t cur_mask, trial_mask;
+
+ get_cpu(); /* Disable CPU Hotplug for accessing cpu_online_map */
+
+ /* checks for change in meter flag */
+ is_changed = (is_metered(trial) != is_metered(cur));
+
+ is_exclusive_changed = (is_cpu_exclusive(trial) !=
+ is_cpu_exclusive(cur));
+
+ /* Temporary limitation until we can get to tasks of a cpuset easily:
+ *
+ * can't change meter setting of cpuset which is already "in-use"
+ * (it has tasks attached to it)
+ */
+ if (is_changed && atomic_read(&cur->count) != 0) {
+ rc = -EBUSY;
+ goto out;
+ }
+
+ /* Cant change meter setting if a cpuset already has children */
+ if (is_changed && !list_empty(&cur->children)) {
+ rc = -EBUSY;
+ goto out;
+ }
+
+ /* Must meter child of metered parent */
+ if (parent && is_metered(parent) && !is_metered(trial))
+ goto out;
+
+ /* Turn on metering only if a cpuset is exclusive */
+ if (parent && !is_metered(parent) && is_metered(trial) &&
+ !is_cpu_exclusive(cur))
+ goto out;
+
+ /* Cant change exclusive setting if a cpuset is metered */
+ if (is_exclusive_changed && is_metered(cur))
+ goto out;
+
+
+ /* Cant change cpus_allowed of a metered cpuset */
+ cpus_and(trial_mask, trial->cpus_allowed, cpu_online_map);
+ cpus_and(cur_mask, cur->cpus_allowed, cpu_online_map);
+ if (is_metered(cur) && !cpus_equal(trial_mask, cur_mask)) {
+ rc = -EBUSY;
+ goto out;
+ }
+
+ rc = 0;
+
+out:
+ put_cpu();
+ return rc;
+}
+
+static void update_meter(struct cpuset *cs)
+{
+ if (is_metered(cs)) {
+ void *parent_data = NULL;
+
+ if (cs->parent)
+ parent_data = cs->parent->cpu_ctlr_data;
+ cs->cpu_ctlr_data = cpu_ctlr->alloc_group(parent_data);
+ } else {
+ cpu_ctlr->dealloc_group(cs->cpu_ctlr_data);
+ /* Do we have any races here ? */
+ cs->cpu_ctlr_data = NULL;
+ }
+}
+
+static int update_quota(struct cpuset *cs, char *buf)
+{
+ int quota;
+
+ if (!buf || !is_metered(cs))
+ return -EINVAL;
+
+ /* Temporary limitation until we can get to tasks of a cpuset easily:
+ *
+ * can't change quota of cpuset which is already "in-use"
+ * (it has tasks attached to it)
+ */
+ if (atomic_read(&cs->count))
+ return -EBUSY;
+
+ quota = simple_strtoul(buf, NULL, 10);
+ cpu_ctlr->assign_quota(cs->cpu_ctlr_data, quota);
+
+ return 0;
+}
+
+static int cpuset_sprintf_quota(char *page, struct cpuset *cs)
+{
+ int quota = 0;
+ char *c = page;
+
+ if (is_metered(cs))
+ quota = cpu_ctlr->get_quota(cs->cpu_ctlr_data);
+ c += sprintf (c, "%d", quota);
+
+ return c - page;
+}
+
+void *cpu_ctlr_data(struct cpuset *cs)
+{
+ return cs->cpu_ctlr_data;
+}
+
+static int cpu_ctlr_pre_move_task(struct task_struct *tsk,
+ unsigned long *flags, struct cpuset *oldcs, struct cpuset *newcs)
+{
+ int change_rq, rc = 0;
+
+ change_rq = is_metered(oldcs) || is_metered(newcs);
+
+ if (change_rq)
+ rc = cpu_ctlr->pre_move_task(tsk, flags, oldcs->cpu_ctlr_data,
+ newcs->cpu_ctlr_data);
+
+ return rc;
+}
+
+static void cpu_ctlr_post_move_task(struct task_struct *tsk,
+ unsigned long *flags, int rc)
+{
+ cpu_ctlr->post_move_task(tsk, flags, rc);
+}
+
+#else
+
+static inline int is_metered(const struct cpuset *cs)
+{
+ return 0;
+}
+
+static int validate_meters(const struct cpuset *cur, const struct cpuset *trial)
+{
+ return 0;
+}
+
+static void update_meter(struct cpuset *cs)
+{
+ return 0;
+}
+
+static int update_quota(struct cpuset *cs, char *buf)
+{
+ return 0;
+}
+
+static int cpu_ctlr_pre_move_task(struct task_struct *tsk,
+ unsigned long *flags, struct cpuset *oldcs, struct cpuset *newcs)
+{
+ return 0;
+}
+
+static void cpu_ctlr_post_move_task(struct task_struct *tsk,
+ unsigned long *flags, int rc)
+{
+ return;
+}
+#endif /* CONFIG_CPUMETER */
+
/*
* Increment this integer everytime any cpuset changes its
* mems_allowed value. Users of cpusets can track this generation
@@ -722,6 +902,10 @@ static int is_cpuset_subset(const struct
static int validate_change(const struct cpuset *cur, const struct cpuset *trial)
{
struct cpuset *c, *par;
+ int rc = 0;
+
+ if ((rc = validate_meters(cur, trial)))
+ return rc;

/* Each of our child cpusets must be a subset of us */
list_for_each_entry(c, &cur->children, sibling) {
@@ -1041,7 +1225,7 @@ static int update_flag(cpuset_flagbits_t
{
int turning_on;
struct cpuset trialcs;
- int err, cpu_exclusive_changed;
+ int err, cpu_exclusive_changed, meter_changed;

turning_on = (simple_strtoul(buf, NULL, 10) != 0);

@@ -1056,6 +1240,7 @@ static int update_flag(cpuset_flagbits_t
return err;
cpu_exclusive_changed =
(is_cpu_exclusive(cs) != is_cpu_exclusive(&trialcs));
+ meter_changed = (is_metered(cs) != is_metered(&trialcs));
mutex_lock(&callback_mutex);
if (turning_on)
set_bit(bit, &cs->flags);
@@ -1065,7 +1250,9 @@ static int update_flag(cpuset_flagbits_t

if (cpu_exclusive_changed)
update_cpu_domains(cs);
- return 0;
+ if (meter_changed)
+ update_meter(cs);
+ return err;
}

/*
@@ -1184,6 +1371,7 @@ static int attach_task(struct cpuset *cs
nodemask_t from, to;
struct mm_struct *mm;
int retval;
+ unsigned long flags;

if (sscanf(pidbuf, "%d", &pid) != 1)
return -EIO;
@@ -1229,7 +1417,10 @@ static int attach_task(struct cpuset *cs
return -ESRCH;
}
atomic_inc(&cs->count);
+ retval = cpu_ctlr_pre_move_task(tsk, &flags, oldcs, cs);
rcu_assign_pointer(tsk->cpuset, cs);
+ if (retval)
+ cpu_ctlr_post_move_task(tsk, &flags, retval);
task_unlock(tsk);

guarantee_online_cpus(cs, &cpus);
@@ -1271,6 +1462,10 @@ typedef enum {
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
FILE_TASKLIST,
+#ifdef CONFIG_CPUMETER
+ FILE_CPU_METER_ENABLED,
+ FILE_CPU_METER_QUOTA,
+#endif
} cpuset_filetype_t;

static ssize_t cpuset_common_file_write(struct file *file, const char __user *userbuf,
@@ -1340,6 +1535,14 @@ static ssize_t cpuset_common_file_write(
case FILE_TASKLIST:
retval = attach_task(cs, buffer, &pathbuf);
break;
+#ifdef CONFIG_CPUMETER
+ case FILE_CPU_METER_ENABLED:
+ retval = update_flag(CS_METER_FLAG, cs, buffer);
+ break;
+ case FILE_CPU_METER_QUOTA:
+ retval = update_quota(cs, buffer);
+ break;
+#endif
default:
retval = -EINVAL;
goto out2;
@@ -1452,6 +1655,14 @@ static ssize_t cpuset_common_file_read(s
case FILE_SPREAD_SLAB:
*s++ = is_spread_slab(cs) ? '1' : '0';
break;
+#ifdef CONFIG_CPUMETER
+ case FILE_CPU_METER_ENABLED:
+ *s++ = is_metered(cs) ? '1' : '0';
+ break;
+ case FILE_CPU_METER_QUOTA:
+ s += cpuset_sprintf_quota(s, cs);
+ break;
+#endif
default:
retval = -EINVAL;
goto out;
@@ -1825,6 +2036,19 @@ static struct cftype cft_spread_slab = {
.private = FILE_SPREAD_SLAB,
};

+#ifdef CONFIG_CPUMETER
+static struct cftype cft_meter_flag = {
+ .name = "cpu_meter_enabled",
+ .private = FILE_CPU_METER_ENABLED,
+};
+
+static struct cftype cft_meter_quota = {
+ .name = "cpu_meter_quota",
+ .private = FILE_CPU_METER_QUOTA,
+};
+
+#endif
+
static int cpuset_populate_dir(struct dentry *cs_dentry)
{
int err;
@@ -1849,6 +2073,12 @@ static int cpuset_populate_dir(struct de
return err;
if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0)
return err;
+#ifdef CONFIG_CPUMETER
+ if ((err = cpuset_add_file(cs_dentry, &cft_meter_flag)) < 0)
+ return err;
+ if ((err = cpuset_add_file(cs_dentry, &cft_meter_quota)) < 0)
+ return err;
+#endif
return 0;
}

@@ -1866,6 +2096,10 @@ static long cpuset_create(struct cpuset
struct cpuset *cs;
int err;

+ /* metered cpusets cant have grand-children! */
+ if (parent->parent && is_metered(parent->parent))
+ return -EINVAL;
+
cs = kmalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
return -ENOMEM;
@@ -1898,6 +2132,13 @@ static long cpuset_create(struct cpuset
if (err < 0)
goto err;

+ if (is_metered(parent)) {
+ cs->cpus_allowed = parent->cpus_allowed;
+ cs->mems_allowed = parent->mems_allowed;
+ set_bit(CS_METER_FLAG, &cs->flags);
+ update_meter(cs);
+ }
+
/*
* Release manage_mutex before cpuset_populate_dir() because it
* will down() this new directory's i_mutex and if we race with
@@ -1953,6 +2194,10 @@ static int cpuset_rmdir(struct inode *un
mutex_unlock(&manage_mutex);
return -EBUSY;
}
+ if (is_metered(cs)) {
+ clear_bit(CS_METER_FLAG, &cs->flags);
+ update_meter(cs);
+ }
if (is_cpu_exclusive(cs)) {
int retval = update_flag(CS_CPU_EXCLUSIVE, cs, "0");
if (retval < 0) {
@@ -2012,6 +2257,7 @@ int __init cpuset_init(void)
top_cpuset.mems_generation = cpuset_mems_generation++;

init_task.cpuset = &top_cpuset;
+ /* xxx: initialize init_task.cpu_ctlr_data */

err = register_filesystem(&cpuset_fs_type);
if (err < 0)
@@ -2166,9 +2412,14 @@ void cpuset_fork(struct task_struct *chi
void cpuset_exit(struct task_struct *tsk)
{
struct cpuset *cs;
+ int rc;
+ unsigned long flags;

cs = tsk->cpuset;
+ rc = cpu_ctlr_pre_move_task(tsk, &flags, cs, &top_cpuset);
tsk->cpuset = &top_cpuset; /* the_top_cpuset_hack - see above */
+ if (rc)
+ cpu_ctlr_post_move_task(tsk, &flags, rc);

if (notify_on_release(cs)) {
char *pathbuf = NULL;
diff -puN include/linux/cpuset.h~cpuset_interface include/linux/cpuset.h
--- linux-2.6.18/include/linux/cpuset.h~cpuset_interface 2006-09-28 16:42:00.077786224 +0530
+++ linux-2.6.18-root/include/linux/cpuset.h 2006-09-28 16:42:00.110781208 +0530
@@ -63,6 +63,15 @@ static inline int cpuset_do_slab_mem_spr
return current->flags & PF_SPREAD_SLAB;
}

+#ifdef CONFIG_CPUMETER
+void *cpu_ctlr_data(struct cpuset *cs);
+#else
+static inline void *cpu_ctlr_data(struct cpuset *cs)
+{
+ return NULL;
+}
+#endif
+
#else /* !CONFIG_CPUSETS */

static inline int cpuset_init_early(void) { return 0; }
diff -puN kernel/sched.c~cpuset_interface kernel/sched.c
--- linux-2.6.18/kernel/sched.c~cpuset_interface 2006-09-28 16:42:00.084785160 +0530
+++ linux-2.6.18-root/kernel/sched.c 2006-09-28 16:42:00.122779384 +0530
@@ -417,7 +417,12 @@ static inline void finish_lock_switch(st
/* return the task-group to which a task belongs */
static inline struct task_grp *task_grp(struct task_struct *p)
{
- return &init_task_grp;
+ struct task_grp *tg = cpu_ctlr_data(p->cpuset);
+
+ if (!tg)
+ tg = &init_task_grp;
+
+ return tg;
}

/*
@@ -7419,6 +7424,8 @@ int sched_assign_quota(struct task_grp *
tg_root->left_over_pct -= (quota - old_quota);
recalc_dontcare(tg_root);

+ /* xxx: change tsk->load_weight of all tasks belonging to this group */
+
return 0;
}

diff -puN init/main.c~cpuset_interface init/main.c
--- linux-2.6.18/init/main.c~cpuset_interface 2006-09-28 16:42:00.088784552 +0530
+++ linux-2.6.18-root/init/main.c 2006-09-28 16:42:00.124779080 +0530
@@ -482,6 +482,7 @@ asmlinkage void __init start_kernel(void
setup_arch(&command_line);
setup_per_cpu_areas();
smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
+ cpuset_init_early();

/*
* Set up the scheduler prior starting any interrupts (such as the
@@ -545,7 +546,6 @@ asmlinkage void __init start_kernel(void
}
#endif
vfs_caches_init_early();
- cpuset_init_early();
mem_init();
kmem_cache_init();
setup_per_cpu_pageset();
_
--
Regards,
vatsa

2006-09-29 03:29:00

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC, PATCH 0/9] CPU Controller V2

Srivatsa wrote:
> - Modified cpuset interface as per Paul Jackson's suggestions.
> Some of the changes are:

thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401