2006-08-20 17:40:55

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 0/7] CPU controller - V1

This patch is an attempt to provide some minimal CPU controller for resource
management purpose. Although we have seen several proposals for one before,
none has gained acceptance yet. For ex: http://lkml.org/lkml/2006/8/4/137
summarizes the feedback received against the last CPU controller posted by
Fujitsu folks.

This is V1 of the patch posted at http://lkml.org/lkml/2006/8/4/6
and has been tested against 2.6.18-rc3.

Change since last version:

- Basic task-group aware SMP load balancing, based on SMP nice
mechanism

Still on my todo list:

- Cleanup so that !CONFIG_CPUMETER has zero impact
- Better timeslice management, so that bursty workloads
can be handled well. I have some ideas (for ex: don't
switch active/expired arrays as soon as active
array becomes empty) which I would like to experiment
with a bit before sending a patch for this
- Better SMP load balancing. I feel that some feedback
about how much bandwidth a task-group is actually
getting (compared to its quota) can be included in
runqueue pressure calculations.

Salient design points of this patch:

- Each task-group gets its own runqueue on every cpu.

- In addition, there is an active and expired array of
task-groups themselves. Task-groups who have expired their
quota are put into expired array.

- Task-groups have priorities. Priority of a task-group is the
same as the priority of the highest-priority runnable task it
has. This I feel will retain interactiveness of the system
as it is today.

- Scheduling the next task involves picking highest priority task-group
from active array first and then picking highest-priority task
within it. Both steps are O(1).

- Token are assigned to task-groups based on their assigned
quota. Once they run out of tokens, the task-group is put
in an expired array. Array switch happens when active array
is empty.

- SMP load-balancing is accomplished on the lines of smpnice.

- Although the algorithm is very simple, it perhaps needs more
refinement to handle different cases. Especially I
feel task-groups which are idle most of the time and
experience bursts once in a while will need to be handled
better than this simple scheme.

Ingo/Nick,
Would request your inputs as before.


--
Regards,
vatsa


2006-08-20 17:42:18

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 1/7] CPU controller V1 - split runqueue


This patch splits the single main runqueue into several runqueues on each CPU.
One (main) runqueue is used to hold task-groups and one runqueue for each
task-group in the system.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>



init/Kconfig | 8 +
kernel/sched.c | 254 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 231 insertions(+), 31 deletions(-)

diff -puN kernel/sched.c~base_cpu_ctlr kernel/sched.c
--- linux-2.6.18-rc3/kernel/sched.c~base_cpu_ctlr 2006-08-19 21:58:54.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-20 21:44:30.000000000 +0530
@@ -191,11 +191,51 @@ static inline unsigned int task_timeslic

struct prio_array {
unsigned int nr_active;
+ int best_static_prio, best_dyn_prio;
DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
struct list_head queue[MAX_PRIO];
};

/*
+ * Each task belong to some task-group. Task-groups and what tasks they contain
+ * are defined by the administrator using some suitable interface.
+ * Administrator can also define the CPU bandwidth provided to each task-group.
+ *
+ * Task-groups are given a certain priority to run on every CPU. Currently
+ * task-group priority on a CPU is defined to be the same as that of
+ * highest-priority runnable task it has on that CPU. Task-groups also
+ * get their own runqueue on every CPU. The main runqueue on each CPU is
+ * used to hold task-groups, rather than tasks.
+ *
+ * Scheduling decision on a CPU is now two-step : first pick highest priority
+ * task-group from the main runqueue and next pick highest priority task from
+ * the runqueue of that group. Both decisions are of O(1) complexity.
+ */
+
+/* runqueue used for every task-group */
+struct task_grp_rq {
+ struct prio_array arrays[2];
+ struct prio_array *active, *expired, *rq_array;
+ unsigned long expired_timestamp;
+ unsigned int ticks; /* Decremented every tick */
+ int prio; /* Priority of the task-group */
+ struct list_head list;
+ struct task_grp *tg;
+};
+
+/* xxx: Avoid this for !CONFIG_CPU_METER */
+static DEFINE_PER_CPU(struct task_grp_rq, init_tg_rq);
+
+/* task-group object - maintains information about each task-group */
+struct task_grp {
+ int ticks; /* bandwidth given to the task-group */
+ struct task_grp_rq *rq[NR_CPUS]; /* runqueue pointer for every cpu */
+};
+
+/* The "default" task-group */
+struct task_grp init_task_grp;
+
+/*
* This is the main, per-CPU runqueue data structure.
*
* Locking rule: those places that want to lock multiple runqueues
@@ -224,12 +264,11 @@ struct rq {
*/
unsigned long nr_uninterruptible;

- unsigned long expired_timestamp;
unsigned long long timestamp_last_tick;
struct task_struct *curr, *idle;
struct mm_struct *prev_mm;
+ /* these arrays hold task-groups */
struct prio_array *active, *expired, arrays[2];
- int best_expired_prio;
atomic_t nr_iowait;

#ifdef CONFIG_SMP
@@ -244,6 +283,7 @@ struct rq {
#endif

#ifdef CONFIG_SCHEDSTATS
+ /* xxx: move these to task-group runqueue */
/* latency stats */
struct sched_info rq_sched_info;

@@ -657,6 +697,63 @@ sched_info_switch(struct task_struct *pr
#define sched_info_switch(t, next) do { } while (0)
#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */

+static unsigned int task_grp_timeslice(struct task_grp *tg)
+{
+ /* xxx: take into account sleep_avg etc of the group */
+ return tg->ticks;
+}
+
+/* Dequeue task-group object from the main per-cpu runqueue */
+static void dequeue_task_grp(struct task_grp_rq *tgrq)
+{
+ struct prio_array *array = tgrq->rq_array;
+
+ array->nr_active--;
+ list_del(&tgrq->list);
+ if (list_empty(array->queue + tgrq->prio))
+ __clear_bit(tgrq->prio, array->bitmap);
+ tgrq->rq_array = NULL;
+}
+
+/* Enqueue task-group object on the main per-cpu runqueue */
+static void enqueue_task_grp(struct task_grp_rq *tgrq, struct prio_array *array,
+ int head)
+{
+ if (!head)
+ list_add_tail(&tgrq->list, array->queue + tgrq->prio);
+ else
+ list_add(&tgrq->list, array->queue + tgrq->prio);
+ __set_bit(tgrq->prio, array->bitmap);
+ array->nr_active++;
+ tgrq->rq_array = array;
+}
+
+/* return the task-group to which a task belongs */
+static inline struct task_grp *task_grp(struct task_struct *p)
+{
+ return &init_task_grp;
+}
+
+static inline void update_task_grp_prio(struct task_grp_rq *tgrq, struct rq *rq,
+ int head)
+{
+ int new_prio;
+ struct prio_array *array = tgrq->rq_array;
+
+ new_prio = tgrq->active->best_dyn_prio;
+ if (new_prio == MAX_PRIO)
+ new_prio = tgrq->expired->best_dyn_prio;
+
+ if (array)
+ dequeue_task_grp(tgrq);
+ tgrq->prio = new_prio;
+ if (new_prio != MAX_PRIO) {
+ if (!array)
+ array = rq->active;
+ enqueue_task_grp(tgrq, array, head);
+ }
+}
+
/*
* Adding/removing a task to/from a priority array:
*/
@@ -664,8 +761,17 @@ static void dequeue_task(struct task_str
{
array->nr_active--;
list_del(&p->run_list);
- if (list_empty(array->queue + p->prio))
+ if (list_empty(array->queue + p->prio)) {
__clear_bit(p->prio, array->bitmap);
+ if (p->prio == array->best_dyn_prio) {
+ struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
+
+ array->best_dyn_prio =
+ sched_find_first_bit(array->bitmap);
+ if (array == tgrq->active || !tgrq->active->nr_active)
+ update_task_grp_prio(tgrq, task_rq(p), 0);
+ }
+ }
}

static void enqueue_task(struct task_struct *p, struct prio_array *array)
@@ -675,6 +781,14 @@ static void enqueue_task(struct task_str
__set_bit(p->prio, array->bitmap);
array->nr_active++;
p->array = array;
+
+ if (p->prio < array->best_dyn_prio) {
+ struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
+
+ array->best_dyn_prio = p->prio;
+ if (array == tgrq->active || !tgrq->active->nr_active)
+ update_task_grp_prio(tgrq, task_rq(p), 0);
+ }
}

/*
@@ -693,6 +807,14 @@ enqueue_task_head(struct task_struct *p,
__set_bit(p->prio, array->bitmap);
array->nr_active++;
p->array = array;
+
+ if (p->prio < array->best_dyn_prio) {
+ struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
+
+ array->best_dyn_prio = p->prio;
+ if (array == tgrq->active || !tgrq->active->nr_active)
+ update_task_grp_prio(tgrq, task_rq(p), 1);
+ }
}

/*
@@ -831,10 +953,11 @@ static int effective_prio(struct task_st
*/
static void __activate_task(struct task_struct *p, struct rq *rq)
{
- struct prio_array *target = rq->active;
+ struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
+ struct prio_array *target = tgrq->active;

if (batch_task(p))
- target = rq->expired;
+ target = tgrq->expired;
enqueue_task(p, target);
inc_nr_running(p, rq);
}
@@ -844,7 +967,10 @@ static void __activate_task(struct task_
*/
static inline void __activate_idle_task(struct task_struct *p, struct rq *rq)
{
- enqueue_task_head(p, rq->active);
+ struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
+ struct prio_array *target = tgrq->active;
+
+ enqueue_task_head(p, target);
inc_nr_running(p, rq);
}

@@ -2077,7 +2203,7 @@ int can_migrate_task(struct task_struct
return 1;
}

-#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->best_expired_prio)
+#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->expired->best_static_prio)

/*
* move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
@@ -2884,6 +3010,8 @@ unsigned long long current_sched_time(co
return ns;
}

+#define nr_tasks(tgrq) (tgrq->active->nr_active + tgrq->expired->nr_active)
+
/*
* We place interactive tasks back into the active array, if possible.
*
@@ -2894,13 +3022,13 @@ unsigned long long current_sched_time(co
* increasing number of running tasks. We also ignore the interactivity
* if a better static_prio task has expired:
*/
-static inline int expired_starving(struct rq *rq)
+static inline int expired_starving(struct task_grp_rq *rq)
{
- if (rq->curr->static_prio > rq->best_expired_prio)
+ if (current->static_prio > rq->expired->best_static_prio)
return 1;
if (!STARVATION_LIMIT || !rq->expired_timestamp)
return 0;
- if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * rq->nr_running)
+ if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * nr_tasks(rq))
return 1;
return 0;
}
@@ -2991,6 +3119,7 @@ void scheduler_tick(void)
struct task_struct *p = current;
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
+ struct task_grp_rq *tgrq = task_grp(p)->rq[cpu];

update_cpu_clock(p, rq, now);

@@ -3004,7 +3133,7 @@ void scheduler_tick(void)
}

/* Task might have expired already, but not scheduled off yet */
- if (p->array != rq->active) {
+ if (p->array != tgrq->active) {
set_tsk_need_resched(p);
goto out;
}
@@ -3027,25 +3156,26 @@ void scheduler_tick(void)
set_tsk_need_resched(p);

/* put it at the end of the queue: */
- requeue_task(p, rq->active);
+ requeue_task(p, tgrq->active);
}
goto out_unlock;
}
if (!--p->time_slice) {
- dequeue_task(p, rq->active);
+ dequeue_task(p, tgrq->active);
set_tsk_need_resched(p);
p->prio = effective_prio(p);
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;

- if (!rq->expired_timestamp)
- rq->expired_timestamp = jiffies;
- if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
- enqueue_task(p, rq->expired);
- if (p->static_prio < rq->best_expired_prio)
- rq->best_expired_prio = p->static_prio;
+ if (!tgrq->expired_timestamp)
+ tgrq->expired_timestamp = jiffies;
+ if (!TASK_INTERACTIVE(p) || expired_starving(tgrq)) {
+ enqueue_task(p, tgrq->expired);
+ if (p->static_prio < tgrq->expired->best_static_prio)
+ tgrq->expired->best_static_prio =
+ p->static_prio;
} else
- enqueue_task(p, rq->active);
+ enqueue_task(p, tgrq->active);
} else {
/*
* Prevent a too long timeslice allowing a task to monopolize
@@ -3066,12 +3196,21 @@ void scheduler_tick(void)
if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
- (p->array == rq->active)) {
+ (p->array == tgrq->active)) {

- requeue_task(p, rq->active);
+ requeue_task(p, tgrq->active);
set_tsk_need_resched(p);
}
}
+
+ if (!--tgrq->ticks) {
+ /* Move the task group to expired list */
+ dequeue_task_grp(tgrq);
+ tgrq->ticks = task_grp_timeslice(task_grp(p));
+ enqueue_task_grp(tgrq, rq->expired, 0);
+ set_tsk_need_resched(p);
+ }
+
out_unlock:
spin_unlock(&rq->lock);
out:
@@ -3264,6 +3403,7 @@ asmlinkage void __sched schedule(void)
int cpu, idx, new_prio;
long *switch_count;
struct rq *rq;
+ struct task_grp_rq *next_grp;

/*
* Test if we are atomic. Since do_exit() needs to call into
@@ -3332,23 +3472,41 @@ need_resched_nonpreemptible:
idle_balance(cpu, rq);
if (!rq->nr_running) {
next = rq->idle;
- rq->expired_timestamp = 0;
wake_sleeping_dependent(cpu);
goto switch_tasks;
}
}

+ /* Pick a task group first */
+#ifdef CONFIG_CPUMETER
array = rq->active;
if (unlikely(!array->nr_active)) {
/*
* Switch the active and expired arrays.
*/
- schedstat_inc(rq, sched_switch);
rq->active = rq->expired;
rq->expired = array;
array = rq->active;
- rq->expired_timestamp = 0;
- rq->best_expired_prio = MAX_PRIO;
+ }
+ idx = sched_find_first_bit(array->bitmap);
+ queue = array->queue + idx;
+ next_grp = list_entry(queue->next, struct task_grp_rq, list);
+#else
+ next_grp = init_task_grp.rq[cpu];
+#endif
+
+ /* Pick a task within that group next */
+ array = next_grp->active;
+ if (!array->nr_active) {
+ /*
+ * Switch the active and expired arrays.
+ */
+ schedstat_inc(next_grp, sched_switch);
+ next_grp->active = next_grp->expired;
+ next_grp->expired = array;
+ array = next_grp->active;
+ next_grp->expired_timestamp = 0;
+ next_grp->expired->best_static_prio = MAX_PRIO;
}

idx = sched_find_first_bit(array->bitmap);
@@ -4413,7 +4571,8 @@ asmlinkage long sys_sched_getaffinity(pi
asmlinkage long sys_sched_yield(void)
{
struct rq *rq = this_rq_lock();
- struct prio_array *array = current->array, *target = rq->expired;
+ struct task_grp_rq *tgrq = task_grp(current)->rq[task_cpu(current)];
+ struct prio_array *array = current->array, *target = tgrq->expired;

schedstat_inc(rq, yld_cnt);
/*
@@ -4424,13 +4583,13 @@ asmlinkage long sys_sched_yield(void)
* array.)
*/
if (rt_task(current))
- target = rq->active;
+ target = tgrq->active;

if (array->nr_active == 1) {
schedstat_inc(rq, yld_act_empty);
- if (!rq->expired->nr_active)
+ if (!tgrq->expired->nr_active)
schedstat_inc(rq, yld_both_empty);
- } else if (!rq->expired->nr_active)
+ } else if (!tgrq->expired->nr_active)
schedstat_inc(rq, yld_exp_empty);

if (array != target) {
@@ -6722,21 +6881,54 @@ int in_sched_functions(unsigned long add
&& addr < (unsigned long)__sched_text_end);
}

+static void task_grp_rq_init(struct task_grp_rq *tgrq, struct task_grp *tg)
+{
+ int j, k;
+
+ tgrq->active = tgrq->arrays;
+ tgrq->expired = tgrq->arrays + 1;
+ tgrq->rq_array = NULL;
+ tgrq->expired->best_static_prio = MAX_PRIO;
+ tgrq->expired->best_dyn_prio = MAX_PRIO;
+ tgrq->active->best_static_prio = MAX_PRIO;
+ tgrq->active->best_dyn_prio = MAX_PRIO;
+ tgrq->prio = MAX_PRIO;
+ tgrq->ticks = tg->ticks;
+ tgrq->tg = tg;
+ INIT_LIST_HEAD(&tgrq->list);
+
+ for (j = 0; j < 2; j++) {
+ struct prio_array *array;
+
+ array = tgrq->arrays + j;
+ for (k = 0; k < MAX_PRIO; k++) {
+ INIT_LIST_HEAD(array->queue + k);
+ __clear_bit(k, array->bitmap);
+ }
+ // delimiter for bitsearch
+ __set_bit(MAX_PRIO, array->bitmap);
+ }
+}
+
void __init sched_init(void)
{
int i, j, k;

+ init_task_grp.ticks = -1; /* Unlimited bandwidth */
+
for_each_possible_cpu(i) {
struct prio_array *array;
struct rq *rq;
+ struct task_grp_rq *tgrq;

rq = cpu_rq(i);
+ tgrq = init_task_grp.rq[i] = &per_cpu(init_tg_rq, i);
spin_lock_init(&rq->lock);
+ task_grp_rq_init(tgrq, &init_task_grp);
lockdep_set_class(&rq->lock, &rq->rq_lock_key);
rq->nr_running = 0;
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
- rq->best_expired_prio = MAX_PRIO;

#ifdef CONFIG_SMP
rq->sd = NULL;
diff -puN init/Kconfig~base_cpu_ctlr init/Kconfig
--- linux-2.6.18-rc3/init/Kconfig~base_cpu_ctlr 2006-08-19 21:58:54.000000000 +0530
+++ linux-2.6.18-rc3-root/init/Kconfig 2006-08-19 21:58:54.000000000 +0530
@@ -248,6 +248,14 @@ config CPUSETS

Say N if unsure.

+config CPUMETER
+ bool "CPU resource control"
+ depends on CPUSETS
+ help
+ This options lets you create cpu resource partitions within
+ cpusets. Each resource partition can be given a different quota
+ of CPU usage.
+
config RELAY
bool "Kernel->user space relay support (formerly relayfs)"
help

_
--
Regards,
vatsa

2006-08-20 17:43:25

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 2/7] CPU controller V1 - define group operations


Define these operations for a task-group:

- create new group
- destroy existing group
- assign bandwidth (quota) for a group
- get bandwidth (quota) of a group


Signed-off-by : Srivatsa Vaddagiri <[email protected]>



include/linux/sched.h | 13 +++++++
kernel/sched.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 98 insertions(+)

diff -puN kernel/sched.c~cpu_ctlr_grp_ops kernel/sched.c
--- linux-2.6.18-rc3/kernel/sched.c~cpu_ctlr_grp_ops 2006-08-20 21:54:17.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-20 21:55:28.000000000 +0530
@@ -7066,3 +7066,88 @@ void set_curr_task(int cpu, struct task_
}

#endif
+
+#ifdef CONFIG_CPUMETER
+
+/* Allocate runqueue structures for the new task-group */
+void *sched_alloc_group(void)
+{
+ struct task_grp *tg;
+ struct task_grp_rq *tgrq;
+ int i;
+
+ tg = kzalloc(sizeof(*tg), GFP_KERNEL);
+ if (!tg)
+ return NULL;
+
+ tg->ticks = -1; /* No limit */
+
+ for_each_possible_cpu(i) {
+ tgrq = kzalloc(sizeof(*tgrq), GFP_KERNEL);
+ if (!tgrq)
+ goto oom;
+ tg->rq[i] = tgrq;
+ task_grp_rq_init(tgrq, tg);
+ }
+
+ return tg;
+oom:
+ while (i--)
+ kfree(tg->rq[i]);
+
+ kfree(tg);
+ return NULL;
+}
+
+/* Deallocate runqueue structures */
+void sched_dealloc_group(struct task_grp *tg)
+{
+ int i;
+
+ for_each_possible_cpu(i)
+ kfree(tg->rq[i]);
+
+ kfree(tg);
+}
+
+/* Assign quota to this group */
+void sched_assign_quota(struct task_grp *tg, int quota)
+{
+ int i;
+
+ /* xxx: check validity of quota */
+ tg->ticks = (quota * 5 * HZ) / 100;
+
+ for_each_possible_cpu(i)
+ tg->rq[i]->ticks = tg->ticks;
+
+}
+
+static inline int cpu_quota(struct task_grp *tg)
+{
+ int val;
+
+ if (tg->ticks == -1)
+ val = 100;
+ else
+ val = (tg->ticks * 100) / (5 * HZ);
+
+ return val;
+}
+
+/* Return assigned quota for this group */
+int sched_get_quota(struct task_grp *tg)
+{
+ return cpu_quota(tg);
+}
+
+static struct task_grp_ops sched_grp_ops = {
+ .alloc_group = sched_alloc_group,
+ .dealloc_group = sched_dealloc_group,
+ .assign_quota = sched_assign_quota,
+ .get_quota = sched_get_quota,
+};
+
+struct task_grp_ops *cpu_ctlr = &sched_grp_ops;
+
+#endif /* CONFIG_CPUMETER */
diff -puN include/linux/sched.h~cpu_ctlr_grp_ops include/linux/sched.h
--- linux-2.6.18-rc3/include/linux/sched.h~cpu_ctlr_grp_ops 2006-08-20 21:54:17.000000000 +0530
+++ linux-2.6.18-rc3-root/include/linux/sched.h 2006-08-20 21:54:17.000000000 +0530
@@ -1604,6 +1604,19 @@ static inline void thaw_processes(void)
static inline int try_to_freeze(void) { return 0; }

#endif /* CONFIG_PM */
+
+#ifdef CONFIG_CPUMETER
+struct task_grp;
+struct task_grp_ops {
+ void *(*alloc_group)(void);
+ void (*dealloc_group)(struct task_grp *grp);
+ void (*assign_quota)(struct task_grp *grp, int quota);
+ int (*get_quota)(struct task_grp *grp);
+};
+
+extern struct task_grp_ops *cpu_ctlr;
+#endif
+
#endif /* __KERNEL__ */

#endif

_
--
Regards,
vatsa

2006-08-20 17:45:12

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 3/7] CPU controller V1 - deal with movement of tasks


When a task moves between groups (as initiated by an administrator), it has to
be removed from the runqueue of its old group and added to the runqueue of its
new group. This patch defines this move operation.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>



include/linux/sched.h | 3 ++
kernel/sched.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 66 insertions(+)

diff -puN kernel/sched.c~cpu_ctlr_move_task kernel/sched.c
--- linux-2.6.18-rc3/kernel/sched.c~cpu_ctlr_move_task 2006-08-20 21:56:09.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-20 21:57:36.000000000 +0530
@@ -7141,10 +7141,73 @@ int sched_get_quota(struct task_grp *tg)
return cpu_quota(tg);
}

+/*
+ * Move a task from one group to another. If the task is already on a
+ * runqueue, this involves removing the task from its old group's runqueue
+ * and adding to its new group's runqueue.
+ *
+ * This however is slightly tricky, given the facts that:
+ *
+ * - some pointer in the task struct (ex: cpuset) represents the group
+ * to which it belongs.
+ * - At any give point during the move operation, the pointer either
+ * points to the old group or to the new group, but not both!
+ * - dequeue_task/enqueue_task rely on this pointer to know which
+ * task_grp_rq the task is to be removed from/added to.
+ *
+ * Hence the dequeue_task/enqueue_task is accomplished in two steps.
+ *
+ * 1. In first step, sched_pre_move_task() is called with the group
+ * pointer set to the old group to which the task belonged.
+ * If the task was on a runqueue, sched_pre_move_task() merely
+ * removes it from the runqueue and returns 1 back to its
+ * caller indicating that step 2 is necessary.
+ *
+ * 2. In second step, sched_post_move_task() is called with the group
+ * pointer set to the new group to which the task belongs.
+ * sched_post_move_task() merely invokes __activate_task() to add the
+ * task in its new runqueue.
+ *
+ */
+
+static unsigned long irq_flags;
+
+int sched_pre_move_task(struct task_struct *tsk, struct task_grp *tg_old,
+ struct task_grp *tg_new)
+{
+ int activate = 0;
+ struct rq *rq;
+
+ if (tg_new == tg_old)
+ return 0;
+
+ rq = task_rq_lock(tsk, &irq_flags);
+
+ if (tsk->array) {
+ deactivate_task(tsk, rq);
+ activate = 1;
+ } else
+ task_rq_unlock(rq, &irq_flags);
+
+ return activate;
+}
+
+/* called with rq lock held */
+void sched_post_move_task(struct task_struct *tsk)
+{
+ struct rq *rq = task_rq(tsk);
+
+ __activate_task(tsk, rq);
+
+ task_rq_unlock(rq, &irq_flags);
+}
+
static struct task_grp_ops sched_grp_ops = {
.alloc_group = sched_alloc_group,
.dealloc_group = sched_dealloc_group,
.assign_quota = sched_assign_quota,
+ .pre_move_task = sched_pre_move_task,
+ .post_move_task = sched_post_move_task,
.get_quota = sched_get_quota,
};

diff -puN include/linux/sched.h~cpu_ctlr_move_task include/linux/sched.h
--- linux-2.6.18-rc3/include/linux/sched.h~cpu_ctlr_move_task 2006-08-20 21:56:09.000000000 +0530
+++ linux-2.6.18-rc3-root/include/linux/sched.h 2006-08-20 21:56:09.000000000 +0530
@@ -1611,6 +1611,9 @@ struct task_grp_ops {
void *(*alloc_group)(void);
void (*dealloc_group)(struct task_grp *grp);
void (*assign_quota)(struct task_grp *grp, int quota);
+ int (*pre_move_task)(struct task_struct *tsk, struct task_grp *old,
+ struct task_grp *new);
+ void (*post_move_task)(struct task_struct *tsk);
int (*get_quota)(struct task_grp *grp);
};


_
--
Regards,
vatsa

2006-08-20 17:45:51

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 4/7] CPU controller V1 - Handle dont care groups


Deal with task-groups whose bandwidth hasnt been explicitly set by the
administrator. Unallocated CPU bandwidth is equally distributed among such
"don't care" groups.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>


include/linux/sched.h | 2 -
kernel/sched.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 80 insertions(+), 4 deletions(-)

diff -puN kernel/sched.c~cpu_ctlr_handle_dont_cares kernel/sched.c
--- linux-2.6.18-rc3/kernel/sched.c~cpu_ctlr_handle_dont_cares 2006-08-20 21:58:30.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-20 21:59:19.000000000 +0530
@@ -229,6 +229,12 @@ static DEFINE_PER_CPU(struct task_grp_rq
/* task-group object - maintains information about each task-group */
struct task_grp {
int ticks; /* bandwidth given to the task-group */
+ int left_over_pct;
+ int total_dont_care_grps;
+ int dont_care; /* Does this group care for its bandwidth ? */
+ struct task_grp *parent;
+ struct list_head dont_care_list;
+ struct list_head list;
struct task_grp_rq *rq[NR_CPUS]; /* runqueue pointer for every cpu */
};

@@ -6915,6 +6921,12 @@ void __init sched_init(void)
int i, j, k;

init_task_grp.ticks = -1; /* Unlimited bandwidth */
+ init_task_grp.left_over_pct = 100; /* 100% unallocated bandwidth */
+ init_task_grp.parent = NULL;
+ init_task_grp.total_dont_care_grps = 1; /* init_task_grp itself */
+ init_task_grp.dont_care = 1;
+ INIT_LIST_HEAD(&init_task_grp.dont_care_list);
+ list_add_tail(&init_task_grp.list, &init_task_grp.dont_care_list);

for_each_possible_cpu(i) {
struct prio_array *array;
@@ -7069,8 +7081,31 @@ void set_curr_task(int cpu, struct task_

#ifdef CONFIG_CPUMETER

+/* Distribute left over bandwidth equally to all "dont care" task groups */
+static void recalc_dontcare(struct task_grp *tg_root)
+{
+ int ticks;
+ struct list_head *entry;
+
+ if (!tg_root->total_dont_care_grps)
+ return;
+
+ ticks = ((tg_root->left_over_pct /
+ tg_root->total_dont_care_grps) * 5 * HZ) / 100;
+
+ list_for_each(entry, &tg_root->dont_care_list) {
+ struct task_grp *tg;
+ int i;
+
+ tg = list_entry(entry, struct task_grp, list);
+ tg->ticks = ticks;
+ for_each_possible_cpu(i)
+ tg->rq[i]->ticks = tg->ticks;
+ }
+}
+
/* Allocate runqueue structures for the new task-group */
-void *sched_alloc_group(void)
+void *sched_alloc_group(struct task_grp *tg_parent)
{
struct task_grp *tg;
struct task_grp_rq *tgrq;
@@ -7081,6 +7116,11 @@ void *sched_alloc_group(void)
return NULL;

tg->ticks = -1; /* No limit */
+ tg->parent = tg_parent;
+ tg->dont_care = 1;
+ tg->left_over_pct = 100;
+ tg->ticks = -1; /* No limit */
+ INIT_LIST_HEAD(&tg->dont_care_list);

for_each_possible_cpu(i) {
tgrq = kzalloc(sizeof(*tgrq), GFP_KERNEL);
@@ -7090,6 +7130,15 @@ void *sched_alloc_group(void)
task_grp_rq_init(tgrq, tg);
}

+ if (tg->parent) {
+ tg->parent->total_dont_care_grps++;
+ list_add_tail(&tg->list, &tg->parent->dont_care_list);
+ recalc_dontcare(tg->parent);
+ } else {
+ tg->total_dont_care_grps = 1;
+ list_add_tail(&tg->list, &tg->dont_care_list);
+ }
+
return tg;
oom:
while (i--)
@@ -7103,6 +7152,16 @@ oom:
void sched_dealloc_group(struct task_grp *tg)
{
int i;
+ struct task_grp *tg_root = tg->parent;
+
+ if (!tg_root)
+ tg_root = tg;
+
+ if (tg->dont_care) {
+ tg_root->total_dont_care_grps--;
+ list_del(&tg->list);
+ recalc_dontcare(tg_root);
+ }

for_each_possible_cpu(i)
kfree(tg->rq[i]);
@@ -7113,7 +7172,20 @@ void sched_dealloc_group(struct task_grp
/* Assign quota to this group */
void sched_assign_quota(struct task_grp *tg, int quota)
{
- int i;
+ int i, old_quota = 0;
+ struct task_grp *tg_root = tg->parent;
+
+ if (!tg_root)
+ tg_root = tg;
+
+ if (tg->dont_care) {
+ tg->dont_care = 0;
+ tg_root->total_dont_care_grps--;
+ list_del(&tg->list);
+ } else
+ old_quota = (tg->ticks * 100) / (5 * HZ);
+
+ tg_root->left_over_pct -= (quota - old_quota);

/* xxx: check validity of quota */
tg->ticks = (quota * 5 * HZ) / 100;
@@ -7121,6 +7193,7 @@ void sched_assign_quota(struct task_grp
for_each_possible_cpu(i)
tg->rq[i]->ticks = tg->ticks;

+ recalc_dontcare(tg_root);
}

static inline int cpu_quota(struct task_grp *tg)
@@ -7138,7 +7211,10 @@ static inline int cpu_quota(struct task_
/* Return assigned quota for this group */
int sched_get_quota(struct task_grp *tg)
{
- return cpu_quota(tg);
+ if (tg->dont_care)
+ return 0;
+ else
+ return cpu_quota(tg);
}

/*
diff -puN include/linux/sched.h~cpu_ctlr_handle_dont_cares include/linux/sched.h
--- linux-2.6.18-rc3/include/linux/sched.h~cpu_ctlr_handle_dont_cares 2006-08-20 21:58:30.000000000 +0530
+++ linux-2.6.18-rc3-root/include/linux/sched.h 2006-08-20 21:58:30.000000000 +0530
@@ -1608,7 +1608,7 @@ static inline int try_to_freeze(void) {
#ifdef CONFIG_CPUMETER
struct task_grp;
struct task_grp_ops {
- void *(*alloc_group)(void);
+ void *(*alloc_group)(struct task_grp *grp_parent);
void (*dealloc_group)(struct task_grp *grp);
void (*assign_quota)(struct task_grp *grp, int quota);
int (*pre_move_task)(struct task_struct *tsk, struct task_grp *old,

_
--
Regards,
vatsa

2006-08-20 17:47:54

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 6/7] CPU controller V1 - task_cpu(p) needs to be correct always


We rely very much on task_cpu(p) to be correct at all times, so that we can
correctly find the task_grp_rq from which the task has to be removed or added
to.

There is however one place in the scheduler where this assumption of
task_cpu(p) being correct is broken. This patch fixes that piece of code.

(Thanks to Balbir Singh for pointing this out to me)

Signed-off-by : Srivatsa Vaddagiri <[email protected]>




kernel/sched.c | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)

diff -puN kernel/sched.c~cpu_ctlr_setcpu kernel/sched.c
--- linux-2.6.18-rc3/kernel/sched.c~cpu_ctlr_setcpu 2006-08-20 22:09:31.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-20 22:09:31.000000000 +0530
@@ -5129,6 +5129,7 @@ static int __migrate_task(struct task_st
{
struct rq *rq_dest, *rq_src;
int ret = 0;
+ struct prio_array *array;

if (unlikely(cpu_is_offline(dest_cpu)))
return ret;
@@ -5144,8 +5145,8 @@ static int __migrate_task(struct task_st
if (!cpu_isset(dest_cpu, p->cpus_allowed))
goto out;

- set_task_cpu(p, dest_cpu);
- if (p->array) {
+ array = p->array;
+ if (array) {
/*
* Sync timestamp with rq_dest's before activating.
* The same thing could be achieved by doing this step
@@ -5155,6 +5156,11 @@ static int __migrate_task(struct task_st
p->timestamp = p->timestamp - rq_src->timestamp_last_tick
+ rq_dest->timestamp_last_tick;
deactivate_task(p, rq_src);
+ }
+
+ set_task_cpu(p, dest_cpu);
+
+ if (array) {
__activate_task(p, rq_dest);
if (TASK_PREEMPTS_CURR(p, rq_dest))
resched_task(rq_dest->curr);

_
--
Regards,
vatsa

2006-08-20 17:47:31

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 5/7] CPU controller V1 - Extend smpnice to be task-group aware


This patch extends the smpnice mechanism to be aware of task-groups and the
quota given to each task-group.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>


kernel/sched.c | 127 +++++++++++++++++++++++++++++++++++++++++++--------------
1 files changed, 96 insertions(+), 31 deletions(-)

diff -puN kernel/sched.c~cpu_ctlr_smp_nice kernel/sched.c
--- linux-2.6.18-rc3/kernel/sched.c~cpu_ctlr_smp_nice 2006-08-20 22:03:42.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-20 22:03:42.000000000 +0530
@@ -874,6 +874,25 @@ static inline int __normal_prio(struct t
#define RTPRIO_TO_LOAD_WEIGHT(rp) \
(PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + LOAD_WEIGHT(rp))

+#ifdef CONFIG_CPUMETER
+
+static inline int cpu_quota(struct task_grp *tg)
+{
+ int val;
+
+ if (tg->ticks == -1)
+ val = 100;
+ else
+ val = (tg->ticks * 100) / (5 * HZ);
+
+ return val;
+}
+
+#define TASK_GROUP_QUOTA(p) cpu_quota(task_grp(p)) / 100
+#else
+#define TASK_GROUP_QUOTA(p) 1
+#endif
+
static void set_load_weight(struct task_struct *p)
{
if (has_rt_policy(p)) {
@@ -887,9 +906,11 @@ static void set_load_weight(struct task_
p->load_weight = 0;
else
#endif
- p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority);
+ p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority)
+ * TASK_GROUP_QUOTA(p);
} else
- p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio);
+ p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio)
+ * TASK_GROUP_QUOTA(p);
}

static inline void
@@ -2209,7 +2230,8 @@ int can_migrate_task(struct task_struct
return 1;
}

-#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->expired->best_static_prio)
+#define rq_best_prio(rq) min((rq)->active->best_dyn_prio, \
+ (rq)->expired->best_static_prio)

/*
* move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
@@ -2218,17 +2240,17 @@ int can_migrate_task(struct task_struct
*
* Called with both runqueues locked.
*/
-static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
- unsigned long max_nr_move, unsigned long max_load_move,
- struct sched_domain *sd, enum idle_type idle,
- int *all_pinned)
+static int __move_tasks(struct task_grp_rq *this_rq, int this_cpu,
+ struct task_grp_rq *busiest, unsigned long max_nr_move,
+ unsigned long max_load_move, struct sched_domain *sd,
+ enum idle_type idle, int *all_pinned, long *load_moved)
{
int idx, pulled = 0, pinned = 0, this_best_prio, best_prio,
- best_prio_seen, skip_for_load;
+ best_prio_seen = 0, skip_for_load;
struct prio_array *array, *dst_array;
struct list_head *head, *curr;
struct task_struct *tmp;
- long rem_load_move;
+ long rem_load_move = 0;

if (max_nr_move == 0 || max_load_move == 0)
goto out;
@@ -2237,14 +2259,6 @@ static int move_tasks(struct rq *this_rq
pinned = 1;
this_best_prio = rq_best_prio(this_rq);
best_prio = rq_best_prio(busiest);
- /*
- * Enable handling of the case where there is more than one task
- * with the best priority. If the current running task is one
- * of those with prio==best_prio we know it won't be moved
- * and therefore it's safe to override the skip (based on load) of
- * any task we find with that prio.
- */
- best_prio_seen = best_prio == busiest->curr->prio;

/*
* We first consider expired tasks. Those will likely not be
@@ -2293,7 +2307,7 @@ skip_queue:
if (skip_for_load && idx < this_best_prio)
skip_for_load = !best_prio_seen && idx == best_prio;
if (skip_for_load ||
- !can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) {
+ !can_migrate_task(tmp, task_rq(tmp), this_cpu, sd, idle, &pinned)) {

best_prio_seen |= idx == best_prio;
if (curr != head)
@@ -2307,7 +2321,8 @@ skip_queue:
schedstat_inc(sd, lb_hot_gained[idle]);
#endif

- pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
+ pull_task(task_rq(tmp), array, tmp, cpu_rq(this_cpu), dst_array,
+ this_cpu);
pulled++;
rem_load_move -= tmp->load_weight;

@@ -2333,9 +2348,70 @@ out:

if (all_pinned)
*all_pinned = pinned;
+ *load_moved = max_load_move - rem_load_move;
return pulled;
}

+static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
+ unsigned long max_nr_move, unsigned long max_load_move,
+ struct sched_domain *sd, enum idle_type idle,
+ int *all_pinned)
+{
+ int idx;
+ long load_moved;
+ unsigned long total_nr_moved = 0, nr_moved;
+ struct prio_array *array;
+ struct task_grp_rq *busy_q, *this_q;
+ struct list_head *head, *curr;
+
+ if (busiest->expired->nr_active)
+ array = busiest->expired;
+ else
+ array = busiest->active;
+
+new_array:
+ /* Start searching at priority 0: */
+ idx = 0;
+skip_bitmap:
+ if (!idx)
+ idx = sched_find_first_bit(array->bitmap);
+ else
+ idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
+ if (idx >= MAX_PRIO) {
+ if (array == busiest->expired && busiest->active->nr_active) {
+ array = busiest->active;
+ goto new_array;
+ }
+ goto out;
+ }
+
+ head = array->queue + idx;
+ curr = head->prev;
+skip_queue:
+ busy_q = list_entry(curr, struct task_grp_rq, list);
+ this_q = busy_q->tg->rq[this_cpu];
+
+ curr = curr->prev;
+
+ nr_moved = __move_tasks(this_q, this_cpu, busy_q, max_nr_move,
+ max_load_move, sd, idle, all_pinned, &load_moved);
+
+ total_nr_moved += nr_moved;
+ max_nr_move -= nr_moved;
+ max_load_move -= load_moved;
+
+ BUG_ON(max_load_move < 0);
+ BUG_ON(max_nr_move < 0);
+
+ if (curr != head)
+ goto skip_queue;
+ idx++;
+ goto skip_bitmap;
+
+out:
+ return total_nr_moved;
+}
+
/*
* find_busiest_group finds and returns the busiest CPU group within the
* domain. It calculates and returns the amount of weighted load which
@@ -7196,18 +7272,6 @@ void sched_assign_quota(struct task_grp
recalc_dontcare(tg_root);
}

-static inline int cpu_quota(struct task_grp *tg)
-{
- int val;
-
- if (tg->ticks == -1)
- val = 100;
- else
- val = (tg->ticks * 100) / (5 * HZ);
-
- return val;
-}
-
/* Return assigned quota for this group */
int sched_get_quota(struct task_grp *tg)
{
@@ -7273,6 +7337,7 @@ void sched_post_move_task(struct task_st
{
struct rq *rq = task_rq(tsk);

+ set_load_weight(tsk);
__activate_task(tsk, rq);

task_rq_unlock(rq, &irq_flags);

_
--
Regards,
vatsa

2006-08-20 17:49:25

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface


A controller cannot live in isolation. *Somebody* needs to tell it about
events like new-task-group creation, destruction of existing group,
assignment of bandwidth and finally movement of tasks between task-groups.

We also need *some* interface for grouping tasks and assigning bandwidth
to the task-groups.

cpusets is chosen as this interface for now.

===============================================================================

DISCLAIMER:

- This by no means is any recommendation for cpuset to be chosen as
resource management interface!

- CPUset interface is chosen for now, because it represents a
task-grouping mechanism which is already in mainline and hence
makes the review of rest of scheduler changes easy.

- The CPU controller can be easily hooked to some other interface
like Resource Groups or User-bean-counter, should that be decided
down the line.

===============================================================================

This patch makes some extension to cpusets as below:

- Every cpuset directory has two new files - meter_cpu, cpu_quota
- Writing '1' into 'meter_cpu' marks the cpuset as "metered" i.e
tasks in this cpuset and its child cpusets will have their CPU
usage controlled by the controller as per the quota settings
mentioned in 'cpu_quota'.
- Writing a number between 0-100 into 'cpu_quota' file of a metered
cpuset assigns corresponding bandwidth for all tasks in the
cpuset (task-group).
- Only cpusets marked as "exclusive" and not having any child cpusets
can be metered.
- Metered cpusets cannot have grand-children!

Metered cpusets was discussed elaborately at:

http://lkml.org/lkml/2005/9/26/58

This patch was inspired from those discussions and is a subset of the features
provided in the earlier patch.

As an example, follow these steps to create metered cpusets:


# cd /dev
# mkdir cpuset
# mount -t cpuset cpuset cpuset
# cd cpuset
# mkdir grp_a
# cd grp_a
# /bin/echo "6-7" > cpus # assign CPUs 6,7 for this cpuset
# /bin/echo 0 > mems # assign node 0 for this cpuset
# /bin/echo 1 > cpu_exclusive
# /bin/echo 1 > meter_cpu

# mkdir very_imp_grp
# Assign 80% bandwidth to this group
# /bin/echo 80 > very_imp_grp/cpu_quota

# echo $apache_webserver_pid > very_imp_grp/tasks

# mkdir less_imp_grp
# Assign 5% bandwidth to this group
# /bin/echo 5 > less_imp_grp/cpu_quota

# echo $mozilla_browser_pid > less_imp_grp/tasks



Signed-off-by : Srivatsa Vaddagiri <[email protected]>




include/linux/cpuset.h | 9 ++
init/main.c | 2
kernel/cpuset.c | 214 ++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched.c | 7 +
4 files changed, 228 insertions(+), 4 deletions(-)

diff -puN kernel/cpuset.c~cpuset_interface kernel/cpuset.c
--- linux-2.6.18-rc3/kernel/cpuset.c~cpuset_interface 2006-08-20 22:22:25.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/cpuset.c 2006-08-20 22:22:25.000000000 +0530
@@ -99,6 +99,9 @@ struct cpuset {
int mems_generation;

struct fmeter fmeter; /* memory_pressure filter */
+#ifdef CONFIG_CPUMETER
+ void *cpu_ctlr_data;
+#endif
};

/* bits in struct cpuset flags field */
@@ -110,6 +113,9 @@ typedef enum {
CS_NOTIFY_ON_RELEASE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
+#ifdef CONFIG_CPUMETER
+ CS_METER_FLAG,
+#endif
} cpuset_flagbits_t;

/* convenient tests for these bits */
@@ -148,6 +154,139 @@ static inline int is_spread_slab(const s
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}

+#ifdef CONFIG_CPUMETER
+static inline int is_metered(const struct cpuset *cs)
+{
+ return test_bit(CS_METER_FLAG, &cs->flags);
+}
+
+/* does the new combination of meter and exclusive flags makes sense? */
+static int validate_meters(const struct cpuset *cur, const struct cpuset *trial)
+{
+ struct cpuset *parent = cur->parent;
+ int is_changed;
+
+ /* checks for change in meter flag */
+ is_changed = (is_metered(trial) != is_metered(cur));
+
+ /* Cant change meter setting if a cpuset already has children */
+ if (is_changed && !list_empty(&cur->children))
+ return -EINVAL;
+
+ /* Cant change meter setting if parent is metered */
+ if (is_changed && parent && is_metered(parent))
+ return -EINVAL;
+
+ /* Turn on metering only if a cpuset is exclusive */
+ if (is_metered(trial) && !is_cpu_exclusive(cur))
+ return -EINVAL;
+
+ /* Cant change exclusive setting if a cpuset is metered */
+ if (!is_cpu_exclusive(trial) && is_metered(cur))
+ return -EINVAL;
+
+ /* Cant change exclusive setting if parent is metered */
+ if (parent && is_metered(parent) && is_cpu_exclusive(trial))
+ return -EINVAL;
+
+ return 0;
+}
+
+static void update_meter(struct cpuset *cs)
+{
+ if (is_metered(cs)) {
+ void *parent_data = NULL;
+
+ if (cs->parent)
+ parent_data = cs->parent->cpu_ctlr_data;
+ cs->cpu_ctlr_data = cpu_ctlr->alloc_group(parent_data);
+ } else
+ cpu_ctlr->dealloc_group(cs->cpu_ctlr_data);
+}
+
+static int update_quota(struct cpuset *cs, char *buf)
+{
+ int quota;
+
+ if (!buf || !is_metered(cs))
+ return -EINVAL;
+
+ quota = simple_strtoul(buf, NULL, 10);
+ cpu_ctlr->assign_quota(cs->cpu_ctlr_data, quota);
+
+ return 0;
+}
+
+static int cpuset_sprintf_quota(char *page, struct cpuset *cs)
+{
+ int quota = 0;
+ char *c = page;
+
+ if (is_metered(cs))
+ quota = cpu_ctlr->get_quota(cs->cpu_ctlr_data);
+ c += sprintf (c, "%d", quota);
+
+ return c - page;
+}
+
+void *cpu_ctlr_data(struct cpuset *cs)
+{
+ return cs->cpu_ctlr_data;
+}
+
+static int cpu_ctlr_pre_move_task(struct task_struct *tsk, struct cpuset *oldcs,
+ struct cpuset *newcs)
+{
+ int change_rq, rc = 0;
+
+ change_rq = is_metered(oldcs) || is_metered(newcs);
+
+ if (change_rq)
+ rc = cpu_ctlr->pre_move_task(tsk, oldcs->cpu_ctlr_data,
+ newcs->cpu_ctlr_data);
+
+ return rc;
+}
+
+static void cpu_ctlr_post_move_task(struct task_struct *tsk)
+{
+ cpu_ctlr->post_move_task(tsk);
+}
+
+#else
+
+static inline int is_metered(const struct cpuset *cs)
+{
+ return 0;
+}
+
+static int validate_meters(const struct cpuset *cur, const struct cpuset *trial)
+{
+ return 0;
+}
+
+static void update_meter(struct cpuset *cs)
+{
+ return 0;
+}
+
+static int update_quota(struct cpuset *cs, char *buf)
+{
+ return 0;
+}
+
+static int cpu_ctlr_pre_move_task(struct task_struct *tsk, struct cpuset *oldcs,
+ struct cpuset *newcs)
+{
+ return 0;
+}
+
+static void cpu_ctlr_post_move_task(struct task_struct *tsk)
+{
+ return;
+}
+#endif /* CONFIG_CPUMETER */
+
/*
* Increment this integer everytime any cpuset changes its
* mems_allowed value. Users of cpusets can track this generation
@@ -723,6 +862,9 @@ static int validate_change(const struct
{
struct cpuset *c, *par;

+ if (validate_meters(cur, trial))
+ return -EINVAL;
+
/* Each of our child cpusets must be a subset of us */
list_for_each_entry(c, &cur->children, sibling) {
if (!is_cpuset_subset(c, trial))
@@ -1037,7 +1179,7 @@ static int update_flag(cpuset_flagbits_t
{
int turning_on;
struct cpuset trialcs;
- int err, cpu_exclusive_changed;
+ int err, cpu_exclusive_changed, meter_changed;

turning_on = (simple_strtoul(buf, NULL, 10) != 0);

@@ -1052,6 +1194,7 @@ static int update_flag(cpuset_flagbits_t
return err;
cpu_exclusive_changed =
(is_cpu_exclusive(cs) != is_cpu_exclusive(&trialcs));
+ meter_changed = (is_metered(cs) != is_metered(&trialcs));
mutex_lock(&callback_mutex);
if (turning_on)
set_bit(bit, &cs->flags);
@@ -1061,7 +1204,9 @@ static int update_flag(cpuset_flagbits_t

if (cpu_exclusive_changed)
update_cpu_domains(cs);
- return 0;
+ if (meter_changed)
+ update_meter(cs);
+ return err;
}

/*
@@ -1180,6 +1325,7 @@ static int attach_task(struct cpuset *cs
nodemask_t from, to;
struct mm_struct *mm;
int retval;
+ int callback = 0;

if (sscanf(pidbuf, "%d", &pid) != 1)
return -EIO;
@@ -1225,7 +1371,10 @@ static int attach_task(struct cpuset *cs
return -ESRCH;
}
atomic_inc(&cs->count);
+ retval = cpu_ctlr_pre_move_task(tsk, oldcs, cs);
rcu_assign_pointer(tsk->cpuset, cs);
+ if (retval)
+ cpu_ctlr_post_move_task(tsk);
task_unlock(tsk);

guarantee_online_cpus(cs, &cpus);
@@ -1267,6 +1416,10 @@ typedef enum {
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
FILE_TASKLIST,
+#ifdef CONFIG_CPUMETER
+ FILE_METER_FLAG,
+ FILE_METER_QUOTA,
+#endif
} cpuset_filetype_t;

static ssize_t cpuset_common_file_write(struct file *file, const char __user *userbuf,
@@ -1336,6 +1489,14 @@ static ssize_t cpuset_common_file_write(
case FILE_TASKLIST:
retval = attach_task(cs, buffer, &pathbuf);
break;
+#ifdef CONFIG_CPUMETER
+ case FILE_METER_FLAG:
+ retval = update_flag(CS_METER_FLAG, cs, buffer);
+ break;
+ case FILE_METER_QUOTA:
+ retval = update_quota(cs, buffer);
+ break;
+#endif
default:
retval = -EINVAL;
goto out2;
@@ -1448,6 +1609,14 @@ static ssize_t cpuset_common_file_read(s
case FILE_SPREAD_SLAB:
*s++ = is_spread_slab(cs) ? '1' : '0';
break;
+#ifdef CONFIG_CPUMETER
+ case FILE_METER_FLAG:
+ *s++ = is_metered(cs) ? '1' : '0';
+ break;
+ case FILE_METER_QUOTA:
+ s += cpuset_sprintf_quota(s, cs);
+ break;
+#endif
default:
retval = -EINVAL;
goto out;
@@ -1821,6 +1990,18 @@ static struct cftype cft_spread_slab = {
.private = FILE_SPREAD_SLAB,
};

+#ifdef CONFIG_CPUMETER
+static struct cftype cft_meter_flag = {
+ .name = "meter_cpu",
+ .private = FILE_METER_FLAG,
+};
+
+static struct cftype cft_meter_quota = {
+ .name = "cpu_quota",
+ .private = FILE_METER_QUOTA,
+};
+#endif
+
static int cpuset_populate_dir(struct dentry *cs_dentry)
{
int err;
@@ -1845,6 +2026,12 @@ static int cpuset_populate_dir(struct de
return err;
if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0)
return err;
+#ifdef CONFIG_CPUMETER
+ if ((err = cpuset_add_file(cs_dentry, &cft_meter_flag)) < 0)
+ return err;
+ if ((err = cpuset_add_file(cs_dentry, &cft_meter_quota)) < 0)
+ return err;
+#endif
return 0;
}

@@ -1862,6 +2049,10 @@ static long cpuset_create(struct cpuset
struct cpuset *cs;
int err;

+ /* metered cpusets cant have grand-children! */
+ if (parent->parent && is_metered(parent->parent))
+ return -EINVAL;
+
cs = kmalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
return -ENOMEM;
@@ -1894,6 +2085,13 @@ static long cpuset_create(struct cpuset
if (err < 0)
goto err;

+ if (is_metered(parent)) {
+ cs->cpus_allowed = parent->cpus_allowed;
+ cs->mems_allowed = parent->mems_allowed;
+ set_bit(CS_METER_FLAG, &cs->flags);
+ update_meter(cs);
+ }
+
/*
* Release manage_mutex before cpuset_populate_dir() because it
* will down() this new directory's i_mutex and if we race with
@@ -1956,6 +2154,13 @@ static int cpuset_rmdir(struct inode *un
return retval;
}
}
+ if (is_metered(cs)) {
+ int retval = update_flag(CS_METER_FLAG, cs, "0");
+ if (retval < 0) {
+ mutex_unlock(&manage_mutex);
+ return retval;
+ }
+ }
parent = cs->parent;
mutex_lock(&callback_mutex);
set_bit(CS_REMOVED, &cs->flags);
@@ -2008,6 +2213,7 @@ int __init cpuset_init(void)
top_cpuset.mems_generation = cpuset_mems_generation++;

init_task.cpuset = &top_cpuset;
+ /* xxx: initialize init_task.cpu_ctlr_data */

err = register_filesystem(&cpuset_fs_type);
if (err < 0)
@@ -2133,9 +2339,13 @@ void cpuset_fork(struct task_struct *chi
void cpuset_exit(struct task_struct *tsk)
{
struct cpuset *cs;
+ int rc;

cs = tsk->cpuset;
+ rc = cpu_ctlr_pre_move_task(tsk, cs, &top_cpuset);
tsk->cpuset = &top_cpuset; /* the_top_cpuset_hack - see above */
+ if (rc)
+ cpu_ctlr_post_move_task(tsk);

if (notify_on_release(cs)) {
char *pathbuf = NULL;
diff -puN include/linux/cpuset.h~cpuset_interface include/linux/cpuset.h
--- linux-2.6.18-rc3/include/linux/cpuset.h~cpuset_interface 2006-08-20 22:22:25.000000000 +0530
+++ linux-2.6.18-rc3-root/include/linux/cpuset.h 2006-08-20 22:22:25.000000000 +0530
@@ -63,6 +63,15 @@ static inline int cpuset_do_slab_mem_spr
return current->flags & PF_SPREAD_SLAB;
}

+#ifdef CONFIG_CPUMETER
+void *cpu_ctlr_data(struct cpuset *cs);
+#else
+static inline void *cpu_ctlr_data(struct cpuset *cs)
+{
+ return NULL;
+}
+#endif
+
#else /* !CONFIG_CPUSETS */

static inline int cpuset_init_early(void) { return 0; }
diff -puN init/main.c~cpuset_interface init/main.c
--- linux-2.6.18-rc3/init/main.c~cpuset_interface 2006-08-20 22:22:25.000000000 +0530
+++ linux-2.6.18-rc3-root/init/main.c 2006-08-20 22:22:25.000000000 +0530
@@ -482,6 +482,7 @@ asmlinkage void __init start_kernel(void
setup_arch(&command_line);
setup_per_cpu_areas();
smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
+ cpuset_init_early();

/*
* Set up the scheduler prior starting any interrupts (such as the
@@ -545,7 +546,6 @@ asmlinkage void __init start_kernel(void
}
#endif
vfs_caches_init_early();
- cpuset_init_early();
mem_init();
kmem_cache_init();
setup_per_cpu_pageset();
diff -puN kernel/sched.c~cpuset_interface kernel/sched.c
--- linux-2.6.18-rc3/kernel/sched.c~cpuset_interface 2006-08-20 22:22:25.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-20 22:22:25.000000000 +0530
@@ -737,7 +737,12 @@ static void enqueue_task_grp(struct task
/* return the task-group to which a task belongs */
static inline struct task_grp *task_grp(struct task_struct *p)
{
- return &init_task_grp;
+ struct task_grp *tg = cpu_ctlr_data(p->cpuset);
+
+ if (!tg)
+ tg = &init_task_grp;
+
+ return tg;
}

static inline void update_task_grp_prio(struct task_grp_rq *tgrq, struct rq *rq,

_
--
Regards,
vatsa

2006-08-20 20:49:17

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

(I added Suresh to cc list - question relating to his work at bottom of this reply. -pj)

Srivatsa wrote:
> - This by no means is any recommendation for cpuset to be chosen as
> resource management interface!

Good ;). My following comments are also neither a recommendation for or
against this use of cpusets.

> - Every cpuset directory has two new files - meter_cpu, cpu_quota

To keep the ever growing namespace of per-cpuset files clear, I'd
prefer naming these two new files something such as:

cpu_meter_enabled
cpu_meter_quota

You wrote:
> - Only cpusets marked as "exclusive" and not having any child cpusets
> can be metered.

and then gave a nice example having cpusets:

1) /dev/cpuset
2) /dev/cpuset/grp_a
3) /dev/cpsuet/grp_a/very_imp_grp
4) /dev/cpuset/grp_a/less_imp_grp

Your statement tells me cpusets (3) and (4) should be marked
cpu_exclusive, but your example only marked (2) as cpu_exclusive.
Which is it?

Aha - neither is right. I see further down in the code that
when you create a child of a metered parent, you automatically
copy the 'cpus', 'mems', and meter flag to the new child.

Perhaps it would help to add a comment in the example shell
script setup code, right after each of the lines:
# mkdir very_imp_grp
...
# mkdir less_imp_grp
something like:
# # Because the parent is marked 'cpu_meter_enabled',
# # this mkdir automatically copied 'cpus', 'mems' and
# # 'cpu_meter_enabled' from the parent to the new child.

Essentially, if user does:
cd /dev/cpuset/...
echo 1 > cpu_meter_enabled
mkdir some_grp
the kernel, on the 'mkdir', automatically does:
cp cpus mems cpu_meter_enabled some_grp

Hmmm ... with this automatic copy, and with the carefully imposed
sequence of operations on the order in which such metered groups can be
setup and torn down, you are implicitly imposing additional constraints
on what configurations of cpusets support metered cpu groups

The constraints seem to be that all metered child cpusets of a
metered parent must have the same cpus and mems, and must also be
marked metered. And the parent must be cpu_exclusive, and the children
cannot be cpu_exclusive. And the children must have no children of
their own.

I can certainly imagine that such constraints make it easier to write
correct scheduler group management code. And I can certainly allow
that the cpuset style API, allowing one bit at a time to be changed
with each separate open/write/close sequence of system calls, makes it
difficult to impose such constraints over the state of several settings
at once.

It sure would be nice to avoid the implicit side affects of copying
parent state on the mkdir, and it sure would be nice to reduce the
enforced sequencing of operations. Off hand, I don't see how to do
this, however.

If you actually ended up using cpusets for this API, then this deserves
some more head scratching.

In the "validate_meters()" routine:

+ /* Cant change meter setting if parent is metered */
+ if (is_changed && parent && is_metered(parent))
+ return -EINVAL;

Would the narrower check "Must meter child of metered parent":
if (parent && is_metered(parent) && !is_metered(trial))
be sufficient?


+ /* Turn on metering only if a cpuset is exclusive */
+ if (is_metered(trial) && !is_cpu_exclusive(cur))
+ return -EINVAL;
+
+ /* Cant change exclusive setting if a cpuset is metered */
+ if (!is_cpu_exclusive(trial) && is_metered(cur))
+ return -EINVAL;

Can the above two checks be collapsed to one check:

/* Can only meter if cpu_exclusive */
if (is_metered(trial) && !is_cpu_exclusive(trial))
return -EINVAL;

Hmmm ... perhaps that's not right either. If I understood this right,
the -children- in a metered group are -always- marked meter on, and
cpu_exclusive off. So they would pass neither your two checks, nor my
one rewrite.

After doing your example, try:
/bin/echo 1 > /dev/cpuset/grp_a/very_imp_grp/notify_on_release
and see if it fails, EINVAL. I'll bet it does.


+ /* Cant change exclusive setting if parent is metered */
+ if (parent && is_metered(parent) && is_cpu_exclusive(trial))
+ return -EINVAL;

Comment doesn't seem to match code. s/change/turn on/ ?

+#ifdef CONFIG_CPUMETER
+ FILE_METER_FLAG,
+ FILE_METER_QUOTA,
+#endif

Could these names match the per-cpuset file names:
FILE_CPU_METER_ENABLED
FILE_CPU_METER_QUOTA


One other question ... how does all this interact with Suresh's
dynamic sched domains?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-21 08:34:17

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/7] CPU controller - V1

On Sun, 2006-08-20 at 23:10 +0530, Srivatsa Vaddagiri wrote:
> Salient design points of this patch:
>
> - Each task-group gets its own runqueue on every cpu.
>
> - In addition, there is an active and expired array of
> task-groups themselves. Task-groups who have expired their
> quota are put into expired array.
>
> - Task-groups have priorities. Priority of a task-group is the
> same as the priority of the highest-priority runnable task it
> has. This I feel will retain interactiveness of the system
> as it is today.

WRT interactivity: Looking at try_to_wake_up(), it appears that wake-up
of a high priority group-a task will not result in preemption of a lower
priority current group-b task. True?

-Mike

2006-08-21 12:49:15

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 0/7] CPU controller - V1

On Mon, Aug 21, 2006 at 10:42:40AM +0000, Mike Galbraith wrote:
> WRT interactivity: Looking at try_to_wake_up(), it appears that wake-up
> of a high priority group-a task will not result in preemption of a lower
> priority current group-b task. True?

I dont think it is true. The definition of TASK_PREEMPTS_CURR() is
unchanged with these patches. Also group priority is linked to the
highest priority task it has. As a result, the high-priority group-a
task should preempt the low priority group-b task.

This is unless group-a has currently run out of its bandwidth and is sitting in
the expired queue (which is something that the TASK_PREEMPTS_CURR()
can be optimized to check for).

--
Regards,
vatsa

2006-08-21 15:02:16

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/7] CPU controller - V1

On Mon, 2006-08-21 at 18:18 +0530, Srivatsa Vaddagiri wrote:
> On Mon, Aug 21, 2006 at 10:42:40AM +0000, Mike Galbraith wrote:
> > WRT interactivity: Looking at try_to_wake_up(), it appears that wake-up
> > of a high priority group-a task will not result in preemption of a lower
> > priority current group-b task. True?
>
> I dont think it is true. The definition of TASK_PREEMPTS_CURR() is
> unchanged with these patches.

I must be missing something. If current and awakening tasks have
separate runqueues, task_rq(awakening)->curr != current. We won't look
at current->prio, so won't resched(current).

-Mike

2006-08-21 16:46:39

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 0/7] CPU controller - V1

On Mon, Aug 21, 2006 at 05:10:41PM +0000, Mike Galbraith wrote:
> I must be missing something. If current and awakening tasks have
> separate runqueues, task_rq(awakening)->curr != current. We won't look
> at current->prio, so won't resched(current).

Ok ..we have two types of runqueues here:

1. struct task_grp_rq
per-task-group-per-cpu runqueue, which holds ready-to-run tasks
belonging to the group in active and expired arrays.

2. struct rq
per-cpu runqueue, which holds ready-to-run task-groups in active and
expired arrays. This structure also holds some members like
curr, nr_running etc which more or less have the same significance as
the current runqueue members.

task_rq(tsk) still extracts "struct rq", while
task_grp(tsk)->rq[task_cpu(tsk)] extracts "struct task_grp_rq".

Hence task_rq(awakening)->curr == current, which should be sufficient to
resched(current), although I think there is a bug in current code
(irrespective of these patches):

try_to_wake_up() :

...

if (!sync || cpu != this_cpu) {
if (TASK_PREEMPTS_CURR(p, rq))
resched_task(rq->curr);
}
success = 1;

...

TASK_PREEMPTS_CURR() is examined and resched_task() is called only if
(cpu != this_cpu). What about the case (cpu == this_cpu) - who will
call resched_task() on current? I had expected the back-end of interrupt
handling to do that, but didnt find any code to do so.

--
Regards,
vatsa

2006-08-21 17:49:50

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Sun, Aug 20, 2006 at 01:48:49PM -0700, Paul Jackson wrote:
> Srivatsa wrote:
> > - This by no means is any recommendation for cpuset to be chosen as
> > resource management interface!
>
> Good ;). My following comments are also neither a recommendation for or
> against this use of cpusets.

Good :)

> To keep the ever growing namespace of per-cpuset files clear, I'd
> prefer naming these two new files something such as:
>
> cpu_meter_enabled
> cpu_meter_quota

Ok ..

> Perhaps it would help to add a comment in the example shell
> script setup code, right after each of the lines:
> # mkdir very_imp_grp
> ...
> # mkdir less_imp_grp
> something like:
> # # Because the parent is marked 'cpu_meter_enabled',
> # # this mkdir automatically copied 'cpus', 'mems' and
> # # 'cpu_meter_enabled' from the parent to the new child.

Ok sure.

> Hmmm ... with this automatic copy, and with the carefully imposed
> sequence of operations on the order in which such metered groups can be
> setup and torn down, you are implicitly imposing additional constraints
> on what configurations of cpusets support metered cpu groups
>
> The constraints seem to be that all metered child cpusets of a
> metered parent must have the same cpus and mems, and must also be
> marked metered. And the parent must be cpu_exclusive, and the children
> cannot be cpu_exclusive. And the children must have no children of
> their own.

Yes. IMHO allowing metered child cpusets to have different cpus doesnt
make much sense because the child cpusets represent some task-grouping
here (which has been provided some CPU bandwidth) rather than a cpuset in its
true-sense.

> I can certainly imagine that such constraints make it easier to write
> correct scheduler group management code. And I can certainly allow
> that the cpuset style API, allowing one bit at a time to be changed
> with each separate open/write/close sequence of system calls, makes it
> difficult to impose such constraints over the state of several settings
> at once.

Could you elaborate this a bit? What do you mean by "difficult to impose
such constraints"? Are you referring to things like after metered child
cpusets have been created, any changes to cpus field of parent has to be
reflected in all its child-cpusets 'atomically'?

> It sure would be nice to avoid the implicit side affects of copying
> parent state on the mkdir, and it sure would be nice to reduce the
> enforced sequencing of operations.

How would this help?

> If you actually ended up using cpusets for this API, then this deserves
> some more head scratching.

There is atleast another serious issue in using cpusets for this API -
how do we easily find all tasks belonging to a cpuset? There seems to be
no easy way of doing this, w/o walking thr' the complete task-list.

We need this task-list for various reasons - like change
taks->load_weight of all tasks in a child-cpuset when its cpu quota is
changed etc.

> In the "validate_meters()" routine:
>
> + /* Cant change meter setting if parent is metered */
> + if (is_changed && parent && is_metered(parent))
> + return -EINVAL;
>
> Would the narrower check "Must meter child of metered parent":
> if (parent && is_metered(parent) && !is_metered(trial))
> be sufficient?

Yes ..although I think the former check is more readable.

> + /* Turn on metering only if a cpuset is exclusive */
> + if (is_metered(trial) && !is_cpu_exclusive(cur))
> + return -EINVAL;
> +
> + /* Cant change exclusive setting if a cpuset is metered */
> + if (!is_cpu_exclusive(trial) && is_metered(cur))
> + return -EINVAL;
>
> Can the above two checks be collapsed to one check:
>
> /* Can only meter if cpu_exclusive */
> if (is_metered(trial) && !is_cpu_exclusive(trial))
> return -EINVAL;
>
> Hmmm ... perhaps that's not right either. If I understood this right,
> the -children- in a metered group are -always- marked meter on, and
> cpu_exclusive off.

Yes.

> So they would pass neither your two checks, nor my one rewrite.

Hmm .. your are right. How abt this:

/* Turn on metering only if a cpuset is exclusive */
if (!is_metered(parent) && is_metered(trial) && !is_cpu_exclusive(cur))
return -EINVAL;

/* Cant change exclusive setting if a cpuset is metered */
if (!is_metered(parent) && !is_cpu_exclusive(trial) && is_metered(cur))
return -EINVAL;

> After doing your example, try:
> /bin/echo 1 > /dev/cpuset/grp_a/very_imp_grp/notify_on_release
> and see if it fails, EINVAL. I'll bet it does.

Or perhaps even /bin/echo 0 > /dev/cpuset/grp_a/very_imp_grp/cpu_exclsuive
(to see the failure)?


> + /* Cant change exclusive setting if parent is metered */
> + if (parent && is_metered(parent) && is_cpu_exclusive(trial))
> + return -EINVAL;
>
> Comment doesn't seem to match code. s/change/turn on/ ?

Yes, "turn on" perhaps makes better sense here in the comment.

> +#ifdef CONFIG_CPUMETER
> + FILE_METER_FLAG,
> + FILE_METER_QUOTA,
> +#endif
>
> Could these names match the per-cpuset file names:
> FILE_CPU_METER_ENABLED
> FILE_CPU_METER_QUOTA

Sure ..Will fix in the next version.

> One other question ... how does all this interact with Suresh's
> dynamic sched domains?

By "all this" I presume you are referring to the changes to cpuset in
this patch. I see little interaction. AFAICS all metered child-cpusets should
be in the same dynamic sched-domain afaics.

--
Regards,
vatsa

2006-08-21 18:24:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/7] CPU controller - V1

On Mon, 2006-08-21 at 22:15 +0530, Srivatsa Vaddagiri wrote:

> Hence task_rq(awakening)->curr == current, which should be sufficient to

Ah, ok. Thanks. I should have read more of the code instead of
pondering the text.

> resched(current), although I think there is a bug in current code
> (irrespective of these patches):
>
> try_to_wake_up() :
>
> ...
>
> if (!sync || cpu != this_cpu) {
> if (TASK_PREEMPTS_CURR(p, rq))
> resched_task(rq->curr);
> }
> success = 1;
>
> ...
>
> TASK_PREEMPTS_CURR() is examined and resched_task() is called only if
> (cpu != this_cpu). What about the case (cpu == this_cpu) - who will
> call resched_task() on current? I had expected the back-end of interrupt
> handling to do that, but didnt find any code to do so.

Looks ok to me. Everything except sync && cpu == this_cpu checks.

-Mike

2006-08-21 18:36:55

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 0/7] CPU controller - V1

On Mon, Aug 21, 2006 at 08:33:08PM +0000, Mike Galbraith wrote:
> Looks ok to me. Everything except sync && cpu == this_cpu checks.

Hmm ..yes ..stupid of me. !sync will catch the case I had in mind.

--
Regards,
vatsa

2006-08-22 09:02:11

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Sun, 2006-08-20 at 23:18 +0530, Srivatsa Vaddagiri wrote:

> As an example, follow these steps to create metered cpusets:
>
>
> # cd /dev
> # mkdir cpuset
> # mount -t cpuset cpuset cpuset
> # cd cpuset
> # mkdir grp_a
> # cd grp_a
> # /bin/echo "6-7" > cpus # assign CPUs 6,7 for this cpuset
> # /bin/echo 0 > mems # assign node 0 for this cpuset
> # /bin/echo 1 > cpu_exclusive
> # /bin/echo 1 > meter_cpu
>
> # mkdir very_imp_grp
> # Assign 80% bandwidth to this group
> # /bin/echo 80 > very_imp_grp/cpu_quota
>
> # echo $apache_webserver_pid > very_imp_grp/tasks
>
> # mkdir less_imp_grp
> # Assign 5% bandwidth to this group
> # /bin/echo 5 > less_imp_grp/cpu_quota
>
> # echo $mozilla_browser_pid > less_imp_grp/tasks

Doesn't seem to work here, but maybe I'm doing something wrong.

I set up cpuset "all" containing cpu 0-1 (all, 1.something cpus I have;)
exactly as you created grp_a. I then creaded sub-groups mikeg and root,
and gave them 20% and 80% sub-group/cpu_quota respectively, and plunked
one shell in mikeg/tasks, and two in root/tasks.

In each root shell, I started a proggy that munches ~80% cpu.

top - 10:51:12 up 32 min, 10 users, load average: 2.00, 2.81, 2.69
Tasks: 114 total, 3 running, 111 sleeping, 0 stopped, 0 zombie
Cpu(s): 79.7% us, 0.0% sy, 0.0% ni, 20.3% id, 0.0% wa, 0.0% hi, 0.0% si

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6503 root 15 0 1368 276 228 S 79 0.0 0:31.03 f
6504 root 15 0 1368 280 228 R 78 0.0 0:29.17 f

I then add the same to the mikeg shell.

top - 10:54:37 up 35 min, 10 users, load average: 3.80, 2.95, 2.74
Tasks: 115 total, 6 running, 109 sleeping, 0 stopped, 0 zombie
Cpu(s): 94.7% us, 0.5% sy, 0.0% ni, 4.8% id, 0.0% wa, 0.0% hi, 0.0% si

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6503 root 15 0 1368 276 228 S 65 0.0 3:12.86 f
6505 mikeg 15 0 1368 276 228 R 64 0.0 0:11.77 f
6504 root 16 0 1368 280 228 R 59 0.0 3:10.58 f

If I add a third to root, the percentages go to roughly 50% per.

-Mike

2006-08-22 10:11:08

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Tue, Aug 22, 2006 at 11:10:36AM +0000, Mike Galbraith wrote:
> Doesn't seem to work here, but maybe I'm doing something wrong.
>
> I set up cpuset "all" containing cpu 0-1 (all, 1.something cpus I have;)

You are assigning all the CPUs to the cpuset "all" and then making it an
exclusive/metered cpuset?

I dont think I am handling that case well (yet), primarily because usage of
remaining tasks (which are not in cpuset "all", "mikeg" & "root") is not
accounted/controlled. Note that those remaining tasks will be running on one of
the CPUs assigned to "all". What needs to happen is those remaining tasks need
to be moved to a separate group (and a runqueue), being given some left-over
CPU quota (which is left over from assignment of quota to mikeg and root),
which is not handled in the patches (yet). One of the reason why I havent
handled it yet is that there is no easy way to retrieve list of tasks attached
to a cpuset.

Can you try assigning (NUM_CPUS-1) cpus to "all" and give it a shot?
Essentially you need to ensure that only tasks chosen by you are running in
cpus given to "all" and other child-cpusets under it.


--
Regards,
vatsa

2006-08-22 12:32:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Tue, 2006-08-22 at 15:40 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Aug 22, 2006 at 11:10:36AM +0000, Mike Galbraith wrote:
> > Doesn't seem to work here, but maybe I'm doing something wrong.
> >
> > I set up cpuset "all" containing cpu 0-1 (all, 1.something cpus I have;)
>
> You are assigning all the CPUs to the cpuset "all" and then making it an
> exclusive/metered cpuset?

Yeah.

> I dont think I am handling that case well (yet), primarily because usage of
> remaining tasks (which are not in cpuset "all", "mikeg" & "root") is not
> accounted/controlled. Note that those remaining tasks will be running on one of
> the CPUs assigned to "all". What needs to happen is those remaining tasks need
> to be moved to a separate group (and a runqueue), being given some left-over
> CPU quota (which is left over from assignment of quota to mikeg and root),
> which is not handled in the patches (yet). One of the reason why I havent
> handled it yet is that there is no easy way to retrieve list of tasks attached
> to a cpuset.

I try it with everything in either root or mikeg.

> Can you try assigning (NUM_CPUS-1) cpus to "all" and give it a shot?
> Essentially you need to ensure that only tasks chosen by you are running in
> cpus given to "all" and other child-cpusets under it.

Sure.

-Mike

2006-08-22 13:15:00

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Tue, 2006-08-22 at 14:41 +0000, Mike Galbraith wrote:
> On Tue, 2006-08-22 at 15:40 +0530, Srivatsa Vaddagiri wrote:
> > On Tue, Aug 22, 2006 at 11:10:36AM +0000, Mike Galbraith wrote:
> > > Doesn't seem to work here, but maybe I'm doing something wrong.
> > >
> > > I set up cpuset "all" containing cpu 0-1 (all, 1.something cpus I have;)
> >
> > You are assigning all the CPUs to the cpuset "all" and then making it an
> > exclusive/metered cpuset?
>
> Yeah.
>
> > I dont think I am handling that case well (yet), primarily because usage of
> > remaining tasks (which are not in cpuset "all", "mikeg" & "root") is not
> > accounted/controlled. Note that those remaining tasks will be running on one of
> > the CPUs assigned to "all". What needs to happen is those remaining tasks need
> > to be moved to a separate group (and a runqueue), being given some left-over
> > CPU quota (which is left over from assignment of quota to mikeg and root),
> > which is not handled in the patches (yet). One of the reason why I havent
> > handled it yet is that there is no easy way to retrieve list of tasks attached
> > to a cpuset.
>
> I try it with everything in either root or mikeg.

That didn't work.

> > Can you try assigning (NUM_CPUS-1) cpus to "all" and give it a shot?
> > Essentially you need to ensure that only tasks chosen by you are running in
> > cpus given to "all" and other child-cpusets under it.

With only cpu 1 in the cpuset, it worked.

-Mike

2006-08-22 13:36:37

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Tue, 2006-08-22 at 15:23 +0000, Mike Galbraith wrote:

> > > Can you try assigning (NUM_CPUS-1) cpus to "all" and give it a shot?
> > > Essentially you need to ensure that only tasks chosen by you are running in
> > > cpus given to "all" and other child-cpusets under it.
>
> With only cpu 1 in the cpuset, it worked.

P.S. since it worked, (and there are other tasks that I assigned on it,
kthreads for example, I only assigned the one shell), I figured I'd try
adding the other cpu and see what happened. It let me hot add cpu 0,
but tasks continue to be run only on cpu 1.

2006-08-22 13:51:39

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Tue, Aug 22, 2006 at 03:45:05PM +0000, Mike Galbraith wrote:
> > With only cpu 1 in the cpuset, it worked.
>
> P.S. since it worked, (and there are other tasks that I assigned on it,
> kthreads for example, I only assigned the one shell), I figured I'd try
> adding the other cpu and see what happened. It let me hot add cpu 0,
> but tasks continue to be run only on cpu 1.

How did you add the other cpu? to the "all" cpuset? I don't think I am
percolating the changes to "cpus" field of parent to child cpuset, which
may explain why tasks continue to run on cpu0. Achieving this change
atomically for all child cpusets again is something I don't know whether
cpuset code can handle well.

--
Regards,
vatsa

2006-08-22 14:02:08

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Tue, Aug 22, 2006 at 03:23:29PM +0000, Mike Galbraith wrote:
> > I try it with everything in either root or mikeg.

How did you transfer everything to root? By cat'ing each task pid
(including init's) to root (or mikeg) task's file?

I will give your experiment a try here and find out what's happening.

You said that you spawn a task which munches ~80% cpu. Is that by
something like:

do {
gettimeofday(&t1, NULL);
loop:
gettimeofday(&t2, NULL);
while (t2.tv_sec - t1.tv_sec != 48)
goto loop;
sleep 12

} while (1);


> That didn't work.

Ok. I will repeat your experiment and see what I can learn from it.


--
Regards,
vatsa

2006-08-22 15:52:31

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Tue, 2006-08-22 at 19:31 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Aug 22, 2006 at 03:23:29PM +0000, Mike Galbraith wrote:
> > > I try it with everything in either root or mikeg.
>
> How did you transfer everything to root? By cat'ing each task pid
> (including init's) to root (or mikeg) task's file?

Yes.

> I will give your experiment a try here and find out what's happening.
>
> You said that you spawn a task which munches ~80% cpu. Is that by
> something like:
>
> do {
> gettimeofday(&t1, NULL);
> loop:
> gettimeofday(&t2, NULL);
> while (t2.tv_sec - t1.tv_sec != 48)
> goto loop;
> sleep 12
>
> } while (1);

Yeah, a sleep/burn loop. The proggy is a one of several scheduler
exploits posted to lkml over the years. The reason I wanted to test
this patch set was to see how well it handles various nasty loads.

-Mike

2006-08-22 15:57:01

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Tue, 2006-08-22 at 19:20 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Aug 22, 2006 at 03:45:05PM +0000, Mike Galbraith wrote:
> > > With only cpu 1 in the cpuset, it worked.
> >
> > P.S. since it worked, (and there are other tasks that I assigned on it,
> > kthreads for example, I only assigned the one shell), I figured I'd try
> > adding the other cpu and see what happened. It let me hot add cpu 0,
> > but tasks continue to be run only on cpu 1.
>
> How did you add the other cpu? to the "all" cpuset? I don't think I am
> percolating the changes to "cpus" field of parent to child cpuset, which
> may explain why tasks continue to run on cpu0. Achieving this change
> atomically for all child cpusets again is something I don't know whether
> cpuset code can handle well.

I just did echo "0-1" > cpus.

-Mike

2006-08-22 15:58:45

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Tue, Aug 22, 2006 at 06:01:01PM +0000, Mike Galbraith wrote:
> Yeah, a sleep/burn loop. The proggy is a one of several scheduler
> exploits posted to lkml over the years. The reason I wanted to test
> this patch set was to see how well it handles various nasty loads.

Ok ..thanks for pointing me to it. I am testing it now and will post my
observations soon.

That said, I think this patch needs more work, especially in the area
of better timeslice management and SMP load-balance. If you have any
comments on its general design and the direction in which this is
proceeding, I would love to hear them.

Thanks for your feedback so far on this!

--
Regards,
vatsa

2006-08-22 16:02:53

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Tue, Aug 22, 2006 at 06:05:31PM +0000, Mike Galbraith wrote:
> I just did echo "0-1" > cpus.

"cpus" here is the "cpus" file in "all" cpuset?

If so, that explains why tasks were stuck in cpu 1 (because child cpusets
arent updated currently with the new cpus setting - which by itself is a tricky
thing to accomplish under cpusets perhaps).

--
Regards,
vatsa

2006-08-22 17:00:48

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Tue, 2006-08-22 at 21:32 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Aug 22, 2006 at 06:05:31PM +0000, Mike Galbraith wrote:
> > I just did echo "0-1" > cpus.
>
> "cpus" here is the "cpus" file in "all" cpuset?

Yes.

-Mike

2006-08-22 18:56:15

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

Mike wrote:
> By cat'ing each task pid
> (including init's) to root (or mikeg) task's file?

I guess you meant:
echo'ing
not:
cat'ing

Lets say for example one has cpusets:
/dev/cpuset
/dev/cpuset/foo

One cannot move the tasks in 'foo' to the top (root) cpuset by doing:

cat < /dev/cpuset/foo/tasks > /dev/cpuset/tasks # fails

That cat fails because the tasks file has to be written one pid at a
time, not in big buffered writes of multiple lines like cat does.

The usual code for doing this move is:

while read i
do
/bin/echo $i > /dev/cpuset/tasks
done < /dev/cpuset/foo/tasks

There is a cute trick that lets you move all the tasks in one cpuset to
another cpuset in a one-liner, by making use of the "sed -u" unbuffered
option:

sed -nu p < /dev/cpuset/foo/tasks > /dev/cpuset/tasks # works

For serious production work, the above is still racey. A task could be
added to the 'foo' cpuset when another task in 'foo' forks while the
copying is being done. The following loop minimizes (doesn't perfectly
solve) this race:

while test -s /dev/cpsuet/foo/tasks
do
sed -nu p < /dev/cpuset/foo/tasks > /dev/cpuset/tasks
done

The above loop is still theoretically racey with fork, but seems to
work in practice.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-23 07:34:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Sun, 2006-08-20 at 23:18 +0530, Srivatsa Vaddagiri wrote:

> As an example, follow these steps to create metered cpusets:
>
>
> # cd /dev
> # mkdir cpuset
> # mount -t cpuset cpuset cpuset
> # cd cpuset
> # mkdir grp_a
> # cd grp_a
> # /bin/echo "6-7" > cpus # assign CPUs 6,7 for this cpuset
> # /bin/echo 0 > mems # assign node 0 for this cpuset
> # /bin/echo 1 > cpu_exclusive
> # /bin/echo 1 > meter_cpu

Implementation might need some idiot proofing.

After mount/cd cpuset, somebody might think "gee, why should I need to
create a group 'all', when it's right there in the root, including all
it's tasks? I'll just set cpu_exclusive, set meter_cpu...

...and after my box finishes rebooting, I'll alert the developer of this
patch set that very bad things happen the instant you do that :)

-Mike

2006-08-23 13:16:11

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Wed, 2006-08-23 at 09:43 +0000, Mike Galbraith wrote:

> ...and after my box finishes rebooting, I'll alert the developer of this
> patch set that very bad things happen the instant you do that :)

Good news is that it doesn't always spontaneous reboot. Sometimes, it
just goes pop.

(gdb) list *schedule+0x18a
0xb13c4aba is in schedule (kernel/sched.c:3586).
3581 #else
3582 next_grp = init_task_grp.rq[cpu];
3583 #endif
3584
3585 /* Pick a task within that group next */
3586 array = next_grp->active;
3587 if (!array->nr_active) {
3588 /*
3589 * Switch the active and expired arrays.
3590 */
(gdb)

BUG: unable to handle kernel paging request at virtual address ffffffe8
printing eip:
b13c4aba
*pde = 00004067
Oops: 0000 [#1]
PREEMPT SMP
Modules linked in: xt_pkttype ipt_LOG xt_limit eeprom snd_pcm_oss snd_mixer_oss snd_seq_midi snd_seq_midi_event snd_seq edd ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat ip_nat iptable_filter ip6table_mangle ip_conntrack nfnetlink ip_tables ip6table_filter ip6_tables x_tables nls_iso8859_1 nls_cp437 nls_utf8 tuner snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd_page_alloc saa7134 prism54 bt878 ir_kbd_i2c snd_mpu401 snd_mpu401_uart snd_rawmidi snd_seq_device snd i2c_i801 bttv ohci1394 ieee1394 video_buf ir_common btcx_risc tveeprom soundcore sd_mod
CPU: 0
EIP: 0060:[<b13c4aba>] Not tainted VLI
EFLAGS: 00210097 (2.6.18-rc4-ctrl #166)
EIP is at schedule+0x18a/0xb5b
eax: 0000008c ebx: b1488f40 ecx: b1e65f60 edx: fffff6e8
esi: b1489054 edi: b15ce300 ebp: b158efac esp: b158ef38
ds: 007b es: 007b ss: 0068
Process swapper (pid: 0, ti=b158e000 task=b1488f40 task.ti=b158e000)
Stack: 00000000 00200046 b158ef3c b1488f40 00000000 00000000 b158ef78 b15ce380
b15ce300 b158ef64 b1027f51 00000001 b1488f40 00000000 008961d3 00000035
0005466d 00000000 b1489054 b1e65f60 b15ce300 b158efac 00000000 b1e6007b
Call Trace:
[<b1001a52>] cpu_idle+0x8f/0xc9
[<b1000622>] rest_init+0x39/0x47
[<b15957c8>] start_kernel+0x34d/0x405
[<b1000210>] 0xb1000210
DWARF2 unwinder stuck at 0xb1000210
Leftover inexact backtrace:
=======================
=======================
BUG: unable to handle kernel paging request at virtual address 38f28034
printing eip:
b1004074
*pde = 00000000
Oops: 0000 [#2]
PREEMPT SMP
Modules linked in: xt_pkttype ipt_LOG xt_limit eeprom snd_pcm_oss snd_mixer_oss snd_seq_midi snd_seq_midi_event snd_seq edd ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat ip_nat iptable_filter ip6table_mangle ip_conntrack nfnetlink ip_tables ip6table_filter ip6_tables x_tables nls_iso8859_1 nls_cp437 nls_utf8 tuner snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer snd_page_alloc saa7134 prism54 bt878 ir_kbd_i2c snd_mpu401 snd_mpu401_uart snd_rawmidi snd_seq_device snd i2c_i801 bttv ohci1394 ieee1394 video_buf ir_common btcx_risc tveeprom soundcore sd_mod
CPU: 0
EIP: 0060:[<b1004074>] Not tainted VLI
EFLAGS: 00210812 (2.6.18-rc4-ctrl #166)
EIP is at show_trace_log_lvl+0x92/0x191
eax: 38f28ffd ebx: 38f289c9 ecx: ffffffff edx: 00200006
esi: b158ee1c edi: 38f28000 ebp: b158ee1c esp: b158edc8
ds: 007b es: 007b ss: 0068
Process swapper (pid: 0, ti=b158e000 task=b1488f40 task.ti=b158e000)
Stack: b1407cc4 b1407d3c 00020809 b1e65f60 fffff6e8 00099100 b1500800 01675007
0000008c 0020007b 0000007b ffffffff b1000210 00000060 00210097 b158f000
00000068 b1488f40 b158ef9b 00000018 b1407d3c b158ee44 b1004219 b1407d3c
Call Trace:
[<b1004219>] show_stack_log_lvl+0xa6/0xcb
[<b10043e7>] show_registers+0x1a9/0x22e
[<b10045a2>] die+0x136/0x2ea
[<b10191cb>] do_page_fault+0x269/0x529
[<b1003bb1>] error_code+0x39/0x40
[<b13c4aba>] schedule+0x18a/0xb5b
[<b1001a52>] cpu_idle+0x8f/0xc9
[<b1000622>] rest_init+0x39/0x47
[<b15957c8>] start_kernel+0x34d/0x405
[<b1000210>] 0xb1000210
DWARF2 unwinder stuck at 0xb1000210
Leftover inexact backtrace:
=======================
=======================
BUG: unable to handle kernel paging request at virtual address 38f28034
printing eip:
b1004074
*pde = 00000000
Recursive die() failure, output suppressed




2006-08-23 13:25:53

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

On Wed, Aug 23, 2006 at 03:24:43PM +0000, Mike Galbraith wrote:
> On Wed, 2006-08-23 at 09:43 +0000, Mike Galbraith wrote:
>
> > ...and after my box finishes rebooting, I'll alert the developer of this
> > patch set that very bad things happen the instant you do that :)
>
> Good news is that it doesn't always spontaneous reboot. Sometimes, it
> just goes pop.

Using the top-level cpuset itself is something that I don't think the
patches support yet. Shouldnt be hard to make it work, except usage of
tasks in the top-level cpuset need to be accounted/controlled. I am working on
this now and as well some changes wrt timeslice management. Will send out some
patches soon!

--
Regards,
vatsa

2006-08-25 12:36:06

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH 1/7] CPU controller V1 - split runqueue

Srivatsa,

I suggest to split existing runqueue structure
into 2 pieces: physical cpu (sd, ...) and
virtual cpu (essentially a runqueue - array, nr_running, loac etc.)

Then replace all references to cpu as int with vcpu_t pointer.

What advantages does it give?
1. it isolates Linux std scheduler code for scheduling
tasks inside runqueues, while adds possibility
to add cleanly more high-level scheduler, which can select
runqueues to run (lets call it "process groups scheduler" - PGS).
2. runqueues can run on arbitrary physical CPUs if needed
which helps to solve balancing problem on SMP.
3. it allows naturally to use different PGS algorithms
on top of Linux one. e.g. yours algorithm (probobalistic) or
fair scheduling algorithms like SFQ, EEVDF, BVT with more
predictable parameters of QoS.
4. it will help us to get to the consensus and commit this work
into mainstream, because different PGS with different properties
will be possible.

Part of this idea is implemented in OpenVZ scheduler and in some
regards looks very much like your work, so I think if you like the idea
we can eloborate.

What do you think?

Thanks,
Kirill

> This patch splits the single main runqueue into several runqueues on each CPU.
> One (main) runqueue is used to hold task-groups and one runqueue for each
> task-group in the system.
>
> Signed-off-by : Srivatsa Vaddagiri <[email protected]>
>
>
>
> init/Kconfig | 8 +
> kernel/sched.c | 254 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
> 2 files changed, 231 insertions(+), 31 deletions(-)
>
> diff -puN kernel/sched.c~base_cpu_ctlr kernel/sched.c
> --- linux-2.6.18-rc3/kernel/sched.c~base_cpu_ctlr 2006-08-19 21:58:54.000000000 +0530
> +++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-20 21:44:30.000000000 +0530
> @@ -191,11 +191,51 @@ static inline unsigned int task_timeslic
>
> struct prio_array {
> unsigned int nr_active;
> + int best_static_prio, best_dyn_prio;
> DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
> struct list_head queue[MAX_PRIO];
> };
>
> /*
> + * Each task belong to some task-group. Task-groups and what tasks they contain
> + * are defined by the administrator using some suitable interface.
> + * Administrator can also define the CPU bandwidth provided to each task-group.
> + *
> + * Task-groups are given a certain priority to run on every CPU. Currently
> + * task-group priority on a CPU is defined to be the same as that of
> + * highest-priority runnable task it has on that CPU. Task-groups also
> + * get their own runqueue on every CPU. The main runqueue on each CPU is
> + * used to hold task-groups, rather than tasks.
> + *
> + * Scheduling decision on a CPU is now two-step : first pick highest priority
> + * task-group from the main runqueue and next pick highest priority task from
> + * the runqueue of that group. Both decisions are of O(1) complexity.
> + */
> +
> +/* runqueue used for every task-group */
> +struct task_grp_rq {
> + struct prio_array arrays[2];
> + struct prio_array *active, *expired, *rq_array;
> + unsigned long expired_timestamp;
> + unsigned int ticks; /* Decremented every tick */
> + int prio; /* Priority of the task-group */
> + struct list_head list;
> + struct task_grp *tg;
> +};
> +
> +/* xxx: Avoid this for !CONFIG_CPU_METER */
> +static DEFINE_PER_CPU(struct task_grp_rq, init_tg_rq);
> +
> +/* task-group object - maintains information about each task-group */
> +struct task_grp {
> + int ticks; /* bandwidth given to the task-group */
> + struct task_grp_rq *rq[NR_CPUS]; /* runqueue pointer for every cpu */
> +};
> +
> +/* The "default" task-group */
> +struct task_grp init_task_grp;
> +
> +/*
> * This is the main, per-CPU runqueue data structure.
> *
> * Locking rule: those places that want to lock multiple runqueues
> @@ -224,12 +264,11 @@ struct rq {
> */
> unsigned long nr_uninterruptible;
>
> - unsigned long expired_timestamp;
> unsigned long long timestamp_last_tick;
> struct task_struct *curr, *idle;
> struct mm_struct *prev_mm;
> + /* these arrays hold task-groups */
> struct prio_array *active, *expired, arrays[2];
> - int best_expired_prio;
> atomic_t nr_iowait;
>
> #ifdef CONFIG_SMP
> @@ -244,6 +283,7 @@ struct rq {
> #endif
>
> #ifdef CONFIG_SCHEDSTATS
> + /* xxx: move these to task-group runqueue */
> /* latency stats */
> struct sched_info rq_sched_info;
>
> @@ -657,6 +697,63 @@ sched_info_switch(struct task_struct *pr
> #define sched_info_switch(t, next) do { } while (0)
> #endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */
>
> +static unsigned int task_grp_timeslice(struct task_grp *tg)
> +{
> + /* xxx: take into account sleep_avg etc of the group */
> + return tg->ticks;
> +}
> +
> +/* Dequeue task-group object from the main per-cpu runqueue */
> +static void dequeue_task_grp(struct task_grp_rq *tgrq)
> +{
> + struct prio_array *array = tgrq->rq_array;
> +
> + array->nr_active--;
> + list_del(&tgrq->list);
> + if (list_empty(array->queue + tgrq->prio))
> + __clear_bit(tgrq->prio, array->bitmap);
> + tgrq->rq_array = NULL;
> +}
> +
> +/* Enqueue task-group object on the main per-cpu runqueue */
> +static void enqueue_task_grp(struct task_grp_rq *tgrq, struct prio_array *array,
> + int head)
> +{
> + if (!head)
> + list_add_tail(&tgrq->list, array->queue + tgrq->prio);
> + else
> + list_add(&tgrq->list, array->queue + tgrq->prio);
> + __set_bit(tgrq->prio, array->bitmap);
> + array->nr_active++;
> + tgrq->rq_array = array;
> +}
> +
> +/* return the task-group to which a task belongs */
> +static inline struct task_grp *task_grp(struct task_struct *p)
> +{
> + return &init_task_grp;
> +}
> +
> +static inline void update_task_grp_prio(struct task_grp_rq *tgrq, struct rq *rq,
> + int head)
> +{
> + int new_prio;
> + struct prio_array *array = tgrq->rq_array;
> +
> + new_prio = tgrq->active->best_dyn_prio;
> + if (new_prio == MAX_PRIO)
> + new_prio = tgrq->expired->best_dyn_prio;
> +
> + if (array)
> + dequeue_task_grp(tgrq);
> + tgrq->prio = new_prio;
> + if (new_prio != MAX_PRIO) {
> + if (!array)
> + array = rq->active;
> + enqueue_task_grp(tgrq, array, head);
> + }
> +}
> +
> /*
> * Adding/removing a task to/from a priority array:
> */
> @@ -664,8 +761,17 @@ static void dequeue_task(struct task_str
> {
> array->nr_active--;
> list_del(&p->run_list);
> - if (list_empty(array->queue + p->prio))
> + if (list_empty(array->queue + p->prio)) {
> __clear_bit(p->prio, array->bitmap);
> + if (p->prio == array->best_dyn_prio) {
> + struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
> +
> + array->best_dyn_prio =
> + sched_find_first_bit(array->bitmap);
> + if (array == tgrq->active || !tgrq->active->nr_active)
> + update_task_grp_prio(tgrq, task_rq(p), 0);
> + }
> + }
> }
>
> static void enqueue_task(struct task_struct *p, struct prio_array *array)
> @@ -675,6 +781,14 @@ static void enqueue_task(struct task_str
> __set_bit(p->prio, array->bitmap);
> array->nr_active++;
> p->array = array;
> +
> + if (p->prio < array->best_dyn_prio) {
> + struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
> +
> + array->best_dyn_prio = p->prio;
> + if (array == tgrq->active || !tgrq->active->nr_active)
> + update_task_grp_prio(tgrq, task_rq(p), 0);
> + }
> }
>
> /*
> @@ -693,6 +807,14 @@ enqueue_task_head(struct task_struct *p,
> __set_bit(p->prio, array->bitmap);
> array->nr_active++;
> p->array = array;
> +
> + if (p->prio < array->best_dyn_prio) {
> + struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
> +
> + array->best_dyn_prio = p->prio;
> + if (array == tgrq->active || !tgrq->active->nr_active)
> + update_task_grp_prio(tgrq, task_rq(p), 1);
> + }
> }
>
> /*
> @@ -831,10 +953,11 @@ static int effective_prio(struct task_st
> */
> static void __activate_task(struct task_struct *p, struct rq *rq)
> {
> - struct prio_array *target = rq->active;
> + struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
> + struct prio_array *target = tgrq->active;
>
> if (batch_task(p))
> - target = rq->expired;
> + target = tgrq->expired;
> enqueue_task(p, target);
> inc_nr_running(p, rq);
> }
> @@ -844,7 +967,10 @@ static void __activate_task(struct task_
> */
> static inline void __activate_idle_task(struct task_struct *p, struct rq *rq)
> {
> - enqueue_task_head(p, rq->active);
> + struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
> + struct prio_array *target = tgrq->active;
> +
> + enqueue_task_head(p, target);
> inc_nr_running(p, rq);
> }
>
> @@ -2077,7 +2203,7 @@ int can_migrate_task(struct task_struct
> return 1;
> }
>
> -#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->best_expired_prio)
> +#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->expired->best_static_prio)
>
> /*
> * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
> @@ -2884,6 +3010,8 @@ unsigned long long current_sched_time(co
> return ns;
> }
>
> +#define nr_tasks(tgrq) (tgrq->active->nr_active + tgrq->expired->nr_active)
> +
> /*
> * We place interactive tasks back into the active array, if possible.
> *
> @@ -2894,13 +3022,13 @@ unsigned long long current_sched_time(co
> * increasing number of running tasks. We also ignore the interactivity
> * if a better static_prio task has expired:
> */
> -static inline int expired_starving(struct rq *rq)
> +static inline int expired_starving(struct task_grp_rq *rq)
> {
> - if (rq->curr->static_prio > rq->best_expired_prio)
> + if (current->static_prio > rq->expired->best_static_prio)
> return 1;
> if (!STARVATION_LIMIT || !rq->expired_timestamp)
> return 0;
> - if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * rq->nr_running)
> + if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * nr_tasks(rq))
> return 1;
> return 0;
> }
> @@ -2991,6 +3119,7 @@ void scheduler_tick(void)
> struct task_struct *p = current;
> int cpu = smp_processor_id();
> struct rq *rq = cpu_rq(cpu);
> + struct task_grp_rq *tgrq = task_grp(p)->rq[cpu];
>
> update_cpu_clock(p, rq, now);
>
> @@ -3004,7 +3133,7 @@ void scheduler_tick(void)
> }
>
> /* Task might have expired already, but not scheduled off yet */
> - if (p->array != rq->active) {
> + if (p->array != tgrq->active) {
> set_tsk_need_resched(p);
> goto out;
> }
> @@ -3027,25 +3156,26 @@ void scheduler_tick(void)
> set_tsk_need_resched(p);
>
> /* put it at the end of the queue: */
> - requeue_task(p, rq->active);
> + requeue_task(p, tgrq->active);
> }
> goto out_unlock;
> }
> if (!--p->time_slice) {
> - dequeue_task(p, rq->active);
> + dequeue_task(p, tgrq->active);
> set_tsk_need_resched(p);
> p->prio = effective_prio(p);
> p->time_slice = task_timeslice(p);
> p->first_time_slice = 0;
>
> - if (!rq->expired_timestamp)
> - rq->expired_timestamp = jiffies;
> - if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
> - enqueue_task(p, rq->expired);
> - if (p->static_prio < rq->best_expired_prio)
> - rq->best_expired_prio = p->static_prio;
> + if (!tgrq->expired_timestamp)
> + tgrq->expired_timestamp = jiffies;
> + if (!TASK_INTERACTIVE(p) || expired_starving(tgrq)) {
> + enqueue_task(p, tgrq->expired);
> + if (p->static_prio < tgrq->expired->best_static_prio)
> + tgrq->expired->best_static_prio =
> + p->static_prio;
> } else
> - enqueue_task(p, rq->active);
> + enqueue_task(p, tgrq->active);
> } else {
> /*
> * Prevent a too long timeslice allowing a task to monopolize
> @@ -3066,12 +3196,21 @@ void scheduler_tick(void)
> if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
> p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
> (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
> - (p->array == rq->active)) {
> + (p->array == tgrq->active)) {
>
> - requeue_task(p, rq->active);
> + requeue_task(p, tgrq->active);
> set_tsk_need_resched(p);
> }
> }
> +
> + if (!--tgrq->ticks) {
> + /* Move the task group to expired list */
> + dequeue_task_grp(tgrq);
> + tgrq->ticks = task_grp_timeslice(task_grp(p));
> + enqueue_task_grp(tgrq, rq->expired, 0);
> + set_tsk_need_resched(p);
> + }
> +
> out_unlock:
> spin_unlock(&rq->lock);
> out:
> @@ -3264,6 +3403,7 @@ asmlinkage void __sched schedule(void)
> int cpu, idx, new_prio;
> long *switch_count;
> struct rq *rq;
> + struct task_grp_rq *next_grp;
>
> /*
> * Test if we are atomic. Since do_exit() needs to call into
> @@ -3332,23 +3472,41 @@ need_resched_nonpreemptible:
> idle_balance(cpu, rq);
> if (!rq->nr_running) {
> next = rq->idle;
> - rq->expired_timestamp = 0;
> wake_sleeping_dependent(cpu);
> goto switch_tasks;
> }
> }
>
> + /* Pick a task group first */
> +#ifdef CONFIG_CPUMETER
> array = rq->active;
> if (unlikely(!array->nr_active)) {
> /*
> * Switch the active and expired arrays.
> */
> - schedstat_inc(rq, sched_switch);
> rq->active = rq->expired;
> rq->expired = array;
> array = rq->active;
> - rq->expired_timestamp = 0;
> - rq->best_expired_prio = MAX_PRIO;
> + }
> + idx = sched_find_first_bit(array->bitmap);
> + queue = array->queue + idx;
> + next_grp = list_entry(queue->next, struct task_grp_rq, list);
> +#else
> + next_grp = init_task_grp.rq[cpu];
> +#endif
> +
> + /* Pick a task within that group next */
> + array = next_grp->active;
> + if (!array->nr_active) {
> + /*
> + * Switch the active and expired arrays.
> + */
> + schedstat_inc(next_grp, sched_switch);
> + next_grp->active = next_grp->expired;
> + next_grp->expired = array;
> + array = next_grp->active;
> + next_grp->expired_timestamp = 0;
> + next_grp->expired->best_static_prio = MAX_PRIO;
> }
>
> idx = sched_find_first_bit(array->bitmap);
> @@ -4413,7 +4571,8 @@ asmlinkage long sys_sched_getaffinity(pi
> asmlinkage long sys_sched_yield(void)
> {
> struct rq *rq = this_rq_lock();
> - struct prio_array *array = current->array, *target = rq->expired;
> + struct task_grp_rq *tgrq = task_grp(current)->rq[task_cpu(current)];
> + struct prio_array *array = current->array, *target = tgrq->expired;
>
> schedstat_inc(rq, yld_cnt);
> /*
> @@ -4424,13 +4583,13 @@ asmlinkage long sys_sched_yield(void)
> * array.)
> */
> if (rt_task(current))
> - target = rq->active;
> + target = tgrq->active;
>
> if (array->nr_active == 1) {
> schedstat_inc(rq, yld_act_empty);
> - if (!rq->expired->nr_active)
> + if (!tgrq->expired->nr_active)
> schedstat_inc(rq, yld_both_empty);
> - } else if (!rq->expired->nr_active)
> + } else if (!tgrq->expired->nr_active)
> schedstat_inc(rq, yld_exp_empty);
>
> if (array != target) {
> @@ -6722,21 +6881,54 @@ int in_sched_functions(unsigned long add
> && addr < (unsigned long)__sched_text_end);
> }
>
> +static void task_grp_rq_init(struct task_grp_rq *tgrq, struct task_grp *tg)
> +{
> + int j, k;
> +
> + tgrq->active = tgrq->arrays;
> + tgrq->expired = tgrq->arrays + 1;
> + tgrq->rq_array = NULL;
> + tgrq->expired->best_static_prio = MAX_PRIO;
> + tgrq->expired->best_dyn_prio = MAX_PRIO;
> + tgrq->active->best_static_prio = MAX_PRIO;
> + tgrq->active->best_dyn_prio = MAX_PRIO;
> + tgrq->prio = MAX_PRIO;
> + tgrq->ticks = tg->ticks;
> + tgrq->tg = tg;
> + INIT_LIST_HEAD(&tgrq->list);
> +
> + for (j = 0; j < 2; j++) {
> + struct prio_array *array;
> +
> + array = tgrq->arrays + j;
> + for (k = 0; k < MAX_PRIO; k++) {
> + INIT_LIST_HEAD(array->queue + k);
> + __clear_bit(k, array->bitmap);
> + }
> + // delimiter for bitsearch
> + __set_bit(MAX_PRIO, array->bitmap);
> + }
> +}
> +
> void __init sched_init(void)
> {
> int i, j, k;
>
> + init_task_grp.ticks = -1; /* Unlimited bandwidth */
> +
> for_each_possible_cpu(i) {
> struct prio_array *array;
> struct rq *rq;
> + struct task_grp_rq *tgrq;
>
> rq = cpu_rq(i);
> + tgrq = init_task_grp.rq[i] = &per_cpu(init_tg_rq, i);
> spin_lock_init(&rq->lock);
> + task_grp_rq_init(tgrq, &init_task_grp);
> lockdep_set_class(&rq->lock, &rq->rq_lock_key);
> rq->nr_running = 0;
> rq->active = rq->arrays;
> rq->expired = rq->arrays + 1;
> - rq->best_expired_prio = MAX_PRIO;
>
> #ifdef CONFIG_SMP
> rq->sd = NULL;
> diff -puN init/Kconfig~base_cpu_ctlr init/Kconfig
> --- linux-2.6.18-rc3/init/Kconfig~base_cpu_ctlr 2006-08-19 21:58:54.000000000 +0530
> +++ linux-2.6.18-rc3-root/init/Kconfig 2006-08-19 21:58:54.000000000 +0530
> @@ -248,6 +248,14 @@ config CPUSETS
>
> Say N if unsure.
>
> +config CPUMETER
> + bool "CPU resource control"
> + depends on CPUSETS
> + help
> + This options lets you create cpu resource partitions within
> + cpusets. Each resource partition can be given a different quota
> + of CPU usage.
> +
> config RELAY
> bool "Kernel->user space relay support (formerly relayfs)"
> help
>
> _

2006-08-28 01:50:34

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 7/7] CPU controller V1 - (temporary) cpuset interface

Vatsa, responding to pj:
> > I can certainly imagine that such constraints make it easier to write
> > correct scheduler group management code. And I can certainly allow
> > that the cpuset style API, allowing one bit at a time to be changed
> > with each separate open/write/close sequence of system calls, makes it
> > difficult to impose such constraints over the state of several settings
> > at once.
>
> Could you elaborate this a bit? What do you mean by "difficult to impose
> such constraints"? Are you referring to things like after metered child
> cpusets have been created, any changes to cpus field of parent has to be
> reflected in all its child-cpusets 'atomically'?
>
> > It sure would be nice to avoid the implicit side affects of copying
> > parent state on the mkdir, and it sure would be nice to reduce the
> > enforced sequencing of operations.
>
> How would this help?

Sorry for the delay in responding - I got distracted by other stuff.

I wasn't very clear ... Let me try again. You have some reasonable
enough constraints that apply to several per-cpuset attributes in
several cpusets, including a parent and its children. However, it
took me a bit of reading to figure this out, as these constraints are
magically imposed on the child cpusets when created beneath a parent
marked cpu_meter_enabled.

Normal cpuset operations are done by an open/write/close on a single
per-cpuset attribute at a time. I prefer to avoid magic in the cpuset
API, such as the special copying of the 'cpus', 'mems' and
'cpu_meter_enabled' attributes from the parent to the new child as a
consequence of doing the mkdir. Rather, if some setup requires these
particular settings, I prefer that we require the user code to have to
set each of these attributes, explicitly, one at a time.

However, your code, reasonably enough, does not want to deal with half
baked cpu_meter enabled setups. If the child is there at all, it must
instantly conform to all the necessary constraints. So you need to have
the kernel code modify several per-cpuset attributes instantly (from the
user perspective) and automagically, as a side affect of the mkdir call.

I don't like that kind of magic, because I think it makes it a bit
harder to understand the interface. But I don't have any idea how to
avoid it in this case. So, as I noted before, this seems worth some
more head scratching, if this CPU controller ends up adapting this
sort of cpuset API mechanism.

Granted, even the current cpuset code sometimes does that magic, as
for instance the creation of the several per-cpuset special files as
a side affect of the mkdir creating the cpuset. So I can understand
that you might not have realized I had any dislike of that style of
API ;).

On a related note, I am inclined, whenever I think about this, toward
suggesting that the 'cpu_meter_enabled' flag, which is set in both the
parent and child, should be two flags. Just as I like my operations
to be simple, with minimum side affects and special cases, so do I
like my flags to be simple, with minimum logical complications. In this
case, that flag means one of two different things, depending on whether
it is the parent or child. Perhaps splitting that flag into two flags,
cpuset_meter_parent and cpuset_meter_child, would help clarify this.

Sorry ... too many words ... hopefully it is clearer this time.


> There is atleast another serious issue in using cpusets for this API -
> how do we easily find all tasks belonging to a cpuset? There seems to be
> no easy way of doing this, w/o walking thr' the complete task-list.
>
> We need this task-list for various reasons - like change
> taks->load_weight of all tasks in a child-cpuset when its cpu quota is
> changed etc.

Yes, the only way that exists with the current cpuset code to get a list
of the tasks in a cpuset requires locking and scanning the tasklist.

For current cpuset needs, that seems reasonable enough - a fairly
minimal kernel solution sufficient for our uses.

But I can well imagine that you would need a more efficient, scalable
way to list the tasks in a cpuset.

I suppose this would lead to a linked list of the tasks in a cpuset.
Given the already somewhat fancy [grep rcu kernel/cpuset.c] mechanism
we use to allow one task to modify another tasks cpuset pointer, even
as that affected task is accessing that cpuset lock free, this could be
interesting coding.


> > One other question ... how does all this interact with Suresh's
> > dynamic sched domains?
>
> By "all this" I presume you are referring to the changes to cpuset in
> this patch. I see little interaction. AFAICS all metered child-cpusets should
> be in the same dynamic sched-domain afaics.

That's the possible issue -- ensuring that all metered child-cpusets are
in the same dynamic sched domain. If you know that works out that way,
then good. I have had to impose a voluntary retirement on myself from
commenting on sched domain code -- it's just too hard for my modest
brain.

My basic concern was not that there would be specific detailed code
level conflicts between your cpu controllers and Suresh's dynamic sched
domains, but rather that I sensed two mechanisms here both intending to
group tasks for the purposes of affecting their scheduling, and I hoped
that it was clear how these two mechanisms played together.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-28 03:34:10

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 1/7] CPU controller V1 - split runqueue

On Fri, Aug 25, 2006 at 04:38:00PM +0400, Kirill Korotaev wrote:
> Srivatsa,
>
> I suggest to split existing runqueue structure
> into 2 pieces: physical cpu (sd, ...) and
> virtual cpu (essentially a runqueue - array, nr_running, loac etc.)
>
> Then replace all references to cpu as int with vcpu_t pointer.

That's going to be a massive change! If I understand you correctly,
things like get_cpu() return virtual CPU number rather than the
corresponding "physical" CPU (the later is anyway a misnomer on
virtualized platforms)? Also we have get_cpu() now reading some structure and be
able to tell which CPU a task is running. Now with virtual CPUs, another
level of translation is needed? Wonder what the performance impact of
that would be ..

> What advantages does it give?
> 1. it isolates Linux std scheduler code for scheduling
> tasks inside runqueues, while adds possibility
> to add cleanly more high-level scheduler, which can select
> runqueues to run (lets call it "process groups scheduler" - PGS).
> 2. runqueues can run on arbitrary physical CPUs if needed
> which helps to solve balancing problem on SMP.

How do you see the relation between load-balance done thr sched-domain
heirarchy today and what will be done thr' virtal runqueues?

> 3. it allows naturally to use different PGS algorithms
> on top of Linux one. e.g. yours algorithm (probobalistic) or
> fair scheduling algorithms like SFQ, EEVDF, BVT with more
> predictable parameters of QoS.
> 4. it will help us to get to the consensus and commit this work
> into mainstream, because different PGS with different properties
> will be possible.
>
> Part of this idea is implemented in OpenVZ scheduler and in some
> regards looks very much like your work, so I think if you like the idea
> we can eloborate.
>
> What do you think?

I believe hypervisors like Xen have a similar approach (virtualing CPU
resource and running a virtual CPU on any available physical CPU). The
worry I have applying this to Linux kernel scheduler is in terms of its
invasiveness and thus general acceptability. I will however let the maintainers
decide on that. Sending some patches also probably will help measure this
"invasiveness/acceptability".

I had another question related to real-time tasks. How do you control
CPU usage of real-time tasks in different containers (especially if they
are SCHED_FIFO types)? Do they get capped at the bandwidth provided to
the container?

Also do you take any special steps to retain interactivity?

--
Regards,
vatsa

2006-08-28 08:13:12

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [PATCH 1/7] CPU controller V1 - split runqueue

Srivatsa Vaddagiri wrote:
> On Fri, Aug 25, 2006 at 04:38:00PM +0400, Kirill Korotaev wrote:
>
>>Srivatsa,
>>
>>I suggest to split existing runqueue structure
>>into 2 pieces: physical cpu (sd, ...) and
>>virtual cpu (essentially a runqueue - array, nr_running, loac etc.)
>>
>>Then replace all references to cpu as int with vcpu_t pointer.
>
>
> That's going to be a massive change! If I understand you correctly,
> things like get_cpu() return virtual CPU number rather than the
> corresponding "physical" CPU (the later is anyway a misnomer on
> virtualized platforms)? Also we have get_cpu() now reading some structure and be
> able to tell which CPU a task is running. Now with virtual CPUs, another
> level of translation is needed? Wonder what the performance impact of
> that would be ..
first, in most places get_cpu() should not be replaced with get_vcpu()
(e.g. accessing per-CPU arrays requires get_cpu(), not get_vcpu()).
next, these changes are _trivial_ which is good as they don't change logic.

>>What advantages does it give?
>>1. it isolates Linux std scheduler code for scheduling
>> tasks inside runqueues, while adds possibility
>> to add cleanly more high-level scheduler, which can select
>> runqueues to run (lets call it "process groups scheduler" - PGS).
>>2. runqueues can run on arbitrary physical CPUs if needed
>> which helps to solve balancing problem on SMP.
>
> How do you see the relation between load-balance done thr sched-domain
> heirarchy today and what will be done thr' virtal runqueues?
sorry, can't get your question.

>>3. it allows naturally to use different PGS algorithms
>> on top of Linux one. e.g. yours algorithm (probobalistic) or
>> fair scheduling algorithms like SFQ, EEVDF, BVT with more
>> predictable parameters of QoS.
>>4. it will help us to get to the consensus and commit this work
>> into mainstream, because different PGS with different properties
>> will be possible.
>>
>>Part of this idea is implemented in OpenVZ scheduler and in some
>>regards looks very much like your work, so I think if you like the idea
>>we can eloborate.
>>
>>What do you think?
>
>
> I believe hypervisors like Xen have a similar approach (virtualing CPU
> resource and running a virtual CPU on any available physical CPU). The
> worry I have applying this to Linux kernel scheduler is in terms of its
> invasiveness and thus general acceptability.
When I talked with Nick Piggin on summit he was quite optimistic
with such an approach. And again, this invasiveness is very simple
so I do not forsee much objections.

> I will however let the maintainers
> decide on that. Sending some patches also probably will help measure this
> "invasiveness/acceptability".
I propose to work on this together helping each other.
This makes part of your patches simlper and ours as well.
And what is good allows different approaches with different properties to be used.

> I had another question related to real-time tasks. How do you control
> CPU usage of real-time tasks in different containers (especially if they
> are SCHED_FIFO types)? Do they get capped at the bandwidth provided to
> the container?
yes. std Linux scheduler runs these processes as far as fair scheduler allows container
to run.

> Also do you take any special steps to retain interactivity?
no, it turned out to work quite fine w/o changes.

Thanks,
Kirill

2006-08-28 11:04:20

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 1/7] CPU controller V1 - split runqueue

On Mon, Aug 28, 2006 at 12:15:40PM +0400, Kirill Korotaev wrote:
> >How do you see the relation between load-balance done thr sched-domain
> >heirarchy today and what will be done thr' virtal runqueues?
> sorry, can't get your question.

Currently, sched-domain heirarchies exist to facilitate load-balance at
different levels (SMT, MC, CPU and finally Node). They have been so
setup to do frequent load-balance across groups at lower levels (like
SMT/CPU) than across groups across higher levels (ex: node).
Also the domain defines what physical cpus you can actually balance
across (which can be modified by something like exclusive cpusets).
And some domains support balance-on-exec/fork while others needn't
support it.

The scheduler currently also relies on the load-balance done thr' this
mechanism to keep each physical CPU busy with work. When CPUs are left
idle because of this mechanism, it may be on purpose (for example the
recent HT/MC optimizations, where we strive to keep each package busy
rather than each CPU - achieved thr' i think active_load_balance).

My question was: when you wanted to exploit the physical vs virtual
runqueue separation on each CPU for load-balance purpose, how would that
play with the above mentioned sched-domain based load-balance mechanisms?
For example: we need to preserve the HT/MC optimizations handled in
sched-domains code currently.

> When I talked with Nick Piggin on summit he was quite optimistic
> with such an approach. And again, this invasiveness is very simple
> so I do not forsee much objections.

Ingo/Nick, what do you think? If we decide that is a usefull thing to
try, I can see how these mechanisms will be usefull for general SMP
systems too (w/o depending on resource management).

> >I will however let the maintainers
> >decide on that. Sending some patches also probably will help measure this
> >"invasiveness/acceptability".
> I propose to work on this together helping each other.
> This makes part of your patches simlper and ours as well.
> And what is good allows different approaches with different properties to
> be used.

Are you advocating that we should be able to switch between
approaches at run-time? Linux (for some good reasons perhaps) has avoided
going that route so far (ex: pluggable schedulers).

--
Regards,
vatsa

2006-08-28 12:31:54

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 1/7] CPU controller V1 - split runqueue

Srivatsa Vaddagiri wrote:
> On Mon, Aug 28, 2006 at 12:15:40PM +0400, Kirill Korotaev wrote:
>>When I talked with Nick Piggin on summit he was quite optimistic
>>with such an approach. And again, this invasiveness is very simple
>>so I do not forsee much objections.
>
>
> Ingo/Nick, what do you think? If we decide that is a usefull thing to
> try, I can see how these mechanisms will be usefull for general SMP
> systems too (w/o depending on resource management).

I still haven't had much time to look at the implementation, but this
design seems cleanest I've considered, IMO.

Of course I would really hope we don't need any special casing in the
SMP balancing (which may be the tricky part). However hopefully if
things don't work well in that department, they can be made to by
improving the core code to be more general rather than special casing.

Do you have a better (/another) idea for the design?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-28 12:53:14

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 1/7] CPU controller V1 - split runqueue

On Mon, Aug 28, 2006 at 10:31:18PM +1000, Nick Piggin wrote:
> I still haven't had much time to look at the implementation, but this
> design seems cleanest I've considered, IMO.
>
> Of course I would really hope we don't need any special casing in the
> SMP balancing (which may be the tricky part). However hopefully if
> things don't work well in that department, they can be made to by
> improving the core code to be more general rather than special casing.
>
> Do you have a better (/another) idea for the design?

I dont' know if it is a better idea - but I have been trying to
experiment with some token-based system where task-groups run until
exhausted out of their tokens. Of course, this will be work-conserving
in the sense that expired task-groups continue running if there arent
others who want to use their share. Token are renewed at periodic
intervals. I believe that is how vserver scheduler works (though havent
looked at their code).

And I was thinking of using something similar to smpnice for
load-balance purposes.

The main point here is that scheduling next-task-group decision is local
to each CPU (very similar to how next-task is picked up currently), with
some load-balance code expected to balance tasks/task-groups across all
CPUs.

In what Kirill is proposing, this "scheduling next-task-group decision"
on each CPU perhaps takes a global view and because of the
physical/virtual CPU separation, any CPU can be running any other CPU's
tasks (smp_processor_id/get_cpu etc now returning virtual CPU number rather than
the actual CPU on which they are running). Kirill is that description correct?


--
Regards,
vatsa