2006-08-04 05:03:55

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Resource management has been talked about quite extensively in the
past, more recently in the context of containers. The basic requirement
here is to provide isolation between *groups* of task wrt their use
of various resources like CPU, Memory, I/O bandwidth, open file-descriptors etc.

Different maintainers have however expressed different opinions over the need to
complicate the kernel to meet this need, especially since it involves core
kernel code like the resource schedulers.

A BoF was hence held at OLS this year to come to a consensus on the minimum
requirements of a resource management solution for Linux kernel. Some notes
taken at the BoF are posted here:

http://www.uwsg.indiana.edu/hypermail/linux/kernel/0607.3/0896.html

An important consensus point of the BoF seemed to be "focus on real
controllers more, preferably memory first, using some simple interface
and task grouping mechanism".

In going forward, following points will need to be addressed:

- Grouping and interface
- What mechanism to use for grouping tasks and
for specifying task-group resource usage limits?
- Design of individual resource controllers like memory and cpu

This patch series is an attempt to take forward the design discussion of a
CPU controller.

For simplicity and convenience, cpuset has been chosen as the means to group
tasks here, primarily because cpuset already exists in the kernel and also
perhaps resource container definition should be unique only inside a cpuset.

Also I think the controller design can be independent of the grouping
interface and hence can work with any other grouping interface we may
settle on finally for resource management.

Other salient notes about this CPU controller:

- Is work-in-progress! I am sending this early so that I can get
some feedback on the general direction in which to proceed
further.

- Works only on UP for now (boot with maxcpus=1). IMO group-aware SMP
load-balancing can be met using smpnice feature. I will work on this
feature next.

- Only soft-limit is supported (work-conserving).

- Each task-group gets its own runqueue on every cpu.

- In addition, there is an active and expired array of
task-groups themselves. Task-groups who have expired their
quota are put into expired array.

- Task-groups have priorities. Priority of a task-group is the
same as the priority of the highest-priority runnable task it
has. This I feel will retain interactiveness of the system
as it is today.

- Scheduling the next task involves picking highest priority
task-group from active array first and then picking highest-priority
task within it. Both steps are O(1).

- Token are assigned to task-groups based on their assigned quota. Once
they run out of tokens, the task-group is put in an expired array.
Array switch happens when active array is empty.

- Although the algorithm is very simple, it perhaps needs more
refinement to handle different cases. Especially I feel task-groups
which are idle most of the time and experience bursts once in a while
will need to be handled better than in this simple scheme.

I would love to hear your comments on these design aspects of the
controller.

--
Regards,
vatsa


2006-08-04 05:05:12

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [ RFC, PATCH 1/5 ] CPU controller - base changes


This patch splits the single main runqueue into several runqueues on each CPU.
One (main) runqueue is used to hold task-groups and one runqueue for each
task-group in the system.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>


init/Kconfig | 8 +
kernel/sched.c | 251 +++++++++++++++++++++++++++++++++++++++++++++++++--------
2 files changed, 228 insertions(+), 31 deletions(-)

diff -puN kernel/sched.c~cpu_ctlr_base_changes kernel/sched.c
--- linux-2.6.18-rc3/kernel/sched.c~cpu_ctlr_base_changes 2006-08-04 07:56:46.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-04 07:56:54.000000000 +0530
@@ -191,10 +191,48 @@ static inline unsigned int task_timeslic

struct prio_array {
unsigned int nr_active;
+ int best_static_prio, best_dyn_prio;
DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
struct list_head queue[MAX_PRIO];
};

+/*
+ * Each task belong to some task-group. Task-groups and what tasks they contain
+ * are defined by the administrator using some suitable interface.
+ * Administrator can also define the CPU bandwidth provided to each task-group.
+ *
+ * Task-groups are given a certain priority to run on every CPU. Currently
+ * task-group priority on a CPU is defined to be the same as that of
+ * highest-priority runnable task it has on that CPU. Task-groups also
+ * get their own runqueue on every CPU. The main runqueue on each CPU is
+ * used to hold task-groups, rather than tasks.
+ *
+ * Scheduling decision on a CPU is now two-step : first pick highest priority
+ * task-group from the main runqueue and next pick highest priority task from
+ * the runqueue of that group. Both decisions are of O(1) complexity.
+ */
+
+/* runqueue used for every task-group */
+struct task_grp_rq {
+ struct prio_array arrays[2];
+ struct prio_array *active, *expired, *rq_array;
+ unsigned long expired_timestamp;
+ unsigned int ticks;
+ int prio; /* Priority of the task-group */
+ struct list_head list;
+};
+
+static DEFINE_PER_CPU(struct task_grp_rq, init_tg_rq);
+
+/* task-group object - maintains information about each task-group */
+struct task_grp {
+ int ticks; /* bandwidth given to the task-group */
+ struct task_grp_rq *rq[NR_CPUS]; /* runqueue pointer for every cpu */
+};
+
+/* The "default" task-group */
+struct task_grp init_task_grp;
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -224,12 +262,11 @@ struct rq {
*/
unsigned long nr_uninterruptible;

- unsigned long expired_timestamp;
unsigned long long timestamp_last_tick;
struct task_struct *curr, *idle;
struct mm_struct *prev_mm;
+ /* these arrays hold task-groups */
struct prio_array *active, *expired, arrays[2];
- int best_expired_prio;
atomic_t nr_iowait;

#ifdef CONFIG_SMP
@@ -244,6 +281,7 @@ struct rq {
#endif

#ifdef CONFIG_SCHEDSTATS
+ /* xxx: move these to task-group runqueue */
/* latency stats */
struct sched_info rq_sched_info;

@@ -657,6 +695,63 @@ sched_info_switch(struct task_struct *pr
#define sched_info_switch(t, next) do { } while (0)
#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */

+static unsigned int task_grp_timeslice(struct task_grp *tg)
+{
+ /* xxx: take into account sleep_avg etc of the group */
+ return tg->ticks;
+}
+
+/* Dequeue task-group object from the main runqueue */
+static void dequeue_task_grp(struct task_grp_rq *tgrq)
+{
+ struct prio_array *array = tgrq->rq_array;
+
+ array->nr_active--;
+ list_del(&tgrq->list);
+ if (list_empty(array->queue + tgrq->prio))
+ __clear_bit(tgrq->prio, array->bitmap);
+ tgrq->rq_array = NULL;
+}
+
+/* Enqueue task-group object on the main runqueue */
+static void enqueue_task_grp(struct task_grp_rq *tgrq, struct prio_array *array,
+ int head)
+{
+ if (!head)
+ list_add_tail(&tgrq->list, array->queue + tgrq->prio);
+ else
+ list_add(&tgrq->list, array->queue + tgrq->prio);
+ __set_bit(tgrq->prio, array->bitmap);
+ array->nr_active++;
+ tgrq->rq_array = array;
+}
+
+/* return the task-group to which a task belongs */
+static inline struct task_grp *task_grp(struct task_struct *p)
+{
+ return &init_task_grp;
+}
+
+static inline void update_task_grp_prio(struct task_grp_rq *tgrq, struct rq *rq,
+ int head)
+{
+ int new_prio;
+ struct prio_array *array = tgrq->rq_array;
+
+ new_prio = tgrq->active->best_dyn_prio;
+ if (new_prio == MAX_PRIO)
+ new_prio = tgrq->expired->best_dyn_prio;
+
+ if (array)
+ dequeue_task_grp(tgrq);
+ tgrq->prio = new_prio;
+ if (new_prio != MAX_PRIO) {
+ if (!array)
+ array = rq->active;
+ enqueue_task_grp(tgrq, array, head);
+ }
+}
+
/*
* Adding/removing a task to/from a priority array:
*/
@@ -664,8 +759,17 @@ static void dequeue_task(struct task_str
{
array->nr_active--;
list_del(&p->run_list);
- if (list_empty(array->queue + p->prio))
+ if (list_empty(array->queue + p->prio)) {
__clear_bit(p->prio, array->bitmap);
+ if (p->prio == array->best_dyn_prio) {
+ struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
+
+ array->best_dyn_prio =
+ sched_find_first_bit(array->bitmap);
+ if (array == tgrq->active || !tgrq->active->nr_active)
+ update_task_grp_prio(tgrq, task_rq(p), 0);
+ }
+ }
}

static void enqueue_task(struct task_struct *p, struct prio_array *array)
@@ -675,6 +779,14 @@ static void enqueue_task(struct task_str
__set_bit(p->prio, array->bitmap);
array->nr_active++;
p->array = array;
+
+ if (p->prio < array->best_dyn_prio) {
+ struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
+
+ array->best_dyn_prio = p->prio;
+ if (array == tgrq->active || !tgrq->active->nr_active)
+ update_task_grp_prio(tgrq, task_rq(p), 0);
+ }
}

/*
@@ -693,6 +805,14 @@ enqueue_task_head(struct task_struct *p,
__set_bit(p->prio, array->bitmap);
array->nr_active++;
p->array = array;
+
+ if (p->prio < array->best_dyn_prio) {
+ struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
+
+ array->best_dyn_prio = p->prio;
+ if (array == tgrq->active || !tgrq->active->nr_active)
+ update_task_grp_prio(tgrq, task_rq(p), 1);
+ }
}

/*
@@ -831,10 +951,11 @@ static int effective_prio(struct task_st
*/
static void __activate_task(struct task_struct *p, struct rq *rq)
{
- struct prio_array *target = rq->active;
+ struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
+ struct prio_array *target = tgrq->active;

if (batch_task(p))
- target = rq->expired;
+ target = tgrq->expired;
enqueue_task(p, target);
inc_nr_running(p, rq);
}
@@ -844,7 +965,10 @@ static void __activate_task(struct task_
*/
static inline void __activate_idle_task(struct task_struct *p, struct rq *rq)
{
- enqueue_task_head(p, rq->active);
+ struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
+ struct prio_array *target = tgrq->active;
+
+ enqueue_task_head(p, target);
inc_nr_running(p, rq);
}

@@ -2077,7 +2201,7 @@ int can_migrate_task(struct task_struct
return 1;
}

-#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->best_expired_prio)
+#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->expired->best_static_prio)

/*
* move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
@@ -2884,6 +3008,8 @@ unsigned long long current_sched_time(co
return ns;
}

+#define nr_tasks(tgrq) (tgrq->active->nr_active + tgrq->expired->nr_active)
+
/*
* We place interactive tasks back into the active array, if possible.
*
@@ -2894,13 +3020,13 @@ unsigned long long current_sched_time(co
* increasing number of running tasks. We also ignore the interactivity
* if a better static_prio task has expired:
*/
-static inline int expired_starving(struct rq *rq)
+static inline int expired_starving(struct task_grp_rq *rq)
{
- if (rq->curr->static_prio > rq->best_expired_prio)
+ if (current->static_prio > rq->expired->best_static_prio)
return 1;
if (!STARVATION_LIMIT || !rq->expired_timestamp)
return 0;
- if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * rq->nr_running)
+ if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * nr_tasks(rq))
return 1;
return 0;
}
@@ -2991,6 +3117,7 @@ void scheduler_tick(void)
struct task_struct *p = current;
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
+ struct task_grp_rq *tgrq = task_grp(p)->rq[cpu];

update_cpu_clock(p, rq, now);

@@ -3004,7 +3131,7 @@ void scheduler_tick(void)
}

/* Task might have expired already, but not scheduled off yet */
- if (p->array != rq->active) {
+ if (p->array != tgrq->active) {
set_tsk_need_resched(p);
goto out;
}
@@ -3027,25 +3154,26 @@ void scheduler_tick(void)
set_tsk_need_resched(p);

/* put it at the end of the queue: */
- requeue_task(p, rq->active);
+ requeue_task(p, tgrq->active);
}
goto out_unlock;
}
if (!--p->time_slice) {
- dequeue_task(p, rq->active);
+ dequeue_task(p, tgrq->active);
set_tsk_need_resched(p);
p->prio = effective_prio(p);
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;

- if (!rq->expired_timestamp)
- rq->expired_timestamp = jiffies;
- if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
- enqueue_task(p, rq->expired);
- if (p->static_prio < rq->best_expired_prio)
- rq->best_expired_prio = p->static_prio;
+ if (!tgrq->expired_timestamp)
+ tgrq->expired_timestamp = jiffies;
+ if (!TASK_INTERACTIVE(p) || expired_starving(tgrq)) {
+ enqueue_task(p, tgrq->expired);
+ if (p->static_prio < tgrq->expired->best_static_prio)
+ tgrq->expired->best_static_prio =
+ p->static_prio;
} else
- enqueue_task(p, rq->active);
+ enqueue_task(p, tgrq->active);
} else {
/*
* Prevent a too long timeslice allowing a task to monopolize
@@ -3066,12 +3194,21 @@ void scheduler_tick(void)
if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
- (p->array == rq->active)) {
+ (p->array == tgrq->active)) {

- requeue_task(p, rq->active);
+ requeue_task(p, tgrq->active);
set_tsk_need_resched(p);
}
}
+
+ if (!--tgrq->ticks) {
+ /* Move the task group to expired list */
+ dequeue_task_grp(tgrq);
+ tgrq->ticks = task_grp_timeslice(task_grp(p));
+ enqueue_task_grp(tgrq, rq->expired, 0);
+ set_tsk_need_resched(p);
+ }
+
out_unlock:
spin_unlock(&rq->lock);
out:
@@ -3264,6 +3401,7 @@ asmlinkage void __sched schedule(void)
int cpu, idx, new_prio;
long *switch_count;
struct rq *rq;
+ struct task_grp_rq *next_grp;

/*
* Test if we are atomic. Since do_exit() needs to call into
@@ -3332,23 +3470,41 @@ need_resched_nonpreemptible:
idle_balance(cpu, rq);
if (!rq->nr_running) {
next = rq->idle;
- rq->expired_timestamp = 0;
wake_sleeping_dependent(cpu);
goto switch_tasks;
}
}

+ /* Pick a task group first */
+#ifdef CONFIG_CPUMETER
array = rq->active;
if (unlikely(!array->nr_active)) {
/*
* Switch the active and expired arrays.
*/
- schedstat_inc(rq, sched_switch);
rq->active = rq->expired;
rq->expired = array;
array = rq->active;
- rq->expired_timestamp = 0;
- rq->best_expired_prio = MAX_PRIO;
+ }
+ idx = sched_find_first_bit(array->bitmap);
+ queue = array->queue + idx;
+ next_grp = list_entry(queue->next, struct task_grp_rq, list);
+#else
+ next_grp = init_task_grp.rq[cpu];
+#endif
+
+ /* Pick a task within that group next */
+ array = next_grp->active;
+ if (!array->nr_active) {
+ /*
+ * Switch the active and expired arrays.
+ */
+ schedstat_inc(next_grp, sched_switch);
+ next_grp->active = next_grp->expired;
+ next_grp->expired = array;
+ array = next_grp->active;
+ next_grp->expired_timestamp = 0;
+ next_grp->expired->best_static_prio = MAX_PRIO;
}

idx = sched_find_first_bit(array->bitmap);
@@ -4413,7 +4569,8 @@ asmlinkage long sys_sched_getaffinity(pi
asmlinkage long sys_sched_yield(void)
{
struct rq *rq = this_rq_lock();
- struct prio_array *array = current->array, *target = rq->expired;
+ struct task_grp_rq *tgrq = task_grp(current)->rq[task_cpu(current)];
+ struct prio_array *array = current->array, *target = tgrq->expired;

schedstat_inc(rq, yld_cnt);
/*
@@ -4424,13 +4581,13 @@ asmlinkage long sys_sched_yield(void)
* array.)
*/
if (rt_task(current))
- target = rq->active;
+ target = tgrq->active;

if (array->nr_active == 1) {
schedstat_inc(rq, yld_act_empty);
- if (!rq->expired->nr_active)
+ if (!tgrq->expired->nr_active)
schedstat_inc(rq, yld_both_empty);
- } else if (!rq->expired->nr_active)
+ } else if (!tgrq->expired->nr_active)
schedstat_inc(rq, yld_exp_empty);

if (array != target) {
@@ -6722,21 +6879,53 @@ int in_sched_functions(unsigned long add
&& addr < (unsigned long)__sched_text_end);
}

+static void task_grp_rq_init(struct task_grp_rq *tgrq, int ticks)
+{
+ int j, k;
+
+ tgrq->active = tgrq->arrays;
+ tgrq->expired = tgrq->arrays + 1;
+ tgrq->rq_array = NULL;
+ tgrq->expired->best_static_prio = MAX_PRIO;
+ tgrq->expired->best_dyn_prio = MAX_PRIO;
+ tgrq->active->best_static_prio = MAX_PRIO;
+ tgrq->active->best_dyn_prio = MAX_PRIO;
+ tgrq->prio = MAX_PRIO;
+ tgrq->ticks = ticks;
+ INIT_LIST_HEAD(&tgrq->list);
+
+ for (j = 0; j < 2; j++) {
+ struct prio_array *array;
+
+ array = tgrq->arrays + j;
+ for (k = 0; k < MAX_PRIO; k++) {
+ INIT_LIST_HEAD(array->queue + k);
+ __clear_bit(k, array->bitmap);
+ }
+ // delimiter for bitsearch
+ __set_bit(MAX_PRIO, array->bitmap);
+ }
+}
+
void __init sched_init(void)
{
int i, j, k;

+ init_task_grp.ticks = -1; /* Unlimited bandwidth */
+
for_each_possible_cpu(i) {
struct prio_array *array;
struct rq *rq;
+ struct task_grp_rq *tgrq;

rq = cpu_rq(i);
+ tgrq = init_task_grp.rq[i] = &per_cpu(init_tg_rq, i);
spin_lock_init(&rq->lock);
+ task_grp_rq_init(tgrq, init_task_grp.ticks);
lockdep_set_class(&rq->lock, &rq->rq_lock_key);
rq->nr_running = 0;
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
- rq->best_expired_prio = MAX_PRIO;

#ifdef CONFIG_SMP
rq->sd = NULL;
diff -puN init/Kconfig~cpu_ctlr_base_changes init/Kconfig
--- linux-2.6.18-rc3/init/Kconfig~cpu_ctlr_base_changes 2006-08-04 07:56:46.000000000 +0530
+++ linux-2.6.18-rc3-root/init/Kconfig 2006-08-04 07:56:54.000000000 +0530
@@ -248,6 +248,14 @@ config CPUSETS

Say N if unsure.

+config CPUMETER
+ bool "CPU resource control"
+ depends on CPUSETS
+ help
+ This options lets you create cpu resource partitions within
+ cpusets. Each resource partition can be given a different quota
+ of CPU usage.
+
config RELAY
bool "Kernel->user space relay support (formerly relayfs)"
help

_
--
Regards,
vatsa

2006-08-04 05:05:57

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [ RFC, PATCH 2/5 ] CPU controller - Define group operations


Define these operations for a task-group:

- create new group
- destroy existing group
- assign bandwidth (quota) for a group
- get bandwidth (quota) of a group


Signed-off-by : Srivatsa Vaddagiri <[email protected]>



include/linux/sched.h | 12 +++++++
kernel/sched.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 91 insertions(+)

diff -puN kernel/sched.c~cpu_ctlr_grp_ops kernel/sched.c
--- linux-2.6.18-rc3/kernel/sched.c~cpu_ctlr_grp_ops 2006-08-04 07:58:50.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-04 07:58:50.000000000 +0530
@@ -7063,3 +7063,82 @@ void set_curr_task(int cpu, struct task_
}

#endif
+
+#ifdef CONFIG_CPUMETER
+
+/* Allocate runqueue structures for the new task-group */
+void *sched_alloc_group(void)
+{
+ struct task_grp *tg;
+ struct task_grp_rq *tgrq;
+ int i;
+
+ tg = kzalloc(sizeof(*tg), GFP_KERNEL);
+ if (!tg)
+ return NULL;
+
+ tg->ticks = -1; /* No limit */
+
+ for_each_possible_cpu(i) {
+ tgrq = kzalloc(sizeof(*tgrq), GFP_KERNEL);
+ if (!tgrq)
+ goto oom;
+ tg->rq[i] = tgrq;
+ task_grp_rq_init(tgrq, tg->ticks);
+ }
+
+ return (void *)tg;
+oom:
+ while (i--)
+ kfree(tg->rq[i]);
+
+ kfree(tg);
+ return NULL;
+}
+
+/* Deallocate runqueue structures */
+void sched_dealloc_group(void *grp)
+{
+ struct task_grp *tg = (struct task_grp *)grp;
+ int i;
+
+ for_each_possible_cpu(i)
+ kfree(tg->rq[i]);
+
+ kfree(tg);
+}
+
+/* Assign quota to this group */
+void sched_assign_quota(void *grp, int quota)
+{
+ struct task_grp *tg = (struct task_grp *)grp;
+ int i;
+
+ tg->ticks = (quota * 5 * HZ) / 100;
+
+ for_each_possible_cpu(i)
+ tg->rq[i]->ticks = tg->ticks;
+
+}
+
+/* Return assigned quota for this group */
+int sched_get_quota(void *grp)
+{
+ struct task_grp *tg = (struct task_grp *)grp;
+ int quota;
+
+ quota = (tg->ticks * 100) / (5 * HZ);
+
+ return quota;
+}
+
+static struct task_grp_ops sched_grp_ops = {
+ .alloc_group = sched_alloc_group,
+ .dealloc_group = sched_dealloc_group,
+ .assign_quota = sched_assign_quota,
+ .get_quota = sched_get_quota,
+};
+
+struct task_grp_ops *cpu_ctlr = &sched_grp_ops;
+
+#endif /* CONFIG_CPUMETER */
diff -puN include/linux/sched.h~cpu_ctlr_grp_ops include/linux/sched.h
--- linux-2.6.18-rc3/include/linux/sched.h~cpu_ctlr_grp_ops 2006-08-04 07:58:50.000000000 +0530
+++ linux-2.6.18-rc3-root/include/linux/sched.h 2006-08-04 07:58:50.000000000 +0530
@@ -1604,6 +1604,18 @@ static inline void thaw_processes(void)
static inline int try_to_freeze(void) { return 0; }

#endif /* CONFIG_PM */
+
+#ifdef CONFIG_CPUMETER
+struct task_grp_ops {
+ void *(*alloc_group)(void);
+ void (*dealloc_group)(void *grp);
+ void (*assign_quota)(void *grp, int quota);
+ int (*get_quota)(void *grp);
+};
+
+extern struct task_grp_ops *cpu_ctlr;
+#endif
+
#endif /* __KERNEL__ */

#endif

_
--
Regards,
vatsa
--
Regards,
vatsa

2006-08-04 05:06:56

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [ RFC, PATCH 3/5 ] CPU controller - deal with movement of tasks


When a task moves between groups (as initiated by an administrator), it has to
be removed from the runqueue of its old group and added to the runqueue of the
new group. This patch defines this move operation.

Also during this move operation identity of a task can be ambiguous, since
identity is usually derived from some pointer in the task_struct. The pointer
can point to either the old group or new group and not both! Hence
pass explicitly into de(en)queue_task() functions the task-group runqueue from
which the task is being removed and added to respectively.


Signed-off-by : Srivatsa Vaddagiri <[email protected]>



include/linux/sched.h | 1
kernel/sched.c | 90 ++++++++++++++++++++++++++++++++++++--------------
2 files changed, 66 insertions(+), 25 deletions(-)

diff -puN kernel/sched.c~cpu_ctlr_move_task kernel/sched.c
--- linux-2.6.18-rc3/kernel/sched.c~cpu_ctlr_move_task 2006-08-04 07:59:08.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-04 08:09:15.000000000 +0530
@@ -755,15 +755,14 @@ static inline void update_task_grp_prio(
/*
* Adding/removing a task to/from a priority array:
*/
-static void dequeue_task(struct task_struct *p, struct prio_array *array)
+static void dequeue_task(struct task_struct *p, struct prio_array *array,
+ struct task_grp_rq *tgrq)
{
array->nr_active--;
list_del(&p->run_list);
if (list_empty(array->queue + p->prio)) {
__clear_bit(p->prio, array->bitmap);
if (p->prio == array->best_dyn_prio) {
- struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
-
array->best_dyn_prio =
sched_find_first_bit(array->bitmap);
if (array == tgrq->active || !tgrq->active->nr_active)
@@ -772,7 +771,8 @@ static void dequeue_task(struct task_str
}
}

-static void enqueue_task(struct task_struct *p, struct prio_array *array)
+static void enqueue_task(struct task_struct *p, struct prio_array *array,
+ struct task_grp_rq *tgrq)
{
sched_info_queued(p);
list_add_tail(&p->run_list, array->queue + p->prio);
@@ -781,8 +781,6 @@ static void enqueue_task(struct task_str
p->array = array;

if (p->prio < array->best_dyn_prio) {
- struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
-
array->best_dyn_prio = p->prio;
if (array == tgrq->active || !tgrq->active->nr_active)
update_task_grp_prio(tgrq, task_rq(p), 0);
@@ -799,7 +797,8 @@ static void requeue_task(struct task_str
}

static inline void
-enqueue_task_head(struct task_struct *p, struct prio_array *array)
+enqueue_task_head(struct task_struct *p, struct prio_array *array,
+ struct task_grp_rq *tgrq)
{
list_add(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
@@ -807,8 +806,6 @@ enqueue_task_head(struct task_struct *p,
p->array = array;

if (p->prio < array->best_dyn_prio) {
- struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
-
array->best_dyn_prio = p->prio;
if (array == tgrq->active || !tgrq->active->nr_active)
update_task_grp_prio(tgrq, task_rq(p), 1);
@@ -956,7 +953,7 @@ static void __activate_task(struct task_

if (batch_task(p))
target = tgrq->expired;
- enqueue_task(p, target);
+ enqueue_task(p, target, tgrq);
inc_nr_running(p, rq);
}

@@ -968,7 +965,7 @@ static inline void __activate_idle_task(
struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
struct prio_array *target = tgrq->active;

- enqueue_task_head(p, target);
+ enqueue_task_head(p, target, tgrq);
inc_nr_running(p, rq);
}

@@ -1097,8 +1094,10 @@ static void activate_task(struct task_st
*/
static void deactivate_task(struct task_struct *p, struct rq *rq)
{
+ struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
+
dec_nr_running(p, rq);
- dequeue_task(p, p->array);
+ dequeue_task(p, p->array, tgrq);
p->array = NULL;
}

@@ -2151,11 +2150,15 @@ static void pull_task(struct rq *src_rq,
struct task_struct *p, struct rq *this_rq,
struct prio_array *this_array, int this_cpu)
{
- dequeue_task(p, src_array);
+ struct task_grp_rq *tgrq;
+
+ tgrq = task_grp(p)->rq[task_cpu(p)];
+ dequeue_task(p, src_array, tgrq);
dec_nr_running(p, src_rq);
set_task_cpu(p, this_cpu);
inc_nr_running(p, this_rq);
- enqueue_task(p, this_array);
+ tgrq = task_grp(p)->rq[task_cpu(p)];
+ enqueue_task(p, this_array, tgrq);
p->timestamp = (p->timestamp - src_rq->timestamp_last_tick)
+ this_rq->timestamp_last_tick;
/*
@@ -3159,7 +3162,7 @@ void scheduler_tick(void)
goto out_unlock;
}
if (!--p->time_slice) {
- dequeue_task(p, tgrq->active);
+ dequeue_task(p, tgrq->active, tgrq);
set_tsk_need_resched(p);
p->prio = effective_prio(p);
p->time_slice = task_timeslice(p);
@@ -3168,12 +3171,12 @@ void scheduler_tick(void)
if (!tgrq->expired_timestamp)
tgrq->expired_timestamp = jiffies;
if (!TASK_INTERACTIVE(p) || expired_starving(tgrq)) {
- enqueue_task(p, tgrq->expired);
+ enqueue_task(p, tgrq->expired, tgrq);
if (p->static_prio < tgrq->expired->best_static_prio)
tgrq->expired->best_static_prio =
p->static_prio;
} else
- enqueue_task(p, tgrq->active);
+ enqueue_task(p, tgrq->active, tgrq);
} else {
/*
* Prevent a too long timeslice allowing a task to monopolize
@@ -3523,9 +3526,11 @@ need_resched_nonpreemptible:
new_prio = recalc_task_prio(next, next->timestamp + delta);

if (unlikely(next->prio != new_prio)) {
- dequeue_task(next, array);
+ struct task_grp_rq *tgrq =
+ task_grp(next)->rq[task_cpu(next)];
+ dequeue_task(next, array, tgrq);
next->prio = new_prio;
- enqueue_task(next, array);
+ enqueue_task(next, array, tgrq);
}
}
next->sleep_type = SLEEP_NORMAL;
@@ -3982,16 +3987,18 @@ void rt_mutex_setprio(struct task_struct
struct prio_array *array;
unsigned long flags;
struct rq *rq;
+ struct task_grp_rq *tgrq;
int oldprio;

BUG_ON(prio < 0 || prio > MAX_PRIO);

rq = task_rq_lock(p, &flags);
+ tgrq = task_grp(p)->rq[task_cpu(p)];

oldprio = p->prio;
array = p->array;
if (array)
- dequeue_task(p, array);
+ dequeue_task(p, array, tgrq);
p->prio = prio;

if (array) {
@@ -4001,7 +4008,7 @@ void rt_mutex_setprio(struct task_struct
*/
if (rt_task(p))
array = rq->active;
- enqueue_task(p, array);
+ enqueue_task(p, array, tgrq);
/*
* Reschedule if we are currently running on this runqueue and
* our priority decreased, or if we are not currently running on
@@ -4024,6 +4031,7 @@ void set_user_nice(struct task_struct *p
int old_prio, delta;
unsigned long flags;
struct rq *rq;
+ struct task_grp_rq *tgrq;

if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
return;
@@ -4032,6 +4040,8 @@ void set_user_nice(struct task_struct *p
* the task might be in the middle of scheduling on another CPU.
*/
rq = task_rq_lock(p, &flags);
+ tgrq = task_grp(p)->rq[task_cpu(p)];
+
/*
* The RT priorities are set via sched_setscheduler(), but we still
* allow the 'normal' nice value to be set - but as expected
@@ -4044,7 +4054,7 @@ void set_user_nice(struct task_struct *p
}
array = p->array;
if (array) {
- dequeue_task(p, array);
+ dequeue_task(p, array, tgrq);
dec_raw_weighted_load(rq, p);
}

@@ -4055,7 +4065,7 @@ void set_user_nice(struct task_struct *p
delta = p->prio - old_prio;

if (array) {
- enqueue_task(p, array);
+ enqueue_task(p, array, tgrq);
inc_raw_weighted_load(rq, p);
/*
* If the task increased its priority or is running and
@@ -4591,8 +4601,8 @@ asmlinkage long sys_sched_yield(void)
schedstat_inc(rq, yld_exp_empty);

if (array != target) {
- dequeue_task(current, array);
- enqueue_task(current, target);
+ dequeue_task(current, array, tgrq);
+ enqueue_task(current, target, tgrq);
} else
/*
* requeue_task is cheaper so perform that if possible.
@@ -7132,10 +7142,40 @@ int sched_get_quota(void *grp)
return quota;
}

+/* Move a task from one task_grp_rq to another */
+void sched_move_task(struct task_struct *tsk, void *grp_old, void *grp_new)
+{
+ struct rq *rq;
+ unsigned long flags;
+ struct task_grp *tg_old = (struct task_grp *)grp_old,
+ *tg_new = (struct task_grp *)grp_new;
+ struct task_grp_rq *tgrq;
+
+ if (tg_new == tg_old)
+ return;
+
+ if (!tg_new)
+ tg_new = &init_task_grp;
+ if (!tg_old)
+ tg_old = &init_task_grp;
+
+ rq = task_rq_lock(tsk, &flags);
+
+ if (tsk->array) {
+ tgrq = tg_old->rq[task_cpu(tsk)];
+ dequeue_task(tsk, tsk->array, tgrq);
+ tgrq = tg_new->rq[task_cpu(tsk)];
+ enqueue_task(tsk, tg_new->rq[task_cpu(tsk)]->active, tgrq);
+ }
+
+ task_rq_unlock(rq, &flags);
+}
+
static struct task_grp_ops sched_grp_ops = {
.alloc_group = sched_alloc_group,
.dealloc_group = sched_dealloc_group,
.assign_quota = sched_assign_quota,
+ .move_task = sched_move_task,
.get_quota = sched_get_quota,
};

diff -puN include/linux/sched.h~cpu_ctlr_move_task include/linux/sched.h
--- linux-2.6.18-rc3/include/linux/sched.h~cpu_ctlr_move_task 2006-08-04 08:09:27.000000000 +0530
+++ linux-2.6.18-rc3-root/include/linux/sched.h 2006-08-04 08:09:41.000000000 +0530
@@ -1610,6 +1610,7 @@ struct task_grp_ops {
void *(*alloc_group)(void);
void (*dealloc_group)(void *grp);
void (*assign_quota)(void *grp, int quota);
+ void (*move_task)(struct task_struct *tsk, void *old, void *new);
int (*get_quota)(void *grp);
};


_
--
Regards,
vatsa
--
Regards,
vatsa

2006-08-04 05:07:44

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [ RFC, PATCH 4/5 ] CPU controller - deal with dont care groups


Deal with task-groups whose bandwidth hasnt been explicitly set by the
administrator. Unallocated CPU bandwidth is equally distributed among such
"don't care" groups.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>


include/linux/sched.h | 2 -
kernel/sched.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 79 insertions(+), 5 deletions(-)

diff -puN kernel/sched.c~cpu_ctlr_handle_dont_cares kernel/sched.c
--- linux-2.6.18-rc3/kernel/sched.c~cpu_ctlr_handle_dont_cares 2006-08-04 08:20:53.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-04 08:31:38.000000000 +0530
@@ -227,6 +227,12 @@ static DEFINE_PER_CPU(struct task_grp_rq
/* task-group object - maintains information about each task-group */
struct task_grp {
int ticks; /* bandwidth given to the task-group */
+ int left_over_pct;
+ int total_dont_care_grps;
+ int dont_care; /* Does this group care for its bandwidth ? */
+ struct task_grp *parent;
+ struct list_head dont_care_list;
+ struct list_head list;
struct task_grp_rq *rq[NR_CPUS]; /* runqueue pointer for every cpu */
};

@@ -6922,6 +6928,12 @@ void __init sched_init(void)
int i, j, k;

init_task_grp.ticks = -1; /* Unlimited bandwidth */
+ init_task_grp.left_over_pct = 100; /* 100% unallocated bandwidth */
+ init_task_grp.parent = NULL;
+ init_task_grp.total_dont_care_grps = 1; /* init_task_grp itself */
+ init_task_grp.dont_care = 1;
+ INIT_LIST_HEAD(&init_task_grp.dont_care_list);
+ list_add_tail(&init_task_grp.list, &init_task_grp.dont_care_list);

for_each_possible_cpu(i) {
struct prio_array *array;
@@ -7076,10 +7088,33 @@ void set_curr_task(int cpu, struct task_

#ifdef CONFIG_CPUMETER

+/* Distribute left over bandwidth equally to all "dont care" task groups */
+static void recalc_dontcare(struct task_grp *tg_root)
+{
+ int ticks;
+ struct list_head *entry;
+
+ if (!tg_root->total_dont_care_grps)
+ return;
+
+ ticks = ((tg_root->left_over_pct /
+ tg_root->total_dont_care_grps) * 5 * HZ) / 100;
+
+ list_for_each(entry, &tg_root->dont_care_list) {
+ struct task_grp *tg;
+ int i;
+
+ tg = list_entry(entry, struct task_grp, list);
+ tg->ticks = ticks;
+ for_each_possible_cpu(i)
+ tg->rq[i]->ticks = tg->ticks;
+ }
+}
+
/* Allocate runqueue structures for the new task-group */
-void *sched_alloc_group(void)
+void *sched_alloc_group(void *grp_parent)
{
- struct task_grp *tg;
+ struct task_grp *tg, *tg_parent = (struct task_grp *)grp_parent;
struct task_grp_rq *tgrq;
int i;

@@ -7088,6 +7123,11 @@ void *sched_alloc_group(void)
return NULL;

tg->ticks = -1; /* No limit */
+ tg->parent = tg_parent;
+ tg->dont_care = 1;
+ tg->left_over_pct = 100;
+ tg->ticks = -1; /* No limit */
+ INIT_LIST_HEAD(&tg->dont_care_list);

for_each_possible_cpu(i) {
tgrq = kzalloc(sizeof(*tgrq), GFP_KERNEL);
@@ -7097,6 +7137,15 @@ void *sched_alloc_group(void)
task_grp_rq_init(tgrq, tg->ticks);
}

+ if (tg->parent) {
+ tg->parent->total_dont_care_grps++;
+ list_add_tail(&tg->list, &tg->parent->dont_care_list);
+ recalc_dontcare(tg->parent);
+ } else {
+ tg->total_dont_care_grps = 1;
+ list_add_tail(&tg->list, &tg->dont_care_list);
+ }
+
return (void *)tg;
oom:
while (i--)
@@ -7109,9 +7158,18 @@ oom:
/* Deallocate runqueue structures */
void sched_dealloc_group(void *grp)
{
- struct task_grp *tg = (struct task_grp *)grp;
+ struct task_grp *tg = (struct task_grp *)grp, *tg_root = tg->parent;
int i;

+ if (!tg_root)
+ tg_root = tg;
+
+ if (tg->dont_care) {
+ tg_root->total_dont_care_grps--;
+ list_del(&tg->list);
+ recalc_dontcare(tg_root);
+ }
+
for_each_possible_cpu(i)
kfree(tg->rq[i]);

@@ -7121,14 +7179,27 @@ void sched_dealloc_group(void *grp)
/* Assign quota to this group */
void sched_assign_quota(void *grp, int quota)
{
- struct task_grp *tg = (struct task_grp *)grp;
+ struct task_grp *tg = (struct task_grp *)grp, *tg_root = tg->parent;
+ int old_quota = 0;
int i;

+ if (!tg_root)
+ tg_root = tg;
+
+ if (tg->dont_care) {
+ tg->dont_care = 0;
+ tg_root->total_dont_care_grps--;
+ list_del(&tg->list);
+ } else
+ old_quota = (tg->ticks * 100) / (5 * HZ);
+
+ tg_root->left_over_pct -= (quota - old_quota);
tg->ticks = (quota * 5 * HZ) / 100;

for_each_possible_cpu(i)
tg->rq[i]->ticks = tg->ticks;

+ recalc_dontcare(tg_root);
}

/* Return assigned quota for this group */
@@ -7137,6 +7208,9 @@ int sched_get_quota(void *grp)
struct task_grp *tg = (struct task_grp *)grp;
int quota;

+ if (tg->dont_care)
+ return 0;
+
quota = (tg->ticks * 100) / (5 * HZ);

return quota;
diff -puN include/linux/sched.h~cpu_ctlr_handle_dont_cares include/linux/sched.h
--- linux-2.6.18-rc3/include/linux/sched.h~cpu_ctlr_handle_dont_cares 2006-08-04 08:31:56.000000000 +0530
+++ linux-2.6.18-rc3-root/include/linux/sched.h 2006-08-04 08:32:07.000000000 +0530
@@ -1607,7 +1607,7 @@ static inline int try_to_freeze(void) {

#ifdef CONFIG_CPUMETER
struct task_grp_ops {
- void *(*alloc_group)(void);
+ void *(*alloc_group)(void *grp_parent);
void (*dealloc_group)(void *grp);
void (*assign_quota)(void *grp, int quota);
void (*move_task)(struct task_struct *tsk, void *old, void *new);

_
--
Regards,
vatsa
--
Regards,
vatsa

2006-08-04 05:08:35

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [ RFC, PATCH 5/5 ] CPU controller - interface with cpusets


A controller cannot live in isolation. *somebody* needs to tell it about
events like new-task-group creation, destruction of existing group,
assignment of bandwidth and finally movement of tasks between task-groups.

We also need *some* interface for grouping tasks and assigning bandwidth
to the task-groups. cpusets is chosen as this interface for now.

This patch makes some extension to cpusets as below:

- Every cpuset directory has two new files - meter_cpu, cpu_quota
- Writing '1' into 'meter_cpu' marks the cpuset as "metered" i.e
tasks in this cpuset and its child cpusets will have their CPU
usage controlled by the controller as per the quota settings
mentioned in 'cpu_quota'.
- Writing a number between 0-100 into 'cpu_quota' file of a metered
cpuset assigns corresponding bandwidth for all tasks in the
cpuset (task-group).
- Only cpusets marked as "exclusive" and not having any child cpusets
can be metered.
- Metered cpusets cannot have grand-children!

Metered cpusets was discussed elaborately at:

http://lkml.org/lkml/2005/9/26/58

This patch was inspired from those discussions and is a subset of the features
provided in the earlier patch.

As an example, follow these steps to create metered cpusets:


# cd /dev
# mkdir cpuset
# mount -t cpuset cpuset cpuset
# cd cpuset
# mkdir grp_a
# cd grp_a
# /bin/echo 1 > cpus # assign CPU 1 for this cpuset
# /bin/echo 0 > mems # assign node 0 for this cpuset
# /bin/echo 1 > cpu_exclusive
# /bin/echo 1 > meter_cpu

# mkdir very_imp_grp
# Assign 80% bandwidth to this group
# /bin/echo 80 > very_imp_grp/cpu_quota

# echo $apache_webserver_pid > very_imp_grp/tasks

# mkdir less_imp_grp
# Assign 5% bandwidth to this group
# /bin/echo 5 > less_imp_grp/cpu_quota

# echo $mozilla_browser_pid > less_imp_grp/tasks



Signed-off-by : Srivatsa Vaddagiri <[email protected]>




include/linux/cpuset.h | 4 +
kernel/cpuset.c | 180 ++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched.c | 7 +
3 files changed, 188 insertions(+), 3 deletions(-)

diff -puN kernel/cpuset.c~cpuset_interface kernel/cpuset.c
--- linux-2.6.18-rc3/kernel/cpuset.c~cpuset_interface 2006-08-04 08:36:55.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/cpuset.c 2006-08-04 09:15:52.000000000 +0530
@@ -99,6 +99,9 @@ struct cpuset {
int mems_generation;

struct fmeter fmeter; /* memory_pressure filter */
+#ifdef CONFIG_CPUMETER
+ void *cpu_ctlr_data;
+#endif
};

/* bits in struct cpuset flags field */
@@ -110,6 +113,9 @@ typedef enum {
CS_NOTIFY_ON_RELEASE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
+#ifdef CONFIG_CPUMETER
+ CS_METER_FLAG,
+#endif
} cpuset_flagbits_t;

/* convenient tests for these bits */
@@ -148,6 +154,110 @@ static inline int is_spread_slab(const s
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}

+#ifdef CONFIG_CPUMETER
+static inline int is_metered(const struct cpuset *cs)
+{
+ return test_bit(CS_METER_FLAG, &cs->flags);
+}
+
+/* does the new combination of meter and exclusive flags makes sense? */
+static int validate_meters(const struct cpuset *cur, const struct cpuset *trial)
+{
+ struct cpuset *parent = cur->parent;
+ int is_changed;
+
+ /* checks for change in meter flag */
+ is_changed = (is_metered(trial) != is_metered(cur));
+
+ /* Cant change meter setting if a cpuset already has children */
+ if (is_changed && !list_empty(&cur->children))
+ return -EINVAL;
+
+ /* Cant change meter setting if parent is metered */
+ if (is_changed && parent && is_metered(parent))
+ return -EINVAL;
+
+ /* Turn on metering only if a cpuset is exclusive */
+ if (is_metered(trial) && !is_cpu_exclusive(cur))
+ return -EINVAL;
+
+ /* Cant change exclusive setting if a cpuset is metered */
+ if (!is_cpu_exclusive(trial) && is_metered(cur))
+ return -EINVAL;
+
+ /* Cant change exclusive setting if parent is metered */
+ if (parent && is_metered(parent) && is_cpu_exclusive(trial))
+ return -EINVAL;
+
+ return 0;
+}
+
+static void update_meter(struct cpuset *cs)
+{
+ if (is_metered(cs)) {
+ void *parent_data = NULL;
+
+ if (cs->parent)
+ parent_data = cs->parent->cpu_ctlr_data;
+ cs->cpu_ctlr_data = cpu_ctlr->alloc_group(parent_data);
+ } else
+ cpu_ctlr->dealloc_group(cs->cpu_ctlr_data);
+}
+
+static int update_quota(struct cpuset *cs, char *buf)
+{
+ int quota;
+
+ if (!buf || !is_metered(cs))
+ return -EINVAL;
+
+ quota = simple_strtoul(buf, NULL, 10);
+ cpu_ctlr->assign_quota(cs->cpu_ctlr_data, quota);
+
+ return 0;
+}
+
+static int cpuset_sprintf_quota(char *page, struct cpuset *cs)
+{
+ int quota = 0;
+ char *c = page;
+
+ if (is_metered(cs))
+ quota = cpu_ctlr->get_quota(cs->cpu_ctlr_data);
+ c += sprintf (c, "%d", quota);
+
+ return c - page;
+}
+
+void *cpu_ctlr_data(struct cpuset *cs)
+{
+ return cs->cpu_ctlr_data;
+}
+
+#else
+
+static inline int is_metered(const struct cpuset *cs)
+{
+ return 0;
+}
+
+static int validate_meters(const struct cpuset *cur, const struct cpuset *trial)
+{
+ return 0;
+}
+
+static void update_meter(struct cpuset *cs)
+{
+ return 0;
+}
+
+static int update_quota(struct cpuset *cs, char *buf)
+{
+ return 0;
+}
+
+#endif /* CONFIG_CPUMETER */
+
/*
* Increment this integer everytime any cpuset changes its
* mems_allowed value. Users of cpusets can track this generation
@@ -723,6 +833,9 @@ static int validate_change(const struct
{
struct cpuset *c, *par;

+ if (validate_meters(cur, trial))
+ return -EINVAL;
+
/* Each of our child cpusets must be a subset of us */
list_for_each_entry(c, &cur->children, sibling) {
if (!is_cpuset_subset(c, trial))
@@ -1037,7 +1150,7 @@ static int update_flag(cpuset_flagbits_t
{
int turning_on;
struct cpuset trialcs;
- int err, cpu_exclusive_changed;
+ int err, cpu_exclusive_changed, meter_changed;

turning_on = (simple_strtoul(buf, NULL, 10) != 0);

@@ -1052,6 +1165,7 @@ static int update_flag(cpuset_flagbits_t
return err;
cpu_exclusive_changed =
(is_cpu_exclusive(cs) != is_cpu_exclusive(&trialcs));
+ meter_changed = (is_metered(cs) != is_metered(&trialcs));
mutex_lock(&callback_mutex);
if (turning_on)
set_bit(bit, &cs->flags);
@@ -1061,7 +1175,9 @@ static int update_flag(cpuset_flagbits_t

if (cpu_exclusive_changed)
update_cpu_domains(cs);
- return 0;
+ if (meter_changed)
+ update_meter(cs);
+ return err;
}

/*
@@ -1225,6 +1341,8 @@ static int attach_task(struct cpuset *cs
return -ESRCH;
}
atomic_inc(&cs->count);
+ if (is_metered(cs))
+ cpu_ctlr->move_task(tsk, oldcs->cpu_ctlr_data, cs->cpu_ctlr_data);
rcu_assign_pointer(tsk->cpuset, cs);
task_unlock(tsk);

@@ -1267,6 +1385,10 @@ typedef enum {
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
FILE_TASKLIST,
+#ifdef CONFIG_CPUMETER
+ FILE_METER_FLAG,
+ FILE_METER_QUOTA,
+#endif
} cpuset_filetype_t;

static ssize_t cpuset_common_file_write(struct file *file, const char __user *userbuf,
@@ -1336,6 +1458,14 @@ static ssize_t cpuset_common_file_write(
case FILE_TASKLIST:
retval = attach_task(cs, buffer, &pathbuf);
break;
+#ifdef CONFIG_CPUMETER
+ case FILE_METER_FLAG:
+ retval = update_flag(CS_METER_FLAG, cs, buffer);
+ break;
+ case FILE_METER_QUOTA:
+ retval = update_quota(cs, buffer);
+ break;
+#endif
default:
retval = -EINVAL;
goto out2;
@@ -1448,6 +1578,14 @@ static ssize_t cpuset_common_file_read(s
case FILE_SPREAD_SLAB:
*s++ = is_spread_slab(cs) ? '1' : '0';
break;
+#ifdef CONFIG_CPUMETER
+ case FILE_METER_FLAG:
+ *s++ = is_metered(cs) ? '1' : '0';
+ break;
+ case FILE_METER_QUOTA:
+ s += cpuset_sprintf_quota(s, cs);
+ break;
+#endif
default:
retval = -EINVAL;
goto out;
@@ -1821,6 +1959,18 @@ static struct cftype cft_spread_slab = {
.private = FILE_SPREAD_SLAB,
};

+#ifdef CONFIG_CPUMETER
+static struct cftype cft_meter_flag = {
+ .name = "meter_cpu",
+ .private = FILE_METER_FLAG,
+};
+
+static struct cftype cft_meter_quota = {
+ .name = "cpu_quota",
+ .private = FILE_METER_QUOTA,
+};
+#endif
+
static int cpuset_populate_dir(struct dentry *cs_dentry)
{
int err;
@@ -1845,6 +1995,12 @@ static int cpuset_populate_dir(struct de
return err;
if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0)
return err;
+#ifdef CONFIG_CPUMETER
+ if ((err = cpuset_add_file(cs_dentry, &cft_meter_flag)) < 0)
+ return err;
+ if ((err = cpuset_add_file(cs_dentry, &cft_meter_quota)) < 0)
+ return err;
+#endif
return 0;
}

@@ -1862,6 +2018,10 @@ static long cpuset_create(struct cpuset
struct cpuset *cs;
int err;

+ /* metered cpusets cant have grand-children! */
+ if (parent->parent && is_metered(parent->parent))
+ return -EINVAL;
+
cs = kmalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
return -ENOMEM;
@@ -1894,6 +2054,13 @@ static long cpuset_create(struct cpuset
if (err < 0)
goto err;

+ if (is_metered(parent)) {
+ cs->cpus_allowed = parent->cpus_allowed;
+ cs->mems_allowed = parent->mems_allowed;
+ set_bit(CS_METER_FLAG, &cs->flags);
+ update_meter(cs);
+ }
+
/*
* Release manage_mutex before cpuset_populate_dir() because it
* will down() this new directory's i_mutex and if we race with
@@ -1956,6 +2123,13 @@ static int cpuset_rmdir(struct inode *un
return retval;
}
}
+ if (is_metered(cs)) {
+ int retval = update_flag(CS_METER_FLAG, cs, "0");
+ if (retval < 0) {
+ mutex_unlock(&manage_mutex);
+ return retval;
+ }
+ }
parent = cs->parent;
mutex_lock(&callback_mutex);
set_bit(CS_REMOVED, &cs->flags);
@@ -2008,6 +2182,7 @@ int __init cpuset_init(void)
top_cpuset.mems_generation = cpuset_mems_generation++;

init_task.cpuset = &top_cpuset;
+ /* xxx: initialize init_task.cpu_ctlr_data */

err = register_filesystem(&cpuset_fs_type);
if (err < 0)
@@ -2135,6 +2310,7 @@ void cpuset_exit(struct task_struct *tsk
struct cpuset *cs;

cs = tsk->cpuset;
+ cpu_ctlr->move_task(tsk, cs->cpu_ctlr_data, top_cpuset.cpu_ctlr_data);
tsk->cpuset = &top_cpuset; /* the_top_cpuset_hack - see above */

if (notify_on_release(cs)) {
diff -puN include/linux/cpuset.h~cpuset_interface include/linux/cpuset.h
--- linux-2.6.18-rc3/include/linux/cpuset.h~cpuset_interface 2006-08-04 08:36:55.000000000 +0530
+++ linux-2.6.18-rc3-root/include/linux/cpuset.h 2006-08-04 08:36:55.000000000 +0530
@@ -52,6 +52,10 @@ extern void cpuset_lock(void);
extern void cpuset_unlock(void);

extern int cpuset_mem_spread_node(void);
+#ifdef CONFIG_CPUMETER
+extern void *cpu_ctlr_data(struct cpuset *cs);
+#endif
+

static inline int cpuset_do_page_mem_spread(void)
{
diff -puN kernel/sched.c~cpuset_interface kernel/sched.c
--- linux-2.6.18-rc3/kernel/sched.c~cpuset_interface 2006-08-04 09:01:54.000000000 +0530
+++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-04 09:03:39.000000000 +0530
@@ -735,7 +735,12 @@ static void enqueue_task_grp(struct task
/* return the task-group to which a task belongs */
static inline struct task_grp *task_grp(struct task_struct *p)
{
- return &init_task_grp;
+ struct task_grp *tg = (struct task_grp *)cpu_ctlr_data(p->cpuset);
+
+ if (!tg)
+ tg = &init_task_grp;
+
+ return tg;
}

static inline void update_task_grp_prio(struct task_grp_rq *tgrq, struct rq *rq,

_
--
Regards,
vatsa
--
Regards,
vatsa

2006-08-04 05:37:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Fri, 4 Aug 2006 10:37:53 +0530
Srivatsa Vaddagiri <[email protected]> wrote:

> Resource management has been talked about quite extensively in the
> past, more recently in the context of containers. The basic requirement
> here is to provide isolation between *groups* of task wrt their use
> of various resources like CPU, Memory, I/O bandwidth, open file-descriptors etc.
>
> Different maintainers have however expressed different opinions over the need to
> complicate the kernel to meet this need, especially since it involves core
> kernel code like the resource schedulers.
>
> A BoF was hence held at OLS this year to come to a consensus on the minimum
> requirements of a resource management solution for Linux kernel. Some notes
> taken at the BoF are posted here:
>
> http://www.uwsg.indiana.edu/hypermail/linux/kernel/0607.3/0896.html
>
> An important consensus point of the BoF seemed to be "focus on real
> controllers more, preferably memory first, using some simple interface
> and task grouping mechanism".

ug, I didn't know this. Had I been there (sorry) I'd have disagreed with
this whole strategy.

I thought the most recently posted CKRM core was a fine piece of code. It
provides the machinery for grouping tasks together and the machinery for
establishing and viewing those groupings via configfs, and other such
common functionality. My 20-minute impression was that this code was an
easy merge and it was just awaiting some useful controllers to come along.

And now we've dumped the good infrastructure and instead we've contentrated
on the controller, wired up via some imaginative ab^H^Hreuse of the cpuset
layer.

I wonder how many of the consensus-makers were familiar with the
contemporary CKRM core?

> In going forward, following points will need to be addressed:
>
> - Grouping and interface
> - What mechanism to use for grouping tasks and
> for specifying task-group resource usage limits?

ckrm ;)

> - Design of individual resource controllers like memory and cpu

Right. We won't be controlling memory, numtasks, disk, network etc
controllers via cpusets, will we?


> This patch series is an attempt to take forward the design discussion of a
> CPU controller.

That's good. The controllers are the sticking point. Most especially the
memory controller.

> For simplicity and convenience, cpuset has been chosen as the means to group
> tasks here, primarily because cpuset already exists in the kernel and also
> perhaps resource container definition should be unique only inside a cpuset.
>

Correct me if I'm wrong, but a cpuset isn't the appropriate machinery to be
using to group tasks.

And if this whole resource-control effort is to end up being successful, it
should have as core infrastructure a flexible, appropriate and uniform way
of grouping tasks together and of getting data into and out of those
aggregates. We already have that, don't we?



2006-08-04 05:43:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Thu, 3 Aug 2006 22:36:50 -0700
Andrew Morton <[email protected]> wrote:

> I thought the most recently posted CKRM core was a fine piece of code.

I mean, subject to more review, testing, input from stakeholders and blah,
I'd be OK with merging the CKRM core fairly aggressively. With just a
minimal controller suite. Because it is good to define the infrastructure
and APIs for task grouping and to then let the controllers fall into place.

The downside to such a strategy is that there is a risk that nobody ever
gets around to implementing useful controllers, so it ends up dead code.
I'd judge that the interest in resource management is such that the risk of
this happening is low.

2006-08-04 05:59:05

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Andrew wrote:
> And now we've dumped the good infrastructure and instead we've contentrated
> on the controller, wired up via some imaginative ab^H^Hreuse of the cpuset
> layer.

Odd ... I'm usually fairly keen to notice any use or abuse of cpuset stuff.

I didn't see any mention of 'cpuset' in:
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0607.3/0896.html

I -do- see there:
* non-hierarchical,
* can't move tasks and
* syscall rather than file system API.

This all sounds like the polar antithesis of cpusets to me.

What did I miss, Andrew?


Before seeing your response, I was inclined to suggest that:
1) containers should have a good infrastructure, from the get go
(you just said the same thing of CKRM, as I read it), and
2) containers -should- look at a hierarchical pseudo file system
for this, as that seems like the 'natural' shape for
containers to take.
3) the syscall API, no hierarchy, 'simple' interface style
suggested for containers in the above notes sounded like
a really bad idea to me.

However, I was thinking of 'containers' when I thought this,
not of CKRM. And I haven't considered CKRM's infrastructure
in recent times. From what you say, it's worthy of serious
consideration now - good.

Perhaps (wild idea here) if 'containers' did lead us to looking
for a hierarchical pseudo file system interface, we could make
this common technology that both containers and the existing
cpusets could use, avoiding duplicating a chunk of vfs-aware
generic code that's now in kernel/cpuset.c to provide the file
system style interface. Cpusets would keep their existing API,
just share some generic vfs-aware code with these new containers.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-04 06:02:41

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Crap - I was reading my email backwards.

I now see "[ RFC, PATCH 5/5 ] CPU controller - interface with cpusets".

I haven't read it yet, but I will likely agree that
this is an abuse of cpusets.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-04 06:16:37

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

pj wrote:
> I haven't read it yet, but I will likely agree that
> this is an abuse of cpusets.

This likely just drove Srivatsa up a wall (sorry), as my comments
in the earlier thread he referenced:

http://lkml.org/lkml/2005/9/26/58

enthusiastically supported adding a cpu controller interface to cpusets.

We need to think through what are the relations between CKRM
controllers, containers and cpusets. But I don't think that
people will naturally want to manage CKRM controllers via cpusets.
That sounds odd to me now. My earlier enthusiasm for it seems
wrong to me now.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-04 06:24:53

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Thu, Aug 03, 2006 at 10:36:50PM -0700, Andrew Morton wrote:
> On Fri, 4 Aug 2006 10:37:53 +0530
>
> > Resource management has been talked about quite extensively in the
> > past, more recently in the context of containers. The basic requirement
> > here is to provide isolation between *groups* of task wrt their use
> > of various resources like CPU, Memory, I/O bandwidth, open file-descriptors etc.
> >
> > Different maintainers have however expressed different opinions over the need to
> > complicate the kernel to meet this need, especially since it involves core
> > kernel code like the resource schedulers.
> >
> > A BoF was hence held at OLS this year to come to a consensus on the minimum
> > requirements of a resource management solution for Linux kernel. Some notes
> > taken at the BoF are posted here:
> >
> > http://www.uwsg.indiana.edu/hypermail/linux/kernel/0607.3/0896.html
> >
> > An important consensus point of the BoF seemed to be "focus on real
> > controllers more, preferably memory first, using some simple interface
> > and task grouping mechanism".
>
> ug, I didn't know this. Had I been there (sorry) I'd have disagreed with
> this whole strategy.

Ah, wish you were there :)

> I thought the most recently posted CKRM core was a fine piece of code. It
> provides the machinery for grouping tasks together and the machinery for
> establishing and viewing those groupings via configfs, and other such
> common functionality. My 20-minute impression was that this code was an
> easy merge and it was just awaiting some useful controllers to come along.
>
> And now we've dumped the good infrastructure and instead we've contentrated
> on the controller, wired up via some imaginative ab^H^Hreuse of the cpuset
> layer.

FWIW, this controller was originally written for f-series. It should
be trivial to put it back there. So really, f-series isn't gone
anywhere. If you want to merge it, I am sure it can be re-submitted.

> I wonder how many of the consensus-makers were familiar with the
> contemporary CKRM core?

I think what would be nice is a clear strategy on whether we need
to work out the infrastructure or the controllers first. One of
the strongest points raised in the BoF was - forget the infrastructure
for now, get some mergable controllers developed. If you
want to stick to f-series infrastructure and want to see some
consensus controllers evolve on top of it, that can be done too.

Thanks
Dipankar

2006-08-04 06:31:38

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Dipankar wrote:
> f-series infrastructure

Do you have a good link to follow for more on this?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-04 06:43:00

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Thu, Aug 03, 2006 at 11:31:13PM -0700, Paul Jackson wrote:
> Dipankar wrote:
> > f-series infrastructure
>
> Do you have a good link to follow for more on this?
>

Here is the link to the last posting of the series by Chandra -

http://lkml.org/lkml/2006/4/27/378

Thanks
Dipankar

2006-08-04 06:46:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Fri, 4 Aug 2006 11:50:36 +0530
Dipankar Sarma <[email protected]> wrote:

> > And now we've dumped the good infrastructure and instead we've contentrated
> > on the controller, wired up via some imaginative ab^H^Hreuse of the cpuset
> > layer.
>
> FWIW, this controller was originally written for f-series.

What the heck is an f-series?

<googles, looks at
http://images.automotive.com/cob/factory_automotive/images/Features/auto_shows/2005_IEAS/2005_Ford_F-series_Harley-Davidson_front.JPG,
gets worried about IBM design methodologies>

> It should
> be trivial to put it back there. So really, f-series isn't gone
> anywhere. If you want to merge it, I am sure it can be re-submitted.

Well. It shouldn't be a matter of what I want to merge - you're the
resource-controller guys. But...

> > I wonder how many of the consensus-makers were familiar with the
> > contemporary CKRM core?
>
> I think what would be nice is a clear strategy on whether we need
> to work out the infrastructure or the controllers first.

We should put our thinking hats on and decide what sort of infrastructure
will need to be in place by the time we have a useful number of useful
controllers implemented.

> One of
> the strongest points raised in the BoF was - forget the infrastructure
> for now, get some mergable controllers developed.

I wonder what inspired that approach. Perhaps it was a reaction to CKRM's
long and difficult history? Perhaps it was felt that doing standalne
controllers with ad-hoc interfaces would make things easier to merge?

Perhaps. But I think you guys know that the end result would be inferior,
and that getting good infrastructure in place first will produce a better
end result, but you decided to go this way because you want to get
_something_ going?

> If you
> want to stick to f-series infrastructure and want to see some
> consensus controllers evolve on top of it, that can be done too.

Do you think that the CKRM core as last posted had any unnecessary
features? I don't have the operational experience to be able to judge
that, so I'd prefer to trust your experience and judgement on that. But
the features which _were_ there looked quite OK from an implementation POV.

So my take was "looks good, done deal - let's go get some sane controllers
working". And now this!

2006-08-04 06:49:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Fri, 4 Aug 2006 12:07:38 +0530
Dipankar Sarma <[email protected]> wrote:

> On Thu, Aug 03, 2006 at 11:31:13PM -0700, Paul Jackson wrote:
> > Dipankar wrote:
> > > f-series infrastructure
> >
> > Do you have a good link to follow for more on this?
> >
>
> Here is the link to the last posting of the series by Chandra -
>
> http://lkml.org/lkml/2006/4/27/378
>

Right, thanks. Look at the diffstats there.

This:

init/main.c | 2 +
kernel/exit.c | 2 +
kernel/fork.c | 2 +
fs/proc/base.c | 19 +++++++++++++++++++
include/linux/sched.h | 4 +

is the *entire* impact on existing kernel code. Sweet.

2006-08-04 06:51:56

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Thu, Aug 03, 2006 at 10:36:50PM -0700, Andrew Morton wrote:
> ug, I didn't know this. Had I been there (sorry) I'd have disagreed with
> this whole strategy.
>
> I thought the most recently posted CKRM core was a fine piece of code. It
> provides the machinery for grouping tasks together and the machinery for
> establishing and viewing those groupings via configfs, and other such
> common functionality. My 20-minute impression was that this code was an
> easy merge and it was just awaiting some useful controllers to come along.
>
> And now we've dumped the good infrastructure and instead we've contentrated
> on the controller, wired up via some imaginative ab^H^Hreuse of the cpuset
> layer.

Andrew,
CPUset was used in this patch series primarily because it
represent a task-grouping mechanism already in the kernel and because
people at the BoF wanted to start with something simple. The idea of using
cpusets here was not to push this as a final solution, but use it as a means to
discuss the effects of task-grouping on CPU scheduler.

We had be more than happy to work with the ckrm core which was posted last.
In fact I had sent out the same cpu controller against ckrm core itself last
time around to Nick/Ingo.

> Right. We won't be controlling memory, numtasks, disk, network etc
> controllers via cpusets, will we?

Agreed. Using CPUset interface makes sense mainly for cpu and memory.

> Correct me if I'm wrong, but a cpuset isn't the appropriate machinery to be
> using to group tasks.
>
> And if this whole resource-control effort is to end up being successful, it
> should have as core infrastructure a flexible, appropriate and uniform way
> of grouping tasks together and of getting data into and out of those
> aggregates. We already have that, don't we?

--
Regards,
vatsa

2006-08-04 07:14:22

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Thu, Aug 03, 2006 at 11:45:37PM -0700, Andrew Morton wrote:
> On Fri, 4 Aug 2006 11:50:36 +0530
> >
> > FWIW, this controller was originally written for f-series.
>
> What the heck is an f-series?

Sorry, the resource group - (formerly ckrm) series that Chandra posted
a few months ago.

http://lkml.org/lkml/2006/4/27/378

> > It should
> > be trivial to put it back there. So really, f-series isn't gone
> > anywhere. If you want to merge it, I am sure it can be re-submitted.
>
> Well. It shouldn't be a matter of what I want to merge - you're the
> resource-controller guys. But...

The point is that putting the controller back to the ckrm framework
would not be that difficult since it came from there.

> > One of
> > the strongest points raised in the BoF was - forget the infrastructure
> > for now, get some mergable controllers developed.
>
> I wonder what inspired that approach. Perhaps it was a reaction to CKRM's
> long and difficult history? Perhaps it was felt that doing standalne
> controllers with ad-hoc interfaces would make things easier to merge?
>
> Perhaps. But I think you guys know that the end result would be inferior,
> and that getting good infrastructure in place first will produce a better
> end result, but you decided to go this way because you want to get
> _something_ going?

I think part of it was the fact that the two main controllers
would were in scheduler and VM code made people to be
more cautious about the controllers and wanted those issues
to be sorted out first.

> > If you
> > want to stick to f-series infrastructure and want to see some
> > consensus controllers evolve on top of it, that can be done too.
>
> Do you think that the CKRM core as last posted had any unnecessary
> features? I don't have the operational experience to be able to judge
> that, so I'd prefer to trust your experience and judgement on that. But
> the features which _were_ there looked quite OK from an implementation POV.
>
> So my take was "looks good, done deal - let's go get some sane controllers
> working". And now this!

My recollection from the BoF is that people felt interface wasn't a major
issue and it wasn't discussed much. Most of the discussion centered
around grouping of tasks and the mem/cpu controllers and some additional
resource control requirements from openvz folks.

One way to go forward with the interface could be to request Chandra
to repost the ckrm infrastructure and see if the stake holders (primarily
container folks) can review and agree on it.

Thanks
Dipankar

2006-08-04 07:14:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Fri, 4 Aug 2006 12:26:15 +0530
Srivatsa Vaddagiri <[email protected]> wrote:

> On Thu, Aug 03, 2006 at 10:36:50PM -0700, Andrew Morton wrote:
> > ug, I didn't know this. Had I been there (sorry) I'd have disagreed with
> > this whole strategy.
> >
> > I thought the most recently posted CKRM core was a fine piece of code. It
> > provides the machinery for grouping tasks together and the machinery for
> > establishing and viewing those groupings via configfs, and other such
> > common functionality. My 20-minute impression was that this code was an
> > easy merge and it was just awaiting some useful controllers to come along.
> >
> > And now we've dumped the good infrastructure and instead we've contentrated
> > on the controller, wired up via some imaginative ab^H^Hreuse of the cpuset
> > layer.
>
> Andrew,
> CPUset was used in this patch series primarily because it
> represent a task-grouping mechanism already in the kernel and because
> people at the BoF wanted to start with something simple. The idea of using
> cpusets here was not to push this as a final solution, but use it as a means to
> discuss the effects of task-grouping on CPU scheduler.

OK.

> We had be more than happy to work with the ckrm core which was posted last.
> In fact I had sent out the same cpu controller against ckrm core itself last
> time around to Nick/Ingo.

Yup.

Please don't let me derail the main intent of this work - to make some progress
on the CPU controller.

There was a lot of discussion last time - Mike, Ingo, others. It would be
a useful starting point if we could be refreshed on what the main issues
were, and whether/how this new patchset addresses them.

2006-08-04 07:28:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Fri, 4 Aug 2006 12:40:09 +0530
Dipankar Sarma <[email protected]> wrote:

> > So my take was "looks good, done deal - let's go get some sane controllers
> > working". And now this!
>
> My recollection from the BoF is that people felt interface wasn't a major
> issue and it wasn't discussed much. Most of the discussion centered
> around grouping of tasks and the mem/cpu controllers and some additional
> resource control requirements from openvz folks.

OK. That's certainly the area to be concentrating on.

> One way to go forward with the interface could be to request Chandra
> to repost the ckrm infrastructure and see if the stake holders (primarily
> container folks) can review and agree on it.

Fine. But let's separate that activity from the primary one: worrying
about the controllers.

2006-08-04 07:35:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [ RFC, PATCH 1/5 ] CPU controller - base changes

On Fri, 4 Aug 2006 10:39:32 +0530
Srivatsa Vaddagiri <[email protected]> wrote:

> +/* runqueue used for every task-group */
> +struct task_grp_rq {
> + struct prio_array arrays[2];
> + struct prio_array *active, *expired, *rq_array;
> + unsigned long expired_timestamp;
> + unsigned int ticks;
> + int prio; /* Priority of the task-group */
> + struct list_head list;
> +};
> +
> +static DEFINE_PER_CPU(struct task_grp_rq, init_tg_rq);

That's another 4.5kbytes/cpu of our precious per-cpu memory. Is it feasible to make this
dependent upon CONFIG_CPUMETER?

2006-08-04 09:30:51

by Alan

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Ar Iau, 2006-08-03 am 22:42 -0700, ysgrifennodd Andrew Morton:
> The downside to such a strategy is that there is a risk that nobody ever
> gets around to implementing useful controllers, so it ends up dead code.
> I'd judge that the interest in resource management is such that the risk of
> this happening is low.

I think the risk is that OpenVZ has all the controls and resource
managers we need, while CKRM is still more research-ish. I find the
OpenVZ code much clearer, cleaner and complete at the moment, although
also much more conservative in its approach to solving problems.

Alan

2006-08-04 11:12:33

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Fri, Aug 04, 2006 at 12:13:42AM -0700, Andrew Morton wrote:
> There was a lot of discussion last time - Mike, Ingo, others. It would be
> a useful starting point if we could be refreshed on what the main issues
> were, and whether/how this new patchset addresses them.

The main issues raised against the CPU controller posted last time were
these:

(ref : http://lkml.org/lkml/2006/4/20/404)

a. Interactive tasks not handled
The patch, which was mainly based on scaling down timeslice of tasks
that are above their guarantee, left interactive tasks untouched. This
meant that interactive tasks could run uncontrolled and would have
affected the guaranteed bandwidth provided for other tasks.

b. Task groups with uncontrolled number of tasks not handled well
The patch retained current single runqueue per-cpu. Thus the runqueue
would contain a mixture of tasks belonging to different groups. Also
each task was given a minimum timeslice of 1 tick. This meant that we
could not limit the CPU bandwidth of a group that has a large number of
tasks to the desired extent.

c. SMP-correctness not implemented
Guaranteed bandwidth wasn't observed on all CPUs put together

d. Supported only guaranteed bandwidth and not soft/hard limit.

e. Bursty workloads not handled well
Scaling down of timeslice, to meet the increased demand of
higher-guaranteed task-groups, was not instantaneous. Rather
timeslice was scaled down when tasks expired their timeslice
and were moved to expired array. This meant that bursty workloads
would get their due share rather slowly.

Apart from these, the other observation I had was that:

f. Incorrect load accounting?
Load of a task was accounted only when it expired its timeslice, rather
than while it was running. This IMO can lead to improper observation of
load a task-group has on a given CPU at times and thus affect
guaranteed bandwidth for other task-groups.

Could we have overcome all these issue with slight changes to the
design? Hard to say. IMHO we get better control only by segregating tasks
into different runqueues and getting control over which task-group to
schedule next, which is what this new patch attempts to do.

In summary, the patch should address limitations a, b, e and f. I am hoping to
address c using smpnice approach. Regarding d, this patch provides more
of a soft-limit feature. Some guaranteed usage for task-groups can still
be met, I feel, by limiting the CPU usage of other groups.

To take all this forward, these significant points need to be decided
for a CPU controller:

1. Do we want to split the current 1 runqueue per-CPU into 1 runqueue
per-task-group per-CPU?

2. How should task-group priority be decided? The decision we take for
this impacts interactivity of the system. In my patch, I attempt to
retain good interactivty by letting task-group priority be decided by
the highest priority task it has.

3. How do we accomplish SMP correctness for task-group bandwidth?
I believe OpenVZ uses virtual runqueues, which simplifies
load balancing a bit, though not sure if that is at the expense
of increased lock contention. IMHO we can try going smpnice route and
see how far that can take us.

Ingo/Nick, what are your thoughts here?

--
Regards,
vatsa

2006-08-04 11:13:49

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [ RFC, PATCH 1/5 ] CPU controller - base changes

On Fri, Aug 04, 2006 at 12:35:28AM -0700, Andrew Morton wrote:
> That's another 4.5kbytes/cpu of our precious per-cpu memory. Is it feasible to make this
> dependent upon CONFIG_CPUMETER?

Yes I think. Will consider that in my next version.

--
Regards,
vatsa

2006-08-04 11:37:09

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Fri, Aug 04, 2006 at 10:49:10AM +0100, Alan Cox wrote:
> I think the risk is that OpenVZ has all the controls and resource
> managers we need, while CKRM is still more research-ish. I find the
> OpenVZ code much clearer, cleaner and complete at the moment, although
> also much more conservative in its approach to solving problems.

I think it would be nice to compare first the features provided by ckrm and
openvz at some point and agree upon the minimum common features we need to have
as we go forward. For instance I think Openvz assumes that tasks do
not need to move between containers (task-groups), whereas ckrm provides this
flexibility for workload management. This may have some effect on the
controller/interface design, no?

--
Regards,
vatsa

2006-08-04 14:19:36

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Andrew,

> ug, I didn't know this. Had I been there (sorry) I'd have disagreed with
> this whole strategy.
I wish you were here :/

> I thought the most recently posted CKRM core was a fine piece of code. It
> provides the machinery for grouping tasks together and the machinery for
> establishing and viewing those groupings via configfs, and other such
> common functionality. My 20-minute impression was that this code was an
> easy merge and it was just awaiting some useful controllers to come along.
the same thing does User BeanCounter patch.
The infrastructure is totally a separate question.
Moreover, as CKRM/UBC experience shows it is the most easiest code to agree on.
And as people agreed we need to speak about serious problems first,
not the interface one (at least, not at the moment when we have no working
resource controllers).

> And now we've dumped the good infrastructure and instead we've contentrated
> on the controller, wired up via some imaginative ab^H^Hreuse of the cpuset
> layer.
exactly. We have no workable controller.
So if there is no code we agreed on, whom is this infrastructure for?

> I wonder how many of the consensus-makers were familiar with the
> contemporary CKRM core?
I was familiar. And I can arise many arguable points in the CKRM infrastructure.

For example:
1. Should task-group be changeable after set/inherited once?
Are you planning to recalculate resources on group change?
e.g. shared memory or used kernel memory is hard to recalculate.
2. should task-group resource container manage all the resources as a whole?
e.g. in OpenVZ tasks can belong to different CPU and UBC containers.
It is more flexible and e.g. we used to put some vital kernel threads
to a separate CPU group to decrease delays in service.
3. I also don't understand why normal binary interface like system call is not used.
We have set_uid, sys_setrlimit and it works pretty good, does it?
4. do we want hierarchical grouping?

>> - Design of individual resource controllers like memory and cpu
>
>
> Right. We won't be controlling memory, numtasks, disk, network etc
> controllers via cpusets, will we?
I hope so :)

Thanks,
Kirill

2006-08-04 14:33:52

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [ RFC, PATCH 1/5 ] CPU controller - base changes

Srivatsa,

AFAICS, you wanted to go the way we used in OpenVZ - 2-level scheduling.
However, you don't do any process balancing between runqueues taking into account
other groups.
In many cases you will simply endup with tasks collected on the same physical
CPU and poor performance. I'm not talking about fairness (proportinal CPU scheduling).
I don't think it is possible to make any estimations for QoS of such a scheduler.

What do you think about a full runqueue virtualization, so that
first level CPU scheduler could select task-group on any basis and then
arbitrary runqueue was selected for running?

Thanks,
Kirill
P.S. BTW, this patch doesn't allow hierarchy in CPU controler.

> This patch splits the single main runqueue into several runqueues on each CPU.
> One (main) runqueue is used to hold task-groups and one runqueue for each
> task-group in the system.
>
> Signed-off-by : Srivatsa Vaddagiri <[email protected]>
>
>
> init/Kconfig | 8 +
> kernel/sched.c | 251 +++++++++++++++++++++++++++++++++++++++++++++++++--------
> 2 files changed, 228 insertions(+), 31 deletions(-)
>
> diff -puN kernel/sched.c~cpu_ctlr_base_changes kernel/sched.c
> --- linux-2.6.18-rc3/kernel/sched.c~cpu_ctlr_base_changes 2006-08-04 07:56:46.000000000 +0530
> +++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-04 07:56:54.000000000 +0530
> @@ -191,10 +191,48 @@ static inline unsigned int task_timeslic
>
> struct prio_array {
> unsigned int nr_active;
> + int best_static_prio, best_dyn_prio;
> DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
> struct list_head queue[MAX_PRIO];
> };
>
> +/*
> + * Each task belong to some task-group. Task-groups and what tasks they contain
> + * are defined by the administrator using some suitable interface.
> + * Administrator can also define the CPU bandwidth provided to each task-group.
> + *
> + * Task-groups are given a certain priority to run on every CPU. Currently
> + * task-group priority on a CPU is defined to be the same as that of
> + * highest-priority runnable task it has on that CPU. Task-groups also
> + * get their own runqueue on every CPU. The main runqueue on each CPU is
> + * used to hold task-groups, rather than tasks.
> + *
> + * Scheduling decision on a CPU is now two-step : first pick highest priority
> + * task-group from the main runqueue and next pick highest priority task from
> + * the runqueue of that group. Both decisions are of O(1) complexity.
> + */
> +
> +/* runqueue used for every task-group */
> +struct task_grp_rq {
> + struct prio_array arrays[2];
> + struct prio_array *active, *expired, *rq_array;
> + unsigned long expired_timestamp;
> + unsigned int ticks;
> + int prio; /* Priority of the task-group */
> + struct list_head list;
> +};
> +
> +static DEFINE_PER_CPU(struct task_grp_rq, init_tg_rq);
> +
> +/* task-group object - maintains information about each task-group */
> +struct task_grp {
> + int ticks; /* bandwidth given to the task-group */
> + struct task_grp_rq *rq[NR_CPUS]; /* runqueue pointer for every cpu */
> +};
> +
> +/* The "default" task-group */
> +struct task_grp init_task_grp;
> +
> /*
> * This is the main, per-CPU runqueue data structure.
> *
> @@ -224,12 +262,11 @@ struct rq {
> */
> unsigned long nr_uninterruptible;
>
> - unsigned long expired_timestamp;
> unsigned long long timestamp_last_tick;
> struct task_struct *curr, *idle;
> struct mm_struct *prev_mm;
> + /* these arrays hold task-groups */
> struct prio_array *active, *expired, arrays[2];
> - int best_expired_prio;
> atomic_t nr_iowait;
>
> #ifdef CONFIG_SMP
> @@ -244,6 +281,7 @@ struct rq {
> #endif
>
> #ifdef CONFIG_SCHEDSTATS
> + /* xxx: move these to task-group runqueue */
> /* latency stats */
> struct sched_info rq_sched_info;
>
> @@ -657,6 +695,63 @@ sched_info_switch(struct task_struct *pr
> #define sched_info_switch(t, next) do { } while (0)
> #endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */
>
> +static unsigned int task_grp_timeslice(struct task_grp *tg)
> +{
> + /* xxx: take into account sleep_avg etc of the group */
> + return tg->ticks;
> +}
> +
> +/* Dequeue task-group object from the main runqueue */
> +static void dequeue_task_grp(struct task_grp_rq *tgrq)
> +{
> + struct prio_array *array = tgrq->rq_array;
> +
> + array->nr_active--;
> + list_del(&tgrq->list);
> + if (list_empty(array->queue + tgrq->prio))
> + __clear_bit(tgrq->prio, array->bitmap);
> + tgrq->rq_array = NULL;
> +}
> +
> +/* Enqueue task-group object on the main runqueue */
> +static void enqueue_task_grp(struct task_grp_rq *tgrq, struct prio_array *array,
> + int head)
> +{
> + if (!head)
> + list_add_tail(&tgrq->list, array->queue + tgrq->prio);
> + else
> + list_add(&tgrq->list, array->queue + tgrq->prio);
> + __set_bit(tgrq->prio, array->bitmap);
> + array->nr_active++;
> + tgrq->rq_array = array;
> +}
> +
> +/* return the task-group to which a task belongs */
> +static inline struct task_grp *task_grp(struct task_struct *p)
> +{
> + return &init_task_grp;
> +}
> +
> +static inline void update_task_grp_prio(struct task_grp_rq *tgrq, struct rq *rq,
> + int head)
> +{
> + int new_prio;
> + struct prio_array *array = tgrq->rq_array;
> +
> + new_prio = tgrq->active->best_dyn_prio;
> + if (new_prio == MAX_PRIO)
> + new_prio = tgrq->expired->best_dyn_prio;
> +
> + if (array)
> + dequeue_task_grp(tgrq);
> + tgrq->prio = new_prio;
> + if (new_prio != MAX_PRIO) {
> + if (!array)
> + array = rq->active;
> + enqueue_task_grp(tgrq, array, head);
> + }
> +}
> +
> /*
> * Adding/removing a task to/from a priority array:
> */
> @@ -664,8 +759,17 @@ static void dequeue_task(struct task_str
> {
> array->nr_active--;
> list_del(&p->run_list);
> - if (list_empty(array->queue + p->prio))
> + if (list_empty(array->queue + p->prio)) {
> __clear_bit(p->prio, array->bitmap);
> + if (p->prio == array->best_dyn_prio) {
> + struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
> +
> + array->best_dyn_prio =
> + sched_find_first_bit(array->bitmap);
> + if (array == tgrq->active || !tgrq->active->nr_active)
> + update_task_grp_prio(tgrq, task_rq(p), 0);
> + }
> + }
> }
>
> static void enqueue_task(struct task_struct *p, struct prio_array *array)
> @@ -675,6 +779,14 @@ static void enqueue_task(struct task_str
> __set_bit(p->prio, array->bitmap);
> array->nr_active++;
> p->array = array;
> +
> + if (p->prio < array->best_dyn_prio) {
> + struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
> +
> + array->best_dyn_prio = p->prio;
> + if (array == tgrq->active || !tgrq->active->nr_active)
> + update_task_grp_prio(tgrq, task_rq(p), 0);
> + }
> }
>
> /*
> @@ -693,6 +805,14 @@ enqueue_task_head(struct task_struct *p,
> __set_bit(p->prio, array->bitmap);
> array->nr_active++;
> p->array = array;
> +
> + if (p->prio < array->best_dyn_prio) {
> + struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
> +
> + array->best_dyn_prio = p->prio;
> + if (array == tgrq->active || !tgrq->active->nr_active)
> + update_task_grp_prio(tgrq, task_rq(p), 1);
> + }
> }
>
> /*
> @@ -831,10 +951,11 @@ static int effective_prio(struct task_st
> */
> static void __activate_task(struct task_struct *p, struct rq *rq)
> {
> - struct prio_array *target = rq->active;
> + struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
> + struct prio_array *target = tgrq->active;
>
> if (batch_task(p))
> - target = rq->expired;
> + target = tgrq->expired;
> enqueue_task(p, target);
> inc_nr_running(p, rq);
> }
> @@ -844,7 +965,10 @@ static void __activate_task(struct task_
> */
> static inline void __activate_idle_task(struct task_struct *p, struct rq *rq)
> {
> - enqueue_task_head(p, rq->active);
> + struct task_grp_rq *tgrq = task_grp(p)->rq[task_cpu(p)];
> + struct prio_array *target = tgrq->active;
> +
> + enqueue_task_head(p, target);
> inc_nr_running(p, rq);
> }
>
> @@ -2077,7 +2201,7 @@ int can_migrate_task(struct task_struct
> return 1;
> }
>
> -#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->best_expired_prio)
> +#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->expired->best_static_prio)
>
> /*
> * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
> @@ -2884,6 +3008,8 @@ unsigned long long current_sched_time(co
> return ns;
> }
>
> +#define nr_tasks(tgrq) (tgrq->active->nr_active + tgrq->expired->nr_active)
> +
> /*
> * We place interactive tasks back into the active array, if possible.
> *
> @@ -2894,13 +3020,13 @@ unsigned long long current_sched_time(co
> * increasing number of running tasks. We also ignore the interactivity
> * if a better static_prio task has expired:
> */
> -static inline int expired_starving(struct rq *rq)
> +static inline int expired_starving(struct task_grp_rq *rq)
> {
> - if (rq->curr->static_prio > rq->best_expired_prio)
> + if (current->static_prio > rq->expired->best_static_prio)
> return 1;
> if (!STARVATION_LIMIT || !rq->expired_timestamp)
> return 0;
> - if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * rq->nr_running)
> + if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * nr_tasks(rq))
> return 1;
> return 0;
> }
> @@ -2991,6 +3117,7 @@ void scheduler_tick(void)
> struct task_struct *p = current;
> int cpu = smp_processor_id();
> struct rq *rq = cpu_rq(cpu);
> + struct task_grp_rq *tgrq = task_grp(p)->rq[cpu];
>
> update_cpu_clock(p, rq, now);
>
> @@ -3004,7 +3131,7 @@ void scheduler_tick(void)
> }
>
> /* Task might have expired already, but not scheduled off yet */
> - if (p->array != rq->active) {
> + if (p->array != tgrq->active) {
> set_tsk_need_resched(p);
> goto out;
> }
> @@ -3027,25 +3154,26 @@ void scheduler_tick(void)
> set_tsk_need_resched(p);
>
> /* put it at the end of the queue: */
> - requeue_task(p, rq->active);
> + requeue_task(p, tgrq->active);
> }
> goto out_unlock;
> }
> if (!--p->time_slice) {
> - dequeue_task(p, rq->active);
> + dequeue_task(p, tgrq->active);
> set_tsk_need_resched(p);
> p->prio = effective_prio(p);
> p->time_slice = task_timeslice(p);
> p->first_time_slice = 0;
>
> - if (!rq->expired_timestamp)
> - rq->expired_timestamp = jiffies;
> - if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
> - enqueue_task(p, rq->expired);
> - if (p->static_prio < rq->best_expired_prio)
> - rq->best_expired_prio = p->static_prio;
> + if (!tgrq->expired_timestamp)
> + tgrq->expired_timestamp = jiffies;
> + if (!TASK_INTERACTIVE(p) || expired_starving(tgrq)) {
> + enqueue_task(p, tgrq->expired);
> + if (p->static_prio < tgrq->expired->best_static_prio)
> + tgrq->expired->best_static_prio =
> + p->static_prio;
> } else
> - enqueue_task(p, rq->active);
> + enqueue_task(p, tgrq->active);
> } else {
> /*
> * Prevent a too long timeslice allowing a task to monopolize
> @@ -3066,12 +3194,21 @@ void scheduler_tick(void)
> if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
> p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
> (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
> - (p->array == rq->active)) {
> + (p->array == tgrq->active)) {
>
> - requeue_task(p, rq->active);
> + requeue_task(p, tgrq->active);
> set_tsk_need_resched(p);
> }
> }
> +
> + if (!--tgrq->ticks) {
> + /* Move the task group to expired list */
> + dequeue_task_grp(tgrq);
> + tgrq->ticks = task_grp_timeslice(task_grp(p));
> + enqueue_task_grp(tgrq, rq->expired, 0);
> + set_tsk_need_resched(p);
> + }
> +
> out_unlock:
> spin_unlock(&rq->lock);
> out:
> @@ -3264,6 +3401,7 @@ asmlinkage void __sched schedule(void)
> int cpu, idx, new_prio;
> long *switch_count;
> struct rq *rq;
> + struct task_grp_rq *next_grp;
>
> /*
> * Test if we are atomic. Since do_exit() needs to call into
> @@ -3332,23 +3470,41 @@ need_resched_nonpreemptible:
> idle_balance(cpu, rq);
> if (!rq->nr_running) {
> next = rq->idle;
> - rq->expired_timestamp = 0;
> wake_sleeping_dependent(cpu);
> goto switch_tasks;
> }
> }
>
> + /* Pick a task group first */
> +#ifdef CONFIG_CPUMETER
> array = rq->active;
> if (unlikely(!array->nr_active)) {
> /*
> * Switch the active and expired arrays.
> */
> - schedstat_inc(rq, sched_switch);
> rq->active = rq->expired;
> rq->expired = array;
> array = rq->active;
> - rq->expired_timestamp = 0;
> - rq->best_expired_prio = MAX_PRIO;
> + }
> + idx = sched_find_first_bit(array->bitmap);
> + queue = array->queue + idx;
> + next_grp = list_entry(queue->next, struct task_grp_rq, list);
> +#else
> + next_grp = init_task_grp.rq[cpu];
> +#endif
> +
> + /* Pick a task within that group next */
> + array = next_grp->active;
> + if (!array->nr_active) {
> + /*
> + * Switch the active and expired arrays.
> + */
> + schedstat_inc(next_grp, sched_switch);
> + next_grp->active = next_grp->expired;
> + next_grp->expired = array;
> + array = next_grp->active;
> + next_grp->expired_timestamp = 0;
> + next_grp->expired->best_static_prio = MAX_PRIO;
> }
>
> idx = sched_find_first_bit(array->bitmap);
> @@ -4413,7 +4569,8 @@ asmlinkage long sys_sched_getaffinity(pi
> asmlinkage long sys_sched_yield(void)
> {
> struct rq *rq = this_rq_lock();
> - struct prio_array *array = current->array, *target = rq->expired;
> + struct task_grp_rq *tgrq = task_grp(current)->rq[task_cpu(current)];
> + struct prio_array *array = current->array, *target = tgrq->expired;
>
> schedstat_inc(rq, yld_cnt);
> /*
> @@ -4424,13 +4581,13 @@ asmlinkage long sys_sched_yield(void)
> * array.)
> */
> if (rt_task(current))
> - target = rq->active;
> + target = tgrq->active;
>
> if (array->nr_active == 1) {
> schedstat_inc(rq, yld_act_empty);
> - if (!rq->expired->nr_active)
> + if (!tgrq->expired->nr_active)
> schedstat_inc(rq, yld_both_empty);
> - } else if (!rq->expired->nr_active)
> + } else if (!tgrq->expired->nr_active)
> schedstat_inc(rq, yld_exp_empty);
>
> if (array != target) {
> @@ -6722,21 +6879,53 @@ int in_sched_functions(unsigned long add
> && addr < (unsigned long)__sched_text_end);
> }
>
> +static void task_grp_rq_init(struct task_grp_rq *tgrq, int ticks)
> +{
> + int j, k;
> +
> + tgrq->active = tgrq->arrays;
> + tgrq->expired = tgrq->arrays + 1;
> + tgrq->rq_array = NULL;
> + tgrq->expired->best_static_prio = MAX_PRIO;
> + tgrq->expired->best_dyn_prio = MAX_PRIO;
> + tgrq->active->best_static_prio = MAX_PRIO;
> + tgrq->active->best_dyn_prio = MAX_PRIO;
> + tgrq->prio = MAX_PRIO;
> + tgrq->ticks = ticks;
> + INIT_LIST_HEAD(&tgrq->list);
> +
> + for (j = 0; j < 2; j++) {
> + struct prio_array *array;
> +
> + array = tgrq->arrays + j;
> + for (k = 0; k < MAX_PRIO; k++) {
> + INIT_LIST_HEAD(array->queue + k);
> + __clear_bit(k, array->bitmap);
> + }
> + // delimiter for bitsearch
> + __set_bit(MAX_PRIO, array->bitmap);
> + }
> +}
> +
> void __init sched_init(void)
> {
> int i, j, k;
>
> + init_task_grp.ticks = -1; /* Unlimited bandwidth */
> +
> for_each_possible_cpu(i) {
> struct prio_array *array;
> struct rq *rq;
> + struct task_grp_rq *tgrq;
>
> rq = cpu_rq(i);
> + tgrq = init_task_grp.rq[i] = &per_cpu(init_tg_rq, i);
> spin_lock_init(&rq->lock);
> + task_grp_rq_init(tgrq, init_task_grp.ticks);
> lockdep_set_class(&rq->lock, &rq->rq_lock_key);
> rq->nr_running = 0;
> rq->active = rq->arrays;
> rq->expired = rq->arrays + 1;
> - rq->best_expired_prio = MAX_PRIO;
>
> #ifdef CONFIG_SMP
> rq->sd = NULL;
> diff -puN init/Kconfig~cpu_ctlr_base_changes init/Kconfig
> --- linux-2.6.18-rc3/init/Kconfig~cpu_ctlr_base_changes 2006-08-04 07:56:46.000000000 +0530
> +++ linux-2.6.18-rc3-root/init/Kconfig 2006-08-04 07:56:54.000000000 +0530
> @@ -248,6 +248,14 @@ config CPUSETS
>
> Say N if unsure.
>
> +config CPUMETER
> + bool "CPU resource control"
> + depends on CPUSETS
> + help
> + This options lets you create cpu resource partitions within
> + cpusets. Each resource partition can be given a different quota
> + of CPU usage.
> +
> config RELAY
> bool "Kernel->user space relay support (formerly relayfs)"
> help
>
> _

2006-08-04 14:36:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Fri, Aug 04, 2006 at 06:20:04PM +0400, Kirill Korotaev wrote:
> For example:
> 1. Should task-group be changeable after set/inherited once?
> Are you planning to recalculate resources on group change?
> e.g. shared memory or used kernel memory is hard to recalculate.

I think this is a nice feature, although not on the top priority list.

> 2. should task-group resource container manage all the resources as a whole?
> e.g. in OpenVZ tasks can belong to different CPU and UBC containers.
> It is more flexible and e.g. we used to put some vital kernel threads
> to a separate CPU group to decrease delays in service.

We already support different resource groups for the very limit rlimit
interface. If we can keep the interface clean doing separate resource
groups is fine.

> 3. I also don't understand why normal binary interface like system call is
> not used.
> We have set_uid, sys_setrlimit and it works pretty good, does it?

Yes. If you can design a syscall interface that is as clean as the two
mentioned above a syscall interface is the best way to go forward.

> 4. do we want hierarchical grouping?

Not at all. It just causes a lot of pain an complexity for no real world
benefits.

2006-08-04 14:47:55

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [ RFC, PATCH 1/5 ] CPU controller - base changes

On Fri, Aug 04, 2006 at 06:34:00PM +0400, Kirill Korotaev wrote:
> Srivatsa,
>
> AFAICS, you wanted to go the way we used in OpenVZ - 2-level scheduling.
> However, you don't do any process balancing between runqueues taking into
> account
> other groups.
> In many cases you will simply endup with tasks collected on the same
> physical
> CPU and poor performance. I'm not talking about fairness (proportinal CPU
> scheduling).

> I don't think it is possible to make any estimations for QoS of such a
> scheduler.

Yes, the patch (as mentioned earlier) does not address SMP correctness
_yet_. That will need to be addressed definitely for an acceptable
controller. My thought was we could try the smpnice approach (which
attempts to deal with the same problem albeit for niced tasks) and
see how far we can go. I am planning to work on it next.

> What do you think about a full runqueue virtualization, so that
> first level CPU scheduler could select task-group on any basis and then
> arbitrary runqueue was selected for running?

That may solve the load balance problem nicely. But isnt there some cost
to be paid for it (like lock contention on the virtual runqueues)?

> P.S. BTW, this patch doesn't allow hierarchy in CPU controler.

Do we want heriarchy?


--
Regards,
vatsa

2006-08-04 14:51:03

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

>>I think the risk is that OpenVZ has all the controls and resource
>>managers we need, while CKRM is still more research-ish. I find the
>>OpenVZ code much clearer, cleaner and complete at the moment, although
>>also much more conservative in its approach to solving problems.
>
>
> I think it would be nice to compare first the features provided by ckrm and
> openvz at some point and agree upon the minimum common features we need to have
> as we go forward. For instance I think Openvz assumes that tasks do
> not need to move between containers (task-groups), whereas ckrm provides this
> flexibility for workload management. This may have some effect on the
> controller/interface design, no?
OpenVZ assumes that tasks can't move between task-groups for a single reason:
user shouldn't be able to escape from the container.
But this have no implication on the design/implementation.

BTW, do you see any practical use cases for tasks jumping between resource-containers?

Kirill

2006-08-04 14:51:30

by Balbir Singh

[permalink] [raw]
Subject: Re: [ RFC, PATCH 1/5 ] CPU controller - base changes

Kirill Korotaev wrote:
> Srivatsa,
>
> AFAICS, you wanted to go the way we used in OpenVZ - 2-level scheduling.
> However, you don't do any process balancing between runqueues taking
> into account
> other groups.

The plan is to do load balancing using the smpnice feature, which is yet to be
worked on

From vatsa's comments

"Works only on UP for now (boot with maxcpus=1). IMO group-aware SMP
load-balancing can be met using smpnice feature. I will work on this
feature next."

> What do you think about a full runqueue virtualization, so that
> first level CPU scheduler could select task-group on any basis and then
> arbitrary runqueue was selected for running?

The patch selects the task group first, based on priority. From the patch

"+ /* Pick a task group first */
+#ifdef CONFIG_CPUMETER "

Did I miss something?

>
> Thanks,
> Kirill
> P.S. BTW, this patch doesn't allow hierarchy in CPU controler.
>


--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-08-04 14:57:16

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller


>>The downside to such a strategy is that there is a risk that nobody ever
>>gets around to implementing useful controllers, so it ends up dead code.
>>I'd judge that the interest in resource management is such that the risk of
>>this happening is low.
>
>
> I think the risk is that OpenVZ has all the controls and resource
> managers we need, while CKRM is still more research-ish. I find the
> OpenVZ code much clearer, cleaner and complete at the moment, although
> also much more conservative in its approach to solving problems.

Alan, I will be happy to hear what you mean by conservative :)
Maybe we can make it more revolutinary then.

Kirill

2006-08-04 15:27:05

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Fri, Aug 04, 2006 at 06:51:55PM +0400, Kirill Korotaev wrote:
> OpenVZ assumes that tasks can't move between task-groups for a single
> reason:
> user shouldn't be able to escape from the container.
> But this have no implication on the design/implementation.

Doesnt the ability to move tasks between groups dynamically affect
(atleast) memory controller design (in giving up ownership etc)?
Also if we need to support this movement, we need to have some
corresponding system call/file-system interface which supports this move
operation.

> BTW, do you see any practical use cases for tasks jumping between
> resource-containers?

The use cases I have heard of which would benefit such a feature is
(say) for database threads which want to change their "resource
affinity" status depending on which customer query they are currently handling.
If they are handling a query for a "important" customer, they will want affinied
to a high bandwidth resource container and later if they start handling
a less important query they will want to give up this affinity and
instead move to a low-bandwidth container.

--
Regards,
vatsa

2006-08-04 15:30:01

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [ProbableSpam] Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Kirill Korotaev wrote:

> I was familiar. And I can arise many arguable points in the CKRM
> infrastructure.
>
> For example:
> 1. Should task-group be changeable after set/inherited once?
> Are you planning to recalculate resources on group change?
> e.g. shared memory or used kernel memory is hard to recalculate.

As far as Resource Group requirements go, preserving the ability to
migrate tasks from one task-group to another is essential.

As Christoph points out, its not something we need to do right now
but it shouldn't and needn't be eliminated.

If userspace wants an application's resource usage to depend on the kind
of work it is doing, then it must have a way to move the tasks of that
application to a different task-group (which encapsulates a different resource
limit setting) without having to restart it.

If the application is having a heavy shared resource usage that makes it hard
to have *accurate* control once it is moved to a different resource group, thats
fine - the flexibility to move has a price.

But lets not worry about this aspect right now until the controllers are sorted
out.

> 2. should task-group resource container manage all the resources as a
> whole?
> e.g. in OpenVZ tasks can belong to different CPU and UBC containers.
> It is more flexible and e.g. we used to put some vital kernel threads
> to a separate CPU group to decrease delays in service.

I personally like this flexibility too.
The price you pay for it is
- that the number of "attachments" from a task
grows with the number of controllers (not a big deal)
- if you expose the groupings to userspace, users have a larger namespace
(of controlled groups) to deal with which can get confusing if all they want
is "one" systemwide view of resource groups (and not one per resource).

At the BoF (and earlier discussions on this topic in CKRM), the reasons offered
for going with a single systemwide grouping was simplicity for the user and
since it would be simpler to implement initially.

But since one can always have a userspace tool that creates identical
groupings for each controller and presents a single systemwide grouping to the
user, I don't think this is a real issue.

> 3. I also don't understand why normal binary interface like system call
> is not used.
> We have set_uid, sys_setrlimit and it works pretty good, does it?

If there are no hierarchies, a syscall interface is fine since the namespace
for the task-group is flat (so one can export to userspace either a number or a
string as a handle to that task-group for operations like create, delete,
set limit, get usage, etc)

A filesystem based interface is useful when you have hierarchies (as resource
groups and cpusets do) since it naturally defines a convenient to use
hierarchical namespace.

> 4. do we want hierarchical grouping?
>
>>> - Design of individual resource controllers like memory and cpu
>>
>>
>>
>> Right. We won't be controlling memory, numtasks, disk, network etc
>> controllers via cpusets, will we?
>
> I hope so :)
>
> Thanks,
> Kirill
>

2006-08-04 16:07:09

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

>>OpenVZ assumes that tasks can't move between task-groups for a single
>>reason:
>>user shouldn't be able to escape from the container.
>>But this have no implication on the design/implementation.
>
>
> Doesnt the ability to move tasks between groups dynamically affect
> (atleast) memory controller design (in giving up ownership etc)?
we save object owner on the object. So if you change the container,
objects are still correctly charged to the creator and are uncharged
correctly on free.

> Also if we need to support this movement, we need to have some
> corresponding system call/file-system interface which supports this move
> operation.
it can be done by the same syscall or whatever which sets your
container group.
we have the same syscall for creating/setting/entering to the container.
i.e. chaning the container dynamically doesn't change the interface.

>>BTW, do you see any practical use cases for tasks jumping between
>>resource-containers?
>
>
> The use cases I have heard of which would benefit such a feature is
> (say) for database threads which want to change their "resource
> affinity" status depending on which customer query they are currently handling.
> If they are handling a query for a "important" customer, they will want affinied
> to a high bandwidth resource container and later if they start handling
> a less important query they will want to give up this affinity and
> instead move to a low-bandwidth container.
this works mostly for CPU only. And OpenVZ design allows to change CPU
resource container dynamically.

But such a trick works poorly for memory, because:
1. threads share lots of resources.
2. complex databases can have more complicated handling than a thread per request.
e.g. one thread servers memory pools, another one caches, some for stored procedures, some for requests etc.

BTW, exactly this difference shows the reason to have different groups for different resources.

Thanks,
Kirill

2006-08-04 16:16:02

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

>>I think the risk is that OpenVZ has all the controls and resource
>>managers we need, while CKRM is still more research-ish. I find the
>>OpenVZ code much clearer, cleaner and complete at the moment, although
>>also much more conservative in its approach to solving problems.
>
>
> I think it would be nice to compare first the features provided by ckrm and
> openvz at some point and agree upon the minimum common features we need to have
> as we go forward. For instance I think Openvz assumes that tasks do
> not need to move between containers (task-groups), whereas ckrm provides this
> flexibility for workload management. This may have some effect on the
> controller/interface design, no?

BTW, to help to compare (as you noted above) here is the list of features provided by OpenVZ:

Memory and some other resources related to mem
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- kernel memory. vmas, LDT, page tables, poll, select, ipc undos and many other kernel
structures which can be created on user requests.
without it's accounting/limiting a system is DoS'able.

user memory (private memory, shared memory, tmpfs, swap):
- locked pages
- shmpages
- physpages. accounting only. Correctly accounts fractions of memory
shared between containers. Can't be limited in a user friendly manner,
since memory denials from page faults are not handled from user space :/
- private memory pages. These are private pages which has are not backed up
in the file or swap and which are pure user pages. These are anonymous
private mappings and cow-able mappings (e.g. glibc .data) which result in private memory.
Accounted correctly taking into acount sharing between containers (i.e. page
fraction is accounted).
This resource is limited on mmap() call.

others:
- 2-level OOM killer. The most fat container should be selected to kill first.
We introduce some guarantee against OOM, so that if the container
consumes less memory than it is guaranteed to, then it won't be killed.
- memory pinned by dcache (there is a simple DoS which can be done
by any Linux user to consume the whole normal zone)
- number of iptables entries (with virtualized networking
containers can allocate memory for iptable rules)
- other socket buffers (unix, netlinks)
- TCP rcv/snd buffers
- UDP rcv buffers
- number of TCP sockets
- number of unix/netlink/other sockets
- number of flocks
- number of ptys
- number of siginfo's
- number of files
- number of tasks

CPU management
~~~~~~~~~~~~~~
1. 2 level fair CPU scheduler with known theoretical fairness and latency bounds:
- 1st level selects a container to run based on the container weight
- 2nd level selects a runqueue in the container and a task in the runqueue

2. cpu limits. Limitation of the container to some CPU rate even if CPUs are idle.


2 level disk quota
~~~~~~~~~~~~~~~~~~
allows to limit directory subtree to some amount of disk space.
inside this quota std linux per-user quotas are available.

Thanks,
Kirill

2006-08-04 16:49:27

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [ProbableSpam] Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller


> BTW, to help to compare (as you noted above) here is the list of
> features provided by OpenVZ:
>

Could you point to a place where we can get a broken-down set of
patches for OpenVZ or (even better), UBC ?

For purposes of the resource management discussion, it will be
useful to be able to look at the UBC patches in isolation
and perhaps port them over to some common interface for testing
comparing with other implementations.

--Shailabh

2006-08-04 17:03:06

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [ProbableSpam] Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Kirill Korotaev wrote:

>> The use cases I have heard of which would benefit such a feature is
>> (say) for database threads which want to change their "resource
>> affinity" status depending on which customer query they are currently
>> handling. If they are handling a query for a "important" customer,
>> they will want affinied
>> to a high bandwidth resource container and later if they start handling
>> a less important query they will want to give up this affinity and
>> instead move to a low-bandwidth container.
>
> this works mostly for CPU only.

And for block I/O bandwidth control since the priority of I/O requests can
also be changed dynamically pretty easily.


> And OpenVZ design allows to change CPU
> resource container dynamically.
>
> But such a trick works poorly for memory, because:
> 1. threads share lots of resources.
> 2. complex databases can have more complicated handling than a thread
> per request.
> e.g. one thread servers memory pools, another one caches, some for
> stored procedures, some for requests etc.
>

True. Stuff like memory, open files etc. are harder to control since
you can't take back allocations that easily and sharing with other tasks is
possible.

> BTW, exactly this difference shows the reason to have different groups
> for different resources.
>

Good point.

> Thanks,
> Kirill
>

2006-08-04 17:06:46

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [ProbableSpam] Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Fri, Aug 04, 2006 at 12:49:16PM -0400, Shailabh Nagar wrote:
>
> >BTW, to help to compare (as you noted above) here is the list of
> >features provided by OpenVZ:
> >
>
> Could you point to a place where we can get a broken-down set of
> patches for OpenVZ or (even better), UBC ?
>
> For purposes of the resource management discussion, it will be
> useful to be able to look at the UBC patches in isolation
> and perhaps port them over to some common interface for testing
> comparing with other implementations.

Kirill,

Is this the latest set (based on your last publication) for
people to look at ?

http://download.openvz.org/kernel/broken-out/2.6.16-026test005.1/

Thanks
Dipankar

2006-08-04 17:50:25

by Martin Bligh

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

> OpenVZ assumes that tasks can't move between task-groups for a single
> reason:
> user shouldn't be able to escape from the container.
> But this have no implication on the design/implementation.

It does, for the memory controller at least. Things like shared
anon_vma's between tasks across containers make it somewhat harder.
It's much worse if you allow threads to split across containers.

> BTW, do you see any practical use cases for tasks jumping between
> resource-containers?

2006-08-04 18:18:04

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [ProbableSpam] Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Dipankar Sarma wrote:
> On Fri, Aug 04, 2006 at 12:49:16PM -0400, Shailabh Nagar wrote:
>
>>>BTW, to help to compare (as you noted above) here is the list of
>>>features provided by OpenVZ:
>>>
>>
>>Could you point to a place where we can get a broken-down set of
>>patches for OpenVZ or (even better), UBC ?
>>
>>For purposes of the resource management discussion, it will be
>>useful to be able to look at the UBC patches in isolation
>>and perhaps port them over to some common interface for testing
>>comparing with other implementations.
>
>
> Kirill,
>
> Is this the latest set (based on your last publication) for
> people to look at ?
>
> http://download.openvz.org/kernel/broken-out/2.6.16-026test005.1/
>
> Thanks
> Dipankar

Thanks for the link. That should suffice for resource group folks to start
looking at UBC carefully. The UBC stuff looks like its been around since
1998 ! So hopefully not much changed between 2.6.16 and now :-)

--Shailabh

2006-08-04 18:27:45

by Rohit Seth

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Fri, 2006-08-04 at 20:03 +0400, Kirill Korotaev wrote:

> > Doesnt the ability to move tasks between groups dynamically affect
> > (atleast) memory controller design (in giving up ownership etc)?
> we save object owner on the object. So if you change the container,
> objects are still correctly charged to the creator and are uncharged
> correctly on free.
>

Seems like the object owner should also change when the object moves
from one container to another.

> > Also if we need to support this movement, we need to have some
> > corresponding system call/file-system interface which supports this move
> > operation.
> it can be done by the same syscall or whatever which sets your
> container group.
> we have the same syscall for creating/setting/entering to the container.
> i.e. chaning the container dynamically doesn't change the interface.
>
> >>BTW, do you see any practical use cases for tasks jumping between
> >>resource-containers?
> >

I think the ability to move file backed memory from one container to
another is useful. This allows appropriate containers to get charged
based on the usage pattern. Though this (movement between containers)
is not something that should be encouraged.

> >
> > The use cases I have heard of which would benefit such a feature is
> > (say) for database threads which want to change their "resource
> > affinity" status depending on which customer query they are currently handling.
> > If they are handling a query for a "important" customer, they will want affinied
> > to a high bandwidth resource container and later if they start handling
> > a less important query they will want to give up this affinity and
> > instead move to a low-bandwidth container.

hmm, would it not be better to have a thread each in two different
containers for handling different kind of requests. Or if there is too
much of sharing between threads, then setting the individual priority
should help.

> this works mostly for CPU only. And OpenVZ design allows to change CPU
> resource container dynamically.
> But such a trick works poorly for memory, because:
> 1. threads share lots of resources.
> 2. complex databases can have more complicated handling than a thread per request.
> e.g. one thread servers memory pools, another one caches, some for stored procedures, some for requests etc.
>

Any resource movement between containers should be at best efforts. The
stats will tend to be more inaccurate (which I think is okay) as the
sharing between resources across increases.

> BTW, exactly this difference shows the reason to have different groups for different resources.
>

Well, for a set of processes that are sharing a set of resources
perfectly, it would be okay to combine all such resources in a single
container. But for a shared resource, like file(that spans across
processes in different containers), it could be useful to have a
stand-alone container.

-rohit

2006-08-04 18:51:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Fri, 4 Aug 2006 16:46:38 +0530
Srivatsa Vaddagiri <[email protected]> wrote:

> On Fri, Aug 04, 2006 at 12:13:42AM -0700, Andrew Morton wrote:
> > There was a lot of discussion last time - Mike, Ingo, others. It would be
> > a useful starting point if we could be refreshed on what the main issues
> > were, and whether/how this new patchset addresses them.
>
> The main issues raised against the CPU controller posted last time were
> these:
>

A useful summary, thanks. It will probably help people if this description
could be maintained along with the patch. Because these issues are complex
and we have a habit of dropping the ball then picking it up months later
when everyone has forgotten everything.

>
> Ingo/Nick, what are your thoughts here?

I believe Ingo is having a bit of downtime for another week or so.

2006-08-04 19:10:42

by Chandra Seetharaman

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Thu, 2006-08-03 at 23:45 -0700, Andrew Morton wrote:
> On Fri, 4 Aug 2006 11:50:36 +0530
> Dipankar Sarma <[email protected]> wrote:
>
> > > And now we've dumped the good infrastructure and instead we've contentrated
> > > on the controller, wired up via some imaginative ab^H^Hreuse of the cpuset
> > > layer.
> >
> > FWIW, this controller was originally written for f-series.
>
> What the heck is an f-series?

That is how the latest posting was referred in ckrm-tech mailing list.
>
> <googles, looks at
> http://images.automotive.com/cob/factory_automotive/images/Features/auto_shows/2005_IEAS/2005_Ford_F-series_Harley-Davidson_front.JPG,
> gets worried about IBM design methodologies>

he he...
>
> > It should
> > be trivial to put it back there. So really, f-series isn't gone
> > anywhere. If you want to merge it, I am sure it can be re-submitted.
>
> Well. It shouldn't be a matter of what I want to merge - you're the
> resource-controller guys. But...
>
> > > I wonder how many of the consensus-makers were familiar with the
> > > contemporary CKRM core?
> >
> > I think what would be nice is a clear strategy on whether we need
> > to work out the infrastructure or the controllers first.
>
> We should put our thinking hats on and decide what sort of infrastructure
> will need to be in place by the time we have a useful number of useful
> controllers implemented.
>
> > One of
> > the strongest points raised in the BoF was - forget the infrastructure
> > for now, get some mergable controllers developed.
>
> I wonder what inspired that approach. Perhaps it was a reaction to CKRM's
> long and difficult history? Perhaps it was felt that doing standalne
> controllers with ad-hoc interfaces would make things easier to merge?
>
> Perhaps. But I think you guys know that the end result would be inferior,
> and that getting good infrastructure in place first will produce a better
> end result, but you decided to go this way because you want to get
> _something_ going?

To some extent yes...
>
> > If you
> > want to stick to f-series infrastructure and want to see some
> > consensus controllers evolve on top of it, that can be done too.
>
> Do you think that the CKRM core as last posted had any unnecessary
> features? I don't have the operational experience to be able to judge

No, not at all. But there were some disagreements w.r.t which interface
to use. So, as pointed by Vatsa in a different email, we wanted to
proceed with controller implementation (where we can get more
agreements) and avoid the controversial topic for now.
> that, so I'd prefer to trust your experience and judgement on that. But
> the features which _were_ there looked quite OK from an implementation POV.
>
> So my take was "looks good, done deal - let's go get some sane controllers
> working". And now this!
--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------


2006-08-04 19:11:50

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Rohit Seth wrote:

>>>The use cases I have heard of which would benefit such a feature is
>>>(say) for database threads which want to change their "resource
>>>affinity" status depending on which customer query they are currently handling.
>>>If they are handling a query for a "important" customer, they will want affinied
>>>to a high bandwidth resource container and later if they start handling
>>>a less important query they will want to give up this affinity and
>>>instead move to a low-bandwidth container.
>
>
> hmm, would it not be better to have a thread each in two different
> containers for handling different kind of requests.

Its possible but now you're effectively requiring the thread pool to
expand as many times as service levels supported.

any long running job whose prioritization changes during its lifetime
also benefits from being able to be moved.


> Or if there is too
> much of sharing between threads, then setting the individual priority
> should help.
>

2006-08-04 19:25:34

by Rohit Seth

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Fri, 2006-08-04 at 15:11 -0400, Shailabh Nagar wrote:
> Rohit Seth wrote:
>
> >>>The use cases I have heard of which would benefit such a feature is
> >>>(say) for database threads which want to change their "resource
> >>>affinity" status depending on which customer query they are currently handling.
> >>>If they are handling a query for a "important" customer, they will want affinied
> >>>to a high bandwidth resource container and later if they start handling
> >>>a less important query they will want to give up this affinity and
> >>>instead move to a low-bandwidth container.
> >
> >
> > hmm, would it not be better to have a thread each in two different
> > containers for handling different kind of requests.
>
> Its possible but now you're effectively requiring the thread pool to
> expand as many times as service levels supported.
>

Either increase the number of threads to match the number of security
levels OR have some kind of individual task priority changes.
Individual proceesses/tasks in a container should be able to have
different priorities.

> any long running job whose prioritization changes during its lifetime
> also benefits from being able to be moved.
>

Moving a task (or any other resource for that matter) from one container
to another should be considered as an extreme step. There are
associated resources like anon memeory etc. that need to also be
appropriately accounted in new container.

>
> > Or if there is too
> > much of sharing between threads, then setting the individual priority
> > should help.
> >

-rohit

2006-08-04 23:09:58

by Jiri Slaby

[permalink] [raw]
Subject: Re: [ RFC, PATCH 2/5 ] CPU controller - Define group operations

Srivatsa Vaddagiri wrote:
> Define these operations for a task-group:
>
> - create new group
> - destroy existing group
> - assign bandwidth (quota) for a group
> - get bandwidth (quota) of a group
>
>
> Signed-off-by : Srivatsa Vaddagiri <[email protected]>
>
>
>
> include/linux/sched.h | 12 +++++++
> kernel/sched.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 91 insertions(+)
>
> diff -puN kernel/sched.c~cpu_ctlr_grp_ops kernel/sched.c
> --- linux-2.6.18-rc3/kernel/sched.c~cpu_ctlr_grp_ops 2006-08-04 07:58:50.000000000 +0530
> +++ linux-2.6.18-rc3-root/kernel/sched.c 2006-08-04 07:58:50.000000000 +0530
> @@ -7063,3 +7063,82 @@ void set_curr_task(int cpu, struct task_
> }
>
> #endif
> +
> +#ifdef CONFIG_CPUMETER
> +
> +/* Allocate runqueue structures for the new task-group */
> +void *sched_alloc_group(void)
> +{
> + struct task_grp *tg;
> + struct task_grp_rq *tgrq;
> + int i;
> +
> + tg = kzalloc(sizeof(*tg), GFP_KERNEL);
> + if (!tg)
> + return NULL;
> +
> + tg->ticks = -1; /* No limit */
> +
> + for_each_possible_cpu(i) {
> + tgrq = kzalloc(sizeof(*tgrq), GFP_KERNEL);
> + if (!tgrq)
> + goto oom;
> + tg->rq[i] = tgrq;
> + task_grp_rq_init(tgrq, tg->ticks);
> + }
> +
> + return (void *)tg;

unneeded cast

> +oom:
> + while (i--)
> + kfree(tg->rq[i]);
> +
> + kfree(tg);
> + return NULL;
> +}
> +
> +/* Deallocate runqueue structures */
> +void sched_dealloc_group(void *grp)
> +{
> + struct task_grp *tg = (struct task_grp *)grp;

again

> + int i;
> +
> + for_each_possible_cpu(i)
> + kfree(tg->rq[i]);
> +
> + kfree(tg);
> +}
> +
> +/* Assign quota to this group */
> +void sched_assign_quota(void *grp, int quota)
> +{
> + struct task_grp *tg = (struct task_grp *)grp;

and one more time

> + int i;
> +
> + tg->ticks = (quota * 5 * HZ) / 100;
> +
> + for_each_possible_cpu(i)
> + tg->rq[i]->ticks = tg->ticks;
> +
> +}
> +
> +/* Return assigned quota for this group */
> +int sched_get_quota(void *grp)
> +{
> + struct task_grp *tg = (struct task_grp *)grp;

...

> + int quota;
> +
> + quota = (tg->ticks * 100) / (5 * HZ);
> +
> + return quota;

what about to just
return (tg->ticks * 100) / (5 * HZ);

regards,
--
<a href="http://www.fi.muni.cz/~xslaby/">Jiri Slaby</a>
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E

2006-08-05 03:30:47

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Andrew Morton wrote:
> On Fri, 4 Aug 2006 10:37:53 +0530
> Srivatsa Vaddagiri <[email protected]> wrote:
>
>
>>Resource management has been talked about quite extensively in the
>>past, more recently in the context of containers. The basic requirement
>>here is to provide isolation between *groups* of task wrt their use
>>of various resources like CPU, Memory, I/O bandwidth, open file-descriptors etc.
>>
>>Different maintainers have however expressed different opinions over the need to
>>complicate the kernel to meet this need, especially since it involves core
>>kernel code like the resource schedulers.
>>
>>A BoF was hence held at OLS this year to come to a consensus on the minimum
>>requirements of a resource management solution for Linux kernel. Some notes
>>taken at the BoF are posted here:
>>
>> http://www.uwsg.indiana.edu/hypermail/linux/kernel/0607.3/0896.html
>>
>>An important consensus point of the BoF seemed to be "focus on real
>>controllers more, preferably memory first, using some simple interface
>>and task grouping mechanism".
>
>
> ug, I didn't know this. Had I been there (sorry) I'd have disagreed with
> this whole strategy.
>
> I thought the most recently posted CKRM core was a fine piece of code. It
> provides the machinery for grouping tasks together and the machinery for
> establishing and viewing those groupings via configfs, and other such
> common functionality. My 20-minute impression was that this code was an
> easy merge and it was just awaiting some useful controllers to come along.
>
> And now we've dumped the good infrastructure and instead we've contentrated
> on the controller, wired up via some imaginative ab^H^Hreuse of the cpuset
> layer.
>
> I wonder how many of the consensus-makers were familiar with the
> contemporary CKRM core?

Sorry, I've been busy with offline stuff and won't be able to catch up with
emails until next week -- someone else might have already covered this.

But: I think we definitely agreed that a nice simple implementation and even
userspace API for grouping tasks would be a no-brainer.

I advocated implementing some simple controllers on top of such an interface
first, that people can start to put in some of their requirements, see if a
common controller framework should be created, look at what interfaces people
want for them.

I don't have a problem with CKRM as such, but I think there are other groups
with good approaches and the problem has been to get people working together.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-07 07:19:36

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

>>>Doesnt the ability to move tasks between groups dynamically affect
>>>(atleast) memory controller design (in giving up ownership etc)?
>>
>>we save object owner on the object. So if you change the container,
>>objects are still correctly charged to the creator and are uncharged
>>correctly on free.
>>
>
>
> Seems like the object owner should also change when the object moves
> from one container to another.
Consider a file which is opened in 2 processes. one of the processes
wants to move to another container then. How would you decide whether
to change the file owner or not?

Kirill

2006-08-07 07:23:15

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [ProbableSpam] Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Dipankar Sarma wrote:
> On Fri, Aug 04, 2006 at 12:49:16PM -0400, Shailabh Nagar wrote:
>
>>>BTW, to help to compare (as you noted above) here is the list of
>>>features provided by OpenVZ:
>>>
>>
>>Could you point to a place where we can get a broken-down set of
>>patches for OpenVZ or (even better), UBC ?
>>
>>For purposes of the resource management discussion, it will be
>>useful to be able to look at the UBC patches in isolation
>>and perhaps port them over to some common interface for testing
>>comparing with other implementations.
>
>
> Kirill,
>
> Is this the latest set (based on your last publication) for
> people to look at ?
>
> http://download.openvz.org/kernel/broken-out/2.6.16-026test005.1/
it is quite recent. all the patches diff-ubc-*
here you can find a tool for UBC control ubctl.c:
http://download.openvz.org/contrib/utils/

Kirill

2006-08-07 07:24:48

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

>> OpenVZ assumes that tasks can't move between task-groups for a single
>> reason:
>> user shouldn't be able to escape from the container.
>> But this have no implication on the design/implementation.
>
>
> It does, for the memory controller at least. Things like shared
> anon_vma's between tasks across containers make it somewhat harder.
> It's much worse if you allow threads to split across containers.
we already have the code to account page fractions shared between containers.
Though, it is quite useless to do so for threads... Since this numbers have no meaning (not a real usage)
and only the sum of it will be a correct value.

>> BTW, do you see any practical use cases for tasks jumping between
>> resource-containers?
>
>

2006-08-07 07:28:40

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [ProbableSpam] Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

>>3. I also don't understand why normal binary interface like system call
>>is not used.
>> We have set_uid, sys_setrlimit and it works pretty good, does it?
>
>
> If there are no hierarchies, a syscall interface is fine since the namespace
> for the task-group is flat (so one can export to userspace either a number or a
> string as a handle to that task-group for operations like create, delete,
> set limit, get usage, etc)
syscalls work fine here as well. you need to specify parent_id and new_id for creation.
that's all. we have such an interfaces for heirarchical CPU scheduler.

> A filesystem based interface is useful when you have hierarchies (as resource
> groups and cpusets do) since it naturally defines a convenient to use
> hierarchical namespace.
but it is not much convinient for applications then.

Thanks,
Kirill

2006-08-07 09:31:24

by Paul Jackson

[permalink] [raw]
Subject: Re: [ProbableSpam] Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Kirill wrote:
> > A filesystem based interface is useful when you have hierarchies (as resource
> > groups and cpusets do) since it naturally defines a convenient to use
> > hierarchical namespace.
> but it is not much convinient for applications then.

Is this simply a language issue? File systems hierarchies
are more easily manipulated with shell utilities (ls, cat,
find, grep, ...) and system call API's are easier to access
from C?

If so, then perhaps all that's lacking for convenient C access
to a filesystem based interface is a good library, that presents
an API convenient for use from C code, but underneath makes the
necessary file system calls (open, read, diropen, stat, ...).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-07 14:34:39

by Martin Bligh

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Kirill Korotaev wrote:

>>> OpenVZ assumes that tasks can't move between task-groups for a
>>> single reason:
>>> user shouldn't be able to escape from the container.
>>> But this have no implication on the design/implementation.
>>
>>
>>
>> It does, for the memory controller at least. Things like shared
>> anon_vma's between tasks across containers make it somewhat harder.
>> It's much worse if you allow threads to split across containers.
>
> we already have the code to account page fractions shared between
> containers.
> Though, it is quite useless to do so for threads... Since this numbers
> have no meaning (not a real usage)
> and only the sum of it will be a correct value.
>
THat sort of accounting poses various horrible problems, which is
why we steered away from it. If you share pages between containers
(presumably billing them equal shares per user), what happens
when you're already at your limit, and one of your sharer's exits?

Plus, are you billing by vma or address_space?

M.


2006-08-07 15:58:31

by Chandra Seetharaman

[permalink] [raw]
Subject: Re: [ProbableSpam] Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Mon, 2006-08-07 at 02:30 -0700, Paul Jackson wrote:
> Kirill wrote:
> > > A filesystem based interface is useful when you have hierarchies (as resource
> > > groups and cpusets do) since it naturally defines a convenient to use
> > > hierarchical namespace.
> > but it is not much convinient for applications then.
>
> Is this simply a language issue? File systems hierarchies
> are more easily manipulated with shell utilities (ls, cat,
> find, grep, ...) and system call API's are easier to access
> from C?
>
> If so, then perhaps all that's lacking for convenient C access
> to a filesystem based interface is a good library, that presents
> an API convenient for use from C code, but underneath makes the
> necessary file system calls (open, read, diropen, stat, ...).
>

I totally agree.

When the difference comes to language issue, one advantage of filesystem
is that there is no need for a user space app to do simple management.
--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------


2006-08-07 16:08:55

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [ProbableSpam] Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

>>>A filesystem based interface is useful when you have hierarchies (as resource
>>>groups and cpusets do) since it naturally defines a convenient to use
>>>hierarchical namespace.
>>
>>but it is not much convinient for applications then.
>
>
> Is this simply a language issue? File systems hierarchies
> are more easily manipulated with shell utilities (ls, cat,
> find, grep, ...) and system call API's are easier to access
> from C?
>
> If so, then perhaps all that's lacking for convenient C access
> to a filesystem based interface is a good library, that presents
> an API convenient for use from C code, but underneath makes the
> necessary file system calls (open, read, diropen, stat, ...).

IMHO:
file system APIs are not good for accessing attributed data.
e.g. we have a /proc which is very convenient for use from shell etc. but
is not good for applications, not fast enough etc.
moreover, /proc had always problems with locking, races and people tend to
feel like they can change text presention of data, while applications parsing
it tend to break.

Kirill

2006-08-07 16:32:56

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

>> we already have the code to account page fractions shared between
>> containers.
>> Though, it is quite useless to do so for threads... Since this numbers
>> have no meaning (not a real usage)
>> and only the sum of it will be a correct value.
>>
> THat sort of accounting poses various horrible problems, which is
> why we steered away from it. If you share pages between containers
> (presumably billing them equal shares per user), what happens
> when you're already at your limit, and one of your sharer's exits?
you come across your limit and new allocations will fail.
BUT! IMPORTANT!
in real life use case with OpenVZ we allow sharing not that much data across containers:
vmas mapped as private, i.e. glibc and other libraries .data section
(and .code if it is writable). So if you use the same glibc and other executables
in the containers then your are charged only a fraction of phys memory used by it.
This kind of sharing is not that huge (<< memory limits usually),
so the situation you described is not a problem
in real life (at least for OpenVZ).

> Plus, are you billing by vma or address_space?
not sure what you mean. we charge pages. first by whole vma.
but as page gets more used by other containers your usage decreases.

Thanks,
Kirill

2006-08-07 17:14:50

by Rohit Seth

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Mon, 2006-08-07 at 11:19 +0400, Kirill Korotaev wrote:
> >>>Doesnt the ability to move tasks between groups dynamically affect
> >>>(atleast) memory controller design (in giving up ownership etc)?
> >>
> >>we save object owner on the object. So if you change the container,
> >>objects are still correctly charged to the creator and are uncharged
> >>correctly on free.
> >>
> >
> >
> > Seems like the object owner should also change when the object moves
> > from one container to another.

> Consider a file which is opened in 2 processes. one of the processes
> wants to move to another container then. How would you decide whether
> to change the file owner or not?
>

If a process has sufficient rights to move a file to a new container
then it should be okay to assign the file to the new container.

Though the point is, if a resource (like file) is getting migrated to a
new container then all the attributes (like owner, #pages in memory
etc.) attached to that resource (file) should also migrate to this new
container. Otherwise the semantics of where does the resource belong
becomes very difficult.

And if you really want a resource to not be able to migrate from one
container then we could define IMMUTABLE flag to indicate that behavior.

-rohit

2006-08-07 17:15:55

by Paul Jackson

[permalink] [raw]
Subject: Re: [ProbableSpam] Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

> we have a /proc which is very convenient for use from shell etc. but
> is not good for applications, not fast enough etc.
> moreover, /proc had always problems with locking, races and people tend to
> feel like they can change text presention of data, while applications parsing
> it tend to break.

Yes - one can botch a file system API.

One can botch syscalls, too. Do you love 'ioctl'?

For some calls, the performance of a raw syscall is critical. And
eventually, filesystem API's must resolve to raw file i/o syscalls.

But for these sorts of system configuration and management, the
difference in performance between file system API's and raw syscall
API's is not one of the decisive issues that determines success or
failure.

Getting a decent API that naturally reflects the long term essential
shape of the data is one of these decisive issues.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-07 18:21:47

by Rohit Seth

[permalink] [raw]
Subject: Re: [ProbableSpam] Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Mon, 2006-08-07 at 10:15 -0700, Paul Jackson wrote:
> > we have a /proc which is very convenient for use from shell etc. but
> > is not good for applications, not fast enough etc.
> > moreover, /proc had always problems with locking, races and people tend to
> > feel like they can change text presention of data, while applications parsing
> > it tend to break.
>
> Yes - one can botch a file system API.
>
> One can botch syscalls, too. Do you love 'ioctl'?
>
> For some calls, the performance of a raw syscall is critical. And
> eventually, filesystem API's must resolve to raw file i/o syscalls.
>
> But for these sorts of system configuration and management, the
> difference in performance between file system API's and raw syscall
> API's is not one of the decisive issues that determines success or
> failure.
>
> Getting a decent API that naturally reflects the long term essential
> shape of the data is one of these decisive issues.
>

Absolutely. Performance is really not a key here. Setting-up (or
Tearing down) these operations shouldn't be that frequent. Configfs
(or proc) should be able to handle those.


-rohit


2006-08-07 18:32:22

by Rohit Seth

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Mon, 2006-08-07 at 20:33 +0400, Kirill Korotaev wrote:
> >> we already have the code to account page fractions shared between
> >> containers.
> >> Though, it is quite useless to do so for threads... Since this numbers
> >> have no meaning (not a real usage)
> >> and only the sum of it will be a correct value.
> >>
> > THat sort of accounting poses various horrible problems, which is
> > why we steered away from it. If you share pages between containers
> > (presumably billing them equal shares per user), what happens
> > when you're already at your limit, and one of your sharer's exits?
> you come across your limit and new allocations will fail.
> BUT! IMPORTANT!
> in real life use case with OpenVZ we allow sharing not that much data across containers:
> vmas mapped as private, i.e. glibc and other libraries .data section
> (and .code if it is writable). So if you use the same glibc and other executables
> in the containers then your are charged only a fraction of phys memory used by it.
> This kind of sharing is not that huge (<< memory limits usually),
> so the situation you described is not a problem
> in real life (at least for OpenVZ).
>

I think it is not a problem for OpenVZ because there is not that much of
sharing going between containers as you mentioned (btw, this least
amount of sharing is a very good thing). Though I'm not sure if one has
to go to the extent of doing fractions with memory accounting. If the
containers are set up in such a way that there is some sharing across
containers then it is okay to be unfair and charge one of those
containers for the specific resource completely.

-rohit

2006-08-07 18:44:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Mon, 2006-08-07 at 11:31 -0700, Rohit Seth wrote:
> I think it is not a problem for OpenVZ because there is not that much
> of
> sharing going between containers as you mentioned (btw, this least
> amount of sharing is a very good thing). Though I'm not sure if one
> has
> to go to the extent of doing fractions with memory accounting. If the
> containers are set up in such a way that there is some sharing across
> containers then it is okay to be unfair and charge one of those
> containers for the specific resource completely.

Right, and if you do reclaim against containers which are over their
limits, the containers being unfairly charged will tend to get hit
first. But, once this happens, I would hope that the ownership of those
shared pages should settle out among all of the users.

If you have 100 containers sharing 100 pages, container0 might be
charged for all 100 pages at first, but I'd hope that eventually
containers 0->99 would each get charged for a single page.

-- Dave

2006-08-07 19:03:31

by Rohit Seth

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Mon, 2006-08-07 at 11:43 -0700, Dave Hansen wrote:
> On Mon, 2006-08-07 at 11:31 -0700, Rohit Seth wrote:
> > I think it is not a problem for OpenVZ because there is not that much
> > of
> > sharing going between containers as you mentioned (btw, this least
> > amount of sharing is a very good thing). Though I'm not sure if one
> > has
> > to go to the extent of doing fractions with memory accounting. If the
> > containers are set up in such a way that there is some sharing across
> > containers then it is okay to be unfair and charge one of those
> > containers for the specific resource completely.
>
> Right, and if you do reclaim against containers which are over their
> limits, the containers being unfairly charged will tend to get hit
> first. But, once this happens, I would hope that the ownership of those
> shared pages should settle out among all of the users.
>

I think there is lot of simplicity and value add by charging one
container (even unfairly) for one resource completely. This puts the
onus on system admin to set up the containers appropriately.

> If you have 100 containers sharing 100 pages, container0 might be
> charged for all 100 pages at first, but I'd hope that eventually
> containers 0->99 would each get charged for a single page.
>

You would be better off having a notion of "shared" container where
these kind of resources get charged. So, if 100 processes are all
touching pages from some common file then the file could be part of
self-contained container with its own limits etc.

-rohit

2006-08-07 19:46:59

by Martin Bligh

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

Rohit Seth wrote:
> On Mon, 2006-08-07 at 11:43 -0700, Dave Hansen wrote:
>
>>On Mon, 2006-08-07 at 11:31 -0700, Rohit Seth wrote:
>>
>>>I think it is not a problem for OpenVZ because there is not that much
>>>of
>>>sharing going between containers as you mentioned (btw, this least
>>>amount of sharing is a very good thing). Though I'm not sure if one
>>>has
>>>to go to the extent of doing fractions with memory accounting. If the
>>>containers are set up in such a way that there is some sharing across
>>>containers then it is okay to be unfair and charge one of those
>>>containers for the specific resource completely.
>>
>>Right, and if you do reclaim against containers which are over their
>>limits, the containers being unfairly charged will tend to get hit
>>first. But, once this happens, I would hope that the ownership of those
>>shared pages should settle out among all of the users.
>>
>
>
> I think there is lot of simplicity and value add by charging one
> container (even unfairly) for one resource completely. This puts the
> onus on system admin to set up the containers appropriately.

It also saves you from maintaining huge lists against each page.

Worse case, you want to bill everyone who opens that address_space
equally. But the semantics on exit still suck.

What was Alan's quote again? "unfair, unreliable, inefficient ...
pick at least one out of the three". or something like that.

M.

2006-08-08 07:17:07

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

>>>>>Doesnt the ability to move tasks between groups dynamically affect
>>>>>(atleast) memory controller design (in giving up ownership etc)?
>>>>
>>>>we save object owner on the object. So if you change the container,
>>>>objects are still correctly charged to the creator and are uncharged
>>>>correctly on free.
>>>>
>>>
>>>
>>>Seems like the object owner should also change when the object moves
>>>from one container to another.
>
>
>>Consider a file which is opened in 2 processes. one of the processes
>>wants to move to another container then. How would you decide whether
>>to change the file owner or not?
>>
>
>
> If a process has sufficient rights to move a file to a new container
> then it should be okay to assign the file to the new container.
there is no such notion as "rights to move a file to a new container".
The same file can be opened in processes belonging to other containers.
And you have no any clue whether to have to change the owner or not.

> Though the point is, if a resource (like file) is getting migrated to a
> new container then all the attributes (like owner, #pages in memory
> etc.) attached to that resource (file) should also migrate to this new
> container. Otherwise the semantics of where does the resource belong
> becomes very difficult.
The same for many other resources. It is a big mistake thinking that most resources
belong to the processes and the owner process can be easily determined.

> And if you really want a resource to not be able to migrate from one
> container then we could define IMMUTABLE flag to indicate that behavior.
I hope not that one used in ext[23]? :)

Kirill

2006-08-08 07:19:31

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

>>you come across your limit and new allocations will fail.
>>BUT! IMPORTANT!
>>in real life use case with OpenVZ we allow sharing not that much data across containers:
>>vmas mapped as private, i.e. glibc and other libraries .data section
>>(and .code if it is writable). So if you use the same glibc and other executables
>>in the containers then your are charged only a fraction of phys memory used by it.
>>This kind of sharing is not that huge (<< memory limits usually),
>>so the situation you described is not a problem
>>in real life (at least for OpenVZ).
>>
>
>
> I think it is not a problem for OpenVZ because there is not that much of
> sharing going between containers as you mentioned (btw, this least
> amount of sharing is a very good thing). Though I'm not sure if one has
> to go to the extent of doing fractions with memory accounting. If the
> containers are set up in such a way that there is some sharing across
> containers then it is okay to be unfair and charge one of those
> containers for the specific resource completely.
In this case you can't plan your resources, can't say which one consumes
more memory and can't select the worst container to kill and many other drawbacks.

Kirill

2006-08-08 14:19:51

by Nick Piggin

[permalink] [raw]
Subject: memory resource accounting (was Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller)

Martin Bligh wrote:
> Rohit Seth wrote:
>

Hi guys, let me make a fool of myself and jump in this thread...

>> I think there is lot of simplicity and value add by charging one
>> container (even unfairly) for one resource completely. This puts the
>> onus on system admin to set up the containers appropriately.
>
>
> It also saves you from maintaining huge lists against each page.
>
> Worse case, you want to bill everyone who opens that address_space
> equally. But the semantics on exit still suck.
>
> What was Alan's quote again? "unfair, unreliable, inefficient ...
> pick at least one out of the three". or something like that.

What's the sucking semantics on exit? I haven't looked much at the
existing memory controllers going around, but the implementation I
imagine looks something like this (I think it is conceptually similar
to the basic beancounters idea):

- anyone who allocates a page for anything gets charged for that page.
Except interrupt/softirq context. Could we ignore these for the moment?

This does give you kernel (slab, pagetable, etc) allocations as well as
userspace. I don't like the idea of doing controllers for inode cache
and controllers for dentry cache, etc, etc, ad infinitum.

- each struct page has a backpointer to its billed container. At the mm
summit Linus said he didn't want back pointers, but I clarified with him
and he isn't against them if they are easily configured out when not using
memory controllers.

- memory accounting containers are in a hierarchy. If you want to destroy a
container but it still has billed memory outstanding, that gets charged
back to the parent. The data structure itself obviously still needs to
stay around, to keep the backpointers from going stale... but that could
be as little as a word or two in size.

The reason I like this way of accounting is that it can be done with a couple
of hooks into page_alloc.c and an ifdef in mm.h, and that is the extent of
the impact on core mm/ so I'd be against anything more intrusive unless this
really doesn't work.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-08 14:58:14

by Dave Hansen

[permalink] [raw]
Subject: Re: memory resource accounting (was Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller)

On Wed, 2006-08-09 at 00:19 +1000, Nick Piggin wrote:
> This does give you kernel (slab, pagetable, etc) allocations as well as
> userspace. I don't like the idea of doing controllers for inode cache
> and controllers for dentry cache, etc, etc, ad infinitum.

Those two might not be such a bad idea. Of the slab in my system, 90%
is reliably from those two slabs alone. Now, a controller for the
'Acpi-Operand' slab might be going too far. ;)

Certainly something we should at least consider down the road.

-- Dave

2006-08-08 15:33:53

by Nick Piggin

[permalink] [raw]
Subject: Re: memory resource accounting (was Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller)

Dave Hansen wrote:
> On Wed, 2006-08-09 at 00:19 +1000, Nick Piggin wrote:
>
>> This does give you kernel (slab, pagetable, etc) allocations as well as
>> userspace. I don't like the idea of doing controllers for inode cache
>> and controllers for dentry cache, etc, etc, ad infinitum.
>
>
> Those two might not be such a bad idea. Of the slab in my system, 90%
> is reliably from those two slabs alone. Now, a controller for the
> 'Acpi-Operand' slab might be going too far. ;)
>
> Certainly something we should at least consider down the road.

But if you have a unified struct page accounting, you don't need that.
You don't need struct radix_tree_node accounting, you don't need buffer_head
accounting, pagetable page accounting, vm_area_struct accounting, task_struct
accounting, etc etc in order to do your memory accounting if what you just
want to know is "who allocated what".

And remember that if you have one container going crazy with inode/dentry
cache, it will get hit by its resource limit and end up having to reclaim
them or go OOM.

Now you *may* want to split the actual accounting into kernel and user parts
if you're worried about obscure corner cases in kernel memory accounting. But
this would basically come for free when you have the GFP_EASYRECLAIM thingy
(at any rate, it is quite unintrusive).


Basically, what I have been hearing is that people want to be able to
surgically isolate the memory allocation of one container from that of
another. IMO this is simply infeasible (and exploit prone) to do it on a
per-kernel-object basis.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-08 17:08:14

by Martin Bligh

[permalink] [raw]
Subject: Re: memory resource accounting (was Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller)

>> It also saves you from maintaining huge lists against each page.
>>
>> Worse case, you want to bill everyone who opens that address_space
>> equally. But the semantics on exit still suck.
>>
>> What was Alan's quote again? "unfair, unreliable, inefficient ...
>> pick at least one out of the three". or something like that.
>
> What's the sucking semantics on exit? I haven't looked much at the
> existing memory controllers going around, but the implementation I
> imagine looks something like this (I think it is conceptually similar
> to the basic beancounters idea):

You have to increase the other processes allocations, putting them
over their limits. If you then force them into reclaim, they're going
to stall, and give bad latency.

> - anyone who allocates a page for anything gets charged for that page.
> Except interrupt/softirq context. Could we ignore these for the moment?
>
> This does give you kernel (slab, pagetable, etc) allocations as well as
> userspace. I don't like the idea of doing controllers for inode cache
> and controllers for dentry cache, etc, etc, ad infinitum.
>
> - each struct page has a backpointer to its billed container. At the mm
> summit Linus said he didn't want back pointers, but I clarified with him
> and he isn't against them if they are easily configured out when not
> using memory controllers.
>
> - memory accounting containers are in a hierarchy. If you want to destroy a
> container but it still has billed memory outstanding, that gets charged
> back to the parent. The data structure itself obviously still needs to
> stay around, to keep the backpointers from going stale... but that could
> be as little as a word or two in size.
>
> The reason I like this way of accounting is that it can be done with a
> couple
> of hooks into page_alloc.c and an ifdef in mm.h, and that is the extent of
> the impact on core mm/ so I'd be against anything more intrusive unless
> this
> really doesn't work.
>

See "inefficent" above (sorry ;-)) What you've chosen is more correct,
but much higher overhead. The point was that there's tradeoffs either
way - the conclusion we came to last time was that to make it 100%
correct, you'd be better off going with a model like Xen.

1. You're adding a backpointer to struct page.

2. Each page is not accounted to one container, but shared across them,
so the billing changes every time someone forks or exits. And not just
for that container, but all of them. Think pte chain based rmap ...
except worse.

3. When a container needs to "shrink" when somebody else exits, how do
we do reclaim pages from a specific container?

M.

2006-08-08 17:19:05

by Rohit Seth

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller

On Tue, 2006-08-08 at 11:17 +0400, Kirill Korotaev wrote:
> >>>>>Doesnt the ability to move tasks between groups dynamically affect
> >>>>>(atleast) memory controller design (in giving up ownership etc)?
> >>>>
> >>>>we save object owner on the object. So if you change the container,
> >>>>objects are still correctly charged to the creator and are uncharged
> >>>>correctly on free.
> >>>>
> >>>
> >>>
> >>>Seems like the object owner should also change when the object moves
> >>>from one container to another.
> >
> >
> >>Consider a file which is opened in 2 processes. one of the processes
> >>wants to move to another container then. How would you decide whether
> >>to change the file owner or not?
> >>
> >
> >
> > If a process has sufficient rights to move a file to a new container
> > then it should be okay to assign the file to the new container.

> there is no such notion as "rights to move a file to a new container".
> The same file can be opened in processes belonging to other containers.
> And you have no any clue whether to have to change the owner or not.
>

I think this is where more details on a design will help. What I'm
thinking is each address_space to have a container pointer. In this
case, pages belonging to a file will charge against one single container
(irrespective of how many processes are touching those pages).

> > Though the point is, if a resource (like file) is getting migrated to a
> > new container then all the attributes (like owner, #pages in memory
> > etc.) attached to that resource (file) should also migrate to this new
> > container. Otherwise the semantics of where does the resource belong
> > becomes very difficult.
> The same for many other resources. It is a big mistake thinking that most resources
> belong to the processes and the owner process can be easily determined.
>

Sure that some resources it wouldn't make sense to move (or find out at
which is the real owner). And I'm not saying that we have to bind them
hard to a process either...but to a single container if they belong to a
same file (for example).

-rohit

2006-08-08 17:36:31

by Rohit Seth

[permalink] [raw]
Subject: Re: memory resource accounting (was Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller)

On Wed, 2006-08-09 at 00:19 +1000, Nick Piggin wrote:

>
> What's the sucking semantics on exit? I haven't looked much at the
> existing memory controllers going around, but the implementation I
> imagine looks something like this (I think it is conceptually similar
> to the basic beancounters idea):
>
> - anyone who allocates a page for anything gets charged for that page.
> Except interrupt/softirq context. Could we ignore these for the moment?
>

And what happens when processes belonging to different containers start
accessing the same page?

> This does give you kernel (slab, pagetable, etc) allocations as well as
> userspace. I don't like the idea of doing controllers for inode cache
> and controllers for dentry cache, etc, etc, ad infinitum.
>

IMO, we don't need to worry about the kernel internal data structures in
the first pass of container support. I agree that something like dcache
can grow to consume a meaningful amount of memory in a system, but I
still think if we can have something more simple to start with that can
track user memory (both anon and pagecache) will be a good start.

> - each struct page has a backpointer to its billed container. At the mm
> summit Linus said he didn't want back pointers, but I clarified with him
> and he isn't against them if they are easily configured out when not using
> memory controllers.
>

I think adding a pointer to struct page brings additional cost without
that much of additional benefit. Doing it at the address_space/anon_vma
level for page_cache is useful.

> - memory accounting containers are in a hierarchy. If you want to destroy a
> container but it still has billed memory outstanding, that gets charged
> back to the parent. The data structure itself obviously still needs to
> stay around, to keep the backpointers from going stale... but that could
> be as little as a word or two in size.
>

Before we go and say that we need hierarchy of containers, we should
have a design of what a container should be containing. AFAICS, flat
containers should be able to do the job.

But in general, if a container is getting aborted, then any residual
resources should also be aborted where ever make sense(may mean flushing
of any page_cache pages) or the operation should not be permitted.

> The reason I like this way of accounting is that it can be done with a couple
> of hooks into page_alloc.c and an ifdef in mm.h, and that is the extent of
> the impact on core mm/ so I'd be against anything more intrusive unless this
> really doesn't work.
>

hmm, probably the changes to core mm are not going to be that intrusive.
The catch will be what happens when you hit the limit of memory assigned
to a container.

-rohit

2006-08-09 01:54:16

by Nick Piggin

[permalink] [raw]
Subject: Re: memory resource accounting (was Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller)

Martin Bligh wrote:

>>> It also saves you from maintaining huge lists against each page.
>>>
>>> Worse case, you want to bill everyone who opens that address_space
>>> equally. But the semantics on exit still suck.
>>>
>>> What was Alan's quote again? "unfair, unreliable, inefficient ...
>>> pick at least one out of the three". or something like that.
>>
>>
>> What's the sucking semantics on exit? I haven't looked much at the
>> existing memory controllers going around, but the implementation I
>> imagine looks something like this (I think it is conceptually similar
>> to the basic beancounters idea):
>
>
> You have to increase the other processes allocations, putting them
> over their limits. If you then force them into reclaim, they're going
> to stall, and give bad latency.


Not within a particular container. If the process exits but leaves around
some memory charge, then that just remains within the same container.

If you want to remove a container, then you have a hierarchy of billing
and your charge just gets accounted to the parent.

>
>> - anyone who allocates a page for anything gets charged for that page.
>> Except interrupt/softirq context. Could we ignore these for the
>> moment?
>>
>> This does give you kernel (slab, pagetable, etc) allocations as
>> well as
>> userspace. I don't like the idea of doing controllers for inode cache
>> and controllers for dentry cache, etc, etc, ad infinitum.
>>
>> - each struct page has a backpointer to its billed container. At the mm
>> summit Linus said he didn't want back pointers, but I clarified
>> with him
>> and he isn't against them if they are easily configured out when
>> not using memory controllers.
>>
>> - memory accounting containers are in a hierarchy. If you want to
>> destroy a
>> container but it still has billed memory outstanding, that gets
>> charged
>> back to the parent. The data structure itself obviously still needs to
>> stay around, to keep the backpointers from going stale... but that
>> could
>> be as little as a word or two in size.
>>
>> The reason I like this way of accounting is that it can be done with
>> a couple
>> of hooks into page_alloc.c and an ifdef in mm.h, and that is the
>> extent of
>> the impact on core mm/ so I'd be against anything more intrusive
>> unless this
>> really doesn't work.
>>
>
> See "inefficent" above (sorry ;-)) What you've chosen is more correct,
> but much higher overhead. The point was that there's tradeoffs either
> way - the conclusion we came to last time was that to make it 100%
> correct, you'd be better off going with a model like Xen.


So if someone says they want it 100% correct, I can tell them to use
Xen and not put accounting into any place in the kernel that allocates
memory? Sweet OK.

If we're happy with doing userspace only memory, then a similar scheme
can be implemented on an object-accounting basis (eg. vmas). I think
there is something that already implements this.

>
> 1. You're adding a backpointer to struct page.


That's nowhere near the overhead of pte chain rmaps, though. I think it
is perfectly acceptible (assuming you *did* want to account kernel page
allocations) and probably will be difficult to notice on non-crazy-highmem
boxes. Which is just about everyone we care about now.

>
> 2. Each page is not accounted to one container, but shared across them,
> so the billing changes every time someone forks or exits. And not just
> for that container, but all of them. Think pte chain based rmap ...
> except worse.


In my proposed scheme, it is just the first who allocates. You hope that
statistically, that is good enough. Otherwise you could go into tracking
what process has a reference to which dentry... good luck getting that
past Al and Christoph.

>
> 3. When a container needs to "shrink" when somebody else exits, how do
> we do reclaim pages from a specific container?


Not the problem of accounting. Any other scheme will have a similar
problem.

However, having the container in the struct page *could* actually help
directed reclaim FWIW.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-09 04:34:14

by Andi Kleen

[permalink] [raw]
Subject: Re: memory resource accounting (was Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller)

Nick Piggin <[email protected]> writes:
>
> - each struct page has a backpointer to its billed container. At the mm
> summit Linus said he didn't want back pointers, but I clarified with him
> and he isn't against them if they are easily configured out when not using
> memory controllers.

This would need to be done at runtime though, otherwise it's useless
for distributions and other people who want single kernel binary images.

Probably would need a second parallel table, but for that you would
need to know at boot already if you plan to use them or not. Ugly.

-Andi

2006-08-09 06:00:15

by Magnus Damm

[permalink] [raw]
Subject: Re: memory resource accounting (was Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller)

On 09 Aug 2006 06:33:54 +0200, Andi Kleen <[email protected]> wrote:
> Nick Piggin <[email protected]> writes:
> >
> > - each struct page has a backpointer to its billed container. At the mm
> > summit Linus said he didn't want back pointers, but I clarified with him
> > and he isn't against them if they are easily configured out when not using
> > memory controllers.
>
> This would need to be done at runtime though, otherwise it's useless
> for distributions and other people who want single kernel binary images.
>
> Probably would need a second parallel table, but for that you would
> need to know at boot already if you plan to use them or not. Ugly.

I've been thinking a bit about replacing the mapping and index members
in struct page with a single pointer that point into a cluster data
type. The cluster data type is aligned to a power of two and contains
a header that is shared between all pages within the cluster. The
header contains a base index and mapping. The rest of the cluster is
an array of pfn:s that point back to the actual page.

The cluster can be seen as a leaf node in the inode radix tree.
Contiguous pages in inode space are kept together in the cluster - not
physically contiguous pages. The cluster pointer in struct page is
used together with alignment to determine the address of the cluster
header and the real index (alignment + base index).

Anyway, what has all this to do with memory resource management? It
should be possible to add a per-cluster container pointer in the
header. This way the per-page overhead is fairly small - all depending
on the cluster size of course.

/ magnus

2006-08-09 06:06:22

by Andi Kleen

[permalink] [raw]
Subject: Re: memory resource accounting (was Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller)


> I've been thinking a bit about replacing the mapping and index members
> in struct page with a single pointer that point into a cluster data
> type. The cluster data type is aligned to a power of two and contains
> a header that is shared between all pages within the cluster. The
> header contains a base index and mapping. The rest of the cluster is
> an array of pfn:s that point back to the actual page.

Nice. While the code would probably do more references i bet
it would be faster overall because it would have less cache footprint.

But doing it would be a *lot* of editing work all over file systems/VM/etc.

-Andi

2006-08-09 06:56:07

by Andrey Savochkin

[permalink] [raw]
Subject: Re: memory resource accounting (was Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller)

Nick,

On Wed, Aug 09, 2006 at 12:19:41AM +1000, Nick Piggin wrote:
[snip]
>
> What's the sucking semantics on exit? I haven't looked much at the
> existing memory controllers going around, but the implementation I
> imagine looks something like this (I think it is conceptually similar
> to the basic beancounters idea):

What you suggests implies an over-simplification of how memory is used in the
system, and doesn't take into account sharing and caching.

Namely:

> - anyone who allocates a page for anything gets charged for that page.
> Except interrupt/softirq context. Could we ignore these for the moment?
>
> This does give you kernel (slab, pagetable, etc) allocations as well as
> userspace. I don't like the idea of doing controllers for inode cache
> and controllers for dentry cache, etc, etc, ad infinitum.

So, are you suggesting that the user (or container) that initially looked up
some directory /var/dir, will stay billed for this memory until
- all users of all subdirectories /var/dir/a/b/c/d/e/f/g/h/i etc.
are gone, and
- dentry cache has been shrunk because of memory pressure?

It is unfair.
But more than that:
one of the goals of resource accounting and restrictions is to give users the
idea of how much resources they are using. Then, when they start to use more
than their allotment, they should be given the opportunity to consider what
they are doing and reduce resource usage.

In my opinion, to make resource accounting useful, serious efforts should
be made not to bill anyone for resources which he isn't really using and has
no control over releasing them.

>
> - each struct page has a backpointer to its billed container. At the mm
> summit Linus said he didn't want back pointers, but I clarified with him
> and he isn't against them if they are easily configured out when not using
> memory controllers.

The same thing: if one user mapped and then released some pages from a shared
library, should he be billed for these pages until all other users unmapped
this library, and page cache has been shrunk?

Best regards
Andrey

2006-08-09 13:44:41

by Kirill Korotaev

[permalink] [raw]
Subject: Re: memory resource accounting (was Re: [RFC, PATCH 0/5] Going forward with Resource Management - A cpu controller)

> But if you have a unified struct page accounting, you don't need that.
> You don't need struct radix_tree_node accounting, you don't need
> buffer_head
> accounting, pagetable page accounting, vm_area_struct accounting,
> task_struct
> accounting, etc etc in order to do your memory accounting if what you just
> want to know is "who allocated what".
Sorry, are you suggesting to use page accounting for slab objects?
You mean, that if we can account page fractions then we can charge
part of slab page to object owners?
If this is correct, then I think it is ineffecient.
In our current implementation page beancounters can charge
only equal fraction of page to each owner, so it is not suitable for slabs.
Moreoever, it is easier to do correct accounting from the slab allocator
itself and with much less overhead.

> And remember that if you have one container going crazy with inode/dentry
> cache, it will get hit by its resource limit and end up having to reclaim
> them or go OOM.
In order to decide which of the containers is crazy you need
to account correctly the amount of _pinned_ dcache memory.
And even to select correct container for OOM you need to have a corect accounting
of _pinned_ dcache.

> Now you *may* want to split the actual accounting into kernel and user
> parts
> if you're worried about obscure corner cases in kernel memory
> accounting. But
> this would basically come for free when you have the GFP_EASYRECLAIM thingy
> (at any rate, it is quite unintrusive).
>
>
> Basically, what I have been hearing is that people want to be able to
> surgically isolate the memory allocation of one container from that of
> another. IMO this is simply infeasible (and exploit prone) to do it on a
> per-kernel-object basis.
We have the following scheme:
cache which should be charged are marked as SLAB_UBC. the same for particular allocations,
we have a GFP_UBC flag specifing that the allocation should be charged to
the owner. Does it look good for you?
I will post kernel memory accounting patches soon here.

Thanks,
Kirill