2006-05-26 04:20:26

by Peter Williams

[permalink] [raw]
Subject: [RFC 0/5] sched: Add CPU rate caps

These patches implement CPU usage rate limits for tasks.

Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
it is a total usage limit and therefore (to my mind) not very useful.
These patches provide an alternative whereby the (recent) average CPU
usage rate of a task can be limited to a (per task) specified proportion
of a single CPU's capacity. The limits are specified in parts per
thousand and come in two varieties -- hard and soft. The difference
between the two is that the system tries to enforce hard caps regardless
of the other demand for CPU resources but allows soft caps to be
exceeded if there are spare CPU resources available. By default, tasks
will have both caps set to 1000 (i.e. no limit) but newly forked tasks
will inherit any caps that have been imposed on their parent from the
parent. The mimimim soft cap allowed is 0 (which effectively puts the
task in the background) and the minimim hard cap allowed is 1.

Care has been taken to minimize the overhead inflicted on tasks that
have no caps and my tests using kernbench indicate that it is hidden in
the noise.

Note:

The first patch in this series fixes some problems with priority
inheritance that are present in 2.6.17-rc4-mm3 but will be fixed in
the next -mm kernel.

Signed-off-by: Peter Williams <[email protected]>


--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce


2006-05-26 04:20:55

by Peter Williams

[permalink] [raw]
Subject: [RFC 2/5] sched: Add CPU rate soft caps

This patch implements (soft) CPU rate caps per task as a proportion of a
single CPU's capacity expressed in parts per thousand. The CPU usage
of capped tasks is determined by using Kalman filters to calculate the
(recent) average lengths of the task's scheduling cycle and the time
spent on the CPU each cycle and taking the ratio of the latter to the
former. To minimize overhead associated with uncapped tasks these
statistics are not kept for them.

Notes:

1. To minimize the overhead incurred when testing to skip caps processing for
uncapped tasks a new flag PF_HAS_CAP has been added to flags.

2. The implementation involves the addition of two priority slots to the
run queue priority arrays and this means that MAX_PRIO no longer
represents the scheduling priority of the idle process and can't be used to
test whether priority values are in the valid range. To alleviate this
problem a new function sched_idle_prio() has been provided.

3. Enforcement of caps is not as strict as it could be in order to
reduce the possibility of a task being starved of CPU while holding
an important system resource with resultant overall performance
degradation. In effect, all runnable capped tasks will get some amount
of CPU access every active/expired swap cycle. This will be most
apparent for small or zero soft caps.

Signed-off-by: Peter Williams <[email protected]>

include/linux/sched.h | 16 ++
init/Kconfig | 2
kernel/Kconfig.caps | 13 +
kernel/rtmutex-debug.c | 4
kernel/sched.c | 362 ++++++++++++++++++++++++++++++++++++++++++++++---
5 files changed, 375 insertions(+), 22 deletions(-)

Index: MM-2.6.17-rc4-mm3/include/linux/sched.h
===================================================================
--- MM-2.6.17-rc4-mm3.orig/include/linux/sched.h 2006-05-26 10:43:21.000000000 +1000
+++ MM-2.6.17-rc4-mm3/include/linux/sched.h 2006-05-26 10:46:35.000000000 +1000
@@ -494,6 +494,12 @@ struct signal_struct {
#define has_rt_policy(p) \
unlikely((p)->policy != SCHED_NORMAL && (p)->policy != SCHED_BATCH)

+#ifdef CONFIG_CPU_RATE_CAPS
+int sched_idle_prio(void);
+#else
+#define sched_idle_prio() MAX_PRIO
+#endif
+
/*
* Some day this will be a full-fledged user tracking system..
*/
@@ -787,6 +793,10 @@ struct task_struct {
unsigned long sleep_avg;
unsigned long long timestamp, last_ran;
unsigned long long sched_time; /* sched_clock time spent running */
+#ifdef CONFIG_CPU_RATE_CAPS
+ unsigned long long avg_cpu_per_cycle, avg_cycle_length;
+ unsigned int cpu_rate_cap;
+#endif
enum sleep_type sleep_type;

unsigned long policy;
@@ -981,6 +991,11 @@ struct task_struct {
#endif
};

+#ifdef CONFIG_CPU_RATE_CAPS
+unsigned int get_cpu_rate_cap(const struct task_struct *);
+int set_cpu_rate_cap(struct task_struct *, unsigned int);
+#endif
+
static inline pid_t process_group(struct task_struct *tsk)
{
return tsk->signal->pgrp;
@@ -1040,6 +1055,7 @@ static inline void put_task_struct(struc
#define PF_SPREAD_SLAB 0x08000000 /* Spread some slab caches over cpuset */
#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
#define PF_MUTEX_TESTER 0x02000000 /* Thread belongs to the rt mutex tester */
+#define PF_HAS_CAP 0x20000000 /* Has a CPU rate cap */

/*
* Only the _current_ task can read/write to tsk->flags, but other
Index: MM-2.6.17-rc4-mm3/init/Kconfig
===================================================================
--- MM-2.6.17-rc4-mm3.orig/init/Kconfig 2006-05-26 10:39:59.000000000 +1000
+++ MM-2.6.17-rc4-mm3/init/Kconfig 2006-05-26 10:45:26.000000000 +1000
@@ -286,6 +286,8 @@ config RELAY

If unsure, say N.

+source "kernel/Kconfig.caps"
+
source "usr/Kconfig"

config UID16
Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps 2006-05-26 10:45:26.000000000 +1000
@@ -0,0 +1,13 @@
+#
+# CPU Rate Caps Configuration
+#
+
+config CPU_RATE_CAPS
+ bool "Support (soft) CPU rate caps"
+ default n
+ ---help---
+ Say y here if you wish to be able to put a (soft) upper limit on
+ the rate of CPU usage by individual tasks. A task which has been
+ allocated a soft CPU rate cap will be limited to that rate of CPU
+ usage unless there is spare CPU resources available after the needs
+ of uncapped tasks are met.
Index: MM-2.6.17-rc4-mm3/kernel/sched.c
===================================================================
--- MM-2.6.17-rc4-mm3.orig/kernel/sched.c 2006-05-26 10:44:51.000000000 +1000
+++ MM-2.6.17-rc4-mm3/kernel/sched.c 2006-05-26 11:00:02.000000000 +1000
@@ -57,6 +57,19 @@

#include <asm/unistd.h>

+#ifdef CONFIG_CPU_RATE_CAPS
+#define IDLE_PRIO (MAX_PRIO + 2)
+#else
+#define IDLE_PRIO MAX_PRIO
+#endif
+#define BGND_PRIO (IDLE_PRIO - 1)
+#define CAPPED_PRIO (IDLE_PRIO - 2)
+
+int sched_idle_prio(void)
+{
+ return IDLE_PRIO;
+}
+
/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
* to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
@@ -186,6 +199,149 @@ static inline unsigned int task_timeslic
return static_prio_timeslice(p->static_prio);
}

+#ifdef CONFIG_CPU_RATE_CAPS
+#define CAP_STATS_OFFSET 8
+#define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
+/* this assumes that p is not a real time task */
+#define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
+#define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
+#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / 1000)
+
+static void init_cpu_rate_caps(task_t *p)
+{
+ p->cpu_rate_cap = 1000;
+ p->flags &= ~PF_HAS_CAP;
+}
+
+static inline void set_cap_flag(task_t *p)
+{
+ if (p->cpu_rate_cap < 1000 && !has_rt_policy(p))
+ p->flags |= PF_HAS_CAP;
+ else
+ p->flags &= ~PF_HAS_CAP;
+}
+
+static inline int task_exceeding_cap(const task_t *p)
+{
+ return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * p->cpu_rate_cap);
+}
+
+#ifdef CONFIG_SCHED_SMT
+static unsigned int smt_timeslice(task_t *p)
+{
+ if (task_has_cap(p) && task_being_capped(p))
+ return 0;
+
+ return task_timeslice(p);
+}
+
+static int task_priority_gt(const task_t *thisp, const task_t *thatp)
+{
+ if (task_has_cap(thisp) && (task_being_capped(thisp)))
+ return 0;
+
+ if (task_has_cap(thatp) && (task_being_capped(thatp)))
+ return 1;
+
+ return thisp->static_prio < thatp->static_prio;
+}
+#endif
+
+/*
+ * Update usage stats to "now" before making comparison
+ * Assume: task is actually on a CPU
+ */
+static int task_exceeding_cap_now(const task_t *p, unsigned long long now)
+{
+ unsigned long long delta, lhs, rhs;
+
+ delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
+ lhs = (p->avg_cpu_per_cycle + delta) * 1000;
+ rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
+
+ return lhs > rhs;
+}
+
+static inline void init_cap_stats(task_t *p)
+{
+ p->avg_cpu_per_cycle = 0;
+ p->avg_cycle_length = 0;
+}
+
+static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
+{
+ unsigned long long delta;
+
+ delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
+ p->avg_cycle_length += delta;
+}
+
+static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
+{
+ unsigned long long delta;
+
+ delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
+ p->avg_cycle_length += delta;
+ p->avg_cpu_per_cycle += delta;
+}
+
+static inline void decay_cap_stats(task_t *p)
+{
+ p->avg_cycle_length *= ((1 << CAP_STATS_OFFSET) - 1);
+ p->avg_cycle_length >>= CAP_STATS_OFFSET;
+ p->avg_cpu_per_cycle *= ((1 << CAP_STATS_OFFSET) - 1);
+ p->avg_cpu_per_cycle >>= CAP_STATS_OFFSET;
+}
+#else
+#define task_has_cap(p) 0
+#define task_is_background(p) 0
+#define task_being_capped(p) 0
+#define cap_load_weight(p) SCHED_LOAD_SCALE
+
+static inline void init_cpu_rate_caps(task_t *p)
+{
+}
+
+static inline void set_cap_flag(task_t *p)
+{
+}
+
+static inline int task_exceeding_cap(const task_t *p)
+{
+ return 0;
+}
+
+#ifdef CONFIG_SCHED_SMT
+#define smt_timeslice(p) task_timeslice(p)
+
+static inline int task_priority_gt(const task_t *thisp, const task_t *thatp)
+{
+ return thisp->static_prio < thatp->static_prio;
+}
+#endif
+
+static inline int task_exceeding_cap_now(const task_t *p, unsigned long long now)
+{
+ return 0;
+}
+
+static inline void init_cap_stats(task_t *p)
+{
+}
+
+static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
+{
+}
+
+static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
+{
+}
+
+static inline void decay_cap_stats(task_t *p)
+{
+}
+#endif
+
#define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran) \
< (long long) (sd)->cache_hot_time)

@@ -197,8 +353,8 @@ typedef struct runqueue runqueue_t;

struct prio_array {
unsigned int nr_active;
- DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
- struct list_head queue[MAX_PRIO];
+ DECLARE_BITMAP(bitmap, IDLE_PRIO+1); /* include 1 bit for delimiter */
+ struct list_head queue[IDLE_PRIO];
};

/*
@@ -710,6 +866,10 @@ static inline int __normal_prio(task_t *
{
int bonus, prio;

+ /* Ensure that background tasks stay at BGND_PRIO */
+ if (task_is_background(p))
+ return BGND_PRIO;
+
bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;

prio = p->static_prio - bonus;
@@ -786,6 +946,8 @@ static inline int expired_starving(runqu

static void set_load_weight(task_t *p)
{
+ set_cap_flag(p);
+
if (has_rt_policy(p)) {
#ifdef CONFIG_SMP
if (p == task_rq(p)->migration_thread)
@@ -798,8 +960,22 @@ static void set_load_weight(task_t *p)
else
#endif
p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority);
- } else
+ } else {
p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio);
+
+ /*
+ * Reduce the probability of a task escaping its CPU rate cap
+ * due to load balancing leaving it on a lighly used CPU
+ * This will be optimized away if rate caps aren't configured
+ */
+ if (task_has_cap(p)) {
+ unsigned int clw; /* load weight based on cap */
+
+ clw = cap_load_weight(p);
+ if (clw < p->load_weight)
+ p->load_weight = clw;
+ }
+ }
}

static inline void inc_raw_weighted_load(runqueue_t *rq, const task_t *p)
@@ -869,7 +1045,8 @@ static void __activate_task(task_t *p, r
{
prio_array_t *target = rq->active;

- if (unlikely(batch_task(p) || (expired_starving(rq) && !rt_task(p))))
+ if (unlikely(batch_task(p) || (expired_starving(rq) && !rt_task(p)) ||
+ task_being_capped(p)))
target = rq->expired;
enqueue_task(p, target);
inc_nr_running(p, rq);
@@ -975,8 +1152,30 @@ static void activate_task(task_t *p, run
#endif

if (!rt_task(p))
+ /*
+ * We want to do the recalculation even if we're exceeding
+ * a cap so that everything still works when we stop
+ * exceeding our cap.
+ */
p->prio = recalc_task_prio(p, now);

+ if (task_has_cap(p)) {
+ inc_cap_stats_cycle(p, now);
+ /* Background tasks are handled in effective_prio()
+ * in order to ensure that they stay at BGND_PRIO
+ * but we need to be careful that we don't override
+ * it here
+ */
+ if (task_exceeding_cap(p) && !task_is_background(p)) {
+ p->normal_prio = CAPPED_PRIO;
+ /*
+ * Don't undo any priority ineheritance
+ */
+ if (!rt_task(p))
+ p->prio = CAPPED_PRIO;
+ }
+ }
+
/*
* This checks to make sure it's not an uninterruptible task
* that is now waking up.
@@ -1566,6 +1765,7 @@ void fastcall sched_fork(task_t *p, int
#endif
set_task_cpu(p, cpu);

+ init_cap_stats(p);
/*
* We mark the process as running here, but have not actually
* inserted it onto the runqueue yet. This guarantees that
@@ -2040,7 +2240,7 @@ void pull_task(runqueue_t *src_rq, prio_
p->timestamp = (p->timestamp - src_rq->timestamp_last_tick)
+ this_rq->timestamp_last_tick;
/*
- * Note that idle threads have a prio of MAX_PRIO, for this test
+ * Note that idle threads have a prio of IDLE_PRIO, for this test
* to be always true for them.
*/
if (TASK_PREEMPTS_CURR(p, this_rq))
@@ -2140,8 +2340,8 @@ skip_bitmap:
if (!idx)
idx = sched_find_first_bit(array->bitmap);
else
- idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
- if (idx >= MAX_PRIO) {
+ idx = find_next_bit(array->bitmap, IDLE_PRIO, idx);
+ if (idx >= IDLE_PRIO) {
if (array == busiest->expired && busiest->active->nr_active) {
array = busiest->active;
dst_array = this_rq->active;
@@ -2931,15 +3131,58 @@ void scheduler_tick(void)
}
goto out_unlock;
}
+ /* Only check for task exceeding cap if it's worthwhile */
+ if (task_has_cap(p)) {
+ /*
+ * Do this even if there's only one task on the queue as
+ * we want to set the priority low so that any waking tasks
+ * can preempt.
+ */
+ if (task_being_capped(p)) {
+ /*
+ * Tasks whose cap is currently being enforced will be
+ * at CAPPED_PRIO or BGND_PRIO priority and preemption
+ * should be enough to keep them in check provided we
+ * don't let them adversely effect tasks on the expired
+ * array
+ */
+ if (!task_is_background(p) && !task_exceeding_cap_now(p, now)) {
+ dequeue_task(p, rq->active);
+ p->prio = effective_prio(p);
+ enqueue_task(p, rq->active);
+ } else if (rq->expired->nr_active && rq->best_expired_prio < p->prio) {
+ dequeue_task(p, rq->active);
+ enqueue_task(p, rq->expired);
+ set_tsk_need_resched(p);
+ goto out_unlock;
+ }
+ } else if (task_exceeding_cap_now(p, now)) {
+ dequeue_task(p, rq->active);
+ p->prio = CAPPED_PRIO;
+ enqueue_task(p, rq->expired);
+ /*
+ * think about making this conditional to reduce
+ * context switch rate
+ */
+ set_tsk_need_resched(p);
+ goto out_unlock;
+ }
+ }
if (!--p->time_slice) {
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
- p->prio = effective_prio(p);
+ if (!task_being_capped(p))
+ p->prio = effective_prio(p);
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;

if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
+ /*
+ * No need to do anything special for capped tasks as here
+ * TASK_INTERACTIVE() should fail when they're exceeding
+ * their caps.
+ */
if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
enqueue_task(p, rq->expired);
if (p->static_prio < rq->best_expired_prio)
@@ -3104,9 +3347,9 @@ static int dependent_sleeper(int this_cp
(sd->per_cpu_gain * DEF_TIMESLICE / 100))
ret = 1;
} else
- if (smt_curr->static_prio < p->static_prio &&
+ if (task_priority_gt(smt_curr, p) &&
!TASK_PREEMPTS_CURR(p, smt_rq) &&
- smt_slice(smt_curr, sd) > task_timeslice(p))
+ smt_slice(smt_curr, sd) > smt_timeslice(p))
ret = 1;

check_smt_task:
@@ -3129,7 +3372,7 @@ check_smt_task:
resched_task(smt_curr);
} else {
if (TASK_PREEMPTS_CURR(p, smt_rq) &&
- smt_slice(p, sd) > task_timeslice(smt_curr))
+ smt_slice(p, sd) > smt_timeslice(smt_curr))
resched_task(smt_curr);
else
wakeup_busy_runqueue(smt_rq);
@@ -3265,6 +3508,10 @@ need_resched_nonpreemptible:
}
}

+ /* do this now so that stats are correct for SMT code */
+ if (task_has_cap(prev))
+ inc_cap_stats_both(prev, now);
+
cpu = smp_processor_id();
if (unlikely(!rq->nr_running)) {
go_idle:
@@ -3305,7 +3552,7 @@ go_idle:
rq->expired = array;
array = rq->active;
rq->expired_timestamp = 0;
- rq->best_expired_prio = MAX_PRIO;
+ rq->best_expired_prio = IDLE_PRIO;
}

idx = sched_find_first_bit(array->bitmap);
@@ -3323,7 +3570,7 @@ go_idle:
array = next->array;
new_prio = recalc_task_prio(next, next->timestamp + delta);

- if (unlikely(next->prio != new_prio)) {
+ if (unlikely(next->prio != new_prio && !task_being_capped(next))) {
dequeue_task(next, array);
next->prio = new_prio;
enqueue_task(next, array);
@@ -3347,6 +3594,10 @@ switch_tasks:

sched_info_switch(prev, next);
if (likely(prev != next)) {
+ if (task_has_cap(next)) {
+ decay_cap_stats(next);
+ inc_cap_stats_cycle(next, now);
+ }
next->timestamp = now;
rq->nr_switches++;
rq->curr = next;
@@ -3792,7 +4043,7 @@ void rt_mutex_setprio(task_t *p, int pri
runqueue_t *rq;
int oldprio;

- BUG_ON(prio < 0 || prio > MAX_PRIO);
+ BUG_ON(prio < 0 || prio > IDLE_PRIO);

rq = task_rq_lock(p, &flags);

@@ -4220,6 +4471,76 @@ out_unlock:
return retval;
}

+#ifdef CONFIG_CPU_RATE_CAPS
+unsigned int get_cpu_rate_cap(const struct task_struct *p)
+{
+ return p->cpu_rate_cap;
+}
+
+EXPORT_SYMBOL(get_cpu_rate_cap);
+
+/*
+ * Require: 0 <= new_cap <= 1000
+ */
+int set_cpu_rate_cap(struct task_struct *p, unsigned int new_cap)
+{
+ int is_allowed;
+ unsigned long flags;
+ struct runqueue *rq;
+ prio_array_t *array;
+ int delta;
+
+ if (new_cap > 1000)
+ return -EINVAL;
+ is_allowed = capable(CAP_SYS_NICE);
+ /*
+ * We have to be careful, if called from /proc code,
+ * the task might be in the middle of scheduling on another CPU.
+ */
+ rq = task_rq_lock(p, &flags);
+ delta = new_cap - p->cpu_rate_cap;
+ if (!is_allowed) {
+ /*
+ * Ordinary users can set/change caps on their own tasks
+ * provided that the new setting is MORE constraining
+ */
+ if (((current->euid != p->uid) && (current->uid != p->uid)) || (delta > 0)) {
+ task_rq_unlock(rq, &flags);
+ return -EPERM;
+ }
+ }
+ /*
+ * The RT tasks don't have caps, but we still allow the caps to be
+ * set - but as expected it wont have any effect on scheduling until
+ * the task becomes SCHED_NORMAL/SCHED_BATCH:
+ */
+ p->cpu_rate_cap = new_cap;
+
+ if (has_rt_policy(p))
+ goto out;
+
+ array = p->array;
+ if (array) {
+ dec_raw_weighted_load(rq, p);
+ dequeue_task(p, array);
+ }
+
+ set_load_weight(p);
+ p->prio = effective_prio(p);
+
+ if (array) {
+ enqueue_task(p, array);
+ inc_raw_weighted_load(rq, p);
+ }
+out:
+ task_rq_unlock(rq, &flags);
+
+ return 0;
+}
+
+EXPORT_SYMBOL(set_cpu_rate_cap);
+#endif
+
long sched_setaffinity(pid_t pid, cpumask_t new_mask)
{
task_t *p;
@@ -4733,7 +5054,7 @@ void __devinit init_idle(task_t *idle, i
idle->timestamp = sched_clock();
idle->sleep_avg = 0;
idle->array = NULL;
- idle->prio = idle->normal_prio = MAX_PRIO;
+ idle->prio = idle->normal_prio = IDLE_PRIO;
idle->state = TASK_RUNNING;
idle->cpus_allowed = cpumask_of_cpu(cpu);
set_task_cpu(idle, cpu);
@@ -5074,7 +5395,7 @@ static void migrate_dead_tasks(unsigned
struct runqueue *rq = cpu_rq(dead_cpu);

for (arr = 0; arr < 2; arr++) {
- for (i = 0; i < MAX_PRIO; i++) {
+ for (i = 0; i < IDLE_PRIO; i++) {
struct list_head *list = &rq->arrays[arr].queue[i];
while (!list_empty(list))
migrate_dead(dead_cpu,
@@ -5244,7 +5565,7 @@ static int migration_call(struct notifie
/* Idle task back to normal (off runqueue, low prio) */
rq = task_rq_lock(rq->idle, &flags);
deactivate_task(rq->idle, rq);
- rq->idle->static_prio = MAX_PRIO;
+ rq->idle->static_prio = IDLE_PRIO;
__setscheduler(rq->idle, SCHED_NORMAL, 0);
migrate_dead_tasks(cpu);
task_rq_unlock(rq, &flags);
@@ -6657,7 +6978,7 @@ void __init sched_init(void)
rq->nr_running = 0;
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
- rq->best_expired_prio = MAX_PRIO;
+ rq->best_expired_prio = IDLE_PRIO;

#ifdef CONFIG_SMP
rq->sd = NULL;
@@ -6673,15 +6994,16 @@ void __init sched_init(void)

for (j = 0; j < 2; j++) {
array = rq->arrays + j;
- for (k = 0; k < MAX_PRIO; k++) {
+ for (k = 0; k < IDLE_PRIO; k++) {
INIT_LIST_HEAD(array->queue + k);
__clear_bit(k, array->bitmap);
}
// delimiter for bitsearch
- __set_bit(MAX_PRIO, array->bitmap);
+ __set_bit(IDLE_PRIO, array->bitmap);
}
}

+ init_cpu_rate_caps(&init_task);
set_load_weight(&init_task);
/*
* The boot idle thread does lazy MMU switching as well:
Index: MM-2.6.17-rc4-mm3/kernel/rtmutex-debug.c
===================================================================
--- MM-2.6.17-rc4-mm3.orig/kernel/rtmutex-debug.c 2006-05-26 10:39:59.000000000 +1000
+++ MM-2.6.17-rc4-mm3/kernel/rtmutex-debug.c 2006-05-26 10:45:26.000000000 +1000
@@ -479,8 +479,8 @@ void debug_rt_mutex_proxy_unlock(struct
void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
{
memset(waiter, 0x11, sizeof(*waiter));
- plist_node_init(&waiter->list_entry, MAX_PRIO);
- plist_node_init(&waiter->pi_list_entry, MAX_PRIO);
+ plist_node_init(&waiter->list_entry, sched_idle_prio());
+ plist_node_init(&waiter->pi_list_entry, sched_idle_prio());
}

void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)

--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-26 04:20:34

by Peter Williams

[permalink] [raw]
Subject: [RFC 1/5] sched: Fix priority inheritence before CPU rate soft caps

Problem:

The advent of priority inheritance (PI) (in -mm kernels) means that the
prio field for non real time tasks can no longer be guaranteed to be
greater than or equal to MAX_RT_PRIO. This, in turn, means that the
rt_task() macro is no longer a reliable test for determining if the
scheduler policy of a task is one of the real time policies.

Redefining rt_task() is not a good solution as the majority places
where it is used within sched.c the current definition is what is
required. However, this is not the case in the functions
set_load_weight() and set_user_nice() (and perhaps elsewhere in the
kernel).

Solution:

Define a new macro, has_rt_policy(), that returns true if the task given
as an argument has a policy of SCHED_RR or SCHED_FIFO and use this
inside set_load_weight() and set_user_nice(). The definition is made in
sched.h so that it is generally available should it be needed elsewhere
in the kernel.

Signed-off-by: Peter Williams <[email protected]>
include/linux/sched.h | 2 ++
kernel/sched.c | 15 +++++++--------
2 files changed, 9 insertions(+), 8 deletions(-)

Index: MM-2.6.17-rc4-mm3/include/linux/sched.h
===================================================================
--- MM-2.6.17-rc4-mm3.orig/include/linux/sched.h 2006-05-26 10:39:59.000000000 +1000
+++ MM-2.6.17-rc4-mm3/include/linux/sched.h 2006-05-26 10:43:21.000000000 +1000
@@ -491,6 +491,8 @@ struct signal_struct {
#define rt_prio(prio) unlikely((prio) < MAX_RT_PRIO)
#define rt_task(p) rt_prio((p)->prio)
#define batch_task(p) (unlikely((p)->policy == SCHED_BATCH))
+#define has_rt_policy(p) \
+ unlikely((p)->policy != SCHED_NORMAL && (p)->policy != SCHED_BATCH)

/*
* Some day this will be a full-fledged user tracking system..
Index: MM-2.6.17-rc4-mm3/kernel/sched.c
===================================================================
--- MM-2.6.17-rc4-mm3.orig/kernel/sched.c 2006-05-26 10:39:59.000000000 +1000
+++ MM-2.6.17-rc4-mm3/kernel/sched.c 2006-05-26 10:44:51.000000000 +1000
@@ -786,7 +786,7 @@ static inline int expired_starving(runqu

static void set_load_weight(task_t *p)
{
- if (rt_task(p)) {
+ if (has_rt_policy(p)) {
#ifdef CONFIG_SMP
if (p == task_rq(p)->migration_thread)
/*
@@ -835,7 +835,7 @@ static inline int normal_prio(task_t *p)
{
int prio;

- if (p->policy != SCHED_NORMAL && p->policy != SCHED_BATCH)
+ if (has_rt_policy(p))
prio = MAX_RT_PRIO-1 - p->rt_priority;
else
prio = __normal_prio(p);
@@ -3831,7 +3831,7 @@ void set_user_nice(task_t *p, long nice)
unsigned long flags;
prio_array_t *array;
runqueue_t *rq;
- int old_prio, new_prio, delta;
+ int old_prio, delta;

if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
return;
@@ -3846,7 +3846,7 @@ void set_user_nice(task_t *p, long nice)
* it wont have any effect on scheduling until the task is
* not SCHED_NORMAL/SCHED_BATCH:
*/
- if (rt_task(p)) {
+ if (has_rt_policy(p)) {
p->static_prio = NICE_TO_PRIO(nice);
goto out_unlock;
}
@@ -3856,12 +3856,11 @@ void set_user_nice(task_t *p, long nice)
dec_raw_weighted_load(rq, p);
}

- old_prio = p->prio;
- new_prio = NICE_TO_PRIO(nice);
- delta = new_prio - old_prio;
p->static_prio = NICE_TO_PRIO(nice);
set_load_weight(p);
- p->prio += delta;
+ old_prio = p->prio;
+ p->prio = effective_prio(p);
+ delta = p->prio - old_prio;

if (array) {
enqueue_task(p, array);

--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-26 04:21:01

by Peter Williams

[permalink] [raw]
Subject: [RFC 3/5] sched: Add CPU rate hard caps

This patch implements hard CPU rate caps per task as a proportion of a
single CPU's capacity expressed in parts per thousand.

Signed-off-by: Peter Williams <[email protected]>
include/linux/sched.h | 8 ++
kernel/Kconfig.caps | 14 +++-
kernel/sched.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 168 insertions(+), 8 deletions(-)

Index: MM-2.6.17-rc4-mm3/include/linux/sched.h
===================================================================
--- MM-2.6.17-rc4-mm3.orig/include/linux/sched.h 2006-05-26 10:46:35.000000000 +1000
+++ MM-2.6.17-rc4-mm3/include/linux/sched.h 2006-05-26 11:00:07.000000000 +1000
@@ -796,6 +796,10 @@ struct task_struct {
#ifdef CONFIG_CPU_RATE_CAPS
unsigned long long avg_cpu_per_cycle, avg_cycle_length;
unsigned int cpu_rate_cap;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+ unsigned int cpu_rate_hard_cap;
+ struct timer_list sinbin_timer;
+#endif
#endif
enum sleep_type sleep_type;

@@ -994,6 +998,10 @@ struct task_struct {
#ifdef CONFIG_CPU_RATE_CAPS
unsigned int get_cpu_rate_cap(const struct task_struct *);
int set_cpu_rate_cap(struct task_struct *, unsigned int);
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+unsigned int get_cpu_rate_hard_cap(const struct task_struct *);
+int set_cpu_rate_hard_cap(struct task_struct *, unsigned int);
+#endif
#endif

static inline pid_t process_group(struct task_struct *tsk)
Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps
===================================================================
--- MM-2.6.17-rc4-mm3.orig/kernel/Kconfig.caps 2006-05-26 10:45:26.000000000 +1000
+++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps 2006-05-26 11:00:07.000000000 +1000
@@ -3,11 +3,21 @@
#

config CPU_RATE_CAPS
- bool "Support (soft) CPU rate caps"
+ bool "Support CPU rate caps"
default n
---help---
- Say y here if you wish to be able to put a (soft) upper limit on
+ Say y here if you wish to be able to put a soft upper limit on
the rate of CPU usage by individual tasks. A task which has been
allocated a soft CPU rate cap will be limited to that rate of CPU
usage unless there is spare CPU resources available after the needs
of uncapped tasks are met.
+
+config CPU_RATE_HARD_CAPS
+ bool "Support CPU rate hard caps"
+ depends on CPU_RATE_CAPS
+ default n
+ ---help---
+ Say y here if you wish to be able to put a hard upper limit on
+ the rate of CPU usage by individual tasks. A task which has been
+ allocated a hard CPU rate cap will be limited to that rate of CPU
+ usage regardless of whether there is spare CPU resources available.
Index: MM-2.6.17-rc4-mm3/kernel/sched.c
===================================================================
--- MM-2.6.17-rc4-mm3.orig/kernel/sched.c 2006-05-26 11:00:02.000000000 +1000
+++ MM-2.6.17-rc4-mm3/kernel/sched.c 2006-05-26 13:50:11.000000000 +1000
@@ -201,21 +201,33 @@ static inline unsigned int task_timeslic

#ifdef CONFIG_CPU_RATE_CAPS
#define CAP_STATS_OFFSET 8
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+static void sinbin_release_fn(unsigned long arg);
+#define min_cpu_rate_cap(p) min((p)->cpu_rate_cap, (p)->cpu_rate_hard_cap)
+#else
+#define min_cpu_rate_cap(p) (p)->cpu_rate_cap
+#endif
#define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
/* this assumes that p is not a real time task */
#define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
#define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
-#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / 1000)
+#define cap_load_weight(p) ((min_cpu_rate_cap(p) * SCHED_LOAD_SCALE) / 1000)

static void init_cpu_rate_caps(task_t *p)
{
p->cpu_rate_cap = 1000;
p->flags &= ~PF_HAS_CAP;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+ p->cpu_rate_hard_cap = 1000;
+ init_timer(&p->sinbin_timer);
+ p->sinbin_timer.function = sinbin_release_fn;
+ p->sinbin_timer.data = (unsigned long) p;
+#endif
}

static inline void set_cap_flag(task_t *p)
{
- if (p->cpu_rate_cap < 1000 && !has_rt_policy(p))
+ if (min_cpu_rate_cap(p) < 1000 && !has_rt_policy(p))
p->flags |= PF_HAS_CAP;
else
p->flags &= ~PF_HAS_CAP;
@@ -223,7 +235,7 @@ static inline void set_cap_flag(task_t *

static inline int task_exceeding_cap(const task_t *p)
{
- return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * p->cpu_rate_cap);
+ return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * min_cpu_rate_cap(p));
}

#ifdef CONFIG_SCHED_SMT
@@ -257,7 +269,7 @@ static int task_exceeding_cap_now(const

delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
lhs = (p->avg_cpu_per_cycle + delta) * 1000;
- rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
+ rhs = (p->avg_cycle_length + delta) * min_cpu_rate_cap(p);

return lhs > rhs;
}
@@ -266,6 +278,10 @@ static inline void init_cap_stats(task_t
{
p->avg_cpu_per_cycle = 0;
p->avg_cycle_length = 0;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+ init_timer(&p->sinbin_timer);
+ p->sinbin_timer.data = (unsigned long) p;
+#endif
}

static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
@@ -1213,6 +1229,64 @@ static void deactivate_task(struct task_
p->array = NULL;
}

+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+#define task_has_hard_cap(p) unlikely((p)->cpu_rate_hard_cap < 1000)
+
+/*
+ * Release a task from the sinbin
+ */
+static void sinbin_release_fn(unsigned long arg)
+{
+ unsigned long flags;
+ struct task_struct *p = (struct task_struct*)arg;
+ struct runqueue *rq = task_rq_lock(p, &flags);
+
+ p->prio = effective_prio(p);
+
+ __activate_task(p, rq);
+
+ task_rq_unlock(rq, &flags);
+}
+
+static unsigned long reqd_sinbin_ticks(const task_t *p)
+{
+ unsigned long long res;
+
+ res = p->avg_cpu_per_cycle * 1000;
+
+ if (res > p->avg_cycle_length * p->cpu_rate_hard_cap) {
+ (void)do_div(res, p->cpu_rate_hard_cap);
+ res -= p->avg_cpu_per_cycle;
+ /*
+ * IF it was available we'd also subtract
+ * the average sleep per cycle here
+ */
+ res >>= CAP_STATS_OFFSET;
+ (void)do_div(res, (1000000000 / HZ));
+
+ return res ? : 1;
+ }
+
+ return 0;
+}
+
+static void sinbin_task(task_t *p, unsigned long durn)
+{
+ if (durn == 0)
+ return;
+ deactivate_task(p, task_rq(p));
+ p->sinbin_timer.expires = jiffies + durn;
+ add_timer(&p->sinbin_timer);
+}
+#else
+#define task_has_hard_cap(p) 0
+#define reqd_sinbin_ticks(p) 0
+
+static inline void sinbin_task(task_t *p, unsigned long durn)
+{
+}
+#endif
+
/*
* resched_task - mark a task 'to be rescheduled now'.
*
@@ -3508,9 +3582,16 @@ need_resched_nonpreemptible:
}
}

- /* do this now so that stats are correct for SMT code */
- if (task_has_cap(prev))
+ if (task_has_cap(prev)) {
inc_cap_stats_both(prev, now);
+ if (task_has_hard_cap(prev) && !prev->state &&
+ !rt_task(prev) && !signal_pending(prev)) {
+ unsigned long sinbin_ticks = reqd_sinbin_ticks(prev);
+
+ if (sinbin_ticks)
+ sinbin_task(prev, sinbin_ticks);
+ }
+ }

cpu = smp_processor_id();
if (unlikely(!rq->nr_running)) {
@@ -4539,6 +4620,67 @@ out:
}

EXPORT_SYMBOL(set_cpu_rate_cap);
+
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+unsigned int get_cpu_rate_hard_cap(const struct task_struct *p)
+{
+ return p->cpu_rate_hard_cap;
+}
+
+EXPORT_SYMBOL(get_cpu_rate_hard_cap);
+
+/*
+ * Require: 1 <= new_cap <= 1000
+ */
+int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int new_cap)
+{
+ int is_allowed;
+ unsigned long flags;
+ struct runqueue *rq;
+ int delta;
+
+ if (new_cap > 1000 && new_cap > 0)
+ return -EINVAL;
+ is_allowed = capable(CAP_SYS_NICE);
+ /*
+ * We have to be careful, if called from /proc code,
+ * the task might be in the middle of scheduling on another CPU.
+ */
+ rq = task_rq_lock(p, &flags);
+ delta = new_cap - p->cpu_rate_hard_cap;
+ if (!is_allowed) {
+ /*
+ * Ordinary users can set/change caps on their own tasks
+ * provided that the new setting is MORE constraining
+ */
+ if (((current->euid != p->uid) && (current->uid != p->uid)) || (delta > 0)) {
+ task_rq_unlock(rq, &flags);
+ return -EPERM;
+ }
+ }
+ /*
+ * The RT tasks don't have caps, but we still allow the caps to be
+ * set - but as expected it wont have any effect on scheduling until
+ * the task becomes SCHED_NORMAL/SCHED_BATCH:
+ */
+ p->cpu_rate_hard_cap = new_cap;
+
+ if (has_rt_policy(p))
+ goto out;
+
+ if (p->array)
+ dec_raw_weighted_load(rq, p);
+ set_load_weight(p);
+ if (p->array)
+ inc_raw_weighted_load(rq, p);
+out:
+ task_rq_unlock(rq, &flags);
+
+ return 0;
+}
+
+EXPORT_SYMBOL(set_cpu_rate_hard_cap);
+#endif
#endif

long sched_setaffinity(pid_t pid, cpumask_t new_mask)

--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-26 04:21:42

by Peter Williams

[permalink] [raw]
Subject: [RFC 4/5] sched: Add procfs interface for CPU rate soft caps

This patch implements a procfs interface for soft CPU rate caps.

Signed-off-by: Peter Williams <[email protected]>
fs/proc/base.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)

Index: MM-2.6.17-rc4-mm3/fs/proc/base.c
===================================================================
--- MM-2.6.17-rc4-mm3.orig/fs/proc/base.c 2006-05-26 13:46:40.000000000 +1000
+++ MM-2.6.17-rc4-mm3/fs/proc/base.c 2006-05-26 13:50:57.000000000 +1000
@@ -167,6 +167,9 @@ enum pid_directory_inos {
#ifdef CONFIG_CPUSETS
PROC_TID_CPUSET,
#endif
+#ifdef CONFIG_CPU_RATE_CAPS
+ PROC_TID_CPU_RATE_CAP,
+#endif
#ifdef CONFIG_SECURITY
PROC_TID_ATTR,
PROC_TID_ATTR_CURRENT,
@@ -280,6 +283,9 @@ static struct pid_entry tid_base_stuff[]
#ifdef CONFIG_AUDITSYSCALL
E(PROC_TID_LOGINUID, "loginuid", S_IFREG|S_IWUSR|S_IRUGO),
#endif
+#ifdef CONFIG_CPU_RATE_CAPS
+ E(PROC_TID_CPU_RATE_CAP, "cpu_rate_cap", S_IFREG|S_IRUGO|S_IWUSR),
+#endif
{0,0,NULL,0}
};

@@ -1036,6 +1042,54 @@ static struct file_operations proc_secco
};
#endif /* CONFIG_SECCOMP */

+#ifdef CONFIG_CPU_RATE_CAPS
+static ssize_t cpu_rate_cap_read(struct file * file, char * buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+ char buffer[64];
+ size_t len;
+ unsigned int cppt = get_cpu_rate_cap(task);
+
+ if (*ppos)
+ return 0;
+ *ppos = len = sprintf(buffer, "%u\n", cppt);
+ if (copy_to_user(buf, buffer, len))
+ return -EFAULT;
+
+ return len;
+}
+
+static ssize_t cpu_rate_cap_write(struct file * file, const char * buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+ char buffer[128] = "";
+ char *endptr = NULL;
+ unsigned long hcppt;
+ int res;
+
+
+ if ((count > 63) || *ppos)
+ return -EFBIG;
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;
+ hcppt = simple_strtoul(buffer, &endptr, 0);
+ if ((endptr == buffer) || (hcppt == ULONG_MAX))
+ return -EINVAL;
+
+ if ((res = set_cpu_rate_cap(task, hcppt)) != 0)
+ return res;
+
+ return count;
+}
+
+struct file_operations proc_cpu_rate_cap_operations = {
+ read: cpu_rate_cap_read,
+ write: cpu_rate_cap_write,
+};
+#endif
+
static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
{
struct inode *inode = dentry->d_inode;
@@ -1796,6 +1850,11 @@ static struct dentry *proc_pident_lookup
inode->i_fop = &proc_loginuid_operations;
break;
#endif
+#ifdef CONFIG_CPU_RATE_CAPS
+ case PROC_TID_CPU_RATE_CAP:
+ inode->i_fop = &proc_cpu_rate_cap_operations;
+ break;
+#endif
default:
printk("procfs: impossible type (%d)",p->type);
iput(inode);

--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-26 04:21:42

by Peter Williams

[permalink] [raw]
Subject: [RFC 5/5] sched: Add procfs interface for CPU rate hard caps

This patch implements a procfs interface for hard CPU rate caps.

Signed-off-by: Peter Williams <[email protected]>
fs/proc/base.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)

Index: MM-2.6.17-rc4-mm3/fs/proc/base.c
===================================================================
--- MM-2.6.17-rc4-mm3.orig/fs/proc/base.c 2006-05-26 13:50:57.000000000 +1000
+++ MM-2.6.17-rc4-mm3/fs/proc/base.c 2006-05-26 13:51:01.000000000 +1000
@@ -170,6 +170,9 @@ enum pid_directory_inos {
#ifdef CONFIG_CPU_RATE_CAPS
PROC_TID_CPU_RATE_CAP,
#endif
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+ PROC_TID_CPU_RATE_HARD_CAP,
+#endif
#ifdef CONFIG_SECURITY
PROC_TID_ATTR,
PROC_TID_ATTR_CURRENT,
@@ -286,6 +289,9 @@ static struct pid_entry tid_base_stuff[]
#ifdef CONFIG_CPU_RATE_CAPS
E(PROC_TID_CPU_RATE_CAP, "cpu_rate_cap", S_IFREG|S_IRUGO|S_IWUSR),
#endif
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+ E(PROC_TID_CPU_RATE_HARD_CAP, "cpu_rate_hard_cap", S_IFREG|S_IRUGO|S_IWUSR),
+#endif
{0,0,NULL,0}
};

@@ -1090,6 +1096,54 @@ struct file_operations proc_cpu_rate_cap
};
#endif

+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+static ssize_t cpu_rate_hard_cap_read(struct file * file, char * buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+ char buffer[64];
+ size_t len;
+ unsigned int cppt = get_cpu_rate_hard_cap(task);
+
+ if (*ppos)
+ return 0;
+ *ppos = len = sprintf(buffer, "%u\n", cppt);
+ if (copy_to_user(buf, buffer, len))
+ return -EFAULT;
+
+ return len;
+}
+
+static ssize_t cpu_rate_hard_cap_write(struct file * file, const char * buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+ char buffer[128] = "";
+ char *endptr = NULL;
+ unsigned long hcppt;
+ int res;
+
+
+ if ((count > 63) || *ppos)
+ return -EFBIG;
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;
+ hcppt = simple_strtoul(buffer, &endptr, 0);
+ if ((endptr == buffer) || (hcppt == ULONG_MAX))
+ return -EINVAL;
+
+ if ((res = set_cpu_rate_hard_cap(task, hcppt)) != 0)
+ return res;
+
+ return count;
+}
+
+struct file_operations proc_cpu_rate_hard_cap_operations = {
+ read: cpu_rate_hard_cap_read,
+ write: cpu_rate_hard_cap_write,
+};
+#endif
+
static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
{
struct inode *inode = dentry->d_inode;
@@ -1855,6 +1909,11 @@ static struct dentry *proc_pident_lookup
inode->i_fop = &proc_cpu_rate_cap_operations;
break;
#endif
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+ case PROC_TID_CPU_RATE_HARD_CAP:
+ inode->i_fop = &proc_cpu_rate_hard_cap_operations;
+ break;
+#endif
default:
printk("procfs: impossible type (%d)",p->type);
iput(inode);

--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-26 07:30:17

by Kari Hurtta

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

Peter Williams <[email protected]> writes in gmane.linux.kernel:

> This patch implements hard CPU rate caps per task as a proportion of a
> single CPU's capacity expressed in parts per thousand.

> + * Require: 1 <= new_cap <= 1000
> + */
> +int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int new_cap)
> +{
> + int is_allowed;
> + unsigned long flags;
> + struct runqueue *rq;
> + int delta;
> +
> + if (new_cap > 1000 && new_cap > 0)
> + return -EINVAL;

That condition looks wrong.



2006-05-26 08:02:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

On Fri, 2006-05-26 at 14:20 +1000, Peter Williams wrote:
> These patches implement CPU usage rate limits for tasks.
>
> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
> it is a total usage limit and therefore (to my mind) not very useful.
> These patches provide an alternative whereby the (recent) average CPU
> usage rate of a task can be limited to a (per task) specified proportion
> of a single CPU's capacity.

The killer problem I see with this approach is that it doesn't address
the divide and conquer problem. If a task is capped, and forks off
workers, each worker inherits the total cap, effectively extending same.

IMHO, per task resource management is too severely limited in it's
usefulness, because jobs are what need managing, and they're seldom
single threaded. In order to use per task limits to manage any given
job, you have to both know the number of components, and manually
distribute resources to each component of the job. If a job has a
dynamic number of components, it becomes impossible to manage.

-Mike

2006-05-26 10:41:35

by Con Kolivas

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

On Friday 26 May 2006 14:20, Peter Williams wrote:
> These patches implement CPU usage rate limits for tasks.

Nice :)

> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
> it is a total usage limit and therefore (to my mind) not very useful.
> These patches provide an alternative whereby the (recent) average CPU
> usage rate of a task can be limited to a (per task) specified proportion
> of a single CPU's capacity. The limits are specified in parts per
> thousand and come in two varieties -- hard and soft.

Why 1000? I doubt that degree of accuracy is possible in cpu accounting and
accuracy or even required. To me it would seem to make more sense to just be
a percentage.

--
-ck

2006-05-26 10:49:41

by Con Kolivas

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

On Friday 26 May 2006 14:20, Peter Williams wrote:
> This patch implements (soft) CPU rate caps per task as a proportion of a
> single CPU's capacity expressed in parts per thousand. The CPU usage
> of capped tasks is determined by using Kalman filters to calculate the
> (recent) average lengths of the task's scheduling cycle and the time
> spent on the CPU each cycle and taking the ratio of the latter to the
> former. To minimize overhead associated with uncapped tasks these
> statistics are not kept for them.
>
> Notes:
>
> 1. To minimize the overhead incurred when testing to skip caps processing
> for uncapped tasks a new flag PF_HAS_CAP has been added to flags.

[ot]I'm sorry to see an Australian adopt American spelling [/ot]

> 3. Enforcement of caps is not as strict as it could be in order to
> reduce the possibility of a task being starved of CPU while holding
> an important system resource with resultant overall performance
> degradation. In effect, all runnable capped tasks will get some amount
> of CPU access every active/expired swap cycle. This will be most
> apparent for small or zero soft caps.

The array swap happens very frequently if there are nothing but heavily cpu
bound tasks, which is not an infrequent workload. I doubt the zero caps are
very effective in that environment.

--
-ck

2006-05-26 11:00:53

by Con Kolivas

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

On Friday 26 May 2006 14:20, Peter Williams wrote:
> This patch implements hard CPU rate caps per task as a proportion of a
> single CPU's capacity expressed in parts per thousand.

A hard cap of 1/1000 could lead to interesting starvation scenarios where a
mutex or semaphore was held by a task that hardly ever got cpu. Same goes to
a lesser extent to a 0 soft cap.

Here is how I handle idleprio tasks in current -ck:

http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
tags tasks that are holding a mutex

http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
is the idleprio policy for staircase.

What it does is runs idleprio tasks as normal tasks when they hold a mutex or
are waking up after calling down() (ie holding a semaphore). These two in
combination have shown resistance to any priority inversion problems in
widespread testing. An attempt was made to track semaphores held via a
down_interruptible() but unfortunately the lack of strict rules about who
could release the semaphore meant accounting was impossible of this scenario.
In practice, though there were no test cases that showed it to be an issue,
and the recent conversion en-masse of semaphores to mutexes in the kernel
means it has pretty much covered most possibilities.

--
-ck

2006-05-26 11:09:46

by Con Kolivas

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

On Friday 26 May 2006 14:20, Peter Williams wrote:
> These patches implement CPU usage rate limits for tasks.
>
> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
> it is a total usage limit and therefore (to my mind) not very useful.
> These patches provide an alternative whereby the (recent) average CPU
> usage rate of a task can be limited to a (per task) specified proportion
> of a single CPU's capacity. The limits are specified in parts per
> thousand and come in two varieties -- hard and soft. The difference
> between the two is that the system tries to enforce hard caps regardless
> of the other demand for CPU resources but allows soft caps to be
> exceeded if there are spare CPU resources available. By default, tasks
> will have both caps set to 1000 (i.e. no limit) but newly forked tasks
> will inherit any caps that have been imposed on their parent from the
> parent. The mimimim soft cap allowed is 0 (which effectively puts the
> task in the background) and the minimim hard cap allowed is 1.
>
> Care has been taken to minimize the overhead inflicted on tasks that
> have no caps and my tests using kernbench indicate that it is hidden in
> the noise.

The presence of tasks with caps will break smp balancing and smp nice. I
suspect you could probably provide a reasonable workaround by altering their
priority bias effect in the raw weighted load in smp nice for soft caps by
the percentage cpu of the cap. Hard caps provide more "interesting"
challenges though. I can't think of a valid biasing off hand for them, but at
least initially using the same logic as soft caps should help.

--
-ck

2006-05-26 11:14:17

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

On Fri, 2006-05-26 at 20:48 +1000, Con Kolivas wrote:
> On Friday 26 May 2006 14:20, Peter Williams wrote:
> > 3. Enforcement of caps is not as strict as it could be in order to
> > reduce the possibility of a task being starved of CPU while holding
> > an important system resource with resultant overall performance
> > degradation. In effect, all runnable capped tasks will get some amount
> > of CPU access every active/expired swap cycle. This will be most
> > apparent for small or zero soft caps.
>
> The array swap happens very frequently if there are nothing but heavily cpu
> bound tasks, which is not an infrequent workload. I doubt the zero caps are
> very effective in that environment.

Hmm. I think that came out kinda back-assward. You meant "the array
swap happens very frequently _unless_..." No?

But anyway, I can't think of any reason to hold back an uncontested
resource.

-Mike

2006-05-26 11:18:07

by Con Kolivas

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

On Friday 26 May 2006 21:15, Mike Galbraith wrote:
> On Fri, 2006-05-26 at 20:48 +1000, Con Kolivas wrote:
> > On Friday 26 May 2006 14:20, Peter Williams wrote:
> > > 3. Enforcement of caps is not as strict as it could be in order to
> > > reduce the possibility of a task being starved of CPU while holding
> > > an important system resource with resultant overall performance
> > > degradation. In effect, all runnable capped tasks will get some amount
> > > of CPU access every active/expired swap cycle. This will be most
> > > apparent for small or zero soft caps.
> >
> > The array swap happens very frequently if there are nothing but heavily
> > cpu bound tasks, which is not an infrequent workload. I doubt the zero
> > caps are very effective in that environment.
>
> Hmm. I think that came out kinda back-assward. You meant "the array
> swap happens very frequently _unless_..." No?

No I didn't. If all you are doing is compiling code then the array swap will
happen often as they will always use up their full timeslice and expire.
Therefore an array swap will follow shortly afterwards.

> But anyway, I can't think of any reason to hold back an uncontested
> resource.

If you are compiling applications it's a contested resource.

--
-ck

2006-05-26 11:28:42

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

On Fri, 2006-05-26 at 21:17 +1000, Con Kolivas wrote:
> On Friday 26 May 2006 21:15, Mike Galbraith wrote:
> > On Fri, 2006-05-26 at 20:48 +1000, Con Kolivas wrote:
> > > On Friday 26 May 2006 14:20, Peter Williams wrote:
> > > > 3. Enforcement of caps is not as strict as it could be in order to
> > > > reduce the possibility of a task being starved of CPU while holding
> > > > an important system resource with resultant overall performance
> > > > degradation. In effect, all runnable capped tasks will get some amount
> > > > of CPU access every active/expired swap cycle. This will be most
> > > > apparent for small or zero soft caps.
> > >
> > > The array swap happens very frequently if there are nothing but heavily
> > > cpu bound tasks, which is not an infrequent workload. I doubt the zero
> > > caps are very effective in that environment.
> >
> > Hmm. I think that came out kinda back-assward. You meant "the array
> > swap happens very frequently _unless_..." No?
>
> No I didn't. If all you are doing is compiling code then the array swap will
> happen often as they will always use up their full timeslice and expire.
> Therefore an array swap will follow shortly afterwards.

Afterward being possibly ages. Frequent array switch happens when you
have mostly sleepy processes, not cpu bound. But whatever.

> > But anyway, I can't think of any reason to hold back an uncontested
> > resource.
>
> If you are compiling applications it's a contested resource.

These zero capped tasks are at the bottom of the heap. They won't be
selected if there's any other runnable task, so it's not contested.

-Mike

2006-05-26 11:34:09

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

On Fri, May 26, 2006 at 02:20:21PM +1000, Peter Williams wrote:
> These patches implement CPU usage rate limits for tasks.
>
> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
> it is a total usage limit and therefore (to my mind) not very useful.
> These patches provide an alternative whereby the (recent) average CPU
> usage rate of a task can be limited to a (per task) specified proportion
> of a single CPU's capacity. The limits are specified in parts per
> thousand and come in two varieties -- hard and soft. The difference
> between the two is that the system tries to enforce hard caps regardless
> of the other demand for CPU resources but allows soft caps to be
> exceeded if there are spare CPU resources available. By default, tasks
> will have both caps set to 1000 (i.e. no limit) but newly forked tasks
> will inherit any caps that have been imposed on their parent from the
> parent. The mimimim soft cap allowed is 0 (which effectively puts the
> task in the background) and the minimim hard cap allowed is 1.
>
> Care has been taken to minimize the overhead inflicted on tasks that
> have no caps and my tests using kernbench indicate that it is hidden in
> the noise.
>
> Note:
>
> The first patch in this series fixes some problems with priority
> inheritance that are present in 2.6.17-rc4-mm3 but will be fixed in
> the next -mm kernel.
>

1000 sounds like a course number. A good estimate for the user setting
these limits would be percentage or better yet let the user decide on the
parts. For example, the user could divide the available CPU's capacity
to 2000 parts and ask for 200 parts or divide into 100 parts and as for 10
parts. The default capacity can be 100 or 1000 parts. May be the part
setting could be a system tunable.

I would also prefer making the capacity defined as the a specified portion
of the capacity of all CPU's. This would make the behaviour more predictable.

Consider a task "T" which has 10 percent of a single CPU's capacity as hard
limit. If it migrated to another CPU, would the new CPU also make 10% of its
capacity available "T".

What is the interval over which the 10% is tracked? Does the task that crosses
its hard limit get killed? If not, When does a task which has exceeded its
hard-limit get a new lease of another 10% to use?

I guess I should move on to reading the code for this feature now :-)

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-05-26 13:55:39

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

Con Kolivas wrote:
> On Friday 26 May 2006 14:20, Peter Williams wrote:
>> This patch implements (soft) CPU rate caps per task as a proportion of a
>> single CPU's capacity expressed in parts per thousand. The CPU usage
>> of capped tasks is determined by using Kalman filters to calculate the
>> (recent) average lengths of the task's scheduling cycle and the time
>> spent on the CPU each cycle and taking the ratio of the latter to the
>> former. To minimize overhead associated with uncapped tasks these
>> statistics are not kept for them.
>>
>> Notes:
>>
>> 1. To minimize the overhead incurred when testing to skip caps processing
>> for uncapped tasks a new flag PF_HAS_CAP has been added to flags.
>
> [ot]I'm sorry to see an Australian adopt American spelling [/ot]

I think you'll find the Oxford English Dictionary (which was the
reference when I went to school in the middle of last century) uses the
z and offers the s version as an option.

>
>> 3. Enforcement of caps is not as strict as it could be in order to
>> reduce the possibility of a task being starved of CPU while holding
>> an important system resource with resultant overall performance
>> degradation. In effect, all runnable capped tasks will get some amount
>> of CPU access every active/expired swap cycle. This will be most
>> apparent for small or zero soft caps.
>
> The array swap happens very frequently if there are nothing but heavily cpu
> bound tasks, which is not an infrequent workload. I doubt the zero caps are
> very effective in that environment.

Yes and it depends on HZ as well (i.e. it works better when HZis zero).
With HZ=250 and a zero capped hard spinning task competing with
another hard spinning task on a single CPU system it struggles to keep
it below below 4%. I've tested hard caps down to 0.5% in the same test
and it copes. So a long term solution such as something similar to the
rt_mutex priority inheritance is needed so that stricter soft capping
can be enforced. I don't think that it would be hard to be more strict
as it would just involve some checking when determining idx in schedule().

BTW in my SPA schedulers this can be controlled by varying the promotion
rate.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-26 13:59:18

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

Con Kolivas wrote:
> On Friday 26 May 2006 14:20, Peter Williams wrote:
>> This patch implements hard CPU rate caps per task as a proportion of a
>> single CPU's capacity expressed in parts per thousand.
>
> A hard cap of 1/1000 could lead to interesting starvation scenarios where a
> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to
> a lesser extent to a 0 soft cap.
>
> Here is how I handle idleprio tasks in current -ck:
>
> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
> tags tasks that are holding a mutex
>
> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
> is the idleprio policy for staircase.
>
> What it does is runs idleprio tasks as normal tasks when they hold a mutex or
> are waking up after calling down() (ie holding a semaphore).

I wasn't aware that you could detect those conditions. They could be
very useful.

> These two in
> combination have shown resistance to any priority inversion problems in
> widespread testing. An attempt was made to track semaphores held via a
> down_interruptible() but unfortunately the lack of strict rules about who
> could release the semaphore meant accounting was impossible of this scenario.
> In practice, though there were no test cases that showed it to be an issue,
> and the recent conversion en-masse of semaphores to mutexes in the kernel
> means it has pretty much covered most possibilities.
>

Thanks,
Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-26 14:00:53

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Con Kolivas wrote:
> On Friday 26 May 2006 14:20, Peter Williams wrote:
>> These patches implement CPU usage rate limits for tasks.
>>
>> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
>> it is a total usage limit and therefore (to my mind) not very useful.
>> These patches provide an alternative whereby the (recent) average CPU
>> usage rate of a task can be limited to a (per task) specified proportion
>> of a single CPU's capacity. The limits are specified in parts per
>> thousand and come in two varieties -- hard and soft. The difference
>> between the two is that the system tries to enforce hard caps regardless
>> of the other demand for CPU resources but allows soft caps to be
>> exceeded if there are spare CPU resources available. By default, tasks
>> will have both caps set to 1000 (i.e. no limit) but newly forked tasks
>> will inherit any caps that have been imposed on their parent from the
>> parent. The mimimim soft cap allowed is 0 (which effectively puts the
>> task in the background) and the minimim hard cap allowed is 1.
>>
>> Care has been taken to minimize the overhead inflicted on tasks that
>> have no caps and my tests using kernbench indicate that it is hidden in
>> the noise.
>
> The presence of tasks with caps will break smp balancing and smp nice. I
> suspect you could probably provide a reasonable workaround by altering their
> priority bias effect in the raw weighted load in smp nice for soft caps by
> the percentage cpu of the cap. Hard caps provide more "interesting"
> challenges though. I can't think of a valid biasing off hand for them, but at
> least initially using the same logic as soft caps should help.
>

I thought that I already did that? Check the changes to set_load_weight().

--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-26 14:12:49

by Con Kolivas

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

On Friday 26 May 2006 23:59, Peter Williams wrote:
> Con Kolivas wrote:
> > On Friday 26 May 2006 14:20, Peter Williams wrote:
> >> This patch implements hard CPU rate caps per task as a proportion of a
> >> single CPU's capacity expressed in parts per thousand.
> >
> > A hard cap of 1/1000 could lead to interesting starvation scenarios where
> > a mutex or semaphore was held by a task that hardly ever got cpu. Same
> > goes to a lesser extent to a 0 soft cap.
> >
> > Here is how I handle idleprio tasks in current -ck:
> >
> > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/
> >patches/track_mutexes-1.patch tags tasks that are holding a mutex
> >
> > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/
> >patches/sched-idleprio-1.7.patch is the idleprio policy for staircase.
> >
> > What it does is runs idleprio tasks as normal tasks when they hold a
> > mutex or are waking up after calling down() (ie holding a semaphore).
>
> I wasn't aware that you could detect those conditions. They could be
> very useful.

Ingo's mutex infrastructure made it possible to accurately track mutexes held,
and basically anything entering uninterruptible sleep has called down().
Mainline, as you know, already flags the latter for interactivity purposes.

--
-ck

2006-05-26 14:21:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote:
> Con Kolivas wrote:
> > On Friday 26 May 2006 14:20, Peter Williams wrote:
> >> This patch implements hard CPU rate caps per task as a proportion of a
> >> single CPU's capacity expressed in parts per thousand.
> >
> > A hard cap of 1/1000 could lead to interesting starvation scenarios where a
> > mutex or semaphore was held by a task that hardly ever got cpu. Same goes to
> > a lesser extent to a 0 soft cap.
> >
> > Here is how I handle idleprio tasks in current -ck:
> >
> > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
> > tags tasks that are holding a mutex
> >
> > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
> > is the idleprio policy for staircase.
> >
> > What it does is runs idleprio tasks as normal tasks when they hold a mutex or
> > are waking up after calling down() (ie holding a semaphore).
>
> I wasn't aware that you could detect those conditions. They could be
> very useful.

Isn't this exactly what the PI code is there to handle? Is something
more than PI needed?

-Mike

2006-05-26 16:11:59

by Björn Steinbrink

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

On 2006.05.26 10:04:20 +0200, Mike Galbraith wrote:
> On Fri, 2006-05-26 at 14:20 +1000, Peter Williams wrote:
> > These patches implement CPU usage rate limits for tasks.
> >
> > Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
> > it is a total usage limit and therefore (to my mind) not very useful.
> > These patches provide an alternative whereby the (recent) average CPU
> > usage rate of a task can be limited to a (per task) specified proportion
> > of a single CPU's capacity.
>
> The killer problem I see with this approach is that it doesn't address
> the divide and conquer problem. If a task is capped, and forks off
> workers, each worker inherits the total cap, effectively extending same.
>
> IMHO, per task resource management is too severely limited in it's
> usefulness, because jobs are what need managing, and they're seldom
> single threaded. In order to use per task limits to manage any given
> job, you have to both know the number of components, and manually
> distribute resources to each component of the job. If a job has a
> dynamic number of components, it becomes impossible to manage.

Linux-VServer uses a token bucket scheduler (TBS) to limit cpu ressources
for processes in a "context". All processes in a context share one token
bucket, which has a set of parameters to tune scheduling behaviour.
As the token bucket is shared by a group of processes, and inherited by
child processes/threads, management is quite easy. And the parameters
can be tuned to allow different scheduling behaviours, like allowing a
process group to burst, ie. use as much cpu time as is available, after
being idle for some time, but being limited to X % cpu time on average.

I'm CC'ing Herbert and Sam on this as they can explain the whole thing a
lot better and I'm not familiar with implementation details.

Regards
Bj?rn

2006-05-27 00:16:20

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Mike Galbraith wrote:
> On Fri, 2006-05-26 at 14:20 +1000, Peter Williams wrote:
>> These patches implement CPU usage rate limits for tasks.
>>
>> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
>> it is a total usage limit and therefore (to my mind) not very useful.
>> These patches provide an alternative whereby the (recent) average CPU
>> usage rate of a task can be limited to a (per task) specified proportion
>> of a single CPU's capacity.
>
> The killer problem I see with this approach is that it doesn't address
> the divide and conquer problem. If a task is capped, and forks off
> workers, each worker inherits the total cap, effectively extending same.
>
> IMHO, per task resource management is too severely limited in it's
> usefulness, because jobs are what need managing, and they're seldom
> single threaded. In order to use per task limits to manage any given
> job, you have to both know the number of components, and manually
> distribute resources to each component of the job. If a job has a
> dynamic number of components, it becomes impossible to manage.

Doing caps at a process level inside the scheduler is doable but would
involve an extra level of complexity including locking at the process
level to calculate process usage rates. Also the calculation of usage
rates would be more complex than just doing it for tasks and the fact
that there are not separate structures for processes and threads also
complicates the code compared to what is required otherwise (e.g. for
Solaris).

I'm not sure that this extra complexity is warranted when it is possible
to implement caps at the process level from outside the scheduler using
the task level caps provided by this patch. However, to allow the costs
to be properly evaluated, I'll put some effort into a process level
capping mechanism over the next few weeks.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-27 00:16:37

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

Mike Galbraith wrote:
> On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote:
>> Con Kolivas wrote:
>>> On Friday 26 May 2006 14:20, Peter Williams wrote:
>>>> This patch implements hard CPU rate caps per task as a proportion of a
>>>> single CPU's capacity expressed in parts per thousand.
>>> A hard cap of 1/1000 could lead to interesting starvation scenarios where a
>>> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to
>>> a lesser extent to a 0 soft cap.
>>>
>>> Here is how I handle idleprio tasks in current -ck:
>>>
>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
>>> tags tasks that are holding a mutex
>>>
>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
>>> is the idleprio policy for staircase.
>>>
>>> What it does is runs idleprio tasks as normal tasks when they hold a mutex or
>>> are waking up after calling down() (ie holding a semaphore).
>> I wasn't aware that you could detect those conditions. They could be
>> very useful.
>
> Isn't this exactly what the PI code is there to handle? Is something
> more than PI needed?
>

AFAIK (but I may be wrong) PI is only used by RT tasks and would need to
be extended. It could be argued that extending PI so that it can be
used by non RT tasks is a worthwhile endeavour in its own right.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-27 01:00:45

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

Kari Hurtta wrote:
> Peter Williams <[email protected]> writes in gmane.linux.kernel:
>
>> This patch implements hard CPU rate caps per task as a proportion of a
>> single CPU's capacity expressed in parts per thousand.
>
>> + * Require: 1 <= new_cap <= 1000
>> + */
>> +int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int new_cap)
>> +{
>> + int is_allowed;
>> + unsigned long flags;
>> + struct runqueue *rq;
>> + int delta;
>> +
>> + if (new_cap > 1000 && new_cap > 0)
>> + return -EINVAL;
>
> That condition looks wrong.

It certainly does.

Thanks
Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-27 01:28:07

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Con Kolivas wrote:
> On Friday 26 May 2006 14:20, Peter Williams wrote:
>> These patches implement CPU usage rate limits for tasks.
>
> Nice :)

Thanks.

>
>> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
>> it is a total usage limit and therefore (to my mind) not very useful.
>> These patches provide an alternative whereby the (recent) average CPU
>> usage rate of a task can be limited to a (per task) specified proportion
>> of a single CPU's capacity. The limits are specified in parts per
>> thousand and come in two varieties -- hard and soft.
>
> Why 1000?

Probably a hang over from a version where the units were proportion of a
whole machine. Percentage doesn't work very well if there are more than
1 CPU in that case (especially if there are more than 100 CPUs :-)).
But it's also useful to have the extra range if your trying to cap
processes (or users) from outside the scheduler using these primitives.

> I doubt that degree of accuracy is possible in cpu accounting and
> accuracy or even required. To me it would seem to make more sense to just be
> a percentage.
>

It's not meant to imply accuracy :-). The main issue is avoiding
overflow when doing the multiplications during the comparisons.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-27 01:41:04

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Balbir Singh wrote:
> On Fri, May 26, 2006 at 02:20:21PM +1000, Peter Williams wrote:
>> These patches implement CPU usage rate limits for tasks.
>>
>> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
>> it is a total usage limit and therefore (to my mind) not very useful.
>> These patches provide an alternative whereby the (recent) average CPU
>> usage rate of a task can be limited to a (per task) specified proportion
>> of a single CPU's capacity. The limits are specified in parts per
>> thousand and come in two varieties -- hard and soft. The difference
>> between the two is that the system tries to enforce hard caps regardless
>> of the other demand for CPU resources but allows soft caps to be
>> exceeded if there are spare CPU resources available. By default, tasks
>> will have both caps set to 1000 (i.e. no limit) but newly forked tasks
>> will inherit any caps that have been imposed on their parent from the
>> parent. The mimimim soft cap allowed is 0 (which effectively puts the
>> task in the background) and the minimim hard cap allowed is 1.
>>
>> Care has been taken to minimize the overhead inflicted on tasks that
>> have no caps and my tests using kernbench indicate that it is hidden in
>> the noise.
>>
>> Note:
>>
>> The first patch in this series fixes some problems with priority
>> inheritance that are present in 2.6.17-rc4-mm3 but will be fixed in
>> the next -mm kernel.
>>
>
> 1000 sounds like a course number. A good estimate for the user setting
> these limits would be percentage or better yet let the user decide on the
> parts. For example, the user could divide the available CPU's capacity
> to 2000 parts and ask for 200 parts or divide into 100 parts and as for 10
> parts. The default capacity can be 100 or 1000 parts. May be the part
> setting could be a system tunable.
>
> I would also prefer making the capacity defined as the a specified portion
> of the capacity of all CPU's. This would make the behaviour more predictable.

The meaning of a cap would change every time you took a CPU off/on line.
This makes the behaviour less predictable not more predictable (at
least in my opinion). You also have the possibility of a cap being
larger than the capacity of a single CPU which doesn't make sense when
capping at the task level.

However, if you still preferred that interface, it could be implemented
as a wrapper around these functionalities out in user space or inside a
resource management component.

>
> Consider a task "T" which has 10 percent of a single CPU's capacity as hard
> limit. If it migrated to another CPU, would the new CPU also make 10% of its
> capacity available "T".
>
> What is the interval over which the 10% is tracked? Does the task that crosses
> its hard limit get killed? If not, When does a task which has exceeded its
> hard-limit get a new lease of another 10% to use?
>
> I guess I should move on to reading the code for this feature now :-)

I look forward to your comments.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-27 01:43:07

by Con Kolivas

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

On Saturday 27 May 2006 11:28, Peter Williams wrote:
> Con Kolivas wrote:
> > On Friday 26 May 2006 14:20, Peter Williams wrote:
> >> Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
> >> it is a total usage limit and therefore (to my mind) not very useful.
> >> These patches provide an alternative whereby the (recent) average CPU
> >> usage rate of a task can be limited to a (per task) specified proportion
> >> of a single CPU's capacity. The limits are specified in parts per
> >> thousand and come in two varieties -- hard and soft.
> >
> > Why 1000?
>
> Probably a hang over from a version where the units were proportion of a
> whole machine. Percentage doesn't work very well if there are more than
> 1 CPU in that case (especially if there are more than 100 CPUs :-)).
> But it's also useful to have the extra range if your trying to cap
> processes (or users) from outside the scheduler using these primitives.
>
> > I doubt that degree of accuracy is possible in cpu accounting and
> > accuracy or even required. To me it would seem to make more sense to just
> > be a percentage.
>
> It's not meant to imply accuracy :-). The main issue is avoiding
> overflow when doing the multiplications during the comparisons.

Well you could always expose a smaller more meaningful value than what is
stored internally. However you've already implied that there are requirements
in userspace for more granularity in the proportioning than percentage can
give.

--
-ck

2006-05-27 06:31:12

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

On 5/26/06, Peter Williams <[email protected]> wrote:
<snip>
>
> Notes:
>
> 1. To minimize the overhead incurred when testing to skip caps processing for
> uncapped tasks a new flag PF_HAS_CAP has been added to flags.
>
> 2. The implementation involves the addition of two priority slots to the
> run queue priority arrays and this means that MAX_PRIO no longer
> represents the scheduling priority of the idle process and can't be used to
> test whether priority values are in the valid range. To alleviate this
> problem a new function sched_idle_prio() has been provided.

I am a little confused by this. Why link the bandwidth expired tasks a
cpu (its caps) to a priority slot? Is this a hack to conitnue using
the prio_array? why not move such tasks to the expired array?

<snip>
> /*
> * Some day this will be a full-fledged user tracking system..
> */
> @@ -787,6 +793,10 @@ struct task_struct {
> unsigned long sleep_avg;
> unsigned long long timestamp, last_ran;
> unsigned long long sched_time; /* sched_clock time spent running */
> +#ifdef CONFIG_CPU_RATE_CAPS
> + unsigned long long avg_cpu_per_cycle, avg_cycle_length;
> + unsigned int cpu_rate_cap;
> +#endif

How is a cycle defined? What are the units of a cycle? Could we please
document the units for the declarations above

> enum sleep_type sleep_type;
>
> unsigned long policy;
> @@ -981,6 +991,11 @@ struct task_struct {
> #endif
> };
>
> +#ifdef CONFIG_CPU_RATE_CAPS
> +unsigned int get_cpu_rate_cap(const struct task_struct *);
> +int set_cpu_rate_cap(struct task_struct *, unsigned int);
> +#endif
> +
> static inline pid_t process_group(struct task_struct *tsk)
> {
> return tsk->signal->pgrp;
> @@ -1040,6 +1055,7 @@ static inline void put_task_struct(struc
> #define PF_SPREAD_SLAB 0x08000000 /* Spread some slab caches over cpuset */
> #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
> #define PF_MUTEX_TESTER 0x02000000 /* Thread belongs to the rt mutex tester */
> +#define PF_HAS_CAP 0x20000000 /* Has a CPU rate cap */
>
> /*
> * Only the _current_ task can read/write to tsk->flags, but other
> Index: MM-2.6.17-rc4-mm3/init/Kconfig
> ===================================================================
> --- MM-2.6.17-rc4-mm3.orig/init/Kconfig 2006-05-26 10:39:59.000000000 +1000
> +++ MM-2.6.17-rc4-mm3/init/Kconfig 2006-05-26 10:45:26.000000000 +1000
> @@ -286,6 +286,8 @@ config RELAY
>
> If unsure, say N.
>
> +source "kernel/Kconfig.caps"
> +
> source "usr/Kconfig"
>
> config UID16
> Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps 2006-05-26 10:45:26.000000000 +1000
> @@ -0,0 +1,13 @@
> +#
> +# CPU Rate Caps Configuration
> +#
> +
> +config CPU_RATE_CAPS
> + bool "Support (soft) CPU rate caps"
> + default n
> + ---help---
> + Say y here if you wish to be able to put a (soft) upper limit on
> + the rate of CPU usage by individual tasks. A task which has been
> + allocated a soft CPU rate cap will be limited to that rate of CPU
> + usage unless there is spare CPU resources available after the needs
> + of uncapped tasks are met.
> Index: MM-2.6.17-rc4-mm3/kernel/sched.c
> ===================================================================
> --- MM-2.6.17-rc4-mm3.orig/kernel/sched.c 2006-05-26 10:44:51.000000000 +1000
> +++ MM-2.6.17-rc4-mm3/kernel/sched.c 2006-05-26 11:00:02.000000000 +1000
> @@ -57,6 +57,19 @@
>
> #include <asm/unistd.h>
>
> +#ifdef CONFIG_CPU_RATE_CAPS
> +#define IDLE_PRIO (MAX_PRIO + 2)
> +#else
> +#define IDLE_PRIO MAX_PRIO
> +#endif
> +#define BGND_PRIO (IDLE_PRIO - 1)
> +#define CAPPED_PRIO (IDLE_PRIO - 2)
> +
> +int sched_idle_prio(void)
> +{
> + return IDLE_PRIO;
> +}
> +
> /*
> * Convert user-nice values [ -20 ... 0 ... 19 ]
> * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
> @@ -186,6 +199,149 @@ static inline unsigned int task_timeslic
> return static_prio_timeslice(p->static_prio);
> }
>
> +#ifdef CONFIG_CPU_RATE_CAPS
> +#define CAP_STATS_OFFSET 8
> +#define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
> +/* this assumes that p is not a real time task */
> +#define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
> +#define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
> +#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / 1000)

Could we please use a const or #define'd name instead of 1000. How
about TOTAL_CAP_IN_PARTS? It would make the code easier to read and
maintain.

> +
> +static void init_cpu_rate_caps(task_t *p)
> +{
> + p->cpu_rate_cap = 1000;
> + p->flags &= ~PF_HAS_CAP;
> +}
> +
> +static inline void set_cap_flag(task_t *p)
> +{
> + if (p->cpu_rate_cap < 1000 && !has_rt_policy(p))
> + p->flags |= PF_HAS_CAP;
> + else
> + p->flags &= ~PF_HAS_CAP;
> +}

Why don't you re-use RLIMIT_INFINITY?

> +
> +static inline int task_exceeding_cap(const task_t *p)
> +{
> + return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * p->cpu_rate_cap);
> +}
> +
> +#ifdef CONFIG_SCHED_SMT
> +static unsigned int smt_timeslice(task_t *p)
> +{
> + if (task_has_cap(p) && task_being_capped(p))
> + return 0;
> +
> + return task_timeslice(p);
> +}
> +
> +static int task_priority_gt(const task_t *thisp, const task_t *thatp)
> +{
> + if (task_has_cap(thisp) && (task_being_capped(thisp)))
> + return 0;
> +
> + if (task_has_cap(thatp) && (task_being_capped(thatp)))
> + return 1;
> +
> + return thisp->static_prio < thatp->static_prio;
> +}

This function needs some comments. At least with respect to what is
thisp and thatp

> +#endif
> +
> +/*
> + * Update usage stats to "now" before making comparison
> + * Assume: task is actually on a CPU
> + */
> +static int task_exceeding_cap_now(const task_t *p, unsigned long long now)
> +{
> + unsigned long long delta, lhs, rhs;
> +
> + delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
> + lhs = (p->avg_cpu_per_cycle + delta) * 1000;
> + rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
> +
> + return lhs > rhs;
> +}
> +
> +static inline void init_cap_stats(task_t *p)
> +{
> + p->avg_cpu_per_cycle = 0;
> + p->avg_cycle_length = 0;
> +}
> +
> +static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
> +{
> + unsigned long long delta;
> +
> + delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
> + p->avg_cycle_length += delta;
> +}
> +
> +static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
> +{
> + unsigned long long delta;
> +
> + delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
> + p->avg_cycle_length += delta;
> + p->avg_cpu_per_cycle += delta;
> +}
> +
> +static inline void decay_cap_stats(task_t *p)
> +{
> + p->avg_cycle_length *= ((1 << CAP_STATS_OFFSET) - 1);
> + p->avg_cycle_length >>= CAP_STATS_OFFSET;
> + p->avg_cpu_per_cycle *= ((1 << CAP_STATS_OFFSET) - 1);
> + p->avg_cpu_per_cycle >>= CAP_STATS_OFFSET;
> +}
> +#else
> +#define task_has_cap(p) 0
> +#define task_is_background(p) 0
> +#define task_being_capped(p) 0
> +#define cap_load_weight(p) SCHED_LOAD_SCALE
> +
> +static inline void init_cpu_rate_caps(task_t *p)
> +{
> +}
> +
> +static inline void set_cap_flag(task_t *p)
> +{
> +}
> +
> +static inline int task_exceeding_cap(const task_t *p)
> +{
> + return 0;
> +}
> +
> +#ifdef CONFIG_SCHED_SMT
> +#define smt_timeslice(p) task_timeslice(p)
> +
> +static inline int task_priority_gt(const task_t *thisp, const task_t *thatp)
> +{
> + return thisp->static_prio < thatp->static_prio;
> +}
> +#endif
> +
> +static inline int task_exceeding_cap_now(const task_t *p, unsigned long long now)
> +{
> + return 0;
> +}
> +
> +static inline void init_cap_stats(task_t *p)
> +{
> +}
> +
> +static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
> +{
> +}
> +
> +static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
> +{
> +}
> +
> +static inline void decay_cap_stats(task_t *p)
> +{
> +}
> +#endif
> +
> #define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran) \
> < (long long) (sd)->cache_hot_time)
>
> @@ -197,8 +353,8 @@ typedef struct runqueue runqueue_t;
>
> struct prio_array {
> unsigned int nr_active;
> - DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
> - struct list_head queue[MAX_PRIO];
> + DECLARE_BITMAP(bitmap, IDLE_PRIO+1); /* include 1 bit for delimiter */
> + struct list_head queue[IDLE_PRIO];
> };
>
> /*
> @@ -710,6 +866,10 @@ static inline int __normal_prio(task_t *
> {
> int bonus, prio;
>
> + /* Ensure that background tasks stay at BGND_PRIO */
> + if (task_is_background(p))
> + return BGND_PRIO;
> +
> bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
>
> prio = p->static_prio - bonus;
> @@ -786,6 +946,8 @@ static inline int expired_starving(runqu
>
> static void set_load_weight(task_t *p)
> {
> + set_cap_flag(p);
> +
> if (has_rt_policy(p)) {
> #ifdef CONFIG_SMP
> if (p == task_rq(p)->migration_thread)
> @@ -798,8 +960,22 @@ static void set_load_weight(task_t *p)
> else
> #endif
> p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority);
> - } else
> + } else {
> p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio);
> +
> + /*
> + * Reduce the probability of a task escaping its CPU rate cap
> + * due to load balancing leaving it on a lighly used CPU
> + * This will be optimized away if rate caps aren't configured
> + */
> + if (task_has_cap(p)) {
> + unsigned int clw; /* load weight based on cap */
> +
> + clw = cap_load_weight(p);
> + if (clw < p->load_weight)
> + p->load_weight = clw;
> + }

You could use
p->load_weight = min(cap_load_weight(p), p->load_weight);


> + }
> }
>
> static inline void inc_raw_weighted_load(runqueue_t *rq, const task_t *p)
> @@ -869,7 +1045,8 @@ static void __activate_task(task_t *p, r
> {
> prio_array_t *target = rq->active;
>
> - if (unlikely(batch_task(p) || (expired_starving(rq) && !rt_task(p))))
> + if (unlikely(batch_task(p) || (expired_starving(rq) && !rt_task(p)) ||
> + task_being_capped(p)))
> target = rq->expired;
> enqueue_task(p, target);
> inc_nr_running(p, rq);
> @@ -975,8 +1152,30 @@ static void activate_task(task_t *p, run
> #endif
>
> if (!rt_task(p))
> + /*
> + * We want to do the recalculation even if we're exceeding
> + * a cap so that everything still works when we stop
> + * exceeding our cap.
> + */
> p->prio = recalc_task_prio(p, now);
>
> + if (task_has_cap(p)) {
> + inc_cap_stats_cycle(p, now);
> + /* Background tasks are handled in effective_prio()
> + * in order to ensure that they stay at BGND_PRIO
> + * but we need to be careful that we don't override
> + * it here
> + */
> + if (task_exceeding_cap(p) && !task_is_background(p)) {
> + p->normal_prio = CAPPED_PRIO;
> + /*
> + * Don't undo any priority ineheritance
> + */
> + if (!rt_task(p))
> + p->prio = CAPPED_PRIO;
> + }
> + }

Within all tasks at CAPPED_PRIO, is priority of the task used for scheduling?

<snip>

Cheers,
Balbir
Linux Technology Center
IBM ISoftware Labs

2006-05-27 06:48:33

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

On 5/26/06, Peter Williams <[email protected]> wrote:
> This patch implements hard CPU rate caps per task as a proportion of a
> single CPU's capacity expressed in parts per thousand.
>
> Signed-off-by: Peter Williams <[email protected]>
> include/linux/sched.h | 8 ++
> kernel/Kconfig.caps | 14 +++-
> kernel/sched.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 168 insertions(+), 8 deletions(-)
>
> Index: MM-2.6.17-rc4-mm3/include/linux/sched.h
> ===================================================================
> --- MM-2.6.17-rc4-mm3.orig/include/linux/sched.h 2006-05-26 10:46:35.000000000 +1000
> +++ MM-2.6.17-rc4-mm3/include/linux/sched.h 2006-05-26 11:00:07.000000000 +1000
> @@ -796,6 +796,10 @@ struct task_struct {
> #ifdef CONFIG_CPU_RATE_CAPS
> unsigned long long avg_cpu_per_cycle, avg_cycle_length;
> unsigned int cpu_rate_cap;
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> + unsigned int cpu_rate_hard_cap;
> + struct timer_list sinbin_timer;

Using a timer for releasing tasks from their sinbin sounds like a bit
of an overhead. Given that there could be 10s of thousands of tasks.
Is it possible to use the scheduler_tick() function take a look at all
deactivated tasks (as efficiently as possible) and activate them when
its time to activate them or just fold the functionality by defining a
time quantum after which everyone is worken up. This time quantum
could be the same as the time over which limits are honoured.

> +#endif
> #endif
> enum sleep_type sleep_type;
>
> @@ -994,6 +998,10 @@ struct task_struct {
> #ifdef CONFIG_CPU_RATE_CAPS
> unsigned int get_cpu_rate_cap(const struct task_struct *);
> int set_cpu_rate_cap(struct task_struct *, unsigned int);
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +unsigned int get_cpu_rate_hard_cap(const struct task_struct *);
> +int set_cpu_rate_hard_cap(struct task_struct *, unsigned int);
> +#endif
> #endif
>
> static inline pid_t process_group(struct task_struct *tsk)
> Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps
> ===================================================================
> --- MM-2.6.17-rc4-mm3.orig/kernel/Kconfig.caps 2006-05-26 10:45:26.000000000 +1000
> +++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps 2006-05-26 11:00:07.000000000 +1000
> @@ -3,11 +3,21 @@
> #
>
> config CPU_RATE_CAPS
> - bool "Support (soft) CPU rate caps"
> + bool "Support CPU rate caps"
> default n
> ---help---
> - Say y here if you wish to be able to put a (soft) upper limit on
> + Say y here if you wish to be able to put a soft upper limit on
> the rate of CPU usage by individual tasks. A task which has been
> allocated a soft CPU rate cap will be limited to that rate of CPU
> usage unless there is spare CPU resources available after the needs
> of uncapped tasks are met.
> +
> +config CPU_RATE_HARD_CAPS
> + bool "Support CPU rate hard caps"
> + depends on CPU_RATE_CAPS
> + default n
> + ---help---
> + Say y here if you wish to be able to put a hard upper limit on
> + the rate of CPU usage by individual tasks. A task which has been
> + allocated a hard CPU rate cap will be limited to that rate of CPU
> + usage regardless of whether there is spare CPU resources available.
> Index: MM-2.6.17-rc4-mm3/kernel/sched.c
> ===================================================================
> --- MM-2.6.17-rc4-mm3.orig/kernel/sched.c 2006-05-26 11:00:02.000000000 +1000
> +++ MM-2.6.17-rc4-mm3/kernel/sched.c 2006-05-26 13:50:11.000000000 +1000
> @@ -201,21 +201,33 @@ static inline unsigned int task_timeslic
>
> #ifdef CONFIG_CPU_RATE_CAPS
> #define CAP_STATS_OFFSET 8
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +static void sinbin_release_fn(unsigned long arg);
> +#define min_cpu_rate_cap(p) min((p)->cpu_rate_cap, (p)->cpu_rate_hard_cap)
> +#else
> +#define min_cpu_rate_cap(p) (p)->cpu_rate_cap
> +#endif
> #define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
> /* this assumes that p is not a real time task */
> #define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
> #define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
> -#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / 1000)
> +#define cap_load_weight(p) ((min_cpu_rate_cap(p) * SCHED_LOAD_SCALE) / 1000)
>
> static void init_cpu_rate_caps(task_t *p)
> {
> p->cpu_rate_cap = 1000;
> p->flags &= ~PF_HAS_CAP;
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> + p->cpu_rate_hard_cap = 1000;
> + init_timer(&p->sinbin_timer);
> + p->sinbin_timer.function = sinbin_release_fn;
> + p->sinbin_timer.data = (unsigned long) p;
> +#endif
> }
>
> static inline void set_cap_flag(task_t *p)
> {
> - if (p->cpu_rate_cap < 1000 && !has_rt_policy(p))
> + if (min_cpu_rate_cap(p) < 1000 && !has_rt_policy(p))
> p->flags |= PF_HAS_CAP;
> else
> p->flags &= ~PF_HAS_CAP;
> @@ -223,7 +235,7 @@ static inline void set_cap_flag(task_t *
>
> static inline int task_exceeding_cap(const task_t *p)
> {
> - return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * p->cpu_rate_cap);
> + return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * min_cpu_rate_cap(p));
> }
>
> #ifdef CONFIG_SCHED_SMT
> @@ -257,7 +269,7 @@ static int task_exceeding_cap_now(const
>
> delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
> lhs = (p->avg_cpu_per_cycle + delta) * 1000;
> - rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
> + rhs = (p->avg_cycle_length + delta) * min_cpu_rate_cap(p);
>
> return lhs > rhs;
> }
> @@ -266,6 +278,10 @@ static inline void init_cap_stats(task_t
> {
> p->avg_cpu_per_cycle = 0;
> p->avg_cycle_length = 0;
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> + init_timer(&p->sinbin_timer);
> + p->sinbin_timer.data = (unsigned long) p;
> +#endif
> }
>
> static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
> @@ -1213,6 +1229,64 @@ static void deactivate_task(struct task_
> p->array = NULL;
> }
>
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +#define task_has_hard_cap(p) unlikely((p)->cpu_rate_hard_cap < 1000)
> +
> +/*
> + * Release a task from the sinbin
> + */
> +static void sinbin_release_fn(unsigned long arg)
> +{
> + unsigned long flags;
> + struct task_struct *p = (struct task_struct*)arg;
> + struct runqueue *rq = task_rq_lock(p, &flags);
> +
> + p->prio = effective_prio(p);
> +
> + __activate_task(p, rq);
> +
> + task_rq_unlock(rq, &flags);
> +}
> +
> +static unsigned long reqd_sinbin_ticks(const task_t *p)
> +{
> + unsigned long long res;
> +
> + res = p->avg_cpu_per_cycle * 1000;
> +
> + if (res > p->avg_cycle_length * p->cpu_rate_hard_cap) {
> + (void)do_div(res, p->cpu_rate_hard_cap);
> + res -= p->avg_cpu_per_cycle;
> + /*
> + * IF it was available we'd also subtract
> + * the average sleep per cycle here
> + */
> + res >>= CAP_STATS_OFFSET;
> + (void)do_div(res, (1000000000 / HZ));

Please use NSEC_PER_SEC if that is what 10^9 stands for in the above
calculation.

> +
> + return res ? : 1;
> + }
> +
> + return 0;
> +}
> +
> +static void sinbin_task(task_t *p, unsigned long durn)
> +{
> + if (durn == 0)
> + return;
> + deactivate_task(p, task_rq(p));
> + p->sinbin_timer.expires = jiffies + durn;
> + add_timer(&p->sinbin_timer);
> +}
> +#else
> +#define task_has_hard_cap(p) 0
> +#define reqd_sinbin_ticks(p) 0
> +
> +static inline void sinbin_task(task_t *p, unsigned long durn)
> +{
> +}
> +#endif
> +
> /*
> * resched_task - mark a task 'to be rescheduled now'.
> *
> @@ -3508,9 +3582,16 @@ need_resched_nonpreemptible:
> }
> }
>
> - /* do this now so that stats are correct for SMT code */
> - if (task_has_cap(prev))
> + if (task_has_cap(prev)) {
> inc_cap_stats_both(prev, now);
> + if (task_has_hard_cap(prev) && !prev->state &&
> + !rt_task(prev) && !signal_pending(prev)) {
> + unsigned long sinbin_ticks = reqd_sinbin_ticks(prev);
> +
> + if (sinbin_ticks)
> + sinbin_task(prev, sinbin_ticks);
> + }
> + }
>
> cpu = smp_processor_id();
> if (unlikely(!rq->nr_running)) {
> @@ -4539,6 +4620,67 @@ out:
> }
>
<snip>

Balbir
Linux Technology Center
IBM Software Labs

2006-05-27 07:04:04

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

Balbir Singh wrote:
> On 5/26/06, Peter Williams <[email protected]> wrote:
> <snip>
>>
>> Notes:
>>
>> 1. To minimize the overhead incurred when testing to skip caps
>> processing for
>> uncapped tasks a new flag PF_HAS_CAP has been added to flags.
>>
>> 2. The implementation involves the addition of two priority slots to the
>> run queue priority arrays and this means that MAX_PRIO no longer
>> represents the scheduling priority of the idle process and can't be
>> used to
>> test whether priority values are in the valid range. To alleviate this
>> problem a new function sched_idle_prio() has been provided.
>
> I am a little confused by this. Why link the bandwidth expired tasks a
> cpu (its caps) to a priority slot? Is this a hack to conitnue using
> the prio_array? why not move such tasks to the expired array?

Because it won't work as after the array switch they may get to run
before tasks who aren't exceeding their cap (or don't have a cap).

>
> <snip>
>> /*
>> * Some day this will be a full-fledged user tracking system..
>> */
>> @@ -787,6 +793,10 @@ struct task_struct {
>> unsigned long sleep_avg;
>> unsigned long long timestamp, last_ran;
>> unsigned long long sched_time; /* sched_clock time spent
>> running */
>> +#ifdef CONFIG_CPU_RATE_CAPS
>> + unsigned long long avg_cpu_per_cycle, avg_cycle_length;
>> + unsigned int cpu_rate_cap;
>> +#endif
>
> How is a cycle defined?

From one "on CPU" to the next.

> What are the units of a cycle?

Well since sched_clock() is used they are obviously nanoseconds
multiplied by 2 to the power of CAP_STATS_OFFSET. But it's fairly
irrelevant as we're only interested in their ratio and that's dimensionless.

> Could we please
> document the units for the declarations above

No.

>
>> enum sleep_type sleep_type;
>>
>> unsigned long policy;
>> @@ -981,6 +991,11 @@ struct task_struct {
>> #endif
>> };
>>
>> +#ifdef CONFIG_CPU_RATE_CAPS
>> +unsigned int get_cpu_rate_cap(const struct task_struct *);
>> +int set_cpu_rate_cap(struct task_struct *, unsigned int);
>> +#endif
>> +
>> static inline pid_t process_group(struct task_struct *tsk)
>> {
>> return tsk->signal->pgrp;
>> @@ -1040,6 +1055,7 @@ static inline void put_task_struct(struc
>> #define PF_SPREAD_SLAB 0x08000000 /* Spread some slab caches
>> over cpuset */
>> #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
>> #define PF_MUTEX_TESTER 0x02000000 /* Thread belongs to
>> the rt mutex tester */
>> +#define PF_HAS_CAP 0x20000000 /* Has a CPU rate cap */
>>
>> /*
>> * Only the _current_ task can read/write to tsk->flags, but other
>> Index: MM-2.6.17-rc4-mm3/init/Kconfig
>> ===================================================================
>> --- MM-2.6.17-rc4-mm3.orig/init/Kconfig 2006-05-26 10:39:59.000000000
>> +1000
>> +++ MM-2.6.17-rc4-mm3/init/Kconfig 2006-05-26 10:45:26.000000000
>> +1000
>> @@ -286,6 +286,8 @@ config RELAY
>>
>> If unsure, say N.
>>
>> +source "kernel/Kconfig.caps"
>> +
>> source "usr/Kconfig"
>>
>> config UID16
>> Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps
>> ===================================================================
>> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
>> +++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps 2006-05-26
>> 10:45:26.000000000 +1000
>> @@ -0,0 +1,13 @@
>> +#
>> +# CPU Rate Caps Configuration
>> +#
>> +
>> +config CPU_RATE_CAPS
>> + bool "Support (soft) CPU rate caps"
>> + default n
>> + ---help---
>> + Say y here if you wish to be able to put a (soft) upper
>> limit on
>> + the rate of CPU usage by individual tasks. A task which has
>> been
>> + allocated a soft CPU rate cap will be limited to that rate
>> of CPU
>> + usage unless there is spare CPU resources available after
>> the needs
>> + of uncapped tasks are met.
>> Index: MM-2.6.17-rc4-mm3/kernel/sched.c
>> ===================================================================
>> --- MM-2.6.17-rc4-mm3.orig/kernel/sched.c 2006-05-26
>> 10:44:51.000000000 +1000
>> +++ MM-2.6.17-rc4-mm3/kernel/sched.c 2006-05-26 11:00:02.000000000
>> +1000
>> @@ -57,6 +57,19 @@
>>
>> #include <asm/unistd.h>
>>
>> +#ifdef CONFIG_CPU_RATE_CAPS
>> +#define IDLE_PRIO (MAX_PRIO + 2)
>> +#else
>> +#define IDLE_PRIO MAX_PRIO
>> +#endif
>> +#define BGND_PRIO (IDLE_PRIO - 1)
>> +#define CAPPED_PRIO (IDLE_PRIO - 2)
>> +
>> +int sched_idle_prio(void)
>> +{
>> + return IDLE_PRIO;
>> +}
>> +
>> /*
>> * Convert user-nice values [ -20 ... 0 ... 19 ]
>> * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
>> @@ -186,6 +199,149 @@ static inline unsigned int task_timeslic
>> return static_prio_timeslice(p->static_prio);
>> }
>>
>> +#ifdef CONFIG_CPU_RATE_CAPS
>> +#define CAP_STATS_OFFSET 8
>> +#define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
>> +/* this assumes that p is not a real time task */
>> +#define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
>> +#define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
>> +#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) /
>> 1000)
>
> Could we please use a const or #define'd name instead of 1000. How
> about TOTAL_CAP_IN_PARTS? It would make the code easier to read and
> maintain.
>
>> +
>> +static void init_cpu_rate_caps(task_t *p)
>> +{
>> + p->cpu_rate_cap = 1000;
>> + p->flags &= ~PF_HAS_CAP;
>> +}
>> +
>> +static inline void set_cap_flag(task_t *p)
>> +{
>> + if (p->cpu_rate_cap < 1000 && !has_rt_policy(p))
>> + p->flags |= PF_HAS_CAP;
>> + else
>> + p->flags &= ~PF_HAS_CAP;
>> +}
>
> Why don't you re-use RLIMIT_INFINITY?

I presume that it means something else.

>
>> +
>> +static inline int task_exceeding_cap(const task_t *p)
>> +{
>> + return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length *
>> p->cpu_rate_cap);
>> +}
>> +
>> +#ifdef CONFIG_SCHED_SMT
>> +static unsigned int smt_timeslice(task_t *p)
>> +{
>> + if (task_has_cap(p) && task_being_capped(p))
>> + return 0;
>> +
>> + return task_timeslice(p);
>> +}
>> +
>> +static int task_priority_gt(const task_t *thisp, const task_t *thatp)
>> +{
>> + if (task_has_cap(thisp) && (task_being_capped(thisp)))
>> + return 0;
>> +
>> + if (task_has_cap(thatp) && (task_being_capped(thatp)))
>> + return 1;
>> +
>> + return thisp->static_prio < thatp->static_prio;
>> +}
>
> This function needs some comments. At least with respect to what is
> thisp and thatp

gt means "greater than" (as any Fortran programmer knows :-)) and is
sufficient documentation.

>
>> +#endif
>> +
>> +/*
>> + * Update usage stats to "now" before making comparison
>> + * Assume: task is actually on a CPU
>> + */
>> +static int task_exceeding_cap_now(const task_t *p, unsigned long long
>> now)
>> +{
>> + unsigned long long delta, lhs, rhs;
>> +
>> + delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
>> + lhs = (p->avg_cpu_per_cycle + delta) * 1000;
>> + rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
>> +
>> + return lhs > rhs;
>> +}
>> +
>> +static inline void init_cap_stats(task_t *p)
>> +{
>> + p->avg_cpu_per_cycle = 0;
>> + p->avg_cycle_length = 0;
>> +}
>> +
>> +static inline void inc_cap_stats_cycle(task_t *p, unsigned long long
>> now)
>> +{
>> + unsigned long long delta;
>> +
>> + delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
>> + p->avg_cycle_length += delta;
>> +}
>> +
>> +static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
>> +{
>> + unsigned long long delta;
>> +
>> + delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
>> + p->avg_cycle_length += delta;
>> + p->avg_cpu_per_cycle += delta;
>> +}
>> +
>> +static inline void decay_cap_stats(task_t *p)
>> +{
>> + p->avg_cycle_length *= ((1 << CAP_STATS_OFFSET) - 1);
>> + p->avg_cycle_length >>= CAP_STATS_OFFSET;
>> + p->avg_cpu_per_cycle *= ((1 << CAP_STATS_OFFSET) - 1);
>> + p->avg_cpu_per_cycle >>= CAP_STATS_OFFSET;
>> +}
>> +#else
>> +#define task_has_cap(p) 0
>> +#define task_is_background(p) 0
>> +#define task_being_capped(p) 0
>> +#define cap_load_weight(p) SCHED_LOAD_SCALE
>> +
>> +static inline void init_cpu_rate_caps(task_t *p)
>> +{
>> +}
>> +
>> +static inline void set_cap_flag(task_t *p)
>> +{
>> +}
>> +
>> +static inline int task_exceeding_cap(const task_t *p)
>> +{
>> + return 0;
>> +}
>> +
>> +#ifdef CONFIG_SCHED_SMT
>> +#define smt_timeslice(p) task_timeslice(p)
>> +
>> +static inline int task_priority_gt(const task_t *thisp, const task_t
>> *thatp)
>> +{
>> + return thisp->static_prio < thatp->static_prio;
>> +}
>> +#endif
>> +
>> +static inline int task_exceeding_cap_now(const task_t *p, unsigned
>> long long now)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline void init_cap_stats(task_t *p)
>> +{
>> +}
>> +
>> +static inline void inc_cap_stats_cycle(task_t *p, unsigned long long
>> now)
>> +{
>> +}
>> +
>> +static inline void inc_cap_stats_both(task_t *p, unsigned long long now)
>> +{
>> +}
>> +
>> +static inline void decay_cap_stats(task_t *p)
>> +{
>> +}
>> +#endif
>> +
>> #define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran) \
>> < (long long) (sd)->cache_hot_time)
>>
>> @@ -197,8 +353,8 @@ typedef struct runqueue runqueue_t;
>>
>> struct prio_array {
>> unsigned int nr_active;
>> - DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for
>> delimiter */
>> - struct list_head queue[MAX_PRIO];
>> + DECLARE_BITMAP(bitmap, IDLE_PRIO+1); /* include 1 bit for
>> delimiter */
>> + struct list_head queue[IDLE_PRIO];
>> };
>>
>> /*
>> @@ -710,6 +866,10 @@ static inline int __normal_prio(task_t *
>> {
>> int bonus, prio;
>>
>> + /* Ensure that background tasks stay at BGND_PRIO */
>> + if (task_is_background(p))
>> + return BGND_PRIO;
>> +
>> bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
>>
>> prio = p->static_prio - bonus;
>> @@ -786,6 +946,8 @@ static inline int expired_starving(runqu
>>
>> static void set_load_weight(task_t *p)
>> {
>> + set_cap_flag(p);
>> +
>> if (has_rt_policy(p)) {
>> #ifdef CONFIG_SMP
>> if (p == task_rq(p)->migration_thread)
>> @@ -798,8 +960,22 @@ static void set_load_weight(task_t *p)
>> else
>> #endif
>> p->load_weight =
>> RTPRIO_TO_LOAD_WEIGHT(p->rt_priority);
>> - } else
>> + } else {
>> p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio);
>> +
>> + /*
>> + * Reduce the probability of a task escaping its CPU
>> rate cap
>> + * due to load balancing leaving it on a lighly used CPU
>> + * This will be optimized away if rate caps aren't
>> configured
>> + */
>> + if (task_has_cap(p)) {
>> + unsigned int clw; /* load weight based on cap */
>> +
>> + clw = cap_load_weight(p);
>> + if (clw < p->load_weight)
>> + p->load_weight = clw;
>> + }
>
> You could use
> p->load_weight = min(cap_load_weight(p), p->load_weight);

Yes.

>
>
>> + }
>> }
>>
>> static inline void inc_raw_weighted_load(runqueue_t *rq, const task_t
>> *p)
>> @@ -869,7 +1045,8 @@ static void __activate_task(task_t *p, r
>> {
>> prio_array_t *target = rq->active;
>>
>> - if (unlikely(batch_task(p) || (expired_starving(rq) &&
>> !rt_task(p))))
>> + if (unlikely(batch_task(p) || (expired_starving(rq) &&
>> !rt_task(p)) ||
>> + task_being_capped(p)))
>> target = rq->expired;
>> enqueue_task(p, target);
>> inc_nr_running(p, rq);
>> @@ -975,8 +1152,30 @@ static void activate_task(task_t *p, run
>> #endif
>>
>> if (!rt_task(p))
>> + /*
>> + * We want to do the recalculation even if we're
>> exceeding
>> + * a cap so that everything still works when we stop
>> + * exceeding our cap.
>> + */
>> p->prio = recalc_task_prio(p, now);
>>
>> + if (task_has_cap(p)) {
>> + inc_cap_stats_cycle(p, now);
>> + /* Background tasks are handled in effective_prio()
>> + * in order to ensure that they stay at BGND_PRIO
>> + * but we need to be careful that we don't override
>> + * it here
>> + */
>> + if (task_exceeding_cap(p) && !task_is_background(p)) {
>> + p->normal_prio = CAPPED_PRIO;
>> + /*
>> + * Don't undo any priority ineheritance
>> + */
>> + if (!rt_task(p))
>> + p->prio = CAPPED_PRIO;
>> + }
>> + }
>
> Within all tasks at CAPPED_PRIO, is priority of the task used for
> scheduling?

Unless a task is exceeding its cap it gets scheduled as per usual and
even if it is exceeding its cap its time slice allocation is still done
as per usual so "nice" and interactivity will still have some effect for
competing "over cap" tasks.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-27 08:44:26

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

Balbir Singh wrote:
>
> Using a timer for releasing tasks from their sinbin sounds like a bit
> of an overhead. Given that there could be 10s of thousands of tasks.

The more runnable tasks there are the less likely it is that any of them
is exceeding its hard cap due to normal competition for the CPUs. So I
think that it's unlikely that there will ever be a very large number of
tasks in the sinbin at the same time.

> Is it possible to use the scheduler_tick() function take a look at all
> deactivated tasks (as efficiently as possible) and activate them when
> its time to activate them or just fold the functionality by defining a
> time quantum after which everyone is worken up. This time quantum
> could be the same as the time over which limits are honoured.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-27 09:26:28

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

On Sat, 2006-05-27 at 10:16 +1000, Peter Williams wrote:
> Mike Galbraith wrote:
> > On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote:
> >> Con Kolivas wrote:
> >>> On Friday 26 May 2006 14:20, Peter Williams wrote:
> >>>> This patch implements hard CPU rate caps per task as a proportion of a
> >>>> single CPU's capacity expressed in parts per thousand.
> >>> A hard cap of 1/1000 could lead to interesting starvation scenarios where a
> >>> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to
> >>> a lesser extent to a 0 soft cap.
> >>>
> >>> Here is how I handle idleprio tasks in current -ck:
> >>>
> >>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
> >>> tags tasks that are holding a mutex
> >>>
> >>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
> >>> is the idleprio policy for staircase.
> >>>
> >>> What it does is runs idleprio tasks as normal tasks when they hold a mutex or
> >>> are waking up after calling down() (ie holding a semaphore).
> >> I wasn't aware that you could detect those conditions. They could be
> >> very useful.
> >
> > Isn't this exactly what the PI code is there to handle? Is something
> > more than PI needed?
> >
>
> AFAIK (but I may be wrong) PI is only used by RT tasks and would need to
> be extended. It could be argued that extending PI so that it can be
> used by non RT tasks is a worthwhile endeavour in its own right.

Hm. Looking around a bit, it appears to me that we're one itty bitty
redefine away from PI being global. No idea if/when that will happen
though.

-Mike

2006-05-28 00:11:12

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

Peter Williams wrote:
> Balbir Singh wrote:
>> On 5/26/06, Peter Williams <[email protected]> wrote:
>> <snip>
>>>
>>> Notes:
>>>
>>> 1. To minimize the overhead incurred when testing to skip caps
>>> processing for
>>> uncapped tasks a new flag PF_HAS_CAP has been added to flags.
>>>
>>> 2. The implementation involves the addition of two priority slots to the
>>> run queue priority arrays and this means that MAX_PRIO no longer
>>> represents the scheduling priority of the idle process and can't be
>>> used to
>>> test whether priority values are in the valid range. To alleviate this
>>> problem a new function sched_idle_prio() has been provided.
>>
>> I am a little confused by this. Why link the bandwidth expired tasks a
>> cpu (its caps) to a priority slot? Is this a hack to conitnue using
>> the prio_array? why not move such tasks to the expired array?
>
> Because it won't work as after the array switch they may get to run
> before tasks who aren't exceeding their cap (or don't have a cap).

Another important reason for using these slots is that it allows waking
tasks to preempt tasks that have exceeded their cap.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-28 02:09:59

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

Mike Galbraith wrote:
> On Sat, 2006-05-27 at 10:16 +1000, Peter Williams wrote:
>> Mike Galbraith wrote:
>>> On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote:
>>>> Con Kolivas wrote:
>>>>> On Friday 26 May 2006 14:20, Peter Williams wrote:
>>>>>> This patch implements hard CPU rate caps per task as a proportion of a
>>>>>> single CPU's capacity expressed in parts per thousand.
>>>>> A hard cap of 1/1000 could lead to interesting starvation scenarios where a
>>>>> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to
>>>>> a lesser extent to a 0 soft cap.
>>>>>
>>>>> Here is how I handle idleprio tasks in current -ck:
>>>>>
>>>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
>>>>> tags tasks that are holding a mutex
>>>>>
>>>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
>>>>> is the idleprio policy for staircase.
>>>>>
>>>>> What it does is runs idleprio tasks as normal tasks when they hold a mutex or
>>>>> are waking up after calling down() (ie holding a semaphore).
>>>> I wasn't aware that you could detect those conditions. They could be
>>>> very useful.
>>> Isn't this exactly what the PI code is there to handle? Is something
>>> more than PI needed?
>>>
>> AFAIK (but I may be wrong) PI is only used by RT tasks and would need to
>> be extended. It could be argued that extending PI so that it can be
>> used by non RT tasks is a worthwhile endeavour in its own right.
>
> Hm. Looking around a bit, it appears to me that we're one itty bitty
> redefine away from PI being global. No idea if/when that will happen
> though.

It needs slightly more than that. It's currently relying on the way
tasks with prio less than MAX_RT_PRIO are treated to prevent the
priority of tasks who are inheriting a priority from having that
priority reset to their normal priority at various places in sched.c.
So something would need to be done in that regard but it shouldn't be
too difficult.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-28 07:38:18

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

On 5/28/06, Peter Williams <[email protected]> wrote:
> Peter Williams wrote:
> > Balbir Singh wrote:
> >> On 5/26/06, Peter Williams <[email protected]> wrote:
> >> <snip>
> >>>
> >>> Notes:
> >>>
> >>> 1. To minimize the overhead incurred when testing to skip caps
> >>> processing for
> >>> uncapped tasks a new flag PF_HAS_CAP has been added to flags.
> >>>
> >>> 2. The implementation involves the addition of two priority slots to the
> >>> run queue priority arrays and this means that MAX_PRIO no longer
> >>> represents the scheduling priority of the idle process and can't be
> >>> used to
> >>> test whether priority values are in the valid range. To alleviate this
> >>> problem a new function sched_idle_prio() has been provided.
> >>
> >> I am a little confused by this. Why link the bandwidth expired tasks a
> >> cpu (its caps) to a priority slot? Is this a hack to conitnue using
> >> the prio_array? why not move such tasks to the expired array?
> >
> > Because it won't work as after the array switch they may get to run
> > before tasks who aren't exceeding their cap (or don't have a cap).
>

That behaviour would be fair. Let the priority govern who gets to run
first (irrespective of their cap) and then use the cap to limit their
timeslice (execution time).

> Another important reason for using these slots is that it allows waking
> tasks to preempt tasks that have exceeded their cap.
>

But among all tasks that exceed their cap (there is no priority based
scheduling). As far as preemption is concerned, if they are moved to
the expired array, capped tasks will only run if an array switch takes
place. If it does, then they get their fare share of the cap again
until they exceed their cap.

Balbir

2006-05-28 13:35:27

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

Balbir Singh wrote:
> On 5/28/06, Peter Williams <[email protected]> wrote:
>> Peter Williams wrote:
>> > Balbir Singh wrote:
>> >> On 5/26/06, Peter Williams <[email protected]> wrote:
>> >> <snip>
>> >>>
>> >>> Notes:
>> >>>
>> >>> 1. To minimize the overhead incurred when testing to skip caps
>> >>> processing for
>> >>> uncapped tasks a new flag PF_HAS_CAP has been added to flags.
>> >>>
>> >>> 2. The implementation involves the addition of two priority slots
>> to the
>> >>> run queue priority arrays and this means that MAX_PRIO no longer
>> >>> represents the scheduling priority of the idle process and can't be
>> >>> used to
>> >>> test whether priority values are in the valid range. To alleviate
>> this
>> >>> problem a new function sched_idle_prio() has been provided.
>> >>
>> >> I am a little confused by this. Why link the bandwidth expired tasks a
>> >> cpu (its caps) to a priority slot? Is this a hack to conitnue using
>> >> the prio_array? why not move such tasks to the expired array?
>> >
>> > Because it won't work as after the array switch they may get to run
>> > before tasks who aren't exceeding their cap (or don't have a cap).
>>
>
> That behaviour would be fair.

Caps aren't about being fair. In fact, giving a task a cap is an
explicit instruction to the scheduler that the task should be treated
unfairly in some circumstances (namely when it's exceeding that cap).

Similarly, the interactive bonus mechanism is not about fairness either.
It's about giving tasks that are thought to be interactive an unfair
advantage so that the user experiences good responsiveness.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-28 14:42:06

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

On 5/28/06, Peter Williams <[email protected]> wrote:
<snip>

> >
> > That behaviour would be fair.
>
> Caps aren't about being fair. In fact, giving a task a cap is an
> explicit instruction to the scheduler that the task should be treated
> unfairly in some circumstances (namely when it's exceeding that cap).
>
> Similarly, the interactive bonus mechanism is not about fairness either.
> It's about giving tasks that are thought to be interactive an unfair
> advantage so that the user experiences good responsiveness.
>

I understand that, I was talking about fairness between capped tasks
and what might be considered fair or intutive between capped tasks and
regular tasks. Of course, the last point is debatable ;)

Balbir

2006-05-28 22:47:03

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Bj?rn Steinbrink wrote:

>>The killer problem I see with this approach is that it doesn't address
>>the divide and conquer problem. If a task is capped, and forks off
>>workers, each worker inherits the total cap, effectively extending same.
>>
>>

Yes, although the current thinking is that you need to set a special
clone() flag (which may be restricted via capabilities such as
CAP_SYS_RESOURCE) to set a new CPU scheduling namespace, so the workers
will inherit the same scheduling ns and therefore be accounted against
the one resource.

Sorry if I'm replying out of context, I'll catch up on this thread
shortly. Thanks for including me.

>>IMHO, per task resource management is too severely limited in it's
>>usefulness, because jobs are what need managing, and they're seldom
>>single threaded. In order to use per task limits to manage any given
>>job, you have to both know the number of components, and manually
>>distribute resources to each component of the job. If a job has a
>>dynamic number of components, it becomes impossible to manage.
>>
>>
>
>Linux-VServer uses a token bucket scheduler (TBS) to limit cpu ressources
>for processes in a "context". All processes in a context share one token
>bucket, which has a set of parameters to tune scheduling behaviour.
>As the token bucket is shared by a group of processes, and inherited by
>child processes/threads, management is quite easy. And the parameters
>can be tuned to allow different scheduling behaviours, like allowing a
>process group to burst, ie. use as much cpu time as is available, after
>being idle for some time, but being limited to X % cpu time on average.
>
>

This is correct. Basically I read the LARTC.org (which explains Linux
network schedulers etc) and the description of the Token Bucket
Scheduler inspired me to write the same thing for CPU resources. It was
originally developed for the 2.4 Alan Cox series kernels. The primary
design guarantee of the scheduler is a low total performance impact -
maximum CPU utilisation prioritisation and fairness a secondary
concern. It was built with the idea that people wanting different sorts
of scheduling policies could at least get a set of userland controls to
implement their approach - to the limit of the effectiveness of task
priorities.

I most recently described this at http://lkml.org/lkml/2006/3/29/59, a
lot of that thread is probably worth catching up on.

It would be nice if we could somehow re-use the scheduling algorithms in
use in the network space here, if it doesn't impact on performance.

For instance, the "CBQ" network scheduler is the same approach as used
in OpenVZ's CPU scheduler, and the classful Token Bucket Filter is the
approach used in VServer. The "Sched_prio" and "Sched_hard" distinction
in vserver could probably be compared to "Ingres Policing", where
available CPU that could run a process instead sits idle - similar to
the network world where data that has arrived is dropped to try to
convince the application to throttle its network activity.

As in the network space (http://lkml.org/lkml/2006/5/19/216) in this
space we have a continual scale of possible implementations, marked by a
highly efficient solution akin to "binding" at one end, and a
virtualisation at the other. Each deliver guarantees most applicable to
certain users or workloads.

Sam.

>I'm CC'ing Herbert and Sam on this as they can explain the whole thing a
>lot better and I'm not familiar with implementation details.
>
>Regards
>Bj?rn
>
>

2006-05-28 23:30:31

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Sam Vilain wrote:
> Bj?rn Steinbrink wrote:
>
>>> The killer problem I see with this approach is that it doesn't address
>>> the divide and conquer problem. If a task is capped, and forks off
>>> workers, each worker inherits the total cap, effectively extending same.
>>>
>>>
>
> Yes, although the current thinking is that you need to set a special
> clone() flag (which may be restricted via capabilities such as
> CAP_SYS_RESOURCE) to set a new CPU scheduling namespace, so the workers
> will inherit the same scheduling ns and therefore be accounted against
> the one resource.
>
> Sorry if I'm replying out of context, I'll catch up on this thread
> shortly. Thanks for including me.
>
>>> IMHO, per task resource management is too severely limited in it's
>>> usefulness, because jobs are what need managing, and they're seldom
>>> single threaded. In order to use per task limits to manage any given
>>> job, you have to both know the number of components, and manually
>>> distribute resources to each component of the job. If a job has a
>>> dynamic number of components, it becomes impossible to manage.
>>>
>>>
>> Linux-VServer uses a token bucket scheduler (TBS) to limit cpu ressources
>> for processes in a "context". All processes in a context share one token
>> bucket, which has a set of parameters to tune scheduling behaviour.
>> As the token bucket is shared by a group of processes, and inherited by
>> child processes/threads, management is quite easy. And the parameters
>> can be tuned to allow different scheduling behaviours, like allowing a
>> process group to burst, ie. use as much cpu time as is available, after
>> being idle for some time, but being limited to X % cpu time on average.
>>
>>
>
> This is correct. Basically I read the LARTC.org (which explains Linux
> network schedulers etc) and the description of the Token Bucket
> Scheduler inspired me to write the same thing for CPU resources. It was
> originally developed for the 2.4 Alan Cox series kernels. The primary
> design guarantee of the scheduler is a low total performance impact -
> maximum CPU utilisation prioritisation and fairness a secondary
> concern. It was built with the idea that people wanting different sorts
> of scheduling policies could at least get a set of userland controls to
> implement their approach - to the limit of the effectiveness of task
> priorities.
>
> I most recently described this at http://lkml.org/lkml/2006/3/29/59, a
> lot of that thread is probably worth catching up on.
>
> It would be nice if we could somehow re-use the scheduling algorithms in
> use in the network space here, if it doesn't impact on performance.
>
> For instance, the "CBQ" network scheduler is the same approach as used
> in OpenVZ's CPU scheduler, and the classful Token Bucket Filter is the
> approach used in VServer. The "Sched_prio" and "Sched_hard" distinction
> in vserver could probably be compared to "Ingres Policing", where
> available CPU that could run a process instead sits idle - similar to
> the network world where data that has arrived is dropped to try to
> convince the application to throttle its network activity.
>
> As in the network space (http://lkml.org/lkml/2006/5/19/216) in this
> space we have a continual scale of possible implementations, marked by a
> highly efficient solution akin to "binding" at one end, and a
> virtualisation at the other. Each deliver guarantees most applicable to
> certain users or workloads.
>
> Sam.
>
>> I'm CC'ing Herbert and Sam on this as they can explain the whole thing a
>> lot better and I'm not familiar with implementation details.

Have you considered adding an implementation of these schedulers to
PlugSched?

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-28 23:27:29

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

Balbir Singh wrote:
> On 5/28/06, Peter Williams <[email protected]> wrote:
> <snip>
>
>> >
>> > That behaviour would be fair.
>>
>> Caps aren't about being fair. In fact, giving a task a cap is an
>> explicit instruction to the scheduler that the task should be treated
>> unfairly in some circumstances (namely when it's exceeding that cap).
>>
>> Similarly, the interactive bonus mechanism is not about fairness either.
>> It's about giving tasks that are thought to be interactive an unfair
>> advantage so that the user experiences good responsiveness.
>>
>
> I understand that, I was talking about fairness between capped tasks
> and what might be considered fair or intutive between capped tasks and
> regular tasks. Of course, the last point is debatable ;)

Well, the primary fairness mechanism in the scheduler is the time slice
allocation and the capping code doesn't fiddle with those so there
should be a reasonable degree of fairness (taking into account "nice")
between capped tasks. To improve that would require allocating several
new priority slots for use by tasks exceeding their caps and fiddling
with those. I don't think that it's worth the bother.

When capped tasks aren't exceeding their cap they are treated just like
any other task and will get the same amount of fairness.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-29 03:09:49

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Peter Williams wrote:

>>This is correct. Basically I read the LARTC.org (which explains Linux
>>network schedulers etc) and the description of the Token Bucket
>>Scheduler inspired me to write the same thing for CPU resources. It was
>>originally developed for the 2.4 Alan Cox series kernels. The primary
>>[...]
>>I most recently described this at http://lkml.org/lkml/2006/3/29/59, a
>>lot of that thread is probably worth catching up on.
>>[...]
>>
>>
>Have you considered adding an implementation of these schedulers to
>PlugSched?
>
>

No, I haven't; I'd be happy to do so, given appropriate pointers to a
codebase I can produce commits for. Is there a public git tree for the
patches, or a series of split out patches? I see only combined patches
on the SourceForge site.

Sam.

2006-05-29 03:41:51

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Sam Vilain wrote:
> Peter Williams wrote:
>
>>> This is correct. Basically I read the LARTC.org (which explains Linux
>>> network schedulers etc) and the description of the Token Bucket
>>> Scheduler inspired me to write the same thing for CPU resources. It was
>>> originally developed for the 2.4 Alan Cox series kernels. The primary
>>> [...]
>>> I most recently described this at http://lkml.org/lkml/2006/3/29/59, a
>>> lot of that thread is probably worth catching up on.
>>> [...]
>>>
>>>
>> Have you considered adding an implementation of these schedulers to
>> PlugSched?
>>
>>
>
> No, I haven't; I'd be happy to do so, given appropriate pointers to a
> codebase I can produce commits for. Is there a public git tree for the
> patches, or a series of split out patches?

Yes, but not yet publicly available. I use quilt to keep the patch
series up to date and do the change as a relatively large series (30 or
so) to make it easier for me to cope with changes in the kernel. When I
do the next release I'll make a tar ball of the patch series available.

Of course, if your eager to start right away I could make the
2.6.17-rc4-mm1 one available?

> I see only combined patches
> on the SourceForge site.

Yes, I'm trying to be not too greedy in my disk space use :-)

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-29 21:17:04

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Peter Williams wrote:

>Yes, but not yet publicly available. I use quilt to keep the patch
>series up to date and do the change as a relatively large series (30 or
>so) to make it easier for me to cope with changes in the kernel. When I
>do the next release I'll make a tar ball of the patch series available.
>
>Of course, if your eager to start right away I could make the
>2.6.17-rc4-mm1 one available?
>
>

Well a piecewise patchset does make it a lot easier to see what's going
on, especially if it's got descriptions of each patch along the way.
I'd certainly be interested in having a look through the split out patch
to see how namespaces and this advanced scheduling system might
interoperate.

Sam.

2006-05-29 23:12:26

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Sam Vilain wrote:
> Peter Williams wrote:
>
>> Yes, but not yet publicly available. I use quilt to keep the patch
>> series up to date and do the change as a relatively large series (30 or
>> so) to make it easier for me to cope with changes in the kernel. When I
>> do the next release I'll make a tar ball of the patch series available.
>>
>> Of course, if your eager to start right away I could make the
>> 2.6.17-rc4-mm1 one available?
>>
>>
>
> Well a piecewise patchset does make it a lot easier to see what's going
> on, especially if it's got descriptions of each patch along the way.

It's a bit light on descriptions at the moment :-( as I keep putting
that in the "do later" bin.

> I'd certainly be interested in having a look through the split out patch
> to see how namespaces and this advanced scheduling system might
> interoperate.

OK. I've tried very hard to make the scheduling code orthogonal to
everything else and it essentially separates out the scheduling within a
CPU from other issues e.g. load balancing. This separation is
sufficiently good for me to have merged PlugSched with an earlier
version of CKRM's CPU management module in a way that made each of
PlugSched's schedulers available within CKRM's infrastructure. (CKRM
have radically rewritten their CPU code since then and I haven't
bothered to keep up.)

The key point that I'm trying to make is that I would expect PlugSched
and namespaces to coexist without any problems. How it integrates with
the "advanced" scheduling system would depend on how that system alters
things such as load balancing and/or whether it goes for scheduling
outcomes at a higher level than the task.

I'm assuming that you're happy to wait for the next release? That will
improve the likelihood of descriptions in the patches :-).

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-30 02:08:03

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Peter Williams wrote:

>>I'd certainly be interested in having a look through the split out patch
>>to see how namespaces and this advanced scheduling system might
>>interoperate.
>>
>>
>
>OK. I've tried very hard to make the scheduling code orthogonal to
>everything else and it essentially separates out the scheduling within a
>CPU from other issues e.g. load balancing. This separation is
>sufficiently good for me to have merged PlugSched with an earlier
>version of CKRM's CPU management module in a way that made each of
>PlugSched's schedulers available within CKRM's infrastructure. (CKRM
>have radically rewritten their CPU code since then and I haven't
>bothered to keep up.)
>
>The key point that I'm trying to make is that I would expect PlugSched
>and namespaces to coexist without any problems. How it integrates with
>the "advanced" scheduling system would depend on how that system alters
>things such as load balancing and/or whether it goes for scheduling
>outcomes at a higher level than the task.
>
>

Coexisting is the base line and I don't think they'll 'interfere' with
each other, per se, but we specifically want to make it easy for
userland to set up and configure scheduling policies to apply to groups
of processes.

For instance, the vserver scheduling extension I wrote changes
scheduling policy so that the interactivity bonus is reduced to -5 ..
-5, and -5 .. +15 is given as a bonus or penalty for an entire vserver
that is currently below or above its allocated CPU quotient. In this
case the scheduling algorithm hasn't changed, just more feedback is
given into the effective priorities of the processes being scheduled.
ie, there are now two "inputs" (task and vserver) to the existing scheduler.

I guess the big question is - is there a corresponding concept in
PlugSched? for instance, is there a reference in the task_struct to the
current scheduling domain, or is it more CKRM-style with classification
modules?

If there is a reference in the task_struct to some set of scheduling
counters, maybe we could squint and say that looks like a namespace, and
throw it into the nsproxy.

>I'm assuming that you're happy to wait for the next release? That will
>improve the likelihood of descriptions in the patches :-).
>
>

Let's keep it the technical dialogue going for now, then.

Now, forgive me if I'm preaching to the vicar here, but have you tried
using Stacked Git for the patch development? I find the way that you
build patch descriptions as you go along, can easily wind the commit
stack to work on individual patches and import other people's work to be
very simple and powerful.

http://www.procode.org/stgit/

I just mention this because you're not the first person I've talked to
using Quilt to express some difficulty in producing incremental patchset
snapshots. Not having used Quilt myself I'm unsure whether this is a
deficiency or just "the way it is" once a patch set gets big.

Sam.

2006-05-30 02:45:30

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Sam Vilain wrote:
> Peter Williams wrote:
>
>>> I'd certainly be interested in having a look through the split out patch
>>> to see how namespaces and this advanced scheduling system might
>>> interoperate.
>>>
>>>
>> OK. I've tried very hard to make the scheduling code orthogonal to
>> everything else and it essentially separates out the scheduling within a
>> CPU from other issues e.g. load balancing. This separation is
>> sufficiently good for me to have merged PlugSched with an earlier
>> version of CKRM's CPU management module in a way that made each of
>> PlugSched's schedulers available within CKRM's infrastructure. (CKRM
>> have radically rewritten their CPU code since then and I haven't
>> bothered to keep up.)
>>
>> The key point that I'm trying to make is that I would expect PlugSched
>> and namespaces to coexist without any problems. How it integrates with
>> the "advanced" scheduling system would depend on how that system alters
>> things such as load balancing and/or whether it goes for scheduling
>> outcomes at a higher level than the task.
>>
>>
>
> Coexisting is the base line and I don't think they'll 'interfere' with
> each other, per se, but we specifically want to make it easy for
> userland to set up and configure scheduling policies to apply to groups
> of processes.

They shouldn't interfere as which scheduler to use is a boot time
selection and only one scheduler is in force. It's mainly a coding
matter and in particular whether the "scheduler driver" interface would
need to be modified or whether your scheduler can be implemented using
the current interface.

>
> For instance, the vserver scheduling extension I wrote changes
> scheduling policy so that the interactivity bonus is reduced to -5 ..
> -5, and -5 .. +15 is given as a bonus or penalty for an entire vserver
> that is currently below or above its allocated CPU quotient. In this
> case the scheduling algorithm hasn't changed, just more feedback is
> given into the effective priorities of the processes being scheduled.
> ie, there are now two "inputs" (task and vserver) to the existing scheduler.
>
> I guess the big question is - is there a corresponding concept in
> PlugSched? for instance, is there a reference in the task_struct to the
> current scheduling domain, or is it more CKRM-style with classification
> modules?

It uses the standard run queue structure with per scheduler
modifications (via a union) to handle the different ways that the
schedulers manage priority arrays (so yes). As I said it restricts
itself to scheduling matters within each run queue and leaves the wider
aspects to the normal code.

At first guess, it sounds like adding your scheduler could be as simple
as taking a copy of ingosched.c (which is the implementation of the
standard scheduler within PlugSched) and then making your modifications.
You could probably even share the same run queue components but
there's nothing to stop you adding new ones.

Each scheduler can also have its own per task data via a union in the
task struct.

>
> If there is a reference in the task_struct to some set of scheduling
> counters, maybe we could squint and say that looks like a namespace, and
> throw it into the nsproxy.

Depends on the scheduler.

>
>> I'm assuming that you're happy to wait for the next release? That will
>> improve the likelihood of descriptions in the patches :-).
>>
>>
>
> Let's keep it the technical dialogue going for now, then.

OK. I'm waiting for the next -mm kernel before I make the next release.

>
> Now, forgive me if I'm preaching to the vicar here, but have you tried
> using Stacked Git for the patch development?

No, I actually use the gquilt GUI wrapper for quilt
<http://freshmeat.net/projects/gquilt/> and, although I've modified it
to use a generic interface to the underlying patch management system
(a.k.a. back end), I haven't yet modified it to use Stacked GIT as a
back end. I have thought about it and it was the primary motivation for
adding the generic interface but I ran out of enthusiasm.

> I find the way that you
> build patch descriptions as you go along, can easily wind the commit
> stack to work on individual patches and import other people's work to be
> very simple and powerful.
>
> http://www.procode.org/stgit/
>
> I just mention this because you're not the first person I've talked to
> using Quilt to express some difficulty in producing incremental patchset
> snapshots. Not having used Quilt myself I'm unsure whether this is a
> deficiency or just "the way it is" once a patch set gets big.

Making quilt easier to use is why I wrote gquilt :-)

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-30 22:05:36

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Peter Williams wrote:

>They shouldn't interfere as which scheduler to use is a boot time
>selection and only one scheduler is in force. It's mainly a coding
>matter and in particular whether the "scheduler driver" interface would
>need to be modified or whether your scheduler can be implemented using
>the current interface.
>
>

Yes, that's the key issue I think - the interface now has more inputs.

>>I guess the big question is - is there a corresponding concept in
>>> PlugSched? for instance, is there a reference in the task_struct to the
>>> current scheduling domain, or is it more CKRM-style with classification
>>> modules?
>>
>>
> It uses the standard run queue structure with per scheduler
> modifications (via a union) to handle the different ways that the
> schedulers manage priority arrays (so yes). As I said it restricts
> itself to scheduling matters within each run queue and leaves the
> wider aspects to the normal code.


Ok, so there is no existing "classification" abstraction? The
classification is tied to the scheduler implementation?

>At first guess, it sounds like adding your scheduler could be as simple
>as taking a copy of ingosched.c (which is the implementation of the
>standard scheduler within PlugSched) and then making your modifications.
> You could probably even share the same run queue components but
>there's nothing to stop you adding new ones.
>
>Each scheduler can also have its own per task data via a union in the
>task struct.
>
>

Ok, sounds like that problem is solved - just the classification one
remaining.

>OK. I'm waiting for the next -mm kernel before I make the next release.
>
>

Looking forward to it.

>>Now, forgive me if I'm preaching to the vicar here, but have you tried
>>using Stacked Git for the patch development?
>>
>>
>
>No, I actually use the gquilt GUI wrapper for quilt
><http://freshmeat.net/projects/gquilt/> and, although I've modified it
>to use a generic interface to the underlying patch management system
>(a.k.a. back end), I haven't yet modified it to use Stacked GIT as a
>back end. I have thought about it and it was the primary motivation for
>adding the generic interface but I ran out of enthusiasm.
>
>

Hmm, guess the vicar disclaimer was a good one to make.

Well maybe you'll find the attached file motivating, then.

Sam.


Attachments:
gquilt_stgit.py (10.42 kB)

2006-05-30 23:22:17

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Sam Vilain wrote:
> Peter Williams wrote:
>
>> They shouldn't interfere as which scheduler to use is a boot time
>> selection and only one scheduler is in force. It's mainly a coding
>> matter and in particular whether the "scheduler driver" interface would
>> need to be modified or whether your scheduler can be implemented using
>> the current interface.
>>
>>
>
> Yes, that's the key issue I think - the interface now has more inputs.
>
>>> I guess the big question is - is there a corresponding concept in
>>>> PlugSched? for instance, is there a reference in the task_struct to the
>>>> current scheduling domain, or is it more CKRM-style with classification
>>>> modules?
>>>
>>>
>> It uses the standard run queue structure with per scheduler
>> modifications (via a union) to handle the different ways that the
>> schedulers manage priority arrays (so yes). As I said it restricts
>> itself to scheduling matters within each run queue and leaves the
>> wider aspects to the normal code.
>
>
> Ok, so there is no existing "classification" abstraction? The
> classification is tied to the scheduler implementation?

Yes.

>
>> At first guess, it sounds like adding your scheduler could be as simple
>> as taking a copy of ingosched.c (which is the implementation of the
>> standard scheduler within PlugSched) and then making your modifications.
>> You could probably even share the same run queue components but
>> there's nothing to stop you adding new ones.
>>
>> Each scheduler can also have its own per task data via a union in the
>> task struct.
>>
>>
>
> Ok, sounds like that problem is solved - just the classification one
> remaining.
>
>> OK. I'm waiting for the next -mm kernel before I make the next release.
>>
>>
>
> Looking forward to it.

Andrew released 2.6.17-rc5-mm1 yesterday so I should have a new version
in a day or two.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-30 23:25:25

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 0/5] sched: Add CPU rate caps

Sam Vilain wrote:
> Peter Williams wrote:
>
>> No, I actually use the gquilt GUI wrapper for quilt
>> <http://freshmeat.net/projects/gquilt/> and, although I've modified it
>> to use a generic interface to the underlying patch management system
>> (a.k.a. back end), I haven't yet modified it to use Stacked GIT as a
>> back end. I have thought about it and it was the primary motivation for
>> adding the generic interface but I ran out of enthusiasm.
>>
>>
>
> Hmm, guess the vicar disclaimer was a good one to make.
>
> Well maybe you'll find the attached file motivating, then.

Maybe.

Thanks
Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-31 13:12:56

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

>> Using a timer for releasing tasks from their sinbin sounds like a bit
>> of an overhead. Given that there could be 10s of thousands of tasks.
>
>
> The more runnable tasks there are the less likely it is that any of them
> is exceeding its hard cap due to normal competition for the CPUs. So I
> think that it's unlikely that there will ever be a very large number of
> tasks in the sinbin at the same time.
for containers this can be untrue... :( actually even for 1000 tasks (I
suppose this is the maximum in your case) it can slowdown significantly
as well.

>> Is it possible to use the scheduler_tick() function take a look at all
>> deactivated tasks (as efficiently as possible) and activate them when
>> its time to activate them or just fold the functionality by defining a
>> time quantum after which everyone is worken up. This time quantum
>> could be the same as the time over which limits are honoured.
agree with it.

Kirill

2006-05-31 13:19:14

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

>> I understand that, I was talking about fairness between capped tasks
>> and what might be considered fair or intutive between capped tasks and
>> regular tasks. Of course, the last point is debatable ;)
>
>
> Well, the primary fairness mechanism in the scheduler is the time slice
> allocation and the capping code doesn't fiddle with those so there
> should be a reasonable degree of fairness (taking into account "nice")
> between capped tasks. To improve that would require allocating several
> new priority slots for use by tasks exceeding their caps and fiddling
> with those. I don't think that it's worth the bother.
I suppose it should be handled still. a subjective feeling :)

BTW, do you have any test results for your patch?
It would be interesting to see how precise these limitations are and
whether or not we should bother for the above...

Kirill

2006-05-31 16:04:51

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

Kirill Korotaev wrote:
>>> Using a timer for releasing tasks from their sinbin sounds like a bit
>>> of an overhead. Given that there could be 10s of thousands of tasks.
>>
>>
>>
>> The more runnable tasks there are the less likely it is that any of
>> them is exceeding its hard cap due to normal competition for the
>> CPUs. So I think that it's unlikely that there will ever be a very
>> large number of tasks in the sinbin at the same time.
>
> for containers this can be untrue... :( actually even for 1000 tasks (I
> suppose this is the maximum in your case) it can slowdown significantly
> as well.

Do you have any documented requirements for container resource management?
Is there a minimum list of features and nice to have features for containers
as far as resource management is concerned?


>
>>> Is it possible to use the scheduler_tick() function take a look at all
>>> deactivated tasks (as efficiently as possible) and activate them when
>>> its time to activate them or just fold the functionality by defining a
>>> time quantum after which everyone is worken up. This time quantum
>>> could be the same as the time over which limits are honoured.
>
> agree with it.

Thinking a bit more along these lines, it would probably break O(1). But I guess a good
algorithm can amortize the cost.

>
> Kirill
>
--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-05-31 18:07:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

On Wed, 2006-05-31 at 21:29 +0530, Balbir Singh wrote:

> Do you have any documented requirements for container resource management?

(?? where would that come from?)

Containers, I can imagine ~working (albeit I don't see why num_tasks
dilution problem shouldn't apply to num_containers... it's the same
thing, stale info)

2006-05-31 23:28:48

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 3/5] sched: Add CPU rate hard caps

Kirill Korotaev wrote:
>>> Using a timer for releasing tasks from their sinbin sounds like a bit
>>> of an overhead. Given that there could be 10s of thousands of tasks.
>>
>>
>> The more runnable tasks there are the less likely it is that any of
>> them is exceeding its hard cap due to normal competition for the
>> CPUs. So I think that it's unlikely that there will ever be a very
>> large number of tasks in the sinbin at the same time.
> for containers this can be untrue...

Why will this be untrue for containers?

> :( actually even for 1000 tasks (I
> suppose this is the maximum in your case) it can slowdown significantly
> as well.
>
>>> Is it possible to use the scheduler_tick() function take a look at all
>>> deactivated tasks (as efficiently as possible) and activate them when
>>> its time to activate them or just fold the functionality by defining a
>>> time quantum after which everyone is worken up. This time quantum
>>> could be the same as the time over which limits are honoured.
> agree with it.

If there are a lot of RUNNABLE (i.e. on a run queue) tasks then normal
competition will mean that their CPU usage rates are small and therefore
unlikely to be greater than their cap. The sinbin is only used for
tasks that are EXCEEDING their cap.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-31 23:39:35

by Peter Williams

[permalink] [raw]
Subject: Re: [RFC 2/5] sched: Add CPU rate soft caps

Kirill Korotaev wrote:
>>> I understand that, I was talking about fairness between capped tasks
>>> and what might be considered fair or intutive between capped tasks and
>>> regular tasks. Of course, the last point is debatable ;)
>>
>>
>> Well, the primary fairness mechanism in the scheduler is the time
>> slice allocation and the capping code doesn't fiddle with those so
>> there should be a reasonable degree of fairness (taking into account
>> "nice") between capped tasks. To improve that would require
>> allocating several new priority slots for use by tasks exceeding their
>> caps and fiddling with those. I don't think that it's worth the bother.

I think more needs to be said about the fairness issue.

1. If a task is getting its cap or more then it's getting its fair share
as specified by that cap. Yes?

2. If a task is getting less CPU usage then its cap then it will be
being scheduled just as if it had no cap and will be getting its fair
share just as much as any task is.

So there is no fairness problem.

> I suppose it should be handled still. a subjective feeling :)
>
> BTW, do you have any test results for your patch?
> It would be interesting to see how precise these limitations are and
> whether or not we should bother for the above...

I tend to test by observing the results of setting caps on running tasks
and this doesn't generate something that can be e-mailed.

Observations indicate that hard caps are enforced to less than 1% and
ditto for soft caps except for small soft caps where the fact (stated in
the patches) that enforcement is not fully strict in order to prevent
priority inversion or starvation means that the cap is generally
exceeded. I'm currently making modifications (based on suggestions by
Con Kolivas) that implement an alternative method for avoiding priority
inversion and starvation and allow the enforcement to be more strict.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce