LinuxLists.cc - [RFC] Fair-user scheduler

2007-01-26 06:01:51

Subject: [RFC] Fair-user scheduler

Current Linux CPU scheduler doesnt recognize process aggregates while
allocating bandwidth. As a result of this, an user could simply spawn large
number of processes and get more bandwidth than others.

Here's a patch that provides fair allocation for all users in a system.

Some benchmark numbers with and without the patch applied follows:

user "vatsa" user "guest"
(make -s -j4 bzImage) (make -s -j20 bzImage)

2.6.20-rc5 472.07s (real) 257.48s (real)
2.6.20-rc5+fairsched 766.74s (real) 766.73s (real)

(Numbers taken on a 2way Intel x86_64 box)

Eventually something like this can be extended to do weighted fair share
scheduling for:

- KVM
- containers
- resource management

Salient features of the patch:

- Based on Ingo's RTLIMIT_RT_CPU patch [1]. Primary difference between
RTLIMIT_RT_CPU patch and this one is that this patch handles
starvation of lower priority tasks in a group and also accounting
is token based (rather than decaying avg).

- Retains existing one-runqueue-per-cpu design

- breaks O(1) (ouch!)
Best way to avoid this is to split runqueue to be per-user and
per-cpu, which I have not implemented to keep the patch simple.

- Fairsched aware SMP load balance NOT addressed (yet)

Comments/flames wellcome!

References:

1. http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc2/2.6.11-rc2-mm2/broken-out/rlimit_rt_cpu.patch

--
Regards,
vatsa

2007-01-26 06:03:19

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: [PATCH 1/2] core scheduler changes

This patch does several things:

- Introduces the notion of control window (current set at 1
sec - ideally the window size should be adjusted based on
number of users to avoid rapid context switches). Bandwidth of each
user is controlled within this window. rq->last_update tracks where
are in the current window.

- Modifies scheduler_tick() to account cpu bandwidth consumption
by a task group. Basically bandwidth consumed by a task is
charged to itself (p->time_slice) -and- to its group as well.

- A task is forced off the CPU once its group has expired the
bandwidth in the current control window. Such a task is also
marked as "starving".

- schedule() avoids picking tasks whose group has expired its
bandwidth in current control window. Any task (with non-zero
p->timeslice) which is not picked to run in schedule() because of
this reason is marked "starving".

- If a group has bandwidth left and it has starving tasks, then
schedule() prefers picking such tasks over non-starving tasks.
This will avoid starvation of lower-priority tasks in a group.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---

diff -puN include/linux/sched.h~cpu-controller-based-on-rtlimit_rt_cpu-patch include/linux/sched.h
--- linux-2.6.20-rc5/include/linux/sched.h~cpu-controller-based-on-rtlimit_rt_cpu-patch 2007-01-19 15:17:27.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/include/linux/sched.h 2007-01-26 09:04:07.000000000 +0530
@@ -531,6 +531,10 @@ struct signal_struct {
#define is_rt_policy(p) ((p) != SCHED_NORMAL && (p) != SCHED_BATCH)
#define has_rt_policy(p) unlikely(is_rt_policy((p)->policy))

+#ifdef CONFIG_FAIRSCHED
+struct cpu_usage;
+#endif
+
/*
* Some day this will be a full-fledged user tracking system..
*/
@@ -555,6 +559,10 @@ struct user_struct {
/* Hash table maintenance information */
struct list_head uidhash_list;
uid_t uid;
+#ifdef CONFIG_FAIRSCHED
+ int cpu_limit;
+ struct cpu_usage *cpu_usage;
+#endif
};

extern struct user_struct *find_user(uid_t);
@@ -1137,6 +1145,9 @@ static inline void put_task_struct(struc
/* Not implemented yet, only for 486*/
#define PF_STARTING 0x00000002 /* being created */
#define PF_EXITING 0x00000004 /* getting shut down */
+#ifdef CONFIG_FAIRSCHED
+#define PF_STARVING 0x00000010 /* Task starving for CPU */
+#endif
#define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */
#define PF_SUPERPRIV 0x00000100 /* used super-user privileges */
#define PF_DUMPCORE 0x00000200 /* dumped core */
diff -puN kernel/sched.c~cpu-controller-based-on-rtlimit_rt_cpu-patch kernel/sched.c
--- linux-2.6.20-rc5/kernel/sched.c~cpu-controller-based-on-rtlimit_rt_cpu-patch 2007-01-19 15:17:27.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/kernel/sched.c 2007-01-26 09:04:07.000000000 +0530
@@ -266,6 +266,9 @@ struct rq {
unsigned long ttwu_local;
#endif
struct lock_class_key rq_lock_key;
+#ifdef CONFIG_FAIRSCHED
+ unsigned long last_update;
+#endif
};

static DEFINE_PER_CPU(struct rq, runqueues);
@@ -710,6 +713,126 @@ enqueue_task_head(struct task_struct *p,
p->array = array;
}

+#ifdef CONFIG_FAIRSCHED
+
+struct cpu_usage {
+ long tokens;
+ unsigned long last_update;
+ int starve_count;
+};
+
+#define task_starving(p) (p->flags & PF_STARVING)
+
+/* Mark a task starving - either we shortcircuited its timeslice or we didnt
+ * pick it to run (because user ran out of bandwidth limit in current epoch).
+ */
+static inline void set_tsk_starving(struct task_struct *p)
+{
+ struct user_struct *user = p->user;
+ struct cpu_usage *cu;
+
+ if (task_starving(p) || !user->cpu_limit)
+ return;
+
+ cu = per_cpu_ptr(user->cpu_usage, task_cpu(p));
+ cu->starve_count++;
+ p->flags |= PF_STARVING;
+}
+
+/* Clear a task's starving flag */
+static inline void clear_tsk_starving(struct task_struct *p)
+{
+ struct user_struct *user = p->user;
+ struct cpu_usage *cu;
+
+ if (!task_starving(p) || !user->cpu_limit)
+ return;
+
+ cu = per_cpu_ptr(user->cpu_usage, task_cpu(p));
+ cu->starve_count--;
+ p->flags &= ~PF_STARVING;
+}
+
+/* Does the task's group have starving tasks? */
+static inline int is_user_starving(struct task_struct *p)
+{
+ struct user_struct *user = p->user;
+ struct cpu_usage *cu;
+
+ if (!user->cpu_limit)
+ return 0;
+
+ cu = per_cpu_ptr(user->cpu_usage, task_cpu(p));
+ if (cu->starve_count)
+ return 1;
+
+ return 0;
+}
+
+/* Are we past the 1-sec control window? If so, all groups get to renew their
+ * expired tokens.
+ */
+static inline void adjust_control_window(void)
+{
+ struct rq *rq = this_rq();
+ unsigned long delta;
+
+ delta = jiffies - rq->last_update;
+ if (delta >= HZ)
+ rq->last_update += (delta/HZ) * HZ;
+}
+
+/* Account group's cpu usage */
+static inline void inc_cpu_usage(struct task_struct *p)
+{
+ struct user_struct *user = p->user;
+ struct cpu_usage *cu;
+
+ adjust_control_window();
+
+ if (!user->cpu_limit)
+ return;
+
+ cu = per_cpu_ptr(user->cpu_usage, task_cpu(p));
+ cu->tokens--;
+}
+
+static inline int task_over_cpu_limit(struct task_struct *p)
+{
+ struct rq *rq = task_rq(p);
+ struct user_struct *user = p->user;
+ struct cpu_usage *cu;
+
+ adjust_control_window();
+
+ if (!user->cpu_limit)
+ return 0;
+
+ cu = per_cpu_ptr(user->cpu_usage, task_cpu(p));
+ if (cu->last_update != rq->last_update) {
+ /* Replenish tokens */
+ cu->tokens += user->cpu_limit * HZ / 100;
+ cu->last_update = rq->last_update;
+ }
+
+ if (cu->tokens <= 0)
+ return 1;
+
+ return 0;
+}
+
+#else
+
+#define task_starving(p) 0
+
+static void inc_cpu_usage(struct task_struct *p) { }
+static int task_over_cpu_limit(struct task_struct *p) { return 0; }
+static void set_tsk_starving(struct task_struct *p) { }
+static void clear_tsk_starving(struct task_struct *p) { }
+static int is_user_starving(struct task_struct *p) { return 0;}
+
+#endif /* CONFIG_FAIRSCHED */
+
/*
* __normal_prio - return the priority that is based on the static
* priority but is modified by bonuses/penalties.
@@ -1607,6 +1730,9 @@ void fastcall sched_fork(struct task_str
/* Want to start with kernel preemption disabled. */
task_thread_info(p)->preempt_count = 1;
#endif
+#ifdef CONFIG_FAIRSCHED
+ p->flags &= ~PF_STARVING;
+#endif
/*
* Share the timeslice between parent and child, thus the
* total amount of pending timeslices in the system doesn't change,
@@ -2074,6 +2200,7 @@ static void pull_task(struct rq *src_rq,
{
dequeue_task(p, src_array);
dec_nr_running(p, src_rq);
+ clear_tsk_starving(p);
set_task_cpu(p, this_cpu);
inc_nr_running(p, this_rq);
enqueue_task(p, this_array);
@@ -3137,6 +3264,9 @@ static void task_running_tick(struct rq
return;
}
spin_lock(&rq->lock);
+
+ inc_cpu_usage(p);
+
/*
* The task was running during this tick - update the
* time slice counter. Note: we do not update a thread's
@@ -3163,17 +3293,18 @@ static void task_running_tick(struct rq
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
p->prio = effective_prio(p);
- p->time_slice = task_timeslice(p);
p->first_time_slice = 0;

if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
- if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
+ if (!TASK_INTERACTIVE(p) || expired_starving(rq)
+ || task_over_cpu_limit(p)) {
enqueue_task(p, rq->expired);
if (p->static_prio < rq->best_expired_prio)
rq->best_expired_prio = p->static_prio;
} else
enqueue_task(p, rq->active);
+ goto out_unlock;
} else {
/*
* Prevent a too long timeslice allowing a task to monopolize
@@ -3200,6 +3331,14 @@ static void task_running_tick(struct rq
set_tsk_need_resched(p);
}
}
+
+ if (task_over_cpu_limit(p)) {
+ dequeue_task(p, rq->active);
+ set_tsk_need_resched(p);
+ enqueue_task(p, rq->expired);
+ set_tsk_starving(p);
+ }
+
out_unlock:
spin_unlock(&rq->lock);
}
@@ -3416,7 +3555,7 @@ asmlinkage void __sched schedule(void)
struct list_head *queue;
unsigned long long now;
unsigned long run_time;
- int cpu, idx, new_prio;
+ int cpu, idx, new_prio, array_switch;
long *switch_count;
struct rq *rq;

@@ -3478,6 +3617,7 @@ need_resched_nonpreemptible:
else {
if (prev->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
+ clear_tsk_starving(prev);
deactivate_task(prev, rq);
}
}
@@ -3493,11 +3633,15 @@ need_resched_nonpreemptible:
}
}

+ array_switch = 0;
+
+pick_next_task:
array = rq->active;
if (unlikely(!array->nr_active)) {
/*
* Switch the active and expired arrays.
*/
+ array_switch++;
schedstat_inc(rq, sched_switch);
rq->active = rq->expired;
rq->expired = array;
@@ -3510,6 +3654,25 @@ need_resched_nonpreemptible:
queue = array->queue + idx;
next = list_entry(queue->next, struct task_struct, run_list);

+ /* If we have done an array switch twice, it means we cant find any
+ * task which isn't above its limit and hence we just run the
+ * first task on the active array.
+ */
+ if (array_switch < 2 && (task_over_cpu_limit(next) ||
+ (!task_starving(next) && is_user_starving(next)))) {
+ dequeue_task(next, rq->active);
+ enqueue_task(next, rq->expired);
+ if (next->time_slice)
+ set_tsk_starving(next);
+ goto pick_next_task;
+ }
+
+ if (task_over_cpu_limit(next))
+ rq->last_update = jiffies;
+ if (!next->time_slice)
+ next->time_slice = task_timeslice(next);
+ clear_tsk_starving(next);
+
if (!rt_task(next) && interactive_sleep(next->sleep_type)) {
unsigned long long delta = now - next->timestamp;
if (unlikely((long long)(now - next->timestamp) < 0))
@@ -5061,6 +5224,7 @@ static int __migrate_task(struct task_st
if (!cpu_isset(dest_cpu, p->cpus_allowed))
goto out;

+ clear_tsk_starving(p);
set_task_cpu(p, dest_cpu);
if (p->array) {
/*
_

--
Regards,
vatsa

2007-01-26 06:05:06

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: [PATCH 2/2] Track number of users in the system

This patch tracks number of users in a system and divides cpu bandwidth
equally among them.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---

diff -puN include/linux/sched.h~user-interface include/linux/sched.h
--- linux-2.6.20-rc5/include/linux/sched.h~user-interface 2007-01-25 20:44:53.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/include/linux/sched.h 2007-01-25 20:44:53.000000000 +0530
@@ -533,7 +533,18 @@ struct signal_struct {

#ifdef CONFIG_FAIRSCHED
struct cpu_usage;
-#endif
+
+int sched_alloc_user(struct user_struct *user);
+void sched_free_user(struct user_struct *user);
+void sched_move_task(void);
+
+#else
+
+static inline int sched_alloc_user(struct user_struct *user) { return 0; }
+static inline void sched_free_user(struct user_struct *user) { }
+static inline void sched_move_task(void) { }
+
+#endif /* CONFIG_FAIRSCHED */

/*
* Some day this will be a full-fledged user tracking system..
@@ -562,6 +573,7 @@ struct user_struct {
#ifdef CONFIG_FAIRSCHED
int cpu_limit;
struct cpu_usage *cpu_usage;
+ struct list_head fair_sched_list;
#endif
};

diff -puN kernel/user.c~user-interface kernel/user.c
--- linux-2.6.20-rc5/kernel/user.c~user-interface 2007-01-25 20:44:53.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/kernel/user.c 2007-01-25 20:44:53.000000000 +0530
@@ -51,6 +51,10 @@ struct user_struct root_user = {
.uid_keyring = &root_user_keyring,
.session_keyring = &root_session_keyring,
#endif
+#ifdef CONFIG_FAIRSCHED
+ .cpu_limit = 0, /* No limit */
+ .fair_sched_list = LIST_HEAD_INIT(root_user.fair_sched_list),
+#endif
};

/*
@@ -112,6 +116,7 @@ void free_uid(struct user_struct *up)
if (atomic_dec_and_lock(&up->__count, &uidhash_lock)) {
uid_hash_remove(up);
spin_unlock_irqrestore(&uidhash_lock, flags);
+ sched_free_user(up);
key_put(up->uid_keyring);
key_put(up->session_keyring);
kmem_cache_free(uid_cachep, up);
@@ -153,6 +158,8 @@ struct user_struct * alloc_uid(uid_t uid
return NULL;
}

+ sched_alloc_user(new);
+
/*
* Before adding this, check whether we raced
* on adding the same user already..
@@ -163,6 +170,7 @@ struct user_struct * alloc_uid(uid_t uid
key_put(new->uid_keyring);
key_put(new->session_keyring);
kmem_cache_free(uid_cachep, new);
+ sched_free_user(new);
} else {
uid_hash_insert(new, hashent);
up = new;
@@ -186,6 +194,7 @@ void switch_uid(struct user_struct *new_
atomic_inc(&new_user->processes);
atomic_dec(&old_user->processes);
switch_uid_keyring(new_user);
+ sched_move_task();
current->user = new_user;

/*
diff -puN kernel/sched.c~user-interface kernel/sched.c
--- linux-2.6.20-rc5/kernel/sched.c~user-interface 2007-01-25 20:44:53.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/kernel/sched.c 2007-01-26 09:04:04.000000000 +0530
@@ -7221,3 +7221,63 @@ void set_curr_task(int cpu, struct task_
}

#endif
+
+#ifdef CONFIG_FAIRSCHED
+
+static struct list_head user_list = LIST_HEAD_INIT(user_list);
+static atomic_t non_root_users;
+
+static void recalc_user_limits(void)
+{
+ int nr_users;
+ struct user_struct *user;
+
+ nr_users = atomic_read(&non_root_users);
+ if (!nr_users)
+ return;
+
+ list_for_each_entry(user, &user_list, fair_sched_list)
+ user->cpu_limit = 100/nr_users;
+}
+
+/* Allocate cpu_usage structure for the new task-group */
+int sched_alloc_user(struct user_struct *user)
+{
+ int i;
+
+ user->cpu_usage = alloc_percpu(struct cpu_usage);
+ if (!user->cpu_usage)
+ return -ENOMEM;
+
+ for_each_possible_cpu(i) {
+ struct cpu_usage *cu;
+
+ cu = per_cpu_ptr(user->cpu_usage, i);
+ cu->tokens = 1;
+ cu->last_update = 0;
+ cu->starve_count = 0;
+ }
+
+ list_add(&user->fair_sched_list, &user_list);
+ atomic_inc(&non_root_users);
+
+ recalc_user_limits();
+
+ return 0;
+}
+
+/* Deallocate cpu_usage structure */
+void sched_free_user(struct user_struct *user)
+{
+ list_del(&user->fair_sched_list);
+ atomic_dec(&non_root_users);
+ recalc_user_limits();
+ free_percpu(user->cpu_usage);
+}
+
+void sched_move_task(void)
+{
+ clear_tsk_starving(current);
+}
+
+#endif
diff -puN init/Kconfig~user-interface init/Kconfig
--- linux-2.6.20-rc5/init/Kconfig~user-interface 2007-01-25 20:44:53.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/init/Kconfig 2007-01-25 20:44:54.000000000 +0530
@@ -249,6 +249,13 @@ config CPUSETS

Say N if unsure.

+config FAIRSCHED
+ bool "Fair user CPU scheduler"
+ depends on EXPERIMENTAL
+ help
+ This options lets you allocate equal cpu bandwidth to each
+ non-root user.
+
config SYSFS_DEPRECATED
bool "Create deprecated sysfs files"
default y
_

--
Regards,
vatsa

2007-01-26 13:58:47

by Kirill Korotaev

[permalink] [raw]

Subject: Re: [RFC] Fair-user scheduler

Srivatsa,

> Current Linux CPU scheduler doesnt recognize process aggregates while
> allocating bandwidth. As a result of this, an user could simply spawn large
> number of processes and get more bandwidth than others.
>
> Here's a patch that provides fair allocation for all users in a system.
>
> Some benchmark numbers with and without the patch applied follows:
>
>
> user "vatsa" user "guest"
> (make -s -j4 bzImage) (make -s -j20 bzImage)
>
> 2.6.20-rc5 472.07s (real) 257.48s (real)
> 2.6.20-rc5+fairsched 766.74s (real) 766.73s (real)
1. If I interpret these numbers correctly, then your scheduler is not work-conservative,
i.e. 766.74 + 766.73 >> 472.07 + 257.48
why does it slow down users so much?
2. compilation of kernel is quite CPU-bound task. So it's not that hard to be fair :)
Can you please try some other applications?
e.g. pipe-based context switching, java Volano benchmark etc.

Thanks,
Kirill

2007-01-26 18:41:55

by Chris Friesen

[permalink] [raw]

Subject: Re: [RFC] Fair-user scheduler

Srivatsa Vaddagiri wrote:
> Current Linux CPU scheduler doesnt recognize process aggregates while
> allocating bandwidth. As a result of this, an user could simply spawn large
> number of processes and get more bandwidth than others.
>
> Here's a patch that provides fair allocation for all users in a system.
>
> Some benchmark numbers with and without the patch applied follows:
>
>
> user "vatsa" user "guest"
> (make -s -j4 bzImage) (make -s -j20 bzImage)
>
> 2.6.20-rc5 472.07s (real) 257.48s (real)
> 2.6.20-rc5+fairsched 766.74s (real) 766.73s (real)

As Kirill brought up, why does it take so much more time? Are you
thrashing the cache?

> - breaks O(1) (ouch!)
> Best way to avoid this is to split runqueue to be per-user and
> per-cpu, which I have not implemented to keep the patch simple.

Presumably this would be made generic, as in per-"group" rather than per
user?

> - Fairsched aware SMP load balance NOT addressed (yet)

This is kind of important, no?

Chris

2007-01-26 19:23:13

by Éric Piel

[permalink] [raw]

Subject: Re: [RFC] Fair-user scheduler

01/26/2007 03:09 PM, Kirill Korotaev wrote/a écrit:
> Srivatsa,
>
>> Current Linux CPU scheduler doesnt recognize process aggregates while
>> allocating bandwidth. As a result of this, an user could simply spawn large
>> number of processes and get more bandwidth than others.
>>
>> Here's a patch that provides fair allocation for all users in a system.
>>
>> Some benchmark numbers with and without the patch applied follows:
>>
>>
>> user "vatsa" user "guest"
>> (make -s -j4 bzImage) (make -s -j20 bzImage)
>>
>> 2.6.20-rc5 472.07s (real) 257.48s (real)
>> 2.6.20-rc5+fairsched 766.74s (real) 766.73s (real)
> 1. If I interpret these numbers correctly, then your scheduler is not work-conservative,
> i.e. 766.74 + 766.73 >> 472.07 + 257.48
> why does it slow down users so much?
You can't measure work-conservation by summing! Everything is ran
_concurrently_. A proof of losing computing power is to show
"MAX(new_algorithm execution_times) > MAX(old_algorithm
execution_times)". Anyway... it still seems lots of power is lost:
MAX(766,766) >> MAX(472,257).

Actually, I'd be very interested by a "fairness number" and believe so
far no one as proposed such thing. Probably needs to take into account
the loss of CPU power and the variance of execution time in between the
sets of tasks which are supposed to be fair.

> 2. compilation of kernel is quite CPU-bound task. So it's not that hard to be fair :)
> Can you please try some other applications?
> e.g. pipe-based context switching, java Volano benchmark etc.
Another worthy benchmark would be :
(make -s -j4 bzImage) vs (nice make -s -j20 bzImage)
^^^^

Regards,
Eric

2007-01-31 15:02:20

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: Re: [PATCH 1/2] core scheduler changes

On Fri, Jan 26, 2007 at 11:33:17AM +0530, Srivatsa Vaddagiri wrote:
>
> This patch does several things:
>
> - Introduces the notion of control window (current set at 1
> sec - ideally the window size should be adjusted based on
> number of users to avoid rapid context switches). Bandwidth of each
> user is controlled within this window. rq->last_update tracks where
> are in the current window.

The patch below makes the control window size configurable (as a macro),
sets the default window size at 10 sec and also fixes a compile error (for
!CONFIG_FAIRSCHED case).

Ideally the window size needs to be determined at run time (based on number of
users). I will address that if there is sufficient interest on a patch
like this ..

Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---

diff -puN kernel/sched.c~window-fix kernel/sched.c
--- linux-2.6.20-rc5/kernel/sched.c~window-fix 2007-01-31 19:59:48.000000000 +0530
+++ linux-2.6.20-rc5-vatsa/kernel/sched.c 2007-01-31 20:08:57.000000000 +0530
@@ -769,17 +769,24 @@ static inline int is_user_starving(struc
return 0;
}

-/* Are we past the 1-sec control window? If so, all groups get to renew their
+#define WINDOW_SIZE (10*HZ)
+
+/* Are we past the control window? If so, all groups get to renew their
* expired tokens.
*/
-static inline void adjust_control_window(void)
+static inline void adjust_control_window(int force)
{
struct rq *rq = this_rq();
unsigned long delta;

+ if (force) {
+ rq->last_update = jiffies;
+ return;
+ }
+
delta = jiffies - rq->last_update;
- if (delta >= HZ)
- rq->last_update += (delta/HZ) * HZ;
+ if (delta >= WINDOW_SIZE)
+ rq->last_update += (delta/WINDOW_SIZE) * WINDOW_SIZE;
}

/* Account group's cpu usage */
@@ -788,7 +795,7 @@ static inline void inc_cpu_usage(struct
struct user_struct *user = p->user;
struct cpu_usage *cu;

- adjust_control_window();
+ adjust_control_window(0);

if (!user->cpu_limit)
return;
@@ -803,7 +810,7 @@ static inline int task_over_cpu_limit(st
struct user_struct *user = p->user;
struct cpu_usage *cu;

- adjust_control_window();
+ adjust_control_window(0);

if (!user->cpu_limit)
return 0;
@@ -811,7 +818,7 @@ static inline int task_over_cpu_limit(st
cu = per_cpu_ptr(user->cpu_usage, task_cpu(p));
if (cu->last_update != rq->last_update) {
/* Replenish tokens */
- cu->tokens += user->cpu_limit * HZ / 100;
+ cu->tokens += user->cpu_limit * WINDOW_SIZE / 100;
cu->last_update = rq->last_update;
}

@@ -830,6 +837,7 @@ static int task_over_cpu_limit(struct ta
static void set_tsk_starving(struct task_struct *p) { }
static void clear_tsk_starving(struct task_struct *p) { }
static int is_user_starving(struct task_struct *p) { return 0;}
+static inline void adjust_control_window(int force) { }

#endif /* CONFIG_FAIRSCHED */

@@ -3668,7 +3676,7 @@ pick_next_task:
}

if (task_over_cpu_limit(next))
- rq->last_update = jiffies;
+ adjust_control_window(1);
if (!next->time_slice)
next->time_slice = task_timeslice(next);
clear_tsk_starving(next);
_

--
Regards,
vatsa

2007-01-31 15:10:17

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: Re: [RFC] Fair-user scheduler

On Fri, Jan 26, 2007 at 07:52:49PM +0100, Eric Piel wrote:
> >> user "vatsa" user "guest"
> >> (make -s -j4 bzImage) (make -s -j20 bzImage)
> >>
> >>2.6.20-rc5 472.07s (real) 257.48s (real)
> >>2.6.20-rc5+fairsched 766.74s (real) 766.73s (real)
> >
> >1. If I interpret these numbers correctly, then your scheduler is not
> >work-conservative,
> >i.e. 766.74 + 766.73 >> 472.07 + 257.48
> >why does it slow down users so much?

Ok, I investigated this a bit. The "1-sec" control window was the
killer. I guess it was causing too much of thrashing. I increased the
control window to 10 sec [1] and I get decent results now.

NOTE : Since the patches don't support SMP load-balancing currently,
all benchmarks were run on only ONE cpu (using cpuset's cpu_exclusive
feature) and hence the numbers are indicative of UP performance.

First, results of kernel compilation:

-------------------------------------------------------------------------------
vatsa guest
(make -j4 bzImage) (make -j20 bzImage)
-------------------------------------------------------------------------------

2.6.20-rc5 767.16s (real) 495.98s (real)
2.6.20-rc5 + fairsched (W=10s) 765.89s (real) 764.89s (real)

-------------------------------------------------------------------------------

-------------------------------------------------------------------------------
vatsa guest
(make -j4 bzImage) (nice make -j20 bzImage)
-------------------------------------------------------------------------------

2.6.20-rc5 767.16s (real) 636.51s (real)
2.6.20-rc5 + fairsched (W=10s) 761.19s (real) 766.51s (real)

-------------------------------------------------------------------------------

Second, results of volanomark benchmark [2].

Both users, vatsa and guest, ran a copy of volano server on different ports.
Each user then launched the volano client on respective ports and
throughput of the client was measured. Ideally we want the throughput to
be same for both user.

-------------------------------------------------------------------------------
vatsa guest
(volano server/client) (volano server/client)
-------------------------------------------------------------------------------

2.6.20-rc5 400000 msg in 39.495s 400000 msg in 39.528s
(10128 msg/sec) (10119 msg/sec)

2.6.20-rc5 + fairsched (W=10s) 400000 msg in 39.662s 400000 msg in 39.646s
(10085 msg/sec) (10089 msg/sec)

-------------------------------------------------------------------------------

Here we dont see much difference between vanilla kernel and the patch
because number of threads spawned by both users are same.

Kirill, did you have any other specific volano configuration in mind?

> Actually, I'd be very interested by a "fairness number" and believe so
> far no one as proposed such thing. Probably needs to take into account
> the loss of CPU power and the variance of execution time in between the
> sets of tasks which are supposed to be fair.

Yeah, in my patch, I dont account/limit execution time of root tasks at
all. That needs to be factored in when we distribute left over cycles to
non-root users (which I dont think I have addressed in my current
patch).

> >2. compilation of kernel is quite CPU-bound task. So it's not that hard to
> >be fair :)
> > Can you please try some other applications?
> > e.g. pipe-based context switching, java Volano benchmark etc.

Kirill, I have provided volano numbers above. Did you have any specific
volano configuration in mind to be tested?

Also do you have a pointer to a ready pipe-based benchmark I can use?

> Another worthy benchmark would be :
> (make -s -j4 bzImage) vs (nice make -s -j20 bzImage)
> ^^^^

I have provide the result of above benchmark also.

References:

1. http://lkml.org/lkml/2007/01/31/138

2. Volanomark benchmark configuration details:

java.vendor = IBM Corporation
java.vendor.url = http://www.ibm.com/
java.version = 1.4.2
java.class.version = 48.0
java.compiler = j9jit22
os.name = Linux
os.version = 2.6.20-rc5-1k
os.arch = amd64

VolanoMark version = 2.5.0.9
Messages sent = 20000
Messages received = 380000
Total messages = 400000

--
Regards,
vatsa

2007-01-31 15:17:03

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: Re: [RFC] Fair-user scheduler

On Fri, Jan 26, 2007 at 12:41:27PM -0600, Chris Friesen wrote:
> As Kirill brought up, why does it take so much more time? Are you
> thrashing the cache?

Yes, the default 1-sec control window I had was causing lot of
thrashing. See http://lkml.org/lkml/2007/1/31/142

> Presumably this would be made generic, as in per-"group" rather than per
> user?

Ideally yes! But I am trying to apply the solution to an existing problem
and later extend for other needs as well.

> > - Fairsched aware SMP load balance NOT addressed (yet)
>
> This is kind of important, no?

Yea again! I have again avoided it to keep the patch simple at first.
Based on interest/response, I can address this later.

--
Regards,
vatsa