2007-11-19 12:14:42

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 0/2] sched: Group scheduler related patches

Hi Ingo,
Here are two patches related to group cpu scheduling.

Patch 1/2 -> Fixes minor bugs and coding style issues
Patch 2/2 -> Improves group fairness on SMP systems.

This is probably one of the last big changes related to group scheduler
that I had on my plate. Pls apply if there is no objection from anyone.

--
Regards,
vatsa


2007-11-19 12:15:45

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 1/2] sched: Minor cleanups


Minor scheduler cleanups:

- Fix coding style
- remove obsolete comment
- Use list_for_each_entry_rcu when walking task group list


Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---
kernel/sched.c | 21 +++------------------
kernel/sched_fair.c | 5 ++++-
2 files changed, 7 insertions(+), 19 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -191,12 +191,12 @@ struct task_group init_task_group = {
};

#ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD 2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD 2*NICE_0_LOAD /* root user's cpu share */
#else
-# define INIT_TASK_GRP_LOAD NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD NICE_0_LOAD
#endif

-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;

/* return group to which a task belongs */
static inline struct task_group *task_group(struct task_struct *p)
@@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i

#define sched_class_highest (&rt_sched_class)

-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
update_load_add(&rq->load, p->se.load.weight);
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -685,7 +685,7 @@ static inline struct cfs_rq *cpu_cfs_rq(

/* Iterate thr' all leaf cfs_rq's on a runqueue */
#define for_each_leaf_cfs_rq(rq, cfs_rq) \
- list_for_each_entry(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+ list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)

/* Do the two (enqueued) entities belong to the same group ? */
static inline int
@@ -1126,7 +1126,10 @@ static void print_cfs_stats(struct seq_f
#ifdef CONFIG_FAIR_GROUP_SCHED
print_cfs_rq(m, cpu, &cpu_rq(cpu)->cfs);
#endif
+
+ rcu_read_lock();
for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
+ rcu_read_unlock();
}
#endif

--
Regards,
vatsa

2007-11-19 12:18:18

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 2/2] sched: Improve fairness of cpu allocation for task groups


The current load balancing scheme isn't good for group fairness.

For ex: on a 8-cpu system, I created 3 groups as under:

a = 8 tasks (cpu.shares = 1024)
b = 4 tasks (cpu.shares = 1024)
c = 3 tasks (cpu.shares = 1024)

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:

Before this patch:
--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5%
b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3%
c | 86.326 | 18.0% | 47.5% 47.9% 48.5%
--------------------------------------------------------------------------------

Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) receive by all tasks of that
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test scripts that produce such an o/p if reqd]

The deviation from desired group fairness is as below:

a = +24.47%
b = -9.13%
c = -15.33%

which is quite high.

After the patch below is applied, here are the results:

--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3%
b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5%
c | 160.653 | 33.5% | 85.8% 90.6% 91.4%
--------------------------------------------------------------------------------

Deviation from desired group fairness is as below:

a = +0.67%
b = -0.83%
c = +0.17%

which is far better IMO. Most of other runs have yielded a deviation within
+-2% at the most, which is good.

Why do we see bad (group) fairness with current scheuler?
=========================================================

Currently cpu's weight is just the summation of individual task weights.
This can yield incorrect results. For ex: consider three groups as below
on a 2-cpu system:

CPU0 CPU1
---------------------------
A (10) B(5)
C(5)
---------------------------

Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
1024).

The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
the load balancer will think both CPUs are perfectly balanced and won't
move around any tasks. This, however, would yield this bandwidth:

A = 50%
B = 25%
C = 25%

which is not the desired result.

What's changing in the patch?
=============================

- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
defined (see below)
- API Change
- Minimum shares that can be allocated to a group is
set to 100 (MIN_GROUP_SHARES macro introduced)
- Two tunables introduced in sysfs (under SCHED_DEBUG) to
control the frequency at which the load balance monitor
thread runs.

The basic change made in this patch is how cpu weight (rq->load.weight) is
calculated. Its now calculated as the summation of group weights on a cpu,
rather than summation of task weights. Weight exerted by a group on a
cpu is dependent on the shares allocated to it and also the number of
tasks the group has on that cpu compared to the total number of
(runnable) tasks the group has in the system.

Let,
W(K,i) = Weight of group K on cpu i
T(K,i) = Task load present in group K's cfs_rq on cpu i
T(K) = Total task load of group K across various cpus
S(K) = Shares allocated to group K
NRCPUS = Number of online cpus in the scheduler domain to
which group K is assigned.

Then,
W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)

A load balance monitor thread is created at bootup, which periodically
runs and adjusts group's weight on each cpu. To avoid its overhead, two
min/max tunables are introduced (under SCHED_DEBUG) to control the rate at which
it runs.


Impact to sched.o size
======================

CONFIG_FAIR_GROUP_SCHED not defined ->

text data bss dec hex filename
36829 2766 48 39643 9adb sched.o-before
36912 2766 48 39726 9b2e sched.o-after
---------
+83
--------

CONFIG_FAIR_GROUP_SCHED defined ->

text data bss dec hex filename
39019 3346 336 42701 a6cd sched.o-before
40223 3482 340 44045 ac0d sched.o-after
------
+1344
------

Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---
include/linux/sched.h | 4
kernel/sched.c | 292 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sched_fair.c | 95 ++++++++++------
kernel/sched_rt.c | 2
kernel/sysctl.c | 16 ++
5 files changed, 348 insertions(+), 61 deletions(-)

Index: current/include/linux/sched.h
===================================================================
--- current.orig/include/linux/sched.h
+++ current/include/linux/sched.h
@@ -1467,6 +1467,10 @@ extern unsigned int sysctl_sched_child_r
extern unsigned int sysctl_sched_features;
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
+#ifdef CONFIG_FAIR_GROUP_SCHED
+extern unsigned int sysctl_sched_min_bal_int_shares;
+extern unsigned int sysctl_sched_max_bal_int_shares;
+#endif

int sched_nr_latency_handler(struct ctl_table *table, int write,
struct file *file, void __user *buffer, size_t *length,
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -168,9 +168,46 @@ struct task_group {
struct sched_entity **se;
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
+
+ /* shares assigned to a task group governs how much of cpu bandwidth
+ * is allocated to the group. The more shares a group has, the more is
+ * the cpu bandwidth allocated to it.
+ *
+ * For ex, lets say that there are three task groups, A, B and C which
+ * have been assigned shares 1000, 2000 and 3000 respectively. Then,
+ * cpu bandwidth allocated by the scheduler to task groups A, B and C
+ * should be:
+ *
+ * Bw(A) = 1000/(1000+2000+3000) * 100 = 16.66%
+ * Bw(B) = 2000/(1000+2000+3000) * 100 = 33.33%
+ * Bw(C) = 3000/(1000+2000+3000) * 100 = 50%
+ *
+ * The weight assigned to a task group's schedulable entities on every
+ * cpu (task_group.se[a_cpu]->load.weight) is derived from the task
+ * group's shares. For ex: lets say that task group A has been
+ * assigned shares of 1000 and there are two CPUs in a system. Then,
+ *
+ * tg_A->se[0]->load.weight = tg_A->se[1]->load.weight = 1000;
+ *
+ * Note: It's not necessary that each of a task's group schedulable
+ * entity have the same weight on all CPUs. If the group
+ * has 2 of its tasks on CPU0 and 1 task on CPU1, then a
+ * better distribution of weight could be:
+ *
+ * tg_A->se[0]->load.weight = 2/3 * 2000 = 1333
+ * tg_A->se[1]->load.weight = 1/2 * 2000 = 667
+ *
+ * rebalance_shares() is responsible for distributing the shares of a
+ * task groups like this among the group's schedulable entities across
+ * cpus.
+ *
+ */
unsigned long shares;
- /* spinlock to serialize modification to shares */
- spinlock_t lock;
+
+ /* lock to serialize modification to shares */
+ struct mutex lock;
+
+ unsigned long last_total_load;
struct rcu_head rcu;
};

@@ -182,6 +219,14 @@ static DEFINE_PER_CPU(struct cfs_rq, ini
static struct sched_entity *init_sched_entity_p[NR_CPUS];
static struct cfs_rq *init_cfs_rq_p[NR_CPUS];

+static DEFINE_MUTEX(doms_cur_mutex); /* serialize access to doms_curr[] array */
+
+/* kernel thread that runs rebalance_shares() periodically */
+static struct task_struct *lb_monitor_task;
+
+static void set_se_shares(struct sched_entity *se, unsigned long shares);
+static int load_balance_monitor(void *unused);
+
/* Default task group.
* Every task in system belong to this group at bootup.
*/
@@ -196,6 +241,8 @@ struct task_group init_task_group = {
# define INIT_TASK_GROUP_LOAD NICE_0_LOAD
#endif

+#define MIN_GROUP_SHARES 100
+
static int init_task_group_load = INIT_TASK_GROUP_LOAD;

/* return group to which a task belongs */
@@ -222,9 +269,21 @@ static inline void set_task_cfs_rq(struc
p->se.parent = task_group(p)->se[cpu];
}

+static inline void lock_doms_cur(void)
+{
+ mutex_lock(&doms_cur_mutex);
+}
+
+static inline void unlock_doms_cur(void)
+{
+ mutex_unlock(&doms_cur_mutex);
+}
+
#else

static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
+static inline void lock_doms_cur(void) { }
+static inline void unlock_doms_cur(void) { }

#endif /* CONFIG_FAIR_GROUP_SCHED */

@@ -854,6 +913,16 @@ iter_move_one_task(struct rq *this_rq, i
struct rq_iterator *iterator);
#endif

+static inline void inc_load(struct rq *rq, unsigned long load)
+{
+ update_load_add(&rq->load, load);
+}
+
+static inline void dec_load(struct rq *rq, unsigned long load)
+{
+ update_load_sub(&rq->load, load);
+}
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -864,26 +933,14 @@ iter_move_one_task(struct rq *this_rq, i

#define sched_class_highest (&rt_sched_class)

-static inline void inc_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_add(&rq->load, p->se.load.weight);
-}
-
-static inline void dec_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_sub(&rq->load, p->se.load.weight);
-}
-
static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running++;
- inc_load(rq, p);
}

static void dec_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running--;
- dec_load(rq, p);
}

static void set_load_weight(struct task_struct *p)
@@ -4055,10 +4112,8 @@ void set_user_nice(struct task_struct *p
goto out_unlock;
}
on_rq = p->se.on_rq;
- if (on_rq) {
+ if (on_rq)
dequeue_task(rq, p, 0);
- dec_load(rq, p);
- }

p->static_prio = NICE_TO_PRIO(nice);
set_load_weight(p);
@@ -4068,7 +4123,6 @@ void set_user_nice(struct task_struct *p

if (on_rq) {
enqueue_task(rq, p, 0);
- inc_load(rq, p);
/*
* If the task increased its priority or is running and
* lowered its priority, then reschedule its CPU:
@@ -6509,6 +6563,8 @@ void partition_sched_domains(int ndoms_n
{
int i, j;

+ lock_doms_cur();
+
/* always unregister in case we don't destroy any domains */
unregister_sched_domain_sysctl();

@@ -6549,6 +6605,8 @@ match2:
ndoms_cur = ndoms_new;

register_sched_domain_sysctl();
+
+ unlock_doms_cur();
}

#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
@@ -6683,6 +6741,17 @@ void __init sched_init_smp(void)
if (set_cpus_allowed(current, non_isolated_cpus) < 0)
BUG();
sched_init_granularity();
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ lb_monitor_task = kthread_create(load_balance_monitor, NULL,
+ "load_balance_monitor");
+ if (!IS_ERR(lb_monitor_task))
+ wake_up_process(lb_monitor_task);
+ else {
+ printk("Could not create load balance monitor thread"
+ "(error = %ld) \n", PTR_ERR(lb_monitor_task));
+ }
+#endif
}
#else
void __init sched_init_smp(void)
@@ -6747,7 +6816,7 @@ void __init sched_init(void)
se->parent = NULL;
}
init_task_group.shares = init_task_group_load;
- spin_lock_init(&init_task_group.lock);
+ mutex_init(&init_task_group.lock);
#endif

for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
@@ -6939,6 +7008,136 @@ void set_curr_task(int cpu, struct task_

#ifdef CONFIG_FAIR_GROUP_SCHED

+/* distribute shares of all task groups among their schedulable entities,
+ * to reflect load distrbution across cpus.
+ */
+static int rebalance_shares(struct sched_domain *sd, int this_cpu)
+{
+ struct cfs_rq *cfs_rq;
+ struct rq *rq = cpu_rq(this_cpu);
+ cpumask_t sdspan = sd->span;
+ int balanced = 1;
+
+ /* Walk thr' all the task groups that we have */
+ for_each_leaf_cfs_rq(rq, cfs_rq) {
+ int i;
+ unsigned long total_load = 0, total_shares;
+ struct task_group *tg = cfs_rq->tg;
+
+ /* Gather total task load of this group across cpus */
+ for_each_cpu_mask(i, sdspan)
+ total_load += tg->cfs_rq[i]->load.weight;
+
+ /* Nothing to do if this group has no load or if it's load
+ * hasn't changed since the last time we checked.
+ */
+ if (!total_load || total_load == tg->last_total_load)
+ continue;
+
+ tg->last_total_load = total_load;
+
+ /* tg->shares represents the number of cpu shares the task group
+ * is eligible to hold on a single cpu. On N cpus, it is
+ * eligible to hold (N * tg->shares) number of cpu shares.
+ */
+ total_shares = tg->shares * cpus_weight(sdspan);
+
+ /* redistribute total_shares across cpus as per the task load
+ * distribution.
+ */
+ for_each_cpu_mask(i, sdspan) {
+ unsigned long local_load, local_shares, irqflags;
+
+ local_load = tg->cfs_rq[i]->load.weight;
+ local_shares = (local_load * total_shares) / total_load;
+ if (!local_shares)
+ local_shares = MIN_GROUP_SHARES;
+ if (local_shares == tg->se[i]->load.weight)
+ continue;
+
+ spin_lock_irqsave(&cpu_rq(i)->lock, irqflags);
+ set_se_shares(tg->se[i], local_shares);
+ spin_unlock_irqrestore(&cpu_rq(i)->lock, irqflags);
+ balanced = 0;
+ }
+ }
+
+ return balanced;
+}
+
+/*
+ * How frequently should we rebalance_shares() across cpus?
+ *
+ * The more frequently we rebalance shares, the more accurate is the fairness
+ * of cpu bandwidth distribution between task groups. However higher frequency
+ * also implies increased scheduling overhead.
+ *
+ * sysctl_sched_min_bal_int_shares represents the minimum interval between
+ * consecutive calls to rebalance_shares() in the same sched domain.
+ *
+ * sysctl_sched_max_bal_int_shares represents the maximum interval between
+ * consecutive calls to rebalance_shares() in the same sched domain.
+ *
+ * These settings allows for the appropriate tradeoff between accuracy of
+ * fairness and the associated overhead.
+ *
+ */
+
+/* default: 8ms, units: milliseconds */
+const_debug unsigned int sysctl_sched_min_bal_int_shares = 8;
+
+/* default: 128ms, units: milliseconds */
+const_debug unsigned int sysctl_sched_max_bal_int_shares = 128;
+
+static int load_balance_monitor(void *unused)
+{
+ unsigned int timeout = sysctl_sched_min_bal_int_shares;
+
+ while(!kthread_should_stop()) {
+ int i, cpu, balanced = 1;
+
+ lock_cpu_hotplug(); /* Prevent cpus going down or coming up */
+ lock_doms_cur(); /* lockout changes to doms_cur[] array */
+
+ rcu_read_lock(); /* to walk rq->sd chain on various cpus */
+
+ for (i=0; i < ndoms_cur; i++) {
+ cpumask_t cpumap = doms_cur[i];
+ struct sched_domain *sd = NULL, *sd_prev = NULL;
+
+ cpu = first_cpu(cpumap);
+
+ /* Find the highest domain at which to balance shares */
+ for_each_domain(cpu, sd) {
+ if (!(sd->flags & SD_LOAD_BALANCE))
+ continue;
+ sd_prev = sd;
+ }
+
+ sd = sd_prev;
+ /* sd == NULL? No load balance reqd in this domain */
+ if (!sd)
+ continue;
+
+ balanced &= rebalance_shares(sd, cpu);
+ }
+
+ rcu_read_unlock();
+
+ unlock_doms_cur();
+ unlock_cpu_hotplug();
+
+ if (!balanced)
+ timeout = sysctl_sched_min_bal_int_shares;
+ else if (timeout < sysctl_sched_max_bal_int_shares)
+ timeout *= 2;
+
+ msleep(timeout);
+ }
+
+ return 0;
+}
+
/* allocate runqueue etc for a new task group */
struct task_group *sched_create_group(void)
{
@@ -6994,7 +7193,7 @@ struct task_group *sched_create_group(vo
}

tg->shares = NICE_0_LOAD;
- spin_lock_init(&tg->lock);
+ mutex_init(&tg->lock);

return tg;

@@ -7092,41 +7291,78 @@ done:
task_rq_unlock(rq, &flags);
}

+/* rq->lock to be locked by caller */
static void set_se_shares(struct sched_entity *se, unsigned long shares)
{
struct cfs_rq *cfs_rq = se->cfs_rq;
struct rq *rq = cfs_rq->rq;
int on_rq;

- spin_lock_irq(&rq->lock);
+ if (!shares)
+ shares = MIN_GROUP_SHARES;

on_rq = se->on_rq;
- if (on_rq)
+ if (on_rq) {
dequeue_entity(cfs_rq, se, 0);
+ dec_load(rq, se->load.weight);
+ }

se->load.weight = shares;
se->load.inv_weight = div64_64((1ULL<<32), shares);

- if (on_rq)
+ if (on_rq) {
enqueue_entity(cfs_rq, se, 0);
-
- spin_unlock_irq(&rq->lock);
+ inc_load(rq, se->load.weight);
+ }
}

int sched_group_set_shares(struct task_group *tg, unsigned long shares)
{
int i;
+ struct cfs_rq *cfs_rq;
+ struct rq *rq;
+
+ mutex_lock(&tg->lock);

- spin_lock(&tg->lock);
if (tg->shares == shares)
goto done;

+ if (shares < MIN_GROUP_SHARES)
+ shares = MIN_GROUP_SHARES;
+
+ /* Prevent any load balance activity (rebalance_shares,
+ * load_balance_fair) from referring to this group first,
+ * by taking it off the rq->leaf_cfs_rq_list on each cpu.
+ */
+ for_each_possible_cpu(i) {
+ cfs_rq = tg->cfs_rq[i];
+ list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
+ }
+
+ /* wait for any ongoing reference to this group to finish */
+ synchronize_sched();
+
+ /* Now we are free to modify the group's share on each cpu
+ * w/o tripping rebalance_share or load_balance_fair.
+ */
tg->shares = shares;
- for_each_possible_cpu(i)
+ for_each_possible_cpu(i) {
+ spin_lock_irq(&cpu_rq(i)->lock);
set_se_shares(tg->se[i], shares);
+ spin_unlock_irq(&cpu_rq(i)->lock);
+ }
+
+ /* Enable load balance activity on this group, by inserting it back on
+ * each cpu's rq->lead_cfs_rq_list.
+ */
+ for_each_possible_cpu(i) {
+ rq = cpu_rq(i);
+ cfs_rq = tg->cfs_rq[i];
+ list_add_rcu(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
+ }

done:
- spin_unlock(&tg->lock);
+ mutex_unlock(&tg->lock);
return 0;
}

Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -702,6 +702,8 @@ static inline struct sched_entity *paren
return se->parent;
}

+#define GROUP_IMBALANCE_PCT 20
+
#else /* CONFIG_FAIR_GROUP_SCHED */

#define for_each_sched_entity(se) \
@@ -756,6 +758,7 @@ static void enqueue_task_fair(struct rq
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ unsigned long old_load = rq->cfs.load.weight, new_load, delta_load;

for_each_sched_entity(se) {
if (se->on_rq)
@@ -764,6 +767,10 @@ static void enqueue_task_fair(struct rq
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
+
+ new_load = rq->cfs.load.weight;
+ delta_load = new_load - old_load;
+ inc_load(rq, delta_load);
}

/*
@@ -775,6 +782,7 @@ static void dequeue_task_fair(struct rq
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ unsigned long old_load = rq->cfs.load.weight, new_load, delta_load;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
@@ -784,6 +792,10 @@ static void dequeue_task_fair(struct rq
break;
sleep = 1;
}
+
+ new_load = rq->cfs.load.weight;
+ delta_load = old_load - new_load;
+ dec_load(rq, delta_load);
}

/*
@@ -938,25 +950,6 @@ static struct task_struct *load_balance_
return __load_balance_iterator(cfs_rq, cfs_rq->rb_load_balance_curr);
}

-#ifdef CONFIG_FAIR_GROUP_SCHED
-static int cfs_rq_best_prio(struct cfs_rq *cfs_rq)
-{
- struct sched_entity *curr;
- struct task_struct *p;
-
- if (!cfs_rq->nr_running)
- return MAX_PRIO;
-
- curr = cfs_rq->curr;
- if (!curr)
- curr = __pick_next_entity(cfs_rq);
-
- p = task_of(curr);
-
- return p->prio;
-}
-#endif
-
static unsigned long
load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move,
@@ -966,28 +959,44 @@ load_balance_fair(struct rq *this_rq, in
struct cfs_rq *busy_cfs_rq;
long rem_load_move = max_load_move;
struct rq_iterator cfs_rq_iterator;
+ unsigned long load_moved;

cfs_rq_iterator.start = load_balance_start_fair;
cfs_rq_iterator.next = load_balance_next_fair;

for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {
#ifdef CONFIG_FAIR_GROUP_SCHED
- struct cfs_rq *this_cfs_rq;
- long imbalance;
- unsigned long maxload;
-
- this_cfs_rq = cpu_cfs_rq(busy_cfs_rq, this_cpu);
-
- imbalance = busy_cfs_rq->load.weight - this_cfs_rq->load.weight;
- /* Don't pull if this_cfs_rq has more load than busy_cfs_rq */
- if (imbalance <= 0)
+ struct cfs_rq *this_cfs_rq = busy_cfs_rq->tg->cfs_rq[this_cpu];
+ unsigned long maxload, task_load, group_weight;
+ unsigned long thisload, per_task_load;
+ struct sched_entity *se = busy_cfs_rq->tg->se[busiest->cpu];
+
+ task_load = busy_cfs_rq->load.weight;
+ group_weight = se->load.weight;
+
+ /*
+ * 'group_weight' is contributed by tasks of total weight
+ * 'task_load'. To move 'rem_load_move' worth of weight only,
+ * we need to move a maximum task load of:
+ *
+ * maxload = (remload / group_weight) * task_load;
+ */
+ maxload = (rem_load_move * task_load) / group_weight;
+
+ if (!maxload || !task_load)
continue;

- /* Don't pull more than imbalance/2 */
- imbalance /= 2;
- maxload = min(rem_load_move, imbalance);
+ per_task_load = task_load/busy_cfs_rq->nr_running;

- *this_best_prio = cfs_rq_best_prio(this_cfs_rq);
+ /* balance_tasks will try to forcibly move atleast one task if
+ * possible (because of SCHED_LOAD_SCALE_FUZZ). Avoid that if
+ * maxload is less than GROUP_IMBALANCE_FUZZ% the per_task_load.
+ */
+ if (100 * maxload < GROUP_IMBALANCE_PCT * per_task_load)
+ continue;
+
+ *this_best_prio = 0;
+ thisload = this_cfs_rq->load.weight;
#else
# define maxload rem_load_move
#endif
@@ -996,11 +1005,31 @@ load_balance_fair(struct rq *this_rq, in
* load_balance_[start|next]_fair iterators
*/
cfs_rq_iterator.arg = busy_cfs_rq;
- rem_load_move -= balance_tasks(this_rq, this_cpu, busiest,
+ load_moved = balance_tasks(this_rq, this_cpu, busiest,
maxload, sd, idle, all_pinned,
this_best_prio,
&cfs_rq_iterator);

+#ifdef CONFIG_FAIR_GROUP_SCHED
+ /* load_moved holds the task load that was moved. The
+ * effective weight moved would be:
+ * load_moved_eff = load_moved/task_load * group_weight;
+ */
+ load_moved = (group_weight * load_moved) / task_load;
+
+ /* Adjust shares on both cpus to reflect load_moved */
+ group_weight -= load_moved;
+ set_se_shares(se, group_weight);
+
+ se = busy_cfs_rq->tg->se[this_cpu];
+ if (!thisload)
+ group_weight = load_moved;
+ else
+ group_weight = se->load.weight + load_moved;
+ set_se_shares(se, group_weight);
+#endif
+
+ rem_load_move -= load_moved;
if (rem_load_move <= 0)
break;
}
Index: current/kernel/sched_rt.c
===================================================================
--- current.orig/kernel/sched_rt.c
+++ current/kernel/sched_rt.c
@@ -31,6 +31,7 @@ static void enqueue_task_rt(struct rq *r

list_add_tail(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
+ inc_load(rq, p->se.load.weight);
}

/*
@@ -45,6 +46,7 @@ static void dequeue_task_rt(struct rq *r
list_del(&p->run_list);
if (list_empty(array->queue + p->prio))
__clear_bit(p->prio, array->bitmap);
+ dec_load(rq, p->se.load.weight);
}

/*
Index: current/kernel/sysctl.c
===================================================================
--- current.orig/kernel/sysctl.c
+++ current/kernel/sysctl.c
@@ -309,6 +309,22 @@ static struct ctl_table kern_table[] = {
.mode = 644,
.proc_handler = &proc_dointvec,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_min_bal_int_shares",
+ .data = &sysctl_sched_min_bal_int_shares,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_max_bal_int_shares",
+ .data = &sysctl_sched_max_bal_int_shares,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#endif
{
.ctl_name = CTL_UNNUMBERED,





--
Regards,
vatsa

2007-11-19 13:08:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched: Minor cleanups


* Srivatsa Vaddagiri <[email protected]> wrote:

> Minor scheduler cleanups:
>
> - Fix coding style
> - remove obsolete comment
> - Use list_for_each_entry_rcu when walking task group list

> #define for_each_leaf_cfs_rq(rq, cfs_rq) \
> - list_for_each_entry(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
> + list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
>
> /* Do the two (enqueued) entities belong to the same group ? */
> static inline int
> @@ -1126,7 +1126,10 @@ static void print_cfs_stats(struct seq_f
> #ifdef CONFIG_FAIR_GROUP_SCHED
> print_cfs_rq(m, cpu, &cpu_rq(cpu)->cfs);
> #endif
> +
> + rcu_read_lock();
> for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
> print_cfs_rq(m, cpu, cfs_rq);
> + rcu_read_unlock();

hm, why is this a cleanup?

Ingo

2007-11-19 13:12:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: Improve fairness of cpu allocation for task groups


* Srivatsa Vaddagiri <[email protected]> wrote:

> The current load balancing scheme isn't good for group fairness.

> ---
> include/linux/sched.h | 4
> kernel/sched.c | 292 +++++++++++++++++++++++++++++++++++++++++++++-----
> kernel/sched_fair.c | 95 ++++++++++------
> kernel/sched_rt.c | 2
> kernel/sysctl.c | 16 ++
> 5 files changed, 348 insertions(+), 61 deletions(-)

i'm leaning towards making this v2.6.25 material, as it affects the
non-group-scheduling bits too and is rather large. When i tested it,
group scheduling worked pretty well - at least for CPU bound tasks - and
on SMP too. Could we live with what we have for now and defer this patch
to v2.6.25? If not, could you split up this patch in a way to defer all
the FAIR_GROUP_SCHED relevant changes to a separate patch which will not
affect the !FAIR_GROUP_SCHED case at all? That will make the case much
clearer.

Ingo

2007-11-19 14:48:54

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched: Minor cleanups

On Mon, Nov 19, 2007 at 02:08:03PM +0100, Ingo Molnar wrote:
> > #define for_each_leaf_cfs_rq(rq, cfs_rq) \
> > - list_for_each_entry(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
> > + list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
> >
> > /* Do the two (enqueued) entities belong to the same group ? */
> > static inline int
> > @@ -1126,7 +1126,10 @@ static void print_cfs_stats(struct seq_f
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> > print_cfs_rq(m, cpu, &cpu_rq(cpu)->cfs);
> > #endif
> > +
> > + rcu_read_lock();
> > for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
> > print_cfs_rq(m, cpu, cfs_rq);
> > + rcu_read_unlock();
>
> hm, why is this a cleanup?

Sorry for the wrong subject. It was supposed to include the above bug fix,
related to how we walk the task group list.

Thinking abt it now, I realize that print_cfs_rq() can potentially
sleep and hence it cannot be surrounded by rcu_read_lock()/unlock().

And as Dipankar just pointed me, sched_create/destroy_group aren't
serialized at all currently, so we need a mutex to protect them. The
same mutex can be then used when walking the list in print_cfs_stats() ..

Will send update patches soon ..


--
Regards,
vatsa

2007-11-19 14:50:33

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: Improve fairness of cpu allocation for task groups

On Mon, Nov 19, 2007 at 02:12:01PM +0100, Ingo Molnar wrote:
> > include/linux/sched.h | 4
> > kernel/sched.c | 292 +++++++++++++++++++++++++++++++++++++++++++++-----
> > kernel/sched_fair.c | 95 ++++++++++------
> > kernel/sched_rt.c | 2
> > kernel/sysctl.c | 16 ++
> > 5 files changed, 348 insertions(+), 61 deletions(-)
>
> i'm leaning towards making this v2.6.25 material, as it affects the
> non-group-scheduling bits too and is rather large. When i tested it,
> group scheduling worked pretty well - at least for CPU bound tasks - and
> on SMP too. Could we live with what we have for now and defer this patch
> to v2.6.25?

Hi Ingo,
I would prefer this to go in 2.6.24 if possible. 2.6.24 would be the
first kernel to support a group scheduler in its entirety (user interface +
related support in scheduler) and also that works reasonably well :) It would
also give me early test feedback.

> If not, could you split up this patch in a way to defer all
> the FAIR_GROUP_SCHED relevant changes to a separate patch which will not
> affect the !FAIR_GROUP_SCHED case at all? That will make the case much
> clearer.

>From my inspection, here are the changes introduced by this patch
for !CONFIG_FAIR_GROUP_SCHED case:

- inc/dec_load() takes a load input instead of task pointer input as their
2nd arg
- inc/dec_nr_running don't call inc/dec_load. Instead,
- enqueue/dequeue_task class callbacks call inc/dec_load
- [Unintended/will-fix change] min/max tunables added in
/proc/sys/kernel

All of above changes (except last, which I will fix) should have zero
functional+runtime effect for !CONFIG_FAIR_GROUP_SCHED case. So I don't see how
I can split Patch 2/2 further.

Or do you prefer I introduce #ifdef's such that even these minor changes to
inc/dec_load are avoided for !CONFIG_FAIR_GROUP_SCHED case? That would
make the code slightly ugly I suspect.

--
Regards,
vatsa

2007-11-19 15:23:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: Improve fairness of cpu allocation for task groups


* Srivatsa Vaddagiri <[email protected]> wrote:

> - inc/dec_load() takes a load input instead of task pointer input as their
> 2nd arg
> - inc/dec_nr_running don't call inc/dec_load. Instead,
> - enqueue/dequeue_task class callbacks call inc/dec_load
> - [Unintended/will-fix change] min/max tunables added in
> /proc/sys/kernel
>
> All of above changes (except last, which I will fix) should have zero
> functional+runtime effect for !CONFIG_FAIR_GROUP_SCHED case. So I
> don't see how I can split Patch 2/2 further.

ok, as long as it's NOP for the CONFIG_FAIR_GROUP_SCHED, we could try
it.

Ingo

2007-11-19 15:54:20

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: Improve fairness of cpu allocation for task groups

On Mon, Nov 19, 2007 at 04:22:58PM +0100, Ingo Molnar wrote:
>
> * Srivatsa Vaddagiri <[email protected]> wrote:
>
> > - inc/dec_load() takes a load input instead of task pointer input as their
> > 2nd arg
> > - inc/dec_nr_running don't call inc/dec_load. Instead,
> > - enqueue/dequeue_task class callbacks call inc/dec_load
> > - [Unintended/will-fix change] min/max tunables added in
> > /proc/sys/kernel
> >
> > All of above changes (except last, which I will fix) should have zero
> > functional+runtime effect for !CONFIG_FAIR_GROUP_SCHED case. So I
> > don't see how I can split Patch 2/2 further.
>
> ok, as long as it's NOP for the CONFIG_FAIR_GROUP_SCHED, we could try
> it.

Ok ..thx. I was begining to make changes to avoid even the above minor changes
for !CONFIG_FAIR_GROUP_SCHED case, but it doesn't look neat, hence will drop
that effort.

I am fixing other problems observed with Patch 1/2 (usage of a mutex to
serialize create/destroy groups) and will resend the series very soon.

--
Regards,
vatsa

2007-11-19 19:01:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: Improve fairness of cpu allocation for task groups


* Srivatsa Vaddagiri <[email protected]> wrote:

> > ok, as long as it's NOP for the CONFIG_FAIR_GROUP_SCHED, we could
> > try it.
>
> Ok ..thx. I was begining to make changes to avoid even the above minor
> changes for !CONFIG_FAIR_GROUP_SCHED case, but it doesn't look neat,
> hence will drop that effort.

well please try it nevertheless, if the changes are not a NOP for the
!FAIR_GROUP_SCHED case it's harder to determine that they are indeed
harmless for the !FAIR_GROUP_SCHED case.

Ingo

2007-11-26 04:48:18

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 0/4] sched: group scheduler related patches (V3)

Here's V3 of the group scheduler related patches, which is mainly addressing
improved fairness of cpu bandwidth allocation for task groups.

Patch 1/4 -> coding style cleanup
Patch 2/4 -> Minor group scheduling related bug fixes

Patch 3/4 (v1) -> Modifies how cpu load is calculated, such that there is zero
impact on !CONFIG_FAIR_GROUP_SCHED
Patch 3/4 (v2) -> Modifies how cpu load is calculated, such that there is a
small impact on code size (but should have NO impact on
functionality or runtime behavior) for
!CONFIG_FAIR_GROUP_SCHED case. The resulting code however is
much neater since it avoids some #ifdefs. I prefer v2.

Patch 4/4 -> Updates load balance logic to provide improved fairness for
task groups.

To have zero impact on !CONFIG_FAIR_GROUP_SCHED case, please apply the
following patches:

- Patch 1/4
- Patch 2/4
- Patch 3/4 (v1)
- Patch 4/4

I personally prefer v2 of Patch 3/4. Even though it has a minor impact
on code size for !CONFIG_FAIR_GROUP_SCHED case, the overall code is much
neater IMHO.

Impact on sched.o size:
=======================

!CONFIG_FAIR_GROUP_SCHED:

text data bss dec hex filename
36829 2766 48 39643 9adb sched.o-before-nofgs
36829 2766 48 39643 9adb sched.o-after-v1-nofgs (v1 of Patch 3/4)
36843 2766 48 39657 9ae9 sched.o-after-v2-nofgs (v2 of Patch 3/4)

CONFIG_FAIR_GROUP_SCHED:

text data bss dec hex filename
39019 3346 336 42701 a6cd sched.o-before-fgs
40303 3482 308 44093 ac3d sched.o-after-v1-fgs (v1 of Patch 3/4)
40303 3482 308 44093 ac3d sched.o-after-v2-fgs (v2 of Patch 3/4)


Changes since V2 of this patchset [1]

- Split the patches better and make them pass under checkpatch.pl
script
- Fixed compile issues under different config options and also
a suspend failure (as posted by Ingo at [2])
- Make load_balance_monitor thread run as real-time task,
so that its execution is not limited by shares allocated to
default task group (init_task_group).
- Reduced minimum shares that can be allocated to a group to 1
(from 100). Would be usefull if someone wants a task group
to get very low bandiwdth or get bandwidth only when other groups
are idle.
- Removed check for tg->last_total_load check in rebalance_shares()
(which was incorrect in V2)

Changes since V1 of this patchset [3]:

- Introduced a task_group_mutex to serialize add/removal of task groups (as
pointed by Dipankar)

Please apply if there are no major concerns.


References:

1. http://marc.info/?l=linux-kernel&m=119549585223262
2. http://lkml.org/lkml/2007/11/19/127
3. http://marc.info/?l=linux-kernel&m=119547452517055


--
Regards,
vatsa

2007-11-26 04:49:55

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 1/4] sched: code cleanup


Minor cleanups:

- Fix coding style
- remove obsolete comment

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

---
kernel/sched.c | 21 +++------------------
1 files changed, 3 insertions(+), 18 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -191,12 +191,12 @@ struct task_group init_task_group = {
};

#ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD 2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD 2*NICE_0_LOAD
#else
-# define INIT_TASK_GRP_LOAD NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD NICE_0_LOAD
#endif

-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;

/* return group to which a task belongs */
static inline struct task_group *task_group(struct task_struct *p)
@@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i

#define sched_class_highest (&rt_sched_class)

-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
update_load_add(&rq->load, p->se.load.weight);

--
Regards,
vatsa

2007-11-26 04:51:05

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [PATCH 2/4] sched: minor fixes for group scheduler


Minor bug fixes for group scheduler:

- Use a mutex to serialize add/remove of task groups and also when
changing shares of a task group. Use the same mutex when printing cfs_rq
stats for various task groups.
- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when walking task
group list)


Signed-off-by: Srivatsa Vaddagiri <[email protected]>

---
kernel/sched.c | 33 +++++++++++++++++++++++++--------
kernel/sched_fair.c | 4 +++-
2 files changed, 28 insertions(+), 9 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -169,8 +169,6 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
- /* spinlock to serialize modification to shares */
- spinlock_t lock;
struct rcu_head rcu;
};

@@ -182,6 +180,11 @@ static DEFINE_PER_CPU(struct cfs_rq, ini
static struct sched_entity *init_sched_entity_p[NR_CPUS];
static struct cfs_rq *init_cfs_rq_p[NR_CPUS];

+/* task_group_mutex serializes add/remove of task groups and also changes to
+ * a task group's cpu shares.
+ */
+static DEFINE_MUTEX(task_group_mutex);
+
/* Default task group.
* Every task in system belong to this group at bootup.
*/
@@ -222,9 +225,21 @@ static inline void set_task_cfs_rq(struc
p->se.parent = task_group(p)->se[cpu];
}

+static inline void lock_task_group_list(void)
+{
+ mutex_lock(&task_group_mutex);
+}
+
+static inline void unlock_task_group_list(void)
+{
+ mutex_unlock(&task_group_mutex);
+}
+
#else

static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
+static inline void lock_task_group_list(void) { }
+static inline void unlock_task_group_list(void) { }

#endif /* CONFIG_FAIR_GROUP_SCHED */

@@ -6747,7 +6762,6 @@ void __init sched_init(void)
se->parent = NULL;
}
init_task_group.shares = init_task_group_load;
- spin_lock_init(&init_task_group.lock);
#endif

for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
@@ -6987,14 +7001,15 @@ struct task_group *sched_create_group(vo
se->parent = NULL;
}

+ tg->shares = NICE_0_LOAD;
+
+ lock_task_group_list();
for_each_possible_cpu(i) {
rq = cpu_rq(i);
cfs_rq = tg->cfs_rq[i];
list_add_rcu(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
}
-
- tg->shares = NICE_0_LOAD;
- spin_lock_init(&tg->lock);
+ unlock_task_group_list();

return tg;

@@ -7040,10 +7055,12 @@ void sched_destroy_group(struct task_gro
struct cfs_rq *cfs_rq = NULL;
int i;

+ lock_task_group_list();
for_each_possible_cpu(i) {
cfs_rq = tg->cfs_rq[i];
list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
}
+ unlock_task_group_list();

BUG_ON(!cfs_rq);

@@ -7117,7 +7134,7 @@ int sched_group_set_shares(struct task_g
{
int i;

- spin_lock(&tg->lock);
+ lock_task_group_list();
if (tg->shares == shares)
goto done;

@@ -7126,7 +7143,7 @@ int sched_group_set_shares(struct task_g
set_se_shares(tg->se[i], shares);

done:
- spin_unlock(&tg->lock);
+ unlock_task_group_list();
return 0;
}

Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -685,7 +685,7 @@ static inline struct cfs_rq *cpu_cfs_rq(

/* Iterate thr' all leaf cfs_rq's on a runqueue */
#define for_each_leaf_cfs_rq(rq, cfs_rq) \
- list_for_each_entry(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+ list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)

/* Do the two (enqueued) entities belong to the same group ? */
static inline int
@@ -1126,7 +1126,9 @@ static void print_cfs_stats(struct seq_f
#ifdef CONFIG_FAIR_GROUP_SCHED
print_cfs_rq(m, cpu, &cpu_rq(cpu)->cfs);
#endif
+ lock_task_group_list();
for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
+ unlock_task_group_list();
}
#endif

--
Regards,
vatsa

2007-11-26 04:52:35

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [Patch 3/4 v1] sched: change how cpu load is calculated

This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated to it.

This version of patch (v1 of Patch 3/4) has zero impact for
!CONFIG_FAIR_GROUP_SCHED case.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

---
kernel/sched.c | 38 ++++++++++++++++++++++++++++++--------
kernel/sched_fair.c | 31 +++++++++++++++++++++++++++----
kernel/sched_rt.c | 2 ++
3 files changed, 59 insertions(+), 12 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -869,15 +869,25 @@
struct rq_iterator *iterator);
#endif

-#include "sched_stats.h"
-#include "sched_idletask.c"
-#include "sched_fair.c"
-#include "sched_rt.c"
-#ifdef CONFIG_SCHED_DEBUG
-# include "sched_debug.c"
-#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED

-#define sched_class_highest (&rt_sched_class)
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_sub(&rq->load, load);
+}
+
+static inline void inc_load(struct rq *rq, const struct task_struct *p) { }
+static inline void dec_load(struct rq *rq, const struct task_struct *p) { }
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void inc_cpu_load(struct rq *rq, unsigned long load) { }
+static inline void dec_cpu_load(struct rq *rq, unsigned long load) { }

static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
@@ -889,6 +899,18 @@
update_load_sub(&rq->load, p->se.load.weight);
}

+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+#include "sched_stats.h"
+#include "sched_idletask.c"
+#include "sched_fair.c"
+#include "sched_rt.c"
+#ifdef CONFIG_SCHED_DEBUG
+# include "sched_debug.c"
+#endif
+
+#define sched_class_highest (&rt_sched_class)
+
static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running++;
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -755,15 +755,26 @@
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
{
struct cfs_rq *cfs_rq;
- struct sched_entity *se = &p->se;
+ struct sched_entity *se = &p->se,
+ *topse = NULL; /* Highest schedulable entity */
+ int incload = 1;

for_each_sched_entity(se) {
- if (se->on_rq)
+ topse = se;
+ if (se->on_rq) {
+ incload = 0;
break;
+ }
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
+ /* Increment cpu load if we just enqueued the first task of a group on
+ * 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+ * at the highest grouping level.
+ */
+ if (incload)
+ inc_cpu_load(rq, topse->load.weight);
}

/*
@@ -774,16 +785,28 @@
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
{
struct cfs_rq *cfs_rq;
- struct sched_entity *se = &p->se;
+ struct sched_entity *se = &p->se,
+ *topse = NULL; /* Highest schedulable entity */
+ int decload = 1;

for_each_sched_entity(se) {
+ topse = se;
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
/* Don't dequeue parent if it has other entities besides us */
- if (cfs_rq->load.weight)
+ if (cfs_rq->load.weight) {
+ if (parent_entity(se))
+ decload = 0;
break;
+ }
sleep = 1;
}
+ /* Decrement cpu load if we just dequeued the last task of a group on
+ * 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+ * at the highest grouping level.
+ */
+ if (decload)
+ dec_cpu_load(rq, topse->load.weight);
}

/*
Index: current/kernel/sched_rt.c
===================================================================
--- current.orig/kernel/sched_rt.c
+++ current/kernel/sched_rt.c
@@ -31,6 +31,7 @@

list_add_tail(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
+ inc_cpu_load(rq, p->se.load.weight);
}

/*
@@ -45,6 +46,7 @@
list_del(&p->run_list);
if (list_empty(array->queue + p->prio))
__clear_bit(p->prio, array->bitmap);
+ dec_cpu_load(rq, p->se.load.weight);
}

/*


--
Regards,
vatsa

2007-11-26 04:53:42

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [Patch 3/4 v2] sched: change how cpu load is calculated


This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch (v2 of Patch 3/4) has a minor impact on code size
(but should have no runtime/functional impact) for !CONFIG_FAIR_GROUP_SCHED
case, but the overall code, IMHO, is neater compared to v1 of Patch 3/4
(because of lesser #ifdefs).

I prefer v2 of Patch 3/4.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

---
kernel/sched.c | 27 +++++++++++----------------
kernel/sched_fair.c | 31 +++++++++++++++++++++++++++----
kernel/sched_rt.c | 2 ++
3 files changed, 40 insertions(+), 20 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -869,6 +869,16 @@
struct rq_iterator *iterator);
#endif

+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_sub(&rq->load, load);
+}
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -879,26 +889,14 @@

#define sched_class_highest (&rt_sched_class)

-static inline void inc_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_add(&rq->load, p->se.load.weight);
-}
-
-static inline void dec_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_sub(&rq->load, p->se.load.weight);
-}
-
static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running++;
- inc_load(rq, p);
}

static void dec_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running--;
- dec_load(rq, p);
}

static void set_load_weight(struct task_struct *p)
@@ -4070,10 +4068,8 @@
goto out_unlock;
}
on_rq = p->se.on_rq;
- if (on_rq) {
+ if (on_rq)
dequeue_task(rq, p, 0);
- dec_load(rq, p);
- }

p->static_prio = NICE_TO_PRIO(nice);
set_load_weight(p);
@@ -4083,7 +4079,6 @@

if (on_rq) {
enqueue_task(rq, p, 0);
- inc_load(rq, p);
/*
* If the task increased its priority or is running and
* lowered its priority, then reschedule its CPU:
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -755,15 +755,26 @@
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
{
struct cfs_rq *cfs_rq;
- struct sched_entity *se = &p->se;
+ struct sched_entity *se = &p->se,
+ *topse = NULL; /* Highest schedulable entity */
+ int incload = 1;

for_each_sched_entity(se) {
- if (se->on_rq)
+ topse = se;
+ if (se->on_rq) {
+ incload = 0;
break;
+ }
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
+ /* Increment cpu load if we just enqueued the first task of a group on
+ * 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+ * at the highest grouping level.
+ */
+ if (incload)
+ inc_cpu_load(rq, topse->load.weight);
}

/*
@@ -774,16 +785,28 @@
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
{
struct cfs_rq *cfs_rq;
- struct sched_entity *se = &p->se;
+ struct sched_entity *se = &p->se,
+ *topse = NULL; /* Highest schedulable entity */
+ int decload = 1;

for_each_sched_entity(se) {
+ topse = se;
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
/* Don't dequeue parent if it has other entities besides us */
- if (cfs_rq->load.weight)
+ if (cfs_rq->load.weight) {
+ if (parent_entity(se))
+ decload = 0;
break;
+ }
sleep = 1;
}
+ /* Decrement cpu load if we just dequeued the last task of a group on
+ * 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+ * at the highest grouping level.
+ */
+ if (decload)
+ dec_cpu_load(rq, topse->load.weight);
}

/*
Index: current/kernel/sched_rt.c
===================================================================
--- current.orig/kernel/sched_rt.c
+++ current/kernel/sched_rt.c
@@ -31,6 +31,7 @@

list_add_tail(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
+ inc_cpu_load(rq, p->se.load.weight);
}

/*
@@ -45,6 +46,7 @@
list_del(&p->run_list);
if (list_empty(array->queue + p->prio))
__clear_bit(p->prio, array->bitmap);
+ dec_cpu_load(rq, p->se.load.weight);
}

/*

--
Regards,
vatsa

2007-11-26 04:56:29

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [Patch 4/4] sched: Improve fairness of cpu bandwidth allocation for task groups


The current load balancing scheme isn't good for group fairness.

For ex: on a 8-cpu system, I created 3 groups as under:

a = 8 tasks (cpu.shares = 1024)
b = 4 tasks (cpu.shares = 1024)
c = 3 tasks (cpu.shares = 1024)

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:

--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5%
b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3%
c | 86.326 | 18.0% | 47.5% 47.9% 48.5%
--------------------------------------------------------------------------------

Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test case that produces a similar o/p if reqd]

The deviation from desired group fairness is as below:

a = +24.47%
b = -9.13%
c = -15.33%

which is quite high.

After the patch below is applied, here are the results:

--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3%
b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5%
c | 160.653 | 33.5% | 85.8% 90.6% 91.4%
--------------------------------------------------------------------------------

Deviation from desired group fairness is as below:

a = +0.67%
b = -0.83%
c = +0.17%

which is far better IMO. Most of other runs have yielded a deviation within
+-2% at the most, which is good.

Why do we see bad (group) fairness with current scheuler?
=========================================================

Currently cpu's weight is just the summation of individual task weights.
This can yield incorrect results. For ex: consider three groups as below
on a 2-cpu system:

CPU0 CPU1
---------------------------
A (10) B(5)
C(5)
---------------------------

Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
1024).

The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
the load balancer will think both CPUs are perfectly balanced and won't
move around any tasks. This, however, would yield this bandwidth:

A = 50%
B = 25%
C = 25%

which is not the desired result.

What's changing in the patch?
=============================

- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
defined (see below)
- API Change
- Two tunables introduced in sysfs (under SCHED_DEBUG) to
control the frequency at which the load balance monitor
thread runs.

The basic change made in this patch is how cpu weight (rq->load.weight) is
calculated. Its now calculated as the summation of group weights on a cpu,
rather than summation of task weights. Weight exerted by a group on a
cpu is dependent on the shares allocated to it and also the number of
tasks the group has on that cpu compared to the total number of
(runnable) tasks the group has in the system.

Let,
W(K,i) = Weight of group K on cpu i
T(K,i) = Task load present in group K's cfs_rq on cpu i
T(K) = Total task load of group K across various cpus
S(K) = Shares allocated to group K
NRCPUS = Number of online cpus in the scheduler domain to
which group K is assigned.

Then,
W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)

A load balance monitor thread is created at bootup, which periodically
runs and adjusts group's weight on each cpu. To avoid its overhead, two
min/max tunables are introduced (under SCHED_DEBUG) to control the rate at which
it runs.


Signed-off-by: Srivatsa Vaddagiri <[email protected]>

---
include/linux/sched.h | 4
kernel/sched.c | 265 ++++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched_fair.c | 86 ++++++++++------
kernel/sysctl.c | 18 +++
4 files changed, 334 insertions(+), 39 deletions(-)

Index: current/include/linux/sched.h
===================================================================
--- current.orig/include/linux/sched.h
+++ current/include/linux/sched.h
@@ -1467,6 +1467,10 @@
extern unsigned int sysctl_sched_features;
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
+#ifdef CONFIG_FAIR_GROUP_SCHED
+extern unsigned int sysctl_sched_min_bal_int_shares;
+extern unsigned int sysctl_sched_max_bal_int_shares;
+#endif

int sched_nr_latency_handler(struct ctl_table *table, int write,
struct file *file, void __user *buffer, size_t *length,
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -168,7 +168,43 @@
struct sched_entity **se;
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
+
+ /*
+ * shares assigned to a task group governs how much of cpu bandwidth
+ * is allocated to the group. The more shares a group has, the more is
+ * the cpu bandwidth allocated to it.
+ *
+ * For ex, lets say that there are three task groups, A, B and C which
+ * have been assigned shares 1000, 2000 and 3000 respectively. Then,
+ * cpu bandwidth allocated by the scheduler to task groups A, B and C
+ * should be:
+ *
+ * Bw(A) = 1000/(1000+2000+3000) * 100 = 16.66%
+ * Bw(B) = 2000/(1000+2000+3000) * 100 = 33.33%
+ * Bw(C) = 3000/(1000+2000+3000) * 100 = 50%
+ *
+ * The weight assigned to a task group's schedulable entities on every
+ * cpu (task_group.se[a_cpu]->load.weight) is derived from the task
+ * group's shares. For ex: lets say that task group A has been
+ * assigned shares of 1000 and there are two CPUs in a system. Then,
+ *
+ * tg_A->se[0]->load.weight = tg_A->se[1]->load.weight = 1000;
+ *
+ * Note: It's not necessary that each of a task's group schedulable
+ * entity have the same weight on all CPUs. If the group
+ * has 2 of its tasks on CPU0 and 1 task on CPU1, then a
+ * better distribution of weight could be:
+ *
+ * tg_A->se[0]->load.weight = 2/3 * 2000 = 1333
+ * tg_A->se[1]->load.weight = 1/2 * 2000 = 667
+ *
+ * rebalance_shares() is responsible for distributing the shares of a
+ * task groups like this among the group's schedulable entities across
+ * cpus.
+ *
+ */
unsigned long shares;
+
struct rcu_head rcu;
};

@@ -184,6 +220,15 @@
* a task group's cpu shares.
*/
static DEFINE_MUTEX(task_group_mutex);
+static DEFINE_MUTEX(doms_cur_mutex); /* serialize access to doms_curr[] array */
+
+#ifdef CONFIG_SMP
+/* kernel thread that runs rebalance_shares() periodically */
+static struct task_struct *lb_monitor_task;
+static int load_balance_monitor(void *unused);
+#endif
+
+static void set_se_shares(struct sched_entity *se, unsigned long shares);

/* Default task group.
* Every task in system belong to this group at bootup.
@@ -199,6 +244,8 @@
# define INIT_TASK_GROUP_LOAD NICE_0_LOAD
#endif

+#define MIN_GROUP_SHARES 1
+
static int init_task_group_load = INIT_TASK_GROUP_LOAD;

/* return group to which a task belongs */
@@ -235,11 +282,23 @@
mutex_unlock(&task_group_mutex);
}

+static inline void lock_doms_cur(void)
+{
+ mutex_lock(&doms_cur_mutex);
+}
+
+static inline void unlock_doms_cur(void)
+{
+ mutex_unlock(&doms_cur_mutex);
+}
+
#else

static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
static inline void lock_task_group_list(void) { }
static inline void unlock_task_group_list(void) { }
+static inline void lock_doms_cur(void) { }
+static inline void unlock_doms_cur(void) { }

#endif /* CONFIG_FAIR_GROUP_SCHED */

@@ -6546,6 +6605,8 @@
{
int i, j;

+ lock_doms_cur();
+
/* always unregister in case we don't destroy any domains */
unregister_sched_domain_sysctl();

@@ -6586,6 +6647,8 @@
ndoms_cur = ndoms_new;

register_sched_domain_sysctl();
+
+ unlock_doms_cur();
}

#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
@@ -6720,6 +6783,18 @@
if (set_cpus_allowed(current, non_isolated_cpus) < 0)
BUG();
sched_init_granularity();
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ lb_monitor_task = kthread_create(load_balance_monitor, NULL,
+ "load_balance_monitor");
+ if (!IS_ERR(lb_monitor_task)) {
+ lb_monitor_task->flags |= PF_NOFREEZE;
+ wake_up_process(lb_monitor_task);
+ } else {
+ printk(KERN_ERR "Could not create load balance monitor thread"
+ "(error = %ld) \n", PTR_ERR(lb_monitor_task));
+ }
+#endif
}
#else
void __init sched_init_smp(void)
@@ -6975,6 +7050,149 @@

#ifdef CONFIG_FAIR_GROUP_SCHED

+#ifdef CONFIG_SMP
+/* distribute shares of all task groups among their schedulable entities,
+ * to reflect load distrbution across cpus.
+ */
+static int rebalance_shares(struct sched_domain *sd, int this_cpu)
+{
+ struct cfs_rq *cfs_rq;
+ struct rq *rq = cpu_rq(this_cpu);
+ cpumask_t sdspan = sd->span;
+ int balanced = 1;
+
+ /* Walk thr' all the task groups that we have */
+ for_each_leaf_cfs_rq(rq, cfs_rq) {
+ int i;
+ unsigned long total_load = 0, total_shares;
+ struct task_group *tg = cfs_rq->tg;
+
+ /* Gather total task load of this group across cpus */
+ for_each_cpu_mask(i, sdspan)
+ total_load += tg->cfs_rq[i]->load.weight;
+
+ /* Nothing to do if this group has no load */
+ if (!total_load)
+ continue;
+
+ /* tg->shares represents the number of cpu shares the task group
+ * is eligible to hold on a single cpu. On N cpus, it is
+ * eligible to hold (N * tg->shares) number of cpu shares.
+ */
+ total_shares = tg->shares * cpus_weight(sdspan);
+
+ /* redistribute total_shares across cpus as per the task load
+ * distribution.
+ */
+ for_each_cpu_mask(i, sdspan) {
+ unsigned long local_load, local_shares;
+
+ local_load = tg->cfs_rq[i]->load.weight;
+ local_shares = (local_load * total_shares) / total_load;
+ if (!local_shares)
+ local_shares = MIN_GROUP_SHARES;
+ if (local_shares == tg->se[i]->load.weight)
+ continue;
+
+ spin_lock_irq(&cpu_rq(i)->lock);
+ set_se_shares(tg->se[i], local_shares);
+ spin_unlock_irq(&cpu_rq(i)->lock);
+ balanced = 0;
+ }
+ }
+
+ return balanced;
+}
+
+/*
+ * How frequently should we rebalance_shares() across cpus?
+ *
+ * The more frequently we rebalance shares, the more accurate is the fairness
+ * of cpu bandwidth distribution between task groups. However higher frequency
+ * also implies increased scheduling overhead.
+ *
+ * sysctl_sched_min_bal_int_shares represents the minimum interval between
+ * consecutive calls to rebalance_shares() in the same sched domain.
+ *
+ * sysctl_sched_max_bal_int_shares represents the maximum interval between
+ * consecutive calls to rebalance_shares() in the same sched domain.
+ *
+ * These settings allows for the appropriate tradeoff between accuracy of
+ * fairness and the associated overhead.
+ *
+ */
+
+/* default: 8ms, units: milliseconds */
+const_debug unsigned int sysctl_sched_min_bal_int_shares = 8;
+
+/* default: 128ms, units: milliseconds */
+const_debug unsigned int sysctl_sched_max_bal_int_shares = 128;
+
+/* kernel thread that invokes rebalance_shares() periodically */
+static int load_balance_monitor(void *unused)
+{
+ unsigned int timeout = sysctl_sched_min_bal_int_shares;
+ struct sched_param schedparm;
+ int ret;
+
+ /* we don't want this thread's execution to be limited by the shares
+ * assigned to default group (init_task_group). Hence make it run
+ * as a RT task.
+ */
+ schedparm.sched_priority = 0; /* run at the lowest RT prio */
+ ret = sched_setscheduler(current, SCHED_RR, &schedparm);
+ if (ret)
+ printk(KERN_ERR "Couldn't set SCHED_RR policy for load balance"
+ "monitor thread (error = %d) \n", ret);
+
+ while (!kthread_should_stop()) {
+ int i, cpu, balanced = 1;
+
+ lock_cpu_hotplug(); /* Prevent cpus going down or coming up */
+ lock_doms_cur(); /* lockout changes to doms_cur[] array */
+
+ rcu_read_lock(); /* to walk rq->sd chain on various cpus and to
+ * walk task group list in rebalance_shares().
+ */
+
+ for (i = 0; i < ndoms_cur; i++) {
+ cpumask_t cpumap = doms_cur[i];
+ struct sched_domain *sd = NULL, *sd_prev = NULL;
+
+ cpu = first_cpu(cpumap);
+
+ /* Find the highest domain at which to balance shares */
+ for_each_domain(cpu, sd) {
+ if (!(sd->flags & SD_LOAD_BALANCE))
+ continue;
+ sd_prev = sd;
+ }
+
+ sd = sd_prev;
+ /* sd == NULL? No load balance reqd in this domain */
+ if (!sd)
+ continue;
+
+ balanced &= rebalance_shares(sd, cpu);
+ }
+
+ rcu_read_unlock();
+
+ unlock_doms_cur();
+ unlock_cpu_hotplug();
+
+ if (!balanced)
+ timeout = sysctl_sched_min_bal_int_shares;
+ else if (timeout < sysctl_sched_max_bal_int_shares)
+ timeout *= 2;
+
+ msleep_interruptible(timeout);
+ }
+
+ return 0;
+}
+#endif /* CONFIG_SMP */
+
/* allocate runqueue etc for a new task group */
struct task_group *sched_create_group(void)
{
@@ -7131,39 +7349,74 @@
task_rq_unlock(rq, &flags);
}

+/* rq->lock to be locked by caller */
static void set_se_shares(struct sched_entity *se, unsigned long shares)
{
struct cfs_rq *cfs_rq = se->cfs_rq;
struct rq *rq = cfs_rq->rq;
int on_rq;

- spin_lock_irq(&rq->lock);
+ if (!shares)
+ shares = MIN_GROUP_SHARES;

on_rq = se->on_rq;
- if (on_rq)
+ if (on_rq) {
dequeue_entity(cfs_rq, se, 0);
+ dec_cpu_load(rq, se->load.weight);
+ }

se->load.weight = shares;
se->load.inv_weight = div64_64((1ULL<<32), shares);

- if (on_rq)
+ if (on_rq) {
enqueue_entity(cfs_rq, se, 0);
-
- spin_unlock_irq(&rq->lock);
+ inc_cpu_load(rq, se->load.weight);
+ }
}

int sched_group_set_shares(struct task_group *tg, unsigned long shares)
{
int i;
+ struct cfs_rq *cfs_rq;
+ struct rq *rq;

lock_task_group_list();
if (tg->shares == shares)
goto done;

+ if (shares < MIN_GROUP_SHARES)
+ shares = MIN_GROUP_SHARES;
+
+ /* Prevent any load balance activity (rebalance_shares,
+ * load_balance_fair) from referring to this group first,
+ * by taking it off the rq->leaf_cfs_rq_list on each cpu.
+ */
+ for_each_possible_cpu(i) {
+ cfs_rq = tg->cfs_rq[i];
+ list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
+ }
+
+ /* wait for any ongoing reference to this group to finish */
+ synchronize_sched();
+
+ /* Now we are free to modify the group's share on each cpu
+ * w/o tripping rebalance_share or load_balance_fair.
+ */
tg->shares = shares;
- for_each_possible_cpu(i)
+ for_each_possible_cpu(i) {
+ spin_lock_irq(&cpu_rq(i)->lock);
set_se_shares(tg->se[i], shares);
+ spin_unlock_irq(&cpu_rq(i)->lock);
+ }

+ /* Enable load balance activity on this group, by inserting it back on
+ * each cpu's rq->leaf_cfs_rq_list.
+ */
+ for_each_possible_cpu(i) {
+ rq = cpu_rq(i);
+ cfs_rq = tg->cfs_rq[i];
+ list_add_rcu(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
+ }
done:
unlock_task_group_list();
return 0;
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -702,6 +702,8 @@
return se->parent;
}

+#define GROUP_IMBALANCE_PCT 20
+
#else /* CONFIG_FAIR_GROUP_SCHED */

#define for_each_sched_entity(se) \
@@ -961,25 +963,6 @@
return __load_balance_iterator(cfs_rq, cfs_rq->rb_load_balance_curr);
}

-#ifdef CONFIG_FAIR_GROUP_SCHED
-static int cfs_rq_best_prio(struct cfs_rq *cfs_rq)
-{
- struct sched_entity *curr;
- struct task_struct *p;
-
- if (!cfs_rq->nr_running)
- return MAX_PRIO;
-
- curr = cfs_rq->curr;
- if (!curr)
- curr = __pick_next_entity(cfs_rq);
-
- p = task_of(curr);
-
- return p->prio;
-}
-#endif
-
static unsigned long
load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move,
@@ -989,28 +972,44 @@
struct cfs_rq *busy_cfs_rq;
long rem_load_move = max_load_move;
struct rq_iterator cfs_rq_iterator;
+ unsigned long load_moved;

cfs_rq_iterator.start = load_balance_start_fair;
cfs_rq_iterator.next = load_balance_next_fair;

for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {
#ifdef CONFIG_FAIR_GROUP_SCHED
- struct cfs_rq *this_cfs_rq;
- long imbalance;
- unsigned long maxload;
-
- this_cfs_rq = cpu_cfs_rq(busy_cfs_rq, this_cpu);
-
- imbalance = busy_cfs_rq->load.weight - this_cfs_rq->load.weight;
- /* Don't pull if this_cfs_rq has more load than busy_cfs_rq */
- if (imbalance <= 0)
+ struct cfs_rq *this_cfs_rq = busy_cfs_rq->tg->cfs_rq[this_cpu];
+ unsigned long maxload, task_load, group_weight;
+ unsigned long thisload, per_task_load;
+ struct sched_entity *se = busy_cfs_rq->tg->se[busiest->cpu];
+
+ task_load = busy_cfs_rq->load.weight;
+ group_weight = se->load.weight;
+
+ /*
+ * 'group_weight' is contributed by tasks of total weight
+ * 'task_load'. To move 'rem_load_move' worth of weight only,
+ * we need to move a maximum task load of:
+ *
+ * maxload = (remload / group_weight) * task_load;
+ */
+ maxload = (rem_load_move * task_load) / group_weight;
+
+ if (!maxload || !task_load)
continue;

- /* Don't pull more than imbalance/2 */
- imbalance /= 2;
- maxload = min(rem_load_move, imbalance);
+ per_task_load = task_load / busy_cfs_rq->nr_running;
+ /* balance_tasks will try to forcibly move atleast one task if
+ * possible (because of SCHED_LOAD_SCALE_FUZZ). Avoid that if
+ * maxload is less than GROUP_IMBALANCE_FUZZ% the per_task_load.
+ */
+ if (100 * maxload < GROUP_IMBALANCE_PCT * per_task_load)
+ continue;

- *this_best_prio = cfs_rq_best_prio(this_cfs_rq);
+ /* Disable priority-based load balance */
+ *this_best_prio = 0;
+ thisload = this_cfs_rq->load.weight;
#else
# define maxload rem_load_move
#endif
@@ -1019,11 +1018,32 @@
* load_balance_[start|next]_fair iterators
*/
cfs_rq_iterator.arg = busy_cfs_rq;
- rem_load_move -= balance_tasks(this_rq, this_cpu, busiest,
+ load_moved = balance_tasks(this_rq, this_cpu, busiest,
maxload, sd, idle, all_pinned,
this_best_prio,
&cfs_rq_iterator);

+#ifdef CONFIG_FAIR_GROUP_SCHED
+ /* load_moved holds the task load that was moved. The
+ * effective weight moved would be:
+ * load_moved_eff = load_moved/task_load * group_weight;
+ */
+ load_moved = (group_weight * load_moved) / task_load;
+
+ /* Adjust shares on both cpus to reflect load_moved */
+ group_weight -= load_moved;
+ set_se_shares(se, group_weight);
+
+ se = busy_cfs_rq->tg->se[this_cpu];
+ if (!thisload)
+ group_weight = load_moved;
+ else
+ group_weight = se->load.weight + load_moved;
+ set_se_shares(se, group_weight);
+#endif
+
+ rem_load_move -= load_moved;
+
if (rem_load_move <= 0)
break;
}
Index: current/kernel/sysctl.c
===================================================================
--- current.orig/kernel/sysctl.c
+++ current/kernel/sysctl.c
@@ -309,6 +309,24 @@
.mode = 644,
.proc_handler = &proc_dointvec,
},
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_min_bal_int_shares",
+ .data = &sysctl_sched_min_bal_int_shares,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_max_bal_int_shares",
+ .data = &sysctl_sched_max_bal_int_shares,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif
#endif
{
.ctl_name = CTL_UNNUMBERED,

--
Regards,
vatsa

2007-11-26 20:28:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch 4/4] sched: Improve fairness of cpu bandwidth allocation for task groups


* Srivatsa Vaddagiri <[email protected]> wrote:

> + /* we don't want this thread's execution to be limited by the shares
> + * assigned to default group (init_task_group). Hence make it run
> + * as a RT task.
> + */
> + schedparm.sched_priority = 0; /* run at the lowest RT prio */
> + ret = sched_setscheduler(current, SCHED_RR, &schedparm);
> + if (ret)
> + printk(KERN_ERR "Couldn't set SCHED_RR policy for load balance"
> + "monitor thread (error = %d) \n", ret);

the first SCHED_RR priority is 1, not 0 - so this call will always fail.

> + lock_cpu_hotplug(); /* Prevent cpus going down or coming up */
> + lock_doms_cur(); /* lockout changes to doms_cur[] array */

please put comments in front of the line like we do in most of sched.c.

> + rcu_read_lock(); /* to walk rq->sd chain on various cpus and to
> + * walk task group list in rebalance_shares().
> + */

the proper comment format is in front of the line and in:

/*
* Comment.
*/

Ingo

2007-11-26 20:30:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch 4/4] sched: Improve fairness of cpu bandwidth allocation for task groups


* Srivatsa Vaddagiri <[email protected]> wrote:

> +static inline void lock_doms_cur(void)
> +{
> + mutex_lock(&doms_cur_mutex);
> +}
> +
> +static inline void unlock_doms_cur(void)
> +{
> + mutex_unlock(&doms_cur_mutex);
> +}
> +
> #else
>
> static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
> static inline void lock_task_group_list(void) { }
> static inline void unlock_task_group_list(void) { }
> +static inline void lock_doms_cur(void) { }
> +static inline void unlock_doms_cur(void) { }
>
> #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> @@ -6546,6 +6605,8 @@
> {
> int i, j;
>
> + lock_doms_cur();
> +
> /* always unregister in case we don't destroy any domains */
> unregister_sched_domain_sysctl();
>
> @@ -6586,6 +6647,8 @@
> ndoms_cur = ndoms_new;
>
> register_sched_domain_sysctl();
> +
> + unlock_doms_cur();
> }

this API and locking should be introduced in a separate patch first, to
reduce the size and impact of the 4/4 patch.

Ingo

2007-11-27 04:54:17

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [Patch 0/5] sched: group scheduler related patches (V4)

On Mon, Nov 26, 2007 at 09:28:36PM +0100, Ingo Molnar wrote:
> the first SCHED_RR priority is 1, not 0 - so this call will always fail.

Thanks for spotting this bug and rest of your review comments.

Here's V4 of the patchset, aimed at improving fairness of cpu bandwidth
allocation for task groups.

Changes since V3 (http://marc.info/?l=linux-kernel&m=119605252303359):

- Fix bug in setting SCHED_RR priority for load_balance_monitor thread
- Fix coding style related issues
- Separate "introduction of lock_doms_cur() API" into a separate patch

I have also tested this patchset against your latest git tree as of
today morning.

Please apply if there are no major concerns.


--
Regards,
vatsa

2007-11-27 04:55:42

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [Patch 1/5] sched: code cleanup

Minor cleanups:

- Fix coding style
- remove obsolete comment


Signed-off-by: Srivatsa Vaddagiri <[email protected]>

---
kernel/sched.c | 21 +++------------------
1 files changed, 3 insertions(+), 18 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -191,12 +191,12 @@ struct task_group init_task_group = {
};

#ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD 2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD 2*NICE_0_LOAD
#else
-# define INIT_TASK_GRP_LOAD NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD NICE_0_LOAD
#endif

-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;

/* return group to which a task belongs */
static inline struct task_group *task_group(struct task_struct *p)
@@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i

#define sched_class_highest (&rt_sched_class)

-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
update_load_add(&rq->load, p->se.load.weight);


--
Regards,
vatsa

2007-11-27 04:56:37

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [Patch 2/5] sched: minor fixes for group scheduler

Minor bug fixes for group scheduler:

- Use a mutex to serialize add/remove of task groups and also when
changing shares of a task group. Use the same mutex when printing cfs_rq
stats for various task groups.
- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when
walking task group list)


Signed-off-by: Srivatsa Vaddagiri <[email protected]>

---
kernel/sched.c | 34 ++++++++++++++++++++++++++--------
kernel/sched_fair.c | 4 +++-
2 files changed, 29 insertions(+), 9 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -169,8 +169,6 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
- /* spinlock to serialize modification to shares */
- spinlock_t lock;
struct rcu_head rcu;
};

@@ -182,6 +180,12 @@ static DEFINE_PER_CPU(struct cfs_rq, ini
static struct sched_entity *init_sched_entity_p[NR_CPUS];
static struct cfs_rq *init_cfs_rq_p[NR_CPUS];

+/*
+ * task_group_mutex serializes add/remove of task groups and also changes to
+ * a task group's cpu shares.
+ */
+static DEFINE_MUTEX(task_group_mutex);
+
/* Default task group.
* Every task in system belong to this group at bootup.
*/
@@ -222,9 +226,21 @@ static inline void set_task_cfs_rq(struc
p->se.parent = task_group(p)->se[cpu];
}

+static inline void lock_task_group_list(void)
+{
+ mutex_lock(&task_group_mutex);
+}
+
+static inline void unlock_task_group_list(void)
+{
+ mutex_unlock(&task_group_mutex);
+}
+
#else

static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
+static inline void lock_task_group_list(void) { }
+static inline void unlock_task_group_list(void) { }

#endif /* CONFIG_FAIR_GROUP_SCHED */

@@ -6747,7 +6763,6 @@ void __init sched_init(void)
se->parent = NULL;
}
init_task_group.shares = init_task_group_load;
- spin_lock_init(&init_task_group.lock);
#endif

for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
@@ -6987,14 +7002,15 @@ struct task_group *sched_create_group(vo
se->parent = NULL;
}

+ tg->shares = NICE_0_LOAD;
+
+ lock_task_group_list();
for_each_possible_cpu(i) {
rq = cpu_rq(i);
cfs_rq = tg->cfs_rq[i];
list_add_rcu(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
}
-
- tg->shares = NICE_0_LOAD;
- spin_lock_init(&tg->lock);
+ unlock_task_group_list();

return tg;

@@ -7040,10 +7056,12 @@ void sched_destroy_group(struct task_gro
struct cfs_rq *cfs_rq = NULL;
int i;

+ lock_task_group_list();
for_each_possible_cpu(i) {
cfs_rq = tg->cfs_rq[i];
list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
}
+ unlock_task_group_list();

BUG_ON(!cfs_rq);

@@ -7117,7 +7135,7 @@ int sched_group_set_shares(struct task_g
{
int i;

- spin_lock(&tg->lock);
+ lock_task_group_list();
if (tg->shares == shares)
goto done;

@@ -7126,7 +7144,7 @@ int sched_group_set_shares(struct task_g
set_se_shares(tg->se[i], shares);

done:
- spin_unlock(&tg->lock);
+ unlock_task_group_list();
return 0;
}

Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -685,7 +685,7 @@ static inline struct cfs_rq *cpu_cfs_rq(

/* Iterate thr' all leaf cfs_rq's on a runqueue */
#define for_each_leaf_cfs_rq(rq, cfs_rq) \
- list_for_each_entry(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
+ list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)

/* Do the two (enqueued) entities belong to the same group ? */
static inline int
@@ -1126,7 +1126,9 @@ static void print_cfs_stats(struct seq_f
#ifdef CONFIG_FAIR_GROUP_SCHED
print_cfs_rq(m, cpu, &cpu_rq(cpu)->cfs);
#endif
+ lock_task_group_list();
for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
+ unlock_task_group_list();
}
#endif

--
Regards,
vatsa

2007-11-27 04:58:20

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [Patch 3/5 v1] sched: change how cpu load is calculated

This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch (v1 of Patch 3/5) has zero impact for
!CONFIG_FAIR_GROUP_SCHED case.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>


---
kernel/sched.c | 38 ++++++++++++++++++++++++++++++--------
kernel/sched_fair.c | 31 +++++++++++++++++++++++++++----
kernel/sched_rt.c | 2 ++
3 files changed, 59 insertions(+), 12 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -870,15 +870,25 @@ iter_move_one_task(struct rq *this_rq, i
struct rq_iterator *iterator);
#endif

-#include "sched_stats.h"
-#include "sched_idletask.c"
-#include "sched_fair.c"
-#include "sched_rt.c"
-#ifdef CONFIG_SCHED_DEBUG
-# include "sched_debug.c"
-#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED

-#define sched_class_highest (&rt_sched_class)
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_sub(&rq->load, load);
+}
+
+static inline void inc_load(struct rq *rq, const struct task_struct *p) { }
+static inline void dec_load(struct rq *rq, const struct task_struct *p) { }
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void inc_cpu_load(struct rq *rq, unsigned long load) { }
+static inline void dec_cpu_load(struct rq *rq, unsigned long load) { }

static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
@@ -890,6 +900,18 @@ static inline void dec_load(struct rq *r
update_load_sub(&rq->load, p->se.load.weight);
}

+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+#include "sched_stats.h"
+#include "sched_idletask.c"
+#include "sched_fair.c"
+#include "sched_rt.c"
+#ifdef CONFIG_SCHED_DEBUG
+# include "sched_debug.c"
+#endif
+
+#define sched_class_highest (&rt_sched_class)
+
static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running++;
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -755,15 +755,26 @@ static inline struct sched_entity *paren
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
{
struct cfs_rq *cfs_rq;
- struct sched_entity *se = &p->se;
+ struct sched_entity *se = &p->se, *topse = NULL;
+ int incload = 1;

for_each_sched_entity(se) {
- if (se->on_rq)
+ topse = se;
+ if (se->on_rq) {
+ incload = 0;
break;
+ }
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
+ /*
+ * Increment cpu load if we just enqueued the first task of a group on
+ * 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+ * at the highest grouping level.
+ */
+ if (incload)
+ inc_cpu_load(rq, topse->load.weight);
}

/*
@@ -774,16 +785,28 @@ static void enqueue_task_fair(struct rq
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
{
struct cfs_rq *cfs_rq;
- struct sched_entity *se = &p->se;
+ struct sched_entity *se = &p->se, *topse = NULL;
+ int decload = 1;

for_each_sched_entity(se) {
+ topse = se;
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
/* Don't dequeue parent if it has other entities besides us */
- if (cfs_rq->load.weight)
+ if (cfs_rq->load.weight) {
+ if (parent_entity(se))
+ decload = 0;
break;
+ }
sleep = 1;
}
+ /*
+ * Decrement cpu load if we just dequeued the last task of a group on
+ * 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+ * at the highest grouping level.
+ */
+ if (decload)
+ dec_cpu_load(rq, topse->load.weight);
}

/*
Index: current/kernel/sched_rt.c
===================================================================
--- current.orig/kernel/sched_rt.c
+++ current/kernel/sched_rt.c
@@ -31,6 +31,7 @@ static void enqueue_task_rt(struct rq *r

list_add_tail(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
+ inc_cpu_load(rq, p->se.load.weight);
}

/*
@@ -45,6 +46,7 @@ static void dequeue_task_rt(struct rq *r
list_del(&p->run_list);
if (list_empty(array->queue + p->prio))
__clear_bit(p->prio, array->bitmap);
+ dec_cpu_load(rq, p->se.load.weight);
}

/*

--
Regards,
vatsa

2007-11-27 04:59:57

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [Patch 3/5 v2] sched: change how cpu load is calculated

This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch (v2 of Patch 3/5) has a minor impact on code size
(but should have no runtime/functional impact) for !CONFIG_FAIR_GROUP_SCHED
case, but the overall code, IMHO, is neater compared to v1 of Patch 3/5
(because of lesser #ifdefs).

I prefer v2 of Patch 3/5.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

---
kernel/sched.c | 27 +++++++++++----------------
kernel/sched_fair.c | 31 +++++++++++++++++++++++++++----
kernel/sched_rt.c | 2 ++
3 files changed, 40 insertions(+), 20 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -870,6 +870,16 @@ iter_move_one_task(struct rq *this_rq, i
struct rq_iterator *iterator);
#endif

+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_sub(&rq->load, load);
+}
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -880,26 +890,14 @@ iter_move_one_task(struct rq *this_rq, i

#define sched_class_highest (&rt_sched_class)

-static inline void inc_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_add(&rq->load, p->se.load.weight);
-}
-
-static inline void dec_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_sub(&rq->load, p->se.load.weight);
-}
-
static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running++;
- inc_load(rq, p);
}

static void dec_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running--;
- dec_load(rq, p);
}

static void set_load_weight(struct task_struct *p)
@@ -4071,10 +4069,8 @@ void set_user_nice(struct task_struct *p
goto out_unlock;
}
on_rq = p->se.on_rq;
- if (on_rq) {
+ if (on_rq)
dequeue_task(rq, p, 0);
- dec_load(rq, p);
- }

p->static_prio = NICE_TO_PRIO(nice);
set_load_weight(p);
@@ -4084,7 +4080,6 @@ void set_user_nice(struct task_struct *p

if (on_rq) {
enqueue_task(rq, p, 0);
- inc_load(rq, p);
/*
* If the task increased its priority or is running and
* lowered its priority, then reschedule its CPU:
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -755,15 +755,26 @@ static inline struct sched_entity *paren
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
{
struct cfs_rq *cfs_rq;
- struct sched_entity *se = &p->se;
+ struct sched_entity *se = &p->se, *topse = NULL;
+ int incload = 1;

for_each_sched_entity(se) {
- if (se->on_rq)
+ topse = se;
+ if (se->on_rq) {
+ incload = 0;
break;
+ }
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
+ /*
+ * Increment cpu load if we just enqueued the first task of a group on
+ * 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+ * at the highest grouping level.
+ */
+ if (incload)
+ inc_cpu_load(rq, topse->load.weight);
}

/*
@@ -774,16 +785,28 @@ static void enqueue_task_fair(struct rq
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
{
struct cfs_rq *cfs_rq;
- struct sched_entity *se = &p->se;
+ struct sched_entity *se = &p->se, *topse = NULL;
+ int decload = 1;

for_each_sched_entity(se) {
+ topse = se;
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
/* Don't dequeue parent if it has other entities besides us */
- if (cfs_rq->load.weight)
+ if (cfs_rq->load.weight) {
+ if (parent_entity(se))
+ decload = 0;
break;
+ }
sleep = 1;
}
+ /*
+ * Decrement cpu load if we just dequeued the last task of a group on
+ * 'rq->cpu'. 'topse' represents the group to which task 'p' belongs
+ * at the highest grouping level.
+ */
+ if (decload)
+ dec_cpu_load(rq, topse->load.weight);
}

/*
Index: current/kernel/sched_rt.c
===================================================================
--- current.orig/kernel/sched_rt.c
+++ current/kernel/sched_rt.c
@@ -31,6 +31,7 @@ static void enqueue_task_rt(struct rq *r

list_add_tail(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
+ inc_cpu_load(rq, p->se.load.weight);
}

/*
@@ -45,6 +46,7 @@ static void dequeue_task_rt(struct rq *r
list_del(&p->run_list);
if (list_empty(array->queue + p->prio))
__clear_bit(p->prio, array->bitmap);
+ dec_cpu_load(rq, p->se.load.weight);
}

/*

--
Regards,
vatsa

2007-11-27 05:08:43

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [Patch 4/5] sched: introduce a mutex and corresponding API to serialize access to doms_cur[] array

doms_cur[] array represents various scheduling domains which are mutually
exclusive. Currently cpusets code can modify this array (by calling
partition_sched_domains()) as a result of user modifying sched_load_balance
flag for various cpusets.

This patch introduces a mutex and corresponding API (only when
CONFIG_FAIR_GROUP_SCHED is defined) which allows a reader to safely read the
doms_cur[] array w/o worrying abt concurrent modifications to the array.

The fair group scheduler code (introduced in next patch of this series)
makes use of this mutex to walk thr' doms_cur[] array while rebalancing
shares of task groups across cpus.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

---
kernel/sched.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -186,6 +186,9 @@ static struct cfs_rq *init_cfs_rq_p[NR_C
*/
static DEFINE_MUTEX(task_group_mutex);

+/* doms_cur_mutex serializes access to doms_cur[] array */
+static DEFINE_MUTEX(doms_cur_mutex);
+
/* Default task group.
* Every task in system belong to this group at bootup.
*/
@@ -236,11 +239,23 @@ static inline void unlock_task_group_lis
mutex_unlock(&task_group_mutex);
}

+static inline void lock_doms_cur(void)
+{
+ mutex_lock(&doms_cur_mutex);
+}
+
+static inline void unlock_doms_cur(void)
+{
+ mutex_unlock(&doms_cur_mutex);
+}
+
#else

static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
static inline void lock_task_group_list(void) { }
static inline void unlock_task_group_list(void) { }
+static inline void lock_doms_cur(void) { }
+static inline void unlock_doms_cur(void) { }

#endif /* CONFIG_FAIR_GROUP_SCHED */

@@ -6547,6 +6562,8 @@ void partition_sched_domains(int ndoms_n
{
int i, j;

+ lock_doms_cur();
+
/* always unregister in case we don't destroy any domains */
unregister_sched_domain_sysctl();

@@ -6587,6 +6604,8 @@ match2:
ndoms_cur = ndoms_new;

register_sched_domain_sysctl();
+
+ unlock_doms_cur();
}

#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)


--
Regards,
vatsa

2007-11-27 05:15:12

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: [Patch 5/5] sched: Improve fairness of cpu bandwidth allocation for task groups


The current load balancing scheme isn't good for group fairness.

For ex: on a 8-cpu system, I created 3 groups as under:

a = 8 tasks (cpu.shares = 1024)
b = 4 tasks (cpu.shares = 1024)
c = 3 tasks (cpu.shares = 1024)

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:

--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5%
b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3%
c | 86.326 | 18.0% | 47.5% 47.9% 48.5%
--------------------------------------------------------------------------------

Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test case that produces a similar o/p if reqd]

The deviation from desired group fairness is as below:

a = +24.47%
b = -9.13%
c = -15.33%

which is quite high.

After the patch below is applied, here are the results:

--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3%
b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5%
c | 160.653 | 33.5% | 85.8% 90.6% 91.4%
--------------------------------------------------------------------------------

Deviation from desired group fairness is as below:

a = +0.67%
b = -0.83%
c = +0.17%

which is far better IMO. Most of other runs have yielded a deviation within
+-2% at the most, which is good.

Why do we see bad (group) fairness with current scheuler?
=========================================================

Currently cpu's weight is just the summation of individual task weights.
This can yield incorrect results. For ex: consider three groups as below
on a 2-cpu system:

CPU0 CPU1
---------------------------
A (10) B(5)
C(5)
---------------------------

Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
1024).

The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
the load balancer will think both CPUs are perfectly balanced and won't
move around any tasks. This, however, would yield this bandwidth:

A = 50%
B = 25%
C = 25%

which is not the desired result.

What's changing in the patch?
=============================

- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
defined (see below)
- API Change
- Two tunables introduced in sysfs (under SCHED_DEBUG) to
control the frequency at which the load balance monitor
thread runs.

The basic change made in this patch is how cpu weight (rq->load.weight) is
calculated. Its now calculated as the summation of group weights on a cpu,
rather than summation of task weights. Weight exerted by a group on a
cpu is dependent on the shares allocated to it and also the number of
tasks the group has on that cpu compared to the total number of
(runnable) tasks the group has in the system.

Let,
W(K,i) = Weight of group K on cpu i
T(K,i) = Task load present in group K's cfs_rq on cpu i
T(K) = Total task load of group K across various cpus
S(K) = Shares allocated to group K
NRCPUS = Number of online cpus in the scheduler domain to
which group K is assigned.

Then,
W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)

A load balance monitor thread is created at bootup, which periodically
runs and adjusts group's weight on each cpu. To avoid its overhead, two
min/max tunables are introduced (under SCHED_DEBUG) to control the rate at which
it runs.

Signed-off-by: Srivatsa Vaddagiri <[email protected]>

---
include/linux/sched.h | 4
kernel/sched.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched_fair.c | 88 ++++++++++------
kernel/sysctl.c | 18 +++
4 files changed, 330 insertions(+), 39 deletions(-)

Index: current/include/linux/sched.h
===================================================================
--- current.orig/include/linux/sched.h
+++ current/include/linux/sched.h
@@ -1467,6 +1467,10 @@ extern unsigned int sysctl_sched_child_r
extern unsigned int sysctl_sched_features;
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+extern unsigned int sysctl_sched_min_bal_int_shares;
+extern unsigned int sysctl_sched_max_bal_int_shares;
+#endif

int sched_nr_latency_handler(struct ctl_table *table, int write,
struct file *file, void __user *buffer, size_t *length,
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -168,7 +168,43 @@ struct task_group {
struct sched_entity **se;
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
+
+ /*
+ * shares assigned to a task group governs how much of cpu bandwidth
+ * is allocated to the group. The more shares a group has, the more is
+ * the cpu bandwidth allocated to it.
+ *
+ * For ex, lets say that there are three task groups, A, B and C which
+ * have been assigned shares 1000, 2000 and 3000 respectively. Then,
+ * cpu bandwidth allocated by the scheduler to task groups A, B and C
+ * should be:
+ *
+ * Bw(A) = 1000/(1000+2000+3000) * 100 = 16.66%
+ * Bw(B) = 2000/(1000+2000+3000) * 100 = 33.33%
+ * Bw(C) = 3000/(1000+2000+3000) * 100 = 50%
+ *
+ * The weight assigned to a task group's schedulable entities on every
+ * cpu (task_group.se[a_cpu]->load.weight) is derived from the task
+ * group's shares. For ex: lets say that task group A has been
+ * assigned shares of 1000 and there are two CPUs in a system. Then,
+ *
+ * tg_A->se[0]->load.weight = tg_A->se[1]->load.weight = 1000;
+ *
+ * Note: It's not necessary that each of a task's group schedulable
+ * entity have the same weight on all CPUs. If the group
+ * has 2 of its tasks on CPU0 and 1 task on CPU1, then a
+ * better distribution of weight could be:
+ *
+ * tg_A->se[0]->load.weight = 2/3 * 2000 = 1333
+ * tg_A->se[1]->load.weight = 1/2 * 2000 = 667
+ *
+ * rebalance_shares() is responsible for distributing the shares of a
+ * task groups like this among the group's schedulable entities across
+ * cpus.
+ *
+ */
unsigned long shares;
+
struct rcu_head rcu;
};

@@ -189,6 +225,14 @@ static DEFINE_MUTEX(task_group_mutex);
/* doms_cur_mutex serializes access to doms_cur[] array */
static DEFINE_MUTEX(doms_cur_mutex);

+#ifdef CONFIG_SMP
+/* kernel thread that runs rebalance_shares() periodically */
+static struct task_struct *lb_monitor_task;
+static int load_balance_monitor(void *unused);
+#endif
+
+static void set_se_shares(struct sched_entity *se, unsigned long shares);
+
/* Default task group.
* Every task in system belong to this group at bootup.
*/
@@ -203,6 +247,8 @@ struct task_group init_task_group = {
# define INIT_TASK_GROUP_LOAD NICE_0_LOAD
#endif

+#define MIN_GROUP_SHARES 1
+
static int init_task_group_load = INIT_TASK_GROUP_LOAD;

/* return group to which a task belongs */
@@ -6740,6 +6786,18 @@ void __init sched_init_smp(void)
if (set_cpus_allowed(current, non_isolated_cpus) < 0)
BUG();
sched_init_granularity();
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ lb_monitor_task = kthread_create(load_balance_monitor, NULL,
+ "load_balance_monitor");
+ if (!IS_ERR(lb_monitor_task)) {
+ lb_monitor_task->flags |= PF_NOFREEZE;
+ wake_up_process(lb_monitor_task);
+ } else {
+ printk(KERN_ERR "Could not create load balance monitor thread"
+ "(error = %ld) \n", PTR_ERR(lb_monitor_task));
+ }
+#endif
}
#else
void __init sched_init_smp(void)
@@ -6995,6 +7053,157 @@ void set_curr_task(int cpu, struct task_

#ifdef CONFIG_FAIR_GROUP_SCHED

+#ifdef CONFIG_SMP
+/*
+ * distribute shares of all task groups among their schedulable entities,
+ * to reflect load distrbution across cpus.
+ */
+static int rebalance_shares(struct sched_domain *sd, int this_cpu)
+{
+ struct cfs_rq *cfs_rq;
+ struct rq *rq = cpu_rq(this_cpu);
+ cpumask_t sdspan = sd->span;
+ int balanced = 1;
+
+ /* Walk thr' all the task groups that we have */
+ for_each_leaf_cfs_rq(rq, cfs_rq) {
+ int i;
+ unsigned long total_load = 0, total_shares;
+ struct task_group *tg = cfs_rq->tg;
+
+ /* Gather total task load of this group across cpus */
+ for_each_cpu_mask(i, sdspan)
+ total_load += tg->cfs_rq[i]->load.weight;
+
+ /* Nothing to do if this group has no load */
+ if (!total_load)
+ continue;
+
+ /*
+ * tg->shares represents the number of cpu shares the task group
+ * is eligible to hold on a single cpu. On N cpus, it is
+ * eligible to hold (N * tg->shares) number of cpu shares.
+ */
+ total_shares = tg->shares * cpus_weight(sdspan);
+
+ /*
+ * redistribute total_shares across cpus as per the task load
+ * distribution.
+ */
+ for_each_cpu_mask(i, sdspan) {
+ unsigned long local_load, local_shares;
+
+ local_load = tg->cfs_rq[i]->load.weight;
+ local_shares = (local_load * total_shares) / total_load;
+ if (!local_shares)
+ local_shares = MIN_GROUP_SHARES;
+ if (local_shares == tg->se[i]->load.weight)
+ continue;
+
+ spin_lock_irq(&cpu_rq(i)->lock);
+ set_se_shares(tg->se[i], local_shares);
+ spin_unlock_irq(&cpu_rq(i)->lock);
+ balanced = 0;
+ }
+ }
+
+ return balanced;
+}
+
+/*
+ * How frequently should we rebalance_shares() across cpus?
+ *
+ * The more frequently we rebalance shares, the more accurate is the fairness
+ * of cpu bandwidth distribution between task groups. However higher frequency
+ * also implies increased scheduling overhead.
+ *
+ * sysctl_sched_min_bal_int_shares represents the minimum interval between
+ * consecutive calls to rebalance_shares() in the same sched domain.
+ *
+ * sysctl_sched_max_bal_int_shares represents the maximum interval between
+ * consecutive calls to rebalance_shares() in the same sched domain.
+ *
+ * These settings allows for the appropriate tradeoff between accuracy of
+ * fairness and the associated overhead.
+ *
+ */
+
+/* default: 8ms, units: milliseconds */
+const_debug unsigned int sysctl_sched_min_bal_int_shares = 8;
+
+/* default: 128ms, units: milliseconds */
+const_debug unsigned int sysctl_sched_max_bal_int_shares = 128;
+
+/* kernel thread that runs rebalance_shares() periodically */
+static int load_balance_monitor(void *unused)
+{
+ unsigned int timeout = sysctl_sched_min_bal_int_shares;
+ struct sched_param schedparm;
+ int ret;
+
+ /*
+ * We don't want this thread's execution to be limited by the shares
+ * assigned to default group (init_task_group). Hence make it run
+ * as a SCHED_RR RT task at the lowest priority.
+ */
+ schedparm.sched_priority = 1;
+ ret = sched_setscheduler(current, SCHED_RR, &schedparm);
+ if (ret)
+ printk(KERN_ERR "Couldn't set SCHED_RR policy for load balance"
+ " monitor thread (error = %d) \n", ret);
+
+ while (!kthread_should_stop()) {
+ int i, cpu, balanced = 1;
+
+ /* Prevent cpus going down or coming up */
+ lock_cpu_hotplug();
+ /* lockout changes to doms_cur[] array */
+ lock_doms_cur();
+ /*
+ * Enter a rcu read-side critical section to safely walk rq->sd
+ * chain on various cpus and to walk task group list
+ * (rq->leaf_cfs_rq_list) in rebalance_shares().
+ */
+ rcu_read_lock();
+
+ for (i = 0; i < ndoms_cur; i++) {
+ cpumask_t cpumap = doms_cur[i];
+ struct sched_domain *sd = NULL, *sd_prev = NULL;
+
+ cpu = first_cpu(cpumap);
+
+ /* Find the highest domain at which to balance shares */
+ for_each_domain(cpu, sd) {
+ if (!(sd->flags & SD_LOAD_BALANCE))
+ continue;
+ sd_prev = sd;
+ }
+
+ sd = sd_prev;
+ /* sd == NULL? No load balance reqd in this domain */
+ if (!sd)
+ continue;
+
+ balanced &= rebalance_shares(sd, cpu);
+ }
+
+ rcu_read_unlock();
+
+ unlock_doms_cur();
+ unlock_cpu_hotplug();
+
+ if (!balanced)
+ timeout = sysctl_sched_min_bal_int_shares;
+ else if (timeout < sysctl_sched_max_bal_int_shares)
+ timeout *= 2;
+
+ msleep_interruptible(timeout);
+ }
+
+ return 0;
+}
+#endif /* CONFIG_SMP */
+
/* allocate runqueue etc for a new task group */
struct task_group *sched_create_group(void)
{
@@ -7151,39 +7360,77 @@ done:
task_rq_unlock(rq, &flags);
}

+/* rq->lock to be locked by caller */
static void set_se_shares(struct sched_entity *se, unsigned long shares)
{
struct cfs_rq *cfs_rq = se->cfs_rq;
struct rq *rq = cfs_rq->rq;
int on_rq;

- spin_lock_irq(&rq->lock);
+ if (!shares)
+ shares = MIN_GROUP_SHARES;

on_rq = se->on_rq;
- if (on_rq)
+ if (on_rq) {
dequeue_entity(cfs_rq, se, 0);
+ dec_cpu_load(rq, se->load.weight);
+ }

se->load.weight = shares;
se->load.inv_weight = div64_64((1ULL<<32), shares);

- if (on_rq)
+ if (on_rq) {
enqueue_entity(cfs_rq, se, 0);
-
- spin_unlock_irq(&rq->lock);
+ inc_cpu_load(rq, se->load.weight);
+ }
}

int sched_group_set_shares(struct task_group *tg, unsigned long shares)
{
int i;
+ struct cfs_rq *cfs_rq;
+ struct rq *rq;

lock_task_group_list();
if (tg->shares == shares)
goto done;

+ if (shares < MIN_GROUP_SHARES)
+ shares = MIN_GROUP_SHARES;
+
+ /*
+ * Prevent any load balance activity (rebalance_shares,
+ * load_balance_fair) from referring to this group first,
+ * by taking it off the rq->leaf_cfs_rq_list on each cpu.
+ */
+ for_each_possible_cpu(i) {
+ cfs_rq = tg->cfs_rq[i];
+ list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
+ }
+
+ /* wait for any ongoing reference to this group to finish */
+ synchronize_sched();
+
+ /*
+ * Now we are free to modify the group's share on each cpu
+ * w/o tripping rebalance_share or load_balance_fair.
+ */
tg->shares = shares;
- for_each_possible_cpu(i)
+ for_each_possible_cpu(i) {
+ spin_lock_irq(&cpu_rq(i)->lock);
set_se_shares(tg->se[i], shares);
+ spin_unlock_irq(&cpu_rq(i)->lock);
+ }

+ /*
+ * Enable load balance activity on this group, by inserting it back on
+ * each cpu's rq->leaf_cfs_rq_list.
+ */
+ for_each_possible_cpu(i) {
+ rq = cpu_rq(i);
+ cfs_rq = tg->cfs_rq[i];
+ list_add_rcu(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
+ }
done:
unlock_task_group_list();
return 0;
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -702,6 +702,8 @@ static inline struct sched_entity *paren
return se->parent;
}

+#define GROUP_IMBALANCE_PCT 20
+
#else /* CONFIG_FAIR_GROUP_SCHED */

#define for_each_sched_entity(se) \
@@ -961,25 +963,6 @@ static struct task_struct *load_balance_
return __load_balance_iterator(cfs_rq, cfs_rq->rb_load_balance_curr);
}

-#ifdef CONFIG_FAIR_GROUP_SCHED
-static int cfs_rq_best_prio(struct cfs_rq *cfs_rq)
-{
- struct sched_entity *curr;
- struct task_struct *p;
-
- if (!cfs_rq->nr_running)
- return MAX_PRIO;
-
- curr = cfs_rq->curr;
- if (!curr)
- curr = __pick_next_entity(cfs_rq);
-
- p = task_of(curr);
-
- return p->prio;
-}
-#endif
-
static unsigned long
load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move,
@@ -989,28 +972,45 @@ load_balance_fair(struct rq *this_rq, in
struct cfs_rq *busy_cfs_rq;
long rem_load_move = max_load_move;
struct rq_iterator cfs_rq_iterator;
+ unsigned long load_moved;

cfs_rq_iterator.start = load_balance_start_fair;
cfs_rq_iterator.next = load_balance_next_fair;

for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {
#ifdef CONFIG_FAIR_GROUP_SCHED
- struct cfs_rq *this_cfs_rq;
- long imbalance;
- unsigned long maxload;
-
- this_cfs_rq = cpu_cfs_rq(busy_cfs_rq, this_cpu);
-
- imbalance = busy_cfs_rq->load.weight - this_cfs_rq->load.weight;
- /* Don't pull if this_cfs_rq has more load than busy_cfs_rq */
- if (imbalance <= 0)
+ struct cfs_rq *this_cfs_rq = busy_cfs_rq->tg->cfs_rq[this_cpu];
+ unsigned long maxload, task_load, group_weight;
+ unsigned long thisload, per_task_load;
+ struct sched_entity *se = busy_cfs_rq->tg->se[busiest->cpu];
+
+ task_load = busy_cfs_rq->load.weight;
+ group_weight = se->load.weight;
+
+ /*
+ * 'group_weight' is contributed by tasks of total weight
+ * 'task_load'. To move 'rem_load_move' worth of weight only,
+ * we need to move a maximum task load of:
+ *
+ * maxload = (remload / group_weight) * task_load;
+ */
+ maxload = (rem_load_move * task_load) / group_weight;
+
+ if (!maxload || !task_load)
continue;

- /* Don't pull more than imbalance/2 */
- imbalance /= 2;
- maxload = min(rem_load_move, imbalance);
+ per_task_load = task_load / busy_cfs_rq->nr_running;
+ /*
+ * balance_tasks will try to forcibly move atleast one task if
+ * possible (because of SCHED_LOAD_SCALE_FUZZ). Avoid that if
+ * maxload is less than GROUP_IMBALANCE_FUZZ% the per_task_load.
+ */
+ if (100 * maxload < GROUP_IMBALANCE_PCT * per_task_load)
+ continue;

- *this_best_prio = cfs_rq_best_prio(this_cfs_rq);
+ /* Disable priority-based load balance */
+ *this_best_prio = 0;
+ thisload = this_cfs_rq->load.weight;
#else
# define maxload rem_load_move
#endif
@@ -1019,11 +1019,33 @@ load_balance_fair(struct rq *this_rq, in
* load_balance_[start|next]_fair iterators
*/
cfs_rq_iterator.arg = busy_cfs_rq;
- rem_load_move -= balance_tasks(this_rq, this_cpu, busiest,
+ load_moved = balance_tasks(this_rq, this_cpu, busiest,
maxload, sd, idle, all_pinned,
this_best_prio,
&cfs_rq_iterator);

+#ifdef CONFIG_FAIR_GROUP_SCHED
+ /*
+ * load_moved holds the task load that was moved. The
+ * effective (group) weight moved would be:
+ * load_moved_eff = load_moved/task_load * group_weight;
+ */
+ load_moved = (group_weight * load_moved) / task_load;
+
+ /* Adjust shares on both cpus to reflect load_moved */
+ group_weight -= load_moved;
+ set_se_shares(se, group_weight);
+
+ se = busy_cfs_rq->tg->se[this_cpu];
+ if (!thisload)
+ group_weight = load_moved;
+ else
+ group_weight = se->load.weight + load_moved;
+ set_se_shares(se, group_weight);
+#endif
+
+ rem_load_move -= load_moved;
+
if (rem_load_move <= 0)
break;
}
Index: current/kernel/sysctl.c
===================================================================
--- current.orig/kernel/sysctl.c
+++ current/kernel/sysctl.c
@@ -309,6 +309,24 @@ static struct ctl_table kern_table[] = {
.mode = 644,
.proc_handler = &proc_dointvec,
},
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_min_bal_int_shares",
+ .data = &sysctl_sched_min_bal_int_shares,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_max_bal_int_shares",
+ .data = &sysctl_sched_max_bal_int_shares,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif
#endif
{
.ctl_name = CTL_UNNUMBERED,

--
Regards,
vatsa

2007-11-27 11:09:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch 0/5] sched: group scheduler related patches (V4)


* Srivatsa Vaddagiri <[email protected]> wrote:

> On Mon, Nov 26, 2007 at 09:28:36PM +0100, Ingo Molnar wrote:
> > the first SCHED_RR priority is 1, not 0 - so this call will always fail.
>
> Thanks for spotting this bug and rest of your review comments.
>
> Here's V4 of the patchset, aimed at improving fairness of cpu
> bandwidth allocation for task groups.

thanks, it looks good - but the fact that we are at v4 of the patchset
underlines the point that this is more of a v2.6.25 patchset than a
v2.6.24 one. Group fairness certainly works fine enough in most setups
and we are late into the v2.6.24 -rc stage. We'll merge this patchset
(or any later versions of it) into v2.6.25 for sure, so distros that
happen to base things off v2.6.24 can pick up your patches just fine.

Ingo

2007-11-27 11:30:18

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [Patch 0/5] sched: group scheduler related patches (V4)

On Tue, Nov 27, 2007 at 12:09:10PM +0100, Ingo Molnar wrote:
> thanks, it looks good - but the fact that we are at v4 of the patchset
> underlines the point that this is more of a v2.6.25 patchset than a
> v2.6.24 one.

Fine. I will resubmit this patchset then once we are into 2.6.25 cycle.

> Group fairness certainly works fine enough in most setups
> and we are late into the v2.6.24 -rc stage. We'll merge this patchset
> (or any later versions of it) into v2.6.25 for sure,

Ok ..thx.

> so distros that
> happen to base things off v2.6.24 can pick up your patches just fine.

--
Regards,
vatsa

2007-11-27 12:53:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch 0/5] sched: group scheduler related patches (V4)


* Srivatsa Vaddagiri <[email protected]> wrote:

> On Tue, Nov 27, 2007 at 12:09:10PM +0100, Ingo Molnar wrote:
> > thanks, it looks good - but the fact that we are at v4 of the patchset
> > underlines the point that this is more of a v2.6.25 patchset than a
> > v2.6.24 one.
>
> Fine. I will resubmit this patchset then once we are into 2.6.25
> cycle.

no need (unless you have bugfixes) i'm carrying this around in the
scheduler git tree. (it will show up in sched-devel.git)

Ingo

2007-11-27 14:20:27

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [Patch 0/5] sched: group scheduler related patches (V4)

On Tue, Nov 27, 2007 at 01:53:12PM +0100, Ingo Molnar wrote:
> > Fine. I will resubmit this patchset then once we are into 2.6.25
> > cycle.
>
> no need (unless you have bugfixes) i'm carrying this around in the
> scheduler git tree. (it will show up in sched-devel.git)

Cool .. Thx! It will get me some testing results (from those who test
sched-devel tree) and also will save me some maintenance trouble :)

--
Regards,
vatsa