2013-03-28 07:59:02

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 0/5] optimization, clean-up, correctness about fair.c

There is no unified subject for this patchset.

Patch 1 is for removing one division operation with multiplication.
Patch 2,3 is for clean-up related to load_balance(), there is improvement
in terms of code size and IMO, readability may be also improved.
Patch 4,5 is for correctness about sched_slice()

Feel free to give a comment for this patchset.

It's based on v3.9-rc4 and top of my previous patchset. But, perhaps,
it may not really depend on my previous patchset. :)

https://lkml.org/lkml/2013/3/26/28
"[PATCH v2 0/6] correct load_balance()"

Thanks.

Joonsoo Kim (5):
sched: remove one division operation in find_buiest_queue()
sched: factor out code to should_we_balance()
sched: clean-up struct sd_lb_stat
sched: don't consider upper se in sched_slice()
sched: limit sched_slice if it is more than sysctl_sched_latency

kernel/sched/fair.c | 347 +++++++++++++++++++++++++--------------------------
1 file changed, 171 insertions(+), 176 deletions(-)

--
1.7.9.5


2013-03-28 07:59:04

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 1/5] sched: remove one division operation in find_buiest_queue()

Remove one division operation in find_buiest_queue().

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6f238d2..1d8774f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4911,7 +4911,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
struct sched_group *group)
{
struct rq *busiest = NULL, *rq;
- unsigned long max_load = 0;
+ unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
int i;

for_each_cpu(i, sched_group_cpus(group)) {
@@ -4942,10 +4942,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* the load can be moved away from the cpu that is potentially
* running at a lower capacity.
*/
- wl = (wl * SCHED_POWER_SCALE) / power;
-
- if (wl > max_load) {
- max_load = wl;
+ if (wl * busiest_power > busiest_load * power) {
+ busiest_load = wl;
+ busiest_power = power;
busiest = rq;
}
}
--
1.7.9.5

2013-03-28 07:59:12

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 3/5] sched: clean-up struct sd_lb_stat

There is no reason to maintain separate variables for this_group
and busiest_group in sd_lb_stat, except saving some space.
But this structure is always allocated in stack, so this saving
isn't really benificial.

This patch unify these variables, so IMO, readability may be improved.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 95ec757..204a9a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4175,36 +4175,6 @@ static unsigned long task_h_load(struct task_struct *p)

/********** Helpers for find_busiest_group ************************/
/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
- */
-struct sd_lb_stats {
- struct sched_group *busiest; /* Busiest group in this sd */
- struct sched_group *this; /* Local group in this sd */
- unsigned long total_load; /* Total load of all groups in sd */
- unsigned long total_pwr; /* Total power of all groups in sd */
- unsigned long avg_load; /* Average load across all groups in sd */
-
- /** Statistics of this group */
- unsigned long this_load;
- unsigned long this_load_per_task;
- unsigned long this_nr_running;
- unsigned long this_has_capacity;
- unsigned int this_idle_cpus;
-
- /* Statistics of the busiest group */
- unsigned int busiest_idle_cpus;
- unsigned long max_load;
- unsigned long busiest_load_per_task;
- unsigned long busiest_nr_running;
- unsigned long busiest_group_capacity;
- unsigned long busiest_has_capacity;
- unsigned int busiest_group_weight;
-
- int group_imb; /* Is there imbalance in this sd */
-};
-
-/*
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
struct sg_lb_stats {
@@ -4219,6 +4189,24 @@ struct sg_lb_stats {
int group_has_capacity; /* Is there extra capacity in the group? */
};

+/*
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ * during load balancing.
+ */
+struct sd_lb_stats {
+ struct sched_group *busiest; /* Busiest group in this sd */
+ struct sched_group *this; /* Local group in this sd */
+ unsigned long total_load; /* Total load of all groups in sd */
+ unsigned long total_pwr; /* Total power of all groups in sd */
+ unsigned long sd_avg_load; /* Average load across all groups in sd */
+
+ /* Statistics of this group */
+ struct sg_lb_stats this_stat;
+
+ /* Statistics of the busiest group */
+ struct sg_lb_stats busiest_stat;
+};
+
/**
* get_sd_load_idx - Obtain the load index for a given sched domain.
* @sd: The sched_domain whose load_idx is to be obtained.
@@ -4499,7 +4487,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
struct sched_group *sg,
struct sg_lb_stats *sgs)
{
- if (sgs->avg_load <= sds->max_load)
+ if (sgs->avg_load <= sds->busiest_stat.avg_load)
return false;

if (sgs->sum_nr_running > sgs->group_capacity)
@@ -4536,7 +4524,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
{
struct sched_domain *child = env->sd->child;
struct sched_group *sg = env->sd->groups;
- struct sg_lb_stats sgs;
+ struct sg_lb_stats tmp_sgs;
int load_idx, prefer_sibling = 0;

if (child && child->flags & SD_PREFER_SIBLING)
@@ -4546,13 +4534,16 @@ static inline void update_sd_lb_stats(struct lb_env *env,

do {
int local_group;
+ struct sg_lb_stats *sgs;

local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
- memset(&sgs, 0, sizeof(sgs));
- update_sg_lb_stats(env, sg, load_idx, local_group, &sgs);
-
- sds->total_load += sgs.group_load;
- sds->total_pwr += sg->sgp->power;
+ if (local_group)
+ sgs = &sds->this_stat;
+ else {
+ sgs = &tmp_sgs;
+ memset(sgs, 0, sizeof(*sgs));
+ }
+ update_sg_lb_stats(env, sg, load_idx, local_group, sgs);

/*
* In case the child domain prefers tasks go to siblings
@@ -4564,26 +4555,19 @@ static inline void update_sd_lb_stats(struct lb_env *env,
* heaviest group when it is already under-utilized (possible
* with a large weight task outweighs the tasks on the system).
*/
- if (prefer_sibling && !local_group && sds->this_has_capacity)
- sgs.group_capacity = min(sgs.group_capacity, 1UL);
+ if (prefer_sibling && !local_group &&
+ sds->this && sds->this_stat.group_has_capacity)
+ sgs->group_capacity = min(sgs->group_capacity, 1UL);

- if (local_group) {
- sds->this_load = sgs.avg_load;
+ /* Now, start updating sd_lb_stats */
+ sds->total_load += sgs->group_load;
+ sds->total_pwr += sg->sgp->power;
+
+ if (local_group)
sds->this = sg;
- sds->this_nr_running = sgs.sum_nr_running;
- sds->this_load_per_task = sgs.sum_weighted_load;
- sds->this_has_capacity = sgs.group_has_capacity;
- sds->this_idle_cpus = sgs.idle_cpus;
- } else if (update_sd_pick_busiest(env, sds, sg, &sgs)) {
- sds->max_load = sgs.avg_load;
+ else if (update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
- sds->busiest_nr_running = sgs.sum_nr_running;
- sds->busiest_idle_cpus = sgs.idle_cpus;
- sds->busiest_group_capacity = sgs.group_capacity;
- sds->busiest_load_per_task = sgs.sum_weighted_load;
- sds->busiest_has_capacity = sgs.group_has_capacity;
- sds->busiest_group_weight = sgs.group_weight;
- sds->group_imb = sgs.group_imb;
+ sds->busiest_stat = *sgs;
}

sg = sg->next;
@@ -4627,8 +4611,8 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
if (env->dst_cpu > busiest_cpu)
return 0;

- env->imbalance = DIV_ROUND_CLOSEST(
- sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE);
+ env->imbalance = DIV_ROUND_CLOSEST(sds->busiest_stat.avg_load *
+ sds->busiest->sgp->power, SCHED_POWER_SCALE);

return 1;
}
@@ -4646,24 +4630,22 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
unsigned long tmp, pwr_now = 0, pwr_move = 0;
unsigned int imbn = 2;
unsigned long scaled_busy_load_per_task;
+ struct sg_lb_stats *this, *busiest;

- if (sds->this_nr_running) {
- sds->this_load_per_task /= sds->this_nr_running;
- if (sds->busiest_load_per_task >
- sds->this_load_per_task)
+ this = &sds->this_stat;
+ busiest = &sds->busiest_stat;
+
+ if (!this->sum_nr_running)
+ this->sum_weighted_load = cpu_avg_load_per_task(env->dst_cpu);
+ else if (busiest->sum_weighted_load > this->sum_weighted_load)
imbn = 1;
- } else {
- sds->this_load_per_task =
- cpu_avg_load_per_task(env->dst_cpu);
- }

- scaled_busy_load_per_task = sds->busiest_load_per_task
- * SCHED_POWER_SCALE;
- scaled_busy_load_per_task /= sds->busiest->sgp->power;
+ scaled_busy_load_per_task = (busiest->sum_weighted_load *
+ SCHED_POWER_SCALE) / sds->busiest->sgp->power;

- if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
- (scaled_busy_load_per_task * imbn)) {
- env->imbalance = sds->busiest_load_per_task;
+ if (busiest->avg_load - this->avg_load + scaled_busy_load_per_task >=
+ (scaled_busy_load_per_task * imbn)) {
+ env->imbalance = busiest->sum_weighted_load;
return;
}

@@ -4674,33 +4656,34 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
*/

pwr_now += sds->busiest->sgp->power *
- min(sds->busiest_load_per_task, sds->max_load);
+ min(busiest->sum_weighted_load, busiest->avg_load);
pwr_now += sds->this->sgp->power *
- min(sds->this_load_per_task, sds->this_load);
+ min(this->sum_weighted_load, this->avg_load);
pwr_now /= SCHED_POWER_SCALE;

/* Amount of load we'd subtract */
- tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
+ tmp = (busiest->sum_weighted_load * SCHED_POWER_SCALE) /
sds->busiest->sgp->power;
- if (sds->max_load > tmp)
+ if (busiest->avg_load > tmp)
pwr_move += sds->busiest->sgp->power *
- min(sds->busiest_load_per_task, sds->max_load - tmp);
+ min(busiest->sum_weighted_load,
+ busiest->avg_load - tmp);

/* Amount of load we'd add */
- if (sds->max_load * sds->busiest->sgp->power <
- sds->busiest_load_per_task * SCHED_POWER_SCALE)
- tmp = (sds->max_load * sds->busiest->sgp->power) /
+ if (busiest->avg_load * sds->busiest->sgp->power <
+ busiest->sum_weighted_load * SCHED_POWER_SCALE)
+ tmp = (busiest->avg_load * sds->busiest->sgp->power) /
sds->this->sgp->power;
else
- tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
+ tmp = (busiest->sum_weighted_load * SCHED_POWER_SCALE) /
sds->this->sgp->power;
pwr_move += sds->this->sgp->power *
- min(sds->this_load_per_task, sds->this_load + tmp);
+ min(this->sum_weighted_load, this->avg_load + tmp);
pwr_move /= SCHED_POWER_SCALE;

/* Move if we gain throughput */
if (pwr_move > pwr_now)
- env->imbalance = sds->busiest_load_per_task;
+ env->imbalance = busiest->sum_weighted_load;
}

/**
@@ -4712,11 +4695,20 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
{
unsigned long max_pull, load_above_capacity = ~0UL;
+ struct sg_lb_stats *this, *busiest;
+
+ /* After below code, we use sum_weighted_load of
+ * this_stat and busiest_stat as load_per_task */
+ this = &sds->this_stat;
+ if (this->sum_nr_running)
+ this->sum_weighted_load /= this->sum_nr_running;

- sds->busiest_load_per_task /= sds->busiest_nr_running;
- if (sds->group_imb) {
- sds->busiest_load_per_task =
- min(sds->busiest_load_per_task, sds->avg_load);
+ busiest = &sds->busiest_stat;
+ /* busiest must have some tasks */
+ busiest->sum_weighted_load /= busiest->sum_nr_running;
+ if (busiest->group_imb) {
+ busiest->sum_weighted_load =
+ min(busiest->sum_weighted_load, sds->sd_avg_load);
}

/*
@@ -4724,17 +4716,17 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* max load less than avg load(as we skip the groups at or below
* its cpu_power, while calculating max_load..)
*/
- if (sds->max_load < sds->avg_load) {
+ if (busiest->avg_load < this->avg_load) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}

- if (!sds->group_imb) {
+ if (!busiest->group_imb) {
/*
* Don't want to pull so many tasks that a group would go idle.
*/
- load_above_capacity = (sds->busiest_nr_running -
- sds->busiest_group_capacity);
+ load_above_capacity = (busiest->sum_nr_running -
+ busiest->group_capacity);

load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);

@@ -4751,12 +4743,13 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* Be careful of negative numbers as they'll appear as very large values
* with unsigned longs.
*/
- max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);
+ max_pull = min(busiest->avg_load - sds->sd_avg_load,
+ load_above_capacity);

/* How much load to actually move to equalise the imbalance */
env->imbalance = min(max_pull * sds->busiest->sgp->power,
- (sds->avg_load - sds->this_load) * sds->this->sgp->power)
- / SCHED_POWER_SCALE;
+ (sds->sd_avg_load - this->avg_load) *
+ sds->this->sgp->power) / SCHED_POWER_SCALE;

/*
* if *imbalance is less than the average load per runnable task
@@ -4764,7 +4757,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* a think about bumping its value to force at least one task to be
* moved
*/
- if (env->imbalance < sds->busiest_load_per_task)
+ if (env->imbalance < busiest->sum_weighted_load)
return fix_small_imbalance(env, sds);

}
@@ -4792,6 +4785,7 @@ static struct sched_group *
find_busiest_group(struct lb_env *env)
{
struct sd_lb_stats sds;
+ struct sg_lb_stats *this, *busiest;

memset(&sds, 0, sizeof(sds));

@@ -4800,42 +4794,44 @@ find_busiest_group(struct lb_env *env)
* this level.
*/
update_sd_lb_stats(env, &sds);
+ this = &sds.this_stat;
+ busiest = &sds.busiest_stat;

if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
check_asym_packing(env, &sds))
return sds.busiest;

/* There is no busy sibling group to pull tasks from */
- if (!sds.busiest || sds.busiest_nr_running == 0)
+ if (!sds.busiest || busiest->sum_nr_running == 0)
goto out_balanced;

- sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;
+ sds.sd_avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;

/*
* If the busiest group is imbalanced the below checks don't
* work because they assumes all things are equal, which typically
* isn't true due to cpus_allowed constraints and the like.
*/
- if (sds.group_imb)
+ if (busiest->group_imb)
goto force_balance;

/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
- if (env->idle == CPU_NEWLY_IDLE && sds.this_has_capacity &&
- !sds.busiest_has_capacity)
+ if (env->idle == CPU_NEWLY_IDLE &&
+ this->group_has_capacity && !busiest->group_has_capacity)
goto force_balance;

/*
* If the local group is more busy than the selected busiest group
* don't try and pull any tasks.
*/
- if (sds.this_load >= sds.max_load)
+ if (this->avg_load >= busiest->avg_load)
goto out_balanced;

/*
* Don't pull any tasks if this group is already above the domain
* average load.
*/
- if (sds.this_load >= sds.avg_load)
+ if (this->avg_load >= sds.sd_avg_load)
goto out_balanced;

if (env->idle == CPU_IDLE) {
@@ -4845,15 +4841,16 @@ find_busiest_group(struct lb_env *env)
* there is no imbalance between this and busiest group
* wrt to idle cpu's, it is balanced.
*/
- if ((sds.this_idle_cpus <= sds.busiest_idle_cpus + 1) &&
- sds.busiest_nr_running <= sds.busiest_group_weight)
+ if ((this->idle_cpus <= busiest->idle_cpus + 1) &&
+ busiest->sum_nr_running <= busiest->group_weight)
goto out_balanced;
} else {
/*
* In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
* imbalance_pct to be conservative.
*/
- if (100 * sds.max_load <= env->sd->imbalance_pct * sds.this_load)
+ if (100 * busiest->avg_load <=
+ env->sd->imbalance_pct * this->avg_load)
goto out_balanced;
}

--
1.7.9.5

2013-03-28 07:59:11

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 4/5] sched: don't consider upper se in sched_slice()

Following-up upper se in sched_slice() should not be done,
because sched_slice() is used for checking that resched is needed
whithin *this* cfs_rq and there is one problem related to this
in current implementation.

The problem is that if we follow-up upper se in sched_slice(), it is
possible that we get a ideal slice which is lower than
sysctl_sched_min_granularity.

For example, we assume that we have 4 tg which is attached to root tg
with same share and each one have 20 runnable tasks on cpu0, respectivly.
In this case, __sched_period() return sysctl_sched_min_granularity * 20
and then go into loop. At first iteration, we compute a portion of slice
for this task on this cfs_rq, so get a slice, sysctl_sched_min_granularity.
Afterward, we enter second iteration and get a slice which is a quarter of
sysctl_sched_min_granularity, because there is 4 tgs with same share
in that cfs_rq.

Ensuring slice larger than min_granularity is important for performance
and there is no lower bound about this, except timer tick, we should
fix it not to consider upper se when calculating sched_slice.

Below is my testing result on my 4 cpus machine.

I did a test for verifying this effect in below environment.

CONFIG_HZ=1000 and CONFIG_SCHED_AUTOGROUP=y
/proc/sys/kernel/sched_min_granularity_ns is 2250000, that is, 2.25ms.

Did following command.

For each 4 sessions,
for i in `seq 20`; do taskset -c 3 sh -c 'while true; do :; done' & done

./perf sched record
./perf script -C 003 | grep sched_switch | cut -b -40 | less

Result is below.

*Vanilla*
sh 2724 [003] 152.52801
sh 2779 [003] 152.52900
sh 2775 [003] 152.53000
sh 2751 [003] 152.53100
sh 2717 [003] 152.53201

*With this patch*
sh 2640 [003] 147.48700
sh 2662 [003] 147.49000
sh 2601 [003] 147.49300
sh 2633 [003] 147.49400

In vanilla case, min_granularity is lower than 1ms, so every tick trigger
reschedule. After patch appied, we can see min_granularity is ensured.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 204a9a9..e232421 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -631,23 +631,20 @@ static u64 __sched_period(unsigned long nr_running)
*/
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ struct load_weight *load;
+ struct load_weight lw;
u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);

- for_each_sched_entity(se) {
- struct load_weight *load;
- struct load_weight lw;
-
- cfs_rq = cfs_rq_of(se);
- load = &cfs_rq->load;
+ load = &cfs_rq->load;

- if (unlikely(!se->on_rq)) {
- lw = cfs_rq->load;
+ if (unlikely(!se->on_rq)) {
+ lw = cfs_rq->load;

- update_load_add(&lw, se->load.weight);
- load = &lw;
- }
- slice = calc_delta_mine(slice, se->load.weight, load);
+ update_load_add(&lw, se->load.weight);
+ load = &lw;
}
+ slice = calc_delta_mine(slice, se->load.weight, load);
+
return slice;
}

--
1.7.9.5

2013-03-28 07:59:43

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 5/5] sched: limit sched_slice if it is more than sysctl_sched_latency

sched_slice() compute ideal runtime slice. If there are many tasks
in cfs_rq, period for this cfs_rq is extended to guarantee that each task
has time slice at least, sched_min_granularity. And then each task get
a portion of this period for it. If there is a task which have much larger
load weight than others, a portion of period can exceed far more than
sysctl_sched_latency.

For exampple, you can simply imagine that one task with nice -20 and
9 tasks with nice 0 on one cfs_rq. In this case, load weight sum for
this cfs_rq is 88761 + 9 * 1024, 97977. So a portion of slice for the
task with nice -20 is sysctl_sched_min_granularity * 10 * (88761 / 97977),
that is, approximately, sysctl_sched_min_granularity * 9. This aspect
can be much larger if there is more tasks with nice 0.

So we should limit this possible weird situation.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e232421..6ceffbc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -645,6 +645,9 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
}
slice = calc_delta_mine(slice, se->load.weight, load);

+ if (unlikely(slice > sysctl_sched_latency))
+ slice = sysctl_sched_latency;
+
return slice;
}

--
1.7.9.5

2013-03-28 07:59:59

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 2/5] sched: factor out code to should_we_balance()

Now checking that this cpu is appropriate to balance is embedded into
update_sg_lb_stats() and this checking has no direct relationship to this
function.

There is not enough reason to place this checking at update_sg_lb_stats(),
except saving one iteration for sched_group_cpus. But with this change,
we can save two memset cost and can expect better compiler optimization,
so clean-up may be more beneficial to us.

Below is result of this patch.

* Vanilla *
text data bss dec hex filename
34499 1136 116 35751 8ba7 kernel/sched/fair.o

* Patched *
text data bss dec hex filename
34243 1136 116 35495 8aa7 kernel/sched/fair.o

In addition, rename @balance to @should_balance in order to represent
its purpose more clearly.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d8774f..95ec757 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4406,22 +4406,17 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
* @group: sched_group whose statistics are to be updated.
* @load_idx: Load index of sched_domain of this_cpu for load calc.
* @local_group: Does group contain this_cpu.
- * @balance: Should we balance.
* @sgs: variable to hold the statistics for this group.
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
- int local_group, int *balance, struct sg_lb_stats *sgs)
+ int local_group, struct sg_lb_stats *sgs)
{
unsigned long nr_running, max_nr_running, min_nr_running;
unsigned long load, max_cpu_load, min_cpu_load;
- unsigned int balance_cpu = -1, first_idle_cpu = 0;
unsigned long avg_load_per_task = 0;
int i;

- if (local_group)
- balance_cpu = group_balance_cpu(group);
-
/* Tally up the load of all CPUs in the group */
max_cpu_load = 0;
min_cpu_load = ~0UL;
@@ -4434,15 +4429,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
nr_running = rq->nr_running;

/* Bias balancing toward cpus of our domain */
- if (local_group) {
- if (idle_cpu(i) && !first_idle_cpu &&
- cpumask_test_cpu(i, sched_group_mask(group))) {
- first_idle_cpu = 1;
- balance_cpu = i;
- }
-
+ if (local_group)
load = target_load(i, load_idx);
- } else {
+ else {
load = source_load(i, load_idx);
if (load > max_cpu_load)
max_cpu_load = load;
@@ -4462,22 +4451,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->idle_cpus++;
}

- /*
- * First idle cpu or the first cpu(busiest) in this sched group
- * is eligible for doing load balancing at this and above
- * domains. In the newly idle case, we will allow all the cpu's
- * to do the newly idle load balance.
- */
- if (local_group) {
- if (env->idle != CPU_NEWLY_IDLE) {
- if (balance_cpu != env->dst_cpu) {
- *balance = 0;
- return;
- }
- update_group_power(env->sd, env->dst_cpu);
- } else if (time_after_eq(jiffies, group->sgp->next_update))
- update_group_power(env->sd, env->dst_cpu);
- }
+ if (local_group && (env->idle != CPU_NEWLY_IDLE ||
+ time_after_eq(jiffies, group->sgp->next_update)))
+ update_group_power(env->sd, env->dst_cpu);

/* Adjust by relative CPU power of the group */
sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
@@ -4556,7 +4532,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* @sds: variable to hold the statistics for this sched_domain.
*/
static inline void update_sd_lb_stats(struct lb_env *env,
- int *balance, struct sd_lb_stats *sds)
+ struct sd_lb_stats *sds)
{
struct sched_domain *child = env->sd->child;
struct sched_group *sg = env->sd->groups;
@@ -4573,10 +4549,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,

local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
memset(&sgs, 0, sizeof(sgs));
- update_sg_lb_stats(env, sg, load_idx, local_group, balance, &sgs);
-
- if (local_group && !(*balance))
- return;
+ update_sg_lb_stats(env, sg, load_idx, local_group, &sgs);

sds->total_load += sgs.group_load;
sds->total_pwr += sg->sgp->power;
@@ -4809,8 +4782,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* to restore balance.
*
* @env: The load balancing environment.
- * @balance: Pointer to a variable indicating if this_cpu
- * is the appropriate cpu to perform load balancing at this_level.
*
* Returns: - the busiest group if imbalance exists.
* - If no imbalance and user has opted for power-savings balance,
@@ -4818,7 +4789,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* put to idle by rebalancing its tasks onto our group.
*/
static struct sched_group *
-find_busiest_group(struct lb_env *env, int *balance)
+find_busiest_group(struct lb_env *env)
{
struct sd_lb_stats sds;

@@ -4828,14 +4799,7 @@ find_busiest_group(struct lb_env *env, int *balance)
* Compute the various statistics relavent for load balancing at
* this level.
*/
- update_sd_lb_stats(env, balance, &sds);
-
- /*
- * this_cpu is not the appropriate cpu to perform load balancing at
- * this level.
- */
- if (!(*balance))
- goto ret;
+ update_sd_lb_stats(env, &sds);

if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
check_asym_packing(env, &sds))
@@ -4899,7 +4863,6 @@ force_balance:
return sds.busiest;

out_balanced:
-ret:
env->imbalance = 0;
return NULL;
}
@@ -4981,13 +4944,46 @@ static int need_active_balance(struct lb_env *env)

static int active_load_balance_cpu_stop(void *data);

+static int should_we_balance(struct lb_env *env)
+{
+ struct sched_group *sg = env->sd->groups;
+ int cpu, balance_cpu = -1;
+
+ /*
+ * In the newly idle case, we will allow all the cpu's
+ * to do the newly idle load balance.
+ */
+ if (env->idle == CPU_NEWLY_IDLE)
+ return 1;
+
+ /* Try to find first idle cpu */
+ for_each_cpu_and(cpu, sched_group_cpus(sg), env->cpus)
+ if (cpumask_test_cpu(cpu, sched_group_mask(sg)) &&
+ idle_cpu(cpu)) {
+ balance_cpu = cpu;
+ break;
+ }
+
+ if (balance_cpu == -1)
+ balance_cpu = group_balance_cpu(sg);
+
+ /*
+ * First idle cpu or the first cpu(busiest) in this sched group
+ * is eligible for doing load balancing at this and above domains.
+ */
+ if (balance_cpu != env->dst_cpu)
+ return 0;
+
+ return 1;
+}
+
/*
* Check this_cpu to ensure it is balanced within domain. Attempt to move
* tasks if there is an imbalance.
*/
static int load_balance(int this_cpu, struct rq *this_rq,
struct sched_domain *sd, enum cpu_idle_type idle,
- int *balance)
+ int *should_balance)
{
int ld_moved, cur_ld_moved, active_balance = 0;
struct sched_group *group;
@@ -5014,12 +5010,14 @@ static int load_balance(int this_cpu, struct rq *this_rq,

schedstat_inc(sd, lb_count[idle]);

-redo:
- group = find_busiest_group(&env, balance);
-
- if (*balance == 0)
+ if (!should_we_balance(&env)) {
+ *should_balance = 0;
goto out_balanced;
+ } else
+ *should_balance = 1;

+redo:
+ group = find_busiest_group(&env);
if (!group) {
schedstat_inc(sd, lb_nobusyg[idle]);
goto out_balanced;
@@ -5234,7 +5232,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
rcu_read_lock();
for_each_domain(this_cpu, sd) {
unsigned long interval;
- int balance = 1;
+ int should_balance;

if (!(sd->flags & SD_LOAD_BALANCE))
continue;
@@ -5242,7 +5240,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
if (sd->flags & SD_BALANCE_NEWIDLE) {
/* If we've pulled tasks over stop searching: */
pulled_task = load_balance(this_cpu, this_rq,
- sd, CPU_NEWLY_IDLE, &balance);
+ sd, CPU_NEWLY_IDLE,
+ &should_balance);
}

interval = msecs_to_jiffies(sd->balance_interval);
@@ -5476,7 +5475,7 @@ void update_max_interval(void)
*/
static void rebalance_domains(int cpu, enum cpu_idle_type idle)
{
- int balance = 1;
+ int should_balance;
struct rq *rq = cpu_rq(cpu);
unsigned long interval;
struct sched_domain *sd;
@@ -5508,7 +5507,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
}

if (time_after_eq(jiffies, sd->last_balance + interval)) {
- if (load_balance(cpu, rq, sd, idle, &balance)) {
+ if (load_balance(cpu, rq, sd, idle, &should_balance)) {
/*
* We've pulled tasks over so either we may
* be no longer idle.
@@ -5530,7 +5529,7 @@ out:
* CPU in our sched group which is doing load balancing more
* actively.
*/
- if (!balance)
+ if (!should_balance)
break;
}
rcu_read_unlock();
--
1.7.9.5

2013-03-29 07:14:26

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

Hi Joonsoo,

On 03/28/2013 01:28 PM, Joonsoo Kim wrote:
> Following-up upper se in sched_slice() should not be done,
> because sched_slice() is used for checking that resched is needed
> whithin *this* cfs_rq and there is one problem related to this
> in current implementation.
>
> The problem is that if we follow-up upper se in sched_slice(), it is
> possible that we get a ideal slice which is lower than
> sysctl_sched_min_granularity.
>
> For example, we assume that we have 4 tg which is attached to root tg
> with same share and each one have 20 runnable tasks on cpu0, respectivly.
> In this case, __sched_period() return sysctl_sched_min_granularity * 20
> and then go into loop. At first iteration, we compute a portion of slice
> for this task on this cfs_rq, so get a slice, sysctl_sched_min_granularity.
> Afterward, we enter second iteration and get a slice which is a quarter of
> sysctl_sched_min_granularity, because there is 4 tgs with same share
> in that cfs_rq.
>
> Ensuring slice larger than min_granularity is important for performance
> and there is no lower bound about this, except timer tick, we should
> fix it not to consider upper se when calculating sched_slice.
>

I am assuming the sysctl_sched_latency = 20ms and
sysctl_sched_min_granularity = 4ms.
In that case:

With your patch, the sum of the sched_slice(runnable_task) on each_tg is
40ms = sched_min_granularity * 10, while the parent tg has a
sched_slice of sysctl_sched_latency / 4 = (20 / 4) = 5ms.

Ideally the children's cpu share must add upto the parent's share.


Thank you

Regards
Preeti U Murthy

2013-03-29 11:37:25

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 5/5] sched: limit sched_slice if it is more than sysctl_sched_latency

Hi Joonsoo

On 03/28/2013 01:28 PM, Joonsoo Kim wrote:
> sched_slice() compute ideal runtime slice. If there are many tasks
> in cfs_rq, period for this cfs_rq is extended to guarantee that each task
> has time slice at least, sched_min_granularity. And then each task get
> a portion of this period for it. If there is a task which have much larger
> load weight than others, a portion of period can exceed far more than
> sysctl_sched_latency.

Correct. But that does not matter, the length of the scheduling latency
period is determined by the return value of ___sched_period(), not the
value of sysctl_sched_latency. You would not extend the period,if you
wanted all tasks to have a slice within the sysctl_sched_latency, right?

So since the value of the length of the scheduling latency period, is
dynamic depending on the number of the processes running, the
sysctl_sched_latency which is the default latency period length is not
mesed with, but is only used as a base to determine the actual
scheduling period.

>
> For exampple, you can simply imagine that one task with nice -20 and
> 9 tasks with nice 0 on one cfs_rq. In this case, load weight sum for
> this cfs_rq is 88761 + 9 * 1024, 97977. So a portion of slice for the
> task with nice -20 is sysctl_sched_min_granularity * 10 * (88761 / 97977),
> that is, approximately, sysctl_sched_min_granularity * 9. This aspect
> can be much larger if there is more tasks with nice 0.

Yeah so the __sched_period says that within 40ms, all tasks need to be
scheduled ateast once, and the highest priority task gets nearly 36ms of
it, while the rest is distributed among the others.

>
> So we should limit this possible weird situation.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e232421..6ceffbc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -645,6 +645,9 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> }
> slice = calc_delta_mine(slice, se->load.weight, load);
>
> + if (unlikely(slice > sysctl_sched_latency))
> + slice = sysctl_sched_latency;

Then in this case the highest priority thread would get
20ms(sysctl_sched_latency), and the rest would get
sysctl_sched_min_granularity * 10 * (1024/97977) which would be 0.4ms.
Then all tasks would get scheduled ateast once within 20ms + (0.4*9) ms
= 23.7ms, while your scheduling latency period was extended to 40ms,just
so that each of these tasks don't have their sched_slices shrunk due to
large number of tasks.

> +
> return slice;
> }
>

Regards
Preeti U Murthy

2013-03-29 11:45:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: factor out code to should_we_balance()

On Thu, 2013-03-28 at 16:58 +0900, Joonsoo Kim wrote:
> There is not enough reason to place this checking at
> update_sg_lb_stats(),
> except saving one iteration for sched_group_cpus. But with this
> change,
> we can save two memset cost and can expect better compiler
> optimization,
> so clean-up may be more beneficial to us.

It would be good if you'd described what you actually did, there's a
lot of code movement and now I have to figure out wth you did and why.

2013-03-29 11:58:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: factor out code to should_we_balance()

On Thu, 2013-03-28 at 16:58 +0900, Joonsoo Kim wrote:
> +static int should_we_balance(struct lb_env *env)
> +{
> + struct sched_group *sg = env->sd->groups;
> + int cpu, balance_cpu = -1;
> +
> + /*
> + * In the newly idle case, we will allow all the cpu's
> + * to do the newly idle load balance.
> + */
> + if (env->idle == CPU_NEWLY_IDLE)
> + return 1;
> +
> + /* Try to find first idle cpu */
> + for_each_cpu_and(cpu, sched_group_cpus(sg), env->cpus)
> + if (cpumask_test_cpu(cpu, sched_group_mask(sg)) &&
> + idle_cpu(cpu)) {
> + balance_cpu = cpu;
> + break;
> + }

/me hands you a bucket of curlies too.. use them I'll send you another
bucket when this one gets empty!

(We always put in curlies on multi lines statements -- as opposed to
multi statement blocks where they're required)

> +
> + if (balance_cpu == -1)
> + balance_cpu = group_balance_cpu(sg);
> +
> + /*
> + * First idle cpu or the first cpu(busiest) in this sched
> group
> + * is eligible for doing load balancing at this and above
> domains.
> + */
> + if (balance_cpu != env->dst_cpu)
> + return 0;
> +
> + return 1;
> +}

2013-04-01 04:08:13

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

Hello, Preeti.

On Fri, Mar 29, 2013 at 12:42:53PM +0530, Preeti U Murthy wrote:
> Hi Joonsoo,
>
> On 03/28/2013 01:28 PM, Joonsoo Kim wrote:
> > Following-up upper se in sched_slice() should not be done,
> > because sched_slice() is used for checking that resched is needed
> > whithin *this* cfs_rq and there is one problem related to this
> > in current implementation.
> >
> > The problem is that if we follow-up upper se in sched_slice(), it is
> > possible that we get a ideal slice which is lower than
> > sysctl_sched_min_granularity.
> >
> > For example, we assume that we have 4 tg which is attached to root tg
> > with same share and each one have 20 runnable tasks on cpu0, respectivly.
> > In this case, __sched_period() return sysctl_sched_min_granularity * 20
> > and then go into loop. At first iteration, we compute a portion of slice
> > for this task on this cfs_rq, so get a slice, sysctl_sched_min_granularity.
> > Afterward, we enter second iteration and get a slice which is a quarter of
> > sysctl_sched_min_granularity, because there is 4 tgs with same share
> > in that cfs_rq.
> >
> > Ensuring slice larger than min_granularity is important for performance
> > and there is no lower bound about this, except timer tick, we should
> > fix it not to consider upper se when calculating sched_slice.
> >
>
> I am assuming the sysctl_sched_latency = 20ms and
> sysctl_sched_min_granularity = 4ms.
> In that case:
>
> With your patch, the sum of the sched_slice(runnable_task) on each_tg is
> 40ms = sched_min_granularity * 10, while the parent tg has a
> sched_slice of sysctl_sched_latency / 4 = (20 / 4) = 5ms.
>
> Ideally the children's cpu share must add upto the parent's share.
>

I don't think so.

We should schedule out the parent tg if 5ms is over. As we do so, we can
fairly distribute time slice to every tg within short term. If we add
the children's cpu share upto the parent's, the parent tg may have
large time slice, so it cannot be preempted easily. There may be a latency
problem if there are many tgs.

Thanks.

>
> Thank you
>
> Regards
> Preeti U Murthy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-04-01 05:09:20

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 5/5] sched: limit sched_slice if it is more than sysctl_sched_latency

Hello Preeti.

On Fri, Mar 29, 2013 at 05:05:37PM +0530, Preeti U Murthy wrote:
> Hi Joonsoo
>
> On 03/28/2013 01:28 PM, Joonsoo Kim wrote:
> > sched_slice() compute ideal runtime slice. If there are many tasks
> > in cfs_rq, period for this cfs_rq is extended to guarantee that each task
> > has time slice at least, sched_min_granularity. And then each task get
> > a portion of this period for it. If there is a task which have much larger
> > load weight than others, a portion of period can exceed far more than
> > sysctl_sched_latency.
>
> Correct. But that does not matter, the length of the scheduling latency
> period is determined by the return value of ___sched_period(), not the
> value of sysctl_sched_latency. You would not extend the period,if you
> wanted all tasks to have a slice within the sysctl_sched_latency, right?
>
> So since the value of the length of the scheduling latency period, is
> dynamic depending on the number of the processes running, the
> sysctl_sched_latency which is the default latency period length is not
> mesed with, but is only used as a base to determine the actual
> scheduling period.
>
> >
> > For exampple, you can simply imagine that one task with nice -20 and
> > 9 tasks with nice 0 on one cfs_rq. In this case, load weight sum for
> > this cfs_rq is 88761 + 9 * 1024, 97977. So a portion of slice for the
> > task with nice -20 is sysctl_sched_min_granularity * 10 * (88761 / 97977),
> > that is, approximately, sysctl_sched_min_granularity * 9. This aspect
> > can be much larger if there is more tasks with nice 0.
>
> Yeah so the __sched_period says that within 40ms, all tasks need to be
> scheduled ateast once, and the highest priority task gets nearly 36ms of
> it, while the rest is distributed among the others.
>
> >
> > So we should limit this possible weird situation.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e232421..6ceffbc 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -645,6 +645,9 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > }
> > slice = calc_delta_mine(slice, se->load.weight, load);
> >
> > + if (unlikely(slice > sysctl_sched_latency))
> > + slice = sysctl_sched_latency;
>
> Then in this case the highest priority thread would get
> 20ms(sysctl_sched_latency), and the rest would get
> sysctl_sched_min_granularity * 10 * (1024/97977) which would be 0.4ms.
> Then all tasks would get scheduled ateast once within 20ms + (0.4*9) ms
> = 23.7ms, while your scheduling latency period was extended to 40ms,just
> so that each of these tasks don't have their sched_slices shrunk due to
> large number of tasks.

I don't know I understand your question correctly.
I will do my best to answer your comment. :)

With this patch, I just limit maximum slice at one time. Scheduling is
controlled through the vruntime. So, in this case, the task with nice -20
will be scheduled twice.

20 + (0.4 * 9) + 20 = 43.9 ms

And after 43.9 ms, this process is repeated.

So I can tell you that scheduling period is preserved as before.

If we give a long period to a task at one go, it can cause
a latency problem. So IMHO, limiting this is meaningful.

Thanks.

>
> > +
> > return slice;
> > }
> >
>
> Regards
> Preeti U Murthy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-04-01 05:10:35

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: factor out code to should_we_balance()

Hello, Peter.

On Fri, Mar 29, 2013 at 12:45:14PM +0100, Peter Zijlstra wrote:
> On Thu, 2013-03-28 at 16:58 +0900, Joonsoo Kim wrote:
> > There is not enough reason to place this checking at
> > update_sg_lb_stats(),
> > except saving one iteration for sched_group_cpus. But with this
> > change,
> > we can save two memset cost and can expect better compiler
> > optimization,
> > so clean-up may be more beneficial to us.
>
> It would be good if you'd described what you actually did, there's a
> lot of code movement and now I have to figure out wth you did and why.

Okay. I will do that when respin.

Thanks.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-04-01 05:16:21

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: factor out code to should_we_balance()

On Fri, Mar 29, 2013 at 12:58:26PM +0100, Peter Zijlstra wrote:
> On Thu, 2013-03-28 at 16:58 +0900, Joonsoo Kim wrote:
> > +static int should_we_balance(struct lb_env *env)
> > +{
> > + struct sched_group *sg = env->sd->groups;
> > + int cpu, balance_cpu = -1;
> > +
> > + /*
> > + * In the newly idle case, we will allow all the cpu's
> > + * to do the newly idle load balance.
> > + */
> > + if (env->idle == CPU_NEWLY_IDLE)
> > + return 1;
> > +
> > + /* Try to find first idle cpu */
> > + for_each_cpu_and(cpu, sched_group_cpus(sg), env->cpus)
> > + if (cpumask_test_cpu(cpu, sched_group_mask(sg)) &&
> > + idle_cpu(cpu)) {
> > + balance_cpu = cpu;
> > + break;
> > + }
>
> /me hands you a bucket of curlies too.. use them I'll send you another
> bucket when this one gets empty!

Thanks!

> (We always put in curlies on multi lines statements -- as opposed to
> multi statement blocks where they're required)

Okay. I will do that when respin.

Thanks.

>
> > +
> > + if (balance_cpu == -1)
> > + balance_cpu = group_balance_cpu(sg);
> > +
> > + /*
> > + * First idle cpu or the first cpu(busiest) in this sched
> > group
> > + * is eligible for doing load balancing at this and above
> > domains.
> > + */
> > + if (balance_cpu != env->dst_cpu)
> > + return 0;
> > +
> > + return 1;
> > +}
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-04-01 06:47:23

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 5/5] sched: limit sched_slice if it is more than sysctl_sched_latency

Hi Joonsoo,

On 04/01/2013 10:39 AM, Joonsoo Kim wrote:
> Hello Preeti.
> So we should limit this possible weird situation.
>>>
>>> Signed-off-by: Joonsoo Kim <[email protected]>
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e232421..6ceffbc 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -645,6 +645,9 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>> }
>>> slice = calc_delta_mine(slice, se->load.weight, load);
>>>
>>> + if (unlikely(slice > sysctl_sched_latency))
>>> + slice = sysctl_sched_latency;
>>
>> Then in this case the highest priority thread would get
>> 20ms(sysctl_sched_latency), and the rest would get
>> sysctl_sched_min_granularity * 10 * (1024/97977) which would be 0.4ms.
>> Then all tasks would get scheduled ateast once within 20ms + (0.4*9) ms
>> = 23.7ms, while your scheduling latency period was extended to 40ms,just
>> so that each of these tasks don't have their sched_slices shrunk due to
>> large number of tasks.
>
> I don't know I understand your question correctly.
> I will do my best to answer your comment. :)
>
> With this patch, I just limit maximum slice at one time. Scheduling is
> controlled through the vruntime. So, in this case, the task with nice -20
> will be scheduled twice.
>
> 20 + (0.4 * 9) + 20 = 43.9 ms
>
> And after 43.9 ms, this process is repeated.
>
> So I can tell you that scheduling period is preserved as before.
>
> If we give a long period to a task at one go, it can cause
> a latency problem. So IMHO, limiting this is meaningful.

Thank you very much for the explanation. Just one question. What is the
reason behind you choosing sysctl_sched_latency as the upper bound here?

Regards
Preeti U Murthy

2013-04-01 07:09:00

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

Hi Joonsoo,

On 04/01/2013 09:38 AM, Joonsoo Kim wrote:
> Hello, Preeti.
>

>>
>> Ideally the children's cpu share must add upto the parent's share.
>>
>
> I don't think so.
>
> We should schedule out the parent tg if 5ms is over. As we do so, we can
> fairly distribute time slice to every tg within short term. If we add
> the children's cpu share upto the parent's, the parent tg may have
> large time slice, so it cannot be preempted easily. There may be a latency
> problem if there are many tgs.

In the case where the #running < sched_nr_latency, the children's
sched_slices add up to the parent's.

A rq with two tgs,each with 3 tasks.

Each of these tasks have a sched slice of
[(sysctl_sched_latency / 3) / 2] as of the present implementation.

The sum of the above sched_slice of all tasks of a tg will lead to the
sched_slice of its parent: sysctl_sched_latency / 2

This breaks when the nr_running on each tg > sched_nr_latency. However I
don't know if this is a good thing or a bad thing.

Regards
Preeti U Murthy

2013-04-02 02:02:27

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 5/5] sched: limit sched_slice if it is more than sysctl_sched_latency

Hello, Preeti.

On Mon, Apr 01, 2013 at 12:15:50PM +0530, Preeti U Murthy wrote:
> Hi Joonsoo,
>
> On 04/01/2013 10:39 AM, Joonsoo Kim wrote:
> > Hello Preeti.
> > So we should limit this possible weird situation.
> >>>
> >>> Signed-off-by: Joonsoo Kim <[email protected]>
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index e232421..6ceffbc 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -645,6 +645,9 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >>> }
> >>> slice = calc_delta_mine(slice, se->load.weight, load);
> >>>
> >>> + if (unlikely(slice > sysctl_sched_latency))
> >>> + slice = sysctl_sched_latency;
> >>
> >> Then in this case the highest priority thread would get
> >> 20ms(sysctl_sched_latency), and the rest would get
> >> sysctl_sched_min_granularity * 10 * (1024/97977) which would be 0.4ms.
> >> Then all tasks would get scheduled ateast once within 20ms + (0.4*9) ms
> >> = 23.7ms, while your scheduling latency period was extended to 40ms,just
> >> so that each of these tasks don't have their sched_slices shrunk due to
> >> large number of tasks.
> >
> > I don't know I understand your question correctly.
> > I will do my best to answer your comment. :)
> >
> > With this patch, I just limit maximum slice at one time. Scheduling is
> > controlled through the vruntime. So, in this case, the task with nice -20
> > will be scheduled twice.
> >
> > 20 + (0.4 * 9) + 20 = 43.9 ms
> >
> > And after 43.9 ms, this process is repeated.
> >
> > So I can tell you that scheduling period is preserved as before.
> >
> > If we give a long period to a task at one go, it can cause
> > a latency problem. So IMHO, limiting this is meaningful.
>
> Thank you very much for the explanation. Just one question. What is the
> reason behind you choosing sysctl_sched_latency as the upper bound here?

sysctl_sched_latency is a sched_slice when there is one task.
So, I think that this is proper as upper bound.

Thanks.

> Regards
> Preeti U Murthy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-04-02 02:25:45

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

Hello, Preeti.

On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote:
> Hi Joonsoo,
>
> On 04/01/2013 09:38 AM, Joonsoo Kim wrote:
> > Hello, Preeti.
> >
>
> >>
> >> Ideally the children's cpu share must add upto the parent's share.
> >>
> >
> > I don't think so.
> >
> > We should schedule out the parent tg if 5ms is over. As we do so, we can
> > fairly distribute time slice to every tg within short term. If we add
> > the children's cpu share upto the parent's, the parent tg may have
> > large time slice, so it cannot be preempted easily. There may be a latency
> > problem if there are many tgs.
>
> In the case where the #running < sched_nr_latency, the children's
> sched_slices add up to the parent's.
>
> A rq with two tgs,each with 3 tasks.
>
> Each of these tasks have a sched slice of
> [(sysctl_sched_latency / 3) / 2] as of the present implementation.
>
> The sum of the above sched_slice of all tasks of a tg will lead to the
> sched_slice of its parent: sysctl_sched_latency / 2
>
> This breaks when the nr_running on each tg > sched_nr_latency. However I
> don't know if this is a good thing or a bad thing.

Ah.. Now I get your point. Yes, you are right and it may be good thing.
With that property, all tasks in the system can be scheduled at least once
in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration,
so my patch may be wrong. With my patch, all tasks in the system cannot be
scheduled at least once in sysctl_sched_latency. Instead, it schedule
all tasks in cfs_rq at least once in sysctl_sched_latency if there is
no other tgs.

I think that it is real problem that sysctl_sched_min_granularity is not
guaranteed for each task.
Instead of this patch, how about considering low bound?

if (slice < sysctl_sched_min_granularity)
slice = sysctl_sched_min_granularity;

Thanks.

>
> Regards
> Preeti U Murthy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-04-02 02:35:34

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

On Tue, 2013-04-02 at 11:25 +0900, Joonsoo Kim wrote:
> Hello, Preeti.
>
> On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote:
> > Hi Joonsoo,
> >
> > On 04/01/2013 09:38 AM, Joonsoo Kim wrote:
> > > Hello, Preeti.
> > >
> >
> > >>
> > >> Ideally the children's cpu share must add upto the parent's share.
> > >>
> > >
> > > I don't think so.
> > >
> > > We should schedule out the parent tg if 5ms is over. As we do so, we can
> > > fairly distribute time slice to every tg within short term. If we add
> > > the children's cpu share upto the parent's, the parent tg may have
> > > large time slice, so it cannot be preempted easily. There may be a latency
> > > problem if there are many tgs.
> >
> > In the case where the #running < sched_nr_latency, the children's
> > sched_slices add up to the parent's.
> >
> > A rq with two tgs,each with 3 tasks.
> >
> > Each of these tasks have a sched slice of
> > [(sysctl_sched_latency / 3) / 2] as of the present implementation.
> >
> > The sum of the above sched_slice of all tasks of a tg will lead to the
> > sched_slice of its parent: sysctl_sched_latency / 2
> >
> > This breaks when the nr_running on each tg > sched_nr_latency. However I
> > don't know if this is a good thing or a bad thing.
>
> Ah.. Now I get your point. Yes, you are right and it may be good thing.
> With that property, all tasks in the system can be scheduled at least once
> in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration,
> so my patch may be wrong. With my patch, all tasks in the system cannot be
> scheduled at least once in sysctl_sched_latency. Instead, it schedule
> all tasks in cfs_rq at least once in sysctl_sched_latency if there is
> no other tgs.
>
> I think that it is real problem that sysctl_sched_min_granularity is not
> guaranteed for each task.
> Instead of this patch, how about considering low bound?
>
> if (slice < sysctl_sched_min_granularity)
> slice = sysctl_sched_min_granularity;

How many SCHED_IDLE or +nice tasks will fit in that?

-Mike

2013-04-02 04:57:39

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

Hi Joonsoo,

On 04/02/2013 07:55 AM, Joonsoo Kim wrote:
> Hello, Preeti.
>
> On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote:
>> Hi Joonsoo,
>>
>> On 04/01/2013 09:38 AM, Joonsoo Kim wrote:
>>> Hello, Preeti.
>>>
>>
>>>>
>>>> Ideally the children's cpu share must add upto the parent's share.
>>>>
>>>
>>> I don't think so.
>>>
>>> We should schedule out the parent tg if 5ms is over. As we do so, we can
>>> fairly distribute time slice to every tg within short term. If we add
>>> the children's cpu share upto the parent's, the parent tg may have
>>> large time slice, so it cannot be preempted easily. There may be a latency
>>> problem if there are many tgs.
>>
>> In the case where the #running < sched_nr_latency, the children's
>> sched_slices add up to the parent's.
>>
>> A rq with two tgs,each with 3 tasks.
>>
>> Each of these tasks have a sched slice of
>> [(sysctl_sched_latency / 3) / 2] as of the present implementation.
>>
>> The sum of the above sched_slice of all tasks of a tg will lead to the
>> sched_slice of its parent: sysctl_sched_latency / 2
>>
>> This breaks when the nr_running on each tg > sched_nr_latency. However I
>> don't know if this is a good thing or a bad thing.
>
> Ah.. Now I get your point. Yes, you are right and it may be good thing.
> With that property, all tasks in the system can be scheduled at least once
> in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration,
> so my patch may be wrong. With my patch, all tasks in the system cannot be
> scheduled at least once in sysctl_sched_latency. Instead, it schedule
> all tasks in cfs_rq at least once in sysctl_sched_latency if there is
> no other tgs.

Exactly. You have got all the above points right.

>
> I think that it is real problem that sysctl_sched_min_granularity is not
> guaranteed for each task.
> Instead of this patch, how about considering low bound?
>
> if (slice < sysctl_sched_min_granularity)
> slice = sysctl_sched_min_granularity;

Consider the below scenario.

A runqueue has two task groups,each with 10 tasks.

With the current implementation,each of these tasks get a sched_slice of
2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks
of both the task groups) will get the chance to run.

But what is the scheduling period in this scenario? Is it 40ms (extended
sysctl_sched_latency), which is the scheduling period for each of the
runqueues with 10 tasks in it?
Or is it 80ms which is the total of the scheduling periods of each of
the run queues with 10 tasks.Either way all tasks seem to get scheduled
atleast once within the scheduling period.So we appear to be safe.
Although the sched_slice < sched_min_granularity.

With your above lower bound of sysctl_sched_min_granularity, each task
of each tg gets 4ms as its sched_slice.So in a matter of
(10*4) + (10*4) = 80ms,all tasks get to run. With the above question
being put forth here as well, we don't appear to be safe if the
scheduling_period is considered to be 40ms, otherwise it appears fine to
me, because it ensures the sched_slice is atleast sched_min_granularity
no matter what.


Thank you

Regards
Preeti U Murthy

2013-04-02 08:10:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: factor out code to should_we_balance()

On Thu, 2013-03-28 at 16:58 +0900, Joonsoo Kim wrote:
> Now checking that this cpu is appropriate to balance is embedded into
> update_sg_lb_stats() and this checking has no direct relationship to
> this
> function.
>
> There is not enough reason to place this checking at
> update_sg_lb_stats(),
> except saving one iteration for sched_group_cpus.

Its only one iteration if there's only 2 groups, but there can be more
than 2, take any desktop Intel i7, it will have 4-8 cores, each with
HT; thus the CPU domain will have 4-8 groups.

And note that local_group is always the first group of a domain, so
we'd stop the balance at the first group and avoid touching the other
3-7, avoiding touching cachelines on 6-14 cpus.

So this short-cut does make sense.. its not pretty, granted, but
killing it doesn't seem right.

2013-04-02 09:26:43

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

Hello, Preeti.

On Tue, Apr 02, 2013 at 10:25:23AM +0530, Preeti U Murthy wrote:
> Hi Joonsoo,
>
> On 04/02/2013 07:55 AM, Joonsoo Kim wrote:
> > Hello, Preeti.
> >
> > On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote:
> >> Hi Joonsoo,
> >>
> >> On 04/01/2013 09:38 AM, Joonsoo Kim wrote:
> >>> Hello, Preeti.
> >>>
> >>
> >>>>
> >>>> Ideally the children's cpu share must add upto the parent's share.
> >>>>
> >>>
> >>> I don't think so.
> >>>
> >>> We should schedule out the parent tg if 5ms is over. As we do so, we can
> >>> fairly distribute time slice to every tg within short term. If we add
> >>> the children's cpu share upto the parent's, the parent tg may have
> >>> large time slice, so it cannot be preempted easily. There may be a latency
> >>> problem if there are many tgs.
> >>
> >> In the case where the #running < sched_nr_latency, the children's
> >> sched_slices add up to the parent's.
> >>
> >> A rq with two tgs,each with 3 tasks.
> >>
> >> Each of these tasks have a sched slice of
> >> [(sysctl_sched_latency / 3) / 2] as of the present implementation.
> >>
> >> The sum of the above sched_slice of all tasks of a tg will lead to the
> >> sched_slice of its parent: sysctl_sched_latency / 2
> >>
> >> This breaks when the nr_running on each tg > sched_nr_latency. However I
> >> don't know if this is a good thing or a bad thing.
> >
> > Ah.. Now I get your point. Yes, you are right and it may be good thing.
> > With that property, all tasks in the system can be scheduled at least once
> > in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration,
> > so my patch may be wrong. With my patch, all tasks in the system cannot be
> > scheduled at least once in sysctl_sched_latency. Instead, it schedule
> > all tasks in cfs_rq at least once in sysctl_sched_latency if there is
> > no other tgs.
>
> Exactly. You have got all the above points right.
>
> >
> > I think that it is real problem that sysctl_sched_min_granularity is not
> > guaranteed for each task.
> > Instead of this patch, how about considering low bound?
> >
> > if (slice < sysctl_sched_min_granularity)
> > slice = sysctl_sched_min_granularity;
>
> Consider the below scenario.
>
> A runqueue has two task groups,each with 10 tasks.
>
> With the current implementation,each of these tasks get a sched_slice of
> 2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks
> of both the task groups) will get the chance to run.
>
> But what is the scheduling period in this scenario? Is it 40ms (extended
> sysctl_sched_latency), which is the scheduling period for each of the
> runqueues with 10 tasks in it?
> Or is it 80ms which is the total of the scheduling periods of each of
> the run queues with 10 tasks.Either way all tasks seem to get scheduled
> atleast once within the scheduling period.So we appear to be safe.
> Although the sched_slice < sched_min_granularity.
>
> With your above lower bound of sysctl_sched_min_granularity, each task
> of each tg gets 4ms as its sched_slice.So in a matter of
> (10*4) + (10*4) = 80ms,all tasks get to run. With the above question
> being put forth here as well, we don't appear to be safe if the
> scheduling_period is considered to be 40ms, otherwise it appears fine to
> me, because it ensures the sched_slice is atleast sched_min_granularity
> no matter what.

So, you mean that we should guarantee to schedule each task atleast once
in sysctl_sched_latency? But this is not guaranteed in current code,
this is why I made this patch. Please refer a patch description.

Thanks.

>
> Thank you
>
> Regards
> Preeti U Murthy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-04-02 09:35:26

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

Hello, Mike.

On Tue, Apr 02, 2013 at 04:35:26AM +0200, Mike Galbraith wrote:
> On Tue, 2013-04-02 at 11:25 +0900, Joonsoo Kim wrote:
> > Hello, Preeti.
> >
> > On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote:
> > > Hi Joonsoo,
> > >
> > > On 04/01/2013 09:38 AM, Joonsoo Kim wrote:
> > > > Hello, Preeti.
> > > >
> > >
> > > >>
> > > >> Ideally the children's cpu share must add upto the parent's share.
> > > >>
> > > >
> > > > I don't think so.
> > > >
> > > > We should schedule out the parent tg if 5ms is over. As we do so, we can
> > > > fairly distribute time slice to every tg within short term. If we add
> > > > the children's cpu share upto the parent's, the parent tg may have
> > > > large time slice, so it cannot be preempted easily. There may be a latency
> > > > problem if there are many tgs.
> > >
> > > In the case where the #running < sched_nr_latency, the children's
> > > sched_slices add up to the parent's.
> > >
> > > A rq with two tgs,each with 3 tasks.
> > >
> > > Each of these tasks have a sched slice of
> > > [(sysctl_sched_latency / 3) / 2] as of the present implementation.
> > >
> > > The sum of the above sched_slice of all tasks of a tg will lead to the
> > > sched_slice of its parent: sysctl_sched_latency / 2
> > >
> > > This breaks when the nr_running on each tg > sched_nr_latency. However I
> > > don't know if this is a good thing or a bad thing.
> >
> > Ah.. Now I get your point. Yes, you are right and it may be good thing.
> > With that property, all tasks in the system can be scheduled at least once
> > in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration,
> > so my patch may be wrong. With my patch, all tasks in the system cannot be
> > scheduled at least once in sysctl_sched_latency. Instead, it schedule
> > all tasks in cfs_rq at least once in sysctl_sched_latency if there is
> > no other tgs.
> >
> > I think that it is real problem that sysctl_sched_min_granularity is not
> > guaranteed for each task.
> > Instead of this patch, how about considering low bound?
> >
> > if (slice < sysctl_sched_min_granularity)
> > slice = sysctl_sched_min_granularity;
>
> How many SCHED_IDLE or +nice tasks will fit in that?

It is more related to how many running tasks in cfs_rq and how many tg is
in the system. If we have two tgs which have more than sched_nr_latency
tasks, all these tasks fit in this condition in current implementation.

Thanks.

>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-04-02 09:50:23

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: factor out code to should_we_balance()

Hello, Peter.

On Tue, Apr 02, 2013 at 10:10:06AM +0200, Peter Zijlstra wrote:
> On Thu, 2013-03-28 at 16:58 +0900, Joonsoo Kim wrote:
> > Now checking that this cpu is appropriate to balance is embedded into
> > update_sg_lb_stats() and this checking has no direct relationship to
> > this
> > function.
> >
> > There is not enough reason to place this checking at
> > update_sg_lb_stats(),
> > except saving one iteration for sched_group_cpus.
>
> Its only one iteration if there's only 2 groups, but there can be more
> than 2, take any desktop Intel i7, it will have 4-8 cores, each with
> HT; thus the CPU domain will have 4-8 groups.
>
> And note that local_group is always the first group of a domain, so
> we'd stop the balance at the first group and avoid touching the other
> 3-7, avoiding touching cachelines on 6-14 cpus.
>
> So this short-cut does make sense.. its not pretty, granted, but
> killing it doesn't seem right.

It seems that there is some misunderstanding about this patch.
In this patch, we don't iterate all groups. Instead, we iterate on
cpus of local sched_group only. So there is no penalty you mentioned.

In summary, net effect is below.

* For cpus which are not proper to balance,
Reduce function call,
Reduce memset

* For cpus which should balance
Extra one iteration on cpus of local sched_group in should_we_balance()
Reduce some branch, so expect better optimization

Thanks.

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-04-02 10:00:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: factor out code to should_we_balance()

On Tue, 2013-04-02 at 18:50 +0900, Joonsoo Kim wrote:
>
> It seems that there is some misunderstanding about this patch.
> In this patch, we don't iterate all groups. Instead, we iterate on
> cpus of local sched_group only. So there is no penalty you mentioned.

OK, I'll go stare at it again..

2013-04-02 10:29:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: factor out code to should_we_balance()

On Tue, 2013-04-02 at 12:00 +0200, Peter Zijlstra wrote:
> On Tue, 2013-04-02 at 18:50 +0900, Joonsoo Kim wrote:
> >
> > It seems that there is some misunderstanding about this patch.
> > In this patch, we don't iterate all groups. Instead, we iterate on
> > cpus of local sched_group only. So there is no penalty you mentioned.
>
> OK, I'll go stare at it again..

Ah, I see, you're doing should_we_balance() _before_
find_busiest_group() and instead you're doing another for_each_cpu() in
there.

I'd write the thing like:

static bool should_we_balance(struct lb_env *env)
{
struct sched_group *sg = env->sd->groups;
struct cpumask *sg_cpus, *sg_mask;
int cpu, balance_cpu = -1;

if (env->idle == CPU_NEWLY_IDLE)
return true;

sg_cpus = sched_group_cpus(sg);
sg_mask = sched_group_mask(sg);

for_each_cpu_and(cpu, sg_cpus, env->cpus) {
if (!cpumask_test_cpu(cpu, sg_mask))
continue;

if (!idle_cpu(cpu))
continue;

balance_cpu = cpu;
break;
}

if (balance_cpu == -1)
balance_cpu = group_balance_cpu(sg);

return balance_cpu == env->dst_cpu;
}

I also considered doing the group_balance_cpu() first to avoid having
to do the idle_cpu() scan, but that's a slight behavioural change
afaict.

2013-04-02 17:34:22

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

Hi Joonsoo,


>>> I think that it is real problem that sysctl_sched_min_granularity is not
>>> guaranteed for each task.
>>> Instead of this patch, how about considering low bound?
>>>
>>> if (slice < sysctl_sched_min_granularity)
>>> slice = sysctl_sched_min_granularity;
>>
>> Consider the below scenario.
>>
>> A runqueue has two task groups,each with 10 tasks.
>>
>> With the current implementation,each of these tasks get a sched_slice of
>> 2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks
>> of both the task groups) will get the chance to run.
>>
>> But what is the scheduling period in this scenario? Is it 40ms (extended
>> sysctl_sched_latency), which is the scheduling period for each of the
>> runqueues with 10 tasks in it?
>> Or is it 80ms which is the total of the scheduling periods of each of
>> the run queues with 10 tasks.Either way all tasks seem to get scheduled
>> atleast once within the scheduling period.So we appear to be safe.
>> Although the sched_slice < sched_min_granularity.
>>
>> With your above lower bound of sysctl_sched_min_granularity, each task
>> of each tg gets 4ms as its sched_slice.So in a matter of
>> (10*4) + (10*4) = 80ms,all tasks get to run. With the above question
>> being put forth here as well, we don't appear to be safe if the
>> scheduling_period is considered to be 40ms, otherwise it appears fine to
>> me, because it ensures the sched_slice is atleast sched_min_granularity
>> no matter what.
>
> So, you mean that we should guarantee to schedule each task atleast once
> in sysctl_sched_latency?

I would rather say all tasks get to run atleast once in a sched_period,
which could be just sysctl_sched_latency or more depending on the number
of tasks in the current implementation.

But this is not guaranteed in current code,
> this is why I made this patch. Please refer a patch description.

Ok,take the example of a runqueue with 2 task groups,each with 10
tasks.Same as your previous example. Can you explain how your patch
ensures that all 20 tasks get to run atleast once in a sched_period?

Regards
Preeti U Murthy

2013-04-04 00:42:14

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

Hello, Preeti.

On Tue, Apr 02, 2013 at 11:02:43PM +0530, Preeti U Murthy wrote:
> Hi Joonsoo,
>
>
> >>> I think that it is real problem that sysctl_sched_min_granularity is not
> >>> guaranteed for each task.
> >>> Instead of this patch, how about considering low bound?
> >>>
> >>> if (slice < sysctl_sched_min_granularity)
> >>> slice = sysctl_sched_min_granularity;
> >>
> >> Consider the below scenario.
> >>
> >> A runqueue has two task groups,each with 10 tasks.
> >>
> >> With the current implementation,each of these tasks get a sched_slice of
> >> 2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks
> >> of both the task groups) will get the chance to run.
> >>
> >> But what is the scheduling period in this scenario? Is it 40ms (extended
> >> sysctl_sched_latency), which is the scheduling period for each of the
> >> runqueues with 10 tasks in it?
> >> Or is it 80ms which is the total of the scheduling periods of each of
> >> the run queues with 10 tasks.Either way all tasks seem to get scheduled
> >> atleast once within the scheduling period.So we appear to be safe.
> >> Although the sched_slice < sched_min_granularity.
> >>
> >> With your above lower bound of sysctl_sched_min_granularity, each task
> >> of each tg gets 4ms as its sched_slice.So in a matter of
> >> (10*4) + (10*4) = 80ms,all tasks get to run. With the above question
> >> being put forth here as well, we don't appear to be safe if the
> >> scheduling_period is considered to be 40ms, otherwise it appears fine to
> >> me, because it ensures the sched_slice is atleast sched_min_granularity
> >> no matter what.
> >
> > So, you mean that we should guarantee to schedule each task atleast once
> > in sysctl_sched_latency?
>
> I would rather say all tasks get to run atleast once in a sched_period,
> which could be just sysctl_sched_latency or more depending on the number
> of tasks in the current implementation.
>
> But this is not guaranteed in current code,
> > this is why I made this patch. Please refer a patch description.
>
> Ok,take the example of a runqueue with 2 task groups,each with 10
> tasks.Same as your previous example. Can you explain how your patch
> ensures that all 20 tasks get to run atleast once in a sched_period?

My patch doesn't ensure that :)
I just want to say a problem of current implementation which can't
ensure to run atleast once in sched_period through my patch description.

So, how about extending a sched_period with rq->nr_running, instead of
cfs_rq->nr_running? It is my quick thought and I think that we can ensure
to run atleast once in this extending sched_period.

And, do we leave a problem if we cannot guaranteed atleast once property?

Thanks.

>
> Regards
> Preeti U Murthy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-04-04 00:55:01

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/5] sched: factor out code to should_we_balance()

Hello, Peter.

On Tue, Apr 02, 2013 at 12:29:42PM +0200, Peter Zijlstra wrote:
> On Tue, 2013-04-02 at 12:00 +0200, Peter Zijlstra wrote:
> > On Tue, 2013-04-02 at 18:50 +0900, Joonsoo Kim wrote:
> > >
> > > It seems that there is some misunderstanding about this patch.
> > > In this patch, we don't iterate all groups. Instead, we iterate on
> > > cpus of local sched_group only. So there is no penalty you mentioned.
> >
> > OK, I'll go stare at it again..
>
> Ah, I see, you're doing should_we_balance() _before_
> find_busiest_group() and instead you're doing another for_each_cpu() in
> there.
>
> I'd write the thing like:
>
> static bool should_we_balance(struct lb_env *env)
> {
> struct sched_group *sg = env->sd->groups;
> struct cpumask *sg_cpus, *sg_mask;
> int cpu, balance_cpu = -1;
>
> if (env->idle == CPU_NEWLY_IDLE)
> return true;
>
> sg_cpus = sched_group_cpus(sg);
> sg_mask = sched_group_mask(sg);
>
> for_each_cpu_and(cpu, sg_cpus, env->cpus) {
> if (!cpumask_test_cpu(cpu, sg_mask))
> continue;
>
> if (!idle_cpu(cpu))
> continue;
>
> balance_cpu = cpu;
> break;
> }
>
> if (balance_cpu == -1)
> balance_cpu = group_balance_cpu(sg);
>
> return balance_cpu == env->dst_cpu;
> }

Okay. It looks nice.

>
> I also considered doing the group_balance_cpu() first to avoid having
> to do the idle_cpu() scan, but that's a slight behavioural change
> afaict.

In my quick thought, we can avoid it through below way.

balance_cpu = group_balance_cpu(sg);
if (idle_cpu(balance_cpu))
return balance_cpu == env->dst_cpu;
else
do idle_cpus() scan loop

Is it your idea? If not, please let me know your idea.

Thanks.

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-04-04 06:50:59

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

Hi Joonsoo,

On 04/04/2013 06:12 AM, Joonsoo Kim wrote:
> Hello, Preeti.

>
> So, how about extending a sched_period with rq->nr_running, instead of
> cfs_rq->nr_running? It is my quick thought and I think that we can ensure
> to run atleast once in this extending sched_period.

Yeah this seems to be correct.This would ensure sched_min_granularity
also. So then in the scenarion where there are 2 tgs in a runqueue with
10 tasks each,when we calculate the sched_slice of any task,the
__sched_period() would return 4*20 = 80ms.

The sched_slice of each of the task would be 80/20 = 4ms. But what about
the sched_slice of each task group? How would that be calculated then?

Let us take the above example and walk through this problem.This would
probably help us spot the issues involved with this.

> And, do we leave a problem if we cannot guaranteed atleast once property?

This would depend on the results of the benchmarks with the changes.I am
unable to comment on this off the top of my head.

Thank you

Regards
Preeti U Murthy



2013-04-05 02:06:22

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

Hello, Preeti.

On Thu, Apr 04, 2013 at 12:18:32PM +0530, Preeti U Murthy wrote:
> Hi Joonsoo,
>
> On 04/04/2013 06:12 AM, Joonsoo Kim wrote:
> > Hello, Preeti.
>
> >
> > So, how about extending a sched_period with rq->nr_running, instead of
> > cfs_rq->nr_running? It is my quick thought and I think that we can ensure
> > to run atleast once in this extending sched_period.
>
> Yeah this seems to be correct.This would ensure sched_min_granularity
> also. So then in the scenarion where there are 2 tgs in a runqueue with
> 10 tasks each,when we calculate the sched_slice of any task,the
> __sched_period() would return 4*20 = 80ms.
>
> The sched_slice of each of the task would be 80/20 = 4ms. But what about
> the sched_slice of each task group? How would that be calculated then?

Ah... Okay.
I will think more deeply about this issue.

>
> Let us take the above example and walk through this problem.This would
> probably help us spot the issues involved with this.
>
> > And, do we leave a problem if we cannot guaranteed atleast once property?
>
> This would depend on the results of the benchmarks with the changes.I am
> unable to comment on this off the top of my head.

Okay. :)

Thanks for your kind review!!

>
> Thank you
>
> Regards
> Preeti U Murthy
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/