Hi,
We're seeing a livelock where two CPUs both loop with interrupts
disabled in pick_next_task_fair() / idle_balance() and continuously
fetch all tasks from each other. This has been observed on a 3.18
kernel but the relevant code paths appear to be the same in current
mainline.
The problem situation appears to be this:
cpu0 cpu1
pick_next_task
take rq0->lock
pick_next_task_fair
running=0
idle_balance()
drop rq0->lock
wakeup A,B on cpu0
pick_next_task
take rq1->lock
pick_next_task_fair
running=0
idle_balance()
drop r1->lock
load_balance()
busiest=rq0
take rq0->lock
detach A,B
drop rq0->lock
take rq1->lock
attach A,B
pulled_task = 2
drop rq1->lock
load_balance()
busiest=rq1
take rq1->lock
detach A,B
drop rq1->lock
take rq0->lock
attach A,B
pulled_task = 2
drop rq0->lock
running=0()
idle_balance()
busiest=rq0, pull A,B, etc.
running = 0
load_balance()
busiest=rq1, pull A,B, etc
And this goes on, with interrupts disabled on both CPUs, for at least a
100 ms (which is when a hardware watchdog hits).
The conditions needed, apart from the right timing, are:
- cgroups. One of the tasks, say A, needs to be in a CPU cgroup. When
the problem occurs, A's ->se has zero load_avg_contrib and
task_h_load(A) is zero. However, the se->parent->parent of A has a
(relatively) high load_avg_contrib. cpu0's cfs_rq has therefore a
relatively high runnable_load_avg. find_busiest_group() therefore
detects imbalance, and detach_tasks() detaches all tasks.
- PREEMPT=n. Otherwise, the code under #ifdef in detach_tasks() would
ensure that we'd only ever pull a maximum of one task during idle
balancing.
- cpu0 and cpu1 are threads on the same core (cpus_share_cache() ==
true). otherwise, cpu1 will not be able to wakeup tasks on cpu0
while cpu0 has interrupts disabled (since an IPI would be required).
Turning off the default TTWU_QUEUE feature would also provide the
same effect.
I see two simple ways to prevent the livelock. One is to just to remove
the #ifdef:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0..74c94dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5937,7 +5937,6 @@ static int detach_tasks(struct lb_env *env)
detached++;
env->imbalance -= load;
-#ifdef CONFIG_PREEMPT
/*
* NEWIDLE balancing is a source of latency, so preemptible
* kernels will stop after the first task is detached to minimize
@@ -5945,7 +5944,6 @@ static int detach_tasks(struct lb_env *env)
*/
if (env->idle == CPU_NEWLY_IDLE)
break;
-#endif
/*
* We only want to steal up to the prescribed amount of
Or this should work too:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0..13358cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7487,6 +7487,8 @@ static int idle_balance(struct rq *this_rq)
*/
if (this_rq->cfs.h_nr_running && !pulled_task)
pulled_task = 1;
+ else if (!this_rq->cfs.h_nr_running && pulled_task)
+ pulled_task = 0;
out:
/* Move the next balance forward */
But it seems that the real problem is that detach_tasks() can remove all
tasks, when cgroups are involved as described above?
Thanks.
/Rabin
On Tue, 2015-06-30 at 16:30 +0200, Rabin Vincent wrote:
> Hi,
>
> We're seeing a livelock where two CPUs both loop with interrupts
> disabled in pick_next_task_fair() / idle_balance() and continuously
> fetch all tasks from each other...
Hm. Does the below help? Looks to me like we are over-balancing.
homer:/sys/kernel/debug/tracing # echo > trace;massive_intr 12 10 > /dev/null;cat trace|grep OINK | wc -l;tail trace
480
massive_intr-8967 [003] d... 404.285682: load_balance: OINK - imbal: 1 load: 83 run: 1 det: 1 sload_was: 180 sload_is: 83 dload: 83
massive_intr-8960 [001] d.s. 404.293331: load_balance: OINK - imbal: 1 load: 84 run: 1 det: 1 sload_was: 180 sload_is: 99 dload: 174
massive_intr-8962 [002] d.s. 404.317572: load_balance: OINK - imbal: 1 load: 83 run: 1 det: 1 sload_was: 180 sload_is: 83 dload: 161
massive_intr-8967 [003] d... 404.318296: load_balance: OINK - imbal: 1 load: 83 run: 1 det: 1 sload_was: 180 sload_is: 83 dload: 83
massive_intr-8960 [005] d... 404.341049: load_balance: OINK - imbal: 1 load: 84 run: 1 det: 1 sload_was: 180 sload_is: 83 dload: 84
massive_intr-8962 [002] d.s. 404.381549: load_balance: OINK - imbal: 1 load: 83 run: 1 det: 1 sload_was: 180 sload_is: 83 dload: 161
massive_intr-8971 [005] d... 404.417148: load_balance: OINK - imbal: 1 load: 84 run: 1 det: 1 sload_was: 180 sload_is: 83 dload: 84
massive_intr-8964 [006] d.s. 404.418536: load_balance: OINK - imbal: 1 load: 72 run: 1 det: 1 sload_was: 144 sload_is: 83 dload: 149
massive_intr-8968 [003] d... 404.437861: load_balance: OINK - imbal: 1 load: 83 run: 1 det: 1 sload_was: 180 sload_is: 83 dload: 83
massive_intr-8970 [001] d.s. 404.485263: load_balance: OINK - imbal: 1 load: 71 run: 1 det: 1 sload_was: 154 sload_is: 83 dload: 148
homer:/sys/kernel/debug/tracing # echo > trace;hackbench > /dev/null;cat trace|grep OINK | wc -l;tail trace
51
hackbench-9079 [006] d... 425.982722: load_balance: OINK - imbal: 54 load: 42 run: 4 det: 4 sload_was: 130 sload_is: 62 dload: 109
hackbench-9231 [002] d... 425.982974: load_balance: OINK - imbal: 14 load: 23 run: 1 det: 1 sload_was: 30 sload_is: 15 dload: 23
hackbench-9328 [006] d... 425.983037: load_balance: OINK - imbal: 16 load: 72 run: 2 det: 1 sload_was: 44 sload_is: 32 dload: 72
hackbench-9197 [002] d... 425.984416: load_balance: OINK - imbal: 62 load: 21 run: 3 det: 8 sload_was: 232 sload_is: 78 dload: 119
hackbench-9196 [004] d... 425.984507: load_balance: OINK - imbal: 45 load: 43 run: 1 det: 1 sload_was: 44 sload_is: 22 dload: 43
hackbench-9201 [004] d... 425.984648: load_balance: OINK - imbal: 15 load: 44 run: 1 det: 2 sload_was: 71 sload_is: 25 dload: 73
hackbench-9235 [002] d... 425.984789: load_balance: OINK - imbal: 5 load: 32 run: 2 det: 1 sload_was: 65 sload_is: 42 dload: 54
hackbench-9327 [000] d... 425.985424: load_balance: OINK - imbal: 1 load: 95 run: 1 det: 1 sload_was: 49 sload_is: 25 dload: 95
hackbench-9193 [003] d... 425.988701: load_balance: OINK - imbal: 22 load: 94 run: 1 det: 1 sload_was: 128 sload_is: 66 dload: 94
hackbench-9197 [003] d... 425.988712: load_balance: OINK - imbal: 56 load: 92 run: 1 det: 1 sload_was: 118 sload_is: 66 dload: 92
homer:/sys/kernel/debug/tracing # tail trace
kwin-4654 [002] d... 460.627311: load_balance: kglobalaccel:4597 is non-contributor - count as 37
konsole-4707 [005] d... 460.627639: load_balance: kded4:4594 is non-contributor - count as 40
konsole-4707 [005] d... 460.627640: load_balance: kactivitymanage:4611 is non-contributor - count as 40
kwin-4654 [005] d... 460.627712: load_balance: kactivitymanage:4611 is non-contributor - count as 41
kactivitymanage-4611 [005] d... 460.627726: load_balance: kded4:4594 is non-contributor - count as 40
kwin-4654 [005] d... 460.828628: load_balance: OINK - imbal: 3 load: 618 run: 1 det: 1 sload_was: 1024 sload_is: 0 dload: 618
plasma-desktop-4665 [000] d... 461.886746: load_balance: baloo_file:4666 is non-contributor - count as 141
systemd-journal-397 [001] d... 466.209790: load_balance: pulseaudio:4718 is non-contributor - count as 5
systemd-1 [007] d... 466.209868: load_balance: kmix:4704 is non-contributor - count as 13
gnome-keyring-d-9455 [002] d... 466.209902: load_balance: krunner:4702 is non-contributor - count as 8
---
kernel/sched/fair.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5897,7 +5897,7 @@ static int detach_tasks(struct lb_env *e
{
struct list_head *tasks = &env->src_rq->cfs_tasks;
struct task_struct *p;
- unsigned long load;
+ unsigned long load, d_load = 0, s_load = env->src_rq->load.weight;
int detached = 0;
lockdep_assert_held(&env->src_rq->lock);
@@ -5936,6 +5936,11 @@ static int detach_tasks(struct lb_env *e
detached++;
env->imbalance -= load;
+ if (!load) {
+ load = min_t(unsigned long, env->imbalance, p->se.load.weight);
+ trace_printk("%s:%d is non-contributor - count as %ld\n", p->comm, p->pid, load);
+ }
+ d_load += load;
#ifdef CONFIG_PREEMPT
/*
@@ -5954,6 +5959,18 @@ static int detach_tasks(struct lb_env *e
if (env->imbalance <= 0)
break;
+ /*
+ * We don't want to bleed busiest_rq dry either. Weighted load
+ * and/or imbalance may be dinky, load contribution can even be
+ * zero, perhaps causing us to over balancem we had not assigned
+ * it above.
+ */
+ if (env->src_rq->load.weight <= env->dst_rq->load.weight + d_load) {
+ trace_printk("OINK - imbal: %ld load: %ld run: %d det: %d sload_was: %ld sload_is: %ld dload: %ld\n",
+ env->imbalance, load, env->src_rq->nr_running, detached, s_load, env->src_rq->load.weight, env->dst_rq->load.weight+d_load);
+ break;
+ }
+
continue;
next:
list_move_tail(&p->se.group_node, tasks);
On Wed, Jul 01, 2015 at 07:36:35AM +0200, Mike Galbraith wrote:
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5897,7 +5897,7 @@ static int detach_tasks(struct lb_env *e
> {
> struct list_head *tasks = &env->src_rq->cfs_tasks;
> struct task_struct *p;
> - unsigned long load;
> + unsigned long load, d_load = 0, s_load = env->src_rq->load.weight;
> int detached = 0;
>
> lockdep_assert_held(&env->src_rq->lock);
> @@ -5936,6 +5936,11 @@ static int detach_tasks(struct lb_env *e
>
> detached++;
> env->imbalance -= load;
> + if (!load) {
> + load = min_t(unsigned long, env->imbalance, p->se.load.weight);
> + trace_printk("%s:%d is non-contributor - count as %ld\n", p->comm, p->pid, load);
> + }
> + d_load += load;
>
> #ifdef CONFIG_PREEMPT
> /*
> @@ -5954,6 +5959,18 @@ static int detach_tasks(struct lb_env *e
> if (env->imbalance <= 0)
> break;
>
> + /*
> + * We don't want to bleed busiest_rq dry either. Weighted load
> + * and/or imbalance may be dinky, load contribution can even be
> + * zero, perhaps causing us to over balancem we had not assigned
> + * it above.
> + */
> + if (env->src_rq->load.weight <= env->dst_rq->load.weight + d_load) {
> + trace_printk("OINK - imbal: %ld load: %ld run: %d det: %d sload_was: %ld sload_is: %ld dload: %ld\n",
> + env->imbalance, load, env->src_rq->nr_running, detached, s_load, env->src_rq->load.weight, env->dst_rq->load.weight+d_load);
> + break;
> + }
> +
> continue;
> next:
> list_move_tail(&p->se.group_node, tasks);
>
I've tried to analyse how your patch would affect the situation in one
of the crash dumps which I have of the problem.
In this dump, cpu0 is in the middle of dequeuing all tasks from cpu1.
rcu_sched has already been detached and there are two tasks left, one of them
which is being processed by dequeue_entity_load_avg() called from
dequeue_task() at the time the watchdog hits. lb_env is the following.
imbalance is, as you can see, 60.
crash> struct lb_env 8054fd50
struct lb_env {
sd = 0x8fc13e00,
src_rq = 0x81297200,
src_cpu = 1,
dst_cpu = 0,
dst_rq = 0x8128e200,
dst_grpmask = 0x0,
new_dst_cpu = 0,
idle = CPU_NEWLY_IDLE,
imbalance = 60,
cpus = 0x8128d238,
flags = 0,
loop = 2,
loop_break = 32,
loop_max = 3,
fbq_type = all,
tasks = {
next = 0x8fc4c6ec,
prev = 0x8fc4c6ec
}
}
Weights of the runqueues:
crash> struct rq.load.weight runqueues:0,1
[0]: 8128e200
load.weight = 0,
[1]: 81297200
load.weight = 1935,
The only running tasks on the system are these three:
crash> foreach RU ps
PID PPID CPU TASK ST %MEM VSZ RSS COMM
> 0 0 0 8056d8b0 RU 0.0 0 0 [swapper/0]
> 0 0 1 8fc5cd18 RU 0.0 0 0 [swapper/1]
> 0 0 2 8fc5c6b0 RU 0.0 0 0 [swapper/2]
> 0 0 3 8fc5c048 RU 0.0 0 0 [swapper/3]
7 2 0 8fc4c690 RU 0.0 0 0 [rcu_sched]
30 2 1 8fd26108 RU 0.0 0 0 [kswapd0]
413 1 1 8edda408 RU 0.6 1900 416 rngd
And the load.weight and load_avg_contribs for them and their parent SEs:
crash> foreach 7 30 413 load
PID: 7 TASK: 8fc4c690 CPU: 0 COMMAND: "rcu_sched"
task_h_load(): 325 [ = (load_avg_contrib { 5} * cfs_rq->h_load { 65}) / (cfs_rq->runnable_load_avg { 0} + 1) ]
SE: 8fc4c6d8 load_avg_contrib: 5 load.weight: 1024 PARENT: 00000000 GROUPNAME: (null)
PID: 30 TASK: 8fd26108 CPU: 1 COMMAND: "kswapd0"
task_h_load(): 10 [ = (load_avg_contrib { 10} * cfs_rq->h_load { 133}) / (cfs_rq->runnable_load_avg { 128} + 1) ]
SE: 8fd26150 load_avg_contrib: 10 load.weight: 1024 PARENT: 00000000 GROUPNAME: (null)
PID: 413 TASK: 8edda408 CPU: 1 COMMAND: "rngd"
task_h_load(): 0 [ = (load_avg_contrib { 0} * cfs_rq->h_load { 0}) / (cfs_rq->runnable_load_avg { 0} + 1) ]
SE: 8edda450 load_avg_contrib: 0 load.weight: 1024 PARENT: 8fffbd00 GROUPNAME: (null)
SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 8f531f80 GROUPNAME: [email protected]
SE: 8f531f80 load_avg_contrib: 0 load.weight: 1024 PARENT: 8f456e00 GROUPNAME: system-rngd.slice
SE: 8f456e00 load_avg_contrib: 118 load.weight: 911 PARENT: 00000000 GROUPNAME: system.slice
Given the above, we can see that with your patch:
- dst_rq->load.weight is 0 and will not change in this loop.
- src_rq->load.weight was 1935 + 1024 before the loop. It will go down
to 1935 (already has), 1024, and then 0.
- d_load will be 325*, 335, and then 395.
(* - probably not exactly since rcu_sched has already had set_task_rq() called
[cfs_rq switched] on it, but I guess it's actually going to be much lower based
on the other dumps I see where rcu_sched hasn't be switched yet).
So, we will not hit the "if (env->src_rq->load.weight <=
env->dst_rq->load.weight + d_load)" condition to break out of the loop until we
actualy move all tasks. So the patch will not have any effect on this case.
Or am I missing something?
We'll set up a test anyway with the patch; the problem usually takes a
couple of days to reproduce.
/Rabin
On Wed, 2015-07-01 at 16:55 +0200, Rabin Vincent wrote:
> So, we will not hit the "if (env->src_rq->load.weight <=
> env->dst_rq->load.weight + d_load)" condition to break out of the loop until we
> actualy move all tasks. So the patch will not have any effect on this case.
> Or am I missing something?
Probably not. I did have it breaking if dst_rq would pass
src_rq->nr_running, which would certainly stop it, but thought I try to
let it watch weights.
Either way, task_h_load(p) returning 0 is not very wonderful.
-Mike
On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote:
> PID: 413 TASK: 8edda408 CPU: 1 COMMAND: "rngd"
> task_h_load(): 0 [ = (load_avg_contrib { 0} * cfs_rq->h_load { 0}) / (cfs_rq->runnable_load_avg { 0} + 1) ]
> SE: 8edda450 load_avg_contrib: 0 load.weight: 1024 PARENT: 8fffbd00 GROUPNAME: (null)
> SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 8f531f80 GROUPNAME: [email protected]
> SE: 8f531f80 load_avg_contrib: 0 load.weight: 1024 PARENT: 8f456e00 GROUPNAME: system-rngd.slice
> SE: 8f456e00 load_avg_contrib: 118 load.weight: 911 PARENT: 00000000 GROUPNAME: system.slice
So there's two problems there... the first we can (and should) fix, the
second I'm not sure there's anything we can do about.
Firstly, a group (parent) load_avg_contrib should never be less than
that of its constituent parts, therefore the top 3 SEs should have at
least 118 too.
Now its been a while since I looked at the per entity load tracking
stuff so some of the details have left me, but while it looks like we
add the se->avg.load_avg_contrib to its cfs->runnable_load, we do not
propagate that into the corresponding (group) se.
This means the se->avg.load_avg_contrib is accounted per cpu without
migration benefits. So if our task just got migrated onto a cpu that
hasn't ran the group in a while, the group will not have accumulated
runtime.
A quick fix would be something like the below; although I think we want
to do something else, like maybe propagate the load_avg_contrib up the
hierarchy etc.. But I need to think more about that.
The second problem is that your second SE has a weight of 2, that'll get
the task_h_load() a factor of 1/512 in which will flatten pretty much
anything down to small. This is per configuration, so there's really not
something we can or should do about that.
Untested, uncompiled hackery following, mostly for discussion.
---
kernel/sched/fair.c | 6 ++++--
kernel/sched/sched.h | 1 +
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0ca0a6..95d0ba249c8b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6082,7 +6082,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
struct rq *rq = rq_of(cfs_rq);
struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
unsigned long now = jiffies;
- unsigned long load;
+ unsigned long load, load_avg_contrib = 0;
if (cfs_rq->last_h_load_update == now)
return;
@@ -6090,6 +6090,8 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
cfs_rq->h_load_next = NULL;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
+ cfs_rq->h_load_avg_contrib = load_avg_contrib =
+ max(load_avg_contrib, se->avg.load_avg_contrib);
cfs_rq->h_load_next = se;
if (cfs_rq->last_h_load_update == now)
break;
@@ -6102,7 +6104,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
while ((se = cfs_rq->h_load_next) != NULL) {
load = cfs_rq->h_load;
- load = div64_ul(load * se->avg.load_avg_contrib,
+ load = div64_ul(load * cfs_rq->h_load_avg_contrib,
cfs_rq->runnable_load_avg + 1);
cfs_rq = group_cfs_rq(se);
cfs_rq->h_load = load;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 885889190a1f..7738e3b301b7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -394,6 +394,7 @@ struct cfs_rq {
* this group.
*/
unsigned long h_load;
+ unsigned long h_load_avg_contrib;
u64 last_h_load_update;
struct sched_entity *h_load_next;
#endif /* CONFIG_FAIR_GROUP_SCHED */
Hello,
We have two basic load balancing: idle blancing and periodic balancing.
It takes a tick for periodic balancing to happen again, so the livelock
could not be from it.
On the contrary, the idle balancing occurs as needed at arbitrary time,
so the livelock could happen.
And obviously, the idle balancing livelock SHOULD happen: one CPU pulls
tasks from the other, makes the other idle, and this iterates...
That being said, it is also obvious to prevent the livelock from happening:
idle pulling until the source rq's nr_running is 1, becuase otherwise we
just avoid idleness by making another idleness.
Hope the patch at the end should work.
On Wed, Jul 01, 2015 at 10:44:04PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote:
> > PID: 413 TASK: 8edda408 CPU: 1 COMMAND: "rngd"
> > task_h_load(): 0 [ = (load_avg_contrib { 0} * cfs_rq->h_load { 0}) / (cfs_rq->runnable_load_avg { 0} + 1) ]
> > SE: 8edda450 load_avg_contrib: 0 load.weight: 1024 PARENT: 8fffbd00 GROUPNAME: (null)
> > SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 8f531f80 GROUPNAME: [email protected]
> > SE: 8f531f80 load_avg_contrib: 0 load.weight: 1024 PARENT: 8f456e00 GROUPNAME: system-rngd.slice
> > SE: 8f456e00 load_avg_contrib: 118 load.weight: 911 PARENT: 00000000 GROUPNAME: system.slice
>
> Firstly, a group (parent) load_avg_contrib should never be less than
> that of its constituent parts, therefore the top 3 SEs should have at
> least 118 too.
I think the downward is parent, so with this case, "parent is bigger than
child" is ok.
But if the child is the only contributor to the parent's load (probably is this case),
then the load_avg_contrib should not jump suddenly from 0 to 118.
So this might be due to the __update_group_entity_contrib(), I did not look into
detail in this complex function, but a glimpse suggests it is at least not consistent
if not arbitray, so likely will not satisfy "parent is bigger than child".
But this patch series http://comments.gmane.org/gmane.linux.kernel/1981970 should
be very consistent in computing all SE's load avergage, thus safisfy the universal
truth...
> Now its been a while since I looked at the per entity load tracking
> stuff so some of the details have left me, but while it looks like we
> add the se->avg.load_avg_contrib to its cfs->runnable_load, we do not
> propagate that into the corresponding (group) se.
>
> This means the se->avg.load_avg_contrib is accounted per cpu without
> migration benefits. So if our task just got migrated onto a cpu that
> hasn't ran the group in a while, the group will not have accumulated
> runtime.
Yes, this is true, the migration has nothing to do with either the source
or the destination group SEs. Down this road, if we want the addition and
subtraction after migation, we need to do it bottom-up along the SE, and
do without rq->lock (?). Need to think about it for a while.
Thanks,
Yuyang
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40a7fcb..f7cc1ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
return 0;
while (!list_empty(tasks)) {
+
+ if (env->idle == CPU_NEWLY_IDLE && env->src_rq->nr_running <= 1)
+ break;
+
p = list_first_entry(tasks, struct task_struct, se.group_node);
env->loop++;
Hi Mike,
On Thu, Jul 02, 2015 at 10:05:47AM +0200, Mike Galbraith wrote:
> On Thu, 2015-07-02 at 07:25 +0800, Yuyang Du wrote:
>
> > That being said, it is also obvious to prevent the livelock from happening:
> > idle pulling until the source rq's nr_running is 1, becuase otherwise we
> > just avoid idleness by making another idleness.
>
> Yeah, but that's just the symptom, not the disease. Better for the idle
> balance symptom may actually be to only pull one when idle balancing.
> After all, the immediate goal is to find something better to do than
> idle, not to achieve continual perfect (is the enemy of good) balance.
>
Symptom? :)
You mean "pull one and stop, can't be greedy"? Right, but still need to
assure you don't make another idle CPU (meaning until nr_running == 1), which
is the cure to disease.
I am ok with at most "pull one", but probably we stick to the load_balance()
by pulling an fair amount, assuming load_balance() magically computes the
right imbalance, otherwise you may have to do multiple "pull one"s.
Thanks,
Yuyang
On Thu, 2015-07-02 at 07:25 +0800, Yuyang Du wrote:
> That being said, it is also obvious to prevent the livelock from happening:
> idle pulling until the source rq's nr_running is 1, becuase otherwise we
> just avoid idleness by making another idleness.
Yeah, but that's just the symptom, not the disease. Better for the idle
balance symptom may actually be to only pull one when idle balancing.
After all, the immediate goal is to find something better to do than
idle, not to achieve continual perfect (is the enemy of good) balance.
-Mike
On Thu, 2015-07-02 at 09:05 +0800, Yuyang Du wrote:
>
> Symptom? :)
Yeah, my fingers have a collection of those :)
-Mike
On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> And obviously, the idle balancing livelock SHOULD happen: one CPU pulls
> tasks from the other, makes the other idle, and this iterates...
>
> That being said, it is also obvious to prevent the livelock from happening:
> idle pulling until the source rq's nr_running is 1, becuase otherwise we
> just avoid idleness by making another idleness.
Well, ideally the imbalance calculation would be so that it would avoid
this from happening in the first place. Its a 'balance' operation, not a
'steal everything'.
We want to take work -- as we have none -- but we want to ensure that
afterwards we have equal work, ie we're balanced.
So clearly that all is hosed. Now Morten was looking into simplifying
calculate_imbalance() recently.
> On Wed, Jul 01, 2015 at 10:44:04PM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote:
> > > PID: 413 TASK: 8edda408 CPU: 1 COMMAND: "rngd"
> > > task_h_load(): 0 [ = (load_avg_contrib { 0} * cfs_rq->h_load { 0}) / (cfs_rq->runnable_load_avg { 0} + 1) ]
> > > SE: 8edda450 load_avg_contrib: 0 load.weight: 1024 PARENT: 8fffbd00 GROUPNAME: (null)
> > > SE: 8fffbd00 load_avg_contrib: 0 load.weight: 2 PARENT: 8f531f80 GROUPNAME: [email protected]
> > > SE: 8f531f80 load_avg_contrib: 0 load.weight: 1024 PARENT: 8f456e00 GROUPNAME: system-rngd.slice
> > > SE: 8f456e00 load_avg_contrib: 118 load.weight: 911 PARENT: 00000000 GROUPNAME: system.slice
> >
> > Firstly, a group (parent) load_avg_contrib should never be less than
> > that of its constituent parts, therefore the top 3 SEs should have at
> > least 118 too.
>
> I think the downward is parent,
Ugh, I cannot read. Let me blame it on the heat.
On Thu, Jul 02, 2015 at 09:05:39AM +0800, Yuyang Du wrote:
> Hi Mike,
>
> On Thu, Jul 02, 2015 at 10:05:47AM +0200, Mike Galbraith wrote:
> > On Thu, 2015-07-02 at 07:25 +0800, Yuyang Du wrote:
> >
> > > That being said, it is also obvious to prevent the livelock from happening:
> > > idle pulling until the source rq's nr_running is 1, becuase otherwise we
> > > just avoid idleness by making another idleness.
> >
> > Yeah, but that's just the symptom, not the disease. Better for the idle
> > balance symptom may actually be to only pull one when idle balancing.
> > After all, the immediate goal is to find something better to do than
> > idle, not to achieve continual perfect (is the enemy of good) balance.
> >
> Symptom? :)
>
> You mean "pull one and stop, can't be greedy"? Right, but still need to
> assure you don't make another idle CPU (meaning until nr_running == 1), which
> is the cure to disease.
>
> I am ok with at most "pull one", but probably we stick to the load_balance()
> by pulling an fair amount, assuming load_balance() magically computes the
> right imbalance, otherwise you may have to do multiple "pull one"s.
Talking about the disease and looking at the debug data that Rabin has
provided I think the problem is due to the way blocked load is handled
(or not handled) in calculate_imbalance().
We have three entities in the root cfs_rq on cpu1:
1. Task entity pid 7, load_avg_contrib = 5.
2. Task entity pid 30, load_avg_contrib = 10.
3. Group entity, load_avg_contrib = 118, but contains task entity pid
413 further down the hierarchy with task_h_load() = 0. The 118 comes
from the blocked load contribution in the system.slice task group.
calculate_imbalance() figures out the average loads are:
cpu0: load/capacity = 0*1024/1024 = 0
cpu1: load/capacity = (5 + 10 + 118)*1024/1024 = 133
domain: load/capacity = (0 + 133)*1024/(2*1024) = 62
env->imbalance = 62
Rabin reported env->imbalance = 60 after pulling the rcu task with
load_avg_contrib = 5. It doesn't match my numbers exactly, but it pretty
close ;-)
detach_tasks() will attempts to pull 62 based on tasks task_h_load() but
the task_h_load() sum is only 5 + 10 + 0 and hence detach_tasks() will
empty the src_rq.
IOW, since task groups include blocked load in the load_avg_contrib (see
__update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
imbalance includes blocked load and hence env->imbalance >=
sum(task_h_load(p)) for all tasks p on the rq. Which leads to
detach_tasks() emptying the rq completely in the reported scenario where
blocked load > runnable load.
Whether emptying the src_rq is the right thing to do depends on on your
point of view. Does balanced load (runnable+blocked) take priority over
keeping cpus busy or not? For idle_balance() it seems intuitively
correct to not empty the rq and hence you could consider env->imbalance
to be too big.
I think we will see more of this kind of problems if we include
weighted_cpuload() as well. Parts of the imbalance calculation code is
quite old and could use some attention first.
A short term fix could be what Yuyang propose, stop pulling tasks when
there is only one left in detach_tasks(). It won't affect active load
balance where we may want to migrate the last task as it active load
balance doesn't use detach_tasks().
Morten
On Thu, Jul 02, 2015 at 12:53:59PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> > And obviously, the idle balancing livelock SHOULD happen: one CPU pulls
> > tasks from the other, makes the other idle, and this iterates...
> >
> > That being said, it is also obvious to prevent the livelock from happening:
> > idle pulling until the source rq's nr_running is 1, becuase otherwise we
> > just avoid idleness by making another idleness.
>
> Well, ideally the imbalance calculation would be so that it would avoid
> this from happening in the first place. Its a 'balance' operation, not a
> 'steal everything'.
>
> We want to take work -- as we have none -- but we want to ensure that
> afterwards we have equal work, ie we're balanced.
Agreed, I think this is the true problem. See my other reply.
>
> So clearly that all is hosed. Now Morten was looking into simplifying
> calculate_imbalance() recently.
Yes. I'm held up doing other stuff at the moment, but I think
calculate_imbalance() needs some attention and I'm planning on looking at
that next.
On Thu, Jul 02, 2015 at 12:44:55PM +0100, Morten Rasmussen wrote:
> On Thu, Jul 02, 2015 at 12:53:59PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> > > And obviously, the idle balancing livelock SHOULD happen: one CPU pulls
> > > tasks from the other, makes the other idle, and this iterates...
> > >
> > > That being said, it is also obvious to prevent the livelock from happening:
> > > idle pulling until the source rq's nr_running is 1, becuase otherwise we
> > > just avoid idleness by making another idleness.
> >
> > Well, ideally the imbalance calculation would be so that it would avoid
> > this from happening in the first place. Its a 'balance' operation, not a
> > 'steal everything'.
> >
> > We want to take work -- as we have none -- but we want to ensure that
> > afterwards we have equal work, ie we're balanced.
>
> Agreed, I think this is the true problem. See my other reply.
Yes, this is agreed at all time. Like I said load_balance() (for idle balancing)
should compute the right imbalance and move a fair amount, to achieve we are
balanced. Whatever is wrong in how much computed and moved "right imbalance" is
should be fixed anyway.
But still, I think, even with the above, in idle balancing, pulling until the source
rq's nr_running == 1 is not just "a short term fix", but should be there permanently
acting like a last guard with no overhead, why not.
>
> >
> > So clearly that all is hosed. Now Morten was looking into simplifying
> > calculate_imbalance() recently.
>
> Yes. I'm held up doing other stuff at the moment, but I think
> calculate_imbalance() needs some attention and I'm planning on looking at
> that next.
Thanks,
Yuyang
Hi Morten,
On Thu, Jul 02, 2015 at 12:40:32PM +0100, Morten Rasmussen wrote:
> detach_tasks() will attempts to pull 62 based on tasks task_h_load() but
> the task_h_load() sum is only 5 + 10 + 0 and hence detach_tasks() will
> empty the src_rq.
>
> IOW, since task groups include blocked load in the load_avg_contrib (see
> __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
> imbalance includes blocked load and hence env->imbalance >=
> sum(task_h_load(p)) for all tasks p on the rq. Which leads to
> detach_tasks() emptying the rq completely in the reported scenario where
> blocked load > runnable load.
Whenever I want to know the load avg concerning task group, I need to
walk through the complete codes again, I prefer not to do it this time.
But it should not be that simply to say "the 118 comes from the blocked load".
Anyway, with blocked load, yes, we definitely can't move (or even find) some
ammount of the imbalance if we only look at the tasks on the queue. But this
may or may not be a problem.
Firstly, the question comes to whether we want blocked load anywhere.
This is just about a "now vs. average" question.
Secondly, if we stick to average, we just need to treat the blocked load
consistently, not that group SE has it, but task SE does not, or somewhere
has it, others not.
Thanks,
Yuyang
> Whether emptying the src_rq is the right thing to do depends on on your
> point of view. Does balanced load (runnable+blocked) take priority over
> keeping cpus busy or not? For idle_balance() it seems intuitively
> correct to not empty the rq and hence you could consider env->imbalance
> to be too big.
>
> I think we will see more of this kind of problems if we include
> weighted_cpuload() as well. Parts of the imbalance calculation code is
> quite old and could use some attention first.
>
> A short term fix could be what Yuyang propose, stop pulling tasks when
> there is only one left in detach_tasks(). It won't affect active load
> balance where we may want to migrate the last task as it active load
> balance doesn't use detach_tasks().
>
> Morten
On Fri, 2015-07-03 at 02:42 +0800, Yuyang Du wrote:
> But still, I think, even with the above, in idle balancing, pulling until the source
> rq's nr_running == 1 is not just "a short term fix", but should be there permanently
> acting like a last guard with no overhead, why not.
Yeah, seems so. Searching for steal all samples...
(this is all with autogroup)
load_balance: idle - s_run: 0 d_run: 2 s_load: 0 d_load: 3 imb: 23 det_tasks: 2 det_load: 3 zeros: 1
load_balance: idle - s_run: 0 d_run: 2 s_load: 0 d_load: 0 imb: 32 det_tasks: 2 det_load: 0 zeros: 2
load_balance: idle - s_run: 0 d_run: 2 s_load: 0 d_load: 1 imb: 17 det_tasks: 2 det_load: 1 zeros: 1
load_balance: idle - s_run: 0 d_run: 2 s_load: 0 d_load: 37 imb: 22 det_tasks: 2 det_load: 37 zeros: 1
load_balance: idle - s_run: 0 d_run: 2 s_load: 0 d_load: 0 imb: 102 det_tasks: 2 det_load: 0 zeros: 2
load_balance: idle - s_run: 0 d_run: 1 s_load: 0 d_load: 93 imb: 47 det_tasks: 1 det_load: 93 zeros: 0
load_balance: idle - s_run: 0 d_run: 2 s_load: 0 d_load: 202 imb: 125 det_tasks: 2 det_load: 202 zeros: 0
load_balance: idle - s_run: 0 d_run: 2 s_load: 0 d_load: 243 imb: 188 det_tasks: 2 det_load: 243 zeros: 0
load_balance: idle - s_run: 0 d_run: 1 s_load: 0 d_load: 145 imb: 73 det_tasks: 1 det_load: 145 zeros: 0
load_balance: idle - s_run: 0 d_run: 1 s_load: 0 d_load: 46 imb: 24 det_tasks: 1 det_load: 46 zeros: 0
Both varieties of total pilferage (w/wo 0 load tasks involved) seem to
happen only during idle balance, never periodic (yet).
Oddity: make -j8 occasionally stacks/pulls piles of load=dinky.
homer:/sys/kernel/debug/tracing # for i in `seq 1 10`; do cat trace|grep "s_run: 1.*det_tasks: $i.*zeros: 0"|wc -l; done
71634
1567
79
15
1
3
0
2
3
0
homer:/sys/kernel/debug/tracing # cat trace|grep "s_run: 1.*det_tasks: 8.*zeros: 0"
<idle>-0 [002] dNs. 594.973783: load_balance: norm - s_run: 1 d_run: 9 s_load: 67 d_load: 1110 imb: 86 det_tasks: 8 det_load: 86 zeros: 0
<...>-10367 [007] d... 1456.477281: load_balance: idle - s_run: 1 d_run: 8 s_load: 805 d_load: 22 imb: 45 det_tasks: 8 det_load: 22 zeros: 0
homer:/sys/kernel/debug/tracing # cat trace|grep "s_run: 1.*det_tasks: 9.*zeros: 0"
<...>-23317 [004] d... 486.677925: load_balance: idle - s_run: 1 d_run: 9 s_load: 888 d_load: 27 imb: 47 det_tasks: 9 det_load: 27 zeros: 0
<...>-11485 [002] d... 573.411095: load_balance: idle - s_run: 1 d_run: 9 s_load: 124 d_load: 78 imb: 82 det_tasks: 9 det_load: 78 zeros: 0
<...>-23286 [000] d... 1510.378740: load_balance: idle - s_run: 1 d_run: 9 s_load: 102 d_load: 58 imb: 63 det_tasks: 9 det_load: 58 zeros: 0
On Fri, Jul 03, 2015 at 03:37:02AM +0800, Yuyang Du wrote:
> Hi Morten,
>
> On Thu, Jul 02, 2015 at 12:40:32PM +0100, Morten Rasmussen wrote:
> > detach_tasks() will attempts to pull 62 based on tasks task_h_load() but
> > the task_h_load() sum is only 5 + 10 + 0 and hence detach_tasks() will
> > empty the src_rq.
> >
> > IOW, since task groups include blocked load in the load_avg_contrib (see
> > __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
> > imbalance includes blocked load and hence env->imbalance >=
> > sum(task_h_load(p)) for all tasks p on the rq. Which leads to
> > detach_tasks() emptying the rq completely in the reported scenario where
> > blocked load > runnable load.
>
> Whenever I want to know the load avg concerning task group, I need to
> walk through the complete codes again, I prefer not to do it this time.
> But it should not be that simply to say "the 118 comes from the blocked load".
But the whole hierarchy of group entities is updated each time we enqueue
or dequeue a task. I don't see how the group entity load_avg_contrib is
not up to date? Why do you need to update it again?
In any case, we have one task in the group hierarchy which has a
load_avg_contrib of 0 and the grand-grand parent group entity has a
load_avg_contrib of 118 and no additional tasks. That load contribution
must be from tasks which are no longer around on the rq? No?
> Anyway, with blocked load, yes, we definitely can't move (or even find) some
> ammount of the imbalance if we only look at the tasks on the queue. But this
> may or may not be a problem.
>
> Firstly, the question comes to whether we want blocked load anywhere.
> This is just about a "now vs. average" question.
That is what I meant in the paragraph below. It is a scheduling policy
question.
> Secondly, if we stick to average, we just need to treat the blocked load
> consistently, not that group SE has it, but task SE does not, or somewhere
> has it, others not.
I agree that inconsistent use of blocked load will lead us into trouble.
The problem is that none of the load-balance logic was designed for
blocked load. It was written to deal load that is currently on the rq
and slightly biased by average cpu load, not dealing with load
contribution of tasks which we can't migrate at the moment because they
are blocked. The load-balance code has to be updated to deal with
blocked load. We will run into all sorts of issues if we don't and roll
out use of blocked load everywhere.
However, before we can rework the load-balance code, we have to agree on
the now vs average balance policy.
Your proposed patch implements a policy somewhere in between. We try to
balance based on average, but we don't allow idle_balance() to empty a
cpu completely. A pure average balance policy would allow a cpu to go
idle even if we could do have tasks waiting on other rqs if the blocked
load indicates that other task will show up shortly one the cpu. A pure
"now" balance would balance based on runnable_load_avg for all entities
including groups ignoring all blocked load, but that goes against the
PELT group balancing design.
I'm not against having a policy that sits somewhere in between, we just
have to agree it is the right policy and clean up the load-balance code
such that the implemented policy is clear.
Morten
>
> Thanks,
> Yuyang
>
> > Whether emptying the src_rq is the right thing to do depends on on your
> > point of view. Does balanced load (runnable+blocked) take priority over
> > keeping cpus busy or not? For idle_balance() it seems intuitively
> > correct to not empty the rq and hence you could consider env->imbalance
> > to be too big.
> >
> > I think we will see more of this kind of problems if we include
> > weighted_cpuload() as well. Parts of the imbalance calculation code is
> > quite old and could use some attention first.
> >
> > A short term fix could be what Yuyang propose, stop pulling tasks when
> > there is only one left in detach_tasks(). It won't affect active load
> > balance where we may want to migrate the last task as it active load
> > balance doesn't use detach_tasks().
> >
> > Morten
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
> > > IOW, since task groups include blocked load in the load_avg_contrib (see
> > > __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
> > > imbalance includes blocked load and hence env->imbalance >=
> > > sum(task_h_load(p)) for all tasks p on the rq. Which leads to
> > > detach_tasks() emptying the rq completely in the reported scenario where
> > > blocked load > runnable load.
So IIRC we need the blocked load for groups for computing the per-cpu
slices of the total weight, avg works really well for that.
> I'm not against having a policy that sits somewhere in between, we just
> have to agree it is the right policy and clean up the load-balance code
> such that the implemented policy is clear.
Right, for balancing its a tricky question, but mixing them without
intent is, as you say, a bit of a mess.
So clearly blocked load doesn't make sense for (new)idle balancing. OTOH
it does make some sense for the regular periodic balancing, because
there we really do care mostly about the averages, esp. so when we're
overloaded -- but there are issues there too.
Now we can't track them both (or rather we could, but overhead).
I like Yuyang's load tracking rewrite, but it changes exactly this part,
and I'm not sure I understand the full ramifications of that yet.
One way out would be to split the load balancer into 3 distinct regions;
1) get a task on every CPU, screw everything else.
2) get each CPU fully utilized, still ignoring 'load'
3) when everybody is fully utilized, consider load.
If we make find_busiest_foo() select one of these 3, and make
calculate_imbalance() invariant to the metric passed in, and have things
like cpu_load() and task_load() return different, but coherent, numbers
depending on which region we're in, this almost sounds 'simple'.
The devil is in the details, and the balancer is a hairy nest of details
which will make the above non-trivial.
But for 1) we could simply 'balance' on nr_running, for 2) we can
'balance' on runnable_avg and for 3) we'll 'balance' on load_avg (which
will then include blocked load).
Let me go play outside for a bit so that it can sink in what kind of
nonsense my heat addled brain has just sprouted :-)
On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 40a7fcb..f7cc1ef 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
> return 0;
>
> while (!list_empty(tasks)) {
> +
> + if (env->idle == CPU_NEWLY_IDLE && env->src_rq->nr_running <= 1)
Should we make that ->idle != CPU_NOT_IDLE ?
> + break;
> +
> p = list_first_entry(tasks, struct task_struct, se.group_node);
>
> env->loop++;
Hi Morten,
On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
> > > IOW, since task groups include blocked load in the load_avg_contrib (see
> > > __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
> > > imbalance includes blocked load and hence env->imbalance >=
> > > sum(task_h_load(p)) for all tasks p on the rq. Which leads to
> > > detach_tasks() emptying the rq completely in the reported scenario where
> > > blocked load > runnable load.
> >
> > Whenever I want to know the load avg concerning task group, I need to
> > walk through the complete codes again, I prefer not to do it this time.
> > But it should not be that simply to say "the 118 comes from the blocked load".
>
> But the whole hierarchy of group entities is updated each time we enqueue
> or dequeue a task. I don't see how the group entity load_avg_contrib is
> not up to date? Why do you need to update it again?
>
> In any case, we have one task in the group hierarchy which has a
> load_avg_contrib of 0 and the grand-grand parent group entity has a
> load_avg_contrib of 118 and no additional tasks. That load contribution
> must be from tasks which are no longer around on the rq? No?
load_avg_contrib has WEIGHT inside, so the most I can say is:
SE: 8f456e00's load_avg_contrib 118 = (its cfs_rq's runnable + blocked) / (tg->load_avg + 1) * tg->shares
The tg->shares is probably 1024 (at least 911). So we are just left with:
cfs_rq / tg = 11.5%
I myself did question the sudden jump from 0 to 118 (see a previous reply).
But anyway, this really is irrelevant to the discusstion.
> > Anyway, with blocked load, yes, we definitely can't move (or even find) some
> > ammount of the imbalance if we only look at the tasks on the queue. But this
> > may or may not be a problem.
> >
> > Firstly, the question comes to whether we want blocked load anywhere.
> > This is just about a "now vs. average" question.
>
> That is what I meant in the paragraph below. It is a scheduling policy
> question.
>
> > Secondly, if we stick to average, we just need to treat the blocked load
> > consistently, not that group SE has it, but task SE does not, or somewhere
> > has it, others not.
>
> I agree that inconsistent use of blocked load will lead us into trouble.
> The problem is that none of the load-balance logic was designed for
> blocked load. It was written to deal load that is currently on the rq
> and slightly biased by average cpu load, not dealing with load
> contribution of tasks which we can't migrate at the moment because they
> are blocked. The load-balance code has to be updated to deal with
> blocked load. We will run into all sorts of issues if we don't and roll
> out use of blocked load everywhere.
>
> However, before we can rework the load-balance code, we have to agree on
> the now vs average balance policy.
>
> Your proposed patch implements a policy somewhere in between. We try to
> balance based on average, but we don't allow idle_balance() to empty a
> cpu completely. A pure average balance policy would allow a cpu to go
> idle even if we could do have tasks waiting on other rqs if the blocked
> load indicates that other task will show up shortly one the cpu. A pure
> "now" balance would balance based on runnable_load_avg for all entities
> including groups ignoring all blocked load, but that goes against the
> PELT group balancing design.
>
> I'm not against having a policy that sits somewhere in between, we just
> have to agree it is the right policy and clean up the load-balance code
> such that the implemented policy is clear.
The proposed patch sits in between? I agree, but would rather see it from
another perspective.
First, I don't think it merits a solution/policy. It is just a cheap
"last guard" to protect the "king" - no crash.
Second, a "pure average" policy is pretty fine in general, but it does not
mean we would simply allow a CPU to be pulled empty, that is because we are
making a bet from a prediction (average) here. By prediction, it basically
means sometimes it fails. As the failure could lead to a disater, without
blaming the prediction, it is reasonable we make a sort of "damage control".
Thanks,
Yuyang
On Fri, Jul 03, 2015 at 06:39:28PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 40a7fcb..f7cc1ef 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
> > return 0;
> >
> > while (!list_empty(tasks)) {
> > +
> > + if (env->idle == CPU_NEWLY_IDLE && env->src_rq->nr_running <= 1)
>
> Should we make that ->idle != CPU_NOT_IDLE ?
I think including CPU_IDLE is good.
--
Subject: [PATCH] sched: Avoid pulling all tasks in idle balancing
In idle balancing where a CPU going idle pulls tasks from another CPU,
a livelock may happen if the CPU pulls all tasks from another, makes
it idle, and this iterates. So just avoid this.
Reported-by: Rabin Vincent <[email protected]>
Signed-off-by: Yuyang Du <[email protected]>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40a7fcb..769d591 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5898,6 +5898,13 @@ static int detach_tasks(struct lb_env *env)
return 0;
while (!list_empty(tasks)) {
+ /*
+ * We don't want to steal all, otherwise we may be treated likewise,
+ * which could at worst lead to a livelock crash.
+ */
+ if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+ break;
+
p = list_first_entry(tasks, struct task_struct, se.group_node);
env->loop++;
On Fri, Jul 03, 2015 at 06:38:31PM +0200, Peter Zijlstra wrote:
> > I'm not against having a policy that sits somewhere in between, we just
> > have to agree it is the right policy and clean up the load-balance code
> > such that the implemented policy is clear.
>
> Right, for balancing its a tricky question, but mixing them without
> intent is, as you say, a bit of a mess.
>
> So clearly blocked load doesn't make sense for (new)idle balancing. OTOH
> it does make some sense for the regular periodic balancing, because
> there we really do care mostly about the averages, esp. so when we're
> overloaded -- but there are issues there too.
>
> Now we can't track them both (or rather we could, but overhead).
>
> I like Yuyang's load tracking rewrite, but it changes exactly this part,
> and I'm not sure I understand the full ramifications of that yet.
Thanks. It would be a pure average policy, which is non-perfect like now,
and certainly needs a mixing like now, but it is worth a starter, because
it is simple and reasaonble, and based on it, the other parts can be simple
and reasonable.
> One way out would be to split the load balancer into 3 distinct regions;
>
> 1) get a task on every CPU, screw everything else.
> 2) get each CPU fully utilized, still ignoring 'load'
> 3) when everybody is fully utilized, consider load.
>
> If we make find_busiest_foo() select one of these 3, and make
> calculate_imbalance() invariant to the metric passed in, and have things
> like cpu_load() and task_load() return different, but coherent, numbers
> depending on which region we're in, this almost sounds 'simple'.
>
> The devil is in the details, and the balancer is a hairy nest of details
> which will make the above non-trivial.
>
> But for 1) we could simply 'balance' on nr_running, for 2) we can
> 'balance' on runnable_avg and for 3) we'll 'balance' on load_avg (which
> will then include blocked load).
>
> Let me go play outside for a bit so that it can sink in what kind of
> nonsense my heat addled brain has just sprouted :-)
Hi Yuyang,
On 05/07/15 21:12, Yuyang Du wrote:
> Hi Morten,
>
> On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
>>>> IOW, since task groups include blocked load in the load_avg_contrib (see
>>>> __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
>>>> imbalance includes blocked load and hence env->imbalance >=
>>>> sum(task_h_load(p)) for all tasks p on the rq. Which leads to
>>>> detach_tasks() emptying the rq completely in the reported scenario where
>>>> blocked load > runnable load.
>>>
>>> Whenever I want to know the load avg concerning task group, I need to
>>> walk through the complete codes again, I prefer not to do it this time.
>>> But it should not be that simply to say "the 118 comes from the blocked load".
>>
>> But the whole hierarchy of group entities is updated each time we enqueue
>> or dequeue a task. I don't see how the group entity load_avg_contrib is
>> not up to date? Why do you need to update it again?
>>
>> In any case, we have one task in the group hierarchy which has a
>> load_avg_contrib of 0 and the grand-grand parent group entity has a
>> load_avg_contrib of 118 and no additional tasks. That load contribution
>> must be from tasks which are no longer around on the rq? No?
>
> load_avg_contrib has WEIGHT inside, so the most I can say is:
> SE: 8f456e00's load_avg_contrib 118 = (its cfs_rq's runnable + blocked) / (tg->load_avg + 1) * tg->shares
>
> The tg->shares is probably 1024 (at least 911). So we are just left with:
>
> cfs_rq / tg = 11.5%
>
> I myself did question the sudden jump from 0 to 118 (see a previous reply).
Do you mean the jump from system-rngd.slice (0) (tg.css.id=3) to
system.slice (118) (tg.css.id=2)?
Maybe, the 118 might come from another tg hierarchy (w/ tg.css.id >= 3)
inside the system.slice group representing another service.
Rabin, could you share the content of your
/sys/fs/cgroup/cpu/system.slice directory and of /proc/cgroups ?
Whether 118 comes from the cfs_rq->blocked_load_avg of one of the tg
levels of one of the other system.slice tg hierarchies or it results
from not updating the se.avg.load_avg_contrib values of se's
representing tg's immediately is not that important I guess.
Even if we're able to sync both things (task en/dequeue and tg
se.avg.load_avg_contrib update) perfectly (by calling
update_cfs_rq_blocked_load() always w/ force_update=1 and immediately
after that update_entity_load_avg() for all tg se's in one hierarchy, we
would still have to deal w/ the blocked load part if the tg se
representing system.slice contributes to
cpu_rq(cpu)->cfs.runnable_load_avg.
-- Dietmar
>
> But anyway, this really is irrelevant to the discusstion.
>
[...]
On Mon, Jul 06, 2015 at 07:36:56PM +0200, Dietmar Eggemann wrote:
> Rabin, could you share the content of your
> /sys/fs/cgroup/cpu/system.slice directory and of /proc/cgroups ?
Here's /proc/cgroups,
# cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpu 2 98 1
cpuacct 2 98 1
and the contents of /sys/fs/cgroup/cpu/system.slice are available here:
https://drive.google.com/file/d/0B4tMLbMvJ-l6ZVBvZ09QOE15MU0/view
/Rabin
Hi,
I tested the patch on a setup with 7 devices, all running the same troublesome use-case in parallel (same use-case as we used to produce the crash dumps). This use-case was previously able to reproduce the problem about 21 times during 24 hours. After including the patch the setup ran perfectly for 48 hours. So to summarize, patch tested OK.
Tested-by: Stefan Ekenberg <[email protected]>
On Mon, Jul 06, 2015 at 12:11:51AM +0200, Yuyang Du wrote:
> On Fri, Jul 03, 2015 at 06:39:28PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 40a7fcb..f7cc1ef 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
> > > return 0;
> > >
> > > while (!list_empty(tasks)) {
> > > +
> > > + if (env->idle == CPU_NEWLY_IDLE && env->src_rq->nr_running <= 1)
> >
> > Should we make that ->idle != CPU_NOT_IDLE ?
>
> I think including CPU_IDLE is good.
>
> --
> Subject: [PATCH] sched: Avoid pulling all tasks in idle balancing
>
> In idle balancing where a CPU going idle pulls tasks from another CPU,
> a livelock may happen if the CPU pulls all tasks from another, makes
> it idle, and this iterates. So just avoid this.
>
> Reported-by: Rabin Vincent <[email protected]>
> Signed-off-by: Yuyang Du <[email protected]>
> ---
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 40a7fcb..769d591 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5898,6 +5898,13 @@ static int detach_tasks(struct lb_env *env)
> return 0;
>
> while (!list_empty(tasks)) {
> + /*
> + * We don't want to steal all, otherwise we may be treated likewise,
> + * which could at worst lead to a livelock crash.
> + */
> + if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
> + break;
> +
> p = list_first_entry(tasks, struct task_struct, se.group_node);
>
> env->loop++;
On Mon, Jul 06, 2015 at 04:12:41AM +0800, Yuyang Du wrote:
> Hi Morten,
>
> On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
> > > > IOW, since task groups include blocked load in the load_avg_contrib (see
> > > > __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
> > > > imbalance includes blocked load and hence env->imbalance >=
> > > > sum(task_h_load(p)) for all tasks p on the rq. Which leads to
> > > > detach_tasks() emptying the rq completely in the reported scenario where
> > > > blocked load > runnable load.
> > >
> > > Whenever I want to know the load avg concerning task group, I need to
> > > walk through the complete codes again, I prefer not to do it this time.
> > > But it should not be that simply to say "the 118 comes from the blocked load".
> >
> > But the whole hierarchy of group entities is updated each time we enqueue
> > or dequeue a task. I don't see how the group entity load_avg_contrib is
> > not up to date? Why do you need to update it again?
> >
> > In any case, we have one task in the group hierarchy which has a
> > load_avg_contrib of 0 and the grand-grand parent group entity has a
> > load_avg_contrib of 118 and no additional tasks. That load contribution
> > must be from tasks which are no longer around on the rq? No?
>
> load_avg_contrib has WEIGHT inside, so the most I can say is:
> SE: 8f456e00's load_avg_contrib 118 = (its cfs_rq's runnable + blocked) / (tg->load_avg + 1) * tg->shares
>
> The tg->shares is probably 1024 (at least 911). So we are just left with:
>
> cfs_rq / tg = 11.5%
Yes, we also know that there is only one runnable task in the task group
hierarchy and its contribution is 0. Hence the rest must be from
non-runnable tasks belonging to some child group.
> I myself did question the sudden jump from 0 to 118 (see a previous reply).
>
> But anyway, this really is irrelevant to the discusstion.
Agreed.
>
> > > Anyway, with blocked load, yes, we definitely can't move (or even find) some
> > > ammount of the imbalance if we only look at the tasks on the queue. But this
> > > may or may not be a problem.
> > >
> > > Firstly, the question comes to whether we want blocked load anywhere.
> > > This is just about a "now vs. average" question.
> >
> > That is what I meant in the paragraph below. It is a scheduling policy
> > question.
> >
> > > Secondly, if we stick to average, we just need to treat the blocked load
> > > consistently, not that group SE has it, but task SE does not, or somewhere
> > > has it, others not.
> >
> > I agree that inconsistent use of blocked load will lead us into trouble.
> > The problem is that none of the load-balance logic was designed for
> > blocked load. It was written to deal load that is currently on the rq
> > and slightly biased by average cpu load, not dealing with load
> > contribution of tasks which we can't migrate at the moment because they
> > are blocked. The load-balance code has to be updated to deal with
> > blocked load. We will run into all sorts of issues if we don't and roll
> > out use of blocked load everywhere.
> >
> > However, before we can rework the load-balance code, we have to agree on
> > the now vs average balance policy.
> >
> > Your proposed patch implements a policy somewhere in between. We try to
> > balance based on average, but we don't allow idle_balance() to empty a
> > cpu completely. A pure average balance policy would allow a cpu to go
> > idle even if we could do have tasks waiting on other rqs if the blocked
> > load indicates that other task will show up shortly one the cpu. A pure
> > "now" balance would balance based on runnable_load_avg for all entities
> > including groups ignoring all blocked load, but that goes against the
> > PELT group balancing design.
> >
> > I'm not against having a policy that sits somewhere in between, we just
> > have to agree it is the right policy and clean up the load-balance code
> > such that the implemented policy is clear.
>
> The proposed patch sits in between? I agree, but would rather see it from
> another perspective.
>
> First, I don't think it merits a solution/policy. It is just a cheap
> "last guard" to protect the "king" - no crash.
It's fine for me to have checks that makes sure we don't crash if we hit
some corner case in the load-balance code. I was more interested in
figuring out why we ended up in this situation and how we can implement
a more consistent policy.
> Second, a "pure average" policy is pretty fine in general, but it does not
> mean we would simply allow a CPU to be pulled empty, that is because we are
> making a bet from a prediction (average) here. By prediction, it basically
> means sometimes it fails. As the failure could lead to a disater, without
> blaming the prediction, it is reasonable we make a sort of "damage control".
I like the idea of balancing based on a average load/utilization that
includes blocked load/utilization. It ties in very nicely driving DVFS
for example. It is a fundamental change to how we perceive load for
load-balancing decisions though. It basically requires an update to the
load-balancing policy as most of it doesn't consider blocked load
scenarios this one. This one idle-balance problem is easily solve by
making sure not to pull the last task. I makes a lot of sense. But I'm
quite sure that we will face many more of such cases.
For example, at task wake-up, do you put the task on a cpu which is idle
in this instant with a high average load or on a cpu which is busy in
this instant but has a low average load? Currently, we just put the task
on an idle cpu ignoring any blocked load. If we implement a strict
average view policy we should choose the cpu with the lowest load even
if it is currently busy.
I would suggest that we revise the load-balance code before/while adding
blocked load so we get a clear(er) policy in the load-balance code so we
don't need too many (and maybe not so straightforward) last guards to fix
wrong balancing decisions. It doesn't have to be perfect, I just want to
get rid of code that nobody around can explain so we can tell if we
still make the right predictions when adding blocked load.
On Mon, Jul 06, 2015 at 06:31:44AM +0800, Yuyang Du wrote:
> On Fri, Jul 03, 2015 at 06:38:31PM +0200, Peter Zijlstra wrote:
> > > I'm not against having a policy that sits somewhere in between, we just
> > > have to agree it is the right policy and clean up the load-balance code
> > > such that the implemented policy is clear.
> >
> > Right, for balancing its a tricky question, but mixing them without
> > intent is, as you say, a bit of a mess.
> >
> > So clearly blocked load doesn't make sense for (new)idle balancing. OTOH
> > it does make some sense for the regular periodic balancing, because
> > there we really do care mostly about the averages, esp. so when we're
> > overloaded -- but there are issues there too.
> >
> > Now we can't track them both (or rather we could, but overhead).
> >
> > I like Yuyang's load tracking rewrite, but it changes exactly this part,
> > and I'm not sure I understand the full ramifications of that yet.
I don't think anybody does ;-) But I think we should try to make it
work.
> Thanks. It would be a pure average policy, which is non-perfect like now,
> and certainly needs a mixing like now, but it is worth a starter, because
> it is simple and reasaonble, and based on it, the other parts can be simple
> and reasonable.
I think we all agree on the benefits of taking blocked load into
account but also that there are some policy questions to be addressed.
> > One way out would be to split the load balancer into 3 distinct regions;
> >
> > 1) get a task on every CPU, screw everything else.
> > 2) get each CPU fully utilized, still ignoring 'load'
> > 3) when everybody is fully utilized, consider load.
Seems very reasonable to me. We more or less follow that idea in the
energy-model driven scheduling patches, at least 2) and 3).
The difficult bit is detecting when to transition between 2) and 3). If
you want to enforce smp_nice you have to start worrying about task
priority as soon as one cpu is fully utilized.
For example, a fully utilized cpu has two high priority tasks while all
other cpus are running low priority tasks and are not fully utilized.
The utilization imbalance may be too small to cause any tasks to be
migrated, so we end up giving fewer cycles to the high priority tasks.
> > If we make find_busiest_foo() select one of these 3, and make
> > calculate_imbalance() invariant to the metric passed in, and have things
> > like cpu_load() and task_load() return different, but coherent, numbers
> > depending on which region we're in, this almost sounds 'simple'.
> >
> > The devil is in the details, and the balancer is a hairy nest of details
> > which will make the above non-trivial.
Yes, but if we have an overall policy like the one you propose we can at
least make it complicated and claim that we think we know what it is
supposed to do ;-)
I agree that there is some work to be done in find_busiest_*() and
calcuate_imbalance() + friends. Maybe step one should be to clean them
up a bit.
On Thu, Jul 09, 2015 at 02:53:14PM +0100, Morten Rasmussen wrote:
> On Mon, Jul 06, 2015 at 04:12:41AM +0800, Yuyang Du wrote:
> > Hi Morten,
> >
> > On Fri, Jul 03, 2015 at 10:34:41AM +0100, Morten Rasmussen wrote:
> > > > > IOW, since task groups include blocked load in the load_avg_contrib (see
> > > > > __update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
> > > > > imbalance includes blocked load and hence env->imbalance >=
> > > > > sum(task_h_load(p)) for all tasks p on the rq. Which leads to
> > > > > detach_tasks() emptying the rq completely in the reported scenario where
> > > > > blocked load > runnable load.
> > > >
> > > > Whenever I want to know the load avg concerning task group, I need to
> > > > walk through the complete codes again, I prefer not to do it this time.
> > > > But it should not be that simply to say "the 118 comes from the blocked load".
> > >
> > > But the whole hierarchy of group entities is updated each time we enqueue
> > > or dequeue a task. I don't see how the group entity load_avg_contrib is
> > > not up to date? Why do you need to update it again?
> > >
> > > In any case, we have one task in the group hierarchy which has a
> > > load_avg_contrib of 0 and the grand-grand parent group entity has a
> > > load_avg_contrib of 118 and no additional tasks. That load contribution
> > > must be from tasks which are no longer around on the rq? No?
> >
> > load_avg_contrib has WEIGHT inside, so the most I can say is:
> > SE: 8f456e00's load_avg_contrib 118 = (its cfs_rq's runnable + blocked) / (tg->load_avg + 1) * tg->shares
> >
> > The tg->shares is probably 1024 (at least 911). So we are just left with:
> >
> > cfs_rq / tg = 11.5%
>
> Yes, we also know that there is only one runnable task in the task group
> hierarchy and its contribution is 0. Hence the rest must be from
> non-runnable tasks belonging to some child group.
Agreed.
Hi,
On Thu, Jul 09, 2015 at 03:32:20PM +0100, Morten Rasmussen wrote:
> On Mon, Jul 06, 2015 at 06:31:44AM +0800, Yuyang Du wrote:
> > On Fri, Jul 03, 2015 at 06:38:31PM +0200, Peter Zijlstra wrote:
> > > > I'm not against having a policy that sits somewhere in between, we just
> > > > have to agree it is the right policy and clean up the load-balance code
> > > > such that the implemented policy is clear.
> > >
> > > Right, for balancing its a tricky question, but mixing them without
> > > intent is, as you say, a bit of a mess.
> > >
> > > So clearly blocked load doesn't make sense for (new)idle balancing. OTOH
> > > it does make some sense for the regular periodic balancing, because
> > > there we really do care mostly about the averages, esp. so when we're
> > > overloaded -- but there are issues there too.
> > >
> > > Now we can't track them both (or rather we could, but overhead).
> > >
> > > I like Yuyang's load tracking rewrite, but it changes exactly this part,
> > > and I'm not sure I understand the full ramifications of that yet.
>
> I don't think anybody does ;-) But I think we should try to make it
> work.
>
> > Thanks. It would be a pure average policy, which is non-perfect like now,
> > and certainly needs a mixing like now, but it is worth a starter, because
> > it is simple and reasaonble, and based on it, the other parts can be simple
> > and reasonable.
>
> I think we all agree on the benefits of taking blocked load into
> account but also that there are some policy questions to be addressed.
>
> > > One way out would be to split the load balancer into 3 distinct regions;
> > >
> > > 1) get a task on every CPU, screw everything else.
> > > 2) get each CPU fully utilized, still ignoring 'load'
> > > 3) when everybody is fully utilized, consider load.
>
> Seems very reasonable to me. We more or less follow that idea in the
> energy-model driven scheduling patches, at least 2) and 3).
>
> The difficult bit is detecting when to transition between 2) and 3). If
> you want to enforce smp_nice you have to start worrying about task
> priority as soon as one cpu is fully utilized.
>
> For example, a fully utilized cpu has two high priority tasks while all
> other cpus are running low priority tasks and are not fully utilized.
> The utilization imbalance may be too small to cause any tasks to be
> migrated, so we end up giving fewer cycles to the high priority tasks.
>
> > > If we make find_busiest_foo() select one of these 3, and make
> > > calculate_imbalance() invariant to the metric passed in, and have things
> > > like cpu_load() and task_load() return different, but coherent, numbers
> > > depending on which region we're in, this almost sounds 'simple'.
> > >
> > > The devil is in the details, and the balancer is a hairy nest of details
> > > which will make the above non-trivial.
>
> Yes, but if we have an overall policy like the one you propose we can at
> least make it complicated and claim that we think we know what it is
> supposed to do ;-)
>
> I agree that there is some work to be done in find_busiest_*() and
> calcuate_imbalance() + friends. Maybe step one should be to clean them
> up a bit.
Consensus looks like that we move step-by-step and start working right now:
1) Based on the "Rewrite" patch, let me add cfs_rq->runnable_load_avg. Then
we will have up-to-date everything: load.weight, runnable_load_avg, and
load_avg (including runnable + blocked), from pure now to pure average.
The runnable_load_avg will be used the same as now. So we will not have
a shred of remification. As long as the code is cleared and simplified,
it is a win.
2) Let's clean up a bit the load balancing part code-wise, and if needed,
make change to the obvious things, otherwise leave it unchanged.
3) Polish/complicate the policies, :)
What do you think?
Thanks,
Yuyang
On 07/07/15 12:17, Rabin Vincent wrote:
> On Mon, Jul 06, 2015 at 07:36:56PM +0200, Dietmar Eggemann wrote:
>> Rabin, could you share the content of your
>> /sys/fs/cgroup/cpu/system.slice directory and of /proc/cgroups ?
>
> Here's /proc/cgroups,
>
> # cat /proc/cgroups
> #subsys_name hierarchy num_cgroups enabled
> cpu 2 98 1
> cpuacct 2 98 1
>
> and the contents of /sys/fs/cgroup/cpu/system.slice are available here:
> https://drive.google.com/file/d/0B4tMLbMvJ-l6ZVBvZ09QOE15MU0/view
>
> /Rabin
>
So why not maintain a runnable signal for the task group se's?
At least to figure out if the 118 is coming from blocked load.
-- >8 --
Subject: [PATCH] sched: Maintain a runnable version of tg->load_avg and
cfs_rq->tg_load_contrib
Including blocked load in the load average contribution of sched
entities (se->avg.load_avg_contrib) representing task groups can
lead to scenarios where the imbalance is greater than
sum(task_h_load(p)) for all tasks p on the src rq.
To avoid this use cfs_rq->runnable_tg_load_contrib and
tg->runnable_load_avg to calculate se->avg.load_avg_contrib for
sched entities representing task groups.
Both runnable based values are updated in cadence with the
existing values.
The existing tg->load_avg and cfs_rq->tg_load_contrib are still
used to calculate task group weight.
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/fair.c | 11 ++++++++---
kernel/sched/sched.h | 2 ++
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 587a2f67ceb1..f2cfbaaf5700 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2647,7 +2647,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
int force_update)
{
struct task_group *tg = cfs_rq->tg;
- long tg_contrib;
+ long tg_contrib, runnable_tg_contrib;
tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
tg_contrib -= cfs_rq->tg_load_contrib;
@@ -2655,9 +2655,14 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
if (!tg_contrib)
return;
+ runnable_tg_contrib = cfs_rq->runnable_load_avg;
+ runnable_tg_contrib -= cfs_rq->runnable_tg_load_contrib;
+
if (force_update || abs(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
atomic_long_add(tg_contrib, &tg->load_avg);
cfs_rq->tg_load_contrib += tg_contrib;
+ atomic_long_add(runnable_tg_contrib, &tg->runnable_load_avg);
+ cfs_rq->runnable_tg_load_contrib += runnable_tg_contrib;
}
}
@@ -2690,9 +2695,9 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
u64 contrib;
- contrib = cfs_rq->tg_load_contrib * tg->shares;
+ contrib = cfs_rq->runnable_tg_load_contrib * tg->shares;
se->avg.load_avg_contrib = div_u64(contrib,
- atomic_long_read(&tg->load_avg) + 1);
+ atomic_long_read(&tg->runnable_load_avg) + 1);
/*
* For group entities we need to compute a correction term in the case
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84d48790bb6d..eed74e5efe91 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -245,6 +245,7 @@ struct task_group {
#ifdef CONFIG_SMP
atomic_long_t load_avg;
+ atomic_long_t runnable_load_avg;
atomic_t runnable_avg;
#endif
#endif
@@ -386,6 +387,7 @@ struct cfs_rq {
/* Required to track per-cpu representation of a task_group */
u32 tg_runnable_contrib;
unsigned long tg_load_contrib;
+ unsigned long runnable_tg_load_contrib;
/*
* h_load = weight * f(tg)
--
1.9.1
Hi Peter,
What about this patch?
Adding:
Tested-by: Stefan Ekenberg <[email protected]>
Thanks,
Yuyang
On Mon, Jul 06, 2015 at 06:11:51AM +0800, Yuyang Du wrote:
> On Fri, Jul 03, 2015 at 06:39:28PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 02, 2015 at 07:25:11AM +0800, Yuyang Du wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 40a7fcb..f7cc1ef 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -5898,6 +5898,10 @@ static int detach_tasks(struct lb_env *env)
> > > return 0;
> > >
> > > while (!list_empty(tasks)) {
> > > +
> > > + if (env->idle == CPU_NEWLY_IDLE && env->src_rq->nr_running <= 1)
> >
> > Should we make that ->idle != CPU_NOT_IDLE ?
>
> I think including CPU_IDLE is good.
>
> --
> Subject: [PATCH] sched: Avoid pulling all tasks in idle balancing
>
> In idle balancing where a CPU going idle pulls tasks from another CPU,
> a livelock may happen if the CPU pulls all tasks from another, makes
> it idle, and this iterates. So just avoid this.
>
> Reported-by: Rabin Vincent <[email protected]>
> Signed-off-by: Yuyang Du <[email protected]>
> ---
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 40a7fcb..769d591 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5898,6 +5898,13 @@ static int detach_tasks(struct lb_env *env)
> return 0;
>
> while (!list_empty(tasks)) {
> + /*
> + * We don't want to steal all, otherwise we may be treated likewise,
> + * which could at worst lead to a livelock crash.
> + */
> + if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
> + break;
> +
> p = list_first_entry(tasks, struct task_struct, se.group_node);
>
> env->loop++;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Commit-ID: 985d3a4c11cd28251bcc7925aa2d7a9038910384
Gitweb: http://git.kernel.org/tip/985d3a4c11cd28251bcc7925aa2d7a9038910384
Author: Yuyang Du <[email protected]>
AuthorDate: Mon, 6 Jul 2015 06:11:51 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 3 Aug 2015 12:21:19 +0200
sched/fair: Avoid pulling all tasks in idle balancing
In idle balancing where a CPU going idle pulls tasks from another CPU,
a livelock may happen if the CPU pulls all tasks from another, makes
it idle, and this iterates. So just avoid this.
Reported-by: Rabin Vincent <[email protected]>
Signed-off-by: Yuyang Du <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 587a2f6..8b384b8d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5863,6 +5863,13 @@ static int detach_tasks(struct lb_env *env)
return 0;
while (!list_empty(tasks)) {
+ /*
+ * We don't want to steal all, otherwise we may be treated likewise,
+ * which could at worst lead to a livelock crash.
+ */
+ if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+ break;
+
p = list_first_entry(tasks, struct task_struct, se.group_node);
env->loop++;