Date: Tue, 30 Jun 2015 16:30:58 +0200
From: Rabin Vincent <rabin.vincent@axis.com>
To: mingo@redhat.com, peterz@infradead.org
Cc: linux-kernel@vger.kernel.org
Subject: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()
Message-ID: <20150630143057.GA31689@axis.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4329
Lines: 132

Hi,

We're seeing a livelock where two CPUs both loop with interrupts
disabled in pick_next_task_fair() / idle_balance() and continuously
fetch all tasks from each other.  This has been observed on a 3.18
kernel but the relevant code paths appear to be the same in current
mainline.

The problem situation appears to be this:

cpu0                                        cpu1


pick_next_task
 take rq0->lock
 pick_next_task_fair
  running=0
  idle_balance()
   drop rq0->lock


                                        wakeup A,B on cpu0


                                        pick_next_task
                                         take rq1->lock
                                         pick_next_task_fair
                                          running=0
                                          idle_balance()
                                           drop r1->lock
                                           load_balance()
                                           busiest=rq0
                                           take rq0->lock
                                           detach A,B
                                           drop rq0->lock
                                           take rq1->lock
                                           attach A,B
                                           pulled_task = 2
                                           drop rq1->lock

   load_balance()
   busiest=rq1
   take rq1->lock
   detach A,B
   drop rq1->lock
   take rq0->lock
   attach A,B
   pulled_task = 2
   drop rq0->lock

                                          running=0()
                                          idle_balance()
                                          busiest=rq0, pull A,B, etc.


  running = 0
  load_balance()
  busiest=rq1, pull A,B, etc

And this goes on, with interrupts disabled on both CPUs, for at least a
100 ms (which is when a hardware watchdog hits).

The conditions needed, apart from the right timing, are:

 - cgroups. One of the tasks, say A, needs to be in a CPU cgroup.  When
   the problem occurs, A's ->se has zero load_avg_contrib and
   task_h_load(A) is zero.  However, the se->parent->parent of A has a
   (relatively) high load_avg_contrib.  cpu0's cfs_rq has therefore a
   relatively high runnable_load_avg.  find_busiest_group() therefore
   detects imbalance, and detach_tasks() detaches all tasks.

 - PREEMPT=n.  Otherwise, the code under #ifdef in detach_tasks() would
   ensure that we'd only ever pull a maximum of one task during idle
   balancing.

 - cpu0 and cpu1 are threads on the same core (cpus_share_cache() ==
   true).  otherwise, cpu1 will not be able to wakeup tasks on cpu0
   while cpu0 has interrupts disabled (since an IPI would be required).
   Turning off the default TTWU_QUEUE feature would also provide the
   same effect.

I see two simple ways to prevent the livelock.  One is to just to remove
the #ifdef:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0..74c94dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5937,7 +5937,6 @@ static int detach_tasks(struct lb_env *env)
 		detached++;
 		env->imbalance -= load;
 
-#ifdef CONFIG_PREEMPT
 		/*
 		 * NEWIDLE balancing is a source of latency, so preemptible
 		 * kernels will stop after the first task is detached to minimize
@@ -5945,7 +5944,6 @@ static int detach_tasks(struct lb_env *env)
 		 */
 		if (env->idle == CPU_NEWLY_IDLE)
 			break;
-#endif
 
 		/*
 		 * We only want to steal up to the prescribed amount of

Or this should work too:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0..13358cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7487,6 +7487,8 @@ static int idle_balance(struct rq *this_rq)
 	 */
 	if (this_rq->cfs.h_nr_running && !pulled_task)
 		pulled_task = 1;
+	else if (!this_rq->cfs.h_nr_running && pulled_task)
+		pulled_task = 0;
 
 out:
 	/* Move the next balance forward */

But it seems that the real problem is that detach_tasks() can remove all
tasks, when cgroups are involved as described above?

Thanks.

/Rabin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/