Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\))
Subject: Re: sched: hang in migrate_swap
From: Rafael David Tinoco <rafael.tinoco@canonical.com>
In-Reply-To: <20140514102602.GJ30445@twins.programming.kicks-ass.net>
Date: Mon, 15 Jun 2015 16:38:21 -0300
Cc: Kirill Tkhai <tkhai@yandex.ru>, Michael wang <wangyun@linux.vnet.ibm.com>,
        "ktkhai@parallels.com" <ktkhai@parallels.com>,
        Ingo Molnar <mingo@kernel.org>, LKML <linux-kernel@vger.kernel.org>
Content-Transfer-Encoding: 8BIT
Message-Id: <4945CDF2-4666-4112-89F1-775E87B3EECD@canonical.com>
References: <20140224121218.GR15586@twins.programming.kicks-ass.net> <534610A4.5000302@oracle.com> <53464164.5030701@linux.vnet.ibm.com> <336561397137116@web27h.yandex.ru> <5347FCED.8040706@oracle.com> <1442521397229373@web20m.yandex.ru> <53711785.5010504@oracle.com> <20140514102602.GJ30445@twins.programming.kicks-ass.net>
To: Peter Zijlstra <peterz@infradead.org>,
        Sasha Levin <sasha.levin@oracle.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7253
Lines: 199

Peter, Sasha, coming back to this?

Not that this is happening frequently or I can easily reproduce, but?

> On May14, 2014, at 07:26 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, May 14, 2014 at 02:21:04PM +0400, Kirill Tkhai wrote:
>> 
>> 
>> 14.05.2014, 14:14, "Peter Zijlstra" <peterz@infradead.org>:
>>> On Wed, May 14, 2014 at 01:42:32PM +0400, Kirill Tkhai wrote:
>>> 
>>>> Peter, do we have to queue stop works orderly?
>>>> 
>>>> Is there is not a possibility, when two pair of works queued different on
>>>> different cpus?
>>>> 
>>>>  kernel/stop_machine.c | 10 ++++++++--
>>>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>>> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
>>>> index b6b67ec..29e221b 100644
>>>> --- a/kernel/stop_machine.c
>>>> +++ b/kernel/stop_machine.c
>>>> @@ -250,8 +250,14 @@ struct irq_cpu_stop_queue_work_info {
>>>>  static void irq_cpu_stop_queue_work(void *arg)
>>>>  {
>>>>          struct irq_cpu_stop_queue_work_info *info = arg;
>>>> - cpu_stop_queue_work(info->cpu1, info->work1);
>>>> - cpu_stop_queue_work(info->cpu2, info->work2);
>>>> +
>>>> + if (info->cpu1 < info->cpu2) {
>>>> + cpu_stop_queue_work(info->cpu1, info->work1);
>>>> + cpu_stop_queue_work(info->cpu2, info->work2);
>>>> + } else {
>>>> + cpu_stop_queue_work(info->cpu2, info->work2);
>>>> + cpu_stop_queue_work(info->cpu1, info->work1);
>>>> + }
>>>>  }
>>> 
>>> I'm not sure, we already send the IPI to the first cpu of the pair, so
>>> supposing we have 4 cpus, and get 4 pairs like:
>>> 
>>> 0,1 1,2 2,3 3,0
>>> 
>>> That would result in IPIs to 0, 1, 2, and 0 again, and since the IPI
>>> function is serialized I don't immediately see a way for this to
>>> deadlock.
>> 
>> It's about stop_two_cpus(), I have a distrust about other users of stop task:
>> 
>> queue_stop_cpus_work() queues work consequentially:
>> 
>> 0 1 2 4
>> 
>> stop_two_cpus() may queue:
>> 
>> 1 0
>> 
>> Looks like, stop thread on 0th and on 1th are waiting for wrong works.
> 
> so we serialize stop_cpus_work() vs stop_two_cpus() with an l/g lock.
> 
> Ah, but stop_cpus_work() only holds the global lock over queueing, it
> doesn't wait for completion, that might indeed cause a problem.
> 
> Also, since its two different cpus queueing, the ordered queue doesn't
> really matter, you can still interleave the all and two sets and get
> into this state.

Do you think __stop_cpus->queue_stop_cpus_work() & stop_two_cpus might be stepping into each other because of this global lock being on held on queuing only (and not completion) ? 

In the past I described to Sasha the follow scenario from one of my 3.13 kernels:

> -> multi_cpu_stop -> do { } while (curstate != MULTI_STOP_EXIT);
> 
> In my case, curstate is WAY different from enum containing MULTI_STOP_EXIT (4).
> 
> Register totally messed up (probably after cpu_relax(), right where
> you were trapped -> after the pause instruction).
> 
> my case:
> 
> PID: 118    TASK: ffff883fd28ec7d0  CPU: 9   COMMAND: "migration/9"
> ...
>   [exception RIP: multi_cpu_stop+0x64]
>   RIP: ffffffff810f5944  RSP: ffff883fd2907d98  RFLAGS: 00000246
>   RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000246
>   RDX: ffff883fd2907d98  RSI: 0000000000000000  RDI: 0000000000000001
>   RBP: ffffffff810f5944   R8: ffffffff810f5944   R9: 0000000000000000
>   R10: ffff883fd2907d98  R11: 0000000000000246  R12: ffffffffffffffff
>   R13: ffff883f55d01b48  R14: 0000000000000000  R15: 0000000000000001
>   ORIG_RAX: 0000000000000001  CS: 0010  SS: 0000
> --- <NMI exception stack> ---
> #4 [ffff883fd2907d98] multi_cpu_stop+0x64 at ffffffff810f5944
> 
> 208              } while (curstate != MULTI_STOP_EXIT);
>      ---> RIP
> RIP 0xffffffff810f5944 <+100>:   cmp    $0x4,%edx
>      ---> CHECKING FOR MULTI_STOP_EXIT
> 
> RDX: ffff883fd2907d98 -> does not make any sense
> 
> ###
> 
> If i'm reading this right,
> 
> """
> CPU 05 - PID 14990
> 
> do_numa_page
> task_numa_fault
> numa_migrate_preferred
> task_numa_migrate
> migrate_swap (curr: 14990, task: 14996)
> stop_two_cpus (cpu1=05(14996), cpu2=00(14990))
> wait_for_completion
> 
> 14990 - CPU05
> 14996 - CPU00
> 
> stop_two_cpus:
>   multi_stop_data (msdata->state = MULTI_STOP_PREPARE)
>   smp_call_function_single (min=cpu2=00, irq_cpu_stop_queue_work, wait=1)
>       smp_call_function_single (ran on lowest CPU, 00 for this case)
>       irq_cpu_stop_queue_work
>           cpu_stop_queue_work(cpu1=05(14996)) # add work (multi_cpu_stop) to cpu 05 cpu_stopper queue
>           cpu_stop_queue_work(cpu2=00(14990)) # add work (multi_cpu_stop) to cpu 00 cpu_stopper queue
>   wait_for_completion() --> HERE
> """
> 
> in my case, checking task structs for tasks scheduled when
> "waiting_for_completion()":
> 
> PID 14990 CPU 05 -> PID 14996 CPU 00
> PID 14991 CPU 30 -> PID 14998 CPU 01
> PID 14992 CPU 30 -> PID 14998 CPU 01
> PID 14996 CPU 00 -> PID 14992 CPU 30
> PID 14998 CPU 01 -> PID 14990 CPU 05
> 
> AND
> 
> 102      2   6  ffff881fd2ea97f0  RU   0.0       0      0  [migration/6]
> 118      2   9  ffff883fd28ec7d0  RU   0.0       0      0  [migration/9]
> 143      2  14  ffff883fd29d47d0  RU   0.0       0      0  [migration/14]
> 148      2  15  ffff883fd29fc7d0  RU   0.0       0      0  [migration/15]
> 153      2  16  ffff881fd2f517f0  RU   0.0       0      0  [migration/16]
> 
> THEN
> 
> I am still waiting for 5 cpu_stopper_thread -> multi_cpu_stop just
> scheduled (probably in the per cpu's queue of cpus 0,1,5,30), not
> running yet.
> 
> AND
> 
> I don't have any "wait_for_completion" for those "OLDER" migration
> threads (6, 9, 14, 15 and 16). Probably wait_for_completion signalled 
> done.completion before racing.

And following this thread?s discussion, and commits bellow:

commit a1d9a3231eac4117cadaf4b6bba5b2902c15a33e
Author: Kirill Tkhai <tkhai@yandex.ru>
Date:   Thu Apr 10 17:38:36 2014 +0400

   sched: Check for stop task appearance when balancing happens

commit 37e117c07b89194aae7062bc63bde1104c03db02
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Feb 14 12:25:08 2014 +0100

   sched: Guarantee task priority in pick_next_task()

commit 38033c37faab850ed5d33bb675c4de6c66be84d8
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Thu Jan 23 20:32:21 2014 +0100

   sched: Push down pre_schedule() and idle_balance()

commit 606dba2e289446600a0b68422ed2019af5355c12
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Sat Feb 11 06:05:00 2012 +0100

   sched: Push put_prev_task() into pick_next_task()

3.13 kernel still had old logic (before 3.15) - no RETRY_TASK, idle_balance() before pick_next_task(), no deadline scheduler yet - so commit ?a1d9a32? does not play a role into this panic. I?m causing ~ 150 stop_two_cpus calls / sec, for task migration, in a 32 fake numa environment, and I am NOT able to reproduce this lockup but, still, the dump is says it is there :\. For 3.13 series this lockup was seen once, no info on other versions. 

Any thoughts ?

Thank you

-Rafael Tinoco


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/