Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755963AbbFOTib (ORCPT ); Mon, 15 Jun 2015 15:38:31 -0400 Received: from mail-qc0-f170.google.com ([209.85.216.170]:34611 "EHLO mail-qc0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755876AbbFOTi3 convert rfc822-to-8bit (ORCPT ); Mon, 15 Jun 2015 15:38:29 -0400 Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\)) Subject: Re: sched: hang in migrate_swap From: Rafael David Tinoco In-Reply-To: <20140514102602.GJ30445@twins.programming.kicks-ass.net> Date: Mon, 15 Jun 2015 16:38:21 -0300 Cc: Kirill Tkhai , Michael wang , "ktkhai@parallels.com" , Ingo Molnar , LKML Content-Transfer-Encoding: 8BIT Message-Id: <4945CDF2-4666-4112-89F1-775E87B3EECD@canonical.com> References: <20140224121218.GR15586@twins.programming.kicks-ass.net> <534610A4.5000302@oracle.com> <53464164.5030701@linux.vnet.ibm.com> <336561397137116@web27h.yandex.ru> <5347FCED.8040706@oracle.com> <1442521397229373@web20m.yandex.ru> <53711785.5010504@oracle.com> <20140514102602.GJ30445@twins.programming.kicks-ass.net> To: Peter Zijlstra , Sasha Levin X-Mailer: Apple Mail (2.2098) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7253 Lines: 199 Peter, Sasha, coming back to this? Not that this is happening frequently or I can easily reproduce, but? > On May14, 2014, at 07:26 AM, Peter Zijlstra wrote: > > On Wed, May 14, 2014 at 02:21:04PM +0400, Kirill Tkhai wrote: >> >> >> 14.05.2014, 14:14, "Peter Zijlstra" : >>> On Wed, May 14, 2014 at 01:42:32PM +0400, Kirill Tkhai wrote: >>> >>>> Peter, do we have to queue stop works orderly? >>>> >>>> Is there is not a possibility, when two pair of works queued different on >>>> different cpus? >>>> >>>> kernel/stop_machine.c | 10 ++++++++-- >>>> 1 file changed, 8 insertions(+), 2 deletions(-) >>>> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c >>>> index b6b67ec..29e221b 100644 >>>> --- a/kernel/stop_machine.c >>>> +++ b/kernel/stop_machine.c >>>> @@ -250,8 +250,14 @@ struct irq_cpu_stop_queue_work_info { >>>> static void irq_cpu_stop_queue_work(void *arg) >>>> { >>>> struct irq_cpu_stop_queue_work_info *info = arg; >>>> - cpu_stop_queue_work(info->cpu1, info->work1); >>>> - cpu_stop_queue_work(info->cpu2, info->work2); >>>> + >>>> + if (info->cpu1 < info->cpu2) { >>>> + cpu_stop_queue_work(info->cpu1, info->work1); >>>> + cpu_stop_queue_work(info->cpu2, info->work2); >>>> + } else { >>>> + cpu_stop_queue_work(info->cpu2, info->work2); >>>> + cpu_stop_queue_work(info->cpu1, info->work1); >>>> + } >>>> } >>> >>> I'm not sure, we already send the IPI to the first cpu of the pair, so >>> supposing we have 4 cpus, and get 4 pairs like: >>> >>> 0,1 1,2 2,3 3,0 >>> >>> That would result in IPIs to 0, 1, 2, and 0 again, and since the IPI >>> function is serialized I don't immediately see a way for this to >>> deadlock. >> >> It's about stop_two_cpus(), I have a distrust about other users of stop task: >> >> queue_stop_cpus_work() queues work consequentially: >> >> 0 1 2 4 >> >> stop_two_cpus() may queue: >> >> 1 0 >> >> Looks like, stop thread on 0th and on 1th are waiting for wrong works. > > so we serialize stop_cpus_work() vs stop_two_cpus() with an l/g lock. > > Ah, but stop_cpus_work() only holds the global lock over queueing, it > doesn't wait for completion, that might indeed cause a problem. > > Also, since its two different cpus queueing, the ordered queue doesn't > really matter, you can still interleave the all and two sets and get > into this state. Do you think __stop_cpus->queue_stop_cpus_work() & stop_two_cpus might be stepping into each other because of this global lock being on held on queuing only (and not completion) ? In the past I described to Sasha the follow scenario from one of my 3.13 kernels: > -> multi_cpu_stop -> do { } while (curstate != MULTI_STOP_EXIT); > > In my case, curstate is WAY different from enum containing MULTI_STOP_EXIT (4). > > Register totally messed up (probably after cpu_relax(), right where > you were trapped -> after the pause instruction). > > my case: > > PID: 118 TASK: ffff883fd28ec7d0 CPU: 9 COMMAND: "migration/9" > ... > [exception RIP: multi_cpu_stop+0x64] > RIP: ffffffff810f5944 RSP: ffff883fd2907d98 RFLAGS: 00000246 > RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000246 > RDX: ffff883fd2907d98 RSI: 0000000000000000 RDI: 0000000000000001 > RBP: ffffffff810f5944 R8: ffffffff810f5944 R9: 0000000000000000 > R10: ffff883fd2907d98 R11: 0000000000000246 R12: ffffffffffffffff > R13: ffff883f55d01b48 R14: 0000000000000000 R15: 0000000000000001 > ORIG_RAX: 0000000000000001 CS: 0010 SS: 0000 > --- --- > #4 [ffff883fd2907d98] multi_cpu_stop+0x64 at ffffffff810f5944 > > 208 } while (curstate != MULTI_STOP_EXIT); > ---> RIP > RIP 0xffffffff810f5944 <+100>: cmp $0x4,%edx > ---> CHECKING FOR MULTI_STOP_EXIT > > RDX: ffff883fd2907d98 -> does not make any sense > > ### > > If i'm reading this right, > > """ > CPU 05 - PID 14990 > > do_numa_page > task_numa_fault > numa_migrate_preferred > task_numa_migrate > migrate_swap (curr: 14990, task: 14996) > stop_two_cpus (cpu1=05(14996), cpu2=00(14990)) > wait_for_completion > > 14990 - CPU05 > 14996 - CPU00 > > stop_two_cpus: > multi_stop_data (msdata->state = MULTI_STOP_PREPARE) > smp_call_function_single (min=cpu2=00, irq_cpu_stop_queue_work, wait=1) > smp_call_function_single (ran on lowest CPU, 00 for this case) > irq_cpu_stop_queue_work > cpu_stop_queue_work(cpu1=05(14996)) # add work (multi_cpu_stop) to cpu 05 cpu_stopper queue > cpu_stop_queue_work(cpu2=00(14990)) # add work (multi_cpu_stop) to cpu 00 cpu_stopper queue > wait_for_completion() --> HERE > """ > > in my case, checking task structs for tasks scheduled when > "waiting_for_completion()": > > PID 14990 CPU 05 -> PID 14996 CPU 00 > PID 14991 CPU 30 -> PID 14998 CPU 01 > PID 14992 CPU 30 -> PID 14998 CPU 01 > PID 14996 CPU 00 -> PID 14992 CPU 30 > PID 14998 CPU 01 -> PID 14990 CPU 05 > > AND > > 102 2 6 ffff881fd2ea97f0 RU 0.0 0 0 [migration/6] > 118 2 9 ffff883fd28ec7d0 RU 0.0 0 0 [migration/9] > 143 2 14 ffff883fd29d47d0 RU 0.0 0 0 [migration/14] > 148 2 15 ffff883fd29fc7d0 RU 0.0 0 0 [migration/15] > 153 2 16 ffff881fd2f517f0 RU 0.0 0 0 [migration/16] > > THEN > > I am still waiting for 5 cpu_stopper_thread -> multi_cpu_stop just > scheduled (probably in the per cpu's queue of cpus 0,1,5,30), not > running yet. > > AND > > I don't have any "wait_for_completion" for those "OLDER" migration > threads (6, 9, 14, 15 and 16). Probably wait_for_completion signalled > done.completion before racing. And following this thread?s discussion, and commits bellow: commit a1d9a3231eac4117cadaf4b6bba5b2902c15a33e Author: Kirill Tkhai Date: Thu Apr 10 17:38:36 2014 +0400 sched: Check for stop task appearance when balancing happens commit 37e117c07b89194aae7062bc63bde1104c03db02 Author: Peter Zijlstra Date: Fri Feb 14 12:25:08 2014 +0100 sched: Guarantee task priority in pick_next_task() commit 38033c37faab850ed5d33bb675c4de6c66be84d8 Author: Peter Zijlstra Date: Thu Jan 23 20:32:21 2014 +0100 sched: Push down pre_schedule() and idle_balance() commit 606dba2e289446600a0b68422ed2019af5355c12 Author: Peter Zijlstra Date: Sat Feb 11 06:05:00 2012 +0100 sched: Push put_prev_task() into pick_next_task() 3.13 kernel still had old logic (before 3.15) - no RETRY_TASK, idle_balance() before pick_next_task(), no deadline scheduler yet - so commit ?a1d9a32? does not play a role into this panic. I?m causing ~ 150 stop_two_cpus calls / sec, for task migration, in a 32 fake numa environment, and I am NOT able to reproduce this lockup but, still, the dump is says it is there :\. For 3.13 series this lockup was seen once, no info on other versions. Any thoughts ? Thank you -Rafael Tinoco -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/