Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752021AbdGGHri (ORCPT ); Fri, 7 Jul 2017 03:47:38 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:48032 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750848AbdGGHrh (ORCPT ); Fri, 7 Jul 2017 03:47:37 -0400 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Fri, 07 Jul 2017 00:47:29 -0700 From: Vikram Mulukutla To: Thomas Gleixner Cc: Rusty Russell , Tejun Heo , Andrew Morton , LKML , Peter Zijlstra , Sebastian Sewior , linux-kernel-owner@vger.kernel.org Subject: Re: [PATCH] smp/hotplug: Move unparking of percpu threads to the control CPU In-Reply-To: References: Message-ID: User-Agent: Roundcube Webmail/1.2.5 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5222 Lines: 164 On 2017-07-04 13:20, Thomas Gleixner wrote: > Vikram reported the following backtrace: > > BUG: scheduling while atomic: swapper/7/0/0x00000002 > CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.32-perf+ #680 > schedule > schedule_hrtimeout_range_clock > schedule_hrtimeout > wait_task_inactive > __kthread_bind_mask > __kthread_bind > __kthread_unpark > kthread_unpark > cpuhp_online_idle > cpu_startup_entry > secondary_start_kernel > > He analyzed correctly that a parked cpu hotplug thread of an offlined > CPU > was still on the runqueue when the CPU came back online and tried to > unpark > it. This causes the thread which invoked kthread_unpark() to call > wait_task_inactive() and subsequently schedule() with preemption > disabled. > His proposed workaround was to "make sure" that a parked thread has > scheduled out when the CPU goes offline, so the situation cannot > happen. > > But that's still wrong because the root cause is not the fact that the > percpu thread is still on the runqueue and neither that preemption is > disabled, which could be simply solved by enabling preemption before > calling kthread_unpark(). > > The real issue is that the calling thread is the idle task of the > upcoming > CPU, which is not supposed to call anything which might sleep. The > moron, > who wrote that code, missed completely that kthread_unpark() might end > up > in schedule(). > Agreed. Presumably we are still OK with the cpu hotplug thread being migrated off to random CPUs and its unfinished kthread_parkme racing with a subsequent unpark? The cpuhp/X thread ends up on a random rq if it can't do schedule() in time because migration/X yanks it off of the dying CPU X. Apart from slightly disconcerting traces showing cpuhp/X momentarily executing on CPU Y, there's no problem that I can see of course. Probably worth mentioning that the following patch from Junaid Shahid seems to indicate that such a race doesn't always work out as desired: https://lkml.org/lkml/2017/4/28/755 > The solution is simpler than expected. The thread which controls the > hotplug operation is waiting for the CPU to call complete() on the > hotplug > state completion. So the idle task of the upcoming CPU can set its > state to > CPUHP_AP_ONLINE_IDLE and invoke complete(). This in turn wakes the > control > task on a different CPU, which then can safely do the unpark and kick > the > now unparked hotplug thread of the upcoming CPU to complete the bringup > to > the final target state. > > Fixes: 8df3e07e7f21 ("cpu/hotplug: Let upcoming cpu bring itself fully > up") > Reported-by: Vikram Mulukutla > Signed-off-by: Thomas Gleixner > Cc: Peter Zijlstra > Cc: Sebastian Siewior > Cc: rusty@rustcorp.com.au > Cc: tj@kernel.org > Cc: akpm@linux-foundation.org > Cc: stable@vger.kernel.org > > --- > kernel/cpu.c | 30 ++++++++++++++++-------------- > 1 file changed, 16 insertions(+), 14 deletions(-) > > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -271,11 +271,25 @@ void cpu_hotplug_enable(void) > EXPORT_SYMBOL_GPL(cpu_hotplug_enable); > #endif /* CONFIG_HOTPLUG_CPU */ > > +static void __cpuhp_kick_ap_work(struct cpuhp_cpu_state *st); > + > static int bringup_wait_for_ap(unsigned int cpu) > { > struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); > > + /* Wait for the CPU to reach IDLE_ONLINE */ > wait_for_completion(&st->done); > + BUG_ON(!cpu_online(cpu)); > + > + /* Unpark the stopper thread and the hotplug thread of the target cpu > */ > + stop_machine_unpark(cpu); > + kthread_unpark(st->thread); > + > + /* Should we go further up ? */ > + if (st->target > CPUHP_AP_ONLINE_IDLE) { > + __cpuhp_kick_ap_work(st); > + wait_for_completion(&st->done); > + } > return st->result; > } > > @@ -296,9 +310,7 @@ static int bringup_cpu(unsigned int cpu) > irq_unlock_sparse(); > if (ret) > return ret; > - ret = bringup_wait_for_ap(cpu); > - BUG_ON(!cpu_online(cpu)); > - return ret; > + return bringup_wait_for_ap(cpu); > } > > /* > @@ -775,23 +787,13 @@ void notify_cpu_starting(unsigned int cp > void cpuhp_online_idle(enum cpuhp_state state) > { > struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state); > - unsigned int cpu = smp_processor_id(); > > /* Happens for the boot cpu */ > if (state != CPUHP_AP_ONLINE_IDLE) > return; > > st->state = CPUHP_AP_ONLINE_IDLE; > - > - /* Unpark the stopper thread and the hotplug thread of this cpu */ > - stop_machine_unpark(cpu); > - kthread_unpark(st->thread); > - > - /* Should we go further up ? */ > - if (st->target > CPUHP_AP_ONLINE_IDLE) > - __cpuhp_kick_ap_work(st); > - else > - complete(&st->done); > + complete(&st->done); > } > > /* Requires cpu_add_remove_lock to be held */ Nice and clean otherwise. Channagoud was instrumental in collecting data, theorizing with me and testing your fix, so if the concern I've raised above doesn't matter, please add: Tested-by: Channagoud Kadabi Thanks, Vikram -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project