Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932643AbbGTOqa (ORCPT ); Mon, 20 Jul 2015 10:46:30 -0400 Received: from mail.skyhub.de ([78.46.96.112]:33170 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932418AbbGTOq2 (ORCPT ); Mon, 20 Jul 2015 10:46:28 -0400 Date: Mon, 20 Jul 2015 16:46:19 +0200 From: Borislav Petkov To: Joerg Roedel Cc: Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , x86@kernel.org, linux-kernel@vger.kernel.org, Joerg Roedel Subject: Re: [PATCH] x86/smpboot: Check for cpu_active on cpu initialization Message-ID: <20150720144619.GA9361@nazgul.tnic> References: <1437038237-16741-1-git-send-email-joro@8bytes.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1437038237-16741-1-git-send-email-joro@8bytes.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2700 Lines: 86 On Thu, Jul 16, 2015 at 11:17:17AM +0200, Joerg Roedel wrote: > From: Joerg Roedel > > Currently the code to bring up secondary CPUs only checks > for cpu_online before it proceeds with launching the per-cpu > threads for the freshly booted remote CPU. > > But the code to move these threads to the new CPU checks for > cpu_active to do so. If this check fails the threads end up > on the wrong CPU, causing warnings and bugs like: > > WARNING: CPU: 0 PID: 1 at ../kernel/workqueue.c:4417 workqueue_cpu_up_callback > > and/or: > > kernel BUG at ../kernel/smpboot.c:135! > > The reason is that the cpu_active bit for the new CPU > becomes visible significantly later than the cpu_online bit. I see void set_cpu_online(unsigned int cpu, bool online) { if (online) { cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits)); cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits)); } else { which is called in start_secondary(). Do you mean that setting the bit in cpu_active_mask gets delayed soo much? Because it comes right after setting the bit in cpu_online_mask. > The reasons could be that the kernel runs in a KVM guest, > where the vCPU thread gets preempted when the cpu_online bit > is set, but with cpu_active still clear. > > But this could also happen on bare-metal systems with lots > of CPUs. We have observed this issue on an 88 core x86 > system on bare-metal. > > To fix this issue, wait before the remote CPU is online > *and* active before launching the per-cpu threads. > > Signed-off-by: Joerg Roedel > --- > arch/x86/kernel/smpboot.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index d3010aa..30b7b8b 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -1006,7 +1006,7 @@ int native_cpu_up(unsigned int cpu, struct task_struct *tidle) > check_tsc_sync_source(cpu); > local_irq_restore(flags); > > - while (!cpu_online(cpu)) { > + while (!cpu_online(cpu) || !cpu_active(cpu)) { > cpu_relax(); > touch_nmi_watchdog(); Maybe we should just swap the calls in set_cpu_online() instead? I.e., if (online) { cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits)); cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits)); } ? I see cpu_online() being called much more than cpu_active()... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/