Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753639AbbBYXN0 (ORCPT ); Wed, 25 Feb 2015 18:13:26 -0500 Received: from e9.ny.us.ibm.com ([32.97.182.139]:49648 "EHLO e9.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753391AbbBYXNY (ORCPT ); Wed, 25 Feb 2015 18:13:24 -0500 From: Stewart Smith To: Michael Ellerman , linuxppc-dev@ozlabs.org Cc: mingo@kernel.org, tglx@linutronix.de, Anton Blanchard , linux-kernel@vger.kernel.org Subject: Re: [PATCH] powerpc/smp: Wait until secondaries are active & online In-Reply-To: <1424761082-29938-1-git-send-email-mpe@ellerman.id.au> References: <1424761082-29938-1-git-send-email-mpe@ellerman.id.au> User-Agent: Notmuch/0.18+16~gec02089 (http://notmuchmail.org) Emacs/23.1.1 (x86_64-redhat-linux-gnu) Date: Thu, 26 Feb 2015 10:13:16 +1100 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15022523-0033-0000-0000-000001F416BF Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2010 Lines: 46 Michael Ellerman writes: > Anton has a busy ppc64le KVM box where guests sometimes hit the infamous > "kernel BUG at kernel/smpboot.c:134!" issue during boot: > > BUG_ON(td->cpu != smp_processor_id()); > > Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops > output confirms it: > > CPU: 0 > Comm: watchdog/130 > > The problem is that we aren't ensuring the CPU active bit is set for the > secondary before allowing the master to continue on. The master unparks > the secondary CPU's kthreads and the scheduler looks for a CPU to run > on. It calls select_task_rq() and realises the suggested CPU is not in > the cpus_allowed mask. It then ends up in select_fallback_rq(), and > since the active bit isnt't set we choose some other CPU to run on. > > This seems to have been introduced by 6acbfb96976f "sched: Fix hotplug > vs. set_cpus_allowed_ptr()", which changed from setting active before > online to setting active after online. However that was in turn fixing a > bug where other code assumed an active CPU was also online, so we can't > just revert that fix. > > The simplest fix is just to spin waiting for both active & online to be > set. We already have a barrier prior to set_cpu_online() (which also > sets active), to ensure all other setup is completed before online & > active are set. > > Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()") > Signed-off-by: Michael Ellerman > Signed-off-by: Anton Blanchard By building a gcov enabled skiboot, which makes OPAL_START_CPU a whole bunch slower (because gcov), I could really *really* reliably reproduce this. With this patch, I cannot. Tested-by: Stewart Smith -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/