Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757896Ab2BMUQu (ORCPT ); Mon, 13 Feb 2012 15:16:50 -0500 Received: from mga09.intel.com ([134.134.136.24]:15407 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755978Ab2BMUQt (ORCPT ); Mon, 13 Feb 2012 15:16:49 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.67,351,1309762800"; d="scan'208";a="109558690" Message-ID: <4F396FA9.90606@linux.intel.com> Date: Mon, 13 Feb 2012 12:16:41 -0800 From: Arjan van de Ven User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0) Gecko/20111222 Thunderbird/9.0.1 MIME-Version: 1.0 To: Andrew Morton CC: Michael Neuling , Stephen Rothwell , LKML , gregkh@linuxfoundation.org, linux-next@vger.kernel.org, ppc-dev , Milton Miller Subject: Re: Boot failure with next-20120208 References: <20120212113805.c7e5d902c95a9d0f4037e12c@canb.auug.org.au> <16788.1329102254@neuling.org> <4F391BBA.5020506@linux.intel.com> <20120213120549.eab7e2b9.akpm@linux-foundation.org> In-Reply-To: <20120213120549.eab7e2b9.akpm@linux-foundation.org> X-Enigmail-Version: 1.3.5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5708 Lines: 121 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2/13/2012 12:05 PM, Andrew Morton wrote: > On Mon, 13 Feb 2012 06:18:34 -0800 Arjan van de Ven > wrote: > >> On 2/12/2012 7:04 PM, Michael Neuling wrote: >>>> Just a quick note to say I got a boot OOPs with next-20120208 >>>> and 9 on a Power7 blade (my other PowerPC boot tests are ok. >>>> I'll investigate this further on Monday. > > Thanks for testing linux-next. Very useful. > >>>> The line referenced below is: >>>> >>>> BUG_ON(!kobj || !kobj->sd || !attr); >>>> >>>> in sysfs_create_file(). > > Yes, this is exactly why we should never use BUG_ON(a || b). We > don't know which of those three expressions triggered. > >>>> calling .topology_init+0x0/0x1ac @ 1 initcall >>>> 7_.async_cpu_up+0x0/0x40 returned 0 after 9765 usecs >>>> async_continuing @ 20 after 9765 usec ------------[ cut here >>>> ]------------ kernel BUG at fs/sysfs/file.c:573! Oops: >>>> Exception in kernel mode, sig: 5 [#1] SMP NR_CPUS=32 NUMA >>>> pSeries Modules linked in: NIP: c00000000024a35c LR: >>>> c0000000004ee050 CTR: c00000000083ca24 REGS: c0000003fd9e7560 >>>> TRAP: 0700 Not tainted (3.3.0-rc2-autokern1) MSR: >>>> 8000000000029032 CR: 88002082 XER: >>>> 0000000f CFAR: c00000000024a370 TASK = c0000003fd9e8000[20] >>>> 'kworker/u:6' THREAD: c0000003fd9e4000 CPU: 0 GPR00: >>>> 0000000000000001 c0000003fd9e77e0 c000000000d19bb8 >>>> 0000000000000000 GPR04: c000000000bf37a8 0000000000000008 >>>> 8000000002096400 0000000000000000 GPR08: 0000000000000000 >>>> c000000000f80028 c000000000d52bd8 0000000000000000 GPR12: >>>> 0000000048002088 c00000000f33b000 0000000001affa78 >>>> 00000000009aa000 GPR16: 0000000000e1f3c8 0000000002d517f0 >>>> 0000000001aff984 0000000000000060 GPR20: 0000000000000000 >>>> ffffffffffffffff 0000000000000000 c000000000c45128 GPR24: >>>> 0000000000000000 0000000000000008 0000000000000000 >>>> c000000000c44200 GPR28: c000000000f80028 0000000000000008 >>>> c000000000c85038 0000000000000002 NIP [c00000000024a35c] >>>> .sysfs_create_file+0x1c/0x40 LR [c0000000004ee050] >>>> .device_create_file+0x20/0x40 Call Trace: [c0000003fd9e77e0] >>>> [c0000003fd9e78a0] 0xc0000003fd9e78a0 (unreliable) >>>> [c0000003fd9e7850] [c00000000083c9a4] >>>> .register_cpu_online+0x1d0/0x250 [c0000003fd9e7900] >>>> [c00000000083ca8c] .sysfs_cpu_notify+0x68/0x28c >>>> [c0000003fd9e79b0] [c00000000083769c] >>>> .notifier_call_chain+0x9c/0x100 [c0000003fd9e7a50] >>>> [c0000000000a5878] .__cpu_notify+0x38/0x80 [c0000003fd9e7ad0] >>>> [c00000000083e124] ._cpu_up+0x10c/0x178 [c0000003fd9e7b90] >>>> [c00000000083e2c8] .cpu_up+0x138/0x164 [c0000003fd9e7c20] >>>> [c000000000ba46d0] .async_cpu_up+0x28/0x40 [c0000003fd9e7ca0] >>>> [c0000000000d81ec] .async_run_entry_fn+0xbc/0x1f0 >>>> [c0000003fd9e7d50] [c0000000000c7cbc] >>>> .process_one_work+0x19c/0x590 [c0000003fd9e7e10] >>>> [c0000000000c8618] .worker_thread+0x188/0x4b0 >>>> [c0000003fd9e7ed0] [c0000000000ce57c] .kthread+0xbc/0xd0 >>>> [c0000003fd9e7f90] [c000000000021448] >>>> .kernel_thread+0x54/0x70 Instruction dump: 7fa3eb78 ebe1fff8 >>>> eba1ffe8 7c0803a6 4e800020 2c230000 41820024 e8630030 >>>> 7c800074 7800d182 2fa30000 419e0014 <0b000000> 38a00002 >>>> 4bfffebc e8630030 ---[ end trace 31fd0ba7d8756001 ]--- >>>> initcall .topology_init+0x0/0x1ac returned 0 after 0 usecs >>>> calling .pcibios_init+0x0/0xe8 @ 1 PCI: Probing PCI >>>> hardware PCI: Probing PCI hardware done initcall >>>> .pcibios_init+0x0/0xe8 returned 0 after 0 usecs calling >>>> .add_system_ram_resources+0x0/0x140 @ 1 initcall >>>> .add_system_ram_resources+0x0/0x140 returned 0 after 0 usecs >>>> calling >>>> .__machine_initcall_powermac_pmac_i2c_create_platform_devices+0x0/0xc8 >>>> @ 1 initcall >>>> .__machine_initcall_powermac_pmac_i2c_create_platform_devices+0x0/0xc8 >>>> returned 0 after 0 usecs calling .opal_init+0x0/0x1cc @ 1 >>>> opal: Node not found initcall .opal_init+0x0/0x1cc returned >>>> -19 after 0 usecs calling >>>> .__machine_initcall_pseries_ioei_init+0x0/0xa0 @ 1 >>> >>> Reverting "smp: start up non-boot CPUs asynchronously" >>> (8de7a96405 from next-20120208) fixes this problem for me. >>> >> if that fixes it, it means PPC has a race somewhere in the cpu >> hotplug code, since all the patch does is hotplug the cpus one by >> one (which the normal kernel also does, just not in parallel with >> other work) > > The bug looks pretty generic, nothing very PPC-specific there. It > might affect other architectures - we won't know until we find out > wht caused it. well one half of the race looks pretty generic... ..... doesn't mean the other half of the race is though.... > > Ho hum, I suppose I should pull the patch out of linux-next, to > avoid disrupting other testing. This means it's going to be hard > to get the bug fixed. it means losing this one big PPC machine indeed.... until they hit that same race some other way with regular real cpu hotplug ;-( -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJPOW+pAAoJEEHdSxh4DVnEJXEIAKHFQwrGw1R66bEjTMxfgOpH ZlUwjciyNlo2KpqgaUgulHDHNzWQa29nzjPfuk6qG7mHGKnbgyn/IBCnUD5uJqri 6yu7Md0vekwJnoilZQvEuvF6qHrOYOcaWvW60x2y3W+fBesa5zxpqwDLbKj4Qvu3 YAbxeMaAr0W/d7pKEubKds3YJnr1S06qbK8Jw7DF92YEd7xDTBTSpuSF3vA6BlLQ jEV1xwxVwnLEa6A/DnkvQ67Kayj0zfC2CSCqlt2T2BiDl81XkCPC/U3yHuHb/ISI ykwFHy4L2VH54n5dQH8S6qfzNS1vh8IQ0xzVtJH6CHh8XXK7T97T07jZZ/LWMWI= =GQrj -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/