2008-08-12 21:55:43

by Langsdorf, Mark

[permalink] [raw]
Subject: Warning in during hotplug on 2.6.27-rc2-git5

I'm seeing the following error message when I hotunplug and replug
a cpu in 2.6.27-rc2-git5. The system becomes unstable almost
immediately afterwards.

------------[ cut here ]------------
WARNING: at fs/sysfs/dir.c:463 sysfs_add_one+0x33/0x39()
sysfs: duplicate filename 'machinecheck4' can not be created
Modules linked in: cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table sr_mod af_packet button battery ac loop dm_mod usb_storage usbhid ff_memless tg3 libphy ide_pci_generic shpchp ehci_hcd i2c_piix4 i2c_core pci_hotplug ohci_hcd usbcore ide_cd_mod cdrom floppy mptctl ext3 jbd edd fan thermal processor mptsas mptscsih mptbase scsi_transport_sas sg sata_svw libata dock serverworks sd_mod scsi_mod ide_disk ide_core
Pid: 4838, comm: bash Not tainted 2.6.27-rc2-git5-pn_test #2

Call Trace:
[<ffffffff8023194f>] warn_slowpath+0xb4/0xde
[<ffffffff80304cf7>] rb_insert_color+0x61/0xda
[<ffffffff80304cf7>] rb_insert_color+0x61/0xda
[<ffffffff80306c27>] vsnprintf+0x568/0x5b1
[<ffffffff80226c3f>] hrtick_start_fair+0x10d/0x171
[<ffffffff80301b45>] idr_get_empty_slot+0x164/0x243
[<ffffffff80301d1a>] ida_get_new_above+0xf6/0x182
[<ffffffff802a038a>] find_inode+0x28/0x6d
[<ffffffff802d272c>] sysfs_ilookup_test+0x0/0xf
[<ffffffff802d2928>] sysfs_find_dirent+0x1b/0x2f
[<ffffffff802d29de>] sysfs_add_one+0x33/0x39
[<ffffffff802d2ee3>] create_dir+0x4f/0x87
[<ffffffff802d2f50>] sysfs_create_dir+0x35/0x4a
[<ffffffff80302757>] kobject_get+0x12/0x17
[<ffffffff80302890>] kobject_add_internal+0xcf/0x18a
[<ffffffff80302a66>] kobject_init_and_add+0x5b/0x68
[<ffffffff8022d5ab>] set_cpus_allowed_ptr+0x119/0x126
[<ffffffff8038397a>] sysdev_register+0x5a/0xb5
[<ffffffff804172df>] mce_create_device+0xb4/0x156
[<ffffffff804173ac>] mce_cpu_callback+0x2b/0x9b
[<ffffffff8041fb7d>] notifier_call_chain+0x29/0x4c
[<ffffffff8041a2e7>] _cpu_up+0xc8/0x102
[<ffffffff8041a375>] cpu_up+0x54/0x77
[<ffffffff8040ffee>] store_online+0x43/0x67
[<ffffffff802d20a1>] sysfs_write_file+0xd2/0x110
[<ffffffff8028f1a5>] vfs_write+0xad/0x156
[<ffffffff8028f692>] sys_write+0x45/0x6e
[<ffffffff8020bdeb>] system_call_fastpath+0x16/0x1b

---[ end trace 48036b92036180e0 ]---
kobject_add_internal failed for machinecheck4 with -EEXIST, don't try to register things with the same name in the same directory.
Pid: 4838, comm: bash Tainted: G W 2.6.27-rc2-git5-pn_test #2

Call Trace:
[<ffffffff8030290c>] kobject_add_internal+0x14b/0x18a
[<ffffffff80302a66>] kobject_init_and_add+0x5b/0x68
[<ffffffff8022d5ab>] set_cpus_allowed_ptr+0x119/0x126
[<ffffffff8038397a>] sysdev_register+0x5a/0xb5
[<ffffffff804172df>] mce_create_device+0xb4/0x156
[<ffffffff804173ac>] mce_cpu_callback+0x2b/0x9b
[<ffffffff8041fb7d>] notifier_call_chain+0x29/0x4c
[<ffffffff8041a2e7>] _cpu_up+0xc8/0x102
[<ffffffff8041a375>] cpu_up+0x54/0x77
[<ffffffff8040ffee>] store_online+0x43/0x67
[<ffffffff802d20a1>] sysfs_write_file+0xd2/0x110
[<ffffffff8028f1a5>] vfs_write+0xad/0x156
[<ffffffff8028f692>] sys_write+0x45/0x6e
[<ffffffff8020bdeb>] system_call_fastpath+0x16/0x1b

BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
IP: [<ffffffff802d2a11>] sysfs_addrm_start+0x2d/0x99
PGD 22a87c067 PUD 22c9ef067 PMD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table sr_mod af_packet button battery ac loop dm_mod usb_storage usbhid ff_memless tg3 libphy ide_pci_generic shpchp ehci_hcd i2c_piix4 i2c_core pci_hotplug ohci_hcd usbcore ide_cd_mod cdrom floppy mptctl ext3 jbd edd fan thermal processor mptsas mptscsih mptbase scsi_transport_sas sg sata_svw libata dock serverworks sd_mod scsi_mod ide_disk ide_core
Pid: 4838, comm: bash Tainted: G W 2.6.27-rc2-git5-pn_test #2
RIP: 0010:[<ffffffff802d2a11>] [<ffffffff802d2a11>] sysfs_addrm_start+0x2d/0x99
RSP: 0018:ffff88022d0dbb38 EFLAGS: 00010246
RAX: ffff880028047710 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 00000acb2c4dfcc0 RSI: 0000000000000000 RDI: ffff88022fc3b000
RBP: ffff88022d0dbb58 R08: 0000000000000080 R09: ffff88022d0dba58
R10: ffff880028047710 R11: 00000acb2c4dfcc0 R12: 00000000fffffff4
R13: 0000000000000000 R14: ffff88022d0dbbb0 R15: 0000000000000004
FS: 00007f106d7496d0(0000) GS:ffff88022f0938c0(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000038 CR3: 000000022c8d0000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bash (pid: 4838, threadinfo ffff88022d0da000, task ffff88022d6b4d90)
Stack: 00000000fffffff4 ffff88022e3dfc80 ffff88022dc0c870 ffffffff802d2ed8
0000000000000000 0000000000000000 0000000000000000 0000000000000000
ffff88022e3dfc80 ffff88022d0dbd18 00000000fffffffe 0000000000000004
Call Trace:
[<ffffffff802d2ed8>] ? create_dir+0x44/0x87
[<ffffffff802d2f50>] ? sysfs_create_dir+0x35/0x4a
[<ffffffff80302757>] ? kobject_get+0x12/0x17
[<ffffffff80302890>] ? kobject_add_internal+0xcf/0x18a
[<ffffffff80302d57>] ? kobject_add+0x74/0x7c
[<ffffffff8041bfc5>] ? thread_return+0x3e/0xa3
[<ffffffff8030254e>] ? kobject_init_internal+0x12/0x2c
[<ffffffff803025c6>] ? kobject_init+0x41/0x69
[<ffffffff80302619>] ? kobject_create+0x2b/0x30
[<ffffffff80302d8d>] ? kobject_create_and_add+0x2e/0x5b
[<ffffffff804178c3>] ? threshold_create_device+0x1aa/0x32a
[<ffffffff80417aa2>] ? threshold_cpu_callback+0x5f/0x2ca
[<ffffffff804172df>] ? mce_create_device+0xb4/0x156
[<ffffffff8041fb7d>] ? notifier_call_chain+0x29/0x4c
[<ffffffff8041a2e7>] ? _cpu_up+0xc8/0x102
[<ffffffff8041a375>] ? cpu_up+0x54/0x77
[<ffffffff8040ffee>] ? store_online+0x43/0x67
[<ffffffff802d20a1>] ? sysfs_write_file+0xd2/0x110
[<ffffffff8028f1a5>] ? vfs_write+0xad/0x156
[<ffffffff8028f692>] ? sys_write+0x45/0x6e
[<ffffffff8020bdeb>] ? system_call_fastpath+0x16/0x1b


Code: c0 b9 08 00 00 00 fc 53 48 89 fd 48 89 f3 48 83 ec 08 f3 ab 48 89 75 00 48 c7 c7 e0 e7 51 80 e8 c2 9c 14 00 48 8b 3d 57 45 37 00 <48> 8b 73 38 48 89 d9 48 c7 c2 2c 27 2d 80 e8 1e db fc ff 48 85
RIP [<ffffffff802d2a11>] sysfs_addrm_start+0x2d/0x99
RSP <ffff88022d0dbb38>
CR2: 0000000000000038
---[ end trace 48036b92036180e0 ]---



2008-08-14 15:51:13

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Warning in during hotplug on 2.6.27-rc2-git5

On Wednesday, 13 of August 2008, Mark Langsdorf wrote:
> I'm seeing the following error message when I hotunplug and replug
> a cpu in 2.6.27-rc2-git5. The system becomes unstable almost
> immediately afterwards.

Hm, it seems that MCE is involved somehow. Andi, can you have a look at this,
please?


> ------------[ cut here ]------------
> WARNING: at fs/sysfs/dir.c:463 sysfs_add_one+0x33/0x39()
> sysfs: duplicate filename 'machinecheck4' can not be created
> Modules linked in: cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table sr_mod af_packet button battery ac loop dm_mod usb_storage usbhid ff_memless tg3 libphy ide_pci_generic shpchp ehci_hcd i2c_piix4 i2c_core pci_hotplug ohci_hcd usbcore ide_cd_mod cdrom floppy mptctl ext3 jbd edd fan thermal processor mptsas mptscsih mptbase scsi_transport_sas sg sata_svw libata dock serverworks sd_mod scsi_mod ide_disk ide_core
> Pid: 4838, comm: bash Not tainted 2.6.27-rc2-git5-pn_test #2
>
> Call Trace:
> [<ffffffff8023194f>] warn_slowpath+0xb4/0xde
> [<ffffffff80304cf7>] rb_insert_color+0x61/0xda
> [<ffffffff80304cf7>] rb_insert_color+0x61/0xda
> [<ffffffff80306c27>] vsnprintf+0x568/0x5b1
> [<ffffffff80226c3f>] hrtick_start_fair+0x10d/0x171
> [<ffffffff80301b45>] idr_get_empty_slot+0x164/0x243
> [<ffffffff80301d1a>] ida_get_new_above+0xf6/0x182
> [<ffffffff802a038a>] find_inode+0x28/0x6d
> [<ffffffff802d272c>] sysfs_ilookup_test+0x0/0xf
> [<ffffffff802d2928>] sysfs_find_dirent+0x1b/0x2f
> [<ffffffff802d29de>] sysfs_add_one+0x33/0x39
> [<ffffffff802d2ee3>] create_dir+0x4f/0x87
> [<ffffffff802d2f50>] sysfs_create_dir+0x35/0x4a
> [<ffffffff80302757>] kobject_get+0x12/0x17
> [<ffffffff80302890>] kobject_add_internal+0xcf/0x18a
> [<ffffffff80302a66>] kobject_init_and_add+0x5b/0x68
> [<ffffffff8022d5ab>] set_cpus_allowed_ptr+0x119/0x126
> [<ffffffff8038397a>] sysdev_register+0x5a/0xb5
> [<ffffffff804172df>] mce_create_device+0xb4/0x156
> [<ffffffff804173ac>] mce_cpu_callback+0x2b/0x9b
> [<ffffffff8041fb7d>] notifier_call_chain+0x29/0x4c
> [<ffffffff8041a2e7>] _cpu_up+0xc8/0x102
> [<ffffffff8041a375>] cpu_up+0x54/0x77
> [<ffffffff8040ffee>] store_online+0x43/0x67
> [<ffffffff802d20a1>] sysfs_write_file+0xd2/0x110
> [<ffffffff8028f1a5>] vfs_write+0xad/0x156
> [<ffffffff8028f692>] sys_write+0x45/0x6e
> [<ffffffff8020bdeb>] system_call_fastpath+0x16/0x1b
>
> ---[ end trace 48036b92036180e0 ]---
> kobject_add_internal failed for machinecheck4 with -EEXIST, don't try to register things with the same name in the same directory.
> Pid: 4838, comm: bash Tainted: G W 2.6.27-rc2-git5-pn_test #2
>
> Call Trace:
> [<ffffffff8030290c>] kobject_add_internal+0x14b/0x18a
> [<ffffffff80302a66>] kobject_init_and_add+0x5b/0x68
> [<ffffffff8022d5ab>] set_cpus_allowed_ptr+0x119/0x126
> [<ffffffff8038397a>] sysdev_register+0x5a/0xb5
> [<ffffffff804172df>] mce_create_device+0xb4/0x156
> [<ffffffff804173ac>] mce_cpu_callback+0x2b/0x9b
> [<ffffffff8041fb7d>] notifier_call_chain+0x29/0x4c
> [<ffffffff8041a2e7>] _cpu_up+0xc8/0x102
> [<ffffffff8041a375>] cpu_up+0x54/0x77
> [<ffffffff8040ffee>] store_online+0x43/0x67
> [<ffffffff802d20a1>] sysfs_write_file+0xd2/0x110
> [<ffffffff8028f1a5>] vfs_write+0xad/0x156
> [<ffffffff8028f692>] sys_write+0x45/0x6e
> [<ffffffff8020bdeb>] system_call_fastpath+0x16/0x1b
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> IP: [<ffffffff802d2a11>] sysfs_addrm_start+0x2d/0x99
> PGD 22a87c067 PUD 22c9ef067 PMD 0
> Oops: 0000 [1] SMP
> CPU 1
> Modules linked in: cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table sr_mod af_packet button battery ac loop dm_mod usb_storage usbhid ff_memless tg3 libphy ide_pci_generic shpchp ehci_hcd i2c_piix4 i2c_core pci_hotplug ohci_hcd usbcore ide_cd_mod cdrom floppy mptctl ext3 jbd edd fan thermal processor mptsas mptscsih mptbase scsi_transport_sas sg sata_svw libata dock serverworks sd_mod scsi_mod ide_disk ide_core
> Pid: 4838, comm: bash Tainted: G W 2.6.27-rc2-git5-pn_test #2
> RIP: 0010:[<ffffffff802d2a11>] [<ffffffff802d2a11>] sysfs_addrm_start+0x2d/0x99
> RSP: 0018:ffff88022d0dbb38 EFLAGS: 00010246
> RAX: ffff880028047710 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 00000acb2c4dfcc0 RSI: 0000000000000000 RDI: ffff88022fc3b000
> RBP: ffff88022d0dbb58 R08: 0000000000000080 R09: ffff88022d0dba58
> R10: ffff880028047710 R11: 00000acb2c4dfcc0 R12: 00000000fffffff4
> R13: 0000000000000000 R14: ffff88022d0dbbb0 R15: 0000000000000004
> FS: 00007f106d7496d0(0000) GS:ffff88022f0938c0(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000038 CR3: 000000022c8d0000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process bash (pid: 4838, threadinfo ffff88022d0da000, task ffff88022d6b4d90)
> Stack: 00000000fffffff4 ffff88022e3dfc80 ffff88022dc0c870 ffffffff802d2ed8
> 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> ffff88022e3dfc80 ffff88022d0dbd18 00000000fffffffe 0000000000000004
> Call Trace:
> [<ffffffff802d2ed8>] ? create_dir+0x44/0x87
> [<ffffffff802d2f50>] ? sysfs_create_dir+0x35/0x4a
> [<ffffffff80302757>] ? kobject_get+0x12/0x17
> [<ffffffff80302890>] ? kobject_add_internal+0xcf/0x18a
> [<ffffffff80302d57>] ? kobject_add+0x74/0x7c
> [<ffffffff8041bfc5>] ? thread_return+0x3e/0xa3
> [<ffffffff8030254e>] ? kobject_init_internal+0x12/0x2c
> [<ffffffff803025c6>] ? kobject_init+0x41/0x69
> [<ffffffff80302619>] ? kobject_create+0x2b/0x30
> [<ffffffff80302d8d>] ? kobject_create_and_add+0x2e/0x5b
> [<ffffffff804178c3>] ? threshold_create_device+0x1aa/0x32a
> [<ffffffff80417aa2>] ? threshold_cpu_callback+0x5f/0x2ca
> [<ffffffff804172df>] ? mce_create_device+0xb4/0x156
> [<ffffffff8041fb7d>] ? notifier_call_chain+0x29/0x4c
> [<ffffffff8041a2e7>] ? _cpu_up+0xc8/0x102
> [<ffffffff8041a375>] ? cpu_up+0x54/0x77
> [<ffffffff8040ffee>] ? store_online+0x43/0x67
> [<ffffffff802d20a1>] ? sysfs_write_file+0xd2/0x110
> [<ffffffff8028f1a5>] ? vfs_write+0xad/0x156
> [<ffffffff8028f692>] ? sys_write+0x45/0x6e
> [<ffffffff8020bdeb>] ? system_call_fastpath+0x16/0x1b
>
>
> Code: c0 b9 08 00 00 00 fc 53 48 89 fd 48 89 f3 48 83 ec 08 f3 ab 48 89 75 00 48 c7 c7 e0 e7 51 80 e8 c2 9c 14 00 48 8b 3d 57 45 37 00 <48> 8b 73 38 48 89 d9 48 c7 c2 2c 27 2d 80 e8 1e db fc ff 48 85
> RIP [<ffffffff802d2a11>] sysfs_addrm_start+0x2d/0x99
> RSP <ffff88022d0dbb38>
> CR2: 0000000000000038
> ---[ end trace 48036b92036180e0 ]---
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2008-08-14 16:07:01

by Langsdorf, Mark

[permalink] [raw]
Subject: RE: Warning in during hotplug on 2.6.27-rc2-git5

> On Wednesday, 13 of August 2008, Mark Langsdorf wrote:
> > I'm seeing the following error message when I hotunplug and replug
> > a cpu in 2.6.27-rc2-git5. The system becomes unstable almost
> > immediately afterwards.
>
> Hm, it seems that MCE is involved somehow. Andi, can you
> have a look at this, please?

Disabling MCE did eliminate the error message.

-Mark Langsdorf
Operating System Research Center
AMD

2008-08-14 16:33:40

by Andi Kleen

[permalink] [raw]
Subject: Re: Warning in during hotplug on 2.6.27-rc2-git5

On Thu, Aug 14, 2008 at 05:53:57PM +0200, Rafael J. Wysocki wrote:
> On Wednesday, 13 of August 2008, Mark Langsdorf wrote:
> > I'm seeing the following error message when I hotunplug and replug
> > a cpu in 2.6.27-rc2-git5. The system becomes unstable almost
> > immediately afterwards.
>
> Hm, it seems that MCE is involved somehow. Andi, can you have a look at this,
> please?

FWIW the mce code here actually hasn't changed for a long time.


> > ------------[ cut here ]------------
> > WARNING: at fs/sysfs/dir.c:463 sysfs_add_one+0x33/0x39()
> > sysfs: duplicate filename 'machinecheck4' can not be created

The only way I could see that happening if CPU_DEAD/CPU_ONLINE
is not properly balanced.

-Andi

2008-08-14 18:36:14

by Langsdorf, Mark

[permalink] [raw]
Subject: RE: Warning in during hotplug on 2.6.27-rc2-git5

> On Thu, Aug 14, 2008 at 05:53:57PM +0200, Rafael J. Wysocki wrote:
> > On Wednesday, 13 of August 2008, Mark Langsdorf wrote:
> > > I'm seeing the following error message when I hotunplug and replug
> > > a cpu in 2.6.27-rc2-git5. The system becomes unstable almost
> > > immediately afterwards.
> >
> > Hm, it seems that MCE is involved somehow. Andi, can you
> have a look at this,
> > please?
>
> FWIW the mce code here actually hasn't changed for a long time.
>
>
> > > ------------[ cut here ]------------
> > > WARNING: at fs/sysfs/dir.c:463 sysfs_add_one+0x33/0x39()
> > > sysfs: duplicate filename 'machinecheck4' can not be created
>
> The only way I could see that happening if CPU_DEAD/CPU_ONLINE
> is not properly balanced.

I'm still seeing it on 2.6.27-rc2, even with the
patch here http://lkml.org/lkml/2008/7/30/171 and the
wbinvd_halt code patch applied. Maybe something else
broke in some of the recent hotplug changes?

-Mark Langsdorf
Operating System Research Center
AMD

2008-08-16 19:25:29

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Warning in during hotplug on 2.6.27-rc2-git5

On Thursday, 14 of August 2008, Langsdorf, Mark wrote:
> > On Thu, Aug 14, 2008 at 05:53:57PM +0200, Rafael J. Wysocki wrote:
> > > On Wednesday, 13 of August 2008, Mark Langsdorf wrote:
> > > > I'm seeing the following error message when I hotunplug and replug
> > > > a cpu in 2.6.27-rc2-git5. The system becomes unstable almost
> > > > immediately afterwards.
> > >
> > > Hm, it seems that MCE is involved somehow. Andi, can you
> > have a look at this,
> > > please?
> >
> > FWIW the mce code here actually hasn't changed for a long time.
> >
> >
> > > > ------------[ cut here ]------------
> > > > WARNING: at fs/sysfs/dir.c:463 sysfs_add_one+0x33/0x39()
> > > > sysfs: duplicate filename 'machinecheck4' can not be created
> >
> > The only way I could see that happening if CPU_DEAD/CPU_ONLINE
> > is not properly balanced.
>
> I'm still seeing it on 2.6.27-rc2, even with the
> patch here http://lkml.org/lkml/2008/7/30/171 and the
> wbinvd_halt code patch applied. Maybe something else
> broke in some of the recent hotplug changes?

My guess is that MCE does somthing that is not allowed by sysfs any more.

Thanks,
Rafael

2008-08-16 20:21:27

by Greg KH

[permalink] [raw]
Subject: Re: Warning in during hotplug on 2.6.27-rc2-git5

On Sat, Aug 16, 2008 at 09:28:24PM +0200, Rafael J. Wysocki wrote:
> On Thursday, 14 of August 2008, Langsdorf, Mark wrote:
> > > On Thu, Aug 14, 2008 at 05:53:57PM +0200, Rafael J. Wysocki wrote:
> > > > On Wednesday, 13 of August 2008, Mark Langsdorf wrote:
> > > > > I'm seeing the following error message when I hotunplug and replug
> > > > > a cpu in 2.6.27-rc2-git5. The system becomes unstable almost
> > > > > immediately afterwards.
> > > >
> > > > Hm, it seems that MCE is involved somehow. Andi, can you
> > > have a look at this,
> > > > please?
> > >
> > > FWIW the mce code here actually hasn't changed for a long time.
> > >
> > >
> > > > > ------------[ cut here ]------------
> > > > > WARNING: at fs/sysfs/dir.c:463 sysfs_add_one+0x33/0x39()
> > > > > sysfs: duplicate filename 'machinecheck4' can not be created
> > >
> > > The only way I could see that happening if CPU_DEAD/CPU_ONLINE
> > > is not properly balanced.
> >
> > I'm still seeing it on 2.6.27-rc2, even with the
> > patch here http://lkml.org/lkml/2008/7/30/171 and the
> > wbinvd_halt code patch applied. Maybe something else
> > broke in some of the recent hotplug changes?
>
> My guess is that MCE does somthing that is not allowed by sysfs any more.

Hm, sysfs hasn't changed any in 2.6.27-rcX that I know of.

thanks,

greg k-h

2008-08-16 21:33:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Warning in during hotplug on 2.6.27-rc2-git5

On Saturday, 16 of August 2008, Greg KH wrote:
> On Sat, Aug 16, 2008 at 09:28:24PM +0200, Rafael J. Wysocki wrote:
> > On Thursday, 14 of August 2008, Langsdorf, Mark wrote:
> > > > On Thu, Aug 14, 2008 at 05:53:57PM +0200, Rafael J. Wysocki wrote:
> > > > > On Wednesday, 13 of August 2008, Mark Langsdorf wrote:
> > > > > > I'm seeing the following error message when I hotunplug and replug
> > > > > > a cpu in 2.6.27-rc2-git5. The system becomes unstable almost
> > > > > > immediately afterwards.
> > > > >
> > > > > Hm, it seems that MCE is involved somehow. Andi, can you
> > > > have a look at this,
> > > > > please?
> > > >
> > > > FWIW the mce code here actually hasn't changed for a long time.
> > > >
> > > >
> > > > > > ------------[ cut here ]------------
> > > > > > WARNING: at fs/sysfs/dir.c:463 sysfs_add_one+0x33/0x39()
> > > > > > sysfs: duplicate filename 'machinecheck4' can not be created
> > > >
> > > > The only way I could see that happening if CPU_DEAD/CPU_ONLINE
> > > > is not properly balanced.
> > >
> > > I'm still seeing it on 2.6.27-rc2, even with the
> > > patch here http://lkml.org/lkml/2008/7/30/171 and the
> > > wbinvd_halt code patch applied. Maybe something else
> > > broke in some of the recent hotplug changes?
> >
> > My guess is that MCE does somthing that is not allowed by sysfs any more.
>
> Hm, sysfs hasn't changed any in 2.6.27-rcX that I know of.

Hmm. Mark, what kind of a system is this? Is it a 2 quad-core CPU system
or similar ('machinecheck4' in your trace seems to imply something like this)?

Rafael

2008-08-17 02:21:59

by Andi Kleen

[permalink] [raw]
Subject: Re: Warning in during hotplug on 2.6.27-rc2-git5

> > > I'm still seeing it on 2.6.27-rc2, even with the
> > > patch here http://lkml.org/lkml/2008/7/30/171 and the
> > > wbinvd_halt code patch applied. Maybe something else
> > > broke in some of the recent hotplug changes?
> >
> > My guess is that MCE does somthing that is not allowed by sysfs any more.
>
> Hm, sysfs hasn't changed any in 2.6.27-rcX that I know of.

mce hasn't either in this regard. My current theory is that the CPU
up/down notifiers are not balanced anymore (as in duplicated up events)

Would need to add printk to verify that.

-Andi

2008-08-17 02:24:31

by Andi Kleen

[permalink] [raw]
Subject: Re: Warning in during hotplug on 2.6.27-rc2-git5

> I'm still seeing it on 2.6.27-rc2, even with the
> patch here http://lkml.org/lkml/2008/7/30/171 and the
> wbinvd_halt code patch applied. Maybe something else
> broke in some of the recent hotplug changes?

Mark, can you please put a printk

printk("mce_cpu_callback action %lu cpu %u\n", action, cpu);

into arch/x86/kernel/cpu/mcheck/mce_64.c:mce_cpu_callback()
and post the log including the warning?

That would verify my theory.

-Andi

2008-08-17 17:22:39

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Warning in during hotplug on 2.6.27-rc2-git5

On Sunday, 17 of August 2008, Andi Kleen wrote:
> > > > I'm still seeing it on 2.6.27-rc2, even with the
> > > > patch here http://lkml.org/lkml/2008/7/30/171 and the
> > > > wbinvd_halt code patch applied. Maybe something else
> > > > broke in some of the recent hotplug changes?
> > >
> > > My guess is that MCE does somthing that is not allowed by sysfs any more.
> >
> > Hm, sysfs hasn't changed any in 2.6.27-rcX that I know of.
>
> mce hasn't either in this regard. My current theory is that the CPU
> up/down notifiers are not balanced anymore (as in duplicated up events)

It doesn't look like this is the case. Moreover, had that been the case, we'd
have had many reports from people doing suspend/hibernation, but it doesn't
happen.

I think that cpu_down() fails for some reason and that causes the subsequent
onlining to fail. I'd like to find out what's the root cause of that.

Thanks,
Rafael

2008-08-17 19:20:14

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Warning in during hotplug on 2.6.27-rc2-git5

On Sunday, 17 of August 2008, Rafael J. Wysocki wrote:
> On Sunday, 17 of August 2008, Andi Kleen wrote:
> > > > > I'm still seeing it on 2.6.27-rc2, even with the
> > > > > patch here http://lkml.org/lkml/2008/7/30/171 and the
> > > > > wbinvd_halt code patch applied. Maybe something else
> > > > > broke in some of the recent hotplug changes?
> > > >
> > > > My guess is that MCE does somthing that is not allowed by sysfs any more.
> > >
> > > Hm, sysfs hasn't changed any in 2.6.27-rcX that I know of.
> >
> > mce hasn't either in this regard. My current theory is that the CPU
> > up/down notifiers are not balanced anymore (as in duplicated up events)
>
> It doesn't look like this is the case. Moreover, had that been the case, we'd
> have had many reports from people doing suspend/hibernation, but it doesn't
> happen.
>
> I think that cpu_down() fails for some reason and that causes the subsequent
> onlining to fail.

Well, no. If my understanding of the CPU hotplug code is correct, this is not
possible.

The next possibility is that for some 'i' mce_attributes[i] is NULL, although
there are non-NULL values for some j > i. In that case, mce_remove_device()
would fail to remove device_mce for given CPU and the subsequent
mce_create_device() would cause the observed failure.

Thanks,
Rafael

2008-08-17 21:50:36

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Warning in during hotplug on 2.6.27-rc2-git5

On Thursday, 14 of August 2008, Langsdorf, Mark wrote:
> > On Thu, Aug 14, 2008 at 05:53:57PM +0200, Rafael J. Wysocki wrote:
> > > On Wednesday, 13 of August 2008, Mark Langsdorf wrote:
> > > > I'm seeing the following error message when I hotunplug and replug
> > > > a cpu in 2.6.27-rc2-git5. The system becomes unstable almost
> > > > immediately afterwards.
> > >
> > > Hm, it seems that MCE is involved somehow. Andi, can you
> > have a look at this,
> > > please?
> >
> > FWIW the mce code here actually hasn't changed for a long time.
> >
> >
> > > > ------------[ cut here ]------------
> > > > WARNING: at fs/sysfs/dir.c:463 sysfs_add_one+0x33/0x39()
> > > > sysfs: duplicate filename 'machinecheck4' can not be created
> >
> > The only way I could see that happening if CPU_DEAD/CPU_ONLINE
> > is not properly balanced.
>
> I'm still seeing it on 2.6.27-rc2, even with the
> patch here http://lkml.org/lkml/2008/7/30/171 and the
> wbinvd_halt code patch applied. Maybe something else
> broke in some of the recent hotplug changes?

Mark, have you tried to test with commit 34ae7f35a21694aa5cb8829dc5142c39d73d6ba0
(your "preregister support for powernow-k8" patch) reverted and MCE enabled?

Rafael

2008-08-18 13:11:51

by Langsdorf, Mark

[permalink] [raw]
Subject: RE: Warning in during hotplug on 2.6.27-rc2-git5

> > > The only way I could see that happening if CPU_DEAD/CPU_ONLINE
> > > is not properly balanced.
> >
> > I'm still seeing it on 2.6.27-rc2, even with the
> > patch here http://lkml.org/lkml/2008/7/30/171 and the
> > wbinvd_halt code patch applied. Maybe something else
> > broke in some of the recent hotplug changes?
>
> Mark, have you tried to test with commit
> 34ae7f35a21694aa5cb8829dc5142c39d73d6ba0
> (your "preregister support for powernow-k8" patch) reverted
> and MCE enabled?

Yes, it's a failure on a clean 2.6.27-rc2 installation.

-Mark Langsdorf
Operating System Research Center
AMD

2008-08-18 14:35:38

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Warning in during hotplug on 2.6.27-rc2-git5

On Monday, 18 of August 2008, Langsdorf, Mark wrote:
> > > > The only way I could see that happening if CPU_DEAD/CPU_ONLINE
> > > > is not properly balanced.
> > >
> > > I'm still seeing it on 2.6.27-rc2, even with the
> > > patch here http://lkml.org/lkml/2008/7/30/171 and the
> > > wbinvd_halt code patch applied. Maybe something else
> > > broke in some of the recent hotplug changes?
> >
> > Mark, have you tried to test with commit
> > 34ae7f35a21694aa5cb8829dc5142c39d73d6ba0
> > (your "preregister support for powernow-k8" patch) reverted
> > and MCE enabled?
>
> Yes, it's a failure on a clean 2.6.27-rc2 installation.

Ah, the commit was done after -rc2, sorry.

I have a couple of questions, then:
- Does it fail identically for all CPUs or for CPU4 and above only?
- Did previous kernels, most importantly 2.6.26, work correctly?

Also, can you please attach dmesg output, including the offlining of a CPU
that would later lead to the problem with cpu_up(), to the Bugzilla entry at
http://bugzilla.kernel.org/show_bug.cgi?id=11337 ?

Thanks,
Rafael

2008-08-18 15:33:24

by Langsdorf, Mark

[permalink] [raw]
Subject: RE: Warning in during hotplug on 2.6.27-rc2-git5

> > > My guess is that MCE does somthing that is not allowed by
> sysfs any more.
> >
> > Hm, sysfs hasn't changed any in 2.6.27-rcX that I know of.
>
> Hmm. Mark, what kind of a system is this? Is it a 2
> quad-core CPU system or similar ('machinecheck4' in your
> trace seems to imply something like this)?

It's a commercial Tyan 2 socket motherboard with a modified
BIOS to support Family 10h processors. It is a dual socket
with quad-cores in it.

-Mark Langsdorf
Operating System Research Center
AMD

2008-08-18 16:29:37

by Langsdorf, Mark

[permalink] [raw]
Subject: RE: Warning in during hotplug on 2.6.27-rc2-git5

> > > > I'm still seeing it on 2.6.27-rc2, even with the
> > > > patch here http://lkml.org/lkml/2008/7/30/171 and the
> > > > wbinvd_halt code patch applied. Maybe something else
> > > > broke in some of the recent hotplug changes?
>
> I have a couple of questions, then:
> - Does it fail identically for all CPUs or for CPU4 and above only?
> - Did previous kernels, most importantly 2.6.26, work correctly?

It turns out 2.6.26 also fails. It is only failing for CPU4+,
I'm not sure why that would be significant.

> Also, can you please attach dmesg output, including the
> offlining of a CPU that would later lead to the problem
> with cpu_up(), to the Bugzilla entry at
> http://bugzilla.kernel.org/show_bug.cgi?id=11337 ?

I did.

-Mark Langsdorf
Operating System Research Center
AMD

2008-08-18 16:39:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Warning in during hotplug on 2.6.27-rc2-git5

On Monday, 18 of August 2008, Langsdorf, Mark wrote:
> > > > > I'm still seeing it on 2.6.27-rc2, even with the
> > > > > patch here http://lkml.org/lkml/2008/7/30/171 and the
> > > > > wbinvd_halt code patch applied. Maybe something else
> > > > > broke in some of the recent hotplug changes?
> >
> > I have a couple of questions, then:
> > - Does it fail identically for all CPUs or for CPU4 and above only?
> > - Did previous kernels, most importantly 2.6.26, work correctly?
>
> It turns out 2.6.26 also fails.

OK, so I'll drop the bug from the list of recent regressions.

> It is only failing for CPU4+, I'm not sure why that would be significant.

Because I cannot reproduce it on a single-socket AMD quad-core. :-)

> > Also, can you please attach dmesg output, including the
> > offlining of a CPU that would later lead to the problem
> > with cpu_up(), to the Bugzilla entry at
> > http://bugzilla.kernel.org/show_bug.cgi?id=11337 ?
>
> I did.

Thanks. I think we can further debug this using Bugzilla if you don't mind.

Best,
Rafael