2004-06-08 22:57:41

by Norman Weathers

[permalink] [raw]
Subject: 2.4.26 SMP lockup problem


Hello All.

During an interesting round of kernel updates, I found a very interesting
problem. I have several "hundred" nodes in a cluster that I am currently
updating from kernel 2.4.21 to 2.4.26. These nodes are all running RedHat
7.3 (old, I know, but this is the OS that are software currently works on).
During this round of updates, I have updated about 150 PIII 800 MHz nodes,
all of which are currently being used and work just fine (1 GB Ram, e100
ethernet driver, IDE drives, fairly generic). Also, I have a few PIII 1260
nodes (Tyan Motherboard, 2 GB Ram, e100 ethernet driver, again, fairly
generic) that have also been updated and run fine. I have even started
testing fairly new P4 3060 IBM blades. They also seem to work just fine.

Now to the problem. I have "several hundred" Tyan Thunder Motherboards (older
AMD 760MP chipset). I have rebooted ~ 200 of these nodes with the new 2.4.26
kernel and about half of these nodes have suffered a hard lockup during
bootup. The lockup is hard enough that I cannot even isuse sys request keys
over serial or at the local keyboard to cause them to reboot or output a
trace. These nodes have 2 GB of ram, dual 3Com 100 Mb NICS, and IDE drives.
Again, fairly generic for a cluster. I had a vanilal + trond patched 2.4.21
kernel running on these boxes just fine. (The new 2.4.26 kernel also has the
trond patches for 2.4.26). Has anyone seen this happen to them?

Here is some info on the kernel config for the 2.4.26 kernel:

#
# Automatically generated by make menuconfig: don't edit
#
CONFIG_X86=y
# CONFIG_SBUS is not set
CONFIG_UID16=y

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODVERSIONS=y
CONFIG_KMOD=y

#
# Processor type and features
#
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
CONFIG_MPENTIUMIII=y
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MELAN is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_L1_CACHE_SHIFT=5
CONFIG_X86_HAS_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_PGE=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_F00F_WORKS_OK=y
CONFIG_X86_MCE=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
CONFIG_MICROCODE=y
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set
# CONFIG_EDD is not set
# CONFIG_NOHIGHMEM is not set
# CONFIG_HIGHMEM4G is not set
CONFIG_HIGHMEM64G=y
CONFIG_HIGHMEM=y
CONFIG_X86_PAE=y
CONFIG_HIGHIO=y
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
CONFIG_SMP=y
CONFIG_NR_CPUS=32
# CONFIG_X86_NUMA is not set
# CONFIG_X86_TSC_DISABLE is not set
CONFIG_X86_TSC=y
CONFIG_HAVE_DEC_LOCK=y

#
# General setup
#
CONFIG_NET=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GODIRECT is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_ISA=y
CONFIG_PCI_NAMES=y
# CONFIG_EISA is not set
# CONFIG_MCA is not set
CONFIG_HOTPLUG=y
---- Rest cut -------

I have the noapic option passed on the lilo boot prompt line, otherwise we get
the APIC error after about a month or two in service.

We tried to make the kernel somewhat generic because we want this kernel to
boot on the largest hardware base possible. Is there something obvious that
I have missed (I have used these options on the 2.4.21 kernel that we used on
all of the nodes with the exception of the 64 GB memory.

Any help would be appreciated. Any dumps that need to be made (or try to
make), great as I have about 200 nodes right now that are candidates for
testing.

Please contact me at email listed below as I am not on the list.


Email: [email protected]


Thanks in advance.

--

Norman Weathers
SIP Linux Cluster
TCE UNIX
ConocoPhillips
Houston, TX

Office: LO2003
Phone: ETN 639-2727
or (281) 293-2727


2004-06-08 23:07:44

by Steven Dake

[permalink] [raw]
Subject: Re: 2.4.26 SMP lockup problem

Norman,

A kernel traceback of the lockup would be helpful.

To do this, add the nmi_watchdog=1 to the kernel command line (lilo or
pxe boot append option). This will cause the NMI watchdog handler to
buzz off when you have your deadlock.

Run the output through ksymoops and post that to the list.

Thanks
-steve

On Tue, 2004-06-08 at 15:57, Norman Weathers wrote:
> Hello All.
>
> During an interesting round of kernel updates, I found a very interesting
> problem. I have several "hundred" nodes in a cluster that I am currently
> updating from kernel 2.4.21 to 2.4.26. These nodes are all running RedHat
> 7.3 (old, I know, but this is the OS that are software currently works on).
> During this round of updates, I have updated about 150 PIII 800 MHz nodes,
> all of which are currently being used and work just fine (1 GB Ram, e100
> ethernet driver, IDE drives, fairly generic). Also, I have a few PIII 1260
> nodes (Tyan Motherboard, 2 GB Ram, e100 ethernet driver, again, fairly
> generic) that have also been updated and run fine. I have even started
> testing fairly new P4 3060 IBM blades. They also seem to work just fine.
>
> Now to the problem. I have "several hundred" Tyan Thunder Motherboards (older
> AMD 760MP chipset). I have rebooted ~ 200 of these nodes with the new 2.4.26
> kernel and about half of these nodes have suffered a hard lockup during
> bootup. The lockup is hard enough that I cannot even isuse sys request keys
> over serial or at the local keyboard to cause them to reboot or output a
> trace. These nodes have 2 GB of ram, dual 3Com 100 Mb NICS, and IDE drives.
> Again, fairly generic for a cluster. I had a vanilal + trond patched 2.4.21
> kernel running on these boxes just fine. (The new 2.4.26 kernel also has the
> trond patches for 2.4.26). Has anyone seen this happen to them?
>
> Here is some info on the kernel config for the 2.4.26 kernel:
>
> #
> # Automatically generated by make menuconfig: don't edit
> #
> CONFIG_X86=y
> # CONFIG_SBUS is not set
> CONFIG_UID16=y
>
> #
> # Code maturity level options
> #
> CONFIG_EXPERIMENTAL=y
>
> #
> # Loadable module support
> #
> CONFIG_MODULES=y
> CONFIG_MODVERSIONS=y
> CONFIG_KMOD=y
>
> #
> # Processor type and features
> #
> # CONFIG_M386 is not set
> # CONFIG_M486 is not set
> # CONFIG_M586 is not set
> # CONFIG_M586TSC is not set
> # CONFIG_M586MMX is not set
> # CONFIG_M686 is not set
> CONFIG_MPENTIUMIII=y
> # CONFIG_MPENTIUM4 is not set
> # CONFIG_MK6 is not set
> # CONFIG_MK7 is not set
> # CONFIG_MK8 is not set
> # CONFIG_MELAN is not set
> # CONFIG_MCRUSOE is not set
> # CONFIG_MWINCHIPC6 is not set
> # CONFIG_MWINCHIP2 is not set
> # CONFIG_MWINCHIP3D is not set
> # CONFIG_MCYRIXIII is not set
> # CONFIG_MVIAC3_2 is not set
> CONFIG_X86_WP_WORKS_OK=y
> CONFIG_X86_INVLPG=y
> CONFIG_X86_CMPXCHG=y
> CONFIG_X86_XADD=y
> CONFIG_X86_BSWAP=y
> CONFIG_X86_POPAD_OK=y
> # CONFIG_RWSEM_GENERIC_SPINLOCK is not set
> CONFIG_RWSEM_XCHGADD_ALGORITHM=y
> CONFIG_X86_L1_CACHE_SHIFT=5
> CONFIG_X86_HAS_TSC=y
> CONFIG_X86_GOOD_APIC=y
> CONFIG_X86_PGE=y
> CONFIG_X86_USE_PPRO_CHECKSUM=y
> CONFIG_X86_F00F_WORKS_OK=y
> CONFIG_X86_MCE=y
> # CONFIG_TOSHIBA is not set
> # CONFIG_I8K is not set
> CONFIG_MICROCODE=y
> # CONFIG_X86_MSR is not set
> # CONFIG_X86_CPUID is not set
> # CONFIG_EDD is not set
> # CONFIG_NOHIGHMEM is not set
> # CONFIG_HIGHMEM4G is not set
> CONFIG_HIGHMEM64G=y
> CONFIG_HIGHMEM=y
> CONFIG_X86_PAE=y
> CONFIG_HIGHIO=y
> # CONFIG_MATH_EMULATION is not set
> CONFIG_MTRR=y
> CONFIG_SMP=y
> CONFIG_NR_CPUS=32
> # CONFIG_X86_NUMA is not set
> # CONFIG_X86_TSC_DISABLE is not set
> CONFIG_X86_TSC=y
> CONFIG_HAVE_DEC_LOCK=y
>
> #
> # General setup
> #
> CONFIG_NET=y
> CONFIG_X86_IO_APIC=y
> CONFIG_X86_LOCAL_APIC=y
> CONFIG_PCI=y
> # CONFIG_PCI_GOBIOS is not set
> # CONFIG_PCI_GODIRECT is not set
> CONFIG_PCI_GOANY=y
> CONFIG_PCI_BIOS=y
> CONFIG_PCI_DIRECT=y
> CONFIG_ISA=y
> CONFIG_PCI_NAMES=y
> # CONFIG_EISA is not set
> # CONFIG_MCA is not set
> CONFIG_HOTPLUG=y
> ---- Rest cut -------
>
> I have the noapic option passed on the lilo boot prompt line, otherwise we get
> the APIC error after about a month or two in service.
>
> We tried to make the kernel somewhat generic because we want this kernel to
> boot on the largest hardware base possible. Is there something obvious that
> I have missed (I have used these options on the 2.4.21 kernel that we used on
> all of the nodes with the exception of the 64 GB memory.
>
> Any help would be appreciated. Any dumps that need to be made (or try to
> make), great as I have about 200 nodes right now that are candidates for
> testing.
>
> Please contact me at email listed below as I am not on the list.
>
>
> Email: [email protected]
>
>
> Thanks in advance.

2004-06-08 23:30:38

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.4.26 SMP lockup problem

Hi,

do you have ACPI enabled, I don't see it in your partial config. I believe
it was changed in 2.4.22.

Regards,
Willy

On Tue, Jun 08, 2004 at 05:57:28PM -0500, Norman Weathers wrote:
>
> Hello All.
>
> During an interesting round of kernel updates, I found a very interesting
> problem. I have several "hundred" nodes in a cluster that I am currently
> updating from kernel 2.4.21 to 2.4.26. These nodes are all running RedHat
> 7.3 (old, I know, but this is the OS that are software currently works on).
> During this round of updates, I have updated about 150 PIII 800 MHz nodes,
> all of which are currently being used and work just fine (1 GB Ram, e100
> ethernet driver, IDE drives, fairly generic). Also, I have a few PIII 1260
> nodes (Tyan Motherboard, 2 GB Ram, e100 ethernet driver, again, fairly
> generic) that have also been updated and run fine. I have even started
> testing fairly new P4 3060 IBM blades. They also seem to work just fine.
>
> Now to the problem. I have "several hundred" Tyan Thunder Motherboards (older
> AMD 760MP chipset). I have rebooted ~ 200 of these nodes with the new 2.4.26
> kernel and about half of these nodes have suffered a hard lockup during
> bootup. The lockup is hard enough that I cannot even isuse sys request keys
> over serial or at the local keyboard to cause them to reboot or output a
> trace. These nodes have 2 GB of ram, dual 3Com 100 Mb NICS, and IDE drives.
> Again, fairly generic for a cluster. I had a vanilal + trond patched 2.4.21
> kernel running on these boxes just fine. (The new 2.4.26 kernel also has the
> trond patches for 2.4.26). Has anyone seen this happen to them?
>
> Here is some info on the kernel config for the 2.4.26 kernel:
>
> #
> # Automatically generated by make menuconfig: don't edit
> #
> CONFIG_X86=y
> # CONFIG_SBUS is not set
> CONFIG_UID16=y
>
> #
> # Code maturity level options
> #
> CONFIG_EXPERIMENTAL=y
>
> #
> # Loadable module support
> #
> CONFIG_MODULES=y
> CONFIG_MODVERSIONS=y
> CONFIG_KMOD=y
>
> #
> # Processor type and features
> #
> # CONFIG_M386 is not set
> # CONFIG_M486 is not set
> # CONFIG_M586 is not set
> # CONFIG_M586TSC is not set
> # CONFIG_M586MMX is not set
> # CONFIG_M686 is not set
> CONFIG_MPENTIUMIII=y
> # CONFIG_MPENTIUM4 is not set
> # CONFIG_MK6 is not set
> # CONFIG_MK7 is not set
> # CONFIG_MK8 is not set
> # CONFIG_MELAN is not set
> # CONFIG_MCRUSOE is not set
> # CONFIG_MWINCHIPC6 is not set
> # CONFIG_MWINCHIP2 is not set
> # CONFIG_MWINCHIP3D is not set
> # CONFIG_MCYRIXIII is not set
> # CONFIG_MVIAC3_2 is not set
> CONFIG_X86_WP_WORKS_OK=y
> CONFIG_X86_INVLPG=y
> CONFIG_X86_CMPXCHG=y
> CONFIG_X86_XADD=y
> CONFIG_X86_BSWAP=y
> CONFIG_X86_POPAD_OK=y
> # CONFIG_RWSEM_GENERIC_SPINLOCK is not set
> CONFIG_RWSEM_XCHGADD_ALGORITHM=y
> CONFIG_X86_L1_CACHE_SHIFT=5
> CONFIG_X86_HAS_TSC=y
> CONFIG_X86_GOOD_APIC=y
> CONFIG_X86_PGE=y
> CONFIG_X86_USE_PPRO_CHECKSUM=y
> CONFIG_X86_F00F_WORKS_OK=y
> CONFIG_X86_MCE=y
> # CONFIG_TOSHIBA is not set
> # CONFIG_I8K is not set
> CONFIG_MICROCODE=y
> # CONFIG_X86_MSR is not set
> # CONFIG_X86_CPUID is not set
> # CONFIG_EDD is not set
> # CONFIG_NOHIGHMEM is not set
> # CONFIG_HIGHMEM4G is not set
> CONFIG_HIGHMEM64G=y
> CONFIG_HIGHMEM=y
> CONFIG_X86_PAE=y
> CONFIG_HIGHIO=y
> # CONFIG_MATH_EMULATION is not set
> CONFIG_MTRR=y
> CONFIG_SMP=y
> CONFIG_NR_CPUS=32
> # CONFIG_X86_NUMA is not set
> # CONFIG_X86_TSC_DISABLE is not set
> CONFIG_X86_TSC=y
> CONFIG_HAVE_DEC_LOCK=y
>
> #
> # General setup
> #
> CONFIG_NET=y
> CONFIG_X86_IO_APIC=y
> CONFIG_X86_LOCAL_APIC=y
> CONFIG_PCI=y
> # CONFIG_PCI_GOBIOS is not set
> # CONFIG_PCI_GODIRECT is not set
> CONFIG_PCI_GOANY=y
> CONFIG_PCI_BIOS=y
> CONFIG_PCI_DIRECT=y
> CONFIG_ISA=y
> CONFIG_PCI_NAMES=y
> # CONFIG_EISA is not set
> # CONFIG_MCA is not set
> CONFIG_HOTPLUG=y
> ---- Rest cut -------
>
> I have the noapic option passed on the lilo boot prompt line, otherwise we get
> the APIC error after about a month or two in service.
>
> We tried to make the kernel somewhat generic because we want this kernel to
> boot on the largest hardware base possible. Is there something obvious that
> I have missed (I have used these options on the 2.4.21 kernel that we used on
> all of the nodes with the exception of the 64 GB memory.
>
> Any help would be appreciated. Any dumps that need to be made (or try to
> make), great as I have about 200 nodes right now that are candidates for
> testing.
>
> Please contact me at email listed below as I am not on the list.
>
>
> Email: [email protected]
>
>
> Thanks in advance.
>
> --
>
> Norman Weathers
> SIP Linux Cluster
> TCE UNIX
> ConocoPhillips
> Houston, TX
>
> Office: LO2003
> Phone: ETN 639-2727
> or (281) 293-2727
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2004-06-09 13:15:02

by Norman Weathers

[permalink] [raw]
Subject: Re: 2.4.26 SMP lockup problem


That's the funny thing about this lockup... I can't even get a traceback from
the serial console. Here is my boot command line:

auto BOOT_IMAGE=2.4.26 ro root=301 noapic console=tty0 console=ttyS0,115200n8
panic=60 nmi_watchdog=1

I have the watchdog turned on (built in to the kernel, not a module), and have
been using the above command line, but still get no oops and cannot get a
traceback...

On Tuesday 08 June 2004 06:06 pm, Steven Dake wrote:
> Norman,
>
> A kernel traceback of the lockup would be helpful.
>
> To do this, add the nmi_watchdog=1 to the kernel command line (lilo or
> pxe boot append option). This will cause the NMI watchdog handler to
> buzz off when you have your deadlock.
>
> Run the output through ksymoops and post that to the list.
>
> Thanks
> -steve
>
> On Tue, 2004-06-08 at 15:57, Norman Weathers wrote:
> > Hello All.
> >
> > During an interesting round of kernel updates, I found a very interesting
> > problem. I have several "hundred" nodes in a cluster that I am currently
> > updating from kernel 2.4.21 to 2.4.26. These nodes are all running
> > RedHat 7.3 (old, I know, but this is the OS that are software currently
> > works on). During this round of updates, I have updated about 150 PIII
> > 800 MHz nodes, all of which are currently being used and work just fine
> > (1 GB Ram, e100 ethernet driver, IDE drives, fairly generic). Also, I
> > have a few PIII 1260 nodes (Tyan Motherboard, 2 GB Ram, e100 ethernet
> > driver, again, fairly generic) that have also been updated and run fine.
> > I have even started testing fairly new P4 3060 IBM blades. They also
> > seem to work just fine.
> >
> > Now to the problem. I have "several hundred" Tyan Thunder Motherboards
> > (older AMD 760MP chipset). I have rebooted ~ 200 of these nodes with the
> > new 2.4.26 kernel and about half of these nodes have suffered a hard
> > lockup during bootup. The lockup is hard enough that I cannot even isuse
> > sys request keys over serial or at the local keyboard to cause them to
> > reboot or output a trace. These nodes have 2 GB of ram, dual 3Com 100 Mb
> > NICS, and IDE drives. Again, fairly generic for a cluster. I had a
> > vanilal + trond patched 2.4.21 kernel running on these boxes just fine.
> > (The new 2.4.26 kernel also has the trond patches for 2.4.26). Has
> > anyone seen this happen to them?
> >
> > Here is some info on the kernel config for the 2.4.26 kernel:
> >
> > #
> > # Automatically generated by make menuconfig: don't edit
> > #
> > CONFIG_X86=y
> > # CONFIG_SBUS is not set
> > CONFIG_UID16=y
> >
> > #
> > # Code maturity level options
> > #
> > CONFIG_EXPERIMENTAL=y
> >
> > #
> > # Loadable module support
> > #
> > CONFIG_MODULES=y
> > CONFIG_MODVERSIONS=y
> > CONFIG_KMOD=y
> >
> > #
> > # Processor type and features
> > #
> > # CONFIG_M386 is not set
> > # CONFIG_M486 is not set
> > # CONFIG_M586 is not set
> > # CONFIG_M586TSC is not set
> > # CONFIG_M586MMX is not set
> > # CONFIG_M686 is not set
> > CONFIG_MPENTIUMIII=y
> > # CONFIG_MPENTIUM4 is not set
> > # CONFIG_MK6 is not set
> > # CONFIG_MK7 is not set
> > # CONFIG_MK8 is not set
> > # CONFIG_MELAN is not set
> > # CONFIG_MCRUSOE is not set
> > # CONFIG_MWINCHIPC6 is not set
> > # CONFIG_MWINCHIP2 is not set
> > # CONFIG_MWINCHIP3D is not set
> > # CONFIG_MCYRIXIII is not set
> > # CONFIG_MVIAC3_2 is not set
> > CONFIG_X86_WP_WORKS_OK=y
> > CONFIG_X86_INVLPG=y
> > CONFIG_X86_CMPXCHG=y
> > CONFIG_X86_XADD=y
> > CONFIG_X86_BSWAP=y
> > CONFIG_X86_POPAD_OK=y
> > # CONFIG_RWSEM_GENERIC_SPINLOCK is not set
> > CONFIG_RWSEM_XCHGADD_ALGORITHM=y
> > CONFIG_X86_L1_CACHE_SHIFT=5
> > CONFIG_X86_HAS_TSC=y
> > CONFIG_X86_GOOD_APIC=y
> > CONFIG_X86_PGE=y
> > CONFIG_X86_USE_PPRO_CHECKSUM=y
> > CONFIG_X86_F00F_WORKS_OK=y
> > CONFIG_X86_MCE=y
> > # CONFIG_TOSHIBA is not set
> > # CONFIG_I8K is not set
> > CONFIG_MICROCODE=y
> > # CONFIG_X86_MSR is not set
> > # CONFIG_X86_CPUID is not set
> > # CONFIG_EDD is not set
> > # CONFIG_NOHIGHMEM is not set
> > # CONFIG_HIGHMEM4G is not set
> > CONFIG_HIGHMEM64G=y
> > CONFIG_HIGHMEM=y
> > CONFIG_X86_PAE=y
> > CONFIG_HIGHIO=y
> > # CONFIG_MATH_EMULATION is not set
> > CONFIG_MTRR=y
> > CONFIG_SMP=y
> > CONFIG_NR_CPUS=32
> > # CONFIG_X86_NUMA is not set
> > # CONFIG_X86_TSC_DISABLE is not set
> > CONFIG_X86_TSC=y
> > CONFIG_HAVE_DEC_LOCK=y
> >
> > #
> > # General setup
> > #
> > CONFIG_NET=y
> > CONFIG_X86_IO_APIC=y
> > CONFIG_X86_LOCAL_APIC=y
> > CONFIG_PCI=y
> > # CONFIG_PCI_GOBIOS is not set
> > # CONFIG_PCI_GODIRECT is not set
> > CONFIG_PCI_GOANY=y
> > CONFIG_PCI_BIOS=y
> > CONFIG_PCI_DIRECT=y
> > CONFIG_ISA=y
> > CONFIG_PCI_NAMES=y
> > # CONFIG_EISA is not set
> > # CONFIG_MCA is not set
> > CONFIG_HOTPLUG=y
> > ---- Rest cut -------
> >
> > I have the noapic option passed on the lilo boot prompt line, otherwise
> > we get the APIC error after about a month or two in service.
> >
> > We tried to make the kernel somewhat generic because we want this kernel
> > to boot on the largest hardware base possible. Is there something
> > obvious that I have missed (I have used these options on the 2.4.21
> > kernel that we used on all of the nodes with the exception of the 64 GB
> > memory.
> >
> > Any help would be appreciated. Any dumps that need to be made (or try to
> > make), great as I have about 200 nodes right now that are candidates for
> > testing.
> >
> > Please contact me at email listed below as I am not on the list.
> >
> >
> > Email: [email protected]
> >
> >
> > Thanks in advance.

--

Norman Weathers
SIP Linux Cluster
TCE UNIX
ConocoPhillips
Houston, TX

Office: LO2003
Phone: ETN 639-2727
or (281) 293-2727

2004-06-09 13:15:01

by Norman Weathers

[permalink] [raw]
Subject: Re: 2.4.26 SMP lockup problem


No, ACPI is disabled in the kernel on this build.

On Tuesday 08 June 2004 06:30 pm, Willy Tarreau wrote:
> Hi,
>
> do you have ACPI enabled, I don't see it in your partial config. I believe
> it was changed in 2.4.22.
>
> Regards,
> Willy
>
> On Tue, Jun 08, 2004 at 05:57:28PM -0500, Norman Weathers wrote:
> > Hello All.
> >
> > During an interesting round of kernel updates, I found a very interesting
> > problem. I have several "hundred" nodes in a cluster that I am currently
> > updating from kernel 2.4.21 to 2.4.26. These nodes are all running
> > RedHat 7.3 (old, I know, but this is the OS that are software currently
> > works on). During this round of updates, I have updated about 150 PIII
> > 800 MHz nodes, all of which are currently being used and work just fine
> > (1 GB Ram, e100 ethernet driver, IDE drives, fairly generic). Also, I
> > have a few PIII 1260 nodes (Tyan Motherboard, 2 GB Ram, e100 ethernet
> > driver, again, fairly generic) that have also been updated and run fine.
> > I have even started testing fairly new P4 3060 IBM blades. They also
> > seem to work just fine.
> >
> > Now to the problem. I have "several hundred" Tyan Thunder Motherboards
> > (older AMD 760MP chipset). I have rebooted ~ 200 of these nodes with the
> > new 2.4.26 kernel and about half of these nodes have suffered a hard
> > lockup during bootup. The lockup is hard enough that I cannot even isuse
> > sys request keys over serial or at the local keyboard to cause them to
> > reboot or output a trace. These nodes have 2 GB of ram, dual 3Com 100 Mb
> > NICS, and IDE drives. Again, fairly generic for a cluster. I had a
> > vanilal + trond patched 2.4.21 kernel running on these boxes just fine.
> > (The new 2.4.26 kernel also has the trond patches for 2.4.26). Has
> > anyone seen this happen to them?
> >
> > Here is some info on the kernel config for the 2.4.26 kernel:
> >
> > #
> > # Automatically generated by make menuconfig: don't edit
> > #
> > CONFIG_X86=y
> > # CONFIG_SBUS is not set
> > CONFIG_UID16=y
> >
> > #
> > # Code maturity level options
> > #
> > CONFIG_EXPERIMENTAL=y
> >
> > #
> > # Loadable module support
> > #
> > CONFIG_MODULES=y
> > CONFIG_MODVERSIONS=y
> > CONFIG_KMOD=y
> >
> > #
> > # Processor type and features
> > #
> > # CONFIG_M386 is not set
> > # CONFIG_M486 is not set
> > # CONFIG_M586 is not set
> > # CONFIG_M586TSC is not set
> > # CONFIG_M586MMX is not set
> > # CONFIG_M686 is not set
> > CONFIG_MPENTIUMIII=y
> > # CONFIG_MPENTIUM4 is not set
> > # CONFIG_MK6 is not set
> > # CONFIG_MK7 is not set
> > # CONFIG_MK8 is not set
> > # CONFIG_MELAN is not set
> > # CONFIG_MCRUSOE is not set
> > # CONFIG_MWINCHIPC6 is not set
> > # CONFIG_MWINCHIP2 is not set
> > # CONFIG_MWINCHIP3D is not set
> > # CONFIG_MCYRIXIII is not set
> > # CONFIG_MVIAC3_2 is not set
> > CONFIG_X86_WP_WORKS_OK=y
> > CONFIG_X86_INVLPG=y
> > CONFIG_X86_CMPXCHG=y
> > CONFIG_X86_XADD=y
> > CONFIG_X86_BSWAP=y
> > CONFIG_X86_POPAD_OK=y
> > # CONFIG_RWSEM_GENERIC_SPINLOCK is not set
> > CONFIG_RWSEM_XCHGADD_ALGORITHM=y
> > CONFIG_X86_L1_CACHE_SHIFT=5
> > CONFIG_X86_HAS_TSC=y
> > CONFIG_X86_GOOD_APIC=y
> > CONFIG_X86_PGE=y
> > CONFIG_X86_USE_PPRO_CHECKSUM=y
> > CONFIG_X86_F00F_WORKS_OK=y
> > CONFIG_X86_MCE=y
> > # CONFIG_TOSHIBA is not set
> > # CONFIG_I8K is not set
> > CONFIG_MICROCODE=y
> > # CONFIG_X86_MSR is not set
> > # CONFIG_X86_CPUID is not set
> > # CONFIG_EDD is not set
> > # CONFIG_NOHIGHMEM is not set
> > # CONFIG_HIGHMEM4G is not set
> > CONFIG_HIGHMEM64G=y
> > CONFIG_HIGHMEM=y
> > CONFIG_X86_PAE=y
> > CONFIG_HIGHIO=y
> > # CONFIG_MATH_EMULATION is not set
> > CONFIG_MTRR=y
> > CONFIG_SMP=y
> > CONFIG_NR_CPUS=32
> > # CONFIG_X86_NUMA is not set
> > # CONFIG_X86_TSC_DISABLE is not set
> > CONFIG_X86_TSC=y
> > CONFIG_HAVE_DEC_LOCK=y
> >
> > #
> > # General setup
> > #
> > CONFIG_NET=y
> > CONFIG_X86_IO_APIC=y
> > CONFIG_X86_LOCAL_APIC=y
> > CONFIG_PCI=y
> > # CONFIG_PCI_GOBIOS is not set
> > # CONFIG_PCI_GODIRECT is not set
> > CONFIG_PCI_GOANY=y
> > CONFIG_PCI_BIOS=y
> > CONFIG_PCI_DIRECT=y
> > CONFIG_ISA=y
> > CONFIG_PCI_NAMES=y
> > # CONFIG_EISA is not set
> > # CONFIG_MCA is not set
> > CONFIG_HOTPLUG=y
> > ---- Rest cut -------
> >
> > I have the noapic option passed on the lilo boot prompt line, otherwise
> > we get the APIC error after about a month or two in service.
> >
> > We tried to make the kernel somewhat generic because we want this kernel
> > to boot on the largest hardware base possible. Is there something
> > obvious that I have missed (I have used these options on the 2.4.21
> > kernel that we used on all of the nodes with the exception of the 64 GB
> > memory.
> >
> > Any help would be appreciated. Any dumps that need to be made (or try to
> > make), great as I have about 200 nodes right now that are candidates for
> > testing.
> >
> > Please contact me at email listed below as I am not on the list.
> >
> >
> > Email: [email protected]
> >
> >
> > Thanks in advance.
> >
> > --
> >
> > Norman Weathers
> > SIP Linux Cluster
> > TCE UNIX
> > ConocoPhillips
> > Houston, TX
> >
> > Office: LO2003
> > Phone: ETN 639-2727
> > or (281) 293-2727
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > in the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/

--

Norman Weathers
SIP Linux Cluster
TCE UNIX
ConocoPhillips
Houston, TX

Office: LO2003
Phone: ETN 639-2727
or (281) 293-2727

2004-06-09 13:24:48

by Norman Weathers

[permalink] [raw]
Subject: Re: 2.4.26 SMP lockup problem




On Tuesday 08 June 2004 08:00 pm, Mark Hahn wrote:
> > CONFIG_X86_LOCAL_APIC=y
>
> that's the first thing I'd try turning off...
>

I have it disabled on the lilo promptwith noapic. If that is not enough to
keep it disabled on these nodes, then I will turn it off completely.

> > make), great as I have about 200 nodes right now that are candidates for
> > testing.
>
> heh. I'm a cluster admin myself, much smaller now, but looking
> at adding 512-768 duals by the end of the year. gulp!

Just went through that with a series of IBM blades. Don't envy ya.

--

Norman Weathers
SIP Linux Cluster
TCE UNIX
ConocoPhillips
Houston, TX

Office: LO2003
Phone: ETN 639-2727
or (281) 293-2727

2004-06-09 14:05:19

by Sam Gill

[permalink] [raw]
Subject: Re: 2.4.26 SMP lockup problem


I would run a diff command on
the config from the working chipset, aginst the
current config you are trying to use.

If you have the wrong chipset/processor selected
it could cause you problems similar to the ones
you are having.


-Sam


>
>
> On Tuesday 08 June 2004 08:00 pm, Mark Hahn wrote:
>> > CONFIG_X86_LOCAL_APIC=y
>>
>> that's the first thing I'd try turning off...
>>
>
> I have it disabled on the lilo promptwith noapic. If that is not enough
> to
> keep it disabled on these nodes, then I will turn it off completely.
>
>> > make), great as I have about 200 nodes right now that are candidates
>> for
>> > testing.
>>
>> heh. I'm a cluster admin myself, much smaller now, but looking
>> at adding 512-768 duals by the end of the year. gulp!
>
> Just went through that with a series of IBM blades. Don't envy ya.
>
> --
>
> Norman Weathers
> SIP Linux Cluster
> TCE UNIX
> ConocoPhillips
> Houston, TX
>
> Office: LO2003
> Phone: ETN 639-2727
> or (281) 293-2727
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-06-09 20:46:03

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.4.26 SMP lockup problem

Hi,

On Wed, Jun 09, 2004 at 08:09:08AM -0500, Norman Weathers wrote:
>
> That's the funny thing about this lockup... I can't even get a traceback from
> the serial console. Here is my boot command line:
>
> auto BOOT_IMAGE=2.4.26 ro root=301 noapic console=tty0 console=ttyS0,115200n8
> panic=60 nmi_watchdog=1
>
> I have the watchdog turned on (built in to the kernel, not a module), and have
> been using the above command line, but still get no oops and cannot get a
> traceback...

for a yet unknown reason, my board (asus a7m266d) ignores nmi_watchdog=1
but is fine with 2. Might be worth trying.

Regards
Willy

2004-06-10 10:09:19

by Herbert Xu

[permalink] [raw]
Subject: Re: 2.4.26 SMP lockup problem

Norman Weathers <[email protected]> wrote:
>
> On Tuesday 08 June 2004 08:00 pm, Mark Hahn wrote:
>> > CONFIG_X86_LOCAL_APIC=y
>>
>> that's the first thing I'd try turning off...
>
> I have it disabled on the lilo promptwith noapic. If that is not enough to
> keep it disabled on these nodes, then I will turn it off completely.

You want nolapic to disable the local APIC.
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt