Hello,
We are having a very strange issue on some 64bit systems. We have a 32 node
cluster of EM64T's (supermicro boards). We are using our node restore
software to propagate a linux install onto them. We do a pxe boot to a
kernel and initrd image. The initrd has some config info, a basic root
filesystem, and a restore script. The kernel is passed init=/restore (the
restore script itself). The script runs dhcp, gets an ip, then nfs mounts
the master node of the cluster. The backup image is stored on the master
node's nfs mount. The script then applies a backed up partition table and
then mkfs's the partitions, mounts them, untars a backup tar to the drive,
and then makes it bootable with grub.
On these systems, we are getting ext2 errors from the initrd during the
untarring. Soon after, we start getting seg faults on random things (looks
like stuff caused by the still running dhcp client), and then a continuous
stream of segfaults on the restore script itself (restore[1]).
The systems being restored are dual em64t's with 2G of ram and 200G sata
drives. If we up the memory to 4G, the restores complete without error. If
we reduce down to 512M, the segfaults start at the mkfs stage instead of the
untar stage. We've tried different sata drives and controllers without
change. Switching to ide drives works. Switching to reiserfs instead of
ext3 for the destination drives works too. We've tried enabling the scsi
debug stuff as well as the jbd debug stuff for ext3 without getting any more
info. We also enabled the kernel debug options too. We've also tried using
the deprecated ide based sata drivers instead of the scsi based ones without
success. We have tried restoring to Intel's Jarell EM64T systems as well as
an Arima HDAMA opteron with the same errors. We've also tried adding swap
space ASAP in the inird image.
This problem is really baffling us and we're not quite sure what to check
into next. Any ideas?
--
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517
I forgot to mention the kernels that have been tried- 2.6.8.1, 2.6.11.7,
2.6.12-rc3, and a redhat 2.6.9.
On Thursday 28 April 2005 12:16 pm, Rick Warner wrote:
> Hello,
> We are having a very strange issue on some 64bit systems. We have a 32
> node cluster of EM64T's (supermicro boards). We are using our node restore
> software to propagate a linux install onto them. We do a pxe boot to a
> kernel and initrd image. The initrd has some config info, a basic root
> filesystem, and a restore script. The kernel is passed init=/restore (the
> restore script itself). The script runs dhcp, gets an ip, then nfs mounts
> the master node of the cluster. The backup image is stored on the master
> node's nfs mount. The script then applies a backed up partition table and
> then mkfs's the partitions, mounts them, untars a backup tar to the drive,
> and then makes it bootable with grub.
>
> On these systems, we are getting ext2 errors from the initrd during the
> untarring. Soon after, we start getting seg faults on random things (looks
> like stuff caused by the still running dhcp client), and then a continuous
> stream of segfaults on the restore script itself (restore[1]).
>
> The systems being restored are dual em64t's with 2G of ram and 200G sata
> drives. If we up the memory to 4G, the restores complete without error. If
> we reduce down to 512M, the segfaults start at the mkfs stage instead of
> the untar stage. We've tried different sata drives and controllers without
> change. Switching to ide drives works. Switching to reiserfs instead of
> ext3 for the destination drives works too. We've tried enabling the scsi
> debug stuff as well as the jbd debug stuff for ext3 without getting any
> more info. We also enabled the kernel debug options too. We've also tried
> using the deprecated ide based sata drivers instead of the scsi based ones
> without success. We have tried restoring to Intel's Jarell EM64T systems
> as well as an Arima HDAMA opteron with the same errors. We've also tried
> adding swap space ASAP in the inird image.
>
> This problem is really baffling us and we're not quite sure what to check
> into next. Any ideas?
--
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517
On Iau, 2005-04-28 at 17:16, Rick Warner wrote:
> On these systems, we are getting ext2 errors from the initrd during the
> untarring. Soon after, we start getting seg faults on random things (looks
> like stuff caused by the still running dhcp client), and then a continuous
> stream of segfaults on the restore script itself (restore[1]).
This sounds almost like the pxe/boot code is still using ram that the
kernel has now used (eg the PXE layer or pxe booter forgot to close the
client and
its still DMAing happily into the kernel)
On Thursday 28 April 2005 06:48 pm, Alan Cox wrote:
> On Iau, 2005-04-28 at 17:16, Rick Warner wrote:
> > On these systems, we are getting ext2 errors from the initrd during the
> > untarring. Soon after, we start getting seg faults on random things
> > (looks like stuff caused by the still running dhcp client), and then a
> > continuous stream of segfaults on the restore script itself (restore[1]).
>
> This sounds almost like the pxe/boot code is still using ram that the
> kernel has now used (eg the PXE layer or pxe booter forgot to close the
> client and
> its still DMAing happily into the kernel)
This morning, we tried updating to a newer pxelinux (3.07) and had the same
results. We then tried using etherboot with a mknbi tagged image and also
had the same results. Since we are getting the same problem on 3 different
motherboards with 2 different network adapters, I have not looked into
updating the boot rom on the nics. Should I?
What should I look into next? I have attached a serial console log of the
system and errors. The slashes and pipes you see are from a spinning bar
thing. If you want output that is cleaned up without that, I can provide it.
--
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517
Just sending out a ping on this.. anyone have any ideas?
On Friday 29 April 2005 10:45 am, you wrote:
> On Thursday 28 April 2005 06:48 pm, Alan Cox wrote:
> > On Iau, 2005-04-28 at 17:16, Rick Warner wrote:
> > > On these systems, we are getting ext2 errors from the initrd during
> > > the untarring. Soon after, we start getting seg faults on random
> > > things (looks like stuff caused by the still running dhcp client), and
> > > then a continuous stream of segfaults on the restore script itself
> > > (restore[1]).
> >
> > This sounds almost like the pxe/boot code is still using ram that the
> > kernel has now used (eg the PXE layer or pxe booter forgot to close the
> > client and
> > its still DMAing happily into the kernel)
>
> This morning, we tried updating to a newer pxelinux (3.07) and had the same
> results. We then tried using etherboot with a mknbi tagged image and also
> had the same results. Since we are getting the same problem on 3
> different motherboards with 2 different network adapters, I have not looked
> into updating the boot rom on the nics. Should I?
>
> What should I look into next? I have attached a serial console log of the
> system and errors. The slashes and pipes you see are from a spinning bar
> thing. If you want output that is cleaned up without that, I can provide
> it.
--
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517
On Mer, 2005-05-04 at 20:29, Rick Warner wrote:
> Just sending out a ping on this.. anyone have any ideas?
The best I can think of right now in going forward is check
32 v 64 bit kernel
32bit Highmem aware kernel v 32bit non highmem (1GB limit) kernel
PATA boot v SATA boot v Network boot
just to try and find any patterns.
Rick Warner <[email protected]> writes:
> This morning, we tried updating to a newer pxelinux (3.07) and had the same
> results. We then tried using etherboot with a mknbi tagged image and also
> had the same results. Since we are getting the same problem on 3 different
> motherboards with 2 different network adapters, I have not looked into
> updating the boot rom on the nics. Should I?
I remember I had memory corruption problems with an old version of
Etherboot few years ago. The machines were mostly AMD K6 based,
network cards were SMC EPIC100 (Etherpower II) and/or RTL 8139.
Memtest86 (downloaded with Etherboot) complained about random errors.
I think Linux didn't show any such illness.
This was Etherboot 4.something. Upgrading to 5.something fixed the
problem.
I suspect you're using Etherboot newer than 4.x though. I'd probably
give memtest86 loaded from network a try.
--
Krzysztof Halasa
On Thursday 05 May 2005 05:37 pm, Krzysztof Halasa wrote:
> Rick Warner <[email protected]> writes:
> > This morning, we tried updating to a newer pxelinux (3.07) and had the
> > same results. We then tried using etherboot with a mknbi tagged image
> > and also had the same results. Since we are getting the same problem on
> > 3 different motherboards with 2 different network adapters, I have not
> > looked into updating the boot rom on the nics. Should I?
>
> I remember I had memory corruption problems with an old version of
> Etherboot few years ago. The machines were mostly AMD K6 based,
> network cards were SMC EPIC100 (Etherpower II) and/or RTL 8139.
>
> Memtest86 (downloaded with Etherboot) complained about random errors.
> I think Linux didn't show any such illness.
> This was Etherboot 4.something. Upgrading to 5.something fixed the
> problem.
>
> I suspect you're using Etherboot newer than 4.x though. I'd probably
> give memtest86 loaded from network a try.
We actually run memtest86 from the network regularly. This cluster had run
dozens of passes of memtest booted over the network before doing any of this.
We also did an md5sum of our initrd from the network boot server, and then
had the initrd do an md5sum of itself on the network boot. They matched.
Thanks for the advice though! I appreciate it.
--
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517