LinuxLists.cc - Update: SMP 2.2.15 #2 kernel, lock ups...

2000-10-30 23:27:03

Subject: Update: SMP 2.2.15 #2 kernel, lock ups...

I spent the entire day working on this problem... as per Alan's
suggestion, I attempted to upgrade to 2.2.17.

Ugh, I had nothing but disaster.... First, the kernel would not
auto-recognize I had 1 gig of memory... it would only boot saying I had 64
meg. So I added the MEM=1024M line to the lilo config (I believe that is
the correct line, don't have it in front of me). Whenever I booted the
machine under 2.2.17, I would get errors during the boot process... here
is part of one (if you need more details let me know, i had to copy these
down on scrap paper)

Swap_Free trying to free no-existant swap page

Zap_Pte_range: bad pmd (371b10b7)

Unable to handle Kernel Null Ptr deref at virt addr 000001a9

then it goes on and finally lists an oops for a process "top100" (it's an
apache process I have running)

The machine has 1 Gig, a Mylex ExtremeRaid 1100, dual 700 mhz pentium
3's. To refresh, the original problem I had was a lockup every 24-48
hours randomly, with no warning or errors.

As per suggested here (as well as upgrading to .17) I also checked into
bios upgrades for the motherboard -- I found mine was the most current. I
also checked and upgraded the Mylex bios to the latest version and moved
the cards around to different PCI slots. As for the lockups, I don't know
If I have resolved them yet or not, but I do know I am having horrible
problems besides the lockups, as per upgrading to .17.

I am starting to wonder if I am having memory problems? I noticed that
when I was running at 64 megs (by accident, the system was not detecting
my full memory for some reason) the machine seemed to work perfectly, but
once I said MEM=1024, all hell broke loose... thats when I started getting
errors. One thing that consistently happened with .17 was after I had an
error and had to reboot, FSCK had to run. FSCK would find all of these
bad time header things and would work at fixing them, then after like 1
minute of crunching it would just lockup, I could hit return on the
keyboard and see a blank line appear on the screen but that was it. The
drives stopped running and no further processing. If i kept rebooting,
this occurred over and over. Once I dropped back to an older kernel (via
a kernel boot disk), the fsck would work perfectly and complete the boot
process...

Does this sound like a .17 problem or a memory problem or both? I have
had 4 machines with similar hardware (dual processor, mylex raid cards, 1
gig) and not had any problems like this before. (theo ther machines had
slower processors, or older mylex raid). I am about to boot this machine
out the door.

Shouldn't my machine be auto-detecting how much memory I have without
using the MEM= line in lilo?.. I believe it had in the past.

I have ordered a new gig of memory overnight so I can drop it in and see
if it resolves the problem(s)... if you need any more info, such as more
details on those errors, etc. please let me know.

Thanks in advance,
-John

2000-10-30 23:43:14

by Alan

[permalink] [raw]

Subject: Re: Update: SMP 2.2.15 #2 kernel, lock ups...

> Ugh, I had nothing but disaster.... First, the kernel would not
> auto-recognize I had 1 gig of memory... it would only boot saying I had 64

BIOS error. Ask the vendor to fix E801 sizing. Could be your old kernels had
the hack to try E820 (windows uses this so the BIOS writing morons have to
get it right) [sorry the quality of BIOS QA is on my rant list, it appears to
be 'boot windows and ship']

> meg. So I added the MEM=1024M line to the lilo config (I believe that is
> the correct line, don't have it in front of me). Whenever I booted the

Chances are its not 1024 that is available. It'll only use about 900Mb with
a 1Gig sized kernel anyway. Try 900Mb just for now

Alan