2004-04-09 17:56:47

by Chris Meadors

[permalink] [raw]
Subject: 2.6.x oops on x86_64 server

Ever since I put this server into production I've been seeing oops on
average once or twice a week. During testing with the last of the 2.5
series, and the first two 2.6s, the machine seemed stable. But the
first night live it oopsed. While I did put a pretty good load on it
during testing, I think the real world is a bit more rough. It sees
about 10 e-mails every second, and runs SpamAssassin and ClamAV on each.

I run just a plain kernel plus the current patch from
ftp://ftp.x86-64.org/pub/linux/v2.6/ . I've tried without the x86_64
patch, and it still oopses just the same.

I've attatched the two most recent back traces that made it into the
syslog, the final oops doesn't make it to disk. The backtraces are
always very similar, with the bad state in free_hot_cold_page and
prep_new_page. I've also included my current config that the machine is
running with 2.6.5+x86_64.

I've posted to the x86-64.org discussion list. I was asking if they
thought it could be a hardware problem, or if it was more likely to be
software. Andi Kleen said that is wasn't likely to be hardware since it
always seems to be a corrupted mem_map.

I do run the LSI Logic MegaRAID 320-0 controller, which I believe still
has some issues with 64-bit machines. But I thought the problems were
limited to machines with greater than 2 GB of RAM. This machine has
just 2.

A bit more information on the hardware: The motherboard is a Tyan
S2880, with 2 Opteron 240s. 4 512 MB Corsair PC2700 registered ECC
DIMMs. And the MegaRAID 320-0. That card being the ZCR (Zero Channel
RAID), which makes use of the two SCSI channels present on the
motherboard. Each channel is connected to a SCA backplane with 4
Seagate Cheetah 36ES drives, for a total of 8 drives (2, 1 from each
channel, in RAID1 (system), and the other six in RAID5 (home)).

As I said this is a production server now, but I am able to take it down
for short periods of time for testing. If anyone has any ideas, I'm
willing to give them a shot. If you think it is hardware, say so, I'll
try to get things swapped out. If there is anything else you want to
know about the machine or the kernel running on it, let me know I'll get
the info to you.

Thanks.

--
Chris


Attachments:
0406-backtrace.txt (4.77 kB)
0409-backtrace.txt (4.45 kB)
prime_config.txt (3.00 kB)
Download all attachments