2005-10-25 19:00:28

by Chase Venters

[permalink] [raw]
Subject: Oops in do_page_fault

Greetings,

Please forgive me in advanced for the length of this description - I don't
want to leave out any important details.

About two weeks ago I came home from work to find that the fan on my XFX
GeForce 6800GT PCI-E had failed. My computer, which was previously playing
music, was simply playing the last buffer-sized frame of audio repeatedly
as if it were a skipping CD. I cursed, got frustrated, and ordered a
7800GT to replace it.

While I was waiting, I took advantage of my Asus P5GDC-V Deluxe's onboard
Intel 915 graphics to get by in X. This was stable for a few days (prior
to the graphics card failure, this system was stable for a year).
Eventually, though, I was ripping a CD while listening to music and doing
some other minor things when I noticed that the system started crashing in
a very odd way.

I managed to run dmesg from a remote shell that was open, and saw so
really strange traces I didn't manage to save. Pretty soon I realized that
every process that tried to touch the disk would go into a
TASK_UNINTERRUPTIBLE sleep and freeze. The system took about 30 seconds to
crap out - amusingly enough, my music continued to play until the song was
over; then that died too.

Writing it off to a possible bug in the video driver, I rebooted and
noticed ReiserFS doing lots of cleanup. I continued on my way until the
system crashed again an hour later. Stability gradually grew much worse -
I went from a year of stability, to days, to hours, to minutes...
Eventually, I decided the best option would be to leave it alone until
replacement parts arrived.

In the mean time I ran memtest86 exhaustively to verify that my value RAM
wasn't on the fritz.

The replacement card arrived, and annoyed by what seemed to be excessive
corruption on my partition, I used a LiveCD to set one disk in the RAID10
to faulty, removed it, made an ext2 partition on it, moved my data to it
(which thankfully fit), rebuilt the ReiserFS partition on the RAID, moved
all the data back over, and resynced the disk.

The system seemed to work perfectly for days. I was happy to have fixed
the problem. Then, though, I noticed that my brand new fresh partition was
kicking up very similar errors (I think I remember seeing something about
vs-7000 nesting filesystem, as well as complaints about free space
calculations). It took 5 minutes before the system froze during an "emerge
traceroute". This time, the behavior got bad really fast. I could reliably
reproduce the behavior by running "emerge traceroute" (the last thing I
ever saw before death was portage checking /usr/share/doc). Two times it
freezed, two times it actually *immediately* rebooted without even a
visible panic, etc.

I replaced the motherboard with an identical motherboard, upgraded to a
better cooler (CPU is a 540J prescott 3.2GHz), went from a 380 watt to a
500 watt dual 12v-rail supply. Strangely enough, after these changes, I
can now reproduce the crash reliably, but I'm getting (depending on the
kernel version I boot) different but consistent behavior each time.

In 2.6.13, I get an Oops (translated by hand, sorry for inexact formatting):

Oops 0000 #1
PREEMPT SMP
(list of modules linked in includes some alsa modules, nvidia.ko and
sk98lin.ko)

CPU 0
EIP 0060:[<c01182c3>] Tainted: P VLI
EFLAGS: 00010086 (2.6.13)
EIP is at do_page_fault+0xa3/0x5db
eax: f5e50000 ebx: 0000000b ecx: 0000000d edx: 0000000d
esi: 0000000e edi: c0567451 ebp: 00000000 esp: f5e5a10c

ds: 007b es: 007b ss: 0068

2.6.13 will oops reproducibly as above upon completion of "emerge
traceroute". Each 2.6.13 oops always happens at do_page_fault+0xa3/0x5db.
ebx and ecx are also observed to be constant.

Oddly enough, the second Oops I got on 2.6.13 reported a CPU # 2949119.

I also tested "emerge traceroute" on the same partition by booting
2.6.11.7 and 2.6.12.4. Both of these kernels failed to Oops / panic, but
simply froze.

My next step will be to try and replace the CPU (though I really
appreciate any comments as to whether I'm likely looking at a hardware
problem anymore). I ordered a replacement CPU and got sent a 478 instead
of a 775, so it looks like I'm going to have to go grab one up locally.

I tried to rebuild my kernel with SysRQ and a serial console to be of
better help; unfortunately, I can't seem to do enough IO before crashing
to succeed.

Thanks,
Chase Venters


2005-10-25 20:19:13

by Lee Revell

[permalink] [raw]
Subject: Re: Oops in do_page_fault

On Tue, 2005-10-25 at 15:00 -0400, [email protected] wrote:
> Oops 0000 #1
> PREEMPT SMP
> (list of modules linked in includes some alsa modules, nvidia.ko and
> sk98lin.ko)
>

You need to reproduce this with an untainted kernel AKA without nvidia
loaded.

> CPU 0
> EIP 0060:[<c01182c3>] Tainted: P VLI
> EFLAGS: 00010086 (2.6.13)
> EIP is at do_page_fault+0xa3/0x5db
> eax: f5e50000 ebx: 0000000b ecx: 0000000d edx: 0000000d
> esi: 0000000e edi: c0567451 ebp: 00000000 esp: f5e5a10c
>
> ds: 007b es: 007b ss: 0068
>

You left out the most important part of the Oops, the stack trace. It
should have been printed immediately after the registers.

Lee

2005-10-25 20:20:15

by John Stoffel

[permalink] [raw]
Subject: Re: Oops in do_page_fault


chase> Please forgive me in advanced for the length of this
chase> description - I don't want to leave out any important details.

[ edited down ]

chase> In 2.6.13, I get an Oops (translated by hand, sorry for inexact formatting):

chase> Oops 0000 #1
chase> PREEMPT SMP
chase> (list of modules linked in includes some alsa modules, nvidia.ko and
chase> sk98lin.ko)

chase> CPU 0
chase> EIP 0060:[<c01182c3>] Tainted: P VLI
chase> EFLAGS: 00010086 (2.6.13)
chase> EIP is at do_page_fault+0xa3/0x5db
chase> eax: f5e50000 ebx: 0000000b ecx: 0000000d edx: 0000000d
chase> esi: 0000000e edi: c0567451 ebp: 00000000 esp: f5e5a10c

chase> ds: 007b es: 007b ss: 0068

Please remove the binary only nvidia and sklin module(s) you have on
the system so that the kernel isn't tainted any more, and then see if
you can reproduce this problem. You'll need to not only remove the
module, but then reboot to make sure it comes up cleanly without any
corruption. We can't help debug your issues if you're using binary
only modules. Sorry.

John

2005-10-26 02:36:22

by Chase Venters

[permalink] [raw]
Subject: (was: Oops in do_page_fault) ReiserFS problems... (Now with full trace)

On Tuesday 25 October 2005 02:58 pm, Lee Revell wrote:
> You need to reproduce this with an untainted kernel AKA without nvidia
> loaded.

My apologies. To retest, I ensured that no modules at all were dynamically
linked. I also took the liberty of rebuilding the kernel on another system,
then transferring it over (thankfully I was successful). In the rebuilt
kernel, I enabled ReiserFS debugging, Magic SysRq, and a serial console (I
also bought a NULL modem cable).

> You left out the most important part of the Oops, the stack trace. It
> should have been printed immediately after the registers.

Actually, the Oops didn't contain a stack trace either time I produced the bug
on my existing 2.6.13! Subsequent attempts to reproduce the problem on that
production kernel resulted in a freeze.

I've attached the following files:

* boot-dmesg.txt
dmesg right after booting my newly built 2.6.13 debug kernel.
* config-debug.txt
.config from the debug kernel
* cpuinfo.txt
contents of /proc/cpuinfo
* lspci.txt
Output of lspci -vvv, in case anyone finds it relevant.
* emerge.txt
Serial console capture of me running "emerge traceroute" to cause the bug,
along with some logs of ReiserFS sweating.
* crash.txt
ReiserFS's panic, followed by the full traces produced by the kernel.

Just a few more points about my problem...

#1 - I've never seen any of the disks in this raid10 (md1 mounted on /)
produce any CRC errors (though correct me if I'm wrong, I don't see any
reason the kernel should BUG()/oops/panic because of corrupted filesystems)
#2 - This ReiserFS partition is literally days old.
#3 - When I had strange stability issues before, I had been using 2.6.13 for
some time successfully. Stability went from perfect to nonexistent.
#4 - I successfully ran a CPU burn in program I don't have source for. I've
also run memtest86 extensively, replaced the motherboard, power supply and
GPU. I no longer believe this to be a hardware issue.
#5 - I disabled all swap partitions to eliminate that as a variable.

Thanks a bunch for any ideas. If anyone needs more information from me, I'm
willing and able to produce whatever is asked for to help debug this.

Thanks,
Chase Venters


Attachments:
(No filename) (2.15 kB)
boot-dmesg.txt (20.02 kB)
config-debug.txt (33.45 kB)
emerge.txt (12.43 kB)
lspci.txt (14.61 kB)
cpuinfo.txt (511.00 B)
crash.txt (5.83 kB)
Download all attachments

2005-10-26 03:12:21

by Lee Revell

[permalink] [raw]
Subject: Re: (was: Oops in do_page_fault) ReiserFS problems... (Now with full trace)

On Tue, 2005-10-25 at 21:35 -0500, Chase Venters wrote:
> #1 - I've never seen any of the disks in this raid10 (md1 mounted
> on /) produce any CRC errors (though correct me if I'm wrong, I don't
> see any reason the kernel should BUG()/oops/panic because of corrupted
> filesystems)

Reiser3 is notoriously bad at handling error conditions like this. I
personally stopped using reiser3 when I had a power loss that caused
chunks of files to end up in other files. I was able to fix the
critical ones with a text editor but who knows what other damage it did.

Lee