2002-01-05 18:46:00

by Eric

[permalink] [raw]
Subject: 2.4.17 oops - ext2/ext3 fs corruption (?)

I seem to be having a reoccurring problem with my Red Hat 7.2 system
running kernel 2.4.17. Four times now, I have seen the kernel generate an
oops. After the oops, I find that one of file systems is no longer sane.
The effect that I see is a Segmentation Fault when things like ls or du
some directory (the directory is never the same). Also, when the system
is going down for a reboot, it is unable to umount the file system. The
umount command returns a "bad lseek" error.

The first time this happened, it resulted in catastrophic corruption of
/usr and I had to reinstall. At this time, /usr was an ext2 file system.
When I reinstalled, I took the opportunity to reformat all the file
systems, except /home, as ext3.

The second and third times I saw this problem on my /home (ext2) file
system. Fortunately, there was no corruption of the file system after
rebooting the box. I do not have the oops generated by the kernel when
these events occurred.

The fourth time was on /usr (ext3) and did not result in corruption.
Although, a read-only check of the file system before rebooting showed
files with duplicate blocks. I collected the 2 oops logs I found in dmesg
and ran them through ksymoops. There are quite a few warnings, so I don't
know if they are any good.

I also included dmesg in case that is helpful. If any other information
is required, I would be glad to provide it. Any suggestions on what I
could do to fix the problem would be greatly appreciated.

Thanks,

Eric


Attachments:
dmesg (5.53 kB)
dmesg (no oops)
oops1 (7.25 kB)
oops #1
oops2 (7.20 kB)
oops #2
Download all attachments

2002-01-05 20:03:20

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.17 oops - ext2/ext3 fs corruption (?)

Eric wrote:
>
> I seem to be having a reoccurring problem with my Red Hat 7.2 system
> running kernel 2.4.17. Four times now, I have seen the kernel generate an
> oops. After the oops, I find that one of file systems is no longer sane.
> The effect that I see is a Segmentation Fault when things like ls or du
> some directory (the directory is never the same). Also, when the system
> is going down for a reboot, it is unable to umount the file system. The
> umount command returns a "bad lseek" error.
>
> The first time this happened, it resulted in catastrophic corruption of
> /usr and I had to reinstall. At this time, /usr was an ext2 file system.
> When I reinstalled, I took the opportunity to reformat all the file
> systems, except /home, as ext3.
>

Everything here points at failing hardware. Probably memory errors.
People say that memtest86 is good at detecting these things. Another
way to verify this is to move the same setup onto a different computer...

-

2002-01-05 23:31:32

by Eric

[permalink] [raw]
Subject: Re: 2.4.17 oops - ext2/ext3 fs corruption (?)

On Sat, 5 Jan 2002, Andrew Morton wrote:

> Eric wrote:
> >
> > I seem to be having a reoccurring problem with my Red Hat 7.2 system
> > running kernel 2.4.17. Four times now, I have seen the kernel generate an
> > oops. After the oops, I find that one of file systems is no longer sane.
> > The effect that I see is a Segmentation Fault when things like ls or du
> > some directory (the directory is never the same). Also, when the system
> > is going down for a reboot, it is unable to umount the file system. The
> > umount command returns a "bad lseek" error.
>
> Everything here points at failing hardware. Probably memory errors.
> People say that memtest86 is good at detecting these things. Another
> way to verify this is to move the same setup onto a different computer...

I ran memtest86 on the system and let it complete 4 passes before I
stopped it. It found no errors. Unfortunately, I do not have another
system available to test this on. Are there any other diagnostics I can
run to determine if this is truly a hardware problem?

Thanks,

Eric


2002-01-06 10:12:26

by Daniel Phillips

[permalink] [raw]
Subject: Re: 2.4.17 oops - ext2/ext3 fs corruption (?)

On January 6, 2002 12:31 am, Eric wrote:
> On Sat, 5 Jan 2002, Andrew Morton wrote:
> > Eric wrote:
> > >
> > > I seem to be having a reoccurring problem with my Red Hat 7.2 system
> > > running kernel 2.4.17. Four times now, I have seen the kernel generate an
> > > oops. After the oops, I find that one of file systems is no longer sane.
> > > The effect that I see is a Segmentation Fault when things like ls or du
> > > some directory (the directory is never the same). Also, when the system
> > > is going down for a reboot, it is unable to umount the file system. The
> > > umount command returns a "bad lseek" error.
> >
> > Everything here points at failing hardware. Probably memory errors.
> > People say that memtest86 is good at detecting these things. Another
> > way to verify this is to move the same setup onto a different computer...
>
> I ran memtest86 on the system and let it complete 4 passes before I
> stopped it. It found no errors. Unfortunately, I do not have another
> system available to test this on. Are there any other diagnostics I can
> run to determine if this is truly a hardware problem?

This doesn't smell like hardware to me, since your two backtraces are identical:

>>EIP; c013ee54 <d_lookup+64/120> <=====
Trace; c0136d40 <cached_lookup+10/50>
Trace; c013740a <link_path_walk+4ea/730>
Trace; c0136b1f <getname+5f/a0>
Trace; c01379d3 <__user_walk+33/50>
Trace; c0134bb4 <sys_lstat64+14/70>
Trace; c0106e04 <error_code+34/40>
Trace; c0106cf3 <system_call+33/40>
Code; c013ee54 <d_lookup+64/120>
00000000 <_EIP>:
Code; c013ee54 <d_lookup+64/120> <=====
0: 8b 6d 00 mov 0x0(%ebp),%ebp <=====
Code; c013ee57 <d_lookup+67/120>
3: 39 53 44 cmp %edx,0x44(%ebx)
Code; c013ee5a <d_lookup+6a/120>
6: 0f 85 90 00 00 00 jne 9c <_EIP+0x9c> c013eef0 <d_lookup+100/120>
Code; c013ee60 <d_lookup+70/120>
c: 8b 44 24 24 mov 0x24(%esp,1),%eax
Code; c013ee64 <d_lookup+74/120>
10: 39 43 0c cmp %eax,0xc(%ebx)
Code; c013ee67 <d_lookup+77/120>
13: 0f 00 00 sldt (%eax)

--
Daniel

2002-01-12 14:25:22

by Eric

[permalink] [raw]
Subject: Re: 2.4.17 oops - ext2/ext3 fs corruption (?)

On Sun, 6 Jan 2002, Daniel Phillips wrote:

> On January 6, 2002 12:31 am, Eric wrote:
> > On Sat, 5 Jan 2002, Andrew Morton wrote:
> > > Eric wrote:
> > > >
> > > > I seem to be having a reoccurring problem with my Red Hat 7.2 system
> > > > running kernel 2.4.17. Four times now, I have seen the kernel generate an
> > > > oops. After the oops, I find that one of file systems is no longer sane.
> > > > The effect that I see is a Segmentation Fault when things like ls or du
> > > > some directory (the directory is never the same). Also, when the system
> > > > is going down for a reboot, it is unable to umount the file system. The
> > > > umount command returns a "bad lseek" error.
> > >
> > > Everything here points at failing hardware. Probably memory errors.
> > > People say that memtest86 is good at detecting these things. Another
> > > way to verify this is to move the same setup onto a different computer...
> >
> > I ran memtest86 on the system and let it complete 4 passes before I
> > stopped it. It found no errors. Unfortunately, I do not have another
> > system available to test this on. Are there any other diagnostics I can
> > run to determine if this is truly a hardware problem?
>
> This doesn't smell like hardware to me, since your two backtraces are identical:

In an attempt to try a get more information about what is going on, I
tried compiling a new kernel with the options CONFIG_DEBUG_KERNEL and
CONFIG_DEBUG_BUGVERBOSE turned on. However, this seems to have made
things worse. Now, in about 6-12 hours from boot-up, the system will hang
completely with no information from the kernel on the console. The only
way to get out it is to use the reset button on the front of the box.

Is there anything else I can do to try and determine if this is a kernel
problem? Or at least get more information about what is going on?

Thanks,

Eric


2002-01-12 14:53:16

by nada

[permalink] [raw]
Subject: Re: Re: 2.4.17 oops - ext2/ext3 fs corruption (?)

> On Sun, 6 Jan 2002, Daniel Phillips wrote:
>
> > On January 6, 2002 12:31 am, Eric wrote:
> > > On Sat, 5 Jan 2002, Andrew Morton wrote:
> > > > Eric wrote:
> > > > >
> > > > > I seem to be having a reoccurring problem with my Red Hat 7.2 system
> > > > > running kernel 2.4.17. Four times now, I have seen the kernel generate an
> > > > > oops. After the oops, I find that one of file systems is no longer sane.
> > > > > The effect that I see is a Segmentation Fault when things like ls or du
> > > > > some directory (the directory is never the same). Also, when the system
> > > > > is going down for a reboot, it is unable to umount the file system. The
> > > > > umount command returns a "bad lseek" error.
> > > >
> > > > Everything here points at failing hardware. Probably memory errors.
> > > > People say that memtest86 is good at detecting these things. Another
> > > > way to verify this is to move the same setup onto a different computer...
> > >
> > > I ran memtest86 on the system and let it complete 4 passes before I
> > > stopped it. It found no errors. Unfortunately, I do not have another
> > > system available to test this on. Are there any other diagnostics I can
> > > run to determine if this is truly a hardware problem?
> >
> > This doesn't smell like hardware to me, since your two backtraces are identical:
>
> In an attempt to try a get more information about what is going on, I
> tried compiling a new kernel with the options CONFIG_DEBUG_KERNEL and
> CONFIG_DEBUG_BUGVERBOSE turned on. However, this seems to have made
> things worse. Now, in about 6-12 hours from boot-up, the system will hang
> completely with no information from the kernel on the console. The only
> way to get out it is to use the reset button on the front of the box.
>
> Is there anything else I can do to try and determine if this is a kernel
> problem? Or at least get more information about what is going on?
>
> Thanks,
>
> Eric

I had the same problem with ext3fs
The problem was when trying to READ from the partition and not when trying to write. That was interesting.
I copied a lot of files to my ext3 partition (5 gig) and when I tried to copy it back to the other partition(vfat) there were some input-output errors and i got all the data lost, I had to format the ext3 partition and install linux again.
So, the problem does not seem to be hardware problem.
I am traveling now, not in my house, so I can't test it one more time, and when I'll be back probably there will be a fixed kernel, but if not, I'll test it again anyway.

Marco