LinuxLists.cc - 2.5: ext3 bug or dying drive?

2002-12-05 21:21:51

Subject: 2.5: ext3 bug or dying drive?

Overnight, 2.5.50-mm1 took a big stinky shit:

EXT3-fs error (device sd(8,1)): ext3_readdir: bad entry in directory #243371: rec_len % 4 != 0 - offset=1688, inode=243681, rec_len=109, name_len=27
Aborting journal on device sd(8,1).
ext3_abort called.
EXT3-fs abort (device sd(8,1)): ext3_journal_start: Detected aborted journal
Remounting filesystem read-only
EXT3-fs error (device sd(8,1)) in start_transaction: Journal has aborted
EXT3-fs error (device sd(8,1)): ext3_readdir: bad entry in directory #243371: rec_len % 4 != 0 - offset=1688, inode=243681, rec_len=109, name_len=27
EXT3-fs error (device sd(8,1)) in ext3_setattr: Journal has aborted
EXT3-fs error (device sd(8,1)) in start_transaction: Journal has aborted
EXT3-fs error (device sd(8,1)) in start_transaction: Journal has aborted

Nothing particularly interesting was going on (mostly idle X desktop).
I woke up and noticed the fs was mounted ro. The above was in dmesg.

Rebooted and ext3 replayed the journal and said a manual check was
needed due to I/O error on the journal. Ran fsck manually, it found a
whole bunch of orphan inodes including some scary errors like "inode
part of corrupt orphan inode list" or similar.

Rebooted again to force another fsck to be sure, and sure enough it
found more problems. Ugh. I started thinking bad hard drive.

Back up in X, and the same dmesg error occurred again. Repeat above.

Now I am in 2.4 and all seems well. So perhaps not hard drive?

IBM U2W drive on a 2940U2W if it matters. UP kernel.

Robert Love

2002-12-05 21:56:10

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.5: ext3 bug or dying drive?

Robert Love wrote:
>
> Overnight, 2.5.50-mm1 took a big stinky shit:
>
> ...
>
> Rebooted and ext3 replayed the journal and said a manual check was
> needed due to I/O error on the journal.

That'll be e2fsck saying that, when it tries to do journal replay.
I/O errors on the journal during replay not good.

Were there no I/O error messages reported from the device driver,
block, buffer or pagecache layer? Generally everyone like to have
a shout as one flies past.

> Ran fsck manually, it found a
> whole bunch of orphan inodes including some scary errors like "inode
> part of corrupt orphan inode list" or similar.
>
> Rebooted again to force another fsck to be sure, and sure enough it
> found more problems. Ugh. I started thinking bad hard drive.
>
> Back up in X, and the same dmesg error occurred again. Repeat above.
>
> Now I am in 2.4 and all seems well. So perhaps not hard drive?

Well. Changed driver, scsi layer, block layer, VFS and ext3. Could
be anywhere :(

> IBM U2W drive on a 2940U2W if it matters. UP kernel.

It would be useful to give the IO system a bit of a thrashing,
to narrow the problem down. Just a `cat /dev/sda[n] > /dev/null'
would suit.

Bottom line: dunno.

2002-12-05 22:04:34

by Robert Love

[permalink] [raw]

Subject: Re: 2.5: ext3 bug or dying drive?

On Thu, 2002-12-05 at 17:03, Andrew Morton wrote:

> Were there no I/O error messages reported from the device driver,
> block, buffer or pagecache layer? Generally everyone like to have
> a shout as one flies past.

Nope. Odd, eh?

Only log item of relevance was

(scsi0:A:0:0): Locking max tag count at 64

which I get every now and then anyhow.

> It would be useful to give the IO system a bit of a thrashing,
> to narrow the problem down. Just a `cat /dev/sda[n] > /dev/null'
> would suit.

2.4 survived this fine. Looking like its not the disk, then. I will
try this in 2.5 once I backup some data and finish some work.

I should note I have been running this machine with 2.5 for about a
month now with no problems and my development machines have been 2.5
since, uh, 2.5.1 but they are all IDE not SCSI.

> Bottom line: dunno.

Me neither. Quite an anomaly.

Robert Love

2002-12-05 22:13:33

by Robert Love

[permalink] [raw]

Subject: Re: 2.5: ext3 bug or dying drive?

On Thu, 2002-12-05 at 17:12, Robert Love wrote:

> > Bottom line: dunno.
>
> Me neither. Quite an anomaly.

I should add I have some file corruption.

It is probably related to fsck cleaning house - it seems some of the
executables I had open during the nose dive are bad. Reinstalling the
RPM packages fixed that.

Poop.

Robert Love

2002-12-06 00:55:38

by Barry K. Nathan

[permalink] [raw]

Subject: Re: 2.5: ext3 bug or dying drive?

On Thu, Dec 05, 2002 at 04:27:40PM -0500, Robert Love wrote:
> IBM U2W drive on a 2940U2W if it matters. UP kernel.

http://www.storage.ibm.com/hdd/support/download.htm

Download the Drive Fitness Test "Linux disk creator" (which isn't
actually a "disk creator" like the Windows version, but simply a file
you dd onto a 1.44MB floppy), then boot off that and try the Quick test.
If that doesn't show anything wrong, try the Advanced test or whatever
the longer one is called.

If DFT doesn't fail outright and instead offers to erase part of the
drive for you to "repair" it, that means there are bad sectors.
Conversely, if the Advanced test shows Disposition Code 00, that means
the drive is probably OK. (I think another way of interpreting the
results is that OK results are text on a green background, and failures
are text on red.) Anyway, this stuff will seem more obvious once you try
it.

-Barry K. Nathan <[email protected]>

2002-12-06 06:51:16

by Rolf Eike Beer

[permalink] [raw]

Subject: Re: 2.5: ext3 bug or dying drive?

Von Robert Love:

> Overnight, 2.5.50-mm1 took a big stinky shit:

[bad things]

> Nothing particularly interesting was going on (mostly idle X desktop).
> I woke up and noticed the fs was mounted ro. The above was in dmesg.
>
> Rebooted and ext3 replayed the journal and said a manual check was
> needed due to I/O error on the journal. Ran fsck manually, it found a
> whole bunch of orphan inodes including some scary errors like "inode
> part of corrupt orphan inode list" or similar.

>From at least (IIRC) 2.5.46 on I'm getting wrong free block counts in inodes
if I'm writing to discs. Looks like it must be a bit more than just 2 or 20
files, a kernel compile is enough in most cases. It's happening on 2
different hosts, one with SCSI, the other one with IDE. Nothing really bad
has happend until today. But if I can't create new files on a filesystem with
2 GB of free space I know it's time for an "e2fsck -f" on it.

Eike