LinuxLists.cc - 2.4.28-bk3 ext3 oopses at boot

2004-11-27 06:43:31

Subject: 2.4.28-bk3 ext3 oopses at boot

Hi, Guys:

I keep getting pretty random oopses at boot which point at either ext3 or
the SATA. This is a Dell PE750 which Matt sent me, BTW. So, it's SATA-only
and I had to use Jeff's patches to get ____request_resource. This, I suppose
makes this suspect, but if the box passes the boot, everything is peachy.
No filesystem corruption of any kind.

Failures are generally random, and they look like this (transcribed by hand):

VA NULL (0x0c)
EIP: write_one_revoke_record
journal_commit_transaction
scrup
vgacon_cursto
clear_selection
poke_blanked_console
vt_console_print
schedule
__call_console_drivers
release_console_sem
kjournald

VA 0x89e389d5
EIP: find_revoke_record
journal_cancel_revoke
do_get_write_access
journal_alloc_journal_head
ext3_mark_inode_dirty
journal_get_write_access
ext3_add_entry
start_this_handle
new_handle
ext3_add_nondir
ext3_mknod
vfs_mknod
unix_bind
sys_bind
dput
sys_socketcall

The EIP is always in something related to jbd.

A self-built 2.4.21-20.EL works dandy, so this ought to rule out the
a toolchain problem (it's RHEL3 with gcc-3.2). There was a small bunch
of changes in jbd and ext3 since then, but nothing in revoke.c.

I dunno if this report has any value, but thought I'd add a data point.

-- Pete

2004-11-27 07:06:13

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: 2.4.28-bk3 ext3 oopses at boot

Hi,

On Fri, 2004-11-26 at 05:28, Pete Zaitcev wrote:

> I keep getting pretty random oopses at boot which point at either ext3 or
> the SATA. This is a Dell PE750 which Matt sent me, BTW.

...

> Failures are generally random, and they look like this (transcribed by hand):
>
> VA NULL (0x0c)
> EIP: write_one_revoke_record

> VA 0x89e389d5
> EIP: find_revoke_record

Both of these are when list-walking the revoke list. That's purely an
in-memory data structure, it shouldn't be affected by anything on disk.
You could try a forced fsck just to be sure, but it doesn't sound like
there's a corruption on disk here. (There *is* a variant of the revoke
list that is read from disk, but that's only used during log replay,
which isn't active in your oops traces.)

> if the box passes the boot, everything is peachy.
> No filesystem corruption of any kind.

Sounds to me like you're getting corruption in a repeatable address
somewhere at boot time that just happens to be hitting the revoke list
early on.

I saw something like this once before, except it was on a buffer_head
list. Debugging, I was able to identify that the *previous* entry of
the list was always exactly at the 16MB mark in memory, and the list
pointers there were getting corrupted. Turned out to be a bad BIOS that
was trapping a PCI bus operation, setting up a pte for some internal
workspace, restoring it on exit back to the OS but forgetting to flush
the tlb. Adding a tlb flush after various PCI ops fixed it completely.
Similar footprint to yours: a crash shortly after boot in a specific
data structure but with pretty random VAs, and at random points in the
bh walking code.

The revoke code hasn't changed in *ages*, and I've never seen a report
of that list getting corrupted before. I suspect the fs is just an
innocent victim here.

Cheers,
Stephen