2006-12-15 23:00:23

by Bas van Schaik

[permalink] [raw]
Subject: EXT3 filesystem corruptions on AoE, RAID and LVM?

Hi all,

I'm maintaining two clusters, with machines running a mix between Debian
Stable with Etch-kernels to have AoE (ATA over Ethernet support).
Machines in these clusters "export" their harddisks using AoE, and one
machine in the cluster imports those using the kernel "aoe"-module. On
top of those imported devices, multiple RAID5-arrays are created, and
LVM is running on top of RAID, ext3 on the LVM LV.

After a few days, I get EXT3-errors. like this:
>> EXT3-fs: mounted filesystem with ordered data mode.
>> EXT3-fs error (device loop0): ext3_free_blocks_sb: bit already
cleared for block 412186
>> Aborting journal on device loop0.
>> EXT3-fs error (device loop0) in ext3_free_blocks_sb: Journal has aborted
>> EXT3-fs error (device loop0) in ext3_reserve_inode_write: Journal has
aborted
>> EXT3-fs error (device loop0) in ext3_truncate: Journal has aborted
>> EXT3-fs error (device loop0) in ext3_reserve_inode_write: Journal has
aborted
>> EXT3-fs error (device loop0) in ext3_orphan_del: Journal has aborted
>> EXT3-fs error (device loop0) in ext3_reserve_inode_write: Journal has
aborted
>> EXT3-fs error (device loop0) in ext3_delete_inode: Journal has aborted
>> __journal_remove_journal_head: freeing b_committed_data
>> __journal_remove_journal_head: freeing b_committed_data

(...)

>> __journal_remove_journal_head: freeing b_committed_data
>> ext3_abort called.
>> EXT3-fs error (device loop0): ext3_journal_start_sb: Detected aborted
journal
>> Remounting filesystem read-only
>> __journal_remove_journal_head: freeing b_committed_data


FSCK'ing the filesystem fixes those errors, but after a few days (or
weeks, depending on the fs load) the corruptions appear again. I might
be worth telling you that there are no other suspicious messages in my logs.

I saw some other discussions on the mailinglist, but I don't think their
related to my problems. I don't know if I need to file a bug on this,
neither do I know which details you need to help me solve this problem.
So for now I just want to here your thoughts. FYI:

Kernel information for cluster 1:
>> [email protected]:~# uname -a
>> Linux infinity 2.6.17-2-686 #1 SMP Wed Sep 13 16:34:10 UTC 2006 i686
GNU/Linux


And cluster 2:
>> dust:~# uname -a
>> Linux dust 2.6.18-3-686 #1 SMP Thu Nov 23 20:49:23 UTC 2006 i686
GNU/Linux

Note that these are not vanilla kernels, but Debian kernels. However,
AFAIK there are no Debian-specific patches to AoE, ext3, LVM or RAID.

Thanks for your replies!

Best regards,

-- Bas van Schaik


2006-12-21 13:40:43

by Bas van Schaik

[permalink] [raw]
Subject: Re: EXT3 filesystem corruptions on AoE, RAID and LVM?

Hi all,

Sorry for being this brutal and replying to my own post, but is there
anyone out here who might be able to help me with this problem? Or at
least help me with debugging it? Today I've discovered one other type of
inconsistency:

> # ls -la mydir
> -r-S--s-wt 1 2245250426 3397732635 32768 1929-11-11 16:54 mydir
Obviously, there are no UID 2245250426 and GID 3397732635 on my
filesystem, and this file certainly isn't created or modified on the
11th of November 1929. I cannot remove the directory, I get a
"permission denied" (being root!). Of course, I assume an fsck will
solve this problem (probably by just removing this entry from the
filesystem and attaching it to lost+found), but I keep checking my
filesystem about every five days now! And that means another 5 hours of
downtime with this special type of "cluster"...
Thanks in advance for all your thoughts!

-- Bas


Bas van Schaik wrote:
> Hi all,
>
> I'm maintaining two clusters, with machines running a mix between Debian
> Stable with Etch-kernels to have AoE (ATA over Ethernet support).
> Machines in these clusters "export" their harddisks using AoE, and one
> machine in the cluster imports those using the kernel "aoe"-module. On
> top of those imported devices, multiple RAID5-arrays are created, and
> LVM is running on top of RAID, ext3 on the LVM LV.
>
> After a few days, I get EXT3-errors. like this:
>
>>> EXT3-fs: mounted filesystem with ordered data mode.
>>> EXT3-fs error (device loop0): ext3_free_blocks_sb: bit already
>>>
> cleared for block 412186
>
>>> Aborting journal on device loop0.
>>> EXT3-fs error (device loop0) in ext3_free_blocks_sb: Journal has aborted
>>> EXT3-fs error (device loop0) in ext3_reserve_inode_write: Journal has
>>>
> aborted
>
>>> EXT3-fs error (device loop0) in ext3_truncate: Journal has aborted
>>> EXT3-fs error (device loop0) in ext3_reserve_inode_write: Journal has
>>>
> aborted
>
>>> EXT3-fs error (device loop0) in ext3_orphan_del: Journal has aborted
>>> EXT3-fs error (device loop0) in ext3_reserve_inode_write: Journal has
>>>
> aborted
>
>>> EXT3-fs error (device loop0) in ext3_delete_inode: Journal has aborted
>>> __journal_remove_journal_head: freeing b_committed_data
>>> __journal_remove_journal_head: freeing b_committed_data
>>>
>
> (...)
>
>
>>> __journal_remove_journal_head: freeing b_committed_data
>>> ext3_abort called.
>>> EXT3-fs error (device loop0): ext3_journal_start_sb: Detected aborted
>>>
> journal
>
>>> Remounting filesystem read-only
>>> __journal_remove_journal_head: freeing b_committed_data
>>>
>
>
> FSCK'ing the filesystem fixes those errors, but after a few days (or
> weeks, depending on the fs load) the corruptions appear again. I might
> be worth telling you that there are no other suspicious messages in my logs.
>
> I saw some other discussions on the mailinglist, but I don't think their
> related to my problems. I don't know if I need to file a bug on this,
> neither do I know which details you need to help me solve this problem.
> So for now I just want to here your thoughts. FYI:
>
> Kernel information for cluster 1:
>
>>> [email protected]:~# uname -a
>>> Linux infinity 2.6.17-2-686 #1 SMP Wed Sep 13 16:34:10 UTC 2006 i686
>>>
> GNU/Linux
>
>
> And cluster 2:
>
>>> dust:~# uname -a
>>> Linux dust 2.6.18-3-686 #1 SMP Thu Nov 23 20:49:23 UTC 2006 i686
>>>
> GNU/Linux
>
> Note that these are not vanilla kernels, but Debian kernels. However,
> AFAIK there are no Debian-specific patches to AoE, ext3, LVM or RAID.
>
> Thanks for your replies!
>
> Best regards,
>
> -- Bas van Schaik
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>