2002-01-17 20:41:28

by Patrick Scharrenberg

[permalink] [raw]
Subject: 2.4.17 strange ext2 error

Hi,

yesterday I got a very strange ext2 error on my linux machine..
The system has a 5-disk raid-5-software-raid and on top of this there is one
ext2 fs which was clean when mounted 1 week ago..

kernel 2.4.17 (since 1 week)
before it was 2.4.10

suddenly some directories were hidden and so I looked at the logs:

here is a part of /var/log/warn (because it was about 50k I put it on the
websrv):
http://webmail.koenigsnet.rwth-aachen.de/a/warn

after rebooting with kernel 2.4.10 and starting an fsck there were 10,8
billion errors
and now files are gone :-(((((((

..patrick


Attachments:
warn (50.63 kB)

2002-01-17 21:19:30

by Andreas Dilger

[permalink] [raw]
Subject: Re: 2.4.17 strange ext2 error

On Jan 17, 2002 21:40 +0100, Patrick Scharrenberg wrote:
> yesterday I got a very strange ext2 error on my linux machine..
> The system has a 5-disk raid-5-software-raid and on top of this there is one
> ext2 fs which was clean when mounted 1 week ago..
>
> kernel 2.4.17 (since 1 week)
> before it was 2.4.10

When you say it was "clean when mounted 1 week ago" does this mean that you
had run e2fsck on it at that time, or just that it did not report any
errors when you mounted it? Sometimes it is possible to have corruption
in your fs for a while without noticing it if you don't run fsck on it.

Jan 15 19:03:56 atlantis kernel: EXT2-fs error (device md(9,0)):
ext2_free_blocks: Freeing blocks in system zones - Block = 71, count = 1
Jan 15 19:14:45 atlantis kernel: EXT2-fs error (device md(9,0)):
ext2_new_block:Allocating block in system zone - block = 71

This is a problem with the ext2 code that I have a fix for. It is not
the _real_ cause of your problem, since this is already showing that
there was another error which caused this problem.

The issue that I have a fix for is that if there is a corrupt inode
somewhere, and you delete it, then it will happily mark metadata blocks
as unused, and as you can see it will also proceed to allocate that
block for something else.

My patch fixes both of these errors - if you try to free such a metadata
block, it will not clear it in the bitmap, and if you try to allocate a
"free" metadata block free in the bitmap it will mark it in use, but
continue to look for a different block which can be allocated.

This change has implications for ext2/ext3 forward/backward compatibility,
but it is much more robust in the face of any errors in an inodes block
list, which could cause cascading errors if you proceed to scribble over
the itable or free blocks list. I'm CCing the ext2-devel list on this for
discussion, but it should strongly be considered for inclusion into the
official 2.4/2.5 ext2/ext3 code.

Cheers, Andreas

PS - as always, this patch is extracted from among other changes in my
ext2 code, so it may apply with offsets or minor context issues.
====================== ext2-2.4.17-sysblk.diff =============================
--- linux-2.4.17.orig/fs/ext2/balloc.c Thu Oct 25 02:02:41 2001
+++ linux/fs/ext2/balloc.c Thu Dec 13 12:11:47 2001
@@ -302,22 +300,20 @@
if (!gdp)
goto error_return;

- if (in_range (le32_to_cpu(gdp->bg_block_bitmap), block, count) ||
- in_range (le32_to_cpu(gdp->bg_inode_bitmap), block, count) ||
- in_range (block, le32_to_cpu(gdp->bg_inode_table),
- sb->u.ext2_sb.s_itb_per_group) ||
- in_range (block + count - 1, le32_to_cpu(gdp->bg_inode_table),
- sb->u.ext2_sb.s_itb_per_group))
- ext2_error (sb, "ext2_free_blocks",
- "Freeing blocks in system zones - "
- "Block = %lu, count = %lu",
- block, count);
+ for (i = 0; i < count; i++, block++) {
+ if (block == le32_to_cpu(gdp->bg_block_bitmap) ||
+ block == le32_to_cpu(gdp->bg_inode_bitmap) ||
+ in_range(block, le32_to_cpu(gdp->bg_inode_table),
+ sb->u.ext2_sb.s_itb_per_group)) {
+ ext2_error(sb, "ext2_free_blocks",
+ "Freeing block in system zone - block = %lu",
+ block);
+ continue;
+ }

- for (i = 0; i < count; i++) {
if (!ext2_clear_bit (bit + i, bh->b_data))
- ext2_error (sb, "ext2_free_blocks",
- "bit already cleared for block %lu",
- block + i);
+ ext2_error(sb, "ext2_free_blocks",
+ "bit already cleared for block %lu", block);
else {
DQUOT_FREE_BLOCK(inode, 1);
gdp->bg_free_blocks_count =
@@ -336,7 +332,6 @@
wait_on_buffer (bh);
}
if (overflow) {
- block += count;
count = overflow;
goto do_more;
}
@@ -522,8 +522,12 @@
in_range (tmp, le32_to_cpu(gdp->bg_inode_table),
sb->u.ext2_sb.s_itb_per_group))
ext2_error (sb, "ext2_new_block",
- "Allocating block in system zone - "
- "block = %u", tmp);
+ "Allocating block in system zone - block = %u",
+ tmp);
+ ext2_set_bit(j, bh->b_data);
+ DQUOT_FREE_BLOCK(inode, 1);
+ goto repeat;
+ }

if (ext2_set_bit (j, bh->b_data)) {
ext2_warning (sb, "ext2_new_block",
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2002-01-18 00:49:24

by Patrick Scharrenberg

[permalink] [raw]
Subject: Re: 2.4.17 strange ext2 error

Hi,

> On Jan 17, 2002 21:40 +0100, Patrick Scharrenberg wrote:
> > yesterday I got a very strange ext2 error on my linux machine..
> > The system has a 5-disk raid-5-software-raid and on top of this there is
one
> > ext2 fs which was clean when mounted 1 week ago..
> >
> > kernel 2.4.17 (since 1 week)
> > before it was 2.4.10
>
> When you say it was "clean when mounted 1 week ago" does this mean that
you
> had run e2fsck on it at that time, or just that it did not report any
> errors when you mounted it? Sometimes it is possible to have corruption
> in your fs for a while without noticing it if you don't run fsck on it.

I meant that I had run fsck one week ago, because the filesystem was
corrupt (but this was my fault).
At that same time (but after the fsck) I decided to switch to kernel 2.4.17.
But this shouldn't cause such errors with file-loss...

I don't know if it is save for me to run 2.4.17. For now I switched back to
2.4.10.
Tomorrow I'll start an new backup of the data and then try again the.17.
If the error occurs again, I'll post again.... :-)

..patrick