2009-05-13 06:26:36

by Kevin Shanahan

[permalink] [raw]
Subject: More ext4 acl/xattr corruption - 4th occurence now

Hi,

This fs really doesn't like whatever I am doing with it. Here is
another instance of corruption, again with the first symptom showing
up as corrupted acls and xattrs as the backup script crawls over the
directory structure:

getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.DF: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.SMK: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.DF: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.IF: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.SMK: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.SMK: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLSOUR.DF: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.IF: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.SMK: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.IF: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.DF: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.DF: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.SMK: Input/output error
getfacl: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.DF: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.DF: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.SMK: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.IF: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.DF: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.IF: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.SMK: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.SMK: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLSOUR.DF: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.IF: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.SMK: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.IF: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.DF: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.DF: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.SMK: Input/output error
getfattr: apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.DF: Input/output error
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.DF" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.SMK" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.DF" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.IF" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.SMK" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.SMK" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLSOUR.DF" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.IF" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.SMK" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.IF" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLORIG.DF" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLOUTC.DF" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.SMK" failed: Input/output error (5)
rsync: readlink "/srv/samba/local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROB.DF" failed: Input/output error (5)
skipping non-regular file "apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLNATU.IF"

Again, here is the little breadcrumb left at the end of dmesg:

init_special_inode: bogus i_mode (36545)
attempt to access beyond end of device
dm-0: rw=0, want=256153956122632, limit=2147483648
attempt to access beyond end of device
dm-0: rw=0, want=256153956122632, limit=2147483648

And following the same formula as last time(s):

hermes:~# debugfs /dev/dm-0
debugfs: stat "local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF"
invalid inode->i_extra_isize (8224)
Inode: 2542 Type: bad type Mode: 0043 Flags: 0x5849462f
Generation: 538970637 Version: 0x66663030:535f4445
User: 538980401 Group: 538993001 Size: 996566576
File ACL: 538976288 Directory ACL: 0
Links: 8812 Blockcount: 35322822674750
Fragment: Address: 538976288 Number: 0 Size: 0
ctime: 0x41462d54:65636166 -- Tue Sep 14 08:59:24 2004
atime: 0x4e4f4620:63206c61 -- Sat Aug 20 14:59:04 2011
mtime: 0x594c494d:6972413d -- Fri Jun 23 08:18:45 2017
crtime: 0x726f6c6f:3138233d -- Sun Nov 3 13:24:39 2030
dtime: 0x7241203a -- Sun Sep 29 09:35:14 2030
Size of extra inode fields: 8224
BLOCKS:

debugfs: imap "local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF"
Inode 2542 is part of block group 0
located at block 447, offset 0x0d00

hermes:~# dd if=/dev/dm-0 of=block-447.dump bs=4k skip=447 count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.0121164 s, 338 kB/s

Now, this is (possibly) interesting - that block contains a bunch of
file data. Looks like a html email (I can tell it's email because of
the FIXED_ prefix added to the tags by the mail sanitizer).

If I can locate the source of that data, perhaps it will point to
where the corruption is coming from? Any tips on scanning for the
data? I'll start with simple find and grep and see how far I get.

Regards,
Kevin.


Attachments:
(No filename) (6.24 kB)
block-447.dump (4.00 kB)
Download all attachments

2009-05-13 23:56:18

by Kevin Shanahan

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Wed, May 13, 2009 at 03:56:34PM +0930, Kevin Shanahan wrote:
> And following the same formula as last time(s):
>
> hermes:~# debugfs /dev/dm-0
> debugfs: stat "local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF"
> invalid inode->i_extra_isize (8224)
> Inode: 2542 Type: bad type Mode: 0043 Flags: 0x5849462f
> Generation: 538970637 Version: 0x66663030:535f4445
> User: 538980401 Group: 538993001 Size: 996566576
> File ACL: 538976288 Directory ACL: 0
> Links: 8812 Blockcount: 35322822674750
> Fragment: Address: 538976288 Number: 0 Size: 0
> ctime: 0x41462d54:65636166 -- Tue Sep 14 08:59:24 2004
> atime: 0x4e4f4620:63206c61 -- Sat Aug 20 14:59:04 2011
> mtime: 0x594c494d:6972413d -- Fri Jun 23 08:18:45 2017
> crtime: 0x726f6c6f:3138233d -- Sun Nov 3 13:24:39 2030
> dtime: 0x7241203a -- Sun Sep 29 09:35:14 2030
> Size of extra inode fields: 8224
> BLOCKS:
>
> debugfs: imap "local/apps/OLD-APPS/APPS/NWAPPS/OAIII/OATEMP/F_CLPROF.IF"
> Inode 2542 is part of block group 0
> located at block 447, offset 0x0d00
>
> hermes:~# dd if=/dev/dm-0 of=block-447.dump bs=4k skip=447 count=1
> 1+0 records in
> 1+0 records out
> 4096 bytes (4.1 kB) copied, 0.0121164 s, 338 kB/s
>
> Now, this is (possibly) interesting - that block contains a bunch of
> file data. Looks like a html email (I can tell it's email because of
> the FIXED_ prefix added to the tags by the mail sanitizer).
>
> If I can locate the source of that data, perhaps it will point to
> where the corruption is coming from? Any tips on scanning for the
> data? I'll start with simple find and grep and see how far I get.

I didn't find this in any file in the current directory structure. I
guess it could be old data that hadn't been zeroed out.

However, the only binary data I can see seems to be at offset 0x0155
(341), nowhere near offset 0x0d00 (unless I misunderstand the imap
output above. Does that little blob of binary data make any sense as
an inode?

Cheers,
Kevin.

2009-05-14 04:40:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Wed, May 13, 2009 at 03:56:34PM +0930, Kevin Shanahan wrote:
>
> Now, this is (possibly) interesting - that block contains a bunch of
> file data. Looks like a html email (I can tell it's email because of
> the FIXED_ prefix added to the tags by the mail sanitizer).

That is interesting. Knowing that it is file data means we might be
able to track things down this way.

Are you able to apply the following patch to your kernel? If so,
hopefully we'll be able to catch whatever is causing the problem in
the act.

- Ted

commit 8ff799da106e9fc4da9b2a3753b5b86caab27f13
Author: Theodore Ts'o <[email protected]>
Date: Thu May 14 00:39:48 2009 -0400

ext4: Add a block validity check to ext4_get_blocks_wrap()

A few users with very large disks have been reporting low block number
filesystem corruptions, potentially zapping the block group
descriptors or inodes in the first inode table block. It's not clear
what is causing this, but most recently, it appears that whatever is
trashing the filesystem metadata appears to be file data. So let's
try to set a trap for the corruption in ext4_get_blocks_wrap(), which
is where logical blocks in an inode are mapped to physical blocks.

Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4e7f363..4fad4fa 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1120,6 +1120,35 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
ext4_discard_preallocations(inode);
}

+static int check_block_validity(struct inode *inode, sector_t logical,
+ sector_t phys, int len)
+{
+ ext4_fsblk_t valid_block, itable_block;
+ struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es;
+ struct ext4_group_desc *gdp = ext4_get_group_desc(inode->i_sb, 0, 0);
+ unsigned int itable_len = EXT4_SB(inode->i_sb)->s_itb_per_group;
+
+ valid_block = le32_to_cpu(es->s_first_data_block) +
+ EXT4_SB(inode->i_sb)->s_gdb_count;
+ itable_block = ext4_inode_table(inode->i_sb, gdp);
+
+ if (unlikely((phys <= valid_block) ||
+ ((phys + len - 1) > ext4_blocks_count(es)) ||
+ ((phys >= itable_block) &&
+ (phys <= itable_block + itable_len)) ||
+ ((phys + len - 1 >= itable_block) &&
+ (phys + len - 1 <= itable_block + itable_len)))) {
+ ext4_error(inode->i_sb, "check_block_validity",
+ "inode #%lu logical block %llu mapped to %llu "
+ "(size %d)", inode->i_ino,
+ (unsigned long long) logical,
+ (unsigned long long) phys, len);
+ WARN_ON(1);
+ return -EIO;
+ }
+ return 0;
+}
+
/*
* The ext4_get_blocks_wrap() function try to look up the requested blocks,
* and returns if the blocks are already mapped.
@@ -1165,6 +1194,13 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
}
up_read((&EXT4_I(inode)->i_data_sem));

+ if (retval > 0 && buffer_mapped(bh)) {
+ int ret = check_block_validity(inode, block,
+ bh->b_blocknr, retval);
+ if (ret != 0)
+ return ret;
+ }
+
/* If it is only a block(s) look up */
if (!create)
return retval;
@@ -1241,6 +1277,12 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
}

up_write((&EXT4_I(inode)->i_data_sem));
+ if (retval > 0 && buffer_mapped(bh)) {
+ int ret = check_block_validity(inode, block,
+ bh->b_blocknr, retval);
+ if (ret != 0)
+ return ret;
+ }
return retval;
}


2009-05-14 11:07:03

by Kevin Shanahan

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Thu, May 14, 2009 at 12:40:11AM -0400, Theodore Tso wrote:
> On Wed, May 13, 2009 at 03:56:34PM +0930, Kevin Shanahan wrote:
> >
> > Now, this is (possibly) interesting - that block contains a bunch of
> > file data. Looks like a html email (I can tell it's email because of
> > the FIXED_ prefix added to the tags by the mail sanitizer).
>
> That is interesting. Knowing that it is file data means we might be
> able to track things down this way.
>
> Are you able to apply the following patch to your kernel? If so,
> hopefully we'll be able to catch whatever is causing the problem in
> the act.

Sure - now running with 2.6.29.3 + your patch.

patching file fs/ext4/inode.c
Hunk #1 succeeded at 1040 with fuzz 1 (offset -80 lines).
Hunk #2 succeeded at 1113 (offset -81 lines).
Hunk #3 succeeded at 1184 (offset -93 lines).

I'll report any hits for "check_block_validity" in syslog.

Thankyou, much appreciated.

Cheers,
Kevin.

> commit 8ff799da106e9fc4da9b2a3753b5b86caab27f13
> Author: Theodore Ts'o <[email protected]>
> Date: Thu May 14 00:39:48 2009 -0400
>
> ext4: Add a block validity check to ext4_get_blocks_wrap()
>
> A few users with very large disks have been reporting low block number
> filesystem corruptions, potentially zapping the block group
> descriptors or inodes in the first inode table block. It's not clear
> what is causing this, but most recently, it appears that whatever is
> trashing the filesystem metadata appears to be file data. So let's
> try to set a trap for the corruption in ext4_get_blocks_wrap(), which
> is where logical blocks in an inode are mapped to physical blocks.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 4e7f363..4fad4fa 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1120,6 +1120,35 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
> ext4_discard_preallocations(inode);
> }
>
> +static int check_block_validity(struct inode *inode, sector_t logical,
> + sector_t phys, int len)
> +{
> + ext4_fsblk_t valid_block, itable_block;
> + struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es;
> + struct ext4_group_desc *gdp = ext4_get_group_desc(inode->i_sb, 0, 0);
> + unsigned int itable_len = EXT4_SB(inode->i_sb)->s_itb_per_group;
> +
> + valid_block = le32_to_cpu(es->s_first_data_block) +
> + EXT4_SB(inode->i_sb)->s_gdb_count;
> + itable_block = ext4_inode_table(inode->i_sb, gdp);
> +
> + if (unlikely((phys <= valid_block) ||
> + ((phys + len - 1) > ext4_blocks_count(es)) ||
> + ((phys >= itable_block) &&
> + (phys <= itable_block + itable_len)) ||
> + ((phys + len - 1 >= itable_block) &&
> + (phys + len - 1 <= itable_block + itable_len)))) {
> + ext4_error(inode->i_sb, "check_block_validity",
> + "inode #%lu logical block %llu mapped to %llu "
> + "(size %d)", inode->i_ino,
> + (unsigned long long) logical,
> + (unsigned long long) phys, len);
> + WARN_ON(1);
> + return -EIO;
> + }
> + return 0;
> +}
> +
> /*
> * The ext4_get_blocks_wrap() function try to look up the requested blocks,
> * and returns if the blocks are already mapped.
> @@ -1165,6 +1194,13 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> }
> up_read((&EXT4_I(inode)->i_data_sem));
>
> + if (retval > 0 && buffer_mapped(bh)) {
> + int ret = check_block_validity(inode, block,
> + bh->b_blocknr, retval);
> + if (ret != 0)
> + return ret;
> + }
> +
> /* If it is only a block(s) look up */
> if (!create)
> return retval;
> @@ -1241,6 +1277,12 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> }
>
> up_write((&EXT4_I(inode)->i_data_sem));
> + if (retval > 0 && buffer_mapped(bh)) {
> + int ret = check_block_validity(inode, block,
> + bh->b_blocknr, retval);
> + if (ret != 0)
> + return ret;
> + }
> return retval;
> }
>

2009-05-14 11:17:33

by Manish Katiyar

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Thu, May 14, 2009 at 4:37 PM, Kevin Shanahan <[email protected]> wrote:
> On Thu, May 14, 2009 at 12:40:11AM -0400, Theodore Tso wrote:
>> On Wed, May 13, 2009 at 03:56:34PM +0930, Kevin Shanahan wrote:
>> >
>> > Now, this is (possibly) interesting - that block contains a bunch of
>> > file data. Looks like a html email (I can tell it's email because of
>> > the FIXED_ prefix added to the tags by the mail sanitizer).
>>
>> That is interesting. ?Knowing that it is file data means we might be
>> able to track things down this way.
>>
>> Are you able to apply the following patch to your kernel?

Hi Ted,

If we have the possibility of applying patches why not apply this too
and try http://markmail.org/message/e3uwzxrd42isxczg . After all it
originated from this bug only.

Thanks -
Manish


>> If so,
>> hopefully we'll be able to catch whatever is causing the problem in
>> the act.
>
> Sure - now running with 2.6.29.3 + your patch.
>
> ?patching file fs/ext4/inode.c
> ?Hunk #1 succeeded at 1040 with fuzz 1 (offset -80 lines).
> ?Hunk #2 succeeded at 1113 (offset -81 lines).
> ?Hunk #3 succeeded at 1184 (offset -93 lines).
>
> I'll report any hits for "check_block_validity" in syslog.
>
> Thankyou, much appreciated.
>
> Cheers,
> Kevin.
>
>> commit 8ff799da106e9fc4da9b2a3753b5b86caab27f13
>> Author: Theodore Ts'o <[email protected]>
>> Date: ? Thu May 14 00:39:48 2009 -0400
>>
>> ? ? ext4: Add a block validity check to ext4_get_blocks_wrap()
>>
>> ? ? A few users with very large disks have been reporting low block number
>> ? ? filesystem corruptions, potentially zapping the block group
>> ? ? descriptors or inodes in the first inode table block. ?It's not clear
>> ? ? what is causing this, but most recently, it appears that whatever is
>> ? ? trashing the filesystem metadata appears to be file data. ?So let's
>> ? ? try to set a trap for the corruption in ext4_get_blocks_wrap(), which
>> ? ? is where logical blocks in an inode are mapped to physical blocks.
>>
>> ? ? Signed-off-by: "Theodore Ts'o" <[email protected]>
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 4e7f363..4fad4fa 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1120,6 +1120,35 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
>> ? ? ? ? ? ? ? ext4_discard_preallocations(inode);
>> ?}
>>
>> +static int check_block_validity(struct inode *inode, sector_t logical,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? sector_t phys, int len)
>> +{
>> + ? ? ext4_fsblk_t valid_block, itable_block;
>> + ? ? struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es;
>> + ? ? struct ext4_group_desc *gdp = ext4_get_group_desc(inode->i_sb, 0, 0);
>> + ? ? unsigned int itable_len = EXT4_SB(inode->i_sb)->s_itb_per_group;
>> +
>> + ? ? valid_block = le32_to_cpu(es->s_first_data_block) +
>> + ? ? ? ? ? ? EXT4_SB(inode->i_sb)->s_gdb_count;
>> + ? ? itable_block = ext4_inode_table(inode->i_sb, gdp);
>> +
>> + ? ? if (unlikely((phys <= valid_block) ||
>> + ? ? ? ? ? ? ? ? ?((phys + len - 1) > ext4_blocks_count(es)) ||
>> + ? ? ? ? ? ? ? ? ?((phys >= itable_block) &&
>> + ? ? ? ? ? ? ? ? ? (phys <= itable_block + itable_len)) ||
>> + ? ? ? ? ? ? ? ? ?((phys + len - 1 >= itable_block) &&
>> + ? ? ? ? ? ? ? ? ? (phys + len - 1 <= itable_block + itable_len)))) {
>> + ? ? ? ? ? ? ext4_error(inode->i_sb, "check_block_validity",
>> + ? ? ? ? ? ? ? ? ? ? ? ?"inode #%lu logical block %llu mapped to %llu "
>> + ? ? ? ? ? ? ? ? ? ? ? ?"(size %d)", inode->i_ino,
>> + ? ? ? ? ? ? ? ? ? ? ? ?(unsigned long long) logical,
>> + ? ? ? ? ? ? ? ? ? ? ? ?(unsigned long long) phys, len);
>> + ? ? ? ? ? ? WARN_ON(1);
>> + ? ? ? ? ? ? return -EIO;
>> + ? ? }
>> + ? ? return 0;
>> +}
>> +
>> ?/*
>> ? * The ext4_get_blocks_wrap() function try to look up the requested blocks,
>> ? * and returns if the blocks are already mapped.
>> @@ -1165,6 +1194,13 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
>> ? ? ? }
>> ? ? ? up_read((&EXT4_I(inode)->i_data_sem));
>>
>> + ? ? if (retval > 0 && buffer_mapped(bh)) {
>> + ? ? ? ? ? ? int ret = check_block_validity(inode, block,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?bh->b_blocknr, retval);
>> + ? ? ? ? ? ? if (ret != 0)
>> + ? ? ? ? ? ? ? ? ? ? return ret;
>> + ? ? }
>> +
>> ? ? ? /* If it is only a block(s) look up */
>> ? ? ? if (!create)
>> ? ? ? ? ? ? ? return retval;
>> @@ -1241,6 +1277,12 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
>> ? ? ? }
>>
>> ? ? ? up_write((&EXT4_I(inode)->i_data_sem));
>> + ? ? if (retval > 0 && buffer_mapped(bh)) {
>> + ? ? ? ? ? ? int ret = check_block_validity(inode, block,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?bh->b_blocknr, retval);
>> + ? ? ? ? ? ? if (ret != 0)
>> + ? ? ? ? ? ? ? ? ? ? return ret;
>> + ? ? }
>> ? ? ? return retval;
>> ?}
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>



--
Thanks -
Manish

2009-05-14 12:30:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Thu, May 14, 2009 at 04:47:14PM +0530, Manish Katiyar wrote:
>
> Hi Ted,
>
> If we have the possibility of applying patches why not apply this too
> and try http://markmail.org/message/e3uwzxrd42isxczg . After all it
> originated from this bug only.

Because I implemented this functionality already in
fs/ext4/inode.c:ext4_iget(). Check out commit 563bdd61 in the kernel
mainline. (It's not yet in stable release, as it was merged after the
2.6.29 merge window.)

- Ted

2009-05-14 13:25:08

by Kevin Shanahan

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Thu, May 14, 2009 at 08:37:00PM +0930, Kevin Shanahan wrote:
> Sure - now running with 2.6.29.3 + your patch.
>
> patching file fs/ext4/inode.c
> Hunk #1 succeeded at 1040 with fuzz 1 (offset -80 lines).
> Hunk #2 succeeded at 1113 (offset -81 lines).
> Hunk #3 succeeded at 1184 (offset -93 lines).
>
> I'll report any hits for "check_block_validity" in syslog.

That didn't take long:

May 14 22:49:17 hermes kernel: EXT4-fs error (device dm-0): check_block_validity: inode #759 logical block 1741329 mapped to 529 (size 1)
May 14 22:49:17 hermes kernel: Aborting journal on device dm-0:8.
May 14 22:49:17 hermes kernel: ext4_da_writepages: jbd2_start: 293 pages, ino 759; err -30
May 14 22:49:17 hermes kernel: Pid: 374, comm: pdflush Not tainted 2.6.29.3 #1
May 14 22:49:17 hermes kernel: Call Trace:
May 14 22:49:17 hermes kernel: [<ffffffff802fdce3>] ext4_da_writepages+0x211/0x427
May 14 22:49:17 hermes kernel: [<ffffffff8055b44f>] thread_return+0x3d/0xc3
May 14 22:49:17 hermes kernel: [<ffffffff802fef56>] ext4_da_get_block_write+0x0/0x12d
May 14 22:49:17 hermes kernel: [<ffffffff802757df>] do_writepages+0x20/0x2d
May 14 22:49:17 hermes kernel: [<ffffffff802aa359>] __writeback_single_inode+0x196/0x3a4
May 14 22:49:17 hermes kernel: [<ffffffff802542aa>] del_timer_sync+0xc/0x16
May 14 22:49:17 hermes kernel: [<ffffffff802aa9b9>] generic_sync_sb_inodes+0x2a7/0x439
May 14 22:49:17 hermes kernel: [<ffffffff802aacfa>] writeback_inodes+0x7d/0xcc
May 14 22:49:17 hermes kernel: [<ffffffff8027591e>] wb_kupdate+0x9f/0x116
May 14 22:49:17 hermes kernel: [<ffffffff802764bf>] pdflush+0x125/0x1d2
May 14 22:49:17 hermes kernel: [<ffffffff8027587f>] wb_kupdate+0x0/0x116
May 14 22:49:17 hermes kernel: [<ffffffff8027639a>] pdflush+0x0/0x1d2
May 14 22:49:17 hermes kernel: [<ffffffff8025c8cc>] kthread+0x47/0x73
May 14 22:49:17 hermes kernel: [<ffffffff80225f7a>] child_rip+0xa/0x20
May 14 22:49:17 hermes kernel: [<ffffffff8025c885>] kthread+0x0/0x73
May 14 22:49:17 hermes kernel: [<ffffffff80225f70>] child_rip+0x0/0x20
May 14 22:49:18 hermes kernel: Remounting filesystem read-only
May 14 22:49:18 hermes kernel: ------------[ cut here ]------------
May 14 22:49:18 hermes kernel: WARNING: at fs/ext4/inode.c:1066 check_block_validity+0xd8/0xff()
May 14 22:49:18 hermes kernel: Hardware name: S3210SH
May 14 22:49:18 hermes kernel: Pid: 3351, comm: smbd Not tainted 2.6.29.3 #1
May 14 22:49:18 hermes kernel: Call Trace:
May 14 22:49:18 hermes kernel: [<ffffffff8024c0b0>] warn_slowpath+0xd3/0x10d
May 14 22:49:18 hermes kernel: [<ffffffff8025cc0e>] wake_bit_function+0x0/0x23
May 14 22:49:18 hermes kernel: [<ffffffff802afbec>] sync_dirty_buffer+0x5f/0x97
May 14 22:49:18 hermes kernel: [<ffffffff8030589e>] ext4_commit_super+0xbc/0x103
May 14 22:49:18 hermes kernel: [<ffffffff80307f92>] ext4_handle_error+0x83/0xa9
May 14 22:49:18 hermes kernel: [<ffffffff803081cb>] ext4_error+0x83/0x90
May 14 22:49:18 hermes kernel: [<ffffffff8051c5ab>] ipt_do_table+0x308/0x32d
May 14 22:49:18 hermes kernel: [<ffffffff80245f5c>] dequeue_task_fair+0x45/0x167
May 14 22:49:18 hermes kernel: [<ffffffff802fd222>] check_block_validity+0xd8/0xff
May 14 22:49:18 hermes kernel: [<ffffffff802febe1>] ext4_get_blocks_wrap+0xb5/0x24f
May 14 22:49:18 hermes kernel: [<ffffffff802fee2d>] ext4_da_get_block_prep+0x58/0x181
May 14 22:49:18 hermes kernel: [<ffffffff802b0187>] __block_prepare_write+0x150/0x2c4
May 14 22:49:18 hermes kernel: [<ffffffff802fedd5>] ext4_da_get_block_prep+0x0/0x181
May 14 22:49:18 hermes kernel: [<ffffffff802701e4>] add_to_page_cache_locked+0x73/0x95
May 14 22:49:18 hermes kernel: [<ffffffff802b0447>] block_write_begin+0x7a/0xc7
May 14 22:49:18 hermes kernel: [<ffffffff802fce53>] ext4_da_write_begin+0x166/0x1ed
May 14 22:49:18 hermes kernel: [<ffffffff802fedd5>] ext4_da_get_block_prep+0x0/0x181
May 14 22:49:18 hermes kernel: [<ffffffff80270b94>] generic_file_buffered_write+0x12c/0x2e8
May 14 22:49:18 hermes kernel: [<ffffffff80271145>] __generic_file_aio_write_nolock+0x25f/0x293
May 14 22:49:18 hermes kernel: [<ffffffff80271928>] generic_file_aio_write+0x64/0xc4
May 14 22:49:18 hermes kernel: [<ffffffff802f8d33>] ext4_file_write+0x93/0x118
May 14 22:49:18 hermes kernel: [<ffffffff80292f9c>] do_sync_write+0xce/0x113
May 14 22:49:18 hermes kernel: [<ffffffff802bc5a0>] vfs_test_lock+0x2e/0x32
May 14 22:49:18 hermes kernel: [<ffffffff8025cbe0>] autoremove_wake_function+0x0/0x2e
May 14 22:49:18 hermes kernel: [<ffffffff802bd116>] fcntl_setlk+0x24b/0x25c
May 14 22:49:18 hermes kernel: [<ffffffff8029373f>] vfs_write+0xa5/0xde
May 14 22:49:18 hermes kernel: [<ffffffff802937cf>] sys_pwrite64+0x57/0x77
May 14 22:49:18 hermes kernel: [<ffffffff8029d8c9>] sys_fcntl+0x312/0x31e
May 14 22:49:18 hermes kernel: [<ffffffff8022501b>] system_call_fastpath+0x16/0x1b
May 14 22:49:18 hermes kernel: ---[ end trace df4d7250236f7465 ]---
May 14 22:49:18 hermes kernel: EXT4-fs error (device dm-0) in ext4_da_write_begin: IO failure

Any clues there? I don't think I'll be able to run this during the day
if it's going to trigger and remount the fs read-only as easily as
this.

Cheers,
Kevin.

2009-05-14 14:07:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Thu, May 14, 2009 at 10:55:06PM +0930, Kevin Shanahan wrote:
> On Thu, May 14, 2009 at 08:37:00PM +0930, Kevin Shanahan wrote:
> > Sure - now running with 2.6.29.3 + your patch.
> >
> > patching file fs/ext4/inode.c
> > Hunk #1 succeeded at 1040 with fuzz 1 (offset -80 lines).
> > Hunk #2 succeeded at 1113 (offset -81 lines).
> > Hunk #3 succeeded at 1184 (offset -93 lines).
> >
> > I'll report any hits for "check_block_validity" in syslog.
>
> That didn't take long:
>
> May 14 22:49:17 hermes kernel: EXT4-fs error (device dm-0): check_block_validity: inode #759 logical block 1741329 mapped to 529 (size 1)
> May 14 22:49:17 hermes kernel: Aborting journal on device dm-0:8.
> May 14 22:49:17 hermes kernel: ext4_da_writepages: jbd2_start: 293 pages, ino 759; err -30
> May 14 22:49:17 hermes kernel: Pid: 374, comm: pdflush Not tainted 2.6.29.3 #1
> May 14 22:49:17 hermes kernel: Call Trace:

The reason why I put in the ext4_error() was because I figured it was
better to stop the filesystem from stomping on the inode table (since
that way would lie data loss). It would also freeze the filesystem so
we could see what was happening.

Could you run debugfs on the filesystem, and then send me the results
of these commands:

debugfs: stat <759>
debugfs: ncheck 759

Also, could you try run e2fsck on the filesystem and send me the
output? The ext4_error() should have stopped the kernel from doing
any damage (like stomping on block 529, which was part of the inode
table or block group descriptors). If send me the dumpe2fs output
about group 0, we'll know for sure (i.e.):

Group 0: (Blocks 0-32767) [ITABLE_ZEROED]
Checksum 0x3e19, unused inodes 8169
Primary superblock at 0, Group descriptors at 1-5
Reserved GDT blocks at 6-32
Block bitmap at 33 (+33), Inode bitmap at 49 (+49)
Inode table at 65-576 (+65)

(Your numbers will be different; this was from a 80GB hard drive).

> Any clues there? I don't think I'll be able to run this during the day
> if it's going to trigger and remount the fs read-only as easily as
> this.

Yes, this is a *huge*. Finding this may have been just what we needed
to determine what has been going on. Thank you!!

What I would suggest doing is to mount the filesystem with the mount
option -o nodelalloc. This would be useful for two fronts: (1) it
would determine whether or not the problem shows up if we suppress
delayed allocation, and (2) if it works, hopefully it will prevent you
from further filesystem corruption incidents. (At least as a
workaround; turning off delayed allocation will result in a
performance decrease, but better that then data loss, I always say!)

The fact that this triggered so quickly is actually a bit scary, so
hopefully we'll be able to track down exactly what happened fairly
quickly from here, and then fix the problem for real.

Thanks,

- Ted

2009-05-14 14:30:18

by Kevin Shanahan

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Thu, May 14, 2009 at 10:07:32AM -0400, Theodore Tso wrote:
> On Thu, May 14, 2009 at 10:55:06PM +0930, Kevin Shanahan wrote:
> > On Thu, May 14, 2009 at 08:37:00PM +0930, Kevin Shanahan wrote:
> > > Sure - now running with 2.6.29.3 + your patch.
> > >
> > > patching file fs/ext4/inode.c
> > > Hunk #1 succeeded at 1040 with fuzz 1 (offset -80 lines).
> > > Hunk #2 succeeded at 1113 (offset -81 lines).
> > > Hunk #3 succeeded at 1184 (offset -93 lines).
> > >
> > > I'll report any hits for "check_block_validity" in syslog.
> >
> > That didn't take long:
> >
> > May 14 22:49:17 hermes kernel: EXT4-fs error (device dm-0): check_block_validity: inode #759 logical block 1741329 mapped to 529 (size 1)
> > May 14 22:49:17 hermes kernel: Aborting journal on device dm-0:8.
> > May 14 22:49:17 hermes kernel: ext4_da_writepages: jbd2_start: 293 pages, ino 759; err -30
> > May 14 22:49:17 hermes kernel: Pid: 374, comm: pdflush Not tainted 2.6.29.3 #1
> > May 14 22:49:17 hermes kernel: Call Trace:
>
> The reason why I put in the ext4_error() was because I figured it was
> better to stop the filesystem from stomping on the inode table (since
> that way would lie data loss). It would also freeze the filesystem so
> we could see what was happening.

Sure, makes sense. I guess I have gotten complacent as I have been
getting away with it for several months now. :)

> Could you run debugfs on the filesystem, and then send me the results
> of these commands:
>
> debugfs: stat <759>

hermes:~# debugfs /dev/dm-0
debugfs 1.41.3 (12-Oct-2008)
debugfs: stat <759>

Inode: 759 Type: regular Mode: 0660 Flags: 0x80000
Generation: 3979120103 Version: 0x00000000:00000001
User: 0 Group: 10140 Size: 14615630848
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 28546168
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009
atime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009
mtime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009
crtime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009
Size of extra inode fields: 28
Extended attributes stored in inode body:
= "01 00 00 00 01 00 06 00 02 00 07 00 12 27 00 00 04 00 05 00 08 00 05 00 57 27 00 00 08 00 07 00 98 27 00 00 10 00 06 00 20 00 00 00 " (44)
DOSATTRIB = "0x20" (4)
BLOCKS:
(IND):1081467, (0-24575):999424-1023999, (24576-26623):1028096-1030143, (26624-43007):1032192-1048575, (43008-45055):1087488-1089535, (45056-51199):1550336-1556479, (51200-55295):1560576-1564671, (55296-63487):1540096-1548287, (63488-65535):2985984-2988031, (65536-92159):2996224-3022847, (92160-100351):3024896-3033087, (100352-112639):3035136-3047423, (112640-155647):3049472-3092479, (155648-157695):3100672-3102719, (157696-165887):3092480-3100671, (165888-206847):3104768-3145727, (206848-208895):3186688-3188735, (208896-681983):3196928-3670015, (681984-1042431):3735552-4095999, (1042432-1048575):4155392-4161535, (1048576-1050623):4182016-4184063, (1050624-1052671):4186112-4188159, (1052672-1056767):4177920-4182015, (1056768-1060863):4190208-4194303, (1060864-1124351):4655104-4718591, (112
4352-1126399):5390336-5392383, (1126400-1142783):5423104-5439487, (1142784-1155071):5478400-5490687, (1155072-1159167):5472256-5476351, (1159168-1433599):5492736-5767167, (1433600-1437695):6057984-6062079, (1437696-1634303):6094848-6291455, (1634304-1636351):6432768-6434815, (1636352-1640447):6483968-6488063, (1640448-1646591):6498304-6504447, (1646592-1662975):6510592-6526975, (1662976-1718271):6533120-6588415, (1718272-1943551):6590464-6815743, (1943552-2203647):6914048-7174143, (2203648-2205695):7182336-7184383, (2205696-2207743):7192576-7194623, (2207744-2211839):7188480-7192575, (2211840-2355199):7196672-7340031, (2355200-2357247):7380992-7383039, (2357248-2359295):7387136-7389183, (2359296-2463743):7444480-7548927, (2463744-2531327):7567360-7634943, (2531328-2746367):7649280-786431
9, (2746368-2811903):7897088-7962623, (2811904-2813951):7979008-7981055, (2813952-2869247):7995392-8050687, (2869248-2879487):8052736-8062975, (2879488-2881535):8081408-8083455, (2881536-2897919):8065024-8081407, (2897920-2906111):8085504-8093695, (2906112-2908159):8110080-8112127, (2908160-2910207):8114176-8116223, (2910208-2920447):8120320-8130559, (2920448-3026943):8136704-8243199, (3026944-3028991):8249344-8251391, (3028992-3031039):8253440-8255487, (3031040-3035135):8245248-8249343, (3035136-3037183):8292352-8294399, (3037184-3041279):8298496-8302591, (3041280-3043327):8398848-8400895, (3043328-3045375):8402944-8404991, (3045376-3049471):8407040-8411135, (3049472-3055615):8417280-8423423, (3055616-3094527):8501248-8540159, (3094528-3096575):8710144-8712191, (3096576-3110911):8726528
-8740863, (3110912-3112959):8779776-8781823, (3112960-3115007):8806400-8808447, (3115008-3119103):8790016-8794111, (3119104-3162111):8818688-8861695, (3162112-3164159):8863744-8865791, (3164160-3166207):8923136-8925183, (3166208-3235839):11603968-11673599, (3235840-3276799):12998656-13039615, (3276800-3278847):13070336-13072383, (3278848-3282943):13058048-13062143, (3282944-3287039):13066240-13070335, (3287040-3303423):13041664-13058047, (3303424-3307519):16058368-16062463, (3307520-3311615):16084992-16089087, (3311616-3334143):16119808-16142335, (3334144-3336191):16150528-16152575, (3336192-3340287):16146432-16150527, (3340288-3342335):16154624-16156671, (3342336-3348479):16160768-16166911, (3348480-3350527):16199680-16201727, (3350528-3352575):16203776-16205823, (3352576-3354623):16226
304-16228351, (3354624-3356671):16359424-16361471, (3356672-3360767):16369664-16373759, (3360768-3362815):16377856-16379903, (3362816-3364863):16388096-16

> debugfs: ncheck 759

debugfs: ncheck 759
Inode Pathname
759 /local/dumps/exchange/exchange-2000-UCWB-KVM-18.bkfB-KVM-18.bkf

By the way, that filename on disk is just
"exchange-2000-UCWB-KVM-18.bkf". Not sure if the way it appears to
repeat itself at the end there is indicative of a problem.

> Also, could you try run e2fsck on the filesystem and send me the
> output? The ext4_error() should have stopped the kernel from doing
> any damage (like stomping on block 529, which was part of the inode
> table or block group descriptors). If send me the dumpe2fs output
> about group 0, we'll know for sure (i.e.):
>
> Group 0: (Blocks 0-32767) [ITABLE_ZEROED]
> Checksum 0x3e19, unused inodes 8169
> Primary superblock at 0, Group descriptors at 1-5
> Reserved GDT blocks at 6-32
> Block bitmap at 33 (+33), Inode bitmap at 49 (+49)
> Inode table at 65-576 (+65)
>
> (Your numbers will be different; this was from a 80GB hard drive).

hermes:~# e2fsck -pfv /dev/dm-0
/dev/dm-0: recovering journal

331810 inodes used (0.49%)
16080 non-contiguous inodes (4.8%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 331411/329
46723278 blocks used (17.41%)
0 bad blocks
3 large files

305051 regular files
26686 directories
0 character device files
0 block device files
0 fifos
0 links
64 symbolic links (60 fast symbolic links)
0 sockets
--------
331801 files

hermes:~# dumpe2fs /dev/dm-0
...
Group 0: (Blocks 0-32767) [ITABLE_ZEROED]
Checksum 0x0eb9, unused inodes 0
Primary superblock at 0, Group descriptors at 1-64
Reserved GDT blocks at 65-256
Block bitmap at 257 (+257), Inode bitmap at 273 (+273)
Inode table at 289-800 (+289)
3443 free blocks, 0 free inodes, 571 directories
Free blocks: 9067, 9108-9111, 9116-9119, 9276-9279, 9308-9311, 9687, 9720-9727, 9736-9743, 9848-9879, 9888-9895, 9936-9951, 10032-10047, 10072-10079, 10200-10207, 10352-10367, 10584-10591, 10619-10623, 10661-10671, 10740-10751, 11476-11487, 11512-11519, 11637-11647, 11669-11679, 11711, 11772-11775, 11890-11903, 11963-11967, 12187-12191, 12213-12223, 12249-12255, 12272-12280, 12282-12287, 12322-12351, 12381-12383, 12405-12415, 12437-12447, 12478-12479, 12542-12543, 12664-12671, 12727-12735, 12766-12767, 12851-12863, 12890-12895, 12927, 12970, 12980-12991, 13047-13055, 13246-13247, 13290-13295, 13312-13317, 13406-13407, 13433, 13439, 13456-13463, 13490-13503, 13537-13567, 13658-13695, 13770-13823, 13850-13855, 13886-13887, 13908-13919, 13951, 13971-13983, 14010-14015, 14056-14079, 14204-14
207, 14249-14271, 14316-14350, 14358-14361, 14375, 14487-14495, 14526-14527, 14550-14584, 14605, 14619, 14630-14635, 14647-14655, 14660, 14674-14687, 14705-14719, 14744-14774, 14785-14815, 14844-14847, 14889-14926, 14933-14944, 14948, 14965-14975, 14992-15000, 15006-15007, 15035-15039, 15069-15087, 15097-15102, 15114-15127, 15141-15156, 15158-15164, 15176-15184, 15192-15193, 15213, 15223-15230, 15238, 15241-15256, 15264-15281, 15289-15297, 15305-15325, 15333-15341, 15350-15359, 15427-15455, 15482-15487, 15599-15615, 15705-15711, 15732-15743, 15798-15807, 15860-15871, 15923-15935, 15956-15967, 15989-15999, 16040-16063, 16090-16095, 16123-16127, 16171-16191, 16219-16223, 16246-16255, 16309-16319, 16343-16351, 16379-16383, 16565-16575, 16624-16639, 16688-16703, 16721-16735, 16763-16767, 168
37-16863, 16881-16895, 17036-17055, 17075-17087, 17147-17151, 17220-17247, 17267-17279, 17307-17311, 17339-17343, 17365-17375, 17403-17407, 17453-17471, 17531-17535, 17554-17567, 17599, 17647-17663, 17718-17727, 17756-17759, 17812-17823, 17844-17855, 17904-17919, 18103-18111, 18131-18143, 18162-18175, 18277-18303, 18331-18335, 18363-18367, 18393-18399, 18427-18431, 18598-18623, 18676-18687, 18706-18719, 18738-18751, 18805-18815, 18844-18847, 18876-18879, 18908-18911, 18933-18943, 19135, 19198-19199, 19249-19263, 19286-19295, 19323-19327, 19416-19423, 19455, 19527-19551, 19582-19583, 19704-19711, 19756-19775, 19835-19839, 19887-19903, 19925-19935, 19959-19967, 20148-20159, 20206-20223, 20271-20287, 20336-20351, 20422-20447, 20476-20479, 20671, 20727-20735, 20758-20767, 20795-20799, 20818-
20831, 20849-20863, 20891-20895, 20918-20927, 20955-20959, 20980-20991, 21133-21151, 21172-21183, 21221-21247, 21265-21279, 21307-21311, 21345-21375, 21425-21439, 21476-21503, 21531-21553, 21560-21596, 21604, 21611, 21630-21631, 21662-21666, 21672-21673, 21683, 21695-21725, 21732-21755, 21765-21782, 21795-21842, 21848-21875, 21890-21919, 21947-21951, 22005-22030, 22038-22052, 22062-22076, 22091-22092, 22107-22118, 22131-22140, 22154-22183, 22196-22198, 22569-22591, 22619-22623, 22651-22655, 22760-22783, 22803-22815, 22847, 22888-22911, 23002-23007, 23035-23039, 23153-23167, 23189-23199, 23260-23263, 23284-23295, 23336-23359, 23402-23423, 23462-23487, 23512-23519, 23544-23551, 23572-23583, 23644-23647, 23673-23679, 23768-23804, 23806-23807, 23841-23871, 23930-23935, 24011-24031, 24053-240
76, 24103-24127, 24186-24191, 24287, 24347-24351, 24371-24383, 24434-24447, 24475-24479, 24501-24511, 24566-24575, 26431, 26472-26495, 26562-26591, 26619-26623, 26903-26905, 26909-26911, 26932-26943, 26984-27007, 27134-27135, 27182-27199, 27238-27263, 27291-27295, 27323-27327, 27376-27391, 27507-27519, 27641-27647, 27979-27999, 28024-28031, 28103-28127, 28255, 28323-28351, 28373-28383, 28414-28415, 28572-28575, 28598-28607, 28668-28671
Free inodes:

> > Any clues there? I don't think I'll be able to run this during the day
> > if it's going to trigger and remount the fs read-only as easily as
> > this.
>
> Yes, this is a *huge*. Finding this may have been just what we needed
> to determine what has been going on. Thank you!!
>
> What I would suggest doing is to mount the filesystem with the mount
> option -o nodelalloc. This would be useful for two fronts: (1) it
> would determine whether or not the problem shows up if we suppress
> delayed allocation, and (2) if it works, hopefully it will prevent you
> from further filesystem corruption incidents. (At least as a
> workaround; turning off delayed allocation will result in a
> performance decrease, but better that then data loss, I always say!)

Ok, thanks - I've added the mount option. I'll let you know if the
error still triggers.

> The fact that this triggered so quickly is actually a bit scary, so
> hopefully we'll be able to track down exactly what happened fairly
> quickly from here, and then fix the problem for real.

Yeah, that would be great.

Cheers,
Kevin.

2009-05-14 15:44:13

by Eric Sandeen

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

Kevin Shanahan wrote:

> hermes:~# debugfs /dev/dm-0
> debugfs 1.41.3 (12-Oct-2008)
> debugfs: stat <759>
>
> Inode: 759 Type: regular Mode: 0660 Flags: 0x80000
> Generation: 3979120103 Version: 0x00000000:00000001
> User: 0 Group: 10140 Size: 14615630848
> File ACL: 0 Directory ACL: 0
> Links: 1 Blockcount: 28546168
> Fragment: Address: 0 Number: 0 Size: 0
> ctime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009
> atime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009
> mtime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009
> crtime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009
> Size of extra inode fields: 28
> Extended attributes stored in inode body:
> = "01 00 00 00 01 00 06 00 02 00 07 00 12 27 00 00 04 00 05 00 08 00 05 00 57 27 00 00 08 00 07 00 98 27 00 00 10 00 06 00 20 00 00 00 " (44)
> DOSATTRIB = "0x20" (4)

Apologies if I've asked before: DOSATTRIB is a samba xattr; how does
samba come into play in your setup, again?

I'm wondering what that other odd xattr value is too... (the one that
didn't even print a name)

-Eric

2009-05-14 16:13:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Fri, May 15, 2009 at 12:00:15AM +0930, Kevin Shanahan wrote:
> > debugfs: stat <759>
>
> hermes:~# debugfs /dev/dm-0
> debugfs 1.41.3 (12-Oct-2008)
> debugfs: stat <759>
>
> Inode: 759 Type: regular Mode: 0660 Flags: 0x80000
> Generation: 3979120103 Version: 0x00000000:00000001
> User: 0 Group: 10140 Size: 14615630848
> File ACL: 0 Directory ACL: 0
> Links: 1 Blockcount: 28546168
> Fragment: Address: 0 Number: 0 Size: 0
> ctime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009
> atime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009
> mtime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009
> crtime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009

> Inode Pathname
> 759 /local/dumps/exchange/exchange-2000-UCWB-KVM-18.bkfB-KVM-18.bkf

Do you know how the system was likely writing into
/local/dumps/exhcnag/eexchange-2000-UCWB-KVM-18.bkf? What this a
backup via rsync or tar? Was this some application writing into a
pre-existing file via NFS, or via local disk access?

Given the ctime/atime fields, I'm inclined to guess the latter, but it
would be good to know.

The stat dump for the inode 759 does *not* show logical block 1741329
getting mapped to physical block 529. So the question is how did that
happen?

I've started looking, and one thing popped up at me. I need to check
in with the Lustre folks who originally donated the code, but I don't
see any spinlock or mutexes protecting the inode's extent cache. So
if you are on an SMP machine, this could potentially have caused the
problem. How many CPU's or cores do you have? What does
/proc/cpuinfo report? Also, would it be correct to assume this file
is getting served up via Samba. My theory is that we might be running
into problems when two threads are simultaneously trying read and
write to a single file at the same time.

Hmm, what is accessing your files on this system? Are you just doing
backups? Is it just a backup server? Or are you serving up files
using Samba and there are clients which are accessing those files?

So if this the problem the following experiment should be able to
confirm whether it's the problem, by seeing if the problem goes away
if we short-circuit the inode's extent cache. In fs/ext4/extents.c,
try inserting a "return" statement to in ext4_ext_put_in_cache():

static void
ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
__u32 len, ext4_fsblk_t start, int type)
{
struct ext4_ext_cache *cex;

return; <---- insert this line
BUG_ON(len == 0);
cex = &EXT4_I(inode)->i_cached_extent;
cex->ec_type = type;
cex->ec_block = block;
cex->ec_len = len;
cex->ec_start = start;
}

This should short circuit the i_cached_extent cache, and this may be
enough to make your problem go away. (If this theory is correct,
using mount -o nodelalloc probably won't make a difference, although
it might change the timing enough to make the bug harder to see.)

If that solves the problem, the right long-term fix will be to drop bin
a spinlock to protect i_cached_extent.

- Ted

2009-05-14 21:02:49

by Kevin Shanahan

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Thu, May 14, 2009 at 12:12:54PM -0400, Theodore Tso wrote:
> On Fri, May 15, 2009 at 12:00:15AM +0930, Kevin Shanahan wrote:
> > > debugfs: stat <759>
> >
> > hermes:~# debugfs /dev/dm-0
> > debugfs 1.41.3 (12-Oct-2008)
> > debugfs: stat <759>
> >
> > Inode: 759 Type: regular Mode: 0660 Flags: 0x80000
> > Generation: 3979120103 Version: 0x00000000:00000001
> > User: 0 Group: 10140 Size: 14615630848
> > File ACL: 0 Directory ACL: 0
> > Links: 1 Blockcount: 28546168
> > Fragment: Address: 0 Number: 0 Size: 0
> > ctime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009
> > atime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009
> > mtime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009
> > crtime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009
>
> > Inode Pathname
> > 759 /local/dumps/exchange/exchange-2000-UCWB-KVM-18.bkfB-KVM-18.bkf
>
> Do you know how the system was likely writing into
> /local/dumps/exhcnag/eexchange-2000-UCWB-KVM-18.bkf? What this a
> backup via rsync or tar? Was this some application writing into a
> pre-existing file via NFS, or via local disk access?

This file was getting written by a windows server via samba. The
backup script on the windows side deletes the file before writing a
new copy each night. It's a fairly large file, about 14GB.

> Given the ctime/atime fields, I'm inclined to guess the latter, but it
> would be good to know.
>
> The stat dump for the inode 759 does *not* show logical block 1741329
> getting mapped to physical block 529. So the question is how did that
> happen?
>
> I've started looking, and one thing popped up at me. I need to check
> in with the Lustre folks who originally donated the code, but I don't
> see any spinlock or mutexes protecting the inode's extent cache. So
> if you are on an SMP machine, this could potentially have caused the
> problem. How many CPU's or cores do you have? What does
> /proc/cpuinfo report? Also, would it be correct to assume this file
> is getting served up via Samba. My theory is that we might be running
> into problems when two threads are simultaneously trying read and
> write to a single file at the same time.

Yes, it's SMP (4 cores):

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU X3350 @ 2.66GHz
stepping : 7
cpu MHz : 2666.000
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5322.33
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

> Hmm, what is accessing your files on this system? Are you just doing
> backups? Is it just a backup server? Or are you serving up files
> using Samba and there are clients which are accessing those files?

This is a Samba server serving up files to clients and also a backup
server. That particular file is unlikely to have seen concurrent
access from multiple clients - the folder is only accessible to that
windows server to dump its exchange store overnight.

> So if this the problem the following experiment should be able to
> confirm whether it's the problem, by seeing if the problem goes away
> if we short-circuit the inode's extent cache. In fs/ext4/extents.c,
> try inserting a "return" statement to in ext4_ext_put_in_cache():
>
> static void
> ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
> __u32 len, ext4_fsblk_t start, int type)
> {
> struct ext4_ext_cache *cex;
>
> return; <---- insert this line
> BUG_ON(len == 0);
> cex = &EXT4_I(inode)->i_cached_extent;
> cex->ec_type = type;
> cex->ec_block = block;
> cex->ec_len = len;
> cex->ec_start = start;
> }
>
> This should short circuit the i_cached_extent cache, and this may be
> enough to make your problem go away. (If this theory is correct,
> using mount -o nodelalloc probably won't make a difference, although
> it might change the timing enough to make the bug harder to see.)
>
> If that solves the problem, the right long-term fix will be to drop bin
> a spinlock to protect i_cached_extent.

Okay, nodelalloc was enough get through the backup process again at
least once (wrote the whole exchange-2000-UCWB-KVM-18.bkf file out in
full) without triggering the check_block_validity error.

To make sure I had something reproducable on the current kernel
(2.6.29.3 + check_block_validity patch), I re-mounted without
nodelalloc again and tried running just that exchange backup - the bug
did trigger again.

May 15 05:30:55 hermes kernel: EXT4-fs error (device dm-0): check_block_validity: inode #759 logical block 2509617 mapped to 0 (size 1)
May 15 05:30:55 hermes kernel: Aborting journal on device dm-0:8.
May 15 05:30:55 hermes kernel: ext4_da_writepages: jbd2_start: 775 pages, ino 759; err -30
May 15 05:30:55 hermes kernel: Pid: 5026, comm: pdflush Tainted: G W 2.6.29.3 #1
May 15 05:30:55 hermes kernel: Call Trace:
May 15 05:30:55 hermes kernel: [<ffffffff802fdce3>] ext4_da_writepages+0x211/0x427
May 15 05:30:55 hermes kernel: [<ffffffff8055b44f>] thread_return+0x3d/0xc3
May 15 05:30:55 hermes kernel: [<ffffffff802fef56>] ext4_da_get_block_write+0x0/0x12d
May 15 05:30:55 hermes kernel: [<ffffffff802757df>] do_writepages+0x20/0x2d
May 15 05:30:55 hermes kernel: [<ffffffff802aa359>] __writeback_single_inode+0x196/0x3a4
May 15 05:30:55 hermes kernel: [<ffffffff802542aa>] del_timer_sync+0xc/0x16
May 15 05:30:55 hermes kernel: [<ffffffff802aa9b9>] generic_sync_sb_inodes+0x2a7/0x439
May 15 05:30:55 hermes kernel: [<ffffffff802aacfa>] writeback_inodes+0x7d/0xcc
May 15 05:30:55 hermes kernel: [<ffffffff8027591e>] wb_kupdate+0x9f/0x116
May 15 05:30:55 hermes kernel: [<ffffffff802764bf>] pdflush+0x125/0x1d2
May 15 05:30:55 hermes kernel: [<ffffffff8027587f>] wb_kupdate+0x0/0x116
May 15 05:30:55 hermes kernel: [<ffffffff8027639a>] pdflush+0x0/0x1d2
May 15 05:30:55 hermes kernel: [<ffffffff8025c8cc>] kthread+0x47/0x73
May 15 05:30:55 hermes kernel: [<ffffffff80225f7a>] child_rip+0xa/0x20
May 15 05:30:55 hermes kernel: [<ffffffff8025c885>] kthread+0x0/0x73
May 15 05:30:55 hermes kernel: [<ffffffff80225f70>] child_rip+0x0/0x20
May 15 05:30:55 hermes kernel: Remounting filesystem read-only
May 15 05:30:56 hermes kernel: ------------[ cut here ]------------
May 15 05:30:56 hermes kernel: WARNING: at fs/ext4/inode.c:1066 check_block_validity+0xd8/0xff()
May 15 05:30:56 hermes kernel: Hardware name: S3210SH
May 15 05:30:56 hermes kernel: Pid: 6196, comm: smbd Tainted: G W 2.6.29.3 #1
May 15 05:30:56 hermes kernel: Call Trace:
May 15 05:30:56 hermes kernel: [<ffffffff8024c0b0>] warn_slowpath+0xd3/0x10d
May 15 05:30:56 hermes kernel: [<ffffffff8025cc0e>] wake_bit_function+0x0/0x23
May 15 05:30:56 hermes kernel: [<ffffffff802afbec>] sync_dirty_buffer+0x5f/0x97
May 15 05:30:56 hermes kernel: [<ffffffff8030589e>] ext4_commit_super+0xbc/0x103
May 15 05:30:56 hermes kernel: [<ffffffff80307f92>] ext4_handle_error+0x83/0xa9
May 15 05:30:56 hermes kernel: [<ffffffff803081cb>] ext4_error+0x83/0x90
May 15 05:30:56 hermes kernel: [<ffffffff8040c772>] scsi_done+0x0/0xc
May 15 05:30:56 hermes kernel: [<ffffffff80245f5c>] dequeue_task_fair+0x45/0x167
May 15 05:30:56 hermes kernel: [<ffffffff802fd222>] check_block_validity+0xd8/0xff
May 15 05:30:56 hermes kernel: [<ffffffff802febe1>] ext4_get_blocks_wrap+0xb5/0x24f
May 15 05:30:56 hermes kernel: [<ffffffff802fee2d>] ext4_da_get_block_prep+0x58/0x181
May 15 05:30:56 hermes kernel: [<ffffffff802b0187>] __block_prepare_write+0x150/0x2c4
May 15 05:30:56 hermes kernel: [<ffffffff802fedd5>] ext4_da_get_block_prep+0x0/0x181
May 15 05:30:56 hermes kernel: [<ffffffff802701e4>] add_to_page_cache_locked+0x73/0x95
May 15 05:30:56 hermes kernel: [<ffffffff802b0447>] block_write_begin+0x7a/0xc7
May 15 05:30:56 hermes kernel: [<ffffffff802fce53>] ext4_da_write_begin+0x166/0x1ed
May 15 05:30:56 hermes kernel: [<ffffffff802fedd5>] ext4_da_get_block_prep+0x0/0x181
May 15 05:30:56 hermes kernel: [<ffffffff80270b94>] generic_file_buffered_write+0x12c/0x2e8
May 15 05:30:56 hermes kernel: [<ffffffff80271145>] __generic_file_aio_write_nolock+0x25f/0x293
May 15 05:30:56 hermes kernel: [<ffffffff80271928>] generic_file_aio_write+0x64/0xc4
May 15 05:30:56 hermes kernel: [<ffffffff802f8d33>] ext4_file_write+0x93/0x118
May 15 05:30:56 hermes kernel: [<ffffffff80292f9c>] do_sync_write+0xce/0x113
May 15 05:30:56 hermes kernel: [<ffffffff802bc5a0>] vfs_test_lock+0x2e/0x32
May 15 05:30:56 hermes kernel: [<ffffffff8025cbe0>] autoremove_wake_function+0x0/0x2e
May 15 05:30:56 hermes kernel: [<ffffffff802bd116>] fcntl_setlk+0x24b/0x25c
May 15 05:30:56 hermes kernel: [<ffffffff8029373f>] vfs_write+0xa5/0xde
May 15 05:30:56 hermes kernel: [<ffffffff802937cf>] sys_pwrite64+0x57/0x77
May 15 05:30:56 hermes kernel: [<ffffffff8029d8c9>] sys_fcntl+0x312/0x31e
May 15 05:30:56 hermes kernel: [<ffffffff8022501b>] system_call_fastpath+0x16/0x1b
May 15 05:30:56 hermes kernel: ---[ end trace df4d7250236f7466 ]---
May 15 05:30:56 hermes kernel: EXT4-fs error (device dm-0) in ext4_da_write_begin: IO failure
May 15 05:30:56 hermes kernel: __journal_remove_journal_head: freeing b_frozen_data

debugfs: stat <759>

Inode: 759 Type: regular Mode: 0660 Flags: 0x80000
Generation: 2963762165 Version: 0x00000000:00000001
User: 0 Group: 10140 Size: 14814860288
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 28935288
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4a0c363e:77c84a60 -- Fri May 15 00:48:22 2009
atime: 0x4a0c2ce8:55358578 -- Fri May 15 00:08:32 2009
mtime: 0x4a0c363e:77c84a60 -- Fri May 15 00:48:22 2009
crtime: 0x4a0c2ce8:55358578 -- Fri May 15 00:08:32 2009
Size of extra inode fields: 28
Extended attributes stored in inode body:
= "01 00 00 00 01 00 06 00 02 00 07 00 12 27 00 00 04 00 05 00 08 00 05 00 57 27 00 00 08 00 07 00 98 27 00 00 10 00 06 00 20 00 00 00 " (44)
DOSATTRIB = "0x20" (4)
BLOCKS:
(IND):6603263, (0):6604364, (1-511):6602752-6603262, (512-1023):6586368-6586879, (1024-2047):6587392-6588415, (2048-14335):6590464-6602751, (14336-188415):6641664-6815743, (188416-417791):6914048-7143423, (417792-419839):7172096-7174143, (419840-423935):7168000-7172095, (423936-448511):7143424-7167999, (448512-450559):7182336-7184383, (450560-452607):7192576-7194623, (452608-456703):7188480-7192575, (456704-600063):7196672-7340031, (600064-602111):7380992-7383039, (602112-604159):7387136-7389183, (604160-696319):7444480-7536639, (696320-763903):7567360-7634943, (763904-978943):7649280-7864319, (978944-1044479):7897088-7962623, (1044480-1046527):7979008-7981055, (1046528-1079295):7995392-8028159, (1079296-1081343):8048640-8050687, (1081344-1085439):8044544-8048639, (1085440-1101823):8028160
-8044543, (1101824-1112063):8052736-8062975, (1112064-1114111):8081408-8083455, (1114112-1118207):8065024-8069119, (1118208-1122303):8077312-8081407, (1122304-1130495):8069120-8077311, (1130496-1138687):8085504-8093695, (1138688-1140735):8110080-8112127, (1140736-1142783):8114176-8116223, (1142784-1148927):8120320-8126463, (1148928-1150975):8136704-8138751, (1150976-1155071):8126464-8130559, (1155072-1241087):8138752-8224767, (1241088-1243135):8241152-8243199, (1243136-1245183):8249344-8251391, (1245184-1247231):8253440-8255487, (1247232-1251327):8245248-8249343, (1251328-1267711):8224768-8241151, (1267712-1269759):8292352-8294399, (1269760-1273855):8298496-8302591, (1273856-1275903):8398848-8400895, (1275904-1277951):8402944-8404991, (1277952-1282047):8407040-8411135, (1282048-1288191):
8417280-8423423, (1288192-1306623):8501248-8519679, (1306624-1310719):8536064-8540159, (1310720-1327103):8519680-8536063, (1327104-1329151):8710144-8712191, (1329152-1331199):8718336-8720383, (1331200-1345535):8726528-8740863, (1345536-1347583):8749056-8751103, (1347584-1349631):8761344-8763391, (1349632-1351679):8779776-8781823, (1351680-1353727):8806400-8808447, (1353728-1357823):8790016-8794111, (1357824-1386495):8818688-8847359, (1386496-1388543):8859648-8861695, (1388544-1390591):8863744-8865791, (1390592-1394687):8855552-8859647, (1394688-1402879):8847360-8855551, (1402880-1404927):8923136-8925183, (1404928-1474559):11603968-11673599, (1474560-1484799):12998656-13008895, (1484800-1486847):13037568-13039615, (1486848-1490943):13033472-13037567, (1490944-1515519):13008896-13033471, (
1515520-1517567):13070336-13072383, (1517568-1521663):13058048-13062143, (1521664-1525759):13066240-13070335, (1525760-1542143):13041664-13058047, (1542144-1546239):16058368-16062463, (1546240-1550335):16084992-16089087, (1550336-1552383):16119808-16121855, (1552384-1554431):16150528-16152575, (1554432-1558527):16138240-16142335, (1558528-1562623):16146432-16150527, (1562624-1579007):16121856-16138239, (1579008-1581055):16154624-16156671, (1581056-1587199):16160768-16166911, (1587200-1589247):16199680-16201727, (1589248-1591295):16203776-16205823, (1591296-1593343):16226304-16228351, (1593344-1595391):16359424-16361471, (1595392-1599487):16369664-16373759, (1599488-1601535):16377856-16379903, (1601536-1603583):16388096-16390143, (1603584-1605631):16394240-16396287, (1605632-1607679):1640
0384-16402431, (1607680-1611775):16396288-16400383, (1611776-1615871):16455680-16459775, (1615872-1619967):16467968-16472063, (1619968-1624063):16461824-16465919, (1624064-1626111):16574464-16576511, (1626112-1628159):16578560-16580607, (1628160-1630207):16609280-16611327, (1630208-1638399):16601088-16609279, (1638400-1646591):16580608-16588799, (1646592-1650687):16652288-16656383, (1650688-1652735):16662528-16664575, (1652736-1656831):18202624-18206719, (1656832-1658879):18262016-18264063, (1658880-1660927):18276352-18278399, (1660928-1673215):18264064-18276351, (1673216-1677311):18296832-18300927, (1677312-1689599):18305024-18317311, (1689600-1693695):18706432-18710527, (1693696-1695743):18712576-18714623, (1695744-1699839):18728960-18733055, (1699840-1701887):18735104-18737151, (17018
88-1714175):18714624-18726911, (1714176-1716223):18753536-18755583, (1716224-1718271):18757632-18759679, (1718272-1720319):18767872-18769919, (1720320-1728511):18759680-18767871, (1728512-1789951):18780160-18841599, (1789952-1794047):18857984-18862079, (1794048-1810431):18841600-18857983, (1810432-1818623):18866176-18874367, (1818624-1820671):18884608-18886655, (1820672-2035711):18888704-19103743, (2035712-2037759):19132416-19134463, (2037760-2041855):19128320-19132415, (2041856-2066431):19103744-19128319, (2066432-2328575):19136512-19398655, (2328576-2330623):19408896-19410943, (2330624-2332671):19419136-19421183, (2332672-2340863):19410944-19419135, (2340864-2447359):19423232-19529727, (2447360-2478079):19531776-19562495, (2478080-2482175):19576832-19580927, (2482176-2486271):19593216-
19597311, (2486272-2488319):19601408-19603455, (2488320-2783231):19628032-19922943, (2783232-2838527):19933184-19988479, (2838528-2842623):19998720-20002815, (2842624-2846719):20006912-20011007, (2846720-2854911):19988480-19996671, (2854912-3248127):20054016-20447231, (3248128-3252223):20457472-20461567, (3252224-3254271):20471808-20473855, (3254272-3262463):20463616-20471807, (3262464-3264511):20484096-20486143, (3264512-3266559):20502528-20504575, (3266560-3268607):20523008-20525055, (3268608-3270655):20533248-20535295, (3270656-3616909):20578304-20924557
TOTAL: 3616911

debugfs: ncheck 759
Inode Pathname
759 /local/dumps/exchange/exchange-2000-UCWB-KVM-18.bkfB-KVM-18.bkf

hermes:~# e2fsck -pfv /dev/dm-0
/dev/dm-0: recovering journal
/dev/dm-0: Inode 7 has illegal block(s).

/dev/dm-0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)

hermes:~# e2fsck -fv /dev/dm-0
e2fsck 1.41.3 (12-Oct-2008)
Pass 1: Checking inodes, blocks, and sizes
Inode 7 has illegal block(s). Clear<y>? yes

Illegal block #181260 (1109068081) in inode 7. CLEARED.
Illegal block #181261 (2437571651) in inode 7. CLEARED.
Illegal block #181262 (892263544) in inode 7. CLEARED.
Illegal block #181263 (1023380360) in inode 7. CLEARED.
Illegal block #181264 (1152724232) in inode 7. CLEARED.
Illegal block #181265 (2602060959) in inode 7. CLEARED.
Illegal block #181266 (754287188) in inode 7. CLEARED.
Illegal block #181267 (2833437713) in inode 7. CLEARED.
Illegal block #181268 (1495914360) in inode 7. CLEARED.
Illegal block #181269 (1583358807) in inode 7. CLEARED.
Illegal block #181271 (1514245352) in inode 7. CLEARED.
Too many illegal blocks in inode 7.
Clear inode<y>? yes

Restarting e2fsck from the beginning...
Resize inode not valid. Recreate<y>? yes

Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #0 (4275, counted=4276).
Fix<y>? yes

Free blocks count wrong for group #142 (30919, counted=199).
Fix<y>? yes

Free blocks count wrong for group #143 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #159 (13307, counted=11259).
Fix<y>? yes

Free blocks count wrong for group #164 (7197, counted=5149).
Fix<y>? yes

Free blocks count wrong for group #165 (18691, counted=2307).
Fix<y>? yes

Free blocks count wrong for group #167 (31325, counted=2653).
Fix<y>? yes

Free blocks count wrong for group #168 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #169 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #170 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #171 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #172 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #173 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #174 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #175 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #186 (14647, counted=2359).
Fix<y>? yes

Free blocks count wrong for group #187 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #188 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #189 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #190 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #191 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #196 (3955, counted=1907).
Fix<y>? yes

Free blocks count wrong for group #197 (4901, counted=805).
Fix<y>? yes

Free blocks count wrong for group #198 (26087, counted=1511).
Fix<y>? yes

Free blocks count wrong for group #199 (27516, counted=892).
Fix<y>? yes

Free blocks count wrong for group #200 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #201 (18382, counted=6094).
Fix<y>? yes

Free blocks count wrong for group #202 (32104, counted=1384).
Fix<y>? yes

Free blocks count wrong for group #203 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #204 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #205 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #206 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #207 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #211 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #212 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #213 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #214 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #215 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #216 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #217 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #218 (31158, counted=438).
Fix<y>? yes

Free blocks count wrong for group #219 (23522, counted=3042).
Fix<y>? yes

Free blocks count wrong for group #220 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #221 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #222 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #223 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #225 (14501, counted=10405).
Fix<y>? yes

Free blocks count wrong for group #227 (27863, counted=1239).
Fix<y>? yes

Free blocks count wrong for group #228 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #229 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #230 (2780, counted=732).
Fix<y>? yes

Free blocks count wrong for group #231 (32768, counted=0).
Fix<y>? yes

Free blocks count wrong for group #232 (32768, counted=2481).
Fix<y>? yes

Free blocks count wrong (222364420, counted=220943542).
Fix<y>? yes


/dev/dm-0: ***** FILE SYSTEM WAS MODIFIED *****

331810 inodes used (0.49%)
16104 non-contiguous inodes (4.9%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 331389/351
47491914 blocks used (17.69%)
0 bad blocks
2 large files

305051 regular files
26686 directories
0 character device files
0 block device files
0 fifos
0 links
64 symbolic links (60 fast symbolic links)
0 sockets
--------
331801 files

Okay, so now I've booted into 2.6.29.3 + check_block_validity patch +
short circuit i_cached_extent patch, mounted the fs without
nodelalloc. I was able to run the full exchange backup without
triggering the check_block_validity error.

Regards,
Kevin.

2009-05-14 21:07:11

by Kevin Shanahan

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Thu, May 14, 2009 at 10:44:07AM -0500, Eric Sandeen wrote:
> Kevin Shanahan wrote:
>
> > hermes:~# debugfs /dev/dm-0
> > debugfs 1.41.3 (12-Oct-2008)
> > debugfs: stat <759>
> >
> > Inode: 759 Type: regular Mode: 0660 Flags: 0x80000
> > Generation: 3979120103 Version: 0x00000000:00000001
> > User: 0 Group: 10140 Size: 14615630848
> > File ACL: 0 Directory ACL: 0
> > Links: 1 Blockcount: 28546168
> > Fragment: Address: 0 Number: 0 Size: 0
> > ctime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009
> > atime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009
> > mtime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009
> > crtime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009
> > Size of extra inode fields: 28
> > Extended attributes stored in inode body:
> > = "01 00 00 00 01 00 06 00 02 00 07 00 12 27 00 00 04 00 05 00 08 00 05 00 57 27 00 00 08 00 07 00 98 27 00 00 10 00 06 00 20 00 00 00 " (44)
> > DOSATTRIB = "0x20" (4)
>
> Apologies if I've asked before: DOSATTRIB is a samba xattr; how does
> samba come into play in your setup, again?

This server is a Samba server. This file in particular is being
written to via Samba by a Windows 2000 server over the network.

> I'm wondering what that other odd xattr value is too... (the one that
> didn't even print a name)

I would assume that's the file ACL:

hermes:/srv/samba/local/dumps/exchange# getfacl --numeric exchange-2000-UCWB-KVM-18.bkf
# file: exchange-2000-UCWB-KVM-18.bkf
# owner: 0
# group: 10140
user::rw-
user:10002:rwx #effective:rw-
group::r-x #effective:r--
group:10071:r-x #effective:r--
group:10136:rwx #effective:rw-
mask::rw-
other::---

Cheers,
Kevin.

2009-05-14 21:08:32

by Eric Sandeen

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

Kevin Shanahan wrote:
> On Thu, May 14, 2009 at 10:44:07AM -0500, Eric Sandeen wrote:

...

>>> Extended attributes stored in inode body:
>>> = "01 00 00 00 01 00 06 00 02 00 07 00 12 27 00 00 04 00 05 00 08 00 05 00 57 27 00 00 08 00 07 00 98 27 00 00 10 00 06 00 20 00 00 00 " (44)
>>> DOSATTRIB = "0x20" (4)
>> Apologies if I've asked before: DOSATTRIB is a samba xattr; how does
>> samba come into play in your setup, again?
>
> This server is a Samba server. This file in particular is being
> written to via Samba by a Windows 2000 server over the network.
>
>> I'm wondering what that other odd xattr value is too... (the one that
>> didn't even print a name)
>
> I would assume that's the file ACL:

Ah, sure. Ok thanks.

-Eric

2009-05-14 21:23:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote:
> Okay, so now I've booted into 2.6.29.3 + check_block_validity patch +
> short circuit i_cached_extent patch, mounted the fs without
> nodelalloc. I was able to run the full exchange backup without
> triggering the check_block_validity error.

Great!

So here's the final fix (it replaces the short circuit i_cached_extent
patch) which I plan to push to Linus. It should be much less of a
performance hit than simply short-circuiting i_cached_extent...

Thanks so much for helping to find track this down!!! If ever someone
deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be
you....

- Ted

commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5
Author: Theodore Ts'o <[email protected]>
Date: Thu May 14 17:09:37 2009 -0400

ext4: Fix race in ext4_inode_info.i_cached_extent

If one CPU is reading from a file while another CPU is writing to the
same file different locations, there is nothing protecting the
i_cached_extent structure from being used and updated at the same
time. This could potentially cause the wrong location on disk to be
read or written to, including potentially causing the corruption of
the block group descriptors and/or inode table.

Many thanks to Ken Shannah for helping to track down this problem.

Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 172656c..e3a55eb 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1841,11 +1841,13 @@ ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
{
struct ext4_ext_cache *cex;
BUG_ON(len == 0);
+ spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
cex = &EXT4_I(inode)->i_cached_extent;
cex->ec_type = type;
cex->ec_block = block;
cex->ec_len = len;
cex->ec_start = start;
+ spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
}

/*
@@ -1902,12 +1904,17 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
struct ext4_extent *ex)
{
struct ext4_ext_cache *cex;
+ int ret = EXT4_EXT_CACHE_NO;

+ /*
+ * We borrow i_block_reservation_lock to protect i_cached_extent
+ */
+ spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
cex = &EXT4_I(inode)->i_cached_extent;

/* has cache valid data? */
if (cex->ec_type == EXT4_EXT_CACHE_NO)
- return EXT4_EXT_CACHE_NO;
+ goto errout;

BUG_ON(cex->ec_type != EXT4_EXT_CACHE_GAP &&
cex->ec_type != EXT4_EXT_CACHE_EXTENT);
@@ -1918,11 +1925,11 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
ext_debug("%u cached by %u:%u:%llu\n",
block,
cex->ec_block, cex->ec_len, cex->ec_start);
- return cex->ec_type;
+ ret = cex->ec_type;
}

2009-05-14 21:33:15

by Kevin Shanahan

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Thu, May 14, 2009 at 05:23:25PM -0400, Theodore Tso wrote:
> On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote:
> > Okay, so now I've booted into 2.6.29.3 + check_block_validity patch +
> > short circuit i_cached_extent patch, mounted the fs without
> > nodelalloc. I was able to run the full exchange backup without
> > triggering the check_block_validity error.
>
> Great!
>
> So here's the final fix (it replaces the short circuit i_cached_extent
> patch) which I plan to push to Linus. It should be much less of a
> performance hit than simply short-circuiting i_cached_extent...
>
> Thanks so much for helping to find track this down!!! If ever someone
> deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be
> you....

Hehe, no problem. Will do the final testing shortly (ran out of time
this morning, users are back on the system now). Just one little
correction to your patch below:

> commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5
> Author: Theodore Ts'o <[email protected]>
> Date: Thu May 14 17:09:37 2009 -0400
>
> ext4: Fix race in ext4_inode_info.i_cached_extent
>
> If one CPU is reading from a file while another CPU is writing to the
> same file different locations, there is nothing protecting the
> i_cached_extent structure from being used and updated at the same
> time. This could potentially cause the wrong location on disk to be
> read or written to, including potentially causing the corruption of
> the block group descriptors and/or inode table.
>
> Many thanks to Ken Shannah for helping to track down this problem.
^^^^^^^^^^^

Cheers,
Kevin Shanahan.

2009-05-15 01:21:14

by Eric Sandeen

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

Theodore Tso wrote:
> On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote:
>> Okay, so now I've booted into 2.6.29.3 + check_block_validity patch +
>> short circuit i_cached_extent patch, mounted the fs without
>> nodelalloc. I was able to run the full exchange backup without
>> triggering the check_block_validity error.
>
> Great!
>
> So here's the final fix (it replaces the short circuit i_cached_extent
> patch) which I plan to push to Linus. It should be much less of a
> performance hit than simply short-circuiting i_cached_extent...
>
> Thanks so much for helping to find track this down!!! If ever someone
> deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be
> you....
>
> - Ted

So here's a fio job I thought would try to hit this ...

[global]
ioengine=libaio
iodepth=1
filesize=4g
bs=1m
norandommap
direct=1
loops=500
size=16g

[setup]
filename=file
rw=randwrite
loops=1

[thread1]
stonewall
filename=file
rw=randwrite

[thread2]
filename=file
rw=randread

it should lay out a 4g file in random 1m direct IOs to fragment it and
get a lot of extents, then launch 2 threads, one each doing random reads
and random writes of that same file.

I can't make this trip it, though ...

-Eric

2009-05-15 04:08:07

by Alex Tomas

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

when cache was introduced single exclusive spinlock protect
whole ext3_ext_get_blocks and there was no concurrency at all.
so I guess your theory is correct.

thanks, Alex

Theodore Tso wrote:
> On Fri, May 15, 2009 at 12:00:15AM +0930, Kevin Shanahan wrote:
>>> debugfs: stat <759>
>> hermes:~# debugfs /dev/dm-0
>> debugfs 1.41.3 (12-Oct-2008)
>> debugfs: stat <759>
>>
>> Inode: 759 Type: regular Mode: 0660 Flags: 0x80000
>> Generation: 3979120103 Version: 0x00000000:00000001
>> User: 0 Group: 10140 Size: 14615630848
>> File ACL: 0 Directory ACL: 0
>> Links: 1 Blockcount: 28546168
>> Fragment: Address: 0 Number: 0 Size: 0
>> ctime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009
>> atime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009
>> mtime: 0x4a0acdb5:2a88cbec -- Wed May 13 23:10:05 2009
>> crtime: 0x4a0ac45b:10899618 -- Wed May 13 22:30:11 2009
>
>> Inode Pathname
>> 759 /local/dumps/exchange/exchange-2000-UCWB-KVM-18.bkfB-KVM-18.bkf
>
> Do you know how the system was likely writing into
> /local/dumps/exhcnag/eexchange-2000-UCWB-KVM-18.bkf? What this a
> backup via rsync or tar? Was this some application writing into a
> pre-existing file via NFS, or via local disk access?
>
> Given the ctime/atime fields, I'm inclined to guess the latter, but it
> would be good to know.
>
> The stat dump for the inode 759 does *not* show logical block 1741329
> getting mapped to physical block 529. So the question is how did that
> happen?
>
> I've started looking, and one thing popped up at me. I need to check
> in with the Lustre folks who originally donated the code, but I don't
> see any spinlock or mutexes protecting the inode's extent cache. So
> if you are on an SMP machine, this could potentially have caused the
> problem. How many CPU's or cores do you have? What does
> /proc/cpuinfo report? Also, would it be correct to assume this file
> is getting served up via Samba. My theory is that we might be running
> into problems when two threads are simultaneously trying read and
> write to a single file at the same time.
>
> Hmm, what is accessing your files on this system? Are you just doing
> backups? Is it just a backup server? Or are you serving up files
> using Samba and there are clients which are accessing those files?
>
> So if this the problem the following experiment should be able to
> confirm whether it's the problem, by seeing if the problem goes away
> if we short-circuit the inode's extent cache. In fs/ext4/extents.c,
> try inserting a "return" statement to in ext4_ext_put_in_cache():
>
> static void
> ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
> __u32 len, ext4_fsblk_t start, int type)
> {
> struct ext4_ext_cache *cex;
>
> return; <---- insert this line
> BUG_ON(len == 0);
> cex = &EXT4_I(inode)->i_cached_extent;
> cex->ec_type = type;
> cex->ec_block = block;
> cex->ec_len = len;
> cex->ec_start = start;
> }
>
> This should short circuit the i_cached_extent cache, and this may be
> enough to make your problem go away. (If this theory is correct,
> using mount -o nodelalloc probably won't make a difference, although
> it might change the timing enough to make the bug harder to see.)
>
> If that solves the problem, the right long-term fix will be to drop bin
> a spinlock to protect i_cached_extent.
>
> - Ted


2009-05-15 04:55:28

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Thu, May 14, 2009 at 05:23:25PM -0400, Theodore Tso wrote:
> On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote:
> > Okay, so now I've booted into 2.6.29.3 + check_block_validity patch +
> > short circuit i_cached_extent patch, mounted the fs without
> > nodelalloc. I was able to run the full exchange backup without
> > triggering the check_block_validity error.
>
> Great!
>
> So here's the final fix (it replaces the short circuit i_cached_extent
> patch) which I plan to push to Linus. It should be much less of a
> performance hit than simply short-circuiting i_cached_extent...
>
> Thanks so much for helping to find track this down!!! If ever someone
> deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be
> you....
>
> - Ted
>
> commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5
> Author: Theodore Ts'o <[email protected]>
> Date: Thu May 14 17:09:37 2009 -0400
>
> ext4: Fix race in ext4_inode_info.i_cached_extent
>
> If one CPU is reading from a file while another CPU is writing to the
> same file different locations, there is nothing protecting the
> i_cached_extent structure from being used and updated at the same
> time. This could potentially cause the wrong location on disk to be
> read or written to, including potentially causing the corruption of
> the block group descriptors and/or inode table.


It should be multiple readers. We don't allow read/write or multiple
writers via ext4_ext_get_blocks. &EXT4_I(inode)->i_data_sem is supposed
to protect read/write and multiple writers. What it allowed was
multiple readers(get_block call with create = 0). And readers did cache
the extent information which it read from the disk. So the fix is
correct, but we need to update the commit message.

Reviewed-by:Aneesh Kumar K.V <[email protected]>


>
> Many thanks to Ken Shannah for helping to track down this problem.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 172656c..e3a55eb 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -1841,11 +1841,13 @@ ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
> {
> struct ext4_ext_cache *cex;
> BUG_ON(len == 0);
> + spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
> cex = &EXT4_I(inode)->i_cached_extent;
> cex->ec_type = type;
> cex->ec_block = block;
> cex->ec_len = len;
> cex->ec_start = start;
> + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> }
>
> /*
> @@ -1902,12 +1904,17 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
> struct ext4_extent *ex)
> {
> struct ext4_ext_cache *cex;
> + int ret = EXT4_EXT_CACHE_NO;
>
> + /*
> + * We borrow i_block_reservation_lock to protect i_cached_extent
> + */
> + spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
> cex = &EXT4_I(inode)->i_cached_extent;
>
> /* has cache valid data? */
> if (cex->ec_type == EXT4_EXT_CACHE_NO)
> - return EXT4_EXT_CACHE_NO;
> + goto errout;
>
> BUG_ON(cex->ec_type != EXT4_EXT_CACHE_GAP &&
> cex->ec_type != EXT4_EXT_CACHE_EXTENT);
> @@ -1918,11 +1925,11 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t block,
> ext_debug("%u cached by %u:%u:%llu\n",
> block,
> cex->ec_block, cex->ec_len, cex->ec_start);
> - return cex->ec_type;
> + ret = cex->ec_type;
> }
> -
> - /* not in cache */
> - return EXT4_EXT_CACHE_NO;
> +errout:
> + spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
> + return ret;
> }
>
> /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-05-15 04:58:38

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Thu, May 14, 2009 at 12:40:11AM -0400, Theodore Tso wrote:
> On Wed, May 13, 2009 at 03:56:34PM +0930, Kevin Shanahan wrote:
> >
> > Now, this is (possibly) interesting - that block contains a bunch of
> > file data. Looks like a html email (I can tell it's email because of
> > the FIXED_ prefix added to the tags by the mail sanitizer).
>
> That is interesting. Knowing that it is file data means we might be
> able to track things down this way.
>
> Are you able to apply the following patch to your kernel? If so,
> hopefully we'll be able to catch whatever is causing the problem in
> the act.
>
> - Ted
>
> commit 8ff799da106e9fc4da9b2a3753b5b86caab27f13
> Author: Theodore Ts'o <[email protected]>
> Date: Thu May 14 00:39:48 2009 -0400
>
> ext4: Add a block validity check to ext4_get_blocks_wrap()
>
> A few users with very large disks have been reporting low block number
> filesystem corruptions, potentially zapping the block group
> descriptors or inodes in the first inode table block. It's not clear
> what is causing this, but most recently, it appears that whatever is
> trashing the filesystem metadata appears to be file data. So let's
> try to set a trap for the corruption in ext4_get_blocks_wrap(), which
> is where logical blocks in an inode are mapped to physical blocks.


We already do block validation in allocator. In
ext4_mb_mark_diskspace_used we check whether the block allocator is
wrongly getting blocks from system zone. So i guess we don't need this
patch. Or may be we need to remove the check in
ext4_mb_mark_diskspace_used and add the ckeck in get_block so that it
catches the extent cache corruption as found by this bug.

-aneesh



>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 4e7f363..4fad4fa 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1120,6 +1120,35 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
> ext4_discard_preallocations(inode);
> }
>
> +static int check_block_validity(struct inode *inode, sector_t logical,
> + sector_t phys, int len)
> +{
> + ext4_fsblk_t valid_block, itable_block;
> + struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es;
> + struct ext4_group_desc *gdp = ext4_get_group_desc(inode->i_sb, 0, 0);
> + unsigned int itable_len = EXT4_SB(inode->i_sb)->s_itb_per_group;
> +
> + valid_block = le32_to_cpu(es->s_first_data_block) +
> + EXT4_SB(inode->i_sb)->s_gdb_count;
> + itable_block = ext4_inode_table(inode->i_sb, gdp);
> +
> + if (unlikely((phys <= valid_block) ||
> + ((phys + len - 1) > ext4_blocks_count(es)) ||
> + ((phys >= itable_block) &&
> + (phys <= itable_block + itable_len)) ||
> + ((phys + len - 1 >= itable_block) &&
> + (phys + len - 1 <= itable_block + itable_len)))) {
> + ext4_error(inode->i_sb, "check_block_validity",
> + "inode #%lu logical block %llu mapped to %llu "
> + "(size %d)", inode->i_ino,
> + (unsigned long long) logical,
> + (unsigned long long) phys, len);
> + WARN_ON(1);
> + return -EIO;
> + }
> + return 0;
> +}
> +
> /*
> * The ext4_get_blocks_wrap() function try to look up the requested blocks,
> * and returns if the blocks are already mapped.
> @@ -1165,6 +1194,13 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> }
> up_read((&EXT4_I(inode)->i_data_sem));
>
> + if (retval > 0 && buffer_mapped(bh)) {
> + int ret = check_block_validity(inode, block,
> + bh->b_blocknr, retval);
> + if (ret != 0)
> + return ret;
> + }
> +
> /* If it is only a block(s) look up */
> if (!create)
> return retval;
> @@ -1241,6 +1277,12 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> }
>
> up_write((&EXT4_I(inode)->i_data_sem));
> + if (retval > 0 && buffer_mapped(bh)) {
> + int ret = check_block_validity(inode, block,
> + bh->b_blocknr, retval);
> + if (ret != 0)
> + return ret;
> + }
> return retval;
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-05-15 10:11:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Fri, May 15, 2009 at 10:25:08AM +0530, Aneesh Kumar K.V wrote:
> > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5
> > Author: Theodore Ts'o <[email protected]>
> > Date: Thu May 14 17:09:37 2009 -0400
> >
> > ext4: Fix race in ext4_inode_info.i_cached_extent
> >
> > If one CPU is reading from a file while another CPU is writing to the
> > same file different locations, there is nothing protecting the
> > i_cached_extent structure from being used and updated at the same
> > time. This could potentially cause the wrong location on disk to be
> > read or written to, including potentially causing the corruption of
> > the block group descriptors and/or inode table.
>
>
> It should be multiple readers. We don't allow read/write or multiple
> writers via ext4_ext_get_blocks. &EXT4_I(inode)->i_data_sem is supposed
> to protect read/write and multiple writers. What it allowed was
> multiple readers(get_block call with create = 0). And readers did cache
> the extent information which it read from the disk. So the fix is
> correct, but we need to update the commit message.

Yes, you're right. The i_data_sem would have protected us from the
worst of these problems. The commit message isn't totally wrong,
though, since even a writer will call ext4_ext_get_blocks() with
create == 0. But we should describe the failure mode much more
accurately, and say that it is two callers of ext4_ext_get_blocks()
with create=0 racing against one another. It's possible, perhaps even
likely, that one of those callers came from a call of
ext4_get_blocks() with create=1 --- given that the people who were
reporting this were doing so when doing backup jobs.

However, one could imagine multiple mysql (or DB2) threads writing
into random parts of a database, and if two CPU's try to bmap already
allocated (and initialized) blocks out of the database file, so that
ext4_ext_get_blocks() is always being called with create=0, you could
very easily end up with a situation where blocks would be written to
the wrong location on disk. <<shiver>>

We should make sure that all of the enterprise distributions that are
shipping ext4 as a "technology preview" get a copy of this patch as a
fixing a high-priority bug, since a database workload should trip this
pretty easily.

- Ted

2009-05-15 10:27:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Fri, May 15, 2009 at 10:28:27AM +0530, Aneesh Kumar K.V wrote:
> > commit 8ff799da106e9fc4da9b2a3753b5b86caab27f13
> > Author: Theodore Ts'o <[email protected]>
> > Date: Thu May 14 00:39:48 2009 -0400
> >
> > ext4: Add a block validity check to ext4_get_blocks_wrap()
> >
> > A few users with very large disks have been reporting low block number
> > filesystem corruptions, potentially zapping the block group
> > descriptors or inodes in the first inode table block. It's not clear
> > what is causing this, but most recently, it appears that whatever is
> > trashing the filesystem metadata appears to be file data. So let's
> > try to set a trap for the corruption in ext4_get_blocks_wrap(), which
> > is where logical blocks in an inode are mapped to physical blocks.
>
>
> We already do block validation in allocator. In
> ext4_mb_mark_diskspace_used we check whether the block allocator is
> wrongly getting blocks from system zone. So i guess we don't need
> this patch. Or may be we need to remove the check in
> ext4_mb_mark_diskspace_used and add the ckeck in get_block so that
> it catches the extent cache corruption as found by this bug.

Yeah, I was debating what to do with this patch long term. It catches
problems than the sanity checks in ext4_mb_mark_diskspace_used() in
mballoc.c, or ext4_valid_extent() in ext4/extents.c will miss.
However, (a) it's expensive to do these sorts of tests, and (b) it
only checks for problems in the primary block group descriptors and
first block group's inode tables, since this is where we had been
reporting problems.

The other problem is that ext4_mb_mark_diskspace_used()'s tests assume
!flex_bg, and with flex_bg, its checks are largely ineffective.

What I think the right approach probably will be is that we need to
build an rbtree of the "system zone" blocks at mount time. Especially
with flex_bg, in the normal case the block group descriptors and
bitmap blocks will form a very nice, contiguous set of extents that
should be relatively compactly stored in an rbtree. This will allow
us to make the tests done by ext4_valid_extent(),
ext4_mb_mark_diskspace_used(), and ext4_get_blocks() to be ***much***
more effective and comprehensive.

This still doesn't answer the question of where we should be doing the
tests. If we're trying to track down a bug, we probably want to be
doing these tests everywhere. For performance reasons, though, my
guess is that *most* of the tests should be default disabled except
unless a mount option is given. (Why a run-time mount option as
opposed a compile-time tests? So that if this bug happens in the
field, a Level 3 engineer from RHEL's, SLES's, or IBM's help desk can
tell the customer to mount with this option, and track it down without
requesting that the user install a custom kernel; having done some
field work, I can tell you this will save a huge amount of
time/effort.)

I would probably keep the tests in ext4_valid_extent() so we can test
for on-disk corruption, but we would probably want to turn off the
tests at allocation and ext4_get_blocks() for performance reasons ---
at least until someone has a chance to do some real benchmarking tests
and confirms whether or not the extra CPU utilization is visible on
say a TPC-H or TPC-C run.

- Ted

2009-05-15 12:50:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote:
>
> it should lay out a 4g file in random 1m direct IOs to fragment it and
> get a lot of extents, then launch 2 threads, one each doing random reads
> and random writes of that same file.
>
> I can't make this trip it, though ...

If all of the blocks are in the page cache, you won't end up calling
ext4_get_blocks(). Try adding a shell script which runs in parallel
doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done".

- Ted

2009-05-15 12:58:39

by Eric Sandeen

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

Theodore Tso wrote:
> On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote:
>> it should lay out a 4g file in random 1m direct IOs to fragment it and
>> get a lot of extents, then launch 2 threads, one each doing random reads
>> and random writes of that same file.
>>
>> I can't make this trip it, though ...
>
> If all of the blocks are in the page cache, you won't end up calling
> ext4_get_blocks(). Try adding a shell script which runs in parallel
> doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done".
>
> - Ted

I made sure it was a big enough file, and consumed enough memory on the
system before the test, that the entire file couldn't fit in memory.

I can try doing the dropping in the bg ... but it should have been going
to disk already.

-Eric

2009-05-15 13:07:40

by Theodore Ts'o

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Fri, May 15, 2009 at 10:25:08AM +0530, Aneesh Kumar K.V wrote:
> So the fix is correct, but we need to update the commit message.

Here's the updated commit message.

- Ted

ext4: Fix race in ext4_inode_info.i_cached_extent

If two CPU's simultaneously call ext4_ext_get_blocks() at the same
time, there is nothing protecting the i_cached_extent structure from
being used and updated at the same time. This could potentially cause
the wrong location on disk to be read or written to, including
potentially causing the corruption of the block group descriptors
and/or inode table.

This bug has been in the ext4 code since almost the very beginning of
ext4's development. Fortunately once the data is stored in the page
cache cache, ext4_get_blocks() doesn't need to be called, so trying to
replicate this problem to the point where we could identify its root
cause was *extremely* difficult. Many thanks to Kevin Shanahan for
working over several months to be able to reproduce this easily so we
could finally nail down the cause of the corruption.

Signed-off-by: "Theodore Ts'o" <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>

2009-05-15 15:24:21

by Eric Sandeen

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

Eric Sandeen wrote:
> Theodore Tso wrote:
>> On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote:
>>> it should lay out a 4g file in random 1m direct IOs to fragment it and
>>> get a lot of extents, then launch 2 threads, one each doing random reads
>>> and random writes of that same file.
>>>
>>> I can't make this trip it, though ...
>> If all of the blocks are in the page cache, you won't end up calling
>> ext4_get_blocks(). Try adding a shell script which runs in parallel
>> doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done".
>>
>> - Ted
>
> I made sure it was a big enough file, and consumed enough memory on the
> system before the test, that the entire file couldn't fit in memory.
>
> I can try doing the dropping in the bg ... but it should have been going
> to disk already.
>
> -Eric

in a desperate attempt to show the window, I tried this in
ext4_ext_put_in_cache():

cex->ec_block = -1;
cex->ec_start = -1;
schedule_timeout_uninterruptible(HZ/2);
cex->ec_start = start;
cex->ec_block = block;

and this in ext4_ext_in_cache():

if (cex->ec_block == -1 || cex->ec_start == -1)
printk("%s got bad cache\n", __func__);

and it's not firing.

-Eric

2009-05-15 16:27:12

by Eric Sandeen

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

Eric Sandeen wrote:
> Eric Sandeen wrote:
>> Theodore Tso wrote:
>>> On Thu, May 14, 2009 at 08:21:05PM -0500, Eric Sandeen wrote:
>>>> it should lay out a 4g file in random 1m direct IOs to fragment it and
>>>> get a lot of extents, then launch 2 threads, one each doing random reads
>>>> and random writes of that same file.
>>>>
>>>> I can't make this trip it, though ...
>>> If all of the blocks are in the page cache, you won't end up calling
>>> ext4_get_blocks(). Try adding a shell script which runs in parallel
>>> doing a "while /bin/true ; do sleep 1; echo 3 > /proc/sys/vm/drop_cache; done".
>>>
>>> - Ted
>> I made sure it was a big enough file, and consumed enough memory on the
>> system before the test, that the entire file couldn't fit in memory.
>>
>> I can try doing the dropping in the bg ... but it should have been going
>> to disk already.
>>
>> -Eric
>
> in a desperate attempt to show the window, I tried this in
> ext4_ext_put_in_cache():
>
> cex->ec_block = -1;
> cex->ec_start = -1;
> schedule_timeout_uninterruptible(HZ/2);
> cex->ec_start = start;
> cex->ec_block = block;
>
> and this in ext4_ext_in_cache():
>
> if (cex->ec_block == -1 || cex->ec_start == -1)
> printk("%s got bad cache\n", __func__);
>
> and it's not firing.

I take it back, needed a different workload.

Sorry for being pedantic, but if this race is so blindingly obvious and
we're getting so few reports, I wanted to be sure we could hit it. With
my artificially wide window now I think I can see it, but I'm still not
winding up with any corruption or EIOs....

-Eric

2009-05-15 23:19:02

by Kevin Shanahan

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Fri, May 15, 2009 at 07:03:14AM +0930, Kevin Shanahan wrote:
> On Thu, May 14, 2009 at 05:23:25PM -0400, Theodore Tso wrote:
> > On Fri, May 15, 2009 at 06:32:45AM +0930, Kevin Shanahan wrote:
> > > Okay, so now I've booted into 2.6.29.3 + check_block_validity patch +
> > > short circuit i_cached_extent patch, mounted the fs without
> > > nodelalloc. I was able to run the full exchange backup without
> > > triggering the check_block_validity error.
> >
> > Great!
> >
> > So here's the final fix (it replaces the short circuit i_cached_extent
> > patch) which I plan to push to Linus. It should be much less of a
> > performance hit than simply short-circuiting i_cached_extent...
> >
> > Thanks so much for helping to find track this down!!! If ever someone
> > deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be
> > you....
>
> Hehe, no problem. Will do the final testing shortly (ran out of time
> this morning, users are back on the system now). Just one little
> correction to your patch below:
>
> > commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5
> > Author: Theodore Ts'o <[email protected]>
> > Date: Thu May 14 17:09:37 2009 -0400
> >
> > ext4: Fix race in ext4_inode_info.i_cached_extent
> >
> > If one CPU is reading from a file while another CPU is writing to the
> > same file different locations, there is nothing protecting the
> > i_cached_extent structure from being used and updated at the same
> > time. This could potentially cause the wrong location on disk to be
> > read or written to, including potentially causing the corruption of
> > the block group descriptors and/or inode table.
> >
> > Many thanks to Ken Shannah for helping to track down this problem.
> ^^^^^^^^^^^

Tested-by: Kevin Shanahan <[email protected]>

Yes, this patch seems to have fixed the issue. I ran my "exchange
backup to samba share" test on 2.6.29.3 + check_block_validity patch +
the fix race patch with no problems.

Cheers,
Kevin.

2009-05-18 02:14:40

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] ext4: Add a comprehensive block validity check to ext4_get_blocks() (Was: More ext4 acl/xattr corruption - 4th occurence now)

On Fri, May 15, 2009 at 06:27:19AM -0400, Theodore Tso wrote:
>
> The other problem is that ext4_mb_mark_diskspace_used()'s tests assume
> !flex_bg, and with flex_bg, its checks are largely ineffective.
>
> What I think the right approach probably will be is that we need to
> build an rbtree of the "system zone" blocks at mount time. Especially
> with flex_bg, in the normal case the block group descriptors and
> bitmap blocks will form a very nice, contiguous set of extents that
> should be relatively compactly stored in an rbtree. This will allow
> us to make the tests done by ext4_valid_extent(),
> ext4_mb_mark_diskspace_used(), and ext4_get_blocks() to be ***much***
> more effective and comprehensive.

Here's the patch which I've written up that allows
ext4_valid_extent(), ext4_mb_mark_displace_used(),
__ext4_check_blockref(), and ext4_get_blocks() to use the same
infrastructure. Normally, they will all just check to make sure that
block is between s_first_data_blocks and s_blocks_count. But if the
mount option block_validity is enabled, the ext4 filesystem will build
an rbtree of all of the filesystem metadata blocks, and
ext4_data_block_valid() will do a full check make sure the block is
valid.

I suspect we could afford to make block_validity the default, but it
does burn slightly more memory, and will take a little bit of CPU. It
would be nice to benchmark the results and see if it really is
noticeable.

- Ted

commit 93c1e4487e365bc637bf1fd38604817ad6cdf13a
Author: Theodore Ts'o <[email protected]>
Date: Sun May 17 15:38:01 2009 -0400

ext4: Add a comprehensive block validity check to ext4_get_blocks()

To catch filesystem bugs or corruption which could lead to the
filesystem getting severly damaged, this patch adds a facility for
tracking all of the filesystem metadata blocks by contiguous regions
in a red-black tree. This allows quick searching of the tree to
locate extents which might overlap with filesystem metadata blocks.

This facility is also used by the multi-block allocator to assure that
it is not allocating blocks out of the system zone, as well as by the
routines used when reading indirect blocks and extents information
from disk to make sure their contents are valid.

Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index a8ff003..8a34710 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -5,8 +5,8 @@
obj-$(CONFIG_EXT4_FS) += ext4.o

ext4-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
- ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \
- ext4_jbd2.o migrate.o mballoc.o
+ ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \
+ ext4_jbd2.o migrate.o mballoc.o block_validity.o

ext4-$(CONFIG_EXT4_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c
new file mode 100644
index 0000000..50784ef
--- /dev/null
+++ b/fs/ext4/block_validity.c
@@ -0,0 +1,244 @@
+/*
+ * linux/fs/ext4/block_validity.c
+ *
+ * Copyright (C) 2009
+ * Theodore Ts'o ([email protected])
+ *
+ * Track which blocks in the filesystem are metadata blocks that
+ * should never be used as data blocks by files or directories.
+ */
+
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/quotaops.h>
+#include <linux/buffer_head.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
+#include <linux/version.h>
+#include <linux/blkdev.h>
+#include <linux/mutex.h>
+#include "ext4.h"
+
+struct ext4_system_zone {
+ struct rb_node node;
+ ext4_fsblk_t start_blk;
+ unsigned int count;
+};
+
+static struct kmem_cache *ext4_system_zone_cachep;
+
+int __init init_ext4_system_zone(void)
+{
+ ext4_system_zone_cachep = KMEM_CACHE(ext4_system_zone,
+ SLAB_RECLAIM_ACCOUNT);
+ if (ext4_system_zone_cachep == NULL)
+ return -ENOMEM;
+ return 0;
+}
+
+void exit_ext4_system_zone(void)
+{
+ kmem_cache_destroy(ext4_system_zone_cachep);
+}
+
+static inline int can_merge(struct ext4_system_zone *entry1,
+ struct ext4_system_zone *entry2)
+{
+ if ((entry1->start_blk + entry1->count) == entry2->start_blk)
+ return 1;
+ return 0;
+}
+
+/*
+ * Mark a range of blocks as belonging to the "system zone" --- that
+ * is, filesystem metadata blocks which should never be used by
+ * inodes.
+ */
+static int add_system_zone(struct ext4_sb_info *sbi,
+ ext4_fsblk_t start_blk,
+ unsigned int count)
+{
+ struct ext4_system_zone *new_entry = NULL, *entry;
+ struct rb_node **n = &sbi->system_blks.rb_node, *node;
+ struct rb_node *parent = NULL, *new_node = NULL;
+
+ while (*n) {
+ parent = *n;
+ entry = rb_entry(parent, struct ext4_system_zone, node);
+ if (start_blk < entry->start_blk)
+ n = &(*n)->rb_left;
+ else if (start_blk >= (entry->start_blk + entry->count))
+ n = &(*n)->rb_right;
+ else {
+ if (start_blk + count > (entry->start_blk +
+ entry->count))
+ entry->count = (start_blk + count -
+ entry->start_blk);
+ new_node = *n;
+ new_entry = rb_entry(new_node, struct ext4_system_zone,
+ node);
+ break;
+ }
+ }
+
+ if (!new_entry) {
+ new_entry = kmem_cache_alloc(ext4_system_zone_cachep,
+ GFP_KERNEL);
+ if (!new_entry)
+ return -ENOMEM;
+ new_entry->start_blk = start_blk;
+ new_entry->count = count;
+ new_node = &new_entry->node;
+
+ rb_link_node(new_node, parent, n);
+ rb_insert_color(new_node, &sbi->system_blks);
+ }
+
+ /* Can we merge to the left? */
+ node = rb_prev(new_node);
+ if (node) {
+ entry = rb_entry(node, struct ext4_system_zone, node);
+ if (can_merge(entry, new_entry)) {
+ new_entry->start_blk = entry->start_blk;
+ new_entry->count += entry->count;
+ rb_erase(node, &sbi->system_blks);
+ kmem_cache_free(ext4_system_zone_cachep, entry);
+ }
+ }
+
+ /* Can we merge to the right? */
+ node = rb_next(new_node);
+ if (node) {
+ entry = rb_entry(node, struct ext4_system_zone, node);
+ if (can_merge(new_entry, entry)) {
+ new_entry->count += entry->count;
+ rb_erase(node, &sbi->system_blks);
+ kmem_cache_free(ext4_system_zone_cachep, entry);
+ }
+ }
+ return 0;
+}
+
+static void debug_print_tree(struct ext4_sb_info *sbi)
+{
+ struct rb_node *node;
+ struct ext4_system_zone *entry;
+ int first = 1;
+
+ printk(KERN_INFO "System zones: ");
+ node = rb_first(&sbi->system_blks);
+ while (node) {
+ entry = rb_entry(node, struct ext4_system_zone, node);
+ printk("%s%llu-%llu", first ? "" : ", ",
+ entry->start_blk, entry->start_blk + entry->count - 1);
+ first = 0;
+ node = rb_next(node);
+ }
+ printk("\n");
+}
+
+int ext4_setup_system_zone(struct super_block *sb)
+{
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_group_desc *gdp;
+ ext4_group_t i;
+ int flex_size = ext4_flex_bg_size(sbi);
+ int ret;
+
+ if (!test_opt(sb, BLOCK_VALIDITY)) {
+ if (EXT4_SB(sb)->system_blks.rb_node)
+ ext4_release_system_zone(sb);
+ return 0;
+ }
+ if (EXT4_SB(sb)->system_blks.rb_node)
+ return 0;
+
+ for (i=0; i < ngroups; i++) {
+ if (ext4_bg_has_super(sb, i) &&
+ ((i < 5) || ((i % flex_size) == 0)))
+ add_system_zone(sbi, ext4_group_first_block_no(sb, i),
+ sbi->s_gdb_count + 1);
+ gdp = ext4_get_group_desc(sb, i, NULL);
+ ret = add_system_zone(sbi, ext4_block_bitmap(sb, gdp), 1);
+ if (ret)
+ return ret;
+ ret = add_system_zone(sbi, ext4_inode_bitmap(sb, gdp), 1);
+ if (ret)
+ return ret;
+ ret = add_system_zone(sbi, ext4_inode_table(sb, gdp),
+ sbi->s_itb_per_group);
+ if (ret)
+ return ret;
+ }
+
+ if (test_opt(sb, DEBUG))
+ debug_print_tree(EXT4_SB(sb));
+ return 0;
+}
+
+/* Called when the filesystem is unmounted */
+void ext4_release_system_zone(struct super_block *sb)
+{
+ struct rb_node *n = EXT4_SB(sb)->system_blks.rb_node;
+ struct rb_node *parent;
+ struct ext4_system_zone *entry;
+
+ while (n) {
+ /* Do the node's children first */
+ if (n->rb_left) {
+ n = n->rb_left;
+ continue;
+ }
+ if (n->rb_right) {
+ n = n->rb_right;
+ continue;
+ }
+ /*
+ * The node has no children; free it, and then zero
+ * out parent's link to it. Finally go to the
+ * beginning of the loop and try to free the parent
+ * node.
+ */
+ parent = rb_parent(n);
+ entry = rb_entry(n, struct ext4_system_zone, node);
+ kmem_cache_free(ext4_system_zone_cachep, entry);
+ if (!parent)
+ EXT4_SB(sb)->system_blks.rb_node = NULL;
+ else if (parent->rb_left == n)
+ parent->rb_left = NULL;
+ else if (parent->rb_right == n)
+ parent->rb_right = NULL;
+ n = parent;
+ }
+ EXT4_SB(sb)->system_blks.rb_node = NULL;
+}
+
+/*
+ * Returns 1 if the passed-in block region (start_blk,
+ * start_blk+count) is valid; 0 if some part of the block region
+ * overlaps with filesystem metadata blocks.
+ */
+int ext4_data_block_valid(struct ext4_sb_info *sbi, ext4_fsblk_t start_blk,
+ unsigned int count)
+{
+ struct ext4_system_zone *entry;
+ struct rb_node *n = sbi->system_blks.rb_node;
+
+ if ((start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
+ (start_blk + count > ext4_blocks_count(sbi->s_es)))
+ return 0;
+ while (n) {
+ entry = rb_entry(n, struct ext4_system_zone, node);
+ if (start_blk + count - 1 < entry->start_blk)
+ n = n->rb_left;
+ else if (start_blk >= (entry->start_blk + entry->count))
+ n = n->rb_right;
+ else
+ return 0;
+ }
+ return 1;
+}
+
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d164f12..4311cc8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -696,6 +696,7 @@ struct ext4_inode_info {
#define EXT4_MOUNT_I_VERSION 0x2000000 /* i_version support */
#define EXT4_MOUNT_DELALLOC 0x8000000 /* Delalloc support */
#define EXT4_MOUNT_DATA_ERR_ABORT 0x10000000 /* Abort on file data write */
+#define EXT4_MOUNT_BLOCK_VALIDITY 0x20000000 /* Block validity checking */

/* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
@@ -887,6 +888,7 @@ struct ext4_sb_info {
int s_jquota_fmt; /* Format of quota to use */
#endif
unsigned int s_want_extra_isize; /* New inodes should reserve # bytes */
+ struct rb_root system_blks;

#ifdef EXTENTS_STATS
/* ext4 extents stats */
@@ -1618,6 +1620,15 @@ extern struct dentry *ext4_get_parent(struct dentry *child);
extern const struct inode_operations ext4_symlink_inode_operations;
extern const struct inode_operations ext4_fast_symlink_inode_operations;

+/* block_validity */
+extern void ext4_release_system_zone(struct super_block *sb);
+extern int ext4_setup_system_zone(struct super_block *sb);
+extern int __init init_ext4_system_zone(void);
+extern void exit_ext4_system_zone(void);
+extern int ext4_data_block_valid(struct ext4_sb_info *sbi,
+ ext4_fsblk_t start_blk,
+ unsigned int count);
+
/* extents.c */
extern int ext4_ext_tree_init(handle_t *handle, struct inode *);
extern int ext4_ext_writepage_trans_blocks(struct inode *, int);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 27c383c..d04b779 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -326,32 +326,18 @@ ext4_ext_max_entries(struct inode *inode, int depth)

static int ext4_valid_extent(struct inode *inode, struct ext4_extent *ext)
{
- ext4_fsblk_t block = ext_pblock(ext), valid_block;
+ ext4_fsblk_t block = ext_pblock(ext);
int len = ext4_ext_get_actual_len(ext);
- struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es;

- valid_block = le32_to_cpu(es->s_first_data_block) +
- EXT4_SB(inode->i_sb)->s_gdb_count;
- if (unlikely(block <= valid_block ||
- ((block + len) > ext4_blocks_count(es))))
- return 0;
- else
- return 1;
+ return ext4_data_block_valid(EXT4_SB(inode->i_sb), block, len);
}

static int ext4_valid_extent_idx(struct inode *inode,
struct ext4_extent_idx *ext_idx)
{
- ext4_fsblk_t block = idx_pblock(ext_idx), valid_block;
- struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es;
+ ext4_fsblk_t block = idx_pblock(ext_idx);

- valid_block = le32_to_cpu(es->s_first_data_block) +
- EXT4_SB(inode->i_sb)->s_gdb_count;
- if (unlikely(block <= valid_block ||
- (block >= ext4_blocks_count(es))))
- return 0;
- else
- return 1;
+ return ext4_data_block_valid(EXT4_SB(inode->i_sb), block, 1);
}

static int ext4_valid_extent_entries(struct inode *inode,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d7b7480..dadd3f9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -372,20 +372,21 @@ static int ext4_block_to_path(struct inode *inode,
}

static int __ext4_check_blockref(const char *function, struct inode *inode,
- __le32 *p, unsigned int max) {
-
- unsigned int maxblocks = ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es);
+ __le32 *p, unsigned int max)
+{
__le32 *bref = p;
+ unsigned int blk;
+
while (bref < p+max) {
- if (unlikely(le32_to_cpu(*bref) >= maxblocks)) {
+ blk = le32_to_cpu(*bref++);
+ if (blk &&
+ unlikely(!ext4_data_block_valid(EXT4_SB(inode->i_sb),
+ blk, 1))) {
ext4_error(inode->i_sb, function,
- "block reference %u >= max (%u) "
- "in inode #%lu, offset=%d",
- le32_to_cpu(*bref), maxblocks,
- inode->i_ino, (int)(bref-p));
+ "invalid block reference %u "
+ "in inode #%lu", blk, inode->i_ino);
return -EIO;
}
- bref++;
}
return 0;
}
@@ -1125,6 +1126,21 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
ext4_discard_preallocations(inode);
}

+static int check_block_validity(struct inode *inode, sector_t logical,
+ sector_t phys, int len)
+{
+ if (!ext4_data_block_valid(EXT4_SB(inode->i_sb), phys, len)) {
+ ext4_error(inode->i_sb, "check_block_validity",
+ "inode #%lu logical block %llu mapped to %llu "
+ "(size %d)", inode->i_ino,
+ (unsigned long long) logical,
+ (unsigned long long) phys, len);
+ WARN_ON(1);
+ return -EIO;
+ }
+ return 0;
+}
+
/*
* The ext4_get_blocks() function tries to look up the requested blocks,
* and returns if the blocks are already mapped.
@@ -1170,6 +1186,13 @@ int ext4_get_blocks(handle_t *handle, struct inode *inode, sector_t block,
}
up_read((&EXT4_I(inode)->i_data_sem));

+ if (retval > 0 && buffer_mapped(bh)) {
+ int ret = check_block_validity(inode, block,
+ bh->b_blocknr, retval);
+ if (ret != 0)
+ return ret;
+ }
+
/* If it is only a block(s) look up */
if ((flags & EXT4_GET_BLOCKS_CREATE) == 0)
return retval;
@@ -1245,6 +1268,12 @@ int ext4_get_blocks(handle_t *handle, struct inode *inode, sector_t block,
ext4_da_update_reserve_space(inode, retval);

up_write((&EXT4_I(inode)->i_data_sem));
+ if (retval > 0 && buffer_mapped(bh)) {
+ int ret = check_block_validity(inode, block,
+ bh->b_blocknr, retval);
+ if (ret != 0)
+ return ret;
+ }
return retval;
}

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 541bd9a..ed8482e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2961,15 +2961,10 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
+ le32_to_cpu(es->s_first_data_block);

len = ac->ac_b_ex.fe_len;
- if (in_range(ext4_block_bitmap(sb, gdp), block, len) ||
- in_range(ext4_inode_bitmap(sb, gdp), block, len) ||
- in_range(block, ext4_inode_table(sb, gdp),
- EXT4_SB(sb)->s_itb_per_group) ||
- in_range(block + len - 1, ext4_inode_table(sb, gdp),
- EXT4_SB(sb)->s_itb_per_group)) {
+ if (!ext4_data_block_valid(sbi, block, len)) {
ext4_error(sb, __func__,
- "Allocating block %llu in system zone of %d group\n",
- block, ac->ac_b_ex.fe_group);
+ "Allocating blocks %llu-%llu which overlap "
+ "fs metadata\n", block, block+len);
/* File system mounted not to panic on error
* Fix the bitmap and repeat the block allocation
* We leak some of the blocks here.
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index dc34ed3..600b7ad 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -568,6 +568,7 @@ static void ext4_put_super(struct super_block *sb)
struct ext4_super_block *es = sbi->s_es;
int i, err;

+ ext4_release_system_zone(sb);
ext4_mb_release(sb);
ext4_ext_release(sb);
ext4_xattr_put_super(sb);
@@ -1055,6 +1056,7 @@ enum {
Opt_ignore, Opt_barrier, Opt_nobarrier, Opt_err, Opt_resize,
Opt_usrquota, Opt_grpquota, Opt_i_version,
Opt_stripe, Opt_delalloc, Opt_nodelalloc,
+ Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio
};

@@ -1114,6 +1116,8 @@ static const match_table_t tokens = {
{Opt_resize, "resize"},
{Opt_delalloc, "delalloc"},
{Opt_nodelalloc, "nodelalloc"},
+ {Opt_block_validity, "block_validity"},
+ {Opt_noblock_validity, "noblock_validity"},
{Opt_inode_readahead_blks, "inode_readahead_blks=%u"},
{Opt_journal_ioprio, "journal_ioprio=%u"},
{Opt_auto_da_alloc, "auto_da_alloc=%u"},
@@ -1508,6 +1512,12 @@ set_qf_format:
case Opt_delalloc:
set_opt(sbi->s_mount_opt, DELALLOC);
break;
+ case Opt_block_validity:
+ set_opt(sbi->s_mount_opt, BLOCK_VALIDITY);
+ break;
+ case Opt_noblock_validity:
+ clear_opt(sbi->s_mount_opt, BLOCK_VALIDITY);
+ break;
case Opt_inode_readahead_blks:
if (match_int(&args[0], &option))
return 0;
@@ -2826,6 +2836,13 @@ no_journal:
} else if (test_opt(sb, DELALLOC))
printk(KERN_INFO "EXT4-fs: delayed allocation enabled\n");

+ err = ext4_setup_system_zone(sb);
+ if (err) {
+ printk(KERN_ERR "EXT4-fs: failed to initialize system "
+ "zone (%d)\n", err);
+ goto failed_mount4;
+ }
+
ext4_ext_init(sb);
err = ext4_mb_init(sb, needs_recovery);
if (err) {
@@ -2875,6 +2892,7 @@ cantfind_ext4:

failed_mount4:
printk(KERN_ERR "EXT4-fs (device %s): mount failed\n", sb->s_id);
+ ext4_release_system_zone(sb);
if (sbi->s_journal) {
jbd2_journal_destroy(sbi->s_journal);
sbi->s_journal = NULL;
@@ -3515,6 +3533,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
sb->s_flags &= ~MS_RDONLY;
}
}
+ ext4_setup_system_zone(sb);
if (sbi->s_journal == NULL)
ext4_commit_super(sb, 1);

@@ -3927,13 +3946,16 @@ static int __init init_ext4_fs(void)
{
int err;

+ err = init_ext4_system_zone();
+ if (err)
+ return err;
ext4_kset = kset_create_and_add("ext4", NULL, fs_kobj);
if (!ext4_kset)
- return -ENOMEM;
+ goto out4;
ext4_proc_root = proc_mkdir("fs/ext4", NULL);
err = init_ext4_mballoc();
if (err)
- return err;
+ goto out3;

err = init_ext4_xattr();
if (err)
@@ -3958,6 +3980,11 @@ out1:
exit_ext4_xattr();
out2:
exit_ext4_mballoc();
+out3:
+ remove_proc_entry("fs/ext4", NULL);
+ kset_unregister(ext4_kset);
+out4:
+ exit_ext4_system_zone();
return err;
}

@@ -3972,6 +3999,7 @@ static void __exit exit_ext4_fs(void)
exit_ext4_mballoc();
remove_proc_entry("fs/ext4", NULL);
kset_unregister(ext4_kset);
+ exit_ext4_system_zone();
}

MODULE_AUTHOR("Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others");

2009-05-19 10:00:58

by Thierry Vignaud

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

Theodore Tso <[email protected]> writes:

> So here's the final fix (it replaces the short circuit i_cached_extent
> patch) which I plan to push to Linus. It should be much less of a
> performance hit than simply short-circuiting i_cached_extent...
>
> Thanks so much for helping to find track this down!!! If ever someone
> deserved an "Ext4 Baker Street Irregulars" T-shirt, it would be
> you....
>
> - Ted
>
> commit 039ed7a483fdcb2dbbc29f00cd0d74c101ab14c5
> Author: Theodore Ts'o <[email protected]>
> Date: Thu May 14 17:09:37 2009 -0400
>
> ext4: Fix race in ext4_inode_info.i_cached_extent
>
> If one CPU is reading from a file while another CPU is writing to the
> same file different locations, there is nothing protecting the
> i_cached_extent structure from being used and updated at the same
> time. This could potentially cause the wrong location on disk to be
> read or written to, including potentially causing the corruption of
> the block group descriptors and/or inode table.
>
> Many thanks to Ken Shannah for helping to track down this problem.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>

I wonder if that would explain the corruption I reported a couple weeks
ago.

Now I remember I wrongly got 2 parallel cp from the same source
directory to the same target directory.
Could this be the cause?

2009-05-19 11:36:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Tue, May 19, 2009 at 12:00:56PM +0200, Thierry Vignaud wrote:
> I wonder if that would explain the corruption I reported a couple weeks
> ago.
>
> Now I remember I wrongly got 2 parallel cp from the same source
> directory to the same target directory.
>
> Could this be the cause?

Yes, it's possible. There could be a problem either if the two cp's
tried targetting the same file at the same time, or if the directory
was getting expanded at the same time by the two different processes.
Since we don't cache logical->physical mapping for directories (since
because of the journalling requirements directories are stored in the
buffer cache, not the page cache), rather more likely to run into
problems with directories; more so since directoris will tend to be
fragmented.

On the other hand, most of the time writes into the directory will
tend to be into pre-existing free space; but if you had two parallel
cp's copying a large number of files into the same directory, that
could certainly happen.

- Ted



2009-05-19 12:02:27

by Alex Tomas

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

Theodore Tso wrote:
> On the other hand, most of the time writes into the directory will
> tend to be into pre-existing free space; but if you had two parallel
> cp's copying a large number of files into the same directory, that
> could certainly happen.

isn't access to a directory protected with i_mutex ?

thanks, Alex


2009-05-19 15:04:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

On Tue, May 19, 2009 at 04:01:36PM +0400, Alex Tomas wrote:
> Theodore Tso wrote:
>> On the other hand, most of the time writes into the directory will
>> tend to be into pre-existing free space; but if you had two parallel
>> cp's copying a large number of files into the same directory, that
>> could certainly happen.
>
> isn't access to a directory protected with i_mutex ?

Hmm, good point. Yes, that should prevent problems with directories.
So there should only be a problem when two processes are writing to
the same file at the same time.

- Ted

2009-05-19 15:17:01

by Alex Tomas

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

Theodore Tso wrote:
> On Tue, May 19, 2009 at 04:01:36PM +0400, Alex Tomas wrote:
>> Theodore Tso wrote:
>>> On the other hand, most of the time writes into the directory will
>>> tend to be into pre-existing free space; but if you had two parallel
>>> cp's copying a large number of files into the same directory, that
>>> could certainly happen.
>> isn't access to a directory protected with i_mutex ?
>
> Hmm, good point. Yes, that should prevent problems with directories.
> So there should only be a problem when two processes are writing to
> the same file at the same time.

I guess reading can corrupt it as well ?

thanks, Alex

2009-05-19 15:18:08

by Thierry Vignaud

[permalink] [raw]
Subject: Re: More ext4 acl/xattr corruption - 4th occurence now

Theodore Tso <[email protected]> writes:

> >> On the other hand, most of the time writes into the directory will
> >> tend to be into pre-existing free space; but if you had two parallel
> >> cp's copying a large number of files into the same directory, that
> >> could certainly happen.
> >
> > isn't access to a directory protected with i_mutex ?
>
> Hmm, good point. Yes, that should prevent problems with directories.
> So there should only be a problem when two processes are writing to
> the same file at the same time.

I remember that happened...