2009-04-28 09:28:08

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 13201] New: kernel BUG at fs/ext4/extents.c:2737

http://bugzilla.kernel.org/show_bug.cgi?id=13201

Summary: kernel BUG at fs/ext4/extents.c:2737
Product: File System
Version: 2.5
Kernel Version: 2.6.29.1
Platform: All
OS/Version: Linux
Tree: Mainline
Status: NEW
Severity: high
Priority: P1
Component: ext4
AssignedTo: [email protected]
ReportedBy: [email protected]
Regression: No


I got this while testing ext4 on an external RAID system. The system has 4
identical RAID systems each with a single 13TB filesystem. Only one of the 4
failed the test which was to simply write 8GB files until the disk fills up.

The complete messages file is attached.

Apr 25 01:59:38 echo19 kernel: EXT4-fs error (device dm-2):
__ext4_get_inode_loc: unable to read inode block - inode=761860,
block=3612686232
Apr 25 01:59:38 echo19 kernel: EXT4-fs error (device dm-2) in
ext4_reserve_inode_write: IO failure
Apr 25 01:59:38 echo19 kernel: mpage_da_map_blocks block allocation failed for
inode 761860 at logical offset 699276 with max blocks 1024 with error -5
Apr 25 01:59:38 echo19 kernel: This should not happen.!! Data will be lost
Apr 25 01:59:38 echo19 kernel: ------------[ cut here ]------------
Apr 25 01:59:38 echo19 kernel: kernel BUG at fs/ext4/extents.c:2737!

The filesystem was totally inaccessible so I reset the system. On reboot, the
filesystem couldn't be mounted - bad superblock.

I ran fsck a few times before I could remount the filesystem, all the
directories were lost but the test files were intact in lost+found.

I can't see any errors anywhere that might indicate that this is a hardware
problem but these are brand new systems using SAS host connections which we
haven't used before.

I've remade the broken filesystem and restarted the test on this and 11 other
identical filesystems, I'll let you know if the problem reoccurs.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.


2009-04-29 01:06:10

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 13201] kernel BUG at fs/ext4/extents.c:2737

http://bugzilla.kernel.org/show_bug.cgi?id=13201





--- Comment #2 from Franco Broi <[email protected]> 2009-04-29 01:06:09 ---
Of the 12 tests, 2 produced errors.

EXT4-fs error (device dm-5): ext4_mb_generate_buddy: EXT4-fs: group 9: 32768
blocks in bitmap, 1023 in gd

The filesystem seems OK, I can ls the test files.

EXT4-fs error (device dm-3): ext4_mb_generate_buddy: EXT4-fs: group 0: 32768
blocks in bitmap, 970 in gd
EXT4-fs error (device dm-3): ext4_mb_generate_buddy: EXT4-fs: group 0: 32768
blocks in bitmap, 32766 in gd
EXT4-fs error (device dm-3): ext4_init_block_bitmap: Checksum bad for group 1
EXT4-fs error (device dm-3): ext4_mb_generate_buddy: EXT4-fs: group 1: 0 blocks
in bitmap, 1023 in gd
EXT4-fs error (device dm-3): ext4_dx_find_entry: bad entry in directory #15:
directory entry across blocks - offset=28672, inode=0, rec_len=65536,
name_len=0
EXT4-fs error (device dm-3): ext4_add_entry: bad entry in directory #15:
directory entry across blocks - offset=0, inode=0, rec_len=65536, name_len=0
EXT4-fs error (device dm-3): htree_dirblock_to_tree: bad entry in directory #2:
directory entry across blocks - offset=0, inode=0, rec_len=65536, name_len=0
EXT4-fs error (device dm-3): htree_dirblock_to_tree: bad entry in directory #2:
directory entry across blocks - offset=0, inode=0, rec_len=65536, name_len=0
EXT4-fs error (device dm-3): htree_dirblock_to_tree: bad entry in directory #2:
directory entry across blocks - offset=0, inode=0, rec_len=65536, name_len=0
EXT4-fs error (device dm-3): htree_dirblock_to_tree: bad entry in directory #2:
directory entry across blocks - offset=0, inode=0, rec_len=65536, name_len=0
EXT4-fs error (device dm-3): htree_dirblock_to_tree: bad entry in directory #2:
directory entry across blocks - offset=0, inode=0, rec_len=65536, name_len=0

Although df looks ok
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/vgdata--143-data143
13456415384 13232157624 224257760 99% /data143
# ls /data143

Produces no output.

At this point I will need to switch back to ext3 so that I can get this disk
into production but I do have a small window to run some more tests if anyone
has any ideas.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-04-29 03:15:34

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 13201] kernel BUG at fs/ext4/extents.c:2737

http://bugzilla.kernel.org/show_bug.cgi?id=13201


Eric Sandeen <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]




--- Comment #3 from Eric Sandeen <[email protected]> 2009-04-29 03:15:31 ---
Franco, sorry we haven't gotten back with suggestions on this. It looks like
you have hit a couple different end results. We've had a few reports of
corruption on larger filesystems which makes us wonder if there might be a
problem above 8T somewhere...

The current upstream git tree (or the 2.6.30-rc3-git5 prepatch) has more extent
validity checking in it; if you do have the time for another test, running on
that codebase may yield more info, depending on where the problem lies.

-Eric

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-04-30 03:00:50

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 13201] kernel BUG at fs/ext4/extents.c:2737

http://bugzilla.kernel.org/show_bug.cgi?id=13201





--- Comment #4 from Franco Broi <[email protected]> 2009-04-30 03:00:48 ---
I ran a test overnight using 2.6.30-rc3-git5 and it didn't fail. Not sure if
this is a good or bad thing?

I've deleted the files and started the test again.

By the way, deleting files with ext4 is lightening fast, it only takes about 5
minutes to delete 13TB! Again, not sure if this is a good or bad thing, it
doesn't give you much time to hit ctrl_c...

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-04-30 09:39:22

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 13201] kernel BUG at fs/ext4/extents.c:2737

http://bugzilla.kernel.org/show_bug.cgi?id=13201





--- Comment #5 from Franco Broi <[email protected]> 2009-04-30 09:39:21 ---
I've now got filesystem corruption with 2.6.30-rc3-git5, looks pretty much the
same as before.

Apr 30 17:30:56 echo20 kernel: EXT4-fs error (device dm-3):
ext4_mb_generate_buddy: EXT4-fs: group 0: 32768 blocks in bitmap, 23495 in gd
Apr 30 17:30:56 echo20 kernel: EXT4-fs error (device dm-3):
ext4_mb_mark_diskspace_used: Allocating block 1024 in system zone of 0 group

When I do an ls in the test directory I get lots of Input/output errors

EXT4-fs error (device dm-3): ext4_lookup: deleted inode referenced: 127
EXT4-fs error (device dm-3): ext4_lookup: deleted inode referenced: 358
EXT4-fs error (device dm-3): ext4_lookup: deleted inode referenced: 196

Anything you want me to try?

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-05-19 18:04:06

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 13201] kernel BUG at fs/ext4/extents.c:2737

http://bugzilla.kernel.org/show_bug.cgi?id=13201


Theodore Tso <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]




--- Comment #6 from Theodore Tso <[email protected]> 2009-05-19 18:04:05 ---
Could you try replicating this problem in 2.6.30-rc6? We fixed a race
condition in i_cached_extents could have very well caused your problem. I'm
hoping it will close this a few other mystery bug reports we've had over the
past couple of months. (The bug is an old one, but we had struggled with a
reliable reproduction case.)

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-05-23 10:10:52

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 13201] kernel BUG at fs/ext4/extents.c:2737

http://bugzilla.kernel.org/show_bug.cgi?id=13201





--- Comment #7 from Franco Broi <[email protected]> 2009-05-23 10:10:50 ---
I wont be able to recreate the original test conditions but I'll run a test
with a single large filesystem within a couple of weeks.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-06-05 00:51:52

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 13201] kernel BUG at fs/ext4/extents.c:2737

http://bugzilla.kernel.org/show_bug.cgi?id=13201





--- Comment #8 from Franco Broi <[email protected]> 2009-06-05 00:51:52 ---
I haven't been able to recreate the problem using 2.6.30-rc8 but the test
conditions aren't identical to before. Would it make a difference that only a
single filesystem is being written to and not 4 simultaneously as in the
original tests?

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-06-08 16:49:26

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 13201] kernel BUG at fs/ext4/extents.c:2737

http://bugzilla.kernel.org/show_bug.cgi?id=13201





--- Comment #9 from Theodore Tso <[email protected]> 2009-06-08 16:49:28 ---
If this is the same problem as the one which we fixed with identical symptoms,
what matters is multiple processes/threads writing to the same file at the same
time. People using NFS or SAMBA on a backup server seemed to be a most common
scenarios for triggering this (admittedly very hard to reproduce) bug. We
finally got lucky in that someone had a setup which allows for reliable
reproduction of the bug, so we could finally sink our teeth into it.

So if what you saw was the same as the bug we fixed in 2.6.30-rc6, no it
shouldn't make a difference. If it is a completely different bug, then of
course all bets are off. In general though whether you are writing to one
filesystem or 4 filesystems shouldn't make a difference, except in that it
might change the timing necessary to hit a race condition (and in the case of
the bug that we found and fixed, it was highly timing dependent; in fact, even
after we found the problem, we weren't able to come up with a reliable
reproduction case, even though the problem was obvious on paper and the one
user who could reliably reproduce reported it went away once the patch was
applied; IIRC, Eric finally put in a delay into the code to widen the race
window to the point where he could replicate it.)

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-06-09 05:10:30

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 13201] kernel BUG at fs/ext4/extents.c:2737

http://bugzilla.kernel.org/show_bug.cgi?id=13201





--- Comment #10 from Franco Broi <[email protected]> 2009-06-09 05:10:31 ---
(In reply to comment #9)
> If this is the same problem as the one which we fixed with identical symptoms,
> what matters is multiple processes/threads writing to the same file at the same
> time.

Then it doesn't sound like it's the same bug. My tests are very simple, they
just keep writing 8GB files until the disk fills up, there is no concurrent
access to files or even the filesystem, and the machines are completely
standalone.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-08-26 18:11:08

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 13201] kernel BUG at fs/ext4/extents.c:2737

http://bugzilla.kernel.org/show_bug.cgi?id=13201


Valerie Aurora <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]




--- Comment #11 from Valerie Aurora <[email protected]> 2009-08-26 18:11:07 ---
Given that the bug appears to be fixed, and we can't reproduce the original
conditions or get more data on this bug, it seems like we should close this
bug.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.