2009-03-05 16:43:12

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] New: filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821

Summary: filesystem corrupts on heavy I/O
Product: File System
Version: 2.5
KernelVersion: ubuntu-server-2.6.27-9 (smp)
Platform: All
OS/Version: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: ext4
AssignedTo: [email protected]
ReportedBy: [email protected]


Hi

Hoping for a bit of advice from here, since the Ubuntu Server team is
completely unresponsive.

Failing kernel version: 2.6.27-9
Distribution: Ubuntu Server 8.10
Hardware Environment: VMware ESX
Problem Description:

On one box with particularly heavy database I/O, we're seeing this in
the system log:

==================
[1099257.456522] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2621457: unexpected eh_depth - magic f30a, entries 340, max 340(0),
depth 1(2)
[1099257.495979] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2621457: unexpected eh_depth - magic f30a, entries 328, max 340(0),
depth 1(2)
[1099257.505934] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2621457: unexpected eh_depth - magic f30a, entries 340, max 340(0),
depth 1(2)
==================

... ad infinitum.

Kernel (version above) installed a couple months ago, never changed since the
box does not have internet access per default.

dumpe2fs for the filesystem:
===================
dumpe2fs 1.41.3 (12-Oct-2008)
Last mounted on: <not available>
Filesystem UUID: 7adcc7a6-5dd4-4fd5-b988-c92f9429a06c
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype
needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg
dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean with errors
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 90177536
Block count: 360683091
Reserved block count: 18034154
Free blocks: 116980051
Free inodes: 90170042
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 938
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Tue Dec 2 17:23:58 2008
Last mount time: Wed Feb 18 21:44:25 2009
Last write time: Tue Mar 3 15:19:14 2009
Mount count: 5
Maximum mount count: 30
Last checked: Tue Dec 2 17:23:58 2008
Check interval: 15552000 (6 months)
Next check after: Sun May 31 18:23:58 2009
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 2180b9fc-09e3-445e-8170-302178b2eadd
Journal backup: inode blocks
Journal size: 128M
========================

(Deceased launchpad report: https://bugs.launchpad.net/bugs/337246)

What to do?


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


2009-03-05 17:26:49

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #1 from [email protected] 2009-03-05 09:26 -------
Is this something you can reproduce? I'm not remembering anything related
offhand other than an early bug in the extent code, which I think should be
fixed well before 2.6.27.

If you find inode #2621457, I expect that it is very fragmented, and I'd guess
that any access to it (reading it, that is, or perhaps extending it) will trip
this error.

If you can try a later kernel, maybe 2.6.27.x or 28.x or even .29 it'd let us
know if the problem still exists; if not maybe there is more debugging we can
do.

Providing an e2image -r of the filesystem would also let us see what kind of
layout that file has, though from your comments I'm guessing this may not be
possible due to security reasons.


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-05 17:46:45

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #2 from [email protected] 2009-03-05 09:46 -------
It continually spews messages in syslog while it's powered on, but that's not
exactly "reproduction" I guess.

Tried to find the inode in question with debugfs, it said that this particular
inode does not belong to any file - maybe because the inode contains filesystem
metadata, not file data? Anyhow, no useful output - probably in no small part
because I have no clue what I'm doing with debugfs ;-)

New kernel: Okay, upgraded to the latest available Ubuntu Server kernel. Had
to yank the power; the sync would never finish (vmstat says it synced about
1kB/s). Looks like the filesystem is also super-slow for some reason.

(Possible that a logging daemon was putting error messages in the buffer cache
while the sync got farther and farther behind trying to sync those, not sure.
Did kill all loggers at one point, but didn't help much. Hmm.)

It came up with the new kernel, did a fsck, which finished off with "fsck died
with exit status 1".

Will try e2image in a moment..


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-05 18:06:53

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #3 from [email protected] 2009-03-05 10:06 -------
I am thinking that the header check in search_right, in the loop, is still
backwards but i'm in the middle of leaving on a road trip so may be thinking
about it ... backwards. If others don't pick this up I'll look more tomorrow.


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-05 18:13:48

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #4 from [email protected] 2009-03-05 10:13 -------
The error messages you are seeing are the result of the filesystem being
corrupted. The kernel will continue to printk those messages until the
filesystem is fixed. The fact that the fsck died with exit status 1 probably
means you need to upgrade to e2fsprogs 1.41.4 (1.41.3 and earlier e2fsck had a
bug that would be triggered by certain types of corruption).

Are you willing to try a newer kernel? This looks familiar and I think it's
fixed in the latest 2.6.29-rc7 kernel. I'll need to go back and figure out
which bug fix this is most likely related to, and check whether it's been
backported to the 2.6.27.y stable series.


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-05 18:59:53

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #5 from [email protected] 2009-03-05 10:59 -------
@sandeen: Thanks a *lot*. Have a nice trip!

@tytso: Same to you, info much appreciated! (I was wondering if the fsck thing
was an anomaly or not.)

> Are you willing to try a newer kernel?
> This looks familiar and I think it's
> fixed in the latest 2.6.29-rc7 kernel.

I'd rather not since it's production stuff, but then again if you know what the
problem is, and a shiny new kernel is the cure, then yes. I'll have to live on
the edge, bleeding. (or whatever it's called..)

FYI, e2image is coming up when it's done (running for about an hour now).


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-05 19:06:53

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #6 from [email protected] 2009-03-05 11:06 -------
If it's what I think it is, it's not an anomaly; e2fsck 1.41.3 had a bug which
would cause it to not fix certain types of corruption in the extent tree, and
it would just bail out with an exit(1). And the error messages that the
kernel printed are consistent with it. Did you try re-running a forced fsck on
the filesystem after you got that failure?


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-05 19:25:57

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #7 from [email protected] 2009-03-05 11:25 -------
> If it's what I think it is, it's not an anomaly;

(By "anomaly" I just meant that I have no clue whether "fsck died with exit
status 1" means "all OK" or "something bad happened". The console message
doesn't really say anything about whether fsck exit code 1 is a good, bad or
neutral thing. Just sort of states the fact, heh.)

> e2fsck 1.41.3 had a bug which would cause it to ... bail out with an exit(1).

Got it. No, I didn't try a forced fsck (didn't know it would be different from
an automated one), but I'll re-mount ro and do just that when e2image is
finished!

If it still breaks, I'll upgrade e2progs as suggested?


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-06 06:12:14

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #8 from [email protected] 2009-03-05 22:11 -------
Created an attachment (id=20443)
--> (http://bugzilla.kernel.org/attachment.cgi?id=20443&action=view)
potential fix

I have not yet tested this patch, but it may fix your problem.

you can wait 'til I've had a chance to test it, or give it a go; if I'm wrong,
the worst you should get is more corrupt header messages and EIO errors. :)

The ext4_ext_search_right() function is confusing; it uses a "depth" variable
which is 0 at the root and maximum at the leaves, but the on-disk metadata uses
a "depth" (actually eh_depth) which is opposite: maximum at the root, and 0 at
the leaves.

The ext4_ext_check_header() function expects the on-disk semantics, but we are
feeding it the opposite in the while loop in this function. We should be
feeding it the on-disk notion of "depth" which we can get with (p_depth -
depth) - and if you look, the last (more commonly hit) call to
ext4_ext_check_header() does just this.

For a depth 4 tree (p_depth 3) (which would be an incredible amount of extents,
so perhaps not your problem, but perhaps....), then, you'd have:

"depth" in function eh_depth p_depth - depth
0 3 3
1 2 2
2 1 <---- I think you are here 1
3 0 0

Your error messages:
[1099257.456522] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2621457: unexpected eh_depth - magic f30a, entries 340, max 340(0),
depth 1(2)

indicate that it was asked to check that the header was depth (2) but it
actually found depth 1 in the disk metadata.

I've not yet crafted a test to hit this codepath at all, so have not tested
this yet, but by inspection I *think* it is correct.


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-06 07:16:14

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #9 from [email protected] 2009-03-05 23:16 -------
Well, the first thing I would do before doing anything else is upgrade to
e2fsprogs 1.41.4, and then run '"script -c "e2fsck -fv /dev/sdb1" e2fsck.log',
and upload the e2fsck.log file to bugzilla. Then see if the problem goes away.
The fact that e2fsck exited with a status code one of 1 causes me to suspect
that you really did have an inconsistent filesystem, and it's not the problem
which Eric is chasing.


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-06 14:18:51

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #10 from [email protected] 2009-03-06 06:18 -------
e2fsprogs 1.41.4ubuntu1 is available in Jaunty so you could pick those tools up
from the URL below:


http://archive.ubuntu.com/ubuntu/pool/main/e/e2fsprogs/e2fsprogs_1.41.4-1ubuntu1_amd64.deb

http://archive.ubuntu.com/ubuntu/pool/main/e/e2fsprogs/e2fsprogs_1.41.4-1ubuntu1_i386.deb


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-06 20:21:43

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #11 from [email protected] 2009-03-06 12:21 -------
Ok, at least w/ e2fsprogs-1.41.4 the fs doesn't look too bad:

# e2fsck -fn sdb1.e2i
e2fsck 1.41.4 (27-Jan-2009)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (102518871, counted=100960243).
Fix? no

Free inodes count wrong (90170045, counted=90170006).
Fix? no


LargeData: ********** WARNING: Filesystem still has errors **********

LargeData: 7491/90177536 files (39.4% non-contiguous), 258164220/360683091
blocks

the inode in question:

debuge4fs: ncheck 2621457
Inode Pathname
2621457 /mysql/0004_rldb/fw1log20090302.MYI

is pretty fragmented:

# filefrag mysql/0004_rldb/fw1log20090302.MYI
File is stored in extents format
mysql/0004_rldb/fw1log20090302.MYI: 5820 extents found

There is a really large number of indirect (i.e. extent index) blocks in that
file:

# debugfs -c sdb1.e2i -R "stat <2621457>" | sed -e "s/, /\n/g" | grep IND | wc
-l
debugfs 1.41.4 (27-Jan-2009)
sdb1.e2i: catastrophic mode - not reading inode or group bitmaps
2730

I would still try my patch.

-Eric


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-06 20:30:47

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #12 from [email protected] 2009-03-06 12:30 -------
Created an attachment (id=20449)
--> (http://bugzilla.kernel.org/attachment.cgi?id=20449&action=view)
stat of the problematic inode in (hacked) debugfs

This is for the ext4 devs as much as anything; it's a hacked stat <inode> of
the inode that generated the messages.

It does have 3 levels, which would be sufficient to enter the code that
generated the warning, due to what I believe is your bug.


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-10 10:03:41

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #13 from [email protected] 2009-03-10 03:03 -------
dd'd the problematic file to /dev/null just now. Yielded no complaints in the
syslog about unexpected eh_depths. Maybe the reboot+fsck had some effect.

In retrospect, I should probably have found a network mount (since write speed
suffered badly when the problem hit) and done the e2image before reboot/fsck,
so we'd have a pristine image. A quick HOWTO on the subject with best
practices for ad-hoc network mounting (using random kernel/distro) plus a copy
of relevant parts of the e2image man page would be helpful to others in the
same situation.

comment #8:
Priceless information. Thanks!

comment #11:
Transcripts are great; I'm learning a bit of ext4 debugging as we go along. I
take it that there's no on-disk problem (now), then. Confirmed by running fsck
1.41.4 as requested in comment #9:

e2fsck 1.41.4 (27-Jan-2009)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

7648 inodes used (0.01%)
2994 non-contiguous files (39.1%)
17 non-contiguous directories (0.2%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 3710/3891/35/2
273816895 blocks used (75.92%)
0 bad blocks
7 large files

7593 regular files
46 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
--------
7639 files

comment #10:
Thanks! Used above. There's some Ubuntu-related issues, I'll try my luck with
Launchpad for those *cross fingers*.

comment #11 (again):
Will try the patch ASAP, meaning once I've figured out how to best "inject" it
into the local Ubuntu eco-system without disrupting the package management,
updates etc.


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-13 07:12:52

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #14 from [email protected] 2009-03-13 00:11 -------
I'm seeing a similar error with heavy read/write I/O on a 1TB ext4 volume. It's
not clear what behavior triggers the error for me. Occasionally I see the error
in dmesg...

[ 7829.004269] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2491097: unexpected eh_depth - magic f30a, entries 78, max 340(0),
depth 1(2)
[ 7829.012197] mpage_da_map_blocks block allocation failed for inode 2491097 at
logical offset 2788227 with max blocks 7 with error -5
[ 7829.012220] This should not happen.!! Data will be lost

...and sync does not complete. It's not clear from the discussion if e2image
needs to be done before the error occurs on a clean mount, or afterward while
the system is up.

When this happens the reboot procedure does not complete, and I power cycle the
machine. After a power cycle, the journal replays and fsck completes, no
inconsistencies. The files affected by data loss are part of a BitTorrent
network download and after a Torrent data consistency check, I confirm that
data has been lost. If I leave the Torrent active downloading for more than an
hour or so, the ext4 errors occur.

So, "me too" with linux-image-2.6.28-8-generic 2.6.28-8-generic #28-Ubuntu SMP
Thu Mar 5 21:49:36 UTC 2009 i686 GNU/Linux. I'm willing to provide debug info
to an interested developer, privately.


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-13 15:04:52

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #15 from [email protected] 2009-03-13 08:03 -------
(In reply to comment #14)
> I'm seeing a similar error with heavy read/write I/O on a 1TB ext4 volume. It's
> not clear what behavior triggers the error for me. Occasionally I see the error
> in dmesg...
>
> [ 7829.004269] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
> in inode #2491097: unexpected eh_depth - magic f30a, entries 78, max 340(0),
> depth 1(2)
> [ 7829.012197] mpage_da_map_blocks block allocation failed for inode 2491097 at
> logical offset 2788227 with max blocks 7 with error -5
> [ 7829.012220] This should not happen.!! Data will be lost
>
> ...and sync does not complete. It's not clear from the discussion if e2image
> needs to be done before the error occurs on a clean mount, or afterward while
> the system is up.

If you like, it's probably sufficient to point debugfs at the system and do

debugfs> stat <2491097>

to give us an idea of the layout of that file.

> When this happens the reboot procedure does not complete, and I power cycle the
> machine.

the fs may have gone readonly after the error, but I'm not sure it should have
hung up the system.

> After a power cycle, the journal replays and fsck completes, no
> inconsistencies. The files affected by data loss are part of a BitTorrent
> network download and after a Torrent data consistency check, I confirm that
> data has been lost. If I leave the Torrent active downloading for more than an
> hour or so, the ext4 errors occur.

Great, if you can reproduce it w/ bittorrent, can you please try the patch in
the attachments?

Thanks,
-Eric


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-15 04:57:30

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #16 from [email protected] 2009-03-14 21:56 -------
Will try patch soon. I mistakenly thought the patched kernel was running today,
it was not, and so I have some more information from When Bad Things Happen.

Without patch applied:

# dmesg output:
[17415.140240] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2491101: unexpected eh_depth - magic f30a, entries 228, max 340(0),
depth 1(2)
[17415.158912] mpage_da_map_blocks block allocation failed for inode 2491101 at
logical offset 1032931 with max blocks 5 with error -5
[17415.158923] This should not happen.!! Data will be lost
[17461.031925] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2491101: unexpected eh_depth - magic f30a, entries 228, max 340(0),
depth 1(2)
[17461.034989] mpage_da_map_blocks block allocation failed for inode 2491101 at
logical offset 1032967 with max blocks 9 with error -5
[17461.034996] This should not happen.!! Data will be lost
[18307.004556] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2491101: unexpected eh_depth - magic f30a, entries 228, max 340(0),
depth 1(2)
[18307.017829] mpage_da_map_blocks block allocation failed for inode 2491101 at
logical offset 1032935 with max blocks 9 with error -5
[18307.017839] This should not happen.!! Data will be lost
[18426.839773] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2491101: unexpected eh_depth - magic f30a, entries 228, max 340(0),
depth 1(2)
[18426.854493] mpage_da_map_blocks block allocation failed for inode 2491101 at
logical offset 1032943 with max blocks 5 with error -5
[18426.854503] This should not happen.!! Data will be lost
[25859.886794] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2491101: unexpected eh_depth - magic f30a, entries 229, max 340(0),
depth 1(2)
[25859.908981] mpage_da_map_blocks block allocation failed for inode 2491101 at
logical offset 1032924 with max blocks 4 with error -5
[25859.908992] This should not happen.!! Data will be lost

# debugfs output:
debugfs 1.41.4 (27-Jan-2009)
debugfs: stat 2491101
2491101: File not found by ext2_lookup

debugfs: ncheck 2491101
Inode Pathname
2491101 /Datasets/MAME/LD-CHD/cubeqst/cubeqst.chd

debugfs: icheck /Datasets/MAME/LD-CHD/cubeqst/cubeqst.chd
icheck: Bad block number - /Datasets/MAME/LD-CHD/cubeqst/cubeqst.chd
icheck: Invalid block number: /Datasets/MAME/LD-CHD/cubeqst/cubeqst.chd

# filefrag output:
filefrag $MOUNTPATH/Datasets/MAME/LD-CHD/cubeqst/cubeqst.chd
(stuck / does not complete)


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-15 05:24:17

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #17 from [email protected] 2009-03-14 22:23 -------
(In reply to comment #16)

> # debugfs output:
> debugfs 1.41.4 (27-Jan-2009)
> debugfs: stat 2491101
> 2491101: File not found by ext2_lookup

Just FWIW, what you want is:

debugfs: stat <2491101>

to get that inode.

Thanks,
-Eric


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-17 06:08:20

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #18 from [email protected] 2009-03-16 23:07 -------
Created an attachment (id=20561)
--> (http://bugzilla.kernel.org/attachment.cgi?id=20561&action=view)
output of debugfs 'stat <2491101>' (split 1 of 2)

Having first applied the suggested patch, is debugfs output taken after file
completed and verified intact from BitTorrent download. Split and then
compressed to comply with 1000k bugzilla constraint: 1 of 2.


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-17 06:08:28

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #19 from [email protected] 2009-03-16 23:07 -------
Created an attachment (id=20562)
--> (http://bugzilla.kernel.org/attachment.cgi?id=20562&action=view)
output of debugfs 'stat <2491101>' (split 2 of 2)

Having first applied the suggested patch, is debugfs output taken after file
completed and verified intact from BitTorrent download. Split and then
compressed to comply with 1000k bugzilla constraint: 2 of 2.


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-17 06:31:47

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821





------- Comment #20 from [email protected] 2009-03-16 23:30 -------
As noted in attachments, I have applied the patch and completed my BitTorrent
download (which was problematic without the patch).

linux-image-2.6.28-8-generic (Ubuntu): Download unable to complete due to EXT4
error and non-responsive sync.

linux-image-2.6.28-9-generic (Ubuntu+potential fix patch), first try: Computer
is found unresponsive and screen dark, power and capslock blinking.

linux-image-2.6.28-9-generic (Ubuntu+potential fix patch), second try: Computer
is normal, no errors in dmesg about EXT4, download is successful.

The failure on first try is a new behavior I have not seen before with this
machine. I therefore am reporting inconclusively that the patch works for me.


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-17 15:29:42

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12821] filesystem corrupts on heavy I/O

http://bugzilla.kernel.org/show_bug.cgi?id=12821


[email protected] changed:

What |Removed |Added
----------------------------------------------------------------------------
AssignedTo|[email protected]|[email protected]
Status|NEW |ASSIGNED




------- Comment #21 from [email protected] 2009-03-17 08:28 -------
(In reply to comment #20)
> As noted in attachments, I have applied the patch and completed my BitTorrent
> download (which was problematic without the patch).
>
> linux-image-2.6.28-8-generic (Ubuntu): Download unable to complete due to EXT4
> error and non-responsive sync.
>
> linux-image-2.6.28-9-generic (Ubuntu+potential fix patch), first try: Computer
> is found unresponsive and screen dark, power and capslock blinking.

That indicates a panic, but we'd need a lot more info, which you probably can't
gather at this time. Getting panics visible when you're in X is tricky,
netconsole or serial console or similar may be needed. So you paniced, but we
have no idea why, unfortunately.

> linux-image-2.6.28-9-generic (Ubuntu+potential fix patch), second try: Computer
> is normal, no errors in dmesg about EXT4, download is successful.

well, that sounds good ;)

> The failure on first try is a new behavior I have not seen before with this
> machine. I therefore am reporting inconclusively that the patch works for me.

Ok, thanks for the testing.. It's headed upstream in any case, I think, as it
seems pretty well correct from inspection and review.

Oh, re: the stat of that inode, for a 10G file that's a metric ton of
extents... you should encourage whichever torrent client this is to use the
preallocation routines now available (fallocate(2)) - a syscall now, exposed by
glibc soon.

the file has 600+ extent tree *nodes* each of which holds an extent... you have
nearly 105000 extents. That's ... impressive ... :)


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
You are the assignee for the bug, or are watching the assignee.