LinuxLists.cc - 2.6.18-mm2: ext3 BUG?

2006-10-05 21:34:20

Subject: 2.6.18-mm2: ext3 BUG?

Hello,

while yum update-ing, yum crashed and this appeared in log:
[ 2840.688718] EXT3-fs error (device hda2): ext3_free_blocks_sb: bit already
cleared for block 747938
[ 2840.688732] Aborting journal on device hda2.
[ 2840.688858] ext3_abort called.
[ 2840.688863] EXT3-fs error (device hda2): ext3_journal_start_sb: Detected
aborted journal
[ 2840.688869] Remounting filesystem read-only
[ 2840.819816] EXT3-fs error (device hda2): ext3_free_blocks_sb: bit already
cleared for block 747939
[ 2840.819839] EXT3-fs error (device hda2) in ext3_free_blocks_sb: Journal has
aborted
[ 2840.819846] EXT3-fs error (device hda2) in ext3_free_blocks_sb: Journal has
aborted
[ 2840.819852] EXT3-fs error (device hda2) in ext3_free_blocks_sb: Journal has
aborted
[ 2840.819859] EXT3-fs error (device hda2) in ext3_reserve_inode_write: Journal
has aborted
[ 2840.819865] EXT3-fs error (device hda2) in ext3_truncate: Journal has aborted
[ 2840.819872] EXT3-fs error (device hda2) in ext3_reserve_inode_write: Journal
has aborted
[ 2840.819879] EXT3-fs error (device hda2) in ext3_orphan_del: Journal has aborted
[ 2840.819886] EXT3-fs error (device hda2) in ext3_reserve_inode_write: Journal
has aborted
[ 2841.422114] __journal_remove_journal_head: freeing b_committed_data
[ 2841.422125] __journal_remove_journal_head: freeing b_committed_data
[ 2841.422134] __journal_remove_journal_head: freeing b_committed_data
[ 2841.422150] __journal_remove_journal_head: freeing b_committed_data
[ 2841.422165] __journal_remove_journal_head: freeing b_committed_data
[ 2841.422172] __journal_remove_journal_head: freeing b_committed_data
[ 2841.422209] __journal_remove_journal_head: freeing b_committed_data
[ 2841.422264] __journal_remove_journal_head: freeing b_committed_data
[ 2841.422292] __journal_remove_journal_head: freeing b_committed_data
[ 2841.422348] __journal_remove_journal_head: freeing b_committed_data
[ 2841.422420] __journal_remove_journal_head: freeing b_committed_data
[ 2841.422453] __journal_remove_journal_head: freeing b_committed_data
[ 2841.422597] __journal_remove_journal_head: freeing b_committed_data
[ 2841.422651] journal commit I/O error
[ 2842.098946] attempt to access beyond end of device
[ 2842.098954] hda2: rw=0, want=1622222496, limit=19551105
[ 2842.098964] attempt to access beyond end of device
[ 2842.098970] hda2: rw=0, want=2862991560, limit=19551105
[ 2842.098980] attempt to access beyond end of device
[ 2842.098985] hda2: rw=0, want=2442477896, limit=19551105
[ 2842.098993] attempt to access beyond end of device
[ 2842.098999] hda2: rw=0, want=745051184, limit=19551105
[ 2842.099006] attempt to access beyond end of device
[ 2842.099012] hda2: rw=0, want=2876066312, limit=19551105
[ 2842.099019] attempt to access beyond end of device
[ 2842.099026] hda2: rw=0, want=2996155160, limit=19551105
[ 2842.099033] attempt to access beyond end of device
[ 2842.099038] hda2: rw=0, want=530091296, limit=19551105
[ 2842.099045] attempt to access beyond end of device
[ 2842.099050] hda2: rw=0, want=163997728, limit=19551105
[ 2842.099058] attempt to access beyond end of device
[ 2842.099064] hda2: rw=0, want=1193970848, limit=19551105
[ 2842.099071] attempt to access beyond end of device
[ 2842.099077] hda2: rw=0, want=1781229856, limit=19551105
[ 2842.099085] attempt to access beyond end of device
[ 2842.099092] hda2: rw=0, want=2353439768, limit=19551105
[ 2842.099099] attempt to access beyond end of device
[ 2842.099105] hda2: rw=0, want=608441032, limit=19551105
[ 2842.099112] attempt to access beyond end of device
[ 2842.099118] hda2: rw=0, want=3536917176, limit=19551105
[ 2842.099125] attempt to access beyond end of device
[ 2842.099130] hda2: rw=0, want=1079252368, limit=19551105
[ 2842.099138] attempt to access beyond end of device
[ 2842.099142] hda2: rw=0, want=1532974336, limit=19551105
[ 2842.099149] attempt to access beyond end of device
[ 2842.099154] hda2: rw=0, want=1781229856, limit=19551105
[ 2842.099161] attempt to access beyond end of device
[ 2842.099166] hda2: rw=0, want=1781229856, limit=19551105
[ 2842.099172] Reducing readahead size to 256K

I don't know how to reproduce it and really have no idea what version of -mm
could introduce it (if any).

regards,
--
http://www.fi.muni.cz/~xslaby/ Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E

2006-10-05 21:51:11

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.18-mm2: ext3 BUG?

On Thu, 05 Oct 2006 23:34:13 +0159
Jiri Slaby <[email protected]> wrote:

> Hello,
>
> while yum update-ing, yum crashed and this appeared in log:
> [ 2840.688718] EXT3-fs error (device hda2): ext3_free_blocks_sb: bit already
> cleared for block 747938
> [ 2840.688732] Aborting journal on device hda2.
> [ 2840.688858] ext3_abort called.
>
> ...
>
> I don't know how to reproduce it and really have no idea what version of -mm
> could introduce it (if any).

I don't necessarily see a bug in there. The filesystem got a bit noisy but
did appropriately detect and handle the metadata inconsistency.

The next step would be to fsck that filesystem, see waht it says.

2006-10-05 23:16:29

by Jiri Slaby

[permalink] [raw]

Subject: Re: 2.6.18-mm2: ext3 BUG?

Andrew Morton wrote:
> On Thu, 05 Oct 2006 23:34:13 +0159
> Jiri Slaby <[email protected]> wrote:
>
>> Hello,
>>
>> while yum update-ing, yum crashed and this appeared in log:
>> [ 2840.688718] EXT3-fs error (device hda2): ext3_free_blocks_sb: bit already
>> cleared for block 747938
>> [ 2840.688732] Aborting journal on device hda2.
>> [ 2840.688858] ext3_abort called.
>>
>> ...
>>
>> I don't know how to reproduce it and really have no idea what version of -mm
>> could introduce it (if any).
>
> I don't necessarily see a bug in there. The filesystem got a bit noisy but
> did appropriately detect and handle the metadata inconsistency.

Perhaps, but why did it occur? S.m.a.r.t. doesn't tell me anything suspicious.

> The next step would be to fsck that filesystem, see waht it says.

Yup. I fscked it after reboot and fixed them all...

[went to gather some info from e2fsprogs sources what kind of errors it was (I
didn't note it and can't remember)]

block differences, incorrect block counts, orphaned entries, some (gnome-vfs2
stuff which has been updated) went to lost+found.

I unfotunately can't post more accurate info, because I am a... chump? Bite me
and shame on me...

regards,
--
http://www.fi.muni.cz/~xslaby/ Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E

2006-10-06 00:19:54

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.18-mm2: ext3 BUG?

On Fri, 06 Oct 2006 01:16:21 +0159
Jiri Slaby <[email protected]> wrote:

> Andrew Morton wrote:
> > On Thu, 05 Oct 2006 23:34:13 +0159
> > Jiri Slaby <[email protected]> wrote:
> >
> >> Hello,
> >>
> >> while yum update-ing, yum crashed and this appeared in log:
> >> [ 2840.688718] EXT3-fs error (device hda2): ext3_free_blocks_sb: bit already
> >> cleared for block 747938
> >> [ 2840.688732] Aborting journal on device hda2.
> >> [ 2840.688858] ext3_abort called.
> >>
> >> ...
> >>
> >> I don't know how to reproduce it and really have no idea what version of -mm
> >> could introduce it (if any).
> >
> > I don't necessarily see a bug in there. The filesystem got a bit noisy but
> > did appropriately detect and handle the metadata inconsistency.
>
> Perhaps, but why did it occur? S.m.a.r.t. doesn't tell me anything suspicious.

Don't know. The usual diagnosis for this sort of thing is "your disk shat
itself". Could be a bad disk, bad power supply, bad memory, some piece of
kernel code went and trashed some memory, bug in the driver. It's a
mystery, sorry.

2006-10-06 09:34:03

by Jiri Kosina

[permalink] [raw]

Subject: Re: 2.6.18-mm2: ext3 BUG?

On Thu, 5 Oct 2006, Andrew Morton wrote:

> Don't know. The usual diagnosis for this sort of thing is "your disk
> shat itself". Could be a bad disk, bad power supply, bad memory, some
> piece of kernel code went and trashed some memory, bug in the driver.
> It's a mystery, sorry.

Actually I have also experienced some ext3 corruption (*) on my virtual
machine under qemu emulation with 2.6.18-mm3, but I thought my previous
other "strange" activities were guilty. Now it seems that there could be
indeed something rotten in ext3 driver. Will try to reproduce, and if I am
able to do so, I will bisect.

(*) Similar symptoms to the original poster - just fsck finding a few
orphaned entries and lost blocks, etc.

--
Jiri Kosina

2006-10-08 06:33:33

by Jan-Benedict Glaw

[permalink] [raw]

Subject: Re: 2.6.18-mm2: ext3 BUG?

On Thu, 2006-10-05 17:14:28 -0700, Andrew Morton <[email protected]> wrote:
> On Fri, 06 Oct 2006 01:16:21 +0159
> Jiri Slaby <[email protected]> wrote:
> > Andrew Morton wrote:
> > > On Thu, 05 Oct 2006 23:34:13 +0159
> > > Jiri Slaby <[email protected]> wrote:
> > > > while yum update-ing, yum crashed and this appeared in log:
> > > > [ 2840.688718] EXT3-fs error (device hda2): ext3_free_blocks_sb: bit already
> > > > cleared for block 747938
> > > > [ 2840.688732] Aborting journal on device hda2.
> > > > [ 2840.688858] ext3_abort called.
> > > >
> > > > ...
> > > >
> > > > I don't know how to reproduce it and really have no idea what version of -mm
> > > > could introduce it (if any).
> > >
> > > I don't necessarily see a bug in there. The filesystem got a bit noisy but
> > > did appropriately detect and handle the metadata inconsistency.
> >
> > Perhaps, but why did it occur? S.m.a.r.t. doesn't tell me anything suspicious.
>
> Don't know. The usual diagnosis for this sort of thing is "your disk shat
> itself". Could be a bad disk, bad power supply, bad memory, some piece of
> kernel code went and trashed some memory, bug in the driver. It's a
> mystery, sorry.

Just to add, I've seen right this, too, on Debian's 2.6.17-2-686, with
a 00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE
(rev 01) (8086:7111) PATA controller with a ST3300822A disk. That's
healthy from smartmontool's point of view. The machine has 192MB RAM, an
Intel P3 processor and is idle during daytime, busy with fetching
backups at night. I'm using this filesystem with faubackup, lots of
small files, lots of hard links and a number of large files. Some of
the posts below mention large files, too. My impression would be that
it happens when unlink()ing large files. Oh, and it's a LV, not a
direct partition.

memtest86 didn't reveal anything, too (ran it for nearly 24h.)

http://www.issociate.de/board/post/157775/Linux_2.6.10_/_RAID1_problem.html
January 5, 2005
2.6.10, UP, no PREEMPT, 2.4.25 was a lot more stable. Problem
disappeared by changing the mainboard.

And a test case:
~ dd if=/dev/zero of=test0 bs=1M count=300
~ while :; do cp test0 test1; cp test1 test2; cp test2 test0; od test0; done
Problem is RAID1 related and doesn't show up on bare disks.

http://www.unixshell.com/forum/archive/index.php/t-208.html
April 26, 2005
Author thinks this was 2.6.11.

http://www.liangfok.com/blog/archives/2005/07/file_system_bec.html
July 4, 2005
No specific information given.

http://comments.gmane.org/gmane.linux.kernel/353757
December 10, 2005
2.6.15-rc5, happens when exiting qemu.

http://www.nabble.com/-BUG-2.6.15-rc5--EXT3-fs-error-and-soft-lockup-detected-t723639.html
December 10, 2005
2.6.15-rc5

http://www.ubuntuforums.org/archive/index.php/t-106260.html
Dec 20, 2005, plus Dec 21, 2005 (another guy with the same problem)
Disk works in one machine (P2), but not in a different box
(P3). Breaks Ubuntu installation.

http://lkml.org/lkml/2006/4/3/212
April 3, 2006
No specific information given.

http://ebergen.net/wordpress/2006/04/
April 12, 2006
kubuntu user, drive is 4 years old and probably PATA.

http://myrddin.org/2006/02/
February 14, 2006
The author sees a problem like this when deleting large
(1GB..2GB) files. ": > $file" before removing them helps. This
results in an open(x, O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE)

http://forum.hardware.fr/hardwarefr/OSAlternatifs/Resolu-Debian-Ext3-FS-Error-pour-partition-sujet-58230-1.htm
June 19, 2006

In one case, there was a test case mentioned. I'll run that on my
affected box in a non-productive LV, like this:

dd bs=1M count=200 if=/dev/zero of=test0
while :; do
echo "cp 0-1"; cp test0 test1 || break
echo "cp 1-2"; cp test1 test2 || break
echo "cp 2-3"; cp test2 test3 || break
echo "cp 3-4"; cp test3 test4 || break
echo "od 0" ; od test0 || break
echo "rm 1"; rm test1 || break
echo "rm 2"; rm test2 || break
echo "rm 3"; rm test3 || break
echo "rm 4"; rm test4 || break
done

After about 30h runtime, I got:

EXT3-fs error (device dm-5): ext3_free_blocks_sb: bit already cleared for block 194810
Aborting journal on device dm-5.
ext3_abort called.
EXT3-fs error (device dm-5): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
EXT3-fs error (device dm-5) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device dm-5) in ext3_truncate: Journal has aborted
EXT3-fs error (device dm-5) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device dm-5) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device dm-5) in ext3_reserve_inode_write: Journal has aborted
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data

Last echoes from the testcase above:

rm 1
rm 2
rm: cannot remove `test2': Read-only file system

kolbe34-backup:/mnt# dumpe2fs /dev/kolbe34_backup/ext3crash 2>/dev/null | grep features
Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file

kolbe34-backup:/mnt# e2fsck -fy /dev/kolbe34_backup/ext3crash
e2fsck 1.39 (29-May-2006)
/dev/kolbe34_backup/ext3crash: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 49154 has zero dtime. Fix? yes

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(100884--124927) -(178242--194174)
Fix? yes

Free blocks count wrong for group #3 (1037, counted=25081).
Fix? yes

Free blocks count wrong for group #5 (2432, counted=18366).
Fix? yes

Free blocks count wrong (15190753, counted=15230731).
Fix? yes

Inode bitmap differences: -49154
Fix? yes

Free inodes count wrong for group #3 (16382, counted=16383).
Fix? yes

Free inodes count wrong (7864304, counted=7864305).
Fix? yes

/dev/kolbe34_backup/ext3crash: ***** FILE SYSTEM WAS MODIFIED *****
/dev/kolbe34_backup/ext3crash: 15/7864320 files (6.7% non-contiguous), 497909/15728640 blocks
kolbe34-backup:/mnt# mount /dev/kolbe34_backup/ext3crash test/
kolbe34-backup:/mnt# ls -li test/
total 820032
11 drwx------ 2 root root 16384 2006-10-06 10:42 lost+found
49153 -rw-r--r-- 1 root root 209715200 2006-10-07 16:42 test0
147457 -rw-r--r-- 1 root root 209715200 2006-10-08 02:21 test2
147458 -rw-r--r-- 1 root root 209715200 2006-10-08 02:21 test3
147459 -rw-r--r-- 1 root root 209715200 2006-10-08 02:22 test4

MfG, JBG

--
Jan-Benedict Glaw [email protected] +49-172-7608481
Signature of: Eine Freie Meinung in einem Freien Kopf
the second : für einen Freien Staat voll Freier Bürger.

Attachments:

(No filename) (6.65 kB)
signature.asc (189.00 B)
Digital signature Download all attachments

2006-10-08 07:14:36

by Jan-Benedict Glaw

[permalink] [raw]

Subject: Re: 2.6.18-mm2: ext3 BUG?

On Sun, 2006-10-08 08:33:30 +0200, Jan-Benedict Glaw <[email protected]> wrote:
> Just to add, I've seen right this, too, on Debian's 2.6.17-2-686, with
> a 00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE
> (rev 01) (8086:7111) PATA controller with a ST3300822A disk. That's
> healthy from smartmontool's point of view. The machine has 192MB RAM, an
> Intel P3 processor and is idle during daytime, busy with fetching
> backups at night. I'm using this filesystem with faubackup, lots of
> small files, lots of hard links and a number of large files. Some of
> the posts below mention large files, too. My impression would be that
> it happens when unlink()ing large files. Oh, and it's a LV, not a
> direct partition.

Another thing to add: I don't think this corruption is related to the
PIIX4 controller. For some days (when we put the machines that were
backed-up into production), we tried to work with an external USB HDD.
(The backup box is off-site and only has limited bandwidth, so the
idea was to move the USB HDD to the main site if we were on fire
there. We gave up this idea due to too small USB performance.)

However, I've seen this problem twice with the USB-attached disk, too.
It's the exact same disk, we just threw away the case.

MfG, JBG

--
Jan-Benedict Glaw [email protected] +49-172-7608481
Signature of: http://catb.org/~esr/faqs/smart-questions.html
the second :

Attachments:

(No filename) (1.42 kB)
signature.asc (189.00 B)
Digital signature Download all attachments

2006-10-08 08:48:20

by Jan-Benedict Glaw

[permalink] [raw]

Subject: Re: 2.6.18-mm2: ext3 BUG?

On Sun, 2006-10-08 08:33:30 +0200, Jan-Benedict Glaw <[email protected]> wrote:
> dd bs=1M count=200 if=/dev/zero of=test0
> while :; do
> echo "cp 0-1"; cp test0 test1 || break
> echo "cp 1-2"; cp test1 test2 || break
> echo "cp 2-3"; cp test2 test3 || break
> echo "cp 3-4"; cp test3 test4 || break
> echo "od 0" ; od test0 || break
> echo "rm 1"; rm test1 || break
> echo "rm 2"; rm test2 || break
> echo "rm 3"; rm test3 || break
> echo "rm 4"; rm test4 || break
> done

Just tested again and got *exactly* the same error message, bit
already cleared for block 194810, but this time after only 20min:

> EXT3-fs error (device dm-5): ext3_free_blocks_sb: bit already cleared for block 194810
> Aborting journal on device dm-5.
> ext3_abort called.
> EXT3-fs error (device dm-5): ext3_journal_start_sb: Detected aborted journal
> Remounting filesystem read-only
> EXT3-fs error (device dm-5) in ext3_reserve_inode_write: Journal has aborted
> EXT3-fs error (device dm-5) in ext3_truncate: Journal has aborted
> EXT3-fs error (device dm-5) in ext3_reserve_inode_write: Journal has aborted
> EXT3-fs error (device dm-5) in ext3_orphan_del: Journal has aborted
> EXT3-fs error (device dm-5) in ext3_reserve_inode_write: Journal has aborted
> __journal_remove_journal_head: freeing b_committed_data
> __journal_remove_journal_head: freeing b_committed_data
> __journal_remove_journal_head: freeing b_committed_data
>
>
> Last echoes from the testcase above:
>
> rm 1
> rm 2
> rm: cannot remove `test2': Read-only file system

...and with exactly the same position it broke again.

> kolbe34-backup:/mnt# dumpe2fs /dev/kolbe34_backup/ext3crash 2>/dev/null | grep features
> Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file

However, fsck looks a bit different this time:

kolbe34-backup:/mnt# e2fsck -jy /dev/kolbe34_backup/ext3crash
e2fsck 1.39 (29-May-2006)
/dev/kolbe34_backup/ext3crash: recovering journal
/dev/kolbe34_backup/ext3crash contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 49154 has zero dtime. Fix<y>?

/dev/kolbe34_backup/ext3crash: e2fsck canceled.

/dev/kolbe34_backup/ext3crash: ***** FILE SYSTEM WAS MODIFIED *****

/dev/kolbe34_backup/ext3crash: ********** WARNING: Filesystem still has errors **********

kolbe34-backup:/mnt# e2fsck -fy /dev/kolbe34_backup/ext3crash
e2fsck 1.39 (29-May-2006)
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 49154 has zero dtime. Fix? yes

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(107533--124927) -(178242--194673)
Fix? yes

Free blocks count wrong for group #3 (7686, counted=25081).
Fix? yes

Free blocks count wrong for group #5 (1933, counted=18366).
Fix? yes

Free blocks count wrong (15196903, counted=15230731).
Fix? yes

Inode bitmap differences: -49154
Fix? yes

Free inodes count wrong for group #3 (16379, counted=16380).
Fix? yes

Free inodes count wrong (7864304, counted=7864305).
Fix? yes

/dev/kolbe34_backup/ext3crash: ***** FILE SYSTEM WAS MODIFIED *****
/dev/kolbe34_backup/ext3crash: 15/7864320 files (6.7% non-contiguous), 497909/15728640 blocks

MfG, JBG

--
Jan-Benedict Glaw [email protected] +49-172-7608481
Signature of: http://www.chiark.greenend.org.uk/~sgtatham/bugs.html
the second :

Attachments:

(No filename) (3.41 kB)
signature.asc (189.00 B)
Digital signature Download all attachments

2006-10-08 09:53:34

by Jiri Slaby

[permalink] [raw]

Subject: Re: 2.6.18-mm2: ext3 BUG?

Jan-Benedict Glaw wrote:
> On Sun, 2006-10-08 08:33:30 +0200, Jan-Benedict Glaw <[email protected]> wrote:
>> dd bs=1M count=200 if=/dev/zero of=test0
>> while :; do
>> echo "cp 0-1"; cp test0 test1 || break
>> echo "cp 1-2"; cp test1 test2 || break
>> echo "cp 2-3"; cp test2 test3 || break
>> echo "cp 3-4"; cp test3 test4 || break
>> echo "od 0" ; od test0 || break
>> echo "rm 1"; rm test1 || break
>> echo "rm 2"; rm test2 || break
>> echo "rm 3"; rm test3 || break
>> echo "rm 4"; rm test4 || break
>> done
>
> Just tested again and got *exactly* the same error message, bit
> already cleared for block 194810, but this time after only 20min:
>
>> EXT3-fs error (device dm-5): ext3_free_blocks_sb: bit already cleared for block 194810
>> Aborting journal on device dm-5.
>> ext3_abort called.
>> EXT3-fs error (device dm-5): ext3_journal_start_sb: Detected aborted journal
>> Remounting filesystem read-only
>> EXT3-fs error (device dm-5) in ext3_reserve_inode_write: Journal has aborted
>> EXT3-fs error (device dm-5) in ext3_truncate: Journal has aborted
>> EXT3-fs error (device dm-5) in ext3_reserve_inode_write: Journal has aborted
>> EXT3-fs error (device dm-5) in ext3_orphan_del: Journal has aborted
>> EXT3-fs error (device dm-5) in ext3_reserve_inode_write: Journal has aborted
>> __journal_remove_journal_head: freeing b_committed_data
>> __journal_remove_journal_head: freeing b_committed_data
>> __journal_remove_journal_head: freeing b_committed_data
>>
>>
>> Last echoes from the testcase above:
>>
>> rm 1
>> rm 2
>> rm: cannot remove `test2': Read-only file system
>
> ...and with exactly the same position it broke again.
>
>> kolbe34-backup:/mnt# dumpe2fs /dev/kolbe34_backup/ext3crash 2>/dev/null | grep features
>> Filesystem features: has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file
>
> However, fsck looks a bit different this time:
>
> kolbe34-backup:/mnt# e2fsck -jy /dev/kolbe34_backup/ext3crash
> e2fsck 1.39 (29-May-2006)
> /dev/kolbe34_backup/ext3crash: recovering journal
> /dev/kolbe34_backup/ext3crash contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Deleted inode 49154 has zero dtime. Fix<y>?
>
> /dev/kolbe34_backup/ext3crash: e2fsck canceled.
>
> /dev/kolbe34_backup/ext3crash: ***** FILE SYSTEM WAS MODIFIED *****
>
> /dev/kolbe34_backup/ext3crash: ********** WARNING: Filesystem still has errors **********
>
> kolbe34-backup:/mnt# e2fsck -fy /dev/kolbe34_backup/ext3crash
> e2fsck 1.39 (29-May-2006)
> Pass 1: Checking inodes, blocks, and sizes
> Deleted inode 49154 has zero dtime. Fix? yes
>
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Block bitmap differences: -(107533--124927) -(178242--194673)
> Fix? yes
>
> Free blocks count wrong for group #3 (7686, counted=25081).
> Fix? yes
>
> Free blocks count wrong for group #5 (1933, counted=18366).
> Fix? yes
>
> Free blocks count wrong (15196903, counted=15230731).
> Fix? yes
>
> Inode bitmap differences: -49154
> Fix? yes
>
> Free inodes count wrong for group #3 (16379, counted=16380).
> Fix? yes
>
> Free inodes count wrong (7864304, counted=7864305).
> Fix? yes
>
>
> /dev/kolbe34_backup/ext3crash: ***** FILE SYSTEM WAS MODIFIED *****
> /dev/kolbe34_backup/ext3crash: 15/7864320 files (6.7% non-contiguous), 497909/15728640 blocks

Just to confirm: these are errors I got.

regards,
--
http://www.fi.muni.cz/~xslaby/ Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E

2006-10-10 07:09:36

by Jan-Benedict Glaw

[permalink] [raw]

Subject: Re: 2.6.18-mm2: ext3 BUG?

On Sun, 2006-10-08 08:33:30 +0200, Jan-Benedict Glaw <[email protected]> wrote:
> On Thu, 2006-10-05 17:14:28 -0700, Andrew Morton <[email protected]> wrote:

> In one case, there was a test case mentioned. I'll run that on my
> affected box in a non-productive LV, like this:
>
> dd bs=1M count=200 if=/dev/zero of=test0
> while :; do
> echo "cp 0-1"; cp test0 test1 || break
> echo "cp 1-2"; cp test1 test2 || break
> echo "cp 2-3"; cp test2 test3 || break
> echo "cp 3-4"; cp test3 test4 || break
> echo "od 0" ; od test0 || break
> echo "rm 1"; rm test1 || break
> echo "rm 2"; rm test2 || break
> echo "rm 3"; rm test3 || break
> echo "rm 4"; rm test4 || break
> done

While I could reproduce it with a 200MB file, it seems I can't break
it with a 10MB file.

MfG, JBG

--
Jan-Benedict Glaw [email protected] +49-172-7608481
Signature of: Lauf nicht vor Deinem Glück davon:
the second : Es könnte hinter Dir stehen!

Attachments:

(No filename) (0.99 kB)
signature.asc (189.00 B)
Digital signature Download all attachments

2006-10-11 10:42:19

by Jan Kara

[permalink] [raw]

Subject: Re: 2.6.18-mm2: ext3 BUG?

> On Sun, 2006-10-08 08:33:30 +0200, Jan-Benedict Glaw <[email protected]> wrote:
> > On Thu, 2006-10-05 17:14:28 -0700, Andrew Morton <[email protected]> wrote:
>
> > In one case, there was a test case mentioned. I'll run that on my
> > affected box in a non-productive LV, like this:
> >
> > dd bs=1M count=200 if=/dev/zero of=test0
> > while :; do
> > echo "cp 0-1"; cp test0 test1 || break
> > echo "cp 1-2"; cp test1 test2 || break
> > echo "cp 2-3"; cp test2 test3 || break
> > echo "cp 3-4"; cp test3 test4 || break
> > echo "od 0" ; od test0 || break
> > echo "rm 1"; rm test1 || break
> > echo "rm 2"; rm test2 || break
> > echo "rm 3"; rm test3 || break
> > echo "rm 4"; rm test4 || break
> > done
>
> While I could reproduce it with a 200MB file, it seems I can't break
> it with a 10MB file.
Hmm, I was running the test for several ours without any problem...
The kernel is 2.6.17.6, ext3 in ordered data mode, standard SATA disk. I'm
now running it again and trying my luck ;). What is your testing environment?

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-10-23 08:13:58

by Jan-Benedict Glaw

[permalink] [raw]

Subject: Re: 2.6.18-mm2: ext3 BUG?

On Wed, 2006-10-11 12:42:02 +0200, Jan Kara <[email protected]> wrote:
> > On Sun, 2006-10-08 08:33:30 +0200, Jan-Benedict Glaw <[email protected]> wrote:
> > While I could reproduce it with a 200MB file, it seems I can't break
> > it with a 10MB file.
> Hmm, I was running the test for several ours without any problem...
> The kernel is 2.6.17.6, ext3 in ordered data mode, standard SATA disk. I'm
> now running it again and trying my luck ;). What is your testing environment?

kolbe34-backup:/mnt# uname -a
Linux kolbe34-backup 2.6.17-2-686 #1 SMP Wed Sep 13 16:34:10 UTC 2006 i686 GNU/Linux
kolbe34-backup:/mnt# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 7
model name : Pentium III (Katmai)
stepping : 3
cpu MHz : 448.674
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse up
bogomips : 898.38
kolbe34-backup:/mnt# grep -i preem /boot/config-2.6.17-2-686
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set
kolbe34-backup:/mnt# lspci
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 03)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 03)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 02)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.2 USB Controller: Intel Corporation 82371AB/EB/MB PIIX4 USB (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
00:0e.0 Ethernet controller: Advanced Micro Devices [AMD] 79c970 [PCnet32 LANCE] (rev 16)
01:00.0 VGA compatible controller: ATI Technologies Inc 3D Rage Pro AGP 1X/2X (rev 5c)
kolbe34-backup:/mnt# lspci -n
00:00.0 0600: 8086:7190 (rev 03)
00:01.0 0604: 8086:7191 (rev 03)
00:07.0 0601: 8086:7110 (rev 02)
00:07.1 0101: 8086:7111 (rev 01)
00:07.2 0c03: 8086:7112 (rev 01)
00:07.3 0680: 8086:7113 (rev 02)
00:0e.0 0200: 1022:2000 (rev 16)
01:00.0 0300: 1002:4742 (rev 5c)
kolbe34-backup:~# hdparm -i /dev/hdb

/dev/hdb:

Model=ST3300822A, FwRev=3.AAE, SerialNo=5NF24YCN
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=unknown, BuffSize=8192kB, MaxMultSect=16, MultSect=off
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=268435455
IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 *udma2 udma3 udma4 udma5
AdvancedPM=no WriteCache=enabled
Drive conforms to: Unspecified: ATA/ATAPI-1 ATA/ATAPI-2 ATA/ATAPI-3 ATA/ATAPI-4 ATA/ATAPI-5 ATA/ATAPI-6 ATA/ATAPI-7

* signifies the current active mode

Still running Debian's 2.6.17-2-686, I'm now tracking down the file
size when I start to see this type of corruption. Right now, it seems
I never get it with a 16384 KB (16 MB) large file, but I get it with a
21504 KB (21 MB) file.

Is there something important that changes handling of file contents in
the 16..21 MB range?

dumpe2fs output at http://lug-owl.de/~jbglaw/ext3-dumpe2fs.txt for
that filesystem. I'll now run with a 18.5 MB file...

MfG, JBG

--
Jan-Benedict Glaw [email protected] +49-172-7608481
Signature of: Wenn ich wach bin, träume ich.
the second :

Attachments:

(No filename) (3.55 kB)
signature.asc (189.00 B)
Digital signature Download all attachments

2006-10-23 10:40:55

by Jan Kara

[permalink] [raw]

Subject: Re: 2.6.18-mm2: ext3 BUG?

Hello,

> On Wed, 2006-10-11 12:42:02 +0200, Jan Kara <[email protected]> wrote:
> > > On Sun, 2006-10-08 08:33:30 +0200, Jan-Benedict Glaw <[email protected]> wrote:
> > > While I could reproduce it with a 200MB file, it seems I can't break
> > > it with a 10MB file.
> > Hmm, I was running the test for several ours without any problem...
> > The kernel is 2.6.17.6, ext3 in ordered data mode, standard SATA disk. I'm
> > now running it again and trying my luck ;). What is your testing environment?
>
> kolbe34-backup:/mnt# uname -a
Thanks for info. This looks pretty similar to what I have (only that
I have Athlon).

> Still running Debian's 2.6.17-2-686, I'm now tracking down the file
> size when I start to see this type of corruption. Right now, it seems
> I never get it with a 16384 KB (16 MB) large file, but I get it with a
> 21504 KB (21 MB) file.
>
> Is there something important that changes handling of file contents in
> the 16..21 MB range?
Umm, I've checked and found nothing obvious. We already have to use
double-indirect block at 16MB, maybe reservation code does some
distiction. Could you mount the filesystem with -o 'noreservation' and
see whether you can still reproduce the problem? Also it may be useful
to find out, whether you see the failure also with some older kernels...

Honza
>
> dumpe2fs output at http://lug-owl.de/~jbglaw/ext3-dumpe2fs.txt for
> that filesystem. I'll now run with a 18.5 MB file...
>
> MfG, JBG
>
> --
> Jan-Benedict Glaw [email protected] +49-172-7608481
> Signature of: Wenn ich wach bin, tr?ume ich.
> the second :

--
Jan Kara <[email protected]>
SuSE CR Labs