2008-12-03 21:38:54

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12151] New: Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151

Summary: Unexplained fsck errors on a ext4 filesystem
Product: File System
Version: 2.5
KernelVersion: kernel-2.6.27.5-117.fc10.x86_64
Platform: All
OS/Version: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: ext4
AssignedTo: [email protected]
ReportedBy: [email protected]


Distribution: Fedora 10
Hardware Environment:

Processor:
Q6600 2.4ghz

Memory:
4gb

dmidecode:
Manufacturer: ASUSTeK Computer INC.
Product Name: P5B-Deluxe

lspci:
00:00.0 Host bridge: Intel Corporation 82P965/G965 Memory Controller Hub (rev
02)
00:01.0 PCI bridge: Intel Corporation 82P965/G965 PCI Express Root Port (rev
02)
00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
Controller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
Controller #5 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI
Controller #2 (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1
(rev 02)
00:1c.2 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 3
(rev 02)
00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 5
(rev 02)
00:1c.5 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 6
(rev 02)
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI
Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev f2)
00:1f.0 ISA bridge: Intel Corporation 82801HB/HR (ICH8/R) LPC Interface
Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801HR/HO/HH (ICH8R/DO/DH) 6 port
SATA AHCI Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 02)
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E
Gigabit Ethernet Controller (rev 12)
03:00.0 SATA controller: JMicron Technologies, Inc. JMicron 20360/20363 AHCI
Controller (rev 02)
03:00.1 IDE interface: JMicron Technologies, Inc. JMicron 20360/20363 AHCI
Controller (rev 02)
04:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet
Controller (Copper) (rev 06)
06:01.0 Multimedia audio controller: Creative Labs SB Audigy (rev 04)
06:01.1 Input device controller: Creative Labs SB Audigy Game Port (rev 04)
06:01.2 FireWire (IEEE 1394): Creative Labs SB Audigy FireWire Port (rev 04)
06:02.0 Ethernet controller: Lite-On Communications Inc LNE100TX (rev 20)
06:03.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000
Controller (PHY/Link)
06:04.0 Ethernet controller: Marvell Technology Group Ltd. 88E8001 Gigabit
Ethernet Controller (rev 14)


Software Environment:
e2fsprogs-1.41.3-2.fc10.x86_64
rsync-3.0.4-0.fc10.x86_64

Problem Description:
I rebooted and received the errors below from fsck. The nature of the errors
suggest to me a race condition or off by one bug. In all but one case the
problem is the count being off by one. In the one exception the count is off by
two.

I didn't receive complaints from fsck on previous reboots after the creation of
the filesystem. The shutdown before the startup that resulted in these errors
seemed to have gone normally.

I do remember rsync complaining about at least mpc1211 during the rsync that
copied the data across the network from another system. I don't remember the
complaint.

Most of the files on the filesystem are video files in the 100mb+ range. There
are also other large files like isos, virtualization images, etc. All the files
complained about are really small files.

The underlying layers are Linux software raid5 running on 6 1tb hard drives.
Other arrays using the same drives are raid1 and raid10.

mkfs command used to make the filesystem:
mkfs.ext4 -j -b 4096 -i 524288 -m 0 -E stride=256 -O extents /dev/md3

Other messages that may relate:

EXT4-fs: barriers enabled
EXT4-fs: barriers enabled
EXT4-fs: barriers enabled
JBD: barrier-based sync failed on md1:8 - disabling barriers
JBD: barrier-based sync failed on md2:8 - disabling barriers
JBD: barrier-based sync failed on md3:8 - disabling barriers

df -h output:
/dev/md1 32G 5.3G 25G 18% /
/dev/md0 198M 14M 174M 8% /boot
/dev/md2 288G 60G 229G 21% /home
/dev/md3 4.1T 2.3T 1.8T 56% /home/data



An automatic fsck check on boot started, and saw errors.

Group descriptor 374 has invalid unused inodes count 1
Group descriptor 375 has invalid unused inodes count 1
Group descriptor 588 has invalid unused inodes count 1
Group descriptor 940 has invalid unused inodes count 1
Group descriptor 1230 has invalid unused inodes count 1
Group descriptor 1486 has invalid unused inodes count 1
Group descriptor 1834 has invalid unused inodes count 1
Group descriptor 2444 has invalid unused inodes count 1
Group descriptor 2854 has invalid unused inodes count 1
Group descriptor 3066 has invalid unused inodes count 1
Group descriptor 3210 has invalid unused inodes count 1
Group descriptor 3933 has invalid unused inodes count 1
Group descriptor 4656 has invalid unused inodes count 1
Extended attribute block 12255232 has reference count 3 should be 1

Pass 1
Extended attribute block 12255232 has reference count 3 should be 1

Pass 2
Entry '..' in ??? (314882) has incorrent filetype (was 2, should be 1).
Entry 'txdps.tex' in
/backup/home/backup/11-26-2006/home/backup/12-3-2005/usr/share/texmf/tex/generic/texdraw
(453383) has incorrect filetype (was 1, should be 2).
Entry 'mpc1211' in
/backup/home/backup/11-26/2006/home/backup/12-3-2005/usr/src/kernels/2.6.14-1.1637_FC4-x86_64/arch/sh/boards
(469489) is a link to directory /home/backup/home/backup/11-26-2006/home/backup
/12-3-2005/usr/share/texmf/tex/generic/texdraw/txdps.tex (469505).
Entry 'gencfg.c' in /backup/home/builder/mozilla/nsprpub/pr/include (995609)
has an incorrect filetype (was 1, should be 2).
Entry 'CVS' in
/backup/home/builder/mozilla/toolkit/themes/pinstripe/mozapps/extentions
(1006847) in a link to directory
/backup/home/builder/mozilla/nsprpub/pr/include/gencfg.c (1006849).
Entry 'lost+found' in /video/movies (181) has incorrect filetype (was 2, should
be 1).
Entry 'text_italic.png' in
/backup/home/backup/11-26-2006/usr/share/icons/crystalsvg/16x16/actions
(613940) has incorrect filetype (was 1, should be 2).
Entry 'ko' in /backup/home/backup/11-26-2006/usr/share/local (621673) is a link
to directory
/backup/home/backup/11-26-2006/usr/share/icons/crystalsvg/16x16/actions/text_italic.png
(625665).

Pass 3
Unconnected directory inode 314882 (???)
'..' in
/backup/home/backup/11-26-2006/home/backup/12-3-2005/usr/share/texmf/tex/generic/texdraw/txdps.tex
(469505) is
/backup/home/backup/11-26-2006/home/backup/12-3-2005/usr/src/kernels/2.6.14-1.1637_FC4-x
86_64/arch/sh/boards (469489), should be
/backup/home/backup/11-26-2006/home/backup/12-3-2005/usr/share/texmf/tex/generic/texdraw
(453383).
'..' in
/backup/home/backup/11-26-2006/usr/share/icons/crystalsvg/16x16/actions/text_italic.png
(625665) is /backup/home/backup/11-26-2006/usr/share/locale (621673), should be
/backup/home/backup/11-26-2006/
usr/share/icons/crystalsvg/16x16/actions (613940).
'..' in /backup/home/builder/mozilla/nsprpub/pr/include/gencfg.c (1006849) is
/backup/home/builder/mozilla/toolkit/themes/pinstripe/mozapps/extensions
(1006847), should be /backup/home/builder/mozilla/nsprpu
b/pr/include (995609).

Pass 4
Inode 181 ref count is 12, should be 11.
Inode 82141 ref is 1, should be 2.
Inode 95745 ref count is 1, should be 2.
Inode 96001 ref count is 1, should be 2.
Inode 150529 ref count is 1, should be 2.
Inode 240641 ref count is 1, should be 2.
Inode 314016 ref count is 347, should be 346.
Inode 314881 ref count is 0, should be 2.
Inode 314882 ref count is 3, should be 2.
Inode 380416 ref count is 4, should be 3.
Inode 380417 ref count is 1, should be 2.
Inode 730596 ref count is 14, should be 13.
Inode 730625 ref count is 1, should be 2.
Inode 730723 ref count is 284, should be 283.
Inode 784897 ref count is 1, should be 2.
Inode 821761 ref count is 1, should be 2.
Inode 1191927 ref count is 7, should be 6.
Inode 1191937 ref count is 1, should be 2.

Pass 5
Block bitmap differences: -40304640 -48693248 -59540393 -79796520 -93519872
-105185280 -(128714059--128714061) -152568096
Free blocks count wrong for group #1230 (1845, counted=1846).
Free blocks count wrong for group #1486 (1851, counted=1852).
Free blocks count wrong for group #1817 (910, counted=911).
Free blocks count wrong for group #2454 (1149, counted=1150)
Free blocks count wrong for group #3210 (1845, counted=1846).
Free blocks count wrong for group #3928 (31706, counted=31709).
Free blocks count wrong for group #4656 (1540, counted=1541).
Free blocks count wrong (494216298, counted=494216308).
Free inodes count wrong for group #374 (0, counted=1).
Free inodes count wrong for group #375 (0, counted=1).
Free inodes count wrong for group #588 (0, counted=1).
Free inodes count wrong for group #940 (0, counted=1).
Free inodes count wrong for group #1230 (0, counted=1).
Directories count wrong for group #1230 (194, counted=193).
Free inodes count wrong for group #1486 (0, counted=1).
Directories count wrong for group #1486 (197, counted=196).
Free inodes count wrong for group #1834 (0, counted=1).
Free inodes count wrong for group #2444 (0, counted=1).
Free inodes count wrong for group #2854 (0, counted=1).
Directories count wrong for group #2854 (67, counted=66).
Free inodes count wrong for group #3066 (0, counted=1).
Free inodes count wrong for group #3210 (0, counted=1).
Directories count wrong for group #3210 (201, counted=200).
Free inodes count wrong for group #3933 (0, counted=1).
Free inodes count wrong for group #4656 (0, counted=1).
Directories count wrong for group #4656 (104, counted=103).
Free inodes count wrong (6899829, counted=6899842).


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


2008-12-11 02:33:12

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12151] Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151


[email protected] changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]




------- Comment #1 from [email protected] 2008-12-10 18:33 -------
was this a one-time occurrence? have you encountered this again?


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-01-18 02:24:32

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12151] Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151





------- Comment #2 from [email protected] 2009-01-17 18:24 -------
Any updates on this bug report? Has this happened again for you?


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-01-18 21:31:02

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12151] Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151





------- Comment #3 from [email protected] 2009-01-18 13:30 -------
At least a related problem seems to have happened on another reboot. This
reboot was a hard reset, because of the system going into some mostly hung
state. I saw it somewhat respond a few times, ping still work, ssh would return
the comment string, and even managed to login once but it hung before the
shell.

On the next boot I found I seem to have a corrupt superblock. It sounds a lot
like this reboot.

http://kerneltrap.org/mailarchive/linux-ext4/2009/1/5/4598534


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-04 05:41:38

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12151] Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151





------- Comment #4 from [email protected] 2009-03-03 21:41 -------
If you can try the 2.6.29 kernels out of koji and see if you still hit this
it'd be great. As I mentioned on IRC I have found another race w/ inode
alloc/free but it's not likely to lead to as much damage as your above fsck
found.

Have you been hitting this reliably?


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-03-04 05:43:29

by bugme-daemon

[permalink] [raw]
Subject: [Bug 12151] Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151





------- Comment #5 from [email protected] 2009-03-03 21:43 -------
oh, and; a boot-time fsck on fedora root filesystems probably means the fs was
marked w/ errors prior to the shutdown. You might look in your logs to see if
there's anything there.


--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

2009-04-27 13:40:56

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 12151] Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151


Alan <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]
Kernel Version|kernel-2.6.27.5-117.fc10.x8 |2.6.27.5
|6_64 |
Regression|--- |No




--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-04-30 09:10:43

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 12151] Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151


Florian Engelhardt <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]




--- Comment #6 from Florian Engelhardt <[email protected]> 2009-04-30 09:10:41 ---
Same problem here with 2.6.29.2
I have tried ext4 on a linux software raid 0, it was working flawlessly for
about a week. I booted the computer yesterday and hat exactly the same errors
as described above. Running e2fsck from a archlinux bootstick took about one
hour and left me with a mountable, but free of every kind of directory
structur, ext4 filesystem.
I am running that raid 0 on two 500GB Sata Disks. I can provide you with more
information on that this evening.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-04-30 20:56:58

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 12151] Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151





--- Comment #7 from Nathan Grennan <[email protected]> 2009-04-30 20:56:56 ---
I have been running Fedora's kernels the whole time, and I haven't seen this
issue with 2.6.29 kernels. I don't think I saw it with 2.6.28 kernels either. I
wonder if Fedora has been putting more patches for ext4 in over what goes into
Linus's tree.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-05-01 13:52:35

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 12151] Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151


Theodore Tso <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]




--- Comment #8 from Theodore Tso <[email protected]> 2009-05-01 13:52:35 ---
We're desperately looking for a reliable reproduction case for this problem.
My suggestion at this point is if you have a large filesystem (the reports for
this seem to come from users with > 1TB filesystems) to take a periodic e2image
backup of your filesystem before the corruption, and save it on some other
filesystem so there is a backup of your filesystem metadata. This will help
recover the filesystem after the corruption.

If you can reproduce this reliably, please let us know. We haven't been able
to get this problem reproduced yet.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-05-01 15:19:38

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 12151] Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151





--- Comment #9 from Eric Sandeen <[email protected]> 2009-05-01 15:19:38 ---
(In reply to comment #7)
> I have been running Fedora's kernels the whole time, and I haven't seen this
> issue with 2.6.29 kernels. I don't think I saw it with 2.6.28 kernels either. I
> wonder if Fedora has been putting more patches for ext4 in over what goes into
> Linus's tree.

Nathan - well, not really. I'll never put something in fedora that hasn't been
sent upstream, it's not how we work.

One difference may be that the 2.6.27 kernels in F10 did have the ext4 "stable"
backports that Ted was doing...

If you're running .29 kernels from fedora, it should be equivalent to what's
upstream. The only changes in F11 for example are:

Patch2920: linux-2.6-ext4-flush-on-close.patch
Patch2921: linux-2.6-ext4-really-print-warning-once.patch

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-05-02 16:29:30

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 12151] Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151





--- Comment #10 from Nathan Grennan <[email protected]> 2009-05-02 16:29:29 ---
Here is my basic experience with ext4. I had two basic problems.

One was where the system would just go off into a hang. The other was this
issue. This issue went away when I went with a 2.6.28+ kernel. Backporting
patches didn't work for me. I say this because at the time you guys were
telling me there were no new patches, but I would have issue with 2.6.27
kernels, but not 2.6.28 kernels. Later cebbert said there was something nasty
in 2.6.28.1 kernels, so I upgraded to 2.6.29. I have had zero issues with ext4
since upgrading to 2.6.29.

I just looked through my irc logs, and found the errors that I think caused
this problem. Sandeen, I have mentioned these to you before. How I think it
would go would be I would get one of these errors, the system would continue,
because that is the crazy default. Then a few days later, I having not noticed
these errors, would reboot the system, and receive the fsck issue above. From
what I remember reading this issue was fixed.

Feb 16 12:03:19 proton kernel: EXT4-fs error (device md3):
ext4_mb_generate_buddy: EXT4-fs: group
EXT4-fs error (device md3): mb_free_blocks: double-free of inode 0's block
321550248(bit 30632 in group 9812)

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-05-02 20:24:57

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 12151] Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151





--- Comment #11 from Theodore Tso <[email protected]> 2009-05-02 20:24:56 ---
Hi guys (and gals), please don't assume that your problem is the same as a
previously reported bug --- especially if the bug report title is as vague as
"unexplained fsck errors". That could mean software bugs, or hardware bugs
--- and just because you have an "unexplained fsck error", please don't assume
your problem is the same as another person's. It could be, if the kernel
versions are the same, and the symptoms are exactly the same, and especially if
the way to reproduce it is the same.

The original bug report dated from a 2.6.27 kernel, and there have been a huge
number of bugs fixed since then. To be honest, not all bug fixes have been
backported to the 2.6.27.x series, either. In some cases it was just way too
difficult to do. So Nathan, if you're at 2.6.29, and you're not seeing any
problem, then we're probably better of closing this bug.

Florian, my guess is that whatever problem Nathan reported back in the 2.6.27
kernel is very different what you're seeing. May I suggest that you open a
new bugzilla entry for your problems, and please give us as much detail as
possible?

Many thanks.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-05-02 22:14:25

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 12151] Unexplained fsck errors on a ext4 filesystem

http://bugzilla.kernel.org/show_bug.cgi?id=12151


Nathan Grennan <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |CODE_FIX




--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.