Hi,
FiSC (our FS checker) issues a warning on ext2, complaining that crash
after fsync causes file system to corrupt. FS corrupts in two different
ways: 1. file contains illegal blocks (such as block # -2) 2. one block
owned by two different files.
I diagnosed the warning a little bit and it appears that this warning can
be triggered by the following steps:
1. a file is truncated, so several blocks are freed
2. a new file is created, and the blocks freed in step 1 are reused
3. fsync on the new file
4. crash and run fsck to recover.
fsync should guarantee that a specific file is persistent on disk.
Presumably, operations on other files should not mess up with the file we
just fsync (true ?) However, I also understand that ext2 by default
relies on e2fsck to provide file system consistency. Do you guys consider
the above warning as a bug or not? Any clarification on this will be very
helpful.
To reproduce the warning, please download the test case at
http://fisc.stanford.edu/bug3/crash.tar.bz2, untar, compile and run the
executable ./crash <disk partition> <mount point> This test case is
semi-automatically generated. It may contain more than enough FS
operations to trigger the warning. **NOTE**: it'll run mke2fs on <disk
partition> and reboot your machine!
e2fsck output:
e2fsck 1.36 (05-Feb-2005)
/dev/ide/host0/bus0/target0/lun0/part9 was not cleanly unmounted, check
forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 12 has illegal block(s). Clear? yes
Illegal block #-2 (2305145833) in inode 12. CLEARED.
Inode 12, i_blocks is 24, should be 16. Fix? yes
Duplicate blocks found... invoking duplicate block passes.
Pass 1B: Rescan for duplicate/bad blocks
Duplicate/bad block(s) in inode 12: 24
Duplicate/bad block(s) in inode 13: 24
Pass 1C: Scan directories for inodes with dup blocks.
Pass 1D: Reconciling duplicate blocks
(There are 2 inodes containing duplicate/bad blocks.)
File ... (inode #12, mod time Mon Mar 7 01:27:12 2005)
has 1 duplicate block(s), shared with 1 file(s):
... (inode #13, mod time Mon Mar 7 01:27:14 2005)
Clone duplicate/bad blocks? yes
File ... (inode #13, mod time Mon Mar 7 01:27:14 2005)
has 1 duplicate block(s), shared with 1 file(s):
... (inode #12, mod time Mon Mar 7 01:27:12 2005)
Duplicated blocks already reassigned or cloned.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Unattached inode 12
Connect to /lost+found? yes
Inode 12 ref count is 2, should be 1. Fix? yes
Unattached inode 13
Connect to /lost+found? yes
Inode 13 ref count is 2, should be 1. Fix? yes
Pass 5: Checking group summary information
Block bitmap differences: +(21--22) +(29--31)
Fix? yes
Free blocks count wrong for group #0 (37, counted=31).
Fix? yes
Free blocks count wrong (37, counted=31).
Fix? yes
/dev/ide/host0/bus0/target0/lun0/part9: ***** FILE SYSTEM WAS MODIFIED
*****
/dev/ide/host0/bus0/target0/lun0/part9: 13/16 files (7.7% non-contiguous),
29/60 blocks
running cmd "sudo mount -t ext2 /dev/ide/host0/bus0/target0/lun0/part9
/mnt/sbd1
-Junfeng
On Mon, Mar 07 2005, Junfeng Yang wrote:
>
> Hi,
>
> FiSC (our FS checker) issues a warning on ext2, complaining that crash
> after fsync causes file system to corrupt. FS corrupts in two different
> ways: 1. file contains illegal blocks (such as block # -2) 2. one block
> owned by two different files.
>
> I diagnosed the warning a little bit and it appears that this warning can
> be triggered by the following steps:
>
> 1. a file is truncated, so several blocks are freed
> 2. a new file is created, and the blocks freed in step 1 are reused
> 3. fsync on the new file
> 4. crash and run fsck to recover.
>
> fsync should guarantee that a specific file is persistent on disk.
> Presumably, operations on other files should not mess up with the file we
> just fsync (true ?) However, I also understand that ext2 by default
> relies on e2fsck to provide file system consistency. Do you guys consider
> the above warning as a bug or not? Any clarification on this will be very
> helpful.
fsync on ext2 only really guarantees that the data has reached
the disk, what the disk does it outside the realm of the fs.
If the ide drive has write back caching enabled, the data just
might only be in cache. If the power is removed right after fsync
returns, the drive might not get a chance to actually commit the
write to platter.
This isn't a patch suitable for inclusion, but you could try
and repeat with it applied. Does it fix the file corruption for
you?
===== fsync.c 1.9 vs edited =====
--- 1.9/fs/ext2/fsync.c 2004-04-12 19:54:19 +02:00
+++ edited/fsync.c 2005-03-07 11:44:25 +01:00
@@ -47,5 +47,7 @@
err = ext2_sync_inode(inode);
if (ret == 0)
ret = err;
+
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
return ret;
}
--
Jens Axboe
> fsync on ext2 only really guarantees that the data has reached
> the disk, what the disk does it outside the realm of the fs.
> If the ide drive has write back caching enabled, the data just
> might only be in cache. If the power is removed right after fsync
> returns, the drive might not get a chance to actually commit the
> write to platter.
Hi Jens,
Thanks for the reply. I tried your patch, and also setting hdparm -W0.
The warning is still there. This warning and the previous ones I reported
should be irrelevant to IDE drivers, as FiSC (our FS checker) doesn't
actually crash the machine but simulates a crash using a ramdisk.
It appears to me that this warning can be triggered by the following steps:
1. create a file A with several data blocks. fsync(A) to disk
2. truncate A to a smaller size, causing a few blocks to be freed.
However, they are only freed in memory. The corresponding changes in
bitmaps haven't yet hit the disk.
3. create a file B with several data blocks. ext2 will re-use the freed
blocks from step 2.
4. fsync(B). Once fsync returns, crash.
At this moment, the truncate in step 2 hasn't reached the disk yet, so the
file A on disk still contains pointers to the freed blocks. However, the
fsync(B) in step 4 flushes B's inode and other metadata to disk. Now we
end up with a file system where a block is shared by two files.
I'm not sure how the invalid block number warning is triggered.
-Junfeng
On Mar 07, 2005 14:55 -0800, Junfeng Yang wrote:
> > fsync on ext2 only really guarantees that the data has reached
> > the disk, what the disk does it outside the realm of the fs.
> > If the ide drive has write back caching enabled, the data just
> > might only be in cache. If the power is removed right after fsync
> > returns, the drive might not get a chance to actually commit the
> > write to platter.
>
> Thanks for the reply. I tried your patch, and also setting hdparm -W0.
> The warning is still there. This warning and the previous ones I reported
> should be irrelevant to IDE drivers, as FiSC (our FS checker) doesn't
> actually crash the machine but simulates a crash using a ramdisk.
>
> It appears to me that this warning can be triggered by the following steps:
>
> 1. create a file A with several data blocks. fsync(A) to disk
>
> 2. truncate A to a smaller size, causing a few blocks to be freed.
> However, they are only freed in memory. The corresponding changes in
> bitmaps haven't yet hit the disk.
>
> 3. create a file B with several data blocks. ext2 will re-use the freed
> blocks from step 2.
>
> 4. fsync(B). Once fsync returns, crash.
In ext3 this case is handled because the filesystem won't reallocate the
metadata blocks freed from file A before they have been committed to disk.
Also, the operations on file A are guaranteed to complete before or with
operations on file B so fsync(B) will also cause the changes from A to
be flushed to disk at the same time (this is guaranteed to complete before
fsync(B) returns).
I'm not sure how easy it would be to fix this in ext2 without introducing
at least some of the mechanisms from ext3, nor whether there is desire to
do this given the presence of ext3.
> At this moment, the truncate in step 2 hasn't reached the disk yet, so the
> file A on disk still contains pointers to the freed blocks. However, the
> fsync(B) in step 4 flushes B's inode and other metadata to disk. Now we
> end up with a file system where a block is shared by two files.
>
> I'm not sure how the invalid block number warning is triggered.
If B file was larger than 48kB, and you are filling it with e.g. 0xffffffe
then this could overwrite file A's indirect block from the data in file B.
Cheers, Andreas
--
Andreas Dilger
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/
Thanks a lot Andreas. Your message clarifies everything.
> In ext3 this case is handled because the filesystem won't reallocate the
> metadata blocks freed from file A before they have been committed to disk.
> Also, the operations on file A are guaranteed to complete before or with
> operations on file B so fsync(B) will also cause the changes from A to
> be flushed to disk at the same time (this is guaranteed to complete before
> fsync(B) returns).
In order words, each fsync essentailly triggers a jbd commit, right?
-Junfeng
Jens Axboe wrote:
> fsync on ext2 only really guarantees that the data has reached
> the disk, what the disk does it outside the realm of the fs.
> If the ide drive has write back caching enabled, the data just
> might only be in cache. If the power is removed right after fsync
> returns, the drive might not get a chance to actually commit the
> write to platter.
Is this really the behavior in the current kernel? If so this seems
quite wrong to me - if the application did an fsync, I think the kernel
should be sending cache flush commands to the drive before the call
completes..
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
On Tue, Mar 08 2005, Pavel Machek wrote:
> On Po 07-03-05 11:45:13, Jens Axboe wrote:
> > On Mon, Mar 07 2005, Junfeng Yang wrote:
> > >
> > > Hi,
> > >
> > > FiSC (our FS checker) issues a warning on ext2, complaining that crash
> > > after fsync causes file system to corrupt. FS corrupts in two different
> > > ways: 1. file contains illegal blocks (such as block # -2) 2. one block
> > > owned by two different files.
> > >
> > > I diagnosed the warning a little bit and it appears that this warning can
> > > be triggered by the following steps:
> > >
> > > 1. a file is truncated, so several blocks are freed
> > > 2. a new file is created, and the blocks freed in step 1 are reused
> > > 3. fsync on the new file
> > > 4. crash and run fsck to recover.
> > >
> > > fsync should guarantee that a specific file is persistent on disk.
> > > Presumably, operations on other files should not mess up with the file we
> > > just fsync (true ?) However, I also understand that ext2 by default
> > > relies on e2fsck to provide file system consistency. Do you guys consider
> > > the above warning as a bug or not? Any clarification on this will be very
> > > helpful.
> >
> > fsync on ext2 only really guarantees that the data has reached
> > the disk, what the disk does it outside the realm of the fs.
> > If the ide drive has write back caching enabled, the data just
> > might only be in cache. If the power is removed right after fsync
> > returns, the drive might not get a chance to actually commit the
> > write to platter.
>
> I *think* they are using emulation for their checker, so drivers
> lying about writes should not be problem.
Yes, Junfeng informed me of that later.
--
Jens Axboe
On Po 07-03-05 11:45:13, Jens Axboe wrote:
> On Mon, Mar 07 2005, Junfeng Yang wrote:
> >
> > Hi,
> >
> > FiSC (our FS checker) issues a warning on ext2, complaining that crash
> > after fsync causes file system to corrupt. FS corrupts in two different
> > ways: 1. file contains illegal blocks (such as block # -2) 2. one block
> > owned by two different files.
> >
> > I diagnosed the warning a little bit and it appears that this warning can
> > be triggered by the following steps:
> >
> > 1. a file is truncated, so several blocks are freed
> > 2. a new file is created, and the blocks freed in step 1 are reused
> > 3. fsync on the new file
> > 4. crash and run fsck to recover.
> >
> > fsync should guarantee that a specific file is persistent on disk.
> > Presumably, operations on other files should not mess up with the file we
> > just fsync (true ?) However, I also understand that ext2 by default
> > relies on e2fsck to provide file system consistency. Do you guys consider
> > the above warning as a bug or not? Any clarification on this will be very
> > helpful.
>
> fsync on ext2 only really guarantees that the data has reached
> the disk, what the disk does it outside the realm of the fs.
> If the ide drive has write back caching enabled, the data just
> might only be in cache. If the power is removed right after fsync
> returns, the drive might not get a chance to actually commit the
> write to platter.
I *think* they are using emulation for their checker, so drivers
lying about writes should not be problem.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
On Mon, Mar 07, 2005 at 01:57:10AM -0800, Junfeng Yang wrote:
> FiSC (our FS checker) issues a warning on ext2, complaining that crash
> after fsync causes file system to corrupt. FS corrupts in two different
> ways: 1. file contains illegal blocks (such as block # -2) 2. one block
> owned by two different files.
>
> I diagnosed the warning a little bit and it appears that this warning can
> be triggered by the following steps:
>
> 1. a file is truncated, so several blocks are freed
> 2. a new file is created, and the blocks freed in step 1 are reused
> 3. fsync on the new file
> 4. crash and run fsck to recover.
>
> fsync should guarantee that a specific file is persistent on disk.
> Presumably, operations on other files should not mess up with the file we
> just fsync (true ?) However, I also understand that ext2 by default
> relies on e2fsck to provide file system consistency. Do you guys consider
> the above warning as a bug or not? Any clarification on this will be very
> helpful.
Whether or not it is a bug is debateable. (Talking to certain *BSD
folks on this subject will cause them to jump and down and froth at
the mouth.) It is *expected* behaviour, yes, and it is mitigated by
two factors. (1) Metadata for ext2 is synced out every 5 seconds,
while data is synced out every 60, so the max window for this race
described above is 5 seconds, and in practice rarely shows up if you
are not using fsync. (2) Unlike BSD's fsck, when a block is owned by
two different files, we offer an option to clone the affected files so
data isn't lost, while BSD's fsck shoots both files and asks questions
later.
I believe the warning should go away if you mount -o sync (but then
the filesystem will perform very slowly :-).
As I had alluded to earlier, this has historically been a bone of
contention with the *BSD folks since ext2 was much, much faster than
the BSD FFS, and the main reason why is that the FFS had all sorts of
logic to make sure metadata blocks would be synced in certain order to
make life easier for their fsck, and this made ext2 substantially
faster than the BSD FFS. Given that most people only pay attention to
benchmarks, this upset them. Ext2's approach relied on a simpler and
more performant kernel implementation, and a more intelligent fsck.
Also, as I had pointed out to the BSD folks, BSD 4.3/4.4's FFS also
had the property that they did just enough write ordering to guarantee
that fsck -p would run smoothly, but not enough to make any guarantees
about recently written files, and only a filesystem weenie would care
that fsck was clean when user data files were silently corrupted after
a crash. (This was all pre-soft updates, by the way.)
The tradeoff as far as ext2 was concerned was that if you got unlucky
and managed to trigger this warning it did require a manual fsck
instead of an automatic fsck. In actual practice, in the real world
this happened extremely rarely.
Should we fix it today? Given that we have ext3, I'd probably answer
no. It's a known property of ext2; we've lived with it for over ten
years, and to add this would just slow down ext2 (which gets used
often as benchmark standard to aspire to), and make the ext2 codebase
more complicated.
- Ted
> I believe the warning should go away if you mount -o sync (but then
> the filesystem will perform very slowly :-).
>
I do agree with you, Andreas and other ppl on that this is expected
behavior on ext2, and ext3 should be chosen over ext2 when such
corruptions are under consideration.
However, mount -o sync won't fix the problem for ext2 either :) I sent a
report last week about that ext2 doesn't actually sync writes even if an
ext2 partition is mounted -o sync,dirsync. Andrew confirmed that ext2 has
MS_SYNCHONOUS holes (and possibly O_SYNC holes).
check out http://www.uwsg.iu.edu/hypermail/linux/kernel/0503.0/1252.html
or google "Junfeng mount sync".
Thanks,
-Junfeng
> the mouth.) It is *expected* behaviour, yes, and it is mitigated by
> two factors. (1) Metadata for ext2 is synced out every 5 seconds,
> while data is synced out every 60, so the max window for this race
> described above is 5 seconds, and in practice rarely shows up if you
> are not using fsync. (2) Unlike BSD's fsck, when a block is owned by
> two different files, we offer an option to clone the affected files so
> data isn't lost, while BSD's fsck shoots both files and asks questions
> later.
FiSC detects another related warning: a file's data are not what they
should be after fsync. Turns out that e2fsck -y clears invalid indirect
blocks before it clones the shared blocks, causing file to lose data when
these invalid blocks are also shared blocks. Considering the following
case:
file A has block 100 as indirect block on disk
file B has block 100 as data block, and user writes garbage to block 100,
then fsync(B).
Now clearing block 100 before cloning in e2fsck would cause file B to
loose its data block. Possible fix would be to clone duplicate blocks
before clear invalid blocks? Any thoughts? Or user has to run e2fsck
twice in this case?
to reproduce the warning, get http://fisc.stanford.edu/bug5/crash.c. it
uses fixed mount point /mnt/sbd0.
e2fsck output:
e2fsck 1.36 (05-Feb-2005)
/dev/ide/host0/bus0/target0/lun0/part9 was not cleanly unmounted, check
forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 12 has illegal block(s). Clear? yes
Illegal block #-2 (2517328384) in inode 12. CLEARED.
Inode 12, i_blocks is 24, should be 16. Fix? yes
Duplicate blocks found... invoking duplicate block passes.
Pass 1B: Rescan for duplicate/bad blocks
Duplicate/bad block(s) in inode 12: 22 25 26 27
Duplicate/bad block(s) in inode 13: 22
Duplicate/bad block(s) in inode 16: 25 26 27
Pass 1C: Scan directories for inodes with dup blocks.
Pass 1D: Reconciling duplicate blocks
(There are 3 inodes containing duplicate/bad blocks.)
File ... (inode #12, mod time Tue Mar 8 21:32:38 2005)
has 4 duplicate block(s), shared with 2 file(s):
... (inode #16, mod time Tue Mar 8 21:32:40 2005)
... (inode #13, mod time Tue Mar 8 21:32:39 2005)
Clone duplicate/bad blocks? yes
File ... (inode #13, mod time Tue Mar 8 21:32:39 2005)
has 1 duplicate block(s), shared with 1 file(s):
... (inode #12, mod time Tue Mar 8 21:32:38 2005)
Duplicated blocks already reassigned or cloned.
File ... (inode #16, mod time Tue Mar 8 21:32:40 2005)
has 3 duplicate block(s), shared with 1 file(s):
... (inode #12, mod time Tue Mar 8 21:32:38 2005)
Duplicated blocks already reassigned or cloned.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Unattached inode 12
Connect to /lost+found? yes
Inode 12 ref count is 2, should be 1. Fix? yes
Unattached inode 13
Connect to /lost+found? yes
Inode 13 ref count is 2, should be 1. Fix? yes
Unattached inode 16
Connect to /lost+found? yes
Inode 16 ref count is 2, should be 1. Fix? yes
Pass 5: Checking group summary information
Free blocks count wrong for group #0 (28, counted=24).
Fix? yes
Free blocks count wrong (28, counted=24).
Fix? yes
Inode bitmap differences: -(14--15)
Fix? yes
Free inodes count wrong for group #0 (0, counted=2).
Fix? yes
Directories count wrong for group #0 (4, counted=2).
Fix? yes
Free inodes count wrong (0, counted=2).
Fix? yes
-Junfeng
Hello Ted,
In article <[email protected]> you wrote:
> Should we fix it today? Given that we have ext3, I'd probably answer
> no. It's a known property of ext2; we've lived with it for over ten
> years, and to add this would just slow down ext2 (which gets used
> often as benchmark standard to aspire to), and make the ext2 codebase
> more complicated.
I am not too deep into FS design, however I have heared from some admins of
a pretty busy server, that the allocation method of placinf file content
close to the directories cause a lot of fragmentation there. So I wonder if
an simple change in allocation policy could lower the problems with the
fragmentation and (especially) lower the chance of blocks beeing reused too
often.
I might be able to get the patch, however I am not sure if it will solves
or lowers the problem the checker group found.
The performance impact of such a changed allocation policy is, however on
the given system positive (due to decreased fragmentation).
BTW: I was not directly involved here in the decision process which FS to
use, however I am sure it is both related to performance and recoverability.
Because all recent (journalling) filesystems XFS, Reiser and ext3 very often
failed with big data loss in that environment, whereas ext2 could most often
be recovered very well. So from my point of you ext2 should not be
desupported.
Greetings
Bernd
PS: i have a before/after screenshot of the filesystem in this german
article. It is a pop3 server:
http://itblog.eckenfels.net/archives/8-Fragmentierung.html