LinuxLists.cc - data corrupting bug in 2.4.20 ext3, data=journal

2002-12-01 08:04:23

Subject: data corrupting bug in 2.4.20 ext3, data=journal

In 2.4.20-pre5 an optimisation was made to the ext3 fsync function
which can very easily cause file data corruption at unmount time. This
was first reported by Nick Piggin on November 29th (one day after 2.4.20 was
released, and three months after the bug was merged. Unfortunate timing)

This only affects filesystems which were mounted with the `data=journal'
option. Or files which are operating under `chattr -j'. So most people
are unaffected. The problem is not present in 2.5 kernels.

The symptoms are that any file data which was written within the thirty
seconds prior to the unmount may not make it to disk. A workaround is
to run `sync' before unmounting.

The optimisation was intended to avoid writing out and waiting on the
inode's buffers when the subsequent commit would do that anyway. This
optimisation was applied to both data=journal and data=ordered modes.
But it is only valid for data=ordered mode.

In data=journal mode the data is left dirty in memory and the unmount
will silently discard it.

The fix is to only apply the optimisation to inodes which are operating
under data=ordered.

--- linux-akpm/fs/ext3/fsync.c~ext3-fsync-fix Sat Nov 30 23:37:33 2002
+++ linux-akpm-akpm/fs/ext3/fsync.c Sat Nov 30 23:39:30 2002
@@ -63,10 +63,12 @@ int ext3_sync_file(struct file * file, s
*/
ret = fsync_inode_buffers(inode);

- /* In writeback mode, we need to force out data buffers too. In
- * the other modes, ext3_force_commit takes care of forcing out
- * just the right data blocks. */
- if (test_opt(inode->i_sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)
+ /*
+ * If the inode is under ordered-data writeback it is not necessary to
+ * sync its data buffers here - commit will do that, with potentially
+ * better IO merging
+ */
+ if (!ext3_should_order_data(inode))
ret |= fsync_inode_data_buffers(inode);

ext3_force_commit(inode->i_sb);

_

2002-12-01 08:45:04

by Andrew Morton

[permalink] [raw]

Subject: Re: data corrupting bug in 2.4.20 ext3, data=journal

Andrew Morton wrote:
>
> ...
> The fix is to only apply the optimisation to inodes which are operating
> under data=ordered.
>

That "fix" didn't fix it. Sorry about that.

Please avoid ext3/data=journal until it is sorted out.

2002-12-01 12:33:27

by Nick Piggin

[permalink] [raw]

Subject: Re: data corrupting bug in 2.4.20 ext3, data=journal

Andrew Morton wrote:

>In 2.4.20-pre5 an optimisation was made to the ext3 fsync function
>which can very easily cause file data corruption at unmount time. This
>was first reported by Nick Piggin on November 29th (one day after 2.4.20 was
>released, and three months after the bug was merged. Unfortunate timing)
>
In fact it was reported on lkml on 18th July IIRC before 2.4.19 was
released if that is any help to you. 2.4.19 and 2.4.20 are affected
and I haven't tested previous releases. I was going to re-report it
sometime, but Alan brought it to light just the other day.

Nick

2002-12-02 07:09:41

by Andrew Morton

[permalink] [raw]

Subject: Re: data corrupting bug in 2.4.20 ext3, data=journal

Nick Piggin wrote:
>
> Andrew Morton wrote:
>
> >In 2.4.20-pre5 an optimisation was made to the ext3 fsync function
> >which can very easily cause file data corruption at unmount time. This
> >was first reported by Nick Piggin on November 29th (one day after 2.4.20 was
> >released, and three months after the bug was merged. Unfortunate timing)
> >
> In fact it was reported on lkml on 18th July IIRC before 2.4.19 was
> released if that is any help to you. 2.4.19 and 2.4.20 are affected
> and I haven't tested previous releases. I was going to re-report it
> sometime, but Alan brought it to light just the other day.
>

Are you sure? I can't make it happen on 2.4.19. And disabling the new
BH_Freed logic (which went into 2.4.20-pre5) makes it go away.

--- linux-akpm/fs/jbd/commit.c~a Sun Dec 1 23:10:12 2002
+++ linux-akpm-akpm/fs/jbd/commit.c Sun Dec 1 23:10:27 2002
@@ -695,7 +695,7 @@ skip_commit: /* The journal should be un
* use in a different page. */
if (__buffer_state(bh, Freed)) {
clear_bit(BH_Freed, &bh->b_state);
- clear_bit(BH_JBDDirty, &bh->b_state);
+// clear_bit(BH_JBDDirty, &bh->b_state);
}

if (buffer_jdirty(bh)) {

_

2002-12-02 08:18:15

by Nick Piggin

[permalink] [raw]

Subject: Re: data corrupting bug in 2.4.20 ext3, data=journal

Andrew Morton wrote:

>Nick Piggin wrote:
>>
>> Andrew Morton wrote:
>>
>> >In 2.4.20-pre5 an optimisation was made to the ext3 fsync function
>> >which can very easily cause file data corruption at unmount time. This
>> >was first reported by Nick Piggin on November 29th (one day after 2.4.20 was
>> >released, and three months after the bug was merged. Unfortunate timing)
>> >
>> In fact it was reported on lkml on 18th July IIRC before 2.4.19 was
>> released if that is any help to you. 2.4.19 and 2.4.20 are affected
>> and I haven't tested previous releases. I was going to re-report it
>> sometime, but Alan brought it to light just the other day.
>>
>
>Are you sure? I can't make it happen on 2.4.19. And disabling the new
>BH_Freed logic (which went into 2.4.20-pre5) makes it go away.
>
>
>--- linux-akpm/fs/jbd/commit.c~a Sun Dec 1 23:10:12 2002
>+++ linux-akpm-akpm/fs/jbd/commit.c Sun Dec 1 23:10:27 2002
>@@ -695,7 +695,7 @@ skip_commit: /* The journal should be un
> * use in a different page. */
> if (__buffer_state(bh, Freed)) {
> clear_bit(BH_Freed, &bh->b_state);
>- clear_bit(BH_JBDDirty, &bh->b_state);
>+// clear_bit(BH_JBDDirty, &bh->b_state);
> }
>
> if (buffer_jdirty(bh)) {
>
I reported the bug for 2.4.19-rc1 and 2 but I can't remember if I tested
2.4.19
when it was released. It has an external journal on a seperate disk. I can't
really do any testing with the machine unfortunately.

Regards,
Nick

2002-12-02 12:07:20

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: data corrupting bug in 2.4.20 ext3, data=journal

On Mon, 2002-12-02 at 07:17, Andrew Morton wrote:

> Are you sure? I can't make it happen on 2.4.19. And disabling the new
> BH_Freed logic (which went into 2.4.20-pre5) makes it go away.
>
> --- linux-akpm/fs/jbd/commit.c~a Sun Dec 1 23:10:12 2002
> +++ linux-akpm-akpm/fs/jbd/commit.c Sun Dec 1 23:10:27 2002
> @@ -695,7 +695,7 @@ skip_commit: /* The journal should be un
> - clear_bit(BH_JBDDirty, &bh->b_state);
> +// clear_bit(BH_JBDDirty, &bh->b_state);

Argh.

That's not the right fix --- it reintroduces the bug that BH_Freed was
introduced to solve in the first place.

The problem is that ext3 is expecting that truncate_inode_pages() (and
hence ext3_flushpage) is only called during a truncate. That's what the
function is named for, after all, and it's the hint we need to indicate
that future writeback on the data we're discarding should be disabled
(so that we don't get old data written on top of new data should the
block get deallocated.)

But kill_supers() eventually calls truncate_inode_pages() too when we're
doing the invalidate_inodes(). And ext3 is reacting just the way it
would for a normal truncate --- the data still gets written to the
journal (correct, if we reboot before the truncate commits then the old
data is preserved in the journal) but is not queued for writeback.

The solution is to set BH_Freed in ext3_flushpage IFF we're being called
from the truncate, but to avoid it if we're in an umount. I'm not sure
of the best way to do that right now, but there are some trivial but
hacky methods possible (eg. see if we're in a nested transaction; if so,
it's a truncate, if not, it's a umount.) MS_ACTIVE might be a possible
flag to test, but I'll need to double-check whether that is 100% safe
--- we can't afford to skip the BH_Freed setting if we're in a truncate
and the filesystem is not yet completely quiesced.

Cheers,
Stephen

2002-12-02 15:29:37

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: data corrupting bug in 2.4.20 ext3, data=journal

--- linux-2.4-ext3merge/fs/ext3/super.c.=K0027=.orig 2002-12-02 15:35:13.000000000 +0000
+++ linux-2.4-ext3merge/fs/ext3/super.c 2002-12-02 15:35:14.000000000 +0000
@@ -1640,7 +1640,12 @@
sb->s_dirt = 0;
target = log_start_commit(EXT3_SB(sb)->s_journal, NULL);

- if (do_sync_supers) {
+ /*
+ * Tricky --- if we are unmounting, the write really does need
+ * to be synchronous. We can detect that by looking for NULL in
+ * sb->s_root.
+ */
+ if (do_sync_supers || !sb->s_root) {
unlock_super(sb);
log_wait_commit(EXT3_SB(sb)->s_journal, target);
lock_super(sb);

Attachments:

4201-umount-sync-super.patch (582.00 B)

2002-12-02 17:37:43

by Andrew Morton

[permalink] [raw]

Subject: Re: data corrupting bug in 2.4.20 ext3, data=journal

"Stephen C. Tweedie" wrote:
>
> ...
> The problem is that ext3 is expecting that truncate_inode_pages() (and
> hence ext3_flushpage) is only called during a truncate.

That's OK, because there shouldn't be any dirty data attached to the
inodes at that time. But there is, because the commit which write_super()
started hasn't finished yet:

static int do_sync_supers = 0;
...
target = log_start_commit(EXT3_SB(sb)->s_journal, NULL);

if (do_sync_supers) {
unlock_super(sb);
log_wait_commit(EXT3_SB(sb)->s_journal, target);

We need to do a full sync at unmount.

I assume that in other journalling modes the buffers are dirty
anyway, so sync_buffers() gets them.

And indeed enabling do_sync_supers fixes it up in both 2.4 and 2.5 (2.5
doesn't have sync_buffers(), but sync_inodes_sb() gets everything).

But are we sure that ->write_super() will always be called?

int fsync_super(struct super_block *sb)
{
...
if (sb->s_dirt && sb->s_op && sb->s_op->write_super)
sb->s_op->write_super(sb);

I suspect that if s_dirt is not set, and we have dirty inodes,
write_super is not called and nothing will force a commit anywhere
in the unmount process. Which could explain the similar failure
in 2.4.19-rc1 which Nick reports.

We need to be able to distinguish between a periodic flush of the
superblock and a real sync. The write_super overload has always
been uncomfortable.

Google for "write_super is not for syncing" ;) I think Chris's
patch is the way to fix all this up.