LinuxLists.cc - [patch] fix the ext3 data=journal unmount bug

2002-12-06 05:45:27

Subject: [patch] fix the ext3 data=journal unmount bug

This patch fixes the data loss which can occur when unmounting a
data=journal ext3 filesystem.

The core problem is that the VFS doesn't tell the filesystem enough
about what is happening. ext3 _needs_ to know the difference between
regular memory-cleansing writeback and sync-for-data-integrity
purposes.

(These two operations are really quite distinct, and the kernel has got
it wrong for ages. Even now, kupdate is running filemap_fdatawait()
quite needlessly)

In the early days, ext3 would assume that a write_super() call meant
"sync". That worked OK.

But that slowed down the kupdate function - it doesn't need to wait on
the writeout. So we took the `wait' out of ext3_write_super(). And
that worked OK too, because the VFS would later write back all the
dirty data for us.

But then an unrelated optimisation to the truncate path caused that to
not work any more, and we were exposed.

This patch adds a new super_block operation `sync_fs', whose mandate is
to "sync the filesystem" for data-integrity purposes. ie: it is a
synchronous writeout, whereas write_super is an asynchronous flush.

It is a minimal fix. Really all the `sync' code in the VFS needs a
rethink. It is _very_ ext2-centric, and needs to be redesigned to
provide more information to sophisticated filesystems about what is
going on.

But that's not a 2.4 project. And it's not looking like a 2.5 project
either - I shall be proposing the same fix for 2.5.

fs/buffer.c | 6 ++++--
fs/ext3/super.c | 25 +++++++++++++------------
fs/super.c | 6 +++++-
include/linux/fs.h | 3 ++-
4 files changed, 24 insertions(+), 16 deletions(-)

--- linux-akpm/fs/buffer.c~sync_fs Thu Dec 5 21:33:56 2002
+++ linux-akpm-akpm/fs/buffer.c Thu Dec 5 21:33:56 2002
@@ -327,6 +327,8 @@ int fsync_super(struct super_block *sb)
lock_super(sb);
if (sb->s_dirt && sb->s_op && sb->s_op->write_super)
sb->s_op->write_super(sb);
+ if (sb->s_op && sb->s_op->sync_fs)
+ sb->s_op->sync_fs(sb);
unlock_super(sb);
unlock_kernel();

@@ -346,7 +348,7 @@ int fsync_dev(kdev_t dev)
lock_kernel();
sync_inodes(dev);
DQUOT_SYNC(dev);
- sync_supers(dev);
+ sync_supers(dev, 1);
unlock_kernel();

return sync_buffers(dev, 1);
@@ -2833,7 +2835,7 @@ static int sync_old_buffers(void)
{
lock_kernel();
sync_unlocked_inodes();
- sync_supers(0);
+ sync_supers(0, 0);
unlock_kernel();

for (;;) {
--- linux-akpm/include/linux/fs.h~sync_fs Thu Dec 5 21:33:56 2002
+++ linux-akpm-akpm/include/linux/fs.h Thu Dec 5 21:33:56 2002
@@ -894,6 +894,7 @@ struct super_operations {
void (*delete_inode) (struct inode *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
+ int (*sync_fs) (struct super_block *);
void (*write_super_lockfs) (struct super_block *);
void (*unlockfs) (struct super_block *);
int (*statfs) (struct super_block *, struct statfs *);
@@ -1240,7 +1241,7 @@ static inline int fsync_inode_data_buffe
extern int inode_has_buffers(struct inode *);
extern int filemap_fdatasync(struct address_space *);
extern int filemap_fdatawait(struct address_space *);
-extern void sync_supers(kdev_t);
+extern void sync_supers(kdev_t dev, int wait);
extern int bmap(struct inode *, int);
extern int notify_change(struct dentry *, struct iattr *);
extern int permission(struct inode *, int);
--- linux-akpm/fs/super.c~sync_fs Thu Dec 5 21:33:56 2002
+++ linux-akpm-akpm/fs/super.c Thu Dec 5 21:33:56 2002
@@ -445,7 +445,7 @@ static inline void write_super(struct su
* hold up the sync while mounting a device. (The newly
* mounted device won't need syncing.)
*/
-void sync_supers(kdev_t dev)
+void sync_supers(kdev_t dev, int wait)
{
struct super_block * sb;

@@ -454,6 +454,8 @@ void sync_supers(kdev_t dev)
if (sb) {
if (sb->s_dirt)
write_super(sb);
+ if (wait && sb->s_op && sb->s_op->sync_fs)
+ sb->s_op->sync_fs(sb);
drop_super(sb);
}
return;
@@ -467,6 +469,8 @@ restart:
spin_unlock(&sb_lock);
down_read(&sb->s_umount);
write_super(sb);
+ if (wait && sb->s_op && sb->s_op->sync_fs)
+ sb->s_op->sync_fs(sb);
drop_super(sb);
goto restart;
} else
--- linux-akpm/fs/ext3/super.c~sync_fs Thu Dec 5 21:33:56 2002
+++ linux-akpm-akpm/fs/ext3/super.c Thu Dec 5 21:33:56 2002
@@ -47,6 +47,8 @@ static void ext3_mark_recovery_complete(
static void ext3_clear_journal_err(struct super_block * sb,
struct ext3_super_block * es);

+static int ext3_sync_fs(struct super_block * sb);
+
#ifdef CONFIG_JBD_DEBUG
int journal_no_write[2];

@@ -454,6 +456,7 @@ static struct super_operations ext3_sops
delete_inode: ext3_delete_inode, /* BKL not held. We take it */
put_super: ext3_put_super, /* BKL held */
write_super: ext3_write_super, /* BKL held */
+ sync_fs: ext3_sync_fs,
write_super_lockfs: ext3_write_super_lockfs, /* BKL not held. Take it */
unlockfs: ext3_unlockfs, /* BKL not held. We take it */
statfs: ext3_statfs, /* BKL held */
@@ -1577,24 +1580,22 @@ int ext3_force_commit(struct super_block
* This implicitly triggers the writebehind on sync().
*/

-static int do_sync_supers = 0;
-MODULE_PARM(do_sync_supers, "i");
-MODULE_PARM_DESC(do_sync_supers, "Write superblocks synchronously");
-
void ext3_write_super (struct super_block * sb)
{
+ if (down_trylock(&sb->s_lock) == 0)
+ BUG();
+ sb->s_dirt = 0;
+ log_start_commit(EXT3_SB(sb)->s_journal, NULL);
+}
+
+static int ext3_sync_fs(struct super_block *sb)
+{
tid_t target;

- if (down_trylock(&sb->s_lock) == 0)
- BUG(); /* aviro detector */
sb->s_dirt = 0;
target = log_start_commit(EXT3_SB(sb)->s_journal, NULL);
-
- if (do_sync_supers) {
- unlock_super(sb);
- log_wait_commit(EXT3_SB(sb)->s_journal, target);
- lock_super(sb);
- }
+ log_wait_commit(EXT3_SB(sb)->s_journal, target);
+ return 0;
}

/*

_

2002-12-06 17:54:56

by Chris Mason

[permalink] [raw]

Subject: Re: [patch] fix the ext3 data=journal unmount bug

On Fri, 2002-12-06 at 00:52, Andrew Morton wrote:
>
>
> This patch fixes the data loss which can occur when unmounting a
> data=journal ext3 filesystem.
>
> The core problem is that the VFS doesn't tell the filesystem enough
> about what is happening. ext3 _needs_ to know the difference between
> regular memory-cleansing writeback and sync-for-data-integrity
> purposes.
>

What happens when the user does a sync() immediately after kupdate
trigger a write_super?

Since ext3_write_super just clears s_dirt, I don't see how sync_fs()
will get called.

-chris

2002-12-06 19:05:01

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch] fix the ext3 data=journal unmount bug

Chris Mason wrote:
>
> On Fri, 2002-12-06 at 00:52, Andrew Morton wrote:
> >
> >
> > This patch fixes the data loss which can occur when unmounting a
> > data=journal ext3 filesystem.
> >
> > The core problem is that the VFS doesn't tell the filesystem enough
> > about what is happening. ext3 _needs_ to know the difference between
> > regular memory-cleansing writeback and sync-for-data-integrity
> > purposes.
> >
>
> What happens when the user does a sync() immediately after kupdate
> trigger a write_super?
>
> Since ext3_write_super just clears s_dirt, I don't see how sync_fs()
> will get called.
>

It won't. There isn't really a sane way of doing this properly unless
we do something like:

1) Add a new flag to the superblock
2) Set that flag against all r/w superblocks before starting the sync
3) Use that flag inside the superblock walk.

That would provide a reasonable solution, but I don't believe we
need to go to those lengths in 2.4, do you?

2002-12-06 19:26:54

by Chris Mason

[permalink] [raw]

Subject: Re: [patch] fix the ext3 data=journal unmount bug

On Fri, 2002-12-06 at 14:12, Andrew Morton wrote:
>
> It won't. There isn't really a sane way of doing this properly unless
> we do something like:
>
> 1) Add a new flag to the superblock
> 2) Set that flag against all r/w superblocks before starting the sync
> 3) Use that flag inside the superblock walk.
>
> That would provide a reasonable solution, but I don't believe we
> need to go to those lengths in 2.4, do you?

Grin, I'm partial to changing sync_supers to allow the FS to leave
s_dirt set in its write_super call.

I see what ext3 gains from your current patch in the unmount case, but
the sync case is really unchanged because of interaction with kupdate.

Other filesystems trying to use the sync_fs() call might think adding
one is enough to always get called on sync, and I think that will lead
to unreliable sync implementations.

-chris

2002-12-06 19:37:53

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch] fix the ext3 data=journal unmount bug

Chris Mason wrote:
>
> On Fri, 2002-12-06 at 14:12, Andrew Morton wrote:
> >
> > It won't. There isn't really a sane way of doing this properly unless
> > we do something like:
> >
> > 1) Add a new flag to the superblock
> > 2) Set that flag against all r/w superblocks before starting the sync
> > 3) Use that flag inside the superblock walk.
> >
> > That would provide a reasonable solution, but I don't believe we
> > need to go to those lengths in 2.4, do you?
>
> Grin, I'm partial to changing sync_supers to allow the FS to leave
> s_dirt set in its write_super call.

That doesn't sound like a simplification ;)

> I see what ext3 gains from your current patch in the unmount case, but
> the sync case is really unchanged because of interaction with kupdate.

True. And I'd like /bin/sync to _really_ be synchronous because
I use `reboot -f' all the time. Even though SuS-or-POSIX say that
sync() only needs to _start_ the IO. That's rather silly.

> Other filesystems trying to use the sync_fs() call might think adding
> one is enough to always get called on sync, and I think that will lead
> to unreliable sync implementations.

OK. How about we do it that way in in 2.5 and then look at a backport?
With the steps I propose above, filesystems which don't implement
sync_fs would see no changes, so it should be safe.

2002-12-06 19:48:35

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: [patch] fix the ext3 data=journal unmount bug

Hi,

On Fri, 2002-12-06 at 19:45, Andrew Morton wrote:

> > I see what ext3 gains from your current patch in the unmount case, but
> > the sync case is really unchanged because of interaction with kupdate.
>
> True. And I'd like /bin/sync to _really_ be synchronous because
> I use `reboot -f' all the time. Even though SuS-or-POSIX say that
> sync() only needs to _start_ the IO. That's rather silly.

But at the same time I'd like to avoid sync becoming serialised on its
writes. If you've got a lot of filesystems mounted, doing each
filesystem's sync sequentially and synchronously is going to be a lot
slower than allowing async syncs. In other words, for sync(2) we really
want async commit submission followed by a synchronous wait for
completion. And that's probably more churn than I'd like to see at this
stage for 2.4.

Cheers,
Stephen

2002-12-06 20:26:28

by Chris Mason

[permalink] [raw]

Subject: Re: [patch] fix the ext3 data=journal unmount bug

On Fri, 2002-12-06 at 14:57, Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, 2002-12-06 at 19:45, Andrew Morton wrote:
>
> > > I see what ext3 gains from your current patch in the unmount case, but
> > > the sync case is really unchanged because of interaction with kupdate.
> >
> > True. And I'd like /bin/sync to _really_ be synchronous because
> > I use `reboot -f' all the time. Even though SuS-or-POSIX say that
> > sync() only needs to _start_ the IO. That's rather silly.
>
> But at the same time I'd like to avoid sync becoming serialised on its
> writes. If you've got a lot of filesystems mounted, doing each
> filesystem's sync sequentially and synchronously is going to be a lot
> slower than allowing async syncs. In other words, for sync(2) we really
> want async commit submission followed by a synchronous wait for
> completion. And that's probably more churn than I'd like to see at this
> stage for 2.4.

The bulk of the sync(2) will be async though, since most of the io is
actually writing dirty data buffers out. We already do that in two
stages.

For 2.5, if an FS really wanted a two stage sync for it's non-data
pages, it could put a locked page onto one of the lists that
sync_inodes(1) will catch and wait for. There are lots of other ways of
course, but there's already a framework for sync to wait on pages.

For 2.4, an FS async sync function could toss a locked buffer head into
the locked buffer lru list, and unlock when the commit is complete.

Neither idea is as clean as a real aio interface, but all the
infrastructure is already there.

Also, I think async sync support is a different feature than allowing
the FS to know the difference between kupdate periodic writes and
syncs/unmounts.

-chris

2002-12-06 21:13:25

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: [patch] fix the ext3 data=journal unmount bug

Hi,

On Fri, 2002-12-06 at 20:34, Chris Mason wrote:

> The bulk of the sync(2) will be async though, since most of the io is
> actually writing dirty data buffers out. We already do that in two
> stages.

Not with data journaling. That's the whole point: the VFS assumes too
much about where the data is being written, when.

> For 2.5, if an FS really wanted a two stage sync for it's non-data
> pages

But it's data that is the problem. For sync() semantics,
data-journaling only requires that the pages have hit the journal. For
umount, it is critical that we complete the final writeback before
destroying the inode lists.

Cheers,
Stephen

2002-12-06 21:59:12

by Chris Mason

[permalink] [raw]

Subject: Re: [patch] fix the ext3 data=journal unmount bug

On Fri, 2002-12-06 at 16:22, Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, 2002-12-06 at 20:34, Chris Mason wrote:
>
> > The bulk of the sync(2) will be async though, since most of the io is
> > actually writing dirty data buffers out. We already do that in two
> > stages.
>
> Not with data journaling. That's the whole point: the VFS assumes too
> much about where the data is being written, when.

But with data journaling, there's a limited amount data pending that
needs to be sent to the log. It isn't like the data pages in the
data=writeback, where there might be gigs and gigs worth of pages.

Most data=journal setups are for synchronous writes, where the
transactions will be small, so sending things to the log won't take
long.

>
> > For 2.5, if an FS really wanted a two stage sync for it's non-data
> > pages
>
> But it's data that is the problem. For sync() semantics,
> data-journaling only requires that the pages have hit the journal. For
> umount, it is critical that we complete the final writeback before
> destroying the inode lists.

Well, I was trying to find a word for pages involved w/the journal and
failed ;-) My only real point is we can add an async sync without
changing the way supers get processed.

It seems like a natural progression to start adding journal address
spaces to deal with this instead of extra stuff in the super code, where
locking and super flag semantics make things sticky.

-chris

2002-12-06 22:15:44

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: [patch] fix the ext3 data=journal unmount bug

Hi,

On Fri, 2002-12-06 at 22:07, Chris Mason wrote:

> But with data journaling, there's a limited amount data pending that
> needs to be sent to the log. It isn't like the data pages in the
> data=writeback, where there might be gigs and gigs worth of pages.

That's true right now, but it may not be for other cases. For example,
a phase-tree type of filesystem may have huge amounts of data
accumulated behind the commit, and any filesystem doing deferred block
allocation will also have a lot of data which needs to be synced
intelligently, not just by the VM walking the dirty buffer lists itself.

> It seems like a natural progression to start adding journal address
> spaces to deal with this instead of extra stuff in the super code, where
> locking and super flag semantics make things sticky.

Absolutely, and I think an entirely separate ->sync_fs method is the way
to go, as it doesn't assume any specific semantics about what data
structure is getting locked in what fashion.

--Stephen

2002-12-07 14:47:07

by Matthias Andree

[permalink] [raw]

Subject: Re: [patch] fix the ext3 data=journal unmount bug

On Thu, 05 Dec 2002, Andrew Morton wrote:

> fs/buffer.c | 6 ++++--
> fs/ext3/super.c | 25 +++++++++++++------------
> fs/super.c | 6 +++++-
> include/linux/fs.h | 3 ++-
> 4 files changed, 24 insertions(+), 16 deletions(-)
>
> --- linux-akpm/fs/buffer.c~sync_fs Thu Dec 5 21:33:56 2002
> +++ linux-akpm-akpm/fs/buffer.c Thu Dec 5 21:33:56 2002
> @@ -327,6 +327,8 @@ int fsync_super(struct super_block *sb)
> lock_super(sb);
> if (sb->s_dirt && sb->s_op && sb->s_op->write_super)
> sb->s_op->write_super(sb);
> + if (sb->s_op && sb->s_op->sync_fs)
> + sb->s_op->sync_fs(sb);
> unlock_super(sb);
> unlock_kernel();

Against what kernel version is this?