2009-11-06 22:33:10

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

commit a71ce8c6c9bf269b192f352ea555217815cf027e updated
ext4_statfs() to update the on-disk superblock counters,
but modified this buffer directly without any journaling of
the change. This is one of the accesses that was causing
the crc errors in journal replay as seen in kernel.org
bugzilla #14354.

Signed-off-by: Eric Sandeen <[email protected]>
---


diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 08370ae..b02daf6 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3605,6 +3605,8 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_super_block *es = sbi->s_es;
u64 fsid;
+ handle_t *handle;
+ int err = 0;

if (test_opt(sb, MINIX_DF)) {
sbi->s_overhead_last = 0;
@@ -3650,20 +3652,33 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
buf->f_blocks = ext4_blocks_count(es) - sbi->s_overhead_last;
buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter) -
percpu_counter_sum_positive(&sbi->s_dirtyblocks_counter);
- ext4_free_blocks_count_set(es, buf->f_bfree);
buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es);
if (buf->f_bfree < ext4_r_blocks_count(es))
buf->f_bavail = 0;
buf->f_files = le32_to_cpu(es->s_inodes_count);
buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
- es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
buf->f_namelen = EXT4_NAME_LEN;
fsid = le64_to_cpup((void *)es->s_uuid) ^
le64_to_cpup((void *)es->s_uuid + sizeof(u64));
buf->f_fsid.val[0] = fsid & 0xFFFFFFFFUL;
buf->f_fsid.val[1] = (fsid >> 32) & 0xFFFFFFFFUL;

- return 0;
+ handle = ext4_journal_start_sb(sb, 1);
+ if (IS_ERR(handle)) {
+ err = PTR_ERR(handle);
+ goto out;
+ }
+ err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh);
+ if (err)
+ goto journal_stop;
+ es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
+ ext4_free_blocks_count_set(es, buf->f_bfree);
+ err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);
+
+journal_stop:
+ ext4_journal_stop(handle);
+out:
+ return err;
}

/* Helper function for writing quotas on sync - we need to start transaction



2009-11-07 00:26:48

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

On 2009-11-06, at 15:33, Eric Sandeen wrote:
> commit a71ce8c6c9bf269b192f352ea555217815cf027e updated
> ext4_statfs() to update the on-disk superblock counters,
> but modified this buffer directly without any journaling of
> the change. This is one of the accesses that was causing the crc
> errors in journal replay as seen in kernel.org bugzilla #14354.
>
> Signed-off-by: Eric Sandeen <[email protected]>
> ---
>
>
> @@ -3650,20 +3652,33 @@ static int ext4_statfs(struct dentry
> *dentry, struct kstatfs *buf)
> + handle = ext4_journal_start_sb(sb, 1);
> + if (IS_ERR(handle)) {
> + err = PTR_ERR(handle);
> + goto out;
> + }
> + err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh);
> + if (err)
> + goto journal_stop;
> + es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
> + ext4_free_blocks_count_set(es, buf->f_bfree);
> + err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);


I admit to being the instigator of this change.

The intention is that we want to update the on-disk superblock block/
inode
counters from the per-cpu data periodically, since they are never
updated
anymore (only the group summaries are updated, to avoid contention).
However,
this isn't critical work, since it is only useful for read-only e2fsck
not
reporting spurious errors on the filesystem and dumpe2fs/debugfs
having some
chance at reporting a reasonable value for the filesystem space usage.

Starting a transaction as part of statfs is really counter-productive
to making
that code efficient, which was the whole point of the original patch
to remove
the per-call "overhead" calculation.

The intention was that the in-memory superblock would be updated
whenever statfs
is called (this doesn't cost anything, since we've already computed
the value
for statfs), and if the superblock is written to disk for some other
reason they
will go along for the ride.

If the choice is between adding a proper transaction here, or not
doing this at
all, I'd rather just not do it at all. Of course, I'd like to work
out some
kind of compromise, like only updating the superblock when there is
already a shadow BH that is being used to write to the journal, or
similar.

If there is a desire to keep a transaction here and update the
superblock
counters, it _definitely_ doesn't need to be done on every statfs, but
at most
once every 30 seconds or whatever.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-11-07 01:08:47

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

Andreas Dilger wrote:
> On 2009-11-06, at 15:33, Eric Sandeen wrote:
>> commit a71ce8c6c9bf269b192f352ea555217815cf027e updated
>> ext4_statfs() to update the on-disk superblock counters, but
>> modified this buffer directly without any journaling of the change.
>> This is one of the accesses that was causing the crc errors in
>> journal replay as seen in kernel.org bugzilla #14354.
>>
>> Signed-off-by: Eric Sandeen <[email protected]> ---

...

> I admit to being the instigator of this change.
>
> The intention is that we want to update the on-disk superblock
> block/inode counters from the per-cpu data periodically, since they
> are never updated anymore (only the group summaries are updated, to
> avoid contention). However, this isn't critical work, since it is
> only useful for read-only e2fsck not reporting spurious errors on the
> filesystem and dumpe2fs/debugfs having some chance at reporting a
> reasonable value for the filesystem space usage.
>
> Starting a transaction as part of statfs is really counter-productive
> to making that code efficient, which was the whole point of the
> original patch to remove the per-call "overhead" calculation.
>
> The intention was that the in-memory superblock would be updated
> whenever statfs is called (this doesn't cost anything, since we've
> already computed the value for statfs), and if the superblock is
> written to disk for some other reason they will go along for the
> ride.
>
> If the choice is between adding a proper transaction here, or not
> doing this at all, I'd rather just not do it at all. Of course, I'd
> like to work out some kind of compromise, like only updating the
> superblock when there is already a shadow BH that is being used to
> write to the journal, or similar.
>
> If there is a desire to keep a transaction here and update the
> superblock counters, it _definitely_ doesn't need to be done on every
> statfs, but at most once every 30 seconds or whatever.

You know, I think I thought about all that, and I wrote the patch anyway
somehow; blame a late friday evening for that one. :)

I'll think of a better route to take.

Thanks,
-Eric

2009-11-08 21:48:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

On Fri, Nov 06, 2009 at 05:26:51PM -0700, Andreas Dilger wrote:
> If the choice is between adding a proper transaction here, or not
> doing this at all, I'd rather just not do it at all. Of course, I'd
> like to work out some kind of compromise, like only updating the
> superblock when there is already a shadow BH that is being used to
> write to the journal, or similar.

In practice, the superblock is never going to modified in normal
operations, unless a resize happens to be happening. Since we already
force the superblock summary counters to be correct during an unmount
or file system freeze, we really only need this so that it's correct
after a file system crash.

I don't think people generally end up calling statfs() all that
frequently, so it's not clear how much adding a 30 second throttle
would help. Maybe we should just not bother trying to update the
superblock at all on a statfs()?

Hmm...

- Ted

2009-11-08 22:09:41

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

Theodore Tso wrote:
> On Fri, Nov 06, 2009 at 05:26:51PM -0700, Andreas Dilger wrote:
>> If the choice is between adding a proper transaction here, or not
>> doing this at all, I'd rather just not do it at all. Of course, I'd
>> like to work out some kind of compromise, like only updating the
>> superblock when there is already a shadow BH that is being used to
>> write to the journal, or similar.
>
> In practice, the superblock is never going to modified in normal
> operations, unless a resize happens to be happening. Since we already
> force the superblock summary counters to be correct during an unmount
> or file system freeze, we really only need this so that it's correct
> after a file system crash.
>
> I don't think people generally end up calling statfs() all that
> frequently, so it's not clear how much adding a 30 second throttle
> would help. Maybe we should just not bother trying to update the
> superblock at all on a statfs()?

for now maybe that's better....

But don't we journal the superblock sometimes, not others ... for
example write_super -> ext4_write_super -> ext4_commit_super does no
journaling of superblock modifications. ext4_orphan_add, however, does.
This would likely lead to trouble w/ the debugging patch ... though I
didn't see it ... ?

So I was premature w/ this patch, I think.

Maybe we could unconditionally do the copy-out in
jbd2_journal_write_metadata_buffer() ...?

-Eric

> Hmm...
>
> - Ted


2009-11-09 04:41:25

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

On 2009-11-08, at 14:48, Theodore Tso wrote:
> On Fri, Nov 06, 2009 at 05:26:51PM -0700, Andreas Dilger wrote:
>> If the choice is between adding a proper transaction here, or not
>> doing this at all, I'd rather just not do it at all. Of course, I'd
>> like to work out some kind of compromise, like only updating the
>> superblock when there is already a shadow BH that is being used to
>> write to the journal, or similar.
>
> In practice, the superblock is never going to modified in normal
> operations, unless a resize happens to be happening.

Actually, Eric had a important case where the superblock is modified
is whenever any file is added to the orphan list, so this basically
happens all the time. When the orphan code was added, it made sense
to put the head of the orphan list on the superblock because it was
updated for every truncate to change the free block counters.

> Since we already force the superblock summary counters to be correct
> during an unmount or file system freeze, we really only need this so
> that it's correct after a file system crash.
>
> I don't think people generally end up calling statfs() all that
> frequently, so it's not clear how much adding a 30 second throttle
> would help. Maybe we should just not bother trying to update the
> superblock at all on a statfs()?


The reason we added this was for running a read-only e2fsck on a
filesystem without getting spurious errors just because the superblock
summaries were incorrect. The other alternative is to change e2fsck
so that it doesn't consider just a block/inode summary an error.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-11-09 12:53:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

On Sun, Nov 08, 2009 at 04:09:40PM -0600, Eric Sandeen wrote:
>
> But don't we journal the superblock sometimes, not others ... for
> example write_super -> ext4_write_super -> ext4_commit_super does no
> journaling of superblock modifications. ext4_orphan_add, however, does.
> This would likely lead to trouble w/ the debugging patch ... though I
> didn't see it ... ?

Ah, I had forgotten about ext4_orphan_add(); that is indeed the one
place where we would be updating the super block under normal
operations, besides online-resize.

I've been looking at the write_super() paths, and from what I can tell
it's only used in two places. The generic fsync() handler,
file_fsync(), which we do't use, and sync_supers(), which will indeed
call write_super() -> ext4_write_super() if sb->s_dirt is set. That
led me to examine the places where we set s_dirt, and it's in a lot of
places where we're no longer modifying the superblock any more, but
we're still setting sb->s_dirt. I don't know why you didn't see
problems with the debugging patch; the only thing I can think of is
that since the actual superblock update is deferred to a
timer-triggered callback, you were getting consistently lucky ---
which is hard for me to believe, but I don't have a better suggestion.

What I think we do need to do is eliminate all of the places where we
set sb->s_dirt, and if we need to update the superblock, we do it
ourselves, under journaling control.

That leaves places which call ext4_commit_super() directly, which is
at mount and unmount time (which should be OK, as long as it's before
or after journalling is active) and when we freeze the filesystem,
which might be OK, but we need to take a careful look at it.

- Ted

2009-11-09 17:55:23

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

On 2009-11-09, at 05:53, Theodore Tso wrote:
> On Sun, Nov 08, 2009 at 04:09:40PM -0600, Eric Sandeen wrote:
>> But don't we journal the superblock sometimes, not others ... for
>> example write_super -> ext4_write_super -> ext4_commit_super does no
>> journaling of superblock modifications. ext4_orphan_add, however,
>> does.
>> This would likely lead to trouble w/ the debugging patch ... though I
>> didn't see it ... ?
>
> Ah, I had forgotten about ext4_orphan_add(); that is indeed the one
> place where we would be updating the super block under normal
> operations, besides online-resize.
>
> I've been looking at the write_super() paths, and from what I can tell
> it's only used in two places. The generic fsync() handler,
> file_fsync(), which we do't use, and sync_supers(), which will indeed
> call write_super() -> ext4_write_super() if sb->s_dirt is set. That
> led me to examine the places where we set s_dirt, and it's in a lot of
> places where we're no longer modifying the superblock any more, but
> we're still setting sb->s_dirt. I don't know why you didn't see
> problems with the debugging patch; the only thing I can think of is
> that since the actual superblock update is deferred to a
> timer-triggered callback, you were getting consistently lucky ---
> which is hard for me to believe, but I don't have a better suggestion.

I suspect this is because the only thing that changes in the superblock
these days is the orphan list, so out-of-order writes to the superblock
will at worst result in a few entries added/missing from the orphan
list.
I do recall that there are "inodes from a corrupt orphan list found"
messages seen occasionally during full e2fsck runs, but it has never
been important enough to investigate.

> What I think we do need to do is eliminate all of the places where we
> set sb->s_dirt, and if we need to update the superblock, we do it
> ourselves, under journaling control.

We have to ensure that writeout of the superblock is still being done
correctly during non-journal mode operation.

> That leaves places which call ext4_commit_super() directly, which is
> at mount and unmount time (which should be OK, as long as it's before
> or after journalling is active) and when we freeze the filesystem,
> which might be OK, but we need to take a careful look at it.


We also write out the superblock directly in ext4_error(), so that the
EXT4_ERROR_FS flag is written to disk (if at all possible) rather than
putting the superblock into a journal transaction that will not be
replayed (due to the transaction never committing after the journal is
aborted or the node panics). Since that will be in the last transaction
anyways (unless errors=continue is used) I don't see it as a major
problem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-11-15 03:29:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

On Sun, Nov 08, 2009 at 09:41:28PM -0700, Andreas Dilger wrote:
>
> The reason we added this was for running a read-only e2fsck on a
> filesystem without getting spurious errors just because the superblock
> summaries were incorrect. The other alternative is to change e2fsck
> so that it doesn't consider just a block/inode summary an error.
>

We've been doing this for a while --- e2fsprogs 1.34, since April
2003. In e2fsck/super.c:check_super_block():

/*
* Update the global counts from the block group counts. This
* is needed for an experimental patch which eliminates
* locking the entire filesystem when allocating blocks or
* inodes; if the filesystem is not unmounted cleanly, the
* global counts may not be accurate.
*/
if ((free_blocks != sb->s_free_blocks_count) ||
(free_inodes != sb->s_free_inodes_count)) {
if (ctx->options & E2F_OPT_READONLY)
ext2fs_unmark_valid(fs);
else {
sb->s_free_blocks_count = free_blocks;
sb->s_free_inodes_count = free_inodes;
ext2fs_mark_super_dirty(fs);
}
}

I suppose I should edit out the bit about "experimental patch", which
might have been true in 2003, but it's certainly not true any more.

Hence, I think it's safe to eliminate the updates in ext4_statfs()
altogether.

- Ted

commit 48efef97e03811e4492607ed7a7aefb35e2c0c00
Author: Theodore Ts'o <[email protected]>
Date: Sat Nov 14 21:37:52 2009 -0500

ext4: Don't update the superblock in ext4_statfs()

commit a71ce8c6c9bf269b192f352ea555217815cf027e updated ext4_statfs()
to update the on-disk superblock counters, but modified this buffer
directly without any journaling of the change. This is one of the
accesses that was causing the crc errors in journal replay as seen in
kernel.org bugzilla #14354.

Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8662b2e..3cd519b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3668,13 +3668,11 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
buf->f_blocks = ext4_blocks_count(es) - sbi->s_overhead_last;
buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter) -
percpu_counter_sum_positive(&sbi->s_dirtyblocks_counter);
- ext4_free_blocks_count_set(es, buf->f_bfree);
buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es);
if (buf->f_bfree < ext4_r_blocks_count(es))
buf->f_bavail = 0;
buf->f_files = le32_to_cpu(es->s_inodes_count);
buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
- es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
buf->f_namelen = EXT4_NAME_LEN;
fsid = le64_to_cpup((void *)es->s_uuid) ^
le64_to_cpup((void *)es->s_uuid + sizeof(u64));

2009-11-16 23:38:38

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

On 2009-11-14, at 19:29, Theodore Tso wrote:
> On Sun, Nov 08, 2009 at 09:41:28PM -0700, Andreas Dilger wrote:
>>
>> The reason we added this was for running a read-only e2fsck on a
>> filesystem without getting spurious errors just because the
>> superblock
>> summaries were incorrect. The other alternative is to change e2fsck
>> so that it doesn't consider just a block/inode summary an error.
>
> We've been doing this for a while --- e2fsprogs 1.34, since April
> 2003. In e2fsck/super.c:check_super_block():
>
> if ((free_blocks != sb->s_free_blocks_count) ||
> (free_inodes != sb->s_free_inodes_count)) {
> if (ctx->options & E2F_OPT_READONLY)
> ext2fs_unmark_valid(fs);
> else {

The problem is that if you do "e2fsck -fn" it will still report this
as an error in the filesystem, even though "e2fsck -fp" will silently
fix it. I just repeated this test and still see errors, even 30
minutes after a file was modified, even after multiple syncs.

[adilger@webber ~]$ sync; sleep 10; sync
[adilger@webber ~]$ e2fsck -fn /dev/dm-0
e2fsck 1.41.6.sun1 (30-May-2009)
Warning! /dev/dm-0 is mounted.
Warning: skipping journal recovery because doing a read-only
filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 884739 has zero dtime. Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -1784645
Fix? no

Inode bitmap differences: -884739
Fix? no


home: ********** WARNING: Filesystem still has errors **********

home: 72709/2621440 files (17.6% non-contiguous), 4730591/5242880 blocks
[adilger@webber ~]$ echo $?
4

> Hence, I think it's safe to eliminate the updates in ext4_statfs()
> altogether.


Not yet. I'd be happy if the "-n" e2fsck treated these fields in the
same
way as it does for the real e2fsck.

The other thing that comes to mind is that we don't recover the journal
for a read-only e2fsck, but we DO recover it on a read-only mount
seems inconsistent. It wouldn't be hard to have e2fsck -n read the
journal and
persistently cache the journal blocks in its internal cache (i.e. flag
them so they can't be discarded from cache) before it runs the rest of
the
e2fsck.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-11-19 20:32:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

On Mon, Nov 16, 2009 at 03:38:16PM -0800, Andreas Dilger wrote:
>
> The problem is that if you do "e2fsck -fn" it will still report this
> as an error in the filesystem, even though "e2fsck -fp" will
> silently fix it. I just repeated this test and still see errors,
> even 30 minutes after a file was modified, even after multiple
> syncs.

Sure, but running e2fsck -fn on a mounted file system will always
potentially show problems. In fact, in your demonstration:

> [adilger@webber ~]$ sync; sleep 10; sync
> [adilger@webber ~]$ e2fsck -fn /dev/dm-0
> e2fsck 1.41.6.sun1 (30-May-2009)
> Warning! /dev/dm-0 is mounted.
> Warning: skipping journal recovery because doing a read-only
> filesystem check.
...
> Pass 1: Checking inodes, blocks, and sizes
> Deleted inode 884739 has zero dtime. Fix? no
...
> Pass 5: Checking group summary information
> Block bitmap differences: -1784645
> Fix? no
>
> Inode bitmap differences: -884739
> Fix? no

.... neither of these errors would be fixed by the hacking of updating
the summary free blocks and inode counts.

If the concern is what happens when someone runs e2fsck -fn on a
mountd file system, I have a very hard time getting excited about
that....

> The other thing that comes to mind is that we don't recover the journal
> for a read-only e2fsck, but we DO recover it on a read-only mount
> seems inconsistent. It wouldn't be hard to have e2fsck -n read the
> journal and
> persistently cache the journal blocks in its internal cache (i.e. flag
> them so they can't be discarded from cache) before it runs the rest
> of the
> e2fsck.

Eventually it would be nice if we did the same thing in both kernel
and userspace when doing a read-only mount/check: build a redirection
table that maps specific physical blocks to the block in the journal,
and whenever the system tries to access a specific physical block, we
look up the proper block to use instead in the redirection block.

The one tricky bit about doing this in the kernel is that we would
still have to replay the journal in the case of the read-only root.
Why? Because otherwise older e2fsck's would get confused and replay
the journal, and that would lead to some potentially serious
confusion. Even if we fix this in future versions of e2fsck, we still
need to be careful dealing with remounting a r/o filesystem to be
read/write, especially in the journal=data mode.

The simple way of handling journaled data blocks is to hack the
bmap() function to use the redirection block, but the problem with
doing that is the journal block will be left in the buffer heads in
the page cache. If the file system is remounted r/w without first
flushing these buffer heads, future attempts to modify these pages in
the page cache could result in a random block in the journalling
getting corrupted by an update, instead of updating the proper final
location on disk for that data block.

If we have someone who is at least some basic experience in kernel
coding, but and an entry-level project getting involved with ext4,
this would be an ideal, self-contained thing to try doing. I'd
suggest implementing it in userspace first, using the userspace/kernel
API framework that allows e2fsck/recovery.c to be roughly kept in sync
with fs/jbd[2]/recovery.c, and avoiding the hair of r/o roots by
always replaying the journal in the case of the root file system.
Anyone interested? If so, let me know...

- Ted

2009-11-23 11:57:40

by Duane Griffin

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

2009/11/19 <[email protected]>:
> On Mon, Nov 16, 2009 at 03:38:16PM -0800, Andreas Dilger wrote:
>> The other thing that comes to mind is that we don't recover the journal
>> for a read-only e2fsck, but we DO recover it on a read-only mount
>> seems inconsistent. It wouldn't be hard to have e2fsck -n read the
>> journal and
>> persistently cache the journal blocks in its internal cache (i.e. flag
>> them so they can't be discarded from cache) before it runs the rest
>> of the
>> e2fsck.
>
> Eventually it would be nice if we did the same thing in both kernel
> and userspace when doing a read-only mount/check: build a redirection
> table that maps specific physical blocks to the block in the journal,
> and whenever the system tries to access a specific physical block, we
> look up the proper block to use instead in the redirection block.

Unfortunately you can't just blindly give back the journalled block:
it may have been escaped. So you need to read in the block from the
journal, unescape it if required, then give it back.

> The one tricky bit about doing this in the kernel is that we would
> still have to replay the journal in the case of the read-only root.
> Why? Because otherwise older e2fsck's would get confused and replay
> the journal, and that would lead to some potentially serious
> confusion. Even if we fix this in future versions of e2fsck, we still
> need to be careful dealing with remounting a r/o filesystem to be
> read/write, especially in the journal=data mode.

Hmm. The e2fsck confusion is an interesting wrinkle.

> The simple way of handling journaled data blocks is to hack the
> bmap() function to use the redirection block, but the problem with
> doing that is the journal block will be left in the buffer heads in
> the page cache. If the file system is remounted r/w without first
> flushing these buffer heads, future attempts to modify these pages in
> the page cache could result in a random block in the journalling
> getting corrupted by an update, instead of updating the proper final
> location on disk for that data block.

Yes, they certainly need to be flushed.

> If we have someone who is at least some basic experience in kernel
> coding, but and an entry-level project getting involved with ext4,
> this would be an ideal, self-contained thing to try doing. I'd
> suggest implementing it in userspace first, using the userspace/kernel
> API framework that allows e2fsck/recovery.c to be roughly kept in sync
> with fs/jbd[2]/recovery.c, and avoiding the hair of r/o roots by
> always replaying the journal in the case of the root file system.
> Anyone interested? If so, let me know...

I am (still) interested in this. I'll have a look at the userspace
side of things.

> - Ted

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan

2009-11-23 14:27:00

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

On Mon, Nov 23, 2009 at 11:57:44AM +0000, Duane Griffin wrote:
> Unfortunately you can't just blindly give back the journalled block:
> it may have been escaped. So you need to read in the block from the
> journal, unescape it if required, then give it back.

Good point; this is going to make supporting a read-only mount that
doesn't replay the journal very difficult to implement for
data=journal mode, since it would mean intercepting the actual block
I/O read functions for data reads, which is outside of fs/ext4 in the
generic fs/ and mm/ functions. Doing it for metadata blocks will be
annoying, but at least it's all inside fs/ext4.

- Ted

2009-11-23 14:40:47

by Duane Griffin

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext4: journal superblock modifications in ext4_statfs()

2009/11/23 <[email protected]>:
> On Mon, Nov 23, 2009 at 11:57:44AM +0000, Duane Griffin wrote:
>> Unfortunately you can't just blindly give back the journalled block:
>> it may have been escaped. So you need to read in the block from the
>> journal, unescape it if required, then give it back.
>
> Good point; this is going to make supporting a read-only mount that
> doesn't replay the journal very difficult to implement for
> data=journal mode, since it would mean intercepting the actual block
> I/O read functions for data reads, which is outside of fs/ext4 in the
> generic fs/ and mm/ functions. Doing it for metadata blocks will be
> annoying, but at least it's all inside fs/ext4.

Yes, that is the sticking point I got to with my last attempt at this :(

I was kinda hoping that someone would have a bright idea that doesn't
involve hacking an IO completion callback into get_block_t...

> - Ted

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan