LinuxLists.cc - [RFC PATCH 1/1] add a jbd option to force an unclean journal state

2008-03-04 18:52:02

Subject: [RFC PATCH 1/1] add a jbd option to force an unclean journal state

Hello,

I'm fixing to go through and muck with some of the transaction handling stuff in
jbd and I want a way to verify that I'm not screwing anything up in the
process, and this is what I came up with. Basically this option would only be
used in the case where someone mounts an ext3 image or fs, does a specific IO
operation (create 100 files, write data to a few files etc), unmounts the fs
and remounts so that jbd does its journal recovery and then check the status of
the fs to make sure its exactly the way its expected to be. I'm not entirely
sure how usefull of an option like this would be (or if I did it right :) ),
but I thought I'd throw it out there in case anybody thinks it may be useful,
and in case there is some case that I'm missing so I can fix it and better make
sure I don't mess anything up while doing stuff. Basically this patch keeps us
from resetting the journal's tail/transaction sequence when we destroy the
journal so when we mount the fs again it will look like we didn't unmount
properly and recovery will occur. Any comments are much appreciated,

Josef

Index: linux-2.6/fs/jbd/journal.c
===================================================================
--- linux-2.6.orig/fs/jbd/journal.c
+++ linux-2.6/fs/jbd/journal.c
@@ -1146,9 +1146,12 @@ void journal_destroy(journal_t *journal)
J_ASSERT(journal->j_checkpoint_transactions == NULL);
spin_unlock(&journal->j_list_lock);

- /* We can now mark the journal as empty. */
- journal->j_tail = 0;
- journal->j_tail_sequence = ++journal->j_transaction_sequence;
+ if (!(journal->j_flags & JFS_UNCLEAN)) {
+ /* We can now mark the journal as empty. */
+ journal->j_tail = 0;
+ journal->j_tail_sequence = ++journal->j_transaction_sequence;
+ }
+
if (journal->j_sb_buffer) {
journal_update_superblock(journal, 1);
brelse(journal->j_sb_buffer);
Index: linux-2.6/include/linux/jbd.h
===================================================================
--- linux-2.6.orig/include/linux/jbd.h
+++ linux-2.6/include/linux/jbd.h
@@ -825,6 +825,7 @@ struct journal_s
#define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */
#define JFS_LOADED 0x010 /* The journal superblock has been loaded */
#define JFS_BARRIER 0x020 /* Use IDE barriers */
+#define JFS_UNCLEAN 0x040 /* Don't clean up the journal on umount */

/*
* Function declarations for the journaling transaction and buffer
Index: linux-2.6/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.orig/fs/jbd/checkpoint.c
+++ linux-2.6/fs/jbd/checkpoint.c
@@ -423,10 +423,14 @@ int cleanup_journal_tail(journal_t *jour
} else if ((transaction = journal->j_running_transaction) != NULL) {
first_tid = transaction->t_tid;
blocknr = journal->j_head;
- } else {
+ } else if (!(journal->j_flags & JFS_UNCLEAN)) {
first_tid = journal->j_transaction_sequence;
blocknr = journal->j_head;
+ } else {
+ first_tid = journal->j_tail_sequence;
+ blocknr = journal->j_tail;
}
+
spin_unlock(&journal->j_list_lock);
J_ASSERT(blocknr != 0);

2008-03-04 19:01:12

by Jan Kara

[permalink] [raw]

Subject: Re: [RFC PATCH 1/1] add a jbd option to force an unclean journal state

Hi,

On Tue 04-03-08 13:39:41, Josef Bacik wrote:
> jbd and I want a way to verify that I'm not screwing anything up in the
> process, and this is what I came up with. Basically this option would only be
> used in the case where someone mounts an ext3 image or fs, does a specific IO
> operation (create 100 files, write data to a few files etc), unmounts the fs
> and remounts so that jbd does its journal recovery and then check the status of
> the fs to make sure its exactly the way its expected to be. I'm not entirely
> sure how usefull of an option like this would be (or if I did it right :) ),
> but I thought I'd throw it out there in case anybody thinks it may be useful,
> and in case there is some case that I'm missing so I can fix it and better make
> sure I don't mess anything up while doing stuff. Basically this patch keeps us
> from resetting the journal's tail/transaction sequence when we destroy the
> journal so when we mount the fs again it will look like we didn't unmount
> properly and recovery will occur. Any comments are much appreciated,
Actually, there is a different way how we've done checking like this (and
I think also more useful), at least for ext3. Basically you mounted a
filesysteem with some timeout and after the timeout, device was forced
read-only. And then you've checked that the fs is consistent after journal
replay. I think Andrew had the patches somewhere...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-03-04 23:59:14

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC PATCH 1/1] add a jbd option to force an unclean journal state

On Tue, 4 Mar 2008 20:01:09 +0100
Jan Kara <[email protected]> wrote:

> Hi,
>
> On Tue 04-03-08 13:39:41, Josef Bacik wrote:
> > jbd and I want a way to verify that I'm not screwing anything up in the
> > process, and this is what I came up with. Basically this option would only be
> > used in the case where someone mounts an ext3 image or fs, does a specific IO
> > operation (create 100 files, write data to a few files etc), unmounts the fs
> > and remounts so that jbd does its journal recovery and then check the status of
> > the fs to make sure its exactly the way its expected to be. I'm not entirely
> > sure how usefull of an option like this would be (or if I did it right :) ),
> > but I thought I'd throw it out there in case anybody thinks it may be useful,
> > and in case there is some case that I'm missing so I can fix it and better make
> > sure I don't mess anything up while doing stuff. Basically this patch keeps us
> > from resetting the journal's tail/transaction sequence when we destroy the
> > journal so when we mount the fs again it will look like we didn't unmount
> > properly and recovery will occur. Any comments are much appreciated,
> Actually, there is a different way how we've done checking like this (and
> I think also more useful), at least for ext3. Basically you mounted a
> filesysteem with some timeout and after the timeout, device was forced
> read-only. And then you've checked that the fs is consistent after journal
> replay. I think Andrew had the patches somewhere...

About a billion years ago...

But the idea was (I think) good:

- mount the filesystem with `-o ro_after=100'

- the fs arms a timer to go off in 100 seconds

- now you start running some filesystem stress test

- the timer goes off. At timer-interrupt time, flags are set which cause
the low-level driver layer to start silently ignoring all writes to the
device which backs the filesystem.

This simulates a crash or poweroff.

- Now up in userspace we

- kill off the stresstest
- unmount the fs
- mount the fs (to run recovery)
- unmount the fs
- fsck it
- mount the fs
- check the data content of the files which the stresstest was writing:
look for uninitialised blocks, incorrect data, etc.
- unmount the fs

- start it all again.

So it's 100% scriptable and can be left running overnight, etc. It found
quite a few problems with ext3/jbd recovery which I doubt could be found by
other means. This was 6-7 years ago and I'd expect that new recovery bugs
have crept in since then which it can expose.

I think we should implement this in a formal, mergeable fashion, as there
are numerous filesystems which could and should use this sort of testing
infrastructure.

2008-03-05 02:46:21

by Eric Sandeen

[permalink] [raw]

Subject: Re: [RFC PATCH 1/1] add a jbd option to force an unclean journal state

Andrew Morton wrote:

> So it's 100% scriptable and can be left running overnight, etc. It found
> quite a few problems with ext3/jbd recovery which I doubt could be found by
> other means. This was 6-7 years ago and I'd expect that new recovery bugs
> have crept in since then which it can expose.
>
> I think we should implement this in a formal, mergeable fashion, as there
> are numerous filesystems which could and should use this sort of testing
> infrastructure.

FWIW, xfs has something vaguely similar, the XFS_IOC_GOINGDOWN ioctl,
which invokes xfs_fs_goingdown, which takes a few flags:

/*
* Flags for going down operation
*/
#define XFS_FSOP_GOING_FLAGS_DEFAULT 0x0 /* going down */
#define XFS_FSOP_GOING_FLAGS_LOGFLUSH 0x1 /* flush log but
not data */
#define XFS_FSOP_GOING_FLAGS_NOLOGFLUSH 0x2 /* don't flush
log nor data */

but ultimately calls xfs_force_shutdown, which is sort of rougly similar
to ext3_abort....

The xfs qa tests make use of this ioctl.

-Eric

2008-03-05 07:34:49

by Andreas Dilger

[permalink] [raw]

Subject: Re: [RFC PATCH 1/1] add a jbd option to force an unclean journal state

On Mar 04, 2008 15:58 -0800, Andrew Morton wrote:
> - mount the filesystem with `-o ro_after=100'
>
> - the fs arms a timer to go off in 100 seconds
>
> - now you start running some filesystem stress test
>
> - the timer goes off. At timer-interrupt time, flags are set which cause
> the low-level driver layer to start silently ignoring all writes to the
> device which backs the filesystem.
>
> This simulates a crash or poweroff.
>
> - Now up in userspace we
>
> - kill off the stresstest
> - unmount the fs
> - mount the fs (to run recovery)
> - unmount the fs
> - fsck it
> - mount the fs
> - check the data content of the files which the stresstest was writing:
> look for uninitialised blocks, incorrect data, etc.
> - unmount the fs
>
> - start it all again.
>
>
> So it's 100% scriptable and can be left running overnight, etc. It found
> quite a few problems with ext3/jbd recovery which I doubt could be found by
> other means. This was 6-7 years ago and I'd expect that new recovery bugs
> have crept in since then which it can expose.
>
> I think we should implement this in a formal, mergeable fashion, as there
> are numerous filesystems which could and should use this sort of testing
> infrastructure.

We use a patch which is a distant ancestor of Andrew's original patch to
do very similar testing for Lustre + ext3, allowing us to simulate node
crashes. This patch is against 2.6.22, but the relevant code doesn't
appear significantly changed in newer kernels. YMMV.

The major difference between this code and Andrew's original code is
that this allows multiple devices to be turned read-only at one time
(e.g. ext3 filesystem + external journal), while the original code wasn't
very robust in that area.

There is no mechanism to enable this from userspace or the filesystem,
since there is Lustre code in the kernel that calls dev_set_rdonly()
on one or more devices when we hit a failure trigger, but adding the
timer code or a /proc or /sys entry for this would be easy.

The device is reset only when the last reference to it is removed in
kill_bdev() so that there isn't a race with re-enabling writes to the
device while there are still dirty buffers outstanding, and certainly
corrupting the filesystem. That's why dev_clear_rdonly() is not exported.

Signed-off-by: Andreas Dilger <[email protected]>

Index: linux-2.6.22.5/block/ll_rw_blk.c
===================================================================
--- linux-2.6.22.5.orig/block/ll_rw_blk.c 2007-08-22 17:23:54.000000000 -0600
+++ linux-2.6.22.5/block/ll_rw_blk.c 2008-02-21 01:07:16.000000000 -0700
@@ -3101,6 +3101,8 @@

#endif /* CONFIG_FAIL_MAKE_REQUEST */

+int dev_check_rdonly(struct block_device *bdev);
+
/**
* generic_make_request: hand a buffer to its device driver for I/O
* @bio: The bio describing the location in memory and on the device.
@@ -3185,6 +3187,12 @@ static inline void __generic_make_request

if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
goto end_io;
+
+ if (unlikely(bio->bi_rw == WRITE &&
+ dev_check_rdonly(bio->bi_bdev))) {
+ bio_endio(bio, bio->bi_size, 0);
+ break;
+ }

if (should_fail_request(bio))
goto end_io;
@@ -3850,6 +3858,100 @@ void swap_io_context(struct io_context
*ioc2 = temp;
}
EXPORT_SYMBOL(swap_io_context);
+
+ /*
+ * Debug code for turning block devices "read-only" (will discard writes
+ * silently). This is for filesystem crash/recovery testing.
+ */
+struct deventry {
+ dev_t dev;
+ struct deventry *next;
+};
+
+static struct deventry *devlist = NULL;
+static spinlock_t devlock = SPIN_LOCK_UNLOCKED;
+
+int dev_check_rdonly(struct block_device *bdev)
+{
+ struct deventry *cur;
+
+ if (!bdev)
+ return 0;
+
+ spin_lock(&devlock);
+ cur = devlist;
+ while (cur) {
+ if (bdev->bd_dev == cur->dev) {
+ spin_unlock(&devlock);
+ return 1;
+ }
+ cur = cur->next;
+ }
+ spin_unlock(&devlock);
+
+ return 0;
+}
+
+void dev_set_rdonly(struct block_device *bdev)
+{
+ struct deventry *newdev, *cur;
+
+ if (!bdev)
+ return;
+
+ newdev = kmalloc(sizeof(struct deventry), GFP_KERNEL);
+ if (!newdev)
+ return;
+
+ spin_lock(&devlock);
+ cur = devlist;
+ while (cur) {
+ if (bdev->bd_dev == cur->dev) {
+ spin_unlock(&devlock);
+ kfree(newdev);
+ return;
+ }
+ cur = cur->next;
+ }
+ newdev->dev = bdev->bd_dev;
+ newdev->next = devlist;
+ devlist = newdev;
+ spin_unlock(&devlock);
+
+ printk(KERN_WARNING "Turning device %s (%#x) read-only\n",
+ bdev->bd_disk ? bdev->bd_disk->disk_name : "", bdev->bd_dev);
+}
+
+void dev_clear_rdonly(struct block_device *bdev)
+{
+ struct deventry *cur, *last = NULL;
+
+ if (!bdev)
+ return;
+
+ spin_lock(&devlock);
+ cur = devlist;
+ while (cur) {
+ if (bdev->bd_dev == cur->dev) {
+ if (last)
+ last->next = cur->next;
+ else
+ devlist = cur->next;
+ spin_unlock(&devlock);
+ kfree(cur);
+ printk(KERN_WARNING "Removing read-only on %s (%#x)\n",
+ bdev->bd_disk ? bdev->bd_disk->disk_name :
+ "unknown block", bdev->bd_dev);
+ return;
+ }
+ last = cur;
+ cur = cur->next;
+ }
+ spin_unlock(&devlock);
+}
+
+EXPORT_SYMBOL(dev_set_rdonly);
+EXPORT_SYMBOL(dev_check_rdonly);

/*
* sysfs parts below
Index: linux-2.6.22.5/fs/block_dev.c
===================================================================
--- linux-2.6.22.5.orig/fs/block_dev.c 2007-08-22 17:23:54.000000000 -0600
+++ linux-2.6.22.5/fs/block_dev.c 2008-02-21 01:07:16.000000000 -0700
@@ -63,6 +63,7 @@ static void kill_bdev(struct block_device *bdev)
return;
invalidate_bh_lrus();
truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
+ dev_clear_rdonly(bdev);
}

int set_blocksize(struct block_device *bdev, int size)
Index: linux-2.6.22.5/include/linux/fs.h
===================================================================
--- linux-2.6.22.5.orig/include/linux/fs.h 2008-02-21 00:58:18.000000000 -0700
+++ linux-2.6.22.5/include/linux/fs.h 2008-02-21 01:07:16.000000000 -0700
@@ -1744,6 +1744,10 @@
extern void submit_bio(int, struct bio *);
extern int bdev_read_only(struct block_device *);
#endif
+#define HAVE_CLEAR_RDONLY_ON_PUT
+extern void dev_set_rdonly(struct block_device *bdev);
+extern int dev_check_rdonly(struct block_device *bdev);
+extern void dev_clear_rdonly(struct block_device *bdev);
extern int set_blocksize(struct block_device *, int);
extern int sb_set_blocksize(struct super_block *, int);
extern int sb_min_blocksize(struct super_block *, int);

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.