Ext3's internal debugging has always assumed that it was illegal for there to
be parallel IO on a buffer-head which it is trying to modify. That's
reasonable --- if there is an IO collision, we end up with IOs hitting disk
out-of-order wrt the journal, so we lose recovery guarantees.
However, there are two cases where the test is a little over-zealous. If
user space is performing inherently non-transactional writes (eg. tune2fs
adding a label to a live filesystem and writing to the buffered device
superblock location) then we can hit the ext3 assertion.
More seriously, since 2.4.11 the page cache can lock a buffer_head for read
even if the bh is already under journal control. The tune2fs bug is very
rare: there have been no reports of it in Bugzilla or ext3-users lists, and
only one on 2.5 on linux-kernel. But now, a dump(8) on a live filesystem
can also give rise to the same condition, and in testing, dump + fs activity
reproduces the assertion-failure VERY rapidly.
This patch changes the jbd get-write-access code to take the buffer_head
lock before testing the uptodate and dirty state of a bh, and relaxes the
handling of unexpectedly-dirty buffers to be a printk warning, not a
fatal error. The lock will cure the dump(8) interaction, and the warning
means that we will still spot out-of-order writes, while not taking the
whole kernel down if we collide with a tune2fs(8).
The patch also removes a small potential hole in the recovery guarantees. It
is not safe for a transaction to steal buffers from checkpoint mode until after
that transaction has committed. Otherwise, a reboot at the wrong moment might
find the old copy of the buffer in the journal had been removed from the
recovery set before the new copy was written.
--- linux-ext3-2.4merge-2/include/linux/jbd.h.=K0000=.orig Thu Aug 22 22:54:34 2002
+++ linux-ext3-2.4merge-2/include/linux/jbd.h Thu Aug 22 23:43:16 2002
@@ -32,6 +32,14 @@
#define journal_oom_retry 1
+/*
+ * Define JBD_PARANOID_WRITES to cause a kernel BUG() check if ext3
+ * finds a buffer unexpectedly dirty. This is useful for debugging, but
+ * can cause spurious kernel panics if there are applications such as
+ * tune2fs modifying our buffer_heads behind our backs.
+ */
+#undef JBD_PARANOID_WRITES
+
#ifdef CONFIG_JBD_DEBUG
/*
* Define JBD_EXPENSIVE_CHECKING to enable more expensive internal
@@ -730,6 +738,10 @@
schedule(); \
} while (1)
+extern void __jbd_unexpected_dirty_buffer(char *, int, struct journal_head *);
+#define jbd_unexpected_dirty_buffer(jh) \
+ __jbd_unexpected_dirty_buffer(__FUNCTION__, __LINE__, (jh))
+
/*
* is_journal_abort
*
--- linux-ext3-2.4merge-2/fs/jbd/transaction.c.=K0000=.orig Thu Aug 22 22:53:28 2002
+++ linux-ext3-2.4merge-2/fs/jbd/transaction.c Thu Aug 22 23:43:16 2002
@@ -539,76 +539,67 @@
static int
do_get_write_access(handle_t *handle, struct journal_head *jh, int force_copy)
{
+ struct buffer_head *bh;
transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;
int error;
char *frozen_buffer = NULL;
int need_copy = 0;
-
+ int locked;
+
jbd_debug(5, "buffer_head %p, force_copy %d\n", jh, force_copy);
JBUFFER_TRACE(jh, "entry");
repeat:
+ bh = jh2bh(jh);
+
/* @@@ Need to check for errors here at some point. */
/*
- * AKPM: neither bdflush nor kupdate run with the BKL. There's
- * nothing we can do to prevent them from starting writeout of a
- * BUF_DIRTY buffer at any time. And checkpointing buffers are on
- * BUF_DIRTY. So. We no longer assert that the buffer is unlocked.
- *
- * However. It is very wrong for us to allow ext3 to start directly
- * altering the ->b_data of buffers which may at that very time be
- * undergoing writeout to the client filesystem. This can leave
- * the filesystem in an inconsistent, transient state if we crash.
- * So what we do is to steal the buffer if it is in checkpoint
- * mode and dirty. The journal lock will keep out checkpoint-mode
- * state transitions within journal_remove_checkpoint() and the buffer
- * is locked to keep bdflush/kupdate/whoever away from it as well.
- *
* AKPM: we have replaced all the lock_journal_bh_wait() stuff with a
* simple lock_journal(). This code here will care for locked buffers.
*/
- /*
- * The buffer_locked() || buffer_dirty() tests here are simply an
- * optimisation tweak. If anyone else in the system decides to
- * lock this buffer later on, we'll blow up. There doesn't seem
- * to be a good reason why they should do this.
- */
- if (jh->b_cp_transaction &&
- (buffer_locked(jh2bh(jh)) || buffer_dirty(jh2bh(jh)))) {
+ locked = test_and_set_bit(BH_Lock, &bh->b_state);
+ if (locked) {
+ /* We can't reliably test the buffer state if we found
+ * it already locked, so just wait for the lock and
+ * retry. */
unlock_journal(journal);
- lock_buffer(jh2bh(jh));
- spin_lock(&journal_datalist_lock);
- if (jh->b_cp_transaction && buffer_dirty(jh2bh(jh))) {
- /* OK, we need to steal it */
- JBUFFER_TRACE(jh, "stealing from checkpoint mode");
- J_ASSERT_JH(jh, jh->b_next_transaction == NULL);
- J_ASSERT_JH(jh, jh->b_frozen_data == NULL);
-
- J_ASSERT(handle->h_buffer_credits > 0);
- handle->h_buffer_credits--;
-
- /* This will clear BH_Dirty and set BH_JBDDirty. */
- JBUFFER_TRACE(jh, "file as BJ_Reserved");
- __journal_file_buffer(jh, transaction, BJ_Reserved);
-
- /* And pull it off BUF_DIRTY, onto BUF_CLEAN */
- refile_buffer(jh2bh(jh));
+ __wait_on_buffer(bh);
+ lock_journal(journal);
+ goto repeat;
+ }
+
+ /* We now hold the buffer lock so it is safe to query the buffer
+ * state. Is the buffer dirty?
+ *
+ * If so, there are two possibilities. The buffer may be
+ * non-journaled, and undergoing a quite legitimate writeback.
+ * Otherwise, it is journaled, and we don't expect dirty buffers
+ * in that state (the buffers should be marked JBD_Dirty
+ * instead.) So either the IO is being done under our own
+ * control and this is a bug, or it's a third party IO such as
+ * dump(8) (which may leave the buffer scheduled for read ---
+ * ie. locked but not dirty) or tune2fs (which may actually have
+ * the buffer dirtied, ugh.) */
- /*
- * The buffer is now hidden from bdflush. It is
- * metadata against the current transaction.
- */
- JBUFFER_TRACE(jh, "steal from cp mode is complete");
+ if (buffer_dirty(bh)) {
+ spin_lock(&journal_datalist_lock);
+ /* First question: is this buffer already part of the
+ * current transaction or the existing committing
+ * transaction? */
+ if (jh->b_transaction) {
+ J_ASSERT_JH(jh, jh->b_transaction == transaction ||
+ jh->b_transaction == journal->j_committing_transaction);
+ if (jh->b_next_transaction)
+ J_ASSERT_JH(jh, jh->b_next_transaction == transaction);
+ JBUFFER_TRACE(jh, "Unexpected dirty buffer");
+ jbd_unexpected_dirty_buffer(jh);
}
spin_unlock(&journal_datalist_lock);
- unlock_buffer(jh2bh(jh));
- lock_journal(journal);
- goto repeat;
}
- J_ASSERT_JH(jh, !buffer_locked(jh2bh(jh)));
+ unlock_buffer(bh);
error = -EROFS;
if (is_handle_aborted(handle))
@@ -1978,6 +1969,10 @@
__blist_add_buffer(list, jh);
jh->b_jlist = jlist;
+ /* The following list of buffer states needs to be consistent
+ * with __jbd_unexpected_dirty_buffer()'s handling of dirty
+ * state. */
+
if (jlist == BJ_Metadata || jlist == BJ_Reserved ||
jlist == BJ_Shadow || jlist == BJ_Forget) {
if (atomic_set_buffer_clean(jh2bh(jh))) {
--- linux-ext3-2.4merge-2/fs/jbd/journal.c.=K0000=.orig Thu Aug 22 22:53:04 2002
+++ linux-ext3-2.4merge-2/fs/jbd/journal.c Thu Aug 22 23:43:16 2002
@@ -1485,6 +1485,49 @@
unlock_journal(journal);
}
+
+/*
+ * Report any unexpected dirty buffers which turn up. Normally those
+ * indicate an error, but they can occur if the user is running (say)
+ * tune2fs to modify the live filesystem, so we need the option of
+ * continuing as gracefully as possible. #
+ *
+ * The caller should already hold the journal lock and
+ * journal_datalist_lock spinlock: most callers will need those anyway
+ * in order to probe the buffer's journaling state safely.
+ */
+void __jbd_unexpected_dirty_buffer(char *function, int line,
+ struct journal_head *jh)
+{
+ struct buffer_head *bh = jh2bh(jh);
+ int jlist;
+
+ if (buffer_dirty(bh)) {
+ printk ("%sUnexpected dirty buffer encountered at "
+ "%s:%d (%s blocknr %lu)\n",
+ KERN_WARNING, function, line,
+ kdevname(bh->b_dev), bh->b_blocknr);
+#ifdef JBD_PARANOID_WRITES
+ J_ASSERT_BH (bh, !buffer_dirty(bh));
+#endif
+
+ /* If this buffer is one which might reasonably be dirty
+ * --- ie. data, or not part of this journal --- then
+ * we're OK to leave it alone, but otherwise we need to
+ * move the dirty bit to the journal's own internal
+ * JBDDirty bit. */
+ jlist = jh->b_jlist;
+
+ if (jlist == BJ_Metadata || jlist == BJ_Reserved ||
+ jlist == BJ_Shadow || jlist == BJ_Forget) {
+ if (atomic_set_buffer_clean(jh2bh(jh))) {
+ set_bit(BH_JBDDirty, &jh2bh(jh)->b_state);
+ }
+ }
+ }
+}
+
+
int journal_blocks_per_page(struct inode *inode)
{
return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
On Fri, Aug 23, 2002 at 12:19:36AM +0100, Stephen Tweedie wrote:
> Ext3's internal debugging has always assumed that it was illegal for there to
> be parallel IO on a buffer-head which it is trying to modify. That's
> reasonable --- if there is an IO collision, we end up with IOs hitting disk
> out-of-order wrt the journal, so we lose recovery guarantees.
>
> However, there are two cases where the test is a little over-zealous. If
> user space is performing inherently non-transactional writes (eg. tune2fs
> adding a label to a live filesystem and writing to the buffered device
> superblock location) then we can hit the ext3 assertion.
>
> More seriously, since 2.4.11 the page cache can lock a buffer_head for read
> even if the bh is already under journal control. The tune2fs bug is very
> rare: there have been no reports of it in Bugzilla or ext3-users lists, and
> only one on 2.5 on linux-kernel. But now, a dump(8) on a live filesystem
> can also give rise to the same condition, and in testing, dump + fs activity
> reproduces the assertion-failure VERY rapidly.
Btw, is the following patch still aimed for inclusion?
Date: Sat, 13 Apr 2002 23:59:48 +0100
From: "Stephen C. Tweedie" <[email protected]>
To: Andrew Morton <[email protected]>,
Alexander Viro <[email protected]>,
Linus Torvalds <[email protected]>,
Andrea Arcangeli <[email protected]>
Cc: Stephen Tweedie <[email protected]>, [email protected]
Subject: [RFC] Patch: aliasing bug in blockdev-in-pagecache?
Message-ID: <[email protected]>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5.1i
X-Spam-Status: No, hits=-5.0 required=5.0 tests=SUBJ_ENDS_IN_Q_MARK,UNIFIED_PATCH version=2.11
Hi all,
I think there's a data-corruption possible in both ext2 and ext3 (and
in fact any filesystem which uses bread) if user space is reading from
the filesystem's buffered block device during live fs activity.
I've recently been seeing an ext3 assert failure where the filesystem
comes across a buffer which is unexpectedly locked, while running a
dump(8) on the live filesystem. Now, I do _not_ want to start another
debate about whether it's sane to do that or not! But the mounted
kernel filesystem should still survive if the user is only reading
from the buffered device.
The problem turns out to be in block_read_full_page(). The page cache
IO code assumes that IO is synchronised at the page level, but any
kernel code using getblk() (or sb_read() or related functions) to
access the buffers at the same time will bypass the page locking.
So if block_read_full_page() encounters a page which bread() is
already reading into cache, it will see the page unlocked but the
buffer_head locked and !uptodate. However, it ignores the bh lock,
seeing only that the buffer is !uptodate, and so it then locks the bh
itself and submits a read IO.
Unfortunately, if the bread() wins the race for the wait_on_buffer
after the initial IO, we can already have started to modify the buffer
contents by the time that block_read_full_page() starts the new IO.
So we read stale contents from disk on top of the modified contents in
cache.
To solve this, we really do need to have block_read_full_page() test
the uptodate state under protection of the buffer_head lock. We
already go through 3 stages in block_read_full_page(): gather the
buffers needing IO, then lock them, then submit the IO. To be safe,
we need a final test for buffer_uptodate() *after* we have locked the
required buffers.
I've verified that the scenario above is definitely happening, and I'm
currently testing the patch below. I'm using 2.4 for the testing
right now, but 2.5 seems to have the same problem at first glance.
Comments?
--Stephen
--- linux/fs/buffer.c.~1~ Fri Apr 12 17:59:09 2002
+++ linux/fs/buffer.c Sat Apr 13 21:09:36 2002
@@ -1902,9 +1902,14 @@
}
/* Stage 3: start the IO */
- for (i = 0; i < nr; i++)
- submit_bh(READ, arr[i]);
-
+ for (i = 0; i < nr; i++) {
+ struct buffer_head * bh = arr[i];
+ if (buffer_uptodate(bh))
+ end_buffer_io_async(bh, 1);
+ else
+ submit_bh(READ, bh);
+ }
+
return 0;
}
Hi,
On Fri, Aug 23, 2002 at 10:59:33AM +0100, Christoph Hellwig wrote:
> Btw, is the following patch still aimed for inclusion?
> --- linux/fs/buffer.c.~1~ Fri Apr 12 17:59:09 2002
> +++ linux/fs/buffer.c Sat Apr 13 21:09:36 2002
> /* Stage 3: start the IO */
> - for (i = 0; i < nr; i++)
> - submit_bh(READ, arr[i]);
> -
> + for (i = 0; i < nr; i++) {
> + struct buffer_head * bh = arr[i];
> + if (buffer_uptodate(bh))
> + end_buffer_io_async(bh, 1);
> + else
> + submit_bh(READ, bh);
> + }
> +
Yes, it's in my next batch.
--Stephen