LinuxLists.cc - [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

2006-09-01 15:47:16

by Badari Pulavarty

[permalink] [raw]

Subject: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

Hi Andrew,

I have been running into following bug while running fsx
tests on 1k (ext3) filesystem all the time.

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/buffer.c:2791
invalid opcode: 0000 [1] SMP

Its complaining about BUG_ON(!buffer_mapped(bh)).

It was hard to track it down, needed lots of debug - but here
is the problem & fix. Since the fix is in __set_page_buffer_dirty()
code - I am wondering how it would effect others :(

With this fix fsx tests ran for more than 16 hours (and still
running).

Please let me know, what you think.

Thanks,
Badari

Patch to fix: Kernel BUG at fs/buffer.c:2791
on 1k (2k) filesystems while running fsx.

journal_commit_transaction collects lots of dirty buffer from
and does a single ll_rw_block() to write them out. ll_rw_block()
locks the buffer and checks to see if they are dirty and submits
them for IO.

In the mean while, journal_unmap_buffers() as part of
truncate can unmap the buffer and throw it away. Since its
a 1k (2k) filesystem - each page (4k) will have more than
one buffer_head attached to the page and and we can't free
up buffer_heads attached to the page (if we are not
invalidating the whole page).

Now, any call to set_page_dirty() (like msync_interval)
could end up setting all the buffer heads attached to
this page again dirty, including the ones those got
cleaned up :(

If ll_rw_block() runs now and sees the dirty bit it does
submit_bh() on those buffer_heads and triggers the assert.

Fix is to check if the buffer is mapped before setting its
dirty bit in __set_page_dirty_buffers().

Signed-off-by: Badari Pulavarty <[email protected]>
---
fs/buffer.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6.18-rc5/fs/buffer.c
===================================================================
--- linux-2.6.18-rc5.orig/fs/buffer.c 2006-09-01 08:20:51.000000000 -0700
+++ linux-2.6.18-rc5/fs/buffer.c 2006-09-01 08:41:01.000000000 -0700
@@ -846,7 +846,13 @@ int __set_page_dirty_buffers(struct page
struct buffer_head *bh = head;

do {
- set_buffer_dirty(bh);
+ /*
+ * Its possible that, not all buffers attached to
+ * this page are mapped (cleaned up by truncate).
+ * If so, skip them.
+ */
+ if (buffer_mapped(bh))
+ set_buffer_dirty(bh);
bh = bh->b_this_page;
} while (bh != head);
}

2006-09-01 16:12:57

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

Hi,

On Fri, 1 Sep 2006, Badari Pulavarty wrote:
>
> I have been running into following bug while running fsx
> tests on 1k (ext3) filesystem all the time.
>
> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at fs/buffer.c:2791
> invalid opcode: 0000 [1] SMP
>
> Its complaining about BUG_ON(!buffer_mapped(bh)).
>
> It was hard to track it down, needed lots of debug - but here
> is the problem & fix. Since the fix is in __set_page_buffer_dirty()
> code - I am wondering how it would effect others :(

This will breaks NTFS and probably a lot of other file systems I would
think.

For example all buffer based file systems in their writepage
implementations will create buffers if none are present and those will not
be mapped. If for whatever reason writepage now breaks out before mapping
the buffers (e.g. because the buffers do not need to be written or due to
an error) you are left with a page containing unmapped buffers.

Then a page dirty comes in caused by a mmapped write for example.

__set_page_dirty_bufferes() runs by default and with your patch does not
set the unmapped buffers dirty thus they never get written out and you
have data corruption.

It is the caller of submit_bh() that is doing the stupidity of submitting
unmapped buffers for i/o or even the caller of the caller, etc...
Somewhere up in that chain buffers should have been mapped before being
submitted for i/o otherwise it is a BUG() (as correctly identified by
submit_bh()).

Perhaps the real fix is not to have ext3 use ll_rw_block() and instead
make it use submit_bh() directly and only submit buffers inside the file
size and/or make it map buffers before calling ll_rw_block() and if they
are outside the file size just clean them without submitting them...

Best regards,

Anton

> With this fix fsx tests ran for more than 16 hours (and still
> running).
>
> Please let me know, what you think.
>
> Thanks,
> Badari
>
> Patch to fix: Kernel BUG at fs/buffer.c:2791
> on 1k (2k) filesystems while running fsx.
>
> journal_commit_transaction collects lots of dirty buffer from
> and does a single ll_rw_block() to write them out. ll_rw_block()
> locks the buffer and checks to see if they are dirty and submits
> them for IO.
>
> In the mean while, journal_unmap_buffers() as part of
> truncate can unmap the buffer and throw it away. Since its
> a 1k (2k) filesystem - each page (4k) will have more than
> one buffer_head attached to the page and and we can't free
> up buffer_heads attached to the page (if we are not
> invalidating the whole page).
>
> Now, any call to set_page_dirty() (like msync_interval)
> could end up setting all the buffer heads attached to
> this page again dirty, including the ones those got
> cleaned up :(
>
> If ll_rw_block() runs now and sees the dirty bit it does
> submit_bh() on those buffer_heads and triggers the assert.
>
> Fix is to check if the buffer is mapped before setting its
> dirty bit in __set_page_dirty_buffers().
>
> Signed-off-by: Badari Pulavarty <[email protected]>
> ---
> fs/buffer.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.18-rc5/fs/buffer.c
> ===================================================================
> --- linux-2.6.18-rc5.orig/fs/buffer.c 2006-09-01 08:20:51.000000000 -0700
> +++ linux-2.6.18-rc5/fs/buffer.c 2006-09-01 08:41:01.000000000 -0700
> @@ -846,7 +846,13 @@ int __set_page_dirty_buffers(struct page
> struct buffer_head *bh = head;
>
> do {
> - set_buffer_dirty(bh);
> + /*
> + * Its possible that, not all buffers attached to
> + * this page are mapped (cleaned up by truncate).
> + * If so, skip them.
> + */
> + if (buffer_mapped(bh))
> + set_buffer_dirty(bh);
> bh = bh->b_this_page;
> } while (bh != head);
> }

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

2006-09-01 16:29:08

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

On Fri, 2006-09-01 at 17:12 +0100, Anton Altaparmakov wrote:
> Hi,
>
> On Fri, 1 Sep 2006, Badari Pulavarty wrote:
> >
> > I have been running into following bug while running fsx
> > tests on 1k (ext3) filesystem all the time.
> >
> > ----------- [cut here ] --------- [please bite here ] ---------
> > Kernel BUG at fs/buffer.c:2791
> > invalid opcode: 0000 [1] SMP
> >
> > Its complaining about BUG_ON(!buffer_mapped(bh)).
> >
> > It was hard to track it down, needed lots of debug - but here
> > is the problem & fix. Since the fix is in __set_page_buffer_dirty()
> > code - I am wondering how it would effect others :(
>
> This will breaks NTFS and probably a lot of other file systems I would
> think.

Well, it can happen only with fileystems with blocksize < pagesize.
So, that should limit the scope :)

>
> For example all buffer based file systems in their writepage
> implementations will create buffers if none are present and those will not
> be mapped. If for whatever reason writepage now breaks out before mapping
> the buffers (e.g. because the buffers do not need to be written or due to
> an error) you are left with a page containing unmapped buffers.
>
> Then a page dirty comes in caused by a mmapped write for example.
>
> __set_page_dirty_bufferes() runs by default and with your patch does not
> set the unmapped buffers dirty thus they never get written out and you
> have data corruption.

I was wondering, is it *okay* to have a dirty buffer but not mapped to
disk ? I guess, so ..

>
> It is the caller of submit_bh() that is doing the stupidity of submitting
> unmapped buffers for i/o or even the caller of the caller, etc...
> Somewhere up in that chain buffers should have been mapped before being
> submitted for i/o otherwise it is a BUG() (as correctly identified by
> submit_bh()).
>
> Perhaps the real fix is not to have ext3 use ll_rw_block() and instead
> make it use submit_bh() directly and only submit buffers inside the file
> size and/or make it map buffers before calling ll_rw_block() and if they
> are outside the file size just clean them without submitting them...

Yeah. I considered doing it in ll_rw_block() - but then I thought
fixing it in set_page_buffer_dirty() may be a valid fix :(

Let me try other options, then.

Thanks,
Badari

2006-09-01 17:18:01

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

On Fri, 01 Sep 2006 09:32:22 -0700
Badari Pulavarty <[email protected]> wrote:

> > > Kernel BUG at fs/buffer.c:2791
> > > invalid opcode: 0000 [1] SMP
> > >
> > > Its complaining about BUG_ON(!buffer_mapped(bh)).

I need to have a little think about this, remember what _should_ be
happening in this situation.

We (mainly I) used to do a huge amount of fsx-linux testing on 1k blocksize
filesystems. We've done something to make this start happening. Part of
resolving this bug will be working out what that was.

2006-09-01 17:43:45

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

On Fri, 2006-09-01 at 10:18 -0700, Andrew Morton wrote:
> On Fri, 01 Sep 2006 09:32:22 -0700
> Badari Pulavarty <[email protected]> wrote:
>
> > > > Kernel BUG at fs/buffer.c:2791
> > > > invalid opcode: 0000 [1] SMP
> > > >
> > > > Its complaining about BUG_ON(!buffer_mapped(bh)).
>
> I need to have a little think about this, remember what _should_ be
> happening in this situation.

Agreed. I used to run fsx on regular basis - never saw the problem
before. I will try to go back few kernel versions and verify.

>
> We (mainly I) used to do a huge amount of fsx-linux testing on 1k blocksize
> filesystems. We've done something to make this start happening. Part of
> resolving this bug will be working out what that was.
>

Here is the other fix, I did to avoid the problem (basically, have
jbd manage its own submit function and skip unmapped buffers) - not
elegant. Thats why I tried to address at the root cause..

Thanks,
Badari

Patch to fix: Kernel BUG at fs/buffer.c:2791
on 1k (2k) filesystems while running fsx.

journal_commit_transaction collects lots of dirty buffer from
and does a single ll_rw_block() to write them out. ll_rw_block()
locks the buffer and checks to see if they are dirty and submits
them for IO.

In the mean while, journal_unmap_buffers() as part of
truncate can unmap the buffer and throw it away. Since its
a 1k (2k) filesystem - each page (4k) will have more than
one buffer_head attached to the page and and we can't free
up buffer_heads attached to the page (if we are not
invalidating the whole page).

Now, any call to set_page_dirty() (like msync_interval)
could end up setting all the buffer heads attached to
this page again dirty, including the ones those got
cleaned up :(

So, -not-so-elegant fix would be to have make jbd skip all
the buffers that are not mapped.

Signed-off-by: Badari Pulavarty <[email protected]>
---
fs/jbd/commit.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)

Index: linux-2.6.18-rc5/fs/jbd/commit.c
===================================================================
--- linux-2.6.18-rc5.orig/fs/jbd/commit.c 2006-08-27 20:41:48.000000000 -0700
+++ linux-2.6.18-rc5/fs/jbd/commit.c 2006-09-01 10:36:16.000000000 -0700
@@ -160,6 +160,34 @@ static int journal_write_commit_record(j
return (ret == -EIO);
}

+static void jbd_write_buffers(int nr, struct buffer_head *bhs[])
+{
+ int i;
+
+ for (i = 0; i < nr; i++) {
+ struct buffer_head *bh = bhs[i];
+
+ lock_buffer(bh);
+
+ /*
+ * case 1: Buffer could have been flushed by now,
+ * if so nothing to do for us.
+ * case 2: Buffer could have been unmapped up by
+ * journal_unmap_buffer - but still attached to the
+ * page. Any calls to set_page_dirty() would dirty
+ * the buffer even though its not mapped. If so,
+ * we need to skip them.
+ */
+ if (buffer_mapped(bh) && test_clear_buffer_dirty(bh)) {
+ bh->b_end_io = end_buffer_write_sync;
+ get_bh(bh);
+ submit_bh(WRITE, bh);
+ continue;
+ }
+ unlock_buffer(bh);
+ }
+}
+
/*
* journal_commit_transaction
*
@@ -356,7 +384,7 @@ write_out_data:
jbd_debug(2, "submit %d writes\n",
bufs);
spin_unlock(&journal->j_list_lock);
- ll_rw_block(SWRITE, bufs, wbuf);
+ jbd_write_buffer(bufs, wbuf);
journal_brelse_array(wbuf, bufs);
bufs = 0;
goto write_out_data;
@@ -379,7 +407,7 @@ write_out_data:

if (bufs) {
spin_unlock(&journal->j_list_lock);
- ll_rw_block(SWRITE, bufs, wbuf);
+ jbd_write_buffers(bufs, wbuf);
journal_brelse_array(wbuf, bufs);
spin_lock(&journal->j_list_lock);
}

2006-09-01 21:01:35

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

Andrew Morton wrote:
> On Fri, 01 Sep 2006 09:32:22 -0700
> Badari Pulavarty <[email protected]> wrote:
>
>
>>>> Kernel BUG at fs/buffer.c:2791
>>>> invalid opcode: 0000 [1] SMP
>>>>
>>>> Its complaining about BUG_ON(!buffer_mapped(bh)).
>>>>
>
> I need to have a little think about this, remember what _should_ be
> happening in this situation.
>
> We (mainly I) used to do a huge amount of fsx-linux testing on 1k blocksize
> filesystems. We've done something to make this start happening. Part of
> resolving this bug will be working out what that was.
>

Here is the progress in tracking this down so far.

I am able to reproduce the problem on following kernel versions.

2.6.18-rc5
2.6.18-rc4
2.6.17.11
2.6.16.28
2.6.15.7
2.6.14.7

I am yet to find a latest kernel version - where this works :(
I am going to try older versions of the kernel.

Thanks,
Badari

--
VGER BF report: H 3.60822e-15

2006-09-05 16:08:24

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

On Fri, 2006-09-01 at 10:18 -0700, Andrew Morton wrote:
> On Fri, 01 Sep 2006 09:32:22 -0700
> Badari Pulavarty <[email protected]> wrote:
>
> > > > Kernel BUG at fs/buffer.c:2791
> > > > invalid opcode: 0000 [1] SMP
> > > >
> > > > Its complaining about BUG_ON(!buffer_mapped(bh)).
>
> I need to have a little think about this, remember what _should_ be
> happening in this situation.
>
> We (mainly I) used to do a huge amount of fsx-linux testing on 1k blocksize
> filesystems. We've done something to make this start happening. Part of
> resolving this bug will be working out what that was.
>

It took a while to track this down. 2.6.13 is the last kernel
which runs my fsx tests fine (72+ hours).

Here is the change that seems to cause the problem. Jana Kara
introduced a new mode "SWRITE" for ll_rw_block() - where it
waits for buffer to be unlocked (WRITE will skip locked
buffers) + jbd journal_commit_transaction() has been changed
to make use of SWRITE.

http://marc.theaimsgroup.com/?l=linux-fsdevel&m=112109788702895&w=2

Theoritically same race (between journal_commit_transaction() and
journal_unmap_buffer()+set_page_dirty()) could exist before his changes
- which should trigger bug in submit_bh(). But I can't seem to hit
it without his changes. My guess is ll_rw_block() is always skipping
those cleaned up buffers, before page gets dirtied again ..

Andrew, what should we do ? Do you suggest handling this in jbd
itself (like this patch) ?

Thanks,
Badari

Patch to fix: Kernel BUG at fs/buffer.c:2791
on 1k (2k) filesystems while running fsx.

journal_commit_transaction collects lots of dirty buffer from
and does a single ll_rw_block() to write them out. ll_rw_block()
locks the buffer and checks to see if they are dirty and submits
them for IO.

In the mean while, journal_unmap_buffers() as part of
truncate can unmap the buffer and throw it away. Since its
a 1k (2k) filesystem - each page (4k) will have more than
one buffer_head attached to the page and and we can't free
up buffer_heads attached to the page (if we are not
invalidating the whole page).

Now, any call to set_page_dirty() (like msync_interval)
could end up setting all the buffer heads attached to
this page again dirty, including the ones those got
cleaned up :(

So, -not-so-elegant fix would be to have make jbd skip all
the buffers that are not mapped.

Signed-off-by: Badari Pulavarty <[email protected]>
---
fs/jbd/commit.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)

Index: linux-2.6.18-rc5/fs/jbd/commit.c
===================================================================
--- linux-2.6.18-rc5.orig/fs/jbd/commit.c 2006-08-27 20:41:48.000000000 -0700
+++ linux-2.6.18-rc5/fs/jbd/commit.c 2006-09-01 10:43:54.000000000 -0700
@@ -160,6 +160,34 @@ static int journal_write_commit_record(j
return (ret == -EIO);
}

+static void jbd_write_buffers(int nr, struct buffer_head *bhs[])
+{
+ int i;
+
+ for (i = 0; i < nr; i++) {
+ struct buffer_head *bh = bhs[i];
+
+ lock_buffer(bh);
+
+ /*
+ * case 1: Buffer could have been flushed by now,
+ * if so nothing to do for us.
+ * case 2: Buffer could have been unmapped up by
+ * journal_unmap_buffer - but still attached to the
+ * page. Any calls to set_page_dirty() would dirty
+ * the buffer even though its not mapped. If so,
+ * we need to skip them.
+ */
+ if (buffer_mapped(bh) && test_clear_buffer_dirty(bh)) {
+ bh->b_end_io = end_buffer_write_sync;
+ get_bh(bh);
+ submit_bh(WRITE, bh);
+ continue;
+ }
+ unlock_buffer(bh);
+ }
+}
+
/*
* journal_commit_transaction
*
@@ -356,7 +384,7 @@ write_out_data:
jbd_debug(2, "submit %d writes\n",
bufs);
spin_unlock(&journal->j_list_lock);
- ll_rw_block(SWRITE, bufs, wbuf);
+ jbd_write_buffers(bufs, wbuf);
journal_brelse_array(wbuf, bufs);
bufs = 0;
goto write_out_data;
@@ -379,7 +407,7 @@ write_out_data:

if (bufs) {
spin_unlock(&journal->j_list_lock);
- ll_rw_block(SWRITE, bufs, wbuf);
+ jbd_write_buffers(bufs, wbuf);
journal_brelse_array(wbuf, bufs);
spin_lock(&journal->j_list_lock);
}

2006-09-06 12:47:19

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

Hi,

> On Fri, 2006-09-01 at 10:18 -0700, Andrew Morton wrote:
> > > > > Kernel BUG at fs/buffer.c:2791
> > > > > invalid opcode: 0000 [1] SMP
> > > > >
> > > > > Its complaining about BUG_ON(!buffer_mapped(bh)).
>
> Here is the change that seems to cause the problem. Jana Kara
> introduced a new mode "SWRITE" for ll_rw_block() - where it
> waits for buffer to be unlocked (WRITE will skip locked
> buffers) + jbd journal_commit_transaction() has been changed
> to make use of SWRITE.
>
> http://marc.theaimsgroup.com/?l=linux-fsdevel&m=112109788702895&w=2
>
> Theoritically same race (between journal_commit_transaction() and
> journal_unmap_buffer()+set_page_dirty()) could exist before his changes
> - which should trigger bug in submit_bh(). But I can't seem to hit
> it without his changes. My guess is ll_rw_block() is always skipping
> those cleaned up buffers, before page gets dirtied again ..
I think that the change to ll_rw_block() just widens the window much
more...

> Andrew, what should we do ? Do you suggest handling this in jbd
> itself (like this patch) ?
Actually that part of commit code needs rewrite anyway (and after that
rewrite you get rid of ll_rw_block()) because of other problems - the
code assumes that whenever buffer is locked, it is being written to disk
which is not true... I have some preliminary patches for that but they
are not very nice and so far I didn't have enough time to find a nice
solution.

Honza

--
Jan Kara <[email protected]>
SuSE CR Labs

2006-09-06 15:09:19

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote:
> Hi,
>
> > On Fri, 2006-09-01 at 10:18 -0700, Andrew Morton wrote:
> > > > > > Kernel BUG at fs/buffer.c:2791
> > > > > > invalid opcode: 0000 [1] SMP
> > > > > >
> > > > > > Its complaining about BUG_ON(!buffer_mapped(bh)).
> >
> > Here is the change that seems to cause the problem. Jana Kara
> > introduced a new mode "SWRITE" for ll_rw_block() - where it
> > waits for buffer to be unlocked (WRITE will skip locked
> > buffers) + jbd journal_commit_transaction() has been changed
> > to make use of SWRITE.
> >
> > http://marc.theaimsgroup.com/?l=linux-fsdevel&m=112109788702895&w=2
> >
> > Theoritically same race (between journal_commit_transaction() and
> > journal_unmap_buffer()+set_page_dirty()) could exist before his changes
> > - which should trigger bug in submit_bh(). But I can't seem to hit
> > it without his changes. My guess is ll_rw_block() is always skipping
> > those cleaned up buffers, before page gets dirtied again ..
> I think that the change to ll_rw_block() just widens the window much
> more...
>

Yes.

> > Andrew, what should we do ? Do you suggest handling this in jbd
> > itself (like this patch) ?
> Actually that part of commit code needs rewrite anyway (and after that
> rewrite you get rid of ll_rw_block()) because of other problems - the
> code assumes that whenever buffer is locked, it is being written to disk
> which is not true... I have some preliminary patches for that but they
> are not very nice and so far I didn't have enough time to find a nice
> solution.

Are you okay with current not-so-elegant fix ?

Thanks,
Badari

Patch to fix: Kernel BUG at fs/buffer.c:2791
on 1k (2k) filesystems while running fsx.

journal_commit_transaction collects lots of dirty buffer from
and does a single ll_rw_block() to write them out. ll_rw_block()
locks the buffer and checks to see if they are dirty and submits
them for IO.

In the mean while, journal_unmap_buffers() as part of
truncate can unmap the buffer and throw it away. Since its
a 1k (2k) filesystem - each page (4k) will have more than
one buffer_head attached to the page and and we can't free
up buffer_heads attached to the page (if we are not
invalidating the whole page).

Now, any call to set_page_dirty() (like msync_interval)
could end up setting all the buffer heads attached to
this page again dirty, including the ones those got
cleaned up :(

So, -not-so-elegant fix would be to have make jbd skip all
the buffers that are not mapped.

Signed-off-by: Badari Pulavarty <[email protected]>
---
fs/jbd/commit.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)

Index: linux-2.6.18-rc5/fs/jbd/commit.c
===================================================================
--- linux-2.6.18-rc5.orig/fs/jbd/commit.c 2006-08-27 20:41:48.000000000 -0700
+++ linux-2.6.18-rc5/fs/jbd/commit.c 2006-09-01 10:43:54.000000000 -0700
@@ -160,6 +160,34 @@ static int journal_write_commit_record(j
return (ret == -EIO);
}

+static void jbd_write_buffers(int nr, struct buffer_head *bhs[])
+{
+ int i;
+
+ for (i = 0; i < nr; i++) {
+ struct buffer_head *bh = bhs[i];
+
+ lock_buffer(bh);
+
+ /*
+ * case 1: Buffer could have been flushed by now,
+ * if so nothing to do for us.
+ * case 2: Buffer could have been unmapped up by
+ * journal_unmap_buffer - but still attached to the
+ * page. Any calls to set_page_dirty() would dirty
+ * the buffer even though its not mapped. If so,
+ * we need to skip them.
+ */
+ if (buffer_mapped(bh) && test_clear_buffer_dirty(bh)) {
+ bh->b_end_io = end_buffer_write_sync;
+ get_bh(bh);
+ submit_bh(WRITE, bh);
+ continue;
+ }
+ unlock_buffer(bh);
+ }
+}
+
/*
* journal_commit_transaction
*
@@ -356,7 +384,7 @@ write_out_data:
jbd_debug(2, "submit %d writes\n",
bufs);
spin_unlock(&journal->j_list_lock);
- ll_rw_block(SWRITE, bufs, wbuf);
+ jbd_write_buffers(bufs, wbuf);
journal_brelse_array(wbuf, bufs);
bufs = 0;
goto write_out_data;
@@ -379,7 +407,7 @@ write_out_data:

if (bufs) {
spin_unlock(&journal->j_list_lock);
- ll_rw_block(SWRITE, bufs, wbuf);
+ jbd_write_buffers(bufs, wbuf);
journal_brelse_array(wbuf, bufs);
spin_lock(&journal->j_list_lock);
}

2006-09-06 15:34:49

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

> On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote:
>
> > > Andrew, what should we do ? Do you suggest handling this in jbd
> > > itself (like this patch) ?
> > Actually that part of commit code needs rewrite anyway (and after that
> > rewrite you get rid of ll_rw_block()) because of other problems - the
> > code assumes that whenever buffer is locked, it is being written to disk
> > which is not true... I have some preliminary patches for that but they
> > are not very nice and so far I didn't have enough time to find a nice
> > solution.
>
> Are you okay with current not-so-elegant fix ?
Actually I don't quite understand how it can happen what you describe
(so probably I missed something). How it can happen that some buffers
are unmapped while we are committing them? journal_unmap_buffers()
checks whether we are not committing truncated buffers and if so, it
does not do anything to such buffers...
Bye
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-09-06 16:19:05

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

On Wed, 2006-09-06 at 17:34 +0200, Jan Kara wrote:
> > On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote:
> >
> > > > Andrew, what should we do ? Do you suggest handling this in jbd
> > > > itself (like this patch) ?
> > > Actually that part of commit code needs rewrite anyway (and after that
> > > rewrite you get rid of ll_rw_block()) because of other problems - the
> > > code assumes that whenever buffer is locked, it is being written to disk
> > > which is not true... I have some preliminary patches for that but they
> > > are not very nice and so far I didn't have enough time to find a nice
> > > solution.
> >
> > Are you okay with current not-so-elegant fix ?
> Actually I don't quite understand how it can happen what you describe
> (so probably I missed something). How it can happen that some buffers
> are unmapped while we are committing them? journal_unmap_buffers()
> checks whether we are not committing truncated buffers and if so, it
> does not do anything to such buffers...
> Bye
> Honza

Yep. I spent lot of time trying to understand - why they are not
getting skipped :(

But my debug clearly shows that we are clearing the buffer, while
we haven't actually submitted to ll_rw_block() code. (I added "track"
flag to bh and set it in journal_commit_transaction() when we add
them to wbuf[] and clear it in ll_rw_block() after submit. I checked
for this flag in journal_unmap_buffer() while clearing the buffer).
Here is what my debug shows:

buffer is tracked bh ffff8101686ea850 size 1024

Call Trace:
[<ffffffff8020b395>] show_trace+0xb5/0x370
[<ffffffff8020b665>] dump_stack+0x15/0x20
[<ffffffff8030d474>] journal_invalidatepage+0x314/0x3b0
[<ffffffff802fe948>] ext3_invalidatepage+0x38/0x40
[<ffffffff80282750>] do_invalidatepage+0x20/0x30
[<ffffffff80260613>] truncate_complete_page+0x23/0x50
[<ffffffff8026070d>] truncate_inode_pages_range+0xcd/0x300
[<ffffffff80260950>] truncate_inode_pages+0x10/0x20
[<ffffffff802686ff>] vmtruncate+0x5f/0x100
[<ffffffff8029d880>] inode_setattr+0x30/0x140
[<ffffffff802ff8cb>] ext3_setattr+0x1bb/0x230
[<ffffffff8029daee>] notify_change+0x15e/0x320
[<ffffffff8027f973>] do_truncate+0x53/0x80
[<ffffffff802800f8>] sys_ftruncate+0xf8/0x130
[<ffffffff80209d5a>] system_call+0x7e/0x83

Thanks,
Badari

2006-09-06 16:27:23

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

> On Wed, 2006-09-06 at 17:34 +0200, Jan Kara wrote:
> > > On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote:
> > >
> > > > > Andrew, what should we do ? Do you suggest handling this in jbd
> > > > > itself (like this patch) ?
> > > > Actually that part of commit code needs rewrite anyway (and after that
> > > > rewrite you get rid of ll_rw_block()) because of other problems - the
> > > > code assumes that whenever buffer is locked, it is being written to disk
> > > > which is not true... I have some preliminary patches for that but they
> > > > are not very nice and so far I didn't have enough time to find a nice
> > > > solution.
> > >
> > > Are you okay with current not-so-elegant fix ?
> > Actually I don't quite understand how it can happen what you describe
> > (so probably I missed something). How it can happen that some buffers
> > are unmapped while we are committing them? journal_unmap_buffers()
> > checks whether we are not committing truncated buffers and if so, it
> > does not do anything to such buffers...
> > Bye
> > Honza
>
> Yep. I spent lot of time trying to understand - why they are not
> getting skipped :(
>
> But my debug clearly shows that we are clearing the buffer, while
> we haven't actually submitted to ll_rw_block() code. (I added "track"
> flag to bh and set it in journal_commit_transaction() when we add
> them to wbuf[] and clear it in ll_rw_block() after submit. I checked
> for this flag in journal_unmap_buffer() while clearing the buffer).
> Here is what my debug shows:
>
> buffer is tracked bh ffff8101686ea850 size 1024
>
> Call Trace:
> [<ffffffff8020b395>] show_trace+0xb5/0x370
> [<ffffffff8020b665>] dump_stack+0x15/0x20
> [<ffffffff8030d474>] journal_invalidatepage+0x314/0x3b0
I see just journal_invalidatepage() here. That is fine. It calls
journal_unmap_buffer() which should do nothing return 0. If it does
not it would be IMO bug.. If the buffer is really unmapped here, in what
state it is (i.e. which list is it on?).

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-09-06 16:43:51

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

On Wed, 2006-09-06 at 18:27 +0200, Jan Kara wrote:
> > On Wed, 2006-09-06 at 17:34 +0200, Jan Kara wrote:
> > > > On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote:
> > > >
> > > > > > Andrew, what should we do ? Do you suggest handling this in jbd
> > > > > > itself (like this patch) ?
> > > > > Actually that part of commit code needs rewrite anyway (and after that
> > > > > rewrite you get rid of ll_rw_block()) because of other problems - the
> > > > > code assumes that whenever buffer is locked, it is being written to disk
> > > > > which is not true... I have some preliminary patches for that but they
> > > > > are not very nice and so far I didn't have enough time to find a nice
> > > > > solution.
> > > >
> > > > Are you okay with current not-so-elegant fix ?
> > > Actually I don't quite understand how it can happen what you describe
> > > (so probably I missed something). How it can happen that some buffers
> > > are unmapped while we are committing them? journal_unmap_buffers()
> > > checks whether we are not committing truncated buffers and if so, it
> > > does not do anything to such buffers...
> > > Bye
> > > Honza
> >
> > Yep. I spent lot of time trying to understand - why they are not
> > getting skipped :(
> >
> > But my debug clearly shows that we are clearing the buffer, while
> > we haven't actually submitted to ll_rw_block() code. (I added "track"
> > flag to bh and set it in journal_commit_transaction() when we add
> > them to wbuf[] and clear it in ll_rw_block() after submit. I checked
> > for this flag in journal_unmap_buffer() while clearing the buffer).
> > Here is what my debug shows:
> >
> > buffer is tracked bh ffff8101686ea850 size 1024
> >
> > Call Trace:
> > [<ffffffff8020b395>] show_trace+0xb5/0x370
> > [<ffffffff8020b665>] dump_stack+0x15/0x20
> > [<ffffffff8030d474>] journal_invalidatepage+0x314/0x3b0
> I see just journal_invalidatepage() here. That is fine. It calls
> journal_unmap_buffer() which should do nothing return 0. If it does
> not it would be IMO bug.. If the buffer is really unmapped here, in what
> state it is (i.e. which list is it on?).
>
Acutally, I added dump_stack() in journal_unmap_buffer() when it
does clear_buffer_mapped(). gcc must of pulled in the function ..
I will add more debug to track the list bh came from.

- Badari

2006-09-06 17:03:30

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

> On Wed, 2006-09-06 at 18:27 +0200, Jan Kara wrote:
> > >
> > > But my debug clearly shows that we are clearing the buffer, while
> > > we haven't actually submitted to ll_rw_block() code. (I added "track"
> > > flag to bh and set it in journal_commit_transaction() when we add
> > > them to wbuf[] and clear it in ll_rw_block() after submit. I checked
> > > for this flag in journal_unmap_buffer() while clearing the buffer).
> > > Here is what my debug shows:
> > >
> > > buffer is tracked bh ffff8101686ea850 size 1024
> > >
> > > Call Trace:
> > > [<ffffffff8020b395>] show_trace+0xb5/0x370
> > > [<ffffffff8020b665>] dump_stack+0x15/0x20
> > > [<ffffffff8030d474>] journal_invalidatepage+0x314/0x3b0
> > I see just journal_invalidatepage() here. That is fine. It calls
> > journal_unmap_buffer() which should do nothing return 0. If it does
> > not it would be IMO bug.. If the buffer is really unmapped here, in what
> > state it is (i.e. which list is it on?).
> >
> Acutally, I added dump_stack() in journal_unmap_buffer() when it
> does clear_buffer_mapped(). gcc must of pulled in the function ..
> I will add more debug to track the list bh came from.
Ah, ok. My guess is that the buffer is in BJ_Locked list. But we don't
write buffers from BJ_Locked list (as they are supposedly already
written). What probably happens is:
Process 1 Process 2
We start scanning t_sync_datalist.
Add buffer to wbuf[].
Locks the buffer (it can be some other
writer).
Restarts scanning the list
Finds buffer is locked -> moves to
BJ_Locked list (actually this is BUG
as we have not written the buffer yet)
journal_unmap_buffer() happens

Maybe what you could verify is, that we are moving the buffer added to
wbuf[] to BJ_Locked list in the write_out_data: loop.
If it really happens, what I've described above, the fix should be
different. We shouldn't have moved the buffer to BJ_Locked list...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-09-06 17:13:37

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

On Wed, 2006-09-06 at 18:27 +0200, Jan Kara wrote:
> > On Wed, 2006-09-06 at 17:34 +0200, Jan Kara wrote:
> > > > On Wed, 2006-09-06 at 14:47 +0200, Jan Kara wrote:
> > > >
> > > > > > Andrew, what should we do ? Do you suggest handling this in jbd
> > > > > > itself (like this patch) ?
> > > > > Actually that part of commit code needs rewrite anyway (and after that
> > > > > rewrite you get rid of ll_rw_block()) because of other problems - the
> > > > > code assumes that whenever buffer is locked, it is being written to disk
> > > > > which is not true... I have some preliminary patches for that but they
> > > > > are not very nice and so far I didn't have enough time to find a nice
> > > > > solution.
> > > >
> > > > Are you okay with current not-so-elegant fix ?
> > > Actually I don't quite understand how it can happen what you describe
> > > (so probably I missed something). How it can happen that some buffers
> > > are unmapped while we are committing them? journal_unmap_buffers()
> > > checks whether we are not committing truncated buffers and if so, it
> > > does not do anything to such buffers...
> > > Bye
> > > Honza
> >
> > Yep. I spent lot of time trying to understand - why they are not
> > getting skipped :(
> >
> > But my debug clearly shows that we are clearing the buffer, while
> > we haven't actually submitted to ll_rw_block() code. (I added "track"
> > flag to bh and set it in journal_commit_transaction() when we add
> > them to wbuf[] and clear it in ll_rw_block() after submit. I checked
> > for this flag in journal_unmap_buffer() while clearing the buffer).
> > Here is what my debug shows:
> >
> > buffer is tracked bh ffff8101686ea850 size 1024
> >
> > Call Trace:
> > [<ffffffff8020b395>] show_trace+0xb5/0x370
> > [<ffffffff8020b665>] dump_stack+0x15/0x20
> > [<ffffffff8030d474>] journal_invalidatepage+0x314/0x3b0
> I see just journal_invalidatepage() here. That is fine. It calls
> journal_unmap_buffer() which should do nothing return 0. If it does
> not it would be IMO bug.. If the buffer is really unmapped here, in what
> state it is (i.e. which list is it on?).

Okay.. here is the path its taking according to my debug ..
Remember, the issue is: after the buffer is cleaned - they are
still (left) attached to the page (since a page can have 4
buffer heads and we partially truncated the page). After
we clean up the buffers any subsequent call to set_page_dirty()
would end up marking *all* the buffers dirty. If ll_rw_block()
happens after this, we will run into the assert. If no
set_page_dirty() happens before ll_rw_block() happens, things
would be fine - as the buffer won't be dirty and be skipped.

journal_unmap_buffer()
{
.....
} else {
/* Good, the buffer belongs to the running transaction.
* We are writing our own transaction's data, not any
* previous one's, so it is safe to throw it away
* (remember that we expect the filesystem to have set
* i_size already for this truncate so recovery will not
* expose the disk blocks we are discarding here.) */
J_ASSERT_JH(jh, transaction == journal-
>j_running_transaction);
may_free = __dispose_buffer(jh, transaction);
}
zap_buffer:
journal_put_journal_head(jh);
zap_buffer_no_jh:
spin_unlock(&journal->j_list_lock);
jbd_unlock_bh_state(bh);
spin_unlock(&journal->j_state_lock);
zap_buffer_unlocked:
clear_buffer_dirty(bh);
J_ASSERT_BH(bh, !buffer_jbddirty(bh));
clear_buffer_mapped(bh);
clear_buffer_req(bh);
clear_buffer_new(bh);
bh->b_bdev = NULL;
return may_free;
}

2006-09-06 17:27:27

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

> On Wed, 2006-09-06 at 18:27 +0200, Jan Kara wrote:
> > > But my debug clearly shows that we are clearing the buffer, while
> > > we haven't actually submitted to ll_rw_block() code. (I added "track"
> > > flag to bh and set it in journal_commit_transaction() when we add
> > > them to wbuf[] and clear it in ll_rw_block() after submit. I checked
> > > for this flag in journal_unmap_buffer() while clearing the buffer).
> > > Here is what my debug shows:
> > >
> > > buffer is tracked bh ffff8101686ea850 size 1024
> > >
> > > Call Trace:
> > > [<ffffffff8020b395>] show_trace+0xb5/0x370
> > > [<ffffffff8020b665>] dump_stack+0x15/0x20
> > > [<ffffffff8030d474>] journal_invalidatepage+0x314/0x3b0
> > I see just journal_invalidatepage() here. That is fine. It calls
> > journal_unmap_buffer() which should do nothing return 0. If it does
> > not it would be IMO bug.. If the buffer is really unmapped here, in what
> > state it is (i.e. which list is it on?).
>
> Okay.. here is the path its taking according to my debug ..
> Remember, the issue is: after the buffer is cleaned - they are
> still (left) attached to the page (since a page can have 4
> buffer heads and we partially truncated the page). After
> we clean up the buffers any subsequent call to set_page_dirty()
> would end up marking *all* the buffers dirty. If ll_rw_block()
> happens after this, we will run into the assert. If no
> set_page_dirty() happens before ll_rw_block() happens, things
> would be fine - as the buffer won't be dirty and be skipped.
Yes, OK.

> journal_unmap_buffer()
> {
> .....
> } else {
> /* Good, the buffer belongs to the running transaction.
> * We are writing our own transaction's data, not any
> * previous one's, so it is safe to throw it away
> * (remember that we expect the filesystem to have set
> * i_size already for this truncate so recovery will not
> * expose the disk blocks we are discarding here.) */
> J_ASSERT_JH(jh, transaction == journal-
> >j_running_transaction);
> may_free = __dispose_buffer(jh, transaction);
> }
> zap_buffer:
> journal_put_journal_head(jh);
> zap_buffer_no_jh:
> spin_unlock(&journal->j_list_lock);
> jbd_unlock_bh_state(bh);
> spin_unlock(&journal->j_state_lock);
> zap_buffer_unlocked:
> clear_buffer_dirty(bh);
> J_ASSERT_BH(bh, !buffer_jbddirty(bh));
> clear_buffer_mapped(bh);
> clear_buffer_req(bh);
> clear_buffer_new(bh);
> bh->b_bdev = NULL;
> return may_free;
> }
Ugh! Are you sure? For this path the buffer must be attached (only) to
the running transaction. But then how the commit code comes to it?
Somebody would have to even manage to refile the buffer from the
committing transaction to the running one while the buffer is in wbuf[].
Could you check whether someone does __journal_refile_buffer() on your
marked buffers, please? Or whether we move buffer to BJ_Locked list in
the write_out_data: loop? Thanks.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-09-07 02:15:14

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

On Wed, 6 Sep 2006 19:27:33 +0200
Jan Kara <[email protected]> wrote:

> Ugh! Are you sure? For this path the buffer must be attached (only) to
> the running transaction. But then how the commit code comes to it?
> Somebody would have to even manage to refile the buffer from the
> committing transaction to the running one while the buffer is in wbuf[].
> Could you check whether someone does __journal_refile_buffer() on your
> marked buffers, please? Or whether we move buffer to BJ_Locked list in
> the write_out_data: loop? Thanks.

The ext3-debug patch will help here. It records, within the bh, the inputs
from the last 32 BUFFER_TRACE()s which were run against this bh. If a
J_ASSERT fails then you get a nice trace of the last 32 "things" which
happened to this bh, including the bh's state at that transition. It
basically tells you everything you need to know to find the bug.

It's worth spending the time to become familiar with it - I used it a lot in
early ext3 development.

I've been unable to reproduce this crash, btw. Is there some magic
incantation apat from running `fsx-linux'?

drivers/ide/ide-io.c | 22 ++
fs/Kconfig | 4
fs/Makefile | 2
fs/buffer.c | 100 ++++++++++++
fs/direct-io.c | 1
fs/ext3/inode.c | 7
fs/ext3/namei.c | 40 ++++-
fs/ext3/super.c | 81 ++++++++++
fs/ext3/xattr.c | 4
fs/jbd-kernel.c | 256 +++++++++++++++++++++++++++++++++
fs/jbd/commit.c | 19 +-
fs/jbd/transaction.c | 46 +++--
include/linux/buffer-trace.h | 85 ++++++++++
include/linux/buffer_head.h | 4
include/linux/jbd.h | 26 +--
15 files changed, 652 insertions(+), 45 deletions(-)

diff -puN drivers/ide/ide-io.c~ext3-debug drivers/ide/ide-io.c
--- a/drivers/ide/ide-io.c~ext3-debug
+++ a/drivers/ide/ide-io.c
@@ -980,6 +980,28 @@ static ide_startstop_t start_request (id
ide_startstop_t startstop;
sector_t block;

+#ifdef CONFIG_JBD_DEBUG
+ /*
+ * Silently stop writing to this disk to simulate a crash.
+ */
+ extern struct block_device *journal_no_write[2];
+ int i;
+
+ if (!(rq->flags & REQ_RW))
+ goto its_a_read;
+
+ for (i = 0; i < 2; i++) {
+ struct block_device *bdev = journal_no_write[i];
+ if (bdev && bdev_get_queue(bdev) == rq->q) {
+ printk("%s: write suppressed\n", __FUNCTION__);
+ ide_end_request(drive, 1, rq->nr_sectors);
+ return ide_stopped;
+ }
+ }
+its_a_read:
+ ;
+#endif
+
BUG_ON(!(rq->flags & REQ_STARTED));

#ifdef DEBUG
diff -puN fs/buffer.c~ext3-debug fs/buffer.c
--- a/fs/buffer.c~ext3-debug
+++ a/fs/buffer.c
@@ -35,7 +35,9 @@
#include <linux/hash.h>
#include <linux/suspend.h>
#include <linux/buffer_head.h>
+#include <linux/buffer-trace.h>
#include <linux/bio.h>
+#include <linux/jbd.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/bitops.h>
@@ -52,6 +54,7 @@ init_buffer(struct buffer_head *bh, bh_e
{
bh->b_end_io = handler;
bh->b_private = private;
+ buffer_trace_init(&bh->b_history);
}

static int sync_buffer(void *word)
@@ -60,6 +63,7 @@ static int sync_buffer(void *word)
struct buffer_head *bh
= container_of(word, struct buffer_head, b_state);

+ BUFFER_TRACE(bh, "enter");
smp_mb();
bd = bh->b_bdev;
if (bd)
@@ -104,6 +108,7 @@ static void buffer_io_error(struct buffe
{
char b[BDEVNAME_SIZE];

+ BUFFER_TRACE(bh, "I/O error");
printk(KERN_ERR "Buffer I/O error on device %s, logical block %Lu\n",
bdevname(bh->b_bdev, b),
(unsigned long long)bh->b_blocknr);
@@ -115,10 +120,12 @@ static void buffer_io_error(struct buffe
*/
void end_buffer_read_sync(struct buffer_head *bh, int uptodate)
{
+ BUFFER_TRACE(bh, "enter");
if (uptodate) {
set_buffer_uptodate(bh);
} else {
/* This happens, due to failed READA attempts. */
+ BUFFER_TRACE(bh, "clear uptodate");
clear_buffer_uptodate(bh);
}
unlock_buffer(bh);
@@ -130,8 +137,10 @@ void end_buffer_write_sync(struct buffer
char b[BDEVNAME_SIZE];

if (uptodate) {
+ BUFFER_TRACE(bh, "uptodate");
set_buffer_uptodate(bh);
} else {
+ BUFFER_TRACE(bh, "Not uptodate");
if (!buffer_eopnotsupp(bh) && printk_ratelimit()) {
buffer_io_error(bh);
printk(KERN_WARNING "lost page write due to "
@@ -139,6 +148,7 @@ void end_buffer_write_sync(struct buffer
bdevname(bh->b_bdev, b));
}
set_buffer_write_io_error(bh);
+ BUFFER_TRACE(bh, "clear uptodate");
clear_buffer_uptodate(bh);
}
unlock_buffer(bh);
@@ -514,12 +524,15 @@ static void end_buffer_async_read(struct
struct page *page;
int page_uptodate = 1;

+ BUFFER_TRACE(bh, "enter");
+
BUG_ON(!buffer_async_read(bh));

page = bh->b_page;
if (uptodate) {
set_buffer_uptodate(bh);
} else {
+ BUFFER_TRACE(bh, "clear uptodate");
clear_buffer_uptodate(bh);
if (printk_ratelimit())
buffer_io_error(bh);
@@ -576,6 +589,8 @@ static void end_buffer_async_write(struc
struct buffer_head *tmp;
struct page *page;

+ BUFFER_TRACE(bh, "enter");
+
BUG_ON(!buffer_async_write(bh));

page = bh->b_page;
@@ -589,6 +604,7 @@ static void end_buffer_async_write(struc
bdevname(bh->b_bdev, b));
}
set_bit(AS_EIO, &page->mapping->flags);
+ BUFFER_TRACE(bh, "clear uptodate");
clear_buffer_uptodate(bh);
SetPageError(page);
}
@@ -796,6 +812,7 @@ void mark_buffer_dirty_inode(struct buff
struct address_space *mapping = inode->i_mapping;
struct address_space *buffer_mapping = bh->b_page->mapping;

+ BUFFER_TRACE(bh, "mark dirty");
mark_buffer_dirty(bh);
if (!mapping->assoc_mapping) {
mapping->assoc_mapping = buffer_mapping;
@@ -846,7 +863,25 @@ int __set_page_dirty_buffers(struct page
struct buffer_head *bh = head;

do {
- set_buffer_dirty(bh);
+ if (buffer_uptodate(bh)) {
+ BUFFER_TRACE(bh, "set dirty");
+ /* The following test can only succeed
+ * if jbd is built in, not if it's a
+ * module... */
+#ifdef CONFIG_JBD
+ if (buffer_jbd(bh)) {
+ struct journal_head *jh;
+
+ jh = journal_add_journal_head(bh);
+ WARN_ON(jh->b_jlist == BJ_Metadata);
+ journal_put_journal_head(jh);
+ }
+#endif
+ set_buffer_dirty(bh);
+ } else {
+ printk("%s: !uptodate buffer\n", __FUNCTION__);
+ buffer_assertion_failure(bh);
+ }
bh = bh->b_this_page;
} while (bh != head);
}
@@ -1037,6 +1072,7 @@ no_grow:
do {
bh = head;
head = head->b_this_page;
+ bh->b_this_page = NULL;
free_buffer_head(bh);
} while (head);
}
@@ -1251,6 +1287,7 @@ __getblk_slow(struct block_device *bdev,
*/
void fastcall mark_buffer_dirty(struct buffer_head *bh)
{
+ BUFFER_TRACE(bh, "entry");
if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))
__set_page_dirty_nobuffers(bh->b_page);
}
@@ -1264,12 +1301,14 @@ void fastcall mark_buffer_dirty(struct b
*/
void __brelse(struct buffer_head * buf)
{
+ BUFFER_TRACE(buf, "entry");
if (atomic_read(&buf->b_count)) {
put_bh(buf);
return;
}
printk(KERN_ERR "VFS: brelse: Trying to free free buffer\n");
WARN_ON(1);
+ buffer_assertion_failure(buf);
}

/*
@@ -1278,6 +1317,7 @@ void __brelse(struct buffer_head * buf)
*/
void __bforget(struct buffer_head *bh)
{
+ BUFFER_TRACE(bh, "enter");
clear_buffer_dirty(bh);
if (!list_empty(&bh->b_assoc_buffers)) {
struct address_space *buffer_mapping = bh->b_page->mapping;
@@ -1540,6 +1580,7 @@ EXPORT_SYMBOL(set_bh_page);
*/
static void discard_buffer(struct buffer_head * bh)
{
+ BUFFER_TRACE(bh, "enter");
lock_buffer(bh);
clear_buffer_dirty(bh);
bh->b_bdev = NULL;
@@ -1697,6 +1738,7 @@ void unmap_underlying_metadata(struct bl

old_bh = __find_get_block_slow(bdev, block);
if (old_bh) {
+ BUFFER_TRACE(old_bh, "alias: unmap it");
clear_buffer_dirty(old_bh);
wait_on_buffer(old_bh);
clear_buffer_req(old_bh);
@@ -1777,6 +1819,7 @@ static int __block_write_full_page(struc
/*
* The buffer was zeroed by block_write_full_page()
*/
+ BUFFER_TRACE(bh, "outside last_block");
clear_buffer_dirty(bh);
set_buffer_uptodate(bh);
} else if (!buffer_mapped(bh) && buffer_dirty(bh)) {
@@ -1785,6 +1828,8 @@ static int __block_write_full_page(struc
if (err)
goto recover;
if (buffer_new(bh)) {
+ BUFFER_TRACE(bh, "new: call "
+ "unmap_underlying_metadata");
/* blockdev mappings never come here */
clear_buffer_new(bh);
unmap_underlying_metadata(bh->b_bdev,
@@ -1812,6 +1857,11 @@ static int __block_write_full_page(struc
continue;
}
if (test_clear_buffer_dirty(bh)) {
+ if (!buffer_uptodate(bh)) {
+ printk("%s: dirty non-uptodate buffer\n",
+ __FUNCTION__);
+ buffer_assertion_failure(bh);
+ }
mark_buffer_async_write(bh);
} else {
unlock_buffer(bh);
@@ -1873,6 +1923,7 @@ recover:
/* Recovery: lock and submit the mapped buffers */
do {
if (buffer_mapped(bh) && buffer_dirty(bh)) {
+ BUFFER_TRACE(bh, "lock it");
lock_buffer(bh);
mark_buffer_async_write(bh);
} else {
@@ -1939,9 +1990,12 @@ static int __block_prepare_write(struct
if (err)
break;
if (buffer_new(bh)) {
+ BUFFER_TRACE(bh, "new: call "
+ "unmap_underlying_metadata");
unmap_underlying_metadata(bh->b_bdev,
bh->b_blocknr);
if (PageUptodate(page)) {
+ BUFFER_TRACE(bh, "setting uptodate");
set_buffer_uptodate(bh);
continue;
}
@@ -1962,12 +2016,14 @@ static int __block_prepare_write(struct
}
}
if (PageUptodate(page)) {
+ BUFFER_TRACE(bh, "setting uptodate");
if (!buffer_uptodate(bh))
set_buffer_uptodate(bh);
continue;
}
if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
(block_start < from || block_end > to)) {
+ BUFFER_TRACE(bh, "reading");
ll_rw_block(READ, 1, &bh);
*wait_bh++=bh;
}
@@ -2010,6 +2066,7 @@ static int __block_prepare_write(struct
memset(kaddr+block_start, 0, bh->b_size);
kunmap_atomic(kaddr, KM_USER0);
set_buffer_uptodate(bh);
+ BUFFER_TRACE(bh, "mark dirty");
mark_buffer_dirty(bh);
}
next_bh:
@@ -2038,6 +2095,7 @@ static int __block_commit_write(struct i
partial = 1;
} else {
set_buffer_uptodate(bh);
+ BUFFER_TRACE(bh, "mark dirty");
mark_buffer_dirty(bh);
}
}
@@ -2364,6 +2422,7 @@ static void end_buffer_read_nobh(struct
set_buffer_uptodate(bh);
} else {
/* This happens, due to failed READA attempts. */
+ BUFFER_TRACE(bh, "clear uptodate");
clear_buffer_uptodate(bh);
}
unlock_buffer(bh);
@@ -2456,6 +2515,7 @@ int nobh_prepare_write(struct page *page
bh->b_data = (char *)(long)block_start;
bh->b_bdev = map_bh.b_bdev;
bh->b_private = NULL;
+ buffer_trace_init(&bh->b_history);
read_bh[nr_reads++] = bh;
}
}
@@ -2563,6 +2623,10 @@ int nobh_writepage(struct page *page, ge
* they may have been added in ext3_writepage(). Make them
* freeable here, so the page does not leak.
*/
+ if (page_has_buffers(page)) {
+ struct buffer_head *b = page_buffers(page);
+ BUFFER_TRACE(b, "EIO");
+ }
#if 0
/* Not really sure about this - do we need this ? */
if (page->mapping->a_ops->invalidatepage)
@@ -2572,6 +2636,11 @@ int nobh_writepage(struct page *page, ge
return 0; /* don't care */
}

+ if (page_has_buffers(page)) {
+ struct buffer_head *b = page_buffers(page);
+ BUFFER_TRACE(b, "partial");
+ }
+
/*
* The page straddles i_size. It must be zeroed out on each and every
* writepage invocation because it may be mmapped. "A file is mapped
@@ -2700,6 +2769,7 @@ int block_truncate_page(struct address_s
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);

+ BUFFER_TRACE(bh, "mark dirty");
mark_buffer_dirty(bh);
err = 0;

@@ -2758,8 +2828,10 @@ sector_t generic_block_bmap(struct addre
{
struct buffer_head tmp;
struct inode *inode = mapping->host;
+
tmp.b_state = 0;
tmp.b_blocknr = 0;
+ tmp.b_page = NULL;
tmp.b_size = 1 << inode->i_blkbits;
get_block(inode, block, &tmp, 0);
return tmp.b_blocknr;
@@ -2787,9 +2859,11 @@ int submit_bh(int rw, struct buffer_head
struct bio *bio;
int ret = 0;

- BUG_ON(!buffer_locked(bh));
- BUG_ON(!buffer_mapped(bh));
- BUG_ON(!bh->b_end_io);
+ BUFFER_TRACE(bh, "enter");
+
+ J_ASSERT_BH(bh, buffer_locked(bh));
+ J_ASSERT_BH(bh, buffer_mapped(bh));
+ J_ASSERT_BH(bh, bh->b_end_io != NULL);

if (buffer_ordered(bh) && (rw == WRITE))
rw = WRITE_BARRIER;
@@ -2801,6 +2875,11 @@ int submit_bh(int rw, struct buffer_head
if (test_set_buffer_req(bh) && (rw == WRITE || rw == WRITE_BARRIER))
clear_buffer_write_io_error(bh);

+ if (rw == WRITE)
+ BUFFER_TRACE(bh, "write");
+ else
+ BUFFER_TRACE(bh, "read");
+
/*
* from here on down, it's all bio -- do the initial mapping,
* submit_bio -> generic_make_request may further map this bio around
@@ -3005,6 +3084,7 @@ out:

do {
struct buffer_head *next = bh->b_this_page;
+ bh->b_this_page = NULL;
free_buffer_head(bh);
bh = next;
} while (bh != buffers_to_free);
@@ -3090,6 +3170,7 @@ struct buffer_head *alloc_buffer_head(gf
get_cpu_var(bh_accounting).nr++;
recalc_bh_state();
put_cpu_var(bh_accounting);
+ buffer_trace_init(&ret->b_history);
}
return ret;
}
@@ -3097,7 +3178,16 @@ EXPORT_SYMBOL(alloc_buffer_head);

void free_buffer_head(struct buffer_head *bh)
{
- BUG_ON(!list_empty(&bh->b_assoc_buffers));
+ J_ASSERT_BH(bh, bh->b_this_page == NULL);
+ J_ASSERT_BH(bh, list_empty(&bh->b_assoc_buffers));
+#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE)
+ if (buffer_jbd(bh)) {
+ J_ASSERT_BH(bh, bh2jh(bh)->b_transaction == 0);
+ J_ASSERT_BH(bh, bh2jh(bh)->b_next_transaction == 0);
+ J_ASSERT_BH(bh, bh2jh(bh)->b_frozen_data == 0);
+ J_ASSERT_BH(bh, bh2jh(bh)->b_committed_data == 0);
+ }
+#endif
kmem_cache_free(bh_cachep, bh);
get_cpu_var(bh_accounting).nr--;
recalc_bh_state();
diff -puN fs/direct-io.c~ext3-debug fs/direct-io.c
--- a/fs/direct-io.c~ext3-debug
+++ a/fs/direct-io.c
@@ -535,6 +535,7 @@ static int get_more_blocks(struct dio *d

map_bh->b_state = 0;
map_bh->b_size = fs_count << dio->inode->i_blkbits;
+ map_bh->b_page = 0;

create = dio->rw & WRITE;
if (dio->lock_type == DIO_LOCKING) {
diff -puN fs/ext3/inode.c~ext3-debug fs/ext3/inode.c
--- a/fs/ext3/inode.c~ext3-debug
+++ a/fs/ext3/inode.c
@@ -1006,6 +1006,7 @@ struct buffer_head *ext3_getblk(handle_t

dummy.b_state = 0;
dummy.b_blocknr = -1000;
+ dummy.b_page = NULL;
buffer_trace_init(&dummy.b_history);
err = ext3_get_blocks_handle(handle, inode, block, 1,
&dummy, create, 1);
@@ -1444,6 +1445,11 @@ static int ext3_ordered_writepage(struct
walk_page_buffers(handle, page_bufs, 0,
PAGE_CACHE_SIZE, NULL, bget_one);

+ {
+ struct buffer_head *b = page_buffers(page);
+ BUFFER_TRACE(b, "call block_write_full_page");
+ }
+
ret = block_write_full_page(page, ext3_get_block, wbc);

/*
@@ -1845,6 +1851,7 @@ static int ext3_block_truncate_page(hand
} else {
if (ext3_should_order_data(inode))
err = ext3_journal_dirty_data(handle, bh);
+ BUFFER_TRACE(bh, "mark dirty");
mark_buffer_dirty(bh);
}

diff -puN fs/ext3/namei.c~ext3-debug fs/ext3/namei.c
--- a/fs/ext3/namei.c~ext3-debug
+++ a/fs/ext3/namei.c
@@ -62,6 +62,7 @@ static struct buffer_head *ext3_append(h
EXT3_I(inode)->i_disksize = inode->i_size;
ext3_journal_get_write_access(handle,bh);
}
+ BUFFER_TRACE(bh, "exit");
return bh;
}

@@ -342,6 +343,7 @@ dx_probe(struct dentry *dentry, struct i
dir = dentry->d_parent->d_inode;
if (!(bh = ext3_bread (NULL,dir, 0, 0, err)))
goto fail;
+ BUFFER_TRACE(bh, "bread it");
root = (struct dx_root *) bh->b_data;
if (root->info.hash_version != DX_HASH_TEA &&
root->info.hash_version != DX_HASH_HALF_MD4 &&
@@ -416,18 +418,21 @@ dx_probe(struct dentry *dentry, struct i

at = p - 1;
dxtrace(printk(" %x->%u\n", at == entries? 0: dx_get_hash(at), dx_get_block(at)));
+ BUFFER_TRACE(bh, "add to frame");
frame->bh = bh;
frame->entries = entries;
frame->at = at;
if (!indirect--) return frame;
if (!(bh = ext3_bread (NULL,dir, dx_get_block(at), 0, err)))
goto fail2;
+ BUFFER_TRACE(bh, "read it");
at = entries = ((struct dx_node *) bh->b_data)->entries;
assert (dx_get_limit(entries) == dx_node_limit (dir));
frame++;
}
fail2:
while (frame >= frame_in) {
+ BUFFER_TRACE(bh, "brelse it");
brelse(frame->bh);
frame--;
}
@@ -440,8 +445,11 @@ static void dx_release (struct dx_frame
if (frames[0].bh == NULL)
return;

- if (((struct dx_root *) frames[0].bh->b_data)->info.indirect_levels)
+ if (((struct dx_root *) frames[0].bh->b_data)->info.indirect_levels) {
+ BUFFER_TRACE(frames[1].bh, "brelse it");
brelse(frames[1].bh);
+ }
+ BUFFER_TRACE(frames[0].bh, "brelse it");
brelse(frames[0].bh);
}

@@ -965,6 +973,8 @@ static struct buffer_head * ext3_dx_find
dx_release (frames);
return bh;
}
+ BUFFER_TRACE(bh, "brelse it");
+ BUFFER_TRACE(bh, "brelse it");
brelse (bh);
/* Check to see if we should continue to search */
retval = ext3_htree_next_block(dir, hash, frame,
@@ -999,6 +1009,7 @@ static struct dentry *ext3_lookup(struct
inode = NULL;
if (bh) {
unsigned long ino = le32_to_cpu(de->inode);
+ BUFFER_TRACE(bh, "brelse it");
brelse (bh);
if (!ext3_valid_inum(dir->i_sb, ino)) {
ext3_error(dir->i_sb, "ext3_lookup",
@@ -1128,16 +1139,20 @@ static struct ext3_dir_entry_2 *do_split

bh2 = ext3_append (handle, dir, &newblock, error);
if (!(bh2)) {
+ BUFFER_TRACE(*bh, "brelse it");
brelse(*bh);
*bh = NULL;
goto errout;
}

+ BUFFER_TRACE(bh2, "using it");
BUFFER_TRACE(*bh, "get_write_access");
err = ext3_journal_get_write_access(handle, *bh);
if (err) {
journal_error:
+ BUFFER_TRACE(*bh, "brelse");
brelse(*bh);
+ BUFFER_TRACE(bh2, "brelse");
brelse(bh2);
*bh = NULL;
ext3_std_error(dir->i_sb, err);
@@ -1173,6 +1188,8 @@ static struct ext3_dir_entry_2 *do_split
/* Which block gets the new entry? */
if (hinfo->hash >= hash2)
{
+ BUFFER_TRACE(*bh, "swapping");
+ BUFFER_TRACE(bh2, "swapping");
swap(*bh, bh2);
de = de2;
}
@@ -1181,8 +1198,11 @@ static struct ext3_dir_entry_2 *do_split
if (err)
goto journal_error;
err = ext3_journal_dirty_metadata (handle, frame->bh);
- if (err)
+ if (err) {
+ BUFFER_TRACE(bh2, "error");
goto journal_error;
+ }
+ BUFFER_TRACE(*bh, "brelse");
brelse (bh2);
dxtrace(dx_show_index ("frame", frame->entries));
errout:
@@ -1225,6 +1245,7 @@ static int add_dirent_to_buf(handle_t *h
return -EIO;
}
if (ext3_match (namelen, name, de)) {
+ BUFFER_TRACE(bh, "brelse");
brelse (bh);
return -EEXIST;
}
@@ -1282,6 +1303,7 @@ static int add_dirent_to_buf(handle_t *h
err = ext3_journal_dirty_metadata(handle, bh);
if (err)
ext3_std_error(dir->i_sb, err);
+ BUFFER_TRACE(bh, "brelse");
brelse(bh);
return 0;
}
@@ -1312,6 +1334,7 @@ static int make_indexed_dir(handle_t *ha

blocksize = dir->i_sb->s_blocksize;
dxtrace(printk("Creating index\n"));
+ BUFFER_TRACE(bh, "get write access");
retval = ext3_journal_get_write_access(handle, bh);
if (retval) {
ext3_std_error(dir->i_sb, retval);
@@ -1356,6 +1379,7 @@ static int make_indexed_dir(handle_t *ha
frame = frames;
frame->entries = entries;
frame->at = entries;
+ BUFFER_TRACE(bh, "adding to frame");
frame->bh = bh;
bh = bh2;
de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
@@ -1411,6 +1435,7 @@ static int ext3_add_entry (handle_t *han
bh = ext3_bread(handle, dir, block, 0, &retval);
if(!bh)
return retval;
+ BUFFER_TRACE(bh, "read it");
retval = add_dirent_to_buf(handle, dentry, inode, NULL, bh);
if (retval != -ENOSPC)
return retval;
@@ -1425,6 +1450,7 @@ static int ext3_add_entry (handle_t *han
bh = ext3_append(handle, dir, &block, &retval);
if (!bh)
return retval;
+ BUFFER_TRACE(bh, "append returned it");
de = (struct ext3_dir_entry_2 *) bh->b_data;
de->inode = 0;
de->rec_len = cpu_to_le16(blocksize);
@@ -1489,6 +1515,7 @@ static int ext3_dx_add_entry(handle_t *h
bh2 = ext3_append (handle, dir, &newblock, &err);
if (!(bh2))
goto cleanup;
+ BUFFER_TRACE(bh2, "append returned it");
node2 = (struct dx_node *)(bh2->b_data);
entries2 = node2->entries;
node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
@@ -1518,6 +1545,8 @@ static int ext3_dx_add_entry(handle_t *h
if (at - entries >= icount1) {
frame->at = at = at - entries - icount1 + entries2;
frame->entries = entries = entries2;
+ BUFFER_TRACE(frame->bh, "swap");
+ BUFFER_TRACE(bh2, "swap");
swap(frame->bh, bh2);
}
dx_insert_block (frames + 0, hash2, newblock);
@@ -1527,6 +1556,7 @@ static int ext3_dx_add_entry(handle_t *h
err = ext3_journal_dirty_metadata(handle, bh2);
if (err)
goto journal_error;
+ BUFFER_TRACE(bh2, "brelse");
brelse (bh2);
} else {
dxtrace(printk("Creating second level index...\n"));
@@ -1543,6 +1573,8 @@ static int ext3_dx_add_entry(handle_t *h
frame = frames + 1;
frame->at = at = at - entries + entries2;
frame->entries = entries = entries2;
+ BUFFER_TRACE(frame->bh, "overwrite");
+ BUFFER_TRACE(bh2, "add to frame");
frame->bh = bh2;
err = ext3_journal_get_write_access(handle,
frame->bh);
@@ -1561,8 +1593,10 @@ static int ext3_dx_add_entry(handle_t *h
journal_error:
ext3_std_error(dir->i_sb, err);
cleanup:
- if (bh)
+ if (bh) {
+ BUFFER_TRACE(bh, "brelse");
brelse(bh);
+ }
dx_release(frames);
return err;
}
diff -puN fs/ext3/super.c~ext3-debug fs/ext3/super.c
--- a/fs/ext3/super.c~ext3-debug
+++ a/fs/ext3/super.c
@@ -42,6 +42,10 @@
#include "acl.h"
#include "namei.h"

+#ifdef CONFIG_JBD_DEBUG
+static int ext3_ro_after; /* Make fs read-only after this many jiffies */
+#endif
+
static int ext3_load_journal(struct super_block *, struct ext3_super_block *,
unsigned long journal_devnum);
static int ext3_create_journal(struct super_block *, struct ext3_super_block *,
@@ -62,6 +66,65 @@ static void ext3_unlockfs(struct super_b
static void ext3_write_super (struct super_block * sb);
static void ext3_write_super_lockfs(struct super_block *sb);

+#ifdef CONFIG_JBD_DEBUG
+struct block_device *journal_no_write[2];
+
+/*
+ * Debug code for turning filesystems "read-only" after a specified
+ * amount of time. This is for crash/recovery testing.
+ */
+
+static void make_rdonly(struct block_device *bdev,
+ struct block_device **no_write)
+{
+ char b[BDEVNAME_SIZE];
+
+ if (bdev) {
+ printk(KERN_WARNING "Turning device %s read-only\n",
+ bdevname(bdev, b));
+ *no_write = bdev;
+ }
+}
+
+static void turn_fs_readonly(unsigned long arg)
+{
+ struct super_block *sb = (struct super_block *)arg;
+
+ make_rdonly(sb->s_bdev, &journal_no_write[0]);
+ make_rdonly(EXT3_SB(sb)->s_journal->j_dev, &journal_no_write[1]);
+ wake_up(&EXT3_SB(sb)->ro_wait_queue);
+}
+
+static void setup_ro_after(struct super_block *sb)
+{
+ struct ext3_sb_info *sbi = EXT3_SB(sb);
+ init_timer(&sbi->turn_ro_timer);
+ if (ext3_ro_after) {
+ printk(KERN_DEBUG "fs will go read-only in %d jiffies\n",
+ ext3_ro_after);
+ init_waitqueue_head(&sbi->ro_wait_queue);
+ journal_no_write[0] = NULL;
+ journal_no_write[1] = NULL;
+ sbi->turn_ro_timer.function = turn_fs_readonly;
+ sbi->turn_ro_timer.data = (unsigned long)sb;
+ sbi->turn_ro_timer.expires = jiffies + ext3_ro_after;
+ ext3_ro_after = 0;
+ add_timer(&sbi->turn_ro_timer);
+ }
+}
+
+static void clear_ro_after(struct super_block *sb)
+{
+ del_timer_sync(&EXT3_SB(sb)->turn_ro_timer);
+ journal_no_write[0] = NULL;
+ journal_no_write[1] = NULL;
+ ext3_ro_after = 0;
+}
+#else
+#define setup_ro_after(sb) do {} while (0)
+#define clear_ro_after(sb) do {} while (0)
+#endif
+
/*
* Wrappers for journal_start/end.
*
@@ -430,6 +493,7 @@ static void ext3_put_super (struct super
invalidate_bdev(sbi->journal_bdev, 0);
ext3_blkdev_remove(sbi);
}
+ clear_ro_after(sb);
sb->s_fs_info = NULL;
kfree(sbi);
return;
@@ -628,6 +692,7 @@ enum {
Opt_bsd_df, Opt_minix_df, Opt_grpid, Opt_nogrpid,
Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro,
Opt_nouid32, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov,
+ Opt_ro_after,
Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
@@ -657,6 +722,7 @@ static match_table_t tokens = {
{Opt_debug, "debug"},
{Opt_oldalloc, "oldalloc"},
{Opt_orlov, "orlov"},
+ {Opt_ro_after, "ro_after"},
{Opt_user_xattr, "user_xattr"},
{Opt_nouser_xattr, "nouser_xattr"},
{Opt_acl, "acl"},
@@ -790,6 +856,13 @@ static int parse_options (char *options,
case Opt_orlov:
clear_opt (sbi->s_mount_opt, OLDALLOC);
break;
+#ifdef CONFIG_JBD_DEBUG
+ case Opt_ro_after:
+ if (match_int(&args[0], &option))
+ return 0;
+ ext3_ro_after = option;
+ break;
+#endif
#ifdef CONFIG_EXT3_FS_XATTR
case Opt_user_xattr:
set_opt (sbi->s_mount_opt, XATTR_USER);
@@ -1125,6 +1198,7 @@ static int ext3_setup_super(struct super
} else {
printk("internal journal\n");
}
+ setup_ro_after(sb);
return res;
}

@@ -1359,6 +1433,9 @@ static int ext3_fill_super (struct super
int needs_recovery;
__le32 features;

+#ifdef CONFIG_JBD_DEBUG
+ ext3_ro_after = 0;
+#endif
sbi = kmalloc(sizeof(*sbi), GFP_KERNEL);
if (!sbi)
return -ENOMEM;
@@ -1367,6 +1444,7 @@ static int ext3_fill_super (struct super
sbi->s_mount_opt = 0;
sbi->s_resuid = EXT3_DEF_RESUID;
sbi->s_resgid = EXT3_DEF_RESGID;
+ setup_ro_after(sb);

unlock_kernel();

@@ -2109,6 +2187,8 @@ static void ext3_clear_journal_err(struc

journal = EXT3_SB(sb)->s_journal;

+ clear_ro_after(sb);
+
/*
* Now check for any error status which may have been recorded in the
* journal by a prior ext3_error() or ext3_abort()
@@ -2175,6 +2255,7 @@ static int ext3_sync_fs(struct super_blo
if (wait)
log_wait_commit(EXT3_SB(sb)->s_journal, target);
}
+ setup_ro_after(sb);
return 0;
}

diff -puN fs/ext3/xattr.c~ext3-debug fs/ext3/xattr.c
--- a/fs/ext3/xattr.c~ext3-debug
+++ a/fs/ext3/xattr.c
@@ -482,12 +482,14 @@ ext3_xattr_release_block(handle_t *handl
ce = mb_cache_entry_get(ext3_xattr_cache, bh->b_bdev, bh->b_blocknr);
if (BHDR(bh)->h_refcount == cpu_to_le32(1)) {
ea_bdebug(bh, "refcount now=0; freeing");
+ BUFFER_TRACE(bh, "xattr delete");
if (ce)
mb_cache_entry_free(ce);
ext3_free_blocks(handle, inode, bh->b_blocknr, 1);
get_bh(bh);
ext3_forget(handle, 1, inode, bh, bh->b_blocknr);
} else {
+ BUFFER_TRACE(bh, "xattr deref");
if (ext3_journal_get_write_access(handle, bh) == 0) {
lock_buffer(bh);
BHDR(bh)->h_refcount = cpu_to_le32(
@@ -758,6 +760,7 @@ ext3_xattr_block_set(handle_t *handle, s
inserted:
if (!IS_LAST_ENTRY(s->first)) {
new_bh = ext3_xattr_cache_find(inode, header(s->base), &ce);
+ BUFFER_TRACE2(new_bh, bs->bh, "xattr cache looked up");
if (new_bh) {
/* We found an identical block in the cache. */
if (new_bh == bs->bh)
@@ -803,6 +806,7 @@ inserted:
ea_idebug(inode, "creating block %d", block);

new_bh = sb_getblk(sb, block);
+ BUFFER_TRACE(new_bh, "new xattr block");
if (!new_bh) {
getblk_failed:
ext3_free_blocks(handle, inode, block, 1);
diff -puN fs/jbd/commit.c~ext3-debug fs/jbd/commit.c
--- a/fs/jbd/commit.c~ext3-debug
+++ a/fs/jbd/commit.c
@@ -28,10 +28,12 @@
static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
{
BUFFER_TRACE(bh, "");
- if (uptodate)
+ if (uptodate) {
set_buffer_uptodate(bh);
- else
+ } else {
+ BUFFER_TRACE(bh, "clear uptodate");
clear_buffer_uptodate(bh);
+ }
unlock_buffer(bh);
}

@@ -339,7 +341,7 @@ write_out_data:
BUFFER_TRACE(bh, "locked");
if (!inverted_lock(journal, bh))
goto write_out_data;
- __journal_temp_unlink_buffer(jh);
+ __journal_temp_unlink_buffer(journal, jh);
__journal_file_buffer(jh, commit_transaction,
BJ_Locked);
jbd_unlock_bh_state(bh);
@@ -365,7 +367,7 @@ write_out_data:
BUFFER_TRACE(bh, "writeout complete: unfile");
if (!inverted_lock(journal, bh))
goto write_out_data;
- __journal_unfile_buffer(jh);
+ __journal_unfile_buffer(journal, jh);
jbd_unlock_bh_state(bh);
journal_remove_journal_head(bh);
put_bh(bh);
@@ -406,7 +408,7 @@ write_out_data:
continue;
}
if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
- __journal_unfile_buffer(jh);
+ __journal_unfile_buffer(journal, jh);
jbd_unlock_bh_state(bh);
journal_remove_journal_head(bh);
put_bh(bh);
@@ -778,6 +780,7 @@ restart_loop:
* behind for writeback and gets reallocated for another
* use in a different page. */
if (buffer_freed(bh)) {
+ BUFFER_TRACE(bh, "buffer_freed");
clear_buffer_freed(bh);
clear_buffer_jbddirty(bh);
}
@@ -786,9 +789,11 @@ restart_loop:
JBUFFER_TRACE(jh, "add to new checkpointing trans");
__journal_insert_checkpoint(jh, commit_transaction);
JBUFFER_TRACE(jh, "refile for checkpoint writeback");
- __journal_refile_buffer(jh);
+ __journal_refile_buffer(journal, jh);
+ JBUFFER_TRACE(jh, "did writepage hack");
jbd_unlock_bh_state(bh);
} else {
+ BUFFER_TRACE(bh, "not jbddirty");
J_ASSERT_BH(bh, !buffer_dirty(bh));
/* The buffer on BJ_Forget list and not jbddirty means
* it has been freed by this transaction and hence it
@@ -798,7 +803,7 @@ restart_loop:
* disk and before we process the buffer on BJ_Forget
* list. */
JBUFFER_TRACE(jh, "refile or unfile freed buffer");
- __journal_refile_buffer(jh);
+ __journal_refile_buffer(journal, jh);
if (!jh->b_transaction) {
jbd_unlock_bh_state(bh);
/* needs a brelse */
diff -puN /dev/null fs/jbd-kernel.c
--- /dev/null
+++ a/fs/jbd-kernel.c
@@ -0,0 +1,256 @@
+/*
+ * fs/jbd-kernel.c
+ *
+ * Support code for the Journalling Block Device layer.
+ * This file contains things which have to be in-kernel when
+ * JBD is a module.
+ *
+ * 15 May 2001 Andrew Morton <[email protected]>
+ * Created
+ */
+
+#include <linux/config.h>
+#include <linux/fs.h>
+#include <linux/jbd.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+
+/*
+ * Some sanity testing which is called from mark_buffer_clean(),
+ * and must be present in the main kernel.
+ */
+
+void jbd_preclean_buffer_check(struct buffer_head *bh)
+{
+ if (buffer_jbd(bh)) {
+ struct journal_head *jh = bh2jh(bh);
+
+ transaction_t *transaction = jh->b_transaction;
+ journal_t *journal;
+
+ if (jh->b_jlist == 0 && transaction == NULL)
+ return;
+
+ J_ASSERT_JH(jh, transaction != NULL);
+ /* The kernel may be unmapping old data. We expect it
+ * to be dirty in that case, unless the buffer has
+ * already been forgotten by a transaction. */
+ if (jh->b_jlist != BJ_Forget) {
+#if 1
+ if (!buffer_dirty(bh)) {
+ printk("%s: clean of clean buffer\n",
+ __FUNCTION__);
+ print_buffer_trace(bh);
+ return;
+ }
+#endif
+ J_ASSERT_BH(bh, buffer_dirty(bh));
+ }
+
+ journal = transaction->t_journal;
+ J_ASSERT_JH(jh,
+ transaction == journal->j_running_transaction ||
+ transaction == journal->j_committing_transaction);
+ }
+}
+EXPORT_SYMBOL(jbd_preclean_buffer_check);
+
+/*
+ * Support functions for BUFFER_TRACE()
+ */
+
+static spinlock_t trace_lock = SPIN_LOCK_UNLOCKED;
+
+void buffer_trace(const char *function, struct buffer_head *dest,
+ struct buffer_head *src, char *info)
+{
+ struct buffer_history_item *bhist_i;
+ struct page *page;
+ unsigned long flags;
+
+ if (dest == 0 || src == 0)
+ return;
+
+ spin_lock_irqsave(&trace_lock, flags);
+
+ /*
+ * Sometimes we don't initialise the ring pointers. (locally declared
+ * temp buffer_heads). Feebly attempt to detect and correct that here.
+ */
+ if ((dest->b_history.b_history_head - dest->b_history.b_history_tail >
+ BUFFER_HISTORY_SIZE)) {
+ dest->b_history.b_history_head = 0;
+ dest->b_history.b_history_tail = 0;
+ }
+ bhist_i = dest->b_history.b +
+ (dest->b_history.b_history_head & (BUFFER_HISTORY_SIZE - 1));
+ bhist_i->function = function;
+ bhist_i->info = info;
+ bhist_i->b_state = src->b_state;
+ page = src->b_page;
+ if (page)
+ bhist_i->pg_dirty = !!PageDirty(page);
+ else
+ bhist_i->pg_dirty = 0;
+
+#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE)
+ bhist_i->b_trans_is_running = 0;
+ bhist_i->b_trans_is_committing = 0;
+ bhist_i->b_blocknr = src->b_blocknr;
+ if (buffer_jbd(src)) {
+ struct journal_head *jh;
+ journal_t *journal;
+ transaction_t *transaction;
+
+ /* Footwork to avoid racing with journal_remove_journal_head */
+ jh = src->b_private;
+ if (jh == 0)
+ goto raced;
+ transaction = jh->b_transaction;
+ if (src->b_private == 0)
+ goto raced;
+ bhist_i->b_jcount = jh->b_jcount;
+ bhist_i->b_jbd = 1;
+ bhist_i->b_jlist = jh->b_jlist;
+ bhist_i->b_frozen_data = jh->b_frozen_data;
+ bhist_i->b_committed_data = jh->b_committed_data;
+ bhist_i->b_transaction = !!jh->b_transaction;
+ bhist_i->b_next_transaction = !!jh->b_next_transaction;
+ bhist_i->b_cp_transaction = !!jh->b_cp_transaction;
+
+ if (transaction) {
+ journal = transaction->t_journal;
+ bhist_i->b_trans_is_running = transaction ==
+ journal->j_running_transaction;
+ bhist_i->b_trans_is_committing = transaction ==
+ journal->j_committing_transaction;
+ }
+ } else {
+raced:
+ bhist_i->b_jcount = 0;
+ bhist_i->b_jbd = 0;
+ bhist_i->b_jlist = 0;
+ bhist_i->b_frozen_data = 0;
+ bhist_i->b_committed_data = 0;
+ bhist_i->b_transaction = 0;
+ bhist_i->b_next_transaction = 0;
+ bhist_i->b_cp_transaction = 0;
+ }
+#endif /* defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE) */
+
+ bhist_i->cpu = smp_processor_id();
+ bhist_i->b_count = atomic_read(&src->b_count);
+
+ dest->b_history.b_history_head++;
+ if (dest->b_history.b_history_head - dest->b_history.b_history_tail >
+ BUFFER_HISTORY_SIZE)
+ dest->b_history.b_history_tail =
+ dest->b_history.b_history_head - BUFFER_HISTORY_SIZE;
+
+ spin_unlock_irqrestore(&trace_lock, flags);
+}
+
+static const char *b_jlist_to_string(unsigned int b_list)
+{
+ switch (b_list) {
+#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE)
+ case BJ_None: return "BJ_None";
+ case BJ_SyncData: return "BJ_SyncData";
+ case BJ_Metadata: return "BJ_Metadata";
+ case BJ_Forget: return "BJ_Forget";
+ case BJ_IO: return "BJ_IO";
+ case BJ_Shadow: return "BJ_Shadow";
+ case BJ_LogCtl: return "BJ_LogCtl";
+ case BJ_Reserved: return "BJ_Reserved";
+ case BJ_Locked: return "BJ_Locked";
+#endif
+ default: return "Bad b_jlist";
+ }
+}
+
+static void print_one_hist(struct buffer_history_item *bhist_i)
+{
+ printk(" %s():%s\n", bhist_i->function, bhist_i->info);
+ printk(" b_state:0x%lx b_jlist:%s cpu:%d b_count:%d b_blocknr:%lu\n",
+ bhist_i->b_state,
+ b_jlist_to_string(bhist_i->b_jlist),
+ bhist_i->cpu,
+ bhist_i->b_count,
+ bhist_i->b_blocknr);
+#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE)
+ printk(" b_jbd:%u b_frozen_data:%p b_committed_data:%p\n",
+ bhist_i->b_jbd,
+ bhist_i->b_frozen_data,
+ bhist_i->b_committed_data);
+ printk(" b_transaction:%u b_next_transaction:%u "
+ "b_cp_transaction:%u b_trans_is_running:%u\n",
+ bhist_i->b_transaction,
+ bhist_i->b_next_transaction,
+ bhist_i->b_cp_transaction,
+ bhist_i->b_trans_is_running);
+ printk(" b_trans_is_comitting:%u b_jcount:%u pg_dirty:%u",
+ bhist_i->b_trans_is_committing,
+ bhist_i->b_jcount,
+ bhist_i->pg_dirty);
+#endif
+ printk("\n");
+}
+
+void print_buffer_fields(struct buffer_head *bh)
+{
+ printk("b_blocknr:%llu b_count:%d\n",
+ (unsigned long long)bh->b_blocknr, atomic_read(&bh->b_count));
+ printk("b_this_page:%p b_data:%p b_page:%p\n",
+ bh->b_this_page, bh->b_data, bh->b_page);
+#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE)
+ if (buffer_jbd(bh)) {
+ struct journal_head *jh = bh2jh(bh);
+
+ printk("b_jlist:%u b_frozen_data:%p b_committed_data:%p\n",
+ jh->b_jlist, jh->b_frozen_data, jh->b_committed_data);
+ printk(" b_transaction:%p b_next_transaction:%p "
+ "b_cp_transaction:%p\n",
+ jh->b_transaction, jh->b_next_transaction,
+ jh->b_cp_transaction);
+ printk("b_cpnext:%p b_cpprev:%p\n",
+ jh->b_cpnext, jh->b_cpprev);
+ }
+#endif
+}
+
+void print_buffer_trace(struct buffer_head *bh)
+{
+ unsigned long idx, count;
+ unsigned long flags;
+
+ printk("buffer trace for buffer at 0x%p (I am CPU %d)\n",
+ bh, smp_processor_id());
+ BUFFER_TRACE(bh, ""); /* Record state now */
+
+ spin_lock_irqsave(&trace_lock, flags);
+ for ( idx = bh->b_history.b_history_tail, count = 0;
+ idx < bh->b_history.b_history_head &&
+ count < BUFFER_HISTORY_SIZE;
+ idx++, count++)
+ print_one_hist(bh->b_history.b +
+ (idx & (BUFFER_HISTORY_SIZE - 1)));
+
+ print_buffer_fields(bh);
+ spin_unlock_irqrestore(&trace_lock, flags);
+ dump_stack();
+ printk("\n");
+}
+
+static struct buffer_head *failed_buffer_head; /* For access with debuggers */
+
+void buffer_assertion_failure(struct buffer_head *bh)
+{
+ console_verbose();
+ failed_buffer_head = bh;
+ print_buffer_trace(bh);
+}
+EXPORT_SYMBOL(buffer_trace);
+EXPORT_SYMBOL(print_buffer_trace);
+EXPORT_SYMBOL(buffer_assertion_failure);
+EXPORT_SYMBOL(print_buffer_fields);
diff -puN fs/jbd/transaction.c~ext3-debug fs/jbd/transaction.c
--- a/fs/jbd/transaction.c~ext3-debug
+++ a/fs/jbd/transaction.c
@@ -1035,7 +1035,7 @@ int journal_dirty_data(handle_t *handle,
/* journal_clean_data_list() may have got there first */
if (jh->b_transaction != NULL) {
JBUFFER_TRACE(jh, "unfile from commit");
- __journal_temp_unlink_buffer(jh);
+ __journal_temp_unlink_buffer(journal, jh);
/* It still points to the committing
* transaction; move it to this one so
* that the refile assert checks are
@@ -1054,7 +1054,7 @@ int journal_dirty_data(handle_t *handle,
if (jh->b_jlist != BJ_SyncData && jh->b_jlist != BJ_Locked) {
JBUFFER_TRACE(jh, "not on correct data list: unfile");
J_ASSERT_JH(jh, jh->b_jlist != BJ_Shadow);
- __journal_temp_unlink_buffer(jh);
+ __journal_temp_unlink_buffer(journal, jh);
jh->b_transaction = handle->h_transaction;
JBUFFER_TRACE(jh, "file as data");
__journal_file_buffer(jh, handle->h_transaction,
@@ -1250,10 +1250,10 @@ int journal_forget (handle_t *handle, st
*/

if (jh->b_cp_transaction) {
- __journal_temp_unlink_buffer(jh);
+ __journal_temp_unlink_buffer(journal, jh);
__journal_file_buffer(jh, transaction, BJ_Forget);
} else {
- __journal_unfile_buffer(jh);
+ __journal_unfile_buffer(journal, jh);
journal_remove_journal_head(bh);
__brelse(bh);
if (!buffer_jbd(bh)) {
@@ -1442,6 +1442,8 @@ int journal_force_commit(journal_t *jour
static inline void
__blist_add_buffer(struct journal_head **list, struct journal_head *jh)
{
+ J_ASSERT_JH(jh, jh->b_tnext == NULL);
+ J_ASSERT_JH(jh, jh->b_tprev == NULL);
if (!*list) {
jh->b_tnext = jh->b_tprev = jh;
*list = jh;
@@ -1466,6 +1468,8 @@ __blist_add_buffer(struct journal_head *
static inline void
__blist_del_buffer(struct journal_head **list, struct journal_head *jh)
{
+ J_ASSERT_JH(jh, jh->b_tnext != NULL);
+ J_ASSERT_JH(jh, jh->b_tprev != NULL);
if (*list == jh) {
*list = jh->b_tnext;
if (*list == jh)
@@ -1473,6 +1477,8 @@ __blist_del_buffer(struct journal_head *
}
jh->b_tprev->b_tnext = jh->b_tnext;
jh->b_tnext->b_tprev = jh->b_tprev;
+ jh->b_tprev = NULL;
+ jh->b_tnext = NULL;
}

/*
@@ -1486,16 +1492,19 @@ __blist_del_buffer(struct journal_head *
*
* Called under j_list_lock. The journal may not be locked.
*/
-void __journal_temp_unlink_buffer(struct journal_head *jh)
+void __journal_temp_unlink_buffer(journal_t *journal, struct journal_head *jh)
{
struct journal_head **list = NULL;
transaction_t *transaction;
struct buffer_head *bh = jh2bh(jh);

+ JBUFFER_TRACE(jh, "entry");
J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh));
transaction = jh->b_transaction;
- if (transaction)
+ if (transaction) {
+ J_ASSERT_JH(jh, jh->b_transaction->t_journal == journal);
assert_spin_locked(&transaction->t_journal->j_list_lock);
+ }

J_ASSERT_JH(jh, jh->b_jlist < BJ_Types);
if (jh->b_jlist != BJ_None)
@@ -1538,9 +1547,9 @@ void __journal_temp_unlink_buffer(struct
mark_buffer_dirty(bh); /* Expose it to the VM */
}

-void __journal_unfile_buffer(struct journal_head *jh)
+void __journal_unfile_buffer(journal_t *journal, struct journal_head *jh)
{
- __journal_temp_unlink_buffer(jh);
+ __journal_temp_unlink_buffer(journal, jh);
jh->b_transaction = NULL;
}

@@ -1548,7 +1557,7 @@ void journal_unfile_buffer(journal_t *jo
{
jbd_lock_bh_state(jh2bh(jh));
spin_lock(&journal->j_list_lock);
- __journal_unfile_buffer(jh);
+ __journal_unfile_buffer(journal, jh);
spin_unlock(&journal->j_list_lock);
jbd_unlock_bh_state(jh2bh(jh));
}
@@ -1576,7 +1585,7 @@ __journal_try_to_free_buffer(journal_t *
if (jh->b_jlist == BJ_SyncData || jh->b_jlist == BJ_Locked) {
/* A written-back ordered data buffer */
JBUFFER_TRACE(jh, "release data");
- __journal_unfile_buffer(jh);
+ __journal_unfile_buffer(journal, jh);
journal_remove_journal_head(bh);
__brelse(bh);
}
@@ -1681,7 +1690,8 @@ static int __dispose_buffer(struct journ
int may_free = 1;
struct buffer_head *bh = jh2bh(jh);

- __journal_unfile_buffer(jh);
+ JBUFFER_TRACE(jh, "unfile");
+ __journal_unfile_buffer(transaction->t_journal, jh);

if (jh->b_cp_transaction) {
JBUFFER_TRACE(jh, "on running+cp transaction");
@@ -1934,6 +1944,7 @@ void __journal_file_buffer(struct journa
int was_dirty = 0;
struct buffer_head *bh = jh2bh(jh);

+ BUFFER_TRACE(bh, "entry");
J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh));
assert_spin_locked(&transaction->t_journal->j_list_lock);

@@ -1956,7 +1967,7 @@ void __journal_file_buffer(struct journa
}

if (jh->b_transaction)
- __journal_temp_unlink_buffer(jh);
+ __journal_temp_unlink_buffer(transaction->t_journal, jh);
jh->b_transaction = transaction;

switch (jlist) {
@@ -1996,6 +2007,7 @@ void __journal_file_buffer(struct journa

if (was_dirty)
set_buffer_jbddirty(bh);
+ BUFFER_TRACE(bh, "filed");
}

void journal_file_buffer(struct journal_head *jh,
@@ -2018,18 +2030,20 @@ void journal_file_buffer(struct journal_
*
* Called under jbd_lock_bh_state(jh2bh(jh))
*/
-void __journal_refile_buffer(struct journal_head *jh)
+void __journal_refile_buffer(journal_t *journal, struct journal_head *jh)
{
int was_dirty;
struct buffer_head *bh = jh2bh(jh);

+ JBUFFER_TRACE(jh, "entry");
J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh));
if (jh->b_transaction)
assert_spin_locked(&jh->b_transaction->t_journal->j_list_lock);

/* If the buffer is now unused, just drop it. */
if (jh->b_next_transaction == NULL) {
- __journal_unfile_buffer(jh);
+ BUFFER_TRACE(bh, "unfile");
+ __journal_unfile_buffer(journal, jh);
return;
}

@@ -2039,7 +2053,7 @@ void __journal_refile_buffer(struct jour
*/

was_dirty = test_clear_buffer_jbddirty(bh);
- __journal_temp_unlink_buffer(jh);
+ __journal_temp_unlink_buffer(journal, jh);
jh->b_transaction = jh->b_next_transaction;
jh->b_next_transaction = NULL;
__journal_file_buffer(jh, jh->b_transaction,
@@ -2071,7 +2085,7 @@ void journal_refile_buffer(journal_t *jo
jbd_lock_bh_state(bh);
spin_lock(&journal->j_list_lock);

- __journal_refile_buffer(jh);
+ __journal_refile_buffer(journal, jh);
jbd_unlock_bh_state(bh);
journal_remove_journal_head(bh);

diff -puN fs/Kconfig~ext3-debug fs/Kconfig
--- a/fs/Kconfig~ext3-debug
+++ a/fs/Kconfig
@@ -170,6 +170,10 @@ config JBD_DEBUG
generated. To turn debugging off again, do
"echo 0 > /proc/sys/fs/jbd-debug".

+config BUFFER_DEBUG
+ bool "buffer-layer tracing"
+ depends on JBD
+
config FS_MBCACHE
# Meta block cache for Extended Attributes (ext2/ext3)
tristate
diff -puN fs/Makefile~ext3-debug fs/Makefile
--- a/fs/Makefile~ext3-debug
+++ a/fs/Makefile
@@ -24,6 +24,8 @@ obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout
obj-$(CONFIG_BINFMT_EM86) += binfmt_em86.o
obj-$(CONFIG_BINFMT_MISC) += binfmt_misc.o

+obj-$(CONFIG_BUFFER_DEBUG) += jbd-kernel.o
+
# binfmt_script is always there
obj-y += binfmt_script.o

diff -puN fs/mpage.c~ext3-debug fs/mpage.c
diff -puN include/linux/buffer_head.h~ext3-debug include/linux/buffer_head.h
--- a/include/linux/buffer_head.h~ext3-debug
+++ a/include/linux/buffer_head.h
@@ -13,6 +13,7 @@
#include <linux/pagemap.h>
#include <linux/wait.h>
#include <asm/atomic.h>
+#include <linux/buffer-trace.h>

enum bh_state_bits {
BH_Uptodate, /* Contains valid data */
@@ -68,6 +69,9 @@ struct buffer_head {
void *b_private; /* reserved for b_end_io */
struct list_head b_assoc_buffers; /* associated with another mapping */
atomic_t b_count; /* users using this buffer_head */
+#ifdef CONFIG_BUFFER_DEBUG
+ struct buffer_history b_history;
+#endif
};

/*
diff -puN /dev/null include/linux/buffer-trace.h
--- /dev/null
+++ a/include/linux/buffer-trace.h
@@ -0,0 +1,85 @@
+/*
+ * include/linux/buffer-trace.h
+ *
+ * Debugging support for recording buffer_head state transitions
+ *
+ * May 2001, akpm
+ * Created
+ */
+
+#ifndef BUFFER_TRACE_H_INCLUDED
+#define BUFFER_TRACE_H_INCLUDED
+
+#include <linux/config.h>
+
+#ifdef CONFIG_BUFFER_DEBUG
+
+/* The number of records per buffer_head. Must be a power of two */
+#define BUFFER_HISTORY_SIZE 32
+
+struct buffer_head;
+
+/* This gets embedded in struct buffer_head */
+struct buffer_history {
+ struct buffer_history_item {
+ const char *function;
+ char *info;
+ unsigned long b_state;
+ unsigned b_list:3;
+ unsigned b_jlist:4;
+ unsigned pg_dirty:1;
+ unsigned cpu:3;
+ unsigned b_count:8;
+ unsigned long b_blocknr; /* For src != dest */
+#if defined(CONFIG_JBD) || defined(CONFIG_JBD_MODULE)
+ unsigned b_jcount:4;
+ unsigned b_jbd:1;
+ unsigned b_transaction:1;
+ unsigned b_next_transaction:1;
+ unsigned b_cp_transaction:1;
+ unsigned b_trans_is_running:1;
+ unsigned b_trans_is_committing:1;
+ void *b_frozen_data;
+ void *b_committed_data;
+#endif
+ } b[BUFFER_HISTORY_SIZE];
+ unsigned long b_history_head; /* Next place to write */
+ unsigned long b_history_tail; /* Oldest valid entry */
+};
+
+static inline void buffer_trace_init(struct buffer_history *bhist)
+{
+ bhist->b_history_head = 0;
+ bhist->b_history_tail = 0;
+}
+extern void buffer_trace(const char *function, struct buffer_head *dest,
+ struct buffer_head *src, char *info);
+extern void print_buffer_fields(struct buffer_head *bh);
+extern void print_buffer_trace(struct buffer_head *bh);
+
+#define BUFFER_STRINGIFY2(X) #X
+#define BUFFER_STRINGIFY(X) BUFFER_STRINGIFY2(X)
+
+#define BUFFER_TRACE2(dest, src, info) \
+ do { \
+ buffer_trace(__FUNCTION__, (dest), (src), \
+ "["__FILE__":" \
+ BUFFER_STRINGIFY(__LINE__)"] " info); \
+ } while (0)
+
+#define BUFFER_TRACE(bh, info) BUFFER_TRACE2(bh, bh, info)
+#define JBUFFER_TRACE(jh, info) BUFFER_TRACE(jh2bh(jh), info)
+
+#else /* CONFIG_BUFFER_DEBUG */
+
+#define buffer_trace_init(bh) do {} while (0)
+#define print_buffer_fields(bh) do {} while (0)
+#define print_buffer_trace(bh) do {} while (0)
+#define BUFFER_TRACE(bh, info) do {} while (0)
+#define BUFFER_TRACE2(bh, bh2, info) do {} while (0)
+#define JBUFFER_TRACE(jh, info) do {} while (0)
+
+#endif /* CONFIG_BUFFER_DEBUG */
+
+#endif /* BUFFER_TRACE_H_INCLUDED */
+
diff -puN include/linux/jbd.h~ext3-debug include/linux/jbd.h
--- a/include/linux/jbd.h~ext3-debug
+++ a/include/linux/jbd.h
@@ -43,7 +43,7 @@
* hardware _can_ fail, but for debugging purposes when running tests on
* known-good hardware we may want to trap these errors.
*/
-#undef JBD_PARANOID_IOFAIL
+#define JBD_PARANOID_IOFAIL

/*
* The default maximum commit age, in seconds.
@@ -57,7 +57,9 @@
* CONFIG_JBD_DEBUG is on.
*/
#define JBD_EXPENSIVE_CHECKING
+
extern int journal_enable_debug;
+extern struct block_device *journal_no_write[2];

#define jbd_debug(n, f, a...) \
do { \
@@ -270,6 +272,7 @@ void buffer_assertion_failure(struct buf
#else
#define J_ASSERT_BH(bh, expr) J_ASSERT(expr)
#define J_ASSERT_JH(jh, expr) J_ASSERT(expr)
+#define buffer_assertion_failure(bh) do { } while (0)
#endif

#else
@@ -277,9 +280,9 @@ void buffer_assertion_failure(struct buf
#endif /* JBD_ASSERTIONS */

#if defined(JBD_PARANOID_IOFAIL)
-#define J_EXPECT(expr, why...) J_ASSERT(expr)
-#define J_EXPECT_BH(bh, expr, why...) J_ASSERT_BH(bh, expr)
-#define J_EXPECT_JH(jh, expr, why...) J_ASSERT_JH(jh, expr)
+#define J_EXPECT(expr, why...) ({ J_ASSERT(expr); 1; })
+#define J_EXPECT_BH(bh, expr, why...) ({ J_ASSERT_BH(bh, expr); 1; })
+#define J_EXPECT_JH(jh, expr, why...) ({ J_ASSERT_JH(jh, expr); 1; })
#else
#define __journal_expect(expr, why...) \
({ \
@@ -839,10 +842,10 @@ struct journal_s
*/

/* Filing buffers */
-extern void __journal_temp_unlink_buffer(struct journal_head *jh);
+extern void __journal_temp_unlink_buffer(journal_t *, struct journal_head *jh);
extern void journal_unfile_buffer(journal_t *, struct journal_head *);
-extern void __journal_unfile_buffer(struct journal_head *);
-extern void __journal_refile_buffer(struct journal_head *);
+extern void __journal_unfile_buffer(journal_t *, struct journal_head *);
+extern void __journal_refile_buffer(journal_t *, struct journal_head *);
extern void journal_refile_buffer(journal_t *, struct journal_head *);
extern void __journal_file_buffer(struct journal_head *, transaction_t *, int);
extern void __journal_free_buffer(struct journal_head *bh);
@@ -1071,6 +1074,8 @@ static inline int jbd_space_needed(journ
* Definitions which augment the buffer_head layer
*/

+/* JBD additions */
+
/* journaling buffer types */
#define BJ_None 0 /* Not journaled */
#define BJ_SyncData 1 /* Normal data: flush before commit */
@@ -1087,13 +1092,6 @@ extern int jbd_blocks_per_page(struct in

#ifdef __KERNEL__

-#define buffer_trace_init(bh) do {} while (0)
-#define print_buffer_fields(bh) do {} while (0)
-#define print_buffer_trace(bh) do {} while (0)
-#define BUFFER_TRACE(bh, info) do {} while (0)
-#define BUFFER_TRACE2(bh, bh2, info) do {} while (0)
-#define JBUFFER_TRACE(jh, info) do {} while (0)
-
#endif /* __KERNEL__ */

#endif /* _LINUX_JBD_H */
_

2006-09-07 03:04:31

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

Andrew Morton wrote:
> On Wed, 6 Sep 2006 19:27:33 +0200
> Jan Kara <[email protected]> wrote:
>
>
>> Ugh! Are you sure? For this path the buffer must be attached (only) to
>> the running transaction. But then how the commit code comes to it?
>> Somebody would have to even manage to refile the buffer from the
>> committing transaction to the running one while the buffer is in wbuf[].
>> Could you check whether someone does __journal_refile_buffer() on your
>> marked buffers, please? Or whether we move buffer to BJ_Locked list in
>> the write_out_data: loop? Thanks.
>>
>
> The ext3-debug patch will help here. It records, within the bh, the inputs
> from the last 32 BUFFER_TRACE()s which were run against this bh. If a
> J_ASSERT fails then you get a nice trace of the last 32 "things" which
> happened to this bh, including the bh's state at that transition. It
> basically tells you everything you need to know to find the bug.
>
> It's worth spending the time to become familiar with it - I used it a lot in
> early ext3 development.
>

I will try the patch. Unfortunately, adding more debug is causing the
problem reproduction
difficult.
> I've been unable to reproduce this crash, btw. Is there some magic
> incantation apat from running `fsx-linux'?
>
All I do is on a single 1k filesystem, run 4 copies of fsx (on 4
different files, ofcourse).
I hit the assert anywhere between 10min-2hours.

Thanks,
Badari

2006-09-07 03:35:46

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

On Wed, 06 Sep 2006 20:04:27 -0700
Badari Pulavarty <[email protected]> wrote:

> > I've been unable to reproduce this crash, btw. Is there some magic
> > incantation apat from running `fsx-linux'?
> >
> All I do is on a single 1k filesystem, run 4 copies of fsx (on 4
> different files, ofcourse).
> I hit the assert anywhere between 10min-2hours.

hm, OK.

It could be that running

while (1) {
write(fd, "", 1);
fsync(fd);
}

will speed it up: cause a storm of commits, increase the probability.

2006-09-07 15:11:17

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

> Ugh! Are you sure? For this path the buffer must be attached (only) to
> the running transaction. But then how the commit code comes to it?
> Somebody would have to even manage to refile the buffer from the
> committing transaction to the running one while the buffer is in wbuf[].
> Could you check whether someone does __journal_refile_buffer() on your
> marked buffers, please? Or whether we move buffer to BJ_Locked list in
> the write_out_data: loop? Thanks.
>
>

I added more debug in __journal_refile_buffer() to see if the marked
buffers are getting refiled. I am able to reproduce the problem,
but I don't see any debug including my original prints. (It looks as
if none of my debug code exists) - its really confusing.

I will keep looking and get back to you.

I may try Andrew's buffer debug patch - if I get desperate.

Thanks,
Badari

2006-09-07 20:47:56

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

Hi,

> > Ugh! Are you sure? For this path the buffer must be attached (only) to
> > the running transaction. But then how the commit code comes to it?
> > Somebody would have to even manage to refile the buffer from the
> > committing transaction to the running one while the buffer is in wbuf[].
> > Could you check whether someone does __journal_refile_buffer() on your
> > marked buffers, please? Or whether we move buffer to BJ_Locked list in
> > the write_out_data: loop? Thanks.
> >
> >
>
> I added more debug in __journal_refile_buffer() to see if the marked
> buffers are getting refiled. I am able to reproduce the problem,
> but I don't see any debug including my original prints. (It looks as
> if none of my debug code exists) - its really confusing.
>
> I will keep looking and get back to you.
>
> I may try Andrew's buffer debug patch - if I get desperate.
Ho, hum, I think I see what is hapenning. I think we steal the buffer
from commit in journal_dirty_data() (called e.g. from ext3_writepage()).
There we also take care of writing out the buffer so it's mostly OK.
Except that the commit code still keeps the reference to the buffer in
wbuf[]. So the right solution along the lines of your one would be to
check that each buffer in wbuf[] is buffer_jbd(), belongs to the right
transaction and list and only in that case run submit_bh() on it...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-09-07 22:30:37

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

>
> > Ugh! Are you sure? For this path the buffer must be attached (only) to
> > the running transaction. But then how the commit code comes to it?
> > Somebody would have to even manage to refile the buffer from the
> > committing transaction to the running one while the buffer is in wbuf[].
> > Could you check whether someone does __journal_refile_buffer() on your
> > marked buffers, please? Or whether we move buffer to BJ_Locked list in
> > the write_out_data: loop? Thanks.
> >
> >
>
> I added more debug in __journal_refile_buffer() to see if the marked
> buffers are getting refiled. I am able to reproduce the problem,
> but I don't see any debug including my original prints. (It looks as
> if none of my debug code exists) - its really confusing.
>
> I will keep looking and get back to you.
I've been looking more at the code and I have revived my patch fixing
this part of the code. I've mildly tested the patch. Could you also give
it a try? Thanks.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

Attachments:

(No filename) (1.02 kB)
jbd-2.6.18-rc6-1-orderedwrite.diff (6.75 kB)
Download all attachments

2006-09-08 04:33:56

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

Jan Kara wrote:
>>> Ugh! Are you sure? For this path the buffer must be attached (only) to
>>> the running transaction. But then how the commit code comes to it?
>>> Somebody would have to even manage to refile the buffer from the
>>> committing transaction to the running one while the buffer is in wbuf[].
>>> Could you check whether someone does __journal_refile_buffer() on your
>>> marked buffers, please? Or whether we move buffer to BJ_Locked list in
>>> the write_out_data: loop? Thanks.
>>>
>>>
>>>
>> I added more debug in __journal_refile_buffer() to see if the marked
>> buffers are getting refiled. I am able to reproduce the problem,
>> but I don't see any debug including my original prints. (It looks as
>> if none of my debug code exists) - its really confusing.
>>
>> I will keep looking and get back to you.
>>
> I've been looking more at the code and I have revived my patch fixing
> this part of the code. I've mildly tested the patch. Could you also give
> it a try? Thanks.
>
> Honza
>
> ------------------------------------------------------------------------
>
> Original commit code assumes, that when a buffer on BJ_SyncData list is locked,
> it is being written to disk. But this is not true and hence it can lead to a
> potential data loss on crash. Also the code didn't count with the fact that
> journal_dirty_data() can steal buffers from committing transaction and hence
> could write buffers that no longer belong to the committing transaction.
> Finally it could possibly happen that we tried writing out one buffer several
> times.
>
> The patch below tries to solve these problems by a complete rewrite of the data
> commit code. We go through buffers on t_sync_datalist, lock buffers needing
> write out and store them in an array. Buffers are also immediately refiled to
> BJ_Locked list or unfiled (if the write out is completed). When the array is
> full or we have to block on buffer lock, we submit all accumulated buffers for
> IO.
>
> Signed-off-by: Jan Kara <[email protected]>
>
>
I have been running 4+ hours with this patch and seems to work fine. I
haven't hit any
assert yet :)

I will let it run till tomorrow. I will let you know, how it goes.

Thanks,
Badari

2006-09-08 08:25:32

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

Hi,

> Jan Kara wrote:
> > I've been looking more at the code and I have revived my patch fixing
> >this part of the code. I've mildly tested the patch. Could you also give
> >it a try? Thanks.
> >
> > Honza
> >
> >------------------------------------------------------------------------
> >
> >Original commit code assumes, that when a buffer on BJ_SyncData list is
> >locked,
> >it is being written to disk. But this is not true and hence it can lead to
> >a
> >potential data loss on crash. Also the code didn't count with the fact that
> >journal_dirty_data() can steal buffers from committing transaction and
> >hence
> >could write buffers that no longer belong to the committing transaction.
> >Finally it could possibly happen that we tried writing out one buffer
> >several
> >times.
> >
> >The patch below tries to solve these problems by a complete rewrite of the
> >data
> >commit code. We go through buffers on t_sync_datalist, lock buffers needing
> >write out and store them in an array. Buffers are also immediately refiled
> >to
> >BJ_Locked list or unfiled (if the write out is completed). When the array
> >is
> >full or we have to block on buffer lock, we submit all accumulated buffers
> >for
> >IO.
> >
> >Signed-off-by: Jan Kara <[email protected]>
> >
> >
> I have been running 4+ hours with this patch and seems to work fine. I
> haven't hit any
> assert yet :)
>
> I will let it run till tomorrow. I will let you know, how it goes.
Great, thanks. BTW: Do you have any performance tests handy? The
changes are big enough to cause some unexpected performance regressions,
livelocks... If you don't have anything ready, I can setup and run
something myself. Just that I don't like this testing too much ;).

Honza

--
Jan Kara <[email protected]>
SuSE CR Labs

2006-09-08 14:35:32

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

Jan Kara wrote:
> Hi,
>
>
>> Jan Kara wrote:
>>
>>> I've been looking more at the code and I have revived my patch fixing
>>> this part of the code. I've mildly tested the patch. Could you also give
>>> it a try? Thanks.
>>>
>>> Honza
>>>
>>> ------------------------------------------------------------------------
>>>
>>> Original commit code assumes, that when a buffer on BJ_SyncData list is
>>> locked,
>>> it is being written to disk. But this is not true and hence it can lead to
>>> a
>>> potential data loss on crash. Also the code didn't count with the fact that
>>> journal_dirty_data() can steal buffers from committing transaction and
>>> hence
>>> could write buffers that no longer belong to the committing transaction.
>>> Finally it could possibly happen that we tried writing out one buffer
>>> several
>>> times.
>>>
>>> The patch below tries to solve these problems by a complete rewrite of the
>>> data
>>> commit code. We go through buffers on t_sync_datalist, lock buffers needing
>>> write out and store them in an array. Buffers are also immediately refiled
>>> to
>>> BJ_Locked list or unfiled (if the write out is completed). When the array
>>> is
>>> full or we have to block on buffer lock, we submit all accumulated buffers
>>> for
>>> IO.
>>>
>>> Signed-off-by: Jan Kara <[email protected]>
>>>
>>>
>>>
>> I have been running 4+ hours with this patch and seems to work fine. I
>> haven't hit any
>> assert yet :)
>>
>> I will let it run till tomorrow. I will let you know, how it goes.
>>
> Great, thanks. BTW: Do you have any performance tests handy? The
> changes are big enough to cause some unexpected performance regressions,
> livelocks... If you don't have anything ready, I can setup and run
> something myself. Just that I don't like this testing too much ;).
>
Tests are still running fine.

I don't have any performance tests handy. We have some automated tests I
can schedule
to run to verify the stability aspects.

Thanks,
Badari

2006-09-11 09:47:06

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

Hi,

> >>>Original commit code assumes, that when a buffer on BJ_SyncData list is
> >>>locked,
> >>>it is being written to disk. But this is not true and hence it can lead
> >>>to a
> >>>potential data loss on crash. Also the code didn't count with the fact
> >>>that
> >>>journal_dirty_data() can steal buffers from committing transaction and
> >>>hence
> >>>could write buffers that no longer belong to the committing transaction.
> >>>Finally it could possibly happen that we tried writing out one buffer
> >>>several
> >>>times.
> >>>
> >>>The patch below tries to solve these problems by a complete rewrite of
> >>>the data
> >>>commit code. We go through buffers on t_sync_datalist, lock buffers
> >>>needing
> >>>write out and store them in an array. Buffers are also immediately
> >>>refiled to
> >>>BJ_Locked list or unfiled (if the write out is completed). When the
> >>>array is
> >>>full or we have to block on buffer lock, we submit all accumulated
> >>>buffers for
> >>>IO.
> >>>
> >>>Signed-off-by: Jan Kara <[email protected]>
> >>>
> >>>
> >>>
> >>I have been running 4+ hours with this patch and seems to work fine. I
> >>haven't hit any
> >>assert yet :)
> >>
> >>I will let it run till tomorrow. I will let you know, how it goes.
> >>
> > Great, thanks. BTW: Do you have any performance tests handy? The
> >changes are big enough to cause some unexpected performance regressions,
> >livelocks... If you don't have anything ready, I can setup and run
> >something myself. Just that I don't like this testing too much ;).
> >
> Tests are still running fine.
>
> I don't have any performance tests handy. We have some automated tests I
> can schedule to run to verify the stability aspects.
OK. I've run IOZONE rewrite throughput test on my computer with
iozone -t 10 -i 0 -s 10M -e
2.6.18-rc6 and the same kernel + my patch seem to give almost the same
results. The strange thing was that both in vanilla and patched kernel there
were several runs where a write througput (when iozone was creating the file)
was suddenly 10% of the usual value (18MB/s vs. 2MB/s). The rewrite numbers
were always fine. Maybe that has something to do with block allocation
code. Anyway, it is not a regression of my patch so unless your test
finds some problem I think the patch should be ready for inclusion into
-mm...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2006-09-11 20:42:08

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

On Mon, 2006-09-11 at 11:46 +0200, Jan Kara wrote:
...
> >
> > I don't have any performance tests handy. We have some automated tests I
> > can schedule to run to verify the stability aspects.
> OK. I've run IOZONE rewrite throughput test on my computer with
> iozone -t 10 -i 0 -s 10M -e
> 2.6.18-rc6 and the same kernel + my patch seem to give almost the same
> results. The strange thing was that both in vanilla and patched kernel there
> were several runs where a write througput (when iozone was creating the file)
> was suddenly 10% of the usual value (18MB/s vs. 2MB/s). The rewrite numbers
> were always fine. Maybe that has something to do with block allocation
> code. Anyway, it is not a regression of my patch so unless your test
> finds some problem I think the patch should be ready for inclusion into
> -mm...

Your patch seems to be working fine. I haven't found any major
regression yet.

I spent lot of time trying to reproduce the problem with buffer-debug
Andrew sent out - I really wanted to get to bottom of whats really
happening here (since your patch made it go away).

Yes. Your theory is correct. journal_dirty_data() is moving the
buffer-head from commited transaction to current one and
journal_unmap_buffer() is discarding and cleaning up the buffer-head.
Later set_page_dirty() dirties the buffer-head there by causing
BUG() in submit_bh().

Here is the buffer-trace-debug output to confirm it. I can sleep better
now :) Now we can figure out, if your fix is the right one or not ..

Thanks,
Badari

Attachments:

buffer-trace.out (13.23 kB)

2006-09-11 20:52:33

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

> On Mon, 2006-09-11 at 11:46 +0200, Jan Kara wrote:
> ...
> > >
> > > I don't have any performance tests handy. We have some automated tests I
> > > can schedule to run to verify the stability aspects.
> > OK. I've run IOZONE rewrite throughput test on my computer with
> > iozone -t 10 -i 0 -s 10M -e
> > 2.6.18-rc6 and the same kernel + my patch seem to give almost the same
> > results. The strange thing was that both in vanilla and patched kernel there
> > were several runs where a write througput (when iozone was creating the file)
> > was suddenly 10% of the usual value (18MB/s vs. 2MB/s). The rewrite numbers
> > were always fine. Maybe that has something to do with block allocation
> > code. Anyway, it is not a regression of my patch so unless your test
> > finds some problem I think the patch should be ready for inclusion into
> > -mm...
>
> Your patch seems to be working fine. I haven't found any major
> regression yet.
>
> I spent lot of time trying to reproduce the problem with buffer-debug
> Andrew sent out - I really wanted to get to bottom of whats really
> happening here (since your patch made it go away).
>
> Yes. Your theory is correct. journal_dirty_data() is moving the
> buffer-head from commited transaction to current one and
> journal_unmap_buffer() is discarding and cleaning up the buffer-head.
> Later set_page_dirty() dirties the buffer-head there by causing
> BUG() in submit_bh().
>
> Here is the buffer-trace-debug output to confirm it. I can sleep better
> now :) Now we can figure out, if your fix is the right one or not ..
OK, good to hear :). My patch should be prone at least to this problem
(I'm not saying it could not have introduced any other ;). It locks the
buffer and if it needed to drop JBD spin locks, it also checks whether
the buffer remained in the BJ_SyncData list. Hence if the
journal_dirty_data() steals the buffer while we are locking it we find
that out and forget about the buffer. Once the buffer is locked,
journal_dirty_data() won't touch it.
I guess the patch is good enough to send it to Andrew...

Honza

--
Jan Kara <[email protected]>
SuSE CR Labs

2006-09-13 20:25:19

by Dave Kleikamp

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

On Fri, 2006-09-08 at 00:30 +0200, Jan Kara wrote:
> diff -rupX /home/jack/.kerndiffexclude
> linux-2.6.18-rc6/fs/jbd/commit.c
> linux-2.6.18-rc6-1-orderedwrite/fs/jbd/commit.c
> --- linux-2.6.18-rc6/fs/jbd/commit.c 2006-09-06 18:20:48.000000000
> +0200
> +++ linux-2.6.18-rc6-1-orderedwrite/fs/jbd/commit.c 2006-09-08
> 01:05:35.000000000 +0200
> @@ -160,6 +160,117 @@ static int journal_write_commit_record(j
> return (ret == -EIO);
> }
>
> +void journal_do_submit_data(struct buffer_head **wbuf, int bufs)

Is there any reason this couldn't be static?

> +{
> + int i;
> +
> + for (i = 0; i < bufs; i++) {
> + wbuf[i]->b_end_io = end_buffer_write_sync;
> + /* We use-up our safety reference in submit_bh() */
> + submit_bh(WRITE, wbuf[i]);
> + }
> +}

I'm rebasing the ext4 work on the latest -mm tree and would like to
avoid renaming this function in the jbd2 clone.

Thanks,
Shaggy
--
David Kleikamp
IBM Linux Technology Center

2006-09-14 03:38:17

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

On Wed, 13 Sep 2006 15:25:19 -0500
Dave Kleikamp <[email protected]> wrote:

> > +void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
>
> Is there any reason this couldn't be static?

Nope.

> > +{
> > + int i;
> > +
> > + for (i = 0; i < bufs; i++) {
> > + wbuf[i]->b_end_io = end_buffer_write_sync;
> > + /* We use-up our safety reference in submit_bh() */
> > + submit_bh(WRITE, wbuf[i]);
> > + }
> > +}
>
> I'm rebasing the ext4 work on the latest -mm tree and would like to
> avoid renaming this function in the jbd2 clone.

<edits the diff>

2006-09-27 22:01:25

by Eric Sandeen

[permalink] [raw]

Subject: Re: [RFC][PATCH] set_page_buffer_dirty should skip unmapped buffers

Andrew Morton wrote:
> On Wed, 13 Sep 2006 15:25:19 -0500
> Dave Kleikamp <[email protected]> wrote:
>
>>> +void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
>> Is there any reason this couldn't be static?
>
> Nope.

With this change, journal_brelse_array can also be made static in
recovery.c, and removed from jbd.h, I think.

-Eric

2006-09-28 15:04:19

by Dave Kleikamp

[permalink] [raw]

Subject: [PATCH] JBD: Make journal_do_submit_data static

On Wed, 2006-09-27 at 17:01 -0500, Eric Sandeen wrote:
> Andrew Morton wrote:
> > On Wed, 13 Sep 2006 15:25:19 -0500
> > Dave Kleikamp <[email protected]> wrote:
> >
> >>> +void journal_do_submit_data(struct buffer_head **wbuf, int bufs)
> >> Is there any reason this couldn't be static?
> >
> > Nope.
>
> With this change, journal_brelse_array can also be made static in
> recovery.c, and removed from jbd.h, I think.

Looks like it. Here's a patch to do that:

JBD: Make journal_brelse_array static

It's always good to make symbols static when we can, and this also
eliminates the need to rename the function in jbd2

suggested by Eric Sandeen

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Eric Sandeen <[email protected]>

diff --git a/fs/jbd/recovery.c b/fs/jbd/recovery.c
index 445eed6..11563fe 100644
--- a/fs/jbd/recovery.c
+++ b/fs/jbd/recovery.c
@@ -46,7 +46,7 @@ static int scan_revoke_records(journal_t
#ifdef __KERNEL__

/* Release readahead buffers after use */
-void journal_brelse_array(struct buffer_head *b[], int n)
+static void journal_brelse_array(struct buffer_head *b[], int n)
{
while (--n >= 0)
brelse (b[n]);
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index a6d9daa..fe89444 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -977,7 +977,6 @@ extern void journal_write_revoke_reco
extern int journal_set_revoke(journal_t *, unsigned long, tid_t);
extern int journal_test_revoke(journal_t *, unsigned long, tid_t);
extern void journal_clear_revoke(journal_t *);
-extern void journal_brelse_array(struct buffer_head *b[], int n);
extern void journal_switch_revoke_table(journal_t *journal);

/*

--
David Kleikamp
IBM Linux Technology Center