LinuxLists.cc - [RFC] ext3/jbd race: releasing in-use journal

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

"Stephen C. Tweedie" <[email protected]> wrote:
>
> For the past few months there has been a slow but steady trickle of
> reports of oopses in kjournald.

Yes, really tenuous stuff. Very glad if this is the fix!

> Recently I got a couple of reports that
> were repeatable enough to rerun with extra debugging code.
>
> It turns out that we're releasing a journal_head while it is still
> linked onto the transaction's t_locked_list. The exact location is in
> journal_unmap_buffer(). On several exit paths, that does:
>
> spin_unlock(&journal->j_list_lock);
> jbd_unlock_bh_state(bh);
> spin_unlock(&journal->j_state_lock);
> journal_put_journal_head(jh);
>
> releasing the jh *after* dropping the j_list_lock and j_state_lock.
>
> kjournald can then be doing journal_commit_transaction():
>
> spin_lock(&journal->j_list_lock);
> ...
> if (buffer_locked(bh)) {
> BUFFER_TRACE(bh, "locked");
> if (!inverted_lock(journal, bh))
> goto write_out_data;
> __journal_unfile_buffer(jh);
> __journal_file_buffer(jh, commit_transaction,
> BJ_Locked);
> jbd_unlock_bh_state(bh);
>
> The problem happens if journal_unmap_buffer()'s own put_journal_head()
> manages to get in between kjournald's *unfile_buffer and the following
> *file_buffer. Because journal_unmap_buffer() has dropped its bh_state
> lock by this point, there's nothing to prevent this, leading to a
> variety of unpleasant situations. In particular, the jh is unfiled at
> this point, so there's nothing to stop the put_journal_head() from
> freeing the memory we're just about to link onto the BJ_Locked list.

Right. I don't know why journal_put_journal_head() looks at
->b_transaction, really. Should have made presence on a list contribute to
b_jcount. Oh well, it's been that way since 2.5.0 or older..

Don't we have the same race anywhere where we're doing a
journal_refile_buffer() (or equiv) in parallel with a
journal_put_journal_head() outside locks? There seem to be many such.

Perhaps we could also fix this by elevating b_jcount whenever the jh is
being moved between lists?

2005-03-05 00:36:39

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [Ext2-devel] [RFC] ext3/jbd race: releasing in-use journal_heads

Stephen,

I looked at few journalling bugs recently on RHEL4 testing here.
I am wondering if your patch fixes this following BUG also ?
I never got to bottom of some of these journal panics -
since they are not easily reproducible + I don't understand
journal code well enough :(

Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:790: "jh->b_next_transaction == ((void *)0)"
kernel BUG in journal_commit_transaction at fs/jbd/commit.c:790!
cpu 0x8: Vector: 700 (Program Check) at [c00000001fcc3870]
pc: d0000000000a7cd4: .journal_commit_transaction+0x1378/0x155c [jbd]
lr: d0000000000a7cd0: .journal_commit_transaction+0x1374/0x155c [jbd]
sp: c00000001fcc3af0
msr: 8000000000029032
current = 0xc00000002fd86b30
paca = 0xc0000000003de000
pid = 272, comm = kjournald
enter ? for help

Thanks,
Badari

On Fri, 2005-03-04 at 11:54, Stephen C. Tweedie wrote:
> Hi all,
>
> For the past few months there has been a slow but steady trickle of
> reports of oopses in kjournald. Recently I got a couple of reports that
> were repeatable enough to rerun with extra debugging code.
>
> It turns out that we're releasing a journal_head while it is still
> linked onto the transaction's t_locked_list. The exact location is in
> journal_unmap_buffer(). On several exit paths, that does:
>
> spin_unlock(&journal->j_list_lock);
> jbd_unlock_bh_state(bh);
> spin_unlock(&journal->j_state_lock);
> journal_put_journal_head(jh);
>
> releasing the jh *after* dropping the j_list_lock and j_state_lock.
>
> kjournald can then be doing journal_commit_transaction():
>
> spin_lock(&journal->j_list_lock);
> ...
> if (buffer_locked(bh)) {
> BUFFER_TRACE(bh, "locked");
> if (!inverted_lock(journal, bh))
> goto write_out_data;
> __journal_unfile_buffer(jh);
> __journal_file_buffer(jh, commit_transaction,
> BJ_Locked);
> jbd_unlock_bh_state(bh);
>
> The problem happens if journal_unmap_buffer()'s own put_journal_head()
> manages to get in between kjournald's *unfile_buffer and the following
> *file_buffer. Because journal_unmap_buffer() has dropped its bh_state
> lock by this point, there's nothing to prevent this, leading to a
> variety of unpleasant situations. In particular, the jh is unfiled at
> this point, so there's nothing to stop the put_journal_head() from
> freeing the memory we're just about to link onto the BJ_Locked list.
>
> I _think_ that the attached patch deals with this, but I'm still
> awaiting further testing to be sure. I thought I might as well get some
> other ext3 eyes on it while I wait for that -- I'll let you know as soon
> as I hear back from the other testing.
>
> The patch works by making sure that the various exits from
> journal_unmap_buffer() always call journal_put_journal_head() *before*
> unlocking the j_list_lock. This is correct according to the documented
> lock ranking, and it also matches the order in the existing main exit
> path at the end of the function.
>
> Cheers,
> Stephen
>
>
> ______________________________________________________________________
>
> --- linux-2.6-ext3/fs/jbd/transaction.c.=K0000=.orig
> +++ linux-2.6-ext3/fs/jbd/transaction.c
> @@ -1775,10 +1775,10 @@ static int journal_unmap_buffer(journal_
> JBUFFER_TRACE(jh, "checkpointed: add to BJ_Forget");
> ret = __dispose_buffer(jh,
> journal->j_running_transaction);
> + journal_put_journal_head(jh);
> spin_unlock(&journal->j_list_lock);
> jbd_unlock_bh_state(bh);
> spin_unlock(&journal->j_state_lock);
> - journal_put_journal_head(jh);
> return ret;
> } else {
> /* There is no currently-running transaction. So the
> @@ -1789,10 +1789,10 @@ static int journal_unmap_buffer(journal_
> JBUFFER_TRACE(jh, "give to committing trans");
> ret = __dispose_buffer(jh,
> journal->j_committing_transaction);
> + journal_put_journal_head(jh);
> spin_unlock(&journal->j_list_lock);
> jbd_unlock_bh_state(bh);
> spin_unlock(&journal->j_state_lock);
> - journal_put_journal_head(jh);
> return ret;
> } else {
> /* The orphan record's transaction has
> @@ -1813,10 +1813,10 @@ static int journal_unmap_buffer(journal_
> journal->j_running_transaction);
> jh->b_next_transaction = NULL;
> }
> + journal_put_journal_head(jh);
> spin_unlock(&journal->j_list_lock);
> jbd_unlock_bh_state(bh);
> spin_unlock(&journal->j_state_lock);
> - journal_put_journal_head(jh);
> return 0;
> } else {
> /* Good, the buffer belongs to the running transaction.

2005-03-07 14:28:42

[permalink] [raw]

Subject: Re: [Ext2-devel] [RFC] ext3/jbd race: releasing in-use journal_heads

Hi,

On Fri, 2005-03-04 at 23:17, Badari Pulavarty wrote:

> I looked at few journalling bugs recently on RHEL4 testing here.
> I am wondering if your patch fixes this following BUG also ?
> I never got to bottom of some of these journal panics -
> since they are not easily reproducible

Right...

> + I don't understand
> journal code well enough :(

Fortunately this one seems to be a simple locking violation, nothing
subtle in the jbd layers.

> Assertion failure in journal_commit_transaction() at fs/jbd/commit.c:790: "jh->b_next_transaction == ((void *)0)"
> kernel BUG in journal_commit_transaction at fs/jbd/commit.c:790!

I'm assuming that's the line
J_ASSERT_JH(jh, jh->b_next_transaction == NULL);
in the t_forget loop. That's not one of the most common footprints I
saw due to this bug, but it's certainly a possible one. We saw one case
where a bh from the wrong transaction was linked onto the j_locked_list,
due to it being freed then reused while in use; and if that can happen,
pretty much anything can go wrong afterwards!

--Stephen

2005-03-07 14:50:15

by Jan Kara

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

> "Stephen C. Tweedie" <[email protected]> wrote:
> >
> > For the past few months there has been a slow but steady trickle of
> > reports of oopses in kjournald.
>
> Yes, really tenuous stuff. Very glad if this is the fix!
>
> > Recently I got a couple of reports that
> > were repeatable enough to rerun with extra debugging code.
> >
> > It turns out that we're releasing a journal_head while it is still
> > linked onto the transaction's t_locked_list. The exact location is in
> > journal_unmap_buffer(). On several exit paths, that does:
> >
> > spin_unlock(&journal->j_list_lock);
> > jbd_unlock_bh_state(bh);
> > spin_unlock(&journal->j_state_lock);
> > journal_put_journal_head(jh);
> >
> > releasing the jh *after* dropping the j_list_lock and j_state_lock.
> >
> > kjournald can then be doing journal_commit_transaction():
> >
> > spin_lock(&journal->j_list_lock);
> > ...
> > if (buffer_locked(bh)) {
> > BUFFER_TRACE(bh, "locked");
> > if (!inverted_lock(journal, bh))
> > goto write_out_data;
> > __journal_unfile_buffer(jh);
> > __journal_file_buffer(jh, commit_transaction,
> > BJ_Locked);
> > jbd_unlock_bh_state(bh);
> >
> > The problem happens if journal_unmap_buffer()'s own put_journal_head()
> > manages to get in between kjournald's *unfile_buffer and the following
> > *file_buffer. Because journal_unmap_buffer() has dropped its bh_state
> > lock by this point, there's nothing to prevent this, leading to a
> > variety of unpleasant situations. In particular, the jh is unfiled at
> > this point, so there's nothing to stop the put_journal_head() from
> > freeing the memory we're just about to link onto the BJ_Locked list.
>
> Right. I don't know why journal_put_journal_head() looks at
> ->b_transaction, really. Should have made presence on a list contribute to
> b_jcount. Oh well, it's been that way since 2.5.0 or older..
>
> Don't we have the same race anywhere where we're doing a
> journal_refile_buffer() (or equiv) in parallel with a
> journal_put_journal_head() outside locks? There seem to be many such.
I believe the other places should be safe (mostly by luck) as the
caller has made sure that __journal_remove_journal_head() won't do
anything (e.g. set b_transaction, b_next_transaction or such).
Anyway it doesn't seem too safe to me...

Honza

2005-03-07 16:03:24

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

Hi,

On Mon, 2005-03-07 at 14:50, Jan Kara wrote:

> I believe the other places should be safe (mostly by luck) as the
> caller has made sure that __journal_remove_journal_head() won't do
> anything (e.g. set b_transaction, b_next_transaction or such).

Right; I've been looking through all the journal_put_journal_head()
callers and most of the instances are places like journal_get_*_access,
which imply that the jh is still on a list. The problem is races
against places like journal_unmap_buffer() where we can be removing the
bh from those lists as soon as we've lost the journal lock.

> Anyway it doesn't seem too safe to me...

Quite.

I think I agree with Andrew here --- the only real solution is to make
sure that whenever anybody is clearing jh->b_transaction, they protect
themselves against journal_put_journal_head() by either elevating
j_count, or taking the jbd_lock_bh_journal_head() lock.

The current stop-gap may actually work, but I'd be more comfortable with
a robust scheme in place.

--Stephen

2005-03-07 16:41:32

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

Hi,

On Sat, 2005-03-05 at 00:04, Andrew Morton wrote:

> Perhaps we could also fix this by elevating b_jcount whenever the jh is
> being moved between lists?

Possible. But jcount isn't atomic, and it requires the bh_journal_head
lock to modify. Taking and dropping the lock twice around the
__unfile/__refile pair, once to inc jcount and once again to drop it, is
probably unnecessarily expensive. We can probably do with just holding
the bh_journal_head lock alone in most cases.

The specific case we're stumbling on here is the t_locked_list. The
problem manifests itself when we walk that list, but we think it is
actually caused when we first add it to the list:

__journal_unfile_buffer(jh);
__journal_file_buffer(jh, commit_transaction,
BJ_Locked);

__journal_unfile_buffer does no locking of its own. We're calling this
point with the buffer locked, j_list_lock held and the bh_state lock
held.

Trouble is, that's not enough; journal_put_journal_head() can nuke the
buffer with merely the bh_journal_head lock held. In the code above it
would be enough to take the journal_head_lock over the unfile/file pair.

Andrew, can you remember why we ended up with both of those locks in the
first place? If we can do it, the efficient way out here is to abandon
the journal_head_lock and use the bh_state_lock for both. We already
hold that over many of the key refile spots, and this would avoid the
need to take yet another lock in those paths.

--Stephen

2005-03-07 17:05:31

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

Hi,

On Mon, 2005-03-07 at 16:40, Stephen C. Tweedie wrote:

> Andrew, can you remember why we ended up with both of those locks in the
> first place? If we can do it, the efficient way out here is to abandon
> the journal_head_lock and use the bh_state_lock for both. We already
> hold that over many of the key refile spots, and this would avoid the
> need to take yet another lock in those paths.

Actually, I realised there's an easier way: provide a variant of
__journal_unfile_buffer() which doesn't clear jh->b_transaction.

That's a subtle change in the logic, but everywhere that is calling
this, we already hold various locks --- except for that elusive
bh_journal_head lock.

But since everywhere else is using _other_ locks, the transient state
where we're on the transaction but not on a list won't be visible ---
we'll have refiled before we drop the lock.

I think things should work just fine with this change.
__journal_unfile_buffer() already handles the case where we're called
with b_transaction set but no b_jlist, for example.

--Stephen

2005-03-07 21:25:37

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

"Stephen C. Tweedie" <[email protected]> wrote:
>
> Trouble is, that's not enough; journal_put_journal_head() can nuke the
> buffer with merely the bh_journal_head lock held. In the code above it
> would be enough to take the journal_head_lock over the unfile/file pair.
>
> Andrew, can you remember why we ended up with both of those locks in the
> first place? If we can do it, the efficient way out here is to abandon
> the journal_head_lock and use the bh_state_lock for both. We already
> hold that over many of the key refile spots, and this would avoid the
> need to take yet another lock in those paths.

Oh gosh. It was a transformation from the global journal_datalist_lock and
jh_splice_lock locks. jbd_lock_bh_journal_head() is supposed to be a
finegrained innermost lock whose mandate is purely for atomicity of adding
and removing the journal_head and the b_jcount refcounting. I don't recall
there being any deeper meaning than that.

The original changelog says:

buffer_heads and journal_heads are joined at the hip. We need a lock to
protect the joint and its refcounts.

JBD is currently using a global spinlock for that. Change it to use one bit
in bh->b_state.

It could be that we can optimise jbd_lock_bh_journal_head() away, as you
mention. If we have an assertion in there to check that
jbd_lock_bh_state() is held...

2005-03-07 21:43:56

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

Hi,

On Mon, 2005-03-07 at 20:31, Andrew Morton wrote:

> jbd_lock_bh_journal_head() is supposed to be a
> finegrained innermost lock whose mandate is purely for atomicity of adding
> and removing the journal_head and the b_jcount refcounting. I don't recall
> there being any deeper meaning than that.

> It could be that we can optimise jbd_lock_bh_journal_head() away, as you
> mention. If we have an assertion in there to check that
> jbd_lock_bh_state() is held...

Right, that was what I was thinking might be possible. But for now I've
just done the simple patch --- make sure we don't clear
jh->b_transaction when we're just refiling buffers from one list to
another. That should have the desired effect here without dangerous
messing around with locks.

I'm having trouble testing it, though --- I seem to be getting livelocks
in O_DIRECT running 400 fsstress processes in parallel; ring any bells?
I get it with today's bk tree without any ext3 changes too, so at least
it's not a regression.

--Stephen

2005-03-07 21:54:21

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

"Stephen C. Tweedie" <[email protected]> wrote:
>
> I'm having trouble testing it, though --- I seem to be getting livelocks
> in O_DIRECT running 400 fsstress processes in parallel; ring any bells?

Nope. I dont think anyone has been that cruel to ext3 for a while.
I assume this workload used to succeed?

2005-03-07 23:04:38

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

Hi,

On Mon, 2005-03-07 at 21:11, Andrew Morton wrote:

> > I'm having trouble testing it, though --- I seem to be getting livelocks
> > in O_DIRECT running 400 fsstress processes in parallel; ring any bells?
>
> Nope. I dont think anyone has been that cruel to ext3 for a while.
> I assume this workload used to succeed?

I think so. I remember getting a lockup a couple of months ago that I
couldn't get to the bottom of and that looked very similar, so it may
just be a matter of it being much more reproducible now. But I seem to
be able to livelock the directIO bits in minutes now, quite reliably.
I'll try to narrow things down.

altgr-scrlck is showing a range of EIPs all in ext3_direct_IO->
invalidate_inode_pages2_range(). I'm seeing

invalidate_inode_pages2_range()->pagevec_lookup()->find_get_pages()

as by far the most common trace, but also

invalidate_inode_pages2_range()->__might_sleep()
invalidate_inode_pages2_range()->unlock_page()
and a few other similar paths.

invalidate_inode_pages2_range() does do a cond_resched(), so the machine
carries on despite having 316 fsstress tasks in system-mode R state at
the moment. :-) The scheduler is doing a rather impressive job.

--Stephen

2005-03-07 23:57:30

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

"Stephen C. Tweedie" <[email protected]> wrote:
>
> In invalidate_inode_pages2_range(), what happens if we lookup a pagevec,
> get a bunch of pages back, but all the pages in the vec are beyond the
> end of the range we want?

hmm, yes. Another one :(

> @@ -271,12 +271,13 @@ int invalidate_inode_pages2_range(struct
> int was_dirty;
>
> lock_page(page);
> + if (page->mapping == mapping)
> + next = page->index + 1;
> if (page->mapping != mapping || page->index > end) {
> unlock_page(page);
> continue;
> }
> wait_on_page_writeback(page);
> - next = page->index + 1;
> if (next == 0)
> wrapped = 1;
> while (page_mapped(page)) {

truncate_inode_pages_range() seems to dtrt here. Can we do it in the same
manner in invalidate_inode_pages2_range()?

Something like:

diff -puN mm/truncate.c~invalidate_inode_pages2_range-livelock-fix mm/truncate.c
--- 25/mm/truncate.c~invalidate_inode_pages2_range-livelock-fix Mon Mar 7 15:47:25 2005
+++ 25-akpm/mm/truncate.c Mon Mar 7 15:49:09 2005
@@ -305,15 +305,22 @@ int invalidate_inode_pages2_range(struct
min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
+ pgoff_t page_index;
int was_dirty;

lock_page(page);
- if (page->mapping != mapping || page->index > end) {
+ page_index = page->index;
+ if (page_index > end) {
+ next = page_index;
+ unlock_page(page);
+ break;
+ }
+ if (page->mapping != mapping) {
unlock_page(page);
continue;
}
wait_on_page_writeback(page);
- next = page->index + 1;
+ next = page_index + 1;
if (next == 0)
wrapped = 1;
while (page_mapped(page)) {
@@ -323,7 +330,7 @@ int invalidate_inode_pages2_range(struct
*/
unmap_mapping_range(mapping,
page->index << PAGE_CACHE_SHIFT,
- (end - page->index + 1)
+ (end - page_index + 1)
<< PAGE_CACHE_SHIFT,
0);
did_range_unmap = 1;
@@ -332,7 +339,7 @@ int invalidate_inode_pages2_range(struct
* Just zap this page
*/
unmap_mapping_range(mapping,
- page->index << PAGE_CACHE_SHIFT,
+ page_index << PAGE_CACHE_SHIFT,
PAGE_CACHE_SIZE, 0);
}
}
_

2005-03-07 23:38:26

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

--- linux-2.6-ext3/mm/truncate.c.=K0002=.orig
+++ linux-2.6-ext3/mm/truncate.c
@@ -271,12 +271,13 @@ int invalidate_inode_pages2_range(struct
int was_dirty;

lock_page(page);
+ if (page->mapping == mapping)
+ next = page->index + 1;
if (page->mapping != mapping || page->index > end) {
unlock_page(page);
continue;
}
wait_on_page_writeback(page);
- next = page->index + 1;
if (next == 0)
wrapped = 1;
while (page_mapped(page)) {

Attachments:

invalidate-livelock.patch (497.00 B)

2005-03-08 06:20:17

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] ext3/jbd race: releasing in-use journal_heads

yup, looks like the same issue we hit in wait_on_page_writeback_range
during AIO work - probably want to break out of the outer loop as well
when this happens.

>From the old changelog:
>>
>> wait_on_page_writeback_range shouldn't wait for pages beyond the
>> specified range. Ideally, the radix-tree-lookup could accept an
>> end_index parameter so that it doesn't return the extra pages
>> in the first place, but for now we just add a few extra checks
>> to skip such pages.
>>

How hard would it be to add an end_index parameter to the radix tree
lookup, since we seem to be hitting this in multiple places ?

Regards
Suparna

On Mon, Mar 07, 2005 at 03:50:01PM -0800, Andrew Morton wrote:
> "Stephen C. Tweedie" <[email protected]> wrote:
> >
> > In invalidate_inode_pages2_range(), what happens if we lookup a pagevec,
> > get a bunch of pages back, but all the pages in the vec are beyond the
> > end of the range we want?
>
> hmm, yes. Another one :(
>
> > @@ -271,12 +271,13 @@ int invalidate_inode_pages2_range(struct
> > int was_dirty;
> >
> > lock_page(page);
> > + if (page->mapping == mapping)
> > + next = page->index + 1;
> > if (page->mapping != mapping || page->index > end) {
> > unlock_page(page);
> > continue;
> > }
> > wait_on_page_writeback(page);
> > - next = page->index + 1;
> > if (next == 0)
> > wrapped = 1;
> > while (page_mapped(page)) {
>
> truncate_inode_pages_range() seems to dtrt here. Can we do it in the same
> manner in invalidate_inode_pages2_range()?
>
>
> Something like:
>
>
> diff -puN mm/truncate.c~invalidate_inode_pages2_range-livelock-fix mm/truncate.c
> --- 25/mm/truncate.c~invalidate_inode_pages2_range-livelock-fix Mon Mar 7 15:47:25 2005
> +++ 25-akpm/mm/truncate.c Mon Mar 7 15:49:09 2005
> @@ -305,15 +305,22 @@ int invalidate_inode_pages2_range(struct
> min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
> + pgoff_t page_index;
> int was_dirty;
>
> lock_page(page);
> - if (page->mapping != mapping || page->index > end) {
> + page_index = page->index;
> + if (page_index > end) {
> + next = page_index;
> + unlock_page(page);
> + break;
> + }
> + if (page->mapping != mapping) {
> unlock_page(page);
> continue;
> }
> wait_on_page_writeback(page);
> - next = page->index + 1;
> + next = page_index + 1;
> if (next == 0)
> wrapped = 1;
> while (page_mapped(page)) {
> @@ -323,7 +330,7 @@ int invalidate_inode_pages2_range(struct
> */
> unmap_mapping_range(mapping,
> page->index << PAGE_CACHE_SHIFT,
> - (end - page->index + 1)
> + (end - page_index + 1)
> << PAGE_CACHE_SHIFT,
> 0);
> did_range_unmap = 1;
> @@ -332,7 +339,7 @@ int invalidate_inode_pages2_range(struct
> * Just zap this page
> */
> unmap_mapping_range(mapping,
> - page->index << PAGE_CACHE_SHIFT,
> + page_index << PAGE_CACHE_SHIFT,
> PAGE_CACHE_SIZE, 0);
> }
> }
> _
>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Ext2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ext2-devel

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2005-03-08 06:30:02

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] ext3/jbd race: releasing in-use journal_heads

Just as an idea of what I meant (dug up an old WIP patch):

--- radix-tree.c 2004-04-01 10:32:15.384556136 +0530
+++ radix-tree.c.end 2004-04-01 11:11:07.176069944 +0530
@@ -562,7 +562,8 @@ EXPORT_SYMBOL(radix_tree_gang_lookup);
*/
static unsigned int
__lookup_tag(struct radix_tree_root *root, void **results, unsigned long index,
- unsigned int max_items, unsigned long *next_index, int tag)
+ unsigned long max_index, unsigned int max_items,
+ unsigned long *next_index, int tag)
{
unsigned int nr_found = 0;
unsigned int shift;
@@ -596,7 +597,8 @@ __lookup_tag(struct radix_tree_root *roo
if (tag_get(slot, tag, j)) {
BUG_ON(slot->slots[j] == NULL);
results[nr_found++] = slot->slots[j];
- if (nr_found == max_items)
+ if ((nr_found == max_items) ||
+ (index > max_index))
goto out;
}
}
@@ -624,12 +626,15 @@ out:
*/
unsigned int
radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
- unsigned long first_index, unsigned int max_items, int tag)
+ unsigned long first_index, unsigned long end_index,
+ unsigned int max_items, int tag)
{
const unsigned long max_index = radix_tree_maxindex(root->height);
unsigned long cur_index = first_index;
unsigned int ret = 0;

+ if (max_index > end_index)
+ max_index = end_index;
while (ret < max_items) {
unsigned int nr_found;
unsigned long next_index; /* Index of next search */
@@ -637,7 +642,8 @@ radix_tree_gang_lookup_tag(struct radix_
if (cur_index > max_index)
break;
nr_found = __lookup_tag(root, results + ret, cur_index,
- max_items - ret, &next_index, tag);
+ max_index, max_items - ret,
+ &next_index, tag);
ret += nr_found;
if (next_index == 0)
break;

On Tue, Mar 08, 2005 at 11:58:27AM +0530, Suparna Bhattacharya wrote:
>
> yup, looks like the same issue we hit in wait_on_page_writeback_range
> during AIO work - probably want to break out of the outer loop as well
> when this happens.
>
> From the old changelog:
> >>
> >> wait_on_page_writeback_range shouldn't wait for pages beyond the
> >> specified range. Ideally, the radix-tree-lookup could accept an
> >> end_index parameter so that it doesn't return the extra pages
> >> in the first place, but for now we just add a few extra checks
> >> to skip such pages.
> >>
>
> How hard would it be to add an end_index parameter to the radix tree
> lookup, since we seem to be hitting this in multiple places ?
>
> Regards
> Suparna
>
>
> On Mon, Mar 07, 2005 at 03:50:01PM -0800, Andrew Morton wrote:
> > "Stephen C. Tweedie" <[email protected]> wrote:
> > >
> > > In invalidate_inode_pages2_range(), what happens if we lookup a pagevec,
> > > get a bunch of pages back, but all the pages in the vec are beyond the
> > > end of the range we want?
> >
> > hmm, yes. Another one :(
> >
> > > @@ -271,12 +271,13 @@ int invalidate_inode_pages2_range(struct
> > > int was_dirty;
> > >
> > > lock_page(page);
> > > + if (page->mapping == mapping)
> > > + next = page->index + 1;
> > > if (page->mapping != mapping || page->index > end) {
> > > unlock_page(page);
> > > continue;
> > > }
> > > wait_on_page_writeback(page);
> > > - next = page->index + 1;
> > > if (next == 0)
> > > wrapped = 1;
> > > while (page_mapped(page)) {
> >
> > truncate_inode_pages_range() seems to dtrt here. Can we do it in the same
> > manner in invalidate_inode_pages2_range()?
> >
> >
> > Something like:
> >
> >
> > diff -puN mm/truncate.c~invalidate_inode_pages2_range-livelock-fix mm/truncate.c
> > --- 25/mm/truncate.c~invalidate_inode_pages2_range-livelock-fix Mon Mar 7 15:47:25 2005
> > +++ 25-akpm/mm/truncate.c Mon Mar 7 15:49:09 2005
> > @@ -305,15 +305,22 @@ int invalidate_inode_pages2_range(struct
> > min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> > for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
> > struct page *page = pvec.pages[i];
> > + pgoff_t page_index;
> > int was_dirty;
> >
> > lock_page(page);
> > - if (page->mapping != mapping || page->index > end) {
> > + page_index = page->index;
> > + if (page_index > end) {
> > + next = page_index;
> > + unlock_page(page);
> > + break;
> > + }
> > + if (page->mapping != mapping) {
> > unlock_page(page);
> > continue;
> > }
> > wait_on_page_writeback(page);
> > - next = page->index + 1;
> > + next = page_index + 1;
> > if (next == 0)
> > wrapped = 1;
> > while (page_mapped(page)) {
> > @@ -323,7 +330,7 @@ int invalidate_inode_pages2_range(struct
> > */
> > unmap_mapping_range(mapping,
> > page->index << PAGE_CACHE_SHIFT,
> > - (end - page->index + 1)
> > + (end - page_index + 1)
> > << PAGE_CACHE_SHIFT,
> > 0);
> > did_range_unmap = 1;
> > @@ -332,7 +339,7 @@ int invalidate_inode_pages2_range(struct
> > * Just zap this page
> > */
> > unmap_mapping_range(mapping,
> > - page->index << PAGE_CACHE_SHIFT,
> > + page_index << PAGE_CACHE_SHIFT,
> > PAGE_CACHE_SIZE, 0);
> > }
> > }
> > _
> >
> >
> >
> > -------------------------------------------------------
> > SF email is sponsored by - The IT Product Guide
> > Read honest & candid reviews on hundreds of IT Products from real users.
> > Discover which products truly live up to the hype. Start reading now.
> > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > _______________________________________________
> > Ext2-devel mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/ext2-devel
>
> --
> Suparna Bhattacharya ([email protected])
> Linux Technology Center
> IBM Software Lab, India
>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2005-03-08 07:12:34

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] ext3/jbd race: releasing in-use journal_heads

Suparna Bhattacharya <[email protected]> wrote:
>
>
> yup, looks like the same issue we hit in wait_on_page_writeback_range
> during AIO work - probably want to break out of the outer loop as well
> when this happens.

The `next = page_index' before breaking will do that for us.

>
> How hard would it be to add an end_index parameter to the radix tree
> lookup, since we seem to be hitting this in multiple places ?

Really easy if you do it ;)

Let's wait for this particular peiece of code to settle down, do a cleanup
sometime?

2005-03-08 07:19:34

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] ext3/jbd race: releasing in-use journal_heads

On Mon, Mar 07, 2005 at 10:46:18PM -0800, Andrew Morton wrote:
> Suparna Bhattacharya <[email protected]> wrote:
> >
> >
> > yup, looks like the same issue we hit in wait_on_page_writeback_range
> > during AIO work - probably want to break out of the outer loop as well
> > when this happens.
>
> The `next = page_index' before breaking will do that for us.
>
> >
> > How hard would it be to add an end_index parameter to the radix tree
> > lookup, since we seem to be hitting this in multiple places ?
>
> Really easy if you do it ;)
>
> Let's wait for this particular peiece of code to settle down, do a cleanup
> sometime?
>

OK - we can postpone this (let me know if the interface in the patch
I just posted seems like the right direction to use when we go for the
cleanup).

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2005-03-08 07:38:24

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] ext3/jbd race: releasing in-use journal_heads

Suparna Bhattacharya <[email protected]> wrote:
>
> (let me know if the interface in the patch
> I just posted seems like the right direction to use when we go for the
> cleanup)

Well what are the semantics? Pass in an inclusive max_index and the gang
lookup functions terminate when they hit an item whose index is greater
than max_index? And return the number of items thus found?

Seems sensible, and all the comparisons do the right thing if max_index = -1.

2005-03-08 08:05:57

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] ext3/jbd race: releasing in-use journal_heads

On Mon, Mar 07, 2005 at 11:37:42PM -0800, Andrew Morton wrote:
> Suparna Bhattacharya <[email protected]> wrote:
> >
> > (let me know if the interface in the patch
> > I just posted seems like the right direction to use when we go for the
> > cleanup)
>
> Well what are the semantics? Pass in an inclusive max_index and the gang
> lookup functions terminate when they hit an item whose index is greater
> than max_index? And return the number of items thus found?

Yes. (end_index or max_items, whichever it hits first)

>
> Seems sensible, and all the comparisons do the right thing if max_index = -1.

end_index is unsigned long - so -1 would imply highest index possible
... i.e. all items, subject to the radix_tree_maxindex (max_index used
internally in the radix tree code).

Regards
Suparna

>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Ext2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ext2-devel

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2005-03-08 09:28:51

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

Hi,

On Mon, 2005-03-07 at 23:50, Andrew Morton wrote:

> truncate_inode_pages_range() seems to dtrt here. Can we do it in the same
> manner in invalidate_inode_pages2_range()?
>
>
> Something like:

> - if (page->mapping != mapping || page->index > end) {
> + page_index = page->index;
> + if (page_index > end) {
> + next = page_index;
> + unlock_page(page);
> + break;
> + }
> + if (page->mapping != mapping) {
> unlock_page(page);
> continue;
> }

Yes, breaking early seems fine for this. But don't we need to test
page->mapping == mapping again with the page lock held before we can
reliably break on page_index?

I think it should be OK just to move the page->mapping != mapping test
above the page>index > end test. Sure, if all the pages have been
stolen by the time we see them, then we'll repeat without advancing
"next"; but we're still making progress in that case because pages that
were previously in this mapping *have* been removed, just by a different
process. If we're really concerned about that case we can play the
trick that invalidate_mapping_pages() tries and do a "next++" in that
case.

--Stephen

2005-03-08 12:42:37

[permalink] [raw]

Subject: [PATCH] invalidate/o_direct livelock {was Re: [RFC] ext3/jbd race: releasing in-use journal_heads}

Fix livelock in invalidate_inode_pages2_range

invalidate_inode_pages2_range can loop indefinitely if pagevec_lookup
returns only pages beyond the end of the range being invalidated. Fix
it by advancing the restart pointer, "next", every time we see a page on
the correct mapping, not just if the page is within the range; and break
out early when we encounter such a page. Based on a proposed fix by
akpm.

Signed-off-by: Stephen Tweedie <[email protected]>

--- linux-2.6-ext3/mm/truncate.c.=K0002=.orig
+++ linux-2.6-ext3/mm/truncate.c
@@ -268,25 +268,31 @@ int invalidate_inode_pages2_range(struct
min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
+ pgoff_t page_index;
int was_dirty;

lock_page(page);
- if (page->mapping != mapping || page->index > end) {
+ if (page->mapping != mapping) {
unlock_page(page);
continue;
}
- wait_on_page_writeback(page);
- next = page->index + 1;
+ page_index = page->index;
+ next = page_index + 1;
if (next == 0)
wrapped = 1;
+ if (page_index > end) {
+ unlock_page(page);
+ break;
+ }
+ wait_on_page_writeback(page);
while (page_mapped(page)) {
if (!did_range_unmap) {
/*
* Zap the rest of the file in one hit.
*/
unmap_mapping_range(mapping,
- page->index << PAGE_CACHE_SHIFT,
- (end - page->index + 1)
+ page_index << PAGE_CACHE_SHIFT,
+ (end - page_index + 1)
<< PAGE_CACHE_SHIFT,
0);
did_range_unmap = 1;
@@ -295,7 +301,7 @@ int invalidate_inode_pages2_range(struct
* Just zap this page
*/
unmap_mapping_range(mapping,
- page->index << PAGE_CACHE_SHIFT,
+ page_index << PAGE_CACHE_SHIFT,
PAGE_CACHE_SIZE, 0);
}
}

Attachments:

invalidate-livelock.patch (1.87 kB)

2005-03-08 12:55:36

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

Fix destruction of in-use journal_head

journal_put_journal_head() can destroy a journal_head at any time as
long as the jh's b_jcount is zero and b_transaction is NULL. It has no
locking protection against the rest of the journaling code, as the lock
it uses to protect b_jcount and bh->b_private is not used elsewhere in
jbd.

However, there are small windows where b_transaction is getting set
temporarily to NULL during normal operations; typically this is
happening in

__journal_unfile_buffer(jh);
__journal_file_buffer(jh, ...);

call pairs, as __journal_unfile_buffer() will set b_transaction to NULL
and __journal_file_buffer() re-sets it afterwards. A truncate running
in parallel can lead to journal_unmap_buffer() destroying the jh if it
occurs between these two calls.

Fix this by adding a variant of __journal_unfile_buffer() which is only
used for these temporary jh unlinks, and which leaves the b_transaction
field intact so that we never leave a window open where b_transaction is
NULL.

Additionally, trap this error if it does occur, by checking against
jh->b_jlist being non-null when we destroy a jh.

Signed-off-by: Stephen Tweedie <[email protected]>

--- linux-2.6-ext3/fs/jbd/commit.c.=K0003=.orig
+++ linux-2.6-ext3/fs/jbd/commit.c
@@ -258,7 +258,7 @@ write_out_data:
BUFFER_TRACE(bh, "locked");
if (!inverted_lock(journal, bh))
goto write_out_data;
- __journal_unfile_buffer(jh);
+ __journal_temp_unlink_buffer(jh);
__journal_file_buffer(jh, commit_transaction,
BJ_Locked);
jbd_unlock_bh_state(bh);
--- linux-2.6-ext3/fs/jbd/journal.c.=K0003=.orig
+++ linux-2.6-ext3/fs/jbd/journal.c
@@ -1767,6 +1767,7 @@ static void __journal_remove_journal_hea
if (jh->b_transaction == NULL &&
jh->b_next_transaction == NULL &&
jh->b_cp_transaction == NULL) {
+ J_ASSERT_JH(jh, jh->b_jlist == BJ_None);
J_ASSERT_BH(bh, buffer_jbd(bh));
J_ASSERT_BH(bh, jh2bh(jh) == bh);
BUFFER_TRACE(bh, "remove journal_head");
--- linux-2.6-ext3/fs/jbd/transaction.c.=K0003=.orig
+++ linux-2.6-ext3/fs/jbd/transaction.c
@@ -1044,7 +1044,12 @@ int journal_dirty_data(handle_t *handle,
/* journal_clean_data_list() may have got there first */
if (jh->b_transaction != NULL) {
JBUFFER_TRACE(jh, "unfile from commit");
- __journal_unfile_buffer(jh);
+ __journal_temp_unlink_buffer(jh);
+ /* It still points to the committing
+ * transaction; move it to this one so
+ * that the refile assert checks are
+ * happy. */
+ jh->b_transaction = handle->h_transaction;
}
/* The buffer will be refiled below */

@@ -1058,7 +1063,8 @@ int journal_dirty_data(handle_t *handle,
if (jh->b_jlist != BJ_SyncData && jh->b_jlist != BJ_Locked) {
JBUFFER_TRACE(jh, "not on correct data list: unfile");
J_ASSERT_JH(jh, jh->b_jlist != BJ_Shadow);
- __journal_unfile_buffer(jh);
+ __journal_temp_unlink_buffer(jh);
+ jh->b_transaction = handle->h_transaction;
JBUFFER_TRACE(jh, "file as data");
__journal_file_buffer(jh, handle->h_transaction,
BJ_SyncData);
@@ -1233,8 +1239,6 @@ int journal_forget (handle_t *handle, st

JBUFFER_TRACE(jh, "belongs to current transaction: unfile");

- __journal_unfile_buffer(jh);
-
/*
* We are no longer going to journal this buffer.
* However, the commit of this transaction is still
@@ -1248,8 +1252,10 @@ int journal_forget (handle_t *handle, st
*/

if (jh->b_cp_transaction) {
+ __journal_temp_unlink_buffer(jh);
__journal_file_buffer(jh, transaction, BJ_Forget);
} else {
+ __journal_unfile_buffer(jh);
journal_remove_journal_head(bh);
__brelse(bh);
if (!buffer_jbd(bh)) {
@@ -1469,7 +1475,7 @@ __blist_del_buffer(struct journal_head *
*
* Called under j_list_lock. The journal may not be locked.
*/
-void __journal_unfile_buffer(struct journal_head *jh)
+void __journal_temp_unlink_buffer(struct journal_head *jh)
{
struct journal_head **list = NULL;
transaction_t *transaction;
@@ -1486,7 +1492,7 @@ void __journal_unfile_buffer(struct jour

switch (jh->b_jlist) {
case BJ_None:
- goto out;
+ return;
case BJ_SyncData:
list = &transaction->t_sync_datalist;
break;
@@ -1519,7 +1525,11 @@ void __journal_unfile_buffer(struct jour
jh->b_jlist = BJ_None;
if (test_clear_buffer_jbddirty(bh))
mark_buffer_dirty(bh); /* Expose it to the VM */
-out:
+}
+
+void __journal_unfile_buffer(struct journal_head *jh)
+{
+ __journal_temp_unlink_buffer(jh);
jh->b_transaction = NULL;
}

@@ -1929,7 +1939,7 @@ void __journal_file_buffer(struct journa
}

if (jh->b_transaction)
- __journal_unfile_buffer(jh);
+ __journal_temp_unlink_buffer(jh);
jh->b_transaction = transaction;

switch (jlist) {
@@ -2012,7 +2022,7 @@ void __journal_refile_buffer(struct jour
*/

was_dirty = test_clear_buffer_jbddirty(bh);
- __journal_unfile_buffer(jh);
+ __journal_temp_unlink_buffer(jh);
jh->b_transaction = jh->b_next_transaction;
jh->b_next_transaction = NULL;
__journal_file_buffer(jh, jh->b_transaction, BJ_Metadata);

Attachments:

jbd-temp-unlink.patch (5.08 kB)

2005-03-08 15:13:50

by Jan Kara

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

Hello,

> On Mon, 2005-03-07 at 21:08, Stephen C. Tweedie wrote:
>
> > Right, that was what I was thinking might be possible. But for now I've
> > just done the simple patch --- make sure we don't clear
> > jh->b_transaction when we're just refiling buffers from one list to
> > another. That should have the desired effect here without dangerous
> > messing around with locks.
>
<snip>

> Fix destruction of in-use journal_head
>
> journal_put_journal_head() can destroy a journal_head at any time as
> long as the jh's b_jcount is zero and b_transaction is NULL. It has no
> locking protection against the rest of the journaling code, as the lock
> it uses to protect b_jcount and bh->b_private is not used elsewhere in
> jbd.
>
> However, there are small windows where b_transaction is getting set
> temporarily to NULL during normal operations; typically this is
> happening in
>
> __journal_unfile_buffer(jh);
> __journal_file_buffer(jh, ...);
>
> call pairs, as __journal_unfile_buffer() will set b_transaction to NULL
> and __journal_file_buffer() re-sets it afterwards. A truncate running
> in parallel can lead to journal_unmap_buffer() destroying the jh if it
> occurs between these two calls.
Isn't also the following scenario dangerous?

__journal_unfile_buffer(jh);
journal_remove_journal_head(bh);

If journal_unmap_buffer() decides to destroy the buffer before we call
journal_remove_journal_head(), we get an assertion failure in
__journal_remove_journal_head()...
I believe the right fix really is that anyone having a pointer to
bh/jh should hold a reference counted in b_jcount - so basically the
idea Andrew proposed.
It could work like follows: When you first create jh you get a
reference in b_jcount (that is already happening). If you file a jh in
some transaction list, you will 'delegate' this reference to the
transaction and hence you won't do journal_put_journal_head() after
you're done. Now if you unfile a buffer from the list you 'get back' it's
reference (and you have to journal_put_journal_head() if you're not
going to file the buffer back again). The only thing you have to assure is
the proper list locking so that two different callers cannot 'get back' one
reference but that should be already done...
This solution should be doable without additional locking or big code
changes but maybe the reference policy is a bit too complicated...

> Fix this by adding a variant of __journal_unfile_buffer() which is only
> used for these temporary jh unlinks, and which leaves the b_transaction
> field intact so that we never leave a window open where b_transaction is
> NULL.
>
> Additionally, trap this error if it does occur, by checking against
> jh->b_jlist being non-null when we destroy a jh.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2005-03-09 13:13:57

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

Hi,

On Tue, 2005-03-08 at 15:12, Jan Kara wrote:

> Isn't also the following scenario dangerous?
>
> __journal_unfile_buffer(jh);
> journal_remove_journal_head(bh);

It depends. I think the biggest problem here is that there's really no
written rule protecting this stuff universally. But in testing, we just
don't see this triggering.

Why not? We actually have a raft of other locks protecting us in
various places. As long as there's some overlap in the locks, we're
OK. But because it's not written down, not everybody relies on the same
set of locks; it's only because journal_unmap_buffer() was dropping the
jh without enough locks that we saw a problem there.

So the scenario:

__journal_unfile_buffer(jh);
journal_remove_journal_head(bh);

*might* be dangerous, if called carelessly; but in practice it works out
OK. Remember, that journal_remove_journal_head() call still takes the
bh_journal_head lock and still checks b_transaction with that held.

I think it's time I went and worked out *why* it works out OK, though,
so that we can write it down and make sure it stays working! And we may
well be able to simplify the locking when we do this; collapsing the two
bh state locks, for example, may help improve the robustness as well as
improving performance through removing some redundant locking
operations.

Fortunately, there are really only three classes of operation that can
remove a jh. There are the normal VFS/VM operations, like writepage,
try_to_free_buffer, etc; these are all called with the page lock. There
are metadata updates, which take a liberal dose of buffer, bh_state and
journal locks.

Finally there are the commit/checkpoint functions, which run
asynchronously to the rest of the journaling and which don't relate to
the current transaction, but to previous ones. This is the danger area,
I think. But as long as at least the bh_state lock is held, we can
protect these operations against the data/metadata update operations.

--Stephen

2005-03-09 13:28:49

by Jan Kara

[permalink] [raw]

Subject: Re: [RFC] ext3/jbd race: releasing in-use journal_heads

> On Tue, 2005-03-08 at 15:12, Jan Kara wrote:
>
> > Isn't also the following scenario dangerous?
> >
> > __journal_unfile_buffer(jh);
> > journal_remove_journal_head(bh);
>
> It depends. I think the biggest problem here is that there's really no
> written rule protecting this stuff universally. But in testing, we just
> don't see this triggering.
>
> Why not? We actually have a raft of other locks protecting us in
> various places. As long as there's some overlap in the locks, we're
> OK. But because it's not written down, not everybody relies on the same
> set of locks; it's only because journal_unmap_buffer() was dropping the
> jh without enough locks that we saw a problem there.
>
> So the scenario:
>
> __journal_unfile_buffer(jh);
> journal_remove_journal_head(bh);
>
> *might* be dangerous, if called carelessly; but in practice it works out
> OK. Remember, that journal_remove_journal_head() call still takes the
> bh_journal_head lock and still checks b_transaction with that held.
Hmm. I see for example a place at jbd/commit.c, line 287 (which you
did not change in your patch) which does this and doesn't seem to be
protected against journal_unmap_buffer() (but maybe I miss something).
Not that I'd find that race probable but in theory...

> I think it's time I went and worked out *why* it works out OK, though,
> so that we can write it down and make sure it stays working! And we may
> well be able to simplify the locking when we do this; collapsing the two
> bh state locks, for example, may help improve the robustness as well as
> improving performance through removing some redundant locking
> operations.
>
> Fortunately, there are really only three classes of operation that can
> remove a jh. There are the normal VFS/VM operations, like writepage,
> try_to_free_buffer, etc; these are all called with the page lock. There
> are metadata updates, which take a liberal dose of buffer, bh_state and
> journal locks.
>
> Finally there are the commit/checkpoint functions, which run
> asynchronously to the rest of the journaling and which don't relate to
> the current transaction, but to previous ones. This is the danger area,
> I think. But as long as at least the bh_state lock is held, we can
> protect these operations against the data/metadata update operations.
I agree that only the metadata/data updates can race with checkpoint/commit
because other combinations are already serialized by 'higher-level'
locks. But I think we agree that the simplier (and 'local') the locking rules
are the less probable the bugs are..

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2005-03-09 15:12:59