2011-08-11 13:32:32

by Josef Bacik

[permalink] [raw]
Subject: Can a metadata buffer end up in journal_unmap_buffer?

Hello,

I have this weird bug that has been plaguing me for a while where
t_outstanding_credits will end up less than t_nr_buffers. I have done
all sorts of things to try and catch when it happens but nothing seems
to catch it. At some point I had thought that we were screwing up in
journal_unmap_buffer. If a buffer is not on a transaction but is still
a part of a checkpoint we will do a journal_file_buffer() onto the
current running transaction's forget list. The thing is we can still
have b_modified set since we only clear it on
do_get_write_access/journal_get_create_access if it isn't a part of the
transaction yet. So if we do the journal_file_buffer() before anybody
calls do_get_write_access/journal_get_create_access we will short
circuit these checks and b_modified will never be cleared and so when we
do journal_dirty_metadata we won't account for the new buffer and it
will end up inc'ing t_nr_buffers but not t_outstanding_credits.

I had thought this was the problem before and put in a jh->b_modified =
0 in __dispose_buffer, but apparently the problem still happened. But
that support person/customer were not entirely reliable so I'm back to
thinking this is what happened and they just didn't run with my patch.

The problem is I don't think we can call journal_unmap_buffer() on just
a normal metadata block (that is with data=ordered), it only gets called
by ext3_invalidatepage() which is only called on pages on the inodes
address space, so not metadata. However, Jan had a patch to delay the
free'ing of buffers for orphan reasons, with commit

86963918965eb8fe0c8ae009e7c1b4c630f533d5

which makes it seem like metadata can come through
journal_unmap_buffer()? Does anybody know for sure one way or the
other? And if you happen to have a theory on the actual problem itself
I would _love_ to hear it :). Thanks,

Josef


2011-08-11 15:28:24

by Jan Kara

[permalink] [raw]
Subject: Re: Can a metadata buffer end up in journal_unmap_buffer?

Hello,

On Thu 11-08-11 09:32:22, Josef Bacik wrote:
> I have this weird bug that has been plaguing me for a while where
> t_outstanding_credits will end up less than t_nr_buffers. I have done
> all sorts of things to try and catch when it happens but nothing seems
> to catch it. At some point I had thought that we were screwing up in
> journal_unmap_buffer. If a buffer is not on a transaction but is still
> a part of a checkpoint we will do a journal_file_buffer() onto the
> current running transaction's forget list. The thing is we can still
> have b_modified set since we only clear it on
> do_get_write_access/journal_get_create_access if it isn't a part of the
> transaction yet. So if we do the journal_file_buffer() before anybody
> calls do_get_write_access/journal_get_create_access we will short
> circuit these checks and b_modified will never be cleared and so when we
> do journal_dirty_metadata we won't account for the new buffer and it
> will end up inc'ing t_nr_buffers but not t_outstanding_credits.
Good spotting!

> I had thought this was the problem before and put in a jh->b_modified =
> 0 in __dispose_buffer, but apparently the problem still happened. But
> that support person/customer were not entirely reliable so I'm back to
> thinking this is what happened and they just didn't run with my patch.
Umm, I think there's one more way how buffer b_modified == 1 can get
to other transaction's forget list. In journal_unmap_buffer(), transaction
== journal->j_committing_transaction case we do set_buffer_freed() and
set b_next_transaction to the running transaction. So when the currently
committing transaction finishes, it refiles the buffer to BJ_Forget list
of the running transaction. b_modified handling seems to be really fragile
in this regard. I guess the rule is that whenever we are going to change
b_transaction or b_next_transaction, we should clear b_modified.

> The problem is I don't think we can call journal_unmap_buffer() on just
> a normal metadata block (that is with data=ordered), it only gets called
> by ext3_invalidatepage() which is only called on pages on the inodes
> address space, so not metadata. However, Jan had a patch to delay the
> free'ing of buffers for orphan reasons, with commit
>
> 86963918965eb8fe0c8ae009e7c1b4c630f533d5
>
> which makes it seem like metadata can come through
> journal_unmap_buffer()? Does anybody know for sure one way or the
> other? And if you happen to have a theory on the actual problem itself
> I would _love_ to hear it :). Thanks,
It can happen either in data=journal mode or for long symlinks. BTW I
think Kyle Moffet was probably hitting this bug with ext4
(http://lists.debian.org/debian-kernel/2011/04/msg00145.html) and I was
also trying to find the culprit for some time without success so I'm glad
you found it in the end ;).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-08-11 16:16:52

by Josef Bacik

[permalink] [raw]
Subject: Re: Can a metadata buffer end up in journal_unmap_buffer?

On 08/11/2011 11:28 AM, Jan Kara wrote:
> Hello,
>
> On Thu 11-08-11 09:32:22, Josef Bacik wrote:
>> I have this weird bug that has been plaguing me for a while where
>> t_outstanding_credits will end up less than t_nr_buffers. I have done
>> all sorts of things to try and catch when it happens but nothing seems
>> to catch it. At some point I had thought that we were screwing up in
>> journal_unmap_buffer. If a buffer is not on a transaction but is still
>> a part of a checkpoint we will do a journal_file_buffer() onto the
>> current running transaction's forget list. The thing is we can still
>> have b_modified set since we only clear it on
>> do_get_write_access/journal_get_create_access if it isn't a part of the
>> transaction yet. So if we do the journal_file_buffer() before anybody
>> calls do_get_write_access/journal_get_create_access we will short
>> circuit these checks and b_modified will never be cleared and so when we
>> do journal_dirty_metadata we won't account for the new buffer and it
>> will end up inc'ing t_nr_buffers but not t_outstanding_credits.
> Good spotting!
>
>> I had thought this was the problem before and put in a jh->b_modified =
>> 0 in __dispose_buffer, but apparently the problem still happened. But
>> that support person/customer were not entirely reliable so I'm back to
>> thinking this is what happened and they just didn't run with my patch.
> Umm, I think there's one more way how buffer b_modified == 1 can get
> to other transaction's forget list. In journal_unmap_buffer(), transaction
> == journal->j_committing_transaction case we do set_buffer_freed() and
> set b_next_transaction to the running transaction. So when the currently
> committing transaction finishes, it refiles the buffer to BJ_Forget list
> of the running transaction. b_modified handling seems to be really fragile
> in this regard. I guess the rule is that whenever we are going to change
> b_transaction or b_next_transaction, we should clear b_modified.
>

Well this is happening on RHEL5, where we have

set_buffer_freed();
if (jh->b_next_transaction)
jh->b_next_transaction = NULL;

so the only way this happens if it goes through __dispose_buffer.

And the more I look at this I can't see how it would happen exactly. I
can definitely get a modified buffer to show up on the forget list, but
I can't see how I would then re-modify the thing to get it to show up on
BJ_Metadata. On data=journal mode I can definitely see how to do it,
but not with data=ordered mode. The only way to go through
journal_unmap_buffer is to truncate the inode, and for a symlink the
only way to make that happen is to delete it. So I don't see how I
could then make it get dirtied again? Thanks,

Josef

2011-08-11 16:21:33

by Josef Bacik

[permalink] [raw]
Subject: Re: Can a metadata buffer end up in journal_unmap_buffer?

On 08/11/2011 12:16 PM, Josef Bacik wrote:
> On 08/11/2011 11:28 AM, Jan Kara wrote:
>> Hello,
>>
>> On Thu 11-08-11 09:32:22, Josef Bacik wrote:
>>> I have this weird bug that has been plaguing me for a while where
>>> t_outstanding_credits will end up less than t_nr_buffers. I have done
>>> all sorts of things to try and catch when it happens but nothing seems
>>> to catch it. At some point I had thought that we were screwing up in
>>> journal_unmap_buffer. If a buffer is not on a transaction but is still
>>> a part of a checkpoint we will do a journal_file_buffer() onto the
>>> current running transaction's forget list. The thing is we can still
>>> have b_modified set since we only clear it on
>>> do_get_write_access/journal_get_create_access if it isn't a part of the
>>> transaction yet. So if we do the journal_file_buffer() before anybody
>>> calls do_get_write_access/journal_get_create_access we will short
>>> circuit these checks and b_modified will never be cleared and so when we
>>> do journal_dirty_metadata we won't account for the new buffer and it
>>> will end up inc'ing t_nr_buffers but not t_outstanding_credits.
>> Good spotting!
>>
>>> I had thought this was the problem before and put in a jh->b_modified =
>>> 0 in __dispose_buffer, but apparently the problem still happened. But
>>> that support person/customer were not entirely reliable so I'm back to
>>> thinking this is what happened and they just didn't run with my patch.
>> Umm, I think there's one more way how buffer b_modified == 1 can get
>> to other transaction's forget list. In journal_unmap_buffer(), transaction
>> == journal->j_committing_transaction case we do set_buffer_freed() and
>> set b_next_transaction to the running transaction. So when the currently
>> committing transaction finishes, it refiles the buffer to BJ_Forget list
>> of the running transaction. b_modified handling seems to be really fragile
>> in this regard. I guess the rule is that whenever we are going to change
>> b_transaction or b_next_transaction, we should clear b_modified.
>>
>
> Well this is happening on RHEL5, where we have
>
> set_buffer_freed();
> if (jh->b_next_transaction)
> jh->b_next_transaction = NULL;
>
> so the only way this happens if it goes through __dispose_buffer.
>
> And the more I look at this I can't see how it would happen exactly. I
> can definitely get a modified buffer to show up on the forget list, but
> I can't see how I would then re-modify the thing to get it to show up on
> BJ_Metadata. On data=journal mode I can definitely see how to do it,
> but not with data=ordered mode. The only way to go through
> journal_unmap_buffer is to truncate the inode, and for a symlink the
> only way to make that happen is to delete it. So I don't see how I
> could then make it get dirtied again? Thanks,

Bah ignore me I'm an idiot, if the thing gets evicted from cache it will
call truncate_inode_pages which will do all that work. I've been
staring at jbd entirely too long :(. Thanks,

Josef

2011-08-11 20:39:07

by Josef Bacik

[permalink] [raw]
Subject: Re: Can a metadata buffer end up in journal_unmap_buffer?

On 08/11/2011 11:28 AM, Jan Kara wrote:
> Hello,
>
> On Thu 11-08-11 09:32:22, Josef Bacik wrote:
>> I have this weird bug that has been plaguing me for a while where
>> t_outstanding_credits will end up less than t_nr_buffers. I have done
>> all sorts of things to try and catch when it happens but nothing seems
>> to catch it. At some point I had thought that we were screwing up in
>> journal_unmap_buffer. If a buffer is not on a transaction but is still
>> a part of a checkpoint we will do a journal_file_buffer() onto the
>> current running transaction's forget list. The thing is we can still
>> have b_modified set since we only clear it on
>> do_get_write_access/journal_get_create_access if it isn't a part of the
>> transaction yet. So if we do the journal_file_buffer() before anybody
>> calls do_get_write_access/journal_get_create_access we will short
>> circuit these checks and b_modified will never be cleared and so when we
>> do journal_dirty_metadata we won't account for the new buffer and it
>> will end up inc'ing t_nr_buffers but not t_outstanding_credits.
> Good spotting!
>
>> I had thought this was the problem before and put in a jh->b_modified =
>> 0 in __dispose_buffer, but apparently the problem still happened. But
>> that support person/customer were not entirely reliable so I'm back to
>> thinking this is what happened and they just didn't run with my patch.
> Umm, I think there's one more way how buffer b_modified == 1 can get
> to other transaction's forget list. In journal_unmap_buffer(), transaction
> == journal->j_committing_transaction case we do set_buffer_freed() and
> set b_next_transaction to the running transaction. So when the currently
> committing transaction finishes, it refiles the buffer to BJ_Forget list
> of the running transaction. b_modified handling seems to be really fragile
> in this regard. I guess the rule is that whenever we are going to change
> b_transaction or b_next_transaction, we should clear b_modified.
>
>> The problem is I don't think we can call journal_unmap_buffer() on just
>> a normal metadata block (that is with data=ordered), it only gets called
>> by ext3_invalidatepage() which is only called on pages on the inodes
>> address space, so not metadata. However, Jan had a patch to delay the
>> free'ing of buffers for orphan reasons, with commit
>>
>> 86963918965eb8fe0c8ae009e7c1b4c630f533d5
>>
>> which makes it seem like metadata can come through
>> journal_unmap_buffer()? Does anybody know for sure one way or the
>> other? And if you happen to have a theory on the actual problem itself
>> I would _love_ to hear it :). Thanks,
> It can happen either in data=journal mode or for long symlinks. BTW I
> think Kyle Moffet was probably hitting this bug with ext4
> (http://lists.debian.org/debian-kernel/2011/04/msg00145.html) and I was
> also trying to find the culprit for some time without success so I'm glad
> you found it in the end ;).
>

So I'm back to thinking this particular case can't happen with
data=ordered mode, at least the horrible part where the buffer gets
dirtied without h_buffer_credit getting toggled for it. Once we create
the symlink we don't ever touch the page again except to get rid of it
when we delete the symlink. This will make it show back up on the
running transactions forget list, but there's no way it could end up
modified again and end up on it's metadata list. So it's back to debug
patches because I have no other ideas. Thanks,

Josef