2005-04-04 09:04:20

by Jan Kara

[permalink] [raw]
Subject: Problem in log_do_checkpoint()?

Hello,

I've been looking through the JBD code when trying to understand the
assertion failure in log_do_checkpoint() (it was on old SUSE 2.6.5 kernel
though the reporter claims to be able to get the failure even with the
Stephen's patch fixing a race with journal_put_journal_head()) and I've
spotted one place where I think could be a race (the code around there
seems to be the same in latest kernels):
In log_do_checkpoint() we go through the t_checkpoint_list of a
transaction and call __flush_buffer() on each buffer. Suppose there is
just one buffer on the list and it is dirty. __flush_buffer() sees it and
puts it to an array of buffers for flushing. Then the loop finishes,
retry=0, drop_count=0, batch_count=1. So __flush_batch() is called - we
drop all locks and sleep. While we are sleeping somebody else comes and
makes the buffer dirty again (OK, that is not probable, but I think it
could be possible). Now we wake up and call __cleanup_transaction().
It's not able to do anything and returns 0. And we fail on the assertion
J_ASSERT(drop_count != 0 || cleanup_ret != 0).
Am I missing something? In my opinion we should set retry=1 after we
call __flush_batch() even if we call it outside of the "__flush_buffer-loop"...

Honza

--
Jan Kara <[email protected]>
SuSE CR Labs


2005-04-08 14:06:43

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: Problem in log_do_checkpoint()?

Hi,

On Mon, 2005-04-04 at 10:04, Jan Kara wrote:

> In log_do_checkpoint() we go through the t_checkpoint_list of a
> transaction and call __flush_buffer() on each buffer. Suppose there is
> just one buffer on the list and it is dirty. __flush_buffer() sees it and
> puts it to an array of buffers for flushing. Then the loop finishes,
> retry=0, drop_count=0, batch_count=1. So __flush_batch() is called - we
> drop all locks and sleep. While we are sleeping somebody else comes and
> makes the buffer dirty again (OK, that is not probable, but I think it
> could be possible).

Yes, there's no reason why not at that point.

> Now we wake up and call __cleanup_transaction().
> It's not able to do anything and returns 0.

I think the _right_ answer here is to have two separate checkpoint
lists: the current one, plus one for which the checkpoint write has
already been submitted. That way, we can wait for IO completion on
submitted writes without (a) getting conned into doing multiple rewrites
if there's somebody else dirtying the buffer; or (b) getting confused
about how much progress we're making. Buffers on the pre-write list get
written; buffers on the post-write list get waited for; and both count
as progress (eliminating the false assert-failure when we failed to
detect progress).

The prevention of multiple writes in this case should also improve
performance a little.

That ought to be pretty straightforward, I think. The existing cases
where we remove buffers from a checkpoint shouldn't have to care about
which list_head we're removing from; those cases already handle buffers
in both states. It's only when doing the flush/wait that we have to
distinguish the two.

Sounds good?

--Stephen

2005-04-08 17:25:03

by Badari Pulavarty

[permalink] [raw]
Subject: Re: Problem in log_do_checkpoint()?

I get OOPs in log_do_checkpoint() while using ext3 quotas.
Is this anyway related to what you are working on ?

Unable to handle kernel NULL pointer dereference at virtual address
00000000
printing eip:
801aeee1
*pde = 52b31001
Oops: 0002 [#1]
PREEMPT SMP
Modules linked in:
CPU: 3
EIP: 0060:[<801aeee1>] Not tainted VLI
EFLAGS: 00010213 (2.6.11-22)
EIP is at log_do_checkpoint+0x91/0x220
eax: 00000002 ebx: b7d09e0c ecx: 00000001 edx: e24a2000
esi: 00000000 edi: c4bac47c ebp: cceb726c esp: e24a2d18
ds: 007b es: 007b ss: 0068
Process rm (pid: 8694, threadinfo=e24a2000 task=f7b79040)
Stack: f7dc70e4 a1d60b3c e24a2d44 e24a2d3c e24a2d40 e24a2000 00004df4
a6062200
f7dc70e4 00000000 00000000 95447db0 95447e4c ec6c1d7c b52210e4
ec032b40
ec032b0c 936a5800 e5a262b8 95447cac eb4c4354 936a57cc 936a5798
ac0e93bc
Call Trace:
[<801ae94f>] __log_wait_for_space+0x9f/0xc0
[<801a9b42>] start_this_handle+0x132/0x3f0
[<8012f720>] autoremove_wake_function+0x0/0x60
[<8012f720>] autoremove_wake_function+0x0/0x60
[<801a9efd>] journal_start+0xad/0xe0
[<801a68b1>] ext3_dquot_initialize+0x51/0x70
[<801a2d0d>] ext3_rmdir+0x4d/0x1c0
[<8031df76>] _spin_lock+0x16/0x90
[<80168aa9>] vfs_rmdir+0x189/0x230
[<80168be9>] sys_rmdir+0x99/0xf0
[<8010272f>] syscall_call+0x7/0xb
Code: 8b 54 24 1c 89 5c 24 28 8b 40 04 89 44 24 18 8b 5a 28 8b 6b 2c 89
df 8d 76 00 89 fb b8 01 00 00 00 8b 7f 28 8b 33 e8 cf 76 f6 ff <f0> 0f
ba 2e 13 19 c0 85 c0 0f 85 3f 01 00 00 89 5c 24 04 8d 44

Thanks,
Badari



On Mon, 2005-04-04 at 02:04, Jan Kara wrote:
> Hello,
>
> I've been looking through the JBD code when trying to understand the
> assertion failure in log_do_checkpoint() (it was on old SUSE 2.6.5 kernel
> though the reporter claims to be able to get the failure even with the
> Stephen's patch fixing a race with journal_put_journal_head()) and I've
> spotted one place where I think could be a race (the code around there
> seems to be the same in latest kernels):
> In log_do_checkpoint() we go through the t_checkpoint_list of a
> transaction and call __flush_buffer() on each buffer. Suppose there is
> just one buffer on the list and it is dirty. __flush_buffer() sees it and
> puts it to an array of buffers for flushing. Then the loop finishes,
> retry=0, drop_count=0, batch_count=1. So __flush_batch() is called - we
> drop all locks and sleep. While we are sleeping somebody else comes and
> makes the buffer dirty again (OK, that is not probable, but I think it
> could be possible). Now we wake up and call __cleanup_transaction().
> It's not able to do anything and returns 0. And we fail on the assertion
> J_ASSERT(drop_count != 0 || cleanup_ret != 0).
> Am I missing something? In my opinion we should set retry=1 after we
> call __flush_batch() even if we call it outside of the "__flush_buffer-loop"...
>
> Honza

2005-04-11 11:13:13

by Jan Kara

[permalink] [raw]
Subject: Re: Problem in log_do_checkpoint()?

> I get OOPs in log_do_checkpoint() while using ext3 quotas.
> Is this anyway related to what you are working on ?
Nope, it does not seem to be the same problem. In theory it could be a
bug Stephen fixed some time ago - could you try to reproduce the problem
with 2.6.12-rc2 (it contains the fix)? If you're still able to get the
oops, could you please find out where exactly in log_do_checkpoint() it
occured?

>
> Unable to handle kernel NULL pointer dereference at virtual address
> 00000000
> printing eip:
> 801aeee1
> *pde = 52b31001
> Oops: 0002 [#1]
> PREEMPT SMP
> Modules linked in:
> CPU: 3
> EIP: 0060:[<801aeee1>] Not tainted VLI
> EFLAGS: 00010213 (2.6.11-22)
> EIP is at log_do_checkpoint+0x91/0x220
> eax: 00000002 ebx: b7d09e0c ecx: 00000001 edx: e24a2000
> esi: 00000000 edi: c4bac47c ebp: cceb726c esp: e24a2d18
> ds: 007b es: 007b ss: 0068
> Process rm (pid: 8694, threadinfo=e24a2000 task=f7b79040)
> Stack: f7dc70e4 a1d60b3c e24a2d44 e24a2d3c e24a2d40 e24a2000 00004df4
> a6062200
> f7dc70e4 00000000 00000000 95447db0 95447e4c ec6c1d7c b52210e4
> ec032b40
> ec032b0c 936a5800 e5a262b8 95447cac eb4c4354 936a57cc 936a5798
> ac0e93bc
> Call Trace:
> [<801ae94f>] __log_wait_for_space+0x9f/0xc0
> [<801a9b42>] start_this_handle+0x132/0x3f0
> [<8012f720>] autoremove_wake_function+0x0/0x60
> [<8012f720>] autoremove_wake_function+0x0/0x60
> [<801a9efd>] journal_start+0xad/0xe0
> [<801a68b1>] ext3_dquot_initialize+0x51/0x70
> [<801a2d0d>] ext3_rmdir+0x4d/0x1c0
> [<8031df76>] _spin_lock+0x16/0x90
> [<80168aa9>] vfs_rmdir+0x189/0x230
> [<80168be9>] sys_rmdir+0x99/0xf0
> [<8010272f>] syscall_call+0x7/0xb
> Code: 8b 54 24 1c 89 5c 24 28 8b 40 04 89 44 24 18 8b 5a 28 8b 6b 2c 89
> df 8d 76 00 89 fb b8 01 00 00 00 8b 7f 28 8b 33 e8 cf 76 f6 ff <f0> 0f
> ba 2e 13 19 c0 85 c0 0f 85 3f 01 00 00 89 5c 24 04 8d 44

Thanks for report
Honza

2005-04-11 11:36:26

by Jan Kara

[permalink] [raw]
Subject: Re: Problem in log_do_checkpoint()?

Hello,

> On Mon, 2005-04-04 at 10:04, Jan Kara wrote:
>
> > In log_do_checkpoint() we go through the t_checkpoint_list of a
> > transaction and call __flush_buffer() on each buffer. Suppose there is
> > just one buffer on the list and it is dirty. __flush_buffer() sees it and
> > puts it to an array of buffers for flushing. Then the loop finishes,
> > retry=0, drop_count=0, batch_count=1. So __flush_batch() is called - we
> > drop all locks and sleep. While we are sleeping somebody else comes and
> > makes the buffer dirty again (OK, that is not probable, but I think it
> > could be possible).
>
> Yes, there's no reason why not at that point.
>
> > Now we wake up and call __cleanup_transaction().
> > It's not able to do anything and returns 0.
>
> I think the _right_ answer here is to have two separate checkpoint
> lists: the current one, plus one for which the checkpoint write has
> already been submitted. That way, we can wait for IO completion on
> submitted writes without (a) getting conned into doing multiple rewrites
> if there's somebody else dirtying the buffer; or (b) getting confused
> about how much progress we're making. Buffers on the pre-write list get
> written; buffers on the post-write list get waited for; and both count
> as progress (eliminating the false assert-failure when we failed to
> detect progress).
Yes, this seems to be a better long-term solution. A hotfix (retrying
after __flush_batch()) is attached if somebody is interested - it should
be safe and is lightly tested.

> The prevention of multiple writes in this case should also improve
> performance a little.
>
> That ought to be pretty straightforward, I think. The existing cases
> where we remove buffers from a checkpoint shouldn't have to care about
> which list_head we're removing from; those cases already handle buffers
> in both states. It's only when doing the flush/wait that we have to
> distinguish the two.
Yes, AFAICS the changes should remain local to the checkpointing code
(plus __unlink_buffer()). Should I write the patch or will you?

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs


Attachments:
(No filename) (2.09 kB)
jbd-2.6.12-rc2-checkpoint.diff (686.00 B)
Download all attachments

2005-04-11 11:38:25

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: Problem in log_do_checkpoint()?

Hi,

On Fri, 2005-04-08 at 18:14, Badari Pulavarty wrote:
> I get OOPs in log_do_checkpoint() while using ext3 quotas.
> Is this anyway related to what you are working on ?
>
> Unable to handle kernel NULL pointer dereference at virtual address
> 00000000

Doesn't look like it, no. If we understand it right, Jan's problem
would only ever manifest itself as an assert failure, not as an oops.

--Stephen

2005-04-12 14:41:26

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: Problem in log_do_checkpoint()?

Hi,

On Mon, 2005-04-11 at 12:36, Jan Kara wrote:

> > The prevention of multiple writes in this case should also improve
> > performance a little.
> >
> > That ought to be pretty straightforward, I think. The existing cases
> > where we remove buffers from a checkpoint shouldn't have to care about
> > which list_head we're removing from; those cases already handle buffers
> > in both states. It's only when doing the flush/wait that we have to
> > distinguish the two.
> Yes, AFAICS the changes should remain local to the checkpointing code
> (plus __unlink_buffer()). Should I write the patch or will you?

Feel free, but please let me know if you start. I'm doing a bit of
chasing of leaks and dealing with that O_SYNC thing for 2.4 right now,
but I'll get back to the checkpoint code after that if you haven't
started by then.

Cheers,
Stephen