2007-08-14 15:19:15

by Josef Bacik

[permalink] [raw]
Subject: [PATCH] fix panic in jbd by adding locks

Hello,

It is possible to panic the box by a race condition that exists in the
journalling code where we do not take the j_revoke_lock when traversing the
journal's revoked record list. This patch has been tested and we haven't seen
the issue yet, its a rather straightforward and correct (at least I think so :)
fix. Thank you,

Signed-off-by: Josef Bacik <[email protected]>


diff --git a/fs/jbd/revoke.c b/fs/jbd/revoke.c
index 62e13c8..317f598 100644
--- a/fs/jbd/revoke.c
+++ b/fs/jbd/revoke.c
@@ -518,6 +518,7 @@ void journal_write_revoke_records(journal_t *journal,
for (i = 0; i < revoke->hash_size; i++) {
hash_list = &revoke->hash_table[i];

+ spin_lock(&journal->j_revoke_lock);
while (!list_empty(hash_list)) {
record = (struct jbd_revoke_record_s *)
hash_list->next;
@@ -528,6 +529,7 @@ void journal_write_revoke_records(journal_t *journal,
list_del(&record->hash);
kmem_cache_free(revoke_record_cache, record);
}
+ spin_unlock(&journal->j_revoke_lock);
}
if (descriptor)
flush_descriptor(journal, descriptor, offset);
@@ -694,10 +696,12 @@ void journal_clear_revoke(journal_t *journal)

for (i = 0; i < revoke->hash_size; i++) {
hash_list = &revoke->hash_table[i];
+ spin_lock(&journal->j_revoke_lock);
while (!list_empty(hash_list)) {
record = (struct jbd_revoke_record_s*) hash_list->next;
list_del(&record->hash);
kmem_cache_free(revoke_record_cache, record);
}
+ spin_unlock(&journal->j_revoke_lock);
}
}


2007-08-15 12:03:59

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] fix panic in jbd by adding locks

Hello,

> It is possible to panic the box by a race condition that exists in the
> journalling code where we do not take the j_revoke_lock when traversing the
> journal's revoked record list. This patch has been tested and we haven't seen
> the issue yet, its a rather straightforward and correct (at least I think so :)
> fix. Thank you,
In principle, the patch looks fine. The only thing I'm wondering about
is how that panic can happen... Journal write_revoke_records() is called
from journal_commit_transaction() when revoke table for the committing
transaction shouldn't see any further changes. So maybe the patch is
masking a different problem.
Do you have a way of reproducing the problem? Any stack trace
available?

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2007-08-15 12:13:58

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] fix panic in jbd by adding locks

On Wed, Aug 15, 2007 at 01:37:37PM +0200, Jan Kara wrote:
> Hello,
>
> > It is possible to panic the box by a race condition that exists in the
> > journalling code where we do not take the j_revoke_lock when traversing the
> > journal's revoked record list. This patch has been tested and we haven't seen
> > the issue yet, its a rather straightforward and correct (at least I think so :)
> > fix. Thank you,
> In principle, the patch looks fine. The only thing I'm wondering about
> is how that panic can happen... Journal write_revoke_records() is called
> from journal_commit_transaction() when revoke table for the committing
> transaction shouldn't see any further changes. So maybe the patch is
> masking a different problem.
> Do you have a way of reproducing the problem? Any stack trace
> available?

Reproducing the problem is tricky as theres no sure way to make it happen,
basically we run the box with alot of memory pressure while doing alot
operations that require journalling. Here is the backtrace of the panic (note
this is on a RHEL4 kernel so 2.6.9, but the same problem exists upstream)

<1>Unable to handle kernel paging request at virtual address 60000000002b1110
<4>kjournald[1310]: Oops 11012296146944 [1]
<4>Modules linked in: ltd(U) vfat fat dm_mod button uhci_hcd shpchp e1000
bond1(U) bond0(U) ext3 jbd hfcldd(U) hfcldd_conf(U) sd_mod scsi_mod
<4>
<4>Pid: 1310, CPU 0, comm: kjournald
<4>psr : 0000121008026018 ifs : 8000000000000c9c ip : [<a000000200045161>]
Tainted: P
<4>ip is at journal_write_revoke_records+0x221/0x4e0 [jbd]
<4>unat: 0000000000000000 pfs : 0000000000000c9c rsc : 0000000000000003
<4>rnat: 0000000000000000 bsps: 0000000000000000 pr : 000000000000a681
<4>ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
<4>csd : 0000000000000000 ssd : 0000000000000000
<4>b0 : a000000200045270 b6 : a00000020026a240 b7 : a00000010000ee90
<4>f6 : 0fffbe38e38e381b23800 f7 : 0ffe9edc3d22c00000000
<4>f8 : 1000e86fb000000000000 f9 : 100029000000000000000
<4>f10 : 1000aeff71c71b9e6e61a f11 : 1003e0000000000000eff
<4>r1 : a000000200234000 r2 : 000000000000048c r3 : e0000002791a7a90
<4>r8 : 0000000000000000 r9 : e0000002791a0400 r10 : 0000000000000000
<4>r11 : e000000001000000 r12 : e0000002791a7b00 r13 : e0000002791a0000
<4>r14 : e00000027b7ee6c0 r15 : e0000002791a7b00 r16 : e000000272d48018
<4>r17 : 0000000000000000 r18 : 0000000000000000 r19 : 0009804c8a70033f
<4>r20 : 60000000002b1118 r21 : a00000010006ad70 r22 : 0000000000000019
<4>r23 : 0000000000000000 r24 : 0000000000000000 r25 : 0000000000000019
<4>r26 : 0000000000000000 r27 : 0000000000000000 r28 : 0000000000006a41
<4>r29 : 0000000000000000 r30 : 0000000000000000 r31 : e00000027b7ee5a4
<4>
<4>Call Trace:
<4> [<a000000100016da0>] show_stack+0x80/0xa0
<4> sp=e0000002791a7690 bsp=e0000002791a1170
<4> [<a0000001000176b0>] show_regs+0x890/0x8c0
<4> sp=e0000002791a7860 bsp=e0000002791a1128
<4> [<a00000010003e910>] die+0x150/0x240
<4> sp=e0000002791a7880 bsp=e0000002791a10e8
<4> [<a0000001000644c0>] ia64_do_page_fault+0x8c0/0xbc0
<4> sp=e0000002791a7880 bsp=e0000002791a1080
<4> [<a00000010000f600>] ia64_leave_kernel+0x0/0x260
<4> sp=e0000002791a7930 bsp=e0000002791a1080
<4> [<a000000200045160>] journal_write_revoke_records+0x220/0x4e0 [jbd]
<4> sp=e0000002791a7b00 bsp=e0000002791a0f98
<4> [<a00000020003d940>] journal_commit_transaction+0xf80/0x3080 [jbd]
<4> sp=e0000002791a7b10 bsp=e0000002791a0ea0
<4> [<a0000002000458d0>] kjournald+0x170/0x580 [jbd]
<4> sp=e0000002791a7d80 bsp=e0000002791a0e38
<4> [<a000000100018c70>] kernel_thread_helper+0x30/0x60
<4> sp=e0000002791a7e30 bsp=e0000002791a0e10
<4> [<a000000100008c60>] start_kernel_thread+0x20/0x40
<4> sp=e0000002791a7e30 bsp=e0000002791a0e10

While analyzing the problem, Hitachi came up with this explanation for the race
condition


PID: 31401 TASK: e00000004fb30000 CPU: 1 COMMAND: "GET"
#0 [BSP:e00000004fb314d8] context_switch at a00000010006ab90
#1 [BSP:e00000004fb313b8] schedule at a000000100590f40
#2 [BSP:e00000004fb31340] do_get_write_access at a0000002000388e0
#3 [BSP:e00000004fb31300] journal_get_write_access at a000000200039680
#4 [BSP:e00000004fb312b8] ext3_reserve_inode_write at a00000020013f180
#5 [BSP:e00000004fb31290] ext3_mark_inode_dirty at a00000020013f2a0
#6 [BSP:e00000004fb31260] ext3_dirty_inode at a000000200144310
#7 [BSP:e00000004fb31210] __mark_inode_dirty at a000000100178200
#8 [BSP:e00000004fb311e8] update_atime at a000000100165cc0
#9 [BSP:e00000004fb31128] do_generic_mapping_read at a0000001000d40e0
#10 [BSP:e00000004fb310c0] __generic_file_aio_read at a0000001000d8b40
#11 [BSP:e00000004fb31088] generic_file_aio_read at a0000001000d8f60
#12 [BSP:e00000004fb31048] do_sync_read at a0000001001238a0
#13 [BSP:e00000004fb30ff0] vfs_read at a000000100123b90
#14 [BSP:e00000004fb30f78] sys_read at a0000001001241d0

do_get_write_access() can call journal_revoke_cancel().
revoke_record_cache variable is operated in journal_revoke_cancel().
journal_revoke_cancel() is called in user process context, and
journal_write_revoke_records() is called in kjournald context.
Therefore, there is a race condition between journal_revoke_cancel() and
journal_write_revoke_records() because journal_write_revoke_records()
operates revoke_record_cache variable without lock.

I hope this is helpful. Thank you,

Josef

2007-08-16 16:08:36

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] fix panic in jbd by adding locks

Hello,

> > > It is possible to panic the box by a race condition that exists in the
> > > journalling code where we do not take the j_revoke_lock when traversing the
> > > journal's revoked record list. This patch has been tested and we haven't seen
> > > the issue yet, its a rather straightforward and correct (at least I think so :)
> > > fix. Thank you,
> > In principle, the patch looks fine. The only thing I'm wondering about
> > is how that panic can happen... Journal write_revoke_records() is called
> > from journal_commit_transaction() when revoke table for the committing
> > transaction shouldn't see any further changes. So maybe the patch is
> > masking a different problem.
> > Do you have a way of reproducing the problem? Any stack trace
> > available?
>
> Reproducing the problem is tricky as theres no sure way to make it happen,
> basically we run the box with alot of memory pressure while doing alot
> operations that require journalling. Here is the backtrace of the panic (note
> this is on a RHEL4 kernel so 2.6.9, but the same problem exists upstream)
OK.

> <1>Unable to handle kernel paging request at virtual address 60000000002b1110
> <4>kjournald[1310]: Oops 11012296146944 [1]
> <4>Modules linked in: ltd(U) vfat fat dm_mod button uhci_hcd shpchp e1000
> bond1(U) bond0(U) ext3 jbd hfcldd(U) hfcldd_conf(U) sd_mod scsi_mod
> <4>
> <4>Pid: 1310, CPU 0, comm: kjournald
> <4>psr : 0000121008026018 ifs : 8000000000000c9c ip : [<a000000200045161>]
> Tainted: P
> <4>ip is at journal_write_revoke_records+0x221/0x4e0 [jbd]
> <4>unat: 0000000000000000 pfs : 0000000000000c9c rsc : 0000000000000003
> <4>rnat: 0000000000000000 bsps: 0000000000000000 pr : 000000000000a681
> <4>ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
> <4>csd : 0000000000000000 ssd : 0000000000000000
> <4>b0 : a000000200045270 b6 : a00000020026a240 b7 : a00000010000ee90
> <4>f6 : 0fffbe38e38e381b23800 f7 : 0ffe9edc3d22c00000000
> <4>f8 : 1000e86fb000000000000 f9 : 100029000000000000000
> <4>f10 : 1000aeff71c71b9e6e61a f11 : 1003e0000000000000eff
> <4>r1 : a000000200234000 r2 : 000000000000048c r3 : e0000002791a7a90
> <4>r8 : 0000000000000000 r9 : e0000002791a0400 r10 : 0000000000000000
> <4>r11 : e000000001000000 r12 : e0000002791a7b00 r13 : e0000002791a0000
> <4>r14 : e00000027b7ee6c0 r15 : e0000002791a7b00 r16 : e000000272d48018
> <4>r17 : 0000000000000000 r18 : 0000000000000000 r19 : 0009804c8a70033f
> <4>r20 : 60000000002b1118 r21 : a00000010006ad70 r22 : 0000000000000019
> <4>r23 : 0000000000000000 r24 : 0000000000000000 r25 : 0000000000000019
> <4>r26 : 0000000000000000 r27 : 0000000000000000 r28 : 0000000000006a41
> <4>r29 : 0000000000000000 r30 : 0000000000000000 r31 : e00000027b7ee5a4
> <4>
> <4>Call Trace:
> <4> [<a000000100016da0>] show_stack+0x80/0xa0
> <4> sp=e0000002791a7690 bsp=e0000002791a1170
> <4> [<a0000001000176b0>] show_regs+0x890/0x8c0
> <4> sp=e0000002791a7860 bsp=e0000002791a1128
> <4> [<a00000010003e910>] die+0x150/0x240
> <4> sp=e0000002791a7880 bsp=e0000002791a10e8
> <4> [<a0000001000644c0>] ia64_do_page_fault+0x8c0/0xbc0
> <4> sp=e0000002791a7880 bsp=e0000002791a1080
> <4> [<a00000010000f600>] ia64_leave_kernel+0x0/0x260
> <4> sp=e0000002791a7930 bsp=e0000002791a1080
> <4> [<a000000200045160>] journal_write_revoke_records+0x220/0x4e0 [jbd]
> <4> sp=e0000002791a7b00 bsp=e0000002791a0f98
> <4> [<a00000020003d940>] journal_commit_transaction+0xf80/0x3080 [jbd]
> <4> sp=e0000002791a7b10 bsp=e0000002791a0ea0
> <4> [<a0000002000458d0>] kjournald+0x170/0x580 [jbd]
> <4> sp=e0000002791a7d80 bsp=e0000002791a0e38
> <4> [<a000000100018c70>] kernel_thread_helper+0x30/0x60
> <4> sp=e0000002791a7e30 bsp=e0000002791a0e10
> <4> [<a000000100008c60>] start_kernel_thread+0x20/0x40
> <4> sp=e0000002791a7e30 bsp=e0000002791a0e10
Do you know (or could you find out) where exactly in the code is
journal_write_revoke_records+0x221/0x4e0?

> While analyzing the problem, Hitachi came up with this explanation for the race
> condition
>
> PID: 31401 TASK: e00000004fb30000 CPU: 1 COMMAND: "GET"
> #0 [BSP:e00000004fb314d8] context_switch at a00000010006ab90
> #1 [BSP:e00000004fb313b8] schedule at a000000100590f40
> #2 [BSP:e00000004fb31340] do_get_write_access at a0000002000388e0
> #3 [BSP:e00000004fb31300] journal_get_write_access at a000000200039680
> #4 [BSP:e00000004fb312b8] ext3_reserve_inode_write at a00000020013f180
> #5 [BSP:e00000004fb31290] ext3_mark_inode_dirty at a00000020013f2a0
> #6 [BSP:e00000004fb31260] ext3_dirty_inode at a000000200144310
> #7 [BSP:e00000004fb31210] __mark_inode_dirty at a000000100178200
> #8 [BSP:e00000004fb311e8] update_atime at a000000100165cc0
> #9 [BSP:e00000004fb31128] do_generic_mapping_read at a0000001000d40e0
> #10 [BSP:e00000004fb310c0] __generic_file_aio_read at a0000001000d8b40
> #11 [BSP:e00000004fb31088] generic_file_aio_read at a0000001000d8f60
> #12 [BSP:e00000004fb31048] do_sync_read at a0000001001238a0
> #13 [BSP:e00000004fb30ff0] vfs_read at a000000100123b90
> #14 [BSP:e00000004fb30f78] sys_read at a0000001001241d0
>
> do_get_write_access() can call journal_revoke_cancel().
I guess you mean journal_cancel_revoke()...

> revoke_record_cache variable is operated in journal_revoke_cancel().
> journal_revoke_cancel() is called in user process context, and
> journal_write_revoke_records() is called in kjournald context.
> Therefore, there is a race condition between journal_revoke_cancel() and
> journal_write_revoke_records() because journal_write_revoke_records()
> operates revoke_record_cache variable without lock.
Thanks for details. I'm still not convinced. What they essentially
write is that slab cache revoke_record_cache is not guarded by any spin
lock. It's not and that should be fine as slab caches are SMP safe by
themselves.
What could be a problem are hash tables of these revoke records kept
for the running and committing transaction. But as I've written above,
at the moment when journal_write_revoke_records() is called the currently
running transaction is locked, all handles of it are already released
and so nothing can touch the revoke table...
I'm not saying there's not a bug somewhere but your patch does not
seem to help it in any way. Personally, what I find more suspicious is
that journal_cancel_revoke() is called from do_get_write_access()
without any locks. Similarly I don't see any locks around journal_revoke()
when called from ext3. Thus interesting things could possibly happen
when these two meet on the same buffer. I have to investigate more in
detail though.
Honza

2007-08-16 16:34:52

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] fix panic in jbd by adding locks

On Thu, Aug 16, 2007 at 06:08:35PM +0200, Jan Kara wrote:
> Hello,
>
> > > > It is possible to panic the box by a race condition that exists in the
> > > > journalling code where we do not take the j_revoke_lock when traversing the
> > > > journal's revoked record list. This patch has been tested and we haven't seen
> > > > the issue yet, its a rather straightforward and correct (at least I think so :)
> > > > fix. Thank you,
> > > In principle, the patch looks fine. The only thing I'm wondering about
> > > is how that panic can happen... Journal write_revoke_records() is called
> > > from journal_commit_transaction() when revoke table for the committing
> > > transaction shouldn't see any further changes. So maybe the patch is
> > > masking a different problem.
> > > Do you have a way of reproducing the problem? Any stack trace
> > > available?
> >
> > Reproducing the problem is tricky as theres no sure way to make it happen,
> > basically we run the box with alot of memory pressure while doing alot
> > operations that require journalling. Here is the backtrace of the panic (note
> > this is on a RHEL4 kernel so 2.6.9, but the same problem exists upstream)
> OK.
>
> > <1>Unable to handle kernel paging request at virtual address 60000000002b1110
> > <4>kjournald[1310]: Oops 11012296146944 [1]
> > <4>Modules linked in: ltd(U) vfat fat dm_mod button uhci_hcd shpchp e1000
> > bond1(U) bond0(U) ext3 jbd hfcldd(U) hfcldd_conf(U) sd_mod scsi_mod
> > <4>
> > <4>Pid: 1310, CPU 0, comm: kjournald
> > <4>psr : 0000121008026018 ifs : 8000000000000c9c ip : [<a000000200045161>]
> > Tainted: P
> > <4>ip is at journal_write_revoke_records+0x221/0x4e0 [jbd]
> > <4>unat: 0000000000000000 pfs : 0000000000000c9c rsc : 0000000000000003
> > <4>rnat: 0000000000000000 bsps: 0000000000000000 pr : 000000000000a681
> > <4>ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
> > <4>csd : 0000000000000000 ssd : 0000000000000000
> > <4>b0 : a000000200045270 b6 : a00000020026a240 b7 : a00000010000ee90
> > <4>f6 : 0fffbe38e38e381b23800 f7 : 0ffe9edc3d22c00000000
> > <4>f8 : 1000e86fb000000000000 f9 : 100029000000000000000
> > <4>f10 : 1000aeff71c71b9e6e61a f11 : 1003e0000000000000eff
> > <4>r1 : a000000200234000 r2 : 000000000000048c r3 : e0000002791a7a90
> > <4>r8 : 0000000000000000 r9 : e0000002791a0400 r10 : 0000000000000000
> > <4>r11 : e000000001000000 r12 : e0000002791a7b00 r13 : e0000002791a0000
> > <4>r14 : e00000027b7ee6c0 r15 : e0000002791a7b00 r16 : e000000272d48018
> > <4>r17 : 0000000000000000 r18 : 0000000000000000 r19 : 0009804c8a70033f
> > <4>r20 : 60000000002b1118 r21 : a00000010006ad70 r22 : 0000000000000019
> > <4>r23 : 0000000000000000 r24 : 0000000000000000 r25 : 0000000000000019
> > <4>r26 : 0000000000000000 r27 : 0000000000000000 r28 : 0000000000006a41
> > <4>r29 : 0000000000000000 r30 : 0000000000000000 r31 : e00000027b7ee5a4
> > <4>
> > <4>Call Trace:
> > <4> [<a000000100016da0>] show_stack+0x80/0xa0
> > <4> sp=e0000002791a7690 bsp=e0000002791a1170
> > <4> [<a0000001000176b0>] show_regs+0x890/0x8c0
> > <4> sp=e0000002791a7860 bsp=e0000002791a1128
> > <4> [<a00000010003e910>] die+0x150/0x240
> > <4> sp=e0000002791a7880 bsp=e0000002791a10e8
> > <4> [<a0000001000644c0>] ia64_do_page_fault+0x8c0/0xbc0
> > <4> sp=e0000002791a7880 bsp=e0000002791a1080
> > <4> [<a00000010000f600>] ia64_leave_kernel+0x0/0x260
> > <4> sp=e0000002791a7930 bsp=e0000002791a1080
> > <4> [<a000000200045160>] journal_write_revoke_records+0x220/0x4e0 [jbd]
> > <4> sp=e0000002791a7b00 bsp=e0000002791a0f98
> > <4> [<a00000020003d940>] journal_commit_transaction+0xf80/0x3080 [jbd]
> > <4> sp=e0000002791a7b10 bsp=e0000002791a0ea0
> > <4> [<a0000002000458d0>] kjournald+0x170/0x580 [jbd]
> > <4> sp=e0000002791a7d80 bsp=e0000002791a0e38
> > <4> [<a000000100018c70>] kernel_thread_helper+0x30/0x60
> > <4> sp=e0000002791a7e30 bsp=e0000002791a0e10
> > <4> [<a000000100008c60>] start_kernel_thread+0x20/0x40
> > <4> sp=e0000002791a7e30 bsp=e0000002791a0e10
> Do you know (or could you find out) where exactly in the code is
> journal_write_revoke_records+0x221/0x4e0?
>

Yeah sorry

500 void journal_write_revoke_records(journal_t *journal,
501 transaction_t *transaction)
502 {
503 struct journal_head *descriptor;
504 struct jbd_revoke_record_s *record;
505 struct jbd_revoke_table_s *revoke;
506 struct list_head *hash_list;
507 int i, offset, count;
508
509 descriptor = NULL;
510 offset = 0;
511 count = 0;
512
513 /* select revoke table for committing transaction */
514 revoke = journal->j_revoke == journal->j_revoke_table[0] ?
515 journal->j_revoke_table[1] : journal->j_revoke_table[0];
516
517 for (i = 0; i < revoke->hash_size; i++) {
518 hash_list = &revoke->hash_table[i];
519
520 while (!list_empty(hash_list)) {
521 record = (struct jbd_revoke_record_s *)
522 hash_list->next;
523 write_one_revoke_record(journal, transaction,
524 &descriptor, &offset,
525 record);
526 count++;
527 list_del(&record->hash); <-- here

> Thanks for details. I'm still not convinced. What they essentially
> write is that slab cache revoke_record_cache is not guarded by any spin
> lock. It's not and that should be fine as slab caches are SMP safe by
> themselves.

No its the list_del thats not gaurded, so the hash list gets screwed up outside
of a lock. If there are other problems that need to be addressed then ok, but I
still think that we should be protecting all of the list traversal/changing
should be protected by the lock. Thank you,

Josef

2007-08-20 15:02:07

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] fix panic in jbd by adding locks

On Thu 16-08-07 12:37:45, Josef Bacik wrote:
> On Thu, Aug 16, 2007 at 06:08:35PM +0200, Jan Kara wrote:
> > > > > It is possible to panic the box by a race condition that exists in the
> > > > > journalling code where we do not take the j_revoke_lock when traversing the
> > > > > journal's revoked record list. This patch has been tested and we haven't seen
> > > > > the issue yet, its a rather straightforward and correct (at least I think so :)
> > > > > fix. Thank you,
> > > > In principle, the patch looks fine. The only thing I'm wondering about
> > > > is how that panic can happen... Journal write_revoke_records() is called
> > > > from journal_commit_transaction() when revoke table for the committing
> > > > transaction shouldn't see any further changes. So maybe the patch is
> > > > masking a different problem.
> > > > Do you have a way of reproducing the problem? Any stack trace
> > > > available?
> > >
> > > Reproducing the problem is tricky as theres no sure way to make it happen,
> > > basically we run the box with alot of memory pressure while doing alot
> > > operations that require journalling. Here is the backtrace of the panic (note
> > > this is on a RHEL4 kernel so 2.6.9, but the same problem exists upstream)
> > OK.
> >
> > > <1>Unable to handle kernel paging request at virtual address 60000000002b1110
> > > <4>kjournald[1310]: Oops 11012296146944 [1]
> > > <4>Modules linked in: ltd(U) vfat fat dm_mod button uhci_hcd shpchp e1000
> > > bond1(U) bond0(U) ext3 jbd hfcldd(U) hfcldd_conf(U) sd_mod scsi_mod
> > > <4>
> > > <4>Pid: 1310, CPU 0, comm: kjournald
> > > <4>psr : 0000121008026018 ifs : 8000000000000c9c ip : [<a000000200045161>]
> > > Tainted: P
> > > <4>ip is at journal_write_revoke_records+0x221/0x4e0 [jbd]
> > > <4>unat: 0000000000000000 pfs : 0000000000000c9c rsc : 0000000000000003
> > > <4>rnat: 0000000000000000 bsps: 0000000000000000 pr : 000000000000a681
> > > <4>ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
> > > <4>csd : 0000000000000000 ssd : 0000000000000000
> > > <4>b0 : a000000200045270 b6 : a00000020026a240 b7 : a00000010000ee90
> > > <4>f6 : 0fffbe38e38e381b23800 f7 : 0ffe9edc3d22c00000000
> > > <4>f8 : 1000e86fb000000000000 f9 : 100029000000000000000
> > > <4>f10 : 1000aeff71c71b9e6e61a f11 : 1003e0000000000000eff
> > > <4>r1 : a000000200234000 r2 : 000000000000048c r3 : e0000002791a7a90
> > > <4>r8 : 0000000000000000 r9 : e0000002791a0400 r10 : 0000000000000000
> > > <4>r11 : e000000001000000 r12 : e0000002791a7b00 r13 : e0000002791a0000
> > > <4>r14 : e00000027b7ee6c0 r15 : e0000002791a7b00 r16 : e000000272d48018
> > > <4>r17 : 0000000000000000 r18 : 0000000000000000 r19 : 0009804c8a70033f
> > > <4>r20 : 60000000002b1118 r21 : a00000010006ad70 r22 : 0000000000000019
> > > <4>r23 : 0000000000000000 r24 : 0000000000000000 r25 : 0000000000000019
> > > <4>r26 : 0000000000000000 r27 : 0000000000000000 r28 : 0000000000006a41
> > > <4>r29 : 0000000000000000 r30 : 0000000000000000 r31 : e00000027b7ee5a4
> > > <4>
> > > <4>Call Trace:
> > > <4> [<a000000100016da0>] show_stack+0x80/0xa0
> > > <4> sp=e0000002791a7690 bsp=e0000002791a1170
> > > <4> [<a0000001000176b0>] show_regs+0x890/0x8c0
> > > <4> sp=e0000002791a7860 bsp=e0000002791a1128
> > > <4> [<a00000010003e910>] die+0x150/0x240
> > > <4> sp=e0000002791a7880 bsp=e0000002791a10e8
> > > <4> [<a0000001000644c0>] ia64_do_page_fault+0x8c0/0xbc0
> > > <4> sp=e0000002791a7880 bsp=e0000002791a1080
> > > <4> [<a00000010000f600>] ia64_leave_kernel+0x0/0x260
> > > <4> sp=e0000002791a7930 bsp=e0000002791a1080
> > > <4> [<a000000200045160>] journal_write_revoke_records+0x220/0x4e0 [jbd]
> > > <4> sp=e0000002791a7b00 bsp=e0000002791a0f98
> > > <4> [<a00000020003d940>] journal_commit_transaction+0xf80/0x3080 [jbd]
> > > <4> sp=e0000002791a7b10 bsp=e0000002791a0ea0
> > > <4> [<a0000002000458d0>] kjournald+0x170/0x580 [jbd]
> > > <4> sp=e0000002791a7d80 bsp=e0000002791a0e38
> > > <4> [<a000000100018c70>] kernel_thread_helper+0x30/0x60
> > > <4> sp=e0000002791a7e30 bsp=e0000002791a0e10
> > > <4> [<a000000100008c60>] start_kernel_thread+0x20/0x40
> > > <4> sp=e0000002791a7e30 bsp=e0000002791a0e10
> > Do you know (or could you find out) where exactly in the code is
> > journal_write_revoke_records+0x221/0x4e0?
> >
>
> Yeah sorry
>
> 500 void journal_write_revoke_records(journal_t *journal,
> 501 transaction_t *transaction)
> 502 {
> 503 struct journal_head *descriptor;
> 504 struct jbd_revoke_record_s *record;
> 505 struct jbd_revoke_table_s *revoke;
> 506 struct list_head *hash_list;
> 507 int i, offset, count;
> 508
> 509 descriptor = NULL;
> 510 offset = 0;
> 511 count = 0;
> 512
> 513 /* select revoke table for committing transaction */
> 514 revoke = journal->j_revoke == journal->j_revoke_table[0] ?
> 515 journal->j_revoke_table[1] : journal->j_revoke_table[0];
> 516
> 517 for (i = 0; i < revoke->hash_size; i++) {
> 518 hash_list = &revoke->hash_table[i];
> 519
> 520 while (!list_empty(hash_list)) {
> 521 record = (struct jbd_revoke_record_s *)
> 522 hash_list->next;
> 523 write_one_revoke_record(journal, transaction,
> 524 &descriptor, &offset,
> 525 record);
> 526 count++;
> 527 list_del(&record->hash); <-- here
OK, thanks. So record probably points to an already freed memory which has
been overwritten by garbage...

> > Thanks for details. I'm still not convinced. What they essentially
> > write is that slab cache revoke_record_cache is not guarded by any spin
> > lock. It's not and that should be fine as slab caches are SMP safe by
> > themselves.
>
> No its the list_del thats not gaurded, so the hash list gets screwed up outside
> of a lock. If there are other problems that need to be addressed then ok, but I
> still think that we should be protecting all of the list traversal/changing
> should be protected by the lock. Thank you,
But the traversal in journal_write_revoke_records() *is* in fact guarded
by the commit logic in journal_commit_transaction() handling code - it
doesn't allow anybody to mess with revoke lists when a transaction is
committing. So there's no need to guard the hash list again in
journal_write_revoke_records() by the spinlock. And if the logic does not
work and lets somebody modify revoke lists during commit, we have more
serious problems than hash list corruption. That's why I'm trying to find
out where's the real culprit of the Oops. But so far I cannot find out how
the corruption can happen...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2007-08-21 15:41:26

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] fix panic in jbd by adding locks

On Mon, Aug 20, 2007 at 05:20:21PM +0200, Jan Kara wrote:
> OK, thanks. So record probably points to an already freed memory which has
> been overwritten by garbage...
>
> > > Thanks for details. I'm still not convinced. What they essentially
> > > write is that slab cache revoke_record_cache is not guarded by any spin
> > > lock. It's not and that should be fine as slab caches are SMP safe by
> > > themselves.
> >
> > No its the list_del thats not gaurded, so the hash list gets screwed up outside
> > of a lock. If there are other problems that need to be addressed then ok, but I
> > still think that we should be protecting all of the list traversal/changing
> > should be protected by the lock. Thank you,
> But the traversal in journal_write_revoke_records() *is* in fact guarded
> by the commit logic in journal_commit_transaction() handling code - it
> doesn't allow anybody to mess with revoke lists when a transaction is
> committing. So there's no need to guard the hash list again in
> journal_write_revoke_records() by the spinlock. And if the logic does not
> work and lets somebody modify revoke lists during commit, we have more
> serious problems than hash list corruption. That's why I'm trying to find
> out where's the real culprit of the Oops. But so far I cannot find out how
> the corruption can happen...

I should note here I'm not trying to be argumentative, I just want to
understand. Ok so journal_commit_transaction() will make sure all the
handle_t's are removed and such before processing the revoke lists, but right
before we process the revoke lists we set the journals running transaction to
NULL, which means we can continue on our merry way. AFAICS the revoke lists are
per journal, not per transaction, so once we give up the j_state_lock after
having made sure the handle_t's had done their thing, we set the
running_transaction to NULL letting people continue to do their thing, and since
the revoke table is on a per journal basis, its completely valid for a new
transaction to be started, a handle to be added to it, journal_cancel_revoke()
to be run against that handle while still in journal_commit_transaction(). It
wouldn't necessarily be a handle_t from the transaction we are in the middle of
committing, it would be from a new transaction, and since the revoke list is per
journal, both the transaction thats currently being committed and the new
transaction would have access to the same revoke list, hence the race. Is this
correct? If not let me know because I want to understand this code better.
Thank you,

Josef

2007-08-21 16:29:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] fix panic in jbd by adding locks

On Tue 21-08-07 11:43:12, Josef Bacik wrote:
> On Mon, Aug 20, 2007 at 05:20:21PM +0200, Jan Kara wrote:
> > OK, thanks. So record probably points to an already freed memory which has
> > been overwritten by garbage...
> >
> > > > Thanks for details. I'm still not convinced. What they essentially
> > > > write is that slab cache revoke_record_cache is not guarded by any spin
> > > > lock. It's not and that should be fine as slab caches are SMP safe by
> > > > themselves.
> > >
> > > No its the list_del thats not gaurded, so the hash list gets screwed up outside
> > > of a lock. If there are other problems that need to be addressed then ok, but I
> > > still think that we should be protecting all of the list traversal/changing
> > > should be protected by the lock. Thank you,
> > But the traversal in journal_write_revoke_records() *is* in fact guarded
> > by the commit logic in journal_commit_transaction() handling code - it
> > doesn't allow anybody to mess with revoke lists when a transaction is
> > committing. So there's no need to guard the hash list again in
> > journal_write_revoke_records() by the spinlock. And if the logic does not
> > work and lets somebody modify revoke lists during commit, we have more
> > serious problems than hash list corruption. That's why I'm trying to find
> > out where's the real culprit of the Oops. But so far I cannot find out how
> > the corruption can happen...
>
> I should note here I'm not trying to be argumentative, I just want to
> understand. Ok so journal_commit_transaction() will make sure all the
I also just want to understand how the oops can happen :).

> handle_t's are removed and such before processing the revoke lists, but right
> before we process the revoke lists we set the journals running transaction to
> NULL, which means we can continue on our merry way. AFAICS the revoke lists are
> per journal, not per transaction, so once we give up the j_state_lock after
> having made sure the handle_t's had done their thing, we set the
> running_transaction to NULL letting people continue to do their thing, and since
> the revoke table is on a per journal basis, its completely valid for a new
> transaction to be started, a handle to be added to it, journal_cancel_revoke()
> to be run against that handle while still in journal_commit_transaction(). It
> wouldn't necessarily be a handle_t from the transaction we are in the middle of
> committing, it would be from a new transaction, and since the revoke list is per
> journal, both the transaction thats currently being committed and the new
> transaction would have access to the same revoke list, hence the race. Is this
> correct? If not let me know because I want to understand this code better.
The trick is, there are in fact two revoke tables. So the commit code
does the following:
1) It waits until all handles of the running transaction are released
(this is the while loop waiting checking t_updates).
2) It does some cleanup of unused buffers.
3) Switches revoke tables - i.e. journal->j_revoke now points to a
freshly initialized table, the table of the committing transaction is kept
hidden.
4) Transaction state is changed to FLUSH, journal->j_running_transaction
is set to NULL, etc.
5) Data writeout is performed.
6) Saved revoke hash table is written.

After 3), journal_revoke() and journal_revoke_cancel() access the new
hash table and thus have no influence on journal_write_revoke_records(). It
could possibly be that the pointer to the old hash table would be stored in
some local variable - but all the places where we store a pointer to the
hash table seem to be contained inside journal_start(), journal_stop()
pairs (all functions working with hash tables take a transaction handle
as an argument) and because all handles were released at some point in time,
we know that this should not happen either.
I hope it is clearer now. If you still have any questions, please ask.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2007-08-21 17:43:16

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] fix panic in jbd by adding locks

On Tue, Aug 21, 2007 at 06:48:15PM +0200, Jan Kara wrote:
> On Tue 21-08-07 11:43:12, Josef Bacik wrote:
> > On Mon, Aug 20, 2007 at 05:20:21PM +0200, Jan Kara wrote:
> > > OK, thanks. So record probably points to an already freed memory which has
> > > been overwritten by garbage...
> > >
> > > > > Thanks for details. I'm still not convinced. What they essentially
> > > > > write is that slab cache revoke_record_cache is not guarded by any spin
> > > > > lock. It's not and that should be fine as slab caches are SMP safe by
> > > > > themselves.
> > > >
> > > > No its the list_del thats not gaurded, so the hash list gets screwed up outside
> > > > of a lock. If there are other problems that need to be addressed then ok, but I
> > > > still think that we should be protecting all of the list traversal/changing
> > > > should be protected by the lock. Thank you,
> > > But the traversal in journal_write_revoke_records() *is* in fact guarded
> > > by the commit logic in journal_commit_transaction() handling code - it
> > > doesn't allow anybody to mess with revoke lists when a transaction is
> > > committing. So there's no need to guard the hash list again in
> > > journal_write_revoke_records() by the spinlock. And if the logic does not
> > > work and lets somebody modify revoke lists during commit, we have more
> > > serious problems than hash list corruption. That's why I'm trying to find
> > > out where's the real culprit of the Oops. But so far I cannot find out how
> > > the corruption can happen...
> >
> > I should note here I'm not trying to be argumentative, I just want to
> > understand. Ok so journal_commit_transaction() will make sure all the
> I also just want to understand how the oops can happen :).
>
> > handle_t's are removed and such before processing the revoke lists, but right
> > before we process the revoke lists we set the journals running transaction to
> > NULL, which means we can continue on our merry way. AFAICS the revoke lists are
> > per journal, not per transaction, so once we give up the j_state_lock after
> > having made sure the handle_t's had done their thing, we set the
> > running_transaction to NULL letting people continue to do their thing, and since
> > the revoke table is on a per journal basis, its completely valid for a new
> > transaction to be started, a handle to be added to it, journal_cancel_revoke()
> > to be run against that handle while still in journal_commit_transaction(). It
> > wouldn't necessarily be a handle_t from the transaction we are in the middle of
> > committing, it would be from a new transaction, and since the revoke list is per
> > journal, both the transaction thats currently being committed and the new
> > transaction would have access to the same revoke list, hence the race. Is this
> > correct? If not let me know because I want to understand this code better.
> The trick is, there are in fact two revoke tables. So the commit code
> does the following:
> 1) It waits until all handles of the running transaction are released
> (this is the while loop waiting checking t_updates).
> 2) It does some cleanup of unused buffers.
> 3) Switches revoke tables - i.e. journal->j_revoke now points to a
> freshly initialized table, the table of the committing transaction is kept
> hidden.
> 4) Transaction state is changed to FLUSH, journal->j_running_transaction
> is set to NULL, etc.
> 5) Data writeout is performed.
> 6) Saved revoke hash table is written.
>
> After 3), journal_revoke() and journal_revoke_cancel() access the new
> hash table and thus have no influence on journal_write_revoke_records(). It
> could possibly be that the pointer to the old hash table would be stored in
> some local variable - but all the places where we store a pointer to the
> hash table seem to be contained inside journal_start(), journal_stop()
> pairs (all functions working with hash tables take a transaction handle
> as an argument) and because all handles were released at some point in time,
> we know that this should not happen either.
> I hope it is clearer now. If you still have any questions, please ask.
>

Ahh the switch is what I was overlooking, thank you that makes more sense.
Course now I have to figure out what really is going on here :(. I'll see if I
can't sort this out. Thanks much for your help,

Josef

2007-08-23 18:40:25

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] fix panic in jbd by adding locks

On Tue, Aug 21, 2007 at 01:45:02PM -0400, Josef Bacik wrote:
> > > handle_t's are removed and such before processing the revoke lists, but right
> > > before we process the revoke lists we set the journals running transaction to
> > > NULL, which means we can continue on our merry way. AFAICS the revoke lists are
> > > per journal, not per transaction, so once we give up the j_state_lock after
> > > having made sure the handle_t's had done their thing, we set the
> > > running_transaction to NULL letting people continue to do their thing, and since
> > > the revoke table is on a per journal basis, its completely valid for a new
> > > transaction to be started, a handle to be added to it, journal_cancel_revoke()
> > > to be run against that handle while still in journal_commit_transaction(). It
> > > wouldn't necessarily be a handle_t from the transaction we are in the middle of
> > > committing, it would be from a new transaction, and since the revoke list is per
> > > journal, both the transaction thats currently being committed and the new
> > > transaction would have access to the same revoke list, hence the race. Is this
> > > correct? If not let me know because I want to understand this code better.
> > The trick is, there are in fact two revoke tables. So the commit code
> > does the following:
> > 1) It waits until all handles of the running transaction are released
> > (this is the while loop waiting checking t_updates).
> > 2) It does some cleanup of unused buffers.
> > 3) Switches revoke tables - i.e. journal->j_revoke now points to a
> > freshly initialized table, the table of the committing transaction is kept
> > hidden.
> > 4) Transaction state is changed to FLUSH, journal->j_running_transaction
> > is set to NULL, etc.
> > 5) Data writeout is performed.
> > 6) Saved revoke hash table is written.
> >
> > After 3), journal_revoke() and journal_revoke_cancel() access the new
> > hash table and thus have no influence on journal_write_revoke_records(). It
> > could possibly be that the pointer to the old hash table would be stored in
> > some local variable - but all the places where we store a pointer to the
> > hash table seem to be contained inside journal_start(), journal_stop()
> > pairs (all functions working with hash tables take a transaction handle
> > as an argument) and because all handles were released at some point in time,
> > we know that this should not happen either.
> > I hope it is clearer now. If you still have any questions, please ask.
> >
>
> Ahh the switch is what I was overlooking, thank you that makes more sense.
> Course now I have to figure out what really is going on here :(. I'll see if I
> can't sort this out. Thanks much for your help,
>

I'm having some difficulty thinking of a good way to troubleshoot this, list
poisening is on by default, and I don't think any other kind of quick switches
that would help flush this bug out. One thing I was thinking about, what if
somebody comes right behind us trying to commit a separate transaction? I think
this could be possible as by the time we get to the
journal_write_revoke_records() we aren't holding any locks on the journal
itself, so we could come behind the transaction thats currently running, and
switch the revoke record tables on it while its doing its thing. Do you have
any suggestions on a good way to approach this problem? And as for my newest
theory, what have I not thought of :). Thank you,

Josef