From: Jan Kara Subject: Re: next-20090310: ext4 hangs Date: Wed, 25 Mar 2009 20:43:17 +0100 Message-ID: <20090325194316.GQ23439@duck.suse.cz> References: <20090310154745.GF23075@mit.edu> <20090325151122.GA14881@atrey.karlin.mff.cuni.cz> <20090325151516.GB14881@atrey.karlin.mff.cuni.cz> <20090325152234.GN23439@duck.suse.cz> <20090325161556.GP23439@duck.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Theodore Tso , "linux-next@vger.kernel.org" , linux-ext4@vger.kernel.org, LKML , sparclinux@vger.kernel.org To: Alexander Beregalov Return-path: Content-Disposition: inline In-Reply-To: Sender: sparclinux-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Wed 25-03-09 20:07:46, Alexander Beregalov wrote: > 2009/3/25 Jan Kara : > > On Wed 25-03-09 18:29:10, Alexander Beregalov wrote: > >> 2009/3/25 Jan Kara : > >> > On Wed 25-03-09 18:18:43, Alexander Beregalov wrote: > >> >> 2009/3/25 Jan Kara : > >> >> >> > So, I think I need to try it on 2.6.29-rc7 again. > >> >> >> =A0 I've looked into this. Obviously, what's happenning is t= hat we delete > >> >> >> an inode and jbd2_journal_release_jbd_inode() finds inode is= just under > >> >> >> writeout in transaction commit and thus it waits. But it get= s never woken > >> >> >> up and because it has a handle from the transaction, every o= ne eventually > >> >> >> blocks on waiting for a transaction to finish. > >> >> >> =A0 But I don't really see how that can happen. The code is = really > >> >> >> straightforward and everything happens under j_list_lock... = Strange. > >> >> > =A0BTW: Is the system SMP? > >> >> No, it is UP system. > >> > =A0Even stranger. And do you have CONFIG_PREEMPT set? > >> > > >> >> The bug exists even in 2.6.29, I posted it with a new topic. > >> > =A0OK, I've sort-of expected this. > >> > >> CONFIG_PREEMPT_RCU=3Dy > >> CONFIG_PREEMPT_RCU_TRACE=3Dy > >> # CONFIG_PREEMPT_NONE is not set > >> # CONFIG_PREEMPT_VOLUNTARY is not set > >> CONFIG_PREEMPT=3Dy > >> CONFIG_DEBUG_PREEMPT=3Dy > >> # CONFIG_PREEMPT_TRACER is not set > >> > >> config is attached. > > =A0Thanks for the data. I still don't see how the wakeup can get lo= st. The > > process even cannot be preempted when we are in the section protect= ed by > > j_list_lock... Can you send me a disassembly of functions > > jbd2_journal_release_jbd_inode() and journal_submit_data_buffers() = so that > > I can see whether the compiler has not reordered something unexpect= edly? Thanks for the disassembly... > By default gcc inlines journal_submit_data_buffers() > Here is -fno-inline version. Default version is in attach. > =3D=3D=3D=3D >=20 > static int journal_submit_data_buffers(journal_t *journal, > transaction_t *commit_transaction) > { > 9c: 9d e3 bf 40 save %sp, -192, %sp > a0: 11 00 00 00 sethi %hi(0), %o0 > struct jbd2_inode *jinode; > int err, ret =3D 0; > struct address_space *mapping; >=20 > spin_lock(&journal->j_list_lock); > a4: a4 06 25 70 add %i0, 0x570, %l2 > * our inode list. We use JI_COMMIT_RUNNING flag to protect inode we = currently > * operate on from being released while we write out pages. > */ > static int journal_submit_data_buffers(journal_t *journal, > transaction_t *commit_transaction) > { > a8: 90 12 20 00 mov %o0, %o0 > ac: 40 00 00 00 call ac > b0: b0 10 20 00 clr %i0 > struct jbd2_inode *jinode; > int err, ret =3D 0; > struct address_space *mapping; >=20 > spin_lock(&journal->j_list_lock); > list_for_each_entry(jinode, &commit_transaction->t_inode_list= , i_list) { > b4: a6 06 60 60 add %i1, 0x60, %l3 > { > struct jbd2_inode *jinode; > int err, ret =3D 0; > struct address_space *mapping; >=20 > spin_lock(&journal->j_list_lock); > b8: 40 00 00 00 call b8 > bc: 90 10 00 12 mov %l2, %o0 > list_for_each_entry(jinode, &commit_transaction->t_inode_list= , i_list) { > c0: 10 68 00 1d b %xcc, 134 > c4: c2 5e 60 60 ldx [ %i1 + 0x60 ], %g1 > mapping =3D jinode->i_vfs_inode->i_mapping; > jinode->i_flags |=3D JI_COMMIT_RUNNING; > spin_unlock(&journal->j_list_lock); > c8: 90 10 00 12 mov %l2, %o0 > struct address_space *mapping; >=20 > spin_lock(&journal->j_list_lock); > list_for_each_entry(jinode, &commit_transaction->t_inode_list= , i_list) { > mapping =3D jinode->i_vfs_inode->i_mapping; > jinode->i_flags |=3D JI_COMMIT_RUNNING; > cc: c2 04 60 28 ld [ %l1 + 0x28 ], %g1 Here we load jbd2_inode->i_flags into %g1. > int err, ret =3D 0; > struct address_space *mapping; >=20 > spin_lock(&journal->j_list_lock); > list_for_each_entry(jinode, &commit_transaction->t_inode_list= , i_list) { > mapping =3D jinode->i_vfs_inode->i_mapping; > d0: e0 58 a1 e0 ldx [ %g2 + 0x1e0 ], %l0 > jinode->i_flags |=3D JI_COMMIT_RUNNING; > d4: 82 10 60 01 or %g1, 1, %g1 Here we set JI_COMMIT_RUNNING. > spin_unlock(&journal->j_list_lock); > d8: 40 00 00 00 call d8 Here we seem to call preempt_disable() (it would be useful if we coul= d confirm that - easiest option I know is compiling JBD2 into a kernel bu= t some object file trickery should be able to find it out as well...) > dc: c2 24 60 28 st %g1, [ %l1 + 0x28 ] And here we store the register back to memory - but we could be alrea= dy preempted here which could cause bugs... > * submit the inode data buffers. We use writepage > * instead of writepages. Because writepages can do > * block allocation with delalloc. We need to write > * only allocated blocks here. > */ > err =3D journal_submit_inode_data_buffers(mapping); > e0: 7f ff ff d3 call 2c > e4: 90 10 00 10 mov %l0, %o0 > if (!ret) > e8: 80 a6 20 00 cmp %i0, 0 > ec: b1 64 40 08 move %icc, %o0, %i0 > ret =3D err; > spin_lock(&journal->j_list_lock); > f0: 40 00 00 00 call f0 > f4: 90 10 00 12 mov %l2, %o0 > J_ASSERT(jinode->i_transaction =3D=3D commit_transact= ion); > f8: c2 5c 40 00 ldx [ %l1 ], %g1 > fc: 80 a0 40 19 cmp %g1, %i1 > 100: 22 68 00 07 be,a %xcc, 11c > > 104: c2 04 60 28 ld [ %l1 + 0x28 ], %g1 Again, here we load jinode->i_flags. > 108: 11 00 00 00 sethi %hi(0), %o0 > 10c: 92 10 21 04 mov 0x104, %o1 > 110: 40 00 00 00 call 110 > 114: 90 12 20 00 mov %o0, %o0 > 118: 91 d0 20 05 ta 5 > jinode->i_flags &=3D ~JI_COMMIT_RUNNING; > wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING); > 11c: 90 04 60 28 add %l1, 0x28, %o0 > 120: 92 10 20 00 clr %o1 > err =3D journal_submit_inode_data_buffers(mapping); > if (!ret) > ret =3D err; > spin_lock(&journal->j_list_lock); > J_ASSERT(jinode->i_transaction =3D=3D commit_transact= ion); > jinode->i_flags &=3D ~JI_COMMIT_RUNNING; > 124: 82 08 7f fe and %g1, -2, %g1 Here we go &=3D ~JI_COMMIT_RUNNING > wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING); > 128: 40 00 00 00 call 128 > 12c: c2 24 60 28 st %g1, [ %l1 + 0x28 ] And only here we store it back to memory... > struct jbd2_inode *jinode; > int err, ret =3D 0; > struct address_space *mapping; >=20 > spin_lock(&journal->j_list_lock); > list_for_each_entry(jinode, &commit_transaction->t_inode_list= , i_list) { > 130: c2 5c 60 10 ldx [ %l1 + 0x10 ], %g1 > 134: a2 00 7f f0 add %g1, -16, %l1 > * prefetches into the prefetch-cache which only is accessibl= e > * by floating point operations in UltraSPARC-III and later. > * By contrast, "#one_write" prefetches into the L2 cache > * in shared state. > */ > __asm__ __volatile__("prefetch [%0], #one_write" > 138: c2 5c 60 10 ldx [ %l1 + 0x10 ], %g1 > 13c: c7 68 40 00 prefetch [ %g1 ], #one_write > 140: 82 04 60 10 add %l1, 0x10, %g1 > 144: 80 a4 c0 01 cmp %l3, %g1 > 148: 32 6f ff e0 bne,a %xcc, c8 > > 14c: c4 5c 60 20 ldx [ %l1 + 0x20 ], %g2 > spin_lock(&journal->j_list_lock); > J_ASSERT(jinode->i_transaction =3D=3D commit_transact= ion); > wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING); > } > spin_unlock(&journal->j_list_lock); > 150: 90 10 00 12 mov %l2, %o0 > 154: 40 00 00 00 call 154 > 158: b1 3e 20 00 sra %i0, 0, %i0 > return ret; > } > 15c: 81 cf e0 08 rett %i7 + 8 > 160: 01 00 00 00 nop So the compiled code looks a bit suspitious to me. Having the disasse= mbly with symbols properly resolved would help confirm it. I'm adding sparc = list to CC just in case someone sees the problem... Honza --=20 Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe sparclinux" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html