2008-02-04 10:12:39

by Aneesh Kumar K.V

[permalink] [raw]
Subject: jbd2_handle and i_data_sem circular locking dependency detected

Hi,

This is with the new ext3 -> ext4 migrate code added. The recently added
lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
on the ext3 inode during migration to prevent walking the ext3 inode
when it is being converted to ext4 format. Also we want to avoid
file truncation and new blocks being added while converting to ext4.
Also we dont want to reserve large number of credits for journal.
Any idea how to fix this ?


-aneesh

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.24-rc8 #6
-------------------------------------------------------
rm/2401 is trying to acquire lock:
(&ei->i_data_sem){----}, at: [<c01dca58>] ext4_get_blocks_wrap+0x21/0x108

but task is already holding lock:
(jbd2_handle){--..}, at: [<c01fc4a7>] jbd2_journal_start+0xd2/0xff

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (jbd2_handle){--..}:
[<c0143a5c>] __lock_acquire+0xa31/0xc1a
[<c0143cbf>] lock_acquire+0x7a/0x94
[<c01fc4ca>] jbd2_journal_start+0xf5/0xff
[<c01e3539>] ext4_journal_start_sb+0x48/0x4a
[<c01eb980>] ext4_ext_migrate+0x7d/0x535
[<c01df328>] ext4_ioctl+0x528/0x56c
[<c0177700>] do_ioctl+0x50/0x67
[<c017794e>] vfs_ioctl+0x237/0x24a
[<c0177992>] sys_ioctl+0x31/0x4b
[<c0104f8a>] sysenter_past_esp+0x5f/0xa5
[<ffffffff>] 0xffffffff

-> #0 (&ei->i_data_sem){----}:
[<c014394c>] __lock_acquire+0x921/0xc1a
[<c0143cbf>] lock_acquire+0x7a/0x94
[<c044f247>] down_read+0x42/0x79
[<c01dca58>] ext4_get_blocks_wrap+0x21/0x108
[<c01dcba1>] ext4_getblk+0x62/0x1c4
[<c01e0de9>] ext4_find_entry+0x350/0x5b7
[<c01e200c>] ext4_unlink+0x6e/0x1a4
[<c017449e>] vfs_unlink+0x49/0x89
[<c0175f02>] do_unlinkat+0x96/0x12c
[<c0175fa8>] sys_unlink+0x10/0x12
[<c0104f8a>] sysenter_past_esp+0x5f/0xa5
[<ffffffff>] 0xffffffff

other info that might help us debug this:

3 locks held by rm/2401:
#0: (&type->i_mutex_dir_key#5/1){--..}, at: [<c0175eca>] do_unlinkat+0x5e/0x12c
#1: (&sb->s_type->i_mutex_key#8){--..}, at: [<c017448b>] vfs_unlink+0x36/0x89
#2: (jbd2_handle){--..}, at: [<c01fc4a7>] jbd2_journal_start+0xd2/0xff

stack backtrace:
Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6
[<c0106017>] show_trace_log_lvl+0x1a/0x2f
[<c0106893>] show_trace+0x12/0x14
[<c0106b89>] dump_stack+0x6c/0x72
[<c0141b26>] print_circular_bug_tail+0x5f/0x68
[<c014394c>] __lock_acquire+0x921/0xc1a
[<c0143cbf>] lock_acquire+0x7a/0x94
[<c044f247>] down_read+0x42/0x79
[<c01dca58>] ext4_get_blocks_wrap+0x21/0x108
[<c01dcba1>] ext4_getblk+0x62/0x1c4
[<c01e0de9>] ext4_find_entry+0x350/0x5b7
[<c01e200c>] ext4_unlink+0x6e/0x1a4
[<c017449e>] vfs_unlink+0x49/0x89
[<c0175f02>] do_unlinkat+0x96/0x12c
[<c0175fa8>] sys_unlink+0x10/0x12
[<c0104f8a>] sysenter_past_esp+0x5f/0xa5



2008-02-04 15:35:55

by Josef Bacik

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

On Monday 04 February 2008 5:12:28 am Aneesh Kumar K.V wrote:
> Hi,
>
> This is with the new ext3 -> ext4 migrate code added. The recently added
> lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
> on the ext3 inode during migration to prevent walking the ext3 inode
> when it is being converted to ext4 format. Also we want to avoid
> file truncation and new blocks being added while converting to ext4.
> Also we dont want to reserve large number of credits for journal.
> Any idea how to fix this ?
>

Hello,

Seems you should be taking the i_data_sem after starting the journal in
ext4_ext_migrate. I haven't looked at this stuff too much, but everywhere I
see i_data_sem taken its within a journal_start/journal_stop. Here's the
patch. Let me know if I'm way off btw, like I said I'm new to this :). Thank
you,

Josef

Index: linux-2.6/fs/ext4/migrate.c
===================================================================
--- linux-2.6.orig/fs/ext4/migrate.c
+++ linux-2.6/fs/ext4/migrate.c
@@ -414,7 +414,6 @@ int ext4_ext_migrate(struct inode *inode
if ((EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
return -EINVAL;

- down_write(&EXT4_I(inode)->i_data_sem);
handle = ext4_journal_start(inode,
EXT4_DATA_TRANS_BLOCKS(inode->i_sb) +
EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3 +
@@ -424,6 +423,9 @@ int ext4_ext_migrate(struct inode *inode
retval = PTR_ERR(handle);
goto err_out;
}
+
+ down_write(&EXT4_I(inode)->i_data_sem);
+
tmp_inode = ext4_new_inode(handle,
inode->i_sb->s_root->d_inode,
S_IFREG);
@@ -549,10 +551,10 @@ err_out:
*/
tmp_inode->i_nlink = 0;

- ext4_journal_stop(handle);
-
up_write(&EXT4_I(inode)->i_data_sem);

+ ext4_journal_stop(handle);
+
if (tmp_inode)
iput(tmp_inode);


2008-02-04 15:37:56

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

On Mon, Feb 04, 2008 at 10:23:16AM -0500, Josef Bacik wrote:
> On Monday 04 February 2008 5:12:28 am Aneesh Kumar K.V wrote:
> > Hi,
> >
> > This is with the new ext3 -> ext4 migrate code added. The recently added
> > lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
> > on the ext3 inode during migration to prevent walking the ext3 inode
> > when it is being converted to ext4 format. Also we want to avoid
> > file truncation and new blocks being added while converting to ext4.
> > Also we dont want to reserve large number of credits for journal.
> > Any idea how to fix this ?
> >
>
> Hello,
>
> Seems you should be taking the i_data_sem after starting the journal in
> ext4_ext_migrate. I haven't looked at this stuff too much, but everywhere I
> see i_data_sem taken its within a journal_start/journal_stop. Here's the
> patch. Let me know if I'm way off btw, like I said I'm new to this :). Thank
> you,

But that doesn't help. Because we call ext4_journal_restart after taking
i_data_sem. That implies we can sleep waiting for other running
transaction to commit. And if that transaction (in the above example
rm) is waiting for i_data_sem we deadlock.

-aneesh

2008-02-04 16:31:58

by Jan Kara

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

Hi,

On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote:
> This is with the new ext3 -> ext4 migrate code added. The recently added
> lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
> on the ext3 inode during migration to prevent walking the ext3 inode
> when it is being converted to ext4 format. Also we want to avoid
> file truncation and new blocks being added while converting to ext4.
> Also we dont want to reserve large number of credits for journal.
> Any idea how to fix this ?
Hmm, while briefly looking at the code - why do you introduce i_data_sem
and not use i_alloc_sem which is already in VFS inode? That is aimed
exactly at the serialization of truncates, writes and similar users.
That doesn't solve problems with lock ordering but I was just wondering...
Another problem - ext4_fallocate() has the same lock ordering problem as
the migration code and maybe there are others...
One (stupid) solution to your problem is to make i_data_sem be
always locked before the transaction is started. It could possibly have
negative performance impact because you'd have to hold the semaphore for
a longer time and thus a writer would block readers for longer time. So one
would have to measure how big difference that would make.
Another possibility is to start a single transaction for migration and
extend it as long as you can (as truncate does it). And when you can't
extend any more, you drop the i_data_sem and start a new transaction and
acquire the semaphore again. This has the disadvantage that after dropping
the semaphore you have to resync your original inode with the temporary
one your are building which probably ends up being ugly as night... Hmm,
but maybe we could get rid of this - hold i_mutex to protect against all
writes (that ranks outside of transaction start so you can hold it for the
whole migration time - maybe you even hold it if you are called from the
write path...). After dropping i_data_sem you let some readers proceed
but writers still wait on i_mutex so the file shouldn't change under you
(but I suggest adding some BUG_ONs to verify that the file really doesn't
change :).

Honza

> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.24-rc8 #6
> -------------------------------------------------------
> rm/2401 is trying to acquire lock:
> (&ei->i_data_sem){----}, at: [<c01dca58>] ext4_get_blocks_wrap+0x21/0x108
>
> but task is already holding lock:
> (jbd2_handle){--..}, at: [<c01fc4a7>] jbd2_journal_start+0xd2/0xff
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (jbd2_handle){--..}:
> [<c0143a5c>] __lock_acquire+0xa31/0xc1a
> [<c0143cbf>] lock_acquire+0x7a/0x94
> [<c01fc4ca>] jbd2_journal_start+0xf5/0xff
> [<c01e3539>] ext4_journal_start_sb+0x48/0x4a
> [<c01eb980>] ext4_ext_migrate+0x7d/0x535
> [<c01df328>] ext4_ioctl+0x528/0x56c
> [<c0177700>] do_ioctl+0x50/0x67
> [<c017794e>] vfs_ioctl+0x237/0x24a
> [<c0177992>] sys_ioctl+0x31/0x4b
> [<c0104f8a>] sysenter_past_esp+0x5f/0xa5
> [<ffffffff>] 0xffffffff
>
> -> #0 (&ei->i_data_sem){----}:
> [<c014394c>] __lock_acquire+0x921/0xc1a
> [<c0143cbf>] lock_acquire+0x7a/0x94
> [<c044f247>] down_read+0x42/0x79
> [<c01dca58>] ext4_get_blocks_wrap+0x21/0x108
> [<c01dcba1>] ext4_getblk+0x62/0x1c4
> [<c01e0de9>] ext4_find_entry+0x350/0x5b7
> [<c01e200c>] ext4_unlink+0x6e/0x1a4
> [<c017449e>] vfs_unlink+0x49/0x89
> [<c0175f02>] do_unlinkat+0x96/0x12c
> [<c0175fa8>] sys_unlink+0x10/0x12
> [<c0104f8a>] sysenter_past_esp+0x5f/0xa5
> [<ffffffff>] 0xffffffff
>
> other info that might help us debug this:
>
> 3 locks held by rm/2401:
> #0: (&type->i_mutex_dir_key#5/1){--..}, at: [<c0175eca>] do_unlinkat+0x5e/0x12c
> #1: (&sb->s_type->i_mutex_key#8){--..}, at: [<c017448b>] vfs_unlink+0x36/0x89
> #2: (jbd2_handle){--..}, at: [<c01fc4a7>] jbd2_journal_start+0xd2/0xff
>
> stack backtrace:
> Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6
> [<c0106017>] show_trace_log_lvl+0x1a/0x2f
> [<c0106893>] show_trace+0x12/0x14
> [<c0106b89>] dump_stack+0x6c/0x72
> [<c0141b26>] print_circular_bug_tail+0x5f/0x68
> [<c014394c>] __lock_acquire+0x921/0xc1a
> [<c0143cbf>] lock_acquire+0x7a/0x94
> [<c044f247>] down_read+0x42/0x79
> [<c01dca58>] ext4_get_blocks_wrap+0x21/0x108
> [<c01dcba1>] ext4_getblk+0x62/0x1c4
> [<c01e0de9>] ext4_find_entry+0x350/0x5b7
> [<c01e200c>] ext4_unlink+0x6e/0x1a4
> [<c017449e>] vfs_unlink+0x49/0x89
> [<c0175f02>] do_unlinkat+0x96/0x12c
> [<c0175fa8>] sys_unlink+0x10/0x12
> [<c0104f8a>] sysenter_past_esp+0x5f/0xa5
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-02-04 17:12:20

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote:
> Hi,
>
> On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote:
> > This is with the new ext3 -> ext4 migrate code added. The recently added
> > lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
> > on the ext3 inode during migration to prevent walking the ext3 inode
> > when it is being converted to ext4 format. Also we want to avoid
> > file truncation and new blocks being added while converting to ext4.
> > Also we dont want to reserve large number of credits for journal.
> > Any idea how to fix this ?
> Hmm, while briefly looking at the code - why do you introduce i_data_sem
> and not use i_alloc_sem which is already in VFS inode? That is aimed
> exactly at the serialization of truncates, writes and similar users.

How about read ? We are changing the format of inode. We don't want even
the read to go through.


> That doesn't solve problems with lock ordering but I was just wondering...
> Another problem - ext4_fallocate() has the same lock ordering problem as
> the migration code and maybe there are others...


I will look at the same when fixing this.

> One (stupid) solution to your problem is to make i_data_sem be
> always locked before the transaction is started. It could possibly have
> negative performance impact because you'd have to hold the semaphore for
> a longer time and thus a writer would block readers for longer time. So one
> would have to measure how big difference that would make.
> Another possibility is to start a single transaction for migration and
> extend it as long as you can (as truncate does it). And when you can't
> extend any more, you drop the i_data_sem and start a new transaction and
> acquire the semaphore again. This has the disadvantage that after dropping
> the semaphore you have to resync your original inode with the temporary
> one your are building which probably ends up being ugly as night... Hmm,
> but maybe we could get rid of this - hold i_mutex to protect against all
> writes (that ranks outside of transaction start so you can hold it for the
> whole migration time - maybe you even hold it if you are called from the
> write path...). After dropping i_data_sem you let some readers proceed
> but writers still wait on i_mutex so the file shouldn't change under you
> (but I suggest adding some BUG_ONs to verify that the file really doesn't
> change :).

A quick look says truncate can happen even when we hold i_mutex ??

But of all this looks like a workable solution. Will try this out.


-aneesh

2008-02-04 17:40:39

by Jan Kara

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

On Mon 04-02-08 22:42:08, Aneesh Kumar K.V wrote:
> On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote:
> > Hi,
> >
> > On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote:
> > > This is with the new ext3 -> ext4 migrate code added. The recently added
> > > lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
> > > on the ext3 inode during migration to prevent walking the ext3 inode
> > > when it is being converted to ext4 format. Also we want to avoid
> > > file truncation and new blocks being added while converting to ext4.
> > > Also we dont want to reserve large number of credits for journal.
> > > Any idea how to fix this ?
> > Hmm, while briefly looking at the code - why do you introduce i_data_sem
> > and not use i_alloc_sem which is already in VFS inode? That is aimed
> > exactly at the serialization of truncates, writes and similar users.
>
> How about read ? We are changing the format of inode. We don't want even
> the read to go through.
I just meant that the code could use i_alloc_sem instead of i_data_sem in
all the places, remove i_data_sem and you safe some memory... It's just a
suggestion for a cleanup.

> > One (stupid) solution to your problem is to make i_data_sem be
> > always locked before the transaction is started. It could possibly have
> > negative performance impact because you'd have to hold the semaphore for
> > a longer time and thus a writer would block readers for longer time. So one
> > would have to measure how big difference that would make.
> > Another possibility is to start a single transaction for migration and
> > extend it as long as you can (as truncate does it). And when you can't
> > extend any more, you drop the i_data_sem and start a new transaction and
> > acquire the semaphore again. This has the disadvantage that after dropping
> > the semaphore you have to resync your original inode with the temporary
> > one your are building which probably ends up being ugly as night... Hmm,
> > but maybe we could get rid of this - hold i_mutex to protect against all
> > writes (that ranks outside of transaction start so you can hold it for the
> > whole migration time - maybe you even hold it if you are called from the
> > write path...). After dropping i_data_sem you let some readers proceed
> > but writers still wait on i_mutex so the file shouldn't change under you
> > (but I suggest adding some BUG_ONs to verify that the file really doesn't
> > change :).
>
> A quick look says truncate can happen even when we hold i_mutex ??
No, it shouldn't happen - do_truncate() in fs/open.c acquires i_mutex
before it calls notify_change().

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-02-05 12:23:52

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote:
> Hi,
>
> On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote:
> > This is with the new ext3 -> ext4 migrate code added. The recently added
> > lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
> > on the ext3 inode during migration to prevent walking the ext3 inode
> > when it is being converted to ext4 format. Also we want to avoid
> > file truncation and new blocks being added while converting to ext4.
> > Also we dont want to reserve large number of credits for journal.
> > Any idea how to fix this ?
> Hmm, while briefly looking at the code - why do you introduce i_data_sem
> and not use i_alloc_sem which is already in VFS inode? That is aimed
> exactly at the serialization of truncates, writes and similar users.
> That doesn't solve problems with lock ordering but I was just wondering...
> Another problem - ext4_fallocate() has the same lock ordering problem as
> the migration code and maybe there are others...
> One (stupid) solution to your problem is to make i_data_sem be
> always locked before the transaction is started. It could possibly have
> negative performance impact because you'd have to hold the semaphore for
> a longer time and thus a writer would block readers for longer time. So one
> would have to measure how big difference that would make.
> Another possibility is to start a single transaction for migration and
> extend it as long as you can (as truncate does it). And when you can't
> extend any more, you drop the i_data_sem and start a new transaction and
> acquire the semaphore again. This has the disadvantage that after dropping
> the semaphore you have to resync your original inode with the temporary
> one your are building which probably ends up being ugly as night... Hmm,
> but maybe we could get rid of this - hold i_mutex to protect against all
> writes (that ranks outside of transaction start so you can hold it for the
> whole migration time - maybe you even hold it if you are called from the
> write path...). After dropping i_data_sem you let some readers proceed
> but writers still wait on i_mutex so the file shouldn't change under you
> (but I suggest adding some BUG_ONs to verify that the file really doesn't
> change :).
>

How about the patch below. I did the below testing
a) migrate a file
b) run fs_inode fsstres fsx_linux.

The intention was to find out whether the new locking is breaking any of
the other expected hierarchy. It seems to fine. I didn't get any lockdep
warning.

ext4: Fix circular locking dependency with migrate and rm.

From: Aneesh Kumar K.V <[email protected]>

We now take inode->i_mutex lock to prevent any update of the inode i_data
field. Before we switch the inode format we take i_data_sem to prevent
parallel read.

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.24-rc8 #6
-------------------------------------------------------
rm/2401 is trying to acquire lock:
(&ei->i_data_sem){----}, at: [<c01dca58>] ext4_get_blocks_wrap+0x21/0x108

but task is already holding lock:
(jbd2_handle){--..}, at: [<c01fc4a7>] jbd2_journal_start+0xd2/0xff

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (jbd2_handle){--..}:
[<c0143a5c>] __lock_acquire+0xa31/0xc1a
[<c0143cbf>] lock_acquire+0x7a/0x94
[<c01fc4ca>] jbd2_journal_start+0xf5/0xff
[<c01e3539>] ext4_journal_start_sb+0x48/0x4a
[<c01eb980>] ext4_ext_migrate+0x7d/0x535
[<c01df328>] ext4_ioctl+0x528/0x56c
[<c0177700>] do_ioctl+0x50/0x67
[<c017794e>] vfs_ioctl+0x237/0x24a
[<c0177992>] sys_ioctl+0x31/0x4b
[<c0104f8a>] sysenter_past_esp+0x5f/0xa5
[<ffffffff>] 0xffffffff

-> #0 (&ei->i_data_sem){----}:
[<c014394c>] __lock_acquire+0x921/0xc1a
[<c0143cbf>] lock_acquire+0x7a/0x94
[<c044f247>] down_read+0x42/0x79
[<c01dca58>] ext4_get_blocks_wrap+0x21/0x108
[<c01dcba1>] ext4_getblk+0x62/0x1c4
[<c01e0de9>] ext4_find_entry+0x350/0x5b7
[<c01e200c>] ext4_unlink+0x6e/0x1a4
[<c017449e>] vfs_unlink+0x49/0x89
[<c0175f02>] do_unlinkat+0x96/0x12c
[<c0175fa8>] sys_unlink+0x10/0x12
[<c0104f8a>] sysenter_past_esp+0x5f/0xa5
[<ffffffff>] 0xffffffff

other info that might help us debug this:

3 locks held by rm/2401:
#0: (&type->i_mutex_dir_key#5/1){--..}, at: [<c0175eca>] do_unlinkat+0x5e/0x12c
#1: (&sb->s_type->i_mutex_key#8){--..}, at: [<c017448b>] vfs_unlink+0x36/0x89
#2: (jbd2_handle){--..}, at: [<c01fc4a7>] jbd2_journal_start+0xd2/0xff

stack backtrace:
Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6
[<c0106017>] show_trace_log_lvl+0x1a/0x2f
[<c0106893>] show_trace+0x12/0x14
[<c0106b89>] dump_stack+0x6c/0x72
[<c0141b26>] print_circular_bug_tail+0x5f/0x68
[<c014394c>] __lock_acquire+0x921/0xc1a
[<c0143cbf>] lock_acquire+0x7a/0x94
[<c044f247>] down_read+0x42/0x79
[<c01dca58>] ext4_get_blocks_wrap+0x21/0x108
[<c01dcba1>] ext4_getblk+0x62/0x1c4
[<c01e0de9>] ext4_find_entry+0x350/0x5b7
[<c01e200c>] ext4_unlink+0x6e/0x1a4
[<c017449e>] vfs_unlink+0x49/0x89
[<c0175f02>] do_unlinkat+0x96/0x12c
[<c0175fa8>] sys_unlink+0x10/0x12
[<c0104f8a>] sysenter_past_esp+0x5f/0xa5

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---

fs/ext4/migrate.c | 26 ++++++++++++++++----------
1 files changed, 16 insertions(+), 10 deletions(-)


diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 9ee1f7c..f97c993 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -317,6 +317,7 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
goto err_out;
}

+ down_write(&EXT4_I(inode)->i_data_sem);
/*
* We have the extent map build with the tmp inode.
* Now copy the i_data across
@@ -336,6 +337,7 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
spin_lock(&inode->i_lock);
inode->i_blocks += tmp_inode->i_blocks;
spin_unlock(&inode->i_lock);
+ up_write(&EXT4_I(inode)->i_data_sem);

ext4_mark_inode_dirty(handle, inode);
err_out:
@@ -420,7 +422,6 @@ int ext4_ext_migrate(struct inode *inode, struct file *filp,
*/
return retval;

- down_write(&EXT4_I(inode)->i_data_sem);
handle = ext4_journal_start(inode,
EXT4_DATA_TRANS_BLOCKS(inode->i_sb) +
EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3 +
@@ -454,13 +455,6 @@ int ext4_ext_migrate(struct inode *inode, struct file *filp,
ext4_orphan_add(handle, tmp_inode);
ext4_journal_stop(handle);

- ei = EXT4_I(inode);
- i_data = ei->i_data;
- memset(&lb, 0, sizeof(lb));
-
- /* 32 bit block address 4 bytes */
- max_entries = inode->i_sb->s_blocksize >> 2;
-
/*
* start with one credit accounted for
* superblock modification.
@@ -469,7 +463,20 @@ int ext4_ext_migrate(struct inode *inode, struct file *filp,
* trascation that created the inode. Later as and
* when we add extents we extent the journal
*/
+ /*
+ * inode_mutex prevent write and truncate on the file. Read still goes
+ * through. We take i_data_sem in ext4_ext_swap_inode_data before we
+ * switch the inode format to prevent read.
+ */
+ mutex_lock(&(inode->i_mutex));
handle = ext4_journal_start(inode, 1);
+
+ ei = EXT4_I(inode);
+ i_data = ei->i_data;
+ memset(&lb, 0, sizeof(lb));
+
+ /* 32 bit block address 4 bytes */
+ max_entries = inode->i_sb->s_blocksize >> 2;
for (i = 0; i < EXT4_NDIR_BLOCKS; i++, blk_count++) {
if (i_data[i]) {
retval = update_extent_range(handle, tmp_inode,
@@ -556,8 +563,7 @@ err_out:
tmp_inode->i_nlink = 0;

ext4_journal_stop(handle);
-
- up_write(&EXT4_I(inode)->i_data_sem);
+ mutex_unlock(&(inode->i_mutex));

if (tmp_inode)
iput(tmp_inode);

2008-02-05 13:42:30

by Jan Kara

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

On Tue 05-02-08 17:53:42, Aneesh Kumar K.V wrote:
> On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote:
> > Hi,
> >
> > On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote:
> > > This is with the new ext3 -> ext4 migrate code added. The recently added
> > > lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
> > > on the ext3 inode during migration to prevent walking the ext3 inode
> > > when it is being converted to ext4 format. Also we want to avoid
> > > file truncation and new blocks being added while converting to ext4.
> > > Also we dont want to reserve large number of credits for journal.
> > > Any idea how to fix this ?
> > Hmm, while briefly looking at the code - why do you introduce i_data_sem
> > and not use i_alloc_sem which is already in VFS inode? That is aimed
> > exactly at the serialization of truncates, writes and similar users.
> > That doesn't solve problems with lock ordering but I was just wondering...
> > Another problem - ext4_fallocate() has the same lock ordering problem as
> > the migration code and maybe there are others...
> > One (stupid) solution to your problem is to make i_data_sem be
> > always locked before the transaction is started. It could possibly have
> > negative performance impact because you'd have to hold the semaphore for
> > a longer time and thus a writer would block readers for longer time. So one
> > would have to measure how big difference that would make.
> > Another possibility is to start a single transaction for migration and
> > extend it as long as you can (as truncate does it). And when you can't
> > extend any more, you drop the i_data_sem and start a new transaction and
> > acquire the semaphore again. This has the disadvantage that after dropping
> > the semaphore you have to resync your original inode with the temporary
> > one your are building which probably ends up being ugly as night... Hmm,
> > but maybe we could get rid of this - hold i_mutex to protect against all
> > writes (that ranks outside of transaction start so you can hold it for the
> > whole migration time - maybe you even hold it if you are called from the
> > write path...). After dropping i_data_sem you let some readers proceed
> > but writers still wait on i_mutex so the file shouldn't change under you
> > (but I suggest adding some BUG_ONs to verify that the file really doesn't
> > change :).
> >
>
> How about the patch below. I did the below testing
> a) migrate a file
> b) run fs_inode fsstres fsx_linux.
>
> The intention was to find out whether the new locking is breaking any of
> the other expected hierarchy. It seems to fine. I didn't get any lockdep
> warning.
I think there's a problem in the patch. I don't think you can call
free_ind_block() while readers are accessing the file. That could make them
think the file contains holes when they aren't there (or even worse read
freed blocks or so). So you need to lock i_data_sem before this call (that
means you have to move journal_extend() as well). Actually, I don't quite
get why ext4_journal_extend() is called in that function. You can just
count with the 1 credit in ext4_ext_migrate() when you call
ext4_journal_extend() before calling ext4_ext_swap_inode_data(). Oh,
probably because free_ind_block() could extend the transaction (which they
don't do now as far as I can see).
BTW: That freeing code should really take into account that it can modify
bitmaps in different block groups. It's even not that hard to do. Just
before each ext4_free_blocks() in free_ind_block() you check whether you
have still enough credits in the handle (use h_buffer_credits) and if not,
extend it by some amount.
Maybe you could do some test like writing a big file with some data and then
while migration is running read it in a loop and compare that MD5SUM is the
same all the time. Also run some memory-pressure during this test so that
data blocks aren't cached for the whole time of the test... That should
reasonably stress the migration code.

> ext4: Fix circular locking dependency with migrate and rm.
>
> From: Aneesh Kumar K.V <[email protected]>
>
> We now take inode->i_mutex lock to prevent any update of the inode i_data
> field. Before we switch the inode format we take i_data_sem to prevent
> parallel read.
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.24-rc8 #6
> -------------------------------------------------------
> rm/2401 is trying to acquire lock:
> (&ei->i_data_sem){----}, at: [<c01dca58>] ext4_get_blocks_wrap+0x21/0x108
>
> but task is already holding lock:
> (jbd2_handle){--..}, at: [<c01fc4a7>] jbd2_journal_start+0xd2/0xff
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (jbd2_handle){--..}:
> [<c0143a5c>] __lock_acquire+0xa31/0xc1a
> [<c0143cbf>] lock_acquire+0x7a/0x94
> [<c01fc4ca>] jbd2_journal_start+0xf5/0xff
> [<c01e3539>] ext4_journal_start_sb+0x48/0x4a
> [<c01eb980>] ext4_ext_migrate+0x7d/0x535
> [<c01df328>] ext4_ioctl+0x528/0x56c
> [<c0177700>] do_ioctl+0x50/0x67
> [<c017794e>] vfs_ioctl+0x237/0x24a
> [<c0177992>] sys_ioctl+0x31/0x4b
> [<c0104f8a>] sysenter_past_esp+0x5f/0xa5
> [<ffffffff>] 0xffffffff
>
> -> #0 (&ei->i_data_sem){----}:
> [<c014394c>] __lock_acquire+0x921/0xc1a
> [<c0143cbf>] lock_acquire+0x7a/0x94
> [<c044f247>] down_read+0x42/0x79
> [<c01dca58>] ext4_get_blocks_wrap+0x21/0x108
> [<c01dcba1>] ext4_getblk+0x62/0x1c4
> [<c01e0de9>] ext4_find_entry+0x350/0x5b7
> [<c01e200c>] ext4_unlink+0x6e/0x1a4
> [<c017449e>] vfs_unlink+0x49/0x89
> [<c0175f02>] do_unlinkat+0x96/0x12c
> [<c0175fa8>] sys_unlink+0x10/0x12
> [<c0104f8a>] sysenter_past_esp+0x5f/0xa5
> [<ffffffff>] 0xffffffff
>
> other info that might help us debug this:
>
> 3 locks held by rm/2401:
> #0: (&type->i_mutex_dir_key#5/1){--..}, at: [<c0175eca>] do_unlinkat+0x5e/0x12c
> #1: (&sb->s_type->i_mutex_key#8){--..}, at: [<c017448b>] vfs_unlink+0x36/0x89
> #2: (jbd2_handle){--..}, at: [<c01fc4a7>] jbd2_journal_start+0xd2/0xff
>
> stack backtrace:
> Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6
> [<c0106017>] show_trace_log_lvl+0x1a/0x2f
> [<c0106893>] show_trace+0x12/0x14
> [<c0106b89>] dump_stack+0x6c/0x72
> [<c0141b26>] print_circular_bug_tail+0x5f/0x68
> [<c014394c>] __lock_acquire+0x921/0xc1a
> [<c0143cbf>] lock_acquire+0x7a/0x94
> [<c044f247>] down_read+0x42/0x79
> [<c01dca58>] ext4_get_blocks_wrap+0x21/0x108
> [<c01dcba1>] ext4_getblk+0x62/0x1c4
> [<c01e0de9>] ext4_find_entry+0x350/0x5b7
> [<c01e200c>] ext4_unlink+0x6e/0x1a4
> [<c017449e>] vfs_unlink+0x49/0x89
> [<c0175f02>] do_unlinkat+0x96/0x12c
> [<c0175fa8>] sys_unlink+0x10/0x12
> [<c0104f8a>] sysenter_past_esp+0x5f/0xa5
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
>
> fs/ext4/migrate.c | 26 ++++++++++++++++----------
> 1 files changed, 16 insertions(+), 10 deletions(-)
>
>
> diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
> index 9ee1f7c..f97c993 100644
> --- a/fs/ext4/migrate.c
> +++ b/fs/ext4/migrate.c
> @@ -317,6 +317,7 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
> goto err_out;
> }
>
> + down_write(&EXT4_I(inode)->i_data_sem);
> /*
> * We have the extent map build with the tmp inode.
> * Now copy the i_data across
> @@ -336,6 +337,7 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
> spin_lock(&inode->i_lock);
> inode->i_blocks += tmp_inode->i_blocks;
> spin_unlock(&inode->i_lock);
> + up_write(&EXT4_I(inode)->i_data_sem);
>
> ext4_mark_inode_dirty(handle, inode);
> err_out:
> @@ -420,7 +422,6 @@ int ext4_ext_migrate(struct inode *inode, struct file *filp,
> */
> return retval;
>
> - down_write(&EXT4_I(inode)->i_data_sem);
> handle = ext4_journal_start(inode,
> EXT4_DATA_TRANS_BLOCKS(inode->i_sb) +
> EXT4_INDEX_EXTRA_TRANS_BLOCKS + 3 +
> @@ -454,13 +455,6 @@ int ext4_ext_migrate(struct inode *inode, struct file *filp,
> ext4_orphan_add(handle, tmp_inode);
> ext4_journal_stop(handle);
>
> - ei = EXT4_I(inode);
> - i_data = ei->i_data;
> - memset(&lb, 0, sizeof(lb));
> -
> - /* 32 bit block address 4 bytes */
> - max_entries = inode->i_sb->s_blocksize >> 2;
> -
> /*
> * start with one credit accounted for
> * superblock modification.
> @@ -469,7 +463,20 @@ int ext4_ext_migrate(struct inode *inode, struct file *filp,
> * trascation that created the inode. Later as and
> * when we add extents we extent the journal
> */
> + /*
> + * inode_mutex prevent write and truncate on the file. Read still goes
> + * through. We take i_data_sem in ext4_ext_swap_inode_data before we
> + * switch the inode format to prevent read.
> + */
> + mutex_lock(&(inode->i_mutex));
> handle = ext4_journal_start(inode, 1);
> +
> + ei = EXT4_I(inode);
> + i_data = ei->i_data;
> + memset(&lb, 0, sizeof(lb));
> +
> + /* 32 bit block address 4 bytes */
> + max_entries = inode->i_sb->s_blocksize >> 2;
> for (i = 0; i < EXT4_NDIR_BLOCKS; i++, blk_count++) {
> if (i_data[i]) {
> retval = update_extent_range(handle, tmp_inode,
> @@ -556,8 +563,7 @@ err_out:
> tmp_inode->i_nlink = 0;
>
> ext4_journal_stop(handle);
> -
> - up_write(&EXT4_I(inode)->i_data_sem);
> + mutex_unlock(&(inode->i_mutex));
>
> if (tmp_inode)
> iput(tmp_inode);

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-02-05 16:27:08

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

On Tue, Feb 05, 2008 at 02:42:28PM +0100, Jan Kara wrote:
> On Tue 05-02-08 17:53:42, Aneesh Kumar K.V wrote:
> >
> > How about the patch below. I did the below testing
> > a) migrate a file
> > b) run fs_inode fsstres fsx_linux.
> >
> > The intention was to find out whether the new locking is breaking any of
> > the other expected hierarchy. It seems to fine. I didn't get any lockdep
> > warning.
> I think there's a problem in the patch. I don't think you can call
> free_ind_block() while readers are accessing the file. That could make them
> think the file contains holes when they aren't there (or even worse read
> freed blocks or so). So you need to lock i_data_sem before this call (that
> means you have to move journal_extend() as well). Actually, I don't quite
> get why ext4_journal_extend() is called in that function. You can just
> count with the 1 credit in ext4_ext_migrate() when you call
> ext4_journal_extend() before calling ext4_ext_swap_inode_data(). Oh,
> probably because free_ind_block() could extend the transaction (which they
> don't do now as far as I can see).

ext4_journal_extend is called to extend the journal by one credit to
take care of writing the block containing inode. I moved the journal
extend before taking i_data_sem lock.

diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index f97c993..dc6617a 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -302,10 +302,6 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
struct ext4_inode_info *ei = EXT4_I(inode);
struct ext4_inode_info *tmp_ei = EXT4_I(tmp_inode);

- retval = free_ind_block(handle, inode);
- if (retval)
- goto err_out;
-
/*
* One credit accounted for writing the
* i_data field of the original inode
@@ -318,6 +314,10 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
}

down_write(&EXT4_I(inode)->i_data_sem);
+ retval = free_ind_block(handle, inode);
+ if (retval)
+ goto err_out;
+
/*
* We have the extent map build with the tmp inode.
* Now copy the i_data across

The above change also make sure we don't fail after we free the indirect
blocks.


> BTW: That freeing code should really take into account that it can modify
> bitmaps in different block groups. It's even not that hard to do. Just
> before each ext4_free_blocks() in free_ind_block() you check whether you
> have still enough credits in the handle (use h_buffer_credits) and if not,
> extend it by some amount.


I have a FIXME at migrate.c:524 documenting exactly that. The
difficult question was by how much we should extent the journal. ? But
in reality we might have accumulated enough journal credits, I never
really ran across a case where we are running out of the journal credit.


> Maybe you could do some test like writing a big file with some data and then
> while migration is running read it in a loop and compare that MD5SUM is the
> same all the time. Also run some memory-pressure during this test so that
> data blocks aren't cached for the whole time of the test... That should
> reasonably stress the migration code.

I have tested migrate by booting with mem= and doing parallel
read/write and migrate. I didn't do the MDSUM compare. Will do that this
time.


Thanks for all the help with review.
-aneesh
>

2008-02-05 16:34:06

by Jan Kara

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

On Tue 05-02-08 21:57:03, Aneesh Kumar K.V wrote:
> On Tue, Feb 05, 2008 at 02:42:28PM +0100, Jan Kara wrote:
> > On Tue 05-02-08 17:53:42, Aneesh Kumar K.V wrote:
> > >
> > > How about the patch below. I did the below testing
> > > a) migrate a file
> > > b) run fs_inode fsstres fsx_linux.
> > >
> > > The intention was to find out whether the new locking is breaking any of
> > > the other expected hierarchy. It seems to fine. I didn't get any lockdep
> > > warning.
> > I think there's a problem in the patch. I don't think you can call
> > free_ind_block() while readers are accessing the file. That could make them
> > think the file contains holes when they aren't there (or even worse read
> > freed blocks or so). So you need to lock i_data_sem before this call (that
> > means you have to move journal_extend() as well). Actually, I don't quite
> > get why ext4_journal_extend() is called in that function. You can just
> > count with the 1 credit in ext4_ext_migrate() when you call
> > ext4_journal_extend() before calling ext4_ext_swap_inode_data(). Oh,
> > probably because free_ind_block() could extend the transaction (which they
> > don't do now as far as I can see).
>
> ext4_journal_extend is called to extend the journal by one credit to
> take care of writing the block containing inode. I moved the journal
> extend before taking i_data_sem lock.
>
> diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
> index f97c993..dc6617a 100644
> --- a/fs/ext4/migrate.c
> +++ b/fs/ext4/migrate.c
> @@ -302,10 +302,6 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
> struct ext4_inode_info *ei = EXT4_I(inode);
> struct ext4_inode_info *tmp_ei = EXT4_I(tmp_inode);
>
> - retval = free_ind_block(handle, inode);
> - if (retval)
> - goto err_out;
> -
> /*
> * One credit accounted for writing the
> * i_data field of the original inode
> @@ -318,6 +314,10 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
> }
>
> down_write(&EXT4_I(inode)->i_data_sem);
> + retval = free_ind_block(handle, inode);
> + if (retval)
> + goto err_out;
> +
> /*
> * We have the extent map build with the tmp inode.
> * Now copy the i_data across
>
> The above change also make sure we don't fail after we free the indirect
> blocks.
Yeah, now it looks fine.

> > BTW: That freeing code should really take into account that it can modify
> > bitmaps in different block groups. It's even not that hard to do. Just
> > before each ext4_free_blocks() in free_ind_block() you check whether you
> > have still enough credits in the handle (use h_buffer_credits) and if not,
> > extend it by some amount.
>
>
> I have a FIXME at migrate.c:524 documenting exactly that. The
> difficult question was by how much we should extent the journal. ? But
> in reality we might have accumulated enough journal credits, I never
> really ran across a case where we are running out of the journal credit.
Yes, I don't think it is likely to happen in reality but if somebody
would trigger this, it would be almost impossible to track down so one
should be quite careful with these things...
And as I described, doing it failsafe is easy - just look how
try_to_extend_transaction() in ext4/inode.c handles similar problems with
truncate.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-02-05 20:06:47

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

On Tue, Feb 05, 2008 at 05:34:04PM +0100, Jan Kara wrote:
> On Tue 05-02-08 21:57:03, Aneesh Kumar K.V wrote:
> >
> > I have a FIXME at migrate.c:524 documenting exactly that. The
> > difficult question was by how much we should extent the journal. ? But
> > in reality we might have accumulated enough journal credits, I never
> > really ran across a case where we are running out of the journal credit.
> Yes, I don't think it is likely to happen in reality but if somebody
> would trigger this, it would be almost impossible to track down so one
> should be quite careful with these things...
> And as I described, doing it failsafe is easy - just look how
> try_to_extend_transaction() in ext4/inode.c handles similar problems with
> truncate.
>

I moved the indirect block freeing after i update the original inode.
That makes sure even if we fail to free the indirect blocks we have the
original inode converted. Now since i don't need atomicity with freeing of
blocks i extend the journal for each block freed. Below is the diff i have
right now.

diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 1712844..1b00587 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -233,11 +233,14 @@ static int free_dind_blocks(handle_t *handle,

tmp_idata = (__le32 *)bh->b_data;
for (i = 0; i < max_entries; i++) {
- if (tmp_idata[i])
+ if (tmp_idata[i]) {
+ extend_blkdelete_credit(handle, inode);
ext4_free_blocks(handle, inode,
le32_to_cpu(tmp_idata[i]), 1, 1);
+ }
}
put_bh(bh);
+ extend_blkdelete_credit(handle, inode);
ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1, 1);
return 0;
}
@@ -266,29 +269,32 @@ static int free_tind_blocks(handle_t *handle,
}
}
put_bh(bh);
+ extend_blkdelete_credit(handle, inode);
ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1, 1);
return 0;
}

-static int free_ind_block(handle_t *handle, struct inode *inode)
+static int free_ind_block(handle_t *handle, __le32 *i_data)
{
int retval;
- struct ext4_inode_info *ei = EXT4_I(inode);

- if (ei->i_data[EXT4_IND_BLOCK])
+ /* ei->i_data[EXT4_IND_BLOCK] */
+ if (i_data[0]) {
+ extend_blkdelete_credit(handle, inode);
ext4_free_blocks(handle, inode,
- le32_to_cpu(ei->i_data[EXT4_IND_BLOCK]), 1, 1);
+ le32_to_cpu(i_data[0]), 1, 1);
+ }

- if (ei->i_data[EXT4_DIND_BLOCK]) {
- retval = free_dind_blocks(handle, inode,
- ei->i_data[EXT4_DIND_BLOCK]);
+ /* ei->i_data[EXT4_DIND_BLOCK] */
+ if (i_data[1]) {
+ retval = free_dind_blocks(handle, inode, i_data[1]);
if (retval)
return retval;
}

- if (ei->i_data[EXT4_TIND_BLOCK]) {
- retval = free_tind_blocks(handle, inode,
- ei->i_data[EXT4_TIND_BLOCK]);
+ /* ei->i_data[EXT4_TIND_BLOCK] */
+ if (i_data[2]) {
+ retval = free_tind_blocks(handle, inode, i_data[2]);
if (retval)
return retval;
}
@@ -299,6 +305,7 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
struct inode *tmp_inode)
{
int retval;
+ __le32 i_data[3];
struct ext4_inode_info *ei = EXT4_I(inode);
struct ext4_inode_info *tmp_ei = EXT4_I(tmp_inode);

@@ -313,11 +320,11 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
goto err_out;
}

- down_write(&EXT4_I(inode)->i_data_sem);
- retval = free_ind_block(handle, inode);
- if (retval)
- goto err_out;
+ i_data[0] = ei->i_data[EXT4_IND_BLOCK];
+ i_data[1] = ei->i_data[EXT4_DIND_BLOCK];
+ i_data[2] = ei->i_data[EXT4_TIND_BLOCK];

+ down_write(&EXT4_I(inode)->i_data_sem);
/*
* We have the extent map build with the tmp inode.
* Now copy the i_data across
@@ -338,8 +345,12 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
inode->i_blocks += tmp_inode->i_blocks;
spin_unlock(&inode->i_lock);
up_write(&EXT4_I(inode)->i_data_sem);
-
ext4_mark_inode_dirty(handle, inode);
+
+ /* Now free the indirec block of the inode */
+ retval = free_ind_block(handle, i_data);
+ if (retval)
+ goto err_out;
err_out:
return retval;
}
@@ -367,6 +378,7 @@ static int free_ext_idx(handle_t *handle, struct inode *inode,
}
}
put_bh(bh);
+ extend_blkdelete_credit(handle, inode);
ext4_free_blocks(handle, inode, block, 1, 1);
return retval;
}
@@ -394,6 +406,25 @@ static int free_ext_block(handle_t *handle, struct inode *inode)
return retval;

}
+static int extend_blkdelete_credit(handle_t *handle, struct inode *inode)
+{
+ int retval, needed;
+
+ if (handle->h_buffer_credits > EXT4_RESERVE_TRANS_BLOCKS)
+ return 0;
+ /*
+ * We are freeing a blocks. During this we touch
+ * superblock, group descriptor and block bitmap.
+ * So allocate a credit of 3. We may update
+ * quota (user and group).
+ */
+ needed = 3 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb);
+
+ if (ext4_journal_extend(handle, needed) != 0)
+ retval = ext4_journal_restart(handle, needed);
+
+ return retval;
+}

int ext4_ext_migrate(struct inode *inode, struct file *filp,
unsigned int cmd, unsigned long arg)
@@ -514,19 +545,6 @@ int ext4_ext_migrate(struct inode *inode, struct file *filp,
*/
retval = finish_range(handle, tmp_inode, &lb);
err_out:
- /*
- * We are either freeing extent information or indirect
- * blocks. During this we touch superblock, group descriptor
- * and block bitmap. Later we mark the tmp_inode dirty
- * via ext4_ext_tree_init. So allocate a credit of 4
- * We may update quota (user and group).
- *
- * FIXME!! we may be touching bitmaps in different block groups.
- */
- if (ext4_journal_extend(handle,
- 4 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb)) != 0)
- ext4_journal_restart(handle,
- 4 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb));
if (retval)
/*
* Failure case delete the extent information with the
@@ -537,6 +555,10 @@ err_out:
retval = ext4_ext_swap_inode_data(handle, inode,
tmp_inode);

+ /* We mark the tmp_inode dirty via ext4_ext_tree_init. */
+ if (ext4_journal_extend(handle, 1) != 0)
+ ext4_journal_restart(handle, 1);
+
/*
* Mark the tmp_inode as of size zero
*/

2008-03-05 20:12:46

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote:
> Hi,
>
> On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote:
> > This is with the new ext3 -> ext4 migrate code added. The recently added
> > lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
> > on the ext3 inode during migration to prevent walking the ext3 inode
> > when it is being converted to ext4 format. Also we want to avoid
> > file truncation and new blocks being added while converting to ext4.
> > Also we dont want to reserve large number of credits for journal.
> > Any idea how to fix this ?
> Hmm, while briefly looking at the code - why do you introduce i_data_sem
> and not use i_alloc_sem which is already in VFS inode? That is aimed
> exactly at the serialization of truncates, writes and similar users.
> That doesn't solve problems with lock ordering but I was just wondering...
> Another problem - ext4_fallocate() has the same lock ordering problem as
> the migration code and maybe there are others...
> One (stupid) solution to your problem is to make i_data_sem be
> always locked before the transaction is started. It could possibly have
> negative performance impact because you'd have to hold the semaphore for
> a longer time and thus a writer would block readers for longer time. So one
> would have to measure how big difference that would make.
> Another possibility is to start a single transaction for migration and
> extend it as long as you can (as truncate does it). And when you can't
> extend any more, you drop the i_data_sem and start a new transaction and
> acquire the semaphore again. This has the disadvantage that after dropping
> the semaphore you have to resync your original inode with the temporary
> one your are building which probably ends up being ugly as night... Hmm,
> but maybe we could get rid of this - hold i_mutex to protect against all
> writes (that ranks outside of transaction start so you can hold it for the
> whole migration time - maybe you even hold it if you are called from the
> write path...). After dropping i_data_sem you let some readers proceed
> but writers still wait on i_mutex so the file shouldn't change under you
> (but I suggest adding some BUG_ONs to verify that the file really doesn't
> change :).
>

This one came up again when i was doing the mmap ENOSPC patch. Now the
current code with migrate is taking i_mutex to protech against all
writes. But it seems a write to a mmap area mapping a hole can still go
through fine. And that path cannot take i_mutex.

So i guess the migrate locking need to be fixed. Any suggestion ?

-aneesh

2008-03-10 09:37:20

by Jan Kara

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

On Thu 06-03-08 01:42:34, Aneesh Kumar K.V wrote:
> On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote:
> > Hi,
> >
> > On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote:
> > > This is with the new ext3 -> ext4 migrate code added. The recently added
> > > lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
> > > on the ext3 inode during migration to prevent walking the ext3 inode
> > > when it is being converted to ext4 format. Also we want to avoid
> > > file truncation and new blocks being added while converting to ext4.
> > > Also we dont want to reserve large number of credits for journal.
> > > Any idea how to fix this ?
> > Hmm, while briefly looking at the code - why do you introduce i_data_sem
> > and not use i_alloc_sem which is already in VFS inode? That is aimed
> > exactly at the serialization of truncates, writes and similar users.
> > That doesn't solve problems with lock ordering but I was just wondering...
> > Another problem - ext4_fallocate() has the same lock ordering problem as
> > the migration code and maybe there are others...
> > One (stupid) solution to your problem is to make i_data_sem be
> > always locked before the transaction is started. It could possibly have
> > negative performance impact because you'd have to hold the semaphore for
> > a longer time and thus a writer would block readers for longer time. So one
> > would have to measure how big difference that would make.
> > Another possibility is to start a single transaction for migration and
> > extend it as long as you can (as truncate does it). And when you can't
> > extend any more, you drop the i_data_sem and start a new transaction and
> > acquire the semaphore again. This has the disadvantage that after dropping
> > the semaphore you have to resync your original inode with the temporary
> > one your are building which probably ends up being ugly as night... Hmm,
> > but maybe we could get rid of this - hold i_mutex to protect against all
> > writes (that ranks outside of transaction start so you can hold it for the
> > whole migration time - maybe you even hold it if you are called from the
> > write path...). After dropping i_data_sem you let some readers proceed
> > but writers still wait on i_mutex so the file shouldn't change under you
> > (but I suggest adding some BUG_ONs to verify that the file really doesn't
> > change :).
> >
>
> This one came up again when i was doing the mmap ENOSPC patch. Now the
> current code with migrate is taking i_mutex to protech against all
> writes. But it seems a write to a mmap area mapping a hole can still go
> through fine. And that path cannot take i_mutex.
Ah, yeah. I forgot about this case. Yes, mmaped write can happen anytime
and when we writeout these dirty data through writepage() we are not
guarded by i_mutex :(.

> So i guess the migrate locking need to be fixed. Any suggestion ?
Not a simple one. Will think about it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-03-10 14:24:28

by Jan Kara

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

On Thu 06-03-08 01:42:34, Aneesh Kumar K.V wrote:
> On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote:
> > Hi,
> >
> > On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote:
> > > This is with the new ext3 -> ext4 migrate code added. The recently added
> > > lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
> > > on the ext3 inode during migration to prevent walking the ext3 inode
> > > when it is being converted to ext4 format. Also we want to avoid
> > > file truncation and new blocks being added while converting to ext4.
> > > Also we dont want to reserve large number of credits for journal.
> > > Any idea how to fix this ?
> > Hmm, while briefly looking at the code - why do you introduce i_data_sem
> > and not use i_alloc_sem which is already in VFS inode? That is aimed
> > exactly at the serialization of truncates, writes and similar users.
> > That doesn't solve problems with lock ordering but I was just wondering...
> > Another problem - ext4_fallocate() has the same lock ordering problem as
> > the migration code and maybe there are others...
> > One (stupid) solution to your problem is to make i_data_sem be
> > always locked before the transaction is started. It could possibly have
> > negative performance impact because you'd have to hold the semaphore for
> > a longer time and thus a writer would block readers for longer time. So one
> > would have to measure how big difference that would make.
> > Another possibility is to start a single transaction for migration and
> > extend it as long as you can (as truncate does it). And when you can't
> > extend any more, you drop the i_data_sem and start a new transaction and
> > acquire the semaphore again. This has the disadvantage that after dropping
> > the semaphore you have to resync your original inode with the temporary
> > one your are building which probably ends up being ugly as night... Hmm,
> > but maybe we could get rid of this - hold i_mutex to protect against all
> > writes (that ranks outside of transaction start so you can hold it for the
> > whole migration time - maybe you even hold it if you are called from the
> > write path...). After dropping i_data_sem you let some readers proceed
> > but writers still wait on i_mutex so the file shouldn't change under you
> > (but I suggest adding some BUG_ONs to verify that the file really doesn't
> > change :).
> >
>
> This one came up again when i was doing the mmap ENOSPC patch. Now the
> current code with migrate is taking i_mutex to protech against all
> writes. But it seems a write to a mmap area mapping a hole can still go
> through fine. And that path cannot take i_mutex.
>
> So i guess the migrate locking need to be fixed. Any suggestion ?
Hmm, thinking about it a bit more. One possibility is that we could just
use i_mutex to protect against ordinary writes, and before swapping blocks
for extents we'd check whether some holes were not filled in the mean time.
If yes, we can retry the migrate, or fail it and retry later.
Another possibility would be to make ext4 use page_mkwrite to fill in
holes. There we could safely acquire i_mutex and be done.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-03-11 05:45:04

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: jbd2_handle and i_data_sem circular locking dependency detected

On Mon, Mar 10, 2008 at 03:24:26PM +0100, Jan Kara wrote:
> > This one came up again when i was doing the mmap ENOSPC patch. Now the
> > current code with migrate is taking i_mutex to protech against all
> > writes. But it seems a write to a mmap area mapping a hole can still go
> > through fine. And that path cannot take i_mutex.
> >
> > So i guess the migrate locking need to be fixed. Any suggestion ?
> Hmm, thinking about it a bit more. One possibility is that we could just
> use i_mutex to protect against ordinary writes, and before swapping blocks
> for extents we'd check whether some holes were not filled in the mean time.
> If yes, we can retry the migrate, or fail it and retry later.
> Another possibility would be to make ext4 use page_mkwrite to fill in
> holes. There we could safely acquire i_mutex and be done.
>

page_mkwrite is called with mmap_sem help and we can't take inode->i_mutex
in page_mkwrite. The DIO write have inode->i_mutex -> mmap_sem
locking order.

I "fixed" it by introducing i_migrate_sem

http://article.gmane.org/gmane.comp.file-systems.ext4/5402

-aneesh