2012-09-20 03:23:13

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()

In ext4_nonda_switch(), if the file system is getting full we used to
call writeback_inodes_sb_if_idle(). The problem is that we can be
holding i_mutex already, and this causes a potential deadlock when
writeback_inodes_sb_if_idle() when it tries to take s_umount. (See
lockdep output below).

As it turns out we don't need need to hold s_umount; the fact that we
are in the middle of the write(2) system call will keep the superblock
pinned. So we can just call writeback_inodes_sb() directly without
taking s_umount first.

[ INFO: possible circular locking dependency detected ]
3.6.0-rc1-00042-gce894ca #367 Not tainted
-------------------------------------------------------
dd/8298 is trying to acquire lock:
(&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46

but task is already holding lock:
(&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3

which lock already depends on the new lock.

2 locks held by dd/8298:
#0: (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3
#1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3

stack backtrace:
Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
Call Trace:
[<c015b79c>] ? console_unlock+0x345/0x372
[<c06d62a1>] print_circular_bug+0x190/0x19d
[<c019906c>] __lock_acquire+0x86d/0xb6c
[<c01999db>] ? mark_held_locks+0x5c/0x7b
[<c0199724>] lock_acquire+0x66/0xb9
[<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
[<c06db935>] down_read+0x28/0x58
[<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
[<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
[<c026f3b2>] ext4_nonda_switch+0xe1/0xf4
[<c0271ece>] ext4_da_write_begin+0x27/0x193
[<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb
[<c01ddc47>] __generic_file_aio_write+0x1dd/0x205
[<c01ddce7>] generic_file_aio_write+0x78/0xd3
[<c026d336>] ext4_file_write+0x480/0x4a6
[<c0198c1d>] ? __lock_acquire+0x41e/0xb6c
[<c0180944>] ? sched_clock_cpu+0x11a/0x13e
[<c01967e9>] ? trace_hardirqs_off+0xb/0xd
[<c018099f>] ? local_clock+0x37/0x4e
[<c0209f2c>] do_sync_write+0x67/0x9d
[<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44
[<c020a7b9>] vfs_write+0x7b/0xe6
[<c020a9a6>] sys_write+0x3b/0x64
[<c06dd4bd>] syscall_call+0x7/0xb

Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: [email protected]
---
fs/ext4/inode.c | 4 ++--
fs/fs-writeback.c | 1 +
2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0a9a89e..d4c60f2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2449,8 +2449,8 @@ static int ext4_nonda_switch(struct super_block *sb)
* Even if we don't switch but are nearing capacity,
* start pushing delalloc when 1/2 of free blocks are dirty.
*/
- if (free_blocks < 2 * dirty_blocks)
- writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
+ if ((free_blocks < 2 * dirty_blocks) && writeback_in_progress(sb->s_bdi))
+ writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);

return 0;
}
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index be3efc4..5602d73 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -63,6 +63,7 @@ int writeback_in_progress(struct backing_dev_info *bdi)
{
return test_bit(BDI_writeback_running, &bdi->state);
}
+EXPORT_SYMBOL(writeback_in_progress);

static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
{
--
1.7.12.rc0.22.gcdd159b



2012-09-21 22:12:10

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()

On 9/19/12 10:23 PM, Theodore Ts'o wrote:
> In ext4_nonda_switch(), if the file system is getting full we used to
> call writeback_inodes_sb_if_idle(). The problem is that we can be
> holding i_mutex already, and this causes a potential deadlock when
> writeback_inodes_sb_if_idle() when it tries to take s_umount. (See
> lockdep output below).
>
> As it turns out we don't need need to hold s_umount; the fact that we
> are in the middle of the write(2) system call will keep the superblock
> pinned. So we can just call writeback_inodes_sb() directly without
> taking s_umount first.
>
> [ INFO: possible circular locking dependency detected ]
> 3.6.0-rc1-00042-gce894ca #367 Not tainted
> -------------------------------------------------------
> dd/8298 is trying to acquire lock:
> (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
>
> but task is already holding lock:
> (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
>
> which lock already depends on the new lock.
>
> 2 locks held by dd/8298:
> #0: (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3
> #1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
>
> stack backtrace:
> Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
> Call Trace:
> [<c015b79c>] ? console_unlock+0x345/0x372
> [<c06d62a1>] print_circular_bug+0x190/0x19d
> [<c019906c>] __lock_acquire+0x86d/0xb6c
> [<c01999db>] ? mark_held_locks+0x5c/0x7b
> [<c0199724>] lock_acquire+0x66/0xb9
> [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
> [<c06db935>] down_read+0x28/0x58
> [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
> [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
> [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4
> [<c0271ece>] ext4_da_write_begin+0x27/0x193
> [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb
> [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205
> [<c01ddce7>] generic_file_aio_write+0x78/0xd3
> [<c026d336>] ext4_file_write+0x480/0x4a6
> [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c
> [<c0180944>] ? sched_clock_cpu+0x11a/0x13e
> [<c01967e9>] ? trace_hardirqs_off+0xb/0xd
> [<c018099f>] ? local_clock+0x37/0x4e
> [<c0209f2c>] do_sync_write+0x67/0x9d
> [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44
> [<c020a7b9>] vfs_write+0x7b/0xe6
> [<c020a9a6>] sys_write+0x3b/0x64
> [<c06dd4bd>] syscall_call+0x7/0xb
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> Cc: [email protected]
> ---
> fs/ext4/inode.c | 4 ++--
> fs/fs-writeback.c | 1 +
> 2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0a9a89e..d4c60f2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2449,8 +2449,8 @@ static int ext4_nonda_switch(struct super_block *sb)
> * Even if we don't switch but are nearing capacity,
> * start pushing delalloc when 1/2 of free blocks are dirty.
> */
> - if (free_blocks < 2 * dirty_blocks)
> - writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
> + if ((free_blocks < 2 * dirty_blocks) && writeback_in_progress(sb->s_bdi))
> + writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);

Looks to me like this inverts the logic.

We used to write back if idle, now we fire it off if it's already underway.

Shouldn't it be:

+ if ((free_blocks < 2 * dirty_blocks) && !writeback_in_progress(sb->s_bdi))
+ writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);

-Eric

>
> return 0;
> }
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index be3efc4..5602d73 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -63,6 +63,7 @@ int writeback_in_progress(struct backing_dev_info *bdi)
> {
> return test_bit(BDI_writeback_running, &bdi->state);
> }
> +EXPORT_SYMBOL(writeback_in_progress);
>
> static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
> {
>


2012-09-21 23:59:20

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()

On Fri, Sep 21, 2012 at 05:12:05PM -0500, Eric Sandeen wrote:
> > - if (free_blocks < 2 * dirty_blocks)
> > - writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
> > + if ((free_blocks < 2 * dirty_blocks) && writeback_in_progress(sb->s_bdi))
> > + writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);
>
> Looks to me like this inverts the logic.
>
> We used to write back if idle, now we fire it off if it's already underway.
>
> Shouldn't it be:
>
> + if ((free_blocks < 2 * dirty_blocks) && !writeback_in_progress(sb->s_bdi))
> + writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);

Oops, nice catch. Thanks for the review!!

I've added the missing '!' to the patch.

- Ted

2012-09-27 17:18:35

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()

On Fri, 21 Sep 2012 19:59:12 -0400, Theodore Ts'o <[email protected]> wrote:
> On Fri, Sep 21, 2012 at 05:12:05PM -0500, Eric Sandeen wrote:
> > > - if (free_blocks < 2 * dirty_blocks)
> > > - writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
> > > + if ((free_blocks < 2 * dirty_blocks) && writeback_in_progress(sb->s_bdi))
> > > + writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);
> >
> > Looks to me like this inverts the logic.
> >
> > We used to write back if idle, now we fire it off if it's already underway.
> >
> > Shouldn't it be:
> >
> > + if ((free_blocks < 2 * dirty_blocks) && !writeback_in_progress(sb->s_bdi))
> > + writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);
>
> Oops, nice catch. Thanks for the review!!
>
> I've added the missing '!' to the patch.
Hmm... even this '!' patch is still not good. I've got that complain
WARNING: at fs/fs-writeback.c:1316 writeback_inodes_sb_nr+0x1a3/0x200()
Hardware name:
Modules linked in: ext4 jbd2 cpufreq_ondemand acpi_cpufreq freq_table
mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode
sg xhci_hcd button ext3 jbd mbcache sd_mod crc_t10dif aesni_intel
ablk_helper cryptd aes_x86_64 aes_generic ahci libahci pata_acpi
ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 1896, comm: fio Not tainted 3.6.0-rc1+ #77
Call Trace:
[<ffffffff81069433>] warn_slowpath_common+0xc3/0xf0
[<ffffffff8106947a>] warn_slowpath_null+0x1a/0x20
[<ffffffff81253a13>] writeback_inodes_sb_nr+0x1a3/0x200
[<ffffffff81253f4e>] writeback_inodes_sb+0x2e/0x40
[<ffffffffa03d73d9>] ext4_nonda_switch+0xe9/0x120 [ext4]
[<ffffffffa03dd2ea>] ext4_da_write_begin+0x4a/0x330 [ext4]
[<ffffffff811874ca>] generic_perform_write+0x11a/0x360
[<ffffffff810dc9f7>] ? current_kernel_time+0x97/0xb0
[<ffffffff81187771>] generic_file_buffered_write+0x61/0xd0
[<ffffffff8118b98a>] __generic_file_aio_write+0x6ca/0x820
[<ffffffff8118bb93>] generic_file_aio_write+0xb3/0x150
[<ffffffffa03d2785>] ext4_file_write+0x155/0x190 [ext4]
[<ffffffffa03d2630>] ? ext4_file_dio_write+0x550/0x550 [ext4]
[<ffffffff81281f6e>] aio_rw_vect_retry+0xce/0x200
[<ffffffff81281ea0>] ? aio_advance_iovec+0x130/0x130
[<ffffffff81281ea0>] ? aio_advance_iovec+0x130/0x130
[<ffffffff812832b6>] aio_run_iocb+0xd6/0x2e0
[<ffffffff817397c8>] io_submit_one+0x38a/0x3ff
[<ffffffff81284c8e>] do_io_submit+0x2be/0x3d0
[<ffffffff81284db0>] sys_io_submit+0x10/0x20
[<ffffffff8175c8e9>] system_call_fastpath+0x16/0x1b
---[ end trace 586bc928292ed61e ]---

>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-09-27 18:09:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()

On Thu, Sep 27, 2012 at 09:18:30PM +0400, Dmitry Monakhov wrote:
> Hmm... even this '!' patch is still not good. I've got that complain
> WARNING: at fs/fs-writeback.c:1316 writeback_inodes_sb_nr+0x1a3/0x200()

This WARN_ON checking to make sure the s_umount mutex is grabbed.

What kernel are you applying this patch against? Commit 14da92001:
"fs: Protect write paths by sb_start_write - sb_end_write" adds a call
to sb_start_write() in generic_file_aio_write() which means by the
time we call ext4_nonda_switch(), s_umount is grabbed, which probably
explains why I'm not seeing this --- 14da92001 was added to mainline
as of 3.6-rc1, and I'm guessing you're using an older kernel?

This is something we need to consider, though, since "ext4: fix
potential deadlock in ext4_nonda_switch()" is currently marked for
stable, but commit 14da92001 is not marked for stable. So if this
commit gets backported w/o 14da92001, stable kernel users will see
this warning.

- Ted

2012-09-27 18:16:18

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()

On Thu, 27 Sep 2012 14:09:49 -0400, Theodore Ts'o <[email protected]> wrote:
> On Thu, Sep 27, 2012 at 09:18:30PM +0400, Dmitry Monakhov wrote:
> > Hmm... even this '!' patch is still not good. I've got that complain
> > WARNING: at fs/fs-writeback.c:1316 writeback_inodes_sb_nr+0x1a3/0x200()
>
> This WARN_ON checking to make sure the s_umount mutex is grabbed.
>
> What kernel are you applying this patch against? Commit 14da92001:
I use b1725167c38ab current ext4.git/dev head.
> "fs: Protect write paths by sb_start_write - sb_end_write" adds a call
> to sb_start_write() in generic_file_aio_write() which means by the
> time we call ext4_nonda_switch(), s_umount is grabbed, which probably
> explains why I'm not seeing this --- 14da92001 was added to mainline
> as of 3.6-rc1, and I'm guessing you're using an older kernel?
>

> This is something we need to consider, though, since "ext4: fix
> potential deadlock in ext4_nonda_switch()" is currently marked for
> stable, but commit 14da92001 is not marked for stable. So if this
> commit gets backported w/o 14da92001, stable kernel users will see
> this warning.
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-09-27 20:14:05

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: fix potential deadlock in ext4_nonda_switch()

Ah, you're right. I *can* easily see the problem; somehow I just
didn't notice it, but when I look back at my test logs, it's
definitely there. I thought sb_start_write() took s_umount, but it
doesn't, because it was trying to solve the exact same lock ordering
problem. Drat....

Unfortunately, there doesn't seem to be a good solution here. The two
options I see is either (a) create another version of
writeback_inodes_sb() which doesn't have the WARN_ON check in
fs/fs-writeback.c (the warning is a false positive, since there are
other mechanisms which protect the superblock from being unmounted
while the write system call is in progress), or (b) call
writeback_inodes_sb() out of a workqueue.

- Ted