2011-02-07 11:53:05

by Masayoshi MIZUMA

[permalink] [raw]
Subject: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi,

When I checked the freeze feature for ext4 filesystem using fsfreeze command
at 2.6.38-rc3, I got the following messeges:

---------------------------------------------------------------------
Feb 7 15:05:09 RX300S6 kernel: INFO: task fsfreeze:2104 blocked for more than 120 seconds.
Feb 7 15:05:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 7 15:05:09 RX300S6 kernel: fsfreeze D ffff880076d5f040 0 2104 2018 0x00000000
Feb 7 15:05:09 RX300S6 kernel: ffff88005a9f3d98 0000000000000086 ffff88005a9f3d38 ffffffff00000000
Feb 7 15:05:09 RX300S6 kernel: 0000000000014d40 ffff880076d5eab0 ffff880076d5f040 ffff88005a9f3fd8
Feb 7 15:05:09 RX300S6 kernel: ffff880076d5f048 0000000000014d40 ffff88005a9f2010 0000000000014d40
Feb 7 15:05:09 RX300S6 kernel: Call Trace:
Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814aa5f5>] rwsem_down_failed_common+0xb5/0x140
Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814aa693>] rwsem_down_write_failed+0x13/0x20
Feb 7 15:05:09 RX300S6 kernel: [<ffffffff8122f1a3>] call_rwsem_down_write_failed+0x13/0x20
Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814a9c12>] ? down_write+0x32/0x40
Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81155b48>] thaw_super+0x28/0xd0
Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81164338>] do_vfs_ioctl+0x368/0x560
Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81157c73>] ? sys_newfstat+0x33/0x40
Feb 7 15:05:09 RX300S6 kernel: [<ffffffff811645d1>] sys_ioctl+0xa1/0xb0
Feb 7 15:05:09 RX300S6 kernel: [<ffffffff8100bf82>] system_call_fastpath+0x16/0x1b
...
Feb 7 15:07:09 RX300S6 kernel: INFO: task flush-8:0:1409 blocked for more than 120 seconds.
Feb 7 15:07:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 7 15:07:09 RX300S6 kernel: flush-8:0 D ffff880037777a30 0 1409 2 0x00000000
Feb 7 15:07:09 RX300S6 kernel: ffff880037c95a80 0000000000000046 ffff88007c8037a0 0000000000000000
Feb 7 15:07:09 RX300S6 kernel: 0000000000014d40 ffff8800377774a0 ffff880037777a30 ffff880037c95fd8
Feb 7 15:07:09 RX300S6 kernel: ffff880037777a38 0000000000014d40 ffff880037c94010 0000000000014d40
Feb 7 15:07:09 RX300S6 kernel: Call Trace:
Feb 7 15:07:09 RX300S6 kernel: [<ffffffffa00abb85>] ext4_journal_start_sb+0x75/0x130 [ext4]
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81082fc0>] ? autoremove_wake_function+0x0/0x40
Feb 7 15:07:09 RX300S6 kernel: [<ffffffffa0097f0a>] ext4_da_writepages+0x27a/0x640 [ext4]
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81102c91>] do_writepages+0x21/0x40
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff811776b8>] writeback_single_inode+0x98/0x240
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81177cfe>] writeback_sb_inodes+0xce/0x170
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178709>] writeback_inodes_wb+0x99/0x160
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178a8b>] wb_writeback+0x2bb/0x430
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e2c>] wb_do_writeback+0x22c/0x280
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178f32>] bdi_writeback_thread+0xb2/0x260
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81082936>] kthread+0x96/0xa0
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc4>] kernel_thread_helper+0x4/0x10
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff810828a0>] ? kthread+0x0/0xa0
Feb 7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc0>] ? kernel_thread_helper+0x0/0x10
---------------------------------------------------------------------

I think the following deadlock problem happened:

[flush-8:0:1409] | [fsfreeze:2104]
--------------------------------------------+--------------------------------
writeback_inodes_wb |
pin_sb_for_writeback |
down_read_trylock(&sb->s_umount) |
writeback_sb_inodes |thaw_super
writeback_single_inode | down_write(&sb->s_umount)
do_writepages | # stop until flush-8:0 releases
ext4_da_writepages | # read lock of sb->s_umount...
ext4_journal_start_sb |
vfs_check_frozen |
wait_event((sb)->s_wait_unfrozen, |
((sb)->s_frozen < (level))) |
# stop until being waked up by |
# fsfreeze... |
--------------------------------------------+--------------------------------

Could anyone check this problem?

Thanks,
Masayoshi Mizuma




2011-02-15 16:06:30

by Jan Kara

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Mon 07-02-11 20:53:25, Masayoshi MIZUMA wrote:
> Hi,
>
> When I checked the freeze feature for ext4 filesystem using fsfreeze command
> at 2.6.38-rc3, I got the following messeges:
>
> ---------------------------------------------------------------------
> Feb 7 15:05:09 RX300S6 kernel: INFO: task fsfreeze:2104 blocked for more than 120 seconds.
> Feb 7 15:05:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 7 15:05:09 RX300S6 kernel: fsfreeze D ffff880076d5f040 0 2104 2018 0x00000000
> Feb 7 15:05:09 RX300S6 kernel: ffff88005a9f3d98 0000000000000086 ffff88005a9f3d38 ffffffff00000000
> Feb 7 15:05:09 RX300S6 kernel: 0000000000014d40 ffff880076d5eab0 ffff880076d5f040 ffff88005a9f3fd8
> Feb 7 15:05:09 RX300S6 kernel: ffff880076d5f048 0000000000014d40 ffff88005a9f2010 0000000000014d40
> Feb 7 15:05:09 RX300S6 kernel: Call Trace:
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814aa5f5>] rwsem_down_failed_common+0xb5/0x140
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814aa693>] rwsem_down_write_failed+0x13/0x20
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff8122f1a3>] call_rwsem_down_write_failed+0x13/0x20
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814a9c12>] ? down_write+0x32/0x40
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81155b48>] thaw_super+0x28/0xd0
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81164338>] do_vfs_ioctl+0x368/0x560
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81157c73>] ? sys_newfstat+0x33/0x40
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff811645d1>] sys_ioctl+0xa1/0xb0
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff8100bf82>] system_call_fastpath+0x16/0x1b
> ...
> Feb 7 15:07:09 RX300S6 kernel: INFO: task flush-8:0:1409 blocked for more than 120 seconds.
> Feb 7 15:07:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 7 15:07:09 RX300S6 kernel: flush-8:0 D ffff880037777a30 0 1409 2 0x00000000
> Feb 7 15:07:09 RX300S6 kernel: ffff880037c95a80 0000000000000046 ffff88007c8037a0 0000000000000000
> Feb 7 15:07:09 RX300S6 kernel: 0000000000014d40 ffff8800377774a0 ffff880037777a30 ffff880037c95fd8
> Feb 7 15:07:09 RX300S6 kernel: ffff880037777a38 0000000000014d40 ffff880037c94010 0000000000014d40
> Feb 7 15:07:09 RX300S6 kernel: Call Trace:
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffffa00abb85>] ext4_journal_start_sb+0x75/0x130 [ext4]
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81082fc0>] ? autoremove_wake_function+0x0/0x40
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffffa0097f0a>] ext4_da_writepages+0x27a/0x640 [ext4]
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81102c91>] do_writepages+0x21/0x40
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff811776b8>] writeback_single_inode+0x98/0x240
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81177cfe>] writeback_sb_inodes+0xce/0x170
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178709>] writeback_inodes_wb+0x99/0x160
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178a8b>] wb_writeback+0x2bb/0x430
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e2c>] wb_do_writeback+0x22c/0x280
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178f32>] bdi_writeback_thread+0xb2/0x260
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81082936>] kthread+0x96/0xa0
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc4>] kernel_thread_helper+0x4/0x10
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff810828a0>] ? kthread+0x0/0xa0
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc0>] ? kernel_thread_helper+0x0/0x10
> ---------------------------------------------------------------------
>
> I think the following deadlock problem happened:
>
> [flush-8:0:1409] | [fsfreeze:2104]
> --------------------------------------------+--------------------------------
> writeback_inodes_wb |
> pin_sb_for_writeback |
> down_read_trylock(&sb->s_umount) |
> writeback_sb_inodes |thaw_super
> writeback_single_inode | down_write(&sb->s_umount)
> do_writepages | # stop until flush-8:0 releases
> ext4_da_writepages | # read lock of sb->s_umount...
> ext4_journal_start_sb |
> vfs_check_frozen |
> wait_event((sb)->s_wait_unfrozen, |
> ((sb)->s_frozen < (level))) |
> # stop until being waked up by |
> # fsfreeze... |
> --------------------------------------------+--------------------------------
>
> Could anyone check this problem?
Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
under s_umount semaphore, we are prone to deadlock like the one you
describe above.

Looking at the code, s_frozen acts as lock but it's lock ranking is
unclear. Logically, the only sane ranking seems to be to rank above all
other VFS locks because we return with s_frozen held to userspace. But then
the need to wait for s_frozen from inside the filesystem violates this
ranking and causes the above deadlock.

Gosh, this is so broken. The whole thing is made even worse because
filesystems can take different locks in their unfreeze_fs callbacks and we
can possibly deadlock on these locks the same way as we do on s_umount
semaphore.

I have to think how this can be possibly fixed...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-02-15 17:04:00

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> under s_umount semaphore, we are prone to deadlock like the one you
> describe above.

One of the fundamental problems here is that the freeze and thaw
routines are using down_write(&sb->s_umount) for two purposes. The
first is to prevent the resume/thaw from racing with a umount (which
it could do just as well by taking a read lock), but the second is to
prevent the resume/thaw code from racing with itself. That's the core
fundamental problem here.

So I think we can solve this by introduce a new mutex, s_freeze, and
having the the resume/thaw first take the s_freeze mutex and then
second take a read lock on the s_umount.

- Ted

2011-02-15 17:29:54

by Jan Kara

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > under s_umount semaphore, we are prone to deadlock like the one you
> > describe above.
>
> One of the fundamental problems here is that the freeze and thaw
> routines are using down_write(&sb->s_umount) for two purposes. The
> first is to prevent the resume/thaw from racing with a umount (which
> it could do just as well by taking a read lock), but the second is to
> prevent the resume/thaw code from racing with itself. That's the core
> fundamental problem here.
>
> So I think we can solve this by introduce a new mutex, s_freeze, and
> having the the resume/thaw first take the s_freeze mutex and then
> second take a read lock on the s_umount.
Sadly this does not quite work because even down_read(&sb->s_umount)
in thaw_super() can block if there is another process that tries to acquire
s_umount for writing - a situation like:
TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
down_read(&sb->s_umount)
block on s_frozen
down_write(&sb->s_umount)
-blocked
down_read(&sb->s_umount)
-blocked
behind the write access...

The only working solution I see is to check for frozen filesystem before
taking s_umount semaphore which seems rather ugly (but might be bearable if
we did so in some well described wrapper).

And in particular ext4 has another deadlock of this kind because it does
IO from ext4_remount() e.g. when doing online resize (I know it's a bit
artifical but still ;).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-02-15 18:04:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Tue, Feb 15, 2011 at 06:29:54PM +0100, Jan Kara wrote:
> Sadly this does not quite work because even down_read(&sb->s_umount)
> in thaw_super() can block if there is another process that tries to acquire
> s_umount for writing - a situation like:
> TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
> down_read(&sb->s_umount)
> block on s_frozen
> down_write(&sb->s_umount)
> -blocked
> down_read(&sb->s_umount)
> -blocked
> behind the write access...

OK, sorry for being dense, but why does this cause a deadlock? What
are you imaging TASK 3 doing that would impede the flusher from
eventually resuming? Or how would TASK 3 prevent userspace from
completing whatever it needs to do (say, a device mapper ioctl)?

freeze_fs has always been inherently dangerous if the userspace does
not know what it's doing. If it freezes the root file system, and
then while the file system is frozen, userspace attempts to modify
/etc/mtab, it's going to lose. I've in the past argued for some kind
of safety timeout that prevents the system from wedging, but the
argument I've gotten back is (a) it's too complex, and (b) userspace
programmers aren't that stupid, and (c) it could cause the filesystem
to unfreeze when userspace wasn't expecting it. Oh, and (d) if the
system wedges up due to userspace being stupid, it's acceptable.

Obviously, if the kernel does something to itself that causes a
deadlock, we need to fix it, but userspace doing something stupid has
been explicitly ruled out of scope, at least in previous
discussions...

> And in particular ext4 has another deadlock of this kind because it does
> IO from ext4_remount() e.g. when doing online resize (I know it's a bit
> artifical but still ;).

OK, I'm being dense again. How does remount and online resize relate
with each other? and it's not I/O in general which is a problem, it's
writeback activity which causes a problem because it takes a read lock
on s_umount, right?

- Ted

2011-02-15 19:11:25

by Jan Kara

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Tue 15-02-11 13:04:35, Ted Ts'o wrote:
> On Tue, Feb 15, 2011 at 06:29:54PM +0100, Jan Kara wrote:
> > Sadly this does not quite work because even down_read(&sb->s_umount)
> > in thaw_super() can block if there is another process that tries to acquire
> > s_umount for writing - a situation like:
> > TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
> > down_read(&sb->s_umount)
> > block on s_frozen
> > down_write(&sb->s_umount)
> > -blocked
> > down_read(&sb->s_umount)
> > -blocked
> > behind the write access...
>
> OK, sorry for being dense, but why does this cause a deadlock? What
> are you imaging TASK 3 doing that would impede the flusher from
> eventually resuming? Or how would TASK 3 prevent userspace from
> completing whatever it needs to do (say, a device mapper ioctl)?
I was arguing that using down_read(sb->s_umount) in thaw_super() instead
of down_write() does not solve anything. The deadlock as originally
reported can still happen, you just need another task (TASK 2 in the above
scheme) to block in down_write() before thaw_super() happens.

> freeze_fs has always been inherently dangerous if the userspace does
> not know what it's doing. If it freezes the root file system, and
> then while the file system is frozen, userspace attempts to modify
> /etc/mtab, it's going to lose. I've in the past argued for some kind
> of safety timeout that prevents the system from wedging, but the
> argument I've gotten back is (a) it's too complex, and (b) userspace
> programmers aren't that stupid, and (c) it could cause the filesystem
> to unfreeze when userspace wasn't expecting it. Oh, and (d) if the
> system wedges up due to userspace being stupid, it's acceptable.
>
> Obviously, if the kernel does something to itself that causes a
> deadlock, we need to fix it, but userspace doing something stupid has
> been explicitly ruled out of scope, at least in previous
> discussions...
>
> > And in particular ext4 has another deadlock of this kind because it does
> > IO from ext4_remount() e.g. when doing online resize (I know it's a bit
> > artifical but still ;).
>
> OK, I'm being dense again. How does remount and online resize relate
> with each other? and it's not I/O in general which is a problem, it's
> writeback activity which causes a problem because it takes a read lock
> on s_umount, right?
The problem is to start a transaction while holding s_umount semaphore,
or actually any lock that thaw_super() (including per-filesystem
->unfreeze_fs() callback) needs. For ext4 this seems to be sb->s_lock.
I was actually wrong with the ext4 online resizing using resize option
causing possible deadlocks because do_remount_sb() refuses to do anything
with the superblock while it is frozen... But still if we ever happen to
start a transaction in ext4 while sb->s_lock is held, the deadlock with
freezing code can happen and that's just subtle and ugly IMHO.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-02-15 23:17:46

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi.

On Tue, 15 Feb 2011 18:29:54 +0100
Jan Kara <[email protected]> wrote:
> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > under s_umount semaphore, we are prone to deadlock like the one you
> > > describe above.
> >
> > One of the fundamental problems here is that the freeze and thaw
> > routines are using down_write(&sb->s_umount) for two purposes. The
> > first is to prevent the resume/thaw from racing with a umount (which
> > it could do just as well by taking a read lock), but the second is to
> > prevent the resume/thaw code from racing with itself. That's the core
> > fundamental problem here.
> >
> > So I think we can solve this by introduce a new mutex, s_freeze, and
> > having the the resume/thaw first take the s_freeze mutex and then
> > second take a read lock on the s_umount.
> Sadly this does not quite work because even down_read(&sb->s_umount)
> in thaw_super() can block if there is another process that tries to acquire
> s_umount for writing - a situation like:
> TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
> down_read(&sb->s_umount)
> block on s_frozen
> down_write(&sb->s_umount)
> -blocked
> down_read(&sb->s_umount)
> -blocked
> behind the write access...
>
> The only working solution I see is to check for frozen filesystem before
> taking s_umount semaphore which seems rather ugly (but might be bearable if
> we did so in some well described wrapper).
I created the patch that you imagine yesterday.

I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
without a fixed patch. After an hour, I confirmed that this deadlock happened.

However, on the kernel with a fixed patch, this deadlock doesn't still happen
after 12 hours passed.

The patch for linux-2.6.38-rc4 is as follows:
---
fs/fs-writeback.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 59c6e49..1c9a05e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
spin_unlock(&sb_lock);

if (down_read_trylock(&sb->s_umount)) {
- if (sb->s_root)
+ if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
return true;
up_read(&sb->s_umount);
}
--

Best Regards,
Toshiyuki Okajima

2011-02-16 14:56:30

by Jan Kara

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> On Tue, 15 Feb 2011 18:29:54 +0100
> Jan Kara <[email protected]> wrote:
> > On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > > Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > > under s_umount semaphore, we are prone to deadlock like the one you
> > > > describe above.
> > >
> > > One of the fundamental problems here is that the freeze and thaw
> > > routines are using down_write(&sb->s_umount) for two purposes. The
> > > first is to prevent the resume/thaw from racing with a umount (which
> > > it could do just as well by taking a read lock), but the second is to
> > > prevent the resume/thaw code from racing with itself. That's the core
> > > fundamental problem here.
> > >
> > > So I think we can solve this by introduce a new mutex, s_freeze, and
> > > having the the resume/thaw first take the s_freeze mutex and then
> > > second take a read lock on the s_umount.
> > Sadly this does not quite work because even down_read(&sb->s_umount)
> > in thaw_super() can block if there is another process that tries to acquire
> > s_umount for writing - a situation like:
> > TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
> > down_read(&sb->s_umount)
> > block on s_frozen
> > down_write(&sb->s_umount)
> > -blocked
> > down_read(&sb->s_umount)
> > -blocked
> > behind the write access...
> >
> > The only working solution I see is to check for frozen filesystem before
> > taking s_umount semaphore which seems rather ugly (but might be bearable if
> > we did so in some well described wrapper).
> I created the patch that you imagine yesterday.
>
> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> without a fixed patch. After an hour, I confirmed that this deadlock happened.
>
> However, on the kernel with a fixed patch, this deadlock doesn't still happen
> after 12 hours passed.
>
> The patch for linux-2.6.38-rc4 is as follows:
> ---
> fs/fs-writeback.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 59c6e49..1c9a05e 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> spin_unlock(&sb_lock);
>
> if (down_read_trylock(&sb->s_umount)) {
> - if (sb->s_root)
> + if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
> return true;
> up_read(&sb->s_umount);
So this is something along the lines I thought but it actually won't work
for example if sync(1) is run while the filesystem is frozen (that takes
s_umount semaphore in a different place). And generally, I'm not convinced
there are not other places that try to do IO while holding s_umount
semaphore...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-02-17 03:50:13

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

(2011/02/16 23:56), Jan Kara wrote:
> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>> On Tue, 15 Feb 2011 18:29:54 +0100
>> Jan Kara<[email protected]> wrote:
>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>> Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
>>>>> under s_umount semaphore, we are prone to deadlock like the one you
>>>>> describe above.
>>>>
>>>> One of the fundamental problems here is that the freeze and thaw
>>>> routines are using down_write(&sb->s_umount) for two purposes. The
>>>> first is to prevent the resume/thaw from racing with a umount (which
>>>> it could do just as well by taking a read lock), but the second is to
>>>> prevent the resume/thaw code from racing with itself. That's the core
>>>> fundamental problem here.
>>>>
>>>> So I think we can solve this by introduce a new mutex, s_freeze, and
>>>> having the the resume/thaw first take the s_freeze mutex and then
>>>> second take a read lock on the s_umount.
>>> Sadly this does not quite work because even down_read(&sb->s_umount)
>>> in thaw_super() can block if there is another process that tries to acquire
>>> s_umount for writing - a situation like:
>>> TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
>>> down_read(&sb->s_umount)
>>> block on s_frozen
>>> down_write(&sb->s_umount)
>>> -blocked
>>> down_read(&sb->s_umount)
>>> -blocked
>>> behind the write access...
>>>
>>> The only working solution I see is to check for frozen filesystem before
>>> taking s_umount semaphore which seems rather ugly (but might be bearable if
>>> we did so in some well described wrapper).
>> I created the patch that you imagine yesterday.
>>
>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
>> without a fixed patch. After an hour, I confirmed that this deadlock happened.
>>
>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
>> after 12 hours passed.
>>
>> The patch for linux-2.6.38-rc4 is as follows:
>> ---
>> fs/fs-writeback.c | 2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index 59c6e49..1c9a05e 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>> spin_unlock(&sb_lock);
>>
>> if (down_read_trylock(&sb->s_umount)) {
>> - if (sb->s_root)
>> + if (sb->s_frozen == SB_UNFROZEN&& sb->s_root)
>> return true;
>> up_read(&sb->s_umount);

> So this is something along the lines I thought but it actually won't work
> for example if sync(1) is run while the filesystem is frozen (that takes
> s_umount semaphore in a different place). And generally, I'm not convinced
> there are not other places that try to do IO while holding s_umount
> semaphore...
OK. I understand.

This code only fixes the case for the following path:
writeback_inodes_wb
-> ext4_da_writepages
-> ext4_journal_start_sb
-> vfs_check_frozen
But, the code doesn't fix the other cases.

We must modify the local filesystem part in order to fix all cases...?

Regards,
Toshiyuki Okajima


2011-02-17 05:13:53

by Andreas Dilger

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 2011-02-16, at 20:50, Toshiyuki Okajima wrote:
> (2011/02/16 23:56), Jan Kara wrote:
>>
>>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel without a fixed patch. After an hour, I confirmed that this deadlock happened.
>>>
>>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
>>> after 12 hours passed.
>>>
>>> The patch for linux-2.6.38-rc4 is as follows:
>>> ---
>>> fs/fs-writeback.c | 2 +-
>>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>> index 59c6e49..1c9a05e 100644
>>> --- a/fs/fs-writeback.c
>>> +++ b/fs/fs-writeback.c
>>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>>> spin_unlock(&sb_lock);
>>>
>>> if (down_read_trylock(&sb->s_umount)) {
>>> - if (sb->s_root)
>>> + if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
>>> return true;
>>> up_read(&sb->s_umount);

This seems like a very low-risk fix.

>> So this is something along the lines I thought but it actually won't work
>> for example if sync(1) is run while the filesystem is frozen (that takes
>> s_umount semaphore in a different place). And generally, I'm not convinced
>> there are not other places that try to do IO while holding s_umount
>> semaphore...
>
> OK. I understand.
>
> This code only fixes the case for the following path:
> writeback_inodes_wb
> -> ext4_da_writepages
> -> ext4_journal_start_sb
> -> vfs_check_frozen
> But, the code doesn't fix the other cases.
>
> We must modify the local filesystem part in order to fix all cases...?

It seems worthwhile to implement the low-risk fix that covers the common case, and if/when someone hits the rare 3-process case and/or submits a patch for it then that one will be fixed also.

Cheers, Andreas






2011-02-17 10:41:26

by Jan Kara

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Wed 16-02-11 22:13:53, Andreas Dilger wrote:
> On 2011-02-16, at 20:50, Toshiyuki Okajima wrote:
> > (2011/02/16 23:56), Jan Kara wrote:
> >>
> >>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel without a fixed patch. After an hour, I confirmed that this deadlock happened.
> >>>
> >>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
> >>> after 12 hours passed.
> >>>
> >>> The patch for linux-2.6.38-rc4 is as follows:
> >>> ---
> >>> fs/fs-writeback.c | 2 +-
> >>> 1 files changed, 1 insertions(+), 1 deletions(-)
> >>>
> >>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >>> index 59c6e49..1c9a05e 100644
> >>> --- a/fs/fs-writeback.c
> >>> +++ b/fs/fs-writeback.c
> >>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> >>> spin_unlock(&sb_lock);
> >>>
> >>> if (down_read_trylock(&sb->s_umount)) {
> >>> - if (sb->s_root)
> >>> + if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
> >>> return true;
> >>> up_read(&sb->s_umount);
>
> This seems like a very low-risk fix.
>
> >> So this is something along the lines I thought but it actually won't work
> >> for example if sync(1) is run while the filesystem is frozen (that takes
> >> s_umount semaphore in a different place). And generally, I'm not convinced
> >> there are not other places that try to do IO while holding s_umount
> >> semaphore...
> >
> > OK. I understand.
> >
> > This code only fixes the case for the following path:
> > writeback_inodes_wb
> > -> ext4_da_writepages
> > -> ext4_journal_start_sb
> > -> vfs_check_frozen
> > But, the code doesn't fix the other cases.
> >
> > We must modify the local filesystem part in order to fix all cases...?
>
> It seems worthwhile to implement the low-risk fix that covers the common
> case, and if/when someone hits the rare 3-process case and/or submits a
> patch for it then that one will be fixed also.
Yes, the fix is simple enough that I won't oppose it getting in as a
band aid and if we add this band aid to fs/sync.c:sync_one_sb(), it would
even be a reasonably reliable band aid. But that doesn't change the fact
that the locking is simply broken ;).

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-02-17 10:45:54

by Jan Kara

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> (2011/02/16 23:56), Jan Kara wrote:
> >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> >>On Tue, 15 Feb 2011 18:29:54 +0100
> >>Jan Kara<[email protected]> wrote:
> >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> >>>>>describe above.
> >>>>
> >>>>One of the fundamental problems here is that the freeze and thaw
> >>>>routines are using down_write(&sb->s_umount) for two purposes. The
> >>>>first is to prevent the resume/thaw from racing with a umount (which
> >>>>it could do just as well by taking a read lock), but the second is to
> >>>>prevent the resume/thaw code from racing with itself. That's the core
> >>>>fundamental problem here.
> >>>>
> >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> >>>>having the the resume/thaw first take the s_freeze mutex and then
> >>>>second take a read lock on the s_umount.
> >>> Sadly this does not quite work because even down_read(&sb->s_umount)
> >>>in thaw_super() can block if there is another process that tries to acquire
> >>>s_umount for writing - a situation like:
> >>> TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
> >>>down_read(&sb->s_umount)
> >>> block on s_frozen
> >>> down_write(&sb->s_umount)
> >>> -blocked
> >>> down_read(&sb->s_umount)
> >>> -blocked
> >>>behind the write access...
> >>>
> >>>The only working solution I see is to check for frozen filesystem before
> >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> >>>we did so in some well described wrapper).
> >>I created the patch that you imagine yesterday.
> >>
> >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> >>
> >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> >>after 12 hours passed.
> >>
> >>The patch for linux-2.6.38-rc4 is as follows:
> >>---
> >> fs/fs-writeback.c | 2 +-
> >> 1 files changed, 1 insertions(+), 1 deletions(-)
> >>
> >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >>index 59c6e49..1c9a05e 100644
> >>--- a/fs/fs-writeback.c
> >>+++ b/fs/fs-writeback.c
> >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> >> spin_unlock(&sb_lock);
> >>
> >> if (down_read_trylock(&sb->s_umount)) {
> >>- if (sb->s_root)
> >>+ if (sb->s_frozen == SB_UNFROZEN&& sb->s_root)
> >> return true;
> >> up_read(&sb->s_umount);
>
> > So this is something along the lines I thought but it actually won't work
> >for example if sync(1) is run while the filesystem is frozen (that takes
> >s_umount semaphore in a different place). And generally, I'm not convinced
> >there are not other places that try to do IO while holding s_umount
> >semaphore...
> OK. I understand.
>
> This code only fixes the case for the following path:
> writeback_inodes_wb
> -> ext4_da_writepages
> -> ext4_journal_start_sb
> -> vfs_check_frozen
> But, the code doesn't fix the other cases.
>
> We must modify the local filesystem part in order to fix all cases...?
Yes, possibly. But most importantly we should first find clear locking
rules for frozen filesystem that avoid deadlocks like the one above. And
the freezing / unfreezing code might become subtle for that reason, that's
fine, but it would be really good to avoid any complicated things for the
code in the rest of the VFS / filesystems.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-03-28 09:28:26

by Toshiyuki Okajima

[permalink] [raw]
Subject: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi.

On Thu, 17 Feb 2011 11:45:52 +0100
Jan Kara <[email protected]> wrote:
> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > (2011/02/16 23:56), Jan Kara wrote:
> > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > >>Jan Kara<[email protected]> wrote:
> > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > >>>>>describe above.
> > >>>>
> > >>>>One of the fundamental problems here is that the freeze and thaw
> > >>>>routines are using down_write(&sb->s_umount) for two purposes. The
> > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > >>>>it could do just as well by taking a read lock), but the second is to
> > >>>>prevent the resume/thaw code from racing with itself. That's the core
> > >>>>fundamental problem here.
> > >>>>
> > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > >>>>second take a read lock on the s_umount.
> > >>> Sadly this does not quite work because even down_read(&sb->s_umount)
> > >>>in thaw_super() can block if there is another process that tries to acquire
> > >>>s_umount for writing - a situation like:
> > >>> TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
> > >>>down_read(&sb->s_umount)
> > >>> block on s_frozen
> > >>> down_write(&sb->s_umount)
> > >>> -blocked
> > >>> down_read(&sb->s_umount)
> > >>> -blocked
> > >>>behind the write access...
> > >>>
> > >>>The only working solution I see is to check for frozen filesystem before
> > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > >>>we did so in some well described wrapper).
> > >>I created the patch that you imagine yesterday.
> > >>
> > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > >>
> > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > >>after 12 hours passed.
> > >>
> > >>The patch for linux-2.6.38-rc4 is as follows:
> > >>---
> > >> fs/fs-writeback.c | 2 +-
> > >> 1 files changed, 1 insertions(+), 1 deletions(-)
> > >>
> > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > >>index 59c6e49..1c9a05e 100644
> > >>--- a/fs/fs-writeback.c
> > >>+++ b/fs/fs-writeback.c
> > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > >> spin_unlock(&sb_lock);
> > >>
> > >> if (down_read_trylock(&sb->s_umount)) {
> > >>- if (sb->s_root)
> > >>+ if (sb->s_frozen == SB_UNFROZEN&& sb->s_root)
> > >> return true;
> > >> up_read(&sb->s_umount);
> >
> > > So this is something along the lines I thought but it actually won't work
> > >for example if sync(1) is run while the filesystem is frozen (that takes
> > >s_umount semaphore in a different place). And generally, I'm not convinced
> > >there are not other places that try to do IO while holding s_umount
> > >semaphore...
> > OK. I understand.
> >
> > This code only fixes the case for the following path:
> > writeback_inodes_wb
> > -> ext4_da_writepages
> > -> ext4_journal_start_sb
> > -> vfs_check_frozen
> > But, the code doesn't fix the other cases.
> >
> > We must modify the local filesystem part in order to fix all cases...?
> Yes, possibly. But most importantly we should first find clear locking
> rules for frozen filesystem that avoid deadlocks like the one above. And
> the freezing / unfreezing code might become subtle for that reason, that's
> fine, but it would be really good to avoid any complicated things for the
> code in the rest of the VFS / filesystems.
I have deeply continued to examined the root cause of this problem, then
I found it.

It is that we can write a memory which is mmaped to a file. Then the memory
becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
"writeback" the memory.

Therefore, the root cause of this hangup is not only ext4 component (with
delayed allocation feature) but also writeback mechanism for mmap. If you
use the other filesystem, you can write something to the filesystem though
you have freezed the filesystem.

A sample problem is attached on this mail. Try to execute it then you can
confirm that we can write some data to your filesystem while freezing the
filesystem.
(If you change FS variable in go.sh from ext3 to ext4 and you execute
"fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)

I think the best approach to fix this problem is to let users not to write
memory which is mapped to a certain file while the filesystem is freezing.
However, it is very difficult to control users not to write memory which has
been already mapped to the file.

Therefore, I think there is only actual method that we stop writeback thread
to resolve the mmap problem. Also, by this fix, the original problem
(ext4 delayed write vs unfreeze) can be solved.

I created a patch for this problem. Please confirm it.

------------------------------------------------------------------------------
----------
reproducer
----------
[run script] go.sh
#!/bin/sh

FS=ext3
gcc -o ./write ./write.c
dd if=/dev/zero of=/tmp/loop.$$ bs=1k seek=64k count=1 > /dev/null 2>&1
/sbin/mkfs.$FS -Fq /tmp/loop.$$
/sbin/losetup /dev/loop7 /tmp/loop.$$
mkdir -p mnt
mount -t $FS /dev/loop7 mnt
dd if=/dev/zero of=mnt/file bs=4k count=100 > /dev/null 2>&1
./write mnt/file &
pid=$!
# write 0 then 1
/bin/kill -SIGUSR1 $pid
/bin/kill -SIGUSR1 $pid
/sbin/fsfreeze -f mnt
cp /tmp/loop.$$ /tmp/loop.$$.pre
/bin/kill -SIGUSR1 $pid
sync
sleep 30
cp /tmp/loop.$$ /tmp/loop.$$.post
/sbin/fsfreeze -u mnt
/bin/kill -SIGTERM $pid
umount mnt
/sbin/losetup -d /dev/loop7
/usr/bin/cmp /tmp/loop.$$.pre /tmp/loop.$$.post > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "freeze doesn't work correctly!"
else
echo "freeze works correctly!"
fi
rm -f /tmp/loop.$$*
exit 0

[program] write.c
#define LARGEFILE64_SOURCE

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <signal.h>
#include <string.h>
#include <errno.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>

int counter = 0;
char *mmap_addr;
int fd;

#define LOOP 100
#define UNIT 4096
#define MMAPSZ (UNIT*LOOP)
#define FILENAME "./mnt/file"

void
write_inc(int sig)
{
int i;

for (i = 0; i < LOOP; i++)
*((int*)(mmap_addr + UNIT*i)) = counter;
counter ++;
}

void
main_exit(int sig)
{
munmap(mmap_addr, MMAPSZ);
close(fd);
exit(0);
}

int main(int argc, char *argv[])
{
char *file = FILENAME;

if ((fd = open(file, O_RDWR)) < 0) {
perror("open error");
exit(1);
}
if ((mmap_addr = mmap(0, MMAPSZ, PROT_WRITE, MAP_SHARED, fd, 0)) ==
MAP_FAILED) {
perror("mmap error");
close(fd);
exit(2);
}
sigset(SIGTERM, (void *)main_exit);
sigset(SIGUSR1, (void *)write_inc);
while (1)
pause();
}

[step to rerproduce]
# sh ./go.sh
------------------------------------------------------------------------------

[patch]
Now, we can write the memory which is mapped to a file while
the filesystem to which it belongs is being freezed.
Therefore, the filesystem can modify even if it is being freezed.
This fix prevents the flusher thread from updating the filesystem.

Signed-off-by: Toshiyuki Okajima <[email protected]>
---
fs/fs-writeback.c | 2 +-
fs/super.c | 7 ++++++-
mm/page-writeback.c | 2 ++
3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b5ed541..2a60148 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -477,7 +477,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
spin_unlock(&sb_lock);

if (down_read_trylock(&sb->s_umount)) {
- if (sb->s_root)
+ if (sb->s_frozen == 0 && sb->s_root)
return true;
up_read(&sb->s_umount);
}
diff --git a/fs/super.c b/fs/super.c
index 8a06881..bac28c4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -432,8 +432,13 @@ void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
continue;
sb->s_count++;
spin_unlock(&sb_lock);
-
+retry:
down_read(&sb->s_umount);
+ if (sb->s_frozen > 0) {
+ up_read(&sb->s_umount);
+ cond_resched();
+ goto retry;
+ }
if (sb->s_root)
f(sb, arg);
up_read(&sb->s_umount);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 31f6988..eb19642 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1058,7 +1058,9 @@ EXPORT_SYMBOL(generic_writepages);
int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
int ret;
+ struct super_block *sb = mapping->host->i_sb;

+ vfs_check_frozen(sb, SB_FREEZE_TRANS);
if (wbc->nr_to_write <= 0)
return 0;
if (mapping->a_ops->writepages)
--
1.5.5.6

2011-03-30 14:12:09

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hello,

On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
> On Thu, 17 Feb 2011 11:45:52 +0100
> Jan Kara <[email protected]> wrote:
> > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > (2011/02/16 23:56), Jan Kara wrote:
> > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > >>Jan Kara<[email protected]> wrote:
> > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > >>>>>describe above.
> > > >>>>
> > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > >>>>routines are using down_write(&sb->s_umount) for two purposes. The
> > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > >>>>it could do just as well by taking a read lock), but the second is to
> > > >>>>prevent the resume/thaw code from racing with itself. That's the core
> > > >>>>fundamental problem here.
> > > >>>>
> > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > >>>>second take a read lock on the s_umount.
> > > >>> Sadly this does not quite work because even down_read(&sb->s_umount)
> > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > >>>s_umount for writing - a situation like:
> > > >>> TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
> > > >>>down_read(&sb->s_umount)
> > > >>> block on s_frozen
> > > >>> down_write(&sb->s_umount)
> > > >>> -blocked
> > > >>> down_read(&sb->s_umount)
> > > >>> -blocked
> > > >>>behind the write access...
> > > >>>
> > > >>>The only working solution I see is to check for frozen filesystem before
> > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > >>>we did so in some well described wrapper).
> > > >>I created the patch that you imagine yesterday.
> > > >>
> > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > >>
> > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > >>after 12 hours passed.
> > > >>
> > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > >>---
> > > >> fs/fs-writeback.c | 2 +-
> > > >> 1 files changed, 1 insertions(+), 1 deletions(-)
> > > >>
> > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > >>index 59c6e49..1c9a05e 100644
> > > >>--- a/fs/fs-writeback.c
> > > >>+++ b/fs/fs-writeback.c
> > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > >> spin_unlock(&sb_lock);
> > > >>
> > > >> if (down_read_trylock(&sb->s_umount)) {
> > > >>- if (sb->s_root)
> > > >>+ if (sb->s_frozen == SB_UNFROZEN&& sb->s_root)
> > > >> return true;
> > > >> up_read(&sb->s_umount);
> > >
> > > > So this is something along the lines I thought but it actually won't work
> > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > >there are not other places that try to do IO while holding s_umount
> > > >semaphore...
> > > OK. I understand.
> > >
> > > This code only fixes the case for the following path:
> > > writeback_inodes_wb
> > > -> ext4_da_writepages
> > > -> ext4_journal_start_sb
> > > -> vfs_check_frozen
> > > But, the code doesn't fix the other cases.
> > >
> > > We must modify the local filesystem part in order to fix all cases...?
> > Yes, possibly. But most importantly we should first find clear locking
> > rules for frozen filesystem that avoid deadlocks like the one above. And
> > the freezing / unfreezing code might become subtle for that reason, that's
> > fine, but it would be really good to avoid any complicated things for the
> > code in the rest of the VFS / filesystems.
> I have deeply continued to examined the root cause of this problem, then
> I found it.
>
> It is that we can write a memory which is mmaped to a file. Then the memory
> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> "writeback" the memory.
>
> Therefore, the root cause of this hangup is not only ext4 component (with
> delayed allocation feature) but also writeback mechanism for mmap. If you
> use the other filesystem, you can write something to the filesystem though
> you have freezed the filesystem.
Well, you can write something only in the caches, not to the on disk
image. So it's not a problem as such.

> A sample problem is attached on this mail. Try to execute it then you can
> confirm that we can write some data to your filesystem while freezing the
> filesystem.
> (If you change FS variable in go.sh from ext3 to ext4 and you execute
> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>
> I think the best approach to fix this problem is to let users not to write
> memory which is mapped to a certain file while the filesystem is freezing.
> However, it is very difficult to control users not to write memory which has
> been already mapped to the file.
It is actually possible. In case of ext4, you could add a check (+ wait)
in ext4_page_mkwrite() whether the filesystem is frozen or in the process
of being frozen and if so, wait for it to get unfrozen. The only tough
problem here might be the locking as ext4_page_mkwrite() is called with
mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
But you'd have to fix all filesystems (and all paths possibly creating
dirty data) in this way.

> Therefore, I think there is only actual method that we stop writeback thread
> to resolve the mmap problem. Also, by this fix, the original problem
> (ext4 delayed write vs unfreeze) can be solved.
Hmm, I had a look at the code again and think we could fix the issue
cleanly (i.e. all possible users of s_umount) as follows: The lock
ordering will be
s_umount -> "fs frozen"
and there will be a new mutex s_freeze_mutex protecting changes of
s_frozen.

freeze_bdev() already observes this lock ordering, it will only take
s_freeze_mutex for the changes of s_frozen values. The only other code
that is relevant for the lock ordering is thaw_super() (the freezing
process is not expected to reenter kernel for the frozen filesystem).
In thaw_super() we could take s_freeze_mutex, do all the thawing work,
set s_frozen, release s_freeze_mutex and put superblock reference.

So something like the patch below - it seems to work for me, can you test
it please?

>From 0939f4c2fd5d69d7d1bf7ece9a641bb561e9d0dd Mon Sep 17 00:00:00 2001
From: Jan Kara <[email protected]>
Date: Wed, 30 Mar 2011 15:21:44 +0200
Subject: [PATCH] vfs: Fix deadlocks on frozen filesystem

When a filesystem is frozen and the flusher thread decides to do writeback
for the frozen filesystem (e.g. because pages were marked dirty by mmaped
write) we deadlock because we take s_umount semaphore and then try to write
dirty pages which blocks. In this situation there is no way to unfreeze
the filesystem because thawing code requires s_umount semaphore.

Fix the problem removing the need to take s_umount from thawing code. Instead
we introduce new s_freeze_mutex to provide necessary exclusion.

Reported-by: Toshiyuki Okajima <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/super.c | 40 ++++++++++++++++++++++++++++++++++------
include/linux/fs.h | 1 +
2 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index e848649..4f74718 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -77,6 +77,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
INIT_LIST_HEAD(&s->s_dentry_lru);
init_rwsem(&s->s_umount);
mutex_init(&s->s_lock);
+ mutex_init(&s->s_freeze_mutex);
lockdep_set_class(&s->s_umount, &type->s_umount_key);
/*
* The locking rules for s_lock are up to the
@@ -971,6 +972,24 @@ out:
* Syncs the super to make sure the filesystem is consistent and calls the fs's
* freeze_fs. Subsequent calls to this without first thawing the fs will return
* -EBUSY.
+ *
+ * Locking of freeze / thaw is tricky (if not messy). Freezing is protected by
+ * exclusively taking s_umount to avoid races with mount / remount / umount and
+ * also provide exclusion of concurrent freeze calls. Then we have
+ * s_freeze_mutex which protects changes to s_frozen and the call ->freeze_fs()
+ * against races with thawing code.
+ *
+ * Thawing code must not take s_umount before the filesystem is unfrozen
+ * because that would cause deadlocks (e.g. background flushing takes s_umount
+ * and then does writeback which blocks on a frozen filesystem). So we take
+ * only s_freeze_mutex, which provides us exclusion against concurrent
+ * freezing, and hold it until the thawing is finished. We are protected
+ * against superblock going away by holding an active sb reference and against
+ * remounting by the fact that the sb is frozen.
+ *
+ * Notes: s_freeze_mutex cannot be merged with bd_fsfreeze_mutex because we
+ * can freeze block devices without filesystems and also freeze filesystems
+ * not backed by block devices.
*/
int freeze_super(struct super_block *sb)
{
@@ -978,7 +997,9 @@ int freeze_super(struct super_block *sb)

atomic_inc(&sb->s_active);
down_write(&sb->s_umount);
- if (sb->s_frozen) {
+ mutex_lock(&sb->s_freeze_mutex);
+ if (sb->s_frozen != SB_UNFROZEN) {
+ mutex_unlock(&sb->s_freeze_mutex);
deactivate_locked_super(sb);
return -EBUSY;
}
@@ -986,15 +1007,18 @@ int freeze_super(struct super_block *sb)
if (sb->s_flags & MS_RDONLY) {
sb->s_frozen = SB_FREEZE_TRANS;
smp_wmb();
+ mutex_unlock(&sb->s_freeze_mutex);
up_write(&sb->s_umount);
return 0;
}

sb->s_frozen = SB_FREEZE_WRITE;
+ mutex_unlock(&sb->s_freeze_mutex);
smp_wmb();

sync_filesystem(sb);

+ mutex_lock(&sb->s_freeze_mutex);
sb->s_frozen = SB_FREEZE_TRANS;
smp_wmb();

@@ -1005,10 +1029,12 @@ int freeze_super(struct super_block *sb)
printk(KERN_ERR
"VFS:Filesystem freeze failed\n");
sb->s_frozen = SB_UNFROZEN;
+ mutex_unlock(&sb->s_freeze_mutex);
deactivate_locked_super(sb);
return ret;
}
}
+ mutex_unlock(&sb->s_freeze_mutex);
up_write(&sb->s_umount);
return 0;
}
@@ -1019,14 +1045,15 @@ EXPORT_SYMBOL(freeze_super);
* @sb: the super to thaw
*
* Unlocks the filesystem and marks it writeable again after freeze_super().
+ * See freeze_super() for locking comments.
*/
int thaw_super(struct super_block *sb)
{
int error;

- down_write(&sb->s_umount);
- if (sb->s_frozen == SB_UNFROZEN) {
- up_write(&sb->s_umount);
+ mutex_lock(&sb->s_freeze_mutex);
+ if (sb->s_frozen != SB_FREEZE_TRANS) {
+ mutex_unlock(&sb->s_freeze_mutex);
return -EINVAL;
}

@@ -1039,7 +1066,7 @@ int thaw_super(struct super_block *sb)
printk(KERN_ERR
"VFS:Filesystem thaw failed\n");
sb->s_frozen = SB_FREEZE_TRANS;
- up_write(&sb->s_umount);
+ mutex_unlock(&sb->s_freeze_mutex);
return error;
}
}
@@ -1048,7 +1075,8 @@ out:
sb->s_frozen = SB_UNFROZEN;
smp_wmb();
wake_up(&sb->s_wait_unfrozen);
- deactivate_locked_super(sb);
+ mutex_unlock(&sb->s_freeze_mutex);
+ deactivate_super(sb);

return 0;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7061a85..230892d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1382,6 +1382,7 @@ struct super_block {
struct dentry *s_root;
struct rw_semaphore s_umount;
struct mutex s_lock;
+ struct mutex s_freeze_mutex;
int s_count;
atomic_t s_active;
#ifdef CONFIG_SECURITY
--
1.7.1


2011-03-31 08:37:27

by Yongqiang Yang

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi everyone,

Amir met a deadlock when he tested ext4 with snapshot. The deadlock
was reported on
https://github.com/amir73il/ext4-snapshots/commit/56396185d922a73524a091b545e665543abf741a.
It is difficult to reproduce the deadlock. There is a deadlock
reported on http://www.spinics.net/lists/linux-ext4/msg23018.html.
Actually, these two deadlocks come from a same source.

Below are my analysis on the 1st one. Mail is not a good place to
describe parallel processes. I have submitted the analysis to
https://github.com/YANGYongqiang/ext4-snapshots/blob/9e0ae9ae9907125e6bf45aa91db296d4cc041b17/fs/ext4/BUGS#L143.
It is much more readable.

-- deadlock in ext4 with snapshot
ext4 with snapshot calls freeze_super() to bring
a fs be in a clean state when a user takes a snapshot.

freeze truncate kjournald

| ext4_ext_truncate |
freeze_super() | starts a handle |
sets s_frozen | |
| ext4_ext_truncate |
| holds i_data_sem |
ext4_freeze() | | commit_transaction()
wait for updates | | waits for i_data_sem
| ext4_free_blocks |
| calls dquot_free_block|
| |
| dquot_free_block call |
| ext4_dirty_inode |
| |
| ext4_dirty_inode |
| trys to start a handle|
| |
| block due to s_frozen |

in ext3, ext3_freeze() prevents journal from being updated by
lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
unlock_journal_updates().

in ext4, however, before ext4_freeze() returns, it unlock journal, and
ext4 prevents journal from being updated by s_frozen. s_frozen is in
an upper layer, so it is out control of ext4 and deadlock is easy to
happen.

Could someone explain why ext4 does like above but not follow ext3?

Yongqiang.
--
Best Wishes
Yongqiang Yang

2011-03-31 08:48:40

by Yongqiang Yang

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Thu, Mar 31, 2011 at 4:37 PM, Yongqiang Yang <[email protected]> wrote:
> Hi everyone,
>
> Amir met a deadlock when he tested ext4 with snapshot. ?The deadlock
> was reported on
> https://github.com/amir73il/ext4-snapshots/commit/56396185d922a73524a091b545e665543abf741a.
> ?It is difficult to reproduce the deadlock. ?There is a deadlock
> reported on http://www.spinics.net/lists/linux-ext4/msg23018.html.
> Actually, these two deadlocks come from a same source.
>
> Below are my analysis on the 1st one. ?Mail is not a good place to
> describe parallel processes. ?I have submitted the analysis to
> https://github.com/YANGYongqiang/ext4-snapshots/blob/9e0ae9ae9907125e6bf45aa91db296d4cc041b17/fs/ext4/BUGS#L143.
> ?It is much more readable.
>
> -- deadlock in ext4 with snapshot
> ? ? ? ext4 with snapshot calls freeze_super() to bring
> ? ? ? a fs be in a clean state when a user takes a snapshot.
>
> ? ? freeze ? ? ? ? ? ? ? ? ?truncate ? ? ? ? ? ? ?kjournald
>
> ? ? ? ? ? ? ? ? ? ?| ?ext4_ext_truncate ? ? |
> ? ?freeze_super() ?| ? starts a handle ? ? ?|
> ? ?sets s_frozen ? | ? ? ? ? ? ? ? ? ? ? ? ?|
> ? ? ? ? ? ? ? ? ? ?| ?ext4_ext_truncate ? ? |
> ? ? ? ? ? ? ? ? ? ?| ?holds i_data_sem ? ? ?|
> ?ext4_freeze() ? ? | ? ? ? ? ? ? ? ? ? ? ? ?| ? commit_transaction()
> ? wait for updates | ? ? ? ? ? ? ? ? ? ? ? ?| ? waits for i_data_sem
> ? ? ? ? ? ? ? ? ? ?| ?ext4_free_blocks ? ? ?|
> ? ? ? ? ? ? ? ? ? ?| ?calls dquot_free_block|
> ? ? ? ? ? ? ? ? ? ?| ? ? ? ? ? ? ? ? ? ? ? ?|
> ? ? ? ? ? ? ? ? ? ?| ?dquot_free_block call |
> ? ? ? ? ? ? ? ? ? ?| ?ext4_dirty_inode ? ? ?|
> ? ? ? ? ? ? ? ? ? ?| ? ? ? ? ? ? ? ? ? ? ? ?|
> ? ? ? ? ? ? ? ? ? ?| ?ext4_dirty_inode ? ? ?|
> ? ? ? ? ? ? ? ? ? ?| ?trys to start a handle|
> ? ? ? ? ? ? ? ? ? ?| ? ? ? ? ? ? ? ? ? ? ? ?|
> ? ? ? ? ? ? ? ? ? ?| ?block due to s_frozen |
>
> in ext3, ext3_freeze() prevents journal from being updated by
> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
> unlock_journal_updates().
>
> in ext4, however, before ext4_freeze() returns, it unlock journal, and
> ext4 prevents journal from being updated by s_frozen. s_frozen is in
> an upper layer, so it is out control of ext4 and deadlock is easy to
> happen.

Virtually, it is not right to block ext4_journal_start_sb() before we
confirm that current thread has no a active handle. But ext4 does like
that. Deadlock is thus easy to happen. Right?

>
> Could someone explain why ext4 does like above but not follow ext3?
>
> Yongqiang.
> --
> Best Wishes
> Yongqiang Yang
>



--
Best Wishes
Yongqiang Yang

2011-03-31 12:02:28

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi, thanks for your reviewing.

(2011/03/30 23:12), Jan Kara wrote:
> Hello,
>
> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>> On Thu, 17 Feb 2011 11:45:52 +0100
>> Jan Kara<[email protected]> wrote:
>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>> Jan Kara<[email protected]> wrote:
>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
<SNIP>
>> I have deeply continued to examined the root cause of this problem, then
>> I found it.
>>
>> It is that we can write a memory which is mmaped to a file. Then the memory
>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>> "writeback" the memory.
>>
>> Therefore, the root cause of this hangup is not only ext4 component (with
>> delayed allocation feature) but also writeback mechanism for mmap. If you
>> use the other filesystem, you can write something to the filesystem though
>> you have freezed the filesystem.

> Well, you can write something only in the caches, not to the on disk
> image. So it's not a problem as such.
My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
we can write in not only the caches but also the loopback device. However,
I don't still confirm that we can write to the real device(/dev/sdaX).

>
>> A sample problem is attached on this mail. Try to execute it then you can
>> confirm that we can write some data to your filesystem while freezing the
>> filesystem.
>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>
>> I think the best approach to fix this problem is to let users not to write
>> memory which is mapped to a certain file while the filesystem is freezing.
>> However, it is very difficult to control users not to write memory which has
>> been already mapped to the file.
> It is actually possible. In case of ext4, you could add a check (+ wait)
> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
> of being frozen and if so, wait for it to get unfrozen. The only tough
> problem here might be the locking as ext4_page_mkwrite() is called with
> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
> But you'd have to fix all filesystems (and all paths possibly creating
> dirty data) in this way.
>

>> Therefore, I think there is only actual method that we stop writeback thread
>> to resolve the mmap problem. Also, by this fix, the original problem
>> (ext4 delayed write vs unfreeze) can be solved.
> Hmm, I had a look at the code again and think we could fix the issue
> cleanly (i.e. all possible users of s_umount) as follows: The lock
> ordering will be
> s_umount -> "fs frozen"
> and there will be a new mutex s_freeze_mutex protecting changes of
> s_frozen.
>
> freeze_bdev() already observes this lock ordering, it will only take
> s_freeze_mutex for the changes of s_frozen values. The only other code
> that is relevant for the lock ordering is thaw_super() (the freezing
> process is not expected to reenter kernel for the frozen filesystem).
> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
> set s_frozen, release s_freeze_mutex and put superblock reference.
>

> So something like the patch below - it seems to work for me, can you test
> it please?
I think your patch looks good, so, the original problem seems to be solved.
OK, I will test your patch.
This weekend I cannot test it. So, I will reply next week.

Thanks,
Toshiyuki Okajima


2011-03-31 14:04:13

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 3/31/11 3:37 AM, Yongqiang Yang wrote:

> in ext3, ext3_freeze() prevents journal from being updated by
> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
> unlock_journal_updates().
>
> in ext4, however, before ext4_freeze() returns, it unlock journal, and
> ext4 prevents journal from being updated by s_frozen. s_frozen is in
> an upper layer, so it is out control of ext4 and deadlock is easy to
> happen.
>
> Could someone explain why ext4 does like above but not follow ext3?
>
> Yongqiang.

That was me, I think ...

commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
Author: Eric Sandeen <[email protected]>
Date: Sun May 16 02:00:00 2010 -0400

ext4: don't return to userspace after freezing the fs with a mutex held

ext4_freeze() used jbd2_journal_lock_updates() which takes
the j_barrier mutex, and then returns to userspace. The
kernel does not like this:

================================================
[ BUG: lock held when returning to user space! ]
------------------------------------------------
lvcreate/1075 is leaving the kernel with locks still held!
1 lock held by lvcreate/1075:
#0: (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
jbd2_journal_lock_updates+0xe1/0xf0

Use vfs_check_frozen() added to ext4_journal_start_sb() and
ext4_force_commit() instead.

Addresses-Red-Hat-Bugzilla: #568503


Signed-off-by: Eric Sandeen <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>

2011-03-31 14:36:47

by Yongqiang Yang

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Thu, Mar 31, 2011 at 10:04 PM, Eric Sandeen <[email protected]> wrote:
> On 3/31/11 3:37 AM, Yongqiang Yang wrote:
>
>> in ext3, ext3_freeze() prevents journal from being updated by
>> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
>> unlock_journal_updates().
>>
>> in ext4, however, before ext4_freeze() returns, it unlock journal, and
>> ext4 prevents journal from being updated by s_frozen. s_frozen is in
>> an upper layer, so it is out control of ext4 and deadlock is easy to
>> happen.
>>
>> Could someone explain why ext4 does like above but not follow ext3?
>>
>> Yongqiang.
>
> That was me, I think ...

Thank you, Eric.

I think ext4_journal_start() should check if current thread has an
active handle before vfs_check_frozen(), if so, current handle will
be returned. Thus, we can avoid deadlocks.

Do you agree with me? If I am right, I will send a patch.
>
> commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
> Author: Eric Sandeen <[email protected]>
> Date: ? Sun May 16 02:00:00 2010 -0400
>
> ? ?ext4: don't return to userspace after freezing the fs with a mutex held
>
> ? ?ext4_freeze() used jbd2_journal_lock_updates() which takes
> ? ?the j_barrier mutex, and then returns to userspace. ?The
> ? ?kernel does not like this:
>
> ? ?================================================
> ? ?[ BUG: lock held when returning to user space! ]
> ? ?------------------------------------------------
> ? ?lvcreate/1075 is leaving the kernel with locks still held!
> ? ?1 lock held by lvcreate/1075:
> ? ? #0: ?(&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
> ? ?jbd2_journal_lock_updates+0xe1/0xf0
>
> ? ?Use vfs_check_frozen() added to ext4_journal_start_sb() and
> ? ?ext4_force_commit() instead.
>
> ? ?Addresses-Red-Hat-Bugzilla: #568503
>
>
> ? ?Signed-off-by: Eric Sandeen <[email protected]>
> ? ?Signed-off-by: "Theodore Ts'o" <[email protected]>
>



--
Best Wishes
Yongqiang Yang

2011-03-31 15:25:39

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 3/31/11 9:36 AM, Yongqiang Yang wrote:
> On Thu, Mar 31, 2011 at 10:04 PM, Eric Sandeen <[email protected]> wrote:
>> On 3/31/11 3:37 AM, Yongqiang Yang wrote:
>>
>>> in ext3, ext3_freeze() prevents journal from being updated by
>>> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
>>> unlock_journal_updates().
>>>
>>> in ext4, however, before ext4_freeze() returns, it unlock journal, and
>>> ext4 prevents journal from being updated by s_frozen. s_frozen is in
>>> an upper layer, so it is out control of ext4 and deadlock is easy to
>>> happen.
>>>
>>> Could someone explain why ext4 does like above but not follow ext3?
>>>
>>> Yongqiang.
>>
>> That was me, I think ...
>
> Thank you, Eric.
>
> I think ext4_journal_start() should check if current thread has an
> active handle before vfs_check_frozen(), if so, current handle will
> be returned. Thus, we can avoid deadlocks.
>
> Do you agree with me? If I am right, I will send a patch.

If you have a testcase to test it with, sure. plus a patch would help me know for sure what you propose :)

Sorry for breaking it (if I did!) But holding a mutex and returning to userspace was pretty bad, too :(

Thanks,
-Eric

>>
>> commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
>> Author: Eric Sandeen <[email protected]>
>> Date: Sun May 16 02:00:00 2010 -0400
>>
>> ext4: don't return to userspace after freezing the fs with a mutex held
>>
>> ext4_freeze() used jbd2_journal_lock_updates() which takes
>> the j_barrier mutex, and then returns to userspace. The
>> kernel does not like this:
>>
>> ================================================
>> [ BUG: lock held when returning to user space! ]
>> ------------------------------------------------
>> lvcreate/1075 is leaving the kernel with locks still held!
>> 1 lock held by lvcreate/1075:
>> #0: (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>> jbd2_journal_lock_updates+0xe1/0xf0
>>
>> Use vfs_check_frozen() added to ext4_journal_start_sb() and
>> ext4_force_commit() instead.
>>
>> Addresses-Red-Hat-Bugzilla: #568503
>>
>>
>> Signed-off-by: Eric Sandeen <[email protected]>
>> Signed-off-by: "Theodore Ts'o" <[email protected]>
>>
>
>
>


2011-03-31 16:28:43

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Thu 31-03-11 22:36:46, Yongqiang Yang wrote:
> On Thu, Mar 31, 2011 at 10:04 PM, Eric Sandeen <[email protected]> wrote:
> > On 3/31/11 3:37 AM, Yongqiang Yang wrote:
> >
> >> in ext3, ext3_freeze() prevents journal from being updated by
> >> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
> >> unlock_journal_updates().
> >>
> >> in ext4, however, before ext4_freeze() returns, it unlock journal, and
> >> ext4 prevents journal from being updated by s_frozen. s_frozen is in
> >> an upper layer, so it is out control of ext4 and deadlock is easy to
> >> happen.
> >>
> >> Could someone explain why ext4 does like above but not follow ext3?
> >>
> >> Yongqiang.
> >
> > That was me, I think ...
>
> Thank you, Eric.
>
> I think ext4_journal_start() should check if current thread has an
> active handle before vfs_check_frozen(), if so, current handle will
> be returned. Thus, we can avoid deadlocks.
>
> Do you agree with me? If I am right, I will send a patch.
Yes, definitely. This was exactly what I wanted to propose as well.
Thanks for looking into this.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-03-31 23:40:56

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
> Hi.
>
> On Thu, 17 Feb 2011 11:45:52 +0100
> Jan Kara <[email protected]> wrote:
> > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > (2011/02/16 23:56), Jan Kara wrote:
> > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > >>Jan Kara<[email protected]> wrote:
> > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > >>>>>describe above.
> > > >>>>
> > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > >>>>routines are using down_write(&sb->s_umount) for two purposes. The
> > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > >>>>it could do just as well by taking a read lock), but the second is to
> > > >>>>prevent the resume/thaw code from racing with itself. That's the core
> > > >>>>fundamental problem here.
> > > >>>>
> > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > >>>>second take a read lock on the s_umount.
> > > >>> Sadly this does not quite work because even down_read(&sb->s_umount)
> > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > >>>s_umount for writing - a situation like:
> > > >>> TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
> > > >>>down_read(&sb->s_umount)
> > > >>> block on s_frozen
> > > >>> down_write(&sb->s_umount)
> > > >>> -blocked
> > > >>> down_read(&sb->s_umount)
> > > >>> -blocked
> > > >>>behind the write access...
> > > >>>
> > > >>>The only working solution I see is to check for frozen filesystem before
> > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > >>>we did so in some well described wrapper).
> > > >>I created the patch that you imagine yesterday.
> > > >>
> > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > >>
> > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > >>after 12 hours passed.
> > > >>
> > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > >>---
> > > >> fs/fs-writeback.c | 2 +-
> > > >> 1 files changed, 1 insertions(+), 1 deletions(-)
> > > >>
> > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > >>index 59c6e49..1c9a05e 100644
> > > >>--- a/fs/fs-writeback.c
> > > >>+++ b/fs/fs-writeback.c
> > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > >> spin_unlock(&sb_lock);
> > > >>
> > > >> if (down_read_trylock(&sb->s_umount)) {
> > > >>- if (sb->s_root)
> > > >>+ if (sb->s_frozen == SB_UNFROZEN&& sb->s_root)
> > > >> return true;
> > > >> up_read(&sb->s_umount);
> > >
> > > > So this is something along the lines I thought but it actually won't work
> > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > >there are not other places that try to do IO while holding s_umount
> > > >semaphore...
> > > OK. I understand.
> > >
> > > This code only fixes the case for the following path:
> > > writeback_inodes_wb
> > > -> ext4_da_writepages
> > > -> ext4_journal_start_sb
> > > -> vfs_check_frozen
> > > But, the code doesn't fix the other cases.
> > >
> > > We must modify the local filesystem part in order to fix all cases...?
> > Yes, possibly. But most importantly we should first find clear locking
> > rules for frozen filesystem that avoid deadlocks like the one above. And
> > the freezing / unfreezing code might become subtle for that reason, that's
> > fine, but it would be really good to avoid any complicated things for the
> > code in the rest of the VFS / filesystems.
> I have deeply continued to examined the root cause of this problem, then
> I found it.
>
> It is that we can write a memory which is mmaped to a file. Then the memory
> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> "writeback" the memory.

Then surely the issue is that .page_mkwrite is not checking that the
filesystem is frozen before allowing the page fault to continue and
dirty the page?

> I think the best approach to fix this problem is to let users not to write
> memory which is mapped to a certain file while the filesystem is freezing.
> However, it is very difficult to control users not to write memory which has
> been already mapped to the file.

If you don't allow the page to be dirtied in the fist place, then
nothing needs to be done to the writeback path because there is
nothing dirty for it to write back.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-03-31 23:53:14

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 3/31/11 6:40 PM, Dave Chinner wrote:
> On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
>> Hi.
>>
>> On Thu, 17 Feb 2011 11:45:52 +0100
>> Jan Kara <[email protected]> wrote:
>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>> Jan Kara<[email protected]> wrote:
>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>>>>>> Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
>>>>>>>>> under s_umount semaphore, we are prone to deadlock like the one you
>>>>>>>>> describe above.
>>>>>>>>
>>>>>>>> One of the fundamental problems here is that the freeze and thaw
>>>>>>>> routines are using down_write(&sb->s_umount) for two purposes. The
>>>>>>>> first is to prevent the resume/thaw from racing with a umount (which
>>>>>>>> it could do just as well by taking a read lock), but the second is to
>>>>>>>> prevent the resume/thaw code from racing with itself. That's the core
>>>>>>>> fundamental problem here.
>>>>>>>>
>>>>>>>> So I think we can solve this by introduce a new mutex, s_freeze, and
>>>>>>>> having the the resume/thaw first take the s_freeze mutex and then
>>>>>>>> second take a read lock on the s_umount.
>>>>>>> Sadly this does not quite work because even down_read(&sb->s_umount)
>>>>>>> in thaw_super() can block if there is another process that tries to acquire
>>>>>>> s_umount for writing - a situation like:
>>>>>>> TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
>>>>>>> down_read(&sb->s_umount)
>>>>>>> block on s_frozen
>>>>>>> down_write(&sb->s_umount)
>>>>>>> -blocked
>>>>>>> down_read(&sb->s_umount)
>>>>>>> -blocked
>>>>>>> behind the write access...
>>>>>>>
>>>>>>> The only working solution I see is to check for frozen filesystem before
>>>>>>> taking s_umount semaphore which seems rather ugly (but might be bearable if
>>>>>>> we did so in some well described wrapper).
>>>>>> I created the patch that you imagine yesterday.
>>>>>>
>>>>>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
>>>>>> without a fixed patch. After an hour, I confirmed that this deadlock happened.
>>>>>>
>>>>>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
>>>>>> after 12 hours passed.
>>>>>>
>>>>>> The patch for linux-2.6.38-rc4 is as follows:
>>>>>> ---
>>>>>> fs/fs-writeback.c | 2 +-
>>>>>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>>>>> index 59c6e49..1c9a05e 100644
>>>>>> --- a/fs/fs-writeback.c
>>>>>> +++ b/fs/fs-writeback.c
>>>>>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>>>>>> spin_unlock(&sb_lock);
>>>>>>
>>>>>> if (down_read_trylock(&sb->s_umount)) {
>>>>>> - if (sb->s_root)
>>>>>> + if (sb->s_frozen == SB_UNFROZEN&& sb->s_root)
>>>>>> return true;
>>>>>> up_read(&sb->s_umount);
>>>>
>>>>> So this is something along the lines I thought but it actually won't work
>>>>> for example if sync(1) is run while the filesystem is frozen (that takes
>>>>> s_umount semaphore in a different place). And generally, I'm not convinced
>>>>> there are not other places that try to do IO while holding s_umount
>>>>> semaphore...
>>>> OK. I understand.
>>>>
>>>> This code only fixes the case for the following path:
>>>> writeback_inodes_wb
>>>> -> ext4_da_writepages
>>>> -> ext4_journal_start_sb
>>>> -> vfs_check_frozen
>>>> But, the code doesn't fix the other cases.
>>>>
>>>> We must modify the local filesystem part in order to fix all cases...?
>>> Yes, possibly. But most importantly we should first find clear locking
>>> rules for frozen filesystem that avoid deadlocks like the one above. And
>>> the freezing / unfreezing code might become subtle for that reason, that's
>>> fine, but it would be really good to avoid any complicated things for the
>>> code in the rest of the VFS / filesystems.
>> I have deeply continued to examined the root cause of this problem, then
>> I found it.
>>
>> It is that we can write a memory which is mmaped to a file. Then the memory
>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>> "writeback" the memory.
>
> Then surely the issue is that .page_mkwrite is not checking that the
> filesystem is frozen before allowing the page fault to continue and
> dirty the page?
>
>> I think the best approach to fix this problem is to let users not to write
>> memory which is mapped to a certain file while the filesystem is freezing.
>> However, it is very difficult to control users not to write memory which has
>> been already mapped to the file.
>
> If you don't allow the page to be dirtied in the fist place, then
> nothing needs to be done to the writeback path because there is
> nothing dirty for it to write back.

I floated

[PATCH, RFC] check for frozen filesystems in the mmap path

a long time ago, but it went nowhere; maybe time to revive that approach.

-Eric

> Cheers,
>
> Dave.


2011-04-01 14:08:59

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
> > On Thu, 17 Feb 2011 11:45:52 +0100
> > Jan Kara <[email protected]> wrote:
> > > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > > (2011/02/16 23:56), Jan Kara wrote:
> > > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > > >>Jan Kara<[email protected]> wrote:
> > > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > > >>>>>describe above.
> > > > >>>>
> > > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > > >>>>routines are using down_write(&sb->s_umount) for two purposes. The
> > > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > > >>>>it could do just as well by taking a read lock), but the second is to
> > > > >>>>prevent the resume/thaw code from racing with itself. That's the core
> > > > >>>>fundamental problem here.
> > > > >>>>
> > > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > > >>>>second take a read lock on the s_umount.
> > > > >>> Sadly this does not quite work because even down_read(&sb->s_umount)
> > > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > > >>>s_umount for writing - a situation like:
> > > > >>> TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
> > > > >>>down_read(&sb->s_umount)
> > > > >>> block on s_frozen
> > > > >>> down_write(&sb->s_umount)
> > > > >>> -blocked
> > > > >>> down_read(&sb->s_umount)
> > > > >>> -blocked
> > > > >>>behind the write access...
> > > > >>>
> > > > >>>The only working solution I see is to check for frozen filesystem before
> > > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > > >>>we did so in some well described wrapper).
> > > > >>I created the patch that you imagine yesterday.
> > > > >>
> > > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > > >>
> > > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > > >>after 12 hours passed.
> > > > >>
> > > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > > >>---
> > > > >> fs/fs-writeback.c | 2 +-
> > > > >> 1 files changed, 1 insertions(+), 1 deletions(-)
> > > > >>
> > > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > >>index 59c6e49..1c9a05e 100644
> > > > >>--- a/fs/fs-writeback.c
> > > > >>+++ b/fs/fs-writeback.c
> > > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > > >> spin_unlock(&sb_lock);
> > > > >>
> > > > >> if (down_read_trylock(&sb->s_umount)) {
> > > > >>- if (sb->s_root)
> > > > >>+ if (sb->s_frozen == SB_UNFROZEN&& sb->s_root)
> > > > >> return true;
> > > > >> up_read(&sb->s_umount);
> > > >
> > > > > So this is something along the lines I thought but it actually won't work
> > > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > > >there are not other places that try to do IO while holding s_umount
> > > > >semaphore...
> > > > OK. I understand.
> > > >
> > > > This code only fixes the case for the following path:
> > > > writeback_inodes_wb
> > > > -> ext4_da_writepages
> > > > -> ext4_journal_start_sb
> > > > -> vfs_check_frozen
> > > > But, the code doesn't fix the other cases.
> > > >
> > > > We must modify the local filesystem part in order to fix all cases...?
> > > Yes, possibly. But most importantly we should first find clear locking
> > > rules for frozen filesystem that avoid deadlocks like the one above. And
> > > the freezing / unfreezing code might become subtle for that reason, that's
> > > fine, but it would be really good to avoid any complicated things for the
> > > code in the rest of the VFS / filesystems.
> > I have deeply continued to examined the root cause of this problem, then
> > I found it.
> >
> > It is that we can write a memory which is mmaped to a file. Then the memory
> > becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> > "writeback" the memory.
>
> Then surely the issue is that .page_mkwrite is not checking that the
> filesystem is frozen before allowing the page fault to continue and
> dirty the page?
And is this a bug? That isn't clear to me...

> > I think the best approach to fix this problem is to let users not to write
> > memory which is mapped to a certain file while the filesystem is freezing.
> > However, it is very difficult to control users not to write memory which has
> > been already mapped to the file.
>
> If you don't allow the page to be dirtied in the fist place, then
> nothing needs to be done to the writeback path because there is
> nothing dirty for it to write back.
Sure but that's only the problem he was able to hit. But generally,
there's a problem with needing s_umount for unfreezing because it isn't
clear there aren't other code paths which can block with s_umount held
waiting for fs to get unfrozen. And these code paths would cause the same
deadlock. That's why I chose to get rid of s_umount during thawing.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-05 10:24:13

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi.

(2011/03/31 21:03), Toshiyuki Okajima wrote:
> Hi, thanks for your reviewing.
>
> (2011/03/30 23:12), Jan Kara wrote:
>> Hello,
>>
>> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>>> On Thu, 17 Feb 2011 11:45:52 +0100
>>> Jan Kara<[email protected]> wrote:
>>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>>> Jan Kara<[email protected]> wrote:
>>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> <SNIP>
>>> I have deeply continued to examined the root cause of this problem, then
>>> I found it.
>>>
>>> It is that we can write a memory which is mmaped to a file. Then the memory
>>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>>> "writeback" the memory.
>>>
>>> Therefore, the root cause of this hangup is not only ext4 component (with
>>> delayed allocation feature) but also writeback mechanism for mmap. If you
>>> use the other filesystem, you can write something to the filesystem though
>>> you have freezed the filesystem.
>
>> Well, you can write something only in the caches, not to the on disk
>> image. So it's not a problem as such.
> My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
> we can write in not only the caches but also the loopback device. However,
> I don't still confirm that we can write to the real device(/dev/sdaX).
>
>>
>>> A sample problem is attached on this mail. Try to execute it then you can
>>> confirm that we can write some data to your filesystem while freezing the
>>> filesystem.
>>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>>
>>> I think the best approach to fix this problem is to let users not to write
>>> memory which is mapped to a certain file while the filesystem is freezing.
>>> However, it is very difficult to control users not to write memory which has
>>> been already mapped to the file.
>> It is actually possible. In case of ext4, you could add a check (+ wait)
>> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
>> of being frozen and if so, wait for it to get unfrozen. The only tough
>> problem here might be the locking as ext4_page_mkwrite() is called with
>> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
>> But you'd have to fix all filesystems (and all paths possibly creating
>> dirty data) in this way.
>>
>
>>> Therefore, I think there is only actual method that we stop writeback thread
>>> to resolve the mmap problem. Also, by this fix, the original problem
>>> (ext4 delayed write vs unfreeze) can be solved.
>> Hmm, I had a look at the code again and think we could fix the issue
>> cleanly (i.e. all possible users of s_umount) as follows: The lock
>> ordering will be
>> s_umount -> "fs frozen"
>> and there will be a new mutex s_freeze_mutex protecting changes of
>> s_frozen.
>>
>> freeze_bdev() already observes this lock ordering, it will only take
>> s_freeze_mutex for the changes of s_frozen values. The only other code
>> that is relevant for the lock ordering is thaw_super() (the freezing
>> process is not expected to reenter kernel for the frozen filesystem).
>> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
>> set s_frozen, release s_freeze_mutex and put superblock reference.
>>
>
>> So something like the patch below - it seems to work for me, can you test
>> it please?
> I think your patch looks good, so, the original problem seems to be solved.
> OK, I will test your patch.
> This weekend I cannot test it. So, I will reply next week.
I have tested whether Mizuma-san's reproducer can cause to deadlock with your
patch. And then any problems didn't hit while the reproducer was running.

I think your patch solves the original deadlock problem which is reported by
Mizuma-san.

> Reported-by: Toshiyuki Okajima <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/super.c | 40 ++++++++++++++++++++++++++++++++++------
> include/linux/fs.h | 1 +
> 2 files changed, 35 insertions(+), 6 deletions(-)

However, I think a write which causes the deadlock is from mmapped dirty
pages. So, I guess we also need to fix in the mmap path while fsfreezing.

> I floated
>
> [PATCH, RFC] check for frozen filesystems in the mmap path
>
> a long time ago, but it went nowhere; maybe time to revive that approach.

Thanks,
Toshiyuki Okajima


2011-04-05 10:43:19

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi.

(2011/04/01 8:40), Dave Chinner wrote:
> On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
>> Hi.
>>
>> On Thu, 17 Feb 2011 11:45:52 +0100
>> Jan Kara<[email protected]> wrote:
>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>> Jan Kara<[email protected]> wrote:
>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>>>>>> Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
>>>>>>>>> under s_umount semaphore, we are prone to deadlock like the one you
>>>>>>>>> describe above.
>>>>>>>>
>>>>>>>> One of the fundamental problems here is that the freeze and thaw
>>>>>>>> routines are using down_write(&sb->s_umount) for two purposes. The
>>>>>>>> first is to prevent the resume/thaw from racing with a umount (which
>>>>>>>> it could do just as well by taking a read lock), but the second is to
>>>>>>>> prevent the resume/thaw code from racing with itself. That's the core
>>>>>>>> fundamental problem here.
>>>>>>>>
>>>>>>>> So I think we can solve this by introduce a new mutex, s_freeze, and
>>>>>>>> having the the resume/thaw first take the s_freeze mutex and then
>>>>>>>> second take a read lock on the s_umount.
>>>>>>> Sadly this does not quite work because even down_read(&sb->s_umount)
>>>>>>> in thaw_super() can block if there is another process that tries to acquire
>>>>>>> s_umount for writing - a situation like:
>>>>>>> TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
>>>>>>> down_read(&sb->s_umount)
>>>>>>> block on s_frozen
>>>>>>> down_write(&sb->s_umount)
>>>>>>> -blocked
>>>>>>> down_read(&sb->s_umount)
>>>>>>> -blocked
>>>>>>> behind the write access...
>>>>>>>
>>>>>>> The only working solution I see is to check for frozen filesystem before
>>>>>>> taking s_umount semaphore which seems rather ugly (but might be bearable if
>>>>>>> we did so in some well described wrapper).
>>>>>> I created the patch that you imagine yesterday.
>>>>>>
>>>>>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
>>>>>> without a fixed patch. After an hour, I confirmed that this deadlock happened.
>>>>>>
>>>>>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
>>>>>> after 12 hours passed.
>>>>>>
>>>>>> The patch for linux-2.6.38-rc4 is as follows:
>>>>>> ---
>>>>>> fs/fs-writeback.c | 2 +-
>>>>>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>>>>> index 59c6e49..1c9a05e 100644
>>>>>> --- a/fs/fs-writeback.c
>>>>>> +++ b/fs/fs-writeback.c
>>>>>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>>>>>> spin_unlock(&sb_lock);
>>>>>>
>>>>>> if (down_read_trylock(&sb->s_umount)) {
>>>>>> - if (sb->s_root)
>>>>>> + if (sb->s_frozen == SB_UNFROZEN&& sb->s_root)
>>>>>> return true;
>>>>>> up_read(&sb->s_umount);
>>>>
>>>>> So this is something along the lines I thought but it actually won't work
>>>>> for example if sync(1) is run while the filesystem is frozen (that takes
>>>>> s_umount semaphore in a different place). And generally, I'm not convinced
>>>>> there are not other places that try to do IO while holding s_umount
>>>>> semaphore...
>>>> OK. I understand.
>>>>
>>>> This code only fixes the case for the following path:
>>>> writeback_inodes_wb
>>>> -> ext4_da_writepages
>>>> -> ext4_journal_start_sb
>>>> -> vfs_check_frozen
>>>> But, the code doesn't fix the other cases.
>>>>
>>>> We must modify the local filesystem part in order to fix all cases...?
>>> Yes, possibly. But most importantly we should first find clear locking
>>> rules for frozen filesystem that avoid deadlocks like the one above. And
>>> the freezing / unfreezing code might become subtle for that reason, that's
>>> fine, but it would be really good to avoid any complicated things for the
>>> code in the rest of the VFS / filesystems.
>> I have deeply continued to examined the root cause of this problem, then
>> I found it.
>>
>> It is that we can write a memory which is mmaped to a file. Then the memory
>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>> "writeback" the memory.
>
> Then surely the issue is that .page_mkwrite is not checking that the
> filesystem is frozen before allowing the page fault to continue and
> dirty the page?
>
>> I think the best approach to fix this problem is to let users not to write
>> memory which is mapped to a certain file while the filesystem is freezing.
>> However, it is very difficult to control users not to write memory which has
>> been already mapped to the file.
>

> If you don't allow the page to be dirtied in the fist place, then
> nothing needs to be done to the writeback path because there is
> nothing dirty for it to write back.
OK. We can block the write operation by not allowing the page to be
dirtied in the first place.

But we can not easily stop writing the page which is *already mapped*
in the next place. Therefore I think writing back such pages can
be blocked only in the flusher thread.

Thanks,
Toshiyuki Okajima


2011-04-05 22:54:33

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
> (2011/03/31 21:03), Toshiyuki Okajima wrote:
> >Hi, thanks for your reviewing.
> >
> >(2011/03/30 23:12), Jan Kara wrote:
> >>Hello,
> >>
> >>On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
> >>>On Thu, 17 Feb 2011 11:45:52 +0100
> >>>Jan Kara<[email protected]> wrote:
> >>>>On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> >>>>>(2011/02/16 23:56), Jan Kara wrote:
> >>>>>>On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> >>>>>>>On Tue, 15 Feb 2011 18:29:54 +0100
> >>>>>>>Jan Kara<[email protected]> wrote:
> >>>>>>>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> >>>>>>>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> ><SNIP>
> >>>I have deeply continued to examined the root cause of this problem, then
> >>>I found it.
> >>>
> >>>It is that we can write a memory which is mmaped to a file. Then the memory
> >>>becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> >>>"writeback" the memory.
> >>>
> >>>Therefore, the root cause of this hangup is not only ext4 component (with
> >>>delayed allocation feature) but also writeback mechanism for mmap. If you
> >>>use the other filesystem, you can write something to the filesystem though
> >>>you have freezed the filesystem.
> >
> >>Well, you can write something only in the caches, not to the on disk
> >>image. So it's not a problem as such.
> >My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
> >we can write in not only the caches but also the loopback device. However,
> >I don't still confirm that we can write to the real device(/dev/sdaX).
> >
> >>
> >>>A sample problem is attached on this mail. Try to execute it then you can
> >>>confirm that we can write some data to your filesystem while freezing the
> >>>filesystem.
> >>>(If you change FS variable in go.sh from ext3 to ext4 and you execute
> >>>"fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
> >>>
> >>>I think the best approach to fix this problem is to let users not to write
> >>>memory which is mapped to a certain file while the filesystem is freezing.
> >>>However, it is very difficult to control users not to write memory which has
> >>>been already mapped to the file.
> >>It is actually possible. In case of ext4, you could add a check (+ wait)
> >>in ext4_page_mkwrite() whether the filesystem is frozen or in the process
> >>of being frozen and if so, wait for it to get unfrozen. The only tough
> >>problem here might be the locking as ext4_page_mkwrite() is called with
> >>mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
> >>But you'd have to fix all filesystems (and all paths possibly creating
> >>dirty data) in this way.
> >>
> >
> >>>Therefore, I think there is only actual method that we stop writeback thread
> >>>to resolve the mmap problem. Also, by this fix, the original problem
> >>>(ext4 delayed write vs unfreeze) can be solved.
> >>Hmm, I had a look at the code again and think we could fix the issue
> >>cleanly (i.e. all possible users of s_umount) as follows: The lock
> >>ordering will be
> >>s_umount -> "fs frozen"
> >>and there will be a new mutex s_freeze_mutex protecting changes of
> >>s_frozen.
> >>
> >>freeze_bdev() already observes this lock ordering, it will only take
> >>s_freeze_mutex for the changes of s_frozen values. The only other code
> >>that is relevant for the lock ordering is thaw_super() (the freezing
> >>process is not expected to reenter kernel for the frozen filesystem).
> >>In thaw_super() we could take s_freeze_mutex, do all the thawing work,
> >>set s_frozen, release s_freeze_mutex and put superblock reference.
> >>
> >
> >>So something like the patch below - it seems to work for me, can you test
> >>it please?
> >I think your patch looks good, so, the original problem seems to be solved.
> >OK, I will test your patch.
> >This weekend I cannot test it. So, I will reply next week.
> I have tested whether Mizuma-san's reproducer can cause to deadlock with your
> patch. And then any problems didn't hit while the reproducer was running.
>
> I think your patch solves the original deadlock problem which is reported by
> Mizuma-san.
Good. Thanks.

> >Reported-by: Toshiyuki Okajima <[email protected]>
> >Signed-off-by: Jan Kara <[email protected]>
> >---
> > fs/super.c | 40 ++++++++++++++++++++++++++++++++++------
> > include/linux/fs.h | 1 +
> > 2 files changed, 35 insertions(+), 6 deletions(-)
>
> However, I think a write which causes the deadlock is from mmapped dirty
> pages. So, I guess we also need to fix in the mmap path while fsfreezing.
Why? If you dirty a page, writeback thread can come and try to write it -
which blocks - but now that does not matter...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-06 05:09:14

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi.

(2011/04/06 7:54), Jan Kara wrote:
> On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
>> (2011/03/31 21:03), Toshiyuki Okajima wrote:
>>> Hi, thanks for your reviewing.
>>>
>>> (2011/03/30 23:12), Jan Kara wrote:
>>>> Hello,
>>>>
>>>> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>>>>> On Thu, 17 Feb 2011 11:45:52 +0100
>>>>> Jan Kara<[email protected]> wrote:
>>>>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>>>>> Jan Kara<[email protected]> wrote:
>>>>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>> <SNIP>
>>>>> I have deeply continued to examined the root cause of this problem, then
>>>>> I found it.
>>>>>
>>>>> It is that we can write a memory which is mmaped to a file. Then the memory
>>>>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>>>>> "writeback" the memory.
>>>>>
>>>>> Therefore, the root cause of this hangup is not only ext4 component (with
>>>>> delayed allocation feature) but also writeback mechanism for mmap. If you
>>>>> use the other filesystem, you can write something to the filesystem though
>>>>> you have freezed the filesystem.
>>>
>>>> Well, you can write something only in the caches, not to the on disk
>>>> image. So it's not a problem as such.
>>> My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
>>> we can write in not only the caches but also the loopback device. However,
>>> I don't still confirm that we can write to the real device(/dev/sdaX).
>>>
>>>>
>>>>> A sample problem is attached on this mail. Try to execute it then you can
>>>>> confirm that we can write some data to your filesystem while freezing the
>>>>> filesystem.
>>>>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>>>>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>>>>
>>>>> I think the best approach to fix this problem is to let users not to write
>>>>> memory which is mapped to a certain file while the filesystem is freezing.
>>>>> However, it is very difficult to control users not to write memory which has
>>>>> been already mapped to the file.
>>>> It is actually possible. In case of ext4, you could add a check (+ wait)
>>>> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
>>>> of being frozen and if so, wait for it to get unfrozen. The only tough
>>>> problem here might be the locking as ext4_page_mkwrite() is called with
>>>> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
>>>> But you'd have to fix all filesystems (and all paths possibly creating
>>>> dirty data) in this way.
>>>>
>>>
>>>>> Therefore, I think there is only actual method that we stop writeback thread
>>>>> to resolve the mmap problem. Also, by this fix, the original problem
>>>>> (ext4 delayed write vs unfreeze) can be solved.
>>>> Hmm, I had a look at the code again and think we could fix the issue
>>>> cleanly (i.e. all possible users of s_umount) as follows: The lock
>>>> ordering will be
>>>> s_umount -> "fs frozen"
>>>> and there will be a new mutex s_freeze_mutex protecting changes of
>>>> s_frozen.
>>>>
>>>> freeze_bdev() already observes this lock ordering, it will only take
>>>> s_freeze_mutex for the changes of s_frozen values. The only other code
>>>> that is relevant for the lock ordering is thaw_super() (the freezing
>>>> process is not expected to reenter kernel for the frozen filesystem).
>>>> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
>>>> set s_frozen, release s_freeze_mutex and put superblock reference.
>>>>
>>>
>>>> So something like the patch below - it seems to work for me, can you test
>>>> it please?
>>> I think your patch looks good, so, the original problem seems to be solved.
>>> OK, I will test your patch.
>>> This weekend I cannot test it. So, I will reply next week.
>> I have tested whether Mizuma-san's reproducer can cause to deadlock with your
>> patch. And then any problems didn't hit while the reproducer was running.
>>
>> I think your patch solves the original deadlock problem which is reported by
>> Mizuma-san.
> Good. Thanks.
>
>>> Reported-by: Toshiyuki Okajima<[email protected]>
>>> Signed-off-by: Jan Kara<[email protected]>
>>> ---
>>> fs/super.c | 40 ++++++++++++++++++++++++++++++++++------
>>> include/linux/fs.h | 1 +
>>> 2 files changed, 35 insertions(+), 6 deletions(-)
>>

>> However, I think a write which causes the deadlock is from mmapped dirty
>> pages. So, I guess we also need to fix in the mmap path while fsfreezing.
> Why? If you dirty a page, writeback thread can come and try to write it -
> which blocks - but now that does not matter...
I have not understood the code around writeback thread very much...
Please explain me the concrete function name which blocks some writes?

Mizuma-san's reproducer also writes the data which maps to the file (mmap).
The original problem happens after the fsfreeze operation is done.
I understand the normal write operation (not mmap) can be blocked while
fsfreezing. So, I guess we don't always block all the write operation
while fsfreezing.

Thanks
Toshiyuki Okajima


2011-04-06 05:40:05

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
> > > On Thu, 17 Feb 2011 11:45:52 +0100
> > > Jan Kara <[email protected]> wrote:
> > > > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > > > (2011/02/16 23:56), Jan Kara wrote:
> > > > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > > > >>Jan Kara<[email protected]> wrote:
> > > > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > > > >>>>>describe above.
> > > > > >>>>
> > > > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > > > >>>>routines are using down_write(&sb->s_umount) for two purposes. The
> > > > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > > > >>>>it could do just as well by taking a read lock), but the second is to
> > > > > >>>>prevent the resume/thaw code from racing with itself. That's the core
> > > > > >>>>fundamental problem here.
> > > > > >>>>
> > > > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > > > >>>>second take a read lock on the s_umount.
> > > > > >>> Sadly this does not quite work because even down_read(&sb->s_umount)
> > > > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > > > >>>s_umount for writing - a situation like:
> > > > > >>> TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
> > > > > >>>down_read(&sb->s_umount)
> > > > > >>> block on s_frozen
> > > > > >>> down_write(&sb->s_umount)
> > > > > >>> -blocked
> > > > > >>> down_read(&sb->s_umount)
> > > > > >>> -blocked
> > > > > >>>behind the write access...
> > > > > >>>
> > > > > >>>The only working solution I see is to check for frozen filesystem before
> > > > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > > > >>>we did so in some well described wrapper).
> > > > > >>I created the patch that you imagine yesterday.
> > > > > >>
> > > > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > > > >>
> > > > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > > > >>after 12 hours passed.
> > > > > >>
> > > > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > > > >>---
> > > > > >> fs/fs-writeback.c | 2 +-
> > > > > >> 1 files changed, 1 insertions(+), 1 deletions(-)
> > > > > >>
> > > > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > > >>index 59c6e49..1c9a05e 100644
> > > > > >>--- a/fs/fs-writeback.c
> > > > > >>+++ b/fs/fs-writeback.c
> > > > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > > > >> spin_unlock(&sb_lock);
> > > > > >>
> > > > > >> if (down_read_trylock(&sb->s_umount)) {
> > > > > >>- if (sb->s_root)
> > > > > >>+ if (sb->s_frozen == SB_UNFROZEN&& sb->s_root)
> > > > > >> return true;
> > > > > >> up_read(&sb->s_umount);
> > > > >
> > > > > > So this is something along the lines I thought but it actually won't work
> > > > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > > > >there are not other places that try to do IO while holding s_umount
> > > > > >semaphore...
> > > > > OK. I understand.
> > > > >
> > > > > This code only fixes the case for the following path:
> > > > > writeback_inodes_wb
> > > > > -> ext4_da_writepages
> > > > > -> ext4_journal_start_sb
> > > > > -> vfs_check_frozen
> > > > > But, the code doesn't fix the other cases.
> > > > >
> > > > > We must modify the local filesystem part in order to fix all cases...?
> > > > Yes, possibly. But most importantly we should first find clear locking
> > > > rules for frozen filesystem that avoid deadlocks like the one above. And
> > > > the freezing / unfreezing code might become subtle for that reason, that's
> > > > fine, but it would be really good to avoid any complicated things for the
> > > > code in the rest of the VFS / filesystems.
> > > I have deeply continued to examined the root cause of this problem, then
> > > I found it.
> > >
> > > It is that we can write a memory which is mmaped to a file. Then the memory
> > > becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> > > "writeback" the memory.
> >
> > Then surely the issue is that .page_mkwrite is not checking that the
> > filesystem is frozen before allowing the page fault to continue and
> > dirty the page?
> And is this a bug? That isn't clear to me...

Given the semantics of a frozen filesystem, letting any object be
dirtied while frozen (be it an inode, a page, a metadata block, etc)
is definitely a bug.

The way the freeze code is architected is that incoming dirtying
events are prevented so that the writeback side does not need to
care about the frozen state of the filesystem at all. The freeze
operation is supposed to block new dirtiers, then flush all dirty
objects resulting in everything being clean in the filesystem.

Hence if no objects are being dirtied, then there should never be
any need to block writeback threads due to the filesytem being
frozen because, by definition, there should be no work for them to
do. Hence if objects are being dirtied while the filesystem is
frozen, then that is a bug.

> > > I think the best approach to fix this problem is to let users not to write
> > > memory which is mapped to a certain file while the filesystem is freezing.
> > > However, it is very difficult to control users not to write memory which has
> > > been already mapped to the file.
> >
> > If you don't allow the page to be dirtied in the fist place, then
> > nothing needs to be done to the writeback path because there is
> > nothing dirty for it to write back.
> Sure but that's only the problem he was able to hit. But generally,
> there's a problem with needing s_umount for unfreezing because it isn't
> clear there aren't other code paths which can block with s_umount held
> waiting for fs to get unfrozen. And these code paths would cause the same
> deadlock. That's why I chose to get rid of s_umount during thawing.

Holding the s_umount lock while checking if frozen and sleeping
is essentially an ABBA lock inversion bug that can bite in many more
places that just thawing the filesystem. . Any where this is done
should be fixed, so I don't think just removing the s_umount lock
from the thaw path is sufficient to avoid problems.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-04-06 05:57:13

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Wed 06-04-11 14:09:14, Toshiyuki Okajima wrote:
> (2011/04/06 7:54), Jan Kara wrote:
> >On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
> >>(2011/03/31 21:03), Toshiyuki Okajima wrote:
> >>>Hi, thanks for your reviewing.
> >>>
> >>>(2011/03/30 23:12), Jan Kara wrote:
> >>>>Hello,
> >>>>
> >>>>On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
> >>>>>On Thu, 17 Feb 2011 11:45:52 +0100
> >>>>>Jan Kara<[email protected]> wrote:
> >>>>>>On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> >>>>>>>(2011/02/16 23:56), Jan Kara wrote:
> >>>>>>>>On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> >>>>>>>>>On Tue, 15 Feb 2011 18:29:54 +0100
> >>>>>>>>>Jan Kara<[email protected]> wrote:
> >>>>>>>>>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> >>>>>>>>>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> >>><SNIP>
> >>>>>I have deeply continued to examined the root cause of this problem, then
> >>>>>I found it.
> >>>>>
> >>>>>It is that we can write a memory which is mmaped to a file. Then the memory
> >>>>>becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> >>>>>"writeback" the memory.
> >>>>>
> >>>>>Therefore, the root cause of this hangup is not only ext4 component (with
> >>>>>delayed allocation feature) but also writeback mechanism for mmap. If you
> >>>>>use the other filesystem, you can write something to the filesystem though
> >>>>>you have freezed the filesystem.
> >>>
> >>>>Well, you can write something only in the caches, not to the on disk
> >>>>image. So it's not a problem as such.
> >>>My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
> >>>we can write in not only the caches but also the loopback device. However,
> >>>I don't still confirm that we can write to the real device(/dev/sdaX).
> >>>
> >>>>
> >>>>>A sample problem is attached on this mail. Try to execute it then you can
> >>>>>confirm that we can write some data to your filesystem while freezing the
> >>>>>filesystem.
> >>>>>(If you change FS variable in go.sh from ext3 to ext4 and you execute
> >>>>>"fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
> >>>>>
> >>>>>I think the best approach to fix this problem is to let users not to write
> >>>>>memory which is mapped to a certain file while the filesystem is freezing.
> >>>>>However, it is very difficult to control users not to write memory which has
> >>>>>been already mapped to the file.
> >>>>It is actually possible. In case of ext4, you could add a check (+ wait)
> >>>>in ext4_page_mkwrite() whether the filesystem is frozen or in the process
> >>>>of being frozen and if so, wait for it to get unfrozen. The only tough
> >>>>problem here might be the locking as ext4_page_mkwrite() is called with
> >>>>mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
> >>>>But you'd have to fix all filesystems (and all paths possibly creating
> >>>>dirty data) in this way.
> >>>>
> >>>
> >>>>>Therefore, I think there is only actual method that we stop writeback thread
> >>>>>to resolve the mmap problem. Also, by this fix, the original problem
> >>>>>(ext4 delayed write vs unfreeze) can be solved.
> >>>>Hmm, I had a look at the code again and think we could fix the issue
> >>>>cleanly (i.e. all possible users of s_umount) as follows: The lock
> >>>>ordering will be
> >>>>s_umount -> "fs frozen"
> >>>>and there will be a new mutex s_freeze_mutex protecting changes of
> >>>>s_frozen.
> >>>>
> >>>>freeze_bdev() already observes this lock ordering, it will only take
> >>>>s_freeze_mutex for the changes of s_frozen values. The only other code
> >>>>that is relevant for the lock ordering is thaw_super() (the freezing
> >>>>process is not expected to reenter kernel for the frozen filesystem).
> >>>>In thaw_super() we could take s_freeze_mutex, do all the thawing work,
> >>>>set s_frozen, release s_freeze_mutex and put superblock reference.
> >>>>
> >>>
> >>>>So something like the patch below - it seems to work for me, can you test
> >>>>it please?
> >>>I think your patch looks good, so, the original problem seems to be solved.
> >>>OK, I will test your patch.
> >>>This weekend I cannot test it. So, I will reply next week.
> >>I have tested whether Mizuma-san's reproducer can cause to deadlock with your
> >>patch. And then any problems didn't hit while the reproducer was running.
> >>
> >>I think your patch solves the original deadlock problem which is reported by
> >>Mizuma-san.
> > Good. Thanks.
> >
> >>>Reported-by: Toshiyuki Okajima<[email protected]>
> >>>Signed-off-by: Jan Kara<[email protected]>
> >>>---
> >>>fs/super.c | 40 ++++++++++++++++++++++++++++++++++------
> >>>include/linux/fs.h | 1 +
> >>>2 files changed, 35 insertions(+), 6 deletions(-)
> >>
>
> >>However, I think a write which causes the deadlock is from mmapped dirty
> >>pages. So, I guess we also need to fix in the mmap path while fsfreezing.
> > Why? If you dirty a page, writeback thread can come and try to write it -
> >which blocks - but now that does not matter...
> I have not understood the code around writeback thread very much...
> Please explain me the concrete function name which blocks some writes?
It would block in ext4_da_writepages() function.

> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> The original problem happens after the fsfreeze operation is done.
> I understand the normal write operation (not mmap) can be blocked while
> fsfreezing. So, I guess we don't always block all the write operation
> while fsfreezing.
Technically speaking, we block all the transaction starts which means we
end up blocking all the writes from going to disk. But that does not mean
we block all the writes from going to in-memory cache - as you properly
note the mmap case is one of such exceptions.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-06 06:18:56

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
> > > > On Thu, 17 Feb 2011 11:45:52 +0100
> > > > Jan Kara <[email protected]> wrote:
> > > > > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > > > > (2011/02/16 23:56), Jan Kara wrote:
> > > > > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > > > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > > > > >>Jan Kara<[email protected]> wrote:
> > > > > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > > > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > > > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > > > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > > > > >>>>>describe above.
> > > > > > >>>>
> > > > > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > > > > >>>>routines are using down_write(&sb->s_umount) for two purposes. The
> > > > > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > > > > >>>>it could do just as well by taking a read lock), but the second is to
> > > > > > >>>>prevent the resume/thaw code from racing with itself. That's the core
> > > > > > >>>>fundamental problem here.
> > > > > > >>>>
> > > > > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > > > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > > > > >>>>second take a read lock on the s_umount.
> > > > > > >>> Sadly this does not quite work because even down_read(&sb->s_umount)
> > > > > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > > > > >>>s_umount for writing - a situation like:
> > > > > > >>> TASK 1 (e.g. flusher) TASK 2 (e.g. remount) TASK 3 (unfreeze)
> > > > > > >>>down_read(&sb->s_umount)
> > > > > > >>> block on s_frozen
> > > > > > >>> down_write(&sb->s_umount)
> > > > > > >>> -blocked
> > > > > > >>> down_read(&sb->s_umount)
> > > > > > >>> -blocked
> > > > > > >>>behind the write access...
> > > > > > >>>
> > > > > > >>>The only working solution I see is to check for frozen filesystem before
> > > > > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > > > > >>>we did so in some well described wrapper).
> > > > > > >>I created the patch that you imagine yesterday.
> > > > > > >>
> > > > > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > > > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > > > > >>
> > > > > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > > > > >>after 12 hours passed.
> > > > > > >>
> > > > > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > > > > >>---
> > > > > > >> fs/fs-writeback.c | 2 +-
> > > > > > >> 1 files changed, 1 insertions(+), 1 deletions(-)
> > > > > > >>
> > > > > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > > > >>index 59c6e49..1c9a05e 100644
> > > > > > >>--- a/fs/fs-writeback.c
> > > > > > >>+++ b/fs/fs-writeback.c
> > > > > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > > > > >> spin_unlock(&sb_lock);
> > > > > > >>
> > > > > > >> if (down_read_trylock(&sb->s_umount)) {
> > > > > > >>- if (sb->s_root)
> > > > > > >>+ if (sb->s_frozen == SB_UNFROZEN&& sb->s_root)
> > > > > > >> return true;
> > > > > > >> up_read(&sb->s_umount);
> > > > > >
> > > > > > > So this is something along the lines I thought but it actually won't work
> > > > > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > > > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > > > > >there are not other places that try to do IO while holding s_umount
> > > > > > >semaphore...
> > > > > > OK. I understand.
> > > > > >
> > > > > > This code only fixes the case for the following path:
> > > > > > writeback_inodes_wb
> > > > > > -> ext4_da_writepages
> > > > > > -> ext4_journal_start_sb
> > > > > > -> vfs_check_frozen
> > > > > > But, the code doesn't fix the other cases.
> > > > > >
> > > > > > We must modify the local filesystem part in order to fix all cases...?
> > > > > Yes, possibly. But most importantly we should first find clear locking
> > > > > rules for frozen filesystem that avoid deadlocks like the one above. And
> > > > > the freezing / unfreezing code might become subtle for that reason, that's
> > > > > fine, but it would be really good to avoid any complicated things for the
> > > > > code in the rest of the VFS / filesystems.
> > > > I have deeply continued to examined the root cause of this problem, then
> > > > I found it.
> > > >
> > > > It is that we can write a memory which is mmaped to a file. Then the memory
> > > > becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> > > > "writeback" the memory.
> > >
> > > Then surely the issue is that .page_mkwrite is not checking that the
> > > filesystem is frozen before allowing the page fault to continue and
> > > dirty the page?
> > And is this a bug? That isn't clear to me...
>
> Given the semantics of a frozen filesystem, letting any object be
> dirtied while frozen (be it an inode, a page, a metadata block, etc)
> is definitely a bug.
>
> The way the freeze code is architected is that incoming dirtying
> events are prevented so that the writeback side does not need to
> care about the frozen state of the filesystem at all. The freeze
> operation is supposed to block new dirtiers, then flush all dirty
> objects resulting in everything being clean in the filesystem.
>
> Hence if no objects are being dirtied, then there should never be
> any need to block writeback threads due to the filesytem being
> frozen because, by definition, there should be no work for them to
> do. Hence if objects are being dirtied while the filesystem is
> frozen, then that is a bug.
OK, after some thought I start to agree with you that it would be nice
if we didn't allow the pages to be dirtied at the first place. Otherwise
things get a bit fragile as writing a data block does *not* need a
transaction start as such (we just happen to do it in all code paths)...

> > > > I think the best approach to fix this problem is to let users not to write
> > > > memory which is mapped to a certain file while the filesystem is freezing.
> > > > However, it is very difficult to control users not to write memory which has
> > > > been already mapped to the file.
> > >
> > > If you don't allow the page to be dirtied in the fist place, then
> > > nothing needs to be done to the writeback path because there is
> > > nothing dirty for it to write back.
> > Sure but that's only the problem he was able to hit. But generally,
> > there's a problem with needing s_umount for unfreezing because it isn't
> > clear there aren't other code paths which can block with s_umount held
> > waiting for fs to get unfrozen. And these code paths would cause the same
> > deadlock. That's why I chose to get rid of s_umount during thawing.
>
> Holding the s_umount lock while checking if frozen and sleeping
> is essentially an ABBA lock inversion bug that can bite in many more
> places that just thawing the filesystem. Any where this is done should
> be fixed, so I don't think just removing the s_umount lock from the thaw
> path is sufficient to avoid problems.
That's easily said but hard to do - any transaction start in ext3/4 may
block on filesystem being frozen (this seems to be similar for XFS as I'm
looking into the code) and transaction start traditionally nests inside
s_umount (and basically there's no way around that since sync() calls your
fs code with s_umount held). So I'm afraid we are not going to get rid of
this ABBA dependency unless we declare that s_umount ranks above filesystem
being frozen - but surely I'm open to suggestions.

Another possibility is just to hide the problem e.g. by checking for frozen
filesystem whenever we try to get s_umount. But that looks a bit ugly to
me.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-06 07:38:45

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi.

(2011/04/06 14:57), Jan Kara wrote:
> On Wed 06-04-11 14:09:14, Toshiyuki Okajima wrote:
>> (2011/04/06 7:54), Jan Kara wrote:
>>> On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
>>>> (2011/03/31 21:03), Toshiyuki Okajima wrote:
>>>>> Hi, thanks for your reviewing.
>>>>>
>>>>> (2011/03/30 23:12), Jan Kara wrote:
>>>>>> Hello,
>>>>>>
>>>>>> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>>>>>>> On Thu, 17 Feb 2011 11:45:52 +0100
>>>>>>> Jan Kara<[email protected]> wrote:
>>>>>>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>>>>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>>>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>>>>>>> Jan Kara<[email protected]> wrote:
>>>>>>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>> <SNIP>
>>>>>>> I have deeply continued to examined the root cause of this problem, then
>>>>>>> I found it.
>>>>>>>
>>>>>>> It is that we can write a memory which is mmaped to a file. Then the memory
>>>>>>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>>>>>>> "writeback" the memory.
>>>>>>>
>>>>>>> Therefore, the root cause of this hangup is not only ext4 component (with
>>>>>>> delayed allocation feature) but also writeback mechanism for mmap. If you
>>>>>>> use the other filesystem, you can write something to the filesystem though
>>>>>>> you have freezed the filesystem.
>>>>>
>>>>>> Well, you can write something only in the caches, not to the on disk
>>>>>> image. So it's not a problem as such.
>>>>> My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
>>>>> we can write in not only the caches but also the loopback device. However,
>>>>> I don't still confirm that we can write to the real device(/dev/sdaX).
>>>>>
>>>>>>
>>>>>>> A sample problem is attached on this mail. Try to execute it then you can
>>>>>>> confirm that we can write some data to your filesystem while freezing the
>>>>>>> filesystem.
>>>>>>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>>>>>>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>>>>>>
>>>>>>> I think the best approach to fix this problem is to let users not to write
>>>>>>> memory which is mapped to a certain file while the filesystem is freezing.
>>>>>>> However, it is very difficult to control users not to write memory which has
>>>>>>> been already mapped to the file.
>>>>>> It is actually possible. In case of ext4, you could add a check (+ wait)
>>>>>> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
>>>>>> of being frozen and if so, wait for it to get unfrozen. The only tough
>>>>>> problem here might be the locking as ext4_page_mkwrite() is called with
>>>>>> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
>>>>>> But you'd have to fix all filesystems (and all paths possibly creating
>>>>>> dirty data) in this way.
>>>>>>
>>>>>
>>>>>>> Therefore, I think there is only actual method that we stop writeback thread
>>>>>>> to resolve the mmap problem. Also, by this fix, the original problem
>>>>>>> (ext4 delayed write vs unfreeze) can be solved.
>>>>>> Hmm, I had a look at the code again and think we could fix the issue
>>>>>> cleanly (i.e. all possible users of s_umount) as follows: The lock
>>>>>> ordering will be
>>>>>> s_umount -> "fs frozen"
>>>>>> and there will be a new mutex s_freeze_mutex protecting changes of
>>>>>> s_frozen.
>>>>>>
>>>>>> freeze_bdev() already observes this lock ordering, it will only take
>>>>>> s_freeze_mutex for the changes of s_frozen values. The only other code
>>>>>> that is relevant for the lock ordering is thaw_super() (the freezing
>>>>>> process is not expected to reenter kernel for the frozen filesystem).
>>>>>> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
>>>>>> set s_frozen, release s_freeze_mutex and put superblock reference.
>>>>>>
>>>>>
>>>>>> So something like the patch below - it seems to work for me, can you test
>>>>>> it please?
>>>>> I think your patch looks good, so, the original problem seems to be solved.
>>>>> OK, I will test your patch.
>>>>> This weekend I cannot test it. So, I will reply next week.
>>>> I have tested whether Mizuma-san's reproducer can cause to deadlock with your
>>>> patch. And then any problems didn't hit while the reproducer was running.
>>>>
>>>> I think your patch solves the original deadlock problem which is reported by
>>>> Mizuma-san.
>>> Good. Thanks.
>>>
>>>>> Reported-by: Toshiyuki Okajima<[email protected]>
>>>>> Signed-off-by: Jan Kara<[email protected]>
>>>>> ---
>>>>> fs/super.c | 40 ++++++++++++++++++++++++++++++++++------
>>>>> include/linux/fs.h | 1 +
>>>>> 2 files changed, 35 insertions(+), 6 deletions(-)
>>>>
>>
>>>> However, I think a write which causes the deadlock is from mmapped dirty
>>>> pages. So, I guess we also need to fix in the mmap path while fsfreezing.
>>> Why? If you dirty a page, writeback thread can come and try to write it -
>>> which blocks - but now that does not matter...

>> I have not understood the code around writeback thread very much...
>> Please explain me the concrete function name which blocks some writes?
> It would block in ext4_da_writepages() function.
In ext4 with delayed allocation case, I understand it blocks.
(Original deadlock problem is just this case.)
But in ext4 without delayed allocation or other filesystems case, which function
can block writing?

>
>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>> The original problem happens after the fsfreeze operation is done.
>> I understand the normal write operation (not mmap) can be blocked while
>> fsfreezing. So, I guess we don't always block all the write operation
>> while fsfreezing.
> Technically speaking, we block all the transaction starts which means we
> end up blocking all the writes from going to disk. But that does not mean
> we block all the writes from going to in-memory cache - as you properly
> note the mmap case is one of such exceptions.
Hm, I also think we can allow the writes to in-memory cache but we can't allow
the writes to disk while fsfreezing. I am considering that mmap path can
write to disk while fsfreezing because this deadlock problem happens after
fsfreeze operation is done...

Thanks,
Toshiyuki Okajima


2011-04-06 11:21:41

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> > On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > > If you don't allow the page to be dirtied in the fist place, then
> > > > nothing needs to be done to the writeback path because there is
> > > > nothing dirty for it to write back.
> > > Sure but that's only the problem he was able to hit. But generally,
> > > there's a problem with needing s_umount for unfreezing because it isn't
> > > clear there aren't other code paths which can block with s_umount held
> > > waiting for fs to get unfrozen. And these code paths would cause the same
> > > deadlock. That's why I chose to get rid of s_umount during thawing.
> >
> > Holding the s_umount lock while checking if frozen and sleeping
> > is essentially an ABBA lock inversion bug that can bite in many more
> > places that just thawing the filesystem. Any where this is done should
> > be fixed, so I don't think just removing the s_umount lock from the thaw
> > path is sufficient to avoid problems.
> That's easily said but hard to do - any transaction start in ext3/4 may
> block on filesystem being frozen (this seems to be similar for XFS as I'm
> looking into the code) and transaction start traditionally nests inside
> s_umount (and basically there's no way around that since sync() calls your
> fs code with s_umount held).

Sure, but the question must be asked - why is ext3/4 even starting a
transaction on a clean filesystem during sync? A frozen filesystem,
by definition, is a clean filesytem, and therefore sync calls of any
kind should not be trying to write to the FS or start transactions.
XFS does this just fine, so I'd consider such behaviour on a frozen
filesystem a bug in ext3/4...

> So I'm afraid we are not going to get rid of
> this ABBA dependency unless we declare that s_umount ranks above filesystem
> being frozen - but surely I'm open to suggestions.

Not sure I understand what you are saying there - this is already
the case, isn't it? i.e. it has to be held exclusive to freeze a
filesystem...

> Another possibility is just to hide the problem e.g. by checking for frozen
> filesystem whenever we try to get s_umount. But that looks a bit ugly to
> me.

And not necessary, AFAICT.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-04-06 13:44:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Wed, Apr 06, 2011 at 09:21:35PM +1000, Dave Chinner wrote:
> Sure, but the question must be asked - why is ext3/4 even starting a
> transaction on a clean filesystem during sync? A frozen filesystem,
> by definition, is a clean filesytem, and therefore sync calls of any
> kind should not be trying to write to the FS or start transactions.
> XFS does this just fine, so I'd consider such behaviour on a frozen
> filesystem a bug in ext3/4...

XFS does have one special case for this. When writing the dummy log
record at the end of the freeze process we use _xfs_alloc_trans to
bypass the frozen filesystem check as we have to write out this record
when the filesystem already is frozen. But that's after the main
sync with its normal transactions.


2011-04-06 17:40:01

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Wed 06-04-11 21:21:35, Dave Chinner wrote:
> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> > On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> > > On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > > > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > > > If you don't allow the page to be dirtied in the fist place, then
> > > > > nothing needs to be done to the writeback path because there is
> > > > > nothing dirty for it to write back.
> > > > Sure but that's only the problem he was able to hit. But generally,
> > > > there's a problem with needing s_umount for unfreezing because it isn't
> > > > clear there aren't other code paths which can block with s_umount held
> > > > waiting for fs to get unfrozen. And these code paths would cause the same
> > > > deadlock. That's why I chose to get rid of s_umount during thawing.
> > >
> > > Holding the s_umount lock while checking if frozen and sleeping
> > > is essentially an ABBA lock inversion bug that can bite in many more
> > > places that just thawing the filesystem. Any where this is done should
> > > be fixed, so I don't think just removing the s_umount lock from the thaw
> > > path is sufficient to avoid problems.
> > That's easily said but hard to do - any transaction start in ext3/4 may
> > block on filesystem being frozen (this seems to be similar for XFS as I'm
> > looking into the code) and transaction start traditionally nests inside
> > s_umount (and basically there's no way around that since sync() calls your
> > fs code with s_umount held).
>
> Sure, but the question must be asked - why is ext3/4 even starting a
> transaction on a clean filesystem during sync? A frozen filesystem,
> by definition, is a clean filesytem, and therefore sync calls of any
> kind should not be trying to write to the FS or start transactions.
> XFS does this just fine, so I'd consider such behaviour on a frozen
> filesystem a bug in ext3/4...
But by this you are essentially agreeing that the lock inversion is there
in principle. We just hide it by relying on the fact that no code path
trying to change anything with s_umount held (which is the right lock
ordering) gets called while the fs is frozen. And that is fragile.
Actually, I've looked for a while and if you call quotactl(), it will get
s_umount and then tell filesystem to update quota information which blocks
inside the fs waiting for filesystem being unfrozen => deadlock. We can
change this code path to wait for frozen filesystem before taking s_umount
that essentially it just reinstates my point - it't fragile and IMHO we
need some more consistent way to handle this...

> > So I'm afraid we are not going to get rid of
> > this ABBA dependency unless we declare that s_umount ranks above filesystem
> > being frozen - but surely I'm open to suggestions.
>
> Not sure I understand what you are saying there - this is already
> the case, isn't it? i.e. it has to be held exclusive to freeze a
> filesystem...
Not really. We freeze the fs under s_umount but freezing essentially
implements trylock semantics while setting s_frozen so that does not really
establish any lock dependency. What establishes lock dependency is the
thawing path which blocks on s_umount while the filesystem is still frozen.
And this dependency is the other way around - i.e., freezing above
s_umount. This is why I was messing with thawing code to fix this...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-06 17:46:21

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hello,

On Wed 06-04-11 16:40:15, Toshiyuki Okajima wrote:
> (2011/04/06 14:57), Jan Kara wrote:
> >On Wed 06-04-11 14:09:14, Toshiyuki Okajima wrote:
> >>(2011/04/06 7:54), Jan Kara wrote:
> >>>On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
> >>>>(2011/03/31 21:03), Toshiyuki Okajima wrote:
> >>>>>Hi, thanks for your reviewing.
> >>>>>
> >>>>>(2011/03/30 23:12), Jan Kara wrote:
> >>>>>>Hello,
> >>>>>>
> >>>>>>On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
> >>>>>>>On Thu, 17 Feb 2011 11:45:52 +0100
> >>>>>>>Jan Kara<[email protected]> wrote:
> >>>>>>>>On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> >>>>>>>>>(2011/02/16 23:56), Jan Kara wrote:
> >>>>>>>>>>On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> >>>>>>>>>>>On Tue, 15 Feb 2011 18:29:54 +0100
> >>>>>>>>>>>Jan Kara<[email protected]> wrote:
> >>>>>>>>>>>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> >>>>>>>>>>>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> >>>>><SNIP>
> >>>>>>>I have deeply continued to examined the root cause of this problem, then
> >>>>>>>I found it.
> >>>>>>>
> >>>>>>>It is that we can write a memory which is mmaped to a file. Then the memory
> >>>>>>>becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> >>>>>>>"writeback" the memory.
> >>>>>>>
> >>>>>>>Therefore, the root cause of this hangup is not only ext4 component (with
> >>>>>>>delayed allocation feature) but also writeback mechanism for mmap. If you
> >>>>>>>use the other filesystem, you can write something to the filesystem though
> >>>>>>>you have freezed the filesystem.
> >>>>>
> >>>>>>Well, you can write something only in the caches, not to the on disk
> >>>>>>image. So it's not a problem as such.
> >>>>>My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
> >>>>>we can write in not only the caches but also the loopback device. However,
> >>>>>I don't still confirm that we can write to the real device(/dev/sdaX).
> >>>>>
> >>>>>>
> >>>>>>>A sample problem is attached on this mail. Try to execute it then you can
> >>>>>>>confirm that we can write some data to your filesystem while freezing the
> >>>>>>>filesystem.
> >>>>>>>(If you change FS variable in go.sh from ext3 to ext4 and you execute
> >>>>>>>"fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
> >>>>>>>
> >>>>>>>I think the best approach to fix this problem is to let users not to write
> >>>>>>>memory which is mapped to a certain file while the filesystem is freezing.
> >>>>>>>However, it is very difficult to control users not to write memory which has
> >>>>>>>been already mapped to the file.
> >>>>>>It is actually possible. In case of ext4, you could add a check (+ wait)
> >>>>>>in ext4_page_mkwrite() whether the filesystem is frozen or in the process
> >>>>>>of being frozen and if so, wait for it to get unfrozen. The only tough
> >>>>>>problem here might be the locking as ext4_page_mkwrite() is called with
> >>>>>>mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
> >>>>>>But you'd have to fix all filesystems (and all paths possibly creating
> >>>>>>dirty data) in this way.
> >>>>>>
> >>>>>
> >>>>>>>Therefore, I think there is only actual method that we stop writeback thread
> >>>>>>>to resolve the mmap problem. Also, by this fix, the original problem
> >>>>>>>(ext4 delayed write vs unfreeze) can be solved.
> >>>>>>Hmm, I had a look at the code again and think we could fix the issue
> >>>>>>cleanly (i.e. all possible users of s_umount) as follows: The lock
> >>>>>>ordering will be
> >>>>>>s_umount -> "fs frozen"
> >>>>>>and there will be a new mutex s_freeze_mutex protecting changes of
> >>>>>>s_frozen.
> >>>>>>
> >>>>>>freeze_bdev() already observes this lock ordering, it will only take
> >>>>>>s_freeze_mutex for the changes of s_frozen values. The only other code
> >>>>>>that is relevant for the lock ordering is thaw_super() (the freezing
> >>>>>>process is not expected to reenter kernel for the frozen filesystem).
> >>>>>>In thaw_super() we could take s_freeze_mutex, do all the thawing work,
> >>>>>>set s_frozen, release s_freeze_mutex and put superblock reference.
> >>>>>>
> >>>>>
> >>>>>>So something like the patch below - it seems to work for me, can you test
> >>>>>>it please?
> >>>>>I think your patch looks good, so, the original problem seems to be solved.
> >>>>>OK, I will test your patch.
> >>>>>This weekend I cannot test it. So, I will reply next week.
> >>>>I have tested whether Mizuma-san's reproducer can cause to deadlock with your
> >>>>patch. And then any problems didn't hit while the reproducer was running.
> >>>>
> >>>>I think your patch solves the original deadlock problem which is reported by
> >>>>Mizuma-san.
> >>> Good. Thanks.
> >>>
> >>>>>Reported-by: Toshiyuki Okajima<[email protected]>
> >>>>>Signed-off-by: Jan Kara<[email protected]>
> >>>>>---
> >>>>>fs/super.c | 40 ++++++++++++++++++++++++++++++++++------
> >>>>>include/linux/fs.h | 1 +
> >>>>>2 files changed, 35 insertions(+), 6 deletions(-)
> >>>>
> >>
> >>>>However, I think a write which causes the deadlock is from mmapped dirty
> >>>>pages. So, I guess we also need to fix in the mmap path while fsfreezing.
> >>> Why? If you dirty a page, writeback thread can come and try to write it -
> >>>which blocks - but now that does not matter...
>
> >>I have not understood the code around writeback thread very much...
> >>Please explain me the concrete function name which blocks some writes?
> > It would block in ext4_da_writepages() function.
> In ext4 with delayed allocation case, I understand it blocks.
> (Original deadlock problem is just this case.)
> But in ext4 without delayed allocation or other filesystems case, which function
> can block writing?
For ext3 or ext4 without delayed allocation we block inside writepage()
function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
get modified to block while minor-faulting the page on frozen fs because
when blocks are already allocated we may skip starting a transaction and so
we could possibly modify the filesystem.

> >>Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>The original problem happens after the fsfreeze operation is done.
> >>I understand the normal write operation (not mmap) can be blocked while
> >>fsfreezing. So, I guess we don't always block all the write operation
> >>while fsfreezing.
> > Technically speaking, we block all the transaction starts which means we
> >end up blocking all the writes from going to disk. But that does not mean
> >we block all the writes from going to in-memory cache - as you properly
> >note the mmap case is one of such exceptions.
> Hm, I also think we can allow the writes to in-memory cache but we can't allow
> the writes to disk while fsfreezing. I am considering that mmap path can
> write to disk while fsfreezing because this deadlock problem happens after
> fsfreeze operation is done...
I'm sorry I don't understand now - are you speaking about the case above
when writepage() does not wait for filesystem being frozen or something
else?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-06 22:54:01

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Wed, Apr 06, 2011 at 07:40:01PM +0200, Jan Kara wrote:
> On Wed 06-04-11 21:21:35, Dave Chinner wrote:
> > On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> > > On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> > > > On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > > > > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > > > > If you don't allow the page to be dirtied in the fist place, then
> > > > > > nothing needs to be done to the writeback path because there is
> > > > > > nothing dirty for it to write back.
> > > > > Sure but that's only the problem he was able to hit. But generally,
> > > > > there's a problem with needing s_umount for unfreezing because it isn't
> > > > > clear there aren't other code paths which can block with s_umount held
> > > > > waiting for fs to get unfrozen. And these code paths would cause the same
> > > > > deadlock. That's why I chose to get rid of s_umount during thawing.
> > > >
> > > > Holding the s_umount lock while checking if frozen and sleeping
> > > > is essentially an ABBA lock inversion bug that can bite in many more
> > > > places that just thawing the filesystem. Any where this is done should
> > > > be fixed, so I don't think just removing the s_umount lock from the thaw
> > > > path is sufficient to avoid problems.
> > > That's easily said but hard to do - any transaction start in ext3/4 may
> > > block on filesystem being frozen (this seems to be similar for XFS as I'm
> > > looking into the code) and transaction start traditionally nests inside
> > > s_umount (and basically there's no way around that since sync() calls your
> > > fs code with s_umount held).
> >
> > Sure, but the question must be asked - why is ext3/4 even starting a
> > transaction on a clean filesystem during sync? A frozen filesystem,
> > by definition, is a clean filesytem, and therefore sync calls of any
> > kind should not be trying to write to the FS or start transactions.
> > XFS does this just fine, so I'd consider such behaviour on a frozen
> > filesystem a bug in ext3/4...
> But by this you are essentially agreeing that the lock inversion is there
> in principle. We just hide it by relying on the fact that no code path
> trying to change anything with s_umount held (which is the right lock
> ordering) gets called while the fs is frozen. And that is fragile.

It's just another lock ordering rule. i.e. don't sleep on a frozen
filesystem with s_umount held. It's no more fragile than the many
other lock ordering rules we have.

> Actually, I've looked for a while and if you call quotactl(), it will get
> s_umount and then tell filesystem to update quota information which blocks
> inside the fs waiting for filesystem being unfrozen => deadlock.

Which is a bug according to the above locking rule.

> We can
> change this code path to wait for frozen filesystem before taking s_umount
> that essentially it just reinstates my point - it't fragile and IMHO we
> need some more consistent way to handle this...
>
> > > So I'm afraid we are not going to get rid of
> > > this ABBA dependency unless we declare that s_umount ranks above filesystem
> > > being frozen - but surely I'm open to suggestions.
> >
> > Not sure I understand what you are saying there - this is already
> > the case, isn't it? i.e. it has to be held exclusive to freeze a
> > filesystem...
> Not really. We freeze the fs under s_umount but freezing essentially
> implements trylock semantics while setting s_frozen so that does not really
> establish any lock dependency. What establishes lock dependency is the
> thawing path which blocks on s_umount while the filesystem is still frozen.
> And this dependency is the other way around - i.e., freezing above
> s_umount. This is why I was messing with thawing code to fix this...

It's just the tip of the iceberg. If we allow s_umount to be held
while waiting on a frozen filesystem, we open ouselves up to all
manner of problems. Such as umount hanging on a frozen fs,
(which means a shutdown with a frozen filesystem will hang), it can
hang sync, it can hang memory reclaim, it can hang in any path that
takes s_umount and hence do all sorts of bad things.

Yes, unthawing the filesystem will get things moving again with your
patch, but my point is that it simply does not address the problems
caused by the bad behaviour that has already occurred while the FS
is frozen. Fixing the thaw code in this way is like shooting the
messenger - it doesn't fix the problems being reported.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-04-06 22:59:37

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Wed, Apr 06, 2011 at 09:44:28AM -0400, Christoph Hellwig wrote:
> On Wed, Apr 06, 2011 at 09:21:35PM +1000, Dave Chinner wrote:
> > Sure, but the question must be asked - why is ext3/4 even starting a
> > transaction on a clean filesystem during sync? A frozen filesystem,
> > by definition, is a clean filesytem, and therefore sync calls of any
> > kind should not be trying to write to the FS or start transactions.
> > XFS does this just fine, so I'd consider such behaviour on a frozen
> > filesystem a bug in ext3/4...
>
> XFS does have one special case for this. When writing the dummy log
> record at the end of the freeze process we use _xfs_alloc_trans to
> bypass the frozen filesystem check as we have to write out this record
> when the filesystem already is frozen. But that's after the main
> sync with its normal transactions.

Right, that is a special case in the _freeze process_ (i.e. before
we've declared the FS frozen), not a normal operation on a frozen
filesystem.

If you want to list exceptions (i.e. where we explicitly avoid
writes to frozen fs), look for xfs_fs_writeable(), which stops
various write operations from proceeding when the fs is either
frozen, read-only or shut down.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-04-08 21:33:15

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Thu 07-04-11 08:54:01, Dave Chinner wrote:
> On Wed, Apr 06, 2011 at 07:40:01PM +0200, Jan Kara wrote:
> > On Wed 06-04-11 21:21:35, Dave Chinner wrote:
> > > On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> > > > On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> > > > > On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > > > > > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > > > > > If you don't allow the page to be dirtied in the fist place, then
> > > > > > > nothing needs to be done to the writeback path because there is
> > > > > > > nothing dirty for it to write back.
> > > > > > Sure but that's only the problem he was able to hit. But generally,
> > > > > > there's a problem with needing s_umount for unfreezing because it isn't
> > > > > > clear there aren't other code paths which can block with s_umount held
> > > > > > waiting for fs to get unfrozen. And these code paths would cause the same
> > > > > > deadlock. That's why I chose to get rid of s_umount during thawing.
> > > > >
> > > > > Holding the s_umount lock while checking if frozen and sleeping
> > > > > is essentially an ABBA lock inversion bug that can bite in many more
> > > > > places that just thawing the filesystem. Any where this is done should
> > > > > be fixed, so I don't think just removing the s_umount lock from the thaw
> > > > > path is sufficient to avoid problems.
> > > > That's easily said but hard to do - any transaction start in ext3/4 may
> > > > block on filesystem being frozen (this seems to be similar for XFS as I'm
> > > > looking into the code) and transaction start traditionally nests inside
> > > > s_umount (and basically there's no way around that since sync() calls your
> > > > fs code with s_umount held).
> > >
> > > Sure, but the question must be asked - why is ext3/4 even starting a
> > > transaction on a clean filesystem during sync? A frozen filesystem,
> > > by definition, is a clean filesytem, and therefore sync calls of any
> > > kind should not be trying to write to the FS or start transactions.
> > > XFS does this just fine, so I'd consider such behaviour on a frozen
> > > filesystem a bug in ext3/4...
> > But by this you are essentially agreeing that the lock inversion is there
> > in principle. We just hide it by relying on the fact that no code path
> > trying to change anything with s_umount held (which is the right lock
> > ordering) gets called while the fs is frozen. And that is fragile.
>
> It's just another lock ordering rule. i.e. don't sleep on a frozen
> filesystem with s_umount held. It's no more fragile than the many
> other lock ordering rules we have.
Except that for all the filesystems transaction start => sleep on a
frozen filesystem and in some code paths we have s_umount held while doing
a transaction start. So I don't buy the argument that it's just another
normal locking rule because normally we require that all the code paths
follow correct lock ordering. Now we have some paths (like sync) which do
not follow the correct lock ordering and we just make sure they are not
called if they could cause deadlocks by other means...

> > Actually, I've looked for a while and if you call quotactl(), it will get
> > s_umount and then tell filesystem to update quota information which blocks
> > inside the fs waiting for filesystem being unfrozen => deadlock.
>
> Which is a bug according to the above locking rule.
Yes, I was just trying to demonstrate that the locking rule changes
"block until the fs is unfrozen" into "kernel is deadlocked" in an
unexpected places... fsync_bdev() is another case which deadlocks
currently.

> > We can
> > change this code path to wait for frozen filesystem before taking s_umount
> > that essentially it just reinstates my point - it't fragile and IMHO we
> > need some more consistent way to handle this...
> >
> > > > So I'm afraid we are not going to get rid of
> > > > this ABBA dependency unless we declare that s_umount ranks above filesystem
> > > > being frozen - but surely I'm open to suggestions.
> > >
> > > Not sure I understand what you are saying there - this is already
> > > the case, isn't it? i.e. it has to be held exclusive to freeze a
> > > filesystem...
> > Not really. We freeze the fs under s_umount but freezing essentially
> > implements trylock semantics while setting s_frozen so that does not really
> > establish any lock dependency. What establishes lock dependency is the
> > thawing path which blocks on s_umount while the filesystem is still frozen.
> > And this dependency is the other way around - i.e., freezing above
> > s_umount. This is why I was messing with thawing code to fix this...
>
> It's just the tip of the iceberg. If we allow s_umount to be held
> while waiting on a frozen filesystem, we open ouselves up to all
> manner of problems. Such as umount hanging on a frozen fs,
> (which means a shutdown with a frozen filesystem will hang), it can
> hang sync, it can hang memory reclaim, it can hang in any path that
> takes s_umount and hence do all sorts of bad things.
I see. The umount hang (especially in the shutdown case) is not nice.
Direct reclaim won't be blocked AFAICS if we stop dirtying pages while the
fs is frozen (which, as I already wrote, I agree is not a good thing to do
after some thought). Since you can block while accessing the frozen
filesystem anyway (because of atime updates or just because of writing
process waiting with i_mutex held for fs to be unfrozen) I'm not sure how
much worse it would be if s_umount lock would be another lock with which
we can wait for fs to get unfrozen...

> Yes, unthawing the filesystem will get things moving again with your
> patch, but my point is that it simply does not address the problems
> caused by the bad behaviour that has already occurred while the FS
> is frozen. Fixing the thaw code in this way is like shooting the
> messenger - it doesn't fix the problems being reported.
I don't there has been any too bad behavior - you tried to access frozen
filesystem and you got blocked. But OK, I'll invest some more thought into
how to not block with s_umount held without sprinkling frozen checks over
the tree...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-15 13:37:26

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi, sorry for my late response.

(2011/04/07 2:46), Jan Kara wrote:
> Hello,
>
> On Wed 06-04-11 16:40:15, Toshiyuki Okajima wrote:
>> (2011/04/06 14:57), Jan Kara wrote:
>>> On Wed 06-04-11 14:09:14, Toshiyuki Okajima wrote:
>>>> (2011/04/06 7:54), Jan Kara wrote:
>>>>> On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
>>>>>> (2011/03/31 21:03), Toshiyuki Okajima wrote:
>>>>>>> Hi, thanks for your reviewing.
>>>>>>>
>>>>>>> (2011/03/30 23:12), Jan Kara wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>>>>>>>>> On Thu, 17 Feb 2011 11:45:52 +0100
>>>>>>>>> Jan Kara<[email protected]> wrote:
>>>>>>>>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>>>>>>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>>>>>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>>>>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>>>>>>>>> Jan Kara<[email protected]> wrote:
>>>>>>>>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>>>> <SNIP>
>>>>>>>>> I have deeply continued to examined the root cause of this problem, then
>>>>>>>>> I found it.
>>>>>>>>>
>>>>>>>>> It is that we can write a memory which is mmaped to a file. Then the memory
>>>>>>>>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>>>>>>>>> "writeback" the memory.
>>>>>>>>>
>>>>>>>>> Therefore, the root cause of this hangup is not only ext4 component (with
>>>>>>>>> delayed allocation feature) but also writeback mechanism for mmap. If you
>>>>>>>>> use the other filesystem, you can write something to the filesystem though
>>>>>>>>> you have freezed the filesystem.
>>>>>>>
>>>>>>>> Well, you can write something only in the caches, not to the on disk
>>>>>>>> image. So it's not a problem as such.
>>>>>>> My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
>>>>>>> we can write in not only the caches but also the loopback device. However,
>>>>>>> I don't still confirm that we can write to the real device(/dev/sdaX).
>>>>>>>
>>>>>>>>
>>>>>>>>> A sample problem is attached on this mail. Try to execute it then you can
>>>>>>>>> confirm that we can write some data to your filesystem while freezing the
>>>>>>>>> filesystem.
>>>>>>>>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>>>>>>>>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>>>>>>>>
>>>>>>>>> I think the best approach to fix this problem is to let users not to write
>>>>>>>>> memory which is mapped to a certain file while the filesystem is freezing.
>>>>>>>>> However, it is very difficult to control users not to write memory which has
>>>>>>>>> been already mapped to the file.
>>>>>>>> It is actually possible. In case of ext4, you could add a check (+ wait)
>>>>>>>> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
>>>>>>>> of being frozen and if so, wait for it to get unfrozen. The only tough
>>>>>>>> problem here might be the locking as ext4_page_mkwrite() is called with
>>>>>>>> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
>>>>>>>> But you'd have to fix all filesystems (and all paths possibly creating
>>>>>>>> dirty data) in this way.
>>>>>>>>
>>>>>>>
>>>>>>>>> Therefore, I think there is only actual method that we stop writeback thread
>>>>>>>>> to resolve the mmap problem. Also, by this fix, the original problem
>>>>>>>>> (ext4 delayed write vs unfreeze) can be solved.
>>>>>>>> Hmm, I had a look at the code again and think we could fix the issue
>>>>>>>> cleanly (i.e. all possible users of s_umount) as follows: The lock
>>>>>>>> ordering will be
>>>>>>>> s_umount -> "fs frozen"
>>>>>>>> and there will be a new mutex s_freeze_mutex protecting changes of
>>>>>>>> s_frozen.
>>>>>>>>
>>>>>>>> freeze_bdev() already observes this lock ordering, it will only take
>>>>>>>> s_freeze_mutex for the changes of s_frozen values. The only other code
>>>>>>>> that is relevant for the lock ordering is thaw_super() (the freezing
>>>>>>>> process is not expected to reenter kernel for the frozen filesystem).
>>>>>>>> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
>>>>>>>> set s_frozen, release s_freeze_mutex and put superblock reference.
>>>>>>>>
>>>>>>>
>>>>>>>> So something like the patch below - it seems to work for me, can you test
>>>>>>>> it please?
>>>>>>> I think your patch looks good, so, the original problem seems to be solved.
>>>>>>> OK, I will test your patch.
>>>>>>> This weekend I cannot test it. So, I will reply next week.
>>>>>> I have tested whether Mizuma-san's reproducer can cause to deadlock with your
>>>>>> patch. And then any problems didn't hit while the reproducer was running.
>>>>>>
>>>>>> I think your patch solves the original deadlock problem which is reported by
>>>>>> Mizuma-san.
>>>>> Good. Thanks.
>>>>>
>>>>>>> Reported-by: Toshiyuki Okajima<[email protected]>
>>>>>>> Signed-off-by: Jan Kara<[email protected]>
>>>>>>> ---
>>>>>>> fs/super.c | 40 ++++++++++++++++++++++++++++++++++------
>>>>>>> include/linux/fs.h | 1 +
>>>>>>> 2 files changed, 35 insertions(+), 6 deletions(-)
>>>>>>
>>>>
>>>>>> However, I think a write which causes the deadlock is from mmapped dirty
>>>>>> pages. So, I guess we also need to fix in the mmap path while fsfreezing.
>>>>> Why? If you dirty a page, writeback thread can come and try to write it -
>>>>> which blocks - but now that does not matter...
>>
>>>> I have not understood the code around writeback thread very much...
>>>> Please explain me the concrete function name which blocks some writes?
>>> It would block in ext4_da_writepages() function.
>> In ext4 with delayed allocation case, I understand it blocks.
>> (Original deadlock problem is just this case.)
>> But in ext4 without delayed allocation or other filesystems case, which function
>> can block writing?

> For ext3 or ext4 without delayed allocation we block inside writepage()
> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> get modified to block while minor-faulting the page on frozen fs because
> when blocks are already allocated we may skip starting a transaction and so
> we could possibly modify the filesystem.
OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.

(minor-pagefault)
-> do_wp_page()
-> page_mkwrite(= ext4_mkwrite())
=> BLOCK!

(major-pagefault)
-> do_liner_fault()
-> page_mkwrite(= ext4_mkwrite())
=> BLOCK!

>
>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>> The original problem happens after the fsfreeze operation is done.
>>>> I understand the normal write operation (not mmap) can be blocked while
>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>> while fsfreezing.
>>> Technically speaking, we block all the transaction starts which means we
>>> end up blocking all the writes from going to disk. But that does not mean
>>> we block all the writes from going to in-memory cache - as you properly
>>> note the mmap case is one of such exceptions.
>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>> the writes to disk while fsfreezing. I am considering that mmap path can
>> write to disk while fsfreezing because this deadlock problem happens after
>> fsfreeze operation is done...
> I'm sorry I don't understand now - are you speaking about the case above
> when writepage() does not wait for filesystem being frozen or something
> else?
Sorry, I didn't understand around the page fault path.
So, I had read the kernel source code around it, then I maybe understand...

I worry whether we can update the file data in mmap case while fsfreezing.
Of course, I understand that we can write to in-memory cache, and it is not a
problem. However, if we can write to disk while fsfreezing, it is a problem.
So, I summarize the cases whether we can write to disk or not.

--------------------------------------------------------------------------
Cases (Whether we can write the data mmapped to the file on the disk
while fsfreezing)

[1] One of the page which has been mmapped is not bound. And
the page is not allocated yet. (major fault?)

(1) user dirtys a page
(2) a page fault occurs (do_page_fault)
(3) __do_falut is called.
(4) ext4_page_mkwrite is called
(5) ext4_write_begin is called
(6) ext4_journal_start_sb => We can STOP!

[2] One of the page which has been mmapped is not bound. But
the page is already allocated, and the buffer_heads of the page
are not mapped (BH_Mapped). (minor fault?)

(1) user dirtys a page
(2) a page fault occurs (do_page_fault)
(3) do_wp_page is called.
(4) ext4_page_mkwrite is called
(5) ext4_write_begin is called
(6) ext4_journal_start_sb => We can STOP!

[3] One of the page which has been mmapped is not bound. But
the page is already allocated, and the buffer_heads of the page
are mapped (BH_Mapped). (minor fault?)

(1) user dirtys a page
(2) a page fault occurs (do_page_fault)
(3) do_wp_page is called.
(4) ext4_page_mkwrite is called
* Cannot block the dirty page to be written because all bh is mapped.
(5) user munmaps the page (munmap)
(6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
(7) writeback thread writes the page (struct page) to disk
=> We cannot STOP!

[4] One of the page which has been mmapped is bound. And
the page is already allocated.

(1) user dirtys a page
( ) no page fault occurs
(2) user munmaps the page (munmap)
(3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
(4) writeback thread writes the page (struct page) to disk
=> We cannot STOP!
--------------------------------------------------------------------------

So, we can block the cases [1], [2].
But I think we cannot block the cases [3], [4] now.
If fixing the page_mkwrite, we can also block the case [3].
But the case [4] is not blocked because no page fault occurs
when we dirty the mmapped page.

Therefore, to repair this problem, we need to fix the cases [3], [4].
I think we must modify the writeback thread to fix the case [4].

Thanks,
Toshiyuki Okajima


2011-04-15 17:13:13

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hello,

On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> > For ext3 or ext4 without delayed allocation we block inside writepage()
> >function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> >get modified to block while minor-faulting the page on frozen fs because
> >when blocks are already allocated we may skip starting a transaction and so
> >we could possibly modify the filesystem.
> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
>
> (minor-pagefault)
> -> do_wp_page()
> -> page_mkwrite(= ext4_mkwrite())
> => BLOCK!
>
> (major-pagefault)
> -> do_liner_fault()
> -> page_mkwrite(= ext4_mkwrite())
> => BLOCK!
>
> >
> >>>>Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>>>The original problem happens after the fsfreeze operation is done.
> >>>>I understand the normal write operation (not mmap) can be blocked while
> >>>>fsfreezing. So, I guess we don't always block all the write operation
> >>>>while fsfreezing.
> >>> Technically speaking, we block all the transaction starts which means we
> >>>end up blocking all the writes from going to disk. But that does not mean
> >>>we block all the writes from going to in-memory cache - as you properly
> >>>note the mmap case is one of such exceptions.
> >>Hm, I also think we can allow the writes to in-memory cache but we can't allow
> >>the writes to disk while fsfreezing. I am considering that mmap path can
> >>write to disk while fsfreezing because this deadlock problem happens after
> >>fsfreeze operation is done...
> > I'm sorry I don't understand now - are you speaking about the case above
> >when writepage() does not wait for filesystem being frozen or something
> >else?
> Sorry, I didn't understand around the page fault path.
> So, I had read the kernel source code around it, then I maybe understand...
>
> I worry whether we can update the file data in mmap case while fsfreezing.
> Of course, I understand that we can write to in-memory cache, and it is not a
> problem. However, if we can write to disk while fsfreezing, it is a problem.
> So, I summarize the cases whether we can write to disk or not.
>
> --------------------------------------------------------------------------
> Cases (Whether we can write the data mmapped to the file on the disk
> while fsfreezing)
>
> [1] One of the page which has been mmapped is not bound. And
> the page is not allocated yet. (major fault?)
>
> (1) user dirtys a page
> (2) a page fault occurs (do_page_fault)
> (3) __do_falut is called.
> (4) ext4_page_mkwrite is called
> (5) ext4_write_begin is called
> (6) ext4_journal_start_sb => We can STOP!
>
> [2] One of the page which has been mmapped is not bound. But
> the page is already allocated, and the buffer_heads of the page
> are not mapped (BH_Mapped). (minor fault?)
>
> (1) user dirtys a page
> (2) a page fault occurs (do_page_fault)
> (3) do_wp_page is called.
> (4) ext4_page_mkwrite is called
> (5) ext4_write_begin is called
> (6) ext4_journal_start_sb => We can STOP!
>
> [3] One of the page which has been mmapped is not bound. But
> the page is already allocated, and the buffer_heads of the page
> are mapped (BH_Mapped). (minor fault?)
>
> (1) user dirtys a page
> (2) a page fault occurs (do_page_fault)
> (3) do_wp_page is called.
> (4) ext4_page_mkwrite is called
> * Cannot block the dirty page to be written because all bh is mapped.
> (5) user munmaps the page (munmap)
> (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> (7) writeback thread writes the page (struct page) to disk
> => We cannot STOP!
>
> [4] One of the page which has been mmapped is bound. And
> the page is already allocated.
>
> (1) user dirtys a page
> ( ) no page fault occurs
> (2) user munmaps the page (munmap)
> (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> (4) writeback thread writes the page (struct page) to disk
> => We cannot STOP!
> --------------------------------------------------------------------------
>
> So, we can block the cases [1], [2].
> But I think we cannot block the cases [3], [4] now.
> If fixing the page_mkwrite, we can also block the case [3].
> But the case [4] is not blocked because no page fault occurs
> when we dirty the mmapped page.
>
> Therefore, to repair this problem, we need to fix the cases [3], [4].
> I think we must modify the writeback thread to fix the case [4].
The trick here is that when we write a page to disk, we write-protect
the page (you seem to call this that "the page is bound", I'm not sure why).
So we are guaranteed to receive a minor fault (case [3]) if user tries to
modify a page after we finish writeback while freezing the filesystem.
So principially all we need to do is just wait in ext4_page_mkwrite().

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-15 17:17:17

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 4/15/11 12:13 PM, Jan Kara wrote:
> Hello,
>
> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
>>> get modified to block while minor-faulting the page on frozen fs because
>>> when blocks are already allocated we may skip starting a transaction and so
>>> we could possibly modify the filesystem.
>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
>>
>> (minor-pagefault)
>> -> do_wp_page()
>> -> page_mkwrite(= ext4_mkwrite())
>> => BLOCK!
>>
>> (major-pagefault)
>> -> do_liner_fault()
>> -> page_mkwrite(= ext4_mkwrite())
>> => BLOCK!
>>
>>>
>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>> I understand the normal write operation (not mmap) can be blocked while
>>>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>>>> while fsfreezing.
>>>>> Technically speaking, we block all the transaction starts which means we
>>>>> end up blocking all the writes from going to disk. But that does not mean
>>>>> we block all the writes from going to in-memory cache - as you properly
>>>>> note the mmap case is one of such exceptions.
>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>>>> the writes to disk while fsfreezing. I am considering that mmap path can
>>>> write to disk while fsfreezing because this deadlock problem happens after
>>>> fsfreeze operation is done...
>>> I'm sorry I don't understand now - are you speaking about the case above
>>> when writepage() does not wait for filesystem being frozen or something
>>> else?
>> Sorry, I didn't understand around the page fault path.
>> So, I had read the kernel source code around it, then I maybe understand...
>>
>> I worry whether we can update the file data in mmap case while fsfreezing.
>> Of course, I understand that we can write to in-memory cache, and it is not a
>> problem. However, if we can write to disk while fsfreezing, it is a problem.
>> So, I summarize the cases whether we can write to disk or not.
>>
>> --------------------------------------------------------------------------
>> Cases (Whether we can write the data mmapped to the file on the disk
>> while fsfreezing)
>>
>> [1] One of the page which has been mmapped is not bound. And
>> the page is not allocated yet. (major fault?)
>>
>> (1) user dirtys a page
>> (2) a page fault occurs (do_page_fault)
>> (3) __do_falut is called.
>> (4) ext4_page_mkwrite is called
>> (5) ext4_write_begin is called
>> (6) ext4_journal_start_sb => We can STOP!
>>
>> [2] One of the page which has been mmapped is not bound. But
>> the page is already allocated, and the buffer_heads of the page
>> are not mapped (BH_Mapped). (minor fault?)
>>
>> (1) user dirtys a page
>> (2) a page fault occurs (do_page_fault)
>> (3) do_wp_page is called.
>> (4) ext4_page_mkwrite is called
>> (5) ext4_write_begin is called
>> (6) ext4_journal_start_sb => We can STOP!
>>
>> [3] One of the page which has been mmapped is not bound. But
>> the page is already allocated, and the buffer_heads of the page
>> are mapped (BH_Mapped). (minor fault?)
>>
>> (1) user dirtys a page
>> (2) a page fault occurs (do_page_fault)
>> (3) do_wp_page is called.
>> (4) ext4_page_mkwrite is called
>> * Cannot block the dirty page to be written because all bh is mapped.
>> (5) user munmaps the page (munmap)
>> (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>> (7) writeback thread writes the page (struct page) to disk
>> => We cannot STOP!
>>
>> [4] One of the page which has been mmapped is bound. And
>> the page is already allocated.
>>
>> (1) user dirtys a page
>> ( ) no page fault occurs
>> (2) user munmaps the page (munmap)
>> (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>> (4) writeback thread writes the page (struct page) to disk
>> => We cannot STOP!
>> --------------------------------------------------------------------------
>>
>> So, we can block the cases [1], [2].
>> But I think we cannot block the cases [3], [4] now.
>> If fixing the page_mkwrite, we can also block the case [3].
>> But the case [4] is not blocked because no page fault occurs
>> when we dirty the mmapped page.
>>
>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>> I think we must modify the writeback thread to fix the case [4].
> The trick here is that when we write a page to disk, we write-protect
> the page (you seem to call this that "the page is bound", I'm not sure why).
> So we are guaranteed to receive a minor fault (case [3]) if user tries to
> modify a page after we finish writeback while freezing the filesystem.
> So principially all we need to do is just wait in ext4_page_mkwrite().

I've been kind of absent from this thread, sorry, but why would we wait in ext4_page_mkwrite(), rather than in mm/memory.c prior to any page_mkwrite call on any fs?

no frozen fs should be able to dirty & write pages via mmap, right?

-Eric

> Honza


2011-04-15 17:37:34

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Fri 15-04-11 12:17:06, Eric Sandeen wrote:
> On 4/15/11 12:13 PM, Jan Kara wrote:
> > Hello,
> >
> > On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>> For ext3 or ext4 without delayed allocation we block inside writepage()
> >>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> >>> get modified to block while minor-faulting the page on frozen fs because
> >>> when blocks are already allocated we may skip starting a transaction and so
> >>> we could possibly modify the filesystem.
> >> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
> >>
> >> (minor-pagefault)
> >> -> do_wp_page()
> >> -> page_mkwrite(= ext4_mkwrite())
> >> => BLOCK!
> >>
> >> (major-pagefault)
> >> -> do_liner_fault()
> >> -> page_mkwrite(= ext4_mkwrite())
> >> => BLOCK!
> >>
> >>>
> >>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>>>>> The original problem happens after the fsfreeze operation is done.
> >>>>>> I understand the normal write operation (not mmap) can be blocked while
> >>>>>> fsfreezing. So, I guess we don't always block all the write operation
> >>>>>> while fsfreezing.
> >>>>> Technically speaking, we block all the transaction starts which means we
> >>>>> end up blocking all the writes from going to disk. But that does not mean
> >>>>> we block all the writes from going to in-memory cache - as you properly
> >>>>> note the mmap case is one of such exceptions.
> >>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
> >>>> the writes to disk while fsfreezing. I am considering that mmap path can
> >>>> write to disk while fsfreezing because this deadlock problem happens after
> >>>> fsfreeze operation is done...
> >>> I'm sorry I don't understand now - are you speaking about the case above
> >>> when writepage() does not wait for filesystem being frozen or something
> >>> else?
> >> Sorry, I didn't understand around the page fault path.
> >> So, I had read the kernel source code around it, then I maybe understand...
> >>
> >> I worry whether we can update the file data in mmap case while fsfreezing.
> >> Of course, I understand that we can write to in-memory cache, and it is not a
> >> problem. However, if we can write to disk while fsfreezing, it is a problem.
> >> So, I summarize the cases whether we can write to disk or not.
> >>
> >> --------------------------------------------------------------------------
> >> Cases (Whether we can write the data mmapped to the file on the disk
> >> while fsfreezing)
> >>
> >> [1] One of the page which has been mmapped is not bound. And
> >> the page is not allocated yet. (major fault?)
> >>
> >> (1) user dirtys a page
> >> (2) a page fault occurs (do_page_fault)
> >> (3) __do_falut is called.
> >> (4) ext4_page_mkwrite is called
> >> (5) ext4_write_begin is called
> >> (6) ext4_journal_start_sb => We can STOP!
> >>
> >> [2] One of the page which has been mmapped is not bound. But
> >> the page is already allocated, and the buffer_heads of the page
> >> are not mapped (BH_Mapped). (minor fault?)
> >>
> >> (1) user dirtys a page
> >> (2) a page fault occurs (do_page_fault)
> >> (3) do_wp_page is called.
> >> (4) ext4_page_mkwrite is called
> >> (5) ext4_write_begin is called
> >> (6) ext4_journal_start_sb => We can STOP!
> >>
> >> [3] One of the page which has been mmapped is not bound. But
> >> the page is already allocated, and the buffer_heads of the page
> >> are mapped (BH_Mapped). (minor fault?)
> >>
> >> (1) user dirtys a page
> >> (2) a page fault occurs (do_page_fault)
> >> (3) do_wp_page is called.
> >> (4) ext4_page_mkwrite is called
> >> * Cannot block the dirty page to be written because all bh is mapped.
> >> (5) user munmaps the page (munmap)
> >> (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >> (7) writeback thread writes the page (struct page) to disk
> >> => We cannot STOP!
> >>
> >> [4] One of the page which has been mmapped is bound. And
> >> the page is already allocated.
> >>
> >> (1) user dirtys a page
> >> ( ) no page fault occurs
> >> (2) user munmaps the page (munmap)
> >> (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >> (4) writeback thread writes the page (struct page) to disk
> >> => We cannot STOP!
> >> --------------------------------------------------------------------------
> >>
> >> So, we can block the cases [1], [2].
> >> But I think we cannot block the cases [3], [4] now.
> >> If fixing the page_mkwrite, we can also block the case [3].
> >> But the case [4] is not blocked because no page fault occurs
> >> when we dirty the mmapped page.
> >>
> >> Therefore, to repair this problem, we need to fix the cases [3], [4].
> >> I think we must modify the writeback thread to fix the case [4].
> > The trick here is that when we write a page to disk, we write-protect
> > the page (you seem to call this that "the page is bound", I'm not sure why).
> > So we are guaranteed to receive a minor fault (case [3]) if user tries to
> > modify a page after we finish writeback while freezing the filesystem.
> > So principially all we need to do is just wait in ext4_page_mkwrite().
>
> I've been kind of absent from this thread, sorry, but why would we wait in ext4_page_mkwrite(), rather than in mm/memory.c prior to any page_mkwrite call on any fs?
>
> no frozen fs should be able to dirty & write pages via mmap, right?
I have not put that much thought to it but locking might be kind of
tricky in the generic code. We have to be sure that freezing waits for
the page which is just being faulted. That means we have to take page lock
(now writepage() called during fs_freeze will wait for us), check whether
fs is frozen. If yes, unlock page, do vfs_check_frozen(), and retry. This
call sequence is much better suited for block_page_mkwrite() than for
code in memory.c I think.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-18 09:03:16

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi,

(2011/04/16 2:13), Jan Kara wrote:
> Hello,
>
> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
>>> get modified to block while minor-faulting the page on frozen fs because
>>> when blocks are already allocated we may skip starting a transaction and so
>>> we could possibly modify the filesystem.
>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
>>
>> (minor-pagefault)
>> -> do_wp_page()
>> -> page_mkwrite(= ext4_mkwrite())
>> => BLOCK!
>>
>> (major-pagefault)
>> -> do_liner_fault()
>> -> page_mkwrite(= ext4_mkwrite())
>> => BLOCK!
>>
>>>
>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>> I understand the normal write operation (not mmap) can be blocked while
>>>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>>>> while fsfreezing.
>>>>> Technically speaking, we block all the transaction starts which means we
>>>>> end up blocking all the writes from going to disk. But that does not mean
>>>>> we block all the writes from going to in-memory cache - as you properly
>>>>> note the mmap case is one of such exceptions.
>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>>>> the writes to disk while fsfreezing. I am considering that mmap path can
>>>> write to disk while fsfreezing because this deadlock problem happens after
>>>> fsfreeze operation is done...
>>> I'm sorry I don't understand now - are you speaking about the case above
>>> when writepage() does not wait for filesystem being frozen or something
>>> else?
>> Sorry, I didn't understand around the page fault path.
>> So, I had read the kernel source code around it, then I maybe understand...
>>
>> I worry whether we can update the file data in mmap case while fsfreezing.
>> Of course, I understand that we can write to in-memory cache, and it is not a
>> problem. However, if we can write to disk while fsfreezing, it is a problem.
>> So, I summarize the cases whether we can write to disk or not.
>>
>> --------------------------------------------------------------------------
>> Cases (Whether we can write the data mmapped to the file on the disk
>> while fsfreezing)
>>
>> [1] One of the page which has been mmapped is not bound. And
>> the page is not allocated yet. (major fault?)
>>
>> (1) user dirtys a page
>> (2) a page fault occurs (do_page_fault)
>> (3) __do_falut is called.
>> (4) ext4_page_mkwrite is called
>> (5) ext4_write_begin is called
>> (6) ext4_journal_start_sb => We can STOP!
>>
>> [2] One of the page which has been mmapped is not bound. But
>> the page is already allocated, and the buffer_heads of the page
>> are not mapped (BH_Mapped). (minor fault?)
>>
>> (1) user dirtys a page
>> (2) a page fault occurs (do_page_fault)
>> (3) do_wp_page is called.
>> (4) ext4_page_mkwrite is called
>> (5) ext4_write_begin is called
>> (6) ext4_journal_start_sb => We can STOP!
>>
>> [3] One of the page which has been mmapped is not bound. But
>> the page is already allocated, and the buffer_heads of the page
>> are mapped (BH_Mapped). (minor fault?)
>>
>> (1) user dirtys a page
>> (2) a page fault occurs (do_page_fault)
>> (3) do_wp_page is called.
>> (4) ext4_page_mkwrite is called
>> * Cannot block the dirty page to be written because all bh is mapped.
>> (5) user munmaps the page (munmap)
>> (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>> (7) writeback thread writes the page (struct page) to disk
>> => We cannot STOP!
>>
>> [4] One of the page which has been mmapped is bound. And
>> the page is already allocated.
>>
>> (1) user dirtys a page
>> ( ) no page fault occurs
>> (2) user munmaps the page (munmap)
>> (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>> (4) writeback thread writes the page (struct page) to disk
>> => We cannot STOP!
>> --------------------------------------------------------------------------
>>
>> So, we can block the cases [1], [2].
>> But I think we cannot block the cases [3], [4] now.
>> If fixing the page_mkwrite, we can also block the case [3].
>> But the case [4] is not blocked because no page fault occurs
>> when we dirty the mmapped page.
>>
>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>> I think we must modify the writeback thread to fix the case [4].
> The trick here is that when we write a page to disk, we write-protect
> the page (you seem to call this that "the page is bound", I'm not sure why).
Hm, I want to understand how to write-protect the page under fsfreezing.
But, anyway, I understand we don't need to consider the case [4].

> So we are guaranteed to receive a minor fault (case [3]) if user tries to
> modify a page after we finish writeback while freezing the filesystem.
> So principially all we need to do is just wait in ext4_page_mkwrite().
OK. I understand.
Are there any concrete ideas to fix this?
For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?

Thanks,
Toshiyuki Okajima


2011-04-18 10:51:05

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Mon 18-04-11 18:05:01, Toshiyuki Okajima wrote:
> >On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>> For ext3 or ext4 without delayed allocation we block inside writepage()
> >>>function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> >>>get modified to block while minor-faulting the page on frozen fs because
> >>>when blocks are already allocated we may skip starting a transaction and so
> >>>we could possibly modify the filesystem.
> >>OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
> >>
> >>(minor-pagefault)
> >>-> do_wp_page()
> >> -> page_mkwrite(= ext4_mkwrite())
> >> => BLOCK!
> >>
> >>(major-pagefault)
> >>-> do_liner_fault()
> >> -> page_mkwrite(= ext4_mkwrite())
> >> => BLOCK!
> >>
> >>>
> >>>>>>Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>>>>>The original problem happens after the fsfreeze operation is done.
> >>>>>>I understand the normal write operation (not mmap) can be blocked while
> >>>>>>fsfreezing. So, I guess we don't always block all the write operation
> >>>>>>while fsfreezing.
> >>>>> Technically speaking, we block all the transaction starts which means we
> >>>>>end up blocking all the writes from going to disk. But that does not mean
> >>>>>we block all the writes from going to in-memory cache - as you properly
> >>>>>note the mmap case is one of such exceptions.
> >>>>Hm, I also think we can allow the writes to in-memory cache but we can't allow
> >>>>the writes to disk while fsfreezing. I am considering that mmap path can
> >>>>write to disk while fsfreezing because this deadlock problem happens after
> >>>>fsfreeze operation is done...
> >>> I'm sorry I don't understand now - are you speaking about the case above
> >>>when writepage() does not wait for filesystem being frozen or something
> >>>else?
> >>Sorry, I didn't understand around the page fault path.
> >>So, I had read the kernel source code around it, then I maybe understand...
> >>
> >>I worry whether we can update the file data in mmap case while fsfreezing.
> >>Of course, I understand that we can write to in-memory cache, and it is not a
> >>problem. However, if we can write to disk while fsfreezing, it is a problem.
> >>So, I summarize the cases whether we can write to disk or not.
> >>
> >>--------------------------------------------------------------------------
> >>Cases (Whether we can write the data mmapped to the file on the disk
> >>while fsfreezing)
> >>
> >>[1] One of the page which has been mmapped is not bound. And
> >> the page is not allocated yet. (major fault?)
> >>
> >> (1) user dirtys a page
> >> (2) a page fault occurs (do_page_fault)
> >> (3) __do_falut is called.
> >> (4) ext4_page_mkwrite is called
> >> (5) ext4_write_begin is called
> >> (6) ext4_journal_start_sb => We can STOP!
> >>
> >>[2] One of the page which has been mmapped is not bound. But
> >> the page is already allocated, and the buffer_heads of the page
> >> are not mapped (BH_Mapped). (minor fault?)
> >>
> >> (1) user dirtys a page
> >> (2) a page fault occurs (do_page_fault)
> >> (3) do_wp_page is called.
> >> (4) ext4_page_mkwrite is called
> >> (5) ext4_write_begin is called
> >> (6) ext4_journal_start_sb => We can STOP!
> >>
> >>[3] One of the page which has been mmapped is not bound. But
> >> the page is already allocated, and the buffer_heads of the page
> >> are mapped (BH_Mapped). (minor fault?)
> >>
> >> (1) user dirtys a page
> >> (2) a page fault occurs (do_page_fault)
> >> (3) do_wp_page is called.
> >> (4) ext4_page_mkwrite is called
> >> * Cannot block the dirty page to be written because all bh is mapped.
> >> (5) user munmaps the page (munmap)
> >> (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >> (7) writeback thread writes the page (struct page) to disk
> >> => We cannot STOP!
> >>
> >>[4] One of the page which has been mmapped is bound. And
> >> the page is already allocated.
> >>
> >> (1) user dirtys a page
> >> ( ) no page fault occurs
> >> (2) user munmaps the page (munmap)
> >> (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >> (4) writeback thread writes the page (struct page) to disk
> >> => We cannot STOP!
> >>--------------------------------------------------------------------------
> >>
> >>So, we can block the cases [1], [2].
> >>But I think we cannot block the cases [3], [4] now.
> >>If fixing the page_mkwrite, we can also block the case [3].
> >>But the case [4] is not blocked because no page fault occurs
> >>when we dirty the mmapped page.
> >>
> >>Therefore, to repair this problem, we need to fix the cases [3], [4].
> >>I think we must modify the writeback thread to fix the case [4].
> > The trick here is that when we write a page to disk, we write-protect
> >the page (you seem to call this that "the page is bound", I'm not sure why).
> Hm, I want to understand how to write-protect the page under fsfreezing.
Look at what page_mkclean() called from clear_page_dirty_for_io() does...

> But, anyway, I understand we don't need to consider the case [4].
Yes.

> >So we are guaranteed to receive a minor fault (case [3]) if user tries to
> >modify a page after we finish writeback while freezing the filesystem.
> >So principially all we need to do is just wait in ext4_page_mkwrite().
> OK. I understand.
> Are there any concrete ideas to fix this?
> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
Yes.

> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?
Sadly I don't see a simple way to fix this issue for all filesystems at
once. Implementing proper wait in block_page_mkwrite() should fix the issue
for xfs. Other filesystems like GFS2 or Btrfs will have to be fixed
separately as ext4. For ext3, we'd have to add ->page_mkwrite() support. I
have patches for this already for some time but I have to get to properly
testing them in more exotic conditions like 64k pages...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-19 09:43:16

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi,

(2011/04/18 19:51), Jan Kara wrote:
> On Mon 18-04-11 18:05:01, Toshiyuki Okajima wrote:
>>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
>>>>> get modified to block while minor-faulting the page on frozen fs because
>>>>> when blocks are already allocated we may skip starting a transaction and so
>>>>> we could possibly modify the filesystem.
>>>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
>>>>
>>>> (minor-pagefault)
>>>> -> do_wp_page()
>>>> -> page_mkwrite(= ext4_mkwrite())
>>>> => BLOCK!
>>>>
>>>> (major-pagefault)
>>>> -> do_liner_fault()
>>>> -> page_mkwrite(= ext4_mkwrite())
>>>> => BLOCK!
>>>>
>>>>>
>>>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>>> I understand the normal write operation (not mmap) can be blocked while
>>>>>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>>>>>> while fsfreezing.
>>>>>>> Technically speaking, we block all the transaction starts which means we
>>>>>>> end up blocking all the writes from going to disk. But that does not mean
>>>>>>> we block all the writes from going to in-memory cache - as you properly
>>>>>>> note the mmap case is one of such exceptions.
>>>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>>>>>> the writes to disk while fsfreezing. I am considering that mmap path can
>>>>>> write to disk while fsfreezing because this deadlock problem happens after
>>>>>> fsfreeze operation is done...
>>>>> I'm sorry I don't understand now - are you speaking about the case above
>>>>> when writepage() does not wait for filesystem being frozen or something
>>>>> else?
>>>> Sorry, I didn't understand around the page fault path.
>>>> So, I had read the kernel source code around it, then I maybe understand...
>>>>
>>>> I worry whether we can update the file data in mmap case while fsfreezing.
>>>> Of course, I understand that we can write to in-memory cache, and it is not a
>>>> problem. However, if we can write to disk while fsfreezing, it is a problem.
>>>> So, I summarize the cases whether we can write to disk or not.
>>>>
>>>> --------------------------------------------------------------------------
>>>> Cases (Whether we can write the data mmapped to the file on the disk
>>>> while fsfreezing)
>>>>
>>>> [1] One of the page which has been mmapped is not bound. And
>>>> the page is not allocated yet. (major fault?)
>>>>
>>>> (1) user dirtys a page
>>>> (2) a page fault occurs (do_page_fault)
>>>> (3) __do_falut is called.
>>>> (4) ext4_page_mkwrite is called
>>>> (5) ext4_write_begin is called
>>>> (6) ext4_journal_start_sb => We can STOP!
>>>>
>>>> [2] One of the page which has been mmapped is not bound. But
>>>> the page is already allocated, and the buffer_heads of the page
>>>> are not mapped (BH_Mapped). (minor fault?)
>>>>
>>>> (1) user dirtys a page
>>>> (2) a page fault occurs (do_page_fault)
>>>> (3) do_wp_page is called.
>>>> (4) ext4_page_mkwrite is called
>>>> (5) ext4_write_begin is called
>>>> (6) ext4_journal_start_sb => We can STOP!
>>>>
>>>> [3] One of the page which has been mmapped is not bound. But
>>>> the page is already allocated, and the buffer_heads of the page
>>>> are mapped (BH_Mapped). (minor fault?)
>>>>
>>>> (1) user dirtys a page
>>>> (2) a page fault occurs (do_page_fault)
>>>> (3) do_wp_page is called.
>>>> (4) ext4_page_mkwrite is called
>>>> * Cannot block the dirty page to be written because all bh is mapped.
>>>> (5) user munmaps the page (munmap)
>>>> (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>>> (7) writeback thread writes the page (struct page) to disk
>>>> => We cannot STOP!
>>>>
>>>> [4] One of the page which has been mmapped is bound. And
>>>> the page is already allocated.
>>>>
>>>> (1) user dirtys a page
>>>> ( ) no page fault occurs
>>>> (2) user munmaps the page (munmap)
>>>> (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>>> (4) writeback thread writes the page (struct page) to disk
>>>> => We cannot STOP!
>>>> --------------------------------------------------------------------------
>>>>
>>>> So, we can block the cases [1], [2].
>>>> But I think we cannot block the cases [3], [4] now.
>>>> If fixing the page_mkwrite, we can also block the case [3].
>>>> But the case [4] is not blocked because no page fault occurs
>>>> when we dirty the mmapped page.
>>>>
>>>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>>>> I think we must modify the writeback thread to fix the case [4].
>>> The trick here is that when we write a page to disk, we write-protect
>>> the page (you seem to call this that "the page is bound", I'm not sure why).
>> Hm, I want to understand how to write-protect the page under fsfreezing.
> Look at what page_mkclean() called from clear_page_dirty_for_io() does...
Thanks. I'll read that.

>
>> But, anyway, I understand we don't need to consider the case [4].
> Yes.
>
>>> So we are guaranteed to receive a minor fault (case [3]) if user tries to
>>> modify a page after we finish writeback while freezing the filesystem.
>>> So principially all we need to do is just wait in ext4_page_mkwrite().
>> OK. I understand.
>> Are there any concrete ideas to fix this?
>> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
> Yes.
>
>> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?
> Sadly I don't see a simple way to fix this issue for all filesystems at
> once. Implementing proper wait in block_page_mkwrite() should fix the issue
> for xfs. Other filesystems like GFS2 or Btrfs will have to be fixed
> separately as ext4. For ext3, we'd have to add ->page_mkwrite() support. I
> have patches for this already for some time but I have to get to properly
> testing them in more exotic conditions like 64k pages...
OK. I understand the current status of your works to fix the problem which
can be written with some data at mmap path while fsfreezing.

Thanks,
Toshiyuki Okajima


2011-04-22 06:58:39

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi,

On Tue, 19 Apr 2011 18:43:16 +0900
Toshiyuki Okajima <[email protected]> wrote:
> Hi,
>
> (2011/04/18 19:51), Jan Kara wrote:
> > On Mon 18-04-11 18:05:01, Toshiyuki Okajima wrote:
> >>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
> >>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> >>>>> get modified to block while minor-faulting the page on frozen fs because
> >>>>> when blocks are already allocated we may skip starting a transaction and so
> >>>>> we could possibly modify the filesystem.
> >>>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
> >>>>
> >>>> (minor-pagefault)
> >>>> -> do_wp_page()
> >>>> -> page_mkwrite(= ext4_mkwrite())
> >>>> => BLOCK!
> >>>>
> >>>> (major-pagefault)
> >>>> -> do_liner_fault()
> >>>> -> page_mkwrite(= ext4_mkwrite())
> >>>> => BLOCK!
> >>>>
> >>>>>
> >>>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>>>>>>> The original problem happens after the fsfreeze operation is done.
> >>>>>>>> I understand the normal write operation (not mmap) can be blocked while
> >>>>>>>> fsfreezing. So, I guess we don't always block all the write operation
> >>>>>>>> while fsfreezing.
> >>>>>>> Technically speaking, we block all the transaction starts which means we
> >>>>>>> end up blocking all the writes from going to disk. But that does not mean
> >>>>>>> we block all the writes from going to in-memory cache - as you properly
> >>>>>>> note the mmap case is one of such exceptions.
> >>>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
> >>>>>> the writes to disk while fsfreezing. I am considering that mmap path can
> >>>>>> write to disk while fsfreezing because this deadlock problem happens after
> >>>>>> fsfreeze operation is done...
> >>>>> I'm sorry I don't understand now - are you speaking about the case above
> >>>>> when writepage() does not wait for filesystem being frozen or something
> >>>>> else?
> >>>> Sorry, I didn't understand around the page fault path.
> >>>> So, I had read the kernel source code around it, then I maybe understand...
> >>>>
> >>>> I worry whether we can update the file data in mmap case while fsfreezing.
> >>>> Of course, I understand that we can write to in-memory cache, and it is not a
> >>>> problem. However, if we can write to disk while fsfreezing, it is a problem.
> >>>> So, I summarize the cases whether we can write to disk or not.
> >>>>
> >>>> --------------------------------------------------------------------------
> >>>> Cases (Whether we can write the data mmapped to the file on the disk
> >>>> while fsfreezing)
> >>>>
> >>>> [1] One of the page which has been mmapped is not bound. And
> >>>> the page is not allocated yet. (major fault?)
> >>>>
> >>>> (1) user dirtys a page
> >>>> (2) a page fault occurs (do_page_fault)
> >>>> (3) __do_falut is called.
> >>>> (4) ext4_page_mkwrite is called
> >>>> (5) ext4_write_begin is called
> >>>> (6) ext4_journal_start_sb => We can STOP!
> >>>>
> >>>> [2] One of the page which has been mmapped is not bound. But
> >>>> the page is already allocated, and the buffer_heads of the page
> >>>> are not mapped (BH_Mapped). (minor fault?)
> >>>>
> >>>> (1) user dirtys a page
> >>>> (2) a page fault occurs (do_page_fault)
> >>>> (3) do_wp_page is called.
> >>>> (4) ext4_page_mkwrite is called
> >>>> (5) ext4_write_begin is called
> >>>> (6) ext4_journal_start_sb => We can STOP!
> >>>>
> >>>> [3] One of the page which has been mmapped is not bound. But
> >>>> the page is already allocated, and the buffer_heads of the page
> >>>> are mapped (BH_Mapped). (minor fault?)
> >>>>
> >>>> (1) user dirtys a page
> >>>> (2) a page fault occurs (do_page_fault)
> >>>> (3) do_wp_page is called.
> >>>> (4) ext4_page_mkwrite is called
> >>>> * Cannot block the dirty page to be written because all bh is mapped.
> >>>> (5) user munmaps the page (munmap)
> >>>> (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>>> (7) writeback thread writes the page (struct page) to disk
> >>>> => We cannot STOP!
> >>>>
> >>>> [4] One of the page which has been mmapped is bound. And
> >>>> the page is already allocated.
> >>>>
> >>>> (1) user dirtys a page
> >>>> ( ) no page fault occurs
> >>>> (2) user munmaps the page (munmap)
> >>>> (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>>> (4) writeback thread writes the page (struct page) to disk
> >>>> => We cannot STOP!
> >>>> --------------------------------------------------------------------------
> >>>>
> >>>> So, we can block the cases [1], [2].
> >>>> But I think we cannot block the cases [3], [4] now.
> >>>> If fixing the page_mkwrite, we can also block the case [3].
> >>>> But the case [4] is not blocked because no page fault occurs
> >>>> when we dirty the mmapped page.
> >>>>
> >>>> Therefore, to repair this problem, we need to fix the cases [3], [4].
> >>>> I think we must modify the writeback thread to fix the case [4].
> >>> The trick here is that when we write a page to disk, we write-protect
> >>> the page (you seem to call this that "the page is bound", I'm not sure why).
> >> Hm, I want to understand how to write-protect the page under fsfreezing.
> > Look at what page_mkclean() called from clear_page_dirty_for_io() does...
> Thanks. I'll read that.
>
> >
> >> But, anyway, I understand we don't need to consider the case [4].
> > Yes.
> >
> >>> So we are guaranteed to receive a minor fault (case [3]) if user tries to
> >>> modify a page after we finish writeback while freezing the filesystem.
> >>> So principially all we need to do is just wait in ext4_page_mkwrite().
> >> OK. I understand.
> >> Are there any concrete ideas to fix this?
> >> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
> > Yes.
> >
> >> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?
> > Sadly I don't see a simple way to fix this issue for all filesystems at
> > once. Implementing proper wait in block_page_mkwrite() should fix the issue
> > for xfs. Other filesystems like GFS2 or Btrfs will have to be fixed
> > separately as ext4. For ext3, we'd have to add ->page_mkwrite() support. I
> > have patches for this already for some time but I have to get to properly
> > testing them in more exotic conditions like 64k pages...
> OK. I understand the current status of your works to fix the problem which
> can be written with some data at mmap path while fsfreezing.
I have confirmed that the following patch works fine while my or
Mizuma-san's reproducer is running. Therefore,
we can block to write the data, which is mmapped to a file, into a disk
by a page-fault while fsfreezing.

I think this patch fixes the following two problems:
- A deadlock occurs between ext4_da_writepages() (called from
writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
- We can also write the data, which is mmapped to a file,
into a disk while fsfreezing (ext3/ext4).
(reported by me)

Please examine this patch.

Thanks,
Toshiyuki Okajima
---
fs/ext3/file.c | 19 ++++++++++++-
fs/ext3/inode.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/inode.c | 4 ++-
include/linux/ext3_fs.h | 1 +
4 files changed, 93 insertions(+), 2 deletions(-)

diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index f55df0e..6d376ef 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -52,6 +52,23 @@ static int ext3_release_file (struct inode * inode, struct file * filp)
return 0;
}

+static const struct vm_operations_struct ext3_file_vm_ops = {
+ .fault = filemap_fault,
+ .page_mkwrite = ext3_page_mkwrite,
+};
+
+static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct address_space *mapping = file->f_mapping;
+
+ if (!mapping->a_ops->readpage)
+ return -ENOEXEC;
+ file_accessed(file);
+ vma->vm_ops = &ext3_file_vm_ops;
+ vma->vm_flags |= VM_CAN_NONLINEAR;
+ return 0;
+}
+
const struct file_operations ext3_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
@@ -62,7 +79,7 @@ const struct file_operations ext3_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext3_compat_ioctl,
#endif
- .mmap = generic_file_mmap,
+ .mmap = ext3_file_mmap,
.open = dquot_file_open,
.release = ext3_release_file,
.fsync = ext3_sync_file,
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 68b2e43..66c31dd 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -3496,3 +3496,74 @@ int ext3_change_inode_journal_flag(struct inode *inode, int val)

return err;
}
+
+int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct page *page = vmf->page;
+ loff_t size;
+ unsigned long len;
+ int ret = -EINVAL;
+ void *fsdata;
+ struct file *file = vma->vm_file;
+ struct inode *inode = file->f_path.dentry->d_inode;
+ struct address_space *mapping = inode->i_mapping;
+
+ /*
+ * Get i_alloc_sem to stop truncates messing with the inode. We cannot
+ * get i_mutex because we are already holding mmap_sem.
+ */
+ down_read(&inode->i_alloc_sem);
+ size = i_size_read(inode);
+ if (page->mapping != mapping || size <= page_offset(page)
+ || !PageUptodate(page)) {
+ /* page got truncated from under us? */
+ goto out_unlock;
+ }
+ ret = 0;
+ if (PageMappedToDisk(page))
+ goto out_frozen;
+
+ if (page->index == size >> PAGE_CACHE_SHIFT)
+ len = size & ~PAGE_CACHE_MASK;
+ else
+ len = PAGE_CACHE_SIZE;
+
+ lock_page(page);
+ /*
+ * return if we have all the buffers mapped. This avoid
+ * the need to call write_begin/write_end which does a
+ * journal_start/journal_stop which can block and take
+ * long time
+ */
+ if (page_has_buffers(page)) {
+ if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
+ buffer_unmapped)) {
+ unlock_page(page);
+out_frozen:
+ vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+ goto out_unlock;
+ }
+ }
+ unlock_page(page);
+ /*
+ * OK, we need to fill the hole... Do write_begin write_end
+ * to do block allocation/reservation.We are not holding
+ * inode.i__mutex here. That allow * parallel write_begin,
+ * write_end call. lock_page prevent this from happening
+ * on the same page though
+ */
+ ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
+ len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+ if (ret < 0)
+ goto out_unlock;
+ ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
+ len, len, page, fsdata);
+ if (ret < 0)
+ goto out_unlock;
+ ret = 0;
+out_unlock:
+ if (ret)
+ ret = VM_FAULT_SIGBUS;
+ up_read(&inode->i_alloc_sem);
+ return ret;
+}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..44979ae 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5812,7 +5812,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
}
ret = 0;
if (PageMappedToDisk(page))
- goto out_unlock;
+ goto out_frozen;

if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
@@ -5830,6 +5830,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
unlock_page(page);
+out_frozen:
+ vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
goto out_unlock;
}
}
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 85c1d30..a0e39ca 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -919,6 +919,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
extern void ext3_set_aops(struct inode *inode);
extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len);
+extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);

/* ioctl.c */
extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
--
1.5.5.6

2011-04-22 21:26:07

by Peter M. Petrakis

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi All,

On 04/22/2011 02:58 AM, Toshiyuki Okajima wrote:
> Hi,
>
> On Tue, 19 Apr 2011 18:43:16 +0900
> Toshiyuki Okajima <[email protected]> wrote:
>> Hi,
>>
>> (2011/04/18 19:51), Jan Kara wrote:
>>> On Mon 18-04-11 18:05:01, Toshiyuki Okajima wrote:
>>>>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>>>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
>>>>>>> get modified to block while minor-faulting the page on frozen fs because
>>>>>>> when blocks are already allocated we may skip starting a transaction and so
>>>>>>> we could possibly modify the filesystem.
>>>>>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
>>>>>>
>>>>>> (minor-pagefault)
>>>>>> -> do_wp_page()
>>>>>> -> page_mkwrite(= ext4_mkwrite())
>>>>>> => BLOCK!
>>>>>>
>>>>>> (major-pagefault)
>>>>>> -> do_liner_fault()
>>>>>> -> page_mkwrite(= ext4_mkwrite())
>>>>>> => BLOCK!
>>>>>>
>>>>>>>
>>>>>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>>>>> I understand the normal write operation (not mmap) can be blocked while
>>>>>>>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>>>>>>>> while fsfreezing.
>>>>>>>>> Technically speaking, we block all the transaction starts which means we
>>>>>>>>> end up blocking all the writes from going to disk. But that does not mean
>>>>>>>>> we block all the writes from going to in-memory cache - as you properly
>>>>>>>>> note the mmap case is one of such exceptions.
>>>>>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>>>>>>>> the writes to disk while fsfreezing. I am considering that mmap path can
>>>>>>>> write to disk while fsfreezing because this deadlock problem happens after
>>>>>>>> fsfreeze operation is done...
>>>>>>> I'm sorry I don't understand now - are you speaking about the case above
>>>>>>> when writepage() does not wait for filesystem being frozen or something
>>>>>>> else?
>>>>>> Sorry, I didn't understand around the page fault path.
>>>>>> So, I had read the kernel source code around it, then I maybe understand...
>>>>>>
>>>>>> I worry whether we can update the file data in mmap case while fsfreezing.
>>>>>> Of course, I understand that we can write to in-memory cache, and it is not a
>>>>>> problem. However, if we can write to disk while fsfreezing, it is a problem.
>>>>>> So, I summarize the cases whether we can write to disk or not.
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> Cases (Whether we can write the data mmapped to the file on the disk
>>>>>> while fsfreezing)
>>>>>>
>>>>>> [1] One of the page which has been mmapped is not bound. And
>>>>>> the page is not allocated yet. (major fault?)
>>>>>>
>>>>>> (1) user dirtys a page
>>>>>> (2) a page fault occurs (do_page_fault)
>>>>>> (3) __do_falut is called.
>>>>>> (4) ext4_page_mkwrite is called
>>>>>> (5) ext4_write_begin is called
>>>>>> (6) ext4_journal_start_sb => We can STOP!
>>>>>>
>>>>>> [2] One of the page which has been mmapped is not bound. But
>>>>>> the page is already allocated, and the buffer_heads of the page
>>>>>> are not mapped (BH_Mapped). (minor fault?)
>>>>>>
>>>>>> (1) user dirtys a page
>>>>>> (2) a page fault occurs (do_page_fault)
>>>>>> (3) do_wp_page is called.
>>>>>> (4) ext4_page_mkwrite is called
>>>>>> (5) ext4_write_begin is called
>>>>>> (6) ext4_journal_start_sb => We can STOP!
>>>>>>
>>>>>> [3] One of the page which has been mmapped is not bound. But
>>>>>> the page is already allocated, and the buffer_heads of the page
>>>>>> are mapped (BH_Mapped). (minor fault?)
>>>>>>
>>>>>> (1) user dirtys a page
>>>>>> (2) a page fault occurs (do_page_fault)
>>>>>> (3) do_wp_page is called.
>>>>>> (4) ext4_page_mkwrite is called
>>>>>> * Cannot block the dirty page to be written because all bh is mapped.
>>>>>> (5) user munmaps the page (munmap)
>>>>>> (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>>>>> (7) writeback thread writes the page (struct page) to disk
>>>>>> => We cannot STOP!
>>>>>>
>>>>>> [4] One of the page which has been mmapped is bound. And
>>>>>> the page is already allocated.
>>>>>>
>>>>>> (1) user dirtys a page
>>>>>> ( ) no page fault occurs
>>>>>> (2) user munmaps the page (munmap)
>>>>>> (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>>>>> (4) writeback thread writes the page (struct page) to disk
>>>>>> => We cannot STOP!
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>> So, we can block the cases [1], [2].
>>>>>> But I think we cannot block the cases [3], [4] now.
>>>>>> If fixing the page_mkwrite, we can also block the case [3].
>>>>>> But the case [4] is not blocked because no page fault occurs
>>>>>> when we dirty the mmapped page.
>>>>>>
>>>>>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>>>>>> I think we must modify the writeback thread to fix the case [4].
>>>>> The trick here is that when we write a page to disk, we write-protect
>>>>> the page (you seem to call this that "the page is bound", I'm not sure why).
>>>> Hm, I want to understand how to write-protect the page under fsfreezing.
>>> Look at what page_mkclean() called from clear_page_dirty_for_io() does...
>> Thanks. I'll read that.
>>
>>>
>>>> But, anyway, I understand we don't need to consider the case [4].
>>> Yes.
>>>
>>>>> So we are guaranteed to receive a minor fault (case [3]) if user tries to
>>>>> modify a page after we finish writeback while freezing the filesystem.
>>>>> So principially all we need to do is just wait in ext4_page_mkwrite().
>>>> OK. I understand.
>>>> Are there any concrete ideas to fix this?
>>>> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
>>> Yes.
>>>
>>>> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?
>>> Sadly I don't see a simple way to fix this issue for all filesystems at
>>> once. Implementing proper wait in block_page_mkwrite() should fix the issue
>>> for xfs. Other filesystems like GFS2 or Btrfs will have to be fixed
>>> separately as ext4. For ext3, we'd have to add ->page_mkwrite() support. I
>>> have patches for this already for some time but I have to get to properly
>>> testing them in more exotic conditions like 64k pages...
>> OK. I understand the current status of your works to fix the problem which
>> can be written with some data at mmap path while fsfreezing.
> I have confirmed that the following patch works fine while my or
> Mizuma-san's reproducer is running. Therefore,
> we can block to write the data, which is mmapped to a file, into a disk
> by a page-fault while fsfreezing.
>
> I think this patch fixes the following two problems:
> - A deadlock occurs between ext4_da_writepages() (called from
> writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
> - We can also write the data, which is mmapped to a file,
> into a disk while fsfreezing (ext3/ext4).
> (reported by me)
>
> Please examine this patch.

We've recently identified the same root cause in 2.6.32 though the hit rate
is much much higher. The configuration is a SAN ALUA Active/Standby using
multipath. The s_wait_unfrozen/s_umount deadlock is regularly encountered
when a path comes back into service, as a result of a kpartx invocation on
behalf of this udev rule.

/lib/udev/rules.d/95-kpartx.rules

# Create dm tables for partitions
ENV{DM_STATE}=="ACTIVE", ENV{DM_UUID}=="mpath-*", \
RUN+="/sbin/dmsetup ls --target multipath --exec '/sbin/kpartx -a -p -part' -j %M -m %m"


Below are the logs of the current incarntion of the fault with your current patch against 2.6.38.
Still working to obtain a viable crashdump.

[ 1898.017614] mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880c3c815200)

[ 1898.025995] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814780)

[ 1898.034625] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c814b40), event = (0x12)

[ 1898.044803] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814b40)

[ 1898.053475] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815c80), event = (0x12)

[ 1898.063690] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815c80)

[ 1898.072316] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815200), event = (0x0f)

[ 1898.082544] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815200)

[ 1898.571426] sd 0:0:1:0: alua: port group 01 state S supports toluSnA

[ 1898.578635] device-mapper: multipath: Failing path 8:32.

[ 2041.345645] INFO: task kjournald:595 blocked for more than 120 seconds.

[ 2041.353075] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.361891] kjournald D ffff88063acb9a90 0 595 2 0x00000000

[ 2041.369891] ffff88063ace1c30 0000000000000046 ffff88063c282140 ffff880600000000

[ 2041.378416] 0000000000013cc0 ffff88063acb96e0 ffff88063acb9a90 ffff88063ace1fd8

[ 2041.386954] ffff88063acb9a98 0000000000013cc0 ffff88063ace0010 0000000000013cc0

[ 2041.395561] Call Trace:

[ 2041.398358] [<ffffffff81192380>] ? sync_buffer+0x0/0x50

[ 2041.404342] [<ffffffff815d3120>] io_schedule+0x70/0xc0

[ 2041.410227] [<ffffffff811923c5>] sync_buffer+0x45/0x50

[ 2041.416179] [<ffffffff815d378f>] __wait_on_bit+0x5f/0x90

[ 2041.422258] [<ffffffff81192380>] ? sync_buffer+0x0/0x50

[ 2041.428275] [<ffffffff815d3838>] out_of_line_wait_on_bit+0x78/0x90

[ 2041.435324] [<ffffffff81086b90>] ? wake_bit_function+0x0/0x40

[ 2041.441958] [<ffffffff8119237e>] __wait_on_buffer+0x2e/0x30

[ 2041.448333] [<ffffffff8123ab14>] journal_commit_transaction+0x7e4/0xec0

[ 2041.455873] [<ffffffff81038d09>] ? default_spin_lock_flags+0x9/0x10

[ 2041.463020] [<ffffffff8107443c>] ? lock_timer_base+0x3c/0x70

[ 2041.469514] [<ffffffff81074e33>] ? try_to_del_timer_sync+0x83/0xe0

[ 2041.476563] [<ffffffff8123df7d>] kjournald+0xed/0x250

[ 2041.482349] [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2041.489624] [<ffffffff8123de90>] ? kjournald+0x0/0x250

[ 2041.495504] [<ffffffff810865e6>] kthread+0x96/0xa0

[ 2041.501003] [<ffffffff8100ce64>] kernel_thread_helper+0x4/0x10

[ 2041.507667] [<ffffffff81086550>] ? kthread+0x0/0xa0

[ 2041.513301] [<ffffffff8100ce60>] ? kernel_thread_helper+0x0/0x10

[ 2041.520247] INFO: task rsyslogd:1854 blocked for more than 120 seconds.

[ 2041.527677] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.536499] rsyslogd D ffff88063c513170 0 1854 1 0x00000000

[ 2041.544533] ffff88063d0e3cd8 0000000000000082 ffff88063c479180 0000000000000000

[ 2041.553108] 0000000000013cc0 ffff88063c512dc0 ffff88063c513170 ffff88063d0e3fd8

[ 2041.561691] ffff88063c513178 0000000000013cc0 ffff88063d0e2010 0000000000013cc0

[ 2041.570323] Call Trace:

[ 2041.573108] [<ffffffff8110c78d>] __generic_file_aio_write+0xbd/0x470

[ 2041.580447] [<ffffffff8108a82d>] ? hrtimer_try_to_cancel+0x3d/0xd0

[ 2041.587496] [<ffffffff81097e3d>] ? futex_wait_queue_me+0xcd/0x110

[ 2041.594489] [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2041.601833] [<ffffffff8110cba2>] generic_file_aio_write+0x62/0xd0

[ 2041.608831] [<ffffffff81163a9a>] do_sync_write+0xda/0x120

[ 2041.615165] [<ffffffff812de756>] ? rb_erase+0xd6/0x160

[ 2041.621050] [<ffffffff812ac918>] ? apparmor_file_permission+0x18/0x20

[ 2041.628395] [<ffffffff81279b23>] ? security_file_permission+0x23/0x90

[ 2041.635827] [<ffffffff81164018>] vfs_write+0xc8/0x190

[ 2041.641649] [<ffffffff811641d1>] sys_write+0x51/0x90

[ 2041.647337] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2041.654091] INFO: task multipathd:1337 blocked for more than 120 seconds.

[ 2041.661750] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.670669] multipathd D ffff88063e3303b0 0 1337 1 0x00000000

[ 2041.678746] ffff88063c0fda18 0000000000000082 0000000000000000 ffff880600000000

[ 2041.687219] 0000000000013cc0 ffff88063e330000 ffff88063e3303b0 ffff88063c0fdfd8

[ 2041.695818] ffff88063e3303b8 0000000000013cc0 ffff88063c0fc010 0000000000013cc0

[ 2041.704369] Call Trace:

[ 2041.707128] [<ffffffff815d349d>] schedule_timeout+0x21d/0x300

[ 2041.713679] [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90

[ 2041.719846] [<ffffffff8105f763>] ? try_to_wake_up+0xc3/0x410

[ 2041.726301] [<ffffffff815d2436>] wait_for_common+0xd6/0x180

[ 2041.732685] [<ffffffff8105fb05>] ? wake_up_process+0x15/0x20

[ 2041.739138] [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20

[ 2041.746079] [<ffffffff815d25bd>] wait_for_completion+0x1d/0x20

[ 2041.752716] [<ffffffff8107de18>] call_usermodehelper_exec+0xd8/0xe0

[ 2041.759853] [<ffffffff814a3110>] ? parse_hw_handler+0xb0/0x240

[ 2041.766503] [<ffffffff8107e060>] __request_module+0x190/0x210

[ 2041.773054] [<ffffffff812e0c28>] ? sscanf+0x38/0x40

[ 2041.778636] [<ffffffff814a3110>] parse_hw_handler+0xb0/0x240

[ 2041.785121] [<ffffffff814a38c3>] multipath_ctr+0x83/0x1d0

[ 2041.791312] [<ffffffff8149abd5>] ? dm_split_args+0x75/0x140

[ 2041.797671] [<ffffffff8149b9af>] dm_table_add_target+0xff/0x250

[ 2041.804413] [<ffffffff8149de3a>] table_load+0xca/0x2f0

[ 2041.810317] [<ffffffff8149dd70>] ? table_load+0x0/0x2f0

[ 2041.816316] [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240

[ 2041.822184] [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20

[ 2041.828188] [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0

[ 2041.834250] [<ffffffff8109ae6b>] ? sys_futex+0x7b/0x170

[ 2041.840219] [<ffffffff81175611>] sys_ioctl+0xa1/0xb0

[ 2041.845898] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2041.852639] INFO: task iozone:1871 blocked for more than 120 seconds.

[ 2041.859921] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.868760] iozone D ffff880c3bc21a90 0 1871 1869 0x00000000

[ 2041.876728] ffff880c3e743e20 0000000000000086 0000000000000001 ffff880c00000000

[ 2041.885177] 0000000000013cc0 ffff880c3bc216e0 ffff880c3bc21a90 ffff880c3e743fd8

[ 2041.893647] ffff880c3bc21a98 0000000000013cc0 ffff880c3e742010 0000000000013cc0

[ 2041.902112] Call Trace:

[ 2041.906302] [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90

[ 2041.912494] [<ffffffff815d4ddd>] rwsem_down_failed_common+0xcd/0x170

[ 2041.919718] [<ffffffff8118f480>] ? sync_one_sb+0x0/0x30

[ 2041.925719] [<ffffffff815d4eb5>] rwsem_down_read_failed+0x15/0x17

[ 2041.932690] [<ffffffff812e41a4>] call_rwsem_down_read_failed+0x14/0x30

[ 2041.940116] [<ffffffff815d4207>] ? down_read+0x17/0x20

[ 2041.945990] [<ffffffff811665e1>] iterate_supers+0x71/0xf0

[ 2041.952149] [<ffffffff8118f4df>] sys_sync+0x2f/0x70

[ 2041.957763] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2041.964575] INFO: task kpartx:1897 blocked for more than 120 seconds.

[ 2041.971801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.980626] kpartx D ffff88063d05df30 0 1897 1896 0x00000000

[ 2041.988607] ffff88063c3a5b58 0000000000000082 0000000e3c3a5ac8 ffff880c00000000

[ 2041.997056] 0000000000013cc0 ffff88063d05db80 ffff88063d05df30 ffff88063c3a5fd8

[ 2042.005496] ffff88063d05df38 0000000000013cc0 ffff88063c3a4010 0000000000013cc0

[ 2042.013939] Call Trace:

[ 2042.016702] [<ffffffff8123dc85>] log_wait_commit+0xc5/0x150

[ 2042.023089] [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2042.030321] [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20

[ 2042.036584] [<ffffffff811e6256>] ext3_sync_fs+0x66/0x70

[ 2042.042552] [<ffffffff811ba7c1>] dquot_quota_sync+0x1c1/0x330

[ 2042.049133] [<ffffffff81115391>] ? do_writepages+0x21/0x40

[ 2042.055423] [<ffffffff8110ae8b>] ? __filemap_fdatawrite_range+0x5b/0x60

[ 2042.062944] [<ffffffff8118f42c>] __sync_filesystem+0x3c/0x90

[ 2042.069430] [<ffffffff8118f56b>] sync_filesystem+0x4b/0x70

[ 2042.075690] [<ffffffff81166a85>] freeze_super+0x55/0x100

[ 2042.081754] [<ffffffff811993b8>] freeze_bdev+0x98/0xe0

[ 2042.087625] [<ffffffff81499001>] dm_suspend+0xa1/0x2e0

[ 2042.093495] [<ffffffff8149ced9>] ? __get_name_cell+0x99/0xb0

[ 2042.099948] [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2042.105916] [<ffffffff8149e29b>] do_resume+0x17b/0x1b0

[ 2042.111784] [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2042.117753] [<ffffffff8149e365>] dev_suspend+0x95/0xb0

[ 2042.123621] [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2042.129591] [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240

[ 2042.135493] [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20

[ 2042.141770] [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20

[ 2042.147739] [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0

[ 2042.153801] [<ffffffff81175611>] sys_ioctl+0xa1/0xb0

[ 2042.159478] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2161.971321] INFO: task rsyslogd:1854 blocked for more than 120 seconds.

[ 2161.978798] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2161.987656] rsyslogd D ffff88063c513170 0 1854 1 0x00000000

[ 2161.995718] ffff88063d0e3cd8 0000000000000082 ffff88063c479180 0000000000000000

[ 2162.004340] 0000000000013cc0 ffff88063c512dc0 ffff88063c513170 ffff88063d0e3fd8

[ 2162.012932] ffff88063c513178 0000000000013cc0 ffff88063d0e2010 0000000000013cc0

[ 2162.021481] Call Trace:

[ 2162.024290] [<ffffffff8110c78d>] __generic_file_aio_write+0xbd/0x470

[ 2162.031627] [<ffffffff8108a82d>] ? hrtimer_try_to_cancel+0x3d/0xd0

[ 2162.038711] [<ffffffff81097e3d>] ? futex_wait_queue_me+0xcd/0x110

[ 2162.045662] [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2162.053007] [<ffffffff8110cba2>] generic_file_aio_write+0x62/0xd0

[ 2162.059962] [<ffffffff81163a9a>] do_sync_write+0xda/0x120

[ 2162.066165] [<ffffffff812de756>] ? rb_erase+0xd6/0x160

[ 2162.072048] [<ffffffff812ac918>] ? apparmor_file_permission+0x18/0x20

[ 2162.079387] [<ffffffff81279b23>] ? security_file_permission+0x23/0x90

[ 2162.086761] [<ffffffff81164018>] vfs_write+0xc8/0x190

[ 2162.092552] [<ffffffff811641d1>] sys_write+0x51/0x90

[ 2162.098247] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2162.105042] INFO: task multipathd:1337 blocked for more than 120 seconds.

[ 2162.112667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2162.121487] multipathd D ffff88063e3303b0 0 1337 1 0x00000000

[ 2162.129517] ffff88063c0fda18 0000000000000082 0000000000000000 ffff880600000000

[ 2162.138112] 0000000000013cc0 ffff88063e330000 ffff88063e3303b0 ffff88063c0fdfd8

[ 2162.146688] ffff88063e3303b8 0000000000013cc0 ffff88063c0fc010 0000000000013cc0

[ 2162.155253] Call Trace:

[ 2162.158073] [<ffffffff815d349d>] schedule_timeout+0x21d/0x300

[ 2162.164639] [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90

[ 2162.170886] [<ffffffff8105f763>] ? try_to_wake_up+0xc3/0x410

[ 2162.177389] [<ffffffff815d2436>] wait_for_common+0xd6/0x180

[ 2162.183852] [<ffffffff8105fb05>] ? wake_up_process+0x15/0x20

[ 2162.190317] [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20

[ 2162.197304] [<ffffffff815d25bd>] wait_for_completion+0x1d/0x20

[ 2162.203968] [<ffffffff8107de18>] call_usermodehelper_exec+0xd8/0xe0

[ 2162.211111] [<ffffffff814a3110>] ? parse_hw_handler+0xb0/0x240

[ 2162.217807] [<ffffffff8107e060>] __request_module+0x190/0x210

[ 2162.224461] [<ffffffff812e0c28>] ? sscanf+0x38/0x40

[ 2162.230054] [<ffffffff814a3110>] parse_hw_handler+0xb0/0x240

[ 2162.236503] [<ffffffff814a38c3>] multipath_ctr+0x83/0x1d0

[ 2162.242673] [<ffffffff8149abd5>] ? dm_split_args+0x75/0x140

[ 2162.249079] [<ffffffff8149b9af>] dm_table_add_target+0xff/0x250

[ 2162.255840] [<ffffffff8149de3a>] table_load+0xca/0x2f0

[ 2162.261719] [<ffffffff8149dd70>] ? table_load+0x0/0x2f0

[ 2162.267701] [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240

[ 2162.273621] [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20

[ 2162.279592] [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0

[ 2162.285710] [<ffffffff8109ae6b>] ? sys_futex+0x7b/0x170

[ 2162.291694] [<ffffffff81175611>] sys_ioctl+0xa1/0xb0

[ 2162.297383] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2162.304169] INFO: task iozone:1871 blocked for more than 120 seconds.

[ 2162.311407] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2162.320229] iozone D ffff880c3bc21a90 0 1871 1869 0x00000000

[ 2162.328317] ffff880c3e743e20 0000000000000086 0000000000000001 ffff880c00000000

[ 2162.336901] 0000000000013cc0 ffff880c3bc216e0 ffff880c3bc21a90 ffff880c3e743fd8

[ 2162.345415] ffff880c3bc21a98 0000000000013cc0 ffff880c3e742010 0000000000013cc0

[ 2162.353887] Call Trace:

[ 2162.356650] [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90

[ 2162.362815] [<ffffffff815d4ddd>] rwsem_down_failed_common+0xcd/0x170

[ 2162.370042] [<ffffffff8118f480>] ? sync_one_sb+0x0/0x30

[ 2162.376121] [<ffffffff815d4eb5>] rwsem_down_read_failed+0x15/0x17

[ 2162.383075] [<ffffffff812e41a4>] call_rwsem_down_read_failed+0x14/0x30

[ 2162.390575] [<ffffffff815d4207>] ? down_read+0x17/0x20

[ 2162.396501] [<ffffffff811665e1>] iterate_supers+0x71/0xf0

[ 2162.402768] [<ffffffff8118f4df>] sys_sync+0x2f/0x70

[ 2162.408360] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2162.415159] INFO: task kpartx:1897 blocked for more than 120 seconds.

[ 2162.422493] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2162.431405] kpartx D ffff88063d05df30 0 1897 1896 0x00000000

[ 2162.439440] ffff88063c3a5b58 0000000000000082 0000000e3c3a5ac8 ffff880c00000000

[ 2162.448021] 0000000000013cc0 ffff88063d05db80 ffff88063d05df30 ffff88063c3a5fd8

[ 2162.456468] ffff88063d05df38 0000000000013cc0 ffff88063c3a4010 0000000000013cc0

[ 2162.464962] Call Trace:

[ 2162.467724] [<ffffffff8123dc85>] log_wait_commit+0xc5/0x150

[ 2162.474088] [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2162.481319] [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20

[ 2162.487577] [<ffffffff811e6256>] ext3_sync_fs+0x66/0x70

[ 2162.493548] [<ffffffff811ba7c1>] dquot_quota_sync+0x1c1/0x330

[ 2162.500107] [<ffffffff81115391>] ? do_writepages+0x21/0x40

[ 2162.506415] [<ffffffff8110ae8b>] ? __filemap_fdatawrite_range+0x5b/0x60

[ 2162.513947] [<ffffffff8118f42c>] __sync_filesystem+0x3c/0x90

[ 2162.520514] [<ffffffff8118f56b>] sync_filesystem+0x4b/0x70

[ 2162.526783] [<ffffffff81166a85>] freeze_super+0x55/0x100

[ 2162.532896] [<ffffffff811993b8>] freeze_bdev+0x98/0xe0

[ 2162.538819] [<ffffffff81499001>] dm_suspend+0xa1/0x2e0

[ 2162.544705] [<ffffffff8149ced9>] ? __get_name_cell+0x99/0xb0

[ 2162.551174] [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2162.557160] [<ffffffff8149e29b>] do_resume+0x17b/0x1b0

[ 2162.563082] [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2162.569102] [<ffffffff8149e365>] dev_suspend+0x95/0xb0

[ 2162.574987] [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2162.581068] [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240

[ 2162.586954] [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20

[ 2162.593217] [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20

[ 2162.599190] [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0

[ 2162.605298] [<ffffffff81175611>] sys_ioctl+0xa1/0xb0

[ 2162.610990] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2191.336354] Uhhuh. NMI received for unknown reason 21 on CPU 0.

[ 2191.343064] Do you have a strange power saving mode enabled?

[ 2191.349476] Kernel panic - not syncing: NMI: Not continuing

[ 2191.355753] Pid: 0, comm: swapper Not tainted 2.6.38-8-server #43

[ 2191.362593] Call Trace:

[ 2191.365380] <NMI> [<ffffffff815d2083>] ? panic+0x91/0x19e

[ 2191.371779] [<ffffffff815d21f8>] ? printk+0x68/0x70

[ 2191.377381] [<ffffffff815d6333>] ? default_do_nmi+0x1f3/0x200

[ 2191.383929] [<ffffffff815d63c0>] ? do_nmi+0x80/0x90

[ 2191.389526] [<ffffffff815d5b50>] ? nmi+0x20/0x30

[ 2191.394816] [<ffffffff81332d74>] ? intel_idle+0x94/0x120

[ 2191.400897] <<EOE>> [<ffffffff814b3472>] ? cpuidle_idle_call+0xb2/0x1b0

[ 2191.408606] [<ffffffff8100b067>] ? cpu_idle+0xb7/0x110

[ 2191.414497] [<ffffffff815b7682>] ? rest_init+0x72/0x80

[ 2191.420367] [<ffffffff81ae2c95>] ? start_kernel+0x374/0x37b

[ 2191.426780] [<ffffffff81ae2346>] ? x86_64_start_reservations+0x131/0x135

[ 2191.434457] [<ffffffff81ae244d>] ? x86_64_start_kernel+0x103/0x112


Thanks.

Peter




>
> Thanks,
> Toshiyuki Okajima
> ---
> fs/ext3/file.c | 19 ++++++++++++-
> fs/ext3/inode.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++
> fs/ext4/inode.c | 4 ++-
> include/linux/ext3_fs.h | 1 +
> 4 files changed, 93 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext3/file.c b/fs/ext3/file.c
> index f55df0e..6d376ef 100644
> --- a/fs/ext3/file.c
> +++ b/fs/ext3/file.c
> @@ -52,6 +52,23 @@ static int ext3_release_file (struct inode * inode, struct file * filp)
> return 0;
> }
>
> +static const struct vm_operations_struct ext3_file_vm_ops = {
> + .fault = filemap_fault,
> + .page_mkwrite = ext3_page_mkwrite,
> +};
> +
> +static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct address_space *mapping = file->f_mapping;
> +
> + if (!mapping->a_ops->readpage)
> + return -ENOEXEC;
> + file_accessed(file);
> + vma->vm_ops = &ext3_file_vm_ops;
> + vma->vm_flags |= VM_CAN_NONLINEAR;
> + return 0;
> +}
> +
> const struct file_operations ext3_file_operations = {
> .llseek = generic_file_llseek,
> .read = do_sync_read,
> @@ -62,7 +79,7 @@ const struct file_operations ext3_file_operations = {
> #ifdef CONFIG_COMPAT
> .compat_ioctl = ext3_compat_ioctl,
> #endif
> - .mmap = generic_file_mmap,
> + .mmap = ext3_file_mmap,
> .open = dquot_file_open,
> .release = ext3_release_file,
> .fsync = ext3_sync_file,
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 68b2e43..66c31dd 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -3496,3 +3496,74 @@ int ext3_change_inode_journal_flag(struct inode *inode, int val)
>
> return err;
> }
> +
> +int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + struct page *page = vmf->page;
> + loff_t size;
> + unsigned long len;
> + int ret = -EINVAL;
> + void *fsdata;
> + struct file *file = vma->vm_file;
> + struct inode *inode = file->f_path.dentry->d_inode;
> + struct address_space *mapping = inode->i_mapping;
> +
> + /*
> + * Get i_alloc_sem to stop truncates messing with the inode. We cannot
> + * get i_mutex because we are already holding mmap_sem.
> + */
> + down_read(&inode->i_alloc_sem);
> + size = i_size_read(inode);
> + if (page->mapping != mapping || size <= page_offset(page)
> + || !PageUptodate(page)) {
> + /* page got truncated from under us? */
> + goto out_unlock;
> + }
> + ret = 0;
> + if (PageMappedToDisk(page))
> + goto out_frozen;
> +
> + if (page->index == size >> PAGE_CACHE_SHIFT)
> + len = size & ~PAGE_CACHE_MASK;
> + else
> + len = PAGE_CACHE_SIZE;
> +
> + lock_page(page);
> + /*
> + * return if we have all the buffers mapped. This avoid
> + * the need to call write_begin/write_end which does a
> + * journal_start/journal_stop which can block and take
> + * long time
> + */
> + if (page_has_buffers(page)) {
> + if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> + buffer_unmapped)) {
> + unlock_page(page);
> +out_frozen:
> + vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> + goto out_unlock;
> + }
> + }
> + unlock_page(page);
> + /*
> + * OK, we need to fill the hole... Do write_begin write_end
> + * to do block allocation/reservation.We are not holding
> + * inode.i__mutex here. That allow * parallel write_begin,
> + * write_end call. lock_page prevent this from happening
> + * on the same page though
> + */
> + ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
> + len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
> + if (ret < 0)
> + goto out_unlock;
> + ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
> + len, len, page, fsdata);
> + if (ret < 0)
> + goto out_unlock;
> + ret = 0;
> +out_unlock:
> + if (ret)
> + ret = VM_FAULT_SIGBUS;
> + up_read(&inode->i_alloc_sem);
> + return ret;
> +}
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f2fa5e8..44979ae 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5812,7 +5812,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> }
> ret = 0;
> if (PageMappedToDisk(page))
> - goto out_unlock;
> + goto out_frozen;
>
> if (page->index == size >> PAGE_CACHE_SHIFT)
> len = size & ~PAGE_CACHE_MASK;
> @@ -5830,6 +5830,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> ext4_bh_unmapped)) {
> unlock_page(page);
> +out_frozen:
> + vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> goto out_unlock;
> }
> }
> diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
> index 85c1d30..a0e39ca 100644
> --- a/include/linux/ext3_fs.h
> +++ b/include/linux/ext3_fs.h
> @@ -919,6 +919,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
> extern void ext3_set_aops(struct inode *inode);
> extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> u64 start, u64 len);
> +extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
>
> /* ioctl.c */
> extern long ext3_ioctl(struct file *, unsigned int, unsigned long);

2011-04-22 21:40:49

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hello,

On Fri 22-04-11 17:26:07, Peter M. Petrakis wrote:
> On 04/22/2011 02:58 AM, Toshiyuki Okajima wrote:
> > On Tue, 19 Apr 2011 18:43:16 +0900
> > I have confirmed that the following patch works fine while my or
> > Mizuma-san's reproducer is running. Therefore,
> > we can block to write the data, which is mmapped to a file, into a disk
> > by a page-fault while fsfreezing.
> >
> > I think this patch fixes the following two problems:
> > - A deadlock occurs between ext4_da_writepages() (called from
> > writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
> > - We can also write the data, which is mmapped to a file,
> > into a disk while fsfreezing (ext3/ext4).
> > (reported by me)
> >
> > Please examine this patch.
>
> We've recently identified the same root cause in 2.6.32 though the hit rate
> is much much higher. The configuration is a SAN ALUA Active/Standby using
> multipath. The s_wait_unfrozen/s_umount deadlock is regularly encountered
> when a path comes back into service, as a result of a kpartx invocation on
> behalf of this udev rule.
>
> /lib/udev/rules.d/95-kpartx.rules
>
> # Create dm tables for partitions
> ENV{DM_STATE}=="ACTIVE", ENV{DM_UUID}=="mpath-*", \
> RUN+="/sbin/dmsetup ls --target multipath --exec '/sbin/kpartx -a -p -part' -j %M -m %m"
Hmm, I don't think this is the same problem... See:

> [ 1898.017614] mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880c3c815200)
> [ 1898.025995] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814780)
> [ 1898.034625] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c814b40), event = (0x12)
> [ 1898.044803] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814b40)
> [ 1898.053475] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815c80), event = (0x12)
> [ 1898.063690] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815c80)
> [ 1898.072316] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815200), event = (0x0f)
> [ 1898.082544] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815200)
> [ 1898.571426] sd 0:0:1:0: alua: port group 01 state S supports toluSnA
> [ 1898.578635] device-mapper: multipath: Failing path 8:32.
> [ 2041.345645] INFO: task kjournald:595 blocked for more than 120 seconds.
> [ 2041.353075] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2041.361891] kjournald D ffff88063acb9a90 0 595 2 0x00000000
> [ 2041.369891] ffff88063ace1c30 0000000000000046 ffff88063c282140 ffff880600000000
> [ 2041.378416] 0000000000013cc0 ffff88063acb96e0 ffff88063acb9a90 ffff88063ace1fd8
> [ 2041.386954] ffff88063acb9a98 0000000000013cc0 ffff88063ace0010 0000000000013cc0
>
> [ 2041.395561] Call Trace:
> [ 2041.398358] [<ffffffff81192380>] ? sync_buffer+0x0/0x50
> [ 2041.404342] [<ffffffff815d3120>] io_schedule+0x70/0xc0
> [ 2041.410227] [<ffffffff811923c5>] sync_buffer+0x45/0x50
> [ 2041.416179] [<ffffffff815d378f>] __wait_on_bit+0x5f/0x90
> [ 2041.422258] [<ffffffff81192380>] ? sync_buffer+0x0/0x50
> [ 2041.428275] [<ffffffff815d3838>] out_of_line_wait_on_bit+0x78/0x90
> [ 2041.435324] [<ffffffff81086b90>] ? wake_bit_function+0x0/0x40
> [ 2041.441958] [<ffffffff8119237e>] __wait_on_buffer+0x2e/0x30
> [ 2041.448333] [<ffffffff8123ab14>] journal_commit_transaction+0x7e4/0xec0
So kjournald is committing a transaction and waiting for IO to complete.
Which maybe never happens because of multipath being in transition? That
would be a bug...

> [ 2041.670669] multipathd D ffff88063e3303b0 0 1337 1 0x00000000
> [ 2041.678746] ffff88063c0fda18 0000000000000082 0000000000000000 ffff880600000000
> [ 2041.687219] 0000000000013cc0 ffff88063e330000 ffff88063e3303b0 ffff88063c0fdfd8
> [ 2041.695818] ffff88063e3303b8 0000000000013cc0 ffff88063c0fc010 0000000000013cc0
> [ 2041.704369] Call Trace:
> [ 2041.707128] [<ffffffff815d349d>] schedule_timeout+0x21d/0x300
> [ 2041.713679] [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90
> [ 2041.719846] [<ffffffff8105f763>] ? try_to_wake_up+0xc3/0x410
> [ 2041.726301] [<ffffffff815d2436>] wait_for_common+0xd6/0x180
> [ 2041.732685] [<ffffffff8105fb05>] ? wake_up_process+0x15/0x20
> [ 2041.739138] [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
> [ 2041.746079] [<ffffffff815d25bd>] wait_for_completion+0x1d/0x20
> [ 2041.752716] [<ffffffff8107de18>] call_usermodehelper_exec+0xd8/0xe0
> [ 2041.759853] [<ffffffff814a3110>] ? parse_hw_handler+0xb0/0x240
> [ 2041.766503] [<ffffffff8107e060>] __request_module+0x190/0x210
> [ 2041.773054] [<ffffffff812e0c28>] ? sscanf+0x38/0x40
> [ 2041.778636] [<ffffffff814a3110>] parse_hw_handler+0xb0/0x240
> [ 2041.785121] [<ffffffff814a38c3>] multipath_ctr+0x83/0x1d0
> [ 2041.791312] [<ffffffff8149abd5>] ? dm_split_args+0x75/0x140
> [ 2041.797671] [<ffffffff8149b9af>] dm_table_add_target+0xff/0x250
> [ 2041.804413] [<ffffffff8149de3a>] table_load+0xca/0x2f0
> [ 2041.810317] [<ffffffff8149dd70>] ? table_load+0x0/0x2f0
> [ 2041.816316] [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240
> [ 2041.822184] [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20
> [ 2041.828188] [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0
> [ 2041.834250] [<ffffffff8109ae6b>] ? sys_futex+0x7b/0x170
> [ 2041.840219] [<ffffffff81175611>] sys_ioctl+0xa1/0xb0
> [ 2041.845898] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
multipathd is hung waiting for module to be loaded? How come?

> [ 2041.964575] INFO: task kpartx:1897 blocked for more than 120 seconds.
> [ 2041.971801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2041.980626] kpartx D ffff88063d05df30 0 1897 1896 0x00000000
> [ 2041.988607] ffff88063c3a5b58 0000000000000082 0000000e3c3a5ac8 ffff880c00000000
> [ 2041.997056] 0000000000013cc0 ffff88063d05db80 ffff88063d05df30 ffff88063c3a5fd8
> [ 2042.005496] ffff88063d05df38 0000000000013cc0 ffff88063c3a4010 0000000000013cc0
> [ 2042.013939] Call Trace:
> [ 2042.016702] [<ffffffff8123dc85>] log_wait_commit+0xc5/0x150
> [ 2042.023089] [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40
> [ 2042.030321] [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20
> [ 2042.036584] [<ffffffff811e6256>] ext3_sync_fs+0x66/0x70
> [ 2042.042552] [<ffffffff811ba7c1>] dquot_quota_sync+0x1c1/0x330
> [ 2042.049133] [<ffffffff81115391>] ? do_writepages+0x21/0x40
> [ 2042.055423] [<ffffffff8110ae8b>] ? __filemap_fdatawrite_range+0x5b/0x60
> [ 2042.062944] [<ffffffff8118f42c>] __sync_filesystem+0x3c/0x90
> [ 2042.069430] [<ffffffff8118f56b>] sync_filesystem+0x4b/0x70
> [ 2042.075690] [<ffffffff81166a85>] freeze_super+0x55/0x100
> [ 2042.081754] [<ffffffff811993b8>] freeze_bdev+0x98/0xe0
> [ 2042.087625] [<ffffffff81499001>] dm_suspend+0xa1/0x2e0
> [ 2042.093495] [<ffffffff8149ced9>] ? __get_name_cell+0x99/0xb0
> [ 2042.099948] [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
> [ 2042.105916] [<ffffffff8149e29b>] do_resume+0x17b/0x1b0
> [ 2042.111784] [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
> [ 2042.117753] [<ffffffff8149e365>] dev_suspend+0x95/0xb0
> [ 2042.123621] [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
> [ 2042.129591] [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240
> [ 2042.135493] [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20
> [ 2042.141770] [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20
> [ 2042.147739] [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0
> [ 2042.153801] [<ffffffff81175611>] sys_ioctl+0xa1/0xb0
> [ 2042.159478] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
kpartx is waiting for kjournald to finish transaction commit and it is
holding s_umount but that doesn't really seem to be a problem...

So as I say, find a reason why kjournald is not able to finish committing a
transaction and you should solve this riddle ;).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-22 22:10:28

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Fri 22-04-11 15:58:39, Toshiyuki Okajima wrote:
> I have confirmed that the following patch works fine while my or
> Mizuma-san's reproducer is running. Therefore,
> we can block to write the data, which is mmapped to a file, into a disk
> by a page-fault while fsfreezing.
>
> I think this patch fixes the following two problems:
> - A deadlock occurs between ext4_da_writepages() (called from
> writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
> - We can also write the data, which is mmapped to a file,
> into a disk while fsfreezing (ext3/ext4).
> (reported by me)
>
> Please examine this patch.
Thanks for the patch. The ext3 part is not as easy as this. You cannot
really get i_alloc_sem in ext3_page_mkwrite() because mmap_sem is already
held by page fault code and i_alloc_sem should be acquired before it (yes I
know, ext4 already has this bug which should be fixed when I get to it).
Also you'll find that performance of random writers via mmap (which is
relatively common) is going to be rather bad with this patch (because the
file will be heavily fragmented). We have to be more clever which is
exactly why it's taking me so long with my patch :) But tests are already
running so if everything goes fine, I should have patches to submit next
week.

The ext4 part looks correct. I'd just also like to have some comments about
how freeze handling is done because it's kind of subtle.

Honza

> diff --git a/fs/ext3/file.c b/fs/ext3/file.c
> index f55df0e..6d376ef 100644
> --- a/fs/ext3/file.c
> +++ b/fs/ext3/file.c
> @@ -52,6 +52,23 @@ static int ext3_release_file (struct inode * inode, struct file * filp)
> return 0;
> }
>
> +static const struct vm_operations_struct ext3_file_vm_ops = {
> + .fault = filemap_fault,
> + .page_mkwrite = ext3_page_mkwrite,
> +};
> +
> +static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct address_space *mapping = file->f_mapping;
> +
> + if (!mapping->a_ops->readpage)
> + return -ENOEXEC;
> + file_accessed(file);
> + vma->vm_ops = &ext3_file_vm_ops;
> + vma->vm_flags |= VM_CAN_NONLINEAR;
> + return 0;
> +}
> +
> const struct file_operations ext3_file_operations = {
> .llseek = generic_file_llseek,
> .read = do_sync_read,
> @@ -62,7 +79,7 @@ const struct file_operations ext3_file_operations = {
> #ifdef CONFIG_COMPAT
> .compat_ioctl = ext3_compat_ioctl,
> #endif
> - .mmap = generic_file_mmap,
> + .mmap = ext3_file_mmap,
> .open = dquot_file_open,
> .release = ext3_release_file,
> .fsync = ext3_sync_file,
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 68b2e43..66c31dd 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -3496,3 +3496,74 @@ int ext3_change_inode_journal_flag(struct inode *inode, int val)
>
> return err;
> }
> +
> +int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + struct page *page = vmf->page;
> + loff_t size;
> + unsigned long len;
> + int ret = -EINVAL;
> + void *fsdata;
> + struct file *file = vma->vm_file;
> + struct inode *inode = file->f_path.dentry->d_inode;
> + struct address_space *mapping = inode->i_mapping;
> +
> + /*
> + * Get i_alloc_sem to stop truncates messing with the inode. We cannot
> + * get i_mutex because we are already holding mmap_sem.
> + */
> + down_read(&inode->i_alloc_sem);
> + size = i_size_read(inode);
> + if (page->mapping != mapping || size <= page_offset(page)
> + || !PageUptodate(page)) {
> + /* page got truncated from under us? */
> + goto out_unlock;
> + }
> + ret = 0;
> + if (PageMappedToDisk(page))
> + goto out_frozen;
> +
> + if (page->index == size >> PAGE_CACHE_SHIFT)
> + len = size & ~PAGE_CACHE_MASK;
> + else
> + len = PAGE_CACHE_SIZE;
> +
> + lock_page(page);
> + /*
> + * return if we have all the buffers mapped. This avoid
> + * the need to call write_begin/write_end which does a
> + * journal_start/journal_stop which can block and take
> + * long time
> + */
> + if (page_has_buffers(page)) {
> + if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> + buffer_unmapped)) {
> + unlock_page(page);
> +out_frozen:
> + vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> + goto out_unlock;
> + }
> + }
> + unlock_page(page);
> + /*
> + * OK, we need to fill the hole... Do write_begin write_end
> + * to do block allocation/reservation.We are not holding
> + * inode.i__mutex here. That allow * parallel write_begin,
> + * write_end call. lock_page prevent this from happening
> + * on the same page though
> + */
> + ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
> + len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
> + if (ret < 0)
> + goto out_unlock;
> + ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
> + len, len, page, fsdata);
> + if (ret < 0)
> + goto out_unlock;
> + ret = 0;
> +out_unlock:
> + if (ret)
> + ret = VM_FAULT_SIGBUS;
> + up_read(&inode->i_alloc_sem);
> + return ret;
> +}
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f2fa5e8..44979ae 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5812,7 +5812,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> }
> ret = 0;
> if (PageMappedToDisk(page))
> - goto out_unlock;
> + goto out_frozen;
>
> if (page->index == size >> PAGE_CACHE_SHIFT)
> len = size & ~PAGE_CACHE_MASK;
> @@ -5830,6 +5830,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> ext4_bh_unmapped)) {
> unlock_page(page);
> +out_frozen:
> + vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> goto out_unlock;
> }
> }
> diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
> index 85c1d30..a0e39ca 100644
> --- a/include/linux/ext3_fs.h
> +++ b/include/linux/ext3_fs.h
> @@ -919,6 +919,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
> extern void ext3_set_aops(struct inode *inode);
> extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> u64 start, u64 len);
> +extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
>
> /* ioctl.c */
> extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
> --
> 1.5.5.6
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-04-22 22:57:51

by Peter M. Petrakis

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock



On 04/22/2011 05:40 PM, Jan Kara wrote:
> Hello,
>
> On Fri 22-04-11 17:26:07, Peter M. Petrakis wrote:
>> On 04/22/2011 02:58 AM, Toshiyuki Okajima wrote:
>>> On Tue, 19 Apr 2011 18:43:16 +0900
>>> I have confirmed that the following patch works fine while my or
>>> Mizuma-san's reproducer is running. Therefore,
>>> we can block to write the data, which is mmapped to a file, into a disk
>>> by a page-fault while fsfreezing.
>>>
>>> I think this patch fixes the following two problems:
>>> - A deadlock occurs between ext4_da_writepages() (called from
>>> writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
>>> - We can also write the data, which is mmapped to a file,
>>> into a disk while fsfreezing (ext3/ext4).
>>> (reported by me)
>>>
>>> Please examine this patch.
>>
>> We've recently identified the same root cause in 2.6.32 though the hit rate
>> is much much higher. The configuration is a SAN ALUA Active/Standby using
>> multipath. The s_wait_unfrozen/s_umount deadlock is regularly encountered
>> when a path comes back into service, as a result of a kpartx invocation on
>> behalf of this udev rule.
>>
>> /lib/udev/rules.d/95-kpartx.rules
>>
>> # Create dm tables for partitions
>> ENV{DM_STATE}=="ACTIVE", ENV{DM_UUID}=="mpath-*", \
>> RUN+="/sbin/dmsetup ls --target multipath --exec '/sbin/kpartx -a -p -part' -j %M -m %m"
> Hmm, I don't think this is the same problem... See:

Figures :)


>> [ 1898.017614] mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880c3c815200)
>> [ 1898.025995] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814780)
>> [ 1898.034625] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c814b40), event = (0x12)
>> [ 1898.044803] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814b40)
>> [ 1898.053475] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815c80), event = (0x12)
>> [ 1898.063690] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815c80)
>> [ 1898.072316] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815200), event = (0x0f)
>> [ 1898.082544] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815200)
>> [ 1898.571426] sd 0:0:1:0: alua: port group 01 state S supports toluSnA
>> [ 1898.578635] device-mapper: multipath: Failing path 8:32.
>> [ 2041.345645] INFO: task kjournald:595 blocked for more than 120 seconds.
>> [ 2041.353075] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 2041.361891] kjournald D ffff88063acb9a90 0 595 2 0x00000000
>> [ 2041.369891] ffff88063ace1c30 0000000000000046 ffff88063c282140 ffff880600000000
>> [ 2041.378416] 0000000000013cc0 ffff88063acb96e0 ffff88063acb9a90 ffff88063ace1fd8
>> [ 2041.386954] ffff88063acb9a98 0000000000013cc0 ffff88063ace0010 0000000000013cc0
>>
>> [ 2041.395561] Call Trace:
>> [ 2041.398358] [<ffffffff81192380>] ? sync_buffer+0x0/0x50
>> [ 2041.404342] [<ffffffff815d3120>] io_schedule+0x70/0xc0
>> [ 2041.410227] [<ffffffff811923c5>] sync_buffer+0x45/0x50
>> [ 2041.416179] [<ffffffff815d378f>] __wait_on_bit+0x5f/0x90
>> [ 2041.422258] [<ffffffff81192380>] ? sync_buffer+0x0/0x50
>> [ 2041.428275] [<ffffffff815d3838>] out_of_line_wait_on_bit+0x78/0x90
>> [ 2041.435324] [<ffffffff81086b90>] ? wake_bit_function+0x0/0x40
>> [ 2041.441958] [<ffffffff8119237e>] __wait_on_buffer+0x2e/0x30
>> [ 2041.448333] [<ffffffff8123ab14>] journal_commit_transaction+0x7e4/0xec0
> So kjournald is committing a transaction and waiting for IO to complete.
> Which maybe never happens because of multipath being in transition? That
> would be a bug...
>

and it would be a new one for us. It's entirely possible the original deadlock
is resolved, and this is new. With only the tracebacks to consult, and general
unfamiliarity with this area, it looked like the same fault to me.
In 2.6.32 it's a dead ringer per the thread parent:

http://permalink.gmane.org/gmane.comp.file-systems.ext4/23171

[Ubuntu 10.04 - 2.6.32 crashdump]

crash-5.0> ps | grep UN
992 2 7 ffff8802678a8000 UN 0.0 0 0 [flush-251:5]
17295 2537 2 ffff880267be0000 UN 0.2 47060 17368 iozone
17314 2477 5 ffff88026a952010 UN 0.2 47060 17364 iozone
17447 2573 0 ffff880268bd2010 UN 0.2 47060 17340 iozone
17460 1 13 ffff88026b3c4020 UN 0.0 191564 1992 rsyslogd
17606 17597 15 ffff880268420000 UN 0.0 10436 808 kpartx
17738 2268 13 ffff88016908a010 UN 0.0 17756 1616 dhclient-script
17747 2223 15 ffff88026a950000 UN 0.0 151460 1596 multipathd
17748 2284 1 ffff88016908c020 UN 0.0 49260 688 sshd
17749 2284 1 ffff880169088000 UN 0.0 49260 692 sshd
17750 2284 1 ffff88016a628000 UN 0.0 49260 688 sshd
17751 2284 0 ffff88026a3cc020 UN 0.0 49260 688 sshd
17752 2284 0 ffff88026a3ca010 UN 0.0 49260 688 sshd
17753 2284 0 ffff88026a3c8000 UN 0.0 49260 688 sshd
17754 2284 0 ffff880268f60000 UN 0.0 49260 692 sshd
17755 2284 0 ffff880268f62010 UN 0.0 49260 688 sshd
crash-5.0> bt 17606
PID: 17606 TASK: ffff880268420000 CPU: 15 COMMAND: "kpartx"
#0 [ffff88026aac3b18] schedule at ffffffff8158bcbd
#1 [ffff88026aac3bd0] rwsem_down_failed_common at ffffffff8158df2d
#2 [ffff88026aac3c30] rwsem_down_write_failed at ffffffff8158e0b3
#3 [ffff88026aac3c70] call_rwsem_down_write_failed at ffffffff812d9903
#4 [ffff88026aac3ce0] thaw_bdev at ffffffff81186d5a
#5 [ffff88026aac3d40] unlock_fs at ffffffff8145e46d
#6 [ffff88026aac3d60] dm_resume at ffffffff8145fb38
#7 [ffff88026aac3db0] do_resume at ffffffff81465c98
#8 [ffff88026aac3de0] dev_suspend at ffffffff81465d65
#9 [ffff88026aac3e20] ctl_ioctl at ffffffff814665f5
#10 [ffff88026aac3e90] dm_ctl_ioctl at ffffffff814666a3
#11 [ffff88026aac3ea0] vfs_ioctl at ffffffff81165e92
#12 [ffff88026aac3ee0] do_vfs_ioctl at ffffffff81166140
#13 [ffff88026aac3f30] sys_ioctl at ffffffff811664b1
#14 [ffff88026aac3f80] system_call_fastpath at ffffffff810131b2
RIP: 00007fa798b04197 RSP: 00007fff4cf1c6e8 RFLAGS: 00010202
RAX: 0000000000000010 RBX: ffffffff810131b2 RCX: 0000000000000000
RDX: 0000000000bcf310 RSI: 00000000c138fd06 RDI: 0000000000000004
RBP: 0000000000bcf340 R8: 00007fa798dc2528 R9: 00007fff4cf1c640
R10: 00007fa798dc1dc0 R11: 0000000000000246 R12: 00007fa798dc1dc0
R13: 0000000000004000 R14: 0000000000bce0f0 R15: 00007fa798dc1dc0
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
crash-5.0> bt 992

PID: 992 TASK: ffff8802678a8000 CPU: 7 COMMAND: "flush-251:5"
#0 [ffff880267bddb00] schedule at ffffffff8158bcbd
#1 [ffff880267bddbb8] ext4_force_commit at ffffffff8120b16d
#2 [ffff880267bddc18] ext4_write_inode at ffffffff811f29e5
#3 [ffff880267bddc68] writeback_single_inode at ffffffff81178964
#4 [ffff880267bddcb8] writeback_sb_inodes at ffffffff81178f09
#5 [ffff880267bddd18] wb_writeback at ffffffff8117995c
#6 [ffff880267bdddc8] wb_do_writeback at ffffffff81179b6b
#7 [ffff880267bdde58] bdi_writeback_task at ffffffff81179cc3
#8 [ffff880267bdde98] bdi_start_fn at ffffffff8111e816
#9 [ffff880267bddec8] kthread at ffffffff81088a06
#10 [ffff880267bddf48] kernel_thread at ffffffff810142ea

crash-5.0> super_block.s_frozen ffff880268a4e000
s_frozen = 0x2,

int ext4_force_commit(struct super_block *sb)
{
journal_t *journal;
int ret = 0;

if (sb->s_flags & MS_RDONLY)
return 0;

journal = EXT4_SB(sb)->s_journal;
if (journal) {
vfs_check_frozen(sb, SB_FREEZE_TRANS); <=== this is where sleep
ret = ext4_journal_force_commit(journal);
}

return ret;
}


I have tried the previous versions of the patch, backporting
to 2.6.32 without any success. I thought I would just go for it
this time with the latest.


>> [ 2041.670669] multipathd D ffff88063e3303b0 0 1337 1 0x00000000
>> [ 2041.678746] ffff88063c0fda18 0000000000000082 0000000000000000 ffff880600000000
>> [ 2041.687219] 0000000000013cc0 ffff88063e330000 ffff88063e3303b0 ffff88063c0fdfd8
>> [ 2041.695818] ffff88063e3303b8 0000000000013cc0 ffff88063c0fc010 0000000000013cc0
>> [ 2041.704369] Call Trace:
>> [ 2041.707128] [<ffffffff815d349d>] schedule_timeout+0x21d/0x300
>> [ 2041.713679] [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90
>> [ 2041.719846] [<ffffffff8105f763>] ? try_to_wake_up+0xc3/0x410
>> [ 2041.726301] [<ffffffff815d2436>] wait_for_common+0xd6/0x180
>> [ 2041.732685] [<ffffffff8105fb05>] ? wake_up_process+0x15/0x20
>> [ 2041.739138] [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
>> [ 2041.746079] [<ffffffff815d25bd>] wait_for_completion+0x1d/0x20
>> [ 2041.752716] [<ffffffff8107de18>] call_usermodehelper_exec+0xd8/0xe0
>> [ 2041.759853] [<ffffffff814a3110>] ? parse_hw_handler+0xb0/0x240
>> [ 2041.766503] [<ffffffff8107e060>] __request_module+0x190/0x210
>> [ 2041.773054] [<ffffffff812e0c28>] ? sscanf+0x38/0x40
>> [ 2041.778636] [<ffffffff814a3110>] parse_hw_handler+0xb0/0x240
>> [ 2041.785121] [<ffffffff814a38c3>] multipath_ctr+0x83/0x1d0
>> [ 2041.791312] [<ffffffff8149abd5>] ? dm_split_args+0x75/0x140
>> [ 2041.797671] [<ffffffff8149b9af>] dm_table_add_target+0xff/0x250
>> [ 2041.804413] [<ffffffff8149de3a>] table_load+0xca/0x2f0
>> [ 2041.810317] [<ffffffff8149dd70>] ? table_load+0x0/0x2f0
>> [ 2041.816316] [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240
>> [ 2041.822184] [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20
>> [ 2041.828188] [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0
>> [ 2041.834250] [<ffffffff8109ae6b>] ? sys_futex+0x7b/0x170
>> [ 2041.840219] [<ffffffff81175611>] sys_ioctl+0xa1/0xb0
>> [ 2041.845898] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
> multipathd is hung waiting for module to be loaded? How come?

It shouldn't, dh_alua is already loaded.


>> [ 2041.964575] INFO: task kpartx:1897 blocked for more than 120 seconds.
>> [ 2041.971801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 2041.980626] kpartx D ffff88063d05df30 0 1897 1896 0x00000000
>> [ 2041.988607] ffff88063c3a5b58 0000000000000082 0000000e3c3a5ac8 ffff880c00000000
>> [ 2041.997056] 0000000000013cc0 ffff88063d05db80 ffff88063d05df30 ffff88063c3a5fd8
>> [ 2042.005496] ffff88063d05df38 0000000000013cc0 ffff88063c3a4010 0000000000013cc0
>> [ 2042.013939] Call Trace:
>> [ 2042.016702] [<ffffffff8123dc85>] log_wait_commit+0xc5/0x150
>> [ 2042.023089] [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40
>> [ 2042.030321] [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20
>> [ 2042.036584] [<ffffffff811e6256>] ext3_sync_fs+0x66/0x70
>> [ 2042.042552] [<ffffffff811ba7c1>] dquot_quota_sync+0x1c1/0x330
>> [ 2042.049133] [<ffffffff81115391>] ? do_writepages+0x21/0x40
>> [ 2042.055423] [<ffffffff8110ae8b>] ? __filemap_fdatawrite_range+0x5b/0x60
>> [ 2042.062944] [<ffffffff8118f42c>] __sync_filesystem+0x3c/0x90
>> [ 2042.069430] [<ffffffff8118f56b>] sync_filesystem+0x4b/0x70
>> [ 2042.075690] [<ffffffff81166a85>] freeze_super+0x55/0x100
>> [ 2042.081754] [<ffffffff811993b8>] freeze_bdev+0x98/0xe0
>> [ 2042.087625] [<ffffffff81499001>] dm_suspend+0xa1/0x2e0
>> [ 2042.093495] [<ffffffff8149ced9>] ? __get_name_cell+0x99/0xb0
>> [ 2042.099948] [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
>> [ 2042.105916] [<ffffffff8149e29b>] do_resume+0x17b/0x1b0
>> [ 2042.111784] [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
>> [ 2042.117753] [<ffffffff8149e365>] dev_suspend+0x95/0xb0
>> [ 2042.123621] [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
>> [ 2042.129591] [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240
>> [ 2042.135493] [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20
>> [ 2042.141770] [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20
>> [ 2042.147739] [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0
>> [ 2042.153801] [<ffffffff81175611>] sys_ioctl+0xa1/0xb0
>> [ 2042.159478] [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
> kpartx is waiting for kjournald to finish transaction commit and it is
> holding s_umount but that doesn't really seem to be a problem...
>
> So as I say, find a reason why kjournald is not able to finish committing a
> transaction and you should solve this riddle ;).

Cool, thanks!

Peter

>
> Honza

2011-04-25 08:03:33

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi.

On Sat, 23 Apr 2011 00:10:25 +0200
Jan Kara <[email protected]> wrote:
> On Fri 22-04-11 15:58:39, Toshiyuki Okajima wrote:
> > I have confirmed that the following patch works fine while my or
> > Mizuma-san's reproducer is running. Therefore,
> > we can block to write the data, which is mmapped to a file, into a disk
> > by a page-fault while fsfreezing.
> >
> > I think this patch fixes the following two problems:
> > - A deadlock occurs between ext4_da_writepages() (called from
> > writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
> > - We can also write the data, which is mmapped to a file,
> > into a disk while fsfreezing (ext3/ext4).
> > (reported by me)
> >
> > Please examine this patch.
> Thanks for the patch. The ext3 part is not as easy as this. You cannot
> really get i_alloc_sem in ext3_page_mkwrite() because mmap_sem is already
> held by page fault code and i_alloc_sem should be acquired before it (yes I
> know, ext4 already has this bug which should be fixed when I get to it).
> Also you'll find that performance of random writers via mmap (which is
> relatively common) is going to be rather bad with this patch (because the
> file will be heavily fragmented). We have to be more clever which is
> exactly why it's taking me so long with my patch :) But tests are already
> running so if everything goes fine, I should have patches to submit next
> week.
OK, I'll wait your patch. :)

>
> The ext4 part looks correct. I'd just also like to have some comments about
> how freeze handling is done because it's kind of subtle.

How about this?

Thanks,
Toshiyuki Okajima

----------------------------------------------------------------------------------------------------
Subject: [PATCH] ext4: prevent the mmapped page flushing to disk while fsfreezing

Signed-off-by: Toshiyuki Okajima <[email protected]>
---
fs/ext4/inode.c | 10 +++++++++-
1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..411b177 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5812,7 +5812,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
}
ret = 0;
if (PageMappedToDisk(page))
- goto out_unlock;
+ goto out_frozen;

if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
@@ -5830,6 +5830,14 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
unlock_page(page);
+out_frozen:
+ /*
+ * We must wait here while the filesystem is being
+ * frozen otherwise a flushing thread can write this
+ * page to the disk (we can update the filesystem even
+ * if it is frozen).
+ */
+ vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
goto out_unlock;
}
}
--
1.5.5.6

2011-05-02 09:08:16

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Hi,

On 04/06/2011 02:21 PM, Dave Chinner wrote:
> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>> nothing needs to be done to the writeback path because there is
>>>>> nothing dirty for it to write back.
>>>> Sure but that's only the problem he was able to hit. But generally,
>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>> clear there aren't other code paths which can block with s_umount held
>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>
>>> Holding the s_umount lock while checking if frozen and sleeping
>>> is essentially an ABBA lock inversion bug that can bite in many more
>>> places that just thawing the filesystem. Any where this is done should
>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>> path is sufficient to avoid problems.
>> That's easily said but hard to do - any transaction start in ext3/4 may
>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>> looking into the code) and transaction start traditionally nests inside
>> s_umount (and basically there's no way around that since sync() calls your
>> fs code with s_umount held).
>
> Sure, but the question must be asked - why is ext3/4 even starting a
> transaction on a clean filesystem during sync? A frozen filesystem,
> by definition, is a clean filesytem, and therefore sync calls of any
> kind should not be trying to write to the FS or start transactions.
> XFS does this just fine, so I'd consider such behaviour on a frozen
> filesystem a bug in ext3/4...

I had a look at the xfs code for seeing how this is done.
xfs_file_aio_write()
xfs_wait_for_freeze()
vfs_check_frozen()
So xfs_file_aio_write() writes to buffers when the FS is not frozen.

Now, I want to know what stops the following scenario from happening:
--------------------
xfs_file_aio_write()
xfs_wait_for_freeze()
vfs_check_frozen()
At this point F.S was not frozen, so the next instruction in the
xfs_file_aio_write() will be executed next.
However at this point (i.e after checking if F.S is frozen) the write
process gets pre-empted and say the _freeze_ process gets control.

Now the F.S freezes and the write process gets the control back. And so
we end up writing to the page cache when the F.S is frozen.
--------------------

Can anyone please enlighten me on how & why this premption is _not_
possible?

If this pre-emption is _possible_, then can we use sb->s_umount to
prevent a freeze from happening while a write to the page cache buffers
is going on. Eg:

* Before writing to the buffers in the page cache:

down_write(sb->s_umount)
if(sb->s_frozen == SB_FREEZE_WRITE) {
// do not sleep with the sb->s_umount semaphore.
up_write(s_umount);
vfs_check_frozen();
// if you are here then fs is not thawed.
down_write(sb->s_umount);
}


Thanks!


Warm Regards,
Surbhi.




>
>> So I'm afraid we are not going to get rid of
>> this ABBA dependency unless we declare that s_umount ranks above filesystem
>> being frozen - but surely I'm open to suggestions.
>
> Not sure I understand what you are saying there - this is already
> the case, isn't it? i.e. it has to be held exclusive to freeze a
> filesystem...
>
>> Another possibility is just to hide the problem e.g. by checking for frozen
>> filesystem whenever we try to get s_umount. But that looks a bit ugly to
>> me.
>
> And not necessary, AFAICT.
>
> Cheers,
>
> Dave.


2011-05-02 10:56:33

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
> On 04/06/2011 02:21 PM, Dave Chinner wrote:
> >On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> >>On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> >>>On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> >>>>On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> >>>>>If you don't allow the page to be dirtied in the fist place, then
> >>>>>nothing needs to be done to the writeback path because there is
> >>>>>nothing dirty for it to write back.
> >>>> Sure but that's only the problem he was able to hit. But generally,
> >>>>there's a problem with needing s_umount for unfreezing because it isn't
> >>>>clear there aren't other code paths which can block with s_umount held
> >>>>waiting for fs to get unfrozen. And these code paths would cause the same
> >>>>deadlock. That's why I chose to get rid of s_umount during thawing.
> >>>
> >>>Holding the s_umount lock while checking if frozen and sleeping
> >>>is essentially an ABBA lock inversion bug that can bite in many more
> >>>places that just thawing the filesystem. Any where this is done should
> >>>be fixed, so I don't think just removing the s_umount lock from the thaw
> >>>path is sufficient to avoid problems.
> >> That's easily said but hard to do - any transaction start in ext3/4 may
> >>block on filesystem being frozen (this seems to be similar for XFS as I'm
> >>looking into the code) and transaction start traditionally nests inside
> >>s_umount (and basically there's no way around that since sync() calls your
> >>fs code with s_umount held).
> >
> >Sure, but the question must be asked - why is ext3/4 even starting a
> >transaction on a clean filesystem during sync? A frozen filesystem,
> >by definition, is a clean filesytem, and therefore sync calls of any
> >kind should not be trying to write to the FS or start transactions.
> >XFS does this just fine, so I'd consider such behaviour on a frozen
> >filesystem a bug in ext3/4...
>
> I had a look at the xfs code for seeing how this is done.
> xfs_file_aio_write()
> xfs_wait_for_freeze()
> vfs_check_frozen()
> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>
> Now, I want to know what stops the following scenario from happening:
> --------------------
> xfs_file_aio_write()
> xfs_wait_for_freeze()
> vfs_check_frozen()
> At this point F.S was not frozen, so the next instruction in the
> xfs_file_aio_write() will be executed next.
> However at this point (i.e after checking if F.S is frozen) the
> write process gets pre-empted and say the _freeze_ process gets
> control.
>
> Now the F.S freezes and the write process gets the control back. And
> so we end up writing to the page cache when the F.S is frozen.
> --------------------
>
> Can anyone please enlighten me on how & why this premption is _not_
> possible?
XFS works similarly as ext4 in this regard I believe. They have the log
frozen in xfs_freeze() so if the race you describe above happens, either
the writing process gets caught waiting for log to unfreeze or it manages
to start a transaction and then freezing process waits for transaction to
finish before it can proceed with freezing. I'm not sure why is there the
check in xfs_file_aio_write()...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-02 11:27:51

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 05/02/2011 01:56 PM, Jan Kara wrote:
> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>> nothing dirty for it to write back.
>>>>>> Sure but that's only the problem he was able to hit. But generally,
>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>> places that just thawing the filesystem. Any where this is done should
>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>> path is sufficient to avoid problems.
>>>> That's easily said but hard to do - any transaction start in ext3/4 may
>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>> looking into the code) and transaction start traditionally nests inside
>>>> s_umount (and basically there's no way around that since sync() calls your
>>>> fs code with s_umount held).
>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>> by definition, is a clean filesytem, and therefore sync calls of any
>>> kind should not be trying to write to the FS or start transactions.
>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>> filesystem a bug in ext3/4...
>> I had a look at the xfs code for seeing how this is done.
>> xfs_file_aio_write()
>> xfs_wait_for_freeze()
>> vfs_check_frozen()
>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>
>> Now, I want to know what stops the following scenario from happening:
>> --------------------
>> xfs_file_aio_write()
>> xfs_wait_for_freeze()
>> vfs_check_frozen()
>> At this point F.S was not frozen, so the next instruction in the
>> xfs_file_aio_write() will be executed next.
>> However at this point (i.e after checking if F.S is frozen) the
>> write process gets pre-empted and say the _freeze_ process gets
>> control.
>>
>> Now the F.S freezes and the write process gets the control back. And
>> so we end up writing to the page cache when the F.S is frozen.
>> --------------------
>>
>> Can anyone please enlighten me on how& why this premption is _not_
>> possible?
Thanks for your reply.
> XFS works similarly as ext4 in this regard I believe. They have the log
> frozen in xfs_freeze() so if the race you describe above happens, either
> the writing process gets caught waiting for log to unfreeze
Agreed.
> or it manages
> to start a transaction and then freezing process waits for transaction to
> finish before it can proceed with freezing. I'm not sure why is there the
> check in xfs_file_aio_write()...
>
>
I am sorry, but I don't understand how this will happen - i.e I can't
understand what stops freeze_super() (or ext4_freeze) from freezing a
superblock (as the write process stopped just before writing anything
for this transaction and has not taken any locks?)

Thanks!

Warm Regards,
Surbhi.

2011-05-02 12:06:55

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 05/02/2011 02:27 PM, Surbhi Palande wrote:
> On 05/02/2011 01:56 PM, Jan Kara wrote:
>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>> nothing dirty for it to write back.
>>>>>>> Sure but that's only the problem he was able to hit. But generally,
>>>>>>> there's a problem with needing s_umount for unfreezing because it
>>>>>>> isn't
>>>>>>> clear there aren't other code paths which can block with s_umount
>>>>>>> held
>>>>>>> waiting for fs to get unfrozen. And these code paths would cause
>>>>>>> the same
>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>> places that just thawing the filesystem. Any where this is done
>>>>>> should
>>>>>> be fixed, so I don't think just removing the s_umount lock from
>>>>>> the thaw
>>>>>> path is sufficient to avoid problems.
>>>>> That's easily said but hard to do - any transaction start in ext3/4
>>>>> may
>>>>> block on filesystem being frozen (this seems to be similar for XFS
>>>>> as I'm
>>>>> looking into the code) and transaction start traditionally nests
>>>>> inside
>>>>> s_umount (and basically there's no way around that since sync()
>>>>> calls your
>>>>> fs code with s_umount held).
>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>> kind should not be trying to write to the FS or start transactions.
>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>> filesystem a bug in ext3/4...
>>> I had a look at the xfs code for seeing how this is done.
>>> xfs_file_aio_write()
>>> xfs_wait_for_freeze()
>>> vfs_check_frozen()
>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>
>>> Now, I want to know what stops the following scenario from happening:
>>> --------------------
>>> xfs_file_aio_write()
>>> xfs_wait_for_freeze()
>>> vfs_check_frozen()
>>> At this point F.S was not frozen, so the next instruction in the
>>> xfs_file_aio_write() will be executed next.
>>> However at this point (i.e after checking if F.S is frozen) the
>>> write process gets pre-empted and say the _freeze_ process gets
>>> control.
>>>
>>> Now the F.S freezes and the write process gets the control back. And
>>> so we end up writing to the page cache when the F.S is frozen.
>>> --------------------
>>>
>>> Can anyone please enlighten me on how& why this premption is _not_
>>> possible?
> Thanks for your reply.
>> XFS works similarly as ext4 in this regard I believe. They have the log
>> frozen in xfs_freeze() so if the race you describe above happens, either
>> the writing process gets caught waiting for log to unfreeze
> Agreed.
>> or it manages
>> to start a transaction and then freezing process waits for transaction to
>> finish before it can proceed with freezing. I'm not sure why is there the
>> check in xfs_file_aio_write()...
>>
>>
> I am sorry, but I don't understand how this will happen - i.e I can't
> understand what stops freeze_super() (or ext4_freeze) from freezing a
> superblock (as the write process stopped just before writing anything
> for this transaction and has not taken any locks?)

To make myself a little more coherent:

freeze_super()
ext4_freeze()
1) jbd2_journal_updates()
2) jbd2_journal_flush(journal)
3) jbd2_journal_unlock_updates(journal).
4) return

Say now the fs write process stopped just after checking that fs is not
frozen (i.e its thawed). So its ready to write to the page cache. Just
when it has finished this vfs_check_frozen() and before it starts any
write (or transactions), say the write process gets pre-empted and then
the freeze process freezes the superblock. Wont ext4_freeze() simply
lock the current transactions, flush them to the log and then unlock the
transactions (so that new handles/transactions can be accepted later?)

So then after the fsfreeze finishes freezing the F.S, say if the write
process gets the control back. The write process assumes that after its
out of vfs_check_frozen(), the fs is thawed (or unfrozen) where as in
this case it is not.

So I don't understand, _what_ stops the writing process from starting a
transaction in this case when the F.S is frozen already
and what stops the fsfreeze from waiting for the write process (when it
has not yet started the write)?

Warm Regards,
Surbhi.




>
> Thanks!
>
> Warm Regards,
> Surbhi.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-05-02 12:20:58

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
> On 05/02/2011 01:56 PM, Jan Kara wrote:
> >On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
> >>On 04/06/2011 02:21 PM, Dave Chinner wrote:
> >>>On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> >>>>On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> >>>>>On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> >>>>>>On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> >>>>>>>If you don't allow the page to be dirtied in the fist place, then
> >>>>>>>nothing needs to be done to the writeback path because there is
> >>>>>>>nothing dirty for it to write back.
> >>>>>> Sure but that's only the problem he was able to hit. But generally,
> >>>>>>there's a problem with needing s_umount for unfreezing because it isn't
> >>>>>>clear there aren't other code paths which can block with s_umount held
> >>>>>>waiting for fs to get unfrozen. And these code paths would cause the same
> >>>>>>deadlock. That's why I chose to get rid of s_umount during thawing.
> >>>>>Holding the s_umount lock while checking if frozen and sleeping
> >>>>>is essentially an ABBA lock inversion bug that can bite in many more
> >>>>>places that just thawing the filesystem. Any where this is done should
> >>>>>be fixed, so I don't think just removing the s_umount lock from the thaw
> >>>>>path is sufficient to avoid problems.
> >>>> That's easily said but hard to do - any transaction start in ext3/4 may
> >>>>block on filesystem being frozen (this seems to be similar for XFS as I'm
> >>>>looking into the code) and transaction start traditionally nests inside
> >>>>s_umount (and basically there's no way around that since sync() calls your
> >>>>fs code with s_umount held).
> >>>Sure, but the question must be asked - why is ext3/4 even starting a
> >>>transaction on a clean filesystem during sync? A frozen filesystem,
> >>>by definition, is a clean filesytem, and therefore sync calls of any
> >>>kind should not be trying to write to the FS or start transactions.
> >>>XFS does this just fine, so I'd consider such behaviour on a frozen
> >>>filesystem a bug in ext3/4...
> >>I had a look at the xfs code for seeing how this is done.
> >>xfs_file_aio_write()
> >> xfs_wait_for_freeze()
> >> vfs_check_frozen()
> >>So xfs_file_aio_write() writes to buffers when the FS is not frozen.
> >>
> >>Now, I want to know what stops the following scenario from happening:
> >>--------------------
> >>xfs_file_aio_write()
> >> xfs_wait_for_freeze()
> >> vfs_check_frozen()
> >>At this point F.S was not frozen, so the next instruction in the
> >>xfs_file_aio_write() will be executed next.
> >>However at this point (i.e after checking if F.S is frozen) the
> >>write process gets pre-empted and say the _freeze_ process gets
> >>control.
> >>
> >>Now the F.S freezes and the write process gets the control back. And
> >>so we end up writing to the page cache when the F.S is frozen.
> >>--------------------
> >>
> >>Can anyone please enlighten me on how& why this premption is _not_
> >>possible?
> Thanks for your reply.
> > XFS works similarly as ext4 in this regard I believe. They have the log
> >frozen in xfs_freeze() so if the race you describe above happens, either
> >the writing process gets caught waiting for log to unfreeze
> Agreed.
> > or it manages
> >to start a transaction and then freezing process waits for transaction to
> >finish before it can proceed with freezing. I'm not sure why is there the
> >check in xfs_file_aio_write()...
> >
> >
> I am sorry, but I don't understand how this will happen - i.e I
> can't understand what stops freeze_super() (or ext4_freeze) from
> freezing a superblock (as the write process stopped just before
> writing anything for this transaction and has not taken any locks?)
So ext4_freeze() does
jbd2_journal_lock_updates(journal)
which waits for all running transactions to finish and updates
j_barrier_count which stops any news ones from proceeding (check
function start_this_handle()).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-02 12:30:37

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 05/02/2011 03:20 PM, Jan Kara wrote:
> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>> nothing dirty for it to write back.
>>>>>>>> Sure but that's only the problem he was able to hit. But generally,
>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>> places that just thawing the filesystem. Any where this is done should
>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>> path is sufficient to avoid problems.
>>>>>> That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>> fs code with s_umount held).
>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>> kind should not be trying to write to the FS or start transactions.
>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>> filesystem a bug in ext3/4...
>>>> I had a look at the xfs code for seeing how this is done.
>>>> xfs_file_aio_write()
>>>> xfs_wait_for_freeze()
>>>> vfs_check_frozen()
>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>
>>>> Now, I want to know what stops the following scenario from happening:
>>>> --------------------
>>>> xfs_file_aio_write()
>>>> xfs_wait_for_freeze()
>>>> vfs_check_frozen()
>>>> At this point F.S was not frozen, so the next instruction in the
>>>> xfs_file_aio_write() will be executed next.
>>>> However at this point (i.e after checking if F.S is frozen) the
>>>> write process gets pre-empted and say the _freeze_ process gets
>>>> control.
>>>>
>>>> Now the F.S freezes and the write process gets the control back. And
>>>> so we end up writing to the page cache when the F.S is frozen.
>>>> --------------------
>>>>
>>>> Can anyone please enlighten me on how& why this premption is _not_
>>>> possible?
>> Thanks for your reply.
>>> XFS works similarly as ext4 in this regard I believe. They have the log
>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>> the writing process gets caught waiting for log to unfreeze
>> Agreed.
>>> or it manages
>>> to start a transaction and then freezing process waits for transaction to
>>> finish before it can proceed with freezing. I'm not sure why is there the
>>> check in xfs_file_aio_write()...
>>>
>>>
>> I am sorry, but I don't understand how this will happen - i.e I
>> can't understand what stops freeze_super() (or ext4_freeze) from
>> freezing a superblock (as the write process stopped just before
>> writing anything for this transaction and has not taken any locks?)
> So ext4_freeze() does
> jbd2_journal_lock_updates(journal)
> which waits for all running transactions to finish and updates
> j_barrier_count which stops any news ones from proceeding (check
> function start_this_handle()).
>
Yes, but ext4_freeze() also calls jbd2_journal_unlock_updates(journal)
which decrements the j_barrier_count (which was previously
updated/incremented in jbd2_journal_lock_updates) ? before it returns.
So after this call a new transaction/handle can be accepted/started.

A comment in ext4_freeze() says:
/* we rely on s_frozen to stop further updates */
(before calling jbd2_journal_unlock_updates())

Warm Regards,
Surbhi.

2011-05-02 13:16:19

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
> On 05/02/2011 03:20 PM, Jan Kara wrote:
> >On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
> >>On 05/02/2011 01:56 PM, Jan Kara wrote:
> >>>On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
> >>>>On 04/06/2011 02:21 PM, Dave Chinner wrote:
> >>>>>On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> >>>>>>On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> >>>>>>>On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> >>>>>>>>On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> >>>>>>>>>If you don't allow the page to be dirtied in the fist place, then
> >>>>>>>>>nothing needs to be done to the writeback path because there is
> >>>>>>>>>nothing dirty for it to write back.
> >>>>>>>> Sure but that's only the problem he was able to hit. But generally,
> >>>>>>>>there's a problem with needing s_umount for unfreezing because it isn't
> >>>>>>>>clear there aren't other code paths which can block with s_umount held
> >>>>>>>>waiting for fs to get unfrozen. And these code paths would cause the same
> >>>>>>>>deadlock. That's why I chose to get rid of s_umount during thawing.
> >>>>>>>Holding the s_umount lock while checking if frozen and sleeping
> >>>>>>>is essentially an ABBA lock inversion bug that can bite in many more
> >>>>>>>places that just thawing the filesystem. Any where this is done should
> >>>>>>>be fixed, so I don't think just removing the s_umount lock from the thaw
> >>>>>>>path is sufficient to avoid problems.
> >>>>>> That's easily said but hard to do - any transaction start in ext3/4 may
> >>>>>>block on filesystem being frozen (this seems to be similar for XFS as I'm
> >>>>>>looking into the code) and transaction start traditionally nests inside
> >>>>>>s_umount (and basically there's no way around that since sync() calls your
> >>>>>>fs code with s_umount held).
> >>>>>Sure, but the question must be asked - why is ext3/4 even starting a
> >>>>>transaction on a clean filesystem during sync? A frozen filesystem,
> >>>>>by definition, is a clean filesytem, and therefore sync calls of any
> >>>>>kind should not be trying to write to the FS or start transactions.
> >>>>>XFS does this just fine, so I'd consider such behaviour on a frozen
> >>>>>filesystem a bug in ext3/4...
> >>>>I had a look at the xfs code for seeing how this is done.
> >>>>xfs_file_aio_write()
> >>>> xfs_wait_for_freeze()
> >>>> vfs_check_frozen()
> >>>>So xfs_file_aio_write() writes to buffers when the FS is not frozen.
> >>>>
> >>>>Now, I want to know what stops the following scenario from happening:
> >>>>--------------------
> >>>>xfs_file_aio_write()
> >>>> xfs_wait_for_freeze()
> >>>> vfs_check_frozen()
> >>>>At this point F.S was not frozen, so the next instruction in the
> >>>>xfs_file_aio_write() will be executed next.
> >>>>However at this point (i.e after checking if F.S is frozen) the
> >>>>write process gets pre-empted and say the _freeze_ process gets
> >>>>control.
> >>>>
> >>>>Now the F.S freezes and the write process gets the control back. And
> >>>>so we end up writing to the page cache when the F.S is frozen.
> >>>>--------------------
> >>>>
> >>>>Can anyone please enlighten me on how& why this premption is _not_
> >>>>possible?
> >>Thanks for your reply.
> >>> XFS works similarly as ext4 in this regard I believe. They have the log
> >>>frozen in xfs_freeze() so if the race you describe above happens, either
> >>>the writing process gets caught waiting for log to unfreeze
> >>Agreed.
> >>> or it manages
> >>>to start a transaction and then freezing process waits for transaction to
> >>>finish before it can proceed with freezing. I'm not sure why is there the
> >>>check in xfs_file_aio_write()...
> >>>
> >>>
> >>I am sorry, but I don't understand how this will happen - i.e I
> >>can't understand what stops freeze_super() (or ext4_freeze) from
> >>freezing a superblock (as the write process stopped just before
> >>writing anything for this transaction and has not taken any locks?)
> > So ext4_freeze() does
> >jbd2_journal_lock_updates(journal)
> > which waits for all running transactions to finish and updates
> >j_barrier_count which stops any news ones from proceeding (check
> >function start_this_handle()).
> >
> Yes, but ext4_freeze() also calls
> jbd2_journal_unlock_updates(journal) which decrements the
> j_barrier_count (which was previously updated/incremented in
> jbd2_journal_lock_updates) ? before it returns. So after this call a
> new transaction/handle can be accepted/started.
>
> A comment in ext4_freeze() says:
> /* we rely on s_frozen to stop further updates */
> (before calling jbd2_journal_unlock_updates())
Ah, drat, you're right. I've missed this other part. It's the problem
that if you expect to see something, you'll see it regardless of the real
code ;).

The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
it's still racy (although the race window is relatively small) because the
filesystem can become frozen the instant after we check vfs_check_frozen().
Commit 6b0310fb broke it for ext4.

I guess the code was mostly copied from XFS which seems to have the same
problem in xfs_trans_alloc() since the git history beginning. I see two
ways to fix this - either fix ext4/xfs to check s_frozen after starting
a transaction and if the filesystem is being frozen, we stop the
transaction, wait for fs to get unfrozen, and restart. Another option is
to create an analogous logic using a atomic counter of write ops in vfs
that could be used by all filesystems. We'd just have to replace
vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
appropriate places...

Dave, Christoph, any opinions on this?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-02 13:22:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Mon, May 02, 2011 at 03:16:19PM +0200, Jan Kara wrote:
> Dave, Christoph, any opinions on this?

The busyloop in xfs_quiesce_attr which waits for all active transactions
to finish is supposed to fix this issue.

Note that XFS traditionally expects a two stage freeze process where
we first freeze new VFS-level writes, then flush the caches and then
stop transactions, wait for them to finish and do the remainder of
the freeze process, but I really messed that process up when moving
the sequence to generic code. Funnily enough it seems to work
neverless.


2011-05-02 13:22:58

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 05/02/2011 04:16 PM, Jan Kara wrote:
> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>> Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>> places that just thawing the filesystem. Any where this is done should
>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>> That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>> fs code with s_umount held).
>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>> filesystem a bug in ext3/4...
>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>> xfs_file_aio_write()
>>>>>> xfs_wait_for_freeze()
>>>>>> vfs_check_frozen()
>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>
>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>> --------------------
>>>>>> xfs_file_aio_write()
>>>>>> xfs_wait_for_freeze()
>>>>>> vfs_check_frozen()
>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>> xfs_file_aio_write() will be executed next.
>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>> control.
>>>>>>
>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>> --------------------
>>>>>>
>>>>>> Can anyone please enlighten me on how& why this premption is _not_
>>>>>> possible?
>>>> Thanks for your reply.
>>>>> XFS works similarly as ext4 in this regard I believe. They have the log
>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>> the writing process gets caught waiting for log to unfreeze
>>>> Agreed.
>>>>> or it manages
>>>>> to start a transaction and then freezing process waits for transaction to
>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>> check in xfs_file_aio_write()...
>>>>>
>>>>>
>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>> freezing a superblock (as the write process stopped just before
>>>> writing anything for this transaction and has not taken any locks?)
>>> So ext4_freeze() does
>>> jbd2_journal_lock_updates(journal)
>>> which waits for all running transactions to finish and updates
>>> j_barrier_count which stops any news ones from proceeding (check
>>> function start_this_handle()).
>>>
>> Yes, but ext4_freeze() also calls
>> jbd2_journal_unlock_updates(journal) which decrements the
>> j_barrier_count (which was previously updated/incremented in
>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>> new transaction/handle can be accepted/started.
>>
>> A comment in ext4_freeze() says:
>> /* we rely on s_frozen to stop further updates */
>> (before calling jbd2_journal_unlock_updates())
> Ah, drat, you're right. I've missed this other part. It's the problem
> that if you expect to see something, you'll see it regardless of the real
> code ;).
>
> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
> it's still racy (although the race window is relatively small) because the
> filesystem can become frozen the instant after we check vfs_check_frozen().
> Commit 6b0310fb broke it for ext4.
>
> I guess the code was mostly copied from XFS which seems to have the same
> problem in xfs_trans_alloc() since the git history beginning. I see two
> ways to fix this - either fix ext4/xfs to check s_frozen after starting
> a transaction and if the filesystem is being frozen, we stop the
> transaction, wait for fs to get unfrozen, and restart. Another option is
> to create an analogous logic using a atomic counter of write ops in vfs
> that could be used by all filesystems. We'd just have to replace
> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
> appropriate places...
How about calling jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
from ext4_unfreeze()?

So that indeed no transactions can be started before unfreeze is called.

This has another advantage, that it rightfully does not let you update
the access time when the F.S is frozen (touch_atime called from a read
path when the F.S is frozen) Otherwise we also need to fix this path.

Warm Regards,
Surbhi.

> Dave, Christoph, any opinions on this?
> Honza


2011-05-02 13:24:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Mon, May 02, 2011 at 04:22:45PM +0300, Surbhi Palande wrote:
> This has another advantage, that it rightfully does not let you
> update the access time when the F.S is frozen (touch_atime called
> from a read path when the F.S is frozen) Otherwise we also need to
> fix this path.

In most filesystens atime updates aren't transactional. They just
get written into inode->i_atime, and at some later point when the
VFS tries to clean the inode it gets writtent back, either through
a transaction or not.


2011-05-02 13:27:49

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 05/02/2011 04:24 PM, Christoph Hellwig wrote:
> On Mon, May 02, 2011 at 04:22:45PM +0300, Surbhi Palande wrote:
>> This has another advantage, that it rightfully does not let you
>> update the access time when the F.S is frozen (touch_atime called
>> from a read path when the F.S is frozen) Otherwise we also need to
>> fix this path.
> In most filesystens atime updates aren't transactional. They just
> get written into inode->i_atime, and at some later point when the
> VFS tries to clean the inode it gets writtent back, either through
> a transaction or not.
>
Yes, agreed. But then when a F.S is frozen the inode should not be
dirtied? Right? So this has to be fixed?
Also, in ext4, I think that updating atime starts a transaction.

Warm Regards,
Surbhi


2011-05-02 14:02:06

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 5/2/11 7:30 AM, Surbhi Palande wrote:

...

> Yes, but ext4_freeze() also calls jbd2_journal_unlock_updates(journal) which decrements the j_barrier_count (which was previously updated/incremented in jbd2_journal_lock_updates) ? before it returns. So after this call a new transaction/handle can be accepted/started.
>
> A comment in ext4_freeze() says:
> /* we rely on s_frozen to stop further updates */
> (before calling jbd2_journal_unlock_updates())

that was me;

commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
Author: Eric Sandeen <[email protected]>
Date: Sun May 16 02:00:00 2010 -0400

ext4: don't return to userspace after freezing the fs with a mutex held


otherwise we return to userspace holding a mutex :(

-Eric

2011-05-02 14:04:23

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 5/2/11 8:22 AM, Surbhi Palande wrote:
> On 05/02/2011 04:16 PM, Jan Kara wrote:
>> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>> Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>>> places that just thawing the filesystem. Any where this is done should
>>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>> That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>>> fs code with s_umount held).
>>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>>> filesystem a bug in ext3/4...
>>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>>> xfs_file_aio_write()
>>>>>>> xfs_wait_for_freeze()
>>>>>>> vfs_check_frozen()
>>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>>
>>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>>> --------------------
>>>>>>> xfs_file_aio_write()
>>>>>>> xfs_wait_for_freeze()
>>>>>>> vfs_check_frozen()
>>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>>> xfs_file_aio_write() will be executed next.
>>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>>> control.
>>>>>>>
>>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>>> --------------------
>>>>>>>
>>>>>>> Can anyone please enlighten me on how& why this premption is _not_
>>>>>>> possible?
>>>>> Thanks for your reply.
>>>>>> XFS works similarly as ext4 in this regard I believe. They have the log
>>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>>> the writing process gets caught waiting for log to unfreeze
>>>>> Agreed.
>>>>>> or it manages
>>>>>> to start a transaction and then freezing process waits for transaction to
>>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>>> check in xfs_file_aio_write()...
>>>>>>
>>>>>>
>>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>>> freezing a superblock (as the write process stopped just before
>>>>> writing anything for this transaction and has not taken any locks?)
>>>> So ext4_freeze() does
>>>> jbd2_journal_lock_updates(journal)
>>>> which waits for all running transactions to finish and updates
>>>> j_barrier_count which stops any news ones from proceeding (check
>>>> function start_this_handle()).
>>>>
>>> Yes, but ext4_freeze() also calls
>>> jbd2_journal_unlock_updates(journal) which decrements the
>>> j_barrier_count (which was previously updated/incremented in
>>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>>> new transaction/handle can be accepted/started.
>>>
>>> A comment in ext4_freeze() says:
>>> /* we rely on s_frozen to stop further updates */
>>> (before calling jbd2_journal_unlock_updates())
>> Ah, drat, you're right. I've missed this other part. It's the problem
>> that if you expect to see something, you'll see it regardless of the real
>> code ;).
>>
>> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
>> it's still racy (although the race window is relatively small) because the
>> filesystem can become frozen the instant after we check vfs_check_frozen().
>> Commit 6b0310fb broke it for ext4.
>>
>> I guess the code was mostly copied from XFS which seems to have the same
>> problem in xfs_trans_alloc() since the git history beginning. I see two
>> ways to fix this - either fix ext4/xfs to check s_frozen after starting
>> a transaction and if the filesystem is being frozen, we stop the
>> transaction, wait for fs to get unfrozen, and restart. Another option is
>> to create an analogous logic using a atomic counter of write ops in vfs
>> that could be used by all filesystems. We'd just have to replace
>> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
>> appropriate places...
> How about calling jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> from ext4_unfreeze()?

we used to have that, but holding it locked until then means we exit the kernel
with a mutex held, which is pretty icky.

================================================
[ BUG: lock held when returning to user space! ]
------------------------------------------------
lvcreate/1075 is leaving the kernel with locks still held!
1 lock held by lvcreate/1075:
#0: (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
jbd2_journal_lock_updates+0xe1/0xf0


-Eric

> So that indeed no transactions can be started before unfreeze is called.
>
> This has another advantage, that it rightfully does not let you update the access time when the F.S is frozen (touch_atime called from a read path when the F.S is frozen) Otherwise we also need to fix this path.
>
> Warm Regards,
> Surbhi.
>
>> Dave, Christoph, any opinions on this?
>> Honza
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-05-02 14:20:55

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Mon 02-05-11 09:22:04, Christoph Hellwig wrote:
> On Mon, May 02, 2011 at 03:16:19PM +0200, Jan Kara wrote:
> > Dave, Christoph, any opinions on this?
>
> The busyloop in xfs_quiesce_attr which waits for all active transactions
> to finish is supposed to fix this issue.
Hmm, but what prevents the following race?

Thread 1 Thread 2
..
xfs_trans_alloc()
xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
freeze_super()
...
xfs_fs_freeze()
...
xfs_quiesce_attr()
...
_xfs_trans_alloc()
atomic_inc(&mp->m_active_trans);
... goes on modifying the filesystem

It seems to be a similar problem as in ext4 - the atomic_inc() and
vfs_check_frozen() are in the wrong order...

> Note that XFS traditionally expects a two stage freeze process where
> we first freeze new VFS-level writes, then flush the caches and then
> stop transactions, wait for them to finish and do the remainder of
> the freeze process, but I really messed that process up when moving
> the sequence to generic code. Funnily enough it seems to work
> neverless.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-02 14:26:17

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Mon 02-05-11 16:27:39, Surbhi Palande wrote:
> On 05/02/2011 04:24 PM, Christoph Hellwig wrote:
> >On Mon, May 02, 2011 at 04:22:45PM +0300, Surbhi Palande wrote:
> >>This has another advantage, that it rightfully does not let you
> >>update the access time when the F.S is frozen (touch_atime called
> >>from a read path when the F.S is frozen) Otherwise we also need to
> >>fix this path.
> >In most filesystens atime updates aren't transactional. They just
> >get written into inode->i_atime, and at some later point when the
> >VFS tries to clean the inode it gets writtent back, either through
> >a transaction or not.
> >
> Yes, agreed. But then when a F.S is frozen the inode should not be
> dirtied? Right? So this has to be fixed?
> Also, in ext4, I think that updating atime starts a transaction.
Yes, it does. Any mark_inode_dirty() call causes a transaction update.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-02 14:42:03

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Mon, May 02, 2011 at 04:20:55PM +0200, Jan Kara wrote:
> Hmm, but what prevents the following race?
>
> Thread 1 Thread 2
> ..
> xfs_trans_alloc()
> xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> freeze_super()

sb->s_frozen = SB_FREEZE_TRANS;

> ...
> xfs_fs_freeze()
> ...
> xfs_quiesce_attr()

waits for all active
transactions

> ...

xfs_trans_alloc
-> blocks in xfs_wait_for_freeze
(thus doesn't get to _xfs_trans_alloc)

> _xfs_trans_alloc()
> atomic_inc(&mp->m_active_trans);
> ... goes on modifying the filesystem
>
> It seems to be a similar problem as in ext4 - the atomic_inc() and
> vfs_check_frozen() are in the wrong order...

I can't see the problem in this scheme. Note that we want
_xfs_trans_alloc to be able to create a transaction for
xfs_fs_log_dummy, so that we can write the dummy log record after
freezing out all other transactions, so that one is special cased
and doesn't do the xfs_wait_for_freeze.


2011-05-02 16:23:37

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Mon 02-05-11 10:41:55, Christoph Hellwig wrote:
> On Mon, May 02, 2011 at 04:20:55PM +0200, Jan Kara wrote:
> > Hmm, but what prevents the following race?
> >
> > Thread 1 Thread 2
> > ..
> > xfs_trans_alloc()
> > xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > freeze_super()
>
> sb->s_frozen = SB_FREEZE_TRANS;
> > ...
> > xfs_fs_freeze()
> > ...
> > xfs_quiesce_attr()
>
> waits for all active
> transactions
>
> > ...
>
> xfs_trans_alloc
> -> blocks in xfs_wait_for_freeze
But why should it block when xfs_wait_for_freeze() gets called before
freeze_super() gets called? The other thread calls freeze_super() just
after xfs_wait_for_freeze() in thread 1 and before _xfs_trans_alloc() gets
called. Or am I missing some serialization somewhere?

> (thus doesn't get to _xfs_trans_alloc)
>
> > _xfs_trans_alloc()
> > atomic_inc(&mp->m_active_trans);
> > ... goes on modifying the filesystem
> >
> > It seems to be a similar problem as in ext4 - the atomic_inc() and
> > vfs_check_frozen() are in the wrong order...
>
> I can't see the problem in this scheme. Note that we want
> _xfs_trans_alloc to be able to create a transaction for
> xfs_fs_log_dummy, so that we can write the dummy log record after
> freezing out all other transactions, so that one is special cased
> and doesn't do the xfs_wait_for_freeze.
OK.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-02 16:38:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Mon, May 02, 2011 at 06:23:34PM +0200, Jan Kara wrote:
> But why should it block when xfs_wait_for_freeze() gets called before
> freeze_super() gets called? The other thread calls freeze_super() just
> after xfs_wait_for_freeze() in thread 1 and before _xfs_trans_alloc() gets
> called. Or am I missing some serialization somewhere?

Oh, now I get the race window you mean. It's the single instruction
window between doing the frozen check and incrementing m_active_trans.

Yes, that one looks real, although very unlikely to hit. Could be fixed
relatively easily by moving the check after the increment.

2011-05-03 07:27:19

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 05/02/2011 05:04 PM, Eric Sandeen wrote:
> On 5/2/11 8:22 AM, Surbhi Palande wrote:
>> On 05/02/2011 04:16 PM, Jan Kara wrote:
>>> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>>>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>>> Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>>>> places that just thawing the filesystem. Any where this is done should
>>>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>>> That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>>>> fs code with s_umount held).
>>>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>>>> filesystem a bug in ext3/4...
>>>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>>>> xfs_file_aio_write()
>>>>>>>> xfs_wait_for_freeze()
>>>>>>>> vfs_check_frozen()
>>>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>>>
>>>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>>>> --------------------
>>>>>>>> xfs_file_aio_write()
>>>>>>>> xfs_wait_for_freeze()
>>>>>>>> vfs_check_frozen()
>>>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>>>> xfs_file_aio_write() will be executed next.
>>>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>>>> control.
>>>>>>>>
>>>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>>>> --------------------
>>>>>>>>
>>>>>>>> Can anyone please enlighten me on how& why this premption is _not_
>>>>>>>> possible?
>>>>>> Thanks for your reply.
>>>>>>> XFS works similarly as ext4 in this regard I believe. They have the log
>>>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>>>> the writing process gets caught waiting for log to unfreeze
>>>>>> Agreed.
>>>>>>> or it manages
>>>>>>> to start a transaction and then freezing process waits for transaction to
>>>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>>>> check in xfs_file_aio_write()...
>>>>>>>
>>>>>>>
>>>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>>>> freezing a superblock (as the write process stopped just before
>>>>>> writing anything for this transaction and has not taken any locks?)
>>>>> So ext4_freeze() does
>>>>> jbd2_journal_lock_updates(journal)
>>>>> which waits for all running transactions to finish and updates
>>>>> j_barrier_count which stops any news ones from proceeding (check
>>>>> function start_this_handle()).
>>>>>
>>>> Yes, but ext4_freeze() also calls
>>>> jbd2_journal_unlock_updates(journal) which decrements the
>>>> j_barrier_count (which was previously updated/incremented in
>>>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>>>> new transaction/handle can be accepted/started.
>>>>
>>>> A comment in ext4_freeze() says:
>>>> /* we rely on s_frozen to stop further updates */
>>>> (before calling jbd2_journal_unlock_updates())
>>> Ah, drat, you're right. I've missed this other part. It's the problem
>>> that if you expect to see something, you'll see it regardless of the real
>>> code ;).
>>>
>>> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
>>> it's still racy (although the race window is relatively small) because the
>>> filesystem can become frozen the instant after we check vfs_check_frozen().
>>> Commit 6b0310fb broke it for ext4.
>>>
>>> I guess the code was mostly copied from XFS which seems to have the same
>>> problem in xfs_trans_alloc() since the git history beginning. I see two
>>> ways to fix this - either fix ext4/xfs to check s_frozen after starting
>>> a transaction and if the filesystem is being frozen, we stop the
>>> transaction, wait for fs to get unfrozen, and restart. Another option is
>>> to create an analogous logic using a atomic counter of write ops in vfs
>>> that could be used by all filesystems. We'd just have to replace
>>> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
>>> appropriate places...
>> How about calling jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>> from ext4_unfreeze()?
> we used to have that, but holding it locked until then means we exit the kernel
> with a mutex held, which is pretty icky.
>
> ================================================
> [ BUG: lock held when returning to user space! ]
> ------------------------------------------------
> lvcreate/1075 is leaving the kernel with locks still held!
> 1 lock held by lvcreate/1075:
> #0: (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
> jbd2_journal_lock_updates+0xe1/0xf0
>
>
> -Eric
Should this not be reverted? I think that its a lot easier to stop a
transaction between a freeze and a thaw that way! If you agree, can I
send a patch for the same?

Thanks!

Warm Regards,
Surbhi.


>> So that indeed no transactions can be started before unfreeze is called.
>>
>> This has another advantage, that it rightfully does not let you update the access time when the F.S is frozen (touch_atime called from a read path when the F.S is frozen) Otherwise we also need to fix this path.
>>
>> Warm Regards,
>> Surbhi.
>>
>>> Dave, Christoph, any opinions on this?
>>> Honza
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-05-03 08:06:59

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 04/25/2011 09:28 AM, Toshiyuki Okajima wrote:
> Hi.
>
> On Sat, 23 Apr 2011 00:10:25 +0200
> Jan Kara<[email protected]> wrote:
>> On Fri 22-04-11 15:58:39, Toshiyuki Okajima wrote:
>>> I have confirmed that the following patch works fine while my or
>>> Mizuma-san's reproducer is running. Therefore,
>>> we can block to write the data, which is mmapped to a file, into a disk
>>> by a page-fault while fsfreezing.
>>>
>>> I think this patch fixes the following two problems:
>>> - A deadlock occurs between ext4_da_writepages() (called from
>>> writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
>>> - We can also write the data, which is mmapped to a file,
>>> into a disk while fsfreezing (ext3/ext4).
>>> (reported by me)
>>>
>>> Please examine this patch.
>> Thanks for the patch. The ext3 part is not as easy as this. You cannot
>> really get i_alloc_sem in ext3_page_mkwrite() because mmap_sem is already
>> held by page fault code and i_alloc_sem should be acquired before it (yes I
>> know, ext4 already has this bug which should be fixed when I get to it).
>> Also you'll find that performance of random writers via mmap (which is
>> relatively common) is going to be rather bad with this patch (because the
>> file will be heavily fragmented). We have to be more clever which is
>> exactly why it's taking me so long with my patch :) But tests are already
>> running so if everything goes fine, I should have patches to submit next
>> week.
> OK, I'll wait your patch. :)
>
>>
>> The ext4 part looks correct. I'd just also like to have some comments about
>> how freeze handling is done because it's kind of subtle.
>
> How about this?


We can have a race here too - since we are only checking if the F.S is
in a frozen state or not at _that_ point. We are _not_ preventing a F.S
freeze from happening _after_ this point. So here is what can happen:

Key:
(tx: time at xth unit)

Scenario:

Task 1: mmapped write - (case: page mapped to disk and is in page cache)
t1) ext4_page_mkwrite()
t2) vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
t3) ---- Preempted ----


Task 2: Freeze Task
t4) freezes the super block...
...(continues)....
tn) the page cache is clean and the F.S is frozen. Freeze has completed
execution.

Task1: mmapped write - (case: page mapped to disk and is in page cache)
tn+1)ext4_page_mkwrite() returns 0. The write to the mmapped page
continues with writing to a page in the page cache when the F.S is
frozen! So after the vfs_check_frozen() we _are_ susceptible to
"dirtying the page cache when F.S is frozen"

In this case we are not protected by a transaction. Are we?

Warm Regards,
Surbhi.




>
> Thanks,
> Toshiyuki Okajima
>
> ----------------------------------------------------------------------------------------------------
> Subject: [PATCH] ext4: prevent the mmapped page flushing to disk while fsfreezing
>
> Signed-off-by: Toshiyuki Okajima<[email protected]>
> ---
> fs/ext4/inode.c | 10 +++++++++-
> 1 files changed, 9 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f2fa5e8..411b177 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5812,7 +5812,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> }
> ret = 0;
> if (PageMappedToDisk(page))
> - goto out_unlock;
> + goto out_frozen;
>
> if (page->index == size>> PAGE_CACHE_SHIFT)
> len = size& ~PAGE_CACHE_MASK;
> @@ -5830,6 +5830,14 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> ext4_bh_unmapped)) {
> unlock_page(page);
> +out_frozen:
> + /*
> + * We must wait here while the filesystem is being
> + * frozen otherwise a flushing thread can write this
> + * page to the disk (we can update the filesystem even
> + * if it is frozen).
> + */
> + vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> goto out_unlock;
> }
> }


2011-05-03 11:02:00

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
> Hi,
>
> (2011/04/16 2:13), Jan Kara wrote:
>> Hello,
>>
>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should
>>>> probably
>>>> get modified to block while minor-faulting the page on frozen fs
>>>> because
>>>> when blocks are already allocated we may skip starting a transaction
>>>> and so
>>>> we could possibly modify the filesystem.
>>> OK. I think ->page_mkwrite() should also block writing the
>>> minor-faulting pages.
>>>
>>> (minor-pagefault)
>>> -> do_wp_page()
>>> -> page_mkwrite(= ext4_mkwrite())
>>> => BLOCK!
>>>
>>> (major-pagefault)
>>> -> do_liner_fault()
>>> -> page_mkwrite(= ext4_mkwrite())
>>> => BLOCK!
>>>
>>>>
>>>>>>> Mizuma-san's reproducer also writes the data which maps to the
>>>>>>> file (mmap).
>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>> I understand the normal write operation (not mmap) can be blocked
>>>>>>> while
>>>>>>> fsfreezing. So, I guess we don't always block all the write
>>>>>>> operation
>>>>>>> while fsfreezing.
>>>>>> Technically speaking, we block all the transaction starts which
>>>>>> means we
>>>>>> end up blocking all the writes from going to disk. But that does
>>>>>> not mean
>>>>>> we block all the writes from going to in-memory cache - as you
>>>>>> properly
>>>>>> note the mmap case is one of such exceptions.
>>>>> Hm, I also think we can allow the writes to in-memory cache but we
>>>>> can't allow
>>>>> the writes to disk while fsfreezing. I am considering that mmap
>>>>> path can
>>>>> write to disk while fsfreezing because this deadlock problem
>>>>> happens after
>>>>> fsfreeze operation is done...
>>>> I'm sorry I don't understand now - are you speaking about the case
>>>> above
>>>> when writepage() does not wait for filesystem being frozen or something
>>>> else?
>>> Sorry, I didn't understand around the page fault path.
>>> So, I had read the kernel source code around it, then I maybe
>>> understand...
>>>
>>> I worry whether we can update the file data in mmap case while
>>> fsfreezing.
>>> Of course, I understand that we can write to in-memory cache, and it
>>> is not a
>>> problem. However, if we can write to disk while fsfreezing, it is a
>>> problem.
>>> So, I summarize the cases whether we can write to disk or not.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> Cases (Whether we can write the data mmapped to the file on the disk
>>> while fsfreezing)
>>>
>>> [1] One of the page which has been mmapped is not bound. And
>>> the page is not allocated yet. (major fault?)
>>>
>>> (1) user dirtys a page
>>> (2) a page fault occurs (do_page_fault)
>>> (3) __do_falut is called.
>>> (4) ext4_page_mkwrite is called
>>> (5) ext4_write_begin is called
>>> (6) ext4_journal_start_sb => We can STOP!
>>>
>>> [2] One of the page which has been mmapped is not bound. But
>>> the page is already allocated, and the buffer_heads of the page
>>> are not mapped (BH_Mapped). (minor fault?)
>>>
>>> (1) user dirtys a page
>>> (2) a page fault occurs (do_page_fault)
>>> (3) do_wp_page is called.
>>> (4) ext4_page_mkwrite is called
>>> (5) ext4_write_begin is called
>>> (6) ext4_journal_start_sb => We can STOP!

What happens in the case as follows:

Task 1: Mmapped writes
t1)ext4_page_mkwrite()
t2) ext4_write_begin() (FS is thawed so we proceed)
t3) ext4_write_end() (journal is stopped now)
-----Pre-empted-----


Task 2: Freeze Task
t4) freezes the super block...
...(continues)....
tn) the page cache is clean and the F.S is frozen. Freeze has completed
execution.

Task 1: Mmapped writes
tn+1) ext4_page_mkwrite() returns 0.
tn+2) __do_fault() gets control, code gets executed.
tn+3) _do_fault() marks the page dirty if the intent is to write to a
file based page which faulted.

So you end up dirtying the page cache when the F.S is frozen? No?


Warm Regards,
Surbhi.







>>>
>>> [3] One of the page which has been mmapped is not bound. But
>>> the page is already allocated, and the buffer_heads of the page
>>> are mapped (BH_Mapped). (minor fault?)
>>>
>>> (1) user dirtys a page
>>> (2) a page fault occurs (do_page_fault)
>>> (3) do_wp_page is called.
>>> (4) ext4_page_mkwrite is called
>>> * Cannot block the dirty page to be written because all bh is mapped.
>>> (5) user munmaps the page (munmap)
>>> (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>> (7) writeback thread writes the page (struct page) to disk
>>> => We cannot STOP!
>>>
>>> [4] One of the page which has been mmapped is bound. And
>>> the page is already allocated.
>>>
>>> (1) user dirtys a page
>>> ( ) no page fault occurs
>>> (2) user munmaps the page (munmap)
>>> (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>> (4) writeback thread writes the page (struct page) to disk
>>> => We cannot STOP!
>>> --------------------------------------------------------------------------
>>>
>>>
>>> So, we can block the cases [1], [2].
>>> But I think we cannot block the cases [3], [4] now.
>>> If fixing the page_mkwrite, we can also block the case [3].
>>> But the case [4] is not blocked because no page fault occurs
>>> when we dirty the mmapped page.
>>>
>>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>>> I think we must modify the writeback thread to fix the case [4].
>> The trick here is that when we write a page to disk, we write-protect
>> the page (you seem to call this that "the page is bound", I'm not sure
>> why).
> Hm, I want to understand how to write-protect the page under fsfreezing.
> But, anyway, I understand we don't need to consider the case [4].
>
>> So we are guaranteed to receive a minor fault (case [3]) if user tries to
>> modify a page after we finish writeback while freezing the filesystem.
>> So principially all we need to do is just wait in ext4_page_mkwrite().
> OK. I understand.
> Are there any concrete ideas to fix this?
> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent
> it?
>
> Thanks,
> Toshiyuki Okajima
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-05-03 13:08:37

by Surbhi Palande

[permalink] [raw]
Subject: [PATCH] Prevent dirtying a page when ext4 F.S is frozen

Prevent dirtying a page when ext4 F.S is frozen. Also take the write semaphore
sb->s_umount to prevent a F.S freeze from racing with the page dirtying
process. Without this we can end up dirtying a page while a F.S freeze
happened because of preemption.

Signed-off-by: Surbhi Palande <[email protected]>
---
fs/ext4/inode.c | 35 ++++++++++++++++++++++++++++++++++-
1 files changed, 34 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..db3f99d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3827,8 +3827,41 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
*/
static int ext4_journalled_set_page_dirty(struct page *page)
{
+ int ret=0;
+ struct inode * inode = NULL;
+ struct super_block * sb = NULL;
+
+ if(likely((page->mapping) && (page->mapping->host))){
+ inode = page->mapping->host;
+ if(likely(inode->i_sb)){
+ sb = inode->i_sb;
+ /* we do not want a freeze to start now if F.S is not
+ * already frozen*/
+ down_write(&sb->s_umount);
+ if(sb->s_frozen != SB_UNFROZEN) {
+ /* F.S is frozen.
+ * we dont want to sleep with s_umount held.
+ * Or else we might race with thaw_super */
+ up_write(&sb->s_umount);
+ vfs_check_frozen(sb, SB_FREEZE_WRITE);
+ /* F.S is no more frozen. We do not want the
+ * FS freeze to begin after this point
+ */
+ down_write(&sb->s_umount);
+ }
+ }
+ }
SetPageChecked(page);
- return __set_page_dirty_nobuffers(page);
+ ret = __set_page_dirty_nobuffers(page);
+ if(likely((page->mapping) && (page->mapping->host))){
+ if(likely(inode->i_sb)){
+ up_write(&sb->s_umount);
+ /* If we freeze after this point, the dirtied page can
+ * be flushed out!
+ */
+ }
+ }
+ return ret;
}

static const struct address_space_operations ext4_ordered_aops = {
--
1.7.1


2011-05-03 13:08:49

by Surbhi Palande

[permalink] [raw]
Subject: (unknown)


On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
Toshiyuki pointed out.

zap_pte_range()
mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)

So, I think that it is here that we should do the checking for a ext4 F.S
frozen state and also prevent a parallel ext4 F.S freeze from happening.

Attaching a patch for initial review. Please do let me know your thoughts!

Thanks a lot!

Warm Regards,
Surbhi.



2011-05-03 13:46:36

by Jan Kara

[permalink] [raw]
Subject: Re: your mail

On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> Toshiyuki pointed out.
>
> zap_pte_range()
> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>
> So, I think that it is here that we should do the checking for a ext4 F.S
> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>
> Attaching a patch for initial review. Please do let me know your thoughts!
This is definitely the wrong place. ->set_page_dirty() callbacks are
called with various locks held and the page need not be locked (thus
dereferencing page->mapping is oopsable). Moreover this particular callback
is called only in data=journal mode.

Believe me, the right place is page_mkwrite() - you have to catch the
read-only => read-write page transition. Once the page is mapped
read-write, you've already lost the race.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-03 13:56:57

by Surbhi Palande

[permalink] [raw]
Subject: Re: your mail

On 05/03/2011 04:46 PM, Jan Kara wrote:
> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:

Sorry for missing the subject line :(
>> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
>> Toshiyuki pointed out.
>>
>> zap_pte_range()
>> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>
>> So, I think that it is here that we should do the checking for a ext4 F.S
>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>
>> Attaching a patch for initial review. Please do let me know your thoughts!
> This is definitely the wrong place. ->set_page_dirty() callbacks are
> called with various locks held and the page need not be locked (thus
> dereferencing page->mapping is oopsable). Moreover this particular callback
> is called only in data=journal mode.
Ok! Thanks for that!

>
> Believe me, the right place is page_mkwrite() - you have to catch the
> read-only => read-write page transition. Once the page is mapped
> read-write, you've already lost the race.

My only point is:
1) something should prevent the freeze from happening. We cant merely
check the vfs_check_frozen()?

And this should be done where the page is marked dirty.Also, I thought
that the page is marked read-write only in the page table in the
__do_page_fault()? i.e the zap_pte_range() marks them dirty in the page
cache? Is this understanding right?

IMHO, whatever code dirties the page in the page cache should call a F.S
specific function and let it _prevent_ a fsfreeze while the page is
getting dirtied, so that a freeze called after this point flushes this page!

Warm Regards,
Surbhi.










>
> Honza


2011-05-03 15:19:51

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
> On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
> >(2011/04/16 2:13), Jan Kara wrote:
> >>Hello,
> >>
> >>On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>>>For ext3 or ext4 without delayed allocation we block inside writepage()
> >>>>function. But as I wrote to Dave Chinner, ->page_mkwrite() should
> >>>>probably
> >>>>get modified to block while minor-faulting the page on frozen fs
> >>>>because
> >>>>when blocks are already allocated we may skip starting a transaction
> >>>>and so
> >>>>we could possibly modify the filesystem.
> >>>OK. I think ->page_mkwrite() should also block writing the
> >>>minor-faulting pages.
> >>>
> >>>(minor-pagefault)
> >>>-> do_wp_page()
> >>>-> page_mkwrite(= ext4_mkwrite())
> >>>=> BLOCK!
> >>>
> >>>(major-pagefault)
> >>>-> do_liner_fault()
> >>>-> page_mkwrite(= ext4_mkwrite())
> >>>=> BLOCK!
> >>>
> >>>>
> >>>>>>>Mizuma-san's reproducer also writes the data which maps to the
> >>>>>>>file (mmap).
> >>>>>>>The original problem happens after the fsfreeze operation is done.
> >>>>>>>I understand the normal write operation (not mmap) can be blocked
> >>>>>>>while
> >>>>>>>fsfreezing. So, I guess we don't always block all the write
> >>>>>>>operation
> >>>>>>>while fsfreezing.
> >>>>>>Technically speaking, we block all the transaction starts which
> >>>>>>means we
> >>>>>>end up blocking all the writes from going to disk. But that does
> >>>>>>not mean
> >>>>>>we block all the writes from going to in-memory cache - as you
> >>>>>>properly
> >>>>>>note the mmap case is one of such exceptions.
> >>>>>Hm, I also think we can allow the writes to in-memory cache but we
> >>>>>can't allow
> >>>>>the writes to disk while fsfreezing. I am considering that mmap
> >>>>>path can
> >>>>>write to disk while fsfreezing because this deadlock problem
> >>>>>happens after
> >>>>>fsfreeze operation is done...
> >>>>I'm sorry I don't understand now - are you speaking about the case
> >>>>above
> >>>>when writepage() does not wait for filesystem being frozen or something
> >>>>else?
> >>>Sorry, I didn't understand around the page fault path.
> >>>So, I had read the kernel source code around it, then I maybe
> >>>understand...
> >>>
> >>>I worry whether we can update the file data in mmap case while
> >>>fsfreezing.
> >>>Of course, I understand that we can write to in-memory cache, and it
> >>>is not a
> >>>problem. However, if we can write to disk while fsfreezing, it is a
> >>>problem.
> >>>So, I summarize the cases whether we can write to disk or not.
> >>>
> >>>--------------------------------------------------------------------------
> >>>
> >>>Cases (Whether we can write the data mmapped to the file on the disk
> >>>while fsfreezing)
> >>>
> >>>[1] One of the page which has been mmapped is not bound. And
> >>>the page is not allocated yet. (major fault?)
> >>>
> >>>(1) user dirtys a page
> >>>(2) a page fault occurs (do_page_fault)
> >>>(3) __do_falut is called.
> >>>(4) ext4_page_mkwrite is called
> >>>(5) ext4_write_begin is called
> >>>(6) ext4_journal_start_sb => We can STOP!
> >>>
> >>>[2] One of the page which has been mmapped is not bound. But
> >>>the page is already allocated, and the buffer_heads of the page
> >>>are not mapped (BH_Mapped). (minor fault?)
> >>>
> >>>(1) user dirtys a page
> >>>(2) a page fault occurs (do_page_fault)
> >>>(3) do_wp_page is called.
> >>>(4) ext4_page_mkwrite is called
> >>>(5) ext4_write_begin is called
> >>>(6) ext4_journal_start_sb => We can STOP!
>
> What happens in the case as follows:
>
> Task 1: Mmapped writes
> t1)ext4_page_mkwrite()
> t2) ext4_write_begin() (FS is thawed so we proceed)
> t3) ext4_write_end() (journal is stopped now)
> -----Pre-empted-----
>
>
> Task 2: Freeze Task
> t4) freezes the super block...
> ...(continues)....
> tn) the page cache is clean and the F.S is frozen. Freeze has
> completed execution.
>
> Task 1: Mmapped writes
> tn+1) ext4_page_mkwrite() returns 0.
> tn+2) __do_fault() gets control, code gets executed.
> tn+3) _do_fault() marks the page dirty if the intent is to write to
> a file based page which faulted.
>
> So you end up dirtying the page cache when the F.S is frozen? No?
You are right ext4_page_mkrite() as currently implemented has problems.
You have to return the page locked (and check for frozen fs with page lock
held) to avoid races.

If you check for frozen fs with page lock held, you are guaranteed that
freezing code must wait for the page to get unlocked before proceeding. And
before the page is unlocked, it is marked dirty by the pagefault code which
makes freezing code write the page and writeprotect it again. So everything
will be safe.

Doing this cleanly requires some cleanups to ext4_page_mkwrite() (but
stable pages during writeback need that as well so it's a reasonable thing
to do). So something like attached patches should do what's needed - it's
lightly tested with fsx in delalloc, nodelalloc, and data=journal configs.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (4.94 kB)
0001-fs-Create-__block_page_mkwrite-helper-passing-error-.patch (2.45 kB)
0002-ext4-Rewrite-ext4_page_mkwrite-to-return-locked-page.patch (4.76 kB)
0003-ext4-Block-mmapped-writes-while-the-fs-is-frozen.patch (4.05 kB)
Download all attachments

2011-05-03 15:26:08

by Surbhi Palande

[permalink] [raw]
Subject: Re: your mail

On 05/03/2011 04:56 PM, Surbhi Palande wrote:
> On 05/03/2011 04:46 PM, Jan Kara wrote:
>> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
>
> Sorry for missing the subject line :(
>>> On munmap() zap_pte_range() is called which dirties the PTE dirty
>>> pages as
>>> Toshiyuki pointed out.
>>>
>>> zap_pte_range()
>>> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>>
>>> So, I think that it is here that we should do the checking for a ext4
>>> F.S
>>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>>
>>> Attaching a patch for initial review. Please do let me know your
>>> thoughts!
>> This is definitely the wrong place. ->set_page_dirty() callbacks are
>> called with various locks held and the page need not be locked (thus
>> dereferencing page->mapping is oopsable). Moreover this particular
>> callback
>> is called only in data=journal mode.
> Ok! Thanks for that!
>
>>
>> Believe me, the right place is page_mkwrite() - you have to catch the
>> read-only => read-write page transition. Once the page is mapped
>> read-write, you've already lost the race.
Also, we then need to prevent a munmap()/zap_pte_range() call from
dirtying a mmapped file page when the F.S is frozen?

Warm Regards,
Surbhi.

>
> My only point is:
> 1) something should prevent the freeze from happening. We cant merely
> check the vfs_check_frozen()?
>
> And this should be done where the page is marked dirty.Also, I thought
> that the page is marked read-write only in the page table in the
> __do_page_fault()? i.e the zap_pte_range() marks them dirty in the page
> cache? Is this understanding right?
>
> IMHO, whatever code dirties the page in the page cache should call a F.S
> specific function and let it _prevent_ a fsfreeze while the page is
> getting dirtied, so that a freeze called after this point flushes this
> page!
>
> Warm Regards,
> Surbhi.
>
>
>
>
>
>
>
>
>
>
>>
>> Honza
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-05-03 15:36:32

by Jan Kara

[permalink] [raw]
Subject: Re: your mail

On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
> On 05/03/2011 04:46 PM, Jan Kara wrote:
> >On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
>
> Sorry for missing the subject line :(
> >>On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> >>Toshiyuki pointed out.
> >>
> >>zap_pte_range()
> >> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
> >>
> >>So, I think that it is here that we should do the checking for a ext4 F.S
> >>frozen state and also prevent a parallel ext4 F.S freeze from happening.
> >>
> >>Attaching a patch for initial review. Please do let me know your thoughts!
> > This is definitely the wrong place. ->set_page_dirty() callbacks are
> >called with various locks held and the page need not be locked (thus
> >dereferencing page->mapping is oopsable). Moreover this particular callback
> >is called only in data=journal mode.
> Ok! Thanks for that!
>
> >
> >Believe me, the right place is page_mkwrite() - you have to catch the
> >read-only => read-write page transition. Once the page is mapped
> >read-write, you've already lost the race.
>
> My only point is:
> 1) something should prevent the freeze from happening. We cant
> merely check the vfs_check_frozen()?
Yes, I agree - see my other email with patches.

> And this should be done where the page is marked dirty.Also, I
> thought that the page is marked read-write only in the page table in
> the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
> the page cache? Is this understanding right?
The page can become dirty either because it was written via standard
write - write_begin is responsible for reliable check here - or it was
written via mmap - here we rely on page_mkwrite to do a reliable check -
it is analogous to write_begin callback. There should be no other way
to dirty a page.

With dirty bits it is a bit complicated. We have two of them in fact. One
in page table entry maintained by mmu and one in page structure maintained
by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
from page table into struct page. This is a lazy process so page can in
principle have new data without a dirty bit set in struct page because we
have not yet copied the dirty bit from page table. Only at moments where it
is important (like when we want to unmap the page, or throw away the page,
or so), we make sure struct page and page table bits are in sync.

Another subtle thing you need not be aware of it that when we clear page
dirty bit, we also writeprotect the page. So we are guaranteed to get a
page fault when the page is written to again.

> IMHO, whatever code dirties the page in the page cache should call a
> F.S specific function and let it _prevent_ a fsfreeze while the page
> is getting dirtied, so that a freeze called after this point flushes
> this page!
Agreed, that's what code in write_begin() and page_mkwrite() should
achieve.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-03 15:43:57

by Surbhi Palande

[permalink] [raw]
Subject: Re: your mail

On 05/03/2011 06:36 PM, Jan Kara wrote:
> On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
>> On 05/03/2011 04:46 PM, Jan Kara wrote:
>>> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
>>
>> Sorry for missing the subject line :(
>>>> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
>>>> Toshiyuki pointed out.
>>>>
>>>> zap_pte_range()
>>>> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>>>
>>>> So, I think that it is here that we should do the checking for a ext4 F.S
>>>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>>>
>>>> Attaching a patch for initial review. Please do let me know your thoughts!
>>> This is definitely the wrong place. ->set_page_dirty() callbacks are
>>> called with various locks held and the page need not be locked (thus
>>> dereferencing page->mapping is oopsable). Moreover this particular callback
>>> is called only in data=journal mode.
>> Ok! Thanks for that!
>>
>>>
>>> Believe me, the right place is page_mkwrite() - you have to catch the
>>> read-only => read-write page transition. Once the page is mapped
>>> read-write, you've already lost the race.
>>
>> My only point is:
>> 1) something should prevent the freeze from happening. We cant
>> merely check the vfs_check_frozen()?
> Yes, I agree - see my other email with patches.
>
>> And this should be done where the page is marked dirty.Also, I
>> thought that the page is marked read-write only in the page table in
>> the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
>> the page cache? Is this understanding right?
> The page can become dirty either because it was written via standard
> write - write_begin is responsible for reliable check here - or it was
> written via mmap - here we rely on page_mkwrite to do a reliable check -
> it is analogous to write_begin callback. There should be no other way
> to dirty a page.
>
> With dirty bits it is a bit complicated. We have two of them in fact. One
> in page table entry maintained by mmu and one in page structure maintained
> by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
> from page table into struct page. This is a lazy process so page can in
> principle have new data without a dirty bit set in struct page because we
> have not yet copied the dirty bit from page table. Only at moments where it
> is important (like when we want to unmap the page, or throw away the page,
> or so), we make sure struct page and page table bits are in sync.
>
> Another subtle thing you need not be aware of it that when we clear page
> dirty bit, we also writeprotect the page. So we are guaranteed to get a
> page fault when the page is written to again.
>
>> IMHO, whatever code dirties the page in the page cache should call a
>> F.S specific function and let it _prevent_ a fsfreeze while the page
>> is getting dirtied, so that a freeze called after this point flushes
>> this page!
> Agreed, that's what code in write_begin() and page_mkwrite() should
> achieve.
> Honza
Thanks a lot for the wonderful explanation :)

How about the revert : i.e calling jbd2_journal_unlock_updates() from
ext4_unfreeze() instead of the ext4_freeze()? Do you agree to that?


Warm Regards,
Surbhi.


2011-05-03 20:15:03

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 5/3/11 2:27 AM, Surbhi Palande wrote:
> On 05/02/2011 05:04 PM, Eric Sandeen wrote:
>> On 5/2/11 8:22 AM, Surbhi Palande wrote:
>>> On 05/02/2011 04:16 PM, Jan Kara wrote:
>>>> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>>>>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>>>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>>>> Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>>>>> places that just thawing the filesystem. Any where this is done should
>>>>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>>>> That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>>>>> fs code with s_umount held).
>>>>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>>>>> filesystem a bug in ext3/4...
>>>>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>>>>> xfs_file_aio_write()
>>>>>>>>> xfs_wait_for_freeze()
>>>>>>>>> vfs_check_frozen()
>>>>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>>>>
>>>>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>>>>> --------------------
>>>>>>>>> xfs_file_aio_write()
>>>>>>>>> xfs_wait_for_freeze()
>>>>>>>>> vfs_check_frozen()
>>>>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>>>>> xfs_file_aio_write() will be executed next.
>>>>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>>>>> control.
>>>>>>>>>
>>>>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>>>>> --------------------
>>>>>>>>>
>>>>>>>>> Can anyone please enlighten me on how& why this premption is _not_
>>>>>>>>> possible?
>>>>>>> Thanks for your reply.
>>>>>>>> XFS works similarly as ext4 in this regard I believe. They have the log
>>>>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>>>>> the writing process gets caught waiting for log to unfreeze
>>>>>>> Agreed.
>>>>>>>> or it manages
>>>>>>>> to start a transaction and then freezing process waits for transaction to
>>>>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>>>>> check in xfs_file_aio_write()...
>>>>>>>>
>>>>>>>>
>>>>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>>>>> freezing a superblock (as the write process stopped just before
>>>>>>> writing anything for this transaction and has not taken any locks?)
>>>>>> So ext4_freeze() does
>>>>>> jbd2_journal_lock_updates(journal)
>>>>>> which waits for all running transactions to finish and updates
>>>>>> j_barrier_count which stops any news ones from proceeding (check
>>>>>> function start_this_handle()).
>>>>>>
>>>>> Yes, but ext4_freeze() also calls
>>>>> jbd2_journal_unlock_updates(journal) which decrements the
>>>>> j_barrier_count (which was previously updated/incremented in
>>>>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>>>>> new transaction/handle can be accepted/started.
>>>>>
>>>>> A comment in ext4_freeze() says:
>>>>> /* we rely on s_frozen to stop further updates */
>>>>> (before calling jbd2_journal_unlock_updates())
>>>> Ah, drat, you're right. I've missed this other part. It's the problem
>>>> that if you expect to see something, you'll see it regardless of the real
>>>> code ;).
>>>>
>>>> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
>>>> it's still racy (although the race window is relatively small) because the
>>>> filesystem can become frozen the instant after we check vfs_check_frozen().
>>>> Commit 6b0310fb broke it for ext4.
>>>>
>>>> I guess the code was mostly copied from XFS which seems to have the same
>>>> problem in xfs_trans_alloc() since the git history beginning. I see two
>>>> ways to fix this - either fix ext4/xfs to check s_frozen after starting
>>>> a transaction and if the filesystem is being frozen, we stop the
>>>> transaction, wait for fs to get unfrozen, and restart. Another option is
>>>> to create an analogous logic using a atomic counter of write ops in vfs
>>>> that could be used by all filesystems. We'd just have to replace
>>>> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
>>>> appropriate places...
>>> How about calling jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>>> from ext4_unfreeze()?
>> we used to have that, but holding it locked until then means we exit the kernel
>> with a mutex held, which is pretty icky.
>>
>> ================================================
>> [ BUG: lock held when returning to user space! ]
>> ------------------------------------------------
>> lvcreate/1075 is leaving the kernel with locks still held!
>> 1 lock held by lvcreate/1075:
>> #0: (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>> jbd2_journal_lock_updates+0xe1/0xf0
>>
>>
>> -Eric
> Should this not be reverted? I think that its a lot easier to stop a transaction between a freeze and a thaw that way! If you agree, can I send a patch for the same?

Only if you want the kernel to start spewing "BUG!" messages again...

-Eric

> Thanks!
>
> Warm Regards,
> Surbhi.
>
>
>>> So that indeed no transactions can be started before unfreeze is called.
>>>
>>> This has another advantage, that it rightfully does not let you update the access time when the F.S is frozen (touch_atime called from a read path when the F.S is frozen) Otherwise we also need to fix this path.
>>>
>>> Warm Regards,
>>> Surbhi.
>>>
>>>> Dave, Christoph, any opinions on this?
>>>> Honza
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2011-05-04 10:23:04

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 05/03/2011 11:14 PM, Eric Sandeen wrote:
> On 5/3/11 2:27 AM, Surbhi Palande wrote:
>> On 05/02/2011 05:04 PM, Eric Sandeen wrote:
>>> On 5/2/11 8:22 AM, Surbhi Palande wrote:
>>>> On 05/02/2011 04:16 PM, Jan Kara wrote:
>>>>> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>>>>>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>>>>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>>>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>>>>> Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>>>>>> places that just thawing the filesystem. Any where this is done should
>>>>>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>>>>> That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>>>>>> fs code with s_umount held).
>>>>>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>>>>>> filesystem a bug in ext3/4...
>>>>>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>>>>>> xfs_file_aio_write()
>>>>>>>>>> xfs_wait_for_freeze()
>>>>>>>>>> vfs_check_frozen()
>>>>>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>>>>>
>>>>>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>>>>>> --------------------
>>>>>>>>>> xfs_file_aio_write()
>>>>>>>>>> xfs_wait_for_freeze()
>>>>>>>>>> vfs_check_frozen()
>>>>>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>>>>>> xfs_file_aio_write() will be executed next.
>>>>>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>>>>>> control.
>>>>>>>>>>
>>>>>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>>>>>> --------------------
>>>>>>>>>>
>>>>>>>>>> Can anyone please enlighten me on how& why this premption is _not_
>>>>>>>>>> possible?
>>>>>>>> Thanks for your reply.
>>>>>>>>> XFS works similarly as ext4 in this regard I believe. They have the log
>>>>>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>>>>>> the writing process gets caught waiting for log to unfreeze
>>>>>>>> Agreed.
>>>>>>>>> or it manages
>>>>>>>>> to start a transaction and then freezing process waits for transaction to
>>>>>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>>>>>> check in xfs_file_aio_write()...
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>>>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>>>>>> freezing a superblock (as the write process stopped just before
>>>>>>>> writing anything for this transaction and has not taken any locks?)
>>>>>>> So ext4_freeze() does
>>>>>>> jbd2_journal_lock_updates(journal)
>>>>>>> which waits for all running transactions to finish and updates
>>>>>>> j_barrier_count which stops any news ones from proceeding (check
>>>>>>> function start_this_handle()).
>>>>>>>
>>>>>> Yes, but ext4_freeze() also calls
>>>>>> jbd2_journal_unlock_updates(journal) which decrements the
>>>>>> j_barrier_count (which was previously updated/incremented in
>>>>>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>>>>>> new transaction/handle can be accepted/started.
>>>>>>
>>>>>> A comment in ext4_freeze() says:
>>>>>> /* we rely on s_frozen to stop further updates */
>>>>>> (before calling jbd2_journal_unlock_updates())
>>>>> Ah, drat, you're right. I've missed this other part. It's the problem
>>>>> that if you expect to see something, you'll see it regardless of the real
>>>>> code ;).
>>>>>
>>>>> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
>>>>> it's still racy (although the race window is relatively small) because the
>>>>> filesystem can become frozen the instant after we check vfs_check_frozen().
>>>>> Commit 6b0310fb broke it for ext4.
>>>>>
>>>>> I guess the code was mostly copied from XFS which seems to have the same
>>>>> problem in xfs_trans_alloc() since the git history beginning. I see two
>>>>> ways to fix this - either fix ext4/xfs to check s_frozen after starting
>>>>> a transaction and if the filesystem is being frozen, we stop the
>>>>> transaction, wait for fs to get unfrozen, and restart. Another option is
>>>>> to create an analogous logic using a atomic counter of write ops in vfs
>>>>> that could be used by all filesystems. We'd just have to replace
>>>>> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
>>>>> appropriate places...
>>>> How about calling jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>>>> from ext4_unfreeze()?
>>> we used to have that, but holding it locked until then means we exit the kernel
>>> with a mutex held, which is pretty icky.
>>>
>>> ================================================
>>> [ BUG: lock held when returning to user space! ]
>>> ------------------------------------------------
>>> lvcreate/1075 is leaving the kernel with locks still held!
>>> 1 lock held by lvcreate/1075:
>>> #0: (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>>> jbd2_journal_lock_updates+0xe1/0xf0
>>>
>>>
>>> -Eric
>> Should this not be reverted? I think that its a lot easier to stop a transaction between a freeze and a thaw that way! If you agree, can I send a patch for the same?
> Only if you want the kernel to start spewing "BUG!" messages again...
>
> -Eric
But, then you need a much more complicated way to stop accepting the
transactions and the writes between the freeze and the thaw? (in the
write path and the read path)? Is this not much simpler?

Warm Regards,
Surbhi.








>> Thanks!
>>
>> Warm Regards,
>> Surbhi.
>>
>>
>>>> So that indeed no transactions can be started before unfreeze is called.
>>>>
>>>> This has another advantage, that it rightfully does not let you update the access time when the F.S is frozen (touch_atime called from a read path when the F.S is frozen) Otherwise we also need to fix this path.
>>>>
>>>> Warm Regards,
>>>> Surbhi.
>>>>
>>>>> Dave, Christoph, any opinions on this?
>>>>> Honza
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-05-04 12:09:49

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 05/03/2011 06:19 PM, Jan Kara wrote:
> On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
>> On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
>>> (2011/04/16 2:13), Jan Kara wrote:
>>>> Hello,
>>>>
>>>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should
>>>>>> probably
>>>>>> get modified to block while minor-faulting the page on frozen fs
>>>>>> because
>>>>>> when blocks are already allocated we may skip starting a transaction
>>>>>> and so
>>>>>> we could possibly modify the filesystem.
>>>>> OK. I think ->page_mkwrite() should also block writing the
>>>>> minor-faulting pages.
>>>>>
>>>>> (minor-pagefault)
>>>>> -> do_wp_page()
>>>>> -> page_mkwrite(= ext4_mkwrite())
>>>>> => BLOCK!
>>>>>
>>>>> (major-pagefault)
>>>>> -> do_liner_fault()
>>>>> -> page_mkwrite(= ext4_mkwrite())
>>>>> => BLOCK!
>>>>>
>>>>>>
>>>>>>>>> Mizuma-san's reproducer also writes the data which maps to the
>>>>>>>>> file (mmap).
>>>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>>>> I understand the normal write operation (not mmap) can be blocked
>>>>>>>>> while
>>>>>>>>> fsfreezing. So, I guess we don't always block all the write
>>>>>>>>> operation
>>>>>>>>> while fsfreezing.
>>>>>>>> Technically speaking, we block all the transaction starts which
>>>>>>>> means we
>>>>>>>> end up blocking all the writes from going to disk. But that does
>>>>>>>> not mean
>>>>>>>> we block all the writes from going to in-memory cache - as you
>>>>>>>> properly
>>>>>>>> note the mmap case is one of such exceptions.
>>>>>>> Hm, I also think we can allow the writes to in-memory cache but we
>>>>>>> can't allow
>>>>>>> the writes to disk while fsfreezing. I am considering that mmap
>>>>>>> path can
>>>>>>> write to disk while fsfreezing because this deadlock problem
>>>>>>> happens after
>>>>>>> fsfreeze operation is done...
>>>>>> I'm sorry I don't understand now - are you speaking about the case
>>>>>> above
>>>>>> when writepage() does not wait for filesystem being frozen or something
>>>>>> else?
>>>>> Sorry, I didn't understand around the page fault path.
>>>>> So, I had read the kernel source code around it, then I maybe
>>>>> understand...
>>>>>
>>>>> I worry whether we can update the file data in mmap case while
>>>>> fsfreezing.
>>>>> Of course, I understand that we can write to in-memory cache, and it
>>>>> is not a
>>>>> problem. However, if we can write to disk while fsfreezing, it is a
>>>>> problem.
>>>>> So, I summarize the cases whether we can write to disk or not.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> Cases (Whether we can write the data mmapped to the file on the disk
>>>>> while fsfreezing)
>>>>>
>>>>> [1] One of the page which has been mmapped is not bound. And
>>>>> the page is not allocated yet. (major fault?)
>>>>>
>>>>> (1) user dirtys a page
>>>>> (2) a page fault occurs (do_page_fault)
>>>>> (3) __do_falut is called.
>>>>> (4) ext4_page_mkwrite is called
>>>>> (5) ext4_write_begin is called
>>>>> (6) ext4_journal_start_sb => We can STOP!
>>>>>
>>>>> [2] One of the page which has been mmapped is not bound. But
>>>>> the page is already allocated, and the buffer_heads of the page
>>>>> are not mapped (BH_Mapped). (minor fault?)
>>>>>
>>>>> (1) user dirtys a page
>>>>> (2) a page fault occurs (do_page_fault)
>>>>> (3) do_wp_page is called.
>>>>> (4) ext4_page_mkwrite is called
>>>>> (5) ext4_write_begin is called
>>>>> (6) ext4_journal_start_sb => We can STOP!
>>
>> What happens in the case as follows:
>>
>> Task 1: Mmapped writes
>> t1)ext4_page_mkwrite()
>> t2) ext4_write_begin() (FS is thawed so we proceed)
>> t3) ext4_write_end() (journal is stopped now)
>> -----Pre-empted-----
>>
>>
>> Task 2: Freeze Task
>> t4) freezes the super block...
>> ...(continues)....
>> tn) the page cache is clean and the F.S is frozen. Freeze has
>> completed execution.
>>
>> Task 1: Mmapped writes
>> tn+1) ext4_page_mkwrite() returns 0.
>> tn+2) __do_fault() gets control, code gets executed.
>> tn+3) _do_fault() marks the page dirty if the intent is to write to
>> a file based page which faulted.
>>
>> So you end up dirtying the page cache when the F.S is frozen? No?
> You are right ext4_page_mkrite() as currently implemented has problems.
> You have to return the page locked (and check for frozen fs with page lock
> held) to avoid races.
>
> If you check for frozen fs with page lock held, you are guaranteed that
> freezing code must wait for the page to get unlocked before proceeding. And
> before the page is unlocked, it is marked dirty by the pagefault code which
> makes freezing code write the page and writeprotect it again. So everything
> will be safe.
For the locked page to be a part of the freeze initiated sync, should
its owner inode not be dirtied? The page fault handler dirties the page,
but who ensures that the inode is dirtied at this point?

Thanks!

Warm Regards,
Surbhi.



>
> Doing this cleanly requires some cleanups to ext4_page_mkwrite() (but
> stable pages during writeback need that as well so it's a reasonable thing
> to do). So something like attached patches should do what's needed - it's
> lightly tested with fsx in delalloc, nodelalloc, and data=journal configs.
>
> Honza


2011-05-04 14:30:27

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 5/4/11 3:26 AM, Surbhi Palande wrote:
> On 05/03/2011 11:14 PM, Eric Sandeen wrote:
>> On 5/3/11 2:27 AM, Surbhi Palande wrote:

...

>>> Should this not be reverted? I think that its a lot easier to
>>> stop a transaction between a freeze and a thaw that way! If you
>>> agree, can I send a patch for the same?
>> Only if you want the kernel to start spewing "BUG!" messages
>> again...
>>
>> -Eric
> But, then you need a much more complicated way to stop accepting the
> transactions and the writes between the freeze and the thaw? (in the
> write path and the read path)? Is this not much simpler?

I just cannot see how a solution which leads to:

>> ================================================
>> [ BUG: lock held when returning to user space! ]
>> ------------------------------------------------
>> lvcreate/1075 is leaving the kernel with locks still held!
>> 1 lock held by lvcreate/1075:
>> #0: (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>> jbd2_journal_lock_updates+0xe1/0xf0


can be considered viable.

You are welcome to send the patch, and if other ext4 devs concur with it then I'll be outvoted. :)

-Eric


2011-05-04 19:19:16

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
> On 05/03/2011 06:19 PM, Jan Kara wrote:
> >On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
> >>On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
> >>>(2011/04/16 2:13), Jan Kara wrote:
> >>>>Hello,
> >>>>
> >>>>On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>>>>>For ext3 or ext4 without delayed allocation we block inside writepage()
> >>>>>>function. But as I wrote to Dave Chinner, ->page_mkwrite() should
> >>>>>>probably
> >>>>>>get modified to block while minor-faulting the page on frozen fs
> >>>>>>because
> >>>>>>when blocks are already allocated we may skip starting a transaction
> >>>>>>and so
> >>>>>>we could possibly modify the filesystem.
> >>>>>OK. I think ->page_mkwrite() should also block writing the
> >>>>>minor-faulting pages.
> >>>>>
> >>>>>(minor-pagefault)
> >>>>>-> do_wp_page()
> >>>>>-> page_mkwrite(= ext4_mkwrite())
> >>>>>=> BLOCK!
> >>>>>
> >>>>>(major-pagefault)
> >>>>>-> do_liner_fault()
> >>>>>-> page_mkwrite(= ext4_mkwrite())
> >>>>>=> BLOCK!
> >>>>>
> >>>>>>
> >>>>>>>>>Mizuma-san's reproducer also writes the data which maps to the
> >>>>>>>>>file (mmap).
> >>>>>>>>>The original problem happens after the fsfreeze operation is done.
> >>>>>>>>>I understand the normal write operation (not mmap) can be blocked
> >>>>>>>>>while
> >>>>>>>>>fsfreezing. So, I guess we don't always block all the write
> >>>>>>>>>operation
> >>>>>>>>>while fsfreezing.
> >>>>>>>>Technically speaking, we block all the transaction starts which
> >>>>>>>>means we
> >>>>>>>>end up blocking all the writes from going to disk. But that does
> >>>>>>>>not mean
> >>>>>>>>we block all the writes from going to in-memory cache - as you
> >>>>>>>>properly
> >>>>>>>>note the mmap case is one of such exceptions.
> >>>>>>>Hm, I also think we can allow the writes to in-memory cache but we
> >>>>>>>can't allow
> >>>>>>>the writes to disk while fsfreezing. I am considering that mmap
> >>>>>>>path can
> >>>>>>>write to disk while fsfreezing because this deadlock problem
> >>>>>>>happens after
> >>>>>>>fsfreeze operation is done...
> >>>>>>I'm sorry I don't understand now - are you speaking about the case
> >>>>>>above
> >>>>>>when writepage() does not wait for filesystem being frozen or something
> >>>>>>else?
> >>>>>Sorry, I didn't understand around the page fault path.
> >>>>>So, I had read the kernel source code around it, then I maybe
> >>>>>understand...
> >>>>>
> >>>>>I worry whether we can update the file data in mmap case while
> >>>>>fsfreezing.
> >>>>>Of course, I understand that we can write to in-memory cache, and it
> >>>>>is not a
> >>>>>problem. However, if we can write to disk while fsfreezing, it is a
> >>>>>problem.
> >>>>>So, I summarize the cases whether we can write to disk or not.
> >>>>>
> >>>>>--------------------------------------------------------------------------
> >>>>>
> >>>>>Cases (Whether we can write the data mmapped to the file on the disk
> >>>>>while fsfreezing)
> >>>>>
> >>>>>[1] One of the page which has been mmapped is not bound. And
> >>>>>the page is not allocated yet. (major fault?)
> >>>>>
> >>>>>(1) user dirtys a page
> >>>>>(2) a page fault occurs (do_page_fault)
> >>>>>(3) __do_falut is called.
> >>>>>(4) ext4_page_mkwrite is called
> >>>>>(5) ext4_write_begin is called
> >>>>>(6) ext4_journal_start_sb => We can STOP!
> >>>>>
> >>>>>[2] One of the page which has been mmapped is not bound. But
> >>>>>the page is already allocated, and the buffer_heads of the page
> >>>>>are not mapped (BH_Mapped). (minor fault?)
> >>>>>
> >>>>>(1) user dirtys a page
> >>>>>(2) a page fault occurs (do_page_fault)
> >>>>>(3) do_wp_page is called.
> >>>>>(4) ext4_page_mkwrite is called
> >>>>>(5) ext4_write_begin is called
> >>>>>(6) ext4_journal_start_sb => We can STOP!
> >>
> >>What happens in the case as follows:
> >>
> >>Task 1: Mmapped writes
> >>t1)ext4_page_mkwrite()
> >> t2) ext4_write_begin() (FS is thawed so we proceed)
> >> t3) ext4_write_end() (journal is stopped now)
> >>-----Pre-empted-----
> >>
> >>
> >>Task 2: Freeze Task
> >>t4) freezes the super block...
> >>...(continues)....
> >>tn) the page cache is clean and the F.S is frozen. Freeze has
> >>completed execution.
> >>
> >>Task 1: Mmapped writes
> >>tn+1) ext4_page_mkwrite() returns 0.
> >>tn+2) __do_fault() gets control, code gets executed.
> >>tn+3) _do_fault() marks the page dirty if the intent is to write to
> >>a file based page which faulted.
> >>
> >>So you end up dirtying the page cache when the F.S is frozen? No?
> > You are right ext4_page_mkrite() as currently implemented has problems.
> >You have to return the page locked (and check for frozen fs with page lock
> >held) to avoid races.
> >
> >If you check for frozen fs with page lock held, you are guaranteed that
> >freezing code must wait for the page to get unlocked before proceeding. And
> >before the page is unlocked, it is marked dirty by the pagefault code which
> >makes freezing code write the page and writeprotect it again. So everything
> >will be safe.
> For the locked page to be a part of the freeze initiated sync,
> should its owner inode not be dirtied? The page fault handler
> dirties the page, but who ensures that the inode is dirtied at this
> point?
Follow the path from set_page_dirty() -> __set_page_dirty_buffers()
-> __set_page_dirty() -> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

More code reading would save you (and me) some typing ;).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-04 19:24:39

by Jan Kara

[permalink] [raw]
Subject: Re: your mail

On Tue 03-05-11 18:43:48, Surbhi Palande wrote:
> On 05/03/2011 06:36 PM, Jan Kara wrote:
> >On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
> >>On 05/03/2011 04:46 PM, Jan Kara wrote:
> >>>On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
> >>
> >>Sorry for missing the subject line :(
> >>>>On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> >>>>Toshiyuki pointed out.
> >>>>
> >>>>zap_pte_range()
> >>>> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
> >>>>
> >>>>So, I think that it is here that we should do the checking for a ext4 F.S
> >>>>frozen state and also prevent a parallel ext4 F.S freeze from happening.
> >>>>
> >>>>Attaching a patch for initial review. Please do let me know your thoughts!
> >>> This is definitely the wrong place. ->set_page_dirty() callbacks are
> >>>called with various locks held and the page need not be locked (thus
> >>>dereferencing page->mapping is oopsable). Moreover this particular callback
> >>>is called only in data=journal mode.
> >>Ok! Thanks for that!
> >>
> >>>
> >>>Believe me, the right place is page_mkwrite() - you have to catch the
> >>>read-only => read-write page transition. Once the page is mapped
> >>>read-write, you've already lost the race.
> >>
> >>My only point is:
> >>1) something should prevent the freeze from happening. We cant
> >>merely check the vfs_check_frozen()?
> > Yes, I agree - see my other email with patches.
> >
> >>And this should be done where the page is marked dirty.Also, I
> >>thought that the page is marked read-write only in the page table in
> >>the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
> >>the page cache? Is this understanding right?
> > The page can become dirty either because it was written via standard
> >write - write_begin is responsible for reliable check here - or it was
> >written via mmap - here we rely on page_mkwrite to do a reliable check -
> >it is analogous to write_begin callback. There should be no other way
> >to dirty a page.
> >
> >With dirty bits it is a bit complicated. We have two of them in fact. One
> >in page table entry maintained by mmu and one in page structure maintained
> >by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
> >from page table into struct page. This is a lazy process so page can in
> >principle have new data without a dirty bit set in struct page because we
> >have not yet copied the dirty bit from page table. Only at moments where it
> >is important (like when we want to unmap the page, or throw away the page,
> >or so), we make sure struct page and page table bits are in sync.
> >
> >Another subtle thing you need not be aware of it that when we clear page
> >dirty bit, we also writeprotect the page. So we are guaranteed to get a
> >page fault when the page is written to again.
> >
> >>IMHO, whatever code dirties the page in the page cache should call a
> >>F.S specific function and let it _prevent_ a fsfreeze while the page
> >>is getting dirtied, so that a freeze called after this point flushes
> >>this page!
> > Agreed, that's what code in write_begin() and page_mkwrite() should
> >achieve.
> > Honza
> Thanks a lot for the wonderful explanation :)
>
> How about the revert : i.e calling jbd2_journal_unlock_updates()
> from ext4_unfreeze() instead of the ext4_freeze()? Do you agree to
> that?
Sorry, I don't agree with revert. We could talk about changing
jbd2_journal_unlock_updates() to not return with mutex held (and handle
synchronization of locked journal operations differently) as an alternative
to doing "freeze" reference counting. But returning with mutex held to user
space is no-go. It will cause problems in lockdep, violates kernel locking
rules, and generally is a bad programming ;).

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-04 21:34:51

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 05/04/2011 10:19 PM, Jan Kara wrote:
> On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
>> On 05/03/2011 06:19 PM, Jan Kara wrote:
>>> On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
>>>> On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
>>>>> (2011/04/16 2:13), Jan Kara wrote:
>>>>>> Hello,
>>>>>>
>>>>>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>>>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>>>>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should
>>>>>>>> probably
>>>>>>>> get modified to block while minor-faulting the page on frozen fs
>>>>>>>> because
>>>>>>>> when blocks are already allocated we may skip starting a transaction
>>>>>>>> and so
>>>>>>>> we could possibly modify the filesystem.
>>>>>>> OK. I think ->page_mkwrite() should also block writing the
>>>>>>> minor-faulting pages.
>>>>>>>
>>>>>>> (minor-pagefault)
>>>>>>> -> do_wp_page()
>>>>>>> -> page_mkwrite(= ext4_mkwrite())
>>>>>>> => BLOCK!
>>>>>>>
>>>>>>> (major-pagefault)
>>>>>>> -> do_liner_fault()
>>>>>>> -> page_mkwrite(= ext4_mkwrite())
>>>>>>> => BLOCK!
>>>>>>>
>>>>>>>>
>>>>>>>>>>> Mizuma-san's reproducer also writes the data which maps to the
>>>>>>>>>>> file (mmap).
>>>>>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>>>>>> I understand the normal write operation (not mmap) can be blocked
>>>>>>>>>>> while
>>>>>>>>>>> fsfreezing. So, I guess we don't always block all the write
>>>>>>>>>>> operation
>>>>>>>>>>> while fsfreezing.
>>>>>>>>>> Technically speaking, we block all the transaction starts which
>>>>>>>>>> means we
>>>>>>>>>> end up blocking all the writes from going to disk. But that does
>>>>>>>>>> not mean
>>>>>>>>>> we block all the writes from going to in-memory cache - as you
>>>>>>>>>> properly
>>>>>>>>>> note the mmap case is one of such exceptions.
>>>>>>>>> Hm, I also think we can allow the writes to in-memory cache but we
>>>>>>>>> can't allow
>>>>>>>>> the writes to disk while fsfreezing. I am considering that mmap
>>>>>>>>> path can
>>>>>>>>> write to disk while fsfreezing because this deadlock problem
>>>>>>>>> happens after
>>>>>>>>> fsfreeze operation is done...
>>>>>>>> I'm sorry I don't understand now - are you speaking about the case
>>>>>>>> above
>>>>>>>> when writepage() does not wait for filesystem being frozen or something
>>>>>>>> else?
>>>>>>> Sorry, I didn't understand around the page fault path.
>>>>>>> So, I had read the kernel source code around it, then I maybe
>>>>>>> understand...
>>>>>>>
>>>>>>> I worry whether we can update the file data in mmap case while
>>>>>>> fsfreezing.
>>>>>>> Of course, I understand that we can write to in-memory cache, and it
>>>>>>> is not a
>>>>>>> problem. However, if we can write to disk while fsfreezing, it is a
>>>>>>> problem.
>>>>>>> So, I summarize the cases whether we can write to disk or not.
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> Cases (Whether we can write the data mmapped to the file on the disk
>>>>>>> while fsfreezing)
>>>>>>>
>>>>>>> [1] One of the page which has been mmapped is not bound. And
>>>>>>> the page is not allocated yet. (major fault?)
>>>>>>>
>>>>>>> (1) user dirtys a page
>>>>>>> (2) a page fault occurs (do_page_fault)
>>>>>>> (3) __do_falut is called.
>>>>>>> (4) ext4_page_mkwrite is called
>>>>>>> (5) ext4_write_begin is called
>>>>>>> (6) ext4_journal_start_sb => We can STOP!
>>>>>>>
>>>>>>> [2] One of the page which has been mmapped is not bound. But
>>>>>>> the page is already allocated, and the buffer_heads of the page
>>>>>>> are not mapped (BH_Mapped). (minor fault?)
>>>>>>>
>>>>>>> (1) user dirtys a page
>>>>>>> (2) a page fault occurs (do_page_fault)
>>>>>>> (3) do_wp_page is called.
>>>>>>> (4) ext4_page_mkwrite is called
>>>>>>> (5) ext4_write_begin is called
>>>>>>> (6) ext4_journal_start_sb => We can STOP!
>>>>
>>>> What happens in the case as follows:
>>>>
>>>> Task 1: Mmapped writes
>>>> t1)ext4_page_mkwrite()
>>>> t2) ext4_write_begin() (FS is thawed so we proceed)
>>>> t3) ext4_write_end() (journal is stopped now)
>>>> -----Pre-empted-----
>>>>
>>>>
>>>> Task 2: Freeze Task
>>>> t4) freezes the super block...
>>>> ...(continues)....
>>>> tn) the page cache is clean and the F.S is frozen. Freeze has
>>>> completed execution.
>>>>
>>>> Task 1: Mmapped writes
>>>> tn+1) ext4_page_mkwrite() returns 0.
>>>> tn+2) __do_fault() gets control, code gets executed.
>>>> tn+3) _do_fault() marks the page dirty if the intent is to write to
>>>> a file based page which faulted.
>>>>
>>>> So you end up dirtying the page cache when the F.S is frozen? No?
>>> You are right ext4_page_mkrite() as currently implemented has problems.
>>> You have to return the page locked (and check for frozen fs with page lock
>>> held) to avoid races.
>>>
>>> If you check for frozen fs with page lock held, you are guaranteed that
>>> freezing code must wait for the page to get unlocked before proceeding. And
>>> before the page is unlocked, it is marked dirty by the pagefault code which
>>> makes freezing code write the page and writeprotect it again. So everything
>>> will be safe.
>> For the locked page to be a part of the freeze initiated sync,
>> should its owner inode not be dirtied? The page fault handler
>> dirties the page, but who ensures that the inode is dirtied at this
>> point?
Well, I mean it as follows:

Doesn't the writeback code (invoked via sync_filesystem(sb)) write all
the dirty pages of all the _dirty_ inodes of a superblock?

So in the window from the point where ext4_page_mkwrite returns to
__do_fault() _till_ you mark the inode dirty (in __mark_inode_dirty()),
you can have a race with freeze i.e if freeze happens meanwhile, then
the sync initiated by freeze will not consider this locked page as the
owner inode is _clean_ (or not dirtied yet) at that point?

Key: tx: time at unit x

P1: mmapped writes
t1) __do_page_fault()
t2) ext4_page_mkwrite()
// owner inode of the page is in _clean_ state - not yet dirtied
--- pre-empted---

P2: Freeze_super
tn) freeze_super gets control
freezes the F.S, skips the owner inode as it is in the clean state.
syncs all the other dirty inodes. page cache is now clean.


P1: mmapped writes (resume)
tn+x)__do_page_fault() gets control back:
tn+x+1) set_page_dirty()
tn+x+2) __set_page_dirty_buffers()
tn+x+3) __set_page_dirty()
tn+x+4) radix_tree_tag_set(page, PAGECACHE_TAG_DIRTY)

So don't we end up dirtying the page cache when the F.S is frozen?

Again, apologies if I understood the writeback code or something else wrong!

Warm Regards,
Surbhi.

> Follow the path from set_page_dirty() -> __set_page_dirty_buffers()
> -> __set_page_dirty() -> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);


>
> More code reading would save you (and me) some typing ;).
P/S: Sorry about that!

>
> Honza


2011-05-04 22:48:24

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Thu 05-05-11 00:34:51, Surbhi Palande wrote:
> On 05/04/2011 10:19 PM, Jan Kara wrote:
> >On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
> >>On 05/03/2011 06:19 PM, Jan Kara wrote:
> >>>On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
> >>>>What happens in the case as follows:
> >>>>
> >>>>Task 1: Mmapped writes
> >>>>t1)ext4_page_mkwrite()
> >>>> t2) ext4_write_begin() (FS is thawed so we proceed)
> >>>> t3) ext4_write_end() (journal is stopped now)
> >>>>-----Pre-empted-----
> >>>>
> >>>>
> >>>>Task 2: Freeze Task
> >>>>t4) freezes the super block...
> >>>>...(continues)....
> >>>>tn) the page cache is clean and the F.S is frozen. Freeze has
> >>>>completed execution.
> >>>>
> >>>>Task 1: Mmapped writes
> >>>>tn+1) ext4_page_mkwrite() returns 0.
> >>>>tn+2) __do_fault() gets control, code gets executed.
> >>>>tn+3) _do_fault() marks the page dirty if the intent is to write to
> >>>>a file based page which faulted.
> >>>>
> >>>>So you end up dirtying the page cache when the F.S is frozen? No?
> >>> You are right ext4_page_mkrite() as currently implemented has problems.
> >>>You have to return the page locked (and check for frozen fs with page lock
> >>>held) to avoid races.
> >>>
> >>>If you check for frozen fs with page lock held, you are guaranteed that
> >>>freezing code must wait for the page to get unlocked before proceeding. And
> >>>before the page is unlocked, it is marked dirty by the pagefault code which
> >>>makes freezing code write the page and writeprotect it again. So everything
> >>>will be safe.
> >>For the locked page to be a part of the freeze initiated sync,
> >>should its owner inode not be dirtied? The page fault handler
> >>dirties the page, but who ensures that the inode is dirtied at this
> >>point?
> Well, I mean it as follows:
>
> Doesn't the writeback code (invoked via sync_filesystem(sb)) write
> all the dirty pages of all the _dirty_ inodes of a superblock?
>
> So in the window from the point where ext4_page_mkwrite returns to
> __do_fault() _till_ you mark the inode dirty (in
> __mark_inode_dirty()), you can have a race with freeze i.e if freeze
> happens meanwhile, then the sync initiated by freeze will not
> consider this locked page as the owner inode is _clean_ (or not
> dirtied yet) at that point?
Ah, I see. That's actually a good point! Thanks for persistence. So we
should also dirty the page before checking for frozen fs.

> Key: tx: time at unit x
>
> P1: mmapped writes
> t1) __do_page_fault()
> t2) ext4_page_mkwrite()
> // owner inode of the page is in _clean_ state - not yet dirtied
> --- pre-empted---
>
> P2: Freeze_super
> tn) freeze_super gets control
> freezes the F.S, skips the owner inode as it is in the clean state.
> syncs all the other dirty inodes. page cache is now clean.
>
>
> P1: mmapped writes (resume)
> tn+x)__do_page_fault() gets control back:
> tn+x+1) set_page_dirty()
> tn+x+2) __set_page_dirty_buffers()
> tn+x+3) __set_page_dirty()
> tn+x+4) radix_tree_tag_set(page, PAGECACHE_TAG_DIRTY)
>
> So don't we end up dirtying the page cache when the F.S is frozen?
>
> Again, apologies if I understood the writeback code or something else wrong!
No, you understood it right. Just your previous email was too generic so
I have not thought about this particular race.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-05 06:06:39

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 05/05/2011 01:48 AM, Jan Kara wrote:
> On Thu 05-05-11 00:34:51, Surbhi Palande wrote:
>> On 05/04/2011 10:19 PM, Jan Kara wrote:
>>> On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
>>>> On 05/03/2011 06:19 PM, Jan Kara wrote:
>>>>> On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
>>>>>> What happens in the case as follows:
>>>>>>
>>>>>> Task 1: Mmapped writes
>>>>>> t1)ext4_page_mkwrite()
>>>>>> t2) ext4_write_begin() (FS is thawed so we proceed)
>>>>>> t3) ext4_write_end() (journal is stopped now)
>>>>>> -----Pre-empted-----
>>>>>>
>>>>>>
>>>>>> Task 2: Freeze Task
>>>>>> t4) freezes the super block...
>>>>>> ...(continues)....
>>>>>> tn) the page cache is clean and the F.S is frozen. Freeze has
>>>>>> completed execution.
>>>>>>
>>>>>> Task 1: Mmapped writes
>>>>>> tn+1) ext4_page_mkwrite() returns 0.
>>>>>> tn+2) __do_fault() gets control, code gets executed.
>>>>>> tn+3) _do_fault() marks the page dirty if the intent is to write to
>>>>>> a file based page which faulted.
>>>>>>
>>>>>> So you end up dirtying the page cache when the F.S is frozen? No?
>>>>> You are right ext4_page_mkrite() as currently implemented has problems.
>>>>> You have to return the page locked (and check for frozen fs with page lock
>>>>> held) to avoid races.
>>>>>
>>>>> If you check for frozen fs with page lock held, you are guaranteed that
>>>>> freezing code must wait for the page to get unlocked before proceeding. And
>>>>> before the page is unlocked, it is marked dirty by the pagefault code which
>>>>> makes freezing code write the page and writeprotect it again. So everything
>>>>> will be safe.
>>>> For the locked page to be a part of the freeze initiated sync,
>>>> should its owner inode not be dirtied? The page fault handler
>>>> dirties the page, but who ensures that the inode is dirtied at this
>>>> point?
>> Well, I mean it as follows:
>>
>> Doesn't the writeback code (invoked via sync_filesystem(sb)) write
>> all the dirty pages of all the _dirty_ inodes of a superblock?
>>
>> So in the window from the point where ext4_page_mkwrite returns to
>> __do_fault() _till_ you mark the inode dirty (in
>> __mark_inode_dirty()), you can have a race with freeze i.e if freeze
>> happens meanwhile, then the sync initiated by freeze will not
>> consider this locked page as the owner inode is _clean_ (or not
>> dirtied yet) at that point?
> Ah, I see. That's actually a good point! Thanks for persistence. So we
> should also dirty the page before checking for frozen fs.

Should we not also dirty the inode? IMHO, marking an inode will be racy
as well!

Warm Regards,
Surbhi.

>
>> Key: tx: time at unit x
>>
>> P1: mmapped writes
>> t1) __do_page_fault()
>> t2) ext4_page_mkwrite()
>> // owner inode of the page is in _clean_ state - not yet dirtied
>> --- pre-empted---
>>
>> P2: Freeze_super
>> tn) freeze_super gets control
>> freezes the F.S, skips the owner inode as it is in the clean state.
>> syncs all the other dirty inodes. page cache is now clean.
>>
>>
>> P1: mmapped writes (resume)
>> tn+x)__do_page_fault() gets control back:
>> tn+x+1) set_page_dirty()
>> tn+x+2) __set_page_dirty_buffers()
>> tn+x+3) __set_page_dirty()
>> tn+x+4) radix_tree_tag_set(page, PAGECACHE_TAG_DIRTY)
>>
>> So don't we end up dirtying the page cache when the F.S is frozen?
>>
>> Again, apologies if I understood the writeback code or something else wrong!
> No, you understood it right. Just your previous email was too generic so
> I have not thought about this particular race.
>
> Honza


2011-05-05 11:18:09

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Thu 05-05-11 09:06:29, Surbhi Palande wrote:
> On 05/05/2011 01:48 AM, Jan Kara wrote:
> >On Thu 05-05-11 00:34:51, Surbhi Palande wrote:
> >>On 05/04/2011 10:19 PM, Jan Kara wrote:
> >>>On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
> >>>>On 05/03/2011 06:19 PM, Jan Kara wrote:
> >>>>>On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
> >>>>>>What happens in the case as follows:
> >>>>>>
> >>>>>>Task 1: Mmapped writes
> >>>>>>t1)ext4_page_mkwrite()
> >>>>>> t2) ext4_write_begin() (FS is thawed so we proceed)
> >>>>>> t3) ext4_write_end() (journal is stopped now)
> >>>>>>-----Pre-empted-----
> >>>>>>
> >>>>>>
> >>>>>>Task 2: Freeze Task
> >>>>>>t4) freezes the super block...
> >>>>>>...(continues)....
> >>>>>>tn) the page cache is clean and the F.S is frozen. Freeze has
> >>>>>>completed execution.
> >>>>>>
> >>>>>>Task 1: Mmapped writes
> >>>>>>tn+1) ext4_page_mkwrite() returns 0.
> >>>>>>tn+2) __do_fault() gets control, code gets executed.
> >>>>>>tn+3) _do_fault() marks the page dirty if the intent is to write to
> >>>>>>a file based page which faulted.
> >>>>>>
> >>>>>>So you end up dirtying the page cache when the F.S is frozen? No?
> >>>>> You are right ext4_page_mkrite() as currently implemented has problems.
> >>>>>You have to return the page locked (and check for frozen fs with page lock
> >>>>>held) to avoid races.
> >>>>>
> >>>>>If you check for frozen fs with page lock held, you are guaranteed that
> >>>>>freezing code must wait for the page to get unlocked before proceeding. And
> >>>>>before the page is unlocked, it is marked dirty by the pagefault code which
> >>>>>makes freezing code write the page and writeprotect it again. So everything
> >>>>>will be safe.
> >>>>For the locked page to be a part of the freeze initiated sync,
> >>>>should its owner inode not be dirtied? The page fault handler
> >>>>dirties the page, but who ensures that the inode is dirtied at this
> >>>>point?
> >>Well, I mean it as follows:
> >>
> >>Doesn't the writeback code (invoked via sync_filesystem(sb)) write
> >>all the dirty pages of all the _dirty_ inodes of a superblock?
> >>
> >>So in the window from the point where ext4_page_mkwrite returns to
> >>__do_fault() _till_ you mark the inode dirty (in
> >>__mark_inode_dirty()), you can have a race with freeze i.e if freeze
> >>happens meanwhile, then the sync initiated by freeze will not
> >>consider this locked page as the owner inode is _clean_ (or not
> >>dirtied yet) at that point?
> > Ah, I see. That's actually a good point! Thanks for persistence. So we
> >should also dirty the page before checking for frozen fs.
>
> Should we not also dirty the inode? IMHO, marking an inode will be
> racy as well!
Marking the page dirty marks the inode dirty as well as I've explained in my
previous emails. So I'm missing what you are concerned about...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-05 14:01:20

by Surbhi Palande

[permalink] [raw]
Subject: Re: [RFC][PATCH] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On 05/05/2011 02:18 PM, Jan Kara wrote:
> On Thu 05-05-11 09:06:29, Surbhi Palande wrote:
>> On 05/05/2011 01:48 AM, Jan Kara wrote:
>>> On Thu 05-05-11 00:34:51, Surbhi Palande wrote:
>>>> On 05/04/2011 10:19 PM, Jan Kara wrote:
>>>>> On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
>>>>>> On 05/03/2011 06:19 PM, Jan Kara wrote:
>>>>>>> On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
>>>>>>>> What happens in the case as follows:
>>>>>>>>
>>>>>>>> Task 1: Mmapped writes
>>>>>>>> t1)ext4_page_mkwrite()
>>>>>>>> t2) ext4_write_begin() (FS is thawed so we proceed)
>>>>>>>> t3) ext4_write_end() (journal is stopped now)
>>>>>>>> -----Pre-empted-----
>>>>>>>>
>>>>>>>>
>>>>>>>> Task 2: Freeze Task
>>>>>>>> t4) freezes the super block...
>>>>>>>> ...(continues)....
>>>>>>>> tn) the page cache is clean and the F.S is frozen. Freeze has
>>>>>>>> completed execution.
>>>>>>>>
>>>>>>>> Task 1: Mmapped writes
>>>>>>>> tn+1) ext4_page_mkwrite() returns 0.
>>>>>>>> tn+2) __do_fault() gets control, code gets executed.
>>>>>>>> tn+3) _do_fault() marks the page dirty if the intent is to write to
>>>>>>>> a file based page which faulted.
>>>>>>>>
>>>>>>>> So you end up dirtying the page cache when the F.S is frozen? No?
>>>>>>> You are right ext4_page_mkrite() as currently implemented has problems.
>>>>>>> You have to return the page locked (and check for frozen fs with page lock
>>>>>>> held) to avoid races.
>>>>>>>
>>>>>>> If you check for frozen fs with page lock held, you are guaranteed that
>>>>>>> freezing code must wait for the page to get unlocked before proceeding. And
>>>>>>> before the page is unlocked, it is marked dirty by the pagefault code which
>>>>>>> makes freezing code write the page and writeprotect it again. So everything
>>>>>>> will be safe.
>>>>>> For the locked page to be a part of the freeze initiated sync,
>>>>>> should its owner inode not be dirtied? The page fault handler
>>>>>> dirties the page, but who ensures that the inode is dirtied at this
>>>>>> point?
>>>> Well, I mean it as follows:
>>>>
>>>> Doesn't the writeback code (invoked via sync_filesystem(sb)) write
>>>> all the dirty pages of all the _dirty_ inodes of a superblock?
>>>>
>>>> So in the window from the point where ext4_page_mkwrite returns to
>>>> __do_fault() _till_ you mark the inode dirty (in
>>>> __mark_inode_dirty()), you can have a race with freeze i.e if freeze
>>>> happens meanwhile, then the sync initiated by freeze will not
>>>> consider this locked page as the owner inode is _clean_ (or not
>>>> dirtied yet) at that point?
>>> Ah, I see. That's actually a good point! Thanks for persistence. So we
>>> should also dirty the page before checking for frozen fs.
>>
>> Should we not also dirty the inode? IMHO, marking an inode will be
>> racy as well!
> Marking the page dirty marks the inode dirty as well as I've explained in my
> previous emails. So I'm missing what you are concerned about...

Yes you are right! There is no other concern - setting the page dirty
will be racy.

Warm Regards,
Surbhi.

2011-05-06 15:20:45

by Surbhi Palande

[permalink] [raw]
Subject: [RFC][PATCH] Do not accept a new handle when the F.S is frozen

This patch adds a flag which indicates that the journal is frozen or not and
introduces two new functions jbd2_journal_freeze and jbd2_journal_thaw which
should be called when the F.S freezes and thaws respectively.
A new handle can only be started now when the barrier count is 0 and when the
journal is not in a frozen state. While the journal is in a frozen state,
trying to start a new handle would put the process on wait queue. Thawing the
journal would wake up all the processes waiting on this wait queue.

I have lightly tested this patch. Sending it here for initial review. Please
do let me know your inputs. Thanks a lot!

Warm Regards,
Surbhi.



2011-05-06 15:20:31

by Surbhi Palande

[permalink] [raw]
Subject: [PATCH] Adding support to freeze and unfreeze a journal

The journal should be frozen when a F.S freezes. What this means is that till
the F.S is thawed again, no new transactions should be accepted by the
journal. When the F.S thaws, inturn it should thaw the journal and this should
allow the journal to resume accepting new transactions.
While the F.S has frozen the journal, the clients of journal on calling
jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
up the sleeping clients and journalling can progress normally.

Signed-off-by: Surbhi Palande <[email protected]>
---
fs/ext4/super.c | 20 ++++++--------------
fs/jbd2/journal.c | 1 +
fs/jbd2/transaction.c | 36 ++++++++++++++++++++++++++++++++++++
include/linux/jbd2.h | 9 +++++++++
4 files changed, 52 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8553dfb..796aa4c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)

journal = EXT4_SB(sb)->s_journal;

- /* Now we set up the journal barrier. */
- jbd2_journal_lock_updates(journal);
-
+ error = jbd2_journal_freeze(journal);
/*
- * Don't clear the needs_recovery flag if we failed to flush
+ * Don't clear the needs_recovery flag if we failed to freeze
* the journal.
*/
- error = jbd2_journal_flush(journal);
- if (error < 0)
- goto out;
-
- /* Journal blocked and flushed, clear needs_recovery flag. */
- EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
- error = ext4_commit_super(sb, 1);
-out:
- /* we rely on s_frozen to stop further updates */
- jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+ if (error >= 0) {
+ EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
+ error = ext4_commit_super(sb, 1);
+ }
return error;
}

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e0ec3db..5e46333 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
init_waitqueue_head(&journal->j_wait_checkpoint);
init_waitqueue_head(&journal->j_wait_commit);
init_waitqueue_head(&journal->j_wait_updates);
+ init_waitqueue_head(&journal->j_wait_frozen);
mutex_init(&journal->j_barrier);
mutex_init(&journal->j_checkpoint_mutex);
spin_lock_init(&journal->j_revoke_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 05fa77a..ad5a5df 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -171,6 +171,13 @@ repeat:
journal->j_barrier_count == 0);
goto repeat;
}
+ /* dont let a new handle start when a journal is frozen.
+ * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
+ * the jflags indicate that the journal is frozen. So if the
+ * j_barrier_count is 0, then check if this was made 0 by the freezing
+ * process
+ */
+ jbd2_check_frozen(journal);

if (!journal->j_running_transaction) {
read_unlock(&journal->j_state_lock);
@@ -489,6 +496,35 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
}
EXPORT_SYMBOL(jbd2_journal_restart);

+int jbd2_journal_freeze(journal_t *journal)
+{
+ int error = 0;
+ /* Now we set up the journal barrier. */
+ jbd2_journal_lock_updates(journal);
+
+ /*
+ * Don't clear the needs_recovery flag if we failed to flush
+ * the journal.
+ */
+ error = jbd2_journal_flush(journal);
+ if (error >= 0) {
+ write_lock(&journal->j_state_lock);
+ journal->j_flags = journal->j_flags | JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ }
+ jbd2_journal_unlock_updates(journal);
+ return error;
+}
+
+void jbd2_journal_thaw(journal_t * journal)
+{
+ write_lock(&journal->j_state_lock);
+ journal->j_flags = journal->j_flags &= ~JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ wake_up(&journal->j_wait_frozen);
+}
+
+
/**
* void jbd2_journal_lock_updates () - establish a transaction barrier.
* @journal: Journal to establish a barrier on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a32dcae..31d6c23 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -835,6 +835,9 @@ struct journal_s
/* Wait queue to wait for updates to complete */
wait_queue_head_t j_wait_updates;

+ /* Wait queue to wait for journal to thaw*/
+ wait_queue_head_t j_wait_frozen;
+
/* Semaphore for locking against concurrent checkpoints */
struct mutex j_checkpoint_mutex;

@@ -1013,7 +1016,11 @@ struct journal_s
#define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
* data write error in ordered
* mode */
+#define JBD2_FROZEN 0x080 /* Journal thread is frozen as the filesystem is frozen */
+

+#define jbd2_check_frozen(journal) \
+ wait_event(journal->j_wait_frozen, ((journal->j_flags & JBD2_FROZEN) != JBD2_FROZEN))
/*
* Function declarations for the journaling transaction and buffer
* management
@@ -1121,6 +1128,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
struct page *, unsigned long);
extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
extern int jbd2_journal_stop(handle_t *);
+extern int jbd2_journal_freeze(journal_t *);
+extern void jbd2_journal_thaw(journal_t *);
extern int jbd2_journal_flush (journal_t *);
extern void jbd2_journal_lock_updates (journal_t *);
extern void jbd2_journal_unlock_updates (journal_t *);
--
1.7.1


2011-05-06 20:56:45

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] Adding support to freeze and unfreeze a journal

On May 6, 2011, at 09:20, Surbhi Palande wrote:
> The journal should be frozen when a F.S freezes. What this means is that till
> the F.S is thawed again, no new transactions should be accepted by the
> journal. When the F.S thaws, inturn it should thaw the journal and this should
> allow the journal to resume accepting new transactions.
> While the F.S has frozen the journal, the clients of journal on calling
> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
> up the sleeping clients and journalling can progress normally.
>
> Signed-off-by: Surbhi Palande <[email protected]>

I think this is not a good patch as-is, see below.

> ---
> fs/ext4/super.c | 20 ++++++--------------
> fs/jbd2/journal.c | 1 +
> fs/jbd2/transaction.c | 36 ++++++++++++++++++++++++++++++++++++
> include/linux/jbd2.h | 9 +++++++++
> 4 files changed, 52 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8553dfb..796aa4c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>
> journal = EXT4_SB(sb)->s_journal;
>
> - /* Now we set up the journal barrier. */
> - jbd2_journal_lock_updates(journal);
> -
> + error = jbd2_journal_freeze(journal);
> /*
> - * Don't clear the needs_recovery flag if we failed to flush
> + * Don't clear the needs_recovery flag if we failed to freeze
> * the journal.
> */
> - error = jbd2_journal_flush(journal);
> - if (error < 0)
> - goto out;
> -
> - /* Journal blocked and flushed, clear needs_recovery flag. */
> - EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> - error = ext4_commit_super(sb, 1);
> -out:
> - /* we rely on s_frozen to stop further updates */
> - jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> + if (error >= 0) {
> + EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> + error = ext4_commit_super(sb, 1);
> + }
> return error;
> }
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e0ec3db..5e46333 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
> init_waitqueue_head(&journal->j_wait_checkpoint);
> init_waitqueue_head(&journal->j_wait_commit);
> init_waitqueue_head(&journal->j_wait_updates);
> + init_waitqueue_head(&journal->j_wait_frozen);
> mutex_init(&journal->j_barrier);
> mutex_init(&journal->j_checkpoint_mutex);
> spin_lock_init(&journal->j_revoke_lock);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 05fa77a..ad5a5df 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -171,6 +171,13 @@ repeat:
> journal->j_barrier_count == 0);
> goto repeat;
> }
> + /* dont let a new handle start when a journal is frozen.
> + * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
> + * the jflags indicate that the journal is frozen. So if the
> + * j_barrier_count is 0, then check if this was made 0 by the freezing
> + * process
> + */
> + jbd2_check_frozen(journal);

This is called with read_lock(&journal->j_state_lock) held, so it seems that
the thread entering start_this_handle() will hold the j_state_lock the entire
time the journal is frozen. But, see below in jbd2_journal_thaw()...

> if (!journal->j_running_transaction) {
> read_unlock(&journal->j_state_lock);
> @@ -489,6 +496,35 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
> }
> EXPORT_SYMBOL(jbd2_journal_restart);
>
> +int jbd2_journal_freeze(journal_t *journal)
> +{
> + int error = 0;
> + /* Now we set up the journal barrier. */
> + jbd2_journal_lock_updates(journal);
> +
> + /*
> + * Don't clear the needs_recovery flag if we failed to flush
> + * the journal.
> + */
> + error = jbd2_journal_flush(journal);
> + if (error >= 0) {
> + write_lock(&journal->j_state_lock);
> + journal->j_flags = journal->j_flags | JBD2_FROZEN;
> + write_unlock(&journal->j_state_lock);
> + }
> + jbd2_journal_unlock_updates(journal);
> + return error;
> +}
> +
> +void jbd2_journal_thaw(journal_t * journal)
> +{
> + write_lock(&journal->j_state_lock);
> + journal->j_flags = journal->j_flags &= ~JBD2_FROZEN;
> + write_unlock(&journal->j_state_lock);

... this code needs to get a write lock j_state_lock in order to unfreeze the
journal. It seems that this would deadlock as soon as you actually had some
situation where a transaction is being started while the journal is frozen?

> + wake_up(&journal->j_wait_frozen);
> +}
> +
> +
> /**
> * void jbd2_journal_lock_updates () - establish a transaction barrier.
> * @journal: Journal to establish a barrier on.
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index a32dcae..31d6c23 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -835,6 +835,9 @@ struct journal_s
> /* Wait queue to wait for updates to complete */
> wait_queue_head_t j_wait_updates;
>
> + /* Wait queue to wait for journal to thaw*/
> + wait_queue_head_t j_wait_frozen;
> +
> /* Semaphore for locking against concurrent checkpoints */
> struct mutex j_checkpoint_mutex;
>
> @@ -1013,7 +1016,11 @@ struct journal_s
> #define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
> * data write error in ordered
> * mode */
> +#define JBD2_FROZEN 0x080 /* Journal thread is frozen as the filesystem is frozen */
> +
>
> +#define jbd2_check_frozen(journal) \
> + wait_event(journal->j_wait_frozen, ((journal->j_flags & JBD2_FROZEN) != JBD2_FROZEN))

It seems that what is needed here is to check the JBD2_FROZEN state under
lock (so that it is not racing with the flag being set) but drop the lock,
wait, and retry if the journal was actually frozen. Since "goto label" is
ugly from within a macro, this probably needs to be open-coded at the caller in start_this_handle(), something like:

if (journal->j_flags & JBD2_FROZEN) {
read_unlock(&journal->j_state_lock);
wait_event(journal->j_wait_frozen, journal->j_flags&JBD2_FROZEN);
goto repeat;
}

This opens the question of whether it is SMP safe to check j_flags without any kind of barrier or lock held? I think in this case we are fine, since the above code jumps back to "repeat" to re-verify j_flags under j_state_lock, so there is no race on the state.

> * Function declarations for the journaling transaction and buffer
> * management
> @@ -1121,6 +1128,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
> struct page *, unsigned long);
> extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
> extern int jbd2_journal_stop(handle_t *);
> +extern int jbd2_journal_freeze(journal_t *);
> +extern void jbd2_journal_thaw(journal_t *);
> extern int jbd2_journal_flush (journal_t *);
> extern void jbd2_journal_lock_updates (journal_t *);
> extern void jbd2_journal_unlock_updates (journal_t *);
> --
> 1.7.1
>


Cheers, Andreas






2011-05-07 20:04:32

by Surbhi Palande

[permalink] [raw]
Subject: [PATCH v2] Adding support to freeze and unfreeze a journal

On freezing the F.S, the journal should be frozen as well. This implies not
being able to start any new transactions on a frozen journal. Similarly,
thawing a F.S should thaw a journal and this should conversely start accepting
new transactions. While the F.S is frozen any request to start a new
handle should end up on a wait queue till the F.S is thawed back again.

Adding support to freeze and thaw a journal and leveraging this support in
freezing and thawing ext4.

Signed-off-by: Surbhi Palande <[email protected]>
---
Changes since v1:
* Check for the j_flag and if frozen release the j_state_lock before sleeping
on wait queue
* Export the freeze, thaw routines
* Added a barrier after setting the j_flags = JBD2_FROZEN

fs/ext4/super.c | 20 ++++++--------------
fs/jbd2/journal.c | 1 +
fs/jbd2/transaction.c | 43 +++++++++++++++++++++++++++++++++++++++++++
include/linux/jbd2.h | 9 +++++++++
4 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8553dfb..796aa4c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)

journal = EXT4_SB(sb)->s_journal;

- /* Now we set up the journal barrier. */
- jbd2_journal_lock_updates(journal);
-
+ error = jbd2_journal_freeze(journal);
/*
- * Don't clear the needs_recovery flag if we failed to flush
+ * Don't clear the needs_recovery flag if we failed to freeze
* the journal.
*/
- error = jbd2_journal_flush(journal);
- if (error < 0)
- goto out;
-
- /* Journal blocked and flushed, clear needs_recovery flag. */
- EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
- error = ext4_commit_super(sb, 1);
-out:
- /* we rely on s_frozen to stop further updates */
- jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+ if (error >= 0) {
+ EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
+ error = ext4_commit_super(sb, 1);
+ }
return error;
}

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e0ec3db..5e46333 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
init_waitqueue_head(&journal->j_wait_checkpoint);
init_waitqueue_head(&journal->j_wait_commit);
init_waitqueue_head(&journal->j_wait_updates);
+ init_waitqueue_head(&journal->j_wait_frozen);
mutex_init(&journal->j_barrier);
mutex_init(&journal->j_checkpoint_mutex);
spin_lock_init(&journal->j_revoke_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 05fa77a..3283c77 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -171,6 +171,17 @@ repeat:
journal->j_barrier_count == 0);
goto repeat;
}
+ /* dont let a new handle start when a journal is frozen.
+ * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
+ * the jflags indicate that the journal is frozen. So if the
+ * j_barrier_count is 0, then check if this was made 0 by the freezing
+ * process
+ */
+ if (journal->j_flags & JBD2_FROZEN) {
+ read_unlock(&journal->j_state_lock);
+ jbd2_check_frozen(journal);
+ goto repeat;
+ }

if (!journal->j_running_transaction) {
read_unlock(&journal->j_state_lock);
@@ -489,6 +500,38 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
}
EXPORT_SYMBOL(jbd2_journal_restart);

+int jbd2_journal_freeze(journal_t *journal)
+{
+ int error = 0;
+ /* Now we set up the journal barrier. */
+ jbd2_journal_lock_updates(journal);
+
+ /*
+ * Don't clear the needs_recovery flag if we failed to flush
+ * the journal.
+ */
+ error = jbd2_journal_flush(journal);
+ if (error >= 0) {
+ write_lock(&journal->j_state_lock);
+ journal->j_flags = journal->j_flags | JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ }
+ jbd2_journal_unlock_updates(journal);
+ return error;
+}
+EXPORT_SYMBOL(jbd2_journal_freeze);
+
+void jbd2_journal_thaw(journal_t * journal)
+{
+ write_lock(&journal->j_state_lock);
+ journal->j_flags = journal->j_flags &= ~JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ smp_wmb();
+ wake_up(&journal->j_wait_frozen);
+}
+EXPORT_SYMBOL(jbd2_journal_thaw);
+
+
/**
* void jbd2_journal_lock_updates () - establish a transaction barrier.
* @journal: Journal to establish a barrier on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a32dcae..31d6c23 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -835,6 +835,9 @@ struct journal_s
/* Wait queue to wait for updates to complete */
wait_queue_head_t j_wait_updates;

+ /* Wait queue to wait for journal to thaw*/
+ wait_queue_head_t j_wait_frozen;
+
/* Semaphore for locking against concurrent checkpoints */
struct mutex j_checkpoint_mutex;

@@ -1013,7 +1016,11 @@ struct journal_s
#define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
* data write error in ordered
* mode */
+#define JBD2_FROZEN 0x080 /* Journal thread is frozen as the filesystem is frozen */
+

+#define jbd2_check_frozen(journal) \
+ wait_event(journal->j_wait_frozen, ((journal->j_flags & JBD2_FROZEN) != JBD2_FROZEN))
/*
* Function declarations for the journaling transaction and buffer
* management
@@ -1121,6 +1128,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
struct page *, unsigned long);
extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
extern int jbd2_journal_stop(handle_t *);
+extern int jbd2_journal_freeze(journal_t *);
+extern void jbd2_journal_thaw(journal_t *);
extern int jbd2_journal_flush (journal_t *);
extern void jbd2_journal_lock_updates (journal_t *);
extern void jbd2_journal_unlock_updates (journal_t *);
--
1.7.1


2011-05-08 08:24:25

by Marco Stornelli

[permalink] [raw]
Subject: Re: [PATCH v2] Adding support to freeze and unfreeze a journal

Il 07/05/2011 22:04, Surbhi Palande ha scritto:
> On freezing the F.S, the journal should be frozen as well. This implies not
> being able to start any new transactions on a frozen journal. Similarly,
> thawing a F.S should thaw a journal and this should conversely start accepting
> new transactions. While the F.S is frozen any request to start a new
> handle should end up on a wait queue till the F.S is thawed back again.
>
> Adding support to freeze and thaw a journal and leveraging this support in
> freezing and thawing ext4.
>
> Signed-off-by: Surbhi Palande<[email protected]>
> ---
> Changes since v1:
> * Check for the j_flag and if frozen release the j_state_lock before sleeping
> on wait queue
> * Export the freeze, thaw routines
> * Added a barrier after setting the j_flags = JBD2_FROZEN
>
> fs/ext4/super.c | 20 ++++++--------------
> fs/jbd2/journal.c | 1 +
> fs/jbd2/transaction.c | 43 +++++++++++++++++++++++++++++++++++++++++++
> include/linux/jbd2.h | 9 +++++++++
> 4 files changed, 59 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8553dfb..796aa4c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>
> journal = EXT4_SB(sb)->s_journal;
>
> - /* Now we set up the journal barrier. */
> - jbd2_journal_lock_updates(journal);
> -
> + error = jbd2_journal_freeze(journal);
> /*
> - * Don't clear the needs_recovery flag if we failed to flush
> + * Don't clear the needs_recovery flag if we failed to freeze
> * the journal.
> */
> - error = jbd2_journal_flush(journal);
> - if (error< 0)
> - goto out;
> -
> - /* Journal blocked and flushed, clear needs_recovery flag. */
> - EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> - error = ext4_commit_super(sb, 1);
> -out:
> - /* we rely on s_frozen to stop further updates */
> - jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> + if (error>= 0) {
> + EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> + error = ext4_commit_super(sb, 1);
> + }
> return error;
> }
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e0ec3db..5e46333 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
> init_waitqueue_head(&journal->j_wait_checkpoint);
> init_waitqueue_head(&journal->j_wait_commit);
> init_waitqueue_head(&journal->j_wait_updates);
> + init_waitqueue_head(&journal->j_wait_frozen);
> mutex_init(&journal->j_barrier);
> mutex_init(&journal->j_checkpoint_mutex);
> spin_lock_init(&journal->j_revoke_lock);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 05fa77a..3283c77 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -171,6 +171,17 @@ repeat:
> journal->j_barrier_count == 0);
> goto repeat;
> }
> + /* dont let a new handle start when a journal is frozen.
> + * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
> + * the jflags indicate that the journal is frozen. So if the
> + * j_barrier_count is 0, then check if this was made 0 by the freezing
> + * process
> + */
> + if (journal->j_flags& JBD2_FROZEN) {
> + read_unlock(&journal->j_state_lock);
> + jbd2_check_frozen(journal);
> + goto repeat;
> + }
>
> if (!journal->j_running_transaction) {
> read_unlock(&journal->j_state_lock);
> @@ -489,6 +500,38 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
> }
> EXPORT_SYMBOL(jbd2_journal_restart);
>
> +int jbd2_journal_freeze(journal_t *journal)
> +{
> + int error = 0;
> + /* Now we set up the journal barrier. */
> + jbd2_journal_lock_updates(journal);
> +
> + /*
> + * Don't clear the needs_recovery flag if we failed to flush
> + * the journal.
> + */
> + error = jbd2_journal_flush(journal);
> + if (error>= 0) {
> + write_lock(&journal->j_state_lock);
> + journal->j_flags = journal->j_flags | JBD2_FROZEN;

Better journal->j_flags |= JBD2_FROZEN;

> + write_unlock(&journal->j_state_lock);
> + }
> + jbd2_journal_unlock_updates(journal);
> + return error;
> +}
> +EXPORT_SYMBOL(jbd2_journal_freeze);
> +
> +void jbd2_journal_thaw(journal_t * journal)
> +{
> + write_lock(&journal->j_state_lock);
> + journal->j_flags = journal->j_flags&= ~JBD2_FROZEN;

What? Maybe journal->j_flags &= ~JBD2_FROZEN;

> + write_unlock(&journal->j_state_lock);
> + smp_wmb();

It'd be better to add a comment here for this barrier.

> + wake_up(&journal->j_wait_frozen);
> +}
> +EXPORT_SYMBOL(jbd2_journal_thaw);
> +
> +

Marco

2011-05-09 09:04:53

by Surbhi Palande

[permalink] [raw]
Subject: Re: [PATCH v2] Adding support to freeze and unfreeze a journal

On 05/08/2011 11:24 AM, Marco Stornelli wrote:

Thanks for your review!
> Il 07/05/2011 22:04, Surbhi Palande ha scritto:
>> On freezing the F.S, the journal should be frozen as well. This
>> implies not
>> being able to start any new transactions on a frozen journal. Similarly,
>> thawing a F.S should thaw a journal and this should conversely start
>> accepting
>> new transactions. While the F.S is frozen any request to start a new
>> handle should end up on a wait queue till the F.S is thawed back again.
>>
>> Adding support to freeze and thaw a journal and leveraging this
>> support in
>> freezing and thawing ext4.
>>
>> Signed-off-by: Surbhi Palande<[email protected]>
>> ---
>> Changes since v1:
>> * Check for the j_flag and if frozen release the j_state_lock before
>> sleeping
>> on wait queue
>> * Export the freeze, thaw routines
>> * Added a barrier after setting the j_flags = JBD2_FROZEN
>>
>> fs/ext4/super.c | 20 ++++++--------------
>> fs/jbd2/journal.c | 1 +
>> fs/jbd2/transaction.c | 43 +++++++++++++++++++++++++++++++++++++++++++
>> include/linux/jbd2.h | 9 +++++++++
>> 4 files changed, 59 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 8553dfb..796aa4c 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>>
>> journal = EXT4_SB(sb)->s_journal;
>>
>> - /* Now we set up the journal barrier. */
>> - jbd2_journal_lock_updates(journal);
>> -
>> + error = jbd2_journal_freeze(journal);
>> /*
>> - * Don't clear the needs_recovery flag if we failed to flush
>> + * Don't clear the needs_recovery flag if we failed to freeze
>> * the journal.
>> */
>> - error = jbd2_journal_flush(journal);
>> - if (error< 0)
>> - goto out;
>> -
>> - /* Journal blocked and flushed, clear needs_recovery flag. */
>> - EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>> - error = ext4_commit_super(sb, 1);
>> -out:
>> - /* we rely on s_frozen to stop further updates */
>> - jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>> + if (error>= 0) {
>> + EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>> + error = ext4_commit_super(sb, 1);
>> + }
>> return error;
>> }
>>
>> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
>> index e0ec3db..5e46333 100644
>> --- a/fs/jbd2/journal.c
>> +++ b/fs/jbd2/journal.c
>> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
>> init_waitqueue_head(&journal->j_wait_checkpoint);
>> init_waitqueue_head(&journal->j_wait_commit);
>> init_waitqueue_head(&journal->j_wait_updates);
>> + init_waitqueue_head(&journal->j_wait_frozen);
>> mutex_init(&journal->j_barrier);
>> mutex_init(&journal->j_checkpoint_mutex);
>> spin_lock_init(&journal->j_revoke_lock);
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index 05fa77a..3283c77 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -171,6 +171,17 @@ repeat:
>> journal->j_barrier_count == 0);
>> goto repeat;
>> }
>> + /* dont let a new handle start when a journal is frozen.
>> + * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
>> + * the jflags indicate that the journal is frozen. So if the
>> + * j_barrier_count is 0, then check if this was made 0 by the freezing
>> + * process
>> + */
>> + if (journal->j_flags& JBD2_FROZEN) {
>> + read_unlock(&journal->j_state_lock);
>> + jbd2_check_frozen(journal);
>> + goto repeat;
>> + }
>>
>> if (!journal->j_running_transaction) {
>> read_unlock(&journal->j_state_lock);
>> @@ -489,6 +500,38 @@ int jbd2_journal_restart(handle_t *handle, int
>> nblocks)
>> }
>> EXPORT_SYMBOL(jbd2_journal_restart);
>>
>> +int jbd2_journal_freeze(journal_t *journal)
>> +{
>> + int error = 0;
>> + /* Now we set up the journal barrier. */
>> + jbd2_journal_lock_updates(journal);
>> +
>> + /*
>> + * Don't clear the needs_recovery flag if we failed to flush
>> + * the journal.
>> + */
>> + error = jbd2_journal_flush(journal);
>> + if (error>= 0) {
>> + write_lock(&journal->j_state_lock);
>> + journal->j_flags = journal->j_flags | JBD2_FROZEN;
>
> Better journal->j_flags |= JBD2_FROZEN;

I was wondering why this is actually better than the longer form? Is
there any technical reason other than preference of coding style here?

>
>> + write_unlock(&journal->j_state_lock);
>> + }
>> + jbd2_journal_unlock_updates(journal);
>> + return error;
>> +}
>> +EXPORT_SYMBOL(jbd2_journal_freeze);
>> +
>> +void jbd2_journal_thaw(journal_t * journal)
>> +{
>> + write_lock(&journal->j_state_lock);
>> + journal->j_flags = journal->j_flags&= ~JBD2_FROZEN;
>
> What? Maybe journal->j_flags &= ~JBD2_FROZEN;

This definitely is a typo and needs a change. Again, why do you
recommend the shorter form?

>
>> + write_unlock(&journal->j_state_lock);
>> + smp_wmb();
>
> It'd be better to add a comment here for this barrier.
Ok!
>
>> + wake_up(&journal->j_wait_frozen);
>> +}
>> +EXPORT_SYMBOL(jbd2_journal_thaw);
>> +
>> +
>
> Marco
Warm Regards,
Surbhi.


2011-05-09 09:24:07

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] Adding support to freeze and unfreeze a journal

On Mon 09-05-11 12:04:45, Surbhi Palande wrote:
> On 05/08/2011 11:24 AM, Marco Stornelli wrote:
> >>+ error = jbd2_journal_flush(journal);
> >>+ if (error>= 0) {
> >>+ write_lock(&journal->j_state_lock);
> >>+ journal->j_flags = journal->j_flags | JBD2_FROZEN;
> >
> >Better journal->j_flags |= JBD2_FROZEN;
>
> I was wondering why this is actually better than the longer form? Is
> there any technical reason other than preference of coding style
> here?
It's just a coding style but that's kind of important as well. You don't
have to employ your brain by checking whether the right hand side is the
same as the left hand side in this case. So please use |=.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-09 09:53:16

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] Adding support to freeze and unfreeze a journal

On Sat 07-05-11 23:04:22, Surbhi Palande wrote:
> +void jbd2_journal_thaw(journal_t * journal)
> +{
> + write_lock(&journal->j_state_lock);
> + journal->j_flags = journal->j_flags &= ~JBD2_FROZEN;
> + write_unlock(&journal->j_state_lock);
> + smp_wmb();
Why is here the smp_wmb()? The write is inside a rw-lock so it cannot be
reordered. Also wake_up() is protected by queue->lock so I don't see the
need for a barrier.

> + wake_up(&journal->j_wait_frozen);
> +}
> +EXPORT_SYMBOL(jbd2_journal_thaw);

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-09 13:49:24

by Surbhi Palande

[permalink] [raw]
Subject: Re: [PATCH v2] Adding support to freeze and unfreeze a journal

On 05/09/2011 12:53 PM, Jan Kara wrote:
> On Sat 07-05-11 23:04:22, Surbhi Palande wrote:
>> +void jbd2_journal_thaw(journal_t * journal)
>> +{
>> + write_lock(&journal->j_state_lock);
>> + journal->j_flags = journal->j_flags&= ~JBD2_FROZEN;
>> + write_unlock(&journal->j_state_lock);
>> + smp_wmb();
> Why is here the smp_wmb()? The write is inside a rw-lock so it cannot be
> reordered. Also wake_up() is protected by queue->lock so I don't see the
> need for a barrier.

Ok, thanks for letting me know. I was under the impression that a
reorder was possible in case of SMP. I will rewrite the patch with this
change and the one that Marco Stornelli suggested as well.

Thanks a lot!

Warm Regards,
Surbhi.

>
>> + wake_up(&journal->j_wait_frozen);
>> +}
>> +EXPORT_SYMBOL(jbd2_journal_thaw);
>
> Honza


2011-05-09 14:51:43

by Surbhi Palande

[permalink] [raw]
Subject: [PATCH v3] Adding support to freeze and unfreeze a journal

The journal should be frozen when a F.S freezes. What this means is that till
the F.S is thawed again, no new transactions should be accepted by the
journal. When the F.S thaws, inturn it should thaw the journal and this should
allow the journal to resume accepting new transactions.
While the F.S has frozen the journal, the clients of journal on calling
jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
up the sleeping clients and journalling can progress normally.

Signed-off-by: Surbhi Palande <[email protected]>
---
Changes since the last patch:
* Changed to the shorter forms of expressions eg: x |= y
* removed the unnecessary barrier

fs/ext4/super.c | 20 ++++++--------------
fs/jbd2/journal.c | 1 +
fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
include/linux/jbd2.h | 10 ++++++++++
4 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8553dfb..796aa4c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)

journal = EXT4_SB(sb)->s_journal;

- /* Now we set up the journal barrier. */
- jbd2_journal_lock_updates(journal);
-
+ error = jbd2_journal_freeze(journal);
/*
- * Don't clear the needs_recovery flag if we failed to flush
+ * Don't clear the needs_recovery flag if we failed to freeze
* the journal.
*/
- error = jbd2_journal_flush(journal);
- if (error < 0)
- goto out;
-
- /* Journal blocked and flushed, clear needs_recovery flag. */
- EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
- error = ext4_commit_super(sb, 1);
-out:
- /* we rely on s_frozen to stop further updates */
- jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+ if (error >= 0) {
+ EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
+ error = ext4_commit_super(sb, 1);
+ }
return error;
}

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e0ec3db..5e46333 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
init_waitqueue_head(&journal->j_wait_checkpoint);
init_waitqueue_head(&journal->j_wait_commit);
init_waitqueue_head(&journal->j_wait_updates);
+ init_waitqueue_head(&journal->j_wait_frozen);
mutex_init(&journal->j_barrier);
mutex_init(&journal->j_checkpoint_mutex);
spin_lock_init(&journal->j_revoke_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 05fa77a..b040293 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -171,6 +171,17 @@ repeat:
journal->j_barrier_count == 0);
goto repeat;
}
+ /* dont let a new handle start when a journal is frozen.
+ * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
+ * the jflags indicate that the journal is frozen. So if the
+ * j_barrier_count is 0, then check if this was made 0 by the freezing
+ * process
+ */
+ if (journal->j_flags & JBD2_FROZEN) {
+ read_unlock(&journal->j_state_lock);
+ jbd2_check_frozen(journal);
+ goto repeat;
+ }

if (!journal->j_running_transaction) {
read_unlock(&journal->j_state_lock);
@@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
}
EXPORT_SYMBOL(jbd2_journal_restart);

+int jbd2_journal_freeze(journal_t *journal)
+{
+ int error = 0;
+ /* Now we set up the journal barrier. */
+ jbd2_journal_lock_updates(journal);
+
+ /*
+ * Don't clear the needs_recovery flag if we failed to flush
+ * the journal.
+ */
+ error = jbd2_journal_flush(journal);
+ if (error >= 0) {
+ write_lock(&journal->j_state_lock);
+ journal->j_flags |= JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ }
+ jbd2_journal_unlock_updates(journal);
+ return error;
+}
+EXPORT_SYMBOL(jbd2_journal_freeze);
+
+void jbd2_journal_thaw(journal_t * journal)
+{
+ write_lock(&journal->j_state_lock);
+ journal->j_flags &= ~JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ wake_up(&journal->j_wait_frozen);
+}
+EXPORT_SYMBOL(jbd2_journal_thaw);
+
+
/**
* void jbd2_journal_lock_updates () - establish a transaction barrier.
* @journal: Journal to establish a barrier on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a32dcae..c7885b2 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
* @j_wait_checkpoint: Wait queue to trigger checkpointing
* @j_wait_commit: Wait queue to trigger commit
* @j_wait_updates: Wait queue to wait for updates to complete
+ * @j_wait_frozen: Wait queue to wait for journal to thaw
* @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
* @j_head: Journal head - identifies the first unused block in the journal
* @j_tail: Journal tail - identifies the oldest still-used block in the
@@ -835,6 +836,9 @@ struct journal_s
/* Wait queue to wait for updates to complete */
wait_queue_head_t j_wait_updates;

+ /* Wait queue to wait for journal to thaw*/
+ wait_queue_head_t j_wait_frozen;
+
/* Semaphore for locking against concurrent checkpoints */
struct mutex j_checkpoint_mutex;

@@ -1013,7 +1017,11 @@ struct journal_s
#define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
* data write error in ordered
* mode */
+#define JBD2_FROZEN 0x080 /* Journal thread is frozen as the filesystem is frozen */
+

+#define jbd2_check_frozen(journal) \
+ wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN))
/*
* Function declarations for the journaling transaction and buffer
* management
@@ -1121,6 +1129,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
struct page *, unsigned long);
extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
extern int jbd2_journal_stop(handle_t *);
+extern int jbd2_journal_freeze(journal_t *);
+extern void jbd2_journal_thaw(journal_t *);
extern int jbd2_journal_flush (journal_t *);
extern void jbd2_journal_lock_updates (journal_t *);
extern void jbd2_journal_unlock_updates (journal_t *);
--
1.7.1


2011-05-09 15:08:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3] Adding support to freeze and unfreeze a journal

On Mon 09-05-11 17:51:32, Surbhi Palande wrote:
> The journal should be frozen when a F.S freezes. What this means is that till
> the F.S is thawed again, no new transactions should be accepted by the
> journal. When the F.S thaws, inturn it should thaw the journal and this should
> allow the journal to resume accepting new transactions.
> While the F.S has frozen the journal, the clients of journal on calling
> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
> up the sleeping clients and journalling can progress normally.
The patch looks fine. I'd just add here a scheme of the race which can
happen if we don't really freeze the journal and rely on vfs... You can add:
Acked-by: Jan Kara <[email protected]>

Honza
>
> Signed-off-by: Surbhi Palande <[email protected]>
> ---
> Changes since the last patch:
> * Changed to the shorter forms of expressions eg: x |= y
> * removed the unnecessary barrier
>
> fs/ext4/super.c | 20 ++++++--------------
> fs/jbd2/journal.c | 1 +
> fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> include/linux/jbd2.h | 10 ++++++++++
> 4 files changed, 59 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8553dfb..796aa4c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>
> journal = EXT4_SB(sb)->s_journal;
>
> - /* Now we set up the journal barrier. */
> - jbd2_journal_lock_updates(journal);
> -
> + error = jbd2_journal_freeze(journal);
> /*
> - * Don't clear the needs_recovery flag if we failed to flush
> + * Don't clear the needs_recovery flag if we failed to freeze
> * the journal.
> */
> - error = jbd2_journal_flush(journal);
> - if (error < 0)
> - goto out;
> -
> - /* Journal blocked and flushed, clear needs_recovery flag. */
> - EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> - error = ext4_commit_super(sb, 1);
> -out:
> - /* we rely on s_frozen to stop further updates */
> - jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> + if (error >= 0) {
> + EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> + error = ext4_commit_super(sb, 1);
> + }
> return error;
> }
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e0ec3db..5e46333 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
> init_waitqueue_head(&journal->j_wait_checkpoint);
> init_waitqueue_head(&journal->j_wait_commit);
> init_waitqueue_head(&journal->j_wait_updates);
> + init_waitqueue_head(&journal->j_wait_frozen);
> mutex_init(&journal->j_barrier);
> mutex_init(&journal->j_checkpoint_mutex);
> spin_lock_init(&journal->j_revoke_lock);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 05fa77a..b040293 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -171,6 +171,17 @@ repeat:
> journal->j_barrier_count == 0);
> goto repeat;
> }
> + /* dont let a new handle start when a journal is frozen.
> + * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
> + * the jflags indicate that the journal is frozen. So if the
> + * j_barrier_count is 0, then check if this was made 0 by the freezing
> + * process
> + */
> + if (journal->j_flags & JBD2_FROZEN) {
> + read_unlock(&journal->j_state_lock);
> + jbd2_check_frozen(journal);
> + goto repeat;
> + }
>
> if (!journal->j_running_transaction) {
> read_unlock(&journal->j_state_lock);
> @@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
> }
> EXPORT_SYMBOL(jbd2_journal_restart);
>
> +int jbd2_journal_freeze(journal_t *journal)
> +{
> + int error = 0;
> + /* Now we set up the journal barrier. */
> + jbd2_journal_lock_updates(journal);
> +
> + /*
> + * Don't clear the needs_recovery flag if we failed to flush
> + * the journal.
> + */
> + error = jbd2_journal_flush(journal);
> + if (error >= 0) {
> + write_lock(&journal->j_state_lock);
> + journal->j_flags |= JBD2_FROZEN;
> + write_unlock(&journal->j_state_lock);
> + }
> + jbd2_journal_unlock_updates(journal);
> + return error;
> +}
> +EXPORT_SYMBOL(jbd2_journal_freeze);
> +
> +void jbd2_journal_thaw(journal_t * journal)
> +{
> + write_lock(&journal->j_state_lock);
> + journal->j_flags &= ~JBD2_FROZEN;
> + write_unlock(&journal->j_state_lock);
> + wake_up(&journal->j_wait_frozen);
> +}
> +EXPORT_SYMBOL(jbd2_journal_thaw);
> +
> +
> /**
> * void jbd2_journal_lock_updates () - establish a transaction barrier.
> * @journal: Journal to establish a barrier on.
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index a32dcae..c7885b2 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
> * @j_wait_checkpoint: Wait queue to trigger checkpointing
> * @j_wait_commit: Wait queue to trigger commit
> * @j_wait_updates: Wait queue to wait for updates to complete
> + * @j_wait_frozen: Wait queue to wait for journal to thaw
> * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
> * @j_head: Journal head - identifies the first unused block in the journal
> * @j_tail: Journal tail - identifies the oldest still-used block in the
> @@ -835,6 +836,9 @@ struct journal_s
> /* Wait queue to wait for updates to complete */
> wait_queue_head_t j_wait_updates;
>
> + /* Wait queue to wait for journal to thaw*/
> + wait_queue_head_t j_wait_frozen;
> +
> /* Semaphore for locking against concurrent checkpoints */
> struct mutex j_checkpoint_mutex;
>
> @@ -1013,7 +1017,11 @@ struct journal_s
> #define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
> * data write error in ordered
> * mode */
> +#define JBD2_FROZEN 0x080 /* Journal thread is frozen as the filesystem is frozen */
> +
>
> +#define jbd2_check_frozen(journal) \
> + wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN))
> /*
> * Function declarations for the journaling transaction and buffer
> * management
> @@ -1121,6 +1129,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
> struct page *, unsigned long);
> extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
> extern int jbd2_journal_stop(handle_t *);
> +extern int jbd2_journal_freeze(journal_t *);
> +extern void jbd2_journal_thaw(journal_t *);
> extern int jbd2_journal_flush (journal_t *);
> extern void jbd2_journal_lock_updates (journal_t *);
> extern void jbd2_journal_unlock_updates (journal_t *);
> --
> 1.7.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-09 15:23:32

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH v3] Adding support to freeze and unfreeze a journal

On 5/9/11 9:51 AM, Surbhi Palande wrote:
> The journal should be frozen when a F.S freezes. What this means is that till
> the F.S is thawed again, no new transactions should be accepted by the
> journal. When the F.S thaws, inturn it should thaw the journal and this should
> allow the journal to resume accepting new transactions.
> While the F.S has frozen the journal, the clients of journal on calling
> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
> up the sleeping clients and journalling can progress normally.

Can I ask how this was tested?

Ideally anything you found useful for testing should probably be integrated
into the xfstests test suite so that we don't regresss in the future.

thanks,
-Eric

> Signed-off-by: Surbhi Palande <[email protected]>
> ---
> Changes since the last patch:
> * Changed to the shorter forms of expressions eg: x |= y
> * removed the unnecessary barrier
>
> fs/ext4/super.c | 20 ++++++--------------
> fs/jbd2/journal.c | 1 +
> fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> include/linux/jbd2.h | 10 ++++++++++
> 4 files changed, 59 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8553dfb..796aa4c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>
> journal = EXT4_SB(sb)->s_journal;
>
> - /* Now we set up the journal barrier. */
> - jbd2_journal_lock_updates(journal);
> -
> + error = jbd2_journal_freeze(journal);
> /*
> - * Don't clear the needs_recovery flag if we failed to flush
> + * Don't clear the needs_recovery flag if we failed to freeze
> * the journal.
> */
> - error = jbd2_journal_flush(journal);
> - if (error < 0)
> - goto out;
> -
> - /* Journal blocked and flushed, clear needs_recovery flag. */
> - EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> - error = ext4_commit_super(sb, 1);
> -out:
> - /* we rely on s_frozen to stop further updates */
> - jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> + if (error >= 0) {
> + EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> + error = ext4_commit_super(sb, 1);
> + }
> return error;
> }
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e0ec3db..5e46333 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
> init_waitqueue_head(&journal->j_wait_checkpoint);
> init_waitqueue_head(&journal->j_wait_commit);
> init_waitqueue_head(&journal->j_wait_updates);
> + init_waitqueue_head(&journal->j_wait_frozen);
> mutex_init(&journal->j_barrier);
> mutex_init(&journal->j_checkpoint_mutex);
> spin_lock_init(&journal->j_revoke_lock);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 05fa77a..b040293 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -171,6 +171,17 @@ repeat:
> journal->j_barrier_count == 0);
> goto repeat;
> }
> + /* dont let a new handle start when a journal is frozen.
> + * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
> + * the jflags indicate that the journal is frozen. So if the
> + * j_barrier_count is 0, then check if this was made 0 by the freezing
> + * process
> + */
> + if (journal->j_flags & JBD2_FROZEN) {
> + read_unlock(&journal->j_state_lock);
> + jbd2_check_frozen(journal);
> + goto repeat;
> + }
>
> if (!journal->j_running_transaction) {
> read_unlock(&journal->j_state_lock);
> @@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
> }
> EXPORT_SYMBOL(jbd2_journal_restart);
>
> +int jbd2_journal_freeze(journal_t *journal)
> +{
> + int error = 0;
> + /* Now we set up the journal barrier. */
> + jbd2_journal_lock_updates(journal);
> +
> + /*
> + * Don't clear the needs_recovery flag if we failed to flush
> + * the journal.
> + */
> + error = jbd2_journal_flush(journal);
> + if (error >= 0) {
> + write_lock(&journal->j_state_lock);
> + journal->j_flags |= JBD2_FROZEN;
> + write_unlock(&journal->j_state_lock);
> + }
> + jbd2_journal_unlock_updates(journal);
> + return error;
> +}
> +EXPORT_SYMBOL(jbd2_journal_freeze);
> +
> +void jbd2_journal_thaw(journal_t * journal)
> +{
> + write_lock(&journal->j_state_lock);
> + journal->j_flags &= ~JBD2_FROZEN;
> + write_unlock(&journal->j_state_lock);
> + wake_up(&journal->j_wait_frozen);
> +}
> +EXPORT_SYMBOL(jbd2_journal_thaw);
> +
> +
> /**
> * void jbd2_journal_lock_updates () - establish a transaction barrier.
> * @journal: Journal to establish a barrier on.
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index a32dcae..c7885b2 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
> * @j_wait_checkpoint: Wait queue to trigger checkpointing
> * @j_wait_commit: Wait queue to trigger commit
> * @j_wait_updates: Wait queue to wait for updates to complete
> + * @j_wait_frozen: Wait queue to wait for journal to thaw
> * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
> * @j_head: Journal head - identifies the first unused block in the journal
> * @j_tail: Journal tail - identifies the oldest still-used block in the
> @@ -835,6 +836,9 @@ struct journal_s
> /* Wait queue to wait for updates to complete */
> wait_queue_head_t j_wait_updates;
>
> + /* Wait queue to wait for journal to thaw*/
> + wait_queue_head_t j_wait_frozen;
> +
> /* Semaphore for locking against concurrent checkpoints */
> struct mutex j_checkpoint_mutex;
>
> @@ -1013,7 +1017,11 @@ struct journal_s
> #define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
> * data write error in ordered
> * mode */
> +#define JBD2_FROZEN 0x080 /* Journal thread is frozen as the filesystem is frozen */
> +
>
> +#define jbd2_check_frozen(journal) \
> + wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN))
> /*
> * Function declarations for the journaling transaction and buffer
> * management
> @@ -1121,6 +1129,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
> struct page *, unsigned long);
> extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
> extern int jbd2_journal_stop(handle_t *);
> +extern int jbd2_journal_freeze(journal_t *);
> +extern void jbd2_journal_thaw(journal_t *);
> extern int jbd2_journal_flush (journal_t *);
> extern void jbd2_journal_lock_updates (journal_t *);
> extern void jbd2_journal_unlock_updates (journal_t *);


2011-05-10 15:08:02

by Surbhi Palande

[permalink] [raw]
Subject: [PATCH] Adding support to freeze and unfreeze a journal

The journal should be frozen when a F.S freezes. What this means is that till
the F.S is thawed again, no new transactions should be accepted by the
journal. When the F.S thaws, inturn it should thaw the journal and this should
allow the journal to resume accepting new transactions.
While the F.S has frozen the journal, the clients of journal on calling
jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
up the sleeping clients and journalling can progress normally.

An example of the race condition that can happen without this patch is as
follows:

Say the F.S is thawed when we begin. Let tx be the time at unit x

P1: Process doing an aio write
t1) ext4_file_write()
t2) generic_file_aio_write()
t3) __generic_file_aio_write()
// F.S is not frozen, so we do not block in the next check.
t4) vfs_check_frozen()
t5) generic_write_checks()
----------------- Prempted------------------

P2: Process that does fs freeze

t6) freeze_super()
t7) sync_filesystem()
t8) sync_blockdev()
t9) sb->s_op->freeze_fs() (= ext4_freeze)
t10) jbd2_journal_lock_updates()
t11) jbd2_journal_flush()
// Need to unlock the journal before returning to user space.
t12) jbd2_journal_unlock_updates()
// Journal is unlocked and so we can start accepting new transactions now.

// freezing process completes execution. Page cache is now clean and should
// remain clean till the F.S is frozen.
--------------------------------------------

P1: writing process gets the control back
t13) generic_file_buffered_write()
t14) generic_perform_write()
t15) a_ops->write_begin() (= ext4_write_begin)
t16) ext4_journal_start()
// New handle is started. We do not block here! Write continues
// dirtying the page cache while the F.S is frozen!

Signed-off-by: Surbhi Palande <[email protected]>
Acked-by: Jan Kara <[email protected]>
---
fs/ext4/super.c | 20 ++++++--------------
fs/jbd2/journal.c | 1 +
fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
include/linux/jbd2.h | 10 ++++++++++
4 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8553dfb..796aa4c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)

journal = EXT4_SB(sb)->s_journal;

- /* Now we set up the journal barrier. */
- jbd2_journal_lock_updates(journal);
-
+ error = jbd2_journal_freeze(journal);
/*
- * Don't clear the needs_recovery flag if we failed to flush
+ * Don't clear the needs_recovery flag if we failed to freeze
* the journal.
*/
- error = jbd2_journal_flush(journal);
- if (error < 0)
- goto out;
-
- /* Journal blocked and flushed, clear needs_recovery flag. */
- EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
- error = ext4_commit_super(sb, 1);
-out:
- /* we rely on s_frozen to stop further updates */
- jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+ if (error >= 0) {
+ EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
+ error = ext4_commit_super(sb, 1);
+ }
return error;
}

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e0ec3db..5e46333 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
init_waitqueue_head(&journal->j_wait_checkpoint);
init_waitqueue_head(&journal->j_wait_commit);
init_waitqueue_head(&journal->j_wait_updates);
+ init_waitqueue_head(&journal->j_wait_frozen);
mutex_init(&journal->j_barrier);
mutex_init(&journal->j_checkpoint_mutex);
spin_lock_init(&journal->j_revoke_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 05fa77a..b040293 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -171,6 +171,17 @@ repeat:
journal->j_barrier_count == 0);
goto repeat;
}
+ /* dont let a new handle start when a journal is frozen.
+ * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
+ * the jflags indicate that the journal is frozen. So if the
+ * j_barrier_count is 0, then check if this was made 0 by the freezing
+ * process
+ */
+ if (journal->j_flags & JBD2_FROZEN) {
+ read_unlock(&journal->j_state_lock);
+ jbd2_check_frozen(journal);
+ goto repeat;
+ }

if (!journal->j_running_transaction) {
read_unlock(&journal->j_state_lock);
@@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
}
EXPORT_SYMBOL(jbd2_journal_restart);

+int jbd2_journal_freeze(journal_t *journal)
+{
+ int error = 0;
+ /* Now we set up the journal barrier. */
+ jbd2_journal_lock_updates(journal);
+
+ /*
+ * Don't clear the needs_recovery flag if we failed to flush
+ * the journal.
+ */
+ error = jbd2_journal_flush(journal);
+ if (error >= 0) {
+ write_lock(&journal->j_state_lock);
+ journal->j_flags |= JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ }
+ jbd2_journal_unlock_updates(journal);
+ return error;
+}
+EXPORT_SYMBOL(jbd2_journal_freeze);
+
+void jbd2_journal_thaw(journal_t * journal)
+{
+ write_lock(&journal->j_state_lock);
+ journal->j_flags &= ~JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ wake_up(&journal->j_wait_frozen);
+}
+EXPORT_SYMBOL(jbd2_journal_thaw);
+
+
/**
* void jbd2_journal_lock_updates () - establish a transaction barrier.
* @journal: Journal to establish a barrier on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a32dcae..c7885b2 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
* @j_wait_checkpoint: Wait queue to trigger checkpointing
* @j_wait_commit: Wait queue to trigger commit
* @j_wait_updates: Wait queue to wait for updates to complete
+ * @j_wait_frozen: Wait queue to wait for journal to thaw
* @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
* @j_head: Journal head - identifies the first unused block in the journal
* @j_tail: Journal tail - identifies the oldest still-used block in the
@@ -835,6 +836,9 @@ struct journal_s
/* Wait queue to wait for updates to complete */
wait_queue_head_t j_wait_updates;

+ /* Wait queue to wait for journal to thaw*/
+ wait_queue_head_t j_wait_frozen;
+
/* Semaphore for locking against concurrent checkpoints */
struct mutex j_checkpoint_mutex;

@@ -1013,7 +1017,11 @@ struct journal_s
#define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
* data write error in ordered
* mode */
+#define JBD2_FROZEN 0x080 /* Journal thread is frozen as the filesystem is frozen */
+

+#define jbd2_check_frozen(journal) \
+ wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN))
/*
* Function declarations for the journaling transaction and buffer
* management
@@ -1121,6 +1129,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
struct page *, unsigned long);
extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
extern int jbd2_journal_stop(handle_t *);
+extern int jbd2_journal_freeze(journal_t *);
+extern void jbd2_journal_thaw(journal_t *);
extern int jbd2_journal_flush (journal_t *);
extern void jbd2_journal_lock_updates (journal_t *);
extern void jbd2_journal_unlock_updates (journal_t *);
--
1.7.1


2011-05-10 21:07:49

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] Adding support to freeze and unfreeze a journal

Mostly minor cleanups to the commit message and comments.

On May 10, 2011, at 09:07, Surbhi Palande wrote:
> The journal should be frozen when a F.S freezes.

s/F.S/filesystem/g

> What this means is that till

s/till/until/
> the F.S is thawed again, no new transactions should be accepted by the
> journal. When the F.S thaws, inturn it should thaw the journal and this should
> allow the journal to resume accepting new transactions.
> While the F.S has frozen the journal, the clients of journal on calling
> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
> up the sleeping clients and journalling can progress normally.
>
> An example of the race condition that can happen without this patch is as
> follows:
>
> Say the F.S is thawed when we begin. Let tx be the time at unit x
>
> P1: Process doing an aio write
> t1) ext4_file_write()
> t2) generic_file_aio_write()
> t3) __generic_file_aio_write()
> // F.S is not frozen, so we do not block in the next check.
> t4) vfs_check_frozen()
> t5) generic_write_checks()
> ----------------- Prempted------------------
>
> P2: Process that does fs freeze
>
> t6) freeze_super()
> t7) sync_filesystem()
> t8) sync_blockdev()
> t9) sb->s_op->freeze_fs() (= ext4_freeze)
> t10) jbd2_journal_lock_updates()
> t11) jbd2_journal_flush()
> // Need to unlock the journal before returning to user space.
> t12) jbd2_journal_unlock_updates()
> // Journal is unlocked and so we can start accepting new transactions now.
>
> // freezing process completes execution. Page cache is now clean and should
> // remain clean till the F.S is frozen.
> --------------------------------------------
>
> P1: writing process gets the control back
> t13) generic_file_buffered_write()
> t14) generic_perform_write()
> t15) a_ops->write_begin() (= ext4_write_begin)
> t16) ext4_journal_start()
> // New handle is started. We do not block here! Write continues
> // dirtying the page cache while the F.S is frozen!
>
> Signed-off-by: Surbhi Palande <[email protected]>
> Acked-by: Jan Kara <[email protected]>
> ---
> fs/ext4/super.c | 20 ++++++--------------
> fs/jbd2/journal.c | 1 +
> fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> include/linux/jbd2.h | 10 ++++++++++
> 4 files changed, 59 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8553dfb..796aa4c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>
> journal = EXT4_SB(sb)->s_journal;
>
> - /* Now we set up the journal barrier. */
> - jbd2_journal_lock_updates(journal);
> -
> + error = jbd2_journal_freeze(journal);
> /*
> - * Don't clear the needs_recovery flag if we failed to flush
> + * Don't clear the needs_recovery flag if we failed to freeze
> * the journal.
> */
> - error = jbd2_journal_flush(journal);
> - if (error < 0)
> - goto out;
> -
> - /* Journal blocked and flushed, clear needs_recovery flag. */
> - EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> - error = ext4_commit_super(sb, 1);
> -out:
> - /* we rely on s_frozen to stop further updates */
> - jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> + if (error >= 0) {
> + EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> + error = ext4_commit_super(sb, 1);
> + }
> return error;
> }
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e0ec3db..5e46333 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
> init_waitqueue_head(&journal->j_wait_checkpoint);
> init_waitqueue_head(&journal->j_wait_commit);
> init_waitqueue_head(&journal->j_wait_updates);
> + init_waitqueue_head(&journal->j_wait_frozen);
> mutex_init(&journal->j_barrier);
> mutex_init(&journal->j_checkpoint_mutex);
> spin_lock_init(&journal->j_revoke_lock);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 05fa77a..b040293 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -171,6 +171,17 @@ repeat:
> journal->j_barrier_count == 0);
> goto repeat;
> }
> + /* dont let a new handle start when a journal is frozen.

s/dont/Don't/ or s/dont/Do not/

> + * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
> + * the jflags indicate that the journal is frozen. So if the

s/jflags/j_flags/

> + * j_barrier_count is 0, then check if this was made 0 by the freezing
> + * process
> + */
> + if (journal->j_flags & JBD2_FROZEN) {
> + read_unlock(&journal->j_state_lock);
> + jbd2_check_frozen(journal);
> + goto repeat;
> + }
>
> if (!journal->j_running_transaction) {
> read_unlock(&journal->j_state_lock);
> @@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
> }
> EXPORT_SYMBOL(jbd2_journal_restart);
>
> +int jbd2_journal_freeze(journal_t *journal)
> +{
> + int error = 0;
> + /* Now we set up the journal barrier. */
> + jbd2_journal_lock_updates(journal);
> +
> + /*
> + * Don't clear the needs_recovery flag if we failed to flush
> + * the journal.
> + */
> + error = jbd2_journal_flush(journal);
> + if (error >= 0) {
> + write_lock(&journal->j_state_lock);
> + journal->j_flags |= JBD2_FROZEN;
> + write_unlock(&journal->j_state_lock);
> + }
> + jbd2_journal_unlock_updates(journal);
> + return error;
> +}
> +EXPORT_SYMBOL(jbd2_journal_freeze);
> +
> +void jbd2_journal_thaw(journal_t * journal)
> +{
> + write_lock(&journal->j_state_lock);
> + journal->j_flags &= ~JBD2_FROZEN;
> + write_unlock(&journal->j_state_lock);
> + wake_up(&journal->j_wait_frozen);
> +}
> +EXPORT_SYMBOL(jbd2_journal_thaw);
> +
> +
> /**
> * void jbd2_journal_lock_updates () - establish a transaction barrier.
> * @journal: Journal to establish a barrier on.
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index a32dcae..c7885b2 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
> * @j_wait_checkpoint: Wait queue to trigger checkpointing
> * @j_wait_commit: Wait queue to trigger commit
> * @j_wait_updates: Wait queue to wait for updates to complete
> + * @j_wait_frozen: Wait queue to wait for journal to thaw
> * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
> * @j_head: Journal head - identifies the first unused block in the journal
> * @j_tail: Journal tail - identifies the oldest still-used block in the
> @@ -835,6 +836,9 @@ struct journal_s
> /* Wait queue to wait for updates to complete */
> wait_queue_head_t j_wait_updates;
>
> + /* Wait queue to wait for journal to thaw*/
> + wait_queue_head_t j_wait_frozen;
> +
> /* Semaphore for locking against concurrent checkpoints */
> struct mutex j_checkpoint_mutex;
>
> @@ -1013,7 +1017,11 @@ struct journal_s
> #define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
> * data write error in ordered
> * mode */
> +#define JBD2_FROZEN 0x080 /* Journal thread is frozen as the filesystem is frozen */

Need to wrap this to 80 columns:

+#define JBD2_FROZEN 0x080 /* Journal thread frozen along with filesystem */

> +#define jbd2_check_frozen(journal) \
> + wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN))

Having this macro, which is now only used in one place, isn't really clarifying the code because the name "check_frozen" doesn't really imply "wait until it journal is unfrozen". It would be better to just put the wait_event() inline at the one callsite and remove this macro entirely.

> /*
> * Function declarations for the journaling transaction and buffer
> * management
> @@ -1121,6 +1129,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
> struct page *, unsigned long);
> extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
> extern int jbd2_journal_stop(handle_t *);
> +extern int jbd2_journal_freeze(journal_t *);
> +extern void jbd2_journal_thaw(journal_t *);
> extern int jbd2_journal_flush (journal_t *);
> extern void jbd2_journal_lock_updates (journal_t *);
> extern void jbd2_journal_unlock_updates (journal_t *);

Once these minor changes have been made you can add:

Reviewed-by: Andreas Dilger <[email protected]>


Cheers, Andreas






2011-05-11 19:17:35

by Surbhi Palande

[permalink] [raw]
Subject: [PATCH] Attempt to sync the fsstress writes to a frozen F.S

While the fsstress background writes are busy dirtying the page cache, if a
fsfreeze happens then the background writes should stall. A sync should then
not have any data to sync to the FS. If it does have any data to sync then
sync will cause a deadlock by holding the s_umount write semaphore and waiting
in the wait queue for the FS to thaw, whereas the F.S can never thaw without
getting the s_umount write semaphore.

Signed-off-by: Surbhi Palande <[email protected]>
---
068 | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/068 b/068
index 82c1a4e..b9ac58d 100755
--- a/068
+++ b/068
@@ -101,6 +101,11 @@ do
tee -a $seq.full
sleep 2

+ # there should be nothing to sync at this point. This may hang in case
+ # of fsstress background writes dirtying the page cache while the F.S is frozen
+ sync &
+ sleep 2
+
echo "*** thawing \$SCRATCH_MNT" | tee -a $seq.full
xfs_freeze -u "$SCRATCH_MNT" | tee -a $seq.full
[ $? != 0 ] && echo xfs_freeze -u "$SCRATCH_MNT" failed | \
--
1.7.1


2011-05-11 19:17:36

by Surbhi Palande

[permalink] [raw]
Subject: [PATCH] Adding support to freeze and unfreeze a journal

The journal should be frozen when a filesystem freezes. What this means is
that until the filesystem is thawed again, no new transactions should be
accepted by the journal. When the filesystem thaws, inturn it should thaw the
journal and this should allow the journal to resume accepting new
transactions. While the the filesystem has frozen the journal, the clients of
the journal on calling jbd2_journal_start() will sleep on a wait queue.
Thawing the journal will wake up the sleeping clients and journalling can
progress normally.

An example of the race condition that can happen without this patch is as
follows:

Say the filesystem is thawed when we begin. Let tx be the time at unit x

P1: Process doing an aio write
t1) ext4_file_write()
t2) generic_file_aio_write()
t3) __generic_file_aio_write()
// filesystem is not frozen, so we do not block in the next check.
t4) vfs_check_frozen()
t5) generic_write_checks()
----------------- Prempted------------------

P2: Process that does filesystem freeze

t6) freeze_super()
t7) sync_filesystem()
t8) sync_blockdev()
t9) sb->s_op->freeze_fs() (= ext4_freeze)
t10) jbd2_journal_lock_updates()
t11) jbd2_journal_flush()
// Need to unlock the journal before returning to user space.
t12) jbd2_journal_unlock_updates()
// Journal is unlocked and so we can start accepting new transactions now.

// freezing process completes execution. Page cache is now clean and should
// remain clean till the filesystem is frozen.
--------------------------------------------

P1: writing process gets the control back
t13) generic_file_buffered_write()
t14) generic_perform_write()
t15) a_ops->write_begin() (= ext4_write_begin)
t16) ext4_journal_start()
// New handle is started. We do not block here! Write continues
// dirtying the page cache while the filesystem is frozen!

Signed-off-by: Surbhi Palande <[email protected]>
Acked-by: Jan Kara <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
---
fs/ext4/super.c | 20 ++++++--------------
fs/jbd2/journal.c | 1 +
fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
include/linux/jbd2.h | 7 +++++++
4 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8553dfb..796aa4c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)

journal = EXT4_SB(sb)->s_journal;

- /* Now we set up the journal barrier. */
- jbd2_journal_lock_updates(journal);
-
+ error = jbd2_journal_freeze(journal);
/*
- * Don't clear the needs_recovery flag if we failed to flush
+ * Don't clear the needs_recovery flag if we failed to freeze
* the journal.
*/
- error = jbd2_journal_flush(journal);
- if (error < 0)
- goto out;
-
- /* Journal blocked and flushed, clear needs_recovery flag. */
- EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
- error = ext4_commit_super(sb, 1);
-out:
- /* we rely on s_frozen to stop further updates */
- jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+ if (error >= 0) {
+ EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
+ error = ext4_commit_super(sb, 1);
+ }
return error;
}

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e0ec3db..5e46333 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
init_waitqueue_head(&journal->j_wait_checkpoint);
init_waitqueue_head(&journal->j_wait_commit);
init_waitqueue_head(&journal->j_wait_updates);
+ init_waitqueue_head(&journal->j_wait_frozen);
mutex_init(&journal->j_barrier);
mutex_init(&journal->j_checkpoint_mutex);
spin_lock_init(&journal->j_revoke_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 05fa77a..b111642 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -171,6 +171,17 @@ repeat:
journal->j_barrier_count == 0);
goto repeat;
}
+ /* Don't let a new handle start when a journal is frozen.
+ * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
+ * the j_flags indicate that the journal is frozen. So if the
+ * j_barrier_count is 0, then check if this was made 0 by the freezing
+ * process
+ */
+ if (journal->j_flags & JBD2_FROZEN) {
+ read_unlock(&journal->j_state_lock);
+ wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN));
+ goto repeat;
+ }

if (!journal->j_running_transaction) {
read_unlock(&journal->j_state_lock);
@@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
}
EXPORT_SYMBOL(jbd2_journal_restart);

+int jbd2_journal_freeze(journal_t *journal)
+{
+ int error = 0;
+ /* Now we set up the journal barrier. */
+ jbd2_journal_lock_updates(journal);
+
+ /*
+ * Don't clear the needs_recovery flag if we failed to flush
+ * the journal.
+ */
+ error = jbd2_journal_flush(journal);
+ if (error >= 0) {
+ write_lock(&journal->j_state_lock);
+ journal->j_flags |= JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ }
+ jbd2_journal_unlock_updates(journal);
+ return error;
+}
+EXPORT_SYMBOL(jbd2_journal_freeze);
+
+void jbd2_journal_thaw(journal_t * journal)
+{
+ write_lock(&journal->j_state_lock);
+ journal->j_flags &= ~JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ wake_up(&journal->j_wait_frozen);
+}
+EXPORT_SYMBOL(jbd2_journal_thaw);
+
+
/**
* void jbd2_journal_lock_updates () - establish a transaction barrier.
* @journal: Journal to establish a barrier on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a32dcae..22b76de 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
* @j_wait_checkpoint: Wait queue to trigger checkpointing
* @j_wait_commit: Wait queue to trigger commit
* @j_wait_updates: Wait queue to wait for updates to complete
+ * @j_wait_frozen: Wait queue to wait for journal to thaw
* @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
* @j_head: Journal head - identifies the first unused block in the journal
* @j_tail: Journal tail - identifies the oldest still-used block in the
@@ -835,6 +836,9 @@ struct journal_s
/* Wait queue to wait for updates to complete */
wait_queue_head_t j_wait_updates;

+ /* Wait queue to wait for journal to thaw*/
+ wait_queue_head_t j_wait_frozen;
+
/* Semaphore for locking against concurrent checkpoints */
struct mutex j_checkpoint_mutex;

@@ -1013,6 +1017,7 @@ struct journal_s
#define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
* data write error in ordered
* mode */
+#define JBD2_FROZEN 0x080 /* Journal thread frozen along with filesystem */

/*
* Function declarations for the journaling transaction and buffer
@@ -1121,6 +1126,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
struct page *, unsigned long);
extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
extern int jbd2_journal_stop(handle_t *);
+extern int jbd2_journal_freeze(journal_t *);
+extern void jbd2_journal_thaw(journal_t *);
extern int jbd2_journal_flush (journal_t *);
extern void jbd2_journal_lock_updates (journal_t *);
extern void jbd2_journal_unlock_updates (journal_t *);
--
1.7.1


2011-05-11 07:06:43

by Surbhi Palande

[permalink] [raw]
Subject: Re: [PATCH v3] Adding support to freeze and unfreeze a journal

Hi Eric,

On 05/09/2011 06:23 PM, Eric Sandeen wrote:
> On 5/9/11 9:51 AM, Surbhi Palande wrote:
>> The journal should be frozen when a F.S freezes. What this means is that till
>> the F.S is thawed again, no new transactions should be accepted by the
>> journal. When the F.S thaws, inturn it should thaw the journal and this should
>> allow the journal to resume accepting new transactions.
>> While the F.S has frozen the journal, the clients of journal on calling
>> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
>> up the sleeping clients and journalling can progress normally.
>
> Can I ask how this was tested?

Yes! I did the following on an ext4 fs mount:
1. fsfreeze -f $MNT
2. dd if=/dev/zero of=$MNT/file count=10 bs=1024 &
3. sync
4. fsfreeze -u $MNT

If the dd blocks on the start_handle, then the page cache is clean and
sync should have nothing to write and everything will work fine. But
otherwise this should sequence should create a deadlock.

I have attempted to create a patch for xfs-test. Shall send it out as a
reply to this email soon!

Warm Regards,
Surbhi.



>
> Ideally anything you found useful for testing should probably be integrated
> into the xfstests test suite so that we don't regresss in the future.
>
> thanks,
> -Eric
>
>> Signed-off-by: Surbhi Palande<[email protected]>
>> ---
>> Changes since the last patch:
>> * Changed to the shorter forms of expressions eg: x |= y
>> * removed the unnecessary barrier
>>
>> fs/ext4/super.c | 20 ++++++--------------
>> fs/jbd2/journal.c | 1 +
>> fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>> include/linux/jbd2.h | 10 ++++++++++
>> 4 files changed, 59 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 8553dfb..796aa4c 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>>
>> journal = EXT4_SB(sb)->s_journal;
>>
>> - /* Now we set up the journal barrier. */
>> - jbd2_journal_lock_updates(journal);
>> -
>> + error = jbd2_journal_freeze(journal);
>> /*
>> - * Don't clear the needs_recovery flag if we failed to flush
>> + * Don't clear the needs_recovery flag if we failed to freeze
>> * the journal.
>> */
>> - error = jbd2_journal_flush(journal);
>> - if (error< 0)
>> - goto out;
>> -
>> - /* Journal blocked and flushed, clear needs_recovery flag. */
>> - EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>> - error = ext4_commit_super(sb, 1);
>> -out:
>> - /* we rely on s_frozen to stop further updates */
>> - jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>> + if (error>= 0) {
>> + EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>> + error = ext4_commit_super(sb, 1);
>> + }
>> return error;
>> }
>>
>> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
>> index e0ec3db..5e46333 100644
>> --- a/fs/jbd2/journal.c
>> +++ b/fs/jbd2/journal.c
>> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
>> init_waitqueue_head(&journal->j_wait_checkpoint);
>> init_waitqueue_head(&journal->j_wait_commit);
>> init_waitqueue_head(&journal->j_wait_updates);
>> + init_waitqueue_head(&journal->j_wait_frozen);
>> mutex_init(&journal->j_barrier);
>> mutex_init(&journal->j_checkpoint_mutex);
>> spin_lock_init(&journal->j_revoke_lock);
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index 05fa77a..b040293 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -171,6 +171,17 @@ repeat:
>> journal->j_barrier_count == 0);
>> goto repeat;
>> }
>> + /* dont let a new handle start when a journal is frozen.
>> + * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
>> + * the jflags indicate that the journal is frozen. So if the
>> + * j_barrier_count is 0, then check if this was made 0 by the freezing
>> + * process
>> + */
>> + if (journal->j_flags& JBD2_FROZEN) {
>> + read_unlock(&journal->j_state_lock);
>> + jbd2_check_frozen(journal);
>> + goto repeat;
>> + }
>>
>> if (!journal->j_running_transaction) {
>> read_unlock(&journal->j_state_lock);
>> @@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
>> }
>> EXPORT_SYMBOL(jbd2_journal_restart);
>>
>> +int jbd2_journal_freeze(journal_t *journal)
>> +{
>> + int error = 0;
>> + /* Now we set up the journal barrier. */
>> + jbd2_journal_lock_updates(journal);
>> +
>> + /*
>> + * Don't clear the needs_recovery flag if we failed to flush
>> + * the journal.
>> + */
>> + error = jbd2_journal_flush(journal);
>> + if (error>= 0) {
>> + write_lock(&journal->j_state_lock);
>> + journal->j_flags |= JBD2_FROZEN;
>> + write_unlock(&journal->j_state_lock);
>> + }
>> + jbd2_journal_unlock_updates(journal);
>> + return error;
>> +}
>> +EXPORT_SYMBOL(jbd2_journal_freeze);
>> +
>> +void jbd2_journal_thaw(journal_t * journal)
>> +{
>> + write_lock(&journal->j_state_lock);
>> + journal->j_flags&= ~JBD2_FROZEN;
>> + write_unlock(&journal->j_state_lock);
>> + wake_up(&journal->j_wait_frozen);
>> +}
>> +EXPORT_SYMBOL(jbd2_journal_thaw);
>> +
>> +
>> /**
>> * void jbd2_journal_lock_updates () - establish a transaction barrier.
>> * @journal: Journal to establish a barrier on.
>> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
>> index a32dcae..c7885b2 100644
>> --- a/include/linux/jbd2.h
>> +++ b/include/linux/jbd2.h
>> @@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
>> * @j_wait_checkpoint: Wait queue to trigger checkpointing
>> * @j_wait_commit: Wait queue to trigger commit
>> * @j_wait_updates: Wait queue to wait for updates to complete
>> + * @j_wait_frozen: Wait queue to wait for journal to thaw
>> * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
>> * @j_head: Journal head - identifies the first unused block in the journal
>> * @j_tail: Journal tail - identifies the oldest still-used block in the
>> @@ -835,6 +836,9 @@ struct journal_s
>> /* Wait queue to wait for updates to complete */
>> wait_queue_head_t j_wait_updates;
>>
>> + /* Wait queue to wait for journal to thaw*/
>> + wait_queue_head_t j_wait_frozen;
>> +
>> /* Semaphore for locking against concurrent checkpoints */
>> struct mutex j_checkpoint_mutex;
>>
>> @@ -1013,7 +1017,11 @@ struct journal_s
>> #define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
>> * data write error in ordered
>> * mode */
>> +#define JBD2_FROZEN 0x080 /* Journal thread is frozen as the filesystem is frozen */
>> +
>>
>> +#define jbd2_check_frozen(journal) \
>> + wait_event(journal->j_wait_frozen, (journal->j_flags& JBD2_FROZEN))
>> /*
>> * Function declarations for the journaling transaction and buffer
>> * management
>> @@ -1121,6 +1129,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
>> struct page *, unsigned long);
>> extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
>> extern int jbd2_journal_stop(handle_t *);
>> +extern int jbd2_journal_freeze(journal_t *);
>> +extern void jbd2_journal_thaw(journal_t *);
>> extern int jbd2_journal_flush (journal_t *);
>> extern void jbd2_journal_lock_updates (journal_t *);
>> extern void jbd2_journal_unlock_updates (journal_t *);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-05-11 23:50:55

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v3] Adding support to freeze and unfreeze a journal

On 2011-05-11, at 1:06 AM, Surbhi Palande <[email protected]> wrote:

> On 05/09/2011 06:23 PM, Eric Sandeen wrote:
>> On 5/9/11 9:51 AM, Surbhi Palande wrote:
>>> The journal should be frozen when a F.S freezes. What this means is that till
>>> the F.S is thawed again, no new transactions should be accepted by the
>>> journal. When the F.S thaws, inturn it should thaw the journal and this should
>>> allow the journal to resume accepting new transactions.
>>> While the F.S has frozen the journal, the clients of journal on calling
>>> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
>>> up the sleeping clients and journalling can progress normally.
>>
>> Can I ask how this was tested?
>
> Yes! I did the following on an ext4 fs mount:
> 1. fsfreeze -f $MNT
> 2. dd if=/dev/zero of=$MNT/file count=10 bs=1024 &
> 3. sync
> 4. fsfreeze -u $MNT
>
> If the dd blocks on the start_handle, then the page cache is clean and sync should have nothing to write and everything will work fine. But otherwise this should sequence should create a deadlock.

Sorry to ask the obvious question, but presumably this test fails without your patch? It isn't clear from your comment that this is the case.

> I have attempted to create a patch for xfs-test. Shall send it out as a reply to this email soon!
>
> Warm Regards,
> Surbhi.
>
>
>
>>
>> Ideally anything you found useful for testing should probably be integrated
>> into the xfstests test suite so that we don't regresss in the future.
>>
>> thanks,
>> -Eric
>>
>>> Signed-off-by: Surbhi Palande<[email protected]>
>>> ---
>>> Changes since the last patch:
>>> * Changed to the shorter forms of expressions eg: x |= y
>>> * removed the unnecessary barrier
>>>
>>> fs/ext4/super.c | 20 ++++++--------------
>>> fs/jbd2/journal.c | 1 +
>>> fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>>> include/linux/jbd2.h | 10 ++++++++++
>>> 4 files changed, 59 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>> index 8553dfb..796aa4c 100644
>>> --- a/fs/ext4/super.c
>>> +++ b/fs/ext4/super.c
>>> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>>>
>>> journal = EXT4_SB(sb)->s_journal;
>>>
>>> - /* Now we set up the journal barrier. */
>>> - jbd2_journal_lock_updates(journal);
>>> -
>>> + error = jbd2_journal_freeze(journal);
>>> /*
>>> - * Don't clear the needs_recovery flag if we failed to flush
>>> + * Don't clear the needs_recovery flag if we failed to freeze
>>> * the journal.
>>> */
>>> - error = jbd2_journal_flush(journal);
>>> - if (error< 0)
>>> - goto out;
>>> -
>>> - /* Journal blocked and flushed, clear needs_recovery flag. */
>>> - EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>>> - error = ext4_commit_super(sb, 1);
>>> -out:
>>> - /* we rely on s_frozen to stop further updates */
>>> - jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>>> + if (error>= 0) {
>>> + EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>>> + error = ext4_commit_super(sb, 1);
>>> + }
>>> return error;
>>> }
>>>
>>> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
>>> index e0ec3db..5e46333 100644
>>> --- a/fs/jbd2/journal.c
>>> +++ b/fs/jbd2/journal.c
>>> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
>>> init_waitqueue_head(&journal->j_wait_checkpoint);
>>> init_waitqueue_head(&journal->j_wait_commit);
>>> init_waitqueue_head(&journal->j_wait_updates);
>>> + init_waitqueue_head(&journal->j_wait_frozen);
>>> mutex_init(&journal->j_barrier);
>>> mutex_init(&journal->j_checkpoint_mutex);
>>> spin_lock_init(&journal->j_revoke_lock);
>>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>>> index 05fa77a..b040293 100644
>>> --- a/fs/jbd2/transaction.c
>>> +++ b/fs/jbd2/transaction.c
>>> @@ -171,6 +171,17 @@ repeat:
>>> journal->j_barrier_count == 0);
>>> goto repeat;
>>> }
>>> + /* dont let a new handle start when a journal is frozen.
>>> + * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
>>> + * the jflags indicate that the journal is frozen. So if the
>>> + * j_barrier_count is 0, then check if this was made 0 by the freezing
>>> + * process
>>> + */
>>> + if (journal->j_flags& JBD2_FROZEN) {
>>> + read_unlock(&journal->j_state_lock);
>>> + jbd2_check_frozen(journal);
>>> + goto repeat;
>>> + }
>>>
>>> if (!journal->j_running_transaction) {
>>> read_unlock(&journal->j_state_lock);
>>> @@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
>>> }
>>> EXPORT_SYMBOL(jbd2_journal_restart);
>>>
>>> +int jbd2_journal_freeze(journal_t *journal)
>>> +{
>>> + int error = 0;
>>> + /* Now we set up the journal barrier. */
>>> + jbd2_journal_lock_updates(journal);
>>> +
>>> + /*
>>> + * Don't clear the needs_recovery flag if we failed to flush
>>> + * the journal.
>>> + */
>>> + error = jbd2_journal_flush(journal);
>>> + if (error>= 0) {
>>> + write_lock(&journal->j_state_lock);
>>> + journal->j_flags |= JBD2_FROZEN;
>>> + write_unlock(&journal->j_state_lock);
>>> + }
>>> + jbd2_journal_unlock_updates(journal);
>>> + return error;
>>> +}
>>> +EXPORT_SYMBOL(jbd2_journal_freeze);
>>> +
>>> +void jbd2_journal_thaw(journal_t * journal)
>>> +{
>>> + write_lock(&journal->j_state_lock);
>>> + journal->j_flags&= ~JBD2_FROZEN;
>>> + write_unlock(&journal->j_state_lock);
>>> + wake_up(&journal->j_wait_frozen);
>>> +}
>>> +EXPORT_SYMBOL(jbd2_journal_thaw);
>>> +
>>> +
>>> /**
>>> * void jbd2_journal_lock_updates () - establish a transaction barrier.
>>> * @journal: Journal to establish a barrier on.
>>> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
>>> index a32dcae..c7885b2 100644
>>> --- a/include/linux/jbd2.h
>>> +++ b/include/linux/jbd2.h
>>> @@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
>>> * @j_wait_checkpoint: Wait queue to trigger checkpointing
>>> * @j_wait_commit: Wait queue to trigger commit
>>> * @j_wait_updates: Wait queue to wait for updates to complete
>>> + * @j_wait_frozen: Wait queue to wait for journal to thaw
>>> * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
>>> * @j_head: Journal head - identifies the first unused block in the journal
>>> * @j_tail: Journal tail - identifies the oldest still-used block in the
>>> @@ -835,6 +836,9 @@ struct journal_s
>>> /* Wait queue to wait for updates to complete */
>>> wait_queue_head_t j_wait_updates;
>>>
>>> + /* Wait queue to wait for journal to thaw*/
>>> + wait_queue_head_t j_wait_frozen;
>>> +
>>> /* Semaphore for locking against concurrent checkpoints */
>>> struct mutex j_checkpoint_mutex;
>>>
>>> @@ -1013,7 +1017,11 @@ struct journal_s
>>> #define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
>>> * data write error in ordered
>>> * mode */
>>> +#define JBD2_FROZEN 0x080 /* Journal thread is frozen as the filesystem is frozen */
>>> +
>>>
>>> +#define jbd2_check_frozen(journal) \
>>> + wait_event(journal->j_wait_frozen, (journal->j_flags& JBD2_FROZEN))
>>> /*
>>> * Function declarations for the journaling transaction and buffer
>>> * management
>>> @@ -1121,6 +1129,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
>>> struct page *, unsigned long);
>>> extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
>>> extern int jbd2_journal_stop(handle_t *);
>>> +extern int jbd2_journal_freeze(journal_t *);
>>> +extern void jbd2_journal_thaw(journal_t *);
>>> extern int jbd2_journal_flush (journal_t *);
>>> extern void jbd2_journal_lock_updates (journal_t *);
>>> extern void jbd2_journal_unlock_updates (journal_t *);
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2011-05-12 09:40:40

by Surbhi Palande

[permalink] [raw]
Subject: Re: [PATCH v3] Adding support to freeze and unfreeze a journal

On 05/11/2011 12:05 PM, Andreas Dilger wrote:
> On 2011-05-11, at 1:06 AM, Surbhi Palande<[email protected]> wrote:
>
>> On 05/09/2011 06:23 PM, Eric Sandeen wrote:
>>> On 5/9/11 9:51 AM, Surbhi Palande wrote:
>>>> The journal should be frozen when a F.S freezes. What this means is that till
>>>> the F.S is thawed again, no new transactions should be accepted by the
>>>> journal. When the F.S thaws, inturn it should thaw the journal and this should
>>>> allow the journal to resume accepting new transactions.
>>>> While the F.S has frozen the journal, the clients of journal on calling
>>>> jbd2_journal_start() will sleep on a wait queue. Thawing the journal will wake
>>>> up the sleeping clients and journalling can progress normally.
>>>
>>> Can I ask how this was tested?
>>
>> Yes! I did the following on an ext4 fs mount:
>> 1. fsfreeze -f $MNT
>> 2. dd if=/dev/zero of=$MNT/file count=10 bs=1024&
>> 3. sync
>> 4. fsfreeze -u $MNT
>>
>> If the dd blocks on the start_handle, then the page cache is clean and sync should have nothing to write and everything will work fine. But otherwise this should sequence should create a deadlock.
>
> Sorry to ask the obvious question, but presumably this test fails without your patch? It isn't clear from your comment that this is the case.

Actually since this is a race its very difficult to see this
deterministically. The deadlock is apparently regulary seen by running
iozone on multipath - when a path comes back to service.

I imagined that this could be simulated by running a lot of I/O in the
background and trying fsfreeze, unfreeze parallely (multiple times).
Unfortunately its not easy to hit the bug - its never deterministic. I
really smoke tested my patch using this method

{
dd i/o in a loop (10000 times) & (background process)
touch some file & in the same loop (background process)

}

{
freeze &, sync &, unfreeze &, sleep in a loop & (100 times)
(background processes)
}

and saw that there was not any deadlock in this case. But I have not
tested/seen the deadlock with this script without the patch.

Sorry for not being clear before :(

Warm Regards,
Surbhi.















>
>> I have attempted to create a patch for xfs-test. Shall send it out as a reply to this email soon!
>>
>> Warm Regards,
>> Surbhi.
>>
>>
>>
>>>
>>> Ideally anything you found useful for testing should probably be integrated
>>> into the xfstests test suite so that we don't regresss in the future.
>>>
>>> thanks,
>>> -Eric
>>>
>>>> Signed-off-by: Surbhi Palande<[email protected]>
>>>> ---
>>>> Changes since the last patch:
>>>> * Changed to the shorter forms of expressions eg: x |= y
>>>> * removed the unnecessary barrier
>>>>
>>>> fs/ext4/super.c | 20 ++++++--------------
>>>> fs/jbd2/journal.c | 1 +
>>>> fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>>>> include/linux/jbd2.h | 10 ++++++++++
>>>> 4 files changed, 59 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>>> index 8553dfb..796aa4c 100644
>>>> --- a/fs/ext4/super.c
>>>> +++ b/fs/ext4/super.c
>>>> @@ -4179,23 +4179,15 @@ static int ext4_freeze(struct super_block *sb)
>>>>
>>>> journal = EXT4_SB(sb)->s_journal;
>>>>
>>>> - /* Now we set up the journal barrier. */
>>>> - jbd2_journal_lock_updates(journal);
>>>> -
>>>> + error = jbd2_journal_freeze(journal);
>>>> /*
>>>> - * Don't clear the needs_recovery flag if we failed to flush
>>>> + * Don't clear the needs_recovery flag if we failed to freeze
>>>> * the journal.
>>>> */
>>>> - error = jbd2_journal_flush(journal);
>>>> - if (error< 0)
>>>> - goto out;
>>>> -
>>>> - /* Journal blocked and flushed, clear needs_recovery flag. */
>>>> - EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>>>> - error = ext4_commit_super(sb, 1);
>>>> -out:
>>>> - /* we rely on s_frozen to stop further updates */
>>>> - jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>>>> + if (error>= 0) {
>>>> + EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>>>> + error = ext4_commit_super(sb, 1);
>>>> + }
>>>> return error;
>>>> }
>>>>
>>>> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
>>>> index e0ec3db..5e46333 100644
>>>> --- a/fs/jbd2/journal.c
>>>> +++ b/fs/jbd2/journal.c
>>>> @@ -842,6 +842,7 @@ static journal_t * journal_init_common (void)
>>>> init_waitqueue_head(&journal->j_wait_checkpoint);
>>>> init_waitqueue_head(&journal->j_wait_commit);
>>>> init_waitqueue_head(&journal->j_wait_updates);
>>>> + init_waitqueue_head(&journal->j_wait_frozen);
>>>> mutex_init(&journal->j_barrier);
>>>> mutex_init(&journal->j_checkpoint_mutex);
>>>> spin_lock_init(&journal->j_revoke_lock);
>>>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>>>> index 05fa77a..b040293 100644
>>>> --- a/fs/jbd2/transaction.c
>>>> +++ b/fs/jbd2/transaction.c
>>>> @@ -171,6 +171,17 @@ repeat:
>>>> journal->j_barrier_count == 0);
>>>> goto repeat;
>>>> }
>>>> + /* dont let a new handle start when a journal is frozen.
>>>> + * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
>>>> + * the jflags indicate that the journal is frozen. So if the
>>>> + * j_barrier_count is 0, then check if this was made 0 by the freezing
>>>> + * process
>>>> + */
>>>> + if (journal->j_flags& JBD2_FROZEN) {
>>>> + read_unlock(&journal->j_state_lock);
>>>> + jbd2_check_frozen(journal);
>>>> + goto repeat;
>>>> + }
>>>>
>>>> if (!journal->j_running_transaction) {
>>>> read_unlock(&journal->j_state_lock);
>>>> @@ -489,6 +500,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
>>>> }
>>>> EXPORT_SYMBOL(jbd2_journal_restart);
>>>>
>>>> +int jbd2_journal_freeze(journal_t *journal)
>>>> +{
>>>> + int error = 0;
>>>> + /* Now we set up the journal barrier. */
>>>> + jbd2_journal_lock_updates(journal);
>>>> +
>>>> + /*
>>>> + * Don't clear the needs_recovery flag if we failed to flush
>>>> + * the journal.
>>>> + */
>>>> + error = jbd2_journal_flush(journal);
>>>> + if (error>= 0) {
>>>> + write_lock(&journal->j_state_lock);
>>>> + journal->j_flags |= JBD2_FROZEN;
>>>> + write_unlock(&journal->j_state_lock);
>>>> + }
>>>> + jbd2_journal_unlock_updates(journal);
>>>> + return error;
>>>> +}
>>>> +EXPORT_SYMBOL(jbd2_journal_freeze);
>>>> +
>>>> +void jbd2_journal_thaw(journal_t * journal)
>>>> +{
>>>> + write_lock(&journal->j_state_lock);
>>>> + journal->j_flags&= ~JBD2_FROZEN;
>>>> + write_unlock(&journal->j_state_lock);
>>>> + wake_up(&journal->j_wait_frozen);
>>>> +}
>>>> +EXPORT_SYMBOL(jbd2_journal_thaw);
>>>> +
>>>> +
>>>> /**
>>>> * void jbd2_journal_lock_updates () - establish a transaction barrier.
>>>> * @journal: Journal to establish a barrier on.
>>>> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
>>>> index a32dcae..c7885b2 100644
>>>> --- a/include/linux/jbd2.h
>>>> +++ b/include/linux/jbd2.h
>>>> @@ -718,6 +718,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
>>>> * @j_wait_checkpoint: Wait queue to trigger checkpointing
>>>> * @j_wait_commit: Wait queue to trigger commit
>>>> * @j_wait_updates: Wait queue to wait for updates to complete
>>>> + * @j_wait_frozen: Wait queue to wait for journal to thaw
>>>> * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
>>>> * @j_head: Journal head - identifies the first unused block in the journal
>>>> * @j_tail: Journal tail - identifies the oldest still-used block in the
>>>> @@ -835,6 +836,9 @@ struct journal_s
>>>> /* Wait queue to wait for updates to complete */
>>>> wait_queue_head_t j_wait_updates;
>>>>
>>>> + /* Wait queue to wait for journal to thaw*/
>>>> + wait_queue_head_t j_wait_frozen;
>>>> +
>>>> /* Semaphore for locking against concurrent checkpoints */
>>>> struct mutex j_checkpoint_mutex;
>>>>
>>>> @@ -1013,7 +1017,11 @@ struct journal_s
>>>> #define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
>>>> * data write error in ordered
>>>> * mode */
>>>> +#define JBD2_FROZEN 0x080 /* Journal thread is frozen as the filesystem is frozen */
>>>> +
>>>>
>>>> +#define jbd2_check_frozen(journal) \
>>>> + wait_event(journal->j_wait_frozen, (journal->j_flags& JBD2_FROZEN))
>>>> /*
>>>> * Function declarations for the journaling transaction and buffer
>>>> * management
>>>> @@ -1121,6 +1129,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
>>>> struct page *, unsigned long);
>>>> extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
>>>> extern int jbd2_journal_stop(handle_t *);
>>>> +extern int jbd2_journal_freeze(journal_t *);
>>>> +extern void jbd2_journal_thaw(journal_t *);
>>>> extern int jbd2_journal_flush (journal_t *);
>>>> extern void jbd2_journal_lock_updates (journal_t *);
>>>> extern void jbd2_journal_unlock_updates (journal_t *);
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>


2011-05-12 14:23:09

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] Attempt to sync the fsstress writes to a frozen F.S

On 5/11/11 2:10 AM, Surbhi Palande wrote:
> While the fsstress background writes are busy dirtying the page cache, if a
> fsfreeze happens then the background writes should stall. A sync should then
> not have any data to sync to the FS. If it does have any data to sync then
> sync will cause a deadlock by holding the s_umount write semaphore and waiting
> in the wait queue for the FS to thaw, whereas the F.S can never thaw without
> getting the s_umount write semaphore.
>
> Signed-off-by: Surbhi Palande <[email protected]>

Seems ok to me. In the future, when sending xfstests patches,
if you can add "xfstests" to the subject, and cc: the xfs list,
it'd be great.

I presume that this test does fail for you without your fixes?

I'll see if anyone on the xfs list has comments and if not, I can check this in.

Thanks,
-Eric

> ---
> 068 | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/068 b/068
> index 82c1a4e..b9ac58d 100755
> --- a/068
> +++ b/068
> @@ -101,6 +101,11 @@ do
> tee -a $seq.full
> sleep 2
>
> + # there should be nothing to sync at this point. This may hang in case
> + # of fsstress background writes dirtying the page cache while the F.S is frozen
> + sync &
> + sleep 2
> +
> echo "*** thawing \$SCRATCH_MNT" | tee -a $seq.full
> xfs_freeze -u "$SCRATCH_MNT" | tee -a $seq.full
> [ $? != 0 ] && echo xfs_freeze -u "$SCRATCH_MNT" failed | \


2011-05-24 21:42:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] Attempt to sync the fsstress writes to a frozen F.S

On Wed, May 11, 2011 at 10:10:41AM +0300, Surbhi Palande wrote:
> While the fsstress background writes are busy dirtying the page cache, if a
> fsfreeze happens then the background writes should stall. A sync should then
> not have any data to sync to the FS. If it does have any data to sync then
> sync will cause a deadlock by holding the s_umount write semaphore and waiting
> in the wait queue for the FS to thaw, whereas the F.S can never thaw without
> getting the s_umount write semaphore.
>
> Signed-off-by: Surbhi Palande <[email protected]>

Hi Surbhi,

Have you tried out Jan Kara's patches?

[1/3] fs: Create __block_page_mkwrite() helper passing error values back
[2/3] vfs: Block mmapped writes while the fs is frozen
[3/3] ext4: Rewrite ext4_page_mkwrite() to return locked page

Do these patches fix the problem you've been trying to fix with your
patches? I believe they should, but I would appreciate confirmation
that with these patches, you're no longer able to reproduce the
problem you've been concerned about.

Thanks, regards,

- Ted

2011-05-25 12:00:23

by Surbhi Palande

[permalink] [raw]
Subject: Re: [PATCH] Attempt to sync the fsstress writes to a frozen F.S

Hi Ted,


On 05/25/2011 12:42 AM, Ted Ts'o wrote:
> On Wed, May 11, 2011 at 10:10:41AM +0300, Surbhi Palande wrote:
>> While the fsstress background writes are busy dirtying the page cache, if a
>> fsfreeze happens then the background writes should stall. A sync should then
>> not have any data to sync to the FS. If it does have any data to sync then
>> sync will cause a deadlock by holding the s_umount write semaphore and waiting
>> in the wait queue for the FS to thaw, whereas the F.S can never thaw without
>> getting the s_umount write semaphore.
>>
>> Signed-off-by: Surbhi Palande<[email protected]>
>
> Hi Surbhi,
>
> Have you tried out Jan Kara's patches?
>
> [1/3] fs: Create __block_page_mkwrite() helper passing error values back
> [2/3] vfs: Block mmapped writes while the fs is frozen
> [3/3] ext4: Rewrite ext4_page_mkwrite() to return locked page

Yes! We have tried these patches and we still see the same
deadlock/hang. The following is the reason for it:


// lets assume the inode is clean and so are its pages.
P1: process that tries mmap write
t1) __do_fault()
t2) ext4_page_mkwrite()
t3) block_page_mkwrite()
t4) vfs_check_frozen()
// filesystem is not frozen so control falls through.
t5) __block_page_mkwrite()
t6) set_page_dirty()
t7) __set_page_dirty()
t8) radix_tree_tag_set(PAGECACHE_TAG_DIRTY)
// page is dirtied, but inode is yet clean.
---------------------- Pre-empted-----------------
P2: freeze process

t9) freeze_super()
t10) sync_filesystem()
// page cache now clean! no inode is dirty.
// however we have a dirty page belonging to a clean inode.
----------------------Freeze process finishes, filesystem frozen!----


P1: process that tries mmap write gets control.
t11) __set_page_dirty() // gets control back
t12) __mark_inode_dirty()v
// inode is now dirty and it has a dirty page.
// though in reality there is no write which has occured.
t13) if (inode->i_sb->s_frozen != SB_UNFROZEN)
// __block_page_mkwrite() gets control back
t14) unlock_page()
t15) __block_page_mkwrite() returns -EAGAIN
t16) block_page_mkwrite() returns VM_FAULT_RETRY

---------------------------
// now we see the original deadlock reported.
P3: sync a filesystem
t17) down_read(s_umount)
t18) sync_filesystem()
t19) sb->s_op->sync_fs() // =ext4_sync_fs()
t20) vfs_check_frozen() // now blocks for thaw.
// so thaw cannot happen because sync process sleeps with s_umount!

This deadlock can occur whenever the freeze happens after the
vfs_check_frozen() but before the __mark_inode_dirty().

We see blocked sync processes every time we do the following:

1) executing iozone on multipath and
2) I modified the script that Toshiyuki sent, attaching it here. This
script reproduces the bug faster when executed with iozone.
(Note, that since this is a race, this script _may not_ always produce
it on its own)


I also found one more missing piece in the "Add support to freeze and
unfreeze journal":
1) Call jdb2_journal_thaw() from ext4_unfreeze() to restart the
transactions.

I shall send a patch for the same as a reply to this email again.

Thanks!

Warm Regards,
Surbhi.












P3: sync







>
> Do these patches fix the problem you've been trying to fix with your
> patches? I believe they should, but I would appreciate confirmation
> that with these patches, you're no longer able to reproduce the
> problem you've been concerned about.
>
> Thanks, regards,
>
> - Ted


Attachments:
test.sh (2.68 kB)

2011-05-25 12:12:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] Attempt to sync the fsstress writes to a frozen F.S

Hi Surbhi,

Just as a request --- could you start a new thread (this one is getting so long it's hard to follow)?

And could you also include a reliable reproduction case?

Many thanks!!

-- Ted



2011-05-27 16:28:07

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Attempt to sync the fsstress writes to a frozen F.S

Ted,

On Wed 25-05-11 08:12:15, Ted Tso wrote:
> Just as a request --- could you start a new thread (this one is getting
> so long it's hard to follow)?
>
> And could you also include a reliable reproduction case?
Just a quick note - this patch series was not really meant to fix the
deadlocks. They are meant to make freezing reliable in combination with
mmapped writes. As a side-effect, they make the deadlock Surbhi describes
less probable but I'm aware it's still there.

I plan to have another look at how the deadlock could be fixed (the first
attempt was rejected by Dave Chinner) but currently I'm busy with other
stuff...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-12-09 01:56:39

by Masayoshi MIZUMA

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock


(2011/02/07 20:53), Masayoshi MIZUMA wrote:

> Hi,
>
> When I checked the freeze feature for ext4 filesystem using fsfreeze command
> at 2.6.38-rc3, I got the following messeges:

Hi,

I checked freeze function with using below test program at 3.2.0-rc4,
then, I got following messeages and the test program hanged up.
I think this bug is still in 3.2.0-rc4...

The test program:
-----------------------------------------------------------
#!/bin/bash

DEV_1=/dev/sda5
MNT_1=/tmp/sda5
LOOP=500

if [[ ! -d $MNT_1 ]]
then
mkdir -p $MNT_1
fi

mkfs -t ext4 $DEV_1
mount $DEV_1 $MNT_1

./fsstress -d $MNT_1/tmp -n 10000 -p 100 > /dev/null 2>&1 &
PID=$!

for ((i=0; i<LOOP; i++))
do
echo LOOP: $i
fsfreeze -f $MNT_1
fsfreeze -u $MNT_1
done

kill $PID
-----------------------------------------------------------

The messages I got when I ran the test program is below.
-------------------------------------------------------------
INFO: task flush-8:0:720 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-8:0 D 0000000100521461 0 720 2 0x00000000
ffff8800b4c41a40 0000000000000046 0000000000000000 0000000000000000
0000000000013440 ffff8800b4c41fd8 ffff8800b4c40010 0000000000013440
ffff8800b4c41fd8 0000000000013440 ffffffff81a0d020 ffff8800b464d4e0
Call Trace:
[<ffffffff81086b4e>] ? prepare_to_wait+0x5e/0x90
[<ffffffff814ee3ff>] schedule+0x3f/0x60
[<ffffffffa041e485>] ext4_journal_start_sb+0x145/0x1b0 [ext4]
[<ffffffff81086820>] ? wake_up_bit+0x40/0x40
[<ffffffffa0401bc5>] ? ext4_meta_trans_blocks+0xb5/0xc0 [ext4]
[<ffffffffa0406c9d>] ext4_da_writepages+0x29d/0x620 [ext4]
[<ffffffff81227a18>] ? blk_finish_plug+0x18/0x50
[<ffffffff81112bb1>] do_writepages+0x21/0x40
[<ffffffff8118e380>] writeback_single_inode+0x180/0x3b0
[<ffffffff8118e971>] writeback_sb_inodes+0x1a1/0x260
[<ffffffff8118ec6e>] wb_writeback+0xde/0x2b0
[<ffffffff810739c6>] ? try_to_del_timer_sync+0x86/0xe0
[<ffffffff8118eee6>] wb_do_writeback+0xa6/0x260
[<ffffffff81072ef0>] ? lock_timer_base+0x70/0x70
[<ffffffff8118f14a>] bdi_writeback_thread+0xaa/0x270
[<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
[<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
[<ffffffff810861a6>] kthread+0x96/0xa0
[<ffffffff814fa5b4>] kernel_thread_helper+0x4/0x10
[<ffffffff81086110>] ? kthread_worker_fn+0x1a0/0x1a0
[<ffffffff814fa5b0>] ? gs_change+0x13/0x13

INFO: task fsstress:4376 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
fsstress D ffff88009b52dda8 0 4376 4364 0x00000080
ffff88009b52dcb8 0000000000000082 ffffffff810d7e09 ffff88009b52dcc0
0000000000013440 ffff88009b52dfd8 ffff88009b52c010 0000000000013440
ffff88009b52dfd8 0000000000013440 ffff88009b4d54e0 ffff8800a1481560
Call Trace:
[<ffffffff810d7e09>] ? trace_clock_local+0x9/0x10
[<ffffffff814ee3ff>] schedule+0x3f/0x60
[<ffffffff814ee89d>] schedule_timeout+0x1fd/0x2e0
[<ffffffff810e5e43>] ? trace_nowake_buffer_unlock_commit+0x43/0x60
[<ffffffff810127e4>] ? __switch_to+0x194/0x320
[<ffffffff8104d623>] ? ftrace_raw_event_sched_switch+0x103/0x110
[<ffffffff814ee26d>] wait_for_common+0x11d/0x190
[<ffffffff8105a970>] ? try_to_wake_up+0x2b0/0x2b0
[<ffffffff814ee3bd>] wait_for_completion+0x1d/0x20
[<ffffffff8118daef>] writeback_inodes_sb_nr+0x7f/0xa0
[<ffffffff8118dbdf>] writeback_inodes_sb+0x5f/0x80
[<ffffffff811938d0>] ? __sync_filesystem+0x90/0x90
[<ffffffff8119388e>] __sync_filesystem+0x4e/0x90
[<ffffffff811938ef>] sync_one_sb+0x1f/0x30
[<ffffffff811695da>] iterate_supers+0x7a/0xd0
[<ffffffff81193934>] sys_sync+0x34/0x70
[<ffffffff814f8442>] system_call_fastpath+0x16/0x1b
-------------------------------------------------------------

The test program for xfstests is below.
-------------------------------------------------------------
#! /bin/bash
# FSQA Test No. 277
#
# Run fsstress and freeze/unfreeze in parallel
#
#-----------------------------------------------------------------------
# Copyright (c) 2006 Silicon Graphics, Inc. All Rights Reserved.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
#
#-----------------------------------------------------------------------
#
# creator
[email protected]

seq=`basename $0`
echo "QA output created by $seq"

here=`pwd`
tmp=/tmp/$$
status=0 # success is the default!
trap "rm -f $tmp.*; exit \$status" 0 1 2 3 15

# get standard environment, filters and checks
. ./common.rc
. ./common.filter

_workout()
{
echo ""
echo "Run fsstress"
echo ""
num_iterations=500
out=$SCRATCH_MNT/fsstress.$$
args="-p100 -n10000 -d $out"
echo "fsstress $args" >> $here/$seq.full
$FSSTRESS_PROG $args > /dev/null 2>&1 &
pid=$!
echo "Run xfs_freeze in parallel"
for ((i=0; i < num_iterations; i++))
do
xfs_freeze -f $SCRATCH_MNT | tee -a $seq.full
xfs_freeze -u $SCRATCH_MNT | tee -a $seq.full
done
kill $pid 2> /dev/null
wait $pid
}

# real QA test starts here
_supported_fs generic
_supported_os Linux
_need_to_be_root
_require_scratch

_scratch_mkfs >> $seq.full 2>&1
_scratch_mount

if ! _workout; then
umount $SCRATCH_DEV 2>/dev/null
exit
fi

if ! _scratch_unmount; then
echo "failed to umount"
status=1
exit
fi
_check_scratch_fs
status=$?
exit
-------------------------------------------------------------

Thanks,
Masayoshi Mizuma

>
> ---------------------------------------------------------------------
> Feb 7 15:05:09 RX300S6 kernel: INFO: task fsfreeze:2104 blocked for more than 120 seconds.
> Feb 7 15:05:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 7 15:05:09 RX300S6 kernel: fsfreeze D ffff880076d5f040 0 2104 2018 0x00000000
> Feb 7 15:05:09 RX300S6 kernel: ffff88005a9f3d98 0000000000000086 ffff88005a9f3d38 ffffffff00000000
> Feb 7 15:05:09 RX300S6 kernel: 0000000000014d40 ffff880076d5eab0 ffff880076d5f040 ffff88005a9f3fd8
> Feb 7 15:05:09 RX300S6 kernel: ffff880076d5f048 0000000000014d40 ffff88005a9f2010 0000000000014d40
> Feb 7 15:05:09 RX300S6 kernel: Call Trace:
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814aa5f5>] rwsem_down_failed_common+0xb5/0x140
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814aa693>] rwsem_down_write_failed+0x13/0x20
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff8122f1a3>] call_rwsem_down_write_failed+0x13/0x20
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814a9c12>] ? down_write+0x32/0x40
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81155b48>] thaw_super+0x28/0xd0
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81164338>] do_vfs_ioctl+0x368/0x560
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81157c73>] ? sys_newfstat+0x33/0x40
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff811645d1>] sys_ioctl+0xa1/0xb0
> Feb 7 15:05:09 RX300S6 kernel: [<ffffffff8100bf82>] system_call_fastpath+0x16/0x1b
> ...
> Feb 7 15:07:09 RX300S6 kernel: INFO: task flush-8:0:1409 blocked for more than 120 seconds.
> Feb 7 15:07:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 7 15:07:09 RX300S6 kernel: flush-8:0 D ffff880037777a30 0 1409 2 0x00000000
> Feb 7 15:07:09 RX300S6 kernel: ffff880037c95a80 0000000000000046 ffff88007c8037a0 0000000000000000
> Feb 7 15:07:09 RX300S6 kernel: 0000000000014d40 ffff8800377774a0 ffff880037777a30 ffff880037c95fd8
> Feb 7 15:07:09 RX300S6 kernel: ffff880037777a38 0000000000014d40 ffff880037c94010 0000000000014d40
> Feb 7 15:07:09 RX300S6 kernel: Call Trace:
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffffa00abb85>] ext4_journal_start_sb+0x75/0x130 [ext4]
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81082fc0>] ? autoremove_wake_function+0x0/0x40
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffffa0097f0a>] ext4_da_writepages+0x27a/0x640 [ext4]
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81102c91>] do_writepages+0x21/0x40
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff811776b8>] writeback_single_inode+0x98/0x240
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81177cfe>] writeback_sb_inodes+0xce/0x170
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178709>] writeback_inodes_wb+0x99/0x160
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178a8b>] wb_writeback+0x2bb/0x430
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e2c>] wb_do_writeback+0x22c/0x280
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178f32>] bdi_writeback_thread+0xb2/0x260
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81082936>] kthread+0x96/0xa0
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc4>] kernel_thread_helper+0x4/0x10
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff810828a0>] ? kthread+0x0/0xa0
> Feb 7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc0>] ? kernel_thread_helper+0x0/0x10
> ---------------------------------------------------------------------
>
> I think the following deadlock problem happened:
>
> [flush-8:0:1409] | [fsfreeze:2104]
> --------------------------------------------+--------------------------------
> writeback_inodes_wb |
> pin_sb_for_writeback |
> down_read_trylock(&sb->s_umount) |
> writeback_sb_inodes |thaw_super
> writeback_single_inode | down_write(&sb->s_umount)
> do_writepages | # stop until flush-8:0 releases
> ext4_da_writepages | # read lock of sb->s_umount...
> ext4_journal_start_sb |
> vfs_check_frozen |
> wait_event((sb)->s_wait_unfrozen, |
> ((sb)->s_frozen < (level))) |
> # stop until being waked up by |
> # fsfreeze... |
> --------------------------------------------+--------------------------------
>
> Could anyone check this problem?
>
> Thanks,
> Masayoshi Mizuma
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html




2011-12-15 12:41:22

by Masayoshi MIZUMA

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock


(2011/12/09 10:56), Masayoshi MIZUMA wrote:

>
> (2011/02/07 20:53), Masayoshi MIZUMA wrote:
>
> > Hi,
> >
> > When I checked the freeze feature for ext4 filesystem using fsfreeze command
> > at 2.6.38-rc3, I got the following messeges:
>
> Hi,
>
> I checked freeze function with using below test program at 3.2.0-rc4,
> then, I got following messeages and the test program hanged up.
> I think this bug is still in 3.2.0-rc4...

I think the problem is as follows.
When a race between ext4_page_mkwrite() and freeze_super() occurs,
ext4_page_mkwrite() can add a inode to a list (bdi_writeback.b_dirty)
which is needed to do writeback nevertheless sb->s_frozen is SB_FREEZE_WRITE
or SB_FREEZE_TRANS.

process A | process B
------------------------------+-----------------------------------------------
ext4_page_mkwrite() |
=> vfs_check_frozen() |
| freeze_super()
| sb->s_frozen = SB_FREEZE_WRITE
=>__block_page_mkwrite() | => sync_filesystem()
: | # write inodes which are in the list.
: | sb->s_frozen = SB_FREEZE_TRANS
: |
=>__mark_inode_dirty |
# add inode to the list. |
------------------------------+-----------------------------------------------

As the result, if "flush" kthread does writeback the inode which was
added by ext4_page_mkwrite() and thaw_super() runs concurrently, the
deadlock will happen.

Thanks,
Masayoshi Mizuma

>
> The test program:
> -----------------------------------------------------------
> #!/bin/bash
>
> DEV_1=/dev/sda5
> MNT_1=/tmp/sda5
> LOOP=500
>
> if [[ ! -d $MNT_1 ]]
> then
> mkdir -p $MNT_1
> fi
>
> mkfs -t ext4 $DEV_1
> mount $DEV_1 $MNT_1
>
> ./fsstress -d $MNT_1/tmp -n 10000 -p 100 > /dev/null 2>&1 &
> PID=$!
>
> for ((i=0; i<LOOP; i++))
> do
> echo LOOP: $i
> fsfreeze -f $MNT_1
> fsfreeze -u $MNT_1
> done
>
> kill $PID
> -----------------------------------------------------------
>
> The messages I got when I ran the test program is below.
> -------------------------------------------------------------
> INFO: task flush-8:0:720 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> flush-8:0 D 0000000100521461 0 720 2 0x00000000
> ffff8800b4c41a40 0000000000000046 0000000000000000 0000000000000000
> 0000000000013440 ffff8800b4c41fd8 ffff8800b4c40010 0000000000013440
> ffff8800b4c41fd8 0000000000013440 ffffffff81a0d020 ffff8800b464d4e0
> Call Trace:
> [<ffffffff81086b4e>] ? prepare_to_wait+0x5e/0x90
> [<ffffffff814ee3ff>] schedule+0x3f/0x60
> [<ffffffffa041e485>] ext4_journal_start_sb+0x145/0x1b0 [ext4]
> [<ffffffff81086820>] ? wake_up_bit+0x40/0x40
> [<ffffffffa0401bc5>] ? ext4_meta_trans_blocks+0xb5/0xc0 [ext4]
> [<ffffffffa0406c9d>] ext4_da_writepages+0x29d/0x620 [ext4]
> [<ffffffff81227a18>] ? blk_finish_plug+0x18/0x50
> [<ffffffff81112bb1>] do_writepages+0x21/0x40
> [<ffffffff8118e380>] writeback_single_inode+0x180/0x3b0
> [<ffffffff8118e971>] writeback_sb_inodes+0x1a1/0x260
> [<ffffffff8118ec6e>] wb_writeback+0xde/0x2b0
> [<ffffffff810739c6>] ? try_to_del_timer_sync+0x86/0xe0
> [<ffffffff8118eee6>] wb_do_writeback+0xa6/0x260
> [<ffffffff81072ef0>] ? lock_timer_base+0x70/0x70
> [<ffffffff8118f14a>] bdi_writeback_thread+0xaa/0x270
> [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
> [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
> [<ffffffff810861a6>] kthread+0x96/0xa0
> [<ffffffff814fa5b4>] kernel_thread_helper+0x4/0x10
> [<ffffffff81086110>] ? kthread_worker_fn+0x1a0/0x1a0
> [<ffffffff814fa5b0>] ? gs_change+0x13/0x13
>
> INFO: task fsstress:4376 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> fsstress D ffff88009b52dda8 0 4376 4364 0x00000080
> ffff88009b52dcb8 0000000000000082 ffffffff810d7e09 ffff88009b52dcc0
> 0000000000013440 ffff88009b52dfd8 ffff88009b52c010 0000000000013440
> ffff88009b52dfd8 0000000000013440 ffff88009b4d54e0 ffff8800a1481560
> Call Trace:
> [<ffffffff810d7e09>] ? trace_clock_local+0x9/0x10
> [<ffffffff814ee3ff>] schedule+0x3f/0x60
> [<ffffffff814ee89d>] schedule_timeout+0x1fd/0x2e0
> [<ffffffff810e5e43>] ? trace_nowake_buffer_unlock_commit+0x43/0x60
> [<ffffffff810127e4>] ? __switch_to+0x194/0x320
> [<ffffffff8104d623>] ? ftrace_raw_event_sched_switch+0x103/0x110
> [<ffffffff814ee26d>] wait_for_common+0x11d/0x190
> [<ffffffff8105a970>] ? try_to_wake_up+0x2b0/0x2b0
> [<ffffffff814ee3bd>] wait_for_completion+0x1d/0x20
> [<ffffffff8118daef>] writeback_inodes_sb_nr+0x7f/0xa0
> [<ffffffff8118dbdf>] writeback_inodes_sb+0x5f/0x80
> [<ffffffff811938d0>] ? __sync_filesystem+0x90/0x90
> [<ffffffff8119388e>] __sync_filesystem+0x4e/0x90
> [<ffffffff811938ef>] sync_one_sb+0x1f/0x30
> [<ffffffff811695da>] iterate_supers+0x7a/0xd0
> [<ffffffff81193934>] sys_sync+0x34/0x70
> [<ffffffff814f8442>] system_call_fastpath+0x16/0x1b
> -------------------------------------------------------------
>
> The test program for xfstests is below.
> -------------------------------------------------------------
> #! /bin/bash
> # FSQA Test No. 277
> #
> # Run fsstress and freeze/unfreeze in parallel
> #
> #-----------------------------------------------------------------------
> # Copyright (c) 2006 Silicon Graphics, Inc. All Rights Reserved.
> #
> # This program is free software; you can redistribute it and/or
> # modify it under the terms of the GNU General Public License as
> # published by the Free Software Foundation.
> #
> # This program is distributed in the hope that it would be useful,
> # but WITHOUT ANY WARRANTY; without even the implied warranty of
> # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> # GNU General Public License for more details.
> #
> # You should have received a copy of the GNU General Public License
> # along with this program; if not, write the Free Software Foundation,
> # Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> #
> #-----------------------------------------------------------------------
> #
> # creator
> [email protected]
>
> seq=`basename $0`
> echo "QA output created by $seq"
>
> here=`pwd`
> tmp=/tmp/$$
> status=0 # success is the default!
> trap "rm -f $tmp.*; exit \$status" 0 1 2 3 15
>
> # get standard environment, filters and checks
> . ./common.rc
> . ./common.filter
>
> _workout()
> {
> echo ""
> echo "Run fsstress"
> echo ""
> num_iterations=500
> out=$SCRATCH_MNT/fsstress.$$
> args="-p100 -n10000 -d $out"
> echo "fsstress $args" >> $here/$seq.full
> $FSSTRESS_PROG $args > /dev/null 2>&1 &
> pid=$!
> echo "Run xfs_freeze in parallel"
> for ((i=0; i < num_iterations; i++))
> do
> xfs_freeze -f $SCRATCH_MNT | tee -a $seq.full
> xfs_freeze -u $SCRATCH_MNT | tee -a $seq.full
> done
> kill $pid 2> /dev/null
> wait $pid
> }
>
> # real QA test starts here
> _supported_fs generic
> _supported_os Linux
> _need_to_be_root
> _require_scratch
>
> _scratch_mkfs >> $seq.full 2>&1
> _scratch_mount
>
> if ! _workout; then
> umount $SCRATCH_DEV 2>/dev/null
> exit
> fi
>
> if ! _scratch_unmount; then
> echo "failed to umount"
> status=1
> exit
> fi
> _check_scratch_fs
> status=$?
> exit
> -------------------------------------------------------------
>
> Thanks,
> Masayoshi Mizuma
>
> >
> > ---------------------------------------------------------------------
> > Feb 7 15:05:09 RX300S6 kernel: INFO: task fsfreeze:2104 blocked for more than 120 seconds.
> > Feb 7 15:05:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Feb 7 15:05:09 RX300S6 kernel: fsfreeze D ffff880076d5f040 0 2104 2018 0x00000000
> > Feb 7 15:05:09 RX300S6 kernel: ffff88005a9f3d98 0000000000000086 ffff88005a9f3d38 ffffffff00000000
> > Feb 7 15:05:09 RX300S6 kernel: 0000000000014d40 ffff880076d5eab0 ffff880076d5f040 ffff88005a9f3fd8
> > Feb 7 15:05:09 RX300S6 kernel: ffff880076d5f048 0000000000014d40 ffff88005a9f2010 0000000000014d40
> > Feb 7 15:05:09 RX300S6 kernel: Call Trace:
> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814aa5f5>] rwsem_down_failed_common+0xb5/0x140
> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814aa693>] rwsem_down_write_failed+0x13/0x20
> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff8122f1a3>] call_rwsem_down_write_failed+0x13/0x20
> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814a9c12>] ? down_write+0x32/0x40
> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81155b48>] thaw_super+0x28/0xd0
> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81164338>] do_vfs_ioctl+0x368/0x560
> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81157c73>] ? sys_newfstat+0x33/0x40
> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff811645d1>] sys_ioctl+0xa1/0xb0
> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff8100bf82>] system_call_fastpath+0x16/0x1b
> > ...
> > Feb 7 15:07:09 RX300S6 kernel: INFO: task flush-8:0:1409 blocked for more than 120 seconds.
> > Feb 7 15:07:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Feb 7 15:07:09 RX300S6 kernel: flush-8:0 D ffff880037777a30 0 1409 2 0x00000000
> > Feb 7 15:07:09 RX300S6 kernel: ffff880037c95a80 0000000000000046 ffff88007c8037a0 0000000000000000
> > Feb 7 15:07:09 RX300S6 kernel: 0000000000014d40 ffff8800377774a0 ffff880037777a30 ffff880037c95fd8
> > Feb 7 15:07:09 RX300S6 kernel: ffff880037777a38 0000000000014d40 ffff880037c94010 0000000000014d40
> > Feb 7 15:07:09 RX300S6 kernel: Call Trace:
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffffa00abb85>] ext4_journal_start_sb+0x75/0x130 [ext4]
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81082fc0>] ? autoremove_wake_function+0x0/0x40
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffffa0097f0a>] ext4_da_writepages+0x27a/0x640 [ext4]
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81102c91>] do_writepages+0x21/0x40
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff811776b8>] writeback_single_inode+0x98/0x240
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81177cfe>] writeback_sb_inodes+0xce/0x170
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178709>] writeback_inodes_wb+0x99/0x160
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178a8b>] wb_writeback+0x2bb/0x430
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e2c>] wb_do_writeback+0x22c/0x280
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178f32>] bdi_writeback_thread+0xb2/0x260
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81082936>] kthread+0x96/0xa0
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc4>] kernel_thread_helper+0x4/0x10
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff810828a0>] ? kthread+0x0/0xa0
> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc0>] ? kernel_thread_helper+0x0/0x10
> > ---------------------------------------------------------------------
> >
> > I think the following deadlock problem happened:
> >
> > [flush-8:0:1409] | [fsfreeze:2104]
> > --------------------------------------------+--------------------------------
> > writeback_inodes_wb |
> > pin_sb_for_writeback |
> > down_read_trylock(&sb->s_umount) |
> > writeback_sb_inodes |thaw_super
> > writeback_single_inode | down_write(&sb->s_umount)
> > do_writepages | # stop until flush-8:0 releases
> > ext4_da_writepages | # read lock of sb->s_umount...
> > ext4_journal_start_sb |
> > vfs_check_frozen |
> > wait_event((sb)->s_wait_unfrozen, |
> > ((sb)->s_frozen < (level))) |
> > # stop until being waked up by |
> > # fsfreeze... |
> > --------------------------------------------+--------------------------------
> >
> > Could anyone check this problem?
> >
> > Thanks,
> > Masayoshi Mizuma
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html




2013-11-29 04:58:51

by Yongqiang Yang

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

How is fthe bug fixed at last? I can not find the accepted patch.


Thanks,
Yongqiang.

On Thu, Dec 15, 2011 at 8:41 PM, Masayoshi MIZUMA
<[email protected]> wrote:
>
> (2011/12/09 10:56), Masayoshi MIZUMA wrote:
>
>>
>> (2011/02/07 20:53), Masayoshi MIZUMA wrote:
>>
>> > Hi,
>> >
>> > When I checked the freeze feature for ext4 filesystem using fsfreeze command
>> > at 2.6.38-rc3, I got the following messeges:
>>
>> Hi,
>>
>> I checked freeze function with using below test program at 3.2.0-rc4,
>> then, I got following messeages and the test program hanged up.
>> I think this bug is still in 3.2.0-rc4...
>
> I think the problem is as follows.
> When a race between ext4_page_mkwrite() and freeze_super() occurs,
> ext4_page_mkwrite() can add a inode to a list (bdi_writeback.b_dirty)
> which is needed to do writeback nevertheless sb->s_frozen is SB_FREEZE_WRITE
> or SB_FREEZE_TRANS.
>
> process A | process B
> ------------------------------+-----------------------------------------------
> ext4_page_mkwrite() |
> => vfs_check_frozen() |
> | freeze_super()
> | sb->s_frozen = SB_FREEZE_WRITE
> =>__block_page_mkwrite() | => sync_filesystem()
> : | # write inodes which are in the list.
> : | sb->s_frozen = SB_FREEZE_TRANS
> : |
> =>__mark_inode_dirty |
> # add inode to the list. |
> ------------------------------+-----------------------------------------------
>
> As the result, if "flush" kthread does writeback the inode which was
> added by ext4_page_mkwrite() and thaw_super() runs concurrently, the
> deadlock will happen.
>
> Thanks,
> Masayoshi Mizuma
>
>>
>> The test program:
>> -----------------------------------------------------------
>> #!/bin/bash
>>
>> DEV_1=/dev/sda5
>> MNT_1=/tmp/sda5
>> LOOP=500
>>
>> if [[ ! -d $MNT_1 ]]
>> then
>> mkdir -p $MNT_1
>> fi
>>
>> mkfs -t ext4 $DEV_1
>> mount $DEV_1 $MNT_1
>>
>> ./fsstress -d $MNT_1/tmp -n 10000 -p 100 > /dev/null 2>&1 &
>> PID=$!
>>
>> for ((i=0; i<LOOP; i++))
>> do
>> echo LOOP: $i
>> fsfreeze -f $MNT_1
>> fsfreeze -u $MNT_1
>> done
>>
>> kill $PID
>> -----------------------------------------------------------
>>
>> The messages I got when I ran the test program is below.
>> -------------------------------------------------------------
>> INFO: task flush-8:0:720 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> flush-8:0 D 0000000100521461 0 720 2 0x00000000
>> ffff8800b4c41a40 0000000000000046 0000000000000000 0000000000000000
>> 0000000000013440 ffff8800b4c41fd8 ffff8800b4c40010 0000000000013440
>> ffff8800b4c41fd8 0000000000013440 ffffffff81a0d020 ffff8800b464d4e0
>> Call Trace:
>> [<ffffffff81086b4e>] ? prepare_to_wait+0x5e/0x90
>> [<ffffffff814ee3ff>] schedule+0x3f/0x60
>> [<ffffffffa041e485>] ext4_journal_start_sb+0x145/0x1b0 [ext4]
>> [<ffffffff81086820>] ? wake_up_bit+0x40/0x40
>> [<ffffffffa0401bc5>] ? ext4_meta_trans_blocks+0xb5/0xc0 [ext4]
>> [<ffffffffa0406c9d>] ext4_da_writepages+0x29d/0x620 [ext4]
>> [<ffffffff81227a18>] ? blk_finish_plug+0x18/0x50
>> [<ffffffff81112bb1>] do_writepages+0x21/0x40
>> [<ffffffff8118e380>] writeback_single_inode+0x180/0x3b0
>> [<ffffffff8118e971>] writeback_sb_inodes+0x1a1/0x260
>> [<ffffffff8118ec6e>] wb_writeback+0xde/0x2b0
>> [<ffffffff810739c6>] ? try_to_del_timer_sync+0x86/0xe0
>> [<ffffffff8118eee6>] wb_do_writeback+0xa6/0x260
>> [<ffffffff81072ef0>] ? lock_timer_base+0x70/0x70
>> [<ffffffff8118f14a>] bdi_writeback_thread+0xaa/0x270
>> [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
>> [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
>> [<ffffffff810861a6>] kthread+0x96/0xa0
>> [<ffffffff814fa5b4>] kernel_thread_helper+0x4/0x10
>> [<ffffffff81086110>] ? kthread_worker_fn+0x1a0/0x1a0
>> [<ffffffff814fa5b0>] ? gs_change+0x13/0x13
>>
>> INFO: task fsstress:4376 blocked for more than 120 seconds.
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> fsstress D ffff88009b52dda8 0 4376 4364 0x00000080
>> ffff88009b52dcb8 0000000000000082 ffffffff810d7e09 ffff88009b52dcc0
>> 0000000000013440 ffff88009b52dfd8 ffff88009b52c010 0000000000013440
>> ffff88009b52dfd8 0000000000013440 ffff88009b4d54e0 ffff8800a1481560
>> Call Trace:
>> [<ffffffff810d7e09>] ? trace_clock_local+0x9/0x10
>> [<ffffffff814ee3ff>] schedule+0x3f/0x60
>> [<ffffffff814ee89d>] schedule_timeout+0x1fd/0x2e0
>> [<ffffffff810e5e43>] ? trace_nowake_buffer_unlock_commit+0x43/0x60
>> [<ffffffff810127e4>] ? __switch_to+0x194/0x320
>> [<ffffffff8104d623>] ? ftrace_raw_event_sched_switch+0x103/0x110
>> [<ffffffff814ee26d>] wait_for_common+0x11d/0x190
>> [<ffffffff8105a970>] ? try_to_wake_up+0x2b0/0x2b0
>> [<ffffffff814ee3bd>] wait_for_completion+0x1d/0x20
>> [<ffffffff8118daef>] writeback_inodes_sb_nr+0x7f/0xa0
>> [<ffffffff8118dbdf>] writeback_inodes_sb+0x5f/0x80
>> [<ffffffff811938d0>] ? __sync_filesystem+0x90/0x90
>> [<ffffffff8119388e>] __sync_filesystem+0x4e/0x90
>> [<ffffffff811938ef>] sync_one_sb+0x1f/0x30
>> [<ffffffff811695da>] iterate_supers+0x7a/0xd0
>> [<ffffffff81193934>] sys_sync+0x34/0x70
>> [<ffffffff814f8442>] system_call_fastpath+0x16/0x1b
>> -------------------------------------------------------------
>>
>> The test program for xfstests is below.
>> -------------------------------------------------------------
>> #! /bin/bash
>> # FSQA Test No. 277
>> #
>> # Run fsstress and freeze/unfreeze in parallel
>> #
>> #-----------------------------------------------------------------------
>> # Copyright (c) 2006 Silicon Graphics, Inc. All Rights Reserved.
>> #
>> # This program is free software; you can redistribute it and/or
>> # modify it under the terms of the GNU General Public License as
>> # published by the Free Software Foundation.
>> #
>> # This program is distributed in the hope that it would be useful,
>> # but WITHOUT ANY WARRANTY; without even the implied warranty of
>> # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>> # GNU General Public License for more details.
>> #
>> # You should have received a copy of the GNU General Public License
>> # along with this program; if not, write the Free Software Foundation,
>> # Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
>> #
>> #-----------------------------------------------------------------------
>> #
>> # creator
>> [email protected]
>>
>> seq=`basename $0`
>> echo "QA output created by $seq"
>>
>> here=`pwd`
>> tmp=/tmp/$$
>> status=0 # success is the default!
>> trap "rm -f $tmp.*; exit \$status" 0 1 2 3 15
>>
>> # get standard environment, filters and checks
>> . ./common.rc
>> . ./common.filter
>>
>> _workout()
>> {
>> echo ""
>> echo "Run fsstress"
>> echo ""
>> num_iterations=500
>> out=$SCRATCH_MNT/fsstress.$$
>> args="-p100 -n10000 -d $out"
>> echo "fsstress $args" >> $here/$seq.full
>> $FSSTRESS_PROG $args > /dev/null 2>&1 &
>> pid=$!
>> echo "Run xfs_freeze in parallel"
>> for ((i=0; i < num_iterations; i++))
>> do
>> xfs_freeze -f $SCRATCH_MNT | tee -a $seq.full
>> xfs_freeze -u $SCRATCH_MNT | tee -a $seq.full
>> done
>> kill $pid 2> /dev/null
>> wait $pid
>> }
>>
>> # real QA test starts here
>> _supported_fs generic
>> _supported_os Linux
>> _need_to_be_root
>> _require_scratch
>>
>> _scratch_mkfs >> $seq.full 2>&1
>> _scratch_mount
>>
>> if ! _workout; then
>> umount $SCRATCH_DEV 2>/dev/null
>> exit
>> fi
>>
>> if ! _scratch_unmount; then
>> echo "failed to umount"
>> status=1
>> exit
>> fi
>> _check_scratch_fs
>> status=$?
>> exit
>> -------------------------------------------------------------
>>
>> Thanks,
>> Masayoshi Mizuma
>>
>> >
>> > ---------------------------------------------------------------------
>> > Feb 7 15:05:09 RX300S6 kernel: INFO: task fsfreeze:2104 blocked for more than 120 seconds.
>> > Feb 7 15:05:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > Feb 7 15:05:09 RX300S6 kernel: fsfreeze D ffff880076d5f040 0 2104 2018 0x00000000
>> > Feb 7 15:05:09 RX300S6 kernel: ffff88005a9f3d98 0000000000000086 ffff88005a9f3d38 ffffffff00000000
>> > Feb 7 15:05:09 RX300S6 kernel: 0000000000014d40 ffff880076d5eab0 ffff880076d5f040 ffff88005a9f3fd8
>> > Feb 7 15:05:09 RX300S6 kernel: ffff880076d5f048 0000000000014d40 ffff88005a9f2010 0000000000014d40
>> > Feb 7 15:05:09 RX300S6 kernel: Call Trace:
>> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814aa5f5>] rwsem_down_failed_common+0xb5/0x140
>> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814aa693>] rwsem_down_write_failed+0x13/0x20
>> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff8122f1a3>] call_rwsem_down_write_failed+0x13/0x20
>> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814a9c12>] ? down_write+0x32/0x40
>> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81155b48>] thaw_super+0x28/0xd0
>> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81164338>] do_vfs_ioctl+0x368/0x560
>> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81157c73>] ? sys_newfstat+0x33/0x40
>> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff811645d1>] sys_ioctl+0xa1/0xb0
>> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff8100bf82>] system_call_fastpath+0x16/0x1b
>> > ...
>> > Feb 7 15:07:09 RX300S6 kernel: INFO: task flush-8:0:1409 blocked for more than 120 seconds.
>> > Feb 7 15:07:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > Feb 7 15:07:09 RX300S6 kernel: flush-8:0 D ffff880037777a30 0 1409 2 0x00000000
>> > Feb 7 15:07:09 RX300S6 kernel: ffff880037c95a80 0000000000000046 ffff88007c8037a0 0000000000000000
>> > Feb 7 15:07:09 RX300S6 kernel: 0000000000014d40 ffff8800377774a0 ffff880037777a30 ffff880037c95fd8
>> > Feb 7 15:07:09 RX300S6 kernel: ffff880037777a38 0000000000014d40 ffff880037c94010 0000000000014d40
>> > Feb 7 15:07:09 RX300S6 kernel: Call Trace:
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffffa00abb85>] ext4_journal_start_sb+0x75/0x130 [ext4]
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81082fc0>] ? autoremove_wake_function+0x0/0x40
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffffa0097f0a>] ext4_da_writepages+0x27a/0x640 [ext4]
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81102c91>] do_writepages+0x21/0x40
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff811776b8>] writeback_single_inode+0x98/0x240
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81177cfe>] writeback_sb_inodes+0xce/0x170
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178709>] writeback_inodes_wb+0x99/0x160
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178a8b>] wb_writeback+0x2bb/0x430
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e2c>] wb_do_writeback+0x22c/0x280
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178f32>] bdi_writeback_thread+0xb2/0x260
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81082936>] kthread+0x96/0xa0
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc4>] kernel_thread_helper+0x4/0x10
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff810828a0>] ? kthread+0x0/0xa0
>> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc0>] ? kernel_thread_helper+0x0/0x10
>> > ---------------------------------------------------------------------
>> >
>> > I think the following deadlock problem happened:
>> >
>> > [flush-8:0:1409] | [fsfreeze:2104]
>> > --------------------------------------------+--------------------------------
>> > writeback_inodes_wb |
>> > pin_sb_for_writeback |
>> > down_read_trylock(&sb->s_umount) |
>> > writeback_sb_inodes |thaw_super
>> > writeback_single_inode | down_write(&sb->s_umount)
>> > do_writepages | # stop until flush-8:0 releases
>> > ext4_da_writepages | # read lock of sb->s_umount...
>> > ext4_journal_start_sb |
>> > vfs_check_frozen |
>> > wait_event((sb)->s_wait_unfrozen, |
>> > ((sb)->s_frozen < (level))) |
>> > # stop until being waked up by |
>> > # fsfreeze... |
>> > --------------------------------------------+--------------------------------
>> >
>> > Could anyone check this problem?
>> >
>> > Thanks,
>> > Masayoshi Mizuma
>> >
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> > the body of a message to [email protected]
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Best Wishes
Yongqiang Yang

2013-11-29 08:00:30

by Jan Kara

[permalink] [raw]
Subject: Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

On Fri 29-11-13 12:58:29, Yongqiang Yang wrote:
> How is fthe bug fixed at last? I can not find the accepted patch.
It was fixed by rewriting handling of freezing in VFS. That was rather
large patch series but the gut of the change is commit
5accdf82ba25cacefd6c1867f1704beb4d244cdd.

Honza

> On Thu, Dec 15, 2011 at 8:41 PM, Masayoshi MIZUMA
> <[email protected]> wrote:
> >
> > (2011/12/09 10:56), Masayoshi MIZUMA wrote:
> >
> >>
> >> (2011/02/07 20:53), Masayoshi MIZUMA wrote:
> >>
> >> > Hi,
> >> >
> >> > When I checked the freeze feature for ext4 filesystem using fsfreeze command
> >> > at 2.6.38-rc3, I got the following messeges:
> >>
> >> Hi,
> >>
> >> I checked freeze function with using below test program at 3.2.0-rc4,
> >> then, I got following messeages and the test program hanged up.
> >> I think this bug is still in 3.2.0-rc4...
> >
> > I think the problem is as follows.
> > When a race between ext4_page_mkwrite() and freeze_super() occurs,
> > ext4_page_mkwrite() can add a inode to a list (bdi_writeback.b_dirty)
> > which is needed to do writeback nevertheless sb->s_frozen is SB_FREEZE_WRITE
> > or SB_FREEZE_TRANS.
> >
> > process A | process B
> > ------------------------------+-----------------------------------------------
> > ext4_page_mkwrite() |
> > => vfs_check_frozen() |
> > | freeze_super()
> > | sb->s_frozen = SB_FREEZE_WRITE
> > =>__block_page_mkwrite() | => sync_filesystem()
> > : | # write inodes which are in the list.
> > : | sb->s_frozen = SB_FREEZE_TRANS
> > : |
> > =>__mark_inode_dirty |
> > # add inode to the list. |
> > ------------------------------+-----------------------------------------------
> >
> > As the result, if "flush" kthread does writeback the inode which was
> > added by ext4_page_mkwrite() and thaw_super() runs concurrently, the
> > deadlock will happen.
> >
> > Thanks,
> > Masayoshi Mizuma
> >
> >>
> >> The test program:
> >> -----------------------------------------------------------
> >> #!/bin/bash
> >>
> >> DEV_1=/dev/sda5
> >> MNT_1=/tmp/sda5
> >> LOOP=500
> >>
> >> if [[ ! -d $MNT_1 ]]
> >> then
> >> mkdir -p $MNT_1
> >> fi
> >>
> >> mkfs -t ext4 $DEV_1
> >> mount $DEV_1 $MNT_1
> >>
> >> ./fsstress -d $MNT_1/tmp -n 10000 -p 100 > /dev/null 2>&1 &
> >> PID=$!
> >>
> >> for ((i=0; i<LOOP; i++))
> >> do
> >> echo LOOP: $i
> >> fsfreeze -f $MNT_1
> >> fsfreeze -u $MNT_1
> >> done
> >>
> >> kill $PID
> >> -----------------------------------------------------------
> >>
> >> The messages I got when I ran the test program is below.
> >> -------------------------------------------------------------
> >> INFO: task flush-8:0:720 blocked for more than 120 seconds.
> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> flush-8:0 D 0000000100521461 0 720 2 0x00000000
> >> ffff8800b4c41a40 0000000000000046 0000000000000000 0000000000000000
> >> 0000000000013440 ffff8800b4c41fd8 ffff8800b4c40010 0000000000013440
> >> ffff8800b4c41fd8 0000000000013440 ffffffff81a0d020 ffff8800b464d4e0
> >> Call Trace:
> >> [<ffffffff81086b4e>] ? prepare_to_wait+0x5e/0x90
> >> [<ffffffff814ee3ff>] schedule+0x3f/0x60
> >> [<ffffffffa041e485>] ext4_journal_start_sb+0x145/0x1b0 [ext4]
> >> [<ffffffff81086820>] ? wake_up_bit+0x40/0x40
> >> [<ffffffffa0401bc5>] ? ext4_meta_trans_blocks+0xb5/0xc0 [ext4]
> >> [<ffffffffa0406c9d>] ext4_da_writepages+0x29d/0x620 [ext4]
> >> [<ffffffff81227a18>] ? blk_finish_plug+0x18/0x50
> >> [<ffffffff81112bb1>] do_writepages+0x21/0x40
> >> [<ffffffff8118e380>] writeback_single_inode+0x180/0x3b0
> >> [<ffffffff8118e971>] writeback_sb_inodes+0x1a1/0x260
> >> [<ffffffff8118ec6e>] wb_writeback+0xde/0x2b0
> >> [<ffffffff810739c6>] ? try_to_del_timer_sync+0x86/0xe0
> >> [<ffffffff8118eee6>] wb_do_writeback+0xa6/0x260
> >> [<ffffffff81072ef0>] ? lock_timer_base+0x70/0x70
> >> [<ffffffff8118f14a>] bdi_writeback_thread+0xaa/0x270
> >> [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
> >> [<ffffffff8118f0a0>] ? wb_do_writeback+0x260/0x260
> >> [<ffffffff810861a6>] kthread+0x96/0xa0
> >> [<ffffffff814fa5b4>] kernel_thread_helper+0x4/0x10
> >> [<ffffffff81086110>] ? kthread_worker_fn+0x1a0/0x1a0
> >> [<ffffffff814fa5b0>] ? gs_change+0x13/0x13
> >>
> >> INFO: task fsstress:4376 blocked for more than 120 seconds.
> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> fsstress D ffff88009b52dda8 0 4376 4364 0x00000080
> >> ffff88009b52dcb8 0000000000000082 ffffffff810d7e09 ffff88009b52dcc0
> >> 0000000000013440 ffff88009b52dfd8 ffff88009b52c010 0000000000013440
> >> ffff88009b52dfd8 0000000000013440 ffff88009b4d54e0 ffff8800a1481560
> >> Call Trace:
> >> [<ffffffff810d7e09>] ? trace_clock_local+0x9/0x10
> >> [<ffffffff814ee3ff>] schedule+0x3f/0x60
> >> [<ffffffff814ee89d>] schedule_timeout+0x1fd/0x2e0
> >> [<ffffffff810e5e43>] ? trace_nowake_buffer_unlock_commit+0x43/0x60
> >> [<ffffffff810127e4>] ? __switch_to+0x194/0x320
> >> [<ffffffff8104d623>] ? ftrace_raw_event_sched_switch+0x103/0x110
> >> [<ffffffff814ee26d>] wait_for_common+0x11d/0x190
> >> [<ffffffff8105a970>] ? try_to_wake_up+0x2b0/0x2b0
> >> [<ffffffff814ee3bd>] wait_for_completion+0x1d/0x20
> >> [<ffffffff8118daef>] writeback_inodes_sb_nr+0x7f/0xa0
> >> [<ffffffff8118dbdf>] writeback_inodes_sb+0x5f/0x80
> >> [<ffffffff811938d0>] ? __sync_filesystem+0x90/0x90
> >> [<ffffffff8119388e>] __sync_filesystem+0x4e/0x90
> >> [<ffffffff811938ef>] sync_one_sb+0x1f/0x30
> >> [<ffffffff811695da>] iterate_supers+0x7a/0xd0
> >> [<ffffffff81193934>] sys_sync+0x34/0x70
> >> [<ffffffff814f8442>] system_call_fastpath+0x16/0x1b
> >> -------------------------------------------------------------
> >>
> >> The test program for xfstests is below.
> >> -------------------------------------------------------------
> >> #! /bin/bash
> >> # FSQA Test No. 277
> >> #
> >> # Run fsstress and freeze/unfreeze in parallel
> >> #
> >> #-----------------------------------------------------------------------
> >> # Copyright (c) 2006 Silicon Graphics, Inc. All Rights Reserved.
> >> #
> >> # This program is free software; you can redistribute it and/or
> >> # modify it under the terms of the GNU General Public License as
> >> # published by the Free Software Foundation.
> >> #
> >> # This program is distributed in the hope that it would be useful,
> >> # but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> >> # GNU General Public License for more details.
> >> #
> >> # You should have received a copy of the GNU General Public License
> >> # along with this program; if not, write the Free Software Foundation,
> >> # Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> >> #
> >> #-----------------------------------------------------------------------
> >> #
> >> # creator
> >> [email protected]
> >>
> >> seq=`basename $0`
> >> echo "QA output created by $seq"
> >>
> >> here=`pwd`
> >> tmp=/tmp/$$
> >> status=0 # success is the default!
> >> trap "rm -f $tmp.*; exit \$status" 0 1 2 3 15
> >>
> >> # get standard environment, filters and checks
> >> . ./common.rc
> >> . ./common.filter
> >>
> >> _workout()
> >> {
> >> echo ""
> >> echo "Run fsstress"
> >> echo ""
> >> num_iterations=500
> >> out=$SCRATCH_MNT/fsstress.$$
> >> args="-p100 -n10000 -d $out"
> >> echo "fsstress $args" >> $here/$seq.full
> >> $FSSTRESS_PROG $args > /dev/null 2>&1 &
> >> pid=$!
> >> echo "Run xfs_freeze in parallel"
> >> for ((i=0; i < num_iterations; i++))
> >> do
> >> xfs_freeze -f $SCRATCH_MNT | tee -a $seq.full
> >> xfs_freeze -u $SCRATCH_MNT | tee -a $seq.full
> >> done
> >> kill $pid 2> /dev/null
> >> wait $pid
> >> }
> >>
> >> # real QA test starts here
> >> _supported_fs generic
> >> _supported_os Linux
> >> _need_to_be_root
> >> _require_scratch
> >>
> >> _scratch_mkfs >> $seq.full 2>&1
> >> _scratch_mount
> >>
> >> if ! _workout; then
> >> umount $SCRATCH_DEV 2>/dev/null
> >> exit
> >> fi
> >>
> >> if ! _scratch_unmount; then
> >> echo "failed to umount"
> >> status=1
> >> exit
> >> fi
> >> _check_scratch_fs
> >> status=$?
> >> exit
> >> -------------------------------------------------------------
> >>
> >> Thanks,
> >> Masayoshi Mizuma
> >>
> >> >
> >> > ---------------------------------------------------------------------
> >> > Feb 7 15:05:09 RX300S6 kernel: INFO: task fsfreeze:2104 blocked for more than 120 seconds.
> >> > Feb 7 15:05:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> > Feb 7 15:05:09 RX300S6 kernel: fsfreeze D ffff880076d5f040 0 2104 2018 0x00000000
> >> > Feb 7 15:05:09 RX300S6 kernel: ffff88005a9f3d98 0000000000000086 ffff88005a9f3d38 ffffffff00000000
> >> > Feb 7 15:05:09 RX300S6 kernel: 0000000000014d40 ffff880076d5eab0 ffff880076d5f040 ffff88005a9f3fd8
> >> > Feb 7 15:05:09 RX300S6 kernel: ffff880076d5f048 0000000000014d40 ffff88005a9f2010 0000000000014d40
> >> > Feb 7 15:05:09 RX300S6 kernel: Call Trace:
> >> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814aa5f5>] rwsem_down_failed_common+0xb5/0x140
> >> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814aa693>] rwsem_down_write_failed+0x13/0x20
> >> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff8122f1a3>] call_rwsem_down_write_failed+0x13/0x20
> >> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff814a9c12>] ? down_write+0x32/0x40
> >> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81155b48>] thaw_super+0x28/0xd0
> >> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81164338>] do_vfs_ioctl+0x368/0x560
> >> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff81157c73>] ? sys_newfstat+0x33/0x40
> >> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff811645d1>] sys_ioctl+0xa1/0xb0
> >> > Feb 7 15:05:09 RX300S6 kernel: [<ffffffff8100bf82>] system_call_fastpath+0x16/0x1b
> >> > ...
> >> > Feb 7 15:07:09 RX300S6 kernel: INFO: task flush-8:0:1409 blocked for more than 120 seconds.
> >> > Feb 7 15:07:09 RX300S6 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> > Feb 7 15:07:09 RX300S6 kernel: flush-8:0 D ffff880037777a30 0 1409 2 0x00000000
> >> > Feb 7 15:07:09 RX300S6 kernel: ffff880037c95a80 0000000000000046 ffff88007c8037a0 0000000000000000
> >> > Feb 7 15:07:09 RX300S6 kernel: 0000000000014d40 ffff8800377774a0 ffff880037777a30 ffff880037c95fd8
> >> > Feb 7 15:07:09 RX300S6 kernel: ffff880037777a38 0000000000014d40 ffff880037c94010 0000000000014d40
> >> > Feb 7 15:07:09 RX300S6 kernel: Call Trace:
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffffa00abb85>] ext4_journal_start_sb+0x75/0x130 [ext4]
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81082fc0>] ? autoremove_wake_function+0x0/0x40
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffffa0097f0a>] ext4_da_writepages+0x27a/0x640 [ext4]
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81102c91>] do_writepages+0x21/0x40
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff811776b8>] writeback_single_inode+0x98/0x240
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81177cfe>] writeback_sb_inodes+0xce/0x170
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178709>] writeback_inodes_wb+0x99/0x160
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178a8b>] wb_writeback+0x2bb/0x430
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e2c>] wb_do_writeback+0x22c/0x280
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178f32>] bdi_writeback_thread+0xb2/0x260
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81178e80>] ? bdi_writeback_thread+0x0/0x260
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff81082936>] kthread+0x96/0xa0
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc4>] kernel_thread_helper+0x4/0x10
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff810828a0>] ? kthread+0x0/0xa0
> >> > Feb 7 15:07:09 RX300S6 kernel: [<ffffffff8100cdc0>] ? kernel_thread_helper+0x0/0x10
> >> > ---------------------------------------------------------------------
> >> >
> >> > I think the following deadlock problem happened:
> >> >
> >> > [flush-8:0:1409] | [fsfreeze:2104]
> >> > --------------------------------------------+--------------------------------
> >> > writeback_inodes_wb |
> >> > pin_sb_for_writeback |
> >> > down_read_trylock(&sb->s_umount) |
> >> > writeback_sb_inodes |thaw_super
> >> > writeback_single_inode | down_write(&sb->s_umount)
> >> > do_writepages | # stop until flush-8:0 releases
> >> > ext4_da_writepages | # read lock of sb->s_umount...
> >> > ext4_journal_start_sb |
> >> > vfs_check_frozen |
> >> > wait_event((sb)->s_wait_unfrozen, |
> >> > ((sb)->s_frozen < (level))) |
> >> > # stop until being waked up by |
> >> > # fsfreeze... |
> >> > --------------------------------------------+--------------------------------
> >> >
> >> > Could anyone check this problem?
> >> >
> >> > Thanks,
> >> > Masayoshi Mizuma
> >> >
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> > the body of a message to [email protected]
> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Wishes
> Yongqiang Yang
--
Jan Kara <[email protected]>
SUSE Labs, CR