2006-11-07 18:35:30

by Alasdair G Kergon

[permalink] [raw]
Subject: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

From: Srinivasa Ds <[email protected]>

On debugging I found out that,"dmsetup suspend <device name>" calls
"freeze_bdev()",which locks "bd_mount_mutex" to make sure that no new mounts
happen on bdev until thaw_bdev() is called. This "thaw_bdev()" is getting
called when we resume the device through "dmsetup resume <device-name>".
Hence we have 2 processes,one of which locks "bd_mount_mutex"(dmsetup
suspend) and another(dmsetup resume) unlocks it.

Signed-off-by: Srinivasa DS <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Eric Sandeen <[email protected]>
Cc: [email protected]

Index: linux-2.6.19-rc4/fs/block_dev.c
===================================================================
--- linux-2.6.19-rc4.orig/fs/block_dev.c 2006-11-07 17:06:20.000000000 +0000
+++ linux-2.6.19-rc4/fs/block_dev.c 2006-11-07 17:26:04.000000000 +0000
@@ -263,7 +263,7 @@ static void init_once(void * foo, kmem_c
{
memset(bdev, 0, sizeof(*bdev));
mutex_init(&bdev->bd_mutex);
- mutex_init(&bdev->bd_mount_mutex);
+ sema_init(&bdev->bd_mount_sem, 1);
INIT_LIST_HEAD(&bdev->bd_inodes);
INIT_LIST_HEAD(&bdev->bd_list);
#ifdef CONFIG_SYSFS
Index: linux-2.6.19-rc4/fs/buffer.c
===================================================================
--- linux-2.6.19-rc4.orig/fs/buffer.c 2006-11-07 17:06:20.000000000 +0000
+++ linux-2.6.19-rc4/fs/buffer.c 2006-11-07 17:26:04.000000000 +0000
@@ -188,7 +188,9 @@ struct super_block *freeze_bdev(struct b
{
struct super_block *sb;

- mutex_lock(&bdev->bd_mount_mutex);
+ if (down_trylock(&bdev->bd_mount_sem))
+ return -EBUSY;
+
sb = get_super(bdev);
if (sb && !(sb->s_flags & MS_RDONLY)) {
sb->s_frozen = SB_FREEZE_WRITE;
@@ -230,7 +232,7 @@ void thaw_bdev(struct block_device *bdev
drop_super(sb);
}

- mutex_unlock(&bdev->bd_mount_mutex);
+ up(&bdev->bd_mount_sem);
}
EXPORT_SYMBOL(thaw_bdev);

Index: linux-2.6.19-rc4/fs/super.c
===================================================================
--- linux-2.6.19-rc4.orig/fs/super.c 2006-11-07 17:06:20.000000000 +0000
+++ linux-2.6.19-rc4/fs/super.c 2006-11-07 17:26:04.000000000 +0000
@@ -735,9 +735,9 @@ int get_sb_bdev(struct file_system_type
* will protect the lockfs code from trying to start a snapshot
* while we are mounting
*/
- mutex_lock(&bdev->bd_mount_mutex);
+ down(&bdev->bd_mount_sem);
s = sget(fs_type, test_bdev_super, set_bdev_super, bdev);
- mutex_unlock(&bdev->bd_mount_mutex);
+ up(&bdev->bd_mount_sem);
if (IS_ERR(s))
goto error_s;

Index: linux-2.6.19-rc4/include/linux/fs.h
===================================================================
--- linux-2.6.19-rc4.orig/include/linux/fs.h 2006-11-07 17:06:20.000000000 +0000
+++ linux-2.6.19-rc4/include/linux/fs.h 2006-11-07 17:26:04.000000000 +0000
@@ -456,7 +456,7 @@ struct block_device {
struct inode * bd_inode; /* will die */
int bd_openers;
struct mutex bd_mutex; /* open/close mutex */
- struct mutex bd_mount_mutex; /* mount mutex */
+ struct semaphore bd_mount_sem;
struct list_head bd_inodes;
void * bd_holder;
int bd_holders;


2006-11-07 20:18:12

by Mike Snitzer

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On 11/7/06, Alasdair G Kergon <[email protected]> wrote:
> From: Srinivasa Ds <[email protected]>
>
> On debugging I found out that,"dmsetup suspend <device name>" calls
> "freeze_bdev()",which locks "bd_mount_mutex" to make sure that no new mounts
> happen on bdev until thaw_bdev() is called. This "thaw_bdev()" is getting
> called when we resume the device through "dmsetup resume <device-name>".
> Hence we have 2 processes,one of which locks "bd_mount_mutex"(dmsetup
> suspend) and another(dmsetup resume) unlocks it.

Srinivasa's description of the patch just speaks to how freeze_bdev
and thaw_bdev are used by DM but completely skips justification for
switching from mutex to semaphore. Why is it beneficial and/or
necessary to use a semaphore instead of a mutex here?

Mike

2006-11-07 20:23:19

by Eric Sandeen

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Mike Snitzer wrote:
> On 11/7/06, Alasdair G Kergon <[email protected]> wrote:
>> From: Srinivasa Ds <[email protected]>
>>
>> On debugging I found out that,"dmsetup suspend <device name>" calls
>> "freeze_bdev()",which locks "bd_mount_mutex" to make sure that no new mounts
>> happen on bdev until thaw_bdev() is called. This "thaw_bdev()" is getting
>> called when we resume the device through "dmsetup resume <device-name>".
>> Hence we have 2 processes,one of which locks "bd_mount_mutex"(dmsetup
>> suspend) and another(dmsetup resume) unlocks it.
>
> Srinivasa's description of the patch just speaks to how freeze_bdev
> and thaw_bdev are used by DM but completely skips justification for
> switching from mutex to semaphore. Why is it beneficial and/or
> necessary to use a semaphore instead of a mutex here?

Because mutexes are not supposed to be released by anything other than
the thread that took them, as enforced by the various checking code and
noted in the docs...

"The stricter mutex API means you cannot use mutexes the same way you
can use semaphores: e.g. they cannot be used from an interrupt context,
nor can they be unlocked from a different context that which acquired
it."

this particular resource can sometimes be locked & unlocked from 2
different userspace threads.

-Eric

2006-11-07 20:29:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Tue, 7 Nov 2006 18:34:59 +0000
Alasdair G Kergon <[email protected]> wrote:

> From: Srinivasa Ds <[email protected]>
>
> On debugging I found out that,"dmsetup suspend <device name>" calls
> "freeze_bdev()",which locks "bd_mount_mutex" to make sure that no new mounts
> happen on bdev until thaw_bdev() is called. This "thaw_bdev()" is getting
> called when we resume the device through "dmsetup resume <device-name>".
> Hence we have 2 processes,one of which locks "bd_mount_mutex"(dmsetup
> suspend) and another(dmsetup resume) unlocks it.

So... what does this have to do with switching from mutex to semaphore?

Perhaps this works around the debugging code which gets offended if a mutex
is unlocked by a process which didn't do the lock?

If so, it's a bit sad to switch to semaphore just because of some errant
debugging code. Perhaps it would be better to create a new
mutex_unlock_stfu() which suppresses the warning?


> --- linux-2.6.19-rc4.orig/fs/buffer.c 2006-11-07 17:06:20.000000000 +0000
> +++ linux-2.6.19-rc4/fs/buffer.c 2006-11-07 17:26:04.000000000 +0000
> @@ -188,7 +188,9 @@ struct super_block *freeze_bdev(struct b
> {
> struct super_block *sb;
>
> - mutex_lock(&bdev->bd_mount_mutex);
> + if (down_trylock(&bdev->bd_mount_sem))
> + return -EBUSY;
> +

This is a functional change which isn't described in the changelog. What's
happening here?

2006-11-07 22:45:48

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Andrew Morton wrote:

>> --- linux-2.6.19-rc4.orig/fs/buffer.c 2006-11-07 17:06:20.000000000 +0000
>> +++ linux-2.6.19-rc4/fs/buffer.c 2006-11-07 17:26:04.000000000 +0000
>> @@ -188,7 +188,9 @@ struct super_block *freeze_bdev(struct b
>> {
>> struct super_block *sb;
>>
>> - mutex_lock(&bdev->bd_mount_mutex);
>> + if (down_trylock(&bdev->bd_mount_sem))
>> + return -EBUSY;
>> +
>
> This is a functional change which isn't described in the changelog. What's
> happening here?

Only allow one bdev-freezer in at a time, rather than queueing them up?

-Eric

2006-11-07 23:01:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Tue, 07 Nov 2006 16:45:07 -0600
Eric Sandeen <[email protected]> wrote:

> Andrew Morton wrote:
>
> >> --- linux-2.6.19-rc4.orig/fs/buffer.c 2006-11-07 17:06:20.000000000 +0000
> >> +++ linux-2.6.19-rc4/fs/buffer.c 2006-11-07 17:26:04.000000000 +0000
> >> @@ -188,7 +188,9 @@ struct super_block *freeze_bdev(struct b
> >> {
> >> struct super_block *sb;
> >>
> >> - mutex_lock(&bdev->bd_mount_mutex);
> >> + if (down_trylock(&bdev->bd_mount_sem))
> >> + return -EBUSY;
> >> +
> >
> > This is a functional change which isn't described in the changelog. What's
> > happening here?
>
> Only allow one bdev-freezer in at a time, rather than queueing them up?
>

You're asking me? ;)

Guys, I'm going to park this patch pending a full description of what it
does, a description of what the above hunk is doing and pending an
examination of whether we'd be better off changing the mutex-debugging code
rather than switching to semaphores.

2006-11-07 23:08:12

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Tuesday, 7 November 2006 23:45, Eric Sandeen wrote:
> Andrew Morton wrote:
>
> >> --- linux-2.6.19-rc4.orig/fs/buffer.c 2006-11-07 17:06:20.000000000 +0000
> >> +++ linux-2.6.19-rc4/fs/buffer.c 2006-11-07 17:26:04.000000000 +0000
> >> @@ -188,7 +188,9 @@ struct super_block *freeze_bdev(struct b
> >> {
> >> struct super_block *sb;
> >>
> >> - mutex_lock(&bdev->bd_mount_mutex);
> >> + if (down_trylock(&bdev->bd_mount_sem))
> >> + return -EBUSY;
> >> +
> >
> > This is a functional change which isn't described in the changelog. What's
> > happening here?
>
> Only allow one bdev-freezer in at a time, rather than queueing them up?

But freeze_bdev() is supposed to return the result of get_super(bdev)
_unconditionally_. Moreover, in its current form freeze_bdev() _cannot_
_fail_, so I don't see how this change doesn't break any existing code.

For example freeze_filesystems() (recently added to -mm) will be broken
if the down_trylock() is unsuccessful.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-07 23:18:45

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Rafael J. Wysocki wrote:
> On Tuesday, 7 November 2006 23:45, Eric Sandeen wrote:
>> Andrew Morton wrote:
>>
>>>> --- linux-2.6.19-rc4.orig/fs/buffer.c 2006-11-07 17:06:20.000000000 +0000
>>>> +++ linux-2.6.19-rc4/fs/buffer.c 2006-11-07 17:26:04.000000000 +0000
>>>> @@ -188,7 +188,9 @@ struct super_block *freeze_bdev(struct b
>>>> {
>>>> struct super_block *sb;
>>>>
>>>> - mutex_lock(&bdev->bd_mount_mutex);
>>>> + if (down_trylock(&bdev->bd_mount_sem))
>>>> + return -EBUSY;
>>>> +
>>> This is a functional change which isn't described in the changelog. What's
>>> happening here?
>> Only allow one bdev-freezer in at a time, rather than queueing them up?
>
> But freeze_bdev() is supposed to return the result of get_super(bdev)
> _unconditionally_. Moreover, in its current form freeze_bdev() _cannot_
> _fail_, so I don't see how this change doesn't break any existing code.

Well, it could return NULL. Is that a failure?

But, nobody is checking for an outright error, certainly. Especially
when the error hasn't been ERR_PTR'd. :) So I agree, that's not so good.

But, how is a stampede of fs-freezers -supposed- to work? I could
imagine something like a freezer count, and the filesystem is only
unfrozen after everyone has thawed? Or should only one freezer be
active at a time... which is what we have now I guess.

-Eric

2006-11-07 23:23:15

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Tue, Nov 07, 2006 at 12:28:37PM -0800, Andrew Morton wrote:
> So... what does this have to do with switching from mutex to semaphore?

I should have summarised the discussion, sorry.

This has always been a semaphore, but it got changed it to a mutex as part
of a global switch recently - and this indeed generated bug reports
from people who turn debugging on and report the error messages.

So the quick fix in this patch was to put things back how they were.


Now mutex.h states:
* - only the owner can unlock the mutex
* - task may not exit with mutex held

* These semantics are fully enforced when DEBUG_MUTEXES is
* enabled.

Which begs the question whether the stated semantics are stricter than
is actually necessary.

> If so, it's a bit sad to switch to semaphore just because of some errant
> debugging code. Perhaps it would be better to create a new
> mutex_unlock_stfu() which suppresses the warning?

Ingo?

> > + if (down_trylock(&bdev->bd_mount_sem))
> > + return -EBUSY;
> > +

> This is a functional change which isn't described in the changelog. What's
> happening here?

Srinivasa added it as a precaution.
Device-mapper should never fall foul of it, but confirming that is not
trivial unless you know the code well, so it's a useful check to prevent a
bug creeping back in.

Alasdair
--
[email protected]

2006-11-07 23:34:37

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Tue, Nov 07, 2006 at 03:18:08PM -0500, Mike Snitzer wrote:
> Srinivasa's description of the patch just speaks to how freeze_bdev
> and thaw_bdev are used by DM but completely skips justification for
> switching from mutex to semaphore. Why is it beneficial and/or
> necessary to use a semaphore instead of a mutex here?

This used to be a semaphore; someone changed it to a mutex; we started
getting device-mapper bug reports from people because the usage breaks the
stated semantics for a mutex; this patch puts things back while we consider
if there's a better way of doing things.

Alasdair
--
[email protected]

2006-11-07 23:40:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex


* Andrew Morton <[email protected]> wrote:

> If so, it's a bit sad to switch to semaphore just because of some
> errant debugging code. Perhaps it would be better to create a new
> mutex_unlock_stfu() which suppresses the warning?

the code was not using semaphores as a pure mutex thing. For example
unlocking by non-owner is not legal. AFAICS the code returns to
userspace with a held in-kernel mutex. That makes it hard for the kernel
to determine for example whether the lock has been 'forgotten', or kept
like that intentionally. Alasdair, what is the motivation for doing it
like that?

Ingo

2006-11-07 23:44:17

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wednesday, 8 November 2006 00:18, Eric Sandeen wrote:
> Rafael J. Wysocki wrote:
> > On Tuesday, 7 November 2006 23:45, Eric Sandeen wrote:
> >> Andrew Morton wrote:
> >>
> >>>> --- linux-2.6.19-rc4.orig/fs/buffer.c 2006-11-07 17:06:20.000000000 +0000
> >>>> +++ linux-2.6.19-rc4/fs/buffer.c 2006-11-07 17:26:04.000000000 +0000
> >>>> @@ -188,7 +188,9 @@ struct super_block *freeze_bdev(struct b
> >>>> {
> >>>> struct super_block *sb;
> >>>>
> >>>> - mutex_lock(&bdev->bd_mount_mutex);
> >>>> + if (down_trylock(&bdev->bd_mount_sem))
> >>>> + return -EBUSY;
> >>>> +
> >>> This is a functional change which isn't described in the changelog. What's
> >>> happening here?
> >> Only allow one bdev-freezer in at a time, rather than queueing them up?
> >
> > But freeze_bdev() is supposed to return the result of get_super(bdev)
> > _unconditionally_. Moreover, in its current form freeze_bdev() _cannot_
> > _fail_, so I don't see how this change doesn't break any existing code.
>
> Well, it could return NULL. Is that a failure?

It will only fail if get_super(bdev) returns NULL, but if you call
freeze_bdev(sb->s_bdev), then it can't do that.

> But, nobody is checking for an outright error, certainly. Especially
> when the error hasn't been ERR_PTR'd. :) So I agree, that's not so good.
>
> But, how is a stampede of fs-freezers -supposed- to work? I could
> imagine something like a freezer count, and the filesystem is only
> unfrozen after everyone has thawed? Or should only one freezer be
> active at a time... which is what we have now I guess.

I think it shouldn't be possible to freeze an fs more than once.

2006-11-07 23:50:18

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wed, Nov 08, 2006 at 12:05:49AM +0100, Rafael J. Wysocki wrote:
> But freeze_bdev() is supposed to return the result of get_super(bdev)
> _unconditionally_. Moreover, in its current form freeze_bdev() _cannot_
> _fail_, so I don't see how this change doesn't break any existing code.
> For example freeze_filesystems() (recently added to -mm) will be broken
> if the down_trylock() is unsuccessful.

I hadn't noticed that -mm patch. I'll take a look. Up to now, device-mapper
(via dmsetup) and xfs (via xfs_freeze, which dates from before device-mapper
handled this automatically) were the only users. Only one freeze should be
issued at once. A freeze is a temporary thing, normally used while creating a
snapshot. (One problem we still have is lots of old documentation on the web
advising people to run xfs_freeze before creating device-mapper snapshots.)

You're right that the down_trylock idea is more trouble than it's worth and
should be scrapped.

Alasdair
--
[email protected]

2006-11-08 00:02:32

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wednesday, 8 November 2006 00:49, Alasdair G Kergon wrote:
> On Wed, Nov 08, 2006 at 12:05:49AM +0100, Rafael J. Wysocki wrote:
> > But freeze_bdev() is supposed to return the result of get_super(bdev)
> > _unconditionally_. Moreover, in its current form freeze_bdev() _cannot_
> > _fail_, so I don't see how this change doesn't break any existing code.
> > For example freeze_filesystems() (recently added to -mm) will be broken
> > if the down_trylock() is unsuccessful.
>
> I hadn't noticed that -mm patch. I'll take a look. Up to now, device-mapper
> (via dmsetup) and xfs (via xfs_freeze, which dates from before device-mapper
> handled this automatically) were the only users. Only one freeze should be
> issued at once. A freeze is a temporary thing, normally used while creating a
> snapshot. (One problem we still have is lots of old documentation on the web
> advising people to run xfs_freeze before creating device-mapper snapshots.)
>
> You're right that the down_trylock idea is more trouble than it's worth and
> should be scrapped.

Well, having looked at it once again I think I was wrong that this change would
break freeze_filesystems(), because it only calls freeze_bdev() after checking
if sb->s_frozen is not set to SB_FREEZE_TRANS (freeze_filesystems() is only
called after all of the userspace processes have been frozen).

However, XFS_IOC_FREEZE happily returns 0 after calling freeze_bdev(),
apparetnly assuming that it won't fail.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-08 00:01:44

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wed, Nov 08, 2006 at 12:42:02AM +0100, Rafael J. Wysocki wrote:
> On Wednesday, 8 November 2006 00:18, Eric Sandeen wrote:
> > But, how is a stampede of fs-freezers -supposed- to work? I could
> > imagine something like a freezer count, and the filesystem is only
> > unfrozen after everyone has thawed? Or should only one freezer be
> > active at a time... which is what we have now I guess.
> I think it shouldn't be possible to freeze an fs more than once.

In device-mapper today, the only way to get more than one freeze on the
same device is to use xfs and issue xfs_freeze before creating an lvm snapshot
(or issuing the dmsetup equivalent), and at the moment we tell people not to do
that any more. The device-mapper API does not permit multiple freezes of the
same device. (The interesting question is actually whether the request should
be cascaded in any way when devices depend on other devices.)

Now if someone's introducing a new use for freeze_bdev, perhaps now's the time
to revisit the semantics and allow for concurrent freeze requests.

Alasdair
--
[email protected]

2006-11-08 02:31:24

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Tue, Nov 07, 2006 at 11:49:51PM +0000, Alasdair G Kergon wrote:
> I hadn't noticed that -mm patch. I'll take a look.

swsusp-freeze-filesystems-during-suspend-rev-2.patch

I think you need to give more thought to device-mapper
interactions here. If an underlying device is suspended
by device-mapper without freezing the filesystem (the
normal state) and you issue a freeze_bdev on a device
above it, the freeze_bdev may never return if it attempts
any synchronous I/O (as it should).

Try:
while process generating I/O to filesystem on LVM
issue dmsetup suspend --nolockfs (which the lvm2 tools often do)
try your freeze_filesystems()

Maybe: don't allow freeze_filesystems() to run when the system is in that
state; or, use device-mapper suspend instead of freeze_bdev directly where
dm is involved; or skip dm devices that are already frozen - all with
appropriate dependency tracking to process devices in the right order.

Alasdair
--
[email protected]

2006-11-08 03:35:18

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wed, Nov 08, 2006 at 01:00:17AM +0100, Rafael J. Wysocki wrote:
>
> However, XFS_IOC_FREEZE happily returns 0 after calling freeze_bdev(),
> apparetnly assuming that it won't fail.

Because it _can't_ fail at this point. We've got an active superblock,
because we've had to open a fd on the filesystem to get to the point
where we can issue the ioctl.

Hence the get_super() call will always succeed and the freeze
will be executed if we are not read only. The superblock
state changes if the freeze goes ahead, and XFS uses this state
change to determine what to do when the thaw command is sent.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-08 08:29:21

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wed, Nov 08, 2006 at 12:01:09AM +0000, Alasdair G Kergon wrote:
> On Wed, Nov 08, 2006 at 12:42:02AM +0100, Rafael J. Wysocki wrote:
> > On Wednesday, 8 November 2006 00:18, Eric Sandeen wrote:
> > > But, how is a stampede of fs-freezers -supposed- to work? I could
> > > imagine something like a freezer count, and the filesystem is only
> > > unfrozen after everyone has thawed? Or should only one freezer be
> > > active at a time... which is what we have now I guess.
> > I think it shouldn't be possible to freeze an fs more than once.
>
> In device-mapper today, the only way to get more than one freeze on the
> same device is to use xfs and issue xfs_freeze before creating an lvm snapshot
> (or issuing the dmsetup equivalent), and at the moment we tell people not to do
> that any more.

But it's trivial to detect this condition - if (sb->s_frozen != SB_UNFROZEN)
then the filesystem is already frozen and you shouldn't try to freeze
it again. It's simple to do, and the whole problem then just goes away....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-08 09:54:58

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex


>
> You're asking me? ;)
>
> Guys, I'm going to park this patch pending a full description of what it
> does, a description of what the above hunk is doing and pending an
> examination of whether we'd be better off changing the mutex-debugging code
> rather than switching to semaphores.

It's not used as a mutex. Sad but true. It's not so easy to say "just
shut up the debug code", because it's just not that easy: The interface
allows double "unlock", which is fine for semaphores for example. There
fundamentally is no "owner" for this case, and all the mutex concepts
assume that there is an owner. If the owner goes away, pointers to their
task struct for example are no longer valid (used by lockdep and the
other debugging parts). It's what makes the difference between a mutex
and a semaphore: a mutex has an owner and several semantics follow from
that. These semantics allow a more efficient implementation (no multiple
"owners" possible) but once you go away from that fundamental property,
soon we'll see "oh and it needs <this extra code> to cover the wider
semantics correctly.. and soon you have a semaphore again.

Let true semaphores be semaphores, and make all real mutexes mutexes.
But lets not make actual semaphores use mutex code...


--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-11-08 12:12:33

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wednesday, 8 November 2006 03:30, Alasdair G Kergon wrote:
> On Tue, Nov 07, 2006 at 11:49:51PM +0000, Alasdair G Kergon wrote:
> > I hadn't noticed that -mm patch. I'll take a look.
>
> swsusp-freeze-filesystems-during-suspend-rev-2.patch
>
> I think you need to give more thought to device-mapper
> interactions here. If an underlying device is suspended
> by device-mapper without freezing the filesystem (the
> normal state) and you issue a freeze_bdev on a device
> above it, the freeze_bdev may never return if it attempts
> any synchronous I/O (as it should).

Well, it looks like the interactions with dm add quite a bit of
complexity here.

> Try:
> while process generating I/O to filesystem on LVM
> issue dmsetup suspend --nolockfs (which the lvm2 tools often do)
> try your freeze_filesystems()

Okay, I will.

> Maybe: don't allow freeze_filesystems() to run when the system is in that
> state;

I'd like to avoid that (we may be running out of battery power at this point).

> or, use device-mapper suspend instead of freeze_bdev directly where
> dm is involved;

How do I check if dm is involved?

> or skip dm devices that are already frozen - all with
> appropriate dependency tracking to process devices in the right order.

I'd prefer this one, but probably the previous one is simpler to start with.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-08 14:26:04

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wed, Nov 08, 2006 at 07:27:22PM +1100, David Chinner wrote:
> But it's trivial to detect this condition - if (sb->s_frozen != SB_UNFROZEN)
> then the filesystem is already frozen and you shouldn't try to freeze
> it again. It's simple to do, and the whole problem then just goes away....

So is that another vote in support of explicitly supporting multiple concurrent
freeze requests, letting them all succeed, and only thawing after the last one
has requested its thaw? (It's not enough just to check SB_UNFROZEN - also need
to track whether any other outstanding requests to avoid risk of it getting
unfrozen while something independent believes it still to be frozen.)

Alasdair
--
[email protected]

2006-11-08 14:46:04

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wednesday, 8 November 2006 15:25, Alasdair G Kergon wrote:
> On Wed, Nov 08, 2006 at 07:27:22PM +1100, David Chinner wrote:
> > But it's trivial to detect this condition - if (sb->s_frozen != SB_UNFROZEN)
> > then the filesystem is already frozen and you shouldn't try to freeze
> > it again. It's simple to do, and the whole problem then just goes away....
>
> So is that another vote in support of explicitly supporting multiple concurrent
> freeze requests, letting them all succeed, and only thawing after the last one
> has requested its thaw? (It's not enough just to check SB_UNFROZEN - also need
> to track whether any other outstanding requests to avoid risk of it getting
> unfrozen while something independent believes it still to be frozen.)

So, I think, we need the following patch to fix freeze_filesystems().

Will it be enough to cover the interactions with dm?

Rafael


Signed-off-by: Rafael J. Wysocki <[email protected]>
---
fs/buffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.19-rc5-mm1/fs/buffer.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/fs/buffer.c
+++ linux-2.6.19-rc5-mm1/fs/buffer.c
@@ -264,7 +264,7 @@ void freeze_filesystems(void)
*/
list_for_each_entry_reverse(sb, &super_blocks, s_list) {
if (!sb->s_root || !sb->s_bdev ||
- (sb->s_frozen == SB_FREEZE_TRANS) ||
+ (sb->s_frozen != SB_UNFROZEN) ||
(sb->s_flags & MS_RDONLY) ||
(sb->s_flags & MS_FROZEN))
continue;

2006-11-08 15:26:36

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wed, Nov 08, 2006 at 03:43:26PM +0100, Rafael J. Wysocki wrote:
> Will it be enough to cover the interactions with dm?

There are cases where you *cannot* freeze the filesystem (unless
you wait for userspace processes to finish what they are doing) -
and only dm can tell you that.

The freeze_filesystems() code ought to do it's best in any given
circumstances within the constraints.

So:
Get the filesystem's block device.
Check the full tree of devices that that block device depends upon
and for any device that belongs to device-mapper check if it is suspended.
If it is, skip the original device.

struct mapped_device *dm_get_md(dev_t dev);
int dm_suspended(struct mapped_device *md);
void dm_put(struct mapped_device *md);

Handling the tree is the difficult bit, but sysfs could help.
[You can get the device-mapper dependencies with:
struct mapped_device *dm_get_md(dev_t dev);
struct dm_table *dm_get_table(struct mapped_device *md);
struct list_head *dm_table_get_devices(struct dm_table *t);
void dm_table_put(struct dm_table *t);
void dm_put(struct mapped_device *md);
]

Consider that you could have an already-frozen filesystem containing a loop
device containing a device-mapper device containing a not-frozen filesystem.
You won't be able to freeze that top filesystem because the I/O would
queue lower down the stack. (Similar problem if the device-mapper device
in the stack was suspended.)

Alasdair
--
[email protected]

2006-11-08 20:48:28

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi.

On Wed, 2006-11-08 at 13:10 +0100, Rafael J. Wysocki wrote:
> On Wednesday, 8 November 2006 03:30, Alasdair G Kergon wrote:
> > On Tue, Nov 07, 2006 at 11:49:51PM +0000, Alasdair G Kergon wrote:
> > > I hadn't noticed that -mm patch. I'll take a look.
> >
> > swsusp-freeze-filesystems-during-suspend-rev-2.patch
> >
> > I think you need to give more thought to device-mapper
> > interactions here. If an underlying device is suspended
> > by device-mapper without freezing the filesystem (the
> > normal state) and you issue a freeze_bdev on a device
> > above it, the freeze_bdev may never return if it attempts
> > any synchronous I/O (as it should).
>
> Well, it looks like the interactions with dm add quite a bit of
> complexity here.
>
> > Try:
> > while process generating I/O to filesystem on LVM
> > issue dmsetup suspend --nolockfs (which the lvm2 tools often do)
> > try your freeze_filesystems()
>
> Okay, I will.
>
> > Maybe: don't allow freeze_filesystems() to run when the system is in that
> > state;
>
> I'd like to avoid that (we may be running out of battery power at this point).
>
> > or, use device-mapper suspend instead of freeze_bdev directly where
> > dm is involved;
>
> How do I check if dm is involved?
>
> > or skip dm devices that are already frozen - all with
> > appropriate dependency tracking to process devices in the right order.
>
> I'd prefer this one, but probably the previous one is simpler to start with.

Shouldn't we just go for the right thing to begin with? Otherwise we'll
just make more problems for ourselves later.

If we do this last one, I guess we want to do something like I was doing
before (creating a list of the devices we've frozen)?

Regards,

Nigel

2006-11-08 21:10:47

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wednesday, 8 November 2006 21:48, Nigel Cunningham wrote:
> Hi.
>
> On Wed, 2006-11-08 at 13:10 +0100, Rafael J. Wysocki wrote:
> > On Wednesday, 8 November 2006 03:30, Alasdair G Kergon wrote:
> > > On Tue, Nov 07, 2006 at 11:49:51PM +0000, Alasdair G Kergon wrote:
> > > > I hadn't noticed that -mm patch. I'll take a look.
> > >
> > > swsusp-freeze-filesystems-during-suspend-rev-2.patch
> > >
> > > I think you need to give more thought to device-mapper
> > > interactions here. If an underlying device is suspended
> > > by device-mapper without freezing the filesystem (the
> > > normal state) and you issue a freeze_bdev on a device
> > > above it, the freeze_bdev may never return if it attempts
> > > any synchronous I/O (as it should).
> >
> > Well, it looks like the interactions with dm add quite a bit of
> > complexity here.
> >
> > > Try:
> > > while process generating I/O to filesystem on LVM
> > > issue dmsetup suspend --nolockfs (which the lvm2 tools often do)
> > > try your freeze_filesystems()
> >
> > Okay, I will.
> >
> > > Maybe: don't allow freeze_filesystems() to run when the system is in that
> > > state;
> >
> > I'd like to avoid that (we may be running out of battery power at this point).
> >
> > > or, use device-mapper suspend instead of freeze_bdev directly where
> > > dm is involved;
> >
> > How do I check if dm is involved?
> >
> > > or skip dm devices that are already frozen - all with
> > > appropriate dependency tracking to process devices in the right order.
> >
> > I'd prefer this one, but probably the previous one is simpler to start with.
>
> Shouldn't we just go for the right thing to begin with? Otherwise we'll
> just make more problems for ourselves later.
>
> If we do this last one, I guess we want to do something like I was doing
> before (creating a list of the devices we've frozen)?

Well, the frozen superblocks have MS_FROZEN set. What the other list is
needed for?

Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-08 23:08:37

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wednesday, 8 November 2006 16:25, Alasdair G Kergon wrote:
> On Wed, Nov 08, 2006 at 03:43:26PM +0100, Rafael J. Wysocki wrote:
> > Will it be enough to cover the interactions with dm?
>
> There are cases where you *cannot* freeze the filesystem (unless
> you wait for userspace processes to finish what they are doing) -
> and only dm can tell you that.
>
> The freeze_filesystems() code ought to do it's best in any given
> circumstances within the constraints.
>
> So:
> Get the filesystem's block device.
> Check the full tree of devices that that block device depends upon
> and for any device that belongs to device-mapper check if it is suspended.
> If it is, skip the original device.
>
> struct mapped_device *dm_get_md(dev_t dev);
> int dm_suspended(struct mapped_device *md);
> void dm_put(struct mapped_device *md);
>
> Handling the tree is the difficult bit, but sysfs could help.
> [You can get the device-mapper dependencies with:
> struct mapped_device *dm_get_md(dev_t dev);
> struct dm_table *dm_get_table(struct mapped_device *md);
> struct list_head *dm_table_get_devices(struct dm_table *t);
> void dm_table_put(struct dm_table *t);
> void dm_put(struct mapped_device *md);
> ]
>
> Consider that you could have an already-frozen filesystem containing a loop
> device containing a device-mapper device containing a not-frozen filesystem.

I think the last point is handled correctly by freezing the filesystems in the
reverse order - unless the fs below the loop has been frozen before by
someone else, but I guess that would lead to problems anyway.

> You won't be able to freeze that top filesystem because the I/O would
> queue lower down the stack. (Similar problem if the device-mapper device
> in the stack was suspended.)

The suspended dm device in the stack is not, however.

Is there any list of all dm devices in the system?

Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-09 09:12:20

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi!

> > swsusp-freeze-filesystems-during-suspend-rev-2.patch
> >
> > I think you need to give more thought to device-mapper
> > interactions here. If an underlying device is suspended
> > by device-mapper without freezing the filesystem (the
> > normal state) and you issue a freeze_bdev on a device
> > above it, the freeze_bdev may never return if it attempts
> > any synchronous I/O (as it should).
>
> Well, it looks like the interactions with dm add quite a bit of
> complexity here.

What about just fixing xfs (thou shall not write to disk when kernel
threads are frozen), and getting rid of blockdev freezing?

Pavel
--
Thanks for all the (sleeping) penguins.

2006-11-09 15:54:58

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi,

On Wednesday, 8 November 2006 19:09, Pavel Machek wrote:
> Hi!
>
> > > swsusp-freeze-filesystems-during-suspend-rev-2.patch
> > >
> > > I think you need to give more thought to device-mapper
> > > interactions here. If an underlying device is suspended
> > > by device-mapper without freezing the filesystem (the
> > > normal state) and you issue a freeze_bdev on a device
> > > above it, the freeze_bdev may never return if it attempts
> > > any synchronous I/O (as it should).
> >
> > Well, it looks like the interactions with dm add quite a bit of
> > complexity here.
>
> What about just fixing xfs (thou shall not write to disk when kernel
> threads are frozen), and getting rid of blockdev freezing?

Well, first I must admit you were absolutely right being suspicious with
respect to this stuff.

OTOH I have no idea _how_ we can tell xfs that the processes have been
frozen. Should we introduce a global flag for that or something?

Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-09 16:00:18

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi!

> > > Well, it looks like the interactions with dm add quite a bit of
> > > complexity here.
> >
> > What about just fixing xfs (thou shall not write to disk when kernel
> > threads are frozen), and getting rid of blockdev freezing?
>
> Well, first I must admit you were absolutely right being suspicious with
> respect to this stuff.

(OTOH your patch found real bugs in suspend.c, so...)

> OTOH I have no idea _how_ we can tell xfs that the processes have been
> frozen. Should we introduce a global flag for that or something?

I guess XFS should just do all the writes from process context, and
refuse any writing when its threads are frozen... I actually still
believe it is doing the right thing, because you can't really write to
disk from timer.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-09 20:02:14

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi,

On Thursday, 9 November 2006 17:00, Pavel Machek wrote:
> Hi!
>
> > > > Well, it looks like the interactions with dm add quite a bit of
> > > > complexity here.
> > >
> > > What about just fixing xfs (thou shall not write to disk when kernel
> > > threads are frozen), and getting rid of blockdev freezing?
> >
> > Well, first I must admit you were absolutely right being suspicious with
> > respect to this stuff.
>
> (OTOH your patch found real bugs in suspend.c, so...)
>
> > OTOH I have no idea _how_ we can tell xfs that the processes have been
> > frozen. Should we introduce a global flag for that or something?
>
> I guess XFS should just do all the writes from process context, and
> refuse any writing when its threads are frozen... I actually still
> believe it is doing the right thing, because you can't really write to
> disk from timer.

This is from a work queue, so in fact from a process context, but from
a process that is running with PF_NOFREEZE.

And I don't think we can forbid filesystems to use work queues. IMO it's
a legitimate thing to do for an fs.

_But_.

Alasdair, do I think correctly that if there's a suspended device-mapper
device below an non-frozen filesystem, then sys_sync() would block just
as well as freeze_bdev() on this filesystem?

Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-09 21:17:41

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi!

> > > OTOH I have no idea _how_ we can tell xfs that the processes have been
> > > frozen. Should we introduce a global flag for that or something?
> >
> > I guess XFS should just do all the writes from process context, and
> > refuse any writing when its threads are frozen... I actually still
> > believe it is doing the right thing, because you can't really write to
> > disk from timer.
>
> This is from a work queue, so in fact from a process context, but from
> a process that is running with PF_NOFREEZE.

Why not simply &~ PF_NOFREEZE on that particular process? Filesystems
are free to use threads/work queues/whatever, but refrigerator should
mean "no writes to filesystem" for them...

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-09 21:21:42

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi,

On Thursday, 9 November 2006 22:17, Pavel Machek wrote:
> Hi!
>
> > > > OTOH I have no idea _how_ we can tell xfs that the processes have been
> > > > frozen. Should we introduce a global flag for that or something?
> > >
> > > I guess XFS should just do all the writes from process context, and
> > > refuse any writing when its threads are frozen... I actually still
> > > believe it is doing the right thing, because you can't really write to
> > > disk from timer.
> >
> > This is from a work queue, so in fact from a process context, but from
> > a process that is running with PF_NOFREEZE.
>
> Why not simply &~ PF_NOFREEZE on that particular process? Filesystems
> are free to use threads/work queues/whatever, but refrigerator should
> mean "no writes to filesystem" for them...

But how we differentiate worker_threads used by filesystems from the
other ones?

BTW, I think that worker_threads run with PF_NOFREEZE for a reason,
but what exactly is it?

Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-09 21:42:19

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi!

> > > This is from a work queue, so in fact from a process context, but from
> > > a process that is running with PF_NOFREEZE.
> >
> > Why not simply &~ PF_NOFREEZE on that particular process? Filesystems
> > are free to use threads/work queues/whatever, but refrigerator should
> > mean "no writes to filesystem" for them...
>
> But how we differentiate worker_threads used by filesystems from the
> other ones?

I'd expect filesystems to do &~ PF_NOFREEZE by hand.

> BTW, I think that worker_threads run with PF_NOFREEZE for a reason,
> but what exactly is it?

I do not think we had particulary good reasons...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-09 22:24:23

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Thursday, 9 November 2006 22:41, Pavel Machek wrote:
> Hi!
>
> > > > This is from a work queue, so in fact from a process context, but from
> > > > a process that is running with PF_NOFREEZE.
> > >
> > > Why not simply &~ PF_NOFREEZE on that particular process? Filesystems
> > > are free to use threads/work queues/whatever, but refrigerator should
> > > mean "no writes to filesystem" for them...
> >
> > But how we differentiate worker_threads used by filesystems from the
> > other ones?
>
> I'd expect filesystems to do &~ PF_NOFREEZE by hand.
>
> > BTW, I think that worker_threads run with PF_NOFREEZE for a reason,
> > but what exactly is it?
>
> I do not think we had particulary good reasons...

Well, it looks like quite a lot of drivers depend on them, including libata.

I think we can add a flag to __create_workqueue() that will indicate if
this one is to be running with PF_NOFREEZE and a corresponding macro like
create_freezable_workqueue() to be used wherever we want the worker thread
to freeze (in which case it should be calling try_to_freeze() somewhere).
Then, we can teach filesystems to use this macro instead of
create_workqueue().

Having done that we'd be able to drop the freezing of bdevs patch and forget
about the dm-related complexity.

[Still I wonder if the sys_sync() in freeze_processes() is actually safe if
there's a suspended dm device somewhere in the stack, because in the other
case the freezing of bdevs would be no more dangerous than the thing
that we're already doing.]

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-09 23:12:06

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi!

> > > > > This is from a work queue, so in fact from a process context, but from
> > > > > a process that is running with PF_NOFREEZE.
> > > >
> > > > Why not simply &~ PF_NOFREEZE on that particular process? Filesystems
> > > > are free to use threads/work queues/whatever, but refrigerator should
> > > > mean "no writes to filesystem" for them...
> > >
> > > But how we differentiate worker_threads used by filesystems from the
> > > other ones?
> >
> > I'd expect filesystems to do &~ PF_NOFREEZE by hand.
> >
> > > BTW, I think that worker_threads run with PF_NOFREEZE for a reason,
> > > but what exactly is it?
> >
> > I do not think we had particulary good reasons...
>
> Well, it looks like quite a lot of drivers depend on them, including libata.
>
> I think we can add a flag to __create_workqueue() that will indicate if
> this one is to be running with PF_NOFREEZE and a corresponding macro like
> create_freezable_workqueue() to be used wherever we want the worker thread
> to freeze (in which case it should be calling try_to_freeze() somewhere).
> Then, we can teach filesystems to use this macro instead of
> create_workqueue().

Works for me.

> Having done that we'd be able to drop the freezing of bdevs patch and forget
> about the dm-related complexity.

yes.

> [Still I wonder if the sys_sync() in freeze_processes() is actually safe if
> there's a suspended dm device somewhere in the stack, because in the other
> case the freezing of bdevs would be no more dangerous than the thing
> that we're already doing.]

? Not sure if I quite understand, but if dm breaks sync... something
is teribly wrong with dm. And we do simple sys_sync()... so I do not
think we have a problem.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-09 23:25:27

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Fri, Nov 10, 2006 at 12:11:46AM +0100, Pavel Machek wrote:
> ? Not sure if I quite understand, but if dm breaks sync... something
> is teribly wrong with dm. And we do simple sys_sync()... so I do not
> think we have a problem.

If you want to handle arbitrary kernel state, you might have a device-mapper
device somewhere lower down the stack of devices that is queueing any I/O
that reaches it. So anything waiting for I/O completion will wait until
the dm process that suspended that device has finished whatever it is doing
- and that might be a quick thing carried out by a userspace lvm tool, or
a long thing carried out by an administrator using dmsetup.

I'm guessing you need a way of detecting such state lower down the stack
then optionally either aborting the operation telling the user it can't be
done at present; waiting for however long it takes (perhaps for ever if
the admin disappeared); or more probably skipping those devices on a
'best endeavours' basis.

I'm suggesting sysfs or something built on bd_claim might offer a mechanism
for traversing the stack.

Alasdair
--
[email protected]

2006-11-09 23:33:27

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi!

> On Fri, Nov 10, 2006 at 12:11:46AM +0100, Pavel Machek wrote:
> > ? Not sure if I quite understand, but if dm breaks sync... something
> > is teribly wrong with dm. And we do simple sys_sync()... so I do not
> > think we have a problem.
>
> If you want to handle arbitrary kernel state, you might have a device-mapper
> device somewhere lower down the stack of devices that is queueing any I/O
> that reaches it. So anything waiting for I/O completion will wait until
> the dm process that suspended that device has finished whatever it is doing
> - and that might be a quick thing carried out by a userspace lvm tool, or
> a long thing carried out by an administrator using dmsetup.
>
> I'm guessing you need a way of detecting such state lower down the stack
> then optionally either aborting the operation telling the user it can't be
> done at present; waiting for however long it takes (perhaps for ever if
> the admin disappeared); or more probably skipping those devices on a
> 'best endeavours' basis.

Okay, so you claim that sys_sync can stall, waiting for administator?

In such case we can simply do one sys_sync() before we start freezing
userspace... or just more the only sys_sync() there. That way, admin
has chance to unlock his system.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-10 00:34:45

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Thu, Nov 09, 2006 at 05:00:03PM +0100, Pavel Machek wrote:
> > > > Well, it looks like the interactions with dm add quite a bit of
> > > > complexity here.
> > >
> > > What about just fixing xfs (thou shall not write to disk when kernel
> > > threads are frozen), and getting rid of blockdev freezing?
> >
> > Well, first I must admit you were absolutely right being suspicious with
> > respect to this stuff.
>
> (OTOH your patch found real bugs in suspend.c, so...)
>
> > OTOH I have no idea _how_ we can tell xfs that the processes have been
> > frozen. Should we introduce a global flag for that or something?
>
> I guess XFS should just do all the writes from process context, and
> refuse any writing when its threads are frozen... I actually still
> believe it is doing the right thing, because you can't really write to
> disk from timer.

As per the recent thread about this, XFS threads suspend correctly
and XFS doesn't issue I/O from timers.

The problem appears to be per-cpu workqueues that don't get
suspended because suspend does not shut down workqueue threads. You
haven't quiesced the filesystem, you haven't shut down work
queues, but you're expecting the filesystem to magically stop
using them when suspend starts up?

You can't lay the blame on any subsystem for using an interface that
suspend doesn't quiesce when you *haven't told the subsystem it
should suspend itself*. This is true regardless of whether the
subsystem is a device driver, part of the network stack or a
filesystem.

e.g. A struct device has suspend and resume callouts so that each
device can be safely suspended and resumed. Suspend uses these and
you sure as hell don't expect a device to work properly after a
resume if you don't use these interfaces.

So why do you think filesystems should be treated differently?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-10 00:55:21

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Thu, Nov 09, 2006 at 10:17:22PM +0100, Pavel Machek wrote:
> Why not simply &~ PF_NOFREEZE on that particular process? Filesystems
> are free to use threads/work queues/whatever, but refrigerator should
> mean "no writes to filesystem" for them...

Freezing the filesytem is the way to tell the filesystem "no more
writes to the filesytem".

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-10 00:58:27

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Thu, Nov 09, 2006 at 11:21:46PM +0100, Rafael J. Wysocki wrote:
> I think we can add a flag to __create_workqueue() that will indicate if
> this one is to be running with PF_NOFREEZE and a corresponding macro like
> create_freezable_workqueue() to be used wherever we want the worker thread
> to freeze (in which case it should be calling try_to_freeze() somewhere).
> Then, we can teach filesystems to use this macro instead of
> create_workqueue().

At what point does the workqueue get frozen? i.e. how does this
guarantee an unfrozen filesystem will end up in a consistent
state?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-10 10:21:28

by Alan

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Ar Iau, 2006-11-09 am 22:17 +0100, ysgrifennodd Pavel Machek:
> Why not simply &~ PF_NOFREEZE on that particular process? Filesystems
> are free to use threads/work queues/whatever, but refrigerator should
> mean "no writes to filesystem" for them...

You can't go around altering the flags of another process - what locking
are you relying upon for this ?


2006-11-10 10:37:19

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

HI!

> > Why not simply &~ PF_NOFREEZE on that particular process? Filesystems
> > are free to use threads/work queues/whatever, but refrigerator should
> > mean "no writes to filesystem" for them...
>
> You can't go around altering the flags of another process - what locking
> are you relying upon for this ?

Well, my idea was for process to &~ PF_NOFREEZE on itself.

We are currently (kernel/power/process.c) using &p->sighand->siglock
for some accesses, and no locking for others. There were some attempts
to fix this, but they led to problems, and we are not hitting any
problems... part of reason is that we hot-unplug non-boot cpus before
freezing processes, so we are running UP at that point.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-10 10:39:06

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi!

> > > OTOH I have no idea _how_ we can tell xfs that the processes have been
> > > frozen. Should we introduce a global flag for that or something?
> >
> > I guess XFS should just do all the writes from process context, and
> > refuse any writing when its threads are frozen... I actually still
> > believe it is doing the right thing, because you can't really write to
> > disk from timer.
>
> As per the recent thread about this, XFS threads suspend correctly
> and XFS doesn't issue I/O from timers.
>
> The problem appears to be per-cpu workqueues that don't get
> suspended because suspend does not shut down workqueue threads. You

Workqueues should be easy to handle. current->flags &= ~PF_NONFREEZE,
and problem should be solved.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-10 10:40:06

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Fri 2006-11-10 11:57:49, David Chinner wrote:
> On Thu, Nov 09, 2006 at 11:21:46PM +0100, Rafael J. Wysocki wrote:
> > I think we can add a flag to __create_workqueue() that will indicate if
> > this one is to be running with PF_NOFREEZE and a corresponding macro like
> > create_freezable_workqueue() to be used wherever we want the worker thread
> > to freeze (in which case it should be calling try_to_freeze() somewhere).
> > Then, we can teach filesystems to use this macro instead of
> > create_workqueue().
>
> At what point does the workqueue get frozen? i.e. how does this
> guarantee an unfrozen filesystem will end up in a consistent
> state?

Snapshot is atomic; workqueue will be unfrozen with everyone else, but
as there were no writes in the meantime, there should be no problems.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-10 12:06:18

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Friday, 10 November 2006 00:32, Pavel Machek wrote:
> Hi!
>
> > On Fri, Nov 10, 2006 at 12:11:46AM +0100, Pavel Machek wrote:
> > > ? Not sure if I quite understand, but if dm breaks sync... something
> > > is teribly wrong with dm. And we do simple sys_sync()... so I do not
> > > think we have a problem.
> >
> > If you want to handle arbitrary kernel state, you might have a device-mapper
> > device somewhere lower down the stack of devices that is queueing any I/O
> > that reaches it. So anything waiting for I/O completion will wait until
> > the dm process that suspended that device has finished whatever it is doing
> > - and that might be a quick thing carried out by a userspace lvm tool, or
> > a long thing carried out by an administrator using dmsetup.
> >
> > I'm guessing you need a way of detecting such state lower down the stack
> > then optionally either aborting the operation telling the user it can't be
> > done at present; waiting for however long it takes (perhaps for ever if
> > the admin disappeared); or more probably skipping those devices on a
> > 'best endeavours' basis.
>
> Okay, so you claim that sys_sync can stall, waiting for administator?
>
> In such case we can simply do one sys_sync() before we start freezing
> userspace... or just more the only sys_sync() there. That way, admin
> has chance to unlock his system.

Well, this is a different story.

My point is that if we call sys_sync() _anyway_ before calling
freeze_filesystems(), then freeze_filesystems() is _safe_ (either the
sys_sync() blocks, or it doesn't in which case freeze_filesystems() won't
block either).

This means, however, that we can leave the patch as is (well, with the minor
fix I have already posted), for now, because it doesn't make things worse a
bit, but:
(a) it prevents xfs from being corrupted and
(b) it prevents journaling filesystems in general from replaying journals
after a failing resume.

Still, there is a problem with the possibility of potential lock-up - either
with the bdevs-freezing patch or without it - due to a suspended dm device
down the stack and solving that is a _separate_ issue.

Now if we use the userland suspend, there's no problem at all, I think,
because s2disk calls sync() before it goes to suspend_system(), so the
admin will have a chance to unclock the system and everything is fine and
dandy (although it should be documented somewhere, IMHO).

However, if the built-in swsusp is used, then well ... <looks> ... we can put
a call to sys_sync() before prepare_processes() in pm_suspend_disk().

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-12 18:43:32

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi!

> > Okay, so you claim that sys_sync can stall, waiting for administator?
> >
> > In such case we can simply do one sys_sync() before we start freezing
> > userspace... or just more the only sys_sync() there. That way, admin
> > has chance to unlock his system.
>
> Well, this is a different story.
>
> My point is that if we call sys_sync() _anyway_ before calling
> freeze_filesystems(), then freeze_filesystems() is _safe_ (either the
> sys_sync() blocks, or it doesn't in which case freeze_filesystems() won't
> block either).
>
> This means, however, that we can leave the patch as is (well, with the minor
> fix I have already posted), for now, because it doesn't make things worse a
> bit, but:
> (a) it prevents xfs from being corrupted and

I'd really prefer it to be fixed by 'freezeable workqueues'. Can you
point me into sources -- which xfs workqueues are problematic?

(It would be nice to fix that for 2.6.19, and full bdev freezing looks
intrusive to me).

> (b) it prevents journaling filesystems in general from replaying journals
> after a failing resume.

I do not see b) as an useful goal.
Pavel

--
Thanks for all the (sleeping) penguins.

2006-11-12 21:56:36

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi,

On Sunday, 12 November 2006 19:43, Pavel Machek wrote:
> Hi!
>
> > > Okay, so you claim that sys_sync can stall, waiting for administator?
> > >
> > > In such case we can simply do one sys_sync() before we start freezing
> > > userspace... or just more the only sys_sync() there. That way, admin
> > > has chance to unlock his system.
> >
> > Well, this is a different story.
> >
> > My point is that if we call sys_sync() _anyway_ before calling
> > freeze_filesystems(), then freeze_filesystems() is _safe_ (either the
> > sys_sync() blocks, or it doesn't in which case freeze_filesystems() won't
> > block either).
> >
> > This means, however, that we can leave the patch as is (well, with the minor
> > fix I have already posted), for now, because it doesn't make things worse a
> > bit, but:
> > (a) it prevents xfs from being corrupted and
>
> I'd really prefer it to be fixed by 'freezeable workqueues'. Can you
> point me into sources -- which xfs workqueues are problematic?

I think these:

http://www.linux-m32r.org/lxr/http/source/fs/xfs/linux-2.6/xfs_buf.c?a=x86_64#L1829

But then, there's also this one

http://www.linux-m32r.org/lxr/http/source/fs/reiserfs/journal.c?a=x86_64#L2837

and some others that I had no time to trace.

> (It would be nice to fix that for 2.6.19, and full bdev freezing looks
> intrusive to me).

IMHO changing __create_workqueue() will be even more intrusive.

> > (b) it prevents journaling filesystems in general from replaying journals
> > after a failing resume.
>
> I do not see b) as an useful goal.

Okay, let's forget it.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-12 22:31:36

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Fri, Nov 10, 2006 at 11:39:42AM +0100, Pavel Machek wrote:
> On Fri 2006-11-10 11:57:49, David Chinner wrote:
> > On Thu, Nov 09, 2006 at 11:21:46PM +0100, Rafael J. Wysocki wrote:
> > > I think we can add a flag to __create_workqueue() that will indicate if
> > > this one is to be running with PF_NOFREEZE and a corresponding macro like
> > > create_freezable_workqueue() to be used wherever we want the worker thread
> > > to freeze (in which case it should be calling try_to_freeze() somewhere).
> > > Then, we can teach filesystems to use this macro instead of
> > > create_workqueue().
> >
> > At what point does the workqueue get frozen? i.e. how does this
> > guarantee an unfrozen filesystem will end up in a consistent
> > state?
>
> Snapshot is atomic; workqueue will be unfrozen with everyone else, but
> as there were no writes in the meantime, there should be no problems.

That doesn't answer my question - when in the sequence of freezing
do you propose diasbling the workqueues? before the kernel threads,
after the kernel threads, before you sync the filesystem? And how does
freezing them at that point in time guarantee consistent filesystem state?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-12 22:45:55

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Sunday, 12 November 2006 23:30, David Chinner wrote:
> On Fri, Nov 10, 2006 at 11:39:42AM +0100, Pavel Machek wrote:
> > On Fri 2006-11-10 11:57:49, David Chinner wrote:
> > > On Thu, Nov 09, 2006 at 11:21:46PM +0100, Rafael J. Wysocki wrote:
> > > > I think we can add a flag to __create_workqueue() that will indicate if
> > > > this one is to be running with PF_NOFREEZE and a corresponding macro like
> > > > create_freezable_workqueue() to be used wherever we want the worker thread
> > > > to freeze (in which case it should be calling try_to_freeze() somewhere).
> > > > Then, we can teach filesystems to use this macro instead of
> > > > create_workqueue().
> > >
> > > At what point does the workqueue get frozen? i.e. how does this
> > > guarantee an unfrozen filesystem will end up in a consistent
> > > state?
> >
> > Snapshot is atomic; workqueue will be unfrozen with everyone else, but
> > as there were no writes in the meantime, there should be no problems.
>
> That doesn't answer my question - when in the sequence of freezing
> do you propose diasbling the workqueues? before the kernel threads,
> after the kernel threads, before you sync the filesystem?

After the sync, along with the freezing of kernel threads.

> And how does freezing them at that point in time guarantee consistent
> filesystem state?

If the work queues are frozen, there won't be any fs-related activity _after_
we create the suspend image. The sync is done after the userland has been
frozen, so if the resume is unsuccessful, we'll be able to recover the state
of the fs right before the sync, and if the resume is successful, we'll be
able to continue (the state of memory will be the same as before the creation
of the suspend image and the state of the disk will be the same as before the
creation of the suspend image).

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-12 23:31:39

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Sun, Nov 12, 2006 at 06:43:10PM +0000, Pavel Machek wrote:
> Hi!
>
> > > Okay, so you claim that sys_sync can stall, waiting for administator?
> > >
> > > In such case we can simply do one sys_sync() before we start freezing
> > > userspace... or just more the only sys_sync() there. That way, admin
> > > has chance to unlock his system.
> >
> > Well, this is a different story.
> >
> > My point is that if we call sys_sync() _anyway_ before calling
> > freeze_filesystems(), then freeze_filesystems() is _safe_ (either the
> > sys_sync() blocks, or it doesn't in which case freeze_filesystems() won't
> > block either).
> >
> > This means, however, that we can leave the patch as is (well, with the minor
> > fix I have already posted), for now, because it doesn't make things worse a
> > bit, but:
> > (a) it prevents xfs from being corrupted and
>
> I'd really prefer it to be fixed by 'freezeable workqueues'.

I'd prefer that you just freeze the filesystem and let the
filesystem do things correctly.

> Can you
> point me into sources -- which xfs workqueues are problematic?

AFAIK, its the I/O completion workqueues that are causing problems.
(fs/xfs/linux-2.6/xfs_buf.c) However, thinking about it, I'm not
sure that the work queues being left unfrozen is the real problem.

i.e. after a sync there's still I/O outstanding (e.g. metadata in
the log but not on disk), and because the kernel threads are frozen
some time after the sync, we could have issued this delayed write
metadata to disk after the sync. With XFS, we can have a of queue of
thousands of metadata buffers for delwri, and they are all issued
async and can take many seconds for the I/O to complete.

The I/O completion workqueues will continue to run until all I/O
stops, and metadata I/O completion will change the state of the
filesystem in memory.

However, even if you stop the workqueue processing, you're still
going to have to wait for all I/O completion to occur before
snapshotting memory because having any I/O complete changes memory
state. Hence I fail to see how freezing the workqueues really helps
at all here....

Given that the only way to track and block on these delwri metadata
buffers is to issue a sync flush rather than a async flush, suspend
has to do something different to guarantee that we block until all
those I/Os have completed. i.e. freeze the filesystem.

So the problem, IMO, is suspend is not telling the filesystem
to stop doing stuff and so we are getting caught out by doing
stuff that suspend assumes won't happen but does nothing
to prevent.

> > (b) it prevents journaling filesystems in general from replaying journals
> > after a failing resume.

This is incorrect. Freezing an XFS filesystem _ensures_ that log
replay occurs on thaw or a failed resume. XFS specifically dirties
the log after a freeze down to a consistent state so that the
unlinked inode lists get processed by recovery on thaw/next mount.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-13 05:45:00

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Sun, Nov 12, 2006 at 11:43:05PM +0100, Rafael J. Wysocki wrote:
> On Sunday, 12 November 2006 23:30, David Chinner wrote:
> > On Fri, Nov 10, 2006 at 11:39:42AM +0100, Pavel Machek wrote:
> > > On Fri 2006-11-10 11:57:49, David Chinner wrote:
> > > > On Thu, Nov 09, 2006 at 11:21:46PM +0100, Rafael J. Wysocki wrote:
> > > > > I think we can add a flag to __create_workqueue() that will indicate if
> > > > > this one is to be running with PF_NOFREEZE and a corresponding macro like
> > > > > create_freezable_workqueue() to be used wherever we want the worker thread
> > > > > to freeze (in which case it should be calling try_to_freeze() somewhere).
> > > > > Then, we can teach filesystems to use this macro instead of
> > > > > create_workqueue().
> > > >
> > > > At what point does the workqueue get frozen? i.e. how does this
> > > > guarantee an unfrozen filesystem will end up in a consistent
> > > > state?
> > >
> > > Snapshot is atomic; workqueue will be unfrozen with everyone else, but
> > > as there were no writes in the meantime, there should be no problems.
> >
> > That doesn't answer my question - when in the sequence of freezing
> > do you propose diasbling the workqueues? before the kernel threads,
> > after the kernel threads, before you sync the filesystem?
>
> After the sync, along with the freezing of kernel threads.

Before or after freezing of the kthreads? Order _could_ be
important, and different filesystems might require different
orders. What then?

> > And how does freezing them at that point in time guarantee consistent
> > filesystem state?
>
> If the work queues are frozen, there won't be any fs-related activity _after_
> we create the suspend image.

What about if there is still I/O in progress (i.e. kthread wins race and
issues async I/O after the sync but before it's frozen) - freezing the
workqueues does not prevent this activity and memory state will continue to
change as long as there is I/O completing...

> The sync is done after the userland has been
> frozen, so if the resume is unsuccessful, we'll be able to recover the state
> of the fs right before the sync,

Yes, in most cases.

> and if the resume is successful, we'll be
> able to continue (the state of memory will be the same as before the creation
> of the suspend image and the state of the disk will be the same as before the
> creation of the suspend image).

Assuming that you actually suspended an idle filesystem, which sync does not
guarantee you. Rather than assuming the filesystem is idle, why not guarantee
that it is idle by freezing it?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-13 07:39:13

by Stefan Seyfried

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Sun, Nov 12, 2006 at 06:43:10PM +0000, Pavel Machek wrote:
> Hi!

> > (b) it prevents journaling filesystems in general from replaying journals
> > after a failing resume.
>
> I do not see b) as an useful goal.

I'm not sure, but i guess it also solves "GRUB takes a minute to load kernel
and initrd from /boot on suspended reiserfs"-problem, which i see as a _very_
useful goal.

Often, most of the time needed for resume is spent by GRUB loading the
kernel/initrd from a journaled FS.
--
Stefan Seyfried
QA / R&D Team Mobile Devices | "Any ideas, John?"
SUSE LINUX Products GmbH, N?rnberg | "Well, surrounding them's out."

2006-11-13 16:25:47

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Monday, 13 November 2006 00:30, David Chinner wrote:
> On Sun, Nov 12, 2006 at 06:43:10PM +0000, Pavel Machek wrote:
> > Hi!
> >
> > > > Okay, so you claim that sys_sync can stall, waiting for administator?
> > > >
> > > > In such case we can simply do one sys_sync() before we start freezing
> > > > userspace... or just more the only sys_sync() there. That way, admin
> > > > has chance to unlock his system.
> > >
> > > Well, this is a different story.
> > >
> > > My point is that if we call sys_sync() _anyway_ before calling
> > > freeze_filesystems(), then freeze_filesystems() is _safe_ (either the
> > > sys_sync() blocks, or it doesn't in which case freeze_filesystems() won't
> > > block either).
> > >
> > > This means, however, that we can leave the patch as is (well, with the minor
> > > fix I have already posted), for now, because it doesn't make things worse a
> > > bit, but:
> > > (a) it prevents xfs from being corrupted and
> >
> > I'd really prefer it to be fixed by 'freezeable workqueues'.
>
> I'd prefer that you just freeze the filesystem and let the
> filesystem do things correctly.

In fact _I_ agree with you and that's what we have in -mm now. However, Pavel
apparently thinks it's too invasive, so we are considering (theoretically, for
now) a "less invasive" alternative.

> > Can you
> > point me into sources -- which xfs workqueues are problematic?
>
> AFAIK, its the I/O completion workqueues that are causing problems.
> (fs/xfs/linux-2.6/xfs_buf.c) However, thinking about it, I'm not
> sure that the work queues being left unfrozen is the real problem.
>
> i.e. after a sync there's still I/O outstanding (e.g. metadata in
> the log but not on disk), and because the kernel threads are frozen
> some time after the sync, we could have issued this delayed write
> metadata to disk after the sync. With XFS, we can have a of queue of
> thousands of metadata buffers for delwri, and they are all issued
> async and can take many seconds for the I/O to complete.
>
> The I/O completion workqueues will continue to run until all I/O
> stops, and metadata I/O completion will change the state of the
> filesystem in memory.
>
> However, even if you stop the workqueue processing, you're still
> going to have to wait for all I/O completion to occur before
> snapshotting memory because having any I/O complete changes memory
> state. Hence I fail to see how freezing the workqueues really helps
> at all here....

The changes of the memory state by themselves are handled correctly anyway.

The problem is the state of memory after the resume must be consistent with
the data and metadata on disk and it will be consistent if there are no
changes to the on-disk data and metadata made during the suspend
_after_ the suspend image has been created.

Also, the on-disk data and metadata should be sufficient to recover the
filesystem in case the resume fails (but that's obvious).

> Given that the only way to track and block on these delwri metadata
> buffers is to issue a sync flush rather than a async flush, suspend
> has to do something different to guarantee that we block until all
> those I/Os have completed. i.e. freeze the filesystem.
>
> So the problem, IMO, is suspend is not telling the filesystem
> to stop doing stuff and so we are getting caught out by doing
> stuff that suspend assumes won't happen but does nothing
> to prevent.
>
> > > (b) it prevents journaling filesystems in general from replaying journals
> > > after a failing resume.
>
> This is incorrect. Freezing an XFS filesystem _ensures_ that log
> replay occurs on thaw or a failed resume. XFS specifically dirties
> the log after a freeze down to a consistent state so that the
> unlinked inode lists get processed by recovery on thaw/next mount.

Okay, so I was referring to ext3 and reiserfs.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-13 16:25:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Monday, 13 November 2006 06:43, David Chinner wrote:
> On Sun, Nov 12, 2006 at 11:43:05PM +0100, Rafael J. Wysocki wrote:
> > On Sunday, 12 November 2006 23:30, David Chinner wrote:
> > > On Fri, Nov 10, 2006 at 11:39:42AM +0100, Pavel Machek wrote:
> > > > On Fri 2006-11-10 11:57:49, David Chinner wrote:
> > > > > On Thu, Nov 09, 2006 at 11:21:46PM +0100, Rafael J. Wysocki wrote:
> > > > > > I think we can add a flag to __create_workqueue() that will indicate if
> > > > > > this one is to be running with PF_NOFREEZE and a corresponding macro like
> > > > > > create_freezable_workqueue() to be used wherever we want the worker thread
> > > > > > to freeze (in which case it should be calling try_to_freeze() somewhere).
> > > > > > Then, we can teach filesystems to use this macro instead of
> > > > > > create_workqueue().
> > > > >
> > > > > At what point does the workqueue get frozen? i.e. how does this
> > > > > guarantee an unfrozen filesystem will end up in a consistent
> > > > > state?
> > > >
> > > > Snapshot is atomic; workqueue will be unfrozen with everyone else, but
> > > > as there were no writes in the meantime, there should be no problems.
> > >
> > > That doesn't answer my question - when in the sequence of freezing
> > > do you propose diasbling the workqueues? before the kernel threads,
> > > after the kernel threads, before you sync the filesystem?
> >
> > After the sync, along with the freezing of kernel threads.
>
> Before or after freezing of the kthreads? Order _could_ be
> important, and different filesystems might require different
> orders. What then?

Well, I don't really think the order is important. If the freezing of work
queues is done by the freezing of their respective worker threads, the
other threads won't even know they have been frozen.

> > > And how does freezing them at that point in time guarantee consistent
> > > filesystem state?
> >
> > If the work queues are frozen, there won't be any fs-related activity _after_
> > we create the suspend image.
>
> What about if there is still I/O in progress (i.e. kthread wins race and
> issues async I/O after the sync but before it's frozen) - freezing the
> workqueues does not prevent this activity and memory state will continue to
> change as long as there is I/O completing...
>
> > The sync is done after the userland has been
> > frozen, so if the resume is unsuccessful, we'll be able to recover the state
> > of the fs right before the sync,
>
> Yes, in most cases.
>
> > and if the resume is successful, we'll be
> > able to continue (the state of memory will be the same as before the creation
> > of the suspend image and the state of the disk will be the same as before the
> > creation of the suspend image).
>
> Assuming that you actually suspended an idle filesystem, which sync does not
> guarantee you.

Even if it's not idle, we are safe as long as the I/O activity doesn't
continue after the suspend image has been created.

> Rather than assuming the filesystem is idle, why not guarantee
> that it is idle by freezing it?

Well, _I_ personally think that the freezing of filesystems is the right thing
to do, although it may lead to some complications down the road.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-14 00:11:28

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Mon, Nov 13, 2006 at 05:22:54PM +0100, Rafael J. Wysocki wrote:
> On Monday, 13 November 2006 06:43, David Chinner wrote:
> > On Sun, Nov 12, 2006 at 11:43:05PM +0100, Rafael J. Wysocki wrote:
> > Before or after freezing of the kthreads? Order _could_ be
> > important, and different filesystems might require different
> > orders. What then?
>
> Well, I don't really think the order is important. If the freezing of work
> queues is done by the freezing of their respective worker threads, the
> other threads won't even know they have been frozen.

True - I just think that some defined ordering semantics might be
helpful for people trying to use workqueues and kthreads in
the future. It would also help us determine if the current usage
is safe as well...

> > > and if the resume is successful, we'll be
> > > able to continue (the state of memory will be the same as before the creation
> > > of the suspend image and the state of the disk will be the same as before the
> > > creation of the suspend image).
> >
> > Assuming that you actually suspended an idle filesystem, which sync does not
> > guarantee you.
>
> Even if it's not idle, we are safe as long as the I/O activity doesn't
> continue after the suspend image has been created.

And that is my great concern - there really is nothing to prevent
a fs from having I/O outstanding while the suspend image is being
created.

> > Rather than assuming the filesystem is idle, why not guarantee
> > that it is idle by freezing it?
>
> Well, _I_ personally think that the freezing of filesystems is the right thing
> to do, although it may lead to some complications down the road.

*nod*

I would also prefer to start with something we know is safe and work from
there.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-15 18:50:48

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi!

> > > This means, however, that we can leave the patch as is (well, with the minor
> > > fix I have already posted), for now, because it doesn't make things worse a
> > > bit, but:
> > > (a) it prevents xfs from being corrupted and
> >
> > I'd really prefer it to be fixed by 'freezeable workqueues'.
>
> I'd prefer that you just freeze the filesystem and let the
> filesystem do things correctly.

Well, I'd prefer filesystems not to know about suspend, and current
"freeze the filesystem" does not really nest properly.

> > Can you
> > point me into sources -- which xfs workqueues are problematic?
>
> AFAIK, its the I/O completion workqueues that are causing problems.
> (fs/xfs/linux-2.6/xfs_buf.c) However, thinking about it, I'm not
> sure that the work queues being left unfrozen is the real problem.
>
> i.e. after a sync there's still I/O outstanding (e.g. metadata in
> the log but not on disk), and because the kernel threads are frozen
> some time after the sync, we could have issued this delayed write
> metadata to disk after the sync. With XFS, we can have a of queue of

That's okay, snapshot is atomic. As long as data are safely in the
journal, we should be okay.

> However, even if you stop the workqueue processing, you're still
> going to have to wait for all I/O completion to occur before
> snapshotting memory because having any I/O complete changes memory
> state. Hence I fail to see how freezing the workqueues really helps
> at all here....

It is okay to change memory state, just on disk state may not change
after atomic snapshot.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-15 19:59:45

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi,

On Wednesday, 15 November 2006 19:50, Pavel Machek wrote:
> Hi!
>
> > > > This means, however, that we can leave the patch as is (well, with the minor
> > > > fix I have already posted), for now, because it doesn't make things worse a
> > > > bit, but:
> > > > (a) it prevents xfs from being corrupted and
> > >
> > > I'd really prefer it to be fixed by 'freezeable workqueues'.
> >
> > I'd prefer that you just freeze the filesystem and let the
> > filesystem do things correctly.
>
> Well, I'd prefer filesystems not to know about suspend, and current
> "freeze the filesystem" does not really nest properly.
>
> > > Can you
> > > point me into sources -- which xfs workqueues are problematic?
> >
> > AFAIK, its the I/O completion workqueues that are causing problems.
> > (fs/xfs/linux-2.6/xfs_buf.c) However, thinking about it, I'm not
> > sure that the work queues being left unfrozen is the real problem.
> >
> > i.e. after a sync there's still I/O outstanding (e.g. metadata in
> > the log but not on disk), and because the kernel threads are frozen
> > some time after the sync, we could have issued this delayed write
> > metadata to disk after the sync. With XFS, we can have a of queue of
>
> That's okay, snapshot is atomic. As long as data are safely in the
> journal, we should be okay.
>
> > However, even if you stop the workqueue processing, you're still
> > going to have to wait for all I/O completion to occur before
> > snapshotting memory because having any I/O complete changes memory
> > state. Hence I fail to see how freezing the workqueues really helps
> > at all here....
>
> It is okay to change memory state, just on disk state may not change
> after atomic snapshot.

There's one more thing, actually. If the on-disk data and metadata are
changed _after_ the sync we do and _before_ we create the snapshot image,
and the subsequent resume fails, there may be problems with recovering
the filesystem. That is, if I correctly understand what David has told us so
far.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-15 20:03:09

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wednesday, 15 November 2006 20:56, Rafael J. Wysocki wrote:
> Hi,
>
> On Wednesday, 15 November 2006 19:50, Pavel Machek wrote:
> > Hi!
> >
> > > > > This means, however, that we can leave the patch as is (well, with the minor
> > > > > fix I have already posted), for now, because it doesn't make things worse a
> > > > > bit, but:
> > > > > (a) it prevents xfs from being corrupted and
> > > >
> > > > I'd really prefer it to be fixed by 'freezeable workqueues'.
> > >
> > > I'd prefer that you just freeze the filesystem and let the
> > > filesystem do things correctly.
> >
> > Well, I'd prefer filesystems not to know about suspend, and current
> > "freeze the filesystem" does not really nest properly.
> >
> > > > Can you
> > > > point me into sources -- which xfs workqueues are problematic?
> > >
> > > AFAIK, its the I/O completion workqueues that are causing problems.
> > > (fs/xfs/linux-2.6/xfs_buf.c) However, thinking about it, I'm not
> > > sure that the work queues being left unfrozen is the real problem.
> > >
> > > i.e. after a sync there's still I/O outstanding (e.g. metadata in
> > > the log but not on disk), and because the kernel threads are frozen
> > > some time after the sync, we could have issued this delayed write
> > > metadata to disk after the sync. With XFS, we can have a of queue of
> >
> > That's okay, snapshot is atomic. As long as data are safely in the
> > journal, we should be okay.
> >
> > > However, even if you stop the workqueue processing, you're still
> > > going to have to wait for all I/O completion to occur before
> > > snapshotting memory because having any I/O complete changes memory
> > > state. Hence I fail to see how freezing the workqueues really helps
> > > at all here....
> >
> > It is okay to change memory state, just on disk state may not change
> > after atomic snapshot.
>
> There's one more thing, actually. If the on-disk data and metadata are
> changed _after_ the sync we do and _before_ we create the snapshot image,
> and the subsequent resume fails,

Well, but this is equivalent to a power failure immediately after the sync, so
there _must_ be a way to recover the filesystem from that, no?

I think I'll prepare a patch for freezing the work queues and we'll see what
to do next.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-11-15 20:24:07

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi!

> > There's one more thing, actually. If the on-disk data and metadata are
> > changed _after_ the sync we do and _before_ we create the snapshot image,
> > and the subsequent resume fails,
>
> Well, but this is equivalent to a power failure immediately after the sync, so
> there _must_ be a way to recover the filesystem from that, no?

Exactly.

> I think I'll prepare a patch for freezing the work queues and we'll see what
> to do next.

Thanks!
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-15 22:01:50

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi,

On Wednesday, 15 November 2006 21:23, Pavel Machek wrote:
> Hi!
>
> > > There's one more thing, actually. If the on-disk data and metadata are
> > > changed _after_ the sync we do and _before_ we create the snapshot image,
> > > and the subsequent resume fails,
> >
> > Well, but this is equivalent to a power failure immediately after the sync, so
> > there _must_ be a way to recover the filesystem from that, no?
>
> Exactly.
>
> > I think I'll prepare a patch for freezing the work queues and we'll see what
> > to do next.
>
> Thanks!

Okay, the patch follows.

I've been running it for some time on my boxes and it doesn't seem to break
anything. However, I don't use XFS, so well ...

Greetings,
Rafael


Signed-off-by: Rafael J. Wysocki <[email protected]>
---
fs/xfs/linux-2.6/xfs_buf.c | 4 ++--
include/linux/workqueue.h | 8 +++++---
kernel/workqueue.c | 20 ++++++++++++++------
3 files changed, 21 insertions(+), 11 deletions(-)

Index: linux-2.6.19-rc5-mm2/include/linux/workqueue.h
===================================================================
--- linux-2.6.19-rc5-mm2.orig/include/linux/workqueue.h 2006-09-20 05:42:06.000000000 +0200
+++ linux-2.6.19-rc5-mm2/include/linux/workqueue.h 2006-11-15 21:58:40.000000000 +0100
@@ -55,9 +55,11 @@ struct execute_work {
} while (0)

extern struct workqueue_struct *__create_workqueue(const char *name,
- int singlethread);
-#define create_workqueue(name) __create_workqueue((name), 0)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1)
+ int singlethread,
+ int freezeable);
+#define create_workqueue(name) __create_workqueue((name), 0, 0)
+#define create_freezeable_workqueue(name) __create_workqueue((name), 0, 1)
+#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)

extern void destroy_workqueue(struct workqueue_struct *wq);

Index: linux-2.6.19-rc5-mm2/kernel/workqueue.c
===================================================================
--- linux-2.6.19-rc5-mm2.orig/kernel/workqueue.c 2006-11-15 21:32:13.000000000 +0100
+++ linux-2.6.19-rc5-mm2/kernel/workqueue.c 2006-11-15 22:27:40.000000000 +0100
@@ -31,6 +31,7 @@
#include <linux/mempolicy.h>
#include <linux/kallsyms.h>
#include <linux/debug_locks.h>
+#include <linux/freezer.h>

/*
* The per-CPU workqueue (if single thread, we always use the first
@@ -57,6 +58,8 @@ struct cpu_workqueue_struct {
struct task_struct *thread;

int run_depth; /* Detect run_workqueue() recursion depth */
+
+ int freezeable; /* Freeze the thread during suspend */
} ____cacheline_aligned;

/*
@@ -251,7 +254,8 @@ static int worker_thread(void *__cwq)
struct k_sigaction sa;
sigset_t blocked;

- current->flags |= PF_NOFREEZE;
+ if (!cwq->freezeable)
+ current->flags |= PF_NOFREEZE;

set_user_nice(current, -5);

@@ -274,6 +278,9 @@ static int worker_thread(void *__cwq)

set_current_state(TASK_INTERRUPTIBLE);
while (!kthread_should_stop()) {
+ if (cwq->freezeable)
+ try_to_freeze();
+
add_wait_queue(&cwq->more_work, &wait);
if (list_empty(&cwq->worklist))
schedule();
@@ -350,7 +357,7 @@ void fastcall flush_workqueue(struct wor
EXPORT_SYMBOL_GPL(flush_workqueue);

static struct task_struct *create_workqueue_thread(struct workqueue_struct *wq,
- int cpu)
+ int cpu, int freezeable)
{
struct cpu_workqueue_struct *cwq = per_cpu_ptr(wq->cpu_wq, cpu);
struct task_struct *p;
@@ -360,6 +367,7 @@ static struct task_struct *create_workqu
cwq->thread = NULL;
cwq->insert_sequence = 0;
cwq->remove_sequence = 0;
+ cwq->freezeable = freezeable;
INIT_LIST_HEAD(&cwq->worklist);
init_waitqueue_head(&cwq->more_work);
init_waitqueue_head(&cwq->work_done);
@@ -375,7 +383,7 @@ static struct task_struct *create_workqu
}

struct workqueue_struct *__create_workqueue(const char *name,
- int singlethread)
+ int singlethread, int freezeable)
{
int cpu, destroy = 0;
struct workqueue_struct *wq;
@@ -395,7 +403,7 @@ struct workqueue_struct *__create_workqu
mutex_lock(&workqueue_mutex);
if (singlethread) {
INIT_LIST_HEAD(&wq->list);
- p = create_workqueue_thread(wq, singlethread_cpu);
+ p = create_workqueue_thread(wq, singlethread_cpu, freezeable);
if (!p)
destroy = 1;
else
@@ -403,7 +411,7 @@ struct workqueue_struct *__create_workqu
} else {
list_add(&wq->list, &workqueues);
for_each_online_cpu(cpu) {
- p = create_workqueue_thread(wq, cpu);
+ p = create_workqueue_thread(wq, cpu, freezeable);
if (p) {
kthread_bind(p, cpu);
wake_up_process(p);
@@ -657,7 +665,7 @@ static int __devinit workqueue_cpu_callb
mutex_lock(&workqueue_mutex);
/* Create a new workqueue thread for it. */
list_for_each_entry(wq, &workqueues, list) {
- if (!create_workqueue_thread(wq, hotcpu)) {
+ if (!create_workqueue_thread(wq, hotcpu, 0)) {
printk("workqueue for %i failed\n", hotcpu);
return NOTIFY_BAD;
}
Index: linux-2.6.19-rc5-mm2/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- linux-2.6.19-rc5-mm2.orig/fs/xfs/linux-2.6/xfs_buf.c 2006-11-15 21:32:08.000000000 +0100
+++ linux-2.6.19-rc5-mm2/fs/xfs/linux-2.6/xfs_buf.c 2006-11-15 22:12:43.000000000 +0100
@@ -1826,11 +1826,11 @@ xfs_buf_init(void)
if (!xfs_buf_zone)
goto out_free_trace_buf;

- xfslogd_workqueue = create_workqueue("xfslogd");
+ xfslogd_workqueue = create_freezeable_workqueue("xfslogd");
if (!xfslogd_workqueue)
goto out_free_buf_zone;

- xfsdatad_workqueue = create_workqueue("xfsdatad");
+ xfsdatad_workqueue = create_freezeable_workqueue("xfsdatad");
if (!xfsdatad_workqueue)
goto out_destroy_xfslogd_workqueue;

2006-11-15 22:49:50

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi!

> > > > There's one more thing, actually. If the on-disk data and metadata are
> > > > changed _after_ the sync we do and _before_ we create the snapshot image,
> > > > and the subsequent resume fails,
> > >
> > > Well, but this is equivalent to a power failure immediately after the sync, so
> > > there _must_ be a way to recover the filesystem from that, no?
> >
> > Exactly.
> >
> > > I think I'll prepare a patch for freezing the work queues and we'll see what
> > > to do next.
> >
> > Thanks!
>
> Okay, the patch follows.
>
> I've been running it for some time on my boxes and it doesn't seem to break
> anything. However, I don't use XFS, so well ...

Seems like a way to go, thanks!
Pavel

> Index: linux-2.6.19-rc5-mm2/include/linux/workqueue.h
> ===================================================================
> --- linux-2.6.19-rc5-mm2.orig/include/linux/workqueue.h 2006-09-20 05:42:06.000000000 +0200
> +++ linux-2.6.19-rc5-mm2/include/linux/workqueue.h 2006-11-15 21:58:40.000000000 +0100
> @@ -55,9 +55,11 @@ struct execute_work {
> } while (0)
>
> extern struct workqueue_struct *__create_workqueue(const char *name,
> - int singlethread);
> -#define create_workqueue(name) __create_workqueue((name), 0)
> -#define create_singlethread_workqueue(name) __create_workqueue((name), 1)
> + int singlethread,
> + int freezeable);
> +#define create_workqueue(name) __create_workqueue((name), 0, 0)
> +#define create_freezeable_workqueue(name) __create_workqueue((name), 0, 1)
> +#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)
>
> extern void destroy_workqueue(struct workqueue_struct *wq);
>
> Index: linux-2.6.19-rc5-mm2/kernel/workqueue.c
> ===================================================================
> --- linux-2.6.19-rc5-mm2.orig/kernel/workqueue.c 2006-11-15 21:32:13.000000000 +0100
> +++ linux-2.6.19-rc5-mm2/kernel/workqueue.c 2006-11-15 22:27:40.000000000 +0100
> @@ -31,6 +31,7 @@
> #include <linux/mempolicy.h>
> #include <linux/kallsyms.h>
> #include <linux/debug_locks.h>
> +#include <linux/freezer.h>
>
> /*
> * The per-CPU workqueue (if single thread, we always use the first
> @@ -57,6 +58,8 @@ struct cpu_workqueue_struct {
> struct task_struct *thread;
>
> int run_depth; /* Detect run_workqueue() recursion depth */
> +
> + int freezeable; /* Freeze the thread during suspend */
> } ____cacheline_aligned;
>
> /*
> @@ -251,7 +254,8 @@ static int worker_thread(void *__cwq)
> struct k_sigaction sa;
> sigset_t blocked;
>
> - current->flags |= PF_NOFREEZE;
> + if (!cwq->freezeable)
> + current->flags |= PF_NOFREEZE;
>
> set_user_nice(current, -5);
>
> @@ -274,6 +278,9 @@ static int worker_thread(void *__cwq)
>
> set_current_state(TASK_INTERRUPTIBLE);
> while (!kthread_should_stop()) {
> + if (cwq->freezeable)
> + try_to_freeze();
> +
> add_wait_queue(&cwq->more_work, &wait);
> if (list_empty(&cwq->worklist))
> schedule();
> @@ -350,7 +357,7 @@ void fastcall flush_workqueue(struct wor
> EXPORT_SYMBOL_GPL(flush_workqueue);
>
> static struct task_struct *create_workqueue_thread(struct workqueue_struct *wq,
> - int cpu)
> + int cpu, int freezeable)
> {
> struct cpu_workqueue_struct *cwq = per_cpu_ptr(wq->cpu_wq, cpu);
> struct task_struct *p;
> @@ -360,6 +367,7 @@ static struct task_struct *create_workqu
> cwq->thread = NULL;
> cwq->insert_sequence = 0;
> cwq->remove_sequence = 0;
> + cwq->freezeable = freezeable;
> INIT_LIST_HEAD(&cwq->worklist);
> init_waitqueue_head(&cwq->more_work);
> init_waitqueue_head(&cwq->work_done);
> @@ -375,7 +383,7 @@ static struct task_struct *create_workqu
> }
>
> struct workqueue_struct *__create_workqueue(const char *name,
> - int singlethread)
> + int singlethread, int freezeable)
> {
> int cpu, destroy = 0;
> struct workqueue_struct *wq;
> @@ -395,7 +403,7 @@ struct workqueue_struct *__create_workqu
> mutex_lock(&workqueue_mutex);
> if (singlethread) {
> INIT_LIST_HEAD(&wq->list);
> - p = create_workqueue_thread(wq, singlethread_cpu);
> + p = create_workqueue_thread(wq, singlethread_cpu, freezeable);
> if (!p)
> destroy = 1;
> else
> @@ -403,7 +411,7 @@ struct workqueue_struct *__create_workqu
> } else {
> list_add(&wq->list, &workqueues);
> for_each_online_cpu(cpu) {
> - p = create_workqueue_thread(wq, cpu);
> + p = create_workqueue_thread(wq, cpu, freezeable);
> if (p) {
> kthread_bind(p, cpu);
> wake_up_process(p);
> @@ -657,7 +665,7 @@ static int __devinit workqueue_cpu_callb
> mutex_lock(&workqueue_mutex);
> /* Create a new workqueue thread for it. */
> list_for_each_entry(wq, &workqueues, list) {
> - if (!create_workqueue_thread(wq, hotcpu)) {
> + if (!create_workqueue_thread(wq, hotcpu, 0)) {
> printk("workqueue for %i failed\n", hotcpu);
> return NOTIFY_BAD;
> }
> Index: linux-2.6.19-rc5-mm2/fs/xfs/linux-2.6/xfs_buf.c
> ===================================================================
> --- linux-2.6.19-rc5-mm2.orig/fs/xfs/linux-2.6/xfs_buf.c 2006-11-15 21:32:08.000000000 +0100
> +++ linux-2.6.19-rc5-mm2/fs/xfs/linux-2.6/xfs_buf.c 2006-11-15 22:12:43.000000000 +0100
> @@ -1826,11 +1826,11 @@ xfs_buf_init(void)
> if (!xfs_buf_zone)
> goto out_free_trace_buf;
>
> - xfslogd_workqueue = create_workqueue("xfslogd");
> + xfslogd_workqueue = create_freezeable_workqueue("xfslogd");
> if (!xfslogd_workqueue)
> goto out_free_buf_zone;
>
> - xfsdatad_workqueue = create_workqueue("xfsdatad");
> + xfsdatad_workqueue = create_freezeable_workqueue("xfsdatad");
> if (!xfsdatad_workqueue)
> goto out_destroy_xfslogd_workqueue;
>

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-16 23:22:34

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Wed, Nov 15, 2006 at 10:58:51PM +0100, Rafael J. Wysocki wrote:
> On Wednesday, 15 November 2006 21:23, Pavel Machek wrote:
> > > I think I'll prepare a patch for freezing the work queues and we'll see what
> > > to do next.
> >
> > Thanks!
>
> Okay, the patch follows.
>
> I've been running it for some time on my boxes and it doesn't seem to break
> anything. However, I don't use XFS, so well ...

So you haven't actually tested whether it fixes anything or whether
it introduces any regressions?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-16 23:24:24

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Sun, Nov 12, 2006 at 11:43:05PM +0100, Rafael J. Wysocki wrote:
> On Sunday, 12 November 2006 23:30, David Chinner wrote:
> > And how does freezing them at that point in time guarantee consistent
> > filesystem state?
>
> If the work queues are frozen, there won't be any fs-related activity _after_
> we create the suspend image.

fs-related activity before or after the suspend image is captured is
not a problem - it's fs-related activity _during_ the suspend that
is an issue here. If we have async I/O completing during the suspend
image capture, we've got problems....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-16 23:39:12

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

Hi!

> > > > I think I'll prepare a patch for freezing the work queues and we'll see what
> > > > to do next.
> > >
> > > Thanks!
> >
> > Okay, the patch follows.
> >
> > I've been running it for some time on my boxes and it doesn't seem to break
> > anything. However, I don't use XFS, so well ...
>
> So you haven't actually tested whether it fixes anything or whether
> it introduces any regressions?

Noone has a testcase for a problem; swsusp just does not eat
filesystems in practice. Fill free to test the patch, but I'm pretty
sure it will not break anything.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-16 23:41:11

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Fri 2006-11-17 10:23:49, David Chinner wrote:
> On Sun, Nov 12, 2006 at 11:43:05PM +0100, Rafael J. Wysocki wrote:
> > On Sunday, 12 November 2006 23:30, David Chinner wrote:
> > > And how does freezing them at that point in time guarantee consistent
> > > filesystem state?
> >
> > If the work queues are frozen, there won't be any fs-related activity _after_
> > we create the suspend image.
>
> fs-related activity before or after the suspend image is captured is
> not a problem - it's fs-related activity _during_ the suspend that
> is an issue here. If we have async I/O completing during the suspend
> image capture, we've got problems....

fs-related activity _after_ image is captured definitely is a problem
-- it breaks swsusp invariants.

During image capture, any fs-related activity is not possible, as we
are running interrupts disabled, DMA disabled.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-11-17 01:42:45

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Fri, Nov 17, 2006 at 12:40:53AM +0100, Pavel Machek wrote:
> On Fri 2006-11-17 10:23:49, David Chinner wrote:
> > On Sun, Nov 12, 2006 at 11:43:05PM +0100, Rafael J. Wysocki wrote:
> > > On Sunday, 12 November 2006 23:30, David Chinner wrote:
> > > > And how does freezing them at that point in time guarantee consistent
> > > > filesystem state?
> > >
> > > If the work queues are frozen, there won't be any fs-related activity _after_
> > > we create the suspend image.
> >
> > fs-related activity before or after the suspend image is captured is
> > not a problem - it's fs-related activity _during_ the suspend that
> > is an issue here. If we have async I/O completing during the suspend
> > image capture, we've got problems....
>
> fs-related activity _after_ image is captured definitely is a problem
> -- it breaks swsusp invariants.
>
> During image capture, any fs-related activity is not possible, as we
> are running interrupts disabled, DMA disabled.

Ok, so the I/o that finishes during the image capture won't be reported
until after the capture completes. that means we lose the capability
to even run the I/O completion on those buffers once we get to
resume.

They've already been issued, so will be locked and never issued
again because the suspend image says they've been issued, but they'll
never complete on resume because their completion status was
updated after the capture was taken and will never be recreated
after resume. IOWs, the fileystem will start to hang on the next
reference to any of these locked buffers that the filesystem
thinks is still under I/O....

Freezing the workqueues doesn't fix this.....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-17 15:14:10

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

On Fri 2006-11-17 12:40:51, David Chinner wrote:
> On Fri, Nov 17, 2006 at 12:40:53AM +0100, Pavel Machek wrote:
> > On Fri 2006-11-17 10:23:49, David Chinner wrote:
> > > On Sun, Nov 12, 2006 at 11:43:05PM +0100, Rafael J. Wysocki wrote:
> > > > On Sunday, 12 November 2006 23:30, David Chinner wrote:
> > > > > And how does freezing them at that point in time guarantee consistent
> > > > > filesystem state?
> > > >
> > > > If the work queues are frozen, there won't be any fs-related activity _after_
> > > > we create the suspend image.
> > >
> > > fs-related activity before or after the suspend image is captured is
> > > not a problem - it's fs-related activity _during_ the suspend that
> > > is an issue here. If we have async I/O completing during the suspend
> > > image capture, we've got problems....
> >
> > fs-related activity _after_ image is captured definitely is a problem
> > -- it breaks swsusp invariants.
> >
> > During image capture, any fs-related activity is not possible, as we
> > are running interrupts disabled, DMA disabled.
>
> Ok, so the I/o that finishes during the image capture won't be reported
> until after the capture completes. that means we lose the capability

There's no I/O in flight during image capture. Interrupts are
disabled, DMAs are stopped, and drivers were told to shut down (that
includes finishing any outstanding work, and drivers do that
currently; but perhaps docs should be more explicit about it).
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-01-12 09:52:16

by Srinivasa Ds

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

fs/block_dev.c | 2 +-
fs/buffer.c | 6 ++++--
fs/super.c | 4 ++--
include/linux/fs.h | 2 +-
4 files changed, 8 insertions(+), 6 deletions(-)

Index: linux-2.6.20-rc4/fs/block_dev.c
===================================================================
--- linux-2.6.20-rc4.orig/fs/block_dev.c
+++ linux-2.6.20-rc4/fs/block_dev.c
@@ -411,7 +411,7 @@ static void init_once(void * foo, struct
{
memset(bdev, 0, sizeof(*bdev));
mutex_init(&bdev->bd_mutex);
- mutex_init(&bdev->bd_mount_mutex);
+ sema_init(&bdev->bd_mount_sem, 1);
INIT_LIST_HEAD(&bdev->bd_inodes);
INIT_LIST_HEAD(&bdev->bd_list);
#ifdef CONFIG_SYSFS
Index: linux-2.6.20-rc4/fs/buffer.c
===================================================================
--- linux-2.6.20-rc4.orig/fs/buffer.c
+++ linux-2.6.20-rc4/fs/buffer.c
@@ -189,7 +189,9 @@ struct super_block *freeze_bdev(struct b
{
struct super_block *sb;

- mutex_lock(&bdev->bd_mount_mutex);
+ if (down_trylock(&bdev->bd_mount_sem))
+ return -EBUSY;
+
sb = get_super(bdev);
if (sb && !(sb->s_flags & MS_RDONLY)) {
sb->s_frozen = SB_FREEZE_WRITE;
@@ -231,7 +233,7 @@ void thaw_bdev(struct block_device *bdev
drop_super(sb);
}

- mutex_unlock(&bdev->bd_mount_mutex);
+ up(&bdev->bd_mount_sem);
}
EXPORT_SYMBOL(thaw_bdev);

Index: linux-2.6.20-rc4/include/linux/fs.h
===================================================================
--- linux-2.6.20-rc4.orig/include/linux/fs.h
+++ linux-2.6.20-rc4/include/linux/fs.h
@@ -458,7 +458,7 @@ struct block_device {
struct inode * bd_inode; /* will die */
int bd_openers;
struct mutex bd_mutex; /* open/close mutex */
- struct mutex bd_mount_mutex; /* mount mutex */
+ struct semaphore bd_mount_sem;
struct list_head bd_inodes;
void * bd_holder;
int bd_holders;
Index: linux-2.6.20-rc4/fs/super.c
===================================================================
--- linux-2.6.20-rc4.orig/fs/super.c
+++ linux-2.6.20-rc4/fs/super.c
@@ -753,9 +753,9 @@ int get_sb_bdev(struct file_system_type
* will protect the lockfs code from trying to start a snapshot
* while we are mounting
*/
- mutex_lock(&bdev->bd_mount_mutex);
+ down(&bdev->bd_mount_sem);
s = sget(fs_type, test_bdev_super, set_bdev_super, bdev);
- mutex_unlock(&bdev->bd_mount_mutex);
+ up(&bdev->bd_mount_sem);
if (IS_ERR(s))
goto error_s;


Attachments:
dm.fix (2.32 kB)

2007-01-12 20:47:09

by Srinivasa Ds

[permalink] [raw]
Subject: Re: [PATCH 2.6.19 5/5] fs: freeze_bdev with semaphore not mutex

fs/block_dev.c | 2 +-
fs/buffer.c | 6 ++++--
fs/super.c | 4 ++--
include/linux/fs.h | 2 +-
4 files changed, 8 insertions(+), 6 deletions(-)

Index: linux-2.6.20-rc4/fs/block_dev.c
===================================================================
--- linux-2.6.20-rc4.orig/fs/block_dev.c
+++ linux-2.6.20-rc4/fs/block_dev.c
@@ -411,7 +411,7 @@ static void init_once(void * foo, struct
{
memset(bdev, 0, sizeof(*bdev));
mutex_init(&bdev->bd_mutex);
- mutex_init(&bdev->bd_mount_mutex);
+ sema_init(&bdev->bd_mount_sem, 1);
INIT_LIST_HEAD(&bdev->bd_inodes);
INIT_LIST_HEAD(&bdev->bd_list);
#ifdef CONFIG_SYSFS
Index: linux-2.6.20-rc4/fs/buffer.c
===================================================================
--- linux-2.6.20-rc4.orig/fs/buffer.c
+++ linux-2.6.20-rc4/fs/buffer.c
@@ -189,7 +189,9 @@ struct super_block *freeze_bdev(struct b
{
struct super_block *sb;

- mutex_lock(&bdev->bd_mount_mutex);
+ if (down_trylock(&bdev->bd_mount_sem))
+ return -EBUSY;
+
sb = get_super(bdev);
if (sb && !(sb->s_flags & MS_RDONLY)) {
sb->s_frozen = SB_FREEZE_WRITE;
@@ -231,7 +233,7 @@ void thaw_bdev(struct block_device *bdev
drop_super(sb);
}

- mutex_unlock(&bdev->bd_mount_mutex);
+ up(&bdev->bd_mount_sem);
}
EXPORT_SYMBOL(thaw_bdev);

Index: linux-2.6.20-rc4/include/linux/fs.h
===================================================================
--- linux-2.6.20-rc4.orig/include/linux/fs.h
+++ linux-2.6.20-rc4/include/linux/fs.h
@@ -458,7 +458,7 @@ struct block_device {
struct inode * bd_inode; /* will die */
int bd_openers;
struct mutex bd_mutex; /* open/close mutex */
- struct mutex bd_mount_mutex; /* mount mutex */
+ struct semaphore bd_mount_sem;
struct list_head bd_inodes;
void * bd_holder;
int bd_holders;
Index: linux-2.6.20-rc4/fs/super.c
===================================================================
--- linux-2.6.20-rc4.orig/fs/super.c
+++ linux-2.6.20-rc4/fs/super.c
@@ -753,9 +753,9 @@ int get_sb_bdev(struct file_system_type
* will protect the lockfs code from trying to start a snapshot
* while we are mounting
*/
- mutex_lock(&bdev->bd_mount_mutex);
+ down(&bdev->bd_mount_sem);
s = sget(fs_type, test_bdev_super, set_bdev_super, bdev);
- mutex_unlock(&bdev->bd_mount_mutex);
+ up(&bdev->bd_mount_sem);
if (IS_ERR(s))
goto error_s;


Attachments:
dm.fix (2.32 kB)