2006-10-23 04:12:23

by Nigel Cunningham

[permalink] [raw]
Subject: [PATCH] Freeze bdevs when freezing processes.

XFS can continue to submit I/O from a timer routine, even after
freezeable kernel and userspace threads are frozen. This doesn't seem to
be an issue for current swsusp code, but is definitely an issue for
Suspend2, where the pages being written could be overwritten by
Suspend2's atomic copy.

We can address this issue by freezing bdevs after stopping userspace
threads, and thawing them prior to thawing userspace.

Signed-off-by: Nigel Cunningham <[email protected]>

diff --git a/kernel/power/process.c b/kernel/power/process.c
index 4a001fe..ddeeb50 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -13,6 +13,7 @@ #include <linux/interrupt.h>
#include <linux/suspend.h>
#include <linux/module.h>
#include <linux/syscalls.h>
+#include <linux/buffer_head.h>
#include <linux/freezer.h>

/*
@@ -20,6 +21,58 @@ #include <linux/freezer.h>
*/
#define TIMEOUT (20 * HZ)

+struct frozen_fs
+{
+ struct list_head fsb_list;
+ struct super_block *sb;
+};
+
+LIST_HEAD(frozen_fs_list);
+
+void freezer_make_fses_rw(void)
+{
+ struct frozen_fs *fs, *next_fs;
+
+ list_for_each_entry_safe(fs, next_fs, &frozen_fs_list, fsb_list) {
+ thaw_bdev(fs->sb->s_bdev, fs->sb);
+
+ list_del(&fs->fsb_list);
+ kfree(fs);
+ }
+}
+
+/*
+ * Done after userspace is frozen, so there should be no danger of
+ * fses being unmounted while we're in here.
+ */
+int freezer_make_fses_ro(void)
+{
+ struct frozen_fs *fs;
+ struct super_block *sb;
+
+ /* Generate the list */
+ list_for_each_entry(sb, &super_blocks, s_list) {
+ if (!sb->s_root || !sb->s_bdev ||
+ (sb->s_frozen == SB_FREEZE_TRANS) ||
+ (sb->s_flags & MS_RDONLY))
+ continue;
+
+ fs = kmalloc(sizeof(struct frozen_fs), GFP_ATOMIC);
+ if (!fs)
+ return 1;
+ fs->sb = sb;
+ list_add_tail(&fs->fsb_list, &frozen_fs_list);
+ };
+
+ /* Do the freezing in reverse order so filesystems dependant
+ * upon others are frozen in the right order. (Eg loopback
+ * on ext3). */
+ list_for_each_entry_reverse(fs, &frozen_fs_list, fsb_list)
+ freeze_bdev(fs->sb->s_bdev);
+
+ return 0;
+}
+

static inline int freezeable(struct task_struct * p)
{
@@ -119,7 +172,7 @@ int freeze_processes(void)
read_unlock(&tasklist_lock);
todo += nr_user;
if (!user_frozen && !nr_user) {
- sys_sync();
+ freezer_make_fses_ro();
start_time = jiffies;
}
user_frozen = !nr_user;
@@ -174,6 +227,12 @@ void thaw_some_processes(int all)
"Strange, %s not stopped\n", p->comm );
} while_each_thread(g, p);

+ if (!pass) {
+ read_unlock(&tasklist_lock);
+ freezer_make_fses_rw();
+ read_lock(&tasklist_lock);
+ }
+
pass++;
} while (pass < 2 && all);




2006-10-23 10:37:45

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Monday, 23 October 2006 06:12, Nigel Cunningham wrote:
> XFS can continue to submit I/O from a timer routine, even after
> freezeable kernel and userspace threads are frozen. This doesn't seem to
> be an issue for current swsusp code,

So it doesn't look like we need the patch _now_.

> but is definitely an issue for Suspend2, where the pages being written could
> be overwritten by Suspend2's atomic copy.

And IMO that's a good reason why we shouldn't use RCU pages for storing the
image. XFS is one known example that breaks things if we do so and
there may be more such things that we don't know of. The fact that they
haven't appeared in testing so far doesn't mean they don't exist and
moreover some things like that may appear in the future.

> We can address this issue by freezing bdevs after stopping userspace
> threads, and thawing them prior to thawing userspace.

This will only address the issues related to filesystems, but the RCU pages
can also be modified from interrupt context by other code, AFAICT.

> Signed-off-by: Nigel Cunningham <[email protected]>
>
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index 4a001fe..ddeeb50 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -13,6 +13,7 @@ #include <linux/interrupt.h>
> #include <linux/suspend.h>
> #include <linux/module.h>
> #include <linux/syscalls.h>
> +#include <linux/buffer_head.h>
> #include <linux/freezer.h>
>
> /*
> @@ -20,6 +21,58 @@ #include <linux/freezer.h>
> */
> #define TIMEOUT (20 * HZ)
>
> +struct frozen_fs
> +{
> + struct list_head fsb_list;
> + struct super_block *sb;
> +};
> +
> +LIST_HEAD(frozen_fs_list);
> +
> +void freezer_make_fses_rw(void)
> +{
> + struct frozen_fs *fs, *next_fs;
> +
> + list_for_each_entry_safe(fs, next_fs, &frozen_fs_list, fsb_list) {
> + thaw_bdev(fs->sb->s_bdev, fs->sb);
> +
> + list_del(&fs->fsb_list);
> + kfree(fs);
> + }
> +}
> +
> +/*
> + * Done after userspace is frozen, so there should be no danger of
> + * fses being unmounted while we're in here.
> + */
> +int freezer_make_fses_ro(void)
> +{
> + struct frozen_fs *fs;
> + struct super_block *sb;
> +
> + /* Generate the list */
> + list_for_each_entry(sb, &super_blocks, s_list) {
> + if (!sb->s_root || !sb->s_bdev ||
> + (sb->s_frozen == SB_FREEZE_TRANS) ||
> + (sb->s_flags & MS_RDONLY))
> + continue;
> +
> + fs = kmalloc(sizeof(struct frozen_fs), GFP_ATOMIC);
> + if (!fs)
> + return 1;

You're still leaking memory here.

I have a simpler version of this patch without this leak, but we have decided
we won't apply it anyway.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-23 12:09:48

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi.

On Mon, 2006-10-23 at 12:36 +0200, Rafael J. Wysocki wrote:
> On Monday, 23 October 2006 06:12, Nigel Cunningham wrote:
> > XFS can continue to submit I/O from a timer routine, even after
> > freezeable kernel and userspace threads are frozen. This doesn't seem to
> > be an issue for current swsusp code,
>
> So it doesn't look like we need the patch _now_.

I'm trying to prepare the patches to make swsusp into suspend2. Please
assume it's part of a stream of patches that belong together, rather
than an isolated modification. Yes, I'd really rather just rm -f swap.c
user.c swsusp.c, but I'm trying to do the incremental modification thing
for you.

> > but is definitely an issue for Suspend2, where the pages being written could
> > be overwritten by Suspend2's atomic copy.
>
> And IMO that's a good reason why we shouldn't use RCU pages for storing the
> image. XFS is one known example that breaks things if we do so and
> there may be more such things that we don't know of. The fact that they
> haven't appeared in testing so far doesn't mean they don't exist and
> moreover some things like that may appear in the future.

Ok. We can modify the selection of pages to be overwritten. I agree that
absence of proof isn't proof of absence, but on the later part, we can't
rule some modification out now because something in the future might
break it.

> > We can address this issue by freezing bdevs after stopping userspace
> > threads, and thawing them prior to thawing userspace.
>
> This will only address the issues related to filesystems, but the RCU pages
> can also be modified from interrupt context by other code, AFAICT.

Agreed. We can find some other way to address that.

> > Signed-off-by: Nigel Cunningham <[email protected]>
> >
> > diff --git a/kernel/power/process.c b/kernel/power/process.c
> > index 4a001fe..ddeeb50 100644
> > --- a/kernel/power/process.c
> > +++ b/kernel/power/process.c
> > @@ -13,6 +13,7 @@ #include <linux/interrupt.h>
> > #include <linux/suspend.h>
> > #include <linux/module.h>
> > #include <linux/syscalls.h>
> > +#include <linux/buffer_head.h>
> > #include <linux/freezer.h>
> >
> > /*
> > @@ -20,6 +21,58 @@ #include <linux/freezer.h>
> > */
> > #define TIMEOUT (20 * HZ)
> >
> > +struct frozen_fs
> > +{
> > + struct list_head fsb_list;
> > + struct super_block *sb;
> > +};
> > +
> > +LIST_HEAD(frozen_fs_list);
> > +
> > +void freezer_make_fses_rw(void)
> > +{
> > + struct frozen_fs *fs, *next_fs;
> > +
> > + list_for_each_entry_safe(fs, next_fs, &frozen_fs_list, fsb_list) {
> > + thaw_bdev(fs->sb->s_bdev, fs->sb);
> > +
> > + list_del(&fs->fsb_list);
> > + kfree(fs);
> > + }
> > +}
> > +
> > +/*
> > + * Done after userspace is frozen, so there should be no danger of
> > + * fses being unmounted while we're in here.
> > + */
> > +int freezer_make_fses_ro(void)
> > +{
> > + struct frozen_fs *fs;
> > + struct super_block *sb;
> > +
> > + /* Generate the list */
> > + list_for_each_entry(sb, &super_blocks, s_list) {
> > + if (!sb->s_root || !sb->s_bdev ||
> > + (sb->s_frozen == SB_FREEZE_TRANS) ||
> > + (sb->s_flags & MS_RDONLY))
> > + continue;
> > +
> > + fs = kmalloc(sizeof(struct frozen_fs), GFP_ATOMIC);
> > + if (!fs)
> > + return 1;
>
> You're still leaking memory here.

Ah. Good point.

> I have a simpler version of this patch without this leak, but we have decided
> we won't apply it anyway.

We?

You do realise that XFS can still be writing data while you're writing
your snapshot, so (depending on what it does), you might end up with a
messed up filesystem without something similar? (XFS may do
$UNDEFINED_WRITING_TO_DISK while writing the snapshot and potentially
different/inconsistent $UNDEFINED_WRITING_TO_DISK after the atomic
restore).

Nigel

2006-10-23 14:08:11

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Monday, 23 October 2006 14:09, Nigel Cunningham wrote:
> Hi.
>
> On Mon, 2006-10-23 at 12:36 +0200, Rafael J. Wysocki wrote:
> > On Monday, 23 October 2006 06:12, Nigel Cunningham wrote:
> > > XFS can continue to submit I/O from a timer routine, even after
> > > freezeable kernel and userspace threads are frozen. This doesn't seem to
> > > be an issue for current swsusp code,
> >
> > So it doesn't look like we need the patch _now_.
>
> I'm trying to prepare the patches to make swsusp into suspend2.

Oh, I see. Please don't do that.

> Please assume it's part of a stream of patches that belong together, rather
> than an isolated modification. Yes, I'd really rather just rm -f swap.c
> user.c swsusp.c,

And I'm against that, sorry. Also I don't think you can remove user.c just
like that.

> but I'm trying to do the incremental modification thing for you.
>
> > > but is definitely an issue for Suspend2, where the pages being written could
> > > be overwritten by Suspend2's atomic copy.
> >
> > And IMO that's a good reason why we shouldn't use RCU pages for storing the
> > image. XFS is one known example that breaks things if we do so and
> > there may be more such things that we don't know of. The fact that they
> > haven't appeared in testing so far doesn't mean they don't exist and
> > moreover some things like that may appear in the future.
>
> Ok. We can modify the selection of pages to be overwritten. I agree that
> absence of proof isn't proof of absence, but on the later part, we can't
> rule some modification out now because something in the future might
> break it.

This case is a bit special. I don't think it would be right to require every
device driver writer to avoid modifying RCU pages from the interrupt
context, because that would break the suspend to disk ...

Besides, if there is an RCU page that we _know_ we can use to store the image
in it, we can just include this page in the image without copying. This
already gives us one extra free page for the rest of the image and we can
_avoid_ creating two images which suspend2 does and which adds a _lot_ of
complexity to the code.

> > > We can address this issue by freezing bdevs after stopping userspace
> > > threads, and thawing them prior to thawing userspace.
> >
> > This will only address the issues related to filesystems, but the RCU pages
> > can also be modified from interrupt context by other code, AFAICT.
>
> Agreed. We can find some other way to address that.

I think, to stat with, we should look for pages that we're deadly sure will
_not_ be overwritten from the interrupt context (for example, the pages that
are mapped read-only by the user space look promising). Then, we should
estimate the number of these pages and consider if dealing with them in
a special way is worth the effort.

> > > Signed-off-by: Nigel Cunningham <[email protected]>
> > >
> > > diff --git a/kernel/power/process.c b/kernel/power/process.c
> > > index 4a001fe..ddeeb50 100644
> > > --- a/kernel/power/process.c
> > > +++ b/kernel/power/process.c
> > > @@ -13,6 +13,7 @@ #include <linux/interrupt.h>
> > > #include <linux/suspend.h>
> > > #include <linux/module.h>
> > > #include <linux/syscalls.h>
> > > +#include <linux/buffer_head.h>
> > > #include <linux/freezer.h>
> > >
> > > /*
> > > @@ -20,6 +21,58 @@ #include <linux/freezer.h>
> > > */
> > > #define TIMEOUT (20 * HZ)
> > >
> > > +struct frozen_fs
> > > +{
> > > + struct list_head fsb_list;
> > > + struct super_block *sb;
> > > +};
> > > +
> > > +LIST_HEAD(frozen_fs_list);
> > > +
> > > +void freezer_make_fses_rw(void)
> > > +{
> > > + struct frozen_fs *fs, *next_fs;
> > > +
> > > + list_for_each_entry_safe(fs, next_fs, &frozen_fs_list, fsb_list) {
> > > + thaw_bdev(fs->sb->s_bdev, fs->sb);
> > > +
> > > + list_del(&fs->fsb_list);
> > > + kfree(fs);
> > > + }
> > > +}
> > > +
> > > +/*
> > > + * Done after userspace is frozen, so there should be no danger of
> > > + * fses being unmounted while we're in here.
> > > + */
> > > +int freezer_make_fses_ro(void)
> > > +{
> > > + struct frozen_fs *fs;
> > > + struct super_block *sb;
> > > +
> > > + /* Generate the list */
> > > + list_for_each_entry(sb, &super_blocks, s_list) {
> > > + if (!sb->s_root || !sb->s_bdev ||
> > > + (sb->s_frozen == SB_FREEZE_TRANS) ||
> > > + (sb->s_flags & MS_RDONLY))
> > > + continue;
> > > +
> > > + fs = kmalloc(sizeof(struct frozen_fs), GFP_ATOMIC);
> > > + if (!fs)
> > > + return 1;
> >
> > You're still leaking memory here.
>
> Ah. Good point.
>
> > I have a simpler version of this patch without this leak, but we have decided
> > we won't apply it anyway.
>
> We?

Well, Pavel and I have discussed that with the XFS people.

> You do realise that XFS can still be writing data while you're writing
> your snapshot,

According to the XFS people, this can't happen.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-23 14:15:25

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Rafael J. Wysocki wrote:

> This case is a bit special. I don't think it would be right to require every
> device driver writer to avoid modifying RCU pages from the interrupt
> context, because that would break the suspend to disk ...
>
> Besides, if there is an RCU page that we _know_ we can use to store the image
> in it, we can just include this page in the image without copying. This
> already gives us one extra free page for the rest of the image and we can
> _avoid_ creating two images which suspend2 does and which adds a _lot_ of
> complexity to the code.

If you don't mind me asking... what are these RCU pages you speak of?
I couldn't work it out by grepping kernel/power/*

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-23 14:22:06

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Monday, 23 October 2006 16:15, Nick Piggin wrote:
> Rafael J. Wysocki wrote:
>
> > This case is a bit special. I don't think it would be right to require every
> > device driver writer to avoid modifying RCU pages from the interrupt
> > context, because that would break the suspend to disk ...
> >
> > Besides, if there is an RCU page that we _know_ we can use to store the image
> > in it, we can just include this page in the image without copying. This
> > already gives us one extra free page for the rest of the image and we can
> > _avoid_ creating two images which suspend2 does and which adds a _lot_ of
> > complexity to the code.
>
> If you don't mind me asking... what are these RCU pages you speak of?
> I couldn't work it out by grepping kernel/power/*

Oops, s/RCU/LRU/g (shame, shame).

Sorry for the confusion.


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-23 16:55:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

> On Mon, 23 Oct 2006 16:07:16 +0200 "Rafael J. Wysocki" <[email protected]> wrote:
> > I'm trying to prepare the patches to make swsusp into suspend2.
>
> Oh, I see. Please don't do that.

Why not?

2006-10-23 17:14:56

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Mon 2006-10-23 09:55:22, Andrew Morton wrote:
> > On Mon, 23 Oct 2006 16:07:16 +0200 "Rafael J. Wysocki" <[email protected]> wrote:
> > > I'm trying to prepare the patches to make swsusp into suspend2.
> >
> > Oh, I see. Please don't do that.
>
> Why not?

Last time I checked, suspend2 was 15000 lines of code, including its
own plugin system and special user-kernel protocol for drawing
progress bar (netlink based). It also did parts of user interface from
kernel.

OTOH, that was half a year ago, but given that uswsusp can now do most
of the stuff suspend2 does (and without that 15000 lines of code), I
do not think we want to do complete rewrite of swsusp now.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-10-23 17:51:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

> On Mon, 23 Oct 2006 19:14:50 +0200 Pavel Machek <[email protected]> wrote:
> On Mon 2006-10-23 09:55:22, Andrew Morton wrote:
> > > On Mon, 23 Oct 2006 16:07:16 +0200 "Rafael J. Wysocki" <[email protected]> wrote:
> > > > I'm trying to prepare the patches to make swsusp into suspend2.
> > >
> > > Oh, I see. Please don't do that.
> >
> > Why not?
>
> Last time I checked, suspend2 was 15000 lines of code, including its
> own plugin system and special user-kernel protocol for drawing
> progress bar (netlink based). It also did parts of user interface from

That's different.

I don't know where these patches are leading, but thus far they look like
reasonable cleanups and generalisations. So I suggest we just take them
one at a time.

>
> OTOH, that was half a year ago, but given that uswsusp can now do most
> of the stuff suspend2 does (and without that 15000 lines of code), I
> do not think we want to do complete rewrite of swsusp now.

uswsusp seems like a bad idea to me. We'd be better off concentrating on a
simple, clean in-kernel thing which *works*. Right now the main problems
with swsusp are that it's slow and that there are driver problems.
(Actually these are both driver problems).

Fiddling with the top-level interfaces doesn't address either of these core
problems.

Apparently uswsusp has gained support for S3 while the in-kernel driver
does not support S3. That's disappointing.

2006-10-23 18:06:49

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi!

> > > > > I'm trying to prepare the patches to make swsusp into suspend2.
> > > >
> > > > Oh, I see. Please don't do that.
> > >
> > > Why not?
> >
> > Last time I checked, suspend2 was 15000 lines of code, including its
> > own plugin system and special user-kernel protocol for drawing
> > progress bar (netlink based). It also did parts of user interface from
>
> That's different.
>
> I don't know where these patches are leading, but thus far they look like
> reasonable cleanups and generalisations. So I suggest we just take them
> one at a time.

Well, some do look okay, but for example this one complicates
freezing... and gains nothing now. It would allow some badly-designed
parts of suspend2 to be merged in future, but I doubt we want them.

> > OTOH, that was half a year ago, but given that uswsusp can now do most
> > of the stuff suspend2 does (and without that 15000 lines of code), I
> > do not think we want to do complete rewrite of swsusp now.
>
> uswsusp seems like a bad idea to me. We'd be better off concentrating on a
> simple, clean in-kernel thing which *works*. Right now the main problems
> with swsusp are that it's slow and that there are driver problems.
>
> Fiddling with the top-level interfaces doesn't address either of these core
> problems.

No ammount of changes in kernel/power will fix driver problems, that's right.

> Apparently uswsusp has gained support for S3 while the in-kernel driver
> does not support S3. That's disappointing.

Well, uswsusp also gained support for compression, splash screen and
RSA-encryption. While S3 support for (kernel) swsusp would be
reasonably easy/non-intrusive, is there a point? I'd really prefer
people wanting to do swsusp+S3 to use uswsusp.

My goal is to keep in-kernel swsusp simple and reliable. fast would be
nice, too. But having all the features is not the goal.

[swsusp+S3 would need some userland support, anyway, because userland
is needed for video card re-initialization. So you'd need utility
similar to s2ram...]
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-10-23 19:20:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Monday, 23 October 2006 19:50, Andrew Morton wrote:
> > On Mon, 23 Oct 2006 19:14:50 +0200 Pavel Machek <[email protected]> wrote:
> > On Mon 2006-10-23 09:55:22, Andrew Morton wrote:
> > > > On Mon, 23 Oct 2006 16:07:16 +0200 "Rafael J. Wysocki" <[email protected]> wrote:
> > > > > I'm trying to prepare the patches to make swsusp into suspend2.
> > > >
> > > > Oh, I see. Please don't do that.
> > >
> > > Why not?

Generally, we already do many things that suspend2 just duplicates and there's
one thing in suspend2 I really don't like, which is the use of LRU pages as
temporary storage.

> > Last time I checked, suspend2 was 15000 lines of code, including its
> > own plugin system and special user-kernel protocol for drawing
> > progress bar (netlink based). It also did parts of user interface from
>
> That's different.
>
> I don't know where these patches are leading, but thus far they look like
> reasonable cleanups and generalisations. So I suggest we just take them
> one at a time.

Well, the patch for the freezeing of bdevs contains a memory leak that I have
already pointed to Nigel for three times, it leads to some problems in the
error paths, AFAICT, and "solves" a problem that doesn't really exist in the
current code.

Some time ago I posted a very similar patch and we discussed it with the XFS
people. In conclusion we decided not to apply it.

The patch that causes kernel threads to be thawed temporarily before we
attempt to free some memory doesn't look bad, but it's not needed for the
reason given by Nigel. It may be needed for another reason, but I think we
should know why we apply it before we do so.

I have no specific objections with respect to the other patches except I'd
like to understand the patch that adds extents for the handling of swap before
I say I like it or not.

> > OTOH, that was half a year ago, but given that uswsusp can now do most
> > of the stuff suspend2 does (and without that 15000 lines of code), I
> > do not think we want to do complete rewrite of swsusp now.
>
> uswsusp seems like a bad idea to me.

Could you please explain why exactly?

> We'd be better off concentrating on a simple, clean in-kernel thing which
> *works*.

This depends on the point of view. I test both swsusp and uswsusp on a regular
basis on 5 different boxes and they both work flawlessly (except when some
rogue patch breaks them, but that's a different story).

> Right now the main problems with swsusp are that it's slow and that there
> are driver problems. (Actually these are both driver problems).

Well, suspend2 is not about drivers. It's all about the "core", ie. memory
management, saving/restoring the image etc.

As far as the drivers are concerned, we (I, Pavel, Nigel and even you) can't
fix all of them and it really is a challenge to do something that will cause
them to be fixed.

In the meantime we fix some things that _we_ can address, like the handling
of highmem, the PAE-related issue on i386 etc.

> Fiddling with the top-level interfaces doesn't address either of these core
> problems.

No, it doesn't, but for example it's the correct way of handling the resume
from an initrd image, IMHO.

> Apparently uswsusp has gained support for S3 while the in-kernel driver
> does not support S3. That's disappointing.

The problem with that is on _many_ boxes we have to handle the video in a
special way before and after S3, and it is only possible from the user
space. This was one of the reasons why we decided to develop the userland
suspend.


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-23 21:40:23

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Mon, Oct 23, 2006 at 10:50:22AM -0700, Andrew Morton wrote:

> Apparently uswsusp has gained support for S3 while the in-kernel driver
> does not support S3. That's disappointing.

I'm still not sure why that's an especially desirable feature. Every
laptop I've played with will automatically resume from S3 when the
battery level becomes critical, which gives you the opportunity to
suspend to disk. And when it doesn't, you can generally emulate it using
the ACPI alarm to wake up. Is there really a significant quantity of
hardware out there that doesn't support either of these?

--
Matthew Garrett | [email protected]

2006-10-23 22:13:36

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Monday, 23 October 2006 23:39, Matthew Garrett wrote:
> On Mon, Oct 23, 2006 at 10:50:22AM -0700, Andrew Morton wrote:
>
> > Apparently uswsusp has gained support for S3 while the in-kernel driver
> > does not support S3. That's disappointing.
>
> I'm still not sure why that's an especially desirable feature.

Well, I know of at least one user of it (not me). ;-)

> Every laptop I've played with will automatically resume from S3 when the
> battery level becomes critical, which gives you the opportunity to
> suspend to disk. And when it doesn't, you can generally emulate it using
> the ACPI alarm to wake up. Is there really a significant quantity of
> hardware out there that doesn't support either of these?

I think there is.


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-23 22:52:49

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi.

On Mon, 2006-10-23 at 21:19 +0200, Rafael J. Wysocki wrote:
> On Monday, 23 October 2006 19:50, Andrew Morton wrote:
> > > On Mon, 23 Oct 2006 19:14:50 +0200 Pavel Machek <[email protected]> wrote:
> > > On Mon 2006-10-23 09:55:22, Andrew Morton wrote:
> > > > > On Mon, 23 Oct 2006 16:07:16 +0200 "Rafael J. Wysocki" <[email protected]> wrote:
> > > > > > I'm trying to prepare the patches to make swsusp into suspend2.
> > > > >
> > > > > Oh, I see. Please don't do that.
> > > >
> > > > Why not?
>
> Generally, we already do many things that suspend2 just duplicates and there's
> one thing in suspend2 I really don't like, which is the use of LRU pages as
> temporary storage.

The duplicated things obviously won't be merged (unless they're doing it
better somehow), so there's no issue there.

As to the use of LRU pages, this can be disabled with a simple sysfs
setting, so it doesn't need to be a big issue.

> > > Last time I checked, suspend2 was 15000 lines of code, including its
> > > own plugin system and special user-kernel protocol for drawing
> > > progress bar (netlink based). It also did parts of user interface from
> >
> > That's different.
> >
> > I don't know where these patches are leading, but thus far they look like
> > reasonable cleanups and generalisations. So I suggest we just take them
> > one at a time.
>
> Well, the patch for the freezeing of bdevs contains a memory leak that I have
> already pointed to Nigel for three times, it leads to some problems in the
> error paths, AFAICT, and "solves" a problem that doesn't really exist in the
> current code.

There's no memory leak. In Suspend2 (and I believe swsusp, but will
admit I haven't carefully checked), every call to freeze processes has a
matching call to thaw them. The thaw call will invoke make_fses_rw,
which will free the memory that was allocated. If there's an issue, it's
that in the failure path thaw_bdev can be called when freeze_bdev was
never invoked. Having just realised that, I've just fixed it.

Regarding the problem not existing, I'll try to find time to reproduce
it later in the day. Maybe the XFS guys have fixed it since I last
checked, but I do know there was a real issue (even if you haven't seen
it). That said, yesterday was my Red Hat day, so I'm not promising that
I'll be quick.

> Some time ago I posted a very similar patch and we discussed it with the XFS
> people. In conclusion we decided not to apply it.
>
> The patch that causes kernel threads to be thawed temporarily before we
> attempt to free some memory doesn't look bad, but it's not needed for the
> reason given by Nigel. It may be needed for another reason, but I think we
> should know why we apply it before we do so.

I'll seek to address this too.

> I have no specific objections with respect to the other patches except I'd
> like to understand the patch that adds extents for the handling of swap before
> I say I like it or not.
>
> > > OTOH, that was half a year ago, but given that uswsusp can now do most
> > > of the stuff suspend2 does (and without that 15000 lines of code), I
> > > do not think we want to do complete rewrite of swsusp now.
> >
> > uswsusp seems like a bad idea to me.
>
> Could you please explain why exactly?
>
> > We'd be better off concentrating on a simple, clean in-kernel thing which
> > *works*.
>
> This depends on the point of view. I test both swsusp and uswsusp on a regular
> basis on 5 different boxes and they both work flawlessly (except when some
> rogue patch breaks them, but that's a different story).
>
> > Right now the main problems with swsusp are that it's slow and that there
> > are driver problems. (Actually these are both driver problems).
>
> Well, suspend2 is not about drivers. It's all about the "core", ie. memory
> management, saving/restoring the image etc.
>
> As far as the drivers are concerned, we (I, Pavel, Nigel and even you) can't
> fix all of them and it really is a challenge to do something that will cause
> them to be fixed.
>
> In the meantime we fix some things that _we_ can address, like the handling
> of highmem, the PAE-related issue on i386 etc.

Agreed.

> > Fiddling with the top-level interfaces doesn't address either of these core
> > problems.
>
> No, it doesn't, but for example it's the correct way of handling the resume
> from an initrd image, IMHO.
>
> > Apparently uswsusp has gained support for S3 while the in-kernel driver
> > does not support S3. That's disappointing.
>
> The problem with that is on _many_ boxes we have to handle the video in a
> special way before and after S3, and it is only possible from the user
> space. This was one of the reasons why we decided to develop the userland
> suspend.

But that's a different issue - uswsusp is suspend to disk. Your s2ram
thing is suspend to ram (although I guess you're somehow invoking it as
a library or something for suspend to disk + ram).

Regards,

Nigel

2006-10-23 22:58:30

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi.

On Mon, 2006-10-23 at 10:50 -0700, Andrew Morton wrote:
> > On Mon, 23 Oct 2006 19:14:50 +0200 Pavel Machek <[email protected]> wrote:
> > On Mon 2006-10-23 09:55:22, Andrew Morton wrote:
> > > > On Mon, 23 Oct 2006 16:07:16 +0200 "Rafael J. Wysocki" <[email protected]> wrote:
> > > > > I'm trying to prepare the patches to make swsusp into suspend2.
> > > >
> > > > Oh, I see. Please don't do that.
> > >
> > > Why not?
> >
> > Last time I checked, suspend2 was 15000 lines of code, including its
> > own plugin system and special user-kernel protocol for drawing
> > progress bar (netlink based). It also did parts of user interface from
>
> That's different.

Let's judge those bits when we see them, rather than going on rumour and
inuendo :)

> I don't know where these patches are leading, but thus far they look like
> reasonable cleanups and generalisations. So I suggest we just take them
> one at a time.

I am seeking to produce patches to merge Suspend2 one bit at a time, as
has been requested in the past. I'd therefore ask that you do that,
because even if Pavel and Rafael don't like the overall thrust, it
doesn't mean you can't cherry pick what is good and useful along the
way.

Regards,

Nigel

2006-10-23 23:05:25

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi.

On Mon, 2006-10-23 at 16:20 +0200, Rafael J. Wysocki wrote:
> On Monday, 23 October 2006 16:15, Nick Piggin wrote:
> > Rafael J. Wysocki wrote:
> >
> > > This case is a bit special. I don't think it would be right to require every
> > > device driver writer to avoid modifying RCU pages from the interrupt
> > > context, because that would break the suspend to disk ...
> > >
> > > Besides, if there is an RCU page that we _know_ we can use to store the image
> > > in it, we can just include this page in the image without copying. This
> > > already gives us one extra free page for the rest of the image and we can
> > > _avoid_ creating two images which suspend2 does and which adds a _lot_ of
> > > complexity to the code.
> >
> > If you don't mind me asking... what are these RCU pages you speak of?
> > I couldn't work it out by grepping kernel/power/*
>
> Oops, s/RCU/LRU/g (shame, shame).
>
> Sorry for the confusion.

Ah. I wondered where RCU came into the picture, but was willing to go
with the theory that you knew something I didn't :)

Nigel

2006-10-23 23:22:22

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi.

On Mon, 2006-10-23 at 16:07 +0200, Rafael J. Wysocki wrote:
> On Monday, 23 October 2006 14:09, Nigel Cunningham wrote:
> > Hi.
> >
> > On Mon, 2006-10-23 at 12:36 +0200, Rafael J. Wysocki wrote:
> > > On Monday, 23 October 2006 06:12, Nigel Cunningham wrote:
> > > > XFS can continue to submit I/O from a timer routine, even after
> > > > freezeable kernel and userspace threads are frozen. This doesn't seem to
> > > > be an issue for current swsusp code,
> > >
> > > So it doesn't look like we need the patch _now_.
> >
> > I'm trying to prepare the patches to make swsusp into suspend2.
>
> Oh, I see. Please don't do that.

(Yes, echoing Andrew..) Why not?

> > Please assume it's part of a stream of patches that belong together, rather
> > than an isolated modification. Yes, I'd really rather just rm -f swap.c
> > user.c swsusp.c,
>
> And I'm against that, sorry. Also I don't think you can remove user.c just
> like that.

Are you saying you're against a stream of patches that belong together?
Well, I'm sorry but that's the way things work sometimes. I know
everyone loves this evolutionary model, but it just doesn't work in
practice. We see sets of patches going through all the time in which
[2/4] prepares the foundations for [3/4] and so on. That's what I'm
doing, I just don't yet know how big the n is in [2/n].

Regarding removing user.c, I said I'd rather, not that I'm going to
submit a patch to do it :)

> > but I'm trying to do the incremental modification thing for you.
> >
> > > > but is definitely an issue for Suspend2, where the pages being written could
> > > > be overwritten by Suspend2's atomic copy.
> > >
> > > And IMO that's a good reason why we shouldn't use RCU pages for storing the
> > > image. XFS is one known example that breaks things if we do so and
> > > there may be more such things that we don't know of. The fact that they
> > > haven't appeared in testing so far doesn't mean they don't exist and
> > > moreover some things like that may appear in the future.
> >
> > Ok. We can modify the selection of pages to be overwritten. I agree that
> > absence of proof isn't proof of absence, but on the later part, we can't
> > rule some modification out now because something in the future might
> > break it.

(s/RCU/LRU applied to avoid confusion)

> This case is a bit special. I don't think it would be right to require every
> device driver writer to avoid modifying LRU pages from the interrupt
> context, because that would break the suspend to disk ...

I'm not arguing for that.

> Besides, if there is an LRU page that we _know_ we can use to store the image
> in it, we can just include this page in the image without copying. This
> already gives us one extra free page for the rest of the image and we can
> _avoid_ creating two images which suspend2 does and which adds a _lot_ of
> complexity to the code.

Do you have a number for '_lot_'? It's easy to say, but it would be
better to be able to say "I counted n lines of code you could remove if
you didn't save a full image of memory".

> > > > We can address this issue by freezing bdevs after stopping userspace
> > > > threads, and thawing them prior to thawing userspace.
> > >
> > > This will only address the issues related to filesystems, but the RCU pages
> > > can also be modified from interrupt context by other code, AFAICT.
> >
> > Agreed. We can find some other way to address that.
>
> I think, to stat with, we should look for pages that we're deadly sure will
> _not_ be overwritten from the interrupt context (for example, the pages that
> are mapped read-only by the user space look promising). Then, we should
> estimate the number of these pages and consider if dealing with them in
> a special way is worth the effort.

I was thinking of coming at it from the other direction, since I expect
the pages that might be modified from an interrupt context to be
minimal. The fact that Suspend2 has worked for years now using this
algorithm seems to confirm that.

Regards,

Nigel

2006-10-24 07:58:35

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi,

On Tuesday, 24 October 2006 00:52, Nigel Cunningham wrote:
> Hi.
>
> On Mon, 2006-10-23 at 21:19 +0200, Rafael J. Wysocki wrote:
> > On Monday, 23 October 2006 19:50, Andrew Morton wrote:
> > > > On Mon, 23 Oct 2006 19:14:50 +0200 Pavel Machek <[email protected]> wrote:
> > > > On Mon 2006-10-23 09:55:22, Andrew Morton wrote:
> > > > > > On Mon, 23 Oct 2006 16:07:16 +0200 "Rafael J. Wysocki" <[email protected]> wrote:
> > > > > > > I'm trying to prepare the patches to make swsusp into suspend2.
> > > > > >
> > > > > > Oh, I see. Please don't do that.
> > > > >
> > > > > Why not?
> >
> > Generally, we already do many things that suspend2 just duplicates and there's
> > one thing in suspend2 I really don't like, which is the use of LRU pages as
> > temporary storage.
>
> The duplicated things obviously won't be merged (unless they're doing it
> better somehow), so there's no issue there.

OK (as long as there are no doubts the new code is doing things better ;-))

> As to the use of LRU pages, this can be disabled with a simple sysfs
> setting, so it doesn't need to be a big issue.

I'm sorry, but I don't think it can be merged at all as long as we are not
100% sure it's safe. Currently, we aren't.

> > > > Last time I checked, suspend2 was 15000 lines of code, including its
> > > > own plugin system and special user-kernel protocol for drawing
> > > > progress bar (netlink based). It also did parts of user interface from
> > >
> > > That's different.
> > >
> > > I don't know where these patches are leading, but thus far they look like
> > > reasonable cleanups and generalisations. So I suggest we just take them
> > > one at a time.
> >
> > Well, the patch for the freezeing of bdevs contains a memory leak that I have
> > already pointed to Nigel for three times, it leads to some problems in the
> > error paths, AFAICT, and "solves" a problem that doesn't really exist in the
> > current code.
>
> There's no memory leak. In Suspend2 (and I believe swsusp, but will
> admit I haven't carefully checked), every call to freeze processes has a
> matching call to thaw them. The thaw call will invoke make_fses_rw,
> which will free the memory that was allocated. If there's an issue, it's
> that in the failure path thaw_bdev can be called when freeze_bdev was
> never invoked. Having just realised that, I've just fixed it.

I was talking about the leak in the error path, where you exit the function
without freeing the already allocated objects.

> Regarding the problem not existing, I'll try to find time to reproduce
> it later in the day. Maybe the XFS guys have fixed it since I last
> checked, but I do know there was a real issue (even if you haven't seen
> it). That said, yesterday was my Red Hat day, so I'm not promising that
> I'll be quick.

Of course if you are able to show there is a problem, I won't be against this
patch.

> > Some time ago I posted a very similar patch and we discussed it with the XFS
> > people. In conclusion we decided not to apply it.
> >
> > The patch that causes kernel threads to be thawed temporarily before we
> > attempt to free some memory doesn't look bad, but it's not needed for the
> > reason given by Nigel. It may be needed for another reason, but I think we
> > should know why we apply it before we do so.
>
> I'll seek to address this too.
>
> > I have no specific objections with respect to the other patches except I'd
> > like to understand the patch that adds extents for the handling of swap before
> > I say I like it or not.
> >
> > > > OTOH, that was half a year ago, but given that uswsusp can now do most
> > > > of the stuff suspend2 does (and without that 15000 lines of code), I
> > > > do not think we want to do complete rewrite of swsusp now.
> > >
> > > uswsusp seems like a bad idea to me.
> >
> > Could you please explain why exactly?
> >
> > > We'd be better off concentrating on a simple, clean in-kernel thing which
> > > *works*.
> >
> > This depends on the point of view. I test both swsusp and uswsusp on a regular
> > basis on 5 different boxes and they both work flawlessly (except when some
> > rogue patch breaks them, but that's a different story).
> >
> > > Right now the main problems with swsusp are that it's slow and that there
> > > are driver problems. (Actually these are both driver problems).
> >
> > Well, suspend2 is not about drivers. It's all about the "core", ie. memory
> > management, saving/restoring the image etc.
> >
> > As far as the drivers are concerned, we (I, Pavel, Nigel and even you) can't
> > fix all of them and it really is a challenge to do something that will cause
> > them to be fixed.
> >
> > In the meantime we fix some things that _we_ can address, like the handling
> > of highmem, the PAE-related issue on i386 etc.
>
> Agreed.
>
> > > Fiddling with the top-level interfaces doesn't address either of these core
> > > problems.
> >
> > No, it doesn't, but for example it's the correct way of handling the resume
> > from an initrd image, IMHO.
> >
> > > Apparently uswsusp has gained support for S3 while the in-kernel driver
> > > does not support S3. That's disappointing.
> >
> > The problem with that is on _many_ boxes we have to handle the video in a
> > special way before and after S3, and it is only possible from the user
> > space. This was one of the reasons why we decided to develop the userland
> > suspend.
>
> But that's a different issue - uswsusp is suspend to disk. Your s2ram
> thing is suspend to ram (although I guess you're somehow invoking it as
> a library or something for suspend to disk + ram).

Yes, we support the suspend to disk+RAM which also needs the video-related
black magic.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-24 07:59:14

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi!

> > Apparently uswsusp has gained support for S3 while the in-kernel driver
> > does not support S3. That's disappointing.
>
> I'm still not sure why that's an especially desirable feature. Every
> laptop I've played with will automatically resume from S3 when the
> battery level becomes critical, which gives you the opportunity to
> suspend to disk. And when it doesn't, you can generally emulate it using
> the ACPI alarm to wake up. Is there really a significant quantity of
> hardware out there that doesn't support either of these?

Well, there are reasons *against* your implementation:

1) waking laptop at random time (737 taking off?) may be dangerous for
the laptop (and its surroundings)

2) old li-ions are probably able to keep laptop suspended for 20
hours, but will die if you attempt to resume after 10 hours of sleep
without ac power.

That said, patch for 's2ram, wakeup on battery low' for uswsusp would
be welcome.
Pavel
--
Thanks for all the (sleeping) penguins.

2006-10-24 08:02:03

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi!

> > > Last time I checked, suspend2 was 15000 lines of code, including its
> > > own plugin system and special user-kernel protocol for drawing
> > > progress bar (netlink based). It also did parts of user interface from
> >
> > That's different.
>
> Let's judge those bits when we see them, rather than going on rumour and
> inuendo :)

Ok, but lets not merge patches 'because shiny great future patch may
depend on them'.

> I am seeking to produce patches to merge Suspend2 one bit at a time, as
> has been requested in the past. I'd therefore ask that you do that,
> because even if Pavel and Rafael don't like the overall thrust, it
> doesn't mean you can't cherry pick what is good and useful along the
> way.

Yep, this is indeed way to go.
Pavel
--
Thanks for all the (sleeping) penguins.

2006-10-24 08:21:43

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi.

On Tue, 2006-10-24 at 09:57 +0200, Rafael J. Wysocki wrote:
> > There's no memory leak. In Suspend2 (and I believe swsusp, but will
> > admit I haven't carefully checked), every call to freeze processes has a
> > matching call to thaw them. The thaw call will invoke make_fses_rw,
> > which will free the memory that was allocated. If there's an issue, it's
> > that in the failure path thaw_bdev can be called when freeze_bdev was
> > never invoked. Having just realised that, I've just fixed it.
>
> I was talking about the leak in the error path, where you exit the function
> without freeing the already allocated objects.

I know. I'm saying that there's no memory leak because any frozen bdevs
(and memory allocated to record them) are unfrozen (and the memory
freed) when we thaw processes before returning control to the user.

Regards,

Nigel

2006-10-24 08:38:11

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Tuesday, 24 October 2006 01:22, Nigel Cunningham wrote:
> Hi.
>
> On Mon, 2006-10-23 at 16:07 +0200, Rafael J. Wysocki wrote:
> > On Monday, 23 October 2006 14:09, Nigel Cunningham wrote:
> > > Hi.
> > >
> > > On Mon, 2006-10-23 at 12:36 +0200, Rafael J. Wysocki wrote:
> > > > On Monday, 23 October 2006 06:12, Nigel Cunningham wrote:
> > > > > XFS can continue to submit I/O from a timer routine, even after
> > > > > freezeable kernel and userspace threads are frozen. This doesn't seem to
> > > > > be an issue for current swsusp code,
> > > >
> > > > So it doesn't look like we need the patch _now_.
> > >
> > > I'm trying to prepare the patches to make swsusp into suspend2.
> >
> > Oh, I see. Please don't do that.
>
> (Yes, echoing Andrew..) Why not?
>
> > > Please assume it's part of a stream of patches that belong together, rather
> > > than an isolated modification. Yes, I'd really rather just rm -f swap.c
> > > user.c swsusp.c,
> >
> > And I'm against that, sorry. Also I don't think you can remove user.c just
> > like that.
>
> Are you saying you're against a stream of patches that belong together?

No. I'm against a wholesale removal of existing code in order to replace
it with some new code that does the same thing in a slightly different way.

> Well, I'm sorry but that's the way things work sometimes. I know
> everyone loves this evolutionary model, but it just doesn't work in
> practice. We see sets of patches going through all the time in which
> [2/4] prepares the foundations for [3/4] and so on. That's what I'm
> doing, I just don't yet know how big the n is in [2/n].
>
> Regarding removing user.c, I said I'd rather, not that I'm going to
> submit a patch to do it :)

Oh, you could just say you didn't like it. Well, I won't discuss with that. ;-)

> > > but I'm trying to do the incremental modification thing for you.
> > >
> > > > > but is definitely an issue for Suspend2, where the pages being written could
> > > > > be overwritten by Suspend2's atomic copy.
> > > >
> > > > And IMO that's a good reason why we shouldn't use RCU pages for storing the
> > > > image. XFS is one known example that breaks things if we do so and
> > > > there may be more such things that we don't know of. The fact that they
> > > > haven't appeared in testing so far doesn't mean they don't exist and
> > > > moreover some things like that may appear in the future.
> > >
> > > Ok. We can modify the selection of pages to be overwritten. I agree that
> > > absence of proof isn't proof of absence, but on the later part, we can't
> > > rule some modification out now because something in the future might
> > > break it.
>
> (s/RCU/LRU applied to avoid confusion)

Thanks. ;-)

> > This case is a bit special. I don't think it would be right to require every
> > device driver writer to avoid modifying LRU pages from the interrupt
> > context, because that would break the suspend to disk ...
>
> I'm not arguing for that.
>
> > Besides, if there is an LRU page that we _know_ we can use to store the image
> > in it, we can just include this page in the image without copying. This
> > already gives us one extra free page for the rest of the image and we can
> > _avoid_ creating two images which suspend2 does and which adds a _lot_ of
> > complexity to the code.
>
> Do you have a number for '_lot_'? It's easy to say, but it would be
> better to be able to say "I counted n lines of code you could remove if
> you didn't save a full image of memory".

First, you don't save a full image of memory (at least not always), so please
don't say so. It's misleading.

Second, it's not only a matter of the number of lines of code, but the
involved complexity of design. Something like "make a copy of each saveable
page in the system and save the copies" is conceptually simpler than
something like "save the contents of LRU pages, then copy the remaining
saveable pages into the LRU pages and if there are some saveable pages
left make the copies of them, then save all of the copies of saveable pages".

> > > > > We can address this issue by freezing bdevs after stopping userspace
> > > > > threads, and thawing them prior to thawing userspace.
> > > >
> > > > This will only address the issues related to filesystems, but the RCU pages
> > > > can also be modified from interrupt context by other code, AFAICT.
> > >
> > > Agreed. We can find some other way to address that.
> >
> > I think, to stat with, we should look for pages that we're deadly sure will
> > _not_ be overwritten from the interrupt context (for example, the pages that
> > are mapped read-only by the user space look promising). Then, we should
> > estimate the number of these pages and consider if dealing with them in
> > a special way is worth the effort.
>
> I was thinking of coming at it from the other direction, since I expect
> the pages that might be modified from an interrupt context to be
> minimal.

I think there are only a few of them. However, the problem is we can't say
a priori _which_ pages they are.

> The fact that Suspend2 has worked for years now using this
> algorithm seems to confirm that.

Yes, it is a strong indication that such events are _rare_, but they also may
be very difficult to spot.

For example, if you copy some application data into a LRU page that's later
modified from the interrupt contents, the data in the suspend image will be
damaged. Yet, this need not lead to any visible effect after the resume
(eg. there can be a small glitch in the application or even something less
visible than that).

Such effects may only show up after a time. Worse yet, they need not be
noticeable at all. Still unless we rule them out, we can't assume that the
system will be in a correct state after the resume. That's the point.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-24 14:46:34

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Mon, Oct 23, 2006 at 12:36:53PM +0200, Rafael J. Wysocki wrote:
> On Monday, 23 October 2006 06:12, Nigel Cunningham wrote:
> > XFS can continue to submit I/O from a timer routine, even after
> > freezeable kernel and userspace threads are frozen. This doesn't seem to
> > be an issue for current swsusp code,
>
> So it doesn't look like we need the patch _now_.
>
> > but is definitely an issue for Suspend2, where the pages being written could
> > be overwritten by Suspend2's atomic copy.
>
> And IMO that's a good reason why we shouldn't use RCU pages for storing the
> image. XFS is one known example that breaks things if we do so and
> there may be more such things that we don't know of. The fact that they
> haven't appeared in testing so far doesn't mean they don't exist and
> moreover some things like that may appear in the future.

Could you please tell us which XFS bits are broken so we can get
them fixed? The XFS daemons should all be checking if they are
supposed to freeze (i.e. they call try_to_freeze() after they wake
up due to timer expiry) so I thought they were doing the right
thing.

However, I have to say that I agree with freezing the filesystems
before suspend - at least XFS will be in a consistent state that can
be recovered from without corruption if your machine fails to
resume....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-24 15:31:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Tuesday, 24 October 2006 16:44, David Chinner wrote:
> On Mon, Oct 23, 2006 at 12:36:53PM +0200, Rafael J. Wysocki wrote:
> > On Monday, 23 October 2006 06:12, Nigel Cunningham wrote:
> > > XFS can continue to submit I/O from a timer routine, even after
> > > freezeable kernel and userspace threads are frozen. This doesn't seem to
> > > be an issue for current swsusp code,
> >
> > So it doesn't look like we need the patch _now_.
> >
> > > but is definitely an issue for Suspend2, where the pages being written could
> > > be overwritten by Suspend2's atomic copy.
> >
> > And IMO that's a good reason why we shouldn't use RCU pages for storing the
> > image. XFS is one known example that breaks things if we do so and
> > there may be more such things that we don't know of. The fact that they
> > haven't appeared in testing so far doesn't mean they don't exist and
> > moreover some things like that may appear in the future.
>
> Could you please tell us which XFS bits are broken so we can get
> them fixed? The XFS daemons should all be checking if they are
> supposed to freeze (i.e. they call try_to_freeze() after they wake
> up due to timer expiry) so I thought they were doing the right
> thing.
>
> However, I have to say that I agree with freezing the filesystems
> before suspend - at least XFS will be in a consistent state that can
> be recovered from without corruption if your machine fails to
> resume....

Do you mean calling sys_sync() after the userspace has been frozen
may not be sufficient?

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-24 16:23:09

by Oleg Verych

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On 2006-10-24, Rafael J. Wysocki wrote:
> On Tuesday, 24 October 2006 16:44, David Chinner wrote:
>> On Mon, Oct 23, 2006 at 12:36:53PM +0200, Rafael J. Wysocki wrote:
>> > On Monday, 23 October 2006 06:12, Nigel Cunningham wrote:
>> > > XFS can continue to submit I/O from a timer routine, even after
>> > > freezeable kernel and userspace threads are frozen. This doesn't seem to
>> > > be an issue for current swsusp code,
>> >
>> > So it doesn't look like we need the patch _now_.
>> >
>> > > but is definitely an issue for Suspend2, where the pages being written could
>> > > be overwritten by Suspend2's atomic copy.
>> >
>> > And IMO that's a good reason why we shouldn't use RCU pages for storing the
>> > image. XFS is one known example that breaks things if we do so and
>> > there may be more such things that we don't know of. The fact that they
>> > haven't appeared in testing so far doesn't mean they don't exist and
>> > moreover some things like that may appear in the future.
>>
>> Could you please tell us which XFS bits are broken so we can get
>> them fixed? The XFS daemons should all be checking if they are
>> supposed to freeze (i.e. they call try_to_freeze() after they wake
>> up due to timer expiry) so I thought they were doing the right
>> thing.
>>
>> However, I have to say that I agree with freezing the filesystems
>> before suspend - at least XFS will be in a consistent state that can
>> be recovered from without corruption if your machine fails to
>> resume....
>
> Do you mean calling sys_sync() after the userspace has been frozen
> may not be sufficient?

Please see
<http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=317479;msg=105;att=0>

it's bottom of
<http://bugs.debian.org/317479>

IMHO it's may be helpful.

-o--=O`C
#oo'L O
<___=E M

2006-10-24 16:37:26

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Tue, Oct 24, 2006 at 05:29:59PM +0200, Rafael J. Wysocki wrote:
> On Tuesday, 24 October 2006 16:44, David Chinner wrote:
> > On Mon, Oct 23, 2006 at 12:36:53PM +0200, Rafael J. Wysocki wrote:
> > > On Monday, 23 October 2006 06:12, Nigel Cunningham wrote:
> > > > XFS can continue to submit I/O from a timer routine, even after
> > > > freezeable kernel and userspace threads are frozen. This doesn't seem to
> > > > be an issue for current swsusp code,
> > >
> > > So it doesn't look like we need the patch _now_.
> > >
> > > > but is definitely an issue for Suspend2, where the pages being written could
> > > > be overwritten by Suspend2's atomic copy.
> > >
> > > And IMO that's a good reason why we shouldn't use RCU pages for storing the
> > > image. XFS is one known example that breaks things if we do so and
> > > there may be more such things that we don't know of. The fact that they
> > > haven't appeared in testing so far doesn't mean they don't exist and
> > > moreover some things like that may appear in the future.
> >
> > Could you please tell us which XFS bits are broken so we can get
> > them fixed? The XFS daemons should all be checking if they are
> > supposed to freeze (i.e. they call try_to_freeze() after they wake
> > up due to timer expiry) so I thought they were doing the right
> > thing.
> >
> > However, I have to say that I agree with freezing the filesystems
> > before suspend - at least XFS will be in a consistent state that can
> > be recovered from without corruption if your machine fails to
> > resume....
>
> Do you mean calling sys_sync() after the userspace has been frozen
> may not be sufficient?

In most cases it probably is, but sys_sync() doesn't provide any
guarantees that the filesystem is not being used or written to after
it completes. Given that every so often I hear about an XFS filesystem
that was corrupted by suspend, I don't think this is sufficient...

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-24 17:06:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Tue, Oct 24, 2006 at 05:29:59PM +0200, Rafael J. Wysocki wrote:
> Do you mean calling sys_sync() after the userspace has been frozen
> may not be sufficient?

No, that's definitly not enough. You need to freeze_bdev to make sure
data is on disk in the place it's expected by the filesystem without
starting a log recovery.

2006-10-24 17:09:01

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Tue, Oct 24, 2006 at 04:20:31PM +0000, Oleg Verych wrote:
> > Do you mean calling sys_sync() after the userspace has been frozen
> > may not be sufficient?
>
> Please see
> <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=317479;msg=105;att=0>
>
> it's bottom of
> <http://bugs.debian.org/317479>
>
> IMHO it's may be helpful.

It's not. It'sa step in the wrong direction. The only way to guarantee
a filesystem (not just xfs, _any_ filesystem - xfs is just most sensitive)
is to call the write_super_lockfs method.

2006-10-24 19:11:14

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Tuesday, 24 October 2006 19:06, Christoph Hellwig wrote:
> On Tue, Oct 24, 2006 at 05:29:59PM +0200, Rafael J. Wysocki wrote:
> > Do you mean calling sys_sync() after the userspace has been frozen
> > may not be sufficient?
>
> No, that's definitly not enough. You need to freeze_bdev to make sure
> data is on disk in the place it's expected by the filesystem without
> starting a log recovery.

Thanks for the clarification.

Then it looks like we need the freezing of bdevs. Pavel?


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-24 20:16:42

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Monday, 23 October 2006 06:12, Nigel Cunningham wrote:
> XFS can continue to submit I/O from a timer routine, even after
> freezeable kernel and userspace threads are frozen. This doesn't seem to
> be an issue for current swsusp code, but is definitely an issue for
> Suspend2, where the pages being written could be overwritten by
> Suspend2's atomic copy.
>
> We can address this issue by freezing bdevs after stopping userspace
> threads, and thawing them prior to thawing userspace.

Okay, it turns out we'll probably need this, so a couple of comments follow.

> Signed-off-by: Nigel Cunningham <[email protected]>
>
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index 4a001fe..ddeeb50 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -13,6 +13,7 @@ #include <linux/interrupt.h>
> #include <linux/suspend.h>
> #include <linux/module.h>
> #include <linux/syscalls.h>
> +#include <linux/buffer_head.h>
> #include <linux/freezer.h>
>
> /*
> @@ -20,6 +21,58 @@ #include <linux/freezer.h>
> */
> #define TIMEOUT (20 * HZ)
>
> +struct frozen_fs
> +{
> + struct list_head fsb_list;
> + struct super_block *sb;
> +};
> +
> +LIST_HEAD(frozen_fs_list);
> +
> +void freezer_make_fses_rw(void)

I'd call it thaw_filesystems()

> +{
> + struct frozen_fs *fs, *next_fs;
> +
> + list_for_each_entry_safe(fs, next_fs, &frozen_fs_list, fsb_list) {
> + thaw_bdev(fs->sb->s_bdev, fs->sb);
> +
> + list_del(&fs->fsb_list);
> + kfree(fs);
> + }
> +}
> +
> +/*
> + * Done after userspace is frozen, so there should be no danger of
> + * fses being unmounted while we're in here.
> + */
> +int freezer_make_fses_ro(void)

I'd call it freeze_filesystems()

> +{
> + struct frozen_fs *fs;
> + struct super_block *sb;
> +
> + /* Generate the list */
> + list_for_each_entry(sb, &super_blocks, s_list) {
> + if (!sb->s_root || !sb->s_bdev ||
> + (sb->s_frozen == SB_FREEZE_TRANS) ||
> + (sb->s_flags & MS_RDONLY))
> + continue;
> +
> + fs = kmalloc(sizeof(struct frozen_fs), GFP_ATOMIC);
> + if (!fs)
> + return 1;

return -ENOMEM

> + fs->sb = sb;
> + list_add_tail(&fs->fsb_list, &frozen_fs_list);
> + };
> +
> + /* Do the freezing in reverse order so filesystems dependant
> + * upon others are frozen in the right order. (Eg loopback
> + * on ext3). */
> + list_for_each_entry_reverse(fs, &frozen_fs_list, fsb_list)
> + freeze_bdev(fs->sb->s_bdev);
> +
> + return 0;
> +}
> +
>
> static inline int freezeable(struct task_struct * p)
> {
> @@ -119,7 +172,7 @@ int freeze_processes(void)
> read_unlock(&tasklist_lock);
> todo += nr_user;
> if (!user_frozen && !nr_user) {
> - sys_sync();
> + freezer_make_fses_ro();
> start_time = jiffies;
> }
> user_frozen = !nr_user;
> @@ -174,6 +227,12 @@ void thaw_some_processes(int all)
> "Strange, %s not stopped\n", p->comm );
> } while_each_thread(g, p);
>
> + if (!pass) {
> + read_unlock(&tasklist_lock);
> + freezer_make_fses_rw();
> + read_lock(&tasklist_lock);
> + }
> +
> pass++;
> } while (pass < 2 && all);
>
>
>
>
>

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-24 20:38:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

> +/*
> + * Done after userspace is frozen, so there should be no danger of
> + * fses being unmounted while we're in here.
> + */
> +int freezer_make_fses_ro(void)

This should be called freeze_filesystems() or something along the line.
I also wonder whether it should be next to freeze_bdev instead of
in the suspend code.

> +{
> + struct frozen_fs *fs;
> + struct super_block *sb;
> +
> + /* Generate the list */
> + list_for_each_entry(sb, &super_blocks, s_list) {
> + if (!sb->s_root || !sb->s_bdev ||
> + (sb->s_frozen == SB_FREEZE_TRANS) ||
> + (sb->s_flags & MS_RDONLY))
> + continue;
> +
> + fs = kmalloc(sizeof(struct frozen_fs), GFP_ATOMIC);
> + if (!fs)
> + return 1;
> + fs->sb = sb;
> + list_add_tail(&fs->fsb_list, &frozen_fs_list);
> + };
> +
> + /* Do the freezing in reverse order so filesystems dependant
> + * upon others are frozen in the right order. (Eg loopback
> + * on ext3). */
> + list_for_each_entry_reverse(fs, &frozen_fs_list, fsb_list)
> + freeze_bdev(fs->sb->s_bdev);

I'd rather avoid this local list and operate directly on the super_blocks
lists. To do that we'd need another flag in the superblock to flag a
filesystem as suspended by this routine, which seems just fine.

void freeze_filesystems(void)
{
/*
* Freeze in reverse order so filesystems dependant
* upon others are frozen in the right order.
* (E.g. loopback on ext3).
*/
list_for_each_entry_reverse(fs, &super_blocks, fsb_list) {
if (!sb->s_root || !sb->s_bdev ||
(sb->s_frozen == SB_FREEZE_TRANS) ||
(sb->s_flags & MS_RDONLY))
continue;
freeze_bdev(sb->s_bdev);
sb->s_flags &= MS_SUSPENDED; // XXX find protection for s_flags
}
}

2006-10-24 21:26:58

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi!

On Tue 2006-10-24 18:06:33, Christoph Hellwig wrote:
> On Tue, Oct 24, 2006 at 05:29:59PM +0200, Rafael J. Wysocki wrote:
> > Do you mean calling sys_sync() after the userspace has been frozen
> > may not be sufficient?
>
> No, that's definitly not enough. You need to freeze_bdev to make sure
> data is on disk in the place it's expected by the filesystem without
> starting a log recovery.

I believe log recovery is okay in this case.

It can only happen when kernel dies during suspend or during
resume... And log recovery seems okay in that case. We even guarantee
that user did not loose any data -- by using sys_sync() after userland
is stopped -- but let's not overdo over protections.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-10-24 21:33:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Tue, Oct 24, 2006 at 11:26:48PM +0200, Pavel Machek wrote:
> > No, that's definitly not enough. You need to freeze_bdev to make sure
> > data is on disk in the place it's expected by the filesystem without
> > starting a log recovery.
>
> I believe log recovery is okay in this case.
>
> It can only happen when kernel dies during suspend or during
> resume... And log recovery seems okay in that case. We even guarantee
> that user did not loose any data -- by using sys_sync() after userland
> is stopped -- but let's not overdo over protections.

You're still entirely missing the problem.

Take a look at http://www.opengroup.org/onlinepubs/007908799/xsh/sync.html
and the linux sync(2) manpage. The only thing sync guarantees is writing
out all in-memory data to disk. It doesn't even gurantee completion,
although we've been synchronous in Linux for a while.

What it does not gurantee is where on disk the data is located. Now for
a journaling filesystem pushing everything to the log is the easiest way
to complete sync, and it's perfectly valid - if the system crashes after
the sync and before data is written back to it's normal place on disk
the system notices it's not been unmounted cleanly and will do a log
recovery. In the suspend case however the system neither crashes nor
is unmounted - thus the filesystem doesn't know it has to recover the
log. We have to choices to fix this:

(1) force a log recovery of an already mounted and in use filesystem
(2) make sure data is in the right place before suspending

(1) is pretty nasty, and hard to do across filesystems. (2) is already
implemented and easily useable by the suspend code.

2006-10-24 21:38:03

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi!

> > Do you mean calling sys_sync() after the userspace has been frozen
> > may not be sufficient?
>
> In most cases it probably is, but sys_sync() doesn't provide any
> guarantees that the filesystem is not being used or written to after
> it completes. Given that every so often I hear about an XFS filesystem
> that was corrupted by suspend, I don't think this is sufficient...

Userspace is frozen. There's noone that can write to the XFS
filesystem.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-10-24 21:43:40

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi!

On Tue 2006-10-24 22:33:42, Christoph Hellwig wrote:
> On Tue, Oct 24, 2006 at 11:26:48PM +0200, Pavel Machek wrote:
> > > No, that's definitly not enough. You need to freeze_bdev to make sure
> > > data is on disk in the place it's expected by the filesystem without
> > > starting a log recovery.
> >
> > I believe log recovery is okay in this case.
> >
> > It can only happen when kernel dies during suspend or during
> > resume... And log recovery seems okay in that case. We even guarantee
> > that user did not loose any data -- by using sys_sync() after userland
> > is stopped -- but let's not overdo over protections.
>
> You're still entirely missing the problem.
>
> Take a look at http://www.opengroup.org/onlinepubs/007908799/xsh/sync.html
> and the linux sync(2) manpage. The only thing sync guarantees is writing
> out all in-memory data to disk. It doesn't even gurantee completion,
> although we've been synchronous in Linux for a while.

Ok, I assume sys_sync is synchronous, but that's okay.

> What it does not gurantee is where on disk the data is located. Now for
> a journaling filesystem pushing everything to the log is the easiest way
> to complete sync, and it's perfectly valid - if the system crashes after
> the sync and before data is written back to it's normal place on disk
> the system notices it's not been unmounted cleanly and will do a log
> recovery. In the suspend case however the system neither crashes nor
> is unmounted - thus the filesystem doesn't know it has to recover the
> log. We have to choices to fix this:
>
> (1) force a log recovery of an already mounted and in use filesystem
> (2) make sure data is in the right place before suspending
>
> (1) is pretty nasty, and hard to do across filesystems. (2) is already
> implemented and easily useable by the suspend code.

No, there's no need to do either (1) or (2) in "machine suspended and
resumed successfully". In that case, machine just continues as if no
suspend has happened.

In fact I could remove sys_sync() from freezer. suspend code would
still be correct.

That sys_sync() only matters in case of suspend but machine died
during resume... and in that case we know we crashed, and journal
recovery is okay.

I do know how journaling works (from 10000feet, anyway), and it is
okay in this case.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-10-24 22:17:13

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi.

On Tue, 2006-10-24 at 22:16 +0200, Rafael J. Wysocki wrote:
> On Monday, 23 October 2006 06:12, Nigel Cunningham wrote:
> > XFS can continue to submit I/O from a timer routine, even after
> > freezeable kernel and userspace threads are frozen. This doesn't seem to
> > be an issue for current swsusp code, but is definitely an issue for
> > Suspend2, where the pages being written could be overwritten by
> > Suspend2's atomic copy.
> >
> > We can address this issue by freezing bdevs after stopping userspace
> > threads, and thawing them prior to thawing userspace.
>
> Okay, it turns out we'll probably need this, so a couple of comments follow.

Ack. Revised version will come when I find/make the time to do it :)
Feel free to prod :>

Nigel

2006-10-24 22:19:29

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi David.

On Wed, 2006-10-25 at 00:44 +1000, David Chinner wrote:
> On Mon, Oct 23, 2006 at 12:36:53PM +0200, Rafael J. Wysocki wrote:
> > On Monday, 23 October 2006 06:12, Nigel Cunningham wrote:
> > > XFS can continue to submit I/O from a timer routine, even after
> > > freezeable kernel and userspace threads are frozen. This doesn't seem to
> > > be an issue for current swsusp code,
> >
> > So it doesn't look like we need the patch _now_.
> >
> > > but is definitely an issue for Suspend2, where the pages being written could
> > > be overwritten by Suspend2's atomic copy.
> >
> > And IMO that's a good reason why we shouldn't use RCU pages for storing the
> > image. XFS is one known example that breaks things if we do so and
> > there may be more such things that we don't know of. The fact that they
> > haven't appeared in testing so far doesn't mean they don't exist and
> > moreover some things like that may appear in the future.
>
> Could you please tell us which XFS bits are broken so we can get
> them fixed? The XFS daemons should all be checking if they are
> supposed to freeze (i.e. they call try_to_freeze() after they wake
> up due to timer expiry) so I thought they were doing the right
> thing.

The problem (in my experience) isn't the threads but a timer that
submits I/O even when the threads are frozen. It stops when the bdev is
frozen. The last report I've seen was before I added bdev freezing to
suspend2, which was 2.6.14, so you guys may have fixed it since then.

I can seek to get a trace if you like.

Regards,

Nigel

2006-10-25 00:14:15

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Tue, Oct 24, 2006 at 11:37:37PM +0200, Pavel Machek wrote:
> Hi!
>
> > > Do you mean calling sys_sync() after the userspace has been frozen
> > > may not be sufficient?
> >
> > In most cases it probably is, but sys_sync() doesn't provide any
> > guarantees that the filesystem is not being used or written to after
> > it completes. Given that every so often I hear about an XFS filesystem
> > that was corrupted by suspend, I don't think this is sufficient...
>
> Userspace is frozen. There's noone that can write to the XFS
> filesystem.

Sure, no new userspace processes can write data, but what about the
internal state of the filesystem?

All a sync guarantees is that the filesystem is consistent when the
sync returns and XFS provides this guarantee by writing all data and
ensuring all metadata changes are logged so if a crash occurs it can
be recovered (which provides the sync guarantee). hence after a
sys_sync(), XFS will still have lots of dirty metadata that needs to
be written to disk at some time in the future so the transactions
can be removed from the log.

This dirty metadata can be flushed at any time, and the dirty state
is kept in XFS structures and not always in page structures (think
multipage metadata buffers). Hence I cannot see how suspend can
guarantee that it has saved all the dirty data in XFS, nor
restore it correctly on resume. Once you toss dirty metadata that
is currently in the log, further operations will result in that log
transaction being overwritten without it ever being written to disk.
That then means any subsequent operations after resume will corrupt
the filesystem....

Hence the only way to correctly rebuild the XFS state on resume is
to quiesce the filesystem on suspend and thaw it on resume so as to
trigger log recovery.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-25 08:10:14

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi!

> > > > Do you mean calling sys_sync() after the userspace has been frozen
> > > > may not be sufficient?
> > >
> > > In most cases it probably is, but sys_sync() doesn't provide any
> > > guarantees that the filesystem is not being used or written to after
> > > it completes. Given that every so often I hear about an XFS filesystem
> > > that was corrupted by suspend, I don't think this is sufficient...
> >
> > Userspace is frozen. There's noone that can write to the XFS
> > filesystem.
>
> Sure, no new userspace processes can write data, but what about the
> internal state of the filesystem?
>
> All a sync guarantees is that the filesystem is consistent when the
> sync returns and XFS provides this guarantee by writing all data and
> ensuring all metadata changes are logged so if a crash occurs it can
> be recovered (which provides the sync guarantee). hence after a
> sys_sync(), XFS will still have lots of dirty metadata that needs to
> be written to disk at some time in the future so the transactions
> can be removed from the log.
>
> This dirty metadata can be flushed at any time, and the dirty state
> is kept in XFS structures and not always in page structures (think
> multipage metadata buffers). Hence I cannot see how suspend can
> guarantee that it has saved all the dirty data in XFS, nor
> restore it correctly on resume. Once you toss dirty metadata that
> is currently in the log, further operations will result in that log
> transaction being overwritten without it ever being written to disk.
> That then means any subsequent operations after resume will corrupt
> the filesystem....
>
> Hence the only way to correctly rebuild the XFS state on resume is
> to quiesce the filesystem on suspend and thaw it on resume so as to
> trigger log recovery.

No, during suspend/resume, memory image is saved, and no state is
lost. We would not even have to do sys_sync(), and suspend/resume
would still work properly.

sys_sync() is there only to limit damage in case of suspend/resume
failure.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-10-25 08:14:04

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Wednesday, 25 October 2006 02:13, David Chinner wrote:
> On Tue, Oct 24, 2006 at 11:37:37PM +0200, Pavel Machek wrote:
> > Hi!
> >
> > > > Do you mean calling sys_sync() after the userspace has been frozen
> > > > may not be sufficient?
> > >
> > > In most cases it probably is, but sys_sync() doesn't provide any
> > > guarantees that the filesystem is not being used or written to after
> > > it completes. Given that every so often I hear about an XFS filesystem
> > > that was corrupted by suspend, I don't think this is sufficient...
> >
> > Userspace is frozen. There's noone that can write to the XFS
> > filesystem.
>
> Sure, no new userspace processes can write data, but what about the
> internal state of the filesystem?
>
> All a sync guarantees is that the filesystem is consistent when the
> sync returns and XFS provides this guarantee by writing all data and
> ensuring all metadata changes are logged so if a crash occurs it can
> be recovered (which provides the sync guarantee). hence after a
> sys_sync(), XFS will still have lots of dirty metadata that needs to
> be written to disk at some time in the future so the transactions
> can be removed from the log.
>
> This dirty metadata can be flushed at any time, and the dirty state
> is kept in XFS structures and not always in page structures (think
> multipage metadata buffers).

Are the dirty metadata flushed by a kernel thread?

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-25 08:39:24

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Wed, Oct 25, 2006 at 10:10:01AM +0200, Pavel Machek wrote:
> > Hence the only way to correctly rebuild the XFS state on resume is
> > to quiesce the filesystem on suspend and thaw it on resume so as to
> > trigger log recovery.
>
> No, during suspend/resume, memory image is saved, and no state is
> lost. We would not even have to do sys_sync(), and suspend/resume
> would still work properly.

It seems to me that you ensure the filesystem is synced to disk and
then at some point later you record the memory state of the
filesystem, but these happen at different times. That leaves a
window for things to get out of sync again, right?

Cheers,

Dave.

--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-25 08:47:24

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Wed 2006-10-25 18:38:30, David Chinner wrote:
> On Wed, Oct 25, 2006 at 10:10:01AM +0200, Pavel Machek wrote:
> > > Hence the only way to correctly rebuild the XFS state on resume is
> > > to quiesce the filesystem on suspend and thaw it on resume so as to
> > > trigger log recovery.
> >
> > No, during suspend/resume, memory image is saved, and no state is
> > lost. We would not even have to do sys_sync(), and suspend/resume
> > would still work properly.
>
> It seems to me that you ensure the filesystem is synced to disk and
> then at some point later you record the memory state of the
> filesystem, but these happen at different times. That leaves a
> window for things to get out of sync again, right?

I DO NOT HAVE TO ENSURE FILESYSTEM IS SYNCED. That sys_sync() is
optional.

Recording of memory state is atomic, and as long as noone writes to
the disk after atomic snapshot, memory image matches what is on disk.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-10-25 12:33:53

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Wednesday, 25 October 2006 10:47, Pavel Machek wrote:
> On Wed 2006-10-25 18:38:30, David Chinner wrote:
> > On Wed, Oct 25, 2006 at 10:10:01AM +0200, Pavel Machek wrote:
> > > > Hence the only way to correctly rebuild the XFS state on resume is
> > > > to quiesce the filesystem on suspend and thaw it on resume so as to
> > > > trigger log recovery.
> > >
> > > No, during suspend/resume, memory image is saved, and no state is
> > > lost. We would not even have to do sys_sync(), and suspend/resume
> > > would still work properly.
> >
> > It seems to me that you ensure the filesystem is synced to disk and
> > then at some point later you record the memory state of the
> > filesystem, but these happen at different times. That leaves a
> > window for things to get out of sync again, right?
>
> I DO NOT HAVE TO ENSURE FILESYSTEM IS SYNCED. That sys_sync() is
> optional.
>
> Recording of memory state is atomic, and as long as noone writes to
> the disk after atomic snapshot, memory image matches what is on disk.

Well, my impression is that this is exactly what happens here: Something
in the XFS code causes metadata to be written to disk _after_ the atomic
snapshot.

That's why I asked if the dirty XFS metadata were flushed by a kernel thread.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-25 13:23:45

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi.

On Wed, 2006-10-25 at 14:32 +0200, Rafael J. Wysocki wrote:
> On Wednesday, 25 October 2006 10:47, Pavel Machek wrote:
> > On Wed 2006-10-25 18:38:30, David Chinner wrote:
> > > On Wed, Oct 25, 2006 at 10:10:01AM +0200, Pavel Machek wrote:
> > > > > Hence the only way to correctly rebuild the XFS state on resume is
> > > > > to quiesce the filesystem on suspend and thaw it on resume so as to
> > > > > trigger log recovery.
> > > >
> > > > No, during suspend/resume, memory image is saved, and no state is
> > > > lost. We would not even have to do sys_sync(), and suspend/resume
> > > > would still work properly.
> > >
> > > It seems to me that you ensure the filesystem is synced to disk and
> > > then at some point later you record the memory state of the
> > > filesystem, but these happen at different times. That leaves a
> > > window for things to get out of sync again, right?
> >
> > I DO NOT HAVE TO ENSURE FILESYSTEM IS SYNCED. That sys_sync() is
> > optional.
> >
> > Recording of memory state is atomic, and as long as noone writes to
> > the disk after atomic snapshot, memory image matches what is on disk.
>
> Well, my impression is that this is exactly what happens here: Something
> in the XFS code causes metadata to be written to disk _after_ the atomic
> snapshot.
>
> That's why I asked if the dirty XFS metadata were flushed by a kernel thread.

When I first added bdev freezing it was because there was an XFS timer
doing writes.

Regards,

Nigel

2006-10-25 19:07:01

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Wednesday, 25 October 2006 15:23, Nigel Cunningham wrote:
> Hi.
>
> On Wed, 2006-10-25 at 14:32 +0200, Rafael J. Wysocki wrote:
> > On Wednesday, 25 October 2006 10:47, Pavel Machek wrote:
> > > On Wed 2006-10-25 18:38:30, David Chinner wrote:
> > > > On Wed, Oct 25, 2006 at 10:10:01AM +0200, Pavel Machek wrote:
> > > > > > Hence the only way to correctly rebuild the XFS state on resume is
> > > > > > to quiesce the filesystem on suspend and thaw it on resume so as to
> > > > > > trigger log recovery.
> > > > >
> > > > > No, during suspend/resume, memory image is saved, and no state is
> > > > > lost. We would not even have to do sys_sync(), and suspend/resume
> > > > > would still work properly.
> > > >
> > > > It seems to me that you ensure the filesystem is synced to disk and
> > > > then at some point later you record the memory state of the
> > > > filesystem, but these happen at different times. That leaves a
> > > > window for things to get out of sync again, right?
> > >
> > > I DO NOT HAVE TO ENSURE FILESYSTEM IS SYNCED. That sys_sync() is
> > > optional.
> > >
> > > Recording of memory state is atomic, and as long as noone writes to
> > > the disk after atomic snapshot, memory image matches what is on disk.
> >
> > Well, my impression is that this is exactly what happens here: Something
> > in the XFS code causes metadata to be written to disk _after_ the atomic
> > snapshot.
> >
> > That's why I asked if the dirty XFS metadata were flushed by a kernel thread.
>
> When I first added bdev freezing it was because there was an XFS timer
> doing writes.

Yes, I noticed you said that, but I'd like someone from the XFS team to either
confirm or deny it.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-26 07:31:18

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Wed, Oct 25, 2006 at 09:05:56PM +0200, Rafael J. Wysocki wrote:
> On Wednesday, 25 October 2006 15:23, Nigel Cunningham wrote:
> > >
> > > Well, my impression is that this is exactly what happens here: Something
> > > in the XFS code causes metadata to be written to disk _after_ the atomic
> > > snapshot.
> > >
> > > That's why I asked if the dirty XFS metadata were flushed by a kernel thread.
> >
> > When I first added bdev freezing it was because there was an XFS timer
> > doing writes.
>
> Yes, I noticed you said that, but I'd like someone from the XFS team to either
> confirm or deny it.

We have daemons running in the background that can definitely do stuff
after a sync. hmm - one does try_to_freeze() after a wakeup, the
other does:

if (unlikely(freezing(current))) {
set_bit(XBT_FORCE_SLEEP, &target->bt_flags);
refrigerator();
} else {
clear_bit(XBT_FORCE_SLEEP, &target->bt_flags);
}

before it goes to sleep. So that one (xfsbufd - metadata buffer flushing)
can definitely wake up after the sync and do work, and the other could if
the kernel thread freeze occurs after the sync.

Another good question at this point - exactly how should we be putting
these thread to to sleep? Are both these valid methods for freezing them?
And should we be freezing when we wake up instead of before we go to
sleep? i.e. what are teh rules we are supposed to be following?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-26 08:18:35

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi Dave.

On Thu, 2006-10-26 at 17:30 +1000, David Chinner wrote:
> On Wed, Oct 25, 2006 at 09:05:56PM +0200, Rafael J. Wysocki wrote:
> > On Wednesday, 25 October 2006 15:23, Nigel Cunningham wrote:
> > > >
> > > > Well, my impression is that this is exactly what happens here: Something
> > > > in the XFS code causes metadata to be written to disk _after_ the atomic
> > > > snapshot.
> > > >
> > > > That's why I asked if the dirty XFS metadata were flushed by a kernel thread.
> > >
> > > When I first added bdev freezing it was because there was an XFS timer
> > > doing writes.
> >
> > Yes, I noticed you said that, but I'd like someone from the XFS team to either
> > confirm or deny it.
>
> We have daemons running in the background that can definitely do stuff
> after a sync. hmm - one does try_to_freeze() after a wakeup, the
> other does:
>
> if (unlikely(freezing(current))) {
> set_bit(XBT_FORCE_SLEEP, &target->bt_flags);
> refrigerator();
> } else {
> clear_bit(XBT_FORCE_SLEEP, &target->bt_flags);
> }
>
> before it goes to sleep. So that one (xfsbufd - metadata buffer flushing)
> can definitely wake up after the sync and do work, and the other could if
> the kernel thread freeze occurs after the sync.
>
> Another good question at this point - exactly how should we be putting
> these thread to to sleep? Are both these valid methods for freezing them?
> And should we be freezing when we wake up instead of before we go to
> sleep? i.e. what are teh rules we are supposed to be following?

As you have them at the moment, the threads seem to be freezing fine.
The issue I've seen in the past related not to threads but to timer
based activity. Admittedly it was 2.6.14 when I last looked at it, but
there used to be a possibility for XFS to submit I/O from a timer when
the threads are frozen but the bdev isn't frozen. Has that changed?

Regards,

Nigel

2006-10-26 08:49:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Thursday, 26 October 2006 10:18, Nigel Cunningham wrote:
> Hi Dave.
>
> On Thu, 2006-10-26 at 17:30 +1000, David Chinner wrote:
> > On Wed, Oct 25, 2006 at 09:05:56PM +0200, Rafael J. Wysocki wrote:
> > > On Wednesday, 25 October 2006 15:23, Nigel Cunningham wrote:
> > > > >
> > > > > Well, my impression is that this is exactly what happens here: Something
> > > > > in the XFS code causes metadata to be written to disk _after_ the atomic
> > > > > snapshot.
> > > > >
> > > > > That's why I asked if the dirty XFS metadata were flushed by a kernel thread.
> > > >
> > > > When I first added bdev freezing it was because there was an XFS timer
> > > > doing writes.
> > >
> > > Yes, I noticed you said that, but I'd like someone from the XFS team to either
> > > confirm or deny it.
> >
> > We have daemons running in the background that can definitely do stuff
> > after a sync. hmm - one does try_to_freeze() after a wakeup, the
> > other does:
> >
> > if (unlikely(freezing(current))) {
> > set_bit(XBT_FORCE_SLEEP, &target->bt_flags);
> > refrigerator();
> > } else {
> > clear_bit(XBT_FORCE_SLEEP, &target->bt_flags);
> > }
> >
> > before it goes to sleep. So that one (xfsbufd - metadata buffer flushing)
> > can definitely wake up after the sync and do work,

Once it's entered the refrigerator, it won't return from it before
freeze_processes() is called (usually during the resume).

> > and the other could if the kernel thread freeze occurs after the sync.

Even if it does anything after the sync, it's okay as long as that's before we
create the image. And it is, because both threads are frozen before we do it.

> > Another good question at this point - exactly how should we be putting
> > these thread to to sleep? Are both these valid methods for freezing them?
> > And should we be freezing when we wake up instead of before we go to
> > sleep? i.e. what are teh rules we are supposed to be following?
>
> As you have them at the moment, the threads seem to be freezing fine.

Yes. IMO they are freezing fine.

In fact the suspend would fail if they didn't freeze unless one of them had
PF_NOFREEZE set.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-26 08:58:14

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi Nigel,

On Thu, Oct 26, 2006 at 06:18:29PM +1000, Nigel Cunningham wrote:
> On Thu, 2006-10-26 at 17:30 +1000, David Chinner wrote:
> > We have daemons running in the background that can definitely do stuff
> > after a sync. hmm - one does try_to_freeze() after a wakeup, the
> > other does:
> >
> > if (unlikely(freezing(current))) {
> > set_bit(XBT_FORCE_SLEEP, &target->bt_flags);
> > refrigerator();
> > } else {
> > clear_bit(XBT_FORCE_SLEEP, &target->bt_flags);
> > }
> >
> > before it goes to sleep. So that one (xfsbufd - metadata buffer flushing)
> > can definitely wake up after the sync and do work, and the other could if
> > the kernel thread freeze occurs after the sync.
> >
> > Another good question at this point - exactly how should we be putting
> > these thread to to sleep? Are both these valid methods for freezing them?
> > And should we be freezing when we wake up instead of before we go to
> > sleep? i.e. what are teh rules we are supposed to be following?
>
> As you have them at the moment, the threads seem to be freezing fine.
> The issue I've seen in the past related not to threads but to timer
> based activity. Admittedly it was 2.6.14 when I last looked at it, but
> there used to be a possibility for XFS to submit I/O from a timer when
> the threads are frozen but the bdev isn't frozen. Has that changed?

I didn't think we've ever done that - periodic or delayed operations
are passed off to the kernel threads to execute. A stack trace
(if you still have it) would be really help here.

Hmmm - we have a couple of per-cpu work queues as well that are
used on I/O completion and that can, in some circumstances,
trigger new transactions. If we are only flush metadata, then
I don't think that any more I/o will be issued, but I could be
wrong (maze of twisty passages).

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-26 09:09:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Thursday, 26 October 2006 10:18, Nigel Cunningham wrote:
> Hi Dave.
>
> On Thu, 2006-10-26 at 17:30 +1000, David Chinner wrote:
> > On Wed, Oct 25, 2006 at 09:05:56PM +0200, Rafael J. Wysocki wrote:
> > > On Wednesday, 25 October 2006 15:23, Nigel Cunningham wrote:
> > > > >
> > > > > Well, my impression is that this is exactly what happens here: Something
> > > > > in the XFS code causes metadata to be written to disk _after_ the atomic
> > > > > snapshot.
> > > > >
> > > > > That's why I asked if the dirty XFS metadata were flushed by a kernel thread.
> > > >
> > > > When I first added bdev freezing it was because there was an XFS timer
> > > > doing writes.
> > >
> > > Yes, I noticed you said that, but I'd like someone from the XFS team to either
> > > confirm or deny it.
> >
> > We have daemons running in the background that can definitely do stuff
> > after a sync. hmm - one does try_to_freeze() after a wakeup, the
> > other does:
> >
> > if (unlikely(freezing(current))) {
> > set_bit(XBT_FORCE_SLEEP, &target->bt_flags);
> > refrigerator();
> > } else {
> > clear_bit(XBT_FORCE_SLEEP, &target->bt_flags);
> > }
> >
> > before it goes to sleep. So that one (xfsbufd - metadata buffer flushing)
> > can definitely wake up after the sync and do work, and the other could if
> > the kernel thread freeze occurs after the sync.
> >
> > Another good question at this point - exactly how should we be putting
> > these thread to to sleep? Are both these valid methods for freezing them?
> > And should we be freezing when we wake up instead of before we go to
> > sleep? i.e. what are teh rules we are supposed to be following?
>
> As you have them at the moment, the threads seem to be freezing fine.
> The issue I've seen in the past related not to threads but to timer
> based activity. Admittedly it was 2.6.14 when I last looked at it, but
> there used to be a possibility for XFS to submit I/O from a timer when
> the threads are frozen but the bdev isn't frozen.

Also there may be a problem if a workqueue is used for that, because
worker_threads run with PF_NOFREEZE set.

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-26 09:12:13

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi,

On Thursday, 26 October 2006 10:57, David Chinner wrote:
> Hi Nigel,
>
> On Thu, Oct 26, 2006 at 06:18:29PM +1000, Nigel Cunningham wrote:
> > On Thu, 2006-10-26 at 17:30 +1000, David Chinner wrote:
> > > We have daemons running in the background that can definitely do stuff
> > > after a sync. hmm - one does try_to_freeze() after a wakeup, the
> > > other does:
> > >
> > > if (unlikely(freezing(current))) {
> > > set_bit(XBT_FORCE_SLEEP, &target->bt_flags);
> > > refrigerator();
> > > } else {
> > > clear_bit(XBT_FORCE_SLEEP, &target->bt_flags);
> > > }
> > >
> > > before it goes to sleep. So that one (xfsbufd - metadata buffer flushing)
> > > can definitely wake up after the sync and do work, and the other could if
> > > the kernel thread freeze occurs after the sync.
> > >
> > > Another good question at this point - exactly how should we be putting
> > > these thread to to sleep? Are both these valid methods for freezing them?
> > > And should we be freezing when we wake up instead of before we go to
> > > sleep? i.e. what are teh rules we are supposed to be following?
> >
> > As you have them at the moment, the threads seem to be freezing fine.
> > The issue I've seen in the past related not to threads but to timer
> > based activity. Admittedly it was 2.6.14 when I last looked at it, but
> > there used to be a possibility for XFS to submit I/O from a timer when
> > the threads are frozen but the bdev isn't frozen. Has that changed?
>
> I didn't think we've ever done that - periodic or delayed operations
> are passed off to the kernel threads to execute. A stack trace
> (if you still have it) would be really help here.
>
> Hmmm - we have a couple of per-cpu work queues as well that are
> used on I/O completion and that can, in some circumstances,
> trigger new transactions. If we are only flush metadata, then
> I don't think that any more I/o will be issued, but I could be
> wrong (maze of twisty passages).

Well, I think this exactly is the problem, because worker_threads run with
PF_NOFREEZE set (as I've just said in another message).

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-26 09:19:10

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi.

On Thu, 2006-10-26 at 18:57 +1000, David Chinner wrote:
> I didn't think we've ever done that - periodic or delayed operations
> are passed off to the kernel threads to execute. A stack trace
> (if you still have it) would be really help here.

I don't, but I know how to get it again. Will give it a go shortly.

> Hmmm - we have a couple of per-cpu work queues as well that are
> used on I/O completion and that can, in some circumstances,
> trigger new transactions. If we are only flush metadata, then
> I don't think that any more I/o will be issued, but I could be
> wrong (maze of twisty passages).

I can understand that. I'm trying to learn Xgl programming at the mo :)

Nigel

2006-10-27 01:38:46

by David Chinner

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Thu, Oct 26, 2006 at 11:11:29AM +0200, Rafael J. Wysocki wrote:
> On Thursday, 26 October 2006 10:57, David Chinner wrote:
> > On Thu, Oct 26, 2006 at 06:18:29PM +1000, Nigel Cunningham wrote:
> > > As you have them at the moment, the threads seem to be freezing fine.
> > > The issue I've seen in the past related not to threads but to timer
> > > based activity. Admittedly it was 2.6.14 when I last looked at it, but
> > > there used to be a possibility for XFS to submit I/O from a timer when
> > > the threads are frozen but the bdev isn't frozen. Has that changed?
> >
> > I didn't think we've ever done that - periodic or delayed operations
> > are passed off to the kernel threads to execute. A stack trace
> > (if you still have it) would be really help here.
> >
> > Hmmm - we have a couple of per-cpu work queues as well that are
> > used on I/O completion and that can, in some circumstances,
> > trigger new transactions. If we are only flush metadata, then
> > I don't think that any more I/o will be issued, but I could be
> > wrong (maze of twisty passages).
>
> Well, I think this exactly is the problem, because worker_threads run with
> PF_NOFREEZE set (as I've just said in another message).

Ok, so freezing the filesystem is the only way you can prevent
this as the workqueues are flushed as part of quiescing the filesystem.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-27 14:38:18

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

On Friday, 27 October 2006 03:38, David Chinner wrote:
> On Thu, Oct 26, 2006 at 11:11:29AM +0200, Rafael J. Wysocki wrote:
> > On Thursday, 26 October 2006 10:57, David Chinner wrote:
> > > On Thu, Oct 26, 2006 at 06:18:29PM +1000, Nigel Cunningham wrote:
> > > > As you have them at the moment, the threads seem to be freezing fine.
> > > > The issue I've seen in the past related not to threads but to timer
> > > > based activity. Admittedly it was 2.6.14 when I last looked at it, but
> > > > there used to be a possibility for XFS to submit I/O from a timer when
> > > > the threads are frozen but the bdev isn't frozen. Has that changed?
> > >
> > > I didn't think we've ever done that - periodic or delayed operations
> > > are passed off to the kernel threads to execute. A stack trace
> > > (if you still have it) would be really help here.
> > >
> > > Hmmm - we have a couple of per-cpu work queues as well that are
> > > used on I/O completion and that can, in some circumstances,
> > > trigger new transactions. If we are only flush metadata, then
> > > I don't think that any more I/o will be issued, but I could be
> > > wrong (maze of twisty passages).
> >
> > Well, I think this exactly is the problem, because worker_threads run with
> > PF_NOFREEZE set (as I've just said in another message).
>
> Ok, so freezing the filesystem is the only way you can prevent
> this as the workqueues are flushed as part of quiescing the filesystem.

Yes, I think so.

Now at last I know what the problem actually is and why we need the freezing
of filesystems, so thanks for helping me understand that. :-)

Greetings,
Rafael


--
You never change things by fighting the existing reality.
R. Buckminster Fuller

2006-10-29 17:35:43

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi!

> > > > As you have them at the moment, the threads seem to be freezing fine.
> > > > The issue I've seen in the past related not to threads but to timer
> > > > based activity. Admittedly it was 2.6.14 when I last looked at it, but
> > > > there used to be a possibility for XFS to submit I/O from a timer when
> > > > the threads are frozen but the bdev isn't frozen. Has that changed?
> > >
> > > I didn't think we've ever done that - periodic or delayed operations
> > > are passed off to the kernel threads to execute. A stack trace
> > > (if you still have it) would be really help here.
> > >
> > > Hmmm - we have a couple of per-cpu work queues as well that are
> > > used on I/O completion and that can, in some circumstances,
> > > trigger new transactions. If we are only flush metadata, then
> > > I don't think that any more I/o will be issued, but I could be
> > > wrong (maze of twisty passages).
> >
> > Well, I think this exactly is the problem, because worker_threads run with
> > PF_NOFREEZE set (as I've just said in another message).
>
> Ok, so freezing the filesystem is the only way you can prevent
> this as the workqueues are flushed as part of quiescing the filesystem.

Well, alternative is to teach XFS to sense that we are being frozen
and stop disk writes in such case.

OTOH freeze_bdevs is perhaps not that bad solution...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-10-29 23:30:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi,

On Sunday, 29 October 2006 18:35, Pavel Machek wrote:
> Hi!
>
> > > > > As you have them at the moment, the threads seem to be freezing fine.
> > > > > The issue I've seen in the past related not to threads but to timer
> > > > > based activity. Admittedly it was 2.6.14 when I last looked at it, but
> > > > > there used to be a possibility for XFS to submit I/O from a timer when
> > > > > the threads are frozen but the bdev isn't frozen. Has that changed?
> > > >
> > > > I didn't think we've ever done that - periodic or delayed operations
> > > > are passed off to the kernel threads to execute. A stack trace
> > > > (if you still have it) would be really help here.
> > > >
> > > > Hmmm - we have a couple of per-cpu work queues as well that are
> > > > used on I/O completion and that can, in some circumstances,
> > > > trigger new transactions. If we are only flush metadata, then
> > > > I don't think that any more I/o will be issued, but I could be
> > > > wrong (maze of twisty passages).
> > >
> > > Well, I think this exactly is the problem, because worker_threads run with
> > > PF_NOFREEZE set (as I've just said in another message).
> >
> > Ok, so freezing the filesystem is the only way you can prevent
> > this as the workqueues are flushed as part of quiescing the filesystem.
>
> Well, alternative is to teach XFS to sense that we are being frozen
> and stop disk writes in such case.
>
> OTOH freeze_bdevs is perhaps not that bad solution...

Okay, appended is a patch that implements the freezing of bdevs in a slightly
different way than the Nigel's patch did it.

As Christoph suggested, I have put freeze_filesystems() and thaw_filesystems()
into fs/buffer.c and indroduced the MS_FROZEN flag to mark frozen
filesystems.

It seems to work fine, except I get the following trace from lockdep during
the suspend on a regular basis (not 100% reproducible, though):

Stopping tasks...
=============================================
[ INFO: possible recursive locking detected ]
2.6.19-rc2-mm2 #15
---------------------------------------------
s2disk/5564 is trying to acquire lock:
(&bdev->bd_mount_mutex){--..}, at: [<ffffffff80475e79>] mutex_lock+0x9/0x10

but task is already holding lock:
(&bdev->bd_mount_mutex){--..}, at: [<ffffffff80475e79>] mutex_lock+0x9/0x10

other info that might help us debug this:
3 locks held by s2disk/5564:
#0: (&bdev->bd_mount_mutex){--..}, at: [<ffffffff80475e79>] mutex_lock+0x9/0x10
#1: (&type->s_umount_key#16){----}, at: [<ffffffff80291647>] get_super+0x67/0xc0
#2: (&journal->j_barrier){--..}, at: [<ffffffff80475e79>] mutex_lock+0x9/0x10

stack backtrace:

Call Trace:
[<ffffffff8020af79>] dump_trace+0xb9/0x430
[<ffffffff8020b333>] show_trace+0x43/0x60
[<ffffffff8020b635>] dump_stack+0x15/0x20
[<ffffffff8024a1d1>] __lock_acquire+0x881/0xc60
[<ffffffff8024a94d>] lock_acquire+0x8d/0xc0
[<ffffffff80475cd4>] __mutex_lock_slowpath+0xd4/0x270
[<ffffffff80475e79>] mutex_lock+0x9/0x10
[<ffffffff802b2bb6>] freeze_bdev+0x16/0x80
[<ffffffff802b3105>] freeze_filesystems+0x55/0x80
[<ffffffff80255942>] freeze_processes+0x1e2/0x360
[<ffffffff802592a3>] snapshot_ioctl+0x163/0x610
[<ffffffff8029cf0b>] do_ioctl+0x6b/0xa0
[<ffffffff8029d1eb>] vfs_ioctl+0x2ab/0x2d0
[<ffffffff8029d27a>] sys_ioctl+0x6a/0xa0
[<ffffffff80209c2e>] system_call+0x7e/0x83
[<00002afb13a4d8a9>]

done.
Shrinking memory... done (19126 pages freed)

Greetings,
Rafael


Signed-off-by: Rafael J. Wysocki <[email protected]>
---
fs/buffer.c | 38 +++++++++++++++++++++++++++++++++
include/linux/buffer_head.h | 2 +
include/linux/fs.h | 1
kernel/power/process.c | 50 +++++++++++++++++++++++++++++---------------
4 files changed, 74 insertions(+), 17 deletions(-)

Index: linux-2.6.19-rc2-mm2/kernel/power/process.c
===================================================================
--- linux-2.6.19-rc2-mm2.orig/kernel/power/process.c
+++ linux-2.6.19-rc2-mm2/kernel/power/process.c
@@ -14,6 +14,7 @@
#include <linux/module.h>
#include <linux/syscalls.h>
#include <linux/freezer.h>
+#include <linux/buffer_head.h>

/*
* Timeout for stopping processes
@@ -119,7 +120,7 @@ int freeze_processes(void)
read_unlock(&tasklist_lock);
todo += nr_user;
if (!user_frozen && !nr_user) {
- sys_sync();
+ freeze_filesystems();
start_time = jiffies;
}
user_frozen = !nr_user;
@@ -156,28 +157,43 @@ int freeze_processes(void)
void thaw_some_processes(int all)
{
struct task_struct *g, *p;
- int pass = 0; /* Pass 0 = Kernel space, 1 = Userspace */

printk("Restarting tasks... ");
read_lock(&tasklist_lock);
- do {
- do_each_thread(g, p) {
- /*
- * is_user = 0 if kernel thread or borrowed mm,
- * 1 otherwise.
- */
- int is_user = !!(p->mm && !(p->flags & PF_BORROWED_MM));
- if (!freezeable(p) || (is_user != pass))
- continue;
- if (!thaw_process(p))
- printk(KERN_INFO
- "Strange, %s not stopped\n", p->comm);
- } while_each_thread(g, p);

- pass++;
- } while (pass < 2 && all);
+ do_each_thread(g, p) {
+ if (!freezeable(p))
+ continue;
+
+ /* Don't thaw userland processes, for now */
+ if (p->mm && !(p->flags & PF_BORROWED_MM))
+ continue;
+
+ if (!thaw_process(p))
+ printk(KERN_INFO " Strange, %s not stopped\n", p->comm );
+ } while_each_thread(g, p);
+
+ read_unlock(&tasklist_lock);
+ if (!all)
+ goto Exit;
+
+ thaw_filesystems();
+ read_lock(&tasklist_lock);
+
+ do_each_thread(g, p) {
+ if (!freezeable(p))
+ continue;
+
+ /* Kernel threads should have been thawed already */
+ if (!p->mm || (p->flags & PF_BORROWED_MM))
+ continue;
+
+ if (!thaw_process(p))
+ printk(KERN_INFO " Strange, %s not stopped\n", p->comm );
+ } while_each_thread(g, p);

read_unlock(&tasklist_lock);
+Exit:
schedule();
printk("done.\n");
}
Index: linux-2.6.19-rc2-mm2/include/linux/buffer_head.h
===================================================================
--- linux-2.6.19-rc2-mm2.orig/include/linux/buffer_head.h
+++ linux-2.6.19-rc2-mm2/include/linux/buffer_head.h
@@ -170,6 +170,8 @@ wait_queue_head_t *bh_waitq_head(struct
int fsync_bdev(struct block_device *);
struct super_block *freeze_bdev(struct block_device *);
void thaw_bdev(struct block_device *, struct super_block *);
+void freeze_filesystems(void);
+void thaw_filesystems(void);
int fsync_super(struct super_block *);
int fsync_no_super(struct block_device *);
struct buffer_head *__find_get_block(struct block_device *, sector_t, int);
Index: linux-2.6.19-rc2-mm2/include/linux/fs.h
===================================================================
--- linux-2.6.19-rc2-mm2.orig/include/linux/fs.h
+++ linux-2.6.19-rc2-mm2/include/linux/fs.h
@@ -120,6 +120,7 @@ extern int dir_notify_enable;
#define MS_PRIVATE (1<<18) /* change to private */
#define MS_SLAVE (1<<19) /* change to slave */
#define MS_SHARED (1<<20) /* change to shared */
+#define MS_FROZEN (1<<21) /* Frozen by freeze_filesystems() */
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)

Index: linux-2.6.19-rc2-mm2/fs/buffer.c
===================================================================
--- linux-2.6.19-rc2-mm2.orig/fs/buffer.c
+++ linux-2.6.19-rc2-mm2/fs/buffer.c
@@ -244,6 +244,44 @@ void thaw_bdev(struct block_device *bdev
}
EXPORT_SYMBOL(thaw_bdev);

+/**
+ * freeze_filesystems - lock all filesystems and force them into a consistent
+ * state
+ */
+void freeze_filesystems(void)
+{
+ struct super_block *sb;
+
+ /*
+ * Freeze in reverse order so filesystems dependant upon others are
+ * frozen in the right order (eg. loopback on ext3).
+ */
+ list_for_each_entry_reverse(sb, &super_blocks, s_list) {
+ if (!sb->s_root || !sb->s_bdev ||
+ (sb->s_frozen == SB_FREEZE_TRANS) ||
+ (sb->s_flags & MS_RDONLY) ||
+ (sb->s_flags & MS_FROZEN))
+ continue;
+
+ freeze_bdev(sb->s_bdev);
+ sb->s_flags |= MS_FROZEN;
+ }
+}
+
+/**
+ * thaw_filesystems - unlock all filesystems
+ */
+void thaw_filesystems(void)
+{
+ struct super_block *sb;
+
+ list_for_each_entry(sb, &super_blocks, s_list)
+ if (sb->s_flags & MS_FROZEN) {
+ sb->s_flags &= ~MS_FROZEN;
+ thaw_bdev(sb->s_bdev, sb);
+ }
+}
+
/*
* Various filesystems appear to want __find_get_block to be non-blocking.
* But it's the page lock which protects the buffers. To get around this,

2006-10-29 23:52:00

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH] Freeze bdevs when freezing processes.

Hi.

On Mon, 2006-10-30 at 00:29 +0100, Rafael J. Wysocki wrote:
> Hi,
>
> On Sunday, 29 October 2006 18:35, Pavel Machek wrote:
> > Hi!
> >
> > > > > > As you have them at the moment, the threads seem to be freezing fine.
> > > > > > The issue I've seen in the past related not to threads but to timer
> > > > > > based activity. Admittedly it was 2.6.14 when I last looked at it, but
> > > > > > there used to be a possibility for XFS to submit I/O from a timer when
> > > > > > the threads are frozen but the bdev isn't frozen. Has that changed?
> > > > >
> > > > > I didn't think we've ever done that - periodic or delayed operations
> > > > > are passed off to the kernel threads to execute. A stack trace
> > > > > (if you still have it) would be really help here.
> > > > >
> > > > > Hmmm - we have a couple of per-cpu work queues as well that are
> > > > > used on I/O completion and that can, in some circumstances,
> > > > > trigger new transactions. If we are only flush metadata, then
> > > > > I don't think that any more I/o will be issued, but I could be
> > > > > wrong (maze of twisty passages).
> > > >
> > > > Well, I think this exactly is the problem, because worker_threads run with
> > > > PF_NOFREEZE set (as I've just said in another message).
> > >
> > > Ok, so freezing the filesystem is the only way you can prevent
> > > this as the workqueues are flushed as part of quiescing the filesystem.
> >
> > Well, alternative is to teach XFS to sense that we are being frozen
> > and stop disk writes in such case.
> >
> > OTOH freeze_bdevs is perhaps not that bad solution...
>
> Okay, appended is a patch that implements the freezing of bdevs in a slightly
> different way than the Nigel's patch did it.

> As Christoph suggested, I have put freeze_filesystems() and thaw_filesystems()
> into fs/buffer.c and indroduced the MS_FROZEN flag to mark frozen
> filesystems.
>
> It seems to work fine, except I get the following trace from lockdep during
> the suspend on a regular basis (not 100% reproducible, though):
>
> Stopping tasks...
> =============================================
> [ INFO: possible recursive locking detected ]
> 2.6.19-rc2-mm2 #15
> ---------------------------------------------
> s2disk/5564 is trying to acquire lock:
> (&bdev->bd_mount_mutex){--..}, at: [<ffffffff80475e79>] mutex_lock+0x9/0x10
>
> but task is already holding lock:
> (&bdev->bd_mount_mutex){--..}, at: [<ffffffff80475e79>] mutex_lock+0x9/0x10
>
> other info that might help us debug this:
> 3 locks held by s2disk/5564:
> #0: (&bdev->bd_mount_mutex){--..}, at: [<ffffffff80475e79>] mutex_lock+0x9/0x10
> #1: (&type->s_umount_key#16){----}, at: [<ffffffff80291647>] get_super+0x67/0xc0
> #2: (&journal->j_barrier){--..}, at: [<ffffffff80475e79>] mutex_lock+0x9/0x10
>
> stack backtrace:
>
> Call Trace:
> [<ffffffff8020af79>] dump_trace+0xb9/0x430
> [<ffffffff8020b333>] show_trace+0x43/0x60
> [<ffffffff8020b635>] dump_stack+0x15/0x20
> [<ffffffff8024a1d1>] __lock_acquire+0x881/0xc60
> [<ffffffff8024a94d>] lock_acquire+0x8d/0xc0
> [<ffffffff80475cd4>] __mutex_lock_slowpath+0xd4/0x270
> [<ffffffff80475e79>] mutex_lock+0x9/0x10
> [<ffffffff802b2bb6>] freeze_bdev+0x16/0x80
> [<ffffffff802b3105>] freeze_filesystems+0x55/0x80
> [<ffffffff80255942>] freeze_processes+0x1e2/0x360
> [<ffffffff802592a3>] snapshot_ioctl+0x163/0x610
> [<ffffffff8029cf0b>] do_ioctl+0x6b/0xa0
> [<ffffffff8029d1eb>] vfs_ioctl+0x2ab/0x2d0
> [<ffffffff8029d27a>] sys_ioctl+0x6a/0xa0
> [<ffffffff80209c2e>] system_call+0x7e/0x83
> [<00002afb13a4d8a9>]
>
> done.
> Shrinking memory... done (19126 pages freed)
>
> Greetings,
> Rafael
>
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>

Heh. I've just prepared almost exactly the same patch (except for
unwinding the process thawing).

I haven't ever seen those mutex warnings - is that due to something new
in -mm or one of those gazillion new warnings you can enable in vanilla?

Apart from that, I'll add:

Signed-off-by: Nigel Cunningham <[email protected]>

Regards,

Nigel