2015-05-08 07:08:56

by Len Brown

[permalink] [raw]
Subject: [PATCH 1/1] suspend: delete sys_sync()

From: Len Brown <[email protected]>

Remove sys_sync() from the kernel's suspend flow.

sys_sync() is extremely expensive in some configurations,
and so the kernel should not force users to pay this cost
on every suspend.

The user-space utilities s2ram and s2disk choose to invoke sync() today.
A user can invoke suspend directly via /sys/power/state to skip that cost.

Signed-off-by: Len Brown <[email protected]>
---
kernel/power/suspend.c | 6 ------
1 file changed, 6 deletions(-)

diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index b7d6b3a..6d237bb 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -473,12 +473,6 @@ static int enter_state(suspend_state_t state)
if (state == PM_SUSPEND_FREEZE)
freeze_begin();

- trace_suspend_resume(TPS("sync_filesystems"), 0, true);
- printk(KERN_INFO "PM: Syncing filesystems ... ");
- sys_sync();
- printk("done.\n");
- trace_suspend_resume(TPS("sync_filesystems"), 0, false);
-
pr_debug("PM: Preparing system for %s sleep\n", pm_states[state]);
error = suspend_prepare(state);
if (error)
--
2.4.0.rc1


2015-05-08 14:34:17

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, 8 May 2015, Len Brown wrote:

> From: Len Brown <[email protected]>
>
> Remove sys_sync() from the kernel's suspend flow.
>
> sys_sync() is extremely expensive in some configurations,
> and so the kernel should not force users to pay this cost
> on every suspend.
>
> The user-space utilities s2ram and s2disk choose to invoke sync() today.
> A user can invoke suspend directly via /sys/power/state to skip that cost.

I can understand the motivation for this, but is it aimed in the right
direction?

Consider a system where sys_sync() is very expensive. Should such a
system be using system suspend in the first place? If there is a valid
use case, why does the extra overhead of sys_sync() matter?

To put it another way, if the system wants to go into a low-latency
suspend state, why doesn't it use an extreme version of runtime suspend
rather than system suspend?

This reminds me of the discussions we had with the Android developers
when they proposed adding wakelocks to the kernel -- they needed them
so that they could support system suspend when what they really wanted
to do was an extreme version of runtime suspend. But generally
speaking, sys_sync() is not tremendously expensive on devices running
Android.

Alan Stern

2015-05-08 16:36:08

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, May 8, 2015 at 10:34 AM, Alan Stern <[email protected]> wrote:
> On Fri, 8 May 2015, Len Brown wrote:
>
>> From: Len Brown <[email protected]>
>>
>> Remove sys_sync() from the kernel's suspend flow.
>>
>> sys_sync() is extremely expensive in some configurations,
>> and so the kernel should not force users to pay this cost
>> on every suspend.
>>
>> The user-space utilities s2ram and s2disk choose to invoke sync() today.
>> A user can invoke suspend directly via /sys/power/state to skip that cost.
>
> I can understand the motivation for this, but is it aimed in the right
> direction?
>
> Consider a system where sys_sync() is very expensive. Should such a
> system be using system suspend in the first place? If there is a valid
> use case, why does the extra overhead of sys_sync() matter?

There are four significant issues with sys_sync() here:

1. It is not determinitic
Some devices (and their caches) are quick on the 1st sync,
and then slow on the 2nd sync etc.

2. worst case latency is obscene, there are examples of some
syncs which take over 3,000 ms to complete.

3. It doesn't necessarily deliver on the assumed benefits.
Rafael tells me that it depends on the file system
exactly what sync does. Look no further than the fact
that for two sync's in a row, the 2nd one may be slower.

4. Whether to sync or not is a policy choice, and that choice
should be made by the user, not the kernel. Many systems today
want to wake for a packet, process that packet, and get back
to sleep as fast as possible. Jamming a sys_sync() in their
wake doesn't make sense.

> To put it another way, if the system wants to go into a low-latency
> suspend state, why doesn't it use an extreme version of runtime suspend
> rather than system suspend?

There are systems where systems suspend is FAST, except for the sync.
I can e-mail you the analyze_suspend output for such a system.

> This reminds me of the discussions we had with the Android developers
> when they proposed adding wakelocks to the kernel -- they needed them
> so that they could support system suspend when what they really wanted
> to do was an extreme version of runtime suspend. But generally
> speaking, sys_sync() is not tremendously expensive on devices running
> Android.

Unfortunately, sys_sync() can be a significant pain point,
even for systems that run Android.

Len Brown, Intel Open Source Technology Center

2015-05-08 19:13:30

by Alan Cox

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

> 2. worst case latency is obscene, there are examples of some
> syncs which take over 3,000 ms to complete.

ATA is pretty open ended on this. I believe the vendors use 7 seconds
just for the cache flush as their limit because after 7 seconds some non
Linux OS's blow up. However if my suspend/resume crashes (as still I'm
sorry to say happens far too often) I don't want my last ten minutes of
work trashed.

> Unfortunately, sys_sync() can be a significant pain point,
> even for systems that run Android.

Android devices often have slow I/O devices coupled with a lot of memory
so yes that is true.

There are however some very important reasons for using sync() in a
suspend

- I can read data off the suspended machines disk volumes even though I
can't write to them. People do this.

- Suspend requires the firmware, drivers and kernel all get it exactly
right. On a lot of machines therefore suspend is still a buggy pile of
crap. Sync is extremely valuable given that you can't be entirely
sure your system will resume.

- Users habitually do stupid things like removing USB dongles from
suspended boxes and thinking afterwards. Perception is that the device
is off therefore you can unplug it.

So I think its inappropriate to change the default. Allow users to turn
it off by all means, and I imagine many phones would use that.

Some of this however is crappy suspend/resume handling. If the suspend
subsystem was doing its job then for the cases of timeout triggered
suspend it would have triggered most of the disk writes ten seconds
before it tried to suspend properly ;-)

Dirty page management is at the end of the day partly a power management
issue. We already reflect that in other places like laptop mode.

Alan

2015-05-08 19:32:34

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, May 8, 2015 at 3:13 PM, One Thousand Gnomes
<[email protected]> wrote:
>> 2. worst case latency is obscene, there are examples of some
>> syncs which take over 3,000 ms to complete.
>
> ATA is pretty open ended on this. I believe the vendors use 7 seconds
> just for the cache flush as their limit because after 7 seconds some non
> Linux OS's blow up. However if my suspend/resume crashes (as still I'm
> sorry to say happens far too often) I don't want my last ten minutes of
> work trashed.
>
>> Unfortunately, sys_sync() can be a significant pain point,
>> even for systems that run Android.
>
> Android devices often have slow I/O devices coupled with a lot of memory
> so yes that is true.
>
> There are however some very important reasons for using sync() in a
> suspend
>
> - I can read data off the suspended machines disk volumes even though I
> can't write to them. People do this.
>
> - Suspend requires the firmware, drivers and kernel all get it exactly
> right. On a lot of machines therefore suspend is still a buggy pile of
> crap. Sync is extremely valuable given that you can't be entirely
> sure your system will resume.
>
> - Users habitually do stupid things like removing USB dongles from
> suspended boxes and thinking afterwards. Perception is that the device
> is off therefore you can unplug it.
>
> So I think its inappropriate to change the default. Allow users to turn
> it off by all means, and I imagine many phones would use that.

FWIW, 18-months ago, I proposed a patch to make the sys_sync() optional
"[PATCH 1/1] suspend: make sync() on suspend-to-RAM optional"
and feedback was that fewer choices would be better.

Note that user-space has full license both before and after this patch
to sync(). Indeed, the s2disk and s2ram utilities do exactly that.

> Some of this however is crappy suspend/resume handling. If the suspend
> subsystem was doing its job then for the cases of timeout triggered
> suspend it would have triggered most of the disk writes ten seconds
> before it tried to suspend properly ;-)

No problem, continue to use s2ram on your system -- and to the extent
that sync works, your data will be on disk. (sync reliability is a
different topic...)

Understand, however, there are systems which suspend/resume reliably
many times per second, making policy choice of having the kernel hard-code
a sys_sync() into the suspend path a bad idea.

thanks,
Len Brown, Intel Open Source Technology Center

2015-05-08 19:52:53

by Alan Cox

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

> > Some of this however is crappy suspend/resume handling. If the suspend
> > subsystem was doing its job then for the cases of timeout triggered
> > suspend it would have triggered most of the disk writes ten seconds
> > before it tried to suspend properly ;-)
>
> No problem, continue to use s2ram on your system -- and to the extent
> that sync works, your data will be on disk. (sync reliability is a
> different topic...)

Ok let me ask the other obvious question. For all the mainstream
distributions do their default tools and setup sync such that removing it
from the kernel won't actually be noticable by users ?

If the answer is yes, then I shall shut up and stop worrying 8)

> Understand, however, there are systems which suspend/resume reliably
> many times per second, making policy choice of having the kernel hard-code
> a sys_sync() into the suspend path a bad idea.

I'm aware of that - I have a very nice ASUS T100TA.

Alan

2015-05-08 20:05:15

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Friday, May 08, 2015 03:32:30 PM Len Brown wrote:
> On Fri, May 8, 2015 at 3:13 PM, One Thousand Gnomes
> <[email protected]> wrote:
> >> 2. worst case latency is obscene, there are examples of some
> >> syncs which take over 3,000 ms to complete.
> >
> > ATA is pretty open ended on this. I believe the vendors use 7 seconds
> > just for the cache flush as their limit because after 7 seconds some non
> > Linux OS's blow up. However if my suspend/resume crashes (as still I'm
> > sorry to say happens far too often) I don't want my last ten minutes of
> > work trashed.
> >
> >> Unfortunately, sys_sync() can be a significant pain point,
> >> even for systems that run Android.
> >
> > Android devices often have slow I/O devices coupled with a lot of memory
> > so yes that is true.
> >
> > There are however some very important reasons for using sync() in a
> > suspend
> >
> > - I can read data off the suspended machines disk volumes even though I
> > can't write to them. People do this.

Reportedly, for some journaling FSes that may trigger a jornal replay and
therefore on-disk data modifications to happen. I think we're talking about
FAT mostly here, though.

I guess the concern is about things that autosleep (eg. Android) and so may
not be able to sync() from user space before suspending, but for them we
can just move the sync() to the autosleep thread.

> > - Suspend requires the firmware, drivers and kernel all get it exactly
> > right. On a lot of machines therefore suspend is still a buggy pile of
> > crap. Sync is extremely valuable given that you can't be entirely
> > sure your system will resume.
> >
> > - Users habitually do stupid things like removing USB dongles from
> > suspended boxes and thinking afterwards. Perception is that the device
> > is off therefore you can unplug it.
> >
> > So I think its inappropriate to change the default. Allow users to turn
> > it off by all means, and I imagine many phones would use that.
>
> FWIW, 18-months ago, I proposed a patch to make the sys_sync() optional
> "[PATCH 1/1] suspend: make sync() on suspend-to-RAM optional"
> and feedback was that fewer choices would be better.
>
> Note that user-space has full license both before and after this patch
> to sync(). Indeed, the s2disk and s2ram utilities do exactly that.
>
> > Some of this however is crappy suspend/resume handling. If the suspend
> > subsystem was doing its job then for the cases of timeout triggered
> > suspend it would have triggered most of the disk writes ten seconds
> > before it tried to suspend properly ;-)
>
> No problem, continue to use s2ram on your system -- and to the extent
> that sync works, your data will be on disk. (sync reliability is a
> different topic...)
>
> Understand, however, there are systems which suspend/resume reliably
> many times per second, making policy choice of having the kernel hard-code
> a sys_sync() into the suspend path a bad idea.

My current view on that is that whether or not to do a sync() before suspending
ultimately is a policy decision and should belong to user space as such (modulo
the autosleep situation when user space may not know when the suspend is going
to happen).

Moreover, user space is free to do as many sync()s before suspending as it
wants to and the question here is whether or not the *kernel* should sync()
in the suspend code path.

Since we pretty much can demonstrate that having just one sync() in there is
not sufficient in general, should we put two of them in there? Or just
remove the existing one and leave it to user space entirely?

Rafael

2015-05-08 20:14:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Friday, May 08, 2015 08:52:33 PM One Thousand Gnomes wrote:
> > > Some of this however is crappy suspend/resume handling. If the suspend
> > > subsystem was doing its job then for the cases of timeout triggered
> > > suspend it would have triggered most of the disk writes ten seconds
> > > before it tried to suspend properly ;-)
> >
> > No problem, continue to use s2ram on your system -- and to the extent
> > that sync works, your data will be on disk. (sync reliability is a
> > different topic...)
>
> Ok let me ask the other obvious question. For all the mainstream
> distributions do their default tools and setup sync such that removing it
> from the kernel won't actually be noticable by users ?
>
> If the answer is yes, then I shall shut up and stop worrying 8)

The distros I'm familiar with do that. It has always been done traditionally
and I don't think anyone had enough guts to remove it from the scripts. :-)

We need to be more careful about Android as I said, but that can be addressed
by adding sync() to the autosleep thread.

Rafael

2015-05-09 19:59:50

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, 8 May 2015, Rafael J. Wysocki wrote:

> My current view on that is that whether or not to do a sync() before suspending
> ultimately is a policy decision and should belong to user space as such (modulo
> the autosleep situation when user space may not know when the suspend is going
> to happen).
>
> Moreover, user space is free to do as many sync()s before suspending as it
> wants to and the question here is whether or not the *kernel* should sync()
> in the suspend code path.
>
> Since we pretty much can demonstrate that having just one sync() in there is
> not sufficient in general, should we put two of them in there? Or just
> remove the existing one and leave it to user space entirely?

I don't know about the advantages of one sync over two. But how about
adding a "syncs_before_suspend" (or just "syncs") sysfs attribute that
takes a small numeric value? The default can be 0, and the user could
set it to 1 or 2 (or higher).

Alan Stern

Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Sat, 09 May 2015, Alan Stern wrote:
> On Fri, 8 May 2015, Rafael J. Wysocki wrote:
> > My current view on that is that whether or not to do a sync() before suspending
> > ultimately is a policy decision and should belong to user space as such (modulo
> > the autosleep situation when user space may not know when the suspend is going
> > to happen).
> >
> > Moreover, user space is free to do as many sync()s before suspending as it
> > wants to and the question here is whether or not the *kernel* should sync()
> > in the suspend code path.
> >
> > Since we pretty much can demonstrate that having just one sync() in there is
> > not sufficient in general, should we put two of them in there? Or just
> > remove the existing one and leave it to user space entirely?
>
> I don't know about the advantages of one sync over two. But how about
> adding a "syncs_before_suspend" (or just "syncs") sysfs attribute that
> takes a small numeric value? The default can be 0, and the user could
> set it to 1 or 2 (or higher).

IMO it would be much safer to both have that knob, and to set it to keep the
current behavior as the default. Userspace will adapt and change that knob
to whatever is sufficient based on what it does before signaling the kernel
to suspend.

A regression in sync-before-suspend is sure to cause data loss episodes,
after all. And, as far as bikeshedding goes, IMHO syncs_before_suspend is
self-explanatory, which would be a very good reason to use it instead of the
shorter requires-you-to-know-what-it-is-about "syncs".

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2015-05-11 01:45:07

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, May 08, 2015 at 03:08:43AM -0400, Len Brown wrote:
> From: Len Brown <[email protected]>
>
> Remove sys_sync() from the kernel's suspend flow.
>
> sys_sync() is extremely expensive in some configurations,
> and so the kernel should not force users to pay this cost
> on every suspend.

Since when? Please explain what your use case is that makes this
so prohibitively expensive it needs to be removed.

>
> The user-space utilities s2ram and s2disk choose to invoke sync() today.
> A user can invoke suspend directly via /sys/power/state to skip that cost.

So, you want to have s2disk write all the dirty pages in memory to
the suspend image, rather than to the filesystem?

Either way you have to write that dirty data to disk, but if you
write it to the suspend image, it then has to be loaded again on
resume, and then written again to the filesystem the system has
resumed. This doesn't seem very efficient to me....

And, quite frankly, machines fail to resume from suspne dall the
time. e.g. run out of batteries when they are under s2ram
conditions, or s2disk fails because a kernel upgrade was done before
the s2disk and so can't be resumed. With your change, users lose all
the data that was buffered in memory before suspend, whereas right
now it is written to disk and so nothing is lost if the resume from
suspend fails for whatever reason.

IOWs, I can see several good reasons why the sys_sync() needs to
remain in the suspend code. User data safety and filesystem
integrity is far, far more important than a couple of seconds
improvement in suspend speed....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-05-11 20:22:30

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Sun, May 10, 2015 at 9:44 PM, Dave Chinner <[email protected]> wrote:

> ...Please explain what your use case is that makes this
> so prohibitively expensive it needs to be removed.

wake on packet, process packet without turning on the display,
immediately suspend. Do this potentially several times per second.

>> The user-space utilities s2ram and s2disk choose to invoke sync() today.
>> A user can invoke suspend directly via /sys/power/state to skip that cost.
>
> So, you want to have s2disk write all the dirty pages in memory to
> the suspend image, rather than to the filesystem?

The s2disk utility is unchanged by this proposal, it already includes a sync().

cheers,
Len Brown, Intel Open Source Technology Center

2015-05-11 20:34:35

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Sat, May 9, 2015 at 4:25 PM, Henrique de Moraes Holschuh
<[email protected]> wrote:
> On Sat, 09 May 2015, Alan Stern wrote:
>> On Fri, 8 May 2015, Rafael J. Wysocki wrote:
>> > My current view on that is that whether or not to do a sync() before suspending
>> > ultimately is a policy decision and should belong to user space as such (modulo
>> > the autosleep situation when user space may not know when the suspend is going
>> > to happen).
>> >
>> > Moreover, user space is free to do as many sync()s before suspending as it
>> > wants to and the question here is whether or not the *kernel* should sync()
>> > in the suspend code path.
>> >
>> > Since we pretty much can demonstrate that having just one sync() in there is
>> > not sufficient in general, should we put two of them in there? Or just
>> > remove the existing one and leave it to user space entirely?
>>
>> I don't know about the advantages of one sync over two. But how about
>> adding a "syncs_before_suspend" (or just "syncs") sysfs attribute that
>> takes a small numeric value? The default can be 0, and the user could
>> set it to 1 or 2 (or higher).
>
> IMO it would be much safer to both have that knob, and to set it to keep the
> current behavior as the default. Userspace will adapt and change that knob
> to whatever is sufficient based on what it does before signaling the kernel
> to suspend.
>
> A regression in sync-before-suspend is sure to cause data loss episodes,
> after all. And, as far as bikeshedding goes, IMHO syncs_before_suspend is
> self-explanatory, which would be a very good reason to use it instead of the
> shorter requires-you-to-know-what-it-is-about "syncs".
>

When I first thought about this, I had a similar view:
https://lkml.org/lkml/2014/1/23/45

But upon reflection, I do not believe that the kernel is adding value here,
instead it is imposing a policy, and that policy decision is sometimes
prohibitively expensive.
User-space can do this for itself (and in the case of desktop distros,
already does),
and so the kernel should butt-out.

thanks,
Len Brown, Intel Open Source Technology Center

2015-05-12 06:12:32

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Mon, 2015-05-11 at 16:34 -0400, Len Brown wrote:
> But upon reflection, I do not believe that the kernel is adding value
> here,
> instead it is imposing a policy, and that policy decision is sometimes
> prohibitively expensive.
> User-space can do this for itself (and in the case of desktop distros,
> already does),
> and so the kernel should butt-out.

There is however a race condition with doing it in user space.
So adding a parameter to select behavior would increase safety a bit.

Regards
Oliver

2015-05-12 22:34:41

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Mon, May 11, 2015 at 04:22:26PM -0400, Len Brown wrote:
> On Sun, May 10, 2015 at 9:44 PM, Dave Chinner <[email protected]> wrote:
>
> > ...Please explain what your use case is that makes this
> > so prohibitively expensive it needs to be removed.
>
> wake on packet, process packet without turning on the display,
> immediately suspend. Do this potentially several times per second.

The sync should almost always be a no-op because the filesystems
should be clean. In that case, where's the latency coming from? If
filesystems were dirtied during the wake period, then we do care
that they get sync'd properly before suspend.

FWIW, I don't think that s2ram or s2d were designed for this
sort of abuse - they really were designed around laptop use cases,
which is what I mostly care about...

> >> The user-space utilities s2ram and s2disk choose to invoke sync() today.
> >> A user can invoke suspend directly via /sys/power/state to skip that cost.
> >
> > So, you want to have s2disk write all the dirty pages in memory to
> > the suspend image, rather than to the filesystem?
>
> The s2disk utility is unchanged by this proposal, it already includes a sync().

Not everyone uses that utility.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-05-13 23:23:05

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Mon, 11 May 2015 11:44:28 +1000 Dave Chinner <[email protected]> wrote:

> On Fri, May 08, 2015 at 03:08:43AM -0400, Len Brown wrote:
> > From: Len Brown <[email protected]>
> >
> > Remove sys_sync() from the kernel's suspend flow.
> >
> > sys_sync() is extremely expensive in some configurations,
> > and so the kernel should not force users to pay this cost
> > on every suspend.
>
> Since when? Please explain what your use case is that makes this
> so prohibitively expensive it needs to be removed.
>
> >
> > The user-space utilities s2ram and s2disk choose to invoke sync() today.
> > A user can invoke suspend directly via /sys/power/state to skip that cost.
>
> So, you want to have s2disk write all the dirty pages in memory to
> the suspend image, rather than to the filesystem?
>
> Either way you have to write that dirty data to disk, but if you
> write it to the suspend image, it then has to be loaded again on
> resume, and then written again to the filesystem the system has
> resumed. This doesn't seem very efficient to me....
>
> And, quite frankly, machines fail to resume from suspne dall the
> time. e.g. run out of batteries when they are under s2ram
> conditions, or s2disk fails because a kernel upgrade was done before
> the s2disk and so can't be resumed. With your change, users lose all
> the data that was buffered in memory before suspend, whereas right
> now it is written to disk and so nothing is lost if the resume from
> suspend fails for whatever reason.
>
> IOWs, I can see several good reasons why the sys_sync() needs to
> remain in the suspend code. User data safety and filesystem
> integrity is far, far more important than a couple of seconds
> improvement in suspend speed....

To be honest, this sounds like superstition and fear, not science and fact.

"filesystem integrity" is not an issue for the fast majority of filesystems
which use journalling to ensure continued integrity even after a crash. I
think even XFS does that :-)

For those few toy filesystems which don't journal, like VFAT and ..... well,
VFAT, it may well make sense for VFAT to request a notification on suspend
and to do whatever is needed to ensure consistent on-storage structures.

"User data safety" is not very closely related to "sys_sync". That call
doesn't tell my emacs or my libre-office that now is a good time to auto-save.
All that "sys_sync" saves is data that has been written-but-not-fsynced.
That is only a small portion of "user data" and anyone who thinks it is
"safe" is already deluded (possibly deluded by over-exposure to ext3).

I am very much in favour of Len's patch. Not only does the sys_sync()
call have a real cost, but I believe it also has zero value.

Can you identify an actual case where the removal of sys_sync would introduce
a measurable cost?

NeilBrown


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-05-14 23:54:44

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

ng back On Thu, May 14, 2015 at 09:22:51AM +1000, NeilBrown wrote:
> On Mon, 11 May 2015 11:44:28 +1000 Dave Chinner <[email protected]> wrote:
>
> > On Fri, May 08, 2015 at 03:08:43AM -0400, Len Brown wrote:
> > > From: Len Brown <[email protected]>
> > >
> > > Remove sys_sync() from the kernel's suspend flow.
> > >
> > > sys_sync() is extremely expensive in some configurations,
> > > and so the kernel should not force users to pay this cost
> > > on every suspend.
> >
> > Since when? Please explain what your use case is that makes this
> > so prohibitively expensive it needs to be removed.
> >
> > >
> > > The user-space utilities s2ram and s2disk choose to invoke sync() today.
> > > A user can invoke suspend directly via /sys/power/state to skip that cost.
> >
> > So, you want to have s2disk write all the dirty pages in memory to
> > the suspend image, rather than to the filesystem?
> >
> > Either way you have to write that dirty data to disk, but if you
> > write it to the suspend image, it then has to be loaded again on
> > resume, and then written again to the filesystem the system has
> > resumed. This doesn't seem very efficient to me....
> >
> > And, quite frankly, machines fail to resume from suspne dall the
> > time. e.g. run out of batteries when they are under s2ram
> > conditions, or s2disk fails because a kernel upgrade was done before
> > the s2disk and so can't be resumed. With your change, users lose all
> > the data that was buffered in memory before suspend, whereas right
> > now it is written to disk and so nothing is lost if the resume from
> > suspend fails for whatever reason.
> >
> > IOWs, I can see several good reasons why the sys_sync() needs to
> > remain in the suspend code. User data safety and filesystem
> > integrity is far, far more important than a couple of seconds
> > improvement in suspend speed....
>
> To be honest, this sounds like superstition and fear, not science and fact.
>
> "filesystem integrity" is not an issue for the fast majority of filesystems
> which use journalling to ensure continued integrity even after a crash. I
> think even XFS does that :-)

It has nothing to do with journalling, and everything to do with
bring filesystems to an *idle state* before suspend runs. We have a
long history of bug reports with XFS that go: suspend, resume, XFS
almost immediately detects corruption, shuts down.

The problem is that "sync" doesn't make the filesystem idle - XFs
has *lots* of background work going on, and if we aren't *real
careful* the filesystem is still doing work while the hardware gets
powerd down and the suspend image is being taken. the result is on
resume that the on-disk filesystem state does not match the memory
image pulled back from resume, and we get shutdowns.

sys_sync() does not guarantee a filesystem is idle - it guarantees
the data in memory is recoverable, butit doesn't stop the filesystem
from doing things like writing back metadata or running background
cleaup tasks. If those aren't stopped properly, then we get into
the state where in-memory and on-disk state get out of whack. And
s2ram can have these problems too, because if there is IO in flight
when the hardware is powered down, that IO is lost....

Every time some piece of generic infrastructure changes behaviour
w.r.t. suspend/resume, we get a new set of problems being reported
by users. It's extremely hard to test for these problems and it
might take months of occasional corruption reports from a user to
isolate it to being a suspend/resume problem. It's a game of
whack-a-mole, because quite often they come down to the fact that
something changed and nobody in the XFS world knew they had to now
set an different initialisation flag on some structure or workqueue
to make it work the way it needed to work.

Go back an look at the history of sys_sync() in suspend discussions
over the past 10 years. You'll find me saying exactly the same
thing again and again about sys_sync(): it does not guarantee the
filesystem is in an idle or coherent, unchanging state, and nothing
in the suspend code tells the filesystem to enter an idle or frozen
state. We actually have mechanisms for doing this - we use it in the
storage layers to idle the filesystem while we do things like *take
a snapshot*.

What is the mechanism suspend to disk uses? It *takes a snapshot* of
system state, written to disk. It's supposed to be consistent, and
the only way you can guarantee the state of an active, mounted
filesystem has consistent in-memory state and on-disk state and
that it won't get changed is to *freeze the filesystem*.

Removing the sync is only going to make this problem worse because
the delta between on-disk and in-memory state is going to be much,
much larger. There is also likely to be significant filesystem
activity occurring when the filesystem has all it's background
threads and work queues abruptly frozen with no warning or
co-ordination, which makes it impossible for anyone to test
suspend/resume reliably.

Sorry for the long rant, but I've been saying the same thing for 10
years, which is abotu as long as I've been dealing with filesystem
corruptions that have resulted from suspend/resume.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-05-15 00:09:38

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Friday, May 15, 2015 09:54:26 AM Dave Chinner wrote:
> ng back On Thu, May 14, 2015 at 09:22:51AM +1000, NeilBrown wrote:
> > On Mon, 11 May 2015 11:44:28 +1000 Dave Chinner <[email protected]> wrote:
> >
> > > On Fri, May 08, 2015 at 03:08:43AM -0400, Len Brown wrote:
> > > > From: Len Brown <[email protected]>
> > > >
> > > > Remove sys_sync() from the kernel's suspend flow.
> > > >
> > > > sys_sync() is extremely expensive in some configurations,
> > > > and so the kernel should not force users to pay this cost
> > > > on every suspend.
> > >
> > > Since when? Please explain what your use case is that makes this
> > > so prohibitively expensive it needs to be removed.
> > >
> > > >
> > > > The user-space utilities s2ram and s2disk choose to invoke sync() today.
> > > > A user can invoke suspend directly via /sys/power/state to skip that cost.
> > >
> > > So, you want to have s2disk write all the dirty pages in memory to
> > > the suspend image, rather than to the filesystem?
> > >
> > > Either way you have to write that dirty data to disk, but if you
> > > write it to the suspend image, it then has to be loaded again on
> > > resume, and then written again to the filesystem the system has
> > > resumed. This doesn't seem very efficient to me....
> > >
> > > And, quite frankly, machines fail to resume from suspne dall the
> > > time. e.g. run out of batteries when they are under s2ram
> > > conditions, or s2disk fails because a kernel upgrade was done before
> > > the s2disk and so can't be resumed. With your change, users lose all
> > > the data that was buffered in memory before suspend, whereas right
> > > now it is written to disk and so nothing is lost if the resume from
> > > suspend fails for whatever reason.
> > >
> > > IOWs, I can see several good reasons why the sys_sync() needs to
> > > remain in the suspend code. User data safety and filesystem
> > > integrity is far, far more important than a couple of seconds
> > > improvement in suspend speed....
> >
> > To be honest, this sounds like superstition and fear, not science and fact.
> >
> > "filesystem integrity" is not an issue for the fast majority of filesystems
> > which use journalling to ensure continued integrity even after a crash. I
> > think even XFS does that :-)
>
> It has nothing to do with journalling, and everything to do with
> bring filesystems to an *idle state* before suspend runs. We have a
> long history of bug reports with XFS that go: suspend, resume, XFS
> almost immediately detects corruption, shuts down.
>
> The problem is that "sync" doesn't make the filesystem idle - XFs
> has *lots* of background work going on, and if we aren't *real
> careful* the filesystem is still doing work while the hardware gets
> powerd down and the suspend image is being taken. the result is on
> resume that the on-disk filesystem state does not match the memory
> image pulled back from resume, and we get shutdowns.
>
> sys_sync() does not guarantee a filesystem is idle - it guarantees
> the data in memory is recoverable, butit doesn't stop the filesystem
> from doing things like writing back metadata or running background
> cleaup tasks. If those aren't stopped properly, then we get into
> the state where in-memory and on-disk state get out of whack. And
> s2ram can have these problems too, because if there is IO in flight
> when the hardware is powered down, that IO is lost....
>
> Every time some piece of generic infrastructure changes behaviour
> w.r.t. suspend/resume, we get a new set of problems being reported
> by users. It's extremely hard to test for these problems and it
> might take months of occasional corruption reports from a user to
> isolate it to being a suspend/resume problem. It's a game of
> whack-a-mole, because quite often they come down to the fact that
> something changed and nobody in the XFS world knew they had to now
> set an different initialisation flag on some structure or workqueue
> to make it work the way it needed to work.
>
> Go back an look at the history of sys_sync() in suspend discussions
> over the past 10 years. You'll find me saying exactly the same
> thing again and again about sys_sync(): it does not guarantee the
> filesystem is in an idle or coherent, unchanging state, and nothing
> in the suspend code tells the filesystem to enter an idle or frozen
> state. We actually have mechanisms for doing this - we use it in the
> storage layers to idle the filesystem while we do things like *take
> a snapshot*.
>
> What is the mechanism suspend to disk uses? It *takes a snapshot* of
> system state, written to disk. It's supposed to be consistent, and
> the only way you can guarantee the state of an active, mounted
> filesystem has consistent in-memory state and on-disk state and
> that it won't get changed is to *freeze the filesystem*.
>
> Removing the sync is only going to make this problem worse because
> the delta between on-disk and in-memory state is going to be much,
> much larger. There is also likely to be significant filesystem
> activity occurring when the filesystem has all it's background
> threads and work queues abruptly frozen with no warning or
> co-ordination, which makes it impossible for anyone to test
> suspend/resume reliably.
>
> Sorry for the long rant, but I've been saying the same thing for 10
> years, which is abotu as long as I've been dealing with filesystem
> corruptions that have resulted from suspend/resume.

Well, the change proposed by Len is *only* about suspend-to-RAM and
similar. It is *not* about suspend-to-disk, so pretty please let's
not confuse things.

So what problems may arise specifically in the suspend-to-RAM case if
we remove the unconditional sys_sync() from its code path?

Cheers,
Rafael

2015-05-15 00:40:10

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, May 15, 2015 at 8:34 AM, Rafael J. Wysocki <[email protected]> wrote:
> On Friday, May 15, 2015 09:54:26 AM Dave Chinner wrote:
>> ng back On Thu, May 14, 2015 at 09:22:51AM +1000, NeilBrown wrote:
>> > On Mon, 11 May 2015 11:44:28 +1000 Dave Chinner <[email protected]> wrote:
>> >
>> > > On Fri, May 08, 2015 at 03:08:43AM -0400, Len Brown wrote:
>> > > > From: Len Brown <[email protected]>
>> > > >
>> > > > Remove sys_sync() from the kernel's suspend flow.
>> > > >
>> > > > sys_sync() is extremely expensive in some configurations,
>> > > > and so the kernel should not force users to pay this cost
>> > > > on every suspend.
>> > >
>> > > Since when? Please explain what your use case is that makes this
>> > > so prohibitively expensive it needs to be removed.
>> > >
>> > > >
>> > > > The user-space utilities s2ram and s2disk choose to invoke sync() today.
>> > > > A user can invoke suspend directly via /sys/power/state to skip that cost.
>> > >
>> > > So, you want to have s2disk write all the dirty pages in memory to
>> > > the suspend image, rather than to the filesystem?
>> > >
>> > > Either way you have to write that dirty data to disk, but if you
>> > > write it to the suspend image, it then has to be loaded again on
>> > > resume, and then written again to the filesystem the system has
>> > > resumed. This doesn't seem very efficient to me....
>> > >
>> > > And, quite frankly, machines fail to resume from suspne dall the
>> > > time. e.g. run out of batteries when they are under s2ram
>> > > conditions, or s2disk fails because a kernel upgrade was done before
>> > > the s2disk and so can't be resumed. With your change, users lose all
>> > > the data that was buffered in memory before suspend, whereas right
>> > > now it is written to disk and so nothing is lost if the resume from
>> > > suspend fails for whatever reason.
>> > >
>> > > IOWs, I can see several good reasons why the sys_sync() needs to
>> > > remain in the suspend code. User data safety and filesystem
>> > > integrity is far, far more important than a couple of seconds
>> > > improvement in suspend speed....
>> >
>> > To be honest, this sounds like superstition and fear, not science and fact.
>> >
>> > "filesystem integrity" is not an issue for the fast majority of filesystems
>> > which use journalling to ensure continued integrity even after a crash. I
>> > think even XFS does that :-)
>>
>> It has nothing to do with journalling, and everything to do with
>> bring filesystems to an *idle state* before suspend runs. We have a
>> long history of bug reports with XFS that go: suspend, resume, XFS
>> almost immediately detects corruption, shuts down.
>>
>> The problem is that "sync" doesn't make the filesystem idle - XFs
>> has *lots* of background work going on, and if we aren't *real
>> careful* the filesystem is still doing work while the hardware gets
>> powerd down and the suspend image is being taken. the result is on
>> resume that the on-disk filesystem state does not match the memory
>> image pulled back from resume, and we get shutdowns.
>>
>> sys_sync() does not guarantee a filesystem is idle - it guarantees
>> the data in memory is recoverable, butit doesn't stop the filesystem
>> from doing things like writing back metadata or running background
>> cleaup tasks. If those aren't stopped properly, then we get into
>> the state where in-memory and on-disk state get out of whack. And
>> s2ram can have these problems too, because if there is IO in flight
>> when the hardware is powered down, that IO is lost....
>>
>> Every time some piece of generic infrastructure changes behaviour
>> w.r.t. suspend/resume, we get a new set of problems being reported
>> by users. It's extremely hard to test for these problems and it
>> might take months of occasional corruption reports from a user to
>> isolate it to being a suspend/resume problem. It's a game of
>> whack-a-mole, because quite often they come down to the fact that
>> something changed and nobody in the XFS world knew they had to now
>> set an different initialisation flag on some structure or workqueue
>> to make it work the way it needed to work.
>>
>> Go back an look at the history of sys_sync() in suspend discussions
>> over the past 10 years. You'll find me saying exactly the same
>> thing again and again about sys_sync(): it does not guarantee the
>> filesystem is in an idle or coherent, unchanging state, and nothing
>> in the suspend code tells the filesystem to enter an idle or frozen
>> state. We actually have mechanisms for doing this - we use it in the
>> storage layers to idle the filesystem while we do things like *take
>> a snapshot*.
>>
>> What is the mechanism suspend to disk uses? It *takes a snapshot* of
>> system state, written to disk. It's supposed to be consistent, and
>> the only way you can guarantee the state of an active, mounted
>> filesystem has consistent in-memory state and on-disk state and
>> that it won't get changed is to *freeze the filesystem*.
>>
>> Removing the sync is only going to make this problem worse because
>> the delta between on-disk and in-memory state is going to be much,
>> much larger. There is also likely to be significant filesystem
>> activity occurring when the filesystem has all it's background
>> threads and work queues abruptly frozen with no warning or
>> co-ordination, which makes it impossible for anyone to test
>> suspend/resume reliably.
>>
>> Sorry for the long rant, but I've been saying the same thing for 10
>> years, which is abotu as long as I've been dealing with filesystem
>> corruptions that have resulted from suspend/resume.
>
> Well, the change proposed by Len is *only* about suspend-to-RAM and
> similar. It is *not* about suspend-to-disk, so pretty please let's
> not confuse things.
>
> So what problems may arise specifically in the suspend-to-RAM case if
> we remove the unconditional sys_sync() from its code path?

Data loss may be caused for hotplug storage(like USB), or all storage
when power is exhausted during suspend.

Is there obvious advantage to remove sys_sync() in the case?


Thanks,
Ming Lei

2015-05-15 00:59:36

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, May 15, 2015 at 2:40 AM, Ming Lei <[email protected]> wrote:
> On Fri, May 15, 2015 at 8:34 AM, Rafael J. Wysocki <[email protected]> wrote:
>> On Friday, May 15, 2015 09:54:26 AM Dave Chinner wrote:
>>> ng back On Thu, May 14, 2015 at 09:22:51AM +1000, NeilBrown wrote:
>>> > On Mon, 11 May 2015 11:44:28 +1000 Dave Chinner <[email protected]> wrote:
>>> >
>>> > > On Fri, May 08, 2015 at 03:08:43AM -0400, Len Brown wrote:
>>> > > > From: Len Brown <[email protected]>
>>> > > >
>>> > > > Remove sys_sync() from the kernel's suspend flow.
>>> > > >
>>> > > > sys_sync() is extremely expensive in some configurations,
>>> > > > and so the kernel should not force users to pay this cost
>>> > > > on every suspend.
>>> > >
>>> > > Since when? Please explain what your use case is that makes this
>>> > > so prohibitively expensive it needs to be removed.
>>> > >
>>> > > >
>>> > > > The user-space utilities s2ram and s2disk choose to invoke sync() today.
>>> > > > A user can invoke suspend directly via /sys/power/state to skip that cost.
>>> > >
>>> > > So, you want to have s2disk write all the dirty pages in memory to
>>> > > the suspend image, rather than to the filesystem?
>>> > >
>>> > > Either way you have to write that dirty data to disk, but if you
>>> > > write it to the suspend image, it then has to be loaded again on
>>> > > resume, and then written again to the filesystem the system has
>>> > > resumed. This doesn't seem very efficient to me....
>>> > >
>>> > > And, quite frankly, machines fail to resume from suspne dall the
>>> > > time. e.g. run out of batteries when they are under s2ram
>>> > > conditions, or s2disk fails because a kernel upgrade was done before
>>> > > the s2disk and so can't be resumed. With your change, users lose all
>>> > > the data that was buffered in memory before suspend, whereas right
>>> > > now it is written to disk and so nothing is lost if the resume from
>>> > > suspend fails for whatever reason.
>>> > >
>>> > > IOWs, I can see several good reasons why the sys_sync() needs to
>>> > > remain in the suspend code. User data safety and filesystem
>>> > > integrity is far, far more important than a couple of seconds
>>> > > improvement in suspend speed....
>>> >
>>> > To be honest, this sounds like superstition and fear, not science and fact.
>>> >
>>> > "filesystem integrity" is not an issue for the fast majority of filesystems
>>> > which use journalling to ensure continued integrity even after a crash. I
>>> > think even XFS does that :-)
>>>
>>> It has nothing to do with journalling, and everything to do with
>>> bring filesystems to an *idle state* before suspend runs. We have a
>>> long history of bug reports with XFS that go: suspend, resume, XFS
>>> almost immediately detects corruption, shuts down.
>>>
>>> The problem is that "sync" doesn't make the filesystem idle - XFs
>>> has *lots* of background work going on, and if we aren't *real
>>> careful* the filesystem is still doing work while the hardware gets
>>> powerd down and the suspend image is being taken. the result is on
>>> resume that the on-disk filesystem state does not match the memory
>>> image pulled back from resume, and we get shutdowns.
>>>
>>> sys_sync() does not guarantee a filesystem is idle - it guarantees
>>> the data in memory is recoverable, butit doesn't stop the filesystem
>>> from doing things like writing back metadata or running background
>>> cleaup tasks. If those aren't stopped properly, then we get into
>>> the state where in-memory and on-disk state get out of whack. And
>>> s2ram can have these problems too, because if there is IO in flight
>>> when the hardware is powered down, that IO is lost....
>>>
>>> Every time some piece of generic infrastructure changes behaviour
>>> w.r.t. suspend/resume, we get a new set of problems being reported
>>> by users. It's extremely hard to test for these problems and it
>>> might take months of occasional corruption reports from a user to
>>> isolate it to being a suspend/resume problem. It's a game of
>>> whack-a-mole, because quite often they come down to the fact that
>>> something changed and nobody in the XFS world knew they had to now
>>> set an different initialisation flag on some structure or workqueue
>>> to make it work the way it needed to work.
>>>
>>> Go back an look at the history of sys_sync() in suspend discussions
>>> over the past 10 years. You'll find me saying exactly the same
>>> thing again and again about sys_sync(): it does not guarantee the
>>> filesystem is in an idle or coherent, unchanging state, and nothing
>>> in the suspend code tells the filesystem to enter an idle or frozen
>>> state. We actually have mechanisms for doing this - we use it in the
>>> storage layers to idle the filesystem while we do things like *take
>>> a snapshot*.
>>>
>>> What is the mechanism suspend to disk uses? It *takes a snapshot* of
>>> system state, written to disk. It's supposed to be consistent, and
>>> the only way you can guarantee the state of an active, mounted
>>> filesystem has consistent in-memory state and on-disk state and
>>> that it won't get changed is to *freeze the filesystem*.
>>>
>>> Removing the sync is only going to make this problem worse because
>>> the delta between on-disk and in-memory state is going to be much,
>>> much larger. There is also likely to be significant filesystem
>>> activity occurring when the filesystem has all it's background
>>> threads and work queues abruptly frozen with no warning or
>>> co-ordination, which makes it impossible for anyone to test
>>> suspend/resume reliably.
>>>
>>> Sorry for the long rant, but I've been saying the same thing for 10
>>> years, which is abotu as long as I've been dealing with filesystem
>>> corruptions that have resulted from suspend/resume.
>>
>> Well, the change proposed by Len is *only* about suspend-to-RAM and
>> similar. It is *not* about suspend-to-disk, so pretty please let's
>> not confuse things.
>>
>> So what problems may arise specifically in the suspend-to-RAM case if
>> we remove the unconditional sys_sync() from its code path?
>
> Data loss may be caused for hotplug storage(like USB), or all storage
> when power is exhausted during suspend.

Which also may very well happen at run time, right?

> Is there obvious advantage to remove sys_sync() in the case?

Yes, there is. It is not necessary to sync() every time you suspend
if you do that very often.

And it is done in such a place that everything needs to wait for it to complete.

2015-05-15 01:04:30

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, 15 May 2015 09:54:26 +1000 Dave Chinner <[email protected]> wrote:

> ng back On Thu, May 14, 2015 at 09:22:51AM +1000, NeilBrown wrote:
> > On Mon, 11 May 2015 11:44:28 +1000 Dave Chinner <[email protected]> wrote:
> >
> > > On Fri, May 08, 2015 at 03:08:43AM -0400, Len Brown wrote:
> > > > From: Len Brown <[email protected]>
> > > >
> > > > Remove sys_sync() from the kernel's suspend flow.
> > > >
> > > > sys_sync() is extremely expensive in some configurations,
> > > > and so the kernel should not force users to pay this cost
> > > > on every suspend.
> > >
> > > Since when? Please explain what your use case is that makes this
> > > so prohibitively expensive it needs to be removed.
> > >
> > > >
> > > > The user-space utilities s2ram and s2disk choose to invoke sync() today.
> > > > A user can invoke suspend directly via /sys/power/state to skip that cost.
> > >
> > > So, you want to have s2disk write all the dirty pages in memory to
> > > the suspend image, rather than to the filesystem?
> > >
> > > Either way you have to write that dirty data to disk, but if you
> > > write it to the suspend image, it then has to be loaded again on
> > > resume, and then written again to the filesystem the system has
> > > resumed. This doesn't seem very efficient to me....
> > >
> > > And, quite frankly, machines fail to resume from suspne dall the
> > > time. e.g. run out of batteries when they are under s2ram
> > > conditions, or s2disk fails because a kernel upgrade was done before
> > > the s2disk and so can't be resumed. With your change, users lose all
> > > the data that was buffered in memory before suspend, whereas right
> > > now it is written to disk and so nothing is lost if the resume from
> > > suspend fails for whatever reason.
> > >
> > > IOWs, I can see several good reasons why the sys_sync() needs to
> > > remain in the suspend code. User data safety and filesystem
> > > integrity is far, far more important than a couple of seconds
> > > improvement in suspend speed....
> >
> > To be honest, this sounds like superstition and fear, not science and fact.
> >
> > "filesystem integrity" is not an issue for the fast majority of filesystems
> > which use journalling to ensure continued integrity even after a crash. I
> > think even XFS does that :-)
>
> It has nothing to do with journalling, and everything to do with
> bring filesystems to an *idle state* before suspend runs. We have a
> long history of bug reports with XFS that go: suspend, resume, XFS
> almost immediately detects corruption, shuts down.
>
> The problem is that "sync" doesn't make the filesystem idle - XFs
> has *lots* of background work going on, and if we aren't *real
> careful* the filesystem is still doing work while the hardware gets
> powerd down and the suspend image is being taken. the result is on
> resume that the on-disk filesystem state does not match the memory
> image pulled back from resume, and we get shutdowns.
>
> sys_sync() does not guarantee a filesystem is idle - it guarantees
> the data in memory is recoverable, butit doesn't stop the filesystem
> from doing things like writing back metadata or running background
> cleaup tasks. If those aren't stopped properly, then we get into
> the state where in-memory and on-disk state get out of whack. And
> s2ram can have these problems too, because if there is IO in flight
> when the hardware is powered down, that IO is lost....

This seems to be the nub of your complaint - yes?

Some storage devices don't handle suspend as well as they should and lose
requests resulting in corruption. They should obviously be fixed, but it is
you who gets the problem reports and you are not in a position to fix them.
So you want a general solution that hides those problems.
sys_sync at suspend time is a sort-of solution because it flushes and waits
so there is less in-flight IO immediately after a sys_sync and so less
opportunity for a bad device to stuff up.
But you seem to suggest that sys_sync isn't a complete solution and it
doesn't guarantee that xfs is not doing some background metadata IO.

Maybe a sensible thing to do would be to hook the "disk" devices into suspend
and have them flush their queue and possibly send a CACHE_FLUSH command.
That would provide more of a guarantee for you, and less of a cost for Len,
would it not?

Thanks,
NeilBrown



>
> Every time some piece of generic infrastructure changes behaviour
> w.r.t. suspend/resume, we get a new set of problems being reported
> by users. It's extremely hard to test for these problems and it
> might take months of occasional corruption reports from a user to
> isolate it to being a suspend/resume problem. It's a game of
> whack-a-mole, because quite often they come down to the fact that
> something changed and nobody in the XFS world knew they had to now
> set an different initialisation flag on some structure or workqueue
> to make it work the way it needed to work.
>
> Go back an look at the history of sys_sync() in suspend discussions
> over the past 10 years. You'll find me saying exactly the same
> thing again and again about sys_sync(): it does not guarantee the
> filesystem is in an idle or coherent, unchanging state, and nothing
> in the suspend code tells the filesystem to enter an idle or frozen
> state. We actually have mechanisms for doing this - we use it in the
> storage layers to idle the filesystem while we do things like *take
> a snapshot*.
>
> What is the mechanism suspend to disk uses? It *takes a snapshot* of
> system state, written to disk. It's supposed to be consistent, and
> the only way you can guarantee the state of an active, mounted
> filesystem has consistent in-memory state and on-disk state and
> that it won't get changed is to *freeze the filesystem*.
>
> Removing the sync is only going to make this problem worse because
> the delta between on-disk and in-memory state is going to be much,
> much larger. There is also likely to be significant filesystem
> activity occurring when the filesystem has all it's background
> threads and work queues abruptly frozen with no warning or
> co-ordination, which makes it impossible for anyone to test
> suspend/resume reliably.
>
> Sorry for the long rant, but I've been saying the same thing for 10
> years, which is abotu as long as I've been dealing with filesystem
> corruptions that have resulted from suspend/resume.
>
> Cheers,
>
> Dave.


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-05-15 05:13:46

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, May 15, 2015 at 8:59 AM, Rafael J. Wysocki <[email protected]> wrote:
> On Fri, May 15, 2015 at 2:40 AM, Ming Lei <[email protected]> wrote:
>> On Fri, May 15, 2015 at 8:34 AM, Rafael J. Wysocki <[email protected]> wrote:
>>> On Friday, May 15, 2015 09:54:26 AM Dave Chinner wrote:
>>>> ng back On Thu, May 14, 2015 at 09:22:51AM +1000, NeilBrown wrote:
>>>> > On Mon, 11 May 2015 11:44:28 +1000 Dave Chinner <[email protected]> wrote:
>>>> >
>>>> > > On Fri, May 08, 2015 at 03:08:43AM -0400, Len Brown wrote:
>>>> > > > From: Len Brown <[email protected]>
>>>> > > >
>>>> > > > Remove sys_sync() from the kernel's suspend flow.
>>>> > > >
>>>> > > > sys_sync() is extremely expensive in some configurations,
>>>> > > > and so the kernel should not force users to pay this cost
>>>> > > > on every suspend.
>>>> > >
>>>> > > Since when? Please explain what your use case is that makes this
>>>> > > so prohibitively expensive it needs to be removed.
>>>> > >
>>>> > > >
>>>> > > > The user-space utilities s2ram and s2disk choose to invoke sync() today.
>>>> > > > A user can invoke suspend directly via /sys/power/state to skip that cost.
>>>> > >
>>>> > > So, you want to have s2disk write all the dirty pages in memory to
>>>> > > the suspend image, rather than to the filesystem?
>>>> > >
>>>> > > Either way you have to write that dirty data to disk, but if you
>>>> > > write it to the suspend image, it then has to be loaded again on
>>>> > > resume, and then written again to the filesystem the system has
>>>> > > resumed. This doesn't seem very efficient to me....
>>>> > >
>>>> > > And, quite frankly, machines fail to resume from suspne dall the
>>>> > > time. e.g. run out of batteries when they are under s2ram
>>>> > > conditions, or s2disk fails because a kernel upgrade was done before
>>>> > > the s2disk and so can't be resumed. With your change, users lose all
>>>> > > the data that was buffered in memory before suspend, whereas right
>>>> > > now it is written to disk and so nothing is lost if the resume from
>>>> > > suspend fails for whatever reason.
>>>> > >
>>>> > > IOWs, I can see several good reasons why the sys_sync() needs to
>>>> > > remain in the suspend code. User data safety and filesystem
>>>> > > integrity is far, far more important than a couple of seconds
>>>> > > improvement in suspend speed....
>>>> >
>>>> > To be honest, this sounds like superstition and fear, not science and fact.
>>>> >
>>>> > "filesystem integrity" is not an issue for the fast majority of filesystems
>>>> > which use journalling to ensure continued integrity even after a crash. I
>>>> > think even XFS does that :-)
>>>>
>>>> It has nothing to do with journalling, and everything to do with
>>>> bring filesystems to an *idle state* before suspend runs. We have a
>>>> long history of bug reports with XFS that go: suspend, resume, XFS
>>>> almost immediately detects corruption, shuts down.
>>>>
>>>> The problem is that "sync" doesn't make the filesystem idle - XFs
>>>> has *lots* of background work going on, and if we aren't *real
>>>> careful* the filesystem is still doing work while the hardware gets
>>>> powerd down and the suspend image is being taken. the result is on
>>>> resume that the on-disk filesystem state does not match the memory
>>>> image pulled back from resume, and we get shutdowns.
>>>>
>>>> sys_sync() does not guarantee a filesystem is idle - it guarantees
>>>> the data in memory is recoverable, butit doesn't stop the filesystem
>>>> from doing things like writing back metadata or running background
>>>> cleaup tasks. If those aren't stopped properly, then we get into
>>>> the state where in-memory and on-disk state get out of whack. And
>>>> s2ram can have these problems too, because if there is IO in flight
>>>> when the hardware is powered down, that IO is lost....
>>>>
>>>> Every time some piece of generic infrastructure changes behaviour
>>>> w.r.t. suspend/resume, we get a new set of problems being reported
>>>> by users. It's extremely hard to test for these problems and it
>>>> might take months of occasional corruption reports from a user to
>>>> isolate it to being a suspend/resume problem. It's a game of
>>>> whack-a-mole, because quite often they come down to the fact that
>>>> something changed and nobody in the XFS world knew they had to now
>>>> set an different initialisation flag on some structure or workqueue
>>>> to make it work the way it needed to work.
>>>>
>>>> Go back an look at the history of sys_sync() in suspend discussions
>>>> over the past 10 years. You'll find me saying exactly the same
>>>> thing again and again about sys_sync(): it does not guarantee the
>>>> filesystem is in an idle or coherent, unchanging state, and nothing
>>>> in the suspend code tells the filesystem to enter an idle or frozen
>>>> state. We actually have mechanisms for doing this - we use it in the
>>>> storage layers to idle the filesystem while we do things like *take
>>>> a snapshot*.
>>>>
>>>> What is the mechanism suspend to disk uses? It *takes a snapshot* of
>>>> system state, written to disk. It's supposed to be consistent, and
>>>> the only way you can guarantee the state of an active, mounted
>>>> filesystem has consistent in-memory state and on-disk state and
>>>> that it won't get changed is to *freeze the filesystem*.
>>>>
>>>> Removing the sync is only going to make this problem worse because
>>>> the delta between on-disk and in-memory state is going to be much,
>>>> much larger. There is also likely to be significant filesystem
>>>> activity occurring when the filesystem has all it's background
>>>> threads and work queues abruptly frozen with no warning or
>>>> co-ordination, which makes it impossible for anyone to test
>>>> suspend/resume reliably.
>>>>
>>>> Sorry for the long rant, but I've been saying the same thing for 10
>>>> years, which is abotu as long as I've been dealing with filesystem
>>>> corruptions that have resulted from suspend/resume.
>>>
>>> Well, the change proposed by Len is *only* about suspend-to-RAM and
>>> similar. It is *not* about suspend-to-disk, so pretty please let's
>>> not confuse things.
>>>
>>> So what problems may arise specifically in the suspend-to-RAM case if
>>> we remove the unconditional sys_sync() from its code path?
>>
>> Data loss may be caused for hotplug storage(like USB), or all storage
>> when power is exhausted during suspend.
>
> Which also may very well happen at run time, right?

I believe common people can understand the difference between
suspend(often done by docking laptop) and runtime wrt laptop usage

Yes, it may happen runtime for hotplug storage, but people know it is
their own fault and won't complain anyone. But if they find data loss is
caused during suspend and may figure out it is one regression by this
change, they may complain someone, :-)

Actually one common practice about my laptop usage for me is to always
use suspend, and seldom power down, then follows to plug all USB
thumb/HDD. from the laptop and put laptop in my backpack
,then go somewhere, and the change migh make trouble for me.

Also for power exhausting case, it is different with runtime
because distribution often detects low battery and puts system to
suspend when low battery happens. Then the deletion makes
a difference and cause trouble after power is exhausted.

Finally I guess it might affect some network based storage like iSCSI,
CIFS and NFS too, and hope guys in these fields can review the change.

>
>> Is there obvious advantage to remove sys_sync() in the case?
>
> Yes, there is. It is not necessary to sync() every time you suspend
> if you do that very often.
>
> And it is done in such a place that everything needs to wait for it to complete.

I mean Len said sys_sync() has been added to s2ram/s2disk utilities
already in some distribution, that means people think it is required in the
suspend path, so the kernel side sys_sync() shouldn't be expensive
like Dave explained because there are little dirty pages after userspace
have done the sys_sync().


Thanks,
Ming

2015-05-15 10:36:26

by Alan Cox

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

> > Data loss may be caused for hotplug storage(like USB), or all storage
> > when power is exhausted during suspend.
>
> Which also may very well happen at run time, right?

Intuitively users treat "suspended" as a bit like off and do remove
devices they've "finished with".

Technically yes the instant off/on on a phone is the same as the suspend
to RAM on a laptop but it doesn't mean the model in people's heads is the
same... not yet anyway.

> > Is there obvious advantage to remove sys_sync() in the case?
>
> Yes, there is. It is not necessary to sync() every time you suspend
> if you do that very often.

But if you do it very often you won't have any dirty pages to flush so it
will be very fast.

> And it is done in such a place that everything needs to wait for it to complete.

Only because the code deciding to trigger any automated suspend doesn't
do a sync a few seconds before. In the case the user goes to the menus
and does power->suspend then yes it's a delay. In the case where the OS
at some level has decided that it's 10 seconds from automatically
suspending to something the user space can issue a pre-emptive sync to
get the queue size down.

Alan

2015-05-15 14:19:26

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, 15 May 2015, Dave Chinner wrote:

> It has nothing to do with journalling, and everything to do with
> bring filesystems to an *idle state* before suspend runs. We have a
> long history of bug reports with XFS that go: suspend, resume, XFS
> almost immediately detects corruption, shuts down.
>
> The problem is that "sync" doesn't make the filesystem idle - XFs
> has *lots* of background work going on, and if we aren't *real
> careful* the filesystem is still doing work while the hardware gets
> powerd down and the suspend image is being taken. the result is on
> resume that the on-disk filesystem state does not match the memory
> image pulled back from resume, and we get shutdowns.
>
> sys_sync() does not guarantee a filesystem is idle - it guarantees
> the data in memory is recoverable, butit doesn't stop the filesystem
> from doing things like writing back metadata or running background
> cleaup tasks. If those aren't stopped properly, then we get into
> the state where in-memory and on-disk state get out of whack. And
> s2ram can have these problems too, because if there is IO in flight
> when the hardware is powered down, that IO is lost....
>
> Every time some piece of generic infrastructure changes behaviour
> w.r.t. suspend/resume, we get a new set of problems being reported
> by users. It's extremely hard to test for these problems and it
> might take months of occasional corruption reports from a user to
> isolate it to being a suspend/resume problem. It's a game of
> whack-a-mole, because quite often they come down to the fact that
> something changed and nobody in the XFS world knew they had to now
> set an different initialisation flag on some structure or workqueue
> to make it work the way it needed to work.
>
> Go back an look at the history of sys_sync() in suspend discussions
> over the past 10 years. You'll find me saying exactly the same
> thing again and again about sys_sync(): it does not guarantee the
> filesystem is in an idle or coherent, unchanging state, and nothing
> in the suspend code tells the filesystem to enter an idle or frozen
> state. We actually have mechanisms for doing this - we use it in the
> storage layers to idle the filesystem while we do things like *take
> a snapshot*.
>
> What is the mechanism suspend to disk uses? It *takes a snapshot* of
> system state, written to disk. It's supposed to be consistent, and
> the only way you can guarantee the state of an active, mounted
> filesystem has consistent in-memory state and on-disk state and
> that it won't get changed is to *freeze the filesystem*.
>
> Removing the sync is only going to make this problem worse because
> the delta between on-disk and in-memory state is going to be much,
> much larger. There is also likely to be significant filesystem
> activity occurring when the filesystem has all it's background
> threads and work queues abruptly frozen with no warning or
> co-ordination, which makes it impossible for anyone to test
> suspend/resume reliably.
>
> Sorry for the long rant, but I've been saying the same thing for 10
> years, which is abotu as long as I've been dealing with filesystem
> corruptions that have resulted from suspend/resume.

Are you suggesting that filesystems should register a "freeze the
filesystem" routine, and the PM core should call this routine during
hibernation (aka suspend-to-disk)? With a corresponding "unfreeze"
hook for resume, of course.

That seems like a good idea to me. It would also present a logical
place to add some simple checks for changes to the on-disk image caused
by the user moving the storage device to another system, modifying the
contents, and then moving it back all while the current system was
hibernating (or suspended).

Alan Stern

2015-05-15 14:20:20

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, 15 May 2015, NeilBrown wrote:

> Some storage devices don't handle suspend as well as they should and lose
> requests resulting in corruption. They should obviously be fixed, but it is
> you who gets the problem reports and you are not in a position to fix them.
> So you want a general solution that hides those problems.
> sys_sync at suspend time is a sort-of solution because it flushes and waits
> so there is less in-flight IO immediately after a sys_sync and so less
> opportunity for a bad device to stuff up.
> But you seem to suggest that sys_sync isn't a complete solution and it
> doesn't guarantee that xfs is not doing some background metadata IO.
>
> Maybe a sensible thing to do would be to hook the "disk" devices into suspend
> and have them flush their queue and possibly send a CACHE_FLUSH command.
> That would provide more of a guarantee for you, and less of a cost for Len,
> would it not?

The sd driver already does this.

Alan Stern

2015-05-15 14:32:08

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, 15 May 2015, Alan Stern wrote:

> On Fri, 15 May 2015, NeilBrown wrote:
>
> > Some storage devices don't handle suspend as well as they should and lose
> > requests resulting in corruption. They should obviously be fixed, but it is
> > you who gets the problem reports and you are not in a position to fix them.
> > So you want a general solution that hides those problems.
> > sys_sync at suspend time is a sort-of solution because it flushes and waits
> > so there is less in-flight IO immediately after a sys_sync and so less
> > opportunity for a bad device to stuff up.
> > But you seem to suggest that sys_sync isn't a complete solution and it
> > doesn't guarantee that xfs is not doing some background metadata IO.
> >
> > Maybe a sensible thing to do would be to hook the "disk" devices into suspend
> > and have them flush their queue and possibly send a CACHE_FLUSH command.
> > That would provide more of a guarantee for you, and less of a cost for Len,
> > would it not?
>
> The sd driver already does this.

Sorry -- I meant that it does send a SYNCHRONIZE CACHE command. It
doesn't flush the I/O queue.

Alan Stern

2015-05-18 01:57:45

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, 15 May 2015 11:35:57 +0100 One Thousand Gnomes
<[email protected]> wrote:

> > > Data loss may be caused for hotplug storage(like USB), or all storage
> > > when power is exhausted during suspend.
> >
> > Which also may very well happen at run time, right?
>
> Intuitively users treat "suspended" as a bit like off and do remove
> devices they've "finished with".
>
> Technically yes the instant off/on on a phone is the same as the suspend
> to RAM on a laptop but it doesn't mean the model in people's heads is the
> same... not yet anyway.
>
> > > Is there obvious advantage to remove sys_sync() in the case?
> >
> > Yes, there is. It is not necessary to sync() every time you suspend
> > if you do that very often.
>
> But if you do it very often you won't have any dirty pages to flush so it
> will be very fast.

Several people have said things like this, and yet Len clearly saw a problem
so sometimes it isn't as fast as you think it should be. I guess we need
some data.

In fact there seem to be a number of open questions:

1/ Len has seen enough delays to bother sending a patch. Twice. Lots of
other people claim there shouldn't be much delay.

Len: can you provide numbers? How long does the sys_sync take
(min/max/mean....). I think an interesting number would be in a quick
"resume, do something, suspend" cycle, what percent of the time is spent
in sys_sync.
Maybe also report what filesystems are in use, and whether you expect
there to be any dirty data.

If I run "time sync; time sync; time sync" it reports about 50msec for
each sync. If I run "time sleep 0; time sleep 0; time sleep 0" it reports
about 2 msec. So sys_sync is taking at least 50msec.
Almost all of that is in sync_inodes_sb() and much of that is in
_raw_spin_lock.... though I might be misinterpreting perf. It seems to
wait for a BDI flusher thread to go off and do nothing.

Len: is 50msec enough to bother you?


2/ Is a 'sys_sync' really necessary. People claim that "users might lose
data". I claim that some user data could be still in the application
buffers so they would lose that anyway, and applications call fsync after
writing everything out, so sys_sync isn't necessary.
Does anyone have an example of an application which writes important
user data, but doesn't call "fsync" shortly after all data has been written
out of the application's internal buffers?

3/ Is a 'sys_sync' sufficient. Len claims that in some cases, when running
sync; sync
the second sync takes longer, suggesting that the first sync didn't do
everything that was expected. Prior to Linux 1.3.20, sys_sync didn't
wait for everything. Since then it seems to be designed to, but maybe
something isn't right there.
In that case, having sys_sync may not be helping anyway.

Len: can you reliably reproduce this? Can you provide a recipe?

4/ Which part of 'sync' is really important?
sys_sync does two things. It initiates writeout on dirty data and it
wait for writeout to complete. Even if the former isn't necessary, the
latter probably is. Do we need to reliably wait for all output queues
to flush? Where is the best place to do that?

Thanks,
NeilBrown



>
> > And it is done in such a place that everything needs to wait for it to complete.
>
> Only because the code deciding to trigger any automated suspend doesn't
> do a sync a few seconds before. In the case the user goes to the menus
> and does power->suspend then yes it's a delay. In the case where the OS
> at some level has decided that it's 10 seconds from automatically
> suspending to something the user space can issue a pre-emptive sync to
> get the queue size down.
>
> Alan


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-06-19 01:10:16

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Thu, Jun 18, 2015 at 02:14:31AM -0400, Len Brown wrote:
> On Sun, May 17, 2015 at 9:57 PM, NeilBrown <[email protected]> wrote:
> > On Fri, 15 May 2015 11:35:57 +0100 One Thousand Gnomes
> > <[email protected]> wrote:
> >
> >> > > Data loss may be caused for hotplug storage(like USB), or all storage
> >> > > when power is exhausted during suspend.
> >> >
> >> > Which also may very well happen at run time, right?
> >>
> >> Intuitively users treat "suspended" as a bit like off and do remove
> >> devices they've "finished with".
> >>
> >> Technically yes the instant off/on on a phone is the same as the suspend
> >> to RAM on a laptop but it doesn't mean the model in people's heads is the
> >> same... not yet anyway.
> >>
> >> > > Is there obvious advantage to remove sys_sync() in the case?
> >> >
> >> > Yes, there is. It is not necessary to sync() every time you suspend
> >> > if you do that very often.
> >>
> >> But if you do it very often you won't have any dirty pages to flush so it
> >> will be very fast.
> >
> > Several people have said things like this, and yet Len clearly saw a problem
> > so sometimes it isn't as fast as you think it should be. I guess we need
> > some data.
> >
> > In fact there seem to be a number of open questions:
> >
> > 1/ Len has seen enough delays to bother sending a patch. Twice. Lots of
> > other people claim there shouldn't be much delay.
> >
> > Len: can you provide numbers? How long does the sys_sync take
> > (min/max/mean....). I think an interesting number would be in a quick
> > "resume, do something, suspend" cycle, what percent of the time is spent
> > in sys_sync.
> > Maybe also report what filesystems are in use, and whether you expect
> > there to be any dirty data.
> >
> > If I run "time sync; time sync; time sync" it reports about 50msec for
> > each sync. If I run "time sleep 0; time sleep 0; time sleep 0" it reports
> > about 2 msec. So sys_sync is taking at least 50msec.
> > Almost all of that is in sync_inodes_sb() and much of that is in
> > _raw_spin_lock.... though I might be misinterpreting perf. It seems to
> > wait for a BDI flusher thread to go off and do nothing.
> >
> > Len: is 50msec enough to bother you?
>
> 50ms is not acceptable.
> 5ms is also not acceptable.
>
> If it were guaranteed to be under 1ms, it would not behigh on my
> "performance pain" list, but I would still require the option
> to delete it entirely.
>
> But sys_sync time is random on many systems,
> with a very large maximum duration.
>
> Attached is the output from analyze_suspend.py -x2
> on a desktop machine, which has just an inexpensive SSD for storage.
> It is a fresh install of Fedora 22 with a stock 4.0.4-301 kernel
> with no tweaks.

More information required. Mounted filesystems, mount options, etc
at minimum so we have some idea of what is being synced. We don't
even know what filesystem you are using....

> The 1st suspend starts with a sync that takes 6.6ms
> But the 2nd suspend starts with a sync that takes 451.8ms

The graph posted just has a single box labelled "sync_filesystems"
with no other information about what is actually taking that time.
We need to know what is consuming all that time *inside*
sync_filesystems() to make any progress here.

> So the reasoning that subsequent sys_sync's should be instantly quick
> because they are following a previous sys_sync is found to be simply
> erroneous by observations such as this.

Which indicates that either the system is not idle as expected, or
there's some kind of bug in one of the filesystems that is being
synced.

But we can't do any analysis of the problem you are seeing if the
only information you give us "it takes too long". Something like a
function trace that tells use exactly what functions are being
called and how long they take to execute would be really handy
here.

Alternatively, you don't need to suspend to test this - just run a
test program on an idle system that does two sync() calls 50ms apart
and see what happens. If you can't reproduce it outside of a suspend
operation, then we need to look at what suspend is actually causing
to be dirtied in those 50ms.

If you can reproduce it, you can then even narrow down the
filesystem that is causing the problem by using syncfs() instead of
sync() to test the filesystems one at a time, and once you've done
that take a writeback event trace so we can see exactly what
inodes/data is being written back, and hence identify what the first
sync is not cleaning.

FWIW, I'd also suggest trying the changes to the sync code here:

git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git superblock-scaling

As they avoid the need for sync to walk all the cached inodes
in the system needlessly and so should reduce the CPU overhead of
sys_sync() quite substantially.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-06-19 02:35:54

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Thu, Jun 18, 2015 at 9:09 PM, Dave Chinner <[email protected]> wrote:
> On Thu, Jun 18, 2015 at 02:14:31AM -0400, Len Brown wrote:
>> On Sun, May 17, 2015 at 9:57 PM, NeilBrown <[email protected]> wrote:
>> > On Fri, 15 May 2015 11:35:57 +0100 One Thousand Gnomes
>> > <[email protected]> wrote:
>> >
>> >> > > Data loss may be caused for hotplug storage(like USB), or all storage
>> >> > > when power is exhausted during suspend.
>> >> >
>> >> > Which also may very well happen at run time, right?
>> >>
>> >> Intuitively users treat "suspended" as a bit like off and do remove
>> >> devices they've "finished with".
>> >>
>> >> Technically yes the instant off/on on a phone is the same as the suspend
>> >> to RAM on a laptop but it doesn't mean the model in people's heads is the
>> >> same... not yet anyway.
>> >>
>> >> > > Is there obvious advantage to remove sys_sync() in the case?
>> >> >
>> >> > Yes, there is. It is not necessary to sync() every time you suspend
>> >> > if you do that very often.
>> >>
>> >> But if you do it very often you won't have any dirty pages to flush so it
>> >> will be very fast.
>> >
>> > Several people have said things like this, and yet Len clearly saw a problem
>> > so sometimes it isn't as fast as you think it should be. I guess we need
>> > some data.
>> >
>> > In fact there seem to be a number of open questions:
>> >
>> > 1/ Len has seen enough delays to bother sending a patch. Twice. Lots of
>> > other people claim there shouldn't be much delay.
>> >
>> > Len: can you provide numbers? How long does the sys_sync take
>> > (min/max/mean....). I think an interesting number would be in a quick
>> > "resume, do something, suspend" cycle, what percent of the time is spent
>> > in sys_sync.
>> > Maybe also report what filesystems are in use, and whether you expect
>> > there to be any dirty data.
>> >
>> > If I run "time sync; time sync; time sync" it reports about 50msec for
>> > each sync. If I run "time sleep 0; time sleep 0; time sleep 0" it reports
>> > about 2 msec. So sys_sync is taking at least 50msec.
>> > Almost all of that is in sync_inodes_sb() and much of that is in
>> > _raw_spin_lock.... though I might be misinterpreting perf. It seems to
>> > wait for a BDI flusher thread to go off and do nothing.
>> >
>> > Len: is 50msec enough to bother you?
>>
>> 50ms is not acceptable.
>> 5ms is also not acceptable.
>>
>> If it were guaranteed to be under 1ms, it would not behigh on my
>> "performance pain" list, but I would still require the option
>> to delete it entirely.
>>
>> But sys_sync time is random on many systems,
>> with a very large maximum duration.
>>
>> Attached is the output from analyze_suspend.py -x2
>> on a desktop machine, which has just an inexpensive SSD for storage.
>> It is a fresh install of Fedora 22 with a stock 4.0.4-301 kernel
>> with no tweaks.
>
> More information required. Mounted filesystems, mount options, etc
> at minimum so we have some idea of what is being synced. We don't
> even know what filesystem you are using....

In this case, I am using Fedora 22 defaults on a single SSD:

ext4 (rw,relatime,seclabel,data=ordered)


[lenb@d975xbx ~]$ mount
/dev/sda3 on / type ext4 (rw,relatime,seclabel,data=ordered)
devtmpfs on /dev type devtmpfs (rw,relatime,seclabel,size=3053436k,nr_inodes=76)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,rela)
selinuxfs on /sys/fs/selinux type selinuxfs (rw,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=6)
tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,x)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime,seclabel)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cp)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatim)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,r)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blk)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,h)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relati)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,me)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,d)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,f)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=21,pgrp=1,tim)
debugfs on /sys/kernel/debug type debugfs (rw,relatime,seclabel)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel)
tmpfs on /tmp type tmpfs (rw,seclabel)
mqueue on /dev/mqueue type mqueue (rw,relatime,seclabel)
configfs on /sys/kernel/config type configfs (rw,relatime)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
nfsd on /proc/fs/nfsd type nfsd (rw,relatime)
/dev/sda1 on /boot type ext4 (rw,relatime,seclabel,data=ordered)
tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,seclabel,size=6109)
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)


>> The 1st suspend starts with a sync that takes 6.6ms
>> But the 2nd suspend starts with a sync that takes 451.8ms
>
> The graph posted just has a single box labelled "sync_filesystems"
> with no other information about what is actually taking that time.
> We need to know what is consuming all that time *inside*
> sync_filesystems() to make any progress here.

Just for grins, did 100 sys_sync instead of one.
They all ran about the same 5ms.
While they were all to slow, none of them were
O(500ms), so yes, there
does seem to be some state change
that causes the 2nd sync after a resume to be especially slow.

Unfortunately, I've not got an ftrace on the 500ms flavor yet.


>> So the reasoning that subsequent sys_sync's should be instantly quick
>> because they are following a previous sys_sync is found to be simply
>> erroneous by observations such as this.
>
> Which indicates that either the system is not idle as expected, or
> there's some kind of bug in one of the filesystems that is being
> synced.

the system is idle.

>
> But we can't do any analysis of the problem you are seeing if the
> only information you give us "it takes too long".

I'm happy to help debug why sync is sometimes extremely slow.
But the fact remains that even when it is fast, it is too slow,
and should not be hard-coded in the suspend path.

> Something like a
> function trace that tells use exactly what functions are being
> called and how long they take to execute would be really handy
> here.
>
> Alternatively, you don't need to suspend to test this - just run a
> test program on an idle system that does two sync() calls 50ms apart
> and see what happens. If you can't reproduce it outside of a suspend
> operation, then we need to look at what suspend is actually causing
> to be dirtied in those 50ms.

how about something like this:

>From single user mode:

[root@d975xbx ~]# cat ./do.sync
cd /sys/kernel/debug/tracing

echo 0 > tracing_on
echo global > trace_clock
echo 100000 > buffer_size_kb
echo function_graph > current_tracer
echo 1 > options/funcgraph-tail
echo 1 > options/funcgraph-abstime
echo 1 > options/funcgraph-proc

echo 8 > max_graph_depth
echo sys_sync > set_graph_function

for a in 0 1 2 3 4 5 6 7 8 9 ; do
echo > trace
sync
sleep .05
cp trace ~lenb/trace.$a
done

cd ~lenb
grep sys_sync trace.*

[root@d975xbx ~]# grep sys_sync trace.*
trace.0: 3353.150723 | 0) sync-9346 | | sys_sync() {
trace.0: 3353.170781 | 0) sync-9346 | # 20050.69 us | } /* sys_sync */
trace.1: 3353.358415 | 0) sync-9349 | | sys_sync() {
trace.1: 3353.446597 | 1) sync-9349 | # 88159.41 us | } /* sys_sync */
trace.2: 3353.639402 | 0) sync-9353 | | sys_sync() {
trace.2: 3353.728769 | 1) sync-9353 | # 89342.61 us | } /* sys_sync */
trace.3: 3353.914381 | 1) sync-9356 | | sys_sync() {
trace.3: 3354.003319 | 1) sync-9356 | # 88915.31 us | } /* sys_sync */
trace.4: 3354.199339 | 1) sync-9359 | | sys_sync() {
trace.4: 3354.287795 | 1) sync-9359 | # 88433.31 us | } /* sys_sync */
trace.5: 3354.479363 | 0) sync-9362 | | sys_sync() {
trace.5: 3354.568357 | 1) sync-9362 | # 88971.71 us | } /* sys_sync */
trace.6: 3354.762365 | 0) sync-9365 | | sys_sync() {
trace.6: 3354.851681 | 0) sync-9365 | # 89292.67 us | } /* sys_sync */
trace.7: 3355.038391 | 0) sync-9368 | | sys_sync() {
trace.7: 3355.127130 | 0) sync-9368 | # 88715.66 us | } /* sys_sync */
trace.8: 3355.311360 | 0) sync-9371 | | sys_sync() {
trace.8: 3355.398359 | 0) sync-9371 | # 86976.54 us | } /* sys_sync */
trace.9: 3355.588336 | 0) sync-9374 | | sys_sync() {
trace.9: 3355.676756 | 0) sync-9374 | # 88397.52 us | } /* sys_sync */

# grep \# trace.5
# tracer: function_graph
#
# TIME CPU TASK/PID DURATION FUNCTION CALLS
# | | | | | | | | | |
3354.479519 | 0) sync-9362 | # 1887.858 us |
__schedule();
3354.481409 | 0) sync-9362 | # 1889.459 us | } /*
schedule */
3354.481409 | 0) sync-9362 | # 1890.467 us | } /*
schedule_timeout */
3354.481411 | 0) sync-9362 | # 1894.576 us | } /*
wait_for_completion */
3354.481420 | 0) sync-9362 | # 13762.41 us |
__wait_on_bit();
3354.495187 | 0) sync-9362 | # 13763.64 us | } /*
wait_on_page_bit */
3354.495196 | 0) sync-9362 | # 16248.52 us |
__wait_on_bit();
3354.511449 | 0) sync-9362 | # 16249.50 us | } /*
wait_on_page_bit */
3354.511455 | 0) sync-9362 | # 17580.20 us |
__wait_on_bit();
3354.529041 | 0) sync-9362 | # 17581.19 us | } /*
wait_on_page_bit */
3354.529047 | 0) sync-9362 | # 16214.36 us |
__wait_on_bit();
3354.545266 | 0) sync-9362 | # 16215.31 us | } /*
wait_on_page_bit */
3354.545272 | 0) sync-9362 | # 6439.709 us |
__wait_on_bit();
3354.551715 | 0) sync-9362 | # 6440.856 us | } /*
wait_on_page_bit */
3354.551723 | 0) sync-9362 | # 70289.60 us | } /*
filemap_fdatawait_range */
3354.551723 | 0) sync-9362 | # 70290.64 us | } /*
filemap_fdatawait */
3354.567133 | 0) sync-9362 | # 87611.58 us | } /*
sync_inodes_sb */
3354.567133 | 0) sync-9362 | # 87612.32 us | } /*
sync_inodes_one_sb */
3354.567251 | 0) sync-9362 | # 87831.46 us | } /* iterate_supers */
3354.568357 | 1) sync-9362 | # 88971.71 us | } /* sys_sync */

That is where the time is going in the "normal" case.

cheers,
-Len

>
> If you can reproduce it, you can then even narrow down the
> filesystem that is causing the problem by using syncfs() instead of
> sync() to test the filesystems one at a time, and once you've done
> that take a writeback event trace so we can see exactly what
> inodes/data is being written back, and hence identify what the first
> sync is not cleaning.
>
> FWIW, I'd also suggest trying the changes to the sync code here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git superblock-scaling
>
> As they avoid the need for sync to walk all the cached inodes
> in the system needlessly and so should reduce the CPU overhead of
> sys_sync() quite substantially.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]



--
Len Brown, Intel Open Source Technology Center

2015-06-19 04:32:11

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Thu, Jun 18, 2015 at 10:35:44PM -0400, Len Brown wrote:
> On Thu, Jun 18, 2015 at 9:09 PM, Dave Chinner <[email protected]> wrote:
> > On Thu, Jun 18, 2015 at 02:14:31AM -0400, Len Brown wrote:
> >> On Sun, May 17, 2015 at 9:57 PM, NeilBrown <[email protected]> wrote:
> >> > 1/ Len has seen enough delays to bother sending a patch. Twice. Lots of
> >> > other people claim there shouldn't be much delay.
> >> >
> >> > Len: can you provide numbers? How long does the sys_sync take
> >> > (min/max/mean....). I think an interesting number would be in a quick
> >> > "resume, do something, suspend" cycle, what percent of the time is spent
> >> > in sys_sync.
> >> > Maybe also report what filesystems are in use, and whether you expect
> >> > there to be any dirty data.
> >> >
> >> > If I run "time sync; time sync; time sync" it reports about 50msec for
> >> > each sync. If I run "time sleep 0; time sleep 0; time sleep 0" it reports
> >> > about 2 msec. So sys_sync is taking at least 50msec.
> >> > Almost all of that is in sync_inodes_sb() and much of that is in
> >> > _raw_spin_lock.... though I might be misinterpreting perf. It seems to
> >> > wait for a BDI flusher thread to go off and do nothing.
> >> >
> >> > Len: is 50msec enough to bother you?
> >>
> >> 50ms is not acceptable.
> >> 5ms is also not acceptable.
> >>
> >> If it were guaranteed to be under 1ms, it would not behigh on my
> >> "performance pain" list, but I would still require the option
> >> to delete it entirely.
> >>
> >> But sys_sync time is random on many systems,
> >> with a very large maximum duration.
> >>
> >> Attached is the output from analyze_suspend.py -x2
> >> on a desktop machine, which has just an inexpensive SSD for storage.
> >> It is a fresh install of Fedora 22 with a stock 4.0.4-301 kernel
> >> with no tweaks.
> >
> > More information required. Mounted filesystems, mount options, etc
> > at minimum so we have some idea of what is being synced. We don't
> > even know what filesystem you are using....
>
> In this case, I am using Fedora 22 defaults on a single SSD:
>
> ext4 (rw,relatime,seclabel,data=ordered)

OK, nothing unusual.

> [lenb@d975xbx ~]$ mount
> /dev/sda3 on / type ext4 (rw,relatime,seclabel,data=ordered)
....
> /dev/sda1 on /boot type ext4 (rw,relatime,seclabel,data=ordered)
....

And so effectively only a single filesystem that would do IO when
sync is called.

> > The graph posted just has a single box labelled "sync_filesystems"
> > with no other information about what is actually taking that time.
> > We need to know what is consuming all that time *inside*
> > sync_filesystems() to make any progress here.
>
> Just for grins, did 100 sys_sync instead of one.
> They all ran about the same 5ms.

Ok, so what's the overhead? On one of my test VMs, running on
spinning disk backed devices and a single ext3 root filesystem and
running that sync scalability patch series:

> > git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git superblock-scaling

$ for i in `seq 0 1 100`; do time sleep 0 ; done 2>&1 | stats

real 0m0.002s
user 0m0.000000-0.004000(0.000168317+/-0.00058)s
sys 0m0.000000-0.003000(0.00154455+/-0.00074)s
$ for i in `seq 0 1 100`; do time sync ; done 2>&1 | stats

real 0m0.002s
user 0m0.000000-0.002000(0.000118812+/-0.00041)s
sys 0m0.000000-0.003000(0.00151485+/-0.0007)s
$

The sync runtime is exactly the same as running "sleep 0" 100 times.
Without the sync scalability patchset:

$ for i in `seq 0 1 100`; do time sync ; done 2>&1 | stats

real 0m0.007s
user 0m0.000000-0.004000(0.000316832+/-0.00096)s
sys 0m0.003000-0.008000(0.00690099+/-0.00099)s
$

There's 5ms extra time on every sync, and it is entirely system CPU
time. That will be the inode cache traversal overhead. To confirm,
drop the cache and rerun:

$ sudo grep ext3_inode /proc/slabinfo
ext3_inode_cache 23086 23086 960 17 4 : tunables 0 0 0 : slabdata 1358 1358 0
$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches "
$ sudo grep ext3_inode /proc/slabinfo
ext3_inode_cache 3152 4488 960 17 4 : tunables 0 0 0 : slabdata 264 264 0
$ for i in `seq 0 1 100`; do time sync ; done 2>&1 | stats

real 0m0.002s
user 0m0.000000-0.005000(0.000227723+/-0.00074)s
sys 0m0.000000-0.002000(0.00182178+/-0.00057)s

Dropping the inode cache (from 23,000 inodes to 3,000) makes the 5ms
sync runtime disappear.

Can you repeat this test on your system, so that we can determine if
the 5ms ""sync time" is actually just the overhead of inode cache
traversal? If that is the case, the speed of sync on a clean
filesystem is already a solved problem - the patchset should be
merged in the 4.2 cycle....

> While they were all to slow, none of them were
> O(500ms), so yes, there
> does seem to be some state change
> that causes the 2nd sync after a resume to be especially slow.
>
> Unfortunately, I've not got an ftrace on the 500ms flavor yet.

This is the problem we really need to reproduce and track down.

> >> So the reasoning that subsequent sys_sync's should be instantly quick
> >> because they are following a previous sys_sync is found to be simply
> >> erroneous by observations such as this.
> >
> > Which indicates that either the system is not idle as expected, or
> > there's some kind of bug in one of the filesystems that is being
> > synced.
>
> the system is idle.

That doesn't mean the filesystem is idle (e.g. it could be cleaning
the log) or that nothing is dirtying the filesystem behind your back.

> > Something like a
> > function trace that tells use exactly what functions are being
> > called and how long they take to execute would be really handy
> > here.
> >
> > Alternatively, you don't need to suspend to test this - just run a
> > test program on an idle system that does two sync() calls 50ms apart
> > and see what happens. If you can't reproduce it outside of a suspend
> > operation, then we need to look at what suspend is actually causing
> > to be dirtied in those 50ms.
>
> how about something like this:
>
> From single user mode:
>
> [root@d975xbx ~]# cat ./do.sync
> cd /sys/kernel/debug/tracing
>
> echo 0 > tracing_on
> echo global > trace_clock
> echo 100000 > buffer_size_kb
> echo function_graph > current_tracer
> echo 1 > options/funcgraph-tail
> echo 1 > options/funcgraph-abstime
> echo 1 > options/funcgraph-proc
>
> echo 8 > max_graph_depth
> echo sys_sync > set_graph_function
>
> for a in 0 1 2 3 4 5 6 7 8 9 ; do
> echo > trace
> sync
> sleep .05
> cp trace ~lenb/trace.$a

This dirties the filesystem on every loop. Which means the second
and subseqent sync calls are not running on a clean filesystem and
are issuing and waiting on IO.

> done
>
> cd ~lenb
> grep sys_sync trace.*
>
> [root@d975xbx ~]# grep sys_sync trace.*
> trace.0: 3353.150723 | 0) sync-9346 | | sys_sync() {
> trace.0: 3353.170781 | 0) sync-9346 | # 20050.69 us | } /* sys_sync */

20ms, looks clean. Slow, probably due to the function tracer, but
otherwise seems fine.

> trace.1: 3353.358415 | 0) sync-9349 | | sys_sync() {
> trace.1: 3353.446597 | 1) sync-9349 | # 88159.41 us | } /* sys_sync */

88ms, and every subsequent run takes that same time, because:

> # tracer: function_graph
> #
> # TIME CPU TASK/PID DURATION FUNCTION CALLS
> # | | | | | | | | | |
> 3354.479519 | 0) sync-9362 | # 1887.858 us | __schedule();
> 3354.481409 | 0) sync-9362 | # 1889.459 us | } /* schedule */
> 3354.481409 | 0) sync-9362 | # 1890.467 us | } /* schedule_timeout */
> 3354.481411 | 0) sync-9362 | # 1894.576 us | } /* wait_for_completion */
> 3354.481420 | 0) sync-9362 | # 13762.41 us | __wait_on_bit();
> 3354.495187 | 0) sync-9362 | # 13763.64 us | } /* wait_on_page_bit */
> 3354.495196 | 0) sync-9362 | # 16248.52 us | __wait_on_bit();
> 3354.511449 | 0) sync-9362 | # 16249.50 us | } /* wait_on_page_bit */
> 3354.511455 | 0) sync-9362 | # 17580.20 us | __wait_on_bit();
> 3354.529041 | 0) sync-9362 | # 17581.19 us | } /* wait_on_page_bit */
> 3354.529047 | 0) sync-9362 | # 16214.36 us | __wait_on_bit();
> 3354.545266 | 0) sync-9362 | # 16215.31 us | } /* wait_on_page_bit */
> 3354.545272 | 0) sync-9362 | # 6439.709 us | __wait_on_bit();
> 3354.551715 | 0) sync-9362 | # 6440.856 us | } /* wait_on_page_bit */
> 3354.551723 | 0) sync-9362 | # 70289.60 us | } /* filemap_fdatawait_range */
> 3354.551723 | 0) sync-9362 | # 70290.64 us | } /* filemap_fdatawait */
> 3354.567133 | 0) sync-9362 | # 87611.58 us | } /* sync_inodes_sb */
> 3354.567133 | 0) sync-9362 | # 87612.32 us | } /* sync_inodes_one_sb */
> 3354.567251 | 0) sync-9362 | # 87831.46 us | } /* iterate_supers */
> 3354.568357 | 1) sync-9362 | # 88971.71 us | } /* sys_sync */
>
> That is where the time is going in the "normal" case.

... this clearly shows sync is waiting on a dirty data being written
to disk. Write your trace files to something hosted on tmpfs (e.g.
/tmp) rather than copying them to a filesystem that requires IO
during the next sync call.


Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-06-19 06:34:47

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

> Can you repeat this test on your system, so that we can determine if
> the 5ms ""sync time" is actually just the overhead of inode cache
> traversal? If that is the case, the speed of sync on a clean
> filesystem is already a solved problem - the patchset should be
> merged in the 4.2 cycle....

Yes, drop_caches does seem to help repeated sync on this system:
Exactly what patch series does this? I'm running ext4 (the default,
not btrfs)

[lenb@d975xbx ~]$ for i in `seq 0 1 10`; do time sleep 0 ; done

real 0m0.002s
user 0m0.000s
sys 0m0.001s

real 0m0.001s
user 0m0.000s
sys 0m0.001s

real 0m0.001s
user 0m0.001s
sys 0m0.001s

real 0m0.001s
user 0m0.000s
sys 0m0.001s

real 0m0.001s
user 0m0.001s
sys 0m0.000s

real 0m0.001s
user 0m0.001s
sys 0m0.000s

real 0m0.001s
user 0m0.001s
sys 0m0.000s

real 0m0.001s
user 0m0.000s
sys 0m0.001s

real 0m0.001s
user 0m0.001s
sys 0m0.000s

real 0m0.001s
user 0m0.001s
sys 0m0.000s

real 0m0.001s
user 0m0.000s
sys 0m0.001s
[lenb@d975xbx ~]$ for i in `seq 0 1 10`; do time sync ; done

real 0m0.004s
user 0m0.000s
sys 0m0.003s

real 0m0.002s
user 0m0.000s
sys 0m0.002s

real 0m0.002s
user 0m0.000s
sys 0m0.002s

real 0m0.002s
user 0m0.000s
sys 0m0.002s

real 0m0.002s
user 0m0.000s
sys 0m0.002s

real 0m0.002s
user 0m0.000s
sys 0m0.002s

real 0m0.002s
user 0m0.000s
sys 0m0.003s

real 0m0.002s
user 0m0.000s
sys 0m0.002s

real 0m0.002s
user 0m0.000s
sys 0m0.002s

real 0m0.002s
user 0m0.000s
sys 0m0.002s

real 0m0.002s
user 0m0.000s
sys 0m0.002s
[lenb@d975xbx ~]$ sudo grep ext4_inode /proc/slabinfo
ext4_inode_cache 3536 3536 1008 16 4 : tunables 0 0
0 : slabdata 221 221 0
[lenb@d975xbx ~]$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches "
[lenb@d975xbx ~]$ sudo grep ext4_inode /proc/slabinfo
ext4_inode_cache 553 1680 1008 16 4 : tunables 0 0
0 : slabdata 105 105 0
[lenb@d975xbx ~]$ for i in `seq 0 1 10`; do time sync ; done

real 0m0.002s
user 0m0.000s
sys 0m0.001s

real 0m0.002s
user 0m0.000s
sys 0m0.002s

real 0m0.001s
user 0m0.000s
sys 0m0.001s

real 0m0.001s
user 0m0.000s
sys 0m0.001s

real 0m0.001s
user 0m0.000s
sys 0m0.001s

real 0m0.001s
user 0m0.000s
sys 0m0.001s

real 0m0.001s
user 0m0.000s
sys 0m0.001s

real 0m0.001s
user 0m0.000s
sys 0m0.001s

real 0m0.001s
user 0m0.000s
sys 0m0.001s

real 0m0.001s
user 0m0.000s
sys 0m0.001s

real 0m0.001s
user 0m0.000s
sys 0m0.002s


>> While they were all to slow, none of them were
>> O(500ms), so yes, there
>> does seem to be some state change
>> that causes the 2nd sync after a resume to be especially slow.
>>
>> Unfortunately, I've not got an ftrace on the 500ms flavor yet.
>
> This is the problem we really need to reproduce and track down.

Putting a function trace on sys_sync and executing sync manually,
I was able to see it take 100ms,
though function trace itself could be contributing to that...

[lenb@d975xbx ~]$ grep \# trace.20150619_013702
# tracer: function_graph
#
# TIME CPU DURATION FUNCTION CALLS
# | | | | | | | |
374.665063 | 0) # 2229.612 us | } /* __schedule */
374.665064 | 0) # 2230.571 us | } /* schedule */
374.665064 | 0) # 2231.494 us | } /* schedule_timeout */
374.665065 | 0) # 2235.937 us | } /* wait_for_completion */
374.745616 | 0) # 80518.73 us | } /* __schedule */
374.745616 | 0) # 80519.47 us | } /* schedule */
374.745617 | 0) # 80520.28 us | } /*
schedule_timeout */
374.745621 | 0) # 80526.38 us | } /*
io_schedule_timeout */
374.745621 | 0) # 80527.23 us | } /* bit_wait_io */
374.745622 | 0) # 80531.04 us | } /* __wait_on_bit */
374.745623 | 0) # 80531.95 us | } /* wait_on_page_bit */
374.745644 | 0) # 80555.58 us | } /* filemap_fdatawait_range */
374.745644 | 0) # 80556.36 us | } /* filemap_fdatawait */
374.748029 | 0) # 1300.848 us | } /* __schedule */
374.748029 | 0) # 1301.376 us | } /* schedule */
374.748029 | 0) # 1301.923 us | } /*
schedule_timeout */
374.748032 | 0) # 1306.133 us | } /*
io_schedule_timeout */
374.748032 | 0) # 1306.651 us | } /* bit_wait_io */
374.748033 | 0) # 1309.298 us | } /* __wait_on_bit */
374.748033 | 0) # 1309.838 us | } /* wait_on_page_bit */
374.750502 | 0) # 1099.379 us | } /* __schedule */
374.750503 | 0) # 1100.102 us | } /* schedule */
374.750503 | 0) # 1100.882 us | } /*
schedule_timeout */
374.750509 | 0) # 1108.399 us | } /*
io_schedule_timeout */
374.750510 | 0) # 1109.160 us | } /* bit_wait_io */
374.750511 | 0) # 1112.541 us | } /* __wait_on_bit */
374.750512 | 0) # 1113.310 us | } /* wait_on_page_bit */
374.752063 | 0) # 5827.910 us | } /* filemap_fdatawait_range */
374.752063 | 0) # 5828.517 us | } /* filemap_fdatawait */
374.764753 | 0) # 101948.3 us | } /* sync_inodes_sb */
374.764754 | 0) # 101949.1 us | } /* sync_inodes_one_sb */
374.764903 | 0) # 102198.2 us | } /* iterate_supers */
374.767693 | 0) # 1094.872 us | } /* blk_flush_plug_list */
374.767693 | 0) # 1095.405 us | } /* blk_finish_plug */
374.767694 | 0) # 1780.430 us | } /* generic_writepages */
374.767694 | 0) # 1781.172 us | } /* do_writepages */
374.767694 | 0) # 1784.820 us | } /* __filemap_fdatawrite_range */
374.767695 | 0) # 1785.551 us | } /* filemap_fdatawrite */
374.767695 | 0) # 1786.357 us | } /* fdatawrite_one_bdev */
374.767698 | 0) # 1857.427 us | } /* iterate_bdevs */
374.767818 | 0) # 105179.2 us | } /* sys_sync */

running analyze_suspend.py after the slab tweak above didn't change much.
in one run sync was 20ms (out of a total suspend time of 60ms).

Curiously, in another run, sync ran at 15ms, but sd suspend exploded to 300ms.
I've seen that in some other results. Sometimes sync if fast, but sd
then more than makes up for it by being slow:-(

FYI,
I ran analyze_suspend.py -x2
from current directory /tmp, which is mounted on tmpfs,
but still found the 2nd sync was very slow -- 200ms
vs 6 - 20 ms for the sync preceding the 1st suspend.

thanks
Len Brown, Intel Open Source Technology Center

2015-06-19 23:07:40

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri, Jun 19, 2015 at 02:34:37AM -0400, Len Brown wrote:
> > Can you repeat this test on your system, so that we can determine if
> > the 5ms ""sync time" is actually just the overhead of inode cache
> > traversal? If that is the case, the speed of sync on a clean
> > filesystem is already a solved problem - the patchset should be
> > merged in the 4.2 cycle....
>
> Yes, drop_caches does seem to help repeated sync on this system:
> Exactly what patch series does this? I'm running ext4 (the default,
> not btrfs)

None. It's the current behaviour of sync that is ends up walking the
inode cache in it's entirity to find dirty inodes that need to be
waited on. That's what the sync scalability patch series I pointed
you at fixes - sync then keeps a "dirty inodes that need to be
waited on list" instead of doing a cache traversal to find them.
i.e. the "no cache" results you see will soon be the behaviour sync
has regardless of the size of the inode cache.

> [lenb@d975xbx ~]$ sudo grep ext4_inode /proc/slabinfo
> ext4_inode_cache 3536 3536 1008 16 4 : tunables 0 0
> 0 : slabdata 221 221 0

That's actually a really small cache to begin with.

> > This is the problem we really need to reproduce and track down.
>
> Putting a function trace on sys_sync and executing sync manually,
> I was able to see it take 100ms,
> though function trace itself could be contributing to that...

It would seem that way - you need to get the traces to dump to
something that has no sync overhead....

> running analyze_suspend.py after the slab tweak above didn't change much.
> in one run sync was 20ms (out of a total suspend time of 60ms).

Which may be because the inode cache was larger?

> Curiously, in another run, sync ran at 15ms, but sd suspend exploded to 300ms.
> I've seen that in some other results. Sometimes sync if fast, but sd
> then more than makes up for it by being slow:-(

Oh, I see that too. Normally That's because the filesystem hasn't
been told to enter an idle state and so is doing metadata writeback
IO after the sync. When that happens the sd suspend has wait for
request queues to drain, IO to complete and device caches to flush.
This simply cannot be avoided because suspend never tells the
filesytems to enter an idle state....

i.e. remember what I said initially in this thread about suspend
actually needing to freeze filesystems, not just sync them?

> FYI,
> I ran analyze_suspend.py -x2
> from current directory /tmp, which is mounted on tmpfs,
> but still found the 2nd sync was very slow -- 200ms
> vs 6 - 20 ms for the sync preceding the 1st suspend.

So where did that time go? As I pointed out previously, function
trace will only tell us if the delay is data writeback or not. We
seem to have confirmed that the delay is, indeed, writeback of dirty
data. Now we need to identify what the dirty data belongs to: we
need to trace individual writeback events to see what dirty inodes
are actually being written.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-06-20 05:26:30

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

>> Putting a function trace on sys_sync and executing sync manually,
>> I was able to see it take 100ms,
>> though function trace itself could be contributing to that...
>
> It would seem that way - you need to get the traces to dump to
> something that has no sync overhead....

I don't think that ftrace buffers have sync overhead --
at least not until they are copied into files.

>> Curiously, in another run, sync ran at 15ms, but sd suspend exploded to 300ms.
>> I've seen that in some other results. Sometimes sync if fast, but sd
>> then more than makes up for it by being slow:-(
>
> Oh, I see that too. Normally That's because the filesystem hasn't
> been told to enter an idle state and so is doing metadata writeback
> IO after the sync. When that happens the sd suspend has wait for
> request queues to drain, IO to complete and device caches to flush.
> This simply cannot be avoided because suspend never tells the
> filesytems to enter an idle state....

I captured a trace of a slow sd suspend.
Apparently, it does two operations -- first a sync_cache,
then a stop operation. The sync was fast. The stop command
was where all the time went.

I'll look at a more modern drive on the same system next week,
just for comparison.

> i.e. remember what I said initially in this thread about suspend
> actually needing to freeze filesystems, not just sync them?

I think with the complexity of file systems and the underlying
devices, yes, we need to think about how to efficiently
and safely suspend/resume them.

But sys_sync is too expensive to have hard-coded in the kernel suspend path.
Some machines can suspend and resume in under 10ms -- they
absolutely do not want sys_sync() hard-coded in the suspend path.

>> FYI,
>> I ran analyze_suspend.py -x2
>> from current directory /tmp, which is mounted on tmpfs,
>> but still found the 2nd sync was very slow -- 200ms
>> vs 6 - 20 ms for the sync preceding the 1st suspend.
>
> So where did that time go? As I pointed out previously, function
> trace will only tell us if the delay is data writeback or not. We
> seem to have confirmed that the delay is, indeed, writeback of dirty
> data. Now we need to identify what the dirty data belongs to: we
> need to trace individual writeback events to see what dirty inodes
> are actually being written.

I expect that analyze_suspend.py is moving data around between
the back-to-back suspends when the -x2 option is used -- will look into it.

thanks,
Len Brown, Intel Open Source Technology Center

Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Mon, 11 May 2015, Len Brown wrote:
> On Sat, May 9, 2015 at 4:25 PM, Henrique de Moraes Holschuh
> <[email protected]> wrote:
> > On Sat, 09 May 2015, Alan Stern wrote:
> >> On Fri, 8 May 2015, Rafael J. Wysocki wrote:
> >> > My current view on that is that whether or not to do a sync() before suspending
> >> > ultimately is a policy decision and should belong to user space as such (modulo
> >> > the autosleep situation when user space may not know when the suspend is going
> >> > to happen).
> >> >
> >> > Moreover, user space is free to do as many sync()s before suspending as it
> >> > wants to and the question here is whether or not the *kernel* should sync()
> >> > in the suspend code path.
> >> >
> >> > Since we pretty much can demonstrate that having just one sync() in there is
> >> > not sufficient in general, should we put two of them in there? Or just
> >> > remove the existing one and leave it to user space entirely?
> >>
> >> I don't know about the advantages of one sync over two. But how about
> >> adding a "syncs_before_suspend" (or just "syncs") sysfs attribute that
> >> takes a small numeric value? The default can be 0, and the user could
> >> set it to 1 or 2 (or higher).
> >
> > IMO it would be much safer to both have that knob, and to set it to keep the
> > current behavior as the default. Userspace will adapt and change that knob
> > to whatever is sufficient based on what it does before signaling the kernel
> > to suspend.
> >
> > A regression in sync-before-suspend is sure to cause data loss episodes,
> > after all. And, as far as bikeshedding goes, IMHO syncs_before_suspend is
> > self-explanatory, which would be a very good reason to use it instead of the
> > shorter requires-you-to-know-what-it-is-about "syncs".
>
> When I first thought about this, I had a similar view:
> https://lkml.org/lkml/2014/1/23/45
>
> But upon reflection, I do not believe that the kernel is adding value
> here, instead it is imposing a policy, and that policy decision is
> sometimes prohibitively expensive. User-space can do this for itself (and
> in the case of desktop distros, already does), and so the kernel should
> butt-out.

There is a lot of added value in my filesystems not being trashed by
sleep/resume issues on laptops, IMHO, and the reason why we need the kernel
itself to take care of syncing and freezing filesystems has been explained
elsewhere in this thread.

I thoght for a while before replying, and I think the real issue behind this
thread is the want of a change of expected-but-implied semanthics and
behavior for the system-wide sleep-to-memory trigger, to adequate it to a
new reality for newer classes of devices.

Entering "mem" suspend mode through sysfs currently has the implied meaning
of "prepare the *entire* system to stay on a powered down state for
pontentially a _long_ time", where long means "certainly more than 10
seconds" ;-) This is unlikely to be written anywhere, of course, that's just
how it was used by the vast majority for years, at least on traditional
server/desktop/laptop platforms such as x86.

On those platforms, we have to assume the user might plug/unplug devices,
that the power supply might shut down while we're sleeping, that the entire
process is not painless and has a reasonable chance of misbehaving (crashes
on sleep/resume are _really_ common), etc.

What is the safe and proper thing to do in that situation is not necessarily
the best way to go about it when you actually want a somewhat different
behavior, i.e. to "prepare the system to stay on a powered down state for a
short while, and be very fast because this could happen at a very high
frequency"...

IMO, we would actually benefit from *adding* new system-wide sleep/suspend
modes that are optimized for oportunistic, short-lived system-wide sleep
cycles (aka "catnap") that is fast to enter and exit from, and which will be
triggered very frequently, instead of trying to change the assumptions and
expected behavior of the current "deep-sleep" mode...

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2015-06-30 20:05:05

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Thu, Jun 25, 2015 at 1:11 PM, Henrique de Moraes Holschuh
<[email protected]> wrote:
> On Mon, 11 May 2015, Len Brown wrote:
>> On Sat, May 9, 2015 at 4:25 PM, Henrique de Moraes Holschuh
>> <[email protected]> wrote:
>> > On Sat, 09 May 2015, Alan Stern wrote:
>> >> On Fri, 8 May 2015, Rafael J. Wysocki wrote:
>> >> > My current view on that is that whether or not to do a sync() before suspending
>> >> > ultimately is a policy decision and should belong to user space as such (modulo
>> >> > the autosleep situation when user space may not know when the suspend is going
>> >> > to happen).
>> >> >
>> >> > Moreover, user space is free to do as many sync()s before suspending as it
>> >> > wants to and the question here is whether or not the *kernel* should sync()
>> >> > in the suspend code path.
>> >> >
>> >> > Since we pretty much can demonstrate that having just one sync() in there is
>> >> > not sufficient in general, should we put two of them in there? Or just
>> >> > remove the existing one and leave it to user space entirely?
>> >>
>> >> I don't know about the advantages of one sync over two. But how about
>> >> adding a "syncs_before_suspend" (or just "syncs") sysfs attribute that
>> >> takes a small numeric value? The default can be 0, and the user could
>> >> set it to 1 or 2 (or higher).
>> >
>> > IMO it would be much safer to both have that knob, and to set it to keep the
>> > current behavior as the default. Userspace will adapt and change that knob
>> > to whatever is sufficient based on what it does before signaling the kernel
>> > to suspend.
>> >
>> > A regression in sync-before-suspend is sure to cause data loss episodes,
>> > after all. And, as far as bikeshedding goes, IMHO syncs_before_suspend is
>> > self-explanatory, which would be a very good reason to use it instead of the
>> > shorter requires-you-to-know-what-it-is-about "syncs".
>>
>> When I first thought about this, I had a similar view:
>> https://lkml.org/lkml/2014/1/23/45
>>
>> But upon reflection, I do not believe that the kernel is adding value
>> here, instead it is imposing a policy, and that policy decision is
>> sometimes prohibitively expensive. User-space can do this for itself (and
>> in the case of desktop distros, already does), and so the kernel should
>> butt-out.
>
> There is a lot of added value in my filesystems not being trashed by
> sleep/resume issues on laptops, IMHO, and the reason why we need the kernel
> itself to take care of syncing and freezing filesystems has been explained
> elsewhere in this thread.
> I thoght for a while before replying, and I think the real issue behind this
> thread is the want of a change of expected-but-implied semanthics and
> behavior for the system-wide sleep-to-memory trigger, to adequate it to a
> new reality for newer classes of devices.
>
> Entering "mem" suspend mode through sysfs currently has the implied meaning
> of "prepare the *entire* system to stay on a powered down state for
> pontentially a _long_ time", where long means "certainly more than 10
> seconds" ;-) This is unlikely to be written anywhere, of course, that's just
> how it was used by the vast majority for years, at least on traditional
> server/desktop/laptop platforms such as x86.

The _vast_ majority of systems using Linux suspend today are under
an Android user-space. Android has no assumption that that suspend to mem
will necessarily stay suspended for a long time.

The wake-on-packet use-case is generally a much shorter suspend duration,
and more frequent number of suspend/resume cycles
than the "wake on lid-open or button-press" use case.

> On those platforms, we have to assume the user might plug/unplug devices,
> that the power supply might shut down while we're sleeping, that the entire
> process is not painless and has a reasonable chance of misbehaving (crashes
> on sleep/resume are _really_ common), etc.

The fact is that sys_sync() in the kernel suspend to mem path is too expensive
for devices which suspend/resume quickly, routinely, and reliably.

The reality that somebody can stick a broken device into their desktop
or laptop and break suspend doesn't change that.

> What is the safe and proper thing to do in that situation is not necessarily
> the best way to go about it when you actually want a somewhat different
> behavior, i.e. to "prepare the system to stay on a powered down state for a
> short while, and be very fast because this could happen at a very high
> frequency"...
>
> IMO, we would actually benefit from *adding* new system-wide sleep/suspend
> modes that are optimized for oportunistic, short-lived system-wide sleep
> cycles (aka "catnap") that is fast to enter and exit from, and which will be
> triggered very frequently, instead of trying to change the assumptions and
> expected behavior of the current "deep-sleep" mode...

Thank you for sharing your opinion.

I am going to give up trying to change your mind, and those who share
your view. I plan to revive my patch from 2014 which
makes sys_sync() optional. That will not change the historic behavior,
and will still allow everybody to do what they want.
Rafael has said that he can live with the resulting kernel clutter.

BTW. the answer does not appear to be creating a new system sleep state.
Android invokes "mem", and they don't seem excited about teaching user-space
that runs on multiple platforms that what used to be a "mem" and no "freeze"
could be a "mem" plus a "freeze", or a "freeze" and no "mem".

They just want the kernel to give them a "mem" that is fast and reliable;
and they are willing to build the kernel in a way that gives them what
they want.

cheers,
Len Brown, Intel Open Source Technology Center

Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Tue, Jun 30, 2015, at 17:04, Len Brown wrote:
> > Entering "mem" suspend mode through sysfs currently has the implied meaning
> > of "prepare the *entire* system to stay on a powered down state for
> > pontentially a _long_ time", where long means "certainly more than 10
> > seconds" ;-) This is unlikely to be written anywhere, of course, that's just
> > how it was used by the vast majority for years, at least on traditional
> > server/desktop/laptop platforms such as x86.
>
> The _vast_ majority of systems using Linux suspend today are under
> an Android user-space. Android has no assumption that that suspend to
> mem will necessarily stay suspended for a long time.

Indeed, however your change was not android-specific, and it is not
"comfortable" on x86-style hardware and usage patterns.

> > IMO, we would actually benefit from *adding* new system-wide sleep/suspend
> > modes that are optimized for oportunistic, short-lived system-wide sleep
> > cycles (aka "catnap") that is fast to enter and exit from, and which will be
> > triggered very frequently, instead of trying to change the assumptions and
> > expected behavior of the current "deep-sleep" mode...
>
> Thank you for sharing your opinion.
>
> I am going to give up trying to change your mind, and those who share
> your view. I plan to revive my patch from 2014 which
> makes sys_sync() optional. That will not change the historic behavior,
> and will still allow everybody to do what they want.
> Rafael has said that he can live with the resulting kernel clutter.

...

> BTW. the answer does not appear to be creating a new system sleep state.
> Android invokes "mem", and they don't seem excited about teaching
> user-space that runs on multiple platforms that what used to be a "mem" and no
> "freeze" could be a "mem" plus a "freeze", or a "freeze" and no "mem".

Hmm, maybe we could:

1. Make the behavior of "mem" configurable (select default at compile
time, allow it to be changed at runtime).

2. Add a way to always enter the "heavyweight" (x86-style) mem sleep in
platforms where it exists.

3. Add a way to always enter the "light" (android-style) mem sleep in
platforms where it exists.

And make (2) and (3) optional (as in: you can compile out the clutter).
That at least provides a way forward for userspace, at the price of more
gunk on the kernel side.

We already have a lot of stuff that works that way (idle and freq
governors, io elevators, etc).

That said, as long as x86 will still try to safeguard my data during mem
sleep/resume as it does today, I have no strong feelings about
light/heavy-weight "mem" sleep being strictly a compile-time selectable
thing, or a more flexible runtime-selectable behavior.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2015-07-02 03:07:52

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

>> The _vast_ majority of systems using Linux suspend today are under
>> an Android user-space. Android has no assumption that that suspend to
>> mem will necessarily stay suspended for a long time.
>
> Indeed, however your change was not android-specific, and it is not
> "comfortable" on x86-style hardware and usage patterns.

"comfortable on x86-style and usage patterns"?
If you mean "traditional" instead of "comfortable",
where "tradition" is based on 10-year old systems, then sure.

But today, my x86 Android tablet is quite "comfortable"
without a sys_sync() in the kernel suspend path.

No, this isn't Android specific, Android is just the highest-volume demand,
making it an obvious example.

Chrome is the #1 selling "x86-style" clamshell laptop.
Chrome is not only "comfortable" with fast suspend/resume,
the Chrome developers demand it.

> That said, as long as x86 will still try to safeguard my data during mem
> sleep/resume as it does today, I have no strong feelings about
> light/heavy-weight "mem" sleep being strictly a compile-time selectable
> thing, or a more flexible runtime-selectable behavior.

The observation here is that the kernel should not force every system
to sys_sync() on every suspend. The only question is how to best
implement that. The obvious solution was to delete this forced policy
from the kernel, and let user-space handle it.
Rafael has not agreed to push that obvious, though less-than-gentle
solution upstream, and so I'll re-send the historic patch
that allows distros to still sync like it is 1999, if they want to:-)

thanks,
Len Brown, Intel Open Source Technology Center

2015-07-03 01:43:00

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Wed, Jul 01, 2015 at 11:07:29PM -0400, Len Brown wrote:
> >> The _vast_ majority of systems using Linux suspend today are under
> >> an Android user-space. Android has no assumption that that suspend to
> >> mem will necessarily stay suspended for a long time.
> >
> > Indeed, however your change was not android-specific, and it is not
> > "comfortable" on x86-style hardware and usage patterns.
>
> "comfortable on x86-style and usage patterns"?
> If you mean "traditional" instead of "comfortable",
> where "tradition" is based on 10-year old systems, then sure.

Even if this were true(*) we don't break things that currently work
just because something different is "just around the corner". e.g.
if you shut the lid on your laptop and it suspends to RAM, you can
pull the USB drive out that you just copied stuff to and plug it
into another machine and find all the data you copied there is
present.

Remove the sync() from the freeze code, and this isn't guaranteed to
work anymore. It is now dependent on userspace implementations for
this to work, and we know what userspace developers will choose in
this situation. i.e. fast and "works for me", not "safe for
everyone".

(*) Which it clearly isn't true because, as this example shows, my
shiny new laptop still has exactly the same data integrity
requirements as the laptop I was using 10 years ago.

Just because there are lots of Android or Chrome out there it
doesn't mean we can just ignore the requirements of everything
else...

> > That said, as long as x86 will still try to safeguard my data during mem
> > sleep/resume as it does today, I have no strong feelings about
> > light/heavy-weight "mem" sleep being strictly a compile-time selectable
> > thing, or a more flexible runtime-selectable behavior.
>
> The observation here is that the kernel should not force every system
> to sys_sync() on every suspend. The only question is how to best
> implement that.

No, your observation was that "sync is slow". Your *solution* is "we
need to remove sync".

However, your arguement so far has these problems:

- repeated sync from outside the suspend context is does not
demonstrate the problem you are seeing during suspend, and
you have not yet identified why this is the case.
- it has been demonstrated that inode cache size plays a
significant role in sync latency, but you haven't provided
any information to tell us what the cache sizes were when
you see large latencies.
- it has been demonstrated that there are patches pending
that improve clean filesystem sync speed, but you have not
produced numbers to demonstrate that those patches do not
meet your requirements.
- In several tests your "sync latency" monitoring was
dirtying the filesystem and hence *causing* the repeated
syncs to be slow.
- you have not told us whether your suspend monitoring was
the cause of the suspend sync latency or not.
- you have been testing on hardware with questionable power
management behaviour.

IOWs, you have not yet identified the root cause of the slow sync
behaviour on suspend, you have not determined if pending work fixes
the latency problems, and you have not reproduced your results after
fixing the flaws in your testing methodology.

> The obvious solution was to delete this forced policy
> from the kernel, and let user-space handle it.
> Rafael has not agreed to push that obvious, though less-than-gentle
> solution upstream, and so I'll re-send the historic patch
> that allows distros to still sync like it is 1999, if they want to:-)

Please stop shouting about "obvious" solutions until you've actually
proved there is a problem and that problems you find aren't already
fixed by the pending sync changes....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-07-04 00:37:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Friday, July 03, 2015 11:42:50 AM Dave Chinner wrote:
> On Wed, Jul 01, 2015 at 11:07:29PM -0400, Len Brown wrote:
> > >> The _vast_ majority of systems using Linux suspend today are under
> > >> an Android user-space. Android has no assumption that that suspend to
> > >> mem will necessarily stay suspended for a long time.
> > >
> > > Indeed, however your change was not android-specific, and it is not
> > > "comfortable" on x86-style hardware and usage patterns.
> >
> > "comfortable on x86-style and usage patterns"?
> > If you mean "traditional" instead of "comfortable",
> > where "tradition" is based on 10-year old systems, then sure.
>
> Even if this were true(*) we don't break things that currently work
> just because something different is "just around the corner". e.g.
> if you shut the lid on your laptop and it suspends to RAM, you can
> pull the USB drive out that you just copied stuff to and plug it
> into another machine and find all the data you copied there is
> present.
>
> Remove the sync() from the freeze code, and this isn't guaranteed to
> work anymore. It is now dependent on userspace implementations for
> this to work, and we know what userspace developers will choose in
> this situation. i.e. fast and "works for me", not "safe for
> everyone".
>
> (*) Which it clearly isn't true because, as this example shows, my
> shiny new laptop still has exactly the same data integrity
> requirements as the laptop I was using 10 years ago.
>
> Just because there are lots of Android or Chrome out there it
> doesn't mean we can just ignore the requirements of everything
> else...
>
> > > That said, as long as x86 will still try to safeguard my data during mem
> > > sleep/resume as it does today, I have no strong feelings about
> > > light/heavy-weight "mem" sleep being strictly a compile-time selectable
> > > thing, or a more flexible runtime-selectable behavior.
> >
> > The observation here is that the kernel should not force every system
> > to sys_sync() on every suspend. The only question is how to best
> > implement that.
>
> No, your observation was that "sync is slow". Your *solution* is "we
> need to remove sync".

Not only slow, but pointless too. The argument goes: "It is slow and
pointless and so it may be dropped."

Now, I can agree that it wasn't clearly demonstrated that the unconditional
sys_sync() in the suspend code path was pointless, but it also has never
been clearly shown why it is not pointless on systems that suspend and resume
reliably.

[The argument that the user can pull removable storage devices out of the
system while suspended doesn't hold any water to me, because the user can
pull them out of the system when not suspended just as well and cause the
same kind of damage to happen.]

When we were adding it, the thinking was along the lines of "Well, suspend
isn't too reliable, so let's put sys_sync() in there to possibly reduce the
damage from suspend/resume crashes and suspend is slow anyway, so the possible
effect on performance from that shouldn't be noticeable". Clearly, the world
has changed since then and suspend is far more reliable than it used to be in
general and it is not that slow too at least on some systems (especially the
suspend-to-idle flavor).

The only argument against dropping sys_sync() from the suspend code path
I've seen in this thread that I entirely agree with is that it may lead to
regressions, because we've done it practically forever and it may hide latent
bugs somewhere in block drivers etc. Dropping it, though, is the only way
to see those bugs, if any, and if we want to ever fix them, we need to see
them. That's why I think that it may be a good idea to allow people to
drop it if they are willing to accept some extra risk (via the kernel
command line, for example).

Moreover, question is if we really need to carry out the sync on *every*
suspend even if it is not pointless overall. That shouldn't really be
necessary if we suspend and resume often enough or if we resume only for
a while and then suspend again. Maybe it should be rate limited somehow
at least?

Thanks,
Rafael

2015-07-05 09:20:02

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Sat, Jul 4, 2015 at 3:03 AM, Rafael J. Wysocki <[email protected]> wrote:
> [The argument that the user can pull removable storage devices out of the
> system while suspended doesn't hold any water to me, because the user can
> pull them out of the system when not suspended just as well and cause the
> same kind of damage to happen.]

The user may not notice the difference between a suspended and powered
off machine, and thus think it's safe to unplug the USB memory stick while
the machine appears off.

If that's considered a valid use case, this would be a regression.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2015-07-05 09:06:46

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Sat, 4 Jul 2015, Rafael J. Wysocki wrote:

> The only argument against dropping sys_sync() from the suspend code path
> I've seen in this thread that I entirely agree with is that it may lead to
> regressions, because we've done it practically forever and it may hide latent
> bugs somewhere in block drivers etc. Dropping it, though, is the only way
> to see those bugs, if any, and if we want to ever fix them, we need to see
> them. That's why I think that it may be a good idea to allow people to
> drop it if they are willing to accept some extra risk (via the kernel
> command line, for example).

I'd be perfectly happy to have the sync selectable at runtime, one way
or another. The three most reasonable options seem to be:

kernel command line

sysfs file

sysctl setting

The command line is less flexible (it can't be changed after booting).
Either of the other two would be fine with me.

> Moreover, question is if we really need to carry out the sync on *every*
> suspend even if it is not pointless overall. That shouldn't really be
> necessary if we suspend and resume often enough or if we resume only for
> a while and then suspend again. Maybe it should be rate limited somehow
> at least?

For example, skip the sync if the system has been awake for < 100 ms?
The cutoff time could also be controlled by the sysfs file: -1 =>
never sync, 0 => always sync, > 0 => sync if the system has been awake
longer than the value.

Alan Stern

2015-07-05 22:59:20

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Saturday, July 04, 2015 10:50:36 AM Geert Uytterhoeven wrote:
> On Sat, Jul 4, 2015 at 3:03 AM, Rafael J. Wysocki <[email protected]> wrote:
> > [The argument that the user can pull removable storage devices out of the
> > system while suspended doesn't hold any water to me, because the user can
> > pull them out of the system when not suspended just as well and cause the
> > same kind of damage to happen.]
>
> The user may not notice the difference between a suspended and powered
> off machine, and thus think it's safe to unplug the USB memory stick while
> the machine appears off.
>
> If that's considered a valid use case, this would be a regression.

Well, if user space configures the system to use suspend as a "power off
flavor" and doesn't even try to ensure the filesystems to be synced before
suspending (in that mode), that's on the edge of insanity.

And it falls into the "latent bugs that may be there because we've always
synced before suspending" category in my view.

Thanks,
Rafael

2015-07-05 23:02:02

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Saturday, July 04, 2015 10:19:55 AM Alan Stern wrote:
> On Sat, 4 Jul 2015, Rafael J. Wysocki wrote:
>
> > The only argument against dropping sys_sync() from the suspend code path
> > I've seen in this thread that I entirely agree with is that it may lead to
> > regressions, because we've done it practically forever and it may hide latent
> > bugs somewhere in block drivers etc. Dropping it, though, is the only way
> > to see those bugs, if any, and if we want to ever fix them, we need to see
> > them. That's why I think that it may be a good idea to allow people to
> > drop it if they are willing to accept some extra risk (via the kernel
> > command line, for example).
>
> I'd be perfectly happy to have the sync selectable at runtime, one way
> or another. The three most reasonable options seem to be:
>
> kernel command line
>
> sysfs file
>
> sysctl setting
>
> The command line is less flexible (it can't be changed after booting).
> Either of the other two would be fine with me.

We'll probably use a sysfs file (possibly plus a Kconfig option to set the
boot time default).

> > Moreover, question is if we really need to carry out the sync on *every*
> > suspend even if it is not pointless overall. That shouldn't really be
> > necessary if we suspend and resume often enough or if we resume only for
> > a while and then suspend again. Maybe it should be rate limited somehow
> > at least?
>
> For example, skip the sync if the system has been awake for < 100 ms?

Yes, something like that.

> The cutoff time could also be controlled by the sysfs file: -1 =>
> never sync, 0 => always sync, > 0 => sync if the system has been awake
> longer than the value.

That sounds like a good idea to me.

Thanks,
Rafael

2015-07-06 00:07:05

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Sat, Jul 04, 2015 at 03:03:46AM +0200, Rafael J. Wysocki wrote:
> On Friday, July 03, 2015 11:42:50 AM Dave Chinner wrote:
> > On Wed, Jul 01, 2015 at 11:07:29PM -0400, Len Brown wrote:
> > > >> The _vast_ majority of systems using Linux suspend today are under
> > > >> an Android user-space. Android has no assumption that that suspend to
> > > >> mem will necessarily stay suspended for a long time.
> > > >
> > > > Indeed, however your change was not android-specific, and it is not
> > > > "comfortable" on x86-style hardware and usage patterns.
> > >
> > > "comfortable on x86-style and usage patterns"?
> > > If you mean "traditional" instead of "comfortable",
> > > where "tradition" is based on 10-year old systems, then sure.
> >
> > Even if this were true(*) we don't break things that currently work
> > just because something different is "just around the corner". e.g.
> > if you shut the lid on your laptop and it suspends to RAM, you can
> > pull the USB drive out that you just copied stuff to and plug it
> > into another machine and find all the data you copied there is
> > present.
> >
> > Remove the sync() from the freeze code, and this isn't guaranteed to
> > work anymore. It is now dependent on userspace implementations for
> > this to work, and we know what userspace developers will choose in
> > this situation. i.e. fast and "works for me", not "safe for
> > everyone".
> >
> > (*) Which it clearly isn't true because, as this example shows, my
> > shiny new laptop still has exactly the same data integrity
> > requirements as the laptop I was using 10 years ago.
> >
> > Just because there are lots of Android or Chrome out there it
> > doesn't mean we can just ignore the requirements of everything
> > else...
> >
> > > > That said, as long as x86 will still try to safeguard my data during mem
> > > > sleep/resume as it does today, I have no strong feelings about
> > > > light/heavy-weight "mem" sleep being strictly a compile-time selectable
> > > > thing, or a more flexible runtime-selectable behavior.
> > >
> > > The observation here is that the kernel should not force every system
> > > to sys_sync() on every suspend. The only question is how to best
> > > implement that.
> >
> > No, your observation was that "sync is slow". Your *solution* is "we
> > need to remove sync".
>
> Not only slow, but pointless too. The argument goes: "It is slow and
> pointless and so it may be dropped."
>
> Now, I can agree that it wasn't clearly demonstrated that the unconditional
> sys_sync() in the suspend code path was pointless, but it also has never
> been clearly shown why it is not pointless on systems that suspend and resume
> reliably.

I just gave you an example of why sync is needed: Suspend, pull out
USB drive when putting laptop in bag.

> [The argument that the user can pull removable storage devices out of the
> system while suspended doesn't hold any water to me, because the user can
> pull them out of the system when not suspended just as well and cause the
> same kind of damage to happen.]

How many times have you forgotten to pull a usb stick out of a
laptop before putting it away? I've done that several times in the
past few months, and so I've *personally* pulled the USB sticks out
of suspended machines. This is a *common occurrence* and it
currently works just fine, so changing sync behaviour is something
that will directly impact me *as a user*.

Not to mention hybrid suspend (i.e write suspend image to disk, then
suspend to RAM for fast resume, but if power is lost resume from
disk image) both images have exactly the same filesystem state in
them and that is an absolute requirement for a hybrid suspend...

> Moreover, question is if we really need to carry out the sync on *every*
> suspend even if it is not pointless overall. That shouldn't really be
> necessary if we suspend and resume often enough or if we resume only for
> a while and then suspend again. Maybe it should be rate limited somehow
> at least?

If you suspend and resume frequently, then the cost of the sync
shoul dbe negliable because the amount of data dirtied between
resume/suspend shoul dbe negliable. hence my questions about where
sync is spending too much time, and whether we've actually fixed
those problems or not. If sync speed on clean filesystems is a
problem then we need to fix sync, not work around it.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-07-06 10:04:02

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

Hi!

> > Understand, however, there are systems which suspend/resume reliably
> > many times per second, making policy choice of having the kernel hard-code
> > a sys_sync() into the suspend path a bad idea.
>
> My current view on that is that whether or not to do a sync() before suspending
> ultimately is a policy decision and should belong to user space as such (modulo
> the autosleep situation when user space may not know when the suspend is going
> to happen).
>
> Moreover, user space is free to do as many sync()s before suspending as it
> wants to and the question here is whether or not the *kernel* should sync()
> in the suspend code path.

sync()s from userspace do not work, as userspace is still running.

sync() from kernel happens with tasks stopped. ... so it should really
get consistent image on disk.

And there are already interfaces that can s2ram without sync, just use
uswsusp ioctls, not the sysfs writes.

If you are doing multiple suspends per second, a) you are doing
something wrong and b) you'd better use ioctl anyway.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-07-06 10:07:40

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

Hi!

> conditions, or s2disk fails because a kernel upgrade was done before
> the s2disk and so can't be resumed. With your change, users lose all

s2disk should be able to survive kernel change on x86-64 for long time
now, and I have patches for x86, too.

[But yes, I think sync() should be kept.]
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-07-06 10:15:06

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Sat, Jul 4, 2015 at 9:03 AM, Rafael J. Wysocki <[email protected]> wrote:
> On Friday, July 03, 2015 11:42:50 AM Dave Chinner wrote:
>> On Wed, Jul 01, 2015 at 11:07:29PM -0400, Len Brown wrote:
>> > >> The _vast_ majority of systems using Linux suspend today are under
>> > >> an Android user-space. Android has no assumption that that suspend to
>> > >> mem will necessarily stay suspended for a long time.
>> > >
>> > > Indeed, however your change was not android-specific, and it is not
>> > > "comfortable" on x86-style hardware and usage patterns.
>> >
>> > "comfortable on x86-style and usage patterns"?
>> > If you mean "traditional" instead of "comfortable",
>> > where "tradition" is based on 10-year old systems, then sure.
>>
>> Even if this were true(*) we don't break things that currently work
>> just because something different is "just around the corner". e.g.
>> if you shut the lid on your laptop and it suspends to RAM, you can
>> pull the USB drive out that you just copied stuff to and plug it
>> into another machine and find all the data you copied there is
>> present.
>>
>> Remove the sync() from the freeze code, and this isn't guaranteed to
>> work anymore. It is now dependent on userspace implementations for
>> this to work, and we know what userspace developers will choose in
>> this situation. i.e. fast and "works for me", not "safe for
>> everyone".
>>
>> (*) Which it clearly isn't true because, as this example shows, my
>> shiny new laptop still has exactly the same data integrity
>> requirements as the laptop I was using 10 years ago.
>>
>> Just because there are lots of Android or Chrome out there it
>> doesn't mean we can just ignore the requirements of everything
>> else...
>>
>> > > That said, as long as x86 will still try to safeguard my data during mem
>> > > sleep/resume as it does today, I have no strong feelings about
>> > > light/heavy-weight "mem" sleep being strictly a compile-time selectable
>> > > thing, or a more flexible runtime-selectable behavior.
>> >
>> > The observation here is that the kernel should not force every system
>> > to sys_sync() on every suspend. The only question is how to best
>> > implement that.
>>
>> No, your observation was that "sync is slow". Your *solution* is "we
>> need to remove sync".
>
> Not only slow, but pointless too. The argument goes: "It is slow and
> pointless and so it may be dropped."
>
> Now, I can agree that it wasn't clearly demonstrated that the unconditional
> sys_sync() in the suspend code path was pointless, but it also has never
> been clearly shown why it is not pointless on systems that suspend and resume
> reliably.
>
> [The argument that the user can pull removable storage devices out of the
> system while suspended doesn't hold any water to me, because the user can
> pull them out of the system when not suspended just as well and cause the
> same kind of damage to happen.]

I think common users do understand the difference between
suspend and running, don't they? So they may not complain
anyone if they plug off USB disk when system is running, but they
will complain the patch and the regression if their data is lost because
doing that during suspend. Not to mention, it is very common to
plug off USB disk after suspend laptop.

Also there are network based storage(iSCSI, NBD, ...), NFS/CIFS, and
is it safe for these cases?


--
Ming Lei

2015-07-06 11:06:49

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Mon 2015-07-06 01:28:20, Rafael J. Wysocki wrote:
> On Saturday, July 04, 2015 10:19:55 AM Alan Stern wrote:
> > On Sat, 4 Jul 2015, Rafael J. Wysocki wrote:
> >
> > > The only argument against dropping sys_sync() from the suspend code path
> > > I've seen in this thread that I entirely agree with is that it may lead to
> > > regressions, because we've done it practically forever and it may hide latent
> > > bugs somewhere in block drivers etc. Dropping it, though, is the only way
> > > to see those bugs, if any, and if we want to ever fix them, we need to see
> > > them. That's why I think that it may be a good idea to allow people to
> > > drop it if they are willing to accept some extra risk (via the kernel
> > > command line, for example).
> >
> > I'd be perfectly happy to have the sync selectable at runtime, one way
> > or another. The three most reasonable options seem to be:
> >
> > kernel command line
> >
> > sysfs file
> >
> > sysctl setting
> >
> > The command line is less flexible (it can't be changed after booting).
> > Either of the other two would be fine with me.
>
> We'll probably use a sysfs file (possibly plus a Kconfig option to set the
> boot time default).

Android people can already do sync-less s2ram using existing
interface. IMO they should just do it.

In any case, sysfs file + Kconfig is an overkill. We already have too
many Kconfig options.

There's not a single Android phone supported by mainline
kernel. I'm sure they have bigger problems than Android setting
default sysfs values...

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-07-06 11:11:11

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

Hi!

> > Moreover, question is if we really need to carry out the sync on *every*
> > suspend even if it is not pointless overall. That shouldn't really be
> > necessary if we suspend and resume often enough or if we resume only for
> > a while and then suspend again. Maybe it should be rate limited somehow
> > at least?
>
> If you suspend and resume frequently, then the cost of the sync
> shoul dbe negliable because the amount of data dirtied between
> resume/suspend shoul dbe negliable. hence my questions about where
> sync is spending too much time, and whether we've actually fixed
> those problems or not. If sync speed on clean filesystems is a
> problem then we need to fix sync, not work around it.

And yes, that's solution I'd really prefer over adding knobs to
suspend.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-07-06 13:26:03

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Monday, July 06, 2015 10:06:14 AM Dave Chinner wrote:
> On Sat, Jul 04, 2015 at 03:03:46AM +0200, Rafael J. Wysocki wrote:
> > On Friday, July 03, 2015 11:42:50 AM Dave Chinner wrote:
> > > On Wed, Jul 01, 2015 at 11:07:29PM -0400, Len Brown wrote:
> > > > >> The _vast_ majority of systems using Linux suspend today are under
> > > > >> an Android user-space. Android has no assumption that that suspend to
> > > > >> mem will necessarily stay suspended for a long time.
> > > > >
> > > > > Indeed, however your change was not android-specific, and it is not
> > > > > "comfortable" on x86-style hardware and usage patterns.
> > > >
> > > > "comfortable on x86-style and usage patterns"?
> > > > If you mean "traditional" instead of "comfortable",
> > > > where "tradition" is based on 10-year old systems, then sure.
> > >
> > > Even if this were true(*) we don't break things that currently work
> > > just because something different is "just around the corner". e.g.
> > > if you shut the lid on your laptop and it suspends to RAM, you can
> > > pull the USB drive out that you just copied stuff to and plug it
> > > into another machine and find all the data you copied there is
> > > present.
> > >
> > > Remove the sync() from the freeze code, and this isn't guaranteed to
> > > work anymore. It is now dependent on userspace implementations for
> > > this to work, and we know what userspace developers will choose in
> > > this situation. i.e. fast and "works for me", not "safe for
> > > everyone".
> > >
> > > (*) Which it clearly isn't true because, as this example shows, my
> > > shiny new laptop still has exactly the same data integrity
> > > requirements as the laptop I was using 10 years ago.
> > >
> > > Just because there are lots of Android or Chrome out there it
> > > doesn't mean we can just ignore the requirements of everything
> > > else...
> > >
> > > > > That said, as long as x86 will still try to safeguard my data during mem
> > > > > sleep/resume as it does today, I have no strong feelings about
> > > > > light/heavy-weight "mem" sleep being strictly a compile-time selectable
> > > > > thing, or a more flexible runtime-selectable behavior.
> > > >
> > > > The observation here is that the kernel should not force every system
> > > > to sys_sync() on every suspend. The only question is how to best
> > > > implement that.
> > >
> > > No, your observation was that "sync is slow". Your *solution* is "we
> > > need to remove sync".
> >
> > Not only slow, but pointless too. The argument goes: "It is slow and
> > pointless and so it may be dropped."
> >
> > Now, I can agree that it wasn't clearly demonstrated that the unconditional
> > sys_sync() in the suspend code path was pointless, but it also has never
> > been clearly shown why it is not pointless on systems that suspend and resume
> > reliably.
>
> I just gave you an example of why sync is needed: Suspend, pull out
> USB drive when putting laptop in bag.

Exactly the same thing can happen while not suspended. What does make suspend
special with respect to that?

> > [The argument that the user can pull removable storage devices out of the
> > system while suspended doesn't hold any water to me, because the user can
> > pull them out of the system when not suspended just as well and cause the
> > same kind of damage to happen.]
>
> How many times have you forgotten to pull a usb stick out of a
> laptop before putting it away? I've done that several times in the
> past few months, and so I've *personally* pulled the USB sticks out
> of suspended machines. This is a *common occurrence* and it
> currently works just fine, so changing sync behaviour is something
> that will directly impact me *as a user*.
>
> Not to mention hybrid suspend (i.e write suspend image to disk, then
> suspend to RAM for fast resume, but if power is lost resume from
> disk image) both images have exactly the same filesystem state in
> them and that is an absolute requirement for a hybrid suspend...

For that you need to sync. Let's not confuse what we're talking about.

The code path in question is suspend-only, no hibernation involved in any form.

> > Moreover, question is if we really need to carry out the sync on *every*
> > suspend even if it is not pointless overall. That shouldn't really be
> > necessary if we suspend and resume often enough or if we resume only for
> > a while and then suspend again. Maybe it should be rate limited somehow
> > at least?
>
> If you suspend and resume frequently, then the cost of the sync
> shoul dbe negliable because the amount of data dirtied between
> resume/suspend shoul dbe negliable. hence my questions about where
> sync is spending too much time, and whether we've actually fixed
> those problems or not. If sync speed on clean filesystems is a
> problem then we need to fix sync, not work around it.

Well, say you never suspend, but you also only shut down the system when you
need to replace the kernel on it. How often would you invoke global sync on
that system?

Thanks,
Rafael

2015-07-06 13:32:51

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Monday, July 06, 2015 01:06:45 PM Pavel Machek wrote:
> On Mon 2015-07-06 01:28:20, Rafael J. Wysocki wrote:
> > On Saturday, July 04, 2015 10:19:55 AM Alan Stern wrote:
> > > On Sat, 4 Jul 2015, Rafael J. Wysocki wrote:
> > >
> > > > The only argument against dropping sys_sync() from the suspend code path
> > > > I've seen in this thread that I entirely agree with is that it may lead to
> > > > regressions, because we've done it practically forever and it may hide latent
> > > > bugs somewhere in block drivers etc. Dropping it, though, is the only way
> > > > to see those bugs, if any, and if we want to ever fix them, we need to see
> > > > them. That's why I think that it may be a good idea to allow people to
> > > > drop it if they are willing to accept some extra risk (via the kernel
> > > > command line, for example).
> > >
> > > I'd be perfectly happy to have the sync selectable at runtime, one way
> > > or another. The three most reasonable options seem to be:
> > >
> > > kernel command line
> > >
> > > sysfs file
> > >
> > > sysctl setting
> > >
> > > The command line is less flexible (it can't be changed after booting).
> > > Either of the other two would be fine with me.
> >
> > We'll probably use a sysfs file (possibly plus a Kconfig option to set the
> > boot time default).
>
> Android people can already do sync-less s2ram using existing
> interface. IMO they should just do it.
>
> In any case, sysfs file + Kconfig is an overkill. We already have too
> many Kconfig options.

I don't think we can reach a general agreement on what's the *right* approach
with respect to the sys_sync() in the suspend code path, so the only way out
of this situation I can see is to make it configurable.

> There's not a single Android phone supported by mainline
> kernel. I'm sure they have bigger problems than Android setting
> default sysfs values...

But perhaps we'd like to change that?

Thanks,
Rafael

2015-07-07 01:18:26

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Mon, Jul 06, 2015 at 03:52:25PM +0200, Rafael J. Wysocki wrote:
> On Monday, July 06, 2015 10:06:14 AM Dave Chinner wrote:
> > On Sat, Jul 04, 2015 at 03:03:46AM +0200, Rafael J. Wysocki wrote:
> > > > No, your observation was that "sync is slow". Your *solution* is "we
> > > > need to remove sync".
> > >
> > > Not only slow, but pointless too. The argument goes: "It is slow and
> > > pointless and so it may be dropped."
> > >
> > > Now, I can agree that it wasn't clearly demonstrated that the unconditional
> > > sys_sync() in the suspend code path was pointless, but it also has never
> > > been clearly shown why it is not pointless on systems that suspend and resume
> > > reliably.
> >
> > I just gave you an example of why sync is needed: Suspend, pull out
> > USB drive when putting laptop in bag.
>
> Exactly the same thing can happen while not suspended. What does make suspend
> special with respect to that?

Stop being obtuse. If you remember that you need to pull the USB
stick before you suspend, then you damn well remember to "safely
eject" the USB stick and that syncs, unmounts and flushes the drive
caches before telling you it is safe to pull the stick.

> > > Moreover, question is if we really need to carry out the sync on *every*
> > > suspend even if it is not pointless overall. That shouldn't really be
> > > necessary if we suspend and resume often enough or if we resume only for
> > > a while and then suspend again. Maybe it should be rate limited somehow
> > > at least?
> >
> > If you suspend and resume frequently, then the cost of the sync
> > shoul dbe negliable because the amount of data dirtied between
> > resume/suspend shoul dbe negliable. hence my questions about where
> > sync is spending too much time, and whether we've actually fixed
> > those problems or not. If sync speed on clean filesystems is a
> > problem then we need to fix sync, not work around it.
>
> Well, say you never suspend, but you also only shut down the system when you
> need to replace the kernel on it. How often would you invoke global sync on
> that system?

Never, because:

- the kernel does background writeback of dirty data so you
don't need to run sync while the system is running; and
- unmount during shutdown runs sync_filesystem() internally
(just like sys_sync does) to ensure the filesystem is
clean and no data is lost.

Seriously, stop being making ignorant arguments to justify removing
sys_sync(). *If* there's a problem sys_sync() we need to *fix it*,
not ignore it.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-07-07 10:25:25

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Mon 2015-07-06 15:59:15, Rafael J. Wysocki wrote:
> On Monday, July 06, 2015 01:06:45 PM Pavel Machek wrote:
> > On Mon 2015-07-06 01:28:20, Rafael J. Wysocki wrote:
> > > On Saturday, July 04, 2015 10:19:55 AM Alan Stern wrote:
> > > > On Sat, 4 Jul 2015, Rafael J. Wysocki wrote:
> > > >
> > > > > The only argument against dropping sys_sync() from the suspend code path
> > > > > I've seen in this thread that I entirely agree with is that it may lead to
> > > > > regressions, because we've done it practically forever and it may hide latent
> > > > > bugs somewhere in block drivers etc. Dropping it, though, is the only way
> > > > > to see those bugs, if any, and if we want to ever fix them, we need to see
> > > > > them. That's why I think that it may be a good idea to allow people to
> > > > > drop it if they are willing to accept some extra risk (via the kernel
> > > > > command line, for example).
> > > >
> > > > I'd be perfectly happy to have the sync selectable at runtime, one way
> > > > or another. The three most reasonable options seem to be:
> > > >
> > > > kernel command line
> > > >
> > > > sysfs file
> > > >
> > > > sysctl setting
> > > >
> > > > The command line is less flexible (it can't be changed after booting).
> > > > Either of the other two would be fine with me.
> > >
> > > We'll probably use a sysfs file (possibly plus a Kconfig option to set the
> > > boot time default).
> >
> > Android people can already do sync-less s2ram using existing
> > interface. IMO they should just do it.
> >
> > In any case, sysfs file + Kconfig is an overkill. We already have too
> > many Kconfig options.
>
> I don't think we can reach a general agreement on what's the *right* approach
> with respect to the sys_sync() in the suspend code path, so the only way out
> of this situation I can see is to make it configurable.

So first: not having general agreement does not mean we should
introduce Kconfig + sysfs file. Second: your proposal of "lets sync if
runtime was shorter than xxx" is over complex, but at least should not
need Kconfig support... Third: we have ioctl() based interface, and I
guess android should use that one; it already has "s2ram without sync"
method.

> > There's not a single Android phone supported by mainline
> > kernel. I'm sure they have bigger problems than Android setting
> > default sysfs values...
>
> But perhaps we'd like to change that?

We'd like to, but lets start with the real hard stuff (merging support
for Qualcomm chipsets) that is 1000000 LoC+, not with trivial tweaks
that would be 1-line change, but we pollute code with Kconfig+sysfs
making it 100..

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-07-07 11:48:27

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Tuesday, July 07, 2015 11:17:56 AM Dave Chinner wrote:
> On Mon, Jul 06, 2015 at 03:52:25PM +0200, Rafael J. Wysocki wrote:
> > On Monday, July 06, 2015 10:06:14 AM Dave Chinner wrote:
> > > On Sat, Jul 04, 2015 at 03:03:46AM +0200, Rafael J. Wysocki wrote:
> > > > > No, your observation was that "sync is slow". Your *solution* is "we
> > > > > need to remove sync".
> > > >
> > > > Not only slow, but pointless too. The argument goes: "It is slow and
> > > > pointless and so it may be dropped."
> > > >
> > > > Now, I can agree that it wasn't clearly demonstrated that the unconditional
> > > > sys_sync() in the suspend code path was pointless, but it also has never
> > > > been clearly shown why it is not pointless on systems that suspend and resume
> > > > reliably.
> > >
> > > I just gave you an example of why sync is needed: Suspend, pull out
> > > USB drive when putting laptop in bag.
> >
> > Exactly the same thing can happen while not suspended. What does make suspend
> > special with respect to that?
>
> Stop being obtuse.

Stop being rude.

> If you remember that you need to pull the USB
> stick before you suspend, then you damn well remember to "safely
> eject" the USB stick and that syncs, unmounts and flushes the drive
> caches before telling you it is safe to pull the stick.

All of that only means "the kernel needs to protect users from consequences
of silly mistakes" which I'm not sure I agree with.

For example, on desktop systems I use user space syncs filesystems before
writing to /sys/power/state, so the additional sys_sync() in the kernel doesn't
seem to serve any purpose.

> > > > Moreover, question is if we really need to carry out the sync on *every*
> > > > suspend even if it is not pointless overall. That shouldn't really be
> > > > necessary if we suspend and resume often enough or if we resume only for
> > > > a while and then suspend again. Maybe it should be rate limited somehow
> > > > at least?
> > >
> > > If you suspend and resume frequently, then the cost of the sync
> > > shoul dbe negliable because the amount of data dirtied between
> > > resume/suspend shoul dbe negliable. hence my questions about where
> > > sync is spending too much time, and whether we've actually fixed
> > > those problems or not. If sync speed on clean filesystems is a
> > > problem then we need to fix sync, not work around it.
> >
> > Well, say you never suspend, but you also only shut down the system when you
> > need to replace the kernel on it. How often would you invoke global sync on
> > that system?
>
> Never, because:
>
> - the kernel does background writeback of dirty data so you
> don't need to run sync while the system is running; and

OK, and that surely happens between suspend-resume cycles too, so why does
the sync in the suspend code path make a difference, really (except for
avoiding the consequences of possible user mistakes)?

> - unmount during shutdown runs sync_filesystem() internally
> (just like sys_sync does) to ensure the filesystem is
> clean and no data is lost.

And that still will be the case regardless of whether or not suspend is used.

> Seriously, stop being making ignorant arguments to justify removing
> sys_sync(). *If* there's a problem sys_sync() we need to *fix it*,
> not ignore it.

I'm not advocating for dropping the sys_sync(). I'd just have applied the Len's
patch had I thought that had been the way to go.

I'm not agreeing with your argumentation about why the sync is needed, though
(except for the one point that removing it might uncover some latent bugs).

For now, I've seen two arguments: the one about the possible latent bug (that
I agree with) and the one about how users may be hurt if they unplug storage
devices holding unmounted filesystems while suspended (which may be addressed
by a sync in user space just fine). Are there any other reasons?

Rafael

2015-07-07 11:55:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Tuesday, July 07, 2015 12:25:07 PM Pavel Machek wrote:
> On Mon 2015-07-06 15:59:15, Rafael J. Wysocki wrote:
> > On Monday, July 06, 2015 01:06:45 PM Pavel Machek wrote:
> > > On Mon 2015-07-06 01:28:20, Rafael J. Wysocki wrote:
> > > > On Saturday, July 04, 2015 10:19:55 AM Alan Stern wrote:
> > > > > On Sat, 4 Jul 2015, Rafael J. Wysocki wrote:
> > > > >
> > > > > > The only argument against dropping sys_sync() from the suspend code path
> > > > > > I've seen in this thread that I entirely agree with is that it may lead to
> > > > > > regressions, because we've done it practically forever and it may hide latent
> > > > > > bugs somewhere in block drivers etc. Dropping it, though, is the only way
> > > > > > to see those bugs, if any, and if we want to ever fix them, we need to see
> > > > > > them. That's why I think that it may be a good idea to allow people to
> > > > > > drop it if they are willing to accept some extra risk (via the kernel
> > > > > > command line, for example).
> > > > >
> > > > > I'd be perfectly happy to have the sync selectable at runtime, one way
> > > > > or another. The three most reasonable options seem to be:
> > > > >
> > > > > kernel command line
> > > > >
> > > > > sysfs file
> > > > >
> > > > > sysctl setting
> > > > >
> > > > > The command line is less flexible (it can't be changed after booting).
> > > > > Either of the other two would be fine with me.
> > > >
> > > > We'll probably use a sysfs file (possibly plus a Kconfig option to set the
> > > > boot time default).
> > >
> > > Android people can already do sync-less s2ram using existing
> > > interface. IMO they should just do it.
> > >
> > > In any case, sysfs file + Kconfig is an overkill. We already have too
> > > many Kconfig options.
> >
> > I don't think we can reach a general agreement on what's the *right* approach
> > with respect to the sys_sync() in the suspend code path, so the only way out
> > of this situation I can see is to make it configurable.
>
> So first: not having general agreement does not mean we should
> introduce Kconfig + sysfs file. Second: your proposal of "lets sync if
> runtime was shorter than xxx" is over complex, but at least should not
> need Kconfig support...

The Kconfig part is only about setting the default at compile time, so I'm
not sure why you have any problems with it.


> Third: we have ioctl() based interface, and I guess android should use that
> one; it already has "s2ram without sync" method.

And that is tied to hibernation and therefore useless to people who don't
build that in.


> > > There's not a single Android phone supported by mainline
> > > kernel. I'm sure they have bigger problems than Android setting
> > > default sysfs values...
> >
> > But perhaps we'd like to change that?
>
> We'd like to, but lets start with the real hard stuff (merging support
> for Qualcomm chipsets) that is 1000000 LoC+, not with trivial tweaks
> that would be 1-line change, but we pollute code with Kconfig+sysfs
> making it 100..

It is suboptimal, but this is the only way I can see allowing us to make
any forward progress here.

Evidently, some people are not happy with the status quo and they also should
be able to cover their use cases with the mainline kernel, shouldn't they?

Rafael

2015-07-07 13:18:05

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Tue, 2015-07-07 at 14:14 +0200, Rafael J. Wysocki wrote:
> For example, on desktop systems I use user space syncs filesystems
> before
> writing to /sys/power/state, so the additional sys_sync() in the
> kernel doesn't
> seem to serve any purpose.

There is a race you cannot close in user space.

Regards
Oliver

2015-07-07 13:42:45

by Takashi Iwai

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

At Tue, 7 Jul 2015 11:17:56 +1000,
Dave Chinner wrote:
>
> > > > Moreover, question is if we really need to carry out the sync on *every*
> > > > suspend even if it is not pointless overall. That shouldn't really be
> > > > necessary if we suspend and resume often enough or if we resume only for
> > > > a while and then suspend again. Maybe it should be rate limited somehow
> > > > at least?
> > >
> > > If you suspend and resume frequently, then the cost of the sync
> > > shoul dbe negliable because the amount of data dirtied between
> > > resume/suspend shoul dbe negliable. hence my questions about where
> > > sync is spending too much time, and whether we've actually fixed
> > > those problems or not. If sync speed on clean filesystems is a
> > > problem then we need to fix sync, not work around it.
> >
> > Well, say you never suspend, but you also only shut down the system when you
> > need to replace the kernel on it. How often would you invoke global sync on
> > that system?
>
> Never, because:
>
> - the kernel does background writeback of dirty data so you
> don't need to run sync while the system is running; and
> - unmount during shutdown runs sync_filesystem() internally
> (just like sys_sync does) to ensure the filesystem is
> clean and no data is lost.
>
> Seriously, stop being making ignorant arguments to justify removing
> sys_sync(). *If* there's a problem sys_sync() we need to *fix it*,
> not ignore it.

This thread remembered me of an old problem I faced: suspend fails
while copying a large file. So I tried again with 4.2-rc1.

I put a slow USB stick with ext4, started copying a 4GB ISO file, and
triggered the suspend. It failed.

Then removed sys_sync(), and retested. It still failed.

The log shows like:

PM: Preparing system for sleep (freeze)
Freezing user space processes ... (elapsed 0.002 seconds) done.
Freezing remaining freezable tasks ...
Freezing of tasks failed after 20.003 seconds (0 tasks refusing to freeze, wq_busy=1):
Restarting kernel threads ... done.
Restarting tasks ... done.

So, removing sys_sync() will save a small amount of time, but it won't
help always...


Takashi

2015-07-07 14:05:52

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Tuesday, July 07, 2015 03:16:48 PM Oliver Neukum wrote:
> On Tue, 2015-07-07 at 14:14 +0200, Rafael J. Wysocki wrote:
> > For example, on desktop systems I use user space syncs filesystems
> > before
> > writing to /sys/power/state, so the additional sys_sync() in the
> > kernel doesn't
> > seem to serve any purpose.
>
> There is a race you cannot close in user space.

Yes, there is, but I'm not sure how much of a help the sync in the kernel
provides here anyway.

Say this happens. There is a process writing to a file running in parallel
with the suspend process. Suspend starts and that process is frozen. The
sync is called and causes all of the outstanding data to be written back.
The user doesn't realize that the write is technically still in progress, so
he (or she) pulls the storage device out of the system, moves it to another
system, makes changes (say removes the file written to by the process above,
so the blocks previously occupied by that file are now used for some metadata)
and moves the storage back to the suspended system. The system is resumed
and the writing process continues writing possibly to the wrong blocks and
corrupts the filesystem.

Is this possible? If not, why not?

Rafael

2015-07-07 14:39:39

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Tue, 2015-07-07 at 16:32 +0200, Rafael J. Wysocki wrote:
> On Tuesday, July 07, 2015 03:16:48 PM Oliver Neukum wrote:
> > On Tue, 2015-07-07 at 14:14 +0200, Rafael J. Wysocki wrote:
> > > For example, on desktop systems I use user space syncs filesystems
> > > before
> > > writing to /sys/power/state, so the additional sys_sync() in the
> > > kernel doesn't
> > > seem to serve any purpose.
> >
> > There is a race you cannot close in user space.
>
> Yes, there is, but I'm not sure how much of a help the sync in the kernel
> provides here anyway.
>
> Say this happens. There is a process writing to a file running in parallel
> with the suspend process. Suspend starts and that process is frozen. The
> sync is called and causes all of the outstanding data to be written back.
> The user doesn't realize that the write is technically still in progress, so

Well, in that case the user never got the feedback that the write is
finished. That is a race that always exists, like sending SIGKILL to a
running task.
What you describe is in principle unsolvable every time under
any circumstances.

> he (or she) pulls the storage device out of the system, moves it to another
> system, makes changes (say removes the file written to by the process above,
> so the blocks previously occupied by that file are now used for some metadata)
> and moves the storage back to the suspended system. The system is resumed
> and the writing process continues writing possibly to the wrong blocks and
> corrupts the filesystem.

That is a tough nut. But that's not a reason to make it worse.
I'd say there's no reason not to use a secondary interface to
suspend without syncing or to extend or introduce such an interface
if the API is deficient.

Regards
Oliver

2015-07-07 15:03:29

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Tue, 7 Jul 2015, Oliver Neukum wrote:

> > he (or she) pulls the storage device out of the system, moves it to another
> > system, makes changes (say removes the file written to by the process above,
> > so the blocks previously occupied by that file are now used for some metadata)
> > and moves the storage back to the suspended system. The system is resumed
> > and the writing process continues writing possibly to the wrong blocks and
> > corrupts the filesystem.
>
> That is a tough nut. But that's not a reason to make it worse.
> I'd say there's no reason not to use a secondary interface to
> suspend without syncing or to extend or introduce such an interface
> if the API is deficient.

Indeed, the problem Rafael outlined always exists whether or not the
kernel does a sync. Even if no I/O is in progress when the system goes
to sleep, if the user moves a portable storage device with a mounted
filesystem to another computer and updates it before waking the system
up, corruption is highly likely.

In principle this could be solved by adding suspend/resume callbacks to
filesystems. For example, the resume callback could verify that the
superblock had not been changed since the suspend occurred. Or there
could be some other simple way of determining that the filesystem had
not been remounted and changed.

Either way, this is irrelevant to the question of whether the kernel
should issue a sync when suspending.

Alan Stern

2015-07-07 21:45:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Tuesday, July 07, 2015 04:38:26 PM Oliver Neukum wrote:
> On Tue, 2015-07-07 at 16:32 +0200, Rafael J. Wysocki wrote:
> > On Tuesday, July 07, 2015 03:16:48 PM Oliver Neukum wrote:
> > > On Tue, 2015-07-07 at 14:14 +0200, Rafael J. Wysocki wrote:
> > > > For example, on desktop systems I use user space syncs filesystems
> > > > before
> > > > writing to /sys/power/state, so the additional sys_sync() in the
> > > > kernel doesn't
> > > > seem to serve any purpose.
> > >
> > > There is a race you cannot close in user space.
> >
> > Yes, there is, but I'm not sure how much of a help the sync in the kernel
> > provides here anyway.
> >
> > Say this happens. There is a process writing to a file running in parallel
> > with the suspend process. Suspend starts and that process is frozen. The
> > sync is called and causes all of the outstanding data to be written back.
> > The user doesn't realize that the write is technically still in progress, so
>
> Well, in that case the user never got the feedback that the write is
> finished. That is a race that always exists, like sending SIGKILL to a
> running task.
> What you describe is in principle unsolvable every time under
> any circumstances.
>
> > he (or she) pulls the storage device out of the system, moves it to another
> > system, makes changes (say removes the file written to by the process above,
> > so the blocks previously occupied by that file are now used for some metadata)
> > and moves the storage back to the suspended system. The system is resumed
> > and the writing process continues writing possibly to the wrong blocks and
> > corrupts the filesystem.
>
> That is a tough nut. But that's not a reason to make it worse.
> I'd say there's no reason not to use a secondary interface to
> suspend without syncing or to extend or introduce such an interface
> if the API is deficient.

Well, the point here is that the sync we have doesn't prevent all potentially
possible bad things from happening. It's a partial measure at best in that
respect.

Thanks,
Rafael

2015-07-07 21:53:45

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Tuesday, July 07, 2015 11:03:20 AM Alan Stern wrote:
> On Tue, 7 Jul 2015, Oliver Neukum wrote:
>
> > > he (or she) pulls the storage device out of the system, moves it to another
> > > system, makes changes (say removes the file written to by the process above,
> > > so the blocks previously occupied by that file are now used for some metadata)
> > > and moves the storage back to the suspended system. The system is resumed
> > > and the writing process continues writing possibly to the wrong blocks and
> > > corrupts the filesystem.
> >
> > That is a tough nut. But that's not a reason to make it worse.
> > I'd say there's no reason not to use a secondary interface to
> > suspend without syncing or to extend or introduce such an interface
> > if the API is deficient.
>
> Indeed, the problem Rafael outlined always exists whether or not the
> kernel does a sync. Even if no I/O is in progress when the system goes
> to sleep, if the user moves a portable storage device with a mounted
> filesystem to another computer and updates it before waking the system
> up, corruption is highly likely.
>
> In principle this could be solved by adding suspend/resume callbacks to
> filesystems. For example, the resume callback could verify that the
> superblock had not been changed since the suspend occurred. Or there
> could be some other simple way of determining that the filesystem had
> not been remounted and changed.
>
> Either way, this is irrelevant to the question of whether the kernel
> should issue a sync when suspending.

well, that depends on what the purpose of the sync is supposed to be.

If it is there to prevent users from corrupting their filesystems as a result
of a mistake, it is insufficient. If it's there for other reasons, I'm wondering
what those reasons are (on systems that suspend and resume reliably, because the
original reason to put it in there was to reduce the damage from suspend/resume
crashes).

Thanks,
Rafael

2015-07-08 07:52:29

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Wed, 2015-07-08 at 00:11 +0200, Rafael J. Wysocki wrote:
> On Tuesday, July 07, 2015 04:38:26 PM Oliver Neukum wrote:
>
> > That is a tough nut. But that's not a reason to make it worse.
> > I'd say there's no reason not to use a secondary interface to
> > suspend without syncing or to extend or introduce such an interface
> > if the API is deficient.
>
> Well, the point here is that the sync we have doesn't prevent all potentially
> possible bad things from happening. It's a partial measure at best in that
> respect.

Well, removed hardware doesn't work. That is a very basic limitation.
But can we guarantee that any returned syscall actually wrote to disk?
Yes, but it must be done in kernel space. So doing a sync has a true
benefit.
I don't see why you would want to generally remove it. What is wrong
with an interface allowing a selection there to those who don't care
about additional guarantees?

Regards
Oliver

2015-07-08 11:17:23

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Tue 2015-07-07 16:32:08, Rafael J. Wysocki wrote:
> On Tuesday, July 07, 2015 03:16:48 PM Oliver Neukum wrote:
> > On Tue, 2015-07-07 at 14:14 +0200, Rafael J. Wysocki wrote:
> > > For example, on desktop systems I use user space syncs filesystems
> > > before
> > > writing to /sys/power/state, so the additional sys_sync() in the
> > > kernel doesn't
> > > seem to serve any purpose.
> >
> > There is a race you cannot close in user space.
>
> Yes, there is, but I'm not sure how much of a help the sync in the kernel
> provides here anyway.
>
> Say this happens. There is a process writing to a file running in parallel
> with the suspend process. Suspend starts and that process is frozen. The
> sync is called and causes all of the outstanding data to be written back.
> The user doesn't realize that the write is technically still in progress, so
> he (or she) pulls the storage device out of the system, moves it to another
> system, makes changes (say removes the file written to by the process above,
> so the blocks previously occupied by that file are now used for some metadata)
> and moves the storage back to the suspended system. The system is resumed
> and the writing process continues writing possibly to the wrong blocks and
> corrupts the filesystem.
>
> Is this possible? If not, why not?

Of course it is possible; don't do it.

Some warnings, first.

* BIG FAT WARNING
*********************************************************
*
* If you touch anything on disk between suspend and resume...
* ...kiss your data goodbye.

Correct option is to set up machine so that USB unplug awakes it. Or
assume USB was always removed during suspend (we actually have an
option for that).

Still, if you don't see difference from pulling USB from running
machine, and from "off" machine.. I do. And most users I know do. And
some of my machines don't indicate whether they are "off" or "sleeping".

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-07-08 11:20:48

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Wed 2015-07-08 00:20:05, Rafael J. Wysocki wrote:
> On Tuesday, July 07, 2015 11:03:20 AM Alan Stern wrote:
> > On Tue, 7 Jul 2015, Oliver Neukum wrote:
> >
> > > > he (or she) pulls the storage device out of the system, moves it to another
> > > > system, makes changes (say removes the file written to by the process above,
> > > > so the blocks previously occupied by that file are now used for some metadata)
> > > > and moves the storage back to the suspended system. The system is resumed
> > > > and the writing process continues writing possibly to the wrong blocks and
> > > > corrupts the filesystem.
> > >
> > > That is a tough nut. But that's not a reason to make it worse.
> > > I'd say there's no reason not to use a secondary interface to
> > > suspend without syncing or to extend or introduce such an interface
> > > if the API is deficient.
> >
> > Indeed, the problem Rafael outlined always exists whether or not the
> > kernel does a sync. Even if no I/O is in progress when the system goes
> > to sleep, if the user moves a portable storage device with a mounted
> > filesystem to another computer and updates it before waking the system
> > up, corruption is highly likely.
> >
> > In principle this could be solved by adding suspend/resume callbacks to
> > filesystems. For example, the resume callback could verify that the
> > superblock had not been changed since the suspend occurred. Or there
> > could be some other simple way of determining that the filesystem had
> > not been remounted and changed.
> >
> > Either way, this is irrelevant to the question of whether the kernel
> > should issue a sync when suspending.
>
> well, that depends on what the purpose of the sync is supposed to be.
>
> If it is there to prevent users from corrupting their filesystems as a result
> of a mistake, it is insufficient. If it's there for other reasons, I'm wondering
> what those reasons are (on systems that suspend and resume reliably, because the
> original reason to put it in there was to reduce the damage from suspend/resume
> crashes).

I put it there, and there were more reasons than "crashes" to put it
there.

1) crashes.

2) battery is quite likely to run out in suspended machine.

3) if someone pulls the stick and puts it in other machine, I wanted
consistent filesystem at least after journal replay.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-07-08 14:40:03

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Wed, 8 Jul 2015, Pavel Machek wrote:

> > well, that depends on what the purpose of the sync is supposed to be.
> >
> > If it is there to prevent users from corrupting their filesystems as a result
> > of a mistake, it is insufficient. If it's there for other reasons, I'm wondering
> > what those reasons are (on systems that suspend and resume reliably, because the
> > original reason to put it in there was to reduce the damage from suspend/resume
> > crashes).
>
> I put it there, and there were more reasons than "crashes" to put it
> there.
>
> 1) crashes.
>
> 2) battery is quite likely to run out in suspended machine.
>
> 3) if someone pulls the stick and puts it in other machine, I wanted
> consistent filesystem at least after journal replay.

I was going to make the same points.

>From my point of view, whether to issue a sync is a tradeoff. I can't
remember any time in the last several years where lack of a sync would
have caused a problem for my computers, but the possibility still
exists.

So on one hand, issuing the sync can help prevent a low-probability
problem. On the other hand, issuing the sync takes a small amount of
time (negligible for my purposes but not negligible for Len and
others).

I prefer to pay a very small cost to prevent a low-probability problem.
Others may not want to pay, because to them the cost is larger or the
probability is lower.

_That_ is the justification for not eliminating the sync completely but
making it optional.

Alan Stern

2015-07-08 21:36:45

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Wednesday, July 08, 2015 09:51:13 AM Oliver Neukum wrote:
> On Wed, 2015-07-08 at 00:11 +0200, Rafael J. Wysocki wrote:
> > On Tuesday, July 07, 2015 04:38:26 PM Oliver Neukum wrote:
> >
> > > That is a tough nut. But that's not a reason to make it worse.
> > > I'd say there's no reason not to use a secondary interface to
> > > suspend without syncing or to extend or introduce such an interface
> > > if the API is deficient.
> >
> > Well, the point here is that the sync we have doesn't prevent all potentially
> > possible bad things from happening. It's a partial measure at best in that
> > respect.
>
> Well, removed hardware doesn't work. That is a very basic limitation.
> But can we guarantee that any returned syscall actually wrote to disk?
> Yes, but it must be done in kernel space. So doing a sync has a true
> benefit.
> I don't see why you would want to generally remove it. What is wrong
> with an interface allowing a selection there to those who don't care
> about additional guarantees?

Nothing and I'm not discussing that (I've said that already at least once in
this thread).

What I'm questioning is the "why" of calling sys_sync() from the kernel.

If we had a good answer to why we do that to start with, the whole discussion
wouldn't be necessary.

So the answer I'm getting from this thread so far is something like "It is a
safety measure to prevent users from losing data if they pull removable storage
out of the system while suspended".

Your point about the returned syscall guaranees is a good one, but still.

Thanks,
Rafael

2015-07-08 21:38:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Wednesday, July 08, 2015 10:40:00 AM Alan Stern wrote:
> On Wed, 8 Jul 2015, Pavel Machek wrote:
>
> > > well, that depends on what the purpose of the sync is supposed to be.
> > >
> > > If it is there to prevent users from corrupting their filesystems as a result
> > > of a mistake, it is insufficient. If it's there for other reasons, I'm wondering
> > > what those reasons are (on systems that suspend and resume reliably, because the
> > > original reason to put it in there was to reduce the damage from suspend/resume
> > > crashes).
> >
> > I put it there, and there were more reasons than "crashes" to put it
> > there.
> >
> > 1) crashes.
> >
> > 2) battery is quite likely to run out in suspended machine.
> >
> > 3) if someone pulls the stick and puts it in other machine, I wanted
> > consistent filesystem at least after journal replay.
>
> I was going to make the same points.
>
> From my point of view, whether to issue a sync is a tradeoff. I can't
> remember any time in the last several years where lack of a sync would
> have caused a problem for my computers, but the possibility still
> exists.
>
> So on one hand, issuing the sync can help prevent a low-probability
> problem. On the other hand, issuing the sync takes a small amount of
> time (negligible for my purposes but not negligible for Len and
> others).
>
> I prefer to pay a very small cost to prevent a low-probability problem.
> Others may not want to pay, because to them the cost is larger or the
> probability is lower.
>
> _That_ is the justification for not eliminating the sync completely but
> making it optional.

Agreed.

Thanks,
Rafael

2015-07-09 07:34:04

by Oliver Neukum

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Thu, 2015-07-09 at 00:03 +0200, Rafael J. Wysocki wrote:

> Nothing and I'm not discussing that (I've said that already at least once in
> this thread).
>
> What I'm questioning is the "why" of calling sys_sync() from the kernel.

That's strictly speaking two questions

1. Why do it in the kernel

That is easy. It closes the window of a race condition.

2. Why do it at all

In essence because the system becomes inactive. For example we say that
data hits the disk after 30s maximum. We cannot meet such a limit unless
we sync.
There are additional issues, such as a system appearing inactive during
suspension and frankly the far greater likelihood of a crash.

Regards
Oliver

2015-07-09 22:56:33

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Thursday, July 09, 2015 09:32:51 AM Oliver Neukum wrote:
> On Thu, 2015-07-09 at 00:03 +0200, Rafael J. Wysocki wrote:
>
> > Nothing and I'm not discussing that (I've said that already at least once in
> > this thread).
> >
> > What I'm questioning is the "why" of calling sys_sync() from the kernel.
>
> That's strictly speaking two questions
>
> 1. Why do it in the kernel
>
> That is easy. It closes the window of a race condition.
>
> 2. Why do it at all
>
> In essence because the system becomes inactive. For example we say that
> data hits the disk after 30s maximum. We cannot meet such a limit unless
> we sync.

Absolute deadlines are not guaranteed to be met at all in general when
system suspend is used, at least from the user space perspective, so the
above is quite a bit of an overstretch.

> There are additional issues, such as a system appearing inactive during
> suspension and frankly the far greater likelihood of a crash.

I prefer the Alan's point, to be honest: it's a tradeoff between some extra
safety and some extra suspend overhead. Unfortunately, though, we don't have
any numbers allowing us to estimate how much extra safety it provides and so
to justify it quantitatively, so to speak.

Thanks,
Rafael

2015-08-04 19:54:52

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 1/1] suspend: delete sys_sync()

On Fri 2015-07-10 01:22:55, Rafael J. Wysocki wrote:
> On Thursday, July 09, 2015 09:32:51 AM Oliver Neukum wrote:
> > On Thu, 2015-07-09 at 00:03 +0200, Rafael J. Wysocki wrote:
> >
> > > Nothing and I'm not discussing that (I've said that already at least once in
> > > this thread).
> > >
> > > What I'm questioning is the "why" of calling sys_sync() from the kernel.
> >
> > That's strictly speaking two questions
> >
> > 1. Why do it in the kernel
> >
> > That is easy. It closes the window of a race condition.
> >
> > 2. Why do it at all
> >
> > In essence because the system becomes inactive. For example we say that
> > data hits the disk after 30s maximum. We cannot meet such a limit unless
> > we sync.
>
> Absolute deadlines are not guaranteed to be met at all in general when
> system suspend is used, at least from the user space perspective, so the
> above is quite a bit of an overstretch.

Well, this particular deadline _was_ guaranteed before Len's patch,
and it is useful one, too:

"if notebook is not on, it is ok to pull the USB stick".

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html