Hi Linus,
Please pull power management updates for 2.6.33 from:
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6.git for-linus
They include:
* Asynchronous suspend and resume infrastructure. For now, PCI, ACPI and
serio devices are enabled to suspend and resume asynchronously.
* Fixes for the runtime PM framework.
* Hibernate cleanups from Nigel and Jiri Slaby.
* Freezer optimisation from Tejun.
Documentation/power/runtime_pm.txt | 12 +-
drivers/acpi/glue.c | 3 +
drivers/acpi/scan.c | 1 +
drivers/base/core.c | 4 +
drivers/base/power/Makefile | 2 +-
drivers/base/power/common.c | 283 +++++++++++++++
drivers/base/power/main.c | 677 +++++++++++++++++++++++++++++++++---
drivers/base/power/power.h | 42 ++-
drivers/base/power/runtime.c | 27 +-
drivers/base/power/sysfs.c | 47 +++
drivers/input/serio/serio.c | 1 +
drivers/pci/pci.c | 1 +
drivers/pci/pcie/portdrv_core.c | 1 +
include/linux/device.h | 11 +
include/linux/pm.h | 21 +-
include/linux/pm_link.h | 30 ++
include/linux/pm_runtime.h | 12 +
include/linux/resume-trace.h | 7 +
kernel/power/Kconfig | 14 +
kernel/power/Makefile | 2 +-
kernel/power/hibernate.c | 26 ++
kernel/power/main.c | 32 ++-
kernel/power/process.c | 14 +-
kernel/power/swap.c | 107 ++++++-
kernel/power/swsusp.c | 188 ----------
25 files changed, 1281 insertions(+), 284 deletions(-)
---------------
Alan Stern (2):
PM / Runtime: Export the PM runtime workqueue
PM / Runtime: Use deferred_resume flag in pm_request_resume
Jaswinder Singh Rajput (1):
PM: Fix kernel-doc notation
Jiri Slaby (1):
PM / Hibernate: Swap, use KERN_CONT
Nigel Cunningham (2):
PM / Hibernate: Move swap functions to kernel/power/swap.c.
PM / Hibernate: Shift remaining code from swsusp.c to hibernate.c
Rafael J. Wysocki (15):
PM: Introduce PM links framework
PM: Asynchronous resume of devices
PM: Asynchronous suspend of devices
PM: Allow PCI devices to suspend/resume asynchronously
PM: Allow ACPI devices to suspend/resume asynchronously
PM: Add a switch for disabling/enabling asynchronous suspend/resume
PM: Measure device suspend and resume times
PM: Add facility for advanced testing of async suspend/resume
PM: Measure suspend and resume times for individual devices
PM: Allow serio input devices to suspend/resume asynchronously
PM / Runtime: Fix lockdep warning in __pm_runtime_set_status()
PM / Runtime: Ensure timer_expires is nonzero in pm_schedule_suspend()
PM / Runtime: Make documentation of runtime_idle() agree with the code
PM / Runtime: Remove unnecessary braces in __pm_runtime_set_status()
PM: Add flag for devices capable of generating run-time wake-up events
Stephen Rothwell (1):
PM / Suspend: Using TASK_ macros requires sched.h
Tejun Heo (1):
PM / freezer: Don't get over-anxious while waiting
On Sat, 5 Dec 2009, Rafael J. Wysocki wrote:
>
> * Asynchronous suspend and resume infrastructure. For now, PCI, ACPI and
> serio devices are enabled to suspend and resume asynchronously.
I really think this is totally and utterly broken. Both from an
implementation standpoint _and_ from a pure conceptual one.
Why isn't the suspend/resume async stuff just done like the init async
stuff?
We don't need that crazy per-device flag for initialization, neither do we
need drivers "enabling" any async code at all. They just do some things
asynchronously, and then at the end of init time we wait for all those
async events.
So why does suspend/resume need to do crazy sh*t instead?
It all looks terminally broken: you force async suspend for all PCI
drivers, even when it makes no sense. Rather than let the drivers that
already know how to do things like disk spinup asynchronously just do it
that way.
The "timing" routines are also just crazy. What is the excuse for
dpm_show_time() taking both start and stop times, since there is never any
valid situation when it shouldn't have that do_gettimgofday(&stop) just
before it? IOW - the whole end-time thing should be _inside_
dpm_show_time, rather than being done by the caller. No?
In other words - I'm not pulling this crazy thing. You'd better explain
why it was done that way, when we already have done the same things better
before in different ways.
Linus
On Sat, 5 Dec 2009, Linus Torvalds wrote:
>
> In other words - I'm not pulling this crazy thing. You'd better explain
> why it was done that way, when we already have done the same things better
> before in different ways.
I get the feeling that all the crazy infrastructure was due to worrying
about the suspend/resume topology.
But the reason we don't worry about that during init is that it doesn't
really tend to matter. Most slow operations are the things that aren't
topology-aware, ie things like spinning up/down disks etc, that really
could be done as a separate phase instead.
For example, is there really any reason why resume doesn't look exactly
like the init sequence? Drivers that do slow things can start async work
to do them, and then at the end of the resume sequence we just do a "wait
for all the async work", exactly like we do for the current init
sequences.
And yes, for the suspend sequence we obviously need to do any async work
(and wait for it) before we actually shut down the controllers, but that
would be _way_ more natural to do by just introducing a "pre-suspend" hook
that walks the device tree and does any async stuff. And then just wait
for the async stuff to finish before doing the suspend, and perhaps again
before doing late_suspend (maybe somebody wants to do async stuff at the
second stage too).
Then, because we need a way to undo things if things go wrong in the
middle (and because it's also nice to be symmetric), we'd probably want to
introduce that kind of "post_resume()" callback that allows you have a
separate async wakeup thing for resume time too.
What are actually the expensive/slow things during suspend/resume? Am I
wrong when I say it's things like disk spinup/spindown (and USB discovery,
which needs USB-level support anyway, since it can involve devices that we
didn't even know about before discovery started).
Linus
On Saturday 05 December 2009, Linus Torvalds wrote:
>
> On Sat, 5 Dec 2009, Linus Torvalds wrote:
> >
> > In other words - I'm not pulling this crazy thing. You'd better explain
> > why it was done that way, when we already have done the same things better
> > before in different ways.
OK, I'll send another pull request without these patches if the rest of the
changes if fine with you (they are more important than the async stuff to me).
> I get the feeling that all the crazy infrastructure was due to worrying
> about the suspend/resume topology.
Yes, that's the main reason.
> But the reason we don't worry about that during init is that it doesn't
> really tend to matter. Most slow operations are the things that aren't
> topology-aware, ie things like spinning up/down disks etc, that really
> could be done as a separate phase instead.
It was based on the observation that in many cases the current drivers' suspend
and resume callbacks can be run in parallel with the other drivers' callbacks
without any changes to the drivers (and without introducing another phase of
suspend for that matter), because there are no dependencies between them.
The approach you're suggesting would require modifying individual drivers which
I just wanted to avoid. If you don't like that, we'll have to take the longer
route, although I'm afraid that will take lots of time and we won't be able to
exploit the entire possible parallelism this way.
> For example, is there really any reason why resume doesn't look exactly
> like the init sequence? Drivers that do slow things can start async work
> to do them, and then at the end of the resume sequence we just do a "wait
> for all the async work", exactly like we do for the current init
> sequences.
During suspend we actually know what the dependences between the devicces
are and we can use that information to do more things in parallel. For
instance, in the majority of cases (I'm yet to find a counter example), the
entire suspend callbacks of "leaf" PCI devices may be run in parallel with each
other.
So, the point is not to look for "async stuff" in a driver's suspend/resume
callbacks, but to execute the whole suspend/resume callbacks in parallel,
if possible.
> And yes, for the suspend sequence we obviously need to do any async work
> (and wait for it) before we actually shut down the controllers, but that
> would be _way_ more natural to do by just introducing a "pre-suspend" hook
> that walks the device tree and does any async stuff. And then just wait
> for the async stuff to finish before doing the suspend, and perhaps again
> before doing late_suspend (maybe somebody wants to do async stuff at the
> second stage too).
>
> Then, because we need a way to undo things if things go wrong in the
> middle (and because it's also nice to be symmetric), we'd probably want to
> introduce that kind of "post_resume()" callback that allows you have a
> separate async wakeup thing for resume time too.
Yes, we can do that, but I'm afraid that the majority of drivers won't use the
new hooks (people generally seem to be to reluctant to modify their
suspend/resume callbacks not to break things).
Also, for an individual driver it really is difficult to separate the "async
stuff" from the stuff which is not async, because everything that can be done
in parallel with the other drivers' suspend callbacks is potentially async, as
long as there are no dependences between the devices in question (like
parent-child dependences, or PCI-shadow ACPI dependences). And it's
generally worth doing that if a driver's suspend or resume callback calls
msleep() for whatever the reason.
> What are actually the expensive/slow things during suspend/resume? Am I
> wrong when I say it's things like disk spinup/spindown (and USB discovery,
> which needs USB-level support anyway, since it can involve devices that we
> didn't even know about before discovery started).
Disk spinup/spindown takes time, but also some ACPI devices resume slowly,
serio devices do that too and there are surprisingly many drivers that wait
(using msleep() during suspend and resume). Apart from this, every PCI device
going from D0 to D3 during suspend and from D3 to D0 during resume requires
us to sleep for 10 ms (the sleeping is done by the PCI core, so the drivers
don't even realize its there).
Thanks,
Rafael
On Saturday 05 December 2009, Linus Torvalds wrote:
>
> On Sat, 5 Dec 2009, Rafael J. Wysocki wrote:
> >
> > * Asynchronous suspend and resume infrastructure. For now, PCI, ACPI and
> > serio devices are enabled to suspend and resume asynchronously.
>
> I really think this is totally and utterly broken. Both from an
> implementation standpoint _and_ from a pure conceptual one.
>
> Why isn't the suspend/resume async stuff just done like the init async
> stuff?
>
> We don't need that crazy per-device flag for initialization, neither do we
> need drivers "enabling" any async code at all. They just do some things
> asynchronously, and then at the end of init time we wait for all those
> async events.
>
> So why does suspend/resume need to do crazy sh*t instead?
Because it can run entire suspend and resume callbacks in parallel and not
just some stuff inside of them. The flag is to tell it which callbacks not to
execute in parallel, but it essentially should not be necessary as soon as we
know all dependences between devices (ie. the ones that are not encoded in
the structure of the device tree).
The problem is there are dependences between devices we're not aware of,
which are not documented anywhere and not reflected by the device tree
structure and we need some time to figure them out.
> It all looks terminally broken: you force async suspend for all PCI
> drivers, even when it makes no sense.
I'm not exactly sure what you're referring to. The async suspend is not
forced, it just tells the PM core that it can execute PCI suspend/resume
callbacks in parallel as long as the devices in question don't depend on each
other.
> Rather than let the drivers that already know how to do things like disk
> spinup asynchronously just do it that way.
This isn't just about disk spin up and things like that. If we can run entire
suspend/resume callbacks in parallel, why not to do that?
> The "timing" routines are also just crazy. What is the excuse for
> dpm_show_time() taking both start and stop times,
This is a mistake, although really easily fixable in a followup patch.
> since there is never any valid situation when it shouldn't have that
> do_gettimgofday(&stop) just before it? IOW - the whole end-time thing should
> be _inside_ dpm_show_time, rather than being done by the caller. No?
Yes, you're right.
> In other words - I'm not pulling this crazy thing. You'd better explain
> why it was done that way, when we already have done the same things better
> before in different ways.
I'm not sure we have, but whatever.
As I said before, if the rest of the changes in my pull request are fine with
you, I'll just drop the async changes, although I'm not really convinced
they're so bad. They've been discussed a lot and they've been in linux-next
for a few months without any objection from anyone.
Thanks,
Rafael
On Sun, 6 Dec 2009 00:55:36 +0100
"Rafael J. Wysocki" <[email protected]> wrote:
>
> Disk spinup/spindown takes time, but also some ACPI devices resume
> slowly, serio devices do that too and there are surprisingly many
> drivers that wait (using msleep() during suspend and resume). Apart
> from this, every PCI device going from D0 to D3 during suspend and
> from D3 to D0 during resume requires us to sleep for 10 ms (the
> sleeping is done by the PCI core, so the drivers don't even realize
> its there).
maybe a good step is to make a scripts/bootgraph.pl equivalent for
suspend/resume (or make a debug mode that outputs in a compatible format
so that the script can be used as is.. I don't mind either way, and
consider this my offer to help with such a script as long as there's
sufficient logging in dmesg ;-)
that way we can SEE which ones are an issue.... and by how much.
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
>
> The approach you're suggesting would require modifying individual drivers which
> I just wanted to avoid.
In the init path, we had the reverse worry - not wanting to make
everything (where "everything" can be some subsystem like just the set of
PCI drivers, of course - not really "everything" in an absolute sense)
async, and then having to try to work out with the random driver that
couldn't handle it.
And there were _lots_ of drivers that couldn't handle it, because they
knew they got woken up serially. The ATA layer needed to know about
asynchronous things, because sometimes those independent devices aren't so
independent at all. Which is why I don't think your approach is safe.
Just to take an example of the whole "independent devices are not
necessarily independent" thing - things like multi-port PCMCIA controllers
generally show up as multiple PCI devices. But they are _not_ independent,
and they actually share some registers. Resuming them asynchronously might
well be ok, but maybe it's not. Who knows?
In contrast, a device driver can generally know that certain _parts_ of
the initialization is safe. As an example of that, I think the libata
layer does all the port enumeration synchronously, but then once the ports
have been identified, it does the rest async.
That's the kind of decision we can sanely make when we do the async part
as a "drivers may choose to do certain parts asynchronously". Doing it at
a higher level sounds like a problem to me.
> If you don't like that, we'll have to take the longer route, although
> I'm afraid that will take lots of time and we won't be able to exploit
> the entire possible parallelism this way.
Sure. But I'd rather do the safe thing. Especially since there are likely
just a few cases that really take a long time.
> During suspend we actually know what the dependences between the devicces
> are and we can use that information to do more things in parallel. For
> instance, in the majority of cases (I'm yet to find a counter example), the
> entire suspend callbacks of "leaf" PCI devices may be run in parallel with each
> other.
See above. That's simply not at all guaranteed to be true.
And when it isn't true (ie different PCI leaf devices end up having subtle
dependencies), now you need to start doing hacky things.
I'd much rather have the individual drivers say "I can do this part in
parallel", and not force it on them. Because it is definitely _not_
guaranteed that PCI devices can do parallel resume and suspend.
> Yes, we can do that, but I'm afraid that the majority of drivers won't use the
> new hooks (people generally seem to be to reluctant to modify their
> suspend/resume callbacks not to break things).
See above - I don't think this is a "majority" issue. I think it's a
"let's figure out the problem spots, and fix _those_". IOW, get 2% of the
coverage, and get 95% of the advantage.
> Disk spinup/spindown takes time, but also some ACPI devices resume slowly,
We actually saw that when we did async init. And it was horrible. There's
nothing that says that the ACPI stuff necessarily even _can_ run in
parallel.
I think we currently only do the ACPI battery ops asynchronously.
Linus
On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
>
> > It all looks terminally broken: you force async suspend for all PCI
> > drivers, even when it makes no sense.
>
> I'm not exactly sure what you're referring to. The async suspend is not
> forced, it just tells the PM core that it can execute PCI suspend/resume
> callbacks in parallel as long as the devices in question don't depend on each
> other.
That's exactly what I mean by forcing async suspend/resume.
You don't know the ordering rules for PCi devices. Multi-function PCI
devices commonly share registers - they're on the same chip, after all.
And even when the _hardware_ is totally independent, we often have
discovery rules and want to initialize in order because different drivers
will do things like unregister entirely on suspend, and then re-register
on resume.
Imagine the mess when two ethernet devices randomly end up coming up with
different names (eth0/eth1) depending on subtle timing issues.
THAT is why we do things in order. Asynchronous programming is _hard_.
Just deciding that "all PCI devices can always be resumed and suspended
asynchronously" is a much MUCH bigger decision than you seem to have
even realized.
Linus
On Sunday 06 December 2009, Linus Torvalds wrote:
>
> On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
> >
> > > It all looks terminally broken: you force async suspend for all PCI
> > > drivers, even when it makes no sense.
> >
> > I'm not exactly sure what you're referring to. The async suspend is not
> > forced, it just tells the PM core that it can execute PCI suspend/resume
> > callbacks in parallel as long as the devices in question don't depend on each
> > other.
>
> That's exactly what I mean by forcing async suspend/resume.
>
> You don't know the ordering rules for PCi devices.
That's true at the moment, but in principle we can abstract all dependences
between devices as PM links that will enforce specific suspend/resume ordering
between them.
> Multi-function PCI devices commonly share registers - they're on the same
> chip, after all. And even when the _hardware_ is totally independent, we
> often have discovery rules and want to initialize in order because different
> drivers will do things like unregister entirely on suspend, and then
> re-register on resume.
Do any of the PCI drivers do that?
> Imagine the mess when two ethernet devices randomly end up coming up with
> different names (eth0/eth1) depending on subtle timing issues.
>
> THAT is why we do things in order. Asynchronous programming is _hard_.
> Just deciding that "all PCI devices can always be resumed and suspended
> asynchronously" is a much MUCH bigger decision than you seem to have
> even realized.
I have considered that, but at the end of the day I haven't seen a single
problem with that showing up in testing during the last two or three months.
Given the time the patchset spent in linux-next I'd expect someone to report
a problem with it - if there's a problem. But no one has said a word, so I'm
not that worried, although I'm still a bit cautious.
That's why there is the switch for disabling the feature altogether. It is
enabled by default, which perhaps is not the right setting, but I don't really
see the reason why not to turn it on where it doesn't break things (like on
all of my test boxes at the moment).
Still, as I said before, the other changes in my pull request are more
important to me than the async patchset, so please let me know if they are fine
with you.
Thanks,
Rafael
On Sunday 06 December 2009, Arjan van de Ven wrote:
> On Sun, 6 Dec 2009 00:55:36 +0100
> "Rafael J. Wysocki" <[email protected]> wrote:
>
> >
> > Disk spinup/spindown takes time, but also some ACPI devices resume
> > slowly, serio devices do that too and there are surprisingly many
> > drivers that wait (using msleep() during suspend and resume). Apart
> > from this, every PCI device going from D0 to D3 during suspend and
> > from D3 to D0 during resume requires us to sleep for 10 ms (the
> > sleeping is done by the PCI core, so the drivers don't even realize
> > its there).
>
> maybe a good step is to make a scripts/bootgraph.pl equivalent for
> suspend/resume (or make a debug mode that outputs in a compatible format
> so that the script can be used as is.. I don't mind either way, and
> consider this my offer to help with such a script as long as there's
> sufficient logging in dmesg ;-)
OK, so what kind of logging is needed?
> that way we can SEE which ones are an issue.... and by how much.
Well, why not.
Thanks,
Rafael
On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
>
> > Multi-function PCI devices commonly share registers - they're on the same
> > chip, after all. And even when the _hardware_ is totally independent, we
> > often have discovery rules and want to initialize in order because different
> > drivers will do things like unregister entirely on suspend, and then
> > re-register on resume.
>
> Do any of the PCI drivers do that?
It used to be common at least for ethernet - there were a number of
drivers that essentially did the same thing on suspend/resume and on
module unload/reload.
The point is, I don't know. And neither do you. It's much safer to just do
drivers one by one, and not touch drivers that people don't test.
Linus
On Sunday 06 December 2009, Linus Torvalds wrote:
>
> On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
> >
> > The approach you're suggesting would require modifying individual drivers which
> > I just wanted to avoid.
>
> In the init path, we had the reverse worry - not wanting to make
> everything (where "everything" can be some subsystem like just the set of
> PCI drivers, of course - not really "everything" in an absolute sense)
> async, and then having to try to work out with the random driver that
> couldn't handle it.
>
> And there were _lots_ of drivers that couldn't handle it, because they
> knew they got woken up serially. The ATA layer needed to know about
> asynchronous things, because sometimes those independent devices aren't so
> independent at all. Which is why I don't think your approach is safe.
While the current settings are probably unsafe (like enabling PCI devices
to be suspended asynchronously by default if there are not any direct
dependences between them), there are provisions to make eveything safe, if
we have enough information (which also is needed to put the required logic into
the drivers). The device tree represents a good deal of the dependences
between devices and the other dependences may be represented as PM links
enforcing specific ordering of the PM callbacks.
> Just to take an example of the whole "independent devices are not
> necessarily independent" thing - things like multi-port PCMCIA controllers
> generally show up as multiple PCI devices. But they are _not_ independent,
> and they actually share some registers. Resuming them asynchronously might
> well be ok, but maybe it's not. Who knows?
I'd say if there's a worry that the same register may be accessed concurrently
from two different code paths, there should be some locking in place.
> In contrast, a device driver can generally know that certain _parts_ of
> the initialization is safe. As an example of that, I think the libata
> layer does all the port enumeration synchronously, but then once the ports
> have been identified, it does the rest async.
>
> That's the kind of decision we can sanely make when we do the async part
> as a "drivers may choose to do certain parts asynchronously". Doing it at
> a higher level sounds like a problem to me.
The difference between suspend and initialization is that during suspend we
have already enumerated all devices and we should know how they depend on
each other (and we really should know that if we are to actually understand how
things work), so we can represent that information somehow and use it to do
things at the higher level.
How to represent it is a different matter, but in principle it should be
possible.
> > If you don't like that, we'll have to take the longer route, although
> > I'm afraid that will take lots of time and we won't be able to exploit
> > the entire possible parallelism this way.
>
> Sure. But I'd rather do the safe thing. Especially since there are likely
> just a few cases that really take a long time.
And there are lots of small sleeps here and there that accumulate and are
entirely avoidable.
> > During suspend we actually know what the dependences between the devicces
> > are and we can use that information to do more things in parallel. For
> > instance, in the majority of cases (I'm yet to find a counter example), the
> > entire suspend callbacks of "leaf" PCI devices may be run in parallel with each
> > other.
>
> See above. That's simply not at all guaranteed to be true.
>
> And when it isn't true (ie different PCI leaf devices end up having subtle
> dependencies), now you need to start doing hacky things.
>
> I'd much rather have the individual drivers say "I can do this part in
> parallel", and not force it on them. Because it is definitely _not_
> guaranteed that PCI devices can do parallel resume and suspend.
OK, it's not guaranteed, but why not to do this on systems where it's known
to work?
> > Yes, we can do that, but I'm afraid that the majority of drivers won't use the
> > new hooks (people generally seem to be to reluctant to modify their
> > suspend/resume callbacks not to break things).
>
> See above - I don't think this is a "majority" issue. I think it's a
> "let's figure out the problem spots, and fix _those_". IOW, get 2% of the
> coverage, and get 95% of the advantage.
I wouldn't really like to add even more suspend/resume callbacks for this
purpose, because we already have so many of them. And even if we do that,
I don't really expect drivers to start using them any time soon.
> > Disk spinup/spindown takes time, but also some ACPI devices resume slowly,
>
> We actually saw that when we did async init. And it was horrible. There's
> nothing that says that the ACPI stuff necessarily even _can_ run in
> parallel.
>
> I think we currently only do the ACPI battery ops asynchronously.
There are only a few ACPI devices that have real suspend/resume callbacks
and I haven't see problems with these in practice.
Thanks,
Rafael
On Sunday 06 December 2009, Rafael J. Wysocki wrote:
> On Sunday 06 December 2009, Linus Torvalds wrote:
> >
> > On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
> > >
> > > The approach you're suggesting would require modifying individual drivers which
> > > I just wanted to avoid.
> >
> > In the init path, we had the reverse worry - not wanting to make
> > everything (where "everything" can be some subsystem like just the set of
> > PCI drivers, of course - not really "everything" in an absolute sense)
> > async, and then having to try to work out with the random driver that
> > couldn't handle it.
> >
> > And there were _lots_ of drivers that couldn't handle it, because they
> > knew they got woken up serially. The ATA layer needed to know about
> > asynchronous things, because sometimes those independent devices aren't so
> > independent at all. Which is why I don't think your approach is safe.
>
> While the current settings are probably unsafe (like enabling PCI devices
> to be suspended asynchronously by default if there are not any direct
> dependences between them), there are provisions to make eveything safe, if
> we have enough information (which also is needed to put the required logic into
> the drivers). The device tree represents a good deal of the dependences
> between devices and the other dependences may be represented as PM links
> enforcing specific ordering of the PM callbacks.
>
> > Just to take an example of the whole "independent devices are not
> > necessarily independent" thing - things like multi-port PCMCIA controllers
> > generally show up as multiple PCI devices. But they are _not_ independent,
> > and they actually share some registers. Resuming them asynchronously might
> > well be ok, but maybe it's not. Who knows?
>
> I'd say if there's a worry that the same register may be accessed concurrently
> from two different code paths, there should be some locking in place.
>
> > In contrast, a device driver can generally know that certain _parts_ of
> > the initialization is safe. As an example of that, I think the libata
> > layer does all the port enumeration synchronously, but then once the ports
> > have been identified, it does the rest async.
> >
> > That's the kind of decision we can sanely make when we do the async part
> > as a "drivers may choose to do certain parts asynchronously". Doing it at
> > a higher level sounds like a problem to me.
>
> The difference between suspend and initialization is that during suspend we
> have already enumerated all devices and we should know how they depend on
> each other (and we really should know that if we are to actually understand how
> things work), so we can represent that information somehow and use it to do
> things at the higher level.
>
> How to represent it is a different matter, but in principle it should be
> possible.
>
> > > If you don't like that, we'll have to take the longer route, although
> > > I'm afraid that will take lots of time and we won't be able to exploit
> > > the entire possible parallelism this way.
> >
> > Sure. But I'd rather do the safe thing. Especially since there are likely
> > just a few cases that really take a long time.
>
> And there are lots of small sleeps here and there that accumulate and are
> entirely avoidable.
I mean, it is avoidable to do all these sleeps sequentially.
Thanks,
Rafael
On Sun, 6 Dec 2009 02:26:06 +0100
"Rafael J. Wysocki" <[email protected]> wrote:
> On Sunday 06 December 2009, Arjan van de Ven wrote:
> > On Sun, 6 Dec 2009 00:55:36 +0100
> > "Rafael J. Wysocki" <[email protected]> wrote:
> >
> > >
> > > Disk spinup/spindown takes time, but also some ACPI devices resume
> > > slowly, serio devices do that too and there are surprisingly many
> > > drivers that wait (using msleep() during suspend and resume).
> > > Apart from this, every PCI device going from D0 to D3 during
> > > suspend and from D3 to D0 during resume requires us to sleep for
> > > 10 ms (the sleeping is done by the PCI core, so the drivers don't
> > > even realize its there).
> >
> > maybe a good step is to make a scripts/bootgraph.pl equivalent for
> > suspend/resume (or make a debug mode that outputs in a compatible
> > format so that the script can be used as is.. I don't mind either
> > way, and consider this my offer to help with such a script as long
> > as there's sufficient logging in dmesg ;-)
>
> OK, so what kind of logging is needed?
basically the equivalent of the two initcall_debug paths in
init/main.c:do_one_initcall()
which prints a start time and an end time (and a pid) for each init
function; if we have the same for suspend calls (and resume)... we can
make the tool graph it.
Would be nice to get markers for start and end of the whole suspend
sequence as well as for the resume sequence; those make it easier to
know when the end is (so that the axis can be drawn etc)
shouldn't be too hard to implement...
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
>
> While the current settings are probably unsafe (like enabling PCI devices
> to be suspended asynchronously by default if there are not any direct
> dependences between them), there are provisions to make eveything safe, if
> we have enough information (which also is needed to put the required logic into
> the drivers).
I disagree.
Think of a situation that we already handle pretty poorly: USB mass
storage devices over a suspend/resume.
> The device tree represents a good deal of the dependences
> between devices and the other dependences may be represented as PM links
> enforcing specific ordering of the PM callbacks.
The device tree means nothing at all, because it may need to be entirely
rebuilt at resume time.
Optimally, what we _should_ be doing (and aren't) for suspend/resume of
USB is to just tear down the whole topology and rebuild it and re-connect
the things like mass storage devices. IOW, there would be no device tree
to describe the topology, because we're finding it anew. And it's one of
the things we _would_ want to do asynchronously with other things.
We don't want to build up some irrelevant PM links and callbacks. We don't
want to have some completely made-up new infrastructure for something that
we _already_ want to handle totally differently for init time.
IOW, I argue very strongly against making up something PM-specific, when
there really doesn't seem to be much of an advantage. We're much better
off trying to share the init code than making up something new.
> I'd say if there's a worry that the same register may be accessed concurrently
> from two different code paths, there should be some locking in place.
Yeah. And I wish ACPI didn't exist at all. We don't know.
And we want to _limit_ our exposure to these things.
Linus
On Sunday 06 December 2009, Linus Torvalds wrote:
>
> On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
> >
> > While the current settings are probably unsafe (like enabling PCI devices
> > to be suspended asynchronously by default if there are not any direct
> > dependences between them), there are provisions to make eveything safe, if
> > we have enough information (which also is needed to put the required logic into
> > the drivers).
>
> I disagree.
>
> Think of a situation that we already handle pretty poorly: USB mass
> storage devices over a suspend/resume.
>
> > The device tree represents a good deal of the dependences
> > between devices and the other dependences may be represented as PM links
> > enforcing specific ordering of the PM callbacks.
>
> The device tree means nothing at all, because it may need to be entirely
> rebuilt at resume time.
With that assumption we have no choice but to leave the async stuff to the
drivers, which generally I'm fine with, although I really don't expect to see
it done.
> Optimally, what we _should_ be doing (and aren't) for suspend/resume of
> USB is to just tear down the whole topology and rebuild it and re-connect
> the things like mass storage devices. IOW, there would be no device tree
> to describe the topology, because we're finding it anew. And it's one of
> the things we _would_ want to do asynchronously with other things.
I think you should tell that to the USB people, because they don't seem to
think this way.
[Side note, I do think that at least some information in the device tree will
remain valid over suspend/resume, but this is a different matter.]
> We don't want to build up some irrelevant PM links and callbacks. We don't
> want to have some completely made-up new infrastructure for something that
> we _already_ want to handle totally differently for init time.
>
> IOW, I argue very strongly against making up something PM-specific, when
> there really doesn't seem to be much of an advantage. We're much better
> off trying to share the init code than making up something new.
>
> > I'd say if there's a worry that the same register may be accessed concurrently
> > from two different code paths, there should be some locking in place.
>
> Yeah. And I wish ACPI didn't exist at all. We don't know.
>
> And we want to _limit_ our exposure to these things.
Don't worry, I'm not going to touch async suspend/resume again, unless
somebody makes me do it.
BTW, you seem to have some quite strong opinions about power management that
you only share with people when somebody sends you patches you don't like. I
guess it will be much more productive if we know your thoughts about it in
advance, so I hope you won't mind being sent CCs of core PM patches posted to
linux-pm for discussions.
Thanks,
Rafael
* Arjan van de Ven <[email protected]> wrote:
> On Sun, 6 Dec 2009 02:26:06 +0100
> "Rafael J. Wysocki" <[email protected]> wrote:
>
> > On Sunday 06 December 2009, Arjan van de Ven wrote:
> > > On Sun, 6 Dec 2009 00:55:36 +0100
> > > "Rafael J. Wysocki" <[email protected]> wrote:
> > >
> > > >
> > > > Disk spinup/spindown takes time, but also some ACPI devices resume
> > > > slowly, serio devices do that too and there are surprisingly many
> > > > drivers that wait (using msleep() during suspend and resume).
> > > > Apart from this, every PCI device going from D0 to D3 during
> > > > suspend and from D3 to D0 during resume requires us to sleep for
> > > > 10 ms (the sleeping is done by the PCI core, so the drivers don't
> > > > even realize its there).
> > >
> > > maybe a good step is to make a scripts/bootgraph.pl equivalent for
> > > suspend/resume (or make a debug mode that outputs in a compatible
> > > format so that the script can be used as is.. I don't mind either
> > > way, and consider this my offer to help with such a script as long
> > > as there's sufficient logging in dmesg ;-)
> >
> > OK, so what kind of logging is needed?
>
> basically the equivalent of the two initcall_debug paths in
>
> init/main.c:do_one_initcall()
>
> which prints a start time and an end time (and a pid) for each init
> function; if we have the same for suspend calls (and resume)... we can
> make the tool graph it.
>
> Would be nice to get markers for start and end of the whole suspend
> sequence as well as for the resume sequence; those make it easier to
> know when the end is (so that the axis can be drawn etc)
>
> shouldn't be too hard to implement...
I think an even better option would be to extend 'perf timechart' to be
suspend/resume aware: add a few tracepoint events and teach 'perf
timechart' to draw them. (We should be able to do perf timechart record
across suspend/resume cycles just fine.)
( Doing that would also improve the tracing facilities within
suspend/resume quite significantly. It wouldnt just be a
single-purpose thing for graphing, but perf trace and perf stat would
work equally well. )
Thanks,
Ingo
On Sat, 5 Dec 2009, Linus Torvalds wrote:
> Think of a situation that we already handle pretty poorly: USB mass
> storage devices over a suspend/resume.
>
> > The device tree represents a good deal of the dependences
> > between devices and the other dependences may be represented as PM links
> > enforcing specific ordering of the PM callbacks.
>
> The device tree means nothing at all, because it may need to be entirely
> rebuilt at resume time.
Nonsense.
> Optimally, what we _should_ be doing (and aren't) for suspend/resume of
> USB is to just tear down the whole topology and rebuild it and re-connect
> the things like mass storage devices. IOW, there would be no device tree
> to describe the topology, because we're finding it anew. And it's one of
> the things we _would_ want to do asynchronously with other things.
That's ridiculous. Having gone to all the trouble of building a device
tree, one which is presumably still almost entirely correct, why go to
all the trouble of tearing it down only to rebuild it again? (Note:
I'm talking about resume-from-RAM here, not resume-from-hibernation.)
Instead what we do is verify that the devices we remember from before
the suspend are still there, and then asynchronously handle new devices
which have been plugged in during the meantime. Doing this involves
relatively little extra or new code; most of the routines are shared
with the runtime PM and device reset paths.
As for asynchronicity... At init time, USB device discovery truly is
asynchronous. It can happen long after you log in (especially if you
don't plug in the device until after you log in!). But at resume time
we are more highly constrained. User processes cannot be unfrozen
until all the devices have been resumed; otherwise they would encounter
errors when trying to do I/O to a suspended device. (With the runtime
PM framework this is much less of a problem, but plenty of drivers
don't support runtime PM yet.)
> We don't want to build up some irrelevant PM links and callbacks. We don't
> want to have some completely made-up new infrastructure for something that
> we _already_ want to handle totally differently for init time.
>
> IOW, I argue very strongly against making up something PM-specific, when
> there really doesn't seem to be much of an advantage. We're much better
> off trying to share the init code than making up something new.
If I understand correctly, what you're suggesting is impractical. You
would have each driver responsible for resuming the devices it
registers. If it registered some children synchronously (during the
parent's probe) then it would resume them synchronously (during the
parent's resume); if it registered them asynchronously then it would
resume them asynchronously. In essence, every single device_add() or
device_register() call would have to be paired with a resume call.
To make such significant changes in every driver would be prohibitively
difficult. What we need is a compromise which gives drivers control
over the resume process without making them responsible for actually
carrying it out.
So consider this suggestion: Let's define PM groups. Each device
belongs to a group, and each group (except group 0, the initial group)
has an owner device. By default a device is added to its parent's
group during registration, but the driver can request that it be
assigned to a different group, which must be owned by that parent.
During resume, each PM group would correspond to an async task. The
devices in each group would be resumed sequentially, in order of
registration, but asynchronously with respect to other groups. The
async thread to resume a group would be launched after the group's
owner device was resumed.
So for example, the sibling functions on a PCI card could all be
assigned to the same group, but different cards could belong to
different groups. Likewise for ATA and PCMCIA controllers. Extra
cross-group constraints could be added if needed, but there should be
relatively few of them.
This way drivers can decide which of their devices will be resumed in
sequence or concurrently, but they won't have to do any of the
necessary work.
Alan Stern
On Sun, 2009-12-06 at 10:23 -0500, Alan Stern wrote:
> On Sat, 5 Dec 2009, Linus Torvalds wrote:
> That's ridiculous. Having gone to all the trouble of building a device
> tree, one which is presumably still almost entirely correct, why go to
> all the trouble of tearing it down only to rebuild it again? (Note:
> I'm talking about resume-from-RAM here, not resume-from-hibernation.)
There should be nothing special or privileged at all about the device
tree that gets built at boot time.
Consider the scenario of the laptop user with a docking station.
Adding, removing, and rewriting vast swaths of the device tree across
suspend/resume and hibernate/thaw is very easy to do when you are
plugging a laptop into one or more docking stations.
> Instead what we do is verify that the devices we remember from before
> the suspend are still there, and then asynchronously handle new devices
> which have been plugged in during the meantime. Doing this involves
> relatively little extra or new code; most of the routines are shared
> with the runtime PM and device reset paths.
Devices can vanish across suspend to RAM just as easily as they can be
added.
>
> _______________________________________________
> linux-pm mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm
On Sat, 5 Dec 2009 18:05:14 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
>
>
> On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
> >
> > While the current settings are probably unsafe (like enabling PCI
> > devices to be suspended asynchronously by default if there are not
> > any direct dependences between them), there are provisions to make
> > eveything safe, if we have enough information (which also is needed
> > to put the required logic into the drivers).
>
> I disagree.
>
> Think of a situation that we already handle pretty poorly: USB mass
> storage devices over a suspend/resume.
>
> > The device tree represents a good deal of the dependences
> > between devices and the other dependences may be represented as PM
> > links enforcing specific ordering of the PM callbacks.
>
> The device tree means nothing at all, because it may need to be
> entirely rebuilt at resume time.
btw I instrumented both the suspend and resume, and made graphs out of
it for my laptop (modern laptop with Intel cpu/wifi/graphics of course).
http://www.fenrus.org/graphs/suspend.svg
http://www.fenrus.org/graphs/resume.svg
(also attached for convenience)
the resume clearly shows that all this talking about PCI stuff is
completely without practical merit.. it's the USB stuff where the time
is spent.
in suspend, there's a PCI device (:1b) that does take some time, which
is the audio controller. The bulk of the time is in the serio driver
though..
As an "interested bystander" to this thread.... sounds like Linus'
arguments have merit, and that solving the USB resume to go async in
some form will fix pretty much all we want solving...
[and that at least we need to do this stuff data/measurement driven,
and not just based on how we THINK things work]
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, 6 Dec 2009, Arjan van de Ven wrote:
>
> in suspend, there's a PCI device (:1b) that does take some time, which
> is the audio controller. The bulk of the time is in the serio driver
> though..
That serio thing is disgusting. We had serious problems with the serial
driver timeouts for boot-time optimizations too, didn't we?
I assume that you don't even _use_ that serial port, do you? Or is it open
for serial console logging or something? If it isn't even open, we
shouldn't waste any time on the hardware.
Your graph seems to say that serio1 shutdown is roughly from 29.40 to
29.85, ie almost half a second. That's just bogus.
I don't see where it comes from, though. It looks like we have
- pciserial_suspend_ports/serial_pnp_suspend ->
serial8250_suspend_port ->
uart_suspend_port ->
(wait for tx_empty, but only for ASYNC_INITIALIZED, which
shouldn't be true if it's closed, and should be limited to 30ms)
uart_change_pm ->
serial8250_pm
and none of them look like they should take anywhere close to half a
second. So I'm obviously missing something, and your chart didn't include
the sleep/wakeup pairs.
Linus
On Sun, 6 Dec 2009 11:58:44 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
>
>
> On Sun, 6 Dec 2009, Arjan van de Ven wrote:
> >
> > in suspend, there's a PCI device (:1b) that does take some time,
> > which is the audio controller. The bulk of the time is in the serio
> > driver though..
>
> That serio thing is disgusting. We had serious problems with the
> serial driver timeouts for boot-time optimizations too, didn't we?
isn't serio the PS/2 stuff?
(serio was an issue during boot as well due to some interesting rcu
delays btw)
> and none of them look like they should take anywhere close to half a
> second. So I'm obviously missing something, and your chart didn't
> include the sleep/wakeup pairs.
what do you mean by this? what would you like to see ?
(I have a separate graph for resume.. but the graphing program does not
show those things that take so short to resume that the font to print
the name would be less than a pixel; can fix that)
[fwiw I care more about resume speed than suspend speed, but obviously
am happy if both get fixed... just resume tends to be much more user
interesting, just like getting out of the idle loop matters more than
getting into it]
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, 6 Dec 2009, Arjan van de Ven wrote:
> btw I instrumented both the suspend and resume, and made graphs out of
> it for my laptop (modern laptop with Intel cpu/wifi/graphics of course).
>
> http://www.fenrus.org/graphs/suspend.svg
> http://www.fenrus.org/graphs/resume.svg
>
> (also attached for convenience)
>
> the resume clearly shows that all this talking about PCI stuff is
> completely without practical merit.. it's the USB stuff where the time
> is spent.
Arjan, can you try testing the USB timings again with the patch below
(for vanilla 2.6.32)?
Fair warning: I just composed this and haven't tried it out myself.
Thanks,
Alan Stern
Index: 2.6.32/drivers/usb/core/driver.c
===================================================================
--- 2.6.32.orig/drivers/usb/core/driver.c
+++ 2.6.32/drivers/usb/core/driver.c
@@ -1313,8 +1313,9 @@ static int usb_resume_both(struct usb_de
* then we're stuck. */
status = usb_resume_device(udev, msg);
}
- } else if (udev->reset_resume)
+ } else {
status = usb_resume_device(udev, msg);
+ }
if (status == 0 && udev->actconfig) {
for (i = 0; i < udev->actconfig->desc.bNumInterfaces; i++) {
Index: 2.6.32/drivers/usb/core/hub.c
===================================================================
--- 2.6.32.orig/drivers/usb/core/hub.c
+++ 2.6.32/drivers/usb/core/hub.c
@@ -1674,7 +1674,7 @@ static int usb_configure_device_otg(stru
* (Includes HNP test device.)
*/
if (udev->bus->b_hnp_enable || udev->bus->is_b_host) {
- err = usb_port_suspend(udev, PMSG_SUSPEND);
+ err = usb_port_suspend(udev, PMSG_AUTO_SUSPEND);
if (err < 0)
dev_dbg(&udev->dev, "HNP fail, %d\n", err);
}
@@ -2060,6 +2060,7 @@ static int check_port_resume_type(struct
/*
* usb_port_suspend - suspend a usb device's upstream port
* @udev: device that's no longer in active use, not a root hub
+ * @msg: Power Management message describing this state transition
* Context: must be able to sleep; device not locked; pm locks held
*
* Suspends a USB device that isn't in active use, conserving power.
@@ -2107,7 +2108,7 @@ int usb_port_suspend(struct usb_device *
{
struct usb_hub *hub = hdev_to_hub(udev->parent);
int port1 = udev->portnum;
- int status;
+ int status = 0;
// dev_dbg(hub->intfdev, "suspend port %d\n", port1);
@@ -2128,6 +2129,13 @@ int usb_port_suspend(struct usb_device *
status);
}
+ /* For system sleep transitions we don't actually need to suspend
+ * the port. The device will suspend itself when the entire bus
+ * is suspended.
+ */
+ if (!(msg.event & (PM_EVENT_USER | PM_EVENT_REMOTE | PM_EVENT_AUTO)))
+ return status;
+
/* see 7.1.7.6 */
status = set_port_feature(hub->hdev, port1, USB_PORT_FEAT_SUSPEND);
if (status) {
@@ -2231,6 +2239,7 @@ static int finish_port_resume(struct usb
/*
* usb_port_resume - re-activate a suspended usb device's upstream port
* @udev: device to re-activate, not a root hub
+ * @msg: Power Management message describing this state transition
* Context: must be able to sleep; device not locked; pm locks held
*
* This will re-activate the suspended device, increasing power usage
On Sun, 6 Dec 2009, Arjan van de Ven wrote:
> > That serio thing is disgusting. We had serious problems with the
> > serial driver timeouts for boot-time optimizations too, didn't we?
>
> isn't serio the PS/2 stuff?
Oh, you're right, I just assumed it was regular serial. So it's the
keyboard and mouse.
Linus
On Sun, 6 Dec 2009 15:36:40 -0500 (EST)
Alan Stern <[email protected]> wrote:
> On Sun, 6 Dec 2009, Arjan van de Ven wrote:
>
> > btw I instrumented both the suspend and resume, and made graphs out
> > of it for my laptop (modern laptop with Intel cpu/wifi/graphics of
> > course).
> >
> > http://www.fenrus.org/graphs/suspend.svg
> > http://www.fenrus.org/graphs/resume.svg
> >
> > (also attached for convenience)
> >
> > the resume clearly shows that all this talking about PCI stuff is
> > completely without practical merit.. it's the USB stuff where the
> > time is spent.
>
> Arjan, can you try testing the USB timings again with the patch below
> (for vanilla 2.6.32)?
>
> Fair warning: I just composed this and haven't tried it out myself.
unfortunately it does not make a difference that I can notice in the
graphs.
http://www.fenrus.org/graphs/resume2.svg
the resume problem seems to be that we resume all the hubs sequentially,
much like we used to discover them sequentially during boot....
I do not know how much I'm asking for, but would it be sensible to do a
similar thing for hub resume as we did for boot? eg start resuming them
all at the same time, so that the mandatory delays of these hubs will
overlap ?
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, 6 Dec 2009, Arjan van de Ven wrote:
> > Arjan, can you try testing the USB timings again with the patch below
> > (for vanilla 2.6.32)?
> >
> > Fair warning: I just composed this and haven't tried it out myself.
>
> unfortunately it does not make a difference that I can notice in the
> graphs.
>
> http://www.fenrus.org/graphs/resume2.svg
Disappointing...
> the resume problem seems to be that we resume all the hubs sequentially,
> much like we used to discover them sequentially during boot....
But the patch should have reduced the time required to resume each
non-root hub. So the fact that they go sequentially shouldn't matter
as much.
For root hubs the patch won't help. Their delays can't be reduced.
> I do not know how much I'm asking for, but would it be sensible to do a
> similar thing for hub resume as we did for boot? eg start resuming them
> all at the same time, so that the mandatory delays of these hubs will
> overlap ?
For one thing, there shouldn't be any mandatory delays for non-root
hubs during resume-from-RAM (although this depends to some extent on
your system firmware -- and it probably helps to have USB-2.0 hubs
rather than USB-1.1).
More importantly, what you're asking is impossible given the way the PM
core is structured. The hub-resume routine can't return early because
then it wouldn't be possible to resume devices plugged into that hub.
(Ironically, your request is essentially what Rafael was trying to
accomplish in the patches that provoked this email conversation.)
Guess I'll just have to try out your timing log addition for myself and
see what's going on...
Alan Stern
On Sun, 6 Dec 2009 16:46:16 -0500 (EST)
Alan Stern <[email protected]> wrote:
h won't help. Their delays can't be reduced.
>
> > I do not know how much I'm asking for, but would it be sensible to
> > do a similar thing for hub resume as we did for boot? eg start
> > resuming them all at the same time, so that the mandatory delays of
> > these hubs will overlap ?
>
> For one thing, there shouldn't be any mandatory delays for non-root
> hubs during resume-from-RAM (although this depends to some extent on
> your system firmware -- and it probably helps to have USB-2.0 hubs
> rather than USB-1.1).
>
> More importantly, what you're asking is impossible given the way the
> PM core is structured. The hub-resume routine can't return early
> because then it wouldn't be possible to resume devices plugged into
> that hub.
having spent 30 minutes trying to grok this code, I think there may be
a trick in using the async function call infrastructure.
if each USB hub's resume (hub_resume()) would be done as an async
function call, that would start allowing the hub resumes to go async,
but this is not enough.
usb_resume_both() would also then need to be an async call itself, and
do its "resume the parent" recursion as a async function call, and then
it needs to do a synchronization before actually resuming the device
itself (provided it is not a hub or hub like device I suppose).
the later synchronization guarantees that no device will be resumed
before it's parent tree structure is resumed, while allowing parallel
parts of the tree to be resumed in parallel.
The hard part in this is the locking.... that is getting non-trivial
once you have multiple asynchronous functions executing.
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, 6 Dec 2009, Arjan van de Ven wrote:
> having spent 30 minutes trying to grok this code, I think there may be
> a trick in using the async function call infrastructure.
>
> if each USB hub's resume (hub_resume()) would be done as an async
> function call, that would start allowing the hub resumes to go async,
> but this is not enough.
>
> usb_resume_both() would also then need to be an async call itself, and
> do its "resume the parent" recursion as a async function call, and then
> it needs to do a synchronization before actually resuming the device
> itself (provided it is not a hub or hub like device I suppose).
>
> the later synchronization guarantees that no device will be resumed
> before it's parent tree structure is resumed, while allowing parallel
> parts of the tree to be resumed in parallel.
>
> The hard part in this is the locking.... that is getting non-trivial
> once you have multiple asynchronous functions executing.
That's the whole point of Rafael's async suspend/resume framework. He
has done the hard work already.
Alan Stern
On Dec 6, 2009, at 12:18 PM, Arjan van de Ven <[email protected]>
wrote:
> On Sun, 6 Dec 2009 11:58:44 -0800 (PST)
> Linus Torvalds <[email protected]> wrote:
>
>>
>>
>> On Sun, 6 Dec 2009, Arjan van de Ven wrote:
>>>
>>> in suspend, there's a PCI device (:1b) that does take some time,
>>> which is the audio controller. The bulk of the time is in the serio
>>> driver though..
>>
>> That serio thing is disgusting. We had serious problems with the
>> serial driver timeouts for boot-time optimizations too, didn't we?
>
> isn't serio the PS/2 stuff?
Yes, that's your PS/2 mouse (rather touchpad) and the delay comes from
device reset (needed by some keyboard controllers - I remember HP -or
it and keyboard will be dead at resume).
--
Dmitry
On Sun, 6 Dec 2009 14:54:48 -0800
Dmitry Torokhov <[email protected]> wrote:
> > isn't serio the PS/2 stuff?
>
> Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> from device reset (needed by some keyboard controllers - I remember
> HP -or it and keyboard will be dead at resume).
and I have a HP laptop... so this makes perfect sense.
Thanks for the explenation!
Now, the good news is that serio is near invisible on resume...
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, 6 Dec 2009 14:54:48 -0800
Dmitry Torokhov <[email protected]> wrote:
> Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> from device reset (needed by some keyboard controllers - I remember
> HP -or it and keyboard will be dead at resume).
>
btw could we do this reset in an async function call (as long as we
wait for it to complete before we pull the plug finally) ?
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, Dec 06, 2009 at 05:18:56PM -0800, Arjan van de Ven wrote:
> On Sun, 6 Dec 2009 14:54:48 -0800
> Dmitry Torokhov <[email protected]> wrote:
>
> > Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> > from device reset (needed by some keyboard controllers - I remember
> > HP -or it and keyboard will be dead at resume).
> >
>
> btw could we do this reset in an async function call (as long as we
> wait for it to complete before we pull the plug finally) ?
It has to complete before we start shutting down i8042, so there are
dependencies involved...
--
Dmitry
On Sun, Dec 06, 2009 at 04:55:51PM -0800, Arjan van de Ven wrote:
> On Sun, 6 Dec 2009 14:54:48 -0800
> Dmitry Torokhov <[email protected]> wrote:
>
> > > isn't serio the PS/2 stuff?
> >
> > Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> > from device reset (needed by some keyboard controllers - I remember
> > HP -or it and keyboard will be dead at resume).
>
> and I have a HP laptop... so this makes perfect sense.
> Thanks for the explenation!
>
Well, we do it for everyone, it's just a particular series of HPs forced
us to add it.
> Now, the good news is that serio is near invisible on resume...
>
Resume is fully offloaded to kseriod.
--
Dmitry
On Sun, 2009-12-06 at 23:23 +0800, Alan Stern wrote:
> On Sat, 5 Dec 2009, Linus Torvalds wrote:
>
> > Think of a situation that we already handle pretty poorly: USB mass
> > storage devices over a suspend/resume.
> >
> > > The device tree represents a good deal of the dependences
> > > between devices and the other dependences may be represented as PM links
> > > enforcing specific ordering of the PM callbacks.
> >
> > The device tree means nothing at all, because it may need to be entirely
> > rebuilt at resume time.
>
> Nonsense.
>
> > Optimally, what we _should_ be doing (and aren't) for suspend/resume of
> > USB is to just tear down the whole topology and rebuild it and re-connect
> > the things like mass storage devices. IOW, there would be no device tree
> > to describe the topology, because we're finding it anew. And it's one of
> > the things we _would_ want to do asynchronously with other things.
>
> That's ridiculous. Having gone to all the trouble of building a device
> tree, one which is presumably still almost entirely correct, why go to
> all the trouble of tearing it down only to rebuild it again? (Note:
> I'm talking about resume-from-RAM here, not resume-from-hibernation.)
>
> Instead what we do is verify that the devices we remember from before
> the suspend are still there, and then asynchronously handle new devices
> which have been plugged in during the meantime. Doing this involves
> relatively little extra or new code; most of the routines are shared
> with the runtime PM and device reset paths.
>
> As for asynchronicity... At init time, USB device discovery truly is
> asynchronous. It can happen long after you log in (especially if you
> don't plug in the device until after you log in!). But at resume time
> we are more highly constrained. User processes cannot be unfrozen
> until all the devices have been resumed; otherwise they would encounter
> errors when trying to do I/O to a suspended device. (With the runtime
> PM framework this is much less of a problem, but plenty of drivers
> don't support runtime PM yet.)
>
>
> > We don't want to build up some irrelevant PM links and callbacks. We don't
> > want to have some completely made-up new infrastructure for something that
> > we _already_ want to handle totally differently for init time.
> >
> > IOW, I argue very strongly against making up something PM-specific, when
> > there really doesn't seem to be much of an advantage. We're much better
> > off trying to share the init code than making up something new.
>
> If I understand correctly, what you're suggesting is impractical. You
> would have each driver responsible for resuming the devices it
> registers. If it registered some children synchronously (during the
> parent's probe) then it would resume them synchronously (during the
> parent's resume); if it registered them asynchronously then it would
> resume them asynchronously. In essence, every single device_add() or
> device_register() call would have to be paired with a resume call.
>
> To make such significant changes in every driver would be prohibitively
> difficult. What we need is a compromise which gives drivers control
> over the resume process without making them responsible for actually
> carrying it out.
>
> So consider this suggestion: Let's define PM groups. Each device
> belongs to a group, and each group (except group 0, the initial group)
> has an owner device. By default a device is added to its parent's
> group during registration, but the driver can request that it be
> assigned to a different group, which must be owned by that parent.
>
> During resume, each PM group would correspond to an async task. The
> devices in each group would be resumed sequentially, in order of
> registration, but asynchronously with respect to other groups. The
> async thread to resume a group would be launched after the group's
> owner device was resumed.
>
yes, we've talked about something similar to this before. :)
Hi, Linus,
can you please look at this patch set and see if the idea is right?
http://marc.info/?l=linux-kernel&m=124840449826386&w=2
http://marc.info/?l=linux-acpi&m=124840456826456&w=2
http://marc.info/?l=linux-acpi&m=124840456926459&w=2
http://marc.info/?l=linux-acpi&m=124840457026468&w=2
http://marc.info/?l=linux-acpi&m=124840457126471&w=2
If yes, I'll pick them up again and rework a patch set, including some
good thoughts from Rafael.
thanks,
rui
> So for example, the sibling functions on a PCI card could all be
> assigned to the same group, but different cards could belong to
> different groups. Likewise for ATA and PCMCIA controllers. Extra
> cross-group constraints could be added if needed, but there should be
> relatively few of them.
>
> This way drivers can decide which of their devices will be resumed in
> sequence or concurrently, but they won't have to do any of the
> necessary work.
>
> Alan Stern
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 6 Dec 2009, Alan Stern wrote:
>
> That's ridiculous. Having gone to all the trouble of building a device
> tree, one which is presumably still almost entirely correct, why go to
> all the trouble of tearing it down only to rebuild it again? (Note:
> I'm talking about resume-from-RAM here, not resume-from-hibernation.)
Hey, I can believe that it's worth keeping the USB device tree, and just
validating it instead. However:
> If I understand correctly, what you're suggesting is impractical. You
> would have each driver responsible for resuming the devices it
> registers.
The thing is, for 99% of all devices, we really _really_ don't care.
Especially PCI devices.
Your average laptop will have something like ten PCI devices on it, and
99% of those have no delays at all outside of the millisecond-level ones
that it takes for power management register writes etc to take place.
So what I'm suggesting is to NOT DO ANY ASYNC RESUME AT ALL by default.
Because async device management is _hard_, and results in various nasty
ordering problems that are timing-dependent etc. And it's totally
pointless for almost all cases.
This is why I think it is so crazy to try to create those idiotic "this
device depends on that other" lists etc - it's adding serious conceptual
complexity for something that nobody cares about, and that just allows for
non-deterministic behavior that we don't even want.
> So consider this suggestion: Let's define PM groups.
Let's not.
I can imagine that doing USB resume specially is worth it, since USB is
fundamentally a pretty slow bus. But USB is also a fairly clear hierarchy,
so there is no point in PM groups or any other information outside of the
pure topology.
But there is absolutely zero point in doing that for devices in general.
PCI drivers simply do not want concurrent initialization. The upsides are
basically zero (win a few msecs?) and the downsides are the pointless
complexity. We don't do PCI discovery asyncronously either - for all the
same reasons.
Now, a PCI driver may then implement a bus that is slow (ie SCSI, ATA,
USB), and that bus may itself then want to do something else. If it really
is a good idea to add the whole hierarchical model to USB suspend/resume,
I can live with that, but that is absolutely no excuse for then doing it
for cases where the hierarchy is (a) known to be broken (ie the whole PCI
multifunction thing, but also things like motherboard power management
devices) and (b) don't have the same kind of slow bus issues.
Linus
On Sun, 6 Dec 2009 18:27:56 -0800
Dmitry Torokhov <[email protected]> wrote:
> On Sun, Dec 06, 2009 at 04:55:51PM -0800, Arjan van de Ven wrote:
> > On Sun, 6 Dec 2009 14:54:48 -0800
> > Dmitry Torokhov <[email protected]> wrote:
> >
> > > > isn't serio the PS/2 stuff?
> > >
> > > Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> > > from device reset (needed by some keyboard controllers - I
> > > remember HP -or it and keyboard will be dead at resume).
> >
> > and I have a HP laptop... so this makes perfect sense.
> > Thanks for the explenation!
> >
>
> Well, we do it for everyone, it's just a particular series of HPs
> forced us to add it.
>
wonder if it should be a DMI based quirk instead...
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, 6 Dec 2009 18:27:07 -0800
Dmitry Torokhov <[email protected]> wrote:
> On Sun, Dec 06, 2009 at 05:18:56PM -0800, Arjan van de Ven wrote:
> > On Sun, 6 Dec 2009 14:54:48 -0800
> > Dmitry Torokhov <[email protected]> wrote:
> >
> > > Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> > > from device reset (needed by some keyboard controllers - I
> > > remember HP -or it and keyboard will be dead at resume).
> > >
> >
> > btw could we do this reset in an async function call (as long as we
> > wait for it to complete before we pull the plug finally) ?
>
> It has to complete before we start shutting down i8042, so there are
> dependencies involved...
async function calls have 2 methods for synchronization:
* inside an async function, you can wait for all "earlier" async
functions to complete (async_synchronize_cookie)
* outside an async function, you can wait for all scheduled async
functions to complete (async_synchronize_full)
so there's two options to use the async code to cut down this time:
1) Make both the mouse, keyboard AND the i8042 suspend functions async,
and in the i8042 function the code first synchronizes on all previous
async work
2) only make the mouse and keyboard suspend async, and just wait for all
async work in i8042 suspend
I strongly prefer number 1, in terms of getting the best suspend speed.
It means that all other suspend code can run in parallel to the whole
serio/i8042 suspend.
Option two is simpler, but the delay is in the normal, synchronous path,
so other suspend code will not run in parallel.
The good news is that neither is hard for someone familiar with the
code...
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Mon, 7 Dec 2009, Zhang Rui wrote:
>
> Hi, Linus,
> can you please look at this patch set and see if the idea is right?
> http://marc.info/?l=linux-kernel&m=124840449826386&w=2
> http://marc.info/?l=linux-acpi&m=124840456826456&w=2
> http://marc.info/?l=linux-acpi&m=124840456926459&w=2
> http://marc.info/?l=linux-acpi&m=124840457026468&w=2
> http://marc.info/?l=linux-acpi&m=124840457126471&w=2
So I'm not entirely sure about that patch-set, but the thing I like about
it is how drivers really sign up to it one by one, rather than having all
PCI devices automatically signed up for async behavior.
That said, the thing I don't like about it is some of the same thing I
don't necessarily like about the series in Rafael's tree either: it looks
rather over-designed with the whole infrastructure for async device logic
(your patch in http://marc.info/?l=linux-acpi&m=124840456926459&w=2). How
would you explain that whole async_dev_register() logic in simple terms to
somebody else?
(I think yours is simpler that the one in the PM tree, but I dunno. I've
not really compared the two).
So let me explain my dislike by trying to outline some conceptually simple
thing that doesn't have any call-backs, doesn't have any "classes",
doesn't require registration etc. It just allows drivers at any level to
decide to do some things (not necessarily everything) asynchronously.
Here's the outline:
- first off: drivers that don't know that they nest clearly don't do
anything asynchronous. No "PCI devices can be done in parallel" crap,
because they really can't - not in the general case. So just forget
about that kind of logic entirely: it's just wrong.
- the 'suspend' thing is a depth-first tree walk. As we suspend a node,
we first suspend the child nodes, and then we suspend the node itself.
Everybody agrees about that, right?
- Trivial "async rule": the tree is walked synchronously, but as we walk
it, any point in the tree may decide to do some or all of its suspend
asynchronously. For example, when we hit a disk node, the disk driver
may just decide that (a) it knows that the disk is an independent thing
and (b) it's hierarchical wrt it's parent so (c) it can do the disk
suspend asynchronously.
- To protect against a parent node being suspended before any async child
work has completed, the child suspend - before it kicks off the actual
async work - just needs to take a read-lock on the parent (read-lock,
because you may have multiple children sharing a parent, and they don't
lock each other out). Then the only thing the asynchronous code needs
to do is to release the read lock when it is done.
- Now, the rule just becomes that the parent has to take a write lock on
itself when it suspends itself. That will automatically block until
all children are done.
Doesn't the above sound _simple_? Doesn't that sound like it should just
obviously do the right thing? It sounds like something you can explain as
a locking rule without having any complex semantic registration or
anything at all.
Now, the problem remains that when you walk the device tree starting off
all these potentially asynchronous events, you don't want to do that
serialization part (the "parent suspend") as you walk the tree - because
then you would only ever do one single level asynchronously. Which is why
I suggested splitting the suspend into a "pre-suspend" phase (and a
"post-resume" one). Because then the tree walk goes from
# single depth-first thing
suspend(root)
{
for_each_child(root) {
// This may take the parent lock for
// reading if it does something async
suspend(child);
}
// This serializes with any async children
write_lock(root->lock);
suspend_one_node(root);
write_unlock(root->lock);
}
to
# Phase one: walk the tree synchronously, starting any
# async work on the leaves
suspend_prepare(root)
{
for_each_child(root) {
// This may take the parent lock for
// reading if it does something async
suspend_prepare(child);
}
suspend_prepare_one_node(root);
}
# Phase two: walk the tree synchronously, waiting for
# and finishing the suspend
suspend(root)
{
for_each_child(root) {
suspend(child);
}
// This serializes with any async children started in phase 1
write_lock(root->lock);
suspend_one_node(root);
write_unlock(root->lock);
}
and I really think this should work.
The advantage: untouched drivers don't change ANY SEMANTICS AT ALL. If
they don't have a 'suspend_prepare()' function, then they still see that
exact same sequence of 'suspend()' calls. In fact, even if they have
children that _do_ have drivers that have that async phase, they'll never
know, because that simple write-semaphore trivially guarantees that
whether there was async work or not, it will be completed by the time we
call 'suspend()'.
And drivers that want to do things asynchronously don't need to register
or worry: all they do is literally
- move their 'suspend()' function to 'suspend_prepare()' instead
- add a
down_read(dev->parent->lock);
async_run(mysuspend, dev);
to the point that they want to be asynchronous (which may be _all_ of
it or just some slow part). The 'mysuspend' part would be the async
part.
- add a
up_read(dev->parent->lock);
to the end of their asynchronous 'mysuspend()' function, so that when
the child has finished suspending, the parent down_write() will finally
succeed.
Doesn't that all sound pretty simple? And it has a very clear architecture
that is pretty easy to explain to anybody who knows about traversing trees
depth-first.
No complex concepts. No change to existing tested drivers. No callbacks,
no flags, no nothing. And a pretty simple way for a driver to decide: I'll
do my suspends asynchronously (without parent drivers really ever even
having to know about it).
I dunno. Maybe I'm overlooking something, but the above is much closer to
what I think would be worth doing.
Linus
On Sun, Dec 06, 2009 at 09:26:00PM -0800, Arjan van de Ven wrote:
> On Sun, 6 Dec 2009 18:27:56 -0800
> Dmitry Torokhov <[email protected]> wrote:
>
> > On Sun, Dec 06, 2009 at 04:55:51PM -0800, Arjan van de Ven wrote:
> > > On Sun, 6 Dec 2009 14:54:48 -0800
> > > Dmitry Torokhov <[email protected]> wrote:
> > >
> > > > > isn't serio the PS/2 stuff?
> > > >
> > > > Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> > > > from device reset (needed by some keyboard controllers - I
> > > > remember HP -or it and keyboard will be dead at resume).
> > >
> > > and I have a HP laptop... so this makes perfect sense.
> > > Thanks for the explenation!
> > >
> >
> > Well, we do it for everyone, it's just a particular series of HPs
> > forced us to add it.
> >
> wonder if it should be a DMI based quirk instead...
>
I have not received reports where it causes harm or reduces
functionality so I'd prefer having it by default and not try to race
with manufacturers.
--
Dmitry
On Sun, 6 Dec 2009, Linus Torvalds wrote:
>
> And drivers that want to do things asynchronously don't need to register
> or worry: all they do is literally [...]
Side note: for specific bus implementations, you obviously don't have to
even expose the choice. Things like the whole "suspend_late" and
"resume_early" phases don't make sense for USB devices, and the USB core
layer don't even expose those to the various USB drivers.
The same is true of the prepare_suspend/suspend split I'm proposing: I
suspect that for something like USB, it would make most sense to just do
normal node suspend in prepare_suspend, which would do everything
asynchronously. Only USB hub devices would get involved at the later
'suspend()' phase.
So I'm not suggesting that "all drivers" would necessarily even need
changing in order to take advantage of asynchronous behavior.
You could change just the _core_ USB layer would do everything
automatically for USB devices, and now USB devices would automatically
suspend asynchronously not because the generic device layer knows about
it, but because the USB bus layer chose to do that "async_run()" on the
leaf node suspend functions (or rather: a helper function that calls the
leaf-node suspend, and then does the 'up_read()' call on the parent
lock: the actual usb driverrs would never know about any of this).
Linus
On Sun, Dec 06, 2009 at 09:31:12PM -0800, Arjan van de Ven wrote:
> On Sun, 6 Dec 2009 18:27:07 -0800
> Dmitry Torokhov <[email protected]> wrote:
>
> > On Sun, Dec 06, 2009 at 05:18:56PM -0800, Arjan van de Ven wrote:
> > > On Sun, 6 Dec 2009 14:54:48 -0800
> > > Dmitry Torokhov <[email protected]> wrote:
> > >
> > > > Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> > > > from device reset (needed by some keyboard controllers - I
> > > > remember HP -or it and keyboard will be dead at resume).
> > > >
> > >
> > > btw could we do this reset in an async function call (as long as we
> > > wait for it to complete before we pull the plug finally) ?
> >
> > It has to complete before we start shutting down i8042, so there are
> > dependencies involved...
>
> async function calls have 2 methods for synchronization:
>
> * inside an async function, you can wait for all "earlier" async
> functions to complete (async_synchronize_cookie)
> * outside an async function, you can wait for all scheduled async
> functions to complete (async_synchronize_full)
>
> so there's two options to use the async code to cut down this time:
>
> 1) Make both the mouse, keyboard AND the i8042 suspend functions async,
> and in the i8042 function the code first synchronizes on all previous
> async work
> 2) only make the mouse and keyboard suspend async, and just wait for all
> async work in i8042 suspend
>
> I strongly prefer number 1, in terms of getting the best suspend speed.
> It means that all other suspend code can run in parallel to the whole
> serio/i8042 suspend.
> Option two is simpler, but the delay is in the normal, synchronous path,
> so other suspend code will not run in parallel.
>
> The good news is that neither is hard for someone familiar with the
> code...
>
And the bad thing is that violates multiple layers in the kernel. Atkbd
driver does not have to be using i8042; neither does psmouse. Althtough
they do in 99% of the cases there are other controllers providing the
i8042-style ports. Just grep for SERIO_8042 in drivers/input/serio.
I do not want to hard-code the i8042-psmouse-atkbd dependency.
--
Dmitry
On Sun, 6 Dec 2009 22:15:49 -0800
Dmitry Torokhov <[email protected]> wrote:
> And the bad thing is that violates multiple layers in the kernel.
> Atkbd driver does not have to be using i8042; neither does psmouse.
> Althtough they do in 99% of the cases there are other controllers
> providing the i8042-style ports. Just grep for SERIO_8042 in
> drivers/input/serio.
>
> I do not want to hard-code the i8042-psmouse-atkbd dependency.
it's not a specific dependency.
it's a "I know I'm critical, so everything before me needs to be done".
that doesn't encode an actual relationship, it encodes a potential
relationship... with a worst case behavior of ... what we do right
now ;_)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, Dec 06, 2009 at 10:31:12PM -0800, Arjan van de Ven wrote:
> On Sun, 6 Dec 2009 22:15:49 -0800
> Dmitry Torokhov <[email protected]> wrote:
>
> > And the bad thing is that violates multiple layers in the kernel.
> > Atkbd driver does not have to be using i8042; neither does psmouse.
> > Althtough they do in 99% of the cases there are other controllers
> > providing the i8042-style ports. Just grep for SERIO_8042 in
> > drivers/input/serio.
> >
> > I do not want to hard-code the i8042-psmouse-atkbd dependency.
>
> it's not a specific dependency.
>
> it's a "I know I'm critical, so everything before me needs to be done".
>
> that doesn't encode an actual relationship, it encodes a potential
> relationship... with a worst case behavior of ... what we do right
> now ;_)
This is the case with every parent device, isn't it? It is important for
its children. And wasn't Rafael patchset trying to address exactkly
this?
--
Dmitry
On Sun, 6 Dec 2009 21:57:55 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
>
> Now, the problem remains that when you walk the device tree starting
> off all these potentially asynchronous events, you don't want to do
> that serialization part (the "parent suspend") as you walk the tree -
> because then you would only ever do one single level asynchronously.
> Which is why I suggested splitting the suspend into a "pre-suspend"
> phase (and a "post-resume" one). Because then the tree walk goes from
> I dunno. Maybe I'm overlooking something, but the above is much
> closer to what I think would be worth doing.
with what you're describing I suspect the current async function calls
could be used;
in the first tree walk, the drivers do an async_schedule() of the
things they want done asynchronous;
all the core then needs to do is a full synchronization step between the
two tree walks... and we get pretty much all the benefits without
needing the read-then-write-lock primitive for synchronization.
alternative would be to do the synchronization in the part where we
know there's a dependency (like your lock is doing);
but instead of a lock we could store the async cookie there; and just
wait on that in the 2nd phase.... this would be more finegrained, and
an optimization from the "global synchronize"... but I'm not sure it'll
be worth it in practice; it will if there's significant cost in various
parts of the tree AND in the 2nd run; if the 2nd run is cheap in
general, you're not going to get real extra parallelism at the price of
more complexity.
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, 6 Dec 2009, Linus Torvalds wrote:
> # Phase one: walk the tree synchronously, starting any
> # async work on the leaves
> suspend_prepare(root)
> {
> for_each_child(root) {
> // This may take the parent lock for
> // reading if it does something async
> suspend_prepare(child);
> }
> suspend_prepare_one_node(root);
> }
>
> # Phase two: walk the tree synchronously, waiting for
> # and finishing the suspend
> suspend(root)
> {
> for_each_child(root) {
> suspend(child);
> }
> // This serializes with any async children started in phase 1
> write_lock(root->lock);
> suspend_one_node(root);
> write_unlock(root->lock);
> }
>
> and I really think this should work.
> No complex concepts. No change to existing tested drivers. No callbacks,
> no flags, no nothing. And a pretty simple way for a driver to decide: I'll
> do my suspends asynchronously (without parent drivers really ever even
> having to know about it).
>
> I dunno. Maybe I'm overlooking something, but the above is much closer to
> what I think would be worth doing.
You're overlooking resume. It's more difficult than suspend. The
issue is that a child can't start its async part until the parent's
synchronous part is finished.
So for example, suppose the device listing contains P, C, Q, where C is
a child of P, Q is unrelated, and P has a long-lasting asynchronous
requirement. The resume process will stall upon reaching C, waiting
for P to finish. Thus even though P and Q might be able to resume in
parallel, they won't get the chance.
An approach that handles resume well can probably be adapted to handle
suspend too. The reverse isn't true, as this example shows.
Alan Stern
On Monday 07 December 2009, Linus Torvalds wrote:
>
> On Mon, 7 Dec 2009, Zhang Rui wrote:
> >
> > Hi, Linus,
> > can you please look at this patch set and see if the idea is right?
> > http://marc.info/?l=linux-kernel&m=124840449826386&w=2
> > http://marc.info/?l=linux-acpi&m=124840456826456&w=2
> > http://marc.info/?l=linux-acpi&m=124840456926459&w=2
> > http://marc.info/?l=linux-acpi&m=124840457026468&w=2
> > http://marc.info/?l=linux-acpi&m=124840457126471&w=2
>
> So I'm not entirely sure about that patch-set, but the thing I like about
> it is how drivers really sign up to it one by one, rather than having all
> PCI devices automatically signed up for async behavior.
>
> That said, the thing I don't like about it is some of the same thing I
> don't necessarily like about the series in Rafael's tree either:
Just for the record, it's not in there any more.
> it looks rather over-designed with the whole infrastructure for async device logic
> (your patch in http://marc.info/?l=linux-acpi&m=124840456926459&w=2). How
> would you explain that whole async_dev_register() logic in simple terms to
> somebody else?
>
> (I think yours is simpler that the one in the PM tree, but I dunno. I've
> not really compared the two).
>
> So let me explain my dislike by trying to outline some conceptually simple
> thing that doesn't have any call-backs, doesn't have any "classes",
> doesn't require registration etc. It just allows drivers at any level to
> decide to do some things (not necessarily everything) asynchronously.
>
> Here's the outline:
>
> - first off: drivers that don't know that they nest clearly don't do
> anything asynchronous. No "PCI devices can be done in parallel" crap,
> because they really can't - not in the general case. So just forget
> about that kind of logic entirely: it's just wrong.
>
> - the 'suspend' thing is a depth-first tree walk. As we suspend a node,
> we first suspend the child nodes, and then we suspend the node itself.
> Everybody agrees about that, right?
>
> - Trivial "async rule": the tree is walked synchronously, but as we walk
> it, any point in the tree may decide to do some or all of its suspend
> asynchronously. For example, when we hit a disk node, the disk driver
> may just decide that (a) it knows that the disk is an independent thing
> and (b) it's hierarchical wrt it's parent so (c) it can do the disk
> suspend asynchronously.
>
> - To protect against a parent node being suspended before any async child
> work has completed, the child suspend - before it kicks off the actual
> async work - just needs to take a read-lock on the parent (read-lock,
> because you may have multiple children sharing a parent, and they don't
> lock each other out). Then the only thing the asynchronous code needs
> to do is to release the read lock when it is done.
>
> - Now, the rule just becomes that the parent has to take a write lock on
> itself when it suspends itself. That will automatically block until
> all children are done.
>
> Doesn't the above sound _simple_?
I don't think the idea is really that much simpler than the one behind the
patchset you've just rejected. The only real difference is that in that
patchset the entire suspend and resume callbacks could be either
asynchronous or synchronous and in your approach each callback may
be devided into the synchronous and asynchronous part, which admittedly is
more flexible, but not necessarily simpler.
Now, apart from the idea there are some details that need to be taken into
consideration like the fact that the children may not be the only devices
you need to wait for with the parent suspend and that implies additional
locking rules. But you need to know which devices to lock and that has
to be represented somehow (the PM links in my patchset were for this and
_nothing_ else).
Also, it looks like the parent locking should rather be done at the core
level, as it appears to be a piece of code that needs to be called for each
device:
if (I_have_children || I_have_other_dependent_devices)
write_lock_myself
> Now, the problem remains that when you walk the device tree starting off
> all these potentially asynchronous events, you don't want to do that
> serialization part (the "parent suspend") as you walk the tree - because
> then you would only ever do one single level asynchronously. Which is why
> I suggested splitting the suspend into a "pre-suspend" phase (and a
> "post-resume" one). Because then the tree walk goes from
>
> # single depth-first thing
> suspend(root)
> {
> for_each_child(root) {
> // This may take the parent lock for
> // reading if it does something async
> suspend(child);
> }
>
> // This serializes with any async children
> write_lock(root->lock);
> suspend_one_node(root);
> write_unlock(root->lock);
> }
>
> to
>
> # Phase one: walk the tree synchronously, starting any
> # async work on the leaves
> suspend_prepare(root)
> {
> for_each_child(root) {
> // This may take the parent lock for
> // reading if it does something async
> suspend_prepare(child);
> }
> suspend_prepare_one_node(root);
> }
>
> # Phase two: walk the tree synchronously, waiting for
> # and finishing the suspend
> suspend(root)
> {
> for_each_child(root) {
> suspend(child);
> }
> // This serializes with any async children started in phase 1
> write_lock(root->lock);
> suspend_one_node(root);
> write_unlock(root->lock);
> }
>
> and I really think this should work.
We already have prepare and complete suspend callbacks, for a different reason,
and I'm not sure they're suitable for doing the async thing.
So, we'd need to add another two callbacks, just for suspend to RAM, and what
about hibernation? Isn't that going to become a bit too complicated?
> The advantage: untouched drivers don't change ANY SEMANTICS AT ALL.
This also was true for my patchset.
> If they don't have a 'suspend_prepare()' function, then they still see that
> exact same sequence of 'suspend()' calls.
The same holded for drivers without the async_suspend flag set in my patchset
(I really should have left setting it to individual drivers).
> In fact, even if they have children that _do_ have drivers that have that
> async phase, they'll never know, because that simple write-semaphore
> trivially guarantees that whether there was async work or not, it will be
> completed by the time we call 'suspend()'.
Ditto.
> And drivers that want to do things asynchronously don't need to register
> or worry: all they do is literally
>
> - move their 'suspend()' function to 'suspend_prepare()' instead
>
> - add a
>
> down_read(dev->parent->lock);
> async_run(mysuspend, dev);
>
> to the point that they want to be asynchronous (which may be _all_ of
> it or just some slow part). The 'mysuspend' part would be the async
> part.
>
> - add a
>
> up_read(dev->parent->lock);
>
> to the end of their asynchronous 'mysuspend()' function, so that when
> the child has finished suspending, the parent down_write() will finally
> succeed.
In my patchset the drivers didn't need to do all that stuff. The only thing
they needed, if they wanted their suspend/resume to be executed
asynchronously, was to set the async_suspend flag.
But this is just for the record, in case you end up with code that's more
complicated than the rejected one.
Rafael
On Monday 07 December 2009, Dmitry Torokhov wrote:
> On Sun, Dec 06, 2009 at 10:31:12PM -0800, Arjan van de Ven wrote:
> > On Sun, 6 Dec 2009 22:15:49 -0800
> > Dmitry Torokhov <[email protected]> wrote:
> >
> > > And the bad thing is that violates multiple layers in the kernel.
> > > Atkbd driver does not have to be using i8042; neither does psmouse.
> > > Althtough they do in 99% of the cases there are other controllers
> > > providing the i8042-style ports. Just grep for SERIO_8042 in
> > > drivers/input/serio.
> > >
> > > I do not want to hard-code the i8042-psmouse-atkbd dependency.
> >
> > it's not a specific dependency.
> >
> > it's a "I know I'm critical, so everything before me needs to be done".
> >
> > that doesn't encode an actual relationship, it encodes a potential
> > relationship... with a worst case behavior of ... what we do right
> > now ;_)
>
> This is the case with every parent device, isn't it? It is important for
> its children. And wasn't Rafael patchset trying to address exactkly
> this?
Yes, it was.
Thanks,
Rafael
On Sun, 6 Dec 2009, Linus Torvalds wrote:
> I can imagine that doing USB resume specially is worth it, since USB is
> fundamentally a pretty slow bus. But USB is also a fairly clear hierarchy,
> so there is no point in PM groups or any other information outside of the
> pure topology.
>
> But there is absolutely zero point in doing that for devices in general.
> PCI drivers simply do not want concurrent initialization. The upsides are
> basically zero (win a few msecs?) and the downsides are the pointless
> complexity. We don't do PCI discovery asyncronously either - for all the
> same reasons.
>
> Now, a PCI driver may then implement a bus that is slow (ie SCSI, ATA,
> USB), and that bus may itself then want to do something else. If it really
> is a good idea to add the whole hierarchical model to USB suspend/resume,
> I can live with that, but that is absolutely no excuse for then doing it
> for cases where the hierarchy is (a) known to be broken (ie the whole PCI
> multifunction thing, but also things like motherboard power management
> devices) and (b) don't have the same kind of slow bus issues.
Okay. I can understand not wanting to burden everybody else with USB's
weaknesses.
Simply doing an async suspend & resume of each USB root hub might be
enough to give a significant advantage. For the most part these root
hubs tend to be registered sequentially with few or no other devices in
between.[*] Hence the "stalls" that would occur when suspending a
parent or resuming a child wouldn't slow things down very much. We
would not always reap the maximum advantage of a fully-asyncronous
approach but there would be some improvement.
This is sort of what Arjan suggested yesterday. Its benefit is that
nothing outside usbcore has to change.
Alan Stern
[*] In fact this is true only on systems where the USB host controller
drivers are built as modules. If everything is compiled into the
kernel then the devices are registered in the worst possible order:
controller 1, root hub 1, controller 2, root hub 2, ...
I suppose the root hubs could be registered in a delayed work routine.
It would be a little awkward but it would solve this issue.
On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > I dunno. Maybe I'm overlooking something, but the above is much closer to
> > what I think would be worth doing.
>
> You're overlooking resume. It's more difficult than suspend. The
> issue is that a child can't start its async part until the parent's
> synchronous part is finished.
No, I haven't overlooked resume at all. I just assumed that it was
obvious. It's the exact same thing, except in reverse (the locking ends
up being slightly different, but the changes are actually fairly
straightforward).
And by reverse, I mean that you walk the tree in the reverse order too,
exactly like we already do - on suspend we walk it children-first, on
resume we walk it parents-first (small detail: we actually just walk a
simple linked list, but the list is topologically ordered, so walking it
forwards/backwards is topologically the same thing as doing that
depth-first search).
> So for example, suppose the device listing contains P, C, Q, where C is
> a child of P, Q is unrelated, and P has a long-lasting asynchronous
> requirement. The resume process will stall upon reaching C, waiting
> for P to finish. Thus even though P and Q might be able to resume in
> parallel, they won't get the chance.
No. The resume process does EXCTLY THE SAME THING as I outlined for
suspend, but just all in reverse. So now the resume process becomes the
same two-phase thing:
# Phase one
resume(root)
{
// This can do things asynchronously if it wants,
// but needs to take the write lock on itself until
// it is done if it does
resume_one_node(root);
for_each_child(root)
resume(child);
}
# Phase two
post_resume(root)
{
post_resume_one_node(root);
for_each_child(root)
post_resume(child);
}
Notice? It's _exactly_ the same thing as suspend - except all turned
around. We do the nodes before the children ("walk the list backwards"),
and we also do the locking the other way around (ie on suspend we'd lock
the _parent_ if we wanted to do async stuff - to keep it around - but on
resume we lock _ourselves_, so that the children can have something to
wait on. Also note how we take a _write_ lock rather than a read lock).
(And again, I've only written it out in email, I've not tested it or
thought about it all that deeply, so you'll excuse any stupid thinkos.)
Now, for something like PCI, I'd suggest (once more) leaving all drivers
totally unchanged, and you end up with the exact same behavior as we had
before (no real change to the whole resume ordering, and everything is
synchronous so there is no relevant locking).
But how would the USB layer do this?
Simple: all the normal leaf devices would have their resume callback be
called at "post_resume()" time (exactly the reverse of the suspend phase:
we suspend early, and we resume late - it's all a mirror image). And I'd
suggest that the USB layer do it all totally asynchronously, except again
turned around the other way.
Remember how on suspend, the suspend of a leaf device ended up being an
issue of asynchronously calling a function that did the suspend, and then
released the read-lock of the parent. Resume is the same, except now we'd
actually want to take the parent read-lock asynchronously too, so you'd do
down_write(leaf->lock);
async_schedule(usb_node_resume, leaf);
where that function simply does
usb_node_resume(node)
{
/* Wait for the parent to have resumed completely */
down_read(node->parent->lock);
node->resume(node)
up_read(node->parent->lock);
up_write(node->lock);
}
and you're all done. Once more the ordering and the locking takes care of
any need to serialize - there is no data structures to keep track of.
And what about USB hubs? They get resumed in the first phase (again,
exactly the mirror image of the suspend), and the only thing they need to
do is that _exact_ same thing above:
down_write(hub->lock);
async_schedule(usb_node_resume, hub);
- Ta-daa! All done.
Notice? It's really pretty straightforward, and there are _zero_ new
concepts. And again, no callbacks, no nothing. Just the obvious mirror
image of what happened when suspending. We do everything with simple async
calls. And none of the tree walking actually blocks (yes, we do a
"down_write()" on the nodes as we schedule the resume code, but it won't
be a blocking one, since that is the first time we encounter that node:
the blocking will come later when the async threads actually need to wait
for things).
Again, I do not guarantee that I've dotted every i, and crossed every t.
It's just that I'm pretty sure that we really don't need any fancy
"infrastructure" for something this simple. And I really much prefer
"conceptually simple high-level model" over a model of "keep track of all
the relationships and have some complex model of devices".
So let's just look at your example:
> So for example, suppose the device listing contains P, C, Q, where C is
> a child of P, Q is unrelated, and P has a long-lasting asynchronous
> requirement.
The tree is:
... -> P -> C
-> Q
and with what I suggest, during phase one, P will asynchronously start the
resume. As part of its async resume it will have to wait for it's parents,
of course, but all of that happens in a separate context, and the tree
traversal goes on.
And during phase #1, C and Q won't do anything at all. We _could_ do them
during this phase, and it would actually all work out fine, but we
wouldn't want to do that for a simple reason: we _want_ the pre_suspend
and post_resume phases to be total mirror images, because if we end up
doing error handling for the pre-suspend case, then the post-resume phase
would be the "fixup" for it, so we actually want leaf things to happen
during phase #2 - not because it would screw up locking or ordering, but
because of other issues.
When we hit phase #2, we then do C and Q, and do the same thing - we have
an async call that does the read-lock on the parent to make sure it's
all resumed, and then we resume C and Q. And they'll automatically resume
in parallel (unless C is waiting for P, of course, in which case P and Q
end up resuming in parallel, and C ends up waiting).
Now, the above just takes care of the inter-device ordering. There are
unrelated semantics we want to give, like "all devices will have resumed
before we start waking up user space". Those are unrelated to the
topological requirements, of course, and are not a requirement imposed by
the device tree, but by our _other_ semantics (IOW, in this respect it's
kind of like how we wanted pre-suspend and post-resume to be mirror images
for other outside reasons).
So we'd actually have a "phase #3", but that phase wouldn't be visible to
the devices themselves, it would be a
# Phase tree: make sure everything is resumed
for_each_device() {
read_lock(dev->lock);
read_unlock(dev->lock);
}
but as you can see, there's no actual device callbacks involved. It would
be just the code device layer saying "ok, now I'm going to wait for all
the devices to have finished their resume".
Linus
On Mon, 7 Dec 2009, Rafael J. Wysocki wrote:
>
> > The advantage: untouched drivers don't change ANY SEMANTICS AT ALL.
>
> This also was true for my patchset.
That's simply not trye.
You set async_suspend on every single PCI driver. I object very heavily to
it.
You also introduce this whole big "callback when ready", and
"non-topoligical PM dependency chain" thing. Which I also object to.
Notice how with the simpler "lock parent" model, you _can_ actually encode
non-topological dependencies, but you do it by simply read-locking
whatever other independent device you want. So if an architecture has some
system devices that have odd rules, that architecture can simply encode
those rules in its suspend() functions.
It doesn't need to expose it to the device layer - because the device
layer won't even care. The code will just automatically "do the right
thing" without even having that notion of PM dependencies at any other
level than the driver that knows about it.
No registration, no callbacks, no nothing.
> In my patchset the drivers didn't need to do all that stuff. The only thing
> they needed, if they wanted their suspend/resume to be executed
> asynchronously, was to set the async_suspend flag.
In my patchset, the drivers don't need to either.
The _only_ thing that would do this is something like the USB layer. We're
talking ten lines of code or so. And you get rid of all the PM
dependencies and all the infrastructure - because the model is so simple
that it doesn't need any.
(Well, except for the infrastructure to run things asynchronously, but
that was kind of my point from the very beginning: we can just re-use all
that existing async infrastructure. We already have that).
Linus
On Mon, 7 Dec 2009, Linus Torvalds wrote:
>
> And during phase #1, C and Q won't do anything at all. We _could_ do them
> during this phase, and it would actually all work out fine, but we
> wouldn't want to do that for a simple reason: we _want_ the pre_suspend
> and post_resume phases to be total mirror images, because if we end up
> doing error handling for the pre-suspend case, then the post-resume phase
> would be the "fixup" for it, so we actually want leaf things to happen
> during phase #2 - not because it would screw up locking or ordering, but
> because of other issues.
Ho humm.
This part made me think. Since I started mulling over the fact that we
could do the resume thing in a single phase (and really only wanted the
second phase in order to be a mirror image to the suspend), I started
thinking that we could perhaps do even the suspend with a single phase,
and avoid introducing that pre-suspend/post-resume phase at all.
And now that I think about it, we can do that by simply changing the
locking just a tiny bit.
I originally envisioned that two-pase suspend because I was thinking that
the first phase would start off the suspend, and the second phase would
finish it, but we can actually do it all with a single phase that does
both. So starting with just the regular depth-first post-ordering that is
a suspend:
suspend(root)
{
for_each_child(root)
suspend(child);
suspend_one_node(root)
}
the rule would be that for something like USB that wants to do the suspend
asynchronously, the node suspend routine would do
usb_node_suspend(node)
{
// Make sure parent doesn't suspend: this will not block,
// because we'll call the 'suspend' function for all nodes
// before we call it for the parent.
down_read(node->parent->lock);
// Do the part that may block asynchronously
async_schedule(do_usb_node_suspend, node);
}
do_usb_node_suspend(node)
{
// Start out suspend. This will block if we have any
// children that are still busy suspending (they will
// have done a down_read() in their suspend).
down_write(node->lock);
node->suspend(node);
up_write(node->lock);
// This lets our parent continue
up_read(node->parent->lock);
}
and it looks like we don't even need a second phase at all.
IOW, I think USB could do this on its own right now, with no extra
infrastructure from the device layer AT ALL, except for one small thing:
that new "rwsem" lock in the device data structure, and then we'd need the
"wait for everybody to have completed" loop, ie
for_each_dev(dev) {
down_write(dev->lock);
up_write(dev->lock);
}
thing at the end of the suspend loop (same thing as I mentioned about
resuming).
So I think even that whole two-phase thing was unnecessarily complicated.
Linus
On Mon, 7 Dec 2009, Linus Torvalds wrote:
> No, I haven't overlooked resume at all. I just assumed that it was
> obvious. It's the exact same thing, except in reverse (the locking ends
> up being slightly different, but the changes are actually fairly
> straightforward).
>
> And by reverse, I mean that you walk the tree in the reverse order too,
> exactly like we already do - on suspend we walk it children-first, on
> resume we walk it parents-first (small detail: we actually just walk a
> simple linked list, but the list is topologically ordered, so walking it
> forwards/backwards is topologically the same thing as doing that
> depth-first search).
> Notice? It's _exactly_ the same thing as suspend - except all turned
> around. We do the nodes before the children ("walk the list backwards"),
> and we also do the locking the other way around (ie on suspend we'd lock
> the _parent_ if we wanted to do async stuff - to keep it around - but on
> resume we lock _ourselves_, so that the children can have something to
> wait on. Also note how we take a _write_ lock rather than a read lock).
Okay, I think I've got it. But you're wrong about one thing: Resume
isn't _exactly_ the reverse of suspend. For both of them we have to
start the async thread in the first pass. So instead of
resume/post_resume we would have pre_resume/resume, just like
pre_suspend/suspend.
During the pre- pass, the driver launches an async thread and takes the
appropriate locks. The thread does its work as appropriate (with
locking to insure that it first waits for children or parents), and
then in the second pass the driver waits for the async thread to
finish.
A non-async driver (i.e., most of them) would ignore the pre- pass
entirely and do all its work in the second pass.
An async-aware driver would look like this:
pre_suspend(dev)
{
/* Prevent parent from suspending until we are ready */
down_read(dev->parent->lock);
dev->pm_cookie = async_schedule(async_suspend, dev);
}
async_suspend(dev)
{
/* Wait until all children are fully suspended */
down_write(dev->lock);
Suspend dev, taking as much time as needed
up_write(dev->lock);
/* Allow parent to suspend */
up_read(dev->parent->lock);
}
suspend(dev)
{
/* Wait until the suspend is complete */
async_synchronize_cookie(dev->pm_cookie);
}
pre_resume(dev)
{
/* Prevent children from resuming */
down_write(dev->lock);
dev->pm_cookie = async_schedule(async_resume, dev);
}
async_resume(dev)
{
/* Wait until parent is fully resumed */
down_read(dev->parent->lock);
Resume dev, taking as much time as needed
up_read(dev->parent->lock);
/* Allow children to resume */
up_write(dev->lock);
}
resume(dev)
{
/* Wait until resume is complete */
async_synchronize_cookie(dev->pm_cookie);
}
So there's some time symmetry here, but it isn't perfect. This is
probably what you had in mind all along, but I needed to get it
straight.
There's some question about what to do if a suspend or resume fails. A
bunch of async threads will have been launched for other devices, but
now there won't be anything to wait for them. It's not clear how this
should be handled.
Alan Stern
On Mon, 7 Dec 2009, Alan Stern wrote:
>
> Okay, I think I've got it. But you're wrong about one thing: Resume
> isn't _exactly_ the reverse of suspend.
Yeah, no. But I think I made it much closer by getting rid of pre-suspend
and post-resume (my next email to the one you quoted).
And yeah, I started thinking along those lines exactly because it wasn't
as clean a mirror image as I thought it should be able to be.
> A non-async driver (i.e., most of them) would ignore the pre- pass
> entirely and do all its work in the second pass.
See my second email, where I think I can get rid of the whole second pass
thing. I think you'll agree that it's an even nicer mirror image.
> There's some question about what to do if a suspend or resume fails. A
> bunch of async threads will have been launched for other devices, but
> now there won't be anything to wait for them. It's not clear how this
> should be handled.
I think the rule for "suspend fails" is very simple: you can't fail in the
async codepath. There's no sane way to return errors, and trying to would
be too complex anyway. What would you do?
In fact, even though we _can_ fail in the synchronous path, I personally
consider a device driver that ever fails its suspend to be terminally
broken. We're practically always better off suspending and simply turning
off the power than saying "uh, I failed the suspend".
I've occasionally hit a few drivers that caused suspend failures, and each
and every time it was a driver bug, and the right thing to do was to just
ignore the error and suspend anyway - returning an error code and trying
to undo the suspend is not what anybody ever really wants, even if our
model _allows_ for it.
(And the rule for "resume fails" is even simpler: there's nothing we can
really do if something fails to resume - and that's true whether the
failure is synchronous or asynchronous. The device is dead. Try to reset
it, or remove it from the device tree. Tough).
Linus
On Mon, 7 Dec 2009, Linus Torvalds wrote:
> See my second email, where I think I can get rid of the whole second pass
> thing. I think you'll agree that it's an even nicer mirror image.
Yes, I like this approach better and better.
There is still a problem. In your code outlines, you have presented a
classic depth-first (suspend) or depth-last (resume) tree algorithm.
But that's not how the PM core works. Instead it maintains dpm_list, a
list of all devices in order of registration. Suspends and resumes are
carried out by iterating along this list, in the reverse and forward
directions respectively.
There are two advantages. The matter of stack usage, of course. But
more importantly, this order of devices is guaranteed to work. For any
device D, we _know_ that the system can function properly in
circumstances where everything on dpm_list before D is active and
everything after D is inactive -- because that's the state the system
was in when D was registered. Any other order risks errors because of
unknown dependencies.
The consequence is that there's no way to hand off an entire subtree to
an async thread. And as a result, your single-pass algorithm runs into
the kind of "stall" problem I described before.
(In theory we could convert over to a tree algorithm. IMO that would
be nearly as dangerous as going to a full-fledged totally async
scheme.)
But all is not lost. We can still get what we want using a two-pass
list algorithm, where one of the passes is contained within the PM core
-- no extra callbacks are needed. Here's how suspend would work:
dpm_suspend() /* Suspend all devices on dpm_list */
{
list_for_each_entry_reverse(dev, dpm_list, ...) {
/* Make the parent wait for dev */
down_read(dev->parent->lock);
if (dev->async_pm)
async_schedule(device_suspend, dev);
}
list_for_each_entry_reverse(dev, dpm_list, ...) {
if (!dev->async_pm)
device_suspend(dev);
}
async_synchronize_full();
}
device_suspend(dev) /* Suspend a single device */
{
/* Wait until all the children are suspended */
down_write(dev->lock);
dev->bus->suspend(dev);
up_write(dev->lock);
/* Tell the parent we are finished */
up_read(dev->parent->lock);
}
I have glossed over a bunch of details, such as the fact that
device_suspend() really takes two arguments. And it's necessary to be
more careful with the list operations than shown here, because devices
can be unregistered while all this is going on.
Still, this seems reasonable. Bus subsystems and drivers can set the
dev->async_pm flag as desired, and they can use the new rwsems to
handle special dependencies without involving the PM core. No new
callbacks are needed, nor any changes to existing methods.
(Convincing lockdep that all this fancy footwork is valid may require
some effort, though.)
By the way, this bears a striking resemblance to Rafael's patch. The
biggest difference is the use of the new rwsem for dependency
resolution, instead his somewhat cumbersome constraint structures.
> > There's some question about what to do if a suspend or resume fails. A
> > bunch of async threads will have been launched for other devices, but
> > now there won't be anything to wait for them. It's not clear how this
> > should be handled.
>
> I think the rule for "suspend fails" is very simple: you can't fail in the
> async codepath. There's no sane way to return errors, and trying to would
> be too complex anyway. What would you do?
You could prevent the suspend procedure from going any further and
abort the entire system sleep. If you wanted to.
> In fact, even though we _can_ fail in the synchronous path, I personally
> consider a device driver that ever fails its suspend to be terminally
> broken. We're practically always better off suspending and simply turning
> off the power than saying "uh, I failed the suspend".
>
> I've occasionally hit a few drivers that caused suspend failures, and each
> and every time it was a driver bug, and the right thing to do was to just
> ignore the error and suspend anyway - returning an error code and trying
> to undo the suspend is not what anybody ever really wants, even if our
> model _allows_ for it.
There is a valid reason for aborting a sleep transition: the driver has
received a remote wakeup request. Wakeup requests race with sleep, of
course. A request coming after the system is asleep will wake it up;
one coming before the system is asleep should either cause it to wake
up immediately after shutting down or prevent the sleep entirely.
Causing the system to wake up immediately needs hardware support. But
by the time the kernel is aware of a wakeup request, the request is
generally no longer present in the hardware. (For example, an
interrupt has been delivered and the IRQ line is no longer active.)
So the only remaining choice is to abort the sleep transition.
> (And the rule for "resume fails" is even simpler: there's nothing we can
> really do if something fails to resume - and that's true whether the
> failure is synchronous or asynchronous. The device is dead. Try to reset
> it, or remove it from the device tree. Tough).
Right.
Alan Stern
On Monday 07 December 2009, Linus Torvalds wrote:
>
> On Mon, 7 Dec 2009, Rafael J. Wysocki wrote:
> >
> > > The advantage: untouched drivers don't change ANY SEMANTICS AT ALL.
> >
> > This also was true for my patchset.
>
> That's simply not trye.
>
> You set async_suspend on every single PCI driver. I object very heavily to
> it.
That was a mistake, I admit.
However, it was done in a separate patch that (1) was not necessary and (2)
shouldn't have been there. Sorry for making the mistake of including that into
the patchset. So I understand your objection to that and let's not get back to
this again, ok?
> You also introduce this whole big "callback when ready", and
> "non-topoligical PM dependency chain" thing. Which I also object to.
These things are also non-essential. Acutally they wasn't there in the initial
version of my patches and were added after people had complained that it had
not been parallel enough and hadn't take the off-tree dependecies into account.
I could remove these things either and quite easily.
> Notice how with the simpler "lock parent" model, you _can_ actually encode
> non-topological dependencies, but you do it by simply read-locking
> whatever other independent device you want. So if an architecture has some
> system devices that have odd rules, that architecture can simply encode
> those rules in its suspend() functions.
I'm not arguing against that. In fact, my only worry were that additional
suspend/resume callbacks I really wouldn't like to introduce. But since you've
found a way of doing things without them, I'm totally fine with this approach.
> It doesn't need to expose it to the device layer - because the device
> layer won't even care. The code will just automatically "do the right
> thing" without even having that notion of PM dependencies at any other
> level than the driver that knows about it.
>
> No registration, no callbacks, no nothing.
>
> > In my patchset the drivers didn't need to do all that stuff. The only thing
> > they needed, if they wanted their suspend/resume to be executed
> > asynchronously, was to set the async_suspend flag.
>
> In my patchset, the drivers don't need to either.
>
> The _only_ thing that would do this is something like the USB layer. We're
> talking ten lines of code or so. And you get rid of all the PM
> dependencies and all the infrastructure - because the model is so simple
> that it doesn't need any.
It just uses a different way of representing these things, perhaps more
efficiently.
> (Well, except for the infrastructure to run things asynchronously, but
> that was kind of my point from the very beginning: we can just re-use all
> that existing async infrastructure. We already have that).
So I guess the only thing we need at the core level is to call
async_synchronize_full() after every stage of suspend/resume, right?
Rafael
On Mon, 7 Dec 2009, Alan Stern wrote:
>
> Yes, I like this approach better and better.
>
> There is still a problem. In your code outlines, you have presented a
> classic depth-first (suspend) or depth-last (resume) tree algorithm.
Yes, I did that because that clarifies the locking rules (ie "we traverse
parents nodes last/first"), not because it was actually relevant to
anything else.
And the whole pre-order vs post-order is important, and really only shows
up when you show the pseudo-code as a tree walk.
> But that's not how the PM core works. Instead it maintains dpm_list, a
> list of all devices in order of registration.
Right. I did mention that in a couple of the asides, I'm well aware that
we don't actually traverse the tree as a tree.
But the "traverse list forward" is logically the same thing as doing
a pre-order DFS, while going backwards is equivalent to doing a post-order
DFS, since all we really care about is the whole "parent first" or
"children first" part of the ordering.
So I wanted to show the logic in pseudo-code using the tree walk (because
that explains the logic _conceptually_ much better), but the actual code
would just do the list traversal.
> The consequence is that there's no way to hand off an entire subtree to
> an async thread. And as a result, your single-pass algorithm runs into
> the kind of "stall" problem I described before.
No, look again. There's no stall in the thing, because all it really
depends on is (for the suspend path) is that it sees all children before
the parent (because the child will do a "down_read()" on the parent node
and that should not stall), and for the resume path it depends on seeing
the parent node before any children (because the parent node does that
"down_write()" on its own node).
Everything else is _entirely_ asynchronous, including all the other locks
it takes. So there are no stalls (except, of course, if we then hit limits
on numbers of outstanding async work and refuse to create too many
outstanding async things, but that's a separate issue, and intentional, of
course).
You're right that my first one (two-phase suspend) had a stall situation.
Linus
On Mon, 7 Dec 2009, Rafael J. Wysocki wrote:
>
> So I guess the only thing we need at the core level is to call
> async_synchronize_full() after every stage of suspend/resume, right?
Yes and no.
Yes in the sense that _if_ everybody always uses "async_schedule()" (or
whatever the call is named - I've really only written pseudo-code and
haven't even tried to look up the details), then the only thing you need
to do is async_synchronize_full().
But one of the nice things about using just the trivial rwlock model and
letting any async users just depend on that is that we could easily just
depend entirely on those device locks, and allow drivers to do async
shutdowns other ways too.
For example, I could imagine some driver just doing an async suspend (or
resume) that gets completed in an interrupt context, rather than being
done by 'async_schedule()' at all.
So in many ways it's nicer to serialize by just doing
serialize_all_PM_events()
{
for_each_device() {
down_write(dev->lock);
up_write(dev->lock);
}
}
rather than depend on something like async_synchronize_full() that
obviously waits for all async events, but doesn't have the capability to
wait for any other event that some random driver might be using.
[ That "down+up" is kind of stupid, but I don't think we have a "wait for
unlocked" rwsem operation. We could add one, and it would be cheaper for
the case where the device never did anything async at all, and didn't
really need to dirty that cacheline by doing that write lock/unlock
pair. ]
But that really isn't a big deal. I think it would be perfectly ok to also
just say "if you do any async PM, you need to use 'async_schedule()'
because that's all we're going to wait for". It's probably perfectly fine.
Linus
On Mon, 7 Dec 2009, Linus Torvalds wrote:
> > The consequence is that there's no way to hand off an entire subtree to
> > an async thread. And as a result, your single-pass algorithm runs into
> > the kind of "stall" problem I described before.
>
> No, look again. There's no stall in the thing, because all it really
> depends on is (for the suspend path) is that it sees all children before
> the parent (because the child will do a "down_read()" on the parent node
> and that should not stall), and for the resume path it depends on seeing
> the parent node before any children (because the parent node does that
> "down_write()" on its own node).
>
> Everything else is _entirely_ asynchronous, including all the other locks
> it takes. So there are no stalls (except, of course, if we then hit limits
> on numbers of outstanding async work and refuse to create too many
> outstanding async things, but that's a separate issue, and intentional, of
> course).
It only seems that way because you didn't take into account devices
that suspend synchronously but whose children suspend asynchronously.
A synchronous suspend routine for a device with async child suspends
would have to look just like your usb_node_suspend():
suspend_one_node(dev)
{
/* Wait until the children are suspended */
down_write(dev->lock);
Suspend dev
up_write(dev->lock);
/* Allow the parent to suspend */
up_read(dev->parent->lock);
}
So now suppose we've got two USB host controllers, A and B. They are
PCI devices, so they suspend synchronously. Each has a root hub child
(P and Q respectively) which is a USB device and therefore suspends
asynchronously. dpm_list contains: A, P, B, Q. (In fact A doesn't
enter into this discussion; you can ignore it.)
In your one-pass algorithm, we start with usb_node_suspend(Q). It does
down_read(B->lock) and starts an async task for Q. Then we move on to
suspend_one_node(B). It does down_write(B->lock) and blocks until the
async task finishes; then it suspends B. Finally we move on to
usb_node_suspend(P), which does down_read(A->lock) and starts an async
task for P.
The upshot is that P is stuck waiting for Q to suspend, even though it
should have been able to suspend in parallel. This is simply because P
precedes B in the list, and B is synchronous and must wait for Q to
finish.
With my two-pass algorithm, we start with Q. The first loop does
down_read(B->lock) and starts an async task for Q. We move on to B and
do down_read(B->parent->lock), nothing more. Then we move to to P,
with down_read(A->lock) and start an async task for P. Finally we do
down_read(A->parent->lock). Notice that now there are two async tasks,
for P and Q, running in parallel.
The second pass waits for Q to finish before suspending B
synchronously, and waits for P to finish before suspending A
synchronously. This is unavoidable. The point is that it allows P and
Q to suspend at the same time, not one after the other as in the
one-pass scheme.
Alan Stern
On Mon, 7 Dec 2009, Alan Stern wrote:
>
> It only seems that way because you didn't take into account devices
> that suspend synchronously but whose children suspend asynchronously.
But why would I care? If somebody suspends synchronously, then that's what
he wants.
> A synchronous suspend routine for a device with async child suspends
> would have to look just like your usb_node_suspend():
Sure. But that sounds like a "Doctor, it hurts when I do this" situation.
Don't do that.
Make the USB host controller do its suspend asynchronously. We don't
suspend PCI bridges anyway, iirc (but I didn't actually check). And at
worst, we can make the PCI _bridges_ know about async suspends, and solve
it that way - without actually making any normal PCI drivers do it.
Linus
On Monday 07 December 2009, Linus Torvalds wrote:
>
> On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > It only seems that way because you didn't take into account devices
> > that suspend synchronously but whose children suspend asynchronously.
>
> But why would I care? If somebody suspends synchronously, then that's what
> he wants.
>
> > A synchronous suspend routine for a device with async child suspends
> > would have to look just like your usb_node_suspend():
>
> Sure. But that sounds like a "Doctor, it hurts when I do this" situation.
> Don't do that.
>
> Make the USB host controller do its suspend asynchronously. We don't
> suspend PCI bridges anyway, iirc (but I didn't actually check).
That's correct, we don't.
Rafael
On Mon, 7 Dec 2009, Linus Torvalds wrote:
> On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > It only seems that way because you didn't take into account devices
> > that suspend synchronously but whose children suspend asynchronously.
>
> But why would I care? If somebody suspends synchronously, then that's what
> he wants.
It doesn't mean he wants to block unrelated devices from suspending
asynchronously, merely because they happen to come earlier in the list.
> > A synchronous suspend routine for a device with async child suspends
> > would have to look just like your usb_node_suspend():
>
> Sure. But that sounds like a "Doctor, it hurts when I do this" situation.
> Don't do that.
>
> Make the USB host controller do its suspend asynchronously. We don't
> suspend PCI bridges anyway, iirc (but I didn't actually check). And at
> worst, we can make the PCI _bridges_ know about async suspends, and solve
> it that way - without actually making any normal PCI drivers do it.
This sounds suspiciously like pushing the problem up a level and
hoping it will go away. (Sometimes that even works.)
In the end it isn't a very big issue. Using one vs. two passes in
dpm_suspend() is pretty unimportant.
Alan Stern
P.S.: In fact I planned all along to handle USB host controllers
asynchronously anyway, since their resume routines contain some long
delays. I was merely using them as an example.
On Monday 07 December 2009, Linus Torvalds wrote:
>
> On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > It only seems that way because you didn't take into account devices
> > that suspend synchronously but whose children suspend asynchronously.
>
> But why would I care? If somebody suspends synchronously, then that's what
> he wants.
>
> > A synchronous suspend routine for a device with async child suspends
> > would have to look just like your usb_node_suspend():
>
> Sure. But that sounds like a "Doctor, it hurts when I do this" situation.
> Don't do that.
>
> Make the USB host controller do its suspend asynchronously. We don't
> suspend PCI bridges anyway, iirc (but I didn't actually check). And at
> worst, we can make the PCI _bridges_ know about async suspends, and solve
> it that way - without actually making any normal PCI drivers do it.
BTW, I still don't quite understand why not to put the parent's down_write
operation into the core. It's not going to hurt for the "synchronous" devices
and the "asynchronous" ones will need to do it anyway.
Also it looks like that's something to do unconditionally for all nodes
having children, because the parent need not know if the children do async
operations.
Rafael
On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > Make the USB host controller do its suspend asynchronously. We don't
> > suspend PCI bridges anyway, iirc (but I didn't actually check). And at
> > worst, we can make the PCI _bridges_ know about async suspends, and solve
> > it that way - without actually making any normal PCI drivers do it.
>
> This sounds suspiciously like pushing the problem up a level and
> hoping it will go away. (Sometimes that even works.)
The "we don't suspend bridges anyway" is definitely a "hoping it will go
away" issue. I think we did suspend bridges for a short while during the
PM switch-over some time ago, and it worked most of the time, and then on
some machines it just didn't work at all. Probably because ACPI ends up
touching registers behind bridges that we closed down etc.
So PCI bridges are kind of special. Right now we don't touch them, and if
we ever do, that will be another issue.
> In the end it isn't a very big issue. Using one vs. two passes in
> dpm_suspend() is pretty unimportant.
I also suspect that even if you do the USB host controller suspend
synchronously, doing the actual USB devices asynchronously would still
help - even if it's only "asynchronously per bus" thing.
So in fact, it's probably a good first step to start off doing only the
USB devices, not the controller.
Linus
On Mon, 7 Dec 2009, Rafael J. Wysocki wrote:
>
> BTW, I still don't quite understand why not to put the parent's down_write
> operation into the core. It's not going to hurt for the "synchronous" devices
> and the "asynchronous" ones will need to do it anyway.
That's what I started out doing (see the first pseudo-code with the two
phases). But it _does_ actually hurt.
Because it will hurt exactly for the "multiple hubs" case: if you have two
USB hubs in parallel (and the case that Alan pointed out about a USB host
bridge is the exact same deal), then you want to be able to suspend and
resume those two independent hubs in parallel too.
But if you do the "down_write()" synchronously in the core, that means
that you are also stopping the whole "traverse the tree" thing - so now
you aren't handling the hubs in parallel even if you are handling all the
devices _behind_ them asynchronously.
This "serialize while traversing the tree" was what I was initially trying
to avoid with the two-phase approach, but that I realized (after writing
the resume path) that I could avoid much better by just moving the parents
down_write into the asynchronous path.
> Also it looks like that's something to do unconditionally for all nodes
> having children, because the parent need not know if the children do async
> operations.
True, and that was (again) the first iteration. But see above: in order to
allow way more concurrency, you don't want to introduce the false
dependency between the write-lock and the traversal of the tree (or, as
Alan points out - just a list - but that doesn't really change anything)
that is introduced by taking the lock synchronously.
So by moving the write-lock to the asynchronous work that also shuts down
the parent, you avoid that whole unnecessary serialization. But that means
that you can't do the lock in generic code.
Unless you want to do _all_ of the async logic in generic code and
re-introduce the "dev->async_suspend" flag. I would be ok with that now
that the infrastructure seems so simple.
Linus
On Mon, 7 Dec 2009, Linus Torvalds wrote:
> I also suspect that even if you do the USB host controller suspend
> synchronously, doing the actual USB devices asynchronously would still
> help - even if it's only "asynchronously per bus" thing.
>
> So in fact, it's probably a good first step to start off doing only the
> USB devices, not the controller.
Interesting you should say that. The patch I asked Arjan to test
involved not suspending USB devices at all (root hubs being the
exception). That is in fact just what we do when CONFIG_USB_SUSPEND
isn't set.
There's no need to suspend the individual devices when the whole system
is going down. They will automatically suspend when the controller
stops sending out SOF packets, which occurs when the root hub is
suspended. The USB spec describes this, grandiosely, as a "global
suspend".
But yes, I agree. Doing just the USB devices is a good first step.
Alan Stern
On Mon, 7 Dec 2009, Alan Stern wrote:
>
> There's no need to suspend the individual devices when the whole system
> is going down. They will automatically suspend when the controller
> stops sending out SOF packets, which occurs when the root hub is
> suspended. The USB spec describes this, grandiosely, as a "global
> suspend".
Ahh, but the sync vs async would then still matter on resume. No?
Linus
On Mon, 7 Dec 2009, Linus Torvalds wrote:
>
>
> On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > There's no need to suspend the individual devices when the whole system
> > is going down. They will automatically suspend when the controller
> > stops sending out SOF packets, which occurs when the root hub is
> > suspended. The USB spec describes this, grandiosely, as a "global
> > suspend".
>
> Ahh, but the sync vs async would then still matter on resume. No?
That's complicated. If we assume the devices weren't runtime-suspended
before the sleep began, then they would automatically resume themselves
when the controller started transmitting EOF packets. So in that case
resume would be fast and async wouldn't matter.
But if the devices were runtime-suspended, then what? The safest
course is to resume them during the system-wide resume. In that case
yes, the sync vs async would matter.
And if (as happens on many machines) the firmware messes up the
controller settings during resume, then all the USB devices would have
to be reset -- another slow procedure.
Alan Stern
On Monday 07 December 2009, Linus Torvalds wrote:
>
> On Mon, 7 Dec 2009, Rafael J. Wysocki wrote:
> >
> > BTW, I still don't quite understand why not to put the parent's down_write
> > operation into the core. It's not going to hurt for the "synchronous" devices
> > and the "asynchronous" ones will need to do it anyway.
>
> That's what I started out doing (see the first pseudo-code with the two
> phases). But it _does_ actually hurt.
Hmm. If no one calls down_read() on the "synchronous" devices, their
down_write()s will be nops. In turn, if somebody does call down_read(), it
means they really need to wait for someone. They presumably don't need
to wait for each other, but we don't really know that (otherwise they would
have been "asynchronous").
> Because it will hurt exactly for the "multiple hubs" case: if you have two
> USB hubs in parallel (and the case that Alan pointed out about a USB host
> bridge is the exact same deal), then you want to be able to suspend and
> resume those two independent hubs in parallel too.
>
> But if you do the "down_write()" synchronously in the core, that means
> that you are also stopping the whole "traverse the tree" thing - so now
> you aren't handling the hubs in parallel even if you are handling all the
> devices _behind_ them asynchronously.
>
> This "serialize while traversing the tree" was what I was initially trying
> to avoid with the two-phase approach, but that I realized (after writing
> the resume path) that I could avoid much better by just moving the parents
> down_write into the asynchronous path.
But the asynchronous path has to be started somewhere. Basically, there are
three possible places: the core itself, the bus type's suspend routine called
by the core (same goes for resume of course), and the device driver's suspend
routine called by the bus type.
Now, I don't really see how we can put the the parent's down_write() in a
child's suspend routine, for multiple reasons (one of them being that there can
be multiple asynchronous children the parent needs to wait for), so it looks like
it needs to be above the driver's suspend.
However, the parent can be on a different bus type than the children, so it
looks like we can only start the asynchronous path at the core level.
> > Also it looks like that's something to do unconditionally for all nodes
> > having children, because the parent need not know if the children do async
> > operations.
>
> True, and that was (again) the first iteration. But see above: in order to
> allow way more concurrency, you don't want to introduce the false
> dependency between the write-lock and the traversal of the tree (or, as
> Alan points out - just a list - but that doesn't really change anything)
> that is introduced by taking the lock synchronously.
>
> So by moving the write-lock to the asynchronous work that also shuts down
> the parent, you avoid that whole unnecessary serialization. But that means
> that you can't do the lock in generic code.
>
> Unless you want to do _all_ of the async logic in generic code and
> re-introduce the "dev->async_suspend" flag.
Quite frankly, I would like to.
> I would be ok with that now that the infrastructure seems so simple.
Well, perhaps I should dig out my original async suspend/resume patches
that didn't contain all of the non-essential stuff and post them here for
discussion, after all ...
Rafael
On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> However, the parent can be on a different bus type than the children, so it
> looks like we can only start the asynchronous path at the core level.
Agreed.
> > Unless you want to do _all_ of the async logic in generic code and
> > re-introduce the "dev->async_suspend" flag.
>
> Quite frankly, I would like to.
>
> > I would be ok with that now that the infrastructure seems so simple.
>
> Well, perhaps I should dig out my original async suspend/resume patches
> that didn't contain all of the non-essential stuff and post them here for
> discussion, after all ...
That seems like a very good idea. IIRC they were quite similar to what
we have been discussing.
Alan Stern
On Tuesday 08 December 2009, Alan Stern wrote:
> On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> > However, the parent can be on a different bus type than the children, so it
> > looks like we can only start the asynchronous path at the core level.
>
> Agreed.
>
> > > Unless you want to do _all_ of the async logic in generic code and
> > > re-introduce the "dev->async_suspend" flag.
> >
> > Quite frankly, I would like to.
> >
> > > I would be ok with that now that the infrastructure seems so simple.
> >
> > Well, perhaps I should dig out my original async suspend/resume patches
> > that didn't contain all of the non-essential stuff and post them here for
> > discussion, after all ...
>
> That seems like a very good idea. IIRC they were quite similar to what
> we have been discussing.
There you go.
Below is the resume part. I have reworked the original patch a bit so that
it's even simpler. I'll post the suspend part in a reply to this message.
The idea is basically that if a device has the power.async_suspend flag set,
we schedule the execution of it's resume callback asynchronously, but we
wait for the device's parent to finish resume before the device's suspend is
actually executed.
The wait queue plus the op_complete flag combo plays the role of the locking
in the Linus' picture, and it's essentially equivalent, since the devices being
waited for during resume will have to wait during suspend, so for example if
A has to wait for B during suspend, then B will have to wait for A during
resume (thus they both need to know in advance who's going to wait for them
and whom they need to wait for).
Of course, the code in this patch has the problem that if there are two
"asynchronous" devices in dpm_list separated by a series of "synchronous"
devices, then they usually won't be resumed in parallel (which is what we
ultimately want). That can be optimised in a couple of ways, but such
optimisations add quite some details to the code, so let's just omit them for
now.
BTW, thanks to the discussion with Linus I've realized that the off-tree
dependences may be (relatively easily) taken into account by making the
interested drivers directly execute dpm_wait() for the extra devices they
need to wait for, so the entire PM links thing is simply unnecessary. So it
looks like the only thing this patch is missing are the optimisations mentioned
above.
[This version of the patch has only been slightly tested.]
---
drivers/base/power/main.c | 129 +++++++++++++++++++++++++++++++++++++++----
include/linux/device.h | 6 ++
include/linux/pm.h | 4 +
include/linux/resume-trace.h | 7 ++
4 files changed, 134 insertions(+), 12 deletions(-)
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -412,15 +412,17 @@ struct dev_pm_info {
pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */
+ wait_queue_head_t wait_queue;
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
+ unsigned int op_complete:1;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
unsigned long timer_expires;
struct work_struct work;
- wait_queue_head_t wait_queue;
spinlock_t lock;
atomic_t usage_count;
atomic_t child_count;
Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -472,6 +472,12 @@ static inline int device_is_registered(s
return dev->kobj.state_in_sysfs;
}
+static inline void device_enable_async_suspend(struct device *dev, bool enable)
+{
+ if (dev->power.status == DPM_ON)
+ dev->power.async_suspend = enable;
+}
+
void driver_init(void);
/*
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -25,6 +25,7 @@
#include <linux/resume-trace.h>
#include <linux/rwsem.h>
#include <linux/interrupt.h>
+#include <linux/async.h>
#include "../base.h"
#include "power.h"
@@ -42,6 +43,7 @@
LIST_HEAD(dpm_list);
static DEFINE_MUTEX(dpm_list_mtx);
+static pm_message_t pm_transition;
/*
* Set once the preparation of devices for a PM transition has started, reset
@@ -56,6 +58,7 @@ static bool transition_started;
void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
+ init_waitqueue_head(&dev->power.wait_queue);
pm_runtime_init(dev);
}
@@ -162,6 +165,56 @@ void device_pm_move_last(struct device *
}
/**
+ * dpm_reset - Clear op_complete for given device.
+ * @dev: Device to handle.
+ */
+static void dpm_reset(struct device *dev)
+{
+ dev->power.op_complete = false;
+}
+
+/**
+ * dpm_finish - Set op_complete for a device and wake up threads waiting for it.
+ */
+static void dpm_finish(struct device *dev)
+{
+ dev->power.op_complete = true;
+ wake_up_all(&dev->power.wait_queue);
+}
+
+/**
+ * dpm_wait - Wait for a PM operation to complete.
+ * @dev: Device to wait for.
+ * @async: If true, ignore the device's async_suspend flag.
+ *
+ * Wait for a PM operation carried out for @dev to complete, unless @dev has to
+ * be handled synchronously and @async is false.
+ */
+static void dpm_wait(struct device *dev, bool async)
+{
+ if (!dev)
+ return;
+
+ if (!(async || dev->power.async_suspend))
+ return;
+
+ if (!dev->power.op_complete)
+ wait_event(dev->power.wait_queue, !!dev->power.op_complete);
+}
+
+/**
+ * dpm_synchronize - Wait for PM callbacks of all devices to complete.
+ */
+static void dpm_synchronize(void)
+{
+ struct device *dev;
+
+ async_synchronize_full();
+ list_for_each_entry(dev, &dpm_list, power.entry)
+ dpm_reset(dev);
+}
+
+/**
* pm_op - Execute the PM operation appropriate for given PM event.
* @dev: Device to handle.
* @ops: PM operations to choose from.
@@ -334,25 +387,48 @@ static void pm_dev_err(struct device *de
* The driver of @dev will not receive interrupts while this function is being
* executed.
*/
-static int device_resume_noirq(struct device *dev, pm_message_t state)
+static int __device_resume_noirq(struct device *dev, pm_message_t state)
{
int error = 0;
TRACE_DEVICE(dev);
TRACE_RESUME(0);
- if (!dev->bus)
- goto End;
-
- if (dev->bus->pm) {
+ if (dev->bus && dev->bus->pm) {
pm_dev_dbg(dev, state, "EARLY ");
error = pm_noirq_op(dev, dev->bus->pm, state);
}
- End:
+
+ dpm_finish(dev);
+
TRACE_RESUME(error);
return error;
}
+static void async_resume_noirq(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ dpm_wait(dev->parent, true);
+ error = __device_resume_noirq(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async EARLY", error);
+ put_device(dev);
+}
+
+static int device_resume_noirq(struct device *dev)
+{
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume_noirq, dev);
+ return 0;
+ }
+
+ dpm_wait(dev->parent, false);
+ return __device_resume_noirq(dev, pm_transition);
+}
+
/**
* dpm_resume_noirq - Execute "early resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -366,26 +442,28 @@ void dpm_resume_noirq(pm_message_t state
mutex_lock(&dpm_list_mtx);
transition_started = false;
+ pm_transition = state;
list_for_each_entry(dev, &dpm_list, power.entry)
if (dev->power.status > DPM_OFF) {
int error;
dev->power.status = DPM_OFF;
- error = device_resume_noirq(dev, state);
+ error = device_resume_noirq(dev);
if (error)
pm_dev_err(dev, state, " early", error);
}
+ dpm_synchronize();
mutex_unlock(&dpm_list_mtx);
resume_device_irqs();
}
EXPORT_SYMBOL_GPL(dpm_resume_noirq);
/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state)
{
int error = 0;
@@ -426,11 +504,36 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ dpm_finish(dev);
TRACE_RESUME(error);
return error;
}
+static void async_resume(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ dpm_wait(dev->parent, true);
+ error = __device_resume(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);
+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ return 0;
+ }
+
+ dpm_wait(dev->parent, false);
+ return __device_resume(dev, pm_transition);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -444,6 +547,7 @@ static void dpm_resume(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
@@ -454,7 +558,7 @@ static void dpm_resume(pm_message_t stat
dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);
- error = device_resume(dev, state);
+ error = device_resume(dev);
mutex_lock(&dpm_list_mtx);
if (error)
@@ -468,6 +572,7 @@ static void dpm_resume(pm_message_t stat
put_device(dev);
}
list_splice(&list, &dpm_list);
+ dpm_synchronize();
mutex_unlock(&dpm_list_mtx);
}
@@ -793,8 +898,10 @@ static int dpm_prepare(pm_message_t stat
break;
}
dev->power.status = DPM_SUSPENDING;
- if (!list_empty(&dev->power.entry))
+ if (!list_empty(&dev->power.entry)) {
list_move_tail(&dev->power.entry, &list);
+ dpm_reset(dev);
+ }
put_device(dev);
}
list_splice(&list, &dpm_list);
Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@
extern int pm_trace_enabled;
+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const
#else
+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)
On Tuesday 08 December 2009, Rafael J. Wysocki wrote:
> On Tuesday 08 December 2009, Alan Stern wrote:
> > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> >
> > > However, the parent can be on a different bus type than the children, so it
> > > looks like we can only start the asynchronous path at the core level.
> >
> > Agreed.
> >
> > > > Unless you want to do _all_ of the async logic in generic code and
> > > > re-introduce the "dev->async_suspend" flag.
> > >
> > > Quite frankly, I would like to.
> > >
> > > > I would be ok with that now that the infrastructure seems so simple.
> > >
> > > Well, perhaps I should dig out my original async suspend/resume patches
> > > that didn't contain all of the non-essential stuff and post them here for
> > > discussion, after all ...
> >
> > That seems like a very good idea. IIRC they were quite similar to what
> > we have been discussing.
>
> There you go.
Below is the suspend part. It contains some extra code for rolling back the
suspend if one of the asynchronous callbacks returns error code, but apart
from this it's completely analogous to the resume part.
[This patch has only been slightly tested.]
---
drivers/base/power/main.c | 113 ++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 104 insertions(+), 9 deletions(-)
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -202,6 +202,17 @@ static void dpm_wait(struct device *dev,
wait_event(dev->power.wait_queue, !!dev->power.op_complete);
}
+static int device_pm_wait_fn(struct device *dev, void *async_ptr)
+{
+ dpm_wait(dev, *((bool *)async_ptr));
+ return 0;
+}
+
+static void dpm_wait_for_children(struct device *dev, bool async)
+{
+ device_for_each_child(dev, &async, device_pm_wait_fn);
+}
+
/**
* dpm_synchronize - Wait for PM callbacks of all devices to complete.
*/
@@ -638,6 +649,8 @@ static void dpm_complete(pm_message_t st
mutex_unlock(&dpm_list_mtx);
}
+static int async_error;
+
/**
* dpm_resume_end - Execute "resume" callbacks and complete system transition.
* @state: PM transition of the system being carried out.
@@ -685,20 +698,52 @@ static pm_message_t resume_event(pm_mess
* The driver of @dev will not receive interrupts while this function is being
* executed.
*/
-static int device_suspend_noirq(struct device *dev, pm_message_t state)
+static int __device_suspend_noirq(struct device *dev, pm_message_t state)
{
int error = 0;
- if (!dev->bus)
- return 0;
-
- if (dev->bus->pm) {
+ if (dev->bus && dev->bus->pm) {
pm_dev_dbg(dev, state, "LATE ");
error = pm_noirq_op(dev, dev->bus->pm, state);
}
+
+ dpm_finish(dev);
+
return error;
}
+static void async_suspend_noirq(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error = async_error;
+
+ if (error)
+ return;
+
+ dpm_wait_for_children(dev, true);
+ error = __device_suspend_noirq(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async LATE", error);
+ dev->power.status = DPM_OFF;
+ }
+ put_device(dev);
+
+ if (error && !async_error)
+ async_error = error;
+}
+
+static int device_suspend_noirq(struct device *dev)
+{
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend_noirq, dev);
+ return 0;
+ }
+
+ dpm_wait_for_children(dev, false);
+ return __device_suspend_noirq(dev, pm_transition);
+}
+
/**
* dpm_suspend_noirq - Execute "late suspend" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -713,14 +758,21 @@ int dpm_suspend_noirq(pm_message_t state
suspend_device_irqs();
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
list_for_each_entry_reverse(dev, &dpm_list, power.entry) {
- error = device_suspend_noirq(dev, state);
+ dev->power.status = DPM_OFF_IRQ;
+ error = device_suspend_noirq(dev);
if (error) {
pm_dev_err(dev, state, " late", error);
+ dev->power.status = DPM_OFF;
+ break;
+ }
+ if (async_error) {
+ error = async_error;
break;
}
- dev->power.status = DPM_OFF_IRQ;
}
+ dpm_synchronize();
mutex_unlock(&dpm_list_mtx);
if (error)
dpm_resume_noirq(resume_event(state));
@@ -733,7 +785,7 @@ EXPORT_SYMBOL_GPL(dpm_suspend_noirq);
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_suspend(struct device *dev, pm_message_t state)
+static int __device_suspend(struct device *dev, pm_message_t state)
{
int error = 0;
@@ -773,10 +825,45 @@ static int device_suspend(struct device
}
End:
up(&dev->sem);
+ dpm_finish(dev);
return error;
}
+static void async_suspend(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error = async_error;
+
+ if (error)
+ goto End;
+
+ dpm_wait_for_children(dev, true);
+ error = __device_suspend(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async", error);
+
+ dev->power.status = DPM_SUSPENDING;
+ if (!async_error)
+ async_error = error;
+ }
+
+ End:
+ put_device(dev);
+}
+
+static int device_suspend(struct device *dev, pm_message_t state)
+{
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);
+ return 0;
+ }
+
+ dpm_wait_for_children(dev, false);
+ return __device_suspend(dev, pm_transition);
+}
+
/**
* dpm_suspend - Execute "suspend" callbacks for all non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -788,10 +875,12 @@ static int dpm_suspend(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.prev);
get_device(dev);
+ dev->power.status = DPM_OFF;
mutex_unlock(&dpm_list_mtx);
error = device_suspend(dev, state);
@@ -799,16 +888,21 @@ static int dpm_suspend(pm_message_t stat
mutex_lock(&dpm_list_mtx);
if (error) {
pm_dev_err(dev, state, "", error);
+ dev->power.status = DPM_SUSPENDING;
put_device(dev);
break;
}
- dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
list_move(&dev->power.entry, &list);
put_device(dev);
+ if (async_error)
+ break;
}
list_splice(&list, dpm_list.prev);
+ dpm_synchronize();
mutex_unlock(&dpm_list_mtx);
+ if (!error)
+ error = async_error;
return error;
}
@@ -867,6 +961,7 @@ static int dpm_prepare(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
transition_started = true;
+ async_error = 0;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> The wait queue plus the op_complete flag combo plays the role of the locking
> in the Linus' picture
Please just use the lock. Don't make up your own locking crap. Really.
Your patch is horrible. Exactly because your locking is horribly
mis-designed. You can't say things are complete from an interrupt, for
example, since you made it some random bitfield, which has unknown
characteristics (ie non-atomic read-modify-write etc).
The fact is, any time anybody makes up a new locking mechanism, THEY
ALWAYS GET IT WRONG. Don't do it.
I suggested using the rwsem locking for a good reason. It made sense. It
was simpler. Just do it that way, stop making up crap.
Linus
On Tue, 8 Dec 2009, Linus Torvalds wrote:
> Please just use the lock. Don't make up your own locking crap. Really.
>
> Your patch is horrible. Exactly because your locking is horribly
> mis-designed. You can't say things are complete from an interrupt, for
> example, since you made it some random bitfield, which has unknown
> characteristics (ie non-atomic read-modify-write etc).
>
> The fact is, any time anybody makes up a new locking mechanism, THEY
> ALWAYS GET IT WRONG. Don't do it.
>
> I suggested using the rwsem locking for a good reason. It made sense. It
> was simpler. Just do it that way, stop making up crap.
The semantics needed for this kind of lock aren't really the same as
for an rwsem (although obviously an rwsem will do the job). Basically
it needs to have the capability for multiple users to lock it (no
blocking when acquiring a lock) and the capability for a user to wait
until it is totally unlocked. It could be implemented trivially using
an atomic_t counter and a waitqueue head.
Is this a standard sort of lock? It's a lot simpler than most others.
I don't recall seeing anything quite like it anywhere; the closest
thing might be some kind of barrier.
Alan Stern
On Tue, 8 Dec 2009, Alan Stern wrote:
>
> The semantics needed for this kind of lock aren't really the same as
> for an rwsem (although obviously an rwsem will do the job). Basically
> it needs to have the capability for multiple users to lock it (no
> blocking when acquiring a lock) and the capability for a user to wait
> until it is totally unlocked. It could be implemented trivially using
> an atomic_t counter and a waitqueue head.
>
> Is this a standard sort of lock?
Yes it is.
It's called a rwlock. The counter is for readers, the exclusion is for
writers.
Really.
And the thing is, you actually do want the rwlock semantics, because on
the resume side you want the parent to lock it for writing first (so that
the children can wait for the parent to have completed its resume.
So we actually _want_ the full rwlock semantics.
See the code I posted earlier. Here condensed into one email:
- resume:
usb_node_resume(node)
{
// Wait for parent to finish resume
down_read(node->parent->lock);
// .. before resuming outselves
node->resume(node)
// Now we're all done
up_read(node->parent->lock);
up_write(node->lock);
}
/* caller: */
..
// This won't block, because we resume parents before children,
// and the children will take the read lock.
down_write(leaf->lock);
// Do the blocking part asynchronously
async_schedule(usb_node_resume, leaf);
..
- suspend:
usb_node_suspend(node)
{
// Start our suspend. This will block if we have any
// children that are still busy suspending (they will
// have done a down_read() in their suspend).
down_write(node->lock);
node->suspend(node);
up_write(node->lock);
// This lets our parent continue
up_read(node->parent->lock);
}
/* caller: */
// This won't block, because we suspend nodes before parents
down_read(node->parent->lock);
// Do the part that may block asynchronously
async_schedule(do_usb_node_suspend, node);
It really should be that simple. Nothing more, nothing less. And with the
above, finishing the suspend (or resume) from interrupts is fine, and you
don't have any new lock that has undefined memory ordering issues etc.
Linus
On Tue, 8 Dec 2009, Linus Torvalds wrote:
> On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > The semantics needed for this kind of lock aren't really the same as
> > for an rwsem (although obviously an rwsem will do the job). Basically
> > it needs to have the capability for multiple users to lock it (no
> > blocking when acquiring a lock) and the capability for a user to wait
> > until it is totally unlocked. It could be implemented trivially using
> > an atomic_t counter and a waitqueue head.
> >
> > Is this a standard sort of lock?
>
> Yes it is.
>
> It's called a rwlock. The counter is for readers, the exclusion is for
> writers.
>
> Really.
>
> And the thing is, you actually do want the rwlock semantics, because on
> the resume side you want the parent to lock it for writing first (so that
> the children can wait for the parent to have completed its resume.
>
> So we actually _want_ the full rwlock semantics.
I'm not convinced. Condense the description a little farther:
Suspend: Children lock the parent first. When they are
finished they unlock the parent, allowing it to
proceed.
Resume: Parent locks itself first. When it is finished
it unlocks itself, allowing the children to proceed.
The whole readers vs. writers thing is a non-sequitur. (For instance,
this never uses the fact that writers exclude each other.) In each
case a lock is taken and eventually released, allowing someone else to
stop waiting and move forward. In the suspend case we have multiple
lockers and one waiter, whereas in the resume case we have one locker
and multiple waiters.
The simplest generalization is to allow both multiple lockers and
multiple waiters. Call it a waitlock, for want of a better name:
wait_lock(wl)
{
atomic_inc(&wl->count);
}
wait_unlock(wl)
{
if (atomic_dec_and_test(&wl->count)) {
smp_mb__after_atomic_dec();
wake_up_all(wl->wqh);
}
}
wait_for_lock(wl)
{
wait_event(wl->wqh, atomic_read(&wl->count) == 0);
smp_rmb();
}
Note that both wait_lock() and wait_unlock() can be called
in_interrupt.
> See the code I posted earlier. Here condensed into one email:
>
> - resume:
>
> usb_node_resume(node)
> {
> // Wait for parent to finish resume
> down_read(node->parent->lock);
> // .. before resuming outselves
> node->resume(node)
>
> // Now we're all done
> up_read(node->parent->lock);
> up_write(node->lock);
> }
>
> /* caller: */
> ..
> // This won't block, because we resume parents before children,
> // and the children will take the read lock.
> down_write(leaf->lock);
> // Do the blocking part asynchronously
> async_schedule(usb_node_resume, leaf);
> ..
This becomes:
usb_node_resume(node)
{
// Wait for parent to finish resume
wait_for_lock(node->parent->lock);
// .. before resuming outselves
node->resume(node)
// Now we're all done
wait_unlock(node->lock);
}
/* caller: */
..
// This can't block, because wait_lock() is non-blocking.
wait_lock(node->lock);
// Do the blocking part asynchronously
async_schedule(usb_node_resume, leaf);
..
> - suspend:
>
> usb_node_suspend(node)
> {
> // Start our suspend. This will block if we have any
> // children that are still busy suspending (they will
> // have done a down_read() in their suspend).
> down_write(node->lock);
> node->suspend(node);
> up_write(node->lock);
>
> // This lets our parent continue
> up_read(node->parent->lock);
> }
>
> /* caller: */
>
> // This won't block, because we suspend nodes before parents
> down_read(node->parent->lock);
> // Do the part that may block asynchronously
> async_schedule(do_usb_node_suspend, node);
usb_node_suspend(node)
{
// Start our suspend. This will block if we have any
// children that are still busy suspending (they will
// have done a wait_lock() in their suspend).
wait_for_lock(node->lock);
node->suspend(node);
// This lets our parent continue
wait_unlock(node->parent->lock);
}
/* caller: */
..
// This can't block, because wait_lock is non-blocking.
wait_lock(node->parent->lock);
// Do the part that may block asynchronously
async_schedule(do_usb_node_suspend, node);
..
> It really should be that simple. Nothing more, nothing less. And with the
> above, finishing the suspend (or resume) from interrupts is fine, and you
> don't have any new lock that has undefined memory ordering issues etc.
Aren't waitlocks simpler than rwsems? Not as generally useful,
perhaps. But just as correct in this situation.
Alan Stern
On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > So we actually _want_ the full rwlock semantics.
>
> I'm not convinced. Condense the description a little farther:
>
> Suspend: Children lock the parent first. When they are
> finished they unlock the parent, allowing it to
> proceed.
>
> Resume: Parent locks itself first. When it is finished
> it unlocks itself, allowing the children to proceed.
Yes. You can implement it with a simple lock with a count. Nobody debates
that.
But a simple counting lock _is_ a rwlock. Really. They are 100%
semantically equivalent. There is no difference.
> The whole readers vs. writers thing is a non-sequitur.
No it's not.
It's a 100% equivalent problem. It's purely a change of wording. The end
result is the same.
> The simplest generalization is to allow both multiple lockers and
> multiple waiters. Call it a waitlock, for want of a better name:
But we have that. It _has_ a better name: rwlocks.
And the reason the name is better is because now the name describes all
the semantics to anybody who has ever taken a course in operating systems
or in parallelism.
It's also a better implementation, because it actually _works_.
> wait_lock(wl)
> {
> atomic_inc(&wl->count);
> }
>
> wait_unlock(wl)
> {
> if (atomic_dec_and_test(&wl->count)) {
> smp_mb__after_atomic_dec();
> wake_up_all(wl->wqh);
> }
> }
>
> wait_for_lock(wl)
> {
> wait_event(wl->wqh, atomic_read(&wl->count) == 0);
> smp_rmb();
> }
>
> Note that both wait_lock() and wait_unlock() can be called
> in_interrupt.
And note how even though you sprinkled random memory barriers around, you
still got it wrong.
So you just implemented a buggy lock, and for what gain? Tell me exactly
why your buggy lock (assuming you'd know enough about memory ordering to
actually fix it) is better than just using the existing one?
It's certainly not smaller. It's not faster. It doesn't have support for
lockdep. And it's BUGGY.
Really. Tell me why you want to re-implement an existing lock - badly.
[ Hint: you need a smp_mb() *before* the atomic_dec() in wait-unlock, so
that anybody else who sees the new value will be guaranteed to have seen
anything else the unlocker did.
You also need a smp_mb() in the wait_for_lock(), not a smp_rmb(). Can't
allow writes to migrate up either. 'atomic_read()' does not imply any
barriers.
But most architectures can optimize these things for their particular
memory ordering model, and do so in their rwsem implementation. ]
> This becomes:
>
> usb_node_resume(node)
> {
> // Wait for parent to finish resume
> wait_for_lock(node->parent->lock);
> // .. before resuming outselves
> node->resume(node)
>
> // Now we're all done
> wait_unlock(node->lock);
> }
>
> /* caller: */
> ..
> // This can't block, because wait_lock() is non-blocking.
> wait_lock(node->lock);
> // Do the blocking part asynchronously
> async_schedule(usb_node_resume, leaf);
> ..
Umm? Same thing, different words?
That "wait_for_lock()" is equivalent to a 'read_lock()+read_unlock()'. We
_could_ expose such a mechanism for rwsem's too, but why do it? It's
actually nicer to use a real read-lock - and do it _around_ the operation,
because now the locking also automatically gets things like overlapping
suspends and resumes right.
(Which you'd obviously hope never happens, but it's nice from a conceptual
standpoint to know that the locking is robust).
> Aren't waitlocks simpler than rwsems? Not as generally useful,
> perhaps. But just as correct in this situation.
NO!
Dammit. I started this whole rant with this comment to Rafael:
"The fact is, any time anybody makes up a new locking mechanism, THEY
ALWAYS GET IT WRONG. Don't do it."
Take heed. You got it wrong. Admit it. Locking is _hard_. SMP memory
ordering is HARD.
So leave locking to the pro's. They _also_ got it wrong, but they got it
wrong several years ago, and fixed up their sh*t.
This is why you use generic locking. ALWAYS.
Linus
On Tue, 8 Dec 2009, Linus Torvalds wrote:
>
> [ Hint: you need a smp_mb() *before* the atomic_dec() in wait-unlock, so
> that anybody else who sees the new value will be guaranteed to have seen
> anything else the unlocker did.
>
> You also need a smp_mb() in the wait_for_lock(), not a smp_rmb(). Can't
> allow writes to migrate up either. 'atomic_read()' does not imply any
> barriers.
>
> But most architectures can optimize these things for their particular
> memory ordering model, and do so in their rwsem implementation. ]
Side note: if this was a real lock, you'd also needed an smp_wmb() in the
'wait_lock()' path after the atomic_inc(), to make sure that others see
the atomic lock was seen by other people before the suspend started.
In your usage scenario, I don't think it would ever be noticeable, since
the other users are always going to start running from the same thread
that did the wait_lock(), so even if they run on other CPU's, we'll have
scheduled _to_ those other CPU's and done enough memory ordering to
guarantee that they will see the thing.
So it would be ok in this situation, simply because it acts as an
initializer and never sees any real SMP issues.
But it's an example of how you now don't just depend on the locking
primitives themselves doing the right thing, you end up depending very
subtly on exactly how the lock is used. The standard locks do have the
same kind of issue for initializers, but we avoid it elsewhere because
it's so risky.
Linus
On Tue, 8 Dec 2009, Linus Torvalds wrote:
> > The whole readers vs. writers thing is a non-sequitur.
>
> No it's not.
>
> It's a 100% equivalent problem. It's purely a change of wording. The end
> result is the same.
Well, of course the end result is the same (ignoring bugs) -- that was
the point. It doesn't follow that the two locking mechanisms are 100%
equivalent.
> And note how even though you sprinkled random memory barriers around, you
> still got it wrong.
Yes. That comes of trying to think at the keyboard.
> It's certainly not smaller. It's not faster. It doesn't have support for
> lockdep. And it's BUGGY.
Lockdep will choke on the rwsem approach anyway. It has never been
very good at handling tree-structured locking, especially when there
are non-parent-child interactions. But never mind.
> Really. Tell me why you want to re-implement an existing lock - badly.
I didn't want to. The whole exercise was intended to make a point --
that rwsems do more than we really need here.
> [ Hint: you need a smp_mb() *before* the atomic_dec() in wait-unlock, so
> that anybody else who sees the new value will be guaranteed to have seen
> anything else the unlocker did.
Yes.
> You also need a smp_mb() in the wait_for_lock(), not a smp_rmb(). Can't
> allow writes to migrate up either. 'atomic_read()' does not imply any
> barriers.
No, that's not needed. Unlike reads, writes can't move in front of
data or control dependencies. Or so I've been lead to believe...
> That "wait_for_lock()" is equivalent to a 'read_lock()+read_unlock()'.
Not really. It also corresponds to a 'write_lock()+write_unlock()' (in
the suspend routine). Are you claiming these two compound operations
are equivalent?
> We
> _could_ expose such a mechanism for rwsem's too, but why do it? It's
> actually nicer to use a real read-lock - and do it _around_ the operation,
> because now the locking also automatically gets things like overlapping
> suspends and resumes right.
>
> (Which you'd obviously hope never happens, but it's nice from a conceptual
> standpoint to know that the locking is robust).
> Take heed. You got it wrong. Admit it. Locking is _hard_. SMP memory
> ordering is HARD.
Oh, there's no question about that. I never seriously intended this
stuff to be adopted. It was just for discussion.
Alan Stern
On Tue, 8 Dec 2009, Linus Torvalds wrote:
> Side note: if this was a real lock, you'd also needed an smp_wmb() in the
> 'wait_lock()' path after the atomic_inc(), to make sure that others see
> the atomic lock was seen by other people before the suspend started.
>
> In your usage scenario, I don't think it would ever be noticeable, since
> the other users are always going to start running from the same thread
> that did the wait_lock(), so even if they run on other CPU's, we'll have
> scheduled _to_ those other CPU's and done enough memory ordering to
> guarantee that they will see the thing.
>
> So it would be ok in this situation, simply because it acts as an
> initializer and never sees any real SMP issues.
Yes. I would have brought this up, but you made the point for me.
> But it's an example of how you now don't just depend on the locking
> primitives themselves doing the right thing, you end up depending very
> subtly on exactly how the lock is used. The standard locks do have the
> same kind of issue for initializers, but we avoid it elsewhere because
> it's so risky.
No doubt there are other reasons why the "wait-lock" pattern doesn't
get used enough to be noticed.
Alan Stern
On Tuesday 08 December 2009, Linus Torvalds wrote:
>
> On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> >
> > The wait queue plus the op_complete flag combo plays the role of the locking
> > in the Linus' picture
>
> Please just use the lock. Don't make up your own locking crap. Really.
>
> Your patch is horrible. Exactly because your locking is horribly
> mis-designed. You can't say things are complete from an interrupt, for
> example, since you made it some random bitfield, which has unknown
> characteristics (ie non-atomic read-modify-write etc).
I didn't assume anyone would check it from an interrupt, because I didn't see
a point. In fact I didn't assume anyone except for the PM core would check it.
In case this assumption is wrong, it can be easily put under the dev->sem
that we take anyway before calling the bus type (etc.) callbacks.
Anyway, if we use an rwsem, it won't be checkable from interrupt context just
as well.
> The fact is, any time anybody makes up a new locking mechanism, THEY
> ALWAYS GET IT WRONG. Don't do it.
>
> I suggested using the rwsem locking for a good reason. It made sense. It
> was simpler. Just do it that way, stop making up crap.
Suppose we use rwsem and during suspend each child uses a down_read() on a
parent and then the parent uses down_write() on itself. What if, whatever the
reason, the parent is a bit early and does the down_write() before one of the
children has a chance to do the down_read()? Aren't we toast?
Do we need any direct protection against that or does it just work itself out
in a way I just don't see right now?
Rafael
On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> Suppose we use rwsem and during suspend each child uses a down_read() on a
> parent and then the parent uses down_write() on itself. What if, whatever the
> reason, the parent is a bit early and does the down_write() before one of the
> children has a chance to do the down_read()? Aren't we toast?
>
> Do we need any direct protection against that or does it just work itself out
> in a way I just don't see right now?
That's not the way it should be done. Linus had children taking their
parents' locks during suspend, which is simple but leads to
difficulties.
Instead, the PM core should do a down_write() on each device before
starting the device's async suspend routine, and an up_write() when the
routine finishes. Parents should, at the start of their async routine,
do down_read() on each of their children plus whatever other devices
they need to wait for. The core can do the waiting for children part
and the driver's suspend routine can handle any other waiting.
This is a little more awkward because it requires the parent to iterate
through its children. But it does solve the off-tree dependency
problem for suspends.
Alan Stern
On Tuesday 08 December 2009, Alan Stern wrote:
> On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> > Suppose we use rwsem and during suspend each child uses a down_read() on a
> > parent and then the parent uses down_write() on itself. What if, whatever the
> > reason, the parent is a bit early and does the down_write() before one of the
> > children has a chance to do the down_read()? Aren't we toast?
> >
> > Do we need any direct protection against that or does it just work itself out
> > in a way I just don't see right now?
>
> That's not the way it should be done. Linus had children taking their
> parents' locks during suspend, which is simple but leads to
> difficulties.
>
> Instead, the PM core should do a down_write() on each device before
> starting the device's async suspend routine, and an up_write() when the
> routine finishes. Parents should, at the start of their async routine,
> do down_read() on each of their children plus whatever other devices
> they need to wait for. The core can do the waiting for children part
> and the driver's suspend routine can handle any other waiting.
>
> This is a little more awkward because it requires the parent to iterate
> through its children.
I can live with that.
> But it does solve the off-tree dependency problem for suspends.
That's a plus, but I still think we're trying to create a barrier-alike
mechanism using lock.
There's one more possibility to consider, though. What if we use a completion
instead of the flag + wait queue? It surely is a standard synchronization
mechanism and it seems it might work here.
Rafael
On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> > This is a little more awkward because it requires the parent to iterate
> > through its children.
>
> I can live with that.
>
> > But it does solve the off-tree dependency problem for suspends.
>
> That's a plus, but I still think we're trying to create a barrier-alike
> mechanism using lock.
>
> There's one more possibility to consider, though. What if we use a completion
> instead of the flag + wait queue? It surely is a standard synchronization
> mechanism and it seems it might work here.
You're right. I should have thought of that. Linus's original
approach couldn't use a completion because during suspend it needed to
make one task (the parent) wait for a bunch of others (the children).
But if you iterate through the children by hand, that objection no
longer applies.
Alan Stern
On Tue, 8 Dec 2009, Alan Stern wrote:
>
> > You also need a smp_mb() in the wait_for_lock(), not a smp_rmb(). Can't
> > allow writes to migrate up either. 'atomic_read()' does not imply any
> > barriers.
>
> No, that's not needed. Unlike reads, writes can't move in front of
> data or control dependencies. Or so I've been lead to believe...
Sure they can. Control dependencies are trivial - it's called "branch
prediction", and everybody does it, and data dependencies don't exist on
many CPU architectures (even to the point of reading through a pointer
that you loaded).
But yes, on x86, stores only move down. But that's an x86-specific thing.
[ Not that it's also not very common - write buffering is easy and matters
for performance, so any in-order implementation will generally do it. In
contrast, writes moving up doesn't really help peformance and is harder
to do, but can happen with a weakly ordered memory subsystem especially
if you have multi-way caches where some ways are busy and end up being
congested.
So the _common_ case is definitely about delaying writes and doing reads
early if possible. But it's not necessarily at all guaranteed in
general. ]
> > That "wait_for_lock()" is equivalent to a 'read_lock()+read_unlock()'.
>
> Not really. It also corresponds to a 'write_lock()+write_unlock()' (in
> the suspend routine). Are you claiming these two compound operations
> are equivalent?
They have separate semantics, and you just want to pick the one that suits
you. Your counting lock doesn't have the "read_lock+read_unlock" version,
it only has the write_lock/unlock one (ie it requires totally unlocked
thing).
The point being, rwsem's can do everything your counting lock does. And
they already exist. And they already know about all the subtleties of
architecture-specific memory ordering etc.
Linus
On Tuesday 08 December 2009, Alan Stern wrote:
> On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> > > This is a little more awkward because it requires the parent to iterate
> > > through its children.
> >
> > I can live with that.
> >
> > > But it does solve the off-tree dependency problem for suspends.
> >
> > That's a plus, but I still think we're trying to create a barrier-alike
> > mechanism using lock.
> >
> > There's one more possibility to consider, though. What if we use a completion
> > instead of the flag + wait queue? It surely is a standard synchronization
> > mechanism and it seems it might work here.
>
> You're right. I should have thought of that. Linus's original
> approach couldn't use a completion because during suspend it needed to
> make one task (the parent) wait for a bunch of others (the children).
> But if you iterate through the children by hand, that objection no
> longer applies.
BTW, is there a good reason why completion_done() doesn't use spin_lock_irqsave
and spin_unlock_irqrestore? complete() and complete_all() use them, so why not
here?
Rafael
On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> Anyway, if we use an rwsem, it won't be checkable from interrupt context just
> as well.
You can't do a lock() from an interrupt, but the unlocks should be
irq-safe.
> Suppose we use rwsem and during suspend each child uses a down_read() on a
> parent and then the parent uses down_write() on itself. What if, whatever the
> reason, the parent is a bit early and does the down_write() before one of the
> children has a chance to do the down_read()? Aren't we toast?
We're toast, but we're toast for a totally unrealted reason: it means that
you tried to resume a child before a parent, which would be a major bug to
begin with.
Look, I even wrote out the comments, so let me repeat the code one more
time.
- suspend time calling:
// This won't block, because we suspend nodes before parents
down_read(node->parent->lock);
// Do the part that may block asynchronously
async_schedule(do_usb_node_suspend, node);
- resume time calling:
// This won't block, because we resume parents before children,
// and the children will take the read lock.
down_write(leaf->lock);
// Do the blocking part asynchronously
async_schedule(usb_node_resume, leaf);
See? So when we take the parent lock for suspend, we are guaranteed to do
so _before_ the parent node itself suspends. And conversely, when we take
the parent lock (asynchronously) for resume, we're guaranteed to do that
_after_ the parent node has done its own down_write.
And that all depends on just one trivial thing; that the suspend and
resume is called in the right order (children first vs parent first
respectively). And that is such a _major_ correctness issue that if that
isn't correct, your suspend isn't going to work _anyway_.
Linus
On Tue, 8 Dec 2009, Alan Stern wrote:
>
> That's not the way it should be done. Linus had children taking their
> parents' locks during suspend, which is simple but leads to
> difficulties.
No it doesn't. Name them.
> Instead, the PM core should do a down_write() on each device before
> starting the device's async suspend routine, and an up_write() when the
> routine finishes.
No you should NOT do that. If you do that, you serialize the suspend
incorrectly and much too early. IOW, think a topology like this:
a -> b -> c
\
> d -> e
where you'd want to suspend 'c' and 'e' asynchronously. If we do a
'down-write()' on b, then we'll delay until 'c' has suspended, an if we
have ordered the nodes in the obvious depth-first order, we'll walk the PM
device list in the order:
c b e d a
and now we'll serialize on 'b', waiting for 'c' to suspend. Which we do
_not_ want to do, because the whole point was to suspend 'c' and 'e'
together.
> Parents should, at the start of their async routine,
> do down_read() on each of their children plus whatever other devices
> they need to wait for. The core can do the waiting for children part
> and the driver's suspend routine can handle any other waiting.
Why?
That just complicates things. Compare to my simple locking scheme I've
quoted several times.
Linus
On Tue, 8 Dec 2009, Linus Torvalds wrote:
> On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > That's not the way it should be done. Linus had children taking their
> > parents' locks during suspend, which is simple but leads to
> > difficulties.
>
> No it doesn't. Name them.
Really.
Let me put this simply: I've told you guys how to do it simply, with
_zero_ crap. No "iterating over children". No games. No data structures.
No new infrastructure. Just a single new rwlock per device, and _trivial_
code.
So here's the challenge: try it my simple way first. I've quoted the code
about five million times already. If you _actually_ see some problems,
explain them. Don't make up stupid "iterate over each child" things. Don't
claim totally made-up "leads to difficulties". Don't make it any more
complicated than it needs to be.
Keep it simple. And once you have tried that simple approach, and you
really can show why it doesn't work, THEN you can try something else.
But before you try the simple approach and explain why it wouldn't work, I
simply will not pull anything more complex. Understood and agreed?
Linus
On Tue, 8 Dec 2009, Linus Torvalds wrote:
> > No, that's not needed. Unlike reads, writes can't move in front of
> > data or control dependencies. Or so I've been lead to believe...
>
> Sure they can. Control dependencies are trivial - it's called "branch
> prediction", and everybody does it, and data dependencies don't exist on
> many CPU architectures (even to the point of reading through a pointer
> that you loaded).
Wait a second. Are you saying that with code like this:
if (x == 1)
y = 5;
the CPU may write to y before it has finished reading the value of x?
And this write is visible to other CPUs, so that if x was initially 0
and a second CPU sets x to 1, the second CPU may see y == 5 before it
executes the write to x (whatever that may mean)?
Alan Stern
On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> BTW, is there a good reason why completion_done() doesn't use spin_lock_irqsave
> and spin_unlock_irqrestore? complete() and complete_all() use them, so why not
> here?
And likewise in try_wait_for_completion(). It looks like a bug. Maybe
these routines were not intended to be called with interrupts disabled,
but that requirement doesn't seem to be documented. And it isn't a
natural requirement anyway.
Alan Stern
On Tuesday 08 December 2009, Linus Torvalds wrote:
>
> On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> >
> > Anyway, if we use an rwsem, it won't be checkable from interrupt context just
> > as well.
>
> You can't do a lock() from an interrupt, but the unlocks should be
> irq-safe.
>
> > Suppose we use rwsem and during suspend each child uses a down_read() on a
> > parent and then the parent uses down_write() on itself. What if, whatever the
> > reason, the parent is a bit early and does the down_write() before one of the
> > children has a chance to do the down_read()? Aren't we toast?
>
> We're toast, but we're toast for a totally unrealted reason: it means that
> you tried to resume a child before a parent, which would be a major bug to
> begin with.
>
> Look, I even wrote out the comments, so let me repeat the code one more
> time.
>
> - suspend time calling:
> // This won't block, because we suspend nodes before parents
> down_read(node->parent->lock);
> // Do the part that may block asynchronously
> async_schedule(do_usb_node_suspend, node);
>
> - resume time calling:
> // This won't block, because we resume parents before children,
> // and the children will take the read lock.
> down_write(leaf->lock);
> // Do the blocking part asynchronously
> async_schedule(usb_node_resume, leaf);
>
> See? So when we take the parent lock for suspend, we are guaranteed to do
> so _before_ the parent node itself suspends. And conversely, when we take
> the parent lock (asynchronously) for resume, we're guaranteed to do that
> _after_ the parent node has done its own down_write.
>
> And that all depends on just one trivial thing; that the suspend and
> resume is called in the right order (children first vs parent first
> respectively). And that is such a _major_ correctness issue that if that
> isn't correct, your suspend isn't going to work _anyway_.
Understood (I think).
Let's try it, then. Below is the resume patch based on my previous one in this
thread (I have only verified that it builds). Is that along the lines you want?
Rafael
---
drivers/base/power/main.c | 78 ++++++++++++++++++++++++++++++++++++++-----
include/linux/device.h | 6 +++
include/linux/pm.h | 3 +
include/linux/resume-trace.h | 7 +++
4 files changed, 85 insertions(+), 9 deletions(-)
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -26,6 +26,7 @@
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/timer.h>
+#include <linux/rwsem.h>
/*
* Callbacks for platform drivers to implement.
@@ -412,9 +413,11 @@ struct dev_pm_info {
pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
+ struct rw_semaphore rwsem;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -472,6 +472,12 @@ static inline int device_is_registered(s
return dev->kobj.state_in_sysfs;
}
+static inline void device_enable_async_suspend(struct device *dev, bool enable)
+{
+ if (dev->power.status == DPM_ON)
+ dev->power.async_suspend = enable;
+}
+
void driver_init(void);
/*
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -25,6 +25,7 @@
#include <linux/resume-trace.h>
#include <linux/rwsem.h>
#include <linux/interrupt.h>
+#include <linux/async.h>
#include "../base.h"
#include "power.h"
@@ -42,6 +43,7 @@
LIST_HEAD(dpm_list);
static DEFINE_MUTEX(dpm_list_mtx);
+static pm_message_t pm_transition;
/*
* Set once the preparation of devices for a PM transition has started, reset
@@ -56,6 +58,7 @@ static bool transition_started;
void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
+ init_rwsem(&dev->power.rwsem);
pm_runtime_init(dev);
}
@@ -334,25 +337,51 @@ static void pm_dev_err(struct device *de
* The driver of @dev will not receive interrupts while this function is being
* executed.
*/
-static int device_resume_noirq(struct device *dev, pm_message_t state)
+static int __device_resume_noirq(struct device *dev, pm_message_t state)
{
int error = 0;
TRACE_DEVICE(dev);
TRACE_RESUME(0);
- if (!dev->bus)
- goto End;
+ down_read(&dev->parent->power.rwsem);
- if (dev->bus->pm) {
+ if (dev->bus && dev->bus->pm) {
pm_dev_dbg(dev, state, "EARLY ");
error = pm_noirq_op(dev, dev->bus->pm, state);
}
- End:
+
+ up_read(&dev->parent->power.rwsem);
+ up_write(&dev->power.rwsem);
+
TRACE_RESUME(error);
return error;
}
+static void async_resume_noirq(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_resume_noirq(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async EARLY", error);
+ put_device(dev);
+}
+
+static int device_resume_noirq(struct device *dev)
+{
+ down_write(&dev->power.rwsem);
+
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume_noirq, dev);
+ return 0;
+ }
+
+ return __device_resume_noirq(dev, pm_transition);
+}
+
/**
* dpm_resume_noirq - Execute "early resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -366,32 +395,35 @@ void dpm_resume_noirq(pm_message_t state
mutex_lock(&dpm_list_mtx);
transition_started = false;
+ pm_transition = state;
list_for_each_entry(dev, &dpm_list, power.entry)
if (dev->power.status > DPM_OFF) {
int error;
dev->power.status = DPM_OFF;
- error = device_resume_noirq(dev, state);
+ error = device_resume_noirq(dev);
if (error)
pm_dev_err(dev, state, " early", error);
}
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
resume_device_irqs();
}
EXPORT_SYMBOL_GPL(dpm_resume_noirq);
/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state)
{
int error = 0;
TRACE_DEVICE(dev);
TRACE_RESUME(0);
+ down_read(&dev->parent->power.rwsem);
down(&dev->sem);
if (dev->bus) {
@@ -426,11 +458,37 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ up_read(&dev->parent->power.rwsem);
+ up_write(&dev->power.rwsem);
TRACE_RESUME(error);
return error;
}
+static void async_resume(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_resume(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);
+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{
+ down_write(&dev->power.rwsem);
+
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ return 0;
+ }
+
+ return __device_resume(dev, pm_transition);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -444,6 +502,7 @@ static void dpm_resume(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
@@ -454,7 +513,7 @@ static void dpm_resume(pm_message_t stat
dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);
- error = device_resume(dev, state);
+ error = device_resume(dev);
mutex_lock(&dpm_list_mtx);
if (error)
@@ -469,6 +528,7 @@ static void dpm_resume(pm_message_t stat
}
list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
}
/**
Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@
extern int pm_trace_enabled;
+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const
#else
+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)
On Tuesday 08 December 2009, Alan Stern wrote:
> On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> > BTW, is there a good reason why completion_done() doesn't use spin_lock_irqsave
> > and spin_unlock_irqrestore? complete() and complete_all() use them, so why not
> > here?
>
> And likewise in try_wait_for_completion(). It looks like a bug. Maybe
> these routines were not intended to be called with interrupts disabled,
> but that requirement doesn't seem to be documented. And it isn't a
> natural requirement anyway.
OK, let's ask Ingo about that.
Ingo, is there any particular reason why completion_done() and
try_wait_for_completion() don't use spin_lock_irqsave() and
spin_unlock_irqrestore()?
Rafael
> > Sure they can. Control dependencies are trivial - it's called "branch
> > prediction", and everybody does it, and data dependencies don't exist on
> > many CPU architectures (even to the point of reading through a pointer
> > that you loaded).
>
> Wait a second. Are you saying that with code like this:
>
> if (x == 1)
> y = 5;
>
> the CPU may write to y before it has finished reading the value of x?
> And this write is visible to other CPUs, so that if x was initially 0
> and a second CPU sets x to 1, the second CPU may see y == 5 before it
> executes the write to x (whatever that may mean)?
No, the write really depends on x being 1 at any time before the comparison.
On the other hand x being != 0 during the comparison does not prevent the
write without proper locking or barriers.
Have a look at
http://www.linuxjournal.com/article/8211
http://www.linuxjournal.com/article/8212
especially at the alpha part what can happen when dealing with pointer accesses.
Christian
On Tuesday 08 December 2009, Rafael J. Wysocki wrote:
> On Tuesday 08 December 2009, Linus Torvalds wrote:
> >
> > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> > >
> > > Anyway, if we use an rwsem, it won't be checkable from interrupt context just
> > > as well.
> >
> > You can't do a lock() from an interrupt, but the unlocks should be
> > irq-safe.
> >
> > > Suppose we use rwsem and during suspend each child uses a down_read() on a
> > > parent and then the parent uses down_write() on itself. What if, whatever the
> > > reason, the parent is a bit early and does the down_write() before one of the
> > > children has a chance to do the down_read()? Aren't we toast?
> >
> > We're toast, but we're toast for a totally unrealted reason: it means that
> > you tried to resume a child before a parent, which would be a major bug to
> > begin with.
> >
> > Look, I even wrote out the comments, so let me repeat the code one more
> > time.
> >
> > - suspend time calling:
> > // This won't block, because we suspend nodes before parents
> > down_read(node->parent->lock);
> > // Do the part that may block asynchronously
> > async_schedule(do_usb_node_suspend, node);
> >
> > - resume time calling:
> > // This won't block, because we resume parents before children,
> > // and the children will take the read lock.
> > down_write(leaf->lock);
> > // Do the blocking part asynchronously
> > async_schedule(usb_node_resume, leaf);
> >
> > See? So when we take the parent lock for suspend, we are guaranteed to do
> > so _before_ the parent node itself suspends. And conversely, when we take
> > the parent lock (asynchronously) for resume, we're guaranteed to do that
> > _after_ the parent node has done its own down_write.
> >
> > And that all depends on just one trivial thing; that the suspend and
> > resume is called in the right order (children first vs parent first
> > respectively). And that is such a _major_ correctness issue that if that
> > isn't correct, your suspend isn't going to work _anyway_.
>
> Understood (I think).
>
> Let's try it, then. Below is the resume patch based on my previous one in this
> thread (I have only verified that it builds).
Ah, I need to check if dev->parent is not NULL before trying to lock it, but
apart from this it doesn't break things at least.
Rafael
On Tue, 8 Dec 2009, Linus Torvalds wrote:
> On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > That's not the way it should be done. Linus had children taking their
> > parents' locks during suspend, which is simple but leads to
> > difficulties.
>
> No it doesn't. Name them.
Well, one difficulty. It arises only because we are contemplating
having the PM core fire up the async tasks, rather than having the
drivers' suspend routines launch them (the way your original proposal
did -- the difficulty does not arise there).
Suppose A and B are unrelated devices and we need to impose the
off-tree constraint that A suspends after B. With children taking
their parent's lock, the way to prevent A from suspending too soon is
by having B's suspend routine acquire A's lock.
But B's suspend routine runs entirely in an async task, because that
task is started by the PM core and it does the method call. Hence by
the time B's suspend routine is called, A may already have begun
suspending -- it's too late to take A's lock. To make the locking
work, B would have to acquire A's lock _before_ B's async task starts.
Since the PM core is unaware of the off-tree dependency, there's no
simple way to make it work.
> > Instead, the PM core should do a down_write() on each device before
> > starting the device's async suspend routine, and an up_write() when the
> > routine finishes.
>
> No you should NOT do that. If you do that, you serialize the suspend
> incorrectly and much too early. IOW, think a topology like this:
>
> a -> b -> c
> \
> > d -> e
>
> where you'd want to suspend 'c' and 'e' asynchronously. If we do a
> 'down-write()' on b, then we'll delay until 'c' has suspended, an if we
> have ordered the nodes in the obvious depth-first order, we'll walk the PM
> device list in the order:
>
> c b e d a
>
> and now we'll serialize on 'b', waiting for 'c' to suspend. Which we do
> _not_ want to do, because the whole point was to suspend 'c' and 'e'
> together.
You misunderstand. The suspend algorithm will look like this:
dpm_suspend()
{
list_for_each_entry_reverse(dpm_list, dev) {
down_write(dev->lock);
async_schedule(device_suspend, dev);
}
}
device_suspend(dev)
{
device_for_each_child(dev, child) {
down_read(child->lock);
up_read(child->lock);
}
dev->suspend(dev); /* May do off-tree down+up pairs */
up_write(dev->lock);
}
With completions instead of rwsems, the down_write() changes to
init_completion(), the up_write() changes to complete_all(), and the
down_read()+up_read() pairs change to wait_for_completion().
So 'b' will wait for 'c' to suspend, as it must, but 'e' won't wait for
anything.
> > Parents should, at the start of their async routine,
> > do down_read() on each of their children plus whatever other devices
> > they need to wait for. The core can do the waiting for children part
> > and the driver's suspend routine can handle any other waiting.
>
> Why?
>
> That just complicates things. Compare to my simple locking scheme I've
> quoted several times.
It is a little more complicated in that it involves explicitly
iterating over children. But it is simpler in that it can use
completions instead of rwsems and it avoids the off-tree dependency
problem described above.
Alan Stern
On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > Sure they can. Control dependencies are trivial - it's called "branch
> > prediction", and everybody does it, and data dependencies don't exist on
> > many CPU architectures (even to the point of reading through a pointer
> > that you loaded).
>
> Wait a second. Are you saying that with code like this:
>
> if (x == 1)
> y = 5;
>
> the CPU may write to y before it has finished reading the value of x?
Well, in a way. The branch may have been predicted, and the CPU can
_internally_ have done the 'y=5' thing into a write buffer before it even
did the read.
Some time later it will have to _verify_ the prediction and then perhaps
kill the write before it makes it to a data structure that is visible to
others, but internally from the CPU standpoint, yes, the write could have
happened before the read.
Now, whether that write is "before" or "after" the read is debatable. But
one way of looking at it is certainly that the write took place earlier,
and the read might have just caused it to be undone.
And there are real effects of this - looking at the bus, you might have a
bus transaction to get the cacheline that contains 'y' for exclusive
access happen _before_ the bus transaction that reads in the value of 'x'
(but you'd never see the writeout of that '5' before).
> And this write is visible to other CPUs, so that if x was initially 0
> and a second CPU sets x to 1, the second CPU may see y == 5 before it
> executes the write to x (whatever that may mean)?
Well, yes and no. CPU1 above won't release the '5' until it has confirmed
the '1' (even if it does so by reading it late). but assuming the other
CPU also does speculation, then yes, the situation you describe could
happen. If the other CPU does
z = y;
x = 1;
then it's certainly possible that 'z' contains 5 at the end (even if both
x and y started out zero). Because now the read of 'y' on that other CPU
might be delayed, and the write of 'x' goes ahead, CPU1 sees the 1, and
commits its write of 5, sp when CPU2 gets the cacheline, z will now
contain 5.
Is it likely? No. CPU microarchitectures aim to do reads early, and writes
late. Reads are on the critical path, writes can be buffered. But you can
basically get into "impossible" situations where a write that was _later_
in the instruction stream than a read (on CPU2, the 'store 1 to x' would
be after the load of 'y' from memory) could show up in the other order on
another CPU.
Linus
On Tue, 8 Dec 2009, Alan Stern wrote:
>
> And likewise in try_wait_for_completion(). It looks like a bug. Maybe
> these routines were not intended to be called with interrupts disabled,
> but that requirement doesn't seem to be documented. And it isn't a
> natural requirement anyway.
'complete()' is supposed to be callable from interrupts, but the waiting
ones aren't. But 'complete()' is all you should need to call from
interrupts, so that's fine.
So I think completions should work, if done right. That whole "make the
parent wait for all the children to complete" is fine in that sense. And
I'll happily take such an approach if my rwlock thing doesn't work.
Linus
On Tuesday 08 December 2009, Alan Stern wrote:
> On Tue, 8 Dec 2009, Linus Torvalds wrote:
>
> > On Tue, 8 Dec 2009, Alan Stern wrote:
> > >
> > > That's not the way it should be done. Linus had children taking their
> > > parents' locks during suspend, which is simple but leads to
> > > difficulties.
> >
> > No it doesn't. Name them.
>
> Well, one difficulty. It arises only because we are contemplating
> having the PM core fire up the async tasks, rather than having the
> drivers' suspend routines launch them (the way your original proposal
> did -- the difficulty does not arise there).
>
> Suppose A and B are unrelated devices and we need to impose the
> off-tree constraint that A suspends after B. With children taking
> their parent's lock, the way to prevent A from suspending too soon is
> by having B's suspend routine acquire A's lock.
>
> But B's suspend routine runs entirely in an async task, because that
> task is started by the PM core and it does the method call. Hence by
> the time B's suspend routine is called, A may already have begun
> suspending -- it's too late to take A's lock. To make the locking
> work, B would have to acquire A's lock _before_ B's async task starts.
> Since the PM core is unaware of the off-tree dependency, there's no
> simple way to make it work.
Do not set async_suspend for B and instead start your own async thread
from its suspend callback. The parent-children synchronization is done by the
core anyway (at least I'd do it that way), so the only thing you need to worry
about is the extra dependency.
> > That just complicates things. Compare to my simple locking scheme I've
> > quoted several times.
>
> It is a little more complicated in that it involves explicitly
> iterating over children. But it is simpler in that it can use
> completions instead of rwsems and it avoids the off-tree dependency
> problem described above.
I would be slightly more comfortable using completions, but the rwsem-based
approach is fine with me as well.
Rafael
On Tue, 8 Dec 2009, Alan Stern wrote:
>
> Suppose A and B are unrelated devices and we need to impose the
> off-tree constraint that A suspends after B.
Ah. Ok, I can imagine the off-tree constraints, but part of my "keep it
simple" was to simply not do them. If there are constraints that aren't
in the topology of the tree, then I simply don't think that async is worth
it in the first place.
> You misunderstand. The suspend algorithm will look like this:
>
> dpm_suspend()
> {
> list_for_each_entry_reverse(dpm_list, dev) {
> down_write(dev->lock);
> async_schedule(device_suspend, dev);
> }
> }
>
> device_suspend(dev)
> {
> device_for_each_child(dev, child) {
> down_read(child->lock);
> up_read(child->lock);
> }
> dev->suspend(dev); /* May do off-tree down+up pairs */
> up_write(dev->lock);
> }
Ok, so the above I think work (and see my previous email: I think
completions would be workable there too).
It's just that I think the "looping over children" is ugly, when I think
that by doing it the other way around you can make the code simpler and
only depend on the PM device list and a simple parent pointer access.
I also think that you are wrong that the above somehow protects against
non-topological dependencies. If the device you want to keep delay
yourself suspending for is after you in the list, the down_read() on that
may succeed simply because it hasn't even done its down_write() yet and
you got scheduled early.
But I guess you could do that by walking the list twice (first to lock
them all, then to actually call the suspend function). That whole
two-phase thing, except the first phase _only_ locks, and doesn't do any
callbacks.
Linus
On Tuesday 08 December 2009, Rafael J. Wysocki wrote:
> On Tuesday 08 December 2009, Rafael J. Wysocki wrote:
> > On Tuesday 08 December 2009, Linus Torvalds wrote:
> > >
> > > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> > > >
> > > > Anyway, if we use an rwsem, it won't be checkable from interrupt context just
> > > > as well.
> > >
> > > You can't do a lock() from an interrupt, but the unlocks should be
> > > irq-safe.
> > >
> > > > Suppose we use rwsem and during suspend each child uses a down_read() on a
> > > > parent and then the parent uses down_write() on itself. What if, whatever the
> > > > reason, the parent is a bit early and does the down_write() before one of the
> > > > children has a chance to do the down_read()? Aren't we toast?
> > >
> > > We're toast, but we're toast for a totally unrealted reason: it means that
> > > you tried to resume a child before a parent, which would be a major bug to
> > > begin with.
> > >
> > > Look, I even wrote out the comments, so let me repeat the code one more
> > > time.
> > >
> > > - suspend time calling:
> > > // This won't block, because we suspend nodes before parents
> > > down_read(node->parent->lock);
> > > // Do the part that may block asynchronously
> > > async_schedule(do_usb_node_suspend, node);
> > >
> > > - resume time calling:
> > > // This won't block, because we resume parents before children,
> > > // and the children will take the read lock.
> > > down_write(leaf->lock);
> > > // Do the blocking part asynchronously
> > > async_schedule(usb_node_resume, leaf);
> > >
> > > See? So when we take the parent lock for suspend, we are guaranteed to do
> > > so _before_ the parent node itself suspends. And conversely, when we take
> > > the parent lock (asynchronously) for resume, we're guaranteed to do that
> > > _after_ the parent node has done its own down_write.
> > >
> > > And that all depends on just one trivial thing; that the suspend and
> > > resume is called in the right order (children first vs parent first
> > > respectively). And that is such a _major_ correctness issue that if that
> > > isn't correct, your suspend isn't going to work _anyway_.
> >
> > Understood (I think).
> >
> > Let's try it, then. Below is the resume patch based on my previous one in this
> > thread (I have only verified that it builds).
>
> Ah, I need to check if dev->parent is not NULL before trying to lock it, but
> apart from this it doesn't break things at least.
For completness, below is the full async suspend/resume patch with rwlocks,
that has been (very slightly) tested and doesn't seem to break things.
[Note to Alan: lockdep doesn't seem to complain about the not annotated nested
locks.]
Thanks,
Rafael
---
drivers/base/power/main.c | 195 +++++++++++++++++++++++++++++++++++++++----
include/linux/device.h | 6 +
include/linux/pm.h | 3
include/linux/resume-trace.h | 7 +
4 files changed, 194 insertions(+), 17 deletions(-)
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -26,6 +26,7 @@
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/timer.h>
+#include <linux/rwsem.h>
/*
* Callbacks for platform drivers to implement.
@@ -412,9 +413,11 @@ struct dev_pm_info {
pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
+ struct rw_semaphore rwsem;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -472,6 +472,12 @@ static inline int device_is_registered(s
return dev->kobj.state_in_sysfs;
}
+static inline void device_enable_async_suspend(struct device *dev, bool enable)
+{
+ if (dev->power.status == DPM_ON)
+ dev->power.async_suspend = enable;
+}
+
void driver_init(void);
/*
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -25,6 +25,7 @@
#include <linux/resume-trace.h>
#include <linux/rwsem.h>
#include <linux/interrupt.h>
+#include <linux/async.h>
#include "../base.h"
#include "power.h"
@@ -42,6 +43,7 @@
LIST_HEAD(dpm_list);
static DEFINE_MUTEX(dpm_list_mtx);
+static pm_message_t pm_transition;
/*
* Set once the preparation of devices for a PM transition has started, reset
@@ -56,6 +58,7 @@ static bool transition_started;
void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
+ init_rwsem(&dev->power.rwsem);
pm_runtime_init(dev);
}
@@ -334,25 +337,53 @@ static void pm_dev_err(struct device *de
* The driver of @dev will not receive interrupts while this function is being
* executed.
*/
-static int device_resume_noirq(struct device *dev, pm_message_t state)
+static int __device_resume_noirq(struct device *dev, pm_message_t state)
{
int error = 0;
TRACE_DEVICE(dev);
TRACE_RESUME(0);
- if (!dev->bus)
- goto End;
+ if (dev->parent)
+ down_read(&dev->parent->power.rwsem);
- if (dev->bus->pm) {
+ if (dev->bus && dev->bus->pm) {
pm_dev_dbg(dev, state, "EARLY ");
error = pm_noirq_op(dev, dev->bus->pm, state);
}
- End:
+
+ if (dev->parent)
+ up_read(&dev->parent->power.rwsem);
+ up_write(&dev->power.rwsem);
+
TRACE_RESUME(error);
return error;
}
+static void async_resume_noirq(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_resume_noirq(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async EARLY", error);
+ put_device(dev);
+}
+
+static int device_resume_noirq(struct device *dev)
+{
+ down_write(&dev->power.rwsem);
+
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume_noirq, dev);
+ return 0;
+ }
+
+ return __device_resume_noirq(dev, pm_transition);
+}
+
/**
* dpm_resume_noirq - Execute "early resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -366,32 +397,36 @@ void dpm_resume_noirq(pm_message_t state
mutex_lock(&dpm_list_mtx);
transition_started = false;
+ pm_transition = state;
list_for_each_entry(dev, &dpm_list, power.entry)
if (dev->power.status > DPM_OFF) {
int error;
dev->power.status = DPM_OFF;
- error = device_resume_noirq(dev, state);
+ error = device_resume_noirq(dev);
if (error)
pm_dev_err(dev, state, " early", error);
}
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
resume_device_irqs();
}
EXPORT_SYMBOL_GPL(dpm_resume_noirq);
/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state)
{
int error = 0;
TRACE_DEVICE(dev);
TRACE_RESUME(0);
+ if (dev->parent)
+ down_read(&dev->parent->power.rwsem);
down(&dev->sem);
if (dev->bus) {
@@ -426,11 +461,38 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ if (dev->parent)
+ up_read(&dev->parent->power.rwsem);
+ up_write(&dev->power.rwsem);
TRACE_RESUME(error);
return error;
}
+static void async_resume(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_resume(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);
+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{
+ down_write(&dev->power.rwsem);
+
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ return 0;
+ }
+
+ return __device_resume(dev, pm_transition);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -444,6 +506,7 @@ static void dpm_resume(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
@@ -454,7 +517,7 @@ static void dpm_resume(pm_message_t stat
dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);
- error = device_resume(dev, state);
+ error = device_resume(dev);
mutex_lock(&dpm_list_mtx);
if (error)
@@ -469,6 +532,7 @@ static void dpm_resume(pm_message_t stat
}
list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
}
/**
@@ -533,6 +597,8 @@ static void dpm_complete(pm_message_t st
mutex_unlock(&dpm_list_mtx);
}
+static atomic_t async_error;
+
/**
* dpm_resume_end - Execute "resume" callbacks and complete system transition.
* @state: PM transition of the system being carried out.
@@ -580,20 +646,59 @@ static pm_message_t resume_event(pm_mess
* The driver of @dev will not receive interrupts while this function is being
* executed.
*/
-static int device_suspend_noirq(struct device *dev, pm_message_t state)
+static int __device_suspend_noirq(struct device *dev, pm_message_t state)
{
int error = 0;
- if (!dev->bus)
- return 0;
+ down_write(&dev->power.rwsem);
- if (dev->bus->pm) {
+ if (dev->bus && dev->bus->pm) {
pm_dev_dbg(dev, state, "LATE ");
error = pm_noirq_op(dev, dev->bus->pm, state);
}
+
+ up_write(&dev->power.rwsem);
+ if (dev->parent)
+ up_read(&dev->parent->power.rwsem);
+
return error;
}
+static void async_suspend_noirq(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error = atomic_read(&async_error);
+
+ if (error) {
+ if (dev->parent)
+ up_read(&dev->parent->power.rwsem);
+ dev->power.status = DPM_OFF;
+ return;
+ }
+
+ error = __device_suspend_noirq(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async LATE", error);
+ dev->power.status = DPM_OFF;
+ atomic_set(&async_error, error);
+ }
+ put_device(dev);
+}
+
+static int device_suspend_noirq(struct device *dev)
+{
+ if (dev->parent)
+ down_read(&dev->parent->power.rwsem);
+
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend_noirq, dev);
+ return 0;
+ }
+
+ return __device_suspend_noirq(dev, pm_transition);
+}
+
/**
* dpm_suspend_noirq - Execute "late suspend" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -608,15 +713,21 @@ int dpm_suspend_noirq(pm_message_t state
suspend_device_irqs();
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
list_for_each_entry_reverse(dev, &dpm_list, power.entry) {
- error = device_suspend_noirq(dev, state);
+ dev->power.status = DPM_OFF_IRQ;
+ error = device_suspend_noirq(dev);
if (error) {
pm_dev_err(dev, state, " late", error);
+ dev->power.status = DPM_OFF;
break;
}
- dev->power.status = DPM_OFF_IRQ;
+ error = atomic_read(&async_error);
+ if (error)
+ break;
}
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
if (error)
dpm_resume_noirq(resume_event(state));
return error;
@@ -628,10 +739,11 @@ EXPORT_SYMBOL_GPL(dpm_suspend_noirq);
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_suspend(struct device *dev, pm_message_t state)
+static int __device_suspend(struct device *dev, pm_message_t state)
{
int error = 0;
+ down_write(&dev->power.rwsem);
down(&dev->sem);
if (dev->class) {
@@ -668,10 +780,50 @@ static int device_suspend(struct device
}
End:
up(&dev->sem);
+ up_write(&dev->power.rwsem);
+ if (dev->parent)
+ up_read(&dev->parent->power.rwsem);
return error;
}
+static void async_suspend(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error = atomic_read(&async_error);
+
+ if (error) {
+ if (dev->parent)
+ up_read(&dev->parent->power.rwsem);
+ dev->power.status = DPM_SUSPENDING;
+ goto End;
+ }
+
+ error = __device_suspend(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async", error);
+ dev->power.status = DPM_SUSPENDING;
+ atomic_set(&async_error, error);
+ }
+
+ End:
+ put_device(dev);
+}
+
+static int device_suspend(struct device *dev, pm_message_t state)
+{
+ if (dev->parent)
+ down_read(&dev->parent->power.rwsem);
+
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);
+ return 0;
+ }
+
+ return __device_suspend(dev, pm_transition);
+}
+
/**
* dpm_suspend - Execute "suspend" callbacks for all non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -683,10 +835,12 @@ static int dpm_suspend(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.prev);
get_device(dev);
+ dev->power.status = DPM_OFF;
mutex_unlock(&dpm_list_mtx);
error = device_suspend(dev, state);
@@ -694,16 +848,22 @@ static int dpm_suspend(pm_message_t stat
mutex_lock(&dpm_list_mtx);
if (error) {
pm_dev_err(dev, state, "", error);
+ dev->power.status = DPM_SUSPENDING;
put_device(dev);
break;
}
- dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
list_move(&dev->power.entry, &list);
put_device(dev);
+ error = atomic_read(&async_error);
+ if (error)
+ break;
}
list_splice(&list, dpm_list.prev);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
+ if (!error)
+ error = atomic_read(&async_error);
return error;
}
@@ -762,6 +922,7 @@ static int dpm_prepare(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
transition_started = true;
+ atomic_set(&async_error, 0);
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@
extern int pm_trace_enabled;
+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const
#else
+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)
On Tuesday 08 December 2009, Rafael J. Wysocki wrote:
> On Tuesday 08 December 2009, Rafael J. Wysocki wrote:
> > On Tuesday 08 December 2009, Rafael J. Wysocki wrote:
> > > On Tuesday 08 December 2009, Linus Torvalds wrote:
> > > >
> > > > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> > > > >
> > > > > Anyway, if we use an rwsem, it won't be checkable from interrupt context just
> > > > > as well.
> > > >
> > > > You can't do a lock() from an interrupt, but the unlocks should be
> > > > irq-safe.
> > > >
> > > > > Suppose we use rwsem and during suspend each child uses a down_read() on a
> > > > > parent and then the parent uses down_write() on itself. What if, whatever the
> > > > > reason, the parent is a bit early and does the down_write() before one of the
> > > > > children has a chance to do the down_read()? Aren't we toast?
> > > >
> > > > We're toast, but we're toast for a totally unrealted reason: it means that
> > > > you tried to resume a child before a parent, which would be a major bug to
> > > > begin with.
> > > >
> > > > Look, I even wrote out the comments, so let me repeat the code one more
> > > > time.
> > > >
> > > > - suspend time calling:
> > > > // This won't block, because we suspend nodes before parents
> > > > down_read(node->parent->lock);
> > > > // Do the part that may block asynchronously
> > > > async_schedule(do_usb_node_suspend, node);
> > > >
> > > > - resume time calling:
> > > > // This won't block, because we resume parents before children,
> > > > // and the children will take the read lock.
> > > > down_write(leaf->lock);
> > > > // Do the blocking part asynchronously
> > > > async_schedule(usb_node_resume, leaf);
> > > >
> > > > See? So when we take the parent lock for suspend, we are guaranteed to do
> > > > so _before_ the parent node itself suspends. And conversely, when we take
> > > > the parent lock (asynchronously) for resume, we're guaranteed to do that
> > > > _after_ the parent node has done its own down_write.
> > > >
> > > > And that all depends on just one trivial thing; that the suspend and
> > > > resume is called in the right order (children first vs parent first
> > > > respectively). And that is such a _major_ correctness issue that if that
> > > > isn't correct, your suspend isn't going to work _anyway_.
> > >
> > > Understood (I think).
> > >
> > > Let's try it, then. Below is the resume patch based on my previous one in this
> > > thread (I have only verified that it builds).
> >
> > Ah, I need to check if dev->parent is not NULL before trying to lock it, but
> > apart from this it doesn't break things at least.
>
> For completness, below is the full async suspend/resume patch with rwlocks,
> that has been (very slightly) tested and doesn't seem to break things.
>
> [Note to Alan: lockdep doesn't seem to complain about the not annotated nested
> locks.]
BTW, I can easily change it so that it uses completions for synchronization,
but I'm not sure if that's worth spending time on, so please let me know.
Rafael
On Tue, 8 Dec 2009, Linus Torvalds wrote:
> On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > And likewise in try_wait_for_completion(). It looks like a bug. Maybe
> > these routines were not intended to be called with interrupts disabled,
> > but that requirement doesn't seem to be documented. And it isn't a
> > natural requirement anyway.
>
> 'complete()' is supposed to be callable from interrupts, but the waiting
> ones aren't. But 'complete()' is all you should need to call from
> interrupts, so that's fine.
And try_wait_for_completion()? The fact that it doesn't block makes it
interrupt-safe. What's the point of having an interrupt-safe routine
that you can't call from within interrupt handlers?
Even if nobody uses it that way now, there's no guarantee somebody
won't attempt it in the future.
> So I think completions should work, if done right. That whole "make the
> parent wait for all the children to complete" is fine in that sense. And
> I'll happily take such an approach if my rwlock thing doesn't work.
In principle the two approaches could be combined: Add an rwsem for use
by children and a completion for off-tree[*] use. But that would
certainly be overkill. Looping over children doesn't take a
tremendous amount of time compared to a full system suspend.
Alan Stern
[*] "Off-tree" isn't really an appropriate term; these devices aren't
"off" the tree. "Non-tree" would be better.
On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> > Well, one difficulty. It arises only because we are contemplating
> > having the PM core fire up the async tasks, rather than having the
> > drivers' suspend routines launch them (the way your original proposal
> > did -- the difficulty does not arise there).
> >
> > Suppose A and B are unrelated devices and we need to impose the
> > off-tree constraint that A suspends after B. With children taking
> > their parent's lock, the way to prevent A from suspending too soon is
> > by having B's suspend routine acquire A's lock.
> >
> > But B's suspend routine runs entirely in an async task, because that
> > task is started by the PM core and it does the method call. Hence by
> > the time B's suspend routine is called, A may already have begun
> > suspending -- it's too late to take A's lock. To make the locking
> > work, B would have to acquire A's lock _before_ B's async task starts.
> > Since the PM core is unaware of the off-tree dependency, there's no
> > simple way to make it work.
>
> Do not set async_suspend for B and instead start your own async thread
> from its suspend callback. The parent-children synchronization is done by the
> core anyway (at least I'd do it that way), so the only thing you need to worry
> about is the extra dependency.
I don't like that because it introduces "artificial" dependencies: It
makes B depend on all the preceding synchronous suspends, even totally
unrelated ones. But yes, it would work.
> I would be slightly more comfortable using completions, but the rwsem-based
> approach is fine with me as well.
On the principle of making things as easy and foolproof as possible for
driver authors, I also favor completions since it makes dealing with
non-tree dependencies easier.
However either way would be okay. I do have to handle some non-tree
dependencies in USB, but oddly enough they affect only resume, not
suspend. So this "who starts the async task" issue doesn't apply.
Alan Stern
On Tue, 8 Dec 2009, Linus Torvalds wrote:
> It's just that I think the "looping over children" is ugly, when I think
> that by doing it the other way around you can make the code simpler and
> only depend on the PM device list and a simple parent pointer access.
I agree that it is uglier. The only advantage is in handling
asynchronous non-tree suspend dependencies, of which we probably won't
have very many. In fact, I don't know of _any_ offhand.
Interestingly, this non-tree dependency problem does not affect resume.
> I also think that you are wrong that the above somehow protects against
> non-topological dependencies. If the device you want to keep delay
> yourself suspending for is after you in the list, the down_read() on that
> may succeed simply because it hasn't even done its down_write() yet and
> you got scheduled early.
You mean, if A comes before B in the list and A must suspend after B?
Then A's down_read() on B _can't_ occur before B's down_write() on
itself. The down_write() on B happens before the
list_for_each_entry_reverse() iteration reaches A; it even happens
before B's async task is launched.
> But I guess you could do that by walking the list twice (first to lock
> them all, then to actually call the suspend function). That whole
> two-phase thing, except the first phase _only_ locks, and doesn't do any
> callbacks.
Not necessary.
Alan Stern
On Tue, 8 Dec 2009, Alan Stern wrote:
>
> You mean, if A comes before B in the list and A must suspend after B?
But if they are not topologically ordered, then A wouldn't necessarily be
before B on the list in the first place.
Of course, if we've mucked with the list by hand and made sure the
ordering is ok, then that's a different issue. But your whole point seemed
to be that the device could impose its own ordering in its suspend
callback, which is not true on its own without external ordering.
Linus
* Rafael J. Wysocki <[email protected]> wrote:
> On Tuesday 08 December 2009, Alan Stern wrote:
> > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> >
> > > BTW, is there a good reason why completion_done() doesn't use spin_lock_irqsave
> > > and spin_unlock_irqrestore? complete() and complete_all() use them, so why not
> > > here?
> >
> > And likewise in try_wait_for_completion(). It looks like a bug. Maybe
> > these routines were not intended to be called with interrupts disabled,
> > but that requirement doesn't seem to be documented. And it isn't a
> > natural requirement anyway.
>
> OK, let's ask Ingo about that.
>
> Ingo, is there any particular reason why completion_done() and
> try_wait_for_completion() don't use spin_lock_irqsave() and
> spin_unlock_irqrestore()?
that's a bug that should be fixed - all the wakeup side (and atomic)
variants of completetion API should be irq safe.
It appears that these new completion APIs were added via the XFS tree
about a year ago:
39d2f1a: [XFS] extend completions to provide XFS object flush requirements
Please Cc: scheduler folks to all scheduler patches.
Ingo
On Tue, Dec 08, 2009 at 09:35:59PM -0500, Alan Stern wrote:
> On Tue, 8 Dec 2009, Linus Torvalds wrote:
> > It's just that I think the "looping over children" is ugly, when I think
> > that by doing it the other way around you can make the code simpler and
> > only depend on the PM device list and a simple parent pointer access.
> I agree that it is uglier. The only advantage is in handling
> asynchronous non-tree suspend dependencies, of which we probably won't
> have very many. In fact, I don't know of _any_ offhand.
There's some potential for this in embedded audio - it wants to bring
down the entire embedded audio subsystem at once before the individual
devices (and their parents) get suspended since bringing them down out
of sync can result in audible artifacts. Depending on the system the
suspend may take a noticable amount of time so it'd be nice to be able
to run it asynchronously, though we don't currently do so.
At the minute we get away with this mostly through not being able to
represent the cases that are likely to actually trip up over it.
> Interestingly, this non-tree dependency problem does not affect resume.
Embedded audio does potentially - the resume needs all the individual
devices in the subsystem and can take a substantial proportion of the
overall resume time. Currently we get away with a combination of
assuming that all the drivers are live when we decide to start resuming
them and using the ALSA userspace API to deal with bringing the resume
out of line, but it's not ideal.
On Tue, 8 Dec 2009, Linus Torvalds wrote:
> On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > You mean, if A comes before B in the list and A must suspend after B?
>
> But if they are not topologically ordered, then A wouldn't necessarily be
> before B on the list in the first place.
Okay, I see what you're getting at. Yes, this is quite true -- if A
doesn't precede B in dpm_list then A can't safely wait for B to
suspend. To put it another way, only list-compatible constraints are
feasible.
This shouldn't be a problem. If it were we'd be seeing it right now,
because A would _always_ suspend before B.
> Of course, if we've mucked with the list by hand and made sure the
> ordering is ok, then that's a different issue. But your whole point seemed
> to be that the device could impose its own ordering in its suspend
> callback, which is not true on its own without external ordering.
No, sorry for not making it clearer. I was assuming all long that the
non-tree constraints were compatible with the list ordering.
In fact these considerations already affect the USB resume operations,
even without asynchronous resume. The code relies on the fact that the
PCI layer registers sibling devices on a slot in order of increasing
function number. There's no guarantee this will remain true in the
future (it may already be wrong AFAIK), so putting in some explicit
list manipulation is the prudent thing to do.
Alan Stern
On Wed, 9 Dec 2009, Alan Stern wrote:
>
> In fact these considerations already affect the USB resume operations,
> even without asynchronous resume. The code relies on the fact that the
> PCI layer registers sibling devices on a slot in order of increasing
> function number. There's no guarantee this will remain true in the
> future (it may already be wrong AFAIK), so putting in some explicit
> list manipulation is the prudent thing to do.
I do think we want to keep the slot ordering.
One of the silent issues that the device management code has always had is
the whole notion of naming stability. Now, udev and various fancy naming
schemes solve that at a higher level, but it is still the case that we
_really_ want basic things like your PCI controllers to show up in stable
order.
For example, it is _very_ inconvenient if things like PCI probing ends up
allocating different bus numbers (or resource allocations) across reboots
even if the hardware hasn't been changed. Just from a debuggability
standpoint, that just ends up being a total disaster.
For example, we continually hit odd special cases where PCI resource
allocation has some unexplained problem because there is some motherboard
resource that is hidden and invisible to our allocator. They are rare in
the sense that it's usually just a couple of odd laptops or something, but
they are not rare in the sense that pretty much _every_ single time we
change some resource allocation logic, we find one or two machines that
have some issue.
Things like that would be total disasters if the core device layer then
ended up also not having well-defined ordering. This is why I don't want
to do asynchronous PCI device probing, for example (ie we probe the
hardware synchronously, the PCI driver sets it all up synchronously, and
the asynchronous portion is the non-PCI part if any - things like PHY
detection, disk spinup etc).
So async things are fine, but they have _huge_ disadvantages, and I'll
personally take reliability and a stable serial algorithm over an async
one as far as possible.
That's partly why I realy did suggest that we do the async stuff purely in
the USB layer, rather than try to put it deeper in the device layer. And
if we do support it "natively" in the device layer like Rafael's latest
patch, I still think we should be very very nervous about making devices
async unless there is a measured - and very noticeable - advantage.
So I really don't want to push things any further than absolutely
necessary. I do not think that something like "embedded audio" is a reason
for async, for example.
Linus
On Wed, 9 Dec 2009, Mark Brown wrote:
> On Tue, Dec 08, 2009 at 09:35:59PM -0500, Alan Stern wrote:
> > On Tue, 8 Dec 2009, Linus Torvalds wrote:
>
> > > It's just that I think the "looping over children" is ugly, when I think
> > > that by doing it the other way around you can make the code simpler and
> > > only depend on the PM device list and a simple parent pointer access.
>
> > I agree that it is uglier. The only advantage is in handling
> > asynchronous non-tree suspend dependencies, of which we probably won't
> > have very many. In fact, I don't know of _any_ offhand.
>
> There's some potential for this in embedded audio - it wants to bring
> down the entire embedded audio subsystem at once before the individual
> devices (and their parents) get suspended since bringing them down out
> of sync can result in audible artifacts. Depending on the system the
> suspend may take a noticable amount of time so it'd be nice to be able
> to run it asynchronously, though we don't currently do so.
For something like bringing down the entire embedded audio subsystem,
which isn't directly tied to a single device, you would probably be
better off doing it when the PM core broadcasts a suspend notification
(see register_pm_notifier() in include/linux/suspend.h). This occurs
before any devices are suspended, so synchronization isn't an issue.
> At the minute we get away with this mostly through not being able to
> represent the cases that are likely to actually trip up over it.
>
> > Interestingly, this non-tree dependency problem does not affect resume.
>
> Embedded audio does potentially - the resume needs all the individual
> devices in the subsystem and can take a substantial proportion of the
> overall resume time. Currently we get away with a combination of
> assuming that all the drivers are live when we decide to start resuming
> them and using the ALSA userspace API to deal with bringing the resume
> out of line, but it's not ideal.
You can do the same thing with the resume notifier.
Alan Stern
On Wed, 9 Dec 2009, Linus Torvalds wrote:
> That's partly why I realy did suggest that we do the async stuff purely in
> the USB layer, rather than try to put it deeper in the device layer. And
> if we do support it "natively" in the device layer like Rafael's latest
> patch, I still think we should be very very nervous about making devices
> async unless there is a measured - and very noticeable - advantage.
Agreed. Arjan's measurements indicated that USB was one of the biggest
offenders; everything else other than the PS/2 mouse was much faster.
Given these results there isn't much incentive to do anything else
asynchronously.
(However other devices not present on Arjan's machine may be a
different story. Spinning up multiple external disks is a good example
-- although here it may be necessary for the driver to take charge,
because spinning up a disk requires a lot of power and doing too many
of them at the same time could be bad.)
Alan Stern
On Wed, Dec 09, 2009 at 10:49:56AM -0500, Alan Stern wrote:
> On Wed, 9 Dec 2009, Mark Brown wrote:
> > There's some potential for this in embedded audio - it wants to bring
> > down the entire embedded audio subsystem at once before the individual
> > devices (and their parents) get suspended since bringing them down out
> For something like bringing down the entire embedded audio subsystem,
> which isn't directly tied to a single device, you would probably be
> better off doing it when the PM core broadcasts a suspend notification
> (see register_pm_notifier() in include/linux/suspend.h). This occurs
> before any devices are suspended, so synchronization isn't an issue.
I'm not convinced that helps with the fact that the suspend may take a
long time - ideally we'd be able to start the suspend process off but
let other things carry on while it completes without having to worry
about something we're relying on getting suspended underneath us.
> > Embedded audio does potentially - the resume needs all the individual
> > overall resume time. Currently we get away with a combination of
> You can do the same thing with the resume notifier.
Similarly, the length of time the resume may take to complete means it'd
be nice to start as soon as we've got the devices and complete it at our
leisure. This is less pressing since we can tell the PM core we've
resumed but still block userspace.
On Wed, 9 Dec 2009, Mark Brown wrote:
> On Wed, Dec 09, 2009 at 10:49:56AM -0500, Alan Stern wrote:
> > On Wed, 9 Dec 2009, Mark Brown wrote:
>
> > > There's some potential for this in embedded audio - it wants to bring
> > > down the entire embedded audio subsystem at once before the individual
> > > devices (and their parents) get suspended since bringing them down out
>
> > For something like bringing down the entire embedded audio subsystem,
> > which isn't directly tied to a single device, you would probably be
> > better off doing it when the PM core broadcasts a suspend notification
> > (see register_pm_notifier() in include/linux/suspend.h). This occurs
> > before any devices are suspended, so synchronization isn't an issue.
>
> I'm not convinced that helps with the fact that the suspend may take a
> long time - ideally we'd be able to start the suspend process off but
> let other things carry on while it completes without having to worry
> about something we're relying on getting suspended underneath us.
The suspend procedure is oriented around device structures, and what
you're talking about isn't. It's something separate which has to be
finished before _any_ of the audio devices are suspended.
How long does it take to bring down the entire embedded audio
subsystem? And how critical is the timing for typical systems?
Alan Stern
On Wed, Dec 09, 2009 at 11:23:00AM -0500, Alan Stern wrote:
> On Wed, 9 Dec 2009, Mark Brown wrote:
> > I'm not convinced that helps with the fact that the suspend may take a
> > long time - ideally we'd be able to start the suspend process off but
> > let other things carry on while it completes without having to worry
> > about something we're relying on getting suspended underneath us.
> The suspend procedure is oriented around device structures, and what
> you're talking about isn't. It's something separate which has to be
> finished before _any_ of the audio devices are suspended.
In this context the "subsystem" actually has a struct device associated
with it so does appear in the device flow.
> How long does it take to bring down the entire embedded audio
> subsystem? And how critical is the timing for typical systems?
Worst case is about a second for both resume and suspend which means two
seconds total but it's very hardware dependant.
The latency budget for suspend and resume are both zero in an ideal
world, users want to be able to suspend as much as possible which means
they'd like it to take no perceptible time at the human level. Some
hardware is at the point where that's getting realistic but the folks on
older hardware still want to get as close to that as they can.
On Wed, 9 Dec 2009, Mark Brown wrote:
> > How long does it take to bring down the entire embedded audio
> > subsystem? And how critical is the timing for typical systems?
>
> Worst case is about a second for both resume and suspend which means two
> seconds total but it's very hardware dependant.
I would seriously suggest just looking at the code itself.
Maybe the code is just plain sh*t? If we're talking embedded audio, we're
generally talking SoC chips (maybe some external audio daughtercard), and
quite frankly, it sounds to me like you're just wasting your own time.
There is no way that kind of hardware really needs that much time.
We should not design the device infrastructure for crap coding.
Now, I can easily see one-second delays in code that simply has never been
thought about or cared about it. We used to have things like that in the
serial code where just probing for non-existent serial ports took half a
second per port because there was a timeout.
But christ, using that as an argument for "we should do things
asynchronously" sounds like a crazy idea. Why not just take a hard look at
the driver in question, asking hard questions like "does it really need to
do something horrible like that"?
Because bad coding is much more likely to be the real reason.
Linus
On Wed, 9 Dec 2009, Mark Brown wrote:
> > How long does it take to bring down the entire embedded audio
> > subsystem? And how critical is the timing for typical systems?
>
> Worst case is about a second for both resume and suspend which means two
> seconds total but it's very hardware dependant.
A second seems awfully long. What happens if audio isn't being played
when the suspend occurs? Can't you shorten things with no artifacts in
that case?
If audio _is_ being played when a suspend occurs, users probably don't
mind audible artifacts. In fact, they probably expect some.
Alan Stern
On Wed, 9 Dec 2009, Alan Stern wrote:
>
> If audio _is_ being played when a suspend occurs, users probably don't
> mind audible artifacts. In fact, they probably expect some.
I'd say it's physically impossible not to get them. If you're really
suspending your audio hardware, it _will_ be quiet ;)
I suspect somebody is draining existing queues or something, or just
probing for an external analog part. Neither of which is really sensible
or absolutely required in an embedded suspend/resume kind of situation.
Especially for STR, just "leave all the data structures around, and just
stop the DMA engine" is often a perfectly fine solution - but drivers
don't do it, exactly because we've often had the mentality that you
re-initialize everything under the sun.
I can see _why_ a driver would do that ("we re-use the same code that we
use on close/open or module unload/reload"), but it doesn't change the
fact that it's stupid to do if you worry about latency.
And yeah, turning it async might hide the problem. But the code word there
is "hide" rather than "fix".
Linus
On Wed, Dec 09, 2009 at 08:57:32AM -0800, Linus Torvalds wrote:
> On Wed, 9 Dec 2009, Mark Brown wrote:
> > Worst case is about a second for both resume and suspend which means two
> > seconds total but it's very hardware dependant.
> I would seriously suggest just looking at the code itself.
> Maybe the code is just plain sh*t? If we're talking embedded audio, we're
> generally talking SoC chips (maybe some external audio daughtercard), and
Yes, usually this is a SoC plus one or more external devices handling
the mixed signal parts of things all soldered down onto a board.
> quite frankly, it sounds to me like you're just wasting your own time.
> There is no way that kind of hardware really needs that much time.
Some of the older hardware really does need that much time, sadly.
More recent hardware got that down much lower (into the low hundreds of
ms where it's much less of an issue but still present) and current
generations basically don't have the problem any more but for worst case
a second is a good approximation.
The problem comes when you've got audio outputs referenced to something
other than ground which used to happen because no negative supplies were
available in these systems. To bring these up from cold you need to
bring the outputs up to the reference level but if you do that by just
turning on the power you get an audible (often loud) noise in the output
from the square(ish) waveform that results which users don't find
acceptable.
The initial solution was to ramp the voltage on the outputs in such a
way that the waveform that appears on the outputs isn't audible, which
broadly boils down to ramping it slowly. People were very aware of the
problems so later generations of devices added features which allowed
this to happen much more quickly than the original implementations had,
but still noticably slow in terms of the timescales people need.
Current generation hardware solves the problem by using charge pumps to
provide a negative supply, allowing ground referenced outputs which are
just a win all round for this and other reasons. They're fast enough to
allow the power up to be brought completely in line with the start of
the audio stream, taking this out of suspend and resume entirely.
> Now, I can easily see one-second delays in code that simply has never been
> thought about or cared about it. We used to have things like that in the
> serial code where just probing for non-existent serial ports took half a
> second per port because there was a timeout.
It's a deliberate delay waiting for the voltages to ramp, there's plenty
of things that need to be fixed or optimised in the code but those that
are causing issues these days really are just explicitly inserted delays
waiting for things to happen in hardware that do actually take that long.
> Because bad coding is much more likely to be the real reason.
Would that it were - you wouldn't believe the amount of time that's been
spent over the years tuning for this.
On Wed, 9 Dec 2009, Mark Brown wrote:
>
> The problem comes when you've got audio outputs referenced to something
> other than ground which used to happen because no negative supplies were
> available in these systems. To bring these up from cold you need to
> bring the outputs up to the reference level but if you do that by just
> turning on the power you get an audible (often loud) noise in the output
> from the square(ish) waveform that results which users don't find
> acceptable.
Ouch. A second still sounds way too long - but whatever.
However, it sounds like the nice way to do that isn't by doing it
synchronously in the suspend/resume code itself, but simply ramping it
down (and up) from a timer. It would be asynchronous, but not because the
suspend itself is in any way asynchronous.
Done right, it might even result in a nice volume fade of the sound (ie if
the hw allows for it, stop the actual sound engine late on suspend, and
start it early on resume, so that sound works _while_ the whole reference
volume rampdown/up is going on)
Linus
On Wed, Dec 09, 2009 at 12:10:03PM -0500, Alan Stern wrote:
> On Wed, 9 Dec 2009, Mark Brown wrote:
> > Worst case is about a second for both resume and suspend which means two
> > seconds total but it's very hardware dependant.
> A second seems awfully long. What happens if audio isn't being played
> when the suspend occurs? Can't you shorten things with no artifacts in
> that case?
For the affected hardware the problem is basically the same with or
without audio being played. As I said in my reply to Linus this is
delays caused by ramping reference voltages. These delays are
sufficiently long that the reference voltages have to be maintained all
the time so that they don't delay the start of audio streams which means
that having or not having an audio stream at suspend time doesn't affect
the reference voltage ramps since we don't turn them off when not in
use. There is a win from other stuff having been shut off already, but
it's already being exploited.
On suspend the problem is the same as for resume - we need to ramp the
voltages quietly, this time down to zero. We want to make sure they're
actually at zero to ensure that the ramp at resume time starts from a
known hardware state.
On Wed, Dec 09, 2009 at 09:57:22AM -0800, Linus Torvalds wrote:
> On Wed, 9 Dec 2009, Mark Brown wrote:
> > The problem comes when you've got audio outputs referenced to something
> > other than ground which used to happen because no negative supplies were
> > available in these systems. To bring these up from cold you need to
> > bring the outputs up to the reference level but if you do that by just
> > turning on the power you get an audible (often loud) noise in the output
> > from the square(ish) waveform that results which users don't find
> > acceptable.
> Ouch. A second still sounds way too long - but whatever.
Yes, I think there's pretty much universal agreement on that :)
Hardware that needs a few hundred miliseconds is much more common at the
minute (and like I say current generation hardware is basically
unaffected), but it's the number I keep in mind when considering how bad
things might be.
> However, it sounds like the nice way to do that isn't by doing it
> synchronously in the suspend/resume code itself, but simply ramping it
> down (and up) from a timer. It would be asynchronous, but not because the
> suspend itself is in any way asynchronous.
We don't actually need a timer for most of this - generally the ramp is
done by charging or discharging a capacitor through a resistor so you
just set it going then wait, possibly in several stages with a little
bit twiddling in the middle to speed things up which could be done off a
timer.
> Done right, it might even result in a nice volume fade of the sound (ie if
> the hw allows for it, stop the actual sound engine late on suspend, and
> start it early on resume, so that sound works _while_ the whole reference
> volume rampdown/up is going on)
The big issue with running off a partially ramped supply is that it can
upset the analogue components - for example, if an amplifier is trying
to handle a signal with an amplitude outside the supply range then it'll
clip. But sometimes that approach does work and it does get used.
For resume we're pretty much taking care of it already by moving the
resume out of the main device resume and using ALSA-specific stuff to
keep audio streams stopped until we're done but for suspend we don't
know the system is going down until the suspend starts and we do want to
make sure we got the analogue into a known poweroff state so that we can
control powerup properly.
On Tue, 8 Dec 2009, Linus Torvalds wrote:
> > Wait a second. Are you saying that with code like this:
> >
> > if (x == 1)
> > y = 5;
> >
> > the CPU may write to y before it has finished reading the value of x?
> > And this write is visible to other CPUs, so that if x was initially 0
> > and a second CPU sets x to 1, the second CPU may see y == 5 before it
> > executes the write to x (whatever that may mean)?
>
> Well, yes and no. CPU1 above won't release the '5' until it has confirmed
> the '1' (even if it does so by reading it late). but assuming the other
> CPU also does speculation, then yes, the situation you describe could
> happen. If the other CPU does
>
> z = y;
> x = 1;
>
> then it's certainly possible that 'z' contains 5 at the end (even if both
> x and y started out zero). Because now the read of 'y' on that other CPU
> might be delayed, and the write of 'x' goes ahead, CPU1 sees the 1, and
> commits its write of 5, sp when CPU2 gets the cacheline, z will now
> contain 5.
That could be attributed to reordering on CPU2, so let's take CPU2's
peculiarities out of the picture (initially everything is set to 0):
CPU1 CPU2
---- ----
if (x == 1) z = y;
y = 5; mb();
x = 1;
This gets at the heart of the question: Can a write move up past a
control dependency? Similar questions apply to the two types of data
dependency:
CPU1 CPU2
---- ----
y = x + 4; z = y;
mb();
x = 1;
(Initially p points to x, not y):
CPU1 CPU2
---- ----
*p = 5; z = y;
mb();
p = &y;
Can z end up equal to 5 in any of these examples?
Alan Stern
On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> For completness, below is the full async suspend/resume patch with rwlocks,
> that has been (very slightly) tested and doesn't seem to break things.
>
> [Note to Alan: lockdep doesn't seem to complain about the not annotated nested
> locks.]
I can't imagine why not. And wouldn't lockdep get confused by the fact
that in the async case, the rwsems are released by a different process
from the one that acquired them?
> Index: linux-2.6/drivers/base/power/main.c
> ===================================================================
> --- linux-2.6.orig/drivers/base/power/main.c
> +++ linux-2.6/drivers/base/power/main.c
Should we have an attribute under /sys/power to disable async
suspend/resume? It would make testing easier and give people a way to
work around problems.
> @@ -334,25 +337,53 @@ static void pm_dev_err(struct device *de
> * The driver of @dev will not receive interrupts while this function is being
> * executed.
> */
> -static int device_resume_noirq(struct device *dev, pm_message_t state)
> +static int __device_resume_noirq(struct device *dev, pm_message_t state)
> {
Do you want to use async tasks in the late-suspend/early-resume stages?
I know that USB won't use it, not even for the PCI host controllers --
not unless the PCI core specifically wants it. Doing just the regular
suspend/resume stages may be enough.
> +static int device_resume_noirq(struct device *dev)
> +{
> + down_write(&dev->power.rwsem);
> +
> + if (dev->power.async_suspend && !pm_trace_is_enabled()) {
If the sysfs attribute exists, then maybe we _should_ allow async with
PM tracing enabled. I don't know; it's your decision.
atomic_set(&async_error, error);
}
> @@ -683,10 +835,12 @@ static int dpm_suspend(pm_message_t stat
>
> INIT_LIST_HEAD(&list);
> mutex_lock(&dpm_list_mtx);
> + pm_transition = state;
> while (!list_empty(&dpm_list)) {
> struct device *dev = to_device(dpm_list.prev);
>
> get_device(dev);
> + dev->power.status = DPM_OFF;
What's that for? dev->power.status is supposed to be DPM_SUSPENDING
until the suspend method is successfully completed.
> mutex_unlock(&dpm_list_mtx);
>
> error = device_suspend(dev, state);
> @@ -694,16 +848,22 @@ static int dpm_suspend(pm_message_t stat
> mutex_lock(&dpm_list_mtx);
> if (error) {
> pm_dev_err(dev, state, "", error);
> + dev->power.status = DPM_SUSPENDING;
And then this isn't needed.
> put_device(dev);
> break;
> }
> - dev->power.status = DPM_OFF;
This line has to be moved into __device_suspend(), even though it won't
be protected by dpm_list_mtx. The same sort of thing applies to
dpm_suspend_noirq() (although nothing needs to be moved if you don't
make it async).
The rest looks okay.
How about exporting a wait_for_device_to_resume() routine? Drivers
could call it for non-tree resume constraints:
void wait_for_device_to_resume(struct device *other)
{
down_read(&other->power.rwsem);
up_read(&other->power.rwsem);
}
Unfortunately there is no equivalent for non-tree suspend constraints.
Alan Stern
On Wed, 9 Dec 2009, Alan Stern wrote:
>
> That could be attributed to reordering on CPU2, so let's take CPU2's
> peculiarities out of the picture (initially everything is set to 0):
>
> CPU1 CPU2
> ---- ----
> if (x == 1) z = y;
> y = 5; mb();
> x = 1;
>
> This gets at the heart of the question: Can a write move up past a
> control dependency?
> [ .. ]
> Can z end up equal to 5 in any of these examples?
In any _practical_ microarchitecture I know of, the above will never
result in 'z' being 5, even though CPU1 doesn't really have a memory
barrier. But if I read the alpha memory ordering guarantees rigth, then at
least in theory you really can end up with z=5.
Let me write that as five events (with the things in brackets being what
the alpha memory ordering manual calls them):
- A is "read of x returns 1" on CPU1 [ P1:R(x,1) ]
- B is "write of value 5 to y" on CPU1 [ P1:W(y,5) ]
- C is "read of y returns 5" on CPU2 [ P2:R(y,5) ]
- D is "write of value 1 to x" on CPU2 [ P2:W(x,1) ]
- 'MB' is the mb() on CPU2 [ P2:MB ]
(The write of 'z' is irrelevant, we can think of it as a register, the end
result is the same).
And yes, if I read the alpha memory ordering rules correctly, you really
can end up with z=5, although I don't think you will ever find an alpha
_implementation_ that does it.
Why?
The alpha memory ordering literally defines ordering in two ways:
- "location access order". But that is _only_ defined per actual
location, so while 'x' can have a location access order specified by
seeing certain values, there is no "location access order" for two
different memory locations (x and y).
The alpha architecture manual uses "A << B" to say "event A" is before
"event B" when there is a defined ordering.
So in the example above, there is a location access ordering between
P2:W(x,1) << P1:R(x, 1)
and
P2:R(y,5) << P1:W(y,5)
ie you have D << A and B << C.
Good so far, but that doesn't define anything else: there's only
ordering between the pairs (D,A) and (B,C), nothing between them.
- "Processor issue order" for two instruction is _only_ defined by either
(a) memory barriers or (b) accesses to the _same_ locations. The alpha
architecture manual uses "A < B" to say that "event A" is before "event
B" in processor issue order.
So there is a "Processor issue order" on CPU2 due to the memory
barrier: P2:R(y,5) < P2:MB < P2:W(x,1), or put another way C < MB < D:
C < D.
Now, the question is, can we actually get the behaviour of reading 5 on
CPU2 (ie P2:R(y,5)), and that is only possible if we can find an ordering
that satisfies all the constraints. We have
D << A
B << C
C < D
and it seems to be that it is a possible situation: "B C D A"
really does satisfy all the constraints afaik.
So yes, according to the actual alpha architecture memory ordering rules,
you can see '5' from that first read of 'y'. DESPITE having a mb() on
CPU2.
In order to not see 5, you need to also specify "A < B", and the _only_
way to do that processor issue order specification is with a memory
barrier (or if the locations are the same, which they aren't).
"Causality" simply is nowhere in the officially defined alpha memory
ordering. The fact that we test 'x == 1' and conditionally do the write
simply doesn't enter the picture. I suspect you'd have a really hard time
not having causality in practice, but there _are_ things that can break
causality (value prediction etc), so it's not like you'd have to actually
violate physics of reality to do it.
IOW, you could at least in theory implement a CPU that does every
instruction speculatively in parallel, and then validates the end result
afterwards according to the architecture rules. And that CPU would require
the memory barrier on alpha.
(On x86, 'causality' is defined to be part of the memory ordering rules,
so on x86, you _do_ have a 'A < B' relationship. But not on alpha).
Linus
On Wednesday 09 December 2009, Alan Stern wrote:
> On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> > > Well, one difficulty. It arises only because we are contemplating
> > > having the PM core fire up the async tasks, rather than having the
> > > drivers' suspend routines launch them (the way your original proposal
> > > did -- the difficulty does not arise there).
> > >
> > > Suppose A and B are unrelated devices and we need to impose the
> > > off-tree constraint that A suspends after B. With children taking
> > > their parent's lock, the way to prevent A from suspending too soon is
> > > by having B's suspend routine acquire A's lock.
> > >
> > > But B's suspend routine runs entirely in an async task, because that
> > > task is started by the PM core and it does the method call. Hence by
> > > the time B's suspend routine is called, A may already have begun
> > > suspending -- it's too late to take A's lock. To make the locking
> > > work, B would have to acquire A's lock _before_ B's async task starts.
> > > Since the PM core is unaware of the off-tree dependency, there's no
> > > simple way to make it work.
> >
> > Do not set async_suspend for B and instead start your own async thread
> > from its suspend callback. The parent-children synchronization is done by the
> > core anyway (at least I'd do it that way), so the only thing you need to worry
> > about is the extra dependency.
>
> I don't like that because it introduces "artificial" dependencies: It
> makes B depend on all the preceding synchronous suspends, even totally
> unrelated ones. But yes, it would work.
Well, unfortunately, it wouldn't, because (at least in the context of my last
patch) the core would release the rwsems as soon as your suspend had
returned. So you'd have to make your suspend wait for the async thread and
that would make it pointless. So scratch that, it wasn't a good idea at all.
This leaves us with basically two options, where the first one is to use
rwsems in a way that you've proposed (with iterating over children), and the
second one is to use completions. In my opinion rwsems don't give us any
advantage in this case, so I'd very much prefer to use completions.
Rafael
On Wednesday 09 December 2009, Alan Stern wrote:
> On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> > For completness, below is the full async suspend/resume patch with rwlocks,
> > that has been (very slightly) tested and doesn't seem to break things.
> >
> > [Note to Alan: lockdep doesn't seem to complain about the not annotated nested
> > locks.]
>
> I can't imagine why not. And wouldn't lockdep get confused by the fact
> that in the async case, the rwsems are released by a different process
> from the one that acquired them?
/me looks at the .config
I have CONFIG_LOCKDEP_SUPPORT set, is there anything else I need to set
in .config?
> > Index: linux-2.6/drivers/base/power/main.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/base/power/main.c
> > +++ linux-2.6/drivers/base/power/main.c
>
> Should we have an attribute under /sys/power to disable async
> suspend/resume? It would make testing easier and give people a way to
> work around problems.
I have a separate patch adding that, but I'd prefer to focus on the core
feature first, if possible.
> > @@ -334,25 +337,53 @@ static void pm_dev_err(struct device *de
> > * The driver of @dev will not receive interrupts while this function is being
> > * executed.
> > */
> > -static int device_resume_noirq(struct device *dev, pm_message_t state)
> > +static int __device_resume_noirq(struct device *dev, pm_message_t state)
> > {
>
> Do you want to use async tasks in the late-suspend/early-resume stages?
> I know that USB won't use it, not even for the PCI host controllers --
> not unless the PCI core specifically wants it. Doing just the regular
> suspend/resume stages may be enough.
I guess so. It's a leftover from the time I thought PCI might use async
suspend, but it didn't really speed up things at all AFAICS.
I think I'll remove it for now and it's going to be trivial to add it back if
desired.
> > +static int device_resume_noirq(struct device *dev)
> > +{
> > + down_write(&dev->power.rwsem);
> > +
> > + if (dev->power.async_suspend && !pm_trace_is_enabled()) {
>
> If the sysfs attribute exists, then maybe we _should_ allow async with
> PM tracing enabled. I don't know; it's your decision.
I don't think it would be reliable in that case, because the RTC might be
written to by two concurrent threads at the same time.
> atomic_set(&async_error, error);
> }
>
>
> > @@ -683,10 +835,12 @@ static int dpm_suspend(pm_message_t stat
> >
> > INIT_LIST_HEAD(&list);
> > mutex_lock(&dpm_list_mtx);
> > + pm_transition = state;
> > while (!list_empty(&dpm_list)) {
> > struct device *dev = to_device(dpm_list.prev);
> >
> > get_device(dev);
> > + dev->power.status = DPM_OFF;
>
> What's that for? dev->power.status is supposed to be DPM_SUSPENDING
> until the suspend method is successfully completed.
If the suspend is run asynchronoysly, the main thread will always get a
"success" from device_suspend(), so it can't change power.status on this
basis. I thought we could set power.status to DPM_OFF upfront and change
it back when error is returned.
The alternative would be to move the modification of power.status to
device_suspend() and async_suspend(). Well, maybe that's better.
> > mutex_unlock(&dpm_list_mtx);
> >
> > error = device_suspend(dev, state);
> > @@ -694,16 +848,22 @@ static int dpm_suspend(pm_message_t stat
> > mutex_lock(&dpm_list_mtx);
> > if (error) {
> > pm_dev_err(dev, state, "", error);
> > + dev->power.status = DPM_SUSPENDING;
>
> And then this isn't needed.
>
> > put_device(dev);
> > break;
> > }
> > - dev->power.status = DPM_OFF;
>
> This line has to be moved into __device_suspend(), even though it won't
> be protected by dpm_list_mtx. The same sort of thing applies to
> dpm_suspend_noirq() (although nothing needs to be moved if you don't
> make it async).
>
> The rest looks okay.
Still, I think I'd rework it to use completions for the reason described in the
message I've just sent (in short, because of the off-tree dependencies
problem).
> How about exporting a wait_for_device_to_resume() routine? Drivers
> could call it for non-tree resume constraints:
>
> void wait_for_device_to_resume(struct device *other)
> {
> down_read(&other->power.rwsem);
> up_read(&other->power.rwsem);
> }
>
> Unfortunately there is no equivalent for non-tree suspend constraints.
If we use completions, it will be possible to just export something like
dpm_wait(dev)
{
if (dev)
wait_for_completion(dev->power.completion);
}
I think. It appears that will also work for suspend, unless I'm missing
something.
Rafael
On Wed, 9 Dec 2009, Rafael J. Wysocki wrote:
> > I don't like that because it introduces "artificial" dependencies: It
> > makes B depend on all the preceding synchronous suspends, even totally
> > unrelated ones. But yes, it would work.
>
> Well, unfortunately, it wouldn't, because (at least in the context of my last
> patch) the core would release the rwsems as soon as your suspend had
> returned. So you'd have to make your suspend wait for the async thread and
> that would make it pointless. So scratch that, it wasn't a good idea at all.
>
> This leaves us with basically two options, where the first one is to use
> rwsems in a way that you've proposed (with iterating over children), and the
> second one is to use completions. In my opinion rwsems don't give us any
> advantage in this case, so I'd very much prefer to use completions.
If you really want to add support for async suspend constraints, then
completions are clearer than rwsems. If you don't care (and it's
unlikely that anyone will need them in the near future) then you might
as well stick with the current rwsem implementation and avoid iterating
over children.
Alan Stern
On Wednesday 09 December 2009, Ingo Molnar wrote:
>
> * Rafael J. Wysocki <[email protected]> wrote:
>
> > On Tuesday 08 December 2009, Alan Stern wrote:
> > > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> > >
> > > > BTW, is there a good reason why completion_done() doesn't use spin_lock_irqsave
> > > > and spin_unlock_irqrestore? complete() and complete_all() use them, so why not
> > > > here?
> > >
> > > And likewise in try_wait_for_completion(). It looks like a bug. Maybe
> > > these routines were not intended to be called with interrupts disabled,
> > > but that requirement doesn't seem to be documented. And it isn't a
> > > natural requirement anyway.
> >
> > OK, let's ask Ingo about that.
> >
> > Ingo, is there any particular reason why completion_done() and
> > try_wait_for_completion() don't use spin_lock_irqsave() and
> > spin_unlock_irqrestore()?
>
> that's a bug that should be fixed - all the wakeup side (and atomic)
> variants of completetion API should be irq safe.
>
> It appears that these new completion APIs were added via the XFS tree
> about a year ago:
>
> 39d2f1a: [XFS] extend completions to provide XFS object flush requirements
>
> Please Cc: scheduler folks to all scheduler patches.
If you haven't fixed it locally yet, would you mind me posting a fix?
Rafael
On Wed, 9 Dec 2009, Rafael J. Wysocki wrote:
> On Wednesday 09 December 2009, Alan Stern wrote:
> > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> >
> > > For completness, below is the full async suspend/resume patch with rwlocks,
> > > that has been (very slightly) tested and doesn't seem to break things.
> > >
> > > [Note to Alan: lockdep doesn't seem to complain about the not annotated nested
> > > locks.]
> >
> > I can't imagine why not. And wouldn't lockdep get confused by the fact
> > that in the async case, the rwsems are released by a different process
> > from the one that acquired them?
>
> /me looks at the .config
>
> I have CONFIG_LOCKDEP_SUPPORT set, is there anything else I need to set
> in .config?
How about CONFIG_PROVE_LOCKING? If lockdep really does start
complaining then switching to completions would be a simple way to
appease it.
> > > @@ -683,10 +835,12 @@ static int dpm_suspend(pm_message_t stat
> > >
> > > INIT_LIST_HEAD(&list);
> > > mutex_lock(&dpm_list_mtx);
> > > + pm_transition = state;
> > > while (!list_empty(&dpm_list)) {
> > > struct device *dev = to_device(dpm_list.prev);
> > >
> > > get_device(dev);
> > > + dev->power.status = DPM_OFF;
> >
> > What's that for? dev->power.status is supposed to be DPM_SUSPENDING
> > until the suspend method is successfully completed.
>
> If the suspend is run asynchronoysly, the main thread will always get a
> "success" from device_suspend(), so it can't change power.status on this
> basis. I thought we could set power.status to DPM_OFF upfront and change
> it back when error is returned.
>
> The alternative would be to move the modification of power.status to
> device_suspend() and async_suspend(). Well, maybe that's better.
Yes, I think so. Or into __device_suspend(). And the same thing in
dpm_suspend_noirq().
> > How about exporting a wait_for_device_to_resume() routine? Drivers
> > could call it for non-tree resume constraints:
> >
> > void wait_for_device_to_resume(struct device *other)
> > {
> > down_read(&other->power.rwsem);
> > up_read(&other->power.rwsem);
> > }
> >
> > Unfortunately there is no equivalent for non-tree suspend constraints.
>
> If we use completions, it will be possible to just export something like
>
> dpm_wait(dev)
> {
> if (dev)
> wait_for_completion(dev->power.completion);
> }
>
> I think. It appears that will also work for suspend, unless I'm missing
> something.
It will.
Alan Stern
On Wednesday 09 December 2009, Alan Stern wrote:
> On Wed, 9 Dec 2009, Rafael J. Wysocki wrote:
>
> > On Wednesday 09 December 2009, Alan Stern wrote:
> > > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> > >
> > > > For completness, below is the full async suspend/resume patch with rwlocks,
> > > > that has been (very slightly) tested and doesn't seem to break things.
> > > >
> > > > [Note to Alan: lockdep doesn't seem to complain about the not annotated nested
> > > > locks.]
> > >
> > > I can't imagine why not. And wouldn't lockdep get confused by the fact
> > > that in the async case, the rwsems are released by a different process
> > > from the one that acquired them?
> >
> > /me looks at the .config
> >
> > I have CONFIG_LOCKDEP_SUPPORT set, is there anything else I need to set
> > in .config?
>
> How about CONFIG_PROVE_LOCKING? If lockdep really does start
> complaining then switching to completions would be a simple way to
> appease it.
Ah, that one is not set. I guess I'll try it later, although I've already
decided to use completions anyway.
...
> > > How about exporting a wait_for_device_to_resume() routine? Drivers
> > > could call it for non-tree resume constraints:
> > >
> > > void wait_for_device_to_resume(struct device *other)
> > > {
> > > down_read(&other->power.rwsem);
> > > up_read(&other->power.rwsem);
> > > }
> > >
> > > Unfortunately there is no equivalent for non-tree suspend constraints.
> >
> > If we use completions, it will be possible to just export something like
> >
> > dpm_wait(dev)
> > {
> > if (dev)
> > wait_for_completion(dev->power.completion);
> > }
> >
> > I think. It appears that will also work for suspend, unless I'm missing
> > something.
>
> It will.
Completions it is, then.
Additionally, I've removed the async support from the _noirq parts and moved
the setting of power.status on suspend to __device_suspend(). The result is
appended.
Rafael
---
drivers/base/power/main.c | 124 ++++++++++++++++++++++++++++++++++++++++---
include/linux/device.h | 6 ++
include/linux/pm.h | 12 ++++
include/linux/resume-trace.h | 7 ++
4 files changed, 143 insertions(+), 6 deletions(-)
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -26,6 +26,7 @@
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/timer.h>
+#include <linux/completion.h>
/*
* Callbacks for platform drivers to implement.
@@ -412,9 +413,11 @@ struct dev_pm_info {
pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
+ struct completion completion;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
@@ -508,6 +511,13 @@ extern void __suspend_report_result(cons
__suspend_report_result(__func__, fn, ret); \
} while (0)
+extern int __dpm_wait(struct device *dev, void *ign);
+
+static inline void dpm_wait(struct device *dev)
+{
+ __dpm_wait(dev, NULL);
+}
+
#else /* !CONFIG_PM_SLEEP */
#define device_pm_lock() do {} while (0)
@@ -520,6 +530,8 @@ static inline int dpm_suspend_start(pm_m
#define suspend_report_result(fn, ret) do {} while (0)
+static inline void dpm_wait(struct device *dev) {}
+
#endif /* !CONFIG_PM_SLEEP */
/* How to reorder dpm_list after device_move() */
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -25,6 +25,7 @@
#include <linux/resume-trace.h>
#include <linux/rwsem.h>
#include <linux/interrupt.h>
+#include <linux/async.h>
#include "../base.h"
#include "power.h"
@@ -42,6 +43,7 @@
LIST_HEAD(dpm_list);
static DEFINE_MUTEX(dpm_list_mtx);
+static pm_message_t pm_transition;
/*
* Set once the preparation of devices for a PM transition has started, reset
@@ -56,6 +58,7 @@ static bool transition_started;
void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
+ init_completion(&dev->power.completion);
pm_runtime_init(dev);
}
@@ -162,6 +165,39 @@ void device_pm_move_last(struct device *
}
/**
+ * __dpm_wait - Wait for a PM operation to complete.
+ * @dev: Device to wait for.
+ * @ign: This value is not used by the function.
+ */
+int __dpm_wait(struct device *dev, void *ign)
+{
+ if (dev)
+ wait_for_completion(&dev->power.completion);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(__dpm_wait);
+
+static void dpm_wait_for_children(struct device *dev)
+{
+ device_for_each_child(dev, NULL, __dpm_wait);
+}
+
+/**
+ * dpm_synchronize - Wait for PM callbacks of all devices to complete.
+ */
+static void dpm_synchronize(void)
+{
+ struct device *dev;
+
+ async_synchronize_full();
+
+ mutex_lock(&dpm_list_mtx);
+ list_for_each_entry(dev, &dpm_list, power.entry)
+ INIT_COMPLETION(dev->power.completion);
+ mutex_unlock(&dpm_list_mtx);
+}
+
+/**
* pm_op - Execute the PM operation appropriate for given PM event.
* @dev: Device to handle.
* @ops: PM operations to choose from.
@@ -381,17 +417,18 @@ void dpm_resume_noirq(pm_message_t state
EXPORT_SYMBOL_GPL(dpm_resume_noirq);
/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state)
{
int error = 0;
TRACE_DEVICE(dev);
TRACE_RESUME(0);
+ dpm_wait(dev->parent);
down(&dev->sem);
if (dev->bus) {
@@ -426,11 +463,34 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);
TRACE_RESUME(error);
return error;
}
+static void async_resume(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_resume(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);
+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ return 0;
+ }
+
+ return __device_resume(dev, pm_transition);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -444,6 +504,7 @@ static void dpm_resume(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
@@ -454,7 +515,7 @@ static void dpm_resume(pm_message_t stat
dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);
- error = device_resume(dev, state);
+ error = device_resume(dev);
mutex_lock(&dpm_list_mtx);
if (error)
@@ -469,6 +530,7 @@ static void dpm_resume(pm_message_t stat
}
list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ dpm_synchronize();
}
/**
@@ -533,6 +595,8 @@ static void dpm_complete(pm_message_t st
mutex_unlock(&dpm_list_mtx);
}
+static atomic_t async_error;
+
/**
* dpm_resume_end - Execute "resume" callbacks and complete system transition.
* @state: PM transition of the system being carried out.
@@ -628,10 +692,11 @@ EXPORT_SYMBOL_GPL(dpm_suspend_noirq);
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_suspend(struct device *dev, pm_message_t state)
+static int __device_suspend(struct device *dev, pm_message_t state)
{
int error = 0;
+ dpm_wait_for_children(dev);
down(&dev->sem);
if (dev->class) {
@@ -666,12 +731,50 @@ static int device_suspend(struct device
suspend_report_result(dev->bus->suspend, error);
}
}
+
+ if (!error)
+ dev->power.status = DPM_OFF;
+
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);
return error;
}
+static void async_suspend(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error = atomic_read(&async_error);
+
+ if (error) {
+ complete_all(&dev->power.completion);
+ goto End;
+ }
+
+ error = __device_suspend(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async", error);
+ atomic_set(&async_error, error);
+ }
+
+ End:
+ put_device(dev);
+}
+
+static int device_suspend(struct device *dev, pm_message_t state)
+{
+ int error;
+
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);
+ return 0;
+ }
+
+ return __device_suspend(dev, pm_transition);
+}
+
/**
* dpm_suspend - Execute "suspend" callbacks for all non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -683,6 +786,7 @@ static int dpm_suspend(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.prev);
@@ -697,13 +801,18 @@ static int dpm_suspend(pm_message_t stat
put_device(dev);
break;
}
- dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
list_move(&dev->power.entry, &list);
put_device(dev);
+ error = atomic_read(&async_error);
+ if (error)
+ break;
}
list_splice(&list, dpm_list.prev);
mutex_unlock(&dpm_list_mtx);
+ dpm_synchronize();
+ if (!error)
+ error = atomic_read(&async_error);
return error;
}
@@ -762,6 +871,7 @@ static int dpm_prepare(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
transition_started = true;
+ atomic_set(&async_error, 0);
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
@@ -793,8 +903,10 @@ static int dpm_prepare(pm_message_t stat
break;
}
dev->power.status = DPM_SUSPENDING;
- if (!list_empty(&dev->power.entry))
+ if (!list_empty(&dev->power.entry)) {
list_move_tail(&dev->power.entry, &list);
+ INIT_COMPLETION(dev->power.completion);
+ }
put_device(dev);
}
list_splice(&list, &dpm_list);
Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@
extern int pm_trace_enabled;
+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const
#else
+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)
Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -472,6 +472,12 @@ static inline int device_is_registered(s
return dev->kobj.state_in_sysfs;
}
+static inline void device_enable_async_suspend(struct device *dev, bool enable)
+{
+ if (dev->power.status == DPM_ON)
+ dev->power.async_suspend = enable;
+}
+
void driver_init(void);
/*
On Thu, 10 Dec 2009, Rafael J. Wysocki wrote:
>
> Completions it is, then.
What was so hard with the "Try the simple one first" to understand? You
had a simpler working patch, why are you making this more complex one
without ever having had any problems with the simpler one?
Btw, your 'atomic_set()' with errors is pure voodoo programming. That's
not how atomics work. They do SMP-atomic addition etc, the 'atomic_set()'
and 'atomic_read()' things are not in any way more atomic than any other
access.
They are meant for racy reads (atomic_read()) and for initializations
(atomic_set()), and the way you use them that 'atomic' part is entirely
pointless, because it really isn't anything different from an 'int',
except that it may be very very expensive on some architectures due to
hashed spinlocks etc.
So stop this overdesign thing. Start simple. If you _ever_ see real
problems, that's when you add stuff. As it is, any time you add
complexity, you just add bugs.
> +/**
> + * dpm_synchronize - Wait for PM callbacks of all devices to complete.
> + */
> +static void dpm_synchronize(void)
> +{
> + struct device *dev;
> +
> + async_synchronize_full();
> +
> + mutex_lock(&dpm_list_mtx);
> + list_for_each_entry(dev, &dpm_list, power.entry)
> + INIT_COMPLETION(dev->power.completion);
> + mutex_unlock(&dpm_list_mtx);
> +}
And this, for example, is pretty disgusting. Not only is that
INIT_COMPLETION purely brought on by the whole problem with completions
(they are fundamentally one-shot, but you want to use them over and over
so you need to re-initialize them: a nice lock wouldn't have that problem
to begin with), but the comment isn't even accurate. Sure, it waits for
any async jobs, but that's the _least_ of what the function actually does,
so the comment is actively misleading, isn't it?
Linus
* Rafael J. Wysocki <[email protected]> wrote:
> On Wednesday 09 December 2009, Ingo Molnar wrote:
> >
> > * Rafael J. Wysocki <[email protected]> wrote:
> >
> > > On Tuesday 08 December 2009, Alan Stern wrote:
> > > > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> > > >
> > > > > BTW, is there a good reason why completion_done() doesn't use spin_lock_irqsave
> > > > > and spin_unlock_irqrestore? complete() and complete_all() use them, so why not
> > > > > here?
> > > >
> > > > And likewise in try_wait_for_completion(). It looks like a bug. Maybe
> > > > these routines were not intended to be called with interrupts disabled,
> > > > but that requirement doesn't seem to be documented. And it isn't a
> > > > natural requirement anyway.
> > >
> > > OK, let's ask Ingo about that.
> > >
> > > Ingo, is there any particular reason why completion_done() and
> > > try_wait_for_completion() don't use spin_lock_irqsave() and
> > > spin_unlock_irqrestore()?
> >
> > that's a bug that should be fixed - all the wakeup side (and atomic)
> > variants of completetion API should be irq safe.
> >
> > It appears that these new completion APIs were added via the XFS tree
> > about a year ago:
> >
> > 39d2f1a: [XFS] extend completions to provide XFS object flush requirements
> >
> > Please Cc: scheduler folks to all scheduler patches.
>
> If you haven't fixed it locally yet, would you mind me posting a fix?
I wouldnt mind it at all.
Thanks,
Ingo
On Thu, 10 Dec 2009, Rafael J. Wysocki wrote:
> > How about CONFIG_PROVE_LOCKING? If lockdep really does start
> > complaining then switching to completions would be a simple way to
> > appease it.
>
> Ah, that one is not set. I guess I'll try it later, although I've already
> decided to use completions anyway.
You should see how badly lockdep complains about the rwsems. If it
really doesn't like them then using completions makes sense.
> Index: linux-2.6/drivers/base/power/main.c
> ===================================================================
> --- linux-2.6.orig/drivers/base/power/main.c
> +++ linux-2.6/drivers/base/power/main.c
> @@ -56,6 +58,7 @@ static bool transition_started;
> void device_pm_init(struct device *dev)
> {
> dev->power.status = DPM_ON;
> + init_completion(&dev->power.completion);
> pm_runtime_init(dev);
> }
You need a matching complete_all() in device_pm_remove(), in case
someone else is waiting for the device when it gets unregistered.
> +/**
> + * dpm_synchronize - Wait for PM callbacks of all devices to complete.
> + */
> +static void dpm_synchronize(void)
> +{
> + struct device *dev;
> +
> + async_synchronize_full();
> +
> + mutex_lock(&dpm_list_mtx);
> + list_for_each_entry(dev, &dpm_list, power.entry)
> + INIT_COMPLETION(dev->power.completion);
> + mutex_unlock(&dpm_list_mtx);
> +}
I agree with Linus, initializing the completions here is weird. You
should initialize them just before using them.
> @@ -683,6 +786,7 @@ static int dpm_suspend(pm_message_t stat
>
> INIT_LIST_HEAD(&list);
> mutex_lock(&dpm_list_mtx);
> + pm_transition = state;
> while (!list_empty(&dpm_list)) {
> struct device *dev = to_device(dpm_list.prev);
>
> @@ -697,13 +801,18 @@ static int dpm_suspend(pm_message_t stat
> put_device(dev);
> break;
> }
> - dev->power.status = DPM_OFF;
> if (!list_empty(&dev->power.entry))
> list_move(&dev->power.entry, &list);
> put_device(dev);
> + error = atomic_read(&async_error);
> + if (error)
> + break;
> }
> list_splice(&list, dpm_list.prev);
Here's something you might want to do in a later patch. These awkward
list-pointer manipulations can be simplified as follows:
static bool dpm_iterate_forward;
static struct device *dpm_next;
In device_pm_remove():
mutex_lock(&dpm_list_mtx);
if (dev == dpm_next)
dpm_next = to_device(dpm_iterate_forward ?
dev->power.entry.next : dev->power.entry.prev);
list_del_init(&dev->power.entry);
mutex_unlock(&dpm_list_mtx);
In dpm_resume():
dpm_iterate_forward = true;
list_for_each_entry_safe(dev, dpm_next, dpm_list, power.entry) {
...
In dpm_suspend():
dpm_iterate_forward = false;
list_for_each_entry_safe_reverse(dev, dpm_next, dpm_list,
power.entry) {
...
Whether this really is better is a matter of opinion; I like it.
Alan Stern
On Thu, 10 Dec 2009, Alan Stern wrote:
>
> In device_pm_remove():
>
> mutex_lock(&dpm_list_mtx);
> if (dev == dpm_next)
> dpm_next = to_device(dpm_iterate_forward ?
> dev->power.entry.next : dev->power.entry.prev);
> list_del_init(&dev->power.entry);
> mutex_unlock(&dpm_list_mtx);
I'm really not seeing the point - it's much better to hardcode the
ordering in the place you use it (where it is static and the compiler can
generate bette code) than to do some dynamic choice that depends on some
fake flag - especially a global one.
Also, quite frankly, error handling needs to be separated out of the whole
async patch, and needs to be thought about a lot more. And I would
seriously argue that if you have any async suspends, then those async
suspends are _not_ allowed to fail. At least not initially
Having async failures and trying to fix them up is just a disaster. Which
ones actually failed, and which ones were aborted before they even really
got to their suspend routines? Which ones do you try to resume?
IOW, it needs way more thought than what has clearly happened so far. And
once more, I will refuse to merge anything that is complicated for no
actual reason (where reason is "real life, and tested to make a big
difference", not some hand-waving)
Linus
On Thu, 10 Dec 2009, Linus Torvalds wrote:
>
>
> On Thu, 10 Dec 2009, Alan Stern wrote:
> >
> > In device_pm_remove():
> >
> > mutex_lock(&dpm_list_mtx);
> > if (dev == dpm_next)
> > dpm_next = to_device(dpm_iterate_forward ?
> > dev->power.entry.next : dev->power.entry.prev);
> > list_del_init(&dev->power.entry);
> > mutex_unlock(&dpm_list_mtx);
>
> I'm really not seeing the point - it's much better to hardcode the
> ordering in the place you use it (where it is static and the compiler can
> generate bette code) than to do some dynamic choice that depends on some
> fake flag - especially a global one.
You probably didn't look closely at the original code in dpm_suspend()
and dpm_resume(). It's very awkward; each device is removed from
dpm_list, operated on, and then added on to a new local list. At the
end the new list is spliced back into dpm_list.
This approach is better because it doesn't involve changing any list
pointers while the sleep transition is in progress. At any rate, I
don't recommend doing it in the same patch as the async stuff; it
should be done separately. Either before or after -- the two are
independent.
> Also, quite frankly, error handling needs to be separated out of the whole
> async patch, and needs to be thought about a lot more. And I would
> seriously argue that if you have any async suspends, then those async
> suspends are _not_ allowed to fail. At least not initially
>
> Having async failures and trying to fix them up is just a disaster. Which
> ones actually failed, and which ones were aborted before they even really
> got to their suspend routines? Which ones do you try to resume?
We record the status of each device; dev->power.status stores different
values depending on whether the device suspend succeeded or failed.
The value will be correct and up-to-date after async_synchronize_full()
returns. The value is used in dpm_resume() to decide which devices
need their resume methods called. I don't see any problems there.
> IOW, it needs way more thought than what has clearly happened so far. And
> once more, I will refuse to merge anything that is complicated for no
> actual reason (where reason is "real life, and tested to make a big
> difference", not some hand-waving)
I don't think the error handling requires more than minimal changes.
The whole atomic_t thing was overkill. It probably stemmed from a
discussion some time back with Pavel Machek about concurrent writes to
a single variable. I claimed that concurrent writes to a properly
aligned pointer, int, or long would never create a "mash-up"; that is,
readers would see either the original value or one of the new values
but never some weird combination of bits.
Alan Cox pointed out that while this was technically correct, there's
nothing to prevent the compiler from translating
a = b + c;
into something like:
load b, R1
store R1, a
load c, R1
add R1, a
in which case readers might see the intermediate value. (Okay, the
compiler would have to be pretty stupid to do this with such a simple
expression, but it could happen with more complicated expressions.)
Pavel favored always using atomic types when there could be concurrent
writes, and apparently Rafael was following his advice.
Alan Stern
On Thursday 10 December 2009, Linus Torvalds wrote:
>
> On Thu, 10 Dec 2009, Rafael J. Wysocki wrote:
> >
> > Completions it is, then.
>
> What was so hard with the "Try the simple one first" to understand? You
> had a simpler working patch, why are you making this more complex one
> without ever having had any problems with the simpler one?
OK, why don't you just say you won't merge anything that doesn't use rwsems
(although you said before that completions would be fine with you)? That would
make things clear, but also it would mean we gave up handling the off-tree
dependencies in general.
> Btw, your 'atomic_set()' with errors is pure voodoo programming. That's
> not how atomics work. They do SMP-atomic addition etc, the 'atomic_set()'
> and 'atomic_read()' things are not in any way more atomic than any other
> access.
>
> They are meant for racy reads (atomic_read()) and for initializations
> (atomic_set()), and the way you use them that 'atomic' part is entirely
> pointless, because it really isn't anything different from an 'int',
> except that it may be very very expensive on some architectures due to
> hashed spinlocks etc.
>
> So stop this overdesign thing. Start simple. If you _ever_ see real
> problems, that's when you add stuff. As it is, any time you add
> complexity, you just add bugs.
OK, so that need not be atomic.
> > +/**
> > + * dpm_synchronize - Wait for PM callbacks of all devices to complete.
> > + */
> > +static void dpm_synchronize(void)
> > +{
> > + struct device *dev;
> > +
> > + async_synchronize_full();
> > +
> > + mutex_lock(&dpm_list_mtx);
> > + list_for_each_entry(dev, &dpm_list, power.entry)
> > + INIT_COMPLETION(dev->power.completion);
> > + mutex_unlock(&dpm_list_mtx);
> > +}
>
> And this, for example, is pretty disgusting. Not only is that
> INIT_COMPLETION purely brought on by the whole problem with completions
> (they are fundamentally one-shot, but you want to use them over and over
Actually, twice. However, since I don't want to do any async handling in the
_noirq phases any more, I can get rid of this whole function. Thanks for
pointing that out to me.
Rafael
On Thursday 10 December 2009, Alan Stern wrote:
> On Thu, 10 Dec 2009, Rafael J. Wysocki wrote:
>
> > > How about CONFIG_PROVE_LOCKING? If lockdep really does start
> > > complaining then switching to completions would be a simple way to
> > > appease it.
> >
> > Ah, that one is not set. I guess I'll try it later, although I've already
> > decided to use completions anyway.
>
> You should see how badly lockdep complains about the rwsems. If it
> really doesn't like them then using completions makes sense.
It does complain about them, but when the nested _down operations are marked
as nested, it stops complaining (that's in the version where there's no async
in the _noirq phases).
> > Index: linux-2.6/drivers/base/power/main.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/base/power/main.c
> > +++ linux-2.6/drivers/base/power/main.c
> > @@ -56,6 +58,7 @@ static bool transition_started;
> > void device_pm_init(struct device *dev)
> > {
> > dev->power.status = DPM_ON;
> > + init_completion(&dev->power.completion);
> > pm_runtime_init(dev);
> > }
>
> You need a matching complete_all() in device_pm_remove(), in case
> someone else is waiting for the device when it gets unregistered.
Right, added.
> > +/**
> > + * dpm_synchronize - Wait for PM callbacks of all devices to complete.
> > + */
> > +static void dpm_synchronize(void)
> > +{
> > + struct device *dev;
> > +
> > + async_synchronize_full();
> > +
> > + mutex_lock(&dpm_list_mtx);
> > + list_for_each_entry(dev, &dpm_list, power.entry)
> > + INIT_COMPLETION(dev->power.completion);
> > + mutex_unlock(&dpm_list_mtx);
> > +}
>
> I agree with Linus, initializing the completions here is weird. You
> should initialize them just before using them.
I removed that completely and now the INIT_COMPLETION() is always done in the
preceding phase.
> > @@ -683,6 +786,7 @@ static int dpm_suspend(pm_message_t stat
> >
> > INIT_LIST_HEAD(&list);
> > mutex_lock(&dpm_list_mtx);
> > + pm_transition = state;
> > while (!list_empty(&dpm_list)) {
> > struct device *dev = to_device(dpm_list.prev);
> >
> > @@ -697,13 +801,18 @@ static int dpm_suspend(pm_message_t stat
> > put_device(dev);
> > break;
> > }
> > - dev->power.status = DPM_OFF;
> > if (!list_empty(&dev->power.entry))
> > list_move(&dev->power.entry, &list);
> > put_device(dev);
> > + error = atomic_read(&async_error);
> > + if (error)
> > + break;
> > }
> > list_splice(&list, dpm_list.prev);
>
> Here's something you might want to do in a later patch. These awkward
> list-pointer manipulations can be simplified as follows:
Well, I'm not sure if that's more straightforward.
Anyway, as you said, that's something for a different patch. :-)
Below is an updated version of the $subject one. I don't use the atomic_t for
async_error any more and (apart from this fixed issue) I don't see any problems
in the suspend error path now.
Rafael
---
drivers/base/power/main.c | 113 ++++++++++++++++++++++++++++++++++++++++---
include/linux/device.h | 6 ++
include/linux/pm.h | 12 ++++
include/linux/resume-trace.h | 7 ++
4 files changed, 131 insertions(+), 7 deletions(-)
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -26,6 +26,7 @@
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/timer.h>
+#include <linux/completion.h>
/*
* Callbacks for platform drivers to implement.
@@ -412,9 +413,11 @@ struct dev_pm_info {
pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
+ struct completion completion;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
@@ -508,6 +511,13 @@ extern void __suspend_report_result(cons
__suspend_report_result(__func__, fn, ret); \
} while (0)
+extern int __dpm_wait(struct device *dev, void *ign);
+
+static inline void dpm_wait(struct device *dev)
+{
+ __dpm_wait(dev, NULL);
+}
+
#else /* !CONFIG_PM_SLEEP */
#define device_pm_lock() do {} while (0)
@@ -520,6 +530,8 @@ static inline int dpm_suspend_start(pm_m
#define suspend_report_result(fn, ret) do {} while (0)
+static inline void dpm_wait(struct device *dev) {}
+
#endif /* !CONFIG_PM_SLEEP */
/* How to reorder dpm_list after device_move() */
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -25,6 +25,7 @@
#include <linux/resume-trace.h>
#include <linux/rwsem.h>
#include <linux/interrupt.h>
+#include <linux/async.h>
#include "../base.h"
#include "power.h"
@@ -42,6 +43,7 @@
LIST_HEAD(dpm_list);
static DEFINE_MUTEX(dpm_list_mtx);
+static pm_message_t pm_transition;
/*
* Set once the preparation of devices for a PM transition has started, reset
@@ -56,6 +58,7 @@ static bool transition_started;
void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
+ init_completion(&dev->power.completion);
pm_runtime_init(dev);
}
@@ -111,6 +114,7 @@ void device_pm_remove(struct device *dev
pr_debug("PM: Removing info for %s:%s\n",
dev->bus ? dev->bus->name : "No Bus",
kobject_name(&dev->kobj));
+ complete_all(&dev->power.completion);
mutex_lock(&dpm_list_mtx);
list_del_init(&dev->power.entry);
mutex_unlock(&dpm_list_mtx);
@@ -162,6 +166,24 @@ void device_pm_move_last(struct device *
}
/**
+ * __dpm_wait - Wait for a PM operation to complete.
+ * @dev: Device to wait for.
+ * @ign: This value is not used by the function.
+ */
+int __dpm_wait(struct device *dev, void *ign)
+{
+ if (dev)
+ wait_for_completion(&dev->power.completion);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(__dpm_wait);
+
+static void dpm_wait_for_children(struct device *dev)
+{
+ device_for_each_child(dev, NULL, __dpm_wait);
+}
+
+/**
* pm_op - Execute the PM operation appropriate for given PM event.
* @dev: Device to handle.
* @ops: PM operations to choose from.
@@ -366,7 +388,7 @@ void dpm_resume_noirq(pm_message_t state
mutex_lock(&dpm_list_mtx);
transition_started = false;
- list_for_each_entry(dev, &dpm_list, power.entry)
+ list_for_each_entry(dev, &dpm_list, power.entry) {
if (dev->power.status > DPM_OFF) {
int error;
@@ -375,23 +397,27 @@ void dpm_resume_noirq(pm_message_t state
if (error)
pm_dev_err(dev, state, " early", error);
}
+ /* Needed by the subsequent dpm_resume(). */
+ INIT_COMPLETION(dev->power.completion);
+ }
mutex_unlock(&dpm_list_mtx);
resume_device_irqs();
}
EXPORT_SYMBOL_GPL(dpm_resume_noirq);
/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state)
{
int error = 0;
TRACE_DEVICE(dev);
TRACE_RESUME(0);
+ dpm_wait(dev->parent);
down(&dev->sem);
if (dev->bus) {
@@ -426,11 +452,34 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);
TRACE_RESUME(error);
return error;
}
+static void async_resume(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_resume(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);
+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ return 0;
+ }
+
+ return __device_resume(dev, pm_transition);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -444,6 +493,7 @@ static void dpm_resume(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
@@ -454,7 +504,7 @@ static void dpm_resume(pm_message_t stat
dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);
- error = device_resume(dev, state);
+ error = device_resume(dev);
mutex_lock(&dpm_list_mtx);
if (error)
@@ -469,6 +519,7 @@ static void dpm_resume(pm_message_t stat
}
list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
}
/**
@@ -623,15 +674,18 @@ int dpm_suspend_noirq(pm_message_t state
}
EXPORT_SYMBOL_GPL(dpm_suspend_noirq);
+static int async_error;
+
/**
* device_suspend - Execute "suspend" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_suspend(struct device *dev, pm_message_t state)
+static int __device_suspend(struct device *dev, pm_message_t state)
{
int error = 0;
+ dpm_wait_for_children(dev);
down(&dev->sem);
if (dev->class) {
@@ -666,12 +720,48 @@ static int device_suspend(struct device
suspend_report_result(dev->bus->suspend, error);
}
}
+
+ if (!error)
+ dev->power.status = DPM_OFF;
+
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);
return error;
}
+static void async_suspend(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ if (async_error) {
+ complete_all(&dev->power.completion);
+ goto End;
+ }
+
+ error = __device_suspend(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async", error);
+ async_error = error;
+ }
+
+ End:
+ put_device(dev);
+}
+
+static int device_suspend(struct device *dev, pm_message_t state)
+{
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);
+ return 0;
+ }
+
+ return __device_suspend(dev, pm_transition);
+}
+
/**
* dpm_suspend - Execute "suspend" callbacks for all non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -683,6 +773,7 @@ static int dpm_suspend(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.prev);
@@ -697,13 +788,17 @@ static int dpm_suspend(pm_message_t stat
put_device(dev);
break;
}
- dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
list_move(&dev->power.entry, &list);
put_device(dev);
+ if (async_error)
+ break;
}
list_splice(&list, dpm_list.prev);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
+ if (!error)
+ error = async_error;
return error;
}
@@ -762,6 +857,7 @@ static int dpm_prepare(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
transition_started = true;
+ async_error = 0;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
@@ -793,8 +889,11 @@ static int dpm_prepare(pm_message_t stat
break;
}
dev->power.status = DPM_SUSPENDING;
- if (!list_empty(&dev->power.entry))
+ if (!list_empty(&dev->power.entry)) {
list_move_tail(&dev->power.entry, &list);
+ /* Needed by the subsequent dpm_suspend(). */
+ INIT_COMPLETION(dev->power.completion);
+ }
put_device(dev);
}
list_splice(&list, &dpm_list);
Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@
extern int pm_trace_enabled;
+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const
#else
+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)
Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -472,6 +472,12 @@ static inline int device_is_registered(s
return dev->kobj.state_in_sysfs;
}
+static inline void device_enable_async_suspend(struct device *dev, bool enable)
+{
+ if (dev->power.status == DPM_ON)
+ dev->power.async_suspend = enable;
+}
+
void driver_init(void);
/*
On Thu, 10 Dec 2009, Rafael J. Wysocki wrote:
> > You should see how badly lockdep complains about the rwsems. If it
> > really doesn't like them then using completions makes sense.
>
> It does complain about them, but when the nested _down operations are marked
> as nested, it stops complaining (that's in the version where there's no async
> in the _noirq phases).
Did you set the async_suspend flag for any devices during the test?
And did you run more than one suspend/resume cycle?
> +extern int __dpm_wait(struct device *dev, void *ign);
> +
> +static inline void dpm_wait(struct device *dev)
> +{
> + __dpm_wait(dev, NULL);
> +}
Sorry, I intended to mention this before but forgot. This design is
inelegant. You shouldn't have inlines calling functions with extra
unused arguments; they just waste code space. Make dpm_wait() be a
real routine and add a shim to the device_for_each_child() loop.
> @@ -366,7 +388,7 @@ void dpm_resume_noirq(pm_message_t state
>
> mutex_lock(&dpm_list_mtx);
> transition_started = false;
> - list_for_each_entry(dev, &dpm_list, power.entry)
> + list_for_each_entry(dev, &dpm_list, power.entry) {
> if (dev->power.status > DPM_OFF) {
> int error;
>
> @@ -375,23 +397,27 @@ void dpm_resume_noirq(pm_message_t state
> if (error)
> pm_dev_err(dev, state, " early", error);
> }
> + /* Needed by the subsequent dpm_resume(). */
> + INIT_COMPLETION(dev->power.completion);
You're still doing it. Don't initialize the completions in a totally
different phase! Initialize them directly before they are used.
Namely, at the start of device_resume() and device_suspend().
One more thing. A logical time to check for errors is just after
waiting for the children in __device_suspend(), instead of beforehand
in async_suspend(). After all, if an error occurs then it's likely to
happen while we are waiting.
Alan Stern
On Thu, 10 Dec 2009, Rafael J. Wysocki wrote:
>
> OK, why don't you just say you won't merge anything that doesn't use rwsems
I did! Here's a quote (and it's pretty much the whole email, so it's not
like it was hidden):
- [email protected]:
"Let me put this simply: I've told you guys how to do it simply, with
_zero_ crap. No "iterating over children". No games. No data structures.
No new infrastructure. Just a single new rwlock per device, and _trivial_
code.
So here's the challenge: try it my simple way first. I've quoted the code
about five million times already. If you _actually_ see some problems,
explain them. Don't make up stupid "iterate over each child" things. Don't
claim totally made-up "leads to difficulties". Don't make it any more
complicated than it needs to be.
Keep it simple. And once you have tried that simple approach, and you
really can show why it doesn't work, THEN you can try something else.
But before you try the simple approach and explain why it wouldn't work, I
simply will not pull anything more complex. Understood and agreed?"
And then later about completions:
- [email protected]:
"So I think completions should work, if done right. That whole "make the
parent wait for all the children to complete" is fine in that sense. And
I'll happily take such an approach if my rwlock thing doesn't work."
IOW, I'll happily take the completions version, but dammit, I refuse to
take it when there is a simpler approach that does NOT need to iterate,
and does NOT need to re-initialize the data structures each round etc.
That's what I've been arguing against the whole time. It started as
arguing against complex and unnecessary infrastructure, and trying to show
that it _can_ be done so much simpler using existing basic locking.
And I get annoyed when you guys continually seem to want to make it more
complex than it needs to be.
> > And this, for example, is pretty disgusting. Not only is that
> > INIT_COMPLETION purely brought on by the whole problem with completions
> > (they are fundamentally one-shot, but you want to use them over and over
>
> Actually, twice. However, since I don't want to do any async handling in the
> _noirq phases any more, I can get rid of this whole function. Thanks for
> pointing that out to me.
Well, my point was that you'll need to do that
INIT_COMPLETION(dev->power.completion);
thing each suspend and each resume. Exactly because completions are
designed to be "onw-way" things, so you end up having to reset them each
cycle (you just reset them even _more_ than you needed).
Again, my point was that using locks is actually a very _natural_ thing to
do. I really don't understand what problems you and Alan have with just
using locks - we have way more locks in the kernel than we have
completions, so they are the "default" thing to do, and they really are
very natural to use.
[ Ok, so admittedly the actual use of 'struct rw_semaphore' is pretty
unusual, but my point is that people are used to locking semantics in
general, more so than the semantics of completions ]
Completions were literally designed to be used for one-off things - one of
the most common uses is that the 'struct completion' is on the _stack_. It
doesn't get much more one-off than that - and the completions are really
very explicitly designed so that you can do a 'complete()' on something
that will literally disappear from under you as you do it (because the
struct completion might be on the stack of the thing that is waiting for
it, and gets de-allocated when the waiter goes ahead).
That is why 'wait_for_completion()' always has to take the spinlock, for
example - there is no fastpath for completion, because the races for the
waiter releasing things too early are too nasty.
So completions are actually very subtle things - and you don't need any of
that subtlety. I realize that from a user perspective, completions look
very simple, but in many ways they actually have subtler semantics than a
regular lock has.
Linus
On Thursday 10 December 2009, Alan Stern wrote:
> On Thu, 10 Dec 2009, Rafael J. Wysocki wrote:
>
> > > You should see how badly lockdep complains about the rwsems. If it
> > > really doesn't like them then using completions makes sense.
> >
> > It does complain about them, but when the nested _down operations are marked
> > as nested, it stops complaining (that's in the version where there's no async
> > in the _noirq phases).
>
> Did you set the async_suspend flag for any devices during the test?
Yes. All ACPI, all PCI, all serio, as usual. ;-)
> And did you run more than one suspend/resume cycle?
Sure. Actually, I test it in the /sys/power/pm_test = core mode, but that
shouldn't really matter.
> > +extern int __dpm_wait(struct device *dev, void *ign);
> > +
> > +static inline void dpm_wait(struct device *dev)
> > +{
> > + __dpm_wait(dev, NULL);
> > +}
>
> Sorry, I intended to mention this before but forgot. This design is
> inelegant. You shouldn't have inlines calling functions with extra
> unused arguments; they just waste code space. Make dpm_wait() be a
> real routine and add a shim to the device_for_each_child() loop.
I thought about that myself, done now.
> > @@ -366,7 +388,7 @@ void dpm_resume_noirq(pm_message_t state
> >
> > mutex_lock(&dpm_list_mtx);
> > transition_started = false;
> > - list_for_each_entry(dev, &dpm_list, power.entry)
> > + list_for_each_entry(dev, &dpm_list, power.entry) {
> > if (dev->power.status > DPM_OFF) {
> > int error;
> >
> > @@ -375,23 +397,27 @@ void dpm_resume_noirq(pm_message_t state
> > if (error)
> > pm_dev_err(dev, state, " early", error);
> > }
> > + /* Needed by the subsequent dpm_resume(). */
> > + INIT_COMPLETION(dev->power.completion);
>
> You're still doing it. Don't initialize the completions in a totally
> different phase! Initialize them directly before they are used.
> Namely, at the start of device_resume() and device_suspend().
The idea was to initialize them all at the same time, before entering the
phase in which they were used, but I came to the conclusion that this was not
necessary, because the dpm_list ordering was such that the devices to be waited
for would always have their completions reinitialized before starting
__device_suspend() or __device_resume() for the waiting ones.
> One more thing. A logical time to check for errors is just after
> waiting for the children in __device_suspend(), instead of beforehand
> in async_suspend(). After all, if an error occurs then it's likely to
> happen while we are waiting.
Good idea, done.
Updated patch is appended.
Rafael
---
drivers/base/power/main.c | 106 ++++++++++++++++++++++++++++++++++++++++---
include/linux/device.h | 6 ++
include/linux/pm.h | 7 ++
include/linux/resume-trace.h | 7 ++
4 files changed, 121 insertions(+), 5 deletions(-)
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -26,6 +26,7 @@
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/timer.h>
+#include <linux/completion.h>
/*
* Callbacks for platform drivers to implement.
@@ -412,9 +413,11 @@ struct dev_pm_info {
pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
+ struct completion completion;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
@@ -508,6 +511,8 @@ extern void __suspend_report_result(cons
__suspend_report_result(__func__, fn, ret); \
} while (0)
+extern void dpm_wait(struct device *dev);
+
#else /* !CONFIG_PM_SLEEP */
#define device_pm_lock() do {} while (0)
@@ -520,6 +525,8 @@ static inline int dpm_suspend_start(pm_m
#define suspend_report_result(fn, ret) do {} while (0)
+static inline void dpm_wait(struct device *dev) {}
+
#endif /* !CONFIG_PM_SLEEP */
/* How to reorder dpm_list after device_move() */
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -25,6 +25,7 @@
#include <linux/resume-trace.h>
#include <linux/rwsem.h>
#include <linux/interrupt.h>
+#include <linux/async.h>
#include "../base.h"
#include "power.h"
@@ -42,6 +43,7 @@
LIST_HEAD(dpm_list);
static DEFINE_MUTEX(dpm_list_mtx);
+static pm_message_t pm_transition;
/*
* Set once the preparation of devices for a PM transition has started, reset
@@ -56,6 +58,7 @@ static bool transition_started;
void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
+ init_completion(&dev->power.completion);
pm_runtime_init(dev);
}
@@ -111,6 +114,7 @@ void device_pm_remove(struct device *dev
pr_debug("PM: Removing info for %s:%s\n",
dev->bus ? dev->bus->name : "No Bus",
kobject_name(&dev->kobj));
+ complete_all(&dev->power.completion);
mutex_lock(&dpm_list_mtx);
list_del_init(&dev->power.entry);
mutex_unlock(&dpm_list_mtx);
@@ -162,6 +166,28 @@ void device_pm_move_last(struct device *
}
/**
+ * dpm_wait - Wait for a PM operation to complete.
+ * @dev: Device to wait for.
+ */
+void dpm_wait(struct device *dev)
+{
+ if (dev)
+ wait_for_completion(&dev->power.completion);
+}
+EXPORT_SYMBOL_GPL(dpm_wait);
+
+static int dpm_wait_fn(struct device *dev, void *ignore)
+{
+ dpm_wait(dev);
+ return 0;
+}
+
+static void dpm_wait_for_children(struct device *dev)
+{
+ device_for_each_child(dev, NULL, dpm_wait_fn);
+}
+
+/**
* pm_op - Execute the PM operation appropriate for given PM event.
* @dev: Device to handle.
* @ops: PM operations to choose from.
@@ -381,17 +407,18 @@ void dpm_resume_noirq(pm_message_t state
EXPORT_SYMBOL_GPL(dpm_resume_noirq);
/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state)
{
int error = 0;
TRACE_DEVICE(dev);
TRACE_RESUME(0);
+ dpm_wait(dev->parent);
down(&dev->sem);
if (dev->bus) {
@@ -426,11 +453,34 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);
TRACE_RESUME(error);
return error;
}
+static void async_resume(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_resume(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);
+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ return 0;
+ }
+
+ return __device_resume(dev, pm_transition);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -444,6 +494,7 @@ static void dpm_resume(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
@@ -451,10 +502,11 @@ static void dpm_resume(pm_message_t stat
if (dev->power.status >= DPM_OFF) {
int error;
+ INIT_COMPLETION(dev->power.completion);
dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);
- error = device_resume(dev, state);
+ error = device_resume(dev);
mutex_lock(&dpm_list_mtx);
if (error)
@@ -469,6 +521,7 @@ static void dpm_resume(pm_message_t stat
}
list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
}
/**
@@ -623,17 +676,23 @@ int dpm_suspend_noirq(pm_message_t state
}
EXPORT_SYMBOL_GPL(dpm_suspend_noirq);
+static int async_error;
+
/**
* device_suspend - Execute "suspend" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_suspend(struct device *dev, pm_message_t state)
+static int __device_suspend(struct device *dev, pm_message_t state)
{
int error = 0;
+ dpm_wait_for_children(dev);
down(&dev->sem);
+ if (async_error)
+ goto End;
+
if (dev->class) {
if (dev->class->pm) {
pm_dev_dbg(dev, state, "class ");
@@ -666,12 +725,42 @@ static int device_suspend(struct device
suspend_report_result(dev->bus->suspend, error);
}
}
+
+ if (!error)
+ dev->power.status = DPM_OFF;
+
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);
return error;
}
+static void async_suspend(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_suspend(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async", error);
+ async_error = error;
+ }
+
+ put_device(dev);
+}
+
+static int device_suspend(struct device *dev, pm_message_t state)
+{
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);
+ return 0;
+ }
+
+ return __device_suspend(dev, pm_transition);
+}
+
/**
* dpm_suspend - Execute "suspend" callbacks for all non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -683,10 +772,12 @@ static int dpm_suspend(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.prev);
get_device(dev);
+ INIT_COMPLETION(dev->power.completion);
mutex_unlock(&dpm_list_mtx);
error = device_suspend(dev, state);
@@ -697,13 +788,17 @@ static int dpm_suspend(pm_message_t stat
put_device(dev);
break;
}
- dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
list_move(&dev->power.entry, &list);
put_device(dev);
+ if (async_error)
+ break;
}
list_splice(&list, dpm_list.prev);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
+ if (!error)
+ error = async_error;
return error;
}
@@ -762,6 +857,7 @@ static int dpm_prepare(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
transition_started = true;
+ async_error = 0;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@
extern int pm_trace_enabled;
+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const
#else
+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)
Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -472,6 +472,12 @@ static inline int device_is_registered(s
return dev->kobj.state_in_sysfs;
}
+static inline void device_enable_async_suspend(struct device *dev, bool enable)
+{
+ if (dev->power.status == DPM_ON)
+ dev->power.async_suspend = enable;
+}
+
void driver_init(void);
/*
On Thu, 10 Dec 2009, Alan Stern wrote:
>
> You probably didn't look closely at the original code in dpm_suspend()
> and dpm_resume(). It's very awkward; each device is removed from
> dpm_list, operated on, and then added on to a new local list. At the
> end the new list is spliced back into dpm_list.
>
> This approach is better because it doesn't involve changing any list
> pointers while the sleep transition is in progress. At any rate, I
> don't recommend doing it in the same patch as the async stuff; it
> should be done separately. Either before or after -- the two are
> independent.
I do agree with the "independent" part. But I don't agree about the
awkwardness per se.
Sure, it moves things back and forth and has private lists, but that's
actually a fairly standard thing to do in those kinds of situations where
you're taking something off a list, operating on it, and may need to put
it back on the same list eventually. The VM layer does similar things.
So that's why I think your version was actually odder - the existing list
manipulation isn't all that odd. It has that strange "did we get removed
while we dropped the lock and tried to suspend the device" thing, of
course, but that's not entirely unheard of either.
Could it be done more cleanly? I think so, but I agree with you that it's
likely a separate issue.
I _suspect_, for example, that we could just do something like, the
appended to avoid _some_ of the subtlety. IOW, just move the device to the
local list early - and if it gets removed while being suspended, it will
automatically get removed from the local list (the remover doesn't care
_what_ list it is on whe it does a 'list_del(power.entr)').
UNTESTED PATCH! This may be total crap, of course. But it _looks_ like an
"ObviousCleanup(tm)" - famous last words.
Linus
---
drivers/base/power/main.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 8aa2443..f2bb493 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -687,6 +687,7 @@ static int dpm_suspend(pm_message_t state)
struct device *dev = to_device(dpm_list.prev);
get_device(dev);
+ list_move(&dev->power.entry, &list);
mutex_unlock(&dpm_list_mtx);
error = device_suspend(dev, state);
@@ -698,8 +699,6 @@ static int dpm_suspend(pm_message_t state)
break;
}
dev->power.status = DPM_OFF;
- if (!list_empty(&dev->power.entry))
- list_move(&dev->power.entry, &list);
put_device(dev);
}
list_splice(&list, dpm_list.prev);
On Friday 11 December 2009, Linus Torvalds wrote:
>
> On Thu, 10 Dec 2009, Rafael J. Wysocki wrote:
...
>
> IOW, I'll happily take the completions version, but dammit, I refuse to
> take it when there is a simpler approach that does NOT need to iterate,
> and does NOT need to re-initialize the data structures each round etc.
I don't think it really is that simple. For example, the fact that the outer
lock has to be taken by one thread and released by another is not exactly
straightforward. [One might ask what's the critical section in this case.]
Besides, suppose a device driver wants some off-tree constraints to be
satisfied. What's the driver writer supposed to do? He only can lock the
other device, but that will cause lockdep to complain, because this lock
is going to be nested. Moreover, it's already too late, because his async
thread has started and there's no guarantee that the other device hasn't
acquired its rwsem yet.
With completions, the driver doesn't have to take any action to prevent another
one from suspending too early. Instead, the other one has to wait for its
suspend to complete, and for me personally this is a much more natural thing
to do. IOW, if I were a driver writed, I'd probably prefer to wait on a
completion than to use a lock in a tricky manner.
> That's what I've been arguing against the whole time. It started as
> arguing against complex and unnecessary infrastructure, and trying to show
> that it _can_ be done so much simpler using existing basic locking.
>
> And I get annoyed when you guys continually seem to want to make it more
> complex than it needs to be.
>
> > > And this, for example, is pretty disgusting. Not only is that
> > > INIT_COMPLETION purely brought on by the whole problem with completions
> > > (they are fundamentally one-shot, but you want to use them over and over
> >
> > Actually, twice. However, since I don't want to do any async handling in the
> > _noirq phases any more, I can get rid of this whole function. Thanks for
> > pointing that out to me.
>
> Well, my point was that you'll need to do that
>
> INIT_COMPLETION(dev->power.completion);
>
> thing each suspend and each resume. Exactly because completions are
> designed to be "onw-way" things, so you end up having to reset them each
> cycle (you just reset them even _more_ than you needed).
Well, why actually do we need to preserve the state of the data structure from
one cycle to another? There's no need whatsoever.
> Again, my point was that using locks is actually a very _natural_ thing to
> do. I really don't understand what problems you and Alan have with just
> using locks - we have way more locks in the kernel than we have
> completions, so they are the "default" thing to do, and they really are
> very natural to use.
>
> [ Ok, so admittedly the actual use of 'struct rw_semaphore' is pretty
> unusual, but my point is that people are used to locking semantics in
> general, more so than the semantics of completions ]
I still don't think there are many places where locks are used in a way you're
suggesting. I would even say it's quite unusual to use locks this way.
> Completions were literally designed to be used for one-off things - one of
> the most common uses is that the 'struct completion' is on the _stack_. It
> doesn't get much more one-off than that - and the completions are really
> very explicitly designed so that you can do a 'complete()' on something
> that will literally disappear from under you as you do it (because the
> struct completion might be on the stack of the thing that is waiting for
> it, and gets de-allocated when the waiter goes ahead).
We could literally throw away a completion after all of the potentially waiting
threads have finished their operations and then allocate it back again when
necessary. We only need the synchronization in this particular phase of
suspend or resume and it doesn't need to extend to the other phases or other
cycles, because all of the concurrent threads we need to synchronize will
only live during this one particular phase of suspend or resume. They will
all exit when it's finished anyway.
> That is why 'wait_for_completion()' always has to take the spinlock, for
> example - there is no fastpath for completion, because the races for the
> waiter releasing things too early are too nasty.
>
> So completions are actually very subtle things - and you don't need any of
> that subtlety. I realize that from a user perspective, completions look
> very simple, but in many ways they actually have subtler semantics than a
> regular lock has.
Well, I guess your point is that the implementation of completions is much
more complicated that we really need, but I'm not sure if that really hurts.
Rafael
On Fri, 11 Dec 2009, Rafael J. Wysocki wrote:
>
> I don't think it really is that simple. For example, the fact that the outer
> lock has to be taken by one thread and released by another is not exactly
> straightforward. [One might ask what's the critical section in this case.]
Why is that any different from initializing the completion in one thread,
and completing it in another?
It's exactly equivalent.
Completions really are "locks that were initialized to locked". That is,
in fact, how completions came to be: we literally used to use semaphores
for them, and the reason for completions is literally the magic lifetime
rules they have.
So when you do
INIT_COMPLETION(dev->power.completion);
that really is historically, logically, and conceptually exactly the same
thing as initializing a lock to the locked state. We literally used to do
it with the equivalent of
init_MUTEX_LOCKED()
way back when (well, except we didn't have mutexes back then, we had only
counting semaphores) and instead of "complete()", we had "up()" on the
semaphore to complete it.
> Besides, suppose a device driver wants some off-tree constraints to be
> satisfied.
.. and I've told you several times that we should simply not do such
devices asynchronously. At least not unless there is some _overriding_
reason to. And so far, nobody has suggested anything even remotely
likely for that.
Again - KISS: Keep It Simple, Stupid!
Don't try to make up problems. The _only_ subsystem we know wants this is
USB, and we know USB is purely a tree.
> > INIT_COMPLETION(dev->power.completion);
> >
> > thing each suspend and each resume. Exactly because completions are
> > designed to be "onw-way" things, so you end up having to reset them each
> > cycle (you just reset them even _more_ than you needed).
>
> Well, why actually do we need to preserve the state of the data structure from
> one cycle to another? There's no need whatsoever.
My point is, with locks, none of that is necessary. Because they
automatically do the right thing.
By picking the right concept, you don't have any of those "oh, we need to
re-initialize things" issues. They just work.
> I still don't think there are many places where locks are used in a way you're
> suggesting. I would even say it's quite unusual to use locks this way.
See above. It's what completions _are_.
> Well, I guess your point is that the implementation of completions is much
> more complicated that we really need, but I'm not sure if that really hurts.
No. The implementation of completions is actually pretty simple, exactly
because they have that spinlock that is required to protect them.
That wasn't the point. The point was that locks are actually the "normal"
thing to use.
You are arguing as if completions are somehow the simpler model. That's
simply not true. Completions are just a _special_case_of_locking_.
So why not just use regular locks instead, when it's actually the natural
way to do it, and results in simpler code?
Linus
Up front: This is my personal view of the matter. Which probably isn't
of much interest to anybody, so I won't bother to defend these views or
comment any further on them. The decision about what version to use is
up to the two of you. The fact is, either implementation would get the
job done.
On Thu, 10 Dec 2009, Linus Torvalds wrote:
> Completions really are "locks that were initialized to locked". That is,
> in fact, how completions came to be: we literally used to use semaphores
> for them, and the reason for completions is literally the magic lifetime
> rules they have.
>
> So when you do
>
> INIT_COMPLETION(dev->power.completion);
>
> that really is historically, logically, and conceptually exactly the same
> thing as initializing a lock to the locked state. We literally used to do
> it with the equivalent of
>
> init_MUTEX_LOCKED()
>
> way back when (well, except we didn't have mutexes back then, we had only
> counting semaphores) and instead of "complete()", we had "up()" on the
> semaphore to complete it.
You think of it that way because you have been closely involved in the
development of the various kinds of locks. Speaking as an outsider who
has relatively little interest in the internal details, completions
appear simpler than rwsems. Mostly because they have a smaller API:
complete() (or complete_all()) and wait_for_completion() as opposed to
down_read(), up_read(), down_write(), and up_write().
> > Besides, suppose a device driver wants some off-tree constraints to be
> > satisfied.
>
> .. and I've told you several times that we should simply not do such
> devices asynchronously. At least not unless there is some _overriding_
> reason to. And so far, nobody has suggested anything even remotely
> likely for that.
Agreed. The fact that async non-tree suspend constraints are difficult
with rwsems isn't a drawback if nobody needs to use them.
> > Well, why actually do we need to preserve the state of the data structure from
> > one cycle to another? There's no need whatsoever.
>
> My point is, with locks, none of that is necessary. Because they
> automatically do the right thing.
>
> By picking the right concept, you don't have any of those "oh, we need to
> re-initialize things" issues. They just work.
That's true, but it's not entirely clear. There are subtle questions
about what happens if you stop in the middle or a device gets
unregistered or registered in the middle. They require careful thought
in both approaches.
Having to reinitialize a completion each time doesn't bother me. It's
merely an indication that each suspend & resume is independent of all
the others.
> > I still don't think there are many places where locks are used in a way you're
> > suggesting. I would even say it's quite unusual to use locks this way.
>
> See above. It's what completions _are_.
This is almost a philosophical issue. If each A_i must wait for some
B_j's, is the onus on each A_i to test the B_j's it's interested in?
Or is the onus on each B_j to tell the A_i's waiting for it that they
may proceed? As Humpty-Dumpty said, "The question is which is to be
master -- that's all".
> > Well, I guess your point is that the implementation of completions is much
> > more complicated that we really need, but I'm not sure if that really hurts.
>
> No. The implementation of completions is actually pretty simple, exactly
> because they have that spinlock that is required to protect them.
>
> That wasn't the point. The point was that locks are actually the "normal"
> thing to use.
>
> You are arguing as if completions are somehow the simpler model. That's
> simply not true. Completions are just a _special_case_of_locking_.
Doesn't that make them simpler by definition? Special cases always
have less to worry about than the general case.
> So why not just use regular locks instead, when it's actually the natural
> way to do it, and results in simpler code?
Simpler but also more subtle, IMO. If you didn't already know how the
algorithm worked, figuring it out from the code would be harder with
rwsems than with completions. Partly because of the way readers and
writers exchange roles in suspend vs. resume, and partly because
sometimes devices lock themselves and sometimes they lock other
devices. With completions each device has its own, and each device
waits for other devices' completions -- easier to keep track of
mentally.
(I still think this whole readers vs. writers thing is a red herring.
The essential property is that there are two opposing classes of lock
holders. The fact that multiple writers can't hold the lock at the
same time whereas multiple readers can is of no importance; the
algorithm would work just as well if multiple writers _could_ hold the
lock simultaneously.)
Balancing the additional conceptual complexity of the rwsem approach is
the conceptual simplicity afforded by not needing to check all the
children. To me this makes it pretty much a toss-up.
Alan Stern
On Thu, Dec 10, 2009 at 08:59:47AM +0100, Ingo Molnar wrote:
> * Rafael J. Wysocki <[email protected]> wrote:
> > On Wednesday 09 December 2009, Ingo Molnar wrote:
> > > * Rafael J. Wysocki <[email protected]> wrote:
> > > > On Tuesday 08 December 2009, Alan Stern wrote:
> > > > > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> > > > >
> > > > > > BTW, is there a good reason why completion_done() doesn't use spin_lock_irqsave
> > > > > > and spin_unlock_irqrestore? complete() and complete_all() use them, so why not
> > > > > > here?
> > > > >
> > > > > And likewise in try_wait_for_completion(). It looks like a bug. Maybe
> > > > > these routines were not intended to be called with interrupts disabled,
> > > > > but that requirement doesn't seem to be documented. And it isn't a
> > > > > natural requirement anyway.
When I implemented them they were not called from anywhere that
disabled interrupts. IIRC the main reason I used spin_lock_irq()
was because that is what wait_for_completion() used at the time....
> > > that's a bug that should be fixed - all the wakeup side (and atomic)
> > > variants of completetion API should be irq safe.
I see no problems with that ;)
Cheers,
Dave.
--
Dave Chinner
[email protected]
* Dave Chinner <[email protected]> wrote:
> On Thu, Dec 10, 2009 at 08:59:47AM +0100, Ingo Molnar wrote:
> > * Rafael J. Wysocki <[email protected]> wrote:
> > > On Wednesday 09 December 2009, Ingo Molnar wrote:
> > > > * Rafael J. Wysocki <[email protected]> wrote:
> > > > > On Tuesday 08 December 2009, Alan Stern wrote:
> > > > > > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> > > > > >
> > > > > > > BTW, is there a good reason why completion_done() doesn't use spin_lock_irqsave
> > > > > > > and spin_unlock_irqrestore? complete() and complete_all() use them, so why not
> > > > > > > here?
> > > > > >
> > > > > > And likewise in try_wait_for_completion(). It looks like a bug. Maybe
> > > > > > these routines were not intended to be called with interrupts disabled,
> > > > > > but that requirement doesn't seem to be documented. And it isn't a
> > > > > > natural requirement anyway.
>
> When I implemented them they were not called from anywhere that
> disabled interrupts. IIRC the main reason I used spin_lock_irq()
> was because that is what wait_for_completion() used at the time....
Obviously wait_for_competion() as a non-atomic API that can block will
(and should) use _irq() - but atomic variants (complete, but also the
try-wait thing) use irqsafe methods. A fair portion of completions
happen in IRQ context.
Ingo
On Friday 11 December 2009, Linus Torvalds wrote:
>
> On Fri, 11 Dec 2009, Rafael J. Wysocki wrote:
> >
> > I don't think it really is that simple. For example, the fact that the outer
> > lock has to be taken by one thread and released by another is not exactly
> > straightforward. [One might ask what's the critical section in this case.]
>
> Why is that any different from initializing the completion in one thread,
> and completing it in another?
>
> It's exactly equivalent.
>
> Completions really are "locks that were initialized to locked". That is,
> in fact, how completions came to be: we literally used to use semaphores
> for them, and the reason for completions is literally the magic lifetime
> rules they have.
I don't know how they emerged historically and that's why I look a them in a
different way than you do, probably.
But fine, say we use the approach based on rwsems and consider suspend and
the inner lock. We acquire it using down_write(), because we want to wait for
multiple other dirvers. Now, in fact we could do literally
down_write(dev->power.rwsem);
up_write(dev->power.rwsem);
because the lock doesn't really protect anything from anyone. What it does is
to prevent _us_ from doing something too early. To me, personally, it's not a
usual use of locks.
Moreover, if you think completions should be treated like locks, the up_write()
above plays the role of the INIT_COMPLETION() in my last patch (or vice versa),
so we reinitialize the data structure to the previous state in this case too,
only earlier (and we could do that later just as well).
The only real drawback of using completions I can see is that we have to
iterate over the children during suspend, but if async suspend is going to save
us any time at all, we can easily afford it (resume with completions is
actually simpler than with rwsems, because we only have to wait for one device
each time).
> > Besides, suppose a device driver wants some off-tree constraints to be
> > satisfied.
>
> .. and I've told you several times that we should simply not do such
> devices asynchronously. At least not unless there is some _overriding_
> reason to. And so far, nobody has suggested anything even remotely
> likely for that.
>
> Again - KISS: Keep It Simple, Stupid!
>
> Don't try to make up problems. The _only_ subsystem we know wants this is
> USB, and we know USB is purely a tree.
Not really.
I've already said it once, but let me repeat. Some device objects have those
ACPI "shadow" device objects that represent the ACPI view of given "physical"
device and have their own suspend and resume routines. It turns out that
these ACPI "shadow" devices have to be suspended after their "physical"
counterparts and resumed before them, or else things beak really badly.
I don't know the reason for that, I only verified it experimentally (I also
don't like that design, but I didn't invent it and I have to live with it at
least for now). So if we don't enforce these constraints doing async
suspend and resume, we won't be able to handle _any_ devices with those
ACPI "shadow" things asynchronously. Ever. [That includes the majority
PCI devices, at least the "planar" ones (which is unfortunate, but that's how
it goes).]
If we had a clean way of representing off-tree constraints during asynchronous
suspend and resume, we'd be able to handle this issue at the bus type level.
And even if we don't anticipate it right now, I think the iteration over
children during suspend is a fair price for a clean interface that bus types or
drivers can use in future. YMMV.
> > Well, I guess your point is that the implementation of completions is much
> > more complicated that we really need, but I'm not sure if that really hurts.
>
> No. The implementation of completions is actually pretty simple, exactly
> because they have that spinlock that is required to protect them.
>
> That wasn't the point. The point was that locks are actually the "normal"
> thing to use.
>
> You are arguing as if completions are somehow the simpler model.
That's because I think so.
> That's simply not true. Completions are just a _special_case_of_locking_.
Which doesn't necessarily prevent them from being conceptually simpler
that the locking scheme based on rwsems.
> So why not just use regular locks instead, when it's actually the natural
> way to do it, and results in simpler code?
Well, to me, it's way not natural and, quite frankly, in my not so humble
opinion, it's a matter of personal preference.
But, since your personal preference is what matters in this case, I'm not
going to argue any more, because that just plain doesn't make sense.
So, if you're not fine with the last patch I sent
(http://patchwork.kernel.org/patch/66375/), I'll send one using rwsems instead
of completions just to make _you_ happy, not because I think that's what we
should do objectively.
Rafael
On Friday 11 December 2009, Alan Stern wrote:
> Up front: This is my personal view of the matter. Which probably isn't
> of much interest to anybody, so I won't bother to defend these views or
> comment any further on them. The decision about what version to use is
> up to the two of you. The fact is, either implementation would get the
> job done.
>
> On Thu, 10 Dec 2009, Linus Torvalds wrote:
>
> > Completions really are "locks that were initialized to locked". That is,
> > in fact, how completions came to be: we literally used to use semaphores
> > for them, and the reason for completions is literally the magic lifetime
> > rules they have.
> >
> > So when you do
> >
> > INIT_COMPLETION(dev->power.completion);
> >
> > that really is historically, logically, and conceptually exactly the same
> > thing as initializing a lock to the locked state. We literally used to do
> > it with the equivalent of
> >
> > init_MUTEX_LOCKED()
> >
> > way back when (well, except we didn't have mutexes back then, we had only
> > counting semaphores) and instead of "complete()", we had "up()" on the
> > semaphore to complete it.
>
> You think of it that way because you have been closely involved in the
> development of the various kinds of locks. Speaking as an outsider who
> has relatively little interest in the internal details, completions
> appear simpler than rwsems. Mostly because they have a smaller API:
> complete() (or complete_all()) and wait_for_completion() as opposed to
> down_read(), up_read(), down_write(), and up_write().
Agreed.
> > > Besides, suppose a device driver wants some off-tree constraints to be
> > > satisfied.
> >
> > .. and I've told you several times that we should simply not do such
> > devices asynchronously. At least not unless there is some _overriding_
> > reason to. And so far, nobody has suggested anything even remotely
> > likely for that.
>
> Agreed. The fact that async non-tree suspend constraints are difficult
> with rwsems isn't a drawback if nobody needs to use them.
Well, see my reply to Linus. The only thing that bothers me is that if we use
rwsems, there's no way to handle that even if it turns out that someone
needs them after all.
> > > Well, why actually do we need to preserve the state of the data structure from
> > > one cycle to another? There's no need whatsoever.
> >
> > My point is, with locks, none of that is necessary. Because they
> > automatically do the right thing.
> >
> > By picking the right concept, you don't have any of those "oh, we need to
> > re-initialize things" issues. They just work.
>
> That's true, but it's not entirely clear. There are subtle questions
> about what happens if you stop in the middle or a device gets
> unregistered or registered in the middle. They require careful thought
> in both approaches.
>
> Having to reinitialize a completion each time doesn't bother me. It's
> merely an indication that each suspend & resume is independent of all
> the others.
YES!
> > > I still don't think there are many places where locks are used in a way you're
> > > suggesting. I would even say it's quite unusual to use locks this way.
> >
> > See above. It's what completions _are_.
>
> This is almost a philosophical issue. If each A_i must wait for some
> B_j's, is the onus on each A_i to test the B_j's it's interested in?
> Or is the onus on each B_j to tell the A_i's waiting for it that they
> may proceed? As Humpty-Dumpty said, "The question is which is to be
> master -- that's all".
Agreed.
> > > Well, I guess your point is that the implementation of completions is much
> > > more complicated that we really need, but I'm not sure if that really hurts.
> >
> > No. The implementation of completions is actually pretty simple, exactly
> > because they have that spinlock that is required to protect them.
> >
> > That wasn't the point. The point was that locks are actually the "normal"
> > thing to use.
> >
> > You are arguing as if completions are somehow the simpler model. That's
> > simply not true. Completions are just a _special_case_of_locking_.
>
> Doesn't that make them simpler by definition? Special cases always
> have less to worry about than the general case.
Heh, good point.
> > So why not just use regular locks instead, when it's actually the natural
> > way to do it, and results in simpler code?
>
> Simpler but also more subtle, IMO. If you didn't already know how the
> algorithm worked, figuring it out from the code would be harder with
> rwsems than with completions.
Indeed.
> Partly because of the way readers and
> writers exchange roles in suspend vs. resume, and partly because
> sometimes devices lock themselves and sometimes they lock other
> devices. With completions each device has its own, and each device
> waits for other devices' completions -- easier to keep track of
> mentally.
Agreed again.
> (I still think this whole readers vs. writers thing is a red herring.
> The essential property is that there are two opposing classes of lock
> holders. The fact that multiple writers can't hold the lock at the
> same time whereas multiple readers can is of no importance; the
> algorithm would work just as well if multiple writers _could_ hold the
> lock simultaneously.)
>
> Balancing the additional conceptual complexity of the rwsem approach is
> the conceptual simplicity afforded by not needing to check all the
> children. To me this makes it pretty much a toss-up.
Yup.
Thanks!
Rafael
On Fri, 11 Dec 2009, Rafael J. Wysocki wrote:
>
> But fine, say we use the approach based on rwsems and consider suspend and
> the inner lock. We acquire it using down_write(), because we want to wait for
> multiple other dirvers. Now, in fact we could do literally
>
> down_write(dev->power.rwsem);
> up_write(dev->power.rwsem);
>
> because the lock doesn't really protect anything from anyone. What it does is
> to prevent _us_ from doing something too early. To me, personally, it's not a
> usual use of locks.
I agree that it's fairly unusual, but on the other hand, it's unusual only
because you contrieved it to be.
If you instead do
down_write(dev->power.rwsem);
.. do the actual suspend ..
up_write(dev->power.rwsem);
it doesn't look odd any more, does it? And while you don't _need_ to hold
the power lock over the suspend call, it actually does make sense, and
gives you some nicer guarantees.
For an example of the kinds of guarantees it would give you - I think that
you might actually be able to do a partial suspend and then a resume
without any other locks, and you'd know that just the per-device locking
would already guarantee that no device is ever tried to resume before it
has finished its asynchronous suspend.
Think about it.
In the completion model, the "async_synchronize_full()" will synchronize
all async work, and as a result you think that you don't need that level
of robustness from the locking itself.
But think about it this way: if you could abort a failed suspend, and
start resuming devices immediately, without doing that
"async_synchronize_full()" in between - simply because you know that the
node locking itself will just "do the right thing".
To me, that's a sign of a _good_ design. Using a rwsem is simply just more
robust and natural for the problem in question. Exactly because it's a
real lock.
> > Don't try to make up problems. The _only_ subsystem we know wants this is
> > USB, and we know USB is purely a tree.
>
> Not really.
>
> I've already said it once, but let me repeat. Some device objects have those
> ACPI "shadow" device objects that represent the ACPI view of given "physical"
> device and have their own suspend and resume routines. It turns out that
> these ACPI "shadow" devices have to be suspended after their "physical"
> counterparts and resumed before them, or else things beak really badly.
> I don't know the reason for that, I only verified it experimentally (I also
> don't like that design, but I didn't invent it and I have to live with it at
> least for now). So if we don't enforce these constraints doing async
> suspend and resume, we won't be able to handle _any_ devices with those
> ACPI "shadow" things asynchronously. Ever. [That includes the majority
> PCI devices, at least the "planar" ones (which is unfortunate, but that's how
> it goes).]
So?
First off, you're wrong. It's not "ever". I'm happy to add complexity
later, I just don't want to start out with a complex model. Adding
complexity too early "just because we migth need it" is the wrong thing to
do.
Secondly, I repeat: we don't want to do those PCI devices asynchronously
anyway. You're again digging yourself deeper by just continually bringing
up this total non-issue. I realize you did it for testing, but I'm serious
when I say that we should limit these things as much as possible, rather
than see it as an opportunity to do crazy things.
Solve the problem at hand _first_. Solve it as simply as you can. And hope
that you never ever will need anything more complex.
Linus
On Friday 11 December 2009, Linus Torvalds wrote:
>
> On Fri, 11 Dec 2009, Rafael J. Wysocki wrote:
> >
> > But fine, say we use the approach based on rwsems and consider suspend and
> > the inner lock. We acquire it using down_write(), because we want to wait for
> > multiple other dirvers. Now, in fact we could do literally
> >
> > down_write(dev->power.rwsem);
> > up_write(dev->power.rwsem);
> >
> > because the lock doesn't really protect anything from anyone. What it does is
> > to prevent _us_ from doing something too early. To me, personally, it's not a
> > usual use of locks.
>
> I agree that it's fairly unusual, but on the other hand, it's unusual only
> because you contrieved it to be.
Whatever. The very fact that you can freely move the up_write() (as long as
it's after the down_write()) is fairly unusual.
> But think about it this way: if you could abort a failed suspend, and
> start resuming devices immediately, without doing that
> "async_synchronize_full()" in between - simply because you know that the
> node locking itself will just "do the right thing".
I'd rather not. :-)
> To me, that's a sign of a _good_ design. Using a rwsem is simply just more
> robust and natural for the problem in question. Exactly because it's a
> real lock.
...
> Solve the problem at hand _first_. Solve it as simply as you can. And hope
> that you never ever will need anything more complex.
Below is a patch I've just tested, but there's a lockdep problem in it I don't
know how to solve. Namely, lockdep is apparently unhappy with us not releasing
the lock taken in device_suspend() and it complains we take it twice in a row
(which we do, but for another device). I need to use down_read_non_owner()
to make it shut up and then I also need to use up_read_non_owner() in
__device_suspend(), although there's the comment in include/linux/rwsem.h
saying exatly this about that:
/*
* Take/release a lock when not the owner will release it.
*
* [ This API should be avoided as much as possible - the
* proper abstraction for this case is completions. ]
*/
(I'd like to know your opinion about that). Yet, that's not all, because next
it complains during resume that __device_resume() releases a lock it didn't
acquire, which it clearly does, but that is intentional. Unfortunately,
there's no up_write_non_owner() ...
So, what am I supposed to do about that?
Rafael
---
drivers/base/power/main.c | 107 +++++++++++++++++++++++++++++++++++++++----
include/linux/device.h | 6 ++
include/linux/pm.h | 3 +
include/linux/resume-trace.h | 7 ++
4 files changed, 114 insertions(+), 9 deletions(-)
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -26,6 +26,7 @@
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/timer.h>
+#include <linux/rwsem.h>
/*
* Callbacks for platform drivers to implement.
@@ -412,9 +413,11 @@ struct dev_pm_info {
pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
+ struct rw_semaphore rwsem;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -25,6 +25,7 @@
#include <linux/resume-trace.h>
#include <linux/rwsem.h>
#include <linux/interrupt.h>
+#include <linux/async.h>
#include "../base.h"
#include "power.h"
@@ -42,6 +43,7 @@
LIST_HEAD(dpm_list);
static DEFINE_MUTEX(dpm_list_mtx);
+static pm_message_t pm_transition;
/*
* Set once the preparation of devices for a PM transition has started, reset
@@ -56,6 +58,7 @@ static bool transition_started;
void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
+ init_rwsem(&dev->power.rwsem);
pm_runtime_init(dev);
}
@@ -381,17 +384,22 @@ void dpm_resume_noirq(pm_message_t state
EXPORT_SYMBOL_GPL(dpm_resume_noirq);
/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state)
{
+ struct device *parent = dev->parent;
int error = 0;
TRACE_DEVICE(dev);
TRACE_RESUME(0);
+ /* Wait for the parent's resume to complete, if necessary. */
+ if (parent)
+ down_read_nested(&parent->power.rwsem, SINGLE_DEPTH_NESTING);
+
down(&dev->sem);
if (dev->bus) {
@@ -426,11 +434,41 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ if (parent)
+ up_read(&parent->power.rwsem);
+
+ /* Allow the children to resume now. */
+ up_write(&dev->power.rwsem);
TRACE_RESUME(error);
return error;
}
+static void async_resume(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_resume(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);
+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{
+ /* Prevent the children from resuming before us. */
+ down_write(&dev->power.rwsem);
+
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ return 0;
+ }
+
+ return __device_resume(dev, pm_transition);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -444,6 +482,7 @@ static void dpm_resume(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
@@ -454,7 +493,7 @@ static void dpm_resume(pm_message_t stat
dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);
- error = device_resume(dev, state);
+ error = device_resume(dev);
mutex_lock(&dpm_list_mtx);
if (error)
@@ -469,6 +508,7 @@ static void dpm_resume(pm_message_t stat
}
list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
}
/**
@@ -584,13 +624,11 @@ static int device_suspend_noirq(struct d
{
int error = 0;
- if (!dev->bus)
- return 0;
-
- if (dev->bus->pm) {
+ if (dev->bus && dev->bus->pm) {
pm_dev_dbg(dev, state, "LATE ");
error = pm_noirq_op(dev, dev->bus->pm, state);
}
+
return error;
}
@@ -623,17 +661,24 @@ int dpm_suspend_noirq(pm_message_t state
}
EXPORT_SYMBOL_GPL(dpm_suspend_noirq);
+static int async_error;
+
/**
* device_suspend - Execute "suspend" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_suspend(struct device *dev, pm_message_t state)
+static int __device_suspend(struct device *dev, pm_message_t state)
{
int error = 0;
+ /* Wait for the suspends of the children to complete, if necessary. */
+ down_write_nested(&dev->power.rwsem, SINGLE_DEPTH_NESTING);
down(&dev->sem);
+ if (async_error)
+ goto End;
+
if (dev->class) {
if (dev->class->pm) {
pm_dev_dbg(dev, state, "class ");
@@ -666,12 +711,50 @@ static int device_suspend(struct device
suspend_report_result(dev->bus->suspend, error);
}
}
+
+ if (!error)
+ dev->power.status = DPM_OFF;
+
End:
up(&dev->sem);
+ up_write(&dev->power.rwsem);
+
+ /* Allow the parent to suspend now. */
+ if (dev->parent)
+ up_read_non_owner(&dev->parent->power.rwsem);
return error;
}
+static void async_suspend(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_suspend(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async", error);
+ async_error = error;
+ }
+
+ put_device(dev);
+}
+
+static int device_suspend(struct device *dev, pm_message_t state)
+{
+ /* Prevent the parent from suspending before us. */
+ if (dev->parent)
+ down_read_non_owner(&dev->parent->power.rwsem);
+
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);
+ return 0;
+ }
+
+ return __device_suspend(dev, pm_transition);
+}
+
/**
* dpm_suspend - Execute "suspend" callbacks for all non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -683,6 +766,7 @@ static int dpm_suspend(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.prev);
@@ -697,13 +781,17 @@ static int dpm_suspend(pm_message_t stat
put_device(dev);
break;
}
- dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
list_move(&dev->power.entry, &list);
put_device(dev);
+ if (async_error)
+ break;
}
list_splice(&list, dpm_list.prev);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
+ if (!error)
+ error = async_error;
return error;
}
@@ -762,6 +850,7 @@ static int dpm_prepare(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
transition_started = true;
+ async_error = 0;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@
extern int pm_trace_enabled;
+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const
#else
+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)
Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -472,6 +472,12 @@ static inline int device_is_registered(s
return dev->kobj.state_in_sysfs;
}
+static inline void device_enable_async_suspend(struct device *dev, bool enable)
+{
+ if (dev->power.status == DPM_ON)
+ dev->power.async_suspend = enable;
+}
+
void driver_init(void);
/*
On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
>
> Below is a patch I've just tested, but there's a lockdep problem in it I don't
> know how to solve. Namely, lockdep is apparently unhappy with us not releasing
> the lock taken in device_suspend() and it complains we take it twice in a row
> (which we do, but for another device). I need to use down_read_non_owner()
> to make it shut up and then I also need to use up_read_non_owner() in
> __device_suspend(),
Ok, that I admit is actually a problem.
Ok, ok, I'll accept that completion() version, even though I think it's
inferior.
Linus
On Fri, 11 Dec 2009, Rafael J. Wysocki wrote:
> > > .. and I've told you several times that we should simply not do such
> > > devices asynchronously. At least not unless there is some _overriding_
> > > reason to. And so far, nobody has suggested anything even remotely
> > > likely for that.
> >
> > Agreed. The fact that async non-tree suspend constraints are difficult
> > with rwsems isn't a drawback if nobody needs to use them.
>
> Well, see my reply to Linus. The only thing that bothers me is that if we use
> rwsems, there's no way to handle that even if it turns out that someone
> needs them after all.
This is now a totally moot point, but I want to make it anyway just to
show how perverse life can be. It turns out that by combining some of
the worst parts of the rwsem approach and the completion approach, it
_is_ possible to have async non-tree suspend constraints with rwsems.
The key is to imitate the way the completions work.
The resume algorithm doesn't change, but the suspend algorithm does.
Currently, when suspending a device you first read-lock the parent (to
prevent it from suspending too soon), then you asynchronously
write-lock the device and suspend it, and finally read-unlock the
parent.
Instead, you could first write-lock the device (to prevent the parent
and any other dependents from suspending too soon), then asynchronously
read-lock each of the children and anything else the device needs to
wait for, then suspend the device, and finally write-unlock it. This
really is analogous to completions: down_write() is like
init_completion(), up_write() is like complete_all(), and
down_read()+up_read() is like wait_for_completion(). I got the idea
from Linus's comment that completions really are nothing but locks
initialized in the "locked" state.
Of course, you would have to iterate over all the children and deal
with lockdep complaints. So this obviously is not to be considered as
a serious proposal.
Alan Stern
On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
> Below is a patch I've just tested, but there's a lockdep problem in it I don't
> know how to solve. Namely, lockdep is apparently unhappy with us not releasing
> the lock taken in device_suspend() and it complains we take it twice in a row
> (which we do, but for another device). I need to use down_read_non_owner()
> to make it shut up and then I also need to use up_read_non_owner() in
> __device_suspend(), although there's the comment in include/linux/rwsem.h
> saying exatly this about that:
>
> /*
> * Take/release a lock when not the owner will release it.
> *
> * [ This API should be avoided as much as possible - the
> * proper abstraction for this case is completions. ]
> */
>
> (I'd like to know your opinion about that). Yet, that's not all, because next
> it complains during resume that __device_resume() releases a lock it didn't
> acquire, which it clearly does, but that is intentional. Unfortunately,
> there's no up_write_non_owner() ...
Hah! I knew it!
How come lockdep didn't complain earlier? What's different about this
patch? Only the nesting annotations? Why should adding annotations
make lockdep less happy?
Alan Stern
On Saturday 12 December 2009, Alan Stern wrote:
> On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
>
> > Below is a patch I've just tested, but there's a lockdep problem in it I don't
> > know how to solve. Namely, lockdep is apparently unhappy with us not releasing
> > the lock taken in device_suspend() and it complains we take it twice in a row
> > (which we do, but for another device). I need to use down_read_non_owner()
> > to make it shut up and then I also need to use up_read_non_owner() in
> > __device_suspend(), although there's the comment in include/linux/rwsem.h
> > saying exatly this about that:
> >
> > /*
> > * Take/release a lock when not the owner will release it.
> > *
> > * [ This API should be avoided as much as possible - the
> > * proper abstraction for this case is completions. ]
> > */
> >
> > (I'd like to know your opinion about that). Yet, that's not all, because next
> > it complains during resume that __device_resume() releases a lock it didn't
> > acquire, which it clearly does, but that is intentional. Unfortunately,
> > there's no up_write_non_owner() ...
>
> Hah! I knew it!
>
> How come lockdep didn't complain earlier? What's different about this
> patch? Only the nesting annotations? Why should adding annotations
> make lockdep less happy?
I'm not sure. Perhaps I made a mistake during the previous tests.
Rafael
On Saturday 12 December 2009, Linus Torvalds wrote:
>
> On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
> >
> > Below is a patch I've just tested, but there's a lockdep problem in it I don't
> > know how to solve. Namely, lockdep is apparently unhappy with us not releasing
> > the lock taken in device_suspend() and it complains we take it twice in a row
> > (which we do, but for another device). I need to use down_read_non_owner()
> > to make it shut up and then I also need to use up_read_non_owner() in
> > __device_suspend(),
>
> Ok, that I admit is actually a problem.
>
> Ok, ok, I'll accept that completion() version, even though I think it's
> inferior.
Great! :-)
I slightly changed it in the meantime to avoid calling wait_for_completion()
when both the parent and the child are "synchronous", which prevents the code
from choking on some situations when the ordering of dpm_list is wrong (this
happens as a result of bugs, but not necessarily fatal, for example if one of
the drivers' suspend and resume callbacks are NULL and the bus type doesn't
access the hardware directly, so we shouldn't make things worse than they
already are IMO).
I'd like to put it into my tree in this form, if you don't mind.
[Note for Alan: dpm_wait() is not exported for now, we'll export it when there
are any users.]
Rafael
---
From: Rafael J. Wysocki <[email protected]>
Subject: PM: Asynchronous suspend and resume of devices
Theoretically, the total time of system sleep transitions (suspend
to RAM, hibernation) can be reduced by running suspend and resume
callbacks of device drivers in parallel with each other. However,
there are dependencies between devices such that we're not allowed
to suspend the parent of a device before suspending the device
itself. Analogously, we're not allowed to resume a device before
resuming its parent.
Thus, to make it possible to execute device drivers' suspend and
resume callbacks in parallel with each other, introduce (at the PM
core level) a synchronization mechanism preventing the dependencies
between devices from being violated.
First, device drivers that want their suspend and resume callbacks
to be run asynchronously need to set the power.async_suspend flags
of their devices using device_enable_async_suspend().
Second, for each device with the power.async_suspend flag set the PM
core will start async threads to execute its suspend and resume
callbacks.
The async threads started for different devices are synchronized with
each other and with the main suspend (or resume) thread with the help
of completions, in the following way:
(1) There is a completion, power.completion, for each device object.
(2) Each device's completion is reset before starting the async
suspend (or resume) thread for the device or, in the case of
devices whose power.async_suspend flags are not set, before
executing the device's suspend and resume callbacks.
(3) During suspend, right before running the bus type, device type
and device class suspend callbacks for the device, the PM core
waits for the completions of all the device's children to be
completed.
(4) During resume, right before running the bus type, device type and
device class resume callbacks for the device, the PM core waits
for the completion of the device's parent to be completed.
(5) The PM core completes power.completion for each device right
after the bus type, device type and device class suspend (or
resume) callbacks executed for the device have returned.
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/base/power/main.c | 115 ++++++++++++++++++++++++++++++++++++++++---
include/linux/device.h | 6 ++
include/linux/pm.h | 3 +
include/linux/resume-trace.h | 7 ++
4 files changed, 125 insertions(+), 6 deletions(-)
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -26,6 +26,7 @@
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/timer.h>
+#include <linux/completion.h>
/*
* Callbacks for platform drivers to implement.
@@ -412,9 +413,11 @@ struct dev_pm_info {
pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
+ struct completion completion;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -25,6 +25,7 @@
#include <linux/resume-trace.h>
#include <linux/rwsem.h>
#include <linux/interrupt.h>
+#include <linux/async.h>
#include "../base.h"
#include "power.h"
@@ -42,6 +43,7 @@
LIST_HEAD(dpm_list);
static DEFINE_MUTEX(dpm_list_mtx);
+static pm_message_t pm_transition;
/*
* Set once the preparation of devices for a PM transition has started, reset
@@ -56,6 +58,7 @@ static bool transition_started;
void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
+ init_completion(&dev->power.completion);
pm_runtime_init(dev);
}
@@ -111,6 +114,7 @@ void device_pm_remove(struct device *dev
pr_debug("PM: Removing info for %s:%s\n",
dev->bus ? dev->bus->name : "No Bus",
kobject_name(&dev->kobj));
+ complete_all(&dev->power.completion);
mutex_lock(&dpm_list_mtx);
list_del_init(&dev->power.entry);
mutex_unlock(&dpm_list_mtx);
@@ -162,6 +166,31 @@ void device_pm_move_last(struct device *
}
/**
+ * dpm_wait - Wait for a PM operation to complete.
+ * @dev: Device to wait for.
+ * @async: If unset, wait only if the device's power.async_suspend flag is set.
+ */
+static void dpm_wait(struct device *dev, bool async)
+{
+ if (!dev)
+ return;
+
+ if (async || dev->power.async_suspend)
+ wait_for_completion(&dev->power.completion);
+}
+
+static int dpm_wait_fn(struct device *dev, void *async_ptr)
+{
+ dpm_wait(dev, *((bool *)async_ptr));
+ return 0;
+}
+
+static void dpm_wait_for_children(struct device *dev, bool async)
+{
+ device_for_each_child(dev, &async, dpm_wait_fn);
+}
+
+/**
* pm_op - Execute the PM operation appropriate for given PM event.
* @dev: Device to handle.
* @ops: PM operations to choose from.
@@ -381,17 +410,19 @@ void dpm_resume_noirq(pm_message_t state
EXPORT_SYMBOL_GPL(dpm_resume_noirq);
/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
+ * @async: If true, the device is being resumed asynchronously.
*/
-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state, bool async)
{
int error = 0;
TRACE_DEVICE(dev);
TRACE_RESUME(0);
+ dpm_wait(dev->parent, async);
down(&dev->sem);
if (dev->bus) {
@@ -426,11 +457,36 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);
TRACE_RESUME(error);
return error;
}
+static void async_resume(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_resume(dev, pm_transition, true);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);
+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{
+ INIT_COMPLETION(dev->power.completion);
+
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ return 0;
+ }
+
+ return __device_resume(dev, pm_transition, false);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -444,6 +500,7 @@ static void dpm_resume(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);
@@ -454,7 +511,7 @@ static void dpm_resume(pm_message_t stat
dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);
- error = device_resume(dev, state);
+ error = device_resume(dev);
mutex_lock(&dpm_list_mtx);
if (error)
@@ -469,6 +526,7 @@ static void dpm_resume(pm_message_t stat
}
list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
}
/**
@@ -623,17 +681,24 @@ int dpm_suspend_noirq(pm_message_t state
}
EXPORT_SYMBOL_GPL(dpm_suspend_noirq);
+static int async_error;
+
/**
* device_suspend - Execute "suspend" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
+ * @async: If true, the device is being suspended asynchronously.
*/
-static int device_suspend(struct device *dev, pm_message_t state)
+static int __device_suspend(struct device *dev, pm_message_t state, bool async)
{
int error = 0;
+ dpm_wait_for_children(dev, async);
down(&dev->sem);
+ if (async_error)
+ goto End;
+
if (dev->class) {
if (dev->class->pm) {
pm_dev_dbg(dev, state, "class ");
@@ -666,12 +731,44 @@ static int device_suspend(struct device
suspend_report_result(dev->bus->suspend, error);
}
}
+
+ if (!error)
+ dev->power.status = DPM_OFF;
+
End:
up(&dev->sem);
+ complete_all(&dev->power.completion);
return error;
}
+static void async_suspend(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+
+ error = __device_suspend(dev, pm_transition, true);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async", error);
+ async_error = error;
+ }
+
+ put_device(dev);
+}
+
+static int device_suspend(struct device *dev)
+{
+ INIT_COMPLETION(dev->power.completion);
+
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);
+ return 0;
+ }
+
+ return __device_suspend(dev, pm_transition, false);
+}
+
/**
* dpm_suspend - Execute "suspend" callbacks for all non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -683,13 +780,15 @@ static int dpm_suspend(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
+ async_error = 0;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.prev);
get_device(dev);
mutex_unlock(&dpm_list_mtx);
- error = device_suspend(dev, state);
+ error = device_suspend(dev);
mutex_lock(&dpm_list_mtx);
if (error) {
@@ -697,13 +796,17 @@ static int dpm_suspend(pm_message_t stat
put_device(dev);
break;
}
- dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
list_move(&dev->power.entry, &list);
put_device(dev);
+ if (async_error)
+ break;
}
list_splice(&list, dpm_list.prev);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
+ if (!error)
+ error = async_error;
return error;
}
Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@
extern int pm_trace_enabled;
+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const
#else
+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)
Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -472,6 +472,12 @@ static inline int device_is_registered(s
return dev->kobj.state_in_sysfs;
}
+static inline void device_enable_async_suspend(struct device *dev, bool enable)
+{
+ if (dev->power.status == DPM_ON)
+ dev->power.async_suspend = enable;
+}
+
void driver_init(void);
/*
On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
>
> I'd like to put it into my tree in this form, if you don't mind.
This version still has a major problem, which is not related to
completions vs rwsems, but simply to the fact that you wanted to do this
at the generic device layer level rather than do it at the actual
low-level suspend/resume level.
Namely that there's no apparent sane way to say "don't wait for children".
PCI bridges that don't suspend at all - or any other device that only
suspends in the 'suspend_late()' thing, for that matter - don't have any
reason what-so-ever to wait for children, since they aren't actually
suspending in the first place. But you make them wait regardless, which
then serializes things unnecessarily (for example, two unrelated USB
controllers).
And no, making _everything_ be async is _not_ the answer.
Linus
On Saturday 12 December 2009, Linus Torvalds wrote:
>
> On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
> >
> > I'd like to put it into my tree in this form, if you don't mind.
>
> This version still has a major problem, which is not related to
> completions vs rwsems, but simply to the fact that you wanted to do this
> at the generic device layer level rather than do it at the actual
> low-level suspend/resume level.
>
> Namely that there's no apparent sane way to say "don't wait for children".
>
> PCI bridges that don't suspend at all - or any other device that only
> suspends in the 'suspend_late()' thing, for that matter - don't have any
> reason what-so-ever to wait for children, since they aren't actually
> suspending in the first place. But you make them wait regardless, which
> then serializes things unnecessarily (for example, two unrelated USB
> controllers).
This is a problem that needs to be solved.
One solution that we have discussed on linux-pm is to start a bunch of async
threads searching for async devices that can be suspended and suspending
them (assuming suspend is considered) out of order with respect to dpm_list.
For example, leaf async devices can always be suspended at the same time
regardless of their positions in dpm_list. This way we could get almost the
entire gain resulting from suspending or resuming devices in parallel without
bothering drivers with the problem of dependencies that need to be honoured.
That's something we can add on top of this patch, though, not to complicate
things from the start and it surely requires more discussion.
> And no, making _everything_ be async is _not_ the answer.
I'm not sure what you mean, really.
Speaking of PCI bridges, even though they don't "suspend" in the sense of
being put into low power states or something, we still need to save their
registers on suspend and restore them on resume, and that restore has to
be done before we start to access devices below the bridge.
There are devices with totally null suspend and resume routines that even
the bus type doesn't really handle, but those can be marked as "async" from
the start and they won't really get in the way any more (this creates another
issue to solve, namely that we shouldn't really start a new async thread for
each of them; we have considered that too).
Even if we move that all to drivers, the constraints won't go away and someone
will have to take care of them. Now, since _we_ have problems with reaching
an agreement about how to do it, the driver writers will be even less likely to
figure that out.
Rafael
On Saturday 12 December 2009, Rafael J. Wysocki wrote:
> On Saturday 12 December 2009, Linus Torvalds wrote:
> >
> > On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
> > >
...
>
> > And no, making _everything_ be async is _not_ the answer.
>
> I'm not sure what you mean, really.
>
> Speaking of PCI bridges, even though they don't "suspend" in the sense of
> being put into low power states or something, we still need to save their
> registers on suspend and restore them on resume, and that restore has to
> be done before we start to access devices below the bridge.
Of course we restore them at the early stage now so the above remark does't
apply to the patch in question, sorry.
But the one below does.
> Even if we move that all to drivers, the constraints won't go away and someone
> will have to take care of them. Now, since _we_ have problems with reaching
> an agreement about how to do it, the driver writers will be even less likely to
> figure that out.
Rafael
On Thursday 10 December 2009, Ingo Molnar wrote:
>
> * Rafael J. Wysocki <[email protected]> wrote:
>
> > On Wednesday 09 December 2009, Ingo Molnar wrote:
> > >
> > > * Rafael J. Wysocki <[email protected]> wrote:
> > >
> > > > On Tuesday 08 December 2009, Alan Stern wrote:
> > > > > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> > > > >
> > > > > > BTW, is there a good reason why completion_done() doesn't use spin_lock_irqsave
> > > > > > and spin_unlock_irqrestore? complete() and complete_all() use them, so why not
> > > > > > here?
> > > > >
> > > > > And likewise in try_wait_for_completion(). It looks like a bug. Maybe
> > > > > these routines were not intended to be called with interrupts disabled,
> > > > > but that requirement doesn't seem to be documented. And it isn't a
> > > > > natural requirement anyway.
> > > >
> > > > OK, let's ask Ingo about that.
> > > >
> > > > Ingo, is there any particular reason why completion_done() and
> > > > try_wait_for_completion() don't use spin_lock_irqsave() and
> > > > spin_unlock_irqrestore()?
> > >
> > > that's a bug that should be fixed - all the wakeup side (and atomic)
> > > variants of completetion API should be irq safe.
> > >
> > > It appears that these new completion APIs were added via the XFS tree
> > > about a year ago:
> > >
> > > 39d2f1a: [XFS] extend completions to provide XFS object flush requirements
> > >
> > > Please Cc: scheduler folks to all scheduler patches.
> >
> > If you haven't fixed it locally yet, would you mind me posting a fix?
>
> I wouldnt mind it at all.
Is appended.
Thanks,
Rafael
---
From: Rafael J. Wysocki <[email protected]>
Subject: sched: Make wakeup side variants of completion API irq safe
All the wakeup side variants of the completion API shoild be irq
safe, but completion_done() and try_wait_for_completion() aren't.
Fix the problem by making them use spin_lock_irqsave() and
spin_lock_irqrestore().
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
kernel/sched.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -5931,14 +5931,15 @@ EXPORT_SYMBOL(wait_for_completion_killab
*/
bool try_wait_for_completion(struct completion *x)
{
+ unsigned long flags;
int ret = 1;
- spin_lock_irq(&x->wait.lock);
+ spin_lock_irqsave(&x->wait.lock, flags);
if (!x->done)
ret = 0;
else
x->done--;
- spin_unlock_irq(&x->wait.lock);
+ spin_unlock_irqrestore(&x->wait.lock, flags);
return ret;
}
EXPORT_SYMBOL(try_wait_for_completion);
@@ -5953,12 +5954,13 @@ EXPORT_SYMBOL(try_wait_for_completion);
*/
bool completion_done(struct completion *x)
{
+ unsigned long flags;
int ret = 1;
- spin_lock_irq(&x->wait.lock);
+ spin_lock_irqsave(&x->wait.lock, flags);
if (!x->done)
ret = 0;
- spin_unlock_irq(&x->wait.lock);
+ spin_unlock_irqrestore(&x->wait.lock, flags);
return ret;
}
EXPORT_SYMBOL(completion_done);
Commit-ID: 7539a3b3d1f892dd97eaf094134d7de55c13befe
Gitweb: http://git.kernel.org/tip/7539a3b3d1f892dd97eaf094134d7de55c13befe
Author: Rafael J. Wysocki <[email protected]>
AuthorDate: Sun, 13 Dec 2009 00:07:30 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 13 Dec 2009 08:12:46 +0100
sched: Make wakeup side and atomic variants of completion API irq safe
Alan Stern noticed that all the wakeup side (and atomic) variants of the
completion APIs should be irq safe, but the newly introduced
completion_done() and try_wait_for_completion() aren't. The use of the
irq unsafe variants in IRQ contexts can cause crashes/hangs.
Fix the problem by making them use spin_lock_irqsave() and
spin_lock_irqrestore().
Reported-by: Alan Stern <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: pm list <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: David Chinner <[email protected]>
Cc: Lachlan McIlroy <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index ff39cad..8b3532f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5908,14 +5908,15 @@ EXPORT_SYMBOL(wait_for_completion_killable);
*/
bool try_wait_for_completion(struct completion *x)
{
+ unsigned long flags;
int ret = 1;
- spin_lock_irq(&x->wait.lock);
+ spin_lock_irqsave(&x->wait.lock, flags);
if (!x->done)
ret = 0;
else
x->done--;
- spin_unlock_irq(&x->wait.lock);
+ spin_unlock_irqrestore(&x->wait.lock, flags);
return ret;
}
EXPORT_SYMBOL(try_wait_for_completion);
@@ -5930,12 +5931,13 @@ EXPORT_SYMBOL(try_wait_for_completion);
*/
bool completion_done(struct completion *x)
{
+ unsigned long flags;
int ret = 1;
- spin_lock_irq(&x->wait.lock);
+ spin_lock_irqsave(&x->wait.lock, flags);
if (!x->done)
ret = 0;
- spin_unlock_irq(&x->wait.lock);
+ spin_unlock_irqrestore(&x->wait.lock, flags);
return ret;
}
EXPORT_SYMBOL(completion_done);
On Saturday 12 December 2009, Linus Torvalds wrote:
>
> On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
> >
> > I'd like to put it into my tree in this form, if you don't mind.
>
> This version still has a major problem, which is not related to
> completions vs rwsems, but simply to the fact that you wanted to do this
> at the generic device layer level rather than do it at the actual
> low-level suspend/resume level.
>
> Namely that there's no apparent sane way to say "don't wait for children".
There is, if the partent would really do something that could disturb the
children. This isn't always the case, but at least in a few important cases
it is (think of a USB controller and USB devices behind it, for example).
I thought we had this discussion already, but perhaps that was with someone
else and in a slightly different context.
The main reasons why I think it's useful to do this at the generic device layer
level are that, if we do it this way:
a. Drivers that don't want to be "asynchronous" don't need to care in any case.
b. Drivers whose suspend and resume routines are guaranteed not to disturb
anyone else can mark their devices as "async" and be done with it, no other
modification of the code is needed (drivers that do nothing in their suspend
and resume routines also fall into this category).
Now, if it's done at the low-level suspend/resume level, a. will not be true
any more in general. Say device A has parent B and the driver of A wants to
suspend asynchrnously. It needs to split its suspend into synchronous and
asynchronous part and at one point start an async thread to run the latter.
Now assume B has a real reason not to suspend before the suspens of A has
finished. Then, the driver of B has to be modified so that it waits for the
A's async suspend to complete (some sort of synchronization between the two
has to be added). So, even if B is "synchronous", its driver has to be
modified to handle the asynchronous suspend of A.
Similarly, b. will no longer be true if it's done at the low-level
suspend/resume level, because now every driver that wants to be
"asynchronous" will need to take care of running an async thread etc.
Moreover, it will need to make sure that the device parent's driver doesn't
need to be modified, because the parent's suspend may do something that will
disturb the child's asynchronous suspend. Furthermore, if the parent's driver
doesn't need to be modified, it will need to consider the parent of the parent,
because that one may potentially disturb the asynchronous suspend of its
grand child and so on up to a device without a parent.
That already is a pain to a driver writer, but the problem you're saying would
be solved by doing this at the low-level suspend/resume level is still there
in general! Namely, go back do the example with devices A and B and say B
_really_ has to wait for A's suspend to complete. Then, since B is after A in
dpm_list, the PM core will not start the suspend of any device after B until
the suspend of B returns. Now, if the suspend of B waits for the suspend of
A, then the PM core will effectively wait for the suspend of A to complete
before suspending any other devices. Worse yet, if that happens, we can't do
anything about it at the low-level suspend/resume level, althouth at the PM
core level we can.
Rafael
On Sat, 12 Dec 2009, Linus Torvalds wrote:
> This version still has a major problem, which is not related to
> completions vs rwsems, but simply to the fact that you wanted to do this
> at the generic device layer level rather than do it at the actual
> low-level suspend/resume level.
>
> Namely that there's no apparent sane way to say "don't wait for children".
>
> PCI bridges that don't suspend at all - or any other device that only
> suspends in the 'suspend_late()' thing, for that matter - don't have any
> reason what-so-ever to wait for children, since they aren't actually
> suspending in the first place. But you make them wait regardless, which
> then serializes things unnecessarily (for example, two unrelated USB
> controllers).
In reality this should never be a problem.
Consider that ultimately we want to achieve the following two goals:
Implement a two-pass algorithm, so that synchronous devices
can't cause spurious dependencies between two async devices.
(This will fix the issue of an intermediate PCI bridge
serializing two unrelated USB controllers.)
Convert all lengthy suspend/resume operations to async.
Obviously we don't want to do this all at once. But until the goals
are achieved, there's no point worrying about devices being forced to
wait for their children or parents. And after the goals are achieved,
it won't matter.
Why not? Consider the devices which would be delayed. If they use
synchronous suspend/resume then they won't take much time, so delaying
them won't matter. Indeed, based on Arjan's preliminary measurements
it's fair to say that the total time taken by all the synchronous
suspends/resumes put together should be negligible. Even if all of
them were somehow delayed until all the async activities were complete,
nobody would notice or care. (And conversely, if all the async
activities could somehow be forced to wait until all the synchronous
suspends/resumes were done, nobody would notice or care.)
Okay, so consider a case where A comes before B in dpm_list and B is
the parent of C. Suppose B doesn't need to wait for C to suspend, but
we force it to wait anyhow.
If A or C is synchronous then we're okay, by the considerations above.
Suppose A is async. Then it wouldn't be delayed unless it was one of
B's ancestors, so suppose it is. Now we are potentially delaying A
more than necessary.
Or are we? Even though B might not need to wait for C to suspend,
there's an excellent chance that A _does_ need to wait for C. If we
allow B to suspend before C then there would be nothing to prevent A
from suspending too quickly. A's driver would need to wait explicitly
for C -- which is unreasonable since C isn't one of A's children.
(Rafael made a similar point.)
In short, allowing devices to suspend before their children would be
dangerous and probably would not save a significant amount of time.
Alan Stern
On Sun, 13 Dec 2009, Alan Stern wrote:
> > Namely that there's no apparent sane way to say "don't wait for children".
> >
> > PCI bridges that don't suspend at all - or any other device that only
> > suspends in the 'suspend_late()' thing, for that matter - don't have any
> > reason what-so-ever to wait for children, since they aren't actually
> > suspending in the first place. But you make them wait regardless, which
> > then serializes things unnecessarily (for example, two unrelated USB
> > controllers).
> In short, allowing devices to suspend before their children would be
> dangerous and probably would not save a significant amount of time.
There's more to be said. Even without this "don't wait for children"
thing, there can be bad interactions causing unnecessary delays. For
example, suppose A (async) is the parent of B (sync), B comes before C
(sync) in dpm_list, and C is the parent of D (async). Even if A & B
are unrelated to C & D, they will be forced to wait for them. It
doesn't matter that A and D are unrelated and so could suspend
concurrently.
In essence, every synchonrous device is treated as though it depends on
all the synchronous devices preceding it in dpm_list. That's a lot of
unnecessary constraints. At the moment we have no choice, because we
have to assume that some of those constraints actually are necessary --
and we don't know which ones.
It's an inescapable fact: If there are unnecessary ordering constraints
then you generally can't be 100% efficient in carrying out parallel
operations. Compared with all these extra "synchronous" constraints,
the relatively small number of "don't need to wait for children"
constraints is harmless. I bet that if we got rid of all unnecessary
constraints except for making parents always wait for their children,
we'd attain more than 95% of the ideal speedup.
Alan Stern
On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
>
> One solution that we have discussed on linux-pm is to start a bunch of async
> threads searching for async devices that can be suspended and suspending
> them (assuming suspend is considered) out of order with respect to dpm_list.
Ok, guys, stop the crazy.
That's another of those "ok, that's just ttoally stupid and clearly too
complex" ideas that I would never pull.
I should seriously suggest that people just stop discussing architectural
details on the pm list if they all end up being this level of crazy.
The sane thing to do is to just totally ignore the async layer on PCI
bridges and other things that only have a late-suspend/early-resume thing.
No need for the above kind of obviously idiotic crap.
However, my point was really that we wouldn't even have _needed_ that kind
of special case if we had just decided to let the subsystems do it. But
whatever. At worst, the PCI layer can even just mark such devices with
just late/early suspend/resume as being asynchronous, even though that
ends up resulting in some totally pointless async work that doesn't do
anything.
But please guys - reign in the crazy ideas on the pm list. It's not like
our suspend/resume has gotten so stable as to be boring, and we want it to
become unreliable again.
Linus
On Monday 14 December 2009, Linus Torvalds wrote:
>
> On Sat, 12 Dec 2009, Rafael J. Wysocki wrote:
> >
> > One solution that we have discussed on linux-pm is to start a bunch of async
> > threads searching for async devices that can be suspended and suspending
> > them (assuming suspend is considered) out of order with respect to dpm_list.
>
> Ok, guys, stop the crazy.
>
> That's another of those "ok, that's just ttoally stupid and clearly too
> complex" ideas that I would never pull.
>
> I should seriously suggest that people just stop discussing architectural
> details on the pm list if they all end up being this level of crazy.
>
> The sane thing to do is to just totally ignore the async layer on PCI
> bridges and other things that only have a late-suspend/early-resume thing.
> No need for the above kind of obviously idiotic crap.
>
> However, my point was really that we wouldn't even have _needed_ that kind
> of special case if we had just decided to let the subsystems do it. But
> whatever. At worst, the PCI layer can even just mark such devices with
> just late/early suspend/resume as being asynchronous, even though that
> ends up resulting in some totally pointless async work that doesn't do
> anything.
>
> But please guys - reign in the crazy ideas on the pm list. It's not like
> our suspend/resume has gotten so stable as to be boring, and we want it to
> become unreliable again.
Indeed.
OK, what about a two-pass approach in which the first pass only inits the
completions and starts async threads for leaf "async" devices? I think leaf
devices are most likely to take much time to suspend, so this will give us
a chance to save quite some time.
A more aggressive version of this might start the async threads for all async
devices in the first pass and then only handle the sychronous ones in the
second pass - as long as there are only a few async devices that should be
quite efficient.
Rafael
On Mon, 14 Dec 2009, Rafael J. Wysocki wrote:
>
> OK, what about a two-pass approach in which the first pass only inits the
> completions and starts async threads for leaf "async" devices? I think leaf
> devices are most likely to take much time to suspend, so this will give us
> a chance to save quite some time.
Why?
Really.
Again, stop making it harder than it needs to be.
Why do you make up these crazy schemes that are way more complex than they
need to be?
Here's an untested one-liner that has a 10-line comment.
I agree it is ugly, but it is ugly exactly because the generic device
layer _forces_ us to wait for children even when we don't want to. With
this, that unnecessary wait is now done asynchronously.
I'd rather do it some other way - perhaps having an explicit flag that
says "don't wait for children because I'm not going to suspend myself
until 'suspend_late' _anyway_". But at least this is _simple_.
Linus
---
drivers/pci/probe.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 98ffb2d..4e0ad7b 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -437,6 +437,17 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent,
}
bridge->subordinate = child;
+ /*
+ * We don't really suspend PCI buses asyncronously.
+ *
+ * However, since we don't actually suspend them at all until
+ * the late phase, we might as well lie to the device layer
+ * and it to do our no-op not-suspend asynchronously, so that
+ * we end up not synchronizing with any of our child devices
+ * that might want to be asynchronous.
+ */
+ bridge->dev.power.async_suspend = 1;
+
return child;
}
On Mon, 14 Dec 2009, Linus Torvalds wrote:
>
> Here's an untested one-liner that has a 10-line comment.
Btw, when I say "untested", in this case I mean that it isn't even
compile-tested. I haven't merged your other patches yet, so in my tree
that 'async_suspend' flag doesn't even exist, and the patch I sent out
definitely doesn't compile.
But it _might_ compile (and perhaps even work) in your tree.
Linus
On Monday 14 December 2009, Linus Torvalds wrote:
>
> On Mon, 14 Dec 2009, Rafael J. Wysocki wrote:
> >
> > OK, what about a two-pass approach in which the first pass only inits the
> > completions and starts async threads for leaf "async" devices? I think leaf
> > devices are most likely to take much time to suspend, so this will give us
> > a chance to save quite some time.
>
> Why?
>
> Really.
Because the PCI bridges are not the only case where it matters (I'd say they
are really a corner case). Basically, any two async devices separeted by a
series of sync ones are likely not to be suspended (or resumed) in parallel
with each other, because the parent is usually next to its children in dpm_list.
So, if the first device suspends, its "synchronous" parent waits for it and the
suspend of the second async device won't be started until the first one's
suspend has returned. And it doesn't matter at what level we do the async
thing, because dpm_list is there anyway.
As Alan said, the real problem is that we generally can't change the ordering
of dpm_list arbitrarily, because we don't know what's going to happen as a
result. The async_suspend flag tells us, basically, what devices can be safely
moved to different positions in dpm_list without breaking things, as long as
they are not moved behind their parents or in front of their children.
Starting the async suspends upfront would effectively work in the same way as
moving those devices to the beginning of dpm_list without breaking the
parent-child chains, which in turn is likely to allow us to save some extra
time.
That's not only about the PCI bridges, it's more general. As far as your
one-liner is concerned, I'm going to test it, because I think we could use it
anyway.
Rafael
On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
>
> Because the PCI bridges are not the only case where it matters (I'd say they
> are really a corner case). Basically, any two async devices separeted by a
> series of sync ones are likely not to be suspended (or resumed) in parallel
> with each other, because the parent is usually next to its children in dpm_list.
Give a real example that matters.
Really.
How hard can it be to understand: KISS. Keep It Simple, Stupid.
I get really tired of this whole stupid async discussion, because you're
overdesigning it.
To a first approximation, THE ONLY THING THAT MATTERS IS USB.
Linus
On Mon, 14 Dec 2009, Linus Torvalds wrote:
>
> I get really tired of this whole stupid async discussion, because you're
> overdesigning it.
Btw, this is important. I'm not going to pull even your _current_ async
stuff if you can't show that you fundamentally UNDERSTAND this fact.
Stop making up idiotic complex interfaces. Look at my one-liner patch, and
realize that it gets you 99% there - the 99% that matters.
Linus
On Tuesday 15 December 2009, Linus Torvalds wrote:
>
> On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
> >
> > Because the PCI bridges are not the only case where it matters (I'd say they
> > are really a corner case). Basically, any two async devices separeted by a
> > series of sync ones are likely not to be suspended (or resumed) in parallel
> > with each other, because the parent is usually next to its children in dpm_list.
>
> Give a real example that matters.
I'll try. Let -> denote child-parent relationships and assume dpm_list looks
like this:
..., A->B->C, D, E->F->G, ...
where A, B, E, F are all async and C, D, G are sync (E, F, G may be USB and
A, B, C may be serio input devices and D is a device that just happens to be in
dpm_list between them). Say A and C take the majority of the total suspend
time and assume we traverse the dpm_list from left to right.
Now, during suspend, C waits for B that waits for A and G waits for F that
waits for E. Moreover, since C is sync, the PM core won't start the suspend
of D until the suspend of C has returned. In turn, since D is sync, the
suspend of E won't be started until the suspend of D has returned. So in
this situation the gain from the async suspends of A, B, E, F is zero.
However, it won't be zero if we start the async suspends of A, B, E, F
upfront.
I'm not sure if this is sufficiently "real life" for you, but this is how
dpm_list looks on one of my test boxes, more or less.
> Really.
>
> How hard can it be to understand: KISS. Keep It Simple, Stupid.
>
> I get really tired of this whole stupid async discussion, because you're
> overdesigning it.
>
> To a first approximation, THE ONLY THING THAT MATTERS IS USB.
If this applies to _resume_ only, then I agree, but the Arjan's data clearly
show that serio devices take much more time to suspend than USB.
But if we only talk about resume, the PCI bridges don't really matter,
because they are resumed before all devices that depend on them, so they don't
really need to wait for anyone anyway.
Rafael
On Tuesday 15 December 2009, Linus Torvalds wrote:
>
> On Mon, 14 Dec 2009, Linus Torvalds wrote:
> >
> > I get really tired of this whole stupid async discussion, because you're
> > overdesigning it.
>
> Btw, this is important. I'm not going to pull even your _current_ async
> stuff if you can't show that you fundamentally UNDERSTAND this fact.
What fact? The only thing that matters is USB? For resume, it is. For
suspend, it clearly isn't.
> Stop making up idiotic complex interfaces. Look at my one-liner patch, and
> realize that it gets you 99% there - the 99% that matters.
I said I was going to use it, but I don't think that's going to be sufficient.
[BTW, I'm not sure what you want to achieve by insulting me. Either you may
want to scare me, but I'm not scared, or you may want to try to make me so
disgusted that I'll just give up and back off, but this is not going to happen
either.]
Insults aside, I'm going to make some measurements to see how much time we can
save.
Rafael
On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
> >
> > Give a real example that matters.
>
> I'll try. Let -> denote child-parent relationships and assume dpm_list looks
> like this:
No.
I mean something real - something like
- if you run on a non-PC with two USB buses behind non-PCI controllers.
- device xyz.
> If this applies to _resume_ only, then I agree, but the Arjan's data clearly
> show that serio devices take much more time to suspend than USB.
I mean in general - something where you actually have hard data that some
device really needs anythign more than my one-liner, and really _needs_
some complex infrastructure.
Not "let's imagine a case like xyz".
> But if we only talk about resume, the PCI bridges don't really matter,
> because they are resumed before all devices that depend on them, so they don't
> really need to wait for anyone anyway.
But that's my _point_. That's the whole point of the one-liner patch. Read
the comment above that one-liner.
My whole point was that by doing the whole "wait for children" in generic
code, you also made devices - such as PCI bridges - have to wait for
children, even though they don't need to, and don't want to.
So I suggested an admittedly ugly hack to take care of it - rather than
some complex infrastructure.
Linus
On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
>
> What fact? The only thing that matters is USB? For resume, it is. For
> suspend, it clearly isn't.
For suspend, the only other case we've seen has been the keyboard and
mouse controller, which has exactly the same "we can special case it with
a single 'let's do _this_ device asynchronously'". Again, it may not be
pretty, but it sure is simple.
Much simpler than talking about some generic infrastructure changes and
about doing "let's do leaves of the tree separately" schemes.
And that's why I'm _soo_ unhappy with you, and am insulting you. Because
you keep on making the same mistake over and over - overdesigning.
Overdesigning is a SIN. It's the archetypal example of what I call "bad
taste". I get really upset when a subsystem maintainer starts
overdesigning things.
Linus
On Tue, 15 Dec 2009, Linus Torvalds wrote:
> My whole point was that by doing the whole "wait for children" in generic
> code, you also made devices - such as PCI bridges - have to wait for
> children, even though they don't need to, and don't want to.
>
> So I suggested an admittedly ugly hack to take care of it - rather than
> some complex infrastructure.
It doesn't feel like an ugly hack to me. It seems like exactly the
Right Thing To Do: Make as many devices as possible use async
suspend/resume.
The only reason we don't make every device async is because we don't
know whether it's safe. In the case of PCI bridges we _do_ know --
because they don't have any work to do outside of
late_suspend/early_resume -- and so they _should_ be async.
The same goes for devices that don't have suspend or resume methods.
There remains a separate question: Should async devices also be forced
to wait for their children? I don't see why not. For PCI bridges it
won't make any significant difference. As long as the async code
doesn't have to do anything, who cares when it runs?
Alan Stern
On Tue, 15 Dec 2009, Alan Stern wrote:
>
> It doesn't feel like an ugly hack to me. It seems like exactly the
> Right Thing To Do: Make as many devices as possible use async
> suspend/resume.
The reason it's a ugly hack is that it's actually not a simple decision to
make. The devil is in the details:
> The only reason we don't make every device async is because we don't
> know whether it's safe. In the case of PCI bridges we _do_ know --
> because they don't have any work to do outside of
> late_suspend/early_resume -- and so they _should_ be async.
That's the theory, yes. And it was worth the comment to spell out that
theory. But..
It's a very subtle theory, and it's not necessarily always 100% true. For
example, a cardbus bridge is strictly speaking very much a PCI bridge, but
for cardbus bridges we _do_ have a suspend/resume function.
And perhaps worse than that, cardbus bridges are one of the canonical
examples where two different PCI devices actually share registers. It's
quite common that some of the control registers are shared across the two
subfunctions of a two-slot cardbus controller (and we generally don't even
have full docs for them!)
> The same goes for devices that don't have suspend or resume methods.
Yes and no.
Again, the "async_suspend" flag is done at the generic device layer, but
99% of all suspend/resume methods are _not_ done at that level: they are
bus-specific functions, where the bus has a generic suspend-resume
function that it exposes to the generic device layer, and that knows about
the bus-specific rules.
So if you are a PCI device (to take just that example - but it's true of
just about all other buses too), and you don't have any suspend or resume
methods, it's actually impossible to see that fact from the generic device
layer.
And even when you know it's PCI, our rules are actually not simple at all.
Our rules for PCI devices (and this strictly speaking is true for bridges
too) are rather complex:
- do we have _any_ legacy PM support (ie the "direct" driver
suspend/resume functions in the driver ops, rather than having a
"struct dev_pm_ops" pointer)? If so, call "->suspend()"
- If not - do we have that "dev_pm_ops" thing? If so, call it.
- If not - just disable the device entirely _UNLESS_ you're a PCI bridge.
Notice? The way things are set up, if you have no suspend routine, you'll
not get suspended, but you will get disabled.
So it's _not_ actually safe to asynchronously suspend a PCI device if that
device has no driver or no suspend routines - because even in the absense
of a driver and suspend routines, we'll still least disable it. And if
there is some subtle dependency on that device that isn't obvious (say, it
might be used indirectly for some ACPI thing), then that async suspend is
the wrong thing to do.
Subtle? Hell yes.
So the whole thing about "we can do PCI bridges asynchronously because
they are obviously no-op" is kind of true - except for the "obviously"
part. It's not obvious at all. It's rather subtle.
As an example of this kind of subtlety - iirc PCIE bridges used to have
suspend and resume bugs when we initially switched over to the "new world"
suspend/resume exactly because they actually did things at "suspend" time
(rather than suspend_late), and that broke devices behind them (this was
not related to async, of course, but the point is that even when you look
like a PCI bridge, you might be doing odd things).
So just saying "let's do it asynchronously" is _not_ always guaranteed to
be the right thing at all. It's _probably_ safe for at least regular PCI
bridges. Cardbus bridges? Probably not, but since most modern laptop have
just a single slot - and people who have multiple slots seldom use them
all - most people will probably never see the problems that it _could_
introduce.
And PCIE bridges? Should be safe these days, but it wasn't quite as
obvious, because a PCIE bridge actually has a driver unlike a regular
plain PCI-PCI bridge.
Subtle, subtle.
> There remains a separate question: Should async devices also be forced
> to wait for their children? I don't see why not. For PCI bridges it
> won't make any significant difference. As long as the async code
> doesn't have to do anything, who cares when it runs?
That's why I just set the "async_resume = 1" thing.
But there might actually be reasons why we care. Like the fact that we
actually throttle the amount of parallel work we do in async_schedule().
So doing even a "no-op" asynchronously isn't actually a no-op: while it is
pending (and those things can be pending for a long time, since they have
to wait for those slow devices underneath them), it can cause _other_
async work - that isn't necessarily a no-op at all - to be then done
synchronously.
Now, admittedly our async throttling limits are high enough that the above
kind of detail will probably never ever realy matter (default 256 worker
threads etc). But it's an example of how practice is different from theory
- in _theory_ it doesn't make any difference if you wait for something
asynchronously, but in practice it could make a difference under some
circumstances.
Linus
On Tue, 15 Dec 2009, Linus Torvalds wrote:
>
> And even when you know it's PCI, our rules are actually not simple at all.
> Our rules for PCI devices (and this strictly speaking is true for bridges
> too) are rather complex:
>
> - do we have _any_ legacy PM support (ie the "direct" driver
> suspend/resume functions in the driver ops, rather than having a
> "struct dev_pm_ops" pointer)? If so, call "->suspend()"
>
> - If not - do we have that "dev_pm_ops" thing? If so, call it.
>
> - If not - just disable the device entirely _UNLESS_ you're a PCI bridge.
>
> Notice? The way things are set up, if you have no suspend routine, you'll
> not get suspended, but you will get disabled.
Side note - what I think might be a clean solution for PCI at least is to
do something like the following:
- move that "disable the device entirely" thing to suspend_late, rather
than the earlier suspend phase. Now PCI devices without drivers or PM
will not be touched at all in the first suspend phase.
- initialize all PCI devices to have 'async_suspend = 1' on discovery
- whenever we bind a driver to the PCI device, we'd then look at whether
that driver implements suspend/resume callbacks (legacy or new), and
clear the async_suspend bit if so.
That way we'd have the same old synchronous behavior for all PCI suspend
and resume events (unless the driver itself then sets the async_suspend
bit at device init time, which it could do, of course), while still always
doing async "no-op" events.
That would avoid the ugly one-liner that just "knows" that PCI bridges are
special and don't do anything at suspend time (even though they aren't
really - a PCI bridge _could_ have a driver associated with it that does
something that might not be happy being asynchronous).
Linus
On Tue, 15 Dec 2009, Linus Torvalds wrote:
> It's a very subtle theory, and it's not necessarily always 100% true. For
> example, a cardbus bridge is strictly speaking very much a PCI bridge, but
> for cardbus bridges we _do_ have a suspend/resume function.
>
> And perhaps worse than that, cardbus bridges are one of the canonical
> examples where two different PCI devices actually share registers. It's
> quite common that some of the control registers are shared across the two
> subfunctions of a two-slot cardbus controller (and we generally don't even
> have full docs for them!)
Okay. This obviously implies that if/when cardbus bridges are
converted to async suspend/resume, the driver should make sure that the
lower-numbered devices wait for their sibling higher-numbered devices
to suspend (and vice versa for resume). Awkward though it may be.
> > The same goes for devices that don't have suspend or resume methods.
>
> Yes and no.
>
> Again, the "async_suspend" flag is done at the generic device layer, but
> 99% of all suspend/resume methods are _not_ done at that level: they are
> bus-specific functions, where the bus has a generic suspend-resume
> function that it exposes to the generic device layer, and that knows about
> the bus-specific rules.
>
> So if you are a PCI device (to take just that example - but it's true of
> just about all other buses too), and you don't have any suspend or resume
> methods, it's actually impossible to see that fact from the generic device
> layer.
Sure. That's why the async_suspend flag is set at the bus/driver
level.
> And even when you know it's PCI, our rules are actually not simple at all.
> Our rules for PCI devices (and this strictly speaking is true for bridges
> too) are rather complex:
>
> - do we have _any_ legacy PM support (ie the "direct" driver
> suspend/resume functions in the driver ops, rather than having a
> "struct dev_pm_ops" pointer)? If so, call "->suspend()"
>
> - If not - do we have that "dev_pm_ops" thing? If so, call it.
>
> - If not - just disable the device entirely _UNLESS_ you're a PCI bridge.
>
> Notice? The way things are set up, if you have no suspend routine, you'll
> not get suspended, but you will get disabled.
>
> So it's _not_ actually safe to asynchronously suspend a PCI device if that
> device has no driver or no suspend routines - because even in the absense
> of a driver and suspend routines, we'll still least disable it. And if
> there is some subtle dependency on that device that isn't obvious (say, it
> might be used indirectly for some ACPI thing), then that async suspend is
> the wrong thing to do.
>
> Subtle? Hell yes.
I don't disagree. However the subtlety lies mainly in the matter of
non-obvious dependencies. (The other stuff is all known to the PCI
core.) AFAICS there's otherwise little difference between an async
routine that does nothing and one that disables the device -- both
operations are very fast.
The ACPI relations are definitely something to worry about. It would
be a good idea, at an early stage, to add those dependencies
explicitly. I don't know enough about them to say more; perhaps Rafael
does.
As for other non-obvious dependencies... Who knows? Probably the only
way to find them is by experimentation. My guess is that they will
turn out to be connected mostly with "high-level" devices: system
devices, things on the motherboard -- generally speaking, stuff close
to the CPU. Relatively few will be associated with devices below the
level of a PCI device or equivalent.
Ideally we would figure out how to do the slow devices in parallel
without interference from fast devices having unknown dependencies.
Unfortunately this may not be possible.
> So the whole thing about "we can do PCI bridges asynchronously because
> they are obviously no-op" is kind of true - except for the "obviously"
> part. It's not obvious at all. It's rather subtle.
>
> As an example of this kind of subtlety - iirc PCIE bridges used to have
> suspend and resume bugs when we initially switched over to the "new world"
> suspend/resume exactly because they actually did things at "suspend" time
> (rather than suspend_late), and that broke devices behind them (this was
> not related to async, of course, but the point is that even when you look
> like a PCI bridge, you might be doing odd things).
>
> So just saying "let's do it asynchronously" is _not_ always guaranteed to
> be the right thing at all. It's _probably_ safe for at least regular PCI
> bridges. Cardbus bridges? Probably not, but since most modern laptop have
> just a single slot - and people who have multiple slots seldom use them
> all - most people will probably never see the problems that it _could_
> introduce.
>
> And PCIE bridges? Should be safe these days, but it wasn't quite as
> obvious, because a PCIE bridge actually has a driver unlike a regular
> plain PCI-PCI bridge.
>
> Subtle, subtle.
Indeed. Perhaps you were too hasty in suggesting that PCI bridges
should be async.
It would help a lot to see some device lists for typical machines. (If
there are such things.) Otherwise we are just blowing gas.
> > There remains a separate question: Should async devices also be forced
> > to wait for their children? I don't see why not. For PCI bridges it
> > won't make any significant difference. As long as the async code
> > doesn't have to do anything, who cares when it runs?
>
> That's why I just set the "async_resume = 1" thing.
>
> But there might actually be reasons why we care. Like the fact that we
> actually throttle the amount of parallel work we do in async_schedule().
> So doing even a "no-op" asynchronously isn't actually a no-op: while it is
> pending (and those things can be pending for a long time, since they have
> to wait for those slow devices underneath them), it can cause _other_
> async work - that isn't necessarily a no-op at all - to be then done
> synchronously.
>
> Now, admittedly our async throttling limits are high enough that the above
> kind of detail will probably never ever realy matter (default 256 worker
> threads etc). But it's an example of how practice is different from theory
> - in _theory_ it doesn't make any difference if you wait for something
> asynchronously, but in practice it could make a difference under some
> circumstances.
We certainly shouldn't be worried about side effects of async
throttling as this stage. KISS works both ways: Don't overdesign, and
don't worry about things that might crop up when you expand the design.
We have strayed off the point of your original objection: not providing
a way for devices to skip waiting for their children. This really is a
separate issue from deciding whether or not to go async. For example,
your proposed patch makes PCI bridges async but doesn't allow them to
avoid waiting for children. IMO that's a good thing.
The real issue is "blockage": synchronous devices preventing
possible concurrency among async devices. That's what you thought
making PCI bridges async would help.
In general, blockage arises in suspend when you have an async child
with a synchronous parent. The parent has to wait for the child, which
might take a long time, thereby delaying other unrelated devices.
(This explains why you wanted to make PCI bridges async -- they are the
parents of USB controllers.) For resume it's the opposite: an async
parent with synchronous children. Thus, while making PCI bridges async
might make suspend faster, it probably won't help much with resume
speed. You'd have to make the children of USB devices (SCSI hosts,
TTYs, and so on) async. Depending on the order of device registration,
of course.
Apart from all this, there's a glaring hole in the discussion so far.
You and Arjan may not have noticed it, but those of us still using
rotating media have to put up with disk resume times that are a factor
of 100 (!) larger than USB resume times. That's where the greatest
gains are to be found.
Alan Stern
On Tuesday 15 December 2009, Alan Stern wrote:
> On Tue, 15 Dec 2009, Linus Torvalds wrote:
>
> > It's a very subtle theory, and it's not necessarily always 100% true. For
> > example, a cardbus bridge is strictly speaking very much a PCI bridge, but
> > for cardbus bridges we _do_ have a suspend/resume function.
> >
> > And perhaps worse than that, cardbus bridges are one of the canonical
> > examples where two different PCI devices actually share registers. It's
> > quite common that some of the control registers are shared across the two
> > subfunctions of a two-slot cardbus controller (and we generally don't even
> > have full docs for them!)
>
> Okay. This obviously implies that if/when cardbus bridges are
> converted to async suspend/resume, the driver should make sure that the
> lower-numbered devices wait for their sibling higher-numbered devices
> to suspend (and vice versa for resume). Awkward though it may be.
>
> > > The same goes for devices that don't have suspend or resume methods.
> >
> > Yes and no.
> >
> > Again, the "async_suspend" flag is done at the generic device layer, but
> > 99% of all suspend/resume methods are _not_ done at that level: they are
> > bus-specific functions, where the bus has a generic suspend-resume
> > function that it exposes to the generic device layer, and that knows about
> > the bus-specific rules.
> >
> > So if you are a PCI device (to take just that example - but it's true of
> > just about all other buses too), and you don't have any suspend or resume
> > methods, it's actually impossible to see that fact from the generic device
> > layer.
>
> Sure. That's why the async_suspend flag is set at the bus/driver
> level.
>
> > And even when you know it's PCI, our rules are actually not simple at all.
> > Our rules for PCI devices (and this strictly speaking is true for bridges
> > too) are rather complex:
> >
> > - do we have _any_ legacy PM support (ie the "direct" driver
> > suspend/resume functions in the driver ops, rather than having a
> > "struct dev_pm_ops" pointer)? If so, call "->suspend()"
> >
> > - If not - do we have that "dev_pm_ops" thing? If so, call it.
> >
> > - If not - just disable the device entirely _UNLESS_ you're a PCI bridge.
> >
> > Notice? The way things are set up, if you have no suspend routine, you'll
> > not get suspended, but you will get disabled.
> >
> > So it's _not_ actually safe to asynchronously suspend a PCI device if that
> > device has no driver or no suspend routines - because even in the absense
> > of a driver and suspend routines, we'll still least disable it. And if
> > there is some subtle dependency on that device that isn't obvious (say, it
> > might be used indirectly for some ACPI thing), then that async suspend is
> > the wrong thing to do.
> >
> > Subtle? Hell yes.
>
> I don't disagree. However the subtlety lies mainly in the matter of
> non-obvious dependencies. (The other stuff is all known to the PCI
> core.) AFAICS there's otherwise little difference between an async
> routine that does nothing and one that disables the device -- both
> operations are very fast.
>
> The ACPI relations are definitely something to worry about. It would
> be a good idea, at an early stage, to add those dependencies
> explicitly. I don't know enough about them to say more; perhaps Rafael
> does.
It boils down to the fact that for each PCI device known to the ACPI BIOS
there is a "shadow" ACPI device that generally has its own suspend/resume
callbacks and these "shadow" devices are members of the ACPI subtree
of the device tree (ie. they have parents and so on).
Now, when I worked on the first version of async suspend/resume, I noticed
that if those "shadow" ACPI devices did not wait for their PCI counterparts to
suspend, things broke badly. The reason probably wasn't related to what they
did in their suspend/resume callbacks, because they are usually empty, but it
was rather related to the dependencies between devices in the ACPI subtree
(so, generally speaking, it seems the entire ACPI subtree of the device tree
should be suspended after the entire PCI subtree).
That obviously requires more investigation, though.
> As for other non-obvious dependencies... Who knows? Probably the only
> way to find them is by experimentation. My guess is that they will
> turn out to be connected mostly with "high-level" devices: system
> devices, things on the motherboard -- generally speaking, stuff close
> to the CPU. Relatively few will be associated with devices below the
> level of a PCI device or equivalent.
>
> Ideally we would figure out how to do the slow devices in parallel
> without interference from fast devices having unknown dependencies.
> Unfortunately this may not be possible.
I really expect to see those "unknown dependencies" in the _noirq
suspend/resume phases and above. [The very fact they exist is worrisome,
because that's why we don't know why things work on one system and don't
work on another, although they appear to be very similar.]
> > So the whole thing about "we can do PCI bridges asynchronously because
> > they are obviously no-op" is kind of true - except for the "obviously"
> > part. It's not obvious at all. It's rather subtle.
> >
> > As an example of this kind of subtlety - iirc PCIE bridges used to have
> > suspend and resume bugs when we initially switched over to the "new world"
> > suspend/resume exactly because they actually did things at "suspend" time
> > (rather than suspend_late), and that broke devices behind them (this was
> > not related to async, of course, but the point is that even when you look
> > like a PCI bridge, you might be doing odd things).
Well, those "pcieport devices" still are the children of PCIe ports, although
physically they just correspond to different sets of registers within the
ports' config spaces (_that_ is overdesigned IMnsHO) and they are "suspended"
during the regular suspend of their PCIe port "parents".
> > So just saying "let's do it asynchronously" is _not_ always guaranteed to
> > be the right thing at all. It's _probably_ safe for at least regular PCI
> > bridges. Cardbus bridges? Probably not, but since most modern laptop have
> > just a single slot - and people who have multiple slots seldom use them
> > all - most people will probably never see the problems that it _could_
> > introduce.
> >
> > And PCIE bridges? Should be safe these days, but it wasn't quite as
> > obvious, because a PCIE bridge actually has a driver unlike a regular
> > plain PCI-PCI bridge.
> >
> > Subtle, subtle.
>
> Indeed. Perhaps you were too hasty in suggesting that PCI bridges
> should be async.
>
> It would help a lot to see some device lists for typical machines. (If
> there are such things.) Otherwise we are just blowing gas.
>
> > > There remains a separate question: Should async devices also be forced
> > > to wait for their children? I don't see why not. For PCI bridges it
> > > won't make any significant difference. As long as the async code
> > > doesn't have to do anything, who cares when it runs?
> >
> > That's why I just set the "async_resume = 1" thing.
> >
> > But there might actually be reasons why we care. Like the fact that we
> > actually throttle the amount of parallel work we do in async_schedule().
> > So doing even a "no-op" asynchronously isn't actually a no-op: while it is
> > pending (and those things can be pending for a long time, since they have
> > to wait for those slow devices underneath them), it can cause _other_
> > async work - that isn't necessarily a no-op at all - to be then done
> > synchronously.
> >
> > Now, admittedly our async throttling limits are high enough that the above
> > kind of detail will probably never ever realy matter (default 256 worker
> > threads etc). But it's an example of how practice is different from theory
> > - in _theory_ it doesn't make any difference if you wait for something
> > asynchronously, but in practice it could make a difference under some
> > circumstances.
>
> We certainly shouldn't be worried about side effects of async
> throttling as this stage. KISS works both ways: Don't overdesign, and
> don't worry about things that might crop up when you expand the design.
>
> We have strayed off the point of your original objection: not providing
> a way for devices to skip waiting for their children. This really is a
> separate issue from deciding whether or not to go async. For example,
> your proposed patch makes PCI bridges async but doesn't allow them to
> avoid waiting for children. IMO that's a good thing.
>
> The real issue is "blockage": synchronous devices preventing
> possible concurrency among async devices. That's what you thought
> making PCI bridges async would help.
>
> In general, blockage arises in suspend when you have an async child
> with a synchronous parent. The parent has to wait for the child, which
> might take a long time, thereby delaying other unrelated devices.
Exactly, but the Linus' point seems to be that's going to be rare and we
should be able to special case all of the interesting cases.
> (This explains why you wanted to make PCI bridges async -- they are the
> parents of USB controllers.) For resume it's the opposite: an async
> parent with synchronous children.
Is that really going to happen in practice? I mean, what would be the point?
> Thus, while making PCI bridges async might make suspend faster, it probably
> won't help much with resume speed. You'd have to make the children of USB
> devices (SCSI hosts, TTYs, and so on) async. Depending on the order of
> device registration, of course.
>
> Apart from all this, there's a glaring hole in the discussion so far.
> You and Arjan may not have noticed it, but those of us still using
> rotating media have to put up with disk resume times that are a factor
> of 100 (!) larger than USB resume times. That's where the greatest
> gains are to be found.
I guess so.
Rafael
On Tue, 15 Dec 2009, Alan Stern wrote:
>
> Okay. This obviously implies that if/when cardbus bridges are
> converted to async suspend/resume, the driver should make sure that the
> lower-numbered devices wait for their sibling higher-numbered devices
> to suspend (and vice versa for resume). Awkward though it may be.
Yes. However, this is an excellent case where the whole "the device layer
does things asynchronously" is really rather awkward.
For cardbus, the nicest model really would be for the _driver_ to decide
to do some things asynchronously, after having done some other things
synchronously (to make sure of ordering).
That said, I think we are ok for at least Yenta resume, because the really
ordering-critical stuff we tend to do at "resume_early", which wouldn't be
asynchronous anyway.
But for an idea of what I'm talking about, look at the o2micro stuff in
drivers/pcmcia/o2micro.h, and notice how it does certain things only for
the "PCI_FUNC(..devfn) == 0" case.
So I suspect that we _can_ just do cardbus bridges asynchronously too, but
it really needs some care. I suspect to a first approximation we would
want to do the easy cases first, and ignore cardbus as being "known to
possibly have issues".
> > Subtle? Hell yes.
>
> I don't disagree. However the subtlety lies mainly in the matter of
> non-obvious dependencies.
Yes. But we don't necessarily even _know_ those dependencies.
The Cardbus ones I know about, but really only because I wrote much of
that code initially when converting cardbus to look like the PCI bridge it
largely is. But how many other cases like that do we have that we have
perhaps never even hit, because we've never done anything out of order.
> The ACPI relations are definitely something to worry about. It would
> be a good idea, at an early stage, to add those dependencies
> explicitly. I don't know enough about them to say more; perhaps Rafael
> does.
Quite frankly, I would really not want to do ACPI first at all.
We already handle batteries specially, but any random system device? Don't
touch it, is my suggestion. There is just too many ways it can fail. Don't
tell me that things "should work" - we know for a fact that BIOS tables
almost always have every single bug they could possibly have).
> > And PCIE bridges? Should be safe these days, but it wasn't quite as
> > obvious, because a PCIE bridge actually has a driver unlike a regular
> > plain PCI-PCI bridge.
> >
> > Subtle, subtle.
>
> Indeed. Perhaps you were too hasty in suggesting that PCI bridges
> should be async.
Oh, yes. I would suggest that first we do _nothing_ async except for
within just a single USB tree, and perhaps some individual drivers like
the PS/2 keyboard controller (and do even that perhaps only for the PC
version, which we know is on the southbridge and not anywhere else).
If that ends up meaning that we block due to PCI bridges, so be it. I
really would prefer baby steps over anything more complete.
Linus
On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
> > Ideally we would figure out how to do the slow devices in parallel
> > without interference from fast devices having unknown dependencies.
> > Unfortunately this may not be possible.
>
> I really expect to see those "unknown dependencies" in the _noirq
> suspend/resume phases and above. [The very fact they exist is worrisome,
> because that's why we don't know why things work on one system and don't
> work on another, although they appear to be very similar.]
This is a good reason for keeping the _noirq phases synchronous. AFAIK
they don't take long enough to be worth converting, so there's no loss.
> > The real issue is "blockage": synchronous devices preventing
> > possible concurrency among async devices. That's what you thought
> > making PCI bridges async would help.
> >
> > In general, blockage arises in suspend when you have an async child
> > with a synchronous parent. The parent has to wait for the child, which
> > might take a long time, thereby delaying other unrelated devices.
>
> Exactly, but the Linus' point seems to be that's going to be rare and we
> should be able to special case all of the interesting cases.
Maybe that's true. Without seeing some examples of actual dpm_list
contents, we can't tell. Can you post the interesting parts of the
lists from some of your test machines? Maybe with a USB device or two
plugged in? (The device names together with the names of their parents
should be enough.)
> > (This explains why you wanted to make PCI bridges async -- they are the
> > parents of USB controllers.) For resume it's the opposite: an async
> > parent with synchronous children.
>
> Is that really going to happen in practice? I mean, what would be the point?
I don't know. It's all speculation until we see some actual lists.
Alan Stern
On Tue, 15 Dec 2009, Linus Torvalds wrote:
> On Tue, 15 Dec 2009, Alan Stern wrote:
> >
> > Okay. This obviously implies that if/when cardbus bridges are
> > converted to async suspend/resume, the driver should make sure that the
> > lower-numbered devices wait for their sibling higher-numbered devices
> > to suspend (and vice versa for resume). Awkward though it may be.
>
> Yes. However, this is an excellent case where the whole "the device layer
> does things asynchronously" is really rather awkward.
>
> For cardbus, the nicest model really would be for the _driver_ to decide
> to do some things asynchronously, after having done some other things
> synchronously (to make sure of ordering).
Have you considered the possibility of augmenting the design to allow
this? Perhaps reserve a particular return code from the suspend
routine to mean that asynchronous operations are still underway, so the
PM core shouldn't automatically do the complete_all().
> So I suspect that we _can_ just do cardbus bridges asynchronously too, but
> it really needs some care. I suspect to a first approximation we would
> want to do the easy cases first, and ignore cardbus as being "known to
> possibly have issues".
Certainly. Start with the easy things and leave harder devices like
cardbus bridges for later.
> > > Subtle? Hell yes.
> >
> > I don't disagree. However the subtlety lies mainly in the matter of
> > non-obvious dependencies.
>
> Yes. But we don't necessarily even _know_ those dependencies.
Yep. Both non-obvious and non-known.
> The Cardbus ones I know about, but really only because I wrote much of
> that code initially when converting cardbus to look like the PCI bridge it
> largely is. But how many other cases like that do we have that we have
> perhaps never even hit, because we've never done anything out of order.
>
> > The ACPI relations are definitely something to worry about. It would
> > be a good idea, at an early stage, to add those dependencies
> > explicitly. I don't know enough about them to say more; perhaps Rafael
> > does.
>
> Quite frankly, I would really not want to do ACPI first at all.
Dear me, no! I wasn't saying ACPI should be made async; I was saying
that ACPI "shadow" devices should be made to wait for their async PCI
counterparts.
> > Indeed. Perhaps you were too hasty in suggesting that PCI bridges
> > should be async.
>
> Oh, yes. I would suggest that first we do _nothing_ async except for
> within just a single USB tree, and perhaps some individual drivers like
> the PS/2 keyboard controller (and do even that perhaps only for the PC
> version, which we know is on the southbridge and not anywhere else).
>
> If that ends up meaning that we block due to PCI bridges, so be it. I
> really would prefer baby steps over anything more complete.
Agreed. I'm not in any hurry.
Alan Stern
On Tuesday 15 December 2009, Linus Torvalds wrote:
>
> On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
> > >
> > > Give a real example that matters.
> >
> > I'll try. Let -> denote child-parent relationships and assume dpm_list looks
> > like this:
>
> No.
>
> I mean something real - something like
>
> - if you run on a non-PC with two USB buses behind non-PCI controllers.
>
> - device xyz.
>
> > If this applies to _resume_ only, then I agree, but the Arjan's data clearly
> > show that serio devices take much more time to suspend than USB.
>
> I mean in general - something where you actually have hard data that some
> device really needs anythign more than my one-liner, and really _needs_
> some complex infrastructure.
>
> Not "let's imagine a case like xyz".
As I said I would, I made some measurements.
I measured the total time of suspending and resuming devices as shown by the
code added by this patch:
http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=commitdiff_plain;h=c1b8fc0a8bff7707c10f31f3d26bfa88e18ccd94;hp=087dbf5f079f1b55cbd3964c9ce71268473d5b67
on two boxes, HP nx6325 and MSI Wind U100 (hardware-wise they are quite
different and the HP was running 64-bit kernel and user space).
I took four cases into consideration:
(1) synchronous suspend and resume (/sys/power/pm_async = 0)
(2) asynchronous suspend and resume as introduced by the async branch at:
http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=shortlog;h=refs/heads/async
(3) asynchronous suspend and resume like in (2), but with your one-liner setting
the power.async_suspend flag for PCI bridges on top
(4) asynchronous suspend and resume like in (2), but with an extra patch that
is appended on top
For those tests I set power.async_suspend for all USB devices, all serio input
devices, the ACPI battery and the USB PCI controllers (to see the impact of the
one-liner, if any).
I carried out 5 consecutive suspend-resume cycles (started from under X) on
each box in each case, and the raw data are here (all times in milliseconds):
http://www.sisk.pl/kernel/data/async-suspend.pdf
The summarized data are below (the "big" numbers are averages and the +/-
numbers are standard deviations, all in milliseconds):
HP nx6325 MSI Wind U100
sync suspend 1482 (+/- 40) 1180 (+/- 24)
sync resume 2955 (+/- 2) 3597 (+/- 25)
async suspend 1553 (+/- 49) 1177 (+/- 32)
async resume 2692 (+/- 326) 3556 (+/- 33)
async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
async+extra resume 1859 (+/- 114) 1923 (+/- 35)
So, in my opinion, with the above set of "async" devices, it doesn't
make sense to do async suspend at all, because the sync suspend is actually
the fastest on both machines.
However, it surely is worth doing async _resume_ with the extra patch appended
below, because that allows us to save 1 second or more on both machines with
respect to the sync case. The other variants of async resume also produce some
time savings, but (on the nx6325) at the expense of huge fluctuations from one
cycle to another (so they can actually be slower than the sync resume). Only
the async resume with the extra patch is consistently better than the sync one.
The impact of the one-liner is either negligible or slightly negative.
Now, what does the extra patch do? Exactly the thing I was talking about, it
starts all async suspends and resumes upfront.
So, it looks like we both were wrong. I was wrong, because I thought the
extra patch would help suspend, but not resume, while in fact it appears to
help resume big time. You were wrong, because you thought that the one-liner
would have positive impact, while in fact it doesn't.
Concluding, at this point I'd opt for implementing asynchronous resume alone,
_without_ asynchronous suspend, which is more complicated and doesn't really
give us any time savings. At the same time, I'd implement the asynchronous
resume in such a way that all of the async resume threads would be started
before the synchronous suspend thread, because that would give us the best
results.
Rafael
---
drivers/base/power/main.c | 48 +++++++++++++++++++++++++++++-----------------
1 file changed, 31 insertions(+), 17 deletions(-)
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -523,14 +523,9 @@ static void async_resume(void *data, asy
static int device_resume(struct device *dev)
{
- INIT_COMPLETION(dev->power.completion);
-
- if (pm_async_enabled && dev->power.async_suspend
- && !pm_trace_is_enabled()) {
- get_device(dev);
- async_schedule(async_resume, dev);
+ if (dev->power.async_suspend && pm_async_enabled
+ && !pm_trace_is_enabled())
return 0;
- }
return __device_resume(dev, pm_transition, false);
}
@@ -545,14 +540,28 @@ static int device_resume(struct device *
static void dpm_resume(pm_message_t state)
{
struct list_head list;
+ struct device *dev;
ktime_t starttime = ktime_get();
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
pm_transition = state;
- while (!list_empty(&dpm_list)) {
- struct device *dev = to_device(dpm_list.next);
+ list_for_each_entry(dev, &dpm_list, power.entry) {
+ if (dev->power.status < DPM_OFF)
+ continue;
+
+ INIT_COMPLETION(dev->power.completion);
+
+ if (dev->power.async_suspend && pm_async_enabled
+ && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ }
+ }
+
+ while (!list_empty(&dpm_list)) {
+ dev = to_device(dpm_list.next);
get_device(dev);
if (dev->power.status >= DPM_OFF) {
int error;
@@ -809,13 +818,8 @@ static void async_suspend(void *data, as
static int device_suspend(struct device *dev)
{
- INIT_COMPLETION(dev->power.completion);
-
- if (pm_async_enabled && dev->power.async_suspend) {
- get_device(dev);
- async_schedule(async_suspend, dev);
+ if (pm_async_enabled && dev->power.async_suspend)
return 0;
- }
return __device_suspend(dev, pm_transition, false);
}
@@ -827,6 +831,7 @@ static int device_suspend(struct device
static int dpm_suspend(pm_message_t state)
{
struct list_head list;
+ struct device *dev;
ktime_t starttime = ktime_get();
int error = 0;
@@ -834,9 +839,18 @@ static int dpm_suspend(pm_message_t stat
mutex_lock(&dpm_list_mtx);
pm_transition = state;
async_error = 0;
- while (!list_empty(&dpm_list)) {
- struct device *dev = to_device(dpm_list.prev);
+ list_for_each_entry_reverse(dev, &dpm_list, power.entry) {
+ INIT_COMPLETION(dev->power.completion);
+
+ if (pm_async_enabled && dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);
+ }
+ }
+
+ while (!list_empty(&dpm_list)) {
+ dev = to_device(dpm_list.prev);
get_device(dev);
mutex_unlock(&dpm_list_mtx);
On Wed, Dec 16, 2009 at 03:11:05AM +0100, Rafael J. Wysocki wrote:
> On Tuesday 15 December 2009, Linus Torvalds wrote:
> >
> > On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
> > > >
> > > > Give a real example that matters.
> > >
> > > I'll try. Let -> denote child-parent relationships and assume dpm_list looks
> > > like this:
> >
> > No.
> >
> > I mean something real - something like
> >
> > - if you run on a non-PC with two USB buses behind non-PCI controllers.
> >
> > - device xyz.
> >
> > > If this applies to _resume_ only, then I agree, but the Arjan's data clearly
> > > show that serio devices take much more time to suspend than USB.
> >
> > I mean in general - something where you actually have hard data that some
> > device really needs anythign more than my one-liner, and really _needs_
> > some complex infrastructure.
> >
> > Not "let's imagine a case like xyz".
>
> As I said I would, I made some measurements.
>
> I measured the total time of suspending and resuming devices as shown by the
> code added by this patch:
> http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=commitdiff_plain;h=c1b8fc0a8bff7707c10f31f3d26bfa88e18ccd94;hp=087dbf5f079f1b55cbd3964c9ce71268473d5b67
> on two boxes, HP nx6325 and MSI Wind U100 (hardware-wise they are quite
> different and the HP was running 64-bit kernel and user space).
>
> I took four cases into consideration:
> (1) synchronous suspend and resume (/sys/power/pm_async = 0)
> (2) asynchronous suspend and resume as introduced by the async branch at:
> http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=shortlog;h=refs/heads/async
> (3) asynchronous suspend and resume like in (2), but with your one-liner setting
> the power.async_suspend flag for PCI bridges on top
> (4) asynchronous suspend and resume like in (2), but with an extra patch that
> is appended on top
>
> For those tests I set power.async_suspend for all USB devices, all serio input
> devices, the ACPI battery and the USB PCI controllers (to see the impact of the
> one-liner, if any).
>
> I carried out 5 consecutive suspend-resume cycles (started from under X) on
> each box in each case, and the raw data are here (all times in milliseconds):
> http://www.sisk.pl/kernel/data/async-suspend.pdf
>
> The summarized data are below (the "big" numbers are averages and the +/-
> numbers are standard deviations, all in milliseconds):
>
> HP nx6325 MSI Wind U100
>
> sync suspend 1482 (+/- 40) 1180 (+/- 24)
> sync resume 2955 (+/- 2) 3597 (+/- 25)
>
> async suspend 1553 (+/- 49) 1177 (+/- 32)
> async resume 2692 (+/- 326) 3556 (+/- 33)
>
> async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
> async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
>
> async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
> async+extra resume 1859 (+/- 114) 1923 (+/- 35)
>
> So, in my opinion, with the above set of "async" devices, it doesn't
> make sense to do async suspend at all, because the sync suspend is actually
> the fastest on both machines.
I think the async suspend is not asynchronous enough then - what kind of
time do you get if you simply comment out call to psmouse_reset() in
drivers/input/mouse/psmouse-base.c:psmouse_cleanup()? (Just for testing
purposes only, I don't think we want to do that by default.)
--
Dmitry
On Wed, 16 Dec 2009, Rafael J. Wysocki wrote:
> I measured the total time of suspending and resuming devices as shown by the
> code added by this patch:
> http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=commitdiff_plain;h=c1b8fc0a8bff7707c10f31f3d26bfa88e18ccd94;hp=087dbf5f079f1b55cbd3964c9ce71268473d5b67
> on two boxes, HP nx6325 and MSI Wind U100 (hardware-wise they are quite
> different and the HP was running 64-bit kernel and user space).
> I carried out 5 consecutive suspend-resume cycles (started from under X) on
> each box in each case, and the raw data are here (all times in milliseconds):
> http://www.sisk.pl/kernel/data/async-suspend.pdf
I'd like to see much more detailed data. For each device, let's get
the device name, the parent's name, and the start time, end time, and
duration for suspend or resume. The start time should be measured when
you have finished waiting for the children. The end time should be
measured just before the complete_all().
Alan Stern
On Wed, 16 Dec 2009, Rafael J. Wysocki wrote:
>
> The summarized data are below (the "big" numbers are averages and the +/-
> numbers are standard deviations, all in milliseconds):
>
> HP nx6325 MSI Wind U100
>
> sync suspend 1482 (+/- 40) 1180 (+/- 24)
> sync resume 2955 (+/- 2) 3597 (+/- 25)
>
> async suspend 1553 (+/- 49) 1177 (+/- 32)
> async resume 2692 (+/- 326) 3556 (+/- 33)
>
> async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
> async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
>
> async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
> async+extra resume 1859 (+/- 114) 1923 (+/- 35)
>
> So, in my opinion, with the above set of "async" devices, it doesn't
> make sense to do async suspend at all, because the sync suspend is actually
> the fastest on both machines.
Hmm. I certainly agree - your numbers do not seem to support any async at
all.
However, I do note that for the "extra patch" makes a big difference at
resume time. That implies that the resume serializes on some slow device
that wasn't marked async - and starting the async ones early avoids that.
But without the per-device timings, it's hard to even guess what device
that was.
But even that doesn't really help the suspend cases, only resume.
Do you have any sample timing output with devices listed?
Linus
On Wednesday 16 December 2009, Alan Stern wrote:
> On Wed, 16 Dec 2009, Rafael J. Wysocki wrote:
>
> > I measured the total time of suspending and resuming devices as shown by the
> > code added by this patch:
> > http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=commitdiff_plain;h=c1b8fc0a8bff7707c10f31f3d26bfa88e18ccd94;hp=087dbf5f079f1b55cbd3964c9ce71268473d5b67
> > on two boxes, HP nx6325 and MSI Wind U100 (hardware-wise they are quite
> > different and the HP was running 64-bit kernel and user space).
>
> > I carried out 5 consecutive suspend-resume cycles (started from under X) on
> > each box in each case, and the raw data are here (all times in milliseconds):
> > http://www.sisk.pl/kernel/data/async-suspend.pdf
>
> I'd like to see much more detailed data. For each device, let's get
> the device name, the parent's name, and the start time, end time, and
> duration for suspend or resume. The start time should be measured when
> you have finished waiting for the children. The end time should be
> measured just before the complete_all().
I'm going to use the Arjan's patch + script to chart the suspend/resume times
for individual devices. I can send you the raw data, though.
Rafael
On Wednesday 16 December 2009, Linus Torvalds wrote:
>
> On Wed, 16 Dec 2009, Rafael J. Wysocki wrote:
> >
> > The summarized data are below (the "big" numbers are averages and the +/-
> > numbers are standard deviations, all in milliseconds):
> >
> > HP nx6325 MSI Wind U100
> >
> > sync suspend 1482 (+/- 40) 1180 (+/- 24)
> > sync resume 2955 (+/- 2) 3597 (+/- 25)
> >
> > async suspend 1553 (+/- 49) 1177 (+/- 32)
> > async resume 2692 (+/- 326) 3556 (+/- 33)
> >
> > async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
> > async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
> >
> > async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
> > async+extra resume 1859 (+/- 114) 1923 (+/- 35)
> >
> > So, in my opinion, with the above set of "async" devices, it doesn't
> > make sense to do async suspend at all, because the sync suspend is actually
> > the fastest on both machines.
>
> Hmm. I certainly agree - your numbers do not seem to support any async at
> all.
>
> However, I do note that for the "extra patch" makes a big difference at
> resume time. That implies that the resume serializes on some slow device
> that wasn't marked async - and starting the async ones early avoids that.
>
> But without the per-device timings, it's hard to even guess what device
> that was.
>
> But even that doesn't really help the suspend cases, only resume.
>
> Do you have any sample timing output with devices listed?
I'm going to generate one shortly.
Rafael
On Wed, 16 Dec 2009, Rafael J. Wysocki wrote:
> >
> > Do you have any sample timing output with devices listed?
>
> I'm going to generate one shortly.
>From my bootup timings, I have this memory of SATA link bringup being
noticeable. I wonder if that is the case on resume too...
Linus
On Wednesday 16 December 2009, Linus Torvalds wrote:
>
> On Wed, 16 Dec 2009, Rafael J. Wysocki wrote:
> > >
> > > Do you have any sample timing output with devices listed?
> >
> > I'm going to generate one shortly.
I've just put the first set of data, for the HP nx6325 at:
http://www.sisk.pl/kernel/data/nx6325/
The *-dmesg.log files contain full dmesg outputs starting from a cold boot and
including one suspend-resume cycle in each case, with debug_initcall enabled.
The *-suspend.log files are excerpts from the *-dmesg.log files containing
the suspend messages only, and analogously for *-resume.log.
The *-times.txt files contain suspend/resume time for every device sorted
in the decreasing order.
> From my bootup timings, I have this memory of SATA link bringup being
> noticeable. I wonder if that is the case on resume too...
There's no SATA in the nx6325, only IDE, so we'd need to wait for the Wind data
(in the works).
The slowest suspending device in the nx6325 is the audio chip (surprise,
surprise), it takes ~220 ms alone. Then - serio, but since i8042 was not
async, the async suspend of serio didn't really help (another ~140 ms).
Then network, FireWire, MMC, USB, SD host (~15 ms each). [I think we can
help suspend a bit by making i8042 async, although I'm not sure that's going
to be safe.]
The slowest resuming are USB (by far) and then CardBus, audio, USB controllers,
FireWire, network and IDE (but that only takes about 7 ms).
But the main problem with async resume is that the USB devices are at the
beginning of dpm_list, so the resume of them is not even started until _all_ of
the slow devices behind them are woken up. That's why the extra patch helps so
much IMO.
Rafael
Btw, what are the timings if you just force everything async? I think that
worked on yur laptops, no?
It would be interestign to know - if only to see what the asymptotic upper
bound is for all of this is..
Linus
On Wednesday 16 December 2009, Linus Torvalds wrote:
>
> Btw, what are the timings if you just force everything async? I think that
> worked on yur laptops, no?
No, it didn't. I could make all PCI async, provided that the ACPI subtree was
resumed before any PCI devices. [Theoretically I can make that happen by
moving ACPI resume to the _noirq phase (just for testing of course). So I can
try to make PCI async in addition to serio and USB, plus i8042 perhaps, which
should be sfficient for the nx6325 I think.]
Making all async always hanged the boxes on resume.
Rafael
On Wed, 16 Dec 2009, Rafael J. Wysocki wrote:
> I've just put the first set of data, for the HP nx6325 at:
> http://www.sisk.pl/kernel/data/nx6325/
>
> The *-dmesg.log files contain full dmesg outputs starting from a cold boot and
> including one suspend-resume cycle in each case, with debug_initcall enabled.
>
> The *-suspend.log files are excerpts from the *-dmesg.log files containing
> the suspend messages only, and analogously for *-resume.log.
I've just started looking at the sync-suspend.log file. What are all
the '+' characters and " @ 3368" strings after the device names?
You didn't print out the parent name for each device, so the tree
structure has been lost.
Why do those "sd 0:0:0:0 [sda]" messages appear in between two
callbacks? The cache-synchronization and the spin-down commands are
not executed asynchronously.
Alan Stern
On Thursday 17 December 2009, Alan Stern wrote:
> On Wed, 16 Dec 2009, Rafael J. Wysocki wrote:
>
> > I've just put the first set of data, for the HP nx6325 at:
> > http://www.sisk.pl/kernel/data/nx6325/
> >
> > The *-dmesg.log files contain full dmesg outputs starting from a cold boot and
> > including one suspend-resume cycle in each case, with debug_initcall enabled.
> >
> > The *-suspend.log files are excerpts from the *-dmesg.log files containing
> > the suspend messages only, and analogously for *-resume.log.
>
> I've just started looking at the sync-suspend.log file. What are all
> the '+' characters and " @ 3368" strings after the device names?
I think the + is necessary for the Arjan's graph-generating script and the
@ number is the value of current (ie. the PID of the calling task).
> You didn't print out the parent name for each device, so the tree
> structure has been lost.
That's because the original Arjan's patch doesn't do that, I'm adding it
right now.
> Why do those "sd 0:0:0:0 [sda]" messages appear in between two
> callbacks? The cache-synchronization and the spin-down commands are
> not executed asynchronously.
Because the data are incomplete. :-(
I've just realized that the Arjan's patch only covers bus types and classes
that have been converted to dev_pm_ops already, so I'm extending it to the
"legacy" ones at the moment.
Rafael
On Thursday 17 December 2009, Rafael J. Wysocki wrote:
> On Thursday 17 December 2009, Alan Stern wrote:
> > On Wed, 16 Dec 2009, Rafael J. Wysocki wrote:
> >
> > > I've just put the first set of data, for the HP nx6325 at:
> > > http://www.sisk.pl/kernel/data/nx6325/
> > >
> > > The *-dmesg.log files contain full dmesg outputs starting from a cold boot and
> > > including one suspend-resume cycle in each case, with debug_initcall enabled.
> > >
> > > The *-suspend.log files are excerpts from the *-dmesg.log files containing
> > > the suspend messages only, and analogously for *-resume.log.
> >
> > I've just started looking at the sync-suspend.log file. What are all
> > the '+' characters and " @ 3368" strings after the device names?
>
> I think the + is necessary for the Arjan's graph-generating script and the
> @ number is the value of current (ie. the PID of the calling task).
>
> > You didn't print out the parent name for each device, so the tree
> > structure has been lost.
>
> That's because the original Arjan's patch doesn't do that, I'm adding it
> right now.
>
> > Why do those "sd 0:0:0:0 [sda]" messages appear in between two
> > callbacks? The cache-synchronization and the spin-down commands are
> > not executed asynchronously.
>
> Because the data are incomplete. :-(
>
> I've just realized that the Arjan's patch only covers bus types and classes
> that have been converted to dev_pm_ops already, so I'm extending it to the
> "legacy" ones at the moment.
New data files have been uploaded to:
http://www.sisk.pl/kernel/data/nx6325/
http://www.sisk.pl/kernel/data/wind/
Please let me know if you need more information.
Rafael
On Wednesday 16 December 2009, Rafael J. Wysocki wrote:
> On Wednesday 16 December 2009, Linus Torvalds wrote:
> >
> > On Wed, 16 Dec 2009, Rafael J. Wysocki wrote:
> > > >
> > > > Do you have any sample timing output with devices listed?
> > >
> > > I'm going to generate one shortly.
>
> I've just put the first set of data, for the HP nx6325 at:
> http://www.sisk.pl/kernel/data/nx6325/
As I said in a message to Alan, the data were incomplete, because the original
Arjan's patch only covers bus types and device classes converted to
dev_pm_ops, which I only noticed earlier today. So I added the appended patch
on top of the async tree and I applied a one-liner adding the name of the
parent to each device line during (regular) suspend and resume.
The new data sets are at:
http://www.sisk.pl/kernel/data/nx6325/
http://www.sisk.pl/kernel/data/wind/
and the format is the same as described below.
> The *-dmesg.log files contain full dmesg outputs starting from a cold boot and
> including one suspend-resume cycle in each case, with debug_initcall enabled.
>
> The *-suspend.log files are excerpts from the *-dmesg.log files containing
> the suspend messages only, and analogously for *-resume.log.
>
> The *-times.txt files contain suspend/resume time for every device sorted
> in the decreasing order.
>
> > From my bootup timings, I have this memory of SATA link bringup being
> > noticeable. I wonder if that is the case on resume too...
That actually is correct. On the nx6325 suspend is totally dominated by disk
spindown, almost everything else is negligible compared to it (well, except for
the audio), so we can't go down below 1 s during suspend on this box.
On the Wind, disk spindown time is comparable with serio suspend time,
so at least in principle we should be able to get .5 s suspend on this box -
if the disk spindown in async.
In turn, the resume on the Wind is dominated by disk spinup, so we can't
go below 1.5 s on this box during resume (notice that the "async+extra"
approach brings us close to this limit, although we could save .5 s more in
principle by making more devices async).
Resume on the nx6325 is a different story, though, as it is dominated by USB
and PCI devices, so marking those as async would probably bring us close to
the limit.
[Surprisingly enough to me some ACPI devices appear to take quite noticeable
amounts of time to resume on both boxes.]
Tomorrow I'll try to mark as many devices as reasonably possible as async
and see how the total suspend-resume times change.
Rafael
---
drivers/base/power/main.c | 97 ++++++++++++++++++++++++++++++++++++----------
1 file changed, 77 insertions(+), 20 deletions(-)
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -165,6 +165,32 @@ void device_pm_move_last(struct device *
list_move_tail(&dev->power.entry, &dpm_list);
}
+static ktime_t initcall_debug_start(struct device *dev)
+{
+ ktime_t calltime = ktime_set(0, 0);
+
+ if (initcall_debug) {
+ pr_info("calling %s_i+ @ %i\n",
+ dev_name(dev), task_pid_nr(current));
+ calltime = ktime_get();
+ }
+
+ return calltime;
+}
+
+static void initcall_debug_report(struct device *dev, ktime_t calltime,
+ int error)
+{
+ ktime_t delta, rettime;
+
+ if (initcall_debug) {
+ rettime = ktime_get();
+ delta = ktime_sub(rettime, calltime);
+ pr_info("call %s+ returned %d after %Ld usecs\n", dev_name(dev),
+ error, (unsigned long long)ktime_to_ns(delta) >> 10);
+ }
+}
+
/**
* dpm_wait - Wait for a PM operation to complete.
* @dev: Device to wait for.
@@ -201,13 +227,9 @@ static int pm_op(struct device *dev,
pm_message_t state)
{
int error = 0;
- ktime_t calltime, delta, rettime;
+ ktime_t calltime;
- if (initcall_debug) {
- pr_info("calling %s+ @ %i\n",
- dev_name(dev), task_pid_nr(current));
- calltime = ktime_get();
- }
+ calltime = initcall_debug_start(dev);
switch (state.event) {
#ifdef CONFIG_SUSPEND
@@ -256,12 +278,7 @@ static int pm_op(struct device *dev,
error = -EINVAL;
}
- if (initcall_debug) {
- rettime = ktime_get();
- delta = ktime_sub(rettime, calltime);
- pr_info("call %s+ returned %d after %Ld usecs\n", dev_name(dev),
- error, (unsigned long long)ktime_to_ns(delta) >> 10);
- }
+ initcall_debug_report(dev, calltime, error);
return error;
}
@@ -338,8 +355,9 @@ static int pm_noirq_op(struct device *de
if (initcall_debug) {
rettime = ktime_get();
delta = ktime_sub(rettime, calltime);
- printk("initcall %s_i+ returned %d after %Ld usecs\n", dev_name(dev),
- error, (unsigned long long)ktime_to_ns(delta) >> 10);
+ printk("initcall %s_i+ returned %d after %Ld usecs\n",
+ dev_name(dev), error,
+ (unsigned long long)ktime_to_ns(delta) >> 10);
}
return error;
@@ -456,6 +474,26 @@ void dpm_resume_noirq(pm_message_t state
EXPORT_SYMBOL_GPL(dpm_resume_noirq);
/**
+ * legacy_resume - Execute a legacy (bus or class) resume callback for device.
+ * dev: Device to resume.
+ * cb: Resume callback to execute.
+ */
+static int legacy_resume(struct device *dev, int (*cb)(struct device *dev))
+{
+ int error;
+ ktime_t calltime;
+
+ calltime = initcall_debug_start(dev);
+
+ error = cb(dev);
+ suspend_report_result(cb, error);
+
+ initcall_debug_report(dev, calltime, error);
+
+ return error;
+}
+
+/**
* __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
@@ -477,7 +515,7 @@ static int __device_resume(struct device
error = pm_op(dev, dev->bus->pm, state);
} else if (dev->bus->resume) {
pm_dev_dbg(dev, state, "legacy ");
- error = dev->bus->resume(dev);
+ error = legacy_resume(dev, dev->bus->resume);
}
if (error)
goto End;
@@ -498,7 +536,7 @@ static int __device_resume(struct device
error = pm_op(dev, dev->class->pm, state);
} else if (dev->class->resume) {
pm_dev_dbg(dev, state, "legacy class ");
- error = dev->class->resume(dev);
+ error = legacy_resume(dev, dev->class->resume);
}
}
End:
@@ -734,6 +772,27 @@ EXPORT_SYMBOL_GPL(dpm_suspend_noirq);
static int async_error;
/**
+ * legacy_suspend - Execute a legacy (bus or class) suspend callback for device.
+ * dev: Device to suspend.
+ * cb: Suspend callback to execute.
+ */
+static int legacy_suspend(struct device *dev, pm_message_t state,
+ int (*cb)(struct device *dev, pm_message_t state))
+{
+ int error;
+ ktime_t calltime;
+
+ calltime = initcall_debug_start(dev);
+
+ error = cb(dev, state);
+ suspend_report_result(cb, error);
+
+ initcall_debug_report(dev, calltime, error);
+
+ return error;
+}
+
+/**
* device_suspend - Execute "suspend" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
@@ -755,8 +814,7 @@ static int __device_suspend(struct devic
error = pm_op(dev, dev->class->pm, state);
} else if (dev->class->suspend) {
pm_dev_dbg(dev, state, "legacy class ");
- error = dev->class->suspend(dev, state);
- suspend_report_result(dev->class->suspend, error);
+ error = legacy_suspend(dev, state, dev->class->suspend);
}
if (error)
goto End;
@@ -777,8 +835,7 @@ static int __device_suspend(struct devic
error = pm_op(dev, dev->bus->pm, state);
} else if (dev->bus->suspend) {
pm_dev_dbg(dev, state, "legacy ");
- error = dev->bus->suspend(dev, state);
- suspend_report_result(dev->bus->suspend, error);
+ error = legacy_suspend(dev, state, dev->bus->suspend);
}
}
On Thu, 17 Dec 2009, Rafael J. Wysocki wrote:
> That actually is correct. On the nx6325 suspend is totally dominated by disk
> spindown, almost everything else is negligible compared to it (well, except for
> the audio), so we can't go down below 1 s during suspend on this box.
>
> On the Wind, disk spindown time is comparable with serio suspend time,
> so at least in principle we should be able to get .5 s suspend on this box -
> if the disk spindown in async.
>
> In turn, the resume on the Wind is dominated by disk spinup, so we can't
> go below 1.5 s on this box during resume (notice that the "async+extra"
> approach brings us close to this limit, although we could save .5 s more in
> principle by making more devices async).
>
> Resume on the nx6325 is a different story, though, as it is dominated by USB
> and PCI devices, so marking those as async would probably bring us close to
> the limit.
The implications seem pretty clear. If the following sorts of devices
were async:
USB (devices and interfaces), PCI, serio, SCSI (hosts, targets,
devices)
then we would reap close to the maximum benefit -- providing:
async threads are started in a first pass without waiting
for synchronous devices, and
It's not clear that making all these types of devices async will really
work, but it's worth testing.
Alan Stern
On Thursday 17 December 2009, Alan Stern wrote:
> On Thu, 17 Dec 2009, Rafael J. Wysocki wrote:
>
> > That actually is correct. On the nx6325 suspend is totally dominated by disk
> > spindown, almost everything else is negligible compared to it (well, except for
> > the audio), so we can't go down below 1 s during suspend on this box.
> >
> > On the Wind, disk spindown time is comparable with serio suspend time,
> > so at least in principle we should be able to get .5 s suspend on this box -
> > if the disk spindown in async.
> >
> > In turn, the resume on the Wind is dominated by disk spinup, so we can't
> > go below 1.5 s on this box during resume (notice that the "async+extra"
> > approach brings us close to this limit, although we could save .5 s more in
> > principle by making more devices async).
> >
> > Resume on the nx6325 is a different story, though, as it is dominated by USB
> > and PCI devices, so marking those as async would probably bring us close to
> > the limit.
>
> The implications seem pretty clear. If the following sorts of devices
> were async:
>
> USB (devices and interfaces), PCI, serio, SCSI (hosts, targets,
> devices)
Plus ACPI battery.
> then we would reap close to the maximum benefit -- providing:
>
> async threads are started in a first pass without waiting
> for synchronous devices, and
Agreed.
> It's not clear that making all these types of devices async will really
> work, but it's worth testing.
I'm working on it.
Rafael
On Sun, 2009-12-06 at 22:15 -0800, Linus Torvalds wrote:
>
> The same is true of the prepare_suspend/suspend split I'm proposing:
> I
> suspect that for something like USB, it would make most sense to just
> do
> normal node suspend in prepare_suspend, which would do everything
> asynchronously. Only USB hub devices would get involved at the later
> 'suspend()' phase.
Wasn't part of the goal with prepare_suspend() vs. suspend() to handle
the problem of backing store vs the VM ?
IE. Once any device potentially in the VM path is suspended, things like
kmalloc() or gfp() can potentially stall until resume or did we address
that recently ?
Iirc, part of the idea behind prepare_* is that it's safe vs. the above.
Now if you start suspending USB devices at prepare() then you break that
assumption since those could be mass storage with dirty mmap'ed pages on
them.
Now, I'm all for fixing it at the VM/allocator level (if we didn't
already) turning pretty much everything into NO_IO once we start
suspending devices but that's a whole different matter :-)
Cheers,
Ben.
On Thursday 17 December 2009, Rafael J. Wysocki wrote:
...
> Tomorrow I'll try to mark as many devices as reasonably possible as async
> and see how the total suspend-resume times change.
I didn't manage to do that, but I was able to mark sd and i8042 as async and
see the impact of this.
The raw data are in the usual place:
http://www.sisk.pl/kernel/data/async-suspend-resume.pdf
and the individual device timings and logs are in:
http://www.sisk.pl/kernel/data/nx6325/
http://www.sisk.pl/kernel/data/wind/
This is the summary (previous results are inculded for easier reference):
HP nx6325 MSI Wind U100
sync suspend 1482 (+/- 40) 1180 (+/- 24)
sync resume 2955 (+/- 2) 3597 (+/- 25)
async suspend 1553 (+/- 49) 1177 (+/- 32)
async resume 2692 (+/- 326) 3556 (+/- 33)
async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
async+extra resume 1859 (+/- 114) 1923 (+/- 35)
with "async" i8042 and sd:
async suspend 1319 (+/- 51) 1045 (+/- 41)
async resume 2929 (+/- 3) 3546 (+/- 27)
async+extra suspend 1327 (+/- 36) (didn't work)
async+extra resume 1742 (+/- 164) 1896 (+/- 28)
(the summary is also available at: http://www.sisk.pl/kernel/data/results.txt).
So, it actually makes the case for async suspend! Although it's not very
strong, with these two additional devices marked as "async" we get noticeable
suspend time improvement.
Still, the "extra" patch doesn't help on suspend at all and on the Wind the
suspend part of it didn't even work (I'm yet to figure out which of the two
devices crashed the suspend). Nevertheless the resume part of the "extra"
patch worked in both cases and worked better than without the two additional
"async" devices.
To me, this means that the suspend part of the "extra" patch is not really
useful. However, the resume part of it is _very_ useful, so I'd like to add
that part only to the async patchset. The explanation why it helps so much
is also straightforward to me. Namely, if slow async devices are last to
resume, then without the "extra" patch they need to wait for all of the
preceding sync devices and the speedup from executing their resume routines
asynchronously is very limited. Now, with the "extra" patch their resume
routines start as soon as their parents complete resuming and that may be
early enough for the speedup to be significant.
Rafael
On Fri, 18 Dec 2009, Rafael J. Wysocki wrote:
> I didn't manage to do that, but I was able to mark sd and i8042 as async and
> see the impact of this.
Apparently this didn't do what you wanted. In the nx6325
sd+i8042+async+extra log, the 0:0:0:0 device (which is a SCSI disk) was
suspended by the main thread instead of an async thread.
There's an important point I neglected to mention before. Your logs
don't show anything for devices with no suspend callbacks at all.
Nevertheless, these devices sit on the device list and prevent other
devices from suspending or resuming as soon as they could.
For example, the fingerprint sensor (3-1) took the most time to resume.
But other devices were delayed until after it finished because it had
children with no callbacks, and they delayed the devices following
them in the list.
What would happen if you completed these devices immediately, as part
of the first pass?
Alan Stern
On Wednesday 16 December 2009, Dmitry Torokhov wrote:
> On Wed, Dec 16, 2009 at 03:11:05AM +0100, Rafael J. Wysocki wrote:
> > On Tuesday 15 December 2009, Linus Torvalds wrote:
> > >
> > > On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
> > > > >
> > > > > Give a real example that matters.
> > > >
> > > > I'll try. Let -> denote child-parent relationships and assume dpm_list looks
> > > > like this:
> > >
> > > No.
> > >
> > > I mean something real - something like
> > >
> > > - if you run on a non-PC with two USB buses behind non-PCI controllers.
> > >
> > > - device xyz.
> > >
> > > > If this applies to _resume_ only, then I agree, but the Arjan's data clearly
> > > > show that serio devices take much more time to suspend than USB.
> > >
> > > I mean in general - something where you actually have hard data that some
> > > device really needs anythign more than my one-liner, and really _needs_
> > > some complex infrastructure.
> > >
> > > Not "let's imagine a case like xyz".
> >
> > As I said I would, I made some measurements.
> >
> > I measured the total time of suspending and resuming devices as shown by the
> > code added by this patch:
> > http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=commitdiff_plain;h=c1b8fc0a8bff7707c10f31f3d26bfa88e18ccd94;hp=087dbf5f079f1b55cbd3964c9ce71268473d5b67
> > on two boxes, HP nx6325 and MSI Wind U100 (hardware-wise they are quite
> > different and the HP was running 64-bit kernel and user space).
> >
> > I took four cases into consideration:
> > (1) synchronous suspend and resume (/sys/power/pm_async = 0)
> > (2) asynchronous suspend and resume as introduced by the async branch at:
> > http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=shortlog;h=refs/heads/async
> > (3) asynchronous suspend and resume like in (2), but with your one-liner setting
> > the power.async_suspend flag for PCI bridges on top
> > (4) asynchronous suspend and resume like in (2), but with an extra patch that
> > is appended on top
> >
> > For those tests I set power.async_suspend for all USB devices, all serio input
> > devices, the ACPI battery and the USB PCI controllers (to see the impact of the
> > one-liner, if any).
> >
> > I carried out 5 consecutive suspend-resume cycles (started from under X) on
> > each box in each case, and the raw data are here (all times in milliseconds):
> > http://www.sisk.pl/kernel/data/async-suspend.pdf
> >
> > The summarized data are below (the "big" numbers are averages and the +/-
> > numbers are standard deviations, all in milliseconds):
> >
> > HP nx6325 MSI Wind U100
> >
> > sync suspend 1482 (+/- 40) 1180 (+/- 24)
> > sync resume 2955 (+/- 2) 3597 (+/- 25)
> >
> > async suspend 1553 (+/- 49) 1177 (+/- 32)
> > async resume 2692 (+/- 326) 3556 (+/- 33)
> >
> > async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
> > async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
> >
> > async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
> > async+extra resume 1859 (+/- 114) 1923 (+/- 35)
> >
> > So, in my opinion, with the above set of "async" devices, it doesn't
> > make sense to do async suspend at all, because the sync suspend is actually
> > the fastest on both machines.
>
> I think the async suspend is not asynchronous enough then - what kind of
> time do you get if you simply comment out call to psmouse_reset() in
> drivers/input/mouse/psmouse-base.c:psmouse_cleanup()? (Just for testing
> purposes only, I don't think we want to do that by default.)
The problem apparently is that the i8042 suspend/resume is synchronous.
Do you think it's safe to mark it as asynchronous?
Rafael
On Friday 18 December 2009, Rafael J. Wysocki wrote:
> On Thursday 17 December 2009, Rafael J. Wysocki wrote:
> ...
> > Tomorrow I'll try to mark as many devices as reasonably possible as async
> > and see how the total suspend-resume times change.
>
> I didn't manage to do that, but I was able to mark sd and i8042 as async and
> see the impact of this.
>
> The raw data are in the usual place:
>
> http://www.sisk.pl/kernel/data/async-suspend-resume.pdf
>
> and the individual device timings and logs are in:
>
> http://www.sisk.pl/kernel/data/nx6325/
> http://www.sisk.pl/kernel/data/wind/
>
> This is the summary (previous results are inculded for easier reference):
>
> HP nx6325 MSI Wind U100
>
> sync suspend 1482 (+/- 40) 1180 (+/- 24)
> sync resume 2955 (+/- 2) 3597 (+/- 25)
>
> async suspend 1553 (+/- 49) 1177 (+/- 32)
> async resume 2692 (+/- 326) 3556 (+/- 33)
>
> async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
> async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
>
> async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
> async+extra resume 1859 (+/- 114) 1923 (+/- 35)
>
> with "async" i8042 and sd:
>
> async suspend 1319 (+/- 51) 1045 (+/- 41)
> async resume 2929 (+/- 3) 3546 (+/- 27)
>
> async+extra suspend 1327 (+/- 36) (didn't work)
> async+extra resume 1742 (+/- 164) 1896 (+/- 28)
>
> (the summary is also available at: http://www.sisk.pl/kernel/data/results.txt).
>
> So, it actually makes the case for async suspend! Although it's not very
> strong, with these two additional devices marked as "async" we get noticeable
> suspend time improvement.
>
> Still, the "extra" patch doesn't help on suspend at all and on the Wind the
> suspend part of it didn't even work (I'm yet to figure out which of the two
> devices crashed the suspend).
Small update. I've just verified that sd was the failing device, although I'm
not sure about the reason.
Apart from this, I ran some tests on the Wind with i8042 marked as "async"
and sd marked as "sync". In that case all of the tests succeeded and I got
the following numbers:
suspend (i8042 async, full extra patch applied): 1070 (+/- 40)
resume (i8042 async, full extra patch applied): 1915,84 (+/- 27)
suspend (i8042 async, resume part of extra patch applied): 1050 (+/- 34)
First, It looks like the suspend speedup was related to marking i8042 as
"async". Since the serio devices, which are the i8042's children, were also
"async" (just like in all of the tests before), this means that the speedup
resulted from removing a suspend stall caused by a sync parent of async
children (i8042 and serio, respectively, in this case).
However, the suspend part of the extra patch doesn't help really. In fact it
even makes things worse.
So, I still think the resume part of the extra patch is definitely useful, but
the suspend part of it is not. IOW, it's worth running async resumes upfront,
but it's not worth running async suspends upfront.
Rafael
On Fri, Dec 18, 2009 at 11:43:29PM +0100, Rafael J. Wysocki wrote:
> On Wednesday 16 December 2009, Dmitry Torokhov wrote:
> > On Wed, Dec 16, 2009 at 03:11:05AM +0100, Rafael J. Wysocki wrote:
> > > On Tuesday 15 December 2009, Linus Torvalds wrote:
> > > >
> > > > On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
> > > > > >
> > > > > > Give a real example that matters.
> > > > >
> > > > > I'll try. Let -> denote child-parent relationships and assume dpm_list looks
> > > > > like this:
> > > >
> > > > No.
> > > >
> > > > I mean something real - something like
> > > >
> > > > - if you run on a non-PC with two USB buses behind non-PCI controllers.
> > > >
> > > > - device xyz.
> > > >
> > > > > If this applies to _resume_ only, then I agree, but the Arjan's data clearly
> > > > > show that serio devices take much more time to suspend than USB.
> > > >
> > > > I mean in general - something where you actually have hard data that some
> > > > device really needs anythign more than my one-liner, and really _needs_
> > > > some complex infrastructure.
> > > >
> > > > Not "let's imagine a case like xyz".
> > >
> > > As I said I would, I made some measurements.
> > >
> > > I measured the total time of suspending and resuming devices as shown by the
> > > code added by this patch:
> > > http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=commitdiff_plain;h=c1b8fc0a8bff7707c10f31f3d26bfa88e18ccd94;hp=087dbf5f079f1b55cbd3964c9ce71268473d5b67
> > > on two boxes, HP nx6325 and MSI Wind U100 (hardware-wise they are quite
> > > different and the HP was running 64-bit kernel and user space).
> > >
> > > I took four cases into consideration:
> > > (1) synchronous suspend and resume (/sys/power/pm_async = 0)
> > > (2) asynchronous suspend and resume as introduced by the async branch at:
> > > http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=shortlog;h=refs/heads/async
> > > (3) asynchronous suspend and resume like in (2), but with your one-liner setting
> > > the power.async_suspend flag for PCI bridges on top
> > > (4) asynchronous suspend and resume like in (2), but with an extra patch that
> > > is appended on top
> > >
> > > For those tests I set power.async_suspend for all USB devices, all serio input
> > > devices, the ACPI battery and the USB PCI controllers (to see the impact of the
> > > one-liner, if any).
> > >
> > > I carried out 5 consecutive suspend-resume cycles (started from under X) on
> > > each box in each case, and the raw data are here (all times in milliseconds):
> > > http://www.sisk.pl/kernel/data/async-suspend.pdf
> > >
> > > The summarized data are below (the "big" numbers are averages and the +/-
> > > numbers are standard deviations, all in milliseconds):
> > >
> > > HP nx6325 MSI Wind U100
> > >
> > > sync suspend 1482 (+/- 40) 1180 (+/- 24)
> > > sync resume 2955 (+/- 2) 3597 (+/- 25)
> > >
> > > async suspend 1553 (+/- 49) 1177 (+/- 32)
> > > async resume 2692 (+/- 326) 3556 (+/- 33)
> > >
> > > async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
> > > async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
> > >
> > > async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
> > > async+extra resume 1859 (+/- 114) 1923 (+/- 35)
> > >
> > > So, in my opinion, with the above set of "async" devices, it doesn't
> > > make sense to do async suspend at all, because the sync suspend is actually
> > > the fastest on both machines.
> >
> > I think the async suspend is not asynchronous enough then - what kind of
> > time do you get if you simply comment out call to psmouse_reset() in
> > drivers/input/mouse/psmouse-base.c:psmouse_cleanup()? (Just for testing
> > purposes only, I don't think we want to do that by default.)
>
> The problem apparently is that the i8042 suspend/resume is synchronous.
>
> Do you think it's safe to mark it as asynchronous?
>
Umm.. there lie dragons. There is an implicit relationship between i8042
and PNP/ACPI devices representing keyboard and mouse ports, and I am not
sure how happy i8042 (and most importantly the BIOS) will be if they get
shut down before i8042. Also there is EC which is in theory independent
but in practice not so much.
--
Dmitry
On Saturday 19 December 2009, Dmitry Torokhov wrote:
> On Fri, Dec 18, 2009 at 11:43:29PM +0100, Rafael J. Wysocki wrote:
> > On Wednesday 16 December 2009, Dmitry Torokhov wrote:
> > > On Wed, Dec 16, 2009 at 03:11:05AM +0100, Rafael J. Wysocki wrote:
> > > > On Tuesday 15 December 2009, Linus Torvalds wrote:
> > > > >
> > > > > On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
> > > > > > >
> > > > > > > Give a real example that matters.
> > > > > >
> > > > > > I'll try. Let -> denote child-parent relationships and assume dpm_list looks
> > > > > > like this:
> > > > >
> > > > > No.
> > > > >
> > > > > I mean something real - something like
> > > > >
> > > > > - if you run on a non-PC with two USB buses behind non-PCI controllers.
> > > > >
> > > > > - device xyz.
> > > > >
> > > > > > If this applies to _resume_ only, then I agree, but the Arjan's data clearly
> > > > > > show that serio devices take much more time to suspend than USB.
> > > > >
> > > > > I mean in general - something where you actually have hard data that some
> > > > > device really needs anythign more than my one-liner, and really _needs_
> > > > > some complex infrastructure.
> > > > >
> > > > > Not "let's imagine a case like xyz".
> > > >
> > > > As I said I would, I made some measurements.
> > > >
> > > > I measured the total time of suspending and resuming devices as shown by the
> > > > code added by this patch:
> > > > http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=commitdiff_plain;h=c1b8fc0a8bff7707c10f31f3d26bfa88e18ccd94;hp=087dbf5f079f1b55cbd3964c9ce71268473d5b67
> > > > on two boxes, HP nx6325 and MSI Wind U100 (hardware-wise they are quite
> > > > different and the HP was running 64-bit kernel and user space).
> > > >
> > > > I took four cases into consideration:
> > > > (1) synchronous suspend and resume (/sys/power/pm_async = 0)
> > > > (2) asynchronous suspend and resume as introduced by the async branch at:
> > > > http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=shortlog;h=refs/heads/async
> > > > (3) asynchronous suspend and resume like in (2), but with your one-liner setting
> > > > the power.async_suspend flag for PCI bridges on top
> > > > (4) asynchronous suspend and resume like in (2), but with an extra patch that
> > > > is appended on top
> > > >
> > > > For those tests I set power.async_suspend for all USB devices, all serio input
> > > > devices, the ACPI battery and the USB PCI controllers (to see the impact of the
> > > > one-liner, if any).
> > > >
> > > > I carried out 5 consecutive suspend-resume cycles (started from under X) on
> > > > each box in each case, and the raw data are here (all times in milliseconds):
> > > > http://www.sisk.pl/kernel/data/async-suspend.pdf
> > > >
> > > > The summarized data are below (the "big" numbers are averages and the +/-
> > > > numbers are standard deviations, all in milliseconds):
> > > >
> > > > HP nx6325 MSI Wind U100
> > > >
> > > > sync suspend 1482 (+/- 40) 1180 (+/- 24)
> > > > sync resume 2955 (+/- 2) 3597 (+/- 25)
> > > >
> > > > async suspend 1553 (+/- 49) 1177 (+/- 32)
> > > > async resume 2692 (+/- 326) 3556 (+/- 33)
> > > >
> > > > async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
> > > > async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
> > > >
> > > > async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
> > > > async+extra resume 1859 (+/- 114) 1923 (+/- 35)
> > > >
> > > > So, in my opinion, with the above set of "async" devices, it doesn't
> > > > make sense to do async suspend at all, because the sync suspend is actually
> > > > the fastest on both machines.
> > >
> > > I think the async suspend is not asynchronous enough then - what kind of
> > > time do you get if you simply comment out call to psmouse_reset() in
> > > drivers/input/mouse/psmouse-base.c:psmouse_cleanup()? (Just for testing
> > > purposes only, I don't think we want to do that by default.)
> >
> > The problem apparently is that the i8042 suspend/resume is synchronous.
> >
> > Do you think it's safe to mark it as asynchronous?
> >
>
> Umm.. there lie dragons. There is an implicit relationship between i8042
> and PNP/ACPI devices representing keyboard and mouse ports, and I am not
> sure how happy i8042 (and most importantly the BIOS) will be if they get
> shut down before i8042. Also there is EC which is in theory independent
> but in practice not so much.
I see.
Is this possible to identify ACPI devices that should wait for the i8042
suspend and that should be waited for by it on resume?
Rafael
On Friday 18 December 2009, Alan Stern wrote:
> On Fri, 18 Dec 2009, Rafael J. Wysocki wrote:
>
> > I didn't manage to do that, but I was able to mark sd and i8042 as async and
> > see the impact of this.
>
> Apparently this didn't do what you wanted. In the nx6325
> sd+i8042+async+extra log, the 0:0:0:0 device (which is a SCSI disk) was
> suspended by the main thread instead of an async thread.
Hm, that's odd, because there's a noticeable time difference between the
two cases in which the sd is sync and async. I'll look into it further.
> There's an important point I neglected to mention before. Your logs
> don't show anything for devices with no suspend callbacks at all.
> Nevertheless, these devices sit on the device list and prevent other
> devices from suspending or resuming as soon as they could.
Unless they are async, that is.
> For example, the fingerprint sensor (3-1) took the most time to resume.
> But other devices were delayed until after it finished because it had
> children with no callbacks, and they delayed the devices following
> them in the list.
>
> What would happen if you completed these devices immediately, as part
> of the first pass?
OK. How do the PM core is supposed to check if a device has null suspend
and resume? Check all of the function pointers in the first pass?
Rafael
On Saturday 19 December 2009, Rafael J. Wysocki wrote:
> On Saturday 19 December 2009, Dmitry Torokhov wrote:
> > On Fri, Dec 18, 2009 at 11:43:29PM +0100, Rafael J. Wysocki wrote:
> > > On Wednesday 16 December 2009, Dmitry Torokhov wrote:
> > > > On Wed, Dec 16, 2009 at 03:11:05AM +0100, Rafael J. Wysocki wrote:
> > > > > On Tuesday 15 December 2009, Linus Torvalds wrote:
> > > > > >
> > > > > > On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
> > > > > > > >
> > > > > > > > Give a real example that matters.
> > > > > > >
> > > > > > > I'll try. Let -> denote child-parent relationships and assume dpm_list looks
> > > > > > > like this:
> > > > > >
> > > > > > No.
> > > > > >
> > > > > > I mean something real - something like
> > > > > >
> > > > > > - if you run on a non-PC with two USB buses behind non-PCI controllers.
> > > > > >
> > > > > > - device xyz.
> > > > > >
> > > > > > > If this applies to _resume_ only, then I agree, but the Arjan's data clearly
> > > > > > > show that serio devices take much more time to suspend than USB.
> > > > > >
> > > > > > I mean in general - something where you actually have hard data that some
> > > > > > device really needs anythign more than my one-liner, and really _needs_
> > > > > > some complex infrastructure.
> > > > > >
> > > > > > Not "let's imagine a case like xyz".
> > > > >
> > > > > As I said I would, I made some measurements.
> > > > >
> > > > > I measured the total time of suspending and resuming devices as shown by the
> > > > > code added by this patch:
> > > > > http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=commitdiff_plain;h=c1b8fc0a8bff7707c10f31f3d26bfa88e18ccd94;hp=087dbf5f079f1b55cbd3964c9ce71268473d5b67
> > > > > on two boxes, HP nx6325 and MSI Wind U100 (hardware-wise they are quite
> > > > > different and the HP was running 64-bit kernel and user space).
> > > > >
> > > > > I took four cases into consideration:
> > > > > (1) synchronous suspend and resume (/sys/power/pm_async = 0)
> > > > > (2) asynchronous suspend and resume as introduced by the async branch at:
> > > > > http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=shortlog;h=refs/heads/async
> > > > > (3) asynchronous suspend and resume like in (2), but with your one-liner setting
> > > > > the power.async_suspend flag for PCI bridges on top
> > > > > (4) asynchronous suspend and resume like in (2), but with an extra patch that
> > > > > is appended on top
> > > > >
> > > > > For those tests I set power.async_suspend for all USB devices, all serio input
> > > > > devices, the ACPI battery and the USB PCI controllers (to see the impact of the
> > > > > one-liner, if any).
> > > > >
> > > > > I carried out 5 consecutive suspend-resume cycles (started from under X) on
> > > > > each box in each case, and the raw data are here (all times in milliseconds):
> > > > > http://www.sisk.pl/kernel/data/async-suspend.pdf
> > > > >
> > > > > The summarized data are below (the "big" numbers are averages and the +/-
> > > > > numbers are standard deviations, all in milliseconds):
> > > > >
> > > > > HP nx6325 MSI Wind U100
> > > > >
> > > > > sync suspend 1482 (+/- 40) 1180 (+/- 24)
> > > > > sync resume 2955 (+/- 2) 3597 (+/- 25)
> > > > >
> > > > > async suspend 1553 (+/- 49) 1177 (+/- 32)
> > > > > async resume 2692 (+/- 326) 3556 (+/- 33)
> > > > >
> > > > > async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
> > > > > async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
> > > > >
> > > > > async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
> > > > > async+extra resume 1859 (+/- 114) 1923 (+/- 35)
> > > > >
> > > > > So, in my opinion, with the above set of "async" devices, it doesn't
> > > > > make sense to do async suspend at all, because the sync suspend is actually
> > > > > the fastest on both machines.
> > > >
> > > > I think the async suspend is not asynchronous enough then - what kind of
> > > > time do you get if you simply comment out call to psmouse_reset() in
> > > > drivers/input/mouse/psmouse-base.c:psmouse_cleanup()? (Just for testing
> > > > purposes only, I don't think we want to do that by default.)
> > >
> > > The problem apparently is that the i8042 suspend/resume is synchronous.
> > >
> > > Do you think it's safe to mark it as asynchronous?
> > >
> >
> > Umm.. there lie dragons. There is an implicit relationship between i8042
> > and PNP/ACPI devices representing keyboard and mouse ports, and I am not
> > sure how happy i8042 (and most importantly the BIOS) will be if they get
> > shut down before i8042. Also there is EC which is in theory independent
> > but in practice not so much.
>
> I see.
>
> Is this possible to identify ACPI devices that should wait for the i8042
> suspend and that should be waited for by it on resume?
Wait, if you look at the logs at
http://www.sisk.pl/kernel/data/nx6325/
http://www.sisk.pl/kernel/data/wind/
you'll see that the i8042 suspend is called before any ACPI devices are
suspended anyway. In fact, it is suspended right after its serio children
which is very early in the suspend sequence.
So, it seems, if there were any problems with i8042 vs ACPI, we'd experience
them anyway.
Rafael
On Dec 19, 2009, at 2:29 PM, "Rafael J. Wysocki" <[email protected]> wrote:
> On Saturday 19 December 2009, Rafael J. Wysocki wrote:
>> On Saturday 19 December 2009, Dmitry Torokhov wrote:
>>> On Fri, Dec 18, 2009 at 11:43:29PM +0100, Rafael J. Wysocki wrote:
>>>> On Wednesday 16 December 2009, Dmitry Torokhov wrote:
>>>>> On Wed, Dec 16, 2009 at 03:11:05AM +0100, Rafael J. Wysocki wrote:
>>>>>> On Tuesday 15 December 2009, Linus Torvalds wrote:
>>>>>>>
>>>>>>> On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
>>>>>>>>>
>>>>>>>>> Give a real example that matters.
>>>>>>>>
>>>>>>>> I'll try. Let -> denote child-parent relationships and
>>>>>>>> assume dpm_list looks
>>>>>>>> like this:
>>>>>>>
>>>>>>> No.
>>>>>>>
>>>>>>> I mean something real - something like
>>>>>>>
>>>>>>> - if you run on a non-PC with two USB buses behind non-PCI
>>>>>>> controllers.
>>>>>>>
>>>>>>> - device xyz.
>>>>>>>
>>>>>>>> If this applies to _resume_ only, then I agree, but the
>>>>>>>> Arjan's data clearly
>>>>>>>> show that serio devices take much more time to suspend than
>>>>>>>> USB.
>>>>>>>
>>>>>>> I mean in general - something where you actually have hard
>>>>>>> data that some
>>>>>>> device really needs anythign more than my one-liner, and
>>>>>>> really _needs_
>>>>>>> some complex infrastructure.
>>>>>>>
>>>>>>> Not "let's imagine a case like xyz".
>>>>>>
>>>>>> As I said I would, I made some measurements.
>>>>>>
>>>>>> I measured the total time of suspending and resuming devices as
>>>>>> shown by the
>>>>>> code added by this patch:
>>>>>> http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=commitdiff_plain;h=c1b8fc0a8bff7707c10f31f3d26bfa88e18ccd94;hp=087dbf5f079f1b55cbd3964c9ce71268473d5b67
>>>>>> on two boxes, HP nx6325 and MSI Wind U100 (hardware-wise they
>>>>>> are quite
>>>>>> different and the HP was running 64-bit kernel and user space).
>>>>>>
>>>>>> I took four cases into consideration:
>>>>>> (1) synchronous suspend and resume (/sys/power/pm_async = 0)
>>>>>> (2) asynchronous suspend and resume as introduced by the async
>>>>>> branch at:
>>>>>> http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=shortlog;h=refs/heads/async
>>>>>> (3) asynchronous suspend and resume like in (2), but with your
>>>>>> one-liner setting
>>>>>> the power.async_suspend flag for PCI bridges on top
>>>>>> (4) asynchronous suspend and resume like in (2), but with an
>>>>>> extra patch that
>>>>>> is appended on top
>>>>>>
>>>>>> For those tests I set power.async_suspend for all USB devices,
>>>>>> all serio input
>>>>>> devices, the ACPI battery and the USB PCI controllers (to see
>>>>>> the impact of the
>>>>>> one-liner, if any).
>>>>>>
>>>>>> I carried out 5 consecutive suspend-resume cycles (started from
>>>>>> under X) on
>>>>>> each box in each case, and the raw data are here (all times in
>>>>>> milliseconds):
>>>>>> http://www.sisk.pl/kernel/data/async-suspend.pdf
>>>>>>
>>>>>> The summarized data are below (the "big" numbers are averages
>>>>>> and the +/-
>>>>>> numbers are standard deviations, all in milliseconds):
>>>>>>
>>>>>> HP nx6325 MSI Wind U100
>>>>>>
>>>>>> sync suspend 1482 (+/- 40) 1180 (+/- 24)
>>>>>> sync resume 2955 (+/- 2) 3597 (+/- 25)
>>>>>>
>>>>>> async suspend 1553 (+/- 49) 1177 (+/- 32)
>>>>>> async resume 2692 (+/- 326) 3556 (+/- 33)
>>>>>>
>>>>>> async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
>>>>>> async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
>>>>>>
>>>>>> async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
>>>>>> async+extra resume 1859 (+/- 114) 1923 (+/- 35)
>>>>>>
>>>>>> So, in my opinion, with the above set of "async" devices, it
>>>>>> doesn't
>>>>>> make sense to do async suspend at all, because the sync suspend
>>>>>> is actually
>>>>>> the fastest on both machines.
>>>>>
>>>>> I think the async suspend is not asynchronous enough then - what
>>>>> kind of
>>>>> time do you get if you simply comment out call to psmouse_reset
>>>>> () in
>>>>> drivers/input/mouse/psmouse-base.c:psmouse_cleanup()? (Just for
>>>>> testing
>>>>> purposes only, I don't think we want to do that by default.)
>>>>
>>>> The problem apparently is that the i8042 suspend/resume is
>>>> synchronous.
>>>>
>>>> Do you think it's safe to mark it as asynchronous?
>>>>
>>>
>>> Umm.. there lie dragons. There is an implicit relationship between
>>> i8042
>>> and PNP/ACPI devices representing keyboard and mouse ports, and I
>>> am not
>>> sure how happy i8042 (and most importantly the BIOS) will be if
>>> they get
>>> shut down before i8042. Also there is EC which is in theory
>>> independent
>>> but in practice not so much.
>>
>> I see.
>>
>> Is this possible to identify ACPI devices that should wait for the
>> i8042
>> suspend and that should be waited for by it on resume?
>
> Wait, if you look at the logs at
>
> http://www.sisk.pl/kernel/data/nx6325/
> http://www.sisk.pl/kernel/data/wind/
>
> you'll see that the i8042 suspend is called before any ACPI devices
> are
> suspended anyway. In fact, it is suspended right after its serio
> children
> which is very early in the suspend sequence.
Right, and we do want to "suspend" i8042 (well, reset to the initial
state we found it at bootup) before touching ACPI.
If i8042 is async, given the fact that psmouse reset takes a long
time, it is possible that we start suspending PNP before we are done
with i8042.
--
>
Dmitry
On Dec 19, 2009, at 1:33 PM, "Rafael J. Wysocki" <[email protected]> wrote:
> On Saturday 19 December 2009, Dmitry Torokhov wrote:
>> On Fri, Dec 18, 2009 at 11:43:29PM +0100, Rafael J. Wysocki wrote:
>>> On Wednesday 16 December 2009, Dmitry Torokhov wrote:
>>>> On Wed, Dec 16, 2009 at 03:11:05AM +0100, Rafael J. Wysocki wrote:
>>>>> On Tuesday 15 December 2009, Linus Torvalds wrote:
>>>>>>
>>>>>> On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
>>>>>>>>
>>>>>>>> Give a real example that matters.
>>>>>>>
>>>>>>> I'll try. Let -> denote child-parent relationships and assume
>>>>>>> dpm_list looks
>>>>>>> like this:
>>>>>>
>>>>>> No.
>>>>>>
>>>>>> I mean something real - something like
>>>>>>
>>>>>> - if you run on a non-PC with two USB buses behind non-PCI
>>>>>> controllers.
>>>>>>
>>>>>> - device xyz.
>>>>>>
>>>>>>> If this applies to _resume_ only, then I agree, but the
>>>>>>> Arjan's data clearly
>>>>>>> show that serio devices take much more time to suspend than USB.
>>>>>>
>>>>>> I mean in general - something where you actually have hard data
>>>>>> that some
>>>>>> device really needs anythign more than my one-liner, and really
>>>>>> _needs_
>>>>>> some complex infrastructure.
>>>>>>
>>>>>> Not "let's imagine a case like xyz".
>>>>>
>>>>> As I said I would, I made some measurements.
>>>>>
>>>>> I measured the total time of suspending and resuming devices as
>>>>> shown by the
>>>>> code added by this patch:
>>>>> http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=commitdiff_plain;h=c1b8fc0a8bff7707c10f31f3d26bfa88e18ccd94;hp=087dbf5f079f1b55cbd3964c9ce71268473d5b67
>>>>> on two boxes, HP nx6325 and MSI Wind U100 (hardware-wise they
>>>>> are quite
>>>>> different and the HP was running 64-bit kernel and user space).
>>>>>
>>>>> I took four cases into consideration:
>>>>> (1) synchronous suspend and resume (/sys/power/pm_async = 0)
>>>>> (2) asynchronous suspend and resume as introduced by the async
>>>>> branch at:
>>>>> http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=shortlog;h=refs/heads/async
>>>>> (3) asynchronous suspend and resume like in (2), but with your
>>>>> one-liner setting
>>>>> the power.async_suspend flag for PCI bridges on top
>>>>> (4) asynchronous suspend and resume like in (2), but with an
>>>>> extra patch that
>>>>> is appended on top
>>>>>
>>>>> For those tests I set power.async_suspend for all USB devices,
>>>>> all serio input
>>>>> devices, the ACPI battery and the USB PCI controllers (to see
>>>>> the impact of the
>>>>> one-liner, if any).
>>>>>
>>>>> I carried out 5 consecutive suspend-resume cycles (started from
>>>>> under X) on
>>>>> each box in each case, and the raw data are here (all times in
>>>>> milliseconds):
>>>>> http://www.sisk.pl/kernel/data/async-suspend.pdf
>>>>>
>>>>> The summarized data are below (the "big" numbers are averages
>>>>> and the +/-
>>>>> numbers are standard deviations, all in milliseconds):
>>>>>
>>>>> HP nx6325 MSI Wind U100
>>>>>
>>>>> sync suspend 1482 (+/- 40) 1180 (+/- 24)
>>>>> sync resume 2955 (+/- 2) 3597 (+/- 25)
>>>>>
>>>>> async suspend 1553 (+/- 49) 1177 (+/- 32)
>>>>> async resume 2692 (+/- 326) 3556 (+/- 33)
>>>>>
>>>>> async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
>>>>> async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
>>>>>
>>>>> async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
>>>>> async+extra resume 1859 (+/- 114) 1923 (+/- 35)
>>>>>
>>>>> So, in my opinion, with the above set of "async" devices, it
>>>>> doesn't
>>>>> make sense to do async suspend at all, because the sync suspend
>>>>> is actually
>>>>> the fastest on both machines.
>>>>
>>>> I think the async suspend is not asynchronous enough then - what
>>>> kind of
>>>> time do you get if you simply comment out call to psmouse_reset()
>>>> in
>>>> drivers/input/mouse/psmouse-base.c:psmouse_cleanup()? (Just for
>>>> testing
>>>> purposes only, I don't think we want to do that by default.)
>>>
>>> The problem apparently is that the i8042 suspend/resume is
>>> synchronous.
>>>
>>> Do you think it's safe to mark it as asynchronous?
>>>
>>
>> Umm.. there lie dragons. There is an implicit relationship between
>> i8042
>> and PNP/ACPI devices representing keyboard and mouse ports, and I
>> am not
>> sure how happy i8042 (and most importantly the BIOS) will be if
>> they get
>> shut down before i8042. Also there is EC which is in theory
>> independent
>> but in practice not so much.
>
> I see.
>
> Is this possible to identify ACPI devices that should wait for the
> i8042
> suspend and that should be waited for by it on resume?
We could try to add some dependencies while discovering PNP to get KBC
addresses in i8042 but we need tomake sure we do it even in presence
of i8042.nopnp.
--
Dmitry
On Saturday 19 December 2009, Dmitry Torokhov wrote:
> On Dec 19, 2009, at 1:33 PM, "Rafael J. Wysocki" <[email protected]> wrote:
>
> > On Saturday 19 December 2009, Dmitry Torokhov wrote:
> >> On Fri, Dec 18, 2009 at 11:43:29PM +0100, Rafael J. Wysocki wrote:
> >>> On Wednesday 16 December 2009, Dmitry Torokhov wrote:
> >>>> On Wed, Dec 16, 2009 at 03:11:05AM +0100, Rafael J. Wysocki wrote:
> >>>>> On Tuesday 15 December 2009, Linus Torvalds wrote:
> >>>>>>
> >>>>>> On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
> >>>>>>>>
> >>>>>>>> Give a real example that matters.
> >>>>>>>
> >>>>>>> I'll try. Let -> denote child-parent relationships and assume
> >>>>>>> dpm_list looks
> >>>>>>> like this:
> >>>>>>
> >>>>>> No.
> >>>>>>
> >>>>>> I mean something real - something like
> >>>>>>
> >>>>>> - if you run on a non-PC with two USB buses behind non-PCI
> >>>>>> controllers.
> >>>>>>
> >>>>>> - device xyz.
> >>>>>>
> >>>>>>> If this applies to _resume_ only, then I agree, but the
> >>>>>>> Arjan's data clearly
> >>>>>>> show that serio devices take much more time to suspend than USB.
> >>>>>>
> >>>>>> I mean in general - something where you actually have hard data
> >>>>>> that some
> >>>>>> device really needs anythign more than my one-liner, and really
> >>>>>> _needs_
> >>>>>> some complex infrastructure.
> >>>>>>
> >>>>>> Not "let's imagine a case like xyz".
> >>>>>
> >>>>> As I said I would, I made some measurements.
> >>>>>
> >>>>> I measured the total time of suspending and resuming devices as
> >>>>> shown by the
> >>>>> code added by this patch:
> >>>>> http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=commitdiff_plain;h=c1b8fc0a8bff7707c10f31f3d26bfa88e18ccd94;hp=087dbf5f079f1b55cbd3964c9ce71268473d5b67
> >>>>> on two boxes, HP nx6325 and MSI Wind U100 (hardware-wise they
> >>>>> are quite
> >>>>> different and the HP was running 64-bit kernel and user space).
> >>>>>
> >>>>> I took four cases into consideration:
> >>>>> (1) synchronous suspend and resume (/sys/power/pm_async = 0)
> >>>>> (2) asynchronous suspend and resume as introduced by the async
> >>>>> branch at:
> >>>>> http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=shortlog;h=refs/heads/async
> >>>>> (3) asynchronous suspend and resume like in (2), but with your
> >>>>> one-liner setting
> >>>>> the power.async_suspend flag for PCI bridges on top
> >>>>> (4) asynchronous suspend and resume like in (2), but with an
> >>>>> extra patch that
> >>>>> is appended on top
> >>>>>
> >>>>> For those tests I set power.async_suspend for all USB devices,
> >>>>> all serio input
> >>>>> devices, the ACPI battery and the USB PCI controllers (to see
> >>>>> the impact of the
> >>>>> one-liner, if any).
> >>>>>
> >>>>> I carried out 5 consecutive suspend-resume cycles (started from
> >>>>> under X) on
> >>>>> each box in each case, and the raw data are here (all times in
> >>>>> milliseconds):
> >>>>> http://www.sisk.pl/kernel/data/async-suspend.pdf
> >>>>>
> >>>>> The summarized data are below (the "big" numbers are averages
> >>>>> and the +/-
> >>>>> numbers are standard deviations, all in milliseconds):
> >>>>>
> >>>>> HP nx6325 MSI Wind U100
> >>>>>
> >>>>> sync suspend 1482 (+/- 40) 1180 (+/- 24)
> >>>>> sync resume 2955 (+/- 2) 3597 (+/- 25)
> >>>>>
> >>>>> async suspend 1553 (+/- 49) 1177 (+/- 32)
> >>>>> async resume 2692 (+/- 326) 3556 (+/- 33)
> >>>>>
> >>>>> async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
> >>>>> async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
> >>>>>
> >>>>> async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
> >>>>> async+extra resume 1859 (+/- 114) 1923 (+/- 35)
> >>>>>
> >>>>> So, in my opinion, with the above set of "async" devices, it
> >>>>> doesn't
> >>>>> make sense to do async suspend at all, because the sync suspend
> >>>>> is actually
> >>>>> the fastest on both machines.
> >>>>
> >>>> I think the async suspend is not asynchronous enough then - what
> >>>> kind of
> >>>> time do you get if you simply comment out call to psmouse_reset()
> >>>> in
> >>>> drivers/input/mouse/psmouse-base.c:psmouse_cleanup()? (Just for
> >>>> testing
> >>>> purposes only, I don't think we want to do that by default.)
> >>>
> >>> The problem apparently is that the i8042 suspend/resume is
> >>> synchronous.
> >>>
> >>> Do you think it's safe to mark it as asynchronous?
> >>>
> >>
> >> Umm.. there lie dragons. There is an implicit relationship between
> >> i8042
> >> and PNP/ACPI devices representing keyboard and mouse ports, and I
> >> am not
> >> sure how happy i8042 (and most importantly the BIOS) will be if
> >> they get
> >> shut down before i8042. Also there is EC which is in theory
> >> independent
> >> but in practice not so much.
> >
> > I see.
> >
> > Is this possible to identify ACPI devices that should wait for the
> > i8042
> > suspend and that should be waited for by it on resume?
>
> We could try to add some dependencies while discovering PNP to get KBC
> addresses in i8042 but we need tomake sure we do it even in presence
> of i8042.nopnp.
Well, I guess this is the example of the off-tree dependencies that actually
matter Linus wanted. :-)
I guess there are quite a few devices that can depend on the i8042 in
principle, is this correct?
Rafael
On Dec 19, 2009, at 3:10 PM, "Rafael J. Wysocki" <[email protected]> wrote:
> On Saturday 19 December 2009, Dmitry Torokhov wrote:
>> On Dec 19, 2009, at 1:33 PM, "Rafael J. Wysocki" <[email protected]> wrote:
>>
>>> On Saturday 19 December 2009, Dmitry Torokhov wrote:
>>>> On Fri, Dec 18, 2009 at 11:43:29PM +0100, Rafael J. Wysocki wrote:
>>>>> On Wednesday 16 December 2009, Dmitry Torokhov wrote:
>>>>>> On Wed, Dec 16, 2009 at 03:11:05AM +0100, Rafael J. Wysocki
>>>>>> wrote:
>>>>>>> On Tuesday 15 December 2009, Linus Torvalds wrote:
>>>>>>>>
>>>>>>>> On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
>>>>>>>>>>
>>>>>>>>>> Give a real example that matters.
>>>>>>>>>
>>>>>>>>> I'll try. Let -> denote child-parent relationships and assume
>>>>>>>>> dpm_list looks
>>>>>>>>> like this:
>>>>>>>>
>>>>>>>> No.
>>>>>>>>
>>>>>>>> I mean something real - something like
>>>>>>>>
>>>>>>>> - if you run on a non-PC with two USB buses behind non-PCI
>>>>>>>> controllers.
>>>>>>>>
>>>>>>>> - device xyz.
>>>>>>>>
>>>>>>>>> If this applies to _resume_ only, then I agree, but the
>>>>>>>>> Arjan's data clearly
>>>>>>>>> show that serio devices take much more time to suspend than
>>>>>>>>> USB.
>>>>>>>>
>>>>>>>> I mean in general - something where you actually have hard data
>>>>>>>> that some
>>>>>>>> device really needs anythign more than my one-liner, and really
>>>>>>>> _needs_
>>>>>>>> some complex infrastructure.
>>>>>>>>
>>>>>>>> Not "let's imagine a case like xyz".
>>>>>>>
>>>>>>> As I said I would, I made some measurements.
>>>>>>>
>>>>>>> I measured the total time of suspending and resuming devices as
>>>>>>> shown by the
>>>>>>> code added by this patch:
>>>>>>> http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=commitdiff_plain;h=c1b8fc0a8bff7707c10f31f3d26bfa88e18ccd94;hp=087dbf5f079f1b55cbd3964c9ce71268473d5b67
>>>>>>> on two boxes, HP nx6325 and MSI Wind U100 (hardware-wise they
>>>>>>> are quite
>>>>>>> different and the HP was running 64-bit kernel and user space).
>>>>>>>
>>>>>>> I took four cases into consideration:
>>>>>>> (1) synchronous suspend and resume (/sys/power/pm_async = 0)
>>>>>>> (2) asynchronous suspend and resume as introduced by the async
>>>>>>> branch at:
>>>>>>> http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=shortlog;h=refs/heads/async
>>>>>>> (3) asynchronous suspend and resume like in (2), but with your
>>>>>>> one-liner setting
>>>>>>> the power.async_suspend flag for PCI bridges on top
>>>>>>> (4) asynchronous suspend and resume like in (2), but with an
>>>>>>> extra patch that
>>>>>>> is appended on top
>>>>>>>
>>>>>>> For those tests I set power.async_suspend for all USB devices,
>>>>>>> all serio input
>>>>>>> devices, the ACPI battery and the USB PCI controllers (to see
>>>>>>> the impact of the
>>>>>>> one-liner, if any).
>>>>>>>
>>>>>>> I carried out 5 consecutive suspend-resume cycles (started from
>>>>>>> under X) on
>>>>>>> each box in each case, and the raw data are here (all times in
>>>>>>> milliseconds):
>>>>>>> http://www.sisk.pl/kernel/data/async-suspend.pdf
>>>>>>>
>>>>>>> The summarized data are below (the "big" numbers are averages
>>>>>>> and the +/-
>>>>>>> numbers are standard deviations, all in milliseconds):
>>>>>>>
>>>>>>> HP nx6325 MSI Wind U100
>>>>>>>
>>>>>>> sync suspend 1482 (+/- 40) 1180 (+/- 24)
>>>>>>> sync resume 2955 (+/- 2) 3597 (+/- 25)
>>>>>>>
>>>>>>> async suspend 1553 (+/- 49) 1177 (+/- 32)
>>>>>>> async resume 2692 (+/- 326) 3556 (+/- 33)
>>>>>>>
>>>>>>> async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
>>>>>>> async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
>>>>>>>
>>>>>>> async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
>>>>>>> async+extra resume 1859 (+/- 114) 1923 (+/- 35)
>>>>>>>
>>>>>>> So, in my opinion, with the above set of "async" devices, it
>>>>>>> doesn't
>>>>>>> make sense to do async suspend at all, because the sync suspend
>>>>>>> is actually
>>>>>>> the fastest on both machines.
>>>>>>
>>>>>> I think the async suspend is not asynchronous enough then - what
>>>>>> kind of
>>>>>> time do you get if you simply comment out call to psmouse_reset()
>>>>>> in
>>>>>> drivers/input/mouse/psmouse-base.c:psmouse_cleanup()? (Just for
>>>>>> testing
>>>>>> purposes only, I don't think we want to do that by default.)
>>>>>
>>>>> The problem apparently is that the i8042 suspend/resume is
>>>>> synchronous.
>>>>>
>>>>> Do you think it's safe to mark it as asynchronous?
>>>>>
>>>>
>>>> Umm.. there lie dragons. There is an implicit relationship between
>>>> i8042
>>>> and PNP/ACPI devices representing keyboard and mouse ports, and I
>>>> am not
>>>> sure how happy i8042 (and most importantly the BIOS) will be if
>>>> they get
>>>> shut down before i8042. Also there is EC which is in theory
>>>> independent
>>>> but in practice not so much.
>>>
>>> I see.
>>>
>>> Is this possible to identify ACPI devices that should wait for the
>>> i8042
>>> suspend and that should be waited for by it on resume?
>>
>> We could try to add some dependencies while discovering PNP to get
>> KBC
>> addresses in i8042 but we need tomake sure we do it even in presence
>> of i8042.nopnp.
>
> Well, I guess this is the example of the off-tree dependencies that
> actually
> matter Linus wanted. :-)
>
> I guess there are quite a few devices that can depend on the i8042 in
> principle, is this correct?
The devices that depend on i8042 are serio ports that are it's
children. I8042 itself may have indirect dependency on a couple of PNP
devices.
>
I hope this answers your question...
--
Dmitry
On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
>
> Well, I guess this is the example of the off-tree dependencies that actually
> matter Linus wanted. :-)
It's also the kind of dependency where I say "if we get into these kinds
of messes, then the whole async crap isn't worth it".
Really. Having to try to match things up with ACPI and PnP is a nightmare.
Especially since I doubt Windows does anything like this, which means that
there's no reason for BIOS vendors to do the tables so that we'd even
know.
Linus
On Sunday 20 December 2009, Dmitry Torokhov wrote:
> On Dec 19, 2009, at 3:10 PM, "Rafael J. Wysocki" <[email protected]> wrote:
>
> > On Saturday 19 December 2009, Dmitry Torokhov wrote:
> >> On Dec 19, 2009, at 1:33 PM, "Rafael J. Wysocki" <[email protected]> wrote:
> >>
> >>> On Saturday 19 December 2009, Dmitry Torokhov wrote:
> >>>> On Fri, Dec 18, 2009 at 11:43:29PM +0100, Rafael J. Wysocki wrote:
> >>>>> On Wednesday 16 December 2009, Dmitry Torokhov wrote:
> >>>>>> On Wed, Dec 16, 2009 at 03:11:05AM +0100, Rafael J. Wysocki
> >>>>>> wrote:
> >>>>>>> On Tuesday 15 December 2009, Linus Torvalds wrote:
> >>>>>>>>
> >>>>>>>> On Tue, 15 Dec 2009, Rafael J. Wysocki wrote:
> >>>>>>>>>>
> >>>>>>>>>> Give a real example that matters.
> >>>>>>>>>
> >>>>>>>>> I'll try. Let -> denote child-parent relationships and assume
> >>>>>>>>> dpm_list looks
> >>>>>>>>> like this:
> >>>>>>>>
> >>>>>>>> No.
> >>>>>>>>
> >>>>>>>> I mean something real - something like
> >>>>>>>>
> >>>>>>>> - if you run on a non-PC with two USB buses behind non-PCI
> >>>>>>>> controllers.
> >>>>>>>>
> >>>>>>>> - device xyz.
> >>>>>>>>
> >>>>>>>>> If this applies to _resume_ only, then I agree, but the
> >>>>>>>>> Arjan's data clearly
> >>>>>>>>> show that serio devices take much more time to suspend than
> >>>>>>>>> USB.
> >>>>>>>>
> >>>>>>>> I mean in general - something where you actually have hard data
> >>>>>>>> that some
> >>>>>>>> device really needs anythign more than my one-liner, and really
> >>>>>>>> _needs_
> >>>>>>>> some complex infrastructure.
> >>>>>>>>
> >>>>>>>> Not "let's imagine a case like xyz".
> >>>>>>>
> >>>>>>> As I said I would, I made some measurements.
> >>>>>>>
> >>>>>>> I measured the total time of suspending and resuming devices as
> >>>>>>> shown by the
> >>>>>>> code added by this patch:
> >>>>>>> http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=commitdiff_plain;h=c1b8fc0a8bff7707c10f31f3d26bfa88e18ccd94;hp=087dbf5f079f1b55cbd3964c9ce71268473d5b67
> >>>>>>> on two boxes, HP nx6325 and MSI Wind U100 (hardware-wise they
> >>>>>>> are quite
> >>>>>>> different and the HP was running 64-bit kernel and user space).
> >>>>>>>
> >>>>>>> I took four cases into consideration:
> >>>>>>> (1) synchronous suspend and resume (/sys/power/pm_async = 0)
> >>>>>>> (2) asynchronous suspend and resume as introduced by the async
> >>>>>>> branch at:
> >>>>>>> http://git.kernel.org/?p=linux/kernel/git/rafael/suspend-2.6.git;a=shortlog;h=refs/heads/async
> >>>>>>> (3) asynchronous suspend and resume like in (2), but with your
> >>>>>>> one-liner setting
> >>>>>>> the power.async_suspend flag for PCI bridges on top
> >>>>>>> (4) asynchronous suspend and resume like in (2), but with an
> >>>>>>> extra patch that
> >>>>>>> is appended on top
> >>>>>>>
> >>>>>>> For those tests I set power.async_suspend for all USB devices,
> >>>>>>> all serio input
> >>>>>>> devices, the ACPI battery and the USB PCI controllers (to see
> >>>>>>> the impact of the
> >>>>>>> one-liner, if any).
> >>>>>>>
> >>>>>>> I carried out 5 consecutive suspend-resume cycles (started from
> >>>>>>> under X) on
> >>>>>>> each box in each case, and the raw data are here (all times in
> >>>>>>> milliseconds):
> >>>>>>> http://www.sisk.pl/kernel/data/async-suspend.pdf
> >>>>>>>
> >>>>>>> The summarized data are below (the "big" numbers are averages
> >>>>>>> and the +/-
> >>>>>>> numbers are standard deviations, all in milliseconds):
> >>>>>>>
> >>>>>>> HP nx6325 MSI Wind U100
> >>>>>>>
> >>>>>>> sync suspend 1482 (+/- 40) 1180 (+/- 24)
> >>>>>>> sync resume 2955 (+/- 2) 3597 (+/- 25)
> >>>>>>>
> >>>>>>> async suspend 1553 (+/- 49) 1177 (+/- 32)
> >>>>>>> async resume 2692 (+/- 326) 3556 (+/- 33)
> >>>>>>>
> >>>>>>> async+one-liner suspend 1600 (+/- 39) 1212 (+/- 41)
> >>>>>>> async+one-liner resume 2692 (+/- 324) 3579 (+/- 24)
> >>>>>>>
> >>>>>>> async+extra suspend 1496 (+/- 37) 1217 (+/- 38)
> >>>>>>> async+extra resume 1859 (+/- 114) 1923 (+/- 35)
> >>>>>>>
> >>>>>>> So, in my opinion, with the above set of "async" devices, it
> >>>>>>> doesn't
> >>>>>>> make sense to do async suspend at all, because the sync suspend
> >>>>>>> is actually
> >>>>>>> the fastest on both machines.
> >>>>>>
> >>>>>> I think the async suspend is not asynchronous enough then - what
> >>>>>> kind of
> >>>>>> time do you get if you simply comment out call to psmouse_reset()
> >>>>>> in
> >>>>>> drivers/input/mouse/psmouse-base.c:psmouse_cleanup()? (Just for
> >>>>>> testing
> >>>>>> purposes only, I don't think we want to do that by default.)
> >>>>>
> >>>>> The problem apparently is that the i8042 suspend/resume is
> >>>>> synchronous.
> >>>>>
> >>>>> Do you think it's safe to mark it as asynchronous?
> >>>>>
> >>>>
> >>>> Umm.. there lie dragons. There is an implicit relationship between
> >>>> i8042
> >>>> and PNP/ACPI devices representing keyboard and mouse ports, and I
> >>>> am not
> >>>> sure how happy i8042 (and most importantly the BIOS) will be if
> >>>> they get
> >>>> shut down before i8042. Also there is EC which is in theory
> >>>> independent
> >>>> but in practice not so much.
> >>>
> >>> I see.
> >>>
> >>> Is this possible to identify ACPI devices that should wait for the
> >>> i8042
> >>> suspend and that should be waited for by it on resume?
> >>
> >> We could try to add some dependencies while discovering PNP to get
> >> KBC
> >> addresses in i8042 but we need tomake sure we do it even in presence
> >> of i8042.nopnp.
> >
> > Well, I guess this is the example of the off-tree dependencies that
> > actually
> > matter Linus wanted. :-)
> >
> > I guess there are quite a few devices that can depend on the i8042 in
> > principle, is this correct?
>
> The devices that depend on i8042 are serio ports that are it's
> children.
That I already knew. :-)
> I8042 itself may have indirect dependency on a couple of PNP devices.
I was really asking about these.
> I hope this answers your question...
Yes, thanks.
On Sunday 20 December 2009, Linus Torvalds wrote:
>
> On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
> >
> > Well, I guess this is the example of the off-tree dependencies that actually
> > matter Linus wanted. :-)
>
> It's also the kind of dependency where I say "if we get into these kinds
> of messes, then the whole async crap isn't worth it".
>
> Really. Having to try to match things up with ACPI and PnP is a nightmare.
> Especially since I doubt Windows does anything like this, which means that
> there's no reason for BIOS vendors to do the tables so that we'd even
> know.
OK, so this means we can just forget about suspending/resuming i8042
asynchronously, which is a pity, because that gave us some real suspend
speedup on my test systems.
Well, whatever.
So, seriously, do you think it makes sense to do asynchronous suspend at all?
I'm asking, because we're likely to get into troubles like this during suspend
for other kinds of devices too and without resolving them we won't get any
significant speedup from asynchronous suspend.
That said, to me it's definitely worth doing asynchronous resume with the
"start asynch threads upfront" modification, as the results of the tests show
that quite clearly. I hope you agree.
Rafael
On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
>
> OK, so this means we can just forget about suspending/resuming i8042
> asynchronously, which is a pity, because that gave us some real suspend
> speedup on my test systems.
No. What it means is that you shouldn't try to come up with these idiotic
scenarios just trying to make trouble for yourself, and using it as an
excuse for crap.
I suggest you try to treat the i8042 controller async, and see if it is
problematic. If it isn't, don't do that then. But we actually have no real
reason to believe that it would be problematic, at least on a PC where the
actual logic is on the SB (presumably behind the LPC controller).
Why would it be?
The fact that PnP and ACPI enumerates those devices has exactly _what_ to
do with anything?
Linus
On Sat, 19 Dec 2009, Linus Torvalds wrote:
>
> I suggest you try to treat the i8042 controller async, and see if it is
> problematic. If it isn't, don't do that then.
I obviously meant: "If it _is_ problematic, don't do that then". "Is", not
"isn't".
Linus
On Sunday 20 December 2009, Linus Torvalds wrote:
>
> On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
> >
> > OK, so this means we can just forget about suspending/resuming i8042
> > asynchronously, which is a pity, because that gave us some real suspend
> > speedup on my test systems.
>
> No. What it means is that you shouldn't try to come up with these idiotic
> scenarios just trying to make trouble for yourself,
I haven't. I've just asked Dmitry for his opinion and got it. The fact that
you don't like it doesn't mean it's actually "idiotic".
> and using it as an excuse for crap.
I'm not sure what you mean exactly, but whatever.
> I suggest you try to treat the i8042 controller async, and see if it is
> problematic.
I already have and I don't see problems with it, but quite obviously I can't
test all possible configurations out there.
> If it isn't, don't do that then. But we actually have no real
> reason to believe that it would be problematic, at least on a PC where the
> actual logic is on the SB (presumably behind the LPC controller).
>
> Why would it be?
The embedded controller may depend on it.
Rafael
On Sunday 20 December 2009, Linus Torvalds wrote:
>
> On Sat, 19 Dec 2009, Linus Torvalds wrote:
> >
> > I suggest you try to treat the i8042 controller async, and see if it is
> > problematic. If it isn't, don't do that then.
>
> I obviously meant: "If it _is_ problematic, don't do that then". "Is", not
> "isn't".
Sure, I understood that was a typo. :-)
Rafael
On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
> >
> > Why would it be?
>
> The embedded controller may depend on it.
Again, I say "why?"
Anything can be true. That doesn't _make_ everything true. There's no real
reason why PnP/ACPI suspend/resume should really care.
We can try it. Not for 2.6.33, but by the 34 merge window maybe we'll have
a patch-series that is ready to be tested, and that aggressively tries to
do the devices that matter asynchronously.
So instead of you trying to make up some idiotic cross-device worries,
just see if those worries have any actual background in reality. So far I
haven't actually heard anything but "in theory, anything is possible",
which is such a truism that it's not even worth voicing.
That said, I still get the feeling that we'd be even better off simply
trying to avoid the whole keyboard reset entirely. Apparently we do it for
a few HP laptops. It's entirely possible that we'd be better off simply
not _doing_ the slow thing in the first place.
For example, we may be _much_ better off doing that whole keyboard reset
at resume time than at suspend time. That's what we do when we probe
things on initialization - and the resume-time keyboard code is actually
already asynchronous, it does that atkbd_reconnect asynchronously by
queuing it as an event.
So again, all these problems may not at all be fundamnetal problems: the
keyboard driver does certain things, but there is no guarantee that it
_needs_ to do those things. Turning the driver async may be totally the
wrong thing to do, when we could potentially fix latency problems at the
driver level instead.
Linus
On Sunday 20 December 2009, Linus Torvalds wrote:
>
> On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
> > >
> > > Why would it be?
> >
> > The embedded controller may depend on it.
>
> Again, I say "why?"
>
> Anything can be true. That doesn't _make_ everything true. There's no real
> reason why PnP/ACPI suspend/resume should really care.
>
> We can try it. Not for 2.6.33, but by the 34 merge window maybe we'll have
> a patch-series that is ready to be tested, and that aggressively tries to
> do the devices that matter asynchronously.
Yes, I'd like to have such a patch series for 2.6.34.
So far I've been able to confirm that doing serio+i8042, USB and ACPI battery
asynchronously may give us significant time savings, especially during resume.
> So instead of you trying to make up some idiotic cross-device worries,
> just see if those worries have any actual background in reality. So far I
> haven't actually heard anything but "in theory, anything is possible",
> which is such a truism that it's not even worth voicing.
>
> That said, I still get the feeling that we'd be even better off simply
> trying to avoid the whole keyboard reset entirely. Apparently we do it for
> a few HP laptops. It's entirely possible that we'd be better off simply
> not _doing_ the slow thing in the first place.
That very well may be the case, but I'm not the right person to confirm or deny
that.
Rafael
On Sat, Dec 19, 2009 at 04:09:07PM -0800, Linus Torvalds wrote:
>
> That said, I still get the feeling that we'd be even better off simply
> trying to avoid the whole keyboard reset entirely. Apparently we do it for
> a few HP laptops.
I was mistaken, HP laptops do not like mouse disabled when suspending,
not sure about the rest of the state.
> It's entirely possible that we'd be better off simply
> not _doing_ the slow thing in the first place.
>
The reset appeared first in 2.5.42. I expect that some BIOSes get very
confused when tehy find mouse speaking something that they do not
unserstand (i.e. synaptics, ALPS or anything else that is not bare PS/2
or intellimouse), but maybe Vojtech remembers better?
> For example, we may be _much_ better off doing that whole keyboard reset
> at resume time than at suspend time.
We do the reset for the different reasons - at resume we want the device
in known state to ensure that it properly responds to the probes we
send to it. At suspend we trying to reset things into original state so
that the firmware will not be confused.
If we want to try to live without reset we could to PSMOUSE_CMD_RESET_DIS
instead of PSMOUSE_CMD_RESET_BAT which is much heavier. We should
probably not wait for .34 then because the bulk of testing will happen
only when .33 is close to be released because that's when most of
regular users will start using the new code and try to suspend and
resume.
Rafael, how long does suspend take if you change call to psmouse_reset()
in psmouse_cleanup() to ps2_command(&psmouse->ps2dev, NULL, PSMOUSE_CMD_RESET_DIS)?
And do the same for atkbd...
BTW, making just serio asynchronous while keeping i8042 synchronous
makes no sense because I serialize access to i8042 - the thing does not
survive simultaneous [command] access to both keyboard and mouse...
--
Dmitry
On Sun, Dec 20, 2009 at 12:53:45AM +0100, Rafael J. Wysocki wrote:
> On Sunday 20 December 2009, Linus Torvalds wrote:
> >
> > If it isn't, don't do that then. But we actually have no real
> > reason to believe that it would be problematic, at least on a PC where the
> > actual logic is on the SB (presumably behind the LPC controller).
> >
> > Why would it be?
>
> The embedded controller may depend on it.
>
No, not really depend but rather wierd things may happen if you
accessing both. Witness regressions where touching embedded controller
makes us lose data from touchpad, I think you are CCed on that bug.
--
Dmitry
On Sat, 19 Dec 2009, Rafael J. Wysocki wrote:
> On Friday 18 December 2009, Alan Stern wrote:
> > On Fri, 18 Dec 2009, Rafael J. Wysocki wrote:
> >
> > > I didn't manage to do that, but I was able to mark sd and i8042 as async and
> > > see the impact of this.
> >
> > Apparently this didn't do what you wanted. In the nx6325
> > sd+i8042+async+extra log, the 0:0:0:0 device (which is a SCSI disk) was
To be precise, the device is an ATA or SATA disk but it is managed by
the sd driver.
> > suspended by the main thread instead of an async thread.
>
> Hm, that's odd, because there's a noticeable time difference between the
> two cases in which the sd is sync and async. I'll look into it further.
I don't know what the whole story is, but the PID number tells the
tale.
> > There's an important point I neglected to mention before. Your logs
> > don't show anything for devices with no suspend callbacks at all.
> > Nevertheless, these devices sit on the device list and prevent other
> > devices from suspending or resuming as soon as they could.
>
> Unless they are async, that is.
Yes. It would be simpler to make them async. But first we ought to
know what they are. Can you add an extra line to the log for such
devices?
What I'm afraid of is that there might be a "normal" device with a
"normal" ancestor but with "abnormal" devices in between (where
"normal" means there is a suspend or resume routine and "abnormal"
means all the method pointers are NULL). I know that this happens when
there's a USB mass-storage device, for example. If we complete the
intermediate devices immediately, then there won't be anything to
prevent the ancestor from suspending before the device or the device
from resuming before the ancestor. Forcing the "abnormal" devices to
be async, even if they aren't marked that way, would avoid these
problems.
> > For example, the fingerprint sensor (3-1) took the most time to resume.
> > But other devices were delayed until after it finished because it had
> > children with no callbacks, and they delayed the devices following
> > them in the list.
> >
> > What would happen if you completed these devices immediately, as part
> > of the first pass?
>
> OK. How do the PM core is supposed to check if a device has null suspend
> and resume? Check all of the function pointers in the first pass?
All the relevant pointers (including the legacy pointers). That is,
you check only the suspend pointers during the first suspend pass, and
likewise for resume.
Alan Stern
On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
> So, seriously, do you think it makes sense to do asynchronous suspend at all?
> I'm asking, because we're likely to get into troubles like this during suspend
> for other kinds of devices too and without resolving them we won't get any
> significant speedup from asynchronous suspend.
>
> That said, to me it's definitely worth doing asynchronous resume with the
> "start asynch threads upfront" modification, as the results of the tests show
> that quite clearly. I hope you agree.
It's too early to come to this sort of conclusion (i.e., that suspend
and resume react very differently to an asynchronous approach). Unless
you have some definite _reason_ for thinking that resume will benefit
more than suspend, you shouldn't try to generalize so much from tests
on only two systems.
Alan Stern
On Sunday 20 December 2009, Alan Stern wrote:
> On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
>
> > So, seriously, do you think it makes sense to do asynchronous suspend at all?
> > I'm asking, because we're likely to get into troubles like this during suspend
> > for other kinds of devices too and without resolving them we won't get any
> > significant speedup from asynchronous suspend.
> >
> > That said, to me it's definitely worth doing asynchronous resume with the
> > "start asynch threads upfront" modification, as the results of the tests show
> > that quite clearly. I hope you agree.
>
> It's too early to come to this sort of conclusion (i.e., that suspend
> and resume react very differently to an asynchronous approach). Unless
> you have some definite _reason_ for thinking that resume will benefit
> more than suspend, you shouldn't try to generalize so much from tests
> on only two systems.
In fact I have one reason. Namely, the things that drivers do on suspend and
resume are evidently quite different and on these two systems I was able to
test they apparently took different amounts of time to complete.
The very fact that on both systems resume is substantially longer than suspend,
even if all devices are suspended and resumed synchronously, is quite
interesting.
Rafael
On Sunday 20 December 2009, Alan Stern wrote:
> On Sat, 19 Dec 2009, Rafael J. Wysocki wrote:
>
> > On Friday 18 December 2009, Alan Stern wrote:
> > > On Fri, 18 Dec 2009, Rafael J. Wysocki wrote:
> > >
> > > > I didn't manage to do that, but I was able to mark sd and i8042 as async and
> > > > see the impact of this.
> > >
> > > Apparently this didn't do what you wanted. In the nx6325
> > > sd+i8042+async+extra log, the 0:0:0:0 device (which is a SCSI disk) was
>
> To be precise, the device is an ATA or SATA disk but it is managed by
> the sd driver.
>
> > > suspended by the main thread instead of an async thread.
> >
> > Hm, that's odd, because there's a noticeable time difference between the
> > two cases in which the sd is sync and async. I'll look into it further.
>
> I don't know what the whole story is, but the PID number tells the
> tale.
>
> > > There's an important point I neglected to mention before. Your logs
> > > don't show anything for devices with no suspend callbacks at all.
> > > Nevertheless, these devices sit on the device list and prevent other
> > > devices from suspending or resuming as soon as they could.
> >
> > Unless they are async, that is.
>
> Yes. It would be simpler to make them async. But first we ought to
> know what they are. Can you add an extra line to the log for such
> devices?
Sure, I'll do that.
> What I'm afraid of is that there might be a "normal" device with a
> "normal" ancestor but with "abnormal" devices in between (where
> "normal" means there is a suspend or resume routine and "abnormal"
> means all the method pointers are NULL). I know that this happens when
> there's a USB mass-storage device, for example. If we complete the
> intermediate devices immediately, then there won't be anything to
> prevent the ancestor from suspending before the device or the device
> from resuming before the ancestor.
I'm afraid of that too.
Rafael
On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
> > It's too early to come to this sort of conclusion (i.e., that suspend
> > and resume react very differently to an asynchronous approach). Unless
> > you have some definite _reason_ for thinking that resume will benefit
> > more than suspend, you shouldn't try to generalize so much from tests
> > on only two systems.
>
> In fact I have one reason. Namely, the things that drivers do on suspend and
> resume are evidently quite different and on these two systems I was able to
> test they apparently took different amounts of time to complete.
>
> The very fact that on both systems resume is substantially longer than suspend,
> even if all devices are suspended and resumed synchronously, is quite
> interesting.
Yes, it is. But it doesn't mean that suspend won't benefit from
asynchronicity; it just means that the benefits might not be as large
as they are for resume.
Alan Stern
On Sunday 20 December 2009, Alan Stern wrote:
> On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
>
> > > It's too early to come to this sort of conclusion (i.e., that suspend
> > > and resume react very differently to an asynchronous approach). Unless
> > > you have some definite _reason_ for thinking that resume will benefit
> > > more than suspend, you shouldn't try to generalize so much from tests
> > > on only two systems.
> >
> > In fact I have one reason. Namely, the things that drivers do on suspend and
> > resume are evidently quite different and on these two systems I was able to
> > test they apparently took different amounts of time to complete.
> >
> > The very fact that on both systems resume is substantially longer than suspend,
> > even if all devices are suspended and resumed synchronously, is quite
> > interesting.
>
> Yes, it is. But it doesn't mean that suspend won't benefit from
> asynchronicity; it just means that the benefits might not be as large
> as they are for resume.
Agreed, although that rises the question whether they are sufficiently
significant. I guess time will tell. With the i8042 done asynchronously they
are IMO.
BTW, what's the right place to call device_enable_async_suspend() for USB
devices?
Rafael
On Sunday 20 December 2009, Dmitry Torokhov wrote:
> On Sat, Dec 19, 2009 at 04:09:07PM -0800, Linus Torvalds wrote:
> >
> > That said, I still get the feeling that we'd be even better off simply
> > trying to avoid the whole keyboard reset entirely. Apparently we do it for
> > a few HP laptops.
>
> I was mistaken, HP laptops do not like mouse disabled when suspending,
> not sure about the rest of the state.
>
> > It's entirely possible that we'd be better off simply
> > not _doing_ the slow thing in the first place.
> >
>
> The reset appeared first in 2.5.42. I expect that some BIOSes get very
> confused when tehy find mouse speaking something that they do not
> unserstand (i.e. synaptics, ALPS or anything else that is not bare PS/2
> or intellimouse), but maybe Vojtech remembers better?
>
> > For example, we may be _much_ better off doing that whole keyboard reset
> > at resume time than at suspend time.
>
> We do the reset for the different reasons - at resume we want the device
> in known state to ensure that it properly responds to the probes we
> send to it. At suspend we trying to reset things into original state so
> that the firmware will not be confused.
>
> If we want to try to live without reset we could to PSMOUSE_CMD_RESET_DIS
> instead of PSMOUSE_CMD_RESET_BAT which is much heavier. We should
> probably not wait for .34 then because the bulk of testing will happen
> only when .33 is close to be released because that's when most of
> regular users will start using the new code and try to suspend and
> resume.
>
> Rafael, how long does suspend take if you change call to psmouse_reset()
> in psmouse_cleanup() to ps2_command(&psmouse->ps2dev, NULL, PSMOUSE_CMD_RESET_DIS)?
> And do the same for atkbd...
On the nx6325 that appears to reduce the suspend time as much so the effect
of async is not visible any more. On the Wind it decreases the total suspend
time almost by half!
Please push this patch to Linus. :-)
> BTW, making just serio asynchronous while keeping i8042 synchronous
> makes no sense because I serialize access to i8042 - the thing does not
> survive simultaneous [command] access to both keyboard and mouse...
OK
Rafael
On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
> BTW, what's the right place to call device_enable_async_suspend() for USB
> devices?
For USB devices, it's in drivers/usb/core/hub.c:usb_new_device()
anywhere before the call to usb_device_add().
For USB interfaces, it's in
drivers/usb/core/message.c:usb_set_configuration() before the call to
device_add().
For USB endpoints, it's in
drivers/usb/core/endpoint.c:usb_create_ep_devs() before the call to
device_register().
However you won't need to do it for interfaces and endpoints if you
automatically treat as async any device without suspend/resume
callbacks.
Alan Stern
On Sunday 20 December 2009, Alan Stern wrote:
> On Sun, 20 Dec 2009, Rafael J. Wysocki wrote:
>
> > BTW, what's the right place to call device_enable_async_suspend() for USB
> > devices?
>
> For USB devices, it's in drivers/usb/core/hub.c:usb_new_device()
> anywhere before the call to usb_device_add().
>
> For USB interfaces, it's in
> drivers/usb/core/message.c:usb_set_configuration() before the call to
> device_add().
>
> For USB endpoints, it's in
> drivers/usb/core/endpoint.c:usb_create_ep_devs() before the call to
> device_register().
Thanks!
> However you won't need to do it for interfaces and endpoints if you
> automatically treat as async any device without suspend/resume
> callbacks.
I don't do that right now and I need these settings just for testing at the
moment.
Rafael
On Sun, Dec 20, 2009 at 08:25:25PM +0100, Rafael J. Wysocki wrote:
> On Sunday 20 December 2009, Dmitry Torokhov wrote:
> > On Sat, Dec 19, 2009 at 04:09:07PM -0800, Linus Torvalds wrote:
> > >
> > > That said, I still get the feeling that we'd be even better off simply
> > > trying to avoid the whole keyboard reset entirely. Apparently we do it for
> > > a few HP laptops.
> >
> > I was mistaken, HP laptops do not like mouse disabled when suspending,
> > not sure about the rest of the state.
> >
> > > It's entirely possible that we'd be better off simply
> > > not _doing_ the slow thing in the first place.
> > >
> >
> > The reset appeared first in 2.5.42. I expect that some BIOSes get very
> > confused when tehy find mouse speaking something that they do not
> > unserstand (i.e. synaptics, ALPS or anything else that is not bare PS/2
> > or intellimouse), but maybe Vojtech remembers better?
> >
> > > For example, we may be _much_ better off doing that whole keyboard reset
> > > at resume time than at suspend time.
> >
> > We do the reset for the different reasons - at resume we want the device
> > in known state to ensure that it properly responds to the probes we
> > send to it. At suspend we trying to reset things into original state so
> > that the firmware will not be confused.
> >
> > If we want to try to live without reset we could to PSMOUSE_CMD_RESET_DIS
> > instead of PSMOUSE_CMD_RESET_BAT which is much heavier. We should
> > probably not wait for .34 then because the bulk of testing will happen
> > only when .33 is close to be released because that's when most of
> > regular users will start using the new code and try to suspend and
> > resume.
> >
> > Rafael, how long does suspend take if you change call to psmouse_reset()
> > in psmouse_cleanup() to ps2_command(&psmouse->ps2dev, NULL, PSMOUSE_CMD_RESET_DIS)?
> > And do the same for atkbd...
>
> On the nx6325 that appears to reduce the suspend time as much so the effect
> of async is not visible any more. On the Wind it decreases the total suspend
> time almost by half!
>
> Please push this patch to Linus. :-)
>
Let's see if I manage to solicit some testers first. FWIW it seems to be
working on my boxes.
But if this works then I am not sure we even want to bother with async
suspend of i8042 and serios. And serio already does resume
asynchronously through kseriod.
--
Dmitry
On Sun 2009-12-06 22:00:53, Dmitry Torokhov wrote:
> On Sun, Dec 06, 2009 at 09:26:00PM -0800, Arjan van de Ven wrote:
> > On Sun, 6 Dec 2009 18:27:56 -0800
> > Dmitry Torokhov <[email protected]> wrote:
> >
> > > On Sun, Dec 06, 2009 at 04:55:51PM -0800, Arjan van de Ven wrote:
> > > > On Sun, 6 Dec 2009 14:54:48 -0800
> > > > Dmitry Torokhov <[email protected]> wrote:
> > > >
> > > > > > isn't serio the PS/2 stuff?
> > > > >
> > > > > Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> > > > > from device reset (needed by some keyboard controllers - I
> > > > > remember HP -or it and keyboard will be dead at resume).
> > > >
> > > > and I have a HP laptop... so this makes perfect sense.
> > > > Thanks for the explenation!
> > > >
> > >
> > > Well, we do it for everyone, it's just a particular series of HPs
> > > forced us to add it.
> > >
> > wonder if it should be a DMI based quirk instead...
> >
>
> I have not received reports where it causes harm or reduces
> functionality so I'd prefer having it by default and not try to race
> with manufacturers.
Well, it slows down everyone... and people are actually testing with
linux, so it makes this problem more common on new systems.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sun, Dec 20, 2009 at 11:39:15PM -0800, Dmitry Torokhov wrote:
> > On the nx6325 that appears to reduce the suspend time as much so the effect
> > of async is not visible any more. On the Wind it decreases the total suspend
> > time almost by half!
> >
> > Please push this patch to Linus. :-)
> >
>
> Let's see if I manage to solicit some testers first. FWIW it seems to be
> working on my boxes.
>
> But if this works then I am not sure we even want to bother with async
> suspend of i8042 and serios. And serio already does resume
> asynchronously through kseriod.
I'm kind of wondering where this will break, but I don't remember why
the RESET_BAT was put in exactly - the point of making sure the BIOS
doesn't get confused by the advanced modes is correct, and is required
at least when a keyboard is set to "Set 3", but RESET_BAT is a too heavy
hammer anyway - we could just make sure to switch the kbd/mouse to
'default' modes instead of doing a full reset.
--
Vojtech Pavlik
Director SuSE Labs
Hi!
> > That's partly why I realy did suggest that we do the async stuff purely in
> > the USB layer, rather than try to put it deeper in the device layer. And
> > if we do support it "natively" in the device layer like Rafael's latest
> > patch, I still think we should be very very nervous about making devices
> > async unless there is a measured - and very noticeable - advantage.
>
> Agreed. Arjan's measurements indicated that USB was one of the biggest
> offenders; everything else other than the PS/2 mouse was much faster.
> Given these results there isn't much incentive to do anything else
> asynchronously.
>
> (However other devices not present on Arjan's machine may be a
> different story. Spinning up multiple external disks is a good example
> -- although here it may be necessary for the driver to take charge,
> because spinning up a disk requires a lot of power and doing too many
> of them at the same time could be bad.)
Well, system would better be able to supply enough current... because
usb disks auto-sleep on their own, and then something like async ls -l
/*/* would kill your machine...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html