2008-03-24 17:40:36

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

Hi,

This is the 3rd revision of the patch introducing new callbacks for suspend
and hibernation. It has been tested on x86-64.

I did my best to address the Alan's comments regarding the 2nd revision of the
patch. In particular:
* The ->prepare() callbacks are now executed in the forward order to avoid
races between the PM core and the registration of new devices. Analogously,
the ->complete() callbacks are executed in the reverse order.
* Only one list, called dpm_list, is used by the PM core its operations and
the dev->power.status field is used to mark the current status of the device
wrt the suspend/hibernation operations.
* The registrations of parentless devices are disabled before the first
->prepare() method is called and enabled before the first ->resume() method
is called
* The registration of new devices with a parent is disabled right after the
parent's ->prepare() is called and enabled just prior to calling the parent's
->resume().

The patch is on top of the "PM: Remove destroy_suspended_device()" patch
I sent yesterday.

Please review.

Thanks,
Rafael

---
From: Rafael J. Wysocki <[email protected]>

Introduce 'struct pm_ops' and 'struct pm_noirq_ops' representing
suspend and hibernation operations for bus types, device classes and
device types.

Modify the PM core to use 'struct pm_ops' and 'struct pm_noirq_ops'
objects, if defined, instead of the ->suspend() and ->resume() or,
respectively, ->suspend_late() and ->resume_early() callbacks that
will be considered as legacy and gradually phased out.

The main purpose of doing this is to separate suspend (aka S2RAM and
standby) callbacks from hibernation callbacks in such a way that the
new callbacks won't take arguments and the semantics of each of them
will be clearly specified. ?This has been requested for multiple
times by many people, including Linus himself, and the reason is that
within the current scheme if ->resume() is called, for example, it's
difficult to say why it's been called (ie. is it a resume from RAM or
from hibernation or a suspend/hibernation failure etc.?).

The second purpose is to make the suspend/hibernation callbacks more
flexible so that device drivers can handle more than they can within
the current scheme. For example, some drivers may need to prevent
new children of the device from being registered before their
->suspend() callbacks are executed or they may want to carry out some
operations requiring the availability of some other devices, not
directly bound via the parent-child relationship, in order to prepare
for the execution of ->suspend(), etc.

Ultimately, we'd like to stop using the freezing of tasks for suspend
and therefore the drivers' suspend/hibernation code will have to take
care of the handling of the user space during suspend/hibernation.
That, in turn, would be difficult within the current scheme, without
the new ->prepare() and ->complete() callbacks.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

arch/x86/kernel/apm_32.c | 4
drivers/base/power/main.c | 650 +++++++++++++++++++++++++++++++++------------
drivers/base/power/power.h | 2
drivers/base/power/trace.c | 4
include/linux/device.h | 8
include/linux/pm.h | 278 +++++++++++++++++--
kernel/power/disk.c | 14
kernel/power/main.c | 4
8 files changed, 763 insertions(+), 201 deletions(-)

Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -114,7 +114,9 @@ typedef struct pm_message {
int event;
} pm_message_t;

-/*
+/**
+ * struct pm_ops - device PM callbacks
+ *
* Several driver power state transitions are externally visible, affecting
* the state of pending I/O queues and (for drivers that touch hardware)
* interrupts, wakeups, DMA, and other hardware state. There may also be
@@ -122,6 +124,250 @@ typedef struct pm_message {
* to the rest of the driver stack (such as a driver that's ON gating off
* clocks which are not in active use).
*
+ * The externally visible transitions are handled with the help of the following
+ * callbacks included in this structure:
+ *
+ * @prepare: Prepare the device for the upcoming transition, but do NOT change
+ * its hardware state. Prevent new children of the device from being
+ * registered and prevent new calls to the probe method from being made
+ * after @prepare() returns. If @prepare() detects a situation it cannot
+ * handle (e.g. registration of a child already in progress), it may return
+ * -EAGAIN, so that the PM core can execute it once again (e.g. after the
+ * new child has been registered) to recover from the race condition. This
+ * method is executed for all kinds of suspend transitions and is followed
+ * by one of the suspend callbacks: @suspend(), @freeze(), or @poweroff().
+ * The PM core executes @prepare() for all devices before starting to
+ * execute suspend callbacks for any of them, so drivers may assume all of
+ * the other devices to be present and functional while @prepare() is being
+ * executed. However, they may NOT assume anything about the availability
+ * of the user space at that time.
+ *
+ * @complete: Undo the changes made by @prepare(). This method is executed for
+ * all kinds of resume transitions, following one of the resume callbacks:
+ * @resume(), @thaw(), @restore(). Also called if the state transition
+ * fails before the driver's suspend callback (@suspend(), @freeze(),
+ * @poweroff()) can be executed (e.g. if the suspend callback fails for one
+ * of the other devices that the PM core has unsucessfully attempted to
+ * suspend earlier).
+ * The PM core executes @complete() after it has executed the appropriate
+ * resume callback for all devices.
+ *
+ * @suspend: Executed before putting the system into a sleep state in which the
+ * contents of main memory are preserved. Quiesce the device, put it into
+ * a low power state appropriate for the upcoming system state (such as
+ * PCI_D3hot), and enable wakeup events as appropriate.
+ *
+ * @resume: Executed after waking the system up from a sleep state in which the
+ * contents of main memory were preserved. Put the device into the
+ * appropriate state, according to the information saved in memory by the
+ * preceding @suspend(). The driver starts working again, responding to
+ * hardware events and software requests. The hardware may have gone
+ * through a power-off reset, or it may have maintained state from the
+ * previous suspend() which the driver may rely on while resuming. On most
+ * platforms, there are no restrictions on availability of resources like
+ * clocks during @resume().
+ *
+ * @freeze: Hibernation-specific, executed before creating a hibernation image.
+ * Quiesce operations so that a consistent image can be created, but do NOT
+ * otherwise put the device into a low power device state and do NOT emit
+ * system wakeup events. Save in main memory the device settings to be
+ * used by @restore() during the subsequent resume from hibernation or by
+ * the subsequent @thaw(), if the creation of the image or the restoration
+ * of main memory contents from it fails.
+ *
+ * @thaw: Hibernation-specific, executed after creating a hibernation image OR
+ * if the creation of the image fails. Also executed after a failing
+ * attempt to restore the contents of main memory from such an image.
+ * Undo the changes made by the preceding @freeze(), so the device can be
+ * operated in the same way as immediately before the call to @freeze().
+ *
+ * @poweroff: Hibernation-specific, executed after saving a hibernation image.
+ * Quiesce the device, put it into a low power state appropriate for the
+ * upcoming system state (such as PCI_D3hot), and enable wakeup events as
+ * appropriate.
+ *
+ * @restore: Hibernation-specific, executed after restoring the contents of main
+ * memory from a hibernation image. Driver starts working again,
+ * responding to hardware events and software requests. Drivers may NOT
+ * make ANY assumptions about the hardware state right prior to @restore().
+ * On most platforms, there are no restrictions on availability of
+ * resources like clocks during @restore().
+ *
+ * All of the above callbacks, except for @complete(), return error codes.
+ * However, the error codes returned by the resume operations, @resume(),
+ * @thaw(), and @restore(), do not cause the PM core to abort the resume
+ * transition during which they are returned.
+ */
+
+struct pm_ops {
+ int (*prepare)(struct device *dev);
+ void (*complete)(struct device *dev);
+ int (*suspend)(struct device *dev);
+ int (*resume)(struct device *dev);
+ int (*freeze)(struct device *dev);
+ int (*thaw)(struct device *dev);
+ int (*poweroff)(struct device *dev);
+ int (*restore)(struct device *dev);
+};
+
+/**
+ * struct pm_noirq_ops - device PM callbacks executed with interrupts disabled
+ *
+ * The following callbacks included in 'struct pm_noirq_ops' are executed with
+ * interrupts disabled and with the nonboot CPUs switched off:
+ *
+ * @suspend_noirq: Complete the operations of ->suspend() by carrying out any
+ * actions required for suspending the device that need interrupts to be
+ * disabled
+ *
+ * @resume_noirq: Prepare for the execution of ->resume() by carrying out any
+ * actions required for resuming the device that need interrupts to be
+ * disabled
+ *
+ * @freeze_noirq: Complete the operations of ->freeze() by carrying out any
+ * actions required for freezing the device that need interrupts to be
+ * disabled
+ *
+ * @thaw_noirq: Prepare for the execution of ->thaw() by carrying out any
+ * actions required for thawing the device that need interrupts to be
+ * disabled
+ *
+ * @poweroff_noirq: Complete the operations of ->poweroff() by carrying out any
+ * actions required for handling the device that need interrupts to be
+ * disabled
+ *
+ * @restore_noirq: Prepare for the execution of ->restore() by carrying out any
+ * actions required for restoring the operations of the device that need
+ * interrupts to be disabled
+ *
+ * All of the above callbacks, return error codes, but the error codes returned
+ * by the resume operations, @resume_noirq(), @thaw_noirq(), and
+ * @restore_noirq(), do not cause the PM core to abort the resume transition
+ * during which they are returned.
+ */
+
+struct pm_noirq_ops {
+ int (*suspend_noirq)(struct device *dev);
+ int (*resume_noirq)(struct device *dev);
+ int (*freeze_noirq)(struct device *dev);
+ int (*thaw_noirq)(struct device *dev);
+ int (*poweroff_noirq)(struct device *dev);
+ int (*restore_noirq)(struct device *dev);
+};
+
+/**
+ * PM_EVENT_ messages
+ *
+ * The following PM_EVENT_ messages are defined for the internal use of the PM
+ * core, in order to provide a mechanism allowing the high level suspend and
+ * hibernation code to convey the necessary information to the device PM core
+ * code:
+ *
+ * ON No transition.
+ *
+ * FREEZE System is going to hibernate, call ->prepare() and ->freeze()
+ * for all devices.
+ *
+ * SUSPEND System is going to suspend, call ->prepare() and ->suspend()
+ * for all devices.
+ *
+ * HIBERNATE Hibernation image has been saved, call ->prepare() and
+ * ->poweroff() for all devices.
+ *
+ * QUIESCE Contents of main memory are going to be restored from a (loaded)
+ * hibernation image, call ->prepare() and ->freeze() for all
+ * devices.
+ *
+ * RESUME System is resuming, call ->resume() and ->complete() for all
+ * devices.
+ *
+ * THAW Hibernation image has been created, call ->thaw() and
+ * ->complete() for all devices.
+ *
+ * RESTORE Contents of main memory have been restored from a hibernation
+ * image, call ->restore() and ->complete() for all devices.
+ *
+ * RECOVER Creation of a hibernation image or restoration of the main
+ * memory contents from a hibernation image has failed, call
+ * ->thaw() and ->complete() for all devices.
+ */
+
+#define PM_EVENT_ON 0x0000
+#define PM_EVENT_FREEZE 0x0001
+#define PM_EVENT_SUSPEND 0x0002
+#define PM_EVENT_HIBERNATE 0x0004
+#define PM_EVENT_QUIESCE 0x0008
+#define PM_EVENT_RESUME 0x0010
+#define PM_EVENT_THAW 0x0020
+#define PM_EVENT_RESTORE 0x0040
+#define PM_EVENT_RECOVER 0x0080
+
+#define PM_EVENT_SLEEP (PM_EVENT_SUSPEND | PM_EVENT_HIBERNATE)
+
+#define PMSG_FREEZE ((struct pm_message){ .event = PM_EVENT_FREEZE, })
+#define PMSG_QUIESCE ((struct pm_message){ .event = PM_EVENT_QUIESCE, })
+#define PMSG_SUSPEND ((struct pm_message){ .event = PM_EVENT_SUSPEND, })
+#define PMSG_HIBERNATE ((struct pm_message){ .event = PM_EVENT_HIBERNATE, })
+#define PMSG_RESUME ((struct pm_message){ .event = PM_EVENT_RESUME, })
+#define PMSG_THAW ((struct pm_message){ .event = PM_EVENT_THAW, })
+#define PMSG_RESTORE ((struct pm_message){ .event = PM_EVENT_RESTORE, })
+#define PMSG_RECOVER ((struct pm_message){ .event = PM_EVENT_RECOVER, })
+#define PMSG_ON ((struct pm_message){ .event = PM_EVENT_ON, })
+
+/**
+ * Device power management states
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.
+ *
+ * DPM_ON Device is regarded as operational. Set this way
+ * initially and when ->complete() is about to be called.
+ * Also set when ->prepare() fails.
+ *
+ * DPM_RESUMING Device is going to be resumed. Set when ->resume(),
+ * ->thaw(), or ->restore() is about to be called.
+ *
+ * DPM_SUSPENDING Device has been prepared for a power transition. Set
+ * when ->prepare() has just succeeded.
+ *
+ * DPM_OFF Device is regarded as inactive. Set immediately after
+ * ->suspend(), ->freeze(), or ->poweroff() has succeeded.
+ * Also set when ->resume()_noirq, ->thaw_noirq(), or
+ * ->restore_noirq() is about to be called.
+ *
+ * DPM_OFF_IRQ Device is in a "deep sleep". Set immediately after
+ * ->suspend_noirq(), ->freeze_noirq(), or
+ * ->poweroff_noirq() has just succeeded.
+ */
+
+enum dpm_state {
+ DPM_ON,
+ DPM_RESUMING,
+ DPM_SUSPENDING,
+ DPM_OFF,
+ DPM_OFF_IRQ,
+};
+
+struct dev_pm_info {
+ pm_message_t power_state;
+ unsigned can_wakeup:1;
+ unsigned should_wakeup:1;
+ enum dpm_state status; /* Owned by the PM core */
+#ifdef CONFIG_PM_SLEEP
+ struct list_head entry;
+#endif
+};
+
+/*
+ * The PM_EVENT_ messages are also used by drivers implementing the legacy
+ * suspend framework, based on the ->suspend() and ->resume() callbacks common
+ * for suspend and hibernation transitions, according to the rules below.
+ */
+
+/* Necessary, because several drivers use PM_EVENT_PRETHAW */
+#define PM_EVENT_PRETHAW PM_EVENT_QUIESCE
+
+/*
* One transition is triggered by resume(), after a suspend() call; the
* message is implicit:
*
@@ -166,35 +412,11 @@ typedef struct pm_message {
* or from system low-power states such as standby or suspend-to-RAM.
*/

-#define PM_EVENT_ON 0
-#define PM_EVENT_FREEZE 1
-#define PM_EVENT_SUSPEND 2
-#define PM_EVENT_HIBERNATE 4
-#define PM_EVENT_PRETHAW 8
-
-#define PM_EVENT_SLEEP (PM_EVENT_SUSPEND | PM_EVENT_HIBERNATE)
-
-#define PMSG_FREEZE ((struct pm_message){ .event = PM_EVENT_FREEZE, })
-#define PMSG_PRETHAW ((struct pm_message){ .event = PM_EVENT_PRETHAW, })
-#define PMSG_SUSPEND ((struct pm_message){ .event = PM_EVENT_SUSPEND, })
-#define PMSG_HIBERNATE ((struct pm_message){ .event = PM_EVENT_HIBERNATE, })
-#define PMSG_ON ((struct pm_message){ .event = PM_EVENT_ON, })
-
-struct dev_pm_info {
- pm_message_t power_state;
- unsigned can_wakeup:1;
- unsigned should_wakeup:1;
- bool sleeping:1; /* Owned by the PM core */
-#ifdef CONFIG_PM_SLEEP
- struct list_head entry;
-#endif
-};
+#ifdef CONFIG_PM_SLEEP
+extern void device_power_up(pm_message_t state);
+extern void device_resume(pm_message_t state);

extern int device_power_down(pm_message_t state);
-extern void device_power_up(void);
-extern void device_resume(void);
-
-#ifdef CONFIG_PM_SLEEP
extern int device_suspend(pm_message_t state);
extern int device_prepare_suspend(pm_message_t state);

Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -12,11 +12,9 @@
* and add it to the list of power-controlled devices. sysfs entries for
* controlling device power management will also be added.
*
- * A different set of lists than the global subsystem list are used to
- * keep track of power info because we use different lists to hold
- * devices based on what stage of the power management process they
- * are in. The power domain dependencies may also differ from the
- * ancestral dependencies that the subsystem list maintains.
+ * A separate list is used for keeping track of power info, because the power
+ * domain dependencies may differ from the ancestral dependencies that the
+ * subsystem list maintains.
*/

#include <linux/device.h>
@@ -30,31 +28,24 @@
#include "power.h"

/*
- * The entries in the dpm_active list are in a depth first order, simply
+ * The entries in the dpm_list list are in a depth first order, simply
* because children are guaranteed to be discovered after parents, and
* are inserted at the back of the list on discovery.
*
- * All the other lists are kept in the same order, for consistency.
- * However the lists aren't always traversed in the same order.
- * Semaphores must be acquired from the top (i.e., front) down
- * and released in the opposite order. Devices must be suspended
- * from the bottom (i.e., end) up and resumed in the opposite order.
- * That way no parent will be suspended while it still has an active
- * child.
- *
* Since device_pm_add() may be called with a device semaphore held,
* we must never try to acquire a device semaphore while holding
* dpm_list_mutex.
*/

-LIST_HEAD(dpm_active);
-static LIST_HEAD(dpm_off);
-static LIST_HEAD(dpm_off_irq);
+LIST_HEAD(dpm_list);

static DEFINE_MUTEX(dpm_list_mtx);

-/* 'true' if all devices have been suspended, protected by dpm_list_mtx */
-static bool all_sleeping;
+/*
+ * Set once the preparation of devices for a PM transition has started, reset
+ * before starting to resume devices. Protected by dpm_list_mtx.
+ */
+static bool transition_started;

/**
* device_pm_add - add a device to the list of active devices
@@ -68,22 +59,30 @@ int device_pm_add(struct device *dev)
dev->bus ? dev->bus->name : "No Bus",
kobject_name(&dev->kobj));
mutex_lock(&dpm_list_mtx);
- if ((dev->parent && dev->parent->power.sleeping) || all_sleeping) {
- if (dev->parent->power.sleeping)
- dev_warn(dev,
- "parent %s is sleeping, will not add\n",
+ if (dev->parent) {
+ if (dev->parent->power.status > DPM_RESUMING) {
+ dev_warn(dev, "parent %s is sleeping, will not add\n",
dev->parent->bus_id);
- else
- dev_warn(dev, "devices are sleeping, will not add\n");
- WARN_ON(true);
- error = -EBUSY;
- } else {
- error = dpm_sysfs_add(dev);
- if (!error)
- list_add_tail(&dev->power.entry, &dpm_active);
+ goto Refuse;
+ }
+ } else if (transition_started) {
+ /*
+ * We refuse to register parentless devices while a PM
+ * transition is in progress in order to avoid leaving them
+ * unhandled down the road
+ */
+ goto Refuse;
}
+ error = dpm_sysfs_add(dev);
+ if (!error)
+ list_add_tail(&dev->power.entry, &dpm_list);
+ End:
mutex_unlock(&dpm_list_mtx);
return error;
+ Refuse:
+ WARN_ON(true);
+ error = -EBUSY;
+ goto End;
}

/**
@@ -103,73 +102,230 @@ void device_pm_remove(struct device *dev
mutex_unlock(&dpm_list_mtx);
}

+/**
+ * pm_op - execute the PM operation appropiate for given PM event
+ * @dev: Device.
+ * @ops: PM operations to choose from.
+ * @state: PM transition of the system being carried out.
+ */
+static int pm_op(struct device *dev, struct pm_ops *ops, pm_message_t state)
+{
+ int error = 0;
+
+ switch (state.event) {
+#ifdef CONFIG_SUSPEND
+ case PM_EVENT_SUSPEND:
+ if (ops->suspend) {
+ error = ops->suspend(dev);
+ suspend_report_result(ops->suspend, error);
+ }
+ break;
+ case PM_EVENT_RESUME:
+ if (ops->resume) {
+ error = ops->resume(dev);
+ suspend_report_result(ops->resume, error);
+ }
+ break;
+#endif /* CONFIG_SUSPEND */
+#ifdef CONFIG_HIBERNATION
+ case PM_EVENT_FREEZE:
+ case PM_EVENT_QUIESCE:
+ if (ops->freeze) {
+ error = ops->freeze(dev);
+ suspend_report_result(ops->freeze, error);
+ }
+ break;
+ case PM_EVENT_HIBERNATE:
+ if (ops->poweroff) {
+ error = ops->poweroff(dev);
+ suspend_report_result(ops->poweroff, error);
+ }
+ break;
+ case PM_EVENT_THAW:
+ case PM_EVENT_RECOVER:
+ if (ops->thaw) {
+ error = ops->thaw(dev);
+ suspend_report_result(ops->thaw, error);
+ }
+ break;
+ case PM_EVENT_RESTORE:
+ if (ops->restore) {
+ error = ops->restore(dev);
+ suspend_report_result(ops->restore, error);
+ }
+ break;
+#endif /* CONFIG_HIBERNATION */
+ default:
+ error = -EINVAL;
+ }
+ return error;
+}
+
+/**
+ * pm_noirq_op - execute the PM operation appropiate for given PM event
+ * @dev: Device.
+ * @ops: PM operations to choose from.
+ * @state: PM transition of the system being carried out.
+ *
+ * The operation is executed with interrupts disabled by the only remaining
+ * functional CPU in the system.
+ */
+static int pm_noirq_op(struct device *dev, struct pm_noirq_ops *ops,
+ pm_message_t state)
+{
+ int error = 0;
+
+ switch (state.event) {
+#ifdef CONFIG_SUSPEND
+ case PM_EVENT_SUSPEND:
+ if (ops->suspend_noirq) {
+ error = ops->suspend_noirq(dev);
+ suspend_report_result(ops->suspend_noirq, error);
+ }
+ break;
+ case PM_EVENT_RESUME:
+ if (ops->resume_noirq) {
+ error = ops->resume_noirq(dev);
+ suspend_report_result(ops->resume_noirq, error);
+ }
+ break;
+#endif /* CONFIG_SUSPEND */
+#ifdef CONFIG_HIBERNATION
+ case PM_EVENT_FREEZE:
+ case PM_EVENT_QUIESCE:
+ if (ops->freeze_noirq) {
+ error = ops->freeze_noirq(dev);
+ suspend_report_result(ops->freeze_noirq, error);
+ }
+ break;
+ case PM_EVENT_HIBERNATE:
+ if (ops->poweroff_noirq) {
+ error = ops->poweroff_noirq(dev);
+ suspend_report_result(ops->poweroff_noirq, error);
+ }
+ break;
+ case PM_EVENT_THAW:
+ case PM_EVENT_RECOVER:
+ if (ops->thaw_noirq) {
+ error = ops->thaw_noirq(dev);
+ suspend_report_result(ops->thaw_noirq, error);
+ }
+ break;
+ case PM_EVENT_RESTORE:
+ if (ops->restore_noirq) {
+ error = ops->restore_noirq(dev);
+ suspend_report_result(ops->restore_noirq, error);
+ }
+ break;
+#endif /* CONFIG_HIBERNATION */
+ default:
+ error = -EINVAL;
+ }
+ return error;
+}
+
+static char *pm_verb(int event)
+{
+ switch (event) {
+ case PM_EVENT_SUSPEND:
+ return "suspend";
+ case PM_EVENT_RESUME:
+ return "resume";
+ case PM_EVENT_FREEZE:
+ return "freeze";
+ case PM_EVENT_QUIESCE:
+ return "quiesce";
+ case PM_EVENT_HIBERNATE:
+ return "hibernate";
+ case PM_EVENT_THAW:
+ return "thaw";
+ case PM_EVENT_RESTORE:
+ return "restore";
+ default:
+ return "(unknown PM event)";
+ }
+}
+
+static void pm_dev_dbg(struct device *dev, pm_message_t state, char *info)
+{
+ dev_dbg(dev, "%s%s%s\n", info, pm_verb(state.event),
+ ((state.event & PM_EVENT_SLEEP) && device_may_wakeup(dev)) ?
+ ", may wakeup" : "");
+}
+
/*------------------------- Resume routines -------------------------*/

/**
- * resume_device_early - Power on one device (early resume).
+ * resume_device_noirq - Power on one device (early resume).
* @dev: Device.
+ * @state: PM transition of the system being carried out.
*
* Must be called with interrupts disabled.
*/
-static int resume_device_early(struct device *dev)
+static int resume_device_noirq(struct device *dev, pm_message_t state)
{
int error = 0;

TRACE_DEVICE(dev);
TRACE_RESUME(0);

- if (dev->bus && dev->bus->resume_early) {
- dev_dbg(dev, "EARLY resume\n");
+ if (!dev->bus)
+ goto End;
+
+ if (dev->bus->pm_noirq) {
+ pm_dev_dbg(dev, state, "EARLY ");
+ error = pm_noirq_op(dev, dev->bus->pm_noirq, state);
+ } else if (dev->bus->resume_early) {
+ pm_dev_dbg(dev, state, "legacy EARLY ");
error = dev->bus->resume_early(dev);
}
-
+ End:
TRACE_RESUME(error);
return error;
}

/**
* dpm_power_up - Power on all regular (non-sysdev) devices.
+ * @state: PM transition of the system being carried out.
*
- * Walk the dpm_off_irq list and power each device up. This
- * is used for devices that required they be powered down with
- * interrupts disabled. As devices are powered on, they are moved
- * to the dpm_off list.
+ * Execute the appropriate "noirq resume" callback for all devices marked
+ * as DPM_OFF_IRQ.
*
* Must be called with interrupts disabled and only one CPU running.
*/
-static void dpm_power_up(void)
+static void dpm_power_up(pm_message_t state)
{
+ struct device *dev;

- while (!list_empty(&dpm_off_irq)) {
- struct list_head *entry = dpm_off_irq.next;
- struct device *dev = to_device(entry);
-
- list_move_tail(entry, &dpm_off);
- resume_device_early(dev);
- }
+ list_for_each_entry(dev, &dpm_list, power.entry)
+ if (dev->power.status > DPM_OFF) {
+ dev->power.status = DPM_OFF;
+ resume_device_noirq(dev, state);
+ }
}

/**
* device_power_up - Turn on all devices that need special attention.
+ * @state: PM transition of the system being carried out.
*
* Power on system devices, then devices that required we shut them down
* with interrupts disabled.
*
* Must be called with interrupts disabled.
*/
-void device_power_up(void)
+void device_power_up(pm_message_t state)
{
sysdev_resume();
- dpm_power_up();
+ dpm_power_up(state);
}
EXPORT_SYMBOL_GPL(device_power_up);

/**
* resume_device - Restore state for one device.
* @dev: Device.
- *
+ * @state: PM transition of the system being carried out.
*/
-static int resume_device(struct device *dev)
+static int resume_device(struct device *dev, pm_message_t state)
{
int error = 0;

@@ -178,21 +334,40 @@ static int resume_device(struct device *

down(&dev->sem);

- if (dev->bus && dev->bus->resume) {
- dev_dbg(dev,"resuming\n");
- error = dev->bus->resume(dev);
+ if (dev->bus) {
+ if (dev->bus->pm) {
+ pm_dev_dbg(dev, state, "");
+ error = pm_op(dev, dev->bus->pm, state);
+ } else if (dev->bus->resume) {
+ pm_dev_dbg(dev, state, "legacy ");
+ error = dev->bus->resume(dev);
+ }
+ if (error)
+ goto End;
}

- if (!error && dev->type && dev->type->resume) {
- dev_dbg(dev,"resuming\n");
- error = dev->type->resume(dev);
+ if (dev->type) {
+ if (dev->type->pm) {
+ pm_dev_dbg(dev, state, "type ");
+ error = pm_op(dev, dev->type->pm, state);
+ } else if (dev->type->resume) {
+ pm_dev_dbg(dev, state, "legacy type ");
+ error = dev->type->resume(dev);
+ }
+ if (error)
+ goto End;
}

- if (!error && dev->class && dev->class->resume) {
- dev_dbg(dev,"class resume\n");
- error = dev->class->resume(dev);
+ if (dev->class) {
+ if (dev->class->pm) {
+ pm_dev_dbg(dev, state, "class ");
+ error = pm_op(dev, dev->class->pm, state);
+ } else if (dev->class->resume) {
+ pm_dev_dbg(dev, state, "legacy class ");
+ error = dev->class->resume(dev);
+ }
}
-
+ End:
up(&dev->sem);

TRACE_RESUME(error);
@@ -201,78 +376,154 @@ static int resume_device(struct device *

/**
* dpm_resume - Resume every device.
+ * @state: PM transition of the system being carried out.
*
- * Resume the devices that have either not gone through
- * the late suspend, or that did go through it but also
- * went through the early resume.
+ * Execute the appropriate "resume" callback for all devices the status of
+ * which indicates that they are inactive.
+ */
+static void dpm_resume(pm_message_t state)
+{
+ struct list_head list;
+
+ INIT_LIST_HEAD(&list);
+ mutex_lock(&dpm_list_mtx);
+ transition_started = false;
+ while (!list_empty(&dpm_list)) {
+ struct device *dev = to_device(dpm_list.next);
+
+ if (dev->power.status > DPM_SUSPENDING) {
+ dev->power.status = DPM_RESUMING;
+ get_device(dev);
+ mutex_unlock(&dpm_list_mtx);
+
+ resume_device(dev, state);
+
+ mutex_lock(&dpm_list_mtx);
+ put_device(dev);
+ }
+ if (!list_empty(&dev->power.entry))
+ list_move_tail(&dev->power.entry, &list);
+ }
+ list_splice(&list, &dpm_list);
+ mutex_unlock(&dpm_list_mtx);
+}
+
+/**
+ * complete_device - Complete a PM transition for given device
+ * @dev: Device.
+ * @state: PM transition of the system being carried out.
+ */
+static void complete_device(struct device *dev, pm_message_t state)
+{
+ down(&dev->sem);
+
+ if (dev->bus && dev->bus->pm && dev->bus->pm->complete) {
+ pm_dev_dbg(dev, state, "completing ");
+ dev->bus->pm->complete(dev);
+ }
+
+ if (dev->type && dev->type->pm && dev->type->pm->complete) {
+ pm_dev_dbg(dev, state, "completing type ");
+ dev->type->pm->complete(dev);
+ }
+
+ if (dev->class && dev->class->pm && dev->class->pm->complete) {
+ pm_dev_dbg(dev, state, "completing class ");
+ dev->class->pm->complete(dev);
+ }
+
+ up(&dev->sem);
+}
+
+/**
+ * dpm_complete - Complete a PM transition for all devices.
+ * @state: PM transition of the system being carried out.
*
- * Take devices from the dpm_off_list, resume them,
- * and put them on the dpm_locked list.
+ * Execute the ->complete() callbacks for all devices that are not marked
+ * as DPM_ON.
*/
-static void dpm_resume(void)
+static void dpm_complete(pm_message_t state)
{
+ struct list_head list;
+
+ INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
- all_sleeping = false;
- while(!list_empty(&dpm_off)) {
- struct list_head *entry = dpm_off.next;
- struct device *dev = to_device(entry);
+ while (!list_empty(&dpm_list)) {
+ struct device *dev = to_device(dpm_list.prev);

- list_move_tail(entry, &dpm_active);
- dev->power.sleeping = false;
- mutex_unlock(&dpm_list_mtx);
- resume_device(dev);
- mutex_lock(&dpm_list_mtx);
+ if (dev->power.status > DPM_ON) {
+ dev->power.status = DPM_ON;
+ get_device(dev);
+ mutex_unlock(&dpm_list_mtx);
+
+ complete_device(dev, state);
+
+ mutex_lock(&dpm_list_mtx);
+ put_device(dev);
+ }
+ if (!list_empty(&dev->power.entry))
+ list_move(&dev->power.entry, &list);
}
+ list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
}

/**
* device_resume - Restore state of each device in system.
+ * @state: PM transition of the system being carried out.
*
* Resume all the devices, unlock them all, and allow new
* devices to be registered once again.
*/
-void device_resume(void)
+void device_resume(pm_message_t state)
{
might_sleep();
- dpm_resume();
+ dpm_resume(state);
+ dpm_complete(state);
}
EXPORT_SYMBOL_GPL(device_resume);


/*------------------------- Suspend routines -------------------------*/

-static inline char *suspend_verb(u32 event)
+/**
+ * resume_event - return a PM message representing the resume event
+ * corresponding to given sleep state.
+ * @sleep_state: PM message representing a sleep state.
+ */
+static pm_message_t resume_event(pm_message_t sleep_state)
{
- switch (event) {
- case PM_EVENT_SUSPEND: return "suspend";
- case PM_EVENT_FREEZE: return "freeze";
- case PM_EVENT_PRETHAW: return "prethaw";
- default: return "(unknown suspend event)";
+ switch (sleep_state.event) {
+ case PM_EVENT_SUSPEND:
+ return PMSG_RESUME;
+ case PM_EVENT_FREEZE:
+ case PM_EVENT_QUIESCE:
+ return PMSG_RECOVER;
+ case PM_EVENT_HIBERNATE:
+ return PMSG_RESTORE;
}
-}
-
-static void
-suspend_device_dbg(struct device *dev, pm_message_t state, char *info)
-{
- dev_dbg(dev, "%s%s%s\n", info, suspend_verb(state.event),
- ((state.event == PM_EVENT_SUSPEND) && device_may_wakeup(dev)) ?
- ", may wakeup" : "");
+ return PMSG_ON;
}

/**
- * suspend_device_late - Shut down one device (late suspend).
+ * suspend_device_noirq - Shut down one device (late suspend).
* @dev: Device.
- * @state: Power state device is entering.
+ * @state: PM transition of the system being carried out.
*
* This is called with interrupts off and only a single CPU running.
*/
-static int suspend_device_late(struct device *dev, pm_message_t state)
+static int suspend_device_noirq(struct device *dev, pm_message_t state)
{
int error = 0;

- if (dev->bus && dev->bus->suspend_late) {
- suspend_device_dbg(dev, state, "LATE ");
+ if (!dev->bus)
+ return 0;
+
+ if (dev->bus->pm_noirq) {
+ pm_dev_dbg(dev, state, "LATE ");
+ error = pm_noirq_op(dev, dev->bus->pm_noirq, state);
+ } else if (dev->bus->suspend_late) {
+ pm_dev_dbg(dev, state, "legacy LATE ");
error = dev->bus->suspend_late(dev, state);
suspend_report_result(dev->bus->suspend_late, error);
}
@@ -281,37 +532,32 @@ static int suspend_device_late(struct de

/**
* device_power_down - Shut down special devices.
- * @state: Power state to enter.
+ * @state: PM transition of the system being carried out.
*
- * Power down devices that require interrupts to be disabled
- * and move them from the dpm_off list to the dpm_off_irq list.
+ * Power down devices that require interrupts to be disabled.
* Then power down system devices.
*
* Must be called with interrupts disabled and only one CPU running.
*/
int device_power_down(pm_message_t state)
{
+ struct device *dev;
int error = 0;

- while (!list_empty(&dpm_off)) {
- struct list_head *entry = dpm_off.prev;
- struct device *dev = to_device(entry);
-
- error = suspend_device_late(dev, state);
+ list_for_each_entry_reverse(dev, &dpm_list, power.entry) {
+ error = suspend_device_noirq(dev, state);
if (error) {
printk(KERN_ERR "Could not power down device %s: "
"error %d\n",
kobject_name(&dev->kobj), error);
break;
}
- if (!list_empty(&dev->power.entry))
- list_move(&dev->power.entry, &dpm_off_irq);
+ dev->power.status = DPM_OFF_IRQ;
}
-
if (!error)
error = sysdev_suspend(state);
if (error)
- dpm_power_up();
+ dpm_power_up(resume_event(state));
return error;
}
EXPORT_SYMBOL_GPL(device_power_down);
@@ -319,7 +565,7 @@ EXPORT_SYMBOL_GPL(device_power_down);
/**
* suspend_device - Save state of one device.
* @dev: Device.
- * @state: Power state device is entering.
+ * @state: PM transition of the system being carried out.
*/
static int suspend_device(struct device *dev, pm_message_t state)
{
@@ -327,29 +573,43 @@ static int suspend_device(struct device

down(&dev->sem);

- if (dev->power.power_state.event) {
- dev_dbg(dev, "PM: suspend %d-->%d\n",
- dev->power.power_state.event, state.event);
- }
-
- if (dev->class && dev->class->suspend) {
- suspend_device_dbg(dev, state, "class ");
- error = dev->class->suspend(dev, state);
- suspend_report_result(dev->class->suspend, error);
+ if (dev->class) {
+ if (dev->class->pm) {
+ pm_dev_dbg(dev, state, "class ");
+ error = pm_op(dev, dev->class->pm, state);
+ } else if (dev->class->suspend) {
+ pm_dev_dbg(dev, state, "legacy class ");
+ error = dev->class->suspend(dev, state);
+ suspend_report_result(dev->class->suspend, error);
+ }
+ if (error)
+ goto End;
}

- if (!error && dev->type && dev->type->suspend) {
- suspend_device_dbg(dev, state, "type ");
- error = dev->type->suspend(dev, state);
- suspend_report_result(dev->type->suspend, error);
+ if (dev->type) {
+ if (dev->type->pm) {
+ pm_dev_dbg(dev, state, "type ");
+ error = pm_op(dev, dev->type->pm, state);
+ } else if (dev->type->suspend) {
+ pm_dev_dbg(dev, state, "legacy type ");
+ error = dev->type->suspend(dev, state);
+ suspend_report_result(dev->type->suspend, error);
+ }
+ if (error)
+ goto End;
}

- if (!error && dev->bus && dev->bus->suspend) {
- suspend_device_dbg(dev, state, "");
- error = dev->bus->suspend(dev, state);
- suspend_report_result(dev->bus->suspend, error);
+ if (dev->bus) {
+ if (dev->bus->pm) {
+ pm_dev_dbg(dev, state, "");
+ error = pm_op(dev, dev->bus->pm, state);
+ } else if (dev->bus->suspend) {
+ pm_dev_dbg(dev, state, "legacy ");
+ error = dev->bus->suspend(dev, state);
+ suspend_report_result(dev->bus->suspend, error);
+ }
}
-
+ End:
up(&dev->sem);

return error;
@@ -357,67 +617,139 @@ static int suspend_device(struct device

/**
* dpm_suspend - Suspend every device.
- * @state: Power state to put each device in.
- *
- * Walk the dpm_locked list. Suspend each device and move it
- * to the dpm_off list.
+ * @state: PM transition of the system being carried out.
*
- * (For historical reasons, if it returns -EAGAIN, that used to mean
- * that the device would be called again with interrupts disabled.
- * These days, we use the "suspend_late()" callback for that, so we
- * print a warning and consider it an error).
+ * Execute the appropriate "suspend" callbacks for all devices.
*/
static int dpm_suspend(pm_message_t state)
{
+ struct list_head list;
int error = 0;

+ INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
- while (!list_empty(&dpm_active)) {
- struct list_head *entry = dpm_active.prev;
- struct device *dev = to_device(entry);
+ while (!list_empty(&dpm_list)) {
+ struct device *dev = to_device(dpm_list.prev);

- WARN_ON(dev->parent && dev->parent->power.sleeping);
-
- dev->power.sleeping = true;
+ get_device(dev);
mutex_unlock(&dpm_list_mtx);
+
error = suspend_device(dev, state);
+
mutex_lock(&dpm_list_mtx);
+ put_device(dev);
if (error) {
printk(KERN_ERR "Could not suspend device %s: "
- "error %d%s\n",
- kobject_name(&dev->kobj),
- error,
- (error == -EAGAIN ?
- " (please convert to suspend_late)" :
- ""));
- dev->power.sleeping = false;
+ "error %d\n", kobject_name(&dev->kobj), error);
+ list_splice_init(&dpm_list, &list);
break;
}
- if (!list_empty(&dev->power.entry))
- list_move(&dev->power.entry, &dpm_off);
+ if (!list_empty(&dev->power.entry)) {
+ dev->power.status = DPM_OFF;
+ list_move(&dev->power.entry, &list);
+ }
}
- if (!error)
- all_sleeping = true;
+ list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ return error;
+}
+
+/**
+ * prepare_device - Execute the ->prepare() callback(s) for given device.
+ * @dev: Device.
+ * @state: PM transition of the system being carried out.
+ */
+static int prepare_device(struct device *dev, pm_message_t state)
+{
+ int error = 0;
+
+ down(&dev->sem);
+
+ if (dev->bus && dev->bus->pm && dev->bus->pm->prepare) {
+ pm_dev_dbg(dev, state, "preparing ");
+ error = dev->bus->pm->prepare(dev);
+ suspend_report_result(dev->bus->pm->prepare, error);
+ if (error)
+ goto End;
+ }
+
+ if (dev->type && dev->type->pm && dev->type->pm->prepare) {
+ pm_dev_dbg(dev, state, "preparing type ");
+ error = dev->type->pm->prepare(dev);
+ suspend_report_result(dev->type->pm->prepare, error);
+ if (error)
+ goto End;
+ }
+
+ if (dev->class && dev->class->pm && dev->class->pm->prepare) {
+ pm_dev_dbg(dev, state, "preparing class ");
+ error = dev->class->pm->prepare(dev);
+ suspend_report_result(dev->class->pm->prepare, error);
+ }
+ End:
+ up(&dev->sem);

return error;
}

/**
+ * dpm_prepare - Prepare all devices for a PM transition.
+ * @state: PM transition of the system being carried out.
+ *
+ * Execute the ->prepare() callback for all devices.
+ */
+static int dpm_prepare(pm_message_t state)
+{
+ struct list_head list;
+ int error = 0;
+
+ INIT_LIST_HEAD(&list);
+ mutex_lock(&dpm_list_mtx);
+ transition_started = true;
+ while (!list_empty(&dpm_list)) {
+ struct device *dev = to_device(dpm_list.next);
+
+ get_device(dev);
+ mutex_unlock(&dpm_list_mtx);
+
+ error = prepare_device(dev, state);
+
+ mutex_lock(&dpm_list_mtx);
+ put_device(dev);
+ if (error) {
+ if (error == -EAGAIN)
+ continue;
+ printk(KERN_ERR "Could not prepare device %s "
+ "for suspend: error %d\n",
+ kobject_name(&dev->kobj), error);
+ break;
+ }
+ if (!list_empty(&dev->power.entry)) {
+ dev->power.status = DPM_SUSPENDING;
+ list_move_tail(&dev->power.entry, &list);
+ }
+ }
+ list_splice(&list, &dpm_list);
+ mutex_unlock(&dpm_list_mtx);
+ return error;
+}
+
+/**
* device_suspend - Save state and stop all devices in system.
- * @state: new power management state
+ * @state: PM transition of the system being carried out.
*
- * Prevent new devices from being registered, then lock all devices
- * and suspend them.
+ * Prepare and suspend all devices.
*/
int device_suspend(pm_message_t state)
{
int error;

might_sleep();
- error = dpm_suspend(state);
+ error = dpm_prepare(state);
+ if (!error)
+ error = dpm_suspend(state);
if (error)
- device_resume();
+ device_resume(resume_event(state));
return error;
}
EXPORT_SYMBOL_GPL(device_suspend);
Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -69,6 +69,9 @@ struct bus_type {
int (*resume_early)(struct device *dev);
int (*resume)(struct device *dev);

+ struct pm_ops *pm;
+ struct pm_noirq_ops *pm_noirq;
+
struct bus_type_private *p;
};

@@ -202,6 +205,8 @@ struct class {

int (*suspend)(struct device *dev, pm_message_t state);
int (*resume)(struct device *dev);
+
+ struct pm_ops *pm;
};

extern int __must_check class_register(struct class *class);
@@ -345,8 +350,11 @@ struct device_type {
struct attribute_group **groups;
int (*uevent)(struct device *dev, struct kobj_uevent_env *env);
void (*release)(struct device *dev);
+
int (*suspend)(struct device *dev, pm_message_t state);
int (*resume)(struct device *dev);
+
+ struct pm_ops *pm;
};

/* interface for exporting device attributes */
Index: linux-2.6/kernel/power/disk.c
===================================================================
--- linux-2.6.orig/kernel/power/disk.c
+++ linux-2.6/kernel/power/disk.c
@@ -224,7 +224,7 @@ static int create_image(int platform_mod
/* NOTE: device_power_up() is just a resume() for devices
* that suspended with irqs off ... no overall powerup.
*/
- device_power_up();
+ device_power_up(in_suspend ? PMSG_RECOVER : PMSG_RESTORE);
Enable_irqs:
local_irq_enable();
return error;
@@ -280,7 +280,7 @@ int hibernation_snapshot(int platform_mo
Finish:
platform_finish(platform_mode);
Resume_devices:
- device_resume();
+ device_resume(in_suspend ? PMSG_RECOVER : PMSG_RESTORE);
Resume_console:
resume_console();
Close:
@@ -301,7 +301,7 @@ static int resume_target_kernel(void)
int error;

local_irq_disable();
- error = device_power_down(PMSG_PRETHAW);
+ error = device_power_down(PMSG_QUIESCE);
if (error) {
printk(KERN_ERR "PM: Some devices failed to power down, "
"aborting resume\n");
@@ -329,7 +329,7 @@ static int resume_target_kernel(void)
swsusp_free();
restore_processor_state();
touch_softlockup_watchdog();
- device_power_up();
+ device_power_up(PMSG_THAW);
Enable_irqs:
local_irq_enable();
return error;
@@ -350,7 +350,7 @@ int hibernation_restore(int platform_mod

pm_prepare_console();
suspend_console();
- error = device_suspend(PMSG_PRETHAW);
+ error = device_suspend(PMSG_QUIESCE);
if (error)
goto Finish;

@@ -362,7 +362,7 @@ int hibernation_restore(int platform_mod
enable_nonboot_cpus();
}
platform_restore_cleanup(platform_mode);
- device_resume();
+ device_resume(PMSG_RECOVER);
Finish:
resume_console();
pm_restore_console();
@@ -419,7 +419,7 @@ int hibernation_platform_enter(void)
Finish:
hibernation_ops->finish();
Resume_devices:
- device_resume();
+ device_resume(PMSG_RESTORE);
Resume_console:
resume_console();
Close:
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -239,7 +239,7 @@ static int suspend_enter(suspend_state_t
if (!suspend_test(TEST_CORE))
error = suspend_ops->enter(state);

- device_power_up();
+ device_power_up(PMSG_RESUME);
Done:
arch_suspend_enable_irqs();
BUG_ON(irqs_disabled());
@@ -291,7 +291,7 @@ int suspend_devices_and_enter(suspend_st
if (suspend_ops->finish)
suspend_ops->finish();
Resume_devices:
- device_resume();
+ device_resume(PMSG_RESUME);
Resume_console:
resume_console();
Close:
Index: linux-2.6/arch/x86/kernel/apm_32.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apm_32.c
+++ linux-2.6/arch/x86/kernel/apm_32.c
@@ -1208,9 +1208,9 @@ static int suspend(int vetoable)
if (err != APM_SUCCESS)
apm_error("suspend", err);
err = (err == APM_SUCCESS) ? 0 : -EIO;
- device_power_up();
+ device_power_up(PMSG_RESUME);
local_irq_enable();
- device_resume();
+ device_resume(PMSG_RESUME);
queue_event(APM_NORMAL_RESUME, NULL);
out:
spin_lock(&user_list_lock);
Index: linux-2.6/drivers/base/power/power.h
===================================================================
--- linux-2.6.orig/drivers/base/power/power.h
+++ linux-2.6/drivers/base/power/power.h
@@ -4,7 +4,7 @@
* main.c
*/

-extern struct list_head dpm_active; /* The active device list */
+extern struct list_head dpm_list; /* The active device list */

static inline struct device *to_device(struct list_head *entry)
{
Index: linux-2.6/drivers/base/power/trace.c
===================================================================
--- linux-2.6.orig/drivers/base/power/trace.c
+++ linux-2.6/drivers/base/power/trace.c
@@ -188,9 +188,9 @@ static int show_file_hash(unsigned int v
static int show_dev_hash(unsigned int value)
{
int match = 0;
- struct list_head * entry = dpm_active.prev;
+ struct list_head * entry = dpm_list.prev;

- while (entry != &dpm_active) {
+ while (entry != &dpm_list) {
struct device * dev = to_device(entry);
unsigned int hash = hash_string(DEVSEED, dev->bus_id, DEVHASH);
if (hash == value) {


2008-03-24 20:15:11

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Mon, 24 Mar 2008, Rafael J. Wysocki wrote:

> Hi,
>
> This is the 3rd revision of the patch introducing new callbacks for suspend
> and hibernation. It has been tested on x86-64.
...
> * The registrations of parentless devices are disabled before the first
> ->prepare() method is called and enabled before the first ->resume() method
> is called

It would be okay to wait until after the last prepare() method is
called. I don't know if it makes any difference in the end, however.

> +enum dpm_state {
> + DPM_ON,
> + DPM_RESUMING,
> + DPM_SUSPENDING,
> + DPM_OFF,
> + DPM_OFF_IRQ,
> +};

Can we also have a DPM_PREPARING state, set when ->prepare() is about
to be called? The PM core wouldn't make use of it but some drivers
would. (I can't think of any use at all for the analogous
DPM_COMPLETING state, however.)

> @@ -68,22 +59,30 @@ int device_pm_add(struct device *dev)
...
> + if (dev->parent) {
> + if (dev->parent->power.status > DPM_RESUMING) {

Clearer to say: if (dev->parent->power.status >= DPM_SUSPENDING) {

...
> + } else if (transition_started) {
> + /*
> + * We refuse to register parentless devices while a PM
> + * transition is in progress in order to avoid leaving them
> + * unhandled down the road
> + */

Log a warning here? If this ever happened, it would be the sort of
unexpected regression that people get all excited about.

> + goto Refuse;
> }
...

> +static void dpm_resume(pm_message_t state)
> +{
> + struct list_head list;
> +
> + INIT_LIST_HEAD(&list);
> + mutex_lock(&dpm_list_mtx);
> + transition_started = false;
> + while (!list_empty(&dpm_list)) {
> + struct device *dev = to_device(dpm_list.next);
> +
> + if (dev->power.status > DPM_SUSPENDING) {

Clearer to say: if (dev->power.status >= DPM_OFF) {

Note that if dev->power.status is equal to DPM_SUSPENDING then you
don't want to call resume_device(), but you still do want to change
dev->power.status to DPM_RESUMING so that new children can be
registered.

> + dev->power.status = DPM_RESUMING;
> + get_device(dev);
> + mutex_unlock(&dpm_list_mtx);
> +
> + resume_device(dev, state);
> +
> + mutex_lock(&dpm_list_mtx);
> + put_device(dev);
> + }
> + if (!list_empty(&dev->power.entry))
> + list_move_tail(&dev->power.entry, &list);

A little problem here: You refer to dev after calling put_device().

> + }
> + list_splice(&list, &dpm_list);

This isn't the way I imagined doing it (your extra "list"), but it's
fine.

...
> +static void dpm_complete(pm_message_t state)
> {
...
> + complete_device(dev, state);
> +
> + mutex_lock(&dpm_list_mtx);
> + put_device(dev);
> + }
> + if (!list_empty(&dev->power.entry))
> + list_move(&dev->power.entry, &list);

Same problem with use-after-put. Also present in dpm_prepare().

> }
> + list_splice(&list, &dpm_list);
> mutex_unlock(&dpm_list_mtx);
> }

...
> static int dpm_suspend(pm_message_t state)
> {
...
> error = suspend_device(dev, state);
> +
> mutex_lock(&dpm_list_mtx);
> + put_device(dev);
> if (error) {
> printk(KERN_ERR "Could not suspend device %s: "
> - "error %d%s\n",
> - kobject_name(&dev->kobj),
> - error,
> - (error == -EAGAIN ?
> - " (please convert to suspend_late)" :
> - ""));
> - dev->power.sleeping = false;
> + "error %d\n", kobject_name(&dev->kobj), error);
> + list_splice_init(&dpm_list, &list);
> break;
> }
> - if (!list_empty(&dev->power.entry))
> - list_move(&dev->power.entry, &dpm_off);
> + if (!list_empty(&dev->power.entry)) {
> + dev->power.status = DPM_OFF;
> + list_move(&dev->power.entry, &list);
> + }

Use-after-put again.

> }
> - if (!error)
> - all_sleeping = true;
> + list_splice(&list, &dpm_list);

Instead you could eliminate the list_splice_init() above and put here:

list_splice(&list, dpm_list->prev);

This will move the entries from list to the end of dpm_list.

> mutex_unlock(&dpm_list_mtx);
> + return error;
> +}

On the whole it looks quite good.

Alan Stern

2008-03-24 21:19:55

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Monday, 24 of March 2008, Alan Stern wrote:
> On Mon, 24 Mar 2008, Rafael J. Wysocki wrote:
>
> > Hi,
> >
> > This is the 3rd revision of the patch introducing new callbacks for suspend
> > and hibernation. It has been tested on x86-64.
> ...
> > * The registrations of parentless devices are disabled before the first
> > ->prepare() method is called and enabled before the first ->resume() method
> > is called
>
> It would be okay to wait until after the last prepare() method is
> called. I don't know if it makes any difference in the end, however.
>
> > +enum dpm_state {
> > + DPM_ON,
> > + DPM_RESUMING,
> > + DPM_SUSPENDING,
> > + DPM_OFF,
> > + DPM_OFF_IRQ,
> > +};
>
> Can we also have a DPM_PREPARING state, set when ->prepare() is about
> to be called? The PM core wouldn't make use of it but some drivers
> would. (I can't think of any use at all for the analogous
> DPM_COMPLETING state, however.)

Hmm. dev->power.status is protected by dpm_list_mtx. Do you think it would be
useful to have an accessor function for reading it under the lock?

> > @@ -68,22 +59,30 @@ int device_pm_add(struct device *dev)
> ...
> > + if (dev->parent) {
> > + if (dev->parent->power.status > DPM_RESUMING) {
>
> Clearer to say: if (dev->parent->power.status >= DPM_SUSPENDING) {

OK

> ...
> > + } else if (transition_started) {
> > + /*
> > + * We refuse to register parentless devices while a PM
> > + * transition is in progress in order to avoid leaving them
> > + * unhandled down the road
> > + */
>
> Log a warning here? If this ever happened, it would be the sort of
> unexpected regression that people get all excited about.

The WARN_ON() below 'Refuse' will trigger. I think that's sufficient.

> > + goto Refuse;
> > }
> ...
>
> > +static void dpm_resume(pm_message_t state)
> > +{
> > + struct list_head list;
> > +
> > + INIT_LIST_HEAD(&list);
> > + mutex_lock(&dpm_list_mtx);
> > + transition_started = false;
> > + while (!list_empty(&dpm_list)) {
> > + struct device *dev = to_device(dpm_list.next);
> > +
> > + if (dev->power.status > DPM_SUSPENDING) {
>
> Clearer to say: if (dev->power.status >= DPM_OFF) {

OK

> Note that if dev->power.status is equal to DPM_SUSPENDING then you
> don't want to call resume_device(), but you still do want to change
> dev->power.status to DPM_RESUMING so that new children can be
> registered.

Ah, I overlooked that. Will fix.

> > + dev->power.status = DPM_RESUMING;
> > + get_device(dev);
> > + mutex_unlock(&dpm_list_mtx);
> > +
> > + resume_device(dev, state);
> > +
> > + mutex_lock(&dpm_list_mtx);
> > + put_device(dev);
> > + }
> > + if (!list_empty(&dev->power.entry))
> > + list_move_tail(&dev->power.entry, &list);
>
> A little problem here: You refer to dev after calling put_device().

The device can't be removed at this point, because we hold dpm_list_mtx, which
is needed by device_del(). Still, I'll move the put_device() to avoid
confusion (as well as in all of the other places).

> > + }
> > + list_splice(&list, &dpm_list);
>
> This isn't the way I imagined doing it (your extra "list"), but it's
> fine.
>
> ...
> > +static void dpm_complete(pm_message_t state)
> > {
> ...
> > + complete_device(dev, state);
> > +
> > + mutex_lock(&dpm_list_mtx);
> > + put_device(dev);
> > + }
> > + if (!list_empty(&dev->power.entry))
> > + list_move(&dev->power.entry, &list);
>
> Same problem with use-after-put. Also present in dpm_prepare().
>
> > }
> > + list_splice(&list, &dpm_list);
> > mutex_unlock(&dpm_list_mtx);
> > }
>
> ...
> > static int dpm_suspend(pm_message_t state)
> > {
> ...
> > error = suspend_device(dev, state);
> > +
> > mutex_lock(&dpm_list_mtx);
> > + put_device(dev);
> > if (error) {
> > printk(KERN_ERR "Could not suspend device %s: "
> > - "error %d%s\n",
> > - kobject_name(&dev->kobj),
> > - error,
> > - (error == -EAGAIN ?
> > - " (please convert to suspend_late)" :
> > - ""));
> > - dev->power.sleeping = false;
> > + "error %d\n", kobject_name(&dev->kobj), error);
> > + list_splice_init(&dpm_list, &list);
> > break;
> > }
> > - if (!list_empty(&dev->power.entry))
> > - list_move(&dev->power.entry, &dpm_off);
> > + if (!list_empty(&dev->power.entry)) {
> > + dev->power.status = DPM_OFF;
> > + list_move(&dev->power.entry, &list);
> > + }
>
> Use-after-put again.
>
> > }
> > - if (!error)
> > - all_sleeping = true;
> > + list_splice(&list, &dpm_list);
>
> Instead you could eliminate the list_splice_init() above and put here:
>
> list_splice(&list, dpm_list->prev);
>
> This will move the entries from list to the end of dpm_list.

dpm_list may be empty at this point. Wouldn't that cause any trouble?

> > mutex_unlock(&dpm_list_mtx);
> > + return error;
> > +}
>
> On the whole it looks quite good.

Okay, thanks for the comments.

Rafael

2008-03-24 22:18:53

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Monday, 24 of March 2008, Alan Stern wrote:
> On Mon, 24 Mar 2008, Rafael J. Wysocki wrote:
>
> > Hi,
> >
> > This is the 3rd revision of the patch introducing new callbacks for suspend
> > and hibernation. It has been tested on x86-64.
> ...
> > * The registrations of parentless devices are disabled before the first
> > ->prepare() method is called and enabled before the first ->resume() method
> > is called
>
> It would be okay to wait until after the last prepare() method is
> called. I don't know if it makes any difference in the end, however.
>
> > +enum dpm_state {
> > + DPM_ON,
> > + DPM_RESUMING,
> > + DPM_SUSPENDING,
> > + DPM_OFF,
> > + DPM_OFF_IRQ,
> > +};
>
> Can we also have a DPM_PREPARING state, set when ->prepare() is about
> to be called? The PM core wouldn't make use of it but some drivers
> would. (I can't think of any use at all for the analogous
> DPM_COMPLETING state, however.)
>
> > @@ -68,22 +59,30 @@ int device_pm_add(struct device *dev)
> ...
> > + if (dev->parent) {
> > + if (dev->parent->power.status > DPM_RESUMING) {
>
> Clearer to say: if (dev->parent->power.status >= DPM_SUSPENDING) {
>
> ...
> > + } else if (transition_started) {
> > + /*
> > + * We refuse to register parentless devices while a PM
> > + * transition is in progress in order to avoid leaving them
> > + * unhandled down the road
> > + */
>
> Log a warning here? If this ever happened, it would be the sort of
> unexpected regression that people get all excited about.
>
> > + goto Refuse;
> > }
> ...
>
> > +static void dpm_resume(pm_message_t state)
> > +{
> > + struct list_head list;
> > +
> > + INIT_LIST_HEAD(&list);
> > + mutex_lock(&dpm_list_mtx);
> > + transition_started = false;
> > + while (!list_empty(&dpm_list)) {
> > + struct device *dev = to_device(dpm_list.next);
> > +
> > + if (dev->power.status > DPM_SUSPENDING) {
>
> Clearer to say: if (dev->power.status >= DPM_OFF) {
>
> Note that if dev->power.status is equal to DPM_SUSPENDING then you
> don't want to call resume_device(), but you still do want to change
> dev->power.status to DPM_RESUMING so that new children can be
> registered.
>
> > + dev->power.status = DPM_RESUMING;
> > + get_device(dev);
> > + mutex_unlock(&dpm_list_mtx);
> > +
> > + resume_device(dev, state);
> > +
> > + mutex_lock(&dpm_list_mtx);
> > + put_device(dev);
> > + }
> > + if (!list_empty(&dev->power.entry))
> > + list_move_tail(&dev->power.entry, &list);
>
> A little problem here: You refer to dev after calling put_device().
>
> > + }
> > + list_splice(&list, &dpm_list);
>
> This isn't the way I imagined doing it (your extra "list"), but it's
> fine.
>
> ...
> > +static void dpm_complete(pm_message_t state)
> > {
> ...
> > + complete_device(dev, state);
> > +
> > + mutex_lock(&dpm_list_mtx);
> > + put_device(dev);
> > + }
> > + if (!list_empty(&dev->power.entry))
> > + list_move(&dev->power.entry, &list);
>
> Same problem with use-after-put. Also present in dpm_prepare().
>
> > }
> > + list_splice(&list, &dpm_list);
> > mutex_unlock(&dpm_list_mtx);
> > }
>
> ...
> > static int dpm_suspend(pm_message_t state)
> > {
> ...
> > error = suspend_device(dev, state);
> > +
> > mutex_lock(&dpm_list_mtx);
> > + put_device(dev);
> > if (error) {
> > printk(KERN_ERR "Could not suspend device %s: "
> > - "error %d%s\n",
> > - kobject_name(&dev->kobj),
> > - error,
> > - (error == -EAGAIN ?
> > - " (please convert to suspend_late)" :
> > - ""));
> > - dev->power.sleeping = false;
> > + "error %d\n", kobject_name(&dev->kobj), error);
> > + list_splice_init(&dpm_list, &list);
> > break;
> > }
> > - if (!list_empty(&dev->power.entry))
> > - list_move(&dev->power.entry, &dpm_off);
> > + if (!list_empty(&dev->power.entry)) {
> > + dev->power.status = DPM_OFF;
> > + list_move(&dev->power.entry, &list);
> > + }
>
> Use-after-put again.
>
> > }
> > - if (!error)
> > - all_sleeping = true;
> > + list_splice(&list, &dpm_list);
>
> Instead you could eliminate the list_splice_init() above and put here:
>
> list_splice(&list, dpm_list->prev);
>
> This will move the entries from list to the end of dpm_list.
>
> > mutex_unlock(&dpm_list_mtx);
> > + return error;
> > +}

The patch below should address all of your comments above. It also seems to
work.

If it looks good to you, please let me know and I'll repost it in a new thread
along with the platform and PCI patches adapted to it.

Thanks,
Rafael


Signed-off-by: Rafael J. Wysocki <[email protected]>
---

arch/x86/kernel/apm_32.c | 4
drivers/base/power/main.c | 654 ++++++++++++++++++++++++++++++++++-----------
drivers/base/power/power.h | 2
drivers/base/power/trace.c | 4
include/linux/device.h | 8
include/linux/pm.h | 282 +++++++++++++++++--
kernel/power/disk.c | 14
kernel/power/main.c | 4
8 files changed, 772 insertions(+), 200 deletions(-)

Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -114,7 +114,9 @@ typedef struct pm_message {
int event;
} pm_message_t;

-/*
+/**
+ * struct pm_ops - device PM callbacks
+ *
* Several driver power state transitions are externally visible, affecting
* the state of pending I/O queues and (for drivers that touch hardware)
* interrupts, wakeups, DMA, and other hardware state. There may also be
@@ -122,6 +124,254 @@ typedef struct pm_message {
* to the rest of the driver stack (such as a driver that's ON gating off
* clocks which are not in active use).
*
+ * The externally visible transitions are handled with the help of the following
+ * callbacks included in this structure:
+ *
+ * @prepare: Prepare the device for the upcoming transition, but do NOT change
+ * its hardware state. Prevent new children of the device from being
+ * registered and prevent new calls to the probe method from being made
+ * after @prepare() returns. If @prepare() detects a situation it cannot
+ * handle (e.g. registration of a child already in progress), it may return
+ * -EAGAIN, so that the PM core can execute it once again (e.g. after the
+ * new child has been registered) to recover from the race condition. This
+ * method is executed for all kinds of suspend transitions and is followed
+ * by one of the suspend callbacks: @suspend(), @freeze(), or @poweroff().
+ * The PM core executes @prepare() for all devices before starting to
+ * execute suspend callbacks for any of them, so drivers may assume all of
+ * the other devices to be present and functional while @prepare() is being
+ * executed. However, they may NOT assume anything about the availability
+ * of the user space at that time.
+ *
+ * @complete: Undo the changes made by @prepare(). This method is executed for
+ * all kinds of resume transitions, following one of the resume callbacks:
+ * @resume(), @thaw(), @restore(). Also called if the state transition
+ * fails before the driver's suspend callback (@suspend(), @freeze(),
+ * @poweroff()) can be executed (e.g. if the suspend callback fails for one
+ * of the other devices that the PM core has unsucessfully attempted to
+ * suspend earlier).
+ * The PM core executes @complete() after it has executed the appropriate
+ * resume callback for all devices.
+ *
+ * @suspend: Executed before putting the system into a sleep state in which the
+ * contents of main memory are preserved. Quiesce the device, put it into
+ * a low power state appropriate for the upcoming system state (such as
+ * PCI_D3hot), and enable wakeup events as appropriate.
+ *
+ * @resume: Executed after waking the system up from a sleep state in which the
+ * contents of main memory were preserved. Put the device into the
+ * appropriate state, according to the information saved in memory by the
+ * preceding @suspend(). The driver starts working again, responding to
+ * hardware events and software requests. The hardware may have gone
+ * through a power-off reset, or it may have maintained state from the
+ * previous suspend() which the driver may rely on while resuming. On most
+ * platforms, there are no restrictions on availability of resources like
+ * clocks during @resume().
+ *
+ * @freeze: Hibernation-specific, executed before creating a hibernation image.
+ * Quiesce operations so that a consistent image can be created, but do NOT
+ * otherwise put the device into a low power device state and do NOT emit
+ * system wakeup events. Save in main memory the device settings to be
+ * used by @restore() during the subsequent resume from hibernation or by
+ * the subsequent @thaw(), if the creation of the image or the restoration
+ * of main memory contents from it fails.
+ *
+ * @thaw: Hibernation-specific, executed after creating a hibernation image OR
+ * if the creation of the image fails. Also executed after a failing
+ * attempt to restore the contents of main memory from such an image.
+ * Undo the changes made by the preceding @freeze(), so the device can be
+ * operated in the same way as immediately before the call to @freeze().
+ *
+ * @poweroff: Hibernation-specific, executed after saving a hibernation image.
+ * Quiesce the device, put it into a low power state appropriate for the
+ * upcoming system state (such as PCI_D3hot), and enable wakeup events as
+ * appropriate.
+ *
+ * @restore: Hibernation-specific, executed after restoring the contents of main
+ * memory from a hibernation image. Driver starts working again,
+ * responding to hardware events and software requests. Drivers may NOT
+ * make ANY assumptions about the hardware state right prior to @restore().
+ * On most platforms, there are no restrictions on availability of
+ * resources like clocks during @restore().
+ *
+ * All of the above callbacks, except for @complete(), return error codes.
+ * However, the error codes returned by the resume operations, @resume(),
+ * @thaw(), and @restore(), do not cause the PM core to abort the resume
+ * transition during which they are returned.
+ */
+
+struct pm_ops {
+ int (*prepare)(struct device *dev);
+ void (*complete)(struct device *dev);
+ int (*suspend)(struct device *dev);
+ int (*resume)(struct device *dev);
+ int (*freeze)(struct device *dev);
+ int (*thaw)(struct device *dev);
+ int (*poweroff)(struct device *dev);
+ int (*restore)(struct device *dev);
+};
+
+/**
+ * struct pm_noirq_ops - device PM callbacks executed with interrupts disabled
+ *
+ * The following callbacks included in 'struct pm_noirq_ops' are executed with
+ * interrupts disabled and with the nonboot CPUs switched off:
+ *
+ * @suspend_noirq: Complete the operations of ->suspend() by carrying out any
+ * actions required for suspending the device that need interrupts to be
+ * disabled
+ *
+ * @resume_noirq: Prepare for the execution of ->resume() by carrying out any
+ * actions required for resuming the device that need interrupts to be
+ * disabled
+ *
+ * @freeze_noirq: Complete the operations of ->freeze() by carrying out any
+ * actions required for freezing the device that need interrupts to be
+ * disabled
+ *
+ * @thaw_noirq: Prepare for the execution of ->thaw() by carrying out any
+ * actions required for thawing the device that need interrupts to be
+ * disabled
+ *
+ * @poweroff_noirq: Complete the operations of ->poweroff() by carrying out any
+ * actions required for handling the device that need interrupts to be
+ * disabled
+ *
+ * @restore_noirq: Prepare for the execution of ->restore() by carrying out any
+ * actions required for restoring the operations of the device that need
+ * interrupts to be disabled
+ *
+ * All of the above callbacks, return error codes, but the error codes returned
+ * by the resume operations, @resume_noirq(), @thaw_noirq(), and
+ * @restore_noirq(), do not cause the PM core to abort the resume transition
+ * during which they are returned.
+ */
+
+struct pm_noirq_ops {
+ int (*suspend_noirq)(struct device *dev);
+ int (*resume_noirq)(struct device *dev);
+ int (*freeze_noirq)(struct device *dev);
+ int (*thaw_noirq)(struct device *dev);
+ int (*poweroff_noirq)(struct device *dev);
+ int (*restore_noirq)(struct device *dev);
+};
+
+/**
+ * PM_EVENT_ messages
+ *
+ * The following PM_EVENT_ messages are defined for the internal use of the PM
+ * core, in order to provide a mechanism allowing the high level suspend and
+ * hibernation code to convey the necessary information to the device PM core
+ * code:
+ *
+ * ON No transition.
+ *
+ * FREEZE System is going to hibernate, call ->prepare() and ->freeze()
+ * for all devices.
+ *
+ * SUSPEND System is going to suspend, call ->prepare() and ->suspend()
+ * for all devices.
+ *
+ * HIBERNATE Hibernation image has been saved, call ->prepare() and
+ * ->poweroff() for all devices.
+ *
+ * QUIESCE Contents of main memory are going to be restored from a (loaded)
+ * hibernation image, call ->prepare() and ->freeze() for all
+ * devices.
+ *
+ * RESUME System is resuming, call ->resume() and ->complete() for all
+ * devices.
+ *
+ * THAW Hibernation image has been created, call ->thaw() and
+ * ->complete() for all devices.
+ *
+ * RESTORE Contents of main memory have been restored from a hibernation
+ * image, call ->restore() and ->complete() for all devices.
+ *
+ * RECOVER Creation of a hibernation image or restoration of the main
+ * memory contents from a hibernation image has failed, call
+ * ->thaw() and ->complete() for all devices.
+ */
+
+#define PM_EVENT_ON 0x0000
+#define PM_EVENT_FREEZE 0x0001
+#define PM_EVENT_SUSPEND 0x0002
+#define PM_EVENT_HIBERNATE 0x0004
+#define PM_EVENT_QUIESCE 0x0008
+#define PM_EVENT_RESUME 0x0010
+#define PM_EVENT_THAW 0x0020
+#define PM_EVENT_RESTORE 0x0040
+#define PM_EVENT_RECOVER 0x0080
+
+#define PM_EVENT_SLEEP (PM_EVENT_SUSPEND | PM_EVENT_HIBERNATE)
+
+#define PMSG_FREEZE ((struct pm_message){ .event = PM_EVENT_FREEZE, })
+#define PMSG_QUIESCE ((struct pm_message){ .event = PM_EVENT_QUIESCE, })
+#define PMSG_SUSPEND ((struct pm_message){ .event = PM_EVENT_SUSPEND, })
+#define PMSG_HIBERNATE ((struct pm_message){ .event = PM_EVENT_HIBERNATE, })
+#define PMSG_RESUME ((struct pm_message){ .event = PM_EVENT_RESUME, })
+#define PMSG_THAW ((struct pm_message){ .event = PM_EVENT_THAW, })
+#define PMSG_RESTORE ((struct pm_message){ .event = PM_EVENT_RESTORE, })
+#define PMSG_RECOVER ((struct pm_message){ .event = PM_EVENT_RECOVER, })
+#define PMSG_ON ((struct pm_message){ .event = PM_EVENT_ON, })
+
+/**
+ * Device power management states
+ *
+ * These state labels are used internally by the PM core to indicate the current
+ * status of a device with respect to the PM core operations.
+ *
+ * DPM_ON Device is regarded as operational. Set this way
+ * initially and when ->complete() is about to be called.
+ * Also set when ->prepare() fails.
+ *
+ * DPM_PREPARING Device is going to be prepared for a PM transition. Set
+ * when ->prepare() is about to be called.
+ *
+ * DPM_RESUMING Device is going to be resumed. Set when ->resume(),
+ * ->thaw(), or ->restore() is about to be called.
+ *
+ * DPM_SUSPENDING Device has been prepared for a power transition. Set
+ * when ->prepare() has just succeeded.
+ *
+ * DPM_OFF Device is regarded as inactive. Set immediately after
+ * ->suspend(), ->freeze(), or ->poweroff() has succeeded.
+ * Also set when ->resume()_noirq, ->thaw_noirq(), or
+ * ->restore_noirq() is about to be called.
+ *
+ * DPM_OFF_IRQ Device is in a "deep sleep". Set immediately after
+ * ->suspend_noirq(), ->freeze_noirq(), or
+ * ->poweroff_noirq() has just succeeded.
+ */
+
+enum dpm_state {
+ DPM_ON,
+ DPM_PREPARING,
+ DPM_RESUMING,
+ DPM_SUSPENDING,
+ DPM_OFF,
+ DPM_OFF_IRQ,
+};
+
+struct dev_pm_info {
+ pm_message_t power_state;
+ unsigned can_wakeup:1;
+ unsigned should_wakeup:1;
+ enum dpm_state status; /* Owned by the PM core */
+#ifdef CONFIG_PM_SLEEP
+ struct list_head entry;
+#endif
+};
+
+/*
+ * The PM_EVENT_ messages are also used by drivers implementing the legacy
+ * suspend framework, based on the ->suspend() and ->resume() callbacks common
+ * for suspend and hibernation transitions, according to the rules below.
+ */
+
+/* Necessary, because several drivers use PM_EVENT_PRETHAW */
+#define PM_EVENT_PRETHAW PM_EVENT_QUIESCE
+
+/*
* One transition is triggered by resume(), after a suspend() call; the
* message is implicit:
*
@@ -166,35 +416,11 @@ typedef struct pm_message {
* or from system low-power states such as standby or suspend-to-RAM.
*/

-#define PM_EVENT_ON 0
-#define PM_EVENT_FREEZE 1
-#define PM_EVENT_SUSPEND 2
-#define PM_EVENT_HIBERNATE 4
-#define PM_EVENT_PRETHAW 8
-
-#define PM_EVENT_SLEEP (PM_EVENT_SUSPEND | PM_EVENT_HIBERNATE)
-
-#define PMSG_FREEZE ((struct pm_message){ .event = PM_EVENT_FREEZE, })
-#define PMSG_PRETHAW ((struct pm_message){ .event = PM_EVENT_PRETHAW, })
-#define PMSG_SUSPEND ((struct pm_message){ .event = PM_EVENT_SUSPEND, })
-#define PMSG_HIBERNATE ((struct pm_message){ .event = PM_EVENT_HIBERNATE, })
-#define PMSG_ON ((struct pm_message){ .event = PM_EVENT_ON, })
-
-struct dev_pm_info {
- pm_message_t power_state;
- unsigned can_wakeup:1;
- unsigned should_wakeup:1;
- bool sleeping:1; /* Owned by the PM core */
-#ifdef CONFIG_PM_SLEEP
- struct list_head entry;
-#endif
-};
+#ifdef CONFIG_PM_SLEEP
+extern void device_power_up(pm_message_t state);
+extern void device_resume(pm_message_t state);

extern int device_power_down(pm_message_t state);
-extern void device_power_up(void);
-extern void device_resume(void);
-
-#ifdef CONFIG_PM_SLEEP
extern int device_suspend(pm_message_t state);
extern int device_prepare_suspend(pm_message_t state);

Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -12,11 +12,9 @@
* and add it to the list of power-controlled devices. sysfs entries for
* controlling device power management will also be added.
*
- * A different set of lists than the global subsystem list are used to
- * keep track of power info because we use different lists to hold
- * devices based on what stage of the power management process they
- * are in. The power domain dependencies may also differ from the
- * ancestral dependencies that the subsystem list maintains.
+ * A separate list is used for keeping track of power info, because the power
+ * domain dependencies may differ from the ancestral dependencies that the
+ * subsystem list maintains.
*/

#include <linux/device.h>
@@ -30,31 +28,24 @@
#include "power.h"

/*
- * The entries in the dpm_active list are in a depth first order, simply
+ * The entries in the dpm_list list are in a depth first order, simply
* because children are guaranteed to be discovered after parents, and
* are inserted at the back of the list on discovery.
*
- * All the other lists are kept in the same order, for consistency.
- * However the lists aren't always traversed in the same order.
- * Semaphores must be acquired from the top (i.e., front) down
- * and released in the opposite order. Devices must be suspended
- * from the bottom (i.e., end) up and resumed in the opposite order.
- * That way no parent will be suspended while it still has an active
- * child.
- *
* Since device_pm_add() may be called with a device semaphore held,
* we must never try to acquire a device semaphore while holding
* dpm_list_mutex.
*/

-LIST_HEAD(dpm_active);
-static LIST_HEAD(dpm_off);
-static LIST_HEAD(dpm_off_irq);
+LIST_HEAD(dpm_list);

static DEFINE_MUTEX(dpm_list_mtx);

-/* 'true' if all devices have been suspended, protected by dpm_list_mtx */
-static bool all_sleeping;
+/*
+ * Set once the preparation of devices for a PM transition has started, reset
+ * before starting to resume devices. Protected by dpm_list_mtx.
+ */
+static bool transition_started;

/**
* device_pm_add - add a device to the list of active devices
@@ -68,22 +59,30 @@ int device_pm_add(struct device *dev)
dev->bus ? dev->bus->name : "No Bus",
kobject_name(&dev->kobj));
mutex_lock(&dpm_list_mtx);
- if ((dev->parent && dev->parent->power.sleeping) || all_sleeping) {
- if (dev->parent->power.sleeping)
- dev_warn(dev,
- "parent %s is sleeping, will not add\n",
+ if (dev->parent) {
+ if (dev->parent->power.status >= DPM_SUSPENDING) {
+ dev_warn(dev, "parent %s is sleeping, will not add\n",
dev->parent->bus_id);
- else
- dev_warn(dev, "devices are sleeping, will not add\n");
- WARN_ON(true);
- error = -EBUSY;
- } else {
- error = dpm_sysfs_add(dev);
- if (!error)
- list_add_tail(&dev->power.entry, &dpm_active);
+ goto Refuse;
+ }
+ } else if (transition_started) {
+ /*
+ * We refuse to register parentless devices while a PM
+ * transition is in progress in order to avoid leaving them
+ * unhandled down the road
+ */
+ goto Refuse;
}
+ error = dpm_sysfs_add(dev);
+ if (!error)
+ list_add_tail(&dev->power.entry, &dpm_list);
+ End:
mutex_unlock(&dpm_list_mtx);
return error;
+ Refuse:
+ WARN_ON(true);
+ error = -EBUSY;
+ goto End;
}

/**
@@ -103,73 +102,230 @@ void device_pm_remove(struct device *dev
mutex_unlock(&dpm_list_mtx);
}

+/**
+ * pm_op - execute the PM operation appropiate for given PM event
+ * @dev: Device.
+ * @ops: PM operations to choose from.
+ * @state: PM transition of the system being carried out.
+ */
+static int pm_op(struct device *dev, struct pm_ops *ops, pm_message_t state)
+{
+ int error = 0;
+
+ switch (state.event) {
+#ifdef CONFIG_SUSPEND
+ case PM_EVENT_SUSPEND:
+ if (ops->suspend) {
+ error = ops->suspend(dev);
+ suspend_report_result(ops->suspend, error);
+ }
+ break;
+ case PM_EVENT_RESUME:
+ if (ops->resume) {
+ error = ops->resume(dev);
+ suspend_report_result(ops->resume, error);
+ }
+ break;
+#endif /* CONFIG_SUSPEND */
+#ifdef CONFIG_HIBERNATION
+ case PM_EVENT_FREEZE:
+ case PM_EVENT_QUIESCE:
+ if (ops->freeze) {
+ error = ops->freeze(dev);
+ suspend_report_result(ops->freeze, error);
+ }
+ break;
+ case PM_EVENT_HIBERNATE:
+ if (ops->poweroff) {
+ error = ops->poweroff(dev);
+ suspend_report_result(ops->poweroff, error);
+ }
+ break;
+ case PM_EVENT_THAW:
+ case PM_EVENT_RECOVER:
+ if (ops->thaw) {
+ error = ops->thaw(dev);
+ suspend_report_result(ops->thaw, error);
+ }
+ break;
+ case PM_EVENT_RESTORE:
+ if (ops->restore) {
+ error = ops->restore(dev);
+ suspend_report_result(ops->restore, error);
+ }
+ break;
+#endif /* CONFIG_HIBERNATION */
+ default:
+ error = -EINVAL;
+ }
+ return error;
+}
+
+/**
+ * pm_noirq_op - execute the PM operation appropiate for given PM event
+ * @dev: Device.
+ * @ops: PM operations to choose from.
+ * @state: PM transition of the system being carried out.
+ *
+ * The operation is executed with interrupts disabled by the only remaining
+ * functional CPU in the system.
+ */
+static int pm_noirq_op(struct device *dev, struct pm_noirq_ops *ops,
+ pm_message_t state)
+{
+ int error = 0;
+
+ switch (state.event) {
+#ifdef CONFIG_SUSPEND
+ case PM_EVENT_SUSPEND:
+ if (ops->suspend_noirq) {
+ error = ops->suspend_noirq(dev);
+ suspend_report_result(ops->suspend_noirq, error);
+ }
+ break;
+ case PM_EVENT_RESUME:
+ if (ops->resume_noirq) {
+ error = ops->resume_noirq(dev);
+ suspend_report_result(ops->resume_noirq, error);
+ }
+ break;
+#endif /* CONFIG_SUSPEND */
+#ifdef CONFIG_HIBERNATION
+ case PM_EVENT_FREEZE:
+ case PM_EVENT_QUIESCE:
+ if (ops->freeze_noirq) {
+ error = ops->freeze_noirq(dev);
+ suspend_report_result(ops->freeze_noirq, error);
+ }
+ break;
+ case PM_EVENT_HIBERNATE:
+ if (ops->poweroff_noirq) {
+ error = ops->poweroff_noirq(dev);
+ suspend_report_result(ops->poweroff_noirq, error);
+ }
+ break;
+ case PM_EVENT_THAW:
+ case PM_EVENT_RECOVER:
+ if (ops->thaw_noirq) {
+ error = ops->thaw_noirq(dev);
+ suspend_report_result(ops->thaw_noirq, error);
+ }
+ break;
+ case PM_EVENT_RESTORE:
+ if (ops->restore_noirq) {
+ error = ops->restore_noirq(dev);
+ suspend_report_result(ops->restore_noirq, error);
+ }
+ break;
+#endif /* CONFIG_HIBERNATION */
+ default:
+ error = -EINVAL;
+ }
+ return error;
+}
+
+static char *pm_verb(int event)
+{
+ switch (event) {
+ case PM_EVENT_SUSPEND:
+ return "suspend";
+ case PM_EVENT_RESUME:
+ return "resume";
+ case PM_EVENT_FREEZE:
+ return "freeze";
+ case PM_EVENT_QUIESCE:
+ return "quiesce";
+ case PM_EVENT_HIBERNATE:
+ return "hibernate";
+ case PM_EVENT_THAW:
+ return "thaw";
+ case PM_EVENT_RESTORE:
+ return "restore";
+ default:
+ return "(unknown PM event)";
+ }
+}
+
+static void pm_dev_dbg(struct device *dev, pm_message_t state, char *info)
+{
+ dev_dbg(dev, "%s%s%s\n", info, pm_verb(state.event),
+ ((state.event & PM_EVENT_SLEEP) && device_may_wakeup(dev)) ?
+ ", may wakeup" : "");
+}
+
/*------------------------- Resume routines -------------------------*/

/**
- * resume_device_early - Power on one device (early resume).
+ * resume_device_noirq - Power on one device (early resume).
* @dev: Device.
+ * @state: PM transition of the system being carried out.
*
* Must be called with interrupts disabled.
*/
-static int resume_device_early(struct device *dev)
+static int resume_device_noirq(struct device *dev, pm_message_t state)
{
int error = 0;

TRACE_DEVICE(dev);
TRACE_RESUME(0);

- if (dev->bus && dev->bus->resume_early) {
- dev_dbg(dev, "EARLY resume\n");
+ if (!dev->bus)
+ goto End;
+
+ if (dev->bus->pm_noirq) {
+ pm_dev_dbg(dev, state, "EARLY ");
+ error = pm_noirq_op(dev, dev->bus->pm_noirq, state);
+ } else if (dev->bus->resume_early) {
+ pm_dev_dbg(dev, state, "legacy EARLY ");
error = dev->bus->resume_early(dev);
}
-
+ End:
TRACE_RESUME(error);
return error;
}

/**
* dpm_power_up - Power on all regular (non-sysdev) devices.
+ * @state: PM transition of the system being carried out.
*
- * Walk the dpm_off_irq list and power each device up. This
- * is used for devices that required they be powered down with
- * interrupts disabled. As devices are powered on, they are moved
- * to the dpm_off list.
+ * Execute the appropriate "noirq resume" callback for all devices marked
+ * as DPM_OFF_IRQ.
*
* Must be called with interrupts disabled and only one CPU running.
*/
-static void dpm_power_up(void)
+static void dpm_power_up(pm_message_t state)
{
+ struct device *dev;

- while (!list_empty(&dpm_off_irq)) {
- struct list_head *entry = dpm_off_irq.next;
- struct device *dev = to_device(entry);
-
- list_move_tail(entry, &dpm_off);
- resume_device_early(dev);
- }
+ list_for_each_entry(dev, &dpm_list, power.entry)
+ if (dev->power.status > DPM_OFF) {
+ dev->power.status = DPM_OFF;
+ resume_device_noirq(dev, state);
+ }
}

/**
* device_power_up - Turn on all devices that need special attention.
+ * @state: PM transition of the system being carried out.
*
* Power on system devices, then devices that required we shut them down
* with interrupts disabled.
*
* Must be called with interrupts disabled.
*/
-void device_power_up(void)
+void device_power_up(pm_message_t state)
{
sysdev_resume();
- dpm_power_up();
+ dpm_power_up(state);
}
EXPORT_SYMBOL_GPL(device_power_up);

/**
* resume_device - Restore state for one device.
* @dev: Device.
- *
+ * @state: PM transition of the system being carried out.
*/
-static int resume_device(struct device *dev)
+static int resume_device(struct device *dev, pm_message_t state)
{
int error = 0;

@@ -178,21 +334,40 @@ static int resume_device(struct device *

down(&dev->sem);

- if (dev->bus && dev->bus->resume) {
- dev_dbg(dev,"resuming\n");
- error = dev->bus->resume(dev);
+ if (dev->bus) {
+ if (dev->bus->pm) {
+ pm_dev_dbg(dev, state, "");
+ error = pm_op(dev, dev->bus->pm, state);
+ } else if (dev->bus->resume) {
+ pm_dev_dbg(dev, state, "legacy ");
+ error = dev->bus->resume(dev);
+ }
+ if (error)
+ goto End;
}

- if (!error && dev->type && dev->type->resume) {
- dev_dbg(dev,"resuming\n");
- error = dev->type->resume(dev);
+ if (dev->type) {
+ if (dev->type->pm) {
+ pm_dev_dbg(dev, state, "type ");
+ error = pm_op(dev, dev->type->pm, state);
+ } else if (dev->type->resume) {
+ pm_dev_dbg(dev, state, "legacy type ");
+ error = dev->type->resume(dev);
+ }
+ if (error)
+ goto End;
}

- if (!error && dev->class && dev->class->resume) {
- dev_dbg(dev,"class resume\n");
- error = dev->class->resume(dev);
+ if (dev->class) {
+ if (dev->class->pm) {
+ pm_dev_dbg(dev, state, "class ");
+ error = pm_op(dev, dev->class->pm, state);
+ } else if (dev->class->resume) {
+ pm_dev_dbg(dev, state, "legacy class ");
+ error = dev->class->resume(dev);
+ }
}
-
+ End:
up(&dev->sem);

TRACE_RESUME(error);
@@ -201,78 +376,157 @@ static int resume_device(struct device *

/**
* dpm_resume - Resume every device.
+ * @state: PM transition of the system being carried out.
*
- * Resume the devices that have either not gone through
- * the late suspend, or that did go through it but also
- * went through the early resume.
+ * Execute the appropriate "resume" callback for all devices the status of
+ * which indicates that they are inactive.
+ */
+static void dpm_resume(pm_message_t state)
+{
+ struct list_head list;
+
+ INIT_LIST_HEAD(&list);
+ mutex_lock(&dpm_list_mtx);
+ transition_started = false;
+ while (!list_empty(&dpm_list)) {
+ struct device *dev = to_device(dpm_list.next);
+
+ get_device(dev);
+ if (dev->power.status >= DPM_OFF) {
+ dev->power.status = DPM_RESUMING;
+ mutex_unlock(&dpm_list_mtx);
+
+ resume_device(dev, state);
+
+ mutex_lock(&dpm_list_mtx);
+ } else if (dev->power.status == DPM_SUSPENDING) {
+ /* Allow new children of the device to be registered */
+ dev->power.status = DPM_RESUMING;
+ }
+ if (!list_empty(&dev->power.entry))
+ list_move_tail(&dev->power.entry, &list);
+ put_device(dev);
+ }
+ list_splice(&list, &dpm_list);
+ mutex_unlock(&dpm_list_mtx);
+}
+
+/**
+ * complete_device - Complete a PM transition for given device
+ * @dev: Device.
+ * @state: PM transition of the system being carried out.
+ */
+static void complete_device(struct device *dev, pm_message_t state)
+{
+ down(&dev->sem);
+
+ if (dev->bus && dev->bus->pm && dev->bus->pm->complete) {
+ pm_dev_dbg(dev, state, "completing ");
+ dev->bus->pm->complete(dev);
+ }
+
+ if (dev->type && dev->type->pm && dev->type->pm->complete) {
+ pm_dev_dbg(dev, state, "completing type ");
+ dev->type->pm->complete(dev);
+ }
+
+ if (dev->class && dev->class->pm && dev->class->pm->complete) {
+ pm_dev_dbg(dev, state, "completing class ");
+ dev->class->pm->complete(dev);
+ }
+
+ up(&dev->sem);
+}
+
+/**
+ * dpm_complete - Complete a PM transition for all devices.
+ * @state: PM transition of the system being carried out.
*
- * Take devices from the dpm_off_list, resume them,
- * and put them on the dpm_locked list.
+ * Execute the ->complete() callbacks for all devices that are not marked
+ * as DPM_ON.
*/
-static void dpm_resume(void)
+static void dpm_complete(pm_message_t state)
{
+ struct list_head list;
+
+ INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
- all_sleeping = false;
- while(!list_empty(&dpm_off)) {
- struct list_head *entry = dpm_off.next;
- struct device *dev = to_device(entry);
+ while (!list_empty(&dpm_list)) {
+ struct device *dev = to_device(dpm_list.prev);

- list_move_tail(entry, &dpm_active);
- dev->power.sleeping = false;
- mutex_unlock(&dpm_list_mtx);
- resume_device(dev);
- mutex_lock(&dpm_list_mtx);
+ get_device(dev);
+ if (dev->power.status > DPM_ON) {
+ dev->power.status = DPM_ON;
+ mutex_unlock(&dpm_list_mtx);
+
+ complete_device(dev, state);
+
+ mutex_lock(&dpm_list_mtx);
+ }
+ if (!list_empty(&dev->power.entry))
+ list_move(&dev->power.entry, &list);
+ put_device(dev);
}
+ list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
}

/**
* device_resume - Restore state of each device in system.
+ * @state: PM transition of the system being carried out.
*
* Resume all the devices, unlock them all, and allow new
* devices to be registered once again.
*/
-void device_resume(void)
+void device_resume(pm_message_t state)
{
might_sleep();
- dpm_resume();
+ dpm_resume(state);
+ dpm_complete(state);
}
EXPORT_SYMBOL_GPL(device_resume);


/*------------------------- Suspend routines -------------------------*/

-static inline char *suspend_verb(u32 event)
+/**
+ * resume_event - return a PM message representing the resume event
+ * corresponding to given sleep state.
+ * @sleep_state: PM message representing a sleep state.
+ */
+static pm_message_t resume_event(pm_message_t sleep_state)
{
- switch (event) {
- case PM_EVENT_SUSPEND: return "suspend";
- case PM_EVENT_FREEZE: return "freeze";
- case PM_EVENT_PRETHAW: return "prethaw";
- default: return "(unknown suspend event)";
+ switch (sleep_state.event) {
+ case PM_EVENT_SUSPEND:
+ return PMSG_RESUME;
+ case PM_EVENT_FREEZE:
+ case PM_EVENT_QUIESCE:
+ return PMSG_RECOVER;
+ case PM_EVENT_HIBERNATE:
+ return PMSG_RESTORE;
}
-}
-
-static void
-suspend_device_dbg(struct device *dev, pm_message_t state, char *info)
-{
- dev_dbg(dev, "%s%s%s\n", info, suspend_verb(state.event),
- ((state.event == PM_EVENT_SUSPEND) && device_may_wakeup(dev)) ?
- ", may wakeup" : "");
+ return PMSG_ON;
}

/**
- * suspend_device_late - Shut down one device (late suspend).
+ * suspend_device_noirq - Shut down one device (late suspend).
* @dev: Device.
- * @state: Power state device is entering.
+ * @state: PM transition of the system being carried out.
*
* This is called with interrupts off and only a single CPU running.
*/
-static int suspend_device_late(struct device *dev, pm_message_t state)
+static int suspend_device_noirq(struct device *dev, pm_message_t state)
{
int error = 0;

- if (dev->bus && dev->bus->suspend_late) {
- suspend_device_dbg(dev, state, "LATE ");
+ if (!dev->bus)
+ return 0;
+
+ if (dev->bus->pm_noirq) {
+ pm_dev_dbg(dev, state, "LATE ");
+ error = pm_noirq_op(dev, dev->bus->pm_noirq, state);
+ } else if (dev->bus->suspend_late) {
+ pm_dev_dbg(dev, state, "legacy LATE ");
error = dev->bus->suspend_late(dev, state);
suspend_report_result(dev->bus->suspend_late, error);
}
@@ -281,37 +535,32 @@ static int suspend_device_late(struct de

/**
* device_power_down - Shut down special devices.
- * @state: Power state to enter.
+ * @state: PM transition of the system being carried out.
*
- * Power down devices that require interrupts to be disabled
- * and move them from the dpm_off list to the dpm_off_irq list.
+ * Power down devices that require interrupts to be disabled.
* Then power down system devices.
*
* Must be called with interrupts disabled and only one CPU running.
*/
int device_power_down(pm_message_t state)
{
+ struct device *dev;
int error = 0;

- while (!list_empty(&dpm_off)) {
- struct list_head *entry = dpm_off.prev;
- struct device *dev = to_device(entry);
-
- error = suspend_device_late(dev, state);
+ list_for_each_entry_reverse(dev, &dpm_list, power.entry) {
+ error = suspend_device_noirq(dev, state);
if (error) {
printk(KERN_ERR "Could not power down device %s: "
"error %d\n",
kobject_name(&dev->kobj), error);
break;
}
- if (!list_empty(&dev->power.entry))
- list_move(&dev->power.entry, &dpm_off_irq);
+ dev->power.status = DPM_OFF_IRQ;
}
-
if (!error)
error = sysdev_suspend(state);
if (error)
- dpm_power_up();
+ dpm_power_up(resume_event(state));
return error;
}
EXPORT_SYMBOL_GPL(device_power_down);
@@ -319,7 +568,7 @@ EXPORT_SYMBOL_GPL(device_power_down);
/**
* suspend_device - Save state of one device.
* @dev: Device.
- * @state: Power state device is entering.
+ * @state: PM transition of the system being carried out.
*/
static int suspend_device(struct device *dev, pm_message_t state)
{
@@ -327,29 +576,43 @@ static int suspend_device(struct device

down(&dev->sem);

- if (dev->power.power_state.event) {
- dev_dbg(dev, "PM: suspend %d-->%d\n",
- dev->power.power_state.event, state.event);
- }
-
- if (dev->class && dev->class->suspend) {
- suspend_device_dbg(dev, state, "class ");
- error = dev->class->suspend(dev, state);
- suspend_report_result(dev->class->suspend, error);
+ if (dev->class) {
+ if (dev->class->pm) {
+ pm_dev_dbg(dev, state, "class ");
+ error = pm_op(dev, dev->class->pm, state);
+ } else if (dev->class->suspend) {
+ pm_dev_dbg(dev, state, "legacy class ");
+ error = dev->class->suspend(dev, state);
+ suspend_report_result(dev->class->suspend, error);
+ }
+ if (error)
+ goto End;
}

- if (!error && dev->type && dev->type->suspend) {
- suspend_device_dbg(dev, state, "type ");
- error = dev->type->suspend(dev, state);
- suspend_report_result(dev->type->suspend, error);
+ if (dev->type) {
+ if (dev->type->pm) {
+ pm_dev_dbg(dev, state, "type ");
+ error = pm_op(dev, dev->type->pm, state);
+ } else if (dev->type->suspend) {
+ pm_dev_dbg(dev, state, "legacy type ");
+ error = dev->type->suspend(dev, state);
+ suspend_report_result(dev->type->suspend, error);
+ }
+ if (error)
+ goto End;
}

- if (!error && dev->bus && dev->bus->suspend) {
- suspend_device_dbg(dev, state, "");
- error = dev->bus->suspend(dev, state);
- suspend_report_result(dev->bus->suspend, error);
+ if (dev->bus) {
+ if (dev->bus->pm) {
+ pm_dev_dbg(dev, state, "");
+ error = pm_op(dev, dev->bus->pm, state);
+ } else if (dev->bus->suspend) {
+ pm_dev_dbg(dev, state, "legacy ");
+ error = dev->bus->suspend(dev, state);
+ suspend_report_result(dev->bus->suspend, error);
+ }
}
-
+ End:
up(&dev->sem);

return error;
@@ -357,67 +620,142 @@ static int suspend_device(struct device

/**
* dpm_suspend - Suspend every device.
- * @state: Power state to put each device in.
+ * @state: PM transition of the system being carried out.
*
- * Walk the dpm_locked list. Suspend each device and move it
- * to the dpm_off list.
- *
- * (For historical reasons, if it returns -EAGAIN, that used to mean
- * that the device would be called again with interrupts disabled.
- * These days, we use the "suspend_late()" callback for that, so we
- * print a warning and consider it an error).
+ * Execute the appropriate "suspend" callbacks for all devices.
*/
static int dpm_suspend(pm_message_t state)
{
+ struct list_head list;
int error = 0;

+ INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
- while (!list_empty(&dpm_active)) {
- struct list_head *entry = dpm_active.prev;
- struct device *dev = to_device(entry);
-
- WARN_ON(dev->parent && dev->parent->power.sleeping);
+ while (!list_empty(&dpm_list)) {
+ struct device *dev = to_device(dpm_list.prev);

- dev->power.sleeping = true;
+ get_device(dev);
mutex_unlock(&dpm_list_mtx);
+
error = suspend_device(dev, state);
+
mutex_lock(&dpm_list_mtx);
if (error) {
printk(KERN_ERR "Could not suspend device %s: "
- "error %d%s\n",
- kobject_name(&dev->kobj),
- error,
- (error == -EAGAIN ?
- " (please convert to suspend_late)" :
- ""));
- dev->power.sleeping = false;
+ "error %d\n", kobject_name(&dev->kobj), error);
+ put_device(dev);
break;
}
+ dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
- list_move(&dev->power.entry, &dpm_off);
+ list_move(&dev->power.entry, &list);
+ put_device(dev);
}
- if (!error)
- all_sleeping = true;
+ list_splice(&list, dpm_list.prev);
mutex_unlock(&dpm_list_mtx);
+ return error;
+}
+
+/**
+ * prepare_device - Execute the ->prepare() callback(s) for given device.
+ * @dev: Device.
+ * @state: PM transition of the system being carried out.
+ */
+static int prepare_device(struct device *dev, pm_message_t state)
+{
+ int error = 0;
+
+ down(&dev->sem);
+
+ if (dev->bus && dev->bus->pm && dev->bus->pm->prepare) {
+ pm_dev_dbg(dev, state, "preparing ");
+ error = dev->bus->pm->prepare(dev);
+ suspend_report_result(dev->bus->pm->prepare, error);
+ if (error)
+ goto End;
+ }
+
+ if (dev->type && dev->type->pm && dev->type->pm->prepare) {
+ pm_dev_dbg(dev, state, "preparing type ");
+ error = dev->type->pm->prepare(dev);
+ suspend_report_result(dev->type->pm->prepare, error);
+ if (error)
+ goto End;
+ }
+
+ if (dev->class && dev->class->pm && dev->class->pm->prepare) {
+ pm_dev_dbg(dev, state, "preparing class ");
+ error = dev->class->pm->prepare(dev);
+ suspend_report_result(dev->class->pm->prepare, error);
+ }
+ End:
+ up(&dev->sem);
+
+ return error;
+}
+
+/**
+ * dpm_prepare - Prepare all devices for a PM transition.
+ * @state: PM transition of the system being carried out.
+ *
+ * Execute the ->prepare() callback for all devices.
+ */
+static int dpm_prepare(pm_message_t state)
+{
+ struct list_head list;
+ int error = 0;
+
+ INIT_LIST_HEAD(&list);
+ mutex_lock(&dpm_list_mtx);
+ transition_started = true;
+ while (!list_empty(&dpm_list)) {
+ struct device *dev = to_device(dpm_list.next);
+
+ get_device(dev);
+ dev->power.status = DPM_PREPARING;
+ mutex_unlock(&dpm_list_mtx);

+ error = prepare_device(dev, state);
+
+ mutex_lock(&dpm_list_mtx);
+ if (error) {
+ dev->power.status = DPM_ON;
+ if (error == -EAGAIN) {
+ put_device(dev);
+ continue;
+ }
+ printk(KERN_ERR "Could not prepare device %s "
+ "for suspend: error %d\n",
+ kobject_name(&dev->kobj), error);
+ put_device(dev);
+ break;
+ }
+ dev->power.status = DPM_SUSPENDING;
+ if (!list_empty(&dev->power.entry))
+ list_move_tail(&dev->power.entry, &list);
+ put_device(dev);
+ }
+ list_splice(&list, &dpm_list);
+ mutex_unlock(&dpm_list_mtx);
return error;
}

/**
* device_suspend - Save state and stop all devices in system.
- * @state: new power management state
+ * @state: PM transition of the system being carried out.
*
- * Prevent new devices from being registered, then lock all devices
- * and suspend them.
+ * Prepare and suspend all devices.
*/
int device_suspend(pm_message_t state)
{
int error;

might_sleep();
- error = dpm_suspend(state);
+ error = dpm_prepare(state);
+ if (!error)
+ error = dpm_suspend(state);
if (error)
- device_resume();
+ device_resume(resume_event(state));
return error;
}
EXPORT_SYMBOL_GPL(device_suspend);
Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -69,6 +69,9 @@ struct bus_type {
int (*resume_early)(struct device *dev);
int (*resume)(struct device *dev);

+ struct pm_ops *pm;
+ struct pm_noirq_ops *pm_noirq;
+
struct bus_type_private *p;
};

@@ -202,6 +205,8 @@ struct class {

int (*suspend)(struct device *dev, pm_message_t state);
int (*resume)(struct device *dev);
+
+ struct pm_ops *pm;
};

extern int __must_check class_register(struct class *class);
@@ -345,8 +350,11 @@ struct device_type {
struct attribute_group **groups;
int (*uevent)(struct device *dev, struct kobj_uevent_env *env);
void (*release)(struct device *dev);
+
int (*suspend)(struct device *dev, pm_message_t state);
int (*resume)(struct device *dev);
+
+ struct pm_ops *pm;
};

/* interface for exporting device attributes */
Index: linux-2.6/kernel/power/disk.c
===================================================================
--- linux-2.6.orig/kernel/power/disk.c
+++ linux-2.6/kernel/power/disk.c
@@ -224,7 +224,7 @@ static int create_image(int platform_mod
/* NOTE: device_power_up() is just a resume() for devices
* that suspended with irqs off ... no overall powerup.
*/
- device_power_up();
+ device_power_up(in_suspend ? PMSG_RECOVER : PMSG_RESTORE);
Enable_irqs:
local_irq_enable();
return error;
@@ -280,7 +280,7 @@ int hibernation_snapshot(int platform_mo
Finish:
platform_finish(platform_mode);
Resume_devices:
- device_resume();
+ device_resume(in_suspend ? PMSG_RECOVER : PMSG_RESTORE);
Resume_console:
resume_console();
Close:
@@ -301,7 +301,7 @@ static int resume_target_kernel(void)
int error;

local_irq_disable();
- error = device_power_down(PMSG_PRETHAW);
+ error = device_power_down(PMSG_QUIESCE);
if (error) {
printk(KERN_ERR "PM: Some devices failed to power down, "
"aborting resume\n");
@@ -329,7 +329,7 @@ static int resume_target_kernel(void)
swsusp_free();
restore_processor_state();
touch_softlockup_watchdog();
- device_power_up();
+ device_power_up(PMSG_THAW);
Enable_irqs:
local_irq_enable();
return error;
@@ -350,7 +350,7 @@ int hibernation_restore(int platform_mod

pm_prepare_console();
suspend_console();
- error = device_suspend(PMSG_PRETHAW);
+ error = device_suspend(PMSG_QUIESCE);
if (error)
goto Finish;

@@ -362,7 +362,7 @@ int hibernation_restore(int platform_mod
enable_nonboot_cpus();
}
platform_restore_cleanup(platform_mode);
- device_resume();
+ device_resume(PMSG_RECOVER);
Finish:
resume_console();
pm_restore_console();
@@ -419,7 +419,7 @@ int hibernation_platform_enter(void)
Finish:
hibernation_ops->finish();
Resume_devices:
- device_resume();
+ device_resume(PMSG_RESTORE);
Resume_console:
resume_console();
Close:
Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -239,7 +239,7 @@ static int suspend_enter(suspend_state_t
if (!suspend_test(TEST_CORE))
error = suspend_ops->enter(state);

- device_power_up();
+ device_power_up(PMSG_RESUME);
Done:
arch_suspend_enable_irqs();
BUG_ON(irqs_disabled());
@@ -291,7 +291,7 @@ int suspend_devices_and_enter(suspend_st
if (suspend_ops->finish)
suspend_ops->finish();
Resume_devices:
- device_resume();
+ device_resume(PMSG_RESUME);
Resume_console:
resume_console();
Close:
Index: linux-2.6/arch/x86/kernel/apm_32.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apm_32.c
+++ linux-2.6/arch/x86/kernel/apm_32.c
@@ -1208,9 +1208,9 @@ static int suspend(int vetoable)
if (err != APM_SUCCESS)
apm_error("suspend", err);
err = (err == APM_SUCCESS) ? 0 : -EIO;
- device_power_up();
+ device_power_up(PMSG_RESUME);
local_irq_enable();
- device_resume();
+ device_resume(PMSG_RESUME);
queue_event(APM_NORMAL_RESUME, NULL);
out:
spin_lock(&user_list_lock);
Index: linux-2.6/drivers/base/power/power.h
===================================================================
--- linux-2.6.orig/drivers/base/power/power.h
+++ linux-2.6/drivers/base/power/power.h
@@ -4,7 +4,7 @@
* main.c
*/

-extern struct list_head dpm_active; /* The active device list */
+extern struct list_head dpm_list; /* The active device list */

static inline struct device *to_device(struct list_head *entry)
{
Index: linux-2.6/drivers/base/power/trace.c
===================================================================
--- linux-2.6.orig/drivers/base/power/trace.c
+++ linux-2.6/drivers/base/power/trace.c
@@ -188,9 +188,9 @@ static int show_file_hash(unsigned int v
static int show_dev_hash(unsigned int value)
{
int match = 0;
- struct list_head * entry = dpm_active.prev;
+ struct list_head *entry = dpm_list.prev;

- while (entry != &dpm_active) {
+ while (entry != &dpm_list) {
struct device * dev = to_device(entry);
unsigned int hash = hash_string(DEVSEED, dev->bus_id, DEVHASH);
if (hash == value) {

2008-03-25 11:55:07

by Oliver Neukum

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

Am Montag 24 M?rz 2008 schrieb Rafael J. Wysocki:
> -/*
> +/**
> + * struct pm_ops - device PM callbacks
> + *
> ? * Several driver power state transitions are externally visible, affecting
> ? * the state of pending I/O queues and (for drivers that touch hardware)
> ? * interrupts, wakeups, DMA, and other hardware state. ?There may also be
> @@ -122,6 +124,254 @@ typedef struct pm_message {
> ? * to the rest of the driver stack (such as a driver that's ON gating off
> ? * clocks which are not in active use).
> ? *
> + * The externally visible transitions are handled with the help of the following
> + * callbacks included in this structure:
> + *
> + * @prepare: Prepare the device for the upcoming transition, but do NOT change
> + *?????its hardware state. ?Prevent new children of the device from being
> + *?????registered and prevent new calls to the probe method from being made

How is a driver supposed to prevent calls to probe()? You can block in probe()
or you can fail it, but how do you prevent it from being called? User space
can trigger probe via sysfs. To be useful to driver writers, this has
to be clarified.

> + *?????after @prepare() returns. ?If @prepare() detects a situation it cannot
> + *?????handle (e.g. registration of a child already in progress), it may return
> + *?????-EAGAIN, so that the PM core can execute it once again (e.g. after the
> + *?????new child has been registered) to recover from the race condition. This
> + *?????method is executed for all kinds of suspend transitions and is followed
> + *?????by one of the suspend callbacks: @suspend(), @freeze(), or @poweroff().

This could be understood so that disconnect() cannot be called.

> + *?????The PM core executes @prepare() for all devices before starting to
> + *?????execute suspend callbacks for any of them, so drivers may assume all of
> + *?????the other devices to be present and functional while @prepare() is being
> + *?????executed. ?However, they may NOT assume anything about the availability
> + *?????of the user space at that time.

Probably it should be mentioned that this is the time to allocate memory
if you have to. And it is too late to request firmware.

Regards
Oliver

2008-03-25 12:41:25

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Tuesday, 25 of March 2008, Oliver Neukum wrote:
> Am Montag 24 M?rz 2008 schrieb Rafael J. Wysocki:
> > -/*
> > +/**
> > + * struct pm_ops - device PM callbacks
> > + *
> > ? * Several driver power state transitions are externally visible, affecting
> > ? * the state of pending I/O queues and (for drivers that touch hardware)
> > ? * interrupts, wakeups, DMA, and other hardware state. ?There may also be
> > @@ -122,6 +124,254 @@ typedef struct pm_message {
> > ? * to the rest of the driver stack (such as a driver that's ON gating off
> > ? * clocks which are not in active use).
> > ? *
> > + * The externally visible transitions are handled with the help of the following
> > + * callbacks included in this structure:
> > + *
> > + * @prepare: Prepare the device for the upcoming transition, but do NOT change
> > + *?????its hardware state. ?Prevent new children of the device from being
> > + *?????registered and prevent new calls to the probe method from being made
>
> How is a driver supposed to prevent calls to probe()? You can block in probe()
> or you can fail it, but how do you prevent it from being called? User space
> can trigger probe via sysfs.

I overlooked that.

> To be useful to driver writers, this has to be clarified.

The comment is supposed to mean that probe() should be prevented from running
(i.e. even if it's executed, it should either fail or block). I'll write it this way.

> > + *?????after @prepare() returns. ?If @prepare() detects a situation it cannot
> > + *?????handle (e.g. registration of a child already in progress), it may return
> > + *?????-EAGAIN, so that the PM core can execute it once again (e.g. after the
> > + *?????new child has been registered) to recover from the race condition. This
> > + *?????method is executed for all kinds of suspend transitions and is followed
> > + *?????by one of the suspend callbacks: @suspend(), @freeze(), or @poweroff().
>
> This could be understood so that disconnect() cannot be called.

At what time exactly?

> > + *?????The PM core executes @prepare() for all devices before starting to
> > + *?????execute suspend callbacks for any of them, so drivers may assume all of
> > + *?????the other devices to be present and functional while @prepare() is being
> > + *?????executed. ?However, they may NOT assume anything about the availability
> > + *?????of the user space at that time.
>
> Probably it should be mentioned that this is the time to allocate memory
> if you have to.

Well, not exactly. Afterwards you cannot use GFP_KERNEL allocations, but
GFP_NOIO should still work, although for hibernation it's quite possible that
they'll fail if substantial amounts of memory are requested.

> And it is too late to request firmware.

Yes.

Thanks,
Rafael

2008-03-25 13:37:32

by Oliver Neukum

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

Am Dienstag, 25. M?rz 2008 13:40:53 schrieb Rafael J. Wysocki:
> On Tuesday, 25 of March 2008, Oliver Neukum wrote:
> > Am Montag 24 M?rz 2008 schrieb Rafael J. Wysocki:

> > > + *?????after @prepare() returns. ?If @prepare() detects a situation it cannot
> > > + *?????handle (e.g. registration of a child already in progress), it may return
> > > + *?????-EAGAIN, so that the PM core can execute it once again (e.g. after the
> > > + *?????new child has been registered) to recover from the race condition. This
> > > + *?????method is executed for all kinds of suspend transitions and is followed
> > > + *?????by one of the suspend callbacks: @suspend(), @freeze(), or @poweroff().
> >
> > This could be understood so that disconnect() cannot be called.
>
> At what time exactly?

I see no locking that would would prevent disconnect() in the window between
prepare() and suspend()/...

> > > + *?????The PM core executes @prepare() for all devices before starting to
> > > + *?????execute suspend callbacks for any of them, so drivers may assume all of
> > > + *?????the other devices to be present and functional while @prepare() is being
> > > + *?????executed. ?However, they may NOT assume anything about the availability
> > > + *?????of the user space at that time.
> >
> > Probably it should be mentioned that this is the time to allocate memory
> > if you have to.
>
> Well, not exactly. Afterwards you cannot use GFP_KERNEL allocations, but
> GFP_NOIO should still work, although for hibernation it's quite possible that
> they'll fail if substantial amounts of memory are requested.
>
> > And it is too late to request firmware.
>
> Yes.

This is better documented explicitly.

Regards
Oliver

2008-03-25 14:24:18

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Tue, 25 Mar 2008, Oliver Neukum wrote:

> Am Dienstag, 25. M?rz 2008 13:40:53 schrieb Rafael J. Wysocki:
> > On Tuesday, 25 of March 2008, Oliver Neukum wrote:
> > > Am Montag 24 M?rz 2008 schrieb Rafael J. Wysocki:
>
> > > > + *?????after @prepare() returns. ?If @prepare() detects a situation it cannot
> > > > + *?????handle (e.g. registration of a child already in progress), it may return
> > > > + *?????-EAGAIN, so that the PM core can execute it once again (e.g. after the
> > > > + *?????new child has been registered) to recover from the race condition. This
> > > > + *?????method is executed for all kinds of suspend transitions and is followed
> > > > + *?????by one of the suspend callbacks: @suspend(), @freeze(), or @poweroff().
> > >
> > > This could be understood so that disconnect() cannot be called.
> >
> > At what time exactly?
>
> I see no locking that would would prevent disconnect() in the window between
> prepare() and suspend()/...

There is no such locking. It's perfectly legal for a device to be
unregistered between prepare() and suspend(). I suppose it wouldn't
hurt to add a general comment explaining that a device can be
unregistered at any time except when one of its methods is running.

Alan Stern

2008-03-25 14:29:45

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Tue, 25 Mar 2008, Rafael J. Wysocki wrote:

> On Tuesday, 25 of March 2008, Oliver Neukum wrote:
> > Am Montag 24 M?rz 2008 schrieb Rafael J. Wysocki:
> > > -/*
> > > +/**
> > > + * struct pm_ops - device PM callbacks
> > > + *
> > > ? * Several driver power state transitions are externally visible, affecting
> > > ? * the state of pending I/O queues and (for drivers that touch hardware)
> > > ? * interrupts, wakeups, DMA, and other hardware state. ?There may also be
> > > @@ -122,6 +124,254 @@ typedef struct pm_message {
> > > ? * to the rest of the driver stack (such as a driver that's ON gating off
> > > ? * clocks which are not in active use).
> > > ? *
> > > + * The externally visible transitions are handled with the help of the following
> > > + * callbacks included in this structure:
> > > + *
> > > + * @prepare: Prepare the device for the upcoming transition, but do NOT change
> > > + *?????its hardware state. ?Prevent new children of the device from being
> > > + *?????registered and prevent new calls to the probe method from being made
> >
> > How is a driver supposed to prevent calls to probe()? You can block in probe()
> > or you can fail it, but how do you prevent it from being called? User space
> > can trigger probe via sysfs.
>
> I overlooked that.

The driver isn't supposed to prevent calls to its own probe(). The
comment means that the subsystem -- or the rest of the kernel generally
-- is supposed to avoid binding a driver to the device (thereby calling
the probe routine), assuming the device isn't already bound.

I don't expect this sort of thing to be very common. Mostly it happens
when new kernel modules are loaded and new drivers are registered; we
will have to block module loading during a sleep transition. It also
happens when the user writes to a driver's "bind" attribute in sysfs;
the code there will have to block during a sleep transition.

Alan Stern

2008-03-25 15:06:55

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Mon, 24 Mar 2008, Rafael J. Wysocki wrote:

> > Can we also have a DPM_PREPARING state, set when ->prepare() is about
> > to be called? The PM core wouldn't make use of it but some drivers
> > would. (I can't think of any use at all for the analogous
> > DPM_COMPLETING state, however.)
>
> Hmm. dev->power.status is protected by dpm_list_mtx. Do you think it would be
> useful to have an accessor function for reading it under the lock?

I don't think so. What I have in mind is situations where there
accessed has already been synchronized by other means, while the
prepare() method is running. For example:

Task 0 Task 1
------ ------
->prepare() is called
Waits for currently-running
registration in task 1
to finish
Does other stuff
Receives a request to register
a new child under dev
Sees that dev->power.state is
still DPM_ON, so goes ahead
with the child's registration
->prepare() returns
dev->power.state is set to
DPM_SUSPENDING
device_pm_add() checks
dev->power.state and fails
the registration

If dev->power.state had been set to DPM_PREPARING before ->prepare()
was called, then task 1 would have avoided trying to register the
child.

> > > + dev->power.status = DPM_RESUMING;
> > > + get_device(dev);
> > > + mutex_unlock(&dpm_list_mtx);
> > > +
> > > + resume_device(dev, state);
> > > +
> > > + mutex_lock(&dpm_list_mtx);
> > > + put_device(dev);
> > > + }
> > > + if (!list_empty(&dev->power.entry))
> > > + list_move_tail(&dev->power.entry, &list);
> >
> > A little problem here: You refer to dev after calling put_device().
>
> The device can't be removed at this point, because we hold dpm_list_mtx, which
> is needed by device_del().

True, it can't be removed at this point. But it _can_ be removed
between the calls to resume_device() and mutex_lock().

> > > }
> > > - if (!error)
> > > - all_sleeping = true;
> > > + list_splice(&list, &dpm_list);
> >
> > Instead you could eliminate the list_splice_init() above and put here:
> >
> > list_splice(&list, dpm_list->prev);
> >
> > This will move the entries from list to the end of dpm_list.
>
> dpm_list may be empty at this point. Wouldn't that cause any trouble?

It will still work correctly. If dpm_list is empty then dpm_list->prev
is equal to &dpm_list, so it will do the same thing as your current
code does.


I just thought of another problem. At the point where
local_irq_disable() is called, in between device_suspend() and
device_power_down(), it is possible in a preemptible kernel that
another task is holding dpm_list_mtx and is in the middle of updating
the list pointers. This would mess up the traversal in
device_power_down().

I'm not sure about the best way to prevent this. Is it legal to call
unlock_mutex() while interrupts or preemption are disabled?

Alan Stern

2008-03-25 15:12:39

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Mon, 24 Mar 2008, Rafael J. Wysocki wrote:

> The patch below should address all of your comments above. It also seems to
> work.
>
> If it looks good to you, please let me know and I'll repost it in a new thread
> along with the platform and PCI patches adapted to it.

I don't see anything else to change. ACK.

Alan Stern

2008-03-25 20:35:42

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Tuesday, 25 of March 2008, Alan Stern wrote:
> On Mon, 24 Mar 2008, Rafael J. Wysocki wrote:
>
> > > Can we also have a DPM_PREPARING state, set when ->prepare() is about
> > > to be called? The PM core wouldn't make use of it but some drivers
> > > would. (I can't think of any use at all for the analogous
> > > DPM_COMPLETING state, however.)
> >
> > Hmm. dev->power.status is protected by dpm_list_mtx. Do you think it would be
> > useful to have an accessor function for reading it under the lock?
>
> I don't think so. What I have in mind is situations where there
> accessed has already been synchronized by other means, while the
> prepare() method is running. For example:
>
> Task 0 Task 1
> ------ ------
> ->prepare() is called
> Waits for currently-running
> registration in task 1
> to finish
> Does other stuff
> Receives a request to register
> a new child under dev
> Sees that dev->power.state is
> still DPM_ON, so goes ahead
> with the child's registration
> ->prepare() returns
> dev->power.state is set to
> DPM_SUSPENDING
> device_pm_add() checks
> dev->power.state and fails
> the registration
>
> If dev->power.state had been set to DPM_PREPARING before ->prepare()
> was called, then task 1 would have avoided trying to register the
> child.
>
> > > > + dev->power.status = DPM_RESUMING;
> > > > + get_device(dev);
> > > > + mutex_unlock(&dpm_list_mtx);
> > > > +
> > > > + resume_device(dev, state);
> > > > +
> > > > + mutex_lock(&dpm_list_mtx);
> > > > + put_device(dev);
> > > > + }
> > > > + if (!list_empty(&dev->power.entry))
> > > > + list_move_tail(&dev->power.entry, &list);
> > >
> > > A little problem here: You refer to dev after calling put_device().
> >
> > The device can't be removed at this point, because we hold dpm_list_mtx, which
> > is needed by device_del().
>
> True, it can't be removed at this point. But it _can_ be removed
> between the calls to resume_device() and mutex_lock().
>
> > > > }
> > > > - if (!error)
> > > > - all_sleeping = true;
> > > > + list_splice(&list, &dpm_list);
> > >
> > > Instead you could eliminate the list_splice_init() above and put here:
> > >
> > > list_splice(&list, dpm_list->prev);
> > >
> > > This will move the entries from list to the end of dpm_list.
> >
> > dpm_list may be empty at this point. Wouldn't that cause any trouble?
>
> It will still work correctly. If dpm_list is empty then dpm_list->prev
> is equal to &dpm_list, so it will do the same thing as your current
> code does.
>
>
> I just thought of another problem. At the point where
> local_irq_disable() is called, in between device_suspend() and
> device_power_down(), it is possible in a preemptible kernel that
> another task is holding dpm_list_mtx and is in the middle of updating
> the list pointers. This would mess up the traversal in
> device_power_down().
>
> I'm not sure about the best way to prevent this. Is it legal to call
> unlock_mutex() while interrupts or preemption are disabled?

Well, I think it is, but I'm not sure how that can help.

To prevent the race from happening, we can lock dpm_list_mtx before switching
interrupts off in kernel/power/main.c:suspend_enter() and analogously in
kernel/power/disk.c .

Thanks,
Rafael

2008-03-25 20:39:29

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Tuesday, 25 of March 2008, Alan Stern wrote:
> On Mon, 24 Mar 2008, Rafael J. Wysocki wrote:
>
> > The patch below should address all of your comments above. It also seems to
> > work.
> >
> > If it looks good to you, please let me know and I'll repost it in a new thread
> > along with the platform and PCI patches adapted to it.
>
> I don't see anything else to change. ACK.

Thanks.

I'd like to fix the problem with dpm_list_mtx vs the disabling of interrupts
you pointed out in the other message, though.

Rafael

2008-03-26 14:03:51

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Tue, 25 Mar 2008, Rafael J. Wysocki wrote:

> > I just thought of another problem. At the point where
> > local_irq_disable() is called, in between device_suspend() and
> > device_power_down(), it is possible in a preemptible kernel that
> > another task is holding dpm_list_mtx and is in the middle of updating
> > the list pointers. This would mess up the traversal in
> > device_power_down().
> >
> > I'm not sure about the best way to prevent this. Is it legal to call
> > unlock_mutex() while interrupts or preemption are disabled?
>
> Well, I think it is, but I'm not sure how that can help.
>
> To prevent the race from happening, we can lock dpm_list_mtx before switching
> interrupts off in kernel/power/main.c:suspend_enter() and analogously in
> kernel/power/disk.c .

That's right. And once interrupts are turned off you should unlock
dpm_list_mtx again, in case a noirq method wants to unregister a
device. Hence my question: Is it legal to call unlock_mutex() while
interrupts are disabled?

Alan Stern

2008-03-26 20:28:12

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Wednesday, 26 of March 2008, Alan Stern wrote:
> On Tue, 25 Mar 2008, Rafael J. Wysocki wrote:
>
> > > I just thought of another problem. At the point where
> > > local_irq_disable() is called, in between device_suspend() and
> > > device_power_down(), it is possible in a preemptible kernel that
> > > another task is holding dpm_list_mtx and is in the middle of updating
> > > the list pointers. This would mess up the traversal in
> > > device_power_down().
> > >
> > > I'm not sure about the best way to prevent this. Is it legal to call
> > > unlock_mutex() while interrupts or preemption are disabled?
> >
> > Well, I think it is, but I'm not sure how that can help.
> >
> > To prevent the race from happening, we can lock dpm_list_mtx before switching
> > interrupts off in kernel/power/main.c:suspend_enter() and analogously in
> > kernel/power/disk.c .
>
> That's right. And once interrupts are turned off you should unlock
> dpm_list_mtx again, in case a noirq method wants to unregister a
> device.

Why would a noirq method want to do that? IMO, it's not a big deal if noirq
methods are not allowed to unregister devices.

> Hence my question: Is it legal to call unlock_mutex() while interrupts are
> disabled?

Well, I suspect that will confuse lockdep quite a bit. Otherwise, I don't see
a problem with it (it's just changing the value of a shared variable after
all).

Thanks,
Rafael

2008-03-26 20:36:53

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Wed, 26 Mar 2008, Rafael J. Wysocki wrote:

> On Wednesday, 26 of March 2008, Alan Stern wrote:
> > On Tue, 25 Mar 2008, Rafael J. Wysocki wrote:
> >
> > > > I just thought of another problem. At the point where
> > > > local_irq_disable() is called, in between device_suspend() and
> > > > device_power_down(), it is possible in a preemptible kernel that
> > > > another task is holding dpm_list_mtx and is in the middle of updating
> > > > the list pointers. This would mess up the traversal in
> > > > device_power_down().
> > > >
> > > > I'm not sure about the best way to prevent this. Is it legal to call
> > > > unlock_mutex() while interrupts or preemption are disabled?
> > >
> > > Well, I think it is, but I'm not sure how that can help.
> > >
> > > To prevent the race from happening, we can lock dpm_list_mtx before switching
> > > interrupts off in kernel/power/main.c:suspend_enter() and analogously in
> > > kernel/power/disk.c .
> >
> > That's right. And once interrupts are turned off you should unlock
> > dpm_list_mtx again, in case a noirq method wants to unregister a
> > device.
>
> Why would a noirq method want to do that? IMO, it's not a big deal if noirq
> methods are not allowed to unregister devices.

Okay, that's fine. It keeps things simple.

> > Hence my question: Is it legal to call unlock_mutex() while interrupts are
> > disabled?
>
> Well, I suspect that will confuse lockdep quite a bit. Otherwise, I don't see
> a problem with it (it's just changing the value of a shared variable after
> all).

Then you have your answer. Perhaps have device_suspend() exit with the
mutex held and have device_resume() release it (with appropriate
handling for error situations, of course).

Alan Stern

2008-03-26 20:54:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Introduce new top level suspend and hibernation callbacks (rev. 3)

On Wednesday, 26 of March 2008, Alan Stern wrote:
> On Wed, 26 Mar 2008, Rafael J. Wysocki wrote:
>
> > On Wednesday, 26 of March 2008, Alan Stern wrote:
> > > On Tue, 25 Mar 2008, Rafael J. Wysocki wrote:
> > >
> > > > > I just thought of another problem. At the point where
> > > > > local_irq_disable() is called, in between device_suspend() and
> > > > > device_power_down(), it is possible in a preemptible kernel that
> > > > > another task is holding dpm_list_mtx and is in the middle of updating
> > > > > the list pointers. This would mess up the traversal in
> > > > > device_power_down().
> > > > >
> > > > > I'm not sure about the best way to prevent this. Is it legal to call
> > > > > unlock_mutex() while interrupts or preemption are disabled?
> > > >
> > > > Well, I think it is, but I'm not sure how that can help.
> > > >
> > > > To prevent the race from happening, we can lock dpm_list_mtx before switching
> > > > interrupts off in kernel/power/main.c:suspend_enter() and analogously in
> > > > kernel/power/disk.c .
> > >
> > > That's right. And once interrupts are turned off you should unlock
> > > dpm_list_mtx again, in case a noirq method wants to unregister a
> > > device.
> >
> > Why would a noirq method want to do that? IMO, it's not a big deal if noirq
> > methods are not allowed to unregister devices.
>
> Okay, that's fine. It keeps things simple.
>
> > > Hence my question: Is it legal to call unlock_mutex() while interrupts are
> > > disabled?
> >
> > Well, I suspect that will confuse lockdep quite a bit. Otherwise, I don't see
> > a problem with it (it's just changing the value of a shared variable after
> > all).
>
> Then you have your answer. Perhaps have device_suspend() exit with the
> mutex held and have device_resume() release it (with appropriate
> handling for error situations, of course).

That wouldn't work, because enable_nonboot_cpus() is called before
device_resume() and the notifiers in there may want to unregister devices
if some CPUs fail to go online.

I added two accessor functions device_pm_lock() and device_pm_unlock()
to be called just prior to disabling interrupts and right after enabling them,
respectively, in the higher-level PM core (ie. kernel/power/main(disk).c).

Thanks,
Rafael