2010-06-26 13:16:17

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH] PM: Make it possible to avoid wakeup events from being lost

From: Rafael J. Wysocki <[email protected]>

One of the arguments during the suspend blockers discussion was that
the mainline kernel didn't contain any mechanisms making it possible
to avoid losing wakeup events during system suspend.

Generally, there are two problems in that area. First, if a wakeup
event occurs exactly when /sys/power/state is being written to, it
may be delivered to user space right before the freezer kicks in, so
the user space consumer of the event may not be able to process it
before the system is suspended. Second, if a wakeup event occurs
after user space has been frozen, it is not generally guaranteed that
the ongoing transition of the system into a sleep state will be
aborted.

To address these issues introduce a new global sysfs attribute,
/sys/power/wakeup_count, associated with a running counter of wakeup
events and three helper functions, pm_stay_awake(), pm_relax(), and
pm_wakeup_event(), that may be used by kernel subsystems to control
the behavior of this attribute and to request the PM core to abort
system transitions into a sleep state already in progress.

The /sys/power/wakeup_count file may be read from or written to by
user space. Reads will always succeed (unless interrupted by a
signal) and return the current value of the wakeup events counter.
Writes, however, will only succeed if the written number is equal to
the current value of the wakeup events counter. If a write is
successful, it will cause the kernel to save the current value of the
wakeup events counter and to abort the subsequent system transition
into a sleep state if any wakeup events are reported after the write
has returned.

[The assumption is that before writing to /sys/power/state user space
will first read from /sys/power/wakeup_count. Next, user space
consumers of wakeup events will have a chance to acknowledge or
veto the upcoming system transition to a sleep state. Finally, if
the transition is allowed to proceed, /sys/power/wakeup_count will
be written to and if that succeeds, /sys/power/state will be written
to as well. Still, if any wakeup events are reported to the PM core
by kernel subsystems after that point, the transition will be
aborted.]

Additionally, put a wakeup events counter into struct dev_pm_info and
make these per-device wakeup event counters available via sysfs,
so that it's possible to check the activity of various wakeup event
sources within the kernel.

To illustrate how subsystems can use pm_wakeup_event(), make the
low-level PCI runtime PM wakeup-handling code use it.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Documentation/ABI/testing/sysfs-power | 14 +
drivers/base/power/Makefile | 2
drivers/base/power/main.c | 1
drivers/base/power/sysfs.c | 11 +
drivers/base/power/wakeup.c | 240 ++++++++++++++++++++++++++++++++++
drivers/pci/pci-acpi.c | 1
drivers/pci/pci.c | 20 ++
drivers/pci/pci.h | 1
drivers/pci/pcie/pme/pcie_pme.c | 5
include/linux/pm.h | 10 +
include/linux/suspend.h | 7
kernel/power/hibernate.c | 20 +-
kernel/power/main.c | 53 +++++++
kernel/power/suspend.c | 4
14 files changed, 379 insertions(+), 10 deletions(-)

Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -204,6 +204,58 @@ static ssize_t state_store(struct kobjec

power_attr(state);

+/*
+ * The 'wakeup_count' attribute, along with the functions defined in
+ * drivers/base/power/wakeup.c, provides a means by which wakeup events can be
+ * handled in a non-racy way.
+ *
+ * If a wakeup event occurs when the system is in a sleep state, it simply is
+ * woken up. In turn, if an event that would wake the system up from a sleep
+ * state occurs when it is undergoing a transition to that sleep state, the
+ * transition should be aborted. Moreover, if such an event occurs when the
+ * system is in the working state, an attempt to start a transition to the
+ * given sleep state should fail during certain period after the detection of
+ * the event. Using the 'state' attribute alone is not sufficient to satisfy
+ * these requirements, because a wakeup event may occur exactly when 'state'
+ * is being written to and may be delivered to user space right before it is
+ * frozen, so the event will remain only partially processed until the system is
+ * woken up by another event. In particular, it won't cause the transition to
+ * a sleep state to be aborted.
+ *
+ * This difficulty may be overcome if user space uses 'wakeup_count' before
+ * writing to 'state'. It first should read from 'wakeup_count' and store
+ * the read value. Then, after carrying out its own preparations for the system
+ * transition to a sleep state, it should write the stored value to
+ * 'wakeup_count'. If that fails, at least one wakeup event has occured since
+ * 'wakeup_count' was read and 'state' should not be written to. Otherwise, it
+ * is allowed to write to 'state', but the transition will be aborted if there
+ * are any wakeup events detected after 'wakeup_count' was written to.
+ */
+
+static ssize_t wakeup_count_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ unsigned long val;
+
+ return pm_get_wakeup_count(&val) ? sprintf(buf, "%lu\n", val) : -EINTR;
+}
+
+static ssize_t wakeup_count_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t n)
+{
+ unsigned long val;
+
+ if (sscanf(buf, "%lu", &val) == 1) {
+ if (pm_save_wakeup_count(val))
+ return n;
+ }
+ return -EINVAL;
+}
+
+power_attr(wakeup_count);
+
#ifdef CONFIG_PM_TRACE
int pm_trace_enabled;

@@ -236,6 +288,7 @@ static struct attribute * g[] = {
#endif
#ifdef CONFIG_PM_SLEEP
&pm_async_attr.attr,
+ &wakeup_count_attr.attr,
#ifdef CONFIG_PM_DEBUG
&pm_test_attr.attr,
#endif
Index: linux-2.6/drivers/base/power/wakeup.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/wakeup.c
@@ -0,0 +1,240 @@
+/*
+ * drivers/base/power/wakeup.c - System wakeup events framework
+ *
+ * Copyright (c) 2010 Rafael J. Wysocki <[email protected]>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/capability.h>
+#include <linux/suspend.h>
+#include <linux/pm.h>
+
+/*
+ * If set, the suspend/hibernate code will abort transitions to a sleep state
+ * if wakeup events are registered during or immediately before the transition.
+ */
+bool events_check_enabled;
+
+/* The counter of registered wakeup events. */
+static unsigned long event_count;
+/* A preserved old value of event_count. */
+static unsigned long saved_event_count;
+/* The counter of wakeup events being processed. */
+static unsigned long events_in_progress;
+
+static DEFINE_SPINLOCK(events_lock);
+static DECLARE_WAIT_QUEUE_HEAD(events_wait_queue);
+
+/*
+ * The functions below use the observation that each wakeup event starts a
+ * period in which the system should not be suspended. The moment this period
+ * will end depends on how the wakeup event is going to be processed after being
+ * detected and all of the possible cases can be divided into two distinct
+ * groups.
+ *
+ * First, a wakeup event may be detected by the same functional unit that will
+ * carry out the entire processing of it and possibly will pass it to user space
+ * for further processing. In that case the functional unit that has detected
+ * the event may later "close" the "no suspend" period associated with it
+ * directly as soon as it has been dealt with. The pair of pm_stay_awake() and
+ * pm_relax(), balanced with each other, is supposed to be used in such
+ * situations.
+ *
+ * Second, a wakeup event may be detected by one functional unit and processed
+ * by another one. In that case the unit that has detected it cannot really
+ * "close" the "no suspend" period associated with it, unless it knows in
+ * advance what's going to happen to the event during processing. This
+ * knowledge, however, may not be available to it, so it can simply specify time
+ * to wait before the system can be suspended and pass it as the second
+ * argument of pm_wakeup_event().
+ */
+
+/**
+ * pm_stay_awake - Notify the PM core that a wakeup event is being processed.
+ * @dev: Device the wakeup event is related to.
+ *
+ * Notify the PM core of a wakeup event (signaled by @dev) by incrementing the
+ * counter of wakeup events being processed. If @dev is not NULL, the counter
+ * of wakeup events related to @dev is incremented too.
+ *
+ * Call this function after detecting of a wakeup event if pm_relax() is going
+ * to be called directly after processing the event (and possibly passing it to
+ * user space for further processing).
+ *
+ * It is safe to call this function from interrupt context.
+ */
+void pm_stay_awake(struct device *dev)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (dev)
+ dev->power.wakeup_count++;
+
+ events_in_progress++;
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+/**
+ * pm_relax - Notify the PM core that processing of a wakeup event has ended.
+ *
+ * Notify the PM core that a wakeup event has been processed by decrementing
+ * the counter of wakeup events being processed and incrementing the counter
+ * of registered wakeup events.
+ *
+ * Call this function for wakeup events whose processing started with calling
+ * pm_stay_awake().
+ *
+ * It is safe to call it from interrupt context.
+ */
+void pm_relax(void)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (events_in_progress) {
+ event_count++;
+ if (!--events_in_progress)
+ wake_up_all(&events_wait_queue);
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+/**
+ * pm_wakeup_work_fn - Deferred closing of a wakeup event.
+ *
+ * Execute pm_relax() for a wakeup event detected in the past and free the
+ * work item object used for queuing up the work.
+ */
+static void pm_wakeup_work_fn(struct work_struct *work)
+{
+ struct delayed_work *dwork = to_delayed_work(work);
+
+ pm_relax();
+ kfree(dwork);
+}
+
+/**
+ * pm_wakeup_event - Notify the PM core of a wakeup event.
+ * @dev: Device the wakeup event is related to.
+ * @msec: Anticipated event processing time (in milliseconds).
+ *
+ * Notify the PM core of a wakeup event (signaled by @dev) that will take
+ * approximately @msec milliseconds to be processed by the kernel. Increment
+ * the counter of wakeup events being processed and queue up a work item
+ * that will execute pm_relax() for the event after @msec milliseconds. If @dev
+ * is not NULL, the counter of wakeup events related to @dev is incremented too.
+ *
+ * It is safe to call this function from interrupt context.
+ */
+void pm_wakeup_event(struct device *dev, unsigned int msec)
+{
+ unsigned long flags;
+ struct delayed_work *dwork;
+
+ dwork = msec ? kzalloc(sizeof(*dwork), GFP_ATOMIC) : NULL;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (dev)
+ dev->power.wakeup_count++;
+
+ if (dwork) {
+ INIT_DELAYED_WORK(dwork, pm_wakeup_work_fn);
+ schedule_delayed_work(dwork, msecs_to_jiffies(msec));
+
+ events_in_progress++;
+ } else {
+ event_count++;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+/**
+ * pm_check_wakeup_events - Check for new wakeup events.
+ *
+ * Compare the current number of registered wakeup events with its preserved
+ * value from the past to check if new wakeup events have been registered since
+ * the old value was stored. Check if the current number of wakeup events being
+ * processed is zero.
+ */
+bool pm_check_wakeup_events(void)
+{
+ unsigned long flags;
+ bool ret = true;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (events_check_enabled) {
+ ret = (event_count == saved_event_count) && !events_in_progress;
+ events_check_enabled = ret;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+ return ret;
+}
+
+/**
+ * pm_get_wakeup_count - Read the number of registered wakeup events.
+ * @count: Address to store the value at.
+ *
+ * Store the number of registered wakeup events at the address in @count. Block
+ * if the current number of wakeup events being processed is nonzero.
+ *
+ * Return false if the wait for the number of wakeup events being processed to
+ * drop down to zero has been interrupted by a signal (and the current number
+ * of wakeup events being processed is still nonzero). Otherwise return true.
+ */
+bool pm_get_wakeup_count(unsigned long *count)
+{
+ bool ret;
+
+ spin_lock_irq(&events_lock);
+ if (capable(CAP_SYS_ADMIN))
+ events_check_enabled = false;
+
+ if (events_in_progress) {
+ DEFINE_WAIT(wait);
+
+ do {
+ prepare_to_wait(&events_wait_queue, &wait,
+ TASK_INTERRUPTIBLE);
+ if (!events_in_progress)
+ break;
+ spin_unlock_irq(&events_lock);
+
+ schedule();
+
+ spin_lock_irq(&events_lock);
+ } while (!signal_pending(current));
+ finish_wait(&events_wait_queue, &wait);
+ }
+ *count = event_count;
+ ret = !events_in_progress;
+ spin_unlock_irq(&events_lock);
+ return ret;
+}
+
+/**
+ * pm_save_wakeup_count - Save the current number of registered wakeup events.
+ * @count: Value to compare with the current number of registered wakeup events.
+ *
+ * If @count is equal to the current number of registered wakeup events and the
+ * current number of wakeup events being processed is zero, store @count as the
+ * old number of registered wakeup events to be used by pm_check_wakeup_events()
+ * and return true. Otherwise return false.
+ */
+bool pm_save_wakeup_count(unsigned long count)
+{
+ bool ret = false;
+
+ spin_lock_irq(&events_lock);
+ if (count == event_count && !events_in_progress) {
+ saved_event_count = count;
+ events_check_enabled = true;
+ ret = true;
+ }
+ spin_unlock_irq(&events_lock);
+ return ret;
+}
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -457,6 +457,7 @@ struct dev_pm_info {
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
struct completion completion;
+ unsigned long wakeup_count;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
@@ -552,6 +553,11 @@ extern void __suspend_report_result(cons
} while (0)

extern void device_pm_wait_for_dev(struct device *sub, struct device *dev);
+
+/* drivers/base/power/wakeup.c */
+extern void pm_wakeup_event(struct device *dev, unsigned int msec);
+extern void pm_stay_awake(struct device *dev);
+extern void pm_relax(void);
#else /* !CONFIG_PM_SLEEP */

#define device_pm_lock() do {} while (0)
@@ -565,6 +571,10 @@ static inline int dpm_suspend_start(pm_m
#define suspend_report_result(fn, ret) do {} while (0)

static inline void device_pm_wait_for_dev(struct device *a, struct device *b) {}
+
+static inline void pm_wakeup_event(struct device *dev, unsigned int msec) {}
+static inline void pm_stay_awake(struct device *dev) {}
+static inline void pm_relax(void) {}
#endif /* !CONFIG_PM_SLEEP */

/* How to reorder dpm_list after device_move() */
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,5 @@
obj-$(CONFIG_PM) += sysfs.o
-obj-$(CONFIG_PM_SLEEP) += main.o
+obj-$(CONFIG_PM_SLEEP) += main.o wakeup.o
obj-$(CONFIG_PM_RUNTIME) += runtime.o
obj-$(CONFIG_PM_OPS) += generic_ops.o
obj-$(CONFIG_PM_TRACE_RTC) += trace.o
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -59,6 +59,7 @@ void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
init_completion(&dev->power.completion);
+ dev->power.wakeup_count = 0;
pm_runtime_init(dev);
}

Index: linux-2.6/kernel/power/suspend.c
===================================================================
--- linux-2.6.orig/kernel/power/suspend.c
+++ linux-2.6/kernel/power/suspend.c
@@ -172,8 +172,10 @@ static int suspend_enter(suspend_state_t

error = sysdev_suspend(PMSG_SUSPEND);
if (!error) {
- if (!suspend_test(TEST_CORE))
+ if (!suspend_test(TEST_CORE) && pm_check_wakeup_events()) {
error = suspend_ops->enter(state);
+ events_check_enabled = false;
+ }
sysdev_resume();
}

Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -277,7 +277,7 @@ static int create_image(int platform_mod
goto Enable_irqs;
}

- if (hibernation_test(TEST_CORE))
+ if (hibernation_test(TEST_CORE) || !pm_check_wakeup_events())
goto Power_up;

in_suspend = 1;
@@ -288,8 +288,10 @@ static int create_image(int platform_mod
error);
/* Restore control flow magically appears here */
restore_processor_state();
- if (!in_suspend)
+ if (!in_suspend) {
+ events_check_enabled = false;
platform_leave(platform_mode);
+ }

Power_up:
sysdev_resume();
@@ -511,14 +513,20 @@ int hibernation_platform_enter(void)

local_irq_disable();
sysdev_suspend(PMSG_HIBERNATE);
+ if (!pm_check_wakeup_events()) {
+ error = -EAGAIN;
+ goto Power_up;
+ }
+
hibernation_ops->enter();
/* We should never get here */
while (1);

- /*
- * We don't need to reenable the nonboot CPUs or resume consoles, since
- * the system is going to be halted anyway.
- */
+ Power_up:
+ sysdev_resume();
+ local_irq_enable();
+ enable_nonboot_cpus();
+
Platform_finish:
hibernation_ops->finish();

Index: linux-2.6/drivers/pci/pci-acpi.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci-acpi.c
+++ linux-2.6/drivers/pci/pci-acpi.c
@@ -48,6 +48,7 @@ static void pci_acpi_wake_dev(acpi_handl
if (event == ACPI_NOTIFY_DEVICE_WAKE && pci_dev) {
pci_check_pme_status(pci_dev);
pm_runtime_resume(&pci_dev->dev);
+ pci_wakeup_event(pci_dev);
if (pci_dev->subordinate)
pci_pme_wakeup_bus(pci_dev->subordinate);
}
Index: linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
===================================================================
--- linux-2.6.orig/drivers/pci/pcie/pme/pcie_pme.c
+++ linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
@@ -154,6 +154,7 @@ static bool pcie_pme_walk_bus(struct pci
/* Skip PCIe devices in case we started from a root port. */
if (!pci_is_pcie(dev) && pci_check_pme_status(dev)) {
pm_request_resume(&dev->dev);
+ pci_wakeup_event(dev);
ret = true;
}

@@ -254,8 +255,10 @@ static void pcie_pme_handle_request(stru
if (found) {
/* The device is there, but we have to check its PME status. */
found = pci_check_pme_status(dev);
- if (found)
+ if (found) {
pm_request_resume(&dev->dev);
+ pci_wakeup_event(dev);
+ }
pci_dev_put(dev);
} else if (devfn) {
/*
Index: linux-2.6/drivers/base/power/sysfs.c
===================================================================
--- linux-2.6.orig/drivers/base/power/sysfs.c
+++ linux-2.6/drivers/base/power/sysfs.c
@@ -73,6 +73,8 @@
* device are known to the PM core. However, for some devices this
* attribute is set to "enabled" by bus type code or device drivers and in
* that cases it should be safe to leave the default value.
+ *
+ * wakeup_count - Report the number of wakeup events related to the device
*/

static const char enabled[] = "enabled";
@@ -144,6 +146,14 @@ wake_store(struct device * dev, struct d

static DEVICE_ATTR(wakeup, 0644, wake_show, wake_store);

+static ssize_t wakeup_count_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%lu\n", dev->power.wakeup_count);
+}
+
+static DEVICE_ATTR(wakeup_count, 0444, wakeup_count_show, NULL);
+
#ifdef CONFIG_PM_ADVANCED_DEBUG
#ifdef CONFIG_PM_RUNTIME

@@ -230,6 +240,7 @@ static struct attribute * power_attrs[]
&dev_attr_control.attr,
#endif
&dev_attr_wakeup.attr,
+ &dev_attr_wakeup_count.attr,
#ifdef CONFIG_PM_ADVANCED_DEBUG
&dev_attr_async.attr,
#ifdef CONFIG_PM_RUNTIME
Index: linux-2.6/drivers/pci/pci.h
===================================================================
--- linux-2.6.orig/drivers/pci/pci.h
+++ linux-2.6/drivers/pci/pci.h
@@ -56,6 +56,7 @@ extern void pci_update_current_state(str
extern void pci_disable_enabled_device(struct pci_dev *dev);
extern bool pci_check_pme_status(struct pci_dev *dev);
extern int pci_finish_runtime_suspend(struct pci_dev *dev);
+extern void pci_wakeup_event(struct pci_dev *dev);
extern int __pci_pme_wakeup(struct pci_dev *dev, void *ign);
extern void pci_pme_wakeup_bus(struct pci_bus *bus);
extern void pci_pm_init(struct pci_dev *dev);
Index: linux-2.6/drivers/pci/pci.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci.c
+++ linux-2.6/drivers/pci/pci.c
@@ -1275,6 +1275,22 @@ bool pci_check_pme_status(struct pci_dev
return ret;
}

+/*
+ * Time to wait before the system can be put into a sleep state after reporting
+ * a wakeup event signaled by a PCI device.
+ */
+#define PCI_WAKEUP_COOLDOWN 100
+
+/**
+ * pci_wakeup_event - Report a wakeup event related to a given PCI device.
+ * @dev: Device to report the wakeup event for.
+ */
+void pci_wakeup_event(struct pci_dev *dev)
+{
+ if (device_may_wakeup(&dev->dev))
+ pm_wakeup_event(&dev->dev, PCI_WAKEUP_COOLDOWN);
+}
+
/**
* pci_pme_wakeup - Wake up a PCI device if its PME Status bit is set.
* @dev: Device to handle.
@@ -1285,8 +1301,10 @@ bool pci_check_pme_status(struct pci_dev
*/
static int pci_pme_wakeup(struct pci_dev *dev, void *ign)
{
- if (pci_check_pme_status(dev))
+ if (pci_check_pme_status(dev)) {
pm_request_resume(&dev->dev);
+ pci_wakeup_event(dev);
+ }
return 0;
}

Index: linux-2.6/include/linux/suspend.h
===================================================================
--- linux-2.6.orig/include/linux/suspend.h
+++ linux-2.6/include/linux/suspend.h
@@ -295,6 +295,13 @@ extern int unregister_pm_notifier(struct
{ .notifier_call = fn, .priority = pri }; \
register_pm_notifier(&fn##_nb); \
}
+
+/* drivers/base/power/wakeup.c */
+extern bool events_check_enabled;
+
+extern bool pm_check_wakeup_events(void);
+extern bool pm_get_wakeup_count(unsigned long *count);
+extern bool pm_save_wakeup_count(unsigned long count);
#else /* !CONFIG_PM_SLEEP */

static inline int register_pm_notifier(struct notifier_block *nb)
Index: linux-2.6/Documentation/ABI/testing/sysfs-power
===================================================================
--- linux-2.6.orig/Documentation/ABI/testing/sysfs-power
+++ linux-2.6/Documentation/ABI/testing/sysfs-power
@@ -114,3 +114,17 @@ Description:
if this file contains "1", which is the default. It may be
disabled by writing "0" to this file, in which case all devices
will be suspended and resumed synchronously.
+
+What: /sys/power/wakeup_count
+Date: July 2010
+Contact: Rafael J. Wysocki <[email protected]>
+Description:
+ The /sys/power/wakeup_count file allows user space to avoid
+ losing wakeup events when transitioning the system into a sleep
+ state. Reading from it returns the current number of registered
+ wakeup events and it blocks if some wakeup events are being
+ processed at the time the file is read from. Writing to it
+ will only succeed if the current number of wakeup events is
+ equal to the written value and, if successful, will make the
+ kernel abort a subsequent transition to a sleep state if any
+ wakeup events are reported after the write has returned.


2010-06-27 15:50:09

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Sat, 26 Jun 2010, Rafael J. Wysocki wrote:

> +void pm_relax(void)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&events_lock, flags);
> + if (events_in_progress) {
> + event_count++;
> + if (!--events_in_progress)
> + wake_up_all(&events_wait_queue);
> + }
> + spin_unlock_irqrestore(&events_lock, flags);
> +}

> +bool pm_get_wakeup_count(unsigned long *count)
> +{
> + bool ret;
> +
> + spin_lock_irq(&events_lock);
> + if (capable(CAP_SYS_ADMIN))
> + events_check_enabled = false;
> +
> + if (events_in_progress) {
> + DEFINE_WAIT(wait);
> +
> + do {
> + prepare_to_wait(&events_wait_queue, &wait,
> + TASK_INTERRUPTIBLE);
> + if (!events_in_progress)
> + break;
> + spin_unlock_irq(&events_lock);
> +
> + schedule();
> +
> + spin_lock_irq(&events_lock);
> + } while (!signal_pending(current));
> + finish_wait(&events_wait_queue, &wait);
> + }
> + *count = event_count;
> + ret = !events_in_progress;
> + spin_unlock_irq(&events_lock);
> + return ret;
> +}

Here's a thought. Presumably pm_relax() will end up getting called a
lot more often than pm_get_wakeup_count(). Instead of using a wait
queue, you could make pm_get_wakeup_count() poll at 100-ms intervals.
The total overhead would be smaller.

Here's another thought. If event_count and events_in_progress were
atomic_t then the new spinlock wouldn't be needed at all. (But you
would need an appropriate pair of memory barriers, to guarantee that
when a writer decrements events_in_progress to 0 and increments
event_count, a reader won't see events_in_progress == 0 without also
seeing the incremented event_count.) Overall, this may not be a
significant improvement.

Alan Stern

2010-06-27 22:27:47

by 640E9920

[permalink] [raw]
Subject: Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

Looks good to me!

--mgross

On Sat, Jun 26, 2010 at 03:14:13PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> One of the arguments during the suspend blockers discussion was that
> the mainline kernel didn't contain any mechanisms making it possible
> to avoid losing wakeup events during system suspend.
>
> Generally, there are two problems in that area. First, if a wakeup
> event occurs exactly when /sys/power/state is being written to, it
> may be delivered to user space right before the freezer kicks in, so
> the user space consumer of the event may not be able to process it
> before the system is suspended. Second, if a wakeup event occurs
> after user space has been frozen, it is not generally guaranteed that
> the ongoing transition of the system into a sleep state will be
> aborted.
>
> To address these issues introduce a new global sysfs attribute,
> /sys/power/wakeup_count, associated with a running counter of wakeup
> events and three helper functions, pm_stay_awake(), pm_relax(), and
> pm_wakeup_event(), that may be used by kernel subsystems to control
> the behavior of this attribute and to request the PM core to abort
> system transitions into a sleep state already in progress.
>
> The /sys/power/wakeup_count file may be read from or written to by
> user space. Reads will always succeed (unless interrupted by a
> signal) and return the current value of the wakeup events counter.
> Writes, however, will only succeed if the written number is equal to
> the current value of the wakeup events counter. If a write is
> successful, it will cause the kernel to save the current value of the
> wakeup events counter and to abort the subsequent system transition
> into a sleep state if any wakeup events are reported after the write
> has returned.
>
> [The assumption is that before writing to /sys/power/state user space
> will first read from /sys/power/wakeup_count. Next, user space
> consumers of wakeup events will have a chance to acknowledge or
> veto the upcoming system transition to a sleep state. Finally, if
> the transition is allowed to proceed, /sys/power/wakeup_count will
> be written to and if that succeeds, /sys/power/state will be written
> to as well. Still, if any wakeup events are reported to the PM core
> by kernel subsystems after that point, the transition will be
> aborted.]
>
> Additionally, put a wakeup events counter into struct dev_pm_info and
> make these per-device wakeup event counters available via sysfs,
> so that it's possible to check the activity of various wakeup event
> sources within the kernel.
>
> To illustrate how subsystems can use pm_wakeup_event(), make the
> low-level PCI runtime PM wakeup-handling code use it.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> Documentation/ABI/testing/sysfs-power | 14 +
> drivers/base/power/Makefile | 2
> drivers/base/power/main.c | 1
> drivers/base/power/sysfs.c | 11 +
> drivers/base/power/wakeup.c | 240 ++++++++++++++++++++++++++++++++++
> drivers/pci/pci-acpi.c | 1
> drivers/pci/pci.c | 20 ++
> drivers/pci/pci.h | 1
> drivers/pci/pcie/pme/pcie_pme.c | 5
> include/linux/pm.h | 10 +
> include/linux/suspend.h | 7
> kernel/power/hibernate.c | 20 +-
> kernel/power/main.c | 53 +++++++
> kernel/power/suspend.c | 4
> 14 files changed, 379 insertions(+), 10 deletions(-)
>
> Index: linux-2.6/kernel/power/main.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/main.c
> +++ linux-2.6/kernel/power/main.c
> @@ -204,6 +204,58 @@ static ssize_t state_store(struct kobjec
>
> power_attr(state);
>
> +/*
> + * The 'wakeup_count' attribute, along with the functions defined in
> + * drivers/base/power/wakeup.c, provides a means by which wakeup events can be
> + * handled in a non-racy way.
> + *
> + * If a wakeup event occurs when the system is in a sleep state, it simply is
> + * woken up. In turn, if an event that would wake the system up from a sleep
> + * state occurs when it is undergoing a transition to that sleep state, the
> + * transition should be aborted. Moreover, if such an event occurs when the
> + * system is in the working state, an attempt to start a transition to the
> + * given sleep state should fail during certain period after the detection of
> + * the event. Using the 'state' attribute alone is not sufficient to satisfy
> + * these requirements, because a wakeup event may occur exactly when 'state'
> + * is being written to and may be delivered to user space right before it is
> + * frozen, so the event will remain only partially processed until the system is
> + * woken up by another event. In particular, it won't cause the transition to
> + * a sleep state to be aborted.
> + *
> + * This difficulty may be overcome if user space uses 'wakeup_count' before
> + * writing to 'state'. It first should read from 'wakeup_count' and store
> + * the read value. Then, after carrying out its own preparations for the system
> + * transition to a sleep state, it should write the stored value to
> + * 'wakeup_count'. If that fails, at least one wakeup event has occured since
> + * 'wakeup_count' was read and 'state' should not be written to. Otherwise, it
> + * is allowed to write to 'state', but the transition will be aborted if there
> + * are any wakeup events detected after 'wakeup_count' was written to.
> + */
> +
> +static ssize_t wakeup_count_show(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + char *buf)
> +{
> + unsigned long val;
> +
> + return pm_get_wakeup_count(&val) ? sprintf(buf, "%lu\n", val) : -EINTR;
> +}
> +
> +static ssize_t wakeup_count_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t n)
> +{
> + unsigned long val;
> +
> + if (sscanf(buf, "%lu", &val) == 1) {
> + if (pm_save_wakeup_count(val))
> + return n;
> + }
> + return -EINVAL;
> +}
> +
> +power_attr(wakeup_count);
> +
> #ifdef CONFIG_PM_TRACE
> int pm_trace_enabled;
>
> @@ -236,6 +288,7 @@ static struct attribute * g[] = {
> #endif
> #ifdef CONFIG_PM_SLEEP
> &pm_async_attr.attr,
> + &wakeup_count_attr.attr,
> #ifdef CONFIG_PM_DEBUG
> &pm_test_attr.attr,
> #endif
> Index: linux-2.6/drivers/base/power/wakeup.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6/drivers/base/power/wakeup.c
> @@ -0,0 +1,240 @@
> +/*
> + * drivers/base/power/wakeup.c - System wakeup events framework
> + *
> + * Copyright (c) 2010 Rafael J. Wysocki <[email protected]>, Novell Inc.
> + *
> + * This file is released under the GPLv2.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/sched.h>
> +#include <linux/capability.h>
> +#include <linux/suspend.h>
> +#include <linux/pm.h>
> +
> +/*
> + * If set, the suspend/hibernate code will abort transitions to a sleep state
> + * if wakeup events are registered during or immediately before the transition.
> + */
> +bool events_check_enabled;
> +
> +/* The counter of registered wakeup events. */
> +static unsigned long event_count;
> +/* A preserved old value of event_count. */
> +static unsigned long saved_event_count;
> +/* The counter of wakeup events being processed. */
> +static unsigned long events_in_progress;
> +
> +static DEFINE_SPINLOCK(events_lock);
> +static DECLARE_WAIT_QUEUE_HEAD(events_wait_queue);
> +
> +/*
> + * The functions below use the observation that each wakeup event starts a
> + * period in which the system should not be suspended. The moment this period
> + * will end depends on how the wakeup event is going to be processed after being
> + * detected and all of the possible cases can be divided into two distinct
> + * groups.
> + *
> + * First, a wakeup event may be detected by the same functional unit that will
> + * carry out the entire processing of it and possibly will pass it to user space
> + * for further processing. In that case the functional unit that has detected
> + * the event may later "close" the "no suspend" period associated with it
> + * directly as soon as it has been dealt with. The pair of pm_stay_awake() and
> + * pm_relax(), balanced with each other, is supposed to be used in such
> + * situations.
> + *
> + * Second, a wakeup event may be detected by one functional unit and processed
> + * by another one. In that case the unit that has detected it cannot really
> + * "close" the "no suspend" period associated with it, unless it knows in
> + * advance what's going to happen to the event during processing. This
> + * knowledge, however, may not be available to it, so it can simply specify time
> + * to wait before the system can be suspended and pass it as the second
> + * argument of pm_wakeup_event().
> + */
> +
> +/**
> + * pm_stay_awake - Notify the PM core that a wakeup event is being processed.
> + * @dev: Device the wakeup event is related to.
> + *
> + * Notify the PM core of a wakeup event (signaled by @dev) by incrementing the
> + * counter of wakeup events being processed. If @dev is not NULL, the counter
> + * of wakeup events related to @dev is incremented too.
> + *
> + * Call this function after detecting of a wakeup event if pm_relax() is going
> + * to be called directly after processing the event (and possibly passing it to
> + * user space for further processing).
> + *
> + * It is safe to call this function from interrupt context.
> + */
> +void pm_stay_awake(struct device *dev)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&events_lock, flags);
> + if (dev)
> + dev->power.wakeup_count++;
> +
> + events_in_progress++;
> + spin_unlock_irqrestore(&events_lock, flags);
> +}
> +
> +/**
> + * pm_relax - Notify the PM core that processing of a wakeup event has ended.
> + *
> + * Notify the PM core that a wakeup event has been processed by decrementing
> + * the counter of wakeup events being processed and incrementing the counter
> + * of registered wakeup events.
> + *
> + * Call this function for wakeup events whose processing started with calling
> + * pm_stay_awake().
> + *
> + * It is safe to call it from interrupt context.
> + */
> +void pm_relax(void)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&events_lock, flags);
> + if (events_in_progress) {
> + event_count++;
> + if (!--events_in_progress)
> + wake_up_all(&events_wait_queue);
> + }
> + spin_unlock_irqrestore(&events_lock, flags);
> +}
> +
> +/**
> + * pm_wakeup_work_fn - Deferred closing of a wakeup event.
> + *
> + * Execute pm_relax() for a wakeup event detected in the past and free the
> + * work item object used for queuing up the work.
> + */
> +static void pm_wakeup_work_fn(struct work_struct *work)
> +{
> + struct delayed_work *dwork = to_delayed_work(work);
> +
> + pm_relax();
> + kfree(dwork);
> +}
> +
> +/**
> + * pm_wakeup_event - Notify the PM core of a wakeup event.
> + * @dev: Device the wakeup event is related to.
> + * @msec: Anticipated event processing time (in milliseconds).
> + *
> + * Notify the PM core of a wakeup event (signaled by @dev) that will take
> + * approximately @msec milliseconds to be processed by the kernel. Increment
> + * the counter of wakeup events being processed and queue up a work item
> + * that will execute pm_relax() for the event after @msec milliseconds. If @dev
> + * is not NULL, the counter of wakeup events related to @dev is incremented too.
> + *
> + * It is safe to call this function from interrupt context.
> + */
> +void pm_wakeup_event(struct device *dev, unsigned int msec)
> +{
> + unsigned long flags;
> + struct delayed_work *dwork;
> +
> + dwork = msec ? kzalloc(sizeof(*dwork), GFP_ATOMIC) : NULL;
> +
> + spin_lock_irqsave(&events_lock, flags);
> + if (dev)
> + dev->power.wakeup_count++;
> +
> + if (dwork) {
> + INIT_DELAYED_WORK(dwork, pm_wakeup_work_fn);
> + schedule_delayed_work(dwork, msecs_to_jiffies(msec));
> +
> + events_in_progress++;
> + } else {
> + event_count++;
> + }
> + spin_unlock_irqrestore(&events_lock, flags);
> +}
> +
> +/**
> + * pm_check_wakeup_events - Check for new wakeup events.
> + *
> + * Compare the current number of registered wakeup events with its preserved
> + * value from the past to check if new wakeup events have been registered since
> + * the old value was stored. Check if the current number of wakeup events being
> + * processed is zero.
> + */
> +bool pm_check_wakeup_events(void)
> +{
> + unsigned long flags;
> + bool ret = true;
> +
> + spin_lock_irqsave(&events_lock, flags);
> + if (events_check_enabled) {
> + ret = (event_count == saved_event_count) && !events_in_progress;
> + events_check_enabled = ret;
> + }
> + spin_unlock_irqrestore(&events_lock, flags);
> + return ret;
> +}
> +
> +/**
> + * pm_get_wakeup_count - Read the number of registered wakeup events.
> + * @count: Address to store the value at.
> + *
> + * Store the number of registered wakeup events at the address in @count. Block
> + * if the current number of wakeup events being processed is nonzero.
> + *
> + * Return false if the wait for the number of wakeup events being processed to
> + * drop down to zero has been interrupted by a signal (and the current number
> + * of wakeup events being processed is still nonzero). Otherwise return true.
> + */
> +bool pm_get_wakeup_count(unsigned long *count)
> +{
> + bool ret;
> +
> + spin_lock_irq(&events_lock);
> + if (capable(CAP_SYS_ADMIN))
> + events_check_enabled = false;
> +
> + if (events_in_progress) {
> + DEFINE_WAIT(wait);
> +
> + do {
> + prepare_to_wait(&events_wait_queue, &wait,
> + TASK_INTERRUPTIBLE);
> + if (!events_in_progress)
> + break;
> + spin_unlock_irq(&events_lock);
> +
> + schedule();
> +
> + spin_lock_irq(&events_lock);
> + } while (!signal_pending(current));
> + finish_wait(&events_wait_queue, &wait);
> + }
> + *count = event_count;
> + ret = !events_in_progress;
> + spin_unlock_irq(&events_lock);
> + return ret;
> +}
> +
> +/**
> + * pm_save_wakeup_count - Save the current number of registered wakeup events.
> + * @count: Value to compare with the current number of registered wakeup events.
> + *
> + * If @count is equal to the current number of registered wakeup events and the
> + * current number of wakeup events being processed is zero, store @count as the
> + * old number of registered wakeup events to be used by pm_check_wakeup_events()
> + * and return true. Otherwise return false.
> + */
> +bool pm_save_wakeup_count(unsigned long count)
> +{
> + bool ret = false;
> +
> + spin_lock_irq(&events_lock);
> + if (count == event_count && !events_in_progress) {
> + saved_event_count = count;
> + events_check_enabled = true;
> + ret = true;
> + }
> + spin_unlock_irq(&events_lock);
> + return ret;
> +}
> Index: linux-2.6/include/linux/pm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/pm.h
> +++ linux-2.6/include/linux/pm.h
> @@ -457,6 +457,7 @@ struct dev_pm_info {
> #ifdef CONFIG_PM_SLEEP
> struct list_head entry;
> struct completion completion;
> + unsigned long wakeup_count;
> #endif
> #ifdef CONFIG_PM_RUNTIME
> struct timer_list suspend_timer;
> @@ -552,6 +553,11 @@ extern void __suspend_report_result(cons
> } while (0)
>
> extern void device_pm_wait_for_dev(struct device *sub, struct device *dev);
> +
> +/* drivers/base/power/wakeup.c */
> +extern void pm_wakeup_event(struct device *dev, unsigned int msec);
> +extern void pm_stay_awake(struct device *dev);
> +extern void pm_relax(void);
> #else /* !CONFIG_PM_SLEEP */
>
> #define device_pm_lock() do {} while (0)
> @@ -565,6 +571,10 @@ static inline int dpm_suspend_start(pm_m
> #define suspend_report_result(fn, ret) do {} while (0)
>
> static inline void device_pm_wait_for_dev(struct device *a, struct device *b) {}
> +
> +static inline void pm_wakeup_event(struct device *dev, unsigned int msec) {}
> +static inline void pm_stay_awake(struct device *dev) {}
> +static inline void pm_relax(void) {}
> #endif /* !CONFIG_PM_SLEEP */
>
> /* How to reorder dpm_list after device_move() */
> Index: linux-2.6/drivers/base/power/Makefile
> ===================================================================
> --- linux-2.6.orig/drivers/base/power/Makefile
> +++ linux-2.6/drivers/base/power/Makefile
> @@ -1,5 +1,5 @@
> obj-$(CONFIG_PM) += sysfs.o
> -obj-$(CONFIG_PM_SLEEP) += main.o
> +obj-$(CONFIG_PM_SLEEP) += main.o wakeup.o
> obj-$(CONFIG_PM_RUNTIME) += runtime.o
> obj-$(CONFIG_PM_OPS) += generic_ops.o
> obj-$(CONFIG_PM_TRACE_RTC) += trace.o
> Index: linux-2.6/drivers/base/power/main.c
> ===================================================================
> --- linux-2.6.orig/drivers/base/power/main.c
> +++ linux-2.6/drivers/base/power/main.c
> @@ -59,6 +59,7 @@ void device_pm_init(struct device *dev)
> {
> dev->power.status = DPM_ON;
> init_completion(&dev->power.completion);
> + dev->power.wakeup_count = 0;
> pm_runtime_init(dev);
> }
>
> Index: linux-2.6/kernel/power/suspend.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/suspend.c
> +++ linux-2.6/kernel/power/suspend.c
> @@ -172,8 +172,10 @@ static int suspend_enter(suspend_state_t
>
> error = sysdev_suspend(PMSG_SUSPEND);
> if (!error) {
> - if (!suspend_test(TEST_CORE))
> + if (!suspend_test(TEST_CORE) && pm_check_wakeup_events()) {
> error = suspend_ops->enter(state);
> + events_check_enabled = false;
> + }
> sysdev_resume();
> }
>
> Index: linux-2.6/kernel/power/hibernate.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/hibernate.c
> +++ linux-2.6/kernel/power/hibernate.c
> @@ -277,7 +277,7 @@ static int create_image(int platform_mod
> goto Enable_irqs;
> }
>
> - if (hibernation_test(TEST_CORE))
> + if (hibernation_test(TEST_CORE) || !pm_check_wakeup_events())
> goto Power_up;
>
> in_suspend = 1;
> @@ -288,8 +288,10 @@ static int create_image(int platform_mod
> error);
> /* Restore control flow magically appears here */
> restore_processor_state();
> - if (!in_suspend)
> + if (!in_suspend) {
> + events_check_enabled = false;
> platform_leave(platform_mode);
> + }
>
> Power_up:
> sysdev_resume();
> @@ -511,14 +513,20 @@ int hibernation_platform_enter(void)
>
> local_irq_disable();
> sysdev_suspend(PMSG_HIBERNATE);
> + if (!pm_check_wakeup_events()) {
> + error = -EAGAIN;
> + goto Power_up;
> + }
> +
> hibernation_ops->enter();
> /* We should never get here */
> while (1);
>
> - /*
> - * We don't need to reenable the nonboot CPUs or resume consoles, since
> - * the system is going to be halted anyway.
> - */
> + Power_up:
> + sysdev_resume();
> + local_irq_enable();
> + enable_nonboot_cpus();
> +
> Platform_finish:
> hibernation_ops->finish();
>
> Index: linux-2.6/drivers/pci/pci-acpi.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/pci-acpi.c
> +++ linux-2.6/drivers/pci/pci-acpi.c
> @@ -48,6 +48,7 @@ static void pci_acpi_wake_dev(acpi_handl
> if (event == ACPI_NOTIFY_DEVICE_WAKE && pci_dev) {
> pci_check_pme_status(pci_dev);
> pm_runtime_resume(&pci_dev->dev);
> + pci_wakeup_event(pci_dev);
> if (pci_dev->subordinate)
> pci_pme_wakeup_bus(pci_dev->subordinate);
> }
> Index: linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/pcie/pme/pcie_pme.c
> +++ linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
> @@ -154,6 +154,7 @@ static bool pcie_pme_walk_bus(struct pci
> /* Skip PCIe devices in case we started from a root port. */
> if (!pci_is_pcie(dev) && pci_check_pme_status(dev)) {
> pm_request_resume(&dev->dev);
> + pci_wakeup_event(dev);
> ret = true;
> }
>
> @@ -254,8 +255,10 @@ static void pcie_pme_handle_request(stru
> if (found) {
> /* The device is there, but we have to check its PME status. */
> found = pci_check_pme_status(dev);
> - if (found)
> + if (found) {
> pm_request_resume(&dev->dev);
> + pci_wakeup_event(dev);
> + }
> pci_dev_put(dev);
> } else if (devfn) {
> /*
> Index: linux-2.6/drivers/base/power/sysfs.c
> ===================================================================
> --- linux-2.6.orig/drivers/base/power/sysfs.c
> +++ linux-2.6/drivers/base/power/sysfs.c
> @@ -73,6 +73,8 @@
> * device are known to the PM core. However, for some devices this
> * attribute is set to "enabled" by bus type code or device drivers and in
> * that cases it should be safe to leave the default value.
> + *
> + * wakeup_count - Report the number of wakeup events related to the device
> */
>
> static const char enabled[] = "enabled";
> @@ -144,6 +146,14 @@ wake_store(struct device * dev, struct d
>
> static DEVICE_ATTR(wakeup, 0644, wake_show, wake_store);
>
> +static ssize_t wakeup_count_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%lu\n", dev->power.wakeup_count);
> +}
> +
> +static DEVICE_ATTR(wakeup_count, 0444, wakeup_count_show, NULL);
> +
> #ifdef CONFIG_PM_ADVANCED_DEBUG
> #ifdef CONFIG_PM_RUNTIME
>
> @@ -230,6 +240,7 @@ static struct attribute * power_attrs[]
> &dev_attr_control.attr,
> #endif
> &dev_attr_wakeup.attr,
> + &dev_attr_wakeup_count.attr,
> #ifdef CONFIG_PM_ADVANCED_DEBUG
> &dev_attr_async.attr,
> #ifdef CONFIG_PM_RUNTIME
> Index: linux-2.6/drivers/pci/pci.h
> ===================================================================
> --- linux-2.6.orig/drivers/pci/pci.h
> +++ linux-2.6/drivers/pci/pci.h
> @@ -56,6 +56,7 @@ extern void pci_update_current_state(str
> extern void pci_disable_enabled_device(struct pci_dev *dev);
> extern bool pci_check_pme_status(struct pci_dev *dev);
> extern int pci_finish_runtime_suspend(struct pci_dev *dev);
> +extern void pci_wakeup_event(struct pci_dev *dev);
> extern int __pci_pme_wakeup(struct pci_dev *dev, void *ign);
> extern void pci_pme_wakeup_bus(struct pci_bus *bus);
> extern void pci_pm_init(struct pci_dev *dev);
> Index: linux-2.6/drivers/pci/pci.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/pci.c
> +++ linux-2.6/drivers/pci/pci.c
> @@ -1275,6 +1275,22 @@ bool pci_check_pme_status(struct pci_dev
> return ret;
> }
>
> +/*
> + * Time to wait before the system can be put into a sleep state after reporting
> + * a wakeup event signaled by a PCI device.
> + */
> +#define PCI_WAKEUP_COOLDOWN 100
> +
> +/**
> + * pci_wakeup_event - Report a wakeup event related to a given PCI device.
> + * @dev: Device to report the wakeup event for.
> + */
> +void pci_wakeup_event(struct pci_dev *dev)
> +{
> + if (device_may_wakeup(&dev->dev))
> + pm_wakeup_event(&dev->dev, PCI_WAKEUP_COOLDOWN);
> +}
> +
> /**
> * pci_pme_wakeup - Wake up a PCI device if its PME Status bit is set.
> * @dev: Device to handle.
> @@ -1285,8 +1301,10 @@ bool pci_check_pme_status(struct pci_dev
> */
> static int pci_pme_wakeup(struct pci_dev *dev, void *ign)
> {
> - if (pci_check_pme_status(dev))
> + if (pci_check_pme_status(dev)) {
> pm_request_resume(&dev->dev);
> + pci_wakeup_event(dev);
> + }
> return 0;
> }
>
> Index: linux-2.6/include/linux/suspend.h
> ===================================================================
> --- linux-2.6.orig/include/linux/suspend.h
> +++ linux-2.6/include/linux/suspend.h
> @@ -295,6 +295,13 @@ extern int unregister_pm_notifier(struct
> { .notifier_call = fn, .priority = pri }; \
> register_pm_notifier(&fn##_nb); \
> }
> +
> +/* drivers/base/power/wakeup.c */
> +extern bool events_check_enabled;
> +
> +extern bool pm_check_wakeup_events(void);
> +extern bool pm_get_wakeup_count(unsigned long *count);
> +extern bool pm_save_wakeup_count(unsigned long count);
> #else /* !CONFIG_PM_SLEEP */
>
> static inline int register_pm_notifier(struct notifier_block *nb)
> Index: linux-2.6/Documentation/ABI/testing/sysfs-power
> ===================================================================
> --- linux-2.6.orig/Documentation/ABI/testing/sysfs-power
> +++ linux-2.6/Documentation/ABI/testing/sysfs-power
> @@ -114,3 +114,17 @@ Description:
> if this file contains "1", which is the default. It may be
> disabled by writing "0" to this file, in which case all devices
> will be suspended and resumed synchronously.
> +
> +What: /sys/power/wakeup_count
> +Date: July 2010
> +Contact: Rafael J. Wysocki <[email protected]>
> +Description:
> + The /sys/power/wakeup_count file allows user space to avoid
> + losing wakeup events when transitioning the system into a sleep
> + state. Reading from it returns the current number of registered
> + wakeup events and it blocks if some wakeup events are being
> + processed at the time the file is read from. Writing to it
> + will only succeed if the current number of wakeup events is
> + equal to the written value and, if successful, will make the
> + kernel abort a subsequent transition to a sleep state if any
> + wakeup events are reported after the write has returned.

2010-06-28 00:01:04

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Sunday, June 27, 2010, Alan Stern wrote:
> On Sat, 26 Jun 2010, Rafael J. Wysocki wrote:
>
> > +void pm_relax(void)
> > +{
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&events_lock, flags);
> > + if (events_in_progress) {
> > + event_count++;
> > + if (!--events_in_progress)
> > + wake_up_all(&events_wait_queue);
> > + }
> > + spin_unlock_irqrestore(&events_lock, flags);
> > +}
>
> > +bool pm_get_wakeup_count(unsigned long *count)
> > +{
> > + bool ret;
> > +
> > + spin_lock_irq(&events_lock);
> > + if (capable(CAP_SYS_ADMIN))
> > + events_check_enabled = false;
> > +
> > + if (events_in_progress) {
> > + DEFINE_WAIT(wait);
> > +
> > + do {
> > + prepare_to_wait(&events_wait_queue, &wait,
> > + TASK_INTERRUPTIBLE);
> > + if (!events_in_progress)
> > + break;
> > + spin_unlock_irq(&events_lock);
> > +
> > + schedule();
> > +
> > + spin_lock_irq(&events_lock);
> > + } while (!signal_pending(current));
> > + finish_wait(&events_wait_queue, &wait);
> > + }
> > + *count = event_count;
> > + ret = !events_in_progress;
> > + spin_unlock_irq(&events_lock);
> > + return ret;
> > +}
>
> Here's a thought. Presumably pm_relax() will end up getting called a
> lot more often than pm_get_wakeup_count(). Instead of using a wait
> queue, you could make pm_get_wakeup_count() poll at 100-ms intervals.
> The total overhead would be smaller.

For that I'd need a separate kernel thread or a work item that would reschedule
itself periodically, because pm_get_wakeup_count() is only called via
/sys/power/wakeup_count. It would complicate things quite a bit which I'm not
sure is worth it at this point.

> Here's another thought. If event_count and events_in_progress were
> atomic_t then the new spinlock wouldn't be needed at all. (But you
> would need an appropriate pair of memory barriers, to guarantee that
> when a writer decrements events_in_progress to 0 and increments
> event_count, a reader won't see events_in_progress == 0 without also
> seeing the incremented event_count.) Overall, this may not be a
> significant improvement.

No, I don't think it would be significant. Still, we can go back to this
if the spinlock turns out to be a problem in future.

Rafael

2010-06-28 12:52:02

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Monday, June 28, 2010, mark gross wrote:
> Looks good to me!

Great, thanks! May I add your "Acked-by" to the patch, then?

Rafael


> On Sat, Jun 26, 2010 at 03:14:13PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > One of the arguments during the suspend blockers discussion was that
> > the mainline kernel didn't contain any mechanisms making it possible
> > to avoid losing wakeup events during system suspend.
> >
> > Generally, there are two problems in that area. First, if a wakeup
> > event occurs exactly when /sys/power/state is being written to, it
> > may be delivered to user space right before the freezer kicks in, so
> > the user space consumer of the event may not be able to process it
> > before the system is suspended. Second, if a wakeup event occurs
> > after user space has been frozen, it is not generally guaranteed that
> > the ongoing transition of the system into a sleep state will be
> > aborted.
> >
> > To address these issues introduce a new global sysfs attribute,
> > /sys/power/wakeup_count, associated with a running counter of wakeup
> > events and three helper functions, pm_stay_awake(), pm_relax(), and
> > pm_wakeup_event(), that may be used by kernel subsystems to control
> > the behavior of this attribute and to request the PM core to abort
> > system transitions into a sleep state already in progress.
> >
> > The /sys/power/wakeup_count file may be read from or written to by
> > user space. Reads will always succeed (unless interrupted by a
> > signal) and return the current value of the wakeup events counter.
> > Writes, however, will only succeed if the written number is equal to
> > the current value of the wakeup events counter. If a write is
> > successful, it will cause the kernel to save the current value of the
> > wakeup events counter and to abort the subsequent system transition
> > into a sleep state if any wakeup events are reported after the write
> > has returned.
> >
> > [The assumption is that before writing to /sys/power/state user space
> > will first read from /sys/power/wakeup_count. Next, user space
> > consumers of wakeup events will have a chance to acknowledge or
> > veto the upcoming system transition to a sleep state. Finally, if
> > the transition is allowed to proceed, /sys/power/wakeup_count will
> > be written to and if that succeeds, /sys/power/state will be written
> > to as well. Still, if any wakeup events are reported to the PM core
> > by kernel subsystems after that point, the transition will be
> > aborted.]
> >
> > Additionally, put a wakeup events counter into struct dev_pm_info and
> > make these per-device wakeup event counters available via sysfs,
> > so that it's possible to check the activity of various wakeup event
> > sources within the kernel.
> >
> > To illustrate how subsystems can use pm_wakeup_event(), make the
> > low-level PCI runtime PM wakeup-handling code use it.
> >
> > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > ---

2010-06-28 14:16:58

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Mon, 28 Jun 2010, Rafael J. Wysocki wrote:

> > > +bool pm_get_wakeup_count(unsigned long *count)
> > > +{
> > > + bool ret;
> > > +
> > > + spin_lock_irq(&events_lock);
> > > + if (capable(CAP_SYS_ADMIN))
> > > + events_check_enabled = false;
> > > +
> > > + if (events_in_progress) {
> > > + DEFINE_WAIT(wait);
> > > +
> > > + do {
> > > + prepare_to_wait(&events_wait_queue, &wait,
> > > + TASK_INTERRUPTIBLE);
> > > + if (!events_in_progress)
> > > + break;
> > > + spin_unlock_irq(&events_lock);
> > > +
> > > + schedule();
> > > +
> > > + spin_lock_irq(&events_lock);
> > > + } while (!signal_pending(current));
> > > + finish_wait(&events_wait_queue, &wait);
> > > + }
> > > + *count = event_count;
> > > + ret = !events_in_progress;
> > > + spin_unlock_irq(&events_lock);
> > > + return ret;
> > > +}
> >
> > Here's a thought. Presumably pm_relax() will end up getting called a
> > lot more often than pm_get_wakeup_count(). Instead of using a wait
> > queue, you could make pm_get_wakeup_count() poll at 100-ms intervals.
> > The total overhead would be smaller.
>
> For that I'd need a separate kernel thread or a work item that would reschedule
> itself periodically, because pm_get_wakeup_count() is only called via
> /sys/power/wakeup_count. It would complicate things quite a bit which I'm not
> sure is worth it at this point.

What? All I'm saying is that the do-while loop above should be
replaced by a loop that sleeps for 100 ms instead of waiting on a
wait_queue. That is:

while (events_in_progress) {
if (signal_pending(current))
break;
spin_unlock_irq(&events_lock);

schedule_timeout_interruptible(msecs_to_jiffies(100));

spin_lock_irq(&events_lock);
}

And of course, get rid of events_wait_queue.

Alan Stern

2010-06-28 19:03:40

by Rafael J. Wysocki

[permalink] [raw]
Subject: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Monday, June 28, 2010, Alan Stern wrote:
> On Mon, 28 Jun 2010, Rafael J. Wysocki wrote:
>
> > > > +bool pm_get_wakeup_count(unsigned long *count)
> > > > +{
> > > > + bool ret;
> > > > +
> > > > + spin_lock_irq(&events_lock);
> > > > + if (capable(CAP_SYS_ADMIN))
> > > > + events_check_enabled = false;
> > > > +
> > > > + if (events_in_progress) {
> > > > + DEFINE_WAIT(wait);
> > > > +
> > > > + do {
> > > > + prepare_to_wait(&events_wait_queue, &wait,
> > > > + TASK_INTERRUPTIBLE);
> > > > + if (!events_in_progress)
> > > > + break;
> > > > + spin_unlock_irq(&events_lock);
> > > > +
> > > > + schedule();
> > > > +
> > > > + spin_lock_irq(&events_lock);
> > > > + } while (!signal_pending(current));
> > > > + finish_wait(&events_wait_queue, &wait);
> > > > + }
> > > > + *count = event_count;
> > > > + ret = !events_in_progress;
> > > > + spin_unlock_irq(&events_lock);
> > > > + return ret;
> > > > +}
> > >
> > > Here's a thought. Presumably pm_relax() will end up getting called a
> > > lot more often than pm_get_wakeup_count(). Instead of using a wait
> > > queue, you could make pm_get_wakeup_count() poll at 100-ms intervals.
> > > The total overhead would be smaller.
> >
> > For that I'd need a separate kernel thread or a work item that would reschedule
> > itself periodically, because pm_get_wakeup_count() is only called via
> > /sys/power/wakeup_count. It would complicate things quite a bit which I'm not
> > sure is worth it at this point.
>
> What? All I'm saying is that the do-while loop above should be
> replaced by a loop that sleeps for 100 ms instead of waiting on a
> wait_queue.

Ah, OK. I didn't understand what you were trying to say.

Below is the patch I'd like to add to suspend-2.6/linux-next.

Rafael

---
From: Rafael J. Wysocki <[email protected]>
Subject: PM: Make it possible to avoid wakeup events from being lost

One of the arguments during the suspend blockers discussion was that
the mainline kernel didn't contain any mechanisms making it possible
to avoid losing wakeup events during system suspend.

Generally, there are two problems in that area. First, if a wakeup
event occurs exactly when /sys/power/state is being written to, it
may be delivered to user space right before the freezer kicks in, so
the user space consumer of the event may not be able to process it
before the system is suspended. Second, if a wakeup event occurs
after user space has been frozen, it is not generally guaranteed that
the ongoing transition of the system into a sleep state will be
aborted.

To address these issues introduce a new global sysfs attribute,
/sys/power/wakeup_count, associated with a running counter of wakeup
events and three helper functions, pm_stay_awake(), pm_relax(), and
pm_wakeup_event(), that may be used by kernel subsystems to control
the behavior of this attribute and to request the PM core to abort
system transitions into a sleep state already in progress.

The /sys/power/wakeup_count file may be read from or written to by
user space. Reads will always succeed (unless interrupted by a
signal) and return the current value of the wakeup events counter.
Writes, however, will only succeed if the written number is equal to
the current value of the wakeup events counter. If a write is
successful, it will cause the kernel to save the current value of the
wakeup events counter and to abort the subsequent system transition
into a sleep state if any wakeup events are reported after the write
has returned.

[The assumption is that before writing to /sys/power/state user space
will first read from /sys/power/wakeup_count. Next, user space
consumers of wakeup events will have a chance to acknowledge or
veto the upcoming system transition to a sleep state. Finally, if
the transition is allowed to proceed, /sys/power/wakeup_count will
be written to and if that succeeds, /sys/power/state will be written
to as well. Still, if any wakeup events are reported to the PM core
by kernel subsystems after that point, the transition will be
aborted.]

Additionally, put a wakeup events counter into struct dev_pm_info and
make these per-device wakeup event counters available via sysfs,
so that it's possible to check the activity of various wakeup event
sources within the kernel.

To illustrate how subsystems can use pm_wakeup_event(), make the
low-level PCI runtime PM wakeup-handling code use it.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---
Documentation/ABI/testing/sysfs-power | 15 ++
drivers/base/power/Makefile | 2
drivers/base/power/main.c | 1
drivers/base/power/sysfs.c | 11 +
drivers/base/power/wakeup.c | 229 ++++++++++++++++++++++++++++++++++
drivers/pci/pci-acpi.c | 1
drivers/pci/pci.c | 20 ++
drivers/pci/pci.h | 1
drivers/pci/pcie/pme/pcie_pme.c | 5
include/linux/pm.h | 10 +
include/linux/suspend.h | 7 +
kernel/power/hibernate.c | 20 ++
kernel/power/main.c | 53 +++++++
kernel/power/suspend.c | 4
14 files changed, 369 insertions(+), 10 deletions(-)

Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -204,6 +204,58 @@ static ssize_t state_store(struct kobjec

power_attr(state);

+/*
+ * The 'wakeup_count' attribute, along with the functions defined in
+ * drivers/base/power/wakeup.c, provides a means by which wakeup events can be
+ * handled in a non-racy way.
+ *
+ * If a wakeup event occurs when the system is in a sleep state, it simply is
+ * woken up. In turn, if an event that would wake the system up from a sleep
+ * state occurs when it is undergoing a transition to that sleep state, the
+ * transition should be aborted. Moreover, if such an event occurs when the
+ * system is in the working state, an attempt to start a transition to the
+ * given sleep state should fail during certain period after the detection of
+ * the event. Using the 'state' attribute alone is not sufficient to satisfy
+ * these requirements, because a wakeup event may occur exactly when 'state'
+ * is being written to and may be delivered to user space right before it is
+ * frozen, so the event will remain only partially processed until the system is
+ * woken up by another event. In particular, it won't cause the transition to
+ * a sleep state to be aborted.
+ *
+ * This difficulty may be overcome if user space uses 'wakeup_count' before
+ * writing to 'state'. It first should read from 'wakeup_count' and store
+ * the read value. Then, after carrying out its own preparations for the system
+ * transition to a sleep state, it should write the stored value to
+ * 'wakeup_count'. If that fails, at least one wakeup event has occured since
+ * 'wakeup_count' was read and 'state' should not be written to. Otherwise, it
+ * is allowed to write to 'state', but the transition will be aborted if there
+ * are any wakeup events detected after 'wakeup_count' was written to.
+ */
+
+static ssize_t wakeup_count_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ unsigned long val;
+
+ return pm_get_wakeup_count(&val) ? sprintf(buf, "%lu\n", val) : -EINTR;
+}
+
+static ssize_t wakeup_count_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t n)
+{
+ unsigned long val;
+
+ if (sscanf(buf, "%lu", &val) == 1) {
+ if (pm_save_wakeup_count(val))
+ return n;
+ }
+ return -EINVAL;
+}
+
+power_attr(wakeup_count);
+
#ifdef CONFIG_PM_TRACE
int pm_trace_enabled;

@@ -236,6 +288,7 @@ static struct attribute * g[] = {
#endif
#ifdef CONFIG_PM_SLEEP
&pm_async_attr.attr,
+ &wakeup_count_attr.attr,
#ifdef CONFIG_PM_DEBUG
&pm_test_attr.attr,
#endif
Index: linux-2.6/drivers/base/power/wakeup.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/wakeup.c
@@ -0,0 +1,229 @@
+/*
+ * drivers/base/power/wakeup.c - System wakeup events framework
+ *
+ * Copyright (c) 2010 Rafael J. Wysocki <[email protected]>, Novell Inc.
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/capability.h>
+#include <linux/suspend.h>
+#include <linux/pm.h>
+
+/*
+ * If set, the suspend/hibernate code will abort transitions to a sleep state
+ * if wakeup events are registered during or immediately before the transition.
+ */
+bool events_check_enabled;
+
+/* The counter of registered wakeup events. */
+static unsigned long event_count;
+/* A preserved old value of event_count. */
+static unsigned long saved_event_count;
+/* The counter of wakeup events being processed. */
+static unsigned long events_in_progress;
+
+static DEFINE_SPINLOCK(events_lock);
+
+/*
+ * The functions below use the observation that each wakeup event starts a
+ * period in which the system should not be suspended. The moment this period
+ * will end depends on how the wakeup event is going to be processed after being
+ * detected and all of the possible cases can be divided into two distinct
+ * groups.
+ *
+ * First, a wakeup event may be detected by the same functional unit that will
+ * carry out the entire processing of it and possibly will pass it to user space
+ * for further processing. In that case the functional unit that has detected
+ * the event may later "close" the "no suspend" period associated with it
+ * directly as soon as it has been dealt with. The pair of pm_stay_awake() and
+ * pm_relax(), balanced with each other, is supposed to be used in such
+ * situations.
+ *
+ * Second, a wakeup event may be detected by one functional unit and processed
+ * by another one. In that case the unit that has detected it cannot really
+ * "close" the "no suspend" period associated with it, unless it knows in
+ * advance what's going to happen to the event during processing. This
+ * knowledge, however, may not be available to it, so it can simply specify time
+ * to wait before the system can be suspended and pass it as the second
+ * argument of pm_wakeup_event().
+ */
+
+/**
+ * pm_stay_awake - Notify the PM core that a wakeup event is being processed.
+ * @dev: Device the wakeup event is related to.
+ *
+ * Notify the PM core of a wakeup event (signaled by @dev) by incrementing the
+ * counter of wakeup events being processed. If @dev is not NULL, the counter
+ * of wakeup events related to @dev is incremented too.
+ *
+ * Call this function after detecting of a wakeup event if pm_relax() is going
+ * to be called directly after processing the event (and possibly passing it to
+ * user space for further processing).
+ *
+ * It is safe to call this function from interrupt context.
+ */
+void pm_stay_awake(struct device *dev)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (dev)
+ dev->power.wakeup_count++;
+
+ events_in_progress++;
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+/**
+ * pm_relax - Notify the PM core that processing of a wakeup event has ended.
+ *
+ * Notify the PM core that a wakeup event has been processed by decrementing
+ * the counter of wakeup events being processed and incrementing the counter
+ * of registered wakeup events.
+ *
+ * Call this function for wakeup events whose processing started with calling
+ * pm_stay_awake().
+ *
+ * It is safe to call it from interrupt context.
+ */
+void pm_relax(void)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (events_in_progress) {
+ events_in_progress--;
+ event_count++;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+/**
+ * pm_wakeup_work_fn - Deferred closing of a wakeup event.
+ *
+ * Execute pm_relax() for a wakeup event detected in the past and free the
+ * work item object used for queuing up the work.
+ */
+static void pm_wakeup_work_fn(struct work_struct *work)
+{
+ struct delayed_work *dwork = to_delayed_work(work);
+
+ pm_relax();
+ kfree(dwork);
+}
+
+/**
+ * pm_wakeup_event - Notify the PM core of a wakeup event.
+ * @dev: Device the wakeup event is related to.
+ * @msec: Anticipated event processing time (in milliseconds).
+ *
+ * Notify the PM core of a wakeup event (signaled by @dev) that will take
+ * approximately @msec milliseconds to be processed by the kernel. Increment
+ * the counter of wakeup events being processed and queue up a work item
+ * that will execute pm_relax() for the event after @msec milliseconds. If @dev
+ * is not NULL, the counter of wakeup events related to @dev is incremented too.
+ *
+ * It is safe to call this function from interrupt context.
+ */
+void pm_wakeup_event(struct device *dev, unsigned int msec)
+{
+ unsigned long flags;
+ struct delayed_work *dwork;
+
+ dwork = msec ? kzalloc(sizeof(*dwork), GFP_ATOMIC) : NULL;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (dev)
+ dev->power.wakeup_count++;
+
+ if (dwork) {
+ INIT_DELAYED_WORK(dwork, pm_wakeup_work_fn);
+ schedule_delayed_work(dwork, msecs_to_jiffies(msec));
+
+ events_in_progress++;
+ } else {
+ event_count++;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+/**
+ * pm_check_wakeup_events - Check for new wakeup events.
+ *
+ * Compare the current number of registered wakeup events with its preserved
+ * value from the past to check if new wakeup events have been registered since
+ * the old value was stored. Check if the current number of wakeup events being
+ * processed is zero.
+ */
+bool pm_check_wakeup_events(void)
+{
+ unsigned long flags;
+ bool ret = true;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (events_check_enabled) {
+ ret = (event_count == saved_event_count) && !events_in_progress;
+ events_check_enabled = ret;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+ return ret;
+}
+
+/**
+ * pm_get_wakeup_count - Read the number of registered wakeup events.
+ * @count: Address to store the value at.
+ *
+ * Store the number of registered wakeup events at the address in @count. Block
+ * if the current number of wakeup events being processed is nonzero.
+ *
+ * Return false if the wait for the number of wakeup events being processed to
+ * drop down to zero has been interrupted by a signal (and the current number
+ * of wakeup events being processed is still nonzero). Otherwise return true.
+ */
+bool pm_get_wakeup_count(unsigned long *count)
+{
+ bool ret;
+
+ spin_lock_irq(&events_lock);
+ if (capable(CAP_SYS_ADMIN))
+ events_check_enabled = false;
+
+ while (events_in_progress && !signal_pending(current)) {
+ spin_unlock_irq(&events_lock);
+
+ schedule_timeout_interruptible(msecs_to_jiffies(100));
+
+ spin_lock_irq(&events_lock);
+ }
+ *count = event_count;
+ ret = !events_in_progress;
+ spin_unlock_irq(&events_lock);
+ return ret;
+}
+
+/**
+ * pm_save_wakeup_count - Save the current number of registered wakeup events.
+ * @count: Value to compare with the current number of registered wakeup events.
+ *
+ * If @count is equal to the current number of registered wakeup events and the
+ * current number of wakeup events being processed is zero, store @count as the
+ * old number of registered wakeup events to be used by pm_check_wakeup_events()
+ * and return true. Otherwise return false.
+ */
+bool pm_save_wakeup_count(unsigned long count)
+{
+ bool ret = false;
+
+ spin_lock_irq(&events_lock);
+ if (count == event_count && !events_in_progress) {
+ saved_event_count = count;
+ events_check_enabled = true;
+ ret = true;
+ }
+ spin_unlock_irq(&events_lock);
+ return ret;
+}
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -457,6 +457,7 @@ struct dev_pm_info {
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
struct completion completion;
+ unsigned long wakeup_count;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
@@ -552,6 +553,11 @@ extern void __suspend_report_result(cons
} while (0)

extern void device_pm_wait_for_dev(struct device *sub, struct device *dev);
+
+/* drivers/base/power/wakeup.c */
+extern void pm_wakeup_event(struct device *dev, unsigned int msec);
+extern void pm_stay_awake(struct device *dev);
+extern void pm_relax(void);
#else /* !CONFIG_PM_SLEEP */

#define device_pm_lock() do {} while (0)
@@ -565,6 +571,10 @@ static inline int dpm_suspend_start(pm_m
#define suspend_report_result(fn, ret) do {} while (0)

static inline void device_pm_wait_for_dev(struct device *a, struct device *b) {}
+
+static inline void pm_wakeup_event(struct device *dev, unsigned int msec) {}
+static inline void pm_stay_awake(struct device *dev) {}
+static inline void pm_relax(void) {}
#endif /* !CONFIG_PM_SLEEP */

/* How to reorder dpm_list after device_move() */
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,5 @@
obj-$(CONFIG_PM) += sysfs.o
-obj-$(CONFIG_PM_SLEEP) += main.o
+obj-$(CONFIG_PM_SLEEP) += main.o wakeup.o
obj-$(CONFIG_PM_RUNTIME) += runtime.o
obj-$(CONFIG_PM_OPS) += generic_ops.o
obj-$(CONFIG_PM_TRACE_RTC) += trace.o
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -59,6 +59,7 @@ void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
init_completion(&dev->power.completion);
+ dev->power.wakeup_count = 0;
pm_runtime_init(dev);
}

Index: linux-2.6/kernel/power/suspend.c
===================================================================
--- linux-2.6.orig/kernel/power/suspend.c
+++ linux-2.6/kernel/power/suspend.c
@@ -172,8 +172,10 @@ static int suspend_enter(suspend_state_t

error = sysdev_suspend(PMSG_SUSPEND);
if (!error) {
- if (!suspend_test(TEST_CORE))
+ if (!suspend_test(TEST_CORE) && pm_check_wakeup_events()) {
error = suspend_ops->enter(state);
+ events_check_enabled = false;
+ }
sysdev_resume();
}

Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -277,7 +277,7 @@ static int create_image(int platform_mod
goto Enable_irqs;
}

- if (hibernation_test(TEST_CORE))
+ if (hibernation_test(TEST_CORE) || !pm_check_wakeup_events())
goto Power_up;

in_suspend = 1;
@@ -288,8 +288,10 @@ static int create_image(int platform_mod
error);
/* Restore control flow magically appears here */
restore_processor_state();
- if (!in_suspend)
+ if (!in_suspend) {
+ events_check_enabled = false;
platform_leave(platform_mode);
+ }

Power_up:
sysdev_resume();
@@ -511,14 +513,20 @@ int hibernation_platform_enter(void)

local_irq_disable();
sysdev_suspend(PMSG_HIBERNATE);
+ if (!pm_check_wakeup_events()) {
+ error = -EAGAIN;
+ goto Power_up;
+ }
+
hibernation_ops->enter();
/* We should never get here */
while (1);

- /*
- * We don't need to reenable the nonboot CPUs or resume consoles, since
- * the system is going to be halted anyway.
- */
+ Power_up:
+ sysdev_resume();
+ local_irq_enable();
+ enable_nonboot_cpus();
+
Platform_finish:
hibernation_ops->finish();

Index: linux-2.6/drivers/pci/pci-acpi.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci-acpi.c
+++ linux-2.6/drivers/pci/pci-acpi.c
@@ -48,6 +48,7 @@ static void pci_acpi_wake_dev(acpi_handl
if (event == ACPI_NOTIFY_DEVICE_WAKE && pci_dev) {
pci_check_pme_status(pci_dev);
pm_runtime_resume(&pci_dev->dev);
+ pci_wakeup_event(pci_dev);
if (pci_dev->subordinate)
pci_pme_wakeup_bus(pci_dev->subordinate);
}
Index: linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
===================================================================
--- linux-2.6.orig/drivers/pci/pcie/pme/pcie_pme.c
+++ linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
@@ -154,6 +154,7 @@ static bool pcie_pme_walk_bus(struct pci
/* Skip PCIe devices in case we started from a root port. */
if (!pci_is_pcie(dev) && pci_check_pme_status(dev)) {
pm_request_resume(&dev->dev);
+ pci_wakeup_event(dev);
ret = true;
}

@@ -254,8 +255,10 @@ static void pcie_pme_handle_request(stru
if (found) {
/* The device is there, but we have to check its PME status. */
found = pci_check_pme_status(dev);
- if (found)
+ if (found) {
pm_request_resume(&dev->dev);
+ pci_wakeup_event(dev);
+ }
pci_dev_put(dev);
} else if (devfn) {
/*
Index: linux-2.6/drivers/base/power/sysfs.c
===================================================================
--- linux-2.6.orig/drivers/base/power/sysfs.c
+++ linux-2.6/drivers/base/power/sysfs.c
@@ -73,6 +73,8 @@
* device are known to the PM core. However, for some devices this
* attribute is set to "enabled" by bus type code or device drivers and in
* that cases it should be safe to leave the default value.
+ *
+ * wakeup_count - Report the number of wakeup events related to the device
*/

static const char enabled[] = "enabled";
@@ -144,6 +146,14 @@ wake_store(struct device * dev, struct d

static DEVICE_ATTR(wakeup, 0644, wake_show, wake_store);

+static ssize_t wakeup_count_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%lu\n", dev->power.wakeup_count);
+}
+
+static DEVICE_ATTR(wakeup_count, 0444, wakeup_count_show, NULL);
+
#ifdef CONFIG_PM_ADVANCED_DEBUG
#ifdef CONFIG_PM_RUNTIME

@@ -230,6 +240,7 @@ static struct attribute * power_attrs[]
&dev_attr_control.attr,
#endif
&dev_attr_wakeup.attr,
+ &dev_attr_wakeup_count.attr,
#ifdef CONFIG_PM_ADVANCED_DEBUG
&dev_attr_async.attr,
#ifdef CONFIG_PM_RUNTIME
Index: linux-2.6/drivers/pci/pci.h
===================================================================
--- linux-2.6.orig/drivers/pci/pci.h
+++ linux-2.6/drivers/pci/pci.h
@@ -56,6 +56,7 @@ extern void pci_update_current_state(str
extern void pci_disable_enabled_device(struct pci_dev *dev);
extern bool pci_check_pme_status(struct pci_dev *dev);
extern int pci_finish_runtime_suspend(struct pci_dev *dev);
+extern void pci_wakeup_event(struct pci_dev *dev);
extern int __pci_pme_wakeup(struct pci_dev *dev, void *ign);
extern void pci_pme_wakeup_bus(struct pci_bus *bus);
extern void pci_pm_init(struct pci_dev *dev);
Index: linux-2.6/drivers/pci/pci.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci.c
+++ linux-2.6/drivers/pci/pci.c
@@ -1275,6 +1275,22 @@ bool pci_check_pme_status(struct pci_dev
return ret;
}

+/*
+ * Time to wait before the system can be put into a sleep state after reporting
+ * a wakeup event signaled by a PCI device.
+ */
+#define PCI_WAKEUP_COOLDOWN 100
+
+/**
+ * pci_wakeup_event - Report a wakeup event related to a given PCI device.
+ * @dev: Device to report the wakeup event for.
+ */
+void pci_wakeup_event(struct pci_dev *dev)
+{
+ if (device_may_wakeup(&dev->dev))
+ pm_wakeup_event(&dev->dev, PCI_WAKEUP_COOLDOWN);
+}
+
/**
* pci_pme_wakeup - Wake up a PCI device if its PME Status bit is set.
* @dev: Device to handle.
@@ -1285,8 +1301,10 @@ bool pci_check_pme_status(struct pci_dev
*/
static int pci_pme_wakeup(struct pci_dev *dev, void *ign)
{
- if (pci_check_pme_status(dev))
+ if (pci_check_pme_status(dev)) {
pm_request_resume(&dev->dev);
+ pci_wakeup_event(dev);
+ }
return 0;
}

Index: linux-2.6/include/linux/suspend.h
===================================================================
--- linux-2.6.orig/include/linux/suspend.h
+++ linux-2.6/include/linux/suspend.h
@@ -295,6 +295,13 @@ extern int unregister_pm_notifier(struct
{ .notifier_call = fn, .priority = pri }; \
register_pm_notifier(&fn##_nb); \
}
+
+/* drivers/base/power/wakeup.c */
+extern bool events_check_enabled;
+
+extern bool pm_check_wakeup_events(void);
+extern bool pm_get_wakeup_count(unsigned long *count);
+extern bool pm_save_wakeup_count(unsigned long count);
#else /* !CONFIG_PM_SLEEP */

static inline int register_pm_notifier(struct notifier_block *nb)
Index: linux-2.6/Documentation/ABI/testing/sysfs-power
===================================================================
--- linux-2.6.orig/Documentation/ABI/testing/sysfs-power
+++ linux-2.6/Documentation/ABI/testing/sysfs-power
@@ -114,3 +114,18 @@ Description:
if this file contains "1", which is the default. It may be
disabled by writing "0" to this file, in which case all devices
will be suspended and resumed synchronously.
+
+What: /sys/power/wakeup_count
+Date: July 2010
+Contact: Rafael J. Wysocki <[email protected]>
+Description:
+ The /sys/power/wakeup_count file allows user space to put the
+ system into a sleep state while taking into account the
+ concurrent arrival of wakeup events. Reading from it returns
+ the current number of registered wakeup events and it blocks if
+ some wakeup events are being processed at the time the file is
+ read from. Writing to it will only succeed if the current
+ number of wakeup events is equal to the written value and, if
+ successful, will make the kernel abort a subsequent transition
+ to a sleep state if any wakeup events are reported after the
+ write has returned.

2010-06-28 19:13:35

by Jesse Barnes

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Mon, 28 Jun 2010 21:01:53 +0200
"Rafael J. Wysocki" <[email protected]> wrote:

> --- linux-2.6.orig/drivers/pci/pci.c
> +++ linux-2.6/drivers/pci/pci.c
> @@ -1275,6 +1275,22 @@ bool pci_check_pme_status(struct pci_dev
> return ret;
> }
>
> +/*
> + * Time to wait before the system can be put into a sleep state after reporting
> + * a wakeup event signaled by a PCI device.
> + */
> +#define PCI_WAKEUP_COOLDOWN 100
> +
> +/**
> + * pci_wakeup_event - Report a wakeup event related to a given PCI device.
> + * @dev: Device to report the wakeup event for.
> + */
> +void pci_wakeup_event(struct pci_dev *dev)
> +{
> + if (device_may_wakeup(&dev->dev))
> + pm_wakeup_event(&dev->dev, PCI_WAKEUP_COOLDOWN);
> +}
> +
> /**
> * pci_pme_wakeup - Wake up a PCI device if its PME Status bit is set.
> * @dev: Device to handle.
> @@ -1285,8 +1301,10 @@ bool pci_check_pme_status(struct pci_dev
> */
> static int pci_pme_wakeup(struct pci_dev *dev, void *ign)
> {
> - if (pci_check_pme_status(dev))
> + if (pci_check_pme_status(dev)) {
> pm_request_resume(&dev->dev);
> + pci_wakeup_event(dev);
> + }
> return 0;
> }
>

I assume the units here are ms? And the wakeup event propagation check
is pushed down into pci_wakeup_event, is there no place to check
whether the device is configured to generate wakeups in the core device
or PM code?

Other than that, the PCI part is
Acked-by: Jesse Barnes <[email protected]>

--
Jesse Barnes, Intel Open Source Technology Center

2010-06-28 19:19:48

by Alan Stern

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Mon, 28 Jun 2010, Rafael J. Wysocki wrote:

> Below is the patch I'd like to add to suspend-2.6/linux-next.
>
> Rafael
>
> ---
> From: Rafael J. Wysocki <[email protected]>
> Subject: PM: Make it possible to avoid wakeup events from being lost
>
> One of the arguments during the suspend blockers discussion was that
> the mainline kernel didn't contain any mechanisms making it possible
> to avoid losing wakeup events during system suspend.

I don't see anything more to complain about in the patch. :-)

You probably should change the title and the first paragraph of the
description to avoid saying that events can get lost. Like:

PM: Make it possible to avoid races between wakeup and system sleep

Alan Stern

2010-06-28 20:41:01

by Greg KH

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Mon, Jun 28, 2010 at 09:01:53PM +0200, Rafael J. Wysocki wrote:
> Below is the patch I'd like to add to suspend-2.6/linux-next.

Looks nice to me, thanks for doing this:
Acked-by: Greg Kroah-Hartman <[email protected]>

2010-06-28 21:21:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Monday, June 28, 2010, Alan Stern wrote:
> On Mon, 28 Jun 2010, Rafael J. Wysocki wrote:
>
> > Below is the patch I'd like to add to suspend-2.6/linux-next.
> >
> > Rafael
> >
> > ---
> > From: Rafael J. Wysocki <[email protected]>
> > Subject: PM: Make it possible to avoid wakeup events from being lost
> >
> > One of the arguments during the suspend blockers discussion was that
> > the mainline kernel didn't contain any mechanisms making it possible
> > to avoid losing wakeup events during system suspend.
>
> I don't see anything more to complain about in the patch. :-)

Great! :-)

> You probably should change the title and the first paragraph of the
> description to avoid saying that events can get lost. Like:
>
> PM: Make it possible to avoid races between wakeup and system sleep

I will, thanks!

Rafael

2010-06-28 23:28:32

by David Brownell

[permalink] [raw]
Subject: Re: [linux-pm] [PATCH] PM: Make it possible to avoid wakeup events from being lost

Did someone post the canonical driver changes
to make use of this? Something like

suspend() { /* if wake-enabled, up count */ }
resume() { /* if upcounted, downcount */ }

is what first comes to mind.. expecting that
the suspend/resume methods in the driver are
already doing the right things for enabling
and later disabling the "system wake" behavior
on the various relevant hardware events...


2010-06-29 04:43:13

by 640E9920

[permalink] [raw]
Subject: Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Mon, Jun 28, 2010 at 02:50:10PM +0200, Rafael J. Wysocki wrote:
> On Monday, June 28, 2010, mark gross wrote:
> > Looks good to me!
>
> Great, thanks! May I add your "Acked-by" to the patch, then?
yes.

Acked-by: markgross <[email protected]>

--mgross

> Rafael
>
>
> > On Sat, Jun 26, 2010 at 03:14:13PM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <[email protected]>
> > >
> > > One of the arguments during the suspend blockers discussion was that
> > > the mainline kernel didn't contain any mechanisms making it possible
> > > to avoid losing wakeup events during system suspend.
> > >
> > > Generally, there are two problems in that area. First, if a wakeup
> > > event occurs exactly when /sys/power/state is being written to, it
> > > may be delivered to user space right before the freezer kicks in, so
> > > the user space consumer of the event may not be able to process it
> > > before the system is suspended. Second, if a wakeup event occurs
> > > after user space has been frozen, it is not generally guaranteed that
> > > the ongoing transition of the system into a sleep state will be
> > > aborted.
> > >
> > > To address these issues introduce a new global sysfs attribute,
> > > /sys/power/wakeup_count, associated with a running counter of wakeup
> > > events and three helper functions, pm_stay_awake(), pm_relax(), and
> > > pm_wakeup_event(), that may be used by kernel subsystems to control
> > > the behavior of this attribute and to request the PM core to abort
> > > system transitions into a sleep state already in progress.
> > >
> > > The /sys/power/wakeup_count file may be read from or written to by
> > > user space. Reads will always succeed (unless interrupted by a
> > > signal) and return the current value of the wakeup events counter.
> > > Writes, however, will only succeed if the written number is equal to
> > > the current value of the wakeup events counter. If a write is
> > > successful, it will cause the kernel to save the current value of the
> > > wakeup events counter and to abort the subsequent system transition
> > > into a sleep state if any wakeup events are reported after the write
> > > has returned.
> > >
> > > [The assumption is that before writing to /sys/power/state user space
> > > will first read from /sys/power/wakeup_count. Next, user space
> > > consumers of wakeup events will have a chance to acknowledge or
> > > veto the upcoming system transition to a sleep state. Finally, if
> > > the transition is allowed to proceed, /sys/power/wakeup_count will
> > > be written to and if that succeeds, /sys/power/state will be written
> > > to as well. Still, if any wakeup events are reported to the PM core
> > > by kernel subsystems after that point, the transition will be
> > > aborted.]
> > >
> > > Additionally, put a wakeup events counter into struct dev_pm_info and
> > > make these per-device wakeup event counters available via sysfs,
> > > so that it's possible to check the activity of various wakeup event
> > > sources within the kernel.
> > >
> > > To illustrate how subsystems can use pm_wakeup_event(), make the
> > > low-level PCI runtime PM wakeup-handling code use it.
> > >
> > > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > > ---

2010-06-29 19:57:28

by Alan Stern

[permalink] [raw]
Subject: Re: [linux-pm] [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Mon, 28 Jun 2010, David Brownell wrote:

> Did someone post the canonical driver changes
> to make use of this?

No, not really. The patch itself contains an example (PCI) but it
doesn't demonstrate the full range of possible usages.

> Something like
>
> suspend() { /* if wake-enabled, up count */ }
> resume() { /* if upcounted, downcount */ }
>
> is what first comes to mind.. expecting that
> the suspend/resume methods in the driver are
> already doing the right things for enabling
> and later disabling the "system wake" behavior
> on the various relevant hardware events...

The PCI example looks like this:

resume()
{
...
if (device_may_wakeup(dev))
pm_wakeup_event(dev, TIMEOUT_GUESS);
...
}

where TIMEOUT_GUESS is an estimate of how long to wait before allowing
the system to sleep.

For things like keyboards, an example would go more like this:

irq_handler()
{
...
if (key-press event occurred) {
...
if (input queue is empty)
pm_stay_awake(dev);
add event to input queue;
...
}
...
}

read_queue()
{
...
send queued data to userspace
if (input queue is empty)
pm_relax();
...
}

I left out the device_may_wakeup tests; things become rather
complicated if you can have more than one keyboard feeding the same
input queue and some of them are wakeup-enabled while others aren't.

Clearly the appropriate changes will depend on the subsystem and the
kind of event. They may also end up depending on the platform; perhaps
this will be used only on relatively small systems like an Android
phone.

Alan Stern

2010-06-30 07:11:28

by Florian Mickler

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

Hi Rafael!

On Mon, 28 Jun 2010 21:01:53 +0200
"Rafael J. Wysocki" <[email protected]> wrote:

> On Monday, June 28, 2010, Alan Stern wrote:
> > On Mon, 28 Jun 2010, Rafael J. Wysocki wrote:

> From: Rafael J. Wysocki <[email protected]>
> Subject: PM: Make it possible to avoid wakeup events from being lost

I have nothing substantial to add, but just wanted to let you know that
this approach seems like a good alternative to me. As far as I can see
the userspace suspend-blocker interface could be expressed in terms of
this kernel facility which brings android closer to mainline.

The only thing I haven't thought through yet is the 'maintain a discrete
set of constraints' vs 'just increment a number' thing. Especially if
what we loose in information through that (in comparison to 'the other
approach') is made up for by easier in-kernel-API. I _think_ if there
is any need for in-kernel-accounting (i don't know that) it could be
retro-fitted by using the trace event infrastructure?

Cheers,
Flo

2010-06-30 13:47:33

by 640E9920

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Wed, Jun 30, 2010 at 09:10:38AM +0200, Florian Mickler wrote:
> Hi Rafael!
>
> On Mon, 28 Jun 2010 21:01:53 +0200
> "Rafael J. Wysocki" <[email protected]> wrote:
>
> > On Monday, June 28, 2010, Alan Stern wrote:
> > > On Mon, 28 Jun 2010, Rafael J. Wysocki wrote:
>
> > From: Rafael J. Wysocki <[email protected]>
> > Subject: PM: Make it possible to avoid wakeup events from being lost
>
> I have nothing substantial to add, but just wanted to let you know that
> this approach seems like a good alternative to me. As far as I can see
> the userspace suspend-blocker interface could be expressed in terms of
> this kernel facility which brings android closer to mainline.
>
> The only thing I haven't thought through yet is the 'maintain a discrete
> set of constraints' vs 'just increment a number' thing. Especially if
> what we loose in information through that (in comparison to 'the other
> approach') is made up for by easier in-kernel-API. I _think_ if there
> is any need for in-kernel-accounting (i don't know that) it could be
> retro-fitted by using the trace event infrastructure?

Adding ftracing hooks and some less invasive partial state or trace
support to this and pm_qos is likely the next order of business.

--mgross

>
> Cheers,
> Flo

2010-06-30 18:00:47

by Alan Stern

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Mon, 28 Jun 2010, Rafael J. Wysocki wrote:

> +/*
> + * The functions below use the observation that each wakeup event starts a
> + * period in which the system should not be suspended. The moment this period
> + * will end depends on how the wakeup event is going to be processed after being
> + * detected and all of the possible cases can be divided into two distinct
> + * groups.
> + *
> + * First, a wakeup event may be detected by the same functional unit that will
> + * carry out the entire processing of it and possibly will pass it to user space
> + * for further processing. In that case the functional unit that has detected
> + * the event may later "close" the "no suspend" period associated with it
> + * directly as soon as it has been dealt with. The pair of pm_stay_awake() and
> + * pm_relax(), balanced with each other, is supposed to be used in such
> + * situations.
> + *
> + * Second, a wakeup event may be detected by one functional unit and processed
> + * by another one. In that case the unit that has detected it cannot really
> + * "close" the "no suspend" period associated with it, unless it knows in
> + * advance what's going to happen to the event during processing. This
> + * knowledge, however, may not be available to it, so it can simply specify time
> + * to wait before the system can be suspended and pass it as the second
> + * argument of pm_wakeup_event().
> + */

Since there's no longer any way to cancel a call to pm_wakeup_event()
or close the "no suspend" period early, there is no need to use
dynamically-allocated delayed_work structures. You can make do with a
single static timer; always keep it set to expire at the latest time
passed to pm_wakeup_event().

Alan Stern

2010-06-30 19:29:27

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Wednesday, June 30, 2010, Alan Stern wrote:
> On Mon, 28 Jun 2010, Rafael J. Wysocki wrote:
>
> > +/*
> > + * The functions below use the observation that each wakeup event starts a
> > + * period in which the system should not be suspended. The moment this period
> > + * will end depends on how the wakeup event is going to be processed after being
> > + * detected and all of the possible cases can be divided into two distinct
> > + * groups.
> > + *
> > + * First, a wakeup event may be detected by the same functional unit that will
> > + * carry out the entire processing of it and possibly will pass it to user space
> > + * for further processing. In that case the functional unit that has detected
> > + * the event may later "close" the "no suspend" period associated with it
> > + * directly as soon as it has been dealt with. The pair of pm_stay_awake() and
> > + * pm_relax(), balanced with each other, is supposed to be used in such
> > + * situations.
> > + *
> > + * Second, a wakeup event may be detected by one functional unit and processed
> > + * by another one. In that case the unit that has detected it cannot really
> > + * "close" the "no suspend" period associated with it, unless it knows in
> > + * advance what's going to happen to the event during processing. This
> > + * knowledge, however, may not be available to it, so it can simply specify time
> > + * to wait before the system can be suspended and pass it as the second
> > + * argument of pm_wakeup_event().
> > + */
>
> Since there's no longer any way to cancel a call to pm_wakeup_event()
> or close the "no suspend" period early, there is no need to use
> dynamically-allocated delayed_work structures. You can make do with a
> single static timer; always keep it set to expire at the latest time
> passed to pm_wakeup_event().

The decremenations of events_in_progress wouldn't be balanced with
incrementations this way. Or do you have any clever way of dealing with
that in mind?

Rafael

2010-06-30 19:58:40

by Alan Stern

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Wed, 30 Jun 2010, Rafael J. Wysocki wrote:

> > Since there's no longer any way to cancel a call to pm_wakeup_event()
> > or close the "no suspend" period early, there is no need to use
> > dynamically-allocated delayed_work structures. You can make do with a
> > single static timer; always keep it set to expire at the latest time
> > passed to pm_wakeup_event().
>
> The decremenations of events_in_progress wouldn't be balanced with
> incrementations this way. Or do you have any clever way of dealing with
> that in mind?

Keep track of the current expiration time in a static variable called
wakeup_timeout, and use 0 to indicate there is no timeout.

In pm_wakeup_event() (everything protected by the spinlock):

unsigned long new_timeout = jiffies + msecs_to_jiffies(msecs);
if (new_timeout == 0)
new_timeout = 1;

++event_count;
if (!wakeup_timeout || time_after(new_timeout, wakeup_timeout)) {
if (!wakeup_timeout)
++events_in_progress;
wakeup_timeout = new_timeout;
mod_timer(&wakeup_timer, wakeup_timeout);
}

In the timer routine:

if (wakeup_timeout && time_before_eq(wakeup_timeout, jiffies)) {
--events_in_progres;
wakeup_timeout = 0;
}

Alan Stern

2010-06-30 23:53:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Wednesday, June 30, 2010, Alan Stern wrote:
> On Wed, 30 Jun 2010, Rafael J. Wysocki wrote:
>
> > > Since there's no longer any way to cancel a call to pm_wakeup_event()
> > > or close the "no suspend" period early, there is no need to use
> > > dynamically-allocated delayed_work structures. You can make do with a
> > > single static timer; always keep it set to expire at the latest time
> > > passed to pm_wakeup_event().
> >
> > The decremenations of events_in_progress wouldn't be balanced with
> > incrementations this way. Or do you have any clever way of dealing with
> > that in mind?
>
> Keep track of the current expiration time in a static variable called
> wakeup_timeout, and use 0 to indicate there is no timeout.

I invented a slightly different version in the meantime, which is appended
as a patch on top of the original one (I don't want to modify the original
patch, since it's been reviewed already by several people and went to my
linux-next branch).

> In pm_wakeup_event() (everything protected by the spinlock):
>
> unsigned long new_timeout = jiffies + msecs_to_jiffies(msecs);
> if (new_timeout == 0)
> new_timeout = 1;
>
> ++event_count;
> if (!wakeup_timeout || time_after(new_timeout, wakeup_timeout)) {
> if (!wakeup_timeout)
> ++events_in_progress;
> wakeup_timeout = new_timeout;
> mod_timer(&wakeup_timer, wakeup_timeout);
> }
>
> In the timer routine:
>
> if (wakeup_timeout && time_before_eq(wakeup_timeout, jiffies)) {
> --events_in_progres;
> wakeup_timeout = 0;
> }
>

---
drivers/base/power/wakeup.c | 47 +++++++++++++++++++++++++++++++++-----------
1 file changed, 36 insertions(+), 11 deletions(-)

Index: linux-2.6/drivers/base/power/wakeup.c
===================================================================
--- linux-2.6.orig/drivers/base/power/wakeup.c
+++ linux-2.6/drivers/base/power/wakeup.c
@@ -9,6 +9,7 @@
#include <linux/device.h>
#include <linux/slab.h>
#include <linux/sched.h>
+#include <linux/ktime.h>
#include <linux/capability.h>
#include <linux/suspend.h>
#include <linux/pm.h>
@@ -28,6 +29,12 @@ static unsigned long events_in_progress;

static DEFINE_SPINLOCK(events_lock);

+static void pm_wakeup_timer_fn(unsigned long data);
+
+static DEFINE_TIMER(events_timer, pm_wakeup_timer_fn, 0, 0);
+static unsigned long events_timer_expires;
+static unsigned long delayed_count;
+
/*
* The functions below use the observation that each wakeup event starts a
* period in which the system should not be suspended. The moment this period
@@ -103,17 +110,23 @@ void pm_relax(void)
}

/**
- * pm_wakeup_work_fn - Deferred closing of a wakeup event.
+ * pm_wakeup_timer_fn - Deferred closing of a wakeup event.
*
* Execute pm_relax() for a wakeup event detected in the past and free the
* work item object used for queuing up the work.
*/
-static void pm_wakeup_work_fn(struct work_struct *work)
+static void pm_wakeup_timer_fn(unsigned long data)
{
- struct delayed_work *dwork = to_delayed_work(work);
+ unsigned long flags;

- pm_relax();
- kfree(dwork);
+ spin_lock_irqsave(&events_lock, flags);
+ if (events_timer_expires && time_after(jiffies, events_timer_expires)) {
+ events_in_progress -= delayed_count;
+ event_count += delayed_count;
+ delayed_count = 0;
+ events_timer_expires = 0;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
}

/**
@@ -132,19 +145,31 @@ static void pm_wakeup_work_fn(struct wor
void pm_wakeup_event(struct device *dev, unsigned int msec)
{
unsigned long flags;
- struct delayed_work *dwork;
-
- dwork = msec ? kzalloc(sizeof(*dwork), GFP_ATOMIC) : NULL;

spin_lock_irqsave(&events_lock, flags);
if (dev)
dev->power.wakeup_count++;

- if (dwork) {
- INIT_DELAYED_WORK(dwork, pm_wakeup_work_fn);
- schedule_delayed_work(dwork, msecs_to_jiffies(msec));
+ if (msec) {
+ ktime_t kt;
+ struct timespec ts;
+ unsigned long expires;
+
+ kt = ktime_get();
+ kt = ktime_add_ns(kt, msec * NSEC_PER_MSEC);
+ ts = ktime_to_timespec(kt);
+ expires = timespec_to_jiffies(&ts);
+ if (!expires)
+ expires = 1;
+
+ if (!events_timer_expires
+ || time_after(expires, events_timer_expires)) {
+ mod_timer(&events_timer, expires);
+ events_timer_expires = expires;
+ }

events_in_progress++;
+ delayed_count++;
} else {
event_count++;
}

2010-07-01 13:32:19

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

Hi!

> @@ -114,3 +114,17 @@ Description:
> if this file contains "1", which is the default. It may be
> disabled by writing "0" to this file, in which case all devices
> will be suspended and resumed synchronously.
> +
> +What: /sys/power/wakeup_count
> +Date: July 2010
> +Contact: Rafael J. Wysocki <[email protected]>
> +Description:
> + The /sys/power/wakeup_count file allows user space to avoid
> + losing wakeup events when transitioning the system into a sleep
> + state. Reading from it returns the current number of registered
> + wakeup events and it blocks if some wakeup events are being
> + processed at the time the file is read from. Writing to it
> + will only succeed if the current number of wakeup events is
> + equal to the written value and, if successful, will make the
> + kernel abort a subsequent transition to a sleep state if any
> + wakeup events are reported after the write has returned.

I assume that second suspend always succeeds?

I can't say I quite like the way two sysfs files interact with each
other, but it is certainly better then wakelocks...

Maybe we should create sys_suspend()?

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-07-01 13:58:17

by Alan Stern

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Thu, 1 Jul 2010, Rafael J. Wysocki wrote:

> I invented a slightly different version in the meantime, which is appended
> as a patch on top of the original one (I don't want to modify the original
> patch, since it's been reviewed already by several people and went to my
> linux-next branch).

> /**
> - * pm_wakeup_work_fn - Deferred closing of a wakeup event.
> + * pm_wakeup_timer_fn - Deferred closing of a wakeup event.
> *
> * Execute pm_relax() for a wakeup event detected in the past and free the
> * work item object used for queuing up the work.
> */
> -static void pm_wakeup_work_fn(struct work_struct *work)
> +static void pm_wakeup_timer_fn(unsigned long data)
> {
> - struct delayed_work *dwork = to_delayed_work(work);
> + unsigned long flags;
>
> - pm_relax();
> - kfree(dwork);
> + spin_lock_irqsave(&events_lock, flags);
> + if (events_timer_expires && time_after(jiffies, events_timer_expires)) {

Should be time_after_eq.

> + events_in_progress -= delayed_count;
> + event_count += delayed_count;
> + delayed_count = 0;
> + events_timer_expires = 0;
> + }
> + spin_unlock_irqrestore(&events_lock, flags);
> }
>
> /**
> @@ -132,19 +145,31 @@ static void pm_wakeup_work_fn(struct wor
> void pm_wakeup_event(struct device *dev, unsigned int msec)
> {
> unsigned long flags;
> - struct delayed_work *dwork;
> -
> - dwork = msec ? kzalloc(sizeof(*dwork), GFP_ATOMIC) : NULL;
>
> spin_lock_irqsave(&events_lock, flags);
> if (dev)
> dev->power.wakeup_count++;
>
> - if (dwork) {
> - INIT_DELAYED_WORK(dwork, pm_wakeup_work_fn);
> - schedule_delayed_work(dwork, msecs_to_jiffies(msec));
> + if (msec) {
> + ktime_t kt;
> + struct timespec ts;
> + unsigned long expires;
> +
> + kt = ktime_get();
> + kt = ktime_add_ns(kt, msec * NSEC_PER_MSEC);
> + ts = ktime_to_timespec(kt);
> + expires = timespec_to_jiffies(&ts);

Is this somehow better than jiffies + msecs_to_jiffies(msec)?

> + if (!expires)
> + expires = 1;
> +
> + if (!events_timer_expires
> + || time_after(expires, events_timer_expires)) {
> + mod_timer(&events_timer, expires);
> + events_timer_expires = expires;
> + }
>
> events_in_progress++;
> + delayed_count++;
> } else {
> event_count++;
> }

Alan Stern

2010-07-01 15:09:00

by Florian Mickler

[permalink] [raw]
Subject: Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Thu, 1 Jul 2010 15:32:08 +0200
Pavel Machek <[email protected]> wrote:

> Hi!
>
> > @@ -114,3 +114,17 @@ Description:
> > if this file contains "1", which is the default. It may be
> > disabled by writing "0" to this file, in which case all devices
> > will be suspended and resumed synchronously.
> > +
> > +What: /sys/power/wakeup_count
> > +Date: July 2010
> > +Contact: Rafael J. Wysocki <[email protected]>
> > +Description:
> > + The /sys/power/wakeup_count file allows user space to avoid
> > + losing wakeup events when transitioning the system into a sleep
> > + state. Reading from it returns the current number of registered
> > + wakeup events and it blocks if some wakeup events are being
> > + processed at the time the file is read from. Writing to it
> > + will only succeed if the current number of wakeup events is
> > + equal to the written value and, if successful, will make the
> > + kernel abort a subsequent transition to a sleep state if any
> > + wakeup events are reported after the write has returned.


> I can't say I quite like the way two sysfs files interact with each
> other, but it is certainly better then wakelocks...

What two files?

Cheers,
Flo

2010-07-01 19:04:10

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Thursday, July 01, 2010, Pavel Machek wrote:
> Hi!
>
> > @@ -114,3 +114,17 @@ Description:
> > if this file contains "1", which is the default. It may be
> > disabled by writing "0" to this file, in which case all devices
> > will be suspended and resumed synchronously.
> > +
> > +What: /sys/power/wakeup_count
> > +Date: July 2010
> > +Contact: Rafael J. Wysocki <[email protected]>
> > +Description:
> > + The /sys/power/wakeup_count file allows user space to avoid
> > + losing wakeup events when transitioning the system into a sleep
> > + state. Reading from it returns the current number of registered
> > + wakeup events and it blocks if some wakeup events are being
> > + processed at the time the file is read from. Writing to it
> > + will only succeed if the current number of wakeup events is
> > + equal to the written value and, if successful, will make the
> > + kernel abort a subsequent transition to a sleep state if any
> > + wakeup events are reported after the write has returned.
>
> I assume that second suspend always succeeds?

The mechanism is one-shot if that's what you're asking for.

> I can't say I quite like the way two sysfs files interact with each
> other, but it is certainly better then wakelocks...
>
> Maybe we should create sys_suspend()?

Well, one can modify pm-utils to use the new sysfs file quite easily, but it
wouldn't be that easy with sys_suspend() IMO.

Rafael

2010-07-01 20:10:01

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Thursday, July 01, 2010, Alan Stern wrote:
> On Thu, 1 Jul 2010, Rafael J. Wysocki wrote:
>
> > I invented a slightly different version in the meantime, which is appended
> > as a patch on top of the original one (I don't want to modify the original
> > patch, since it's been reviewed already by several people and went to my
> > linux-next branch).
>
> > /**
> > - * pm_wakeup_work_fn - Deferred closing of a wakeup event.
> > + * pm_wakeup_timer_fn - Deferred closing of a wakeup event.
> > *
> > * Execute pm_relax() for a wakeup event detected in the past and free the
> > * work item object used for queuing up the work.
> > */
> > -static void pm_wakeup_work_fn(struct work_struct *work)
> > +static void pm_wakeup_timer_fn(unsigned long data)
> > {
> > - struct delayed_work *dwork = to_delayed_work(work);
> > + unsigned long flags;
> >
> > - pm_relax();
> > - kfree(dwork);
> > + spin_lock_irqsave(&events_lock, flags);
> > + if (events_timer_expires && time_after(jiffies, events_timer_expires)) {
>
> Should be time_after_eq.

Yes, it should.

> > + events_in_progress -= delayed_count;
> > + event_count += delayed_count;
> > + delayed_count = 0;
> > + events_timer_expires = 0;
> > + }
> > + spin_unlock_irqrestore(&events_lock, flags);
> > }
> >
> > /**
> > @@ -132,19 +145,31 @@ static void pm_wakeup_work_fn(struct wor
> > void pm_wakeup_event(struct device *dev, unsigned int msec)
> > {
> > unsigned long flags;
> > - struct delayed_work *dwork;
> > -
> > - dwork = msec ? kzalloc(sizeof(*dwork), GFP_ATOMIC) : NULL;
> >
> > spin_lock_irqsave(&events_lock, flags);
> > if (dev)
> > dev->power.wakeup_count++;
> >
> > - if (dwork) {
> > - INIT_DELAYED_WORK(dwork, pm_wakeup_work_fn);
> > - schedule_delayed_work(dwork, msecs_to_jiffies(msec));
> > + if (msec) {
> > + ktime_t kt;
> > + struct timespec ts;
> > + unsigned long expires;
> > +
> > + kt = ktime_get();
> > + kt = ktime_add_ns(kt, msec * NSEC_PER_MSEC);
> > + ts = ktime_to_timespec(kt);
> > + expires = timespec_to_jiffies(&ts);
>
> Is this somehow better than jiffies + msecs_to_jiffies(msec)?

I'm not sure about overflows. That said, the "+" version is used in many
places, so there's no problem I think.

> > + if (!expires)
> > + expires = 1;
> > +
> > + if (!events_timer_expires
> > + || time_after(expires, events_timer_expires)) {
> > + mod_timer(&events_timer, expires);
> > + events_timer_expires = expires;
> > + }
> >
> > events_in_progress++;
> > + delayed_count++;
> > } else {
> > event_count++;
> > }

I think your version is better for a few reasons, so I created the appended
patch.

Rafael

---
Subject: PM: Do not use dynamically allocated objects in pm_wakeup_event()

Originally, pm_wakeup_event() uses struct delayed_work objects,
allocated with GFP_ATOMIC, to schedule the execution of pm_relax()
in future. However, as noted by Alan Stern, it is not necessary to
do that, because all pm_wakeup_event() calls can use one static timer
that will always be set to expire at the latest time passed to
pm_wakeup_event().

The modifications are based on the example code posted by Alan.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/base/power/wakeup.c | 57 +++++++++++++++++++++++++++++---------------
1 file changed, 38 insertions(+), 19 deletions(-)

Index: linux-2.6/drivers/base/power/wakeup.c
===================================================================
--- linux-2.6.orig/drivers/base/power/wakeup.c
+++ linux-2.6/drivers/base/power/wakeup.c
@@ -9,6 +9,7 @@
#include <linux/device.h>
#include <linux/slab.h>
#include <linux/sched.h>
+#include <linux/ktime.h>
#include <linux/capability.h>
#include <linux/suspend.h>
#include <linux/pm.h>
@@ -28,6 +29,11 @@ static unsigned long events_in_progress;

static DEFINE_SPINLOCK(events_lock);

+static void pm_wakeup_timer_fn(unsigned long data);
+
+static DEFINE_TIMER(events_timer, pm_wakeup_timer_fn, 0, 0);
+static unsigned long events_timer_expires;
+
/*
* The functions below use the observation that each wakeup event starts a
* period in which the system should not be suspended. The moment this period
@@ -103,17 +109,22 @@ void pm_relax(void)
}

/**
- * pm_wakeup_work_fn - Deferred closing of a wakeup event.
+ * pm_wakeup_timer_fn - Delayed finalization of a wakeup event.
*
- * Execute pm_relax() for a wakeup event detected in the past and free the
- * work item object used for queuing up the work.
+ * Decrease the counter of wakeup events being processed after it was increased
+ * by pm_wakeup_event().
*/
-static void pm_wakeup_work_fn(struct work_struct *work)
+static void pm_wakeup_timer_fn(unsigned long data)
{
- struct delayed_work *dwork = to_delayed_work(work);
+ unsigned long flags;

- pm_relax();
- kfree(dwork);
+ spin_lock_irqsave(&events_lock, flags);
+ if (events_timer_expires
+ && time_before_eq(events_timer_expires, jiffies)) {
+ events_in_progress--;
+ events_timer_expires = 0;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
}

/**
@@ -123,30 +134,38 @@ static void pm_wakeup_work_fn(struct wor
*
* Notify the PM core of a wakeup event (signaled by @dev) that will take
* approximately @msec milliseconds to be processed by the kernel. Increment
- * the counter of wakeup events being processed and queue up a work item
- * that will execute pm_relax() for the event after @msec milliseconds. If @dev
- * is not NULL, the counter of wakeup events related to @dev is incremented too.
+ * the counter of registered wakeup events and (if @msec is nonzero) set up
+ * the wakeup events timer to execute pm_wakeup_timer_fn() in future (if the
+ * timer has not been set up already, increment the counter of wakeup events
+ * being processed). If @dev is not NULL, the counter of wakeup events related
+ * to @dev is incremented too.
*
* It is safe to call this function from interrupt context.
*/
void pm_wakeup_event(struct device *dev, unsigned int msec)
{
unsigned long flags;
- struct delayed_work *dwork;
-
- dwork = msec ? kzalloc(sizeof(*dwork), GFP_ATOMIC) : NULL;

spin_lock_irqsave(&events_lock, flags);
+ event_count++;
if (dev)
dev->power.wakeup_count++;

- if (dwork) {
- INIT_DELAYED_WORK(dwork, pm_wakeup_work_fn);
- schedule_delayed_work(dwork, msecs_to_jiffies(msec));
+ if (msec) {
+ unsigned long expires;

- events_in_progress++;
- } else {
- event_count++;
+ expires = jiffies + msecs_to_jiffies(msec);
+ if (!expires)
+ expires = 1;
+
+ if (!events_timer_expires
+ || time_after(expires, events_timer_expires)) {
+ if (!events_timer_expires)
+ events_in_progress++;
+
+ mod_timer(&events_timer, expires);
+ events_timer_expires = expires;
+ }
}
spin_unlock_irqrestore(&events_lock, flags);
}

2010-07-01 20:44:10

by Alan Stern

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Thu, 1 Jul 2010, Rafael J. Wysocki wrote:

> > > + if (msec) {
> > > + ktime_t kt;
> > > + struct timespec ts;
> > > + unsigned long expires;
> > > +
> > > + kt = ktime_get();
> > > + kt = ktime_add_ns(kt, msec * NSEC_PER_MSEC);
> > > + ts = ktime_to_timespec(kt);
> > > + expires = timespec_to_jiffies(&ts);
> >
> > Is this somehow better than jiffies + msecs_to_jiffies(msec)?
>
> I'm not sure about overflows. That said, the "+" version is used in many
> places, so there's no problem I think.

Hmm. NSEC_PER_MSEC must be one million, right? So if msec referred to
anything above 4 seconds (which seems unlikely but not impossible), the
multiplication would overflow on a 32-bit machine.

Apart from that, the main difference between the two patches lies in
when the events are counted, i.e., whether event_count gets incremented
at the start or when the timer expires. I can't see that it matters
much either way.


> Index: linux-2.6/drivers/base/power/wakeup.c
> ===================================================================
> --- linux-2.6.orig/drivers/base/power/wakeup.c
> +++ linux-2.6/drivers/base/power/wakeup.c
> @@ -9,6 +9,7 @@
> #include <linux/device.h>
> #include <linux/slab.h>
> #include <linux/sched.h>
> +#include <linux/ktime.h>

This isn't needed any more.

Alan Stern

2010-07-01 21:07:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update] Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Thursday, July 01, 2010, Alan Stern wrote:
> On Thu, 1 Jul 2010, Rafael J. Wysocki wrote:
>
> > > > + if (msec) {
> > > > + ktime_t kt;
> > > > + struct timespec ts;
> > > > + unsigned long expires;
> > > > +
> > > > + kt = ktime_get();
> > > > + kt = ktime_add_ns(kt, msec * NSEC_PER_MSEC);
> > > > + ts = ktime_to_timespec(kt);
> > > > + expires = timespec_to_jiffies(&ts);
> > >
> > > Is this somehow better than jiffies + msecs_to_jiffies(msec)?
> >
> > I'm not sure about overflows. That said, the "+" version is used in many
> > places, so there's no problem I think.
>
> Hmm. NSEC_PER_MSEC must be one million, right? So if msec referred to
> anything above 4 seconds (which seems unlikely but not impossible), the
> multiplication would overflow on a 32-bit machine.
>
> Apart from that, the main difference between the two patches lies in
> when the events are counted, i.e., whether event_count gets incremented
> at the start or when the timer expires. I can't see that it matters
> much either way.

Well, your version doesn't require the additional static variable and
generally takes fewer lines of code. :-)

> > Index: linux-2.6/drivers/base/power/wakeup.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/base/power/wakeup.c
> > +++ linux-2.6/drivers/base/power/wakeup.c
> > @@ -9,6 +9,7 @@
> > #include <linux/device.h>
> > #include <linux/slab.h>
> > #include <linux/sched.h>
> > +#include <linux/ktime.h>
>
> This isn't needed any more.

Right, I've dropped this line.

Rafael

2010-07-02 18:14:59

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Thu 2010-07-01 21:02:27, Rafael J. Wysocki wrote:
> On Thursday, July 01, 2010, Pavel Machek wrote:
> > Hi!
> >
> > > @@ -114,3 +114,17 @@ Description:
> > > if this file contains "1", which is the default. It may be
> > > disabled by writing "0" to this file, in which case all devices
> > > will be suspended and resumed synchronously.
> > > +
> > > +What: /sys/power/wakeup_count
> > > +Date: July 2010
> > > +Contact: Rafael J. Wysocki <[email protected]>
> > > +Description:
> > > + The /sys/power/wakeup_count file allows user space to avoid
> > > + losing wakeup events when transitioning the system into a sleep
> > > + state. Reading from it returns the current number of registered
> > > + wakeup events and it blocks if some wakeup events are being
> > > + processed at the time the file is read from. Writing to it
> > > + will only succeed if the current number of wakeup events is
> > > + equal to the written value and, if successful, will make the
> > > + kernel abort a subsequent transition to a sleep state if any
> > > + wakeup events are reported after the write has returned.
> >
> > I assume that second suspend always succeeds?
>
> The mechanism is one-shot if that's what you're asking for.

Yep, it would be nice to document it.

Plus maybe... should there be way to clear the wakeup_count? For
example when userspace decides that battery is low and that it wants
to go to sleep now?

> > I can't say I quite like the way two sysfs files interact with each
> > other, but it is certainly better then wakelocks...
> >
> > Maybe we should create sys_suspend()?
>
> Well, one can modify pm-utils to use the new sysfs file quite easily, but it
> wouldn't be that easy with sys_suspend() IMO.

Well... small C helper should be easy...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-07-02 19:23:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] PM: Make it possible to avoid wakeup events from being lost

On Friday, July 02, 2010, Pavel Machek wrote:
> On Thu 2010-07-01 21:02:27, Rafael J. Wysocki wrote:
> > On Thursday, July 01, 2010, Pavel Machek wrote:
> > > Hi!
> > >
> > > > @@ -114,3 +114,17 @@ Description:
> > > > if this file contains "1", which is the default. It may be
> > > > disabled by writing "0" to this file, in which case all devices
> > > > will be suspended and resumed synchronously.
> > > > +
> > > > +What: /sys/power/wakeup_count
> > > > +Date: July 2010
> > > > +Contact: Rafael J. Wysocki <[email protected]>
> > > > +Description:
> > > > + The /sys/power/wakeup_count file allows user space to avoid
> > > > + losing wakeup events when transitioning the system into a sleep
> > > > + state. Reading from it returns the current number of registered
> > > > + wakeup events and it blocks if some wakeup events are being
> > > > + processed at the time the file is read from. Writing to it
> > > > + will only succeed if the current number of wakeup events is
> > > > + equal to the written value and, if successful, will make the
> > > > + kernel abort a subsequent transition to a sleep state if any
> > > > + wakeup events are reported after the write has returned.
> > >
> > > I assume that second suspend always succeeds?
> >
> > The mechanism is one-shot if that's what you're asking for.
>
> Yep, it would be nice to document it.
>
> Plus maybe... should there be way to clear the wakeup_count? For
> example when userspace decides that battery is low and that it wants
> to go to sleep now?

You don't need to clear it, just don't write to it before writing to
/sys/power/state.

Rafael