2010-06-19 22:07:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

Hi,

One of the arguments during the suspend blockers discussion was that the
mainline kernel didn't contain any mechanisms allowing it to avoid losing
wakeup events during system suspend.

Generally, there are two problems in that area. First, if a wakeup event
occurs exactly at the same time when /sys/power/state is being written to,
the even may be delivered to user space right before the freezing of it,
in which case the user space consumer of the event may not be able to process
it before the system is suspended. Second, if a wakeup event occurs after user
space has been frozen and that event is not a wakeup interrupt, the kernel will
not react to it and the system will be suspended.

The following patch illustrates my idea of how these two problems may be
addressed. It introduces a new global sysfs attribute,
/sys/power/wakeup_count, associated with a running counter of wakeup events
and a helper function, pm_wakeup_event(), that may be used by kernel subsystems
to increment the wakeup events counter.

/sys/power/wakeup_count may be read from or written to by user space. Reads
will always succeed and return the current value of the wakeup events counter.
Writes, however, will only succeed if the written number is equal to the
current value of the wakeup events counter. If a write is successful, it will
cause the kernel to save the current value of the wakeup events counter and
to compare the saved number with the current value of the counter at certain
points of the subsequent suspend (or hibernate) sequence. If the two values
don't match, the suspend will be aborted just as though a wakeup interrupt
happened. Reading from /sys/power/wakeup_count again will turn that mechanism
off.

The assumption is that there's a user space power manager that will first
read from /sys/power/wakeup_count. Then it will check all user space consumers
of wakeup events known to it for unprocessed events. If there are any, it will
wait for them to be processed and repeat. In turn, if there are not any,
it will try to write to /sys/power/wakeup_count and if the write is successful,
it will write to /sys/power/state to start suspend, so if any wakeup events
accur past that point, they will be noticed by the kernel and will eventually
cause the suspend to be aborted.

In addition to the above, the patch adds a wakeup events counter to the
power member of struct device and makes these per-device wakeup event counters
available via sysfs, so that it's possible to check the activity of various
wakeup event sources within the kernel.

To illustrate how subsystems can use pm_wakeup_event(), I added it to the
PCI runtime PM wakeup-handling code.

At the moment the patch only contains code changes (ie. no documentation),
but I'm going to add comments etc. if people like the idea.

Please tell me what you think.

Rafael

---
drivers/base/power/Makefile | 2 -
drivers/base/power/main.c | 1
drivers/base/power/power.h | 3 +
drivers/base/power/sysfs.c | 9 ++++
drivers/base/power/wakeup.c | 74 ++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci-acpi.c | 2 +
drivers/pci/pcie/pme/pcie_pme.c | 2 +
include/linux/pm.h | 6 +++
kernel/power/hibernate.c | 14 ++++---
kernel/power/main.c | 24 ++++++++++++
kernel/power/power.h | 6 +++
kernel/power/suspend.c | 2 -
12 files changed, 138 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -204,6 +204,28 @@ static ssize_t state_store(struct kobjec

power_attr(state);

+static ssize_t wakeup_count_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%lu\n", pm_get_wakeup_count());
+}
+
+static ssize_t wakeup_count_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t n)
+{
+ unsigned long val;
+
+ if (sscanf(buf, "%lu", &val) == 1) {
+ if (pm_save_wakeup_count(val))
+ return n;
+ }
+ return -EINVAL;
+}
+
+power_attr(wakeup_count);
+
#ifdef CONFIG_PM_TRACE
int pm_trace_enabled;

@@ -236,6 +258,7 @@ static struct attribute * g[] = {
#endif
#ifdef CONFIG_PM_SLEEP
&pm_async_attr.attr,
+ &wakeup_count_attr.attr,
#ifdef CONFIG_PM_DEBUG
&pm_test_attr.attr,
#endif
@@ -266,6 +289,7 @@ static int __init pm_init(void)
int error = pm_start_workqueue();
if (error)
return error;
+ pm_wakeup_events_init();
power_kobj = kobject_create_and_add("power", NULL);
if (!power_kobj)
return -ENOMEM;
Index: linux-2.6/drivers/base/power/wakeup.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/wakeup.c
@@ -0,0 +1,74 @@
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+static unsigned long event_count;
+static unsigned long saved_event_count;
+static bool events_check_enabled;
+static spinlock_t events_lock;
+
+void pm_wakeup_events_init(void)
+{
+ spin_lock_init(&events_lock);
+}
+
+void pm_wakeup_event(struct device *dev)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&events_lock, flags);
+ event_count++;
+ if (dev)
+ dev->power.wakeup_count++;
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+bool pm_check_wakeup_events(bool enable)
+{
+ unsigned long flags;
+ bool ret;
+
+ spin_lock_irqsave(&events_lock, flags);
+ ret = !events_check_enabled || (event_count == saved_event_count);
+ events_check_enabled = enable;
+ spin_unlock_irqrestore(&events_lock, flags);
+ return ret;
+}
+
+unsigned long pm_get_wakeup_count(void)
+{
+ unsigned long flags;
+ unsigned long count;
+
+ spin_lock_irqsave(&events_lock, flags);
+ events_check_enabled = false;
+ count = event_count;
+ spin_unlock_irqrestore(&events_lock, flags);
+ return count;
+}
+
+bool pm_save_wakeup_count(unsigned long count)
+{
+ unsigned long flags;
+ bool ret = false;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (count == event_count) {
+ saved_event_count = count;
+ events_check_enabled = true;
+ ret = true;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+ return ret;
+}
+
+unsigned long pm_dev_wakeup_count(struct device *dev)
+{
+ unsigned long flags;
+ unsigned long count;
+
+ spin_lock_irqsave(&events_lock, flags);
+ count = dev->power.wakeup_count;
+ spin_unlock_irqrestore(&events_lock, flags);
+ return count;
+}
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -457,6 +457,7 @@ struct dev_pm_info {
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
struct completion completion;
+ unsigned long wakeup_count;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
@@ -552,6 +553,9 @@ extern void __suspend_report_result(cons
} while (0)

extern void device_pm_wait_for_dev(struct device *sub, struct device *dev);
+
+/* drivers/base/power/wakeup.c */
+extern void pm_wakeup_event(struct device *dev);
#else /* !CONFIG_PM_SLEEP */

#define device_pm_lock() do {} while (0)
@@ -565,6 +569,8 @@ static inline int dpm_suspend_start(pm_m
#define suspend_report_result(fn, ret) do {} while (0)

static inline void device_pm_wait_for_dev(struct device *a, struct device *b) {}
+
+static inline void pm_wakeup_event(struct device *dev) {}
#endif /* !CONFIG_PM_SLEEP */

/* How to reorder dpm_list after device_move() */
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,5 @@
obj-$(CONFIG_PM) += sysfs.o
-obj-$(CONFIG_PM_SLEEP) += main.o
+obj-$(CONFIG_PM_SLEEP) += main.o wakeup.o
obj-$(CONFIG_PM_RUNTIME) += runtime.o
obj-$(CONFIG_PM_OPS) += generic_ops.o
obj-$(CONFIG_PM_TRACE_RTC) += trace.o
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -59,6 +59,7 @@ void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
init_completion(&dev->power.completion);
+ dev->power.wakeup_count = 0;
pm_runtime_init(dev);
}

Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -184,6 +184,12 @@ static inline void suspend_test_finish(c
#ifdef CONFIG_PM_SLEEP
/* kernel/power/main.c */
extern int pm_notifier_call_chain(unsigned long val);
+
+/* drivers/base/power/wakeup.c */
+extern void pm_wakeup_events_init(void);
+extern bool pm_check_wakeup_events(bool enable);
+extern unsigned long pm_get_wakeup_count(void);
+extern bool pm_save_wakeup_count(unsigned long count);
#endif

#ifdef CONFIG_HIGHMEM
Index: linux-2.6/kernel/power/suspend.c
===================================================================
--- linux-2.6.orig/kernel/power/suspend.c
+++ linux-2.6/kernel/power/suspend.c
@@ -157,7 +157,7 @@ static int suspend_enter(suspend_state_t

error = sysdev_suspend(PMSG_SUSPEND);
if (!error) {
- if (!suspend_test(TEST_CORE))
+ if (!suspend_test(TEST_CORE) && pm_check_wakeup_events(false))
error = suspend_ops->enter(state);
sysdev_resume();
}
Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -277,7 +277,7 @@ static int create_image(int platform_mod
goto Enable_irqs;
}

- if (hibernation_test(TEST_CORE))
+ if (hibernation_test(TEST_CORE) || !pm_check_wakeup_events(true))
goto Power_up;

in_suspend = 1;
@@ -511,14 +511,18 @@ int hibernation_platform_enter(void)

local_irq_disable();
sysdev_suspend(PMSG_HIBERNATE);
+ if (!pm_check_wakeup_events(false))
+ goto Power_up;
+
hibernation_ops->enter();
/* We should never get here */
while (1);

- /*
- * We don't need to reenable the nonboot CPUs or resume consoles, since
- * the system is going to be halted anyway.
- */
+ Power_up:
+ sysdev_resume();
+ local_irq_enable();
+ enable_nonboot_cpus();
+
Platform_finish:
hibernation_ops->finish();

Index: linux-2.6/drivers/pci/pci-acpi.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci-acpi.c
+++ linux-2.6/drivers/pci/pci-acpi.c
@@ -48,6 +48,8 @@ static void pci_acpi_wake_dev(acpi_handl
if (event == ACPI_NOTIFY_DEVICE_WAKE && pci_dev) {
pci_check_pme_status(pci_dev);
pm_runtime_resume(&pci_dev->dev);
+ if (device_may_wakeup(&pci_dev->dev))
+ pm_wakeup_event(&pci_dev->dev);
if (pci_dev->subordinate)
pci_pme_wakeup_bus(pci_dev->subordinate);
}
Index: linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
===================================================================
--- linux-2.6.orig/drivers/pci/pcie/pme/pcie_pme.c
+++ linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
@@ -147,6 +147,8 @@ static bool pcie_pme_walk_bus(struct pci
/* Skip PCIe devices in case we started from a root port. */
if (!pci_is_pcie(dev) && pci_check_pme_status(dev)) {
pm_request_resume(&dev->dev);
+ if (device_may_wakeup(&dev->dev))
+ pm_wakeup_event(&dev->dev);
ret = true;
}

Index: linux-2.6/drivers/base/power/power.h
===================================================================
--- linux-2.6.orig/drivers/base/power/power.h
+++ linux-2.6/drivers/base/power/power.h
@@ -30,6 +30,9 @@ extern void device_pm_move_before(struct
extern void device_pm_move_after(struct device *, struct device *);
extern void device_pm_move_last(struct device *);

+/* drivers/base/power/wakeup.c */
+extern unsigned long pm_dev_wakeup_count(struct device *dev);
+
#else /* !CONFIG_PM_SLEEP */

static inline void device_pm_init(struct device *dev)
Index: linux-2.6/drivers/base/power/sysfs.c
===================================================================
--- linux-2.6.orig/drivers/base/power/sysfs.c
+++ linux-2.6/drivers/base/power/sysfs.c
@@ -144,6 +144,14 @@ wake_store(struct device * dev, struct d

static DEVICE_ATTR(wakeup, 0644, wake_show, wake_store);

+static ssize_t wakeup_count_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%lu\n", pm_dev_wakeup_count(dev));
+}
+
+static DEVICE_ATTR(wakeup_count, 0444, wakeup_count_show, NULL);
+
#ifdef CONFIG_PM_ADVANCED_DEBUG
#ifdef CONFIG_PM_RUNTIME

@@ -230,6 +238,7 @@ static struct attribute * power_attrs[]
&dev_attr_control.attr,
#endif
&dev_attr_wakeup.attr,
+ &dev_attr_wakeup_count.attr,
#ifdef CONFIG_PM_ADVANCED_DEBUG
&dev_attr_async.attr,
#ifdef CONFIG_PM_RUNTIME


2010-06-20 05:52:38

by 640E9920

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sun, Jun 20, 2010 at 12:05:35AM +0200, Rafael J. Wysocki wrote:
> Hi,
>
> One of the arguments during the suspend blockers discussion was that the
> mainline kernel didn't contain any mechanisms allowing it to avoid losing
> wakeup events during system suspend.
>
> Generally, there are two problems in that area. First, if a wakeup event
> occurs exactly at the same time when /sys/power/state is being written to,
> the even may be delivered to user space right before the freezing of it,
> in which case the user space consumer of the event may not be able to process
yes this is racy. souldn't the wakeup event handers/driver force a user
mode ACK before they stop failing suspend attempts?

> it before the system is suspended. Second, if a wakeup event occurs after user
> space has been frozen and that event is not a wakeup interrupt, the kernel will
> not react to it and the system will be suspended.

If its not a wakeup interrupt is it not fair to allow the suspend to
happen even if its handler's are "in flight" at suspend time?
>
> The following patch illustrates my idea of how these two problems may be
> addressed. It introduces a new global sysfs attribute,
> /sys/power/wakeup_count, associated with a running counter of wakeup events
> and a helper function, pm_wakeup_event(), that may be used by kernel subsystems
> to increment the wakeup events counter.
>
> /sys/power/wakeup_count may be read from or written to by user space. Reads
> will always succeed and return the current value of the wakeup events counter.
> Writes, however, will only succeed if the written number is equal to the
> current value of the wakeup events counter. If a write is successful, it will
> cause the kernel to save the current value of the wakeup events counter and
> to compare the saved number with the current value of the counter at certain
> points of the subsequent suspend (or hibernate) sequence. If the two values
> don't match, the suspend will be aborted just as though a wakeup interrupt
> happened. Reading from /sys/power/wakeup_count again will turn that mechanism
> off.

why would you want to turn it off?

>
> The assumption is that there's a user space power manager that will first
> read from /sys/power/wakeup_count. Then it will check all user space consumers
> of wakeup events known to it for unprocessed events. If there are any, it will
> wait for them to be processed and repeat. In turn, if there are not any,
> it will try to write to /sys/power/wakeup_count and if the write is successful,
> it will write to /sys/power/state to start suspend, so if any wakeup events
> accur past that point, they will be noticed by the kernel and will eventually
> cause the suspend to be aborted.
>
> In addition to the above, the patch adds a wakeup events counter to the
> power member of struct device and makes these per-device wakeup event counters
> available via sysfs, so that it's possible to check the activity of various
> wakeup event sources within the kernel.
>
> To illustrate how subsystems can use pm_wakeup_event(), I added it to the
> PCI runtime PM wakeup-handling code.
>
> At the moment the patch only contains code changes (ie. no documentation),
> but I'm going to add comments etc. if people like the idea.
>
> Please tell me what you think.
>
> Rafael
>
> ---
> drivers/base/power/Makefile | 2 -
> drivers/base/power/main.c | 1
> drivers/base/power/power.h | 3 +
> drivers/base/power/sysfs.c | 9 ++++
> drivers/base/power/wakeup.c | 74 ++++++++++++++++++++++++++++++++++++++++
> drivers/pci/pci-acpi.c | 2 +
> drivers/pci/pcie/pme/pcie_pme.c | 2 +
> include/linux/pm.h | 6 +++
> kernel/power/hibernate.c | 14 ++++---
> kernel/power/main.c | 24 ++++++++++++
> kernel/power/power.h | 6 +++
> kernel/power/suspend.c | 2 -
> 12 files changed, 138 insertions(+), 7 deletions(-)
>
> Index: linux-2.6/kernel/power/main.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/main.c
> +++ linux-2.6/kernel/power/main.c
> @@ -204,6 +204,28 @@ static ssize_t state_store(struct kobjec
>
> power_attr(state);
>
> +static ssize_t wakeup_count_show(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + char *buf)
> +{
> + return sprintf(buf, "%lu\n", pm_get_wakeup_count());
> +}
> +
> +static ssize_t wakeup_count_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t n)
> +{
> + unsigned long val;
> +
> + if (sscanf(buf, "%lu", &val) == 1) {
> + if (pm_save_wakeup_count(val))
> + return n;
> + }
> + return -EINVAL;
> +}
> +
> +power_attr(wakeup_count);
> +
> #ifdef CONFIG_PM_TRACE
> int pm_trace_enabled;
>
> @@ -236,6 +258,7 @@ static struct attribute * g[] = {
> #endif
> #ifdef CONFIG_PM_SLEEP
> &pm_async_attr.attr,
> + &wakeup_count_attr.attr,
> #ifdef CONFIG_PM_DEBUG
> &pm_test_attr.attr,
> #endif
> @@ -266,6 +289,7 @@ static int __init pm_init(void)
> int error = pm_start_workqueue();
> if (error)
> return error;
> + pm_wakeup_events_init();
> power_kobj = kobject_create_and_add("power", NULL);
> if (!power_kobj)
> return -ENOMEM;
> Index: linux-2.6/drivers/base/power/wakeup.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6/drivers/base/power/wakeup.c
> @@ -0,0 +1,74 @@
> +
> +#include <linux/device.h>
> +#include <linux/pm.h>
> +
> +static unsigned long event_count;
> +static unsigned long saved_event_count;

what about over flow issues?

> +static bool events_check_enabled;
> +static spinlock_t events_lock;
> +
> +void pm_wakeup_events_init(void)
> +{
> + spin_lock_init(&events_lock);
> +}
> +
> +void pm_wakeup_event(struct device *dev)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&events_lock, flags);
> + event_count++;
should event_count be an atomic type so we can not bother with taking
the evnets_lock?

> + if (dev)
> + dev->power.wakeup_count++;
> + spin_unlock_irqrestore(&events_lock, flags);
> +}
> +
> +bool pm_check_wakeup_events(bool enable)
> +{
> + unsigned long flags;
> + bool ret;
> +
> + spin_lock_irqsave(&events_lock, flags);
> + ret = !events_check_enabled || (event_count == saved_event_count);
I'm not getting the events_check_enbled flag yet.

> + events_check_enabled = enable;
I'm not sure if this is the right thing depending on all the different
ways the boolians are interacting with eachother.

Which is a red flag to me. This code is confusing.


I'll look at it some more when I'm fresh tomorrow.

--mgross

> + spin_unlock_irqrestore(&events_lock, flags);
> + return ret;
> +}
> +
> +unsigned long pm_get_wakeup_count(void)
> +{
> + unsigned long flags;
> + unsigned long count;
> +
> + spin_lock_irqsave(&events_lock, flags);
> + events_check_enabled = false;
> + count = event_count;
> + spin_unlock_irqrestore(&events_lock, flags);
> + return count;
> +}
> +
> +bool pm_save_wakeup_count(unsigned long count)
> +{
> + unsigned long flags;
> + bool ret = false;
> +
> + spin_lock_irqsave(&events_lock, flags);
> + if (count == event_count) {
> + saved_event_count = count;
> + events_check_enabled = true;
> + ret = true;
> + }
> + spin_unlock_irqrestore(&events_lock, flags);
> + return ret;
> +}
> +
> +unsigned long pm_dev_wakeup_count(struct device *dev)
> +{
> + unsigned long flags;
> + unsigned long count;
> +
> + spin_lock_irqsave(&events_lock, flags);
> + count = dev->power.wakeup_count;
> + spin_unlock_irqrestore(&events_lock, flags);
> + return count;
> +}
> Index: linux-2.6/include/linux/pm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/pm.h
> +++ linux-2.6/include/linux/pm.h
> @@ -457,6 +457,7 @@ struct dev_pm_info {
> #ifdef CONFIG_PM_SLEEP
> struct list_head entry;
> struct completion completion;
> + unsigned long wakeup_count;
> #endif
> #ifdef CONFIG_PM_RUNTIME
> struct timer_list suspend_timer;
> @@ -552,6 +553,9 @@ extern void __suspend_report_result(cons
> } while (0)
>
> extern void device_pm_wait_for_dev(struct device *sub, struct device *dev);
> +
> +/* drivers/base/power/wakeup.c */
> +extern void pm_wakeup_event(struct device *dev);
> #else /* !CONFIG_PM_SLEEP */
>
> #define device_pm_lock() do {} while (0)
> @@ -565,6 +569,8 @@ static inline int dpm_suspend_start(pm_m
> #define suspend_report_result(fn, ret) do {} while (0)
>
> static inline void device_pm_wait_for_dev(struct device *a, struct device *b) {}
> +
> +static inline void pm_wakeup_event(struct device *dev) {}
> #endif /* !CONFIG_PM_SLEEP */
>
> /* How to reorder dpm_list after device_move() */
> Index: linux-2.6/drivers/base/power/Makefile
> ===================================================================
> --- linux-2.6.orig/drivers/base/power/Makefile
> +++ linux-2.6/drivers/base/power/Makefile
> @@ -1,5 +1,5 @@
> obj-$(CONFIG_PM) += sysfs.o
> -obj-$(CONFIG_PM_SLEEP) += main.o
> +obj-$(CONFIG_PM_SLEEP) += main.o wakeup.o
> obj-$(CONFIG_PM_RUNTIME) += runtime.o
> obj-$(CONFIG_PM_OPS) += generic_ops.o
> obj-$(CONFIG_PM_TRACE_RTC) += trace.o
> Index: linux-2.6/drivers/base/power/main.c
> ===================================================================
> --- linux-2.6.orig/drivers/base/power/main.c
> +++ linux-2.6/drivers/base/power/main.c
> @@ -59,6 +59,7 @@ void device_pm_init(struct device *dev)
> {
> dev->power.status = DPM_ON;
> init_completion(&dev->power.completion);
> + dev->power.wakeup_count = 0;
> pm_runtime_init(dev);
> }
>
> Index: linux-2.6/kernel/power/power.h
> ===================================================================
> --- linux-2.6.orig/kernel/power/power.h
> +++ linux-2.6/kernel/power/power.h
> @@ -184,6 +184,12 @@ static inline void suspend_test_finish(c
> #ifdef CONFIG_PM_SLEEP
> /* kernel/power/main.c */
> extern int pm_notifier_call_chain(unsigned long val);
> +
> +/* drivers/base/power/wakeup.c */
> +extern void pm_wakeup_events_init(void);
> +extern bool pm_check_wakeup_events(bool enable);
> +extern unsigned long pm_get_wakeup_count(void);
> +extern bool pm_save_wakeup_count(unsigned long count);
> #endif
>
> #ifdef CONFIG_HIGHMEM
> Index: linux-2.6/kernel/power/suspend.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/suspend.c
> +++ linux-2.6/kernel/power/suspend.c
> @@ -157,7 +157,7 @@ static int suspend_enter(suspend_state_t
>
> error = sysdev_suspend(PMSG_SUSPEND);
> if (!error) {
> - if (!suspend_test(TEST_CORE))
> + if (!suspend_test(TEST_CORE) && pm_check_wakeup_events(false))
> error = suspend_ops->enter(state);
> sysdev_resume();
> }
> Index: linux-2.6/kernel/power/hibernate.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/hibernate.c
> +++ linux-2.6/kernel/power/hibernate.c
> @@ -277,7 +277,7 @@ static int create_image(int platform_mod
> goto Enable_irqs;
> }
>
> - if (hibernation_test(TEST_CORE))
> + if (hibernation_test(TEST_CORE) || !pm_check_wakeup_events(true))
> goto Power_up;
>
> in_suspend = 1;
> @@ -511,14 +511,18 @@ int hibernation_platform_enter(void)
>
> local_irq_disable();
> sysdev_suspend(PMSG_HIBERNATE);
> + if (!pm_check_wakeup_events(false))
> + goto Power_up;
> +
> hibernation_ops->enter();
> /* We should never get here */
> while (1);
>
> - /*
> - * We don't need to reenable the nonboot CPUs or resume consoles, since
> - * the system is going to be halted anyway.
> - */
> + Power_up:
> + sysdev_resume();
> + local_irq_enable();
> + enable_nonboot_cpus();
> +
> Platform_finish:
> hibernation_ops->finish();
>
> Index: linux-2.6/drivers/pci/pci-acpi.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/pci-acpi.c
> +++ linux-2.6/drivers/pci/pci-acpi.c
> @@ -48,6 +48,8 @@ static void pci_acpi_wake_dev(acpi_handl
> if (event == ACPI_NOTIFY_DEVICE_WAKE && pci_dev) {
> pci_check_pme_status(pci_dev);
> pm_runtime_resume(&pci_dev->dev);
> + if (device_may_wakeup(&pci_dev->dev))
> + pm_wakeup_event(&pci_dev->dev);
> if (pci_dev->subordinate)
> pci_pme_wakeup_bus(pci_dev->subordinate);
> }
> Index: linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/pcie/pme/pcie_pme.c
> +++ linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
> @@ -147,6 +147,8 @@ static bool pcie_pme_walk_bus(struct pci
> /* Skip PCIe devices in case we started from a root port. */
> if (!pci_is_pcie(dev) && pci_check_pme_status(dev)) {
> pm_request_resume(&dev->dev);
> + if (device_may_wakeup(&dev->dev))
> + pm_wakeup_event(&dev->dev);
> ret = true;
> }
>
> Index: linux-2.6/drivers/base/power/power.h
> ===================================================================
> --- linux-2.6.orig/drivers/base/power/power.h
> +++ linux-2.6/drivers/base/power/power.h
> @@ -30,6 +30,9 @@ extern void device_pm_move_before(struct
> extern void device_pm_move_after(struct device *, struct device *);
> extern void device_pm_move_last(struct device *);
>
> +/* drivers/base/power/wakeup.c */
> +extern unsigned long pm_dev_wakeup_count(struct device *dev);
> +
> #else /* !CONFIG_PM_SLEEP */
>
> static inline void device_pm_init(struct device *dev)
> Index: linux-2.6/drivers/base/power/sysfs.c
> ===================================================================
> --- linux-2.6.orig/drivers/base/power/sysfs.c
> +++ linux-2.6/drivers/base/power/sysfs.c
> @@ -144,6 +144,14 @@ wake_store(struct device * dev, struct d
>
> static DEVICE_ATTR(wakeup, 0644, wake_show, wake_store);
>
> +static ssize_t wakeup_count_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%lu\n", pm_dev_wakeup_count(dev));
> +}
> +
> +static DEVICE_ATTR(wakeup_count, 0444, wakeup_count_show, NULL);
> +
> #ifdef CONFIG_PM_ADVANCED_DEBUG
> #ifdef CONFIG_PM_RUNTIME
>
> @@ -230,6 +238,7 @@ static struct attribute * power_attrs[]
> &dev_attr_control.attr,
> #endif
> &dev_attr_wakeup.attr,
> + &dev_attr_wakeup_count.attr,
> #ifdef CONFIG_PM_ADVANCED_DEBUG
> &dev_attr_async.attr,
> #ifdef CONFIG_PM_RUNTIME
>

2010-06-20 12:51:13

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sunday, June 20, 2010, mark gross wrote:
> On Sun, Jun 20, 2010 at 12:05:35AM +0200, Rafael J. Wysocki wrote:
> > Hi,
> >
> > One of the arguments during the suspend blockers discussion was that the
> > mainline kernel didn't contain any mechanisms allowing it to avoid losing
> > wakeup events during system suspend.
> >
> > Generally, there are two problems in that area. First, if a wakeup event
> > occurs exactly at the same time when /sys/power/state is being written to,
> > the even may be delivered to user space right before the freezing of it,
> > in which case the user space consumer of the event may not be able to process
> yes this is racy. souldn't the wakeup event handers/driver force a user
> mode ACK before they stop failing suspend attempts?

I'm not sure what you mean. There are wake-up events without any user space
consumers, like wake-on-LAN.

> > it before the system is suspended. Second, if a wakeup event occurs after user
> > space has been frozen and that event is not a wakeup interrupt, the kernel will
> > not react to it and the system will be suspended.
>
> If its not a wakeup interrupt is it not fair to allow the suspend to
> happen even if its handler's are "in flight" at suspend time?

A wakeup event need not be an interrupt, or at least there are interrupts that
have other meanings, like ACPI SCI.

> > The following patch illustrates my idea of how these two problems may be
> > addressed. It introduces a new global sysfs attribute,
> > /sys/power/wakeup_count, associated with a running counter of wakeup events
> > and a helper function, pm_wakeup_event(), that may be used by kernel subsystems
> > to increment the wakeup events counter.
> >
> > /sys/power/wakeup_count may be read from or written to by user space. Reads
> > will always succeed and return the current value of the wakeup events counter.
> > Writes, however, will only succeed if the written number is equal to the
> > current value of the wakeup events counter. If a write is successful, it will
> > cause the kernel to save the current value of the wakeup events counter and
> > to compare the saved number with the current value of the counter at certain
> > points of the subsequent suspend (or hibernate) sequence. If the two values
> > don't match, the suspend will be aborted just as though a wakeup interrupt
> > happened. Reading from /sys/power/wakeup_count again will turn that mechanism
> > off.
>
> why would you want to turn it off?

In some cases I may want to suspend/hibernate losing some wakeup events,
like for example in the case of emergency hibernation on critical battery.

> > The assumption is that there's a user space power manager that will first
> > read from /sys/power/wakeup_count. Then it will check all user space consumers
> > of wakeup events known to it for unprocessed events. If there are any, it will
> > wait for them to be processed and repeat. In turn, if there are not any,
> > it will try to write to /sys/power/wakeup_count and if the write is successful,
> > it will write to /sys/power/state to start suspend, so if any wakeup events
> > accur past that point, they will be noticed by the kernel and will eventually
> > cause the suspend to be aborted.
> >
> > In addition to the above, the patch adds a wakeup events counter to the
> > power member of struct device and makes these per-device wakeup event counters
> > available via sysfs, so that it's possible to check the activity of various
> > wakeup event sources within the kernel.
> >
> > To illustrate how subsystems can use pm_wakeup_event(), I added it to the
> > PCI runtime PM wakeup-handling code.
> >
> > At the moment the patch only contains code changes (ie. no documentation),
> > but I'm going to add comments etc. if people like the idea.
> >
> > Please tell me what you think.
> >
> > Rafael
> >
> > ---
> > drivers/base/power/Makefile | 2 -
> > drivers/base/power/main.c | 1
> > drivers/base/power/power.h | 3 +
> > drivers/base/power/sysfs.c | 9 ++++
> > drivers/base/power/wakeup.c | 74 ++++++++++++++++++++++++++++++++++++++++
> > drivers/pci/pci-acpi.c | 2 +
> > drivers/pci/pcie/pme/pcie_pme.c | 2 +
> > include/linux/pm.h | 6 +++
> > kernel/power/hibernate.c | 14 ++++---
> > kernel/power/main.c | 24 ++++++++++++
> > kernel/power/power.h | 6 +++
> > kernel/power/suspend.c | 2 -
> > 12 files changed, 138 insertions(+), 7 deletions(-)
> >
> > Index: linux-2.6/kernel/power/main.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/power/main.c
> > +++ linux-2.6/kernel/power/main.c
> > @@ -204,6 +204,28 @@ static ssize_t state_store(struct kobjec
> >
> > power_attr(state);
> >
> > +static ssize_t wakeup_count_show(struct kobject *kobj,
> > + struct kobj_attribute *attr,
> > + char *buf)
> > +{
> > + return sprintf(buf, "%lu\n", pm_get_wakeup_count());
> > +}
> > +
> > +static ssize_t wakeup_count_store(struct kobject *kobj,
> > + struct kobj_attribute *attr,
> > + const char *buf, size_t n)
> > +{
> > + unsigned long val;
> > +
> > + if (sscanf(buf, "%lu", &val) == 1) {
> > + if (pm_save_wakeup_count(val))
> > + return n;
> > + }
> > + return -EINVAL;
> > +}
> > +
> > +power_attr(wakeup_count);
> > +
> > #ifdef CONFIG_PM_TRACE
> > int pm_trace_enabled;
> >
> > @@ -236,6 +258,7 @@ static struct attribute * g[] = {
> > #endif
> > #ifdef CONFIG_PM_SLEEP
> > &pm_async_attr.attr,
> > + &wakeup_count_attr.attr,
> > #ifdef CONFIG_PM_DEBUG
> > &pm_test_attr.attr,
> > #endif
> > @@ -266,6 +289,7 @@ static int __init pm_init(void)
> > int error = pm_start_workqueue();
> > if (error)
> > return error;
> > + pm_wakeup_events_init();
> > power_kobj = kobject_create_and_add("power", NULL);
> > if (!power_kobj)
> > return -ENOMEM;
> > Index: linux-2.6/drivers/base/power/wakeup.c
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/drivers/base/power/wakeup.c
> > @@ -0,0 +1,74 @@
> > +
> > +#include <linux/device.h>
> > +#include <linux/pm.h>
> > +
> > +static unsigned long event_count;
> > +static unsigned long saved_event_count;
>
> what about over flow issues?

There's no issue AFAICS. It only matters if the value is different from the
previous one after incrementation.

> > +static bool events_check_enabled;
> > +static spinlock_t events_lock;
> > +
> > +void pm_wakeup_events_init(void)
> > +{
> > + spin_lock_init(&events_lock);
> > +}
> > +
> > +void pm_wakeup_event(struct device *dev)
> > +{
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&events_lock, flags);
> > + event_count++;
> should event_count be an atomic type so we can not bother with taking
> the evnets_lock?

The lock is there, because two counters are incremented at the same time.
Also, in some other places the counter is accessed along with the enable
flag, so there are two operations that need to be done in the same critical
section.

> > + if (dev)
> > + dev->power.wakeup_count++;
> > + spin_unlock_irqrestore(&events_lock, flags);
> > +}
> > +
> > +bool pm_check_wakeup_events(bool enable)
> > +{
> > + unsigned long flags;
> > + bool ret;
> > +
> > + spin_lock_irqsave(&events_lock, flags);
> > + ret = !events_check_enabled || (event_count == saved_event_count);
> I'm not getting the events_check_enbled flag yet.

I tells the PM core whether or not to check wakeup events during suspend.
The checking is only enabled after a successful write to
/sys/power/wakeup_count, it is disabled by reads and by the final check right
before the system enters suspend. [That's because the "saved" value from
the previous suspend should not be used for checking wakeup events during
the next one.]

> > + events_check_enabled = enable;
> I'm not sure if this is the right thing depending on all the different
> ways the boolians are interacting with eachother.
>
> Which is a red flag to me. This code is confusing.

Well, I could use check_wakeup_events() and
check_and_disable_wakeup_events() (the latter is only necessary for the final
check), but I'm not sure if that's going to be better.

> I'll look at it some more when I'm fresh tomorrow.

Thanks for the comments.

Rafael

2010-06-20 16:28:31

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sun, 20 Jun 2010, Rafael J. Wysocki wrote:

> Hi,
>
> One of the arguments during the suspend blockers discussion was that the
> mainline kernel didn't contain any mechanisms allowing it to avoid losing
> wakeup events during system suspend.
>
> Generally, there are two problems in that area. First, if a wakeup event
> occurs exactly at the same time when /sys/power/state is being written to,
> the even may be delivered to user space right before the freezing of it,
> in which case the user space consumer of the event may not be able to process
> it before the system is suspended.

Indeed, the same problem arises if the event isn't delivered to
userspace until after userspace is frozen. Of course, the underlying
issue here is that the kernel has no direct way to know when userspace
has finished processing an event. Userspace would have to tell it,
which generally would mean rewriting some large number of user
programs.

> Second, if a wakeup event occurs after user
> space has been frozen and that event is not a wakeup interrupt, the kernel will
> not react to it and the system will be suspended.

I don't quite understand what you mean here. "Reacting" to an event
involves more than one action. The kernel has to tell the hardware to
stop generating the wakeup signal, and it has to handle the event
somehow.

If the kernel doesn't tell the hardware to stop generating the wakeup
signal, the signal will continue to be active until the system goes to
sleep. At that point it will cause the system to wake up immediately,
so there won't be any problem.

The real problem arises when the hardware stops generating the wakeup
signal but the kernel suspends before it finishes handling the event.
For example, an interrupt handler might receive the event and start
processing it by calling pm_request_resume() -- but if the pm workqueue
thread is already frozen then the processing won't finish until
something else wakes the system up. (IMO this is a potential bug which
could be fixed without too much effort.)

> The following patch illustrates my idea of how these two problems may be
> addressed. It introduces a new global sysfs attribute,
> /sys/power/wakeup_count, associated with a running counter of wakeup events
> and a helper function, pm_wakeup_event(), that may be used by kernel subsystems
> to increment the wakeup events counter.

In what way is this better than suspend blockers?

> /sys/power/wakeup_count may be read from or written to by user space. Reads
> will always succeed and return the current value of the wakeup events counter.
> Writes, however, will only succeed if the written number is equal to the
> current value of the wakeup events counter. If a write is successful, it will
> cause the kernel to save the current value of the wakeup events counter and
> to compare the saved number with the current value of the counter at certain
> points of the subsequent suspend (or hibernate) sequence. If the two values
> don't match, the suspend will be aborted just as though a wakeup interrupt
> happened. Reading from /sys/power/wakeup_count again will turn that mechanism
> off.
>
> The assumption is that there's a user space power manager that will first
> read from /sys/power/wakeup_count. Then it will check all user space consumers
> of wakeup events known to it for unprocessed events.

What happens if an event arrives just before you read
/sys/power/wakeup_count, but the userspace consumer doesn't realize
there is a new unprocessed event until after the power manager checks
it? Your plan is missing a critical step: the "handoff" whereby
responsibility for handling an event passes from the kernel to
userspace.

With suspend blockers, this handoff occurs when an event queue is
emptied and its associate suspend blocker is deactivated. Or with some
kinds of events for which the Android people have not written an
explicit handoff, it occurs when a timer expires (timed suspend
blockers).

> If there are any, it will
> wait for them to be processed and repeat. In turn, if there are not any,
> it will try to write to /sys/power/wakeup_count and if the write is successful,
> it will write to /sys/power/state to start suspend, so if any wakeup events
> accur past that point, they will be noticed by the kernel and will eventually
> cause the suspend to be aborted.

This shares with the other alternatives posted recently the need for a
central power-manager process. And like in-kernel suspend blockers, it
requires changes to wakeup-capable drivers (the wakeup-events counter
has to be incremented).

One advantage of the suspend-blocker approach is that it essentially
uses a single tool to handle both kinds of races (event not fully
handled by the kernel, or event not fully handled by userspace).
Things aren't quite this simple, because of the need for a special API
to implement userspace suspend blockers, but this does avoid the need
for a power-manager process.

> In addition to the above, the patch adds a wakeup events counter to the
> power member of struct device and makes these per-device wakeup event counters
> available via sysfs, so that it's possible to check the activity of various
> wakeup event sources within the kernel.
>
> To illustrate how subsystems can use pm_wakeup_event(), I added it to the
> PCI runtime PM wakeup-handling code.
>
> At the moment the patch only contains code changes (ie. no documentation),
> but I'm going to add comments etc. if people like the idea.
>
> Please tell me what you think.

While this isn't a bad idea, I don't see how it is superior to the
other alternatives that have been proposed.

Alan Stern

2010-06-20 21:52:17

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sunday, June 20, 2010, Alan Stern wrote:
> On Sun, 20 Jun 2010, Rafael J. Wysocki wrote:
>
> > Hi,
> >
> > One of the arguments during the suspend blockers discussion was that the
> > mainline kernel didn't contain any mechanisms allowing it to avoid losing
> > wakeup events during system suspend.
> >
> > Generally, there are two problems in that area. First, if a wakeup event
> > occurs exactly at the same time when /sys/power/state is being written to,
> > the even may be delivered to user space right before the freezing of it,
> > in which case the user space consumer of the event may not be able to process
> > it before the system is suspended.
>
> Indeed, the same problem arises if the event isn't delivered to
> userspace until after userspace is frozen.

In that case the kernel should abort the suspend so that the event can be
delivered to the user space.

> Of course, the underlying issue here is that the kernel has no direct way
> to know when userspace has finished processing an event. Userspace would
> have to tell it, which generally would mean rewriting some large number of user
> programs.

I'm not sure of that. If the kernel doesn't initiate suspend, it doesn't
really need to know whether or not user space has already consumed the event.

> > Second, if a wakeup event occurs after user
> > space has been frozen and that event is not a wakeup interrupt, the kernel will
> > not react to it and the system will be suspended.
>
> I don't quite understand what you mean here. "Reacting" to an event
> involves more than one action. The kernel has to tell the hardware to
> stop generating the wakeup signal, and it has to handle the event
> somehow.

Yes. I meant that the event wouldn't cause the suspend to be aborted.

> If the kernel doesn't tell the hardware to stop generating the wakeup
> signal, the signal will continue to be active until the system goes to
> sleep. At that point it will cause the system to wake up immediately,
> so there won't be any problem.
>
> The real problem arises when the hardware stops generating the wakeup
> signal but the kernel suspends before it finishes handling the event.
> For example, an interrupt handler might receive the event and start
> processing it by calling pm_request_resume() -- but if the pm workqueue
> thread is already frozen then the processing won't finish until
> something else wakes the system up. (IMO this is a potential bug which
> could be fixed without too much effort.)

That's why I put pm_wakeup_event() into the PCI runtime wakeup code, which
doesn't run from the PM workqueue.

> > The following patch illustrates my idea of how these two problems may be
> > addressed. It introduces a new global sysfs attribute,
> > /sys/power/wakeup_count, associated with a running counter of wakeup events
> > and a helper function, pm_wakeup_event(), that may be used by kernel subsystems
> > to increment the wakeup events counter.
>
> In what way is this better than suspend blockers?

It doesn't add any new framework and it doesn't require the users of
pm_wakeup_event() to "unblock" suspend, so it is simpler. It also doesn't add
the user space interface that caused so much opposition to appear.

> > /sys/power/wakeup_count may be read from or written to by user space. Reads
> > will always succeed and return the current value of the wakeup events counter.
> > Writes, however, will only succeed if the written number is equal to the
> > current value of the wakeup events counter. If a write is successful, it will
> > cause the kernel to save the current value of the wakeup events counter and
> > to compare the saved number with the current value of the counter at certain
> > points of the subsequent suspend (or hibernate) sequence. If the two values
> > don't match, the suspend will be aborted just as though a wakeup interrupt
> > happened. Reading from /sys/power/wakeup_count again will turn that mechanism
> > off.
> >
> > The assumption is that there's a user space power manager that will first
> > read from /sys/power/wakeup_count. Then it will check all user space consumers
> > of wakeup events known to it for unprocessed events.
>
> What happens if an event arrives just before you read
> /sys/power/wakeup_count, but the userspace consumer doesn't realize
> there is a new unprocessed event until after the power manager checks
> it? Your plan is missing a critical step: the "handoff" whereby
> responsibility for handling an event passes from the kernel to
> userspace.

I think this is not the kernel's problem. In this approach the kernel makes it
possible for the user space to avoid the race. Whether or not the user space
will use this opportunity is a different matter.

> With suspend blockers, this handoff occurs when an event queue is
> emptied and its associate suspend blocker is deactivated. Or with some
> kinds of events for which the Android people have not written an
> explicit handoff, it occurs when a timer expires (timed suspend
> blockers).

Well, quite frankly, I don't see any difference here. In either case there is
a possibility for user space to mess up things and the kernel can't really help
that.

> > If there are any, it will
> > wait for them to be processed and repeat. In turn, if there are not any,
> > it will try to write to /sys/power/wakeup_count and if the write is successful,
> > it will write to /sys/power/state to start suspend, so if any wakeup events
> > accur past that point, they will be noticed by the kernel and will eventually
> > cause the suspend to be aborted.
>
> This shares with the other alternatives posted recently the need for a
> central power-manager process. And like in-kernel suspend blockers, it
> requires changes to wakeup-capable drivers (the wakeup-events counter
> has to be incremented).

It doesn't really require changes to drivers, but to code that knows of wakeup
events, like the PCI runtime wakeup code. Moreover, it doesn't require kernel
subsystems to know or even care when it is reasonable to allow suspend to
happen. The only thing they need to do is to call pm_wakeup_event() whenever
they see a wakeup event. I don't really think it is too much of a requirement
(and quite frnakly I can't imagine anything simpler than that).

> One advantage of the suspend-blocker approach is that it essentially
> uses a single tool to handle both kinds of races (event not fully
> handled by the kernel, or event not fully handled by userspace).
> Things aren't quite this simple, because of the need for a special API
> to implement userspace suspend blockers, but this does avoid the need
> for a power-manager process.

Yes, it does, but I have an idea about how to implement such a power manager
and I'm going to actually try it.

> > In addition to the above, the patch adds a wakeup events counter to the
> > power member of struct device and makes these per-device wakeup event counters
> > available via sysfs, so that it's possible to check the activity of various
> > wakeup event sources within the kernel.
> >
> > To illustrate how subsystems can use pm_wakeup_event(), I added it to the
> > PCI runtime PM wakeup-handling code.
> >
> > At the moment the patch only contains code changes (ie. no documentation),
> > but I'm going to add comments etc. if people like the idea.
> >
> > Please tell me what you think.
>
> While this isn't a bad idea, I don't see how it is superior to the
> other alternatives that have been proposed.

I don't think any of the approaches that don't use suspend blockers allows
one to avoid the race between the process that writes to /sys/power/state
and a wakeup event happening at the same time. They attempt to address another
issue, which is how to prevent untrusted user space processes from keeping the
system out of idle, but that is a different story.

My patch is all about the (system-wide) suspend mechanism, regardless of
whether or not it is used for opportunistic suspending.

Rafael

2010-06-20 22:57:53

by 640E9920

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sun, Jun 20, 2010 at 12:28:29PM -0400, Alan Stern wrote:
> On Sun, 20 Jun 2010, Rafael J. Wysocki wrote:
>
> > Hi,
> >
> > One of the arguments during the suspend blockers discussion was that the
> > mainline kernel didn't contain any mechanisms allowing it to avoid losing
> > wakeup events during system suspend.
> >
> > Generally, there are two problems in that area. First, if a wakeup event
> > occurs exactly at the same time when /sys/power/state is being written to,
> > the even may be delivered to user space right before the freezing of it,
> > in which case the user space consumer of the event may not be able to process
> > it before the system is suspended.
>
> Indeed, the same problem arises if the event isn't delivered to
> userspace until after userspace is frozen. Of course, the underlying
> issue here is that the kernel has no direct way to know when userspace
> has finished processing an event. Userspace would have to tell it,
> which generally would mean rewriting some large number of user
> programs.

Suspend blockers don't solve this, and the large number of user programs
isn't a big number. /me thinks its 1 or 2 services.

> > Second, if a wakeup event occurs after user
> > space has been frozen and that event is not a wakeup interrupt, the kernel will
> > not react to it and the system will be suspended.
>
> I don't quite understand what you mean here. "Reacting" to an event
> involves more than one action. The kernel has to tell the hardware to
> stop generating the wakeup signal, and it has to handle the event
> somehow.
>
> If the kernel doesn't tell the hardware to stop generating the wakeup
> signal, the signal will continue to be active until the system goes to
> sleep. At that point it will cause the system to wake up immediately,
> so there won't be any problem.
>
> The real problem arises when the hardware stops generating the wakeup
> signal but the kernel suspends before it finishes handling the event.
> For example, an interrupt handler might receive the event and start
> processing it by calling pm_request_resume() -- but if the pm workqueue
> thread is already frozen then the processing won't finish until
> something else wakes the system up. (IMO this is a potential bug which
> could be fixed without too much effort.)
>
> > The following patch illustrates my idea of how these two problems may be
> > addressed. It introduces a new global sysfs attribute,
> > /sys/power/wakeup_count, associated with a running counter of wakeup events
> > and a helper function, pm_wakeup_event(), that may be used by kernel subsystems
> > to increment the wakeup events counter.
>
> In what way is this better than suspend blockers?

1) I don't think suspend blockers really solve or effectively articulate
the suspend wake event race problem.
2) the partial solution suspend blocker offer for the suspend race is
tightly bound to the suspend blocker implementation and not useful in
more general contexts.

> > /sys/power/wakeup_count may be read from or written to by user space. Reads
> > will always succeed and return the current value of the wakeup events counter.
> > Writes, however, will only succeed if the written number is equal to the
> > current value of the wakeup events counter. If a write is successful, it will
> > cause the kernel to save the current value of the wakeup events counter and
> > to compare the saved number with the current value of the counter at certain
> > points of the subsequent suspend (or hibernate) sequence. If the two values
> > don't match, the suspend will be aborted just as though a wakeup interrupt
> > happened. Reading from /sys/power/wakeup_count again will turn that mechanism
> > off.
> >
> > The assumption is that there's a user space power manager that will first
> > read from /sys/power/wakeup_count. Then it will check all user space consumers
> > of wakeup events known to it for unprocessed events.
>
> What happens if an event arrives just before you read
> /sys/power/wakeup_count, but the userspace consumer doesn't realize
> there is a new unprocessed event until after the power manager checks
> it? Your plan is missing a critical step: the "handoff" whereby
> responsibility for handling an event passes from the kernel to
> userspace.
>
> With suspend blockers, this handoff occurs when an event queue is
> emptied and its associate suspend blocker is deactivated. Or with some
> kinds of events for which the Android people have not written an
> explicit handoff, it occurs when a timer expires (timed suspend
> blockers).

The wakeup_count will need to be incremented in the same even queue the
suspend blocker handoff happens.

>
> > If there are any, it will
> > wait for them to be processed and repeat. In turn, if there are not any,
> > it will try to write to /sys/power/wakeup_count and if the write is successful,
> > it will write to /sys/power/state to start suspend, so if any wakeup events
> > accur past that point, they will be noticed by the kernel and will eventually
> > cause the suspend to be aborted.
>
> This shares with the other alternatives posted recently the need for a
> central power-manager process. And like in-kernel suspend blockers, it
> requires changes to wakeup-capable drivers (the wakeup-events counter
> has to be incremented).

true.

> One advantage of the suspend-blocker approach is that it essentially
> uses a single tool to handle both kinds of races (event not fully
> handled by the kernel, or event not fully handled by userspace).
> Things aren't quite this simple, because of the need for a special API
> to implement userspace suspend blockers, but this does avoid the need
> for a power-manager process.

I don't think suspend-blocker handles both kinds of races as well as you
seem to think. A single tool that conflates multiple kernel facilities
is not an advantage. It limits applications outside of the scope of
doing it the "android way".

Where do you get the idea that there isn't a need for a centralized PM
process in Android? Thats not true.

> > In addition to the above, the patch adds a wakeup events counter to the
> > power member of struct device and makes these per-device wakeup event counters
> > available via sysfs, so that it's possible to check the activity of various
> > wakeup event sources within the kernel.
> >
> > To illustrate how subsystems can use pm_wakeup_event(), I added it to the
> > PCI runtime PM wakeup-handling code.
> >
> > At the moment the patch only contains code changes (ie. no documentation),
> > but I'm going to add comments etc. if people like the idea.
> >
> > Please tell me what you think.
>
> While this isn't a bad idea, I don't see how it is superior to the
> other alternatives that have been proposed.
>
I like my PM-event framework better but I think this is ok too.

--mgross


2010-06-20 23:13:25

by 640E9920

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sun, Jun 20, 2010 at 02:49:14PM +0200, Rafael J. Wysocki wrote:
> On Sunday, June 20, 2010, mark gross wrote:
> > On Sun, Jun 20, 2010 at 12:05:35AM +0200, Rafael J. Wysocki wrote:
> > > Hi,
> > >
> > > One of the arguments during the suspend blockers discussion was that the
> > > mainline kernel didn't contain any mechanisms allowing it to avoid losing
> > > wakeup events during system suspend.
> > >
> > > Generally, there are two problems in that area. First, if a wakeup event
> > > occurs exactly at the same time when /sys/power/state is being written to,
> > > the even may be delivered to user space right before the freezing of it,
> > > in which case the user space consumer of the event may not be able to process
> > yes this is racy. souldn't the wakeup event handers/driver force a user
> > mode ACK before they stop failing suspend attempts?
>
> I'm not sure what you mean. There are wake-up events without any user space
> consumers, like wake-on-LAN.

I was thinking about the incoming call scenario. The 3G modem starts
ringing and the we don't want the suspend to happen again until user
mode ACKs the call (answer or ignore)

however; you are right about events without user mode consumers. I
forgot about those.

> > > it before the system is suspended. Second, if a wakeup event occurs after user
> > > space has been frozen and that event is not a wakeup interrupt, the kernel will
> > > not react to it and the system will be suspended.
> >
> > If its not a wakeup interrupt is it not fair to allow the suspend to
> > happen even if its handler's are "in flight" at suspend time?
>
> A wakeup event need not be an interrupt, or at least there are interrupts that
> have other meanings, like ACPI SCI.

true, but who desides if such a thing is a wakeup event? And how does
that polocy get generalized in a platform portable way?

(I don't think this is an issue for this current patch, its just a
nagging issue to me in general with all this PM stuff)


> > > The following patch illustrates my idea of how these two problems may be
> > > addressed. It introduces a new global sysfs attribute,
> > > /sys/power/wakeup_count, associated with a running counter of wakeup events
> > > and a helper function, pm_wakeup_event(), that may be used by kernel subsystems
> > > to increment the wakeup events counter.
> > >
> > > /sys/power/wakeup_count may be read from or written to by user space. Reads
> > > will always succeed and return the current value of the wakeup events counter.
> > > Writes, however, will only succeed if the written number is equal to the
> > > current value of the wakeup events counter. If a write is successful, it will
> > > cause the kernel to save the current value of the wakeup events counter and
> > > to compare the saved number with the current value of the counter at certain
> > > points of the subsequent suspend (or hibernate) sequence. If the two values
> > > don't match, the suspend will be aborted just as though a wakeup interrupt
> > > happened. Reading from /sys/power/wakeup_count again will turn that mechanism
> > > off.
> >
> > why would you want to turn it off?
>
> In some cases I may want to suspend/hibernate losing some wakeup events,
> like for example in the case of emergency hibernation on critical battery.

Or a user directed explicit suspend request. your right.

> > > The assumption is that there's a user space power manager that will first
> > > read from /sys/power/wakeup_count. Then it will check all user space consumers
> > > of wakeup events known to it for unprocessed events. If there are any, it will
> > > wait for them to be processed and repeat. In turn, if there are not any,
> > > it will try to write to /sys/power/wakeup_count and if the write is successful,
> > > it will write to /sys/power/state to start suspend, so if any wakeup events
> > > accur past that point, they will be noticed by the kernel and will eventually
> > > cause the suspend to be aborted.
> > >
> > > In addition to the above, the patch adds a wakeup events counter to the
> > > power member of struct device and makes these per-device wakeup event counters
> > > available via sysfs, so that it's possible to check the activity of various
> > > wakeup event sources within the kernel.
> > >
> > > To illustrate how subsystems can use pm_wakeup_event(), I added it to the
> > > PCI runtime PM wakeup-handling code.
> > >
> > > At the moment the patch only contains code changes (ie. no documentation),
> > > but I'm going to add comments etc. if people like the idea.
> > >
> > > Please tell me what you think.
> > >
> > > Rafael
> > >
> > > ---
> > > drivers/base/power/Makefile | 2 -
> > > drivers/base/power/main.c | 1
> > > drivers/base/power/power.h | 3 +
> > > drivers/base/power/sysfs.c | 9 ++++
> > > drivers/base/power/wakeup.c | 74 ++++++++++++++++++++++++++++++++++++++++
> > > drivers/pci/pci-acpi.c | 2 +
> > > drivers/pci/pcie/pme/pcie_pme.c | 2 +
> > > include/linux/pm.h | 6 +++
> > > kernel/power/hibernate.c | 14 ++++---
> > > kernel/power/main.c | 24 ++++++++++++
> > > kernel/power/power.h | 6 +++
> > > kernel/power/suspend.c | 2 -
> > > 12 files changed, 138 insertions(+), 7 deletions(-)
> > >
> > > Index: linux-2.6/kernel/power/main.c
> > > ===================================================================
> > > --- linux-2.6.orig/kernel/power/main.c
> > > +++ linux-2.6/kernel/power/main.c
> > > @@ -204,6 +204,28 @@ static ssize_t state_store(struct kobjec
> > >
> > > power_attr(state);
> > >
> > > +static ssize_t wakeup_count_show(struct kobject *kobj,
> > > + struct kobj_attribute *attr,
> > > + char *buf)
> > > +{
> > > + return sprintf(buf, "%lu\n", pm_get_wakeup_count());
> > > +}
> > > +
> > > +static ssize_t wakeup_count_store(struct kobject *kobj,
> > > + struct kobj_attribute *attr,
> > > + const char *buf, size_t n)
> > > +{
> > > + unsigned long val;
> > > +
> > > + if (sscanf(buf, "%lu", &val) == 1) {
> > > + if (pm_save_wakeup_count(val))
> > > + return n;
> > > + }
> > > + return -EINVAL;
> > > +}
> > > +
> > > +power_attr(wakeup_count);
> > > +
> > > #ifdef CONFIG_PM_TRACE
> > > int pm_trace_enabled;
> > >
> > > @@ -236,6 +258,7 @@ static struct attribute * g[] = {
> > > #endif
> > > #ifdef CONFIG_PM_SLEEP
> > > &pm_async_attr.attr,
> > > + &wakeup_count_attr.attr,
> > > #ifdef CONFIG_PM_DEBUG
> > > &pm_test_attr.attr,
> > > #endif
> > > @@ -266,6 +289,7 @@ static int __init pm_init(void)
> > > int error = pm_start_workqueue();
> > > if (error)
> > > return error;
> > > + pm_wakeup_events_init();
> > > power_kobj = kobject_create_and_add("power", NULL);
> > > if (!power_kobj)
> > > return -ENOMEM;
> > > Index: linux-2.6/drivers/base/power/wakeup.c
> > > ===================================================================
> > > --- /dev/null
> > > +++ linux-2.6/drivers/base/power/wakeup.c
> > > @@ -0,0 +1,74 @@
> > > +
> > > +#include <linux/device.h>
> > > +#include <linux/pm.h>
> > > +
> > > +static unsigned long event_count;
> > > +static unsigned long saved_event_count;
> >
> > what about over flow issues?
>
> There's no issue AFAICS. It only matters if the value is different from the
> previous one after incrementation.
>
> > > +static bool events_check_enabled;
> > > +static spinlock_t events_lock;
> > > +
> > > +void pm_wakeup_events_init(void)
> > > +{
> > > + spin_lock_init(&events_lock);
> > > +}
> > > +
> > > +void pm_wakeup_event(struct device *dev)
> > > +{
> > > + unsigned long flags;
> > > +
> > > + spin_lock_irqsave(&events_lock, flags);
> > > + event_count++;
> > should event_count be an atomic type so we can not bother with taking
> > the evnets_lock?
>
> The lock is there, because two counters are incremented at the same time.
> Also, in some other places the counter is accessed along with the enable
> flag, so there are two operations that need to be done in the same critical
> section.

ok, I was thinking there is only one counter that needs locking, and one
global value writen too from user mode that didn't need locking.

>
> > > + if (dev)
> > > + dev->power.wakeup_count++;
> > > + spin_unlock_irqrestore(&events_lock, flags);
> > > +}
> > > +
> > > +bool pm_check_wakeup_events(bool enable)
> > > +{
> > > + unsigned long flags;
> > > + bool ret;
> > > +
> > > + spin_lock_irqsave(&events_lock, flags);
> > > + ret = !events_check_enabled || (event_count == saved_event_count);
> > I'm not getting the events_check_enbled flag yet.
>
> I tells the PM core whether or not to check wakeup events during suspend.
> The checking is only enabled after a successful write to
> /sys/power/wakeup_count, it is disabled by reads and by the final check right
> before the system enters suspend. [That's because the "saved" value from
> the previous suspend should not be used for checking wakeup events during
> the next one.]
>
> > > + events_check_enabled = enable;
> > I'm not sure if this is the right thing depending on all the different
> > ways the boolians are interacting with eachother.
> >
> > Which is a red flag to me. This code is confusing.
>
> Well, I could use check_wakeup_events() and
> check_and_disable_wakeup_events() (the latter is only necessary for the final
> check), but I'm not sure if that's going to be better.

I'm not sure either what you have is growing on me anyway.


> > I'll look at it some more when I'm fresh tomorrow.
>
> Thanks for the comments.

I still need to look more at the patch.

--mgross

2010-06-21 02:23:43

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sun, 20 Jun 2010, Rafael J. Wysocki wrote:

> > > Generally, there are two problems in that area. First, if a wakeup event
> > > occurs exactly at the same time when /sys/power/state is being written to,
> > > the even may be delivered to user space right before the freezing of it,
> > > in which case the user space consumer of the event may not be able to process
> > > it before the system is suspended.
> >
> > Indeed, the same problem arises if the event isn't delivered to
> > userspace until after userspace is frozen.
>
> In that case the kernel should abort the suspend so that the event can be
> delivered to the user space.

Yes.

> > Of course, the underlying issue here is that the kernel has no direct way
> > to know when userspace has finished processing an event. Userspace would
> > have to tell it, which generally would mean rewriting some large number of user
> > programs.
>
> I'm not sure of that. If the kernel doesn't initiate suspend, it doesn't
> really need to know whether or not user space has already consumed the event.

That's true. But it only shifts the onus: When a userspace program has
finished processing an event, it has to tell the power-manager process.
Clearly this sort of thing is unavoidable, one way or another.

> > > The following patch illustrates my idea of how these two problems may be
> > > addressed. It introduces a new global sysfs attribute,
> > > /sys/power/wakeup_count, associated with a running counter of wakeup events
> > > and a helper function, pm_wakeup_event(), that may be used by kernel subsystems
> > > to increment the wakeup events counter.
> >
> > In what way is this better than suspend blockers?
>
> It doesn't add any new framework and it doesn't require the users of
> pm_wakeup_event() to "unblock" suspend, so it is simpler. It also doesn't add
> the user space interface that caused so much opposition to appear.

Okay. A quick comparison shows that in your proposal:

There's no need to register and unregister suspend blockers.
But instead you create the equivalent of a suspend blocker
inside every struct device.

Drivers (or subsystems) don't have to activate suspend
blockers. But instead they have to call pm_wakeup_event().

Drivers don't have to deactivate suspend blockers. You don't
have anything equivalent, and as a result your scheme is
subject to the race described below.

There are no userspace suspend blockers and no opportunistic
suspend. Instead a power-manager process takes care of
initiating or preventing suspends as needed.

In short, you have eliminated the userspace part of the suspend blocker
approach just as in some of the proposals posted earlier, and you have
replaced the in-kernel suspend blockers with new data in struct device
and a new PM API. On the whole, it doesn't seem very different from
the in-kernel part of suspend blockers. The most notable difference is
the name: pm_wake_event() vs. suspend_blocker_activate(), or whatever
it ended up being called.

This is the race I was talking about:

> > What happens if an event arrives just before you read
> > /sys/power/wakeup_count, but the userspace consumer doesn't realize
> > there is a new unprocessed event until after the power manager checks
> > it?

> I think this is not the kernel's problem. In this approach the kernel makes it
> possible for the user space to avoid the race. Whether or not the user space
> will use this opportunity is a different matter.

It is _not_ possible for userspace to avoid this race. Help from the
kernel is needed.

> > Your plan is missing a critical step: the "handoff" whereby
> > responsibility for handling an event passes from the kernel to
> > userspace.

> > With suspend blockers, this handoff occurs when an event queue is
> > emptied and its associate suspend blocker is deactivated. Or with some
> > kinds of events for which the Android people have not written an
> > explicit handoff, it occurs when a timer expires (timed suspend
> > blockers).
>
> Well, quite frankly, I don't see any difference here. In either case there is
> a possibility for user space to mess up things and the kernel can't really help
> that.

With suspend blockers, there is also the possibility for userspace to
handle races correctly. But with your scheme there isn't -- that's the
difference.

> > This shares with the other alternatives posted recently the need for a
> > central power-manager process. And like in-kernel suspend blockers, it
> > requires changes to wakeup-capable drivers (the wakeup-events counter
> > has to be incremented).
>
> It doesn't really require changes to drivers, but to code that knows of wakeup
> events, like the PCI runtime wakeup code.

Just like in-kernel suspend blockers.

> Moreover, it doesn't require kernel
> subsystems to know or even care when it is reasonable to allow suspend to
> happen. The only thing they need to do is to call pm_wakeup_event() whenever
> they see a wakeup event.

That's just semantics. Obviously a wakeup event should prevent suspend
from happening, so if subsystems know or care about one then they know
or care about the other.

> I don't really think it is too much of a requirement
> (and quite frnakly I can't imagine anything simpler than that).

This is because you have omitted the part about allowing suspends again
(or if you prefer, about notifying the PM core that a wakeup event has
been handed off to userspace). As a result of leaving this out, you
haven't eliminated all the races.

> Yes, it does, but I have an idea about how to implement such a power manager
> and I'm going to actually try it.

A logical design would be to use dbus for disseminating PM-related
information. Does your idea work that way?

> I don't think any of the approaches that don't use suspend blockers allows
> one to avoid the race between the process that writes to /sys/power/state
> and a wakeup event happening at the same time. They attempt to address another
> issue, which is how to prevent untrusted user space processes from keeping the
> system out of idle, but that is a different story.

Well, there was one approach that didn't use suspend blockers and did
solve the race: the original wakelocks proposal. Of course, that was
just suspend blockers under a different name. One could make a very
good case that your scheme is also suspend blockers under a different
name (and with an important part missing).

Alan Stern

2010-06-21 02:33:07

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sun, 20 Jun 2010, mark gross wrote:

> > Indeed, the same problem arises if the event isn't delivered to
> > userspace until after userspace is frozen. Of course, the underlying
> > issue here is that the kernel has no direct way to know when userspace
> > has finished processing an event. Userspace would have to tell it,
> > which generally would mean rewriting some large number of user
> > programs.
>
> Suspend blockers don't solve this, and the large number of user programs
> isn't a big number. /me thinks its 1 or 2 services.

On Android, perhaps. What about on other systems?

> > In what way is this better than suspend blockers?
>
> 1) I don't think suspend blockers really solve or effectively articulate
> the suspend wake event race problem.

Why not?

> 2) the partial solution suspend blocker offer for the suspend race is
> tightly bound to the suspend blocker implementation and not useful in
> more general contexts.

I don't understand. Can you explain further?

> > One advantage of the suspend-blocker approach is that it essentially
> > uses a single tool to handle both kinds of races (event not fully
> > handled by the kernel, or event not fully handled by userspace).
> > Things aren't quite this simple, because of the need for a special API
> > to implement userspace suspend blockers, but this does avoid the need
> > for a power-manager process.
>
> I don't think suspend-blocker handles both kinds of races as well as you
> seem to think.

Can you give any counterexamples?

> A single tool that conflates multiple kernel facilities
> is not an advantage. It limits applications outside of the scope of
> doing it the "android way".

I don't see that necessarily as a drawback. You might just as well say
that the entire Linux kernel limits applications to doing things the
"Unix" way. Often have fewer choices is an advantage.

> Where do you get the idea that there isn't a need for a centralized PM
> process in Android? Thats not true.

I didn't get that idea from anywhere. Sorry if I gave the wrong
impression -- I meant that non-Android systems would need to implement
a centralized power-manager process, along with all the necessary
changes to other programs.

(Incidentally, even on Android the centralized PM process does not
handle _all_ of the userspace/kernel wakelock interactions.)

Alan Stern

2010-06-21 04:11:01

by David Brownell

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Avoid losing wakeup events during suspend


> > > Indeed, the same problem arises if the event
> isn't delivered to
> > > userspace until after userspace is frozen.

Can we put this more directly: the problem is
that the *SYSTEM ISN'T FULLY SUSPENDED* when the
hardware wake event triggers? (Where "*SYSTEM*
includes userspace not just kernel. In fact the
overall system is built from many subsystems,
some in the kernel and some in userspace.

At the risk of being prematurely general: I'd
point out that these subsystems probably have
sequencing requirements. kernel-then-user is
a degenerate case, and surely oversimplified.
There are other examples, e.g. between kernel
subsystems... Like needing to suspend a PMIC
before the bus it uses, where that bus uses
a task to manage request/response protocols.
(Think I2C or SPI.)

This is like the __init/__exit sequencing mess...

In terms of userspace event delivery, I'd say
it's a bug in the event mechanism if taking the
next step in suspension drops any event. It
should be queued, not lost... As a rule the
hardware queuing works (transparently)...

> Of course, the underlying
> > > issue here is that the kernel has no direct way
> to know when userspace
> > > has finished processing an event.


Again said more directly: there's no current
mechanism to coordinate subsystems. Userspace
can't communicate "I'm ready" to kernel, and
vice versa. (a few decades ago, APM could do
that ... we dropped such mechanisms though, and
I'm fairly sure APM's implementation was holey.)



2010-06-21 05:33:16

by Florian Mickler

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sun, 20 Jun 2010 22:23:38 -0400 (EDT)
Alan Stern <[email protected]> wrote:

> On Sun, 20 Jun 2010, Rafael J. Wysocki wrote:

> > > In what way is this better than suspend blockers?
> >
> > It doesn't add any new framework and it doesn't require the users of
> > pm_wakeup_event() to "unblock" suspend, so it is simpler. It also doesn't add
> > the user space interface that caused so much opposition to appear.
>
> Okay. A quick comparison shows that in your proposal:
>
> There's no need to register and unregister suspend blockers.
> But instead you create the equivalent of a suspend blocker
> inside every struct device.
>
> Drivers (or subsystems) don't have to activate suspend
> blockers. But instead they have to call pm_wakeup_event().
>
> Drivers don't have to deactivate suspend blockers. You don't
> have anything equivalent, and as a result your scheme is
> subject to the race described below.
>
> There are no userspace suspend blockers and no opportunistic
> suspend. Instead a power-manager process takes care of
> initiating or preventing suspends as needed.
>
> In short, you have eliminated the userspace part of the suspend blocker
> approach just as in some of the proposals posted earlier, and you have
> replaced the in-kernel suspend blockers with new data in struct device
> and a new PM API. On the whole, it doesn't seem very different from
> the in-kernel part of suspend blockers. The most notable difference is
> the name: pm_wake_event() vs. suspend_blocker_activate(), or whatever
> it ended up being called.
>
> This is the race I was talking about:
>
> > > What happens if an event arrives just before you read
> > > /sys/power/wakeup_count, but the userspace consumer doesn't realize
> > > there is a new unprocessed event until after the power manager checks
> > > it?
>
> > I think this is not the kernel's problem. In this approach the kernel makes it
> > possible for the user space to avoid the race. Whether or not the user space
> > will use this opportunity is a different matter.
>
> It is _not_ possible for userspace to avoid this race. Help from the
> kernel is needed.

It is possible if every (relevant) userspace program implements a
callback for the powermanager to check if one of it's wakeup-sources
got activated.

That way the powermanager would read /sys/power/wakeup_count, then do
the roundtrip to all it's registered users and only then suspend.

This turns the suspend_blockers concept around. Instead of actively
signaling the suspend_blockers, the userspace programs only answer
"yes/no" when asked. (i.e. polling?)

You _can not_ implement userspace suspend blockers with this approach,
as it is vital for every userspace program to get scheduled and check
it's wakeup-source (if even possible) before you know that the right
parties have won the race.


Cheers,
Flo

2010-06-21 05:55:08

by 640E9920

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sun, Jun 20, 2010 at 10:33:01PM -0400, Alan Stern wrote:
> On Sun, 20 Jun 2010, mark gross wrote:
>
> > > Indeed, the same problem arises if the event isn't delivered to
> > > userspace until after userspace is frozen. Of course, the underlying
> > > issue here is that the kernel has no direct way to know when userspace
> > > has finished processing an event. Userspace would have to tell it,
> > > which generally would mean rewriting some large number of user
> > > programs.
> >
> > Suspend blockers don't solve this, and the large number of user programs
> > isn't a big number. /me thinks its 1 or 2 services.
>
> On Android, perhaps. What about on other systems?

This is just speculation but, I would expect other systems would have
the graphics, input, telephony, and any PM services wake_count / wake
event ware.

I can't imagine (at the moment) anything else that would care.

>
> > > In what way is this better than suspend blockers?
> >
> > 1) I don't think suspend blockers really solve or effectively articulate
> > the suspend wake event race problem.
>
> Why not?

For starters the suspend block patch is about suspend blocking, at least
at first before competing ideas starting coming out and this race issue
was offered up as an excuse to not consider the alternative ideas, then
suddenly suspend blocking became a wake event notification mechanism
implementation that was baked into the implementation. The arguments
are still a blur and irritating to recall / look up again.

But, the point I'm trying to make is that wake event serialization /
event handling suddenly became a big FUD-pie tightly bound to a specific
feature for suspend blocking (or was is auto suspending? Its a magical
solution to all the PM problems in the kernel. I'm being sarcastic.)

Its much better to call out the issue and provide a clean targeted
solution to it without binding it to some other solution to a different
problem.

>
> > 2) the partial solution suspend blocker offer for the suspend race is
> > tightly bound to the suspend blocker implementation and not useful in
> > more general contexts.
>
> I don't understand. Can you explain further?
>
> > > One advantage of the suspend-blocker approach is that it essentially
> > > uses a single tool to handle both kinds of races (event not fully
> > > handled by the kernel, or event not fully handled by userspace).
> > > Things aren't quite this simple, because of the need for a special API
> > > to implement userspace suspend blockers, but this does avoid the need
> > > for a power-manager process.
> >
> > I don't think suspend-blocker handles both kinds of races as well as you
> > seem to think.
>
> Can you give any counterexamples?

I knew you'd ask such a thing. Do you have any correctness proofs of it
working right?

Lets consider the RIL incoming call race:
1) user mode initiates an "opportunistic suspend" or whatever we call
this these days by writing to some sys interface.
2) a call comes in (multi-threaded cpu) just after the write.
3) the caif driver grabs a blocker, stopping the in flight suspend in the
nick of time. But, when should it release it without racing? How does
one implement that kernel code such that its portable to non-android user
mode stacks?

>
> > A single tool that conflates multiple kernel facilities
> > is not an advantage. It limits applications outside of the scope of
> > doing it the "android way".
>
> I don't see that necessarily as a drawback. You might just as well say
> that the entire Linux kernel limits applications to doing things the
> "Unix" way. Often have fewer choices is an advantage.

I disagree with your analogy. One problem one solution with minimal
coupling to other implementation details is nice. Two problems with one
solution dependent on the system wide architecture bound to a user mode
PM design is fragile and not portable.

> > Where do you get the idea that there isn't a need for a centralized PM
> > process in Android? Thats not true.
>
> I didn't get that idea from anywhere. Sorry if I gave the wrong
> impression -- I meant that non-Android systems would need to implement
> a centralized power-manager process, along with all the necessary
> changes to other programs.
>
> (Incidentally, even on Android the centralized PM process does not
> handle _all_ of the userspace/kernel wakelock interactions.)
>

Yeah, I've been told that too. I've grepped for where the wake_unlock
sysfs interfaces are hit from user mode android (eclair) and I would
like some help in identifying those guys. Maybe its in the FroYo code I
don't get to look at yet?

There is libhardware_legacy/power/power.c and an out of tree kernel
broadcom bcm4329 driver under hardware/broadcom but that doesn't count
as it looks like a kernel driver to me.

--mgross

2010-06-21 06:02:43

by David Brownell

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Avoid losing wakeup events during suspend



--- On Sun, 6/20/10, David Brownell <[email protected]> wrote:

... in a sort of "aren't we asking the
wrong questions??" manner ...


I suspect that
looking at the problem in terms of how to
coordinate subsystems (an abstraction which
is at best very ad-hoc today!) we would
end up with a cleaner model, which doesn't
bother so many folk the ay wakelocks or
even suspend blockers seem to bother them...


> From: David Brownell <[email protected]>
> Subject: Re: [linux-pm] [RFC][PATCH] PM: Avoid losing wakeup events during suspend
> To: [email protected], "Alan Stern" <[email protected]>
> Cc: "Neil Brown" <[email protected]>, [email protected], "Dmitry Torokhov" <[email protected]>, "Linux Kernel Mailing List" <[email protected]>, "mark gross" <[email protected]>
> Date: Sunday, June 20, 2010, 9:04 PM
>
> > > > Indeed, the same problem arises if the
> event
> > isn't delivered to
> > > > userspace until after userspace is frozen.
>
> Can we put this more directly:? the problem is
> that the *SYSTEM ISN'T FULLY SUSPENDED* when the
> hardware wake event triggers?? (Where "*SYSTEM*
> includes userspace not just kernel.? In fact the
> overall system is built from many subsystems,
> some in the kernel and some in userspace.
>
> At the risk of being prematurely general:? I'd
> point out that these subsystems probably have
> sequencing requirements.? kernel-then-user is
> a degenerate case, and surely oversimplified.
> There are other examples, e.g. between kernel
> subsystems...? Like needing to suspend a PMIC
> before the bus it uses, where that bus uses
> a task to manage request/response protocols.
> (Think I2C or SPI.)
>
> This is like the __init/__exit sequencing mess...
>
> In terms of userspace event delivery, I'd say
> it's a bug in the event mechanism if taking the
> next step in suspension drops any event.? It
> should be queued, not lost...? As a rule the
> hardware queuing works (transparently)...
>
> > Of course, the underlying
> > > > issue here is that the kernel has no direct
> way
> > to know when userspace
> > > > has finished processing an event.
>
>
> Again said more directly:? there's no current
> mechanism to coordinate subsystems.? Userspace
> can't communicate "I'm ready" to kernel, and
> vice versa.? (a few decades ago, APM could do
> that ... we dropped such mechanisms though, and
> I'm fairly sure APM's implementation was holey.)
>
>
>
>
> _______________________________________________
> linux-pm mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm
>

2010-06-21 06:13:28

by 640E9920

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sun, Jun 20, 2010 at 10:23:38PM -0400, Alan Stern wrote:
> On Sun, 20 Jun 2010, Rafael J. Wysocki wrote:
>
> > > > Generally, there are two problems in that area. First, if a wakeup event
> > > > occurs exactly at the same time when /sys/power/state is being written to,
> > > > the even may be delivered to user space right before the freezing of it,
> > > > in which case the user space consumer of the event may not be able to process
> > > > it before the system is suspended.
> > >
> > > Indeed, the same problem arises if the event isn't delivered to
> > > userspace until after userspace is frozen.
> >
> > In that case the kernel should abort the suspend so that the event can be
> > delivered to the user space.
>
> Yes.
>
> > > Of course, the underlying issue here is that the kernel has no direct way
> > > to know when userspace has finished processing an event. Userspace would
> > > have to tell it, which generally would mean rewriting some large number of user
> > > programs.
> >
> > I'm not sure of that. If the kernel doesn't initiate suspend, it doesn't
> > really need to know whether or not user space has already consumed the event.
>
> That's true. But it only shifts the onus: When a userspace program has
> finished processing an event, it has to tell the power-manager process.
> Clearly this sort of thing is unavoidable, one way or another.
>
> > > > The following patch illustrates my idea of how these two problems may be
> > > > addressed. It introduces a new global sysfs attribute,
> > > > /sys/power/wakeup_count, associated with a running counter of wakeup events
> > > > and a helper function, pm_wakeup_event(), that may be used by kernel subsystems
> > > > to increment the wakeup events counter.
> > >
> > > In what way is this better than suspend blockers?
> >
> > It doesn't add any new framework and it doesn't require the users of
> > pm_wakeup_event() to "unblock" suspend, so it is simpler. It also doesn't add
> > the user space interface that caused so much opposition to appear.
>
> Okay. A quick comparison shows that in your proposal:
>
> There's no need to register and unregister suspend blockers.
> But instead you create the equivalent of a suspend blocker
> inside every struct device.
>
> Drivers (or subsystems) don't have to activate suspend
> blockers. But instead they have to call pm_wakeup_event().
>
> Drivers don't have to deactivate suspend blockers. You don't
> have anything equivalent, and as a result your scheme is
> subject to the race described below.
>
> There are no userspace suspend blockers and no opportunistic
> suspend. Instead a power-manager process takes care of
> initiating or preventing suspends as needed.
>
> In short, you have eliminated the userspace part of the suspend blocker
> approach just as in some of the proposals posted earlier, and you have
> replaced the in-kernel suspend blockers with new data in struct device
> and a new PM API. On the whole, it doesn't seem very different from
> the in-kernel part of suspend blockers. The most notable difference is
> the name: pm_wake_event() vs. suspend_blocker_activate(), or whatever
> it ended up being called.
>
> This is the race I was talking about:

Your confused about what problem this patch attempts to solve. There is
a pm_qos patch in the works to address the suspend blocker
functionality.
http://lists.linux-foundation.org/pipermail/linux-pm/2010-June/026760.html

--mgross



>
> > > What happens if an event arrives just before you read
> > > /sys/power/wakeup_count, but the userspace consumer doesn't realize
> > > there is a new unprocessed event until after the power manager checks
> > > it?
>
> > I think this is not the kernel's problem. In this approach the kernel makes it
> > possible for the user space to avoid the race. Whether or not the user space
> > will use this opportunity is a different matter.
>
> It is _not_ possible for userspace to avoid this race. Help from the
> kernel is needed.
>
> > > Your plan is missing a critical step: the "handoff" whereby
> > > responsibility for handling an event passes from the kernel to
> > > userspace.
>
> > > With suspend blockers, this handoff occurs when an event queue is
> > > emptied and its associate suspend blocker is deactivated. Or with some
> > > kinds of events for which the Android people have not written an
> > > explicit handoff, it occurs when a timer expires (timed suspend
> > > blockers).
> >
> > Well, quite frankly, I don't see any difference here. In either case there is
> > a possibility for user space to mess up things and the kernel can't really help
> > that.
>
> With suspend blockers, there is also the possibility for userspace to
> handle races correctly. But with your scheme there isn't -- that's the
> difference.
>
> > > This shares with the other alternatives posted recently the need for a
> > > central power-manager process. And like in-kernel suspend blockers, it
> > > requires changes to wakeup-capable drivers (the wakeup-events counter
> > > has to be incremented).
> >
> > It doesn't really require changes to drivers, but to code that knows of wakeup
> > events, like the PCI runtime wakeup code.
>
> Just like in-kernel suspend blockers.
>
> > Moreover, it doesn't require kernel
> > subsystems to know or even care when it is reasonable to allow suspend to
> > happen. The only thing they need to do is to call pm_wakeup_event() whenever
> > they see a wakeup event.
>
> That's just semantics. Obviously a wakeup event should prevent suspend
> from happening, so if subsystems know or care about one then they know
> or care about the other.
>
> > I don't really think it is too much of a requirement
> > (and quite frnakly I can't imagine anything simpler than that).
>
> This is because you have omitted the part about allowing suspends again
> (or if you prefer, about notifying the PM core that a wakeup event has
> been handed off to userspace). As a result of leaving this out, you
> haven't eliminated all the races.
>
> > Yes, it does, but I have an idea about how to implement such a power manager
> > and I'm going to actually try it.
>
> A logical design would be to use dbus for disseminating PM-related
> information. Does your idea work that way?
>
> > I don't think any of the approaches that don't use suspend blockers allows
> > one to avoid the race between the process that writes to /sys/power/state
> > and a wakeup event happening at the same time. They attempt to address another
> > issue, which is how to prevent untrusted user space processes from keeping the
> > system out of idle, but that is a different story.
>
> Well, there was one approach that didn't use suspend blockers and did
> solve the race: the original wakelocks proposal. Of course, that was
> just suspend blockers under a different name. One could make a very
> good case that your scheme is also suspend blockers under a different
> name (and with an important part missing).
>
> Alan Stern
>

2010-06-21 12:11:17

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

So where are we at this point?

Discussion had completely died down for a while, and it's picked up
again, but it's not clear to me that we're any closer to reaching
consensus.

There's been one proposal that we simply merge in a set of no-op
inline functions for suspend blockers, just so we can get let the
drivers go in (assuming that Greg K-H believes this is still a
problem), but with an automatical removal of N months (N to be
decided, say 9 or 12 or 18 months).

My concern is that if we do that, we will have simply kicked the ball
down the road for N months. Another approach is to simply merge in
no-op functions and not leave any kind of deprecation schedule.
That's sort of an implicit admission of the fact that we may not reach
consensus on this issue. Or we could simply ship the patches as-is to
Linus after he gets back from vacation and ask him for a thumbs up or
thumbs down vote, which might settle things once and for all.

How do we go forward from here?

- Ted

2010-06-21 12:20:14

by Alan

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

> There's been one proposal that we simply merge in a set of no-op
> inline functions for suspend blockers, just so we can get let the
> drivers go in (assuming that Greg K-H believes this is still a
> problem), but with an automatical removal of N months (N to be
> decided, say 9 or 12 or 18 months).

If you automatically remove the suspend blocker wrappers in 12 months
then you can keep the drivers in tree.

The drivers are generally not a problem (ok some of them like the binder
are interesting little trips) its just those hooks, and we'd all benefit
somewhat even if the only bit google are patching back in are their hooks.

> down the road for N months. Another approach is to simply merge in
> no-op functions and not leave any kind of deprecation schedule.
> That's sort of an implicit admission of the fact that we may not reach
> consensus on this issue. Or we could simply ship the patches as-is to

Very bad precedent. What happens when every other vendor does the same ?
Keep the upstream code clean.

> Linus after he gets back from vacation and ask him for a thumbs up or
> thumbs down vote, which might settle things once and for all.

You seem desperate to just throw it at Linus - you have been all along
before the discussion, during it and now: but I don't understand why ?
It's as if you don't want progress to be made by other means ?

> How do we go forward from here?

PM QoS seems to be evolving nicely.

As for merging stuff - if its new code then it should get submitted with
the hooks left out anyway, just as all the other vendors are doing with
weird little custom hooks of their own. The hooks can still easily be
patched back in as a tiny easy to maintain vendor patch.

This works - the code still gets updated and seen by people, only the
little hook patch has to be maintained and generally it looks after
itself except for the odd .rej when something changes right up by the
merge point.

Alan

2010-06-21 12:27:43

by Florian Mickler

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, 21 Jun 2010 13:22:34 +0100
Alan Cox <[email protected]> wrote:

> > How do we go forward from here?
>
> PM QoS seems to be evolving nicely.

Still, it doesn't solve the userspace-part.

Cheers,
Flo

2010-06-21 12:40:32

by Florian Mickler

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sun, 20 Jun 2010 22:55:22 -0700
mark gross <[email protected]> wrote:

> On Sun, Jun 20, 2010 at 10:33:01PM -0400, Alan Stern wrote:
> > On Sun, 20 Jun 2010, mark gross wrote:
> > > > In what way is this better than suspend blockers?
> > >
> > > 1) I don't think suspend blockers really solve or effectively articulate
> > > the suspend wake event race problem.
> >
> > Why not?
>
> For starters the suspend block patch is about suspend blocking, at least
> at first before competing ideas starting coming out and this race issue
> was offered up as an excuse to not consider the alternative ideas, then
> suddenly suspend blocking became a wake event notification mechanism
> implementation that was baked into the implementation. The arguments
> are still a blur and irritating to recall / look up again.

I really can't follow you here. Userspace and kernelspace actively
blocking suspend when necessary guarantees that the race is won by the
right party. I.e. the race problem is solved.

You can view it as a wake event notification if you want. But that's
just another point of view.

>
> But, the point I'm trying to make is that wake event serialization /
> event handling suddenly became a big FUD-pie tightly bound to a specific
> feature for suspend blocking (or was is auto suspending? Its a magical
> solution to all the PM problems in the kernel. I'm being sarcastic.)
>
> Its much better to call out the issue and provide a clean targeted
> solution to it without binding it to some other solution to a different
> problem.

See my other mail in response to Alan Stern. In the case of a
userspace-suspend-manager, if we just use this wake-event-counting
mechanism, we have to make shure userspace get's scheduled and notifies
the suspend-manager with an 'all clear' before it suspends the machine.

Else it can happen for the userspace-manager to suspend before
userspace could notify the userspace-manager to not suspend.

Cheers,
Flo

2010-06-21 13:43:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, Jun 21, 2010 at 01:22:34PM +0100, Alan Cox wrote:
>
> You seem desperate to just throw it at Linus - you have been all along
> before the discussion, during it and now: but I don't understand why ?

Why are you so afraid to let Linus make the call? He's the benevolent
dictator, isn't he?

> It's as if you don't want progress to be made by other means ?

It doesn't look like progress is being made here.

- Ted

2010-06-21 13:59:20

by Alan

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, 21 Jun 2010 09:42:49 -0400
[email protected] wrote:

> On Mon, Jun 21, 2010 at 01:22:34PM +0100, Alan Cox wrote:
> >
> > You seem desperate to just throw it at Linus - you have been all along
> > before the discussion, during it and now: but I don't understand why ?
>
> Why are you so afraid to let Linus make the call? He's the benevolent
> dictator, isn't he?

I don't mind him laughing at it. I'm just curious why you've tried to
short circuit the more general discussion repeatedly - even right at the
start.

> It doesn't look like progress is being made here.

On suspend blockers no - on some of the real problems yes.

2010-06-21 15:06:32

by Alan Stern

[permalink] [raw]
Subject: Re: [linux-pm] [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sun, 20 Jun 2010, David Brownell wrote:

> Can we put this more directly: the problem is
> that the *SYSTEM ISN'T FULLY SUSPENDED* when the
> hardware wake event triggers? (Where "*SYSTEM*
> includes userspace not just kernel. In fact the
> overall system is built from many subsystems,
> some in the kernel and some in userspace.

Indeed, the system may not even be partially suspended when the wake
event triggers.

> At the risk of being prematurely general: I'd
> point out that these subsystems probably have
> sequencing requirements. kernel-then-user is
> a degenerate case, and surely oversimplified.
> There are other examples, e.g. between kernel
> subsystems... Like needing to suspend a PMIC
> before the bus it uses, where that bus uses
> a task to manage request/response protocols.
> (Think I2C or SPI.)
>
> This is like the __init/__exit sequencing mess...
>
> In terms of userspace event delivery, I'd say
> it's a bug in the event mechanism if taking the
> next step in suspension drops any event. It
> should be queued, not lost... As a rule the
> hardware queuing works (transparently)...

There may be a misunderstanding here... People talk about events
getting lost, but what they (usually) mean is that the event isn't
actually _dropped_ -- rather, it fails to trigger a wakeup or to
prevent a suspend. When something else causes the system to resume
later on, the event will be delivered normally.

This means that the problem is not one of sequencing. The problem is
twofold:

To recognize when a wakeup event has occurred and therefore
it is not now safe to allow the system to suspend;

And to recognize when a wakeup event has been completely
handled and therefore it is once again safe to allow the system
to suspend.

> > Of course, the underlying
> > > > issue here is that the kernel has no direct way
> > to know when userspace
> > > > has finished processing an event.
>
>
> Again said more directly: there's no current
> mechanism to coordinate subsystems. Userspace
> can't communicate "I'm ready" to kernel, and
> vice versa. (a few decades ago, APM could do
> that ... we dropped such mechanisms though, and
> I'm fairly sure APM's implementation was holey.)

Yes, that's a better way of putting it. And it's not just a matter of
"userspace communicating with the kernel", because userspace is not
monolithic. There has to be a way for one user process to communicate
this information to another (I like Florian's idea). Of course, the
kernel doesn't have to worry about those details.

If one accepts a scheme in which all the suspend initiations and
cancellations are carried out by a single process (a power-manager
process), then the difficulties of communication and coordination
between the kernel and userspace are minimized.

Alan Stern

2010-06-21 15:23:37

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, 21 Jun 2010, Florian Mickler wrote:

> On Sun, 20 Jun 2010 22:23:38 -0400 (EDT)
> Alan Stern <[email protected]> wrote:

> > This is the race I was talking about:
> >
> > > > What happens if an event arrives just before you read
> > > > /sys/power/wakeup_count, but the userspace consumer doesn't realize
> > > > there is a new unprocessed event until after the power manager checks
> > > > it?
> >
> > > I think this is not the kernel's problem. In this approach the kernel makes it
> > > possible for the user space to avoid the race. Whether or not the user space
> > > will use this opportunity is a different matter.
> >
> > It is _not_ possible for userspace to avoid this race. Help from the
> > kernel is needed.
>
> It is possible if every (relevant) userspace program implements a
> callback for the powermanager to check if one of it's wakeup-sources
> got activated.
>
> That way the powermanager would read /sys/power/wakeup_count, then do
> the roundtrip to all it's registered users and only then suspend.
>
> This turns the suspend_blockers concept around. Instead of actively
> signaling the suspend_blockers, the userspace programs only answer
> "yes/no" when asked. (i.e. polling?)

In the end you would want to have communication in both directions:
suspend blockers _and_ callbacks. Polling is bad if done too often.
But I think the idea is a good one.

In fact, you don't need a "yes/no" response. Programs merely need a
chance to activate a new suspend blocker if a wakeup source was
recently activated before they acknowledge the poll.

> You _can not_ implement userspace suspend blockers with this approach,
> as it is vital for every userspace program to get scheduled and check
> it's wakeup-source (if even possible) before you know that the right
> parties have won the race.

I'm not sure what you mean. Certainly you can take a userspace
suspend-blocker implementation of the sort discussed before (where
programs communicate their needs to a central power-manager process)
and add this callback mechanism on top.

There is still at least one loophole to be closed: Android's
timer-based wakelocks. These include cases where the Android
developers didn't add enough wakelocks to cover the entire path from
kernel-event to userspace-handler, so they punted and relied on a timer
to decide when the wakelock should be deactivated. (There may be other
cases too; I didn't follow the original discussion very closely.)
It's not clear whether these things can be handled already in Rafael's
scheme with your addition, or whether something new is needed.

Alan Stern

2010-06-21 15:57:28

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sun, 20 Jun 2010, mark gross wrote:

> > > 1) I don't think suspend blockers really solve or effectively articulate
> > > the suspend wake event race problem.
> >
> > Why not?
>
> For starters the suspend block patch is about suspend blocking, at least
> at first before competing ideas starting coming out and this race issue
> was offered up as an excuse to not consider the alternative ideas, then
> suddenly suspend blocking became a wake event notification mechanism
> implementation that was baked into the implementation. The arguments
> are still a blur and irritating to recall / look up again.

I get the feeling you didn't fully absorb the intent of the original
wakelock patches. Right from the start they were about fixing a race
in system suspend: The system may go to sleep even though a pending
wakeup event should have blocked or prevented the suspend. In
particular that means notifying the PM core about wakeup events, when
they occur, and when the system may once again be allowed to suspend.

The userspace parts of the original patches did cloud the issue,
admittedly. But the purpose of the in-kernel parts has always been
clear.

> But, the point I'm trying to make is that wake event serialization /
> event handling suddenly became a big FUD-pie tightly bound to a specific
> feature for suspend blocking (or was is auto suspending? Its a magical
> solution to all the PM problems in the kernel. I'm being sarcastic.)
>
> Its much better to call out the issue and provide a clean targeted
> solution to it without binding it to some other solution to a different
> problem.

That's exactly what the wakelock patches did: They called out the
issue of the system suspending even while there were pending wakeup
events, they provided a targeted solution, and it wasn't bound to
another solution to a different problem.

Indeed, if you go back through the (I agree, irritating) threads on
this topic, you'll see that the FUD and other issues were all
introduced by other kernel developers, mainly because they didn't like
the idea of using system suspend as a mechanism for dynamic power
management.

> > > I don't think suspend-blocker handles both kinds of races as well as you
> > > seem to think.
> >
> > Can you give any counterexamples?
>
> I knew you'd ask such a thing. Do you have any correctness proofs of it
> working right?

If you want, sure. But what you think "working right" means may not be
the same as what I think.

> Lets consider the RIL incoming call race:

"RIL"?

> 1) user mode initiates an "opportunistic suspend" or whatever we call
> this these days by writing to some sys interface.

Okay.

> 2) a call comes in (multi-threaded cpu) just after the write.

Right. Presumably we want the suspend to be delayed until the call
can be handled by a user programm, yes?

> 3) the caif driver grabs a blocker, stopping the in flight suspend in the
> nick of time. But, when should it release it without racing? How does
> one implement that kernel code such that its portable to non-android user
> mode stacks?

I don't know anything about the caif driver so I can't answer this
question in detail. Here's the general idea: Somehow the kernel has to
notify userspace that an incoming call has arrived. Whatever driver or
subsystem sends that notification should release the suspend blocker
once userspace indicates that the notification has been received (for
example, by reading all the data in an input event queue). That's true
for both Android and non-Android.

In some cases there may not be any mechanism for userspace to tell the
kernel when a notification has been received. For thoses cases, the
Android people used timer-based blockers. This is not necessarily the
best approach but they seem happy with it. Others might prefer to add
an explicit "notification received" mechanism.

Still others take the view that since suspends are initiated by
userspace anyway, it's not necessary to tell the kernel when suspend is
safe again. Just tell the user process responsible for initiating
suspends.

> > > A single tool that conflates multiple kernel facilities
> > > is not an advantage. It limits applications outside of the scope of
> > > doing it the "android way".
> >
> > I don't see that necessarily as a drawback. You might just as well say
> > that the entire Linux kernel limits applications to doing things the
> > "Unix" way. Often have fewer choices is an advantage.
>
> I disagree with your analogy. One problem one solution with minimal
> coupling to other implementation details is nice. Two problems with one
> solution dependent on the system wide architecture bound to a user mode
> PM design is fragile and not portable.

I assume the "two problems" you refer to are: telling the PM core that
the kernel has a wakeup event in progress, and telling the PM core that
userspace has a wakeup event in progress. To me these don't seem to be
vastly different problems, and having a single solution for both
doesn't seem out of line.

The fact that this is bound to a user-mode PM design was inevitable,
given the whole idea was to overcome weaknesses in the system suspend
implementation and that implementation already is driven by userspace.

> > (Incidentally, even on Android the centralized PM process does not
> > handle _all_ of the userspace/kernel wakelock interactions.)
> >
>
> Yeah, I've been told that too. I've grepped for where the wake_unlock
> sysfs interfaces are hit from user mode android (eclair) and I would
> like some help in identifying those guys. Maybe its in the FroYo code I
> don't get to look at yet?
>
> There is libhardware_legacy/power/power.c and an out of tree kernel
> broadcom bcm4329 driver under hardware/broadcom but that doesn't count
> as it looks like a kernel driver to me.

I don't know -- I have never looked at the Android userspace. No doubt
Arve can provide some examples.

Alan Stern

2010-06-21 16:01:14

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Sun, 20 Jun 2010, mark gross wrote:

> Your confused about what problem this patch attempts to solve.

I don't think so. Rafael's description was pretty clear.

> There is
> a pm_qos patch in the works to address the suspend blocker
> functionality.
> http://lists.linux-foundation.org/pipermail/linux-pm/2010-June/026760.html

No. That patch addresses something _similar_ to the suspend blocker
functionality. The fact remains, though, that pm_qos is not used
during system suspend (the /sys/power/state interface), hence changes
to pm_qos won't solve the system-suspend problems that suspend blockers
do solve.

Alan Stern

2010-06-21 16:54:48

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, 21 Jun 2010, Florian Mickler wrote:

> > > > What happens if an event arrives just before you read
> > > > /sys/power/wakeup_count, but the userspace consumer doesn't realize
> > > > there is a new unprocessed event until after the power manager checks
> > > > it?
> >
> > > I think this is not the kernel's problem. In this approach the kernel makes it
> > > possible for the user space to avoid the race. Whether or not the user space
> > > will use this opportunity is a different matter.
> >
> > It is _not_ possible for userspace to avoid this race. Help from the
> > kernel is needed.
>
> It is possible if every (relevant) userspace program implements a
> callback for the powermanager to check if one of it's wakeup-sources
> got activated.
>
> That way the powermanager would read /sys/power/wakeup_count, then do
> the roundtrip to all it's registered users and only then suspend.

After further thought, there's still a race:

A wakeup event arrives.

The kernel increments /sys/power/wakeup_count and starts
processing the event.

The power-manager process reads /sys/power/wakeup_count.

The power-manager process polls the relevant programs and
they all say no events are pending.

The power-manager process successfully writes
/sys/power/wakeup_count.

The power-manager process initiates a suspend.

... Hours later ...

The system wakes up.

The kernel finishes its internal processing of the event and
sends a notification to a user program.

The problem here is that the power-manager process can't tell when the
kernel has finished processing an event. This is true both for events
that need to propagate to userspace and for events that are handled
entirely by the kernel.

This is a reflection of the two distinct pieces of information that we
need to keep track of:

A wakeup event has arrived, so it's no longer safe to suspend.

Wakeup events are no longer pending, so it's once again
safe to suspend.

The wakeup_count interface tracks the first, but in this scheme nothing
tracks the second.

Alan Stern

2010-06-21 20:39:23

by Florian Mickler

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, 21 Jun 2010 11:23:33 -0400 (EDT)
Alan Stern <[email protected]> wrote:

> On Mon, 21 Jun 2010, Florian Mickler wrote:
>
> > On Sun, 20 Jun 2010 22:23:38 -0400 (EDT)
> > Alan Stern <[email protected]> wrote:
>
> > > This is the race I was talking about:
> > >
> > > > > What happens if an event arrives just before you read
> > > > > /sys/power/wakeup_count, but the userspace consumer doesn't realize
> > > > > there is a new unprocessed event until after the power manager checks
> > > > > it?
> > >
> > > > I think this is not the kernel's problem. In this approach the kernel makes it
> > > > possible for the user space to avoid the race. Whether or not the user space
> > > > will use this opportunity is a different matter.
> > >
> > > It is _not_ possible for userspace to avoid this race. Help from the
> > > kernel is needed.
> >
> > It is possible if every (relevant) userspace program implements a
> > callback for the powermanager to check if one of it's wakeup-sources
> > got activated.
> >
> > That way the powermanager would read /sys/power/wakeup_count, then do
> > the roundtrip to all it's registered users and only then suspend.
> >
> > This turns the suspend_blockers concept around. Instead of actively
> > signaling the suspend_blockers, the userspace programs only answer
> > "yes/no" when asked. (i.e. polling?)
>
> In the end you would want to have communication in both directions:
> suspend blockers _and_ callbacks. Polling is bad if done too often.
> But I think the idea is a good one.

Actually, I'm not so shure.

1. you have to roundtrip whereas in the suspend_blocker scheme you have
active annotations (i.e. no further action needed)

2. it may not be possible for a user to determine if a wake-event is
in-flight. you would have to somehow pass the wake-event-number with
it, so that the userspace process could ack it properly without
confusion. Or... I don't know of anything else...

1. userspace-manager (UM) reads a number (42).

2. it questions userspace program X: is it ok to suspend?

[please fill in how userspace program X determines to block
suspend]

3a. UM's roundtrip ends and it proceeds to write "42" to the
kernel [suspending]
3b. UM's roundtrip ends and it aborts suspend, because a
(userspace-)suspend-blocker got activated

I'm not shure how the userspace program could determine that there is a
wake-event in flight. Perhaps by storing the number of last wake-event.
But then you need per-wake-event-counters... :|


> In fact, you don't need a "yes/no" response. Programs merely need a
> chance to activate a new suspend blocker if a wakeup source was
> recently activated before they acknowledge the poll.

true. (incorporated alreeady above)

>
> > You _can not_ implement userspace suspend blockers with this approach,
> > as it is vital for every userspace program to get scheduled and check
> > it's wakeup-source (if even possible) before you know that the right
> > parties have won the race.
>
> I'm not sure what you mean.

Sorry, that was not understandable. What I meant was that you "_can
not_" implement the suspend-blockers scheme, where you don't need to
roundtrip through all userspace (with all it's glory).
( => you need the roundtrip here)

>
> There is still at least one loophole to be closed: Android's
> timer-based wakelocks. These include cases where the Android
> developers didn't add enough wakelocks to cover the entire path from
> kernel-event to userspace-handler, so they punted and relied on a timer
> to decide when the wakelock should be deactivated. (There may be other
> cases too; I didn't follow the original discussion very closely.)
> It's not clear whether these things can be handled already in Rafael's
> scheme with your addition, or whether something new is needed.
>
> Alan Stern

Do you have some thoughts about the wake-event-in-flight detection?

Cheers,
Flo

2010-06-21 20:41:17

by Florian Mickler

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, 21 Jun 2010 12:54:44 -0400 (EDT)
Alan Stern <[email protected]> wrote:

> On Mon, 21 Jun 2010, Florian Mickler wrote:
>
> > > > > What happens if an event arrives just before you read
> > > > > /sys/power/wakeup_count, but the userspace consumer doesn't realize
> > > > > there is a new unprocessed event until after the power manager checks
> > > > > it?
> > >
> > > > I think this is not the kernel's problem. In this approach the kernel makes it
> > > > possible for the user space to avoid the race. Whether or not the user space
> > > > will use this opportunity is a different matter.
> > >
> > > It is _not_ possible for userspace to avoid this race. Help from the
> > > kernel is needed.
> >
> > It is possible if every (relevant) userspace program implements a
> > callback for the powermanager to check if one of it's wakeup-sources
> > got activated.
> >
> > That way the powermanager would read /sys/power/wakeup_count, then do
> > the roundtrip to all it's registered users and only then suspend.
>
> After further thought, there's still a race:

Ah, yes. That's what I meant in my other mail with 'how to determine
that an wake event is in flight'.

Read this too late.

> Alan Stern

Cheers,
Flo

2010-06-21 21:19:55

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Monday, June 21, 2010, Alan Stern wrote:
> On Mon, 21 Jun 2010, Florian Mickler wrote:
>
> > > > > What happens if an event arrives just before you read
> > > > > /sys/power/wakeup_count, but the userspace consumer doesn't realize
> > > > > there is a new unprocessed event until after the power manager checks
> > > > > it?
> > >
> > > > I think this is not the kernel's problem. In this approach the kernel makes it
> > > > possible for the user space to avoid the race. Whether or not the user space
> > > > will use this opportunity is a different matter.
> > >
> > > It is _not_ possible for userspace to avoid this race. Help from the
> > > kernel is needed.
> >
> > It is possible if every (relevant) userspace program implements a
> > callback for the powermanager to check if one of it's wakeup-sources
> > got activated.
> >
> > That way the powermanager would read /sys/power/wakeup_count, then do
> > the roundtrip to all it's registered users and only then suspend.
>
> After further thought, there's still a race:
>
> A wakeup event arrives.
>
> The kernel increments /sys/power/wakeup_count and starts
> processing the event.
>
> The power-manager process reads /sys/power/wakeup_count.

You assume that these two steps will occur instantaneously one after the other,
while there may be (and in fact there should be) a delay between them.

I would make the power manager wait for certain time after reading
/sys/power/wakeup_count and if no wakeup events are reported to it within
that time, to write to /sys/power/wakeup_count.

The length of the time to wait would be system-dependent in general, but I'd
also allow the event consumers to influence it (like when an application knows
it will process things for 10 minutes going forward, so it tells the power
manager to wait for at least 10 minutes before attempting to suspend).

> The power-manager process polls the relevant programs and
> they all say no events are pending.
>
> The power-manager process successfully writes
> /sys/power/wakeup_count.
>
> The power-manager process initiates a suspend.
>
> ... Hours later ...
>
> The system wakes up.
>
> The kernel finishes its internal processing of the event and
> sends a notification to a user program.
>
> The problem here is that the power-manager process can't tell when the
> kernel has finished processing an event. This is true both for events
> that need to propagate to userspace and for events that are handled
> entirely by the kernel.
>
> This is a reflection of the two distinct pieces of information that we
> need to keep track of:
>
> A wakeup event has arrived, so it's no longer safe to suspend.
>
> Wakeup events are no longer pending, so it's once again
> safe to suspend.
>
> The wakeup_count interface tracks the first, but in this scheme nothing
> tracks the second.

Which I don't think is really necessary, because we'll need to use timeouts
anyway, at least for events that have no user space consumers.

Rafael

2010-06-21 21:59:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Monday, June 21, 2010, Alan Stern wrote:
> On Sun, 20 Jun 2010, Rafael J. Wysocki wrote:
>
> > > > Generally, there are two problems in that area. First, if a wakeup event
> > > > occurs exactly at the same time when /sys/power/state is being written to,
> > > > the even may be delivered to user space right before the freezing of it,
> > > > in which case the user space consumer of the event may not be able to process
> > > > it before the system is suspended.
> > >
> > > Indeed, the same problem arises if the event isn't delivered to
> > > userspace until after userspace is frozen.
> >
> > In that case the kernel should abort the suspend so that the event can be
> > delivered to the user space.
>
> Yes.
>
> > > Of course, the underlying issue here is that the kernel has no direct way
> > > to know when userspace has finished processing an event. Userspace would
> > > have to tell it, which generally would mean rewriting some large number of user
> > > programs.
> >
> > I'm not sure of that. If the kernel doesn't initiate suspend, it doesn't
> > really need to know whether or not user space has already consumed the event.
>
> That's true. But it only shifts the onus: When a userspace program has
> finished processing an event, it has to tell the power-manager process.
> Clearly this sort of thing is unavoidable, one way or another.

That's correct, but there already are multiple ways to communicate things
between user space processes, while we would need new interfaces to
pass that information between user space and the kernel. That makes quite
some difference.

> > > > The following patch illustrates my idea of how these two problems may be
> > > > addressed. It introduces a new global sysfs attribute,
> > > > /sys/power/wakeup_count, associated with a running counter of wakeup events
> > > > and a helper function, pm_wakeup_event(), that may be used by kernel subsystems
> > > > to increment the wakeup events counter.
> > >
> > > In what way is this better than suspend blockers?
> >
> > It doesn't add any new framework and it doesn't require the users of
> > pm_wakeup_event() to "unblock" suspend, so it is simpler. It also doesn't add
> > the user space interface that caused so much opposition to appear.
>
> Okay. A quick comparison shows that in your proposal:
>
> There's no need to register and unregister suspend blockers.
> But instead you create the equivalent of a suspend blocker
> inside every struct device.

Not really. That thing is for statistics purposes only and it's not essential
from the functionality POV. But you know that. :-)

> Drivers (or subsystems) don't have to activate suspend
> blockers. But instead they have to call pm_wakeup_event().
>
> Drivers don't have to deactivate suspend blockers. You don't
> have anything equivalent, and as a result your scheme is
> subject to the race described below.

The original one without user space variant of suspend blockers is subject to
it as well, because the kernel doesn't know when the user space consumer is
going to finish processing the event.

Rafael

2010-06-21 22:18:06

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, 21 Jun 2010, Florian Mickler wrote:

> > In the end you would want to have communication in both directions:
> > suspend blockers _and_ callbacks. Polling is bad if done too often.
> > But I think the idea is a good one.
>
> Actually, I'm not so shure.
>
> 1. you have to roundtrip whereas in the suspend_blocker scheme you have
> active annotations (i.e. no further action needed)

That's why it's best to use both. The normal case is that programs
activate and deactivate blockers by sending one-way messages to the PM
process. The exceptional case is when the PM process is about to
initiate a suspend; that's when it does the round-trip polling. Since
the only purpose of the polling is to avoid a race, 90% of the time it
will succeed.

> 2. it may not be possible for a user to determine if a wake-event is
> in-flight. you would have to somehow pass the wake-event-number with
> it, so that the userspace process could ack it properly without
> confusion. Or... I don't know of anything else...
>
> 1. userspace-manager (UM) reads a number (42).
>
> 2. it questions userspace program X: is it ok to suspend?
>
> [please fill in how userspace program X determines to block
> suspend]
>
> 3a. UM's roundtrip ends and it proceeds to write "42" to the
> kernel [suspending]
> 3b. UM's roundtrip ends and it aborts suspend, because a
> (userspace-)suspend-blocker got activated
>
> I'm not shure how the userspace program could determine that there is a
> wake-event in flight. Perhaps by storing the number of last wake-event.
> But then you need per-wake-event-counters... :|

Rafael seems to think timeouts will fix this. I'm not so sure.

> Do you have some thoughts about the wake-event-in-flight detection?

Not really, except for something like the original wakelock scheme in
which the kernel tells the PM core when an event is over.

Alan Stern

2010-06-21 22:27:59

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, 21 Jun 2010, Rafael J. Wysocki wrote:

> > After further thought, there's still a race:
> >
> > A wakeup event arrives.
> >
> > The kernel increments /sys/power/wakeup_count and starts
> > processing the event.
> >
> > The power-manager process reads /sys/power/wakeup_count.
>
> You assume that these two steps will occur instantaneously one after the other,
> while there may be (and in fact there should be) a delay between them.

No, I'm not assuming that.

> I would make the power manager wait for certain time after reading
> /sys/power/wakeup_count and if no wakeup events are reported to it within
> that time, to write to /sys/power/wakeup_count.

Why? That's just wasted time -- time during which the system could
have been suspended.

I can understand the power manager might reason as follows: If this
wakeup event hasn't been handed over to a userspace program in the next
5 seconds, I'm going to suspend anyway on the theory that something is
wrong. But why do that when you can get exact information?

> The length of the time to wait would be system-dependent in general, but I'd
> also allow the event consumers to influence it (like when an application knows
> it will process things for 10 minutes going forward, so it tells the power
> manager to wait for at least 10 minutes before attempting to suspend).

It tells the power manager to wait by activating a userspace suspend
blocker. While a blocker is active, the power manager doesn't have to
poll /sys/power/wakeup_count or anything; it just waits for all the
suspend blockers to be deactivated. And there's no guesswork involved;
the power manager knows immediately when it's time to try suspending
again.

> > The power-manager process polls the relevant programs and
> > they all say no events are pending.
> >
> > The power-manager process successfully writes
> > /sys/power/wakeup_count.
> >
> > The power-manager process initiates a suspend.
> >
> > ... Hours later ...
> >
> > The system wakes up.
> >
> > The kernel finishes its internal processing of the event and
> > sends a notification to a user program.
> >
> > The problem here is that the power-manager process can't tell when the
> > kernel has finished processing an event. This is true both for events
> > that need to propagate to userspace and for events that are handled
> > entirely by the kernel.
> >
> > This is a reflection of the two distinct pieces of information that we
> > need to keep track of:
> >
> > A wakeup event has arrived, so it's no longer safe to suspend.
> >
> > Wakeup events are no longer pending, so it's once again
> > safe to suspend.
> >
> > The wakeup_count interface tracks the first, but in this scheme nothing
> > tracks the second.
>
> Which I don't think is really necessary, because we'll need to use timeouts
> anyway, at least for events that have no user space consumers.

You wouldn't need to use timeouts for them either if the kernel had a
way to indicate when it was finished processing events.

Alan Stern

2010-06-21 22:42:24

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tuesday, June 22, 2010, Alan Stern wrote:
> On Mon, 21 Jun 2010, Florian Mickler wrote:
>
> > > In the end you would want to have communication in both directions:
> > > suspend blockers _and_ callbacks. Polling is bad if done too often.
> > > But I think the idea is a good one.
> >
> > Actually, I'm not so shure.
> >
> > 1. you have to roundtrip whereas in the suspend_blocker scheme you have
> > active annotations (i.e. no further action needed)
>
> That's why it's best to use both. The normal case is that programs
> activate and deactivate blockers by sending one-way messages to the PM
> process. The exceptional case is when the PM process is about to
> initiate a suspend; that's when it does the round-trip polling. Since
> the only purpose of the polling is to avoid a race, 90% of the time it
> will succeed.
>
> > 2. it may not be possible for a user to determine if a wake-event is
> > in-flight. you would have to somehow pass the wake-event-number with
> > it, so that the userspace process could ack it properly without
> > confusion. Or... I don't know of anything else...
> >
> > 1. userspace-manager (UM) reads a number (42).
> >
> > 2. it questions userspace program X: is it ok to suspend?
> >
> > [please fill in how userspace program X determines to block
> > suspend]
> >
> > 3a. UM's roundtrip ends and it proceeds to write "42" to the
> > kernel [suspending]
> > 3b. UM's roundtrip ends and it aborts suspend, because a
> > (userspace-)suspend-blocker got activated
> >
> > I'm not shure how the userspace program could determine that there is a
> > wake-event in flight. Perhaps by storing the number of last wake-event.
> > But then you need per-wake-event-counters... :|
>
> Rafael seems to think timeouts will fix this. I'm not so sure.
>
> > Do you have some thoughts about the wake-event-in-flight detection?
>
> Not really, except for something like the original wakelock scheme in
> which the kernel tells the PM core when an event is over.

But the kernel doesn't really know that, so it really can't tell the PM core
anything useful. What happens with suspend blockers is that a kernel suspend
cooperates with a user space consumer of the event to get the story straight.

However, that will only work if the user space is not buggy and doesn't crash,
for example, before releasing the suspend blocker it's holding.

Apart from this, there are those events withoug user space "handoff" that
use timeouts.

Also there are events like wake-on-LAN that can be regarded as instantaneous
from the power manager's point of view, so they don't really need all of the
"suspend blockers" machinery and for them we will need to use a cooldown
timeout anyway.

And if we need to use that cooldown timeout, I don't see why not to use
timeouts for avoiding the race you're worrying about.

Rafael

2010-06-21 22:49:38

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tuesday, June 22, 2010, Rafael J. Wysocki wrote:
> On Tuesday, June 22, 2010, Alan Stern wrote:
> > On Mon, 21 Jun 2010, Florian Mickler wrote:
> >
> > > > In the end you would want to have communication in both directions:
> > > > suspend blockers _and_ callbacks. Polling is bad if done too often.
> > > > But I think the idea is a good one.
> > >
> > > Actually, I'm not so shure.
> > >
> > > 1. you have to roundtrip whereas in the suspend_blocker scheme you have
> > > active annotations (i.e. no further action needed)
> >
> > That's why it's best to use both. The normal case is that programs
> > activate and deactivate blockers by sending one-way messages to the PM
> > process. The exceptional case is when the PM process is about to
> > initiate a suspend; that's when it does the round-trip polling. Since
> > the only purpose of the polling is to avoid a race, 90% of the time it
> > will succeed.
> >
> > > 2. it may not be possible for a user to determine if a wake-event is
> > > in-flight. you would have to somehow pass the wake-event-number with
> > > it, so that the userspace process could ack it properly without
> > > confusion. Or... I don't know of anything else...
> > >
> > > 1. userspace-manager (UM) reads a number (42).
> > >
> > > 2. it questions userspace program X: is it ok to suspend?
> > >
> > > [please fill in how userspace program X determines to block
> > > suspend]
> > >
> > > 3a. UM's roundtrip ends and it proceeds to write "42" to the
> > > kernel [suspending]
> > > 3b. UM's roundtrip ends and it aborts suspend, because a
> > > (userspace-)suspend-blocker got activated
> > >
> > > I'm not shure how the userspace program could determine that there is a
> > > wake-event in flight. Perhaps by storing the number of last wake-event.
> > > But then you need per-wake-event-counters... :|
> >
> > Rafael seems to think timeouts will fix this. I'm not so sure.
> >
> > > Do you have some thoughts about the wake-event-in-flight detection?
> >
> > Not really, except for something like the original wakelock scheme in
> > which the kernel tells the PM core when an event is over.
>
> But the kernel doesn't really know that, so it really can't tell the PM core
> anything useful. What happens with suspend blockers is that a kernel suspend

s/suspend/subsyste/ (-ETOOLATE)

> cooperates with a user space consumer of the event to get the story straight.
>
> However, that will only work if the user space is not buggy and doesn't crash,
> for example, before releasing the suspend blocker it's holding.
>
> Apart from this, there are those events withoug user space "handoff" that
> use timeouts.
>
> Also there are events like wake-on-LAN that can be regarded as instantaneous
> from the power manager's point of view, so they don't really need all of the
> "suspend blockers" machinery and for them we will need to use a cooldown
> timeout anyway.
>
> And if we need to use that cooldown timeout, I don't see why not to use
> timeouts for avoiding the race you're worrying about.

2010-06-22 00:50:23

by Arve Hjønnevåg

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, Jun 21, 2010 at 3:40 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Tuesday, June 22, 2010, Alan Stern wrote:
>> On Mon, 21 Jun 2010, Florian Mickler wrote:
>>
>> > > In the end you would want to have communication in both directions:
>> > > suspend blockers _and_ callbacks. ?Polling is bad if done too often.
>> > > But I think the idea is a good one.
>> >
>> > Actually, I'm not so shure.
>> >
>> > 1. you have to roundtrip whereas in the suspend_blocker scheme you have
>> > active annotations (i.e. no further action needed)
>>
>> That's why it's best to use both. ?The normal case is that programs
>> activate and deactivate blockers by sending one-way messages to the PM
>> process. ?The exceptional case is when the PM process is about to
>> initiate a suspend; that's when it does the round-trip polling. ?Since
>> the only purpose of the polling is to avoid a race, 90% of the time it
>> will succeed.
>>
>> > 2. it may not be possible for a user to determine if a wake-event is
>> > in-flight. you would have to somehow pass the wake-event-number with
>> > it, so that the userspace process could ack it properly without
>> > confusion. Or... I don't know of anything else...
>> >
>> > ? ? 1. userspace-manager (UM) reads a number (42).
>> >
>> > ? ? 2. it questions userspace program X: is it ok to suspend?
>> >
>> > ? ? [please fill in how userspace program X determines to block
>> > ? ? suspend]
>> >
>> > ? ? 3a. UM's roundtrip ends and it proceeds to write "42" to the
>> > ? ? kernel [suspending]
>> > ? ? 3b. UM's roundtrip ends and it aborts suspend, because a
>> > ? ? (userspace-)suspend-blocker got activated
>> >
>> > I'm not shure how the userspace program could determine that there is a
>> > wake-event in flight. Perhaps by storing the number of last wake-event.
>> > But then you need per-wake-event-counters... :|
>>
>> Rafael seems to think timeouts will fix this. ?I'm not so sure.
>>
>> > Do you have some thoughts about the wake-event-in-flight detection?
>>
>> Not really, except for something like the original wakelock scheme in
>> which the kernel tells the PM core when an event is over.
>
> But the kernel doesn't really know that, so it really can't tell the PM core
> anything useful. ?What happens with suspend blockers is that a kernel suspend
> cooperates with a user space consumer of the event to get the story straight.
>
> However, that will only work if the user space is not buggy and doesn't crash,
> for example, before releasing the suspend blocker it's holding.
>
> Apart from this, there are those events withoug user space "handoff" that
> use timeouts.
>
> Also there are events like wake-on-LAN that can be regarded as instantaneous
> from the power manager's point of view, so they don't really need all of the
> "suspend blockers" machinery and for them we will need to use a cooldown
> timeout anyway.
>
> And if we need to use that cooldown timeout, I don't see why not to use
> timeouts for avoiding the race you're worrying about.
>

Just because some events need to use timeouts, does not mean that all
events should use timeouts. You may even want different timeout for
different events. If I want a 5 minute timeout after a wake-on-lan
packet, I don't want the same timeout after every battery check alarm
(every 10 minutes on the Nexus One).

Also, I don't see how this patch helps with events that never make
reach user-space. Unless each driver blocks suspend by returning and
error from its suspend hook, the driver then becomes dependent on the
user-space power manager to choose a sufficiently long timeout. For
some drivers, like the gpio keypad matrix driver there, is no safe
value for the timeout. The keypad driver has to block suspend if any
keys are held down.

I won't have time to actively follow this discussion, but I don't
think this patch is a good solution.

--
Arve Hj?nnev?g

2010-06-22 01:07:28

by 640E9920

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, Jun 21, 2010 at 08:10:58AM -0400, [email protected] wrote:
> So where are we at this point?

The patches are in James Bottomly's tree.

> Discussion had completely died down for a while, and it's picked up
> again, but it's not clear to me that we're any closer to reaching
> consensus.

I thought we (linux community and Android) where ok with the plist /
pm-qos implementation of the building blocks needed to implement the
suspend blocker feature on top of a pm-qos request class (I think the
name was "interactive") pretty much the exact same symantecs as the
suspend blocker thing, just with pm-qos kernel api's.

> There's been one proposal that we simply merge in a set of no-op
> inline functions for suspend blockers, just so we can get let the
> drivers go in (assuming that Greg K-H believes this is still a
> problem), but with an automatical removal of N months (N to be
> decided, say 9 or 12 or 18 months).

I'd rather see the re-tooling of pmqos happen.

>
> My concern is that if we do that, we will have simply kicked the ball
> down the road for N months. Another approach is to simply merge in
> no-op functions and not leave any kind of deprecation schedule.
> That's sort of an implicit admission of the fact that we may not reach
> consensus on this issue. Or we could simply ship the patches as-is to
> Linus after he gets back from vacation and ask him for a thumbs up or
> thumbs down vote, which might settle things once and for all.
>
> How do we go forward from here?
>
put the pm_qos -plist update into linux-next?

--mgross

2010-06-22 01:25:06

by 640E9920

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, Jun 21, 2010 at 12:01:09PM -0400, Alan Stern wrote:
> On Sun, 20 Jun 2010, mark gross wrote:
>
> > Your confused about what problem this patch attempts to solve.
>
> I don't think so. Rafael's description was pretty clear.

Then how is it you don't understand the fact that Rafael's patch is to
solve the wake event notification suspend race and not block opertunistic
suspends or kernel critical sections where suspending should be disabled?

>
> > There is
> > a pm_qos patch in the works to address the suspend blocker
> > functionality.
> > http://lists.linux-foundation.org/pipermail/linux-pm/2010-June/026760.html
>
> No. That patch addresses something _similar_ to the suspend blocker
> functionality. The fact remains, though, that pm_qos is not used
> during system suspend (the /sys/power/state interface), hence changes
> to pm_qos won't solve the system-suspend problems that suspend blockers
> do solve.

You keep saying they solve something, I keep wondering what you are
talking aobut.
Lets see what problems it solves:
* implements oppertunistic suspending (this is a feature not a problem)
* enables kernel critical sections blocking suspending.
* requiers overlapping application specific critcal sections from ISR
into user mode to make implementation correct.
* exposes a user mode interface to set a critical section.
* reduces races between wake events (or suspend blocking events) but I'm
not convinced it solves them.

suspend blockers provide a way to block oppertunistic suspending, wich
I'll have you know, is a pain to get working right and the enabling from
device to device is not very portable and *that* doesn't say good things
about the scheme.

--mgross

2010-06-22 01:58:21

by 640E9920

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, Jun 21, 2010 at 11:57:21AM -0400, Alan Stern wrote:
> On Sun, 20 Jun 2010, mark gross wrote:
>
> > > > 1) I don't think suspend blockers really solve or effectively articulate
> > > > the suspend wake event race problem.
> > >
> > > Why not?
> >
> > For starters the suspend block patch is about suspend blocking, at least
> > at first before competing ideas starting coming out and this race issue
> > was offered up as an excuse to not consider the alternative ideas, then
> > suddenly suspend blocking became a wake event notification mechanism
> > implementation that was baked into the implementation. The arguments
> > are still a blur and irritating to recall / look up again.
>
> I get the feeling you didn't fully absorb the intent of the original
> wakelock patches. Right from the start they were about fixing a race
> in system suspend: The system may go to sleep even though a pending

wrong. They are about 1) adding opportunistic suspend, and 2) adding
critical sections to block suspend.

No race issues where ever talked about until alternative solutions
where put up.

> wakeup event should have blocked or prevented the suspend. In
> particular that means notifying the PM core about wakeup events, when
> they occur, and when the system may once again be allowed to suspend.
>
> The userspace parts of the original patches did cloud the issue,
> admittedly. But the purpose of the in-kernel parts has always been
> clear.
>
> > But, the point I'm trying to make is that wake event serialization /
> > event handling suddenly became a big FUD-pie tightly bound to a specific
> > feature for suspend blocking (or was is auto suspending? Its a magical
> > solution to all the PM problems in the kernel. I'm being sarcastic.)
> >
> > Its much better to call out the issue and provide a clean targeted
> > solution to it without binding it to some other solution to a different
> > problem.
>
> That's exactly what the wakelock patches did: They called out the
> issue of the system suspending even while there were pending wakeup
> events, they provided a targeted solution, and it wasn't bound to
> another solution to a different problem.
>
> Indeed, if you go back through the (I agree, irritating) threads on
> this topic, you'll see that the FUD and other issues were all
> introduced by other kernel developers, mainly because they didn't like
> the idea of using system suspend as a mechanism for dynamic power
> management.

I am struck by how differently you are seeing things.

>
> > > > I don't think suspend-blocker handles both kinds of races as well as you
> > > > seem to think.
> > >
> > > Can you give any counterexamples?
> >
> > I knew you'd ask such a thing. Do you have any correctness proofs of it
> > working right?
>
> If you want, sure. But what you think "working right" means may not be
> the same as what I think.
>
> > Lets consider the RIL incoming call race:
>
> "RIL"?

You are not an android developer are you? RIL is the user mode Radio
Interface Layer.

> > 1) user mode initiates an "opportunistic suspend" or whatever we call
> > this these days by writing to some sys interface.
>
> Okay.
>
> > 2) a call comes in (multi-threaded cpu) just after the write.
>
> Right. Presumably we want the suspend to be delayed until the call
> can be handled by a user programm, yes?
>
> > 3) the caif driver grabs a blocker, stopping the in flight suspend in the
> > nick of time. But, when should it release it without racing? How does
> > one implement that kernel code such that its portable to non-android user
> > mode stacks?
>
> I don't know anything about the caif driver so I can't answer this
> question in detail. Here's the general idea: Somehow the kernel has to
> notify userspace that an incoming call has arrived. Whatever driver or
> subsystem sends that notification should release the suspend blocker
> once userspace indicates that the notification has been received (for
> example, by reading all the data in an input event queue). That's true
> for both Android and non-Android.

so tell me how user space indicates to the kernel object will ever know
when its ok to release the blocker?

then tell me how suspend blocker provide an elegant portable solution
to this that works for multiple user mode stacks.

>
> In some cases there may not be any mechanism for userspace to tell the
> kernel when a notification has been received. For thoses cases, the
> Android people used timer-based blockers. This is not necessarily the
> best approach but they seem happy with it. Others might prefer to add
> an explicit "notification received" mechanism.
the time-based blockers are not part of the latest patch set.

>
> Still others take the view that since suspends are initiated by
> userspace anyway, it's not necessary to tell the kernel when suspend is
> safe again. Just tell the user process responsible for initiating
> suspends.
>
> > > > A single tool that conflates multiple kernel facilities
> > > > is not an advantage. It limits applications outside of the scope of
> > > > doing it the "android way".
> > >
> > > I don't see that necessarily as a drawback. You might just as well say
> > > that the entire Linux kernel limits applications to doing things the
> > > "Unix" way. Often have fewer choices is an advantage.
> >
> > I disagree with your analogy. One problem one solution with minimal
> > coupling to other implementation details is nice. Two problems with one
> > solution dependent on the system wide architecture bound to a user mode
> > PM design is fragile and not portable.
>
> I assume the "two problems" you refer to are: telling the PM core that
> the kernel has a wakeup event in progress, and telling the PM core that
> userspace has a wakeup event in progress. To me these don't seem to be
> vastly different problems, and having a single solution for both
> doesn't seem out of line.

nope.
problem 1) opportunistic enable suspend deferral for critical sections
when suspending is "bad"
problem 2) race between entering pm_suspend call tree and wake events.

> The fact that this is bound to a user-mode PM design was inevitable,
> given the whole idea was to overcome weaknesses in the system suspend
> implementation and that implementation already is driven by userspace.

I think we can do better.

> > > (Incidentally, even on Android the centralized PM process does not
> > > handle _all_ of the userspace/kernel wakelock interactions.)
> > >
> >
> > Yeah, I've been told that too. I've grepped for where the wake_unlock
> > sysfs interfaces are hit from user mode android (eclair) and I would
> > like some help in identifying those guys. Maybe its in the FroYo code I
> > don't get to look at yet?
> >
> > There is libhardware_legacy/power/power.c and an out of tree kernel
> > broadcom bcm4329 driver under hardware/broadcom but that doesn't count
> > as it looks like a kernel driver to me.
>
> I don't know -- I have never looked at the Android userspace. No doubt
> Arve can provide some examples.
who are you? You don't know about the Android stack, and you keep
blowing about how suspend blockers solve all the PM problems? I'm
starting to think your just trolling.

--mgross

2010-06-22 02:24:54

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, 21 Jun 2010, mark gross wrote:

> On Mon, Jun 21, 2010 at 12:01:09PM -0400, Alan Stern wrote:
> > On Sun, 20 Jun 2010, mark gross wrote:
> >
> > > Your confused about what problem this patch attempts to solve.
> >
> > I don't think so. Rafael's description was pretty clear.
>
> Then how is it you don't understand the fact that Rafael's patch is to
> solve the wake event notification suspend race and not block opertunistic
> suspends or kernel critical sections where suspending should be disabled?

I don't know what gave you the idea that I think Rafael's patch is
meant to block kernel critical sections. I certainly don't think that.

However leaving that aside, the rest of the above is just two different
ways of saying the same thing:

Wakeup events should cause suspend to be disabled until the
events are processed.

Arrival of wakeup events races with initiation of system
suspend (whether opportunistic or not).

Therefore blocking suspends when they ought to be disabled requires us
to solve the wake event/suspend race. You can't do one without doing
the other.

> > > There is
> > > a pm_qos patch in the works to address the suspend blocker
> > > functionality.
> > > http://lists.linux-foundation.org/pipermail/linux-pm/2010-June/026760.html
> >
> > No. That patch addresses something _similar_ to the suspend blocker
> > functionality. The fact remains, though, that pm_qos is not used
> > during system suspend (the /sys/power/state interface), hence changes
> > to pm_qos won't solve the system-suspend problems that suspend blockers
> > do solve.
>
> You keep saying they solve something, I keep wondering what you are
> talking aobut.
> Lets see what problems it solves:
> * implements oppertunistic suspending (this is a feature not a problem)
> * enables kernel critical sections blocking suspending.
> * requiers overlapping application specific critcal sections from ISR
> into user mode to make implementation correct.
> * exposes a user mode interface to set a critical section.
> * reduces races between wake events (or suspend blocking events) but I'm
> not convinced it solves them.

The last item on your list is what I meant. The others are not
problems solved by suspend blockers. Well, maybe the second is, but I
don't know of any kernel critical sections that need to block suspend
other than those caused by wakeup events.

I agree with you that bundling opportunistic suspend and the user mode
interface together with suspend blockers made the situation more
difficult by mixing up the important issues.

> suspend blockers provide a way to block oppertunistic suspending, wich
> I'll have you know, is a pain to get working right and the enabling from
> device to device is not very portable and *that* doesn't say good things
> about the scheme.

The real problem with the scheme is to define which events should count
as "wakeup" events and to determine when they have been fully handled.
I can see that these might well vary from one platform to another.
(Maybe that's what you mean by saying the enabling is not very
portable.) But these issues are unavoidable; any scheme will have to
address them.

Alan Stern

2010-06-22 02:46:55

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, 21 Jun 2010, mark gross wrote:

> > I get the feeling you didn't fully absorb the intent of the original
> > wakelock patches. Right from the start they were about fixing a race
> > in system suspend: The system may go to sleep even though a pending
>
> wrong. They are about 1) adding opportunistic suspend, and 2) adding
> critical sections to block suspend.
>
> No race issues where ever talked about until alternative solutions
> where put up.

That's not how I remember it. But this point isn't important enough to
be worth digging through the old emails to verify or disprove it.

> I am struck by how differently you are seeing things.

Those discussions were (and still are!) so long and annoying, it
wouldn't be surprising if _everybody_ saw them differently! :-) You
certainly must agree that a lot of differing personal viewpoints were
expressed.

> > > Lets consider the RIL incoming call race:
> >
> > "RIL"?
>
> You are not an android developer are you?

Nope.

> RIL is the user mode Radio
> Interface Layer.
>
> > > 1) user mode initiates an "opportunistic suspend" or whatever we call
> > > this these days by writing to some sys interface.
> >
> > Okay.
> >
> > > 2) a call comes in (multi-threaded cpu) just after the write.
> >
> > Right. Presumably we want the suspend to be delayed until the call
> > can be handled by a user programm, yes?
> >
> > > 3) the caif driver grabs a blocker, stopping the in flight suspend in the
> > > nick of time. But, when should it release it without racing? How does
> > > one implement that kernel code such that its portable to non-android user
> > > mode stacks?
> >
> > I don't know anything about the caif driver so I can't answer this
> > question in detail. Here's the general idea: Somehow the kernel has to
> > notify userspace that an incoming call has arrived. Whatever driver or
> > subsystem sends that notification should release the suspend blocker
> > once userspace indicates that the notification has been received (for
> > example, by reading all the data in an input event queue). That's true
> > for both Android and non-Android.
>
> so tell me how user space indicates to the kernel object will ever know
> when its ok to release the blocker?

First you tell me how the kernel indicates to userspace that an
incoming call has arrived. My answer will depend on that.

> then tell me how suspend blocker provide an elegant portable solution
> to this that works for multiple user mode stacks.

When did I -- or anyone -- ever say it was "elegant"?

As for multiple user mode stacks, I don't see the issue. If you grant
there is a mechanism whereby userspace indicates to the kernel that a
blocker may be released, then it seems obvious that this mechanism
could be used in multiple user mode stacks.

> problem 1) opportunistic enable suspend deferral for critical sections
> when suspending is "bad"
> problem 2) race between entering pm_suspend call tree and wake events.

Can you point out any examples of kernel critical sections where
suspending is "bad" that aren't started by arrival of a wakeup event?
Unless you can, I'm justified in saying that these "two" problems are
one and the same.

> > The fact that this is bound to a user-mode PM design was inevitable,
> > given the whole idea was to overcome weaknesses in the system suspend
> > implementation and that implementation already is driven by userspace.
>
> I think we can do better.

Maybe we can. I'm certainly not going to claim that suspend blockers
are the best possible solution. And I'm not really keen on seeing them
merged along with all the attendant opportunistic suspend and userspace
API stuff. But so far I haven't heard anything significantly better.

> > I don't know -- I have never looked at the Android userspace. No doubt
> > Arve can provide some examples.
> who are you?

You sound like a Vorlon. Should I ask: "What do you want?" :-)

> You don't know about the Android stack, and you keep
> blowing about how suspend blockers solve all the PM problems?

Not all of them. Just the race between wakeup events and suspend.

> I'm
> starting to think your just trolling.

You may think what you like. Perhaps there is a certain degree of
truth to the accusation -- goodness knows, I do tend to prolong
technical conversations when I think I'm right. (Ask anybody who has
corresponded with me for any length of time.) But this is probably
true of a lot of kernel developers...

Alan Stern

2010-06-22 06:18:45

by Florian Mickler

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Mon, 21 Jun 2010 18:58:37 -0700
mark gross <[email protected]> wrote:


> wrong. They are about 1) adding opportunistic suspend, and 2) adding
> critical sections to block suspend.
>
> No race issues where ever talked about until alternative solutions
> where put up.

The whole and only reason to even define the term "critical sections" is
when you need to define "a race". Or vice versa. A race is prevented by
defining critical sections and protecting these against concurrent
access.

[..snip..]

[some rant that alan is not familiar with android userspace..]

Are you suggesting that only android developers are supposed to talk
about this?

This is a pretty basic thing. It has only to do with system suspend.
(And using system suspend aggressively)

>
> --mgross
>

Cheers,
Flo

2010-06-22 09:25:36

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tuesday, June 22, 2010, Alan Stern wrote:
> On Mon, 21 Jun 2010, mark gross wrote:
>
...
> > I'm starting to think your just trolling.
>
> You may think what you like. Perhaps there is a certain degree of
> truth to the accusation -- goodness knows, I do tend to prolong
> technical conversations when I think I'm right. (Ask anybody who has
> corresponded with me for any length of time.) But this is probably
> true of a lot of kernel developers...

I certainly don't mind discussing with you. :-)

Thanks,
Rafael

2010-06-22 09:31:35

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tuesday, June 22, 2010, mark gross wrote:
> On Mon, Jun 21, 2010 at 11:57:21AM -0400, Alan Stern wrote:
> > On Sun, 20 Jun 2010, mark gross wrote:
...
> > I don't know -- I have never looked at the Android userspace. No doubt
> > Arve can provide some examples.
> who are you? You don't know about the Android stack, and you keep
> blowing about how suspend blockers solve all the PM problems? I'm
> starting to think your just trolling.

And that's just an excellent way of convincing people that you're right. Guess
how it worked on me.

For your information, Alan is one of the people most actively engaged in
power management development at the kernel level and his ideas have heavily
influenced the design of the I/O runtime PM framework, among other things.

I certainly appeciate Alan's opinions very much, because he usually is
technically right.

Thanks,
Rafael

2010-06-22 10:23:38

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tuesday, June 22, 2010, Rafael J. Wysocki wrote:
> On Tuesday, June 22, 2010, Alan Stern wrote:
> > On Mon, 21 Jun 2010, Florian Mickler wrote:
> >
> > > > In the end you would want to have communication in both directions:
> > > > suspend blockers _and_ callbacks. Polling is bad if done too often.
> > > > But I think the idea is a good one.
> > >
> > > Actually, I'm not so shure.
> > >
> > > 1. you have to roundtrip whereas in the suspend_blocker scheme you have
> > > active annotations (i.e. no further action needed)
> >
> > That's why it's best to use both. The normal case is that programs
> > activate and deactivate blockers by sending one-way messages to the PM
> > process. The exceptional case is when the PM process is about to
> > initiate a suspend; that's when it does the round-trip polling. Since
> > the only purpose of the polling is to avoid a race, 90% of the time it
> > will succeed.
> >
> > > 2. it may not be possible for a user to determine if a wake-event is
> > > in-flight. you would have to somehow pass the wake-event-number with
> > > it, so that the userspace process could ack it properly without
> > > confusion. Or... I don't know of anything else...
> > >
> > > 1. userspace-manager (UM) reads a number (42).
> > >
> > > 2. it questions userspace program X: is it ok to suspend?
> > >
> > > [please fill in how userspace program X determines to block
> > > suspend]
> > >
> > > 3a. UM's roundtrip ends and it proceeds to write "42" to the
> > > kernel [suspending]
> > > 3b. UM's roundtrip ends and it aborts suspend, because a
> > > (userspace-)suspend-blocker got activated
> > >
> > > I'm not shure how the userspace program could determine that there is a
> > > wake-event in flight. Perhaps by storing the number of last wake-event.
> > > But then you need per-wake-event-counters... :|
> >
> > Rafael seems to think timeouts will fix this. I'm not so sure.
> >
> > > Do you have some thoughts about the wake-event-in-flight detection?
> >
> > Not really, except for something like the original wakelock scheme in
> > which the kernel tells the PM core when an event is over.
>
> But the kernel doesn't really know that, so it really can't tell the PM core
> anything useful. What happens with suspend blockers is that a kernel subsystem
> cooperates with a user space consumer of the event to get the story straight.
>
> However, that will only work if the user space is not buggy and doesn't crash,
> for example, before releasing the suspend blocker it's holding.

Having reconsidered that I think there's more to it.

Take the PCI subsystem as an example, specifically pcie_pme_handle_request().
This is the place where wakeup events are started, but it has no idea about
how they are going to be handled. Thus in the suspend blocker scheme it would
need to activate a blocker, but it wouldn't be able to release it. So, it
seems, we would need to associate a suspend blocker with each PCIe device
that can generate wakeup events and require all drivers of such devices to
deal with a blocker activated by someone else (PCIe PME driver in this
particular case). That sounds cumbersome to say the least.

Moreover, even if we do that, it still doesn't solve the entire problem,
because the event may need to be delivered to user space and processed by it.
While a driver can check if user space has already read the event, it has
no way to detect whether or not it has finished processing it. In fact,
processing an event may involve an interaction with a (human) user and there's
no general way by which software can figure out when the user considers the
event as processed.

It looks like user space suspend blockers only help in some special cases
when the user space processing of a wakeup event is simple enough, but I don't
think they help in general. For an extreme example, a user may want to wake up
a system using wake-on-LAN to log into it, do some work and log out, so
effectively the initial wakeup event has not been processed entirely until the
user finally logs out of the system. Now, after the system wakeup (resulting
from the wake-on-LAN signal) we need to give the user some time to log in, but
if the user doesn't do that in certain time, it may be reasonable to suspend
and let the user wake up the system again.

Similar situation takes place when the system is woken up by a lid switch.
Evidently, the user has opened the laptop lid to do something, but we don't
even know what the user is going to do, so there's no way we can say when
the wakeup event is finally processed.

So, even if we can say when the kernel has finished processing the event
(although that would be complicated in the PCIe case above), I don't think
it's generally possible to ensure that the entire processing of a wakeup event
has been completed. This leads to the question whether or not it is worth
trying to detect the ending of the processing of a wakeup event.

Now, going back to the $subject patch, I didn't really think it would be
suitable for opportunistic suspend, so let's focus on the "forced" suspend
instead. It still has the problem that wakeup events occuring while
/sys/power/state is written to (or even slightly before) should cause the
system to cancel the suspend, but they generally won't. With the patch
applied that can be avoided by (a) reading from /sys/power/wakeup_count,
(b) waiting for certain time (such that if a suspend event is not entirely
processed within that time, it's worth suspending and waking up the
system again) and (c) writing to /sys/power/wakeup_count right before writing
to /sys/power/state (where the latter is only done if the former succeeds).

Rafael

2010-06-22 14:35:41

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:

> Having reconsidered that I think there's more to it.
>
> Take the PCI subsystem as an example, specifically pcie_pme_handle_request().
> This is the place where wakeup events are started, but it has no idea about
> how they are going to be handled. Thus in the suspend blocker scheme it would
> need to activate a blocker, but it wouldn't be able to release it. So, it
> seems, we would need to associate a suspend blocker with each PCIe device
> that can generate wakeup events and require all drivers of such devices to
> deal with a blocker activated by someone else (PCIe PME driver in this
> particular case). That sounds cumbersome to say the least.

Maybe this means pcie_pme_handle_request() isn't a good place to note
the arrival of a wakeup event. Doing it in the individual driver might
work out better.

Or look at this from a different point of view. Adopting Mark's
terminology, let's say that the kernel is in a "critical section" at
times when we don't want the system to suspend. Currently there's no
way to tell the PM core when the kernel enters or leaves a critical
section, so there's no way to prevent the system from suspending at the
wrong time.

Most wakeup events indicate the start of a critical section, in the
sense that you hardly ever say: "I want the computer to wake up when I
press this button, but I don't care what it does afterward -- it can go
right back to sleep without doing anything if it wants." Much more
common is that a wakeup event requires a certain amount of processing,
and you don't want the system to go back to sleep until the processing
is over. Of course, if the processing is simple enough that it can all
be done in an interrupt handler or a resume method, then nothing
extra is needed since obviously the system won't suspend while an
interrupt handler or a resume method is running. But for more
complicated cases, we need to do more.

The problem in your example is that pcie_pme_handle_request() has no
idea about the nature or extent of the critical section to follow.
Therefore it's not in a good position to mark the beginning of the
critical section, even though it is in an excellent position to mark
the receipt of a wakeup event.

> Moreover, even if we do that, it still doesn't solve the entire problem,
> because the event may need to be delivered to user space and processed by it.
> While a driver can check if user space has already read the event, it has
> no way to detect whether or not it has finished processing it. In fact,
> processing an event may involve an interaction with a (human) user and there's
> no general way by which software can figure out when the user considers the
> event as processed.

Is this the kernel's problem? Once userspace has read the event, we
can safely say that the kernel's critical section is over. Perhaps a
userspace critical section has begun, perhaps not; either way, it's no
longer the kernel's responsibility.

> It looks like user space suspend blockers only help in some special cases
> when the user space processing of a wakeup event is simple enough, but I don't
> think they help in general. For an extreme example, a user may want to wake up
> a system using wake-on-LAN to log into it, do some work and log out, so
> effectively the initial wakeup event has not been processed entirely until the
> user finally logs out of the system. Now, after the system wakeup (resulting
> from the wake-on-LAN signal) we need to give the user some time to log in, but
> if the user doesn't do that in certain time, it may be reasonable to suspend
> and let the user wake up the system again.

I agree. This is a case where there is no clear-cut end to the
kernel's critical section, because the event is not handed over to
userspace. A reasonable approach would be to use a timeout, perhaps
also with some heuristic like: End the critical section early if we
receive 100(?) more network packets before the timeout expires.

> Similar situation takes place when the system is woken up by a lid switch.
> Evidently, the user has opened the laptop lid to do something, but we don't
> even know what the user is going to do, so there's no way we can say when
> the wakeup event is finally processed.

In this case, the kernel could inform an appropriate user process (some
part of DeviceKit? or the power-manager process?) about the lid-switch
event. Once that information had been passed on, the kernel's critical
section would be over. The process could start its own critical
section or not, as it sees fit.

If there is no process to send the information to, the kernel could
again end the critical section after a reasonable timeout (1 minute?).

> So, even if we can say when the kernel has finished processing the event
> (although that would be complicated in the PCIe case above), I don't think
> it's generally possible to ensure that the entire processing of a wakeup event
> has been completed. This leads to the question whether or not it is worth
> trying to detect the ending of the processing of a wakeup event.

As Arve pointed out, in some cases it definitely is worthwhile (the
gpio keypad matrix example). In other cases there may be no reasonable
way to tell. That doesn't mean we have to give up entirely.

> Now, going back to the $subject patch, I didn't really think it would be
> suitable for opportunistic suspend, so let's focus on the "forced" suspend
> instead. It still has the problem that wakeup events occuring while
> /sys/power/state is written to (or even slightly before) should cause the
> system to cancel the suspend, but they generally won't. With the patch
> applied that can be avoided by (a) reading from /sys/power/wakeup_count,
> (b) waiting for certain time (such that if a suspend event is not entirely
> processed within that time, it's worth suspending and waking up the
> system again) and (c) writing to /sys/power/wakeup_count right before writing
> to /sys/power/state (where the latter is only done if the former succeeds).

In other words, you could detect if a critical section begins after the
user process has decided to initiate a suspend. Yes, that's so.

On the other hand, we should already be able to abort a suspend if a
wakeup event arrives after tasks are frozen (to pick a reasonable
boundary point). If we can't -- if some wakeup events are able to slip
through our fingers -- I would say it's a bug and the relevant drivers
need to be fixed. In the end this probably will require adding a
function to notify the PM core that a wakeup event occurred and
therefore a suspend-in-progress should be aborted -- almost exactly
what pm_wakeup_event() does.

So I'm not opposed to the new function. But it doesn't solve the
entire problem of avoiding suspends during critical sections.

Alan Stern

2010-06-22 15:37:05

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tuesday, June 22, 2010, Alan Stern wrote:
> On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:
>
> > Having reconsidered that I think there's more to it.
> >
> > Take the PCI subsystem as an example, specifically pcie_pme_handle_request().
> > This is the place where wakeup events are started, but it has no idea about
> > how they are going to be handled. Thus in the suspend blocker scheme it would
> > need to activate a blocker, but it wouldn't be able to release it. So, it
> > seems, we would need to associate a suspend blocker with each PCIe device
> > that can generate wakeup events and require all drivers of such devices to
> > deal with a blocker activated by someone else (PCIe PME driver in this
> > particular case). That sounds cumbersome to say the least.
>
> Maybe this means pcie_pme_handle_request() isn't a good place to note
> the arrival of a wakeup event. Doing it in the individual driver might
> work out better.

But it this case there will be an open window in which suspend may be started
after the wakeup event is signaled, but before the PM core is told about it.

> Or look at this from a different point of view. Adopting Mark's
> terminology, let's say that the kernel is in a "critical section" at
> times when we don't want the system to suspend. Currently there's no
> way to tell the PM core when the kernel enters or leaves a critical
> section, so there's no way to prevent the system from suspending at the
> wrong time.
>
> Most wakeup events indicate the start of a critical section, in the
> sense that you hardly ever say: "I want the computer to wake up when I
> press this button, but I don't care what it does afterward -- it can go
> right back to sleep without doing anything if it wants." Much more
> common is that a wakeup event requires a certain amount of processing,
> and you don't want the system to go back to sleep until the processing
> is over. Of course, if the processing is simple enough that it can all
> be done in an interrupt handler or a resume method, then nothing
> extra is needed since obviously the system won't suspend while an
> interrupt handler or a resume method is running. But for more
> complicated cases, we need to do more.
>
> The problem in your example is that pcie_pme_handle_request() has no
> idea about the nature or extent of the critical section to follow.

Exactly.

> Therefore it's not in a good position to mark the beginning of the
> critical section, even though it is in an excellent position to mark
> the receipt of a wakeup event.

I think we have no choice but to regard the detection of a wakeup event as the
beginning of a "suspend critical section".

> > Moreover, even if we do that, it still doesn't solve the entire problem,
> > because the event may need to be delivered to user space and processed by it.
> > While a driver can check if user space has already read the event, it has
> > no way to detect whether or not it has finished processing it. In fact,
> > processing an event may involve an interaction with a (human) user and there's
> > no general way by which software can figure out when the user considers the
> > event as processed.
>
> Is this the kernel's problem? Once userspace has read the event, we
> can safely say that the kernel's critical section is over. Perhaps a
> userspace critical section has begun, perhaps not; either way, it's no
> longer the kernel's responsibility.

Well, I agree here, but in the suspend blockers world it is the kernel
responsibility, because the kernel contains the power manager part.

> > It looks like user space suspend blockers only help in some special cases
> > when the user space processing of a wakeup event is simple enough, but I don't
> > think they help in general. For an extreme example, a user may want to wake up
> > a system using wake-on-LAN to log into it, do some work and log out, so
> > effectively the initial wakeup event has not been processed entirely until the
> > user finally logs out of the system. Now, after the system wakeup (resulting
> > from the wake-on-LAN signal) we need to give the user some time to log in, but
> > if the user doesn't do that in certain time, it may be reasonable to suspend
> > and let the user wake up the system again.
>
> I agree. This is a case where there is no clear-cut end to the
> kernel's critical section, because the event is not handed over to
> userspace. A reasonable approach would be to use a timeout, perhaps
> also with some heuristic like: End the critical section early if we
> receive 100(?) more network packets before the timeout expires.

Exactly.

> > Similar situation takes place when the system is woken up by a lid switch.
> > Evidently, the user has opened the laptop lid to do something, but we don't
> > even know what the user is going to do, so there's no way we can say when
> > the wakeup event is finally processed.
>
> In this case, the kernel could inform an appropriate user process (some
> part of DeviceKit? or the power-manager process?) about the lid-switch
> event. Once that information had been passed on, the kernel's critical
> section would be over. The process could start its own critical
> section or not, as it sees fit.
>
> If there is no process to send the information to, the kernel could
> again end the critical section after a reasonable timeout (1 minute?).

Agreed.

> > So, even if we can say when the kernel has finished processing the event
> > (although that would be complicated in the PCIe case above), I don't think
> > it's generally possible to ensure that the entire processing of a wakeup event
> > has been completed. This leads to the question whether or not it is worth
> > trying to detect the ending of the processing of a wakeup event.
>
> As Arve pointed out, in some cases it definitely is worthwhile (the
> gpio keypad matrix example). In other cases there may be no reasonable
> way to tell. That doesn't mean we have to give up entirely.

Well, I'm not sure, because that really depends on the hardware and bus in
question. The necessary condition seems to be that the event be detected
and handled entirely by the same functional unit (eg. a device driver) within
the kernel and such that it is able to detect whether or not user space has
acquired the event information. That doesn't seem to be a common case to me.

> > Now, going back to the $subject patch, I didn't really think it would be
> > suitable for opportunistic suspend, so let's focus on the "forced" suspend
> > instead. It still has the problem that wakeup events occuring while
> > /sys/power/state is written to (or even slightly before) should cause the
> > system to cancel the suspend, but they generally won't. With the patch
> > applied that can be avoided by (a) reading from /sys/power/wakeup_count,
> > (b) waiting for certain time (such that if a suspend event is not entirely
> > processed within that time, it's worth suspending and waking up the
> > system again) and (c) writing to /sys/power/wakeup_count right before writing
> > to /sys/power/state (where the latter is only done if the former succeeds).
>
> In other words, you could detect if a critical section begins after the
> user process has decided to initiate a suspend. Yes, that's so.

Generally yes, although I think it will also detect "critical sections"
starting exactly at the moment the suspend is started. Which in fact is the
purpose of the patch.

> On the other hand, we should already be able to abort a suspend if a
> wakeup event arrives after tasks are frozen (to pick a reasonable
> boundary point). If we can't -- if some wakeup events are able to slip
> through our fingers -- I would say it's a bug and the relevant drivers
> need to be fixed. In the end this probably will require adding a
> function to notify the PM core that a wakeup event occurred and
> therefore a suspend-in-progress should be aborted -- almost exactly
> what pm_wakeup_event() does.

That's correct.

> So I'm not opposed to the new function. But it doesn't solve the
> entire problem of avoiding suspends during critical sections.

Surely not and it isn't my goal at this point.

I think there are a few different issues that the suspend blockers (or
wakelocks) framework attempts to address in one bit hammer. To me, they are
at least (1) deciding when to suspend, (2) detecting events that should make
us avoid suspending (or abort suspend if already started), (3) preventing
"untrusted" processes from making the system use too much energy.

IMO it's better to treat them as different issues and try to address them
separately.

Rafael

2010-06-22 19:55:54

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:

> On Tuesday, June 22, 2010, Alan Stern wrote:
> > On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:
> >
> > > Having reconsidered that I think there's more to it.
> > >
> > > Take the PCI subsystem as an example, specifically pcie_pme_handle_request().
> > > This is the place where wakeup events are started, but it has no idea about
> > > how they are going to be handled. Thus in the suspend blocker scheme it would
> > > need to activate a blocker, but it wouldn't be able to release it. So, it
> > > seems, we would need to associate a suspend blocker with each PCIe device
> > > that can generate wakeup events and require all drivers of such devices to
> > > deal with a blocker activated by someone else (PCIe PME driver in this
> > > particular case). That sounds cumbersome to say the least.
> >
> > Maybe this means pcie_pme_handle_request() isn't a good place to note
> > the arrival of a wakeup event. Doing it in the individual driver might
> > work out better.
>
> But it this case there will be an open window in which suspend may be started
> after the wakeup event is signaled, but before the PM core is told about it.

That's true but it doesn't matter, assuming the suspend can't progress
during this window.

> > The problem in your example is that pcie_pme_handle_request() has no
> > idea about the nature or extent of the critical section to follow.
>
> Exactly.
>
> > Therefore it's not in a good position to mark the beginning of the
> > critical section, even though it is in an excellent position to mark
> > the receipt of a wakeup event.
>
> I think we have no choice but to regard the detection of a wakeup event as the
> beginning of a "suspend critical section".

Receipt of a wakeup event triggers a whole series of function calls,
including calls to the resume methods of every driver. The system
should be designed so that the next suspend can't begin until those
function calls complete. For example, the next suspend certainly can't
begin before the resume methods all complete. Given that premise, any
one of those functions could serve as the start of a suspend critical
section.

> > > Moreover, even if we do that, it still doesn't solve the entire problem,
> > > because the event may need to be delivered to user space and processed by it.
> > > While a driver can check if user space has already read the event, it has
> > > no way to detect whether or not it has finished processing it. In fact,
> > > processing an event may involve an interaction with a (human) user and there's
> > > no general way by which software can figure out when the user considers the
> > > event as processed.
> >
> > Is this the kernel's problem? Once userspace has read the event, we
> > can safely say that the kernel's critical section is over. Perhaps a
> > userspace critical section has begun, perhaps not; either way, it's no
> > longer the kernel's responsibility.
>
> Well, I agree here, but in the suspend blockers world it is the kernel
> responsibility, because the kernel contains the power manager part.

In the suspend blockers world (or at least, in Android's version of the
suspend blockers world), userspace would activate another suspend
blocker before reading the event. The kernel's critical section could
then end safely once the event was read.

> > > So, even if we can say when the kernel has finished processing the event
> > > (although that would be complicated in the PCIe case above), I don't think
> > > it's generally possible to ensure that the entire processing of a wakeup event
> > > has been completed. This leads to the question whether or not it is worth
> > > trying to detect the ending of the processing of a wakeup event.
> >
> > As Arve pointed out, in some cases it definitely is worthwhile (the
> > gpio keypad matrix example). In other cases there may be no reasonable
> > way to tell. That doesn't mean we have to give up entirely.
>
> Well, I'm not sure, because that really depends on the hardware and bus in
> question. The necessary condition seems to be that the event be detected
> and handled entirely by the same functional unit (eg. a device driver) within
> the kernel and such that it is able to detect whether or not user space has
> acquired the event information. That doesn't seem to be a common case to me.

It's hard to say how common this is without having a list of possible
wakeup sources. And of course, that list will differ from one platform
to another.

> > > Now, going back to the $subject patch, I didn't really think it would be
> > > suitable for opportunistic suspend, so let's focus on the "forced" suspend
> > > instead. It still has the problem that wakeup events occuring while
> > > /sys/power/state is written to (or even slightly before) should cause the
> > > system to cancel the suspend, but they generally won't. With the patch
> > > applied that can be avoided by (a) reading from /sys/power/wakeup_count,
> > > (b) waiting for certain time (such that if a suspend event is not entirely
> > > processed within that time, it's worth suspending and waking up the
> > > system again) and (c) writing to /sys/power/wakeup_count right before writing
> > > to /sys/power/state (where the latter is only done if the former succeeds).
> >
> > In other words, you could detect if a critical section begins after the
> > user process has decided to initiate a suspend. Yes, that's so.
>
> Generally yes, although I think it will also detect "critical sections"
> starting exactly at the moment the suspend is started. Which in fact is the
> purpose of the patch.

Well, the moment the suspend is started _does_ occur after the user
process decides to initiate a suspend. Hence critical sections
starting exactly at that moment would indeed be detected.

> > So I'm not opposed to the new function. But it doesn't solve the
> > entire problem of avoiding suspends during critical sections.
>
> Surely not and it isn't my goal at this point.
>
> I think there are a few different issues that the suspend blockers (or
> wakelocks) framework attempts to address in one bit hammer. To me, they are
> at least (1) deciding when to suspend, (2) detecting events that should make
> us avoid suspending (or abort suspend if already started), (3) preventing
> "untrusted" processes from making the system use too much energy.
>
> IMO it's better to treat them as different issues and try to address them
> separately.

Certainly (3) needs to be addressed separately. It should be handled
completely within userspace, if at all possible.

(1) and (2) are closely related. In fact, a reasonable criterion for
(1) might be: Suspend whenever it is allowed. Then (2) becomes: What
sorts of things should disallow suspend, and for how long?


Evidently part of the problem here is that for a very long time
(predating the existence of Linux), people have been using a bad
abstraction. We talk about "wakeup events", but an event occupies only
a single moment in time. If the computer happens to be asleep at that
moment then it wakes up, fine. But if it was already awake, then once
the moment is passed there's no reason not to suspend -- even if only
1 microsecond has elapsed. Likewise, if an event causes the computer
to wake up, then once the computer is awake the moment is over and
there's nothing to prevent the computer from immediately going back to
sleep.

Instead of talking about events, we should be talking about procedures
or sections: something that happens over a non-zero period of time.
But people have never thought in terms of wakeup procedures or suspend
critical sections, and so the kernel isn't designed to accomodate them.
We may know when they begin, but we often have only a cloudy idea of
when they end.

Historically, people have mostly had in mind that the answer to (1)
would be: Suspend whenever the user tells the computer to suspend. In
that kind of setting, (2) doesn't matter: When the user tells the
machine to suspend, it should obey.

But when we start to consider energy conservation and autonomous (or
opportunistic) suspend, things become more complex. This is why, for
example, the USB subsystem has a user-selectable autosuspend delay.
It's not an ideal solution, but it does prevent us from thrashing
between suspending and resuming a device over and over if it gets used
repeatedly during a short period of time.


We can design mechanisms until we are blue in the face (some of us may
be blue already!), but until we remedy this weakness in our thinking we
won't know how to apply them. Which means people won't be able to
agree on a single correct approach.

Alan Stern

2010-06-22 20:01:13

by Rafael J. Wysocki

[permalink] [raw]
Subject: [update] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tuesday, June 22, 2010, Rafael J. Wysocki wrote:
> On Tuesday, June 22, 2010, Alan Stern wrote:
> > On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:
...
> > > So, even if we can say when the kernel has finished processing the event
> > > (although that would be complicated in the PCIe case above), I don't think
> > > it's generally possible to ensure that the entire processing of a wakeup event
> > > has been completed. This leads to the question whether or not it is worth
> > > trying to detect the ending of the processing of a wakeup event.
> >
> > As Arve pointed out, in some cases it definitely is worthwhile (the
> > gpio keypad matrix example). In other cases there may be no reasonable
> > way to tell. That doesn't mean we have to give up entirely.
>
> Well, I'm not sure, because that really depends on the hardware and bus in
> question. The necessary condition seems to be that the event be detected
> and handled entirely by the same functional unit (eg. a device driver) within
> the kernel and such that it is able to detect whether or not user space has
> acquired the event information. That doesn't seem to be a common case to me.

Anyway, below's an update that addresses this particular case.

It adds two more functions, pm_wakeup_begin() and pm_wakeup_end()
that play similar roles to suspend_block() and suspend_unblock(), but they
don't operate on suspend blocker objects. Instead, the first of them increases
a counter of events in progress and the other one decreases this counter.
Together they have the same effect as pm_wakeup_event(), but the counter
of wakeup events in progress they operate on is also checked by
pm_check_wakeup_events().

Thus there are two ways kernel subsystems can signal wakeup events. First,
if the event is not explicitly handed over to user space and "instantaneous",
they can simply call pm_wakeup_event() and be done with it. Second, if the
event is going to be delivered to user space, the subsystem that processes
the event can call pm_wakeup_begin() right when the event is detected and
pm_wakeup_end() when it's been handed over to user space.

Please tell me what you think.

Rafael

---
drivers/base/power/Makefile | 2
drivers/base/power/main.c | 1
drivers/base/power/power.h | 3 +
drivers/base/power/sysfs.c | 9 +++
drivers/base/power/wakeup.c | 113 ++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci-acpi.c | 2
drivers/pci/pcie/pme/pcie_pme.c | 2
include/linux/pm.h | 10 +++
kernel/power/hibernate.c | 14 +++-
kernel/power/main.c | 24 ++++++++
kernel/power/power.h | 7 ++
kernel/power/suspend.c | 2
12 files changed, 182 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -204,6 +204,28 @@ static ssize_t state_store(struct kobjec

power_attr(state);

+static ssize_t wakeup_count_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%lu\n", pm_get_wakeup_count());
+}
+
+static ssize_t wakeup_count_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t n)
+{
+ unsigned long val;
+
+ if (sscanf(buf, "%lu", &val) == 1) {
+ if (pm_save_wakeup_count(val))
+ return n;
+ }
+ return -EINVAL;
+}
+
+power_attr(wakeup_count);
+
#ifdef CONFIG_PM_TRACE
int pm_trace_enabled;

@@ -236,6 +258,7 @@ static struct attribute * g[] = {
#endif
#ifdef CONFIG_PM_SLEEP
&pm_async_attr.attr,
+ &wakeup_count_attr.attr,
#ifdef CONFIG_PM_DEBUG
&pm_test_attr.attr,
#endif
@@ -266,6 +289,7 @@ static int __init pm_init(void)
int error = pm_start_workqueue();
if (error)
return error;
+ pm_wakeup_events_init();
power_kobj = kobject_create_and_add("power", NULL);
if (!power_kobj)
return -ENOMEM;
Index: linux-2.6/drivers/base/power/wakeup.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/wakeup.c
@@ -0,0 +1,113 @@
+
+#include <linux/device.h>
+#include <linux/pm.h>
+
+static unsigned long event_count;
+static unsigned long saved_event_count;
+static unsigned long events_in_progress;
+static bool events_check_enabled;
+static spinlock_t events_lock;
+
+void pm_wakeup_events_init(void)
+{
+ spin_lock_init(&events_lock);
+}
+
+void pm_wakeup_event(struct device *dev)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&events_lock, flags);
+ event_count++;
+ if (dev)
+ dev->power.wakeup_count++;
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+void pm_wakeup_begin(struct device *dev)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&events_lock, flags);
+ events_in_progress++;
+ if (dev)
+ dev->power.wakeup_count++;
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+void pm_wakeup_end(void)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (events_in_progress) {
+ events_in_progress--;
+ event_count++;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+static bool check_wakeup_events(void)
+{
+ return !events_check_enabled
+ || ((event_count == saved_event_count) && !events_in_progress);
+}
+
+bool pm_check_wakeup_events(void)
+{
+ unsigned long flags;
+ bool ret;
+
+ spin_lock_irqsave(&events_lock, flags);
+ ret = check_wakeup_events();
+ events_check_enabled = ret;
+ spin_unlock_irqrestore(&events_lock, flags);
+ return ret;
+}
+
+bool pm_check_wakeup_events_final(void)
+{
+ bool ret;
+
+ ret = check_wakeup_events();
+ events_check_enabled = false;
+ return ret;
+}
+
+unsigned long pm_get_wakeup_count(void)
+{
+ unsigned long flags;
+ unsigned long count;
+
+ spin_lock_irqsave(&events_lock, flags);
+ events_check_enabled = false;
+ count = event_count;
+ spin_unlock_irqrestore(&events_lock, flags);
+ return count;
+}
+
+bool pm_save_wakeup_count(unsigned long count)
+{
+ unsigned long flags;
+ bool ret = false;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (count == event_count) {
+ saved_event_count = count;
+ events_check_enabled = true;
+ ret = true;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+ return ret;
+}
+
+unsigned long pm_dev_wakeup_count(struct device *dev)
+{
+ unsigned long flags;
+ unsigned long count;
+
+ spin_lock_irqsave(&events_lock, flags);
+ count = dev->power.wakeup_count;
+ spin_unlock_irqrestore(&events_lock, flags);
+ return count;
+}
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -457,6 +457,7 @@ struct dev_pm_info {
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
struct completion completion;
+ unsigned long wakeup_count;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
@@ -552,6 +553,11 @@ extern void __suspend_report_result(cons
} while (0)

extern void device_pm_wait_for_dev(struct device *sub, struct device *dev);
+
+/* drivers/base/power/wakeup.c */
+extern void pm_wakeup_event(struct device *dev);
+extern void pm_wakeup_begin(struct device *dev);
+extern void pm_wakeup_end(void);
#else /* !CONFIG_PM_SLEEP */

#define device_pm_lock() do {} while (0)
@@ -565,6 +571,10 @@ static inline int dpm_suspend_start(pm_m
#define suspend_report_result(fn, ret) do {} while (0)

static inline void device_pm_wait_for_dev(struct device *a, struct device *b) {}
+
+static inline void pm_wakeup_event(struct device *dev) {}
+static inline void pm_wakeup_begin(struct device *dev) {}
+static inline void pm_wakeup_end(void) {}
#endif /* !CONFIG_PM_SLEEP */

/* How to reorder dpm_list after device_move() */
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,5 @@
obj-$(CONFIG_PM) += sysfs.o
-obj-$(CONFIG_PM_SLEEP) += main.o
+obj-$(CONFIG_PM_SLEEP) += main.o wakeup.o
obj-$(CONFIG_PM_RUNTIME) += runtime.o
obj-$(CONFIG_PM_OPS) += generic_ops.o
obj-$(CONFIG_PM_TRACE_RTC) += trace.o
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -59,6 +59,7 @@ void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
init_completion(&dev->power.completion);
+ dev->power.wakeup_count = 0;
pm_runtime_init(dev);
}

Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -184,6 +184,13 @@ static inline void suspend_test_finish(c
#ifdef CONFIG_PM_SLEEP
/* kernel/power/main.c */
extern int pm_notifier_call_chain(unsigned long val);
+
+/* drivers/base/power/wakeup.c */
+extern void pm_wakeup_events_init(void);
+extern bool pm_check_wakeup_events(void);
+extern bool pm_check_wakeup_events_final(void);
+extern unsigned long pm_get_wakeup_count(void);
+extern bool pm_save_wakeup_count(unsigned long count);
#endif

#ifdef CONFIG_HIGHMEM
Index: linux-2.6/kernel/power/suspend.c
===================================================================
--- linux-2.6.orig/kernel/power/suspend.c
+++ linux-2.6/kernel/power/suspend.c
@@ -172,7 +172,7 @@ static int suspend_enter(suspend_state_t

error = sysdev_suspend(PMSG_SUSPEND);
if (!error) {
- if (!suspend_test(TEST_CORE))
+ if (!suspend_test(TEST_CORE) && pm_check_wakeup_events_final())
error = suspend_ops->enter(state);
sysdev_resume();
}
Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -277,7 +277,7 @@ static int create_image(int platform_mod
goto Enable_irqs;
}

- if (hibernation_test(TEST_CORE))
+ if (hibernation_test(TEST_CORE) || !pm_check_wakeup_events())
goto Power_up;

in_suspend = 1;
@@ -511,14 +511,18 @@ int hibernation_platform_enter(void)

local_irq_disable();
sysdev_suspend(PMSG_HIBERNATE);
+ if (!pm_check_wakeup_events_final())
+ goto Power_up;
+
hibernation_ops->enter();
/* We should never get here */
while (1);

- /*
- * We don't need to reenable the nonboot CPUs or resume consoles, since
- * the system is going to be halted anyway.
- */
+ Power_up:
+ sysdev_resume();
+ local_irq_enable();
+ enable_nonboot_cpus();
+
Platform_finish:
hibernation_ops->finish();

Index: linux-2.6/drivers/pci/pci-acpi.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci-acpi.c
+++ linux-2.6/drivers/pci/pci-acpi.c
@@ -48,6 +48,8 @@ static void pci_acpi_wake_dev(acpi_handl
if (event == ACPI_NOTIFY_DEVICE_WAKE && pci_dev) {
pci_check_pme_status(pci_dev);
pm_runtime_resume(&pci_dev->dev);
+ if (device_may_wakeup(&pci_dev->dev))
+ pm_wakeup_event(&pci_dev->dev);
if (pci_dev->subordinate)
pci_pme_wakeup_bus(pci_dev->subordinate);
}
Index: linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
===================================================================
--- linux-2.6.orig/drivers/pci/pcie/pme/pcie_pme.c
+++ linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
@@ -154,6 +154,8 @@ static bool pcie_pme_walk_bus(struct pci
/* Skip PCIe devices in case we started from a root port. */
if (!pci_is_pcie(dev) && pci_check_pme_status(dev)) {
pm_request_resume(&dev->dev);
+ if (device_may_wakeup(&dev->dev))
+ pm_wakeup_event(&dev->dev);
ret = true;
}

Index: linux-2.6/drivers/base/power/power.h
===================================================================
--- linux-2.6.orig/drivers/base/power/power.h
+++ linux-2.6/drivers/base/power/power.h
@@ -30,6 +30,9 @@ extern void device_pm_move_before(struct
extern void device_pm_move_after(struct device *, struct device *);
extern void device_pm_move_last(struct device *);

+/* drivers/base/power/wakeup.c */
+extern unsigned long pm_dev_wakeup_count(struct device *dev);
+
#else /* !CONFIG_PM_SLEEP */

static inline void device_pm_init(struct device *dev)
Index: linux-2.6/drivers/base/power/sysfs.c
===================================================================
--- linux-2.6.orig/drivers/base/power/sysfs.c
+++ linux-2.6/drivers/base/power/sysfs.c
@@ -144,6 +144,14 @@ wake_store(struct device * dev, struct d

static DEVICE_ATTR(wakeup, 0644, wake_show, wake_store);

+static ssize_t wakeup_count_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%lu\n", pm_dev_wakeup_count(dev));
+}
+
+static DEVICE_ATTR(wakeup_count, 0444, wakeup_count_show, NULL);
+
#ifdef CONFIG_PM_ADVANCED_DEBUG
#ifdef CONFIG_PM_RUNTIME

@@ -230,6 +238,7 @@ static struct attribute * power_attrs[]
&dev_attr_control.attr,
#endif
&dev_attr_wakeup.attr,
+ &dev_attr_wakeup_count.attr,
#ifdef CONFIG_PM_ADVANCED_DEBUG
&dev_attr_async.attr,
#ifdef CONFIG_PM_RUNTIME

2010-06-22 20:34:54

by Alan Stern

[permalink] [raw]
Subject: Re: [update] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:

> Anyway, below's an update that addresses this particular case.
>
> It adds two more functions, pm_wakeup_begin() and pm_wakeup_end()
> that play similar roles to suspend_block() and suspend_unblock(), but they
> don't operate on suspend blocker objects. Instead, the first of them increases
> a counter of events in progress and the other one decreases this counter.
> Together they have the same effect as pm_wakeup_event(), but the counter
> of wakeup events in progress they operate on is also checked by
> pm_check_wakeup_events().
>
> Thus there are two ways kernel subsystems can signal wakeup events. First,
> if the event is not explicitly handed over to user space and "instantaneous",
> they can simply call pm_wakeup_event() and be done with it. Second, if the
> event is going to be delivered to user space, the subsystem that processes
> the event can call pm_wakeup_begin() right when the event is detected and
> pm_wakeup_end() when it's been handed over to user space.

Or if the event is going to be handled entirely in the kernel but over
a prolonged period of time.

> Please tell me what you think.

I like it a lot. It addresses the main weakness in the earlier
version. With this, you could essentially duplicate the in-kernel part
of the wakelocks/suspend blockers stuff. All except the timed
wakelocks -- you might want to consider adding a
pm_wakeup_begin_timeout() convenience routine.

Here's another possible enhancement (if you can figure out a way to do
it without too much effort): After a suspend begins, keep track of the
first wakeup event you get. Then when the suspend is aborted, print a
log message saying why and indicating which device was responsible for
the wakeup.

One little thing: You have the PCI subsystem call pm_wakeup_event().
If the driver then wants to call pm_wakeup_begin(), the event will get
counted twice. I guess this doesn't matter much, but it does seem
peculiar.

Alan Stern

2010-06-22 21:00:48

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tuesday, June 22, 2010, Alan Stern wrote:
> On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:
>
> > On Tuesday, June 22, 2010, Alan Stern wrote:
> > > On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:
> > >
> > > > Having reconsidered that I think there's more to it.
> > > >
> > > > Take the PCI subsystem as an example, specifically pcie_pme_handle_request().
> > > > This is the place where wakeup events are started, but it has no idea about
> > > > how they are going to be handled. Thus in the suspend blocker scheme it would
> > > > need to activate a blocker, but it wouldn't be able to release it. So, it
> > > > seems, we would need to associate a suspend blocker with each PCIe device
> > > > that can generate wakeup events and require all drivers of such devices to
> > > > deal with a blocker activated by someone else (PCIe PME driver in this
> > > > particular case). That sounds cumbersome to say the least.
> > >
> > > Maybe this means pcie_pme_handle_request() isn't a good place to note
> > > the arrival of a wakeup event. Doing it in the individual driver might
> > > work out better.
> >
> > But it this case there will be an open window in which suspend may be started
> > after the wakeup event is signaled, but before the PM core is told about it.
>
> That's true but it doesn't matter, assuming the suspend can't progress
> during this window.
>
> > > The problem in your example is that pcie_pme_handle_request() has no
> > > idea about the nature or extent of the critical section to follow.
> >
> > Exactly.
> >
> > > Therefore it's not in a good position to mark the beginning of the
> > > critical section, even though it is in an excellent position to mark
> > > the receipt of a wakeup event.
> >
> > I think we have no choice but to regard the detection of a wakeup event as the
> > beginning of a "suspend critical section".
>
> Receipt of a wakeup event triggers a whole series of function calls,
> including calls to the resume methods of every driver. The system
> should be designed so that the next suspend can't begin until those
> function calls complete. For example, the next suspend certainly can't
> begin before the resume methods all complete. Given that premise, any
> one of those functions could serve as the start of a suspend critical
> section.

Well, consider pcie_pme_handle_request() again. It certainly can be called
during suspend (until the PME interrupt is disabled), but the PM workqueue
is frozen at this point, so the device driver's resume routine won't be called.
However, the wakeup signal from the device should be regarded as a wakeup
event in that case IMO.

[We have a check for that in dpm_prepare(), but I think it should be replaced
by the "proper" handling of wakeup events, once we have one.]

...
> > > > So, even if we can say when the kernel has finished processing the event
> > > > (although that would be complicated in the PCIe case above), I don't think
> > > > it's generally possible to ensure that the entire processing of a wakeup event
> > > > has been completed. This leads to the question whether or not it is worth
> > > > trying to detect the ending of the processing of a wakeup event.
> > >
> > > As Arve pointed out, in some cases it definitely is worthwhile (the
> > > gpio keypad matrix example). In other cases there may be no reasonable
> > > way to tell. That doesn't mean we have to give up entirely.
> >
> > Well, I'm not sure, because that really depends on the hardware and bus in
> > question. The necessary condition seems to be that the event be detected
> > and handled entirely by the same functional unit (eg. a device driver) within
> > the kernel and such that it is able to detect whether or not user space has
> > acquired the event information. That doesn't seem to be a common case to me.
>
> It's hard to say how common this is without having a list of possible
> wakeup sources. And of course, that list will differ from one platform
> to another.

Agreed.

...
> > I think there are a few different issues that the suspend blockers (or
> > wakelocks) framework attempts to address in one bit hammer. To me, they are
> > at least (1) deciding when to suspend, (2) detecting events that should make
> > us avoid suspending (or abort suspend if already started), (3) preventing
> > "untrusted" processes from making the system use too much energy.
> >
> > IMO it's better to treat them as different issues and try to address them
> > separately.
>
> Certainly (3) needs to be addressed separately. It should be handled
> completely within userspace, if at all possible.
>
> (1) and (2) are closely related. In fact, a reasonable criterion for
> (1) might be: Suspend whenever it is allowed. Then (2) becomes: What
> sorts of things should disallow suspend, and for how long?

The events I mean by (2) are the minimal subset of conditions used in (1),
because the user may add more restrictions that should be checked by user space.

For example, the user may request to suspend whenever there are no open SSH
connections to the machine (why not?), but even if that criterion is satisfied,
wake-on-LAN events should prevent suspend from happening.

> Evidently part of the problem here is that for a very long time
> (predating the existence of Linux), people have been using a bad
> abstraction. We talk about "wakeup events", but an event occupies only
> a single moment in time. If the computer happens to be asleep at that
> moment then it wakes up, fine. But if it was already awake, then once
> the moment is passed there's no reason not to suspend -- even if only
> 1 microsecond has elapsed. Likewise, if an event causes the computer
> to wake up, then once the computer is awake the moment is over and
> there's nothing to prevent the computer from immediately going back to
> sleep.
>
> Instead of talking about events, we should be talking about procedures
> or sections: something that happens over a non-zero period of time.

Agreed.

> But people have never thought in terms of wakeup procedures or suspend
> critical sections, and so the kernel isn't designed to accomodate them.
> We may know when they begin, but we often have only a cloudy idea of
> when they end.

Yeah.

> Historically, people have mostly had in mind that the answer to (1)
> would be: Suspend whenever the user tells the computer to suspend. In
> that kind of setting, (2) doesn't matter: When the user tells the
> machine to suspend, it should obey.

Well, not necessarily, because the user can change his mind while the machine
is suspending and try to generate a wakeup event to abort the suspend.

> But when we start to consider energy conservation and autonomous (or
> opportunistic) suspend, things become more complex. This is why, for
> example, the USB subsystem has a user-selectable autosuspend delay.
> It's not an ideal solution, but it does prevent us from thrashing
> between suspending and resuming a device over and over if it gets used
> repeatedly during a short period of time.
>
> We can design mechanisms until we are blue in the face (some of us may
> be blue already!), but until we remedy this weakness in our thinking we
> won't know how to apply them. Which means people won't be able to
> agree on a single correct approach.

I agree 100%.

Rafael

2010-06-22 21:43:10

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tuesday, June 22, 2010, Alan Stern wrote:
> On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:
>
> > Anyway, below's an update that addresses this particular case.
> >
> > It adds two more functions, pm_wakeup_begin() and pm_wakeup_end()
> > that play similar roles to suspend_block() and suspend_unblock(), but they
> > don't operate on suspend blocker objects. Instead, the first of them increases
> > a counter of events in progress and the other one decreases this counter.
> > Together they have the same effect as pm_wakeup_event(), but the counter
> > of wakeup events in progress they operate on is also checked by
> > pm_check_wakeup_events().
> >
> > Thus there are two ways kernel subsystems can signal wakeup events. First,
> > if the event is not explicitly handed over to user space and "instantaneous",
> > they can simply call pm_wakeup_event() and be done with it. Second, if the
> > event is going to be delivered to user space, the subsystem that processes
> > the event can call pm_wakeup_begin() right when the event is detected and
> > pm_wakeup_end() when it's been handed over to user space.
>
> Or if the event is going to be handled entirely in the kernel but over
> a prolonged period of time.
>
> > Please tell me what you think.
>
> I like it a lot. It addresses the main weakness in the earlier
> version. With this, you could essentially duplicate the in-kernel part
> of the wakelocks/suspend blockers stuff. All except the timed
> wakelocks -- you might want to consider adding a
> pm_wakeup_begin_timeout() convenience routine.

That may be added in future quite easily if it really turns out to be necessary.
IIRC Arve said Android only used timeouts in user space wakelocks, not in the
kernel ones.

> Here's another possible enhancement (if you can figure out a way to do
> it without too much effort): After a suspend begins, keep track of the
> first wakeup event you get. Then when the suspend is aborted, print a
> log message saying why and indicating which device was responsible for
> the wakeup.

Good idea, but I'd prefer to add it in a separate patch not to complicate
things too much to start with.

> One little thing: You have the PCI subsystem call pm_wakeup_event().
> If the driver then wants to call pm_wakeup_begin(), the event will get
> counted twice. I guess this doesn't matter much, but it does seem
> peculiar.

Knowing that the PCI core has increased the wakeup count of its device, a
PCI driver may simply use pm_wakeup_begin(NULL) and that will only cause
the main counter to be increased in the end. Which kind of makes sense,
because in that case there really is a sequence of events. First, the PCI core
detects a wakeup signal and requests wakeup so that the driver has a chance
to access the device and get the event information from it (although at this
point it is not known whether or not the driver will need to do that). Second,
the driver requests that the system stay in the working state, because it needs
time to process the event data and (presumably) hand it over to user space.
The device has only signaled wakeup once, though, and that should be recorded.

BTW, I thought I would make pm_get_wakeup_count() and pm_save_wakeup_count()
fail if they are called when events_in_progress is nonzero. For
pm_save_wakeup_count() that's pretty obvious (I think) and it also kind of
makes sense for pm_get_wakeup_count(), because that will tell the reader of
/sys/power/wakeup_count that the value is going to change immediately so it
should really try again.

Rafael

2010-06-22 23:00:19

by 640E9920

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tue, Jun 22, 2010 at 12:21:53PM +0200, Rafael J. Wysocki wrote:
> On Tuesday, June 22, 2010, Rafael J. Wysocki wrote:
> > On Tuesday, June 22, 2010, Alan Stern wrote:
> > > On Mon, 21 Jun 2010, Florian Mickler wrote:
> > >
> > > > > In the end you would want to have communication in both directions:
> > > > > suspend blockers _and_ callbacks. Polling is bad if done too often.
> > > > > But I think the idea is a good one.
> > > >
> > > > Actually, I'm not so shure.
> > > >
> > > > 1. you have to roundtrip whereas in the suspend_blocker scheme you have
> > > > active annotations (i.e. no further action needed)
> > >
> > > That's why it's best to use both. The normal case is that programs
> > > activate and deactivate blockers by sending one-way messages to the PM
> > > process. The exceptional case is when the PM process is about to
> > > initiate a suspend; that's when it does the round-trip polling. Since
> > > the only purpose of the polling is to avoid a race, 90% of the time it
> > > will succeed.
> > >
> > > > 2. it may not be possible for a user to determine if a wake-event is
> > > > in-flight. you would have to somehow pass the wake-event-number with
> > > > it, so that the userspace process could ack it properly without
> > > > confusion. Or... I don't know of anything else...
> > > >
> > > > 1. userspace-manager (UM) reads a number (42).
> > > >
> > > > 2. it questions userspace program X: is it ok to suspend?
> > > >
> > > > [please fill in how userspace program X determines to block
> > > > suspend]
> > > >
> > > > 3a. UM's roundtrip ends and it proceeds to write "42" to the
> > > > kernel [suspending]
> > > > 3b. UM's roundtrip ends and it aborts suspend, because a
> > > > (userspace-)suspend-blocker got activated
> > > >
> > > > I'm not shure how the userspace program could determine that there is a
> > > > wake-event in flight. Perhaps by storing the number of last wake-event.
> > > > But then you need per-wake-event-counters... :|
> > >
> > > Rafael seems to think timeouts will fix this. I'm not so sure.
> > >
> > > > Do you have some thoughts about the wake-event-in-flight detection?
> > >
> > > Not really, except for something like the original wakelock scheme in
> > > which the kernel tells the PM core when an event is over.
> >
> > But the kernel doesn't really know that, so it really can't tell the PM core
> > anything useful. What happens with suspend blockers is that a kernel subsystem
> > cooperates with a user space consumer of the event to get the story straight.
> >
> > However, that will only work if the user space is not buggy and doesn't crash,
> > for example, before releasing the suspend blocker it's holding.
>
> Having reconsidered that I think there's more to it.
>
> Take the PCI subsystem as an example, specifically pcie_pme_handle_request().
> This is the place where wakeup events are started, but it has no idea about
> how they are going to be handled. Thus in the suspend blocker scheme it would
> need to activate a blocker, but it wouldn't be able to release it. So, it
> seems, we would need to associate a suspend blocker with each PCIe device
> that can generate wakeup events and require all drivers of such devices to
> deal with a blocker activated by someone else (PCIe PME driver in this
> particular case). That sounds cumbersome to say the least.
>
> Moreover, even if we do that, it still doesn't solve the entire problem,
> because the event may need to be delivered to user space and processed by it.
> While a driver can check if user space has already read the event, it has
> no way to detect whether or not it has finished processing it. In fact,
> processing an event may involve an interaction with a (human) user and there's
> no general way by which software can figure out when the user considers the
> event as processed.
>
> It looks like user space suspend blockers only help in some special cases
> when the user space processing of a wakeup event is simple enough, but I don't
> think they help in general. For an extreme example, a user may want to wake up
> a system using wake-on-LAN to log into it, do some work and log out, so
> effectively the initial wakeup event has not been processed entirely until the
> user finally logs out of the system. Now, after the system wakeup (resulting
> from the wake-on-LAN signal) we need to give the user some time to log in, but
> if the user doesn't do that in certain time, it may be reasonable to suspend
> and let the user wake up the system again.
>
> Similar situation takes place when the system is woken up by a lid switch.
> Evidently, the user has opened the laptop lid to do something, but we don't
> even know what the user is going to do, so there's no way we can say when
> the wakeup event is finally processed.
>
> So, even if we can say when the kernel has finished processing the event
> (although that would be complicated in the PCIe case above), I don't think
> it's generally possible to ensure that the entire processing of a wakeup event
> has been completed. This leads to the question whether or not it is worth
> trying to detect the ending of the processing of a wakeup event.
>
> Now, going back to the $subject patch, I didn't really think it would be
> suitable for opportunistic suspend, so let's focus on the "forced" suspend
> instead. It still has the problem that wakeup events occuring while
> /sys/power/state is written to (or even slightly before) should cause the
> system to cancel the suspend, but they generally won't. With the patch
> applied that can be avoided by (a) reading from /sys/power/wakeup_count,
> (b) waiting for certain time (such that if a suspend event is not entirely
> processed within that time, it's worth suspending and waking up the
> system again) and (c) writing to /sys/power/wakeup_count right before writing
> to /sys/power/state (where the latter is only done if the former succeeds).
>
This is what thought was the problem your idea as trying to deal with.

--mgross

2010-06-22 23:22:25

by 640E9920

[permalink] [raw]
Subject: Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tue, Jun 22, 2010 at 08:18:03AM +0200, Florian Mickler wrote:
> On Mon, 21 Jun 2010 18:58:37 -0700
> mark gross <[email protected]> wrote:
>
>
> > wrong. They are about 1) adding opportunistic suspend, and 2) adding
> > critical sections to block suspend.
> >
> > No race issues where ever talked about until alternative solutions
> > where put up.
>
> The whole and only reason to even define the term "critical sections" is
> when you need to define "a race". Or vice versa. A race is prevented by
> defining critical sections and protecting these against concurrent
> access.
>
> [..snip..]
>
> [some rant that alan is not familiar with android userspace..]
>
> Are you suggesting that only android developers are supposed to talk
> about this?

of course not. I'm just getting frustrated with having android-isms
tossed in my face as we try to discuss the merits of the ideas, only to
find that the are getting tossed around by someone not familiar with
Android.

Sorry about that. I was having a bad day.

--mgross




>
> This is a pretty basic thing. It has only to do with system suspend.
> (And using system suspend aggressively)
>
> >
> > --mgross
> >
>
> Cheers,
> Flo

2010-06-23 02:12:18

by Alan Stern

[permalink] [raw]
Subject: Re: [update] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:

> > > Please tell me what you think.
> >
> > I like it a lot. It addresses the main weakness in the earlier
> > version. With this, you could essentially duplicate the in-kernel part
> > of the wakelocks/suspend blockers stuff. All except the timed
> > wakelocks -- you might want to consider adding a
> > pm_wakeup_begin_timeout() convenience routine.
>
> That may be added in future quite easily if it really turns out to be necessary.
> IIRC Arve said Android only used timeouts in user space wakelocks, not in the
> kernel ones.

Didn't we agree that timeouts would be needed for Wake-on-LAN?

> > Here's another possible enhancement (if you can figure out a way to do
> > it without too much effort): After a suspend begins, keep track of the
> > first wakeup event you get. Then when the suspend is aborted, print a
> > log message saying why and indicating which device was responsible for
> > the wakeup.
>
> Good idea, but I'd prefer to add it in a separate patch not to complicate
> things too much to start with.

Okay. Another thing to be considered later is whether there should be
a way to write to /sys/power/state that would block until the active
wakeup count drops to 0. On the other hand polling maybe once per
second wouldn't be so bad. It would happen only when the kernel had
some events outstanding and userspace didn't.

One thing that stands out is the new spinlock. It could potentially be
a big source of contention. Any wakeup-enabled device is liable to
need it during every interrupt. Do you think this could cause a
noticeable slowdown?

> > One little thing: You have the PCI subsystem call pm_wakeup_event().
> > If the driver then wants to call pm_wakeup_begin(), the event will get
> > counted twice. I guess this doesn't matter much, but it does seem
> > peculiar.
>
> Knowing that the PCI core has increased the wakeup count of its device, a
> PCI driver may simply use pm_wakeup_begin(NULL) and that will only cause
> the main counter to be increased in the end. Which kind of makes sense,
> because in that case there really is a sequence of events. First, the PCI core
> detects a wakeup signal and requests wakeup so that the driver has a chance
> to access the device and get the event information from it (although at this
> point it is not known whether or not the driver will need to do that). Second,
> the driver requests that the system stay in the working state, because it needs
> time to process the event data and (presumably) hand it over to user space.
> The device has only signaled wakeup once, though, and that should be recorded.

Okay, that works. Although if anybody wanted to keep track of timing
statistics, the results wouldn't be as effective since the start and
end times would not be associated with the device.

> BTW, I thought I would make pm_get_wakeup_count() and pm_save_wakeup_count()
> fail if they are called when events_in_progress is nonzero. For
> pm_save_wakeup_count() that's pretty obvious (I think) and it also kind of
> makes sense for pm_get_wakeup_count(), because that will tell the reader of
> /sys/power/wakeup_count that the value is going to change immediately so it
> should really try again.

Sensible.

Alan Stern

2010-06-23 10:10:58

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Wednesday, June 23, 2010, Alan Stern wrote:
> On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:
>
> > > > Please tell me what you think.
> > >
> > > I like it a lot. It addresses the main weakness in the earlier
> > > version. With this, you could essentially duplicate the in-kernel part
> > > of the wakelocks/suspend blockers stuff. All except the timed
> > > wakelocks -- you might want to consider adding a
> > > pm_wakeup_begin_timeout() convenience routine.
> >
> > That may be added in future quite easily if it really turns out to be necessary.
> > IIRC Arve said Android only used timeouts in user space wakelocks, not in the
> > kernel ones.
>
> Didn't we agree that timeouts would be needed for Wake-on-LAN?

Yes, we did, but in the WoL case the timeout will have to be used by the user
space rather than the kernel IMO.

It would make sense to add a timeout argument to pm_wakeup_event() that would
make the system stay in the working state long enough for the driver wakeup
code to start in the PCIe case. I think pm_wakeup_event() mght just increment
events_in_progress and the timer might simply decrement it.

So, maybe it's just better to have pm_wakeup_event(dev, timeout) that will
increment events_in_progress and set up a timer and pm_wakeup_commit(dev) that
will delete the timer, decrement events_in_progress and increment event_count
(unless the timer has already expired before).

That would cost us a (one more) timer in struct dev_pm_info, but it would
allow us to cover all of the cases identified so far. So, if a wakeup event is
handled within one functional unit that both detects it and delivers it to
user space, it would call pm_wakeup_event(dev, 0) (ie. infinite timeout) at the
beginning and then pm_wakeup_commit(dev) when it's done with the event.
If a wakeup event it's just detected by one piece of code and is going to
be handled by another, the first one could call pm_wakeup_event(dev, tm) and
allow the other one to call pm_wakeup_commit(dev) when it's done. However,
calling pm_wakeup_commit(dev) is not mandatory, so the second piece of code
(eg. a PCI driver) doesn't really need to do anything in the simplest case.

> > > Here's another possible enhancement (if you can figure out a way to do
> > > it without too much effort): After a suspend begins, keep track of the
> > > first wakeup event you get. Then when the suspend is aborted, print a
> > > log message saying why and indicating which device was responsible for
> > > the wakeup.
> >
> > Good idea, but I'd prefer to add it in a separate patch not to complicate
> > things too much to start with.
>
> Okay. Another thing to be considered later is whether there should be
> a way to write to /sys/power/state that would block until the active
> wakeup count drops to 0. On the other hand polling maybe once per
> second wouldn't be so bad. It would happen only when the kernel had
> some events outstanding and userspace didn't.

Blocking on a write to /sys/power/state wouldn't help, because if the active
wakeup count is nonzero at this point, the suspend should be aborted anyway,
so there's no need to wait. The same applies to writing to
/sys/power/wakeup_count. However, blocking a read from /sys/power/wakeup_count
instead of failing it when the active wakeup count is nonzero would make sense.

> One thing that stands out is the new spinlock. It could potentially be
> a big source of contention. Any wakeup-enabled device is liable to
> need it during every interrupt. Do you think this could cause a
> noticeable slowdown?

That really depends on the number of wakeup devices. However, ISTR the
original wakelocks code had exactly the same issue (it used a spinlock to
protect the lists of wakelocks).

Rafael

2010-06-23 15:21:34

by Alan Stern

[permalink] [raw]
Subject: Re: [update] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Wed, 23 Jun 2010, Rafael J. Wysocki wrote:

> > Didn't we agree that timeouts would be needed for Wake-on-LAN?
>
> Yes, we did, but in the WoL case the timeout will have to be used by the user
> space rather than the kernel IMO.

The kernel will still have to specify some small initial timeout. Just
long enough for userspace to realize what has happened and start its
own critical section.

> It would make sense to add a timeout argument to pm_wakeup_event() that would
> make the system stay in the working state long enough for the driver wakeup
> code to start in the PCIe case. I think pm_wakeup_event() mght just increment
> events_in_progress and the timer might simply decrement it.

Hmm. I was thinking about a similar problem with the USB hub driver.

Maybe a better answer for this particular issue is to change the
workqueue code. Don't allow a work thread to enter the freezer until
its queue is empty. Then you wouldn't need a timeout.

> So, maybe it's just better to have pm_wakeup_event(dev, timeout) that will
> increment events_in_progress and set up a timer and pm_wakeup_commit(dev) that
> will delete the timer, decrement events_in_progress and increment event_count
> (unless the timer has already expired before).
>
> That would cost us a (one more) timer in struct dev_pm_info, but it would
> allow us to cover all of the cases identified so far. So, if a wakeup event is
> handled within one functional unit that both detects it and delivers it to
> user space, it would call pm_wakeup_event(dev, 0) (ie. infinite timeout) at the
> beginning and then pm_wakeup_commit(dev) when it's done with the event.
> If a wakeup event it's just detected by one piece of code and is going to
> be handled by another, the first one could call pm_wakeup_event(dev, tm) and
> allow the other one to call pm_wakeup_commit(dev) when it's done. However,
> calling pm_wakeup_commit(dev) is not mandatory, so the second piece of code
> (eg. a PCI driver) doesn't really need to do anything in the simplest case.

You have to handle the case where pm_wakeup_commit() gets called after
the timer expires (it should do nothing). And what happens if the
device gets a second wakeup event before the timer for the first one
expires? dev_pm_info would need to store a count of pending events.

In fact, this gets even worse. What if the second event causes you to
move the timeout forward, but then you get a commit for the second
event before the original timer would have expired? It's not clear
that timeouts and early commit work well together.

You could consider changing some of the new function names. Instead of
"wakeup" (which implies that the system was asleep previously) use
"awake" (which implies that you want to prevent the system from going
to sleep, as in "stay awake").

> > One thing that stands out is the new spinlock. It could potentially be
> > a big source of contention. Any wakeup-enabled device is liable to
> > need it during every interrupt. Do you think this could cause a
> > noticeable slowdown?
>
> That really depends on the number of wakeup devices. However, ISTR the
> original wakelocks code had exactly the same issue (it used a spinlock to
> protect the lists of wakelocks).

Yeah, that's right. I have already forgotten the details of how that
original design worked.

Alan Stern

2010-06-23 22:19:59

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Wednesday, June 23, 2010, Alan Stern wrote:
> On Wed, 23 Jun 2010, Rafael J. Wysocki wrote:
>
> > > Didn't we agree that timeouts would be needed for Wake-on-LAN?
> >
> > Yes, we did, but in the WoL case the timeout will have to be used by the user
> > space rather than the kernel IMO.
>
> The kernel will still have to specify some small initial timeout. Just
> long enough for userspace to realize what has happened and start its
> own critical section.
>
> > It would make sense to add a timeout argument to pm_wakeup_event() that would
> > make the system stay in the working state long enough for the driver wakeup
> > code to start in the PCIe case. I think pm_wakeup_event() mght just increment
> > events_in_progress and the timer might simply decrement it.
>
> Hmm. I was thinking about a similar problem with the USB hub driver.
>
> Maybe a better answer for this particular issue is to change the
> workqueue code. Don't allow a work thread to enter the freezer until
> its queue is empty. Then you wouldn't need a timeout.
>
> > So, maybe it's just better to have pm_wakeup_event(dev, timeout) that will
> > increment events_in_progress and set up a timer and pm_wakeup_commit(dev) that
> > will delete the timer, decrement events_in_progress and increment event_count
> > (unless the timer has already expired before).
> >
> > That would cost us a (one more) timer in struct dev_pm_info, but it would
> > allow us to cover all of the cases identified so far. So, if a wakeup event is
> > handled within one functional unit that both detects it and delivers it to
> > user space, it would call pm_wakeup_event(dev, 0) (ie. infinite timeout) at the
> > beginning and then pm_wakeup_commit(dev) when it's done with the event.
> > If a wakeup event it's just detected by one piece of code and is going to
> > be handled by another, the first one could call pm_wakeup_event(dev, tm) and
> > allow the other one to call pm_wakeup_commit(dev) when it's done. However,
> > calling pm_wakeup_commit(dev) is not mandatory, so the second piece of code
> > (eg. a PCI driver) doesn't really need to do anything in the simplest case.
>
> You have to handle the case where pm_wakeup_commit() gets called after
> the timer expires (it should do nothing).

Yup.

> And what happens if the device gets a second wakeup event before the timer
> for the first one expires?

Good question. I don't have an answer to it at the moment, but it seems to
arise from using a single timer for all events.

It looks like it's simpler to make pm_wakeup_event() allocate a timer for each
event and make the timer function remove it. That would cause suspend to
be blocked until the timer expires without a way to cancel it earlier, though.

> dev_pm_info would need to store a count of pending events.

I'm not sure if that helps. It would if both events were going to be handled
in the same way, but I'm not sure if we can safely assume this.

> In fact, this gets even worse. What if the second event causes you to
> move the timeout forward, but then you get a commit for the second
> event before the original timer would have expired? It's not clear
> that timeouts and early commit work well together.

I think they generally do, but there are problems here, as you noted.

> You could consider changing some of the new function names. Instead of
> "wakeup" (which implies that the system was asleep previously) use
> "awake" (which implies that you want to prevent the system from going
> to sleep, as in "stay awake").

A wakeup event may be defined as an event that would cause the system to wakeup
if it were is a sleep state, so I think the name of pm_wakeup_event() is fine.
The other names might be better.

Rafael

2010-06-24 13:14:53

by Rafael J. Wysocki

[permalink] [raw]
Subject: [update 2] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Thursday, June 24, 2010, Rafael J. Wysocki wrote:
> On Wednesday, June 23, 2010, Alan Stern wrote:
> > On Wed, 23 Jun 2010, Rafael J. Wysocki wrote:
> >
> > > > Didn't we agree that timeouts would be needed for Wake-on-LAN?
> > >
> > > Yes, we did, but in the WoL case the timeout will have to be used by the user
> > > space rather than the kernel IMO.
> >
> > The kernel will still have to specify some small initial timeout. Just
> > long enough for userspace to realize what has happened and start its
> > own critical section.
> >
> > > It would make sense to add a timeout argument to pm_wakeup_event() that would
> > > make the system stay in the working state long enough for the driver wakeup
> > > code to start in the PCIe case. I think pm_wakeup_event() mght just increment
> > > events_in_progress and the timer might simply decrement it.
> >
> > Hmm. I was thinking about a similar problem with the USB hub driver.
> >
> > Maybe a better answer for this particular issue is to change the
> > workqueue code. Don't allow a work thread to enter the freezer until
> > its queue is empty. Then you wouldn't need a timeout.
> >
> > > So, maybe it's just better to have pm_wakeup_event(dev, timeout) that will
> > > increment events_in_progress and set up a timer and pm_wakeup_commit(dev) that
> > > will delete the timer, decrement events_in_progress and increment event_count
> > > (unless the timer has already expired before).
> > >
> > > That would cost us a (one more) timer in struct dev_pm_info, but it would
> > > allow us to cover all of the cases identified so far. So, if a wakeup event is
> > > handled within one functional unit that both detects it and delivers it to
> > > user space, it would call pm_wakeup_event(dev, 0) (ie. infinite timeout) at the
> > > beginning and then pm_wakeup_commit(dev) when it's done with the event.
> > > If a wakeup event it's just detected by one piece of code and is going to
> > > be handled by another, the first one could call pm_wakeup_event(dev, tm) and
> > > allow the other one to call pm_wakeup_commit(dev) when it's done. However,
> > > calling pm_wakeup_commit(dev) is not mandatory, so the second piece of code
> > > (eg. a PCI driver) doesn't really need to do anything in the simplest case.
> >
> > You have to handle the case where pm_wakeup_commit() gets called after
> > the timer expires (it should do nothing).
>
> Yup.
>
> > And what happens if the device gets a second wakeup event before the timer
> > for the first one expires?
>
> Good question. I don't have an answer to it at the moment, but it seems to
> arise from using a single timer for all events.
>
> It looks like it's simpler to make pm_wakeup_event() allocate a timer for each
> event and make the timer function remove it. That would cause suspend to
> be blocked until the timer expires without a way to cancel it earlier, though.

So, I decided to try this after all.

Below is a new version of the patch. It introduces pm_stay_awake(dev) and
pm_relax() that play the roles of the "old" pm_wakeup_begin() and
pm_wakeup_end().

pm_wakeup_event() now takes an extra timeout argument and uses it for
deferred execution of pm_relax(). So, one can either use the
pm_stay_awake(dev) / pm_relax() pair, or use pm_wakeup_event(dev, timeout)
if the ending is under someone else's control.

In addition to that, pm_get_wakeup_count() blocks until events_in_progress is
zero.

Please tell me what you think.

Rafael

---
drivers/base/power/Makefile | 2
drivers/base/power/main.c | 1
drivers/base/power/power.h | 3
drivers/base/power/sysfs.c | 9 ++
drivers/base/power/wakeup.c | 150 ++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci-acpi.c | 2
drivers/pci/pci.c | 5 +
drivers/pci/pci.h | 2
drivers/pci/pcie/pme/pcie_pme.c | 7 +
include/linux/pm.h | 10 ++
kernel/power/hibernate.c | 18 +++-
kernel/power/main.c | 24 ++++++
kernel/power/power.h | 7 +
kernel/power/suspend.c | 2
14 files changed, 232 insertions(+), 10 deletions(-)

Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -204,6 +204,28 @@ static ssize_t state_store(struct kobjec

power_attr(state);

+static ssize_t wakeup_count_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%lu\n", pm_get_wakeup_count());
+}
+
+static ssize_t wakeup_count_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t n)
+{
+ unsigned long val;
+
+ if (sscanf(buf, "%lu", &val) == 1) {
+ if (pm_save_wakeup_count(val))
+ return n;
+ }
+ return -EINVAL;
+}
+
+power_attr(wakeup_count);
+
#ifdef CONFIG_PM_TRACE
int pm_trace_enabled;

@@ -236,6 +258,7 @@ static struct attribute * g[] = {
#endif
#ifdef CONFIG_PM_SLEEP
&pm_async_attr.attr,
+ &wakeup_count_attr.attr,
#ifdef CONFIG_PM_DEBUG
&pm_test_attr.attr,
#endif
@@ -266,6 +289,7 @@ static int __init pm_init(void)
int error = pm_start_workqueue();
if (error)
return error;
+ pm_wakeup_events_init();
power_kobj = kobject_create_and_add("power", NULL);
if (!power_kobj)
return -ENOMEM;
Index: linux-2.6/drivers/base/power/wakeup.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/wakeup.c
@@ -0,0 +1,150 @@
+
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/pm.h>
+
+static unsigned long event_count;
+static unsigned long saved_event_count;
+static unsigned long events_in_progress;
+static bool events_check_enabled;
+static spinlock_t events_lock;
+static DECLARE_WAIT_QUEUE_HEAD(events_wait_queue);
+
+void pm_wakeup_events_init(void)
+{
+ spin_lock_init(&events_lock);
+}
+
+void pm_stay_awake(struct device *dev)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (dev)
+ dev->power.wakeup_count++;
+
+ events_in_progress++;
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+void pm_relax(void)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (events_in_progress) {
+ events_in_progress--;
+ event_count++;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+static void pm_wakeup_work_fn(struct work_struct *work)
+{
+ struct delayed_work *dwork = to_delayed_work(work);
+
+ pm_relax();
+ kfree(dwork);
+}
+
+void pm_wakeup_event(struct device *dev, unsigned int msec)
+{
+ unsigned long flags;
+ struct delayed_work *dwork;
+
+ dwork = msec ? kzalloc(sizeof(*dwork), GFP_ATOMIC) : NULL;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (dev)
+ dev->power.wakeup_count++;
+
+ if (dwork) {
+ INIT_DELAYED_WORK(dwork, pm_wakeup_work_fn);
+ schedule_delayed_work(dwork, msecs_to_jiffies(msec));
+
+ events_in_progress++;
+ } else {
+ event_count++;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+static bool check_wakeup_events(void)
+{
+ return !events_check_enabled
+ || ((event_count == saved_event_count) && !events_in_progress);
+}
+
+bool pm_check_wakeup_events(void)
+{
+ unsigned long flags;
+ bool ret;
+
+ spin_lock_irqsave(&events_lock, flags);
+ ret = check_wakeup_events();
+ events_check_enabled = events_check_enabled && ret;
+ spin_unlock_irqrestore(&events_lock, flags);
+ return ret;
+}
+
+bool pm_check_wakeup_events_final(void)
+{
+ bool ret;
+
+ ret = check_wakeup_events();
+ events_check_enabled = false;
+ return ret;
+}
+
+unsigned long pm_dev_wakeup_count(struct device *dev)
+{
+ unsigned long flags;
+ unsigned long count;
+
+ spin_lock_irqsave(&events_lock, flags);
+ count = dev->power.wakeup_count;
+ spin_unlock_irqrestore(&events_lock, flags);
+ return count;
+}
+
+unsigned long pm_get_wakeup_count(void)
+{
+ unsigned long count;
+
+ spin_lock_irq(&events_lock);
+ events_check_enabled = false;
+ if (events_in_progress) {
+ DEFINE_WAIT(wait);
+
+ for (;;) {
+ prepare_to_wait(&events_wait_queue, &wait,
+ TASK_INTERRUPTIBLE);
+ if (!events_in_progress)
+ break;
+ spin_unlock_irq(&events_lock);
+
+ schedule();
+
+ spin_lock_irq(&events_lock);
+ }
+ finish_wait(&events_wait_queue, &wait);
+ }
+ count = event_count;
+ spin_unlock_irq(&events_lock);
+ return count;
+}
+
+bool pm_save_wakeup_count(unsigned long count)
+{
+ bool ret = false;
+
+ spin_lock_irq(&events_lock);
+ if (count == event_count && !events_in_progress) {
+ saved_event_count = count;
+ events_check_enabled = true;
+ ret = true;
+ }
+ spin_unlock_irq(&events_lock);
+ return ret;
+}
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -457,6 +457,7 @@ struct dev_pm_info {
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
struct completion completion;
+ unsigned long wakeup_count;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
@@ -552,6 +553,11 @@ extern void __suspend_report_result(cons
} while (0)

extern void device_pm_wait_for_dev(struct device *sub, struct device *dev);
+
+/* drivers/base/power/wakeup.c */
+extern void pm_wakeup_event(struct device *dev, unsigned int msec);
+extern void pm_stay_awake(struct device *dev);
+extern void pm_relax(void);
#else /* !CONFIG_PM_SLEEP */

#define device_pm_lock() do {} while (0)
@@ -565,6 +571,10 @@ static inline int dpm_suspend_start(pm_m
#define suspend_report_result(fn, ret) do {} while (0)

static inline void device_pm_wait_for_dev(struct device *a, struct device *b) {}
+
+static inline void pm_wakeup_event(struct device *dev, unsigned int msec) {}
+static inline void pm_stay_awake(struct device *dev) {}
+static inline void pm_relax(void) {}
#endif /* !CONFIG_PM_SLEEP */

/* How to reorder dpm_list after device_move() */
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,5 @@
obj-$(CONFIG_PM) += sysfs.o
-obj-$(CONFIG_PM_SLEEP) += main.o
+obj-$(CONFIG_PM_SLEEP) += main.o wakeup.o
obj-$(CONFIG_PM_RUNTIME) += runtime.o
obj-$(CONFIG_PM_OPS) += generic_ops.o
obj-$(CONFIG_PM_TRACE_RTC) += trace.o
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -59,6 +59,7 @@ void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
init_completion(&dev->power.completion);
+ dev->power.wakeup_count = 0;
pm_runtime_init(dev);
}

Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -184,6 +184,13 @@ static inline void suspend_test_finish(c
#ifdef CONFIG_PM_SLEEP
/* kernel/power/main.c */
extern int pm_notifier_call_chain(unsigned long val);
+
+/* drivers/base/power/wakeup.c */
+extern void pm_wakeup_events_init(void);
+extern bool pm_check_wakeup_events(void);
+extern bool pm_check_wakeup_events_final(void);
+extern unsigned long pm_get_wakeup_count(void);
+extern bool pm_save_wakeup_count(unsigned long count);
#endif

#ifdef CONFIG_HIGHMEM
Index: linux-2.6/kernel/power/suspend.c
===================================================================
--- linux-2.6.orig/kernel/power/suspend.c
+++ linux-2.6/kernel/power/suspend.c
@@ -172,7 +172,7 @@ static int suspend_enter(suspend_state_t

error = sysdev_suspend(PMSG_SUSPEND);
if (!error) {
- if (!suspend_test(TEST_CORE))
+ if (!suspend_test(TEST_CORE) && pm_check_wakeup_events_final())
error = suspend_ops->enter(state);
sysdev_resume();
}
Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -277,7 +277,7 @@ static int create_image(int platform_mod
goto Enable_irqs;
}

- if (hibernation_test(TEST_CORE))
+ if (hibernation_test(TEST_CORE) || !pm_check_wakeup_events())
goto Power_up;

in_suspend = 1;
@@ -511,18 +511,24 @@ int hibernation_platform_enter(void)

local_irq_disable();
sysdev_suspend(PMSG_HIBERNATE);
+ if (!pm_check_wakeup_events_final()) {
+ error = -EAGAIN;
+ goto Power_up;
+ }
+
hibernation_ops->enter();
/* We should never get here */
while (1);

- /*
- * We don't need to reenable the nonboot CPUs or resume consoles, since
- * the system is going to be halted anyway.
- */
+ Power_up:
+ sysdev_resume();
+ local_irq_enable();
+ enable_nonboot_cpus();
+
Platform_finish:
hibernation_ops->finish();

- dpm_suspend_noirq(PMSG_RESTORE);
+ dpm_resume_noirq(PMSG_RESTORE);

Resume_devices:
entering_platform_hibernation = false;
Index: linux-2.6/drivers/pci/pci-acpi.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci-acpi.c
+++ linux-2.6/drivers/pci/pci-acpi.c
@@ -48,6 +48,8 @@ static void pci_acpi_wake_dev(acpi_handl
if (event == ACPI_NOTIFY_DEVICE_WAKE && pci_dev) {
pci_check_pme_status(pci_dev);
pm_runtime_resume(&pci_dev->dev);
+ if (device_may_wakeup(&pci_dev->dev))
+ pm_wakeup_event(&pci_dev->dev, PCI_WAKEUP_COOLDOWN);
if (pci_dev->subordinate)
pci_pme_wakeup_bus(pci_dev->subordinate);
}
Index: linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
===================================================================
--- linux-2.6.orig/drivers/pci/pcie/pme/pcie_pme.c
+++ linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
@@ -154,6 +154,8 @@ static bool pcie_pme_walk_bus(struct pci
/* Skip PCIe devices in case we started from a root port. */
if (!pci_is_pcie(dev) && pci_check_pme_status(dev)) {
pm_request_resume(&dev->dev);
+ if (device_may_wakeup(&dev->dev))
+ pm_wakeup_event(&dev->dev, PCI_WAKEUP_COOLDOWN);
ret = true;
}

@@ -254,8 +256,11 @@ static void pcie_pme_handle_request(stru
if (found) {
/* The device is there, but we have to check its PME status. */
found = pci_check_pme_status(dev);
- if (found)
+ if (found) {
pm_request_resume(&dev->dev);
+ if (device_may_wakeup(&dev->dev))
+ pm_wakeup_event(&dev->dev, PCI_WAKEUP_COOLDOWN);
+ }
pci_dev_put(dev);
} else if (devfn) {
/*
Index: linux-2.6/drivers/base/power/power.h
===================================================================
--- linux-2.6.orig/drivers/base/power/power.h
+++ linux-2.6/drivers/base/power/power.h
@@ -30,6 +30,9 @@ extern void device_pm_move_before(struct
extern void device_pm_move_after(struct device *, struct device *);
extern void device_pm_move_last(struct device *);

+/* drivers/base/power/wakeup.c */
+extern unsigned long pm_dev_wakeup_count(struct device *dev);
+
#else /* !CONFIG_PM_SLEEP */

static inline void device_pm_init(struct device *dev)
Index: linux-2.6/drivers/base/power/sysfs.c
===================================================================
--- linux-2.6.orig/drivers/base/power/sysfs.c
+++ linux-2.6/drivers/base/power/sysfs.c
@@ -144,6 +144,14 @@ wake_store(struct device * dev, struct d

static DEVICE_ATTR(wakeup, 0644, wake_show, wake_store);

+static ssize_t wakeup_count_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%lu\n", pm_dev_wakeup_count(dev));
+}
+
+static DEVICE_ATTR(wakeup_count, 0444, wakeup_count_show, NULL);
+
#ifdef CONFIG_PM_ADVANCED_DEBUG
#ifdef CONFIG_PM_RUNTIME

@@ -230,6 +238,7 @@ static struct attribute * power_attrs[]
&dev_attr_control.attr,
#endif
&dev_attr_wakeup.attr,
+ &dev_attr_wakeup_count.attr,
#ifdef CONFIG_PM_ADVANCED_DEBUG
&dev_attr_async.attr,
#ifdef CONFIG_PM_RUNTIME
Index: linux-2.6/drivers/pci/pci.h
===================================================================
--- linux-2.6.orig/drivers/pci/pci.h
+++ linux-2.6/drivers/pci/pci.h
@@ -6,6 +6,8 @@
#define PCI_CFG_SPACE_SIZE 256
#define PCI_CFG_SPACE_EXP_SIZE 4096

+#define PCI_WAKEUP_COOLDOWN 100
+
/* Functions internal to the PCI core code */

extern int pci_uevent(struct device *dev, struct kobj_uevent_env *env);
Index: linux-2.6/drivers/pci/pci.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci.c
+++ linux-2.6/drivers/pci/pci.c
@@ -1285,8 +1285,11 @@ bool pci_check_pme_status(struct pci_dev
*/
static int pci_pme_wakeup(struct pci_dev *dev, void *ign)
{
- if (pci_check_pme_status(dev))
+ if (pci_check_pme_status(dev)) {
pm_request_resume(&dev->dev);
+ if (device_may_wakeup(&dev->dev))
+ pm_wakeup_event(&dev->dev, PCI_WAKEUP_COOLDOWN);
+ }
return 0;
}

2010-06-24 14:16:48

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: [update] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

Rafael J. Wysocki wrote:
> On Tuesday, June 22, 2010, Rafael J. Wysocki wrote:
>> On Tuesday, June 22, 2010, Alan Stern wrote:
>>> On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:
> ...
>>>> So, even if we can say when the kernel has finished processing the event
>>>> (although that would be complicated in the PCIe case above), I don't think
>>>> it's generally possible to ensure that the entire processing of a wakeup event
>>>> has been completed. This leads to the question whether or not it is worth
>>>> trying to detect the ending of the processing of a wakeup event.
>>> As Arve pointed out, in some cases it definitely is worthwhile (the
>>> gpio keypad matrix example). In other cases there may be no reasonable
>>> way to tell. That doesn't mean we have to give up entirely.
>> Well, I'm not sure, because that really depends on the hardware and bus in
>> question. The necessary condition seems to be that the event be detected
>> and handled entirely by the same functional unit (eg. a device driver) within
>> the kernel and such that it is able to detect whether or not user space has
>> acquired the event information. That doesn't seem to be a common case to me.
>
> Anyway, below's an update that addresses this particular case.
>
> It adds two more functions, pm_wakeup_begin() and pm_wakeup_end()
> that play similar roles to suspend_block() and suspend_unblock(), but they
> don't operate on suspend blocker objects. Instead, the first of them increases
> a counter of events in progress and the other one decreases this counter.
> Together they have the same effect as pm_wakeup_event(), but the counter
> of wakeup events in progress they operate on is also checked by
> pm_check_wakeup_events().
>
> Thus there are two ways kernel subsystems can signal wakeup events. First,
> if the event is not explicitly handed over to user space and "instantaneous",
> they can simply call pm_wakeup_event() and be done with it. Second, if the
> event is going to be delivered to user space, the subsystem that processes
> the event can call pm_wakeup_begin() right when the event is detected and
> pm_wakeup_end() when it's been handed over to user space.

How does userspace handle this without races? (I don't see an example
in a driver that talks to userspace in your code...)

For example, if I push a button on my keyboard, the driver calls
pm_wakeup_begin(). Then userspace reads the key from the evdev device
and tells the userspace suspend manager not to go to sleep.

But there's a race: the keyboard driver (or input subsystem) could call
pm_wakeup_end() before the userspace program has a chance to tell the
suspend manager not to sleep.

One possibility would be for poll to report that events are pending
without calling pm_wakeup_end(), giving userspace a chance to prevent
itself from suspending before actually reading the event.


(Also, should "echo mem >/sys/power/state" be different from "echo
mem_respect_suspend_blockers >/sys/power/state?" If I physically press
the suspend key on my laptop, I want it to go to sleep even though I'm
still holding the Fn key that was part of the suspend hotkey.)

--Andy

2010-06-24 14:45:12

by Alan Stern

[permalink] [raw]
Subject: Re: [update] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Thu, 24 Jun 2010, Andy Lutomirski wrote:

> How does userspace handle this without races? (I don't see an example
> in a driver that talks to userspace in your code...)
>
> For example, if I push a button on my keyboard, the driver calls
> pm_wakeup_begin(). Then userspace reads the key from the evdev device
> and tells the userspace suspend manager not to go to sleep.
>
> But there's a race: the keyboard driver (or input subsystem) could call
> pm_wakeup_end() before the userspace program has a chance to tell the
> suspend manager not to sleep.

The assumption is that the user program will poll the evdev device.
When the poll indicates data is available, the program will tell the
userspace power manager not to go to sleep before reading the data.
This avoids the race.

> One possibility would be for poll to report that events are pending
> without calling pm_wakeup_end(), giving userspace a chance to prevent
> itself from suspending before actually reading the event.

Exactly. The critical section doesn't end until the events have been
read. Polling alone doesn't affect the critical section.

> (Also, should "echo mem >/sys/power/state" be different from "echo
> mem_respect_suspend_blockers >/sys/power/state?" If I physically press
> the suspend key on my laptop, I want it to go to sleep even though I'm
> still holding the Fn key that was part of the suspend hotkey.)

With Rafael's scheme, the difference is not in the write to
/sys/power/state but rather in the write to /sys/power/wakeup_count.
If the power manager program doesn't write to /sys/power/wakeup_count
first then /sys/power/state takes effect immediately, just as it does
now.

Alan Stern

2010-06-24 14:50:41

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Thursday, June 24, 2010, Andy Lutomirski wrote:
> Rafael J. Wysocki wrote:
> > On Tuesday, June 22, 2010, Rafael J. Wysocki wrote:
> >> On Tuesday, June 22, 2010, Alan Stern wrote:
> >>> On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:
> > ...
> >>>> So, even if we can say when the kernel has finished processing the event
> >>>> (although that would be complicated in the PCIe case above), I don't think
> >>>> it's generally possible to ensure that the entire processing of a wakeup event
> >>>> has been completed. This leads to the question whether or not it is worth
> >>>> trying to detect the ending of the processing of a wakeup event.
> >>> As Arve pointed out, in some cases it definitely is worthwhile (the
> >>> gpio keypad matrix example). In other cases there may be no reasonable
> >>> way to tell. That doesn't mean we have to give up entirely.
> >> Well, I'm not sure, because that really depends on the hardware and bus in
> >> question. The necessary condition seems to be that the event be detected
> >> and handled entirely by the same functional unit (eg. a device driver) within
> >> the kernel and such that it is able to detect whether or not user space has
> >> acquired the event information. That doesn't seem to be a common case to me.
> >
> > Anyway, below's an update that addresses this particular case.
> >
> > It adds two more functions, pm_wakeup_begin() and pm_wakeup_end()
> > that play similar roles to suspend_block() and suspend_unblock(), but they
> > don't operate on suspend blocker objects. Instead, the first of them increases
> > a counter of events in progress and the other one decreases this counter.
> > Together they have the same effect as pm_wakeup_event(), but the counter
> > of wakeup events in progress they operate on is also checked by
> > pm_check_wakeup_events().
> >
> > Thus there are two ways kernel subsystems can signal wakeup events. First,
> > if the event is not explicitly handed over to user space and "instantaneous",
> > they can simply call pm_wakeup_event() and be done with it. Second, if the
> > event is going to be delivered to user space, the subsystem that processes
> > the event can call pm_wakeup_begin() right when the event is detected and
> > pm_wakeup_end() when it's been handed over to user space.
>
> How does userspace handle this without races? (I don't see an example
> in a driver that talks to userspace in your code...)
>
> For example, if I push a button on my keyboard, the driver calls
> pm_wakeup_begin(). Then userspace reads the key from the evdev device
> and tells the userspace suspend manager not to go to sleep.
>
> But there's a race: the keyboard driver (or input subsystem) could call
> pm_wakeup_end() before the userspace program has a chance to tell the
> suspend manager not to sleep.

That basically is what /sys/power/wakeup_count is for. The power manager is
supposed to first read from it (that will block if there are any events in
progress in the last, recently posted, version of the patch), obtain ACKs from
user space event consumers known to it, then write to
/sys/power/wakeup_count (that will fail if there were any wakeup events in the
meantime) and write to /sys/power/state only if that was successful.

Rafael

2010-06-24 15:08:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update 2] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Thursday, June 24, 2010, Rafael J. Wysocki wrote:
> On Thursday, June 24, 2010, Rafael J. Wysocki wrote:
> > On Wednesday, June 23, 2010, Alan Stern wrote:
> > > On Wed, 23 Jun 2010, Rafael J. Wysocki wrote:
> > >
> > > > > Didn't we agree that timeouts would be needed for Wake-on-LAN?
> > > >
> > > > Yes, we did, but in the WoL case the timeout will have to be used by the user
> > > > space rather than the kernel IMO.
> > >
> > > The kernel will still have to specify some small initial timeout. Just
> > > long enough for userspace to realize what has happened and start its
> > > own critical section.
> > >
> > > > It would make sense to add a timeout argument to pm_wakeup_event() that would
> > > > make the system stay in the working state long enough for the driver wakeup
> > > > code to start in the PCIe case. I think pm_wakeup_event() mght just increment
> > > > events_in_progress and the timer might simply decrement it.
> > >
> > > Hmm. I was thinking about a similar problem with the USB hub driver.
> > >
> > > Maybe a better answer for this particular issue is to change the
> > > workqueue code. Don't allow a work thread to enter the freezer until
> > > its queue is empty. Then you wouldn't need a timeout.
> > >
> > > > So, maybe it's just better to have pm_wakeup_event(dev, timeout) that will
> > > > increment events_in_progress and set up a timer and pm_wakeup_commit(dev) that
> > > > will delete the timer, decrement events_in_progress and increment event_count
> > > > (unless the timer has already expired before).
> > > >
> > > > That would cost us a (one more) timer in struct dev_pm_info, but it would
> > > > allow us to cover all of the cases identified so far. So, if a wakeup event is
> > > > handled within one functional unit that both detects it and delivers it to
> > > > user space, it would call pm_wakeup_event(dev, 0) (ie. infinite timeout) at the
> > > > beginning and then pm_wakeup_commit(dev) when it's done with the event.
> > > > If a wakeup event it's just detected by one piece of code and is going to
> > > > be handled by another, the first one could call pm_wakeup_event(dev, tm) and
> > > > allow the other one to call pm_wakeup_commit(dev) when it's done. However,
> > > > calling pm_wakeup_commit(dev) is not mandatory, so the second piece of code
> > > > (eg. a PCI driver) doesn't really need to do anything in the simplest case.
> > >
> > > You have to handle the case where pm_wakeup_commit() gets called after
> > > the timer expires (it should do nothing).
> >
> > Yup.
> >
> > > And what happens if the device gets a second wakeup event before the timer
> > > for the first one expires?
> >
> > Good question. I don't have an answer to it at the moment, but it seems to
> > arise from using a single timer for all events.
> >
> > It looks like it's simpler to make pm_wakeup_event() allocate a timer for each
> > event and make the timer function remove it. That would cause suspend to
> > be blocked until the timer expires without a way to cancel it earlier, though.
>
> So, I decided to try this after all.
>
> Below is a new version of the patch. It introduces pm_stay_awake(dev) and
> pm_relax() that play the roles of the "old" pm_wakeup_begin() and
> pm_wakeup_end().
>
> pm_wakeup_event() now takes an extra timeout argument and uses it for
> deferred execution of pm_relax(). So, one can either use the
> pm_stay_awake(dev) / pm_relax() pair, or use pm_wakeup_event(dev, timeout)
> if the ending is under someone else's control.
>
> In addition to that, pm_get_wakeup_count() blocks until events_in_progress is
> zero.
>
> Please tell me what you think.

Ah, one piece is missing. Namely, the waiting /sys/power/wakeup_count reader
needs to be woken up when events_in_progress goes down to zero.

I'll send a new version with this bug fixed later today.

Rafael

2010-06-24 15:22:42

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [update] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend



On Jun 24, 2010, at 10:48 AM, "Rafael J. Wysocki" <[email protected]> wrote:

> On Thursday, June 24, 2010, Andy Lutomirski wrote:
>> Rafael J. Wysocki wrote:
>>> On Tuesday, June 22, 2010, Rafael J. Wysocki wrote:
>>>> On Tuesday, June 22, 2010, Alan Stern wrote:
>>>>> On Tue, 22 Jun 2010, Rafael J. Wysocki wrote:
>>> ...
>>>>>> So, even if we can say when the kernel has finished processing
>>>>>> the event
>>>>>> (although that would be complicated in the PCIe case above), I
>>>>>> don't think
>>>>>> it's generally possible to ensure that the entire processing of
>>>>>> a wakeup event
>>>>>> has been completed. This leads to the question whether or not
>>>>>> it is worth
>>>>>> trying to detect the ending of the processing of a wakeup event.
>>>>> As Arve pointed out, in some cases it definitely is worthwhile
>>>>> (the
>>>>> gpio keypad matrix example). In other cases there may be no
>>>>> reasonable
>>>>> way to tell. That doesn't mean we have to give up entirely.
>>>> Well, I'm not sure, because that really depends on the hardware
>>>> and bus in
>>>> question. The necessary condition seems to be that the event be
>>>> detected
>>>> and handled entirely by the same functional unit (eg. a device
>>>> driver) within
>>>> the kernel and such that it is able to detect whether or not user
>>>> space has
>>>> acquired the event information. That doesn't seem to be a common
>>>> case to me.
>>>
>>> Anyway, below's an update that addresses this particular case.
>>>
>>> It adds two more functions, pm_wakeup_begin() and pm_wakeup_end()
>>> that play similar roles to suspend_block() and suspend_unblock(),
>>> but they
>>> don't operate on suspend blocker objects. Instead, the first of
>>> them increases
>>> a counter of events in progress and the other one decreases this
>>> counter.
>>> Together they have the same effect as pm_wakeup_event(), but the
>>> counter
>>> of wakeup events in progress they operate on is also checked by
>>> pm_check_wakeup_events().
>>>
>>> Thus there are two ways kernel subsystems can signal wakeup
>>> events. First,
>>> if the event is not explicitly handed over to user space and
>>> "instantaneous",
>>> they can simply call pm_wakeup_event() and be done with it.
>>> Second, if the
>>> event is going to be delivered to user space, the subsystem that
>>> processes
>>> the event can call pm_wakeup_begin() right when the event is
>>> detected and
>>> pm_wakeup_end() when it's been handed over to user space.
>>
>> How does userspace handle this without races? (I don't see an
>> example
>> in a driver that talks to userspace in your code...)
>>
>> For example, if I push a button on my keyboard, the driver calls
>> pm_wakeup_begin(). Then userspace reads the key from the evdev
>> device
>> and tells the userspace suspend manager not to go to sleep.
>>
>> But there's a race: the keyboard driver (or input subsystem) could
>> call
>> pm_wakeup_end() before the userspace program has a chance to tell the
>> suspend manager not to sleep.
>
> That basically is what /sys/power/wakeup_count is for. The power
> manager is
> supposed to first read from it (that will block if there are any
> events in
> progress in the last, recently posted, version of the patch), obtain
> ACKs from
> user space event consumers known to it, then write to
> /sys/power/wakeup_count (that will fail if there were any wakeup
> events in the
> meantime) and write to /sys/power/state only if that was successful.

That sounds a good deal more complicated than the poll, block suspend,
read, unblock approach, but I guess that as long as both work, the
kernel doesn't really have to care.


>
> Rafael

2010-06-24 15:36:03

by Alan Stern

[permalink] [raw]
Subject: Re: [update 2] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Thu, 24 Jun 2010, Rafael J. Wysocki wrote:

> Ah, one piece is missing. Namely, the waiting /sys/power/wakeup_count reader
> needs to be woken up when events_in_progress goes down to zero.

It also needs to abort immediately if a signal is received.

Alan Stern

2010-06-24 15:44:54

by Alan Stern

[permalink] [raw]
Subject: Re: [update 2] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Thu, 24 Jun 2010, Rafael J. Wysocki wrote:

> > > And what happens if the device gets a second wakeup event before the timer
> > > for the first one expires?
> >
> > Good question. I don't have an answer to it at the moment, but it seems to
> > arise from using a single timer for all events.
> >
> > It looks like it's simpler to make pm_wakeup_event() allocate a timer for each
> > event and make the timer function remove it. That would cause suspend to
> > be blocked until the timer expires without a way to cancel it earlier, though.
>
> So, I decided to try this after all.
>
> Below is a new version of the patch. It introduces pm_stay_awake(dev) and
> pm_relax() that play the roles of the "old" pm_wakeup_begin() and
> pm_wakeup_end().
>
> pm_wakeup_event() now takes an extra timeout argument and uses it for
> deferred execution of pm_relax(). So, one can either use the
> pm_stay_awake(dev) / pm_relax() pair, or use pm_wakeup_event(dev, timeout)
> if the ending is under someone else's control.
>
> In addition to that, pm_get_wakeup_count() blocks until events_in_progress is
> zero.
>
> Please tell me what you think.

This is slightly different from the wakelock design. Each call to
pm_stay_awake() must be paired with a call to pm_relax(), allowing one
device to have multiple concurrent critical sections, whereas calls to
pm_wakeup_event() must not be paired with anything. With wakelocks,
you couldn't have multiple pending events for the same device. I'm not
sure which model is better in practice. No doubt the Android people
will prefer their way.

This requires you to define an explicit PCI_WAKEUP_COOLDOWN delay. I
think that's okay; I had to do something similar with USB and SCSI.
(And I still think it would be a good idea to prevent workqueue threads
from freezing until their queues are empty.)

Instead of allocating the work structures dynamically, would you be
better off using a memory pool?

Alan Stern

2010-06-24 16:21:18

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update 2] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Thursday, June 24, 2010, Alan Stern wrote:
> On Thu, 24 Jun 2010, Rafael J. Wysocki wrote:
>
> > > > And what happens if the device gets a second wakeup event before the timer
> > > > for the first one expires?
> > >
> > > Good question. I don't have an answer to it at the moment, but it seems to
> > > arise from using a single timer for all events.
> > >
> > > It looks like it's simpler to make pm_wakeup_event() allocate a timer for each
> > > event and make the timer function remove it. That would cause suspend to
> > > be blocked until the timer expires without a way to cancel it earlier, though.
> >
> > So, I decided to try this after all.
> >
> > Below is a new version of the patch. It introduces pm_stay_awake(dev) and
> > pm_relax() that play the roles of the "old" pm_wakeup_begin() and
> > pm_wakeup_end().
> >
> > pm_wakeup_event() now takes an extra timeout argument and uses it for
> > deferred execution of pm_relax(). So, one can either use the
> > pm_stay_awake(dev) / pm_relax() pair, or use pm_wakeup_event(dev, timeout)
> > if the ending is under someone else's control.
> >
> > In addition to that, pm_get_wakeup_count() blocks until events_in_progress is
> > zero.
> >
> > Please tell me what you think.
>
> This is slightly different from the wakelock design. Each call to
> pm_stay_awake() must be paired with a call to pm_relax(), allowing one
> device to have multiple concurrent critical sections, whereas calls to
> pm_wakeup_event() must not be paired with anything. With wakelocks,
> you couldn't have multiple pending events for the same device.

You could, but you needed to define multiple wakelocks for the same device for
this purpose.

> I'm not sure which model is better in practice. No doubt the Android people
> will prefer their way.

I suppose so.

> This requires you to define an explicit PCI_WAKEUP_COOLDOWN delay. I
> think that's okay; I had to do something similar with USB and SCSI.
> (And I still think it would be a good idea to prevent workqueue threads
> from freezing until their queues are empty.)

I guess you mean the freezable ones? I'm not sure if that helps a lot, because
new work items may still be added after the workqueue thread has been frozen.

> Instead of allocating the work structures dynamically, would you be
> better off using a memory pool?

Well, it would be kind of equivalent to defining my own slab cache for that,
wouldn't it?

Rafael

2010-06-24 17:09:30

by Alan Stern

[permalink] [raw]
Subject: Re: [update 2] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Thu, 24 Jun 2010, Rafael J. Wysocki wrote:

> > This is slightly different from the wakelock design. Each call to
> > pm_stay_awake() must be paired with a call to pm_relax(), allowing one
> > device to have multiple concurrent critical sections, whereas calls to
> > pm_wakeup_event() must not be paired with anything. With wakelocks,
> > you couldn't have multiple pending events for the same device.
>
> You could, but you needed to define multiple wakelocks for the same device for
> this purpose.

Yeah, okay, but who's going to do that?

> > I'm not sure which model is better in practice. No doubt the Android people
> > will prefer their way.
>
> I suppose so.

It may not make a significant difference in the end. You can always
emulate the wakelock approach by not calling pm_stay_awake() when you
know there is an earlier call still pending.

> > This requires you to define an explicit PCI_WAKEUP_COOLDOWN delay. I
> > think that's okay; I had to do something similar with USB and SCSI.
> > (And I still think it would be a good idea to prevent workqueue threads
> > from freezing until their queues are empty.)
>
> I guess you mean the freezable ones?

Yes. The unfreezable workqueue threads don't have to worry about
getting frozen while their queues are non-empty. :-)

> I'm not sure if that helps a lot, because
> new work items may still be added after the workqueue thread has been frozen.

That's not the point. If a wakeup handler queues a work item (for
example, by calling pm_request_resume) then it wouldn't need to guess a
timeout. The work item would be guaranteed to run before the system
could suspend again.

> > Instead of allocating the work structures dynamically, would you be
> > better off using a memory pool?
>
> Well, it would be kind of equivalent to defining my own slab cache for that,
> wouldn't it?

I suppose so. It would make the GFP_ATOMIC allocations a little more
reliable.

Alan Stern

2010-06-24 23:02:16

by Rafael J. Wysocki

[permalink] [raw]
Subject: [update 3] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Thursday, June 24, 2010, Alan Stern wrote:
> On Thu, 24 Jun 2010, Rafael J. Wysocki wrote:
>
> > Ah, one piece is missing. Namely, the waiting /sys/power/wakeup_count reader
> > needs to be woken up when events_in_progress goes down to zero.
>
> It also needs to abort immediately if a signal is received.

Right.

So, there it goes.

I decided not to play with memory allocations at this point, because I really
don't expect pm_wakeup_event() to be heavily used initially. If there are more
users and it's called more frequently, we can always switch to using a separate
slab cache.

Hopefully, I haven't overlooked anything vitally important this time.

Please tell me what you think.

Rafael

---
drivers/base/power/Makefile | 2
drivers/base/power/main.c | 1
drivers/base/power/power.h | 3
drivers/base/power/sysfs.c | 9 ++
drivers/base/power/wakeup.c | 143 ++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci-acpi.c | 1
drivers/pci/pci.c | 10 ++
drivers/pci/pci.h | 3
drivers/pci/pcie/pme/pcie_pme.c | 5 +
include/linux/pm.h | 10 ++
kernel/power/hibernate.c | 22 ++++--
kernel/power/main.c | 26 +++++++
kernel/power/power.h | 9 ++
kernel/power/suspend.c | 4 -
14 files changed, 237 insertions(+), 11 deletions(-)

Index: linux-2.6/kernel/power/main.c
===================================================================
--- linux-2.6.orig/kernel/power/main.c
+++ linux-2.6/kernel/power/main.c
@@ -204,6 +204,30 @@ static ssize_t state_store(struct kobjec

power_attr(state);

+static ssize_t wakeup_count_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ unsigned long val;
+
+ return pm_get_wakeup_count(&val) ? sprintf(buf, "%lu\n", val) : -EINTR;
+}
+
+static ssize_t wakeup_count_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t n)
+{
+ unsigned long val;
+
+ if (sscanf(buf, "%lu", &val) == 1) {
+ if (pm_save_wakeup_count(val))
+ return n;
+ }
+ return -EINVAL;
+}
+
+power_attr(wakeup_count);
+
#ifdef CONFIG_PM_TRACE
int pm_trace_enabled;

@@ -236,6 +260,7 @@ static struct attribute * g[] = {
#endif
#ifdef CONFIG_PM_SLEEP
&pm_async_attr.attr,
+ &wakeup_count_attr.attr,
#ifdef CONFIG_PM_DEBUG
&pm_test_attr.attr,
#endif
@@ -266,6 +291,7 @@ static int __init pm_init(void)
int error = pm_start_workqueue();
if (error)
return error;
+ pm_wakeup_events_init();
power_kobj = kobject_create_and_add("power", NULL);
if (!power_kobj)
return -ENOMEM;
Index: linux-2.6/drivers/base/power/wakeup.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/base/power/wakeup.c
@@ -0,0 +1,143 @@
+
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/capability.h>
+#include <linux/pm.h>
+
+bool events_check_enabled;
+
+static unsigned long event_count;
+static unsigned long saved_event_count;
+static unsigned long events_in_progress;
+static spinlock_t events_lock;
+static DECLARE_WAIT_QUEUE_HEAD(events_wait_queue);
+
+void pm_wakeup_events_init(void)
+{
+ spin_lock_init(&events_lock);
+}
+
+void pm_stay_awake(struct device *dev)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (dev)
+ dev->power.wakeup_count++;
+
+ events_in_progress++;
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+void pm_relax(void)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (events_in_progress) {
+ event_count++;
+ if (!--events_in_progress)
+ wake_up_all(&events_wait_queue);
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+static void pm_wakeup_work_fn(struct work_struct *work)
+{
+ struct delayed_work *dwork = to_delayed_work(work);
+
+ pm_relax();
+ kfree(dwork);
+}
+
+void pm_wakeup_event(struct device *dev, unsigned int msec)
+{
+ unsigned long flags;
+ struct delayed_work *dwork;
+
+ dwork = msec ? kzalloc(sizeof(*dwork), GFP_ATOMIC) : NULL;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (dev)
+ dev->power.wakeup_count++;
+
+ if (dwork) {
+ INIT_DELAYED_WORK(dwork, pm_wakeup_work_fn);
+ schedule_delayed_work(dwork, msecs_to_jiffies(msec));
+
+ events_in_progress++;
+ } else {
+ event_count++;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+}
+
+bool pm_check_wakeup_events(void)
+{
+ unsigned long flags;
+ bool ret = true;
+
+ spin_lock_irqsave(&events_lock, flags);
+ if (events_check_enabled) {
+ ret = (event_count == saved_event_count) && !events_in_progress;
+ events_check_enabled = ret;
+ }
+ spin_unlock_irqrestore(&events_lock, flags);
+ return ret;
+}
+
+unsigned long pm_dev_wakeup_count(struct device *dev)
+{
+ unsigned long flags;
+ unsigned long count;
+
+ spin_lock_irqsave(&events_lock, flags);
+ count = dev->power.wakeup_count;
+ spin_unlock_irqrestore(&events_lock, flags);
+ return count;
+}
+
+bool pm_get_wakeup_count(unsigned long *count)
+{
+ bool ret;
+
+ spin_lock_irq(&events_lock);
+ if (capable(CAP_SYS_ADMIN))
+ events_check_enabled = false;
+
+ if (events_in_progress) {
+ DEFINE_WAIT(wait);
+
+ do {
+ prepare_to_wait(&events_wait_queue, &wait,
+ TASK_INTERRUPTIBLE);
+ if (!events_in_progress)
+ break;
+ spin_unlock_irq(&events_lock);
+
+ schedule();
+
+ spin_lock_irq(&events_lock);
+ } while (!signal_pending(current));
+ finish_wait(&events_wait_queue, &wait);
+ }
+ *count = event_count;
+ ret = !events_in_progress;
+ spin_unlock_irq(&events_lock);
+ return ret;
+}
+
+bool pm_save_wakeup_count(unsigned long count)
+{
+ bool ret = false;
+
+ spin_lock_irq(&events_lock);
+ if (count == event_count && !events_in_progress) {
+ saved_event_count = count;
+ events_check_enabled = true;
+ ret = true;
+ }
+ spin_unlock_irq(&events_lock);
+ return ret;
+}
Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -457,6 +457,7 @@ struct dev_pm_info {
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
struct completion completion;
+ unsigned long wakeup_count;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
@@ -552,6 +553,11 @@ extern void __suspend_report_result(cons
} while (0)

extern void device_pm_wait_for_dev(struct device *sub, struct device *dev);
+
+/* drivers/base/power/wakeup.c */
+extern void pm_wakeup_event(struct device *dev, unsigned int msec);
+extern void pm_stay_awake(struct device *dev);
+extern void pm_relax(void);
#else /* !CONFIG_PM_SLEEP */

#define device_pm_lock() do {} while (0)
@@ -565,6 +571,10 @@ static inline int dpm_suspend_start(pm_m
#define suspend_report_result(fn, ret) do {} while (0)

static inline void device_pm_wait_for_dev(struct device *a, struct device *b) {}
+
+static inline void pm_wakeup_event(struct device *dev, unsigned int msec) {}
+static inline void pm_stay_awake(struct device *dev) {}
+static inline void pm_relax(void) {}
#endif /* !CONFIG_PM_SLEEP */

/* How to reorder dpm_list after device_move() */
Index: linux-2.6/drivers/base/power/Makefile
===================================================================
--- linux-2.6.orig/drivers/base/power/Makefile
+++ linux-2.6/drivers/base/power/Makefile
@@ -1,5 +1,5 @@
obj-$(CONFIG_PM) += sysfs.o
-obj-$(CONFIG_PM_SLEEP) += main.o
+obj-$(CONFIG_PM_SLEEP) += main.o wakeup.o
obj-$(CONFIG_PM_RUNTIME) += runtime.o
obj-$(CONFIG_PM_OPS) += generic_ops.o
obj-$(CONFIG_PM_TRACE_RTC) += trace.o
Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -59,6 +59,7 @@ void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
init_completion(&dev->power.completion);
+ dev->power.wakeup_count = 0;
pm_runtime_init(dev);
}

Index: linux-2.6/kernel/power/power.h
===================================================================
--- linux-2.6.orig/kernel/power/power.h
+++ linux-2.6/kernel/power/power.h
@@ -184,6 +184,15 @@ static inline void suspend_test_finish(c
#ifdef CONFIG_PM_SLEEP
/* kernel/power/main.c */
extern int pm_notifier_call_chain(unsigned long val);
+
+/* drivers/base/power/wakeup.c */
+extern bool events_check_enabled;
+
+extern void pm_wakeup_events_init(void);
+extern bool pm_check_wakeup_events(void);
+extern bool pm_check_wakeup_events_final(void);
+extern bool pm_get_wakeup_count(unsigned long *count);
+extern bool pm_save_wakeup_count(unsigned long count);
#endif

#ifdef CONFIG_HIGHMEM
Index: linux-2.6/kernel/power/suspend.c
===================================================================
--- linux-2.6.orig/kernel/power/suspend.c
+++ linux-2.6/kernel/power/suspend.c
@@ -172,8 +172,10 @@ static int suspend_enter(suspend_state_t

error = sysdev_suspend(PMSG_SUSPEND);
if (!error) {
- if (!suspend_test(TEST_CORE))
+ if (!suspend_test(TEST_CORE) && pm_check_wakeup_events()) {
error = suspend_ops->enter(state);
+ events_check_enabled = false;
+ }
sysdev_resume();
}

Index: linux-2.6/kernel/power/hibernate.c
===================================================================
--- linux-2.6.orig/kernel/power/hibernate.c
+++ linux-2.6/kernel/power/hibernate.c
@@ -277,7 +277,7 @@ static int create_image(int platform_mod
goto Enable_irqs;
}

- if (hibernation_test(TEST_CORE))
+ if (hibernation_test(TEST_CORE) || !pm_check_wakeup_events())
goto Power_up;

in_suspend = 1;
@@ -288,8 +288,10 @@ static int create_image(int platform_mod
error);
/* Restore control flow magically appears here */
restore_processor_state();
- if (!in_suspend)
+ if (!in_suspend) {
+ events_check_enabled = false;
platform_leave(platform_mode);
+ }

Power_up:
sysdev_resume();
@@ -511,18 +513,24 @@ int hibernation_platform_enter(void)

local_irq_disable();
sysdev_suspend(PMSG_HIBERNATE);
+ if (!pm_check_wakeup_events()) {
+ error = -EAGAIN;
+ goto Power_up;
+ }
+
hibernation_ops->enter();
/* We should never get here */
while (1);

- /*
- * We don't need to reenable the nonboot CPUs or resume consoles, since
- * the system is going to be halted anyway.
- */
+ Power_up:
+ sysdev_resume();
+ local_irq_enable();
+ enable_nonboot_cpus();
+
Platform_finish:
hibernation_ops->finish();

- dpm_suspend_noirq(PMSG_RESTORE);
+ dpm_resume_noirq(PMSG_RESTORE);

Resume_devices:
entering_platform_hibernation = false;
Index: linux-2.6/drivers/pci/pci-acpi.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci-acpi.c
+++ linux-2.6/drivers/pci/pci-acpi.c
@@ -48,6 +48,7 @@ static void pci_acpi_wake_dev(acpi_handl
if (event == ACPI_NOTIFY_DEVICE_WAKE && pci_dev) {
pci_check_pme_status(pci_dev);
pm_runtime_resume(&pci_dev->dev);
+ pci_wakeup_event(pci_dev);
if (pci_dev->subordinate)
pci_pme_wakeup_bus(pci_dev->subordinate);
}
Index: linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
===================================================================
--- linux-2.6.orig/drivers/pci/pcie/pme/pcie_pme.c
+++ linux-2.6/drivers/pci/pcie/pme/pcie_pme.c
@@ -154,6 +154,7 @@ static bool pcie_pme_walk_bus(struct pci
/* Skip PCIe devices in case we started from a root port. */
if (!pci_is_pcie(dev) && pci_check_pme_status(dev)) {
pm_request_resume(&dev->dev);
+ pci_wakeup_event(dev);
ret = true;
}

@@ -254,8 +255,10 @@ static void pcie_pme_handle_request(stru
if (found) {
/* The device is there, but we have to check its PME status. */
found = pci_check_pme_status(dev);
- if (found)
+ if (found) {
pm_request_resume(&dev->dev);
+ pci_wakeup_event(dev);
+ }
pci_dev_put(dev);
} else if (devfn) {
/*
Index: linux-2.6/drivers/base/power/power.h
===================================================================
--- linux-2.6.orig/drivers/base/power/power.h
+++ linux-2.6/drivers/base/power/power.h
@@ -30,6 +30,9 @@ extern void device_pm_move_before(struct
extern void device_pm_move_after(struct device *, struct device *);
extern void device_pm_move_last(struct device *);

+/* drivers/base/power/wakeup.c */
+extern unsigned long pm_dev_wakeup_count(struct device *dev);
+
#else /* !CONFIG_PM_SLEEP */

static inline void device_pm_init(struct device *dev)
Index: linux-2.6/drivers/base/power/sysfs.c
===================================================================
--- linux-2.6.orig/drivers/base/power/sysfs.c
+++ linux-2.6/drivers/base/power/sysfs.c
@@ -144,6 +144,14 @@ wake_store(struct device * dev, struct d

static DEVICE_ATTR(wakeup, 0644, wake_show, wake_store);

+static ssize_t wakeup_count_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%lu\n", pm_dev_wakeup_count(dev));
+}
+
+static DEVICE_ATTR(wakeup_count, 0444, wakeup_count_show, NULL);
+
#ifdef CONFIG_PM_ADVANCED_DEBUG
#ifdef CONFIG_PM_RUNTIME

@@ -230,6 +238,7 @@ static struct attribute * power_attrs[]
&dev_attr_control.attr,
#endif
&dev_attr_wakeup.attr,
+ &dev_attr_wakeup_count.attr,
#ifdef CONFIG_PM_ADVANCED_DEBUG
&dev_attr_async.attr,
#ifdef CONFIG_PM_RUNTIME
Index: linux-2.6/drivers/pci/pci.h
===================================================================
--- linux-2.6.orig/drivers/pci/pci.h
+++ linux-2.6/drivers/pci/pci.h
@@ -6,6 +6,8 @@
#define PCI_CFG_SPACE_SIZE 256
#define PCI_CFG_SPACE_EXP_SIZE 4096

+#define PCI_WAKEUP_COOLDOWN 100
+
/* Functions internal to the PCI core code */

extern int pci_uevent(struct device *dev, struct kobj_uevent_env *env);
@@ -56,6 +58,7 @@ extern void pci_update_current_state(str
extern void pci_disable_enabled_device(struct pci_dev *dev);
extern bool pci_check_pme_status(struct pci_dev *dev);
extern int pci_finish_runtime_suspend(struct pci_dev *dev);
+extern void pci_wakeup_event(struct pci_dev *dev);
extern int __pci_pme_wakeup(struct pci_dev *dev, void *ign);
extern void pci_pme_wakeup_bus(struct pci_bus *bus);
extern void pci_pm_init(struct pci_dev *dev);
Index: linux-2.6/drivers/pci/pci.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci.c
+++ linux-2.6/drivers/pci/pci.c
@@ -1275,6 +1275,12 @@ bool pci_check_pme_status(struct pci_dev
return ret;
}

+void pci_wakeup_event(struct pci_dev *dev)
+{
+ if (device_may_wakeup(&dev->dev))
+ pm_wakeup_event(&dev->dev, PCI_WAKEUP_COOLDOWN);
+}
+
/**
* pci_pme_wakeup - Wake up a PCI device if its PME Status bit is set.
* @dev: Device to handle.
@@ -1285,8 +1291,10 @@ bool pci_check_pme_status(struct pci_dev
*/
static int pci_pme_wakeup(struct pci_dev *dev, void *ign)
{
- if (pci_check_pme_status(dev))
+ if (pci_check_pme_status(dev)) {
pm_request_resume(&dev->dev);
+ pci_wakeup_event(dev);
+ }
return 0;
}

2010-06-24 23:08:12

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update 2] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Thursday, June 24, 2010, Alan Stern wrote:
> On Thu, 24 Jun 2010, Rafael J. Wysocki wrote:
>
> > > This is slightly different from the wakelock design. Each call to
> > > pm_stay_awake() must be paired with a call to pm_relax(), allowing one
> > > device to have multiple concurrent critical sections, whereas calls to
> > > pm_wakeup_event() must not be paired with anything. With wakelocks,
> > > you couldn't have multiple pending events for the same device.
> >
> > You could, but you needed to define multiple wakelocks for the same device for
> > this purpose.
>
> Yeah, okay, but who's going to do that?
>
> > > I'm not sure which model is better in practice. No doubt the Android people
> > > will prefer their way.
> >
> > I suppose so.
>
> It may not make a significant difference in the end. You can always
> emulate the wakelock approach by not calling pm_stay_awake() when you
> know there is an earlier call still pending.
>
> > > This requires you to define an explicit PCI_WAKEUP_COOLDOWN delay. I
> > > think that's okay; I had to do something similar with USB and SCSI.
> > > (And I still think it would be a good idea to prevent workqueue threads
> > > from freezing until their queues are empty.)
> >
> > I guess you mean the freezable ones?
>
> Yes. The unfreezable workqueue threads don't have to worry about
> getting frozen while their queues are non-empty. :-)
>
> > I'm not sure if that helps a lot, because
> > new work items may still be added after the workqueue thread has been frozen.
>
> That's not the point. If a wakeup handler queues a work item (for
> example, by calling pm_request_resume) then it wouldn't need to guess a
> timeout. The work item would be guaranteed to run before the system
> could suspend again.

You seem to be referring to the PM workqueue specifically. Perhaps it would be
better to special-case it and stop it by adding a barrier work during suspend
instead of just freezing? Then, it wouldn't need to be singlethread any more.

Still, I think the timeout is necessary anyway in case the driver simply
doesn't handle the event and user space needs time to catch up. Unfortunately,
the PCI wakeup code doesn't know what happens next in advance.

Rafael

2010-06-25 06:41:43

by Florian Mickler

[permalink] [raw]
Subject: Re: [update 2] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Thu, 24 Jun 2010 13:09:27 -0400 (EDT)
Alan Stern <[email protected]> wrote:

> > > This requires you to define an explicit PCI_WAKEUP_COOLDOWN delay. I
> > > think that's okay; I had to do something similar with USB and SCSI.
> > > (And I still think it would be a good idea to prevent workqueue threads
> > > from freezing until their queues are empty.)

I'm not that familiar with the freezer, but couldn't it be
deadlocky if the work depends on some already frozen part?

What about a new work-type that calls
pm_relax() after executing it's workfunction and executing
pm_stay_awake() on enqueue?


Cheers,
Flo

2010-06-25 13:30:07

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update 2] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Friday, June 25, 2010, Florian Mickler wrote:
> On Thu, 24 Jun 2010 13:09:27 -0400 (EDT)
> Alan Stern <[email protected]> wrote:
>
> > > > This requires you to define an explicit PCI_WAKEUP_COOLDOWN delay. I
> > > > think that's okay; I had to do something similar with USB and SCSI.
> > > > (And I still think it would be a good idea to prevent workqueue threads
> > > > from freezing until their queues are empty.)
>
> I'm not that familiar with the freezer, but couldn't it be
> deadlocky if the work depends on some already frozen part?

No, in the case of freezable workqueues (which is the one we're discussing)
they generally can't depend on anything freezable, because it's never known
which freezable tasks will be frozen first.

> What about a new work-type that calls
> pm_relax() after executing it's workfunction and executing
> pm_stay_awake() on enqueue?

That might be useful., although it doesn't really help here, because there
still is a window between queuing up a work item and executing it.

Rafael

2010-06-25 14:42:31

by Alan Stern

[permalink] [raw]
Subject: Re: [update 3] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Fri, 25 Jun 2010, Rafael J. Wysocki wrote:

> So, there it goes.
>
> I decided not to play with memory allocations at this point, because I really
> don't expect pm_wakeup_event() to be heavily used initially. If there are more
> users and it's called more frequently, we can always switch to using a separate
> slab cache.
>
> Hopefully, I haven't overlooked anything vitally important this time.
>
> Please tell me what you think.

Obviously comments still need to be added. Beyond that...

> --- /dev/null
> +++ linux-2.6/drivers/base/power/wakeup.c
> @@ -0,0 +1,143 @@
> +
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/sched.h>
> +#include <linux/capability.h>
> +#include <linux/pm.h>
> +
> +bool events_check_enabled;
> +
> +static unsigned long event_count;
> +static unsigned long saved_event_count;
> +static unsigned long events_in_progress;
> +static spinlock_t events_lock;

Use static DEFINE_SPINLOCK(events_lock) instead.

> +static DECLARE_WAIT_QUEUE_HEAD(events_wait_queue);
> +
> +void pm_wakeup_events_init(void)
> +{
> + spin_lock_init(&events_lock);
> +}

Then this routine won't be needed.

> +unsigned long pm_dev_wakeup_count(struct device *dev)
> +{
> + unsigned long flags;
> + unsigned long count;
> +
> + spin_lock_irqsave(&events_lock, flags);
> + count = dev->power.wakeup_count;
> + spin_unlock_irqrestore(&events_lock, flags);
> + return count;
> +}

Are the spin_lock calls needed here? I doubt it.

> --- linux-2.6.orig/kernel/power/power.h
> +++ linux-2.6/kernel/power/power.h
> @@ -184,6 +184,15 @@ static inline void suspend_test_finish(c
> #ifdef CONFIG_PM_SLEEP
> /* kernel/power/main.c */
> extern int pm_notifier_call_chain(unsigned long val);
> +
> +/* drivers/base/power/wakeup.c */
> +extern bool events_check_enabled;
> +
> +extern void pm_wakeup_events_init(void);
> +extern bool pm_check_wakeup_events(void);
> +extern bool pm_check_wakeup_events_final(void);
> +extern bool pm_get_wakeup_count(unsigned long *count);
> +extern bool pm_save_wakeup_count(unsigned long count);
> #endif

This is unfortunate. These declarations belong in a file that can
also be #included by drivers/base/power/wakeup.c. Otherwise future
changes might cause type mismatches the compiler won't be able to
catch.

> @@ -511,18 +513,24 @@ int hibernation_platform_enter(void)
>
> local_irq_disable();
> sysdev_suspend(PMSG_HIBERNATE);
> + if (!pm_check_wakeup_events()) {
> + error = -EAGAIN;
> + goto Power_up;
> + }
> +
> hibernation_ops->enter();
> /* We should never get here */
> while (1);
>
> - /*
> - * We don't need to reenable the nonboot CPUs or resume consoles, since
> - * the system is going to be halted anyway.
> - */
> + Power_up:
> + sysdev_resume();
> + local_irq_enable();
> + enable_nonboot_cpus();
> +
> Platform_finish:
> hibernation_ops->finish();
>
> - dpm_suspend_noirq(PMSG_RESTORE);
> + dpm_resume_noirq(PMSG_RESTORE);

Is this a bug fix that crept in along with the other changes?

> --- linux-2.6.orig/drivers/pci/pci.h
> +++ linux-2.6/drivers/pci/pci.h
> @@ -6,6 +6,8 @@
> #define PCI_CFG_SPACE_SIZE 256
> #define PCI_CFG_SPACE_EXP_SIZE 4096
>
> +#define PCI_WAKEUP_COOLDOWN 100

This definition can go directly in pci.c, since it isn't used anywhere
else.

Alan Stern

2010-06-25 15:09:53

by Alan Stern

[permalink] [raw]
Subject: Re: [update 2] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Fri, 25 Jun 2010, Rafael J. Wysocki wrote:

> > That's not the point. If a wakeup handler queues a work item (for
> > example, by calling pm_request_resume) then it wouldn't need to guess a
> > timeout. The work item would be guaranteed to run before the system
> > could suspend again.
>
> You seem to be referring to the PM workqueue specifically. Perhaps it would be
> better to special-case it and stop it by adding a barrier work during suspend
> instead of just freezing? Then, it wouldn't need to be singlethread any more.

The barrier work would have to be queued to each CPU's thread. That
would be okay.

Hmm, it looks like wait_event_freezable() and
wait_event_freezable_timeout() could use similar changes: If the
condition is true then they shouldn't try to freeze the caller.

> Still, I think the timeout is necessary anyway in case the driver simply
> doesn't handle the event and user space needs time to catch up. Unfortunately,
> the PCI wakeup code doesn't know what happens next in advance.

That could all be handled by the lower driver. Still, a 100-ms timeout
isn't going to make a significant difference, since a suspend/resume
cycle will take a comparable length of time.

Alan Stern

2010-06-25 20:35:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update 3] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Friday, June 25, 2010, Alan Stern wrote:
> On Fri, 25 Jun 2010, Rafael J. Wysocki wrote:
>
> > So, there it goes.
> >
> > I decided not to play with memory allocations at this point, because I really
> > don't expect pm_wakeup_event() to be heavily used initially. If there are more
> > users and it's called more frequently, we can always switch to using a separate
> > slab cache.
> >
> > Hopefully, I haven't overlooked anything vitally important this time.
> >
> > Please tell me what you think.
>
> Obviously comments still need to be added.

Indeed.

> Beyond that...
>
> > --- /dev/null
> > +++ linux-2.6/drivers/base/power/wakeup.c
> > @@ -0,0 +1,143 @@
> > +
> > +#include <linux/device.h>
> > +#include <linux/slab.h>
> > +#include <linux/sched.h>
> > +#include <linux/capability.h>
> > +#include <linux/pm.h>
> > +
> > +bool events_check_enabled;
> > +
> > +static unsigned long event_count;
> > +static unsigned long saved_event_count;
> > +static unsigned long events_in_progress;
> > +static spinlock_t events_lock;
>
> Use static DEFINE_SPINLOCK(events_lock) instead.

Hmm. I thought that was deprecated. Never mind.

> > +static DECLARE_WAIT_QUEUE_HEAD(events_wait_queue);
> > +
> > +void pm_wakeup_events_init(void)
> > +{
> > + spin_lock_init(&events_lock);
> > +}
>
> Then this routine won't be needed.
>
> > +unsigned long pm_dev_wakeup_count(struct device *dev)
> > +{
> > + unsigned long flags;
> > + unsigned long count;
> > +
> > + spin_lock_irqsave(&events_lock, flags);
> > + count = dev->power.wakeup_count;
> > + spin_unlock_irqrestore(&events_lock, flags);
> > + return count;
> > +}
>
> Are the spin_lock calls needed here? I doubt it.

No, they aren't. In fact it may be a static inline.

> > --- linux-2.6.orig/kernel/power/power.h
> > +++ linux-2.6/kernel/power/power.h
> > @@ -184,6 +184,15 @@ static inline void suspend_test_finish(c
> > #ifdef CONFIG_PM_SLEEP
> > /* kernel/power/main.c */
> > extern int pm_notifier_call_chain(unsigned long val);
> > +
> > +/* drivers/base/power/wakeup.c */
> > +extern bool events_check_enabled;
> > +
> > +extern void pm_wakeup_events_init(void);
> > +extern bool pm_check_wakeup_events(void);
> > +extern bool pm_check_wakeup_events_final(void);
> > +extern bool pm_get_wakeup_count(unsigned long *count);
> > +extern bool pm_save_wakeup_count(unsigned long count);
> > #endif
>
> This is unfortunate. These declarations belong in a file that can
> also be #included by drivers/base/power/wakeup.c. Otherwise future
> changes might cause type mismatches the compiler won't be able to
> catch.

You're right. In that case I think include/linux/suspend.h is the right header
to put them into.

> > @@ -511,18 +513,24 @@ int hibernation_platform_enter(void)
> >
> > local_irq_disable();
> > sysdev_suspend(PMSG_HIBERNATE);
> > + if (!pm_check_wakeup_events()) {
> > + error = -EAGAIN;
> > + goto Power_up;
> > + }
> > +
> > hibernation_ops->enter();
> > /* We should never get here */
> > while (1);
> >
> > - /*
> > - * We don't need to reenable the nonboot CPUs or resume consoles, since
> > - * the system is going to be halted anyway.
> > - */
> > + Power_up:
> > + sysdev_resume();
> > + local_irq_enable();
> > + enable_nonboot_cpus();
> > +
> > Platform_finish:
> > hibernation_ops->finish();
> >
> > - dpm_suspend_noirq(PMSG_RESTORE);
> > + dpm_resume_noirq(PMSG_RESTORE);
>
> Is this a bug fix that crept in along with the other changes?

Yeah.

> > --- linux-2.6.orig/drivers/pci/pci.h
> > +++ linux-2.6/drivers/pci/pci.h
> > @@ -6,6 +6,8 @@
> > #define PCI_CFG_SPACE_SIZE 256
> > #define PCI_CFG_SPACE_EXP_SIZE 4096
> >
> > +#define PCI_WAKEUP_COOLDOWN 100
>
> This definition can go directly in pci.c, since it isn't used anywhere
> else.

OK

Thanks for the comments,
Rafael

2010-06-25 20:39:27

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [update 2] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Friday, June 25, 2010, Alan Stern wrote:
> On Fri, 25 Jun 2010, Rafael J. Wysocki wrote:
>
> > > That's not the point. If a wakeup handler queues a work item (for
> > > example, by calling pm_request_resume) then it wouldn't need to guess a
> > > timeout. The work item would be guaranteed to run before the system
> > > could suspend again.
> >
> > You seem to be referring to the PM workqueue specifically. Perhaps it would be
> > better to special-case it and stop it by adding a barrier work during suspend
> > instead of just freezing? Then, it wouldn't need to be singlethread any more.
>
> The barrier work would have to be queued to each CPU's thread. That
> would be okay.

I guess we should stop the PM workqueue after the freezing of tasks, shouldn't we?

> Hmm, it looks like wait_event_freezable() and
> wait_event_freezable_timeout() could use similar changes: If the
> condition is true then they shouldn't try to freeze the caller.

Yes, but that should be a separate patch IMHO.

Rafael

2010-06-25 20:58:01

by Alan Stern

[permalink] [raw]
Subject: Re: [update 2] Re: [RFC][PATCH] PM: Avoid losing wakeup events during suspend

On Fri, 25 Jun 2010, Rafael J. Wysocki wrote:

> > > You seem to be referring to the PM workqueue specifically. Perhaps it would be
> > > better to special-case it and stop it by adding a barrier work during suspend
> > > instead of just freezing? Then, it wouldn't need to be singlethread any more.
> >
> > The barrier work would have to be queued to each CPU's thread. That
> > would be okay.
>
> I guess we should stop the PM workqueue after the freezing of tasks, shouldn't we?

Yes. The exact spot probably doesn't matter; that's as good as any.

> > Hmm, it looks like wait_event_freezable() and
> > wait_event_freezable_timeout() could use similar changes: If the
> > condition is true then they shouldn't try to freeze the caller.
>
> Yes, but that should be a separate patch IMHO.

Agreed.

Alan Stern