LinuxLists.cc - [PATCH 0/2] PM / Domains: Infinite loop during reboot

2015-06-18 10:23:09

Subject: [PATCH 0/2] PM / Domains: Infinite loop during reboot

Hi all,

This patch series fixes an infinite loop during reboot after s2ram, as
seen on r8a7791/koelsch with the CPG Clock Domain:
- Patch 1 fixes the cause in the sh_cmt driver,
- Patch 2 hardens the PM Domain core code against infinite loops.

Both patches individually fix the symptom (lock up during reboot), but
only the first one fixes the root cause.
Patches are against next-20150618.

The discussion thread where the problem was originally reported is "PM /
Domains: Infinite loop during reboot"
(http://permalink.gmane.org/gmane.linux.ports.sh.devel/45249).

Thanks!

Geert Uytterhoeven (2):
clocksource: sh_cmt: Only perform clocksource suspend/resume if
enabled
PM / Domains: Avoid infinite loops in attach/detach code

drivers/base/power/domain.c | 15 +++++++++++++--
drivers/clocksource/sh_cmt.c | 6 ++++++
2 files changed, 19 insertions(+), 2 deletions(-)

--
1.9.1

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2015-06-18 10:22:48

by Geert Uytterhoeven

[permalink] [raw]

Subject: [PATCH 1/2] clocksource: sh_cmt: Only perform clocksource suspend/resume if enabled

Currently the sh_cmt clocksource timer is disabled or enabled
unconditionally on clocksource suspend resp. resume, even if a better
clocksource is present (e.g. arch_sys_counter) and the sh_cmt
clocksource is not enabled.

As sh_cmt is a syscore device when its timer is enabled, this may lead
to a genpd.prepared_count imbalance in the presence of PM Domains, which
may cause a lock up during reboot after s2ram.

During suspend:
- pm_genpd_prepare() is called for all non-syscore devices (incl.
sh_cmt), increasing genpd.prepared_count for each device,
- clocksource.suspend() is called for all clocksource devices,
- sh_cmt_clocksource_suspend() calls sh_cmt_stop(), which is a no-op
as the clocksource was not enabled.

During resume:
- clocksource.resume() is called for all clocksource devices,
- sh_cmt_clocksource_resume() calls sh_cmt_start(), which enables the
clocksource timer, and turns sh_cmt into a syscore device,
- pm_genpd_complete() is called for all non-syscore devices (excl.
sh_cmt now!), decreasing genpd.prepared_count for each device but
sh_cmt.

Now genpd.prepared_count of the PM Domain containing sh_cmt is still 1
instead of zero. On subsequent suspend/resume cycles, sh_cmt is still a
syscore device, hence it's skipped for pm_genpd_{prepare,complete}(),
keeping the imbalance of genpd.prepared_count at 1.

During reboot:
- platform_drv_shutdown() is called for any platform device that has
a driver with a .shutdown() method (only rcar-dmac on R-Car Gen2),
- platform_drv_shutdown() calls dev_pm_domain_detach(), which
calls genpd_dev_pm_detach(),
- genpd_dev_pm_detach() keeps calling pm_genpd_remove_device() until
it doesn't return -EAGAIN,
- If the device is part of the same PM Domain as sh_cmt,
pm_genpd_remove_device() always fails with -EAGAIN due to
genpd.prepared_count > 0.
- Infinite loop in genpd_dev_pm_detach().

To fix this, only disable or enable the clocksource timer on clocksource
suspend resp. resume if the clocksource was enabled.

This was tested on r8a7791/koelsch with the CPG Clock Domain:
- using arch_sys_counter as the clocksource, which is the default, and
which showed the problem,
- using sh_cmt as a clocksource ("echo ffca0000.timer > \
/sys/devices/system/clocksource/clocksource0/current_clocksource"),
which behaves the same as before.

Signed-off-by: Geert Uytterhoeven <[email protected]>
---
drivers/clocksource/sh_cmt.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/clocksource/sh_cmt.c b/drivers/clocksource/sh_cmt.c
index b8ff3c64cc452a16..c96de14036a0adeb 100644
--- a/drivers/clocksource/sh_cmt.c
+++ b/drivers/clocksource/sh_cmt.c
@@ -661,6 +661,9 @@ static void sh_cmt_clocksource_suspend(struct clocksource *cs)
{
struct sh_cmt_channel *ch = cs_to_sh_cmt(cs);

+ if (!ch->cs_enabled)
+ return;
+
sh_cmt_stop(ch, FLAG_CLOCKSOURCE);
pm_genpd_syscore_poweroff(&ch->cmt->pdev->dev);
}
@@ -669,6 +672,9 @@ static void sh_cmt_clocksource_resume(struct clocksource *cs)
{
struct sh_cmt_channel *ch = cs_to_sh_cmt(cs);

+ if (!ch->cs_enabled)
+ return;
+
pm_genpd_syscore_poweron(&ch->cmt->pdev->dev);
sh_cmt_start(ch, FLAG_CLOCKSOURCE);
}
--
1.9.1

2015-06-18 10:22:58

by Geert Uytterhoeven

[permalink] [raw]

Subject: [PATCH 2/2] PM / Domains: Avoid infinite loops in attach/detach code

If pm_genpd_{add,remove}_device() keeps on failing with -EAGAIN, we end
up with an infinite loop in genpd_dev_pm_{at,de}tach().

This may happen due to a genpd.prepared_count imbalance. This is a bug
elsewhere, but it will result in a system lock up, possibly during
reboot of an otherwise functioning system.

To avoid this, put a limit on the maximum number of loop iterations,
including a simple back-off mechanism. If the limit is reached, the
operation will just fail. An error message is already printed.

Signed-off-by: Geert Uytterhoeven <[email protected]>
---
drivers/base/power/domain.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index cdd547bd67df8218..aa06f60b98b8af35 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -6,6 +6,7 @@
* This file is released under the GPLv2.
*/

+#include <linux/delay.h>
#include <linux/kernel.h>
#include <linux/io.h>
#include <linux/platform_device.h>
@@ -19,6 +20,9 @@
#include <linux/suspend.h>
#include <linux/export.h>

+#define GENPD_RETRIES 20
+#define GENPD_DELAY_US 10
+
#define GENPD_DEV_CALLBACK(genpd, type, callback, dev) \
({ \
type (*__routine)(struct device *__d); \
@@ -2131,6 +2135,7 @@ EXPORT_SYMBOL_GPL(of_genpd_get_from_provider);
static void genpd_dev_pm_detach(struct device *dev, bool power_off)
{
struct generic_pm_domain *pd;
+ unsigned int i;
int ret = 0;

pd = pm_genpd_lookup_dev(dev);
@@ -2139,10 +2144,13 @@ static void genpd_dev_pm_detach(struct device *dev, bool power_off)

dev_dbg(dev, "removing from PM domain %s\n", pd->name);

- while (1) {
+ for (i = 0; i < GENPD_RETRIES; i++) {
ret = pm_genpd_remove_device(pd, dev);
if (ret != -EAGAIN)
break;
+
+ if (i > GENPD_RETRIES / 2)
+ udelay(GENPD_DELAY_US);
cond_resched();
}

@@ -2218,10 +2226,13 @@ int genpd_dev_pm_attach(struct device *dev)

dev_dbg(dev, "adding to PM domain %s\n", pd->name);

- while (1) {
+ for (i = 0; i < GENPD_RETRIES; i++) {
ret = pm_genpd_add_device(pd, dev);
if (ret != -EAGAIN)
break;
+
+ if (i > GENPD_RETRIES / 2)
+ udelay(GENPD_DELAY_US);
cond_resched();
}

--
1.9.1

2015-06-18 14:13:50

by Laurent Pinchart

[permalink] [raw]

Subject: Re: [PATCH 1/2] clocksource: sh_cmt: Only perform clocksource suspend/resume if enabled

Hi Geert,

Thank you for the patch.

On Thursday 18 June 2015 12:22:33 Geert Uytterhoeven wrote:
> Currently the sh_cmt clocksource timer is disabled or enabled
> unconditionally on clocksource suspend resp. resume, even if a better
> clocksource is present (e.g. arch_sys_counter) and the sh_cmt
> clocksource is not enabled.
>
> As sh_cmt is a syscore device when its timer is enabled, this may lead
> to a genpd.prepared_count imbalance in the presence of PM Domains, which
> may cause a lock up during reboot after s2ram.
>
> During suspend:
> - pm_genpd_prepare() is called for all non-syscore devices (incl.
> sh_cmt), increasing genpd.prepared_count for each device,
> - clocksource.suspend() is called for all clocksource devices,
> - sh_cmt_clocksource_suspend() calls sh_cmt_stop(), which is a no-op
> as the clocksource was not enabled.
>
> During resume:
> - clocksource.resume() is called for all clocksource devices,
> - sh_cmt_clocksource_resume() calls sh_cmt_start(), which enables the
> clocksource timer, and turns sh_cmt into a syscore device,
> - pm_genpd_complete() is called for all non-syscore devices (excl.
> sh_cmt now!), decreasing genpd.prepared_count for each device but
> sh_cmt.
>
> Now genpd.prepared_count of the PM Domain containing sh_cmt is still 1
> instead of zero. On subsequent suspend/resume cycles, sh_cmt is still a
> syscore device, hence it's skipped for pm_genpd_{prepare,complete}(),
> keeping the imbalance of genpd.prepared_count at 1.
>
> During reboot:
> - platform_drv_shutdown() is called for any platform device that has
> a driver with a .shutdown() method (only rcar-dmac on R-Car Gen2),
> - platform_drv_shutdown() calls dev_pm_domain_detach(), which
> calls genpd_dev_pm_detach(),
> - genpd_dev_pm_detach() keeps calling pm_genpd_remove_device() until
> it doesn't return -EAGAIN,
> - If the device is part of the same PM Domain as sh_cmt,
> pm_genpd_remove_device() always fails with -EAGAIN due to
> genpd.prepared_count > 0.
> - Infinite loop in genpd_dev_pm_detach().
>
> To fix this, only disable or enable the clocksource timer on clocksource
> suspend resp. resume if the clocksource was enabled.
>
> This was tested on r8a7791/koelsch with the CPG Clock Domain:
> - using arch_sys_counter as the clocksource, which is the default, and
> which showed the problem,
> - using sh_cmt as a clocksource ("echo ffca0000.timer > \
> /sys/devices/system/clocksource/clocksource0/current_clocksource"),
> which behaves the same as before.
>
> Signed-off-by: Geert Uytterhoeven <[email protected]>

Good catch. The patch looks good to me.

Acked-by: Laurent Pinchart <[email protected]>

I've quickly checked the MTU2 and TMU timer drivers and they should be immune
to the issue, but I'd appreciate if you could confirm that.

> ---
> drivers/clocksource/sh_cmt.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/clocksource/sh_cmt.c b/drivers/clocksource/sh_cmt.c
> index b8ff3c64cc452a16..c96de14036a0adeb 100644
> --- a/drivers/clocksource/sh_cmt.c
> +++ b/drivers/clocksource/sh_cmt.c
> @@ -661,6 +661,9 @@ static void sh_cmt_clocksource_suspend(struct
> clocksource *cs) {
> struct sh_cmt_channel *ch = cs_to_sh_cmt(cs);
>
> + if (!ch->cs_enabled)
> + return;
> +
> sh_cmt_stop(ch, FLAG_CLOCKSOURCE);
> pm_genpd_syscore_poweroff(&ch->cmt->pdev->dev);
> }
> @@ -669,6 +672,9 @@ static void sh_cmt_clocksource_resume(struct clocksource
> *cs) {
> struct sh_cmt_channel *ch = cs_to_sh_cmt(cs);
>
> + if (!ch->cs_enabled)
> + return;
> +
> pm_genpd_syscore_poweron(&ch->cmt->pdev->dev);
> sh_cmt_start(ch, FLAG_CLOCKSOURCE);
> }

--
Regards,

Laurent Pinchart

2015-06-18 14:19:32

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: [PATCH 1/2] clocksource: sh_cmt: Only perform clocksource suspend/resume if enabled

Hi Laurent,

On Thu, Jun 18, 2015 at 4:14 PM, Laurent Pinchart
<[email protected]> wrote:
> On Thursday 18 June 2015 12:22:33 Geert Uytterhoeven wrote:
>> Currently the sh_cmt clocksource timer is disabled or enabled
>> unconditionally on clocksource suspend resp. resume, even if a better
>> clocksource is present (e.g. arch_sys_counter) and the sh_cmt
>> clocksource is not enabled.

>> Signed-off-by: Geert Uytterhoeven <[email protected]>
>
> Good catch. The patch looks good to me.

While the solution was simple, it was hard to catch (engineers and
lightbulbs... euh timers ;-)

> Acked-by: Laurent Pinchart <[email protected]>

Thanks!

> I've quickly checked the MTU2 and TMU timer drivers and they should be immune
> to the issue, but I'd appreciate if you could confirm that.

I had concluded the same, thanks for verifying!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2015-06-22 07:30:28

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: [PATCH 2/2] PM / Domains: Avoid infinite loops in attach/detach code

On Thu, Jun 18, 2015 at 12:22 PM, Geert Uytterhoeven
<[email protected]> wrote:
> If pm_genpd_{add,remove}_device() keeps on failing with -EAGAIN, we end
> up with an infinite loop in genpd_dev_pm_{at,de}tach().
>
> This may happen due to a genpd.prepared_count imbalance. This is a bug
> elsewhere, but it will result in a system lock up, possibly during
> reboot of an otherwise functioning system.
>
> To avoid this, put a limit on the maximum number of loop iterations,
> including a simple back-off mechanism. If the limit is reached, the
> operation will just fail. An error message is already printed.
>
> Signed-off-by: Geert Uytterhoeven <[email protected]>
> ---
> drivers/base/power/domain.c | 15 +++++++++++++--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
> index cdd547bd67df8218..aa06f60b98b8af35 100644
> --- a/drivers/base/power/domain.c
> +++ b/drivers/base/power/domain.c

> @@ -2218,10 +2226,13 @@ int genpd_dev_pm_attach(struct device *dev)
>
> dev_dbg(dev, "adding to PM domain %s\n", pd->name);
>
> - while (1) {
> + for (i = 0; i < GENPD_RETRIES; i++) {

Sorry, there's a declaration of "i" missing.

> ret = pm_genpd_add_device(pd, dev);
> if (ret != -EAGAIN)
> break;
> +
> + if (i > GENPD_RETRIES / 2)
> + udelay(GENPD_DELAY_US);
> cond_resched();
> }
>
> --
> 1.9.1

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2015-06-22 07:31:42

by Geert Uytterhoeven

[permalink] [raw]

Subject: [PATCH 2/2] PM / Domains: Avoid infinite loops in attach/detach code

If pm_genpd_{add,remove}_device() keeps on failing with -EAGAIN, we end
up with an infinite loop in genpd_dev_pm_{at,de}tach().

This may happen due to a genpd.prepared_count imbalance. This is a bug
elsewhere, but it will result in a system lock up, possibly during
reboot of an otherwise functioning system.

To avoid this, put a limit on the maximum number of loop iterations,
including a simple back-off mechanism. If the limit is reached, the
operation will just fail. An error message is already printed.

Signed-off-by: Geert Uytterhoeven <[email protected]>
---
drivers/base/power/domain.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index cdd547bd67df8218..60e0309dd8dd0264 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -6,6 +6,7 @@
* This file is released under the GPLv2.
*/

+#include <linux/delay.h>
#include <linux/kernel.h>
#include <linux/io.h>
#include <linux/platform_device.h>
@@ -19,6 +20,9 @@
#include <linux/suspend.h>
#include <linux/export.h>

+#define GENPD_RETRIES 20
+#define GENPD_DELAY_US 10
+
#define GENPD_DEV_CALLBACK(genpd, type, callback, dev) \
({ \
type (*__routine)(struct device *__d); \
@@ -2131,6 +2135,7 @@ EXPORT_SYMBOL_GPL(of_genpd_get_from_provider);
static void genpd_dev_pm_detach(struct device *dev, bool power_off)
{
struct generic_pm_domain *pd;
+ unsigned int i;
int ret = 0;

pd = pm_genpd_lookup_dev(dev);
@@ -2139,10 +2144,13 @@ static void genpd_dev_pm_detach(struct device *dev, bool power_off)

dev_dbg(dev, "removing from PM domain %s\n", pd->name);

- while (1) {
+ for (i = 0; i < GENPD_RETRIES; i++) {
ret = pm_genpd_remove_device(pd, dev);
if (ret != -EAGAIN)
break;
+
+ if (i > GENPD_RETRIES / 2)
+ udelay(GENPD_DELAY_US);
cond_resched();
}

@@ -2183,6 +2191,7 @@ int genpd_dev_pm_attach(struct device *dev)
{
struct of_phandle_args pd_args;
struct generic_pm_domain *pd;
+ unsigned int i;
int ret;

if (!dev->of_node)
@@ -2218,10 +2227,13 @@ int genpd_dev_pm_attach(struct device *dev)

dev_dbg(dev, "adding to PM domain %s\n", pd->name);

- while (1) {
+ for (i = 0; i < GENPD_RETRIES; i++) {
ret = pm_genpd_add_device(pd, dev);
if (ret != -EAGAIN)
break;
+
+ if (i > GENPD_RETRIES / 2)
+ udelay(GENPD_DELAY_US);
cond_resched();
}

--
1.9.1

2015-06-22 23:15:05

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH 2/2] PM / Domains: Avoid infinite loops in attach/detach code

On Monday, June 22, 2015 09:31:22 AM Geert Uytterhoeven wrote:
> If pm_genpd_{add,remove}_device() keeps on failing with -EAGAIN, we end
> up with an infinite loop in genpd_dev_pm_{at,de}tach().
>
> This may happen due to a genpd.prepared_count imbalance. This is a bug
> elsewhere, but it will result in a system lock up, possibly during
> reboot of an otherwise functioning system.
>
> To avoid this, put a limit on the maximum number of loop iterations,
> including a simple back-off mechanism. If the limit is reached, the
> operation will just fail. An error message is already printed.
>
> Signed-off-by: Geert Uytterhoeven <[email protected]>

This was too late for my first 4.2 pull request, but I may push it in the
second half of the merge window. Does it depend on anything?

Rafael

2015-06-23 07:16:31

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: [PATCH 2/2] PM / Domains: Avoid infinite loops in attach/detach code

Hi Rafael,

On Tue, Jun 23, 2015 at 1:41 AM, Rafael J. Wysocki <[email protected]> wrote:
> On Monday, June 22, 2015 09:31:22 AM Geert Uytterhoeven wrote:
>> If pm_genpd_{add,remove}_device() keeps on failing with -EAGAIN, we end
>> up with an infinite loop in genpd_dev_pm_{at,de}tach().
>>
>> This may happen due to a genpd.prepared_count imbalance. This is a bug
>> elsewhere, but it will result in a system lock up, possibly during
>> reboot of an otherwise functioning system.
>>
>> To avoid this, put a limit on the maximum number of loop iterations,
>> including a simple back-off mechanism. If the limit is reached, the
>> operation will just fail. An error message is already printed.
>>
>> Signed-off-by: Geert Uytterhoeven <[email protected]>
>
> This was too late for my first 4.2 pull request, but I may push it in the
> second half of the merge window. Does it depend on anything?

Not that I'm aware of.

It applies fine on v4.1 or next-20150622.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2015-06-23 12:50:54

by Ulf Hansson

[permalink] [raw]

Subject: Re: [PATCH 2/2] PM / Domains: Avoid infinite loops in attach/detach code

On 22 June 2015 at 09:31, Geert Uytterhoeven <[email protected]> wrote:
> If pm_genpd_{add,remove}_device() keeps on failing with -EAGAIN, we end
> up with an infinite loop in genpd_dev_pm_{at,de}tach().
>
> This may happen due to a genpd.prepared_count imbalance. This is a bug
> elsewhere, but it will result in a system lock up, possibly during
> reboot of an otherwise functioning system.
>
> To avoid this, put a limit on the maximum number of loop iterations,
> including a simple back-off mechanism. If the limit is reached, the
> operation will just fail. An error message is already printed.
>
> Signed-off-by: Geert Uytterhoeven <[email protected]>
> ---
> drivers/base/power/domain.c | 16 ++++++++++++++--
> 1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
> index cdd547bd67df8218..60e0309dd8dd0264 100644
> --- a/drivers/base/power/domain.c
> +++ b/drivers/base/power/domain.c
> @@ -6,6 +6,7 @@
> * This file is released under the GPLv2.
> */
>
> +#include <linux/delay.h>
> #include <linux/kernel.h>
> #include <linux/io.h>
> #include <linux/platform_device.h>
> @@ -19,6 +20,9 @@
> #include <linux/suspend.h>
> #include <linux/export.h>
>
> +#define GENPD_RETRIES 20
> +#define GENPD_DELAY_US 10
> +
> #define GENPD_DEV_CALLBACK(genpd, type, callback, dev) \
> ({ \
> type (*__routine)(struct device *__d); \
> @@ -2131,6 +2135,7 @@ EXPORT_SYMBOL_GPL(of_genpd_get_from_provider);
> static void genpd_dev_pm_detach(struct device *dev, bool power_off)
> {
> struct generic_pm_domain *pd;
> + unsigned int i;
> int ret = 0;
>
> pd = pm_genpd_lookup_dev(dev);
> @@ -2139,10 +2144,13 @@ static void genpd_dev_pm_detach(struct device *dev, bool power_off)
>
> dev_dbg(dev, "removing from PM domain %s\n", pd->name);
>
> - while (1) {
> + for (i = 0; i < GENPD_RETRIES; i++) {
> ret = pm_genpd_remove_device(pd, dev);
> if (ret != -EAGAIN)
> break;
> +
> + if (i > GENPD_RETRIES / 2)
> + udelay(GENPD_DELAY_US);
> cond_resched();
> }
>
> @@ -2183,6 +2191,7 @@ int genpd_dev_pm_attach(struct device *dev)
> {
> struct of_phandle_args pd_args;
> struct generic_pm_domain *pd;
> + unsigned int i;
> int ret;
>
> if (!dev->of_node)
> @@ -2218,10 +2227,13 @@ int genpd_dev_pm_attach(struct device *dev)
>
> dev_dbg(dev, "adding to PM domain %s\n", pd->name);
>
> - while (1) {
> + for (i = 0; i < GENPD_RETRIES; i++) {
> ret = pm_genpd_add_device(pd, dev);
> if (ret != -EAGAIN)
> break;
> +
> + if (i > GENPD_RETRIES / 2)
> + udelay(GENPD_DELAY_US);

In this execution path, we retry when getting -EAGAIN while believing
the reason to the error are only *temporary* as we are soon waiting
for all devices in the genpd to be system PM resumed. At least that's
my understanding to why we want to deal with -EAGAIN here, but I might
be wrong.

In this regards, I wonder whether it could be better to re-try only a
few times but with a far longer interval time than a couple us. What
do you think?

However, what if the reason to why we get -EAGAIN isn't *temporary*,
because we are about to enter system PM suspend state. Then the caller
of this function which comes via some bus' ->probe(), will hang until
the a system PM resume is completed. Is that really going to work? So,
for this case your limited re-try approach will affect this scenario
as well, have you considered that?

> cond_resched();
> }
>
> --
> 1.9.1
>

Kind regards
Uffe

2015-06-23 13:20:51

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: [PATCH 2/2] PM / Domains: Avoid infinite loops in attach/detach code

Hi Ulf,

On Tue, Jun 23, 2015 at 2:50 PM, Ulf Hansson <[email protected]> wrote:
> On 22 June 2015 at 09:31, Geert Uytterhoeven <[email protected]> wrote:
>> If pm_genpd_{add,remove}_device() keeps on failing with -EAGAIN, we end
>> up with an infinite loop in genpd_dev_pm_{at,de}tach().
>>
>> This may happen due to a genpd.prepared_count imbalance. This is a bug
>> elsewhere, but it will result in a system lock up, possibly during
>> reboot of an otherwise functioning system.
>>
>> To avoid this, put a limit on the maximum number of loop iterations,
>> including a simple back-off mechanism. If the limit is reached, the
>> operation will just fail. An error message is already printed.
>>
>> Signed-off-by: Geert Uytterhoeven <[email protected]>
>> ---
>> drivers/base/power/domain.c | 16 ++++++++++++++--
>> 1 file changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
>> index cdd547bd67df8218..60e0309dd8dd0264 100644
>> --- a/drivers/base/power/domain.c
>> +++ b/drivers/base/power/domain.c
>> @@ -6,6 +6,7 @@
>> * This file is released under the GPLv2.
>> */
>>
>> +#include <linux/delay.h>
>> #include <linux/kernel.h>
>> #include <linux/io.h>
>> #include <linux/platform_device.h>
>> @@ -19,6 +20,9 @@
>> #include <linux/suspend.h>
>> #include <linux/export.h>
>>
>> +#define GENPD_RETRIES 20
>> +#define GENPD_DELAY_US 10
>> +
>> #define GENPD_DEV_CALLBACK(genpd, type, callback, dev) \
>> ({ \
>> type (*__routine)(struct device *__d); \
>> @@ -2131,6 +2135,7 @@ EXPORT_SYMBOL_GPL(of_genpd_get_from_provider);
>> static void genpd_dev_pm_detach(struct device *dev, bool power_off)
>> {
>> struct generic_pm_domain *pd;
>> + unsigned int i;
>> int ret = 0;
>>
>> pd = pm_genpd_lookup_dev(dev);
>> @@ -2139,10 +2144,13 @@ static void genpd_dev_pm_detach(struct device *dev, bool power_off)
>>
>> dev_dbg(dev, "removing from PM domain %s\n", pd->name);
>>
>> - while (1) {
>> + for (i = 0; i < GENPD_RETRIES; i++) {
>> ret = pm_genpd_remove_device(pd, dev);
>> if (ret != -EAGAIN)
>> break;
>> +
>> + if (i > GENPD_RETRIES / 2)
>> + udelay(GENPD_DELAY_US);
>> cond_resched();
>> }
>>
>> @@ -2183,6 +2191,7 @@ int genpd_dev_pm_attach(struct device *dev)
>> {
>> struct of_phandle_args pd_args;
>> struct generic_pm_domain *pd;
>> + unsigned int i;
>> int ret;
>>
>> if (!dev->of_node)
>> @@ -2218,10 +2227,13 @@ int genpd_dev_pm_attach(struct device *dev)
>>
>> dev_dbg(dev, "adding to PM domain %s\n", pd->name);
>>
>> - while (1) {
>> + for (i = 0; i < GENPD_RETRIES; i++) {
>> ret = pm_genpd_add_device(pd, dev);
>> if (ret != -EAGAIN)
>> break;
>> +
>> + if (i > GENPD_RETRIES / 2)
>> + udelay(GENPD_DELAY_US);
>
> In this execution path, we retry when getting -EAGAIN while believing
> the reason to the error are only *temporary* as we are soon waiting
> for all devices in the genpd to be system PM resumed. At least that's
> my understanding to why we want to deal with -EAGAIN here, but I might
> be wrong.
>
> In this regards, I wonder whether it could be better to re-try only a
> few times but with a far longer interval time than a couple us. What
> do you think?

That's indeed viable. I have no idea for how long this temporary state can
extend.

> However, what if the reason to why we get -EAGAIN isn't *temporary*,
> because we are about to enter system PM suspend state. Then the caller
> of this function which comes via some bus' ->probe(), will hang until
> the a system PM resume is completed. Is that really going to work? So,
> for this case your limited re-try approach will affect this scenario
> as well, have you considered that?

There's a limit on the number of retries, so it won't hang indefinitely.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2015-06-23 13:38:58

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH 2/2] PM / Domains: Avoid infinite loops in attach/detach code

On Tue, Jun 23, 2015 at 3:20 PM, Geert Uytterhoeven
<[email protected]> wrote:
> Hi Ulf,
>
> On Tue, Jun 23, 2015 at 2:50 PM, Ulf Hansson <[email protected]> wrote:
>> On 22 June 2015 at 09:31, Geert Uytterhoeven <[email protected]> wrote:
>>> If pm_genpd_{add,remove}_device() keeps on failing with -EAGAIN, we end
>>> up with an infinite loop in genpd_dev_pm_{at,de}tach().
>>>
>>> This may happen due to a genpd.prepared_count imbalance. This is a bug
>>> elsewhere, but it will result in a system lock up, possibly during
>>> reboot of an otherwise functioning system.
>>>
>>> To avoid this, put a limit on the maximum number of loop iterations,
>>> including a simple back-off mechanism. If the limit is reached, the
>>> operation will just fail. An error message is already printed.
>>>
>>> Signed-off-by: Geert Uytterhoeven <[email protected]>
>>> ---
>>> drivers/base/power/domain.c | 16 ++++++++++++++--
>>> 1 file changed, 14 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
>>> index cdd547bd67df8218..60e0309dd8dd0264 100644
>>> --- a/drivers/base/power/domain.c
>>> +++ b/drivers/base/power/domain.c
>>> @@ -6,6 +6,7 @@
>>> * This file is released under the GPLv2.
>>> */
>>>
>>> +#include <linux/delay.h>
>>> #include <linux/kernel.h>
>>> #include <linux/io.h>
>>> #include <linux/platform_device.h>
>>> @@ -19,6 +20,9 @@
>>> #include <linux/suspend.h>
>>> #include <linux/export.h>
>>>
>>> +#define GENPD_RETRIES 20
>>> +#define GENPD_DELAY_US 10
>>> +
>>> #define GENPD_DEV_CALLBACK(genpd, type, callback, dev) \
>>> ({ \
>>> type (*__routine)(struct device *__d); \
>>> @@ -2131,6 +2135,7 @@ EXPORT_SYMBOL_GPL(of_genpd_get_from_provider);
>>> static void genpd_dev_pm_detach(struct device *dev, bool power_off)
>>> {
>>> struct generic_pm_domain *pd;
>>> + unsigned int i;
>>> int ret = 0;
>>>
>>> pd = pm_genpd_lookup_dev(dev);
>>> @@ -2139,10 +2144,13 @@ static void genpd_dev_pm_detach(struct device *dev, bool power_off)
>>>
>>> dev_dbg(dev, "removing from PM domain %s\n", pd->name);
>>>
>>> - while (1) {
>>> + for (i = 0; i < GENPD_RETRIES; i++) {
>>> ret = pm_genpd_remove_device(pd, dev);
>>> if (ret != -EAGAIN)
>>> break;
>>> +
>>> + if (i > GENPD_RETRIES / 2)
>>> + udelay(GENPD_DELAY_US);
>>> cond_resched();
>>> }
>>>
>>> @@ -2183,6 +2191,7 @@ int genpd_dev_pm_attach(struct device *dev)
>>> {
>>> struct of_phandle_args pd_args;
>>> struct generic_pm_domain *pd;
>>> + unsigned int i;
>>> int ret;
>>>
>>> if (!dev->of_node)
>>> @@ -2218,10 +2227,13 @@ int genpd_dev_pm_attach(struct device *dev)
>>>
>>> dev_dbg(dev, "adding to PM domain %s\n", pd->name);
>>>
>>> - while (1) {
>>> + for (i = 0; i < GENPD_RETRIES; i++) {
>>> ret = pm_genpd_add_device(pd, dev);
>>> if (ret != -EAGAIN)
>>> break;
>>> +
>>> + if (i > GENPD_RETRIES / 2)
>>> + udelay(GENPD_DELAY_US);
>>
>> In this execution path, we retry when getting -EAGAIN while believing
>> the reason to the error are only *temporary* as we are soon waiting
>> for all devices in the genpd to be system PM resumed. At least that's
>> my understanding to why we want to deal with -EAGAIN here, but I might
>> be wrong.
>>
>> In this regards, I wonder whether it could be better to re-try only a
>> few times but with a far longer interval time than a couple us. What
>> do you think?
>
> That's indeed viable. I have no idea for how long this temporary state can
> extend.

A usual approach to this kind of thing is to use exponential fallback
where you increase the delay twice with respect to the previous one
every time.

Rafael

2015-06-23 13:45:49

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: [PATCH 2/2] PM / Domains: Avoid infinite loops in attach/detach code

Hi Rafael,

On Tue, Jun 23, 2015 at 3:38 PM, Rafael J. Wysocki <[email protected]> wrote:
>>>> @@ -2218,10 +2227,13 @@ int genpd_dev_pm_attach(struct device *dev)
>>>>
>>>> dev_dbg(dev, "adding to PM domain %s\n", pd->name);
>>>>
>>>> - while (1) {
>>>> + for (i = 0; i < GENPD_RETRIES; i++) {
>>>> ret = pm_genpd_add_device(pd, dev);
>>>> if (ret != -EAGAIN)
>>>> break;
>>>> +
>>>> + if (i > GENPD_RETRIES / 2)
>>>> + udelay(GENPD_DELAY_US);
>>>
>>> In this execution path, we retry when getting -EAGAIN while believing
>>> the reason to the error are only *temporary* as we are soon waiting
>>> for all devices in the genpd to be system PM resumed. At least that's
>>> my understanding to why we want to deal with -EAGAIN here, but I might
>>> be wrong.
>>>
>>> In this regards, I wonder whether it could be better to re-try only a
>>> few times but with a far longer interval time than a couple us. What
>>> do you think?
>>
>> That's indeed viable. I have no idea for how long this temporary state can
>> extend.
>
> A usual approach to this kind of thing is to use exponential fallback
> where you increase the delay twice with respect to the previous one
> every time.

Right, but when do you give up?

Note that udelay() is a busy loop. Should it fall back to msleep() after
a while?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2015-06-24 08:34:05

by Ulf Hansson

[permalink] [raw]

Subject: Re: [PATCH 2/2] PM / Domains: Avoid infinite loops in attach/detach code

[...]

>>>
>>> @@ -2183,6 +2191,7 @@ int genpd_dev_pm_attach(struct device *dev)
>>> {
>>> struct of_phandle_args pd_args;
>>> struct generic_pm_domain *pd;
>>> + unsigned int i;
>>> int ret;
>>>
>>> if (!dev->of_node)
>>> @@ -2218,10 +2227,13 @@ int genpd_dev_pm_attach(struct device *dev)
>>>
>>> dev_dbg(dev, "adding to PM domain %s\n", pd->name);
>>>
>>> - while (1) {
>>> + for (i = 0; i < GENPD_RETRIES; i++) {
>>> ret = pm_genpd_add_device(pd, dev);
>>> if (ret != -EAGAIN)
>>> break;
>>> +
>>> + if (i > GENPD_RETRIES / 2)
>>> + udelay(GENPD_DELAY_US);
>>
>> In this execution path, we retry when getting -EAGAIN while believing
>> the reason to the error are only *temporary* as we are soon waiting
>> for all devices in the genpd to be system PM resumed. At least that's
>> my understanding to why we want to deal with -EAGAIN here, but I might
>> be wrong.
>>
>> In this regards, I wonder whether it could be better to re-try only a
>> few times but with a far longer interval time than a couple us. What
>> do you think?
>
> That's indeed viable. I have no idea for how long this temporary state can
> extend.

That will depend on the system PM resume time for the devices residing
in the genpd. So, I guess we need a guestimate then. How about a total
sleep time of a few seconds?

>
>> However, what if the reason to why we get -EAGAIN isn't *temporary*,
>> because we are about to enter system PM suspend state. Then the caller
>> of this function which comes via some bus' ->probe(), will hang until
>> the a system PM resume is completed. Is that really going to work? So,
>> for this case your limited re-try approach will affect this scenario
>> as well, have you considered that?
>
> There's a limit on the number of retries, so it won't hang indefinitely.

What happens with the timer functions (like msleep()) during the
system PM suspend transition?

Kind regards
Uffe

2015-06-24 08:35:54

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: [PATCH 2/2] PM / Domains: Avoid infinite loops in attach/detach code

On Wed, Jun 24, 2015 at 10:33 AM, Ulf Hansson <[email protected]> wrote:
> [...]
>
>>>>
>>>> @@ -2183,6 +2191,7 @@ int genpd_dev_pm_attach(struct device *dev)
>>>> {
>>>> struct of_phandle_args pd_args;
>>>> struct generic_pm_domain *pd;
>>>> + unsigned int i;
>>>> int ret;
>>>>
>>>> if (!dev->of_node)
>>>> @@ -2218,10 +2227,13 @@ int genpd_dev_pm_attach(struct device *dev)
>>>>
>>>> dev_dbg(dev, "adding to PM domain %s\n", pd->name);
>>>>
>>>> - while (1) {
>>>> + for (i = 0; i < GENPD_RETRIES; i++) {
>>>> ret = pm_genpd_add_device(pd, dev);
>>>> if (ret != -EAGAIN)
>>>> break;
>>>> +
>>>> + if (i > GENPD_RETRIES / 2)
>>>> + udelay(GENPD_DELAY_US);
>>>
>>> In this execution path, we retry when getting -EAGAIN while believing
>>> the reason to the error are only *temporary* as we are soon waiting
>>> for all devices in the genpd to be system PM resumed. At least that's
>>> my understanding to why we want to deal with -EAGAIN here, but I might
>>> be wrong.
>>>
>>> In this regards, I wonder whether it could be better to re-try only a
>>> few times but with a far longer interval time than a couple us. What
>>> do you think?
>>
>> That's indeed viable. I have no idea for how long this temporary state can
>> extend.
>
> That will depend on the system PM resume time for the devices residing
> in the genpd. So, I guess we need a guestimate then. How about a total
> sleep time of a few seconds?
>
>>
>>> However, what if the reason to why we get -EAGAIN isn't *temporary*,
>>> because we are about to enter system PM suspend state. Then the caller
>>> of this function which comes via some bus' ->probe(), will hang until
>>> the a system PM resume is completed. Is that really going to work? So,
>>> for this case your limited re-try approach will affect this scenario
>>> as well, have you considered that?
>>
>> There's a limit on the number of retries, so it won't hang indefinitely.
>
> What happens with the timer functions (like msleep()) during the
> system PM suspend transition?

I guess we can no longer call msleep() after syscore suspend?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2015-06-24 13:22:10

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH 2/2] PM / Domains: Avoid infinite loops in attach/detach code

On Wednesday, June 24, 2015 10:35:44 AM Geert Uytterhoeven wrote:
> On Wed, Jun 24, 2015 at 10:33 AM, Ulf Hansson <[email protected]> wrote:
> > [...]
> >
> >>>>
> >>>> @@ -2183,6 +2191,7 @@ int genpd_dev_pm_attach(struct device *dev)
> >>>> {
> >>>> struct of_phandle_args pd_args;
> >>>> struct generic_pm_domain *pd;
> >>>> + unsigned int i;
> >>>> int ret;
> >>>>
> >>>> if (!dev->of_node)
> >>>> @@ -2218,10 +2227,13 @@ int genpd_dev_pm_attach(struct device *dev)
> >>>>
> >>>> dev_dbg(dev, "adding to PM domain %s\n", pd->name);
> >>>>
> >>>> - while (1) {
> >>>> + for (i = 0; i < GENPD_RETRIES; i++) {
> >>>> ret = pm_genpd_add_device(pd, dev);
> >>>> if (ret != -EAGAIN)
> >>>> break;
> >>>> +
> >>>> + if (i > GENPD_RETRIES / 2)
> >>>> + udelay(GENPD_DELAY_US);
> >>>
> >>> In this execution path, we retry when getting -EAGAIN while believing
> >>> the reason to the error are only *temporary* as we are soon waiting
> >>> for all devices in the genpd to be system PM resumed. At least that's
> >>> my understanding to why we want to deal with -EAGAIN here, but I might
> >>> be wrong.
> >>>
> >>> In this regards, I wonder whether it could be better to re-try only a
> >>> few times but with a far longer interval time than a couple us. What
> >>> do you think?
> >>
> >> That's indeed viable. I have no idea for how long this temporary state can
> >> extend.
> >
> > That will depend on the system PM resume time for the devices residing
> > in the genpd. So, I guess we need a guestimate then. How about a total
> > sleep time of a few seconds?
> >
> >>
> >>> However, what if the reason to why we get -EAGAIN isn't *temporary*,
> >>> because we are about to enter system PM suspend state. Then the caller
> >>> of this function which comes via some bus' ->probe(), will hang until
> >>> the a system PM resume is completed. Is that really going to work? So,
> >>> for this case your limited re-try approach will affect this scenario
> >>> as well, have you considered that?
> >>
> >> There's a limit on the number of retries, so it won't hang indefinitely.
> >
> > What happens with the timer functions (like msleep()) during the
> > system PM suspend transition?
>
> I guess we can no longer call msleep() after syscore suspend?

That's correct. Time is effectively frozen at that point.

Rafael

2015-06-24 13:44:15

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH 2/2] PM / Domains: Avoid infinite loops in attach/detach code

On Tuesday, June 23, 2015 03:45:43 PM Geert Uytterhoeven wrote:
> Hi Rafael,
>
> On Tue, Jun 23, 2015 at 3:38 PM, Rafael J. Wysocki <[email protected]> wrote:
> >>>> @@ -2218,10 +2227,13 @@ int genpd_dev_pm_attach(struct device *dev)
> >>>>
> >>>> dev_dbg(dev, "adding to PM domain %s\n", pd->name);
> >>>>
> >>>> - while (1) {
> >>>> + for (i = 0; i < GENPD_RETRIES; i++) {
> >>>> ret = pm_genpd_add_device(pd, dev);
> >>>> if (ret != -EAGAIN)
> >>>> break;
> >>>> +
> >>>> + if (i > GENPD_RETRIES / 2)
> >>>> + udelay(GENPD_DELAY_US);
> >>>
> >>> In this execution path, we retry when getting -EAGAIN while believing
> >>> the reason to the error are only *temporary* as we are soon waiting
> >>> for all devices in the genpd to be system PM resumed. At least that's
> >>> my understanding to why we want to deal with -EAGAIN here, but I might
> >>> be wrong.
> >>>
> >>> In this regards, I wonder whether it could be better to re-try only a
> >>> few times but with a far longer interval time than a couple us. What
> >>> do you think?
> >>
> >> That's indeed viable. I have no idea for how long this temporary state can
> >> extend.
> >
> > A usual approach to this kind of thing is to use exponential fallback
> > where you increase the delay twice with respect to the previous one
> > every time.
>
> Right, but when do you give up?

Well, I guess you know what a reasonable timeout should be?

> Note that udelay() is a busy loop. Should it fall back to msleep() after
> a while?

If we can't fall back to msleep() at one point, you may as well simply poll
periodically as you did originally.

Thanks,
Rafael