2017-12-05 22:58:53

by James Hogan

[permalink] [raw]
Subject: [RFC PATCH] cpuidle/coupled: Handle broadcast enter failures

From: James Hogan <[email protected]>

If the hrtimer based broadcast tick device is in use, the enabling of
broadcast ticks by cpuidle may fail when the next broadcast event is
brought forward to match the next event due on the local tick device,
This is because setting the next event may migrate the hrtimer based
broadcast tick to the current CPU, which then makes
broadcast_needs_cpu() fail.

This isn't normally a problem as cpuidle handles it by falling back to
the deepest idle state not needing broadcast ticks, however when coupled
cpuidle is used it can happen after the coupled CPUs have all agreed on
a particular idle state, resulting in only one of the CPUs falling back
to a shallower state, and an attempt to couple two completely different
idle states which may not be safe.

Therefore extend cpuidle_enter_state_coupled() to be able to handle the
enabling of broadcast ticks directly, so that a failure can be detected
at the higher level, and all coupled CPUs can be made to fall back to
the same idle state.

This takes place after the idle state has been initially agreed. Each
CPU will then attempt to enable broadcast ticks (if necessary), and upon
failure it will update the requested_state[] array before a second
coupled parallel barrier so that all coupled CPUs can recognise the
change.

Signed-off-by: James Hogan <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Preeti U Murthy <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Is this an acceptable approach in principle?

Better/cleaner ideas to handle this are most welcome.

This doesn't directly address the problem that some of the time it won't
be possible to enter deeper idle states because of the hrtimer based
broadcast tick's affinity. The actual case I'm looking at is on MIPS
with cpuidle-cps, where the first core cannot (currently) go into a deep
idle state requiring broadcast ticks, so it'd be nice if the hrtimer
based broadcast tick device could just stay on core 0.
---
drivers/cpuidle/coupled.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/cpuidle/cpuidle.c | 24 ++++++++++++------------
include/linux/cpuidle.h | 4 ++++
3 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
index 147f38ea0fcd..bb76bb68dc29 100644
--- a/drivers/cpuidle/coupled.c
+++ b/drivers/cpuidle/coupled.c
@@ -23,6 +23,7 @@
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
+#include <linux/tick.h>

#include "cpuidle.h"

@@ -107,6 +108,7 @@ struct cpuidle_coupled {
int requested_state[NR_CPUS];
atomic_t ready_waiting_counts;
atomic_t abort_barrier;
+ atomic_t broadcast_barrier;
int online_count;
int refcnt;
int prevent;
@@ -480,6 +482,8 @@ int cpuidle_enter_state_coupled(struct cpuidle_device *dev,
{
int entered_state = -1;
struct cpuidle_coupled *coupled = dev->coupled;
+ struct cpuidle_state *target_state;
+ bool broadcast;
int w;

if (!coupled)
@@ -600,8 +604,51 @@ int cpuidle_enter_state_coupled(struct cpuidle_device *dev,
/* all cpus have acked the coupled state */
next_state = cpuidle_coupled_get_state(dev, coupled);

+ target_state = &drv->states[next_state];
+ broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP);
+ /*
+ * Tell the time framework to switch to a broadcast timer because our
+ * local timer will be shut down. If a local timer is used from another
+ * CPU as a broadcast timer, this call may fail if it is not available.
+ * In that case all coupled CPUs must fall back to a different idle
+ * state.
+ */
+ if (broadcast) {
+ if (tick_broadcast_enter()) {
+ next_state = find_deepest_state(drv, dev,
+ target_state->exit_latency,
+ CPUIDLE_FLAG_TIMER_STOP, false);
+ if (next_state < 0) {
+ default_idle_call();
+ /* FIXME goto reset? */
+ entered_state = -EBUSY;
+ goto skip;
+ }
+ broadcast = false;
+
+ /* update state */
+ coupled->requested_state[dev->cpu] = next_state;
+ /* matches smp_rmb() in cpuidle_coupled_get_state() */
+ smp_wmb();
+ }
+
+ cpuidle_coupled_parallel_barrier(dev,
+ &coupled->broadcast_barrier);
+ }
+
+ /* all cpus have acked the coupled state */
+ next_state = cpuidle_coupled_get_state(dev, coupled);
+
entered_state = cpuidle_enter_state(dev, drv, next_state);

+ if (broadcast) {
+ if (WARN_ON_ONCE(!irqs_disabled()))
+ local_irq_disable();
+
+ tick_broadcast_exit();
+ }
+
+skip:
cpuidle_coupled_set_done(dev->cpu, coupled);

out:
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 68a16827f45f..85357cee31ed 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -73,11 +73,9 @@ int cpuidle_play_dead(void)
return -ENODEV;
}

-static int find_deepest_state(struct cpuidle_driver *drv,
- struct cpuidle_device *dev,
- unsigned int max_latency,
- unsigned int forbidden_flags,
- bool s2idle)
+int find_deepest_state(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+ unsigned int max_latency, unsigned int forbidden_flags,
+ bool s2idle)
{
unsigned int latency_req = 0;
int i, ret = 0;
@@ -200,7 +198,8 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
* local timer will be shut down. If a local timer is used from another
* CPU as a broadcast timer, this call may fail if it is not available.
*/
- if (broadcast && tick_broadcast_enter()) {
+ if (!cpuidle_state_is_coupled(drv, index) &&
+ broadcast && tick_broadcast_enter()) {
index = find_deepest_state(drv, dev, target_state->exit_latency,
CPUIDLE_FLAG_TIMER_STOP, false);
if (index < 0) {
@@ -228,15 +227,16 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
/* The cpu is no longer idle or about to enter idle. */
sched_idle_set_state(NULL);

- if (broadcast) {
- if (WARN_ON_ONCE(!irqs_disabled()))
- local_irq_disable();
+ if (!cpuidle_state_is_coupled(drv, index)) {
+ if (broadcast) {
+ if (WARN_ON_ONCE(!irqs_disabled()))
+ local_irq_disable();

- tick_broadcast_exit();
- }
+ tick_broadcast_exit();
+ }

- if (!cpuidle_state_is_coupled(drv, index))
local_irq_enable();
+ }

diff = ktime_us_delta(time_end, time_start);
if (diff > INT_MAX)
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 8f7788d23b57..54856535f03d 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -153,6 +153,10 @@ extern void cpuidle_resume(void);
extern int cpuidle_enable_device(struct cpuidle_device *dev);
extern void cpuidle_disable_device(struct cpuidle_device *dev);
extern int cpuidle_play_dead(void);
+extern int find_deepest_state(struct cpuidle_driver *drv,
+ struct cpuidle_device *dev,
+ unsigned int max_latency,
+ unsigned int forbidden_flags, bool s2idle);

extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev);
static inline struct cpuidle_device *cpuidle_get_device(void)
--
2.14.1


2017-12-07 11:17:30

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [RFC PATCH] cpuidle/coupled: Handle broadcast enter failures

On 05/12/2017 23:55, James Hogan wrote:
> From: James Hogan <[email protected]>
>
> If the hrtimer based broadcast tick device is in use, the enabling of
> broadcast ticks by cpuidle may fail when the next broadcast event is
> brought forward to match the next event due on the local tick device,
> This is because setting the next event may migrate the hrtimer based
> broadcast tick to the current CPU, which then makes
> broadcast_needs_cpu() fail.
>
> This isn't normally a problem as cpuidle handles it by falling back to
> the deepest idle state not needing broadcast ticks, however when coupled
> cpuidle is used it can happen after the coupled CPUs have all agreed on
> a particular idle state, resulting in only one of the CPUs falling back
> to a shallower state, and an attempt to couple two completely different
> idle states which may not be safe.
>
> Therefore extend cpuidle_enter_state_coupled() to be able to handle the
> enabling of broadcast ticks directly, so that a failure can be detected
> at the higher level, and all coupled CPUs can be made to fall back to
> the same idle state.
>
> This takes place after the idle state has been initially agreed. Each
> CPU will then attempt to enable broadcast ticks (if necessary), and upon
> failure it will update the requested_state[] array before a second
> coupled parallel barrier so that all coupled CPUs can recognise the
> change.
>
> Signed-off-by: James Hogan <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: Daniel Lezcano <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Preeti U Murthy <[email protected]>
> Cc: Ralf Baechle <[email protected]>
> Cc: Paul Burton <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> Is this an acceptable approach in principle?
>
> Better/cleaner ideas to handle this are most welcome.
>
> This doesn't directly address the problem that some of the time it won't
> be possible to enter deeper idle states because of the hrtimer based
> broadcast tick's affinity. The actual case I'm looking at is on MIPS
> with cpuidle-cps, where the first core cannot (currently) go into a deep
> idle state requiring broadcast ticks, so it'd be nice if the hrtimer
> based broadcast tick device could just stay on core 0.
> ---

Before commenting this patch, I would like to understand why the couple
idle state is needed for the MIPS, what in the architecture forces the
usage of the couple idle state?

The hrtimer broadcast mechanism is only needed if there isn't a backup
timer outside of the idle state's power domain. That's the case on this
platform?


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2017-12-07 11:48:09

by James Hogan

[permalink] [raw]
Subject: Re: [RFC PATCH] cpuidle/coupled: Handle broadcast enter failures

On Thu, Dec 07, 2017 at 12:17:25PM +0100, Daniel Lezcano wrote:
> On 05/12/2017 23:55, James Hogan wrote:
> > From: James Hogan <[email protected]>
> >
> > If the hrtimer based broadcast tick device is in use, the enabling of
> > broadcast ticks by cpuidle may fail when the next broadcast event is
> > brought forward to match the next event due on the local tick device,
> > This is because setting the next event may migrate the hrtimer based
> > broadcast tick to the current CPU, which then makes
> > broadcast_needs_cpu() fail.
> >
> > This isn't normally a problem as cpuidle handles it by falling back to
> > the deepest idle state not needing broadcast ticks, however when coupled
> > cpuidle is used it can happen after the coupled CPUs have all agreed on
> > a particular idle state, resulting in only one of the CPUs falling back
> > to a shallower state, and an attempt to couple two completely different
> > idle states which may not be safe.
> >
> > Therefore extend cpuidle_enter_state_coupled() to be able to handle the
> > enabling of broadcast ticks directly, so that a failure can be detected
> > at the higher level, and all coupled CPUs can be made to fall back to
> > the same idle state.
> >
> > This takes place after the idle state has been initially agreed. Each
> > CPU will then attempt to enable broadcast ticks (if necessary), and upon
> > failure it will update the requested_state[] array before a second
> > coupled parallel barrier so that all coupled CPUs can recognise the
> > change.
> >
> > Signed-off-by: James Hogan <[email protected]>
> > Cc: "Rafael J. Wysocki" <[email protected]>
> > Cc: Daniel Lezcano <[email protected]>
> > Cc: Frederic Weisbecker <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Preeti U Murthy <[email protected]>
> > Cc: Ralf Baechle <[email protected]>
> > Cc: Paul Burton <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > ---
> > Is this an acceptable approach in principle?
> >
> > Better/cleaner ideas to handle this are most welcome.
> >
> > This doesn't directly address the problem that some of the time it won't
> > be possible to enter deeper idle states because of the hrtimer based
> > broadcast tick's affinity. The actual case I'm looking at is on MIPS
> > with cpuidle-cps, where the first core cannot (currently) go into a deep
> > idle state requiring broadcast ticks, so it'd be nice if the hrtimer
> > based broadcast tick device could just stay on core 0.
> > ---
>
> Before commenting this patch, I would like to understand why the couple
> idle state is needed for the MIPS, what in the architecture forces the
> usage of the couple idle state?

Hardware multithreading.

Each physical core may have more than one hardware thread (VPE or VP in
MIPS lingo), each of which appears as a separate CPU to Linux. The lower
power states are all effective at the core level though:
- non-coherent wait - the hardware threads share physical caches, so
coherency can only be turned off when all hardware threads are in a
safe state, else they (1) wouldn't be coherent with the rest of the
system and (2) could bring stuff into the cache which isn't kept
coherent, requiring I presume a second cache flush.
- clock gated - must go non-coherent first, and applies to the whole
core and all the physical resources shared by the VP(E)s.
- power gated - again must go non-coherent first, and applies to the
whole core and all the physical resources shared by the VP(E)s.

>
> The hrtimer broadcast mechanism is only needed if there isn't a backup
> timer outside of the idle state's power domain. That's the case on this
> platform?

I believe there is an external timer, and I believe we recommend
customers implement one for use as a clocksource anyway (in case of
frequency scaling), but on this particular platform the driver isn't
upstream yet.

If its something that shouldn't be supported in Linux, perhaps a simple
WARN_ON is the better approach (i.e. if the broadcast tick can't be
enabled in the current place and its a coupled idle state), though it
does at least seem to work now with this and a couple of other patches
(though I haven't been able to take any power measurements yet).

Cheers
James


Attachments:
(No filename) (4.27 kB)
signature.asc (833.00 B)
Digital signature
Download all attachments

2017-12-07 13:03:53

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [RFC PATCH] cpuidle/coupled: Handle broadcast enter failures

On 07/12/2017 12:47, James Hogan wrote:
> On Thu, Dec 07, 2017 at 12:17:25PM +0100, Daniel Lezcano wrote:
>> On 05/12/2017 23:55, James Hogan wrote:
>>> From: James Hogan <[email protected]>
>>>
>>> If the hrtimer based broadcast tick device is in use, the enabling of
>>> broadcast ticks by cpuidle may fail when the next broadcast event is
>>> brought forward to match the next event due on the local tick device,
>>> This is because setting the next event may migrate the hrtimer based
>>> broadcast tick to the current CPU, which then makes
>>> broadcast_needs_cpu() fail.
>>>
>>> This isn't normally a problem as cpuidle handles it by falling back to
>>> the deepest idle state not needing broadcast ticks, however when coupled
>>> cpuidle is used it can happen after the coupled CPUs have all agreed on
>>> a particular idle state, resulting in only one of the CPUs falling back
>>> to a shallower state, and an attempt to couple two completely different
>>> idle states which may not be safe.
>>>
>>> Therefore extend cpuidle_enter_state_coupled() to be able to handle the
>>> enabling of broadcast ticks directly, so that a failure can be detected
>>> at the higher level, and all coupled CPUs can be made to fall back to
>>> the same idle state.
>>>
>>> This takes place after the idle state has been initially agreed. Each
>>> CPU will then attempt to enable broadcast ticks (if necessary), and upon
>>> failure it will update the requested_state[] array before a second
>>> coupled parallel barrier so that all coupled CPUs can recognise the
>>> change.
>>>
>>> Signed-off-by: James Hogan <[email protected]>
>>> Cc: "Rafael J. Wysocki" <[email protected]>
>>> Cc: Daniel Lezcano <[email protected]>
>>> Cc: Frederic Weisbecker <[email protected]>
>>> Cc: Thomas Gleixner <[email protected]>
>>> Cc: Ingo Molnar <[email protected]>
>>> Cc: Preeti U Murthy <[email protected]>
>>> Cc: Ralf Baechle <[email protected]>
>>> Cc: Paul Burton <[email protected]>
>>> Cc: [email protected]
>>> Cc: [email protected]
>>> Cc: [email protected]
>>> ---
>>> Is this an acceptable approach in principle?
>>>
>>> Better/cleaner ideas to handle this are most welcome.
>>>
>>> This doesn't directly address the problem that some of the time it won't
>>> be possible to enter deeper idle states because of the hrtimer based
>>> broadcast tick's affinity. The actual case I'm looking at is on MIPS
>>> with cpuidle-cps, where the first core cannot (currently) go into a deep
>>> idle state requiring broadcast ticks, so it'd be nice if the hrtimer
>>> based broadcast tick device could just stay on core 0.
>>> ---
>>
>> Before commenting this patch, I would like to understand why the couple
>> idle state is needed for the MIPS, what in the architecture forces the
>> usage of the couple idle state?
>
> Hardware multithreading.
>
> Each physical core may have more than one hardware thread (VPE or VP in
> MIPS lingo), each of which appears as a separate CPU to Linux. The lower
> power states are all effective at the core level though:
> - non-coherent wait - the hardware threads share physical caches, so
> coherency can only be turned off when all hardware threads are in a
> safe state, else they (1) wouldn't be coherent with the rest of the
> system and (2) could bring stuff into the cache which isn't kept
> coherent, requiring I presume a second cache flush.
> - clock gated - must go non-coherent first, and applies to the whole
> core and all the physical resources shared by the VP(E)s.
> - power gated - again must go non-coherent first, and applies to the
> whole core and all the physical resources shared by the VP(E)s.

The couple idle state was introduced to compensate hardware mis-design.

There are a couple of examples omap4 and exynos4 where only CPU0 can do
some PM operations. AFAICT, from the feedbacks I got, couple idle state
consume more energy than it saves (very likely because of the busy loop
sync mechanism).

I did some tests with a 4 cores system and the overhead was so high the
system had a very bad response time, so dropped the cluster idle state.

So before going further in the couple idle path, I suggest you
investigate first a sync mechanism which may be implemented by the
hardware. One good example is the cpuidle-ux500.c.

With a proper last man standing sync, we can then take care of the timer
broadcast thing. I can give you a hand for that if you need.

>> The hrtimer broadcast mechanism is only needed if there isn't a backup
>> timer outside of the idle state's power domain. That's the case on this
>> platform?
>
> I believe there is an external timer, and I believe we recommend
> customers implement one for use as a clocksource anyway (in case of
> frequency scaling), but on this particular platform the driver isn't
> upstream yet.

Perhaps, you can upstream the external timer driver first?

> If its something that shouldn't be supported in Linux, perhaps a simple
> WARN_ON is the better approach (i.e. if the broadcast tick can't be
> enabled in the current place and its a coupled idle state), though it
> does at least seem to work now with this and a couple of other patches
> (though I haven't been able to take any power measurements yet).




--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog