2013-09-11 02:53:39

by Preeti U Murthy

[permalink] [raw]
Subject: [PATCH V3 0/6] cpuidle/ppc: Enable broadcast support for deep idle states

On PowerPC, when CPUs enter deep idle states, their local timers get
switched off. An external clock device needs to programmed to wake them
up at their next timer event.
On PowerPC, we do not have an external device equivalent to HPET,
which is currently used on architectures like x86 under the same scenario.
Instead we assign the local timer of one of the CPUs to do this job.

This patchset is an attempt to hook onto the existing timer broadcast
framework in the kernel by using the local timer of one of the CPUs to do the
job of the external clock device.

On expiry of this device, the broadcast framework today has the infrastructure
to send ipis to all such CPUs whose local timers have expired. Hence the term
"broadcast" and the ipi sent is called the broadcast ipi.

This patch series is ported ontop of 3.11-rc7 + the cpuidle driver backend
for power posted by Deepthi Dharwar recently.
http://comments.gmane.org/gmane.linux.ports.ppc.embedded/63556

Changes in V3:

1. Fix the way in which a broadcast ipi is handled on the idling cpus. Timer
handling on a broadcast ipi is being done now without missing out any timer
stats generation.

2. Fix a bug in the programming of the hrtimer meant to do broadcast. Program
it to trigger at the earlier of a "broadcast period", and the next wakeup
event. By introducing the "broadcast period" as the maximum period after
which the broadcast hrtimer can fire, we ensure that we do not miss
wakeups in corner cases.

3. On hotplug of a broadcast cpu, trigger the hrtimer meant to do broadcast
to fire immediately on the new broadcast cpu. This will ensure we do not miss
doing a broadcast pending in the nearest future.

4. Change the type of allocation from GFP_KERNEL to GFP_NOWAIT while
initializing bc_hrtimer since we are in an atomic context and cannot sleep.

5. Use the broadcast ipi to wakeup the newly nominated broadcast cpu on
hotplug of the old instead of smp_call_function_single(). This is because we
are interrupt disabled at this point and should not be using
smp_call_function_single or its children in this context to send an ipi.

6. Move GENERIC_CLOCKEVENTS_BROADCAST to arch/powerpc/Kconfig.

7. Fix coding style issues.

Changes in V2: https://lkml.org/lkml/2013/8/14/239

1. Dynamically pick a broadcast CPU, instead of having a dedicated one.
2. Remove the constraint of having to disable tickless idle on the broadcast
CPU by queueing a hrtimer dedicated to do broadcast.

V1 posting: https://lkml.org/lkml/2013/7/25/740.

The patchset has been tested for stability in idle and during multi threaded
ebizzy runs.

Many thanks to Ben H, Frederic Weisbecker, Li Yang, Srivatsa S. Bhat and
Vaidyanathan Srinivasan for all their comments and suggestions so far.

---

Preeti U Murthy (4):
cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
cpuidle/ppc: Add basic infrastructure to support the broadcast framework on ppc
cpuidle/ppc: Introduce the deep idle state in which the local timers stop
cpuidle/ppc: Nominate new broadcast cpu on hotplug of the old

Srivatsa S. Bhat (2):
powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC)
powerpc: Implement broadcast timer interrupt as an IPI message


arch/powerpc/Kconfig | 1
arch/powerpc/include/asm/smp.h | 3 -
arch/powerpc/include/asm/time.h | 4 +
arch/powerpc/kernel/smp.c | 23 +++-
arch/powerpc/kernel/time.c | 143 ++++++++++++++++++++------
arch/powerpc/platforms/cell/interrupt.c | 2
arch/powerpc/platforms/ps3/smp.c | 2
drivers/cpuidle/cpuidle-ibm-power.c | 172 +++++++++++++++++++++++++++++++
scripts/kconfig/streamline_config.pl | 0
9 files changed, 307 insertions(+), 43 deletions(-)
mode change 100644 => 100755 scripts/kconfig/streamline_config.pl


2013-09-11 02:53:58

by Preeti U Murthy

[permalink] [raw]
Subject: [PATCH V3 1/6] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC)

From: Srivatsa S. Bhat <[email protected]>

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map to
a common implementation - generic_smp_call_function_single_interrupt(). So, we
can consolidate them and save one of the IPI message slots, (which are precious,
since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC using
PPC_MSG_CALL_FUNC_SINGLE itself and release its IPI message slot, so that it
can be used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Preeti U Murthy <[email protected]>
---

arch/powerpc/include/asm/smp.h | 2 +-
arch/powerpc/kernel/smp.c | 12 +++++-------
arch/powerpc/platforms/cell/interrupt.c | 2 +-
arch/powerpc/platforms/ps3/smp.c | 2 +-
4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 48cfc85..a632b6e 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu);
*
* Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
* in /proc/interrupts will be wrong!!! --Troy */
-#define PPC_MSG_CALL_FUNCTION 0
+#define PPC_MSG_UNUSED 0
#define PPC_MSG_RESCHEDULE 1
#define PPC_MSG_CALL_FUNC_SINGLE 2
#define PPC_MSG_DEBUGGER_BREAK 3
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 38b0ba6..bc41e9f 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -111,9 +111,9 @@ int smp_generic_kick_cpu(int nr)
}
#endif /* CONFIG_PPC64 */

-static irqreturn_t call_function_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
{
- generic_smp_call_function_interrupt();
+ /* This slot is unused and hence available for use, if needed */
return IRQ_HANDLED;
}

@@ -144,14 +144,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
}

static irq_handler_t smp_ipi_action[] = {
- [PPC_MSG_CALL_FUNCTION] = call_function_action,
+ [PPC_MSG_UNUSED] = unused_action, /* Slot available for future use */
[PPC_MSG_RESCHEDULE] = reschedule_action,
[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
};

const char *smp_ipi_name[] = {
- [PPC_MSG_CALL_FUNCTION] = "ipi call function",
+ [PPC_MSG_UNUSED] = "ipi unused",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
[PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
@@ -221,8 +221,6 @@ irqreturn_t smp_ipi_demux(void)
all = xchg(&info->messages, 0);

#ifdef __BIG_ENDIAN
- if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNCTION)))
- generic_smp_call_function_interrupt();
if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
@@ -265,7 +263,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
unsigned int cpu;

for_each_cpu(cpu, mask)
- do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
+ do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
}

#if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..28166e4 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -213,7 +213,7 @@ static void iic_request_ipi(int msg)

void iic_request_IPIs(void)
{
- iic_request_ipi(PPC_MSG_CALL_FUNCTION);
+ iic_request_ipi(PPC_MSG_UNUSED);
iic_request_ipi(PPC_MSG_RESCHEDULE);
iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..488f069 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void)
* to index needs to be setup.
*/

- BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION != 0);
+ BUILD_BUG_ON(PPC_MSG_UNUSED != 0);
BUILD_BUG_ON(PPC_MSG_RESCHEDULE != 1);
BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK != 3);

2013-09-11 02:54:15

by Preeti U Murthy

[permalink] [raw]
Subject: [PATCH V3 2/6] powerpc: Implement broadcast timer interrupt as an IPI message

From: Srivatsa S. Bhat <[email protected]>

For scalability and performance reasons, we want the broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster
than the smp_call_function mechanism because the IPI handlers are fixed
and hence they don't involve costly operations such as adding IPI handlers
to the target CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
broadcast timer interrupts efficiently.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
[Changelog modified by [email protected]]
Signed-off-by: Preeti U Murthy <[email protected]>
---

arch/powerpc/include/asm/smp.h | 3 ++-
arch/powerpc/include/asm/time.h | 1 +
arch/powerpc/kernel/smp.c | 19 +++++++++++++++----
arch/powerpc/kernel/time.c | 4 ++++
arch/powerpc/platforms/cell/interrupt.c | 2 +-
arch/powerpc/platforms/ps3/smp.c | 2 +-
scripts/kconfig/streamline_config.pl | 0
7 files changed, 24 insertions(+), 7 deletions(-)
mode change 100644 => 100755 scripts/kconfig/streamline_config.pl

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index a632b6e..22f6d63 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu);
*
* Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
* in /proc/interrupts will be wrong!!! --Troy */
-#define PPC_MSG_UNUSED 0
+#define PPC_MSG_TIMER 0
#define PPC_MSG_RESCHEDULE 1
#define PPC_MSG_CALL_FUNC_SINGLE 2
#define PPC_MSG_DEBUGGER_BREAK 3
@@ -194,6 +194,7 @@ extern struct smp_ops_t *smp_ops;

extern void arch_send_call_function_single_ipi(int cpu);
extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);
+extern void arch_send_tick_broadcast(const struct cpumask *mask);

/* Definitions relative to the secondary CPU spin loop
* and entry point. Not all of them exist on both 32 and
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..4e35282 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent;
struct rtc_time;
extern void to_tm(int tim, struct rtc_time * tm);
extern void GregorianDay(struct rtc_time *tm);
+extern void decrementer_timer_interrupt(void);

extern void generic_calibrate_decr(void);

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index bc41e9f..d3b7014 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
#include <asm/ptrace.h>
#include <linux/atomic.h>
#include <asm/irq.h>
+#include <asm/hw_irq.h>
#include <asm/page.h>
#include <asm/pgtable.h>
#include <asm/prom.h>
@@ -111,9 +112,9 @@ int smp_generic_kick_cpu(int nr)
}
#endif /* CONFIG_PPC64 */

-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t timer_action(int irq, void *data)
{
- /* This slot is unused and hence available for use, if needed */
+ decrementer_timer_interrupt();
return IRQ_HANDLED;
}

@@ -144,14 +145,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
}

static irq_handler_t smp_ipi_action[] = {
- [PPC_MSG_UNUSED] = unused_action, /* Slot available for future use */
+ [PPC_MSG_TIMER] = timer_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
};

const char *smp_ipi_name[] = {
- [PPC_MSG_UNUSED] = "ipi unused",
+ [PPC_MSG_TIMER] = "ipi timer",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
[PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
@@ -221,6 +222,8 @@ irqreturn_t smp_ipi_demux(void)
all = xchg(&info->messages, 0);

#ifdef __BIG_ENDIAN
+ if (all & (1 << (24 - 8 * PPC_MSG_TIMER)))
+ decrementer_timer_interrupt();
if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
@@ -266,6 +269,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
}

+void arch_send_tick_broadcast(const struct cpumask *mask)
+{
+ unsigned int cpu;
+
+ for_each_cpu(cpu, mask)
+ do_message_pass(cpu, PPC_MSG_TIMER);
+}
+
#if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
void smp_send_debugger_break(void)
{
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 65ab9e9..0dfa0c5 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -813,6 +813,10 @@ static void decrementer_set_mode(enum clock_event_mode mode,
decrementer_set_next_event(DECREMENTER_MAX, dev);
}

+void decrementer_timer_interrupt(void)
+{
+}
+
static void register_decrementer_clockevent(int cpu)
{
struct clock_event_device *dec = &per_cpu(decrementers, cpu);
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index 28166e4..1359113 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -213,7 +213,7 @@ static void iic_request_ipi(int msg)

void iic_request_IPIs(void)
{
- iic_request_ipi(PPC_MSG_UNUSED);
+ iic_request_ipi(PPC_MSG_TIMER);
iic_request_ipi(PPC_MSG_RESCHEDULE);
iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 488f069..5cb742a 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void)
* to index needs to be setup.
*/

- BUILD_BUG_ON(PPC_MSG_UNUSED != 0);
+ BUILD_BUG_ON(PPC_MSG_TIMER != 0);
BUILD_BUG_ON(PPC_MSG_RESCHEDULE != 1);
BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK != 3);
diff --git a/scripts/kconfig/streamline_config.pl b/scripts/kconfig/streamline_config.pl
old mode 100644
new mode 100755

2013-09-11 02:54:30

by Preeti U Murthy

[permalink] [raw]
Subject: [PATCH V3 3/6] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines

On PowerPC, when CPUs enter deep idle states, their local timers get
switched off. The local timer is called the decrementer. An external clock
device needs to programmed to wake them up at their next timer event.
On PowerPC, we do not have an external device equivalent to HPET,
which is currently used on architectures like x86 under the same scenario.
Instead we assign the local timer of one of the CPUs to do this job.

On expiry of this timer, the broadcast framework today has the infrastructure
to send ipis to all such CPUs whose local timers have expired.

When such an ipi is received, the cpus in deep idle should handle their
expired timers. It should be as though they were woken up from a
timer interrupt itself. Hence this external ipi serves as an emulated timer
interrupt for the cpus in deep idle.

Therefore ideally on ppc, these cpus should call timer_interrupt() which
is the interrupt handler for a decrementer interrupt. But timer_interrupt()
also contains routines which are usually performed in an interrupt handler.
These are not required to be done in this scenario as the external interrupt
handler takes care of them.

Therefore split up timer_interrupt() into routines performed during regular
interrupt handling and __timer_interrupt(), which takes care of running local
timers and collecting time related stats. Now on a broadcast ipi, call
__timer_interrupt().

Signed-off-by: Preeti U Murthy <[email protected]>
---

arch/powerpc/kernel/time.c | 69 ++++++++++++++++++++++++--------------------
1 file changed, 37 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 0dfa0c5..eb48291 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -478,6 +478,42 @@ void arch_irq_work_raise(void)

#endif /* CONFIG_IRQ_WORK */

+static void __timer_interrupt(void)
+{
+ struct pt_regs *regs = get_irq_regs();
+ u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+ struct clock_event_device *evt = &__get_cpu_var(decrementers);
+ u64 now;
+
+ __get_cpu_var(irq_stat).timer_irqs++;
+ trace_timer_interrupt_entry(regs);
+
+ if (test_irq_work_pending()) {
+ clear_irq_work_pending();
+ irq_work_run();
+ }
+
+ now = get_tb_or_rtc();
+ if (now >= *next_tb) {
+ *next_tb = ~(u64)0;
+ if (evt->event_handler)
+ evt->event_handler(evt);
+ } else {
+ now = *next_tb - now;
+ if (now <= DECREMENTER_MAX)
+ set_dec((int)now);
+ }
+
+#ifdef CONFIG_PPC64
+ /* collect purr register values often, for accurate calculations */
+ if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+ struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
+ cu->current_tb = mfspr(SPRN_PURR);
+ }
+#endif
+ trace_timer_interrupt_exit(regs);
+}
+
/*
* timer_interrupt - gets called when the decrementer overflows,
* with interrupts disabled.
@@ -486,8 +522,6 @@ void timer_interrupt(struct pt_regs * regs)
{
struct pt_regs *old_regs;
u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
- struct clock_event_device *evt = &__get_cpu_var(decrementers);
- u64 now;

/* Ensure a positive value is written to the decrementer, or else
* some CPUs will continue to take decrementer exceptions.
@@ -510,8 +544,6 @@ void timer_interrupt(struct pt_regs * regs)
*/
may_hard_irq_enable();

- __get_cpu_var(irq_stat).timer_irqs++;
-
#if defined(CONFIG_PPC32) && defined(CONFIG_PMAC)
if (atomic_read(&ppc_n_lost_interrupts) != 0)
do_IRQ(regs);
@@ -520,34 +552,7 @@ void timer_interrupt(struct pt_regs * regs)
old_regs = set_irq_regs(regs);
irq_enter();

- trace_timer_interrupt_entry(regs);
-
- if (test_irq_work_pending()) {
- clear_irq_work_pending();
- irq_work_run();
- }
-
- now = get_tb_or_rtc();
- if (now >= *next_tb) {
- *next_tb = ~(u64)0;
- if (evt->event_handler)
- evt->event_handler(evt);
- } else {
- now = *next_tb - now;
- if (now <= DECREMENTER_MAX)
- set_dec((int)now);
- }
-
-#ifdef CONFIG_PPC64
- /* collect purr register values often, for accurate calculations */
- if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
- struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
- cu->current_tb = mfspr(SPRN_PURR);
- }
-#endif
-
- trace_timer_interrupt_exit(regs);
-
+ __timer_interrupt();
irq_exit();
set_irq_regs(old_regs);
}

2013-09-11 02:54:46

by Preeti U Murthy

[permalink] [raw]
Subject: [PATCH V3 4/6] cpuidle/ppc: Add basic infrastructure to support the broadcast framework on ppc

The broadcast framework in the kernel expects an external clock device which will
continue functioning in deep idle states also. This ability is specified by
the "non-existence" of the feature C3STOP . This is the device that it relies
upon to wakup cpus in deep idle states whose local timers/clock devices get
switched off in deep idle states.

On ppc we do not have such an external device. Therefore we introduce a
pseudo clock device, which has the features of this external clock device
called the broadcast_clockevent. Having such a device qualifies the cpus to
enter and exit deep idle states from the point of view of the broadcast
framework, because there is an external device to wake them up.
Specifically the broadcast framework uses this device's event
handler and next_event members in its functioning. On ppc we use this
device as the gateway into the broadcast framework and *not* as a
timer. An explicit timer infrastructure will be developed in the following
patches to keep track of when to wake up cpus in deep idle.

Since this device is a pseudo device, it can be safely assumed to work for
all cpus. Therefore its cpumask is set to cpu_possible_mask. Also due to the
same reason, the set_next_event() routine associated with this device is a
nop.

The broadcast framework relies on a broadcast functionality being made
available in the .broadcast member of the local clock devices on all cpus.
This function is called upon by the broadcast framework on one of the nominated
cpus, to send ipis to all the cpus in deep idle at their expired timer events.
This patch also initializes the .broadcast member of the decrementer whose
job is to send the broadcast ipis.

When cpus inform the broadcast framework that they are entering deep idle,
their local timers are put in shutdown mode. On ppc, this means setting the
decrementer_next_tb and programming the decrementer to DECREMENTER_MAX.
On being woken up by the broadcast ipi, these cpus call __timer_interrupt(),
which runs the local timers only if decrementer_next_tb has expired.
Therefore on being woken up from the broadcast ipi, set the decrementers_next_tb
to now before calling __timer_interrupt().

Signed-off-by: Preeti U Murthy <[email protected]>
---

arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/time.h | 1 +
arch/powerpc/kernel/time.c | 69 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index dbd9d3c..550fc04 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -130,6 +130,7 @@ config PPC
select GENERIC_CMOS_UPDATE
select GENERIC_TIME_VSYSCALL_OLD
select GENERIC_CLOCKEVENTS
+ select GENERIC_CLOCKEVENTS_BROADCAST
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
select HAVE_MOD_ARCH_SPECIFIC
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 4e35282..264dc96 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -24,6 +24,7 @@ extern unsigned long tb_ticks_per_jiffy;
extern unsigned long tb_ticks_per_usec;
extern unsigned long tb_ticks_per_sec;
extern struct clock_event_device decrementer_clockevent;
+extern struct clock_event_device broadcast_clockevent;

struct rtc_time;
extern void to_tm(int tim, struct rtc_time * tm);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index eb48291..bda78bb 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -42,6 +42,7 @@
#include <linux/timex.h>
#include <linux/kernel_stat.h>
#include <linux/time.h>
+#include <linux/timer.h>
#include <linux/init.h>
#include <linux/profile.h>
#include <linux/cpu.h>
@@ -97,8 +98,13 @@ static struct clocksource clocksource_timebase = {

static int decrementer_set_next_event(unsigned long evt,
struct clock_event_device *dev);
+static int broadcast_set_next_event(unsigned long evt,
+ struct clock_event_device *dev);
+static void broadcast_set_mode(enum clock_event_mode mode,
+ struct clock_event_device *dev);
static void decrementer_set_mode(enum clock_event_mode mode,
struct clock_event_device *dev);
+static void decrementer_timer_broadcast(const struct cpumask *mask);

struct clock_event_device decrementer_clockevent = {
.name = "decrementer",
@@ -106,12 +112,24 @@ struct clock_event_device decrementer_clockevent = {
.irq = 0,
.set_next_event = decrementer_set_next_event,
.set_mode = decrementer_set_mode,
- .features = CLOCK_EVT_FEAT_ONESHOT,
+ .broadcast = decrementer_timer_broadcast,
+ .features = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
};
EXPORT_SYMBOL(decrementer_clockevent);

+struct clock_event_device broadcast_clockevent = {
+ .name = "broadcast",
+ .rating = 200,
+ .irq = 0,
+ .set_next_event = broadcast_set_next_event,
+ .set_mode = broadcast_set_mode,
+ .features = CLOCK_EVT_FEAT_ONESHOT,
+};
+EXPORT_SYMBOL(broadcast_clockevent);
+
DEFINE_PER_CPU(u64, decrementers_next_tb);
static DEFINE_PER_CPU(struct clock_event_device, decrementers);
+static struct clock_event_device bc_timer;

#define XSEC_PER_SEC (1024*1024)

@@ -811,6 +829,19 @@ static int decrementer_set_next_event(unsigned long evt,
return 0;
}

+static int broadcast_set_next_event(unsigned long evt,
+ struct clock_event_device *dev)
+{
+ return 0;
+}
+
+static void broadcast_set_mode(enum clock_event_mode mode,
+ struct clock_event_device *dev)
+{
+ if (mode != CLOCK_EVT_MODE_ONESHOT)
+ broadcast_set_next_event(DECREMENTER_MAX, dev);
+}
+
static void decrementer_set_mode(enum clock_event_mode mode,
struct clock_event_device *dev)
{
@@ -820,6 +851,15 @@ static void decrementer_set_mode(enum clock_event_mode mode,

void decrementer_timer_interrupt(void)
{
+ u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+
+ *next_tb = get_tb_or_rtc();
+ __timer_interrupt();
+}
+
+static void decrementer_timer_broadcast(const struct cpumask *mask)
+{
+ arch_send_tick_broadcast(mask);
}

static void register_decrementer_clockevent(int cpu)
@@ -835,6 +875,19 @@ static void register_decrementer_clockevent(int cpu)
clockevents_register_device(dec);
}

+static void register_broadcast_clockevent(int cpu)
+{
+ struct clock_event_device *bc_evt = &bc_timer;
+
+ *bc_evt = broadcast_clockevent;
+ bc_evt->cpumask = cpu_possible_mask;
+
+ printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
+ bc_evt->name, bc_evt->mult, bc_evt->shift, cpu);
+
+ clockevents_register_device(bc_evt);
+}
+
static void __init init_decrementer_clockevent(void)
{
int cpu = smp_processor_id();
@@ -849,6 +902,19 @@ static void __init init_decrementer_clockevent(void)
register_decrementer_clockevent(cpu);
}

+static void __init init_broadcast_clockevent(void)
+{
+ int cpu = smp_processor_id();
+
+ clockevents_calc_mult_shift(&broadcast_clockevent, ppc_tb_freq, 4);
+
+ broadcast_clockevent.max_delta_ns =
+ clockevent_delta2ns(DECREMENTER_MAX, &broadcast_clockevent);
+ broadcast_clockevent.min_delta_ns =
+ clockevent_delta2ns(2, &broadcast_clockevent);
+ register_broadcast_clockevent(cpu);
+}
+
void secondary_cpu_time_init(void)
{
/* Start the decrementer on CPUs that have manual control
@@ -925,6 +991,7 @@ void __init time_init(void)
clocksource_init();

init_decrementer_clockevent();
+ init_broadcast_clockevent();
}


2013-09-11 02:55:05

by Preeti U Murthy

[permalink] [raw]
Subject: [PATCH V3 5/6] cpuidle/ppc: Introduce the deep idle state in which the local timers stop

Now that we have the basic infrastructure setup to make use of the broadcast
framework, introduce the deep idle state in which cpus need to avail the
functionality provided by this infrastructure to wake them up at their
expired timer events. On ppc this deep idle state is called sleep.
In this patch however, we introduce longnap, which emulates sleep
state, by disabling timer interrupts. This is until such time that sleep support is
made available in the kernel.

Since on ppc, we do not have an external device that can wakeup cpus in deep
idle, the local timer of one of the cpus need to be nominated to do this job.
This cpu is called the broadcast cpu/bc_cpu. Only if the bc_cpu is nominated
will the remaining cpus be allowed to enter deep idle state after notifying
the broadcast framework about their next timer event. The bc_cpu is not allowed
to enter deep idle state.

The first cpu that enters longnap is made the bc_cpu. It queues a hrtimer onto
itself which expires after a broadcast period. The job of this
hrtimer is to call into the broadcast framework[1] using the pseudo clock device
that we have initiliazed, in which, the cpus whose wakeup times
have expired are sent an ipi.
On each expiry of the hrtimer, it is programmed to the earlier of the
next pending timer event of the cpus in deep idle and the broadcast period, so
as to not miss any wakeups.

The broadcast period is nothing but the max duration until which the
bc_cpu need not concern itself with checking for expired timer events on cpus
in deep idle. The broadcast period is set to a jiffy in this patch for debug
purposes. Ideally it needn't be smaller than the target_residency of the deep
idle state.

But having a dedicated bc_cpu would mean overloading just one cpu with the
broadcast work which could hinder its performance apart from leading to thermal
imbalance on the chip. Therefore unassign the bc_cpu when there are no more cpus
in deep idle to be woken up. The bc_cpu is left unassigned until such a time that
a cpu enters longnap to be nominated as the bc_cpu and the above cycle repeats.

Protect the region of nomination,de-nomination and check for existence of broadcast
cpu with a lock to ensure synchronization between them.

[1] tick_handle_oneshot_broadcast() or tick_handle_periodic_broadcast().

Signed-off-by: Preeti U Murthy <[email protected]>
---

arch/powerpc/include/asm/time.h | 1
arch/powerpc/kernel/time.c | 2
drivers/cpuidle/cpuidle-ibm-power.c | 150 +++++++++++++++++++++++++++++++++++
3 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 264dc96..38341fa 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -25,6 +25,7 @@ extern unsigned long tb_ticks_per_usec;
extern unsigned long tb_ticks_per_sec;
extern struct clock_event_device decrementer_clockevent;
extern struct clock_event_device broadcast_clockevent;
+extern struct clock_event_device bc_timer;

struct rtc_time;
extern void to_tm(int tim, struct rtc_time * tm);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index bda78bb..44a76de 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -129,7 +129,7 @@ EXPORT_SYMBOL(broadcast_clockevent);

DEFINE_PER_CPU(u64, decrementers_next_tb);
static DEFINE_PER_CPU(struct clock_event_device, decrementers);
-static struct clock_event_device bc_timer;
+struct clock_event_device bc_timer;

#define XSEC_PER_SEC (1024*1024)

diff --git a/drivers/cpuidle/cpuidle-ibm-power.c b/drivers/cpuidle/cpuidle-ibm-power.c
index f8905c3..ae47a0a 100644
--- a/drivers/cpuidle/cpuidle-ibm-power.c
+++ b/drivers/cpuidle/cpuidle-ibm-power.c
@@ -12,12 +12,19 @@
#include <linux/cpuidle.h>
#include <linux/cpu.h>
#include <linux/notifier.h>
+#include <linux/clockchips.h>
+#include <linux/tick.h>
+#include <linux/hrtimer.h>
+#include <linux/ktime.h>
+#include <linux/spinlock.h>
+#include <linux/slab.h>

#include <asm/paca.h>
#include <asm/reg.h>
#include <asm/machdep.h>
#include <asm/firmware.h>
#include <asm/runlatch.h>
+#include <asm/time.h>
#include <asm/plpar_wrappers.h>

struct cpuidle_driver power_idle_driver = {
@@ -28,6 +35,26 @@ struct cpuidle_driver power_idle_driver = {
static int max_idle_state;
static struct cpuidle_state *cpuidle_state_table;

+static int bc_cpu = -1;
+static struct hrtimer *bc_hrtimer;
+static int bc_hrtimer_initialized = 0;
+
+/*
+ * Bits to indicate if a cpu can enter deep idle where local timer gets
+ * switched off.
+ * BROADCAST_CPU_PRESENT : Enter deep idle since bc_cpu is assigned
+ * BROADCAST_CPU_SELF : Do not enter deep idle since you are bc_cpu
+ * BROADCAST_CPU_ABSENT : Do not enter deep idle since there is no bc_cpu,
+ * hence nominate yourself as bc_cpu
+ * BROADCAST_CPU_ERROR : Do not enter deep idle since there is no bc_cpu
+ * and the broadcast hrtimer could not be initialized.
+ */
+enum broadcast_cpu_status {
+ BROADCAST_CPU_PRESENT,
+ BROADCAST_CPU_SELF,
+ BROADCAST_CPU_ERROR,
+};
+
static inline void idle_loop_prolog(unsigned long *in_purr)
{
*in_purr = mfspr(SPRN_PURR);
@@ -44,6 +71,8 @@ static inline void idle_loop_epilog(unsigned long in_purr)
get_lppaca()->idle = 0;
}

+static DEFINE_SPINLOCK(longnap_idle_lock);
+
static int snooze_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
@@ -139,6 +168,120 @@ static int nap_loop(struct cpuidle_device *dev,
return index;
}

+/* Functions supporting broadcasting in longnap */
+static ktime_t get_next_bc_tick(void)
+{
+ u64 next_bc_ns;
+
+ next_bc_ns = (tb_ticks_per_jiffy / tb_ticks_per_usec) * 1000;
+ return ns_to_ktime(next_bc_ns);
+}
+
+static int restart_broadcast(struct clock_event_device *bc_evt)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&longnap_idle_lock, flags);
+ bc_evt->event_handler(bc_evt);
+
+ if (bc_evt->next_event.tv64 == KTIME_MAX)
+ bc_cpu = -1;
+
+ spin_unlock_irqrestore(&longnap_idle_lock, flags);
+ return (bc_cpu != -1);
+}
+
+static enum hrtimer_restart handle_broadcast(struct hrtimer *hrtimer)
+{
+ struct clock_event_device *bc_evt = &bc_timer;
+ ktime_t interval, next_bc_tick;
+
+ u64 now = get_tb_or_rtc();
+ ktime_t now_ktime = ns_to_ktime((now / tb_ticks_per_usec) * 1000);
+
+ if (!restart_broadcast(bc_evt))
+ return HRTIMER_NORESTART;
+
+ interval.tv64 = bc_evt->next_event.tv64 - now_ktime.tv64;
+ next_bc_tick = get_next_bc_tick();
+
+ if (interval.tv64 < next_bc_tick.tv64)
+ hrtimer_forward_now(hrtimer, interval);
+ else
+ hrtimer_forward_now(hrtimer, next_bc_tick);
+
+ return HRTIMER_RESTART;
+}
+
+static enum broadcast_cpu_status can_enter_deep_idle(int cpu)
+{
+ if (bc_cpu != -1 && cpu != bc_cpu) {
+ return BROADCAST_CPU_PRESENT;
+ } else if (bc_cpu != -1 && cpu == bc_cpu) {
+ return BROADCAST_CPU_SELF;
+ } else {
+ if (!bc_hrtimer_initialized) {
+ bc_hrtimer = kmalloc(sizeof(*bc_hrtimer), GFP_NOWAIT);
+ if (!bc_hrtimer)
+ return BROADCAST_CPU_ERROR;
+ hrtimer_init(bc_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+ bc_hrtimer->function = handle_broadcast;
+ hrtimer_start(bc_hrtimer, get_next_bc_tick(),
+ HRTIMER_MODE_REL_PINNED);
+ bc_hrtimer_initialized = 1;
+ } else {
+ hrtimer_start(bc_hrtimer, get_next_bc_tick(), HRTIMER_MODE_REL_PINNED);
+ }
+
+ bc_cpu = cpu;
+ return BROADCAST_CPU_SELF;
+ }
+}
+
+/* Emulate sleep, with long nap.
+ * During sleep, the core does not receive decrementer interrupts.
+ * Emulate sleep using long nap with decrementers interrupts disabled.
+ * This is an initial prototype to test the broadcast framework for ppc.
+ */
+static int longnap_loop(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv,
+ int index)
+{
+ int cpu = dev->cpu;
+ unsigned long lpcr = mfspr(SPRN_LPCR);
+ unsigned long flags;
+ int bc_cpu_status;
+
+ lpcr &= ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */
+
+ /* exit powersave upon external interrupt, but not decrementer
+ * interrupt, Emulate sleep.
+ */
+ lpcr |= LPCR_PECE0;
+
+ spin_lock_irqsave(&longnap_idle_lock, flags);
+ bc_cpu_status = can_enter_deep_idle(cpu);
+
+ if (bc_cpu_status == BROADCAST_CPU_PRESENT) {
+ mtspr(SPRN_LPCR, lpcr);
+ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
+ spin_unlock_irqrestore(&longnap_idle_lock, flags);
+ power7_nap();
+ spin_lock_irqsave(&longnap_idle_lock, flags);
+ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
+ spin_unlock_irqrestore(&longnap_idle_lock, flags);
+ } else if (bc_cpu_status == BROADCAST_CPU_SELF) {
+ lpcr |= LPCR_PECE1;
+ mtspr(SPRN_LPCR, lpcr);
+ spin_unlock_irqrestore(&longnap_idle_lock, flags);
+ power7_nap();
+ } else {
+ spin_unlock_irqrestore(&longnap_idle_lock, flags);
+ }
+
+ return index;
+}
+
/*
* States for dedicated partition case.
*/
@@ -187,6 +330,13 @@ static struct cpuidle_state powernv_states[] = {
.exit_latency = 10,
.target_residency = 100,
.enter = &nap_loop },
+ { /* LongNap */
+ .name = "LongNap",
+ .desc = "LongNap",
+ .flags = CPUIDLE_FLAG_TIME_VALID,
+ .exit_latency = 10,
+ .target_residency = 100,
+ .enter = &longnap_loop },
};

void update_smt_snooze_delay(int cpu, int residency)

2013-09-11 02:55:21

by Preeti U Murthy

[permalink] [raw]
Subject: [PATCH V3 6/6] cpuidle/ppc: Nominate new broadcast cpu on hotplug of the old

On hotplug of the broadcast cpu, cancel the hrtimer queued to do
broadcast and nominate a new broadcast cpu to be the first cpu in the
broadcast mask which includes all the cpus that have notified the broadcast
framework about entering deep idle state.

Since the new broadcast cpu is one of the cpus in deep idle, send an ipi to
wake it up to continue the duty of broadcast. The new broadcast cpu needs to
find out if it woke up to resume broadcast. If so it needs to restart the
broadcast hrtimer on itself.

Its possible that the old broadcast cpu was hotplugged out when the broadcast
hrtimer was about to fire on it. Therefore the newly nominated broadcast cpu
should set the broadcast hrtimer on itself to expire immediately so as to not
miss wakeups under such scenarios.

Signed-off-by: Preeti U Murthy <[email protected]>
---

arch/powerpc/include/asm/time.h | 1 +
arch/powerpc/kernel/time.c | 1 +
drivers/cpuidle/cpuidle-ibm-power.c | 22 ++++++++++++++++++++++
3 files changed, 24 insertions(+)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 38341fa..3bc0205 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -31,6 +31,7 @@ struct rtc_time;
extern void to_tm(int tim, struct rtc_time * tm);
extern void GregorianDay(struct rtc_time *tm);
extern void decrementer_timer_interrupt(void);
+extern void broadcast_irq_entry(void);

extern void generic_calibrate_decr(void);

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 44a76de..0ac2e11 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -853,6 +853,7 @@ void decrementer_timer_interrupt(void)
{
u64 *next_tb = &__get_cpu_var(decrementers_next_tb);

+ broadcast_irq_entry();
*next_tb = get_tb_or_rtc();
__timer_interrupt();
}
diff --git a/drivers/cpuidle/cpuidle-ibm-power.c b/drivers/cpuidle/cpuidle-ibm-power.c
index ae47a0a..580ea04 100644
--- a/drivers/cpuidle/cpuidle-ibm-power.c
+++ b/drivers/cpuidle/cpuidle-ibm-power.c
@@ -282,6 +282,12 @@ static int longnap_loop(struct cpuidle_device *dev,
return index;
}

+void broadcast_irq_entry(void)
+{
+ if (smp_processor_id() == bc_cpu)
+ hrtimer_start(bc_hrtimer, ns_to_ktime(0), HRTIMER_MODE_REL_PINNED);
+}
+
/*
* States for dedicated partition case.
*/
@@ -360,6 +366,7 @@ static int power_cpuidle_add_cpu_notifier(struct notifier_block *n,
unsigned long action, void *hcpu)
{
int hotcpu = (unsigned long)hcpu;
+ unsigned long flags;
struct cpuidle_device *dev =
per_cpu(cpuidle_devices, hotcpu);

@@ -372,6 +379,21 @@ static int power_cpuidle_add_cpu_notifier(struct notifier_block *n,
cpuidle_resume_and_unlock();
break;

+ case CPU_DYING:
+ case CPU_DYING_FROZEN:
+ spin_lock_irqsave(&longnap_idle_lock, flags);
+ if (hotcpu == bc_cpu) {
+ bc_cpu = -1;
+ hrtimer_cancel(bc_hrtimer);
+ if (!cpumask_empty(tick_get_broadcast_oneshot_mask())) {
+ bc_cpu = cpumask_first(
+ tick_get_broadcast_oneshot_mask());
+ arch_send_tick_broadcast(cpumask_of(bc_cpu));
+ }
+ }
+ spin_unlock_irqrestore(&longnap_idle_lock, flags);
+ break;
+
case CPU_DEAD:
case CPU_DEAD_FROZEN:
cpuidle_pause_and_lock();

2013-09-11 16:50:21

by Geoff Levand

[permalink] [raw]
Subject: Re: [PATCH V3 1/6] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC)

On Wed, 2013-09-11 at 08:21 +0530, Preeti U Murthy wrote:
> arch/powerpc/platforms/ps3/smp.c | 2 +-

The PS3 part is trivial and looks OK.

Acked-by: Geoff Levand <[email protected]>

2013-09-11 16:51:15

by Geoff Levand

[permalink] [raw]
Subject: Re: [PATCH V3 2/6] powerpc: Implement broadcast timer interrupt as an IPI message

On Wed, 2013-09-11 at 08:21 +0530, Preeti U Murthy wrote:
> arch/powerpc/platforms/ps3/smp.c | 2 +-

The PS3 part is trivial and looks OK.

Acked-by: Geoff Levand <[email protected]>