2013-07-25 09:05:40

by Preeti U Murthy

[permalink] [raw]
Subject: [RFC PATCH 0/5] cpuidle/ppc: Timer offload framework to support deep idle states

On PowerPC, when CPUs enter deep idle states, their local timers are
switched off. The responsibility of waking them up at their next timer event,
needs to be handed over to an external device. On PowerPC, we do not have an
external device equivalent to HPET, which is currently done on architectures
like x86. Instead we assign the local timer of one of the CPUs to do this job.

This patchset is an attempt to make use of the existing timer broadcast
framework in the kernel to meet the above requirement, except that the tick
broadcast device is the local timer of the boot CPU.

This patch series is ported ontop of 3.11-rc1 + the cpuidle driver backend
for powernv posted by Deepthi Dharwar recently. The current design and
implementation supports the ONESHOT tick mode. It does not yet support
the PERIODIC tick mode. This patch is tested with NOHZ_FULL off.

Patch[1/5], Patch[2/5]: optimize the broadcast mechanism on ppc.
Patch[3/5]: Introduces the core of the timer offload framework on powerpc.
Patch[4/5]: The cpu doing the broadcast should not go into tickless idle.
Patch[5/5]: Add a deep idle state to the cpuidle state table on powernv.

Patch[5/5] is the patch that ultimately makes use of the timer offload
framework that the patches Patch[1/5] to Patch[4/5] build.

---

Preeti U Murthy (3):
cpuidle/ppc: Add timer offload framework to support deep idle states
cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints
cpuidle/ppc: Add longnap state to the idle states on powernv

Srivatsa S. Bhat (2):
powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC)
powerpc: Implement broadcast timer interrupt as an IPI message


arch/powerpc/include/asm/smp.h | 3 +
arch/powerpc/include/asm/time.h | 3 +
arch/powerpc/kernel/smp.c | 23 ++++--
arch/powerpc/kernel/time.c | 84 +++++++++++++++++++++++
arch/powerpc/platforms/cell/interrupt.c | 2 -
arch/powerpc/platforms/powernv/Kconfig | 1
arch/powerpc/platforms/powernv/processor_idle.c | 48 +++++++++++++
arch/powerpc/platforms/ps3/smp.c | 2 -
kernel/time/tick-sched.c | 7 ++
9 files changed, 161 insertions(+), 12 deletions(-)

--
Signature


2013-07-25 09:05:54

by Preeti U Murthy

[permalink] [raw]
Subject: [RFC PATCH 5/5] cpuidle/ppc: Add longnap state to the idle states on powernv

This patch hooks into the existing broadcast framework with the support that this
patchset introduces for ppc, and the cpuidle driver backend
for powernv(posted out recently by Deepthi Dharwar) to add sleep state as
one of the deep idle states, in which the decrementer is switched off.

However in this patch, we only emulate sleep by going into a state which does
a nap with the decrementer interrupts disabled, termed as longnap. This enables
focus on the timer broadcast framework for ppc in this series of patches ,
which is required as a first step to enable sleep on ppc.

Signed-off-by: Preeti U Murthy <[email protected]>
---

arch/powerpc/platforms/powernv/processor_idle.c | 48 +++++++++++++++++++++++
1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/processor_idle.c b/arch/powerpc/platforms/powernv/processor_idle.c
index f43ad91a..9aca502 100644
--- a/arch/powerpc/platforms/powernv/processor_idle.c
+++ b/arch/powerpc/platforms/powernv/processor_idle.c
@@ -9,16 +9,18 @@
#include <linux/cpuidle.h>
#include <linux/cpu.h>
#include <linux/notifier.h>
+#include <linux/clockchips.h>

#include <asm/machdep.h>
#include <asm/runlatch.h>
+#include <asm/time.h>

struct cpuidle_driver powernv_idle_driver = {
.name = "powernv_idle",
.owner = THIS_MODULE,
};

-#define MAX_IDLE_STATE_COUNT 2
+#define MAX_IDLE_STATE_COUNT 3

static int max_idle_state = MAX_IDLE_STATE_COUNT - 1;
static struct cpuidle_device __percpu *powernv_cpuidle_devices;
@@ -54,6 +56,43 @@ static int nap_loop(struct cpuidle_device *dev,
return index;
}

+/* Emulate sleep, with long nap.
+ * During sleep, the core does not receive decrementer interrupts.
+ * Emulate sleep using long nap with decrementers interrupts disabled.
+ * This is an initial prototype to test the timer offload framework for ppc.
+ * We will eventually introduce the sleep state once the timer offload framework
+ * for ppc is stable.
+ */
+static int longnap_loop(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv,
+ int index)
+{
+ int cpu = dev->cpu;
+
+ unsigned long lpcr = mfspr(SPRN_LPCR);
+
+ lpcr &= ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */
+
+ /* exit powersave upon external interrupt, but not decrementer
+ * interrupt, Emulate sleep.
+ */
+ lpcr |= LPCR_PECE0;
+
+ if (cpu != bc_cpu) {
+ mtspr(SPRN_LPCR, lpcr);
+ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
+ power7_nap();
+ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
+ } else {
+ /* Wakeup on a decrementer interrupt, Do a nap */
+ lpcr |= LPCR_PECE1;
+ mtspr(SPRN_LPCR, lpcr);
+ power7_nap();
+ }
+
+ return index;
+}
+
/*
* States for dedicated partition case.
*/
@@ -72,6 +111,13 @@ static struct cpuidle_state powernv_states[MAX_IDLE_STATE_COUNT] = {
.exit_latency = 10,
.target_residency = 100,
.enter = &nap_loop },
+ { /* LongNap */
+ .name = "LongNap",
+ .desc = "LongNap",
+ .flags = CPUIDLE_FLAG_TIME_VALID,
+ .exit_latency = 10,
+ .target_residency = 100,
+ .enter = &longnap_loop },
};

static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,

2013-07-25 09:06:02

by Preeti U Murthy

[permalink] [raw]
Subject: [RFC PATCH 3/5] cpuidle/ppc: Add timer offload framework to support deep idle states

On ppc, in deep idle states, the lapic of the cpus gets switched off.
Hence make use of the broadcast framework to wakeup cpus in sleep state,
except that on ppc, we do not have an external device such as HPET, but
we use the lapic of a cpu itself as the broadcast device.

Instantiate two different clock event devices, one representing the
lapic and another representing the broadcast device for each cpu.
Such a cpu is forbidden to enter the deep idle state. The cpu which hosts
the broadcast device will be referred to as the broadcast cpu in the
changelogs of this patchset for convenience.

For now, only the boot cpu's broadcast device gets registered as a clock event
device along with the lapic. Hence this is the broadcast cpu.

On the broadcast cpu, on each timer interrupt, apart from the regular lapic event
handler the broadcast handler is also called. We avoid the overhead of
programming the lapic for a broadcast event specifically. The reason is
prevent multiple cpus from sending IPIs to program the lapic of the broadcast
cpu for their next local event each time they go to deep idle state.

Apart from this there is no change in the way broadcast is handled today. On
a broadcast ipi the event handler for a timer interrupt is called on the cpu
in deep idle state to handle the local events.

The current design and implementation of the timer offload framework supports
the ONESHOT tick mode but not the PERIODIC mode.

Signed-off-by: Preeti U. Murthy <[email protected]>
---

arch/powerpc/include/asm/time.h | 3 +
arch/powerpc/kernel/smp.c | 4 +-
arch/powerpc/kernel/time.c | 79 ++++++++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/Kconfig | 1
4 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..936be0d 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -24,14 +24,17 @@ extern unsigned long tb_ticks_per_jiffy;
extern unsigned long tb_ticks_per_usec;
extern unsigned long tb_ticks_per_sec;
extern struct clock_event_device decrementer_clockevent;
+extern struct clock_event_device broadcast_clockevent;

struct rtc_time;
extern void to_tm(int tim, struct rtc_time * tm);
extern void GregorianDay(struct rtc_time *tm);
+extern void decrementer_timer_interrupt(void);

extern void generic_calibrate_decr(void);

extern void set_dec_cpu6(unsigned int val);
+extern int bc_cpu;

/* Some sane defaults: 125 MHz timebase, 1GHz processor */
extern unsigned long ppc_proc_freq;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6a68ca4..d3b7014 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -114,7 +114,7 @@ int smp_generic_kick_cpu(int nr)

static irqreturn_t timer_action(int irq, void *data)
{
- timer_interrupt();
+ decrementer_timer_interrupt();
return IRQ_HANDLED;
}

@@ -223,7 +223,7 @@ irqreturn_t smp_ipi_demux(void)

#ifdef __BIG_ENDIAN
if (all & (1 << (24 - 8 * PPC_MSG_TIMER)))
- timer_interrupt();
+ decrementer_timer_interrupt();
if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 65ab9e9..8ed0fb3 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -42,6 +42,7 @@
#include <linux/timex.h>
#include <linux/kernel_stat.h>
#include <linux/time.h>
+#include <linux/timer.h>
#include <linux/init.h>
#include <linux/profile.h>
#include <linux/cpu.h>
@@ -97,8 +98,11 @@ static struct clocksource clocksource_timebase = {

static int decrementer_set_next_event(unsigned long evt,
struct clock_event_device *dev);
+static int broadcast_set_next_event(unsigned long evt,
+ struct clock_event_device *dev);
static void decrementer_set_mode(enum clock_event_mode mode,
struct clock_event_device *dev);
+static void decrementer_timer_broadcast(const struct cpumask *mask);

struct clock_event_device decrementer_clockevent = {
.name = "decrementer",
@@ -106,13 +110,26 @@ struct clock_event_device decrementer_clockevent = {
.irq = 0,
.set_next_event = decrementer_set_next_event,
.set_mode = decrementer_set_mode,
- .features = CLOCK_EVT_FEAT_ONESHOT,
+ .broadcast = decrementer_timer_broadcast,
+ .features = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
};
EXPORT_SYMBOL(decrementer_clockevent);

+struct clock_event_device broadcast_clockevent = {
+ .name = "broadcast",
+ .rating = 200,
+ .irq = 0,
+ .set_next_event = broadcast_set_next_event,
+ .set_mode = decrementer_set_mode,
+ .features = CLOCK_EVT_FEAT_ONESHOT,
+};
+EXPORT_SYMBOL(broadcast_clockevent);
+
DEFINE_PER_CPU(u64, decrementers_next_tb);
static DEFINE_PER_CPU(struct clock_event_device, decrementers);
+static DEFINE_PER_CPU(struct clock_event_device, bc_timer);

+int bc_cpu;
#define XSEC_PER_SEC (1024*1024)

#ifdef CONFIG_PPC64
@@ -487,6 +504,8 @@ void timer_interrupt(struct pt_regs * regs)
struct pt_regs *old_regs;
u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
struct clock_event_device *evt = &__get_cpu_var(decrementers);
+ struct clock_event_device *bc_evt = &__get_cpu_var(bc_timer);
+ int cpu = smp_processor_id();
u64 now;

/* Ensure a positive value is written to the decrementer, or else
@@ -532,6 +551,10 @@ void timer_interrupt(struct pt_regs * regs)
*next_tb = ~(u64)0;
if (evt->event_handler)
evt->event_handler(evt);
+ if (cpu == bc_cpu && bc_evt->event_handler) {
+ bc_evt->event_handler(bc_evt);
+ }
+
} else {
now = *next_tb - now;
if (now <= DECREMENTER_MAX)
@@ -806,6 +829,18 @@ static int decrementer_set_next_event(unsigned long evt,
return 0;
}

+/*
+ * We cannot program the decrementer of a remote CPU. Hence CPUs going into
+ * deep idle states need to send IPIs to the broadcast CPU to program its
+ * decrementer for their next local event so as to receive a broadcast IPI
+ * for the same. In order to avoid this overhead, this function is a nop.
+ */
+static int broadcast_set_next_event(unsigned long evt,
+ struct clock_event_device *dev)
+{
+ return 0;
+}
+
static void decrementer_set_mode(enum clock_event_mode mode,
struct clock_event_device *dev)
{
@@ -813,6 +848,20 @@ static void decrementer_set_mode(enum clock_event_mode mode,
decrementer_set_next_event(DECREMENTER_MAX, dev);
}

+void decrementer_timer_interrupt(void)
+{
+ struct clock_event_device *evt;
+ evt = &per_cpu(decrementers, smp_processor_id());
+
+ if (evt->event_handler)
+ evt->event_handler(evt);
+}
+
+static void decrementer_timer_broadcast(const struct cpumask *mask)
+{
+ arch_send_tick_broadcast(mask);
+}
+
static void register_decrementer_clockevent(int cpu)
{
struct clock_event_device *dec = &per_cpu(decrementers, cpu);
@@ -826,6 +875,20 @@ static void register_decrementer_clockevent(int cpu)
clockevents_register_device(dec);
}

+static void register_broadcast_clockevent(int cpu)
+{
+ struct clock_event_device *bc_evt = &per_cpu(bc_timer, cpu);
+
+ *bc_evt = broadcast_clockevent;
+ bc_evt->cpumask = cpumask_of(cpu);
+
+ printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
+ bc_evt->name, bc_evt->mult, bc_evt->shift, cpu);
+
+ clockevents_register_device(bc_evt);
+ bc_cpu = cpu;
+}
+
static void __init init_decrementer_clockevent(void)
{
int cpu = smp_processor_id();
@@ -840,6 +903,19 @@ static void __init init_decrementer_clockevent(void)
register_decrementer_clockevent(cpu);
}

+static void __init init_broadcast_clockevent(void)
+{
+ int cpu = smp_processor_id();
+
+ clockevents_calc_mult_shift(&broadcast_clockevent, ppc_tb_freq, 4);
+
+ broadcast_clockevent.max_delta_ns =
+ clockevent_delta2ns(DECREMENTER_MAX, &broadcast_clockevent);
+ broadcast_clockevent.min_delta_ns =
+ clockevent_delta2ns(2, &broadcast_clockevent);
+ register_broadcast_clockevent(cpu);
+}
+
void secondary_cpu_time_init(void)
{
/* Start the decrementer on CPUs that have manual control
@@ -916,6 +992,7 @@ void __init time_init(void)
clocksource_init();

init_decrementer_clockevent();
+ init_broadcast_clockevent();
}


diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index ace2d22..e1a96eb 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -6,6 +6,7 @@ config PPC_POWERNV
select PPC_ICP_NATIVE
select PPC_P7_NAP
select PPC_PCI_CHOICE if EMBEDDED
+ select GENERIC_CLOCKEVENTS_BROADCAST
select EPAPR_BOOT
default y

2013-07-25 09:06:07

by Preeti U Murthy

[permalink] [raw]
Subject: [RFC PATCH 1/5] powerpc: Free up the IPI message slot of ipi call function (PPC_MSG_CALL_FUNC)

From: Srivatsa S. Bhat <[email protected]>

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE
map to a common implementation - generic_smp_call_function_single_interrupt().
So, we can consolidate them and save one of the IPI message slots, (which are
precious, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC using
PPC_MSG_CALL_FUNC_SINGLE itself and release its IPI message slot, so that it
can be used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Preeti U Murthy <[email protected]>
---

arch/powerpc/include/asm/smp.h | 2 +-
arch/powerpc/kernel/smp.c | 12 +++++-------
arch/powerpc/platforms/cell/interrupt.c | 2 +-
arch/powerpc/platforms/ps3/smp.c | 2 +-
4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index ffbaabe..51bf017 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu);
*
* Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
* in /proc/interrupts will be wrong!!! --Troy */
-#define PPC_MSG_CALL_FUNCTION 0
+#define PPC_MSG_UNUSED 0
#define PPC_MSG_RESCHEDULE 1
#define PPC_MSG_CALL_FUNC_SINGLE 2
#define PPC_MSG_DEBUGGER_BREAK 3
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 38b0ba6..bc41e9f 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -111,9 +111,9 @@ int smp_generic_kick_cpu(int nr)
}
#endif /* CONFIG_PPC64 */

-static irqreturn_t call_function_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
{
- generic_smp_call_function_interrupt();
+ /* This slot is unused and hence available for use, if needed */
return IRQ_HANDLED;
}

@@ -144,14 +144,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
}

static irq_handler_t smp_ipi_action[] = {
- [PPC_MSG_CALL_FUNCTION] = call_function_action,
+ [PPC_MSG_UNUSED] = unused_action, /* Slot available for future use */
[PPC_MSG_RESCHEDULE] = reschedule_action,
[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
};

const char *smp_ipi_name[] = {
- [PPC_MSG_CALL_FUNCTION] = "ipi call function",
+ [PPC_MSG_UNUSED] = "ipi unused",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
[PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
@@ -221,8 +221,6 @@ irqreturn_t smp_ipi_demux(void)
all = xchg(&info->messages, 0);

#ifdef __BIG_ENDIAN
- if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNCTION)))
- generic_smp_call_function_interrupt();
if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
@@ -265,7 +263,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
unsigned int cpu;

for_each_cpu(cpu, mask)
- do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
+ do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
}

#if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..28166e4 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -213,7 +213,7 @@ static void iic_request_ipi(int msg)

void iic_request_IPIs(void)
{
- iic_request_ipi(PPC_MSG_CALL_FUNCTION);
+ iic_request_ipi(PPC_MSG_UNUSED);
iic_request_ipi(PPC_MSG_RESCHEDULE);
iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..488f069 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void)
* to index needs to be setup.
*/

- BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION != 0);
+ BUILD_BUG_ON(PPC_MSG_UNUSED != 0);
BUILD_BUG_ON(PPC_MSG_RESCHEDULE != 1);
BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK != 3);

2013-07-25 09:08:08

by Preeti U Murthy

[permalink] [raw]
Subject: [RFC PATCH 2/5] powerpc: Implement broadcast timer interrupt as an IPI message

From: Srivatsa S. Bhat <[email protected]>

For scalability and performance reasons, we want the broadcast timer
interrupts to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster
than the smp_call_function mechanism because the IPI handlers are fixed
and hence they don't involve costly operations such as adding IPI handlers
to the target CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
broadcast timer interrupts efficiently.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
Signed-off-by: Preeti U Murthy <[email protected]>
---

arch/powerpc/include/asm/smp.h | 3 ++-
arch/powerpc/kernel/smp.c | 19 +++++++++++++++----
arch/powerpc/platforms/cell/interrupt.c | 2 +-
arch/powerpc/platforms/ps3/smp.c | 2 +-
4 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 51bf017..d877b69 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -117,7 +117,7 @@ extern int cpu_to_core_id(int cpu);
*
* Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
* in /proc/interrupts will be wrong!!! --Troy */
-#define PPC_MSG_UNUSED 0
+#define PPC_MSG_TIMER 0
#define PPC_MSG_RESCHEDULE 1
#define PPC_MSG_CALL_FUNC_SINGLE 2
#define PPC_MSG_DEBUGGER_BREAK 3
@@ -190,6 +190,7 @@ extern struct smp_ops_t *smp_ops;

extern void arch_send_call_function_single_ipi(int cpu);
extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);
+extern void arch_send_tick_broadcast(const struct cpumask *mask);

/* Definitions relative to the secondary CPU spin loop
* and entry point. Not all of them exist on both 32 and
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index bc41e9f..6a68ca4 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
#include <asm/ptrace.h>
#include <linux/atomic.h>
#include <asm/irq.h>
+#include <asm/hw_irq.h>
#include <asm/page.h>
#include <asm/pgtable.h>
#include <asm/prom.h>
@@ -111,9 +112,9 @@ int smp_generic_kick_cpu(int nr)
}
#endif /* CONFIG_PPC64 */

-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t timer_action(int irq, void *data)
{
- /* This slot is unused and hence available for use, if needed */
+ timer_interrupt();
return IRQ_HANDLED;
}

@@ -144,14 +145,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
}

static irq_handler_t smp_ipi_action[] = {
- [PPC_MSG_UNUSED] = unused_action, /* Slot available for future use */
+ [PPC_MSG_TIMER] = timer_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
};

const char *smp_ipi_name[] = {
- [PPC_MSG_UNUSED] = "ipi unused",
+ [PPC_MSG_TIMER] = "ipi timer",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
[PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
@@ -221,6 +222,8 @@ irqreturn_t smp_ipi_demux(void)
all = xchg(&info->messages, 0);

#ifdef __BIG_ENDIAN
+ if (all & (1 << (24 - 8 * PPC_MSG_TIMER)))
+ timer_interrupt();
if (all & (1 << (24 - 8 * PPC_MSG_RESCHEDULE)))
scheduler_ipi();
if (all & (1 << (24 - 8 * PPC_MSG_CALL_FUNC_SINGLE)))
@@ -266,6 +269,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
}

+void arch_send_tick_broadcast(const struct cpumask *mask)
+{
+ unsigned int cpu;
+
+ for_each_cpu(cpu, mask)
+ do_message_pass(cpu, PPC_MSG_TIMER);
+}
+
#if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
void smp_send_debugger_break(void)
{
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index 28166e4..1359113 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -213,7 +213,7 @@ static void iic_request_ipi(int msg)

void iic_request_IPIs(void)
{
- iic_request_ipi(PPC_MSG_UNUSED);
+ iic_request_ipi(PPC_MSG_TIMER);
iic_request_ipi(PPC_MSG_RESCHEDULE);
iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 488f069..5cb742a 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -74,7 +74,7 @@ static int __init ps3_smp_probe(void)
* to index needs to be setup.
*/

- BUILD_BUG_ON(PPC_MSG_UNUSED != 0);
+ BUILD_BUG_ON(PPC_MSG_TIMER != 0);
BUILD_BUG_ON(PPC_MSG_RESCHEDULE != 1);
BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK != 3);

2013-07-25 09:08:12

by Preeti U Murthy

[permalink] [raw]
Subject: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

In the current design of timer offload framework, the broadcast cpu should
*not* go into tickless idle so as to avoid missed wakeups on CPUs in deep idle states.

Since we prevent the CPUs entering deep idle states from programming the lapic of the
broadcast cpu for their respective next local events for reasons mentioned in
PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up during
each of its timer interrupt programmed to its local events.

With tickless idle, the broadcast CPU might not get a timer interrupt till after
many ticks which can result in missed wakeups on CPUs in deep idle states. By
disabling tickless idle, worst case, the tick_sched hrtimer will trigger a
timer interrupt every period to check for broadcast.

However the current setup of tickless idle does not let us make the choice
of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless idle,
is a system wide setting. Hence resort to an arch specific call to check if a cpu
can go into tickless idle.

Signed-off-by: Preeti U Murthy <[email protected]>
---

arch/powerpc/kernel/time.c | 5 +++++
kernel/time/tick-sched.c | 7 +++++++
2 files changed, 12 insertions(+)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 8ed0fb3..68a636f 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -862,6 +862,11 @@ static void decrementer_timer_broadcast(const struct cpumask *mask)
arch_send_tick_broadcast(mask);
}

+int arch_can_stop_idle_tick(int cpu)
+{
+ return cpu != bc_cpu;
+}
+
static void register_decrementer_clockevent(int cpu)
{
struct clock_event_device *dec = &per_cpu(decrementers, cpu);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6960172..e9ffa84 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -700,8 +700,15 @@ static void tick_nohz_full_stop_tick(struct tick_sched *ts)
#endif
}

+int __weak arch_can_stop_idle_tick(int cpu)
+{
+ return 1;
+}
+
static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
{
+ if (!arch_can_stop_idle_tick(cpu))
+ return false;
/*
* If this cpu is offline and it is the one which updates
* jiffies, then give up the assignment and let it be taken by

2013-07-25 13:30:59

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

On Thu, Jul 25, 2013 at 02:33:02PM +0530, Preeti U Murthy wrote:
> In the current design of timer offload framework, the broadcast cpu should
> *not* go into tickless idle so as to avoid missed wakeups on CPUs in deep idle states.
>
> Since we prevent the CPUs entering deep idle states from programming the lapic of the
> broadcast cpu for their respective next local events for reasons mentioned in
> PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up during
> each of its timer interrupt programmed to its local events.
>
> With tickless idle, the broadcast CPU might not get a timer interrupt till after
> many ticks which can result in missed wakeups on CPUs in deep idle states. By
> disabling tickless idle, worst case, the tick_sched hrtimer will trigger a
> timer interrupt every period to check for broadcast.
>
> However the current setup of tickless idle does not let us make the choice
> of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless idle,
> is a system wide setting. Hence resort to an arch specific call to check if a cpu
> can go into tickless idle.

Hi Preeti,

I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle mode.
I read in the previous patch that's because in dynticks idle mode the broadcast
CPU deactivates its lapic so it doesn't receive the IPI. But may be I misunderstood.
Anyway that's not good for powersaving.

Also when an arch wants to prevent a CPU from entering dynticks idle mode, it typically
use arch_needs_cpu(). May be that could fit for you as well?

Thanks.

2013-07-26 02:43:13

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

Hi Frederic,

On 07/25/2013 07:00 PM, Frederic Weisbecker wrote:
> On Thu, Jul 25, 2013 at 02:33:02PM +0530, Preeti U Murthy wrote:
>> In the current design of timer offload framework, the broadcast cpu should
>> *not* go into tickless idle so as to avoid missed wakeups on CPUs in deep idle states.
>>
>> Since we prevent the CPUs entering deep idle states from programming the lapic of the
>> broadcast cpu for their respective next local events for reasons mentioned in
>> PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up during
>> each of its timer interrupt programmed to its local events.
>>
>> With tickless idle, the broadcast CPU might not get a timer interrupt till after
>> many ticks which can result in missed wakeups on CPUs in deep idle states. By
>> disabling tickless idle, worst case, the tick_sched hrtimer will trigger a
>> timer interrupt every period to check for broadcast.
>>
>> However the current setup of tickless idle does not let us make the choice
>> of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless idle,
>> is a system wide setting. Hence resort to an arch specific call to check if a cpu
>> can go into tickless idle.
>
> Hi Preeti,
>
> I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle mode.
> I read in the previous patch that's because in dynticks idle mode the broadcast
> CPU deactivates its lapic so it doesn't receive the IPI. But may be I misunderstood.
> Anyway that's not good for powersaving.

Let me elaborate. The CPUs in deep idle states have their lapics
deactivated. This means the next timer event which would typically have
been taken care of by a lapic firing at the appropriate moment does not
get taken care of in deep idle states, due to the lapic being switched off.

Hence such CPUs offload their next timer event to the broadcast CPU,
which should *not* enter deep idle states. The broadcast CPU has the
responsibility of waking the CPUs in deep idle states.

*The lapic of a broadcast CPU is active always*. Say CPUX, wants the
broadcast CPU to wake it up at timeX. Since we cannot program the lapic
of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
asking it to program its lapic to fire at timeX so as to wake up CPUX.
*With multiple CPUs the overhead of sending IPI, could result in
performance bottlenecks and may not scale well.*

Hence the workaround is that the broadcast CPU on each of its timer
interrupt checks if any of the next timer event of a CPU in deep idle
state has expired, which can very well be found from dev->next_event of
that CPU. For example the timeX that has been mentioned above has
expired. If so the broadcast handler is called to send an IPI to the
idling CPU to wake it up.

*If the broadcast CPU, is in tickless idle, its timer interrupt could be
many ticks away. It could miss waking up a CPU in deep idle*, if its
wakeup is much before this timer interrupt of the broadcast CPU. But
without tickless idle, atleast at each period we are assured of a timer
interrupt. At which time broadcast handling is done as stated in the
previous paragraph and we will not miss wakeup of CPUs in deep idle states.

Yeah it is true that not allowing the broadcast CPU to enter tickless
idle is bad for power savings, but for the use case that we are aiming
at in this patch series, the current approach seems to be the best, with
minimal trade-offs in performance, power savings, scalability and no
change in the broadcast framework that exists today in the kernel.

>
> Also when an arch wants to prevent a CPU from entering dynticks idle mode, it typically
> use arch_needs_cpu(). May be that could fit for you as well?

Oh ok thanks :) I will look into this and get back on if we can use it.

Regards
Preeti U Murthy

2013-07-26 03:06:27

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

Hi Frederic,

On 07/25/2013 07:00 PM, Frederic Weisbecker wrote:
> On Thu, Jul 25, 2013 at 02:33:02PM +0530, Preeti U Murthy wrote:
>> In the current design of timer offload framework, the broadcast cpu should
>> *not* go into tickless idle so as to avoid missed wakeups on CPUs in deep idle states.
>>
>> Since we prevent the CPUs entering deep idle states from programming the lapic of the
>> broadcast cpu for their respective next local events for reasons mentioned in
>> PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up during
>> each of its timer interrupt programmed to its local events.
>>
>> With tickless idle, the broadcast CPU might not get a timer interrupt till after
>> many ticks which can result in missed wakeups on CPUs in deep idle states. By
>> disabling tickless idle, worst case, the tick_sched hrtimer will trigger a
>> timer interrupt every period to check for broadcast.
>>
>> However the current setup of tickless idle does not let us make the choice
>> of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless idle,
>> is a system wide setting. Hence resort to an arch specific call to check if a cpu
>> can go into tickless idle.
>
> Hi Preeti,
>
> I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle mode.
> I read in the previous patch that's because in dynticks idle mode the broadcast
> CPU deactivates its lapic so it doesn't receive the IPI. But may be I misunderstood.
> Anyway that's not good for powersaving.
>
> Also when an arch wants to prevent a CPU from entering dynticks idle mode, it typically
> use arch_needs_cpu(). May be that could fit for you as well?

Yes this will suit our requirement perfectly. I will note down this
change for the next version of this patchset. Thank you very much for
pointing this out :)

Regards
Preeti U Murthy

2013-07-26 03:19:57

by Paul Mackerras

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

On Fri, Jul 26, 2013 at 08:09:23AM +0530, Preeti U Murthy wrote:
> Hi Frederic,
>
> On 07/25/2013 07:00 PM, Frederic Weisbecker wrote:
> > Hi Preeti,
> >
> > I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle mode.
> > I read in the previous patch that's because in dynticks idle mode the broadcast
> > CPU deactivates its lapic so it doesn't receive the IPI. But may be I misunderstood.
> > Anyway that's not good for powersaving.
>
> Let me elaborate. The CPUs in deep idle states have their lapics
> deactivated. This means the next timer event which would typically have
> been taken care of by a lapic firing at the appropriate moment does not
> get taken care of in deep idle states, due to the lapic being switched off.

I really don't think it's helpful to use the term "lapic" in
connection with Power systems. There is nothing that is called a
"lapic" in a Power machine. The nearest equivalent of the LAPIC on
x86 machines is the ICP, the interrupt-controller presentation
element, of which there is one per CPU thread.

However, I don't believe the ICP gets disabled in deep sleep modes.
What does get disabled is the "decrementer", which is a register that
normally counts down (at 512MHz) and generates an exception when it is
negative. The decrementer *is* part of the CPU core, unlike the ICP.
That's why we can still get IPIs but not timer interrupts.

Please reword your patch description to not use the term "lapic",
which is not defined in the Power context and is therefore just
causing confusion.

Paul.

2013-07-26 03:38:41

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

Hi Paul,

On 07/26/2013 08:49 AM, Paul Mackerras wrote:
> On Fri, Jul 26, 2013 at 08:09:23AM +0530, Preeti U Murthy wrote:
>> Hi Frederic,
>>
>> On 07/25/2013 07:00 PM, Frederic Weisbecker wrote:
>>> Hi Preeti,
>>>
>>> I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle mode.
>>> I read in the previous patch that's because in dynticks idle mode the broadcast
>>> CPU deactivates its lapic so it doesn't receive the IPI. But may be I misunderstood.
>>> Anyway that's not good for powersaving.
>>
>> Let me elaborate. The CPUs in deep idle states have their lapics
>> deactivated. This means the next timer event which would typically have
>> been taken care of by a lapic firing at the appropriate moment does not
>> get taken care of in deep idle states, due to the lapic being switched off.
>
> I really don't think it's helpful to use the term "lapic" in
> connection with Power systems. There is nothing that is called a
> "lapic" in a Power machine. The nearest equivalent of the LAPIC on
> x86 machines is the ICP, the interrupt-controller presentation
> element, of which there is one per CPU thread.
>
> However, I don't believe the ICP gets disabled in deep sleep modes.
> What does get disabled is the "decrementer", which is a register that
> normally counts down (at 512MHz) and generates an exception when it is
> negative. The decrementer *is* part of the CPU core, unlike the ICP.
> That's why we can still get IPIs but not timer interrupts.
>
> Please reword your patch description to not use the term "lapic",
> which is not defined in the Power context and is therefore just
> causing confusion.

Noted. Thank you :) I will probably send out a fresh patchset with the
appropriate changelog to avoid this confusion ?
>
> Paul.
>
Regards
Preeti U murthy

2013-07-26 04:14:39

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

Hi Frederic,

I apologise for the confusion. As Paul pointed out maybe the usage of
the term lapic is causing a large amount of confusion. So please see the
clarification below. Maybe it will help answer your question.

On 07/26/2013 08:09 AM, Preeti U Murthy wrote:
> Hi Frederic,
>
> On 07/25/2013 07:00 PM, Frederic Weisbecker wrote:
>> On Thu, Jul 25, 2013 at 02:33:02PM +0530, Preeti U Murthy wrote:
>>> In the current design of timer offload framework, the broadcast cpu should
>>> *not* go into tickless idle so as to avoid missed wakeups on CPUs in deep idle states.
>>>
>>> Since we prevent the CPUs entering deep idle states from programming the lapic of the
>>> broadcast cpu for their respective next local events for reasons mentioned in
>>> PATCH[3/5], the broadcast CPU checks if there are any CPUs to be woken up during
>>> each of its timer interrupt programmed to its local events.
>>>
>>> With tickless idle, the broadcast CPU might not get a timer interrupt till after
>>> many ticks which can result in missed wakeups on CPUs in deep idle states. By
>>> disabling tickless idle, worst case, the tick_sched hrtimer will trigger a
>>> timer interrupt every period to check for broadcast.
>>>
>>> However the current setup of tickless idle does not let us make the choice
>>> of tickless on individual cpus. NOHZ_MODE_INACTIVE which disables tickless idle,
>>> is a system wide setting. Hence resort to an arch specific call to check if a cpu
>>> can go into tickless idle.
>>
>> Hi Preeti,
>>
>> I'm not exactly sure why you can't enter the broadcast CPU in dynticks idle mode.
>> I read in the previous patch that's because in dynticks idle mode the broadcast
>> CPU deactivates its lapic so it doesn't receive the IPI. But may be I misunderstood.
>> Anyway that's not good for powersaving.

Firstly, when CPUs enter deep idle states, their local clock event
devices get switched off. In the case of powerpc, local clock event
device is the decrementer. Hence such CPUs *do not get timer interrupts*
but are still *capable of taking IPIs.*

So we need to ensure that some other CPU, in this case the broadcast
CPU, makes note of when the timer interrupt of the CPU in such deep idle
states is to trigger and at that moment issue an IPI to that CPU.

*The broadcast CPU however should have its decrementer active always*,
meaning it is disallowed from entering deep idle states, where the
decrementer switches off, precisely because the other idling CPUs bank
on it for the above mentioned reason.

> *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
> broadcast CPU to wake it up at timeX. Since we cannot program the lapic
> of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
> asking it to program its lapic to fire at timeX so as to wake up CPUX.
> *With multiple CPUs the overhead of sending IPI, could result in
> performance bottlenecks and may not scale well.*

Rewording the above. The decrementer of the broadcast CPU is active
always. Since we cannot program the clock event device
of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
(which the broadcast CPU is very well capable of receiving), asking it
to program its decrementer to fire at timeX so as to wake up CPUX
*With multiple CPUs the overhead of sending IPI, could result in
performance bottlenecks and may not scale well.*

>
> Hence the workaround is that the broadcast CPU on each of its timer
> interrupt checks if any of the next timer event of a CPU in deep idle
> state has expired, which can very well be found from dev->next_event of
> that CPU. For example the timeX that has been mentioned above has
> expired. If so the broadcast handler is called to send an IPI to the
> idling CPU to wake it up.
>
> *If the broadcast CPU, is in tickless idle, its timer interrupt could be
> many ticks away. It could miss waking up a CPU in deep idle*, if its
> wakeup is much before this timer interrupt of the broadcast CPU. But
> without tickless idle, atleast at each period we are assured of a timer
> interrupt. At which time broadcast handling is done as stated in the
> previous paragraph and we will not miss wakeup of CPUs in deep idle states.
>
> Yeah it is true that not allowing the broadcast CPU to enter tickless
> idle is bad for power savings, but for the use case that we are aiming
> at in this patch series, the current approach seems to be the best, with
> minimal trade-offs in performance, power savings, scalability and no
> change in the broadcast framework that exists today in the kernel.
>

Regards
Preeti U Murthy

2013-07-27 06:32:03

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote:
> *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
> broadcast CPU to wake it up at timeX. Since we cannot program the lapic
> of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
> asking it to program its lapic to fire at timeX so as to wake up CPUX.
> *With multiple CPUs the overhead of sending IPI, could result in
> performance bottlenecks and may not scale well.*
>
> Hence the workaround is that the broadcast CPU on each of its timer
> interrupt checks if any of the next timer event of a CPU in deep idle
> state has expired, which can very well be found from dev->next_event of
> that CPU. For example the timeX that has been mentioned above has
> expired. If so the broadcast handler is called to send an IPI to the
> idling CPU to wake it up.
>
> *If the broadcast CPU, is in tickless idle, its timer interrupt could be
> many ticks away. It could miss waking up a CPU in deep idle*, if its
> wakeup is much before this timer interrupt of the broadcast CPU. But
> without tickless idle, atleast at each period we are assured of a timer
> interrupt. At which time broadcast handling is done as stated in the
> previous paragraph and we will not miss wakeup of CPUs in deep idle states.

But that means a great loss of power saving on the broadcast CPU when the machine
is basically completely idle. We might be able to come up with some thing better.

(Note : I do no know the timer offload code if it exists already, I'm describing
how things could happen "out of the blue" without any knowledge of pre-existing
framework here)

We can know when the broadcast CPU expects to wake up next. When a CPU goes to
a deep sleep state, it can then

- Indicate to the broadcast CPU when it intends to be woken up by queuing
itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION:
Play with the locality of that: have one queue (and one "broadcast CPU") per
chip or per node instead of a global one to limit cache bouncing).

- Check if that happens before the broadcast CPU intended wake time (we
need statistics to see how often that happens), and in that case send an IPI
to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep
time to the min of it's intended sleep time and the new sleeper time.
(OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?)

- We can probably limit spurrious wakeups a *LOT* by aligning that target time
to a global jiffy boundary, meaning that several CPUs going to idle are likely
to be choosing the same. Or maybe better, an adaptative alignment by essentially
getting more coarse grained as we go further in the future

- When the "broadcast" CPU goes to sleep, it can play the same game of alignment.

I don't like the concept of a dedicated broadcast CPU however. I'd rather have a
general queue (or per node) of sleepers needing a wakeup and more/less dynamically
pick a waker to be the last man standing, but it does make things a bit more
tricky with tickless scheduler (non-idle).

Still, I wonder if we could just have some algorithm to actually pick wakers
more dynamically based on who ever has the closest "next wakeup" planned,
that sort of thing. A fixed broadcaster will create an imbalance in
power/thermal within the chip in addition to needing to be moved around on
hotplug etc...

Cheers,
Ben.

2013-07-27 07:53:15

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

Hi Ben,

On 07/27/2013 12:00 PM, Benjamin Herrenschmidt wrote:
> On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote:
>> *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
>> broadcast CPU to wake it up at timeX. Since we cannot program the lapic
>> of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
>> asking it to program its lapic to fire at timeX so as to wake up CPUX.
>> *With multiple CPUs the overhead of sending IPI, could result in
>> performance bottlenecks and may not scale well.*
>>
>> Hence the workaround is that the broadcast CPU on each of its timer
>> interrupt checks if any of the next timer event of a CPU in deep idle
>> state has expired, which can very well be found from dev->next_event of
>> that CPU. For example the timeX that has been mentioned above has
>> expired. If so the broadcast handler is called to send an IPI to the
>> idling CPU to wake it up.
>>
>> *If the broadcast CPU, is in tickless idle, its timer interrupt could be
>> many ticks away. It could miss waking up a CPU in deep idle*, if its
>> wakeup is much before this timer interrupt of the broadcast CPU. But
>> without tickless idle, atleast at each period we are assured of a timer
>> interrupt. At which time broadcast handling is done as stated in the
>> previous paragraph and we will not miss wakeup of CPUs in deep idle states.
>
> But that means a great loss of power saving on the broadcast CPU when the machine
> is basically completely idle. We might be able to come up with some thing better.
>
> (Note : I do no know the timer offload code if it exists already, I'm describing
> how things could happen "out of the blue" without any knowledge of pre-existing
> framework here)
>
> We can know when the broadcast CPU expects to wake up next. When a CPU goes to
> a deep sleep state, it can then
>
> - Indicate to the broadcast CPU when it intends to be woken up by queuing
> itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION:
> Play with the locality of that: have one queue (and one "broadcast CPU") per
> chip or per node instead of a global one to limit cache bouncing).
>
> - Check if that happens before the broadcast CPU intended wake time (we
> need statistics to see how often that happens), and in that case send an IPI
> to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep
> time to the min of it's intended sleep time and the new sleeper time.
> (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?)
>
> - We can probably limit spurrious wakeups a *LOT* by aligning that target time
> to a global jiffy boundary, meaning that several CPUs going to idle are likely
> to be choosing the same. Or maybe better, an adaptative alignment by essentially
> getting more coarse grained as we go further in the future
>
> - When the "broadcast" CPU goes to sleep, it can play the same game of alignment.
>
> I don't like the concept of a dedicated broadcast CPU however. I'd rather have a
> general queue (or per node) of sleepers needing a wakeup and more/less dynamically
> pick a waker to be the last man standing, but it does make things a bit more
> tricky with tickless scheduler (non-idle).
>
> Still, I wonder if we could just have some algorithm to actually pick wakers
> more dynamically based on who ever has the closest "next wakeup" planned,
> that sort of thing. A fixed broadcaster will create an imbalance in
> power/thermal within the chip in addition to needing to be moved around on
> hotplug etc...

Thank you for having listed out the above suggestions. Below, I will
bring out some ideas about how the concerns that you have raised can be
addressed in the increasing order of priority.

- To begin with, I think we can have the following model to have the
responsibility of the broadcast CPU float around certain CPUs. i.e. Not
have a dedicated broadcast CPU. I will refer to the broadcast CPU as the
bc_cpu henceforth for convenience.

1. The first CPU that intends to enter deep sleep state will be the bc_cpu.

2. Every other CPU that intends to enter deep idle state will enter
themselves into a mask, say the bc_mask, which is already being done
today, after they check that a bc_cpu has been assigned.

3. The bc_cpu should not enter tickless idle, until step 5a holds true.

4. So on every timer interrupt, which is at-least every period, it
checks the bc_mask to see if any CPUs need to be woken up.

5. The bc cpu should not enter tickless idle *until* it is de-nominated
as the bc_cpu. The de-nomination occurs when:
a. In one of its timer interrupts, it does broadcast handling to find
out that there are no CPUs to be woken up.

6. So if 5a holds, then there is no bc_cpu anymore until a CPU decides
to enter deep idle state again, in which case steps 1 to 5 repeat.


- We could optimize this further, to allow the bc_cpu to enter tickless
idle, even while it is nominated as one. This can be the next step, if
we can get the above to work stably.

You have already brought out this point, so I will just reword it. Each
time broadcast handling is done, the bc_cpu needs to check if the wakeup
time of a CPU, that has entered deep idle state, and is yet to be woken
up, is before the bc_cpu's wakeup time, which was programmed to its
local events.

If so, then reprogram the decrementer to the wakeup time of a CPU that
is in deep idle state.

But we need to keep in mind one point. When CPUs go into deep idle, they
cannot program the local timer of the bc_cpu to their wakeup time. This
is because a CPU cannot program the timer of a remote CPU.

Therefore the only time we can check if 'wakeup time of the CPU that
enters deep idle state is before broadcast CPU's intended wake time so
as to reprogram the decrementer', is in the broadcast handler itself,
which is done *on* the bc_cpu alone.



What do you think?


- Coming to your third suggestion of aligning the wakeup time of CPUs, I
will spend some time on this and get back regarding the same.


>
> Cheers,
> Ben.
>

Thank you

Regards
Preeti U Murthy

2013-07-29 05:12:06

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

* Benjamin Herrenschmidt <[email protected]> [2013-07-27 16:30:05]:

> On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote:
> > *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
> > broadcast CPU to wake it up at timeX. Since we cannot program the lapic
> > of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
> > asking it to program its lapic to fire at timeX so as to wake up CPUX.
> > *With multiple CPUs the overhead of sending IPI, could result in
> > performance bottlenecks and may not scale well.*
> >
> > Hence the workaround is that the broadcast CPU on each of its timer
> > interrupt checks if any of the next timer event of a CPU in deep idle
> > state has expired, which can very well be found from dev->next_event of
> > that CPU. For example the timeX that has been mentioned above has
> > expired. If so the broadcast handler is called to send an IPI to the
> > idling CPU to wake it up.
> >
> > *If the broadcast CPU, is in tickless idle, its timer interrupt could be
> > many ticks away. It could miss waking up a CPU in deep idle*, if its
> > wakeup is much before this timer interrupt of the broadcast CPU. But
> > without tickless idle, atleast at each period we are assured of a timer
> > interrupt. At which time broadcast handling is done as stated in the
> > previous paragraph and we will not miss wakeup of CPUs in deep idle states.
>
> But that means a great loss of power saving on the broadcast CPU when the machine
> is basically completely idle. We might be able to come up with some thing better.

Hi Ben,

Yes, we will need to improve on this case in stages. In the current
design, we will have to hold one of the CPU in shallow idle state
(nap) to wakeup other deep idle cpus. The cost of keeping the
periodic tick ON on the broadcast CPU in minimal (but not zero) since
we would not allow that CPU to enter any deep idle states even if
there were no periodic timers queued.

> (Note : I do no know the timer offload code if it exists already, I'm describing
> how things could happen "out of the blue" without any knowledge of pre-existing
> framework here)
>
> We can know when the broadcast CPU expects to wake up next. When a CPU goes to
> a deep sleep state, it can then
>
> - Indicate to the broadcast CPU when it intends to be woken up by queuing
> itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION:
> Play with the locality of that: have one queue (and one "broadcast CPU") per
> chip or per node instead of a global one to limit cache bouncing).
>
> - Check if that happens before the broadcast CPU intended wake time (we
> need statistics to see how often that happens), and in that case send an IPI
> to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep
> time to the min of it's intended sleep time and the new sleeper time.
> (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?)

This will be an improvement and the idea we have is to have
a hierarchical method of finding a waking CPU within core/socket/node
in order to find a better fit and ultimately send IPI to wakeup
a broadcast CPU only if there is no other fit. This condition would
imply that more CPUs are in deep idle state and the cost of sending an
IPI to reprogram the broadcast cpu's local timer may well payoff.

> - We can probably limit spurrious wakeups a *LOT* by aligning that target time
> to a global jiffy boundary, meaning that several CPUs going to idle are likely
> to be choosing the same. Or maybe better, an adaptative alignment by essentially
> getting more coarse grained as we go further in the future
>
> - When the "broadcast" CPU goes to sleep, it can play the same game of alignment.

CPUs entering deep idle state would need to wakeup only at a jiffy
boundary or the jiffy boundary earlier than the target wakeup time.
Your point is can the broadcast cpu wakeup the sleeping CPU *around*
the designated wakeup time (earlier) so as to avoid reprogramming its
timer.

> I don't like the concept of a dedicated broadcast CPU however. I'd rather have a
> general queue (or per node) of sleepers needing a wakeup and more/less dynamically
> pick a waker to be the last man standing, but it does make things a bit more
> tricky with tickless scheduler (non-idle).
>
> Still, I wonder if we could just have some algorithm to actually pick wakers
> more dynamically based on who ever has the closest "next wakeup" planned,
> that sort of thing. A fixed broadcaster will create an imbalance in
> power/thermal within the chip in addition to needing to be moved around on
> hotplug etc...

Right Ben. The hierarchical way of selecting the waker will help us
have multiple wakers in different sockets/cores. The broadcast
framework allows us to decouple the cpu going to idle and the waker to
be selected independently. This patch series is the start where we
pick one and allow it to move around. The ideal goal to achieve would
be that we can have multiple wakers serving wakeup requests from
a queue (mask) and wakers are generally busy or just idle cpu who need
not prevent itself from going to tickless idle or deep idle states
just to wakeup another one. This optimizations can adapt to the case
where we will need the last cpu to stay in shallow idle mode and in
tickless with the wakeup events queued to target one of the deep idle
cpu.

--Vaidy

2013-07-29 05:28:37

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

* Preeti U Murthy <[email protected]> [2013-07-27 13:20:37]:

> Hi Ben,
>
> On 07/27/2013 12:00 PM, Benjamin Herrenschmidt wrote:
> > On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote:
> >> *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
> >> broadcast CPU to wake it up at timeX. Since we cannot program the lapic
> >> of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
> >> asking it to program its lapic to fire at timeX so as to wake up CPUX.
> >> *With multiple CPUs the overhead of sending IPI, could result in
> >> performance bottlenecks and may not scale well.*
> >>
> >> Hence the workaround is that the broadcast CPU on each of its timer
> >> interrupt checks if any of the next timer event of a CPU in deep idle
> >> state has expired, which can very well be found from dev->next_event of
> >> that CPU. For example the timeX that has been mentioned above has
> >> expired. If so the broadcast handler is called to send an IPI to the
> >> idling CPU to wake it up.
> >>
> >> *If the broadcast CPU, is in tickless idle, its timer interrupt could be
> >> many ticks away. It could miss waking up a CPU in deep idle*, if its
> >> wakeup is much before this timer interrupt of the broadcast CPU. But
> >> without tickless idle, atleast at each period we are assured of a timer
> >> interrupt. At which time broadcast handling is done as stated in the
> >> previous paragraph and we will not miss wakeup of CPUs in deep idle states.
> >
> > But that means a great loss of power saving on the broadcast CPU when the machine
> > is basically completely idle. We might be able to come up with some thing better.
> >
> > (Note : I do no know the timer offload code if it exists already, I'm describing
> > how things could happen "out of the blue" without any knowledge of pre-existing
> > framework here)
> >
> > We can know when the broadcast CPU expects to wake up next. When a CPU goes to
> > a deep sleep state, it can then
> >
> > - Indicate to the broadcast CPU when it intends to be woken up by queuing
> > itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION:
> > Play with the locality of that: have one queue (and one "broadcast CPU") per
> > chip or per node instead of a global one to limit cache bouncing).
> >
> > - Check if that happens before the broadcast CPU intended wake time (we
> > need statistics to see how often that happens), and in that case send an IPI
> > to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep
> > time to the min of it's intended sleep time and the new sleeper time.
> > (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?)
> >
> > - We can probably limit spurrious wakeups a *LOT* by aligning that target time
> > to a global jiffy boundary, meaning that several CPUs going to idle are likely
> > to be choosing the same. Or maybe better, an adaptative alignment by essentially
> > getting more coarse grained as we go further in the future
> >
> > - When the "broadcast" CPU goes to sleep, it can play the same game of alignment.
> >
> > I don't like the concept of a dedicated broadcast CPU however. I'd rather have a
> > general queue (or per node) of sleepers needing a wakeup and more/less dynamically
> > pick a waker to be the last man standing, but it does make things a bit more
> > tricky with tickless scheduler (non-idle).
> >
> > Still, I wonder if we could just have some algorithm to actually pick wakers
> > more dynamically based on who ever has the closest "next wakeup" planned,
> > that sort of thing. A fixed broadcaster will create an imbalance in
> > power/thermal within the chip in addition to needing to be moved around on
> > hotplug etc...
>
> Thank you for having listed out the above suggestions. Below, I will
> bring out some ideas about how the concerns that you have raised can be
> addressed in the increasing order of priority.
>
> - To begin with, I think we can have the following model to have the
> responsibility of the broadcast CPU float around certain CPUs. i.e. Not
> have a dedicated broadcast CPU. I will refer to the broadcast CPU as the
> bc_cpu henceforth for convenience.
>
> 1. The first CPU that intends to enter deep sleep state will be the bc_cpu.
>
> 2. Every other CPU that intends to enter deep idle state will enter
> themselves into a mask, say the bc_mask, which is already being done
> today, after they check that a bc_cpu has been assigned.
>
> 3. The bc_cpu should not enter tickless idle, until step 5a holds true.
>
> 4. So on every timer interrupt, which is at-least every period, it
> checks the bc_mask to see if any CPUs need to be woken up.
>
> 5. The bc cpu should not enter tickless idle *until* it is de-nominated
> as the bc_cpu. The de-nomination occurs when:
> a. In one of its timer interrupts, it does broadcast handling to find
> out that there are no CPUs to be woken up.
>
> 6. So if 5a holds, then there is no bc_cpu anymore until a CPU decides
> to enter deep idle state again, in which case steps 1 to 5 repeat.
>
>
> - We could optimize this further, to allow the bc_cpu to enter tickless
> idle, even while it is nominated as one. This can be the next step, if
> we can get the above to work stably.
>
> You have already brought out this point, so I will just reword it. Each
> time broadcast handling is done, the bc_cpu needs to check if the wakeup
> time of a CPU, that has entered deep idle state, and is yet to be woken
> up, is before the bc_cpu's wakeup time, which was programmed to its
> local events.
>
> If so, then reprogram the decrementer to the wakeup time of a CPU that
> is in deep idle state.
>
> But we need to keep in mind one point. When CPUs go into deep idle, they
> cannot program the local timer of the bc_cpu to their wakeup time. This
> is because a CPU cannot program the timer of a remote CPU.
>
> Therefore the only time we can check if 'wakeup time of the CPU that
> enters deep idle state is before broadcast CPU's intended wake time so
> as to reprogram the decrementer', is in the broadcast handler itself,
> which is done *on* the bc_cpu alone.
>
>
>
> What do you think?
>
>
> - Coming to your third suggestion of aligning the wakeup time of CPUs, I
> will spend some time on this and get back regarding the same.

Hi Preeti,

One of Ben's suggestions is to coarse grain the waker's timer event.
The trade off is whether we issue an IPI for each CPU needing a wakeup
or let the bc_cpu wakeup periodically and *see* that there is a new
request. The interval for a wakeup request will be much coarse grain
than a tick. We maybe able to easily reduce the power impact of not
letting bc_cpu go tickless by choosing a right coarse grain period.
For example we can let the bc_cpu look for new wakeup requests once in
every 10 or 20 jiffies rather than every jiffy and align the wakeup
requests at this coarse grain wakeup. We do pay a power penalty by
waking up few jiffies earlier which we can mitigate by reevaluating
the situation and queueing a fine grain timer to the right jiffy on
the bc_cpu if such a situation arises.

The point is a new wakeup request will *ask* for a wakeup later than
the coarse grain period. So the bc_cpu can wakeup at the coarse time
period and reprogram its timer to the right jiffy.

--Vaidy

2013-07-29 10:14:46

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] cpuidle/ppc: CPU goes tickless if there are no arch-specific constraints

Hi,

On 07/29/2013 10:58 AM, Vaidyanathan Srinivasan wrote:
> * Preeti U Murthy <[email protected]> [2013-07-27 13:20:37]:
>
>> Hi Ben,
>>
>> On 07/27/2013 12:00 PM, Benjamin Herrenschmidt wrote:
>>> On Fri, 2013-07-26 at 08:09 +0530, Preeti U Murthy wrote:
>>>> *The lapic of a broadcast CPU is active always*. Say CPUX, wants the
>>>> broadcast CPU to wake it up at timeX. Since we cannot program the lapic
>>>> of a remote CPU, CPUX will need to send an IPI to the broadcast CPU,
>>>> asking it to program its lapic to fire at timeX so as to wake up CPUX.
>>>> *With multiple CPUs the overhead of sending IPI, could result in
>>>> performance bottlenecks and may not scale well.*
>>>>
>>>> Hence the workaround is that the broadcast CPU on each of its timer
>>>> interrupt checks if any of the next timer event of a CPU in deep idle
>>>> state has expired, which can very well be found from dev->next_event of
>>>> that CPU. For example the timeX that has been mentioned above has
>>>> expired. If so the broadcast handler is called to send an IPI to the
>>>> idling CPU to wake it up.
>>>>
>>>> *If the broadcast CPU, is in tickless idle, its timer interrupt could be
>>>> many ticks away. It could miss waking up a CPU in deep idle*, if its
>>>> wakeup is much before this timer interrupt of the broadcast CPU. But
>>>> without tickless idle, atleast at each period we are assured of a timer
>>>> interrupt. At which time broadcast handling is done as stated in the
>>>> previous paragraph and we will not miss wakeup of CPUs in deep idle states.
>>>
>>> But that means a great loss of power saving on the broadcast CPU when the machine
>>> is basically completely idle. We might be able to come up with some thing better.
>>>
>>> (Note : I do no know the timer offload code if it exists already, I'm describing
>>> how things could happen "out of the blue" without any knowledge of pre-existing
>>> framework here)
>>>
>>> We can know when the broadcast CPU expects to wake up next. When a CPU goes to
>>> a deep sleep state, it can then
>>>
>>> - Indicate to the broadcast CPU when it intends to be woken up by queuing
>>> itself into an ordered queue (ordered by target wakeup time). (OPTIMISATION:
>>> Play with the locality of that: have one queue (and one "broadcast CPU") per
>>> chip or per node instead of a global one to limit cache bouncing).
>>>
>>> - Check if that happens before the broadcast CPU intended wake time (we
>>> need statistics to see how often that happens), and in that case send an IPI
>>> to wake it up now. When the broadcast CPU goes to sleep, it limits its sleep
>>> time to the min of it's intended sleep time and the new sleeper time.
>>> (OPTIMISATION: Dynamically chose a broadcast CPU based on closest expiry ?)
>>>
>>> - We can probably limit spurrious wakeups a *LOT* by aligning that target time
>>> to a global jiffy boundary, meaning that several CPUs going to idle are likely
>>> to be choosing the same. Or maybe better, an adaptative alignment by essentially
>>> getting more coarse grained as we go further in the future
>>>
>>> - When the "broadcast" CPU goes to sleep, it can play the same game of alignment.
>>>
>>> I don't like the concept of a dedicated broadcast CPU however. I'd rather have a
>>> general queue (or per node) of sleepers needing a wakeup and more/less dynamically
>>> pick a waker to be the last man standing, but it does make things a bit more
>>> tricky with tickless scheduler (non-idle).
>>>
>>> Still, I wonder if we could just have some algorithm to actually pick wakers
>>> more dynamically based on who ever has the closest "next wakeup" planned,
>>> that sort of thing. A fixed broadcaster will create an imbalance in
>>> power/thermal within the chip in addition to needing to be moved around on
>>> hotplug etc...
>>
>> Thank you for having listed out the above suggestions. Below, I will
>> bring out some ideas about how the concerns that you have raised can be
>> addressed in the increasing order of priority.
>>
>> - To begin with, I think we can have the following model to have the
>> responsibility of the broadcast CPU float around certain CPUs. i.e. Not
>> have a dedicated broadcast CPU. I will refer to the broadcast CPU as the
>> bc_cpu henceforth for convenience.
>>
>> 1. The first CPU that intends to enter deep sleep state will be the bc_cpu.
>>
>> 2. Every other CPU that intends to enter deep idle state will enter
>> themselves into a mask, say the bc_mask, which is already being done
>> today, after they check that a bc_cpu has been assigned.
>>
>> 3. The bc_cpu should not enter tickless idle, until step 5a holds true.
>>
>> 4. So on every timer interrupt, which is at-least every period, it
>> checks the bc_mask to see if any CPUs need to be woken up.
>>
>> 5. The bc cpu should not enter tickless idle *until* it is de-nominated
>> as the bc_cpu. The de-nomination occurs when:
>> a. In one of its timer interrupts, it does broadcast handling to find
>> out that there are no CPUs to be woken up.
>>
>> 6. So if 5a holds, then there is no bc_cpu anymore until a CPU decides
>> to enter deep idle state again, in which case steps 1 to 5 repeat.
>>
>>
>> - We could optimize this further, to allow the bc_cpu to enter tickless
>> idle, even while it is nominated as one. This can be the next step, if
>> we can get the above to work stably.
>>
>> You have already brought out this point, so I will just reword it. Each
>> time broadcast handling is done, the bc_cpu needs to check if the wakeup
>> time of a CPU, that has entered deep idle state, and is yet to be woken
>> up, is before the bc_cpu's wakeup time, which was programmed to its
>> local events.
>>
>> If so, then reprogram the decrementer to the wakeup time of a CPU that
>> is in deep idle state.
>>
>> But we need to keep in mind one point. When CPUs go into deep idle, they
>> cannot program the local timer of the bc_cpu to their wakeup time. This
>> is because a CPU cannot program the timer of a remote CPU.
>>
>> Therefore the only time we can check if 'wakeup time of the CPU that
>> enters deep idle state is before broadcast CPU's intended wake time so
>> as to reprogram the decrementer', is in the broadcast handler itself,
>> which is done *on* the bc_cpu alone.
>>
>>
>>
>> What do you think?
>>
>>
>> - Coming to your third suggestion of aligning the wakeup time of CPUs, I
>> will spend some time on this and get back regarding the same.
>
> Hi Preeti,
>
> One of Ben's suggestions is to coarse grain the waker's timer event.
> The trade off is whether we issue an IPI for each CPU needing a wakeup
> or let the bc_cpu wakeup periodically and *see* that there is a new
> request. The interval for a wakeup request will be much coarse grain
> than a tick. We maybe able to easily reduce the power impact of not
> letting bc_cpu go tickless by choosing a right coarse grain period.
> For example we can let the bc_cpu look for new wakeup requests once in
> every 10 or 20 jiffies rather than every jiffy and align the wakeup
> requests at this coarse grain wakeup. We do pay a power penalty by
> waking up few jiffies earlier which we can mitigate by reevaluating
> the situation and queueing a fine grain timer to the right jiffy on
> the bc_cpu if such a situation arises.
>
> The point is a new wakeup request will *ask* for a wakeup later than
> the coarse grain period. So the bc_cpu can wakeup at the coarse time
> period and reprogram its timer to the right jiffy.
>
> --Vaidy

Thanks Ben, Vaidy for your suggestions.

I will work on the second version of this patchset, which will address
two major issues that are brought out in this thread:

1. Dynamically choosing a broadcast CPU and floating this responsibility
around. This could have a lot of scope for optimization. and can be done
in steps. To begin with we could have the first CPU that sleeps to be
nominated the broadcast CPU. It would be relieved of this duty when
there are no more CPUs in deep idle states to be woken up. The next time
a CPU enters deep idle it will be nominated as the broadcast CPU.

However although the above will address the problems associated with a
dedicated broadcast CPU, it still does not solve the issue with
broadcast CPU having to refrain from entering tickless idle.

Point 2 will address this issue to an extent.

2. Have a timer on the broadcast CPU, that is specifically intended to
wake up CPUs in deep idle states. This timer will have a fixed period
much larger than a jiffy, a period sufficient enough not to miss wakeup
of CPUs in deep idle states. But if the wakeup of a CPU in deep idle is
smaller than this fixed period, it will send an IPI to the broadcast CPU
to reprogram its decrementer for this wakeup. This will allow the
broadcast CPU to enter tickless idle.

This is as opposed to the current approach where we have the broadcast
CPU waking up every periodic tick to check if broadcast is required.

Do let me know if there are points that are missed, which need to be
considered as pressing next steps.

Thank you

Regards
Preeti U Murthy