2012-02-24 22:33:47

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 0/4] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v2

Addressed various comments to the previous version. I did have to avoid
x86 smpboot cleanup that Ingo suggested as that did not seem trivial to me :-).
I also separated out the change that does percpu idle task caching. That
change does provide a measurable gain to the IPI sender overhead.

Previous versions
* RFC - https://lkml.org/lkml/2012/2/6/357
* v1 - https://lkml.org/lkml/2012/2/22/512

Changes since previous versions:
RFC to v1
Moved the changes into arch specific code as per PeterZ suggestion (Mostly)
Got rid of new per cpu state logic in favor of TIF flag bits

v1 to v2
Generic TS_POLLING cleanup
Really really no change to generic code (other than TS_POLLING cleanup)
Single bit in TIF flags. Had to get rid of micro optimization of avoiding
second IPI to a CPU when there is already one pending, in favor of keeping
code simple and faster in common case.
Add irq_enter irq_exit around pending interrupt handlers
Extended the optimization to cover C1 mwait() and poll_idle()


2012-02-24 22:33:55

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 1/4] tile, ia64, x86: TS_POLLING cleanup

Essentially a revert of commit 495ab9c045e1b0e5c82951b762257fe1c9d81564
dependent commit 0888f06ac99f993df2bb4c479f5b9306dafe154f and
some other subsequent dependent changes.

commit 495ab9c mentions memory traffic potentially caused my atomic
set and clear bit in idle entry exit path. But -
* On most of the recent CPUs, there should not be extra memory traffic due to
lock prefix set and clear from local CPU (as long as var is in cache).
* Most of the recent x86 CPUs use mwait based idle variants which do not
set/clear this bit at all.
* With tickless idle we have lot less number of entry exits into idle and
also we do lot more work along the way compared to when these optimization
was done.

So, reverting these changes now seems to makes sense, cleaniny up the code
and leaving all archs in sync wrt TIF_POLLING_NRFLAG.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
arch/ia64/include/asm/thread_info.h | 5 +----
arch/ia64/kernel/process.c | 10 +++-------
arch/tile/include/asm/thread_info.h | 7 +++----
arch/tile/kernel/process.c | 13 ++++---------
arch/x86/include/asm/thread_info.h | 8 +++-----
arch/x86/kernel/apm_32.c | 12 ++++--------
arch/x86/kernel/process.c | 10 +++-------
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
drivers/acpi/processor_idle.c | 34 +++++++++++-----------------------
kernel/sched/core.c | 8 ++------
11 files changed, 36 insertions(+), 75 deletions(-)

diff --git a/arch/ia64/include/asm/thread_info.h b/arch/ia64/include/asm/thread_info.h
index e054bcc..81cb5bb 100644
--- a/arch/ia64/include/asm/thread_info.h
+++ b/arch/ia64/include/asm/thread_info.h
@@ -133,10 +133,7 @@ struct thread_info {
/* like TIF_ALLWORK_BITS but sans TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT */
#define TIF_WORK_MASK (TIF_ALLWORK_MASK&~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT))

-#define TS_POLLING 1 /* true if in idle loop and not sleeping */
-#define TS_RESTORE_SIGMASK 2 /* restore signal mask in do_signal() */
-
-#define tsk_is_polling(t) (task_thread_info(t)->status & TS_POLLING)
+#define TS_RESTORE_SIGMASK 1 /* restore signal mask in do_signal() */

#ifndef __ASSEMBLY__
#define HAVE_SET_RESTORE_SIGMASK 1
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index 6d33c5c..5929104 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -301,14 +301,10 @@ cpu_idle (void)
/* endless idle loop with no priority at all */
while (1) {
if (can_do_pal_halt) {
- current_thread_info()->status &= ~TS_POLLING;
- /*
- * TS_POLLING-cleared state must be visible before we
- * test NEED_RESCHED:
- */
- smp_mb();
+ clear_thread_flag(TIF_POLLING_NRFLAG);
+ smp_mb__after_clear_bit();
} else {
- current_thread_info()->status |= TS_POLLING;
+ set_thread_flag(TIF_POLLING_NRFLAG);
}

if (!need_resched()) {
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index bc4f562..1386698 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -126,6 +126,7 @@ extern void cpu_idle_on_new_stack(struct thread_info *old_ti,
#define TIF_SECCOMP 6 /* secure computing */
#define TIF_MEMDIE 7 /* OOM killer at work */
#define TIF_NOTIFY_RESUME 8 /* callback before returning to user */
+#define TIF_POLLING_NRFLAG 9 /* true if poll_idle() is polling TIF_NEED_RESCHED */

#define _TIF_SIGPENDING (1<<TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED)
@@ -136,6 +137,7 @@ extern void cpu_idle_on_new_stack(struct thread_info *old_ti,
#define _TIF_SECCOMP (1<<TIF_SECCOMP)
#define _TIF_MEMDIE (1<<TIF_MEMDIE)
#define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME)
+#define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG)

/* Work to do on any return to user space. */
#define _TIF_ALLWORK_MASK \
@@ -152,10 +154,7 @@ extern void cpu_idle_on_new_stack(struct thread_info *old_ti,
#ifdef __tilegx__
#define TS_COMPAT 0x0001 /* 32-bit compatibility mode */
#endif
-#define TS_POLLING 0x0004 /* in idle loop but not sleeping */
-#define TS_RESTORE_SIGMASK 0x0008 /* restore signal mask in do_signal */
-
-#define tsk_is_polling(t) (task_thread_info(t)->status & TS_POLLING)
+#define TS_RESTORE_SIGMASK 0x0002 /* restore signal mask in do_signal */

#ifndef __ASSEMBLY__
#define HAVE_SET_RESTORE_SIGMASK 1
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index 4c1ac6e..8725333 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -72,8 +72,7 @@ void cpu_idle(void)
{
int cpu = smp_processor_id();

-
- current_thread_info()->status |= TS_POLLING;
+ set_thread_flag(TIF_POLLING_NRFLAG);

if (no_idle_nap) {
while (1) {
@@ -93,18 +92,14 @@ void cpu_idle(void)

local_irq_disable();
__get_cpu_var(irq_stat).idle_timestamp = jiffies;
- current_thread_info()->status &= ~TS_POLLING;
- /*
- * TS_POLLING-cleared state must be visible before we
- * test NEED_RESCHED:
- */
- smp_mb();
+ clear_thread_flag(TIF_POLLING_NRFLAG);
+ smp_mb__after_clear_bit();

if (!need_resched())
_cpu_idle();
else
local_irq_enable();
- current_thread_info()->status |= TS_POLLING;
+ set_thread_flag(TIF_POLLING_NRFLAG);
}
rcu_idle_exit();
tick_nohz_idle_exit();
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index bc817cd..a4d3888 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -95,6 +95,7 @@ struct thread_info {
#define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
#define TIF_SYSCALL_TRACEPOINT 28 /* syscall tracepoint instrumentation */
+#define TIF_POLLING_NRFLAG 29 /* true if poll_idle() is polling TIF_NEED_RESCHED */

#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
@@ -116,6 +117,7 @@ struct thread_info {
#define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP)
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
#define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT)
+#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)

/* work to do in syscall_trace_enter() */
#define _TIF_WORK_SYSCALL_ENTRY \
@@ -250,11 +252,7 @@ static inline struct thread_info *current_thread_info(void)
#define TS_USEDFPU 0x0001 /* FPU was used by this task
this quantum (SMP) */
#define TS_COMPAT 0x0002 /* 32bit syscall active (64BIT)*/
-#define TS_POLLING 0x0004 /* idle task polling need_resched,
- skip sending interrupt */
-#define TS_RESTORE_SIGMASK 0x0008 /* restore signal mask in do_signal() */
-
-#define tsk_is_polling(t) (task_thread_info(t)->status & TS_POLLING)
+#define TS_RESTORE_SIGMASK 0x0004 /* restore signal mask in do_signal() */

#ifndef __ASSEMBLY__
#define HAVE_SET_RESTORE_SIGMASK 1
diff --git a/arch/x86/kernel/apm_32.c b/arch/x86/kernel/apm_32.c
index f76623c..d3e8b4d 100644
--- a/arch/x86/kernel/apm_32.c
+++ b/arch/x86/kernel/apm_32.c
@@ -822,21 +822,17 @@ static int apm_do_idle(void)
int polling;
int err = 0;

- polling = !!(current_thread_info()->status & TS_POLLING);
+ polling = test_thread_flag(TIF_POLLING_NRFLAG);
if (polling) {
- current_thread_info()->status &= ~TS_POLLING;
- /*
- * TS_POLLING-cleared state must be visible before we
- * test NEED_RESCHED:
- */
- smp_mb();
+ clear_thread_flag(TIF_POLLING_NRFLAG);
+ smp_mb__after_clear_bit();
}
if (!need_resched()) {
idled = 1;
ret = apm_bios_call_simple(APM_FUNC_IDLE, 0, 0, &eax, &err);
}
if (polling)
- current_thread_info()->status |= TS_POLLING;
+ set_thread_flag(TIF_POLLING_NRFLAG);

if (!idled)
return 0;
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 15763af..99a8109 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -379,18 +379,14 @@ void default_idle(void)
if (hlt_use_halt()) {
trace_power_start(POWER_CSTATE, 1, smp_processor_id());
trace_cpu_idle(1, smp_processor_id());
- current_thread_info()->status &= ~TS_POLLING;
- /*
- * TS_POLLING-cleared state must be visible before we
- * test NEED_RESCHED:
- */
- smp_mb();
+ clear_thread_flag(TIF_POLLING_NRFLAG);
+ smp_mb__after_clear_bit();

if (!need_resched())
safe_halt(); /* enables interrupts racelessly */
else
local_irq_enable();
- current_thread_info()->status |= TS_POLLING;
+ set_thread_flag(TIF_POLLING_NRFLAG);
trace_power_end(smp_processor_id());
trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
} else {
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 324cd72..5de6bb1 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -95,7 +95,7 @@ void cpu_idle(void)
*/
boot_init_stack_canary();

- current_thread_info()->status |= TS_POLLING;
+ set_thread_flag(TIF_POLLING_NRFLAG);

/* endless idle loop with no priority at all */
while (1) {
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 753e803..98b1854 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -109,7 +109,7 @@ static inline void play_dead(void)
*/
void cpu_idle(void)
{
- current_thread_info()->status |= TS_POLLING;
+ set_thread_flag(TIF_POLLING_NRFLAG);

/*
* If we're the non-boot CPU, nothing set the stack canary up
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 0e8e2de..b7a0f0d 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -131,17 +131,13 @@ static struct dmi_system_id __cpuinitdata processor_power_dmi_table[] = {
*/
static void acpi_safe_halt(void)
{
- current_thread_info()->status &= ~TS_POLLING;
- /*
- * TS_POLLING-cleared state must be visible before we
- * test NEED_RESCHED:
- */
- smp_mb();
+ clear_thread_flag(TIF_POLLING_NRFLAG);
+ smp_mb__after_clear_bit();
if (!need_resched()) {
safe_halt();
local_irq_disable();
}
- current_thread_info()->status |= TS_POLLING;
+ set_thread_flag(TIF_POLLING_NRFLAG);
}

#ifdef ARCH_APICTIMER_STOPS_ON_C3
@@ -795,15 +791,11 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
local_irq_disable();

if (cx->entry_method != ACPI_CSTATE_FFH) {
- current_thread_info()->status &= ~TS_POLLING;
- /*
- * TS_POLLING-cleared state must be visible before we test
- * NEED_RESCHED:
- */
- smp_mb();
+ clear_thread_flag(TIF_POLLING_NRFLAG);
+ smp_mb__after_clear_bit();

if (unlikely(need_resched())) {
- current_thread_info()->status |= TS_POLLING;
+ set_thread_flag(TIF_POLLING_NRFLAG);
local_irq_enable();
return -EINVAL;
}
@@ -835,7 +827,7 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,

local_irq_enable();
if (cx->entry_method != ACPI_CSTATE_FFH)
- current_thread_info()->status |= TS_POLLING;
+ set_thread_flag(TIF_POLLING_NRFLAG);

cx->usage++;

@@ -887,15 +879,11 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
local_irq_disable();

if (cx->entry_method != ACPI_CSTATE_FFH) {
- current_thread_info()->status &= ~TS_POLLING;
- /*
- * TS_POLLING-cleared state must be visible before we test
- * NEED_RESCHED:
- */
- smp_mb();
+ clear_thread_flag(TIF_POLLING_NRFLAG);
+ smp_mb__after_clear_bit();

if (unlikely(need_resched())) {
- current_thread_info()->status |= TS_POLLING;
+ set_thread_flag(TIF_POLLING_NRFLAG);
local_irq_enable();
return -EINVAL;
}
@@ -955,7 +943,7 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,

local_irq_enable();
if (cx->entry_method != ACPI_CSTATE_FFH)
- current_thread_info()->status |= TS_POLLING;
+ set_thread_flag(TIF_POLLING_NRFLAG);

cx->usage++;

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5255c9d..9809c8b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -502,10 +502,6 @@ static inline void init_hrtick(void)
*/
#ifdef CONFIG_SMP

-#ifndef tsk_is_polling
-#define tsk_is_polling(t) test_tsk_thread_flag(t, TIF_POLLING_NRFLAG)
-#endif
-
void resched_task(struct task_struct *p)
{
int cpu;
@@ -523,7 +519,7 @@ void resched_task(struct task_struct *p)

/* NEED_RESCHED must be visible before we test polling */
smp_mb();
- if (!tsk_is_polling(p))
+ if (!test_tsk_thread_flag(p, TIF_POLLING_NRFLAG))
smp_send_reschedule(cpu);
}

@@ -602,7 +598,7 @@ void wake_up_idle_cpu(int cpu)

/* NEED_RESCHED must be visible before we test polling */
smp_mb();
- if (!tsk_is_polling(rq->idle))
+ if (!test_tsk_thread_flag(rq->idle, TIF_POLLING_NRFLAG))
smp_send_reschedule(cpu);
}

--
1.7.7.3

2012-02-24 22:34:02

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 2/4] x86: Mwait idle optimization to avoid CAL+RES IPIs -v2

smp_call_function_single and ttwu_queue_remote sends unconditional IPI
to target CPU. However, if the target CPU is in some form of poll based idle,
we can do IPI-less wakeups.
Doing this has certain advantages:
* Lower overhead on Async IPI send path. Measurements on Westmere based
systems show savings on "no wait" smp_call_function_single with idle
target CPU (as measured on the sender side).
local socket smp_call_func cost goes from ~1600 to ~1100 cycles
remote socket smp_call_func cost goes from ~2000 to ~1800 cycles
* Avoiding actual interrupts shows a measurable reduction (10%) in system
non-idle cycles and cache-references with micro-benchmark sending IPI from
one CPU to all the other mostly idle CPUs in the system.
* On a mostly idle system, turbostat shows a tiny decrease in C0(active) time
and a corresponding increase in C6 state (Each row being 10min avg)
%c0 %c1 %c6
Before
Run 1 1.51 2.93 95.55
Run 2 1.48 2.86 95.65
Run 3 1.46 2.78 95.74
After
Run 1 1.35 2.63 96.00
Run 2 1.46 2.78 95.74
Run 3 1.37 2.63 95.98

We started looking at this with one of our workloads where system is partially
busy and we noticed some kernel hotspots in find_next_bit and
default_send_IPI_mask_sequence_phys coming from sched wakeup (futex wakeups)
and networking call functions.

Thanks to Suresh for the suggestion of using TIF flags instead of
having a new percpu state variable and complicated update logic.

Notes:
* This only helps when target CPU is idle. When it is busy we will still send
IPI as before.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
arch/x86/include/asm/ipiless_wake.h | 84 +++++++++++++++++++++++++++++++++++
arch/x86/include/asm/thread_info.h | 3 +
arch/x86/kernel/acpi/cstate.c | 7 ++-
arch/x86/kernel/process_32.c | 2 +
arch/x86/kernel/process_64.c | 2 +
arch/x86/kernel/smp.c | 8 +++
6 files changed, 104 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/include/asm/ipiless_wake.h

diff --git a/arch/x86/include/asm/ipiless_wake.h b/arch/x86/include/asm/ipiless_wake.h
new file mode 100644
index 0000000..a490dd3
--- /dev/null
+++ b/arch/x86/include/asm/ipiless_wake.h
@@ -0,0 +1,84 @@
+#ifndef _ASM_X86_IPILESS_WAKE_H
+#define _ASM_X86_IPILESS_WAKE_H
+
+#include <linux/hardirq.h>
+#include <linux/sched.h>
+#include <asm/thread_info.h>
+
+#ifdef CONFIG_SMP
+
+/*
+ * TIF_IN_IPILESS_IDLE CPU being in a idle state with ipiless wakeup
+ * capability, without any pending IPIs.
+ * It is conditionally reset by an IPI source CPU and the reset automatically
+ * brings the target CPU out of its idle state.
+ *
+ * TS_IPILESS_WAKEUP is only changed by local CPU and is a place to store
+ * the info that there is a pending IPI work needed after complete idle exit.
+ */
+
+static inline void enter_ipiless_idle(void)
+{
+ set_thread_flag(TIF_IN_IPILESS_IDLE);
+}
+
+static inline void exit_ipiless_idle(void)
+{
+ if (!test_and_clear_thread_flag(TIF_IN_IPILESS_IDLE)) {
+ /*
+ * Flag was already cleared, indicating that there is
+ * a pending IPIless wakeup.
+ * Save that info in status for later use.
+ */
+ current_thread_info()->status |= TS_IPILESS_WAKEUP;
+ }
+}
+
+static inline int is_ipiless_wakeup_pending(void)
+{
+ return need_resched() ||
+ unlikely(!test_thread_flag(TIF_IN_IPILESS_IDLE));
+}
+
+static inline void do_ipiless_pending_work(void)
+{
+ if (unlikely(current_thread_info()->status & TS_IPILESS_WAKEUP)) {
+ current_thread_info()->status &= ~TS_IPILESS_WAKEUP;
+
+ local_bh_disable();
+ local_irq_disable();
+
+ irq_enter();
+ generic_smp_call_function_single_interrupt();
+ irq_exit();
+
+ scheduler_ipi(); /* Does its own irq enter/exit */
+
+ local_irq_enable();
+ local_bh_enable(); /* Needed for bh handling */
+ }
+}
+
+static inline int try_ipiless_wakeup(int cpu)
+{
+ struct thread_info *idle_ti = task_thread_info(idle_task(cpu));
+
+ if (!(idle_ti->flags & _TIF_IN_IPILESS_IDLE))
+ return 0;
+
+ return test_and_clear_bit(TIF_IN_IPILESS_IDLE,
+ (unsigned long *)&idle_ti->flags);
+}
+
+#else
+static inline void do_ipiless_pending_work(void) { }
+static inline void enter_ipiless_idle(void) { }
+static inline void exit_ipiless_idle(void) { }
+
+static inline int is_ipiless_wakeup_pending(void)
+{
+ return need_resched();
+}
+#endif
+
+#endif /* _ASM_X86_IPILESS_WAKE_H */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index a4d3888..3c5ae3b 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -96,6 +96,7 @@ struct thread_info {
#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
#define TIF_SYSCALL_TRACEPOINT 28 /* syscall tracepoint instrumentation */
#define TIF_POLLING_NRFLAG 29 /* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_IN_IPILESS_IDLE 30 /* Task in IPIless idle state */

#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
@@ -118,6 +119,7 @@ struct thread_info {
#define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES)
#define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT)
#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
+#define _TIF_IN_IPILESS_IDLE (1 << TIF_IN_IPILESS_IDLE)

/* work to do in syscall_trace_enter() */
#define _TIF_WORK_SYSCALL_ENTRY \
@@ -253,6 +255,7 @@ static inline struct thread_info *current_thread_info(void)
this quantum (SMP) */
#define TS_COMPAT 0x0002 /* 32bit syscall active (64BIT)*/
#define TS_RESTORE_SIGMASK 0x0004 /* restore signal mask in do_signal() */
+#define TS_IPILESS_WAKEUP 0x0008 /* pending IPI-work on idle exit */

#ifndef __ASSEMBLY__
#define HAVE_SET_RESTORE_SIGMASK 1
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index f50e7fb..30ab435 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -12,6 +12,7 @@
#include <linux/sched.h>

#include <acpi/processor.h>
+#include <asm/ipiless_wake.h>
#include <asm/acpi.h>
#include <asm/mwait.h>

@@ -161,15 +162,17 @@ EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
*/
void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
{
- if (!need_resched()) {
+ enter_ipiless_idle();
+ if (!is_ipiless_wakeup_pending()) {
if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
clflush((void *)&current_thread_info()->flags);

__monitor((void *)&current_thread_info()->flags, 0, 0);
smp_mb();
- if (!need_resched())
+ if (!is_ipiless_wakeup_pending())
__mwait(ax, cx);
}
+ exit_ipiless_idle();
}

void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 5de6bb1..014e26d 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -44,6 +44,7 @@
#include <asm/system.h>
#include <asm/ldt.h>
#include <asm/processor.h>
+#include <asm/ipiless_wake.h>
#include <asm/i387.h>
#include <asm/desc.h>
#ifdef CONFIG_MATH_EMULATION
@@ -116,6 +117,7 @@ void cpu_idle(void)
if (cpuidle_idle_call())
pm_idle();
start_critical_timings();
+ do_ipiless_pending_work();
}
rcu_idle_exit();
tick_nohz_idle_exit();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 98b1854..777bb7d 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -42,6 +42,7 @@
#include <asm/pgtable.h>
#include <asm/system.h>
#include <asm/processor.h>
+#include <asm/ipiless_wake.h>
#include <asm/i387.h>
#include <asm/mmu_context.h>
#include <asm/prctl.h>
@@ -148,6 +149,7 @@ void cpu_idle(void)

rcu_idle_exit();
start_critical_timings();
+ do_ipiless_pending_work();

/* In many cases the interrupt that ended idle
has already called exit_idle. But some idle
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 66c74f4..4b44bef 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -27,6 +27,7 @@
#include <asm/mtrr.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
+#include <asm/ipiless_wake.h>
#include <asm/proto.h>
#include <asm/apic.h>
#include <asm/nmi.h>
@@ -120,11 +121,18 @@ static void native_smp_send_reschedule(int cpu)
WARN_ON(1);
return;
}
+
+ if (try_ipiless_wakeup(cpu))
+ return;
+
apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
}

void native_send_call_func_single_ipi(int cpu)
{
+ if (try_ipiless_wakeup(cpu))
+ return;
+
apic->send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR);
}

--
1.7.7.3

2012-02-24 22:34:10

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 3/4] x86: Extend IPIless wake to C1_mwait and poll_idle

poll_idle(), mwait_idle() [which is used for C1 mwait] and any other
polling based idle loops can also use the IPIless wakeup logic added
with mwait_idle_with_hints() in the earlier patch.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
arch/x86/kernel/process.c | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 99a8109..43bb0a5 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -22,6 +22,7 @@
#include <asm/uaccess.h>
#include <asm/i387.h>
#include <asm/debugreg.h>
+#include <asm/ipiless_wake.h>

struct kmem_cache *task_xstate_cachep;
EXPORT_SYMBOL_GPL(task_xstate_cachep);
@@ -445,7 +446,8 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait);
/* Default MONITOR/MWAIT with no hints, used for default C1 state */
static void mwait_idle(void)
{
- if (!need_resched()) {
+ enter_ipiless_idle();
+ if (!is_ipiless_wakeup_pending()) {
trace_power_start(POWER_CSTATE, 1, smp_processor_id());
trace_cpu_idle(1, smp_processor_id());
if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
@@ -453,7 +455,7 @@ static void mwait_idle(void)

__monitor((void *)&current_thread_info()->flags, 0, 0);
smp_mb();
- if (!need_resched())
+ if (!is_ipiless_wakeup_pending())
__sti_mwait(0, 0);
else
local_irq_enable();
@@ -461,6 +463,7 @@ static void mwait_idle(void)
trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
} else
local_irq_enable();
+ exit_ipiless_idle();
}

/*
@@ -470,13 +473,15 @@ static void mwait_idle(void)
*/
static void poll_idle(void)
{
+ enter_ipiless_idle();
trace_power_start(POWER_CSTATE, 0, smp_processor_id());
trace_cpu_idle(0, smp_processor_id());
local_irq_enable();
- while (!need_resched())
+ while (!is_ipiless_wakeup_pending())
cpu_relax();
trace_power_end(smp_processor_id());
trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
+ exit_ipiless_idle();
}

/*
--
1.7.7.3

2012-02-24 22:34:08

by Venkatesh Pallipadi

[permalink] [raw]
Subject: [PATCH 4/4] x86: Optimize try_ipiless_wakeup avoid idle task lookup

Optimize try_ipiless_wakeup with caching of idletask's ti_flags pointer
in a percpu area. Shows a measurable difference in cost of async
smp_call_function_single() to a target CPU that is mwait-idle.

This shows ~50-100 cycles (of total 1200(local) or 1900(remote)) savings on
the IPI send side (as measured with async smp_call_function_single).

Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
arch/x86/include/asm/ipiless_wake.h | 7 ++++---
arch/x86/kernel/smpboot.c | 4 ++++
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/ipiless_wake.h b/arch/x86/include/asm/ipiless_wake.h
index a490dd3..232ce36 100644
--- a/arch/x86/include/asm/ipiless_wake.h
+++ b/arch/x86/include/asm/ipiless_wake.h
@@ -7,6 +7,7 @@

#ifdef CONFIG_SMP

+DECLARE_PER_CPU(__u32 *, idletask_ti_flags);
/*
* TIF_IN_IPILESS_IDLE CPU being in a idle state with ipiless wakeup
* capability, without any pending IPIs.
@@ -61,13 +62,13 @@ static inline void do_ipiless_pending_work(void)

static inline int try_ipiless_wakeup(int cpu)
{
- struct thread_info *idle_ti = task_thread_info(idle_task(cpu));
+ __u32 *ti_flags = per_cpu(idletask_ti_flags, cpu);

- if (!(idle_ti->flags & _TIF_IN_IPILESS_IDLE))
+ if (!(*ti_flags & _TIF_IN_IPILESS_IDLE))
return 0;

return test_and_clear_bit(TIF_IN_IPILESS_IDLE,
- (unsigned long *)&idle_ti->flags);
+ (unsigned long *)ti_flags);
}

#else
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 66d250c..33339e2 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -133,6 +133,8 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_llc_shared_map);
DEFINE_PER_CPU_SHARED_ALIGNED(struct cpuinfo_x86, cpu_info);
EXPORT_PER_CPU_SYMBOL(cpu_info);

+DEFINE_PER_CPU(__u32 *, idletask_ti_flags);
+
atomic_t init_deasserted;

/*
@@ -715,6 +717,7 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu)
set_idle_for_cpu(cpu, c_idle.idle);
do_rest:
per_cpu(current_task, cpu) = c_idle.idle;
+ per_cpu(idletask_ti_flags, cpu) = &task_thread_info(c_idle.idle)->flags;
#ifdef CONFIG_X86_32
/* Stack for startup_32 can be just as for start_secondary onwards */
irq_ctx_init(cpu);
@@ -1143,6 +1146,7 @@ void __init native_smp_prepare_boot_cpu(void)
/* already set me in cpu_online_mask in boot_cpu_init() */
cpumask_set_cpu(me, cpu_callout_mask);
per_cpu(cpu_state, me) = CPU_ONLINE;
+ per_cpu(idletask_ti_flags, me) = &task_thread_info(current)->flags;
}

void __init native_smp_cpus_done(unsigned int max_cpus)
--
1.7.7.3

2012-02-27 07:33:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/4] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v2


* Venkatesh Pallipadi <[email protected]> wrote:

> [...] I did have to avoid x86 smpboot cleanup that Ingo
> suggested as that did not seem trivial to me :-).

Well, idletask_ti_flags is unacceptably ugly, please do the
smpboot.c cleanup because that paves the way to add new features
to the x86 idle code.

Thanks,

Ingo

2012-02-27 18:21:25

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [PATCH 0/4] Extend mwait idle to optimize away CAL and RES interrupts to an idle CPU -v2

On Sun, Feb 26, 2012 at 11:32 PM, Ingo Molnar <[email protected]> wrote:
>
> * Venkatesh Pallipadi <[email protected]> wrote:
>
>> [...] I did have to avoid x86 smpboot cleanup that Ingo
>> suggested as that did not seem trivial to me :-).
>
> Well, idletask_ti_flags is unacceptably ugly, please do the
> smpboot.c cleanup because that paves the way to add new features
> to the x86 idle code.
>

OK. Will look at this soon.
If everything else looks OK, can you queue up the first 3 patches in
the series somewhere so that it gets some wider testing. Things should
work without that idletask_ti_flags change.

Thanks,
Venki