2013-06-03 19:27:05

by Seiji Aguchi

[permalink] [raw]
Subject: [PATCH v13 0/3]trace,x86: irq vector tracepoint support

Change log
v12 -> v13
- Rebase to 3.10-rc3
- Patch 2/3
- Separate a patch introducing entering/exiting_irq() from patch-v12 3/3.
- Patch 3/3
- Introduce write_trace_idt_entry() to remove "ifdef CONFIG_TRACING"
from _set_gate().
- Change DEFINE_IRQ_VECTOR_EVENT to save on repeat code.
- Add a path of irq_vector.h to Makefile.
- Introduce current_idt_descr_ptr and load_current_idt() to switch
IDT in a generic way with handling Debug traps/NMI and cpu hotplug.

v11 -> v12
- Rebase to 3.9-rc5

v10 -> v11
- Rebase to 3.9-rc2
- Add a modification for hyperv_callback vector. (patch 2/3)
- Change a way to switch idt to check the table in use instead of saving/restoring it,
because saving/restoring functions will break if we have to add another one. (patch 2/3)

v9 -> v10
- Add an explanation the reason why tracepoints has to place inside irq enter/exit handling. (patch 3/3)

v8 -> v9
- Rebase to 3.8-rc6
- Add Steven's email address at the top of the message and
move my signed-off-by below Steven's one because it is
originally created by Steven. (patch 1/3)
- Introduce a irq_vector_mutex to avoid a race at registering/unregistering
time. (patch 2/3)
- Use a per_cpu data to orig_idt_descr because IDT descritor is needed to each cpu
and the appropriate data type is per_cpu data. It is suggested by Steven.
(patch 2/3)

v7 -> v8
- Rebase to 3.8-rc4
- Add a patch 1 introducing DEFINE_EVENT_FN() macro.
- Rename original patches 1 and 2 to 2 and 3.
- Change a definition of tracepoint to use DEFINE_EVENT_FN(). (patch 2)
- Change alloc_intr_gate() to use do{}while(0) to avoid a warning
of checkpatch.pl. (patch 2)
- Move entering_irq()/exiting_irq() to arch/x86/include/asm/apic.h (patch 3)

v6 -> v7
- Divide into two patches to make a code review easier.
Summery of each patch is as follows.
- Patch 1/2
- Add an irq_vector tracing infrastructure.
- Create idt_table for tracing. It is refactored to avoid duplicating
existing logic.
- Duplicate new irq handlers inserted tracepoints.

- Patch 2/2
- Share a common logic among irq handlers to make them
manageable and readable.

v5 -> v6
- Rebased to 3.7

v4 -> v5
- Rebased to 3.6.0

- Introduce a logic switching IDT at enabling/disabling TP time
so that a time penalty makes a zero when tracepoints are disabled.
This IDT is created only when CONFIG_TRACEPOINTS is enabled.

- Remove arch_irq_vector_entry/exit and add followings again
so that we can add each tracepoint in a generic way.
- error_apic_vector
- thermal_apic_vector
- threshold_apic_vector
- spurious_apic_vector
- x86_platform_ipi_vector

- Drop nmi tracepoints to begin with apic interrupts and discuss a logic switching
IDT first.

- Move irq_vectors.h in the directory of arch/x86/include/asm/trace because
I'm not sure if a logic switching IDT is sharable with other architectures.

v3 -> v4
- Add a latency measurement of each tracepoint
- Rebased to 3.6-rc6

v2 -> v3
- Remove an invalidate_tlb_vector event because it was replaced by a call function vector
in a following commit.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4

v1 -> v2
- Modify variable name from irq to vector.
- Merge arch-specific tracepoints below to an arch_irq_vector_entry/exit.
- error_apic_vector
- thermal_apic_vector
- threshold_apic_vector
- spurious_apic_vector
- x86_platform_ipi_vector

[Purpose of this patch]

As Vaibhav explained in the thread below, tracepoints for irq vectors are useful.

http://www.spinics.net/lists/mm-commits/msg85707.html

<snip>
The current interrupt traces from irq_handler_entry and irq_handler_exit provide
when an interrupt is handled.
They provide good data about when the system has switched to kernel space and
how it affects the currently running processes.

There are some IRQ vectors which trigger the system into kernel space, which are
not handled in generic IRQ handlers.
Tracing such events gives us the information about IRQ interaction with other
system events.

The trace also tells where the system is spending its time. We want to know which
cores are handling interrupts and
how they are affecting other processes in the system. Also, the trace provides
information about when the cores are
idle and which interrupts are changing that state.
<snip>

On the other hand, my usecase is tracing just local timer event and getting
a value of instruction pointer.

I suggested to add an argument local timer event to get instruction pointer before.
But there is another way to get it with external module like systemtap.
So, I don't need to add any argument to irq vector tracepoints now.

[Patch Description]

Vaibhav's patch shared a trace point ,irq_vector_entry/irq_vector_exit, in all events.
But there is an above use case to trace specific irq_vector rather than tracing all events.
In this case, we are concerned about overhead due to unwanted events.

This patch adds following tracepoints instead of introducing irq_vector_entry/exit.
so that we can enable them independently.
- local_timer_vector
- reschedule_vector
- call_function_vector
- call_function_single_vector
- irq_work_entry_vector
- error_apic_vector
- thermal_apic_vector
- threshold_apic_vector
- spurious_apic_vector
- x86_platform_ipi_vector

Please see descriptions in each patch.

Seiji Aguchi (2):
trace,x86: Introduce entering/exiting_irq()
trace,x86: Add irq vector tracepoints

Steven Rostedt (1):
tracing: Add DEFINE_EVENT_FN() macro

arch/x86/include/asm/apic.h | 27 ++++++++
arch/x86/include/asm/desc.h | 55 +++++++++++++++-
arch/x86/include/asm/entry_arch.h | 8 ++-
arch/x86/include/asm/hw_irq.h | 17 +++++
arch/x86/include/asm/mshyperv.h | 1 +
arch/x86/include/asm/trace/irq_vectors.h | 104 ++++++++++++++++++++++++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/apic/Makefile | 1 +
arch/x86/kernel/apic/apic.c | 71 +++++++++++++++++----
arch/x86/kernel/cpu/common.c | 5 +-
arch/x86/kernel/cpu/mcheck/therm_throt.c | 24 +++++--
arch/x86/kernel/cpu/mcheck/threshold.c | 24 +++++--
arch/x86/kernel/entry_32.S | 12 +++-
arch/x86/kernel/entry_64.S | 31 +++++++--
arch/x86/kernel/head_64.S | 6 ++
arch/x86/kernel/irq.c | 31 ++++++---
arch/x86/kernel/irq_work.c | 24 ++++++-
arch/x86/kernel/smp.c | 65 ++++++++++++++++--
arch/x86/kernel/tracepoint.c | 58 +++++++++++++++++
include/linux/tracepoint.h | 2 +
include/trace/define_trace.h | 5 ++
include/trace/ftrace.h | 4 +
include/xen/events.h | 3 +
23 files changed, 520 insertions(+), 59 deletions(-)
create mode 100644 arch/x86/include/asm/trace/irq_vectors.h
create mode 100644 arch/x86/kernel/tracepoint.c


2013-06-03 19:29:29

by Seiji Aguchi

[permalink] [raw]
Subject: [PATCH v13 1/3] tracing: Add DEFINE_EVENT_FN() macro

From: Steven Rostedt <[email protected]>

Each TRACE_EVENT() adds several helper functions. If two or more trace events
share the same structure and print format, they can also share most of these
helper functions and save a lot of space from duplicate code. This is why the
DECLARE_EVENT_CLASS() and DEFINE_EVENT() were created.

Some events require a trigger to be called at registering and unregistering of
the event and to do so they use TRACE_EVENT_FN().

If multiple events require a trigger, they currently have no choice but to use
TRACE_EVENT_FN() as there's no DEFINE_EVENT_FN() available. This unfortunately
causes a lot of wasted duplicate code created.

By adding a DEFINE_EVENT_FN(), these events can still use a
DECLARE_EVENT_CLASS() and then define their own triggers.

Signed-off-by: Steven Rostedt <[email protected]>
Signed-off-by: Seiji Aguchi <[email protected]>
---
include/linux/tracepoint.h | 2 ++
include/trace/define_trace.h | 5 +++++
include/trace/ftrace.h | 4 ++++
3 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 2f322c3..9bf59e5 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -378,6 +378,8 @@ static inline void tracepoint_synchronize_unregister(void)
#define DECLARE_EVENT_CLASS(name, proto, args, tstruct, assign, print)
#define DEFINE_EVENT(template, name, proto, args) \
DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
+#define DEFINE_EVENT_FN(template, name, proto, args, reg, unreg)\
+ DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
#define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
#define DEFINE_EVENT_CONDITION(template, name, proto, \
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index 1905ca8..02e1003 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -44,6 +44,10 @@
#define DEFINE_EVENT(template, name, proto, args) \
DEFINE_TRACE(name)

+#undef DEFINE_EVENT_FN
+#define DEFINE_EVENT_FN(template, name, proto, args, reg, unreg) \
+ DEFINE_TRACE_FN(name, reg, unreg)
+
#undef DEFINE_EVENT_PRINT
#define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
DEFINE_TRACE(name)
@@ -91,6 +95,7 @@
#undef TRACE_EVENT_CONDITION
#undef DECLARE_EVENT_CLASS
#undef DEFINE_EVENT
+#undef DEFINE_EVENT_FN
#undef DEFINE_EVENT_PRINT
#undef DEFINE_EVENT_CONDITION
#undef TRACE_HEADER_MULTI_READ
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 19edd7f..d615f78 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -71,6 +71,10 @@
static struct ftrace_event_call __used \
__attribute__((__aligned__(4))) event_##name

+#undef DEFINE_EVENT_FN
+#define DEFINE_EVENT_FN(template, name, proto, args, reg, unreg) \
+ DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
+
#undef DEFINE_EVENT_PRINT
#define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
-- 1.7.1

2013-06-03 19:29:59

by Seiji Aguchi

[permalink] [raw]
Subject: [PATCH v13 2/3] trace,x86: Introduce entering/exiting_irq()

When implementing tracepoints in interrupt handers, if the tracepoints are
simply added in the performance sensitive path of interrupt handers,
it may cause potential performance problem due to the time penalty.

To solve the problem, an idea is to prepare non-trace/trace irq handers and
switch their IDTs at the enabling/disabling time.

To do this, this patch introduces entering_irq()/exiting_irq() for pre/post-
processing of each irq handler.

A way to use them is as follows.

Non-trace irq handler:
smp_irq_handler()
{
entering_irq(); /* pre-processing of this handler */
__smp_irq_handler(); /*
* common logic between non-trace and trace handlers
* in a vector.
*/
exiting_irq(); /* post-processing of this handler */

}

Trace irq_handler:
smp_trace_irq_handler()
{
entering_irq(); /* pre-processing of this handler */
trace_irq_entry(); /* tracepoint for irq entry */
__smp_irq_handler(); /*
* common logic between non-trace and trace handlers
* in a vector.
*/
trace_irq_exit(); /* tracepoint for irq exit */
exiting_irq(); /* post-processing of this handler */

}

If tracepoints can place outside entering_irq()/exiting_irq() as follows,
it looks cleaner.

smp_trace_irq_handler()
{
trace_irq_entry();
smp_irq_handler();
trace_irq_exit();
}

But it doesn't work.
The problem is with irq_enter/exit() being called. They must be called before
trace_irq_enter/exit(), because of the rcu_irq_enter() must be called before
any tracepoints are used, as tracepoints use rcu to synchronize.

As a possible alternative, we may be able to call irq_enter() first as follows
if irq_enter() can nest.

smp_trace_irq_hander()
{
irq_entry();
trace_irq_entry();
smp_irq_handler();
trace_irq_exit();
irq_exit();
}

But it doesn't work, either.
If irq_enter() is nested, it may have a time penalty because it has to check if it
was already called or not. The time penalty is not desired in performance sensitive
paths even if it is tiny.

Signed-off-by: Seiji Aguchi <[email protected]>
---
arch/x86/include/asm/apic.h | 27 +++++++++++++++++++++++++++
1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 3388034..f8119b5 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -12,6 +12,7 @@
#include <asm/fixmap.h>
#include <asm/mpspec.h>
#include <asm/msr.h>
+#include <asm/idle.h>

#define ARCH_APICTIMER_STOPS_ON_C3 1

@@ -687,5 +688,31 @@ extern int default_check_phys_apicid_present(int phys_apicid);
#endif

#endif /* CONFIG_X86_LOCAL_APIC */
+extern void irq_enter(void);
+extern void irq_exit(void);
+
+static inline void entering_irq(void)
+{
+ irq_enter();
+ exit_idle();
+}
+
+static inline void entering_ack_irq(void)
+{
+ ack_APIC_irq();
+ entering_irq();
+}
+
+static inline void exiting_irq(void)
+{
+ irq_exit();
+}
+
+static inline void exiting_ack_irq(void)
+{
+ irq_exit();
+ /* Ack only at the end to avoid potential reentry */
+ ack_APIC_irq();
+}

#endif /* _ASM_X86_APIC_H */
-- 1.7.1

2013-06-03 19:31:00

by Seiji Aguchi

[permalink] [raw]
Subject: [PATCH v13 3/3] trace,x86: Add irq vector tracepoints

[Purpose of this patch]

As Vaibhav explained in the thread below, tracepoints for irq vectors
are useful.

http://www.spinics.net/lists/mm-commits/msg85707.html

<snip>
The current interrupt traces from irq_handler_entry and irq_handler_exit
provide when an interrupt is handled. They provide good data about when
the system has switched to kernel space and how it affects the currently
running processes.

There are some IRQ vectors which trigger the system into kernel space,
which are not handled in generic IRQ handlers. Tracing such events gives
us the information about IRQ interaction with other system events.

The trace also tells where the system is spending its time. We want to
know which cores are handling interrupts and how they are affecting other
processes in the system. Also, the trace provides information about when
the cores are idle and which interrupts are changing that state.
<snip>

On the other hand, my usecase is tracing just local timer event and
getting a value of instruction pointer.

I suggested to add an argument local timer event to get instruction pointer before.
But there is another way to get it with external module like systemtap.
So, I don't need to add any argument to irq vector tracepoints now.

[Patch Description]

Vaibhav's patch shared a trace point ,irq_vector_entry/irq_vector_exit, in all events.
But there is an above use case to trace specific irq_vector rather than tracing all events.
In this case, we are concerned about overhead due to unwanted events.

This patch adds following tracepoints instead of introducing irq_vector_entry/exit.
so that we can enable them independently.
- local_timer_vector
- reschedule_vector
- call_function_vector
- call_function_single_vector
- irq_work_entry_vector
- error_apic_vector
- thermal_apic_vector
- threshold_apic_vector
- spurious_apic_vector
- x86_platform_ipi_vector

Also, it introduces a logic switching IDT at enabling/disabling time so that a time penalty
makes a zero when tracepoints are disabled. Detailed explanations are as follows.
- Create non-trace and trace irq handlers with entering_irq()/exiting_irq().
- Create a new IDT, trace_idt_table, at boot time by adding a logic to
_set_gate(). It is just a copy of original idt table.
- Registering the new handers for tracpoints to the new IDT by introducing
macros to alloc_intr_gate() called at regstering time of irq_vector handlers.
- Switch IDT to new one at enabling TP time.
- Restore to an original IDT at disabling TP time.
The new IDT is created only when CONFIG_TRACING is enabled to avoid being used for other purposes.

Signed-off-by: Seiji Aguchi <[email protected]>
---
arch/x86/include/asm/desc.h | 55 +++++++++++++++-
arch/x86/include/asm/entry_arch.h | 8 ++-
arch/x86/include/asm/hw_irq.h | 17 +++++
arch/x86/include/asm/mshyperv.h | 1 +
arch/x86/include/asm/trace/irq_vectors.h | 104 ++++++++++++++++++++++++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/apic/Makefile | 1 +
arch/x86/kernel/apic/apic.c | 71 +++++++++++++++++----
arch/x86/kernel/cpu/common.c | 5 +-
arch/x86/kernel/cpu/mcheck/therm_throt.c | 24 +++++--
arch/x86/kernel/cpu/mcheck/threshold.c | 24 +++++--
arch/x86/kernel/entry_32.S | 12 +++-
arch/x86/kernel/entry_64.S | 31 +++++++--
arch/x86/kernel/head_64.S | 6 ++
arch/x86/kernel/irq.c | 31 ++++++---
arch/x86/kernel/irq_work.c | 24 ++++++-
arch/x86/kernel/smp.c | 65 ++++++++++++++++--
arch/x86/kernel/tracepoint.c | 58 +++++++++++++++++
include/xen/events.h | 3 +
19 files changed, 482 insertions(+), 59 deletions(-)
create mode 100644 arch/x86/include/asm/trace/irq_vectors.h
create mode 100644 arch/x86/kernel/tracepoint.c

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 8bf1c06..400f0db 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -320,6 +320,19 @@ static inline void set_nmi_gate(int gate, void *addr)
}
#endif

+#ifdef CONFIG_TRACING
+extern struct desc_ptr trace_idt_descr;
+extern gate_desc trace_idt_table[];
+static inline void write_trace_idt_entry(int entry, const gate_desc *gate)
+{
+ write_idt_entry(trace_idt_table, entry, gate);
+}
+#else
+static inline void write_trace_idt_entry(int entry, const gate_desc *gate)
+{
+}
+#endif
+
static inline void _set_gate(int gate, unsigned type, void *addr,
unsigned dpl, unsigned ist, unsigned seg)
{
@@ -331,6 +344,7 @@ static inline void _set_gate(int gate, unsigned type, void *addr,
* setup time
*/
write_idt_entry(idt_table, gate, &s);
+ write_trace_idt_entry(gate, &s);
}

/*
@@ -360,12 +374,39 @@ static inline void alloc_system_vector(int vector)
}
}

-static inline void alloc_intr_gate(unsigned int n, void *addr)
+#ifdef CONFIG_TRACING
+static inline void trace_set_intr_gate(unsigned int gate, void *addr)
+{
+ gate_desc s;
+
+ pack_gate(&s, GATE_INTERRUPT, (unsigned long)addr, 0, 0, __KERNEL_CS);
+ write_idt_entry(trace_idt_table, gate, &s);
+}
+
+static inline void __trace_alloc_intr_gate(unsigned int n, void *addr)
+{
+ trace_set_intr_gate(n, addr);
+}
+#else
+static inline void trace_set_intr_gate(unsigned int gate, void *addr)
+{
+}
+
+#define __trace_alloc_intr_gate(n, addr)
+#endif
+
+static inline void __alloc_intr_gate(unsigned int n, void *addr)
{
- alloc_system_vector(n);
set_intr_gate(n, addr);
}

+#define alloc_intr_gate(n, addr) \
+ do { \
+ alloc_system_vector(n); \
+ __alloc_intr_gate(n, addr); \
+ __trace_alloc_intr_gate(n, trace_##addr); \
+ } while (0)
+
/*
* This routine sets up an interrupt gate at directory privilege level 3.
*/
@@ -405,4 +446,14 @@ static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist)
_set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, __KERNEL_CS);
}

+extern atomic_long_t current_idt_descr_ptr;
+static inline void load_current_idt(void)
+{
+ if (atomic_long_read(&current_idt_descr_ptr))
+ load_idt((const struct desc_ptr *)
+ atomic_long_read(&current_idt_descr_ptr));
+ else
+ load_idt((const struct desc_ptr *)&idt_descr);
+}
+
#endif /* _ASM_X86_DESC_H */
diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index 9bd4eca..dc5fa66 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -13,14 +13,16 @@
BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
BUILD_INTERRUPT(call_function_single_interrupt,CALL_FUNCTION_SINGLE_VECTOR)
-BUILD_INTERRUPT(irq_move_cleanup_interrupt,IRQ_MOVE_CLEANUP_VECTOR)
-BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)
+BUILD_INTERRUPT3(irq_move_cleanup_interrupt, IRQ_MOVE_CLEANUP_VECTOR,
+ smp_irq_move_cleanup_interrupt)
+BUILD_INTERRUPT3(reboot_interrupt, REBOOT_VECTOR, smp_reboot_interrupt)
#endif

BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)

#ifdef CONFIG_HAVE_KVM
-BUILD_INTERRUPT(kvm_posted_intr_ipi, POSTED_INTR_VECTOR)
+BUILD_INTERRUPT3(kvm_posted_intr_ipi, POSTED_INTR_VECTOR,
+ smp_kvm_posted_intr_ipi)
#endif

/*
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 1da97ef..e4ac559 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -77,6 +77,23 @@ extern void threshold_interrupt(void);
extern void call_function_interrupt(void);
extern void call_function_single_interrupt(void);

+#ifdef CONFIG_TRACING
+/* Interrupt handlers registered during init_IRQ */
+extern void trace_apic_timer_interrupt(void);
+extern void trace_x86_platform_ipi(void);
+extern void trace_error_interrupt(void);
+extern void trace_irq_work_interrupt(void);
+extern void trace_spurious_interrupt(void);
+extern void trace_thermal_interrupt(void);
+extern void trace_reschedule_interrupt(void);
+extern void trace_threshold_interrupt(void);
+extern void trace_call_function_interrupt(void);
+extern void trace_call_function_single_interrupt(void);
+#define trace_irq_move_cleanup_interrupt irq_move_cleanup_interrupt
+#define trace_reboot_interrupt reboot_interrupt
+#define trace_kvm_posted_intr_ipi kvm_posted_intr_ipi
+#endif /* CONFIG_TRACING */
+
/* IOAPIC */
#define IO_APIC_IRQ(x) (((x) >= NR_IRQS_LEGACY) || ((1<<(x)) & io_apic_irqs))
extern unsigned long io_apic_irqs;
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index c2934be..cc40e03 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -12,6 +12,7 @@ struct ms_hyperv_info {
extern struct ms_hyperv_info ms_hyperv;

void hyperv_callback_vector(void);
+#define trace_hyperv_callback_vector hyperv_callback_vector
void hyperv_vector_handler(struct pt_regs *regs);
void hv_register_vmbus_handler(int irq, irq_handler_t handler);

diff --git a/arch/x86/include/asm/trace/irq_vectors.h b/arch/x86/include/asm/trace/irq_vectors.h
new file mode 100644
index 0000000..2874df2
--- /dev/null
+++ b/arch/x86/include/asm/trace/irq_vectors.h
@@ -0,0 +1,104 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM irq_vectors
+
+#if !defined(_TRACE_IRQ_VECTORS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_IRQ_VECTORS_H
+
+#include <linux/tracepoint.h>
+
+extern void trace_irq_vector_regfunc(void);
+extern void trace_irq_vector_unregfunc(void);
+
+DECLARE_EVENT_CLASS(x86_irq_vector,
+
+ TP_PROTO(int vector),
+
+ TP_ARGS(vector),
+
+ TP_STRUCT__entry(
+ __field( int, vector )
+ ),
+
+ TP_fast_assign(
+ __entry->vector = vector;
+ ),
+
+ TP_printk("vector=%d", __entry->vector) );
+
+#define DEFINE_IRQ_VECTOR_EVENT(name) \
+DEFINE_EVENT_FN(x86_irq_vector, name##_entry, \
+ TP_PROTO(int vector), \
+ TP_ARGS(vector), \
+ trace_irq_vector_regfunc, \
+ trace_irq_vector_unregfunc); \
+DEFINE_EVENT_FN(x86_irq_vector, name##_exit, \
+ TP_PROTO(int vector), \
+ TP_ARGS(vector), \
+ trace_irq_vector_regfunc, \
+ trace_irq_vector_unregfunc);
+
+
+/*
+ * local_timer - called when entering/exiting a local timer interrupt
+ * vector handler
+ */
+DEFINE_IRQ_VECTOR_EVENT(local_timer);
+
+/*
+ * reschedule - called when entering/exiting a reschedule vector handler
+ */
+DEFINE_IRQ_VECTOR_EVENT(reschedule);
+
+/*
+ * spurious_apic - called when entering/exiting a spurious apic vector handler
+ */
+DEFINE_IRQ_VECTOR_EVENT(spurious_apic);
+
+/*
+ * error_apic - called when entering/exiting an error apic vector handler
+ */
+DEFINE_IRQ_VECTOR_EVENT(error_apic);
+
+/*
+ * x86_platform_ipi - called when entering/exiting a x86 platform ipi interrupt
+ * vector handler
+ */
+DEFINE_IRQ_VECTOR_EVENT(x86_platform_ipi);
+
+/*
+ * irq_work - called when entering/exiting a irq work interrupt
+ * vector handler
+ */
+DEFINE_IRQ_VECTOR_EVENT(irq_work);
+
+/*
+ * call_function - called when entering/exiting a call function interrupt
+ * vector handler
+ */
+DEFINE_IRQ_VECTOR_EVENT(call_function);
+
+/*
+ * call_function_single - called when entering/exiting a call function
+ * single interrupt vector handler
+ */
+DEFINE_IRQ_VECTOR_EVENT(call_function_single);
+
+/*
+ * threshold_apic - called when entering/exiting a threshold apic interrupt
+ * vector handler
+ */
+DEFINE_IRQ_VECTOR_EVENT(threshold_apic);
+
+/*
+ * thermal_apic - called when entering/exiting a thermal apic interrupt
+ * vector handler
+ */
+DEFINE_IRQ_VECTOR_EVENT(thermal_apic);
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE irq_vectors
+#endif /* _TRACE_IRQ_VECTORS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 7bd3bd3..74b3891 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -102,6 +102,7 @@ obj-$(CONFIG_OF) += devicetree.o
obj-$(CONFIG_UPROBES) += uprobes.o

obj-$(CONFIG_PERF_EVENTS) += perf_regs.o
+obj-$(CONFIG_TRACING) += tracepoint.o

###
# 64 bit specific files
diff --git a/arch/x86/kernel/apic/Makefile b/arch/x86/kernel/apic/Makefile
index 0ae0323..5274c3a 100644
--- a/arch/x86/kernel/apic/Makefile
+++ b/arch/x86/kernel/apic/Makefile
@@ -2,6 +2,7 @@
# Makefile for local APIC drivers and for the IO-APIC code
#

+CFLAGS_apic.o := -I$(src)/../../include/asm/trace
obj-$(CONFIG_X86_LOCAL_APIC) += apic.o apic_noop.o ipi.o
obj-y += hw_nmi.o

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 904611b..61ced40 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -55,6 +55,9 @@
#include <asm/tsc.h>
#include <asm/hypervisor.h>

+#define CREATE_TRACE_POINTS
+#include <asm/trace/irq_vectors.h>
+
unsigned int num_processors;

unsigned disabled_cpus __cpuinitdata;
@@ -919,17 +922,35 @@ void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
/*
* NOTE! We'd better ACK the irq immediately,
* because timer handling can be slow.
+ *
+ * update_process_times() expects us to have done irq_enter().
+ * Besides, if we don't timer interrupts ignore the global
+ * interrupt lock, which is the WrongThing (tm) to do.
*/
- ack_APIC_irq();
+ entering_ack_irq();
+ local_apic_timer_interrupt();
+ exiting_irq();
+
+ set_irq_regs(old_regs);
+}
+
+void __irq_entry smp_trace_apic_timer_interrupt(struct pt_regs *regs)
+{
+ struct pt_regs *old_regs = set_irq_regs(regs);
+
/*
+ * NOTE! We'd better ACK the irq immediately,
+ * because timer handling can be slow.
+ *
* update_process_times() expects us to have done irq_enter().
* Besides, if we don't timer interrupts ignore the global
* interrupt lock, which is the WrongThing (tm) to do.
*/
- irq_enter();
- exit_idle();
+ entering_ack_irq();
+ trace_local_timer_entry(LOCAL_TIMER_VECTOR);
local_apic_timer_interrupt();
- irq_exit();
+ trace_local_timer_exit(LOCAL_TIMER_VECTOR);
+ exiting_irq();

set_irq_regs(old_regs);
}
@@ -1907,12 +1928,10 @@ int __init APIC_init_uniprocessor(void)
/*
* This interrupt should _never_ happen with our APIC/SMP architecture
*/
-void smp_spurious_interrupt(struct pt_regs *regs)
+static inline void __smp_spurious_interrupt(void)
{
u32 v;

- irq_enter();
- exit_idle();
/*
* Check if this really is a spurious interrupt and ACK it
* if it is a vectored one. Just in case...
@@ -1927,13 +1946,28 @@ void smp_spurious_interrupt(struct pt_regs *regs)
/* see sw-dev-man vol 3, chapter 7.4.13.5 */
pr_info("spurious APIC interrupt on CPU#%d, "
"should never happen.\n", smp_processor_id());
- irq_exit();
+}
+
+void smp_spurious_interrupt(struct pt_regs *regs)
+{
+ entering_irq();
+ __smp_spurious_interrupt();
+ exiting_irq();
+}
+
+void smp_trace_spurious_interrupt(struct pt_regs *regs)
+{
+ entering_irq();
+ trace_spurious_apic_entry(SPURIOUS_APIC_VECTOR);
+ __smp_spurious_interrupt();
+ trace_spurious_apic_exit(SPURIOUS_APIC_VECTOR);
+ exiting_irq();
}

/*
* This interrupt should never happen with our APIC/SMP architecture
*/
-void smp_error_interrupt(struct pt_regs *regs)
+static inline void __smp_error_interrupt(struct pt_regs *regs)
{
u32 v0, v1;
u32 i = 0;
@@ -1948,8 +1982,6 @@ void smp_error_interrupt(struct pt_regs *regs)
"Illegal register address", /* APIC Error Bit 7 */
};

- irq_enter();
- exit_idle();
/* First tickle the hardware, only then report what went on. -- REW */
v0 = apic_read(APIC_ESR);
apic_write(APIC_ESR, 0);
@@ -1970,7 +2002,22 @@ void smp_error_interrupt(struct pt_regs *regs)

apic_printk(APIC_DEBUG, KERN_CONT "\n");

- irq_exit();
+}
+
+void smp_error_interrupt(struct pt_regs *regs)
+{
+ entering_irq();
+ __smp_error_interrupt(regs);
+ exiting_irq();
+}
+
+void smp_trace_error_interrupt(struct pt_regs *regs)
+{
+ entering_irq();
+ trace_error_apic_entry(ERROR_APIC_VECTOR);
+ __smp_error_interrupt(regs);
+ trace_error_apic_exit(ERROR_APIC_VECTOR);
+ exiting_irq();
}

/**
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 22018f7..5878d0a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1069,6 +1069,7 @@ static __init int setup_disablecpuid(char *arg)
}
__setup("clearcpuid=", setup_disablecpuid);

+atomic_long_t current_idt_descr_ptr = ATOMIC_LONG_INIT(0);
#ifdef CONFIG_X86_64
struct desc_ptr idt_descr = { NR_VECTORS * 16 - 1, (unsigned long) idt_table };
struct desc_ptr nmi_idt_descr = { NR_VECTORS * 16 - 1,
@@ -1161,7 +1162,7 @@ void debug_stack_reset(void)
if (WARN_ON(!this_cpu_read(debug_stack_use_ctr)))
return;
if (this_cpu_dec_return(debug_stack_use_ctr) == 0)
- load_idt((const struct desc_ptr *)&idt_descr);
+ load_current_idt();
}

#else /* CONFIG_X86_64 */
@@ -1257,7 +1258,7 @@ void __cpuinit cpu_init(void)
switch_to_new_gdt(cpu);
loadsegment(fs, 0);

- load_idt((const struct desc_ptr *)&idt_descr);
+ load_current_idt();

memset(me->thread.tls_array, 0, GDT_ENTRY_TLS_ENTRIES * 8);
syscall_init();
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 47a1870..2f3a799 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -29,6 +29,7 @@
#include <asm/idle.h>
#include <asm/mce.h>
#include <asm/msr.h>
+#include <asm/trace/irq_vectors.h>

/* How long to wait between reporting thermal events */
#define CHECK_INTERVAL (300 * HZ)
@@ -378,15 +379,26 @@ static void unexpected_thermal_interrupt(void)

static void (*smp_thermal_vector)(void) = unexpected_thermal_interrupt;

-asmlinkage void smp_thermal_interrupt(struct pt_regs *regs)
+static inline void __smp_thermal_interrupt(void)
{
- irq_enter();
- exit_idle();
inc_irq_stat(irq_thermal_count);
smp_thermal_vector();
- irq_exit();
- /* Ack only at the end to avoid potential reentry */
- ack_APIC_irq();
+}
+
+asmlinkage void smp_thermal_interrupt(struct pt_regs *regs)
+{
+ entering_irq();
+ __smp_thermal_interrupt();
+ exiting_ack_irq();
+}
+
+asmlinkage void smp_trace_thermal_interrupt(struct pt_regs *regs)
+{
+ entering_irq();
+ trace_thermal_apic_entry(THERMAL_APIC_VECTOR);
+ __smp_thermal_interrupt();
+ trace_thermal_apic_exit(THERMAL_APIC_VECTOR);
+ exiting_ack_irq();
}

/* Thermal monitoring depends on APIC, ACPI and clock modulation */
diff --git a/arch/x86/kernel/cpu/mcheck/threshold.c b/arch/x86/kernel/cpu/mcheck/threshold.c
index aa578ca..fe6b1c8 100644
--- a/arch/x86/kernel/cpu/mcheck/threshold.c
+++ b/arch/x86/kernel/cpu/mcheck/threshold.c
@@ -8,6 +8,7 @@
#include <asm/apic.h>
#include <asm/idle.h>
#include <asm/mce.h>
+#include <asm/trace/irq_vectors.h>

static void default_threshold_interrupt(void)
{
@@ -17,13 +18,24 @@ static void default_threshold_interrupt(void)

void (*mce_threshold_vector)(void) = default_threshold_interrupt;

-asmlinkage void smp_threshold_interrupt(void)
+static inline void __smp_threshold_interrupt(void)
{
- irq_enter();
- exit_idle();
inc_irq_stat(irq_threshold_count);
mce_threshold_vector();
- irq_exit();
- /* Ack only at the end to avoid potential reentry */
- ack_APIC_irq();
+}
+
+asmlinkage void smp_threshold_interrupt(void)
+{
+ entering_irq();
+ __smp_threshold_interrupt();
+ exiting_ack_irq();
+}
+
+asmlinkage void smp_trace_threshold_interrupt(void)
+{
+ entering_irq();
+ trace_threshold_apic_entry(THRESHOLD_APIC_VECTOR);
+ __smp_threshold_interrupt();
+ trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR);
+ exiting_ack_irq();
}
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 8f3e2de..2cfbc3a 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -801,7 +801,17 @@ ENTRY(name) \
CFI_ENDPROC; \
ENDPROC(name)

-#define BUILD_INTERRUPT(name, nr) BUILD_INTERRUPT3(name, nr, smp_##name)
+
+#ifdef CONFIG_TRACING
+#define TRACE_BUILD_INTERRUPT(name, nr) \
+ BUILD_INTERRUPT3(trace_##name, nr, smp_trace_##name)
+#else
+#define TRACE_BUILD_INTERRUPT(name, nr)
+#endif
+
+#define BUILD_INTERRUPT(name, nr) \
+ BUILD_INTERRUPT3(name, nr, smp_##name); \
+ TRACE_BUILD_INTERRUPT(name, nr)

/* The include is where all of the SMP etc. interrupts come from */
#include <asm/entry_arch.h>
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 7272089..11eef43 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1138,7 +1138,7 @@ END(common_interrupt)
/*
* APIC interrupts.
*/
-.macro apicinterrupt num sym do_sym
+.macro apicinterrupt3 num sym do_sym
ENTRY(\sym)
INTR_FRAME
ASM_CLAC
@@ -1150,15 +1150,32 @@ ENTRY(\sym)
END(\sym)
.endm

+#ifdef CONFIG_TRACING
+#define trace(sym) trace_##sym
+#define smp_trace(sym) smp_trace_##sym
+
+.macro trace_apicinterrupt num sym
+apicinterrupt3 \num trace(\sym) smp_trace(\sym)
+.endm
+#else
+.macro trace_apicinterrupt num sym do_sym
+.endm
+#endif
+
+.macro apicinterrupt num sym do_sym
+apicinterrupt3 \num \sym \do_sym
+trace_apicinterrupt \num \sym
+.endm
+
#ifdef CONFIG_SMP
-apicinterrupt IRQ_MOVE_CLEANUP_VECTOR \
+apicinterrupt3 IRQ_MOVE_CLEANUP_VECTOR \
irq_move_cleanup_interrupt smp_irq_move_cleanup_interrupt
-apicinterrupt REBOOT_VECTOR \
+apicinterrupt3 REBOOT_VECTOR \
reboot_interrupt smp_reboot_interrupt
#endif

#ifdef CONFIG_X86_UV
-apicinterrupt UV_BAU_MESSAGE \
+apicinterrupt3 UV_BAU_MESSAGE \
uv_bau_message_intr1 uv_bau_message_interrupt
#endif
apicinterrupt LOCAL_TIMER_VECTOR \
@@ -1167,7 +1184,7 @@ apicinterrupt X86_PLATFORM_IPI_VECTOR \
x86_platform_ipi smp_x86_platform_ipi

#ifdef CONFIG_HAVE_KVM
-apicinterrupt POSTED_INTR_VECTOR \
+apicinterrupt3 POSTED_INTR_VECTOR \
kvm_posted_intr_ipi smp_kvm_posted_intr_ipi
#endif

@@ -1451,13 +1468,13 @@ ENTRY(xen_failsafe_callback)
CFI_ENDPROC
END(xen_failsafe_callback)

-apicinterrupt HYPERVISOR_CALLBACK_VECTOR \
+apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
xen_hvm_callback_vector xen_evtchn_do_upcall

#endif /* CONFIG_XEN */

#if IS_ENABLED(CONFIG_HYPERV)
-apicinterrupt HYPERVISOR_CALLBACK_VECTOR \
+apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
hyperv_callback_vector hyperv_vector_handler
#endif /* CONFIG_HYPERV */

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 08f7e80..dae6a9d 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -519,6 +519,12 @@ ENTRY(idt_table)
ENTRY(nmi_idt_table)
.skip IDT_ENTRIES * 16

+#ifdef CONFIG_TRACING
+ .align L1_CACHE_BYTES
+ENTRY(trace_idt_table)
+ .skip IDT_ENTRIES * 16
+#endif
+
__PAGE_ALIGNED_BSS
NEXT_PAGE(empty_zero_page)
.skip PAGE_SIZE
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index ac0631d..06af119 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -17,6 +17,7 @@
#include <asm/idle.h>
#include <asm/mce.h>
#include <asm/hw_irq.h>
+#include <asm/trace/irq_vectors.h>

atomic_t irq_err_count;

@@ -204,23 +205,21 @@ unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
/*
* Handler for X86_PLATFORM_IPI_VECTOR.
*/
-void smp_x86_platform_ipi(struct pt_regs *regs)
+void __smp_x86_platform_ipi(void)
{
- struct pt_regs *old_regs = set_irq_regs(regs);
-
- ack_APIC_irq();
-
- irq_enter();
-
- exit_idle();
-
inc_irq_stat(x86_platform_ipis);

if (x86_platform_ipi_callback)
x86_platform_ipi_callback();
+}

- irq_exit();
+void smp_x86_platform_ipi(struct pt_regs *regs)
+{
+ struct pt_regs *old_regs = set_irq_regs(regs);

+ entering_ack_irq();
+ __smp_x86_platform_ipi();
+ exiting_irq();
set_irq_regs(old_regs);
}

@@ -246,6 +245,18 @@ void smp_kvm_posted_intr_ipi(struct pt_regs *regs)
}
#endif

+void smp_trace_x86_platform_ipi(struct pt_regs *regs)
+{
+ struct pt_regs *old_regs = set_irq_regs(regs);
+
+ entering_ack_irq();
+ trace_x86_platform_ipi_entry(X86_PLATFORM_IPI_VECTOR);
+ __smp_x86_platform_ipi();
+ trace_x86_platform_ipi_exit(X86_PLATFORM_IPI_VECTOR);
+ exiting_irq();
+ set_irq_regs(old_regs);
+}
+
EXPORT_SYMBOL_GPL(vector_used_by_percpu_irq);

#ifdef CONFIG_HOTPLUG_CPU
diff --git a/arch/x86/kernel/irq_work.c b/arch/x86/kernel/irq_work.c
index ca8f703..636a55e 100644
--- a/arch/x86/kernel/irq_work.c
+++ b/arch/x86/kernel/irq_work.c
@@ -8,14 +8,34 @@
#include <linux/irq_work.h>
#include <linux/hardirq.h>
#include <asm/apic.h>
+#include <asm/trace/irq_vectors.h>

-void smp_irq_work_interrupt(struct pt_regs *regs)
+static inline void irq_work_entering_irq(void)
{
irq_enter();
ack_APIC_irq();
+}
+
+static inline void __smp_irq_work_interrupt(void)
+{
inc_irq_stat(apic_irq_work_irqs);
irq_work_run();
- irq_exit();
+}
+
+void smp_irq_work_interrupt(struct pt_regs *regs)
+{
+ irq_work_entering_irq();
+ __smp_irq_work_interrupt();
+ exiting_irq();
+}
+
+void smp_trace_irq_work_interrupt(struct pt_regs *regs)
+{
+ irq_work_entering_irq();
+ trace_irq_work_entry(IRQ_WORK_VECTOR);
+ __smp_irq_work_interrupt();
+ trace_irq_work_exit(IRQ_WORK_VECTOR);
+ exiting_irq();
}

void arch_irq_work_raise(void)
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..f4fe0b8 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -30,6 +30,7 @@
#include <asm/proto.h>
#include <asm/apic.h>
#include <asm/nmi.h>
+#include <asm/trace/irq_vectors.h>
/*
* Some notes on x86 processor bugs affecting SMP operation:
*
@@ -249,32 +250,80 @@ finish:
/*
* Reschedule call back.
*/
-void smp_reschedule_interrupt(struct pt_regs *regs)
+static inline void __smp_reschedule_interrupt(void)
{
- ack_APIC_irq();
inc_irq_stat(irq_resched_count);
scheduler_ipi();
+}
+
+void smp_reschedule_interrupt(struct pt_regs *regs)
+{
+ ack_APIC_irq();
+ __smp_reschedule_interrupt();
/*
* KVM uses this interrupt to force a cpu out of guest mode
*/
}

-void smp_call_function_interrupt(struct pt_regs *regs)
+void smp_trace_reschedule_interrupt(struct pt_regs *regs)
+{
+ ack_APIC_irq();
+ trace_reschedule_entry(RESCHEDULE_VECTOR);
+ __smp_reschedule_interrupt();
+ trace_reschedule_exit(RESCHEDULE_VECTOR);
+ /*
+ * KVM uses this interrupt to force a cpu out of guest mode
+ */
+}
+
+static inline void call_function_entering_irq(void)
{
ack_APIC_irq();
irq_enter();
+}
+
+static inline void __smp_call_function_interrupt(void)
+{
generic_smp_call_function_interrupt();
inc_irq_stat(irq_call_count);
- irq_exit();
}

-void smp_call_function_single_interrupt(struct pt_regs *regs)
+void smp_call_function_interrupt(struct pt_regs *regs)
+{
+ call_function_entering_irq();
+ __smp_call_function_interrupt();
+ exiting_irq();
+}
+
+void smp_trace_call_function_interrupt(struct pt_regs *regs)
+{
+ call_function_entering_irq();
+ trace_call_function_entry(CALL_FUNCTION_VECTOR);
+ __smp_call_function_interrupt();
+ trace_call_function_exit(CALL_FUNCTION_VECTOR);
+ exiting_irq();
+}
+
+static inline void __smp_call_function_single_interrupt(void)
{
- ack_APIC_irq();
- irq_enter();
generic_smp_call_function_single_interrupt();
inc_irq_stat(irq_call_count);
- irq_exit();
+}
+
+void smp_call_function_single_interrupt(struct pt_regs *regs)
+{
+ call_function_entering_irq();
+ __smp_call_function_single_interrupt();
+ exiting_irq();
+}
+
+void smp_trace_call_function_single_interrupt(struct pt_regs *regs)
+{
+ call_function_entering_irq();
+ trace_call_function_single_entry(CALL_FUNCTION_SINGLE_VECTOR);
+ __smp_call_function_single_interrupt();
+ trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR);
+ exiting_irq();
}

static int __init nonmi_ipi_setup(char *str)
diff --git a/arch/x86/kernel/tracepoint.c b/arch/x86/kernel/tracepoint.c
new file mode 100644
index 0000000..09a5fa7
--- /dev/null
+++ b/arch/x86/kernel/tracepoint.c
@@ -0,0 +1,58 @@
+/*
+ * Code for supporting irq vector tracepoints.
+ *
+ * Copyright (C) 2013 Seiji Aguchi <[email protected]>
+ *
+ */
+#include <asm/hw_irq.h>
+#include <asm/desc.h>
+#include <linux/atomic.h>
+
+struct desc_ptr trace_idt_descr = { NR_VECTORS * 16 - 1,
+ (unsigned long) trace_idt_table };
+
+#ifndef CONFIG_X86_64
+gate_desc trace_idt_table[NR_VECTORS] __page_aligned_data
+ = { { { { 0, 0 } } }, };
+#endif
+
+static int trace_irq_vector_refcount;
+static DEFINE_MUTEX(irq_vector_mutex);
+
+static void switch_idt(void *arg)
+{
+ load_current_idt();
+}
+
+void trace_irq_vector_regfunc(void)
+{
+ mutex_lock(&irq_vector_mutex);
+ if (!trace_irq_vector_refcount) {
+ atomic_long_set(&current_idt_descr_ptr,
+ (unsigned long)&trace_idt_descr);
+ wmb();
+ smp_call_function(switch_idt, NULL, 0);
+ local_irq_disable();
+ switch_idt(NULL);
+ local_irq_enable();
+ }
+ trace_irq_vector_refcount++;
+ mutex_unlock(&irq_vector_mutex);
+}
+
+void trace_irq_vector_unregfunc(void)
+{
+ mutex_lock(&irq_vector_mutex);
+ trace_irq_vector_refcount--;
+ if (!trace_irq_vector_refcount) {
+ atomic_long_set(&current_idt_descr_ptr,
+ (unsigned long)&idt_descr);
+ wmb();
+ smp_call_function(switch_idt, NULL, 0);
+ local_irq_disable();
+ switch_idt(NULL);
+ local_irq_enable();
+ }
+ mutex_unlock(&irq_vector_mutex);
+}
+
diff --git a/include/xen/events.h b/include/xen/events.h
index b2b27c6..c9ea10e 100644
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -76,6 +76,9 @@ unsigned irq_from_evtchn(unsigned int evtchn);

/* Xen HVM evtchn vector callback */
void xen_hvm_callback_vector(void);
+#ifdef CONFIG_TRACING
+#define trace_xen_hvm_callback_vector xen_hvm_callback_vector
+#endif
extern int xen_have_vector_callback;
int xen_set_callback_via(uint64_t via);
void xen_evtchn_do_upcall(struct pt_regs *regs);
--
1.7.1

2013-06-03 23:53:08

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v13 3/3] trace,x86: Add irq vector tracepoints

On Mon, 2013-06-03 at 15:29 -0400, Seiji Aguchi wrote:

Yeah, I believe this does work. But you probably should add a comment
like the following:

/*
* The current_idt_descr_ptr can only be set out of interrupt context
* to avoid races. Once set, the load_current_idt() is called by
interrupt
* context either by NMI, debug, or via a smp_call_function(). That way
* the IDT will always be set back to the expected descriptor.
*/
>
> +extern atomic_long_t current_idt_descr_ptr;
> +static inline void load_current_idt(void)
> +{
> + if (atomic_long_read(&current_idt_descr_ptr))

Also, we should probably add here:
unsigned long new_idt = atomic_long_read(&current_idt_descr_ptr);

if (WARN_ON_ONCE(!validate_idt(new_idt))
return;
load_idt((const struct desc_ptr *)new_idt);

> + load_idt((const struct desc_ptr *)
> + atomic_long_read(&current_idt_descr_ptr));
> + else
> + load_idt((const struct desc_ptr *)&idt_descr);
> +}
> +

Then have

bool validate_idt(unsigned long idt)
{
switch(idt) {
case (unsigned long)&trace_idt_descr:
case (unsigned long)&idt_descr:
return 0;
}
return -1;
}

This way we wont be opening up any easy root holes where if a process
finds a way to modify some arbitrary kernel memory, we can prevent it
from modifying the current_idt_descr_ptr and have a nice way to exploit
the IDT. Sure, one can argue that if they can modify arbitrary kernel
memory, we may already be lost, but lets not make it easier for them
than need be.

-- Steve

2013-06-04 15:51:31

by Seiji Aguchi

[permalink] [raw]
Subject: RE: [PATCH v13 3/3] trace,x86: Add irq vector tracepoints

> Yeah, I believe this does work. But you probably should add a comment
> like the following:

OK. I will add some comment above " extern atomic_long_t current_idt_descr_ptr;".

>
> /*
> * The current_idt_descr_ptr can only be set out of interrupt context
> * to avoid races.

I will introduce set_current_idt() as follows.

set_current_idt(unsigned long idt)
{
If (WARN_ON_ONCE(in_interrupt()))
return;

atomic_long_set(idt);

}


> * Once set, the load_current_idt() is called by interrupt
> * context either by NMI, debug, or via a smp_call_function(). That way
> * the IDT will always be set back to the expected descriptor.
> */

The important thing is not "called by interrupt context" but "called with interrupt disabled"
to avoid races.
Actually, load_current_idt() is called in process context in irq_vector_{reg/unreg}func().
In next patch, I will rewrite the comment.

> >
> > +extern atomic_long_t current_idt_descr_ptr;
> > +static inline void load_current_idt(void)
> > +{
> > + if (atomic_long_read(&current_idt_descr_ptr))
>
> Also, we should probably add here:
> unsigned long new_idt = atomic_long_read(&current_idt_descr_ptr);
>
> if (WARN_ON_ONCE(!validate_idt(new_idt))
> return;
> load_idt((const struct desc_ptr *)new_idt);
>
> > + load_idt((const struct desc_ptr *)
> > + atomic_long_read(&current_idt_descr_ptr));
> > + else
> > + load_idt((const struct desc_ptr *)&idt_descr);
> > +}
> > +
>
> Then have
>
> bool validate_idt(unsigned long idt)
> {
> switch(idt) {
> case (unsigned long)&trace_idt_descr:
> case (unsigned long)&idt_descr:
> return 0;
> }
> return -1;
> }
>
> This way we wont be opening up any easy root holes where if a process
> finds a way to modify some arbitrary kernel memory, we can prevent it
> from modifying the current_idt_descr_ptr and have a nice way to exploit
> the IDT. Sure, one can argue that if they can modify arbitrary kernel
> memory, we may already be lost, but lets not make it easier for them
> than need be.

I will introduce the validate_idt() as above in a next patch.

Thanks.

Seiji

>
> -- Steve
>

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-06-04 18:17:34

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v13 3/3] trace,x86: Add irq vector tracepoints

On 06/03/2013 04:53 PM, Steven Rostedt wrote:
>
> This way we wont be opening up any easy root holes where if a process
> finds a way to modify some arbitrary kernel memory, we can prevent it
> from modifying the current_idt_descr_ptr and have a nice way to exploit
> the IDT. Sure, one can argue that if they can modify arbitrary kernel
> memory, we may already be lost, but lets not make it easier for them
> than need be.
>

I don't like current_idt_descr_ptr if we can avoid it. It is a direct
proxy for reading and writing the original IDT, in other words, it
really hasn't really addressed the issue.

What I'm thinking we really should have is a function that returns the
IDT that we currently should be using, based on the current state. If
that state is, say, tracing on/off and NMI on/off, then that can be
indicated by bits in a state vector.

The point is that the IDT address itself should not be mutable state if
it can be at all avoided.

-hpa


2013-06-04 18:32:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v13 3/3] trace,x86: Add irq vector tracepoints

On Tue, 2013-06-04 at 11:15 -0700, H. Peter Anvin wrote:
> On 06/03/2013 04:53 PM, Steven Rostedt wrote:
> >
> > This way we wont be opening up any easy root holes where if a process
> > finds a way to modify some arbitrary kernel memory, we can prevent it
> > from modifying the current_idt_descr_ptr and have a nice way to exploit
> > the IDT. Sure, one can argue that if they can modify arbitrary kernel
> > memory, we may already be lost, but lets not make it easier for them
> > than need be.
> >
>
> I don't like current_idt_descr_ptr if we can avoid it. It is a direct
> proxy for reading and writing the original IDT, in other words, it
> really hasn't really addressed the issue.
>
> What I'm thinking we really should have is a function that returns the
> IDT that we currently should be using, based on the current state. If
> that state is, say, tracing on/off and NMI on/off, then that can be
> indicated by bits in a state vector.

The NMI on/off may be a bit trickier, as it is also a debug state as
well. When we go into a nested debug or NMI state we use the same IDT.

>
> The point is that the IDT address itself should not be mutable state if
> it can be at all avoided.

Hmm, maybe we can do it. Have two counters, a debug_idt_ctr and a
trace_idt_ctr, then have a function that basically does this:

if (this_cpu_read(debug_idt_ctr))
load_idt(&nmi_idt_descr); /* probably should rename to debug_idt_descr) */
else if (trace_idt_ctr)
load_idt(&trace_idt_descr);
else
load_idt(&idt_descr);

Then all modifications of the idt would call this function.

-- Steve


2013-06-04 18:38:13

by Seiji Aguchi

[permalink] [raw]
Subject: RE: [PATCH v13 3/3] trace,x86: Add irq vector tracepoints

> > The point is that the IDT address itself should not be mutable state if
> > it can be at all avoided.
>
> Hmm, maybe we can do it. Have two counters, a debug_idt_ctr and a
> trace_idt_ctr, then have a function that basically does this:
>
> if (this_cpu_read(debug_idt_ctr))
> load_idt(&nmi_idt_descr); /* probably should rename to debug_idt_descr) */
> else if (trace_idt_ctr)
> load_idt(&trace_idt_descr);
> else
> load_idt(&idt_descr);
>
> Then all modifications of the idt would call this function.

I think it will work.
I will make the patch.

Seiji

>
> -- Steve
>
>

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-06-04 20:21:52

by Seiji Aguchi

[permalink] [raw]
Subject: RE: [PATCH v13 3/3] trace,x86: Add irq vector tracepoints

Steven,

>
> Hmm, maybe we can do it. Have two counters, a debug_idt_ctr and a
> trace_idt_ctr, then have a function that basically does this:
>
> if (this_cpu_read(debug_idt_ctr))

I think we can use "debug_stack_use_ctr" for the checking.
Is it correct?
Or, do I need to introduce a new debug_idt_ctr?

Seiji


????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-06-04 20:58:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v13 3/3] trace,x86: Add irq vector tracepoints

On Tue, 2013-06-04 at 20:20 +0000, Seiji Aguchi wrote:
> Steven,
>
> >
> > Hmm, maybe we can do it. Have two counters, a debug_idt_ctr and a
> > trace_idt_ctr, then have a function that basically does this:
> >
> > if (this_cpu_read(debug_idt_ctr))
>
> I think we can use "debug_stack_use_ctr" for the checking.
> Is it correct?
> Or, do I need to introduce a new debug_idt_ctr?
>

No, it's the same variable. I was thinking we should rename it too, as
debug_stack_use_ctr, doesn't really describe what is happening anymore.

-- Steve

2013-06-04 21:02:04

by Seiji Aguchi

[permalink] [raw]
Subject: RE: [PATCH v13 3/3] trace,x86: Add irq vector tracepoints

OK, I will rename debug_stack_use_ctr to debug_idt_ctr.
Thanks.

Seiji

> -----Original Message-----
> From: Steven Rostedt [mailto:[email protected]]
> Sent: Tuesday, June 04, 2013 4:58 PM
> To: Seiji Aguchi
> Cc: H. Peter Anvin; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected]; Tomoki Sekiyama
> Subject: Re: [PATCH v13 3/3] trace,x86: Add irq vector tracepoints
>
> On Tue, 2013-06-04 at 20:20 +0000, Seiji Aguchi wrote:
> > Steven,
> >
> > >
> > > Hmm, maybe we can do it. Have two counters, a debug_idt_ctr and a
> > > trace_idt_ctr, then have a function that basically does this:
> > >
> > > if (this_cpu_read(debug_idt_ctr))
> >
> > I think we can use "debug_stack_use_ctr" for the checking.
> > Is it correct?
> > Or, do I need to introduce a new debug_idt_ctr?
> >
>
> No, it's the same variable. I was thinking we should rename it too, as
> debug_stack_use_ctr, doesn't really describe what is happening anymore.
>
> -- Steve
>

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?