Subject: [RFC PATCH 0/7] Early task context tracking

Note: do not take it too seriously, it is just a proof of concept.

Some time ago, while using perf to check the automaton model, I noticed
that perf was losing events. The same was reproducible with ftrace.

See: https://www.spinics.net/lists/linux-rt-users/msg19781.html

Steve pointed to a problem in the identification of the context
execution used by the recursion control.

Currently, recursion control uses the preempt_counter to
identify the current context. The NMI/HARD/SOFT IRQ counters
are set in the preempt_counter in the irq_enter/exit functions.

In a trace, they are set like this:
-------------- %< --------------------
0) ==========> |
0) | do_IRQ() { /* First C function */
0) | irq_enter() {
0) | /* set the IRQ context. */
0) 1.081 us | }
0) | handle_irq() {
0) | /* IRQ handling code */
0) + 10.290 us | }
0) | irq_exit() {
0) | /* unset the IRQ context. */
0) 6.657 us | }
0) + 18.995 us | }
0) <========== |
-------------- >% --------------------

As one can see, functions (and events) that take place before the set
and after unset the preempt_counter are identified in the wrong context,
causing the miss interpretation that a recursion is taking place.
When this happens, events are dropped.

To resolve this problem, the set/unset of the IRQ/NMI context needs to
be done before the execution of the first C execution, and after its
return. By doing so, and using this method to identify the context in the
trace recursion protection, no more events are lost.

A possible solution is to use a per-cpu variable set and unset in the
entry point of NMI/IRQs, before calling the C handler. This possible
solution is presented in the next patches as a proof of concept, for
x86_64. However, other ideas might be better than mine... so...

Daniel Bristot de Oliveira (7):
x86/entry: Add support for early task context tracking
trace: Move the trace recursion context enum to trace.h and reuse it
trace: Optimize trace_get_context_bit()
trace/ring_buffer: Use trace_get_context_bit()
trace: Use early task context tracking if available
events: Create an trace_get_context_bit()
events: Use early task context tracking if available

arch/x86/entry/entry_64.S | 9 ++++++
arch/x86/include/asm/irqflags.h | 30 ++++++++++++++++++++
arch/x86/kernel/cpu/common.c | 4 +++
include/linux/irqflags.h | 4 +++
kernel/events/internal.h | 50 +++++++++++++++++++++++++++------
kernel/softirq.c | 5 +++-
kernel/trace/ring_buffer.c | 28 ++----------------
kernel/trace/trace.h | 46 ++++++++++++++++++++++--------
8 files changed, 129 insertions(+), 47 deletions(-)

--
2.20.1


Subject: [RFC PATCH 1/7] x86/entry: Add support for early task context tracking

Currently, the identification of the context is made through the
preempt_counter, but it is set after the execution of the first functions
of the IRQ/NMI, causing potential problems in the identification of the
current status. For instance, ftrace/perf might drop events in the early
stage of IRQ/NMI handlers because the preempt_counter was not set.

The proposed approach is to use a dedicated per-cpu variable to keep
track of the context of execution, with values set before the execution
of the first C function of the interrupt handler.

This is a PoC in the x86_64.

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "Joel Fernandes (Google)" <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Tommaso Cucinotta <[email protected]>
Cc: Romulo Silva de Oliveira <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/entry/entry_64.S | 9 +++++++++
arch/x86/include/asm/irqflags.h | 30 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/common.c | 4 ++++
include/linux/irqflags.h | 4 ++++
kernel/softirq.c | 5 ++++-
5 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..1471b544241f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -545,6 +545,7 @@ ENTRY(interrupt_entry)
testb $3, CS+8(%rsp)
jz 1f

+ TASK_CONTEXT_SET_BIT context=TASK_CTX_IRQ
/*
* IRQ from user mode.
*
@@ -561,6 +562,8 @@ ENTRY(interrupt_entry)

1:
ENTER_IRQ_STACK old_rsp=%rdi save_ret=1
+
+ TASK_CONTEXT_SET_BIT context=TASK_CTX_IRQ
/* We entered an interrupt context - irqs are off: */
TRACE_IRQS_OFF

@@ -586,6 +589,7 @@ ret_from_intr:
DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_OFF

+ TASK_CONTEXT_RESET_BIT context=TASK_CTX_IRQ
LEAVE_IRQ_STACK

testb $3, CS(%rsp)
@@ -780,6 +784,7 @@ ENTRY(\sym)
call interrupt_entry
UNWIND_HINT_REGS indirect=1
call \do_sym /* rdi points to pt_regs */
+ TASK_CONTEXT_RESET_BIT context=TASK_CTX_IRQ
jmp ret_from_intr
END(\sym)
_ASM_NOKPROBE(\sym)
@@ -1403,9 +1408,11 @@ ENTRY(nmi)
* done with the NMI stack.
*/

+ TASK_CONTEXT_SET_BIT context=TASK_CTX_NMI
movq %rsp, %rdi
movq $-1, %rsi
call do_nmi
+ TASK_CONTEXT_RESET_BIT context=TASK_CTX_NMI

/*
* Return back to user mode. We must *not* do the normal exit
@@ -1615,10 +1622,12 @@ end_repeat_nmi:
call paranoid_entry
UNWIND_HINT_REGS

+ TASK_CONTEXT_SET_BIT context=TASK_CTX_NMI
/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
movq %rsp, %rdi
movq $-1, %rsi
call do_nmi
+ TASK_CONTEXT_RESET_BIT context=TASK_CTX_NMI

/* Always restore stashed CR3 value (see paranoid_entry) */
RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 058e40fed167..5a12bc3ea02b 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -3,6 +3,7 @@
#define _X86_IRQFLAGS_H_

#include <asm/processor-flags.h>
+#include <asm/percpu.h>

#ifndef __ASSEMBLY__

@@ -202,4 +203,33 @@ static inline int arch_irqs_disabled(void)
#endif
#endif /* __ASSEMBLY__ */

+#ifdef CONFIG_X86_64
+/*
+ * NOTE: I know I need to implement this to the 32 bits as well.
+ * But... this is just a POC.
+ */
+#define ARCH_HAS_TASK_CONTEXT 1
+
+#define TASK_CTX_THREAD 0x0
+#define TASK_CTX_SOFTIRQ 0x1
+#define TASK_CTX_IRQ 0x2
+#define TASK_CTX_NMI 0x4
+
+#ifdef __ASSEMBLY__
+.macro TASK_CONTEXT_SET_BIT context:req
+ orb $\context, PER_CPU_VAR(task_context)
+.endm
+
+.macro TASK_CONTEXT_RESET_BIT context:req
+ andb $~\context, PER_CPU_VAR(task_context)
+.endm
+#else /* __ASSEMBLY__ */
+DECLARE_PER_CPU(unsigned char, task_context);
+
+static __always_inline void task_context_set(unsigned char context)
+{
+ raw_cpu_write_1(task_context, context);
+}
+#endif /* __ASSEMBLY__ */
+#endif /* CONFIG_X86_64 */
#endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cb28e98a0659..1acbec22319b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1531,6 +1531,8 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
EXPORT_PER_CPU_SYMBOL(__preempt_count);

+DEFINE_PER_CPU(unsigned char, task_context) __visible = 0;
+
/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
@@ -1604,6 +1606,8 @@ EXPORT_PER_CPU_SYMBOL(current_task);
DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
EXPORT_PER_CPU_SYMBOL(__preempt_count);

+DEFINE_PER_CPU(unsigned char, task_context) __visible = 0;
+
/*
* On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
* the top of the kernel stack. Use an extra percpu variable to track the
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 21619c92c377..1c3473bbe5d2 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -168,4 +168,8 @@ do { \

#define irqs_disabled_flags(flags) raw_irqs_disabled_flags(flags)

+#ifndef ARCH_HAS_TASK_CONTEXT
+#define task_context_set(context) do {} while (0)
+#endif
+
#endif
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 10277429ed84..324de769dc07 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -410,8 +410,11 @@ void irq_exit(void)
#endif
account_irq_exit_time(current);
preempt_count_sub(HARDIRQ_OFFSET);
- if (!in_interrupt() && local_softirq_pending())
+ if (!in_interrupt() && local_softirq_pending()) {
+ task_context_set(TASK_CTX_SOFTIRQ);
invoke_softirq();
+ task_context_set(TASK_CTX_IRQ);
+ }

tick_irq_exit();
rcu_irq_exit();
--
2.20.1

Subject: [RFC PATCH 2/7] trace: Move the trace recursion context enum to trace.h and reuse it

Both trace and ring buffer code needs to identify in which
context the current code is running to control recursion.

Move the enum in the trace.h, and unify its usage.

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "Joel Fernandes (Google)" <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Tommaso Cucinotta <[email protected]>
Cc: Romulo Silva de Oliveira <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/trace/ring_buffer.c | 25 +++++--------------------
kernel/trace/trace.h | 25 +++++++++++++++++++++----
2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 41b6f96e5366..fa8cbad2ca88 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -27,6 +27,8 @@

#include <asm/local.h>

+#include "trace.h"
+
static void update_pages_handler(struct work_struct *work);

/*
@@ -428,23 +430,6 @@ struct rb_event_info {
int add_timestamp;
};

-/*
- * Used for which event context the event is in.
- * NMI = 0
- * IRQ = 1
- * SOFTIRQ = 2
- * NORMAL = 3
- *
- * See trace_recursive_lock() comment below for more details.
- */
-enum {
- RB_CTX_NMI,
- RB_CTX_IRQ,
- RB_CTX_SOFTIRQ,
- RB_CTX_NORMAL,
- RB_CTX_MAX
-};
-
/*
* head_page == tail_page && head == tail then buffer is empty.
*/
@@ -2704,10 +2689,10 @@ trace_recursive_lock(struct ring_buffer_per_cpu *cpu_buffer)
int bit;

if (!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
- bit = RB_CTX_NORMAL;
+ bit = TRACE_CTX_NORMAL;
else
- bit = pc & NMI_MASK ? RB_CTX_NMI :
- pc & HARDIRQ_MASK ? RB_CTX_IRQ : RB_CTX_SOFTIRQ;
+ bit = pc & NMI_MASK ? TRACE_CTX_NMI :
+ pc & HARDIRQ_MASK ? TRACE_CTX_IRQ : TRACE_CTX_SOFTIRQ;

if (unlikely(val & (1 << (bit + cpu_buffer->nest))))
return 1;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index d80cee49e0eb..dad2f0cd7208 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -616,20 +616,37 @@ enum {

#define TRACE_CONTEXT_MASK TRACE_LIST_MAX

+/*
+ * Used for which event context the event is in.
+ * NMI = 0
+ * IRQ = 1
+ * SOFTIRQ = 2
+ * NORMAL = 3
+ *
+ * See trace_recursive_lock() comment for more details.
+ */
+enum {
+ TRACE_CTX_NMI,
+ TRACE_CTX_IRQ,
+ TRACE_CTX_SOFTIRQ,
+ TRACE_CTX_NORMAL,
+ TRACE_CTX_MAX
+};
+
static __always_inline int trace_get_context_bit(void)
{
int bit;

if (in_interrupt()) {
if (in_nmi())
- bit = 0;
+ bit = TRACE_CTX_NMI;

else if (in_irq())
- bit = 1;
+ bit = TRACE_CTX_IRQ;
else
- bit = 2;
+ bit = TRACE_CTX_SOFTIRQ;
} else
- bit = 3;
+ bit = TRACE_CTX_NORMAL;

return bit;
}
--
2.20.1

Subject: [RFC PATCH 4/7] trace/ring_buffer: Use trace_get_context_bit()

Unify the context identification in the function trace_get_context_bit().

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "Joel Fernandes (Google)" <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Tommaso Cucinotta <[email protected]>
Cc: Romulo Silva de Oliveira <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/trace/ring_buffer.c | 9 +--------
1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index fa8cbad2ca88..5bff5bcb1bf8 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2685,14 +2685,7 @@ static __always_inline int
trace_recursive_lock(struct ring_buffer_per_cpu *cpu_buffer)
{
unsigned int val = cpu_buffer->current_context;
- unsigned long pc = preempt_count();
- int bit;
-
- if (!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
- bit = TRACE_CTX_NORMAL;
- else
- bit = pc & NMI_MASK ? TRACE_CTX_NMI :
- pc & HARDIRQ_MASK ? TRACE_CTX_IRQ : TRACE_CTX_SOFTIRQ;
+ int bit = trace_get_context_bit();

if (unlikely(val & (1 << (bit + cpu_buffer->nest))))
return 1;
--
2.20.1

Subject: [RFC PATCH 6/7] events: Create an trace_get_context_bit()

Create a specific function to get the current task context.

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "Joel Fernandes (Google)" <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Tommaso Cucinotta <[email protected]>
Cc: Romulo Silva de Oliveira <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/events/internal.h | 37 ++++++++++++++++++++++++++++---------
1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 79c47076700a..241a2318bfdc 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -202,18 +202,37 @@ arch_perf_out_copy_user(void *dst, const void *src, unsigned long n)

DEFINE_OUTPUT_COPY(__output_copy_user, arch_perf_out_copy_user)

-static inline int get_recursion_context(int *recursion)
+/*
+ * Used for which event context the event is in.
+ * NMI = 0
+ * IRQ = 1
+ * SOFTIRQ = 2
+ * NORMAL = 3
+ *
+ * See trace_recursive_lock() comment for more details.
+ */
+enum {
+ TRACE_CTX_NMI,
+ TRACE_CTX_IRQ,
+ TRACE_CTX_SOFTIRQ,
+ TRACE_CTX_NORMAL,
+ TRACE_CTX_MAX
+};
+
+static __always_inline int trace_get_context_bit(void)
{
- int rctx;
+ unsigned long pc = preempt_count();

- if (unlikely(in_nmi()))
- rctx = 3;
- else if (in_irq())
- rctx = 2;
- else if (in_softirq())
- rctx = 1;
+ if (pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET))
+ return pc & NMI_MASK ? TRACE_CTX_NMI :
+ pc & HARDIRQ_MASK ? TRACE_CTX_IRQ : TRACE_CTX_SOFTIRQ;
else
- rctx = 0;
+ return TRACE_CTX_NORMAL;
+}
+
+static inline int get_recursion_context(int *recursion)
+{
+ int rctx = trace_get_context_bit();

if (recursion[rctx])
return -1;
--
2.20.1

Subject: [RFC PATCH 7/7] events: Use early task context tracking if available

Use the early task context tracking to detect the current context,
if the arch supports it.

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "Joel Fernandes (Google)" <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Tommaso Cucinotta <[email protected]>
Cc: Romulo Silva de Oliveira <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/events/internal.h | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 241a2318bfdc..8e4215f8cb93 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -219,6 +219,18 @@ enum {
TRACE_CTX_MAX
};

+#ifdef ARCH_HAS_TASK_CONTEXT
+static __always_inline int trace_get_context_bit(void)
+{
+ unsigned long tc = this_cpu_read(task_context);
+
+ if (tc)
+ return tc & TASK_CTX_NMI ? TRACE_CTX_NMI :
+ tc & TASK_CTX_IRQ ? TRACE_CTX_IRQ : TRACE_CTX_SOFTIRQ;
+ else
+ return TRACE_CTX_NORMAL;
+}
+#else /* ARCH_HAS_TASK_CONTEXT */
static __always_inline int trace_get_context_bit(void)
{
unsigned long pc = preempt_count();
@@ -229,6 +241,7 @@ static __always_inline int trace_get_context_bit(void)
else
return TRACE_CTX_NORMAL;
}
+#endif /* ARCH_HAS_TASK_CONTEXT */

static inline int get_recursion_context(int *recursion)
{
--
2.20.1

Subject: [RFC PATCH 5/7] trace: Use early task context tracking if available

Use the early task context tracking to detect the current context,
if the arch supports it.

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "Joel Fernandes (Google)" <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Tommaso Cucinotta <[email protected]>
Cc: Romulo Silva de Oliveira <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/trace/trace.h | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 09318748fab8..62ba4bd0e436 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -632,7 +632,18 @@ enum {
TRACE_CTX_NORMAL,
TRACE_CTX_MAX
};
+#ifdef ARCH_HAS_TASK_CONTEXT
+static __always_inline int trace_get_context_bit(void)
+{
+ unsigned long tc = this_cpu_read(task_context);

+ if (tc)
+ return tc & TASK_CTX_NMI ? TRACE_CTX_NMI :
+ tc & TASK_CTX_IRQ ? TRACE_CTX_IRQ : TRACE_CTX_SOFTIRQ;
+ else
+ return TRACE_CTX_NORMAL;
+}
+#else /* ARCH_HAS_TASK_CONTEXT */
static __always_inline int trace_get_context_bit(void)
{
unsigned long pc = preempt_count();
@@ -643,6 +654,7 @@ static __always_inline int trace_get_context_bit(void)
else
return TRACE_CTX_NORMAL;
}
+#endif /* ARCH_HAS_TASK_CONTEXT */

static __always_inline int trace_test_and_set_recursion(int start, int max)
{
--
2.20.1

Subject: [RFC PATCH 3/7] trace: Optimize trace_get_context_bit()

trace_get_context_bit() and trace_recursive_lock() uses the same logic,
but the second reads the per_cpu variable only once.

Uses the trace_recursive_lock()'s logic in trace_get_context_bit().

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "Joel Fernandes (Google)" <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Tommaso Cucinotta <[email protected]>
Cc: Romulo Silva de Oliveira <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/trace/trace.h | 19 ++++++-------------
1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index dad2f0cd7208..09318748fab8 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -635,20 +635,13 @@ enum {

static __always_inline int trace_get_context_bit(void)
{
- int bit;
-
- if (in_interrupt()) {
- if (in_nmi())
- bit = TRACE_CTX_NMI;
+ unsigned long pc = preempt_count();

- else if (in_irq())
- bit = TRACE_CTX_IRQ;
- else
- bit = TRACE_CTX_SOFTIRQ;
- } else
- bit = TRACE_CTX_NORMAL;
-
- return bit;
+ if (pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET))
+ return pc & NMI_MASK ? TRACE_CTX_NMI :
+ pc & HARDIRQ_MASK ? TRACE_CTX_IRQ : TRACE_CTX_SOFTIRQ;
+ else
+ return TRACE_CTX_NORMAL;
}

static __always_inline int trace_test_and_set_recursion(int start, int max)
--
2.20.1

2019-04-04 00:02:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Early task context tracking

> On Apr 2, 2019, at 2:03 PM, Daniel Bristot de Oliveira <[email protected]> wrote:
>
> Note: do not take it too seriously, it is just a proof of concept.
>
> Some time ago, while using perf to check the automaton model, I noticed
> that perf was losing events. The same was reproducible with ftrace.
>
> See: https://www.spinics.net/lists/linux-rt-users/msg19781.html
>
> Steve pointed to a problem in the identification of the context
> execution used by the recursion control.
>
> Currently, recursion control uses the preempt_counter to
> identify the current context. The NMI/HARD/SOFT IRQ counters
> are set in the preempt_counter in the irq_enter/exit functions.
>
> In a trace, they are set like this:
> -------------- %< --------------------
> 0) ==========> |
> 0) | do_IRQ() { /* First C function */
> 0) | irq_enter() {
> 0) | /* set the IRQ context. */
> 0) 1.081 us | }
> 0) | handle_irq() {
> 0) | /* IRQ handling code */
> 0) + 10.290 us | }
> 0) | irq_exit() {
> 0) | /* unset the IRQ context. */
> 0) 6.657 us | }
> 0) + 18.995 us | }
> 0) <========== |
> -------------- >% --------------------
>
> As one can see, functions (and events) that take place before the set
> and after unset the preempt_counter are identified in the wrong context,
> causing the miss interpretation that a recursion is taking place.
> When this happens, events are dropped.
>
> To resolve this problem, the set/unset of the IRQ/NMI context needs to
> be done before the execution of the first C execution, and after its
> return. By doing so, and using this method to identify the context in the
> trace recursion protection, no more events are lost.

I would much rather do the opposite: completely remove context
tracking from the asm and, instead, stick it into the C code. We'd
need to make sure that the C code is totally immune from tracing,
kprobes, etc, but it would be a nice cleanup. And then you could fix
this bug in C!


>
> A possible solution is to use a per-cpu variable set and unset in the
> entry point of NMI/IRQs, before calling the C handler. This possible
> solution is presented in the next patches as a proof of concept, for
> x86_64. However, other ideas might be better than mine... so...
>
> Daniel Bristot de Oliveira (7):
> x86/entry: Add support for early task context tracking
> trace: Move the trace recursion context enum to trace.h and reuse it
> trace: Optimize trace_get_context_bit()
> trace/ring_buffer: Use trace_get_context_bit()
> trace: Use early task context tracking if available
> events: Create an trace_get_context_bit()
> events: Use early task context tracking if available
>
> arch/x86/entry/entry_64.S | 9 ++++++
> arch/x86/include/asm/irqflags.h | 30 ++++++++++++++++++++
> arch/x86/kernel/cpu/common.c | 4 +++
> include/linux/irqflags.h | 4 +++
> kernel/events/internal.h | 50 +++++++++++++++++++++++++++------
> kernel/softirq.c | 5 +++-
> kernel/trace/ring_buffer.c | 28 ++----------------
> kernel/trace/trace.h | 46 ++++++++++++++++++++++--------
> 8 files changed, 129 insertions(+), 47 deletions(-)
>
> --
> 2.20.1
>

2019-04-04 09:43:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Early task context tracking

On Wed, Apr 03, 2019 at 05:01:02PM -0700, Andy Lutomirski wrote:
> I would much rather do the opposite: completely remove context
> tracking from the asm and, instead, stick it into the C code. We'd
> need to make sure that the C code is totally immune from tracing,
> kprobes, etc, but it would be a nice cleanup. And then you could fix
> this bug in C!

Right, so you had some prelim patches toward that a few weeks ago. Esp.
for the idtentry stuff that looked fairly straight forward.

2019-04-04 17:42:45

by Joel Fernandes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Early task context tracking

On Tue, Apr 02, 2019 at 10:03:52PM +0200, Daniel Bristot de Oliveira wrote:
> Note: do not take it too seriously, it is just a proof of concept.
>
> Some time ago, while using perf to check the automaton model, I noticed
> that perf was losing events. The same was reproducible with ftrace.
>
> See: https://www.spinics.net/lists/linux-rt-users/msg19781.html
>
> Steve pointed to a problem in the identification of the context
> execution used by the recursion control.
>
> Currently, recursion control uses the preempt_counter to
> identify the current context. The NMI/HARD/SOFT IRQ counters
> are set in the preempt_counter in the irq_enter/exit functions.

Just started looking.

Thinking out loud... can we not just update the preempt_count as early on
entry and as late on exit, as possible, and fix it that way? (Haven't fully
yet looked into what could break if we did that.)

I also feel the context tracking should be unified, right now we already have
two methods AFAIK - preempt_count and lockdep. Now this is yet another third.
Granted lockdep cannot be enabled in production, but still. It will be nice
to unify these tracking methods and if there is a single point of all such
context tracking that works well, and even better if we can just fix
preempt_count and use that for non-debugging usecases.

Also I feel in_interrupt() etc should be updated to rely on such tracking
methods if something other than preempt_count is used..

thanks,

- Joel


> In a trace, they are set like this:
> -------------- %< --------------------
> 0) ==========> |
> 0) | do_IRQ() { /* First C function */
> 0) | irq_enter() {
> 0) | /* set the IRQ context. */
> 0) 1.081 us | }
> 0) | handle_irq() {
> 0) | /* IRQ handling code */
> 0) + 10.290 us | }
> 0) | irq_exit() {
> 0) | /* unset the IRQ context. */
> 0) 6.657 us | }
> 0) + 18.995 us | }
> 0) <========== |
> -------------- >% --------------------
>
> As one can see, functions (and events) that take place before the set
> and after unset the preempt_counter are identified in the wrong context,
> causing the miss interpretation that a recursion is taking place.
> When this happens, events are dropped.
>
> To resolve this problem, the set/unset of the IRQ/NMI context needs to
> be done before the execution of the first C execution, and after its
> return. By doing so, and using this method to identify the context in the
> trace recursion protection, no more events are lost.
>
> A possible solution is to use a per-cpu variable set and unset in the
> entry point of NMI/IRQs, before calling the C handler. This possible
> solution is presented in the next patches as a proof of concept, for
> x86_64. However, other ideas might be better than mine... so...
>
> Daniel Bristot de Oliveira (7):
> x86/entry: Add support for early task context tracking
> trace: Move the trace recursion context enum to trace.h and reuse it
> trace: Optimize trace_get_context_bit()
> trace/ring_buffer: Use trace_get_context_bit()
> trace: Use early task context tracking if available
> events: Create an trace_get_context_bit()
> events: Use early task context tracking if available
>
> arch/x86/entry/entry_64.S | 9 ++++++
> arch/x86/include/asm/irqflags.h | 30 ++++++++++++++++++++
> arch/x86/kernel/cpu/common.c | 4 +++
> include/linux/irqflags.h | 4 +++
> kernel/events/internal.h | 50 +++++++++++++++++++++++++++------
> kernel/softirq.c | 5 +++-
> kernel/trace/ring_buffer.c | 28 ++----------------
> kernel/trace/trace.h | 46 ++++++++++++++++++++++--------
> 8 files changed, 129 insertions(+), 47 deletions(-)
>
> --
> 2.20.1
>

Subject: Re: [RFC PATCH 0/7] Early task context tracking

On 4/4/19 2:01 AM, Andy Lutomirski wrote:
>> To resolve this problem, the set/unset of the IRQ/NMI context needs to
>> be done before the execution of the first C execution, and after its
>> return. By doing so, and using this method to identify the context in the
>> trace recursion protection, no more events are lost.
> I would much rather do the opposite: completely remove context
> tracking from the asm and, instead, stick it into the C code. We'd
> need to make sure that the C code is totally immune from tracing,
> kprobes, etc, but it would be a nice cleanup. And then you could fix
> this bug in C!
>
>

Humm... what we could do to have things in C is to set the variable right at the
begin of the C handler, e.g., do_IRQ(), and right before the return.

But by doing this we would have a problem with two things:

1) irq handler itself (e.g., do_IRQ())
2) functions/tracepoints that might run before and after the handler execution
(e.g., preemptirq tracer), but still in the IRQ context.

We can work around the first case by checking if (the function is in the
__irq_entry .text section) in the recursion control.

The second case would still be a problem. For instance, the preemptirq:
tracepoints in the preemptirq tracer would be "dropped" in the case of a
miss-identification of a recursion.

Thinking aloud: should we try to move the preemptirq tracers to the C part?

I will try to come up with a patch with this approach to see if it "works."

Thoughts?

-- Daniel

Subject: Re: [RFC PATCH 0/7] Early task context tracking

On 4/4/19 7:40 PM, Joel Fernandes wrote:
>> Currently, recursion control uses the preempt_counter to
>> identify the current context. The NMI/HARD/SOFT IRQ counters
>> are set in the preempt_counter in the irq_enter/exit functions.
> Just started looking.
>
> Thinking out loud... can we not just update the preempt_count as early on
> entry and as late on exit, as possible, and fix it that way? (Haven't fully
> yet looked into what could break if we did that.)
>
> I also feel the context tracking should be unified, right now we already have
> two methods AFAIK - preempt_count and lockdep. Now this is yet another third.
> Granted lockdep cannot be enabled in production, but still. It will be nice
> to unify these tracking methods and if there is a single point of all such
> context tracking that works well, and even better if we can just fix
> preempt_count and use that for non-debugging usecases.
>
> Also I feel in_interrupt() etc should be updated to rely on such tracking
> methods if something other than preempt_count is used..

Hi Joel,

I agree with you that it is important to have a single method to identify the
context.

I did the RFC using a specific percpu variable to make things simpler. Also
because I tried to move set/unset of the preempt_counter and my dev VM stopped
booting. So it looked, somehow, risky to move the preempt_counter.

Still, if people believe it is better to use the preempt_counter... I am not
against...

-- Daniel


> thanks,
>
> - Joel
>
>

2019-04-08 16:24:05

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Early task context tracking

On Mon, Apr 8, 2019 at 5:47 AM Daniel Bristot de Oliveira
<[email protected]> wrote:
>
> On 4/4/19 2:01 AM, Andy Lutomirski wrote:
> >> To resolve this problem, the set/unset of the IRQ/NMI context needs to
> >> be done before the execution of the first C execution, and after its
> >> return. By doing so, and using this method to identify the context in the
> >> trace recursion protection, no more events are lost.
> > I would much rather do the opposite: completely remove context
> > tracking from the asm and, instead, stick it into the C code. We'd
> > need to make sure that the C code is totally immune from tracing,
> > kprobes, etc, but it would be a nice cleanup. And then you could fix
> > this bug in C!
> >
> >
>
> Humm... what we could do to have things in C is to set the variable right at the
> begin of the C handler, e.g., do_IRQ(), and right before the return.
>
> But by doing this we would have a problem with two things:
>
> 1) irq handler itself (e.g., do_IRQ())
> 2) functions/tracepoints that might run before and after the handler execution
> (e.g., preemptirq tracer), but still in the IRQ context.
>
> We can work around the first case by checking if (the function is in the
> __irq_entry .text section) in the recursion control.
>
> The second case would still be a problem. For instance, the preemptirq:
> tracepoints in the preemptirq tracer would be "dropped" in the case of a
> miss-identification of a recursion.
>
> Thinking aloud: should we try to move the preemptirq tracers to the C part?


I think we should try to move as much as possible to the C part.