While working on rewriting function graph tracer, I found a design
flaw in the code. Commit 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth
before calling trace return callback") tried to fix a bug that caused
interrupts not to be traced if max_depth was set to 1 (for debugging
NO_HZ_FULL), because if it came in just as another function that was being
traced that had a depth of 1 was exiting. This caused the timing of being in
the kernel to be off. The fix was simply to move the calling of the return
function after the updating of curr_ret_stack index as that was what was
being used.
The change log even says that it was safe to do this, but unfortunately it
was not, and the barrier() there was specifically *for* that function (which
shows why one should document barriers).
The problem is that the return callback may still be using what's on the
shadow stack and by changing the shadow stack pointer, it may allow for
another function being traced to overwrite that data. Note, if this happens,
it will only cause garbage data to be traced and will not affect the actual
operations of the kernel (ie. it wont crash).
Unfortunately just reverting that will bring back the old bug. The real way
to fix that old bug is to create another counter to handle the depth, but
this means that we need to change all the architectures that implement
function graph tracing (that's 12 of them). Luckily, I need to do this
anyway in my re-design so this is a good thing.
Since all the archictectures do basicall the same thing to prepare the
function graph trace to be traced, I made a generic function that they all
can use and simplified the logic of all the architectures. Then I'm able to
fix the design problem in one place.
I pushed this code up to let zero-day have a whack at it, and I also
downloaded the latest 8.1.0 cross compilers for all the archs that are
affected and compiled tested them all (and got rid of all the warnings
I introduced as well).
I marked this all for stable, but in reality it may not need to be ported
as it will probably not be trivial to do so, becaues you'll need to also
fix the architectures that are no longer used (although do we care about
them?). But if someone really cares about correct timings of the function
graph profiler when the options/graph-time is set to zero, then be my
guest.
Feel free to test this! I'll be pushing this to linux-next and let it
sit there a week or so before pushing it to Linus.
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
ftrace/urgent
Head SHA1: 1ffd62e282a649483afb0bb4d76f7244b3c10075
Steven Rostedt (VMware) (18):
function_graph: Create function_graph_enter() to consolidate architecture code
x86/function_graph: Simplify with function_graph_entry()
ARM: function_graph: Simplify with function_graph_entry()
arm64: function_graph: Simplify with function_graph_entry()
microblaze: function_graph: Simplify with function_graph_entry()
MIPS: function_graph: Simplify with function_graph_entry()
nds32: function_graph: Simplify with function_graph_entry()
parisc: function_graph: Simplify with function_graph_entry()
powerpc/function_graph: Simplify with function_graph_entry()
riscv/function_graph: Simplify with function_graph_entry()
s390/function_graph: Simplify with function_graph_entry()
sh/function_graph: Simplify with function_graph_entry()
sparc/function_graph: Simplify with function_graph_entry()
function_graph: Make ftrace_push_return_trace() static
function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack
function_graph: Move return callback before update of curr_ret_stack
function_graph: Reverse the order of pushing the ret_stack and the callback
function_graph: Have profiler use curr_ret_stack and not depth
----
arch/arm/kernel/ftrace.c | 17 +------------
arch/arm64/kernel/ftrace.c | 15 +----------
arch/microblaze/kernel/ftrace.c | 15 ++---------
arch/mips/kernel/ftrace.c | 14 ++---------
arch/nds32/kernel/ftrace.c | 18 ++-----------
arch/parisc/kernel/ftrace.c | 17 +++----------
arch/powerpc/kernel/trace/ftrace.c | 15 ++---------
arch/riscv/kernel/ftrace.c | 14 ++---------
arch/s390/kernel/ftrace.c | 13 ++--------
arch/sh/kernel/ftrace.c | 16 ++----------
arch/sparc/kernel/ftrace.c | 11 +-------
arch/x86/kernel/ftrace.c | 15 +----------
include/linux/ftrace.h | 4 +--
include/linux/sched.h | 1 +
kernel/trace/ftrace.c | 7 ++++--
kernel/trace/trace_functions_graph.c | 49 ++++++++++++++++++++++++++++--------
16 files changed, 67 insertions(+), 174 deletions(-)
Hi Steve,
This series looks good to me.
Reviewed-by: Masami Hiramatsu <[email protected]>
for this series.
Thank you!
# Now I know why sometimes func-graph depth looks broken...
On Wed, 21 Nov 2018 19:28:01 -0500
Steven Rostedt <[email protected]> wrote:
> While working on rewriting function graph tracer, I found a design
> flaw in the code. Commit 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth
> before calling trace return callback") tried to fix a bug that caused
> interrupts not to be traced if max_depth was set to 1 (for debugging
> NO_HZ_FULL), because if it came in just as another function that was being
> traced that had a depth of 1 was exiting. This caused the timing of being in
> the kernel to be off. The fix was simply to move the calling of the return
> function after the updating of curr_ret_stack index as that was what was
> being used.
>
> The change log even says that it was safe to do this, but unfortunately it
> was not, and the barrier() there was specifically *for* that function (which
> shows why one should document barriers).
>
> The problem is that the return callback may still be using what's on the
> shadow stack and by changing the shadow stack pointer, it may allow for
> another function being traced to overwrite that data. Note, if this happens,
> it will only cause garbage data to be traced and will not affect the actual
> operations of the kernel (ie. it wont crash).
>
> Unfortunately just reverting that will bring back the old bug. The real way
> to fix that old bug is to create another counter to handle the depth, but
> this means that we need to change all the architectures that implement
> function graph tracing (that's 12 of them). Luckily, I need to do this
> anyway in my re-design so this is a good thing.
>
> Since all the archictectures do basicall the same thing to prepare the
> function graph trace to be traced, I made a generic function that they all
> can use and simplified the logic of all the architectures. Then I'm able to
> fix the design problem in one place.
>
> I pushed this code up to let zero-day have a whack at it, and I also
> downloaded the latest 8.1.0 cross compilers for all the archs that are
> affected and compiled tested them all (and got rid of all the warnings
> I introduced as well).
>
> I marked this all for stable, but in reality it may not need to be ported
> as it will probably not be trivial to do so, becaues you'll need to also
> fix the architectures that are no longer used (although do we care about
> them?). But if someone really cares about correct timings of the function
> graph profiler when the options/graph-time is set to zero, then be my
> guest.
>
> Feel free to test this! I'll be pushing this to linux-next and let it
> sit there a week or so before pushing it to Linus.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> ftrace/urgent
>
> Head SHA1: 1ffd62e282a649483afb0bb4d76f7244b3c10075
>
>
> Steven Rostedt (VMware) (18):
> function_graph: Create function_graph_enter() to consolidate architecture code
> x86/function_graph: Simplify with function_graph_entry()
> ARM: function_graph: Simplify with function_graph_entry()
> arm64: function_graph: Simplify with function_graph_entry()
> microblaze: function_graph: Simplify with function_graph_entry()
> MIPS: function_graph: Simplify with function_graph_entry()
> nds32: function_graph: Simplify with function_graph_entry()
> parisc: function_graph: Simplify with function_graph_entry()
> powerpc/function_graph: Simplify with function_graph_entry()
> riscv/function_graph: Simplify with function_graph_entry()
> s390/function_graph: Simplify with function_graph_entry()
> sh/function_graph: Simplify with function_graph_entry()
> sparc/function_graph: Simplify with function_graph_entry()
> function_graph: Make ftrace_push_return_trace() static
> function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack
> function_graph: Move return callback before update of curr_ret_stack
> function_graph: Reverse the order of pushing the ret_stack and the callback
> function_graph: Have profiler use curr_ret_stack and not depth
>
> ----
> arch/arm/kernel/ftrace.c | 17 +------------
> arch/arm64/kernel/ftrace.c | 15 +----------
> arch/microblaze/kernel/ftrace.c | 15 ++---------
> arch/mips/kernel/ftrace.c | 14 ++---------
> arch/nds32/kernel/ftrace.c | 18 ++-----------
> arch/parisc/kernel/ftrace.c | 17 +++----------
> arch/powerpc/kernel/trace/ftrace.c | 15 ++---------
> arch/riscv/kernel/ftrace.c | 14 ++---------
> arch/s390/kernel/ftrace.c | 13 ++--------
> arch/sh/kernel/ftrace.c | 16 ++----------
> arch/sparc/kernel/ftrace.c | 11 +-------
> arch/x86/kernel/ftrace.c | 15 +----------
> include/linux/ftrace.h | 4 +--
> include/linux/sched.h | 1 +
> kernel/trace/ftrace.c | 7 ++++--
> kernel/trace/trace_functions_graph.c | 49 ++++++++++++++++++++++++++++--------
> 16 files changed, 67 insertions(+), 174 deletions(-)
--
Masami Hiramatsu <[email protected]>
On Wed, Nov 21, 2018 at 07:28:01PM -0500, Steven Rostedt wrote:
>
> [ ... snip ... ]
>
> Feel free to test this! I'll be pushing this to linux-next and let it
> sit there a week or so before pushing it to Linus.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> ftrace/urgent
Hi Steve,
With your ftrace/urgent branch linked above, if I try a quick
function_graph test like the following:
SYSFS=/sys/kernel/debug/tracing
echo 0 > "$SYSFS/tracing_on"
echo cmdline_proc_show > "$SYSFS/set_graph_function"
echo function_graph > "$SYSFS/current_tracer"
echo 1 > "$SYSFS/tracing_on"
I see a bunch of scheduler interrupt functions in the trace/trace_pipe
without even invoking cmdline_proc_show().
This tests works as expected with Linux 4.20-rc3 though:
% cat /sys/kernel/debug/tracing/trace_pipe
2) | cmdline_proc_show() {
2) 0.320 us | seq_puts();
2) 0.030 us | seq_putc();
2) 1.352 us | }
Operator error, or did the patchset break something?
Regards,
-- Joe
On Wed, 28 Nov 2018 15:39:31 -0500
Joe Lawrence <[email protected]> wrote:
> Hi Steve,
>
> With your ftrace/urgent branch linked above, if I try a quick
> function_graph test like the following:
>
> SYSFS=/sys/kernel/debug/tracing
>
> echo 0 > "$SYSFS/tracing_on"
> echo cmdline_proc_show > "$SYSFS/set_graph_function"
> echo function_graph > "$SYSFS/current_tracer"
> echo 1 > "$SYSFS/tracing_on"
>
> I see a bunch of scheduler interrupt functions in the trace/trace_pipe
> without even invoking cmdline_proc_show().
>
> This tests works as expected with Linux 4.20-rc3 though:
>
> % cat /sys/kernel/debug/tracing/trace_pipe
> 2) | cmdline_proc_show() {
> 2) 0.320 us | seq_puts();
> 2) 0.030 us | seq_putc();
> 2) 1.352 us | }
>
> Operator error, or did the patchset break something?
Nope, that does seem to be a bug :-/
-- Steve
On Wed, 28 Nov 2018 16:00:21 -0500
Steven Rostedt <[email protected]> wrote:
> On Wed, 28 Nov 2018 15:39:31 -0500
> Joe Lawrence <[email protected]> wrote:
>
> > Hi Steve,
> >
> > With your ftrace/urgent branch linked above, if I try a quick
> > function_graph test like the following:
> >
> > SYSFS=/sys/kernel/debug/tracing
> >
> > echo 0 > "$SYSFS/tracing_on"
> > echo cmdline_proc_show > "$SYSFS/set_graph_function"
> > echo function_graph > "$SYSFS/current_tracer"
> > echo 1 > "$SYSFS/tracing_on"
> >
> > I see a bunch of scheduler interrupt functions in the trace/trace_pipe
> > without even invoking cmdline_proc_show().
> >
> > This tests works as expected with Linux 4.20-rc3 though:
> >
> > % cat /sys/kernel/debug/tracing/trace_pipe
> > 2) | cmdline_proc_show() {
> > 2) 0.320 us | seq_puts();
> > 2) 0.030 us | seq_putc();
> > 2) 1.352 us | }
> >
> > Operator error, or did the patchset break something?
>
> Nope, that does seem to be a bug :-/
>
Does this patch fix it for you?
-- Steve
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 3b8c0e24ab30..1c8f4aa7020e 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -512,6 +512,9 @@ enum {
* can only be modified by current, we can reuse trace_recursion.
*/
TRACE_IRQ_BIT,
+
+ /* Set if the function is in the set_graph_function file */
+ TRACE_GRAPH_BIT,
};
#define trace_recursion_set(bit) do { (current)->trace_recursion |= (1<<(bit)); } while (0)
@@ -855,6 +858,8 @@ static inline int ftrace_graph_addr(unsigned long addr)
}
if (ftrace_lookup_ip(ftrace_graph_hash, addr)) {
+
+ trace_recursion_set(TRACE_GRAPH_BIT);
/*
* If no irqs are to be traced, but a set_graph_function
* is set, and called by an interrupt handler, we still
@@ -901,7 +906,8 @@ extern unsigned int fgraph_max_depth;
static inline bool ftrace_graph_ignore_func(struct ftrace_graph_ent *trace)
{
/* trace it when it is-nested-in or is a function enabled. */
- return !(trace->depth || ftrace_graph_addr(trace->func)) ||
+ return !(trace_recursion_test(TRACE_GRAPH_BIT) ||
+ ftrace_graph_addr(trace->func)) ||
(trace->depth < 0) ||
(fgraph_max_depth && trace->depth >= fgraph_max_depth);
}
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 2561460d7baf..69fbb6225637 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -509,6 +509,9 @@ void trace_graph_return(struct ftrace_graph_ret *trace)
int cpu;
int pc;
+ if (!trace->depth)
+ trace_recursion_clear(TRACE_GRAPH_BIT);
+
local_irq_save(flags);
cpu = raw_smp_processor_id();
data = per_cpu_ptr(tr->trace_buffer.data, cpu);
@@ -532,6 +535,8 @@ void set_graph_array(struct trace_array *tr)
static void trace_graph_thresh_return(struct ftrace_graph_ret *trace)
{
+ if (!trace->depth)
+ trace_recursion_clear(TRACE_GRAPH_BIT);
if (tracing_thresh &&
(trace->rettime - trace->calltime < tracing_thresh))
return;
diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index b7357f9f82a3..b20bf076ce7f 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -208,6 +208,9 @@ static void irqsoff_graph_return(struct ftrace_graph_ret *trace)
unsigned long flags;
int pc;
+ if (!trace->depth)
+ trace_recursion_clear(TRACE_GRAPH_BIT);
+
if (!func_prolog_dec(tr, &data, &flags))
return;
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index a86b303e6c67..069867f4eae6 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -270,6 +270,9 @@ static void wakeup_graph_return(struct ftrace_graph_ret *trace)
unsigned long flags;
int pc;
+ if (!trace->depth)
+ trace_recursion_clear(TRACE_GRAPH_BIT);
+
if (!func_prolog_preempt_disable(tr, &data, &pc))
return;
On Wed, 28 Nov 2018 22:29:36 -0500
Steven Rostedt <[email protected]> wrote:
> Does this patch fix it for you?
Take 2. I realized that the reason for the interrupts being traced is
because it is more likely to interrupt when the depth is already 0,
setting it to 1 causing the interrupt to think it's already in a
function that wants to be traced (when that wasn't decided yet).
The fix uses a bit that gets set when we want to trace (and will trace
till that bit is cleared). That bit gets cleared on the return
function when depth is zero. But if we are tracing something in a
interrupt that happened to interrupt the beginning of a function that
already set the depth to 0, then we need to clear the bit at depth of 1
not zero (and 2 if we want to trace a function that interrupted a
softirq, that interrupted a start of normal function. and 3 if we want
to trace an NMI function that had the very unlikely case of
interrupting a start of a interrupt function, that interrupted the
start of a softirq function, that interrupted a start of a normal
context function!).
If that happens then we will continue to trace functions when we don't
want to. To solve that, I need to record the depth (0-3) when the bit
is set, and clear it when the return function is at that same depth
(0-3).
-- Steve
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 3b8c0e24ab30..447bd96ee658 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -512,12 +512,44 @@ enum {
* can only be modified by current, we can reuse trace_recursion.
*/
TRACE_IRQ_BIT,
+
+ /* Set if the function is in the set_graph_function file */
+ TRACE_GRAPH_BIT,
+
+ /*
+ * In the very unlikely case that an interrupt came in
+ * at a start of graph tracing, and we want to trace
+ * the function in that interrupt, the depth can be greater
+ * than zero, because of the preempted start of a previous
+ * trace. In an even more unlikely case, depth could be 2
+ * if a softirq interrupted the start of graph tracing,
+ * followed by an interrupt preempting a start of graph
+ * tracing in the softirq, and depth can even be 3
+ * if an NMI came in at the start of an interrupt function
+ * that preempted a softirq start of a function that
+ * preempted normal context!!!! Luckily, it can't be
+ * greater than 3, so the next two bits are a mask
+ * of what the depth is when we set TRACE_GRAPH_BIT
+ */
+
+ TRACE_GRAPH_DEPTH_START_BIT,
+ TRACE_GRAPH_DEPTH_END_BIT,
};
#define trace_recursion_set(bit) do { (current)->trace_recursion |= (1<<(bit)); } while (0)
#define trace_recursion_clear(bit) do { (current)->trace_recursion &= ~(1<<(bit)); } while (0)
#define trace_recursion_test(bit) ((current)->trace_recursion & (1<<(bit)))
+#define trace_recursion_depth() \
+ (((current)->trace_recursion >> TRACE_GRAPH_DEPTH_START_BIT) & 3)
+#define trace_recursion_set_depth(depth) \
+ do { \
+ current->trace_recursion &= \
+ ~(3 << TRACE_GRAPH_DEPTH_START_BIT); \
+ current->trace_recursion |= \
+ ((depth) & 3) << TRACE_GRAPH_DEPTH_START_BIT; \
+ } while (0)
+
#define TRACE_CONTEXT_BITS 4
#define TRACE_FTRACE_START TRACE_FTRACE_BIT
@@ -843,8 +875,9 @@ extern void __trace_graph_return(struct trace_array *tr,
extern struct ftrace_hash *ftrace_graph_hash;
extern struct ftrace_hash *ftrace_graph_notrace_hash;
-static inline int ftrace_graph_addr(unsigned long addr)
+static inline int ftrace_graph_addr(struct ftrace_graph_ent *trace)
{
+ unsigned long addr = trace->func;
int ret = 0;
preempt_disable_notrace();
@@ -855,6 +888,14 @@ static inline int ftrace_graph_addr(unsigned long addr)
}
if (ftrace_lookup_ip(ftrace_graph_hash, addr)) {
+
+ /*
+ * This needs to be cleared on the return functions
+ * when the depth is zero.
+ */
+ trace_recursion_set(TRACE_GRAPH_BIT);
+ trace_recursion_set_depth(trace->depth);
+
/*
* If no irqs are to be traced, but a set_graph_function
* is set, and called by an interrupt handler, we still
@@ -872,6 +913,13 @@ static inline int ftrace_graph_addr(unsigned long addr)
return ret;
}
+static inline void ftrace_graph_addr_finish(struct ftrace_graph_ret *trace)
+{
+ if (trace_recursion_test(TRACE_GRAPH_BIT) &&
+ trace->depth == trace_recursion_depth())
+ trace_recursion_clear(TRACE_GRAPH_BIT);
+}
+
static inline int ftrace_graph_notrace_addr(unsigned long addr)
{
int ret = 0;
@@ -885,7 +933,7 @@ static inline int ftrace_graph_notrace_addr(unsigned long addr)
return ret;
}
#else
-static inline int ftrace_graph_addr(unsigned long addr)
+static inline int ftrace_graph_addr(struct ftrace_graph_ent *trace)
{
return 1;
}
@@ -894,6 +942,8 @@ static inline int ftrace_graph_notrace_addr(unsigned long addr)
{
return 0;
}
+static inline void ftrace_graph_addr_finish(struct ftrace_graph_ret *trace)
+{ }
#endif /* CONFIG_DYNAMIC_FTRACE */
extern unsigned int fgraph_max_depth;
@@ -901,7 +951,8 @@ extern unsigned int fgraph_max_depth;
static inline bool ftrace_graph_ignore_func(struct ftrace_graph_ent *trace)
{
/* trace it when it is-nested-in or is a function enabled. */
- return !(trace->depth || ftrace_graph_addr(trace->func)) ||
+ return !(trace_recursion_test(TRACE_GRAPH_BIT) ||
+ ftrace_graph_addr(trace)) ||
(trace->depth < 0) ||
(fgraph_max_depth && trace->depth >= fgraph_max_depth);
}
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 2561460d7baf..086af4f5c3e8 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -509,6 +509,8 @@ void trace_graph_return(struct ftrace_graph_ret *trace)
int cpu;
int pc;
+ ftrace_graph_addr_finish(trace);
+
local_irq_save(flags);
cpu = raw_smp_processor_id();
data = per_cpu_ptr(tr->trace_buffer.data, cpu);
@@ -532,6 +534,8 @@ void set_graph_array(struct trace_array *tr)
static void trace_graph_thresh_return(struct ftrace_graph_ret *trace)
{
+ ftrace_graph_addr_finish(trace);
+
if (tracing_thresh &&
(trace->rettime - trace->calltime < tracing_thresh))
return;
diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index b7357f9f82a3..98ea6d28df15 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -208,6 +208,8 @@ static void irqsoff_graph_return(struct ftrace_graph_ret *trace)
unsigned long flags;
int pc;
+ ftrace_graph_addr_finish(trace);
+
if (!func_prolog_dec(tr, &data, &flags))
return;
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index a86b303e6c67..7d04b9890755 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -270,6 +270,8 @@ static void wakeup_graph_return(struct ftrace_graph_ret *trace)
unsigned long flags;
int pc;
+ ftrace_graph_addr_finish(trace);
+
if (!func_prolog_preempt_disable(tr, &data, &pc))
return;
On 11/28/2018 11:24 PM, Steven Rostedt wrote:
> On Wed, 28 Nov 2018 22:29:36 -0500
> Steven Rostedt <[email protected]> wrote:
>
>> Does this patch fix it for you?
>
> Take 2. I realized that the reason for the interrupts being traced is
> because it is more likely to interrupt when the depth is already 0,
> setting it to 1 causing the interrupt to think it's already in a
> function that wants to be traced (when that wasn't decided yet).
>
> The fix uses a bit that gets set when we want to trace (and will trace
> till that bit is cleared). That bit gets cleared on the return
> function when depth is zero. But if we are tracing something in a
> interrupt that happened to interrupt the beginning of a function that
> already set the depth to 0, then we need to clear the bit at depth of 1
> not zero (and 2 if we want to trace a function that interrupted a
> softirq, that interrupted a start of normal function. and 3 if we want
> to trace an NMI function that had the very unlikely case of
> interrupting a start of a interrupt function, that interrupted the
> start of a softirq function, that interrupted a start of a normal
> context function!).
>
> If that happens then we will continue to trace functions when we don't
> want to. To solve that, I need to record the depth (0-3) when the bit
> is set, and clear it when the return function is at that same depth
> (0-3).
>
> -- Steve
>
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 3b8c0e24ab30..447bd96ee658 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -512,12 +512,44 @@ enum {
> * can only be modified by current, we can reuse trace_recursion.
> */
> TRACE_IRQ_BIT,
> +
> + /* Set if the function is in the set_graph_function file */
> + TRACE_GRAPH_BIT,
> +
> + /*
> + * In the very unlikely case that an interrupt came in
> + * at a start of graph tracing, and we want to trace
> + * the function in that interrupt, the depth can be greater
> + * than zero, because of the preempted start of a previous
> + * trace. In an even more unlikely case, depth could be 2
> + * if a softirq interrupted the start of graph tracing,
> + * followed by an interrupt preempting a start of graph
> + * tracing in the softirq, and depth can even be 3
> + * if an NMI came in at the start of an interrupt function
> + * that preempted a softirq start of a function that
> + * preempted normal context!!!! Luckily, it can't be
> + * greater than 3, so the next two bits are a mask
> + * of what the depth is when we set TRACE_GRAPH_BIT
> + */
> +
> + TRACE_GRAPH_DEPTH_START_BIT,
> + TRACE_GRAPH_DEPTH_END_BIT,
> };
>
> #define trace_recursion_set(bit) do { (current)->trace_recursion |= (1<<(bit)); } while (0)
> #define trace_recursion_clear(bit) do { (current)->trace_recursion &= ~(1<<(bit)); } while (0)
> #define trace_recursion_test(bit) ((current)->trace_recursion & (1<<(bit)))
>
> +#define trace_recursion_depth() \
> + (((current)->trace_recursion >> TRACE_GRAPH_DEPTH_START_BIT) & 3)
> +#define trace_recursion_set_depth(depth) \
> + do { \
> + current->trace_recursion &= \
> + ~(3 << TRACE_GRAPH_DEPTH_START_BIT); \
> + current->trace_recursion |= \
> + ((depth) & 3) << TRACE_GRAPH_DEPTH_START_BIT; \
> + } while (0)
> +
> #define TRACE_CONTEXT_BITS 4
>
> #define TRACE_FTRACE_START TRACE_FTRACE_BIT
> @@ -843,8 +875,9 @@ extern void __trace_graph_return(struct trace_array *tr,
> extern struct ftrace_hash *ftrace_graph_hash;
> extern struct ftrace_hash *ftrace_graph_notrace_hash;
>
> -static inline int ftrace_graph_addr(unsigned long addr)
> +static inline int ftrace_graph_addr(struct ftrace_graph_ent *trace)
> {
> + unsigned long addr = trace->func;
> int ret = 0;
>
> preempt_disable_notrace();
> @@ -855,6 +888,14 @@ static inline int ftrace_graph_addr(unsigned long addr)
> }
>
> if (ftrace_lookup_ip(ftrace_graph_hash, addr)) {
> +
> + /*
> + * This needs to be cleared on the return functions
> + * when the depth is zero.
> + */
> + trace_recursion_set(TRACE_GRAPH_BIT);
> + trace_recursion_set_depth(trace->depth);
> +
> /*
> * If no irqs are to be traced, but a set_graph_function
> * is set, and called by an interrupt handler, we still
> @@ -872,6 +913,13 @@ static inline int ftrace_graph_addr(unsigned long addr)
> return ret;
> }
>
> +static inline void ftrace_graph_addr_finish(struct ftrace_graph_ret *trace)
> +{
> + if (trace_recursion_test(TRACE_GRAPH_BIT) &&
> + trace->depth == trace_recursion_depth())
> + trace_recursion_clear(TRACE_GRAPH_BIT);
> +}
> +
> static inline int ftrace_graph_notrace_addr(unsigned long addr)
> {
> int ret = 0;
> @@ -885,7 +933,7 @@ static inline int ftrace_graph_notrace_addr(unsigned long addr)
> return ret;
> }
> #else
> -static inline int ftrace_graph_addr(unsigned long addr)
> +static inline int ftrace_graph_addr(struct ftrace_graph_ent *trace)
> {
> return 1;
> }
> @@ -894,6 +942,8 @@ static inline int ftrace_graph_notrace_addr(unsigned long addr)
> {
> return 0;
> }
> +static inline void ftrace_graph_addr_finish(struct ftrace_graph_ret *trace)
> +{ }
> #endif /* CONFIG_DYNAMIC_FTRACE */
>
> extern unsigned int fgraph_max_depth;
> @@ -901,7 +951,8 @@ extern unsigned int fgraph_max_depth;
> static inline bool ftrace_graph_ignore_func(struct ftrace_graph_ent *trace)
> {
> /* trace it when it is-nested-in or is a function enabled. */
> - return !(trace->depth || ftrace_graph_addr(trace->func)) ||
> + return !(trace_recursion_test(TRACE_GRAPH_BIT) ||
> + ftrace_graph_addr(trace)) ||
> (trace->depth < 0) ||
> (fgraph_max_depth && trace->depth >= fgraph_max_depth);
> }
> diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
> index 2561460d7baf..086af4f5c3e8 100644
> --- a/kernel/trace/trace_functions_graph.c
> +++ b/kernel/trace/trace_functions_graph.c
> @@ -509,6 +509,8 @@ void trace_graph_return(struct ftrace_graph_ret *trace)
> int cpu;
> int pc;
>
> + ftrace_graph_addr_finish(trace);
> +
> local_irq_save(flags);
> cpu = raw_smp_processor_id();
> data = per_cpu_ptr(tr->trace_buffer.data, cpu);
> @@ -532,6 +534,8 @@ void set_graph_array(struct trace_array *tr)
>
> static void trace_graph_thresh_return(struct ftrace_graph_ret *trace)
> {
> + ftrace_graph_addr_finish(trace);
> +
> if (tracing_thresh &&
> (trace->rettime - trace->calltime < tracing_thresh))
> return;
> diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
> index b7357f9f82a3..98ea6d28df15 100644
> --- a/kernel/trace/trace_irqsoff.c
> +++ b/kernel/trace/trace_irqsoff.c
> @@ -208,6 +208,8 @@ static void irqsoff_graph_return(struct ftrace_graph_ret *trace)
> unsigned long flags;
> int pc;
>
> + ftrace_graph_addr_finish(trace);
> +
> if (!func_prolog_dec(tr, &data, &flags))
> return;
>
> diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> index a86b303e6c67..7d04b9890755 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -270,6 +270,8 @@ static void wakeup_graph_return(struct ftrace_graph_ret *trace)
> unsigned long flags;
> int pc;
>
> + ftrace_graph_addr_finish(trace);
> +
> if (!func_prolog_preempt_disable(tr, &data, &pc))
> return;
>
>
With the "take 2" patch, I only see smp_irq_work_interrupt() graph when
the graph_function is in progress.. for example:
3) | cmdline_proc_show() {
3) ==========> |
3) | smp_irq_work_interrupt() {
3) | irq_enter() {
3) | rcu_irq_enter() {
3) 0.141 us | rcu_dynticks_curr_cpu_in_eqs();
3) 0.589 us | }
3) 0.877 us | }
3) | __wake_up() {
3) | __wake_up_common_lock() {
3) 0.122 us | _raw_spin_lock_irqsave();
3) | __wake_up_common() {
3) | autoremove_wake_function() {
3) | default_wake_function() {
3) | try_to_wake_up() {
3) 0.209 us | _raw_spin_lock_irqsave();
3) | select_task_rq_fair() {
3) | select_idle_sibling() {
3) 0.140 us | available_idle_cpu();
3) 0.408 us | }
3) 0.711 us | }
3) | native_smp_send_reschedule() {
3) 4.298 us | x2apic_send_IPI();
3) 4.574 us | }
3) 0.144 us | _raw_spin_unlock_irqrestore();
3) 6.677 us | }
3) 6.915 us | }
3) 7.182 us | }
3) 7.620 us | }
3) 0.132 us | _raw_spin_unlock_irqrestore();
3) 8.401 us | }
3) 9.066 us | }
3) | irq_exit() {
3) 0.128 us | idle_cpu();
3) | rcu_irq_exit() {
3) 0.122 us | rcu_dynticks_curr_cpu_in_eqs();
3) + 10.598 us | }
3) + 11.128 us | }
3) + 22.075 us | }
3) <========== |
3) 0.414 us | seq_puts();
3) 0.128 us | seq_putc();
3) + 29.040 us | }
-- Joe
On Thu, 29 Nov 2018 10:08:55 -0500
Joe Lawrence <[email protected]> wrote:
> With the "take 2" patch, I only see smp_irq_work_interrupt() graph when
> the graph_function is in progress.. for example:
In other words it works :-)
Can I get a Tested-by from you?
-- Steve
On 11/29/2018 11:17 AM, Steven Rostedt wrote:
> On Thu, 29 Nov 2018 10:08:55 -0500
> Joe Lawrence <[email protected]> wrote:
>
>> With the "take 2" patch, I only see smp_irq_work_interrupt() graph when
>> the graph_function is in progress.. for example:
>
> In other words it works :-)
>
> Can I get a Tested-by from you?
>
Yup, thanks for fixing!
Tested-by: Joe Lawrence <[email protected]>
-- Joe