This is the port of the hardware latency detector from the -rt patch
to mainline. Instead of keeping it as a device that had its own debugfs
filesystem infrastructure, it made more sense to make it into a tracer
like irqsoff and wakeup latency tracers currently are.
With this patch set, a new tracer is enabled if CONFIG_HWLAT_TRACER is
enabled. Inside the available_tracers file will be hwlat_detector.
# cd /sys/kernel/debug/tracing
# echo hwlat_detector > current_tracer
will enable the hwlat_detector that will create per cpu kernel threads
(which cpus is defined by the tracing/hwlat_detector/cpumask, default
is just CPU 0).
Like the other tracers (function, function_graph, preemptirqsoff,
and mmiotracer), the hwlat_detector can add a significant performance
penalty when enabled. As each of the threads created will go into a spin
checking the trace_local_clock (sched_clock) for any gaps of time
and will report them if they are greater than the threshold defined
by tracing/tracing_thresh (usecs, default 10). The spin is performed with
interrupts disabled and runs for "width" usecs in "window" usecs time. The
width and window are defined by the values in the corresponding file
names under tracing/hwlat_detector/
To keep authorship, the first patch is the code from the -rt patch
directly, removing the setup of the device and its own ring buffer.
I made sure that it still compiles though (even though it isn't
included in the Makefile, I tested it by adding it in the Makefile
to make sure there was enough there to compile).
The next patch is what I did to it to make it into a tracer.
The third patch is the documentation pulled from the rt patch with
minor tweaks to still maintain authorship from Jon.
The last patch adds a new feature, as the original code only ran
one thread, the last patch adds the cpumask and allows the user
to run threads on any CPU they choose.
Here's an example:
# cd /sys/kernel/debug/tracing
# echo hwat_detector > current_tracer
# sleep 1000
# cat trace
# tracer: hwlat_detector
#
# entries-in-buffer/entries-written: 4255/4255 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<...>-1093 [000] d... 190.028562: cnt:1 ts:1429650359.0152020816 inner:0 outer:12
<...>-1093 [000] dn.. 192.029474: cnt:2 ts:1429650361.0227041390 inner:10 outer:11
<...>-1093 [000] dn.. 251.057300: cnt:3 ts:1429650420.0228626083 inner:11 outer:9
<...>-1093 [000] d... 256.059542: cnt:4 ts:1429650425.0227696532 inner:0 outer:13
<...>-1093 [000] dn.. 265.063606: cnt:5 ts:1429650433.0896822479 inner:11 outer:18
<...>-1093 [000] dn.. 311.084310: cnt:6 ts:1429650480.0229495647 inner:13 outer:12
<...>-1093 [000] dn.. 324.090171: cnt:7 ts:1429650493.0105674296 inner:13 outer:9
<...>-1093 [000] dn.. 371.111359: cnt:8 ts:1429650540.0228324942 inner:0 outer:15
<...>-1093 [000] d... 385.117823: cnt:9 ts:1429650554.0105350802 inner:9 outer:11
The inner, outer times shows where the latency was detected:
while (run) {
start_ts = trace_local_clock();
end_ts = trace_local_clock();
if (!first && start_ts - last_ts > thresh)
record_outer();
if (end_ts - start_ts > thresh)
record_inner();
last_ts = end_ts;
first = 0;
}
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
ftrace/hwlat
Head SHA1: 876428062c7fbb94a06fefc8df86560f9a80c8dd
Jon Masters (2):
tracing: Initial hwlat_detector code from the -rt patch
tracing: Add documentation for hwlat_detector tracer
Steven Rostedt (Red Hat) (2):
tracing: Add hwlat_detector tracer
hwlat: Add per cpu mask for hwlat_detector
----
Documentation/trace/hwlat_detector.txt | 66 +++
kernel/trace/Kconfig | 40 ++
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 11 +-
kernel/trace/trace.h | 5 +
kernel/trace/trace_entries.h | 23 +
kernel/trace/trace_hwlatdetect.c | 763 +++++++++++++++++++++++++++++++++
kernel/trace/trace_output.c | 84 ++++
8 files changed, 988 insertions(+), 5 deletions(-)
On Thu, 23 Apr 2015, Steven Rostedt wrote:
> This is the port of the hardware latency detector from the -rt patch
> to mainline. Instead of keeping it as a device that had its own debugfs
> filesystem infrastructure, it made more sense to make it into a tracer
> like irqsoff and wakeup latency tracers currently are.
>
> With this patch set, a new tracer is enabled if CONFIG_HWLAT_TRACER is
> enabled. Inside the available_tracers file will be hwlat_detector.
>
> # cd /sys/kernel/debug/tracing
> # echo hwlat_detector > current_tracer
>
> will enable the hwlat_detector that will create per cpu kernel threads
> (which cpus is defined by the tracing/hwlat_detector/cpumask, default
> is just CPU 0).
>
> Like the other tracers (function, function_graph, preemptirqsoff,
> and mmiotracer), the hwlat_detector can add a significant performance
> penalty when enabled. As each of the threads created will go into a spin
> checking the trace_local_clock (sched_clock) for any gaps of time
> and will report them if they are greater than the threshold defined
> by tracing/tracing_thresh (usecs, default 10). The spin is performed with
> interrupts disabled and runs for "width" usecs in "window" usecs time. The
That's fine, but this still lacks a detection of NMI
disturbance. We've seen false positives reported over and over when
stuff like the NMI watchdog or perf was enabled while running this.
Aside of that isn't there a way to detect SMI crap with performance
counters on recent hardware?
Thanks,
tglx
On Thu, 23 Apr 2015 21:51:47 +0200 (CEST)
Thomas Gleixner <[email protected]> wrote:
> That's fine, but this still lacks a detection of NMI
> disturbance. We've seen false positives reported over and over when
> stuff like the NMI watchdog or perf was enabled while running this.
I was waiting for someone to point that out :-)
Is the NMI code generic enough now to know that an NMI triggered, and
we could detect that and ignore the latencies if one did. Or perhaps
even add a tracepoint in the start and end of an NMI, to account for
it, (add hooks there), in case there's any SMIs that sneak in after an
NMI.
I guess I could also add an NMI notifier to let me know. But I know how
much everyone loves notifiers :-)
>
> Aside of that isn't there a way to detect SMI crap with performance
> counters on recent hardware?
>
Nothing I know of that is generic enough. And just because an SMI
triggers, doesn't mean it's bad if it is quick enough. We have had
arguments with HW vendors about their SMIs, and used the hwlat_detector
to show that their SMIs are not as innocent as they claim. But we also
have seen SMIs trigger under 1us, where it doesn't affect the system.
-- Steve
On Thu, 23 Apr 2015, Steven Rostedt wrote:
> On Thu, 23 Apr 2015 21:51:47 +0200 (CEST)
> Thomas Gleixner <[email protected]> wrote:
>
> > That's fine, but this still lacks a detection of NMI
> > disturbance. We've seen false positives reported over and over when
> > stuff like the NMI watchdog or perf was enabled while running this.
>
> I was waiting for someone to point that out :-)
Always to your service :)
> Is the NMI code generic enough now to know that an NMI triggered, and
> we could detect that and ignore the latencies if one did. Or perhaps
> even add a tracepoint in the start and end of an NMI, to account for
> it, (add hooks there), in case there's any SMIs that sneak in after an
> NMI.
There are tracepoints in nmi_enter() and nmi_exit() at least in the
kernel source I'm looking at.
> I guess I could also add an NMI notifier to let me know. But I know how
> much everyone loves notifiers :-)
I was tempted to tell you to shoot yourself, but realized in time
that this would be politically incorrect.
> >
> > Aside of that isn't there a way to detect SMI crap with performance
> > counters on recent hardware?
> >
>
> Nothing I know of that is generic enough. And just because an SMI
> triggers, doesn't mean it's bad if it is quick enough. We have had
> arguments with HW vendors about their SMIs, and used the hwlat_detector
> to show that their SMIs are not as innocent as they claim. But we also
> have seen SMIs trigger under 1us, where it doesn't affect the system.
I know of a SMI event counter which is available on newer CPUs and
Intel promised to add a SMI cycle counter as well. I have no idea
whether that one ever materialized. PeterZ should know.
But at least on the machines which have the event counter it would be
usefull to include that information as well.
Thanks,
tglx
On Thu, 23 Apr 2015 22:21:11 +0200 (CEST)
Thomas Gleixner <[email protected]> wrote:
> > Is the NMI code generic enough now to know that an NMI triggered, and
> > we could detect that and ignore the latencies if one did. Or perhaps
> > even add a tracepoint in the start and end of an NMI, to account for
> > it, (add hooks there), in case there's any SMIs that sneak in after an
> > NMI.
>
> There are tracepoints in nmi_enter() and nmi_exit() at least in the
> kernel source I'm looking at.
Ah, the "trace_hardirq_enter()". That would be something I could hook
into, as the only thing that could trigger that when interrupts are
disabled happen to be NMIs.
>
> > I guess I could also add an NMI notifier to let me know. But I know how
> > much everyone loves notifiers :-)
>
> I was tempted to tell you to shoot yourself, but realized in time
> that this would be politically incorrect.
And I would have to go all CodeOfConflict on you.
>
> > >
> > > Aside of that isn't there a way to detect SMI crap with performance
> > > counters on recent hardware?
> > >
> >
> > Nothing I know of that is generic enough. And just because an SMI
> > triggers, doesn't mean it's bad if it is quick enough. We have had
> > arguments with HW vendors about their SMIs, and used the hwlat_detector
> > to show that their SMIs are not as innocent as they claim. But we also
> > have seen SMIs trigger under 1us, where it doesn't affect the system.
>
> I know of a SMI event counter which is available on newer CPUs and
> Intel promised to add a SMI cycle counter as well. I have no idea
> whether that one ever materialized. PeterZ should know.
Hmm, a cycle counter would be good to add.
>
> But at least on the machines which have the event counter it would be
> usefull to include that information as well.
I'll try to add onto this series to include SMI counters if they exist.
Thanks,
-- Steve
On 04/23/2015 05:21 PM, Thomas Gleixner wrote:
> I know of a SMI event counter which is available on newer CPUs and
> Intel promised to add a SMI cycle counter as well. I have no idea
> whether that one ever materialized. PeterZ should know.
The turbostat shows how many SMIs happened during a period. To do it, it
reads the 'MSR_SMI_COUNT' counter. The code's comment says this counter
is there since Nehalem.
On Thu, 23 Apr 2015 16:40:39 -0400
Steven Rostedt <[email protected]> wrote:
> On Thu, 23 Apr 2015 22:21:11 +0200 (CEST)
> Thomas Gleixner <[email protected]> wrote:
>
>
> > > Is the NMI code generic enough now to know that an NMI triggered, and
> > > we could detect that and ignore the latencies if one did. Or perhaps
> > > even add a tracepoint in the start and end of an NMI, to account for
> > > it, (add hooks there), in case there's any SMIs that sneak in after an
> > > NMI.
> >
> > There are tracepoints in nmi_enter() and nmi_exit() at least in the
> > kernel source I'm looking at.
>
> Ah, the "trace_hardirq_enter()". That would be something I could hook
> into, as the only thing that could trigger that when interrupts are
> disabled happen to be NMIs.
>
and the naming convention fooled me :-(
That's not a tracepoint, it's a hook into lockdep. There's also a
ftrace_nmi_enter/exit() there too, that was once used for handling NMIs
when modifying the function counters via stop machine. But now that we
have switched to breakpoints, we do not need that anymore.
The only user of the ftrace_nmi_enter/exit() calls is sh. Although it's
not supported anymore, I'm sure I could use these handlers for the
hwlat_detector too. I believe jump_labels are allowed in NMI context,
and we could add hooks there to be able to jump to hwlat code.
I can add a wrapper for sh to make sure that still works (even though
there's no more maintainer that I know of for the arch).
-- Steve
On Thu, Apr 23, 2015 at 1:21 PM, Thomas Gleixner <[email protected]> wrote:
>
> But at least on the machines which have the event counter it would be
> usefull to include that information as well.
In fact, I'd argue that we should *not* do this odd magic latency
measurement thing at all, exactly because Intel gave is the SMI
counter, and it's much more likely to be useful in real life. The odd
"stop machine and busy loop adn do magic" thing is a incredibly
invasive hack that sane people will never enable at all, while the
"add support for the hadrware we asked for and got" is a small thing
that we can do on all modern Intel chips, and can be enabled by
default because there is no downside.
Linus
On Thu, 23 Apr 2015 15:50:29 -0700
Linus Torvalds <[email protected]> wrote:
> On Thu, Apr 23, 2015 at 1:21 PM, Thomas Gleixner <[email protected]> wrote:
> >
> > But at least on the machines which have the event counter it would be
> > usefull to include that information as well.
>
> In fact, I'd argue that we should *not* do this odd magic latency
> measurement thing at all, exactly because Intel gave is the SMI
> counter, and it's much more likely to be useful in real life. The odd
> "stop machine and busy loop adn do magic" thing is a incredibly
> invasive hack that sane people will never enable at all, while the
No sane person should enable this on any production machine, and nor
should they use the other latency tracer (irqsoff and friends). But we
have used this odd magic latency measurement in dealings with vendors
and such in certifying their machines. Thus, this has not been
something that we just wanted to throw into the kernel for fun. This
tool has actually been very helpful to us.
> "add support for the hadrware we asked for and got" is a small thing
> that we can do on all modern Intel chips, and can be enabled by
> default because there is no downside.
>
What about a mix and match tracer? If the hardware supports SMI
counters (and more importantly, SMI cycle counters), it will just use
that, otherwise if the hardware does not support the SMI counting, it
falls back to the odd magic latency measurement thing.
I could even make the odd magic latency measurement thing only be
enabled via a tracer flag, such that it would be safe to use the SMI
counter if supported, but if it isn't supported, a tracing message will
display info about the more invasive (not to be used in production
environment) measurement. But the more invasive version will only be
activated if the user explicitly set it (even if SMI counters were not
available).
And when this was first proposed, it was called smi_detector, and I
believe it was Andrew that suggested to rename it to hwlat_detector,
because it could theoretically, detect other types of hardware issues
that could stop the CPU, in case something like that ever arise.
-- Steve