Subject: [PATCH V3 0/9] hwlat improvements and osnoise/timerlat tracers

This series proposes a set of improvements and new features for the
tracing subsystem to facilitate the debugging of low latency
deployments.

Currently, hwlat runs on a single CPU at a time, migrating across a
set of CPUs in a round-robin fashion. This series improves hwlat
to allow hwlat to run on multiple CPUs in parallel, increasing the
chances of detecting a hardware latency, at the cost of using more
CPU time.

It also proposes a new tracer named osnoise, that aims to help users
of isolcpus= (or a similar method) to measure how much noise the OS
and the hardware add to the isolated application. The osnoise tracer
bases on the hwlat detector code. The difference is that, instead of
sampling with interrupts disabled, the osnoise tracer samples the CPU with
interrupts and preemption enabled. In this way, the sampling thread will
suffer any source of noise from the OS. The detection and classification
of the type of noise are then made by observing the entry points of NMIs,
IRQs, SoftIRQs, and threads. If none of these sources of noise is detected,
the tool associates the noise with the hardware. The tool periodically
prints a status, printing the total noise of the period, the max single
noise observed, the percentage of CPU available for the task, along with
the counters of each source of the noise. To debug the sources of noise,
the tracer also adds a set of tracepoints that print any NMI, IRQ, SofIRQ,
and thread occurrence. These tracepoints print the starting time and the
noise's net duration at the end of the noise. In this way, it reduces the
number of tracepoints (one instead of two) and the need to manually
accounting the contribution of each noise independently.

Finaly, the timerlat tracer aims to help the preemptive kernel developers
to find sources of wakeup latencies of real-time threads. The tracer
creates a per-cpu kernel thread with real-time priority. The tracer thread
sets a periodic timer to wakeup itself, and goes to sleep waiting for the
timer to fire. At the wakeup, the thread then computes a wakeup latency
value as the difference between the current time and the absolute time
that the timer was set to expire. The tracer prints two lines at every
activation. The first is the timer latency observed at the hardirq context
before the activation of the thread. The second is the timer latency
observed by the thread, which is the same level that cyclictest reports.
The ACTIVATION ID field serves to relate the irq execution to its
respective thread execution. The tracer is build on top of osnoise tracer,
and the osnoise: events can be used to trace the source of interference
from NMI, IRQs and other threads. It also enables the capture of the
stacktrace at the IRQ context, which helps to identify the code path
that can cause thread delay.

Changes from v2:
- osnoise sample reports in nanoseconds (as all other osnoise tracepoints)
(Bristot)
- Remove divisions from osnoise main loop (Bristot)
- Make the tracers work well when starting via kernel-cmdline
(Red Hat's performance team need)
- Rename main/interrupt functions (Bristot)
- Fix timerlat reset (Juri Lelli)
- Fix timerlat start (Juri Lelli)

Changes from v1:
- Remove `` from RST (Corbet)
- Add RST files to the index (Corbet)
- Fix text and typos (Rostedt)
- Remove the cpus from hwlat (Rostedt)
- Remove the disable_migrate/fallback to mode none on hwlat (Rostedt)
- Add a generic way to read/write u64 and use it on
hwlat/osnoise/timerlat (Rostedt)
- Make osnoise/timerlat to work properly with trace-cmd/tracer
instances (Rostedt)
- osnoise using the tracing_threshold (Rostedt)
- Rearrange tracepoint structure to avoid "holes" (Rostedt)

Daniel Bristot de Oliveira (8):
tracing/hwlat: Fix Clark's email
tracing/hwlat: Implement the mode config option
tracing/hwlat: Switch disable_migrate to mode none
tracing/hwlat: Implement the per-cpu mode
tracing/trace: Add a generic function to read/write u64 values from
tracefs
trace/hwlat: Use the generic function to read/write width and window
tracing: Add osnoise tracer
tracing: Add timerlat tracer

Steven Rostedt (1):
tracing: Add __print_ns_to_secs() and __print_ns_without_secs()
helpers

Documentation/trace/hwlat_detector.rst | 13 +-
Documentation/trace/index.rst | 2 +
Documentation/trace/osnoise-tracer.rst | 152 ++
Documentation/trace/timerlat-tracer.rst | 158 ++
include/linux/ftrace_irq.h | 13 +
include/trace/events/osnoise.h | 142 ++
include/trace/trace_events.h | 25 +
kernel/trace/Kconfig | 62 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 87 +
kernel/trace/trace.h | 30 +-
kernel/trace/trace_entries.h | 41 +
kernel/trace/trace_hwlat.c | 410 +++--
kernel/trace/trace_osnoise.c | 2126 +++++++++++++++++++++++
kernel/trace/trace_output.c | 119 +-
15 files changed, 3234 insertions(+), 147 deletions(-)
create mode 100644 Documentation/trace/osnoise-tracer.rst
create mode 100644 Documentation/trace/timerlat-tracer.rst
create mode 100644 include/trace/events/osnoise.h
create mode 100644 kernel/trace/trace_osnoise.c

--
2.26.3



Subject: [PATCH V3 1/9] tracing/hwlat: Fix Clark's email

Clark's email is [email protected]

No functional change.

Cc: Jonathan Corbet <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Alexandre Chartre <[email protected]>
Cc: Clark Willaims <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/trace/trace_hwlat.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index 632ef88131a9..0a5635401125 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -34,7 +34,7 @@
* Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <[email protected]>
* Copyright (C) 2013-2016 Steven Rostedt, Red Hat, Inc. <[email protected]>
*
- * Includes useful feedback from Clark Williams <[email protected]>
+ * Includes useful feedback from Clark Williams <[email protected]>
*
*/
#include <linux/kthread.h>
--
2.26.3


Subject: [PATCH V3 4/9] tracing/hwlat: Implement the per-cpu mode

Implements the per-cpu mode in which a sampling thread is created for
each cpu in the "cpus" (and tracing_mask).

The per-cpu mode has the potention to speed up the hwlat detection by
running on multiple CPUs at the same time, at the cost of higher cpu
usage with irqs disabled. Use with care.

Cc: Jonathan Corbet <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Alexandre Chartre <[email protected]>
Cc: Clark Willaims <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
Documentation/trace/hwlat_detector.rst | 3 +-
kernel/trace/trace_hwlat.c | 165 +++++++++++++++++++------
2 files changed, 130 insertions(+), 38 deletions(-)

diff --git a/Documentation/trace/hwlat_detector.rst b/Documentation/trace/hwlat_detector.rst
index 4d952df0586a..8881ce919553 100644
--- a/Documentation/trace/hwlat_detector.rst
+++ b/Documentation/trace/hwlat_detector.rst
@@ -78,10 +78,11 @@ in /sys/kernel/tracing:
- hwlat_detector/window - amount of time between (width) runs (usecs)
- hwlat_detector/mode - the thread mode

-By default, the hwlat detector's kernel thread will migrate across each CPU
+By default, one hwlat detector kernel thread will migrate across each CPU
specified in cpumask at the beginning of a new window, in a round-robin
fashion. This behavior can be changed by changing the thread mode,
the available options are:

- none: do not force migration
- round-robin: migrate across each CPU specified in cpumask [default]
+ - per-cpu: create one thread for each cpu in tracing_cpumask
diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index fecc3752d7da..84689fa14d9a 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -54,9 +54,6 @@ static struct trace_array *hwlat_trace;
#define DEFAULT_SAMPLE_WIDTH 500000 /* 0.5s */
#define DEFAULT_LAT_THRESHOLD 10 /* 10us */

-/* sampling thread*/
-static struct task_struct *hwlat_kthread;
-
static struct dentry *hwlat_sample_width; /* sample width us */
static struct dentry *hwlat_sample_window; /* sample window us */
static struct dentry *hwlat_thread_mode; /* hwlat thread mode */
@@ -64,18 +61,26 @@ static struct dentry *hwlat_thread_mode; /* hwlat thread mode */
enum {
MODE_NONE = 0,
MODE_ROUND_ROBIN,
+ MODE_PER_CPU,
MODE_MAX
};
-static char *thread_mode_str[] = { "none", "round-robin" };
+static char *thread_mode_str[] = { "none", "round-robin", "per-cpu" };

/* Save the previous tracing_thresh value */
static unsigned long save_tracing_thresh;

-/* NMI timestamp counters */
-static u64 nmi_ts_start;
-static u64 nmi_total_ts;
-static int nmi_count;
-static int nmi_cpu;
+/* runtime kthread data */
+struct hwlat_kthread_data {
+ struct task_struct *kthread;
+ /* NMI timestamp counters */
+ u64 nmi_ts_start;
+ u64 nmi_total_ts;
+ int nmi_count;
+ int nmi_cpu;
+};
+
+struct hwlat_kthread_data hwlat_single_cpu_data;
+DEFINE_PER_CPU(struct hwlat_kthread_data, hwlat_per_cpu_data);

/* Tells NMIs to call back to the hwlat tracer to record timestamps */
bool trace_hwlat_callback_enabled;
@@ -112,6 +117,14 @@ static struct hwlat_data {
.thread_mode = MODE_ROUND_ROBIN
};

+struct hwlat_kthread_data *get_cpu_data(void)
+{
+ if (hwlat_data.thread_mode == MODE_PER_CPU)
+ return this_cpu_ptr(&hwlat_per_cpu_data);
+ else
+ return &hwlat_single_cpu_data;
+}
+
static bool hwlat_busy;

static void trace_hwlat_sample(struct hwlat_sample *sample)
@@ -149,7 +162,9 @@ static void trace_hwlat_sample(struct hwlat_sample *sample)

void trace_hwlat_callback(bool enter)
{
- if (smp_processor_id() != nmi_cpu)
+ struct hwlat_kthread_data *kdata = get_cpu_data();
+
+ if (kdata->kthread)
return;

/*
@@ -158,13 +173,13 @@ void trace_hwlat_callback(bool enter)
*/
if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {
if (enter)
- nmi_ts_start = time_get();
+ kdata->nmi_ts_start = time_get();
else
- nmi_total_ts += time_get() - nmi_ts_start;
+ kdata->nmi_total_ts += time_get() - kdata->nmi_ts_start;
}

if (enter)
- nmi_count++;
+ kdata->nmi_count++;
}

/**
@@ -176,6 +191,7 @@ void trace_hwlat_callback(bool enter)
*/
static int get_sample(void)
{
+ struct hwlat_kthread_data *kdata = get_cpu_data();
struct trace_array *tr = hwlat_trace;
struct hwlat_sample s;
time_type start, t1, t2, last_t2;
@@ -188,9 +204,8 @@ static int get_sample(void)

do_div(thresh, NSEC_PER_USEC); /* modifies interval value */

- nmi_cpu = smp_processor_id();
- nmi_total_ts = 0;
- nmi_count = 0;
+ kdata->nmi_total_ts = 0;
+ kdata->nmi_count = 0;
/* Make sure NMIs see this first */
barrier();

@@ -260,15 +275,15 @@ static int get_sample(void)
ret = 1;

/* We read in microseconds */
- if (nmi_total_ts)
- do_div(nmi_total_ts, NSEC_PER_USEC);
+ if (kdata->nmi_total_ts)
+ do_div(kdata->nmi_total_ts, NSEC_PER_USEC);

hwlat_data.count++;
s.seqnum = hwlat_data.count;
s.duration = sample;
s.outer_duration = outer_sample;
- s.nmi_total_ts = nmi_total_ts;
- s.nmi_count = nmi_count;
+ s.nmi_total_ts = kdata->nmi_total_ts;
+ s.nmi_count = kdata->nmi_count;
s.count = count;
trace_hwlat_sample(&s);

@@ -364,23 +379,43 @@ static int kthread_fn(void *data)
}

/**
- * start_kthread - Kick off the hardware latency sampling/detector kthread
+ * stop_stop_kthread - Inform the hardware latency sampling/detector kthread to stop
+ *
+ * This kicks the running hardware latency sampling/detector kernel thread and
+ * tells it to stop sampling now. Use this on unload and at system shutdown.
+ */
+static void stop_single_kthread(void)
+{
+ struct hwlat_kthread_data *kdata = get_cpu_data();
+ struct task_struct *kthread = kdata->kthread;
+
+ if (!kthread)
+
+ return;
+ kthread_stop(kthread);
+
+ kdata->kthread = NULL;
+}
+
+
+/**
+ * start_single_kthread - Kick off the hardware latency sampling/detector kthread
*
* This starts the kernel thread that will sit and sample the CPU timestamp
* counter (TSC or similar) and look for potential hardware latencies.
*/
-static int start_kthread(struct trace_array *tr)
+static int start_single_kthread(struct trace_array *tr)
{
+ struct hwlat_kthread_data *kdata = get_cpu_data();
struct cpumask *current_mask = &save_cpumask;
struct task_struct *kthread;
int next_cpu;

- if (hwlat_kthread)
+ if (kdata->kthread)
return 0;

-
kthread = kthread_create(kthread_fn, NULL, "hwlatd");
- if (IS_ERR(kthread)) {
+ if (IS_ERR(kdata->kthread)) {
pr_err(BANNER "could not start sampling thread\n");
return -ENOMEM;
}
@@ -400,24 +435,73 @@ static int start_kthread(struct trace_array *tr)

sched_setaffinity(kthread->pid, current_mask);

- hwlat_kthread = kthread;
+ kdata->kthread = kthread;
wake_up_process(kthread);

return 0;
}

/**
- * stop_kthread - Inform the hardware latency sampling/detector kthread to stop
+ * stop_per_cpu_kthread - Inform the hardware latency sampling/detector kthread to stop
*
- * This kicks the running hardware latency sampling/detector kernel thread and
+ * This kicks the running hardware latency sampling/detector kernel threads and
* tells it to stop sampling now. Use this on unload and at system shutdown.
*/
-static void stop_kthread(void)
+static void stop_per_cpu_kthreads(void)
{
- if (!hwlat_kthread)
- return;
- kthread_stop(hwlat_kthread);
- hwlat_kthread = NULL;
+ struct task_struct *kthread;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ kthread = per_cpu(hwlat_per_cpu_data, cpu).kthread;
+ if (kthread)
+ kthread_stop(kthread);
+ }
+}
+
+/**
+ * start_per_cpu_kthread - Kick off the hardware latency sampling/detector kthreads
+ *
+ * This starts the kernel threads that will sit on potentially all cpus and
+ * sample the CPU timestamp counter (TSC or similar) and look for potential
+ * hardware latencies.
+ */
+static int start_per_cpu_kthreads(struct trace_array *tr)
+{
+ struct cpumask *current_mask = &save_cpumask;
+ struct cpumask *this_cpumask;
+ struct task_struct *kthread;
+ char comm[24];
+ int cpu;
+
+ if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
+ return -ENOMEM;
+
+ /*
+ * Run only on CPUs in which hwlat is allowed to run.
+ */
+ get_online_cpus();
+ cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask);
+ put_online_cpus();
+
+ for_each_online_cpu(cpu)
+ per_cpu(hwlat_per_cpu_data, cpu).kthread = NULL;
+
+ for_each_cpu(cpu, current_mask) {
+ snprintf(comm, 24, "hwlatd/%d", cpu);
+
+ kthread = kthread_create_on_cpu(kthread_fn, NULL, cpu, comm);
+ if (IS_ERR(kthread)) {
+ pr_err(BANNER "could not start sampling thread\n");
+ stop_per_cpu_kthreads();
+ return -ENOMEM;
+ }
+
+ per_cpu(hwlat_per_cpu_data, cpu).kthread = kthread;
+ wake_up_process(kthread);
+ }
+
+ return 0;
}

/*
@@ -596,7 +680,8 @@ static void hwlat_tracer_stop(struct trace_array *tr);
* The "none" sets the allowed cpumask for a single hwlatd thread at the
* startup and lets the scheduler handle the migration. The default mode is
* the "round-robin" one, in which a single hwlatd thread runs, migrating
- * among the allowed CPUs in a round-robin fashion.
+ * among the allowed CPUs in a round-robin fashion. The "per-cpu" mode
+ * creates one hwlatd thread per allowed CPU.
*/
static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf,
size_t cnt, loff_t *ppos)
@@ -720,14 +805,20 @@ static void hwlat_tracer_start(struct trace_array *tr)
{
int err;

- err = start_kthread(tr);
+ if (hwlat_data.thread_mode == MODE_PER_CPU)
+ err = start_per_cpu_kthreads(tr);
+ else
+ err = start_single_kthread(tr);
if (err)
pr_err(BANNER "Cannot start hwlat kthread\n");
}

static void hwlat_tracer_stop(struct trace_array *tr)
{
- stop_kthread();
+ if (hwlat_data.thread_mode == MODE_PER_CPU)
+ stop_per_cpu_kthreads();
+ else
+ stop_single_kthread();
}

static int hwlat_tracer_init(struct trace_array *tr)
@@ -756,7 +847,7 @@ static int hwlat_tracer_init(struct trace_array *tr)

static void hwlat_tracer_reset(struct trace_array *tr)
{
- stop_kthread();
+ hwlat_tracer_stop(tr);

/* the tracing threshold is static between runs */
last_tracing_thresh = tracing_thresh;
--
2.26.3


Subject: [PATCH V3 5/9] tracing/trace: Add a generic function to read/write u64 values from tracefs

Provides a generic read and write implementation to save/read u64 values
from a file on tracefs. The trace_ull_config structure defines where to
read/write the value, the min and the max acceptable values, and a lock
to protect the write.

Cc: Jonathan Corbet <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Alexandre Chartre <[email protected]>
Cc: Clark Willaims <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/trace/trace.c | 87 ++++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.h | 19 ++++++++++
2 files changed, 106 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 560e4c8d3825..b4cd89010813 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7516,6 +7516,93 @@ static const struct file_operations snapshot_raw_fops = {

#endif /* CONFIG_TRACER_SNAPSHOT */

+/*
+ * trace_ull_config_write - Generic write function to save u64 value
+ * @filp: The active open file structure
+ * @ubuf: The userspace provided buffer to read value into
+ * @cnt: The maximum number of bytes to read
+ * @ppos: The current "file" position
+ *
+ * This function provides a generic write implementation to save u64 values
+ * from a file on tracefs. The filp->private_data must point to a
+ * trace_ull_config structure that defines where to write the value, the
+ * min and the max acceptable values, and a lock to protect the write.
+ */
+static ssize_t
+trace_ull_config_write(struct file *filp, const char __user *ubuf, size_t cnt,
+ loff_t *ppos)
+{
+ struct trace_ull_config *config = filp->private_data;
+ u64 val;
+ int err;
+
+ if (!config)
+ return -EFAULT;
+
+ err = kstrtoull_from_user(ubuf, cnt, 10, &val);
+ if (err)
+ return err;
+
+ if (config->lock)
+ mutex_lock(config->lock);
+
+ if (config->min && val < *config->min)
+ err = -EINVAL;
+
+ if (config->max && val > *config->max)
+ err = -EINVAL;
+
+ if (!err)
+ *config->val = val;
+
+ if (config->lock)
+ mutex_unlock(config->lock);
+
+ if (err)
+ return err;
+
+ return cnt;
+}
+
+/*
+ * trace_ull_config_read - Generic write function to read u64 value via tracefs
+ * @filp: The active open file structure
+ * @ubuf: The userspace provided buffer to read value into
+ * @cnt: The maximum number of bytes to read
+ * @ppos: The current "file" position
+ *
+ * This function provides a generic read implementation to read a u64 value
+ * from a file on tracefs. The filp->private_data must point to a
+ * trace_ull_config structure with valid data.
+ */
+static ssize_t
+trace_ull_config_read(struct file *filp, char __user *ubuf, size_t cnt,
+ loff_t *ppos)
+{
+ struct trace_ull_config *config = filp->private_data;
+ char buf[ULL_STR_SIZE];
+ u64 val;
+ int len;
+
+ if (!config)
+ return -EFAULT;
+
+ val = *config->val;
+
+ if (cnt > sizeof(buf))
+ cnt = sizeof(buf);
+
+ len = snprintf(buf, sizeof(buf), "%llu\n", val);
+
+ return simple_read_from_buffer(ubuf, cnt, ppos, buf, len);
+}
+
+const struct file_operations trace_ull_config_fops = {
+ .open = tracing_open_generic,
+ .read = trace_ull_config_read,
+ .write = trace_ull_config_write,
+};
+
#define TRACING_LOG_ERRS_MAX 8
#define TRACING_LOG_LOC_MAX 128

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index cd80d046c7a5..44fa25c1264a 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1952,4 +1952,23 @@ static inline bool is_good_name(const char *name)
return true;
}

+/*
+ * This is a generic way to read and write a u64 config value from a file
+ * in tracefs.
+ *
+ * The value is stored on the variable pointed by *val. The value needs
+ * to be at least *min and at most *max. The write is protected by an
+ * existing *lock.
+ */
+struct trace_ull_config {
+ struct mutex *lock;
+ u64 *val;
+ u64 *min;
+ u64 *max;
+};
+
+#define ULL_STR_SIZE 22 /* 20 digits max */
+
+extern const struct file_operations trace_ull_config_fops;
+
#endif /* _LINUX_KERNEL_TRACE_H */
--
2.26.3


Subject: [PATCH V3 3/9] tracing/hwlat: Switch disable_migrate to mode none

When in the round-robin mode, if the tracer detects a change in the
hwlatd thread affinity by an external tool, e.g., taskset, the
round-robin logic is disabled. The disable_migrate variable currently
tracks this.

With the addition of the "mode" config and the mode "none," the
disable_migrate logic is equivalent to switch to the "none" mode.

Hence, instead of using a hidden variable to track this behavior,
switch the mode to none, informing the user about this change.

Cc: Jonathan Corbet <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Alexandre Chartre <[email protected]>
Cc: Clark Willaims <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/trace/trace_hwlat.c | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index 1f5d48830fd6..fecc3752d7da 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -286,7 +286,6 @@ static int get_sample(void)
}

static struct cpumask save_cpumask;
-static bool disable_migrate;

static void move_to_next_cpu(void)
{
@@ -294,15 +293,13 @@ static void move_to_next_cpu(void)
struct trace_array *tr = hwlat_trace;
int next_cpu;

- if (disable_migrate)
- return;
/*
* If for some reason the user modifies the CPU affinity
* of this thread, then stop migrating for the duration
* of the current test.
*/
if (!cpumask_equal(current_mask, current->cpus_ptr))
- goto disable;
+ goto change_mode;

get_online_cpus();
cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask);
@@ -313,7 +310,7 @@ static void move_to_next_cpu(void)
next_cpu = cpumask_first(current_mask);

if (next_cpu >= nr_cpu_ids) /* Shouldn't happen! */
- goto disable;
+ goto change_mode;

cpumask_clear(current_mask);
cpumask_set_cpu(next_cpu, current_mask);
@@ -321,8 +318,9 @@ static void move_to_next_cpu(void)
sched_setaffinity(0, current_mask);
return;

- disable:
- disable_migrate = true;
+ change_mode:
+ hwlat_data.thread_mode = MODE_NONE;
+ pr_info(BANNER "cpumask changed while in round-robin mode, switching to mode none\n");
}

/*
@@ -740,7 +738,6 @@ static int hwlat_tracer_init(struct trace_array *tr)

hwlat_trace = tr;

- disable_migrate = false;
hwlat_data.count = 0;
tr->max_latency = 0;
save_tracing_thresh = tracing_thresh;
--
2.26.3


Subject: [PATCH V3 2/9] tracing/hwlat: Implement the mode config option

Provides the "mode" config to the hardware latency detector. hwlatd has
two different operation modes. The default mode is the "round-robin" one,
in which a single hwlatd thread runs, migrating among the allowed CPUs in a
"round-robin" fashion. This is the current behavior.

The "none" sets the allowed cpumask for a single hwlatd thread at the
startup, but skips the round-robin, letting the scheduler handle the
migration.

In preparation to the per-cpu mode.

Cc: Jonathan Corbet <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Alexandre Chartre <[email protected]>
Cc: Clark Willaims <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
Documentation/trace/hwlat_detector.rst | 12 +-
kernel/trace/trace_hwlat.c | 171 +++++++++++++++++++++++--
2 files changed, 169 insertions(+), 14 deletions(-)

diff --git a/Documentation/trace/hwlat_detector.rst b/Documentation/trace/hwlat_detector.rst
index 5739349649c8..4d952df0586a 100644
--- a/Documentation/trace/hwlat_detector.rst
+++ b/Documentation/trace/hwlat_detector.rst
@@ -76,8 +76,12 @@ in /sys/kernel/tracing:
- tracing_cpumask - the CPUs to move the hwlat thread across
- hwlat_detector/width - specified amount of time to spin within window (usecs)
- hwlat_detector/window - amount of time between (width) runs (usecs)
+ - hwlat_detector/mode - the thread mode

-The hwlat detector's kernel thread will migrate across each CPU specified in
-tracing_cpumask between each window. To limit the migration, either modify
-tracing_cpumask, or modify the hwlat kernel thread (named [hwlatd]) CPU
-affinity directly, and the migration will stop.
+By default, the hwlat detector's kernel thread will migrate across each CPU
+specified in cpumask at the beginning of a new window, in a round-robin
+fashion. This behavior can be changed by changing the thread mode,
+the available options are:
+
+ - none: do not force migration
+ - round-robin: migrate across each CPU specified in cpumask [default]
diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index 0a5635401125..1f5d48830fd6 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -59,6 +59,14 @@ static struct task_struct *hwlat_kthread;

static struct dentry *hwlat_sample_width; /* sample width us */
static struct dentry *hwlat_sample_window; /* sample window us */
+static struct dentry *hwlat_thread_mode; /* hwlat thread mode */
+
+enum {
+ MODE_NONE = 0,
+ MODE_ROUND_ROBIN,
+ MODE_MAX
+};
+static char *thread_mode_str[] = { "none", "round-robin" };

/* Save the previous tracing_thresh value */
static unsigned long save_tracing_thresh;
@@ -96,11 +104,16 @@ static struct hwlat_data {
u64 sample_window; /* total sampling window (on+off) */
u64 sample_width; /* active sampling portion of window */

+ int thread_mode; /* thread mode */
+
} hwlat_data = {
.sample_window = DEFAULT_SAMPLE_WINDOW,
.sample_width = DEFAULT_SAMPLE_WIDTH,
+ .thread_mode = MODE_ROUND_ROBIN
};

+static bool hwlat_busy;
+
static void trace_hwlat_sample(struct hwlat_sample *sample)
{
struct trace_array *tr = hwlat_trace;
@@ -328,7 +341,8 @@ static int kthread_fn(void *data)

while (!kthread_should_stop()) {

- move_to_next_cpu();
+ if (hwlat_data.thread_mode == MODE_ROUND_ROBIN)
+ move_to_next_cpu();

local_irq_disable();
get_sample();
@@ -366,11 +380,6 @@ static int start_kthread(struct trace_array *tr)
if (hwlat_kthread)
return 0;

- /* Just pick the first CPU on first iteration */
- get_online_cpus();
- cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask);
- put_online_cpus();
- next_cpu = cpumask_first(current_mask);

kthread = kthread_create(kthread_fn, NULL, "hwlatd");
if (IS_ERR(kthread)) {
@@ -378,8 +387,19 @@ static int start_kthread(struct trace_array *tr)
return -ENOMEM;
}

- cpumask_clear(current_mask);
- cpumask_set_cpu(next_cpu, current_mask);
+
+ /* Just pick the first CPU on first iteration */
+ get_online_cpus();
+ cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask);
+ put_online_cpus();
+
+ if (hwlat_data.thread_mode == MODE_ROUND_ROBIN) {
+ next_cpu = cpumask_first(current_mask);
+ cpumask_clear(current_mask);
+ cpumask_set_cpu(next_cpu, current_mask);
+
+ }
+
sched_setaffinity(kthread->pid, current_mask);

hwlat_kthread = kthread;
@@ -511,6 +531,125 @@ hwlat_window_write(struct file *filp, const char __user *ubuf,
return cnt;
}

+static void *s_mode_start(struct seq_file *s, loff_t *pos)
+{
+ int mode = *pos;
+
+ if (mode >= MODE_MAX)
+ return NULL;
+
+ return pos;
+}
+
+static void *s_mode_next(struct seq_file *s, void *v, loff_t *pos)
+{
+ int mode = ++(*pos);
+
+ if (mode >= MODE_MAX)
+ return NULL;
+
+ return pos;
+}
+
+static int s_mode_show(struct seq_file *s, void *v)
+{
+ loff_t *pos = v;
+ int mode = *pos;
+
+ if (mode == hwlat_data.thread_mode)
+ seq_printf(s, "[%s]", thread_mode_str[mode]);
+ else
+ seq_printf(s, "%s", thread_mode_str[mode]);
+
+ if (mode != MODE_MAX)
+ seq_puts(s, " ");
+
+ return 0;
+}
+
+static void s_mode_stop(struct seq_file *s, void *v)
+{
+ seq_puts(s, "\n");
+}
+
+static const struct seq_operations thread_mode_seq_ops = {
+ .start = s_mode_start,
+ .next = s_mode_next,
+ .show = s_mode_show,
+ .stop = s_mode_stop
+};
+
+static int hwlat_mode_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &thread_mode_seq_ops);
+};
+
+static void hwlat_tracer_start(struct trace_array *tr);
+static void hwlat_tracer_stop(struct trace_array *tr);
+/**
+ * hwlat_mode_write - Write function for "mode" entry
+ * @filp: The active open file structure
+ * @ubuf: The user buffer that contains the value to write
+ * @cnt: The maximum number of bytes to write to "file"
+ * @ppos: The current position in @file
+ *
+ * This function provides a write implementation for the "mode" interface
+ * to the hardware latency detector. hwlatd has different operation modes.
+ * The "none" sets the allowed cpumask for a single hwlatd thread at the
+ * startup and lets the scheduler handle the migration. The default mode is
+ * the "round-robin" one, in which a single hwlatd thread runs, migrating
+ * among the allowed CPUs in a round-robin fashion.
+ */
+static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ struct trace_array *tr = hwlat_trace;
+ const char *mode;
+ char buf[64];
+ int ret, i;
+
+ if (cnt >= sizeof(buf))
+ return -EINVAL;
+
+ if (copy_from_user(buf, ubuf, cnt))
+ return -EFAULT;
+
+ buf[cnt] = 0;
+
+ mode = strstrip(buf);
+
+ ret = -EINVAL;
+
+ /*
+ * trace_types_lock is taken to avoid concurrency on start/stop
+ * and hwlat_busy.
+ */
+ mutex_lock(&trace_types_lock);
+ if (hwlat_busy)
+ hwlat_tracer_stop(tr);
+
+ mutex_lock(&hwlat_data.lock);
+
+ for (i = 0; i < MODE_MAX; i++) {
+ if (strcmp(mode, thread_mode_str[i]) == 0) {
+ hwlat_data.thread_mode = i;
+ ret = cnt;
+ }
+ }
+
+ mutex_unlock(&hwlat_data.lock);
+
+ if (hwlat_busy)
+ hwlat_tracer_start(tr);
+ mutex_unlock(&trace_types_lock);
+
+ *ppos += cnt;
+
+
+
+ return ret;
+}
+
static const struct file_operations width_fops = {
.open = tracing_open_generic,
.read = hwlat_read,
@@ -523,6 +662,13 @@ static const struct file_operations window_fops = {
.write = hwlat_window_write,
};

+static const struct file_operations thread_mode_fops = {
+ .open = hwlat_mode_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+ .write = hwlat_mode_write
+};
/**
* init_tracefs - A function to initialize the tracefs interface files
*
@@ -558,6 +704,13 @@ static int init_tracefs(void)
if (!hwlat_sample_width)
goto err;

+ hwlat_thread_mode = trace_create_file("mode", 0644,
+ top_dir,
+ NULL,
+ &thread_mode_fops);
+ if (!hwlat_thread_mode)
+ goto err;
+
return 0;

err:
@@ -579,8 +732,6 @@ static void hwlat_tracer_stop(struct trace_array *tr)
stop_kthread();
}

-static bool hwlat_busy;
-
static int hwlat_tracer_init(struct trace_array *tr)
{
/* Only allow one instance to enable this */
--
2.26.3


Subject: [PATCH V3 9/9] tracing: Add timerlat tracer

The timerlat tracer aims to help the preemptive kernel developers to
found souces of wakeup latencies of real-time threads. Like cyclictest,
the tracer sets a periodic timer that wakes up a thread. The thread then
computes a *wakeup latency* value as the difference between the *current
time* and the *absolute time* that the timer was set to expire. The main
goal of timerlat is tracing in such a way to help kernel developers.

Usage

Write the ASCII text "timerlat" into the current_tracer file of the
tracing system (generally mounted at /sys/kernel/tracing).

For example:

[root@f32 ~]# cd /sys/kernel/tracing/
[root@f32 tracing]# echo timerlat > current_tracer

It is possible to follow the trace by reading the trace trace file::

[root@f32 tracing]# cat trace
# tracer: timerlat
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# || /
# |||| ACTIVATION
# TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY
# | | | |||| | | | |
<idle>-0 [000] d.h1 54.029328: #1 context irq timer_latency 932 ns
<...>-867 [000] .... 54.029339: #1 context thread timer_latency 11700 ns
<idle>-0 [001] dNh1 54.029346: #1 context irq timer_latency 2833 ns
<...>-868 [001] .... 54.029353: #1 context thread timer_latency 9820 ns
<idle>-0 [000] d.h1 54.030328: #2 context irq timer_latency 769 ns
<...>-867 [000] .... 54.030330: #2 context thread timer_latency 3070 ns
<idle>-0 [001] d.h1 54.030344: #2 context irq timer_latency 935 ns
<...>-868 [001] .... 54.030347: #2 context thread timer_latency 4351 ns

The tracer creates a per-cpu kernel thread with real-time priority that
prints two lines at every activation. The first is the *timer latency*
observed at the *hardirq* context before the activation of the thread.
The second is the *timer latency* observed by the thread, which is the
same level that cyclictest reports. The ACTIVATION ID field
serves to relate the *irq* execution to its respective *thread* execution.

The irq/thread splitting is important to clarify at which context
the unexpected high value is coming from. The *irq* context can be
delayed by hardware related actions, such as SMIs, NMIs, IRQs
or by a thread masking interrupts. Once the timer happens, the delay
can also be influenced by blocking caused by threads. For example, by
postponing the scheduler execution via preempt_disable(), by the
scheduler execution, or by masking interrupts. Threads can
also be delayed by the interference from other threads and IRQs.

The timerlat can also take advantage of the osnoise: traceevents.
For example:

[root@f32 ~]# cd /sys/kernel/tracing/
[root@f32 tracing]# echo timerlat > current_tracer
[root@f32 tracing]# echo osnoise > set_event
[root@f32 tracing]# echo 25 > osnoise/stop_tracing_out_us
[root@f32 tracing]# tail -10 trace
cc1-87882 [005] d..h... 548.771078: #402268 context irq timer_latency 1585 ns
cc1-87882 [005] dNLh1.. 548.771082: irq_noise: local_timer:236 start 548.771077442 duration 4597 ns
cc1-87882 [005] dNLh2.. 548.771083: irq_noise: reschedule:253 start 548.771083017 duration 56 ns
cc1-87882 [005] dNLh2.. 548.771086: irq_noise: call_function_single:251 start 548.771083811 duration 2048 ns
cc1-87882 [005] dNLh2.. 548.771088: irq_noise: call_function_single:251 start 548.771086814 duration 1495 ns
cc1-87882 [005] dNLh2.. 548.771091: irq_noise: call_function_single:251 start 548.771089194 duration 1558 ns
cc1-87882 [005] dNLh2.. 548.771094: irq_noise: call_function_single:251 start 548.771091719 duration 1932 ns
cc1-87882 [005] dNLh2.. 548.771096: irq_noise: call_function_single:251 start 548.771094696 duration 1050 ns
cc1-87882 [005] d...3.. 548.771101: thread_noise: cc1:87882 start 548.771078243 duration 10909 ns
timerlat/5-1035 [005] ....... 548.771103: #402268 context thread timer_latency 25960 ns

For further information see: Documentation/trace/timerlat-tracer.rst

Cc: Jonathan Corbet <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Alexandre Chartre <[email protected]>
Cc: Clark Willaims <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
Documentation/trace/index.rst | 1 +
Documentation/trace/timerlat-tracer.rst | 158 ++++++
kernel/trace/Kconfig | 28 ++
kernel/trace/trace.h | 2 +
kernel/trace/trace_entries.h | 16 +
kernel/trace/trace_osnoise.c | 625 ++++++++++++++++++++++--
kernel/trace/trace_output.c | 47 ++
7 files changed, 846 insertions(+), 31 deletions(-)
create mode 100644 Documentation/trace/timerlat-tracer.rst

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 608107b27cc0..3769b9b7aed8 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -24,6 +24,7 @@ Linux Tracing Technologies
boottime-trace
hwlat_detector
osnoise-tracer
+ timerlat-tracer
intel_th
ring-buffer-design
stm
diff --git a/Documentation/trace/timerlat-tracer.rst b/Documentation/trace/timerlat-tracer.rst
new file mode 100644
index 000000000000..902d2f9a489f
--- /dev/null
+++ b/Documentation/trace/timerlat-tracer.rst
@@ -0,0 +1,158 @@
+###############
+Timerlat tracer
+###############
+
+The timerlat tracer aims to help the preemptive kernel developers to
+found souces of wakeup latencies of real-time threads. Like cyclictest,
+the tracer sets a periodic timer that wakes up a thread. The thread then
+computes a *wakeup latency* value as the difference between the *current
+time* and the *absolute time* that the timer was set to expire. The main
+goal of timerlat is tracing in such a way to help kernel developers.
+
+Usage
+-----
+
+Write the ASCII text "timerlat" into the current_tracer file of the
+tracing system (generally mounted at /sys/kernel/tracing).
+
+For example::
+
+ [root@f32 ~]# cd /sys/kernel/tracing/
+ [root@f32 tracing]# echo timerlat > current_tracer
+
+It is possible to follow the trace by reading the trace trace file::
+
+ [root@f32 tracing]# cat trace
+ # tracer: timerlat
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # || /
+ # |||| ACTIVATION
+ # TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY
+ # | | | |||| | | | |
+ <idle>-0 [000] d.h1 54.029328: #1 context irq timer_latency 932 ns
+ <...>-867 [000] .... 54.029339: #1 context thread timer_latency 11700 ns
+ <idle>-0 [001] dNh1 54.029346: #1 context irq timer_latency 2833 ns
+ <...>-868 [001] .... 54.029353: #1 context thread timer_latency 9820 ns
+ <idle>-0 [000] d.h1 54.030328: #2 context irq timer_latency 769 ns
+ <...>-867 [000] .... 54.030330: #2 context thread timer_latency 3070 ns
+ <idle>-0 [001] d.h1 54.030344: #2 context irq timer_latency 935 ns
+ <...>-868 [001] .... 54.030347: #2 context thread timer_latency 4351 ns
+
+
+The tracer creates a per-cpu kernel thread with real-time priority that
+prints two lines at every activation. The first is the *timer latency*
+observed at the *hardirq* context before the activation of the thread.
+The second is the *timer latency* observed by the thread, which is the
+same level that cyclictest reports. The ACTIVATION ID field
+serves to relate the *irq* execution to its respective *thread* execution.
+
+The *irq*/*thread* splitting is important to clarify at which context
+the unexpected high value is coming from. The *irq* context can be
+delayed by hardware related actions, such as SMIs, NMIs, IRQs
+or by a thread masking interrupts. Once the timer happens, the delay
+can also be influenced by blocking caused by threads. For example, by
+postponing the scheduler execution via preempt_disable(), by the
+scheduler execution, or by masking interrupts. Threads can
+also be delayed by the interference from other threads and IRQs.
+
+Tracer options
+---------------------
+
+The timerlat tracer is built on top of osnoise tracer.
+So its configuration is also done in the osnoise/ config
+directory. The timerlat configs are:
+
+ - cpus: CPUs at which a timerlat thread will execute.
+ - timerlat_period_us: the period of the timerlat thread.
+ - osnoise/stop_tracing_in_us: stop the system tracing if a
+ timer latency at the *irq* context higher than the configured
+ value happens. Writing 0 disables this option.
+ - stop_tracing_out_us: stop the system tracing if a
+ timer latency at the *thread* context higher than the configured
+ value happens. Writing 0 disables this option.
+ - print_stack: save the stack of the IRQ ocurrence, and print
+ it after the *thread* read the latency.
+
+timerlat and osnoise
+----------------------------
+
+The timerlat can also take advantage of the osnoise: traceevents.
+For example::
+
+ [root@f32 ~]# cd /sys/kernel/tracing/
+ [root@f32 tracing]# echo timerlat > current_tracer
+ [root@f32 tracing]# echo osnoise > set_event
+ [root@f32 tracing]# echo 25 > osnoise/stop_tracing_out_us
+ [root@f32 tracing]# tail -10 trace
+ cc1-87882 [005] d..h... 548.771078: #402268 context irq timer_latency 1585 ns
+ cc1-87882 [005] dNLh1.. 548.771082: irq_noise: local_timer:236 start 548.771077442 duration 4597 ns
+ cc1-87882 [005] dNLh2.. 548.771083: irq_noise: reschedule:253 start 548.771083017 duration 56 ns
+ cc1-87882 [005] dNLh2.. 548.771086: irq_noise: call_function_single:251 start 548.771083811 duration 2048 ns
+ cc1-87882 [005] dNLh2.. 548.771088: irq_noise: call_function_single:251 start 548.771086814 duration 1495 ns
+ cc1-87882 [005] dNLh2.. 548.771091: irq_noise: call_function_single:251 start 548.771089194 duration 1558 ns
+ cc1-87882 [005] dNLh2.. 548.771094: irq_noise: call_function_single:251 start 548.771091719 duration 1932 ns
+ cc1-87882 [005] dNLh2.. 548.771096: irq_noise: call_function_single:251 start 548.771094696 duration 1050 ns
+ cc1-87882 [005] d...3.. 548.771101: thread_noise: cc1:87882 start 548.771078243 duration 10909 ns
+ timerlat/5-1035 [005] ....... 548.771103: #402268 context thread timer_latency 25960 ns
+
+In this case, the root cause of the timer latency does not point for a
+single, but to a series of call_function_single IPIs, followed by a 10
+*us* delay from a cc1 thread noise, along with the regular timer
+activation. It is worth mentioning that the *duration* values reported
+by the osnoise events are *net* values. For example, the
+thread_noise does not include the duration of the overhead caused
+by the IRQ execution (which indeed accounted for 12736 ns).
+
+Such pieces of evidence are useful for the developer to use other
+tracing methods to figure out how to optimize the environment.
+
+IRQ stacktrace
+---------------------------
+
+The osnoise/print_stack option is helpful for the cases in which a thread
+noise causes the major factor for the timer latency, because of preempt or
+irq disabled. For example::
+
+ [root@f32 tracing]# echo 500 > osnoise/stop_tracing_out_us
+ [root@f32 tracing]# echo 500 > osnoise/print_stack
+ [root@f32 tracing]# echo timerlat > current_tracer
+ [root@f32 tracing]# tail -21 per_cpu/cpu7/trace
+ insmod-1026 [007] dN.h1.. 200.201948: irq_noise: local_timer:236 start 200.201939376 duration 7872 ns
+ insmod-1026 [007] d..h1.. 200.202587: #29800 context irq timer_latency 1616 ns
+ insmod-1026 [007] dN.h2.. 200.202598: irq_noise: local_timer:236 start 200.202586162 duration 11855 ns
+ insmod-1026 [007] dN.h3.. 200.202947: irq_noise: local_timer:236 start 200.202939174 duration 7318 ns
+ insmod-1026 [007] d...3.. 200.203444: thread_noise: insmod:1026 start 200.202586933 duration 838681 ns
+ timerlat/7-1001 [007] ....... 200.203445: #29800 context thread timer_latency 859978 ns
+ timerlat/7-1001 [007] ....1.. 200.203446: <stack trace>
+ => timerlat_irq
+ => __hrtimer_run_queues
+ => hrtimer_interrupt
+ => __sysvec_apic_timer_interrupt
+ => asm_call_irq_on_stack
+ => sysvec_apic_timer_interrupt
+ => asm_sysvec_apic_timer_interrupt
+ => delay_tsc
+ => dummy_load_1ms_pd_init
+ => do_one_initcall
+ => do_init_module
+ => __do_sys_finit_module
+ => do_syscall_64
+ => entry_SYSCALL_64_after_hwframe
+
+In this case, it is possible to see that the thread added the highest
+contribution to the *timer latency* and the stack trace points to
+a function named dummy_load_1ms_pd_init, which had the following
+code (on purpose)::
+
+ static int __init dummy_load_1ms_pd_init(void)
+ {
+ preempt_disable();
+ mdelay(1);
+ preempt_enable();
+ return 0;
+
+ }
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 41582ae4682b..d567b1717c4c 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -390,6 +390,34 @@ config OSNOISE_TRACER
To enable this tracer, echo in "osnoise" into the current_tracer
file.

+config TIMERLAT_TRACER
+ bool "Timerlat tracer"
+ select OSNOISE_TRACER
+ select GENERIC_TRACER
+ help
+ The timerlat tracer aims to help the preemptive kernel developers
+ to find sources of wakeup latencies of real-time threads.
+
+ The tracer creates a per-cpu kernel thread with real-time priority.
+ The tracer thread sets a periodic timer to wakeup itself, and goes
+ to sleep waiting for the timer to fire. At the wakeup, the thread
+ then computes a wakeup latency value as the difference between
+ the current time and the absolute time that the timer was set
+ to expire.
+
+ The tracer prints two lines at every activation. The first is the
+ timer latency observed at the hardirq context before the
+ activation of the thread. The second is the timer latency observed
+ by the thread, which is the same level that cyclictest reports. The
+ ACTIVATION ID field serves to relate the irq execution to its
+ respective thread execution.
+
+ The tracer is build on top of osnoise tracer, and the osnoise:
+ events can be used to trace the source of interference from NMI,
+ IRQs and other threads. It also enables the capture of the
+ stacktrace at the IRQ context, which helps to identify the code
+ path that can cause thread delay.
+
config MMIOTRACE
bool "Memory mapped IO tracing"
depends on HAVE_MMIOTRACE_SUPPORT && PCI
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 754dfe8987a2..889986242c40 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -45,6 +45,7 @@ enum trace_type {
TRACE_BPUTS,
TRACE_HWLAT,
TRACE_OSNOISE,
+ TRACE_TIMERLAT,
TRACE_RAW_DATA,
TRACE_FUNC_REPEATS,

@@ -448,6 +449,7 @@ extern void __ftrace_bad_type(void);
IF_ASSIGN(var, ent, struct bputs_entry, TRACE_BPUTS); \
IF_ASSIGN(var, ent, struct hwlat_entry, TRACE_HWLAT); \
IF_ASSIGN(var, ent, struct osnoise_entry, TRACE_OSNOISE);\
+ IF_ASSIGN(var, ent, struct timerlat_entry, TRACE_TIMERLAT);\
IF_ASSIGN(var, ent, struct raw_data_entry, TRACE_RAW_DATA);\
IF_ASSIGN(var, ent, struct trace_mmiotrace_rw, \
TRACE_MMIO_RW); \
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 158c0984b59b..cd41e863b51c 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -385,3 +385,19 @@ FTRACE_ENTRY(osnoise, osnoise_entry,
__entry->softirq_count,
__entry->thread_count)
);
+
+FTRACE_ENTRY(timerlat, timerlat_entry,
+
+ TRACE_TIMERLAT,
+
+ F_STRUCT(
+ __field( unsigned int, seqnum )
+ __field( int, context )
+ __field( u64, timer_latency )
+ ),
+
+ F_printk("seq:%u\tcontext:%d\ttimer_latency:%llu\n",
+ __entry->seqnum,
+ __entry->context,
+ __entry->timer_latency)
+);
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 9bd40b514d84..3a8d70fbb57f 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
/*
* OS Noise Tracer: computes the OS Noise suffered by a running thread.
+ * Timerlat Tracer: measures the wakeup latency of a timer triggered IRQ and thread.
*
* Based on "hwlat_detector" tracer by:
* Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <[email protected]>
@@ -21,6 +22,7 @@
#include <linux/cpumask.h>
#include <linux/delay.h>
#include <linux/sched/clock.h>
+#include <uapi/linux/sched/types.h>
#include <linux/sched.h>
#include "trace.h"

@@ -45,6 +47,9 @@ static struct trace_array *osnoise_trace;
#define DEFAULT_SAMPLE_PERIOD 1000000 /* 1s */
#define DEFAULT_SAMPLE_RUNTIME 1000000 /* 1s */

+#define DEFAULT_TIMERLAT_PERIOD 1000 /* 1ms */
+#define DEFAULT_TIMERLAT_PRIO 95 /* FIFO 95 */
+
/*
* NMI runtime info.
*/
@@ -62,6 +67,8 @@ struct irq {
u64 delta_start;
};

+#define IRQ_CONTEXT 0
+#define THREAD_CONTEXT 1
/*
* SofIRQ runtime info.
*/
@@ -108,32 +115,76 @@ static inline struct osnoise_variables *this_cpu_osn_var(void)
return this_cpu_ptr(&per_cpu_osnoise_var);
}

+#ifdef CONFIG_TIMERLAT_TRACER
+/*
+ * Runtime information for the timer mode.
+ */
+struct timerlat_variables {
+ struct task_struct *kthread;
+ struct hrtimer timer;
+ u64 rel_period;
+ u64 abs_period;
+ bool tracing_thread;
+ u64 count;
+};
+
+DEFINE_PER_CPU(struct timerlat_variables, per_cpu_timerlat_var);
+
/**
- * osn_var_reset - Reset the values of the given osnoise_variables
+ * this_cpu_tmr_var - Return the per-cpu timerlat_variables on its relative CPU
+ */
+static inline struct timerlat_variables *this_cpu_tmr_var(void)
+{
+ return this_cpu_ptr(&per_cpu_timerlat_var);
+}
+
+/**
+ * tlat_var_reset - Reset the values of the given timerlat_variables
*/
-static inline void osn_var_reset(struct osnoise_variables *osn_var)
+static inline void tlat_var_reset(void)
{
+ struct timerlat_variables *tlat_var;
+ int cpu;
/*
* So far, all the values are initialized as 0, so
* zeroing the structure is perfect.
*/
- memset(osn_var, 0, sizeof(struct osnoise_variables));
+ for_each_cpu(cpu, cpu_online_mask) {
+ tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
+ memset(tlat_var, 0, sizeof(struct timerlat_variables));
+ }
}
+#else /* CONFIG_TIMERLAT_TRACER */
+#define tlat_var_reset() do {} while (0)
+#endif /* CONFIG_TIMERLAT_TRACER */

/**
- * osn_var_reset_all - Reset the value of all per-cpu osnoise_variables
+ * osn_var_reset - Reset the values of the given osnoise_variables
*/
-static inline void osn_var_reset_all(void)
+static inline void osn_var_reset(void)
{
struct osnoise_variables *osn_var;
int cpu;

+ /*
+ * So far, all the values are initialized as 0, so
+ * zeroing the structure is perfect.
+ */
for_each_cpu(cpu, cpu_online_mask) {
osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
- osn_var_reset(osn_var);
+ memset(osn_var, 0, sizeof(struct osnoise_variables));
}
}

+/**
+ * osn_var_reset_all - Reset the value of all per-cpu osnoise_variables
+ */
+static inline void osn_var_reset_all(void)
+{
+ osn_var_reset();
+ tlat_var_reset();
+}
+
/*
* Tells NMIs to call back to the osnoise tracer to record timestamps.
*/
@@ -154,6 +205,18 @@ struct osnoise_sample {
int thread_count; /* # Threads during this sample */
};

+#ifdef CONFIG_TIMERLAT_TRACER
+/*
+ * timerlat sample structure definition. Used to store the statistics of
+ * a sample run.
+ */
+struct timerlat_sample {
+ u64 seqnum; /* unique sequence */
+ u64 timer_latency; /* timer_latency */
+ int context; /* timer context */
+};
+#endif
+
/*
* Protect the interface.
*/
@@ -165,13 +228,23 @@ struct mutex interface_lock;
static struct osnoise_data {
u64 sample_period; /* total sampling period */
u64 sample_runtime; /* active sampling portion of period */
- u64 stop_tracing_in; /* stop trace in the inside operation (loop) */
- u64 stop_tracing_out; /* stop trace in the outside operation (report) */
+ u64 stop_tracing_in; /* stop trace in the inside operation (loop/irq) */
+ u64 stop_tracing_out; /* stop trace in the outside operation (report/thread) */
+#ifdef CONFIG_TIMERLAT_TRACER
+ u64 timerlat_period; /* timerlat period */
+ u64 print_stack; /* print IRQ stack if total > */
+ int timerlat_tracer; /* timerlat tracer */
+#endif
} osnoise_data = {
.sample_period = DEFAULT_SAMPLE_PERIOD,
.sample_runtime = DEFAULT_SAMPLE_RUNTIME,
.stop_tracing_in = 0,
.stop_tracing_out = 0,
+#ifdef CONFIG_TIMERLAT_TRACER
+ .print_stack = 0,
+ .timerlat_period = DEFAULT_TIMERLAT_PERIOD,
+ .timerlat_tracer = 0,
+#endif
};

/*
@@ -232,6 +305,125 @@ static void trace_osnoise_sample(struct osnoise_sample *sample)
trace_buffer_unlock_commit_nostack(buffer, event);
}

+#ifdef CONFIG_TIMERLAT_TRACER
+/*
+ * Print the timerlat header info.
+ */
+static void print_timerlat_headers(struct seq_file *s)
+{
+ seq_puts(s, "# _-----=> irqs-off\n");
+ seq_puts(s, "# / _----=> need-resched\n");
+ seq_puts(s, "# | / _---=> hardirq/softirq\n");
+ seq_puts(s, "# || / _--=> preempt-depth\n");
+ seq_puts(s, "# || /\n");
+ seq_puts(s, "# |||| ACTIVATION\n");
+ seq_puts(s, "# TASK-PID CPU# |||| TIMESTAMP ID ");
+ seq_puts(s, " CONTEXT LATENCY\n");
+ seq_puts(s, "# | | | |||| | | ");
+ seq_puts(s, " | |\n");
+}
+
+/*
+ * Record an timerlat_sample into the tracer buffer.
+ */
+static void trace_timerlat_sample(struct timerlat_sample *sample)
+{
+ struct trace_array *tr = osnoise_trace;
+ struct trace_event_call *call = &event_osnoise;
+ struct trace_buffer *buffer = tr->array_buffer.buffer;
+ struct ring_buffer_event *event;
+ struct timerlat_entry *entry;
+
+ event = trace_buffer_lock_reserve(buffer, TRACE_TIMERLAT, sizeof(*entry),
+ tracing_gen_ctx());
+ if (!event)
+ return;
+ entry = ring_buffer_event_data(event);
+ entry->seqnum = sample->seqnum;
+ entry->context = sample->context;
+ entry->timer_latency = sample->timer_latency;
+
+ if (!call_filter_check_discard(call, entry, buffer, event))
+ trace_buffer_unlock_commit_nostack(buffer, event);
+}
+
+#ifdef CONFIG_STACKTRACE
+/*
+ * Stack trace will take place only at IRQ level, so, no need
+ * to control nesting here.
+ */
+struct trace_stack {
+ int stack_size;
+ int nr_entries;
+ unsigned long calls[PAGE_SIZE];
+};
+
+static DEFINE_PER_CPU(struct trace_stack, trace_stack);
+
+/**
+ * timerlat_save_stack - save a stack trace without printing
+ *
+ * Save the current stack trace without printing. The
+ * stack will be printed later, after the end of the measurement.
+ */
+static void timerlat_save_stack(int skip)
+{
+ unsigned int size, nr_entries;
+ struct trace_stack *fstack;
+
+ fstack = this_cpu_ptr(&trace_stack);
+
+ size = ARRAY_SIZE(fstack->calls);
+
+ nr_entries = stack_trace_save(fstack->calls, size, skip);
+
+ fstack->stack_size = nr_entries * sizeof(unsigned long);
+ fstack->nr_entries = nr_entries;
+
+ return;
+
+}
+/**
+ * timerlat_dump_stack - dump a stack trace previously saved
+ *
+ * Dump a saved stack trace into the trace buffer.
+ */
+static void timerlat_dump_stack(void)
+{
+ struct trace_event_call *call = &event_osnoise;
+ struct trace_array *tr = osnoise_trace;
+ struct trace_buffer *buffer = tr->array_buffer.buffer;
+ struct ring_buffer_event *event;
+ struct trace_stack *fstack;
+ struct stack_entry *entry;
+ unsigned int size;
+
+ preempt_disable_notrace();
+ fstack = this_cpu_ptr(&trace_stack);
+ size = fstack->stack_size;
+
+ event = trace_buffer_lock_reserve(buffer, TRACE_STACK, sizeof(*entry) + size,
+ tracing_gen_ctx());
+ if (!event)
+ goto out;
+
+ entry = ring_buffer_event_data(event);
+
+ memcpy(&entry->caller, fstack->calls, size);
+ entry->size = fstack->nr_entries;
+
+ if (!call_filter_check_discard(call, entry, buffer, event))
+ trace_buffer_unlock_commit_nostack(buffer, event);
+
+out:
+ preempt_enable_notrace();
+}
+#else
+#define timerlat_dump_stack() do {} while (0)
+#define timerlat_save_stack(a) do {} while (0)
+#endif /* CONFIG_STACKTRACE */
+#endif /* CONFIG_TIMERLAT_TRACER */
+
/**
* Macros to encapsulate the time capturing infrastructure.
*/
@@ -373,6 +565,30 @@ set_int_safe_time(struct osnoise_variables *osn_var, u64 *time)
return int_counter;
}

+#ifdef CONFIG_TIMERLAT_TRACER
+/**
+ * copy_int_safe_time - Copy *src into *desc aware of interference
+ */
+static u64
+copy_int_safe_time(struct osnoise_variables *osn_var, u64 *dst, u64 *src)
+{
+ u64 int_counter;
+
+ do {
+ int_counter = local_read(&osn_var->int_counter);
+ /* synchronize with interrupts */
+ barrier();
+
+ *dst = *src;
+
+ /* synchronize with interrupts */
+ barrier();
+ } while (int_counter != local_read(&osn_var->int_counter));
+
+ return int_counter;
+}
+#endif /* CONFIG_TIMERLAT_TRACER */
+
/**
* trace_osnoise_callback - NMI entry/exit callback
*
@@ -801,6 +1017,22 @@ void trace_softirq_exit_callback(void *data, unsigned int vec_nr)
if (!osn_var->sampling)
return;

+#ifdef CONFIG_TIMERLAT_TRACER
+ /*
+ * If the timerlat is enabled, but the irq handler did
+ * not run yet enabling timerlat_tracer, do not trace.
+ */
+ if (unlikely(osnoise_data.timerlat_tracer)) {
+ struct timerlat_variables *tlat_var;
+ tlat_var = this_cpu_tmr_var();
+ if (!tlat_var->tracing_thread) {
+ osn_var->softirq.arrival_time = 0;
+ osn_var->softirq.delta_start = 0;
+ return;
+ }
+ }
+#endif
+
duration = get_int_safe_duration(osn_var, &osn_var->softirq.delta_start);
trace_softirq_noise(vec_nr, osn_var->softirq.arrival_time, duration);
cond_move_thread_delta_start(osn_var, duration);
@@ -893,6 +1125,18 @@ thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
if (!osn_var->sampling)
return;

+#ifdef CONFIG_TIMERLAT_TRACER
+ if (osnoise_data.timerlat_tracer) {
+ struct timerlat_variables *tlat_var;
+ tlat_var = this_cpu_tmr_var();
+ if (!tlat_var->tracing_thread) {
+ osn_var->thread.delta_start = 0;
+ osn_var->thread.arrival_time = 0;
+ return;
+ }
+ }
+#endif
+
duration = get_int_safe_duration(osn_var, &osn_var->thread.delta_start);

trace_thread_noise(t, osn_var->thread.arrival_time, duration);
@@ -1182,6 +1426,197 @@ static int osnoise_main(void *data)
return 0;
}

+#ifdef CONFIG_TIMERLAT_TRACER
+/**
+ * timerlat_irq - hrtimer handler for timerlat.
+ */
+static enum hrtimer_restart timerlat_irq(struct hrtimer *timer)
+{
+ struct osnoise_variables *osn_var = this_cpu_osn_var();
+ struct trace_array *tr = osnoise_trace;
+ struct timerlat_variables *tlat;
+ struct timerlat_sample s;
+ u64 now;
+ u64 diff;
+
+ /*
+ * I am not sure if the timer was armed for this CPU. So, get
+ * the timerlat struct from the timer itself, not from this
+ * CPU.
+ */
+ tlat = container_of(timer, struct timerlat_variables, timer);
+
+ now = ktime_to_ns(hrtimer_cb_get_time(&tlat->timer));
+
+ /*
+ * Enable the osnoise: events for thread an softirq.
+ */
+ tlat->tracing_thread = true;
+
+ osn_var->thread.arrival_time = time_get();
+
+ /*
+ * A hardirq is running: the timer IRQ. It is for sure preempting
+ * a thread, and potentially preempting a softirq.
+ *
+ * At this point, it is not interesting to know the duration of the
+ * preempted thread (and maybe softirq), but how much time they will
+ * delay the beginning of the execution of the timer thread.
+ *
+ * To get the correct (net) delay added by the softirq, its delta_start
+ * is set as the IRQ one. In this way, at the return of the IRQ, the delta
+ * start of the sofitrq will be zeroed, accounting then only the time
+ * after that.
+ *
+ * The thread follows the same principle. However, if a softirq is
+ * running, the thread needs to receive the softirq delta_start. The
+ * reason being is that the softirq will be the last to be unfolded,
+ * resseting the thread delay to zero.
+ */
+#ifndef CONFIG_PREEMPT_RT
+ if (osn_var->softirq.delta_start) {
+ copy_int_safe_time(osn_var, &osn_var->thread.delta_start,
+ &osn_var->softirq.delta_start);
+
+ copy_int_safe_time(osn_var, &osn_var->softirq.delta_start,
+ &osn_var->irq.delta_start);
+ } else {
+ copy_int_safe_time(osn_var, &osn_var->thread.delta_start,
+ &osn_var->irq.delta_start);
+ }
+#else /* CONFIG_PREEMPT_RT */
+ /*
+ * The sofirqs run as threads on RT, so there is not need
+ * to keep track of it.
+ */
+ copy_int_safe_time(osn_var, &osn_var->thread.delta_start, &osn_var->irq.delta_start);
+#endif /* CONFIG_PREEMPT_RT */
+
+ /*
+ * Compute the current time with the expected time.
+ */
+ diff = now - tlat->abs_period;
+
+ tlat->count++;
+ s.seqnum = tlat->count;
+ s.timer_latency = diff;
+ s.context = IRQ_CONTEXT;
+
+ trace_timerlat_sample(&s);
+
+ /* Keep a running maximum ever recorded os noise "latency" */
+ if (diff > tr->max_latency) {
+ tr->max_latency = diff;
+ latency_fsnotify(tr);
+ }
+
+ if (osnoise_data.stop_tracing_in)
+ if (time_to_us(diff) >= osnoise_data.stop_tracing_in)
+ osnoise_stop_tracing();
+
+ wake_up_process(tlat->kthread);
+
+#ifdef CONFIG_STACKTRACE
+ if (osnoise_data.print_stack)
+ timerlat_save_stack(0);
+#endif
+
+ return HRTIMER_NORESTART;
+}
+
+/**
+ * wait_next_period - Wait for the next period for timerlat
+ */
+static int wait_next_period(struct timerlat_variables *tlat)
+{
+ ktime_t next_abs_period, now;
+ u64 rel_period = osnoise_data.timerlat_period * 1000;
+
+ now = hrtimer_cb_get_time(&tlat->timer);
+ next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
+
+ /*
+ * Save the next abs_period.
+ */
+ tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
+
+ /*
+ * If the new abs_period is in the past, skip the activation.
+ */
+ while (ktime_compare(now, next_abs_period) > 0) {
+ next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
+ tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
+ }
+
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ hrtimer_start(&tlat->timer, next_abs_period, HRTIMER_MODE_ABS_PINNED_HARD);
+ schedule();
+ return 1;
+}
+
+/**
+ * timerlat_main- Timerlat main
+ */
+static int timerlat_main(void *data)
+{
+ struct osnoise_variables *osn_var = this_cpu_osn_var();
+ struct timerlat_variables *tlat = this_cpu_tmr_var();
+ struct timerlat_sample s;
+ struct sched_param sp;
+ u64 now, diff;
+
+ /*
+ * Make the thread RT, that is how cyclictest is usually used.
+ */
+ sp.sched_priority = DEFAULT_TIMERLAT_PRIO;
+ sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
+
+ tlat->count = 0;
+ tlat->tracing_thread = false;
+
+ hrtimer_init(&tlat->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED_HARD);
+ tlat->timer.function = timerlat_irq;
+ tlat->kthread = current;
+ osn_var->pid = current->pid;
+ /*
+ * Anotate the arrival time.
+ */
+ tlat->abs_period = hrtimer_cb_get_time(&tlat->timer);
+
+ wait_next_period(tlat);
+
+ osn_var->sampling = 1;
+
+ while (!kthread_should_stop()) {
+ now = ktime_to_ns(hrtimer_cb_get_time(&tlat->timer));
+ diff = now - tlat->abs_period;
+
+ s.seqnum = tlat->count;
+ s.timer_latency = diff;
+ s.context = THREAD_CONTEXT;
+
+ trace_timerlat_sample(&s);
+
+#ifdef CONFIG_STACKTRACE
+ if (osnoise_data.print_stack)
+ if (osnoise_data.print_stack <= time_to_us(diff))
+ timerlat_dump_stack();
+#endif /* CONFIG_STACKTRACE */
+
+ tlat->tracing_thread = false;
+ if (osnoise_data.stop_tracing_out)
+ if (time_to_us(diff) >= osnoise_data.stop_tracing_out)
+ osnoise_stop_tracing();
+
+ wait_next_period(tlat);
+ }
+
+ hrtimer_cancel(&tlat->timer);
+ return 0;
+}
+#endif /* CONFIG_TIMERLAT_TRACER */
+
/**
* stop_per_cpu_kthread - stop per-cpu threads
*
@@ -1212,6 +1647,7 @@ static int start_per_cpu_kthreads(struct trace_array *tr)
struct cpumask *current_mask = &save_cpumask;
struct task_struct *kthread;
char comm[24];
+ void *main = osnoise_main;
int cpu;

get_online_cpus();
@@ -1229,9 +1665,17 @@ static int start_per_cpu_kthreads(struct trace_array *tr)
per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;

for_each_cpu(cpu, current_mask) {
+#ifdef CONFIG_TIMERLAT_TRACER
+ if (osnoise_data.timerlat_tracer) {
+ snprintf(comm, 24, "timerlat/%d", cpu);
+ main = timerlat_main;
+ } else {
+ snprintf(comm, 24, "osnoise/%d", cpu);
+ }
+#else
snprintf(comm, 24, "osnoise/%d", cpu);
-
- kthread = kthread_create_on_cpu(osnoise_main, NULL, cpu, comm);
+#endif
+ kthread = kthread_create_on_cpu(main, NULL, cpu, comm);

if (IS_ERR(kthread)) {
pr_err(BANNER "could not start sampling thread\n");
@@ -1373,6 +1817,31 @@ static struct trace_ull_config osnoise_stop_total = {
.min = NULL,
};

+#ifdef CONFIG_TIMERLAT_TRACER
+/*
+ * osnoise/print_stack: print the stacktrace of the IRQ handler if the total
+ * latency is higher than val.
+ */
+static struct trace_ull_config osnoise_print_stack = {
+ .lock = &interface_lock,
+ .val = &osnoise_data.print_stack,
+ .max = NULL,
+ .min = NULL,
+};
+
+/*
+ * osnoise/timerlat_period: min 100 us, max 1 s
+ */
+u64 timerlat_min_period = 100;
+u64 timerlat_max_period = 1000000;
+static struct trace_ull_config timerlat_period = {
+ .lock = &interface_lock,
+ .val = &osnoise_data.timerlat_period,
+ .max = &timerlat_max_period,
+ .min = &timerlat_min_period,
+};
+#endif
+
static const struct file_operations cpus_fops = {
.open = tracing_open_generic,
.read = osnoise_cpus_read,
@@ -1383,10 +1852,9 @@ static const struct file_operations cpus_fops = {
/**
* init_tracefs - A function to initialize the tracefs interface files
*
- * This function creates entries in tracefs for "osnoise". It creates the
- * "osnoise" directory in the tracing directory, and within that
- * directory is the count, runtime and period files to change and view
- * those values.
+ * This function creates entries in tracefs for "osnoise" and "timerlat".
+ * It creates these directories in the tracing directory, and within that
+ * directory the use can change and view the configs.
*/
static int init_tracefs(void)
{
@@ -1400,7 +1868,7 @@ static int init_tracefs(void)

top_dir = tracefs_create_dir("osnoise", NULL);
if (!top_dir)
- return -ENOMEM;
+ return 0;

tmp = tracefs_create_file("period_us", 0640, top_dir,
&osnoise_period, &trace_ull_config_fops);
@@ -1425,6 +1893,19 @@ static int init_tracefs(void)
tmp = trace_create_file("cpus", 0644, top_dir, NULL, &cpus_fops);
if (!tmp)
goto err;
+#ifdef CONFIG_TIMERLAT_TRACER
+#ifdef CONFIG_STACKTRACE
+ tmp = tracefs_create_file("print_stack", 0640, top_dir,
+ &osnoise_print_stack, &trace_ull_config_fops);
+ if (!tmp)
+ goto err;
+#endif
+
+ tmp = tracefs_create_file("timerlat_period_us", 0640, top_dir,
+ &timerlat_period, &trace_ull_config_fops);
+ if (!tmp)
+ goto err;
+#endif

return 0;

@@ -1465,18 +1946,15 @@ static int osnoise_hook_events(void)
return -EINVAL;
}

-static void osnoise_tracer_start(struct trace_array *tr)
+static int __osnoise_tracer_start(struct trace_array *tr)
{
int retval;

- if (osnoise_busy)
- return;
-
osn_var_reset_all();

retval = osnoise_hook_events();
if (retval)
- goto out_err;
+ return retval;
/*
* Make sure NMIs see reseted values.
*/
@@ -1484,15 +1962,27 @@ static void osnoise_tracer_start(struct trace_array *tr)
trace_osnoise_callback_enabled = true;

retval = start_per_cpu_kthreads(tr);
- /*
- * all fine!
- */
- if (!retval)
+ if (retval) {
+ unhook_irq_events();
+ return retval;
+ }
+
+ osnoise_busy = true;
+
+ return 0;
+}
+
+static void osnoise_tracer_start(struct trace_array *tr)
+{
+ int retval;
+
+ if (osnoise_busy)
return;

-out_err:
- unhook_irq_events();
- pr_err(BANNER "Error starting osnoise tracer\n");
+ retval = __osnoise_tracer_start(tr);
+ if (retval)
+ pr_err(BANNER "Error starting osnoise tracer\n");
+
}

static void osnoise_tracer_stop(struct trace_array *tr)
@@ -1514,18 +2004,16 @@ static void osnoise_tracer_stop(struct trace_array *tr)

static int osnoise_tracer_init(struct trace_array *tr)
{
+
/* Only allow one instance to enable this */
if (osnoise_busy)
return -EBUSY;

osnoise_trace = tr;
-
tr->max_latency = 0;

osnoise_tracer_start(tr);

- osnoise_busy = true;
-
return 0;
}

@@ -1544,6 +2032,71 @@ static struct tracer osnoise_tracer __read_mostly = {
.allow_instances = true,
};

+#ifdef CONFIG_TIMERLAT_TRACER
+static void timerlat_tracer_start(struct trace_array *tr)
+{
+ int retval;
+
+ if (osnoise_busy)
+ return;
+
+ osnoise_data.timerlat_tracer = 1;
+
+ retval = __osnoise_tracer_start(tr);
+ if (retval)
+ goto out_err;
+
+ return;
+out_err:
+ pr_err(BANNER "Error starting timerlat tracer\n");
+}
+
+static void timerlat_tracer_stop(struct trace_array *tr)
+{
+ int cpu;
+
+ if (!osnoise_busy)
+ return;
+
+ for_each_online_cpu(cpu)
+ per_cpu(per_cpu_osnoise_var, cpu).sampling = 0;
+
+ osnoise_tracer_stop(tr);
+
+ osnoise_data.timerlat_tracer = 0;
+}
+
+static int timerlat_tracer_init(struct trace_array *tr)
+{
+ /* Only allow one instance to enable this */
+ if (osnoise_busy)
+ return -EBUSY;
+
+ osnoise_trace = tr;
+
+ tr->max_latency = 0;
+
+ timerlat_tracer_start(tr);
+
+ return 0;
+}
+
+static void timerlat_tracer_reset(struct trace_array *tr)
+{
+ timerlat_tracer_stop(tr);
+}
+
+static struct tracer timerlat_tracer __read_mostly = {
+ .name = "timerlat",
+ .init = timerlat_tracer_init,
+ .reset = timerlat_tracer_reset,
+ .start = timerlat_tracer_start,
+ .stop = timerlat_tracer_stop,
+ .print_header = print_timerlat_headers,
+ .allow_instances = true,
+};
+#endif /* CONFIG_TIMERLAT_TRACER */
+
__init static int init_osnoise_tracer(void)
{
int ret;
@@ -1553,8 +2106,18 @@ __init static int init_osnoise_tracer(void)
cpumask_copy(&osnoise_cpumask, cpu_all_mask);

ret = register_tracer(&osnoise_tracer);
- if (ret)
+ if (ret) {
+ pr_err(BANNER "Error registering osnoise!\n");
return ret;
+ }
+
+#ifdef CONFIG_TIMERLAT_TRACER
+ ret = register_tracer(&timerlat_tracer);
+ if (ret) {
+ pr_err(BANNER "Error registering timerlat\n");
+ return ret;
+ }
+#endif

init_tracefs();

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 642b6584eba5..a0bf446bb034 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1301,6 +1301,52 @@ static struct trace_event trace_osnoise_event = {
.funcs = &trace_osnoise_funcs,
};

+/* TRACE_TIMERLAT */
+static enum print_line_t
+trace_timerlat_print(struct trace_iterator *iter, int flags,
+ struct trace_event *event)
+{
+ struct trace_entry *entry = iter->ent;
+ struct trace_seq *s = &iter->seq;
+ struct timerlat_entry *field;
+
+ trace_assign_type(field, entry);
+
+ trace_seq_printf(s, "#%-5u context %6s timer_latency %9llu ns\n",
+ field->seqnum,
+ field->context ? "thread" : "irq",
+ field->timer_latency);
+
+ return trace_handle_return(s);
+}
+
+static enum print_line_t
+trace_timerlat_raw(struct trace_iterator *iter, int flags,
+ struct trace_event *event)
+{
+ struct timerlat_entry *field;
+ struct trace_seq *s = &iter->seq;
+
+ trace_assign_type(field, iter->ent);
+
+ trace_seq_printf(s, "%u %d %llu\n",
+ field->seqnum,
+ field->context,
+ field->timer_latency);
+
+ return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_timerlat_funcs = {
+ .trace = trace_timerlat_print,
+ .raw = trace_timerlat_raw,
+};
+
+static struct trace_event trace_timerlat_event = {
+ .type = TRACE_TIMERLAT,
+ .funcs = &trace_timerlat_funcs,
+};
+
/* TRACE_BPUTS */
static enum print_line_t
trace_bputs_print(struct trace_iterator *iter, int flags,
@@ -1512,6 +1558,7 @@ static struct trace_event *events[] __initdata = {
&trace_print_event,
&trace_hwlat_event,
&trace_osnoise_event,
+ &trace_timerlat_event,
&trace_raw_data_event,
&trace_func_repeats_event,
NULL
--
2.26.3


Subject: [PATCH V3 6/9] trace/hwlat: Use the generic function to read/write width and window

Use the trace_ull_config generic implementation to reducing the code of
hwlat detector.

Cc: Jonathan Corbet <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Alexandre Chartre <[email protected]>
Cc: Clark Willaims <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
kernel/trace/trace_hwlat.c | 135 ++++---------------------------------
1 file changed, 14 insertions(+), 121 deletions(-)

diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index 84689fa14d9a..02247263d94a 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -504,115 +504,6 @@ static int start_per_cpu_kthreads(struct trace_array *tr)
return 0;
}

-/*
- * hwlat_read - Wrapper read function for reading both window and width
- * @filp: The active open file structure
- * @ubuf: The userspace provided buffer to read value into
- * @cnt: The maximum number of bytes to read
- * @ppos: The current "file" position
- *
- * This function provides a generic read implementation for the global state
- * "hwlat_data" structure filesystem entries.
- */
-static ssize_t hwlat_read(struct file *filp, char __user *ubuf,
- size_t cnt, loff_t *ppos)
-{
- char buf[U64STR_SIZE];
- u64 *entry = filp->private_data;
- u64 val;
- int len;
-
- if (!entry)
- return -EFAULT;
-
- if (cnt > sizeof(buf))
- cnt = sizeof(buf);
-
- val = *entry;
-
- len = snprintf(buf, sizeof(buf), "%llu\n", val);
-
- return simple_read_from_buffer(ubuf, cnt, ppos, buf, len);
-}
-
-/**
- * hwlat_width_write - Write function for "width" entry
- * @filp: The active open file structure
- * @ubuf: The user buffer that contains the value to write
- * @cnt: The maximum number of bytes to write to "file"
- * @ppos: The current position in @file
- *
- * This function provides a write implementation for the "width" interface
- * to the hardware latency detector. It can be used to configure
- * for how many us of the total window us we will actively sample for any
- * hardware-induced latency periods. Obviously, it is not possible to
- * sample constantly and have the system respond to a sample reader, or,
- * worse, without having the system appear to have gone out to lunch. It
- * is enforced that width is less that the total window size.
- */
-static ssize_t
-hwlat_width_write(struct file *filp, const char __user *ubuf,
- size_t cnt, loff_t *ppos)
-{
- u64 val;
- int err;
-
- err = kstrtoull_from_user(ubuf, cnt, 10, &val);
- if (err)
- return err;
-
- mutex_lock(&hwlat_data.lock);
- if (val < hwlat_data.sample_window)
- hwlat_data.sample_width = val;
- else
- err = -EINVAL;
- mutex_unlock(&hwlat_data.lock);
-
- if (err)
- return err;
-
- return cnt;
-}
-
-/**
- * hwlat_window_write - Write function for "window" entry
- * @filp: The active open file structure
- * @ubuf: The user buffer that contains the value to write
- * @cnt: The maximum number of bytes to write to "file"
- * @ppos: The current position in @file
- *
- * This function provides a write implementation for the "window" interface
- * to the hardware latency detector. The window is the total time
- * in us that will be considered one sample period. Conceptually, windows
- * occur back-to-back and contain a sample width period during which
- * actual sampling occurs. Can be used to write a new total window size. It
- * is enforced that any value written must be greater than the sample width
- * size, or an error results.
- */
-static ssize_t
-hwlat_window_write(struct file *filp, const char __user *ubuf,
- size_t cnt, loff_t *ppos)
-{
- u64 val;
- int err;
-
- err = kstrtoull_from_user(ubuf, cnt, 10, &val);
- if (err)
- return err;
-
- mutex_lock(&hwlat_data.lock);
- if (hwlat_data.sample_width < val)
- hwlat_data.sample_window = val;
- else
- err = -EINVAL;
- mutex_unlock(&hwlat_data.lock);
-
- if (err)
- return err;
-
- return cnt;
-}
-
static void *s_mode_start(struct seq_file *s, loff_t *pos)
{
int mode = *pos;
@@ -733,16 +624,18 @@ static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf,
return ret;
}

-static const struct file_operations width_fops = {
- .open = tracing_open_generic,
- .read = hwlat_read,
- .write = hwlat_width_write,
+static struct trace_ull_config hwlat_width = {
+ .lock = &hwlat_data.lock,
+ .val = &hwlat_data.sample_width,
+ .max = &hwlat_data.sample_window,
+ .min = NULL,
};

-static const struct file_operations window_fops = {
- .open = tracing_open_generic,
- .read = hwlat_read,
- .write = hwlat_window_write,
+static struct trace_ull_config hwlat_window = {
+ .lock = &hwlat_data.lock,
+ .val = &hwlat_data.sample_window,
+ .max = NULL,
+ .min = &hwlat_data.sample_width,
};

static const struct file_operations thread_mode_fops = {
@@ -775,15 +668,15 @@ static int init_tracefs(void)

hwlat_sample_window = tracefs_create_file("window", 0640,
top_dir,
- &hwlat_data.sample_window,
- &window_fops);
+ &hwlat_window,
+ &trace_ull_config_fops);
if (!hwlat_sample_window)
goto err;

hwlat_sample_width = tracefs_create_file("width", 0644,
top_dir,
- &hwlat_data.sample_width,
- &width_fops);
+ &hwlat_width,
+ &trace_ull_config_fops);
if (!hwlat_sample_width)
goto err;

--
2.26.3


Subject: [PATCH V3 7/9] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers

From: Steven Rostedt <[email protected]>

To have nanosecond output displayed in a more human readable format, its
nicer to convert it to a seconds format (XXX.YYYYYYYYY). The problem is that
to do so, the numbers must be divided by NSEC_PER_SEC, and moded too. But as
these numbers are 64 bit, this can not be done simply with '/' and '%'
operators, but must use do_div() instead.

Instead of performing the expensive do_div() in the hot path of the
tracepoint, it is more efficient to perform it during the output phase. But
passing in do_div() can confuse the parser, and do_div() doesn't work
exactly like a normal C function. It modifies the number in place, and we
don't want to modify the actual values in the ring buffer.

Two helper functions are now created:

__print_ns_to_secs() and __print_ns_without_secs()

They both take a value of nanoseconds, and the former will return that
number divided by NSEC_PER_SEC, and the latter will mod it with NSEC_PER_SEC
giving a way to print a nice human readable format:

__print_fmt("time=%llu.%09u",
__print_ns_to_secs(REC->nsec_val),
__print_ns_without_secs(REC->nsec_val))

Cc: Jonathan Corbet <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Alexandre Chartre <[email protected]>
Cc: Clark Willaims <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Steven Rostedt <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
include/trace/trace_events.h | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 8268bf747d6f..248090415b97 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -358,6 +358,21 @@ TRACE_MAKE_SYSTEM_STR();
trace_print_hex_dump_seq(p, prefix_str, prefix_type, \
rowsize, groupsize, buf, len, ascii)

+#undef __print_ns_to_secs
+#define __print_ns_to_secs(value) \
+ ({ \
+ u64 ____val = (u64)value; \
+ do_div(____val, NSEC_PER_SEC); \
+ ____val; \
+ })
+
+#undef __print_ns_without_secs
+#define __print_ns_without_secs(value) \
+ ({ \
+ u64 ____val = (u64)value; \
+ (u32) do_div(____val, NSEC_PER_SEC); \
+ })
+
#undef DECLARE_EVENT_CLASS
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
static notrace enum print_line_t \
@@ -736,6 +751,16 @@ static inline void ftrace_test_probe_##call(void) \
#undef __print_array
#undef __print_hex_dump

+/*
+ * The below is not executed in the kernel. It is only what is
+ * displayed in the print format for userspace to parse.
+ */
+#undef __print_ns_to_secs
+#define __print_ns_to_secs(val) val / 1000000000UL
+
+#undef __print_ns_without_secs
+#define __print_ns_without_secs(val) val % 1000000000UL
+
#undef TP_printk
#define TP_printk(fmt, args...) "\"" fmt "\", " __stringify(args)

--
2.26.3


Subject: [PATCH V3 8/9] tracing: Add osnoise tracer

In the context of high-performance computing (HPC), the Operating System
Noise (*osnoise*) refers to the interference experienced by an application
due to activities inside the operating system. In the context of Linux,
NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
system. Moreover, hardware-related jobs can also cause noise, for example,
via SMIs.

The osnoise tracer leverages the hwlat_detector by running a similar
loop with preemption, SoftIRQs and IRQs enabled, thus allowing all
the sources of *osnoise* during its execution. Using the same approach
of hwlat, osnoise takes note of the entry and exit point of any
source of interferences, increasing a per-cpu interference counter. The
osnoise tracer also saves an interference counter for each source of
interference. The interference counter for NMI, IRQs, SoftIRQs, and
threads is increased anytime the tool observes these interferences' entry
events. When a noise happens without any interference from the operating
system level, the hardware noise counter increases, pointing to a
hardware-related noise. In this way, osnoise can account for any
source of interference. At the end of the period, the osnoise tracer
prints the sum of all noise, the max single noise, the percentage of CPU
available for the thread, and the counters for the noise sources.

Usage

Write the ASCII text "osnoise" into the current_tracer file of the
tracing system (generally mounted at /sys/kernel/tracing).

For example::

[root@f32 ~]# cd /sys/kernel/tracing/
[root@f32 tracing]# echo osnoise > current_tracer

It is possible to follow the trace by reading the trace trace file::

[root@f32 tracing]# cat trace
# tracer: osnoise
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth MAX
# || / SINGLE Interference counters:
# |||| RUNTIME NOISE % OF CPU NOISE +-----------------------------+
# TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD
# | | | |||| | | | | | | | | | |
<...>-859 [000] .... 81.637220: 1000000 190 99.98100 9 18 0 1007 18 1
<...>-860 [001] .... 81.638154: 1000000 656 99.93440 74 23 0 1006 16 3
<...>-861 [002] .... 81.638193: 1000000 5675 99.43250 202 6 0 1013 25 21
<...>-862 [003] .... 81.638242: 1000000 125 99.98750 45 1 0 1011 23 0
<...>-863 [004] .... 81.638260: 1000000 1721 99.82790 168 7 0 1002 49 41
<...>-864 [005] .... 81.638286: 1000000 263 99.97370 57 6 0 1006 26 2
<...>-865 [006] .... 81.638302: 1000000 109 99.98910 21 3 0 1006 18 1
<...>-866 [007] .... 81.638326: 1000000 7816 99.21840 107 8 0 1016 39 19

In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
tracer prints a message at the end of each period for each CPU that is
running an osnoise/CPU thread. The osnoise specific fields report:

- The RUNTIME IN USE reports the amount of time in microseconds that
the osnoise thread kept looping reading the time.
- The NOISE IN US reports the sum of noise in microseconds observed
by the osnoise tracer during the associated runtime.
- The % OF CPU AVAILABLE reports the percentage of CPU available for
the osnoise thread during the runtime window.
- The MAX SINGLE NOISE IN US reports the maximum single noise observed
during the runtime window.
- The Interference counters display how many each of the respective
interference happened during the runtime window.

Note that the example above shows a high number of HW noise samples.
The reason being is that this sample was taken on a virtual machine,
and the host interference is detected as a hardware interference.

Tracer options

The tracer has a set of options inside the osnoise directory, they are:

- osnoise/cpus: CPUs at which a osnoise thread will execute.
- osnoise/period_us: the period of the osnoise thread.
- osnoise/runtime_us: how long an osnoise thread will look for noise.
- osnoise/stop_tracing_in_us: stop the system tracing if a single noise
higher than the configured value happens. Writing 0 disables this
option.
- osnoise/stop_tracing_out_us: stop the system tracing if total noise
higher than the configured value happens. Writing 0 disables this
option.
- tracing_threshold: the minimum delta between two time() reads to be
considered as noise, in us. When set to 0, the minimum valid value
will be used, which is currently 1 us.

Additional Tracing

In addition to the tracer, a set of tracepoints were added to
facilitate the identification of the osnoise source.

- osnoise:sample_threshold: printed anytime a noise is higher than
the configurable tolerance_ns.
- osnoise:nmi_noise: noise from NMI, including the duration.
- osnoise:irq_noise: noise from an IRQ, including the duration.
- osnoise:softirq_noise: noise from a SoftIRQ, including the
duration.
- osnoise:thread_noise: noise from a thread, including the duration.

Note that all the values are *net values*. For example, if while osnoise
is running, another thread preempts the osnoise thread, it will start a
thread_noise duration at the start. Then, an IRQ takes place, preempting
the thread_noise, starting a irq_noise. When the IRQ ends its execution,
it will compute its duration, and this duration will be subtracted from
the thread_noise, in such a way as to avoid the double accounting of the
IRQ execution. This logic is valid for all sources of noise.

Here is one example of the usage of these tracepoints::

osnoise/8-961 [008] d.h. 5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
osnoise/8-961 [008] dNh. 5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
migration/8-54 [008] d... 5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
osnoise/8-961 [008] .... 5789.858413: sample_threshold: start 5789.858404555 duration 8723 ns interferences 2

In this example, a noise sample of 8 microseconds was reported in the last
line, pointing to two interferences. Looking backward in the trace, the
two previous entries were about the migration thread running after a
timer IRQ execution. The first event is not part of the noise because
it took place one millisecond before.

It is worth noticing that the sum of the duration reported in the
tracepoints is smaller than eight us reported in the sample_threshold.
The reason roots in the overhead of the entry and exit code that happens
before and after any interference execution. This justifies the dual
approach: measuring thread and tracing.

Cc: Jonathan Corbet <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Alexandre Chartre <[email protected]>
Cc: Clark Willaims <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
---
Documentation/trace/index.rst | 1 +
Documentation/trace/osnoise-tracer.rst | 152 +++
include/linux/ftrace_irq.h | 13 +
include/trace/events/osnoise.h | 142 +++
kernel/trace/Kconfig | 34 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.h | 9 +-
kernel/trace/trace_entries.h | 25 +
kernel/trace/trace_osnoise.c | 1563 ++++++++++++++++++++++++
kernel/trace/trace_output.c | 72 +-
10 files changed, 2008 insertions(+), 4 deletions(-)
create mode 100644 Documentation/trace/osnoise-tracer.rst
create mode 100644 include/trace/events/osnoise.h
create mode 100644 kernel/trace/trace_osnoise.c

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index f634b36fd3aa..608107b27cc0 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -23,6 +23,7 @@ Linux Tracing Technologies
histogram-design
boottime-trace
hwlat_detector
+ osnoise-tracer
intel_th
ring-buffer-design
stm
diff --git a/Documentation/trace/osnoise-tracer.rst b/Documentation/trace/osnoise-tracer.rst
new file mode 100644
index 000000000000..acf4b4d5ddfd
--- /dev/null
+++ b/Documentation/trace/osnoise-tracer.rst
@@ -0,0 +1,152 @@
+==============
+OSNOISE Tracer
+==============
+
+In the context of high-performance computing (HPC), the Operating System
+Noise (*osnoise*) refers to the interference experienced by an application
+due to activities inside the operating system. In the context of Linux,
+NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
+system. Moreover, hardware-related jobs can also cause noise, for example,
+via SMIs.
+
+hwlat_detector is one of the tools used to identify the most complex
+source of noise: *hardware noise*.
+
+In a nutshell, the hwlat_detector creates a thread that runs
+periodically for a given period. At the beginning of a period, the thread
+disables interrupt and starts sampling. While running, the hwlatd
+thread reads the time in a loop. As interrupts are disabled, threads,
+IRQs, and SoftIRQs cannot interfere with the hwlatd thread. Hence, the
+cause of any gap between two different reads of the time roots either on
+NMI or in the hardware itself. At the end of the period, hwlatd enables
+interrupts and reports the max observed gap between the reads. It also
+prints a NMI occurrence counter. If the output does not report NMI
+executions, the user can conclude that the hardware is the culprit for
+the latency. The hwlat detects the NMI execution by observing
+the entry and exit of a NMI.
+
+The osnoise tracer leverages the hwlat_detector by running a
+similar loop with preemption, SoftIRQs and IRQs enabled, thus allowing
+all the sources of *osnoise* during its execution. Using the same approach
+of hwlat, osnoise takes note of the entry and exit point of any
+source of interferences, increasing a per-cpu interference counter. The
+osnoise tracer also saves an interference counter for each source of
+interference. The interference counter for NMI, IRQs, SoftIRQs, and
+threads is increased anytime the tool observes these interferences' entry
+events. When a noise happens without any interference from the operating
+system level, the hardware noise counter increases, pointing to a
+hardware-related noise. In this way, osnoise can account for any
+source of interference. At the end of the period, the osnoise tracer
+prints the sum of all noise, the max single noise, the percentage of CPU
+available for the thread, and the counters for the noise sources.
+
+Usage
+-----
+
+Write the ASCII text "osnoise" into the current_tracer file of the
+tracing system (generally mounted at /sys/kernel/tracing).
+
+For example::
+
+ [root@f32 ~]# cd /sys/kernel/tracing/
+ [root@f32 tracing]# echo osnoise > current_tracer
+
+It is possible to follow the trace by reading the trace trace file::
+
+ [root@f32 tracing]# cat trace
+ # tracer: osnoise
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth MAX
+ # || / SINGLE Interference counters:
+ # |||| RUNTIME NOISE % OF CPU NOISE +-----------------------------+
+ # TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD
+ # | | | |||| | | | | | | | | | |
+ <...>-859 [000] .... 81.637220: 1000000 190 99.98100 9 18 0 1007 18 1
+ <...>-860 [001] .... 81.638154: 1000000 656 99.93440 74 23 0 1006 16 3
+ <...>-861 [002] .... 81.638193: 1000000 5675 99.43250 202 6 0 1013 25 21
+ <...>-862 [003] .... 81.638242: 1000000 125 99.98750 45 1 0 1011 23 0
+ <...>-863 [004] .... 81.638260: 1000000 1721 99.82790 168 7 0 1002 49 41
+ <...>-864 [005] .... 81.638286: 1000000 263 99.97370 57 6 0 1006 26 2
+ <...>-865 [006] .... 81.638302: 1000000 109 99.98910 21 3 0 1006 18 1
+ <...>-866 [007] .... 81.638326: 1000000 7816 99.21840 107 8 0 1016 39 19
+
+In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
+tracer prints a message at the end of each period for each CPU that is
+running an osnoise/ thread. The osnoise specific fields report:
+
+ - The RUNTIME IN USE reports the amount of time in microseconds that
+ the osnoise thread kept looping reading the time.
+ - The NOISE IN US reports the sum of noise in microseconds observed
+ by the osnoise tracer during the associated runtime.
+ - The % OF CPU AVAILABLE reports the percentage of CPU available for
+ the osnoise thread during the runtime window.
+ - The MAX SINGLE NOISE IN US reports the maximum single noise observed
+ during the runtime window.
+ - The Interference counters display how many each of the respective
+ interference happened during the runtime window.
+
+Note that the example above shows a high number of HW noise samples.
+The reason being is that this sample was taken on a virtual machine,
+and the host interference is detected as a hardware interference.
+
+Tracer options
+---------------------
+
+The tracer has a set of options inside the osnoise directory, they are:
+
+ - osnoise/cpus: CPUs at which a osnoise thread will execute.
+ - osnoise/period_us: the period of the osnoise thread.
+ - osnoise/runtime_us: how long an osnoise thread will look for noise.
+ - osnoise/stop_tracing_in_us: stop the system tracing if a single noise
+ higher than the configured value happens. Writing 0 disables this
+ option.
+ - osnoise/stop_tracing_out_us: stop the system tracing if total noise
+ higher than the configured value happens. Writing 0 disables this
+ option.
+ - tracing_threshold: the minimum delta between two time() reads to be
+ considered as noise, in us. When set to 0, the minimum valid value
+ will be used, which is currently 1 us.
+
+Additional Tracing
+------------------
+
+In addition to the tracer, a set of tracepoints were added to
+facilitate the identification of the osnoise source.
+
+ - osnoise:sample_threshold: printed anytime a noise is higher than
+ the configurable tolerance_ns.
+ - osnoise:nmi_noise: noise from NMI, including the duration.
+ - osnoise:irq_noise: noise from an IRQ, including the duration.
+ - osnoise:softirq_noise: noise from a SoftIRQ, including the
+ duration.
+ - osnoise:thread_noise: noise from a thread, including the duration.
+
+Note that all the values are *net values*. For example, if while osnoise
+is running, another thread preempts the osnoise thread, it will start a
+thread_noise duration at the start. Then, an IRQ takes place, preempting
+the thread_noise, starting a irq_noise. When the IRQ ends its execution,
+it will compute its duration, and this duration will be subtracted from
+the thread_noise, in such a way as to avoid the double accounting of the
+IRQ execution. This logic is valid for all sources of noise.
+
+Here is one example of the usage of these tracepoints::
+
+ osnoise/8-961 [008] d.h. 5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
+ osnoise/8-961 [008] dNh. 5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
+ migration/8-54 [008] d... 5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
+ osnoise/8-961 [008] .... 5789.858413: sample_threshold: start 5789.858404555 duration 8812 ns interferences 2
+
+In this example, a noise sample of 8 microseconds was reported in the last
+line, pointing to two interferences. Looking backward in the trace, the
+two previous entries were about the migration thread running after a
+timer IRQ execution. The first event is not part of the noise because
+it took place one millisecond before.
+
+It is worth noticing that the sum of the duration reported in the
+tracepoints is smaller than eight us reported in the sample_threshold.
+The reason roots in the overhead of the entry and exit code that happens
+before and after any interference execution. This justifies the dual
+approach: measuring thread and tracing.
diff --git a/include/linux/ftrace_irq.h b/include/linux/ftrace_irq.h
index 0abd9a1d2852..f6faa31289ba 100644
--- a/include/linux/ftrace_irq.h
+++ b/include/linux/ftrace_irq.h
@@ -7,12 +7,21 @@ extern bool trace_hwlat_callback_enabled;
extern void trace_hwlat_callback(bool enter);
#endif

+#ifdef CONFIG_OSNOISE_TRACER
+extern bool trace_osnoise_callback_enabled;
+extern void trace_osnoise_callback(bool enter);
+#endif
+
static inline void ftrace_nmi_enter(void)
{
#ifdef CONFIG_HWLAT_TRACER
if (trace_hwlat_callback_enabled)
trace_hwlat_callback(true);
#endif
+#ifdef CONFIG_OSNOISE_TRACER
+ if (trace_osnoise_callback_enabled)
+ trace_osnoise_callback(true);
+#endif
}

static inline void ftrace_nmi_exit(void)
@@ -21,6 +30,10 @@ static inline void ftrace_nmi_exit(void)
if (trace_hwlat_callback_enabled)
trace_hwlat_callback(false);
#endif
+#ifdef CONFIG_OSNOISE_TRACER
+ if (trace_osnoise_callback_enabled)
+ trace_osnoise_callback(false);
+#endif
}

#endif /* _LINUX_FTRACE_IRQ_H */
diff --git a/include/trace/events/osnoise.h b/include/trace/events/osnoise.h
new file mode 100644
index 000000000000..28762c69f6c9
--- /dev/null
+++ b/include/trace/events/osnoise.h
@@ -0,0 +1,142 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM osnoise
+
+#if !defined(_OSNOISE_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _OSNOISE_TRACE_H
+
+#include <linux/tracepoint.h>
+TRACE_EVENT(thread_noise,
+
+ TP_PROTO(struct task_struct *t, u64 start, u64 duration),
+
+ TP_ARGS(t, start, duration),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN)
+ __field( u64, start )
+ __field( u64, duration)
+ __field( pid_t, pid )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+ __entry->pid = t->pid;
+ __entry->start = start;
+ __entry->duration = duration;
+ ),
+
+ TP_printk("%8s:%d start %llu.%09u duration %llu ns",
+ __entry->comm,
+ __entry->pid,
+ __print_ns_to_secs(__entry->start),
+ __print_ns_without_secs(__entry->start),
+ __entry->duration)
+);
+
+TRACE_EVENT(softirq_noise,
+
+ TP_PROTO(int vector, u64 start, u64 duration),
+
+ TP_ARGS(vector, start, duration),
+
+ TP_STRUCT__entry(
+ __field( u64, start )
+ __field( u64, duration)
+ __field( int, vector )
+ ),
+
+ TP_fast_assign(
+ __entry->vector = vector;
+ __entry->start = start;
+ __entry->duration = duration;
+ ),
+
+ TP_printk("%8s:%d start %llu.%09u duration %llu ns",
+ show_softirq_name(__entry->vector),
+ __entry->vector,
+ __print_ns_to_secs(__entry->start),
+ __print_ns_without_secs(__entry->start),
+ __entry->duration)
+);
+
+TRACE_EVENT(irq_noise,
+
+ TP_PROTO(int vector, const char *desc, u64 start, u64 duration),
+
+ TP_ARGS(vector, desc, start, duration),
+
+ TP_STRUCT__entry(
+ __field( u64, start )
+ __field( u64, duration)
+ __string( desc, desc )
+ __field( int, vector )
+
+ ),
+
+ TP_fast_assign(
+ __assign_str(desc, desc);
+ __entry->vector = vector;
+ __entry->start = start;
+ __entry->duration = duration;
+ ),
+
+ TP_printk("%s:%d start %llu.%09u duration %llu ns",
+ __get_str(desc),
+ __entry->vector,
+ __print_ns_to_secs(__entry->start),
+ __print_ns_without_secs(__entry->start),
+ __entry->duration)
+);
+
+TRACE_EVENT(nmi_noise,
+
+ TP_PROTO(u64 start, u64 duration),
+
+ TP_ARGS(start, duration),
+
+ TP_STRUCT__entry(
+ __field( u64, start )
+ __field( u64, duration)
+ ),
+
+ TP_fast_assign(
+ __entry->start = start;
+ __entry->duration = duration;
+ ),
+
+ TP_printk("start %llu.%09u duration %llu ns",
+ __print_ns_to_secs(__entry->start),
+ __print_ns_without_secs(__entry->start),
+ __entry->duration)
+);
+
+TRACE_EVENT(sample_threshold,
+
+ TP_PROTO(u64 start, u64 duration, u64 interference),
+
+ TP_ARGS(start, duration, interference),
+
+ TP_STRUCT__entry(
+ __field( u64, start )
+ __field( u64, duration)
+ __field( u64, interference)
+ ),
+
+ TP_fast_assign(
+ __entry->start = start;
+ __entry->duration = duration;
+ __entry->interference = interference;
+ ),
+
+ TP_printk("start %llu.%09u duration %llu ns interferences %llu",
+ __print_ns_to_secs(__entry->start),
+ __print_ns_without_secs(__entry->start),
+ __entry->duration,
+ __entry->interference)
+);
+
+#endif /* _TRACE_OSNOISE_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 7fa82778c3e6..41582ae4682b 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -356,6 +356,40 @@ config HWLAT_TRACER
file. Every time a latency is greater than tracing_thresh, it will
be recorded into the ring buffer.

+config OSNOISE_TRACER
+ bool "OS Noise tracer"
+ select GENERIC_TRACER
+ help
+ In the context of high-performance computing (HPC), the Operating
+ System Noise (osnoise) refers to the interference experienced by an
+ application due to activities inside the operating system. In the
+ context of Linux, NMIs, IRQs, SoftIRQs, and any other system thread
+ can cause noise to the system. Moreover, hardware-related jobs can
+ also cause noise, for example, via SMIs.
+
+ The osnoise tracer leverages the hwlat_detector by running a similar
+ loop with preemption, SoftIRQs and IRQs enabled, thus allowing all
+ the sources of osnoise during its execution. The osnoise tracer takes
+ note of the entry and exit point of any source of interferences,
+ increasing a per-cpu interference counter. It saves an interference
+ counter for each source of interference. The interference counter for
+ NMI, IRQs, SoftIRQs, and threads is increased anytime the tool
+ observes these interferences' entry events. When a noise happens
+ without any interference from the operating system level, the
+ hardware noise counter increases, pointing to a hardware-related
+ noise. In this way, osnoise can account for any source of
+ interference. At the end of the period, the osnoise tracer prints
+ the sum of all noise, the max single noise, the percentage of CPU
+ available for the thread, and the counters for the noise sources.
+
+ In addition to the tracer, a set of tracepoints were added to
+ facilitate the identification of the osnoise source.
+
+ The output will appear in the trace and trace_pipe files.
+
+ To enable this tracer, echo in "osnoise" into the current_tracer
+ file.
+
config MMIOTRACE
bool "Memory mapped IO tracing"
depends on HAVE_MMIOTRACE_SUPPORT && PCI
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index b28d3e5013cd..b1c47ccf4f73 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -58,6 +58,7 @@ obj-$(CONFIG_IRQSOFF_TRACER) += trace_irqsoff.o
obj-$(CONFIG_PREEMPT_TRACER) += trace_irqsoff.o
obj-$(CONFIG_SCHED_TRACER) += trace_sched_wakeup.o
obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
+obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o
obj-$(CONFIG_NOP_TRACER) += trace_nop.o
obj-$(CONFIG_STACK_TRACER) += trace_stack.o
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 44fa25c1264a..754dfe8987a2 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -44,6 +44,7 @@ enum trace_type {
TRACE_BLK,
TRACE_BPUTS,
TRACE_HWLAT,
+ TRACE_OSNOISE,
TRACE_RAW_DATA,
TRACE_FUNC_REPEATS,

@@ -297,7 +298,8 @@ struct trace_array {
struct array_buffer max_buffer;
bool allocated_snapshot;
#endif
-#if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)
+#if defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER) \
+ || defined(CONFIG_OSNOISE_TRACER)
unsigned long max_latency;
#ifdef CONFIG_FSNOTIFY
struct dentry *d_max_latency;
@@ -445,6 +447,7 @@ extern void __ftrace_bad_type(void);
IF_ASSIGN(var, ent, struct bprint_entry, TRACE_BPRINT); \
IF_ASSIGN(var, ent, struct bputs_entry, TRACE_BPUTS); \
IF_ASSIGN(var, ent, struct hwlat_entry, TRACE_HWLAT); \
+ IF_ASSIGN(var, ent, struct osnoise_entry, TRACE_OSNOISE);\
IF_ASSIGN(var, ent, struct raw_data_entry, TRACE_RAW_DATA);\
IF_ASSIGN(var, ent, struct trace_mmiotrace_rw, \
TRACE_MMIO_RW); \
@@ -675,8 +678,8 @@ void update_max_tr_single(struct trace_array *tr,
struct task_struct *tsk, int cpu);
#endif /* CONFIG_TRACER_MAX_TRACE */

-#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) && \
- defined(CONFIG_FSNOTIFY)
+#if (defined(CONFIG_TRACER_MAX_TRACE) || defined(CONFIG_HWLAT_TRACER)) \
+ || defined(CONFIG_OSNOISE_TRACER) && defined(CONFIG_FSNOTIFY)

void latency_fsnotify(struct trace_array *tr);

diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 251c819cf0c5..158c0984b59b 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -360,3 +360,28 @@ FTRACE_ENTRY(func_repeats, func_repeats_entry,
__entry->count,
FUNC_REPEATS_GET_DELTA_TS(__entry))
);
+
+FTRACE_ENTRY(osnoise, osnoise_entry,
+
+ TRACE_OSNOISE,
+
+ F_STRUCT(
+ __field( u64, noise )
+ __field( u64, runtime )
+ __field( u64, max_sample )
+ __field( unsigned int, hw_count )
+ __field( unsigned int, nmi_count )
+ __field( unsigned int, irq_count )
+ __field( unsigned int, softirq_count )
+ __field( unsigned int, thread_count )
+ ),
+
+ F_printk("noise:%llu\tmax_sample:%llu\thw:%u\tnmi:%u\tirq:%u\tsoftirq:%u\tthread:%u\n",
+ __entry->noise,
+ __entry->max_sample,
+ __entry->hw_count,
+ __entry->nmi_count,
+ __entry->irq_count,
+ __entry->softirq_count,
+ __entry->thread_count)
+);
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
new file mode 100644
index 000000000000..9bd40b514d84
--- /dev/null
+++ b/kernel/trace/trace_osnoise.c
@@ -0,0 +1,1563 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * OS Noise Tracer: computes the OS Noise suffered by a running thread.
+ *
+ * Based on "hwlat_detector" tracer by:
+ * Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <[email protected]>
+ * Copyright (C) 2013-2016 Steven Rostedt, Red Hat, Inc. <[email protected]>
+ * With feedback from Clark Williams <[email protected]>
+ *
+ * And also based on the rtsl tracer presented on:
+ * DE OLIVEIRA, Daniel Bristot, et al. Demystifying the real-time linux
+ * scheduling latency. In: 32nd Euromicro Conference on Real-Time Systems
+ * (ECRTS 2020). Schloss Dagstuhl-Leibniz-Zentrum fur Informatik, 2020.
+ *
+ * Copyright (C) 2021 Daniel Bristot de Oliveira, Red Hat, Inc. <[email protected]>
+ */
+
+#include <linux/kthread.h>
+#include <linux/tracefs.h>
+#include <linux/uaccess.h>
+#include <linux/cpumask.h>
+#include <linux/delay.h>
+#include <linux/sched/clock.h>
+#include <linux/sched.h>
+#include "trace.h"
+
+#ifdef CONFIG_X86_LOCAL_APIC
+#include <asm/trace/irq_vectors.h>
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+#endif /* CONFIG_X86_LOCAL_APIC */
+
+#include <trace/events/irq.h>
+#include <trace/events/sched.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/osnoise.h>
+
+static struct trace_array *osnoise_trace;
+
+/*
+ * Default values.
+ */
+#define BANNER "osnoise: "
+#define DEFAULT_SAMPLE_PERIOD 1000000 /* 1s */
+#define DEFAULT_SAMPLE_RUNTIME 1000000 /* 1s */
+
+/*
+ * NMI runtime info.
+ */
+struct nmi {
+ u64 count;
+ u64 delta_start;
+};
+
+/*
+ * IRQ runtime info.
+ */
+struct irq {
+ u64 count;
+ u64 arrival_time;
+ u64 delta_start;
+};
+
+/*
+ * SofIRQ runtime info.
+ */
+struct softirq {
+ u64 count;
+ u64 arrival_time;
+ u64 delta_start;
+};
+
+/*
+ * Thread runtime info.
+ */
+struct thread {
+ u64 count;
+ u64 arrival_time;
+ u64 delta_start;
+};
+
+/*
+ * Runtime information: this structure saves the runtime information used by
+ * one sampling thread.
+ */
+struct osnoise_variables {
+ struct task_struct *kthread;
+ bool sampling;
+ pid_t pid;
+ struct nmi nmi;
+ struct irq irq;
+ struct softirq softirq;
+ struct thread thread;
+ local_t int_counter;
+};
+
+/*
+ * Per-cpu runtime information.
+ */
+DEFINE_PER_CPU(struct osnoise_variables, per_cpu_osnoise_var);
+
+/**
+ * this_cpu_osn_var - Return the per-cpu osnoise_variables on its relative CPU
+ */
+static inline struct osnoise_variables *this_cpu_osn_var(void)
+{
+ return this_cpu_ptr(&per_cpu_osnoise_var);
+}
+
+/**
+ * osn_var_reset - Reset the values of the given osnoise_variables
+ */
+static inline void osn_var_reset(struct osnoise_variables *osn_var)
+{
+ /*
+ * So far, all the values are initialized as 0, so
+ * zeroing the structure is perfect.
+ */
+ memset(osn_var, 0, sizeof(struct osnoise_variables));
+}
+
+/**
+ * osn_var_reset_all - Reset the value of all per-cpu osnoise_variables
+ */
+static inline void osn_var_reset_all(void)
+{
+ struct osnoise_variables *osn_var;
+ int cpu;
+
+ for_each_cpu(cpu, cpu_online_mask) {
+ osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
+ osn_var_reset(osn_var);
+ }
+}
+
+/*
+ * Tells NMIs to call back to the osnoise tracer to record timestamps.
+ */
+bool trace_osnoise_callback_enabled;
+
+/*
+ * osnoise sample structure definition. Used to store the statistics of a
+ * sample run.
+ */
+struct osnoise_sample {
+ u64 runtime; /* runtime */
+ u64 noise; /* noise */
+ u64 max_sample; /* max single noise sample */
+ int hw_count; /* # HW (incl. hypervisor) interference */
+ int nmi_count; /* # NMIs during this sample */
+ int irq_count; /* # IRQs during this sample */
+ int softirq_count; /* # SoftIRQs during this sample */
+ int thread_count; /* # Threads during this sample */
+};
+
+/*
+ * Protect the interface.
+ */
+struct mutex interface_lock;
+
+/*
+ * Tracer data.
+ */
+static struct osnoise_data {
+ u64 sample_period; /* total sampling period */
+ u64 sample_runtime; /* active sampling portion of period */
+ u64 stop_tracing_in; /* stop trace in the inside operation (loop) */
+ u64 stop_tracing_out; /* stop trace in the outside operation (report) */
+} osnoise_data = {
+ .sample_period = DEFAULT_SAMPLE_PERIOD,
+ .sample_runtime = DEFAULT_SAMPLE_RUNTIME,
+ .stop_tracing_in = 0,
+ .stop_tracing_out = 0,
+};
+
+/*
+ * Boolean variable used to inform that the tracer is currently sampling.
+ */
+static bool osnoise_busy;
+
+/*
+ * Print the osnoise header info.
+ */
+static void print_osnoise_headers(struct seq_file *s)
+{
+ seq_puts(s, "# _-----=> irqs-off\n");
+ seq_puts(s, "# / _----=> need-resched\n");
+ seq_puts(s, "# | / _---=> hardirq/softirq\n");
+ seq_puts(s, "# || / _--=> preempt-depth ");
+ seq_puts(s, " MAX\n");
+
+ seq_puts(s, "# || / ");
+ seq_puts(s, " SINGLE Interference counters:\n");
+
+ seq_puts(s, "# |||| RUNTIME ");
+ seq_puts(s, " NOISE %% OF CPU NOISE +-----------------------------+\n");
+
+ seq_puts(s, "# TASK-PID CPU# |||| TIMESTAMP IN US ");
+ seq_puts(s, " IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD\n");
+
+ seq_puts(s, "# | | | |||| | | ");
+ seq_puts(s, " | | | | | | | |\n");
+}
+
+/*
+ * Record an osnoise_sample into the tracer buffer.
+ */
+static void trace_osnoise_sample(struct osnoise_sample *sample)
+{
+ struct trace_array *tr = osnoise_trace;
+ struct trace_buffer *buffer = tr->array_buffer.buffer;
+ struct trace_event_call *call = &event_osnoise;
+ struct ring_buffer_event *event;
+ struct osnoise_entry *entry;
+
+ event = trace_buffer_lock_reserve(buffer, TRACE_OSNOISE, sizeof(*entry),
+ tracing_gen_ctx());
+ if (!event)
+ return;
+ entry = ring_buffer_event_data(event);
+ entry->runtime = sample->runtime;
+ entry->noise = sample->noise;
+ entry->max_sample = sample->max_sample;
+ entry->hw_count = sample->hw_count;
+ entry->nmi_count = sample->nmi_count;
+ entry->irq_count = sample->irq_count;
+ entry->softirq_count = sample->softirq_count;
+ entry->thread_count = sample->thread_count;
+
+ if (!call_filter_check_discard(call, entry, buffer, event))
+ trace_buffer_unlock_commit_nostack(buffer, event);
+}
+
+/**
+ * Macros to encapsulate the time capturing infrastructure.
+ */
+#define time_get() trace_clock_local()
+#define time_to_us(x) div_u64(x, 1000)
+#define time_sub(a, b) ((a) - (b))
+
+/**
+ * cond_move_irq_delta_start - Forward the delta_start of a running IRQ
+ *
+ * If an IRQ is preempted by an NMI, its delta_start is pushed forward
+ * to discount the NMI interference.
+ *
+ * See get_int_safe_duration().
+ */
+static inline void
+cond_move_irq_delta_start(struct osnoise_variables *osn_var, u64 duration)
+{
+ if (osn_var->irq.delta_start)
+ osn_var->irq.delta_start += duration;
+}
+
+#ifndef CONFIG_PREEMPT_RT
+/**
+ * cond_move_softirq_delta_start - Forward the delta_start of a running SoftIRQ
+ *
+ * If a SoftIRQ is preempted by an IRQ or NMI, its delta_start is pushed
+ * forward to discount the interference.
+ *
+ * See get_int_safe_duration().
+ */
+static inline void
+cond_move_softirq_delta_start(struct osnoise_variables *osn_var, u64 duration)
+{
+ if (osn_var->softirq.delta_start)
+ osn_var->softirq.delta_start += duration;
+}
+#else /* CONFIG_PREEMPT_RT */
+#define cond_move_softirq_delta_start(osn_var, duration) do {} while (0)
+#endif
+
+/**
+ * cond_move_thread_delta_start - Forward the delta_start of a running thread
+ *
+ * If a noisy thread is preempted by an Softirq, IRQ or NMI, its delta_start
+ * is pushed forward to discount the interference.
+ *
+ * See get_int_safe_duration().
+ */
+static inline void
+cond_move_thread_delta_start(struct osnoise_variables *osn_var, u64 duration)
+{
+ if (osn_var->thread.delta_start)
+ osn_var->thread.delta_start += duration;
+}
+
+/**
+ * get_int_safe_duration - Get the duration of a window
+ *
+ * The irq, softirq and thread varaibles need to have its duration without
+ * the interference from higher priority interrupts. Instead of keeping a
+ * variable to discount the interrupt interference from these variables, the
+ * starting time of these variables are pushed forward with the interrupt's
+ * duration. In this way, a single variable is used to:
+ *
+ * - Know if a given window is being measured.
+ * - Account its duration.
+ * - Discount the interference.
+ *
+ * To avoid getting inconsistent values, e.g.,:
+ *
+ * now = time_get()
+ * ---> interrupt!
+ * delta_start -= int duration;
+ * <---
+ * duration = now - delta_start;
+ *
+ * result: negative duration if the variable duration before the
+ * interrupt was smaller than the interrupt execution.
+ *
+ * A counter of interrupts is used. If the counter increased, try
+ * to capture an interference safe duration.
+ */
+static inline s64
+get_int_safe_duration(struct osnoise_variables *osn_var, u64 *delta_start)
+{
+ u64 int_counter, now;
+ s64 duration;
+
+ do {
+ int_counter = local_read(&osn_var->int_counter);
+ /* synchronize with interrupts */
+ barrier();
+
+ now = time_get();
+ duration = (now - *delta_start);
+
+ /* synchronize with interrupts */
+ barrier();
+ } while (int_counter != local_read(&osn_var->int_counter));
+
+ /*
+ * This is an evidence of race conditions that cause
+ * a value to be "discounted" too much.
+ */
+ if (duration < 0)
+ pr_err("int safe negative!\n");
+
+ *delta_start = 0;
+
+ return duration;
+}
+
+/**
+ *
+ * set_int_safe_time - Save the current time on *time, aware of interference
+ *
+ * Get the time, taking into consideration a possible interference from
+ * higher priority interrupts.
+ *
+ * See get_int_safe_duration() for an explanation.
+ */
+static u64
+set_int_safe_time(struct osnoise_variables *osn_var, u64 *time)
+{
+ u64 int_counter;
+
+ do {
+ int_counter = local_read(&osn_var->int_counter);
+ /* synchronize with interrupts */
+ barrier();
+
+ *time = time_get();
+
+ /* synchronize with interrupts */
+ barrier();
+ } while (int_counter != local_read(&osn_var->int_counter));
+
+ return int_counter;
+}
+
+/**
+ * trace_osnoise_callback - NMI entry/exit callback
+ *
+ * This function is called at the entry and exit NMI code. The bool enter
+ * distinguishes between either case. This function is used to note a NMI
+ * occurrence, compute the noise caused by the NMI, and to remove the noise
+ * it is potentially causing on other interference variables.
+ */
+void trace_osnoise_callback(bool enter)
+{
+ struct osnoise_variables *osn_var = this_cpu_osn_var();
+ u64 duration;
+
+ if (!osn_var->sampling)
+ return;
+
+ /*
+ * Currently trace_clock_local() calls sched_clock() and the
+ * generic version is not NMI safe.
+ */
+ if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {
+ if (enter) {
+ osn_var->nmi.delta_start = time_get();
+ local_inc(&osn_var->int_counter);
+ } else {
+ duration = time_get() - osn_var->nmi.delta_start;
+
+ trace_nmi_noise(osn_var->nmi.delta_start, duration);
+
+ cond_move_irq_delta_start(osn_var, duration);
+ cond_move_softirq_delta_start(osn_var, duration);
+ cond_move_thread_delta_start(osn_var, duration);
+ }
+ }
+
+ if (enter)
+ osn_var->nmi.count++;
+}
+
+/**
+ * __trace_irq_entry - Note the starting of an IRQ
+ *
+ * Save the starting time of an IRQ. As IRQs are non-preemptive to other IRQs,
+ * it is safe to use a single variable (ons_var->irq) to save the statistics.
+ * The arrival_time is used to report... the arrival time. The delta_start
+ * is used to compute the duration at the IRQ exit handler. See
+ * cond_move_irq_delta_start().
+ */
+static inline void __trace_irq_entry(int id)
+{
+ struct osnoise_variables *osn_var = this_cpu_osn_var();
+
+ if (!osn_var->sampling)
+ return;
+ /*
+ * This value will be used in the report, but not to compute
+ * the execution time, so it is safe to get it unsafe.
+ */
+ osn_var->irq.arrival_time = time_get();
+ set_int_safe_time(osn_var, &osn_var->irq.delta_start);
+ osn_var->irq.count++;
+
+ local_inc(&osn_var->int_counter);
+}
+
+/**
+ * __trace_irq_exit - Note the end of an IRQ, sava data and trace
+ *
+ * Computes the duration of the IRQ noise, and trace it. Also discounts the
+ * interference from other sources of noise could be currently being accounted.
+ */
+static inline void __trace_irq_exit(int id, const char *desc)
+{
+ struct osnoise_variables *osn_var = this_cpu_osn_var();
+ int duration;
+
+ if (!osn_var->sampling)
+ return;
+
+ duration = get_int_safe_duration(osn_var, &osn_var->irq.delta_start);
+ trace_irq_noise(id, desc, osn_var->irq.arrival_time, duration);
+ osn_var->irq.arrival_time = 0;
+ cond_move_softirq_delta_start(osn_var, duration);
+ cond_move_thread_delta_start(osn_var, duration);
+}
+
+/**
+ * trace_irqentry_callback - Callback to the irq:irq_entry traceevent
+ *
+ * Used to note the starting of an IRQ occurece.
+ */
+void trace_irqentry_callback(void *data, int irq, struct irqaction *action)
+{
+ __trace_irq_entry(irq);
+}
+
+/**
+ * trace_irqexit_callback - Callback to the irq:irq_exit traceevent
+ *
+ * Used to note the end of an IRQ occurece.
+ */
+void trace_irqexit_callback(void *data, int irq, struct irqaction *action, int ret)
+{
+ __trace_irq_exit(irq, action->name);
+}
+
+#ifdef CONFIG_X86_LOCAL_APIC
+/**
+ * trace_intel_irq_entry - record intel specific IRQ entry
+ */
+void trace_intel_irq_entry(void *data, int vector)
+{
+ __trace_irq_entry(vector);
+}
+
+/**
+ * trace_intel_irq_exit - record intel specific IRQ exit
+ */
+void trace_intel_irq_exit(void *data, int vector)
+{
+ char *vector_desc = (char *) data;
+
+ __trace_irq_exit(vector, vector_desc);
+}
+
+/**
+ * register_intel_irq_tp - Register intel specific IRQ entry tracepoints
+ */
+static int register_intel_irq_tp(void)
+{
+ int ret;
+
+ ret = register_trace_local_timer_entry(trace_intel_irq_entry, NULL);
+ if (ret)
+ goto out_err;
+
+ ret = register_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
+ if (ret)
+ goto out_timer_entry;
+
+#ifdef CONFIG_X86_THERMAL_VECTOR
+ ret = register_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
+ if (ret)
+ goto out_timer_exit;
+
+ ret = register_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
+ if (ret)
+ goto out_thermal_entry;
+#endif /* CONFIG_X86_THERMAL_VECTOR */
+
+#ifdef CONFIG_X86_MCE_AMD
+ ret = register_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
+ if (ret)
+ goto out_thermal_exit;
+
+ ret = register_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
+ if (ret)
+ goto out_deferred_entry;
+#endif
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+ ret = register_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
+ if (ret)
+ goto out_deferred_exit;
+
+ ret = register_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
+ if (ret)
+ goto out_threshold_entry;
+#endif /* CONFIG_X86_MCE_THRESHOLD */
+
+#ifdef CONFIG_SMP
+ ret = register_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
+ if (ret)
+ goto out_threshold_exit;
+
+ ret = register_trace_call_function_single_exit(trace_intel_irq_exit,
+ "call_function_single");
+ if (ret)
+ goto out_call_function_single_entry;
+
+ ret = register_trace_call_function_entry(trace_intel_irq_entry, NULL);
+ if (ret)
+ goto out_call_function_single_exit;
+
+ ret = register_trace_call_function_exit(trace_intel_irq_exit, "call_function");
+ if (ret)
+ goto out_call_function_entry;
+
+ ret = register_trace_reschedule_entry(trace_intel_irq_entry, NULL);
+ if (ret)
+ goto out_call_function_exit;
+
+ ret = register_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
+ if (ret)
+ goto out_reschedule_entry;
+#endif /* CONFIG_SMP */
+
+#ifdef CONFIG_IRQ_WORK
+ ret = register_trace_irq_work_entry(trace_intel_irq_entry, NULL);
+ if (ret)
+ goto out_reschedule_exit;
+
+ ret = register_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
+ if (ret)
+ goto out_irq_work_entry;
+#endif
+
+ ret = register_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
+ if (ret)
+ goto out_irq_work_exit;
+
+ ret = register_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
+ if (ret)
+ goto out_x86_ipi_entry;
+
+ ret = register_trace_error_apic_entry(trace_intel_irq_entry, NULL);
+ if (ret)
+ goto out_x86_ipi_exit;
+
+ ret = register_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
+ if (ret)
+ goto out_error_apic_entry;
+
+ ret = register_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
+ if (ret)
+ goto out_error_apic_exit;
+
+ ret = register_trace_spurious_apic_exit(trace_intel_irq_exit, "spurious_apic");
+ if (ret)
+ goto out_spurious_apic_entry;
+
+ return 0;
+
+out_spurious_apic_entry:
+ unregister_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
+out_error_apic_exit:
+ unregister_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
+out_error_apic_entry:
+ unregister_trace_error_apic_entry(trace_intel_irq_entry, NULL);
+out_x86_ipi_exit:
+ unregister_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
+out_x86_ipi_entry:
+ unregister_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
+out_irq_work_exit:
+
+#ifdef CONFIG_IRQ_WORK
+ unregister_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
+out_irq_work_entry:
+ unregister_trace_irq_work_entry(trace_intel_irq_entry, NULL);
+out_reschedule_exit:
+#endif
+
+#ifdef CONFIG_SMP
+ unregister_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
+out_reschedule_entry:
+ unregister_trace_reschedule_entry(trace_intel_irq_entry, NULL);
+out_call_function_exit:
+ unregister_trace_call_function_exit(trace_intel_irq_exit, "call_function");
+out_call_function_entry:
+ unregister_trace_call_function_entry(trace_intel_irq_entry, NULL);
+out_call_function_single_exit:
+ unregister_trace_call_function_single_exit(trace_intel_irq_exit, "call_function_single");
+out_call_function_single_entry:
+ unregister_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
+out_threshold_exit:
+#endif
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+ unregister_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
+out_threshold_entry:
+ unregister_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
+out_deferred_exit:
+#endif
+
+#ifdef CONFIG_X86_MCE_AMD
+ unregister_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
+out_deferred_entry:
+ unregister_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
+out_thermal_exit:
+#endif /* CONFIG_X86_MCE_AMD */
+
+#ifdef CONFIG_X86_THERMAL_VECTOR
+ unregister_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
+out_thermal_entry:
+ unregister_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
+out_timer_exit:
+#endif /* CONFIG_X86_THERMAL_VECTOR */
+
+ unregister_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
+out_timer_entry:
+ unregister_trace_local_timer_entry(trace_intel_irq_entry, NULL);
+out_err:
+ return -EINVAL;
+}
+
+static void unregister_intel_irq_tp(void)
+{
+ unregister_trace_spurious_apic_exit(trace_intel_irq_exit, "spurious_apic");
+ unregister_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
+ unregister_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
+ unregister_trace_error_apic_entry(trace_intel_irq_entry, NULL);
+ unregister_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
+ unregister_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
+
+#ifdef CONFIG_IRQ_WORK
+ unregister_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
+ unregister_trace_irq_work_entry(trace_intel_irq_entry, NULL);
+#endif
+
+#ifdef CONFIG_SMP
+ unregister_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
+ unregister_trace_reschedule_entry(trace_intel_irq_entry, NULL);
+ unregister_trace_call_function_exit(trace_intel_irq_exit, "call_function");
+ unregister_trace_call_function_entry(trace_intel_irq_entry, NULL);
+ unregister_trace_call_function_single_exit(trace_intel_irq_exit, "call_function_single");
+ unregister_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
+#endif
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+ unregister_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
+ unregister_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
+#endif
+
+#ifdef CONFIG_X86_MCE_AMD
+ unregister_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
+ unregister_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
+#endif
+
+#ifdef CONFIG_X86_THERMAL_VECTOR
+ unregister_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
+ unregister_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
+#endif /* CONFIG_X86_THERMAL_VECTOR */
+
+ unregister_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
+ unregister_trace_local_timer_entry(trace_intel_irq_entry, NULL);
+}
+
+#else /* CONFIG_X86_LOCAL_APIC */
+#define register_intel_irq_tp() do {} while (0)
+#define unregister_intel_irq_tp() do {} while (0)
+#endif /* CONFIG_X86_LOCAL_APIC */
+
+/**
+ * hook_irq_events - Hook IRQ handling events
+ *
+ * This function hooks the IRQ related callbacks to the respective trace
+ * events.
+ */
+int hook_irq_events(void)
+{
+ int ret;
+
+ ret = register_trace_irq_handler_entry(trace_irqentry_callback, NULL);
+ if (ret)
+ goto out_err;
+
+ ret = register_trace_irq_handler_exit(trace_irqexit_callback, NULL);
+ if (ret)
+ goto out_unregister_entry;
+
+ ret = register_intel_irq_tp();
+ if (ret)
+ goto out_irq_exit;
+
+ return 0;
+
+out_irq_exit:
+ unregister_trace_irq_handler_exit(trace_irqexit_callback, NULL);
+out_unregister_entry:
+ unregister_trace_irq_handler_entry(trace_irqentry_callback, NULL);
+out_err:
+ return -EINVAL;
+}
+
+/**
+ * unhook_irq_events - Unhook IRQ handling events
+ *
+ * This function unhooks the IRQ related callbacks to the respective trace
+ * events.
+ */
+void unhook_irq_events(void)
+{
+ unregister_intel_irq_tp();
+ unregister_trace_irq_handler_exit(trace_irqexit_callback, NULL);
+ unregister_trace_irq_handler_entry(trace_irqentry_callback, NULL);
+}
+
+#ifndef CONFIG_PREEMPT_RT
+/**
+ * trace_softirq_entry_callback - Note the starting of a SoftIRQ
+ *
+ * Save the starting time of a SoftIRQ. As SoftIRQs are non-preemptive to
+ * other SoftIRQs, it is safe to use a single variable (ons_var->softirq)
+ * to save the statistics. The arrival_time is used to report... the
+ * arrival time. The delta_start is used to compute the duration at the
+ * SoftIRQ exit handler. See cond_move_softirq_delta_start().
+ */
+void trace_softirq_entry_callback(void *data, unsigned int vec_nr)
+{
+ struct osnoise_variables *osn_var = this_cpu_osn_var();
+
+ if (!osn_var->sampling)
+ return;
+ /*
+ * This value will be used in the report, but not to compute
+ * the execution time, so it is safe to get it unsafe.
+ */
+ osn_var->softirq.arrival_time = time_get();
+ set_int_safe_time(osn_var, &osn_var->softirq.delta_start);
+ osn_var->softirq.count++;
+
+ local_inc(&osn_var->int_counter);
+}
+
+/**
+ * trace_softirq_exit_callback - Note the end of an SoftIRQ
+ *
+ * Computes the duration of the SoftIRQ noise, and trace it. Also discounts the
+ * interference from other sources of noise could be currently being accounted.
+ */
+void trace_softirq_exit_callback(void *data, unsigned int vec_nr)
+{
+ struct osnoise_variables *osn_var = this_cpu_osn_var();
+ int duration;
+
+ if (!osn_var->sampling)
+ return;
+
+ duration = get_int_safe_duration(osn_var, &osn_var->softirq.delta_start);
+ trace_softirq_noise(vec_nr, osn_var->softirq.arrival_time, duration);
+ cond_move_thread_delta_start(osn_var, duration);
+ osn_var->softirq.arrival_time = 0;
+}
+
+/**
+ * hook_softirq_events - Hook SoftIRQ handling events
+ *
+ * This function hooks the SoftIRQ related callbacks to the respective trace
+ * events.
+ */
+static int hook_softirq_events(void)
+{
+ int ret;
+
+ ret = register_trace_softirq_entry(trace_softirq_entry_callback, NULL);
+ if (ret)
+ goto out_err;
+
+ ret = register_trace_softirq_exit(trace_softirq_exit_callback, NULL);
+ if (ret)
+ goto out_unreg_entry;
+
+ return 0;
+
+out_unreg_entry:
+ unregister_trace_softirq_entry(trace_softirq_entry_callback, NULL);
+out_err:
+ return -EINVAL;
+}
+
+/**
+ * unhook_softirq_events - Unhook SoftIRQ handling events
+ *
+ * This function hooks the SoftIRQ related callbacks to the respective trace
+ * events.
+ */
+static void unhook_softirq_events(void)
+{
+ unregister_trace_softirq_entry(trace_softirq_entry_callback, NULL);
+ unregister_trace_softirq_exit(trace_softirq_exit_callback, NULL);
+}
+#else /* CONFIG_PREEMPT_RT */
+/*
+ * SoftIRQ are threads on the PREEMPT_RT mode.
+ */
+static int hook_softirq_events(void)
+{
+ return 0;
+}
+static void unhook_softirq_events(void)
+{
+}
+#endif
+
+/**
+ * thread_entry - Record the starting of a thread noise window
+ *
+ * It saves the context switch time for a noisy thread, and increments
+ * the interference counters.
+ */
+static void
+thread_entry(struct osnoise_variables *osn_var, struct task_struct *t)
+{
+ if (!osn_var->sampling)
+ return;
+ /*
+ * The arrival time will be used in the report, but not to compute
+ * the execution time, so it is safe to get it unsafe.
+ */
+ osn_var->thread.arrival_time = time_get();
+
+ set_int_safe_time(osn_var, &osn_var->thread.delta_start);
+
+ osn_var->thread.count++;
+ local_inc(&osn_var->int_counter);
+}
+
+/**
+ * thread_exit - Report the end of a thread noise window
+ *
+ * It computes the total noise from a thread, tracing if needed.
+ */
+static void
+thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
+{
+ int duration;
+
+ if (!osn_var->sampling)
+ return;
+
+ duration = get_int_safe_duration(osn_var, &osn_var->thread.delta_start);
+
+ trace_thread_noise(t, osn_var->thread.arrival_time, duration);
+
+ osn_var->thread.arrival_time = 0;
+}
+
+/**
+ * trace_sched_switch - sched:sched_switch trace event handler
+ *
+ * This function is hooked to the sched:sched_switch trace event, and it is
+ * used to record the beginning and to report the end of a thread noise window.
+ */
+void
+trace_sched_switch_callback(void *data, bool preempt, struct task_struct *p,
+ struct task_struct *n)
+{
+ struct osnoise_variables *osn_var = this_cpu_osn_var();
+
+ if (p->pid != osn_var->pid)
+ thread_exit(osn_var, p);
+
+ if (n->pid != osn_var->pid)
+ thread_entry(osn_var, n);
+}
+
+/**
+ * hook_thread_events - Hook the insturmentation for thread noise
+ *
+ * Hook the osnoise tracer callbacks to handle the noise from other
+ * threads on the necessary kernel events.
+ */
+int hook_thread_events(void)
+{
+ int ret;
+
+ ret = register_trace_sched_switch(trace_sched_switch_callback, NULL);
+ if (ret)
+ return -EINVAL;
+
+ return 0;
+}
+
+/**
+ * unhook_thread_events - *nhook the insturmentation for thread noise
+ *
+ * Unook the osnoise tracer callbacks to handle the noise from other
+ * threads on the necessary kernel events.
+ */
+void unhook_thread_events(void)
+{
+ unregister_trace_sched_switch(trace_sched_switch_callback, NULL);
+}
+
+/**
+ * save_osn_sample_stats - Save the osnoise_sample statistics
+ *
+ * Save the osnoise_sample statistics before the sampling phase. These
+ * values will be used later to compute the diff betwneen the statistics
+ * before and after the osnoise sampling.
+ */
+void save_osn_sample_stats(struct osnoise_variables *osn_var, struct osnoise_sample *s)
+{
+ s->nmi_count = osn_var->nmi.count;
+ s->irq_count = osn_var->irq.count;
+ s->softirq_count = osn_var->softirq.count;
+ s->thread_count = osn_var->thread.count;
+}
+
+/**
+ * diff_osn_sample_stats - Compute the osnoise_sample statistics
+ *
+ * After a sample period, compute the difference on the osnoise_sample
+ * statistics. The struct osnoise_sample *s contains the statistics saved via
+ * save_osn_sample_stats() before the osnoise sampling.
+ */
+void diff_osn_sample_stats(struct osnoise_variables *osn_var, struct osnoise_sample *s)
+{
+ s->nmi_count = osn_var->nmi.count - s->nmi_count;
+ s->irq_count = osn_var->irq.count - s->irq_count;
+ s->softirq_count = osn_var->softirq.count - s->softirq_count;
+ s->thread_count = osn_var->thread.count - s->thread_count;
+}
+
+/**
+ * osnoise_stop_tracing - Stop tracing and the tracer.
+ */
+static void osnoise_stop_tracing(void)
+{
+ tracing_off();
+}
+
+/**
+ * run_osnoise - Sample the time and look for osnoise
+ *
+ * Used to capture the time, looking for potential osnoise latency repeatedly.
+ * Different from hwlat_detector, it is called with preemption and interrupts
+ * enabled. This allows irqs, softirqs and threads to run, interfering on the
+ * osnoise sampling thread, as they would do with a regular thread.
+ */
+static int run_osnoise(void)
+{
+ struct osnoise_variables *osn_var = this_cpu_osn_var();
+ u64 noise = 0, sum_noise = 0, max_noise = 0;
+ struct trace_array *tr = osnoise_trace;
+ u64 start, sample, last_sample;
+ u64 last_int_count, int_count;
+ s64 total, last_total = 0;
+ struct osnoise_sample s;
+ unsigned int threshold;
+ int hw_count = 0;
+ u64 runtime, stop_in;
+ int ret = -1;
+
+ /*
+ * Considers the current thread as the workload.
+ */
+ osn_var->pid = current->pid;
+
+ /*
+ * Save the current stats for the diff
+ */
+ save_osn_sample_stats(osn_var, &s);
+
+ /*
+ * threshold should be at least 1 us.
+ */
+ threshold = tracing_thresh ? tracing_thresh : 1000;
+
+ /*
+ * Make sure NMIs see sampling first
+ */
+ osn_var->sampling = true;
+ barrier();
+
+ /*
+ * Transform the *_us config to nanoseconds to avoid the
+ * division on the main loop.
+ */
+ runtime = osnoise_data.sample_runtime * NSEC_PER_USEC;
+ stop_in = osnoise_data.stop_tracing_in * NSEC_PER_USEC;
+
+ /*
+ * Start timestemp
+ */
+ start = time_get();
+
+ /*
+ * "previous" loop
+ */
+ last_int_count = set_int_safe_time(osn_var, &last_sample);
+
+ do {
+ /*
+ * Get sample!
+ */
+ int_count = set_int_safe_time(osn_var, &sample);
+
+ noise = time_sub(sample, last_sample);
+
+ /*
+ * This shouldn't happen.
+ */
+ if (noise < 0) {
+ pr_err(BANNER "time running backwards\n");
+ goto out;
+ }
+
+ /*
+ * Sample runtime.
+ */
+ total = time_sub(sample, start);
+
+ /*
+ * Check for possible overflows.
+ */
+ if (total < last_total) {
+ pr_err("Time total overflowed\n");
+ break;
+ }
+
+ last_total = total;
+
+ if (noise >= threshold) {
+ int interference = int_count - last_int_count;
+
+ if (noise > max_noise)
+ max_noise = noise;
+
+ if (!interference)
+ hw_count++;
+
+ sum_noise += noise;
+
+ trace_sample_threshold(last_sample, noise, interference);
+
+ if (osnoise_data.stop_tracing_in)
+ if (noise > stop_in)
+ osnoise_stop_tracing();
+ }
+
+ /*
+ * For the non-preemptive kernel config: let threads runs, if
+ * they so wish.
+ */
+ cond_resched();
+
+ last_sample = sample;
+ last_int_count = int_count;
+
+ } while (total < runtime && !kthread_should_stop());
+
+ /*
+ * Finish the above in the view for interrupts.
+ */
+ barrier();
+
+ osn_var->sampling = false;
+
+ /*
+ * Make sure sampling data is no longer updated.
+ */
+ barrier();
+
+ /*
+ * Save noise info.
+ */
+ s.noise = time_to_us(sum_noise);
+ s.runtime = time_to_us(total);
+ s.max_sample = time_to_us(max_noise);
+ s.hw_count = hw_count;
+
+ /* Save interference stats info */
+ diff_osn_sample_stats(osn_var, &s);
+
+ trace_osnoise_sample(&s);
+
+ /* Keep a running maximum ever recorded osnoise "latency" */
+ if (max_noise > tr->max_latency) {
+ tr->max_latency = max_noise;
+ latency_fsnotify(tr);
+ }
+
+ if (osnoise_data.stop_tracing_out)
+ if (s.noise > osnoise_data.stop_tracing_out)
+ osnoise_stop_tracing();
+
+ return 0;
+out:
+ return ret;
+}
+
+static struct cpumask osnoise_cpumask;
+static struct cpumask save_cpumask;
+
+/*
+ * osnoise_main - The osnoise detection kernel thread
+ *
+ * Calls run_osnoise() function to measure the osnoise for the configured runtime,
+ * every period.
+ */
+static int osnoise_main(void *data)
+{
+ s64 interval;
+
+ while (!kthread_should_stop()) {
+
+ run_osnoise();
+
+ mutex_lock(&interface_lock);
+ interval = osnoise_data.sample_period - osnoise_data.sample_runtime;
+ mutex_unlock(&interface_lock);
+
+ do_div(interval, USEC_PER_MSEC);
+
+ /*
+ * differently from hwlat_detector, the osnoise tracer can run
+ * without a pause because preemption is on.
+ */
+ if (interval < 1)
+ continue;
+
+ if (msleep_interruptible(interval))
+ break;
+ }
+
+ return 0;
+}
+
+/**
+ * stop_per_cpu_kthread - stop per-cpu threads
+ *
+ * Stop the osnoise sampling htread. Use this on unload and at system
+ * shutdown.
+ */
+static void stop_per_cpu_kthreads(void)
+{
+ struct task_struct *kthread;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread;
+ if (kthread)
+ kthread_stop(kthread);
+ per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
+ }
+}
+
+/**
+ * start_per_cpu_kthread - Kick off per-cpu osnoise sampling kthreads
+ *
+ * This starts the kernel thread that will look for osnoise on many
+ * cpus.
+ */
+static int start_per_cpu_kthreads(struct trace_array *tr)
+{
+ struct cpumask *current_mask = &save_cpumask;
+ struct task_struct *kthread;
+ char comm[24];
+ int cpu;
+
+ get_online_cpus();
+ /*
+ * Run only on CPUs in which trace and osnoise are allowed to run.
+ */
+ cpumask_and(current_mask, tr->tracing_cpumask, &osnoise_cpumask);
+ /*
+ * And the CPU is online.
+ */
+ cpumask_and(current_mask, cpu_online_mask, current_mask);
+ put_online_cpus();
+
+ for_each_online_cpu(cpu)
+ per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
+
+ for_each_cpu(cpu, current_mask) {
+ snprintf(comm, 24, "osnoise/%d", cpu);
+
+ kthread = kthread_create_on_cpu(osnoise_main, NULL, cpu, comm);
+
+ if (IS_ERR(kthread)) {
+ pr_err(BANNER "could not start sampling thread\n");
+ stop_per_cpu_kthreads();
+ return -ENOMEM;
+ }
+
+ per_cpu(per_cpu_osnoise_var, cpu).kthread = kthread;
+ wake_up_process(kthread);
+ }
+
+ return 0;
+}
+
+/*
+ * osnoise_cpus_read - Read function for reading the "cpus" file
+ * @filp: The active open file structure
+ * @ubuf: The userspace provided buffer to read value into
+ * @cnt: The maximum number of bytes to read
+ * @ppos: The current "file" position
+ *
+ * Prints the "cpus" output into the user-provided buffer.
+ */
+static ssize_t
+osnoise_cpus_read(struct file *filp, char __user *ubuf, size_t count,
+ loff_t *ppos)
+{
+ char *mask_str;
+ int len;
+
+ len = snprintf(NULL, 0, "%*pbl\n",
+ cpumask_pr_args(&osnoise_cpumask)) + 1;
+ mask_str = kmalloc(len, GFP_KERNEL);
+ if (!mask_str)
+ return -ENOMEM;
+
+ len = snprintf(mask_str, len, "%*pbl\n",
+ cpumask_pr_args(&osnoise_cpumask));
+ if (len >= count) {
+ count = -EINVAL;
+ goto out_err;
+ }
+ count = simple_read_from_buffer(ubuf, count, ppos, mask_str, len);
+
+out_err:
+ kfree(mask_str);
+
+ return count;
+}
+
+/**
+ * osnoise_cpus_write - Write function for "cpus" entry
+ * @filp: The active open file structure
+ * @ubuf: The user buffer that contains the value to write
+ * @cnt: The maximum number of bytes to write to "file"
+ * @ppos: The current position in @file
+ *
+ * This function provides a write implementation for the "cpus"
+ * interface to the osnoise trace. By default, it lists all CPUs,
+ * in this way, allowing osnoise threads to run on any online CPU
+ * of the system. It serves to restrict the execution of osnoise to the
+ * set of CPUs writing via this interface. Note that osnoise also
+ * respects the "tracing_cpumask." Hence, osnoise threads will run only
+ * on the set of CPUs allowed here AND on "tracing_cpumask." Why not
+ * have just "tracing_cpumask?" Because the user might be interested
+ * in tracing what is running on other CPUs. For instance, one might
+ * run osnoise in one HT CPU while observing what is running on the
+ * sibling HT CPU.
+ */
+static ssize_t
+osnoise_cpus_write(struct file *filp, const char __user *ubuf, size_t count,
+ loff_t *ppos)
+{
+ cpumask_var_t osnoise_cpumask_new;
+ char buf[256];
+ int err;
+
+ if (count >= 256)
+ return -EINVAL;
+
+ if (copy_from_user(buf, ubuf, count))
+ return -EFAULT;
+
+ if (!zalloc_cpumask_var(&osnoise_cpumask_new, GFP_KERNEL))
+ return -ENOMEM;
+
+ err = cpulist_parse(buf, osnoise_cpumask_new);
+ if (err)
+ goto err_free;
+
+ cpumask_copy(&osnoise_cpumask, osnoise_cpumask_new);
+
+ free_cpumask_var(osnoise_cpumask_new);
+ return count;
+
+err_free:
+ free_cpumask_var(osnoise_cpumask_new);
+
+ return err;
+}
+
+/*
+ * osnoise/runtime_us: cannot be greater than the period.
+ */
+static struct trace_ull_config osnoise_runtime = {
+ .lock = &interface_lock,
+ .val = &osnoise_data.sample_runtime,
+ .max = &osnoise_data.sample_period,
+ .min = NULL,
+};
+
+/*
+ * osnoise/period_us: cannot be smaller than the runtime.
+ */
+static struct trace_ull_config osnoise_period = {
+ .lock = &interface_lock,
+ .val = &osnoise_data.sample_period,
+ .max = NULL,
+ .min = &osnoise_data.sample_runtime,
+};
+
+/*
+ * osnoise/stop_tracing_in_us: no limit.
+ */
+static struct trace_ull_config osnoise_stop_single = {
+ .lock = &interface_lock,
+ .val = &osnoise_data.stop_tracing_in,
+ .max = NULL,
+ .min = NULL,
+};
+
+/*
+ * osnoise/stop_tracing_out_us: no limit.
+ */
+static struct trace_ull_config osnoise_stop_total = {
+ .lock = &interface_lock,
+ .val = &osnoise_data.stop_tracing_out,
+ .max = NULL,
+ .min = NULL,
+};
+
+static const struct file_operations cpus_fops = {
+ .open = tracing_open_generic,
+ .read = osnoise_cpus_read,
+ .write = osnoise_cpus_write,
+ .llseek = generic_file_llseek,
+};
+
+/**
+ * init_tracefs - A function to initialize the tracefs interface files
+ *
+ * This function creates entries in tracefs for "osnoise". It creates the
+ * "osnoise" directory in the tracing directory, and within that
+ * directory is the count, runtime and period files to change and view
+ * those values.
+ */
+static int init_tracefs(void)
+{
+ struct dentry *top_dir;
+ struct dentry *tmp;
+ int ret;
+
+ ret = tracing_init_dentry();
+ if (ret)
+ return -ENOMEM;
+
+ top_dir = tracefs_create_dir("osnoise", NULL);
+ if (!top_dir)
+ return -ENOMEM;
+
+ tmp = tracefs_create_file("period_us", 0640, top_dir,
+ &osnoise_period, &trace_ull_config_fops);
+ if (!tmp)
+ goto err;
+
+ tmp = tracefs_create_file("runtime_us", 0644, top_dir,
+ &osnoise_runtime, &trace_ull_config_fops);
+ if (!tmp)
+ goto err;
+
+ tmp = tracefs_create_file("stop_tracing_in_us", 0640, top_dir,
+ &osnoise_stop_single, &trace_ull_config_fops);
+ if (!tmp)
+ goto err;
+
+ tmp = tracefs_create_file("stop_tracing_out_us", 0640, top_dir,
+ &osnoise_stop_total, &trace_ull_config_fops);
+ if (!tmp)
+ goto err;
+
+ tmp = trace_create_file("cpus", 0644, top_dir, NULL, &cpus_fops);
+ if (!tmp)
+ goto err;
+
+ return 0;
+
+err:
+ tracefs_remove(top_dir);
+ return -ENOMEM;
+}
+
+static int osnoise_hook_events(void)
+{
+ int retval;
+
+ /*
+ * Trace is already hooked, we are re-enabling from
+ * a stop_tracing_*.
+ */
+ if (trace_osnoise_callback_enabled)
+ return 0;
+
+ retval = hook_irq_events();
+ if (retval)
+ return -EINVAL;
+
+ retval = hook_softirq_events();
+ if (retval)
+ goto out_unhook_irq;
+
+ retval = hook_thread_events();
+ /*
+ * All fine!
+ */
+ if (!retval)
+ return 0;
+
+ unhook_softirq_events();
+out_unhook_irq:
+ unhook_irq_events();
+ return -EINVAL;
+}
+
+static void osnoise_tracer_start(struct trace_array *tr)
+{
+ int retval;
+
+ if (osnoise_busy)
+ return;
+
+ osn_var_reset_all();
+
+ retval = osnoise_hook_events();
+ if (retval)
+ goto out_err;
+ /*
+ * Make sure NMIs see reseted values.
+ */
+ barrier();
+ trace_osnoise_callback_enabled = true;
+
+ retval = start_per_cpu_kthreads(tr);
+ /*
+ * all fine!
+ */
+ if (!retval)
+ return;
+
+out_err:
+ unhook_irq_events();
+ pr_err(BANNER "Error starting osnoise tracer\n");
+}
+
+static void osnoise_tracer_stop(struct trace_array *tr)
+{
+ if (!osnoise_busy)
+ return;
+
+ trace_osnoise_callback_enabled = false;
+ barrier();
+
+ stop_per_cpu_kthreads();
+
+ unhook_irq_events();
+ unhook_softirq_events();
+ unhook_thread_events();
+
+ osnoise_busy = false;
+}
+
+static int osnoise_tracer_init(struct trace_array *tr)
+{
+ /* Only allow one instance to enable this */
+ if (osnoise_busy)
+ return -EBUSY;
+
+ osnoise_trace = tr;
+
+ tr->max_latency = 0;
+
+ osnoise_tracer_start(tr);
+
+ osnoise_busy = true;
+
+ return 0;
+}
+
+static void osnoise_tracer_reset(struct trace_array *tr)
+{
+ osnoise_tracer_stop(tr);
+}
+
+static struct tracer osnoise_tracer __read_mostly = {
+ .name = "osnoise",
+ .init = osnoise_tracer_init,
+ .reset = osnoise_tracer_reset,
+ .start = osnoise_tracer_start,
+ .stop = osnoise_tracer_stop,
+ .print_header = print_osnoise_headers,
+ .allow_instances = true,
+};
+
+__init static int init_osnoise_tracer(void)
+{
+ int ret;
+
+ mutex_init(&interface_lock);
+
+ cpumask_copy(&osnoise_cpumask, cpu_all_mask);
+
+ ret = register_tracer(&osnoise_tracer);
+ if (ret)
+ return ret;
+
+ init_tracefs();
+
+ return 0;
+}
+late_initcall(init_osnoise_tracer);
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d0368a569bfa..642b6584eba5 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1202,7 +1202,6 @@ trace_hwlat_print(struct trace_iterator *iter, int flags,
return trace_handle_return(s);
}

-
static enum print_line_t
trace_hwlat_raw(struct trace_iterator *iter, int flags,
struct trace_event *event)
@@ -1232,6 +1231,76 @@ static struct trace_event trace_hwlat_event = {
.funcs = &trace_hwlat_funcs,
};

+/* TRACE_OSNOISE */
+static enum print_line_t
+trace_osnoise_print(struct trace_iterator *iter, int flags,
+ struct trace_event *event)
+{
+ struct trace_entry *entry = iter->ent;
+ struct trace_seq *s = &iter->seq;
+ struct osnoise_entry *field;
+ u64 ratio, ratio_dec;
+ u64 net_runtime;
+
+ trace_assign_type(field, entry);
+
+ /*
+ * compute the available % of cpu time.
+ */
+ net_runtime = field->runtime - field->noise;
+ ratio = net_runtime * 10000000;
+ do_div(ratio, field->runtime);
+ ratio_dec = do_div(ratio, 100000);
+
+ trace_seq_printf(s, "%llu %10llu %3llu.%05llu %7llu",
+ field->runtime,
+ field->noise,
+ ratio, ratio_dec,
+ field->max_sample);
+
+ trace_seq_printf(s, " %6u", field->hw_count);
+ trace_seq_printf(s, " %6u", field->nmi_count);
+ trace_seq_printf(s, " %6u", field->irq_count);
+ trace_seq_printf(s, " %6u", field->softirq_count);
+ trace_seq_printf(s, " %6u", field->thread_count);
+
+ trace_seq_putc(s, '\n');
+
+ return trace_handle_return(s);
+}
+
+static enum print_line_t
+trace_osnoise_raw(struct trace_iterator *iter, int flags,
+ struct trace_event *event)
+{
+ struct osnoise_entry *field;
+ struct trace_seq *s = &iter->seq;
+
+ trace_assign_type(field, iter->ent);
+
+ trace_seq_printf(s, "%lld %llu %llu %u %u %u %u %u\n",
+ field->runtime,
+ field->noise,
+ field->max_sample,
+ field->hw_count,
+ field->nmi_count,
+ field->irq_count,
+ field->softirq_count,
+ field->thread_count);
+
+ return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_osnoise_funcs = {
+ .trace = trace_osnoise_print,
+ .raw = trace_osnoise_raw,
+};
+
+static struct trace_event trace_osnoise_event = {
+ .type = TRACE_OSNOISE,
+ .funcs = &trace_osnoise_funcs,
+};
+
/* TRACE_BPUTS */
static enum print_line_t
trace_bputs_print(struct trace_iterator *iter, int flags,
@@ -1442,6 +1511,7 @@ static struct trace_event *events[] __initdata = {
&trace_bprint_event,
&trace_print_event,
&trace_hwlat_event,
+ &trace_osnoise_event,
&trace_raw_data_event,
&trace_func_repeats_event,
NULL
--
2.26.3


Subject: Re: [PATCH V3 4/9] tracing/hwlat: Implement the per-cpu mode

On 5/27/21 1:58 PM, Juri Lelli wrote:
> Hi,
>
> On 14/05/21 22:51, Daniel Bristot de Oliveira wrote:
>
> [...]
>
>> +/**
>> + * start_per_cpu_kthread - Kick off the hardware latency sampling/detector kthreads
>> + *
>> + * This starts the kernel threads that will sit on potentially all cpus and
>> + * sample the CPU timestamp counter (TSC or similar) and look for potential
>> + * hardware latencies.
>> + */
>> +static int start_per_cpu_kthreads(struct trace_array *tr)
>> +{
>> + struct cpumask *current_mask = &save_cpumask;
>> + struct cpumask *this_cpumask;
>> + struct task_struct *kthread;
>> + char comm[24];
>> + int cpu;
>> +
>> + if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
>> + return -ENOMEM;
>
> Is this_cpumask actually used anywhere?

OOpppsss, this is a left-over :-(....

Before starting using kthread_create_on_cpu(), I was using this_cpumask to set
the affinity of threads created via kthread_create().... but it is not needed
anymore.

I will remove it, good catch.

Thanks!
-- Daniel

>
> Thanks,
> Juri
>

2021-05-27 20:48:56

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH V3 0/9] hwlat improvements and osnoise/timerlat tracers

Hi,

On 14/05/21 22:51, Daniel Bristot de Oliveira wrote:
> This series proposes a set of improvements and new features for the
> tracing subsystem to facilitate the debugging of low latency
> deployments.
>
> Currently, hwlat runs on a single CPU at a time, migrating across a
> set of CPUs in a round-robin fashion. This series improves hwlat
> to allow hwlat to run on multiple CPUs in parallel, increasing the
> chances of detecting a hardware latency, at the cost of using more
> CPU time.
>
> It also proposes a new tracer named osnoise, that aims to help users
> of isolcpus= (or a similar method) to measure how much noise the OS
> and the hardware add to the isolated application. The osnoise tracer
> bases on the hwlat detector code. The difference is that, instead of
> sampling with interrupts disabled, the osnoise tracer samples the CPU with
> interrupts and preemption enabled. In this way, the sampling thread will
> suffer any source of noise from the OS. The detection and classification
> of the type of noise are then made by observing the entry points of NMIs,
> IRQs, SoftIRQs, and threads. If none of these sources of noise is detected,
> the tool associates the noise with the hardware. The tool periodically
> prints a status, printing the total noise of the period, the max single
> noise observed, the percentage of CPU available for the task, along with
> the counters of each source of the noise. To debug the sources of noise,
> the tracer also adds a set of tracepoints that print any NMI, IRQ, SofIRQ,
> and thread occurrence. These tracepoints print the starting time and the
> noise's net duration at the end of the noise. In this way, it reduces the
> number of tracepoints (one instead of two) and the need to manually
> accounting the contribution of each noise independently.
>
> Finaly, the timerlat tracer aims to help the preemptive kernel developers
> to find sources of wakeup latencies of real-time threads. The tracer
> creates a per-cpu kernel thread with real-time priority. The tracer thread
> sets a periodic timer to wakeup itself, and goes to sleep waiting for the
> timer to fire. At the wakeup, the thread then computes a wakeup latency
> value as the difference between the current time and the absolute time
> that the timer was set to expire. The tracer prints two lines at every
> activation. The first is the timer latency observed at the hardirq context
> before the activation of the thread. The second is the timer latency
> observed by the thread, which is the same level that cyclictest reports.
> The ACTIVATION ID field serves to relate the irq execution to its
> respective thread execution. The tracer is build on top of osnoise tracer,
> and the osnoise: events can be used to trace the source of interference
> from NMI, IRQs and other threads. It also enables the capture of the
> stacktrace at the IRQ context, which helps to identify the code path
> that can cause thread delay.

FWIW, I've been using the new tracers extensively downstream for a while
now and I find them very useful and quite more precise to detect
problems than what we currently have available.

The fact that one can do almost everything needed to spot latency issues
from entirely inside the kernel with a simple interface is a big plus to me
as well.

I wouldn't mind if this gets accepted very soon! :)

Best,
Juri

2021-05-27 20:49:12

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH V3 4/9] tracing/hwlat: Implement the per-cpu mode

Hi,

On 14/05/21 22:51, Daniel Bristot de Oliveira wrote:

[...]

> +/**
> + * start_per_cpu_kthread - Kick off the hardware latency sampling/detector kthreads
> + *
> + * This starts the kernel threads that will sit on potentially all cpus and
> + * sample the CPU timestamp counter (TSC or similar) and look for potential
> + * hardware latencies.
> + */
> +static int start_per_cpu_kthreads(struct trace_array *tr)
> +{
> + struct cpumask *current_mask = &save_cpumask;
> + struct cpumask *this_cpumask;
> + struct task_struct *kthread;
> + char comm[24];
> + int cpu;
> +
> + if (!alloc_cpumask_var(&this_cpumask, GFP_KERNEL))
> + return -ENOMEM;

Is this_cpumask actually used anywhere?

Thanks,
Juri

2021-05-29 02:18:06

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 0/9] hwlat improvements and osnoise/timerlat tracers

On Thu, 27 May 2021 14:07:37 +0200
Juri Lelli <[email protected]> wrote:

> FWIW, I've been using the new tracers extensively downstream for a while
> now and I find them very useful and quite more precise to detect
> problems than what we currently have available.
>
> The fact that one can do almost everything needed to spot latency issues
> from entirely inside the kernel with a simple interface is a big plus to me
> as well.
>
> I wouldn't mind if this gets accepted very soon! :)

Neither would I ;-) But because of my extended vacation and other
immediate responsibilities I need to take care of, it may be a couple
of weeks before I can thoroughly look at this ;-)

Anyway, thanks for the vote of confidence.

-- Steve

2021-06-03 20:13:39

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 2/9] tracing/hwlat: Implement the mode config option

On Fri, 14 May 2021 22:51:11 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> Provides the "mode" config to the hardware latency detector. hwlatd has
> two different operation modes. The default mode is the "round-robin" one,
> in which a single hwlatd thread runs, migrating among the allowed CPUs in a
> "round-robin" fashion. This is the current behavior.
>
> The "none" sets the allowed cpumask for a single hwlatd thread at the
> startup, but skips the round-robin, letting the scheduler handle the
> migration.
>
> In preparation to the per-cpu mode.
>
> Cc: Jonathan Corbet <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Alexandre Chartre <[email protected]>
> Cc: Clark Willaims <[email protected]>
> Cc: John Kacur <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
> ---
> Documentation/trace/hwlat_detector.rst | 12 +-
> kernel/trace/trace_hwlat.c | 171 +++++++++++++++++++++++--
> 2 files changed, 169 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/trace/hwlat_detector.rst b/Documentation/trace/hwlat_detector.rst
> index 5739349649c8..4d952df0586a 100644
> --- a/Documentation/trace/hwlat_detector.rst
> +++ b/Documentation/trace/hwlat_detector.rst
> @@ -76,8 +76,12 @@ in /sys/kernel/tracing:
> - tracing_cpumask - the CPUs to move the hwlat thread across
> - hwlat_detector/width - specified amount of time to spin within window (usecs)
> - hwlat_detector/window - amount of time between (width) runs (usecs)
> + - hwlat_detector/mode - the thread mode
>
> -The hwlat detector's kernel thread will migrate across each CPU specified in
> -tracing_cpumask between each window. To limit the migration, either modify
> -tracing_cpumask, or modify the hwlat kernel thread (named [hwlatd]) CPU
> -affinity directly, and the migration will stop.
> +By default, the hwlat detector's kernel thread will migrate across each CPU
> +specified in cpumask at the beginning of a new window, in a round-robin
> +fashion. This behavior can be changed by changing the thread mode,
> +the available options are:
> +
> + - none: do not force migration
> + - round-robin: migrate across each CPU specified in cpumask [default]
> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
> index 0a5635401125..1f5d48830fd6 100644
> --- a/kernel/trace/trace_hwlat.c
> +++ b/kernel/trace/trace_hwlat.c
> @@ -59,6 +59,14 @@ static struct task_struct *hwlat_kthread;
>
> static struct dentry *hwlat_sample_width; /* sample width us */
> static struct dentry *hwlat_sample_window; /* sample window us */
> +static struct dentry *hwlat_thread_mode; /* hwlat thread mode */
> +
> +enum {
> + MODE_NONE = 0,
> + MODE_ROUND_ROBIN,
> + MODE_MAX
> +};
> +static char *thread_mode_str[] = { "none", "round-robin" };

I usually do the above with:

#define HWLAT_MODES \
C(NONE, "none"), \
C(ROUND_ROBIN, "round-robin")

#undef C
#define C(x,y) MODE_##x

enum {
HWLAT_MODES,
MODE_MAX
};

#undef C
#define C(x,y) y

static char *thread_mode_str[] = { HWLAT_MODES };

But this is small enough not to do this.

>
> /* Save the previous tracing_thresh value */
> static unsigned long save_tracing_thresh;
> @@ -96,11 +104,16 @@ static struct hwlat_data {
> u64 sample_window; /* total sampling window (on+off) */
> u64 sample_width; /* active sampling portion of window */
>
> + int thread_mode; /* thread mode */
> +
> } hwlat_data = {
> .sample_window = DEFAULT_SAMPLE_WINDOW,
> .sample_width = DEFAULT_SAMPLE_WIDTH,
> + .thread_mode = MODE_ROUND_ROBIN
> };
>
> +static bool hwlat_busy;
> +
> static void trace_hwlat_sample(struct hwlat_sample *sample)
> {
> struct trace_array *tr = hwlat_trace;
> @@ -328,7 +341,8 @@ static int kthread_fn(void *data)
>
> while (!kthread_should_stop()) {
>
> - move_to_next_cpu();
> + if (hwlat_data.thread_mode == MODE_ROUND_ROBIN)
> + move_to_next_cpu();
>
> local_irq_disable();
> get_sample();
> @@ -366,11 +380,6 @@ static int start_kthread(struct trace_array *tr)
> if (hwlat_kthread)
> return 0;
>
> - /* Just pick the first CPU on first iteration */
> - get_online_cpus();
> - cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask);
> - put_online_cpus();
> - next_cpu = cpumask_first(current_mask);
>
> kthread = kthread_create(kthread_fn, NULL, "hwlatd");
> if (IS_ERR(kthread)) {
> @@ -378,8 +387,19 @@ static int start_kthread(struct trace_array *tr)
> return -ENOMEM;
> }
>
> - cpumask_clear(current_mask);
> - cpumask_set_cpu(next_cpu, current_mask);
> +
> + /* Just pick the first CPU on first iteration */
> + get_online_cpus();
> + cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask);
> + put_online_cpus();
> +
> + if (hwlat_data.thread_mode == MODE_ROUND_ROBIN) {
> + next_cpu = cpumask_first(current_mask);
> + cpumask_clear(current_mask);
> + cpumask_set_cpu(next_cpu, current_mask);
> +
> + }
> +
> sched_setaffinity(kthread->pid, current_mask);
>
> hwlat_kthread = kthread;
> @@ -511,6 +531,125 @@ hwlat_window_write(struct file *filp, const char __user *ubuf,
> return cnt;
> }
>
> +static void *s_mode_start(struct seq_file *s, loff_t *pos)
> +{
> + int mode = *pos;
> +

Probably should add:

mutex_lock(&hwlat_data.lock);

here.

> + if (mode >= MODE_MAX)
> + return NULL;
> +
> + return pos;
> +}
> +
> +static void *s_mode_next(struct seq_file *s, void *v, loff_t *pos)
> +{
> + int mode = ++(*pos);
> +
> + if (mode >= MODE_MAX)
> + return NULL;
> +
> + return pos;
> +}
> +
> +static int s_mode_show(struct seq_file *s, void *v)
> +{
> + loff_t *pos = v;
> + int mode = *pos;
> +
> + if (mode == hwlat_data.thread_mode)
> + seq_printf(s, "[%s]", thread_mode_str[mode]);
> + else
> + seq_printf(s, "%s", thread_mode_str[mode]);
> +
> + if (mode != MODE_MAX)
> + seq_puts(s, " ");
> +
> + return 0;
> +}
> +
> +static void s_mode_stop(struct seq_file *s, void *v)
> +{
> + seq_puts(s, "\n");

And

mutex_unlock(&hwlat_data.lock);

here to prevent conflicts with the update?

> +}
> +
> +static const struct seq_operations thread_mode_seq_ops = {
> + .start = s_mode_start,
> + .next = s_mode_next,
> + .show = s_mode_show,
> + .stop = s_mode_stop
> +};
> +
> +static int hwlat_mode_open(struct inode *inode, struct file *file)
> +{
> + return seq_open(file, &thread_mode_seq_ops);
> +};
> +
> +static void hwlat_tracer_start(struct trace_array *tr);
> +static void hwlat_tracer_stop(struct trace_array *tr);

Add newline here.

> +/**
> + * hwlat_mode_write - Write function for "mode" entry
> + * @filp: The active open file structure
> + * @ubuf: The user buffer that contains the value to write
> + * @cnt: The maximum number of bytes to write to "file"
> + * @ppos: The current position in @file
> + *
> + * This function provides a write implementation for the "mode" interface
> + * to the hardware latency detector. hwlatd has different operation modes.
> + * The "none" sets the allowed cpumask for a single hwlatd thread at the
> + * startup and lets the scheduler handle the migration. The default mode is
> + * the "round-robin" one, in which a single hwlatd thread runs, migrating
> + * among the allowed CPUs in a round-robin fashion.
> + */
> +static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + struct trace_array *tr = hwlat_trace;
> + const char *mode;
> + char buf[64];
> + int ret, i;
> +
> + if (cnt >= sizeof(buf))
> + return -EINVAL;
> +
> + if (copy_from_user(buf, ubuf, cnt))
> + return -EFAULT;
> +
> + buf[cnt] = 0;
> +
> + mode = strstrip(buf);
> +
> + ret = -EINVAL;
> +
> + /*
> + * trace_types_lock is taken to avoid concurrency on start/stop
> + * and hwlat_busy.
> + */
> + mutex_lock(&trace_types_lock);
> + if (hwlat_busy)
> + hwlat_tracer_stop(tr);
> +
> + mutex_lock(&hwlat_data.lock);
> +
> + for (i = 0; i < MODE_MAX; i++) {
> + if (strcmp(mode, thread_mode_str[i]) == 0) {
> + hwlat_data.thread_mode = i;
> + ret = cnt;
> + }
> + }
> +
> + mutex_unlock(&hwlat_data.lock);
> +
> + if (hwlat_busy)
> + hwlat_tracer_start(tr);
> + mutex_unlock(&trace_types_lock);
> +
> + *ppos += cnt;
> +
> +
> +
> + return ret;
> +}
> +
> static const struct file_operations width_fops = {
> .open = tracing_open_generic,
> .read = hwlat_read,
> @@ -523,6 +662,13 @@ static const struct file_operations window_fops = {
> .write = hwlat_window_write,
> };
>
> +static const struct file_operations thread_mode_fops = {
> + .open = hwlat_mode_open,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = seq_release,
> + .write = hwlat_mode_write
> +};
> /**
> * init_tracefs - A function to initialize the tracefs interface files
> *
> @@ -558,6 +704,13 @@ static int init_tracefs(void)
> if (!hwlat_sample_width)
> goto err;
>
> + hwlat_thread_mode = trace_create_file("mode", 0644,
> + top_dir,
> + NULL,
> + &thread_mode_fops);
> + if (!hwlat_thread_mode)
> + goto err;
> +
> return 0;
>
> err:
> @@ -579,8 +732,6 @@ static void hwlat_tracer_stop(struct trace_array *tr)
> stop_kthread();
> }
>
> -static bool hwlat_busy;
> -
> static int hwlat_tracer_init(struct trace_array *tr)
> {
> /* Only allow one instance to enable this */


-- Steve

2021-06-03 21:19:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 4/9] tracing/hwlat: Implement the per-cpu mode

On Fri, 14 May 2021 22:51:13 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> void trace_hwlat_callback(bool enter)
> {
> - if (smp_processor_id() != nmi_cpu)
> + struct hwlat_kthread_data *kdata = get_cpu_data();
> +
> + if (kdata->kthread)

Shouldn't that be:

if (!kdata->kthread)

?

-- Steve

> return;
>
> /*
> @@ -158,13 +173,13 @@ void trace_hwlat_callback(bool enter)
> */
> if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {
> if (enter)
> - nmi_ts_start = time_get();
> + kdata->nmi_ts_start = time_get();
> else
> - nmi_total_ts += time_get() - nmi_ts_start;
> + kdata->nmi_total_ts += time_get() - kdata->nmi_ts_start;
> }
>
> if (enter)
> - nmi_count++;
> + kdata->nmi_count++;
> }
>

2021-06-03 21:25:15

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 5/9] tracing/trace: Add a generic function to read/write u64 values from tracefs

On Fri, 14 May 2021 22:51:14 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> Provides a generic read and write implementation to save/read u64 values
> from a file on tracefs. The trace_ull_config structure defines where to
> read/write the value, the min and the max acceptable values, and a lock
> to protect the write.

This states what the patch is doing, but does not say why it is doing it.
>
> Cc: Jonathan Corbet <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Alexandre Chartre <[email protected]>
> Cc: Clark Willaims <[email protected]>
> Cc: John Kacur <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
> ---
> kernel/trace/trace.c | 87 ++++++++++++++++++++++++++++++++++++++++++++
> kernel/trace/trace.h | 19 ++++++++++
> 2 files changed, 106 insertions(+)
>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 560e4c8d3825..b4cd89010813 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -7516,6 +7516,93 @@ static const struct file_operations snapshot_raw_fops = {
>
> #endif /* CONFIG_TRACER_SNAPSHOT */
>
> +/*
> + * trace_ull_config_write - Generic write function to save u64 value


That is a horrible name. What the hell is the "config"?

> + * @filp: The active open file structure
> + * @ubuf: The userspace provided buffer to read value into
> + * @cnt: The maximum number of bytes to read
> + * @ppos: The current "file" position
> + *
> + * This function provides a generic write implementation to save u64 values
> + * from a file on tracefs. The filp->private_data must point to a
> + * trace_ull_config structure that defines where to write the value, the
> + * min and the max acceptable values, and a lock to protect the write.

This doesn't seem to be a generic way to save 64 bit values (which I still
don't understand, because unsigned long long should work too). But it looks
like the rational is for having some kind of generic way to read 64 bit
values giving them a min and a max.

I see this is used later, but this patch needs to be rewritten. It makes no
sense.

-- Steve


> + */
> +static ssize_t
> +trace_ull_config_write(struct file *filp, const char __user *ubuf, size_t cnt,
> + loff_t *ppos)
> +{
> + struct trace_ull_config *config = filp->private_data;
> + u64 val;
> + int err;
> +
> + if (!config)
> + return -EFAULT;
> +
> + err = kstrtoull_from_user(ubuf, cnt, 10, &val);
> + if (err)
> + return err;
> +
> + if (config->lock)
> + mutex_lock(config->lock);
> +
> + if (config->min && val < *config->min)
> + err = -EINVAL;
> +
> + if (config->max && val > *config->max)
> + err = -EINVAL;
> +
> + if (!err)
> + *config->val = val;
> +
> + if (config->lock)
> + mutex_unlock(config->lock);
> +
> + if (err)
> + return err;
> +
> + return cnt;
> +}
> +
> +/*
> + * trace_ull_config_read - Generic write function to read u64 value via tracefs
> + * @filp: The active open file structure
> + * @ubuf: The userspace provided buffer to read value into
> + * @cnt: The maximum number of bytes to read
> + * @ppos: The current "file" position
> + *
> + * This function provides a generic read implementation to read a u64 value
> + * from a file on tracefs. The filp->private_data must point to a
> + * trace_ull_config structure with valid data.
> + */
> +static ssize_t
> +trace_ull_config_read(struct file *filp, char __user *ubuf, size_t cnt,
> + loff_t *ppos)
> +{
> + struct trace_ull_config *config = filp->private_data;
> + char buf[ULL_STR_SIZE];
> + u64 val;
> + int len;
> +
> + if (!config)
> + return -EFAULT;
> +
> + val = *config->val;
> +
> + if (cnt > sizeof(buf))
> + cnt = sizeof(buf);
> +
> + len = snprintf(buf, sizeof(buf), "%llu\n", val);
> +
> + return simple_read_from_buffer(ubuf, cnt, ppos, buf, len);
> +}
> +
> +const struct file_operations trace_ull_config_fops = {
> + .open = tracing_open_generic,
> + .read = trace_ull_config_read,
> + .write = trace_ull_config_write,
> +};
> +
> #define TRACING_LOG_ERRS_MAX 8
> #define TRACING_LOG_LOC_MAX 128
>
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index cd80d046c7a5..44fa25c1264a 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -1952,4 +1952,23 @@ static inline bool is_good_name(const char *name)
> return true;
> }
>
> +/*
> + * This is a generic way to read and write a u64 config value from a file
> + * in tracefs.
> + *
> + * The value is stored on the variable pointed by *val. The value needs
> + * to be at least *min and at most *max. The write is protected by an
> + * existing *lock.
> + */
> +struct trace_ull_config {
> + struct mutex *lock;
> + u64 *val;
> + u64 *min;
> + u64 *max;
> +};
> +
> +#define ULL_STR_SIZE 22 /* 20 digits max */
> +
> +extern const struct file_operations trace_ull_config_fops;
> +
> #endif /* _LINUX_KERNEL_TRACE_H */

2021-06-03 21:29:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 6/9] trace/hwlat: Use the generic function to read/write width and window

On Fri, 14 May 2021 22:51:15 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> @@ -733,16 +624,18 @@ static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf,
> return ret;
> }
>
> -static const struct file_operations width_fops = {
> - .open = tracing_open_generic,
> - .read = hwlat_read,
> - .write = hwlat_width_write,
> +static struct trace_ull_config hwlat_width = {
> + .lock = &hwlat_data.lock,
> + .val = &hwlat_data.sample_width,
> + .max = &hwlat_data.sample_window,
> + .min = NULL,
> };
>
> -static const struct file_operations window_fops = {
> - .open = tracing_open_generic,
> - .read = hwlat_read,
> - .write = hwlat_window_write,
> +static struct trace_ull_config hwlat_window = {

Yeah, the naming convention needs to be changed, because ull_config is
meaningless, and this code makes no sense. I know what it is doing, but if
I didn't, I'd have no clue what it was doing by reading it. :-p

-- Steve


> + .lock = &hwlat_data.lock,
> + .val = &hwlat_data.sample_window,
> + .max = NULL,
> + .min = &hwlat_data.sample_width,
> };
>
> static const struct file_operations thread_mode_fops = {
> @@ -775,15 +668,15 @@ static int init_tracefs(void)
>
> hwlat_sample_window = tracefs_create_file("window", 0640,
> top_dir,
> - &hwlat_data.sample_window,
> - &window_fops);
> + &hwlat_window,
> + &trace_ull_config_fops);
> if (!hwlat_sample_window)
> goto err;
>
> hwlat_sample_width = tracefs_create_file("width", 0644,
> top_dir,
> - &hwlat_data.sample_width,
> - &width_fops);
> + &hwlat_width,
> + &trace_ull_config_fops);
> if (!hwlat_sample_width)
> goto err;
>

2021-06-03 21:30:53

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 7/9] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers

On Fri, 14 May 2021 22:51:16 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> +++ b/include/trace/trace_events.h
> @@ -358,6 +358,21 @@ TRACE_MAKE_SYSTEM_STR();
> trace_print_hex_dump_seq(p, prefix_str, prefix_type, \
> rowsize, groupsize, buf, len, ascii)
>
> +#undef __print_ns_to_secs
> +#define __print_ns_to_secs(value) \
> + ({ \
> + u64 ____val = (u64)value; \
> + do_div(____val, NSEC_PER_SEC); \
> + ____val; \
> + })

I know my name is on this, but we need parenthesis around "value".

> +
> +#undef __print_ns_without_secs
> +#define __print_ns_without_secs(value) \
> + ({ \
> + u64 ____val = (u64)value; \

Here too.

> + (u32) do_div(____val, NSEC_PER_SEC); \
> + })
> +
> #undef DECLARE_EVENT_CLASS
> #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> static notrace enum print_line_t \
> @@ -736,6 +751,16 @@ static inline void ftrace_test_probe_##call(void) \
> #undef __print_array
> #undef __print_hex_dump
>
> +/*
> + * The below is not executed in the kernel. It is only what is
> + * displayed in the print format for userspace to parse.
> + */
> +#undef __print_ns_to_secs
> +#define __print_ns_to_secs(val) val / 1000000000UL
> +
> +#undef __print_ns_without_secs
> +#define __print_ns_without_secs(val) val % 1000000000UL

And around "val" in the above two macros.

-- Steve

> +
> #undef TP_printk
> #define TP_printk(fmt, args...) "\"" fmt "\", " __stringify(args)
>

2021-06-03 21:32:52

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 8/9] tracing: Add osnoise tracer

On Fri, 14 May 2021 22:51:17 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> +static void osnoise_tracer_start(struct trace_array *tr)
> +{
> + int retval;
> +
> + if (osnoise_busy)
> + return;
> +

I need more time to look at this patch, but "git show" points out that
there's a spurious tab in the above line.

-- Steve


> + osn_var_reset_all();
> +
> + retval = osnoise_hook_events();
> + if (retval)
> + goto out_err;

2021-06-04 04:21:37

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH V3 7/9] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers

On Thu, 2021-06-03 at 17:29 -0400, Steven Rostedt wrote:
> > +++ b/include/trace/trace_events.h
> > @@ -358,6 +358,21 @@ TRACE_MAKE_SYSTEM_STR();
> > ? trace_print_hex_dump_seq(p, prefix_str, prefix_type, \
> > ? rowsize, groupsize, buf, len, ascii)
> > ?
> >
> > +#undef __print_ns_to_secs
> > +#define __print_ns_to_secs(value) \
> > + ({ \
> > + u64 ____val = (u64)value; \
> > + do_div(____val, NSEC_PER_SEC); \
> > + ____val; \
> > + })
>
> I know my name is on this, but we need parenthesis around "value".

If tracing cleanups for trace_events.h are being done, perhaps
another bit of untidiness is the macro definition and uses of
__assign_str.

$ git grep -w -1 __assign_str include/trace/trace_events.h
include/trace/trace_events.h-
include/trace/trace_events.h:#undef __assign_str
include/trace/trace_events.h:#define __assign_str(dst, src) \
include/trace/trace_events.h- strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");

Its definition has a semicolon as do most uses but a dozen handfuls of
other uses do not have a semicolon. It'd be more consistent to add a
semicolon to the uses without them and when done treewide, then remove
the semicolon from the macro declaration.

$ git grep -P '\b__assign_str\b' | wc -l
551
$ git grep -P '\b__assign_str\b.*;' | wc -l
480

For instance: sunrpc.h has a mixture of uses with and without semicolons.

$ git grep -P '\b__assign_str\b' include/trace/events/sunrpc.h | wc -l
65
$ git grep -P '\b__assign_str\b.*;' include/trace/events/sunrpc.h | wc -l
38

It's especially odd in the last block below to have 3 successive uses
where the first and last has a semicolon, but the middle one does not.

So perhaps as a start:
---
include/trace/events/sunrpc.h | 40 ++++++++++++++++++++--------------------
1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index d02e01a27b690..861f199896c6a 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -154,8 +154,8 @@ TRACE_EVENT(rpc_clnt_new,
__entry->client_id = clnt->cl_clid;
__assign_str(addr, xprt->address_strings[RPC_DISPLAY_ADDR]);
__assign_str(port, xprt->address_strings[RPC_DISPLAY_PORT]);
- __assign_str(program, program)
- __assign_str(server, server)
+ __assign_str(program, program);
+ __assign_str(server, server);
),

TP_printk("client=%u peer=[%s]:%s program=%s server=%s",
@@ -180,8 +180,8 @@ TRACE_EVENT(rpc_clnt_new_err,

TP_fast_assign(
__entry->error = error;
- __assign_str(program, program)
- __assign_str(server, server)
+ __assign_str(program, program);
+ __assign_str(server, server);
),

TP_printk("program=%s server=%s error=%d",
@@ -284,8 +284,8 @@ TRACE_EVENT(rpc_request,
__entry->client_id = task->tk_client->cl_clid;
__entry->version = task->tk_client->cl_vers;
__entry->async = RPC_IS_ASYNC(task);
- __assign_str(progname, task->tk_client->cl_program->name)
- __assign_str(procname, rpc_proc_name(task))
+ __assign_str(progname, task->tk_client->cl_program->name);
+ __assign_str(procname, rpc_proc_name(task));
),

TP_printk("task:%u@%u %sv%d %s (%ssync)",
@@ -494,10 +494,10 @@ DECLARE_EVENT_CLASS(rpc_reply_event,
__entry->task_id = task->tk_pid;
__entry->client_id = task->tk_client->cl_clid;
__entry->xid = be32_to_cpu(task->tk_rqstp->rq_xid);
- __assign_str(progname, task->tk_client->cl_program->name)
+ __assign_str(progname, task->tk_client->cl_program->name);
__entry->version = task->tk_client->cl_vers;
- __assign_str(procname, rpc_proc_name(task))
- __assign_str(servername, task->tk_xprt->servername)
+ __assign_str(procname, rpc_proc_name(task));
+ __assign_str(servername, task->tk_xprt->servername);
),

TP_printk("task:%u@%d server=%s xid=0x%08x %sv%d %s",
@@ -622,8 +622,8 @@ TRACE_EVENT(rpc_stats_latency,
__entry->task_id = task->tk_pid;
__entry->xid = be32_to_cpu(task->tk_rqstp->rq_xid);
__entry->version = task->tk_client->cl_vers;
- __assign_str(progname, task->tk_client->cl_program->name)
- __assign_str(procname, rpc_proc_name(task))
+ __assign_str(progname, task->tk_client->cl_program->name);
+ __assign_str(procname, rpc_proc_name(task));
__entry->backlog = ktime_to_us(backlog);
__entry->rtt = ktime_to_us(rtt);
__entry->execute = ktime_to_us(execute);
@@ -669,15 +669,15 @@ TRACE_EVENT(rpc_xdr_overflow,
__entry->task_id = task->tk_pid;
__entry->client_id = task->tk_client->cl_clid;
__assign_str(progname,
- task->tk_client->cl_program->name)
+ task->tk_client->cl_program->name);
__entry->version = task->tk_client->cl_vers;
- __assign_str(procedure, task->tk_msg.rpc_proc->p_name)
+ __assign_str(procedure, task->tk_msg.rpc_proc->p_name);
} else {
__entry->task_id = 0;
__entry->client_id = 0;
- __assign_str(progname, "unknown")
+ __assign_str(progname, "unknown");
__entry->version = 0;
- __assign_str(procedure, "unknown")
+ __assign_str(procedure, "unknown");
}
__entry->requested = requested;
__entry->end = xdr->end;
@@ -735,9 +735,9 @@ TRACE_EVENT(rpc_xdr_alignment,
__entry->task_id = task->tk_pid;
__entry->client_id = task->tk_client->cl_clid;
__assign_str(progname,
- task->tk_client->cl_program->name)
+ task->tk_client->cl_program->name);
__entry->version = task->tk_client->cl_vers;
- __assign_str(procedure, task->tk_msg.rpc_proc->p_name)
+ __assign_str(procedure, task->tk_msg.rpc_proc->p_name);

__entry->offset = offset;
__entry->copied = copied;
@@ -1107,9 +1107,9 @@ TRACE_EVENT(xprt_retransmit,
__entry->xid = be32_to_cpu(rqst->rq_xid);
__entry->ntrans = rqst->rq_ntrans;
__assign_str(progname,
- task->tk_client->cl_program->name)
+ task->tk_client->cl_program->name);
__entry->version = task->tk_client->cl_vers;
- __assign_str(procedure, task->tk_msg.rpc_proc->p_name)
+ __assign_str(procedure, task->tk_msg.rpc_proc->p_name);
),

TP_printk(
@@ -1842,7 +1842,7 @@ TRACE_EVENT(svc_xprt_accept,

TP_fast_assign(
__assign_str(addr, xprt->xpt_remotebuf);
- __assign_str(protocol, xprt->xpt_class->xcl_name)
+ __assign_str(protocol, xprt->xpt_class->xcl_name);
__assign_str(service, service);
),


Subject: Re: [PATCH V3 4/9] tracing/hwlat: Implement the per-cpu mode

On 6/3/21 11:17 PM, Steven Rostedt wrote:
> On Fri, 14 May 2021 22:51:13 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
>> void trace_hwlat_callback(bool enter)
>> {
>> - if (smp_processor_id() != nmi_cpu)
>> + struct hwlat_kthread_data *kdata = get_cpu_data();
>> +
>> + if (kdata->kthread)
>
> Shouldn't that be:
>
> if (!kdata->kthread)

oops! Fixing in v4.

-- Daniel


> ?
>
> -- Steve
>
>> return;
>>
>> /*
>> @@ -158,13 +173,13 @@ void trace_hwlat_callback(bool enter)
>> */
>> if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {
>> if (enter)
>> - nmi_ts_start = time_get();
>> + kdata->nmi_ts_start = time_get();
>> else
>> - nmi_total_ts += time_get() - nmi_ts_start;
>> + kdata->nmi_total_ts += time_get() - kdata->nmi_ts_start;
>> }
>>
>> if (enter)
>> - nmi_count++;
>> + kdata->nmi_count++;
>> }
>>
>

Subject: Re: [PATCH V3 5/9] tracing/trace: Add a generic function to read/write u64 values from tracefs

On 6/3/21 11:22 PM, Steven Rostedt wrote:
> On Fri, 14 May 2021 22:51:14 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
>> Provides a generic read and write implementation to save/read u64 values
>> from a file on tracefs. The trace_ull_config structure defines where to
>> read/write the value, the min and the max acceptable values, and a lock
>> to protect the write.
>
> This states what the patch is doing, but does not say why it is doing it.

Yeah...

>>
>> Cc: Jonathan Corbet <[email protected]>
>> Cc: Steven Rostedt <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Alexandre Chartre <[email protected]>
>> Cc: Clark Willaims <[email protected]>
>> Cc: John Kacur <[email protected]>
>> Cc: Juri Lelli <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
>> ---
>> kernel/trace/trace.c | 87 ++++++++++++++++++++++++++++++++++++++++++++
>> kernel/trace/trace.h | 19 ++++++++++
>> 2 files changed, 106 insertions(+)
>>
>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
>> index 560e4c8d3825..b4cd89010813 100644
>> --- a/kernel/trace/trace.c
>> +++ b/kernel/trace/trace.c
>> @@ -7516,6 +7516,93 @@ static const struct file_operations snapshot_raw_fops = {
>>
>> #endif /* CONFIG_TRACER_SNAPSHOT */
>>
>> +/*
>> + * trace_ull_config_write - Generic write function to save u64 value
>
>
> That is a horrible name. What the hell is the "config"?
>
>> + * @filp: The active open file structure
>> + * @ubuf: The userspace provided buffer to read value into
>> + * @cnt: The maximum number of bytes to read
>> + * @ppos: The current "file" position
>> + *
>> + * This function provides a generic write implementation to save u64 values
>> + * from a file on tracefs. The filp->private_data must point to a
>> + * trace_ull_config structure that defines where to write the value, the
>> + * min and the max acceptable values, and a lock to protect the write.
>
> This doesn't seem to be a generic way to save 64 bit values (which I still
> don't understand, because unsigned long long should work too). But it looks
> like the rational is for having some kind of generic way to read 64 bit
> values giving them a min and a max.
>
> I see this is used later, but this patch needs to be rewritten. It makes no
> sense.

The reason for this patch is that hwlat, osnoise, and timerlat have "u64 config"
options that are read/write via tracefs "files." In the previous version, I had
multiple functions doing basically the same thing:

A write function that:
read a u64 from user-space
get a lock,
check for min/max acceptable values
save the value
release the lock.

and a read function that:
write the config value to the "read" buffer.

And so, I tried to come up with a way to avoid code duplication.

question: are only the names that are bad? (I agree that they are bad) or do you
think that the overall idea is bad? :-)

Suggestions?

-- Daniel

> -- Steve

Subject: Re: [PATCH V3 7/9] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers

On 6/3/21 11:29 PM, Steven Rostedt wrote:
> On Fri, 14 May 2021 22:51:16 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
>> +++ b/include/trace/trace_events.h
>> @@ -358,6 +358,21 @@ TRACE_MAKE_SYSTEM_STR();
>> trace_print_hex_dump_seq(p, prefix_str, prefix_type, \
>> rowsize, groupsize, buf, len, ascii)
>>
>> +#undef __print_ns_to_secs
>> +#define __print_ns_to_secs(value) \
>> + ({ \
>> + u64 ____val = (u64)value; \
>> + do_div(____val, NSEC_PER_SEC); \
>> + ____val; \
>> + })
>
> I know my name is on this, but we need parenthesis around "value".
>
>> +
>> +#undef __print_ns_without_secs
>> +#define __print_ns_without_secs(value) \
>> + ({ \
>> + u64 ____val = (u64)value; \
>
> Here too.
>
>> + (u32) do_div(____val, NSEC_PER_SEC); \
>> + })
>> +
>> #undef DECLARE_EVENT_CLASS
>> #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
>> static notrace enum print_line_t \
>> @@ -736,6 +751,16 @@ static inline void ftrace_test_probe_##call(void) \
>> #undef __print_array
>> #undef __print_hex_dump
>>
>> +/*
>> + * The below is not executed in the kernel. It is only what is
>> + * displayed in the print format for userspace to parse.
>> + */
>> +#undef __print_ns_to_secs
>> +#define __print_ns_to_secs(val) val / 1000000000UL
>> +
>> +#undef __print_ns_without_secs
>> +#define __print_ns_without_secs(val) val % 1000000000UL
>
> And around "val" in the above two macros.

Fixing in the v4.

Thanks Steven!
-- Daniel

> -- Steve
>
>> +
>> #undef TP_printk
>> #define TP_printk(fmt, args...) "\"" fmt "\", " __stringify(args)
>>
>

2021-06-04 16:21:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 5/9] tracing/trace: Add a generic function to read/write u64 values from tracefs

On Fri, 4 Jun 2021 18:05:06 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

>
> The reason for this patch is that hwlat, osnoise, and timerlat have "u64 config"
> options that are read/write via tracefs "files." In the previous version, I had
> multiple functions doing basically the same thing:
>
> A write function that:
> read a u64 from user-space
> get a lock,
> check for min/max acceptable values
> save the value
> release the lock.
>
> and a read function that:
> write the config value to the "read" buffer.
>
> And so, I tried to come up with a way to avoid code duplication.
>
> question: are only the names that are bad? (I agree that they are bad) or do you
> think that the overall idea is bad? :-)
>
> Suggestions?

I don't think the overall idea is bad, if it is what I think you are
doing. I just don't believe you articulated what you are doing.

It has nothing to do with 64 bit reads and writes, but instead has to
do with reading and writing values that depend on each other for what
is acceptable.

Perhaps have it called trace_min_max_write() and trace_min_max_read(),
and document what it is used for.

-- Steve

2021-06-04 16:23:54

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 7/9] tracing: Add __print_ns_to_secs() and __print_ns_without_secs() helpers

On Thu, 03 Jun 2021 21:19:50 -0700
Joe Perches <[email protected]> wrote:

> If tracing cleanups for trace_events.h are being done, perhaps
> another bit of untidiness is the macro definition and uses of
> __assign_str.

This isn't a tracing cleanup, but adding new functionality.

That said,

>
> $ git grep -w -1 __assign_str include/trace/trace_events.h
> include/trace/trace_events.h-
> include/trace/trace_events.h:#undef __assign_str
> include/trace/trace_events.h:#define __assign_str(dst, src) \
> include/trace/trace_events.h- strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
>
> Its definition has a semicolon as do most uses but a dozen handfuls of
> other uses do not have a semicolon. It'd be more consistent to add a
> semicolon to the uses without them and when done treewide, then remove
> the semicolon from the macro declaration.

I have no problem taking a clean up patch that adds semicolons to all
use cases of "__assign_str()" and ever remove the one from where it is
defined. As long as it doesn't break any builds, I'm fine with that.

-- Steve

Subject: Re: [PATCH V3 5/9] tracing/trace: Add a generic function to read/write u64 values from tracefs

On 6/4/21 6:18 PM, Steven Rostedt wrote:
> On Fri, 4 Jun 2021 18:05:06 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
>>
>> The reason for this patch is that hwlat, osnoise, and timerlat have "u64 config"
>> options that are read/write via tracefs "files." In the previous version, I had
>> multiple functions doing basically the same thing:
>>
>> A write function that:
>> read a u64 from user-space
>> get a lock,
>> check for min/max acceptable values
>> save the value
>> release the lock.
>>
>> and a read function that:
>> write the config value to the "read" buffer.
>>
>> And so, I tried to come up with a way to avoid code duplication.
>>
>> question: are only the names that are bad? (I agree that they are bad) or do you
>> think that the overall idea is bad? :-)
>>
>> Suggestions?
>
> I don't think the overall idea is bad, if it is what I think you are
> doing. I just don't believe you articulated what you are doing.

I see!

> It has nothing to do with 64 bit reads and writes, but instead has to
> do with reading and writing values that depend on each other for what
> is acceptable.

yeah, that is a better (starting point for an) explanation.

> Perhaps have it called trace_min_max_write() and trace_min_max_read(),
> and document what it is used for.

I will do that!

-- Daniel

> -- Steve
>

Subject: Re: [PATCH V3 6/9] trace/hwlat: Use the generic function to read/write width and window

On 6/3/21 11:27 PM, Steven Rostedt wrote:
> On Fri, 14 May 2021 22:51:15 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
>> @@ -733,16 +624,18 @@ static ssize_t hwlat_mode_write(struct file *filp, const char __user *ubuf,
>> return ret;
>> }
>>
>> -static const struct file_operations width_fops = {
>> - .open = tracing_open_generic,
>> - .read = hwlat_read,
>> - .write = hwlat_width_write,
>> +static struct trace_ull_config hwlat_width = {
>> + .lock = &hwlat_data.lock,
>> + .val = &hwlat_data.sample_width,
>> + .max = &hwlat_data.sample_window,
>> + .min = NULL,
>> };
>>
>> -static const struct file_operations window_fops = {
>> - .open = tracing_open_generic,
>> - .read = hwlat_read,
>> - .write = hwlat_window_write,
>> +static struct trace_ull_config hwlat_window = {
> Yeah, the naming convention needs to be changed, because ull_config is
> meaningless, and this code makes no sense. I know what it is doing, but if
> I didn't, I'd have no clue what it was doing by reading it. :-p

I will rework the patch 5/9 to add a better explanation for the read/write
functions, and I will add comments to this patch, explaining the reason for the
min/max values.

Sound good?

-- Daniel

> -- Steve
>
>

2021-06-04 20:51:47

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 6/9] trace/hwlat: Use the generic function to read/write width and window

On Fri, 4 Jun 2021 18:36:58 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> I will rework the patch 5/9 to add a better explanation for the read/write
> functions, and I will add comments to this patch, explaining the reason for the
> min/max values.
>
> Sound good?

We'll see in v4 ;-)

-- Steve

2021-06-04 21:30:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 8/9] tracing: Add osnoise tracer

On Fri, 14 May 2021 22:51:17 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> Tracer options
>
> The tracer has a set of options inside the osnoise directory, they are:
>
> - osnoise/cpus: CPUs at which a osnoise thread will execute.
> - osnoise/period_us: the period of the osnoise thread.
> - osnoise/runtime_us: how long an osnoise thread will look for noise.
> - osnoise/stop_tracing_in_us: stop the system tracing if a single noise
> higher than the configured value happens. Writing 0 disables this
> option.
> - osnoise/stop_tracing_out_us: stop the system tracing if total noise
> higher than the configured value happens. Writing 0 disables this
> option.

The "in" vs "out" is confusing. Perhaps we should call it:

stop_tracing_single_us and stop_tracing_total_us ?

> - tracing_threshold: the minimum delta between two time() reads to be
> considered as noise, in us. When set to 0, the minimum valid value
> will be used, which is currently 1 us.
>
> Additional Tracing
>
> In addition to the tracer, a set of tracepoints were added to
> facilitate the identification of the osnoise source.
>
> - osnoise:sample_threshold: printed anytime a noise is higher than
> the configurable tolerance_ns.
> - osnoise:nmi_noise: noise from NMI, including the duration.
> - osnoise:irq_noise: noise from an IRQ, including the duration.
> - osnoise:softirq_noise: noise from a SoftIRQ, including the
> duration.
> - osnoise:thread_noise: noise from a thread, including the duration.
>
> Note that all the values are *net values*. For example, if while osnoise
> is running, another thread preempts the osnoise thread, it will start a
> thread_noise duration at the start. Then, an IRQ takes place, preempting
> the thread_noise, starting a irq_noise. When the IRQ ends its execution,
> it will compute its duration, and this duration will be subtracted from
> the thread_noise, in such a way as to avoid the double accounting of the
> IRQ execution. This logic is valid for all sources of noise.
>
> Here is one example of the usage of these tracepoints::
>
> osnoise/8-961 [008] d.h. 5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
> osnoise/8-961 [008] dNh. 5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
> migration/8-54 [008] d... 5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
> osnoise/8-961 [008] .... 5789.858413: sample_threshold: start 5789.858404555 duration 8723 ns interferences 2
>
> In this example, a noise sample of 8 microseconds was reported in the last
> line, pointing to two interferences. Looking backward in the trace, the
> two previous entries were about the migration thread running after a
> timer IRQ execution. The first event is not part of the noise because
> it took place one millisecond before.
>
> It is worth noticing that the sum of the duration reported in the
> tracepoints is smaller than eight us reported in the sample_threshold.
> The reason roots in the overhead of the entry and exit code that happens
> before and after any interference execution. This justifies the dual
> approach: measuring thread and tracing.
>



> --- /dev/null
> +++ b/kernel/trace/trace_osnoise.c
> @@ -0,0 +1,1563 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * OS Noise Tracer: computes the OS Noise suffered by a running thread.
> + *
> + * Based on "hwlat_detector" tracer by:
> + * Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <[email protected]>
> + * Copyright (C) 2013-2016 Steven Rostedt, Red Hat, Inc. <[email protected]>
> + * With feedback from Clark Williams <[email protected]>
> + *
> + * And also based on the rtsl tracer presented on:
> + * DE OLIVEIRA, Daniel Bristot, et al. Demystifying the real-time linux
> + * scheduling latency. In: 32nd Euromicro Conference on Real-Time Systems
> + * (ECRTS 2020). Schloss Dagstuhl-Leibniz-Zentrum fur Informatik, 2020.
> + *
> + * Copyright (C) 2021 Daniel Bristot de Oliveira, Red Hat, Inc. <[email protected]>
> + */
> +
> +#include <linux/kthread.h>
> +#include <linux/tracefs.h>
> +#include <linux/uaccess.h>
> +#include <linux/cpumask.h>
> +#include <linux/delay.h>
> +#include <linux/sched/clock.h>
> +#include <linux/sched.h>
> +#include "trace.h"
> +
> +#ifdef CONFIG_X86_LOCAL_APIC
> +#include <asm/trace/irq_vectors.h>
> +#undef TRACE_INCLUDE_PATH
> +#undef TRACE_INCLUDE_FILE
> +#endif /* CONFIG_X86_LOCAL_APIC */
> +
> +#include <trace/events/irq.h>
> +#include <trace/events/sched.h>
> +
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/osnoise.h>
> +
> +static struct trace_array *osnoise_trace;
> +
> +/*
> + * Default values.
> + */
> +#define BANNER "osnoise: "
> +#define DEFAULT_SAMPLE_PERIOD 1000000 /* 1s */
> +#define DEFAULT_SAMPLE_RUNTIME 1000000 /* 1s */
> +
> +/*
> + * NMI runtime info.
> + */
> +struct nmi {
> + u64 count;
> + u64 delta_start;
> +};
> +
> +/*
> + * IRQ runtime info.
> + */
> +struct irq {
> + u64 count;
> + u64 arrival_time;
> + u64 delta_start;
> +};
> +
> +/*
> + * SofIRQ runtime info.
> + */
> +struct softirq {
> + u64 count;
> + u64 arrival_time;
> + u64 delta_start;
> +};
> +
> +/*
> + * Thread runtime info.
> + */
> +struct thread {

These are rather generic struct names that could possibly be a conflict
to something in the future. I would suggest adding "osn_" or something
in front of them:

struct osn_nmi
struct osn_irq
struct osn_thread

or something like that.


> + u64 count;
> + u64 arrival_time;
> + u64 delta_start;
> +};
> +
> +/*
> + * Runtime information: this structure saves the runtime information used by
> + * one sampling thread.
> + */
> +struct osnoise_variables {
> + struct task_struct *kthread;
> + bool sampling;
> + pid_t pid;
> + struct nmi nmi;
> + struct irq irq;
> + struct softirq softirq;
> + struct thread thread;
> + local_t int_counter;
> +};

Nit, but it looks better to add tabs to the fields:

struct osnoise_variables {
sturct task_struct *kthread;
bool sampling;
pid_t pid;
struct nmi nmi;
struct irq irq;
struct softirq softirq;
struct thread thread;
local_t int_counter;
};

See, it's much easier to see the fields of the structure this way.


> +
> +/*
> + * Per-cpu runtime information.
> + */
> +DEFINE_PER_CPU(struct osnoise_variables, per_cpu_osnoise_var);
> +
> +/**
> + * this_cpu_osn_var - Return the per-cpu osnoise_variables on its relative CPU
> + */
> +static inline struct osnoise_variables *this_cpu_osn_var(void)
> +{
> + return this_cpu_ptr(&per_cpu_osnoise_var);
> +}
> +
> +/**
> + * osn_var_reset - Reset the values of the given osnoise_variables
> + */
> +static inline void osn_var_reset(struct osnoise_variables *osn_var)
> +{
> + /*
> + * So far, all the values are initialized as 0, so
> + * zeroing the structure is perfect.
> + */
> + memset(osn_var, 0, sizeof(struct osnoise_variables));

I'm one of those that prefer:

memset(osn_var, 0, sizeof(*osn_var))

Just in case something changes, you don't need to modify the memset.


> +}
> +
> +/**
> + * osn_var_reset_all - Reset the value of all per-cpu osnoise_variables
> + */
> +static inline void osn_var_reset_all(void)
> +{
> + struct osnoise_variables *osn_var;
> + int cpu;
> +
> + for_each_cpu(cpu, cpu_online_mask) {
> + osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
> + osn_var_reset(osn_var);
> + }
> +}
> +
> +/*
> + * Tells NMIs to call back to the osnoise tracer to record timestamps.
> + */
> +bool trace_osnoise_callback_enabled;
> +
> +/*
> + * osnoise sample structure definition. Used to store the statistics of a
> + * sample run.
> + */
> +struct osnoise_sample {
> + u64 runtime; /* runtime */
> + u64 noise; /* noise */
> + u64 max_sample; /* max single noise sample */
> + int hw_count; /* # HW (incl. hypervisor) interference */
> + int nmi_count; /* # NMIs during this sample */
> + int irq_count; /* # IRQs during this sample */
> + int softirq_count; /* # SoftIRQs during this sample */
> + int thread_count; /* # Threads during this sample */
> +};
> +
> +/*
> + * Protect the interface.
> + */
> +struct mutex interface_lock;
> +
> +/*
> + * Tracer data.
> + */
> +static struct osnoise_data {
> + u64 sample_period; /* total sampling period */
> + u64 sample_runtime; /* active sampling portion of period */
> + u64 stop_tracing_in; /* stop trace in the inside operation (loop) */
> + u64 stop_tracing_out; /* stop trace in the outside operation (report) */
> +} osnoise_data = {
> + .sample_period = DEFAULT_SAMPLE_PERIOD,
> + .sample_runtime = DEFAULT_SAMPLE_RUNTIME,
> + .stop_tracing_in = 0,
> + .stop_tracing_out = 0,
> +};
> +
> +/*
> + * Boolean variable used to inform that the tracer is currently sampling.
> + */
> +static bool osnoise_busy;
> +
> +/*
> + * Print the osnoise header info.
> + */
> +static void print_osnoise_headers(struct seq_file *s)
> +{
> + seq_puts(s, "# _-----=> irqs-off\n");
> + seq_puts(s, "# / _----=> need-resched\n");
> + seq_puts(s, "# | / _---=> hardirq/softirq\n");
> + seq_puts(s, "# || / _--=> preempt-depth ");
> + seq_puts(s, " MAX\n");
> +
> + seq_puts(s, "# || / ");
> + seq_puts(s, " SINGLE Interference counters:\n");
> +
> + seq_puts(s, "# |||| RUNTIME ");
> + seq_puts(s, " NOISE %% OF CPU NOISE +-----------------------------+\n");
> +
> + seq_puts(s, "# TASK-PID CPU# |||| TIMESTAMP IN US ");
> + seq_puts(s, " IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD\n");
> +
> + seq_puts(s, "# | | | |||| | | ");
> + seq_puts(s, " | | | | | | | |\n");
> +}
> +
> +/*
> + * Record an osnoise_sample into the tracer buffer.
> + */
> +static void trace_osnoise_sample(struct osnoise_sample *sample)
> +{
> + struct trace_array *tr = osnoise_trace;
> + struct trace_buffer *buffer = tr->array_buffer.buffer;
> + struct trace_event_call *call = &event_osnoise;
> + struct ring_buffer_event *event;
> + struct osnoise_entry *entry;
> +
> + event = trace_buffer_lock_reserve(buffer, TRACE_OSNOISE, sizeof(*entry),
> + tracing_gen_ctx());
> + if (!event)
> + return;
> + entry = ring_buffer_event_data(event);
> + entry->runtime = sample->runtime;
> + entry->noise = sample->noise;
> + entry->max_sample = sample->max_sample;
> + entry->hw_count = sample->hw_count;
> + entry->nmi_count = sample->nmi_count;
> + entry->irq_count = sample->irq_count;
> + entry->softirq_count = sample->softirq_count;
> + entry->thread_count = sample->thread_count;
> +
> + if (!call_filter_check_discard(call, entry, buffer, event))
> + trace_buffer_unlock_commit_nostack(buffer, event);
> +}
> +
> +/**
> + * Macros to encapsulate the time capturing infrastructure.
> + */
> +#define time_get() trace_clock_local()
> +#define time_to_us(x) div_u64(x, 1000)
> +#define time_sub(a, b) ((a) - (b))
> +
> +/**
> + * cond_move_irq_delta_start - Forward the delta_start of a running IRQ
> + *
> + * If an IRQ is preempted by an NMI, its delta_start is pushed forward
> + * to discount the NMI interference.
> + *
> + * See get_int_safe_duration().
> + */
> +static inline void
> +cond_move_irq_delta_start(struct osnoise_variables *osn_var, u64 duration)
> +{
> + if (osn_var->irq.delta_start)
> + osn_var->irq.delta_start += duration;
> +}
> +
> +#ifndef CONFIG_PREEMPT_RT
> +/**
> + * cond_move_softirq_delta_start - Forward the delta_start of a running SoftIRQ
> + *
> + * If a SoftIRQ is preempted by an IRQ or NMI, its delta_start is pushed
> + * forward to discount the interference.
> + *
> + * See get_int_safe_duration().
> + */
> +static inline void
> +cond_move_softirq_delta_start(struct osnoise_variables *osn_var, u64 duration)
> +{
> + if (osn_var->softirq.delta_start)
> + osn_var->softirq.delta_start += duration;
> +}
> +#else /* CONFIG_PREEMPT_RT */
> +#define cond_move_softirq_delta_start(osn_var, duration) do {} while (0)
> +#endif
> +
> +/**

Don't use the "/**" notation if you are not following kernel doc (which
means you need to also document each parameter and return value. The
'/**' is usually reserved for non static functions that are an API for
other parts of the kernel. Use just '/*' for these (unless you want to
add full kernel doc notation).

> + * cond_move_thread_delta_start - Forward the delta_start of a running thread
> + *
> + * If a noisy thread is preempted by an Softirq, IRQ or NMI, its delta_start
> + * is pushed forward to discount the interference.
> + *
> + * See get_int_safe_duration().
> + */
> +static inline void
> +cond_move_thread_delta_start(struct osnoise_variables *osn_var, u64 duration)
> +{
> + if (osn_var->thread.delta_start)
> + osn_var->thread.delta_start += duration;
> +}
> +
> +/**
> + * get_int_safe_duration - Get the duration of a window
> + *
> + * The irq, softirq and thread varaibles need to have its duration without
> + * the interference from higher priority interrupts. Instead of keeping a
> + * variable to discount the interrupt interference from these variables, the
> + * starting time of these variables are pushed forward with the interrupt's
> + * duration. In this way, a single variable is used to:
> + *
> + * - Know if a given window is being measured.
> + * - Account its duration.
> + * - Discount the interference.
> + *
> + * To avoid getting inconsistent values, e.g.,:
> + *
> + * now = time_get()
> + * ---> interrupt!
> + * delta_start -= int duration;
> + * <---
> + * duration = now - delta_start;
> + *
> + * result: negative duration if the variable duration before the
> + * interrupt was smaller than the interrupt execution.
> + *
> + * A counter of interrupts is used. If the counter increased, try
> + * to capture an interference safe duration.
> + */
> +static inline s64
> +get_int_safe_duration(struct osnoise_variables *osn_var, u64 *delta_start)
> +{
> + u64 int_counter, now;
> + s64 duration;
> +
> + do {
> + int_counter = local_read(&osn_var->int_counter);
> + /* synchronize with interrupts */
> + barrier();
> +
> + now = time_get();
> + duration = (now - *delta_start);
> +
> + /* synchronize with interrupts */
> + barrier();
> + } while (int_counter != local_read(&osn_var->int_counter));
> +
> + /*
> + * This is an evidence of race conditions that cause
> + * a value to be "discounted" too much.
> + */
> + if (duration < 0)
> + pr_err("int safe negative!\n");

Probably want to have this happen at most once a run. If something were
to break, I don't think we want this to live lock the machine doing
tons of prints. We could have a variable stored on the
osnoise_variables that states this was printed. Check that variable to
see if it wasn't printed during a run (when current_tracer was set),
and print only once if it is.


> +
> + *delta_start = 0;
> +
> + return duration;
> +}
> +
> +/**
> + *
> + * set_int_safe_time - Save the current time on *time, aware of interference
> + *
> + * Get the time, taking into consideration a possible interference from
> + * higher priority interrupts.
> + *
> + * See get_int_safe_duration() for an explanation.
> + */
> +static u64
> +set_int_safe_time(struct osnoise_variables *osn_var, u64 *time)
> +{
> + u64 int_counter;
> +
> + do {
> + int_counter = local_read(&osn_var->int_counter);
> + /* synchronize with interrupts */
> + barrier();
> +
> + *time = time_get();

time_get() is trace_clock_local() which is used in tracing, and should
not be affected by interrupts. Why the loop? Even if using the generic
NMI-unsafe version of sched_clock, it does its own loop.

> +
> + /* synchronize with interrupts */
> + barrier();
> + } while (int_counter != local_read(&osn_var->int_counter));
> +
> + return int_counter;
> +}
> +
> +/**
> + * trace_osnoise_callback - NMI entry/exit callback
> + *
> + * This function is called at the entry and exit NMI code. The bool enter
> + * distinguishes between either case. This function is used to note a NMI
> + * occurrence, compute the noise caused by the NMI, and to remove the noise
> + * it is potentially causing on other interference variables.
> + */
> +void trace_osnoise_callback(bool enter)
> +{
> + struct osnoise_variables *osn_var = this_cpu_osn_var();
> + u64 duration;
> +
> + if (!osn_var->sampling)
> + return;
> +
> + /*
> + * Currently trace_clock_local() calls sched_clock() and the
> + * generic version is not NMI safe.
> + */
> + if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {
> + if (enter) {
> + osn_var->nmi.delta_start = time_get();
> + local_inc(&osn_var->int_counter);
> + } else {
> + duration = time_get() - osn_var->nmi.delta_start;
> +
> + trace_nmi_noise(osn_var->nmi.delta_start, duration);
> +
> + cond_move_irq_delta_start(osn_var, duration);
> + cond_move_softirq_delta_start(osn_var, duration);
> + cond_move_thread_delta_start(osn_var, duration);
> + }
> + }
> +
> + if (enter)
> + osn_var->nmi.count++;
> +}
> +
> +/**
> + * __trace_irq_entry - Note the starting of an IRQ
> + *
> + * Save the starting time of an IRQ. As IRQs are non-preemptive to other IRQs,
> + * it is safe to use a single variable (ons_var->irq) to save the statistics.
> + * The arrival_time is used to report... the arrival time. The delta_start
> + * is used to compute the duration at the IRQ exit handler. See
> + * cond_move_irq_delta_start().
> + */
> +static inline void __trace_irq_entry(int id)
> +{
> + struct osnoise_variables *osn_var = this_cpu_osn_var();
> +
> + if (!osn_var->sampling)
> + return;
> + /*
> + * This value will be used in the report, but not to compute
> + * the execution time, so it is safe to get it unsafe.
> + */
> + osn_var->irq.arrival_time = time_get();
> + set_int_safe_time(osn_var, &osn_var->irq.delta_start);
> + osn_var->irq.count++;
> +
> + local_inc(&osn_var->int_counter);
> +}
> +
> +/**
> + * __trace_irq_exit - Note the end of an IRQ, sava data and trace
> + *
> + * Computes the duration of the IRQ noise, and trace it. Also discounts the
> + * interference from other sources of noise could be currently being accounted.
> + */
> +static inline void __trace_irq_exit(int id, const char *desc)
> +{
> + struct osnoise_variables *osn_var = this_cpu_osn_var();
> + int duration;
> +
> + if (!osn_var->sampling)
> + return;
> +
> + duration = get_int_safe_duration(osn_var, &osn_var->irq.delta_start);
> + trace_irq_noise(id, desc, osn_var->irq.arrival_time, duration);
> + osn_var->irq.arrival_time = 0;
> + cond_move_softirq_delta_start(osn_var, duration);
> + cond_move_thread_delta_start(osn_var, duration);
> +}
> +
> +/**
> + * trace_irqentry_callback - Callback to the irq:irq_entry traceevent
> + *
> + * Used to note the starting of an IRQ occurece.
> + */
> +void trace_irqentry_callback(void *data, int irq, struct irqaction *action)
> +{
> + __trace_irq_entry(irq);
> +}
> +
> +/**
> + * trace_irqexit_callback - Callback to the irq:irq_exit traceevent
> + *
> + * Used to note the end of an IRQ occurece.
> + */
> +void trace_irqexit_callback(void *data, int irq, struct irqaction *action, int ret)
> +{
> + __trace_irq_exit(irq, action->name);
> +}
> +
> +#ifdef CONFIG_X86_LOCAL_APIC

I wonder if we should move this into a separate file, making the
__trace_irq_entry() a more name space safe name and have it call that.
I have a bit of a distaste for arch specific code in a generic file.


> +/**
> + * trace_intel_irq_entry - record intel specific IRQ entry
> + */
> +void trace_intel_irq_entry(void *data, int vector)
> +{
> + __trace_irq_entry(vector);
> +}
> +
> +/**
> + * trace_intel_irq_exit - record intel specific IRQ exit
> + */
> +void trace_intel_irq_exit(void *data, int vector)
> +{
> + char *vector_desc = (char *) data;
> +
> + __trace_irq_exit(vector, vector_desc);
> +}
> +
> +/**
> + * register_intel_irq_tp - Register intel specific IRQ entry tracepoints
> + */
> +static int register_intel_irq_tp(void)
> +{
> + int ret;
> +
> + ret = register_trace_local_timer_entry(trace_intel_irq_entry, NULL);
> + if (ret)
> + goto out_err;
> +
> + ret = register_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
> + if (ret)
> + goto out_timer_entry;
> +
> +#ifdef CONFIG_X86_THERMAL_VECTOR
> + ret = register_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
> + if (ret)
> + goto out_timer_exit;
> +
> + ret = register_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
> + if (ret)
> + goto out_thermal_entry;
> +#endif /* CONFIG_X86_THERMAL_VECTOR */
> +
> +#ifdef CONFIG_X86_MCE_AMD
> + ret = register_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
> + if (ret)
> + goto out_thermal_exit;
> +
> + ret = register_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
> + if (ret)
> + goto out_deferred_entry;
> +#endif
> +
> +#ifdef CONFIG_X86_MCE_THRESHOLD
> + ret = register_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
> + if (ret)
> + goto out_deferred_exit;
> +
> + ret = register_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
> + if (ret)
> + goto out_threshold_entry;
> +#endif /* CONFIG_X86_MCE_THRESHOLD */
> +
> +#ifdef CONFIG_SMP
> + ret = register_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
> + if (ret)
> + goto out_threshold_exit;
> +
> + ret = register_trace_call_function_single_exit(trace_intel_irq_exit,
> + "call_function_single");
> + if (ret)
> + goto out_call_function_single_entry;
> +
> + ret = register_trace_call_function_entry(trace_intel_irq_entry, NULL);
> + if (ret)
> + goto out_call_function_single_exit;
> +
> + ret = register_trace_call_function_exit(trace_intel_irq_exit, "call_function");
> + if (ret)
> + goto out_call_function_entry;
> +
> + ret = register_trace_reschedule_entry(trace_intel_irq_entry, NULL);
> + if (ret)
> + goto out_call_function_exit;
> +
> + ret = register_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
> + if (ret)
> + goto out_reschedule_entry;
> +#endif /* CONFIG_SMP */
> +
> +#ifdef CONFIG_IRQ_WORK
> + ret = register_trace_irq_work_entry(trace_intel_irq_entry, NULL);
> + if (ret)
> + goto out_reschedule_exit;
> +
> + ret = register_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
> + if (ret)
> + goto out_irq_work_entry;
> +#endif
> +
> + ret = register_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
> + if (ret)
> + goto out_irq_work_exit;
> +
> + ret = register_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
> + if (ret)
> + goto out_x86_ipi_entry;
> +
> + ret = register_trace_error_apic_entry(trace_intel_irq_entry, NULL);
> + if (ret)
> + goto out_x86_ipi_exit;
> +
> + ret = register_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
> + if (ret)
> + goto out_error_apic_entry;
> +
> + ret = register_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
> + if (ret)
> + goto out_error_apic_exit;
> +
> + ret = register_trace_spurious_apic_exit(trace_intel_irq_exit, "spurious_apic");
> + if (ret)
> + goto out_spurious_apic_entry;
> +
> + return 0;
> +
> +out_spurious_apic_entry:
> + unregister_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
> +out_error_apic_exit:
> + unregister_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
> +out_error_apic_entry:
> + unregister_trace_error_apic_entry(trace_intel_irq_entry, NULL);
> +out_x86_ipi_exit:
> + unregister_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
> +out_x86_ipi_entry:
> + unregister_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
> +out_irq_work_exit:
> +
> +#ifdef CONFIG_IRQ_WORK
> + unregister_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
> +out_irq_work_entry:
> + unregister_trace_irq_work_entry(trace_intel_irq_entry, NULL);
> +out_reschedule_exit:
> +#endif
> +
> +#ifdef CONFIG_SMP
> + unregister_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
> +out_reschedule_entry:
> + unregister_trace_reschedule_entry(trace_intel_irq_entry, NULL);
> +out_call_function_exit:
> + unregister_trace_call_function_exit(trace_intel_irq_exit, "call_function");
> +out_call_function_entry:
> + unregister_trace_call_function_entry(trace_intel_irq_entry, NULL);
> +out_call_function_single_exit:
> + unregister_trace_call_function_single_exit(trace_intel_irq_exit, "call_function_single");
> +out_call_function_single_entry:
> + unregister_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
> +out_threshold_exit:
> +#endif
> +
> +#ifdef CONFIG_X86_MCE_THRESHOLD
> + unregister_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
> +out_threshold_entry:
> + unregister_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
> +out_deferred_exit:
> +#endif
> +
> +#ifdef CONFIG_X86_MCE_AMD
> + unregister_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
> +out_deferred_entry:
> + unregister_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
> +out_thermal_exit:
> +#endif /* CONFIG_X86_MCE_AMD */
> +
> +#ifdef CONFIG_X86_THERMAL_VECTOR
> + unregister_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
> +out_thermal_entry:
> + unregister_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
> +out_timer_exit:
> +#endif /* CONFIG_X86_THERMAL_VECTOR */
> +
> + unregister_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
> +out_timer_entry:
> + unregister_trace_local_timer_entry(trace_intel_irq_entry, NULL);
> +out_err:
> + return -EINVAL;
> +}
> +
> +static void unregister_intel_irq_tp(void)
> +{
> + unregister_trace_spurious_apic_exit(trace_intel_irq_exit, "spurious_apic");
> + unregister_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
> + unregister_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
> + unregister_trace_error_apic_entry(trace_intel_irq_entry, NULL);
> + unregister_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
> + unregister_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
> +
> +#ifdef CONFIG_IRQ_WORK
> + unregister_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
> + unregister_trace_irq_work_entry(trace_intel_irq_entry, NULL);
> +#endif
> +
> +#ifdef CONFIG_SMP
> + unregister_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
> + unregister_trace_reschedule_entry(trace_intel_irq_entry, NULL);
> + unregister_trace_call_function_exit(trace_intel_irq_exit, "call_function");
> + unregister_trace_call_function_entry(trace_intel_irq_entry, NULL);
> + unregister_trace_call_function_single_exit(trace_intel_irq_exit, "call_function_single");
> + unregister_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
> +#endif
> +
> +#ifdef CONFIG_X86_MCE_THRESHOLD
> + unregister_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
> + unregister_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
> +#endif
> +
> +#ifdef CONFIG_X86_MCE_AMD
> + unregister_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
> + unregister_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
> +#endif
> +
> +#ifdef CONFIG_X86_THERMAL_VECTOR
> + unregister_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
> + unregister_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
> +#endif /* CONFIG_X86_THERMAL_VECTOR */
> +
> + unregister_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
> + unregister_trace_local_timer_entry(trace_intel_irq_entry, NULL);
> +}
> +


> +#else /* CONFIG_X86_LOCAL_APIC */
> +#define register_intel_irq_tp() do {} while (0)
> +#define unregister_intel_irq_tp() do {} while (0)
> +#endif /* CONFIG_X86_LOCAL_APIC */

And have this in a local header file. kernel/trace/trace_osnoise.h ?


> +
> +/**
> + * hook_irq_events - Hook IRQ handling events
> + *
> + * This function hooks the IRQ related callbacks to the respective trace
> + * events.
> + */
> +int hook_irq_events(void)
> +{
> + int ret;
> +
> + ret = register_trace_irq_handler_entry(trace_irqentry_callback, NULL);
> + if (ret)
> + goto out_err;
> +
> + ret = register_trace_irq_handler_exit(trace_irqexit_callback, NULL);
> + if (ret)
> + goto out_unregister_entry;
> +
> + ret = register_intel_irq_tp();
> + if (ret)
> + goto out_irq_exit;
> +
> + return 0;
> +
> +out_irq_exit:
> + unregister_trace_irq_handler_exit(trace_irqexit_callback, NULL);
> +out_unregister_entry:
> + unregister_trace_irq_handler_entry(trace_irqentry_callback, NULL);
> +out_err:
> + return -EINVAL;
> +}
> +
> +/**
> + * unhook_irq_events - Unhook IRQ handling events
> + *
> + * This function unhooks the IRQ related callbacks to the respective trace
> + * events.
> + */
> +void unhook_irq_events(void)
> +{
> + unregister_intel_irq_tp();
> + unregister_trace_irq_handler_exit(trace_irqexit_callback, NULL);
> + unregister_trace_irq_handler_entry(trace_irqentry_callback, NULL);
> +}
> +
> +#ifndef CONFIG_PREEMPT_RT
> +/**
> + * trace_softirq_entry_callback - Note the starting of a SoftIRQ

Where did you get the case usage of that. I've never seen it written
like "SoftIRQ" in the kernel. Just call it softirq as it is referenced
everyplace else.

> + *
> + * Save the starting time of a SoftIRQ. As SoftIRQs are non-preemptive to
> + * other SoftIRQs, it is safe to use a single variable (ons_var->softirq)
> + * to save the statistics. The arrival_time is used to report... the
> + * arrival time. The delta_start is used to compute the duration at the
> + * SoftIRQ exit handler. See cond_move_softirq_delta_start().
> + */
> +void trace_softirq_entry_callback(void *data, unsigned int vec_nr)
> +{
> + struct osnoise_variables *osn_var = this_cpu_osn_var();
> +
> + if (!osn_var->sampling)
> + return;
> + /*
> + * This value will be used in the report, but not to compute
> + * the execution time, so it is safe to get it unsafe.
> + */
> + osn_var->softirq.arrival_time = time_get();
> + set_int_safe_time(osn_var, &osn_var->softirq.delta_start);
> + osn_var->softirq.count++;
> +
> + local_inc(&osn_var->int_counter);
> +}
> +
> +/**
> + * trace_softirq_exit_callback - Note the end of an SoftIRQ
> + *
> + * Computes the duration of the SoftIRQ noise, and trace it. Also discounts the
> + * interference from other sources of noise could be currently being accounted.
> + */
> +void trace_softirq_exit_callback(void *data, unsigned int vec_nr)
> +{
> + struct osnoise_variables *osn_var = this_cpu_osn_var();
> + int duration;
> +
> + if (!osn_var->sampling)
> + return;
> +
> + duration = get_int_safe_duration(osn_var, &osn_var->softirq.delta_start);
> + trace_softirq_noise(vec_nr, osn_var->softirq.arrival_time, duration);
> + cond_move_thread_delta_start(osn_var, duration);
> + osn_var->softirq.arrival_time = 0;
> +}
> +
> +/**
> + * hook_softirq_events - Hook SoftIRQ handling events
> + *
> + * This function hooks the SoftIRQ related callbacks to the respective trace
> + * events.
> + */
> +static int hook_softirq_events(void)
> +{
> + int ret;
> +
> + ret = register_trace_softirq_entry(trace_softirq_entry_callback, NULL);
> + if (ret)
> + goto out_err;
> +
> + ret = register_trace_softirq_exit(trace_softirq_exit_callback, NULL);
> + if (ret)
> + goto out_unreg_entry;
> +
> + return 0;
> +
> +out_unreg_entry:
> + unregister_trace_softirq_entry(trace_softirq_entry_callback, NULL);
> +out_err:
> + return -EINVAL;
> +}
> +
> +/**
> + * unhook_softirq_events - Unhook SoftIRQ handling events
> + *
> + * This function hooks the SoftIRQ related callbacks to the respective trace
> + * events.
> + */
> +static void unhook_softirq_events(void)
> +{
> + unregister_trace_softirq_entry(trace_softirq_entry_callback, NULL);
> + unregister_trace_softirq_exit(trace_softirq_exit_callback, NULL);
> +}
> +#else /* CONFIG_PREEMPT_RT */
> +/*
> + * SoftIRQ are threads on the PREEMPT_RT mode.
> + */
> +static int hook_softirq_events(void)
> +{
> + return 0;
> +}
> +static void unhook_softirq_events(void)
> +{
> +}
> +#endif
> +
> +/**
> + * thread_entry - Record the starting of a thread noise window
> + *
> + * It saves the context switch time for a noisy thread, and increments
> + * the interference counters.
> + */
> +static void
> +thread_entry(struct osnoise_variables *osn_var, struct task_struct *t)
> +{
> + if (!osn_var->sampling)
> + return;
> + /*
> + * The arrival time will be used in the report, but not to compute
> + * the execution time, so it is safe to get it unsafe.
> + */
> + osn_var->thread.arrival_time = time_get();
> +
> + set_int_safe_time(osn_var, &osn_var->thread.delta_start);
> +
> + osn_var->thread.count++;
> + local_inc(&osn_var->int_counter);
> +}
> +
> +/**
> + * thread_exit - Report the end of a thread noise window
> + *
> + * It computes the total noise from a thread, tracing if needed.
> + */
> +static void
> +thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
> +{
> + int duration;
> +
> + if (!osn_var->sampling)
> + return;
> +
> + duration = get_int_safe_duration(osn_var, &osn_var->thread.delta_start);
> +
> + trace_thread_noise(t, osn_var->thread.arrival_time, duration);
> +
> + osn_var->thread.arrival_time = 0;
> +}
> +
> +/**
> + * trace_sched_switch - sched:sched_switch trace event handler
> + *
> + * This function is hooked to the sched:sched_switch trace event, and it is
> + * used to record the beginning and to report the end of a thread noise window.
> + */
> +void
> +trace_sched_switch_callback(void *data, bool preempt, struct task_struct *p,
> + struct task_struct *n)
> +{
> + struct osnoise_variables *osn_var = this_cpu_osn_var();
> +
> + if (p->pid != osn_var->pid)
> + thread_exit(osn_var, p);
> +
> + if (n->pid != osn_var->pid)
> + thread_entry(osn_var, n);
> +}
> +
> +/**
> + * hook_thread_events - Hook the insturmentation for thread noise
> + *
> + * Hook the osnoise tracer callbacks to handle the noise from other
> + * threads on the necessary kernel events.
> + */
> +int hook_thread_events(void)
> +{
> + int ret;
> +
> + ret = register_trace_sched_switch(trace_sched_switch_callback, NULL);
> + if (ret)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +/**
> + * unhook_thread_events - *nhook the insturmentation for thread noise
> + *
> + * Unook the osnoise tracer callbacks to handle the noise from other
> + * threads on the necessary kernel events.
> + */
> +void unhook_thread_events(void)
> +{
> + unregister_trace_sched_switch(trace_sched_switch_callback, NULL);
> +}
> +
> +/**
> + * save_osn_sample_stats - Save the osnoise_sample statistics
> + *
> + * Save the osnoise_sample statistics before the sampling phase. These
> + * values will be used later to compute the diff betwneen the statistics
> + * before and after the osnoise sampling.
> + */
> +void save_osn_sample_stats(struct osnoise_variables *osn_var, struct osnoise_sample *s)
> +{
> + s->nmi_count = osn_var->nmi.count;
> + s->irq_count = osn_var->irq.count;
> + s->softirq_count = osn_var->softirq.count;
> + s->thread_count = osn_var->thread.count;
> +}
> +
> +/**
> + * diff_osn_sample_stats - Compute the osnoise_sample statistics
> + *
> + * After a sample period, compute the difference on the osnoise_sample
> + * statistics. The struct osnoise_sample *s contains the statistics saved via
> + * save_osn_sample_stats() before the osnoise sampling.
> + */
> +void diff_osn_sample_stats(struct osnoise_variables *osn_var, struct osnoise_sample *s)
> +{
> + s->nmi_count = osn_var->nmi.count - s->nmi_count;
> + s->irq_count = osn_var->irq.count - s->irq_count;
> + s->softirq_count = osn_var->softirq.count - s->softirq_count;
> + s->thread_count = osn_var->thread.count - s->thread_count;
> +}
> +
> +/**
> + * osnoise_stop_tracing - Stop tracing and the tracer.
> + */
> +static void osnoise_stop_tracing(void)
> +{
> + tracing_off();

This can run in an instance. The "tracing_off()" affects only the top
level tracing buffer, not the instance that this may be running in. You
want tracer_tracing_off(tr).

> +}
> +
> +/**
> + * run_osnoise - Sample the time and look for osnoise
> + *
> + * Used to capture the time, looking for potential osnoise latency repeatedly.
> + * Different from hwlat_detector, it is called with preemption and interrupts
> + * enabled. This allows irqs, softirqs and threads to run, interfering on the
> + * osnoise sampling thread, as they would do with a regular thread.
> + */
> +static int run_osnoise(void)
> +{
> + struct osnoise_variables *osn_var = this_cpu_osn_var();
> + u64 noise = 0, sum_noise = 0, max_noise = 0;
> + struct trace_array *tr = osnoise_trace;
> + u64 start, sample, last_sample;
> + u64 last_int_count, int_count;
> + s64 total, last_total = 0;
> + struct osnoise_sample s;
> + unsigned int threshold;
> + int hw_count = 0;
> + u64 runtime, stop_in;
> + int ret = -1;
> +
> + /*
> + * Considers the current thread as the workload.
> + */
> + osn_var->pid = current->pid;
> +
> + /*
> + * Save the current stats for the diff
> + */
> + save_osn_sample_stats(osn_var, &s);
> +
> + /*
> + * threshold should be at least 1 us.
> + */
> + threshold = tracing_thresh ? tracing_thresh : 1000;

BTW, you can write the above as:

threshold = tracing_thresh ? : 1000;

too ;-)


> +
> + /*
> + * Make sure NMIs see sampling first
> + */
> + osn_var->sampling = true;
> + barrier();
> +
> + /*
> + * Transform the *_us config to nanoseconds to avoid the
> + * division on the main loop.
> + */
> + runtime = osnoise_data.sample_runtime * NSEC_PER_USEC;
> + stop_in = osnoise_data.stop_tracing_in * NSEC_PER_USEC;
> +
> + /*
> + * Start timestemp
> + */
> + start = time_get();
> +
> + /*
> + * "previous" loop
> + */
> + last_int_count = set_int_safe_time(osn_var, &last_sample);
> +
> + do {
> + /*
> + * Get sample!
> + */
> + int_count = set_int_safe_time(osn_var, &sample);
> +
> + noise = time_sub(sample, last_sample);
> +
> + /*
> + * This shouldn't happen.
> + */
> + if (noise < 0) {
> + pr_err(BANNER "time running backwards\n");
> + goto out;
> + }
> +
> + /*
> + * Sample runtime.
> + */
> + total = time_sub(sample, start);
> +
> + /*
> + * Check for possible overflows.
> + */
> + if (total < last_total) {
> + pr_err("Time total overflowed\n");
> + break;
> + }
> +
> + last_total = total;
> +
> + if (noise >= threshold) {
> + int interference = int_count - last_int_count;
> +
> + if (noise > max_noise)
> + max_noise = noise;
> +
> + if (!interference)
> + hw_count++;
> +
> + sum_noise += noise;
> +
> + trace_sample_threshold(last_sample, noise, interference);
> +
> + if (osnoise_data.stop_tracing_in)
> + if (noise > stop_in)
> + osnoise_stop_tracing();
> + }
> +
> + /*
> + * For the non-preemptive kernel config: let threads runs, if
> + * they so wish.
> + */
> + cond_resched();
> +
> + last_sample = sample;
> + last_int_count = int_count;
> +
> + } while (total < runtime && !kthread_should_stop());
> +
> + /*
> + * Finish the above in the view for interrupts.
> + */
> + barrier();
> +
> + osn_var->sampling = false;
> +
> + /*
> + * Make sure sampling data is no longer updated.
> + */
> + barrier();
> +
> + /*
> + * Save noise info.
> + */
> + s.noise = time_to_us(sum_noise);
> + s.runtime = time_to_us(total);
> + s.max_sample = time_to_us(max_noise);
> + s.hw_count = hw_count;
> +
> + /* Save interference stats info */
> + diff_osn_sample_stats(osn_var, &s);
> +
> + trace_osnoise_sample(&s);
> +
> + /* Keep a running maximum ever recorded osnoise "latency" */
> + if (max_noise > tr->max_latency) {
> + tr->max_latency = max_noise;
> + latency_fsnotify(tr);
> + }
> +
> + if (osnoise_data.stop_tracing_out)
> + if (s.noise > osnoise_data.stop_tracing_out)
> + osnoise_stop_tracing();
> +
> + return 0;
> +out:
> + return ret;
> +}
> +
> +static struct cpumask osnoise_cpumask;
> +static struct cpumask save_cpumask;
> +
> +/*
> + * osnoise_main - The osnoise detection kernel thread
> + *
> + * Calls run_osnoise() function to measure the osnoise for the configured runtime,
> + * every period.
> + */
> +static int osnoise_main(void *data)
> +{
> + s64 interval;
> +
> + while (!kthread_should_stop()) {
> +
> + run_osnoise();
> +
> + mutex_lock(&interface_lock);
> + interval = osnoise_data.sample_period - osnoise_data.sample_runtime;
> + mutex_unlock(&interface_lock);
> +
> + do_div(interval, USEC_PER_MSEC);
> +
> + /*
> + * differently from hwlat_detector, the osnoise tracer can run
> + * without a pause because preemption is on.


Can it? With !CONFIG_PREEMPT, this will never sleep, right? I believe
there's watchdogs that will trigger if a kernel thread never schedules
out. I might be wrong, will have to check on this.


> + */
> + if (interval < 1)
> + continue;
> +
> + if (msleep_interruptible(interval))
> + break;
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * stop_per_cpu_kthread - stop per-cpu threads
> + *
> + * Stop the osnoise sampling htread. Use this on unload and at system
> + * shutdown.
> + */
> +static void stop_per_cpu_kthreads(void)
> +{
> + struct task_struct *kthread;
> + int cpu;
> +
> + for_each_online_cpu(cpu) {
> + kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread;
> + if (kthread)
> + kthread_stop(kthread);
> + per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
> + }
> +}
> +
> +/**
> + * start_per_cpu_kthread - Kick off per-cpu osnoise sampling kthreads
> + *
> + * This starts the kernel thread that will look for osnoise on many
> + * cpus.
> + */
> +static int start_per_cpu_kthreads(struct trace_array *tr)
> +{
> + struct cpumask *current_mask = &save_cpumask;
> + struct task_struct *kthread;
> + char comm[24];
> + int cpu;
> +
> + get_online_cpus();
> + /*
> + * Run only on CPUs in which trace and osnoise are allowed to run.
> + */
> + cpumask_and(current_mask, tr->tracing_cpumask, &osnoise_cpumask);
> + /*
> + * And the CPU is online.
> + */
> + cpumask_and(current_mask, cpu_online_mask, current_mask);
> + put_online_cpus();
> +
> + for_each_online_cpu(cpu)
> + per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;

I think we want for_each_possible_cpu(), especially since you don't
have anything protecting the online cpus here since you did
put_online_cpus. But still, we probably want them all NULL anyway.

Should this have a CPU shutdown notifier, to clean things up if this is
running while people are playing with CPU hotplug?

> +
> + for_each_cpu(cpu, current_mask) {
> + snprintf(comm, 24, "osnoise/%d", cpu);
> +
> + kthread = kthread_create_on_cpu(osnoise_main, NULL, cpu, comm);
> +
> + if (IS_ERR(kthread)) {
> + pr_err(BANNER "could not start sampling thread\n");
> + stop_per_cpu_kthreads();
> + return -ENOMEM;
> + }
> +
> + per_cpu(per_cpu_osnoise_var, cpu).kthread = kthread;
> + wake_up_process(kthread);
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * osnoise_cpus_read - Read function for reading the "cpus" file
> + * @filp: The active open file structure
> + * @ubuf: The userspace provided buffer to read value into
> + * @cnt: The maximum number of bytes to read
> + * @ppos: The current "file" position
> + *
> + * Prints the "cpus" output into the user-provided buffer.
> + */
> +static ssize_t
> +osnoise_cpus_read(struct file *filp, char __user *ubuf, size_t count,
> + loff_t *ppos)
> +{
> + char *mask_str;
> + int len;
> +
> + len = snprintf(NULL, 0, "%*pbl\n",
> + cpumask_pr_args(&osnoise_cpumask)) + 1;
> + mask_str = kmalloc(len, GFP_KERNEL);
> + if (!mask_str)
> + return -ENOMEM;
> +
> + len = snprintf(mask_str, len, "%*pbl\n",
> + cpumask_pr_args(&osnoise_cpumask));
> + if (len >= count) {
> + count = -EINVAL;
> + goto out_err;
> + }
> + count = simple_read_from_buffer(ubuf, count, ppos, mask_str, len);
> +
> +out_err:
> + kfree(mask_str);
> +
> + return count;
> +}
> +
> +/**
> + * osnoise_cpus_write - Write function for "cpus" entry
> + * @filp: The active open file structure
> + * @ubuf: The user buffer that contains the value to write
> + * @cnt: The maximum number of bytes to write to "file"
> + * @ppos: The current position in @file
> + *
> + * This function provides a write implementation for the "cpus"
> + * interface to the osnoise trace. By default, it lists all CPUs,
> + * in this way, allowing osnoise threads to run on any online CPU
> + * of the system. It serves to restrict the execution of osnoise to the
> + * set of CPUs writing via this interface. Note that osnoise also
> + * respects the "tracing_cpumask." Hence, osnoise threads will run only
> + * on the set of CPUs allowed here AND on "tracing_cpumask." Why not
> + * have just "tracing_cpumask?" Because the user might be interested
> + * in tracing what is running on other CPUs. For instance, one might
> + * run osnoise in one HT CPU while observing what is running on the
> + * sibling HT CPU.
> + */
> +static ssize_t
> +osnoise_cpus_write(struct file *filp, const char __user *ubuf, size_t count,
> + loff_t *ppos)
> +{
> + cpumask_var_t osnoise_cpumask_new;
> + char buf[256];
> + int err;
> +

Since this does not affect a running osnoise tracer, we should either
have it return BUSY if it is running, or have it update the threads
during the run.

-- Steve


> + if (count >= 256)
> + return -EINVAL;
> +
> + if (copy_from_user(buf, ubuf, count))
> + return -EFAULT;
> +
> + if (!zalloc_cpumask_var(&osnoise_cpumask_new, GFP_KERNEL))
> + return -ENOMEM;
> +
> + err = cpulist_parse(buf, osnoise_cpumask_new);
> + if (err)
> + goto err_free;
> +
> + cpumask_copy(&osnoise_cpumask, osnoise_cpumask_new);
> +
> + free_cpumask_var(osnoise_cpumask_new);
> + return count;
> +
> +err_free:
> + free_cpumask_var(osnoise_cpumask_new);
> +
> + return err;
> +}
> +
> +/*
> + * osnoise/runtime_us: cannot be greater than the period.
> + */
> +static struct trace_ull_config osnoise_runtime = {
> + .lock = &interface_lock,
> + .val = &osnoise_data.sample_runtime,
> + .max = &osnoise_data.sample_period,
> + .min = NULL,
> +};
> +
> +/*
> + * osnoise/period_us: cannot be smaller than the runtime.
> + */
> +static struct trace_ull_config osnoise_period = {
> + .lock = &interface_lock,
> + .val = &osnoise_data.sample_period,
> + .max = NULL,
> + .min = &osnoise_data.sample_runtime,
> +};
> +
> +/*
> + * osnoise/stop_tracing_in_us: no limit.
> + */
> +static struct trace_ull_config osnoise_stop_single = {
> + .lock = &interface_lock,
> + .val = &osnoise_data.stop_tracing_in,
> + .max = NULL,
> + .min = NULL,
> +};
> +
> +/*
> + * osnoise/stop_tracing_out_us: no limit.
> + */
> +static struct trace_ull_config osnoise_stop_total = {
> + .lock = &interface_lock,
> + .val = &osnoise_data.stop_tracing_out,
> + .max = NULL,
> + .min = NULL,
> +};
> +
> +static const struct file_operations cpus_fops = {
> + .open = tracing_open_generic,
> + .read = osnoise_cpus_read,
> + .write = osnoise_cpus_write,
> + .llseek = generic_file_llseek,
> +};
> +
> +/**
> + * init_tracefs - A function to initialize the tracefs interface files
> + *
> + * This function creates entries in tracefs for "osnoise". It creates the
> + * "osnoise" directory in the tracing directory, and within that
> + * directory is the count, runtime and period files to change and view
> + * those values.
> + */
> +static int init_tracefs(void)
> +{
> + struct dentry *top_dir;
> + struct dentry *tmp;
> + int ret;
> +
> + ret = tracing_init_dentry();
> + if (ret)
> + return -ENOMEM;
> +
> + top_dir = tracefs_create_dir("osnoise", NULL);
> + if (!top_dir)
> + return -ENOMEM;
> +
> + tmp = tracefs_create_file("period_us", 0640, top_dir,
> + &osnoise_period, &trace_ull_config_fops);
> + if (!tmp)
> + goto err;
> +
> + tmp = tracefs_create_file("runtime_us", 0644, top_dir,
> + &osnoise_runtime, &trace_ull_config_fops);
> + if (!tmp)
> + goto err;
> +
> + tmp = tracefs_create_file("stop_tracing_in_us", 0640, top_dir,
> + &osnoise_stop_single, &trace_ull_config_fops);
> + if (!tmp)
> + goto err;
> +
> + tmp = tracefs_create_file("stop_tracing_out_us", 0640, top_dir,
> + &osnoise_stop_total, &trace_ull_config_fops);
> + if (!tmp)
> + goto err;
> +
> + tmp = trace_create_file("cpus", 0644, top_dir, NULL, &cpus_fops);
> + if (!tmp)
> + goto err;
> +
> + return 0;
> +
> +err:
> + tracefs_remove(top_dir);
> + return -ENOMEM;
> +}
> +
> +static int osnoise_hook_events(void)
> +{
> + int retval;
> +
> + /*
> + * Trace is already hooked, we are re-enabling from
> + * a stop_tracing_*.
> + */
> + if (trace_osnoise_callback_enabled)
> + return 0;
> +
> + retval = hook_irq_events();
> + if (retval)
> + return -EINVAL;
> +
> + retval = hook_softirq_events();
> + if (retval)
> + goto out_unhook_irq;
> +
> + retval = hook_thread_events();
> + /*
> + * All fine!
> + */
> + if (!retval)
> + return 0;
> +
> + unhook_softirq_events();
> +out_unhook_irq:
> + unhook_irq_events();
> + return -EINVAL;
> +}
> +
> +static void osnoise_tracer_start(struct trace_array *tr)
> +{
> + int retval;
> +
> + if (osnoise_busy)
> + return;
> +
> + osn_var_reset_all();
> +
> + retval = osnoise_hook_events();
> + if (retval)
> + goto out_err;
> + /*
> + * Make sure NMIs see reseted values.
> + */
> + barrier();
> + trace_osnoise_callback_enabled = true;
> +
> + retval = start_per_cpu_kthreads(tr);
> + /*
> + * all fine!
> + */
> + if (!retval)
> + return;
> +
> +out_err:
> + unhook_irq_events();
> + pr_err(BANNER "Error starting osnoise tracer\n");
> +}
> +
> +static void osnoise_tracer_stop(struct trace_array *tr)
> +{
> + if (!osnoise_busy)
> + return;
> +
> + trace_osnoise_callback_enabled = false;
> + barrier();
> +
> + stop_per_cpu_kthreads();
> +
> + unhook_irq_events();
> + unhook_softirq_events();
> + unhook_thread_events();
> +
> + osnoise_busy = false;
> +}
> +
> +static int osnoise_tracer_init(struct trace_array *tr)
> +{
> + /* Only allow one instance to enable this */
> + if (osnoise_busy)
> + return -EBUSY;
> +
> + osnoise_trace = tr;
> +
> + tr->max_latency = 0;
> +
> + osnoise_tracer_start(tr);
> +
> + osnoise_busy = true;
> +
> + return 0;
> +}
> +
> +static void osnoise_tracer_reset(struct trace_array *tr)
> +{
> + osnoise_tracer_stop(tr);
> +}
> +
> +static struct tracer osnoise_tracer __read_mostly = {
> + .name = "osnoise",
> + .init = osnoise_tracer_init,
> + .reset = osnoise_tracer_reset,
> + .start = osnoise_tracer_start,
> + .stop = osnoise_tracer_stop,
> + .print_header = print_osnoise_headers,
> + .allow_instances = true,
> +};
> +
> +__init static int init_osnoise_tracer(void)
> +{
> + int ret;
> +
> + mutex_init(&interface_lock);
> +
> + cpumask_copy(&osnoise_cpumask, cpu_all_mask);
> +
> + ret = register_tracer(&osnoise_tracer);
> + if (ret)
> + return ret;
> +
> + init_tracefs();
> +
> + return 0;
> +}
> +late_initcall(init_osnoise_tracer);
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index d0368a569bfa..642b6584eba5 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -1202,7 +1202,6 @@ trace_hwlat_print(struct trace_iterator *iter, int flags,
> return trace_handle_return(s);
> }
>
> -
> static enum print_line_t
> trace_hwlat_raw(struct trace_iterator *iter, int flags,
> struct trace_event *event)
> @@ -1232,6 +1231,76 @@ static struct trace_event trace_hwlat_event = {
> .funcs = &trace_hwlat_funcs,
> };
>
> +/* TRACE_OSNOISE */
> +static enum print_line_t
> +trace_osnoise_print(struct trace_iterator *iter, int flags,
> + struct trace_event *event)
> +{
> + struct trace_entry *entry = iter->ent;
> + struct trace_seq *s = &iter->seq;
> + struct osnoise_entry *field;
> + u64 ratio, ratio_dec;
> + u64 net_runtime;
> +
> + trace_assign_type(field, entry);
> +
> + /*
> + * compute the available % of cpu time.
> + */
> + net_runtime = field->runtime - field->noise;
> + ratio = net_runtime * 10000000;
> + do_div(ratio, field->runtime);
> + ratio_dec = do_div(ratio, 100000);
> +
> + trace_seq_printf(s, "%llu %10llu %3llu.%05llu %7llu",
> + field->runtime,
> + field->noise,
> + ratio, ratio_dec,
> + field->max_sample);
> +
> + trace_seq_printf(s, " %6u", field->hw_count);
> + trace_seq_printf(s, " %6u", field->nmi_count);
> + trace_seq_printf(s, " %6u", field->irq_count);
> + trace_seq_printf(s, " %6u", field->softirq_count);
> + trace_seq_printf(s, " %6u", field->thread_count);
> +
> + trace_seq_putc(s, '\n');
> +
> + return trace_handle_return(s);
> +}
> +
> +static enum print_line_t
> +trace_osnoise_raw(struct trace_iterator *iter, int flags,
> + struct trace_event *event)
> +{
> + struct osnoise_entry *field;
> + struct trace_seq *s = &iter->seq;
> +
> + trace_assign_type(field, iter->ent);
> +
> + trace_seq_printf(s, "%lld %llu %llu %u %u %u %u %u\n",
> + field->runtime,
> + field->noise,
> + field->max_sample,
> + field->hw_count,
> + field->nmi_count,
> + field->irq_count,
> + field->softirq_count,
> + field->thread_count);
> +
> + return trace_handle_return(s);
> +}
> +
> +static struct trace_event_functions trace_osnoise_funcs = {
> + .trace = trace_osnoise_print,
> + .raw = trace_osnoise_raw,
> +};
> +
> +static struct trace_event trace_osnoise_event = {
> + .type = TRACE_OSNOISE,
> + .funcs = &trace_osnoise_funcs,
> +};
> +
> /* TRACE_BPUTS */
> static enum print_line_t
> trace_bputs_print(struct trace_iterator *iter, int flags,
> @@ -1442,6 +1511,7 @@ static struct trace_event *events[] __initdata = {
> &trace_bprint_event,
> &trace_print_event,
> &trace_hwlat_event,
> + &trace_osnoise_event,
> &trace_raw_data_event,
> &trace_func_repeats_event,
> NULL

Subject: Re: [PATCH V3 8/9] tracing: Add osnoise tracer

On 6/4/21 11:28 PM, Steven Rostedt wrote:
> On Fri, 14 May 2021 22:51:17 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
>> Tracer options
>>
>> The tracer has a set of options inside the osnoise directory, they are:
>>
>> - osnoise/cpus: CPUs at which a osnoise thread will execute.
>> - osnoise/period_us: the period of the osnoise thread.
>> - osnoise/runtime_us: how long an osnoise thread will look for noise.
>> - osnoise/stop_tracing_in_us: stop the system tracing if a single noise
>> higher than the configured value happens. Writing 0 disables this
>> option.
>> - osnoise/stop_tracing_out_us: stop the system tracing if total noise
>> higher than the configured value happens. Writing 0 disables this
>> option.
>
> The "in" vs "out" is confusing. Perhaps we should call it:
>
> stop_tracing_single_us and stop_tracing_total_us ?

I am using these more "generic terms" because they are also used by the timerlat
tracer.

In the timerlat tracer, the "in" file is used to stop the tracer for a given IRQ
latency (so, the "inside" operation), while the "out" is used to stop the tracer
in the thread latency (hence the outside operation).

The total sounds good for the "out"! But the single does not work fine for the
IRQ... how about: stop_tracing_partial_us ?

It is hard to find a good shared name :-/


>> - tracing_threshold: the minimum delta between two time() reads to be
>> considered as noise, in us. When set to 0, the minimum valid value
>> will be used, which is currently 1 us.
>>
>> Additional Tracing
>>
>> In addition to the tracer, a set of tracepoints were added to
>> facilitate the identification of the osnoise source.
>>
>> - osnoise:sample_threshold: printed anytime a noise is higher than
>> the configurable tolerance_ns.
>> - osnoise:nmi_noise: noise from NMI, including the duration.
>> - osnoise:irq_noise: noise from an IRQ, including the duration.
>> - osnoise:softirq_noise: noise from a SoftIRQ, including the
>> duration.
>> - osnoise:thread_noise: noise from a thread, including the duration.
>>
>> Note that all the values are *net values*. For example, if while osnoise
>> is running, another thread preempts the osnoise thread, it will start a
>> thread_noise duration at the start. Then, an IRQ takes place, preempting
>> the thread_noise, starting a irq_noise. When the IRQ ends its execution,
>> it will compute its duration, and this duration will be subtracted from
>> the thread_noise, in such a way as to avoid the double accounting of the
>> IRQ execution. This logic is valid for all sources of noise.
>>
>> Here is one example of the usage of these tracepoints::
>>
>> osnoise/8-961 [008] d.h. 5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
>> osnoise/8-961 [008] dNh. 5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
>> migration/8-54 [008] d... 5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
>> osnoise/8-961 [008] .... 5789.858413: sample_threshold: start 5789.858404555 duration 8723 ns interferences 2
>>
>> In this example, a noise sample of 8 microseconds was reported in the last
>> line, pointing to two interferences. Looking backward in the trace, the
>> two previous entries were about the migration thread running after a
>> timer IRQ execution. The first event is not part of the noise because
>> it took place one millisecond before.
>>
>> It is worth noticing that the sum of the duration reported in the
>> tracepoints is smaller than eight us reported in the sample_threshold.
>> The reason roots in the overhead of the entry and exit code that happens
>> before and after any interference execution. This justifies the dual
>> approach: measuring thread and tracing.
>>
>
>
>
>> --- /dev/null
>> +++ b/kernel/trace/trace_osnoise.c
>> @@ -0,0 +1,1563 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * OS Noise Tracer: computes the OS Noise suffered by a running thread.
>> + *
>> + * Based on "hwlat_detector" tracer by:
>> + * Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <[email protected]>
>> + * Copyright (C) 2013-2016 Steven Rostedt, Red Hat, Inc. <[email protected]>
>> + * With feedback from Clark Williams <[email protected]>
>> + *
>> + * And also based on the rtsl tracer presented on:
>> + * DE OLIVEIRA, Daniel Bristot, et al. Demystifying the real-time linux
>> + * scheduling latency. In: 32nd Euromicro Conference on Real-Time Systems
>> + * (ECRTS 2020). Schloss Dagstuhl-Leibniz-Zentrum fur Informatik, 2020.
>> + *
>> + * Copyright (C) 2021 Daniel Bristot de Oliveira, Red Hat, Inc. <[email protected]>
>> + */
>> +
>> +#include <linux/kthread.h>
>> +#include <linux/tracefs.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/cpumask.h>
>> +#include <linux/delay.h>
>> +#include <linux/sched/clock.h>
>> +#include <linux/sched.h>
>> +#include "trace.h"
>> +
>> +#ifdef CONFIG_X86_LOCAL_APIC
>> +#include <asm/trace/irq_vectors.h>
>> +#undef TRACE_INCLUDE_PATH
>> +#undef TRACE_INCLUDE_FILE
>> +#endif /* CONFIG_X86_LOCAL_APIC */
>> +
>> +#include <trace/events/irq.h>
>> +#include <trace/events/sched.h>
>> +
>> +#define CREATE_TRACE_POINTS
>> +#include <trace/events/osnoise.h>
>> +
>> +static struct trace_array *osnoise_trace;
>> +
>> +/*
>> + * Default values.
>> + */
>> +#define BANNER "osnoise: "
>> +#define DEFAULT_SAMPLE_PERIOD 1000000 /* 1s */
>> +#define DEFAULT_SAMPLE_RUNTIME 1000000 /* 1s */
>> +
>> +/*
>> + * NMI runtime info.
>> + */
>> +struct nmi {
>> + u64 count;
>> + u64 delta_start;
>> +};
>> +
>> +/*
>> + * IRQ runtime info.
>> + */
>> +struct irq {
>> + u64 count;
>> + u64 arrival_time;
>> + u64 delta_start;
>> +};
>> +
>> +/*
>> + * SofIRQ runtime info.
>> + */
>> +struct softirq {
>> + u64 count;
>> + u64 arrival_time;
>> + u64 delta_start;
>> +};
>> +
>> +/*
>> + * Thread runtime info.
>> + */
>> +struct thread {
>
> These are rather generic struct names that could possibly be a conflict
> to something in the future. I would suggest adding "osn_" or something
> in front of them:
>
> struct osn_nmi
> struct osn_irq
> struct osn_thread
>
> or something like that.

I will use the "osn_" prefix.

>
>> + u64 count;
>> + u64 arrival_time;
>> + u64 delta_start;
>> +};
>> +
>> +/*
>> + * Runtime information: this structure saves the runtime information used by
>> + * one sampling thread.
>> + */
>> +struct osnoise_variables {
>> + struct task_struct *kthread;
>> + bool sampling;
>> + pid_t pid;
>> + struct nmi nmi;
>> + struct irq irq;
>> + struct softirq softirq;
>> + struct thread thread;
>> + local_t int_counter;
>> +};
>
> Nit, but it looks better to add tabs to the fields:
>
> struct osnoise_variables {
> sturct task_struct *kthread;
> bool sampling;
> pid_t pid;
> struct nmi nmi;
> struct irq irq;
> struct softirq softirq;
> struct thread thread;
> local_t int_counter;
> };
>
> See, it's much easier to see the fields of the structure this way.

yeah, I will change this too.

>
>> +
>> +/*
>> + * Per-cpu runtime information.
>> + */
>> +DEFINE_PER_CPU(struct osnoise_variables, per_cpu_osnoise_var);
>> +
>> +/**
>> + * this_cpu_osn_var - Return the per-cpu osnoise_variables on its relative CPU
>> + */
>> +static inline struct osnoise_variables *this_cpu_osn_var(void)
>> +{
>> + return this_cpu_ptr(&per_cpu_osnoise_var);
>> +}
>> +
>> +/**
>> + * osn_var_reset - Reset the values of the given osnoise_variables
>> + */
>> +static inline void osn_var_reset(struct osnoise_variables *osn_var)
>> +{
>> + /*
>> + * So far, all the values are initialized as 0, so
>> + * zeroing the structure is perfect.
>> + */
>> + memset(osn_var, 0, sizeof(struct osnoise_variables));
>
> I'm one of those that prefer:
>
> memset(osn_var, 0, sizeof(*osn_var))
>
> Just in case something changes, you don't need to modify the memset.
>
>
>> +}
>> +
>> +/**
>> + * osn_var_reset_all - Reset the value of all per-cpu osnoise_variables
>> + */
>> +static inline void osn_var_reset_all(void)
>> +{
>> + struct osnoise_variables *osn_var;
>> + int cpu;
>> +
>> + for_each_cpu(cpu, cpu_online_mask) {
>> + osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
>> + osn_var_reset(osn_var);
>> + }
>> +}
>> +
>> +/*
>> + * Tells NMIs to call back to the osnoise tracer to record timestamps.
>> + */
>> +bool trace_osnoise_callback_enabled;
>> +
>> +/*
>> + * osnoise sample structure definition. Used to store the statistics of a
>> + * sample run.
>> + */
>> +struct osnoise_sample {
>> + u64 runtime; /* runtime */
>> + u64 noise; /* noise */
>> + u64 max_sample; /* max single noise sample */
>> + int hw_count; /* # HW (incl. hypervisor) interference */
>> + int nmi_count; /* # NMIs during this sample */
>> + int irq_count; /* # IRQs during this sample */
>> + int softirq_count; /* # SoftIRQs during this sample */
>> + int thread_count; /* # Threads during this sample */
>> +};
>> +
>> +/*
>> + * Protect the interface.
>> + */
>> +struct mutex interface_lock;
>> +
>> +/*
>> + * Tracer data.
>> + */
>> +static struct osnoise_data {
>> + u64 sample_period; /* total sampling period */
>> + u64 sample_runtime; /* active sampling portion of period */
>> + u64 stop_tracing_in; /* stop trace in the inside operation (loop) */
>> + u64 stop_tracing_out; /* stop trace in the outside operation (report) */
>> +} osnoise_data = {
>> + .sample_period = DEFAULT_SAMPLE_PERIOD,
>> + .sample_runtime = DEFAULT_SAMPLE_RUNTIME,
>> + .stop_tracing_in = 0,
>> + .stop_tracing_out = 0,
>> +};
>> +
>> +/*
>> + * Boolean variable used to inform that the tracer is currently sampling.
>> + */
>> +static bool osnoise_busy;
>> +
>> +/*
>> + * Print the osnoise header info.
>> + */
>> +static void print_osnoise_headers(struct seq_file *s)
>> +{
>> + seq_puts(s, "# _-----=> irqs-off\n");
>> + seq_puts(s, "# / _----=> need-resched\n");
>> + seq_puts(s, "# | / _---=> hardirq/softirq\n");
>> + seq_puts(s, "# || / _--=> preempt-depth ");
>> + seq_puts(s, " MAX\n");
>> +
>> + seq_puts(s, "# || / ");
>> + seq_puts(s, " SINGLE Interference counters:\n");
>> +
>> + seq_puts(s, "# |||| RUNTIME ");
>> + seq_puts(s, " NOISE %% OF CPU NOISE +-----------------------------+\n");
>> +
>> + seq_puts(s, "# TASK-PID CPU# |||| TIMESTAMP IN US ");
>> + seq_puts(s, " IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD\n");
>> +
>> + seq_puts(s, "# | | | |||| | | ");
>> + seq_puts(s, " | | | | | | | |\n");
>> +}
>> +
>> +/*
>> + * Record an osnoise_sample into the tracer buffer.
>> + */
>> +static void trace_osnoise_sample(struct osnoise_sample *sample)
>> +{
>> + struct trace_array *tr = osnoise_trace;
>> + struct trace_buffer *buffer = tr->array_buffer.buffer;
>> + struct trace_event_call *call = &event_osnoise;
>> + struct ring_buffer_event *event;
>> + struct osnoise_entry *entry;
>> +
>> + event = trace_buffer_lock_reserve(buffer, TRACE_OSNOISE, sizeof(*entry),
>> + tracing_gen_ctx());
>> + if (!event)
>> + return;
>> + entry = ring_buffer_event_data(event);
>> + entry->runtime = sample->runtime;
>> + entry->noise = sample->noise;
>> + entry->max_sample = sample->max_sample;
>> + entry->hw_count = sample->hw_count;
>> + entry->nmi_count = sample->nmi_count;
>> + entry->irq_count = sample->irq_count;
>> + entry->softirq_count = sample->softirq_count;
>> + entry->thread_count = sample->thread_count;
>> +
>> + if (!call_filter_check_discard(call, entry, buffer, event))
>> + trace_buffer_unlock_commit_nostack(buffer, event);
>> +}
>> +
>> +/**
>> + * Macros to encapsulate the time capturing infrastructure.
>> + */
>> +#define time_get() trace_clock_local()
>> +#define time_to_us(x) div_u64(x, 1000)
>> +#define time_sub(a, b) ((a) - (b))
>> +
>> +/**
>> + * cond_move_irq_delta_start - Forward the delta_start of a running IRQ
>> + *
>> + * If an IRQ is preempted by an NMI, its delta_start is pushed forward
>> + * to discount the NMI interference.
>> + *
>> + * See get_int_safe_duration().
>> + */
>> +static inline void
>> +cond_move_irq_delta_start(struct osnoise_variables *osn_var, u64 duration)
>> +{
>> + if (osn_var->irq.delta_start)
>> + osn_var->irq.delta_start += duration;
>> +}
>> +
>> +#ifndef CONFIG_PREEMPT_RT
>> +/**
>> + * cond_move_softirq_delta_start - Forward the delta_start of a running SoftIRQ
>> + *
>> + * If a SoftIRQ is preempted by an IRQ or NMI, its delta_start is pushed
>> + * forward to discount the interference.
>> + *
>> + * See get_int_safe_duration().
>> + */
>> +static inline void
>> +cond_move_softirq_delta_start(struct osnoise_variables *osn_var, u64 duration)
>> +{
>> + if (osn_var->softirq.delta_start)
>> + osn_var->softirq.delta_start += duration;
>> +}
>> +#else /* CONFIG_PREEMPT_RT */
>> +#define cond_move_softirq_delta_start(osn_var, duration) do {} while (0)
>> +#endif
>> +
>> +/**
>
> Don't use the "/**" notation if you are not following kernel doc (which
> means you need to also document each parameter and return value. The
> '/**' is usually reserved for non static functions that are an API for
> other parts of the kernel. Use just '/*' for these (unless you want to
> add full kernel doc notation).

ok!

>> + * cond_move_thread_delta_start - Forward the delta_start of a running thread
>> + *
>> + * If a noisy thread is preempted by an Softirq, IRQ or NMI, its delta_start
>> + * is pushed forward to discount the interference.
>> + *
>> + * See get_int_safe_duration().
>> + */
>> +static inline void
>> +cond_move_thread_delta_start(struct osnoise_variables *osn_var, u64 duration)
>> +{
>> + if (osn_var->thread.delta_start)
>> + osn_var->thread.delta_start += duration;
>> +}
>> +
>> +/**
>> + * get_int_safe_duration - Get the duration of a window
>> + *
>> + * The irq, softirq and thread varaibles need to have its duration without
>> + * the interference from higher priority interrupts. Instead of keeping a
>> + * variable to discount the interrupt interference from these variables, the
>> + * starting time of these variables are pushed forward with the interrupt's
>> + * duration. In this way, a single variable is used to:
>> + *
>> + * - Know if a given window is being measured.
>> + * - Account its duration.
>> + * - Discount the interference.
>> + *
>> + * To avoid getting inconsistent values, e.g.,:
>> + *
>> + * now = time_get()
>> + * ---> interrupt!
>> + * delta_start -= int duration;
>> + * <---
>> + * duration = now - delta_start;
>> + *
>> + * result: negative duration if the variable duration before the
>> + * interrupt was smaller than the interrupt execution.
>> + *
>> + * A counter of interrupts is used. If the counter increased, try
>> + * to capture an interference safe duration.
>> + */
>> +static inline s64
>> +get_int_safe_duration(struct osnoise_variables *osn_var, u64 *delta_start)
>> +{
>> + u64 int_counter, now;
>> + s64 duration;
>> +
>> + do {
>> + int_counter = local_read(&osn_var->int_counter);
>> + /* synchronize with interrupts */
>> + barrier();
>> +
>> + now = time_get();
>> + duration = (now - *delta_start);
>> +
>> + /* synchronize with interrupts */
>> + barrier();
>> + } while (int_counter != local_read(&osn_var->int_counter));
>> +
>> + /*
>> + * This is an evidence of race conditions that cause
>> + * a value to be "discounted" too much.
>> + */
>> + if (duration < 0)
>> + pr_err("int safe negative!\n");
>
> Probably want to have this happen at most once a run. If something were
> to break, I don't think we want this to live lock the machine doing
> tons of prints. We could have a variable stored on the
> osnoise_variables that states this was printed. Check that variable to
> see if it wasn't printed during a run (when current_tracer was set),
> and print only once if it is.

yeah, also because this could happen in a section that printk does not work fine.

>
>> +
>> + *delta_start = 0;
>> +
>> + return duration;
>> +}
>> +
>> +/**
>> + *
>> + * set_int_safe_time - Save the current time on *time, aware of interference
>> + *
>> + * Get the time, taking into consideration a possible interference from
>> + * higher priority interrupts.
>> + *
>> + * See get_int_safe_duration() for an explanation.
>> + */
>> +static u64
>> +set_int_safe_time(struct osnoise_variables *osn_var, u64 *time)
>> +{
>> + u64 int_counter;
>> +
>> + do {
>> + int_counter = local_read(&osn_var->int_counter);
>> + /* synchronize with interrupts */
>> + barrier();
>> +
>> + *time = time_get();
>
> time_get() is trace_clock_local() which is used in tracing, and should
> not be affected by interrupts. Why the loop? Even if using the generic
> NMI-unsafe version of sched_clock, it does its own loop.

The loop is there to synchronize the *time and the int_counter that is returned...

>> +
>> + /* synchronize with interrupts */
>> + barrier();
>> + } while (int_counter != local_read(&osn_var->int_counter));
>> +
>> + return int_counter;
>> +}
>> +
>> +/**
>> + * trace_osnoise_callback - NMI entry/exit callback
>> + *
>> + * This function is called at the entry and exit NMI code. The bool enter
>> + * distinguishes between either case. This function is used to note a NMI
>> + * occurrence, compute the noise caused by the NMI, and to remove the noise
>> + * it is potentially causing on other interference variables.
>> + */
>> +void trace_osnoise_callback(bool enter)
>> +{
>> + struct osnoise_variables *osn_var = this_cpu_osn_var();
>> + u64 duration;
>> +
>> + if (!osn_var->sampling)
>> + return;
>> +
>> + /*
>> + * Currently trace_clock_local() calls sched_clock() and the
>> + * generic version is not NMI safe.
>> + */
>> + if (!IS_ENABLED(CONFIG_GENERIC_SCHED_CLOCK)) {
>> + if (enter) {
>> + osn_var->nmi.delta_start = time_get();
>> + local_inc(&osn_var->int_counter);
>> + } else {
>> + duration = time_get() - osn_var->nmi.delta_start;
>> +
>> + trace_nmi_noise(osn_var->nmi.delta_start, duration);
>> +
>> + cond_move_irq_delta_start(osn_var, duration);
>> + cond_move_softirq_delta_start(osn_var, duration);
>> + cond_move_thread_delta_start(osn_var, duration);
>> + }
>> + }
>> +
>> + if (enter)
>> + osn_var->nmi.count++;
>> +}
>> +
>> +/**
>> + * __trace_irq_entry - Note the starting of an IRQ
>> + *
>> + * Save the starting time of an IRQ. As IRQs are non-preemptive to other IRQs,
>> + * it is safe to use a single variable (ons_var->irq) to save the statistics.
>> + * The arrival_time is used to report... the arrival time. The delta_start
>> + * is used to compute the duration at the IRQ exit handler. See
>> + * cond_move_irq_delta_start().
>> + */
>> +static inline void __trace_irq_entry(int id)
>> +{
>> + struct osnoise_variables *osn_var = this_cpu_osn_var();
>> +
>> + if (!osn_var->sampling)
>> + return;
>> + /*
>> + * This value will be used in the report, but not to compute
>> + * the execution time, so it is safe to get it unsafe.
>> + */
>> + osn_var->irq.arrival_time = time_get();
>> + set_int_safe_time(osn_var, &osn_var->irq.delta_start);
>> + osn_var->irq.count++;
>> +
>> + local_inc(&osn_var->int_counter);
>> +}
>> +
>> +/**
>> + * __trace_irq_exit - Note the end of an IRQ, sava data and trace
>> + *
>> + * Computes the duration of the IRQ noise, and trace it. Also discounts the
>> + * interference from other sources of noise could be currently being accounted.
>> + */
>> +static inline void __trace_irq_exit(int id, const char *desc)
>> +{
>> + struct osnoise_variables *osn_var = this_cpu_osn_var();
>> + int duration;
>> +
>> + if (!osn_var->sampling)
>> + return;
>> +
>> + duration = get_int_safe_duration(osn_var, &osn_var->irq.delta_start);
>> + trace_irq_noise(id, desc, osn_var->irq.arrival_time, duration);
>> + osn_var->irq.arrival_time = 0;
>> + cond_move_softirq_delta_start(osn_var, duration);
>> + cond_move_thread_delta_start(osn_var, duration);
>> +}
>> +
>> +/**
>> + * trace_irqentry_callback - Callback to the irq:irq_entry traceevent
>> + *
>> + * Used to note the starting of an IRQ occurece.
>> + */
>> +void trace_irqentry_callback(void *data, int irq, struct irqaction *action)
>> +{
>> + __trace_irq_entry(irq);
>> +}
>> +
>> +/**
>> + * trace_irqexit_callback - Callback to the irq:irq_exit traceevent
>> + *
>> + * Used to note the end of an IRQ occurece.
>> + */
>> +void trace_irqexit_callback(void *data, int irq, struct irqaction *action, int ret)
>> +{
>> + __trace_irq_exit(irq, action->name);
>> +}
>> +
>> +#ifdef CONFIG_X86_LOCAL_APIC
>
> I wonder if we should move this into a separate file, making the
> __trace_irq_entry() a more name space safe name and have it call that.
> I have a bit of a distaste for arch specific code in a generic file.

Yeah, where should I place it? somewhere in arch/x86?

>
>> +/**
>> + * trace_intel_irq_entry - record intel specific IRQ entry
>> + */
>> +void trace_intel_irq_entry(void *data, int vector)
>> +{
>> + __trace_irq_entry(vector);
>> +}
>> +
>> +/**
>> + * trace_intel_irq_exit - record intel specific IRQ exit
>> + */
>> +void trace_intel_irq_exit(void *data, int vector)
>> +{
>> + char *vector_desc = (char *) data;
>> +
>> + __trace_irq_exit(vector, vector_desc);
>> +}
>> +
>> +/**
>> + * register_intel_irq_tp - Register intel specific IRQ entry tracepoints
>> + */
>> +static int register_intel_irq_tp(void)
>> +{
>> + int ret;
>> +
>> + ret = register_trace_local_timer_entry(trace_intel_irq_entry, NULL);
>> + if (ret)
>> + goto out_err;
>> +
>> + ret = register_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
>> + if (ret)
>> + goto out_timer_entry;
>> +
>> +#ifdef CONFIG_X86_THERMAL_VECTOR
>> + ret = register_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
>> + if (ret)
>> + goto out_timer_exit;
>> +
>> + ret = register_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
>> + if (ret)
>> + goto out_thermal_entry;
>> +#endif /* CONFIG_X86_THERMAL_VECTOR */
>> +
>> +#ifdef CONFIG_X86_MCE_AMD
>> + ret = register_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
>> + if (ret)
>> + goto out_thermal_exit;
>> +
>> + ret = register_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
>> + if (ret)
>> + goto out_deferred_entry;
>> +#endif
>> +
>> +#ifdef CONFIG_X86_MCE_THRESHOLD
>> + ret = register_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
>> + if (ret)
>> + goto out_deferred_exit;
>> +
>> + ret = register_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
>> + if (ret)
>> + goto out_threshold_entry;
>> +#endif /* CONFIG_X86_MCE_THRESHOLD */
>> +
>> +#ifdef CONFIG_SMP
>> + ret = register_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
>> + if (ret)
>> + goto out_threshold_exit;
>> +
>> + ret = register_trace_call_function_single_exit(trace_intel_irq_exit,
>> + "call_function_single");
>> + if (ret)
>> + goto out_call_function_single_entry;
>> +
>> + ret = register_trace_call_function_entry(trace_intel_irq_entry, NULL);
>> + if (ret)
>> + goto out_call_function_single_exit;
>> +
>> + ret = register_trace_call_function_exit(trace_intel_irq_exit, "call_function");
>> + if (ret)
>> + goto out_call_function_entry;
>> +
>> + ret = register_trace_reschedule_entry(trace_intel_irq_entry, NULL);
>> + if (ret)
>> + goto out_call_function_exit;
>> +
>> + ret = register_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
>> + if (ret)
>> + goto out_reschedule_entry;
>> +#endif /* CONFIG_SMP */
>> +
>> +#ifdef CONFIG_IRQ_WORK
>> + ret = register_trace_irq_work_entry(trace_intel_irq_entry, NULL);
>> + if (ret)
>> + goto out_reschedule_exit;
>> +
>> + ret = register_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
>> + if (ret)
>> + goto out_irq_work_entry;
>> +#endif
>> +
>> + ret = register_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
>> + if (ret)
>> + goto out_irq_work_exit;
>> +
>> + ret = register_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
>> + if (ret)
>> + goto out_x86_ipi_entry;
>> +
>> + ret = register_trace_error_apic_entry(trace_intel_irq_entry, NULL);
>> + if (ret)
>> + goto out_x86_ipi_exit;
>> +
>> + ret = register_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
>> + if (ret)
>> + goto out_error_apic_entry;
>> +
>> + ret = register_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
>> + if (ret)
>> + goto out_error_apic_exit;
>> +
>> + ret = register_trace_spurious_apic_exit(trace_intel_irq_exit, "spurious_apic");
>> + if (ret)
>> + goto out_spurious_apic_entry;
>> +
>> + return 0;
>> +
>> +out_spurious_apic_entry:
>> + unregister_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
>> +out_error_apic_exit:
>> + unregister_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
>> +out_error_apic_entry:
>> + unregister_trace_error_apic_entry(trace_intel_irq_entry, NULL);
>> +out_x86_ipi_exit:
>> + unregister_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
>> +out_x86_ipi_entry:
>> + unregister_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
>> +out_irq_work_exit:
>> +
>> +#ifdef CONFIG_IRQ_WORK
>> + unregister_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
>> +out_irq_work_entry:
>> + unregister_trace_irq_work_entry(trace_intel_irq_entry, NULL);
>> +out_reschedule_exit:
>> +#endif
>> +
>> +#ifdef CONFIG_SMP
>> + unregister_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
>> +out_reschedule_entry:
>> + unregister_trace_reschedule_entry(trace_intel_irq_entry, NULL);
>> +out_call_function_exit:
>> + unregister_trace_call_function_exit(trace_intel_irq_exit, "call_function");
>> +out_call_function_entry:
>> + unregister_trace_call_function_entry(trace_intel_irq_entry, NULL);
>> +out_call_function_single_exit:
>> + unregister_trace_call_function_single_exit(trace_intel_irq_exit, "call_function_single");
>> +out_call_function_single_entry:
>> + unregister_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
>> +out_threshold_exit:
>> +#endif
>> +
>> +#ifdef CONFIG_X86_MCE_THRESHOLD
>> + unregister_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
>> +out_threshold_entry:
>> + unregister_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
>> +out_deferred_exit:
>> +#endif
>> +
>> +#ifdef CONFIG_X86_MCE_AMD
>> + unregister_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
>> +out_deferred_entry:
>> + unregister_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
>> +out_thermal_exit:
>> +#endif /* CONFIG_X86_MCE_AMD */
>> +
>> +#ifdef CONFIG_X86_THERMAL_VECTOR
>> + unregister_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
>> +out_thermal_entry:
>> + unregister_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
>> +out_timer_exit:
>> +#endif /* CONFIG_X86_THERMAL_VECTOR */
>> +
>> + unregister_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
>> +out_timer_entry:
>> + unregister_trace_local_timer_entry(trace_intel_irq_entry, NULL);
>> +out_err:
>> + return -EINVAL;
>> +}
>> +
>> +static void unregister_intel_irq_tp(void)
>> +{
>> + unregister_trace_spurious_apic_exit(trace_intel_irq_exit, "spurious_apic");
>> + unregister_trace_spurious_apic_entry(trace_intel_irq_entry, NULL);
>> + unregister_trace_error_apic_exit(trace_intel_irq_exit, "error_apic");
>> + unregister_trace_error_apic_entry(trace_intel_irq_entry, NULL);
>> + unregister_trace_x86_platform_ipi_exit(trace_intel_irq_exit, "x86_platform_ipi");
>> + unregister_trace_x86_platform_ipi_entry(trace_intel_irq_entry, NULL);
>> +
>> +#ifdef CONFIG_IRQ_WORK
>> + unregister_trace_irq_work_exit(trace_intel_irq_exit, "irq_work");
>> + unregister_trace_irq_work_entry(trace_intel_irq_entry, NULL);
>> +#endif
>> +
>> +#ifdef CONFIG_SMP
>> + unregister_trace_reschedule_exit(trace_intel_irq_exit, "reschedule");
>> + unregister_trace_reschedule_entry(trace_intel_irq_entry, NULL);
>> + unregister_trace_call_function_exit(trace_intel_irq_exit, "call_function");
>> + unregister_trace_call_function_entry(trace_intel_irq_entry, NULL);
>> + unregister_trace_call_function_single_exit(trace_intel_irq_exit, "call_function_single");
>> + unregister_trace_call_function_single_entry(trace_intel_irq_entry, NULL);
>> +#endif
>> +
>> +#ifdef CONFIG_X86_MCE_THRESHOLD
>> + unregister_trace_threshold_apic_exit(trace_intel_irq_exit, "threshold_apic");
>> + unregister_trace_threshold_apic_entry(trace_intel_irq_entry, NULL);
>> +#endif
>> +
>> +#ifdef CONFIG_X86_MCE_AMD
>> + unregister_trace_deferred_error_apic_exit(trace_intel_irq_exit, "deferred_error");
>> + unregister_trace_deferred_error_apic_entry(trace_intel_irq_entry, NULL);
>> +#endif
>> +
>> +#ifdef CONFIG_X86_THERMAL_VECTOR
>> + unregister_trace_thermal_apic_exit(trace_intel_irq_exit, "thermal_apic");
>> + unregister_trace_thermal_apic_entry(trace_intel_irq_entry, NULL);
>> +#endif /* CONFIG_X86_THERMAL_VECTOR */
>> +
>> + unregister_trace_local_timer_exit(trace_intel_irq_exit, "local_timer");
>> + unregister_trace_local_timer_entry(trace_intel_irq_entry, NULL);
>> +}
>> +
>
>
>> +#else /* CONFIG_X86_LOCAL_APIC */
>> +#define register_intel_irq_tp() do {} while (0)
>> +#define unregister_intel_irq_tp() do {} while (0)
>> +#endif /* CONFIG_X86_LOCAL_APIC */
>
> And have this in a local header file. kernel/trace/trace_osnoise.h ?

Ack! Also because these osnoise: tracepoints will be used by yet another
tracers, the rtsl... https://github.com/bristot/rtsl/ (I hope :-))

>
>> +
>> +/**
>> + * hook_irq_events - Hook IRQ handling events
>> + *
>> + * This function hooks the IRQ related callbacks to the respective trace
>> + * events.
>> + */
>> +int hook_irq_events(void)
>> +{
>> + int ret;
>> +
>> + ret = register_trace_irq_handler_entry(trace_irqentry_callback, NULL);
>> + if (ret)
>> + goto out_err;
>> +
>> + ret = register_trace_irq_handler_exit(trace_irqexit_callback, NULL);
>> + if (ret)
>> + goto out_unregister_entry;
>> +
>> + ret = register_intel_irq_tp();
>> + if (ret)
>> + goto out_irq_exit;
>> +
>> + return 0;
>> +
>> +out_irq_exit:
>> + unregister_trace_irq_handler_exit(trace_irqexit_callback, NULL);
>> +out_unregister_entry:
>> + unregister_trace_irq_handler_entry(trace_irqentry_callback, NULL);
>> +out_err:
>> + return -EINVAL;
>> +}
>> +
>> +/**
>> + * unhook_irq_events - Unhook IRQ handling events
>> + *
>> + * This function unhooks the IRQ related callbacks to the respective trace
>> + * events.
>> + */
>> +void unhook_irq_events(void)
>> +{
>> + unregister_intel_irq_tp();
>> + unregister_trace_irq_handler_exit(trace_irqexit_callback, NULL);
>> + unregister_trace_irq_handler_entry(trace_irqentry_callback, NULL);
>> +}
>> +
>> +#ifndef CONFIG_PREEMPT_RT
>> +/**
>> + * trace_softirq_entry_callback - Note the starting of a SoftIRQ
>
> Where did you get the case usage of that. I've never seen it written
> like "SoftIRQ" in the kernel. Just call it softirq as it is referenced
> everyplace else.

I do not remember when, but I was already asked to use this "SoftIRQ" once.. I
anyway, I prefer "softirq" too. I will use it.

>> + *
>> + * Save the starting time of a SoftIRQ. As SoftIRQs are non-preemptive to
>> + * other SoftIRQs, it is safe to use a single variable (ons_var->softirq)
>> + * to save the statistics. The arrival_time is used to report... the
>> + * arrival time. The delta_start is used to compute the duration at the
>> + * SoftIRQ exit handler. See cond_move_softirq_delta_start().
>> + */
>> +void trace_softirq_entry_callback(void *data, unsigned int vec_nr)
>> +{
>> + struct osnoise_variables *osn_var = this_cpu_osn_var();
>> +
>> + if (!osn_var->sampling)
>> + return;
>> + /*
>> + * This value will be used in the report, but not to compute
>> + * the execution time, so it is safe to get it unsafe.
>> + */
>> + osn_var->softirq.arrival_time = time_get();
>> + set_int_safe_time(osn_var, &osn_var->softirq.delta_start);
>> + osn_var->softirq.count++;
>> +
>> + local_inc(&osn_var->int_counter);
>> +}
>> +
>> +/**
>> + * trace_softirq_exit_callback - Note the end of an SoftIRQ
>> + *
>> + * Computes the duration of the SoftIRQ noise, and trace it. Also discounts the
>> + * interference from other sources of noise could be currently being accounted.
>> + */
>> +void trace_softirq_exit_callback(void *data, unsigned int vec_nr)
>> +{
>> + struct osnoise_variables *osn_var = this_cpu_osn_var();
>> + int duration;
>> +
>> + if (!osn_var->sampling)
>> + return;
>> +
>> + duration = get_int_safe_duration(osn_var, &osn_var->softirq.delta_start);
>> + trace_softirq_noise(vec_nr, osn_var->softirq.arrival_time, duration);
>> + cond_move_thread_delta_start(osn_var, duration);
>> + osn_var->softirq.arrival_time = 0;
>> +}
>> +
>> +/**
>> + * hook_softirq_events - Hook SoftIRQ handling events
>> + *
>> + * This function hooks the SoftIRQ related callbacks to the respective trace
>> + * events.
>> + */
>> +static int hook_softirq_events(void)
>> +{
>> + int ret;
>> +
>> + ret = register_trace_softirq_entry(trace_softirq_entry_callback, NULL);
>> + if (ret)
>> + goto out_err;
>> +
>> + ret = register_trace_softirq_exit(trace_softirq_exit_callback, NULL);
>> + if (ret)
>> + goto out_unreg_entry;
>> +
>> + return 0;
>> +
>> +out_unreg_entry:
>> + unregister_trace_softirq_entry(trace_softirq_entry_callback, NULL);
>> +out_err:
>> + return -EINVAL;
>> +}
>> +
>> +/**
>> + * unhook_softirq_events - Unhook SoftIRQ handling events
>> + *
>> + * This function hooks the SoftIRQ related callbacks to the respective trace
>> + * events.
>> + */
>> +static void unhook_softirq_events(void)
>> +{
>> + unregister_trace_softirq_entry(trace_softirq_entry_callback, NULL);
>> + unregister_trace_softirq_exit(trace_softirq_exit_callback, NULL);
>> +}
>> +#else /* CONFIG_PREEMPT_RT */
>> +/*
>> + * SoftIRQ are threads on the PREEMPT_RT mode.
>> + */
>> +static int hook_softirq_events(void)
>> +{
>> + return 0;
>> +}
>> +static void unhook_softirq_events(void)
>> +{
>> +}
>> +#endif
>> +
>> +/**
>> + * thread_entry - Record the starting of a thread noise window
>> + *
>> + * It saves the context switch time for a noisy thread, and increments
>> + * the interference counters.
>> + */
>> +static void
>> +thread_entry(struct osnoise_variables *osn_var, struct task_struct *t)
>> +{
>> + if (!osn_var->sampling)
>> + return;
>> + /*
>> + * The arrival time will be used in the report, but not to compute
>> + * the execution time, so it is safe to get it unsafe.
>> + */
>> + osn_var->thread.arrival_time = time_get();
>> +
>> + set_int_safe_time(osn_var, &osn_var->thread.delta_start);
>> +
>> + osn_var->thread.count++;
>> + local_inc(&osn_var->int_counter);
>> +}
>> +
>> +/**
>> + * thread_exit - Report the end of a thread noise window
>> + *
>> + * It computes the total noise from a thread, tracing if needed.
>> + */
>> +static void
>> +thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
>> +{
>> + int duration;
>> +
>> + if (!osn_var->sampling)
>> + return;
>> +
>> + duration = get_int_safe_duration(osn_var, &osn_var->thread.delta_start);
>> +
>> + trace_thread_noise(t, osn_var->thread.arrival_time, duration);
>> +
>> + osn_var->thread.arrival_time = 0;
>> +}
>> +
>> +/**
>> + * trace_sched_switch - sched:sched_switch trace event handler
>> + *
>> + * This function is hooked to the sched:sched_switch trace event, and it is
>> + * used to record the beginning and to report the end of a thread noise window.
>> + */
>> +void
>> +trace_sched_switch_callback(void *data, bool preempt, struct task_struct *p,
>> + struct task_struct *n)
>> +{
>> + struct osnoise_variables *osn_var = this_cpu_osn_var();
>> +
>> + if (p->pid != osn_var->pid)
>> + thread_exit(osn_var, p);
>> +
>> + if (n->pid != osn_var->pid)
>> + thread_entry(osn_var, n);
>> +}
>> +
>> +/**
>> + * hook_thread_events - Hook the insturmentation for thread noise
>> + *
>> + * Hook the osnoise tracer callbacks to handle the noise from other
>> + * threads on the necessary kernel events.
>> + */
>> +int hook_thread_events(void)
>> +{
>> + int ret;
>> +
>> + ret = register_trace_sched_switch(trace_sched_switch_callback, NULL);
>> + if (ret)
>> + return -EINVAL;
>> +
>> + return 0;
>> +}
>> +
>> +/**
>> + * unhook_thread_events - *nhook the insturmentation for thread noise
>> + *
>> + * Unook the osnoise tracer callbacks to handle the noise from other
>> + * threads on the necessary kernel events.
>> + */
>> +void unhook_thread_events(void)
>> +{
>> + unregister_trace_sched_switch(trace_sched_switch_callback, NULL);
>> +}
>> +
>> +/**
>> + * save_osn_sample_stats - Save the osnoise_sample statistics
>> + *
>> + * Save the osnoise_sample statistics before the sampling phase. These
>> + * values will be used later to compute the diff betwneen the statistics
>> + * before and after the osnoise sampling.
>> + */
>> +void save_osn_sample_stats(struct osnoise_variables *osn_var, struct osnoise_sample *s)
>> +{
>> + s->nmi_count = osn_var->nmi.count;
>> + s->irq_count = osn_var->irq.count;
>> + s->softirq_count = osn_var->softirq.count;
>> + s->thread_count = osn_var->thread.count;
>> +}
>> +
>> +/**
>> + * diff_osn_sample_stats - Compute the osnoise_sample statistics
>> + *
>> + * After a sample period, compute the difference on the osnoise_sample
>> + * statistics. The struct osnoise_sample *s contains the statistics saved via
>> + * save_osn_sample_stats() before the osnoise sampling.
>> + */
>> +void diff_osn_sample_stats(struct osnoise_variables *osn_var, struct osnoise_sample *s)
>> +{
>> + s->nmi_count = osn_var->nmi.count - s->nmi_count;
>> + s->irq_count = osn_var->irq.count - s->irq_count;
>> + s->softirq_count = osn_var->softirq.count - s->softirq_count;
>> + s->thread_count = osn_var->thread.count - s->thread_count;
>> +}
>> +
>> +/**
>> + * osnoise_stop_tracing - Stop tracing and the tracer.
>> + */
>> +static void osnoise_stop_tracing(void)
>> +{
>> + tracing_off();
>
> This can run in an instance. The "tracing_off()" affects only the top
> level tracing buffer, not the instance that this may be running in. You
> want tracer_tracing_off(tr).

ack!

>> +}
>> +
>> +/**
>> + * run_osnoise - Sample the time and look for osnoise
>> + *
>> + * Used to capture the time, looking for potential osnoise latency repeatedly.
>> + * Different from hwlat_detector, it is called with preemption and interrupts
>> + * enabled. This allows irqs, softirqs and threads to run, interfering on the
>> + * osnoise sampling thread, as they would do with a regular thread.
>> + */
>> +static int run_osnoise(void)
>> +{
>> + struct osnoise_variables *osn_var = this_cpu_osn_var();
>> + u64 noise = 0, sum_noise = 0, max_noise = 0;
>> + struct trace_array *tr = osnoise_trace;
>> + u64 start, sample, last_sample;
>> + u64 last_int_count, int_count;
>> + s64 total, last_total = 0;
>> + struct osnoise_sample s;
>> + unsigned int threshold;
>> + int hw_count = 0;
>> + u64 runtime, stop_in;
>> + int ret = -1;
>> +
>> + /*
>> + * Considers the current thread as the workload.
>> + */
>> + osn_var->pid = current->pid;
>> +
>> + /*
>> + * Save the current stats for the diff
>> + */
>> + save_osn_sample_stats(osn_var, &s);
>> +
>> + /*
>> + * threshold should be at least 1 us.
>> + */
>> + threshold = tracing_thresh ? tracing_thresh : 1000;
>
> BTW, you can write the above as:
>
> threshold = tracing_thresh ? : 1000;
>
> too ;-)
>
>
>> +
>> + /*
>> + * Make sure NMIs see sampling first
>> + */
>> + osn_var->sampling = true;
>> + barrier();
>> +
>> + /*
>> + * Transform the *_us config to nanoseconds to avoid the
>> + * division on the main loop.
>> + */
>> + runtime = osnoise_data.sample_runtime * NSEC_PER_USEC;
>> + stop_in = osnoise_data.stop_tracing_in * NSEC_PER_USEC;
>> +
>> + /*
>> + * Start timestemp
>> + */
>> + start = time_get();
>> +
>> + /*
>> + * "previous" loop
>> + */
>> + last_int_count = set_int_safe_time(osn_var, &last_sample);
>> +
>> + do {
>> + /*
>> + * Get sample!
>> + */
>> + int_count = set_int_safe_time(osn_var, &sample);
>> +
>> + noise = time_sub(sample, last_sample);
>> +
>> + /*
>> + * This shouldn't happen.
>> + */
>> + if (noise < 0) {
>> + pr_err(BANNER "time running backwards\n");
>> + goto out;
>> + }
>> +
>> + /*
>> + * Sample runtime.
>> + */
>> + total = time_sub(sample, start);
>> +
>> + /*
>> + * Check for possible overflows.
>> + */
>> + if (total < last_total) {
>> + pr_err("Time total overflowed\n");
>> + break;
>> + }
>> +
>> + last_total = total;
>> +
>> + if (noise >= threshold) {
>> + int interference = int_count - last_int_count;
>> +
>> + if (noise > max_noise)
>> + max_noise = noise;
>> +
>> + if (!interference)
>> + hw_count++;
>> +
>> + sum_noise += noise;
>> +
>> + trace_sample_threshold(last_sample, noise, interference);
>> +
>> + if (osnoise_data.stop_tracing_in)
>> + if (noise > stop_in)
>> + osnoise_stop_tracing();
>> + }
>> +
>> + /*
>> + * For the non-preemptive kernel config: let threads runs, if
>> + * they so wish.
>> + */
>> + cond_resched();
>> +
>> + last_sample = sample;
>> + last_int_count = int_count;
>> +
>> + } while (total < runtime && !kthread_should_stop());
>> +
>> + /*
>> + * Finish the above in the view for interrupts.
>> + */
>> + barrier();
>> +
>> + osn_var->sampling = false;
>> +
>> + /*
>> + * Make sure sampling data is no longer updated.
>> + */
>> + barrier();
>> +
>> + /*
>> + * Save noise info.
>> + */
>> + s.noise = time_to_us(sum_noise);
>> + s.runtime = time_to_us(total);
>> + s.max_sample = time_to_us(max_noise);
>> + s.hw_count = hw_count;
>> +
>> + /* Save interference stats info */
>> + diff_osn_sample_stats(osn_var, &s);
>> +
>> + trace_osnoise_sample(&s);
>> +
>> + /* Keep a running maximum ever recorded osnoise "latency" */
>> + if (max_noise > tr->max_latency) {
>> + tr->max_latency = max_noise;
>> + latency_fsnotify(tr);
>> + }
>> +
>> + if (osnoise_data.stop_tracing_out)
>> + if (s.noise > osnoise_data.stop_tracing_out)
>> + osnoise_stop_tracing();
>> +
>> + return 0;
>> +out:
>> + return ret;
>> +}
>> +
>> +static struct cpumask osnoise_cpumask;
>> +static struct cpumask save_cpumask;
>> +
>> +/*
>> + * osnoise_main - The osnoise detection kernel thread
>> + *
>> + * Calls run_osnoise() function to measure the osnoise for the configured runtime,
>> + * every period.
>> + */
>> +static int osnoise_main(void *data)
>> +{
>> + s64 interval;
>> +
>> + while (!kthread_should_stop()) {
>> +
>> + run_osnoise();
>> +
>> + mutex_lock(&interface_lock);
>> + interval = osnoise_data.sample_period - osnoise_data.sample_runtime;
>> + mutex_unlock(&interface_lock);
>> +
>> + do_div(interval, USEC_PER_MSEC);
>> +
>> + /*
>> + * differently from hwlat_detector, the osnoise tracer can run
>> + * without a pause because preemption is on.
>
>
> Can it? With !CONFIG_PREEMPT, this will never sleep, right? I believe
> there's watchdogs that will trigger if a kernel thread never schedules
> out. I might be wrong, will have to check on this.

There is a cond_resched() in the run_noise() loop. Thus making osnoise
"preemptible," allowing other threads to run (and to interfere).

>
>> + */
>> + if (interval < 1)
>> + continue;
>> +
>> + if (msleep_interruptible(interval))
>> + break;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +/**
>> + * stop_per_cpu_kthread - stop per-cpu threads
>> + *
>> + * Stop the osnoise sampling htread. Use this on unload and at system
>> + * shutdown.
>> + */
>> +static void stop_per_cpu_kthreads(void)
>> +{
>> + struct task_struct *kthread;
>> + int cpu;
>> +
>> + for_each_online_cpu(cpu) {
>> + kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread;
>> + if (kthread)
>> + kthread_stop(kthread);
>> + per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
>> + }
>> +}
>> +
>> +/**
>> + * start_per_cpu_kthread - Kick off per-cpu osnoise sampling kthreads
>> + *
>> + * This starts the kernel thread that will look for osnoise on many
>> + * cpus.
>> + */
>> +static int start_per_cpu_kthreads(struct trace_array *tr)
>> +{
>> + struct cpumask *current_mask = &save_cpumask;
>> + struct task_struct *kthread;
>> + char comm[24];
>> + int cpu;
>> +
>> + get_online_cpus();
>> + /*
>> + * Run only on CPUs in which trace and osnoise are allowed to run.
>> + */
>> + cpumask_and(current_mask, tr->tracing_cpumask, &osnoise_cpumask);
>> + /*
>> + * And the CPU is online.
>> + */
>> + cpumask_and(current_mask, cpu_online_mask, current_mask);
>> + put_online_cpus();
>> +
>> + for_each_online_cpu(cpu)
>> + per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
>
> I think we want for_each_possible_cpu(), especially since you don't
> have anything protecting the online cpus here since you did
> put_online_cpus. But still, we probably want them all NULL anyway.
>
> Should this have a CPU shutdown notifier, to clean things up if this is
> running while people are playing with CPU hotplug?


well.... it seems to be worth trying! Question can I take a mutex on the hp
callback? (it will be the interface mutex).

>
>> +
>> + for_each_cpu(cpu, current_mask) {
>> + snprintf(comm, 24, "osnoise/%d", cpu);
>> +
>> + kthread = kthread_create_on_cpu(osnoise_main, NULL, cpu, comm);
>> +
>> + if (IS_ERR(kthread)) {
>> + pr_err(BANNER "could not start sampling thread\n");
>> + stop_per_cpu_kthreads();
>> + return -ENOMEM;
>> + }
>> +
>> + per_cpu(per_cpu_osnoise_var, cpu).kthread = kthread;
>> + wake_up_process(kthread);
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +/*
>> + * osnoise_cpus_read - Read function for reading the "cpus" file
>> + * @filp: The active open file structure
>> + * @ubuf: The userspace provided buffer to read value into
>> + * @cnt: The maximum number of bytes to read
>> + * @ppos: The current "file" position
>> + *
>> + * Prints the "cpus" output into the user-provided buffer.
>> + */
>> +static ssize_t
>> +osnoise_cpus_read(struct file *filp, char __user *ubuf, size_t count,
>> + loff_t *ppos)
>> +{
>> + char *mask_str;
>> + int len;
>> +
>> + len = snprintf(NULL, 0, "%*pbl\n",
>> + cpumask_pr_args(&osnoise_cpumask)) + 1;
>> + mask_str = kmalloc(len, GFP_KERNEL);
>> + if (!mask_str)
>> + return -ENOMEM;
>> +
>> + len = snprintf(mask_str, len, "%*pbl\n",
>> + cpumask_pr_args(&osnoise_cpumask));
>> + if (len >= count) {
>> + count = -EINVAL;
>> + goto out_err;
>> + }
>> + count = simple_read_from_buffer(ubuf, count, ppos, mask_str, len);
>> +
>> +out_err:
>> + kfree(mask_str);
>> +
>> + return count;
>> +}
>> +
>> +/**
>> + * osnoise_cpus_write - Write function for "cpus" entry
>> + * @filp: The active open file structure
>> + * @ubuf: The user buffer that contains the value to write
>> + * @cnt: The maximum number of bytes to write to "file"
>> + * @ppos: The current position in @file
>> + *
>> + * This function provides a write implementation for the "cpus"
>> + * interface to the osnoise trace. By default, it lists all CPUs,
>> + * in this way, allowing osnoise threads to run on any online CPU
>> + * of the system. It serves to restrict the execution of osnoise to the
>> + * set of CPUs writing via this interface. Note that osnoise also
>> + * respects the "tracing_cpumask." Hence, osnoise threads will run only
>> + * on the set of CPUs allowed here AND on "tracing_cpumask." Why not
>> + * have just "tracing_cpumask?" Because the user might be interested
>> + * in tracing what is running on other CPUs. For instance, one might
>> + * run osnoise in one HT CPU while observing what is running on the
>> + * sibling HT CPU.
>> + */
>> +static ssize_t
>> +osnoise_cpus_write(struct file *filp, const char __user *ubuf, size_t count,
>> + loff_t *ppos)
>> +{
>> + cpumask_var_t osnoise_cpumask_new;
>> + char buf[256];
>> + int err;
>> +
>
> Since this does not affect a running osnoise tracer, we should either
> have it return BUSY if it is running, or have it update the threads
> during the run.

I will do the same thing I did (as your request) for hwlat's mode: I will
stop/start the tracer to apply the changes.

Thanks Steven!
-- Daniel

2021-06-07 15:49:59

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 8/9] tracing: Add osnoise tracer

On Mon, 7 Jun 2021 14:00:56 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> I am using these more "generic terms" because they are also used by the timerlat
> tracer.
>
> In the timerlat tracer, the "in" file is used to stop the tracer for a given IRQ
> latency (so, the "inside" operation), while the "out" is used to stop the tracer
> in the thread latency (hence the outside operation).
>
> The total sounds good for the "out"! But the single does not work fine for the
> IRQ... how about: stop_tracing_partial_us ?
>
> It is hard to find a good shared name :-/

What about:

stop_tracing_us and stop_tracing_total_us, and not have anything
special for the first one?

-- Steve

2021-06-08 01:40:01

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 9/9] tracing: Add timerlat tracer

On Fri, 14 May 2021 22:51:18 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> The timerlat tracer aims to help the preemptive kernel developers to
> found souces of wakeup latencies of real-time threads. Like cyclictest,
> the tracer sets a periodic timer that wakes up a thread. The thread then
> computes a *wakeup latency* value as the difference between the *current
> time* and the *absolute time* that the timer was set to expire. The main
> goal of timerlat is tracing in such a way to help kernel developers.
>

Hmm, we should add a way to have wake up tracers only trace a specific
task, where these osnoise trace events would also be useful. That is,
run cyclictest with the wakeup tracer, that it does this for cyclictest
directly. That shouldn't be too difficult to add.

> Usage
>
> Write the ASCII text "timerlat" into the current_tracer file of the
> tracing system (generally mounted at /sys/kernel/tracing).
>
> For example:
>
> [root@f32 ~]# cd /sys/kernel/tracing/
> [root@f32 tracing]# echo timerlat > current_tracer
>
> It is possible to follow the trace by reading the trace trace file::

Do not need rst markup in commit logs ;-)

>
> [root@f32 tracing]# cat trace
> # tracer: timerlat
> #
> # _-----=> irqs-off
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # || /
> # |||| ACTIVATION
> # TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY
> # | | | |||| | | | |
> <idle>-0 [000] d.h1 54.029328: #1 context irq timer_latency 932 ns
> <...>-867 [000] .... 54.029339: #1 context thread timer_latency 11700 ns
> <idle>-0 [001] dNh1 54.029346: #1 context irq timer_latency 2833 ns
> <...>-868 [001] .... 54.029353: #1 context thread timer_latency 9820 ns
> <idle>-0 [000] d.h1 54.030328: #2 context irq timer_latency 769 ns
> <...>-867 [000] .... 54.030330: #2 context thread timer_latency 3070 ns
> <idle>-0 [001] d.h1 54.030344: #2 context irq timer_latency 935 ns
> <...>-868 [001] .... 54.030347: #2 context thread timer_latency 4351 ns
>
> The tracer creates a per-cpu kernel thread with real-time priority that
> prints two lines at every activation. The first is the *timer latency*
> observed at the *hardirq* context before the activation of the thread.
> The second is the *timer latency* observed by the thread, which is the
> same level that cyclictest reports. The ACTIVATION ID field

The above is misleading. Below, I see that you state that the values are
"net values" where the thread latency does not include the irq latency.
This is not the same as cyclictest. (I had to update my ASCII art below
after reading the below statement).

> serves to relate the *irq* execution to its respective *thread* execution.
>
> The irq/thread splitting is important to clarify at which context
> the unexpected high value is coming from. The *irq* context can be
> delayed by hardware related actions, such as SMIs, NMIs, IRQs
> or by a thread masking interrupts. Once the timer happens, the delay
> can also be influenced by blocking caused by threads. For example, by
> postponing the scheduler execution via preempt_disable(), by the
> scheduler execution, or by masking interrupts. Threads can
> also be delayed by the interference from other threads and IRQs.


I wonder if ASCII art would help clarify the above. At least for the
document (not the change log here).


time ==>
expected actual thread
wakeup wakeup scheduled
| | |
v v v
|---------|-------|------|----------|
^
|
interrupt

|--------------|
irq latency

|-----------|
thread latency


>
> The timerlat can also take advantage of the osnoise: traceevents.
> For example:
>
> [root@f32 ~]# cd /sys/kernel/tracing/
> [root@f32 tracing]# echo timerlat > current_tracer
> [root@f32 tracing]# echo osnoise > set_event
> [root@f32 tracing]# echo 25 > osnoise/stop_tracing_out_us
> [root@f32 tracing]# tail -10 trace
> cc1-87882 [005] d..h... 548.771078: #402268 context irq timer_latency 1585 ns
> cc1-87882 [005] dNLh1.. 548.771082: irq_noise: local_timer:236 start 548.771077442 duration 4597 ns
> cc1-87882 [005] dNLh2.. 548.771083: irq_noise: reschedule:253 start 548.771083017 duration 56 ns
> cc1-87882 [005] dNLh2.. 548.771086: irq_noise: call_function_single:251 start 548.771083811 duration 2048 ns
> cc1-87882 [005] dNLh2.. 548.771088: irq_noise: call_function_single:251 start 548.771086814 duration 1495 ns
> cc1-87882 [005] dNLh2.. 548.771091: irq_noise: call_function_single:251 start 548.771089194 duration 1558 ns
> cc1-87882 [005] dNLh2.. 548.771094: irq_noise: call_function_single:251 start 548.771091719 duration 1932 ns
> cc1-87882 [005] dNLh2.. 548.771096: irq_noise: call_function_single:251 start 548.771094696 duration 1050 ns
> cc1-87882 [005] d...3.. 548.771101: thread_noise: cc1:87882 start 548.771078243 duration 10909 ns
> timerlat/5-1035 [005] ....... 548.771103: #402268 context thread timer_latency 25960 ns
>
> For further information see: Documentation/trace/timerlat-tracer.rst
>


>
> diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
> index 608107b27cc0..3769b9b7aed8 100644
> --- a/Documentation/trace/index.rst
> +++ b/Documentation/trace/index.rst
> @@ -24,6 +24,7 @@ Linux Tracing Technologies
> boottime-trace
> hwlat_detector
> osnoise-tracer
> + timerlat-tracer
> intel_th
> ring-buffer-design
> stm
> diff --git a/Documentation/trace/timerlat-tracer.rst b/Documentation/trace/timerlat-tracer.rst
> new file mode 100644
> index 000000000000..902d2f9a489f
> --- /dev/null
> +++ b/Documentation/trace/timerlat-tracer.rst
> @@ -0,0 +1,158 @@
> +###############
> +Timerlat tracer
> +###############
> +
> +The timerlat tracer aims to help the preemptive kernel developers to
> +found souces of wakeup latencies of real-time threads. Like cyclictest,

"to find sources"

> +the tracer sets a periodic timer that wakes up a thread. The thread then
> +computes a *wakeup latency* value as the difference between the *current
> +time* and the *absolute time* that the timer was set to expire. The main
> +goal of timerlat is tracing in such a way to help kernel developers.
> +
> +Usage
> +-----
> +
> +Write the ASCII text "timerlat" into the current_tracer file of the
> +tracing system (generally mounted at /sys/kernel/tracing).
> +
> +For example::
> +
> + [root@f32 ~]# cd /sys/kernel/tracing/
> + [root@f32 tracing]# echo timerlat > current_tracer
> +
> +It is possible to follow the trace by reading the trace trace file::
> +
> + [root@f32 tracing]# cat trace
> + # tracer: timerlat
> + #
> + # _-----=> irqs-off
> + # / _----=> need-resched
> + # | / _---=> hardirq/softirq
> + # || / _--=> preempt-depth
> + # || /
> + # |||| ACTIVATION
> + # TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY
> + # | | | |||| | | | |
> + <idle>-0 [000] d.h1 54.029328: #1 context irq timer_latency 932 ns
> + <...>-867 [000] .... 54.029339: #1 context thread timer_latency 11700 ns
> + <idle>-0 [001] dNh1 54.029346: #1 context irq timer_latency 2833 ns
> + <...>-868 [001] .... 54.029353: #1 context thread timer_latency 9820 ns
> + <idle>-0 [000] d.h1 54.030328: #2 context irq timer_latency 769 ns
> + <...>-867 [000] .... 54.030330: #2 context thread timer_latency 3070 ns
> + <idle>-0 [001] d.h1 54.030344: #2 context irq timer_latency 935 ns
> + <...>-868 [001] .... 54.030347: #2 context thread timer_latency 4351 ns
> +
> +
> +The tracer creates a per-cpu kernel thread with real-time priority that
> +prints two lines at every activation. The first is the *timer latency*
> +observed at the *hardirq* context before the activation of the thread.
> +The second is the *timer latency* observed by the thread, which is the
> +same level that cyclictest reports. The ACTIVATION ID field
> +serves to relate the *irq* execution to its respective *thread* execution.
> +
> +The *irq*/*thread* splitting is important to clarify at which context
> +the unexpected high value is coming from. The *irq* context can be
> +delayed by hardware related actions, such as SMIs, NMIs, IRQs
> +or by a thread masking interrupts. Once the timer happens, the delay
> +can also be influenced by blocking caused by threads. For example, by
> +postponing the scheduler execution via preempt_disable(), by the
> +scheduler execution, or by masking interrupts. Threads can
> +also be delayed by the interference from other threads and IRQs.

This is where I would add that ASCII art.

> +
> +Tracer options
> +---------------------
> +
> +The timerlat tracer is built on top of osnoise tracer.
> +So its configuration is also done in the osnoise/ config
> +directory. The timerlat configs are:
> +
> + - cpus: CPUs at which a timerlat thread will execute.
> + - timerlat_period_us: the period of the timerlat thread.
> + - osnoise/stop_tracing_in_us: stop the system tracing if a
> + timer latency at the *irq* context higher than the configured
> + value happens. Writing 0 disables this option.
> + - stop_tracing_out_us: stop the system tracing if a
> + timer latency at the *thread* context higher than the configured
> + value happens. Writing 0 disables this option.
> + - print_stack: save the stack of the IRQ ocurrence, and print
> + it after the *thread* read the latency.

"thread read the latency" doesn't make sense.

"and print it after the *thread context* event". ?


> +
> +timerlat and osnoise
> +----------------------------
> +
> +The timerlat can also take advantage of the osnoise: traceevents.
> +For example::
> +
> + [root@f32 ~]# cd /sys/kernel/tracing/
> + [root@f32 tracing]# echo timerlat > current_tracer
> + [root@f32 tracing]# echo osnoise > set_event

Note, set_event should be deprecated. Use:

echo 1 > events/osnoise/enable

instead.


> + [root@f32 tracing]# echo 25 > osnoise/stop_tracing_out_us
> + [root@f32 tracing]# tail -10 trace
> + cc1-87882 [005] d..h... 548.771078: #402268 context irq timer_latency 1585 ns
> + cc1-87882 [005] dNLh1.. 548.771082: irq_noise: local_timer:236 start 548.771077442 duration 4597 ns
> + cc1-87882 [005] dNLh2.. 548.771083: irq_noise: reschedule:253 start 548.771083017 duration 56 ns
> + cc1-87882 [005] dNLh2.. 548.771086: irq_noise: call_function_single:251 start 548.771083811 duration 2048 ns
> + cc1-87882 [005] dNLh2.. 548.771088: irq_noise: call_function_single:251 start 548.771086814 duration 1495 ns
> + cc1-87882 [005] dNLh2.. 548.771091: irq_noise: call_function_single:251 start 548.771089194 duration 1558 ns
> + cc1-87882 [005] dNLh2.. 548.771094: irq_noise: call_function_single:251 start 548.771091719 duration 1932 ns
> + cc1-87882 [005] dNLh2.. 548.771096: irq_noise: call_function_single:251 start 548.771094696 duration 1050 ns
> + cc1-87882 [005] d...3.. 548.771101: thread_noise: cc1:87882 start 548.771078243 duration 10909 ns
> + timerlat/5-1035 [005] ....... 548.771103: #402268 context thread timer_latency 25960 ns
> +
> +In this case, the root cause of the timer latency does not point for a
> +single, but to a series of call_function_single IPIs, followed by a 10

"not point to a single"

> +*us* delay from a cc1 thread noise, along with the regular timer
> +activation. It is worth mentioning that the *duration* values reported
> +by the osnoise events are *net* values. For example, the
> +thread_noise does not include the duration of the overhead caused
> +by the IRQ execution (which indeed accounted for 12736 ns).

As stated above, I updated my view of the ASCII art after reading this. You
should not compare what cyclictest reports as the thread latency. But what
cyclictest reports is the sum of the two (irq latency plus thread latency).


> +
> +Such pieces of evidence are useful for the developer to use other
> +tracing methods to figure out how to optimize the environment.
> +
> +IRQ stacktrace
> +---------------------------
> +
> +The osnoise/print_stack option is helpful for the cases in which a thread
> +noise causes the major factor for the timer latency, because of preempt or
> +irq disabled. For example::
> +
> + [root@f32 tracing]# echo 500 > osnoise/stop_tracing_out_us
> + [root@f32 tracing]# echo 500 > osnoise/print_stack
> + [root@f32 tracing]# echo timerlat > current_tracer
> + [root@f32 tracing]# tail -21 per_cpu/cpu7/trace
> + insmod-1026 [007] dN.h1.. 200.201948: irq_noise: local_timer:236 start 200.201939376 duration 7872 ns
> + insmod-1026 [007] d..h1.. 200.202587: #29800 context irq timer_latency 1616 ns
> + insmod-1026 [007] dN.h2.. 200.202598: irq_noise: local_timer:236 start 200.202586162 duration 11855 ns
> + insmod-1026 [007] dN.h3.. 200.202947: irq_noise: local_timer:236 start 200.202939174 duration 7318 ns
> + insmod-1026 [007] d...3.. 200.203444: thread_noise: insmod:1026 start 200.202586933 duration 838681 ns
> + timerlat/7-1001 [007] ....... 200.203445: #29800 context thread timer_latency 859978 ns
> + timerlat/7-1001 [007] ....1.. 200.203446: <stack trace>
> + => timerlat_irq
> + => __hrtimer_run_queues
> + => hrtimer_interrupt
> + => __sysvec_apic_timer_interrupt
> + => asm_call_irq_on_stack
> + => sysvec_apic_timer_interrupt
> + => asm_sysvec_apic_timer_interrupt
> + => delay_tsc
> + => dummy_load_1ms_pd_init
> + => do_one_initcall
> + => do_init_module
> + => __do_sys_finit_module
> + => do_syscall_64
> + => entry_SYSCALL_64_after_hwframe
> +
> +In this case, it is possible to see that the thread added the highest
> +contribution to the *timer latency* and the stack trace points to
> +a function named dummy_load_1ms_pd_init, which had the following
> +code (on purpose)::

Should add here as well that the stack is saved at the time of interrupt,
and not at the time it is reported.

> +
> + static int __init dummy_load_1ms_pd_init(void)
> + {
> + preempt_disable();
> + mdelay(1);
> + preempt_enable();
> + return 0;
> +
> + }
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 41582ae4682b..d567b1717c4c 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -390,6 +390,34 @@ config OSNOISE_TRACER
> To enable this tracer, echo in "osnoise" into the current_tracer
> file.
>
> +config TIMERLAT_TRACER
> + bool "Timerlat tracer"
> + select OSNOISE_TRACER
> + select GENERIC_TRACER
> + help
> + The timerlat tracer aims to help the preemptive kernel developers
> + to find sources of wakeup latencies of real-time threads.
> +
> + The tracer creates a per-cpu kernel thread with real-time priority.
> + The tracer thread sets a periodic timer to wakeup itself, and goes
> + to sleep waiting for the timer to fire. At the wakeup, the thread
> + then computes a wakeup latency value as the difference between
> + the current time and the absolute time that the timer was set
> + to expire.
> +
> + The tracer prints two lines at every activation. The first is the
> + timer latency observed at the hardirq context before the
> + activation of the thread. The second is the timer latency observed
> + by the thread, which is the same level that cyclictest reports. The

Again, change the reference to cyclictest here. As what cyclictest reports
is the sum of the two. Saying the "same level that cyclictest reports" is
misleading.

> + ACTIVATION ID field serves to relate the irq execution to its
> + respective thread execution.
> +
> + The tracer is build on top of osnoise tracer, and the osnoise:
> + events can be used to trace the source of interference from NMI,
> + IRQs and other threads. It also enables the capture of the
> + stacktrace at the IRQ context, which helps to identify the code
> + path that can cause thread delay.
> +
> config MMIOTRACE
> bool "Memory mapped IO tracing"
> depends on HAVE_MMIOTRACE_SUPPORT && PCI
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 754dfe8987a2..889986242c40 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -45,6 +45,7 @@ enum trace_type {
> TRACE_BPUTS,
> TRACE_HWLAT,
> TRACE_OSNOISE,
> + TRACE_TIMERLAT,
> TRACE_RAW_DATA,
> TRACE_FUNC_REPEATS,
>
> @@ -448,6 +449,7 @@ extern void __ftrace_bad_type(void);
> IF_ASSIGN(var, ent, struct bputs_entry, TRACE_BPUTS); \
> IF_ASSIGN(var, ent, struct hwlat_entry, TRACE_HWLAT); \
> IF_ASSIGN(var, ent, struct osnoise_entry, TRACE_OSNOISE);\
> + IF_ASSIGN(var, ent, struct timerlat_entry, TRACE_TIMERLAT);\
> IF_ASSIGN(var, ent, struct raw_data_entry, TRACE_RAW_DATA);\
> IF_ASSIGN(var, ent, struct trace_mmiotrace_rw, \
> TRACE_MMIO_RW); \
> diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
> index 158c0984b59b..cd41e863b51c 100644
> --- a/kernel/trace/trace_entries.h
> +++ b/kernel/trace/trace_entries.h
> @@ -385,3 +385,19 @@ FTRACE_ENTRY(osnoise, osnoise_entry,
> __entry->softirq_count,
> __entry->thread_count)
> );
> +
> +FTRACE_ENTRY(timerlat, timerlat_entry,
> +
> + TRACE_TIMERLAT,
> +
> + F_STRUCT(
> + __field( unsigned int, seqnum )
> + __field( int, context )
> + __field( u64, timer_latency )
> + ),
> +
> + F_printk("seq:%u\tcontext:%d\ttimer_latency:%llu\n",
> + __entry->seqnum,
> + __entry->context,
> + __entry->timer_latency)
> +);
> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> index 9bd40b514d84..3a8d70fbb57f 100644
> --- a/kernel/trace/trace_osnoise.c
> +++ b/kernel/trace/trace_osnoise.c
> @@ -1,6 +1,7 @@
> // SPDX-License-Identifier: GPL-2.0
> /*
> * OS Noise Tracer: computes the OS Noise suffered by a running thread.
> + * Timerlat Tracer: measures the wakeup latency of a timer triggered IRQ and thread.
> *
> * Based on "hwlat_detector" tracer by:
> * Copyright (C) 2008-2009 Jon Masters, Red Hat, Inc. <[email protected]>
> @@ -21,6 +22,7 @@
> #include <linux/cpumask.h>
> #include <linux/delay.h>
> #include <linux/sched/clock.h>
> +#include <uapi/linux/sched/types.h>
> #include <linux/sched.h>
> #include "trace.h"
>
> @@ -45,6 +47,9 @@ static struct trace_array *osnoise_trace;
> #define DEFAULT_SAMPLE_PERIOD 1000000 /* 1s */
> #define DEFAULT_SAMPLE_RUNTIME 1000000 /* 1s */
>
> +#define DEFAULT_TIMERLAT_PERIOD 1000 /* 1ms */
> +#define DEFAULT_TIMERLAT_PRIO 95 /* FIFO 95 */
> +
> /*
> * NMI runtime info.
> */
> @@ -62,6 +67,8 @@ struct irq {
> u64 delta_start;
> };
>
> +#define IRQ_CONTEXT 0
> +#define THREAD_CONTEXT 1
> /*
> * SofIRQ runtime info.
> */
> @@ -108,32 +115,76 @@ static inline struct osnoise_variables *this_cpu_osn_var(void)
> return this_cpu_ptr(&per_cpu_osnoise_var);
> }
>
> +#ifdef CONFIG_TIMERLAT_TRACER
> +/*
> + * Runtime information for the timer mode.
> + */
> +struct timerlat_variables {
> + struct task_struct *kthread;
> + struct hrtimer timer;
> + u64 rel_period;
> + u64 abs_period;
> + bool tracing_thread;
> + u64 count;
> +};

Like with the osnoise comment, put in tabs to make the fields stand out.

> +
> +DEFINE_PER_CPU(struct timerlat_variables, per_cpu_timerlat_var);
> +
> /**
> - * osn_var_reset - Reset the values of the given osnoise_variables
> + * this_cpu_tmr_var - Return the per-cpu timerlat_variables on its relative CPU
> + */
> +static inline struct timerlat_variables *this_cpu_tmr_var(void)
> +{
> + return this_cpu_ptr(&per_cpu_timerlat_var);
> +}
> +
> +/**
> + * tlat_var_reset - Reset the values of the given timerlat_variables
> */
> -static inline void osn_var_reset(struct osnoise_variables *osn_var)
> +static inline void tlat_var_reset(void)
> {
> + struct timerlat_variables *tlat_var;
> + int cpu;
> /*
> * So far, all the values are initialized as 0, so
> * zeroing the structure is perfect.
> */
> - memset(osn_var, 0, sizeof(struct osnoise_variables));
> + for_each_cpu(cpu, cpu_online_mask) {
> + tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
> + memset(tlat_var, 0, sizeof(struct timerlat_variables));
> + }
> }
> +#else /* CONFIG_TIMERLAT_TRACER */
> +#define tlat_var_reset() do {} while (0)
> +#endif /* CONFIG_TIMERLAT_TRACER */
>
> /**
> - * osn_var_reset_all - Reset the value of all per-cpu osnoise_variables
> + * osn_var_reset - Reset the values of the given osnoise_variables
> */
> -static inline void osn_var_reset_all(void)
> +static inline void osn_var_reset(void)
> {
> struct osnoise_variables *osn_var;
> int cpu;
>
> + /*
> + * So far, all the values are initialized as 0, so
> + * zeroing the structure is perfect.
> + */
> for_each_cpu(cpu, cpu_online_mask) {
> osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
> - osn_var_reset(osn_var);
> + memset(osn_var, 0, sizeof(struct osnoise_variables));
> }
> }
>
> +/**
> + * osn_var_reset_all - Reset the value of all per-cpu osnoise_variables
> + */
> +static inline void osn_var_reset_all(void)
> +{
> + osn_var_reset();
> + tlat_var_reset();
> +}
> +
> /*
> * Tells NMIs to call back to the osnoise tracer to record timestamps.
> */
> @@ -154,6 +205,18 @@ struct osnoise_sample {
> int thread_count; /* # Threads during this sample */
> };
>
> +#ifdef CONFIG_TIMERLAT_TRACER
> +/*
> + * timerlat sample structure definition. Used to store the statistics of
> + * a sample run.
> + */
> +struct timerlat_sample {
> + u64 seqnum; /* unique sequence */

The seqnum in the event is unsigned int, whereas here it's u64.

> + u64 timer_latency; /* timer_latency */
> + int context; /* timer context */
> +};
> +#endif
> +
> /*
> * Protect the interface.
> */
> @@ -165,13 +228,23 @@ struct mutex interface_lock;
> static struct osnoise_data {
> u64 sample_period; /* total sampling period */
> u64 sample_runtime; /* active sampling portion of period */
> - u64 stop_tracing_in; /* stop trace in the inside operation (loop) */
> - u64 stop_tracing_out; /* stop trace in the outside operation (report) */
> + u64 stop_tracing_in; /* stop trace in the inside operation (loop/irq) */
> + u64 stop_tracing_out; /* stop trace in the outside operation (report/thread) */
> +#ifdef CONFIG_TIMERLAT_TRACER
> + u64 timerlat_period; /* timerlat period */
> + u64 print_stack; /* print IRQ stack if total > */
> + int timerlat_tracer; /* timerlat tracer */
> +#endif
> } osnoise_data = {
> .sample_period = DEFAULT_SAMPLE_PERIOD,
> .sample_runtime = DEFAULT_SAMPLE_RUNTIME,
> .stop_tracing_in = 0,
> .stop_tracing_out = 0,
> +#ifdef CONFIG_TIMERLAT_TRACER
> + .print_stack = 0,
> + .timerlat_period = DEFAULT_TIMERLAT_PERIOD,
> + .timerlat_tracer = 0,
> +#endif
> };
>
> /*
> @@ -232,6 +305,125 @@ static void trace_osnoise_sample(struct osnoise_sample *sample)
> trace_buffer_unlock_commit_nostack(buffer, event);
> }
>
> +#ifdef CONFIG_TIMERLAT_TRACER
> +/*
> + * Print the timerlat header info.
> + */
> +static void print_timerlat_headers(struct seq_file *s)
> +{
> + seq_puts(s, "# _-----=> irqs-off\n");
> + seq_puts(s, "# / _----=> need-resched\n");
> + seq_puts(s, "# | / _---=> hardirq/softirq\n");
> + seq_puts(s, "# || / _--=> preempt-depth\n");
> + seq_puts(s, "# || /\n");
> + seq_puts(s, "# |||| ACTIVATION\n");
> + seq_puts(s, "# TASK-PID CPU# |||| TIMESTAMP ID ");
> + seq_puts(s, " CONTEXT LATENCY\n");
> + seq_puts(s, "# | | | |||| | | ");
> + seq_puts(s, " | |\n");
> +}
> +
> +/*
> + * Record an timerlat_sample into the tracer buffer.
> + */
> +static void trace_timerlat_sample(struct timerlat_sample *sample)
> +{
> + struct trace_array *tr = osnoise_trace;
> + struct trace_event_call *call = &event_osnoise;
> + struct trace_buffer *buffer = tr->array_buffer.buffer;
> + struct ring_buffer_event *event;
> + struct timerlat_entry *entry;
> +
> + event = trace_buffer_lock_reserve(buffer, TRACE_TIMERLAT, sizeof(*entry),
> + tracing_gen_ctx());
> + if (!event)
> + return;
> + entry = ring_buffer_event_data(event);
> + entry->seqnum = sample->seqnum;
> + entry->context = sample->context;
> + entry->timer_latency = sample->timer_latency;
> +
> + if (!call_filter_check_discard(call, entry, buffer, event))
> + trace_buffer_unlock_commit_nostack(buffer, event);
> +}
> +
> +#ifdef CONFIG_STACKTRACE
> +/*
> + * Stack trace will take place only at IRQ level, so, no need
> + * to control nesting here.
> + */
> +struct trace_stack {
> + int stack_size;
> + int nr_entries;
> + unsigned long calls[PAGE_SIZE];

That is rather big. It's 8 * PAGE_SIZE. I don't think that's what you really
wanted.

> +};
> +
> +static DEFINE_PER_CPU(struct trace_stack, trace_stack);
> +
> +/**

Again, remove the KernelDoc notation of /**, or make it real kerneldoc
notation.

> + * timerlat_save_stack - save a stack trace without printing
> + *
> + * Save the current stack trace without printing. The
> + * stack will be printed later, after the end of the measurement.
> + */
> +static void timerlat_save_stack(int skip)
> +{
> + unsigned int size, nr_entries;
> + struct trace_stack *fstack;
> +
> + fstack = this_cpu_ptr(&trace_stack);
> +
> + size = ARRAY_SIZE(fstack->calls);
> +
> + nr_entries = stack_trace_save(fstack->calls, size, skip);
> +
> + fstack->stack_size = nr_entries * sizeof(unsigned long);
> + fstack->nr_entries = nr_entries;
> +
> + return;
> +
> +}
> +/**
> + * timerlat_dump_stack - dump a stack trace previously saved
> + *
> + * Dump a saved stack trace into the trace buffer.
> + */
> +static void timerlat_dump_stack(void)
> +{
> + struct trace_event_call *call = &event_osnoise;
> + struct trace_array *tr = osnoise_trace;
> + struct trace_buffer *buffer = tr->array_buffer.buffer;
> + struct ring_buffer_event *event;
> + struct trace_stack *fstack;
> + struct stack_entry *entry;
> + unsigned int size;
> +
> + preempt_disable_notrace();
> + fstack = this_cpu_ptr(&trace_stack);
> + size = fstack->stack_size;
> +
> + event = trace_buffer_lock_reserve(buffer, TRACE_STACK, sizeof(*entry) + size,
> + tracing_gen_ctx());
> + if (!event)
> + goto out;
> +
> + entry = ring_buffer_event_data(event);
> +
> + memcpy(&entry->caller, fstack->calls, size);
> + entry->size = fstack->nr_entries;
> +
> + if (!call_filter_check_discard(call, entry, buffer, event))
> + trace_buffer_unlock_commit_nostack(buffer, event);
> +
> +out:
> + preempt_enable_notrace();
> +}
> +#else
> +#define timerlat_dump_stack() do {} while (0)
> +#define timerlat_save_stack(a) do {} while (0)
> +#endif /* CONFIG_STACKTRACE */
> +#endif /* CONFIG_TIMERLAT_TRACER */
> +
> /**
> * Macros to encapsulate the time capturing infrastructure.
> */
> @@ -373,6 +565,30 @@ set_int_safe_time(struct osnoise_variables *osn_var, u64 *time)
> return int_counter;
> }
>
> +#ifdef CONFIG_TIMERLAT_TRACER
> +/**
> + * copy_int_safe_time - Copy *src into *desc aware of interference
> + */
> +static u64
> +copy_int_safe_time(struct osnoise_variables *osn_var, u64 *dst, u64 *src)
> +{
> + u64 int_counter;
> +
> + do {
> + int_counter = local_read(&osn_var->int_counter);
> + /* synchronize with interrupts */
> + barrier();
> +
> + *dst = *src;
> +
> + /* synchronize with interrupts */
> + barrier();
> + } while (int_counter != local_read(&osn_var->int_counter));

Note, that the loop is unnecessary on 64 bit machines.

The compiler should not split loads, in such cases.

> +
> + return int_counter;
> +}
> +#endif /* CONFIG_TIMERLAT_TRACER */
> +
> /**
> * trace_osnoise_callback - NMI entry/exit callback
> *
> @@ -801,6 +1017,22 @@ void trace_softirq_exit_callback(void *data, unsigned int vec_nr)
> if (!osn_var->sampling)
> return;
>
> +#ifdef CONFIG_TIMERLAT_TRACER
> + /*
> + * If the timerlat is enabled, but the irq handler did
> + * not run yet enabling timerlat_tracer, do not trace.
> + */
> + if (unlikely(osnoise_data.timerlat_tracer)) {
> + struct timerlat_variables *tlat_var;
> + tlat_var = this_cpu_tmr_var();
> + if (!tlat_var->tracing_thread) {

What happens if the timer interrupt triggers here?

> + osn_var->softirq.arrival_time = 0;
> + osn_var->softirq.delta_start = 0;
> + return;
> + }
> + }
> +#endif
> +
> duration = get_int_safe_duration(osn_var, &osn_var->softirq.delta_start);
> trace_softirq_noise(vec_nr, osn_var->softirq.arrival_time, duration);
> cond_move_thread_delta_start(osn_var, duration);
> @@ -893,6 +1125,18 @@ thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
> if (!osn_var->sampling)
> return;
>
> +#ifdef CONFIG_TIMERLAT_TRACER
> + if (osnoise_data.timerlat_tracer) {
> + struct timerlat_variables *tlat_var;
> + tlat_var = this_cpu_tmr_var();
> + if (!tlat_var->tracing_thread) {

Or here?

> + osn_var->thread.delta_start = 0;
> + osn_var->thread.arrival_time = 0;
> + return;
> + }
> + }
> +#endif
> +
> duration = get_int_safe_duration(osn_var, &osn_var->thread.delta_start);
>
> trace_thread_noise(t, osn_var->thread.arrival_time, duration);
> @@ -1182,6 +1426,197 @@ static int osnoise_main(void *data)
> return 0;
> }
>
> +#ifdef CONFIG_TIMERLAT_TRACER
> +/**
> + * timerlat_irq - hrtimer handler for timerlat.
> + */
> +static enum hrtimer_restart timerlat_irq(struct hrtimer *timer)
> +{
> + struct osnoise_variables *osn_var = this_cpu_osn_var();
> + struct trace_array *tr = osnoise_trace;
> + struct timerlat_variables *tlat;
> + struct timerlat_sample s;
> + u64 now;
> + u64 diff;
> +
> + /*
> + * I am not sure if the timer was armed for this CPU. So, get
> + * the timerlat struct from the timer itself, not from this
> + * CPU.
> + */
> + tlat = container_of(timer, struct timerlat_variables, timer);
> +
> + now = ktime_to_ns(hrtimer_cb_get_time(&tlat->timer));
> +
> + /*
> + * Enable the osnoise: events for thread an softirq.
> + */
> + tlat->tracing_thread = true;
> +
> + osn_var->thread.arrival_time = time_get();
> +
> + /*
> + * A hardirq is running: the timer IRQ. It is for sure preempting
> + * a thread, and potentially preempting a softirq.
> + *
> + * At this point, it is not interesting to know the duration of the
> + * preempted thread (and maybe softirq), but how much time they will
> + * delay the beginning of the execution of the timer thread.
> + *
> + * To get the correct (net) delay added by the softirq, its delta_start
> + * is set as the IRQ one. In this way, at the return of the IRQ, the delta
> + * start of the sofitrq will be zeroed, accounting then only the time
> + * after that.
> + *
> + * The thread follows the same principle. However, if a softirq is
> + * running, the thread needs to receive the softirq delta_start. The
> + * reason being is that the softirq will be the last to be unfolded,
> + * resseting the thread delay to zero.
> + */
> +#ifndef CONFIG_PREEMPT_RT
> + if (osn_var->softirq.delta_start) {
> + copy_int_safe_time(osn_var, &osn_var->thread.delta_start,
> + &osn_var->softirq.delta_start);

Isn't softirq.delta_start going to be zero here? It doesn't look to get
updated until you set tracing_thread to true, but that happens here, and as
this is in a interrupt context, there will not be a softirq happening
between the setting of that to true to this point.

> +
> + copy_int_safe_time(osn_var, &osn_var->softirq.delta_start,
> + &osn_var->irq.delta_start);
> + } else {
> + copy_int_safe_time(osn_var, &osn_var->thread.delta_start,
> + &osn_var->irq.delta_start);
> + }
> +#else /* CONFIG_PREEMPT_RT */
> + /*
> + * The sofirqs run as threads on RT, so there is not need
> + * to keep track of it.
> + */
> + copy_int_safe_time(osn_var, &osn_var->thread.delta_start, &osn_var->irq.delta_start);
> +#endif /* CONFIG_PREEMPT_RT */
> +
> + /*
> + * Compute the current time with the expected time.
> + */
> + diff = now - tlat->abs_period;
> +
> + tlat->count++;
> + s.seqnum = tlat->count;
> + s.timer_latency = diff;
> + s.context = IRQ_CONTEXT;
> +
> + trace_timerlat_sample(&s);
> +
> + /* Keep a running maximum ever recorded os noise "latency" */
> + if (diff > tr->max_latency) {
> + tr->max_latency = diff;
> + latency_fsnotify(tr);
> + }
> +
> + if (osnoise_data.stop_tracing_in)
> + if (time_to_us(diff) >= osnoise_data.stop_tracing_in)
> + osnoise_stop_tracing();
> +
> + wake_up_process(tlat->kthread);
> +
> +#ifdef CONFIG_STACKTRACE
> + if (osnoise_data.print_stack)
> + timerlat_save_stack(0);
> +#endif

No need for the #ifdef above. timerlat_save_stack() is defined as a nop
when not enabled, and the compiler will just optimize this out.

> +
> + return HRTIMER_NORESTART;
> +}
> +
> +/**
> + * wait_next_period - Wait for the next period for timerlat
> + */
> +static int wait_next_period(struct timerlat_variables *tlat)
> +{
> + ktime_t next_abs_period, now;
> + u64 rel_period = osnoise_data.timerlat_period * 1000;
> +
> + now = hrtimer_cb_get_time(&tlat->timer);
> + next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
> +
> + /*
> + * Save the next abs_period.
> + */
> + tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
> +
> + /*
> + * If the new abs_period is in the past, skip the activation.
> + */
> + while (ktime_compare(now, next_abs_period) > 0) {
> + next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
> + tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
> + }
> +
> + set_current_state(TASK_INTERRUPTIBLE);
> +
> + hrtimer_start(&tlat->timer, next_abs_period, HRTIMER_MODE_ABS_PINNED_HARD);
> + schedule();
> + return 1;
> +}
> +
> +/**
> + * timerlat_main- Timerlat main
> + */
> +static int timerlat_main(void *data)
> +{
> + struct osnoise_variables *osn_var = this_cpu_osn_var();
> + struct timerlat_variables *tlat = this_cpu_tmr_var();
> + struct timerlat_sample s;
> + struct sched_param sp;
> + u64 now, diff;
> +
> + /*
> + * Make the thread RT, that is how cyclictest is usually used.
> + */
> + sp.sched_priority = DEFAULT_TIMERLAT_PRIO;
> + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);

Hmm, I thought Peter Zijlstra was removing all sched_setscheduler*() calls
in the kernel :-/ Although, this one seems legit, and we are not running
from within a module.

-- Steve

> +
> + tlat->count = 0;
> + tlat->tracing_thread = false;
> +
> + hrtimer_init(&tlat->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED_HARD);
> + tlat->timer.function = timerlat_irq;
> + tlat->kthread = current;
> + osn_var->pid = current->pid;
> + /*
> + * Anotate the arrival time.
> + */
> + tlat->abs_period = hrtimer_cb_get_time(&tlat->timer);
> +
> + wait_next_period(tlat);
> +
> + osn_var->sampling = 1;
> +
> + while (!kthread_should_stop()) {
> + now = ktime_to_ns(hrtimer_cb_get_time(&tlat->timer));
> + diff = now - tlat->abs_period;
> +
> + s.seqnum = tlat->count;
> + s.timer_latency = diff;
> + s.context = THREAD_CONTEXT;
> +
> + trace_timerlat_sample(&s);
> +
> +#ifdef CONFIG_STACKTRACE
> + if (osnoise_data.print_stack)
> + if (osnoise_data.print_stack <= time_to_us(diff))
> + timerlat_dump_stack();
> +#endif /* CONFIG_STACKTRACE */
> +
> + tlat->tracing_thread = false;
> + if (osnoise_data.stop_tracing_out)
> + if (time_to_us(diff) >= osnoise_data.stop_tracing_out)
> + osnoise_stop_tracing();
> +
> + wait_next_period(tlat);
> + }
> +
> + hrtimer_cancel(&tlat->timer);
> + return 0;
> +}
> +#endif /* CONFIG_TIMERLAT_TRACER */

Subject: Re: [PATCH V3 8/9] tracing: Add osnoise tracer

On 6/7/21 5:47 PM, Steven Rostedt wrote:
>> I am using these more "generic terms" because they are also used by the timerlat
>> tracer.
>>
>> In the timerlat tracer, the "in" file is used to stop the tracer for a given IRQ
>> latency (so, the "inside" operation), while the "out" is used to stop the tracer
>> in the thread latency (hence the outside operation).
>>
>> The total sounds good for the "out"! But the single does not work fine for the
>> IRQ... how about: stop_tracing_partial_us ?
>>
>> It is hard to find a good shared name :-/
> What about:
>
> stop_tracing_us and stop_tracing_total_us, and not have anything
> special for the first one?
I cannot find a better name... and it makes sense: if an "in" value on osnoise
or an IRQ latency on timerlat is higher than "stop_tracing_us"... it is more
important than the total... so it indeed deserves the more intuitive name.

(working on osnoise changes now...)

-- Daniel
> -- Steve
>

Subject: Re: [PATCH V3 8/9] tracing: Add osnoise tracer

On 6/4/21 11:28 PM, Steven Rostedt wrote:
>> + /*
>> + * This is an evidence of race conditions that cause
>> + * a value to be "discounted" too much.
>> + */
>> + if (duration < 0)
>> + pr_err("int safe negative!\n");
> Probably want to have this happen at most once a run. If something were
> to break, I don't think we want this to live lock the machine doing
> tons of prints. We could have a variable stored on the
> osnoise_variables that states this was printed. Check that variable to
> see if it wasn't printed during a run (when current_tracer was set),
> and print only once if it is.

I created a "bool tainted" variable, that is set true if any problem with time()
related functions happen. I will pr_warn that there is a problem on _start(),
but also print this info at the top of the tracer header, so it is clear also
from the trace output.

Thoughts?

-- Daniel


Subject: Re: [PATCH V3 8/9] tracing: Add osnoise tracer

On 6/8/21 7:39 PM, Steven Rostedt wrote:
>> I created a "bool tainted" variable, that is set true if any problem with time()
>> related functions happen. I will pr_warn that there is a problem on _start(),
>> but also print this info at the top of the tracer header, so it is clear also
>> from the trace output.
>>
>> Thoughts?
>>
> Or perhaps have that pr_err() actually be written into the trace buffer?
>
> You can use
>
> trace_array_printk_buf(tr->array_buffer.buffer, _THIS_IP_, "string");
>
> without it triggering that nasty trace_printk() notice ;-)

cool! I created a function osnoise_taint(char *msg) that prints the msg using
trace_array_printk_buf. I am using it instead of all pr_warn that could take
place inside osnoise regular operation.

I am still placing the note in the header, just in case we miss the message in
the log.

-- Daniel

> -- Steve
>

2021-06-09 08:53:21

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 8/9] tracing: Add osnoise tracer

On Tue, 8 Jun 2021 19:17:55 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> On 6/4/21 11:28 PM, Steven Rostedt wrote:
> >> + /*
> >> + * This is an evidence of race conditions that cause
> >> + * a value to be "discounted" too much.
> >> + */
> >> + if (duration < 0)
> >> + pr_err("int safe negative!\n");
> > Probably want to have this happen at most once a run. If something were
> > to break, I don't think we want this to live lock the machine doing
> > tons of prints. We could have a variable stored on the
> > osnoise_variables that states this was printed. Check that variable to
> > see if it wasn't printed during a run (when current_tracer was set),
> > and print only once if it is.
>
> I created a "bool tainted" variable, that is set true if any problem with time()
> related functions happen. I will pr_warn that there is a problem on _start(),
> but also print this info at the top of the tracer header, so it is clear also
> from the trace output.
>
> Thoughts?
>

Or perhaps have that pr_err() actually be written into the trace buffer?

You can use

trace_array_printk_buf(tr->array_buffer.buffer, _THIS_IP_, "string");

without it triggering that nasty trace_printk() notice ;-)

-- Steve

2021-06-09 11:24:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 8/9] tracing: Add osnoise tracer

On Tue, 8 Jun 2021 21:33:31 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> cool! I created a function osnoise_taint(char *msg) that prints the msg using
> trace_array_printk_buf. I am using it instead of all pr_warn that could take
> place inside osnoise regular operation.

Make it a macro, so that _THIS_IP_ is meaningful.

>
> I am still placing the note in the header, just in case we miss the message in
> the log.

+1

-- Steve

Subject: Re: [PATCH V3 8/9] tracing: Add osnoise tracer

On 6/4/21 11:28 PM, Steven Rostedt wrote:
>> +#ifdef CONFIG_X86_LOCAL_APIC
> I wonder if we should move this into a separate file, making the
> __trace_irq_entry() a more name space safe name and have it call that.
> I have a bit of a distaste for arch specific code in a generic file.
>

I am placing the intel specific file in:

arch/x86/kernel/trace_osnoise.c

and the kernel/trace/trace_osnoise.h looks like this:

#ifdef CONFIG_X86_LOCAL_APIC
int osnoise_arch_register(void);
int osnoise_arch_unregister(void);
#else /* CONFIG_X86_LOCAL_APIC */
#define osnoise_arch_register() do {} while (0)
#define osnoise_arch_unregister() do {} while (0)
#endif /* CONFIG_X86_LOCAL_APIC */

This can be used by other archs as well...

sound reasonable?

-- Daniel


>> +/**
>> + * trace_intel_irq_entry - record intel specific IRQ entry
>> + */
>> +void trace_intel_irq_entry(void *data, int vector)
>> +{
>> + __trace_irq_entry(vector);
>> +}
>> +


2021-06-09 18:44:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 8/9] tracing: Add osnoise tracer

On Wed, 9 Jun 2021 14:14:17 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> On 6/4/21 11:28 PM, Steven Rostedt wrote:
> >> +#ifdef CONFIG_X86_LOCAL_APIC
> > I wonder if we should move this into a separate file, making the
> > __trace_irq_entry() a more name space safe name and have it call that.
> > I have a bit of a distaste for arch specific code in a generic file.
> >
>
> I am placing the intel specific file in:
>
> arch/x86/kernel/trace_osnoise.c

I would make it just arch/x86/kernel/trace.c

so that it can hold all arch specific tracing information, and not need
to create a file for anything else we might need later.

>
> and the kernel/trace/trace_osnoise.h looks like this:
>
> #ifdef CONFIG_X86_LOCAL_APIC
> int osnoise_arch_register(void);
> int osnoise_arch_unregister(void);
> #else /* CONFIG_X86_LOCAL_APIC */
> #define osnoise_arch_register() do {} while (0)
> #define osnoise_arch_unregister() do {} while (0)
> #endif /* CONFIG_X86_LOCAL_APIC */
>
> This can be used by other archs as well...
>
> sound reasonable?
>

The proper way to d that is to use weak functions in the C code in the
generic file.

int __weak osnoise_arch_register(void)
{
return 0;
}

int __weak osnoise_arch_unregister(void)
{
return 0;
}

Hmm, does the unregister really need a return value?

-- Steve

Subject: Re: [PATCH V3 8/9] tracing: Add osnoise tracer

On 6/9/21 3:03 PM, Steven Rostedt wrote:
> On Wed, 9 Jun 2021 14:14:17 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
>> On 6/4/21 11:28 PM, Steven Rostedt wrote:
>>>> +#ifdef CONFIG_X86_LOCAL_APIC
>>> I wonder if we should move this into a separate file, making the
>>> __trace_irq_entry() a more name space safe name and have it call that.
>>> I have a bit of a distaste for arch specific code in a generic file.
>>>
>>
>> I am placing the intel specific file in:
>>
>> arch/x86/kernel/trace_osnoise.c
>
> I would make it just arch/x86/kernel/trace.c

moved!

> so that it can hold all arch specific tracing information, and not need
> to create a file for anything else we might need later.
>
>>
>> and the kernel/trace/trace_osnoise.h looks like this:
>>
>> #ifdef CONFIG_X86_LOCAL_APIC
>> int osnoise_arch_register(void);
>> int osnoise_arch_unregister(void);
>> #else /* CONFIG_X86_LOCAL_APIC */
>> #define osnoise_arch_register() do {} while (0)
>> #define osnoise_arch_unregister() do {} while (0)
>> #endif /* CONFIG_X86_LOCAL_APIC */
>>
>> This can be used by other archs as well...
>>
>> sound reasonable?
>>
>
> The proper way to d that is to use weak functions in the C code in the
> generic file.
>
> int __weak osnoise_arch_register(void)
> {
> return 0;
> }
>
> int __weak osnoise_arch_unregister(void)
> {
> return 0;
> }
>
> Hmm, does the unregister really need a return value?

it was always returning 0, changed it to void.

-- Daniel

Subject: Re: [PATCH V3 9/9] tracing: Add timerlat tracer

On 6/8/21 3:36 AM, Steven Rostedt wrote:
> On Fri, 14 May 2021 22:51:18 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:

[...]
>> It is possible to follow the trace by reading the trace trace file::
>
> Do not need rst markup in commit logs ;-)
>
>>
>> [root@f32 tracing]# cat trace
>> # tracer: timerlat
>> #
>> # _-----=> irqs-off
>> # / _----=> need-resched
>> # | / _---=> hardirq/softirq
>> # || / _--=> preempt-depth
>> # || /
>> # |||| ACTIVATION
>> # TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY
>> # | | | |||| | | | |
>> <idle>-0 [000] d.h1 54.029328: #1 context irq timer_latency 932 ns
>> <...>-867 [000] .... 54.029339: #1 context thread timer_latency 11700 ns
>> <idle>-0 [001] dNh1 54.029346: #1 context irq timer_latency 2833 ns
>> <...>-868 [001] .... 54.029353: #1 context thread timer_latency 9820 ns
>> <idle>-0 [000] d.h1 54.030328: #2 context irq timer_latency 769 ns
>> <...>-867 [000] .... 54.030330: #2 context thread timer_latency 3070 ns
>> <idle>-0 [001] d.h1 54.030344: #2 context irq timer_latency 935 ns
>> <...>-868 [001] .... 54.030347: #2 context thread timer_latency 4351 ns
>>
>> The tracer creates a per-cpu kernel thread with real-time priority that
>> prints two lines at every activation. The first is the *timer latency*
>> observed at the *hardirq* context before the activation of the thread.
>> The second is the *timer latency* observed by the thread, which is the
>> same level that cyclictest reports. The ACTIVATION ID field
>
> The above is misleading. Below, I see that you state that the values are
> "net values" where the thread latency does not include the irq latency.
> This is not the same as cyclictest. (I had to update my ASCII art below
> after reading the below statement).

Replying here all the comments about the timerlat/cyclictest
timeline, and gross/net values.

So, yeah, my description was not clear enough. The values that are
net are those reported by the *osnoise: events* only. The *timerlat
tracer* values are not discounted, and that is why they are similar
to the value reported by cyclictest (ok, cyclictest still captures
the exit to user overhead and friends).

>> serves to relate the *irq* execution to its respective *thread* execution.
>>
>> The irq/thread splitting is important to clarify at which context
>> the unexpected high value is coming from. The *irq* context can be
>> delayed by hardware related actions, such as SMIs, NMIs, IRQs
>> or by a thread masking interrupts. Once the timer happens, the delay
>> can also be influenced by blocking caused by threads. For example, by
>> postponing the scheduler execution via preempt_disable(), by the
>> scheduler execution, or by masking interrupts. Threads can
>> also be delayed by the interference from other threads and IRQs.
>
>
> I wonder if ASCII art would help clarify the above. At least for the
> document (not the change log here).
>
>
> time ==>
> expected actual thread
> wakeup wakeup scheduled
> | | |
> v v v
> |---------|-------|------|----------|
> ^
> |
> interrupt
>
> |--------------|
> irq latency
>
> |-----------|
> thread latency
>

A liked the idea of adding a timeline!

[...]

>> + [root@f32 tracing]# cat trace
>> + # tracer: timerlat
>> + #
>> + # _-----=> irqs-off
>> + # / _----=> need-resched
>> + # | / _---=> hardirq/softirq
>> + # || / _--=> preempt-depth
>> + # || /
>> + # |||| ACTIVATION
>> + # TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY
>> + # | | | |||| | | | |
>> + <idle>-0 [000] d.h1 54.029328: #1 context irq timer_latency 932 ns
>> + <...>-867 [000] .... 54.029339: #1 context thread timer_latency 11700 ns
>> + <idle>-0 [001] dNh1 54.029346: #1 context irq timer_latency 2833 ns
>> + <...>-868 [001] .... 54.029353: #1 context thread timer_latency 9820 ns
>> + <idle>-0 [000] d.h1 54.030328: #2 context irq timer_latency 769 ns
>> + <...>-867 [000] .... 54.030330: #2 context thread timer_latency 3070 ns
>> + <idle>-0 [001] d.h1 54.030344: #2 context irq timer_latency 935 ns
>> + <...>-868 [001] .... 54.030347: #2 context thread timer_latency 4351 ns
>> +
>> +
>> +The tracer creates a per-cpu kernel thread with real-time priority that
>> +prints two lines at every activation. The first is the *timer latency*
>> +observed at the *hardirq* context before the activation of the thread.
>> +The second is the *timer latency* observed by the thread, which is the
>> +same level that cyclictest reports. The ACTIVATION ID field
>> +serves to relate the *irq* execution to its respective *thread* execution.
>> +
>> +The *irq*/*thread* splitting is important to clarify at which context
>> +the unexpected high value is coming from. The *irq* context can be
>> +delayed by hardware related actions, such as SMIs, NMIs, IRQs
>> +or by a thread masking interrupts. Once the timer happens, the delay
>> +can also be influenced by blocking caused by threads. For example, by
>> +postponing the scheduler execution via preempt_disable(), by the
>> +scheduler execution, or by masking interrupts. Threads can
>> +also be delayed by the interference from other threads and IRQs.
>
> This is where I would add that ASCII art.

I am proposing a ASCII art on another point.... see bellow.

>
>> +
>> +Tracer options
>> +---------------------
>> +
>> +The timerlat tracer is built on top of osnoise tracer.
>> +So its configuration is also done in the osnoise/ config
>> +directory. The timerlat configs are:
>> +
>> + - cpus: CPUs at which a timerlat thread will execute.
>> + - timerlat_period_us: the period of the timerlat thread.
>> + - osnoise/stop_tracing_in_us: stop the system tracing if a
>> + timer latency at the *irq* context higher than the configured
>> + value happens. Writing 0 disables this option.
>> + - stop_tracing_out_us: stop the system tracing if a
>> + timer latency at the *thread* context higher than the configured
>> + value happens. Writing 0 disables this option.
>> + - print_stack: save the stack of the IRQ ocurrence, and print
>> + it after the *thread* read the latency.
>
> "thread read the latency" doesn't make sense.
>
> "and print it after the *thread context* event". ?
>
>
>> +
>> +timerlat and osnoise
>> +----------------------------
>> +
>> +The timerlat can also take advantage of the osnoise: traceevents.
>> +For example::
>> +
>> + [root@f32 ~]# cd /sys/kernel/tracing/
>> + [root@f32 tracing]# echo timerlat > current_tracer
>> + [root@f32 tracing]# echo osnoise > set_event
>
> Note, set_event should be deprecated. Use:
>
> echo 1 > events/osnoise/enable
>
> instead.
>
>
>> + [root@f32 tracing]# echo 25 > osnoise/stop_tracing_out_us
>> + [root@f32 tracing]# tail -10 trace
>> + cc1-87882 [005] d..h... 548.771078: #402268 context irq timer_latency 1585 ns
>> + cc1-87882 [005] dNLh1.. 548.771082: irq_noise: local_timer:236 start 548.771077442 duration 4597 ns
>> + cc1-87882 [005] dNLh2.. 548.771083: irq_noise: reschedule:253 start 548.771083017 duration 56 ns
>> + cc1-87882 [005] dNLh2.. 548.771086: irq_noise: call_function_single:251 start 548.771083811 duration 2048 ns
>> + cc1-87882 [005] dNLh2.. 548.771088: irq_noise: call_function_single:251 start 548.771086814 duration 1495 ns
>> + cc1-87882 [005] dNLh2.. 548.771091: irq_noise: call_function_single:251 start 548.771089194 duration 1558 ns
>> + cc1-87882 [005] dNLh2.. 548.771094: irq_noise: call_function_single:251 start 548.771091719 duration 1932 ns
>> + cc1-87882 [005] dNLh2.. 548.771096: irq_noise: call_function_single:251 start 548.771094696 duration 1050 ns
>> + cc1-87882 [005] d...3.. 548.771101: thread_noise: cc1:87882 start 548.771078243 duration 10909 ns
>> + timerlat/5-1035 [005] ....... 548.771103: #402268 context thread timer_latency 25960 ns
>> +
>> +In this case, the root cause of the timer latency does not point for a
>> +single, but to a series of call_function_single IPIs, followed by a 10
>
> "not point to a single"
>
>> +*us* delay from a cc1 thread noise, along with the regular timer
>> +activation. It is worth mentioning that the *duration* values reported
>> +by the osnoise events are *net* values. For example, the
>> +thread_noise does not include the duration of the overhead caused
>> +by the IRQ execution (which indeed accounted for 12736 ns).
>
> As stated above, I updated my view of the ASCII art after reading this. You
> should not compare what cyclictest reports as the thread latency. But what
> cyclictest reports is the sum of the two (irq latency plus thread latency).

Here is the point where I mention net values... so I am changing this part of
documentation to:

------------------- %< -----------------------------
It is worth mentioning that the *duration* values reported
by the osnoise: events are *net* values. For example, the
thread_noise does not include the duration of the overhead caused
by the IRQ execution (which indeed accounted for 12736 ns). But
the values reported by the timerlat tracer (timerlat_latency)
are *gross* values.

The art below illustrates a CPU timeline and how the timerlat tracer
observes it at the top and the osnoise: events at the bottom. Each "-"
in the timelines means 1 us, and the time moves ==>:

External context irq context thread
clock timer_latency timer_latency
event 18 us 48 us
| ^ ^
v | |
|------------------| | <-- timerlat irq timeline
|------------------+-----------------------------| <-- timerlat thread timeline
^ ^
===================== CPU timeline ======================================
[timerlat/ irq] [ dev irq ]
[another thread...^ v..^ v........][timerlat/ thread]
===================== CPU timeline ======================================
|-------------| |---------| <-- irq_noise timeline
|--^ v--------| <-- thread_noise timeline
| | |
| | + thread_noise: 10 us
| +-> irq_noise: 9 us
+-> irq_noise: 13 us

--------------- >% --------------------------------

thoughts?

-- Daniel

Subject: Re: [PATCH V3 9/9] tracing: Add timerlat tracer

On 6/8/21 3:36 AM, Steven Rostedt wrote:
> On Fri, 14 May 2021 22:51:18 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
>> The timerlat tracer aims to help the preemptive kernel developers to
>> found souces of wakeup latencies of real-time threads. Like cyclictest,
>> the tracer sets a periodic timer that wakes up a thread. The thread then
>> computes a *wakeup latency* value as the difference between the *current
>> time* and the *absolute time* that the timer was set to expire. The main
>> goal of timerlat is tracing in such a way to help kernel developers.
>>
>
> Hmm, we should add a way to have wake up tracers only trace a specific
> task, where these osnoise trace events would also be useful. That is,
> run cyclictest with the wakeup tracer, that it does this for cyclictest
> directly. That shouldn't be too difficult to add.

Yep! the osnoise: events are useful for other tracers, and even alone... Indeed,
they are part of the rtsl (which I plan to submit later this year).

It was already on my todo list to find a way to enable the events independently.
It could be as simple as adding a "hook" file to the osnoise/ dir, or to hook
the events when the first osnoise: event gets enabled, and to unhook them when
the last gets disable.

I will have a look at enabling them with wakeup tracers along the way.

>> Usage
>>
>> Write the ASCII text "timerlat" into the current_tracer file of the
>> tracing system (generally mounted at /sys/kernel/tracing).
>>
>> For example:
>>
>> [root@f32 ~]# cd /sys/kernel/tracing/
>> [root@f32 tracing]# echo timerlat > current_tracer
>>
>> It is possible to follow the trace by reading the trace trace file::
>
> Do not need rst markup in commit logs ;-)

Oops! :-)

>>
>> [root@f32 tracing]# cat trace
>> # tracer: timerlat
>> #
>> # _-----=> irqs-off
>> # / _----=> need-resched
>> # | / _---=> hardirq/softirq
>> # || / _--=> preempt-depth
>> # || /
>> # |||| ACTIVATION
>> # TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY
>> # | | | |||| | | | |
>> <idle>-0 [000] d.h1 54.029328: #1 context irq timer_latency 932 ns
>> <...>-867 [000] .... 54.029339: #1 context thread timer_latency 11700 ns
>> <idle>-0 [001] dNh1 54.029346: #1 context irq timer_latency 2833 ns
>> <...>-868 [001] .... 54.029353: #1 context thread timer_latency 9820 ns
>> <idle>-0 [000] d.h1 54.030328: #2 context irq timer_latency 769 ns
>> <...>-867 [000] .... 54.030330: #2 context thread timer_latency 3070 ns
>> <idle>-0 [001] d.h1 54.030344: #2 context irq timer_latency 935 ns
>> <...>-868 [001] .... 54.030347: #2 context thread timer_latency 4351 ns
>>
>> The tracer creates a per-cpu kernel thread with real-time priority that
>> prints two lines at every activation. The first is the *timer latency*
>> observed at the *hardirq* context before the activation of the thread.
>> The second is the *timer latency* observed by the thread, which is the
>> same level that cyclictest reports. The ACTIVATION ID field

[..]

>> --- /dev/null
>> +++ b/Documentation/trace/timerlat-tracer.rst
>> @@ -0,0 +1,158 @@
>> +###############
>> +Timerlat tracer
>> +###############
>> +
>> +The timerlat tracer aims to help the preemptive kernel developers to
>> +found souces of wakeup latencies of real-time threads. Like cyclictest,
>
> "to find sources"

Fixed.

[...]
>> +
>> +Tracer options
>> +---------------------
>> +
>> +The timerlat tracer is built on top of osnoise tracer.
>> +So its configuration is also done in the osnoise/ config
>> +directory. The timerlat configs are:
>> +
>> + - cpus: CPUs at which a timerlat thread will execute.
>> + - timerlat_period_us: the period of the timerlat thread.
>> + - osnoise/stop_tracing_in_us: stop the system tracing if a
>> + timer latency at the *irq* context higher than the configured
>> + value happens. Writing 0 disables this option.
>> + - stop_tracing_out_us: stop the system tracing if a
>> + timer latency at the *thread* context higher than the configured
>> + value happens. Writing 0 disables this option.
>> + - print_stack: save the stack of the IRQ ocurrence, and print
>> + it after the *thread* read the latency.
>
> "thread read the latency" doesn't make sense.
>
> "and print it after the *thread context* event". ?

Fixed.

>
>> +
>> +timerlat and osnoise
>> +----------------------------
>> +
>> +The timerlat can also take advantage of the osnoise: traceevents.
>> +For example::
>> +
>> + [root@f32 ~]# cd /sys/kernel/tracing/
>> + [root@f32 tracing]# echo timerlat > current_tracer
>> + [root@f32 tracing]# echo osnoise > set_event
>
> Note, set_event should be deprecated. Use:
>
> echo 1 > events/osnoise/enable
>
> instead.
>

Fixed (and mental note added).

>> + [root@f32 tracing]# echo 25 > osnoise/stop_tracing_out_us
>> + [root@f32 tracing]# tail -10 trace
>> + cc1-87882 [005] d..h... 548.771078: #402268 context irq timer_latency 1585 ns
>> + cc1-87882 [005] dNLh1.. 548.771082: irq_noise: local_timer:236 start 548.771077442 duration 4597 ns
>> + cc1-87882 [005] dNLh2.. 548.771083: irq_noise: reschedule:253 start 548.771083017 duration 56 ns
>> + cc1-87882 [005] dNLh2.. 548.771086: irq_noise: call_function_single:251 start 548.771083811 duration 2048 ns
>> + cc1-87882 [005] dNLh2.. 548.771088: irq_noise: call_function_single:251 start 548.771086814 duration 1495 ns
>> + cc1-87882 [005] dNLh2.. 548.771091: irq_noise: call_function_single:251 start 548.771089194 duration 1558 ns
>> + cc1-87882 [005] dNLh2.. 548.771094: irq_noise: call_function_single:251 start 548.771091719 duration 1932 ns
>> + cc1-87882 [005] dNLh2.. 548.771096: irq_noise: call_function_single:251 start 548.771094696 duration 1050 ns
>> + cc1-87882 [005] d...3.. 548.771101: thread_noise: cc1:87882 start 548.771078243 duration 10909 ns
>> + timerlat/5-1035 [005] ....... 548.771103: #402268 context thread timer_latency 25960 ns
>> +
>> +In this case, the root cause of the timer latency does not point for a
>> +single, but to a series of call_function_single IPIs, followed by a 10
>
> "not point to a single"

Fixed.

[...]

>> +IRQ stacktrace
>> +---------------------------
>> +
>> +The osnoise/print_stack option is helpful for the cases in which a thread
>> +noise causes the major factor for the timer latency, because of preempt or
>> +irq disabled. For example::
>> +
>> + [root@f32 tracing]# echo 500 > osnoise/stop_tracing_out_us
>> + [root@f32 tracing]# echo 500 > osnoise/print_stack
>> + [root@f32 tracing]# echo timerlat > current_tracer
>> + [root@f32 tracing]# tail -21 per_cpu/cpu7/trace
>> + insmod-1026 [007] dN.h1.. 200.201948: irq_noise: local_timer:236 start 200.201939376 duration 7872 ns
>> + insmod-1026 [007] d..h1.. 200.202587: #29800 context irq timer_latency 1616 ns
>> + insmod-1026 [007] dN.h2.. 200.202598: irq_noise: local_timer:236 start 200.202586162 duration 11855 ns
>> + insmod-1026 [007] dN.h3.. 200.202947: irq_noise: local_timer:236 start 200.202939174 duration 7318 ns
>> + insmod-1026 [007] d...3.. 200.203444: thread_noise: insmod:1026 start 200.202586933 duration 838681 ns
>> + timerlat/7-1001 [007] ....... 200.203445: #29800 context thread timer_latency 859978 ns
>> + timerlat/7-1001 [007] ....1.. 200.203446: <stack trace>
>> + => timerlat_irq
>> + => __hrtimer_run_queues
>> + => hrtimer_interrupt
>> + => __sysvec_apic_timer_interrupt
>> + => asm_call_irq_on_stack
>> + => sysvec_apic_timer_interrupt
>> + => asm_sysvec_apic_timer_interrupt
>> + => delay_tsc
>> + => dummy_load_1ms_pd_init
>> + => do_one_initcall
>> + => do_init_module
>> + => __do_sys_finit_module
>> + => do_syscall_64
>> + => entry_SYSCALL_64_after_hwframe
>> +
>> +In this case, it is possible to see that the thread added the highest
>> +contribution to the *timer latency* and the stack trace points to
>> +a function named dummy_load_1ms_pd_init, which had the following
>> +code (on purpose)::
>
> Should add here as well that the stack is saved at the time of interrupt,
> and not at the time it is reported.

Fixed.

[...]

>>
>> +#ifdef CONFIG_TIMERLAT_TRACER
>> +/*
>> + * Runtime information for the timer mode.
>> + */
>> +struct timerlat_variables {
>> + struct task_struct *kthread;
>> + struct hrtimer timer;
>> + u64 rel_period;
>> + u64 abs_period;
>> + bool tracing_thread;
>> + u64 count;
>> +};
>
> Like with the osnoise comment, put in tabs to make the fields stand out.

Done.

[...]

>> +#ifdef CONFIG_TIMERLAT_TRACER
>> +/*
>> + * timerlat sample structure definition. Used to store the statistics of
>> + * a sample run.
>> + */
>> +struct timerlat_sample {
>> + u64 seqnum; /* unique sequence */
>
> The seqnum in the event is unsigned int, whereas here it's u64.

all set to unsigned int.

[...]

>> +
>> +#ifdef CONFIG_STACKTRACE
>> +/*
>> + * Stack trace will take place only at IRQ level, so, no need
>> + * to control nesting here.
>> + */
>> +struct trace_stack {
>> + int stack_size;
>> + int nr_entries;
>> + unsigned long calls[PAGE_SIZE];
>
> That is rather big. It's 8 * PAGE_SIZE. I don't think that's what you really
> wanted.

no, I did not want that... is 256 a good number?

>> +};
>> +
>> +static DEFINE_PER_CPU(struct trace_stack, trace_stack);
>> +
>> +/**
>
> Again, remove the KernelDoc notation of /**, or make it real kerneldoc
> notation.

Fixed!

[...]

>> *
>> @@ -801,6 +1017,22 @@ void trace_softirq_exit_callback(void *data, unsigned int vec_nr)
>> if (!osn_var->sampling)
>> return;
>>
>> +#ifdef CONFIG_TIMERLAT_TRACER
>> + /*
>> + * If the timerlat is enabled, but the irq handler did
>> + * not run yet enabling timerlat_tracer, do not trace.
>> + */
>> + if (unlikely(osnoise_data.timerlat_tracer)) {
>> + struct timerlat_variables *tlat_var;
>> + tlat_var = this_cpu_tmr_var();
>> + if (!tlat_var->tracing_thread) {
>
> What happens if the timer interrupt triggers here?

The tracer will not report the softirq overhead. But at this point, the softirq
is returning, and the duration would be from this time to...



>> + osn_var->softirq.arrival_time = 0;
>> + osn_var->softirq.delta_start = 0;
>> + return;
>> + }
>> + }
>> +#endif
>> +
>> duration = get_int_safe_duration(osn_var, &osn_var->softirq.delta_start);

here.

We can disable interrupts to avoid this issue. But the question is, is it worth
to disable interrupts to avoid this problem?

>> trace_softirq_noise(vec_nr, osn_var->softirq.arrival_time, duration);
>> cond_move_thread_delta_start(osn_var, duration);
>> @@ -893,6 +1125,18 @@ thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
>> if (!osn_var->sampling)
>> return;
>>
>> +#ifdef CONFIG_TIMERLAT_TRACER
>> + if (osnoise_data.timerlat_tracer) {
>> + struct timerlat_variables *tlat_var;
>> + tlat_var = this_cpu_tmr_var();
>> + if (!tlat_var->tracing_thread) {
>
> Or here?

The problem that can happen with the softirq cannot happen here: this code runs
with interrupts disabled on __schedule() (it is hooked to the sched_switch).

>> + osn_var->thread.delta_start = 0;
>> + osn_var->thread.arrival_time = 0;
>> + return;
>> + }
>> + }
>> +#endif
>> +
>> duration = get_int_safe_duration(osn_var, &osn_var->thread.delta_start);
>>
>> trace_thread_noise(t, osn_var->thread.arrival_time, duration);
>> @@ -1182,6 +1426,197 @@ static int osnoise_main(void *data)
>> return 0;
>> }
>>
>> +#ifdef CONFIG_TIMERLAT_TRACER
>> +/**
>> + * timerlat_irq - hrtimer handler for timerlat.
>> + */
>> +static enum hrtimer_restart timerlat_irq(struct hrtimer *timer)
>> +{
>> + struct osnoise_variables *osn_var = this_cpu_osn_var();
>> + struct trace_array *tr = osnoise_trace;
>> + struct timerlat_variables *tlat;
>> + struct timerlat_sample s;
>> + u64 now;
>> + u64 diff;
>> +
>> + /*
>> + * I am not sure if the timer was armed for this CPU. So, get
>> + * the timerlat struct from the timer itself, not from this
>> + * CPU.
>> + */
>> + tlat = container_of(timer, struct timerlat_variables, timer);
>> +
>> + now = ktime_to_ns(hrtimer_cb_get_time(&tlat->timer));
>> +
>> + /*
>> + * Enable the osnoise: events for thread an softirq.
>> + */
>> + tlat->tracing_thread = true;
>> +
>> + osn_var->thread.arrival_time = time_get();
>> +
>> + /*
>> + * A hardirq is running: the timer IRQ. It is for sure preempting
>> + * a thread, and potentially preempting a softirq.
>> + *
>> + * At this point, it is not interesting to know the duration of the
>> + * preempted thread (and maybe softirq), but how much time they will
>> + * delay the beginning of the execution of the timer thread.
>> + *
>> + * To get the correct (net) delay added by the softirq, its delta_start
>> + * is set as the IRQ one. In this way, at the return of the IRQ, the delta
>> + * start of the sofitrq will be zeroed, accounting then only the time
>> + * after that.
>> + *
>> + * The thread follows the same principle. However, if a softirq is
>> + * running, the thread needs to receive the softirq delta_start. The
>> + * reason being is that the softirq will be the last to be unfolded,
>> + * resseting the thread delay to zero.
>> + */
>> +#ifndef CONFIG_PREEMPT_RT
>> + if (osn_var->softirq.delta_start) {
>> + copy_int_safe_time(osn_var, &osn_var->thread.delta_start,
>> + &osn_var->softirq.delta_start);
>
> Isn't softirq.delta_start going to be zero here? It doesn't look to get
> updated until you set tracing_thread to true, but that happens here, and as
> this is in a interrupt context, there will not be a softirq happening
> between the setting of that to true to this point.

No... on the timerlat, the "sampling" is always on. And the
osnoise_data.timerlat_tracer is only checked at the softirq return, so the
softirq entry always set set the delta_start.

>> +
>> + copy_int_safe_time(osn_var, &osn_var->softirq.delta_start,
>> + &osn_var->irq.delta_start);
>> + } else {
>> + copy_int_safe_time(osn_var, &osn_var->thread.delta_start,
>> + &osn_var->irq.delta_start);
>> + }
>> +#else /* CONFIG_PREEMPT_RT */
>> + /*
>> + * The sofirqs run as threads on RT, so there is not need
>> + * to keep track of it.
>> + */
>> + copy_int_safe_time(osn_var, &osn_var->thread.delta_start, &osn_var->irq.delta_start);
>> +#endif /* CONFIG_PREEMPT_RT */
>> +
>> + /*
>> + * Compute the current time with the expected time.
>> + */
>> + diff = now - tlat->abs_period;
>> +
>> + tlat->count++;
>> + s.seqnum = tlat->count;
>> + s.timer_latency = diff;
>> + s.context = IRQ_CONTEXT;
>> +
>> + trace_timerlat_sample(&s);
>> +
>> + /* Keep a running maximum ever recorded os noise "latency" */
>> + if (diff > tr->max_latency) {
>> + tr->max_latency = diff;
>> + latency_fsnotify(tr);
>> + }
>> +
>> + if (osnoise_data.stop_tracing_in)
>> + if (time_to_us(diff) >= osnoise_data.stop_tracing_in)
>> + osnoise_stop_tracing();
>> +
>> + wake_up_process(tlat->kthread);
>> +
>> +#ifdef CONFIG_STACKTRACE
>> + if (osnoise_data.print_stack)
>> + timerlat_save_stack(0);
>> +#endif
>
> No need for the #ifdef above. timerlat_save_stack() is defined as a nop
> when not enabled, and the compiler will just optimize this out.

The osnoise_data.print_stack is ifdefed, should I remove it from ifdef?

>
>> +
>> + return HRTIMER_NORESTART;
>> +}
>> +
>> +/**
>> + * wait_next_period - Wait for the next period for timerlat
>> + */
>> +static int wait_next_period(struct timerlat_variables *tlat)
>> +{
>> + ktime_t next_abs_period, now;
>> + u64 rel_period = osnoise_data.timerlat_period * 1000;
>> +
>> + now = hrtimer_cb_get_time(&tlat->timer);
>> + next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
>> +
>> + /*
>> + * Save the next abs_period.
>> + */
>> + tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
>> +
>> + /*
>> + * If the new abs_period is in the past, skip the activation.
>> + */
>> + while (ktime_compare(now, next_abs_period) > 0) {
>> + next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
>> + tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
>> + }
>> +
>> + set_current_state(TASK_INTERRUPTIBLE);
>> +
>> + hrtimer_start(&tlat->timer, next_abs_period, HRTIMER_MODE_ABS_PINNED_HARD);
>> + schedule();
>> + return 1;
>> +}
>> +
>> +/**
>> + * timerlat_main- Timerlat main
>> + */
>> +static int timerlat_main(void *data)
>> +{
>> + struct osnoise_variables *osn_var = this_cpu_osn_var();
>> + struct timerlat_variables *tlat = this_cpu_tmr_var();
>> + struct timerlat_sample s;
>> + struct sched_param sp;
>> + u64 now, diff;
>> +
>> + /*
>> + * Make the thread RT, that is how cyclictest is usually used.
>> + */
>> + sp.sched_priority = DEFAULT_TIMERLAT_PRIO;
>> + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
>
> Hmm, I thought Peter Zijlstra was removing all sched_setscheduler*() calls
> in the kernel :-/ Although, this one seems legit, and we are not running
> from within a module.
>
> -- Steve
>

-- Daniel

2021-06-11 20:13:02

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 9/9] tracing: Add timerlat tracer

On Fri, 11 Jun 2021 14:59:13 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> ------------------ %< -----------------------------
> It is worth mentioning that the *duration* values reported
> by the osnoise: events are *net* values. For example, the
> thread_noise does not include the duration of the overhead caused
> by the IRQ execution (which indeed accounted for 12736 ns). But
> the values reported by the timerlat tracer (timerlat_latency)
> are *gross* values.
>
> The art below illustrates a CPU timeline and how the timerlat tracer
> observes it at the top and the osnoise: events at the bottom. Each "-"
> in the timelines means 1 us, and the time moves ==>:
>
> External context irq context thread
> clock timer_latency timer_latency
> event 18 us 48 us
> | ^ ^
> v | |
> |------------------| | <-- timerlat irq timeline
> |------------------+-----------------------------| <-- timerlat thread timeline
> ^ ^
> ===================== CPU timeline ======================================
> [timerlat/ irq] [ dev irq ]
> [another thread...^ v..^ v........][timerlat/ thread]
> ===================== CPU timeline ======================================
> |-------------| |---------| <-- irq_noise timeline
> |--^ v--------| <-- thread_noise timeline
> | | |
> | | + thread_noise: 10 us
> | +-> irq_noise: 9 us
> +-> irq_noise: 13 us
>
> --------------- >% --------------------------------

That's really busy, and honestly, I can't tell what is what.

The "context irq timer_latency" is a confusing name. Could we just have
that be "timer irq latency"? And "context thread timer_latency" just be
"thread latency". Adding too much text to the name actually makes it harder
to understand. We want to simplify it, not make people have to think harder
to see it.

I think we can get rid of the "<-- .* timeline" to the right. I don't
think they are necessary. Again, the more you add to the diagram, the
busier it looks, and the harder it is to read.

Could we switch "[timerlat/ irq]" to just "[timer irq]" and explain how
that "context irq timer_latency"/"timer irq latency" is related?

Should probably state that the "dev irq" is an unrelated device interrupt
that happened.

What's with the two CPU timeline lines? Now there I think it would be
better to have the arrow text by itself.

And finally, not sure if you plan on doing this, but have a output of the
trace that would show the above.

Thus, here's what I would expect to see:

External
clock timer irq latency thread latency
event 18 us 48 us
| ^ ^
v | |
|------------------| |
|------------------+-----------------------------|
^ ^
=========================================================================
[timerlat/ irq] [ dev irq ]
[another thread...^ v..^ v........][timerlat/ thread] <-- CPU task timeline
=========================================================================
|-------------| |---------|
|--^ v--------|
| | |
| | + thread_noise: 10 us
| +-> irq_noise: 9 us
+-> irq_noise: 13 us

The "[ dev irq ]" above is an interrupt from some device on the system that
causes extra noise to the timerlat task.

I think the above may be easier to understand, especially if the trace
output that represents it is below.

Also, I have to ask, shouldn't the "thread noise" really start at the
"External clock event"?

-- Steve

2021-06-11 20:50:59

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 9/9] tracing: Add timerlat tracer

On Fri, 11 Jun 2021 16:13:36 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> >> +
> >> +#ifdef CONFIG_STACKTRACE
> >> +/*
> >> + * Stack trace will take place only at IRQ level, so, no need
> >> + * to control nesting here.
> >> + */
> >> +struct trace_stack {
> >> + int stack_size;
> >> + int nr_entries;
> >> + unsigned long calls[PAGE_SIZE];
> >
> > That is rather big. It's 8 * PAGE_SIZE. I don't think that's what you really
> > wanted.
>
> no, I did not want that... is 256 a good number?

Sure. But make it a macro.

#define MAX_CALLS 256

or something like that.

>
> >> +};
> >> +
> >> +static DEFINE_PER_CPU(struct trace_stack, trace_stack);
> >> +
> >> +/**
> >
> > Again, remove the KernelDoc notation of /**, or make it real kerneldoc
> > notation.
>
> Fixed!
>
> [...]
>
> >> *
> >> @@ -801,6 +1017,22 @@ void trace_softirq_exit_callback(void *data, unsigned int vec_nr)
> >> if (!osn_var->sampling)
> >> return;
> >>
> >> +#ifdef CONFIG_TIMERLAT_TRACER
> >> + /*
> >> + * If the timerlat is enabled, but the irq handler did
> >> + * not run yet enabling timerlat_tracer, do not trace.
> >> + */
> >> + if (unlikely(osnoise_data.timerlat_tracer)) {
> >> + struct timerlat_variables *tlat_var;
> >> + tlat_var = this_cpu_tmr_var();
> >> + if (!tlat_var->tracing_thread) {
> >
> > What happens if the timer interrupt triggers here?
>
> The tracer will not report the softirq overhead. But at this point, the softirq
> is returning, and the duration would be from this time to...
>
>
>
> >> + osn_var->softirq.arrival_time = 0;
> >> + osn_var->softirq.delta_start = 0;
> >> + return;
> >> + }
> >> + }
> >> +#endif
> >> +
> >> duration = get_int_safe_duration(osn_var, &osn_var->softirq.delta_start);
>
> here.
>
> We can disable interrupts to avoid this issue. But the question is, is it worth
> to disable interrupts to avoid this problem?
>
> >> trace_softirq_noise(vec_nr, osn_var->softirq.arrival_time, duration);
> >> cond_move_thread_delta_start(osn_var, duration);
> >> @@ -893,6 +1125,18 @@ thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
> >> if (!osn_var->sampling)
> >> return;
> >>
> >> +#ifdef CONFIG_TIMERLAT_TRACER
> >> + if (osnoise_data.timerlat_tracer) {
> >> + struct timerlat_variables *tlat_var;
> >> + tlat_var = this_cpu_tmr_var();
> >> + if (!tlat_var->tracing_thread) {
> >
> > Or here?
>
> The problem that can happen with the softirq cannot happen here: this code runs
> with interrupts disabled on __schedule() (it is hooked to the sched_switch).
>
> >> + osn_var->thread.delta_start = 0;
> >> + osn_var->thread.arrival_time = 0;
> >> + return;
> >> + }
> >> + }
> >> +#endif
> >> +
> >> duration = get_int_safe_duration(osn_var, &osn_var->thread.delta_start);
> >>
> >> trace_thread_noise(t, osn_var->thread.arrival_time, duration);
> >> @@ -1182,6 +1426,197 @@ static int osnoise_main(void *data)
> >> return 0;
> >> }
> >>
> >> +#ifdef CONFIG_TIMERLAT_TRACER
> >> +/**
> >> + * timerlat_irq - hrtimer handler for timerlat.
> >> + */
> >> +static enum hrtimer_restart timerlat_irq(struct hrtimer *timer)
> >> +{
> >> + struct osnoise_variables *osn_var = this_cpu_osn_var();
> >> + struct trace_array *tr = osnoise_trace;
> >> + struct timerlat_variables *tlat;
> >> + struct timerlat_sample s;
> >> + u64 now;
> >> + u64 diff;
> >> +
> >> + /*
> >> + * I am not sure if the timer was armed for this CPU. So, get
> >> + * the timerlat struct from the timer itself, not from this
> >> + * CPU.
> >> + */
> >> + tlat = container_of(timer, struct timerlat_variables, timer);
> >> +
> >> + now = ktime_to_ns(hrtimer_cb_get_time(&tlat->timer));
> >> +
> >> + /*
> >> + * Enable the osnoise: events for thread an softirq.
> >> + */
> >> + tlat->tracing_thread = true;
> >> +
> >> + osn_var->thread.arrival_time = time_get();
> >> +
> >> + /*
> >> + * A hardirq is running: the timer IRQ. It is for sure preempting
> >> + * a thread, and potentially preempting a softirq.
> >> + *
> >> + * At this point, it is not interesting to know the duration of the
> >> + * preempted thread (and maybe softirq), but how much time they will
> >> + * delay the beginning of the execution of the timer thread.
> >> + *
> >> + * To get the correct (net) delay added by the softirq, its delta_start
> >> + * is set as the IRQ one. In this way, at the return of the IRQ, the delta
> >> + * start of the sofitrq will be zeroed, accounting then only the time
> >> + * after that.
> >> + *
> >> + * The thread follows the same principle. However, if a softirq is
> >> + * running, the thread needs to receive the softirq delta_start. The
> >> + * reason being is that the softirq will be the last to be unfolded,
> >> + * resseting the thread delay to zero.
> >> + */
> >> +#ifndef CONFIG_PREEMPT_RT
> >> + if (osn_var->softirq.delta_start) {
> >> + copy_int_safe_time(osn_var, &osn_var->thread.delta_start,
> >> + &osn_var->softirq.delta_start);
> >
> > Isn't softirq.delta_start going to be zero here? It doesn't look to get
> > updated until you set tracing_thread to true, but that happens here, and as
> > this is in a interrupt context, there will not be a softirq happening
> > between the setting of that to true to this point.
>
> No... on the timerlat, the "sampling" is always on. And the
> osnoise_data.timerlat_tracer is only checked at the softirq return, so the
> softirq entry always set set the delta_start.

OK, I was confused by the timerlat using the "__osnoise_tracer_start()". If
timerlat is going to use that, perhaps we need to rename it, because the
"osnoise" is one tracer, and its confusing that the "timerlat" is using
functions called "*_osnoise_*". I was thinking that those functions were
only for the osnoise tracer and not part of the timerlat tracer, and
ignored them when looking at what the timerlat tracer was doing.

Can we rename that to simply "start_latency_tracing()" or something more
generic.

>
> >> +
> >> + copy_int_safe_time(osn_var, &osn_var->softirq.delta_start,
> >> + &osn_var->irq.delta_start);
> >> + } else {
> >> + copy_int_safe_time(osn_var, &osn_var->thread.delta_start,
> >> + &osn_var->irq.delta_start);
> >> + }
> >> +#else /* CONFIG_PREEMPT_RT */
> >> + /*
> >> + * The sofirqs run as threads on RT, so there is not need
> >> + * to keep track of it.
> >> + */
> >> + copy_int_safe_time(osn_var, &osn_var->thread.delta_start, &osn_var->irq.delta_start);
> >> +#endif /* CONFIG_PREEMPT_RT */
> >> +
> >> + /*
> >> + * Compute the current time with the expected time.
> >> + */
> >> + diff = now - tlat->abs_period;
> >> +
> >> + tlat->count++;
> >> + s.seqnum = tlat->count;
> >> + s.timer_latency = diff;
> >> + s.context = IRQ_CONTEXT;
> >> +
> >> + trace_timerlat_sample(&s);
> >> +
> >> + /* Keep a running maximum ever recorded os noise "latency" */
> >> + if (diff > tr->max_latency) {
> >> + tr->max_latency = diff;
> >> + latency_fsnotify(tr);
> >> + }
> >> +
> >> + if (osnoise_data.stop_tracing_in)
> >> + if (time_to_us(diff) >= osnoise_data.stop_tracing_in)
> >> + osnoise_stop_tracing();
> >> +
> >> + wake_up_process(tlat->kthread);
> >> +
> >> +#ifdef CONFIG_STACKTRACE
> >> + if (osnoise_data.print_stack)
> >> + timerlat_save_stack(0);
> >> +#endif
> >
> > No need for the #ifdef above. timerlat_save_stack() is defined as a nop
> > when not enabled, and the compiler will just optimize this out.
>
> The osnoise_data.print_stack is ifdefed, should I remove it from ifdef?

Well, the above ifdef is for STACKTRACE not for TIMERLAT_TRACER, which
encompasses all of this. And the "timerlat_save_stack()" is a nop when
STACKTRACE is not defined. So no.

-- Steve

>
> >
> >> +
> >> + return HRTIMER_NORESTART;
> >> +}
> >> +

Subject: Re: [PATCH V3 9/9] tracing: Add timerlat tracer

On 6/11/21 10:48 PM, Steven Rostedt wrote:
> On Fri, 11 Jun 2021 16:13:36 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
>>>> +
>>>> +#ifdef CONFIG_STACKTRACE
>>>> +/*
>>>> + * Stack trace will take place only at IRQ level, so, no need
>>>> + * to control nesting here.
>>>> + */
>>>> +struct trace_stack {
>>>> + int stack_size;
>>>> + int nr_entries;
>>>> + unsigned long calls[PAGE_SIZE];
>>>
>>> That is rather big. It's 8 * PAGE_SIZE. I don't think that's what you really
>>> wanted.
>>
>> no, I did not want that... is 256 a good number?
>
> Sure. But make it a macro.
>
> #define MAX_CALLS 256
>
> or something like that.

Ack!

>>
>>>> +};
>>>> +
>>>> +static DEFINE_PER_CPU(struct trace_stack, trace_stack);
>>>> +
>>>> +/**
>>>
>>> Again, remove the KernelDoc notation of /**, or make it real kerneldoc
>>> notation.
>>
>> Fixed!
>>
>> [...]
>>
>>>> *
>>>> @@ -801,6 +1017,22 @@ void trace_softirq_exit_callback(void *data, unsigned int vec_nr)
>>>> if (!osn_var->sampling)
>>>> return;
>>>>
>>>> +#ifdef CONFIG_TIMERLAT_TRACER
>>>> + /*
>>>> + * If the timerlat is enabled, but the irq handler did
>>>> + * not run yet enabling timerlat_tracer, do not trace.
>>>> + */
>>>> + if (unlikely(osnoise_data.timerlat_tracer)) {
>>>> + struct timerlat_variables *tlat_var;
>>>> + tlat_var = this_cpu_tmr_var();
>>>> + if (!tlat_var->tracing_thread) {
>>>
>>> What happens if the timer interrupt triggers here?
>>
>> The tracer will not report the softirq overhead. But at this point, the softirq
>> is returning, and the duration would be from this time to...
>>
>>
>>
>>>> + osn_var->softirq.arrival_time = 0;
>>>> + osn_var->softirq.delta_start = 0;
>>>> + return;
>>>> + }
>>>> + }
>>>> +#endif
>>>> +
>>>> duration = get_int_safe_duration(osn_var, &osn_var->softirq.delta_start);
>>
>> here.
>>
>> We can disable interrupts to avoid this issue. But the question is, is it worth
>> to disable interrupts to avoid this problem?
>>
>>>> trace_softirq_noise(vec_nr, osn_var->softirq.arrival_time, duration);
>>>> cond_move_thread_delta_start(osn_var, duration);
>>>> @@ -893,6 +1125,18 @@ thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
>>>> if (!osn_var->sampling)
>>>> return;
>>>>
>>>> +#ifdef CONFIG_TIMERLAT_TRACER
>>>> + if (osnoise_data.timerlat_tracer) {
>>>> + struct timerlat_variables *tlat_var;
>>>> + tlat_var = this_cpu_tmr_var();
>>>> + if (!tlat_var->tracing_thread) {
>>>
>>> Or here?
>>
>> The problem that can happen with the softirq cannot happen here: this code runs
>> with interrupts disabled on __schedule() (it is hooked to the sched_switch).
>>
>>>> + osn_var->thread.delta_start = 0;
>>>> + osn_var->thread.arrival_time = 0;
>>>> + return;
>>>> + }
>>>> + }
>>>> +#endif
>>>> +
>>>> duration = get_int_safe_duration(osn_var, &osn_var->thread.delta_start);
>>>>
>>>> trace_thread_noise(t, osn_var->thread.arrival_time, duration);
>>>> @@ -1182,6 +1426,197 @@ static int osnoise_main(void *data)
>>>> return 0;
>>>> }
>>>>
>>>> +#ifdef CONFIG_TIMERLAT_TRACER
>>>> +/**
>>>> + * timerlat_irq - hrtimer handler for timerlat.
>>>> + */
>>>> +static enum hrtimer_restart timerlat_irq(struct hrtimer *timer)
>>>> +{
>>>> + struct osnoise_variables *osn_var = this_cpu_osn_var();
>>>> + struct trace_array *tr = osnoise_trace;
>>>> + struct timerlat_variables *tlat;
>>>> + struct timerlat_sample s;
>>>> + u64 now;
>>>> + u64 diff;
>>>> +
>>>> + /*
>>>> + * I am not sure if the timer was armed for this CPU. So, get
>>>> + * the timerlat struct from the timer itself, not from this
>>>> + * CPU.
>>>> + */
>>>> + tlat = container_of(timer, struct timerlat_variables, timer);
>>>> +
>>>> + now = ktime_to_ns(hrtimer_cb_get_time(&tlat->timer));
>>>> +
>>>> + /*
>>>> + * Enable the osnoise: events for thread an softirq.
>>>> + */
>>>> + tlat->tracing_thread = true;
>>>> +
>>>> + osn_var->thread.arrival_time = time_get();
>>>> +
>>>> + /*
>>>> + * A hardirq is running: the timer IRQ. It is for sure preempting
>>>> + * a thread, and potentially preempting a softirq.
>>>> + *
>>>> + * At this point, it is not interesting to know the duration of the
>>>> + * preempted thread (and maybe softirq), but how much time they will
>>>> + * delay the beginning of the execution of the timer thread.
>>>> + *
>>>> + * To get the correct (net) delay added by the softirq, its delta_start
>>>> + * is set as the IRQ one. In this way, at the return of the IRQ, the delta
>>>> + * start of the sofitrq will be zeroed, accounting then only the time
>>>> + * after that.
>>>> + *
>>>> + * The thread follows the same principle. However, if a softirq is
>>>> + * running, the thread needs to receive the softirq delta_start. The
>>>> + * reason being is that the softirq will be the last to be unfolded,
>>>> + * resseting the thread delay to zero.
>>>> + */
>>>> +#ifndef CONFIG_PREEMPT_RT
>>>> + if (osn_var->softirq.delta_start) {
>>>> + copy_int_safe_time(osn_var, &osn_var->thread.delta_start,
>>>> + &osn_var->softirq.delta_start);
>>>
>>> Isn't softirq.delta_start going to be zero here? It doesn't look to get
>>> updated until you set tracing_thread to true, but that happens here, and as
>>> this is in a interrupt context, there will not be a softirq happening
>>> between the setting of that to true to this point.
>>
>> No... on the timerlat, the "sampling" is always on. And the
>> osnoise_data.timerlat_tracer is only checked at the softirq return, so the
>> softirq entry always set set the delta_start.
>
> OK, I was confused by the timerlat using the "__osnoise_tracer_start()". If
> timerlat is going to use that, perhaps we need to rename it, because the
> "osnoise" is one tracer, and its confusing that the "timerlat" is using
> functions called "*_osnoise_*". I was thinking that those functions were
> only for the osnoise tracer and not part of the timerlat tracer, and
> ignored them when looking at what the timerlat tracer was doing.
>
> Can we rename that to simply "start_latency_tracing()" or something more
> generic.

right, also considering that we will (I hope) have the rtsl next, and other
ideas coming around the usage of osnoise: events on other tracers, I also think
should a more generic term to the events. Indeed, on rtsl they have a different
name.

Thinking only about the instrumentation/events, what they are tracking is the
execution time. So how about naming them as:

exec_time:thread
exec_time:irq

Also adding that, although here we measure the execution time of "task" context,
on rtsl we have other kinds of "windows" that they measure, for instance, the
poid window (Preemption or IRQ disabled). So, the term exec time also fits there.

?

>>
>>>> +
>>>> + copy_int_safe_time(osn_var, &osn_var->softirq.delta_start,
>>>> + &osn_var->irq.delta_start);
>>>> + } else {
>>>> + copy_int_safe_time(osn_var, &osn_var->thread.delta_start,
>>>> + &osn_var->irq.delta_start);
>>>> + }
>>>> +#else /* CONFIG_PREEMPT_RT */
>>>> + /*
>>>> + * The sofirqs run as threads on RT, so there is not need
>>>> + * to keep track of it.
>>>> + */
>>>> + copy_int_safe_time(osn_var, &osn_var->thread.delta_start, &osn_var->irq.delta_start);
>>>> +#endif /* CONFIG_PREEMPT_RT */
>>>> +
>>>> + /*
>>>> + * Compute the current time with the expected time.
>>>> + */
>>>> + diff = now - tlat->abs_period;
>>>> +
>>>> + tlat->count++;
>>>> + s.seqnum = tlat->count;
>>>> + s.timer_latency = diff;
>>>> + s.context = IRQ_CONTEXT;
>>>> +
>>>> + trace_timerlat_sample(&s);
>>>> +
>>>> + /* Keep a running maximum ever recorded os noise "latency" */
>>>> + if (diff > tr->max_latency) {
>>>> + tr->max_latency = diff;
>>>> + latency_fsnotify(tr);
>>>> + }
>>>> +
>>>> + if (osnoise_data.stop_tracing_in)
>>>> + if (time_to_us(diff) >= osnoise_data.stop_tracing_in)
>>>> + osnoise_stop_tracing();
>>>> +
>>>> + wake_up_process(tlat->kthread);
>>>> +
>>>> +#ifdef CONFIG_STACKTRACE
>>>> + if (osnoise_data.print_stack)
>>>> + timerlat_save_stack(0);
>>>> +#endif
>>>
>>> No need for the #ifdef above. timerlat_save_stack() is defined as a nop
>>> when not enabled, and the compiler will just optimize this out.
>>
>> The osnoise_data.print_stack is ifdefed, should I remove it from ifdef?
>
> Well, the above ifdef is for STACKTRACE not for TIMERLAT_TRACER, which
> encompasses all of this. And the "timerlat_save_stack()" is a nop when
> STACKTRACE is not defined. So no.

Ooooppppsss, right, miss attention from my side.

-- Daniel

> -- Steve
>
>>
>>>
>>>> +
>>>> + return HRTIMER_NORESTART;
>>>> +}
>>>> +
>

Subject: Re: [PATCH V3 9/9] tracing: Add timerlat tracer

On 6/11/21 10:03 PM, Steven Rostedt wrote:
> On Fri, 11 Jun 2021 14:59:13 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
>> ------------------ %< -----------------------------
>> It is worth mentioning that the *duration* values reported
>> by the osnoise: events are *net* values. For example, the
>> thread_noise does not include the duration of the overhead caused
>> by the IRQ execution (which indeed accounted for 12736 ns). But
>> the values reported by the timerlat tracer (timerlat_latency)
>> are *gross* values.
>>
>> The art below illustrates a CPU timeline and how the timerlat tracer
>> observes it at the top and the osnoise: events at the bottom. Each "-"
>> in the timelines means 1 us, and the time moves ==>:
>>
>> External context irq context thread
>> clock timer_latency timer_latency
>> event 18 us 48 us
>> | ^ ^
>> v | |
>> |------------------| | <-- timerlat irq timeline
>> |------------------+-----------------------------| <-- timerlat thread timeline
>> ^ ^
>> ===================== CPU timeline ======================================
>> [timerlat/ irq] [ dev irq ]
>> [another thread...^ v..^ v........][timerlat/ thread]
>> ===================== CPU timeline ======================================
>> |-------------| |---------| <-- irq_noise timeline
>> |--^ v--------| <-- thread_noise timeline
>> | | |
>> | | + thread_noise: 10 us
>> | +-> irq_noise: 9 us
>> +-> irq_noise: 13 us
>>
>> --------------- >% --------------------------------
>
> That's really busy, and honestly, I can't tell what is what.
>
> The "context irq timer_latency" is a confusing name. Could we just have
> that be "timer irq latency"? And "context thread timer_latency" just be
> "thread latency". Adding too much text to the name actually makes it harder
> to understand. We want to simplify it, not make people have to think harder
> to see it.
>
> I think we can get rid of the "<-- .* timeline" to the right. I don't
> think they are necessary. Again, the more you add to the diagram, the
> busier it looks, and the harder it is to read.
>
> Could we switch "[timerlat/ irq]" to just "[timer irq]" and explain how
> that "context irq timer_latency"/"timer irq latency" is related?
>
> Should probably state that the "dev irq" is an unrelated device interrupt
> that happened.
>
> What's with the two CPU timeline lines? Now there I think it would be
> better to have the arrow text by itself.
>
> And finally, not sure if you plan on doing this, but have a output of the
> trace that would show the above.
>
> Thus, here's what I would expect to see:
>
> External
> clock timer irq latency e thread latency
> event 18 us 48 us
> | ^ ^
> v | |
> |------------------| |
> |------------------+-----------------------------|
> ^ ^
> =========================================================================
> [timerlat/ irq] [ dev irq ]
> [another thread...^ v..^ v........][timerlat/ thread] <-- CPU task timeline
> =========================================================================
> |-------------| |---------|
> |--^ v--------|
> | | |
> | | + thread_noise: 10 us
> | +-> irq_noise: 9 us
> +-> irq_noise: 13 us

It looks good to me!

> The "[ dev irq ]" above is an interrupt from some device on the system that
> causes extra noise to the timerlat task.
>
> I think the above may be easier to understand, especially if the trace
> output that represents it is below.

ok, I can try to capture a trace sample and represent it into the ASCII art
format above.

> Also, I have to ask, shouldn't the "thread noise" really start at the
> "External clock event"?

To go in that direction, we need to track things that delayed the IRQ execution.
We are already tracking other IRQs' execution, but we will have to keep a
history of past executions and "playback" them. This will add some overhead
linear to the past event... and/or some pessimism.

We will also have to track IRQ disabled sections. The problem of tracking IRQ
disabled is that it depends on tracing infra-structure that is not enabled by
default of distros... And there are IRQ delay causes that are not related to the
thread... like idle states... (and all these things create more and more states
to track these things)...

So, I added the timer irq latency to figure out when the problem is related to
things that delay the IRQ, and the stack trace will help us figure out where the
problem is in the thread context. After the IRQ execution, the thread noise is
helpful - even without all the thread noise before the IRQ.

Furthermore, if we start trying to abstract the causes of delay, we will find
the rtsl :-). The rtls events and abstractions give us the worst-case scheduling
latency without adding unneeded pessimism (sound analysis). It covers all the
possible cases, for all any schedulers, even without the need of a measuring
thread like here (or with cyclictest) - and this is a good thing because it does
not change the target system's workload.

The problem is that... rtsl depends on tracing infra-structure that are not
enabled by default on distros, like preempt_ and irq_ disabled events.

So, I see timerlat as a tool for on-the-fly usage, like debugging on customers
(as we do at red hat). It can be enabled by default on distros because it only
depends on existing and already enabled events and causes no overhead when
disabled. rtsl targets more specific cases, like safety-critical systems, where
the overhead is acceptable because of the sound analysis of the scheduling bound
(which is rooted in a formal specification & analysis of the system).

-- Daniel

> -- Steve
>

2021-06-12 23:12:36

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 9/9] tracing: Add timerlat tracer

On Sat, 12 Jun 2021 11:41:41 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:
> > I think the above may be easier to understand, especially if the trace
> > output that represents it is below.
>
> ok, I can try to capture a trace sample and represent it into the ASCII art
> format above.

Why capture it? Just fudge an example that fits the example ;-)

>
> > Also, I have to ask, shouldn't the "thread noise" really start at the
> > "External clock event"?
>
> To go in that direction, we need to track things that delayed the IRQ execution.

[snip long explanation of the obvious (to me at least) ;-) ]

> the overhead is acceptable because of the sound analysis of the scheduling bound
> (which is rooted in a formal specification & analysis of the system).

I meant, that it needs to be documented, what the real thread noise is
but due to what is available it may not be truly accurate.

-- Steve

2021-06-12 23:15:33

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH V3 9/9] tracing: Add timerlat tracer

On Sat, 12 Jun 2021 10:47:16 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> Thinking only about the instrumentation/events, what they are tracking is the
> execution time. So how about naming them as:
>
> exec_time:thread
> exec_time:irq

I guess. I should go and look at your other code.

>
> Also adding that, although here we measure the execution time of "task" context,
> on rtsl we have other kinds of "windows" that they measure, for instance, the
> poid window (Preemption or IRQ disabled). So, the term exec time also fits there.

LOL at "poid"

-- Steve

Subject: Re: [PATCH V3 9/9] tracing: Add timerlat tracer

On 6/13/21 1:09 AM, Steven Rostedt wrote:
> On Sat, 12 Jun 2021 10:47:16 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
>> Thinking only about the instrumentation/events, what they are tracking is the
>> execution time. So how about naming them as:
>>
>> exec_time:thread
>> exec_time:irq
> I guess. I should go and look at your other code.
>

I have the v4 with all (including hotplug) execpt this name change. But as there
are a lot of changes already, I will send it now, and keep thinking about this...

-- Daniel