2022-10-07 15:47:52

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 0/5] Generic IPI sending tracepoint

Background
==========

Detecting IPI *reception* is relatively easy, e.g. using
trace_irq_handler_{entry,exit} or even just function-trace
flush_smp_call_function_queue() for SMP calls.

Figuring out their *origin*, is trickier as there is no generic tracepoint tied
to e.g. smp_call_function():

o AFAIA x86 has no tracepoint tied to sending IPIs, only receiving them
(cf. trace_call_function{_single}_entry()).
o arm/arm64 do have trace_ipi_raise(), which gives us the target cpus but also a
mostly useless string (smp_calls will all be "Function call interrupts").
o Other architectures don't seem to have any IPI-sending related tracepoint.

I believe one reason those tracepoints used by arm/arm64 ended up as they were
is because these archs used to handle IPIs differently from regular interrupts
(the IRQ driver would directly invoke an IPI-handling routine), which meant they
never showed up in trace_irq_handler_{entry, exit}. The trace_ipi_{entry,exit}
tracepoints gave a way to trace IPI reception but those have become redundant as
of:

56afcd3dbd19 ("ARM: Allow IPIs to be handled as normal interrupts")
d3afc7f12987 ("arm64: Allow IPIs to be handled as normal interrupts")

which gave IPIs a "proper" handler function used through
generic_handle_domain_irq(), which makes them show up via
trace_irq_handler_{entry, exit}.

Changing stuff up
=================

Per the above, it would make sense to reshuffle trace_ipi_raise() and move it
into generic code. This also came up during Daniel's talk on Osnoise at the CPU
isolation MC of LPC 2022 [1].

Now, to be useful, such a tracepoint needs to export:
o targeted CPU(s)
o calling context

The only way to get the calling context with trace_ipi_raise() is to trigger a
stack dump, e.g. $(trace-cmd -e ipi* -T echo 42).

As for the targeted CPUs, the existing tracepoint does export them, albeit in
cpumask form, which is quite inconvenient from a tooling perspective. For
instance, as far as I'm aware, it's not possible to do event filtering on a
cpumask via trace-cmd.

Because of the above points, this is introducing a new tracepoint.

Patches
=======

This results in having trace events for:

o smp_call_function*()
o smp_send_reschedule()
o irq_work_queue*()

This is incomplete, just looking at arm64 there's more IPI types that aren't covered:

IPI_CPU_STOP,
IPI_CPU_CRASH_STOP,
IPI_TIMER,
IPI_WAKEUP,

... But it feels like a good starting point.

Another thing worth mentioning is that depending on the callsite, the _RET_IP_
fed to the tracepoint is not always useful - generic_exec_single() doesn't tell
you much about the actual callback being sent via IPI, so there might be value
in exploding the single tracepoint into at least one variant for smp_calls.

Links
=====

[1]: https://youtu.be/5gT57y4OzBM?t=14234

Valentin Schneider (5):
trace: Add trace_ipi_send_{cpu, cpumask}
sched, smp: Trace send_call_function_single_ipi()
smp: Add a multi-CPU variant to send_call_function_single_ipi()
irq_work: Trace calls to arch_irq_work_raise()
treewide: Rename and trace arch-definitions of smp_send_reschedule()

arch/alpha/kernel/smp.c | 2 +-
arch/arc/kernel/smp.c | 2 +-
arch/arm/kernel/smp.c | 5 +----
arch/arm64/kernel/smp.c | 3 +--
arch/csky/kernel/smp.c | 2 +-
arch/hexagon/kernel/smp.c | 2 +-
arch/ia64/kernel/smp.c | 4 ++--
arch/loongarch/include/asm/smp.h | 2 +-
arch/mips/include/asm/smp.h | 2 +-
arch/openrisc/kernel/smp.c | 2 +-
arch/parisc/kernel/smp.c | 4 ++--
arch/powerpc/kernel/smp.c | 4 ++--
arch/riscv/kernel/smp.c | 4 ++--
arch/s390/kernel/smp.c | 2 +-
arch/sh/kernel/smp.c | 2 +-
arch/sparc/kernel/smp_32.c | 2 +-
arch/sparc/kernel/smp_64.c | 2 +-
arch/x86/include/asm/smp.h | 2 +-
arch/xtensa/kernel/smp.c | 2 +-
include/linux/smp.h | 1 +
include/trace/events/ipi.h | 27 +++++++++++++++++++++++++++
kernel/irq_work.c | 12 +++++++++++-
kernel/sched/core.c | 7 +++++--
kernel/smp.c | 18 +++++++++++++++++-
24 files changed, 84 insertions(+), 31 deletions(-)

--
2.31.1


2022-10-07 15:48:17

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 1/5] trace: Add trace_ipi_send_{cpu, cpumask}

trace_ipi_raise is unsuitable for generically tracing IPI sources; add a
variant of it that takes a callsite and a CPU. Define a macro helper for
handling IPIs sent to multiple CPUs.

Signed-off-by: Valentin Schneider <[email protected]>
---
include/trace/events/ipi.h | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/include/trace/events/ipi.h b/include/trace/events/ipi.h
index 0be71dad6ec0..fd2f2aeb36fe 100644
--- a/include/trace/events/ipi.h
+++ b/include/trace/events/ipi.h
@@ -35,6 +35,33 @@ TRACE_EVENT(ipi_raise,
TP_printk("target_mask=%s (%s)", __get_bitmask(target_cpus), __entry->reason)
);

+TRACE_EVENT(ipi_send_cpu,
+
+ TP_PROTO(unsigned long callsite, unsigned int cpu),
+
+ TP_ARGS(callsite, cpu),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, callsite)
+ __field(unsigned int, cpu)
+ ),
+
+ TP_fast_assign(
+ __entry->callsite = callsite;
+ __entry->cpu = cpu;
+ ),
+
+ TP_printk("callsite=%pS target_cpu=%d", (void *)__entry->callsite, __entry->cpu)
+);
+
+#define trace_ipi_send_cpumask(callsite, mask) do { \
+ if (static_key_false(&__tracepoint_ipi_send_cpu.key)) { \
+ int cpu; \
+ for_each_cpu(cpu, mask) \
+ trace_ipi_send_cpu(callsite, cpu); \
+ } \
+} while (0)
+
DECLARE_EVENT_CLASS(ipi_handler,

TP_PROTO(const char *reason),
--
2.31.1

2022-10-07 15:49:53

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 2/5] sched, smp: Trace send_call_function_single_ipi()

send_call_function_single_ipi() is the thing that sends IPIs at the bottom
of smp_call_function*() via either generic_exec_single() or
smp_call_function_many_cond(). Give it an IPI-related tracepoint.

Note that this ends up tracing any IPI sent via __smp_call_single_queue(),
which covers __ttwu_queue_wakelist() and irq_work_queue_on() "for free".

Signed-off-by: Valentin Schneider <[email protected]>
---
arch/arm/kernel/smp.c | 3 ---
arch/arm64/kernel/smp.c | 1 -
kernel/sched/core.c | 7 +++++--
kernel/smp.c | 4 ++++
4 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 978db2d96b44..3b280d55c1c4 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -48,9 +48,6 @@
#include <asm/mach/arch.h>
#include <asm/mpu.h>

-#define CREATE_TRACE_POINTS
-#include <trace/events/ipi.h>
-
/*
* as from 2.5, kernels no longer have an init_tasks structure
* so we need some other way of telling a new secondary core
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index ffc5d76cf695..937d2623e06b 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -51,7 +51,6 @@
#include <asm/ptrace.h>
#include <asm/virt.h>

-#define CREATE_TRACE_POINTS
#include <trace/events/ipi.h>

DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 60fdc0faf1c9..14e5e137172f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -81,6 +81,7 @@
#include <linux/sched/rseq_api.h>
#include <trace/events/sched.h>
#undef CREATE_TRACE_POINTS
+#include <trace/events/ipi.h>

#include "sched.h"
#include "stats.h"
@@ -3753,10 +3754,12 @@ void send_call_function_single_ipi(int cpu)
{
struct rq *rq = cpu_rq(cpu);

- if (!set_nr_if_polling(rq->idle))
+ if (!set_nr_if_polling(rq->idle)) {
+ trace_ipi_send_cpu(_RET_IP_, cpu);
arch_send_call_function_single_ipi(cpu);
- else
+ } else {
trace_sched_wake_idle_without_ipi(cpu);
+ }
}

/*
diff --git a/kernel/smp.c b/kernel/smp.c
index e8cdc025a046..7a7a22d69972 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -26,6 +26,10 @@
#include <linux/sched/debug.h>
#include <linux/jump_label.h>

+#define CREATE_TRACE_POINTS
+#include <trace/events/ipi.h>
+#undef CREATE_TRACE_POINTS
+
#include "smpboot.h"
#include "sched/smp.h"

--
2.31.1

2022-10-07 15:55:06

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 3/5] smp: Add a multi-CPU variant to send_call_function_single_ipi()

This simply wraps around the arch function and prepends it with a
tracepoint, bringing parity with send_call_function_single_ipi().

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/smp.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 7a7a22d69972..387735180aed 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -160,6 +160,12 @@ void __init call_function_init(void)
smpcfd_prepare_cpu(smp_processor_id());
}

+static inline void send_call_function_ipi_mask(const struct cpumask *mask)
+{
+ trace_ipi_send_cpumask(_RET_IP_, mask);
+ arch_send_call_function_ipi_mask(mask);
+}
+
#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG

static DEFINE_STATIC_KEY_FALSE(csdlock_debug_enabled);
@@ -970,7 +976,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
if (nr_cpus == 1)
send_call_function_single_ipi(last_cpu);
else if (likely(nr_cpus > 1))
- arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
+ send_call_function_ipi_mask(cfd->cpumask_ipi);

cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->pinged, this_cpu, CFD_SEQ_NOCPU, CFD_SEQ_PINGED);
}
--
2.31.1

2022-10-07 16:09:28

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 4/5] irq_work: Trace calls to arch_irq_work_raise()

Adding a tracepoint to send_call_function_single_ipi() covers
irq_work_queue_on() when the CPU isn't the local one - add a tracepoint to
irq_work_raise() to cover the other cases.

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/irq_work.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 7afa40fe5cc4..5a550b681878 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -22,6 +22,8 @@
#include <asm/processor.h>
#include <linux/kasan.h>

+#include <trace/events/ipi.h>
+
static DEFINE_PER_CPU(struct llist_head, raised_list);
static DEFINE_PER_CPU(struct llist_head, lazy_list);
static DEFINE_PER_CPU(struct task_struct *, irq_workd);
@@ -74,6 +76,14 @@ void __weak arch_irq_work_raise(void)
*/
}

+static inline void irq_work_raise(void)
+{
+ if (arch_irq_work_has_interrupt())
+ trace_ipi_send_cpu(_RET_IP_, smp_processor_id());
+
+ arch_irq_work_raise();
+}
+
/* Enqueue on current CPU, work must already be claimed and preempt disabled */
static void __irq_work_queue_local(struct irq_work *work)
{
@@ -99,7 +109,7 @@ static void __irq_work_queue_local(struct irq_work *work)

/* If the work is "lazy", handle it from next tick if any */
if (!lazy_work || tick_nohz_tick_stopped())
- arch_irq_work_raise();
+ irq_work_raise();
}

/* Enqueue the irq work @work on the current CPU */
--
2.31.1

2022-10-07 16:27:14

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 5/5] treewide: Rename and trace arch-definitions of smp_send_reschedule()

To be able to trace invocations of smp_send_reschedule(), rename the
arch-specific definitions of it to arch_smp_send_reschedule() and wrap it
into an smp_send_reschedule() that contains a tracepoint.

Signed-off-by: Valentin Schneider <[email protected]>
---
arch/alpha/kernel/smp.c | 2 +-
arch/arc/kernel/smp.c | 2 +-
arch/arm/kernel/smp.c | 2 +-
arch/arm64/kernel/smp.c | 2 +-
arch/csky/kernel/smp.c | 2 +-
arch/hexagon/kernel/smp.c | 2 +-
arch/ia64/kernel/smp.c | 4 ++--
arch/loongarch/include/asm/smp.h | 2 +-
arch/mips/include/asm/smp.h | 2 +-
arch/openrisc/kernel/smp.c | 2 +-
arch/parisc/kernel/smp.c | 4 ++--
arch/powerpc/kernel/smp.c | 4 ++--
arch/riscv/kernel/smp.c | 4 ++--
arch/s390/kernel/smp.c | 2 +-
arch/sh/kernel/smp.c | 2 +-
arch/sparc/kernel/smp_32.c | 2 +-
arch/sparc/kernel/smp_64.c | 2 +-
arch/x86/include/asm/smp.h | 2 +-
arch/xtensa/kernel/smp.c | 2 +-
include/linux/smp.h | 1 +
kernel/smp.c | 6 ++++++
21 files changed, 30 insertions(+), 23 deletions(-)

diff --git a/arch/alpha/kernel/smp.c b/arch/alpha/kernel/smp.c
index f4e20f75438f..38637eb9eebd 100644
--- a/arch/alpha/kernel/smp.c
+++ b/arch/alpha/kernel/smp.c
@@ -562,7 +562,7 @@ handle_ipi(struct pt_regs *regs)
}

void
-smp_send_reschedule(int cpu)
+arch_smp_send_reschedule(int cpu)
{
#ifdef DEBUG_IPI_MSG
if (cpu == hard_smp_processor_id())
diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c
index ab9e75e90f72..ae2e6a312361 100644
--- a/arch/arc/kernel/smp.c
+++ b/arch/arc/kernel/smp.c
@@ -292,7 +292,7 @@ static void ipi_send_msg(const struct cpumask *callmap, enum ipi_msg_type msg)
ipi_send_msg_one(cpu, msg);
}

-void smp_send_reschedule(int cpu)
+void arch_smp_send_reschedule(int cpu)
{
ipi_send_msg_one(cpu, IPI_RESCHEDULE);
}
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 3b280d55c1c4..f216ac890b6f 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -745,7 +745,7 @@ void __init set_smp_ipi_range(int ipi_base, int n)
ipi_setup(smp_processor_id());
}

-void smp_send_reschedule(int cpu)
+void arch_smp_send_reschedule(int cpu)
{
smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
}
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 937d2623e06b..8d108edc4a89 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -976,7 +976,7 @@ void __init set_smp_ipi_range(int ipi_base, int n)
ipi_setup(smp_processor_id());
}

-void smp_send_reschedule(int cpu)
+void arch_smp_send_reschedule(int cpu)
{
smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
}
diff --git a/arch/csky/kernel/smp.c b/arch/csky/kernel/smp.c
index 4b605aa2e1d6..fd7f81be16dd 100644
--- a/arch/csky/kernel/smp.c
+++ b/arch/csky/kernel/smp.c
@@ -140,7 +140,7 @@ void smp_send_stop(void)
on_each_cpu(ipi_stop, NULL, 1);
}

-void smp_send_reschedule(int cpu)
+void arch_smp_send_reschedule(int cpu)
{
send_ipi_message(cpumask_of(cpu), IPI_RESCHEDULE);
}
diff --git a/arch/hexagon/kernel/smp.c b/arch/hexagon/kernel/smp.c
index 4ba93e59370c..4e8bee25b8c6 100644
--- a/arch/hexagon/kernel/smp.c
+++ b/arch/hexagon/kernel/smp.c
@@ -217,7 +217,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
}
}

-void smp_send_reschedule(int cpu)
+void arch_smp_send_reschedule(int cpu)
{
send_ipi(cpumask_of(cpu), IPI_RESCHEDULE);
}
diff --git a/arch/ia64/kernel/smp.c b/arch/ia64/kernel/smp.c
index e2cc59db86bc..ea4f009a232b 100644
--- a/arch/ia64/kernel/smp.c
+++ b/arch/ia64/kernel/smp.c
@@ -220,11 +220,11 @@ kdump_smp_send_init(void)
* Called with preemption disabled.
*/
void
-smp_send_reschedule (int cpu)
+arch_smp_send_reschedule (int cpu)
{
ia64_send_ipi(cpu, IA64_IPI_RESCHEDULE, IA64_IPI_DM_INT, 0);
}
-EXPORT_SYMBOL_GPL(smp_send_reschedule);
+EXPORT_SYMBOL_GPL(arch_smp_send_reschedule);

/*
* Called with preemption disabled.
diff --git a/arch/loongarch/include/asm/smp.h b/arch/loongarch/include/asm/smp.h
index 71189b28bfb2..3fcca134dfb1 100644
--- a/arch/loongarch/include/asm/smp.h
+++ b/arch/loongarch/include/asm/smp.h
@@ -83,7 +83,7 @@ extern void show_ipi_list(struct seq_file *p, int prec);
* it goes straight through and wastes no time serializing
* anything. Worst case is that we lose a reschedule ...
*/
-static inline void smp_send_reschedule(int cpu)
+static inline void arch_smp_send_reschedule(int cpu)
{
loongson3_send_ipi_single(cpu, SMP_RESCHEDULE);
}
diff --git a/arch/mips/include/asm/smp.h b/arch/mips/include/asm/smp.h
index 5d9ff61004ca..9806e79895d9 100644
--- a/arch/mips/include/asm/smp.h
+++ b/arch/mips/include/asm/smp.h
@@ -66,7 +66,7 @@ extern void calculate_cpu_foreign_map(void);
* it goes straight through and wastes no time serializing
* anything. Worst case is that we lose a reschedule ...
*/
-static inline void smp_send_reschedule(int cpu)
+static inline void arch_smp_send_reschedule(int cpu)
{
extern const struct plat_smp_ops *mp_ops; /* private */

diff --git a/arch/openrisc/kernel/smp.c b/arch/openrisc/kernel/smp.c
index e1419095a6f0..0a7a059e2dff 100644
--- a/arch/openrisc/kernel/smp.c
+++ b/arch/openrisc/kernel/smp.c
@@ -173,7 +173,7 @@ void handle_IPI(unsigned int ipi_msg)
}
}

-void smp_send_reschedule(int cpu)
+void arch_smp_send_reschedule(int cpu)
{
smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
}
diff --git a/arch/parisc/kernel/smp.c b/arch/parisc/kernel/smp.c
index 7dbd92cafae3..b7fc859fa87d 100644
--- a/arch/parisc/kernel/smp.c
+++ b/arch/parisc/kernel/smp.c
@@ -246,8 +246,8 @@ void kgdb_roundup_cpus(void)
inline void
smp_send_stop(void) { send_IPI_allbutself(IPI_CPU_STOP); }

-void
-smp_send_reschedule(int cpu) { send_IPI_single(cpu, IPI_RESCHEDULE); }
+void
+arch_smp_send_reschedule(int cpu) { send_IPI_single(cpu, IPI_RESCHEDULE); }

void
smp_send_all_nop(void)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 169703fead57..2d7b217392f2 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -364,12 +364,12 @@ static inline void do_message_pass(int cpu, int msg)
#endif
}

-void smp_send_reschedule(int cpu)
+void arch_smp_send_reschedule(int cpu)
{
if (likely(smp_ops))
do_message_pass(cpu, PPC_MSG_RESCHEDULE);
}
-EXPORT_SYMBOL_GPL(smp_send_reschedule);
+EXPORT_SYMBOL_GPL(arch_smp_send_reschedule);

void arch_send_call_function_single_ipi(int cpu)
{
diff --git a/arch/riscv/kernel/smp.c b/arch/riscv/kernel/smp.c
index 760a64518c58..213602e89a8b 100644
--- a/arch/riscv/kernel/smp.c
+++ b/arch/riscv/kernel/smp.c
@@ -235,8 +235,8 @@ void smp_send_stop(void)
cpumask_pr_args(cpu_online_mask));
}

-void smp_send_reschedule(int cpu)
+void arch_smp_send_reschedule(int cpu)
{
send_ipi_single(cpu, IPI_RESCHEDULE);
}
-EXPORT_SYMBOL_GPL(smp_send_reschedule);
+EXPORT_SYMBOL_GPL(arch_smp_send_reschedule);
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 30c91d565933..9d1c36571106 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -542,7 +542,7 @@ void arch_send_call_function_single_ipi(int cpu)
* it goes straight through and wastes no time serializing
* anything. Worst case is that we lose a reschedule ...
*/
-void smp_send_reschedule(int cpu)
+void arch_smp_send_reschedule(int cpu)
{
pcpu_ec_call(pcpu_devices + cpu, ec_schedule);
}
diff --git a/arch/sh/kernel/smp.c b/arch/sh/kernel/smp.c
index 65924d9ec245..5cf35a774dc7 100644
--- a/arch/sh/kernel/smp.c
+++ b/arch/sh/kernel/smp.c
@@ -256,7 +256,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
(bogosum / (5000/HZ)) % 100);
}

-void smp_send_reschedule(int cpu)
+void arch_smp_send_reschedule(int cpu)
{
mp_ops->send_ipi(cpu, SMP_MSG_RESCHEDULE);
}
diff --git a/arch/sparc/kernel/smp_32.c b/arch/sparc/kernel/smp_32.c
index ad8094d955eb..87eaa7719fa2 100644
--- a/arch/sparc/kernel/smp_32.c
+++ b/arch/sparc/kernel/smp_32.c
@@ -120,7 +120,7 @@ void cpu_panic(void)

struct linux_prom_registers smp_penguin_ctable = { 0 };

-void smp_send_reschedule(int cpu)
+void arch_smp_send_reschedule(int cpu)
{
/*
* CPU model dependent way of implementing IPI generation targeting
diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index a55295d1b924..e5964d1d8b37 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1430,7 +1430,7 @@ static unsigned long send_cpu_poke(int cpu)
return hv_err;
}

-void smp_send_reschedule(int cpu)
+void arch_smp_send_reschedule(int cpu)
{
if (cpu == smp_processor_id()) {
WARN_ON_ONCE(preemptible());
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index a73bced40e24..5ff5815149bd 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -99,7 +99,7 @@ static inline void play_dead(void)
smp_ops.play_dead();
}

-static inline void smp_send_reschedule(int cpu)
+static inline void arch_smp_send_reschedule(int cpu)
{
smp_ops.smp_send_reschedule(cpu);
}
diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c
index 4dc109dd6214..d95907b8e4d3 100644
--- a/arch/xtensa/kernel/smp.c
+++ b/arch/xtensa/kernel/smp.c
@@ -389,7 +389,7 @@ void arch_send_call_function_single_ipi(int cpu)
send_ipi_message(cpumask_of(cpu), IPI_CALL_FUNC);
}

-void smp_send_reschedule(int cpu)
+void arch_smp_send_reschedule(int cpu)
{
send_ipi_message(cpumask_of(cpu), IPI_RESCHEDULE);
}
diff --git a/include/linux/smp.h b/include/linux/smp.h
index a80ab58ae3f1..a67e7aad17b9 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -125,6 +125,7 @@ extern void smp_send_stop(void);
/*
* sends a 'reschedule' event to another CPU:
*/
+extern void arch_smp_send_reschedule(int cpu);
extern void smp_send_reschedule(int cpu);


diff --git a/kernel/smp.c b/kernel/smp.c
index 387735180aed..9dfe057424f8 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -166,6 +166,12 @@ static inline void send_call_function_ipi_mask(const struct cpumask *mask)
arch_send_call_function_ipi_mask(mask);
}

+void smp_send_reschedule(int cpu)
+{
+ trace_ipi_send_cpu(_RET_IP_, cpu);
+ arch_smp_send_reschedule(cpu);
+}
+
#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG

static DEFINE_STATIC_KEY_FALSE(csdlock_debug_enabled);
--
2.31.1

2022-10-07 21:01:47

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Generic IPI sending tracepoint

Hi Valentin,

On Fri, Oct 07, 2022 at 04:41:40PM +0100, Valentin Schneider wrote:
> Background
> ==========
>
> Detecting IPI *reception* is relatively easy, e.g. using
> trace_irq_handler_{entry,exit} or even just function-trace
> flush_smp_call_function_queue() for SMP calls.
>
> Figuring out their *origin*, is trickier as there is no generic tracepoint tied
> to e.g. smp_call_function():
>
> o AFAIA x86 has no tracepoint tied to sending IPIs, only receiving them
> (cf. trace_call_function{_single}_entry()).
> o arm/arm64 do have trace_ipi_raise(), which gives us the target cpus but also a
> mostly useless string (smp_calls will all be "Function call interrupts").
> o Other architectures don't seem to have any IPI-sending related tracepoint.
>
> I believe one reason those tracepoints used by arm/arm64 ended up as they were
> is because these archs used to handle IPIs differently from regular interrupts
> (the IRQ driver would directly invoke an IPI-handling routine), which meant they
> never showed up in trace_irq_handler_{entry, exit}. The trace_ipi_{entry,exit}
> tracepoints gave a way to trace IPI reception but those have become redundant as
> of:
>
> 56afcd3dbd19 ("ARM: Allow IPIs to be handled as normal interrupts")
> d3afc7f12987 ("arm64: Allow IPIs to be handled as normal interrupts")
>
> which gave IPIs a "proper" handler function used through
> generic_handle_domain_irq(), which makes them show up via
> trace_irq_handler_{entry, exit}.
>
> Changing stuff up
> =================
>
> Per the above, it would make sense to reshuffle trace_ipi_raise() and move it
> into generic code. This also came up during Daniel's talk on Osnoise at the CPU
> isolation MC of LPC 2022 [1].
>
> Now, to be useful, such a tracepoint needs to export:
> o targeted CPU(s)
> o calling context
>
> The only way to get the calling context with trace_ipi_raise() is to trigger a
> stack dump, e.g. $(trace-cmd -e ipi* -T echo 42).
>
> As for the targeted CPUs, the existing tracepoint does export them, albeit in
> cpumask form, which is quite inconvenient from a tooling perspective. For
> instance, as far as I'm aware, it's not possible to do event filtering on a
> cpumask via trace-cmd.

https://man7.org/linux/man-pages/man1/trace-cmd-set.1.html

-f filter
Specify a filter for the previous event. This must come after
a -e. This will filter what events get recorded based on the
content of the event. Filtering is passed to the kernel
directly so what filtering is allowed may depend on what
version of the kernel you have. Basically, it will let you
use C notation to check if an event should be processed or
not.

==, >=, <=, >, <, &, |, && and ||

The above are usually safe to use to compare fields.

This looks overkill to me (consider large number of bits set in mask).

+#define trace_ipi_send_cpumask(callsite, mask) do { \
+ if (static_key_false(&__tracepoint_ipi_send_cpu.key)) { \
+ int cpu; \
+ for_each_cpu(cpu, mask) \
+ trace_ipi_send_cpu(callsite, cpu); \
+ } \
+} while (0)


>
> Because of the above points, this is introducing a new tracepoint.
>
> Patches
> =======
>
> This results in having trace events for:
>
> o smp_call_function*()
> o smp_send_reschedule()
> o irq_work_queue*()
>
> This is incomplete, just looking at arm64 there's more IPI types that aren't covered:
>
> IPI_CPU_STOP,
> IPI_CPU_CRASH_STOP,
> IPI_TIMER,
> IPI_WAKEUP,
>
> ... But it feels like a good starting point.

Can't you have a single tracepoint (or variant with cpumask) that would
cover such cases as well?

Maybe (as parameters for tracepoint):

* type (reschedule, smp_call_function, timer, wakeup, ...).

* function address: valid for smp_call_function, irq_work_queue
types.

> Another thing worth mentioning is that depending on the callsite, the _RET_IP_
> fed to the tracepoint is not always useful - generic_exec_single() doesn't tell
> you much about the actual callback being sent via IPI, so there might be value
> in exploding the single tracepoint into at least one variant for smp_calls.

Not sure i grasp what you mean by "exploding the single tracepoint...",
but yes knowing the function or irq work function is very useful.

>
> Links
> =====
>
> [1]: https://youtu.be/5gT57y4OzBM?t=14234
>
> Valentin Schneider (5):
> trace: Add trace_ipi_send_{cpu, cpumask}
> sched, smp: Trace send_call_function_single_ipi()
> smp: Add a multi-CPU variant to send_call_function_single_ipi()
> irq_work: Trace calls to arch_irq_work_raise()
> treewide: Rename and trace arch-definitions of smp_send_reschedule()
>
> arch/alpha/kernel/smp.c | 2 +-
> arch/arc/kernel/smp.c | 2 +-
> arch/arm/kernel/smp.c | 5 +----
> arch/arm64/kernel/smp.c | 3 +--
> arch/csky/kernel/smp.c | 2 +-
> arch/hexagon/kernel/smp.c | 2 +-
> arch/ia64/kernel/smp.c | 4 ++--
> arch/loongarch/include/asm/smp.h | 2 +-
> arch/mips/include/asm/smp.h | 2 +-
> arch/openrisc/kernel/smp.c | 2 +-
> arch/parisc/kernel/smp.c | 4 ++--
> arch/powerpc/kernel/smp.c | 4 ++--
> arch/riscv/kernel/smp.c | 4 ++--
> arch/s390/kernel/smp.c | 2 +-
> arch/sh/kernel/smp.c | 2 +-
> arch/sparc/kernel/smp_32.c | 2 +-
> arch/sparc/kernel/smp_64.c | 2 +-
> arch/x86/include/asm/smp.h | 2 +-
> arch/xtensa/kernel/smp.c | 2 +-
> include/linux/smp.h | 1 +
> include/trace/events/ipi.h | 27 +++++++++++++++++++++++++++
> kernel/irq_work.c | 12 +++++++++++-
> kernel/sched/core.c | 7 +++++--
> kernel/smp.c | 18 +++++++++++++++++-
> 24 files changed, 84 insertions(+), 31 deletions(-)
>
> --
> 2.31.1
>
>

2022-10-08 00:58:40

by Guo Ren

[permalink] [raw]
Subject: Re: [RFC PATCH 5/5] treewide: Rename and trace arch-definitions of smp_send_reschedule()

On Fri, Oct 7, 2022 at 11:46 PM Valentin Schneider <[email protected]> wrote:
>
> To be able to trace invocations of smp_send_reschedule(), rename the
> arch-specific definitions of it to arch_smp_send_reschedule() and wrap it
> into an smp_send_reschedule() that contains a tracepoint.
>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> arch/alpha/kernel/smp.c | 2 +-
> arch/arc/kernel/smp.c | 2 +-
> arch/arm/kernel/smp.c | 2 +-
> arch/arm64/kernel/smp.c | 2 +-
> arch/csky/kernel/smp.c | 2 +-
> arch/hexagon/kernel/smp.c | 2 +-
> arch/ia64/kernel/smp.c | 4 ++--
> arch/loongarch/include/asm/smp.h | 2 +-
> arch/mips/include/asm/smp.h | 2 +-
> arch/openrisc/kernel/smp.c | 2 +-
> arch/parisc/kernel/smp.c | 4 ++--
> arch/powerpc/kernel/smp.c | 4 ++--
> arch/riscv/kernel/smp.c | 4 ++--
> arch/s390/kernel/smp.c | 2 +-
> arch/sh/kernel/smp.c | 2 +-
> arch/sparc/kernel/smp_32.c | 2 +-
> arch/sparc/kernel/smp_64.c | 2 +-
> arch/x86/include/asm/smp.h | 2 +-
> arch/xtensa/kernel/smp.c | 2 +-
> include/linux/smp.h | 1 +
> kernel/smp.c | 6 ++++++
> 21 files changed, 30 insertions(+), 23 deletions(-)
>
> diff --git a/arch/alpha/kernel/smp.c b/arch/alpha/kernel/smp.c
> index f4e20f75438f..38637eb9eebd 100644
> --- a/arch/alpha/kernel/smp.c
> +++ b/arch/alpha/kernel/smp.c
> @@ -562,7 +562,7 @@ handle_ipi(struct pt_regs *regs)
> }
>
> void
> -smp_send_reschedule(int cpu)
> +arch_smp_send_reschedule(int cpu)
> {
> #ifdef DEBUG_IPI_MSG
> if (cpu == hard_smp_processor_id())
> diff --git a/arch/arc/kernel/smp.c b/arch/arc/kernel/smp.c
> index ab9e75e90f72..ae2e6a312361 100644
> --- a/arch/arc/kernel/smp.c
> +++ b/arch/arc/kernel/smp.c
> @@ -292,7 +292,7 @@ static void ipi_send_msg(const struct cpumask *callmap, enum ipi_msg_type msg)
> ipi_send_msg_one(cpu, msg);
> }
>
> -void smp_send_reschedule(int cpu)
> +void arch_smp_send_reschedule(int cpu)
> {
> ipi_send_msg_one(cpu, IPI_RESCHEDULE);
> }
> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index 3b280d55c1c4..f216ac890b6f 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -745,7 +745,7 @@ void __init set_smp_ipi_range(int ipi_base, int n)
> ipi_setup(smp_processor_id());
> }
>
> -void smp_send_reschedule(int cpu)
> +void arch_smp_send_reschedule(int cpu)
> {
> smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
> }
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 937d2623e06b..8d108edc4a89 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -976,7 +976,7 @@ void __init set_smp_ipi_range(int ipi_base, int n)
> ipi_setup(smp_processor_id());
> }
>
> -void smp_send_reschedule(int cpu)
> +void arch_smp_send_reschedule(int cpu)
> {
> smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
> }
> diff --git a/arch/csky/kernel/smp.c b/arch/csky/kernel/smp.c
> index 4b605aa2e1d6..fd7f81be16dd 100644
> --- a/arch/csky/kernel/smp.c
> +++ b/arch/csky/kernel/smp.c
> @@ -140,7 +140,7 @@ void smp_send_stop(void)
> on_each_cpu(ipi_stop, NULL, 1);
> }
>
> -void smp_send_reschedule(int cpu)
> +void arch_smp_send_reschedule(int cpu)
For csky part, Acked-by: Guo Ren <[email protected]>

> {
> send_ipi_message(cpumask_of(cpu), IPI_RESCHEDULE);
> }
> diff --git a/arch/hexagon/kernel/smp.c b/arch/hexagon/kernel/smp.c
> index 4ba93e59370c..4e8bee25b8c6 100644
> --- a/arch/hexagon/kernel/smp.c
> +++ b/arch/hexagon/kernel/smp.c
> @@ -217,7 +217,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> }
> }
>
> -void smp_send_reschedule(int cpu)
> +void arch_smp_send_reschedule(int cpu)
> {
> send_ipi(cpumask_of(cpu), IPI_RESCHEDULE);
> }
> diff --git a/arch/ia64/kernel/smp.c b/arch/ia64/kernel/smp.c
> index e2cc59db86bc..ea4f009a232b 100644
> --- a/arch/ia64/kernel/smp.c
> +++ b/arch/ia64/kernel/smp.c
> @@ -220,11 +220,11 @@ kdump_smp_send_init(void)
> * Called with preemption disabled.
> */
> void
> -smp_send_reschedule (int cpu)
> +arch_smp_send_reschedule (int cpu)
> {
> ia64_send_ipi(cpu, IA64_IPI_RESCHEDULE, IA64_IPI_DM_INT, 0);
> }
> -EXPORT_SYMBOL_GPL(smp_send_reschedule);
> +EXPORT_SYMBOL_GPL(arch_smp_send_reschedule);
>
> /*
> * Called with preemption disabled.
> diff --git a/arch/loongarch/include/asm/smp.h b/arch/loongarch/include/asm/smp.h
> index 71189b28bfb2..3fcca134dfb1 100644
> --- a/arch/loongarch/include/asm/smp.h
> +++ b/arch/loongarch/include/asm/smp.h
> @@ -83,7 +83,7 @@ extern void show_ipi_list(struct seq_file *p, int prec);
> * it goes straight through and wastes no time serializing
> * anything. Worst case is that we lose a reschedule ...
> */
> -static inline void smp_send_reschedule(int cpu)
> +static inline void arch_smp_send_reschedule(int cpu)
> {
> loongson3_send_ipi_single(cpu, SMP_RESCHEDULE);
> }
> diff --git a/arch/mips/include/asm/smp.h b/arch/mips/include/asm/smp.h
> index 5d9ff61004ca..9806e79895d9 100644
> --- a/arch/mips/include/asm/smp.h
> +++ b/arch/mips/include/asm/smp.h
> @@ -66,7 +66,7 @@ extern void calculate_cpu_foreign_map(void);
> * it goes straight through and wastes no time serializing
> * anything. Worst case is that we lose a reschedule ...
> */
> -static inline void smp_send_reschedule(int cpu)
> +static inline void arch_smp_send_reschedule(int cpu)
> {
> extern const struct plat_smp_ops *mp_ops; /* private */
>
> diff --git a/arch/openrisc/kernel/smp.c b/arch/openrisc/kernel/smp.c
> index e1419095a6f0..0a7a059e2dff 100644
> --- a/arch/openrisc/kernel/smp.c
> +++ b/arch/openrisc/kernel/smp.c
> @@ -173,7 +173,7 @@ void handle_IPI(unsigned int ipi_msg)
> }
> }
>
> -void smp_send_reschedule(int cpu)
> +void arch_smp_send_reschedule(int cpu)
> {
> smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
> }
> diff --git a/arch/parisc/kernel/smp.c b/arch/parisc/kernel/smp.c
> index 7dbd92cafae3..b7fc859fa87d 100644
> --- a/arch/parisc/kernel/smp.c
> +++ b/arch/parisc/kernel/smp.c
> @@ -246,8 +246,8 @@ void kgdb_roundup_cpus(void)
> inline void
> smp_send_stop(void) { send_IPI_allbutself(IPI_CPU_STOP); }
>
> -void
> -smp_send_reschedule(int cpu) { send_IPI_single(cpu, IPI_RESCHEDULE); }
> +void
> +arch_smp_send_reschedule(int cpu) { send_IPI_single(cpu, IPI_RESCHEDULE); }
>
> void
> smp_send_all_nop(void)
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 169703fead57..2d7b217392f2 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -364,12 +364,12 @@ static inline void do_message_pass(int cpu, int msg)
> #endif
> }
>
> -void smp_send_reschedule(int cpu)
> +void arch_smp_send_reschedule(int cpu)
> {
> if (likely(smp_ops))
> do_message_pass(cpu, PPC_MSG_RESCHEDULE);
> }
> -EXPORT_SYMBOL_GPL(smp_send_reschedule);
> +EXPORT_SYMBOL_GPL(arch_smp_send_reschedule);
>
> void arch_send_call_function_single_ipi(int cpu)
> {
> diff --git a/arch/riscv/kernel/smp.c b/arch/riscv/kernel/smp.c
> index 760a64518c58..213602e89a8b 100644
> --- a/arch/riscv/kernel/smp.c
> +++ b/arch/riscv/kernel/smp.c
> @@ -235,8 +235,8 @@ void smp_send_stop(void)
> cpumask_pr_args(cpu_online_mask));
> }
>
> -void smp_send_reschedule(int cpu)
> +void arch_smp_send_reschedule(int cpu)
> {
> send_ipi_single(cpu, IPI_RESCHEDULE);
> }
> -EXPORT_SYMBOL_GPL(smp_send_reschedule);
> +EXPORT_SYMBOL_GPL(arch_smp_send_reschedule);
> diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
> index 30c91d565933..9d1c36571106 100644
> --- a/arch/s390/kernel/smp.c
> +++ b/arch/s390/kernel/smp.c
> @@ -542,7 +542,7 @@ void arch_send_call_function_single_ipi(int cpu)
> * it goes straight through and wastes no time serializing
> * anything. Worst case is that we lose a reschedule ...
> */
> -void smp_send_reschedule(int cpu)
> +void arch_smp_send_reschedule(int cpu)
> {
> pcpu_ec_call(pcpu_devices + cpu, ec_schedule);
> }
> diff --git a/arch/sh/kernel/smp.c b/arch/sh/kernel/smp.c
> index 65924d9ec245..5cf35a774dc7 100644
> --- a/arch/sh/kernel/smp.c
> +++ b/arch/sh/kernel/smp.c
> @@ -256,7 +256,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
> (bogosum / (5000/HZ)) % 100);
> }
>
> -void smp_send_reschedule(int cpu)
> +void arch_smp_send_reschedule(int cpu)
> {
> mp_ops->send_ipi(cpu, SMP_MSG_RESCHEDULE);
> }
> diff --git a/arch/sparc/kernel/smp_32.c b/arch/sparc/kernel/smp_32.c
> index ad8094d955eb..87eaa7719fa2 100644
> --- a/arch/sparc/kernel/smp_32.c
> +++ b/arch/sparc/kernel/smp_32.c
> @@ -120,7 +120,7 @@ void cpu_panic(void)
>
> struct linux_prom_registers smp_penguin_ctable = { 0 };
>
> -void smp_send_reschedule(int cpu)
> +void arch_smp_send_reschedule(int cpu)
> {
> /*
> * CPU model dependent way of implementing IPI generation targeting
> diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
> index a55295d1b924..e5964d1d8b37 100644
> --- a/arch/sparc/kernel/smp_64.c
> +++ b/arch/sparc/kernel/smp_64.c
> @@ -1430,7 +1430,7 @@ static unsigned long send_cpu_poke(int cpu)
> return hv_err;
> }
>
> -void smp_send_reschedule(int cpu)
> +void arch_smp_send_reschedule(int cpu)
> {
> if (cpu == smp_processor_id()) {
> WARN_ON_ONCE(preemptible());
> diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
> index a73bced40e24..5ff5815149bd 100644
> --- a/arch/x86/include/asm/smp.h
> +++ b/arch/x86/include/asm/smp.h
> @@ -99,7 +99,7 @@ static inline void play_dead(void)
> smp_ops.play_dead();
> }
>
> -static inline void smp_send_reschedule(int cpu)
> +static inline void arch_smp_send_reschedule(int cpu)
> {
> smp_ops.smp_send_reschedule(cpu);
> }
> diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c
> index 4dc109dd6214..d95907b8e4d3 100644
> --- a/arch/xtensa/kernel/smp.c
> +++ b/arch/xtensa/kernel/smp.c
> @@ -389,7 +389,7 @@ void arch_send_call_function_single_ipi(int cpu)
> send_ipi_message(cpumask_of(cpu), IPI_CALL_FUNC);
> }
>
> -void smp_send_reschedule(int cpu)
> +void arch_smp_send_reschedule(int cpu)
> {
> send_ipi_message(cpumask_of(cpu), IPI_RESCHEDULE);
> }
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index a80ab58ae3f1..a67e7aad17b9 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -125,6 +125,7 @@ extern void smp_send_stop(void);
> /*
> * sends a 'reschedule' event to another CPU:
> */
> +extern void arch_smp_send_reschedule(int cpu);
> extern void smp_send_reschedule(int cpu);
>
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 387735180aed..9dfe057424f8 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -166,6 +166,12 @@ static inline void send_call_function_ipi_mask(const struct cpumask *mask)
> arch_send_call_function_ipi_mask(mask);
> }
>
> +void smp_send_reschedule(int cpu)
> +{
> + trace_ipi_send_cpu(_RET_IP_, cpu);
> + arch_smp_send_reschedule(cpu);
> +}
> +
> #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
>
> static DEFINE_STATIC_KEY_FALSE(csdlock_debug_enabled);
> --
> 2.31.1
>


--
Best Regards
Guo Ren

2022-10-08 19:49:59

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] irq_work: Trace calls to arch_irq_work_raise()

On Fri, 7 Oct 2022 16:45:32 +0100
Valentin Schneider <[email protected]> wrote:
> }
>
> +static inline void irq_work_raise(void)
> +{
> + if (arch_irq_work_has_interrupt())
> + trace_ipi_send_cpu(_RET_IP_, smp_processor_id());

To save on the branch, let's make the above:

if (trace_ipi_send_cpu_enabled() && arch_irq_work_has_interrupt())

As the "trace_*_enabled()" is a static branch that will make it a nop
when the tracepoint is not enabled.

-- Steve


> +
> + arch_irq_work_raise();
> +}
> +
> /* Enqueue on current CPU, work must already be claimed and preempt disabled */

2022-10-08 19:50:25

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Generic IPI sending tracepoint

On Fri, 7 Oct 2022 17:01:33 -0300
Marcelo Tosatti <[email protected]> wrote:

> > As for the targeted CPUs, the existing tracepoint does export them, albeit in
> > cpumask form, which is quite inconvenient from a tooling perspective. For
> > instance, as far as I'm aware, it's not possible to do event filtering on a
> > cpumask via trace-cmd.
>
> https://man7.org/linux/man-pages/man1/trace-cmd-set.1.html
>
> -f filter
> Specify a filter for the previous event. This must come after
> a -e. This will filter what events get recorded based on the
> content of the event. Filtering is passed to the kernel
> directly so what filtering is allowed may depend on what
> version of the kernel you have. Basically, it will let you
> use C notation to check if an event should be processed or
> not.
>
> ==, >=, <=, >, <, &, |, && and ||
>
> The above are usually safe to use to compare fields.

We could always add an "isset(x)" filter ;-)

-- Steve

2022-10-11 16:15:57

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] irq_work: Trace calls to arch_irq_work_raise()

On 08/10/22 15:34, Steven Rostedt wrote:
> On Fri, 7 Oct 2022 16:45:32 +0100
> Valentin Schneider <[email protected]> wrote:
>> }
>>
>> +static inline void irq_work_raise(void)
>> +{
>> + if (arch_irq_work_has_interrupt())
>> + trace_ipi_send_cpu(_RET_IP_, smp_processor_id());
>
> To save on the branch, let's make the above:
>
> if (trace_ipi_send_cpu_enabled() && arch_irq_work_has_interrupt())
>
> As the "trace_*_enabled()" is a static branch that will make it a nop
> when the tracepoint is not enabled.
>

Makes sense, thanks for the suggestion.

> -- Steve
>
>
>> +
>> + arch_irq_work_raise();
>> +}
>> +
>> /* Enqueue on current CPU, work must already be claimed and preempt disabled */

2022-10-11 16:34:33

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Generic IPI sending tracepoint

+Cc Douglas

On 07/10/22 17:01, Marcelo Tosatti wrote:
> Hi Valentin,
>
> On Fri, Oct 07, 2022 at 04:41:40PM +0100, Valentin Schneider wrote:
>> Background
>> ==========
>>
>> As for the targeted CPUs, the existing tracepoint does export them, albeit in
>> cpumask form, which is quite inconvenient from a tooling perspective. For
>> instance, as far as I'm aware, it's not possible to do event filtering on a
>> cpumask via trace-cmd.
>
> https://man7.org/linux/man-pages/man1/trace-cmd-set.1.html
>
> -f filter
> Specify a filter for the previous event. This must come after
> a -e. This will filter what events get recorded based on the
> content of the event. Filtering is passed to the kernel
> directly so what filtering is allowed may depend on what
> version of the kernel you have. Basically, it will let you
> use C notation to check if an event should be processed or
> not.
>
> ==, >=, <=, >, <, &, |, && and ||
>
> The above are usually safe to use to compare fields.
>
> This looks overkill to me (consider large number of bits set in mask).
>
> +#define trace_ipi_send_cpumask(callsite, mask) do { \
> + if (static_key_false(&__tracepoint_ipi_send_cpu.key)) { \
> + int cpu; \
> + for_each_cpu(cpu, mask) \
> + trace_ipi_send_cpu(callsite, cpu); \
> + } \
> +} while (0)
>

Indeed, I expected pushback on this :-)

I went for this due to how much simpler an int is to process/use compared
to a cpumask. There is the trigger example I listed above, but the
consumption of the trace event itself as well.

Consider this event collected on an arm64 QEMU instance (output from trace-cmd)

<...>-234 [001] 37.251567: ipi_raise: target_mask=00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000004 (Function call interrupts)

That sort of formatting has been an issue downstream for things like LISA
[1] where events are aggregated into Pandas tables, and we need to play
silly games for performance reason because bitmasks aren't a native Python
type.

I had a look at libtraceevent to see how this data is exposed and if the
answer would be better tooling:

tep_get_field_val() just yields an unsigned long long of value 0x200018,
which AFAICT is just the [length, offset] thing associated with dynamic
arrays. Not really usable, and I don't see anything exported in the lib to
extract and use those values.

tep_get_field_raw() is better, it handles the dynamic array for us and
yields a pointer to the cpumask array at the tail of the record. With that
it's easy to get an output such as: cpumask[size=32]=[4,0,0,0,]. Still,
this isn't a native type for many programming languages.

In contrast, this is immediately readable and consumable by userspace tools

<...>-234 [001] 37.250882: ipi_send_cpu: callsite=__smp_call_single_queue+0x5c target_cpu=2

Thinking out loud, it makes way more sense to record a cpumask in the
tracepoint, but perhaps we could have a postprocessing step to transform
those into N events each targeting a single CPU?

[1]: https://github.com/ARM-software/lisa/blob/37b51243a94b27ea031ff62bb4ce818a59a7f6ef/lisa/trace.py#L4756

>
>>
>> Because of the above points, this is introducing a new tracepoint.
>>
>> Patches
>> =======
>>
>> This results in having trace events for:
>>
>> o smp_call_function*()
>> o smp_send_reschedule()
>> o irq_work_queue*()
>>
>> This is incomplete, just looking at arm64 there's more IPI types that aren't covered:
>>
>> IPI_CPU_STOP,
>> IPI_CPU_CRASH_STOP,
>> IPI_TIMER,
>> IPI_WAKEUP,
>>
>> ... But it feels like a good starting point.
>
> Can't you have a single tracepoint (or variant with cpumask) that would
> cover such cases as well?
>
> Maybe (as parameters for tracepoint):
>
> * type (reschedule, smp_call_function, timer, wakeup, ...).
>
> * function address: valid for smp_call_function, irq_work_queue
> types.
>

That's a good point, I wasn't sure about having a parameter serving as
discriminant for another, but the function address would be either valid or
NULL which is fine. So perhaps:
o callsite (i.e. _RET_IP_), serves as type
o address of callback tied to IPI, if any
o target CPUs

>> Another thing worth mentioning is that depending on the callsite, the _RET_IP_
>> fed to the tracepoint is not always useful - generic_exec_single() doesn't tell
>> you much about the actual callback being sent via IPI, so there might be value
>> in exploding the single tracepoint into at least one variant for smp_calls.
>
> Not sure i grasp what you mean by "exploding the single tracepoint...",
> but yes knowing the function or irq work function is very useful.
>

Sorry; I meant having several "specialized" tracepoints instead of a single one.

Subject: Re: [RFC PATCH 0/5] Generic IPI sending tracepoint

On 10/11/22 18:17, Valentin Schneider wrote:
> Thinking out loud, it makes way more sense to record a cpumask in the
> tracepoint, but perhaps we could have a postprocessing step to transform
> those into N events each targeting a single CPU?

My approach on the tracers/rtla is to make the simple things in kernel, and beautify
things in user-space.

You could keep the tracepoint as a mask, and then make it pretty, like cpus=3-5,8
in user-space. For example with a trace-cmd/perf loadable plugin, libtracefs helper.

For rtla I was thinking to make a new tool to parse them. and make it pretty there.

-- Daniel

2022-10-11 16:47:28

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Generic IPI sending tracepoint

On 11/10/22 18:22, Daniel Bristot de Oliveira wrote:
> On 10/11/22 18:17, Valentin Schneider wrote:
>> Thinking out loud, it makes way more sense to record a cpumask in the
>> tracepoint, but perhaps we could have a postprocessing step to transform
>> those into N events each targeting a single CPU?
>
> My approach on the tracers/rtla is to make the simple things in kernel, and beautify
> things in user-space.
>
> You could keep the tracepoint as a mask, and then make it pretty, like cpus=3-5,8
> in user-space. For example with a trace-cmd/perf loadable plugin, libtracefs helper.
>

That's a nice idea, the one downside I see is that means registering an
event handler for all events with cpumasks rather than directly targeting
cpumask fields, but that doesn't look too horrible. I'll dig a bit in that
direction.

> For rtla I was thinking to make a new tool to parse them. and make it pretty there.
>
> -- Daniel

2022-10-11 20:41:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Generic IPI sending tracepoint

On Tue, 11 Oct 2022 17:17:04 +0100
Valentin Schneider <[email protected]> wrote:

> tep_get_field_val() just yields an unsigned long long of value 0x200018,
> which AFAICT is just the [length, offset] thing associated with dynamic
> arrays. Not really usable, and I don't see anything exported in the lib to
> extract and use those values.

Correct.

>
> tep_get_field_raw() is better, it handles the dynamic array for us and
> yields a pointer to the cpumask array at the tail of the record. With that
> it's easy to get an output such as: cpumask[size=32]=[4,0,0,0,]. Still,
> this isn't a native type for many programming languages.

Yeah, this is the interface that I would have used. And it would likely
require some kind of wrapper to make it into what you prefer.

Note, I've been complaining for some time now how much I hate the
libtraceevent interface, and want to rewrite it. (I just spoke to
someone that wants to implement it in Rust). But the question comes
down to how to make it backward compatible. Perhaps we don't and just
up the major version number (libtraceevent 2.0).

What would you recommend as an API to process cpumasks better?

-- Steve

2022-10-11 20:44:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] Generic IPI sending tracepoint

On Tue, 11 Oct 2022 17:40:26 +0100
Valentin Schneider <[email protected]> wrote:

> > You could keep the tracepoint as a mask, and then make it pretty, like cpus=3-5,8
> > in user-space. For example with a trace-cmd/perf loadable plugin, libtracefs helper.
> >
>
> That's a nice idea, the one downside I see is that means registering an
> event handler for all events with cpumasks rather than directly targeting
> cpumask fields, but that doesn't look too horrible. I'll dig a bit in that
> direction.

We could just make all all dynamic array's of unsigned long use that
format? I don't know of any other event that has dynamic arrays of
unsigned longs. And doing a search doesn't come up with any.

-- Steve