When running RT workloads in isolated CPUs, many cases of deadline misses
are caused by remote CPU requests such as smp_call_function*().
For those cases, having the names of those functions running around the
deadline miss moment could help finding a target for the next improvements.
Add tracepoints for acquiring the funtion name & argument before entry and
after exitting the called function.
Signed-off-by: Leonardo Bras <[email protected]>
---
include/trace/events/smp.h | 56 ++++++++++++++++++++++++++++++++++++++
kernel/smp.c | 11 ++++++++
2 files changed, 67 insertions(+)
create mode 100644 include/trace/events/smp.h
diff --git a/include/trace/events/smp.h b/include/trace/events/smp.h
new file mode 100644
index 0000000000000..94aae8d71705d
--- /dev/null
+++ b/include/trace/events/smp.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM smp
+
+#if !defined(_TRACE_SMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_SMP_H
+
+#include <linux/tracepoint.h>
+#include <linux/smp.h>
+
+/*
+ * Tracepoints for a function which is called as an effect of smp_call_function.*
+ */
+TRACE_EVENT(smp_call_function_entry,
+
+ TP_PROTO(smp_call_func_t func, void *info),
+
+ TP_ARGS(func, info),
+
+ TP_STRUCT__entry(
+ __field(void *, func)
+ __field(void *, info)
+ ),
+
+ TP_fast_assign(
+ __entry->func = func;
+ __entry->info = info;
+ ),
+
+ TP_printk("function %ps, argument = %p", __entry->func, __entry->info)
+);
+
+
+TRACE_EVENT(smp_call_function_exit,
+
+ TP_PROTO(smp_call_func_t func, void *info),
+
+ TP_ARGS(func, info),
+
+ TP_STRUCT__entry(
+ __field(void *, func)
+ __field(void *, info)
+ ),
+
+ TP_fast_assign(
+ __entry->func = func;
+ __entry->info = info;
+ ),
+
+ TP_printk("function %ps with argument = %p", __entry->func, __entry->info)
+);
+
+#endif /* _TRACE_SMP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/smp.c b/kernel/smp.c
index 06a413987a14a..38d8dec28c39c 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -29,6 +29,9 @@
#include "smpboot.h"
#include "sched/smp.h"
+#define CREATE_TRACE_POINTS
+#include <trace/events/smp.h>
+
#define CSD_TYPE(_csd) ((_csd)->node.u_flags & CSD_FLAG_TYPE_MASK)
#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
@@ -517,7 +520,9 @@ static int generic_exec_single(int cpu, struct __call_single_data *csd)
csd_lock_record(csd);
csd_unlock(csd);
local_irq_save(flags);
+ trace_smp_call_function_entry(func, info);
func(info);
+ trace_smp_call_function_exit(func, info);
csd_lock_record(NULL);
local_irq_restore(flags);
return 0;
@@ -627,7 +632,9 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
}
csd_lock_record(csd);
+ trace_smp_call_function_entry(func, info);
func(info);
+ trace_smp_call_function_exit(func, info);
csd_unlock(csd);
csd_lock_record(NULL);
} else {
@@ -662,7 +669,9 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
csd_lock_record(csd);
csd_unlock(csd);
+ trace_smp_call_function_entry(func, info);
func(info);
+ trace_smp_call_function_exit(func, info);
csd_lock_record(NULL);
} else if (type == CSD_TYPE_IRQ_WORK) {
irq_work_single(csd);
@@ -975,7 +984,9 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
unsigned long flags;
local_irq_save(flags);
+ trace_smp_call_function_entry(func, info);
func(info);
+ trace_smp_call_function_exit(func, info);
local_irq_restore(flags);
}
--
2.40.0
On 2023-04-06 04:57:18 [-0300], Leonardo Bras wrote:
> When running RT workloads in isolated CPUs, many cases of deadline misses
> are caused by remote CPU requests such as smp_call_function*().
>
> For those cases, having the names of those functions running around the
> deadline miss moment could help finding a target for the next improvements.
>
> Add tracepoints for acquiring the funtion name & argument before entry and
> after exitting the called function.
>
> Signed-off-by: Leonardo Bras <[email protected]>
Check
https://lore.kernel.org/[email protected]
Sebastian
On Thu, 2023-04-06 at 10:15 +0200, Sebastian Andrzej Siewior wrote:
> On 2023-04-06 04:57:18 [-0300], Leonardo Bras wrote:
> > When running RT workloads in isolated CPUs, many cases of deadline misses
> > are caused by remote CPU requests such as smp_call_function*().
> >
> > For those cases, having the names of those functions running around the
> > deadline miss moment could help finding a target for the next improvements.
> >
> > Add tracepoints for acquiring the funtion name & argument before entry and
> > after exitting the called function.
> >
> > Signed-off-by: Leonardo Bras <[email protected]>
>
> Check
> https://lore.kernel.org/[email protected]
>
> Sebastian
>
Hello Sebastian, thanks for the heads up!
IIUC Valentin is adding/improving tracepoints that are collected in the
requesting CPU, at the moment of scheduling the IPI, which are also useful.
But on my scenario, would be nice to actually have the information on the time
window in which the requested function is running in the requested CPU.
This would allow us to check how many and which IPIs took too much time to
complete, and caused the RT deadline to miss.
(I noticed there is an include/trace/events/ipi.h that could be used instead of
creating a new include/trace/events/smp.h, what is your opinion on this ? )
Please let me know if I am missing something.
Thank you!
Leo
On Thu, Apr 06, 2023 at 04:57:18AM -0300, Leonardo Bras wrote:
> When running RT workloads in isolated CPUs, many cases of deadline misses
> are caused by remote CPU requests such as smp_call_function*().
>
> For those cases, having the names of those functions running around the
> deadline miss moment could help finding a target for the next improvements.
>
> Add tracepoints for acquiring the funtion name & argument before entry and
> after exitting the called function.
>
> Signed-off-by: Leonardo Bras <[email protected]>
How are the patches queued there not sufficient?
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=smp/core
On Thu, 06 Apr 2023 05:42:13 -0300
Leonardo Brás <[email protected]> wrote:
> Hello Sebastian, thanks for the heads up!
>
> IIUC Valentin is adding/improving tracepoints that are collected in the
> requesting CPU, at the moment of scheduling the IPI, which are also useful.
>
> But on my scenario, would be nice to actually have the information on the time
> window in which the requested function is running in the requested CPU.
>
Well once you know the functions that are being called (running your test
case with the IPI trace events), then you can get the timings of those same
functions by:
# cd /sys/kernel/tracing
# echo '<space-delimited-list-of-functions>' > set_ftrace_filter
# echo function_graph > current_tracer
Then the function graph will give you those timings.
-- Steve
On Thu, 2023-04-06 at 11:55 +0200, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 04:57:18AM -0300, Leonardo Bras wrote:
> > When running RT workloads in isolated CPUs, many cases of deadline misses
> > are caused by remote CPU requests such as smp_call_function*().
> >
> > For those cases, having the names of those functions running around the
> > deadline miss moment could help finding a target for the next improvements.
> >
> > Add tracepoints for acquiring the funtion name & argument before entry and
> > after exitting the called function.
> >
> > Signed-off-by: Leonardo Bras <[email protected]>
>
> How are the patches queued there not sufficient?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=smp/core
>
IIUC the last commits add tracepoints that are collected in the
requesting CPU, at the moment of scheduling the IPI, which are also useful in
some scenarios.
On my scenario, it could help a little, since it makes possible to filter what
all other cpus are scheduling on the requested cpu. OTOH it could be also be
misleading, as the requested cpu could be running something that was scheduled
way before.
The change I propose does exactly what my scenario need: track exactly which
function was running at given time in the requested CPU. With this info, we can
check which (if any) remotely requested function was running on given time
window.
(An unrelated thing I just thought: We could even use the commits you pointed
together with my proposed change in order to measure how long does it take for a
requested function to run / complete in the requested cpu)
Thanks for reviewing!
Leo
On Thu, 2023-04-06 at 09:49 -0400, Steven Rostedt wrote:
> On Thu, 06 Apr 2023 05:42:13 -0300
> Leonardo Brás <[email protected]> wrote:
>
> > Hello Sebastian, thanks for the heads up!
> >
> > IIUC Valentin is adding/improving tracepoints that are collected in the
> > requesting CPU, at the moment of scheduling the IPI, which are also useful.
> >
> > But on my scenario, would be nice to actually have the information on the time
> > window in which the requested function is running in the requested CPU.
> >
Hello Steven,
>
> Well once you know the functions that are being called (running your test
> case with the IPI trace events), then you can get the timings of those same
> functions by:
>
> # cd /sys/kernel/tracing
> # echo '<space-delimited-list-of-functions>' > set_ftrace_filter
> # echo function_graph > current_tracer
>
> Then the function graph will give you those timings.
IIUC, the use scenario of your suggestion would be something like:
1) Run something until we get a deadline miss in given CPU
2) Get the log -> Which functions were scheduled on that CPU in the last N time
3) Add those functions to the ftrace_filter
4) Run again until a deadline miss again
If that's correct, there are some points worth mentioning:
a) It requires getting the deadline miss again, which may take non-negligible
time for reproducing, slowing the debug process. The suggested change would get
the functions in the first run.
b) Where is no guarantee the next deadline time window will contain (only) those
functions. In case some are missing, we would need to add the new functions and
run the test once more.
c) Those functions could be called outside of remote scheduling context, and
would dirty our log.
>
> -- Steve
>
Thanks for reviewing and contributing, Steve!
Leo
On Wed, 2023-04-19 at 00:45 -0300, Leonardo Brás wrote:
> On Thu, 2023-04-06 at 11:55 +0200, Peter Zijlstra wrote:
> > On Thu, Apr 06, 2023 at 04:57:18AM -0300, Leonardo Bras wrote:
> > > When running RT workloads in isolated CPUs, many cases of deadline misses
> > > are caused by remote CPU requests such as smp_call_function*().
> > >
> > > For those cases, having the names of those functions running around the
> > > deadline miss moment could help finding a target for the next improvements.
> > >
> > > Add tracepoints for acquiring the funtion name & argument before entry and
> > > after exitting the called function.
> > >
> > > Signed-off-by: Leonardo Bras <[email protected]>
> >
> > How are the patches queued there not sufficient?
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=smp/core
Hello Peter,
The trace I proposed will run in the CPU which received the request, not in the
one which originated it. Meaning the trace will run in "cpu" from
smp_call_function*(cpu,...)
IIUC the trace introduced by the commits in tip/smp/core (which you pointed)
will run in the cpu which calls smp_call_function*(), which is different from
above.
Did I get it right?
Could you please help me understand how the commits in tip/smp/core are
sufficient for my scenario?
Thanks!
Leo
> >
>
> IIUC the last commits add tracepoints that are collected in the
> requesting CPU, at the moment of scheduling the IPI, which are also useful in
> some scenarios.
>
> On my scenario, it could help a little, since it makes possible to filter what
> all other cpus are scheduling on the requested cpu. OTOH it could be also be
> misleading, as the requested cpu could be running something that was scheduled
> way before.
>
> The change I propose does exactly what my scenario need: track exactly which
> function was running at given time in the requested CPU. With this info, we can
> check which (if any) remotely requested function was running on given time
> window.
>
> (An unrelated thing I just thought: We could even use the commits you pointed
> together with my proposed change in order to measure how long does it take for a
> requested function to run / complete in the requested cpu)
>
> Thanks for reviewing!
>
> Leo
On Wed, Apr 19, 2023 at 12:45:08AM -0300, Leonardo Br?s wrote:
> On Thu, 2023-04-06 at 11:55 +0200, Peter Zijlstra wrote:
> > On Thu, Apr 06, 2023 at 04:57:18AM -0300, Leonardo Bras wrote:
> > > When running RT workloads in isolated CPUs, many cases of deadline misses
> > > are caused by remote CPU requests such as smp_call_function*().
> > >
> > > For those cases, having the names of those functions running around the
> > > deadline miss moment could help finding a target for the next improvements.
> > >
> > > Add tracepoints for acquiring the funtion name & argument before entry and
> > > after exitting the called function.
> > >
> > > Signed-off-by: Leonardo Bras <[email protected]>
> >
> > How are the patches queued there not sufficient?
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=smp/core
> >
>
> IIUC the last commits add tracepoints that are collected in the
> requesting CPU, at the moment of scheduling the IPI, which are also useful in
> some scenarios.
>
> On my scenario, it could help a little,?since it makes possible to filter what
> all other cpus are scheduling on the requested cpu. OTOH it could be also be
> misleading, as the requested cpu could be running something that was scheduled
> way before.
>
> The change I propose does exactly what my scenario need: track exactly which
> function was running at given time in the requested CPU. With this info, we can
> check which (if any) remotely requested function was running on given time
> window.
I was thinking you could simply (graph)-trace
__flush_smp_call_function_queue() with a max_graph_depth or so (Steve
says that ought to work).
But even that might be too specific, your use case sounds more like what
we have the irq-off latency tracer for, and that thing will immediately
tell you what functions were being ran.
> (An unrelated thing I just thought: We could even use the commits you pointed
> together with my proposed change in order to measure how long does it take for a
> requested function to run / complete in the requested cpu)
I don't think you could actually do that; the send tracepoints Valentin
added don't log the csd address, this means you cannot distinguish
two CSDs with the same function send from different CPUs.
To do this you'd need to add the csd address to the the ipi_send
tracepoints and your own (possibly replacing info -- because I don't
think that's too useful).
Valentin -- is any of this something you'd also find useful?
On Wed, May 3, 2023 at 12:17 PM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Apr 19, 2023 at 12:45:08AM -0300, Leonardo Brįs wrote:
> > On Thu, 2023-04-06 at 11:55 +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 06, 2023 at 04:57:18AM -0300, Leonardo Bras wrote:
> > > > When running RT workloads in isolated CPUs, many cases of deadline misses
> > > > are caused by remote CPU requests such as smp_call_function*().
> > > >
> > > > For those cases, having the names of those functions running around the
> > > > deadline miss moment could help finding a target for the next improvements.
> > > >
> > > > Add tracepoints for acquiring the funtion name & argument before entry and
> > > > after exitting the called function.
> > > >
> > > > Signed-off-by: Leonardo Bras <[email protected]>
> > >
> > > How are the patches queued there not sufficient?
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=smp/core
> > >
> >
> > IIUC the last commits add tracepoints that are collected in the
> > requesting CPU, at the moment of scheduling the IPI, which are also useful in
> > some scenarios.
> >
> > On my scenario, it could help a little, since it makes possible to filter what
> > all other cpus are scheduling on the requested cpu. OTOH it could be also be
> > misleading, as the requested cpu could be running something that was scheduled
> > way before.
> >
> > The change I propose does exactly what my scenario need: track exactly which
> > function was running at given time in the requested CPU. With this info, we can
> > check which (if any) remotely requested function was running on given time
> > window.
>
> I was thinking you could simply (graph)-trace
> __flush_smp_call_function_queue() with a max_graph_depth or so (Steve
> says that ought to work).
>
> But even that might be too specific, your use case sounds more like what
> we have the irq-off latency tracer for, and that thing will immediately
> tell you what functions were being ran.
>
> > (An unrelated thing I just thought: We could even use the commits you pointed
> > together with my proposed change in order to measure how long does it take for a
> > requested function to run / complete in the requested cpu)
>
> I don't think you could actually do that; the send tracepoints Valentin
> added don't log the csd address, this means you cannot distinguish
> two CSDs with the same function send from different CPUs.
>
> To do this you'd need to add the csd address to the the ipi_send
> tracepoints and your own (possibly replacing info -- because I don't
> think that's too useful).
Sure, I will improve this in my patch.
You think I should rebase it on top of tip/smp/core in order to add it
to the set?
Thanks!
Leo
>
> Valentin -- is any of this something you'd also find useful?
>
+Daniel
On 03/05/23 16:59, Peter Zijlstra wrote:
> On Wed, Apr 19, 2023 at 12:45:08AM -0300, Leonardo Brás wrote:
>>
>> IIUC the last commits add tracepoints that are collected in the
>> requesting CPU, at the moment of scheduling the IPI, which are also useful in
>> some scenarios.
>>
>> On my scenario, it could help a little, since it makes possible to filter what
>> all other cpus are scheduling on the requested cpu. OTOH it could be also be
>> misleading, as the requested cpu could be running something that was scheduled
>> way before.
>>
>> The change I propose does exactly what my scenario need: track exactly which
>> function was running at given time in the requested CPU. With this info, we can
>> check which (if any) remotely requested function was running on given time
>> window.
>
> I was thinking you could simply (graph)-trace
> __flush_smp_call_function_queue() with a max_graph_depth or so (Steve
> says that ought to work).
>
> But even that might be too specific, your use case sounds more like what
> we have the irq-off latency tracer for, and that thing will immediately
> tell you what functions were being ran.
>
>> (An unrelated thing I just thought: We could even use the commits you pointed
>> together with my proposed change in order to measure how long does it take for a
>> requested function to run / complete in the requested cpu)
>
> I don't think you could actually do that; the send tracepoints Valentin
> added don't log the csd address, this means you cannot distinguish
> two CSDs with the same function send from different CPUs.
>
> To do this you'd need to add the csd address to the the ipi_send
> tracepoints and your own (possibly replacing info -- because I don't
> think that's too useful).
>
> Valentin -- is any of this something you'd also find useful?
Conceptually yeah, however:
With isolcpus + NOHZ_FULL, *any* IPI sent to an isolated CPU is game over,
you interrupt the userspace task and you eat the full meal that is
NOHZ_FULL kernel entry. Pretty much any such IPI will show up when
evaluating your setup with rteval/rlta/whatever, so having an event at
IPI reception is redundant.
IIUC Leonardo's looking at a setup with isolcpus but not necessarily
NOHZ_FULL, so he wants to look at *which* IPIs (if any) are pushing the
isolated task over its deadline/period. I would argue any IPI received
there is at risk of doing that, so it's the same fight as with NOHZ_FULL.
With that said, I suppose this could still be helpful for e.g. osnoise to
hook into and point the finger at which CPU/context sent the problematic
IPI. Or more generally, as Leonardo suggested, to measure CSD IPI delivery
times.
One thing though is that trace_ipi_send_cpu*() is not used solely for
CSD's, cf. irq_work_raise() or smp_send_reschedule(). We might want to
split that into e.g. trace_csd_queue_cpu*() + trace_ipi_send*().
Something like so...
---
diff --git a/include/trace/events/smp.h b/include/trace/events/smp.h
new file mode 100644
index 0000000000000..8fc725a2b45b9
--- /dev/null
+++ b/include/trace/events/smp.h
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM smp
+
+#if !defined(_TRACE_SMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_SMP_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(csd_queue_cpu,
+
+ TP_PROTO(const unsigned int cpu,
+ unsigned long callsite,
+ smp_call_func_t func,
+ call_single_data_t *csd),
+
+ TP_ARGS(cpu, callsite, func, csd),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, cpu)
+ __field(void *, callsite)
+ __field(void *, func)
+ __field(void *, csd)
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = cpu;
+ __entry->callsite = (void *)callsite;
+ __entry->func = func;
+ __entry->csd = csd;
+ ),
+
+ TP_printk("cpu=%u callsite=%pS func=%pS csd=%p",
+ __entry->cpu, __entry->callsite, __entry->func, __entry->csd)
+);
+
+DECLARE_EVENT_CLASS(csd_function,
+
+ TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
+
+ TP_ARGS(func, csd),
+
+ TP_STRUCT__entry(
+ __field(void *, func)
+ __field(void *, csd)
+ ),
+
+ TP_fast_assign(
+ __entry->func = func;
+ __entry->csd = csd;
+ ),
+
+ TP_printk("func=%pS csd=%p", __entry->func, __entry->csd)
+);
+
+DEFINE_EVENT(csd_function, csd_function_entry,
+ TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
+ TP_ARGS(func, csd)
+);
+
+DEFINE_EVENT(csd_function, csd_function_exit,
+ TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
+ TP_ARGS(func, csd)
+);
+
+#endif /* _TRACE_SMP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/smp.c b/kernel/smp.c
index ab3e5dad6cfe9..7d28db303e9bc 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -27,6 +27,9 @@
#include <linux/jump_label.h>
#include <trace/events/ipi.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/smp.h>
+#undef CREATE_TRACE_POINTS
#include "smpboot.h"
#include "sched/smp.h"
@@ -121,6 +124,14 @@ send_call_function_ipi_mask(struct cpumask *mask)
arch_send_call_function_ipi_mask(mask);
}
+static __always_inline void
+csd_do_func(smp_call_func_t func, void *info, call_single_data_t *csd)
+{
+ trace_csd_function_entry(func, csd);
+ func(info);
+ trace_csd_function_exit(func, csd);
+}
+
#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
static DEFINE_STATIC_KEY_MAYBE(CONFIG_CSD_LOCK_WAIT_DEBUG_DEFAULT, csdlock_debug_enabled);
@@ -329,7 +340,7 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
* even if we haven't sent the smp_call IPI yet (e.g. the stopper
* executes migration_cpu_stop() on the remote CPU).
*/
- if (trace_ipi_send_cpu_enabled()) {
+ if (trace_csd_queue_cpu_enabled()) {
call_single_data_t *csd;
smp_call_func_t func;
@@ -337,7 +348,7 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
func = CSD_TYPE(csd) == CSD_TYPE_TTWU ?
sched_ttwu_pending : csd->func;
- trace_ipi_send_cpu(cpu, _RET_IP_, func);
+ trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
}
/*
@@ -375,7 +386,7 @@ static int generic_exec_single(int cpu, struct __call_single_data *csd)
csd_lock_record(csd);
csd_unlock(csd);
local_irq_save(flags);
- func(info);
+ csd_do_func(func, info, NULL);
csd_lock_record(NULL);
local_irq_restore(flags);
return 0;
@@ -477,7 +488,7 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
}
csd_lock_record(csd);
- func(info);
+ csd_do_func(func, info, csd);
csd_unlock(csd);
csd_lock_record(NULL);
} else {
@@ -508,7 +519,7 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
csd_lock_record(csd);
csd_unlock(csd);
- func(info);
+ csd_do_func(func, info, csd);
csd_lock_record(NULL);
} else if (type == CSD_TYPE_IRQ_WORK) {
irq_work_single(csd);
@@ -522,8 +533,10 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
/*
* Third; only CSD_TYPE_TTWU is left, issue those.
*/
- if (entry)
- sched_ttwu_pending(entry);
+ if (entry) {
+ csd = llist_entry(entry, typeof(*csd), node.llist);
+ csd_do_func(sched_ttwu_pending, entry, csd);
+ }
}
@@ -624,7 +637,7 @@ EXPORT_SYMBOL(smp_call_function_single);
/**
* smp_call_function_single_async() - Run an asynchronous function on a
- * specific CPU.
+ * specific CPU.
* @cpu: The CPU to run on.
* @csd: Pre-allocated and setup data structure
*
@@ -728,7 +741,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
int cpu, last_cpu, this_cpu = smp_processor_id();
struct call_function_data *cfd;
bool wait = scf_flags & SCF_WAIT;
- int nr_cpus = 0, nr_queued = 0;
+ int nr_cpus = 0;
bool run_remote = false;
bool run_local = false;
@@ -786,21 +799,16 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
csd->node.src = smp_processor_id();
csd->node.dst = cpu;
#endif
+
+ trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
+
if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) {
__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
nr_cpus++;
last_cpu = cpu;
}
- nr_queued++;
}
- /*
- * Trace each smp_function_call_*() as an IPI, actual IPIs
- * will be traced with func==generic_smp_call_function_single_ipi().
- */
- if (nr_queued)
- trace_ipi_send_cpumask(cfd->cpumask, _RET_IP_, func);
-
/*
* Choose the most efficient way to send an IPI. Note that the
* number of CPUs might be zero due to concurrent changes to the
@@ -816,7 +824,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
unsigned long flags;
local_irq_save(flags);
- func(info);
+ csd_do_func(func, info, NULL);
local_irq_restore(flags);
}
On Thu, 04 May 2023 12:59:41 +0100
Valentin Schneider <[email protected]> wrote:
> +TRACE_EVENT(csd_queue_cpu,
> +
> + TP_PROTO(const unsigned int cpu,
Why the "const" ?
-- Steve
> + unsigned long callsite,
> + smp_call_func_t func,
> + call_single_data_t *csd),
> +
> + TP_ARGS(cpu, callsite, func, csd),
> +
> + TP_STRUCT__entry(
> + __field(unsigned int, cpu)
> + __field(void *, callsite)
> + __field(void *, func)
> + __field(void *, csd)
> + ),
> +
> + TP_fast_assign(
> + __entry->cpu = cpu;
> + __entry->callsite = (void *)callsite;
> + __entry->func = func;
> + __entry->csd = csd;
> + ),
> +
> + TP_printk("cpu=%u callsite=%pS func=%pS csd=%p",
> + __entry->cpu, __entry->callsite, __entry->func, __entry->csd)
> +);
> +
On Thu, May 04, 2023 at 12:59:41PM +0100, Valentin Schneider wrote:
> With that said, I suppose this could still be helpful for e.g. osnoise to
> hook into and point the finger at which CPU/context sent the problematic
> IPI. Or more generally, as Leonardo suggested, to measure CSD IPI delivery
> times.
>
> One thing though is that trace_ipi_send_cpu*() is not used solely for
> CSD's, cf. irq_work_raise() or smp_send_reschedule(). We might want to
> split that into e.g. trace_csd_queue_cpu*() + trace_ipi_send*().
>
> Something like so...
>
Yeah, looks about right I suppose.. it generates more events, but given
we need the csd that's inevitable.
On Thu, 2023-05-04 at 12:59 +0100, Valentin Schneider wrote:
> +Daniel
>
> On 03/05/23 16:59, Peter Zijlstra wrote:
> > On Wed, Apr 19, 2023 at 12:45:08AM -0300, Leonardo Brás wrote:
> > >
> > > IIUC the last commits add tracepoints that are collected in the
> > > requesting CPU, at the moment of scheduling the IPI, which are also useful in
> > > some scenarios.
> > >
> > > On my scenario, it could help a little, since it makes possible to filter what
> > > all other cpus are scheduling on the requested cpu. OTOH it could be also be
> > > misleading, as the requested cpu could be running something that was scheduled
> > > way before.
> > >
> > > The change I propose does exactly what my scenario need: track exactly which
> > > function was running at given time in the requested CPU. With this info, we can
> > > check which (if any) remotely requested function was running on given time
> > > window.
> >
> > I was thinking you could simply (graph)-trace
> > __flush_smp_call_function_queue() with a max_graph_depth or so (Steve
> > says that ought to work).
> >
> > But even that might be too specific, your use case sounds more like what
> > we have the irq-off latency tracer for, and that thing will immediately
> > tell you what functions were being ran.
> >
> > > (An unrelated thing I just thought: We could even use the commits you pointed
> > > together with my proposed change in order to measure how long does it take for a
> > > requested function to run / complete in the requested cpu)
> >
> > I don't think you could actually do that; the send tracepoints Valentin
> > added don't log the csd address, this means you cannot distinguish
> > two CSDs with the same function send from different CPUs.
> >
> > To do this you'd need to add the csd address to the the ipi_send
> > tracepoints and your own (possibly replacing info -- because I don't
> > think that's too useful).
> >
> > Valentin -- is any of this something you'd also find useful?
>
> Conceptually yeah, however:
>
> With isolcpus + NOHZ_FULL, *any* IPI sent to an isolated CPU is game over,
> you interrupt the userspace task and you eat the full meal that is
> NOHZ_FULL kernel entry. Pretty much any such IPI will show up when
> evaluating your setup with rteval/rlta/whatever, so having an event at
> IPI reception is redundant.
>
> IIUC Leonardo's looking at a setup with isolcpus but not necessarily
> NOHZ_FULL, so he wants to look at *which* IPIs (if any) are pushing the
> isolated task over its deadline/period. I would argue any IPI received
> there is at risk of doing that, so it's the same fight as with NOHZ_FULL.
>
> With that said, I suppose this could still be helpful for e.g. osnoise to
> hook into and point the finger at which CPU/context sent the problematic
> IPI. Or more generally, as Leonardo suggested, to measure CSD IPI delivery
> times.
>
> One thing though is that trace_ipi_send_cpu*() is not used solely for
> CSD's, cf. irq_work_raise() or smp_send_reschedule(). We might want to
> split that into e.g. trace_csd_queue_cpu*() + trace_ipi_send*().
>
> Something like so...
>
> ---
> diff --git a/include/trace/events/smp.h b/include/trace/events/smp.h
> new file mode 100644
> index 0000000000000..8fc725a2b45b9
> --- /dev/null
> +++ b/include/trace/events/smp.h
> @@ -0,0 +1,69 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM smp
> +
> +#if !defined(_TRACE_SMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_SMP_H
> +
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(csd_queue_cpu,
> +
> + TP_PROTO(const unsigned int cpu,
> + unsigned long callsite,
> + smp_call_func_t func,
> + call_single_data_t *csd),
> +
> + TP_ARGS(cpu, callsite, func, csd),
> +
> + TP_STRUCT__entry(
> + __field(unsigned int, cpu)
> + __field(void *, callsite)
> + __field(void *, func)
> + __field(void *, csd)
> + ),
> +
> + TP_fast_assign(
> + __entry->cpu = cpu;
> + __entry->callsite = (void *)callsite;
> + __entry->func = func;
> + __entry->csd = csd;
> + ),
> +
> + TP_printk("cpu=%u callsite=%pS func=%pS csd=%p",
> + __entry->cpu, __entry->callsite, __entry->func, __entry->csd)
> +);
This is for the caller side, right?
> +
> +DECLARE_EVENT_CLASS(csd_function,
> +
> + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
> +
> + TP_ARGS(func, csd),
> +
> + TP_STRUCT__entry(
> + __field(void *, func)
> + __field(void *, csd)
> + ),
> +
> + TP_fast_assign(
> + __entry->func = func;
> + __entry->csd = csd;
> + ),
> +
> + TP_printk("func=%pS csd=%p", __entry->func, __entry->csd)
> +);
> +
> +DEFINE_EVENT(csd_function, csd_function_entry,
> + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
> + TP_ARGS(func, csd)
> +);
> +
> +DEFINE_EVENT(csd_function, csd_function_exit,
> + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
> + TP_ARGS(func, csd)
> +);
Oh, this is what event_class is for. Thanks for the example :)
> +
> +#endif /* _TRACE_SMP_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index ab3e5dad6cfe9..7d28db303e9bc 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -27,6 +27,9 @@
> #include <linux/jump_label.h>
>
> #include <trace/events/ipi.h>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/smp.h>
> +#undef CREATE_TRACE_POINTS
>
> #include "smpboot.h"
> #include "sched/smp.h"
> @@ -121,6 +124,14 @@ send_call_function_ipi_mask(struct cpumask *mask)
> arch_send_call_function_ipi_mask(mask);
> }
>
> +static __always_inline void
> +csd_do_func(smp_call_func_t func, void *info, call_single_data_t *csd)
> +{
> + trace_csd_function_entry(func, csd);
> + func(info);
> + trace_csd_function_exit(func, csd);
> +}
> +
Good one, a helper to avoid calling those traces everywhere.
Honest question:
Since info == csd->info and func == csd->func, we could just pass csd, right?
I suppose the suggestion on the 3-argument version is to use the values already
fetched from memory instead of fetching them again. Is that correct?
> #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
>
> static DEFINE_STATIC_KEY_MAYBE(CONFIG_CSD_LOCK_WAIT_DEBUG_DEFAULT, csdlock_debug_enabled);
> @@ -329,7 +340,7 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
> * even if we haven't sent the smp_call IPI yet (e.g. the stopper
> * executes migration_cpu_stop() on the remote CPU).
> */
> - if (trace_ipi_send_cpu_enabled()) {
> + if (trace_csd_queue_cpu_enabled()) {
> call_single_data_t *csd;
> smp_call_func_t func;
>
> @@ -337,7 +348,7 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
> func = CSD_TYPE(csd) == CSD_TYPE_TTWU ?
> sched_ttwu_pending : csd->func;
>
> - trace_ipi_send_cpu(cpu, _RET_IP_, func);
> + trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
> }
>
> /*
> @@ -375,7 +386,7 @@ static int generic_exec_single(int cpu, struct __call_single_data *csd)
> csd_lock_record(csd);
> csd_unlock(csd);
> local_irq_save(flags);
> - func(info);
> + csd_do_func(func, info, NULL);
> csd_lock_record(NULL);
> local_irq_restore(flags);
> return 0;
> @@ -477,7 +488,7 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
> }
>
> csd_lock_record(csd);
> - func(info);
> + csd_do_func(func, info, csd);
> csd_unlock(csd);
> csd_lock_record(NULL);
> } else {
> @@ -508,7 +519,7 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
>
> csd_lock_record(csd);
> csd_unlock(csd);
> - func(info);
> + csd_do_func(func, info, csd);
> csd_lock_record(NULL);
> } else if (type == CSD_TYPE_IRQ_WORK) {
> irq_work_single(csd);
> @@ -522,8 +533,10 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
> /*
> * Third; only CSD_TYPE_TTWU is left, issue those.
> */
> - if (entry)
> - sched_ttwu_pending(entry);
> + if (entry) {
> + csd = llist_entry(entry, typeof(*csd), node.llist);
> + csd_do_func(sched_ttwu_pending, entry, csd);
> + }
> }
>
>
> @@ -624,7 +637,7 @@ EXPORT_SYMBOL(smp_call_function_single);
>
> /**
> * smp_call_function_single_async() - Run an asynchronous function on a
> - * specific CPU.
> + * specific CPU.
> * @cpu: The CPU to run on.
> * @csd: Pre-allocated and setup data structure
> *
> @@ -728,7 +741,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
> int cpu, last_cpu, this_cpu = smp_processor_id();
> struct call_function_data *cfd;
> bool wait = scf_flags & SCF_WAIT;
> - int nr_cpus = 0, nr_queued = 0;
> + int nr_cpus = 0;
> bool run_remote = false;
> bool run_local = false;
>
> @@ -786,21 +799,16 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
> csd->node.src = smp_processor_id();
> csd->node.dst = cpu;
> #endif
> +
> + trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
> +
> if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) {
> __cpumask_set_cpu(cpu, cfd->cpumask_ipi);
> nr_cpus++;
> last_cpu = cpu;
> }
> - nr_queued++;
> }
>
> - /*
> - * Trace each smp_function_call_*() as an IPI, actual IPIs
> - * will be traced with func==generic_smp_call_function_single_ipi().
> - */
> - if (nr_queued)
> - trace_ipi_send_cpumask(cfd->cpumask, _RET_IP_, func);
> -
> /*
> * Choose the most efficient way to send an IPI. Note that the
> * number of CPUs might be zero due to concurrent changes to the
> @@ -816,7 +824,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
> unsigned long flags;
>
> local_irq_save(flags);
> - func(info);
> + csd_do_func(func, info, NULL);
> local_irq_restore(flags);
> }
>
>
I will rebase my patch on top of tip/smp/core and apply the suggested changes.
A v2 will be send shortly.
Thank you all for reviewing!
Leo
On Wed, May 10, 2023 at 5:27 PM Leonardo Brás <[email protected]> wrote:
>
> On Thu, 2023-05-04 at 12:59 +0100, Valentin Schneider wrote:
> > +Daniel
> >
> > On 03/05/23 16:59, Peter Zijlstra wrote:
> > > On Wed, Apr 19, 2023 at 12:45:08AM -0300, Leonardo Brás wrote:
> > > >
> > > > IIUC the last commits add tracepoints that are collected in the
> > > > requesting CPU, at the moment of scheduling the IPI, which are also useful in
> > > > some scenarios.
> > > >
> > > > On my scenario, it could help a little, since it makes possible to filter what
> > > > all other cpus are scheduling on the requested cpu. OTOH it could be also be
> > > > misleading, as the requested cpu could be running something that was scheduled
> > > > way before.
> > > >
> > > > The change I propose does exactly what my scenario need: track exactly which
> > > > function was running at given time in the requested CPU. With this info, we can
> > > > check which (if any) remotely requested function was running on given time
> > > > window.
> > >
> > > I was thinking you could simply (graph)-trace
> > > __flush_smp_call_function_queue() with a max_graph_depth or so (Steve
> > > says that ought to work).
> > >
> > > But even that might be too specific, your use case sounds more like what
> > > we have the irq-off latency tracer for, and that thing will immediately
> > > tell you what functions were being ran.
> > >
> > > > (An unrelated thing I just thought: We could even use the commits you pointed
> > > > together with my proposed change in order to measure how long does it take for a
> > > > requested function to run / complete in the requested cpu)
> > >
> > > I don't think you could actually do that; the send tracepoints Valentin
> > > added don't log the csd address, this means you cannot distinguish
> > > two CSDs with the same function send from different CPUs.
> > >
> > > To do this you'd need to add the csd address to the the ipi_send
> > > tracepoints and your own (possibly replacing info -- because I don't
> > > think that's too useful).
> > >
> > > Valentin -- is any of this something you'd also find useful?
> >
> > Conceptually yeah, however:
> >
> > With isolcpus + NOHZ_FULL, *any* IPI sent to an isolated CPU is game over,
> > you interrupt the userspace task and you eat the full meal that is
> > NOHZ_FULL kernel entry. Pretty much any such IPI will show up when
> > evaluating your setup with rteval/rlta/whatever, so having an event at
> > IPI reception is redundant.
> >
> > IIUC Leonardo's looking at a setup with isolcpus but not necessarily
> > NOHZ_FULL, so he wants to look at *which* IPIs (if any) are pushing the
> > isolated task over its deadline/period. I would argue any IPI received
> > there is at risk of doing that, so it's the same fight as with NOHZ_FULL.
> >
> > With that said, I suppose this could still be helpful for e.g. osnoise to
> > hook into and point the finger at which CPU/context sent the problematic
> > IPI. Or more generally, as Leonardo suggested, to measure CSD IPI delivery
> > times.
> >
> > One thing though is that trace_ipi_send_cpu*() is not used solely for
> > CSD's, cf. irq_work_raise() or smp_send_reschedule(). We might want to
> > split that into e.g. trace_csd_queue_cpu*() + trace_ipi_send*().
> >
> > Something like so...
> >
> > ---
> > diff --git a/include/trace/events/smp.h b/include/trace/events/smp.h
> > new file mode 100644
> > index 0000000000000..8fc725a2b45b9
> > --- /dev/null
> > +++ b/include/trace/events/smp.h
> > @@ -0,0 +1,69 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#undef TRACE_SYSTEM
> > +#define TRACE_SYSTEM smp
> > +
> > +#if !defined(_TRACE_SMP_H) || defined(TRACE_HEADER_MULTI_READ)
> > +#define _TRACE_SMP_H
> > +
> > +#include <linux/tracepoint.h>
> > +
> > +TRACE_EVENT(csd_queue_cpu,
> > +
> > + TP_PROTO(const unsigned int cpu,
> > + unsigned long callsite,
> > + smp_call_func_t func,
> > + call_single_data_t *csd),
> > +
> > + TP_ARGS(cpu, callsite, func, csd),
> > +
> > + TP_STRUCT__entry(
> > + __field(unsigned int, cpu)
> > + __field(void *, callsite)
> > + __field(void *, func)
> > + __field(void *, csd)
> > + ),
> > +
> > + TP_fast_assign(
> > + __entry->cpu = cpu;
> > + __entry->callsite = (void *)callsite;
> > + __entry->func = func;
> > + __entry->csd = csd;
> > + ),
> > +
> > + TP_printk("cpu=%u callsite=%pS func=%pS csd=%p",
> > + __entry->cpu, __entry->callsite, __entry->func, __entry->csd)
> > +);
>
> This is for the caller side, right?
>
> > +
> > +DECLARE_EVENT_CLASS(csd_function,
> > +
> > + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
> > +
> > + TP_ARGS(func, csd),
> > +
> > + TP_STRUCT__entry(
> > + __field(void *, func)
> > + __field(void *, csd)
> > + ),
> > +
> > + TP_fast_assign(
> > + __entry->func = func;
> > + __entry->csd = csd;
> > + ),
> > +
> > + TP_printk("func=%pS csd=%p", __entry->func, __entry->csd)
> > +);
> > +
> > +DEFINE_EVENT(csd_function, csd_function_entry,
> > + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
> > + TP_ARGS(func, csd)
> > +);
> > +
> > +DEFINE_EVENT(csd_function, csd_function_exit,
> > + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
> > + TP_ARGS(func, csd)
> > +);
>
> Oh, this is what event_class is for. Thanks for the example :)
>
> > +
> > +#endif /* _TRACE_SMP_H */
> > +
> > +/* This part must be outside protection */
> > +#include <trace/define_trace.h>
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index ab3e5dad6cfe9..7d28db303e9bc 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -27,6 +27,9 @@
> > #include <linux/jump_label.h>
> >
> > #include <trace/events/ipi.h>
> > +#define CREATE_TRACE_POINTS
> > +#include <trace/events/smp.h>
> > +#undef CREATE_TRACE_POINTS
> >
> > #include "smpboot.h"
> > #include "sched/smp.h"
> > @@ -121,6 +124,14 @@ send_call_function_ipi_mask(struct cpumask *mask)
> > arch_send_call_function_ipi_mask(mask);
> > }
> >
> > +static __always_inline void
> > +csd_do_func(smp_call_func_t func, void *info, call_single_data_t *csd)
> > +{
> > + trace_csd_function_entry(func, csd);
> > + func(info);
> > + trace_csd_function_exit(func, csd);
> > +}
> > +
>
> Good one, a helper to avoid calling those traces everywhere.
>
> Honest question:
> Since info == csd->info and func == csd->func, we could just pass csd, right?
> I suppose the suggestion on the 3-argument version is to use the values already
> fetched from memory instead of fetching them again. Is that correct?
>
> > #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
> >
> > static DEFINE_STATIC_KEY_MAYBE(CONFIG_CSD_LOCK_WAIT_DEBUG_DEFAULT, csdlock_debug_enabled);
> > @@ -329,7 +340,7 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
> > * even if we haven't sent the smp_call IPI yet (e.g. the stopper
> > * executes migration_cpu_stop() on the remote CPU).
> > */
> > - if (trace_ipi_send_cpu_enabled()) {
> > + if (trace_csd_queue_cpu_enabled()) {
> > call_single_data_t *csd;
> > smp_call_func_t func;
> >
> > @@ -337,7 +348,7 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
> > func = CSD_TYPE(csd) == CSD_TYPE_TTWU ?
> > sched_ttwu_pending : csd->func;
> >
> > - trace_ipi_send_cpu(cpu, _RET_IP_, func);
> > + trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
> > }
> >
> > /*
> > @@ -375,7 +386,7 @@ static int generic_exec_single(int cpu, struct __call_single_data *csd)
> > csd_lock_record(csd);
> > csd_unlock(csd);
> > local_irq_save(flags);
> > - func(info);
> > + csd_do_func(func, info, NULL);
> > csd_lock_record(NULL);
> > local_irq_restore(flags);
> > return 0;
> > @@ -477,7 +488,7 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
> > }
> >
> > csd_lock_record(csd);
> > - func(info);
> > + csd_do_func(func, info, csd);
> > csd_unlock(csd);
> > csd_lock_record(NULL);
> > } else {
> > @@ -508,7 +519,7 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
> >
> > csd_lock_record(csd);
> > csd_unlock(csd);
> > - func(info);
> > + csd_do_func(func, info, csd);
> > csd_lock_record(NULL);
> > } else if (type == CSD_TYPE_IRQ_WORK) {
> > irq_work_single(csd);
> > @@ -522,8 +533,10 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
> > /*
> > * Third; only CSD_TYPE_TTWU is left, issue those.
> > */
> > - if (entry)
> > - sched_ttwu_pending(entry);
> > + if (entry) {
> > + csd = llist_entry(entry, typeof(*csd), node.llist);
> > + csd_do_func(sched_ttwu_pending, entry, csd);
> > + }
> > }
> >
> >
> > @@ -624,7 +637,7 @@ EXPORT_SYMBOL(smp_call_function_single);
> >
> > /**
> > * smp_call_function_single_async() - Run an asynchronous function on a
> > - * specific CPU.
> > + * specific CPU.
> > * @cpu: The CPU to run on.
> > * @csd: Pre-allocated and setup data structure
> > *
> > @@ -728,7 +741,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
> > int cpu, last_cpu, this_cpu = smp_processor_id();
> > struct call_function_data *cfd;
> > bool wait = scf_flags & SCF_WAIT;
> > - int nr_cpus = 0, nr_queued = 0;
> > + int nr_cpus = 0;
> > bool run_remote = false;
> > bool run_local = false;
> >
> > @@ -786,21 +799,16 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
> > csd->node.src = smp_processor_id();
> > csd->node.dst = cpu;
> > #endif
> > +
> > + trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
> > +
> > if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) {
> > __cpumask_set_cpu(cpu, cfd->cpumask_ipi);
> > nr_cpus++;
> > last_cpu = cpu;
> > }
> > - nr_queued++;
> > }
> >
> > - /*
> > - * Trace each smp_function_call_*() as an IPI, actual IPIs
> > - * will be traced with func==generic_smp_call_function_single_ipi().
> > - */
> > - if (nr_queued)
> > - trace_ipi_send_cpumask(cfd->cpumask, _RET_IP_, func);
> > -
> > /*
> > * Choose the most efficient way to send an IPI. Note that the
> > * number of CPUs might be zero due to concurrent changes to the
> > @@ -816,7 +824,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
> > unsigned long flags;
> >
> > local_irq_save(flags);
> > - func(info);
> > + csd_do_func(func, info, NULL);
> > local_irq_restore(flags);
> > }
> >
> >
>
> I will rebase my patch on top of tip/smp/core and apply the suggested changes.
> A v2 will be send shortly.
RFC-v2 sent:
https://lore.kernel.org/lkml/[email protected]/
>
> Thank you all for reviewing!
> Leo
>
>
On Wed, May 10, 2023 at 7:18 PM Leonardo Bras Soares Passos
<[email protected]> wrote:
>
> On Wed, May 10, 2023 at 5:27 PM Leonardo Brás <[email protected]> wrote:
> >
> > On Thu, 2023-05-04 at 12:59 +0100, Valentin Schneider wrote:
> > > +Daniel
> > >
> > > On 03/05/23 16:59, Peter Zijlstra wrote:
> > > > On Wed, Apr 19, 2023 at 12:45:08AM -0300, Leonardo Brás wrote:
> > > > >
> > > > > IIUC the last commits add tracepoints that are collected in the
> > > > > requesting CPU, at the moment of scheduling the IPI, which are also useful in
> > > > > some scenarios.
> > > > >
> > > > > On my scenario, it could help a little, since it makes possible to filter what
> > > > > all other cpus are scheduling on the requested cpu. OTOH it could be also be
> > > > > misleading, as the requested cpu could be running something that was scheduled
> > > > > way before.
> > > > >
> > > > > The change I propose does exactly what my scenario need: track exactly which
> > > > > function was running at given time in the requested CPU. With this info, we can
> > > > > check which (if any) remotely requested function was running on given time
> > > > > window.
> > > >
> > > > I was thinking you could simply (graph)-trace
> > > > __flush_smp_call_function_queue() with a max_graph_depth or so (Steve
> > > > says that ought to work).
> > > >
> > > > But even that might be too specific, your use case sounds more like what
> > > > we have the irq-off latency tracer for, and that thing will immediately
> > > > tell you what functions were being ran.
> > > >
> > > > > (An unrelated thing I just thought: We could even use the commits you pointed
> > > > > together with my proposed change in order to measure how long does it take for a
> > > > > requested function to run / complete in the requested cpu)
> > > >
> > > > I don't think you could actually do that; the send tracepoints Valentin
> > > > added don't log the csd address, this means you cannot distinguish
> > > > two CSDs with the same function send from different CPUs.
> > > >
> > > > To do this you'd need to add the csd address to the the ipi_send
> > > > tracepoints and your own (possibly replacing info -- because I don't
> > > > think that's too useful).
> > > >
> > > > Valentin -- is any of this something you'd also find useful?
> > >
> > > Conceptually yeah, however:
> > >
> > > With isolcpus + NOHZ_FULL, *any* IPI sent to an isolated CPU is game over,
> > > you interrupt the userspace task and you eat the full meal that is
> > > NOHZ_FULL kernel entry. Pretty much any such IPI will show up when
> > > evaluating your setup with rteval/rlta/whatever, so having an event at
> > > IPI reception is redundant.
> > >
> > > IIUC Leonardo's looking at a setup with isolcpus but not necessarily
> > > NOHZ_FULL, so he wants to look at *which* IPIs (if any) are pushing the
> > > isolated task over its deadline/period. I would argue any IPI received
> > > there is at risk of doing that, so it's the same fight as with NOHZ_FULL.
> > >
> > > With that said, I suppose this could still be helpful for e.g. osnoise to
> > > hook into and point the finger at which CPU/context sent the problematic
> > > IPI. Or more generally, as Leonardo suggested, to measure CSD IPI delivery
> > > times.
> > >
> > > One thing though is that trace_ipi_send_cpu*() is not used solely for
> > > CSD's, cf. irq_work_raise() or smp_send_reschedule(). We might want to
> > > split that into e.g. trace_csd_queue_cpu*() + trace_ipi_send*().
> > >
> > > Something like so...
> > >
> > > ---
> > > diff --git a/include/trace/events/smp.h b/include/trace/events/smp.h
> > > new file mode 100644
> > > index 0000000000000..8fc725a2b45b9
> > > --- /dev/null
> > > +++ b/include/trace/events/smp.h
> > > @@ -0,0 +1,69 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#undef TRACE_SYSTEM
> > > +#define TRACE_SYSTEM smp
> > > +
> > > +#if !defined(_TRACE_SMP_H) || defined(TRACE_HEADER_MULTI_READ)
> > > +#define _TRACE_SMP_H
> > > +
> > > +#include <linux/tracepoint.h>
> > > +
> > > +TRACE_EVENT(csd_queue_cpu,
> > > +
> > > + TP_PROTO(const unsigned int cpu,
> > > + unsigned long callsite,
> > > + smp_call_func_t func,
> > > + call_single_data_t *csd),
> > > +
> > > + TP_ARGS(cpu, callsite, func, csd),
> > > +
> > > + TP_STRUCT__entry(
> > > + __field(unsigned int, cpu)
> > > + __field(void *, callsite)
> > > + __field(void *, func)
> > > + __field(void *, csd)
> > > + ),
> > > +
> > > + TP_fast_assign(
> > > + __entry->cpu = cpu;
> > > + __entry->callsite = (void *)callsite;
> > > + __entry->func = func;
> > > + __entry->csd = csd;
> > > + ),
> > > +
> > > + TP_printk("cpu=%u callsite=%pS func=%pS csd=%p",
> > > + __entry->cpu, __entry->callsite, __entry->func, __entry->csd)
> > > +);
> >
> > This is for the caller side, right?
> >
> > > +
> > > +DECLARE_EVENT_CLASS(csd_function,
> > > +
> > > + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
> > > +
> > > + TP_ARGS(func, csd),
> > > +
> > > + TP_STRUCT__entry(
> > > + __field(void *, func)
> > > + __field(void *, csd)
> > > + ),
> > > +
> > > + TP_fast_assign(
> > > + __entry->func = func;
> > > + __entry->csd = csd;
> > > + ),
> > > +
> > > + TP_printk("func=%pS csd=%p", __entry->func, __entry->csd)
> > > +);
> > > +
> > > +DEFINE_EVENT(csd_function, csd_function_entry,
> > > + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
> > > + TP_ARGS(func, csd)
> > > +);
> > > +
> > > +DEFINE_EVENT(csd_function, csd_function_exit,
> > > + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
> > > + TP_ARGS(func, csd)
> > > +);
> >
> > Oh, this is what event_class is for. Thanks for the example :)
> >
> > > +
> > > +#endif /* _TRACE_SMP_H */
> > > +
> > > +/* This part must be outside protection */
> > > +#include <trace/define_trace.h>
> > > diff --git a/kernel/smp.c b/kernel/smp.c
> > > index ab3e5dad6cfe9..7d28db303e9bc 100644
> > > --- a/kernel/smp.c
> > > +++ b/kernel/smp.c
> > > @@ -27,6 +27,9 @@
> > > #include <linux/jump_label.h>
> > >
> > > #include <trace/events/ipi.h>
> > > +#define CREATE_TRACE_POINTS
> > > +#include <trace/events/smp.h>
> > > +#undef CREATE_TRACE_POINTS
> > >
> > > #include "smpboot.h"
> > > #include "sched/smp.h"
> > > @@ -121,6 +124,14 @@ send_call_function_ipi_mask(struct cpumask *mask)
> > > arch_send_call_function_ipi_mask(mask);
> > > }
> > >
> > > +static __always_inline void
> > > +csd_do_func(smp_call_func_t func, void *info, call_single_data_t *csd)
> > > +{
> > > + trace_csd_function_entry(func, csd);
> > > + func(info);
> > > + trace_csd_function_exit(func, csd);
> > > +}
> > > +
> >
> > Good one, a helper to avoid calling those traces everywhere.
> >
> > Honest question:
> > Since info == csd->info and func == csd->func, we could just pass csd, right?
> > I suppose the suggestion on the 3-argument version is to use the values already
> > fetched from memory instead of fetching them again. Is that correct?
> >
> > > #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
> > >
> > > static DEFINE_STATIC_KEY_MAYBE(CONFIG_CSD_LOCK_WAIT_DEBUG_DEFAULT, csdlock_debug_enabled);
> > > @@ -329,7 +340,7 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
> > > * even if we haven't sent the smp_call IPI yet (e.g. the stopper
> > > * executes migration_cpu_stop() on the remote CPU).
> > > */
> > > - if (trace_ipi_send_cpu_enabled()) {
> > > + if (trace_csd_queue_cpu_enabled()) {
> > > call_single_data_t *csd;
> > > smp_call_func_t func;
> > >
> > > @@ -337,7 +348,7 @@ void __smp_call_single_queue(int cpu, struct llist_node *node)
> > > func = CSD_TYPE(csd) == CSD_TYPE_TTWU ?
> > > sched_ttwu_pending : csd->func;
> > >
> > > - trace_ipi_send_cpu(cpu, _RET_IP_, func);
> > > + trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
> > > }
> > >
> > > /*
> > > @@ -375,7 +386,7 @@ static int generic_exec_single(int cpu, struct __call_single_data *csd)
> > > csd_lock_record(csd);
> > > csd_unlock(csd);
> > > local_irq_save(flags);
> > > - func(info);
> > > + csd_do_func(func, info, NULL);
> > > csd_lock_record(NULL);
> > > local_irq_restore(flags);
> > > return 0;
> > > @@ -477,7 +488,7 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
> > > }
> > >
> > > csd_lock_record(csd);
> > > - func(info);
> > > + csd_do_func(func, info, csd);
> > > csd_unlock(csd);
> > > csd_lock_record(NULL);
> > > } else {
> > > @@ -508,7 +519,7 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
> > >
> > > csd_lock_record(csd);
> > > csd_unlock(csd);
> > > - func(info);
> > > + csd_do_func(func, info, csd);
> > > csd_lock_record(NULL);
> > > } else if (type == CSD_TYPE_IRQ_WORK) {
> > > irq_work_single(csd);
> > > @@ -522,8 +533,10 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
> > > /*
> > > * Third; only CSD_TYPE_TTWU is left, issue those.
> > > */
> > > - if (entry)
> > > - sched_ttwu_pending(entry);
> > > + if (entry) {
> > > + csd = llist_entry(entry, typeof(*csd), node.llist);
> > > + csd_do_func(sched_ttwu_pending, entry, csd);
> > > + }
> > > }
> > >
> > >
> > > @@ -624,7 +637,7 @@ EXPORT_SYMBOL(smp_call_function_single);
> > >
> > > /**
> > > * smp_call_function_single_async() - Run an asynchronous function on a
> > > - * specific CPU.
> > > + * specific CPU.
> > > * @cpu: The CPU to run on.
> > > * @csd: Pre-allocated and setup data structure
> > > *
> > > @@ -728,7 +741,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
> > > int cpu, last_cpu, this_cpu = smp_processor_id();
> > > struct call_function_data *cfd;
> > > bool wait = scf_flags & SCF_WAIT;
> > > - int nr_cpus = 0, nr_queued = 0;
> > > + int nr_cpus = 0;
> > > bool run_remote = false;
> > > bool run_local = false;
> > >
> > > @@ -786,21 +799,16 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
> > > csd->node.src = smp_processor_id();
> > > csd->node.dst = cpu;
> > > #endif
> > > +
> > > + trace_csd_queue_cpu(cpu, _RET_IP_, func, csd);
> > > +
> > > if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) {
> > > __cpumask_set_cpu(cpu, cfd->cpumask_ipi);
> > > nr_cpus++;
> > > last_cpu = cpu;
> > > }
> > > - nr_queued++;
> > > }
> > >
> > > - /*
> > > - * Trace each smp_function_call_*() as an IPI, actual IPIs
> > > - * will be traced with func==generic_smp_call_function_single_ipi().
> > > - */
> > > - if (nr_queued)
> > > - trace_ipi_send_cpumask(cfd->cpumask, _RET_IP_, func);
> > > -
> > > /*
> > > * Choose the most efficient way to send an IPI. Note that the
> > > * number of CPUs might be zero due to concurrent changes to the
> > > @@ -816,7 +824,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
> > > unsigned long flags;
> > >
> > > local_irq_save(flags);
> > > - func(info);
> > > + csd_do_func(func, info, NULL);
> > > local_irq_restore(flags);
> > > }
> > >
> > >
> >
> > I will rebase my patch on top of tip/smp/core and apply the suggested changes.
> > A v2 will be send shortly.
>
> RFC-v2 sent:
> https://lore.kernel.org/lkml/[email protected]/
>
Argh, I accidentally sent an unfinished patch, sorry about that.
Please disconsider above v2, and review the v3 instead:
https://lore.kernel.org/lkml/[email protected]/
> >
> > Thank you all for reviewing!
> > Leo
> >
> >
On 10/05/23 17:27, Leonardo Brás wrote:
> On Thu, 2023-05-04 at 12:59 +0100, Valentin Schneider wrote:
>> +TRACE_EVENT(csd_queue_cpu,
>> +
>> + TP_PROTO(const unsigned int cpu,
>> + unsigned long callsite,
>> + smp_call_func_t func,
>> + call_single_data_t *csd),
>> +
>> + TP_ARGS(cpu, callsite, func, csd),
>> +
>> + TP_STRUCT__entry(
>> + __field(unsigned int, cpu)
>> + __field(void *, callsite)
>> + __field(void *, func)
>> + __field(void *, csd)
>> + ),
>> +
>> + TP_fast_assign(
>> + __entry->cpu = cpu;
>> + __entry->callsite = (void *)callsite;
>> + __entry->func = func;
>> + __entry->csd = csd;
>> + ),
>> +
>> + TP_printk("cpu=%u callsite=%pS func=%pS csd=%p",
>> + __entry->cpu, __entry->callsite, __entry->func, __entry->csd)
>> +);
>
> This is for the caller side, right?
>
Yep, see usage lower down.
>> +
>> +DECLARE_EVENT_CLASS(csd_function,
>> +
>> + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
>> +
>> + TP_ARGS(func, csd),
>> +
>> + TP_STRUCT__entry(
>> + __field(void *, func)
>> + __field(void *, csd)
>> + ),
>> +
>> + TP_fast_assign(
>> + __entry->func = func;
>> + __entry->csd = csd;
>> + ),
>> +
>> + TP_printk("func=%pS csd=%p", __entry->func, __entry->csd)
>> +);
>> +
>> +DEFINE_EVENT(csd_function, csd_function_entry,
>> + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
>> + TP_ARGS(func, csd)
>> +);
>> +
>> +DEFINE_EVENT(csd_function, csd_function_exit,
>> + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
>> + TP_ARGS(func, csd)
>> +);
>
> Oh, this is what event_class is for. Thanks for the example :)
>
>> +
>> +#endif /* _TRACE_SMP_H */
>> +
>> +/* This part must be outside protection */
>> +#include <trace/define_trace.h>
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index ab3e5dad6cfe9..7d28db303e9bc 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -27,6 +27,9 @@
>> #include <linux/jump_label.h>
>>
>> #include <trace/events/ipi.h>
>> +#define CREATE_TRACE_POINTS
>> +#include <trace/events/smp.h>
>> +#undef CREATE_TRACE_POINTS
>>
>> #include "smpboot.h"
>> #include "sched/smp.h"
>> @@ -121,6 +124,14 @@ send_call_function_ipi_mask(struct cpumask *mask)
>> arch_send_call_function_ipi_mask(mask);
>> }
>>
>> +static __always_inline void
>> +csd_do_func(smp_call_func_t func, void *info, call_single_data_t *csd)
>> +{
>> + trace_csd_function_entry(func, csd);
>> + func(info);
>> + trace_csd_function_exit(func, csd);
>> +}
>> +
>
> Good one, a helper to avoid calling those traces everywhere.
>
> Honest question:
> Since info == csd->info and func == csd->func, we could just pass csd, right?
> I suppose the suggestion on the 3-argument version is to use the values already
> fetched from memory instead of fetching them again. Is that correct?
>
There's also the special case of CSD_TYPE_TTWU where there is no csd->func,
instead we have an implicit func mapping to sched_ttwu_pending). I think
it's preferable to directly feed the right things to the TP than to
duplicate the "decoding" logic against the *csd passed as TP argument.
On Thu, May 11, 2023 at 5:13 AM Valentin Schneider <[email protected]> wrote:
>
> On 10/05/23 17:27, Leonardo Brás wrote:
> > On Thu, 2023-05-04 at 12:59 +0100, Valentin Schneider wrote:
> >> +TRACE_EVENT(csd_queue_cpu,
> >> +
> >> + TP_PROTO(const unsigned int cpu,
> >> + unsigned long callsite,
> >> + smp_call_func_t func,
> >> + call_single_data_t *csd),
> >> +
> >> + TP_ARGS(cpu, callsite, func, csd),
> >> +
> >> + TP_STRUCT__entry(
> >> + __field(unsigned int, cpu)
> >> + __field(void *, callsite)
> >> + __field(void *, func)
> >> + __field(void *, csd)
> >> + ),
> >> +
> >> + TP_fast_assign(
> >> + __entry->cpu = cpu;
> >> + __entry->callsite = (void *)callsite;
> >> + __entry->func = func;
> >> + __entry->csd = csd;
> >> + ),
> >> +
> >> + TP_printk("cpu=%u callsite=%pS func=%pS csd=%p",
> >> + __entry->cpu, __entry->callsite, __entry->func, __entry->csd)
> >> +);
> >
> > This is for the caller side, right?
> >
>
> Yep, see usage lower down.
>
> >> +
> >> +DECLARE_EVENT_CLASS(csd_function,
> >> +
> >> + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
> >> +
> >> + TP_ARGS(func, csd),
> >> +
> >> + TP_STRUCT__entry(
> >> + __field(void *, func)
> >> + __field(void *, csd)
> >> + ),
> >> +
> >> + TP_fast_assign(
> >> + __entry->func = func;
> >> + __entry->csd = csd;
> >> + ),
> >> +
> >> + TP_printk("func=%pS csd=%p", __entry->func, __entry->csd)
> >> +);
> >> +
> >> +DEFINE_EVENT(csd_function, csd_function_entry,
> >> + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
> >> + TP_ARGS(func, csd)
> >> +);
> >> +
> >> +DEFINE_EVENT(csd_function, csd_function_exit,
> >> + TP_PROTO(smp_call_func_t func, call_single_data_t *csd),
> >> + TP_ARGS(func, csd)
> >> +);
> >
> > Oh, this is what event_class is for. Thanks for the example :)
> >
> >> +
> >> +#endif /* _TRACE_SMP_H */
> >> +
> >> +/* This part must be outside protection */
> >> +#include <trace/define_trace.h>
> >> diff --git a/kernel/smp.c b/kernel/smp.c
> >> index ab3e5dad6cfe9..7d28db303e9bc 100644
> >> --- a/kernel/smp.c
> >> +++ b/kernel/smp.c
> >> @@ -27,6 +27,9 @@
> >> #include <linux/jump_label.h>
> >>
> >> #include <trace/events/ipi.h>
> >> +#define CREATE_TRACE_POINTS
> >> +#include <trace/events/smp.h>
> >> +#undef CREATE_TRACE_POINTS
> >>
> >> #include "smpboot.h"
> >> #include "sched/smp.h"
> >> @@ -121,6 +124,14 @@ send_call_function_ipi_mask(struct cpumask *mask)
> >> arch_send_call_function_ipi_mask(mask);
> >> }
> >>
> >> +static __always_inline void
> >> +csd_do_func(smp_call_func_t func, void *info, call_single_data_t *csd)
> >> +{
> >> + trace_csd_function_entry(func, csd);
> >> + func(info);
> >> + trace_csd_function_exit(func, csd);
> >> +}
> >> +
> >
> > Good one, a helper to avoid calling those traces everywhere.
> >
> > Honest question:
> > Since info == csd->info and func == csd->func, we could just pass csd, right?
> > I suppose the suggestion on the 3-argument version is to use the values already
> > fetched from memory instead of fetching them again. Is that correct?
> >
>
> There's also the special case of CSD_TYPE_TTWU where there is no csd->func,
> instead we have an implicit func mapping to sched_ttwu_pending). I think
> it's preferable to directly feed the right things to the TP than to
> duplicate the "decoding" logic against the *csd passed as TP argument.
>
Quite interesting, thank you for sharing!
I sent a v3 which got a warning from the kernel testing robot. I
already solved the warning, so please provide feedback on the rest of
the patch.
About the warning:
It is an alignment error between 'struct __call_single_data' from
generic_exec_single() and 'call_single_data_t' from csd_do_func(): the
first is 8-byte aligned, and the second is 32-byte aligned according
to the typedef.
My first idea was to convert my patches' parameters from
call_single_data_t to 'struct __call_single_data', but then I found
out the 'struct' option allows splitting csd between 2 cachelines,
which is usually bad.
Then I decided to send a patchset [1] fixing generic_exec_single() and
its callers. If it's accepted, the warning will go away, and v3 will
merge cleanly.
If not, I will send a v4 changing the parameters.
Thanks for reviewing,
Leo
1. https://lore.kernel.org/lkml/[email protected]