2021-02-18 22:26:10

by Michael Jeanson

[permalink] [raw]
Subject: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)

Formerly known as “Sleepable tracepoints”.

When invoked from system call enter/exit instrumentation, accessing
user-space data is a common use-case for tracers. However, tracepoints
currently disable preemption around iteration on the registered
tracepoint probes and invocation of the probe callbacks, which prevents
tracers from handling page faults.

Extend the tracepoint and trace event APIs to allow specific tracer
probes to take page faults. Adapt ftrace, perf, and ebpf to allow being
called from sleepable context, and convert the system call enter/exit
instrumentation to sleepable tracepoints.

This series only implements the tracepoint infrastructure required to
allow tracers to handle page faults. Modifying each tracer to handle
those page faults would be a next step after we all agree on this piece
of instrumentation infrastructure.

This patchset is based on v5.10.15.

Changes since v1 RFC:

- Rename "sleepable tracepoints" to "faultable tracepoints", MAYSLEEP to
MAYFAULT, and use might_fault() rather than might_sleep(), to properly
convey that the tracepoints are meant to be able to take a page fault,
which requires to be able to sleep *and* to hold the mmap_sem.

Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: [email protected]

Mathieu Desnoyers (1):
tracing: use Tasks Trace RCU instead of SRCU for rcuidle tracepoints

Michael Jeanson (5):
tracing: introduce faultable tracepoints (v2)
tracing: ftrace: add support for faultable tracepoints
tracing: bpf-trace: add support for faultable tracepoints
tracing: perf: add support for faultable tracepoints
tracing: convert sys_enter/exit to faultable tracepoints

include/linux/tracepoint-defs.h | 11 ++++
include/linux/tracepoint.h | 111 +++++++++++++++++++++-----------
include/trace/bpf_probe.h | 23 ++++++-
include/trace/define_trace.h | 8 +++
include/trace/events/syscalls.h | 4 +-
include/trace/perf.h | 26 ++++++--
include/trace/trace_events.h | 79 +++++++++++++++++++++--
init/Kconfig | 1 +
kernel/trace/bpf_trace.c | 5 +-
kernel/trace/trace_events.c | 15 ++++-
kernel/trace/trace_syscalls.c | 84 ++++++++++++++++--------
kernel/tracepoint.c | 104 ++++++++++++++++++++++++------
12 files changed, 373 insertions(+), 98 deletions(-)

--
2.25.1


2021-02-18 22:27:45

by Michael Jeanson

[permalink] [raw]
Subject: [RFC PATCH 6/6] tracing: use Tasks Trace RCU instead of SRCU for rcuidle tracepoints

From: Mathieu Desnoyers <[email protected]>

Similarly to SRCU, Tasks Trace RCU can be used for rcuidle tracepoints.
It has the advantage to provide faster RCU read-side. Similarly to
SRCU, Tasks Trace RCU grace periods are ready after core_initcall.

Now that Tasks Trace RCU is used for faultable tracepoints, using it for
rcuidle tracepoints is an overall simplification.

Some distinctions between SRCU and Tasks Trace RCU:

- Tasks Trace RCU can be used from NMI context, which was not possible
with SRCU,
- Tree SRCU has more scalable grace periods than Tasks Trace RCU, but it
should not matter for tracing use-cases,
- Tasks Trace RCU has slower grace periods than SRCU (similar to those
of RCU in upcoming kernels, but similar to Tasks RCU in current
kernels). This should also be OK for tracing,
- SRCU readers can be used in places where Tasks Trace RCU readers cannot,
but these places are also all places where tracing is prohibited.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Michael Jeanson <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected]
Cc: Joel Fernandes <[email protected]>
---
include/linux/tracepoint.h | 34 ++++++++--------------------------
kernel/tracepoint.c | 25 +++++++++----------------
2 files changed, 17 insertions(+), 42 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 04079cbd2015..c22a87c34a22 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -13,7 +13,6 @@
*/

#include <linux/smp.h>
-#include <linux/srcu.h>
#include <linux/errno.h>
#include <linux/types.h>
#include <linux/cpumask.h>
@@ -34,8 +33,6 @@ struct trace_eval_map {

#define TRACEPOINT_DEFAULT_PRIO 10

-extern struct srcu_struct tracepoint_srcu;
-
extern int
tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
extern int
@@ -87,7 +84,6 @@ int unregister_tracepoint_module_notifier(struct notifier_block *nb)
static inline void tracepoint_synchronize_unregister(void)
{
synchronize_rcu_tasks_trace();
- synchronize_srcu(&tracepoint_srcu);
synchronize_rcu();
}
#else
@@ -176,30 +172,19 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
#define __DO_TRACE(name, proto, args, cond, rcuidle, tp_flags) \
do { \
struct tracepoint_func *it_func_ptr; \
- int __maybe_unused __idx = 0; \
void *__data; \
bool mayfault = (tp_flags) & TRACEPOINT_MAYFAULT; \
+ bool tasks_trace_rcu = mayfault || (rcuidle); \
\
if (!(cond)) \
return; \
\
- /* srcu can't be used from NMI */ \
- WARN_ON_ONCE(rcuidle && in_nmi()); \
- \
- if (mayfault) { \
- rcu_read_lock_trace(); \
- } else { \
- /* keep srcu and sched-rcu usage consistent */ \
+ if (!mayfault) \
preempt_disable_notrace(); \
- } \
- /* \
- * For rcuidle callers, use srcu since sched-rcu \
- * doesn't work from the idle path. \
- */ \
- if (rcuidle) { \
- __idx = srcu_read_lock_notrace(&tracepoint_srcu);\
+ if (tasks_trace_rcu) \
+ rcu_read_lock_trace(); \
+ if (rcuidle) \
rcu_irq_enter_irqson(); \
- } \
\
it_func_ptr = \
rcu_dereference_raw((&__tracepoint_##name)->funcs); \
@@ -209,14 +194,11 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
__DO_TRACE_CALL(name)(args); \
} \
\
- if (rcuidle) { \
+ if (rcuidle) \
rcu_irq_exit_irqson(); \
- srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
- } \
- \
- if (mayfault) \
+ if (tasks_trace_rcu) \
rcu_read_unlock_trace(); \
- else \
+ if (!mayfault) \
preempt_enable_notrace(); \
} while (0)

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 41fc9c6e17f6..efa49f22d435 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -18,9 +18,6 @@
extern tracepoint_ptr_t __start___tracepoints_ptrs[];
extern tracepoint_ptr_t __stop___tracepoints_ptrs[];

-DEFINE_SRCU(tracepoint_srcu);
-EXPORT_SYMBOL_GPL(tracepoint_srcu);
-
/* Set to 1 to enable tracepoint debug output */
static const int tracepoint_debug;

@@ -65,14 +62,9 @@ static void rcu_tasks_trace_free_old_probes(struct rcu_head *head)
kfree(container_of(head, struct tp_probes, rcu));
}

-static void srcu_free_old_probes(struct rcu_head *head)
-{
- call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
-}
-
static void rcu_free_old_probes(struct rcu_head *head)
{
- call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
+ call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
}

static __init int release_early_probes(void)
@@ -90,7 +82,7 @@ static __init int release_early_probes(void)
return 0;
}

-/* SRCU and Tasks Trace RCU are initialized at core_initcall */
+/* Tasks Trace RCU is initialized at core_initcall */
postcore_initcall(release_early_probes);

static inline void release_probes(struct tracepoint_func *old)
@@ -100,9 +92,8 @@ static inline void release_probes(struct tracepoint_func *old)
struct tp_probes, probes[0]);

/*
- * We can't free probes if SRCU and Tasks Trace RCU are not
- * initialized yet. Postpone the freeing till after both are
- * initialized.
+ * We can't free probes if Tasks Trace RCU is not initialized yet.
+ * Postpone the freeing till after Tasks Trace RCU is initialized.
*/
if (unlikely(!ok_to_free_tracepoints)) {
tp_probes->rcu.next = early_probes;
@@ -111,9 +102,11 @@ static inline void release_probes(struct tracepoint_func *old)
}

/*
- * Tracepoint probes are protected by sched RCU, SRCU and
- * Tasks Trace RCU by chaining the callbacks we cover all three
- * cases and wait for all three grace periods.
+ * Tracepoint probes are protected by both sched RCU and
+ * Tasks Trace RCU, by calling the Tasks Trace RCU callback in
+ * the sched RCU callback we cover both cases. So let us chain
+ * the Tasks Trace RCU and sched RCU callbacks to wait for both
+ * grace periods.
*/
call_rcu(&tp_probes->rcu, rcu_free_old_probes);
}
--
2.25.1

2021-02-24 06:55:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)

On Thu, 18 Feb 2021 17:21:19 -0500
Michael Jeanson <[email protected]> wrote:

> This series only implements the tracepoint infrastructure required to
> allow tracers to handle page faults. Modifying each tracer to handle
> those page faults would be a next step after we all agree on this piece
> of instrumentation infrastructure.

I started taking a quick look at this, and came up with the question: how
do you allow preemption when dealing with per-cpu buffers or storage to
record the data?

That is, perf, bpf and ftrace are all using some kind of per-cpu data, and
this is the reason for the need to disable preemption. What's the solution
that LTTng is using for this? I know it has a per cpu buffers too, but does
it have some kind of "per task" buffer that is being used to extract the
data that can fault?

-- Steve

2021-02-24 17:06:06

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)

----- On Feb 24, 2021, at 11:22 AM, Michael Jeanson [email protected] wrote:

> [ Adding Mathieu Desnoyers in CC ]
>
> On 2021-02-23 21 h 16, Steven Rostedt wrote:
>> On Thu, 18 Feb 2021 17:21:19 -0500
>> Michael Jeanson <[email protected]> wrote:
>>
>>> This series only implements the tracepoint infrastructure required to
>>> allow tracers to handle page faults. Modifying each tracer to handle
>>> those page faults would be a next step after we all agree on this piece
>>> of instrumentation infrastructure.
>>
>> I started taking a quick look at this, and came up with the question: how
>> do you allow preemption when dealing with per-cpu buffers or storage to
>> record the data?
>>
>> That is, perf, bpf and ftrace are all using some kind of per-cpu data, and
>> this is the reason for the need to disable preemption. What's the solution
>> that LTTng is using for this? I know it has a per cpu buffers too, but does
>> it have some kind of "per task" buffer that is being used to extract the
>> data that can fault?

As a prototype solution, what I've done currently is to copy the user-space
data into a kmalloc'd buffer in a preparation step before disabling preemption
and copying data over into the per-cpu buffers. It works, but I think we should
be able to do it without the needless copy.

What I have in mind as an efficient solution (not implemented yet) for the LTTng
kernel tracer goes as follows:

#define COMMIT_LOCAL 0
#define COMMIT_REMOTE 1

- faultable probe is called from system call tracepoint [ preemption/blocking/migration is allowed ]
- probe code calculate the length which needs to be reserved to store the event
(e.g. user strlen),

- preempt disable -> [ preemption/blocking/migration is not allowed from here ]
- reserve_cpu = smp_processor_id()
- reserve space in the ring buffer for reserve_cpu
[ from that point on, we have _exclusive_ access to write into the ring buffer "slot"
from any cpu until we commit. ]
- preempt enable -> [ preemption/blocking/migration is allowed from here ]

- copy data from user-space to the ring buffer "slot",

- preempt disable -> [ preemption/blocking/migration is not allowed from here ]
commit_cpu = smp_processor_id()
if (commit_cpu == reserve_cpu)
use local_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_LOCAL]
else
use atomic_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_REMOTE]
- preempt enable -> [ preemption/blocking/migration is allowed from here ]

Given that lttng uses per-buffer/per-sub-buffer commit counters as simple free-running
accumulators, the trick here is to use two commit counters rather than single one for each
sub-buffer. Whenever we need to read a commit count value, we always sum the total of the
LOCAL and REMOTE counter.

This allows dealing with migration between reserve and commit without requiring the overhead
of an atomic operation on the fast-path (LOCAL case).

I had to design this kind of dual-counter trick in the context of user-space use of restartable
sequences. It looks like it may have a role to play in the kernel as well. :)

Or am I missing something important that would not survive real-life ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2021-02-24 18:19:06

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)

On Wed, 24 Feb 2021 11:59:35 -0500 (EST)
Mathieu Desnoyers <[email protected]> wrote:
>
> As a prototype solution, what I've done currently is to copy the user-space
> data into a kmalloc'd buffer in a preparation step before disabling preemption
> and copying data over into the per-cpu buffers. It works, but I think we should
> be able to do it without the needless copy.
>
> What I have in mind as an efficient solution (not implemented yet) for the LTTng
> kernel tracer goes as follows:
>
> #define COMMIT_LOCAL 0
> #define COMMIT_REMOTE 1
>
> - faultable probe is called from system call tracepoint [ preemption/blocking/migration is allowed ]
> - probe code calculate the length which needs to be reserved to store the event
> (e.g. user strlen),
>
> - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
> - reserve_cpu = smp_processor_id()
> - reserve space in the ring buffer for reserve_cpu
> [ from that point on, we have _exclusive_ access to write into the ring buffer "slot"
> from any cpu until we commit. ]
> - preempt enable -> [ preemption/blocking/migration is allowed from here ]
>

So basically the commit position here doesn't move until this task is
scheduled back in and the commit (remote or local) is updated.

To put it in terms of the ftrace ring buffer, where we have both a commit
page and a commit index, and it only gets moved by the first one to start a
commit stack (that is, interrupts that interrupted a write will not
increment the commit).

Now, I'm not sure how LTTng does it, but I could see issues for ftrace to
try to move the commit pointer (the pointer to the new commit page), as the
design is currently dependent on the fact that it can't happen while
commits are taken place.

Are the pages of the LTTng indexed by an array of pages?

-- Steve

2021-02-25 01:03:57

by Michael Jeanson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)

[ Adding Mathieu Desnoyers in CC ]

On 2021-02-23 21 h 16, Steven Rostedt wrote:
> On Thu, 18 Feb 2021 17:21:19 -0500
> Michael Jeanson <[email protected]> wrote:
>
>> This series only implements the tracepoint infrastructure required to
>> allow tracers to handle page faults. Modifying each tracer to handle
>> those page faults would be a next step after we all agree on this piece
>> of instrumentation infrastructure.
>
> I started taking a quick look at this, and came up with the question: how
> do you allow preemption when dealing with per-cpu buffers or storage to
> record the data?
>
> That is, perf, bpf and ftrace are all using some kind of per-cpu data, and
> this is the reason for the need to disable preemption. What's the solution
> that LTTng is using for this? I know it has a per cpu buffers too, but does
> it have some kind of "per task" buffer that is being used to extract the
> data that can fault?
>
> -- Steve
>

2021-02-25 04:43:41

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)


----- Mathieu Desnoyers <[email protected]> wrote:
> ----- On Feb 24, 2021, at 11:22 AM, Michael Jeanson [email protected] wrote:
>
> > [ Adding Mathieu Desnoyers in CC ]
> >
> > On 2021-02-23 21 h 16, Steven Rostedt wrote:
> >> On Thu, 18 Feb 2021 17:21:19 -0500
> >> Michael Jeanson <[email protected]> wrote:
> >>
> >>> This series only implements the tracepoint infrastructure required to
> >>> allow tracers to handle page faults. Modifying each tracer to handle
> >>> those page faults would be a next step after we all agree on this piece
> >>> of instrumentation infrastructure.
> >>
> >> I started taking a quick look at this, and came up with the question: how
> >> do you allow preemption when dealing with per-cpu buffers or storage to
> >> record the data?
> >>
> >> That is, perf, bpf and ftrace are all using some kind of per-cpu data, and
> >> this is the reason for the need to disable preemption. What's the solution
> >> that LTTng is using for this? I know it has a per cpu buffers too, but does
> >> it have some kind of "per task" buffer that is being used to extract the
> >> data that can fault?
>
> As a prototype solution, what I've done currently is to copy the user-space
> data into a kmalloc'd buffer in a preparation step before disabling preemption
> and copying data over into the per-cpu buffers. It works, but I think we should
> be able to do it without the needless copy.
>
> What I have in mind as an efficient solution (not implemented yet) for the LTTng
> kernel tracer goes as follows:
>
> #define COMMIT_LOCAL 0
> #define COMMIT_REMOTE 1
>
> - faultable probe is called from system call tracepoint [ preemption/blocking/migration is allowed ]
> - probe code calculate the length which needs to be reserved to store the event
> (e.g. user strlen),
>
> - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
> - reserve_cpu = smp_processor_id()
> - reserve space in the ring buffer for reserve_cpu
> [ from that point on, we have _exclusive_ access to write into the ring buffer "slot"
> from any cpu until we commit. ]
> - preempt enable -> [ preemption/blocking/migration is allowed from here ]
>
> - copy data from user-space to the ring buffer "slot",
>
> - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
> commit_cpu = smp_processor_id()
> if (commit_cpu == reserve_cpu)
> use local_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_LOCAL]
> else
> use atomic_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_REMOTE]

The line above should increment reserve_cpu's buffer commit count, of course.

Thanks,

Mathieu

> - preempt enable -> [ preemption/blocking/migration is allowed from here ]
>
> Given that lttng uses per-buffer/per-sub-buffer commit counters as simple free-running
> accumulators, the trick here is to use two commit counters rather than single one for each
> sub-buffer. Whenever we need to read a commit count value, we always sum the total of the
> LOCAL and REMOTE counter.
>
> This allows dealing with migration between reserve and commit without requiring the overhead
> of an atomic operation on the fast-path (LOCAL case).
>
> I had to design this kind of dual-counter trick in the context of user-space use of restartable
> sequences. It looks like it may have a role to play in the kernel as well. :)
>
> Or am I missing something important that would not survive real-life ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2021-02-25 21:50:29

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)



----- On Feb 24, 2021, at 1:14 PM, rostedt [email protected] wrote:

> On Wed, 24 Feb 2021 11:59:35 -0500 (EST)
> Mathieu Desnoyers <[email protected]> wrote:
>>
>> As a prototype solution, what I've done currently is to copy the user-space
>> data into a kmalloc'd buffer in a preparation step before disabling preemption
>> and copying data over into the per-cpu buffers. It works, but I think we should
>> be able to do it without the needless copy.
>>
>> What I have in mind as an efficient solution (not implemented yet) for the LTTng
>> kernel tracer goes as follows:
>>
>> #define COMMIT_LOCAL 0
>> #define COMMIT_REMOTE 1
>>
>> - faultable probe is called from system call tracepoint [
>> preemption/blocking/migration is allowed ]
>> - probe code calculate the length which needs to be reserved to store the event
>> (e.g. user strlen),
>>
>> - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
>> - reserve_cpu = smp_processor_id()
>> - reserve space in the ring buffer for reserve_cpu
>> [ from that point on, we have _exclusive_ access to write into the ring buffer
>> "slot"
>> from any cpu until we commit. ]
>> - preempt enable -> [ preemption/blocking/migration is allowed from here ]
>>
>
> So basically the commit position here doesn't move until this task is
> scheduled back in and the commit (remote or local) is updated.

Indeed.

> To put it in terms of the ftrace ring buffer, where we have both a commit
> page and a commit index, and it only gets moved by the first one to start a
> commit stack (that is, interrupts that interrupted a write will not
> increment the commit).

The tricky part for ftrace is its reliance on the fact that the concurrent
users of the per-cpu ring buffer are all nested contexts. LTTng does not
assume that and has been designed to be used both in kernel and user-space:
lttng-modules and lttng-ust share a lot of ring buffer code. Therefore,
LTTng's ring buffer supports preemption/migration of concurrent contexts.

The fact that LTTng uses local-atomic-ops on its kernel ring buffers is just
an optimization on an overall ring buffer design meant to allow preemption.

> Now, I'm not sure how LTTng does it, but I could see issues for ftrace to
> try to move the commit pointer (the pointer to the new commit page), as the
> design is currently dependent on the fact that it can't happen while
> commits are taken place.

Indeed, what makes it easy for LTTng is because the ring buffer has been
designed to support preemption/migration from the ground up.

> Are the pages of the LTTng indexed by an array of pages?

Yes, they are. Handling the initial page allocation and then the tracer copy of data
to/from the ring buffer pages is the responsibility of the LTTng lib ring buffer "backend".
The LTTng lib ring buffer backend is somewhat similar to a page table done in software, where
the top level of the page table can be dynamically updated when doing flight recorder tracing.

It is however completely separate from the space reservation/commit scheme which is handled
by the lib ring buffer "frontend".

The algorithm I described in my prior email is specifically targeted at the frontend layer,
leaving the "backend" unchanged.

For some reasons I suspect Ftrace ring buffer combined those two layers into a single
algorithm, which may have its advantages, but seems to strengthen its dependency on
only having nested contexts sharing a given per-cpu ring buffer.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2021-02-26 05:30:10

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] [RFC] Faultable tracepoints (v2)

On Thu, Feb 25, 2021 at 9:15 AM Mathieu Desnoyers
<[email protected]> wrote:
>
> ----- On Feb 24, 2021, at 11:22 AM, Michael Jeanson [email protected] wrote:
>
> > [ Adding Mathieu Desnoyers in CC ]
> >
> > On 2021-02-23 21 h 16, Steven Rostedt wrote:
> >> On Thu, 18 Feb 2021 17:21:19 -0500
> >> Michael Jeanson <[email protected]> wrote:
> >>
> >>> This series only implements the tracepoint infrastructure required to
> >>> allow tracers to handle page faults. Modifying each tracer to handle
> >>> those page faults would be a next step after we all agree on this piece
> >>> of instrumentation infrastructure.
> >>
> >> I started taking a quick look at this, and came up with the question: how
> >> do you allow preemption when dealing with per-cpu buffers or storage to
> >> record the data?
> >>
> >> That is, perf, bpf and ftrace are all using some kind of per-cpu data, and
> >> this is the reason for the need to disable preemption. What's the solution
> >> that LTTng is using for this? I know it has a per cpu buffers too, but does
> >> it have some kind of "per task" buffer that is being used to extract the
> >> data that can fault?
>
> As a prototype solution, what I've done currently is to copy the user-space
> data into a kmalloc'd buffer in a preparation step before disabling preemption
> and copying data over into the per-cpu buffers. It works, but I think we should
> be able to do it without the needless copy.
>
> What I have in mind as an efficient solution (not implemented yet) for the LTTng
> kernel tracer goes as follows:
>
> #define COMMIT_LOCAL 0
> #define COMMIT_REMOTE 1
>
> - faultable probe is called from system call tracepoint [ preemption/blocking/migration is allowed ]

label:
restart:

> - probe code calculate the length which needs to be reserved to store the event
> (e.g. user strlen),

Does "user strlen" makes the content fault in?

Is it possible to make the sleepable faulting only happen here between
"restart" and the following "preempt disable"? The code here should
do a prefetch operation like "user strlen".

And we can keep preemption disabled when copying the data. If there
is a fault while copying, then we can restart from the label "restart".

Very immature thought.

Thanks
Lai

>
> - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
> - reserve_cpu = smp_processor_id()
> - reserve space in the ring buffer for reserve_cpu
> [ from that point on, we have _exclusive_ access to write into the ring buffer "slot"
> from any cpu until we commit. ]
> - preempt enable -> [ preemption/blocking/migration is allowed from here ]
>
> - copy data from user-space to the ring buffer "slot",
>
> - preempt disable -> [ preemption/blocking/migration is not allowed from here ]
> commit_cpu = smp_processor_id()
> if (commit_cpu == reserve_cpu)
> use local_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_LOCAL]
> else
> use atomic_add to increment the buf[commit_cpu].subbuffer[current].commit_count[COMMIT_REMOTE]
> - preempt enable -> [ preemption/blocking/migration is allowed from here ]
>
> Given that lttng uses per-buffer/per-sub-buffer commit counters as simple free-running
> accumulators, the trick here is to use two commit counters rather than single one for each
> sub-buffer. Whenever we need to read a commit count value, we always sum the total of the
> LOCAL and REMOTE counter.
>
> This allows dealing with migration between reserve and commit without requiring the overhead
> of an atomic operation on the fast-path (LOCAL case).
>
> I had to design this kind of dual-counter trick in the context of user-space use of restartable
> sequences. It looks like it may have a role to play in the kernel as well. :)
>
> Or am I missing something important that would not survive real-life ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com