2020-02-21 13:53:55

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

Effectively revert commit 865e63b04e9b2 ("tracing: Add back in
rcu_irq_enter/exit_irqson() for rcuidle tracepoints") now that we've
taught perf how to deal with not having an RCU context provided.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
---
include/linux/tracepoint.h | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -179,10 +179,8 @@ static inline struct tracepoint *tracepo
* For rcuidle callers, use srcu since sched-rcu \
* doesn't work from the idle path. \
*/ \
- if (rcuidle) { \
+ if (rcuidle) \
__idx = srcu_read_lock_notrace(&tracepoint_srcu);\
- rcu_irq_enter_irqsave(); \
- } \
\
it_func_ptr = rcu_dereference_raw((tp)->funcs); \
\
@@ -194,10 +192,8 @@ static inline struct tracepoint *tracepo
} while ((++it_func_ptr)->func); \
} \
\
- if (rcuidle) { \
- rcu_irq_exit_irqsave(); \
+ if (rcuidle) \
srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
- } \
\
preempt_enable_notrace(); \
} while (0)



2020-03-06 10:45:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

On Fri, Feb 21, 2020 at 02:34:32PM +0100, Peter Zijlstra wrote:
> Effectively revert commit 865e63b04e9b2 ("tracing: Add back in
> rcu_irq_enter/exit_irqson() for rcuidle tracepoints") now that we've
> taught perf how to deal with not having an RCU context provided.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Reviewed-by: Steven Rostedt (VMware) <[email protected]>
> ---
> include/linux/tracepoint.h | 8 ++------
> 1 file changed, 2 insertions(+), 6 deletions(-)
>
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -179,10 +179,8 @@ static inline struct tracepoint *tracepo
> * For rcuidle callers, use srcu since sched-rcu \
> * doesn't work from the idle path. \
> */ \
> - if (rcuidle) { \
> + if (rcuidle) \
> __idx = srcu_read_lock_notrace(&tracepoint_srcu);\
> - rcu_irq_enter_irqsave(); \
> - } \
> \
> it_func_ptr = rcu_dereference_raw((tp)->funcs); \
> \
> @@ -194,10 +192,8 @@ static inline struct tracepoint *tracepo
> } while ((++it_func_ptr)->func); \
> } \
> \
> - if (rcuidle) { \
> - rcu_irq_exit_irqsave(); \
> + if (rcuidle) \
> srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
> - } \
> \
> preempt_enable_notrace(); \
> } while (0)

So what happens when BPF registers for these tracepoints? BPF very much
wants RCU on AFAIU.

2020-03-06 11:32:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

On Fri, Mar 06, 2020 at 11:43:35AM +0100, Peter Zijlstra wrote:
> On Fri, Feb 21, 2020 at 02:34:32PM +0100, Peter Zijlstra wrote:
> > Effectively revert commit 865e63b04e9b2 ("tracing: Add back in
> > rcu_irq_enter/exit_irqson() for rcuidle tracepoints") now that we've
> > taught perf how to deal with not having an RCU context provided.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > Reviewed-by: Steven Rostedt (VMware) <[email protected]>
> > ---
> > include/linux/tracepoint.h | 8 ++------
> > 1 file changed, 2 insertions(+), 6 deletions(-)
> >
> > --- a/include/linux/tracepoint.h
> > +++ b/include/linux/tracepoint.h
> > @@ -179,10 +179,8 @@ static inline struct tracepoint *tracepo
> > * For rcuidle callers, use srcu since sched-rcu \
> > * doesn't work from the idle path. \
> > */ \
> > - if (rcuidle) { \
> > + if (rcuidle) \
> > __idx = srcu_read_lock_notrace(&tracepoint_srcu);\
> > - rcu_irq_enter_irqsave(); \
> > - } \
> > \
> > it_func_ptr = rcu_dereference_raw((tp)->funcs); \
> > \
> > @@ -194,10 +192,8 @@ static inline struct tracepoint *tracepo
> > } while ((++it_func_ptr)->func); \
> > } \
> > \
> > - if (rcuidle) { \
> > - rcu_irq_exit_irqsave(); \
> > + if (rcuidle) \
> > srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
> > - } \
> > \
> > preempt_enable_notrace(); \
> > } while (0)
>
> So what happens when BPF registers for these tracepoints? BPF very much
> wants RCU on AFAIU.

I suspect we needs something like this...

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index a2f15222f205..67a39dbce0ce 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1475,11 +1475,13 @@ void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp)
static __always_inline
void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
{
+ int rcu_flags = trace_rcu_enter();
rcu_read_lock();
preempt_disable();
(void) BPF_PROG_RUN(prog, args);
preempt_enable();
rcu_read_unlock();
+ trace_rcu_exit(rcu_flags);
}

#define UNPACK(...) __VA_ARGS__

2020-03-06 15:52:14

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

On Fri, Mar 6, 2020 at 3:31 AM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Mar 06, 2020 at 11:43:35AM +0100, Peter Zijlstra wrote:
> > On Fri, Feb 21, 2020 at 02:34:32PM +0100, Peter Zijlstra wrote:
> > > Effectively revert commit 865e63b04e9b2 ("tracing: Add back in
> > > rcu_irq_enter/exit_irqson() for rcuidle tracepoints") now that we've
> > > taught perf how to deal with not having an RCU context provided.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > > Reviewed-by: Steven Rostedt (VMware) <[email protected]>
> > > ---
> > > include/linux/tracepoint.h | 8 ++------
> > > 1 file changed, 2 insertions(+), 6 deletions(-)
> > >
> > > --- a/include/linux/tracepoint.h
> > > +++ b/include/linux/tracepoint.h
> > > @@ -179,10 +179,8 @@ static inline struct tracepoint *tracepo
> > > * For rcuidle callers, use srcu since sched-rcu \
> > > * doesn't work from the idle path. \
> > > */ \
> > > - if (rcuidle) { \
> > > + if (rcuidle) \
> > > __idx = srcu_read_lock_notrace(&tracepoint_srcu);\
> > > - rcu_irq_enter_irqsave(); \
> > > - } \
> > > \
> > > it_func_ptr = rcu_dereference_raw((tp)->funcs); \
> > > \
> > > @@ -194,10 +192,8 @@ static inline struct tracepoint *tracepo
> > > } while ((++it_func_ptr)->func); \
> > > } \
> > > \
> > > - if (rcuidle) { \
> > > - rcu_irq_exit_irqsave(); \
> > > + if (rcuidle) \
> > > srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
> > > - } \
> > > \
> > > preempt_enable_notrace(); \
> > > } while (0)
> >
> > So what happens when BPF registers for these tracepoints? BPF very much
> > wants RCU on AFAIU.
>
> I suspect we needs something like this...
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index a2f15222f205..67a39dbce0ce 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -1475,11 +1475,13 @@ void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp)
> static __always_inline
> void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
> {
> + int rcu_flags = trace_rcu_enter();
> rcu_read_lock();
> preempt_disable();
> (void) BPF_PROG_RUN(prog, args);
> preempt_enable();
> rcu_read_unlock();
> + trace_rcu_exit(rcu_flags);

One big NACK.
I will not slowdown 99% of cases because of one dumb user.
Absolutely no way.

2020-03-06 16:05:51

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

----- On Mar 6, 2020, at 10:51 AM, Alexei Starovoitov [email protected] wrote:

> On Fri, Mar 6, 2020 at 3:31 AM Peter Zijlstra <[email protected]> wrote:
>>
>> On Fri, Mar 06, 2020 at 11:43:35AM +0100, Peter Zijlstra wrote:
>> > On Fri, Feb 21, 2020 at 02:34:32PM +0100, Peter Zijlstra wrote:
>> > > Effectively revert commit 865e63b04e9b2 ("tracing: Add back in
>> > > rcu_irq_enter/exit_irqson() for rcuidle tracepoints") now that we've
>> > > taught perf how to deal with not having an RCU context provided.
>> > >
>> > > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>> > > Reviewed-by: Steven Rostedt (VMware) <[email protected]>
>> > > ---
>> > > include/linux/tracepoint.h | 8 ++------
>> > > 1 file changed, 2 insertions(+), 6 deletions(-)
>> > >
>> > > --- a/include/linux/tracepoint.h
>> > > +++ b/include/linux/tracepoint.h
>> > > @@ -179,10 +179,8 @@ static inline struct tracepoint *tracepo
>> > > * For rcuidle callers, use srcu since sched-rcu \
>> > > * doesn't work from the idle path. \
>> > > */ \
>> > > - if (rcuidle) { \
>> > > + if (rcuidle) \
>> > > __idx = srcu_read_lock_notrace(&tracepoint_srcu);\
>> > > - rcu_irq_enter_irqsave(); \
>> > > - } \
>> > > \
>> > > it_func_ptr = rcu_dereference_raw((tp)->funcs); \
>> > > \
>> > > @@ -194,10 +192,8 @@ static inline struct tracepoint *tracepo
>> > > } while ((++it_func_ptr)->func); \
>> > > } \
>> > > \
>> > > - if (rcuidle) { \
>> > > - rcu_irq_exit_irqsave(); \
>> > > + if (rcuidle) \
>> > > srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
>> > > - } \
>> > > \
>> > > preempt_enable_notrace(); \
>> > > } while (0)
>> >
>> > So what happens when BPF registers for these tracepoints? BPF very much
>> > wants RCU on AFAIU.
>>
>> I suspect we needs something like this...
>>
>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>> index a2f15222f205..67a39dbce0ce 100644
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -1475,11 +1475,13 @@ void bpf_put_raw_tracepoint(struct bpf_raw_event_map
>> *btp)
>> static __always_inline
>> void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
>> {
>> + int rcu_flags = trace_rcu_enter();
>> rcu_read_lock();
>> preempt_disable();
>> (void) BPF_PROG_RUN(prog, args);
>> preempt_enable();
>> rcu_read_unlock();
>> + trace_rcu_exit(rcu_flags);
>
> One big NACK.
> I will not slowdown 99% of cases because of one dumb user.
> Absolutely no way.

If we care about not adding those extra branches on the fast-path, there is
an alternative way to do things: BPF could provide two distinct probe callbacks,
one meant for rcuidle tracepoints (which would have the trace_rcu_enter/exit), and
the other for the for 99% of the other callsites which have RCU watching.

I would recommend performing benchmarks justifying the choice of one approach over
the other though.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-03-06 17:22:42

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

On Fri, Mar 06, 2020 at 07:51:18AM -0800, Alexei Starovoitov wrote:
> On Fri, Mar 6, 2020 at 3:31 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Fri, Mar 06, 2020 at 11:43:35AM +0100, Peter Zijlstra wrote:
> > > On Fri, Feb 21, 2020 at 02:34:32PM +0100, Peter Zijlstra wrote:
> > > > Effectively revert commit 865e63b04e9b2 ("tracing: Add back in
> > > > rcu_irq_enter/exit_irqson() for rcuidle tracepoints") now that we've
> > > > taught perf how to deal with not having an RCU context provided.
> > > >
> > > > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > > > Reviewed-by: Steven Rostedt (VMware) <[email protected]>
> > > > ---
> > > > include/linux/tracepoint.h | 8 ++------
> > > > 1 file changed, 2 insertions(+), 6 deletions(-)
> > > >
> > > > --- a/include/linux/tracepoint.h
> > > > +++ b/include/linux/tracepoint.h
> > > > @@ -179,10 +179,8 @@ static inline struct tracepoint *tracepo
> > > > * For rcuidle callers, use srcu since sched-rcu \
> > > > * doesn't work from the idle path. \
> > > > */ \
> > > > - if (rcuidle) { \
> > > > + if (rcuidle) \
> > > > __idx = srcu_read_lock_notrace(&tracepoint_srcu);\
> > > > - rcu_irq_enter_irqsave(); \
> > > > - } \
> > > > \
> > > > it_func_ptr = rcu_dereference_raw((tp)->funcs); \
> > > > \
> > > > @@ -194,10 +192,8 @@ static inline struct tracepoint *tracepo
> > > > } while ((++it_func_ptr)->func); \
> > > > } \
> > > > \
> > > > - if (rcuidle) { \
> > > > - rcu_irq_exit_irqsave(); \
> > > > + if (rcuidle) \
> > > > srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
> > > > - } \
> > > > \
> > > > preempt_enable_notrace(); \
> > > > } while (0)
> > >
> > > So what happens when BPF registers for these tracepoints? BPF very much
> > > wants RCU on AFAIU.
> >
> > I suspect we needs something like this...
> >
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index a2f15222f205..67a39dbce0ce 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -1475,11 +1475,13 @@ void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp)
> > static __always_inline
> > void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
> > {
> > + int rcu_flags = trace_rcu_enter();
> > rcu_read_lock();
> > preempt_disable();
> > (void) BPF_PROG_RUN(prog, args);
> > preempt_enable();
> > rcu_read_unlock();
> > + trace_rcu_exit(rcu_flags);
>
> One big NACK.
> I will not slowdown 99% of cases because of one dumb user.
> Absolutely no way.

For the 99% usecases, they incur an additional atomic_read and a branch, with
the above. Is that the concern? Just want to make sure we are talking about
same thing.

Speaking of slowdowns, you don't really need that rcu_read_lock/unlock()
pair in __bpf_trace_run() AFAICS. The rcu_read_unlock() can run into the
rcu_read_unlock_special() slowpath and if not, at least has branches. Most
importantly, RCU is consolidated which means preempt_disable() implies
rcu_read_lock().

thanks,

- Joel

2020-03-06 17:55:48

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

On Fri, 6 Mar 2020 11:04:28 -0500 (EST)
Mathieu Desnoyers <[email protected]> wrote:

> If we care about not adding those extra branches on the fast-path, there is
> an alternative way to do things: BPF could provide two distinct probe callbacks,
> one meant for rcuidle tracepoints (which would have the trace_rcu_enter/exit), and
> the other for the for 99% of the other callsites which have RCU watching.
>
> I would recommend performing benchmarks justifying the choice of one approach over
> the other though.

I just whipped this up (haven't even tried to compile it), but this should
satisfy everyone. Those that register a callback that needs RCU protection
simply registers with one of the _rcu versions, and all will be done. And
since DO_TRACE is a macro, and rcuidle is a constant, the rcu protection
code will be compiled out for locations that it is not needed.

With this, perf doesn't even need to do anything extra but register with
the "_rcu" version.

-- Steve

diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index b29950a19205..582dece30170 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -25,6 +25,7 @@ struct tracepoint_func {
void *func;
void *data;
int prio;
+ int requires_rcu;
};

struct tracepoint {
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 1fb11daa5c53..5f4de82ffa0f 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -179,25 +179,28 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
* For rcuidle callers, use srcu since sched-rcu \
* doesn't work from the idle path. \
*/ \
- if (rcuidle) { \
+ if (rcuidle) \
__idx = srcu_read_lock_notrace(&tracepoint_srcu);\
- rcu_irq_enter_irqson(); \
- } \
\
it_func_ptr = rcu_dereference_raw((tp)->funcs); \
\
if (it_func_ptr) { \
do { \
+ int rcu_flags; \
it_func = (it_func_ptr)->func; \
+ if (rcuidle && \
+ (it_func_ptr)->requires_rcu) \
+ rcu_flags = trace_rcu_enter(); \
__data = (it_func_ptr)->data; \
((void(*)(proto))(it_func))(args); \
+ if (rcuidle && \
+ (it_func_ptr)->requires_rcu) \
+ trace_rcu_exit(rcu_flags); \
} while ((++it_func_ptr)->func); \
} \
\
- if (rcuidle) { \
+ if (rcuidle) \
rcu_irq_exit_irqson(); \
- srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
- } \
\
preempt_enable_notrace(); \
} while (0)
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 73956eaff8a9..1797e20fd471 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -295,6 +295,7 @@ static int tracepoint_remove_func(struct tracepoint *tp,
* @probe: probe handler
* @data: tracepoint data
* @prio: priority of this function over other registered functions
+ * @rcu: set to non zero if the callback requires RCU protection
*
* Returns 0 if ok, error value on error.
* Note: if @tp is within a module, the caller is responsible for
@@ -302,8 +303,8 @@ static int tracepoint_remove_func(struct tracepoint *tp,
* performed either with a tracepoint module going notifier, or from
* within module exit functions.
*/
-int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
- void *data, int prio)
+int tracepoint_probe_register_prio_rcu(struct tracepoint *tp, void *probe,
+ void *data, int prio, int rcu)
{
struct tracepoint_func tp_func;
int ret;
@@ -312,12 +313,52 @@ int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
tp_func.func = probe;
tp_func.data = data;
tp_func.prio = prio;
+ tp_func.requires_rcu = rcu;
ret = tracepoint_add_func(tp, &tp_func, prio);
mutex_unlock(&tracepoints_mutex);
return ret;
}
+EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio_rcu);
+
+/**
+ * tracepoint_probe_register_prio - Connect a probe to a tracepoint with priority
+ * @tp: tracepoint
+ * @probe: probe handler
+ * @data: tracepoint data
+ * @prio: priority of this function over other registered functions
+ *
+ * Returns 0 if ok, error value on error.
+ * Note: if @tp is within a module, the caller is responsible for
+ * unregistering the probe before the module is gone. This can be
+ * performed either with a tracepoint module going notifier, or from
+ * within module exit functions.
+ */
+int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
+ void *data, int prio)
+{
+ return tracepoint_probe_register_prio_rcu(tp, probe, data, prio, 0);
+}
EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio);

+/**
+ * tracepoint_probe_register_rcu - Connect a probe to a tracepoint
+ * @tp: tracepoint
+ * @probe: probe handler
+ * @data: tracepoint data
+ *
+ * Returns 0 if ok, error value on error.
+ * Note: if @tp is within a module, the caller is responsible for
+ * unregistering the probe before the module is gone. This can be
+ * performed either with a tracepoint module going notifier, or from
+ * within module exit functions.
+ */
+int tracepoint_probe_register_rcu(struct tracepoint *tp, void *probe, void *data)
+{
+ return tracepoint_probe_register_prio_rcu(tp, probe, data,
+ TRACEPOINT_DEFAULT_PRIO, 1);
+}
+EXPORT_SYMBOL_GPL(tracepoint_probe_register_rcu);
+
/**
* tracepoint_probe_register - Connect a probe to a tracepoint
* @tp: tracepoint
@@ -332,7 +373,8 @@ EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio);
*/
int tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data)
{
- return tracepoint_probe_register_prio(tp, probe, data, TRACEPOINT_DEFAULT_PRIO);
+ return tracepoint_probe_register_prio_rcu(tp, probe, data,
+ TRACEPOINT_DEFAULT_PRIO, 0);
}
EXPORT_SYMBOL_GPL(tracepoint_probe_register);

2020-03-06 18:46:15

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

On Fri, Mar 06, 2020 at 12:55:00PM -0500, Steven Rostedt wrote:
> On Fri, 6 Mar 2020 11:04:28 -0500 (EST)
> Mathieu Desnoyers <[email protected]> wrote:
>
> > If we care about not adding those extra branches on the fast-path, there is
> > an alternative way to do things: BPF could provide two distinct probe callbacks,
> > one meant for rcuidle tracepoints (which would have the trace_rcu_enter/exit), and
> > the other for the for 99% of the other callsites which have RCU watching.
> >
> > I would recommend performing benchmarks justifying the choice of one approach over
> > the other though.
>
> I just whipped this up (haven't even tried to compile it), but this should
> satisfy everyone. Those that register a callback that needs RCU protection
> simply registers with one of the _rcu versions, and all will be done. And
> since DO_TRACE is a macro, and rcuidle is a constant, the rcu protection
> code will be compiled out for locations that it is not needed.
>
> With this, perf doesn't even need to do anything extra but register with
> the "_rcu" version.

Looks nice! Some comments below:

> -- Steve
>
> diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
> index b29950a19205..582dece30170 100644
> --- a/include/linux/tracepoint-defs.h
> +++ b/include/linux/tracepoint-defs.h
> @@ -25,6 +25,7 @@ struct tracepoint_func {
> void *func;
> void *data;
> int prio;
> + int requires_rcu;
> };
>
> struct tracepoint {
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 1fb11daa5c53..5f4de82ffa0f 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -179,25 +179,28 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> * For rcuidle callers, use srcu since sched-rcu \
> * doesn't work from the idle path. \
> */ \
> - if (rcuidle) { \
> + if (rcuidle) \
> __idx = srcu_read_lock_notrace(&tracepoint_srcu);\

Small addition:
To prevent confusion, we could make more clear that SRCU here is just to
protect the tracepoint function table and not the callbacks themselves.

> - rcu_irq_enter_irqson(); \
> - } \
> \
> it_func_ptr = rcu_dereference_raw((tp)->funcs); \
> \
> if (it_func_ptr) { \
> do { \
> + int rcu_flags; \
> it_func = (it_func_ptr)->func; \
> + if (rcuidle && \
> + (it_func_ptr)->requires_rcu) \
> + rcu_flags = trace_rcu_enter(); \
> __data = (it_func_ptr)->data; \
> ((void(*)(proto))(it_func))(args); \
> + if (rcuidle && \
> + (it_func_ptr)->requires_rcu) \
> + trace_rcu_exit(rcu_flags); \

Nit: If we have incurred the cost of trace_rcu_enter() once, we can call
it only once and then call trace_rcu_exit() after the do-while loop. That way
we pay the price only once.

thanks,

- Joel

> } while ((++it_func_ptr)->func); \
> } \
> \
> - if (rcuidle) { \
> + if (rcuidle) \
> rcu_irq_exit_irqson(); \
> - srcu_read_unlock_notrace(&tracepoint_srcu, __idx);\
> - } \
> \
> preempt_enable_notrace(); \
> } while (0)
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 73956eaff8a9..1797e20fd471 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -295,6 +295,7 @@ static int tracepoint_remove_func(struct tracepoint *tp,
> * @probe: probe handler
> * @data: tracepoint data
> * @prio: priority of this function over other registered functions
> + * @rcu: set to non zero if the callback requires RCU protection
> *
> * Returns 0 if ok, error value on error.
> * Note: if @tp is within a module, the caller is responsible for
> @@ -302,8 +303,8 @@ static int tracepoint_remove_func(struct tracepoint *tp,
> * performed either with a tracepoint module going notifier, or from
> * within module exit functions.
> */
> -int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
> - void *data, int prio)
> +int tracepoint_probe_register_prio_rcu(struct tracepoint *tp, void *probe,
> + void *data, int prio, int rcu)
> {
> struct tracepoint_func tp_func;
> int ret;
> @@ -312,12 +313,52 @@ int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
> tp_func.func = probe;
> tp_func.data = data;
> tp_func.prio = prio;
> + tp_func.requires_rcu = rcu;
> ret = tracepoint_add_func(tp, &tp_func, prio);
> mutex_unlock(&tracepoints_mutex);
> return ret;
> }
> +EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio_rcu);
> +
> +/**
> + * tracepoint_probe_register_prio - Connect a probe to a tracepoint with priority
> + * @tp: tracepoint
> + * @probe: probe handler
> + * @data: tracepoint data
> + * @prio: priority of this function over other registered functions
> + *
> + * Returns 0 if ok, error value on error.
> + * Note: if @tp is within a module, the caller is responsible for
> + * unregistering the probe before the module is gone. This can be
> + * performed either with a tracepoint module going notifier, or from
> + * within module exit functions.
> + */
> +int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
> + void *data, int prio)
> +{
> + return tracepoint_probe_register_prio_rcu(tp, probe, data, prio, 0);
> +}
> EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio);
>
> +/**
> + * tracepoint_probe_register_rcu - Connect a probe to a tracepoint
> + * @tp: tracepoint
> + * @probe: probe handler
> + * @data: tracepoint data
> + *
> + * Returns 0 if ok, error value on error.
> + * Note: if @tp is within a module, the caller is responsible for
> + * unregistering the probe before the module is gone. This can be
> + * performed either with a tracepoint module going notifier, or from
> + * within module exit functions.
> + */
> +int tracepoint_probe_register_rcu(struct tracepoint *tp, void *probe, void *data)
> +{
> + return tracepoint_probe_register_prio_rcu(tp, probe, data,
> + TRACEPOINT_DEFAULT_PRIO, 1);
> +}
> +EXPORT_SYMBOL_GPL(tracepoint_probe_register_rcu);
> +
> /**
> * tracepoint_probe_register - Connect a probe to a tracepoint
> * @tp: tracepoint
> @@ -332,7 +373,8 @@ EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio);
> */
> int tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data)
> {
> - return tracepoint_probe_register_prio(tp, probe, data, TRACEPOINT_DEFAULT_PRIO);
> + return tracepoint_probe_register_prio_rcu(tp, probe, data,
> + TRACEPOINT_DEFAULT_PRIO, 0);
> }
> EXPORT_SYMBOL_GPL(tracepoint_probe_register);
>

2020-03-06 19:00:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

On Fri, 6 Mar 2020 13:45:38 -0500
Joel Fernandes <[email protected]> wrote:

> On Fri, Mar 06, 2020 at 12:55:00PM -0500, Steven Rostedt wrote:
> > On Fri, 6 Mar 2020 11:04:28 -0500 (EST)
> > Mathieu Desnoyers <[email protected]> wrote:
> >
> > > If we care about not adding those extra branches on the fast-path, there is
> > > an alternative way to do things: BPF could provide two distinct probe callbacks,
> > > one meant for rcuidle tracepoints (which would have the trace_rcu_enter/exit), and
> > > the other for the for 99% of the other callsites which have RCU watching.
> > >
> > > I would recommend performing benchmarks justifying the choice of one approach over
> > > the other though.
> >
> > I just whipped this up (haven't even tried to compile it), but this should
> > satisfy everyone. Those that register a callback that needs RCU protection
> > simply registers with one of the _rcu versions, and all will be done. And
> > since DO_TRACE is a macro, and rcuidle is a constant, the rcu protection
> > code will be compiled out for locations that it is not needed.
> >
> > With this, perf doesn't even need to do anything extra but register with
> > the "_rcu" version.
>
> Looks nice! Some comments below:
>
> > -- Steve
> >
> > diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
> > index b29950a19205..582dece30170 100644
> > --- a/include/linux/tracepoint-defs.h
> > +++ b/include/linux/tracepoint-defs.h
> > @@ -25,6 +25,7 @@ struct tracepoint_func {
> > void *func;
> > void *data;
> > int prio;
> > + int requires_rcu;
> > };
> >
> > struct tracepoint {
> > diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> > index 1fb11daa5c53..5f4de82ffa0f 100644
> > --- a/include/linux/tracepoint.h
> > +++ b/include/linux/tracepoint.h
> > @@ -179,25 +179,28 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> > * For rcuidle callers, use srcu since sched-rcu \
> > * doesn't work from the idle path. \
> > */ \
> > - if (rcuidle) { \
> > + if (rcuidle) \
> > __idx = srcu_read_lock_notrace(&tracepoint_srcu);\
>
> Small addition:
> To prevent confusion, we could make more clear that SRCU here is just to
> protect the tracepoint function table and not the callbacks themselves.
>
> > - rcu_irq_enter_irqson(); \
> > - } \
> > \
> > it_func_ptr = rcu_dereference_raw((tp)->funcs); \
> > \
> > if (it_func_ptr) { \
> > do { \
> > + int rcu_flags; \
> > it_func = (it_func_ptr)->func; \
> > + if (rcuidle && \
> > + (it_func_ptr)->requires_rcu) \
> > + rcu_flags = trace_rcu_enter(); \
> > __data = (it_func_ptr)->data; \
> > ((void(*)(proto))(it_func))(args); \
> > + if (rcuidle && \
> > + (it_func_ptr)->requires_rcu) \
> > + trace_rcu_exit(rcu_flags); \
>
> Nit: If we have incurred the cost of trace_rcu_enter() once, we can call
> it only once and then call trace_rcu_exit() after the do-while loop. That way
> we pay the price only once.
>

I thought about that, but the common case is only one callback attached at
a time. To make the code complex for the non common case seemed too much
of an overkill. If we find that it does help, it's best to do that as a
separate patch because then if something goes wrong we know where it
happened.

Currently, this provides the same overhead as if each callback did it
themselves like we were proposing (but without the added need to do it for
all instances of the callback).

-- Steve

2020-03-06 19:15:39

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

On Fri, Mar 06, 2020 at 01:59:25PM -0500, Steven Rostedt wrote:
[snip]
> > > - rcu_irq_enter_irqson(); \
> > > - } \
> > > \
> > > it_func_ptr = rcu_dereference_raw((tp)->funcs); \
> > > \
> > > if (it_func_ptr) { \
> > > do { \
> > > + int rcu_flags; \
> > > it_func = (it_func_ptr)->func; \
> > > + if (rcuidle && \
> > > + (it_func_ptr)->requires_rcu) \
> > > + rcu_flags = trace_rcu_enter(); \
> > > __data = (it_func_ptr)->data; \
> > > ((void(*)(proto))(it_func))(args); \
> > > + if (rcuidle && \
> > > + (it_func_ptr)->requires_rcu) \
> > > + trace_rcu_exit(rcu_flags); \
> >
> > Nit: If we have incurred the cost of trace_rcu_enter() once, we can call
> > it only once and then call trace_rcu_exit() after the do-while loop. That way
> > we pay the price only once.
> >
>
> I thought about that, but the common case is only one callback attached at
> a time. To make the code complex for the non common case seemed too much
> of an overkill. If we find that it does help, it's best to do that as a
> separate patch because then if something goes wrong we know where it
> happened.
>
> Currently, this provides the same overhead as if each callback did it
> themselves like we were proposing (but without the added need to do it for
> all instances of the callback).

That's ok, it could be a separate patch.

thanks,

- Joel

2020-03-06 20:23:25

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

----- On Mar 6, 2020, at 12:55 PM, rostedt [email protected] wrote:

> On Fri, 6 Mar 2020 11:04:28 -0500 (EST)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> If we care about not adding those extra branches on the fast-path, there is
>> an alternative way to do things: BPF could provide two distinct probe callbacks,
>> one meant for rcuidle tracepoints (which would have the trace_rcu_enter/exit),
>> and
>> the other for the for 99% of the other callsites which have RCU watching.
>>
>> I would recommend performing benchmarks justifying the choice of one approach
>> over
>> the other though.
>
> I just whipped this up (haven't even tried to compile it), but this should
> satisfy everyone. Those that register a callback that needs RCU protection
> simply registers with one of the _rcu versions, and all will be done. And
> since DO_TRACE is a macro, and rcuidle is a constant, the rcu protection
> code will be compiled out for locations that it is not needed.
>
> With this, perf doesn't even need to do anything extra but register with
> the "_rcu" version.
>
> -- Steve
>

[...]

> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 73956eaff8a9..1797e20fd471 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -295,6 +295,7 @@ static int tracepoint_remove_func(struct tracepoint *tp,
> * @probe: probe handler
> * @data: tracepoint data
> * @prio: priority of this function over other registered functions
> + * @rcu: set to non zero if the callback requires RCU protection
> *
> * Returns 0 if ok, error value on error.
> * Note: if @tp is within a module, the caller is responsible for
> @@ -302,8 +303,8 @@ static int tracepoint_remove_func(struct tracepoint *tp,
> * performed either with a tracepoint module going notifier, or from
> * within module exit functions.
> */
> -int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe,
> - void *data, int prio)
> +int tracepoint_probe_register_prio_rcu(struct tracepoint *tp, void *probe,
> + void *data, int prio, int rcu)

I agree with the overall approach. Just a bit of nitpicking on the API:

I understand that the "prio" argument is a separate argument because it can take
many values. However, "rcu" is just a boolean, so I wonder if we should not rather
introduce a "int flags" with a bitmask enum, e.g.

int tracepoint_probe_register_prio_flags(struct tracepoint *tp, void *probe,
void *data, int prio, int flags)

where flags would be populated through OR between labels of this enum:

enum tracepoint_flags {
TRACEPOINT_FLAG_RCU = (1U << 0),
};

We can then be future-proof for additional flags without ending up calling e.g.

tracepoint_probe_register_featurea_featureb_featurec(tp, probe, data, 0, 1, 0, 1)

which seems rather error-prone and less readable than a set of flags.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-03-06 20:46:44

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

On Fri, 6 Mar 2020 15:22:46 -0500 (EST)
Mathieu Desnoyers <[email protected]> wrote:

> I agree with the overall approach. Just a bit of nitpicking on the API:
>
> I understand that the "prio" argument is a separate argument because it can take
> many values. However, "rcu" is just a boolean, so I wonder if we should not rather
> introduce a "int flags" with a bitmask enum, e.g.

I thought about this approach, but thought it was a bit overkill. As the
kernel doesn't have an internal API, I figured we can switch this over to
flags when we get another flag to add. Unless you can think of one in the
near future.

>
> int tracepoint_probe_register_prio_flags(struct tracepoint *tp, void *probe,
> void *data, int prio, int flags)
>
> where flags would be populated through OR between labels of this enum:
>
> enum tracepoint_flags {
> TRACEPOINT_FLAG_RCU = (1U << 0),
> };
>
> We can then be future-proof for additional flags without ending up calling e.g.
>
> tracepoint_probe_register_featurea_featureb_featurec(tp, probe, data, 0, 1, 0, 1)

No, as soon as there is another boolean to add, the rcu version would be
switched to flags. I even thought about making the rcu and prio into one,
and change prio to be a SHRT_MAX max, and have the other 16 bits be for
flags.

-- Steve


>
> which seems rather error-prone and less readable than a set of flags.

2020-03-06 20:56:38

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

----- On Mar 6, 2020, at 3:45 PM, rostedt [email protected] wrote:

> On Fri, 6 Mar 2020 15:22:46 -0500 (EST)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> I agree with the overall approach. Just a bit of nitpicking on the API:
>>
>> I understand that the "prio" argument is a separate argument because it can take
>> many values. However, "rcu" is just a boolean, so I wonder if we should not
>> rather
>> introduce a "int flags" with a bitmask enum, e.g.
>
> I thought about this approach, but thought it was a bit overkill. As the
> kernel doesn't have an internal API, I figured we can switch this over to
> flags when we get another flag to add. Unless you can think of one in the
> near future.

The additional feature I have in mind for near future would be to register
a probe which can take a page fault to a "sleepable" tracepoint. This would
require preemption to be enabled and use of SRCU.

We can always change things when we get there.

Thanks,

Mathieu

>
>>
>> int tracepoint_probe_register_prio_flags(struct tracepoint *tp, void *probe,
>> void *data, int prio, int flags)
>>
>> where flags would be populated through OR between labels of this enum:
>>
>> enum tracepoint_flags {
>> TRACEPOINT_FLAG_RCU = (1U << 0),
>> };
>>
>> We can then be future-proof for additional flags without ending up calling e.g.
>>
>> tracepoint_probe_register_featurea_featureb_featurec(tp, probe, data, 0, 1, 0,
>> 1)
>
> No, as soon as there is another boolean to add, the rcu version would be
> switched to flags. I even thought about making the rcu and prio into one,
> and change prio to be a SHRT_MAX max, and have the other 16 bits be for
> flags.
>
> -- Steve
>
>
>>
> > which seems rather error-prone and less readable than a set of flags.

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-03-06 21:09:04

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

On Fri, 6 Mar 2020 15:55:24 -0500 (EST)
Mathieu Desnoyers <[email protected]> wrote:

> ----- On Mar 6, 2020, at 3:45 PM, rostedt [email protected] wrote:
>
> > On Fri, 6 Mar 2020 15:22:46 -0500 (EST)
> > Mathieu Desnoyers <[email protected]> wrote:
> >
> >> I agree with the overall approach. Just a bit of nitpicking on the API:
> >>
> >> I understand that the "prio" argument is a separate argument because it can take
> >> many values. However, "rcu" is just a boolean, so I wonder if we should not
> >> rather
> >> introduce a "int flags" with a bitmask enum, e.g.
> >
> > I thought about this approach, but thought it was a bit overkill. As the
> > kernel doesn't have an internal API, I figured we can switch this over to
> > flags when we get another flag to add. Unless you can think of one in the
> > near future.
>
> The additional feature I have in mind for near future would be to register
> a probe which can take a page fault to a "sleepable" tracepoint. This would
> require preemption to be enabled and use of SRCU.
>
> We can always change things when we get there.

Yeah, let's rename it if we get there.

-- Steve

2020-03-06 23:11:18

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v4 16/27] tracing: Remove regular RCU context for _rcuidle tracepoints (again)

On Fri, Mar 6, 2020 at 12:55 PM Mathieu Desnoyers
<[email protected]> wrote:
>
> ----- On Mar 6, 2020, at 3:45 PM, rostedt [email protected] wrote:
>
> > On Fri, 6 Mar 2020 15:22:46 -0500 (EST)
> > Mathieu Desnoyers <[email protected]> wrote:
> >
> >> I agree with the overall approach. Just a bit of nitpicking on the API:
> >>
> >> I understand that the "prio" argument is a separate argument because it can take
> >> many values. However, "rcu" is just a boolean, so I wonder if we should not
> >> rather
> >> introduce a "int flags" with a bitmask enum, e.g.
> >
> > I thought about this approach, but thought it was a bit overkill. As the
> > kernel doesn't have an internal API, I figured we can switch this over to
> > flags when we get another flag to add. Unless you can think of one in the
> > near future.
>
> The additional feature I have in mind for near future would be to register
> a probe which can take a page fault to a "sleepable" tracepoint. This would
> require preemption to be enabled and use of SRCU.

I'm working on sleepable bpf as well and this extra flag for tracepoints
would come very handy, so I would go with flags approach right away.
We wouldn't need to touch the same protos multiple times,
less conflicts for us all, etc.