This patch adds code allowing the event filter to be set only if there's
no active tracing going on.
Signed-off-by: Tom Zanussi <[email protected]>
---
kernel/trace/trace.c | 13 +++++++++++++
kernel/trace/trace.h | 2 ++
kernel/trace/trace_events.c | 7 +++++++
kernel/trace/trace_events_filter.c | 12 ++++++++++++
4 files changed, 34 insertions(+), 0 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 3624b25..30530f7 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -235,6 +235,19 @@ static struct tracer *trace_types __read_mostly;
/* current_trace points to the tracer that is currently active */
static struct tracer *current_trace __read_mostly;
+/**
+ * tracer_is_nop - return if nop tracer is current tracer
+ *
+ * This function is used by other tracers to know whether there is
+ * currently a tracer set. Tracers may use this function to know if
+ * it should enable their features when starting up. See event tracer
+ * for an example (event_filter_write).
+ */
+int tracer_is_nop(void)
+{
+ return current_trace == &nop_trace;
+}
+
/*
* max_tracer_type_len is used to simplify the allocating of
* buffers to read userspace tracer names. We keep track of
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index cfb07ef..6834345 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -466,6 +466,7 @@ struct trace_iterator {
int tracer_init(struct tracer *t, struct trace_array *tr);
int tracing_is_enabled(void);
+int tracer_is_nop(void);
void trace_wake_up(void);
void tracing_reset(struct trace_array *tr, int cpu);
void tracing_reset_online_cpus(struct trace_array *tr);
@@ -859,6 +860,7 @@ extern int filter_match_preds(struct ftrace_event_call *call, void *rec);
extern void filter_free_subsystem_preds(struct event_subsystem *system);
extern int filter_add_subsystem_pred(struct event_subsystem *system,
struct filter_pred *pred);
+extern int subsystem_events_enabled(struct event_subsystem *system);
static inline void
filter_check_discard(struct ftrace_event_call *call, void *rec,
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index be9299a..eb1d363 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -498,6 +498,9 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
struct filter_pred *pred;
int err;
+ if (tracing_is_enabled() && (!tracer_is_nop() || call->enabled))
+ return -EBUSY;
+
if (cnt >= sizeof(buf))
return -EINVAL;
@@ -564,6 +567,10 @@ subsystem_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
struct filter_pred *pred;
int err;
+ if (tracing_is_enabled() &&
+ (!tracer_is_nop() || subsystem_events_enabled(system)))
+ return -EBUSY;
+
if (cnt >= sizeof(buf))
return -EINVAL;
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 470ad94..a706e0b 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -172,6 +172,18 @@ void filter_free_preds(struct ftrace_event_call *call)
}
}
+int subsystem_events_enabled(struct event_subsystem *system)
+{
+ struct ftrace_event_call *call = __start_ftrace_events;
+
+ events_for_each(call) {
+ if (!strcmp(call->system, system->name) && call->enabled)
+ return 1;
+ }
+
+ return 0;
+}
+
void filter_free_subsystem_preds(struct event_subsystem *system)
{
struct ftrace_event_call *call = __start_ftrace_events;
--
1.5.6.3
* Tom Zanussi <[email protected]> wrote:
> This patch adds code allowing the event filter to be set only if
> there's no active tracing going on.
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -498,6 +498,9 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
> struct filter_pred *pred;
> int err;
>
> + if (tracing_is_enabled() && (!tracer_is_nop() || call->enabled))
> + return -EBUSY;
hm, but it would be the normal use-case to set filters on the fly.
To experiment around with them and shape them until the output is
just right. Having to turn the tracer on/off during that seems quite
counterproductive to that use-case.
Ingo
On Wed, 2009-04-01 at 14:24 +0200, Ingo Molnar wrote:
> * Tom Zanussi <[email protected]> wrote:
>
> > This patch adds code allowing the event filter to be set only if
> > there's no active tracing going on.
>
> > --- a/kernel/trace/trace_events.c
> > +++ b/kernel/trace/trace_events.c
> > @@ -498,6 +498,9 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
> > struct filter_pred *pred;
> > int err;
> >
> > + if (tracing_is_enabled() && (!tracer_is_nop() || call->enabled))
> > + return -EBUSY;
>
> hm, but it would be the normal use-case to set filters on the fly.
> To experiment around with them and shape them until the output is
> just right. Having to turn the tracer on/off during that seems quite
> counterproductive to that use-case.
>
I didn't see anything that could be used to temporarily disable tracing
(tracing_stop() and tracing_start() are 'quick' versions that mostly
just disable recording), so did it this way to avoid adding any overhead
to the filter-checking code.
But anyway, I'll post a new patch shortly that uses rcu and does allow
the filters to be set on the fly.
Tom
> Ingo
* Tom Zanussi <[email protected]> wrote:
> On Wed, 2009-04-01 at 14:24 +0200, Ingo Molnar wrote:
> > * Tom Zanussi <[email protected]> wrote:
> >
> > > This patch adds code allowing the event filter to be set only if
> > > there's no active tracing going on.
> >
> > > --- a/kernel/trace/trace_events.c
> > > +++ b/kernel/trace/trace_events.c
> > > @@ -498,6 +498,9 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
> > > struct filter_pred *pred;
> > > int err;
> > >
> > > + if (tracing_is_enabled() && (!tracer_is_nop() || call->enabled))
> > > + return -EBUSY;
> >
> > hm, but it would be the normal use-case to set filters on the fly.
> > To experiment around with them and shape them until the output is
> > just right. Having to turn the tracer on/off during that seems quite
> > counterproductive to that use-case.
> >
>
> I didn't see anything that could be used to temporarily disable
> tracing (tracing_stop() and tracing_start() are 'quick' versions
> that mostly just disable recording), so did it this way to avoid
> adding any overhead to the filter-checking code.
>
> But anyway, I'll post a new patch shortly that uses rcu and does
> allow the filters to be set on the fly.
that's a very intelligent way to do it!
There's a theoretical problem though: what if we put a filtered
tracepoint into the RCU code? Especially if that tracepoint is in
the common function-tracer callback affecting all kernel functions.
I've Cc:-ed Paul. I think the quiescent state logic should handle
this just fine, but i'm not 100% sure.
Ingo
On Fri, 3 Apr 2009, Ingo Molnar wrote:
>
> * Tom Zanussi <[email protected]> wrote:
>
> > On Wed, 2009-04-01 at 14:24 +0200, Ingo Molnar wrote:
> > > * Tom Zanussi <[email protected]> wrote:
> > >
> > > > This patch adds code allowing the event filter to be set only if
> > > > there's no active tracing going on.
> > >
> > > > --- a/kernel/trace/trace_events.c
> > > > +++ b/kernel/trace/trace_events.c
> > > > @@ -498,6 +498,9 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
> > > > struct filter_pred *pred;
> > > > int err;
> > > >
> > > > + if (tracing_is_enabled() && (!tracer_is_nop() || call->enabled))
> > > > + return -EBUSY;
> > >
> > > hm, but it would be the normal use-case to set filters on the fly.
> > > To experiment around with them and shape them until the output is
> > > just right. Having to turn the tracer on/off during that seems quite
> > > counterproductive to that use-case.
> > >
> >
> > I didn't see anything that could be used to temporarily disable
> > tracing (tracing_stop() and tracing_start() are 'quick' versions
> > that mostly just disable recording), so did it this way to avoid
> > adding any overhead to the filter-checking code.
> >
> > But anyway, I'll post a new patch shortly that uses rcu and does
> > allow the filters to be set on the fly.
>
> that's a very intelligent way to do it!
>
> There's a theoretical problem though: what if we put a filtered
> tracepoint into the RCU code? Especially if that tracepoint is in
> the common function-tracer callback affecting all kernel functions.
> I've Cc:-ed Paul. I think the quiescent state logic should handle
> this just fine, but i'm not 100% sure.
I commented about this too. I feel the safest way is to simply use
preempt_disable, and instead of synchronize_rcu, we can use
synchronize_sched, which should have the same effect.
-- Steve
On Fri, Apr 03, 2009 at 03:59:56PM +0200, Ingo Molnar wrote:
>
> * Tom Zanussi <[email protected]> wrote:
>
> > On Wed, 2009-04-01 at 14:24 +0200, Ingo Molnar wrote:
> > > * Tom Zanussi <[email protected]> wrote:
> > >
> > > > This patch adds code allowing the event filter to be set only if
> > > > there's no active tracing going on.
> > >
> > > > --- a/kernel/trace/trace_events.c
> > > > +++ b/kernel/trace/trace_events.c
> > > > @@ -498,6 +498,9 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
> > > > struct filter_pred *pred;
> > > > int err;
> > > >
> > > > + if (tracing_is_enabled() && (!tracer_is_nop() || call->enabled))
> > > > + return -EBUSY;
> > >
> > > hm, but it would be the normal use-case to set filters on the fly.
> > > To experiment around with them and shape them until the output is
> > > just right. Having to turn the tracer on/off during that seems quite
> > > counterproductive to that use-case.
> > >
> >
> > I didn't see anything that could be used to temporarily disable
> > tracing (tracing_stop() and tracing_start() are 'quick' versions
> > that mostly just disable recording), so did it this way to avoid
> > adding any overhead to the filter-checking code.
> >
> > But anyway, I'll post a new patch shortly that uses rcu and does
> > allow the filters to be set on the fly.
>
> that's a very intelligent way to do it!
>
> There's a theoretical problem though: what if we put a filtered
> tracepoint into the RCU code? Especially if that tracepoint is in
> the common function-tracer callback affecting all kernel functions.
> I've Cc:-ed Paul. I think the quiescent state logic should handle
> this just fine, but i'm not 100% sure.
Disclaimer: I don't claim to fully understand the code. I do not see any
problems tracing the quiescent-state logic. However, it appears to me
that you don't get to set tracepoints in the idle loop for rcuclassic,
rcutree, and rcutiny.
My kneejerk reaction is "why would anyone want to trace the idle loop?"
Thanx, Paul
* Paul E. McKenney <[email protected]> wrote:
> On Fri, Apr 03, 2009 at 03:59:56PM +0200, Ingo Molnar wrote:
> >
> > * Tom Zanussi <[email protected]> wrote:
> >
> > > On Wed, 2009-04-01 at 14:24 +0200, Ingo Molnar wrote:
> > > > * Tom Zanussi <[email protected]> wrote:
> > > >
> > > > > This patch adds code allowing the event filter to be set only if
> > > > > there's no active tracing going on.
> > > >
> > > > > --- a/kernel/trace/trace_events.c
> > > > > +++ b/kernel/trace/trace_events.c
> > > > > @@ -498,6 +498,9 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
> > > > > struct filter_pred *pred;
> > > > > int err;
> > > > >
> > > > > + if (tracing_is_enabled() && (!tracer_is_nop() || call->enabled))
> > > > > + return -EBUSY;
> > > >
> > > > hm, but it would be the normal use-case to set filters on the fly.
> > > > To experiment around with them and shape them until the output is
> > > > just right. Having to turn the tracer on/off during that seems quite
> > > > counterproductive to that use-case.
> > > >
> > >
> > > I didn't see anything that could be used to temporarily disable
> > > tracing (tracing_stop() and tracing_start() are 'quick' versions
> > > that mostly just disable recording), so did it this way to avoid
> > > adding any overhead to the filter-checking code.
> > >
> > > But anyway, I'll post a new patch shortly that uses rcu and does
> > > allow the filters to be set on the fly.
> >
> > that's a very intelligent way to do it!
> >
> > There's a theoretical problem though: what if we put a filtered
> > tracepoint into the RCU code? Especially if that tracepoint is in
> > the common function-tracer callback affecting all kernel functions.
> > I've Cc:-ed Paul. I think the quiescent state logic should handle
> > this just fine, but i'm not 100% sure.
>
> Disclaimer: I don't claim to fully understand the code. I do not see any
> problems tracing the quiescent-state logic. However, it appears to me
> that you don't get to set tracepoints in the idle loop for rcuclassic,
> rcutree, and rcutiny.
>
> My kneejerk reaction is "why would anyone want to trace the idle
> loop?"
heh :-)
Idle enter/exit events are useful to tune power use for example. The
more events we have there, the more we prevent the CPU from slowly
going into deep sleep mode.
Ingo
On Fri, 3 Apr 2009, Ingo Molnar wrote:
> >
> > My kneejerk reaction is "why would anyone want to trace the idle
> > loop?"
>
> heh :-)
>
> Idle enter/exit events are useful to tune power use for example. The
> more events we have there, the more we prevent the CPU from slowly
> going into deep sleep mode.
I can say I use it a lot. I'm still needing a way to set the function pid
recorder to 0 since the conversion to the pid structure. There are times
I only want to trace the interrupts that happen in the idle loop.
-- Steve
On Fri, Apr 03, 2009 at 12:43:15PM -0400, Steven Rostedt wrote:
>
> On Fri, 3 Apr 2009, Ingo Molnar wrote:
> > >
> > > My kneejerk reaction is "why would anyone want to trace the idle
> > > loop?"
> >
> > heh :-)
> >
> > Idle enter/exit events are useful to tune power use for example. The
> > more events we have there, the more we prevent the CPU from slowly
> > going into deep sleep mode.
But we could in principle trace idle enter/exit from the scheduler, correct?
That said, it would be possible to allow much of the idle loop to contain
RCU read-side critical sections, but this requires putting rcu_qsctr_inc()
in each and every idle loop, plus catching the cases where idle loops
shut down the CPU. Note that this applies to synchronize_sched() as
well as synchronize_rcu().
> I can say I use it a lot. I'm still needing a way to set the function pid
> recorder to 0 since the conversion to the pid structure. There are times
> I only want to trace the interrupts that happen in the idle loop.
It would be OK to trace the interrupts, just not the idle loop itself.
Thanx, Paul
On Fri, 2009-04-03 at 10:12 -0400, Steven Rostedt wrote:
> On Fri, 3 Apr 2009, Ingo Molnar wrote:
>
> >
> > * Tom Zanussi <[email protected]> wrote:
> >
> > > On Wed, 2009-04-01 at 14:24 +0200, Ingo Molnar wrote:
> > > > * Tom Zanussi <[email protected]> wrote:
> > > >
> > > > > This patch adds code allowing the event filter to be set only if
> > > > > there's no active tracing going on.
> > > >
> > > > > --- a/kernel/trace/trace_events.c
> > > > > +++ b/kernel/trace/trace_events.c
> > > > > @@ -498,6 +498,9 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
> > > > > struct filter_pred *pred;
> > > > > int err;
> > > > >
> > > > > + if (tracing_is_enabled() && (!tracer_is_nop() || call->enabled))
> > > > > + return -EBUSY;
> > > >
> > > > hm, but it would be the normal use-case to set filters on the fly.
> > > > To experiment around with them and shape them until the output is
> > > > just right. Having to turn the tracer on/off during that seems quite
> > > > counterproductive to that use-case.
> > > >
> > >
> > > I didn't see anything that could be used to temporarily disable
> > > tracing (tracing_stop() and tracing_start() are 'quick' versions
> > > that mostly just disable recording), so did it this way to avoid
> > > adding any overhead to the filter-checking code.
> > >
> > > But anyway, I'll post a new patch shortly that uses rcu and does
> > > allow the filters to be set on the fly.
> >
> > that's a very intelligent way to do it!
> >
> > There's a theoretical problem though: what if we put a filtered
> > tracepoint into the RCU code? Especially if that tracepoint is in
> > the common function-tracer callback affecting all kernel functions.
> > I've Cc:-ed Paul. I think the quiescent state logic should handle
> > this just fine, but i'm not 100% sure.
>
> I commented about this too. I feel the safest way is to simply use
> preempt_disable, and instead of synchronize_rcu, we can use
> synchronize_sched, which should have the same effect.
>
Hmm, after reading Paul's replies, it sounds like this approach might be
more trouble than it's worth. Maybe going back to the idea of
temporarily stopping/starting tracing would be a better idea, but with a
little more heavyweight version of the current 'quick' tracing
start/stop (that would prevent entering the tracing functions (and ththe
filter_check_discard()).
I was thinking it would be something like:
stop_tracing();
current_tracer->stop(); /* unregister tracepoints, etc */
remove filter
current_tracer->start(); /* reregister tracepoints, etc */
start_tracing();
The struct tracer comments suggest that the stop()/start()
ops are meant for pausing, I'd guess for things like this, but some of
the tracers don't implement them.
For the events in the event tracer, it would be something like:
stop_tracing();
call->unregfunc(); /* unregister tracepoint */
remove filter
call->regfunc(); /* reregister tracepoint */
start_tracing();
If that makes sense, I can try it that way instead.
Tom
On Sat, 4 Apr 2009, Tom Zanussi wrote:
>
> Hmm, after reading Paul's replies, it sounds like this approach might be
> more trouble than it's worth. Maybe going back to the idea of
> temporarily stopping/starting tracing would be a better idea, but with a
> little more heavyweight version of the current 'quick' tracing
> start/stop (that would prevent entering the tracing functions (and ththe
> filter_check_discard()).
Actually, I forgot what the general problem we are avoiding here with the
RCU locks. Could you explain that again. Just so that I can get a better
idea without having to read between the lines of the previous messages in
this thread.
>
> I was thinking it would be something like:
>
> stop_tracing();
> current_tracer->stop(); /* unregister tracepoints, etc */
>
> remove filter
>
> current_tracer->start(); /* reregister tracepoints, etc */
> start_tracing();
This use to be the way start and stop worked, but I'm trying to
make them more light weight. I've been wanting start/stop to be called
by start_tracing() and stop_tracing() and those should be able to be
called in any context. But registering and unregistering tracepoints calls
mutexes, which can not be done in atomic context.
There are still some tracers that have start/stop using sleeping code, but
in the long run I want the tracers start/stop functions to be light.
>
> The struct tracer comments suggest that the stop()/start()
> ops are meant for pausing, I'd guess for things like this, but some of
> the tracers don't implement them.
Yeah, They should, but leaving out the start/stop functions, is just
the tracers way of saying, I don't need to pause. :-/
>
> For the events in the event tracer, it would be something like:
>
> stop_tracing();
> call->unregfunc(); /* unregister tracepoint */
>
> remove filter
>
> call->regfunc(); /* reregister tracepoint */
> start_tracing();
>
> If that makes sense, I can try it that way instead.
>
I'll comment about this if I get that explanation of the problem again ;-)
-- Steve
On Sat, Apr 04, 2009 at 11:49:30AM -0400, Steven Rostedt wrote:
> On Sat, 4 Apr 2009, Tom Zanussi wrote:
> >
> > Hmm, after reading Paul's replies, it sounds like this approach might be
> > more trouble than it's worth. Maybe going back to the idea of
> > temporarily stopping/starting tracing would be a better idea, but with a
> > little more heavyweight version of the current 'quick' tracing
> > start/stop (that would prevent entering the tracing functions (and ththe
> > filter_check_discard()).
>
> Actually, I forgot what the general problem we are avoiding here with the
> RCU locks. Could you explain that again. Just so that I can get a better
> idea without having to read between the lines of the previous messages in
> this thread.
I would be interested in hearing this as well.
Thanx, Paul
> > I was thinking it would be something like:
> >
> > stop_tracing();
> > current_tracer->stop(); /* unregister tracepoints, etc */
> >
> > remove filter
> >
> > current_tracer->start(); /* reregister tracepoints, etc */
> > start_tracing();
>
> This use to be the way start and stop worked, but I'm trying to
> make them more light weight. I've been wanting start/stop to be called
> by start_tracing() and stop_tracing() and those should be able to be
> called in any context. But registering and unregistering tracepoints calls
> mutexes, which can not be done in atomic context.
>
> There are still some tracers that have start/stop using sleeping code, but
> in the long run I want the tracers start/stop functions to be light.
>
>
> >
> > The struct tracer comments suggest that the stop()/start()
> > ops are meant for pausing, I'd guess for things like this, but some of
> > the tracers don't implement them.
>
> Yeah, They should, but leaving out the start/stop functions, is just
> the tracers way of saying, I don't need to pause. :-/
>
> >
> > For the events in the event tracer, it would be something like:
> >
> > stop_tracing();
> > call->unregfunc(); /* unregister tracepoint */
> >
> > remove filter
> >
> > call->regfunc(); /* reregister tracepoint */
> > start_tracing();
> >
> > If that makes sense, I can try it that way instead.
> >
>
> I'll comment about this if I get that explanation of the problem again ;-)
>
> -- Steve
>
On Sat, 2009-04-04 at 11:49 -0400, Steven Rostedt wrote:
> On Sat, 4 Apr 2009, Tom Zanussi wrote:
> >
> > Hmm, after reading Paul's replies, it sounds like this approach might be
> > more trouble than it's worth. Maybe going back to the idea of
> > temporarily stopping/starting tracing would be a better idea, but with a
> > little more heavyweight version of the current 'quick' tracing
> > start/stop (that would prevent entering the tracing functions (and ththe
> > filter_check_discard()).
>
>
> Actually, I forgot what the general problem we are avoiding here with the
> RCU locks. Could you explain that again. Just so that I can get a better
> idea without having to read between the lines of the previous messages in
> this thread.
>
Basically the problem is that the tracing functions call
filter_match_preds(call,...) where call->preds is an array of predicates
that get checked to determine whether the current event matches or not.
When an existing filter is deleted (or an old one replaced), the
call->preds array is freed and set to NULL (which happens only via a
write to the 'filter' debugfs file). So without any protection, while
one cpu is freeing the preds array, the others may still be using it,
and if so, it will crash the box. You can easily see the problem with
e.g. the function tracer:
# echo function > /debug/tracing/current_tracer
Function tracing is now live
# echo 'common_pid == 0' > /debug/tracing/events/ftrace/function/filter
No problem, no preds are freed the first time
# echo 0 > /debug/tracing/events/ftrace/function/filter
Crash.
My first patch took the safe route and completely disallowed filters
from being set when any tracing was live i.e. you had to for example
echo 0 > tracing_enabled or echo 0 > enable for a particular event, etc.
This wasn't great for usability, though - it would be much nicer to be
able to remove or set new filters on the fly, while tracing is active,
which rcu seemed perfect for - the preds wouldn't actually be destroyed
until all the current users were finished with them. My second patch
implemented that and it seemed to nicely fix the problem, but it
apparently can cause other problems...
So assuming we can't use rcu for this, it would be nice to have a way to
'pause' tracing so the current filter can be removed i.e. some version
of stop_trace()/start_trace() that make sure nothing is still executing
or can enter filter_match_preds() while the current call->preds is being
destroyed. Seems like it would be straightforward to implement for the
event tracer, since each event maps to a tracepoint that could be
temporarily unregistered/reregistered, but maybe not so easy for the
ftrace tracers...
Tom
>
> >
> > I was thinking it would be something like:
> >
> > stop_tracing();
> > current_tracer->stop(); /* unregister tracepoints, etc */
> >
> > remove filter
> >
> > current_tracer->start(); /* reregister tracepoints, etc */
> > start_tracing();
>
> This use to be the way start and stop worked, but I'm trying to
> make them more light weight. I've been wanting start/stop to be called
> by start_tracing() and stop_tracing() and those should be able to be
> called in any context. But registering and unregistering tracepoints calls
> mutexes, which can not be done in atomic context.
> There are still some tracers that have start/stop using sleeping code, but
> in the long run I want the tracers start/stop functions to be light.
>
>
> >
> > The struct tracer comments suggest that the stop()/start()
> > ops are meant for pausing, I'd guess for things like this, but some of
> > the tracers don't implement them.
>
> Yeah, They should, but leaving out the start/stop functions, is just
> the tracers way of saying, I don't need to pause. :-/
>
> >
> > For the events in the event tracer, it would be something like:
> >
> > stop_tracing();
> > call->unregfunc(); /* unregister tracepoint */
> >
> > remove filter
> >
> > call->regfunc(); /* reregister tracepoint */
> > start_tracing();
> >
> > If that makes sense, I can try it that way instead.
> >
>
> I'll comment about this if I get that explanation of the problem again ;-)
>
> -- Steve
>
On Sun, Apr 05, 2009 at 02:34:25AM -0500, Tom Zanussi wrote:
> On Sat, 2009-04-04 at 11:49 -0400, Steven Rostedt wrote:
> > On Sat, 4 Apr 2009, Tom Zanussi wrote:
> > >
> > > Hmm, after reading Paul's replies, it sounds like this approach might be
> > > more trouble than it's worth. Maybe going back to the idea of
> > > temporarily stopping/starting tracing would be a better idea, but with a
> > > little more heavyweight version of the current 'quick' tracing
> > > start/stop (that would prevent entering the tracing functions (and ththe
> > > filter_check_discard()).
> >
> >
> > Actually, I forgot what the general problem we are avoiding here with the
> > RCU locks. Could you explain that again. Just so that I can get a better
> > idea without having to read between the lines of the previous messages in
> > this thread.
> >
>
> Basically the problem is that the tracing functions call
> filter_match_preds(call,...) where call->preds is an array of predicates
> that get checked to determine whether the current event matches or not.
> When an existing filter is deleted (or an old one replaced), the
> call->preds array is freed and set to NULL (which happens only via a
> write to the 'filter' debugfs file). So without any protection, while
> one cpu is freeing the preds array, the others may still be using it,
> and if so, it will crash the box. You can easily see the problem with
> e.g. the function tracer:
>
> # echo function > /debug/tracing/current_tracer
>
> Function tracing is now live
>
> # echo 'common_pid == 0' > /debug/tracing/events/ftrace/function/filter
>
> No problem, no preds are freed the first time
>
> # echo 0 > /debug/tracing/events/ftrace/function/filter
>
> Crash.
>
> My first patch took the safe route and completely disallowed filters
> from being set when any tracing was live i.e. you had to for example
> echo 0 > tracing_enabled or echo 0 > enable for a particular event, etc.
>
> This wasn't great for usability, though - it would be much nicer to be
> able to remove or set new filters on the fly, while tracing is active,
> which rcu seemed perfect for - the preds wouldn't actually be destroyed
> until all the current users were finished with them. My second patch
> implemented that and it seemed to nicely fix the problem, but it
> apparently can cause other problems...
>
> So assuming we can't use rcu for this, it would be nice to have a way to
> 'pause' tracing so the current filter can be removed i.e. some version
> of stop_trace()/start_trace() that make sure nothing is still executing
> or can enter filter_match_preds() while the current call->preds is being
> destroyed. Seems like it would be straightforward to implement for the
> event tracer, since each event maps to a tracepoint that could be
> temporarily unregistered/reregistered, but maybe not so easy for the
> ftrace tracers...
In principle, it would be possible to rework RCU so that instead of the
whole idle loop being a quiescent state, there is a single quiescent state
at one point in each idle loop. The reason that I have been avoiding this
is that there are a lot of idle loops out there, and it would be a bit
annoying to (1) find them all and update them and (2) keep track of all of
them to ensure that new ones cannot slip in without the quiescent state.
But it could be done if the need is there. Simple enough change.
The following patch shows the general approach, assuming that CPUs
are never put to sleep without entering nohz mode.
Thoughts?
Thanx, Paul
>From 7e08c37b20cb3d93ba67f8ad5d46f2c38acb8fe5 Mon Sep 17 00:00:00 2001
From: Paul E. McKenney <[email protected]>
Date: Sun, 5 Apr 2009 10:09:54 -0700
Subject: [PATCH] Make idle loop (mostly) safe for RCU read-side critical sections.
Not for inclusion, demo only. Untested, probably fails to compile.
This patch is for demonstration purposes only. It adds a facility to
rcutree.c to allow RCU read-side critical sections to be used in
idle loops, as long as those RCU read-side critical sections do not
lap over the call to rcu_idle().
If this were a real patch, it would have the following:
o A config variable to allow architectures to opt out of this
sort of behavior. (But then again, maybe not.)
o Follow-up patches that added a call to rcu_idle() to each
idle loop in the kernel, probably grouped by architecture.
o Documentation updates to explain the new loosened restrictions
regarding RCU read-side critical sections and idle loops.
Signed-off-by: Paul E. McKenney <[email protected]>
---
arch/x86/kernel/process.c | 1 +
include/linux/rcupdate.h | 1 +
kernel/rcutree.c | 21 ++++++++++++++-------
3 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 156f875..adbaf13 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -310,6 +310,7 @@ void default_idle(void)
current_thread_info()->status |= TS_POLLING;
trace_power_end(&it);
} else {
+ rcu_idle();
local_irq_enable();
/* loop is done by the caller */
cpu_relax();
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 528343e..3905f54 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -265,6 +265,7 @@ extern void synchronize_rcu(void);
extern void rcu_barrier(void);
extern void rcu_barrier_bh(void);
extern void rcu_barrier_sched(void);
+extern void rcu_idle(void);
/* Internal to kernel */
extern void rcu_init(void);
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 97ce315..4c61b71 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -937,6 +937,17 @@ static void rcu_do_batch(struct rcu_data *rdp)
}
/*
+ * Called from each idle loop to enable RCU to treat the idle loop as
+ * a quiescent state. Note that this code assumes that idle CPUs continue
+ * executing instructions until they enter nohz mode.
+ */
+void rcu_idle(void)
+{
+ rcu_qsctr_inc(cpu);
+ rcu_bh_qsctr_inc(cpu);
+}
+
+/*
* Check to see if this CPU is in a non-context-switch quiescent state
* (user mode or idle loop for rcu, non-softirq execution for rcu_bh).
* Also schedule the RCU softirq handler.
@@ -947,15 +958,11 @@ static void rcu_do_batch(struct rcu_data *rdp)
*/
void rcu_check_callbacks(int cpu, int user)
{
- if (user ||
- (idle_cpu(cpu) && rcu_scheduler_active &&
- !in_softirq() && hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
+ if (user) {
/*
- * Get here if this CPU took its interrupt from user
- * mode or from the idle loop, and if this is not a
- * nested interrupt. In this case, the CPU is in
- * a quiescent state, so count it.
+ * Get here if this CPU took its interrupt from user mode.
+ * In this case, the CPU is in a quiescent state, so count it.
*
* No memory barrier is required here because both
* rcu_qsctr_inc() and rcu_bh_qsctr_inc() reference
--
1.5.2.5
On Sun, 5 Apr 2009, Paul E. McKenney wrote:
> > Basically the problem is that the tracing functions call
> > filter_match_preds(call,...) where call->preds is an array of predicates
> > that get checked to determine whether the current event matches or not.
> > When an existing filter is deleted (or an old one replaced), the
> > call->preds array is freed and set to NULL (which happens only via a
> > write to the 'filter' debugfs file). So without any protection, while
> > one cpu is freeing the preds array, the others may still be using it,
> > and if so, it will crash the box. You can easily see the problem with
> > e.g. the function tracer:
> >
> > # echo function > /debug/tracing/current_tracer
> >
> > Function tracing is now live
> >
> > # echo 'common_pid == 0' > /debug/tracing/events/ftrace/function/filter
> >
> > No problem, no preds are freed the first time
> >
> > # echo 0 > /debug/tracing/events/ftrace/function/filter
> >
> > Crash.
> >
> > My first patch took the safe route and completely disallowed filters
> > from being set when any tracing was live i.e. you had to for example
> > echo 0 > tracing_enabled or echo 0 > enable for a particular event, etc.
> >
> > This wasn't great for usability, though - it would be much nicer to be
> > able to remove or set new filters on the fly, while tracing is active,
> > which rcu seemed perfect for - the preds wouldn't actually be destroyed
> > until all the current users were finished with them. My second patch
> > implemented that and it seemed to nicely fix the problem, but it
> > apparently can cause other problems...
The proble is that function tracing also traces the rcu calls. Even though
the function trace protects against recursion, by adding rcu locks to the
function tracer, we have just doubled the overhead for it. Every function
trace will call rcu_read_lock, then that would be traced too, and the
function tracer would see that it is recursive and return. All this is
added overhead to _every_ function!
I do not understand why my recommendation is not used. All tracers require
preemption to be disabled. By simply removing the pred from the list, do a
synchronize_sched(), then set it to NULL. The update is done by userland,
synchronizing a schedule should not be that noticeable.
> >
> > So assuming we can't use rcu for this, it would be nice to have a way to
> > 'pause' tracing so the current filter can be removed i.e. some version
> > of stop_trace()/start_trace() that make sure nothing is still executing
> > or can enter filter_match_preds() while the current call->preds is being
> > destroyed. Seems like it would be straightforward to implement for the
> > event tracer, since each event maps to a tracepoint that could be
> > temporarily unregistered/reregistered, but maybe not so easy for the
> > ftrace tracers...
>
> In principle, it would be possible to rework RCU so that instead of the
> whole idle loop being a quiescent state, there is a single quiescent state
> at one point in each idle loop. The reason that I have been avoiding this
> is that there are a lot of idle loops out there, and it would be a bit
> annoying to (1) find them all and update them and (2) keep track of all of
> them to ensure that new ones cannot slip in without the quiescent state.
>
> But it could be done if the need is there. Simple enough change.
> The following patch shows the general approach, assuming that CPUs
> are never put to sleep without entering nohz mode.
>
> Thoughts?
I think using synchronize_sched() should be good enough for what we need.
-- Steve
On Mon, Apr 06, 2009 at 11:59:30AM -0400, Steven Rostedt wrote:
> On Sun, 5 Apr 2009, Paul E. McKenney wrote:
> > > Basically the problem is that the tracing functions call
> > > filter_match_preds(call,...) where call->preds is an array of predicates
> > > that get checked to determine whether the current event matches or not.
> > > When an existing filter is deleted (or an old one replaced), the
> > > call->preds array is freed and set to NULL (which happens only via a
> > > write to the 'filter' debugfs file). So without any protection, while
> > > one cpu is freeing the preds array, the others may still be using it,
> > > and if so, it will crash the box. You can easily see the problem with
> > > e.g. the function tracer:
> > >
> > > # echo function > /debug/tracing/current_tracer
> > >
> > > Function tracing is now live
> > >
> > > # echo 'common_pid == 0' > /debug/tracing/events/ftrace/function/filter
> > >
> > > No problem, no preds are freed the first time
> > >
> > > # echo 0 > /debug/tracing/events/ftrace/function/filter
> > >
> > > Crash.
> > >
> > > My first patch took the safe route and completely disallowed filters
> > > from being set when any tracing was live i.e. you had to for example
> > > echo 0 > tracing_enabled or echo 0 > enable for a particular event, etc.
> > >
> > > This wasn't great for usability, though - it would be much nicer to be
> > > able to remove or set new filters on the fly, while tracing is active,
> > > which rcu seemed perfect for - the preds wouldn't actually be destroyed
> > > until all the current users were finished with them. My second patch
> > > implemented that and it seemed to nicely fix the problem, but it
> > > apparently can cause other problems...
>
> The proble is that function tracing also traces the rcu calls. Even though
> the function trace protects against recursion, by adding rcu locks to the
> function tracer, we have just doubled the overhead for it. Every function
> trace will call rcu_read_lock, then that would be traced too, and the
> function tracer would see that it is recursive and return. All this is
> added overhead to _every_ function!
>
> I do not understand why my recommendation is not used. All tracers require
> preemption to be disabled. By simply removing the pred from the list, do a
> synchronize_sched(), then set it to NULL. The update is done by userland,
> synchronizing a schedule should not be that noticeable.
The only caution is that synchronize_sched() ignores preempt-disable
sequences in the idle loop. The reason for this is that synchronize_sched()
maps to synchronize_rcu() for rcuclassic and rcutree.
So, if you need to make synchronize_sched() pay attention to
preempt-disable sequences in the idle loop, something similar to the
patch to RCU that I sent earlier (adding explicit rcu_idle() call to
each idle loop) would be required.
> > > So assuming we can't use rcu for this, it would be nice to have a way to
> > > 'pause' tracing so the current filter can be removed i.e. some version
> > > of stop_trace()/start_trace() that make sure nothing is still executing
> > > or can enter filter_match_preds() while the current call->preds is being
> > > destroyed. Seems like it would be straightforward to implement for the
> > > event tracer, since each event maps to a tracepoint that could be
> > > temporarily unregistered/reregistered, but maybe not so easy for the
> > > ftrace tracers...
> >
> > In principle, it would be possible to rework RCU so that instead of the
> > whole idle loop being a quiescent state, there is a single quiescent state
> > at one point in each idle loop. The reason that I have been avoiding this
> > is that there are a lot of idle loops out there, and it would be a bit
> > annoying to (1) find them all and update them and (2) keep track of all of
> > them to ensure that new ones cannot slip in without the quiescent state.
> >
> > But it could be done if the need is there. Simple enough change.
> > The following patch shows the general approach, assuming that CPUs
> > are never put to sleep without entering nohz mode.
> >
> > Thoughts?
>
> I think using synchronize_sched() should be good enough for what we need.
Again, as long as either (1) you are OK with synchronize_sched()
ignoring preempt-disable sequences in the idle loop or (2) we rework RCU
to add something like an rcu_idle() call in each idle loop.
Thanx, Paul
On Mon, 6 Apr 2009, Paul E. McKenney wrote:
> On Mon, Apr 06, 2009 at 11:59:30AM -0400, Steven Rostedt wrote:
> > On Sun, 5 Apr 2009, Paul E. McKenney wrote:
> > > > Basically the problem is that the tracing functions call
> > > > filter_match_preds(call,...) where call->preds is an array of predicates
> > > > that get checked to determine whether the current event matches or not.
> > > > When an existing filter is deleted (or an old one replaced), the
> > > > call->preds array is freed and set to NULL (which happens only via a
> > > > write to the 'filter' debugfs file). So without any protection, while
> > > > one cpu is freeing the preds array, the others may still be using it,
> > > > and if so, it will crash the box. You can easily see the problem with
> > > > e.g. the function tracer:
> > > >
> > > > # echo function > /debug/tracing/current_tracer
> > > >
> > > > Function tracing is now live
> > > >
> > > > # echo 'common_pid == 0' > /debug/tracing/events/ftrace/function/filter
> > > >
> > > > No problem, no preds are freed the first time
> > > >
> > > > # echo 0 > /debug/tracing/events/ftrace/function/filter
> > > >
> > > > Crash.
> > > >
> > > > My first patch took the safe route and completely disallowed filters
> > > > from being set when any tracing was live i.e. you had to for example
> > > > echo 0 > tracing_enabled or echo 0 > enable for a particular event, etc.
> > > >
> > > > This wasn't great for usability, though - it would be much nicer to be
> > > > able to remove or set new filters on the fly, while tracing is active,
> > > > which rcu seemed perfect for - the preds wouldn't actually be destroyed
> > > > until all the current users were finished with them. My second patch
> > > > implemented that and it seemed to nicely fix the problem, but it
> > > > apparently can cause other problems...
> >
> > The proble is that function tracing also traces the rcu calls. Even though
> > the function trace protects against recursion, by adding rcu locks to the
> > function tracer, we have just doubled the overhead for it. Every function
> > trace will call rcu_read_lock, then that would be traced too, and the
> > function tracer would see that it is recursive and return. All this is
> > added overhead to _every_ function!
> >
> > I do not understand why my recommendation is not used. All tracers require
> > preemption to be disabled. By simply removing the pred from the list, do a
> > synchronize_sched(), then set it to NULL. The update is done by userland,
> > synchronizing a schedule should not be that noticeable.
>
> The only caution is that synchronize_sched() ignores preempt-disable
> sequences in the idle loop. The reason for this is that synchronize_sched()
> maps to synchronize_rcu() for rcuclassic and rcutree.
>
> So, if you need to make synchronize_sched() pay attention to
> preempt-disable sequences in the idle loop, something similar to the
> patch to RCU that I sent earlier (adding explicit rcu_idle() call to
> each idle loop) would be required.
Ah! OK, now I understand the issue :-)
>
> > > > So assuming we can't use rcu for this, it would be nice to have a way to
> > > > 'pause' tracing so the current filter can be removed i.e. some version
> > > > of stop_trace()/start_trace() that make sure nothing is still executing
> > > > or can enter filter_match_preds() while the current call->preds is being
> > > > destroyed. Seems like it would be straightforward to implement for the
> > > > event tracer, since each event maps to a tracepoint that could be
> > > > temporarily unregistered/reregistered, but maybe not so easy for the
> > > > ftrace tracers...
> > >
> > > In principle, it would be possible to rework RCU so that instead of the
> > > whole idle loop being a quiescent state, there is a single quiescent state
> > > at one point in each idle loop. The reason that I have been avoiding this
> > > is that there are a lot of idle loops out there, and it would be a bit
> > > annoying to (1) find them all and update them and (2) keep track of all of
> > > them to ensure that new ones cannot slip in without the quiescent state.
> > >
> > > But it could be done if the need is there. Simple enough change.
> > > The following patch shows the general approach, assuming that CPUs
> > > are never put to sleep without entering nohz mode.
> > >
> > > Thoughts?
> >
> > I think using synchronize_sched() should be good enough for what we need.
>
> Again, as long as either (1) you are OK with synchronize_sched()
> ignoring preempt-disable sequences in the idle loop or (2) we rework RCU
> to add something like an rcu_idle() call in each idle loop.
3) add "notrace" to the idle functions ;-)
But perhaps the rcu_idle might be the best idea.
-- Steve
On Mon, Apr 06, 2009 at 03:30:32PM -0400, Steven Rostedt wrote:
>
> On Mon, 6 Apr 2009, Paul E. McKenney wrote:
>
> > On Mon, Apr 06, 2009 at 11:59:30AM -0400, Steven Rostedt wrote:
> > > On Sun, 5 Apr 2009, Paul E. McKenney wrote:
> > > > > Basically the problem is that the tracing functions call
> > > > > filter_match_preds(call,...) where call->preds is an array of predicates
> > > > > that get checked to determine whether the current event matches or not.
> > > > > When an existing filter is deleted (or an old one replaced), the
> > > > > call->preds array is freed and set to NULL (which happens only via a
> > > > > write to the 'filter' debugfs file). So without any protection, while
> > > > > one cpu is freeing the preds array, the others may still be using it,
> > > > > and if so, it will crash the box. You can easily see the problem with
> > > > > e.g. the function tracer:
> > > > >
> > > > > # echo function > /debug/tracing/current_tracer
> > > > >
> > > > > Function tracing is now live
> > > > >
> > > > > # echo 'common_pid == 0' > /debug/tracing/events/ftrace/function/filter
> > > > >
> > > > > No problem, no preds are freed the first time
> > > > >
> > > > > # echo 0 > /debug/tracing/events/ftrace/function/filter
> > > > >
> > > > > Crash.
> > > > >
> > > > > My first patch took the safe route and completely disallowed filters
> > > > > from being set when any tracing was live i.e. you had to for example
> > > > > echo 0 > tracing_enabled or echo 0 > enable for a particular event, etc.
> > > > >
> > > > > This wasn't great for usability, though - it would be much nicer to be
> > > > > able to remove or set new filters on the fly, while tracing is active,
> > > > > which rcu seemed perfect for - the preds wouldn't actually be destroyed
> > > > > until all the current users were finished with them. My second patch
> > > > > implemented that and it seemed to nicely fix the problem, but it
> > > > > apparently can cause other problems...
> > >
> > > The proble is that function tracing also traces the rcu calls. Even though
> > > the function trace protects against recursion, by adding rcu locks to the
> > > function tracer, we have just doubled the overhead for it. Every function
> > > trace will call rcu_read_lock, then that would be traced too, and the
> > > function tracer would see that it is recursive and return. All this is
> > > added overhead to _every_ function!
> > >
> > > I do not understand why my recommendation is not used. All tracers require
> > > preemption to be disabled. By simply removing the pred from the list, do a
> > > synchronize_sched(), then set it to NULL. The update is done by userland,
> > > synchronizing a schedule should not be that noticeable.
> >
> > The only caution is that synchronize_sched() ignores preempt-disable
> > sequences in the idle loop. The reason for this is that synchronize_sched()
> > maps to synchronize_rcu() for rcuclassic and rcutree.
> >
> > So, if you need to make synchronize_sched() pay attention to
> > preempt-disable sequences in the idle loop, something similar to the
> > patch to RCU that I sent earlier (adding explicit rcu_idle() call to
> > each idle loop) would be required.
>
> Ah! OK, now I understand the issue :-)
>
> >
> > > > > So assuming we can't use rcu for this, it would be nice to have a way to
> > > > > 'pause' tracing so the current filter can be removed i.e. some version
> > > > > of stop_trace()/start_trace() that make sure nothing is still executing
> > > > > or can enter filter_match_preds() while the current call->preds is being
> > > > > destroyed. Seems like it would be straightforward to implement for the
> > > > > event tracer, since each event maps to a tracepoint that could be
> > > > > temporarily unregistered/reregistered, but maybe not so easy for the
> > > > > ftrace tracers...
> > > >
> > > > In principle, it would be possible to rework RCU so that instead of the
> > > > whole idle loop being a quiescent state, there is a single quiescent state
> > > > at one point in each idle loop. The reason that I have been avoiding this
> > > > is that there are a lot of idle loops out there, and it would be a bit
> > > > annoying to (1) find them all and update them and (2) keep track of all of
> > > > them to ensure that new ones cannot slip in without the quiescent state.
> > > >
> > > > But it could be done if the need is there. Simple enough change.
> > > > The following patch shows the general approach, assuming that CPUs
> > > > are never put to sleep without entering nohz mode.
> > > >
> > > > Thoughts?
> > >
> > > I think using synchronize_sched() should be good enough for what we need.
> >
> > Again, as long as either (1) you are OK with synchronize_sched()
> > ignoring preempt-disable sequences in the idle loop or (2) we rework RCU
> > to add something like an rcu_idle() call in each idle loop.
>
> 3) add "notrace" to the idle functions ;-)
>
> But perhaps the rcu_idle might be the best idea.
And tracing the idle time is also sometimes very useful :-)
> -- Steve
>
On Mon, 6 Apr 2009, Frederic Weisbecker wrote:
> > >
> > > > > > So assuming we can't use rcu for this, it would be nice to have a way to
> > > > > > 'pause' tracing so the current filter can be removed i.e. some version
> > > > > > of stop_trace()/start_trace() that make sure nothing is still executing
> > > > > > or can enter filter_match_preds() while the current call->preds is being
> > > > > > destroyed. Seems like it would be straightforward to implement for the
> > > > > > event tracer, since each event maps to a tracepoint that could be
> > > > > > temporarily unregistered/reregistered, but maybe not so easy for the
> > > > > > ftrace tracers...
> > > > >
> > > > > In principle, it would be possible to rework RCU so that instead of the
> > > > > whole idle loop being a quiescent state, there is a single quiescent state
> > > > > at one point in each idle loop. The reason that I have been avoiding this
> > > > > is that there are a lot of idle loops out there, and it would be a bit
> > > > > annoying to (1) find them all and update them and (2) keep track of all of
> > > > > them to ensure that new ones cannot slip in without the quiescent state.
> > > > >
> > > > > But it could be done if the need is there. Simple enough change.
> > > > > The following patch shows the general approach, assuming that CPUs
> > > > > are never put to sleep without entering nohz mode.
> > > > >
> > > > > Thoughts?
> > > >
> > > > I think using synchronize_sched() should be good enough for what we need.
> > >
> > > Again, as long as either (1) you are OK with synchronize_sched()
> > > ignoring preempt-disable sequences in the idle loop or (2) we rework RCU
> > > to add something like an rcu_idle() call in each idle loop.
> >
> > 3) add "notrace" to the idle functions ;-)
> >
> > But perhaps the rcu_idle might be the best idea.
>
>
> And tracing the idle time is also sometimes very useful :-)
Agreed. I guess choice 2 is the best answer.
-- Steve
On Mon, Apr 06, 2009 at 03:52:55PM -0400, Steven Rostedt wrote:
>
> On Mon, 6 Apr 2009, Frederic Weisbecker wrote:
> > > >
> > > > > > > So assuming we can't use rcu for this, it would be nice to have a way to
> > > > > > > 'pause' tracing so the current filter can be removed i.e. some version
> > > > > > > of stop_trace()/start_trace() that make sure nothing is still executing
> > > > > > > or can enter filter_match_preds() while the current call->preds is being
> > > > > > > destroyed. Seems like it would be straightforward to implement for the
> > > > > > > event tracer, since each event maps to a tracepoint that could be
> > > > > > > temporarily unregistered/reregistered, but maybe not so easy for the
> > > > > > > ftrace tracers...
> > > > > >
> > > > > > In principle, it would be possible to rework RCU so that instead of the
> > > > > > whole idle loop being a quiescent state, there is a single quiescent state
> > > > > > at one point in each idle loop. The reason that I have been avoiding this
> > > > > > is that there are a lot of idle loops out there, and it would be a bit
> > > > > > annoying to (1) find them all and update them and (2) keep track of all of
> > > > > > them to ensure that new ones cannot slip in without the quiescent state.
> > > > > >
> > > > > > But it could be done if the need is there. Simple enough change.
> > > > > > The following patch shows the general approach, assuming that CPUs
> > > > > > are never put to sleep without entering nohz mode.
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > I think using synchronize_sched() should be good enough for what we need.
> > > >
> > > > Again, as long as either (1) you are OK with synchronize_sched()
> > > > ignoring preempt-disable sequences in the idle loop or (2) we rework RCU
> > > > to add something like an rcu_idle() call in each idle loop.
> > >
> > > 3) add "notrace" to the idle functions ;-)
> > >
> > > But perhaps the rcu_idle might be the best idea.
> >
> >
> > And tracing the idle time is also sometimes very useful :-)
>
> Agreed. I guess choice 2 is the best answer.
Fair enough!
Would one of you please check the placement of the rcu_idle() in the
patch? Patch reproduced below for convenience.
Thanx, Paul
------------------------------------------------------------------------
>From 7e08c37b20cb3d93ba67f8ad5d46f2c38acb8fe5 Mon Sep 17 00:00:00 2001
From: Paul E. McKenney <[email protected]>
Date: Sun, 5 Apr 2009 10:09:54 -0700
Subject: [PATCH] Make idle loop (mostly) safe for RCU read-side critical sections.
Not for inclusion, demo only. Untested, probably fails to compile.
This patch is for demonstration purposes only. It adds a facility to
rcutree.c to allow RCU read-side critical sections to be used in
idle loops, as long as those RCU read-side critical sections do not
lap over the call to rcu_idle().
If this were a real patch, it would have the following:
o A config variable to allow architectures to opt out of this
sort of behavior. (But then again, maybe not.)
o Follow-up patches that added a call to rcu_idle() to each
idle loop in the kernel, probably grouped by architecture.
o Documentation updates to explain the new loosened restrictions
regarding RCU read-side critical sections and idle loops.
Signed-off-by: Paul E. McKenney <[email protected]>
---
arch/x86/kernel/process.c | 1 +
include/linux/rcupdate.h | 1 +
kernel/rcutree.c | 21 ++++++++++++++-------
3 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 156f875..adbaf13 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -310,6 +310,7 @@ void default_idle(void)
current_thread_info()->status |= TS_POLLING;
trace_power_end(&it);
} else {
+ rcu_idle();
local_irq_enable();
/* loop is done by the caller */
cpu_relax();
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 528343e..3905f54 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -265,6 +265,7 @@ extern void synchronize_rcu(void);
extern void rcu_barrier(void);
extern void rcu_barrier_bh(void);
extern void rcu_barrier_sched(void);
+extern void rcu_idle(void);
/* Internal to kernel */
extern void rcu_init(void);
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 97ce315..4c61b71 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -937,6 +937,17 @@ static void rcu_do_batch(struct rcu_data *rdp)
}
/*
+ * Called from each idle loop to enable RCU to treat the idle loop as
+ * a quiescent state. Note that this code assumes that idle CPUs continue
+ * executing instructions until they enter nohz mode.
+ */
+void rcu_idle(void)
+{
+ rcu_qsctr_inc(cpu);
+ rcu_bh_qsctr_inc(cpu);
+}
+
+/*
* Check to see if this CPU is in a non-context-switch quiescent state
* (user mode or idle loop for rcu, non-softirq execution for rcu_bh).
* Also schedule the RCU softirq handler.
@@ -947,15 +958,11 @@ static void rcu_do_batch(struct rcu_data *rdp)
*/
void rcu_check_callbacks(int cpu, int user)
{
- if (user ||
- (idle_cpu(cpu) && rcu_scheduler_active &&
- !in_softirq() && hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
+ if (user) {
/*
- * Get here if this CPU took its interrupt from user
- * mode or from the idle loop, and if this is not a
- * nested interrupt. In this case, the CPU is in
- * a quiescent state, so count it.
+ * Get here if this CPU took its interrupt from user mode.
+ * In this case, the CPU is in a quiescent state, so count it.
*
* No memory barrier is required here because both
* rcu_qsctr_inc() and rcu_bh_qsctr_inc() reference
--
1.5.2.5
On Mon, Apr 06, 2009 at 01:15:24PM -0700, Paul E. McKenney wrote:
> On Mon, Apr 06, 2009 at 03:52:55PM -0400, Steven Rostedt wrote:
> >
> > On Mon, 6 Apr 2009, Frederic Weisbecker wrote:
> > > > >
> > > > > > > > So assuming we can't use rcu for this, it would be nice to have a way to
> > > > > > > > 'pause' tracing so the current filter can be removed i.e. some version
> > > > > > > > of stop_trace()/start_trace() that make sure nothing is still executing
> > > > > > > > or can enter filter_match_preds() while the current call->preds is being
> > > > > > > > destroyed. Seems like it would be straightforward to implement for the
> > > > > > > > event tracer, since each event maps to a tracepoint that could be
> > > > > > > > temporarily unregistered/reregistered, but maybe not so easy for the
> > > > > > > > ftrace tracers...
> > > > > > >
> > > > > > > In principle, it would be possible to rework RCU so that instead of the
> > > > > > > whole idle loop being a quiescent state, there is a single quiescent state
> > > > > > > at one point in each idle loop. The reason that I have been avoiding this
> > > > > > > is that there are a lot of idle loops out there, and it would be a bit
> > > > > > > annoying to (1) find them all and update them and (2) keep track of all of
> > > > > > > them to ensure that new ones cannot slip in without the quiescent state.
> > > > > > >
> > > > > > > But it could be done if the need is there. Simple enough change.
> > > > > > > The following patch shows the general approach, assuming that CPUs
> > > > > > > are never put to sleep without entering nohz mode.
> > > > > > >
> > > > > > > Thoughts?
> > > > > >
> > > > > > I think using synchronize_sched() should be good enough for what we need.
> > > > >
> > > > > Again, as long as either (1) you are OK with synchronize_sched()
> > > > > ignoring preempt-disable sequences in the idle loop or (2) we rework RCU
> > > > > to add something like an rcu_idle() call in each idle loop.
> > > >
> > > > 3) add "notrace" to the idle functions ;-)
> > > >
> > > > But perhaps the rcu_idle might be the best idea.
> > >
> > >
> > > And tracing the idle time is also sometimes very useful :-)
> >
> > Agreed. I guess choice 2 is the best answer.
>
> Fair enough!
>
> Would one of you please check the placement of the rcu_idle() in the
> patch? Patch reproduced below for convenience.
Hmmm... Do the start_critical_timings() and stop_critical_timings()
functions have anything to do with ftrace()?
It does not look like I can just bury RCU-idle controls in these
functions, given that they are also invoked around call_console_drivers(),
but if all the idle loops are surrounded by stop_critical_timings() and
start_critical_timings(), that would ease location of all the idle
loops. ;-)
Thanx, Paul
On Mon, 6 Apr 2009, Paul E. McKenney wrote:
>
> Hmmm... Do the start_critical_timings() and stop_critical_timings()
> functions have anything to do with ftrace()?
Yep.
>
> It does not look like I can just bury RCU-idle controls in these
> functions, given that they are also invoked around call_console_drivers(),
> but if all the idle loops are surrounded by stop_critical_timings() and
> start_critical_timings(), that would ease location of all the idle
> loops. ;-)
They are for the irqsoff tracer. Several archs will disable interrupts and
call an assembly "halt" type instruction that can be woken up by
interrupts. But this skips the irqs off check, and we end up with irqsoff
showing the idle loop with huge interrupts off latencies.
I'm not sure they are included in all archs. They may be.
-- Steve
On Mon, Apr 06, 2009 at 08:34:20PM -0400, Steven Rostedt wrote:
>
>
> On Mon, 6 Apr 2009, Paul E. McKenney wrote:
> >
> > Hmmm... Do the start_critical_timings() and stop_critical_timings()
> > functions have anything to do with ftrace()?
>
> Yep.
>
> > It does not look like I can just bury RCU-idle controls in these
> > functions, given that they are also invoked around call_console_drivers(),
> > but if all the idle loops are surrounded by stop_critical_timings() and
> > start_critical_timings(), that would ease location of all the idle
> > loops. ;-)
>
> They are for the irqsoff tracer. Several archs will disable interrupts and
> call an assembly "halt" type instruction that can be woken up by
> interrupts. But this skips the irqs off check, and we end up with irqsoff
> showing the idle loop with huge interrupts off latencies.
>
> I'm not sure they are included in all archs. They may be.
Not a chance! :-(
Looks like one can poke into cpu_idle() on all arches, but looks like
I cannot even assume that I can leverage CONFIG_NO_HZ due to the
rcu_needs_cpu() case. See below for my rough (and almost certainly
erroneous) notes thus far.
Thoughts?
Thanx, Paul
------------------------------------------------------------------------
o Have rcu_idle() call from spinning idle loops. This would
invoke rcu_qsctr_inc() and rcu_bh_qsctr_inc().
In this case, rcu_check_callbacks() need only check for
interrupt from user-level code.
o But this won't work for idle loops that yield to a hypervisor
without being in nohz node. For these, do rcu_idle_start()
before the yield and rcu_idle_end() after the yield. These
would manipulate a per-CPU variable.
In this case, rcu_check_callbacks() needs to check for:
a. interrupt from user-level code
b. unnested interrupt while under the influence of
rcu_idle_start().
o Put rcu_idle_start() after stop_critical_timings() and put
rcu_idle_end() before start_critical_timings() in each arch's
idle loop. But maybe pick less confusing names...
o powerpc: looks OK.
o s390: looks OK -- always dyntick.
o Super-H: Looks OK -- always dyntick.
o x86 32-bit: Looks OK. This is cpu_idle() in
arch/x86/kernel/process_32.c.
o x86 64-bit: Looks OK.
o ACPI: Looks OK.
o x86 APM: invokes system default idle, so OK if that is OK.
o x86: default_idle() in arch/x86/kernel/process.c
need rcu_idle_start() and rcu_idle_end(), or ensure
that caller always does the right thing???
o cpuidle_idle_call() in drivers/cpuidle/cpuidle.c
needs rcu_idle_start() and rcu_idle_end()???
o ARM: Looks OK -- always dyntick. See cpu_idle() in
arch/arm/kernel/process.c.
o Blackfin: Looks OK -- always dyntick. see cpu_idle() in
arch/blackfin/kernel/process.c.
o Cris: ??? cpu_idle() in arch/cris/kernel/process.c
has strange comment:
* Mark this as an RCU critical section so that
* synchronize_kernel() in the unload path waits
* for our completion.
Ain't gonna happen...
But looks like we need an rcu_start_idle() and an
rcu_stop_idle() around the call to idle().
o H8300: cpu_idle() in arch/h8300/kernel/process.c.
Need an rcu_start_idle() and an rcu_stop_idle() around
the call to idle(). Hopefully not messing too much with
the exit latency.
o IA64: cpu_idle() in arch/ia64/kernel/process.c.
Need an rcu_start_idle() and an rcu_stop_idle() around
the call to idle(). Not sure how far around...
o M68k: cpu_idle() in arch/m68k/kernel/process.c.
Need an rcu_start_idle() and an rcu_stop_idle() around
the call to idle(). Hopefully not messing too much with
the exit latency.
o M68Knommu: Ditto, but arch/m68knommu/kernel/process.c.
o UM: Looks OK -- always nohz.
o M32r: cpu_idle() in arch/m32r/kernel/process.c.
Need an rcu_start_idle() and an rcu_stop_idle() around
the call to idle(). Hopefully not messing too much with
the exit latency.
o MN10300: cpu_idle() in arch/mn10300/kernel/process.c.
Need an rcu_start_idle() and an rcu_stop_idle() around
the call to idle(). Hopefully not messing too much with
the exit latency.
@@@ Need to recheck the "nohz OK" assumption -- might have disabled it.
Maybe make rcu_idle_start() and rcu_idle_stop() no-ops in case
of CONFIG_NO_HZ -- but still need rcu_needs_cpu() case to work.
So follow up on cpu_idle() cases.
o Idle loops (for posterity's sake):
o Generic: default_idle() gets rcu_idle().
o Power: All invoked from cpu_idle(), which already contains
stop_critical_timings() and start_critical_timings().
o 44x: ppc44x_idle() absolutely unclear.
o powermac: power4_idle() is assembly code???
Appears to need a call to rcu_idle() in
the loop headed by label 1b:.
o maple: same as powermac.
o pseries: pSeries_setup_arch()
o pseries_shared_idle_sleep()
needs rcu_idle_start() and
rcu_idle_end() surrounding
the cede_processor() call.
o pseries_dedicated_idle_sleep()
needs rcu_idle_start() and
rcu_idle_end() surrounding
the cede_processor() call.
Might also need rcu_idle() in
the "while" loop...
o iseries: iSeries_setup_arch()
o iseries_shared_idle() seems to need
rcu_idle_start() and rcu_idle_end()
surrounding yield_shared_processor()
call.
o iseries_dedicated_idle() needs no
action, as it forces nohz. ;-)
o pasemi: modes[] array in platforms/pasemi/idle.c
o idle_spin() assembly function, need
call to rcu_idle() in loop headed by
label 1:.