Message-ID: <546DD445.5080108@hitachi.com>
Date: Thu, 20 Nov 2014 20:45:09 +0900
From: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Organization: Hitachi, Ltd., Japan
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20120614 Thunderbird/13.0.1
MIME-Version: 1.0
To: Steven Rostedt <rostedt@goodmis.org>
Cc: LKML <linux-kernel@vger.kernel.org>, Ingo Molnar <mingo@kernel.org>,
        Namhyung Kim <namhyung@kernel.org>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCH] tracing: Fix race of function probes counting
References: <20141118134643.4b550ee4@gandalf.local.home>
In-Reply-To: <20141118134643.4b550ee4@gandalf.local.home>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org

(2014/11/19 3:46), Steven Rostedt wrote:
> 
> The function probe counting for traceon and traceoff suffered a race
> condition where if the probe was executing on two or more CPUs at the
> same time, it could decrement the counter by more than one when
> disabling (or enabling) the tracer only once.
> 
> The way the traceon and traceoff probes are suppose to work is that
> they disable (or enable) tracing once per count. If a user were to
> echo 'schedule:traceoff:3' into set_ftrace_filter, then when the
> schedule function was called, it would disable tracing. But the count
> should only be decremented once (to 2). Then if the user enabled tracing
> again (via tracing_on file), the next call to schedule would disable
> tracing again and the count would be decremented to 1.
> 
> But if multiple CPUS called schedule at the same time, it is possible
> that the count would be decremented more than once because of the
> simple "count--" used.
> 
> By reading the count into a local variable and using memory barriers
> we can guarantee that the count would only be decremented once per
> disable (or enable).
> 
> The stack trace probe had a similar race, but here the stack trace will
> decrement for each time it is called. But this had the read-modify-
> write race, where it could stack trace more than the number of times
> that was specified. This case we use a cmpxchg to stack trace only the
> number of times specified.
> 
> The dump probes can still use the old "update_count()" function as
> they only run once, and that is controlled by the dump logic
> itself.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

I have found a couple of typos, but basically, it looks OK.

Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>

> ---
>  kernel/trace/trace_functions.c | 117 +++++++++++++++++++++++++++++++++--------
>  1 file changed, 96 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
> index a8e0c7666164..973db52eb070 100644
> --- a/kernel/trace/trace_functions.c
> +++ b/kernel/trace/trace_functions.c
> @@ -261,37 +261,74 @@ static struct tracer function_trace __tracer_data =
>  };
>  
>  #ifdef CONFIG_DYNAMIC_FTRACE
> -static int update_count(void **data)
> +static void update_traceon_count(void **data, int on)

btw, why don't you use bool instead of int?

>  {
> -	unsigned long *count = (long *)data;
> +	long *count = (long *)data;
> +	long old_count = *count;
>  
> -	if (!*count)
> -		return 0;
> +	/*
> +	 * Tracing gets disabled (or enabled) once per count.
> +	 * This function can be called at the same time on mulitple CPUs.
                                                           ^^^^^^^^ multiple

> +	 * It is fine if both disable (or enable) tracing, as disabling
> +	 * (or enabling) the second time doesn't do anything as the
> +	 * state of the tracer is already disabled (or enabled).
> +	 * What needs to be synchronized in this case is that the count
> +	 * only gets decremented once, even if the tracer is disabled
> +	 * (or enabled) twice, as the second one is really a nop.
> +	 *
> +	 * The memory barriers guarantee that we only decrement the
> +	 * counter once. First the count is read to a local variable
> +	 * and a read barrier is used to make sure that it is loaded
> +	 * before checking if the tracer is in the state we want.
> +	 * If the tracer is not in the state we want, then the count
> +	 * is guaranteed to be the old count.
> +	 *
> +	 * Next the tracer is set to the state we want (disabled or enabled)
> +	 * then a write memory barrier is used to make sure that
> +	 * the new state is visible before changing the counter by
> +	 * one minus the old counter. This guarantees that another CPU
> +	 * executing this code will see the new state before seeing
> +	 * the new counter value, and would not do anthing if the new
                                                   ^^^^^^^anything?

> +	 * counter is seen.
> +	 *
> +	 * Note, there is no synchronization between this and a user
> +	 * setting the tracing_on file. But we currently don't care
> +	 * about that.
> +	 */
> +	if (!old_count)
> +		return;
>  
> -	if (*count != -1)
> -		(*count)--;
> +	/* Make sure we see count before checking tracing state */
> +	smp_rmb();
>  
> -	return 1;
> +	if (on == !!tracing_is_on())
> +		return;
> +
> +	if (on)
> +		tracing_on();
> +	else
> +		tracing_off();
> +
> +	/* unlimited? */
> +	if (old_count == -1)
> +		return;
> +
> +	/* Make sure tracing state is visible before updating count */
> +	smp_wmb();
> +
> +	*count = old_count - 1;
>  }
>  
>  static void
>  ftrace_traceon_count(unsigned long ip, unsigned long parent_ip, void **data)
>  {
> -	if (tracing_is_on())
> -		return;
> -
> -	if (update_count(data))
> -		tracing_on();
> +	update_traceon_count(data, 1);
>  }
>  
>  static void
>  ftrace_traceoff_count(unsigned long ip, unsigned long parent_ip, void **data)
>  {
> -	if (!tracing_is_on())
> -		return;
> -
> -	if (update_count(data))
> -		tracing_off();
> +	update_traceon_count(data, 0);
>  }
>  
>  static void
> @@ -330,11 +367,49 @@ ftrace_stacktrace(unsigned long ip, unsigned long parent_ip, void **data)
>  static void
>  ftrace_stacktrace_count(unsigned long ip, unsigned long parent_ip, void **data)
>  {
> -	if (!tracing_is_on())
> -		return;
> +	long *count = (long *)data;
> +	long old_count;
> +	long new_count;
>  
> -	if (update_count(data))
> -		trace_dump_stack(STACK_SKIP);
> +	/*
> +	 * Stack traces should only execute the number of times the
> +	 * user specified in the counter.
> +	 */
> +	do {
> +
> +		if (!tracing_is_on())
> +			return;
> +
> +		old_count = *count;
> +
> +		if (!old_count)
> +			return;
> +
> +		/* unlimited? */
> +		if (old_count == -1) {
> +			trace_dump_stack(STACK_SKIP);
> +			return;
> +		}
> +
> +		new_count = old_count - 1;
> +		new_count = cmpxchg(count, old_count, new_count);
> +		if (new_count == old_count)
> +			trace_dump_stack(STACK_SKIP);
> +
> +	} while (new_count != old_count);
> +}
> +
> +static int update_count(void **data)
> +{
> +	unsigned long *count = (long *)data;
> +
> +	if (!*count)
> +		return 0;
> +
> +	if (*count != -1)
> +		(*count)--;
> +
> +	return 1;
>  }
>  
>  static void
> 


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/