DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=date:from:to:subject:message-id:references:mime-version
         :content-type:content-disposition:in-reply-to:user-agent;
        b=kQAvzHFObEwap3j951449jtRfd3nCqlSiAa4wJAN99P2uxX+atPvwccRXNoGJmATzh
         ywbQdDdsoBik/X3P/x2Y75j0bZ6jP9e3UfbcHlu2ZTc0VV9vkiKgW3TRDaptXwZZGto7
         A2o/4RtmphH1A5GQkmcJ1lZ4P/9zSRDyDhdLw=
Date: Sat, 17 Jan 2009 14:37:39 +0100
From: Frederic Weisbecker <fweisbec@gmail.com>
To: Arnaldo Carvalho de Melo <acme@redhat.com>,
       Jens Axboe <jens.axboe@oracle.com>, Ingo Molnar <mingo@elte.hu>,
       Steven Rostedt <rostedt@goodmis.org>, linux-kernel@vger.kernel.org
Subject: Re: [RFC][PATCH] ftrace interface for blktrace
Message-ID: <20090117133737.GA8413@nowhere>
References: <20090116181611.GE6562@ghostprotocols.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20090116181611.GE6562@ghostprotocols.net>
User-Agent: Mutt/1.5.18 (2008-05-17)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 25448
Lines: 879

On Fri, Jan 16, 2009 at 04:16:11PM -0200, Arnaldo Carvalho de Melo wrote:
> Hi Jens,
> 
> 	The patch below adds a ftrace interface for blktrace, allowing
> people to use it without any required userspace tools, using sysfs to setup the
> act_mask, pid, start_lba, end_lba.
> 	Example usage:
> 
> [root@f10-1 ~]# mount -t debugfs none /sys/kernel/debug/
> [root@f10-1 ~]# ls /sys/block/sda/sda1/trace/
> act_mask  enable  end_lba  pid  start_lba


That reminds me it would be good to provide a more powerful tracer options
Api, to set numeric or string values for the options, like the kernel
parameters..... So you would avoid to implement your own way to do it.
I should put my hands into it one day.


> [root@f10-1 ~]# cat /sys/kernel/debug/tracing/available_tracers 
> blk function_graph wakeup function sched_switch nop
> [root@f10-1 ~]# echo blk > /sys/kernel/debug/tracing/current_tracer 
> [root@f10-1 ~]# cat /sys/kernel/debug/tracing/trace
> # tracer: blk
> #
> # DEV   CPU TIMESTAMP     PID ACT FLG
> #  |     |     |           |   |   |
> [root@f10-1 ~]# echo 1 > /sys/block/sda/sda1/trace/enable 
> [root@f10-1 ~]# tail /sys/kernel/debug/tracing/trace
>   8,1    0   143.070085  1208  P  NS [tail]
>   8,1    0   143.070087  1208  I RBS 4154375 + 8 [tail]
>   8,1    0   143.070091  1208  A WBS 4155391 + 16 <- (8,1) 4155328
>   8,1    0   143.070091  1208  Q   R 4155391 + 16 [tail]
>   8,1    0   143.070094  1208  G  RB 4155391 + 16 [tail]
>   8,1    0   143.070094  1208  I RBS 4155391 + 16 [tail]
>   8,1    0   143.070097  1208  U  WS [tail] 2
>   8,1    0   143.070101  1208  D  WB 4154375 + 8 [tail]
>   8,1    0   143.070341  1208  D  WB 4155391 + 16 [tail]
>   8,1    0   143.070635     0  C  RS 4154375 + 8 [0]


nice!


> [root@f10-1 ~]#
> 
> 	Filters can be set using:
> 
> [root@f10-1 ~]# echo write,sync > /sys/block/sda/sda1/trace/act_mask 
> [root@f10-1 ~]# cat /sys/block/sda/sda1/trace/act_mask 
> 0xa

That would be better to see the thing as a string, like you set it.

 
> 	One can go on interactively adding more filters, for instance:
> 
> [root@f10-1 ~]# echo `pidof kjournald` > /sys/block/sda/sda1/trace/pid
> [root@f10-1 ~]# cat /sys/block/sda/sda1/trace/pid
> 423


Or why not simply kjournald? :-)


> 	And use the existing ftrace infrastructure to, for instance,
> replicate the btrace (blktrace -o - | blkparse) functionality by simply
> using cat:
> 
> [root@f10-1 ~]# cat /sys/kernel/debug/tracing/trace_pipe 
>   8,1    0   457.253356   423  A WBS 6335 + 8 <- (8,1) 6272
>   8,1    0   457.253360   423  Q   R 6335 + 8 [kjournald]
>   8,1    0   457.253367   423  G  RB 6335 + 8 [kjournald]
>   8,1    0   457.253372   423  I RBS 6335 + 8 [kjournald]
>   8,1    0   457.253376   423  A WBS 6343 + 8 <- (8,1) 6280
>   8,1    0   457.253376   423  Q   R 6343 + 8 [kjournald]
>   8,1    0   457.253378   423  M   W 6343 + 8 [kjournald]
>   8,1    0   457.253380   423  A WBS 6351 + 8 <- (8,1) 6288
> CONTROL+C
> [root@f10-1 ~]#
> 
> 	This patch was done on Linus tree, I'll now work with it on
> linux-tip so that I can experiment with binary tracing, i.e to ask for
> the output of /sys/kernel/debug/tracing/trace_pipe to be raw, exactly
> the one provide by relayfs based variant, so that we will be able to
> continue using blkparse to validate the trace entries, as a debugging
> aid.


But I guess there will be some changes to bring on the tracing to make
it support binary flow...


> 	Also there are other interesting things that one gets from using
> the ftrace infrastructure, such as being able to set see traffic that
> happens in just a cpuset (/sys/kernel/debug/tracing/tracing_cpumask),
> and others that I plan to add such as what Steve did in "[PATCH 2/2]
> ftrace: add stack trace to function tracer", so that we could have
> things like (made up example):
> 
> [root@f10-1 ~]# cat /sys/kernel/debug/tracing/trace_pipe 
>   8,1    0   457.253356   423  A WBS 6335 + 8 <- (8,1) 6272
>   8,1    0   457.253360   423  Q   R 6335 + 8 [kjournald]
>   8,1    0   457.253367   423  G  RB 6335 + 8 [kjournald]
>   8,1    0   457.253372   423  I RBS 6335 + 8 [kjournald]
>  <= sync_buffer
>  <= __wait_on_bit
>  <= out_of_line_wait_on_bit
>  <= __wait_on_buffer
>  <= sync_dirty_buffer
>  <= journal_commit_transaction
>  <= kjournald
>   8,1    0   457.253376   423  A WBS 6343 + 8 <- (8,1) 6280
>   8,1    0   457.253376   423  Q   R 6343 + 8 [kjournald]
>   8,1    0   457.253378   423  M   W 6343 + 8 [kjournald]
>   8,1    0   457.253380   423  A WBS 6351 + 8 <- (8,1) 6288


Cool, but I guess you will not need this ftrace plugin to do it.
stack traces entries for tracing are already supported for a while.

 
> 	More polishing is needed, I wrote the patch this way just to
> show where things are different, with new code, for registering the
> tracer, bits stolen from the userspace blktrace, allowing config via
> sysfs at the bottom.
> 
> 	Ideally we would move this to kernel/trace/block.c, break it
> into smaller self contained patches, etc.
> 
> Comments?
> 
> - Arnaldo
> 
>  block/blktrace.c      |  578 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/partitions/check.c |    7 
>  2 files changed, 583 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blktrace.c b/block/blktrace.c
> index b0a2cae..10985f9 100644
> --- a/block/blktrace.c
> +++ b/block/blktrace.c
> @@ -25,9 +25,13 @@
>  #include <linux/time.h>
>  #include <trace/block.h>
>  #include <asm/uaccess.h>
> +#include <../kernel/trace/trace.h>


Yeah, it really needs to be moved on kernel/trace :-)


>  static unsigned int blktrace_seq __read_mostly = 1;
>  
> +static struct trace_array *blk_tr;
> +static int __read_mostly  blk_tracer_enabled;
> +
>  /* Global reference count of probes */
>  static DEFINE_MUTEX(blk_probe_mutex);
>  static atomic_t blk_probes_ref = ATOMIC_INIT(0);
> @@ -43,6 +47,9 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
>  {
>  	struct blk_io_trace *t;
>  
> +	if (!bt->rchan)
> +		return;
> +
>  	t = relay_reserve(bt->rchan, sizeof(*t) + len);
>  	if (t) {
>  		const int cpu = smp_processor_id();
> @@ -90,6 +97,9 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
>  	unsigned long flags;
>  	char *buf;
>  
> +	if (!bt->msg_data)
> +		return;
> +
>  	local_irq_save(flags);
>  	buf = per_cpu_ptr(bt->msg_data, smp_processor_id());
>  	va_start(args, fmt);
> @@ -131,13 +141,14 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  		     int rw, u32 what, int error, int pdu_len, void *pdu_data)
>  {
>  	struct task_struct *tsk = current;
> +	struct ring_buffer_event *event = NULL;
>  	struct blk_io_trace *t;
>  	unsigned long flags;
>  	unsigned long *sequence;
>  	pid_t pid;
>  	int cpu;
>  
> -	if (unlikely(bt->trace_state != Blktrace_running))
> +	if (unlikely(bt->trace_state != Blktrace_running || !blk_tracer_enabled))
>  		return;
>  
>  	what |= ddir_act[rw & WRITE];
> @@ -158,6 +169,25 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  	 */
>  	local_irq_save(flags);


Is the local_irq_save still needed here? Note that the ring buffer is safe by itself.

> +	if (blk_tr) {
> +		struct trace_array_cpu *data;
> +
> +		tracing_record_cmdline(current);
> +		cpu = raw_smp_processor_id();


I think you can use smp_processor_id(). There is no risk of tracing recursion here.


> +		data = blk_tr->data[cpu];
> +
> +		if (unlikely(atomic_read(&data->disabled)))
> +			goto out_local_irq_restore;
> +
> +		event = ring_buffer_lock_reserve(blk_tr->buffer,
> +						 sizeof(*t) + pdu_len, &flags);
> +		if (!event)
> +			goto out_local_irq_restore;
> +		
> +		t = ring_buffer_event_data(event);
> +		goto record_it;
> +	}
> +
>  	if (unlikely(tsk->btrace_seq != blktrace_seq))
>  		trace_note_tsk(bt, tsk);
>  
> @@ -169,19 +199,26 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  		t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
>  		t->sequence = ++(*sequence);
>  		t->time = ktime_to_ns(ktime_get());

You already have a timestamp for each entry on the ring buffer.

> +		t->cpu = cpu;


Ditto. But perhaps you can be preempted between the block action and the trace insertion.
I'm not sure of the context here.. So perhaps you must keep it.


> +record_it:
>  		t->sector = sector;
>  		t->bytes = bytes;
>  		t->action = what;
>  		t->pid = pid;
>  		t->device = bt->dev;
> -		t->cpu = cpu;
>  		t->error = error;
>  		t->pdu_len = pdu_len;
>  
>  		if (pdu_len)
>  			memcpy((void *) t + sizeof(*t), pdu_data, pdu_len);
> +
> +		if (blk_tr) {
> +			ring_buffer_unlock_commit(blk_tr->buffer, event, flags);
> +			trace_wake_up();
> +		}
>  	}
>  
> +out_local_irq_restore:
>  	local_irq_restore(flags);
>  }
>  
> @@ -888,3 +925,540 @@ static void blk_unregister_tracepoints(void)
>  
>  	tracepoint_synchronize_unregister();
>  }
> +
> +/*
> + * struct blk_io_tracer formatting routines
> + */
> +
> +static void fill_rwbs(char *rwbs, struct blk_io_trace *t)
> +{
> +        int i = 0;
> +
> +        if (t->action & BLK_TC_DISCARD)	   rwbs[i++] = 'D';
> +        else if (t->action & BLK_TC_WRITE) rwbs[i++] = 'W';
> +        else if (t->bytes)		   rwbs[i++] = 'R';
> +        else				   rwbs[i++] = 'N';
> +
> +        if (t->action & BLK_TC_AHEAD)	   rwbs[i++] = 'A';
> +        if (t->action & BLK_TC_BARRIER)	   rwbs[i++] = 'B';
> +        if (t->action & BLK_TC_SYNC)	   rwbs[i++] = 'S';
> +        if (t->action & BLK_TC_META)	   rwbs[i++] = 'M';
> +
> +        rwbs[i] = '\0';
> +}
> +
> +static inline void *pdu_start(struct blk_io_trace *t)
> +{
> +	return t + 1;
> +}
> +
> +static inline u32 t_sec(const struct blk_io_trace *t)
> +{
> +	return t->bytes >> 9;
> +}
> +
> +static __u64 get_pdu_int(struct blk_io_trace *t)
> +{
> +	__u64 *val = pdu_start(t);
> +
> +	return be64_to_cpu(*val);
> +}
> +
> +static void get_pdu_remap(struct blk_io_trace *t, struct blk_io_trace_remap *r)
> +{
> +	struct blk_io_trace_remap *__r = pdu_start(t);
> +	__u64 sector = __r->sector;
> +
> +	r->device = be32_to_cpu(__r->device);
> +	r->device_from = be32_to_cpu(__r->device_from);
> +	r->sector = be64_to_cpu(sector);
> +}
> +
> +static int blk_log_action(struct trace_iterator *iter, struct blk_io_trace *t,
> +			  const char *act)
> +{
> +	char rwbs[6];
> +	unsigned long long ts	= ns2usecs(iter->ts);
> +	unsigned long usec_rem	= do_div(ts, USEC_PER_SEC);
> +	unsigned secs		= (unsigned long)ts;
> +
> +	fill_rwbs(rwbs, t);
> +	return trace_seq_printf(&iter->seq,
> +				"%3d,%-3d %2d %5d.%06lu %5u %2s %3s ",
> +				MAJOR(t->device), MINOR(t->device), iter->cpu,
> +				secs, usec_rem, t->pid, act, rwbs);
> +}
> +
> +static int blk_log_generic(struct trace_iterator *iter, struct blk_io_trace *t,
> +			   const char *act)
> +{
> +	int ret = blk_log_action(iter, t, act);
> +	if (ret) {
> +		const char *cmd = trace_find_cmdline(t->pid);
> +
> +		if (t_sec(t))
> +			ret = trace_seq_printf(&iter->seq, "%llu + %u [%s]\n",
> +					       (unsigned long long)t->sector,
> +					       t_sec(t), cmd);
> +		else
> +			ret = trace_seq_printf(&iter->seq, "[%s]\n", cmd);
> +	}
> +	return ret;
> +}
> +
> +static int blk_log_with_err(struct trace_iterator *iter, struct blk_io_trace *t,
> +			    const char *act)
> +{
> +	int ret = blk_log_action(iter, t, act);
> +	if (ret) {
> +		if (t_sec(t))
> +			ret = trace_seq_printf(&iter->seq, "%llu + %u [%d]\n",
> +					       (unsigned long long)t->sector,
> +					       t_sec(t), t->error);
> +		else
> +			ret = trace_seq_printf(&iter->seq, "%llu [%d]\n",
> +					       (unsigned long long)t->sector,
> +					       t->error);
> +	}
> +	return ret;
> +}
> +
> +static int blk_log_remap(struct trace_iterator *iter, struct blk_io_trace *t)
> +{
> +	int ret = blk_log_action(iter, t, "A");
> +	if (ret) {
> +		struct blk_io_trace_remap r = { .device = 0, };
> +
> +		get_pdu_remap(t, &r);
> +		t->device = r.device_from;
> +		ret = trace_seq_printf(&iter->seq, "%llu + %u <- (%d,%d) %llu\n",
> +				       (unsigned long long)t->sector,
> +				       t_sec(t), MAJOR(r.device), MINOR(r.device),
> +				       (unsigned long long)r.sector);
> +	}
> +	return ret;
> +}
> +
> +static int blk_log_plug(struct trace_iterator *iter, struct blk_io_trace *t)
> +{
> +	int ret = blk_log_action(iter, t, "P");
> +	if (ret)
> +		ret = trace_seq_printf(&iter->seq, "[%s]\n",
> +				       trace_find_cmdline(t->pid));
> +	return ret;
> +}
> +
> +static int blk_log_unplug(struct trace_iterator *iter, struct blk_io_trace *t,
> +			  const char *act)
> +{
> +	int ret = blk_log_action(iter, t, act);
> +	if (ret)
> +		ret = trace_seq_printf(&iter->seq, "[%s] %u\n",
> +				       trace_find_cmdline(t->pid),
> +				       get_pdu_int(t));
> +	return ret;
> +}
> +
> +static int blk_log_split(struct trace_iterator *iter, struct blk_io_trace *t,
> +			 const char *act)
> +{
> +	int ret = blk_log_action(iter, t, act);
> +	if (ret)
> +		ret = trace_seq_printf(&iter->seq, "%llu / %u [%s]\n",
> +				       (unsigned long long)t->sector,
> +				       get_pdu_int(t),
> +				       trace_find_cmdline(t->pid));
> +	return ret;
> +}
> +
> +/*
> + * struct tracer operations
> + */
> +
> +static void blk_tracer_print_header(struct seq_file *m)
> +{
> +	seq_puts(m, "# DEV   CPU TIMESTAMP     PID ACT FLG\n"
> +		    "#  |     |     |           |   |   |\n");
> +}
> +
> +static enum print_line_t blk_tracer_print_line(struct trace_iterator *iter)
> +{
> +	struct trace_seq *s = &iter->seq;
> +	struct blk_io_trace *t = (struct blk_io_trace *)iter->ent;
> +	const u16 what = t->action & ((1 << BLK_TC_SHIFT) - 1);
> +	int ret;
> +
> +	switch (what) {
> +	case __BLK_TA_ISSUE:	    ret = blk_log_generic(iter, t, "D"); break;
> +	case __BLK_TA_INSERT:	    ret = blk_log_generic(iter, t, "I"); break;
> +	case __BLK_TA_QUEUE:	    ret = blk_log_generic(iter, t, "Q"); break;
> +	case __BLK_TA_BOUNCE:	    ret = blk_log_generic(iter, t, "B"); break;
> +	case __BLK_TA_BACKMERGE:    ret = blk_log_generic(iter, t, "M"); break;
> +	case __BLK_TA_FRONTMERGE:   ret = blk_log_generic(iter, t, "F"); break;
> +	case __BLK_TA_GETRQ:	    ret = blk_log_generic(iter, t, "G"); break;
> +	case __BLK_TA_SLEEPRQ:	    ret = blk_log_generic(iter, t, "S"); break;
> +	case __BLK_TA_REQUEUE:	    ret = blk_log_with_err(iter, t, "R"); break;
> +	case __BLK_TA_COMPLETE:	    ret = blk_log_with_err(iter, t, "C"); break;
> +	case __BLK_TA_PLUG:	    ret = blk_log_plug(iter, t); break;
> +	case __BLK_TA_UNPLUG_IO:    ret = blk_log_unplug(iter, t, "U"); break;
> +	case __BLK_TA_UNPLUG_TIMER: ret = blk_log_unplug(iter, t, "UT"); break;
> +	case __BLK_TA_REMAP:	    ret = blk_log_remap(iter, t); break;
> +	case __BLK_TA_SPLIT:	    ret = blk_log_split(iter, t, "X"); break;
> +	default:
> +		ret = trace_seq_printf(s, "Bad pc action %x\n", what);
> +		break;
> +	}
> +
> +	if (ret == 0)
> +		return TRACE_TYPE_PARTIAL_LINE;
> +
> +	return TRACE_TYPE_HANDLED;
> +}
> +
> +static void blk_tracer_start(struct trace_array *tr)
> +{
> +	int cpu;
> +
> +	tr->time_start = ftrace_now(tr->cpu);
> +
> +	for_each_online_cpu(cpu)
> +		tracing_reset(tr, cpu);
> +
> +	mutex_lock(&blk_probe_mutex);
> +	if (atomic_add_return(1, &blk_probes_ref) == 1)
> +		if (blk_register_tracepoints())
> +			atomic_dec(&blk_probes_ref);
> +	mutex_unlock(&blk_probe_mutex);


Is this mutex really needed? You already use an atomic test that make
your wall.


> +}
> +
> +static int blk_tracer_init(struct trace_array *tr)
> +{
> +	blk_tr = tr;
> +	blk_tracer_start(tr);
> +	mutex_lock(&blk_probe_mutex);
> +	blk_tracer_enabled++;


I think you will not have more than one blk_tracer_init() call
while your tracer is enabled.

So, unless there are other callsites that use your tracer, I guess this counter
is useless.

> +	mutex_unlock(&blk_probe_mutex);
> +	return 0;
> +}
> +
> +static void blk_tracer_stop(struct trace_array *tr)
> +{
> +	mutex_lock(&blk_probe_mutex);
> +	if (atomic_dec_and_test(&blk_probes_ref))
> +		blk_unregister_tracepoints();
> +	mutex_unlock(&blk_probe_mutex);
> +}
> +
> +static void blk_tracer_reset(struct trace_array *tr)
> +{
> +	if (!atomic_read(&blk_probes_ref))
> +		return;
> +
> +	mutex_lock(&blk_probe_mutex);
> +	blk_tracer_enabled--;
> +	WARN_ON(blk_tracer_enabled < 0);


Ditto, I don't think that can happen.


> +	mutex_unlock(&blk_probe_mutex);
> +
> +	blk_tracer_stop(tr);
> +}
> +
> +static struct tracer blk_tracer __read_mostly = {
> +	.name		= "blk",
> +	.init		= blk_tracer_init,
> +	.reset		= blk_tracer_reset,
> +	.start		= blk_tracer_start,
> +	.stop		= blk_tracer_stop,
> +	.print_header	= blk_tracer_print_header,
> +	.print_line	= blk_tracer_print_line,
> +};
> +
> +static int __init init_blk_tracer(void)
> +{
> +	return register_tracer(&blk_tracer);
> +}
> +
> +device_initcall(init_blk_tracer);
> +
> +static int blk_trace_remove_queue(struct request_queue *q)
> +{
> +	struct blk_trace *bt;
> +
> +	bt = xchg(&q->blk_trace, NULL);
> +	if (bt == NULL)
> +		return -EINVAL;
> +
> +	kfree(bt);
> +	return 0;
> +}
> +
> +/*
> + * Setup everything required to start tracing
> + */
> +static int blk_trace_setup_queue(struct request_queue *q, dev_t dev)
> +{
> +	struct blk_trace *old_bt, *bt = NULL;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	bt = kzalloc(sizeof(*bt), GFP_KERNEL);
> +	if (!bt)
> +		goto err;
> +
> +	bt->dev = dev;
> +	bt->act_mask = (u16)-1;

That looks strange on a first view.
Why not ~0 ? But that doesn't really matter...

> +	bt->end_lba = -1ULL;
> +	bt->trace_state = Blktrace_running;
> +
> +	old_bt = xchg(&q->blk_trace, bt);
> +	if (old_bt != NULL) {
> +		(void)xchg(&q->blk_trace, old_bt);
> +		kfree(bt);
> +		ret = -EBUSY;
> +	}
> +	return 0;
> +err:
> +	return ret;
> +}
> +
> +/*
> + * sysfs interface to enable and configure tracing
> + */
> +
> +static ssize_t sysfs_blk_trace_enable_show(struct device *dev,
> +					   struct device_attribute *attr,
> +					   char *buf)
> +{
> +	struct hd_struct *p = dev_to_part(dev);
> +	struct block_device *bdev;
> +	ssize_t ret = -ENXIO;
> +
> +	lock_kernel();

Ouch, I hope you will add a comment to explain why you are using the bkl
here, I guess this is block related but I'm not sure of anything :-)

> +	bdev = bdget(part_devt(p));
> +	if (bdev != NULL) {
> +		struct request_queue *q = bdev_get_queue(bdev);
> +
> +		if (q != NULL) {
> +			mutex_lock(&bdev->bd_mutex);
> +			ret = sprintf(buf, "%u\n", !!q->blk_trace);

I guess you can be a bit racy here and drop the mutex. That wouldn't matter.

> +			mutex_unlock(&bdev->bd_mutex);
> +		}
> +
> +		bdput(bdev);
> +	}
> +
> +	unlock_kernel();
> +	return ret;
> +}
> +
> +static ssize_t sysfs_blk_trace_enable_store(struct device *dev,
> +					    struct device_attribute *attr,
> +					    const char *buf, size_t count)
> +{
> +	struct block_device *bdev;
> +	struct request_queue *q;
> +	struct hd_struct *p;
> +	int value;
> +	ssize_t ret = -ENXIO;
> +
> +	if (count == 0 || sscanf(buf, "%d", &value) != 1)
> +		goto out;
> +
> +	lock_kernel();
> +	p = dev_to_part(dev);
> +	bdev = bdget(part_devt(p));
> +	if (bdev == NULL)
> +		goto out_unlock_kernel;
> +
> +	q = bdev_get_queue(bdev);
> +	if (q == NULL)
> +		goto out_bdput;
> +
> +	mutex_lock(&bdev->bd_mutex);
> +	if (value)
> +		ret = blk_trace_setup_queue(q, bdev->bd_dev);
> +	else
> +		ret = blk_trace_remove_queue(q);
> +	mutex_unlock(&bdev->bd_mutex);
> +
> +	if (ret == 0)
> +		ret = count;
> +out_bdput:
> +	bdput(bdev);
> +out_unlock_kernel:
> +	unlock_kernel();
> +out:
> +	return ret;
> +}
> +
> +static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count);
> +#define BLK_TRACE_DEVICE_ATTR(_name) \
> +	DEVICE_ATTR(_name, S_IRUGO | S_IWUSR, \
> +		    sysfs_blk_trace_attr_show, \
> +		    sysfs_blk_trace_attr_store)
> +
> +static DEVICE_ATTR(enable, S_IRUGO | S_IWUSR,
> +		   sysfs_blk_trace_enable_show, sysfs_blk_trace_enable_store);
> +static BLK_TRACE_DEVICE_ATTR(act_mask);
> +static BLK_TRACE_DEVICE_ATTR(pid);
> +static BLK_TRACE_DEVICE_ATTR(start_lba);
> +static BLK_TRACE_DEVICE_ATTR(end_lba);
> +
> +static struct attribute *blk_trace_attrs[] = {
> +	&dev_attr_enable.attr,
> +	&dev_attr_act_mask.attr,
> +	&dev_attr_pid.attr,
> +	&dev_attr_start_lba.attr,
> +	&dev_attr_end_lba.attr,
> +	NULL
> +};
> +
> +struct attribute_group blk_trace_attr_group = {
> +	.name  = "trace",
> +	.attrs = blk_trace_attrs,
> +};
> +
> +static int blk_str2act_mask(const char *str)
> +{
> +	int mask = 0;
> +	char *copy = kstrdup(str, GFP_KERNEL), *s;
> +
> +	if (copy == NULL)
> +		return -ENOMEM;
> +
> +	s = strstrip(copy);
> +
> +	while (1) {
> +		char *sep = strchr(s, ',');
> +
> +		if (sep != NULL)
> +			*sep = '\0';
> +
> +		if (strcasecmp(s, "barrier") == 0)
> +			mask |= BLK_TC_BARRIER;
> +		else if (strcasecmp(s, "complete") == 0)
> +			mask |= BLK_TC_COMPLETE;
> +		else if (strcasecmp(s, "fs") == 0)
> +			mask |= BLK_TC_FS;
> +		else if (strcasecmp(s, "issue") == 0)
> +			mask |= BLK_TC_ISSUE;
> +		else if (strcasecmp(s, "pc") == 0)
> +			mask |= BLK_TC_PC;
> +		else if (strcasecmp(s, "queue") == 0)
> +			mask |= BLK_TC_QUEUE;
> +		else if (strcasecmp(s, "read") == 0)
> +			mask |= BLK_TC_READ;
> +		else if (strcasecmp(s, "requeue") == 0)
> +			mask |= BLK_TC_REQUEUE;
> +		else if (strcasecmp(s, "sync") == 0)
> +			mask |= BLK_TC_SYNC;
> +		else if (strcasecmp(s, "write") == 0)
> +			mask |= BLK_TC_WRITE;
> +
> +		if (sep == NULL)
> +			break;
> +
> +		s = sep + 1;
> +	}
> +	kfree(copy);
> +
> +	return mask;
> +}
> +
> +static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	struct hd_struct *p = dev_to_part(dev);
> +	struct request_queue *q;
> +	struct block_device *bdev;
> +	ssize_t ret = -ENXIO;
> +
> +	lock_kernel();
> +	bdev = bdget(part_devt(p));
> +	if (bdev == NULL)
> +		goto out_unlock_kernel;
> +
> +	q = bdev_get_queue(bdev);
> +	if (q == NULL)
> +		goto out_bdput;
> +	mutex_lock(&bdev->bd_mutex);
> +	if (q->blk_trace == NULL)
> +		ret = sprintf(buf, "disabled\n");
> +	else if (attr == &dev_attr_act_mask)
> +		ret = sprintf(buf, "%#x\n", q->blk_trace->act_mask);
> +	else if (attr == &dev_attr_pid)
> +		ret = sprintf(buf, "%u\n", q->blk_trace->pid);
> +	else if (attr == &dev_attr_start_lba)
> +		ret = sprintf(buf, "%llu\n", q->blk_trace->start_lba);
> +	else if (attr == &dev_attr_end_lba)
> +		ret = sprintf(buf, "%llu\n", q->blk_trace->end_lba);
> +	mutex_unlock(&bdev->bd_mutex);
> +out_bdput:
> +	bdput(bdev);
> +out_unlock_kernel:
> +	unlock_kernel();
> +	return ret;
> +}
> +
> +static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	struct block_device *bdev;
> +	struct request_queue *q;
> +	struct hd_struct *p;
> +	u64 value;
> +	ssize_t ret = -ENXIO;
> +
> +	if (count == 0)
> +		goto out;
> +
> +	if (attr == &dev_attr_act_mask) {
> +		if (sscanf(buf, "%llx", &value) != 1) {
> +			/* Assume it is a list of trace category names */
> +			value = blk_str2act_mask(buf);
> +			if (value < 0)
> +				goto out;
> +		}
> +	} else if (sscanf(buf, "%llu", &value) != 1)
> +		goto out;
> +
> +	lock_kernel();
> +	p = dev_to_part(dev);
> +	bdev = bdget(part_devt(p));
> +	if (bdev == NULL)
> +		goto out_unlock_kernel;
> +
> +	q = bdev_get_queue(bdev);
> +	if (q == NULL)
> +		goto out_bdput;
> +
> +	mutex_lock(&bdev->bd_mutex);
> +	ret = 0;
> +	if (q->blk_trace == NULL)
> +		ret = blk_trace_setup_queue(q, bdev->bd_dev);
> +
> +	if (ret == 0) {
> +		if (attr == &dev_attr_act_mask)
> +			q->blk_trace->act_mask = value;
> +		else if (attr == &dev_attr_pid)
> +			q->blk_trace->pid = value;
> +		else if (attr == &dev_attr_start_lba)
> +			q->blk_trace->start_lba = value;
> +		else if (attr == &dev_attr_end_lba)
> +			q->blk_trace->end_lba = value;
> +		ret = count;
> +	}
> +	mutex_unlock(&bdev->bd_mutex);
> +out_bdput:
> +	bdput(bdev);
> +out_unlock_kernel:
> +	unlock_kernel();
> +out:
> +	return ret;
> +}
> diff --git a/fs/partitions/check.c b/fs/partitions/check.c
> index 6d72024..01714ef 100644
> --- a/fs/partitions/check.c
> +++ b/fs/partitions/check.c
> @@ -268,6 +268,10 @@ ssize_t part_fail_store(struct device *dev,
>  }
>  #endif
>  
> +#ifdef CONFIG_BLK_DEV_IO_TRACE
> +extern struct attribute_group blk_trace_attr_group;
> +#endif
> +
>  static DEVICE_ATTR(partition, S_IRUGO, part_partition_show, NULL);
>  static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL);
>  static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL);
> @@ -294,6 +298,9 @@ static struct attribute_group part_attr_group = {
>  
>  static struct attribute_group *part_attr_groups[] = {
>  	&part_attr_group,
> +#ifdef CONFIG_BLK_DEV_IO_TRACE
> +	&blk_trace_attr_group,
> +#endif
>  	NULL
>  };


This tracer looks good :-)

  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/