Message-ID: <5583B1A9.60503@bmw-carit.de>
Date: Fri, 19 Jun 2015 08:07:37 +0200
From: Daniel Wagner <daniel.wagner@bmw-carit.de>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: Alexei Starovoitov <ast@plumgrid.com>
CC: <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v0] bpf: BPF based latency tracing
References: <1434627604-9624-1-git-send-email-daniel.wagner@bmw-carit.de> <5582FA7A.4050505@plumgrid.com>
In-Reply-To: <5582FA7A.4050505@plumgrid.com>
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6196
Lines: 145

On 06/18/2015 07:06 PM, Alexei Starovoitov wrote:
> On 6/18/15 4:40 AM, Daniel Wagner wrote:
>> BPF offers another way to generate latency histograms. We attach
>> kprobes at trace_preempt_off and trace_preempt_on and calculate the
>> time it takes to from seeing the off/on transition.
>>
>> The first array is used to store the start time stamp. The key is the
>> CPU id. The second array stores the log2(time diff). We need to use
>> static allocation here (array and not hash tables). The kprobes
>> hooking into trace_preempt_on|off should not calling any dynamic
>> memory allocation or free path. We need to avoid recursivly
>> getting called. Besides that, it reduces jitter in the measurement.
>>
>> CPU 0
>>        latency        : count     distribution
>>         1 -> 1        : 0       
>> |                                        |
>>         2 -> 3        : 0       
>> |                                        |
>>         4 -> 7        : 0       
>> |                                        |
>>         8 -> 15       : 0       
>> |                                        |
>>        16 -> 31       : 0       
>> |                                        |
>>        32 -> 63       : 0       
>> |                                        |
>>        64 -> 127      : 0       
>> |                                        |
>>       128 -> 255      : 0       
>> |                                        |
>>       256 -> 511      : 0       
>> |                                        |
>>       512 -> 1023     : 0       
>> |                                        |
>>      1024 -> 2047     : 0       
>> |                                        |
>>      2048 -> 4095     : 166723  
>> |*************************************** |
>>      4096 -> 8191     : 19870   
>> |***                                     |
>>      8192 -> 16383    : 6324    
>> |                                        |
>>     16384 -> 32767    : 1098    
>> |                                        |
> 
> nice useful sample indeed!
> The numbers are non-JITed, right?

You are right. I forgot to enable the JIT compiler. 

> JIT should reduce the measurement cost 2-3x, but preempt_on/off
> latency probably will stay in 2k range.

Here is an example output with JIT enabled after a few seconds.

      latency        : count     distribution
       1 -> 1        : 0        |                                        |
       2 -> 3        : 0        |                                        |
       4 -> 7        : 0        |                                        |
       8 -> 15       : 0        |                                        |
      16 -> 31       : 0        |                                        |
      32 -> 63       : 0        |                                        |
      64 -> 127      : 0        |                                        |
     128 -> 255      : 0        |                                        |
     256 -> 511      : 0        |                                        |
     512 -> 1023     : 0        |                                        |
    1024 -> 2047     : 0        |                                        |
    2048 -> 4095     : 11595    |*************************************** |
    4096 -> 8191     : 3086     |*********                               |
    8192 -> 16383    : 392      |                                        |
   16384 -> 32767    : 114      |                                        |
   32768 -> 65535    : 16       |                                        |
   65536 -> 131071   : 1        |                                        |
  131072 -> 262143   : 0        |                                        |
  262144 -> 524287   : 0        |                                        |
  524288 -> 1048575  : 34       |                                        |

As you predicted it wont go lower than 2k range. 

>> I am not sure if it is really worth spending more time getting
>> the hash table working for the trace_preempt_[on|off] kprobes.
>> There are so many things which could go wrong, so going with
>> a static version seems for me the right choice.
> 
> agree. for this use case arrays are better choice anyway.
> But I'll keep working on getting hash tables working even
> in this extreme conditions. bpf should be always rock solid.
> 
> I'm only a bit suspicious of kprobes, since we have:
> NOKPROBE_SYMBOL(preempt_count_sub)
> but trace_preemp_on() called by preempt_count_sub()
> don't have this mark...

The original commit indicates that anything called from
preempt_disable() should also be marked as NOKPROBE_SYMBOL:

	commit 43627582799db317e966ecb0002c2c3c9805ec0f
	Author:	Srinivasa Ds <srinivasa@in.ibm.com>  Sun Feb 24 00:24:04 2008
	Committer:	Linus Torvalds <torvalds@woody.linux-foundation.org>  Sun Feb 24 02:13:24 2008
	Original File:	kernel/sched.c

	kprobes: refuse kprobe insertion on add/sub_preempt_counter()

	Kprobes makes use of preempt_disable(),preempt_enable_noresched() and these
	functions inturn call add/sub_preempt_count().  So we need to refuse user from
	inserting probe in to these functions.

	This patch disallows user from probing add/sub_preempt_count().


Obviously, this would render this patch useless. 

>> +SEC("kprobe/trace_preempt_off")

BTW, is there a reason why not supporting build-in
tracepoints/events? It looks like it is only an artificial
limitation of bpf_helpers.

>> +int bpf_prog1(struct pt_regs *ctx)
>> +{
>> +    int cpu = bpf_get_smp_processor_id();
>> +    u64 *ts = bpf_map_lookup_elem(&my_map, &cpu);
>> +
>> +    if (ts)
>> +        *ts = bpf_ktime_get_ns();
> 
> btw, I'm planning to add native per-cpu maps which will
> speed up things more and reduce measurement overhead.

Funny I was about to suggest something like this :)

> I think you can retarget this patch to net-next and send
> it to netdev. It's not too late for this merge window.

I'll rebase it to net-next.

cheers,
daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/