Subject: Re: [PATCH -tip v6 00/22] kprobes: introduce NOKPROBE_SYMBOL(), cleanup and fixes crash bugs

Hi Ingo,

(2013/12/20 19:46), Ingo Molnar wrote:
>
> * Masami Hiramatsu <[email protected]> wrote:
>
>> (2013/12/20 17:20), Ingo Molnar wrote:
>>>
>>> * Masami Hiramatsu <[email protected]> wrote:
>>>
>>>>> But a closer look indicates that the insertion of kprobes is
>>>>> taking about three (!!) orders of magnitude longer than before, as
>>>>> judged by the rate of increase of 'wc -l
>>>>> /sys/kernel/debug/kprobes/list'.
>>>>
>>>> Right, because kprobes are not designed for thousands of probes.
>>>
>>> Then this needs to be fixed, because right now this bug is making it
>>> near impossible to properly test kprobes robustness.
>>>
>>> For example a hash table (hashed by probe address) could be used in
>>> addition to the list, to speed up basic operations.
>>
>> kprobe itself is already using hlist (6bits hash table).
>> Maybe we'd better expand the table bits. However, the iteration
>> of the list on debugfs is just doing seq_printf()s. I'm not exactly
>> sure what Frank complaints about...
>
> Well, Frank reported that the test he performed takes hours to finish,
> and he mentioned a specific script line he used to produce that:
>
> # stap -te "probe kprobe.function("*") {}"
>
> I suspect an equivalent perf probe sequence would be something like:
>
> # for FUNC in $(grep -iw t /proc/kallsyms | cut -d' ' -f3); do date; perf probe -a $FUNC; done
>
> (totally untested.)
>
> Can you reproduce that slowdown, using his method?

OK, when I used ftrace interface, it didn't slows things down at all
(at this point :))
----
# time (grep -iw t /proc/kallsyms | awk '{print "p:"$3,"0x"$1}' | xargs --max-lines=1 echo >> /sys/kernel/debug/tracing/kprobe_events )

real 0m36.303s
user 0m0.420s
sys 0m2.428s

# wc -l /sys/kernel/debug/tracing/kprobe_events
26980 /sys/kernel/debug/tracing/kprobe_events

# time cat /sys/kernel/debug/tracing/kprobe_events > /dev/null

real 0m0.054s
user 0m0.000s
sys 0m0.052s
----

This is because of two reasons:
- ftrace interface doesn't access any files, it just parses the event
and sets it up.
- ftrace (and perf probe) doesn't enable the event while setting.

I guess the second reason is why the stap takes so long time to set
probes. stap tries to register kprobes without disabled flag, that
means we enables thousands of probes (overheads).

So the similar thing happens when we enables events as below;

# for i in /sys/kernel/debug/tracing/events/kprobes/* ; do date; echo 1 > $i; done
Wed Jan 29 10:44:50 UTC 2014
...

I tried it and canceled after 4 min passed. It enabled about 17k events
and slowed down my system very much(I almost got hang check timer).

I think we should have some performance statistics (hit count?) and
if it goes over a threshold, we should stop enabling other events.
(Note that kprobes and other events take a time, one event may just
consume a small amount of time, usually less than 0.5usec. but what
happens if it hits 1,000,000 times per 1 sec...?)


> I can reproduce one weirdness, with just 13 probes added, 'perf probe
> -l' [which should really be 'perf probe list'!] executes very slowly:
>
> # perf stat --null --repeat 3 perf probe -l
>
> Performance counter stats for 'perf probe -l' (3 runs):
>
> 0.763640098 seconds time elapsed ( +- 1.61% )
>
> 0.7 seconds is ridiculously long.

As you can see, listing up the probes via ftrace interface just
takes 0.05sec/26980probes. Perf probe may take a time for
list up events because it tries to find symbols/lines from
kallsyms or debuginfo.

Thank you,

--
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]


2014-02-09 14:37:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH -tip v6 00/22] kprobes: introduce NOKPROBE_SYMBOL(), cleanup and fixes crash bugs


* Masami Hiramatsu <[email protected]> wrote:

> I guess the second reason is why the stap takes so long time to set
> probes. stap tries to register kprobes without disabled flag, that
> means we enables thousands of probes (overheads).
>
> So the similar thing happens when we enables events as below;
>
> # for i in /sys/kernel/debug/tracing/events/kprobes/* ; do date; echo 1 > $i; done
> Wed Jan 29 10:44:50 UTC 2014
> ...
>
> I tried it and canceled after 4 min passed. It enabled about 17k
> events and slowed down my system very much(I almost got hang check
> timer).

Ok, I guess that's the slowdown bug that Frank reported.

> I think we should have some performance statistics (hit count?) and
> if it goes over a threshold, we should stop enabling other events.

That really feels like a hack. How about fixing the root cause? Does
the enabling of all probes have to be so slow?

Thanks,

Ingo

2014-02-09 21:27:43

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH -tip v6 00/22] kprobes: introduce NOKPROBE_SYMBOL(), cleanup and fixes crash bugs

Hi -

> > So the similar thing happens when we enables events as below;
> >
> > # for i in /sys/kernel/debug/tracing/events/kprobes/* ; do date; echo 1 > $i; done
> > Wed Jan 29 10:44:50 UTC 2014
> > ...
> >
> > I tried it and canceled after 4 min passed. It enabled about 17k
> > events and slowed down my system very much(I almost got hang check
> > timer).
>
> Ok, I guess that's the slowdown bug that Frank reported.

It could be, but it feels a bit different. In my testing from
December, it's as though it wasn't the activated probes *hitting* that
were associated with the slowdown, but them merely being activated.
It was as though something with the kprobes/ftrace probe-registration
code performed a lot more work than it did longer ago. (One way to
test this could be to be more careful in the selection of kprobes
being enabled. For examle, emplace thousands, but only unused loaded
modules.) I'm sorry I didn't get time to investigate further.

- FChE

Subject: Re: [PATCH -tip v6 00/22] kprobes: introduce NOKPROBE_SYMBOL(), cleanup and fixes crash bugs

(2014/02/09 23:37), Ingo Molnar wrote:
>
> * Masami Hiramatsu <[email protected]> wrote:
>
>> I guess the second reason is why the stap takes so long time to set
>> probes. stap tries to register kprobes without disabled flag, that
>> means we enables thousands of probes (overheads).
>>
>> So the similar thing happens when we enables events as below;
>>
>> # for i in /sys/kernel/debug/tracing/events/kprobes/* ; do date; echo 1 > $i; done
>> Wed Jan 29 10:44:50 UTC 2014
>> ...
>>
>> I tried it and canceled after 4 min passed. It enabled about 17k
>> events and slowed down my system very much(I almost got hang check
>> timer).
>
> Ok, I guess that's the slowdown bug that Frank reported.
>
>> I think we should have some performance statistics (hit count?) and
>> if it goes over a threshold, we should stop enabling other events.
>
> That really feels like a hack. How about fixing the root cause? Does
> the enabling of all probes have to be so slow?

When I tried to use perf top, most of the time was consumed in
kprobe_ftrace_handler and optimized_callback, both of them
are the handler of kprobes. Since I just tried on a VM guest and it
didn't support NMI nor PMU, thus I have to use a bare metal machine
for deeper investigation (and I'll do).
And yes, it seems that the performance problem comes from probing
and tracing itself.

Thank you,

--
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

Subject: Re: [PATCH -tip v6 00/22] kprobes: introduce NOKPROBE_SYMBOL(), cleanup and fixes crash bugs

(2014/02/10 13:02), Masami Hiramatsu wrote:
> (2014/02/09 23:37), Ingo Molnar wrote:
>>
>> * Masami Hiramatsu <[email protected]> wrote:
>>
>>> I guess the second reason is why the stap takes so long time to set
>>> probes. stap tries to register kprobes without disabled flag, that
>>> means we enables thousands of probes (overheads).
>>>
>>> So the similar thing happens when we enables events as below;
>>>
>>> # for i in /sys/kernel/debug/tracing/events/kprobes/* ; do date; echo 1 > $i; done
>>> Wed Jan 29 10:44:50 UTC 2014
>>> ...
>>>
>>> I tried it and canceled after 4 min passed. It enabled about 17k
>>> events and slowed down my system very much(I almost got hang check
>>> timer).
>>
>> Ok, I guess that's the slowdown bug that Frank reported.
>>
>>> I think we should have some performance statistics (hit count?) and
>>> if it goes over a threshold, we should stop enabling other events.
>>
>> That really feels like a hack. How about fixing the root cause? Does
>> the enabling of all probes have to be so slow?
>
> When I tried to use perf top, most of the time was consumed in
> kprobe_ftrace_handler and optimized_callback, both of them
> are the handler of kprobes. Since I just tried on a VM guest and it
> didn't support NMI nor PMU, thus I have to use a bare metal machine
> for deeper investigation (and I'll do).
> And yes, it seems that the performance problem comes from probing
> and tracing itself.

OK, I've tested and collected some information,

At first, to evaluate the performance of kprobes itself, I set the
ftrace tracing_on=0 (no record) before testing.

For setting probes on all symbols, I ran the below script;
----
#!/bin/bash
TRACE_DIR=/sys/kernel/debug/tracing/

echo "Disable tracing to remove tracing overhead"
echo 0 > $TRACE_DIR/tracing_on

echo > $TRACE_DIR/kprobe_events
echo "Setup events on all functions"
grep -iw t /proc/kallsyms | awk 'BEGIN{i=0};{print("p:"$3"_"i, "0x"$1); i++}' | \
while read l; do echo $l >> $TRACE_DIR/kprobe_events || echo "Failed: $l" ; done
----

And I enabled just 2000(2k) probes, and ran perf top. Here are the result
----
Samples: 12K of event 'cycles', Event count (approx.): 3526908766
77.41% [kernel] [k] get_kprobe
3.17% [kernel] [k] acpi_processor_ffh_cstate_enter
0.93% [kernel] [k] acpi_os_read_port

Samples: 9K of event 'cache-misses', Event count (approx.): 6180338
80.66% [kernel] [k] get_kprobe
2.41% [kernel] [k] clear_page_c
0.48% perf [.] rb_next
----

Even with just 2k probes, get_kprobe has very high overhead!
The reason is that we currently have just 64 entries for the
hlist (perf annotate said that the most of the cache-misses
happened at where accessing hlist->next).
This is too small for handing more than 10k probes.

Thus, I tried to enlarge the hlist_head to 4096 entries and
measured the performance again.

With 2k probes, I got below numbers;
----
Samples: 12K of event 'cycles', Event count (approx.): 1952272318
19.08% [kernel] [k] get_kprobe
6.51% [kernel] [k] acpi_processor_ffh_cstate_enter
2.51% [kernel] [k] avtab_search_node

Samples: 1K of event 'cache-misses', Event count (approx.): 1016756
31.81% [kernel] [k] clear_page_c
5.39% [kernel] [k] shmem_getpage_gfp
3.04% [kernel] [k] get_kprobe
----
OK, now the get_kprobe is cooled down.
And with 8k probes, it still be better, and we can see the
overhead percentage is going up.
----
Samples: 13K of event 'cycles', Event count (approx.): 2119001528
33.17% [kernel] [k] get_kprobe
6.98% [kernel] [k] acpi_processor_ffh_cstate_enter
3.36% [kernel] [k] ftrace_lookup_ip
Samples: 6K of event 'cache-misses', Event count (approx.): 2662552
19.14% [kernel] [k] clear_page_c
15.26% [kernel] [k] get_kprobe
4.16% [kernel] [k] shmem_getpage_gfp
----

And finally, when I tried to 16k probes, it slowed down too much and
couldn't run perf.
I have another idea to solve this, the per-cpu cache for active kprobes,
which can reduce the random memory access on the hlist (but will
consume more memory).

But anyway, for handling massive multiple probes, we cannot escape from
the slowdown finally. As I said, if we consume even 0.1 usec/probe,
it can hit more than 10,000,000 times/sec (because all function call hit
it). As a result, it consumes 1sec / 1sec (this means your machine slows
down to 1/2). And cache miss makes things much worse than that.

Thank you,

--
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]