Hello,
Following this huge discussion thread, I tried to come with a marker mechanism
(which is something everyone seems to agree that is a necessity) that would be
useful to each kind of tracing (dynamic and static) (concerned projects :
SystemTAP, LKET, LKST, LTTng) and even combinations of those. Religious
considerations aside, I really think that this kind of generic markup is
necessary to fill *everybody*'s need. If I forgot about a specific genericity
aspect, please tell me.
I take for agreed that both static and dynamic tracing are useful for different
needs and that a full markup must support both and combinations, letting the
user or the distribution choose.
If you like it, please add the right menuconfig lines in arch/*/Kconfig and a
NOPS macro in include/asm-*/marker.h.
Comments are, as always, welcome.
Mathieu
--- BEGIN ---
--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -1082,6 +1082,8 @@ config KPROBES
for kernel debugging, non-intrusive instrumentation and testing.
If in doubt, say "N".
+source "kernel/Kconfig.marker"
+
source "ltt/Kconfig"
endmenu
--- /dev/null
+++ b/include/asm-i386/marker.h
@@ -0,0 +1,12 @@
+/*****************************************************************************
+ * marker.h
+ *
+ * Code markup for dynamic and static tracing. i386 support.
+ *
+ * Mathieu Desnoyers <[email protected]>
+ *
+ * September 2006
+ */
+
+#define JPROBE_TARGET \
+ __asm__ ( GENERIC_NOP5 )
--- /dev/null
+++ b/include/linux/marker.h
@@ -0,0 +1,77 @@
+/*****************************************************************************
+ * marker.h
+ *
+ * Code markup for dynamic and static tracing.
+ *
+ * Use either :
+ * MARK
+ * MARK_NOPRINT (will never call printk)
+ * MARK_STATIC (not dynamically instrumentable, will never call printk)
+ *
+ * Example :
+ *
+ * MARK(subsystem_event, "Event happened %d %s", someint, somestring);
+ * Where :
+ * - Subsystem is the name of your subsystem.
+ * - event is the name of the event to mark.
+ * - "Event happened %d %s" is the formatted string for printk.
+ * - someint is an integer.
+ * - somestring is a char *.
+ *
+ * Mathieu Desnoyers <[email protected]>
+ *
+ * September 2006
+ */
+
+#include <linux/config.h>
+#include <linux/kernel.h>
+
+#include <asm/marker.h>
+
+#define MARK_SYM(event) \
+ __asm__ ( "__mark_" KBUILD_BASENAME "_" #event ":" )
+
+#define MARK_INACTIVE(event, format, args...)
+
+#define MARK_PRINT(event, format, args...) printk(format, ##args);
+
+#define MARK_FPROBE(event, format, args...) fprobe_##event(args);
+
+#define MARK_KPROBE(event, format, args...) MARK_SYM(event);
+
+#define MARK_JPROBE(event, format, args...) \
+ do { \
+ MARK_SYM(event); \
+ JPROBE_TARGET; \
+ } while(0)
+
+/* Menu configured markers */
+#ifndef CONFIG_MARK
+#define MARK MARK_INACTIVE
+#elif defined(CONFIG_MARK_PRINT)
+#define MARK MARK_PRINT
+#elif defined(CONFIG_MARK_FPROBE)
+#define MARK MARK_FPROBE
+#elif defined(CONFIG_MARK_KPROBE)
+#define MARK MARK_KPROBE
+#elif defined(CONFIG_MARK_JPROBE)
+#define MARK MARK_JPROBE
+#endif
+
+#ifndef CONFIG_MARK_NOPRINT
+#define MARK_NOPRINT MARK_INACTIVE
+#elif defined(CONFIG_MARK_NOPRINT_FPROBE)
+#define MARK_NOPRINT MARK_FPROBE
+#elif defined(CONFIG_MARK_NOPRINT_KPROBE)
+#define MARK_NOPRINT MARK_KPROBE
+#elif defined(CONFIG_MARK_NOPRINT_JPROBE)
+#define MARK_NOPRINT MARK_JPROBE
+#endif
+
+#ifndef CONFIG_MARK_STATIC
+#define MARK_STATIC MARK_INACTIVE
+#else
+#define MARK_STATIC MARK_FPROBE
+#endif
+
+
--- /dev/null
+++ b/kernel/Kconfig.marker
@@ -0,0 +1,75 @@
+# Code markers configuration
+
+menu "Marker configuration"
+
+
+config MARK
+ bool "Enable MARK code markers"
+ default y
+ help
+ Activate markers that can call printk or can be instrumented
+ dynamically.
+
+choice
+ prompt "MARK code marker behavior"
+ default MARK_KPROBE
+ depends on MARK
+ help
+ Configuration of markers that can call printk or can be
+ instrumented dynamically.
+
+config MARK_KPROBE
+ bool "KPROBE"
+ ---help---
+ Change markers for a symbol "__mark_modulename_event".
+config MARK_JPROBE
+ bool "JPROBE"
+ ---help---
+ Change markers for a symbol "__mark_modulename_event"
+ and create a target for a high speed dynamic probe.
+config MARK_FPROBE
+ bool "FPROBE"
+ ---help---
+ Change markers for a function call.
+config MARK_PRINT
+ bool "PRINT"
+ ---help---
+ Call printk from the marker.
+endchoice
+
+config MARK_NOPRINT
+ bool "Enable MARK_NOPRINT code markers"
+ default y
+ help
+ Activate markers that cannot call printk.
+
+choice
+ prompt "MARK_NOPRINT code marker behavior"
+ default MARK_NOPRINT_KPROBE
+ depends on MARK_NOPRINT
+ help
+ Configuration of markers that cannot call printk.
+
+config MARK_NOPRINT_KPROBE
+ bool "KPROBE"
+ ---help---
+ Change markers for a symbol "__mark_modulename_event".
+config MARK_NOPRINT_JPROBE
+ bool "JPROBE"
+ ---help---
+ Change markers for a symbol "__mark_modulename_event"
+ and create a target for a high speed dynamic probe.
+config MARK_NOPRINT_FPROBE
+ bool "FPROBE"
+ ---help---
+ Change markers for a function call.
+endchoice
+
+config MARK_STATIC
+ bool "Enable MARK_STATIC code markers"
+ default y
+ help
+ Activate markers that cannot be instrumented dynamically. They will
+ generate function calls to each function-style probe.
+
+endmenu
--- END ---
OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Ar Llu, 2006-09-18 am 19:45 -0400, ysgrifennodd Mathieu Desnoyers:
> +#define MARK_KPROBE(event, format, args...) MARK_SYM(event);
> +
> +#define MARK_JPROBE(event, format, args...) \
> + do { \
> + MARK_SYM(event); \
> + JPROBE_TARGET; \
> + } while(0)
Seems a good path and has scope to be combined with some of our debug
trace printks to take them out into trace tool space instead of
cluttering up mainstream
On Mon, Sep 18, 2006 at 07:45:02PM -0400, Mathieu Desnoyers wrote:
> + * Mathieu Desnoyers <[email protected]>
> + *
> + * September 2006
> + */
> +
> +#include <linux/config.h>
> +#include <linux/kernel.h>
config.h is automatically included in the build process.
kernel.h is too iirc.
Dave
* Mathieu Desnoyers <[email protected]> wrote:
> +choice
> + prompt "MARK code marker behavior"
> +config MARK_KPROBE
> +config MARK_JPROBE
> +config MARK_FPROBE
> + Change markers for a function call.
> +config MARK_PRINT
as indicated before in great detail, NACK on this profileration of
marker options, especially the function call one. I'd like to see _one_
marker mechanism that distros could enable, preferably with zero (or at
most one NOP) in-code overhead. (You can of course patch whatever
extension ontop of it, in out-of-tree code, to gain further performance
advantage by generating direct system-calls.)
There might be a hodgepodge of methods and tools in userspace to do
debugging, but in the kernel we should get our act together and only
take _one_ (or none at all), and then spend all our efforts on improving
that primary method of debug instrumentation. As kprobes/SystemTap has
proven, it is possible to have zero-overhead inactive probes.
Furthermore, for such a patch to make sense in the upstream kernel,
downstream tracing code has to make actual use of that NOP-marker. I.e.
a necessary (but not sufficient) requirement for upstream inclusion (in
my view) would be for this mechanism to be used by LTT and LKST. (again,
you can patch LTT for your own purposes in your own patchset if you
think the performance overhead of probes is too much)
Ingo
* Ingo Molnar <[email protected]> wrote:
> > +choice
> > + prompt "MARK code marker behavior"
>
> > +config MARK_KPROBE
> > +config MARK_JPROBE
> > +config MARK_FPROBE
> > + Change markers for a function call.
> > +config MARK_PRINT
>
> as indicated before in great detail, NACK on this profileration of
> marker options, especially the function call one. I'd like to see _one_
> marker mechanism that distros could enable, preferably with zero (or at
> most one NOP) in-code overhead. (You can of course patch whatever
> extension ontop of it, in out-of-tree code, to gain further performance
> advantage by generating direct system-calls.)
^---function
Ingo
Ingo Molnar wrote:
> * Mathieu Desnoyers <[email protected]> wrote:
>
>> +choice
>> + prompt "MARK code marker behavior"
>
>> +config MARK_KPROBE
>> +config MARK_JPROBE
>> +config MARK_FPROBE
>> + Change markers for a function call.
>> +config MARK_PRINT
>
> as indicated before in great detail, NACK on this profileration of
> marker options, especially the function call one. I'd like to see _one_
> marker mechanism that distros could enable, preferably with zero (or at
> most one NOP) in-code overhead. (You can of course patch whatever
> extension ontop of it, in out-of-tree code, to gain further performance
> advantage by generating direct system-calls.)
>
> There might be a hodgepodge of methods and tools in userspace to do
> debugging, but in the kernel we should get our act together and only
> take _one_ (or none at all), and then spend all our efforts on improving
> that primary method of debug instrumentation. As kprobes/SystemTap has
> proven, it is possible to have zero-overhead inactive probes.
>
> Furthermore, for such a patch to make sense in the upstream kernel,
> downstream tracing code has to make actual use of that NOP-marker. I.e.
> a necessary (but not sufficient) requirement for upstream inclusion (in
> my view) would be for this mechanism to be used by LTT and LKST. (again,
> you can patch LTT for your own purposes in your own patchset if you
> think the performance overhead of probes is too much)
You know ... it strikes me that there's another way to do this, that's
zero overhead when not enabled, and gets rid of the inflexibility in
kprobes. It might not work well in all cases, but at least for simple
non-inlined functions, it'd seem to.
Why don't we just copy the whole damned function somewhere else, and
make an instrumented copy (as a kernel module)? Then reroute all the
function calls through it, instead of the original version. OK, it's
not completely trivial to do, but simpler than kprobes (probably
doing the switchover atomically is the hard part, but not impossible).
There's NO overhead when not using, and much lower than probes when
you are.
That way we can do whatever the hell we please with internal variables,
however GCC optimises it, can write flexible instrumenting code to just
about anything, program in C as God intended, etc, etc. No, it probably
won't fix every case under the sun, but hopefully most of them, and we
can still use kprobes/djprobes/bodilyprobes for the rest of the cases.
M.
Mathieu Desnoyers <[email protected]> writes:
> [...] I take for agreed that both static and dynamic tracing are
> useful for different needs and that a full markup must support both
> and combinations, letting the user or the distribution choose.
Elaborating on Ingo's "one mechanism" comments, I believe a marker
widget needs to be generic at run time. We're not just looking for a
way of hiding direct calls to lttng in a marker macro. We're looking
for a way of marking spots & data in a uniform way, then later
(run-time) binding each of those markers to (tools such as) lttng
and/or systemtap.
- FChE
* Martin J. Bligh <[email protected]> wrote:
> You know ... it strikes me that there's another way to do this, that's
> zero overhead when not enabled, and gets rid of the inflexibility in
> kprobes. It might not work well in all cases, but at least for simple
> non-inlined functions, it'd seem to.
>
> Why don't we just copy the whole damned function somewhere else, and
> make an instrumented copy (as a kernel module)? Then reroute all the
> function calls through it, instead of the original version. OK, it's
> not completely trivial to do, but simpler than kprobes (probably doing
> the switchover atomically is the hard part, but not impossible).
> There's NO overhead when not using, and much lower than probes when
> you are.
>
> That way we can do whatever the hell we please with internal
> variables, however GCC optimises it, can write flexible instrumenting
> code to just about anything, program in C as God intended, etc, etc.
> No, it probably won't fix every case under the sun, but hopefully most
> of them, and we can still use kprobes/djprobes/bodilyprobes for the
> rest of the cases.
yeah, this would be nice - if it werent it for function pointers, and if
all kernel functions were relocatable. But if you can think of a method
to do this, it would be nice.
Ingo
Hi -
On Tue, Sep 19, 2006 at 08:11:40AM -0700, Martin J. Bligh wrote:
> [...] Why don't we just copy the whole damned function somewhere
> else, and make an instrumented copy (as a kernel module)? Then
> reroute all the function calls through it [...]
Interesting idea. Are you imagining this instrumented copy being
built at kernel compile time (something like building a "-g -O0"
parallel)? Or compiled anew from original sources after deployment?
Or on-the-fly binary-level rewriting a la SPIN?
> OK, it's not completely trivial to do, but simpler than kprobes [...]
None of the three above are that easy. Do you have an implementation
idea?
- FChE
Frank Ch. Eigler wrote:
> Hi -
>
> On Tue, Sep 19, 2006 at 08:11:40AM -0700, Martin J. Bligh wrote:
>
>
>>[...] Why don't we just copy the whole damned function somewhere
>>else, and make an instrumented copy (as a kernel module)? Then
>>reroute all the function calls through it [...]
>
>
> Interesting idea. Are you imagining this instrumented copy being
> built at kernel compile time (something like building a "-g -O0"
> parallel)? Or compiled anew from original sources after deployment?
> Or on-the-fly binary-level rewriting a la SPIN?
"compiled anew from original sources after deployment" seems the most
practical to do to me. From second hand info on using systemtap, you
seem to need the same compiler and source tree to work from anyway, so
this doesn't seem much of a burden.
>>OK, it's not completely trivial to do, but simpler than kprobes [...]
>
> None of the three above are that easy. Do you have an implementation
> idea?
not in detail, but given the problems that the other probe technologies
solved, it seems easy in comparison. It seems like all we'd need to do
is "list all references to function, freeze kernel, update all
references, continue", but perhaps I'm oversimplifying it ... if it's
all just straight calls, it'd seem easy. The freeze would be very short,
it's just poking a few addresses.
Having multiple hooks inside the same function pieced in at different
times, etc gets tricky, but you can always fall back on one of the other
methods if you get something complicated (or enforce some self-dicipline
in userspace on how to compound them together).
Ingo Molnar wrote:
> yeah, this would be nice - if it werent it for function pointers,
> and if all kernel functions were relocatable. But if you can think of
> a method to do this, it would be nice.
Well, it doesn't have to work for everything. But would be much nicer
for when it does work, it seems to me. Which functions are not
relocatable? Function pointers are indeed a problem, for the functions
they're used on, but they're not common. Some simple markup for these
types of functions would fix it easily enough, I'd think.
A more common problem would seem to me to be instrumenting a inlined
function that was pulled into multiple places, but even that doesn't
seem particularly difficult.
M.
Martin J. Bligh wrote:
> Ingo Molnar wrote:
>
>> * Mathieu Desnoyers <[email protected]> wrote:
>>
>>> +choice
>>> + prompt "MARK code marker behavior"
>>
>>
>>> +config MARK_KPROBE
>>> +config MARK_JPROBE
>>> +config MARK_FPROBE
>>> + Change markers for a function call.
>>> +config MARK_PRINT
>>
>>
>> as indicated before in great detail, NACK on this profileration of
>> marker options, especially the function call one. I'd like to see
>> _one_ marker mechanism that distros could enable, preferably with
>> zero (or at most one NOP) in-code overhead. (You can of course patch
>> whatever extension ontop of it, in out-of-tree code, to gain further
>> performance advantage by generating direct system-calls.)
>>
>> There might be a hodgepodge of methods and tools in userspace to do
>> debugging, but in the kernel we should get our act together and only
>> take _one_ (or none at all), and then spend all our efforts on
>> improving that primary method of debug instrumentation. As
>> kprobes/SystemTap has proven, it is possible to have zero-overhead
>> inactive probes.
>>
>> Furthermore, for such a patch to make sense in the upstream kernel,
>> downstream tracing code has to make actual use of that NOP-marker.
>> I.e. a necessary (but not sufficient) requirement for upstream
>> inclusion (in my view) would be for this mechanism to be used by LTT
>> and LKST. (again, you can patch LTT for your own purposes in your own
>> patchset if you think the performance overhead of probes is too much)
>
>
> You know ... it strikes me that there's another way to do this, that's
> zero overhead when not enabled, and gets rid of the inflexibility in
> kprobes. It might not work well in all cases, but at least for simple
> non-inlined functions, it'd seem to.
>
> Why don't we just copy the whole damned function somewhere else, and
> make an instrumented copy (as a kernel module)? Then reroute all the
> function calls through it, instead of the original version. OK, it's
> not completely trivial to do, but simpler than kprobes (probably
> doing the switchover atomically is the hard part, but not impossible).
> There's NO overhead when not using, and much lower than probes when
> you are.
>
> That way we can do whatever the hell we please with internal variables,
> however GCC optimises it, can write flexible instrumenting code to just
> about anything, program in C as God intended, etc, etc. No, it probably
> won't fix every case under the sun, but hopefully most of them, and we
> can still use kprobes/djprobes/bodilyprobes for the rest of the cases.
>
> M.
It is an interesting idea but there appears to be following hard issues
(some of which you have already listed) i am not able to see how we can
overcome them
1) We are going to have a duplicate of the whole function which means
any significant changes in the original function needs to be done on the
copy as well, you think maintainers would like this double work idea.
2) Inline functions is often the place where we need a fast path to
overcome the current kprobes overhead.
3) As you said it is not trivial across all the platforms to do a switch
to the instrumented function from the original during the execution.
This problem is similar to the issue we are dealing with djprobes.
Martin J. Bligh wrote:
> Why don't we just copy the whole damned function somewhere else, and
> make an instrumented copy (as a kernel module)?
If you're going to go with that, then why not just use a comment-based
markup? Then your alternate copy gets to be generated from the same
codebase. It also solves the inherent problem of decided on whether
a macro-based markup is far too intrusive, since you can mildly allow
yourself more verbosity in a comment. Not only that, but if it's
comment-based, it's even forseable, though maybe not desirable, than
*everything* that deals with this type of markup be maintained out
of tree (i.e. scripts generating alternate functions and all.)
Karim
> It is an interesting idea but there appears to be following hard issues
> (some of which you have already listed) i am not able to see how we can
> overcome them
>
> 1) We are going to have a duplicate of the whole function which means
> any significant changes in the original function needs to be done on the
> copy as well, you think maintainers would like this double work idea.
No, no ... the duplicate function isn't duplicated source code, only
object code. Either a config option via the markup macros that we've
been discussing, or something I hack up on the fly to debug a problem
dynamically. In terms of how the debugging-type source code is kept,
it's no different than something like systemtap or LTT (either would
work, and a normal diff could be used to keep out of tree stuff),
it's just how it hooks in is different to kprobes.
> 2) Inline functions is often the place where we need a fast path to
> overcome the current kprobes overhead.
You can still instrument inline functions, you just need to hook all
the callers, not the inline itself.
> 3) As you said it is not trivial across all the platforms to do a switch
> to the instrumented function from the original during the execution.
> This problem is similar to the issue we are dealing with djprobes.
If we just freeze all kernel operations for a split second whilst we do
this, does it matter? Or even if we don't ... there's a brief race where
some calls are traced, and some are not ... does that even matter?
Doesn't seem like most usages would care.
M.
Karim Yaghmour wrote:
> Martin J. Bligh wrote:
>
>>Why don't we just copy the whole damned function somewhere else, and
>>make an instrumented copy (as a kernel module)?
>
>
> If you're going to go with that, then why not just use a comment-based
> markup?
Comment, marker macro, flat patch, don't care much. all would work.
> Then your alternate copy gets to be generated from the same codebase.
That was always the intent, or codebase + flat patch if really
necessary. Sorry if that wasn't clear.
> It also solves the inherent problem of decided on whether
> a macro-based markup is far too intrusive, since you can mildly allow
> yourself more verbosity in a comment. Not only that, but if it's
> comment-based, it's even forseable, though maybe not desirable, than
> *everything* that deals with this type of markup be maintained out
> of tree (i.e. scripts generating alternate functions and all.)
Not sure we need scripts, just a normal patch diff would do. I'm not
sure any of this alters the markup debate much ... it just would seem
to provide a simpler, faster, and more flexible way of hooking in than
kprobes.
M.
Martin Bligh wrote:
> That was always the intent, or codebase + flat patch if really
> necessary. Sorry if that wasn't clear.
Ah, ok.
> Not sure we need scripts, just a normal patch diff would do. I'm not
> sure any of this alters the markup debate much ...
It doesn't, just wasn't clear on the function duplication part.
> it just would seem
> to provide a simpler, faster, and more flexible way of hooking in than
> kprobes.
Sure.
Karim
On Tue, 19 Sep 2006 09:04:43 -0700
Martin Bligh <[email protected]> wrote:
> It seems like all we'd need to do
> is "list all references to function, freeze kernel, update all
> references, continue"
"overwrite first 5 bytes of old function with `jmp new_function'".
Martin Bligh <[email protected]> wrote on 19/09/2006 17:04:43:
> Frank Ch. Eigler wrote:
> > Hi -
> >
> > On Tue, Sep 19, 2006 at 08:11:40AM -0700, Martin J. Bligh wrote:
> >
> >
> >>[...] Why don't we just copy the whole damned function somewhere
> >>else, and make an instrumented copy (as a kernel module)? Then
> >>reroute all the function calls through it [...]
> >
> >
> > Interesting idea. Are you imagining this instrumented copy being
> > built at kernel compile time (something like building a "-g -O0"
> > parallel)? Or compiled anew from original sources after deployment?
> > Or on-the-fly binary-level rewriting a la SPIN?
>
> "compiled anew from original sources after deployment" seems the most
> practical to do to me. From second hand info on using systemtap, you
> seem to need the same compiler and source tree to work from anyway, so
> this doesn't seem much of a burden.
>
If I'm not mistaken, this has been done before under the guise of dynamic
patch. Doesn't Solaris have the capability? I'm certain that some UNIXes do
as well as non-UNIX O/Ss.
Richard
Andrew Morton wrote:
> On Tue, 19 Sep 2006 09:04:43 -0700
> Martin Bligh <[email protected]> wrote:
>
>
>>It seems like all we'd need to do
>>is "list all references to function, freeze kernel, update all
>>references, continue"
>
>
> "overwrite first 5 bytes of old function with `jmp new_function'".
Yes, that's simple. but slower, as you have a double jump. Probably
a damned sight faster than int3 though.
M.
Martin Bligh wrote:
> That was always the intent, or codebase + flat patch if really
> necessary. Sorry if that wasn't clear.
Actually rereading through your posts with this correction in mind
I find this to actually be one of the most interesting ideas I've
seen of late. There's probably not a 1-to-1 correlation here, but
some of the problems mentioned seem similar to RCU stuff (modify
pointer, make sure nobody's got copy to it, etc.), tough I could
be wrong.
Random thoughts -- no guarantees:
Instead of freezing everything and making sure all text refs to
function are modified, you might just be able to use kprobes (on
the architectures that have it) as a trampoline for on-the-fly
address call modifications. And on the archs that don't have
kprobes, you could at build time degrade this by replacing direct
calls to instrumented functions by function pointers or localized
ifs.
Not sure.
Karim
* Martin Bligh <[email protected]> wrote:
> Andrew Morton wrote:
> >On Tue, 19 Sep 2006 09:04:43 -0700
> >Martin Bligh <[email protected]> wrote:
> >
> >
> >>It seems like all we'd need to do
> >>is "list all references to function, freeze kernel, update all
> >>references, continue"
> >
> >
> >"overwrite first 5 bytes of old function with `jmp new_function'".
>
> Yes, that's simple. but slower, as you have a double jump. Probably a
> damned sight faster than int3 though.
modern CPUs will probably even optimize that intermediate jump away in
their BTB-ish caches. But in any case this would solve the function
pointer problem too.
Ingo
Martin Bligh <[email protected]> writes:
> [...] "compiled anew from original sources after deployment" seems
> the most practical to do to me. From second hand info on using
> systemtap, you seem to need the same compiler and source tree to
> work from anyway [...]
Not quite. Systemtap does not look at sources, only object code and
its embedded debugging information. (How many distributions keep
around compilable source trees?)
> [...] It seems like all we'd need to do is "list all references to
> function, freeze kernel, update all references, continue", [...]
One additional problem are external references made *by* the function.
Those too would all have to be relocated to the live data.
Live code patching is theoretically useful for all kinds of things,
but I've never heard it described as relatively simple before! :-)
- FChE
Frank Ch. Eigler wrote:
> Martin Bligh <[email protected]> writes:
>
>
>>[...] "compiled anew from original sources after deployment" seems
>>the most practical to do to me. From second hand info on using
>>systemtap, you seem to need the same compiler and source tree to
>>work from anyway [...]
>
>
> Not quite. Systemtap does not look at sources, only object code and
> its embedded debugging information. (How many distributions keep
> around compilable source trees?)
???? Boggle. Any distro that cannot find the source code for it's kernel
deserves a swift kick to the head, plus a red hot poker somewhere else.
>>[...] It seems like all we'd need to do is "list all references to
>>function, freeze kernel, update all references, continue", [...]
>
> One additional problem are external references made *by* the function.
> Those too would all have to be relocated to the live data.
Not sure what you mean ... could you give a quick example?
> Live code patching is theoretically useful for all kinds of things,
> but I've never heard it described as relatively simple before! :-)
well, on a whole-function basis, it seems somewhat simpler.
M.
Hi -
On Tue, Sep 19, 2006 at 09:52:55AM -0700, Martin Bligh wrote:
> >[...] (How many distributions keep around compilable source
> >trees?)
>
> ???? Boggle. Any distro that cannot find the source code for it's kernel
> deserves a swift kick to the head, plus a red hot poker somewhere else.
My question is more whether they package up such a buildable
configured patched source tree (/usr/src/redhat/BUILD/* in RH-speak),
or just some extract like the .c/.h files.
> >>[...] It seems like all we'd need to do is "list all references to
> >>function, freeze kernel, update all references, continue", [...]
> >
> >One additional problem are external references made *by* the function.
> >Those too would all have to be relocated to the live data.
>
> Not sure what you mean ... could you give a quick example?
Think about stuff that any function does. It calls other functions,
and manipulates global data, which all show up as external references
in the object code. All those references would have to be patched to
refer to the live running copy of the original compilation unit.
- FChE
On Tue, Sep 19, 2006 at 09:41:30AM -0700, Martin Bligh wrote:
> Andrew Morton wrote:
> >On Tue, 19 Sep 2006 09:04:43 -0700
> >Martin Bligh <[email protected]> wrote:
> >
> >
> >>It seems like all we'd need to do
> >>is "list all references to function, freeze kernel, update all
> >>references, continue"
> >
> >
> >"overwrite first 5 bytes of old function with `jmp new_function'".
>
> Yes, that's simple. but slower, as you have a double jump. Probably
> a damned sight faster than int3 though.
>
> M.
The advantage of using int3 over jmp to launch the instrumented
module is that int3 (or breakpoint in most architectures) is an
atomic operation to insert.
I am getting some more ideas...
1. Copy the original functions, instrument them and insert them as
a part of kernel module with different name prefix.
2. Insert breakpoint only on those routines at runtime.
3. When the breakpoint gets hit, change the instruction pointer to
the instrumented routine. No need to single step at all.
Adv:
Can be enabled/disabled dynamically by inserting/removing
breakpoints. No overhead of single stepping.
No restriction of running the handler in interrupt context.
You can have pre-compiled instrumented routines.
This mechanism can be used for pre-defined set of routines and for
arbiratory probe points, you can use kprobes/jprobes/systemtap.
No need to be super-user for predefined breakpoints.
Dis:
Maintainence of the code, since it can code base need to be
duplicated and instrumented.
The above idea is similar to runtime or dynamic patching, but here we
use int3(breakpoint) rather than jump instruction.
Please correct me if I am wrong.
Please let me know if need more information.
Thanks
Prasanna
--
Prasanna S.P.
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329
>>>>It seems like all we'd need to do
>>>>is "list all references to function, freeze kernel, update all
>>>>references, continue"
>>>
>>>
>>>"overwrite first 5 bytes of old function with `jmp new_function'".
>>
>>Yes, that's simple. but slower, as you have a double jump. Probably
>>a damned sight faster than int3 though.
>
>
> The advantage of using int3 over jmp to launch the instrumented
> module is that int3 (or breakpoint in most architectures) is an
> atomic operation to insert.
Ah, good point. Though ... how much do we care what the speed of
insertion/removal actually is? If we can tolerate it being slow,
then just sync everyone up in an IPI to freeze them out whilst
doing the insert.
> I am getting some more ideas...
>
> 1. Copy the original functions, instrument them and insert them as
> a part of kernel module with different name prefix.
> 2. Insert breakpoint only on those routines at runtime.
> 3. When the breakpoint gets hit, change the instruction pointer to
> the instrumented routine. No need to single step at all.
Surely this still carries the overhead of doing the breakpoint,
which was part of what we were trying to get away from? I suppose
we get more flexibility this way. Or does the slowness not actually
come from the int3, but only the single-stepping?
How about we combine all three ideas together ...
1. Load modified copy of the function in question.
2. overwrite the first instruction of the routine with an int3 that
does what you say (atomically)
3. Then overwrite the second instruction with a jump that's faster
4. Now atomically overwrite the int3 with a nop, and let the jump
take over.
> Adv:
> Can be enabled/disabled dynamically by inserting/removing
> breakpoints. No overhead of single stepping.
> No restriction of running the handler in interrupt context.
> You can have pre-compiled instrumented routines.
> This mechanism can be used for pre-defined set of routines and for
> arbiratory probe points, you can use kprobes/jprobes/systemtap.
> No need to be super-user for predefined breakpoints.
>
> Dis:
> Maintainence of the code, since it can code base need to be
> duplicated and instrumented.
CONFIG_FOO_BAR .... turn it on or off to turn on the instrumentation.
compiled out by default. Compiled in when making the tracing functions.
> The above idea is similar to runtime or dynamic patching, but here we
> use int3(breakpoint) rather than jump instruction.
Depends what we're trying to fix. I was trying to fix two things:
1. Flexibility - kprobes seem unable to access all local variables etc
easily, and go anywhere inside the function. Plus keeping low overhead
for doing things like keeping counters in a function (see previous
example I mentioned for counting pages in shrink_list).
2. Overhead of the int3, which was allegedly 1000 cycles or so, though
faster after Ingo had played with it, it's still significant.
M.
On Tue, Sep 19, 2006 at 10:17:53AM -0700, Martin Bligh wrote:
> >>>>It seems like all we'd need to do
> >>>>is "list all references to function, freeze kernel, update all
> >>>>references, continue"
> >>>
> >>>
> >>>"overwrite first 5 bytes of old function with `jmp new_function'".
> >>
> >>Yes, that's simple. but slower, as you have a double jump. Probably
> >>a damned sight faster than int3 though.
> >
> >
> >The advantage of using int3 over jmp to launch the instrumented
> >module is that int3 (or breakpoint in most architectures) is an
> >atomic operation to insert.
>
> Ah, good point. Though ... how much do we care what the speed of
> insertion/removal actually is? If we can tolerate it being slow,
> then just sync everyone up in an IPI to freeze them out whilst
> doing the insert.
>
I guess using IPI occasionally would be acceptable. But I think
using IPI for each probes will lots of overhead.
>
> Surely this still carries the overhead of doing the breakpoint,
> which was part of what we were trying to get away from? I suppose
> we get more flexibility this way. Or does the slowness not actually
> come from the int3, but only the single-stepping?
Yes, it comes from int3 as well.
>
> How about we combine all three ideas together ...
>
> 1. Load modified copy of the function in question.
> 2. overwrite the first instruction of the routine with an int3 that
> does what you say (atomically)
> 3. Then overwrite the second instruction with a jump that's faster
> 4. Now atomically overwrite the int3 with a nop, and let the jump
> take over.
>
That's a good solution.
Thanks
Prasanna
> >Adv:
> >Can be enabled/disabled dynamically by inserting/removing
> >breakpoints. No overhead of single stepping.
> >No restriction of running the handler in interrupt context.
> >You can have pre-compiled instrumented routines.
> >This mechanism can be used for pre-defined set of routines and for
> >arbiratory probe points, you can use kprobes/jprobes/systemtap.
> >No need to be super-user for predefined breakpoints.
> >
> >Dis:
> >Maintainence of the code, since it can code base need to be
> >duplicated and instrumented.
>
> CONFIG_FOO_BAR .... turn it on or off to turn on the instrumentation.
> compiled out by default. Compiled in when making the tracing functions.
>
> >The above idea is similar to runtime or dynamic patching, but here we
> >use int3(breakpoint) rather than jump instruction.
>
> Depends what we're trying to fix. I was trying to fix two things:
>
> 1. Flexibility - kprobes seem unable to access all local variables etc
> easily, and go anywhere inside the function. Plus keeping low overhead
> for doing things like keeping counters in a function (see previous
> example I mentioned for counting pages in shrink_list).
>
> 2. Overhead of the int3, which was allegedly 1000 cycles or so, though
> faster after Ingo had played with it, it's still significant.
>
> M.
--
Prasanna S.P.
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329
Hi Martin,
* Martin J. Bligh ([email protected]) wrote:
> Why don't we just copy the whole damned function somewhere else, and
> make an instrumented copy (as a kernel module)? Then reroute all the
> function calls through it, instead of the original version. OK, it's
> not completely trivial to do, but simpler than kprobes (probably
> doing the switchover atomically is the hard part, but not impossible).
> There's NO overhead when not using, and much lower than probes when
> you are.
>
I just thought about your idea and I think it can be very powerful. I think it
can be a lot easier with a probe at the beginning of the function than changing
function pointers everywhere. First of all, if we just think about accessing
easily internal variables, we could think of this simple trampoline scheme :
1 - load the instrumented function with modprobe
2 - use kprobe to reroute the first instructions of the original function to the
new one.
3 - _not_ use the special kprobe_ret, simply return at the end of the
instrumented function.
Then, if we want to optimize the speed of this mechanism, we can deploy
djprobes : it would greatly help them to know in advance where the probe is
located. We would have to see if the prologue of a function is a good spot to
put a jump (it does not seem to be the case however) :( .
To stop this tracing behavior, we would just have to remove the kprobe.
Unloading of the instrumented module can be difficult though (we have to be sure
the code will no longer be executed).
Mathieu
OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Vara Prasad ([email protected]) wrote:
> It is an interesting idea but there appears to be following hard issues
> (some of which you have already listed) i am not able to see how we can
> overcome them
>
> 1) We are going to have a duplicate of the whole function which means
> any significant changes in the original function needs to be done on the
> copy as well, you think maintainers would like this double work idea.
>
Not with my marker proposal. There is only need to compile it with different
flags.
> 2) Inline functions is often the place where we need a fast path to
> overcome the current kprobes overhead.
>
> 3) As you said it is not trivial across all the platforms to do a switch
> to the instrumented function from the original during the execution.
> This problem is similar to the issue we are dealing with djprobes.
>
I would really like to know how good djprobes is at instrumenting the
prologue of a function.
Mathieu
OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Martin Bligh ([email protected]) wrote:
> How about we combine all three ideas together ...
>
> 1. Load modified copy of the function in question.
> 2. overwrite the first instruction of the routine with an int3 that
> does what you say (atomically)
> 3. Then overwrite the second instruction with a jump that's faster
> 4. Now atomically overwrite the int3 with a nop, and let the jump
> take over.
>
Very good idea.. However, overwriting the second instruction with a jump could
be dangerous on preemptible and SMP kernels, because we never know if a thread
has an IP in any of its contexts that would return exactly at the middle of the
jump. I think it would be doable to overwrite a 5+ bytes instruction with a NOP
non-atomically in all cases, but as the instructions nin the prologue seems to
be smaller :
prologue on x86
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
epilogue on x86
3: 5d pop %ebp
4: c3 ret
Then is can be a problem. Ideas are welcome.
Mathieu
OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>>Ah, good point. Though ... how much do we care what the speed of
>>insertion/removal actually is? If we can tolerate it being slow,
>>then just sync everyone up in an IPI to freeze them out whilst
>>doing the insert.
>>
>
> I guess using IPI occasionally would be acceptable. But I think
> using IPI for each probes will lots of overhead.
Depends how often you're inserting/removing probes, I guess.
Aren't these being done manually, in which case it really can't
be that many? Still doesn't fix the problem Matieu just pointed
out though. Humpf.
>>How about we combine all three ideas together ...
>>
>>1. Load modified copy of the function in question.
>>2. overwrite the first instruction of the routine with an int3 that
>>does what you say (atomically)
>>3. Then overwrite the second instruction with a jump that's faster
>>4. Now atomically overwrite the int3 with a nop, and let the jump
>>take over.
>
> That's a good solution.
It's not exactly elegant or simple, but I guess it'd work if we have
to go to that extent. Seems like a lot of complexity though, I'd
rather get rid of the int3 trap if we can.
M.
Mathieu Desnoyers wrote:
> * Martin Bligh ([email protected]) wrote:
>
>>How about we combine all three ideas together ...
>>
>>1. Load modified copy of the function in question.
>>2. overwrite the first instruction of the routine with an int3 that
>>does what you say (atomically)
>>3. Then overwrite the second instruction with a jump that's faster
>>4. Now atomically overwrite the int3 with a nop, and let the jump
>>take over.
>>
>
>
> Very good idea.. However, overwriting the second instruction with a jump could
> be dangerous on preemptible and SMP kernels, because we never know if a thread
> has an IP in any of its contexts that would return exactly at the middle of the
> jump. I think it would be doable to overwrite a 5+ bytes instruction with a NOP
> non-atomically in all cases, but as the instructions nin the prologue seems to
> be smaller :
>
> prologue on x86
> 0: 55 push %ebp
> 1: 89 e5 mov %esp,%ebp
> epilogue on x86
> 3: 5d pop %ebp
> 4: c3 ret
>
> Then is can be a problem. Ideas are welcome.
Ugh, yes that's somewhat problematic. It does seem rather unlikely that
there's a function call in the function prologue when we're busy
offloading stuff onto the stack, but still ...
For the cases where we're prepared to overwrite the call instruction in
the caller, rather than insert an extra jump in the callee, can we not
do that atomically by overwriting the address we're jumping to (the
call is obviously there already)? Doesn't fix function pointers, etc,
but might work well for the simple case at least.
M.
* Martin Bligh ([email protected]) wrote:
> Mathieu Desnoyers wrote:
> >* Martin Bligh ([email protected]) wrote:
> >
> >jump. I think it would be doable to overwrite a 5+ bytes instruction with
> >a NOP
> >non-atomically in all cases, but as the instructions not in the prologue
> >seems to
> >be smaller :
> >
> >prologue on x86
> > 0: 55 push %ebp
> > 1: 89 e5 mov %esp,%ebp
> >epilogue on x86
> > 3: 5d pop %ebp
> > 4: c3 ret
> >
> >Then is can be a problem. Ideas are welcome.
>
> Ugh, yes that's somewhat problematic. It does seem rather unlikely that
> there's a function call in the function prologue when we're busy
> offloading stuff onto the stack, but still ...
>
A function call is not the cause of the problem : an interrupt/trap is.
> For the cases where we're prepared to overwrite the call instruction in
> the caller, rather than insert an extra jump in the callee, can we not
> do that atomically by overwriting the address we're jumping to (the
> call is obviously there already)? Doesn't fix function pointers, etc,
> but might work well for the simple case at least.
>
I don't think we have any guarantee that the function pointer in the call is
aligned, so I guess it would not be an atomic replacement.
Mathieu
OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Martin Bligh wrote:
> [...]
> Depends what we're trying to fix. I was trying to fix two things:
>
> 1. Flexibility - kprobes seem unable to access all local variables etc
> easily, and go anywhere inside the function. Plus keeping low overhead
> for doing things like keeping counters in a function (see previous
> example I mentioned for counting pages in shrink_list).
>
Using tools like systemtap on can consult DWARF information and put
probes in the middle of the function and access local variables as well,
that is not the real problem. The issue here is compiler doesn't seem to
generate required DWARF information in some cases due to optimizations.
The other related problem is when there exists debug information, the
way to specify the breakpoint location is using line number which is not
maintainable, having a marker solves this problem as well. Your proposal
still doesn't solve the need for markers if i understood correctly.
> 2. Overhead of the int3, which was allegedly 1000 cycles or so, though
> faster after Ingo had played with it, it's still significant.
The reason Kprobes use breakpoint instruction as pointed out by Prasanna
is, it is atomic on most platforms. We are already working on an
improved idea using jump instruction with which overhead is less than
100 cycles on modern CPU's but it has some limitations and issues
related to preemption and SMP.
You can get a glimpse of some of the issues here
http://sourceware.org/ml/systemtap/2006-q3/msg00507.html
http://sourceware.org/ml/systemtap/2005-q4/msg00117.html
For more details do a search for djprobe in the systemtap mailing list
(sorry i am not able to find few threads to summarize all the issues).
Here is the algorithm djprobes uses to
IA
|
[-2][-1][0][1][2][3][4][5][6][7]
[ins1][ins2][ ins3 ]
[<- DCR ->]
[<- JTPR ->]
ins1: 1st Instruction
ins2: 2nd Instruction
ins3: 3rd Instruction
IA: Insertion Address
JTPR: Jump Target Prohibition Region
DCR: Detoured Code Region
The replacement procedure of djpopbes is the following (i have simplified for readability the actual steps djprobes uses)
(1) copying instruction(s) in DCR
(2) putting break point instruction at IA
(3) make sure no cpu's have replacing instructions in the cache to avoid jump to the middle of jmp instruction
(4) replacing original instruction(s) with jump instruction
As you can see from the above your suggestion is very similar to the
djprobes hence i believe all the issues related to djprobes will be
valid for yours as well.
> M.
* Vara Prasad ([email protected]) wrote:
> Martin Bligh wrote:
>
> >[...]
> >Depends what we're trying to fix. I was trying to fix two things:
> >
> >1. Flexibility - kprobes seem unable to access all local variables etc
> >easily, and go anywhere inside the function. Plus keeping low overhead
> >for doing things like keeping counters in a function (see previous
> >example I mentioned for counting pages in shrink_list).
> >
> Using tools like systemtap on can consult DWARF information and put
> probes in the middle of the function and access local variables as well,
> that is not the real problem. The issue here is compiler doesn't seem to
> generate required DWARF information in some cases due to optimizations.
> The other related problem is when there exists debug information, the
> way to specify the breakpoint location is using line number which is not
> maintainable, having a marker solves this problem as well. Your proposal
> still doesn't solve the need for markers if i understood correctly.
>
His implementation makes a heavy use of a marker mechanism : this is exactly
what permits to create the instrumented objects from the same source code, but
with different #defines.
Mathieu
OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers wrote:
> * Vara Prasad ([email protected]) wrote:
>
>>Martin Bligh wrote:
>>
>>
>>>[...]
>>>Depends what we're trying to fix. I was trying to fix two things:
>>>
>>>1. Flexibility - kprobes seem unable to access all local variables etc
>>>easily, and go anywhere inside the function. Plus keeping low overhead
>>>for doing things like keeping counters in a function (see previous
>>>example I mentioned for counting pages in shrink_list).
>>>
>>
>>Using tools like systemtap on can consult DWARF information and put
>>probes in the middle of the function and access local variables as well,
>>that is not the real problem. The issue here is compiler doesn't seem to
>>generate required DWARF information in some cases due to optimizations.
>>The other related problem is when there exists debug information, the
>>way to specify the breakpoint location is using line number which is not
>>maintainable, having a marker solves this problem as well. Your proposal
>>still doesn't solve the need for markers if i understood correctly.
>
> His implementation makes a heavy use of a marker mechanism : this is exactly
> what permits to create the instrumented objects from the same source code, but
> with different #defines.
I don't think it ties us to markers, though I think they're superior for
maintaintance, personally. It could equally well be an out of tree
normal flat patch with all the tracing in, which would make Andrew
happy, even if I think it sucks ;-)
M.
Vara Prasad wrote:
> Martin Bligh wrote:
>
>> [...]
>> Depends what we're trying to fix. I was trying to fix two things:
>>
>> 1. Flexibility - kprobes seem unable to access all local variables etc
>> easily, and go anywhere inside the function. Plus keeping low overhead
>> for doing things like keeping counters in a function (see previous
>> example I mentioned for counting pages in shrink_list).
>>
> Using tools like systemtap on can consult DWARF information and put
> probes in the middle of the function and access local variables as well,
> that is not the real problem. The issue here is compiler doesn't seem to
> generate required DWARF information in some cases due to optimizations.
It seems difficult to seperate those two from each other. If the
subsystem you're relying on doesn't work, then ....
> The other related problem is when there exists debug information, the
> way to specify the breakpoint location is using line number which is not
> maintainable, having a marker solves this problem as well. Your proposal
> still doesn't solve the need for markers if i understood correctly.
It could, but I think we're better off with the markers, yes.
>> 2. Overhead of the int3, which was allegedly 1000 cycles or so, though
>> faster after Ingo had played with it, it's still significant.
>
> The reason Kprobes use breakpoint instruction as pointed out by Prasanna
> is, it is atomic on most platforms. We are already working on an
> improved idea using jump instruction with which overhead is less than
> 100 cycles on modern CPU's but it has some limitations and issues
> related to preemption and SMP.
>
> You can get a glimpse of some of the issues here
> http://sourceware.org/ml/systemtap/2006-q3/msg00507.html
> http://sourceware.org/ml/systemtap/2005-q4/msg00117.html
> For more details do a search for djprobe in the systemtap mailing list
> (sorry i am not able to find few threads to summarize all the issues).
"This djprobe is NOT a replacement of kprobes. Djprobe and kprobes
have complementary qualities. (ex: djprobe's overhead is low, and
kprobes can be inserted in anywhere.)". Hmm. that seems problematic.
From what I was describing for function replacement, we could do an NMI
IPI to everyone, and lock them in there whilst we insert the probe, but
it's a bit sucky.
> Here is the algorithm djprobes uses to
>
> IA
> | [-2][-1][0][1][2][3][4][5][6][7]
> [ins1][ins2][ ins3 ]
> [<- DCR ->]
> [<- JTPR ->]
>
> ins1: 1st Instruction
> ins2: 2nd Instruction
> ins3: 3rd Instruction
> IA: Insertion Address
> JTPR: Jump Target Prohibition Region
> DCR: Detoured Code Region
>
>
> The replacement procedure of djpopbes is the following (i have
> simplified for readability the actual steps djprobes uses)
>
> (1) copying instruction(s) in DCR
> (2) putting break point instruction at IA
> (3) make sure no cpu's have replacing instructions in the cache to avoid
> jump to the middle of jmp instruction
> (4) replacing original instruction(s) with jump instruction
>
> As you can see from the above your suggestion is very similar to the
> djprobes hence i believe all the issues related to djprobes will be
> valid for yours as well.
The hooking seems very similar, yes, perhaps I can be lazy and just
steal djprobes for this. The difference is that if we just replace the
whole function, we can just shove arbitrary changes into functions, and
do whatever we please. Plus we don't have to worry about locating
internal variables, etc.
M.
On Tue, Sep 19, 2006 at 12:26:32PM -0700, Martin Bligh wrote:
> Vara Prasad wrote:
> >Martin Bligh wrote:
> >
> >>[...]
> >>Depends what we're trying to fix. I was trying to fix two things:
> >>
> >>1. Flexibility - kprobes seem unable to access all local variables etc
> >>easily, and go anywhere inside the function. Plus keeping low overhead
> >>for doing things like keeping counters in a function (see previous
> >>example I mentioned for counting pages in shrink_list).
> >>
> >Using tools like systemtap on can consult DWARF information and put
> >probes in the middle of the function and access local variables as well,
> >that is not the real problem. The issue here is compiler doesn't seem to
> >generate required DWARF information in some cases due to optimizations.
>
> It seems difficult to seperate those two from each other. If the
> subsystem you're relying on doesn't work, then ....
>
> >The other related problem is when there exists debug information, the
> >way to specify the breakpoint location is using line number which is not
> >maintainable, having a marker solves this problem as well. Your proposal
> >still doesn't solve the need for markers if i understood correctly.
>
> It could, but I think we're better off with the markers, yes.
>
> >>2. Overhead of the int3, which was allegedly 1000 cycles or so, though
> >>faster after Ingo had played with it, it's still significant.
> >
> >The reason Kprobes use breakpoint instruction as pointed out by Prasanna
> >is, it is atomic on most platforms. We are already working on an
> >improved idea using jump instruction with which overhead is less than
> >100 cycles on modern CPU's but it has some limitations and issues
> >related to preemption and SMP.
> >
> >You can get a glimpse of some of the issues here
> >http://sourceware.org/ml/systemtap/2006-q3/msg00507.html
> >http://sourceware.org/ml/systemtap/2005-q4/msg00117.html
> >For more details do a search for djprobe in the systemtap mailing list
> >(sorry i am not able to find few threads to summarize all the issues).
>
> "This djprobe is NOT a replacement of kprobes. Djprobe and kprobes
> have complementary qualities. (ex: djprobe's overhead is low, and
> kprobes can be inserted in anywhere.)". Hmm. that seems problematic.
>
> From what I was describing for function replacement, we could do an NMI
> IPI to everyone, and lock them in there whilst we insert the probe, but
> it's a bit sucky.
We can do batch processing here. Send one IPI to everyone
and then insert bunch of jump instructions. This will reduce number
of IPI required here.
>
> >Here is the algorithm djprobes uses to
> >
> > IA
> > | [-2][-1][0][1][2][3][4][5][6][7]
> > [ins1][ins2][ ins3 ]
> > [<- DCR ->]
> > [<- JTPR ->]
> >
> >ins1: 1st Instruction
> >ins2: 2nd Instruction
> >ins3: 3rd Instruction
> >IA: Insertion Address
> >JTPR: Jump Target Prohibition Region
> >DCR: Detoured Code Region
> >
> >
> >The replacement procedure of djpopbes is the following (i have
> >simplified for readability the actual steps djprobes uses)
> >
> >(1) copying instruction(s) in DCR
> >(2) putting break point instruction at IA
> >(3) make sure no cpu's have replacing instructions in the cache to avoid
> >jump to the middle of jmp instruction
> >(4) replacing original instruction(s) with jump instruction
> >
> >As you can see from the above your suggestion is very similar to the
> >djprobes hence i believe all the issues related to djprobes will be
> >valid for yours as well.
>
> The hooking seems very similar, yes, perhaps I can be lazy and just
> steal djprobes for this. The difference is that if we just replace the
> whole function, we can just shove arbitrary changes into functions, and
> do whatever we please. Plus we don't have to worry about locating
> internal variables, etc.
>
Some more coplicated method.
How about inserting a (instruction size) number of breakpoints and
wait untill all the threads gets scheduled atleast once (so that
threads would hit the breakpoint, if their IPs are in the middle of
instruction we want to replace with jump) and then replace with
jump instruction.
Thanks
Prasanna
--
Prasanna S.P.
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329
* S. P. Prasanna ([email protected]) wrote:
>
> Some more coplicated method.
> How about inserting a (instruction size) number of breakpoints and
> wait untill all the threads gets scheduled atleast once (so that
> threads would hit the breakpoint, if their IPs are in the middle of
> instruction we want to replace with jump) and then replace with
> jump instruction.
>
What happen if a thread is stopped ?
Mathieu
OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Martin Bligh wrote:
> be that many? Still doesn't fix the problem Matieu just pointed
> out though. Humpf.
There's one possibility if we're willing to insert a placeholder
at function entry that allows to essentially do what Andrew
suggests without much impact. Specifically, if you need a 5-byte
operation to jump to the alternate instrumented function, you
can then do something like:
1- At build time insert 5-byte unconditional jump to instruction
right after placeholder.
2- At runtime for diverting flow:
- Replace first byte with int3 (atomically)
- Replace next 4 bytes with instrumented function destination
- Replace first byte
3- At runtime for returning flow:
- Do #2 but for the original placeholder jump.
There's not race condition here or fear of interrupt return in
the middle of anything, or any need to stop the kernel from
operating and the likes, or even dependency on kprobes or need
for dprobes, at least in as far as I can see -- so this should
be trivial on m68k ;). The price to pay is an additional
unconditional jump at all times, which should be optimized at
runtime by the CPU. Benchmarks could help show the real impact,
but as Ingo said, these things should be minimal.
In sum, this would work for function pointers and wouldn't
require having to walk the code in search of instances of
"call foo" to replace.
Just a thought.
Karim
Mathieu Desnoyers wrote:
> * Vara Prasad ([email protected]) wrote:
>> Martin Bligh wrote:
>>
>>> [...]
>>> Depends what we're trying to fix. I was trying to fix two things:
>>>
>>> 1. Flexibility - kprobes seem unable to access all local variables etc
>>> easily, and go anywhere inside the function. Plus keeping low overhead
>>> for doing things like keeping counters in a function (see previous
>>> example I mentioned for counting pages in shrink_list).
>>>
>> Using tools like systemtap on can consult DWARF information and put
>> probes in the middle of the function and access local variables as well,
>> that is not the real problem. The issue here is compiler doesn't seem to
>> generate required DWARF information in some cases due to optimizations.
>> The other related problem is when there exists debug information, the
>> way to specify the breakpoint location is using line number which is not
>> maintainable, having a marker solves this problem as well. Your proposal
>> still doesn't solve the need for markers if i understood correctly.
>>
>
> His implementation makes a heavy use of a marker mechanism : this is exactly
> what permits to create the instrumented objects from the same source code, but
> with different #defines.
Djprobes don't depend on markers. Actually, markers help to find the
safe place to probe, but they are not necessary. At least, instructions
that are more than 4 byte are probable.
As Vara pointed out, we are developing the tools that find the
safe place for djprobes.
Satoshi OSHIMA
Ar Maw, 2006-09-19 am 13:54 -0400, ysgrifennodd Mathieu Desnoyers:
> Very good idea.. However, overwriting the second instruction with a jump could
> be dangerous on preemptible and SMP kernels, because we never know if a thread
> has an IP in any of its contexts that would return exactly at the middle of the
> jump.
No: on x86 it is the *same* case for all of these even writing an int3.
One byte or a megabyte,
You MUST ensure that every CPU executes a serializing instruction before
it hits code that was modified by another processor. Otherwise you get
CPU errata and the CPU produces results which vendors like to describe
as "undefined".
Thus you have to serialize, and if you are serializing it really doesn't
matter if you write a byte, a paragraph or a page.
Alan
Alan Cox wrote:
> Ar Maw, 2006-09-19 am 13:54 -0400, ysgrifennodd Mathieu Desnoyers:
>> Very good idea.. However, overwriting the second instruction with a jump could
>> be dangerous on preemptible and SMP kernels, because we never know if a thread
>> has an IP in any of its contexts that would return exactly at the middle of the
>> jump.
>
> No: on x86 it is the *same* case for all of these even writing an int3.
> One byte or a megabyte,
>
> You MUST ensure that every CPU executes a serializing instruction before
> it hits code that was modified by another processor. Otherwise you get
> CPU errata and the CPU produces results which vendors like to describe
> as "undefined".
I was aware of that this errata existed, but never actually knew the
actual specifics of it. Are these two separate problems or just
one?
a) the errata & a possible thread having an IP leading back within (not
at the start of) the range to be replaced.
b) the errata & replacing single instruction with single instruction of
same size.
In a), there's almost an intractable problem of making sure no IP leads
back within the range to be replaced. In b) we still have to take care
of the errata part, but no worry about the stalled thread with invalid
IP.
> Thus you have to serialize, and if you are serializing it really doesn't
> matter if you write a byte, a paragraph or a page.
I was vaguely aware of the issue on x86. Do you know if this applies the
same on other achitectures?
Also, this is SMP-only, right? (Not that single UP matters for desktop
anymore, but just checking.)
Any pointers to the errata?
Karim
--
President / Opersys Inc.
Embedded Linux Training and Expertise
http://www.opersys.com / 1.866.677.4546
Hi Alan,
On Wed, Sep 20, 2006 at 01:08:45AM +0100, Alan Cox wrote:
> Ar Maw, 2006-09-19 am 13:54 -0400, ysgrifennodd Mathieu Desnoyers:
> > Very good idea.. However, overwriting the second instruction with a jump could
> > be dangerous on preemptible and SMP kernels, because we never know if a thread
> > has an IP in any of its contexts that would return exactly at the middle of the
> > jump.
>
> No: on x86 it is the *same* case for all of these even writing an int3.
> One byte or a megabyte,
>
> You MUST ensure that every CPU executes a serializing instruction before
> it hits code that was modified by another processor. Otherwise you get
> CPU errata and the CPU produces results which vendors like to describe
> as "undefined".
Are you referring to Intel erratum "unsynchronized cross-modifying code"
- where it refers to the practice of modifying code on one processor
where another has prefetched the unmodified version of the code.
Thanks
Prasanna
>
> Thus you have to serialize, and if you are serializing it really doesn't
> matter if you write a byte, a paragraph or a page.
>
--
Prasanna S.P.
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329
* Alan Cox ([email protected]) wrote:
> Ar Maw, 2006-09-19 am 13:54 -0400, ysgrifennodd Mathieu Desnoyers:
> > Very good idea.. However, overwriting the second instruction with a jump could
> > be dangerous on preemptible and SMP kernels, because we never know if a thread
> > has an IP in any of its contexts that would return exactly at the middle of the
> > jump.
>
> No: on x86 it is the *same* case for all of these even writing an int3.
> One byte or a megabyte,
>
> You MUST ensure that every CPU executes a serializing instruction before
> it hits code that was modified by another processor. Otherwise you get
> CPU errata and the CPU produces results which vendors like to describe
> as "undefined".
>
> Thus you have to serialize, and if you are serializing it really doesn't
> matter if you write a byte, a paragraph or a page.
>
Hi Alan,
What I am trying to address is not "code patching with INT3", but "code patching
with a 5 bytes JMP". The errata you point to applies to both and kprobes
mechanism already takes care of this with the serialization method you describe.
However, there is a supplemental problem with the fact that a JMP is 5 bytes,
not 1. You are right about saying that overwriting code with any amount of
*int3* does not matter, but what happens when you put one or more 5 bytes long
jumps instead ?
Think about it : if you are replacing 1-2-3 or 4 bytes long instruction and,
unluckily, on any stack of any thread preempted from any CPU, you have a
current instruction pointer pointing at the middle of the region where you want
to put the 5 bytes JMP, the processor will likely trigger an illegal
instruction fault when this particular thread is scheduled back.
Mathieu
OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
[email protected] wrote on 20/09/2006 02:08:52:
> Hi Alan,
>
> On Wed, Sep 20, 2006 at 01:08:45AM +0100, Alan Cox wrote:
> > Ar Maw, 2006-09-19 am 13:54 -0400, ysgrifennodd Mathieu Desnoyers:
> > > Very good idea.. However, overwriting the second instruction
> with a jump could
> > > be dangerous on preemptible and SMP kernels, because we never
> know if a thread
> > > has an IP in any of its contexts that would return exactly at
> the middle of the
> > > jump.
> >
> > No: on x86 it is the *same* case for all of these even writing an int3.
> > One byte or a megabyte,
> >
> > You MUST ensure that every CPU executes a serializing instruction
before
> > it hits code that was modified by another processor. Otherwise you get
> > CPU errata and the CPU produces results which vendors like to describe
> > as "undefined".
>
> Are you referring to Intel erratum "unsynchronized cross-modifying code"
> - where it refers to the practice of modifying code on one processor
> where another has prefetched the unmodified version of the code.
>
> Thanks
> Prasanna
In the special case of replacing an opcode with int3 that erratum doesn't
apply. I know that's not in the manuals but it has been confirmed by the
Intel microarchitecture group. And it's not reasonable to it to be any
other way.
- -
Richard J Moore
IBM Advanced Linux Response Team - Linux Technology Centre
MOBEX: 264807; Mobile (+44) (0)7739-875237
Office: (+44) (0)1962-817072
S. P. Prasanna wrote:
>>
>> Yes, that's simple. but slower, as you have a double jump. Probably
>> a damned sight faster than int3 though.
>>
>> M.
>>
>
> The advantage of using int3 over jmp to launch the instrumented
> module is that int3 (or breakpoint in most architectures) is an
> atomic operation to insert.
>
Yes, 5 bytes is not an atomic write except on 64-bit. So a race is possible.
How about this workaround:
1. Overwrite the start of the function with a hlt, which is atomic.
2. Write that 5-byte jump after the hlt.
3. Overwrite the hlt with nop so things will work
4. interrupt any cpus that got stuck on the hlt - or just wait for the
timer.
Helge Hafting
Ar Mer, 2006-09-20 am 11:39 +0200, ysgrifennodd Helge Hafting:
> Yes, 5 bytes is not an atomic write except on 64-bit. So a race is possible.
Untrue as well. Pentium and later have CMPXCHG8.
> How about this workaround:
> 1. Overwrite the start of the function with a hlt, which is atomic.
> 2. Write that 5-byte jump after the hlt.
> 3. Overwrite the hlt with nop so things will work
> 4. interrupt any cpus that got stuck on the hlt - or just wait for the
> timer.
CPU errata time again. You have to synchronize.
Alan
Ar Mer, 2006-09-20 am 09:18 +0100, ysgrifennodd Richard J Moore:
> > Are you referring to Intel erratum "unsynchronized cross-modifying code"
> > - where it refers to the practice of modifying code on one processor
> > where another has prefetched the unmodified version of the code.
> In the special case of replacing an opcode with int3 that erratum doesn't
> apply. I know that's not in the manuals but it has been confirmed by the
> Intel microarchitecture group. And it's not reasonable to it to be any
> other way.
Ok thats cool to know and I wish they'd documented it. Is the same true
for AMD ?
Alan
Ar Maw, 2006-09-19 am 20:52 -0400, ysgrifennodd Karim Yaghmour:
> a) the errata & a possible thread having an IP leading back within (not
> at the start of) the range to be replaced.
> b) the errata & replacing single instruction with single instruction of
> same size.
Intel don't distinguish. Richard's reply later in the thread answers a
lot more including what Intels architecture team said about int3 being a
specific safe case for soem reason
> I was vaguely aware of the issue on x86. Do you know if this applies the
> same on other achitectures?
I wouldn't know.
> Also, this is SMP-only, right? (Not that single UP matters for desktop
> anymore, but just checking.)
There are some uniprocessor errata but I cannot see how you could patch
code, somehow take an interrupt (or return from one) without executing a
serializing instruction, so I likewise think its SMP only.
> Any pointers to the errata?
developer.intel.com 'specification update' documents (which are always
good reading).
Hi,
S. P. Prasanna wrote:
> Some more coplicated method.
> How about inserting a (instruction size) number of breakpoints and
> wait untill all the threads gets scheduled atleast once (so that
> threads would hit the breakpoint, if their IPs are in the middle of
> instruction we want to replace with jump) and then replace with
> jump instruction.
I think there is no need to insert so many breakpoints.
Instead of that, you merely wait that all the threads which are
running on each processors at that time gets scheduled, if the kernel
is *NOT* preemptive.
If the kernel is preemptive, some threads might sleep on the target
address. In this case, we can use freeze_processes() to ensure safety.
This idea was proposed by Ingo.
Thanks,
--
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]
Ingo Molnar <[email protected]> writes:
>
> yeah, this would be nice - if it werent it for function pointers, and if
> all kernel functions were relocatable. But if you can think of a method
> to do this, it would be nice.
x86-64 did it for some time statically to replace mem copies and some other
functions. Basically it just patches the beginning of the other function
to a jump. However this assumes that the code doesn't contain absolute addresses
(e.g. no switches). In the x86-64 it's easy because only assembly functions
are threated this way.
-Andi
Alan Cox <[email protected]> writes:
> Ar Mer, 2006-09-20 am 09:18 +0100, ysgrifennodd Richard J Moore:
> > > Are you referring to Intel erratum "unsynchronized cross-modifying code"
> > > - where it refers to the practice of modifying code on one processor
> > > where another has prefetched the unmodified version of the code.
>
> > In the special case of replacing an opcode with int3 that erratum doesn't
> > apply. I know that's not in the manuals but it has been confirmed by the
> > Intel microarchitecture group. And it's not reasonable to it to be any
> > other way.
>
> Ok thats cool to know and I wish they'd documented it. Is the same true
> for AMD ?
It pretty much has to, otherwise lots of debuggers would be unhappy
-Andi
Hi,
Mathieu Desnoyers wrote:
> Hello,
>
> Following this huge discussion thread, I tried to come with a marker mechanism
> (which is something everyone seems to agree that is a necessity) that would be
> useful to each kind of tracing (dynamic and static) (concerned projects :
> SystemTAP, LKET, LKST, LTTng) and even combinations of those. Religious
> considerations aside, I really think that this kind of generic markup is
> necessary to fill *everybody*'s need. If I forgot about a specific genericity
> aspect, please tell me.
>
> I take for agreed that both static and dynamic tracing are useful for different
> needs and that a full markup must support both and combinations, letting the
> user or the distribution choose.
Basically, I like this static marker concept.
But I wonder why wouldn't you use the architecture-independent
marker which SystemTap already supports.
If we use NOPs, it highly depends on architecture, and is hard
to port.
Thanks,
--
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]
Hi,
Alan Cox wrote:
> Ar Mer, 2006-09-20 am 11:39 +0200, ysgrifennodd Helge Hafting:
>> How about this workaround:
>> 1. Overwrite the start of the function with a hlt, which is atomic.
>> 2. Write that 5-byte jump after the hlt.
>> 3. Overwrite the hlt with nop so things will work
>> 4. interrupt any cpus that got stuck on the hlt - or just wait for the
>> timer.
>
> CPU errata time again. You have to synchronize.
Sure, and the djprobe which I had developed method can treat it as below;
1. Overwrite the 1st instruction with int3. (atomic)
2. Wait until all processes running on every cpus are scheduled.
(I'm using synchronize_sched(). This step ensures no-one exist on
the instructions which will be overwritten by the dest-addr)
3. Write the destination address
4. Interrupt any cpus to serialize those caches (using CPUID).
5. Overwrite the int3 with jmp opcode. (atomic)
In this method, the instructions are updated like below;
0. [ insn1 ][ insn2]
1. [int3]1 ][ insn2]
2. wait
3. [int3][ destaddr]
4. sync
5. [jmp to destaddr]
Actually, #2 is not enough for the preemptive kernel. So, current
djprobe doesn't support CONFIG_PREEMPT. But Ingo proposed some
good ideas (use freeze_processes()). I'll try his ideas.
What would you think about djprobe's method?
Thanks,
--
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]
Hi Karim,
Karim Yaghmour wrote:
> Martin Bligh wrote:
>> be that many? Still doesn't fix the problem Matieu just pointed
>> out though. Humpf.
>
> There's one possibility if we're willing to insert a placeholder
> at function entry that allows to essentially do what Andrew
> suggests without much impact. Specifically, if you need a 5-byte
> operation to jump to the alternate instrumented function, you
> can then do something like:
This method is very similar to the djprobe.
And I had gotten the same idea to support preemptive kernel.
> 1- At build time insert 5-byte unconditional jump to instruction
> right after placeholder.
This means the below code, doesn't this?
---
jmp 1f /* short jump consumes 2 bytes */
nop
nop
nop
1:
---
> 2- At runtime for diverting flow:
> - Replace first byte with int3 (atomically)
> - Replace next 4 bytes with instrumented function destination
- Serialize all processor's cache by using IPI and cpuid.
> - Replace first byte
> 3- At runtime for returning flow:
> - Do #2 but for the original placeholder jump.
I think the djprobe can provide most of functionalities which
your idea requires.
I'll update the djprobe against for 2.6.17 or later as soon as
possible. Would you try to use it?
Thanks,
--
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]
* Masami Hiramatsu ([email protected]) wrote:
> Hi,
>
> Mathieu Desnoyers wrote:
> > Hello,
> >
> > Following this huge discussion thread, I tried to come with a marker mechanism
> > (which is something everyone seems to agree that is a necessity) that would be
> > useful to each kind of tracing (dynamic and static) (concerned projects :
> > SystemTAP, LKET, LKST, LTTng) and even combinations of those. Religious
> > considerations aside, I really think that this kind of generic markup is
> > necessary to fill *everybody*'s need. If I forgot about a specific genericity
> > aspect, please tell me.
> >
> > I take for agreed that both static and dynamic tracing are useful for different
> > needs and that a full markup must support both and combinations, letting the
> > user or the distribution choose.
>
> Basically, I like this static marker concept.
> But I wonder why wouldn't you use the architecture-independent
> marker which SystemTap already supports.
> If we use NOPs, it highly depends on architecture, and is hard
> to port.
>
Hi Masami,
Are you talking about the marker presented by Frank in his OLS paper (
void dest() = NULL; if(dest) dest()) ? I think it is a very good idea to use it
instead of nops.
Mathieu
OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Alan Cox <[email protected]> wrote on 20/09/2006 11:32:07:
> Ar Mer, 2006-09-20 am 09:18 +0100, ysgrifennodd Richard J Moore:
> > > Are you referring to Intel erratum "unsynchronized cross-modifying
code"
> > > - where it refers to the practice of modifying code on one processor
> > > where another has prefetched the unmodified version of the code.
>
> > In the special case of replacing an opcode with int3 that erratum
doesn't
> > apply. I know that's not in the manuals but it has been confirmed by
the
> > Intel microarchitecture group. And it's not reasonable to it to be any
> > other way.
>
> Ok thats cool to know and I wish they'd documented it. Is the same true
> for AMD ?
>
> Alan
>
Not sure probably - I can ask.
Intel explained it to me thus:
When the i-fetch has been done and the micro-ops are in the trace cache
then there's no longer a direct correlation between the original machine
instruction boundaries and the micro ops. This is due to optimization. For
example (artificial one for illustrative purposes):
mov eax,ebx
mov memory,eax
mov eax,1
(using intel notation not ATT - force of habit)
In the trace cache there would be no micro ops to update eax with ebx.
Altering the "mov eax,ebx" to "mov ecx,ebx" on the fly invalidates the
optimized trace cache, hence the onlhy recourse is a GPF.
If the modification doens't invalidate the trace cache then no GPF. The
question is: "can we predict th circumstances when the trace cache has not
been invalidated", and the answer in general is no since the
microarchtecture is not public. But one can guess that modifying the single
byte opcode with in interrupting instruction - int3 - doesn't cause an
inconsistency that can't be handled. And that's what Intel confirmed. Go
ahead and store int3 without the need to synchronise (i.e. force the trace
cache to be flushed).
My guess is that AMD behaves exactly the same way. But I'll check.
Richard
Masami Hiramatsu wrote:
> This method is very similar to the djprobe.
> And I had gotten the same idea to support preemptive kernel.
...
> This means the below code, doesn't this?
> ---
> jmp 1f /* short jump consumes 2 bytes */
> nop
> nop
> nop
> 1:
Actually this is slightly different (and requires more support
on behalf of the underlying mechanism then what I was suggesting.)
Basically, as was discussed elsewhere, there is some complex
mechanisms required for taking care of the case where you got
an interrupt at, say, the second or third nop. With the
mechanism I'm suggesting (replacing a 5 byte jmp with a 5 byte
jmp), the underlying mechanics do not require having to take
care of the above-mentioned case.
> - Serialize all processor's cache by using IPI and cpuid.
Yes.
> I think the djprobe can provide most of functionalities which
> your idea requires.
> I'll update the djprobe against for 2.6.17 or later as soon as
> possible. Would you try to use it?
Basically I'm trying to come up with a mechanism that will be
relatively trivial to implement on any architecture. My
understanding is that kprobes/djprobes combo do not necessarily
fit this description. Of course, that's not a justification for
not trying to get it to work, but my understanding is that
Martin's proposal, if it were implemented, would have a number
of advantages over just having kprobes/djprobes.
Though, in fact, djprobes can be used on the x86 (since it
already works on that) for doing exactly what I'm looking
for: replacing a 5 byte jmp with a 5 byte jmp. My understanding
is that djprobes doesn't need any special intelligence (even
on preemptable kernels) here since it shouldn't need to worry
about an IP back anywhere inside a series of nops. IOW, we
should be able to do what Martin suggests fairly easily (if
we agree on a 5-byte "null" jump at the entry of functions
of interest). Right?
Karim
* Karim Yaghmour ([email protected]) wrote:
>
> Masami Hiramatsu wrote:
> > This method is very similar to the djprobe.
> > And I had gotten the same idea to support preemptive kernel.
> ...
> > This means the below code, doesn't this?
> > ---
> > jmp 1f /* short jump consumes 2 bytes */
> > nop
> > nop
> > nop
> > 1:
>
> Actually this is slightly different (and requires more support
> on behalf of the underlying mechanism then what I was suggesting.)
> Basically, as was discussed elsewhere, there is some complex
> mechanisms required for taking care of the case where you got
> an interrupt at, say, the second or third nop. With the
> mechanism I'm suggesting (replacing a 5 byte jmp with a 5 byte
> jmp), the underlying mechanics do not require having to take
> care of the above-mentioned case.
>
Karim, the jmp already there targets the end of the region : no possible
executioni of the three following nops. Clever :)
Mathieu
OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Martin J. Bligh wrote:
> can still use kprobes/djprobes/bodilyprobes for the rest of the cases.
May I suggest we call your mechanism "bprobes" ... which stands for "branch"
probes of course ;)
Karim
Mathieu Desnoyers wrote:
> Karim, the jmp already there targets the end of the region : no possible
> executioni of the three following nops. Clever :)
Must get more coffee ...
Karim
Hello Hiramatsu-san,
So here's a more intelligent answer than last time :)
Masami Hiramatsu wrote:
> This method is very similar to the djprobe.
> And I had gotten the same idea to support preemptive kernel.
...
> This means the below code, doesn't this?
> ---
> jmp 1f /* short jump consumes 2 bytes */
> nop
> nop
> nop
> 1:
> ---
YES, as pointed out by Mathieu, this does essentially the same.
And, yes, as mentioned earlier, this should work fine on
preemptable kernels.
> I think the djprobe can provide most of functionalities which
> your idea requires.
Indeed.
Thanks,
Karim
Hi -
On Wed, Sep 20, 2006 at 01:21:52PM -0400, Karim Yaghmour wrote:
> [...] IOW, we should be able to do what Martin suggests fairly
> easily (if we agree on a 5-byte "null" jump at the entry of
> functions of interest). Right? [...]
My interpretation of Martin's Monday proposal is that, if implemented,
we wouldn't need any of this nop/int3 stuff. If function being
instrumented were recompiled on-the-fly, then it could sport plain &
direct C-level calls to the instrumentation handlers.
- FChE
Hello Frank,
Frank Ch. Eigler wrote:
> My interpretation of Martin's Monday proposal is that, if implemented,
> we wouldn't need any of this nop/int3 stuff. If function being
> instrumented were recompiled on-the-fly, then it could sport plain &
> direct C-level calls to the instrumentation handlers.
Absolutely. I guess the length of these threads is just fertile
ground for misunderstandings. Basically what Hiramatsu-san and
myself were discussing was just the mechanism for selecting/
forking in between the uninstrumented function and the instrumented
one.
So, to recap:
If you had 100,000 instrumentation points in the scheduler (obviously
a totally bogus number here ...) you'd have 2 functions:
1- one with no instrumentation at all, but with a 5byte filler such
as the one presented by Hiramatsu-san.
2- one with the instrumentation.
Early in the proposal, the mechanics of switching in between "1" and "2"
seemed to be problematic, but I think with Hiramatsu-san's proposal
and, on the x86, djprobes, we've got it figured out.
Let me know if I'm not providing enough detail.
Thanks,
Karim
Frank Ch. Eigler wrote:
> Hi -
>
> On Wed, Sep 20, 2006 at 01:21:52PM -0400, Karim Yaghmour wrote:
>
>
>>[...] IOW, we should be able to do what Martin suggests fairly
>>easily (if we agree on a 5-byte "null" jump at the entry of
>>functions of interest). Right? [...]
>
>
> My interpretation of Martin's Monday proposal is that, if implemented,
> we wouldn't need any of this nop/int3 stuff. If function being
> instrumented were recompiled on-the-fly, then it could sport plain &
> direct C-level calls to the instrumentation handlers.
It's looking to me like it might still need djprobes to implement, in
order to get the atomic and safe switchover from the original function
into the traced one. All rather sad, but seems to be true from all the
CPU errata, etc. If anyone can see a way round that, I'd love to hear
it.
What it would give you above and beyond djprobes is an easier and more
flexible way to actually do the instrumentation itself.
M.
Martin Bligh wrote:
> It's looking to me like it might still need djprobes to implement, in
> order to get the atomic and safe switchover from the original function
> into the traced one. All rather sad, but seems to be true from all the
> CPU errata, etc. If anyone can see a way round that, I'd love to hear
> it.
But we don't need to fight the errata, there are fortunately solutions
that take care of it where it does exist (x86: djprobes/kprobes.)
What's more interesting, though, is that the method as it is proposed
at this stage *seems* to be easily portable to other archs. And where
such binary trickery is difficult to pull off, nothing precludes
having a universally "portable" mechanism including something akin to
switching between instrumented vs. normal function at function entry.
Even such conditional ifs can be optimized by the CPU nowadays.
The picture is, nevertheless, very bright at the moment (I think).
Just have a 5byte filler at function entry such as Hiramatsu-san
suggested, and use djprobes to fork to instrumented function. The
unconditional jump in the filler will most likely be utterly
unmeasurable, and benchmarks should confirm this.
So:
On x86: use 5byte filler and djprobes.
On "sane" archs: use filler and override as explained earlier.
Elsewhere: use standard "if" or function pointer at function entry.
> What it would give you above and beyond djprobes is an easier and more
> flexible way to actually do the instrumentation itself.
Absolutely agree.
Karim
Karim Yaghmour wrote:
> Martin Bligh wrote:
>
>>It's looking to me like it might still need djprobes to implement, in
>>order to get the atomic and safe switchover from the original function
>>into the traced one. All rather sad, but seems to be true from all the
>>CPU errata, etc. If anyone can see a way round that, I'd love to hear
>>it.
>
>
> But we don't need to fight the errata, there are fortunately solutions
> that take care of it where it does exist (x86: djprobes/kprobes.)
> What's more interesting, though, is that the method as it is proposed
> at this stage *seems* to be easily portable to other archs. And where
> such binary trickery is difficult to pull off, nothing precludes
> having a universally "portable" mechanism including something akin to
> switching between instrumented vs. normal function at function entry.
> Even such conditional ifs can be optimized by the CPU nowadays.
>
> The picture is, nevertheless, very bright at the moment (I think).
> Just have a 5byte filler at function entry such as Hiramatsu-san
> suggested, and use djprobes to fork to instrumented function. The
> unconditional jump in the filler will most likely be utterly
> unmeasurable, and benchmarks should confirm this.
>
> So:
> On x86: use 5byte filler and djprobes.
> On "sane" archs: use filler and override as explained earlier.
> Elsewhere: use standard "if" or function pointer at function entry.
Do we even need the filler padding? I thought we could insert kprobes
at the beginning of any function without that ... it was only a
requirement for mid-function (sometimes). If we copy the whole function,
we don't even need that any more ...
if kprobes can do it, I don't see why djprobes can't ... after all, it
just seems to use kprobes to insert a jump, AFAICS.
M.
Martin Bligh wrote:
> Do we even need the filler padding? I thought we could insert kprobes
> at the beginning of any function without that ... it was only a
> requirement for mid-function (sometimes). If we copy the whole function,
> we don't even need that any more ...
>
> if kprobes can do it, I don't see why djprobes can't ... after all, it
> just seems to use kprobes to insert a jump, AFAICS.
I guess I must not be explaining myself properly.
The padding is for one purpose and one purpose only: having
a know-to-be-good location at the beginning of the
uninstrumented function for later using djprobes on. Once
you've got that, then you can indeed copy the entire
function and do whatever you want *without* using djprobes
or kprobes, but using direct calls.
If you don't have the padding, then you might yourself in
a case where you're replacing bytes from multiple instructions
where something somewhere may have an IP within the replaced
range. And to get around that you have to pull a few magic
tricks *and* make a few assumptions. But if you replace a
5 bytes instruction (or the equivalent as in Hiramatsu-san's
proposla) with another 5 bytes instruction, none of that is
needed and djprobes can be used *today* to do that.
Using this, you've got an arguably non-existent penalty
for the function with the filler and a very fast jump to
the instrumented function. The best of both worlds
actually.
Let me know if I'm still not being clear.
Karim
Karim Yaghmour wrote:
> Martin Bligh wrote:
>
>>Do we even need the filler padding? I thought we could insert kprobes
>>at the beginning of any function without that ... it was only a
>>requirement for mid-function (sometimes). If we copy the whole function,
>>we don't even need that any more ...
>>
>>if kprobes can do it, I don't see why djprobes can't ... after all, it
>>just seems to use kprobes to insert a jump, AFAICS.
>
>
> I guess I must not be explaining myself properly.
>
> The padding is for one purpose and one purpose only: having
> a know-to-be-good location at the beginning of the
> uninstrumented function for later using djprobes on. Once
> you've got that, then you can indeed copy the entire
> function and do whatever you want *without* using djprobes
> or kprobes, but using direct calls.
>
> If you don't have the padding, then you might yourself in
> a case where you're replacing bytes from multiple instructions
> where something somewhere may have an IP within the replaced
> range. And to get around that you have to pull a few magic
> tricks *and* make a few assumptions. But if you replace a
> 5 bytes instruction (or the equivalent as in Hiramatsu-san's
> proposla) with another 5 bytes instruction, none of that is
> needed and djprobes can be used *today* to do that.
>
> Using this, you've got an arguably non-existent penalty
> for the function with the filler and a very fast jump to
> the instrumented function. The best of both worlds
> actually.
>
> Let me know if I'm still not being clear.
You mean using the jump-over thing that was posted earlier?
I thought the CPU erratas prevented doing that atomically
properly. From my understanding of the last 24 hours discussion,
it seemed like the ONLY thing we could do safely atomically was
insert an int3. Which sucks, frankly, but still.
Or are we talking about locking everyone in an NMI? Having
proposed that, I now think it doesn't work ... we still return
from it when it's done, and might be in the middle of the
instruction stream we just crapped on.
So, maybe I missed a bit of the conversation, or didn't understand
it, but I was trying to follow it pretty closely. Even with the
padding, I don't see how overwriting it is atomic ... they could
be off processing an interrupt / NMI or whatever when you were
in the midst of it.
One thing Michael (cc'ed) pointed out was the possibility of using
"jump to self" as a small marker instruction, where we set the
function in busy wait at the start as we overwrite the next few,
then overwrite the jump to selfs with a nop to liberate it again.
But I'm unconvinced that gets around the CPU errata Alan was
pointing to.
M.
Martin Bligh wrote:
> You mean using the jump-over thing that was posted earlier?
> I thought the CPU erratas prevented doing that atomically
> properly. From my understanding of the last 24 hours discussion,
> it seemed like the ONLY thing we could do safely atomically was
> insert an int3. Which sucks, frankly, but still.
No. djprobes already does safely insert other stuff than just
int3, that's the whole point.
Here are the relevant postings by Hiramatsu-san:
http://marc.theaimsgroup.com/?l=linux-kernel&m=115875912510827&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=115875867519302&w=2
Unless there's something *I* fundamentally misunderstood from
Hiramatsu-san's implementation and input, djprobes can replace
the 5-byte filler with a 5-byte unconditional jump. IOW your
mechanism works, no int3s involved.
Karim
Alan Cox <[email protected]> wrote on 20/09/2006 11:44:29:
> Ar Maw, 2006-09-19 am 20:52 -0400, ysgrifennodd Karim Yaghmour:
> > a) the errata & a possible thread having an IP leading back within (not
> > at the start of) the range to be replaced.
> > b) the errata & replacing single instruction with single instruction of
> > same size.
>
> Intel don't distinguish. Richard's reply later in the thread answers a
> lot more including what Intels architecture team said about int3 being a
> specific safe case for soem reason
>
> > I was vaguely aware of the issue on x86. Do you know if this applies
the
> > same on other achitectures?
>
> I wouldn't know.
It can for another reason - score-boarding: that's where a byte being
stored assumes intermediate values due to the bits not being set
simultaneously. Generally this doesn't cause a problem because data across
processors is serialised for update by mutexes. However, when applied to
code all sorts of interesting instructions can execute before the bits
settle down. I haven't heard of this troubling Intel, but it does occur on
some current architectures.
Richard
Hi!
> > > > Very good idea.. However, overwriting the second instruction
> > with a jump could
> > > > be dangerous on preemptible and SMP kernels, because we never
> > know if a thread
> > > > has an IP in any of its contexts that would return exactly at
> > the middle of the
> > > > jump.
> > >
> > > No: on x86 it is the *same* case for all of these even writing an int3.
> > > One byte or a megabyte,
> > >
> > > You MUST ensure that every CPU executes a serializing instruction
> before
> > > it hits code that was modified by another processor. Otherwise you get
> > > CPU errata and the CPU produces results which vendors like to describe
> > > as "undefined".
> >
> > Are you referring to Intel erratum "unsynchronized cross-modifying code"
> > - where it refers to the practice of modifying code on one processor
> > where another has prefetched the unmodified version of the code.
>
> In the special case of replacing an opcode with int3 that erratum doesn't
> apply. I know that's not in the manuals but it has been confirmed by the
> Intel microarchitecture group. And it's not reasonable to it to be any
> other way.
What about replacing int3 with old instruction (i.e. marker being
deleted)?
Pavel
--
Thanks for all the (sleeping) penguins.
On Thu, 21 Sep 2006, Richard J Moore wrote:
>
> It can for another reason - score-boarding: that's where a byte being
> stored assumes intermediate values due to the bits not being set
> simultaneously. Generally this doesn't cause a problem because data across
> processors is serialised for update by mutexes. However, when applied to
> code all sorts of interesting instructions can execute before the bits
> settle down. I haven't heard of this troubling Intel, but it does occur on
> some current architectures.
I'd not heard of this phenomenon, and it worries me. There are places
in kernel code where we peek at some volatile variable (perhaps a long)
without locking, and expect to see it in any one of several well-defined
states. Are you saying that there are architectures supported by Linux,
on which we might see an "impossible" mix of states, due to score-boarding?
Hugh
Hugh Dickins <[email protected]> wrote on 23/09/2006 16:34:33:
> On Thu, 21 Sep 2006, Richard J Moore wrote:
> >
> > It can for another reason - score-boarding: that's where a byte being
> > stored assumes intermediate values due to the bits not being set
> > simultaneously. Generally this doesn't cause a problem because data
across
> > processors is serialised for update by mutexes. However, when applied
to
> > code all sorts of interesting instructions can execute before the bits
> > settle down. I haven't heard of this troubling Intel, but it does occur
on
> > some current architectures.
>
> I'd not heard of this phenomenon, and it worries me. There are places
> in kernel code where we peek at some volatile variable (perhaps a long)
> without locking, and expect to see it in any one of several well-defined
> states. Are you saying that there are architectures supported by Linux,
> on which we might see an "impossible" mix of states, due to
score-boarding?
>
> Hugh
These things tend not to be discussed in specific detail in the processor
reference manuals. If there are exposures they are generally covered by
blanket statements about the need to ensure correct serialization between
processors when reading from, and writing to, the same location. As far as
I am aware Linux is protected from such affects because we do use locks, or
serializing instructions, to protect the updating of variables that are
accessed by multiple processors. My guess is that the exposure to
score-boarding, if it exists at all, tends to be limited to concurrent
bitwise operations.
Richard