2006-09-24 07:49:03

by Karim Yaghmour

[permalink] [raw]
Subject: Does this work? "dcprobes" an x86-hack simple djprobes-equivalent?


Richard, Hiramatsu-san,

First, I _don't know if the following works_, I just got a
hunch, and I thought I'd run this by you since you've both
done extensive work on binary patching on x86. If this works,
though, I think this would provide much of the same advantages
as djprobes without any of its disadvantages. Note: this is
very much an x86 hack, but my understanding is that djrobes'
difficulties specifically stem from the variable-length
instruction set found on the x86, so other archs should be
easier to adapt.

The problem I'm trying to solve:

Since its inception, djprobes has run into quite a few problems
in trying to replace arbitrary instruction spans because of
the inherent problem of squashing an instruction to which
someone/somewhere may sometime in the future directly jump
(say insert 5 byte jump over a sequence of bytes having multiple
instructions.) Solving this requires making quite a few
assumptions about the program. Let's just say that, as I said
elsewhere, I'm not too comfortable with the idea of letting
djprobes replace any instruction that is less than 5 bytes --
but as I explained elsewhere, it can be very useful in cases
where the replaced instruction stream is tailored for being
replaced by djprobes; see my earlier "bprobes" proposal.

So here's what I got in mind:

There is, though, one single case that I can think of where
replacing a multi-byte span will just work (in as far as my
current analysis allows me to see) without caring if anything
anywhere will ever jump back into the span -- regardless of
whether preemption is active or not. If fact, there would be
no need to freeze any process or try to look up any process'
stack while doing this.

As you mentioned earlier Richard, we can just go ahead and
stick an int3 almost anywhere. Now, the opcode for that is
CC. So what if we tried this:
- For address starting at A and spanning S bytes (where S cannot
be smaller than 7), create entry in hash table for that
replacement target, which includes a copy of the bytes found
in that span.
- Starting at A, insert S int3s. That is if these are 7 bytes,
we now have:
CC CC CC CC CC CC CC
- Replace first byte with "far call" OP:
9A CC CC CC CC CC CC

Hence the name, dcprobes: dynamic call probes.

So now, we've got a far call to 0xCCCC,0xCCCCCCCC at every probe point.
And since this is a call, the caller's EIP is on the stack, and
we can then mux based on that EIP to direct to the appropriate
callback, possibly using the hash table. And because every
component of the address is actually int3, we don't really care
if anybody has a reference to any byte within A+1, A+2, A+3, etc.
because as soon as that thing runs, it'll trigger an int3 where
we'll be able to detect that using the hash table, fix things
up and the system continue running. That fixes one of the major
problems with djprobes since we would be able to arbitrarily
insert probe points anywhere without having to make any
assumptions about the code.

Of course, this means hardwiring a multiplexing function at
0xCCCC,0xCCCCCCCC, if that makes any sense (offset 0xCCCCCCCC
of code segment entry 7,099 of the LDT with an RPL of 1).

[ If the far call thing didn't work, we could resort to a
"standard" call (E8 instead of 9A and 5 bytes instead of 7) and
then hack up the destination address based on the current
location to make sure that at offset current+0xCCCCCCCC
something meaningful is there ... with the appropriate
protection ... this may be simpler than the far call, but may
require a specially-built kernel that changes the layout of
the memory map for user-space ... ]

For removal, things are slightly more complicated. We can go
ahead and replace the old bytecode back so long as we do it
operands first and opcodes last starting by the opcode of the
last instruction in the stream. In that way, no freezing of
tasks is necessary since no instruction will actually execute
until the opcode is back. And since the addresses where valid
opcodes were expected did, at all times, have valid opcodes, then
there is no need to make any educated guesses about the system
or do anything about references tasks/threads may have -- making
this mechanism workable even for assembly and preemptable kernels.

If replacing the bytecode back as explained above is too
complicated (since it requires some disassembly), we could just
suspend all processes, force all CPUs to rendez-vous, modify the
byte span, and let everything run. Again, without needing to look
up any tasks' stack or anything like that.

Also, because this is a call, we can keep track of threads that
go in the multiplexer and therefore implement some form of
locking in order to preserve the ability to remove the probe
sanely.

If this works, djprobes should be easily modified for this,
and I'm sure the djprobes team has got ample experience to do
this.

Of course, if it works. Hence my question: does this make sense?

[ Obviously, this is not a replacement for what Mathieu is working
on right now, but it would be useful for significantly speeding up
the generic SystemTap probes, which was one of djprobes' much
anticipated goals. ]

Karim
--
President / Opersys Inc.
Embedded Linux Training and Expertise
http://www.opersys.com / 1.866.677.4546


2006-09-25 04:25:51

by Karim Yaghmour

[permalink] [raw]
Subject: Re: Does this work? "dcprobes" an x86-hack simple djprobes-equivalent?


Slight binary typo ...

Karim Yaghmour wrote:
> Of course, this means hardwiring a multiplexing function at
> 0xCCCC,0xCCCCCCCC, if that makes any sense (offset 0xCCCCCCCC
> of code segment entry 7,099 of the LDT with an RPL of 1).

Actually 0xCCCC is code segment entry 6,553 of the LDT with an
RPL of 0.

Karim
--
President / Opersys Inc.
Embedded Linux Training and Expertise
http://www.opersys.com / 1.866.677.4546

Subject: Re: Does this work? "dcprobes" an x86-hack simple djprobes-equivalent?

Hi Karim,

Thank you for new idea.
I discussed your proposal deeply with my coworkers.

I think your approach has following advantages/disadvantages/problem;
<advantages>
(a) Able to be inserted into the target address of the branch.
(b) So, binary analysis tool becomes simple.
<disadvantages>
(c) Implementation is much complicated.
(d) Highly depend on the x86 arch.
(e) Bigger overhead than djprobe.
(f) There will be side effect(*)
<problem>
(g) User applications can modify LDT. (ex. wine)

I think the dcprobe will work, but, unfortunately, it has
an vulnerability by the problem (g).

(*) In the following code:
---
a=0
do {
...
a++;
}while (a <= 100)
---
In case of inserting dcprobe at the 1st line (a=0),
it will replace 2nd (or more) instructions.
In this case, the fix up routine (based on int3)
will be invoked one hundred times.

Thanks,



--
Masami HIRAMATSU
2nd Research Dept.
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]