LinuxLists.cc - [PATCH 19/21] i386 Kprobes semaphore fix

2005-11-08 04:42:08

Subject: [PATCH 19/21] i386 Kprobes semaphore fix

IA-32 linear address translation is loads of fun.

While cleaning up the LDT code, I noticed that kprobes code was very bogus
with respect to segment handling. Many, many bugs are fixed here. I chose
to combine the three separate functions that try to do linear address
conversion into one, nice and working functions. All of the versions had
bugs.

1) Taking an int3 from v8086 mode could cause the kprobes code to read a
non-existent LDT.

2) The CS value was not truncated to 16 bit, which could cause an access
beyond the bounds of the LDT.

3) The LDT was being read without taking the mm->context semaphore, which
means bogus and or non-existent vmalloc()ed pages could be read.

4) 16-bit code segments do not truncate EIP to 16-bit, it is perfectly
valid to issue an instruction at 0xffff, and there is no wraparound
of EIP in protected mode.

5) V8086 mode does truncate EIP to 16-bit.

6) Taking the mm->context semaphore requires interrupts to be enabled.

7) Do not assume the GDT TLS descriptors are flat.

8) Raceful testing of segment access rights without LDT semaphore

9) Segment limit for V8086 code is USER limit.

Kprobes was still broken; it would try to read userspace directly;
since I'm already here, might as well fix that too.

Signed-off-by: Zachary Amsden <[email protected]>
Index: linux-2.6.14-zach-work/arch/i386/kernel/kprobes.c
===================================================================
--- linux-2.6.14-zach-work.orig/arch/i386/kernel/kprobes.c 2005-11-04 19:25:27.000000000 -0800
+++ linux-2.6.14-zach-work/arch/i386/kernel/kprobes.c 2005-11-04 19:26:37.000000000 -0800
@@ -156,23 +156,25 @@ static int __kprobes kprobe_handler(stru
struct kprobe *p;
int ret = 0;
kprobe_opcode_t *addr = NULL;
- unsigned long *lp;
+ unsigned long limit;

- /* We're in an interrupt, but this is clear and BUG()-safe. */
- preempt_disable();
- /* Check if the application is using LDT entry for its code segment and
- * calculate the address by reading the base address from the LDT entry.
+ /*
+ * Getting the address may require getting the LDT semaphore, so
+ * wait to disable preempt since we may take interrupts here.
*/
- if ((regs->xcs & 4) && (current->mm)) {
- lp = (unsigned long *) ((unsigned long)((regs->xcs >> 3) * 8)
- + (char *) current->mm->context.ldt);
- addr = (kprobe_opcode_t *) (get_desc_base(lp) + regs->eip -
- sizeof(kprobe_opcode_t));
- } else {
- addr = (kprobe_opcode_t *)(regs->eip - sizeof(kprobe_opcode_t));
- }
- /* Check we're not actually recursing */
- if (kprobe_running()) {
+ addr = (kprobe_opcode_t *)convert_eip_to_linear(regs,
+ regs->eip - sizeof(kprobe_opcode_t),
+ &current->mm->context, &limit);
+
+ /* Don't let userspace races re-address into kernel space */
+ if ((unsigned long)addr > limit)
+ return 0;
+
+ /* We're in an interrupt, but this is clear and BUG()-safe. */
+ preempt_disable();
+
+ /* Check we're not actually recursing */
+ if (kprobe_running()) {
/* We *are* holding lock here, so this is safe.
Disarm the probe we just hit, and ignore it. */
p = get_kprobe(addr);
@@ -209,13 +211,20 @@ static int __kprobes kprobe_handler(stru
lock_kprobes();
p = get_kprobe(addr);
if (!p) {
+ unsigned char instr;
unlock_kprobes();
if (regs->eflags & VM_MASK) {
/* We are in virtual-8086 mode. Return 0 */
goto no_kprobe;
}

- if (*addr != BREAKPOINT_INSTRUCTION) {
+ instr = BREAKPOINT_INSTRUCTION;
+ if (user_mode(regs))
+ __get_user(instr, (unsigned char __user *) addr);
+ else
+ instr = *addr;
+
+ if (instr != BREAKPOINT_INSTRUCTION) {
/*
* The breakpoint instruction was removed right
* after we hit it. Another cpu has removed
Index: linux-2.6.14-zach-work/arch/i386/kernel/ptrace.c
===================================================================
--- linux-2.6.14-zach-work.orig/arch/i386/kernel/ptrace.c 2005-11-04 19:25:27.000000000 -0800
+++ linux-2.6.14-zach-work/arch/i386/kernel/ptrace.c 2005-11-04 19:26:37.000000000 -0800
@@ -146,46 +146,76 @@ static unsigned long getreg(struct task_
return retval;
}

-static unsigned long convert_eip_to_linear(struct task_struct *child, struct pt_regs *regs)
+/*
+ * Get the GDT/LDT descriptor base. When you look for races in this code
+ * remember that LDT and other horrors are only used in user space. Must
+ * disable pre-emption to reading the GDT, and must take the LDT semaphore
+ * for LDT segments. The fast path handles standard kernel and user CS
+ * as well as V8086 mode.
+ */
+unsigned long convert_eip_to_linear_slow(unsigned long eip, unsigned long seg,
+ mm_context_t *context, unsigned long *eip_limit)
{
- unsigned long addr, seg;
+ unsigned long base, seg_limit;
+ u32 seg_ar;
+ struct desc_struct *desc;
+ unsigned long flags;

- addr = regs->eip;
- seg = regs->xcs & 0xffff;
- if (regs->eflags & VM_MASK) {
- addr = (addr & 0xffff) + (seg << 4);
- return addr;
+ if (segment_from_ldt(seg)) {
+ /*
+ * Horrors abound. Must enable IRQs to take the LDT
+ * semaphore. Ok, since LDT from a faulting CS can only
+ * be from userspace, or this is from ptrace operating
+ * on a child context directly from a system call.
+ * This unfortunate mess is needed to deal with int3
+ * kprobes which enter with IRQs disabled.
+ */
+ local_save_flags(flags);
+ local_irq_enable();
+ down(&context->sem);
+ desc = get_ldt_desc(context, seg);
+ } else {
+ /* Must disable preemption while reading the GDT. */
+ desc = get_gdt_desc(get_cpu(), seg);
+ flags = 0; /* silence compiler */
}

- /*
- * We'll assume that the code segments in the GDT
- * are all zero-based. That is largely true: the
- * TLS segments are used for data, and the PNPBIOS
- * and APM bios ones we just ignore here.
- */
- if (seg & LDT_SEGMENT) {
- u32 *desc;
- unsigned long base;
-
- down(&child->mm->context.sem);
- desc = child->mm->context.ldt + (seg & ~7);
- base = (desc[0] >> 16) | ((desc[1] & 0xff) << 16) | (desc[1] & 0xff000000);
-
- /* 16-bit code segment? */
- if (!((desc[1] >> 22) & 1))
- addr &= 0xffff;
- addr += base;
- up(&child->mm->context.sem);
+ /* Check the segment exists, is within the current LDT/GDT size,
+ that kernel/user (ring 0..3) has the appropriate privilege,
+ that it's a code segment, and get the limit. */
+ asm ("larl %3,%0; lsll %3,%1"
+ : "=&r" (seg_ar), "=r" (seg_limit) : "0" (0), "rm" (seg));
+ if ((~seg_ar & 0x9800) || eip > seg_limit) {
+ if (eip_limit)
+ *eip_limit = 0;
+ eip = 1; /* So that returned eip > *eip_limit. */
+ base = 0;
+ } else {
+ base = get_desc_base(desc);
+ eip += base;
+ seg_limit += base;
+ }
+
+ if (segment_from_ldt(seg)) {
+ up(&context->sem);
+ local_irq_restore(flags);
+ } else {
+ put_cpu();
}
- return addr;
+
+ /* Adjust EIP and segment limit, and clamp at the kernel limit.
+ It's legitimate for segments to wrap at 0xffffffff. */
+ if (eip_limit && seg_limit < *eip_limit && seg_limit >= base)
+ *eip_limit = seg_limit;
+
+ return eip;
}

static inline int is_at_popf(struct task_struct *child, struct pt_regs *regs)
{
int i, copied;
unsigned char opcode[16];
- unsigned long addr = convert_eip_to_linear(child, regs);
-
+ unsigned long addr = convert_eip_to_linear(regs, regs->eip, &child->mm->context, NULL);
copied = access_process_vm(child, addr, opcode, sizeof(opcode), 0);
for (i = 0; i < copied; i++) {
switch (opcode[i]) {
Index: linux-2.6.14-zach-work/arch/i386/mm/fault.c
===================================================================
--- linux-2.6.14-zach-work.orig/arch/i386/mm/fault.c 2005-11-04 19:25:27.000000000 -0800
+++ linux-2.6.14-zach-work/arch/i386/mm/fault.c 2005-11-04 19:26:37.000000000 -0800
@@ -56,77 +56,6 @@ void bust_spinlocks(int yes)
console_loglevel = loglevel_save;
}

-/*
- * Return EIP plus the CS segment base. The segment limit is also
- * adjusted, clamped to the kernel/user address space (whichever is
- * appropriate), and returned in *eip_limit.
- *
- * The segment is checked, because it might have been changed by another
- * task between the original faulting instruction and here.
- *
- * If CS is no longer a valid code segment, or if EIP is beyond the
- * limit, or if it is a kernel address when CS is not a kernel segment,
- * then the returned value will be greater than *eip_limit.
- *
- * This is slow, but is very rarely executed.
- */
-static inline unsigned long get_segment_eip(struct pt_regs *regs,
- unsigned long *eip_limit)
-{
- unsigned long eip = regs->eip;
- unsigned seg = regs->xcs & 0xffff;
- u32 seg_ar, seg_limit, base, *desc;
-
- /* The standard kernel/user address space limit. */
- *eip_limit = (seg & 3) ? USER_DS.seg : KERNEL_DS.seg;
-
- /* Unlikely, but must come before segment checks. */
- if (unlikely((regs->eflags & VM_MASK) != 0))
- return eip + (seg << 4);
-
- /* By far the most common cases. */
- if (likely(seg == __USER_CS || seg == __KERNEL_CS))
- return eip;
-
- /* Check the segment exists, is within the current LDT/GDT size,
- that kernel/user (ring 0..3) has the appropriate privilege,
- that it's a code segment, and get the limit. */
- __asm__ ("larl %3,%0; lsll %3,%1"
- : "=&r" (seg_ar), "=r" (seg_limit) : "0" (0), "rm" (seg));
- if ((~seg_ar & 0x9800) || eip > seg_limit) {
- *eip_limit = 0;
- return 1; /* So that returned eip > *eip_limit. */
- }
-
- /* Get the GDT/LDT descriptor base.
- When you look for races in this code remember that
- LDT and other horrors are only used in user space. */
- if (seg & (1<<2)) {
- /* Must lock the LDT while reading it. */
- down(&current->mm->context.sem);
- desc = current->mm->context.ldt;
- desc = (void *)desc + (seg & ~7);
- } else {
- /* Must disable preemption while reading the GDT. */
- desc = (u32 *)get_cpu_gdt_table(get_cpu());
- desc = (void *)desc + (seg & ~7);
- }
-
- /* Decode the code segment base from the descriptor */
- base = get_desc_base(desc);
-
- if (seg & (1<<2)) {
- up(&current->mm->context.sem);
- } else
- put_cpu();
-
- /* Adjust EIP and segment limit, and clamp at the kernel limit.
- It's legitimate for segments to wrap at 0xffffffff. */
- seg_limit += base;
- if (seg_limit < *eip_limit && seg_limit >= base)
- *eip_limit = seg_limit;
- return eip + base;
-}

/*
* Sometimes AMD Athlon/Opteron CPUs report invalid exceptions on prefetch.
@@ -135,7 +64,8 @@ static inline unsigned long get_segment_
static int __is_prefetch(struct pt_regs *regs, unsigned long addr)
{
unsigned long limit;
- unsigned long instr = get_segment_eip (regs, &limit);
+ unsigned long instr = convert_eip_to_linear (regs, regs->eip,
+ &current->mm->context, &limit);
int scan_more = 1;
int prefetch = 0;
int i;
Index: linux-2.6.14-zach-work/include/asm-i386/ptrace.h
===================================================================
--- linux-2.6.14-zach-work.orig/include/asm-i386/ptrace.h 2005-11-04 19:25:27.000000000 -0800
+++ linux-2.6.14-zach-work/include/asm-i386/ptrace.h 2005-11-04 19:26:37.000000000 -0800
@@ -54,9 +54,11 @@ struct pt_regs {
#define PTRACE_GET_THREAD_AREA 25
#define PTRACE_SET_THREAD_AREA 26

-#ifdef __KERNEL__
+#if defined(__KERNEL__) && !defined(__arch_um__)

#include <asm/vm86.h>
+#include <asm/mmu.h>
+#include <asm/uaccess.h>

struct task_struct;
extern void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs, int error_code);
@@ -82,6 +84,45 @@ extern unsigned long profile_pc(struct p
#else
#define profile_pc(regs) instruction_pointer(regs)
#endif
-#endif /* __KERNEL__ */
+
+/*
+ * Return EIP plus the CS segment base. The segment limit is also
+ * adjusted, clamped to the kernel/user address space (whichever is
+ * appropriate), and returned in *eip_limit.
+ *
+ * The segment is checked, because it might have been changed by another
+ * task between the original faulting instruction and here.
+ *
+ * If CS is no longer a valid code segment, or if EIP is beyond the
+ * limit, or if it is a kernel address when CS is not a kernel segment,
+ * then the returned value will be greater than *eip_limit.
+ *
+ * This is slow, but is very rarely executed.
+ */
+extern unsigned long convert_eip_to_linear_slow(unsigned long addr, unsigned long seg, mm_context_t *context, unsigned long *eip_limit);
+
+static inline unsigned long convert_eip_to_linear(struct pt_regs *regs, unsigned long addr, mm_context_t *context, unsigned long * const eip_limit)
+{
+ unsigned long seg;
+
+ seg = regs->xcs & 0xffff;
+
+ /* The standard kernel/user address space limit. */
+ if (eip_limit)
+ *eip_limit = user_mode(regs) ? USER_DS.seg : KERNEL_DS.seg;
+
+ if (regs->eflags & VM_MASK) {
+ addr = (addr & 0xffff) + (seg << 4);
+ return addr;
+ }
+
+ /* By far the most common cases. */
+ if (likely(seg == __USER_CS || seg == __KERNEL_CS))
+ return addr;
+
+ return convert_eip_to_linear_slow(addr, seg, context, eip_limit);
+}
+
+#endif /* defined(__KERNEL__) && !defined(__arch_um__) */

#endif

2005-11-08 13:12:36

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

On Tuesday 08 November 2005 05:39, Zachary Amsden wrote:
> IA-32 linear address translation is loads of fun.

Thanks for doing that audit work. Can you please double check x86-64 code is
ok?

Actually giving all that complexity maybe it would be better to just
stop handling the case and remove all that. I'm not sure what kprobes needs it
for - it doesn't even handle user space yet and even if it ever does it is
unlikely that handling 16bit code makes much sense. And the prefetch
workaround does it, but 16bit DOS code is unlikely to contain prefetches
anyways. And for ptrace - well, who cares? I suppose dosemu has an own
debugger anyways and it could be handled in user space (i suppose
they still have that code from 2.4 anyways)

> While cleaning up the LDT code, I noticed that kprobes code was very bogus
> with respect to segment handling. Many, many bugs are fixed here. I chose
> to combine the three separate functions that try to do linear address
> conversion into one, nice and working functions. All of the versions had
> bugs.
>
> 1) Taking an int3 from v8086 mode could cause the kprobes code to read a
> non-existent LDT.
>
> 2) The CS value was not truncated to 16 bit, which could cause an access
> beyond the bounds of the LDT.

That's a (small) security hole, isn't it?

-Andi

2005-11-08 13:26:13

by Zachary Amsden

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

Ingo Molnar wrote:

>could these bugs lead to local rootholes, if kprobes are enabled?
>
>The threat model we care about is: administrator has a few dozen kprobes
>activated, which sample various aspects of the system. Unprivileged
>userspace tries to exploit this fact: it has full control over its own
>conduct (including the ability to create LDTs, vm86 mode and weird
>segments), but unprivileged userspace has no ability to
>activate/deactivate kprobes.
>
>could this cause problems?
>
>

Good - it was thinking along these lines that led me to this patch. It
is still possible to defeat the EIP checking if you are crafty enough,
because there is a window of opportunity from the time a fault is
induced until the LDT is read where another CPU can come along and
modify the LDT. Note the GDT does not have this problem, as remote
changes (via ptrace I think I saw code that makes it possible) will not
take effect until the thread is rescheduled. But with kernel
preemption, even GDT based segments have the race. Pardon my thinking
out loud - this is a very sticky situation.

Actually, perhaps this window could be closed; we will have a pending
LDT reload posted by IPI, or a reschedule (and thus TLS reload) pending,
and we can hold processing of that even while enabling interrupts.
There is a problem - the LDT will have already been modified, and we
don't always re-allocate a new page for it, so the old data may be
gone. Seems like we could find ways to close the race, but it seems
difficult to verify as correct, and may cost performance.

So my approach is to ignore the race; instead, I use limit checking.
Note the limit here is specified in linear (non-segmented) virtual
memory, and is clamped to MIN(limit,PAGE_OFFSET) for userspace probes.

+
+ /* Don't let userspace races re-address into kernel space */
+ if ((unsigned long)addr > limit)
+ return 0;
+

Also, added this test:

+ if (user_mode(regs))
+ __get_user(instr, (unsigned char __user *) addr);
+ else
+ instr = *addr;
+

Which saves the kernel from faulting due to "impersonated non-linear
breakpoints". Note the OOPs should have been caught and safely dealt
with anyway.

So there are still a couple of crafty tricks that can cause you to
re-address into kernel range EIPs, but limit checking protects us, and
in general, perhaps we need not worry about strict correctness of
kprobes for these edge cases. Now that I have the issue at attention,
is it even possible / useful to set kprobes in a specific userspace
context, or should only kernel breakpoints be considered for kprobes?
Reading the code, it was not clear, and I could not find proper
documentation, so I left the semantics as I saw them.

If indeed, as I suspect, kprobes are only for kernel code, then the code
here can be greatly simplified by discarding user code segments right
away and doing flat EIP conversion.

Zach

2005-11-08 13:36:57

by Zachary Amsden

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

Andi Kleen wrote:

>On Tuesday 08 November 2005 05:39, Zachary Amsden wrote:
>
>
>>IA-32 linear address translation is loads of fun.
>>
>>
>
>Thanks for doing that audit work. Can you please double check x86-64 code is
>ok?
>
>Actually giving all that complexity maybe it would be better to just
>stop handling the case and remove all that. I'm not sure what kprobes needs it
>for - it doesn't even handle user space yet and even if it ever does it is
>unlikely that handling 16bit code makes much sense. And the prefetch
>workaround does it, but 16bit DOS code is unlikely to contain prefetches
>anyways. And for ptrace - well, who cares? I suppose dosemu has an own
>debugger anyways and it could be handled in user space (i suppose
>they still have that code from 2.4 anyways)
>
>

I got the idea from the x86-64 code; prompted by you, I looked at it,
and it appears correct, but I would like to give it a full audit as well
(especially regarding 32-bit compatibility segments).

About the three cases here:

The prefetch workaround should be harmless, again because of limit
checking, the kernel is safe even in the raceful cases.

One can imagine clever uses for ptrace to do, say user space
virtualization (since I'm on the topic), or other neat things. So there
is nothing really wrong about having the fully correct EIP conversion
(and here we shouldn't need to worry about races causing some issues
with strict correctness, since there can be one external control thread).

But were kprobes even inteneded for userspace? There are races here
that are difficult to close without some heavy machinery, and I would
rather not put the machinery in place if simplifying the code is the
right answer.

Zach

2005-11-09 13:38:52

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

On Tuesday 08 November 2005 14:36, Zachary Amsden wrote:

> One can imagine clever uses for ptrace to do, say user space
> virtualization (since I'm on the topic), or other neat things. So there
> is nothing really wrong about having the fully correct EIP conversion
> (and here we shouldn't need to worry about races causing some issues
> with strict correctness, since there can be one external control thread).

Well, the code still scaries me a bit, but ok. x86-64 left at least one case
intentionally out.

> But were kprobes even inteneded for userspace? There are races here
> that are difficult to close without some heavy machinery, and I would
> rather not put the machinery in place if simplifying the code is the
> right answer.

I believe user space kprobes are being worked on by some IBM India folks yes.

-Andi

2005-11-09 16:48:52

by Zachary Amsden

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

Andi Kleen wrote:

>On Tuesday 08 November 2005 14:36, Zachary Amsden wrote:
>
>
>
>>One can imagine clever uses for ptrace to do, say user space
>>virtualization (since I'm on the topic), or other neat things. So there
>>is nothing really wrong about having the fully correct EIP conversion
>>(and here we shouldn't need to worry about races causing some issues
>>with strict correctness, since there can be one external control thread).
>>
>>
>
>Well, the code still scaries me a bit, but ok. x86-64 left at least one case
>intentionally out.
>
>
>
>>But were kprobes even inteneded for userspace? There are races here
>>that are difficult to close without some heavy machinery, and I would
>>rather not put the machinery in place if simplifying the code is the
>>right answer.
>>
>>
>
>I believe user space kprobes are being worked on by some IBM India folks yes.
>
>

I'm convinced this is pointless. What does it buy you over a ptrace
based debugger? Why would you want extra code running in the kernel
that can be done perfectly well in userspace?

Let me stress that if you are running on modified segment state, you
have no way to safely determine the virtual address on which you took an
instruction trap (int3, general protection, etc..). If you can't
determine the virtual address safely, you can't back out your code patch
to remove the breakpoint. At this point, you can't execute the next
instruction; you must wait for a _policy_ decision to be made. Adding
policy decisions like this to the kernel surely seems like a bad idea.
If the fallback is to have a debugger running in userspace that has a
user or script attached that can make the interactive decision, then why
not solve the entire problem in userspace from the start? It's a lot
safer, it simplifies the kernel kprobes code a lot, and it leaves the
debugger implementation free to do anything it chooses, as well as not
locking the solution to a particular kernel.

Zach

2005-11-09 16:58:21

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

* Zachary Amsden <[email protected]> wrote:

> >I believe user space kprobes are being worked on by some IBM India folks
> >yes.
>
> I'm convinced this is pointless. What does it buy you over a ptrace
> based debugger? Why would you want extra code running in the kernel
> that can be done perfectly well in userspace?

kprobes are not just for 'debuggers', they are also used for tracing and
other dynamic instrumentation in projects like systemtap. Ptrace is way
too slow and limited for things like that.

Ingo

2005-11-09 17:55:37

by Zachary Amsden

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

Ingo Molnar wrote:

>* Zachary Amsden <[email protected]> wrote:
>
>
>
>>>I believe user space kprobes are being worked on by some IBM India folks
>>>yes.
>>>
>>>
>>I'm convinced this is pointless. What does it buy you over a ptrace
>>based debugger? Why would you want extra code running in the kernel
>>that can be done perfectly well in userspace?
>>
>>
>
>kprobes are not just for 'debuggers', they are also used for tracing and
>other dynamic instrumentation in projects like systemtap. Ptrace is way
>too slow and limited for things like that.
>
>

Well, if there is a justification for it, that means we really should
handle all the nasty EIP conversion cases due to segmentation and v8086
mode in the kprobes code. I was hoping that might not be the case.

Zach

2005-11-10 05:33:55

by S. P. Prasanna

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

resending this mail, since my earlier email did not reach lkml.
On Wed, Nov 09, 2005 at 03:07:55PM +0530, Prasanna S Panchamukhi wrote:
> Zach,
>
> Thanks for doing this.
>
> On Tue, Nov 08, 2005 at 05:36:53AM -0800, Zachary Amsden wrote:
> > Andi Kleen wrote:
> >
> > >On Tuesday 08 November 2005 05:39, Zachary Amsden wrote:
> > >
> > >
> > >>IA-32 linear address translation is loads of fun.
> > >>
> > >>
> > >
> > >Thanks for doing that audit work. Can you please double check x86-64 code
> > >is
> > >ok?
> > >
> > >Actually giving all that complexity maybe it would be better to just
> > >stop handling the case and remove all that. I'm not sure what kprobes
> > >needs it for - it doesn't even handle user space yet and even if it ever
> > >does it is unlikely that handling 16bit code makes much sense. And the
>
>
> The code was added to address the problem related to stealing of interrupts from
> VM86. Please see the discussion thread for more details from the URL below
> http://lkml.org/lkml/2004/11/9/214
>
> > But were kprobes even inteneded for userspace? There are races here
> > that are difficult to close without some heavy machinery, and I would
> > rather not put the machinery in place if simplifying the code is the
> > right answer.
>
> Presently kprobes supports only kernel space probes. Work is in progress
> for user space probes support.
>
> >+ addr = (kprobe_opcode_t *)convert_eip_to_linear(regs,
> >+ regs->eip -
> >sizeof(kprobe_opcode_t),
> >+ &current->mm->context, &limit);
> >+
>
> Instead you can check if it is in kernel mode and calculate the address directly
> first, since it is in the fast path.
> addr = regs->eip - sizeof(kprobe_opcode_t);
> else
> addr = convert_eip_to_linear(..);
>
> there by avoiding calling convert_eip_to_linear () in case of every kernel probes.
>
>
> >+ /* Don't let userspace races re-address into kernel space */
> >+ if ((unsigned long)addr > limit)
> >+ return 0;
>
> there is no need for this check here in the fast path, because kprobes handles this
> case by checking if the address is on the kprobes hash list and later returning
> from that point.
>
> Please make sure it pass the test case discussed in the thread, URL is below.
> http://lkml.org/lkml/2004/11/9/214
>
> Thanks
> -Prasanna
> --
> Prasanna S Panchamukhi
> Linux Technology Center
> India Software Labs, IBM Bangalore
> Ph: 91-80-25044636
> <[email protected]>

--
Have a Nice Day!

Thanks & Regards
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Ph: 91-80-25044636
<[email protected]>

2005-11-10 07:10:47

by S. P. Prasanna

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

On Wed, Nov 09, 2005 at 09:52:40AM -0800, Zachary Amsden wrote:
> Ingo Molnar wrote:
>
> >* Zachary Amsden <[email protected]> wrote:
> >
> >
> >
> >>>I believe user space kprobes are being worked on by some IBM India folks
> >>>yes.
> >>>
> >>>
> >>I'm convinced this is pointless. What does it buy you over a ptrace
> >>based debugger? Why would you want extra code running in the kernel
> >>that can be done perfectly well in userspace?
> >>
> >>
> >
> >kprobes are not just for 'debuggers', they are also used for tracing and
> >other dynamic instrumentation in projects like systemtap. Ptrace is way
> >too slow and limited for things like that.
> >
> >
>
> Well, if there is a justification for it, that means we really should
> handle all the nasty EIP conversion cases due to segmentation and v8086
> mode in the kprobes code. I was hoping that might not be the case.
>

As Ingo mentioned above, Systemtap uses kprobes infrastructure to provide
dynamic kernel instrumentation. Using which user can add lots of probes
easily, so we need to take care of this fast path.

Instead of calling convert_eip_to_linear() for all cases, you can
just check if it is in kernel mode and calculate the address directly

if (kernel mode)
addr = regs->eip - sizeof(kprobe_opcode_t);
else
addr = convert_eip_to_linear(..);

there by avoiding call to convert_eip_to_linear () for every kernel probes.

As Andi mentioned user space probes support is in progress and
this address conversion will help in case of user space probes as well.

Thanks
Prasanna
--

Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Ph: 91-80-25044636
<[email protected]>

2005-11-10 14:58:29

by Zachary Amsden

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

Prasanna S Panchamukhi wrote:

>On Wed, Nov 09, 2005 at 09:52:40AM -0800, Zachary Amsden wrote:
>
>
>>Ingo Molnar wrote:
>>
>>
>>
>>>* Zachary Amsden <[email protected]> wrote:
>>>
>>>
>>>
>>>
>>>
>>>>>I believe user space kprobes are being worked on by some IBM India folks
>>>>>yes.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>I'm convinced this is pointless. What does it buy you over a ptrace
>>>>based debugger? Why would you want extra code running in the kernel
>>>>that can be done perfectly well in userspace?
>>>>
>>>>
>>>>
>>>>
>>>kprobes are not just for 'debuggers', they are also used for tracing and
>>>other dynamic instrumentation in projects like systemtap. Ptrace is way
>>>too slow and limited for things like that.
>>>
>>>
>>>
>>>
>>Well, if there is a justification for it, that means we really should
>>handle all the nasty EIP conversion cases due to segmentation and v8086
>>mode in the kprobes code. I was hoping that might not be the case.
>>
>>
>>
>
>As Ingo mentioned above, Systemtap uses kprobes infrastructure to provide
>dynamic kernel instrumentation. Using which user can add lots of probes
>easily, so we need to take care of this fast path.
>
>Instead of calling convert_eip_to_linear() for all cases, you can
>just check if it is in kernel mode and calculate the address directly
>
> if (kernel mode)
> addr = regs->eip - sizeof(kprobe_opcode_t);
> else
> addr = convert_eip_to_linear(..);
>
>there by avoiding call to convert_eip_to_linear () for every kernel probes.
>
>As Andi mentioned user space probes support is in progress and
>this address conversion will help in case of user space probes as well.
>

I like this better. I have to rework that patch anyways, since it no
longer applies cleanly.

Zach

2005-11-10 16:18:05

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

Prasanna S Panchamukhi wrote:
>
> As Ingo mentioned above, Systemtap uses kprobes infrastructure to provide
> dynamic kernel instrumentation. Using which user can add lots of probes
> easily, so we foreed to take care of this fast path.
>
> Instead of calling convert_eip_to_linear() for all cases, you can
> just check if it is in kernel mode and calculate the address directly
>
> if (kernel mode)
> addr = regs->eip - sizeof(kprobe_opcode_t);
> else
> addr = convert_eip_to_linear(..);
>
> there by avoiding call to convert_eip_to_linear () for every kernel probes.
>
> As Andi mentioned user space probes support is in progress and
> this address conversion will help in case of user space probes as well.
>

Would it make sense for this to be part of convert_eip_to_linear?

-hpa

2005-11-11 15:27:32

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

On Wednesday 09 November 2005 18:52, Zachary Amsden wrote:

> Well, if there is a justification for it, that means we really should
> handle all the nasty EIP conversion cases due to segmentation and v8086
> mode in the kprobes code. I was hoping that might not be the case.

Or just forbid kprobes for 16bit processes (or anything running in a non GDT
code segment). Would be perfectly reasonable IMHO.

-Andi

2005-11-11 15:26:35

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

On Wednesday 09 November 2005 17:46, Zachary Amsden wrote:
> Andi Kleen wrote:
> >On Tuesday 08 November 2005 14:36, Zachary Amsden wrote:
> >>One can imagine clever uses for ptrace to do, say user space
> >>virtualization (since I'm on the topic), or other neat things. So there
> >>is nothing really wrong about having the fully correct EIP conversion
> >>(and here we shouldn't need to worry about races causing some issues
> >>with strict correctness, since there can be one external control thread).
> >
> >Well, the code still scaries me a bit, but ok. x86-64 left at least one
> > case intentionally out.
> >
> >>But were kprobes even inteneded for userspace? There are races here
> >>that are difficult to close without some heavy machinery, and I would
> >>rather not put the machinery in place if simplifying the code is the
> >>right answer.
> >
> >I believe user space kprobes are being worked on by some IBM India folks
> > yes.
>
> I'm convinced this is pointless. What does it buy you over a ptrace
> based debugger? Why would you want extra code running in the kernel
> that can be done perfectly well in userspace?

People might want to do something like "attach trace point to glibc function
X" for all processes on the system. Attaching ptrace to everything would
cause large overhead. systemtap really handles a different need - of just low
overhead tracing, not heavy weight debugging.

> Let me stress that if you are running on modified segment state, you
> have no way to safely determine the virtual address on which you took an
> instruction trap (int3, general protection, etc..). If you can't
> determine the virtual address safely, you can't back out your code patch
> to remove the breakpoint. At this point, you can't execute the next

Kernel kprobes solves this by executing the code out of line. I don't know
how they want to do that in user space though (need a safe address for that),
but somehow that can be likely done.

> instruction; you must wait for a _policy_ decision to be made. Adding
> policy decisions like this to the kernel surely seems like a bad idea.
> If the fallback is to have a debugger running in userspace that has a
> user or script attached that can make the interactive decision, then why
> not solve the entire problem in userspace from the start? It's a lot

Doing it in user space would make it hard to do global tracing, and it
also likely would have much higher overhead.

-Andi

2005-11-14 05:52:34

by S. P. Prasanna

[permalink] [raw]

Subject: Re: [PATCH 19/21] i386 Kprobes semaphore fix

> > Let me stress that if you are running on modified segment state, you
> > have no way to safely determine the virtual address on which you took an
> > instruction trap (int3, general protection, etc..). If you can't
> > determine the virtual address safely, you can't back out your code patch
> > to remove the breakpoint. At this point, you can't execute the next
>
> Kernel kprobes solves this by executing the code out of line. I don't know
> how they want to do that in user space though (need a safe address for that),
> but somehow that can be likely done.

In case of user space probes we adopt a similar method for executing the code
out-of-line. In user space probes we find free space in the current
process address space and copy the original instruction to that location and
execute that instruction from that location. User processes use stack space
to store local variables, agruments and return values. Normally the stack
space either below or above the stack pointer indicates the free stack space.
Also in case of no stack free space, we can expand the process stack, copy the
instruction and execute the instruction from that location.
Detials about this method is discussed on systemtap mailing lists. URL is below.

http://sourceware.org/ml/systemtap/2005-q3/msg00542.html

Please let me know if you have any other solution to the above problem.

Thanks
Prasanna
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-25044636