2010-01-11 12:25:28

by Srikar Dronamraju

[permalink] [raw]
Subject: [RFC] [PATCH 0/7] UBP, XOL and Uprobes

Hi,

This patchset implements Uprobes which enables you to dynamically
break into any routine in a user space application and collect
information non-disruptively. Uprobes is based on utrace and uses
x86 instruction decoder.

When a uprobe is registered, Uprobes makes a copy of the probed
instruction, stops the probed application, replaces the first
byte(s) of the probed instruction with a breakpoint instruction and
allows the probed application to continue. (Uprobes uses the same
copy-on-write mechanism so that the breakpoint affects only that
process.)

When a CPU hits the breakpoint instruction, Uprobes intercepts the
SIGTRAP and finds the associated uprobe. It then executes the
associated handler. Uprobes single-steps its copy of the probed
instruction and resumes execution of the probed process at the
instruction following the probepoint. Instruction copies to be
single-stepped are stored in a per-process "single-step out of line
(XOL) area,"

Uprobes can be used to take advantage of static markers available
in user space applications.

Advantages of uprobes over conventional debugging include:
1. Non-disruptive.
2. Uses Execution out of line(XOL),
3. Much better handling of multithreaded programs because of XOL.
4. No context switch between tracer, tracee.

Here is the list of TODO Items.

- Provide a perf interface to uprobes.
- Return probes.
- Support for Other Architectures.
- Jump optimization.


This patchset provides
Subject: [RFC] [PATCH 0/7] UBP, XOL and Uprobes
Subject: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
Subject: [RFC] [PATCH 2/7] x86 support for UBP
Subject: [RFC] [PATCH 3/7] Execution out of line (XOL)
Subject: [RFC] [PATCH 4/7] Uprobes Implementation
Subject: [RFC] [PATCH 5/7] X86 Support for Uprobes
Subject: [RFC] [PATCH 6/7] Uprobes Documentation
Subject: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

This patchset is based on
git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-utrace.git

If and when utrace gets accepted into tip tree or Mainline, I will rebase
this patchset.

Please do provide your valuable comments.

--
Thanks and Regards
Srikar


2010-01-11 12:25:49

by Srikar Dronamraju

[permalink] [raw]
Subject: [RFC] [PATCH 2/7] x86 support for UBP

x86 support for user breakpoint Infrastructure

This patch provides x86 specific userspace breakpoint assistance
implementation details.
This patch requires "x86: instruction decoder API" patch.
http://lkml.org/lkml/2009/6/1/459

Signed-off-by: Jim Keniston <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/Kconfig | 1
arch/x86/include/asm/ubp.h | 40 +++
arch/x86/kernel/Makefile | 2
arch/x86/kernel/ubp_x86.c | 577 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 620 insertions(+)

Index: new_uprobes.git/arch/x86/Kconfig
===================================================================
--- new_uprobes.git.orig/arch/x86/Kconfig
+++ new_uprobes.git/arch/x86/Kconfig
@@ -50,6 +50,7 @@ config X86
select HAVE_KERNEL_BZIP2
select HAVE_KERNEL_LZMA
select HAVE_HW_BREAKPOINT
+ select HAVE_UBP
select HAVE_ARCH_KMEMCHECK
select HAVE_USER_RETURN_NOTIFIER

Index: new_uprobes.git/arch/x86/include/asm/ubp.h
===================================================================
--- /dev/null
+++ new_uprobes.git/arch/x86/include/asm/ubp.h
@@ -0,0 +1,40 @@
+#ifndef _ASM_UBP_H
+#define _ASM_UBP_H
+/*
+ * User-space BreakPoint support (ubp) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008, 2009
+ */
+
+typedef u8 ubp_opcode_t;
+#define MAX_UINSN_BYTES 16
+#define UBP_XOL_SLOT_BYTES (MAX_UINSN_BYTES)
+
+#ifdef CONFIG_X86_64
+struct ubp_bkpt_arch_info {
+ unsigned long rip_target_address;
+ u8 orig_insn[MAX_UINSN_BYTES];
+};
+struct ubp_task_arch_info {
+ unsigned long saved_scratch_register;
+};
+#else
+struct ubp_bkpt_arch_info {};
+struct ubp_task_arch_info {};
+#endif
+
+#endif /* _ASM_UBP_H */
Index: new_uprobes.git/arch/x86/kernel/Makefile
===================================================================
--- new_uprobes.git.orig/arch/x86/kernel/Makefile
+++ new_uprobes.git/arch/x86/kernel/Makefile
@@ -116,6 +116,8 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION)

obj-$(CONFIG_SWIOTLB) += pci-swiotlb.o

+obj-$(CONFIG_UBP) += ubp_x86.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
Index: new_uprobes.git/arch/x86/kernel/ubp_x86.c
===================================================================
--- /dev/null
+++ new_uprobes.git/arch/x86/kernel/ubp_x86.c
@@ -0,0 +1,577 @@
+/*
+ * User-space BreakPoint support (ubp) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008, 2009
+ */
+
+#define UBP_IMPLEMENTATION 1
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/ubp.h>
+#include <asm/insn.h>
+
+#ifdef CONFIG_X86_32
+#define is_32bit_app(tsk) 1
+#else
+#define is_32bit_app(tsk) (test_tsk_thread_flag(tsk, TIF_IA32))
+#endif
+
+#define UBP_FIX_RIP_AX 0x8000
+#define UBP_FIX_RIP_CX 0x4000
+
+/* Adaptations for mhiramat x86 decoder v14. */
+#define OPCODE1(insn) ((insn)->opcode.bytes[0])
+#define OPCODE2(insn) ((insn)->opcode.bytes[1])
+#define OPCODE3(insn) ((insn)->opcode.bytes[2])
+#define MODRM_REG(insn) X86_MODRM_REG(insn->modrm.value)
+
+static void set_ip(struct pt_regs *regs, unsigned long vaddr)
+{
+ regs->ip = vaddr;
+}
+
+#ifdef CONFIG_X86_64
+static bool is_riprel_insn(struct ubp_bkpt *ubp)
+{
+ return ((ubp->fixups & (UBP_FIX_RIP_AX | UBP_FIX_RIP_CX)) != 0);
+}
+
+static void cancel_xol(struct task_struct *tsk, struct ubp_bkpt *ubp)
+{
+ if (is_riprel_insn(ubp)) {
+ /*
+ * We rewrote ubp->insn to use indirect addressing rather
+ * than rip-relative addressing for XOL. For
+ * single-stepping inline, put back the original instruction.
+ */
+ memcpy(ubp->insn, ubp->arch_info.orig_insn, MAX_UINSN_BYTES);
+ ubp->strategy &= ~UBP_HNT_TSKINFO;
+ }
+}
+#endif /* CONFIG_X86_64 */
+
+#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
+ (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) | \
+ (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) | \
+ (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) | \
+ (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf)) \
+ << (row % 32))
+
+static const u32 good_insns_64[256 / 32] = {
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+ /* ---------------------------------------------- */
+ W(0x00, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 00 */
+ W(0x10, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 10 */
+ W(0x20, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 20 */
+ W(0x30, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 30 */
+ W(0x40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 40 */
+ W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+ W(0x60, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+ W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+ W(0x80, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+ W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+ W(0xc0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+ W(0xd0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+ W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+ W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1) /* f0 */
+ /* ---------------------------------------------- */
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+};
+
+/* Good-instruction tables for 32-bit apps -- copied from i386 uprobes */
+
+static const u32 good_insns_32[256 / 32] = {
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+ /* ---------------------------------------------- */
+ W(0x00, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) | /* 00 */
+ W(0x10, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) , /* 10 */
+ W(0x20, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) | /* 20 */
+ W(0x30, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) , /* 30 */
+ W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+ W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+ W(0x60, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+ W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+ W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+ W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+ W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+ W(0xd0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+ W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+ W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1) /* f0 */
+ /* ---------------------------------------------- */
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+};
+
+/* Using this for both 64-bit and 32-bit apps */
+static const u32 good_2byte_insns[256 / 32] = {
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+ /* ---------------------------------------------- */
+ W(0x00, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1) | /* 00 */
+ W(0x10, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* 10 */
+ W(0x20, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1) | /* 20 */
+ W(0x30, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 30 */
+ W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+ W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+ W(0x60, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 60 */
+ W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1) , /* 70 */
+ W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+ W(0xa0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1) | /* a0 */
+ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+ W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* c0 */
+ W(0xd0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+ W(0xe0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* e0 */
+ W(0xf0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0) /* f0 */
+ /* ---------------------------------------------- */
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+};
+
+/*
+ * opcodes we'll probably never support:
+ * 6c-6d, e4-e5, ec-ed - in
+ * 6e-6f, e6-e7, ee-ef - out
+ * cc, cd - int3, int
+ * cf - iret
+ * d6 - illegal instruction
+ * f1 - int1/icebp
+ * f4 - hlt
+ * fa, fb - cli, sti
+ * 0f - lar, lsl, syscall, clts, sysret, sysenter, sysexit, invd, wbinvd, ud2
+ *
+ * invalid opcodes in 64-bit mode:
+ * 06, 0e, 16, 1e, 27, 2f, 37, 3f, 60-62, 82, c4-c5, d4-d5
+ *
+ * 63 - we support this opcode in x86_64 but not in i386.
+ *
+ * opcodes we may need to refine support for:
+ * 0f - 2-byte instructions: For many of these instructions, the validity
+ * depends on the prefix and/or the reg field. On such instructions, we
+ * just consider the opcode combination valid if it corresponds to any
+ * valid instruction.
+ * 8f - Group 1 - only reg = 0 is OK
+ * c6-c7 - Group 11 - only reg = 0 is OK
+ * d9-df - fpu insns with some illegal encodings
+ * f2, f3 - repnz, repz prefixes. These are also the first byte for
+ * certain floating-point instructions, such as addsd.
+ * fe - Group 4 - only reg = 0 or 1 is OK
+ * ff - Group 5 - only reg = 0-6 is OK
+ *
+ * others -- Do we need to support these?
+ * 0f - (floating-point?) prefetch instructions
+ * 07, 17, 1f - pop es, pop ss, pop ds
+ * 26, 2e, 36, 3e - es:, cs:, ss:, ds: segment prefixes --
+ * but 64 and 65 (fs: and gs:) seem to be used, so we support them
+ * 67 - addr16 prefix
+ * ce - into
+ * f0 - lock prefix
+ */
+
+/*
+ * TODO:
+ * - Where necessary, examine the modrm byte and allow only valid instructions
+ * in the different Groups and fpu instructions.
+ */
+
+static bool is_prefix_bad(struct insn *insn)
+{
+ int i;
+
+ for (i = 0; i < insn->prefixes.nbytes; i++) {
+ switch (insn->prefixes.bytes[i]) {
+ case 0x26: /*INAT_PFX_ES */
+ case 0x2E: /*INAT_PFX_CS */
+ case 0x36: /*INAT_PFX_DS */
+ case 0x3E: /*INAT_PFX_SS */
+ case 0xF0: /*INAT_PFX_LOCK */
+ return 1;
+ }
+ }
+ return 0;
+}
+
+static void report_bad_prefix(void)
+{
+ printk(KERN_ERR "ubp does not currently support probing "
+ "instructions with any of the following prefixes: "
+ "cs:, ds:, es:, ss:, lock:\n");
+}
+
+static void report_bad_1byte_opcode(int mode, ubp_opcode_t op)
+{
+ printk(KERN_ERR "In %d-bit apps, "
+ "ubp does not currently support probing "
+ "instructions whose first byte is 0x%2.2x\n", mode, op);
+}
+
+static void report_bad_2byte_opcode(ubp_opcode_t op)
+{
+ printk(KERN_ERR "ubp does not currently support probing "
+ "instructions with the 2-byte opcode 0x0f 0x%2.2x\n", op);
+}
+
+static int validate_insn_32bits(struct ubp_bkpt *ubp, struct insn *insn)
+{
+ insn_init(insn, ubp->insn, false);
+
+ /* Skip good instruction prefixes; reject "bad" ones. */
+ insn_get_opcode(insn);
+ if (is_prefix_bad(insn)) {
+ report_bad_prefix();
+ return -EPERM;
+ }
+ if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_32))
+ return 0;
+ if (insn->opcode.nbytes == 2) {
+ if (test_bit(OPCODE2(insn),
+ (unsigned long *) good_2byte_insns))
+ return 0;
+ report_bad_2byte_opcode(OPCODE2(insn));
+ } else
+ report_bad_1byte_opcode(32, OPCODE1(insn));
+ return -EPERM;
+}
+
+static int validate_insn_64bits(struct ubp_bkpt *ubp, struct insn *insn)
+{
+ insn_init(insn, ubp->insn, true);
+
+ /* Skip good instruction prefixes; reject "bad" ones. */
+ insn_get_opcode(insn);
+ if (is_prefix_bad(insn)) {
+ report_bad_prefix();
+ return -EPERM;
+ }
+ if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_64))
+ return 0;
+ if (insn->opcode.nbytes == 2) {
+ if (test_bit(OPCODE2(insn),
+ (unsigned long *) good_2byte_insns))
+ return 0;
+ report_bad_2byte_opcode(OPCODE2(insn));
+ } else
+ report_bad_1byte_opcode(64, OPCODE1(insn));
+ return -EPERM;
+}
+
+/*
+ * Figure out which fixups post_xol() will need to perform, and annotate
+ * ubp->fixups accordingly. To start with, ubp->fixups is either zero or
+ * it reflects rip-related fixups.
+ */
+static void prepare_fixups(struct ubp_bkpt *ubp, struct insn *insn)
+{
+ bool fix_ip = true, fix_call = false; /* defaults */
+ insn_get_opcode(insn); /* should be a nop */
+
+ switch (OPCODE1(insn)) {
+ case 0xc3: /* ret/lret */
+ case 0xcb:
+ case 0xc2:
+ case 0xca:
+ /* ip is correct */
+ fix_ip = false;
+ break;
+ case 0xe8: /* call relative - Fix return addr */
+ fix_call = true;
+ break;
+ case 0x9a: /* call absolute - Fix return addr, not ip */
+ fix_call = true;
+ fix_ip = false;
+ break;
+ case 0xff:
+ {
+ int reg;
+ insn_get_modrm(insn);
+ reg = MODRM_REG(insn);
+ if (reg == 2 || reg == 3) {
+ /* call or lcall, indirect */
+ /* Fix return addr; ip is correct. */
+ fix_call = true;
+ fix_ip = false;
+ } else if (reg == 4 || reg == 5) {
+ /* jmp or ljmp, indirect */
+ /* ip is correct. */
+ fix_ip = false;
+ }
+ break;
+ }
+ case 0xea: /* jmp absolute -- ip is correct */
+ fix_ip = false;
+ break;
+ default:
+ break;
+ }
+ if (fix_ip)
+ ubp->fixups |= UBP_FIX_IP;
+ if (fix_call)
+ ubp->fixups |= UBP_FIX_CALL;
+}
+
+#ifdef CONFIG_X86_64
+static int handle_riprel_insn(struct ubp_bkpt *ubp, struct insn *insn);
+#endif
+
+static int analyze_insn(struct task_struct *tsk, struct ubp_bkpt *ubp)
+{
+ int ret;
+ struct insn insn;
+
+ ubp->fixups = 0;
+#ifdef CONFIG_X86_64
+ ubp->arch_info.rip_target_address = 0x0;
+#endif
+
+ if (is_32bit_app(tsk)) {
+ ret = validate_insn_32bits(ubp, &insn);
+ if (ret != 0)
+ return ret;
+ } else {
+ ret = validate_insn_64bits(ubp, &insn);
+ if (ret != 0)
+ return ret;
+ }
+ if (ubp->strategy & UBP_HNT_INLINE)
+ return 0;
+#ifdef CONFIG_X86_64
+ ret = handle_riprel_insn(ubp, &insn);
+ if (ret == -1)
+ /* rip-relative; can't XOL */
+ return 0;
+ else if (ret == 0)
+ /* not rip-relative */
+ ubp->strategy &= ~UBP_HNT_TSKINFO;
+#endif
+ prepare_fixups(ubp, &insn);
+ return 0;
+}
+
+#ifdef CONFIG_X86_64
+/*
+ * If ubp->insn doesn't use rip-relative addressing, return 0. Otherwise,
+ * rewrite the instruction so that it accesses its memory operand
+ * indirectly through a scratch register. Set ubp->fixups and
+ * ubp->arch_info.rip_target_address accordingly. (The contents of the
+ * scratch register will be saved before we single-step the modified
+ * instruction, and restored afterward.) Return 1.
+ *
+ * (... except if the client doesn't support our UBP_HNT_TSKINFO strategy,
+ * we must suppress XOL for rip-relative instructions: return -1.)
+ *
+ * We do this because a rip-relative instruction can access only a
+ * relatively small area (+/- 2 GB from the instruction), and the XOL
+ * area typically lies beyond that area. At least for instructions
+ * that store to memory, we can't execute the original instruction
+ * and "fix things up" later, because the misdirected store could be
+ * disastrous.
+ *
+ * Some useful facts about rip-relative instructions:
+ * - There's always a modrm byte.
+ * - There's never a SIB byte.
+ * - The displacement is always 4 bytes.
+ */
+static int handle_riprel_insn(struct ubp_bkpt *ubp, struct insn *insn)
+{
+ u8 *cursor;
+ u8 reg;
+
+ if (!insn_rip_relative(insn))
+ return 0;
+
+ /*
+ * We have a rip-relative instruction. To allow this instruction
+ * to be single-stepped out of line, the client must provide us
+ * with a per-task ubp_task_arch_info object.
+ */
+ if (!(ubp->strategy & UBP_HNT_TSKINFO)) {
+ ubp->strategy |= UBP_HNT_INLINE;
+ return -1;
+ }
+ memcpy(ubp->arch_info.orig_insn, ubp->insn, MAX_UINSN_BYTES);
+
+ /*
+ * Point cursor at the modrm byte. The next 4 bytes are the
+ * displacement. Beyond the displacement, for some instructions,
+ * is the immediate operand.
+ */
+ cursor = ubp->insn + insn->prefixes.nbytes + insn->rex_prefix.nbytes
+ + insn->opcode.nbytes;
+ insn_get_length(insn);
+
+ /*
+ * Convert from rip-relative addressing to indirect addressing
+ * via a scratch register. Change the r/m field from 0x5 (%rip)
+ * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
+ */
+ reg = MODRM_REG(insn);
+ if (reg == 0) {
+ /*
+ * The register operand (if any) is either the A register
+ * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
+ * REX prefix) %r8. In any case, we know the C register
+ * is NOT the register operand, so we use %rcx (register
+ * #1) for the scratch register.
+ */
+ ubp->fixups = UBP_FIX_RIP_CX;
+ /* Change modrm from 00 000 101 to 00 000 001. */
+ *cursor = 0x1;
+ } else {
+ /* Use %rax (register #0) for the scratch register. */
+ ubp->fixups = UBP_FIX_RIP_AX;
+ /* Change modrm from 00 xxx 101 to 00 xxx 000 */
+ *cursor = (reg << 3);
+ }
+
+ /* Target address = address of next instruction + (signed) offset */
+ ubp->arch_info.rip_target_address = (long) ubp->vaddr +
+ insn->length + insn->displacement.value;
+ /* Displacement field is gone; slide immediate field (if any) over. */
+ if (insn->immediate.nbytes) {
+ cursor++;
+ memmove(cursor, cursor + insn->displacement.nbytes,
+ insn->immediate.nbytes);
+ }
+ return 1;
+}
+
+/*
+ * If we're emulating a rip-relative instruction, save the contents
+ * of the scratch register and store the target address in that register.
+ */
+static int pre_xol(struct task_struct *tsk, struct ubp_bkpt *ubp,
+ struct ubp_task_arch_info *tskinfo, struct pt_regs *regs)
+{
+ BUG_ON(!ubp->xol_vaddr);
+ regs->ip = ubp->xol_vaddr;
+ if (ubp->fixups & UBP_FIX_RIP_AX) {
+ tskinfo->saved_scratch_register = regs->ax;
+ regs->ax = ubp->arch_info.rip_target_address;
+ } else if (ubp->fixups & UBP_FIX_RIP_CX) {
+ tskinfo->saved_scratch_register = regs->cx;
+ regs->cx = ubp->arch_info.rip_target_address;
+ }
+ return 0;
+}
+#endif
+
+/*
+ * Called by post_xol() to adjust the return address pushed by a call
+ * instruction executed out of line.
+ */
+static int adjust_ret_addr(struct task_struct *tsk, unsigned long sp,
+ long correction)
+{
+ int rasize, ncopied;
+ long ra = 0;
+
+ if (is_32bit_app(tsk))
+ rasize = 4;
+ else
+ rasize = 8;
+ ncopied = ubp_read_vm(tsk, sp, &ra, rasize);
+ if (unlikely(ncopied != rasize))
+ goto fail;
+ ra += correction;
+ ncopied = ubp_write_data(tsk, sp, &ra, rasize);
+ if (unlikely(ncopied != rasize))
+ goto fail;
+ return 0;
+
+fail:
+ printk(KERN_ERR
+ "ubp: Failed to adjust return address after"
+ " single-stepping call instruction;"
+ " pid=%d, sp=%#lx\n", tsk->pid, sp);
+ return -EFAULT;
+}
+
+/*
+ * Called after single-stepping. ubp->vaddr is the address of the
+ * instruction whose first byte has been replaced by the "int3"
+ * instruction. To avoid the SMP problems that can occur when we
+ * temporarily put back the original opcode to single-step, we
+ * single-stepped a copy of the instruction. The address of this
+ * copy is ubp->xol_vaddr.
+ *
+ * This function prepares to resume execution after the single-step.
+ * We have to fix things up as follows:
+ *
+ * Typically, the new ip is relative to the copied instruction. We need
+ * to make it relative to the original instruction (FIX_IP). Exceptions
+ * are return instructions and absolute or indirect jump or call instructions.
+ *
+ * If the single-stepped instruction was a call, the return address that
+ * is atop the stack is the address following the copied instruction. We
+ * need to make it the address following the original instruction (FIX_CALL).
+ *
+ * If the original instruction was a rip-relative instruction such as
+ * "movl %edx,0xnnnn(%rip)", we have instead executed an equivalent
+ * instruction using a scratch register -- e.g., "movl %edx,(%rax)".
+ * We need to restore the contents of the scratch register and adjust
+ * the ip, keeping in mind that the instruction we executed is 4 bytes
+ * shorter than the original instruction (since we squeezed out the offset
+ * field). (FIX_RIP_AX or FIX_RIP_CX)
+ */
+static int post_xol(struct task_struct *tsk, struct ubp_bkpt *ubp,
+ struct ubp_task_arch_info *tskinfo, struct pt_regs *regs)
+{
+ /* Typically, the XOL vma is at a high addr, so correction < 0. */
+ long correction = (long) (ubp->vaddr - ubp->xol_vaddr);
+ int result = 0;
+
+#ifdef CONFIG_X86_64
+ if (is_riprel_insn(ubp)) {
+ if (ubp->fixups & UBP_FIX_RIP_AX)
+ regs->ax = tskinfo->saved_scratch_register;
+ else
+ regs->cx = tskinfo->saved_scratch_register;
+ /*
+ * The original instruction includes a displacement, and so
+ * is 4 bytes longer than what we've just single-stepped.
+ * Fall through to handle stuff like "jmpq *...(%rip)" and
+ * "callq *...(%rip)".
+ */
+ correction += 4;
+ }
+#endif
+ if (ubp->fixups & UBP_FIX_IP)
+ regs->ip += correction;
+ if (ubp->fixups & UBP_FIX_CALL)
+ result = adjust_ret_addr(tsk, regs->sp, correction);
+ return result;
+}
+
+struct ubp_arch_info ubp_arch_info = {
+ .bkpt_insn = 0xcc,
+ .ip_advancement_by_bkpt_insn = 1,
+ .max_insn_bytes = MAX_UINSN_BYTES,
+#ifdef CONFIG_X86_32
+ .strategies = 0x0,
+#else
+ /* rip-relative instructions require special handling. */
+ .strategies = UBP_HNT_TSKINFO,
+ .pre_xol = pre_xol,
+ .cancel_xol = cancel_xol,
+#endif
+ .set_ip = set_ip,
+ .analyze_insn = analyze_insn,
+ .post_xol = post_xol,
+};

2010-01-11 12:25:39

by Srikar Dronamraju

[permalink] [raw]
Subject: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

User Space Breakpoint Assistance Layer (UBP)

User space breakpointing Infrastructure provides kernel subsystems
with architecture independent interface to establish breakpoints in
user applications. This patch provides core implementation of ubp and
also wrappers for architecture dependent methods.

UBP currently supports both single stepping inline and execution out
of line strategies. Two different probepoints in the same process can
have two different strategies.

You need to follow this up with the UBP patch for your architecture.

Signed-off-by: Jim Keniston <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/Kconfig | 12 +
include/linux/ubp.h | 282 ++++++++++++++++++++++++++++++
kernel/Makefile | 1
kernel/ubp_core.c | 479 ++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 774 insertions(+)

Index: new_uprobes.git/arch/Kconfig
===================================================================
--- new_uprobes.git.orig/arch/Kconfig
+++ new_uprobes.git/arch/Kconfig
@@ -57,6 +57,15 @@ config KPROBES
for kernel debugging, non-intrusive instrumentation and testing.
If in doubt, say "N".

+config UBP
+ bool "User-space breakpoint assistance (EXPERIMENTAL)"
+ depends on MODULES
+ depends on HAVE_UBP
+ help
+ Ubp enables kernel subsystems to establish breakpoints
+ in user applications. This service is used by components
+ such as uprobes. If in doubt, say "N".
+
config HAVE_EFFICIENT_UNALIGNED_ACCESS
bool
help
@@ -90,6 +99,9 @@ config USER_RETURN_NOTIFIER
Provide a kernel-internal notification when a cpu is about to
switch to user mode.

+config HAVE_UBP
+ def_bool n
+
config HAVE_IOREMAP_PROT
bool

Index: new_uprobes.git/include/linux/ubp.h
===================================================================
--- /dev/null
+++ new_uprobes.git/include/linux/ubp.h
@@ -0,0 +1,282 @@
+#ifndef _LINUX_UBP_H
+#define _LINUX_UBP_H
+/*
+ * User-space BreakPoint support (ubp)
+ * include/linux/ubp.h
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008, 2009
+ */
+
+#include <asm/ubp.h>
+struct task_struct;
+struct pt_regs;
+
+/**
+ * Strategy hints:
+ *
+ * %UBP_HNT_INLINE: Specifies that the instruction must
+ * be single-stepped inline. Can be set by the caller of
+ * @arch->analyze_insn() -- e.g., if caller is out of XOL slots --
+ * or by @arch->analyze_insn() if there's no viable XOL strategy
+ * for that instruction. Set in arch->strategies if the architecture
+ * doesn't implement XOL.
+ *
+ * %UBP_HNT_PERMSL: Specifies that the instruction slot whose
+ * address is @ubp->xol_vaddr is assigned to @ubp for the life of
+ * the process. Can be used by @arch->analyze_insn() to simplify
+ * XOL in some cases. Ignored in @arch->strategies.
+ *
+ * %UBP_HNT_TSKINFO: Set in @arch->strategies if the architecture's
+ * XOL handling requires the preservation of special
+ * task-specific info between the calls to @arch->pre_xol()
+ * and @arch->post_xol(). (E.g., XOL of x86_64 rip-relative
+ * instructions uses a scratch register, whose value is saved
+ * by pre_xol() and restored by post_xol().) The caller
+ * of @arch->analyze_insn() should set %UBP_HNT_TSKINFO in
+ * @ubp->strategy if it's set in @arch->strategies and the caller
+ * can maintain a @ubp_task_arch_info object for each probed task.
+ * @arch->analyze_insn() should leave this flag set in @ubp->strategy
+ * if it needs to use the per-task @ubp_task_arch_info object.
+ */
+#define UBP_HNT_INLINE 0x1 /* Single-step this insn inline. */
+#define UBP_HNT_TSKINFO 0x2 /* XOL requires ubp_task_arch_info */
+#define UBP_HNT_PERMSL 0x4 /* XOL slot assignment is permanent */
+
+#define UBP_HNT_MASK 0x7
+
+/**
+ * struct ubp_bkpt - user-space breakpoint/probepoint
+ *
+ * @vaddr: virtual address of probepoint
+ * @xol_vaddr: virtual address of XOL slot assigned to this probepoint
+ * @opcode: copy of opcode at @vaddr
+ * @insn: typically a copy of the instruction at @vaddr. More
+ * precisely, this is the instruction (stream) that will be
+ * executed in place of the original instruction.
+ * @strategy: hints about how this instruction will be executed
+ * @fixups: set of fixups to be executed by @arch->post_xol()
+ * @arch_info: architecture-specific info about this probepoint
+ */
+struct ubp_bkpt {
+ unsigned long vaddr;
+ unsigned long xol_vaddr;
+ ubp_opcode_t opcode;
+ u8 insn[UBP_XOL_SLOT_BYTES];
+ u16 strategy;
+ u16 fixups;
+ struct ubp_bkpt_arch_info arch_info;
+};
+
+/* Post-execution fixups. Some architectures may define others. */
+#define UPB_FIX_NONE 0x0 /* No fixup needed */
+#define UBP_FIX_IP 0x1 /* Adjust IP back to vicinity of actual insn */
+#define UBP_FIX_CALL 0x2 /* Adjust the return address of a call insn */
+
+#ifndef UPB_FIX_DEFAULT
+#define UPB_FIX_DEFAULT UBP_FIX_IP
+#endif
+
+#if defined(CONFIG_UBP)
+extern int ubp_init(u16 *strategies);
+extern int ubp_insert_bkpt(struct task_struct *tsk, struct ubp_bkpt *ubp);
+extern unsigned long ubp_get_bkpt_addr(struct pt_regs *regs);
+extern int ubp_pre_sstep(struct task_struct *tsk, struct ubp_bkpt *ubp,
+ struct ubp_task_arch_info *tskinfo, struct pt_regs *regs);
+extern int ubp_post_sstep(struct task_struct *tsk, struct ubp_bkpt *ubp,
+ struct ubp_task_arch_info *tskinfo, struct pt_regs *regs);
+extern int ubp_cancel_xol(struct task_struct *tsk, struct ubp_bkpt *ubp);
+extern int ubp_remove_bkpt(struct task_struct *tsk, struct ubp_bkpt *ubp);
+extern int ubp_validate_insn_addr(struct task_struct *tsk,
+ unsigned long vaddr);
+extern void ubp_set_ip(struct pt_regs *regs, unsigned long vaddr);
+#else /* CONFIG_UBP */
+static inline int ubp_init(u16 *strategies)
+{
+ return -ENOSYS;
+}
+static inline int ubp_insert_bkpt(struct task_struct *tsk,
+ struct ubp_bkpt *ubp)
+{
+ return -ENOSYS;
+}
+static inline unsigned long ubp_get_bkpt_addr(struct pt_regs *regs)
+{
+ return -ENOSYS;
+}
+static inline int ubp_pre_sstep(struct task_struct *tsk,
+ struct ubp_bkpt *ubp, struct ubp_task_arch_info *tskinfo,
+ struct pt_regs *regs)
+{
+ return -ENOSYS;
+}
+static inline int ubp_post_sstep(struct task_struct *tsk,
+ struct ubp_bkpt *ubp, struct ubp_task_arch_info *tskinfo,
+ struct pt_regs *regs)
+{
+ return -ENOSYS;
+}
+static inline int ubp_cancel_xol(struct task_struct *tsk,
+ struct ubp_bkpt *ubp)
+{
+ return -ENOSYS;
+}
+static inline int ubp_remove_bkpt(struct task_struct *tsk,
+ struct ubp_bkpt *ubp)
+{
+ return -ENOSYS;
+}
+static inline int ubp_validate_insn_addr(struct task_struct *tsk,
+ unsigned long vaddr)
+{
+ return -ENOSYS;
+}
+static inline void ubp_set_ip(struct pt_regs *regs, unsigned long vaddr)
+{
+}
+#endif /* CONFIG_UBP */
+
+#ifdef UBP_IMPLEMENTATION
+/**
+ * struct ubp_arch_info - architecture-specific parameters and functions
+ *
+ * Most architectures can use the default versions of @read_opcode(),
+ * @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn(); ia64 is an
+ * exception. All functions (including @validate_address()) can assume
+ * that the caller has verified that the probepoint's virtual address
+ * resides in an executable VM area.
+ *
+ * @bkpt_insn:
+ * The architecture's breakpoint instruction. This is used by
+ * the default versions of @set_bkpt(), @set_orig_insn(), and
+ * @is_bkpt_insn().
+ * @ip_advancement_by_bkpt_insn:
+ * The number of bytes the instruction pointer is advanced by
+ * this architecture's breakpoint instruction. For example, after
+ * the powerpc trap instruction executes, the ip still points to the
+ * breakpoint instruction (ip_advancement_by_bkpt_insn = 0); but the
+ * x86 int3 instruction (1 byte) advances the ip past the int3
+ * (ip_advancement_by_bkpt_insn = 1).
+ * @max_insn_bytes:
+ * The maximum length, in bytes, of an instruction in this
+ * architecture. This must be <= UBP_XOL_SLOT_BYTES;
+ * @strategies:
+ * Bit-map of %UBP_HNT_* values recognized by this architecture.
+ * Include %UBP_HNT_INLINE iff this architecture doesn't support
+ * execution out of line. Include %UBP_HNT_TSKINFO if
+ * XOL of at least some instructions requires communication of
+ * per-task state between @pre_xol() and @post_xol().
+ * @set_ip:
+ * Set the instruction pointer in @regs to @vaddr.
+ * @validate_address:
+ * Return 0 if @vaddr is a valid instruction address, or a negative
+ * errno (typically -%EINVAL) otherwise. If you don't provide
+ * @validate_address(), any address will be accepted. Caller
+ * guarantees that @vaddr is in an executable VM area. This
+ * function typically just enforces arch-specific instruction
+ * alignment.
+ * @read_opcode:
+ * For task @tsk, read the opcode at @vaddr and store it in
+ * @opcode. Return 0 (success) or a negative errno. Defaults to
+ * @ubp_read_opcode().
+ * @set_bkpt:
+ * For task @tsk, store @bkpt_insn at @ubp->vaddr. Return 0
+ * (success) or a negative errno. Defaults to @ubp_set_bkpt().
+ * @set_orig_insn:
+ * For task @tsk, restore the original opcode (@ubp->opcode) at
+ * @ubp->vaddr. If @check is true, first verify that there's
+ * actually a breakpoint instruction there. Return 0 (success) or
+ * a negative errno. Defaults to @ubp_set_orig_insn().
+ * @is_bkpt_insn:
+ * Return %true if @ubp->opcode is @bkpt_insn. Defaults to
+ * @ubp_is_bkpt_insn(), which just tests (ubp->opcode ==
+ * arch->bkpt_insn).
+ * @analyze_insn:
+ * Analyze @ubp->insn. Return 0 if @ubp->insn is an instruction
+ * you can probe, or a negative errno (typically -%EPERM)
+ * otherwise. The caller sets @ubp->strategy to %UBP_HNT_INLINE
+ * to suppress XOL for this instruction (e.g., because we're
+ * out of XOL slots). If the instruction can be probed but
+ * can't be executed out of line, set @ubp->strategy to
+ * %UBP_HNT_INLINE. Otherwise, determine what sort of XOL-related
+ * fixups @post_xol() (and possibly @pre_xol()) will need
+ * to do for this instruction, and annotate @ubp accordingly.
+ * You may modify @ubp->insn (e.g., the x86_64 port does this
+ * for rip-relative instructions), but if you do so, you should
+ * retain a copy in @ubp->arch_info in case you have to revert
+ * to single-stepping inline (see @cancel_xol()).
+ * @pre_xol:
+ * Called just before executing the instruction associated
+ * with @ubp out of line. @ubp->xol_vaddr is the address in
+ * @tsk's virtual address space where @ubp->insn has been copied.
+ * @pre_xol() should at least set the instruction pointer in
+ * @regs to @ubp->xol_vaddr -- which is what the default,
+ * @ubp_pre_xol(), does. If @ubp->strategy includes the
+ * %UBP_HNT_TSKINFO flag, then @tskinfo points to a per-task
+ * copy of struct ubp_task_arch_info.
+ * @post_xol:
+ * Called after executing the instruction associated with
+ * @ubp out of line. @post_xol() should perform the fixups
+ * specified in @ubp->fixups, which includes ensuring that the
+ * instruction pointer in @regs points at the next instruction in
+ * the probed instruction stream. @tskinfo is as for @pre_xol().
+ * You must provide this function.
+ * @cancel_xol:
+ * The instruction associated with @ubp cannot be executed
+ * out of line after all. (This can happen when XOL slots
+ * are lazily assigned, and we run out of slots before we
+ * hit this breakpoint. This function should never be called
+ * if @analyze_insn() was previously called for @ubp with a
+ * non-zero value of @ubp->xol_vaddr and with %UBP_HNT_PERMSL
+ * set in @ubp->strategy.) Adjust @ubp as needed so it can be
+ * single-stepped inline. Omit this function if you don't need it.
+ */
+
+struct ubp_arch_info {
+ ubp_opcode_t bkpt_insn;
+ u8 ip_advancement_by_bkpt_insn;
+ u8 max_insn_bytes;
+ u16 strategies;
+ void (*set_ip)(struct pt_regs *regs, unsigned long vaddr);
+ int (*validate_address)(struct task_struct *tsk, unsigned long vaddr);
+ int (*read_opcode)(struct task_struct *tsk, unsigned long vaddr,
+ ubp_opcode_t *opcode);
+ int (*set_bkpt)(struct task_struct *tsk, struct ubp_bkpt *ubp);
+ int (*set_orig_insn)(struct task_struct *tsk,
+ struct ubp_bkpt *ubp, bool check);
+ bool (*is_bkpt_insn)(struct ubp_bkpt *ubp);
+ int (*analyze_insn)(struct task_struct *tsk, struct ubp_bkpt *ubp);
+ int (*pre_xol)(struct task_struct *tsk, struct ubp_bkpt *ubp,
+ struct ubp_task_arch_info *tskinfo,
+ struct pt_regs *regs);
+ int (*post_xol)(struct task_struct *tsk, struct ubp_bkpt *ubp,
+ struct ubp_task_arch_info *tskinfo,
+ struct pt_regs *regs);
+ void (*cancel_xol)(struct task_struct *tsk, struct ubp_bkpt *ubp);
+};
+
+/* Unexported functions & macros for use by arch-specific code */
+#define ubp_opcode_sz ((unsigned int)(sizeof(ubp_opcode_t)))
+extern int ubp_read_vm(struct task_struct *tsk, unsigned long vaddr,
+ void *kbuf, int nbytes);
+extern int ubp_write_data(struct task_struct *tsk, unsigned long vaddr,
+ const void *kbuf, int nbytes);
+
+extern struct ubp_arch_info ubp_arch_info;
+
+#endif /* UBP_IMPLEMENTATION */
+
+#endif /* _LINUX_UBP_H */
Index: new_uprobes.git/kernel/Makefile
===================================================================
--- new_uprobes.git.orig/kernel/Makefile
+++ new_uprobes.git/kernel/Makefile
@@ -102,6 +102,7 @@ obj-$(CONFIG_SLOW_WORK_DEBUG) += slow-wo
obj-$(CONFIG_PERF_EVENTS) += perf_event.o
obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
+obj-$(CONFIG_UBP) += ubp_core.o

ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
Index: new_uprobes.git/kernel/ubp_core.c
===================================================================
--- /dev/null
+++ new_uprobes.git/kernel/ubp_core.c
@@ -0,0 +1,479 @@
+/*
+ * User-space BreakPoint support (ubp)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008, 2009
+ */
+
+#define UBP_IMPLEMENTATION 1
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/mm.h>
+#include <linux/ubp.h>
+#include <linux/uaccess.h>
+
+/*
+ * TODO: Resolve verbosity. ubp_insert_bkpt() is the only function
+ * that reports failures via printk.
+ */
+
+static struct ubp_arch_info *arch = &ubp_arch_info;
+
+static bool ubp_uses_xol(u16 strategy)
+{
+ return !(strategy & UBP_HNT_INLINE);
+}
+
+static bool validate_strategy(u16 strategy, u16 valid_bits)
+{
+ return ((strategy & (~valid_bits)) == 0);
+}
+
+/**
+ * ubp_init - initialize the ubp data structures
+ * @strategies indicates which breakpoint-related strategies are
+ * supported by the client:
+ * %UBP_HNT_INLINE: Client supports only single-stepping inline.
+ * Otherwise client must provide an instruction slot
+ * (UBP_XOL_SLOT_BYTES bytes) in the probed process's address
+ * space for each instruction to be executed out of line.
+ * %UBP_HNT_TSKINFO: Client can provide and maintain one
+ * @ubp_task_arch_info object for each probed task. (Failure to
+ * support this will prevent XOL of rip-relative instructions on
+ * x86_64, at least.)
+ * Upon return, @strategies is updated to reflect those strategies
+ * required by this particular architecture's implementation of ubp:
+ * %UBP_HNT_INLINE: Architecture or client supports only
+ * single-stepping inline.
+ * %UBP_HNT_TSKINFO: Architecture uses @ubp_task_arch_info, and will
+ * expect it to be passed to @ubp_pre_sstep() and @ubp_post_sstep()
+ * as needed (see @ubp_insert_bkpt()).
+ * Possible errors:
+ * -%ENOSYS: ubp not supported for this architecture.
+ * -%EINVAL: unrecognized flags in @strategies
+ */
+int ubp_init(u16 *strategies)
+{
+ u16 inline_bit, tskinfo_bit;
+ u16 client_strategies = *strategies;
+
+ if (!validate_strategy(client_strategies,
+ UBP_HNT_INLINE | UBP_HNT_TSKINFO))
+ return -EINVAL;
+
+ inline_bit = (client_strategies | arch->strategies) & UBP_HNT_INLINE;
+ tskinfo_bit = (client_strategies & arch->strategies) & UBP_HNT_TSKINFO;
+ *strategies = (inline_bit | tskinfo_bit);
+ return 0;
+}
+
+/*
+ * Read @nbytes at @vaddr from @tsk into @kbuf. Return number of bytes read.
+ * Not exported, but available for use by arch-specific ubp code.
+ */
+int ubp_read_vm(struct task_struct *tsk, unsigned long vaddr,
+ void *kbuf, int nbytes)
+{
+ if (tsk == current) {
+ int nleft = copy_from_user(kbuf, (void __user *) vaddr,
+ nbytes);
+ return nbytes - nleft;
+ } else
+ return access_process_vm(tsk, vaddr, kbuf, nbytes, 0);
+}
+
+/*
+ * Write @nbytes from @kbuf at @vaddr in @tsk. Return number of bytes written.
+ * Can be used to write to stack or data VM areas, but not instructions.
+ * Not exported, but available for use by arch-specific ubp code.
+ */
+int ubp_write_data(struct task_struct *tsk, unsigned long vaddr,
+ const void *kbuf, int nbytes)
+{
+ int nleft;
+
+ if (tsk == current) {
+ nleft = copy_to_user((void __user *) vaddr, kbuf, nbytes);
+ return nbytes - nleft;
+ } else
+ return access_process_vm(tsk, vaddr, (void *) kbuf,
+ nbytes, 1);
+}
+
+static int ubp_write_opcode(struct task_struct *tsk, unsigned long vaddr,
+ ubp_opcode_t opcode)
+{
+ int result;
+
+ result = access_process_vm(tsk, vaddr, &opcode, ubp_opcode_sz, 1);
+ return (result == ubp_opcode_sz ? 0 : -EFAULT);
+}
+
+/* Default implementation of arch->read_opcode */
+static int ubp_read_opcode(struct task_struct *tsk, unsigned long vaddr,
+ ubp_opcode_t *opcode)
+{
+ int bytes_read;
+
+ bytes_read = ubp_read_vm(tsk, vaddr, opcode, ubp_opcode_sz);
+ return (bytes_read == ubp_opcode_sz ? 0 : -EFAULT);
+}
+
+/* Default implementation of arch->set_bkpt */
+static int ubp_set_bkpt(struct task_struct *tsk, struct ubp_bkpt *ubp)
+{
+ return ubp_write_opcode(tsk, ubp->vaddr, arch->bkpt_insn);
+}
+
+/* Default implementation of arch->set_orig_insn */
+static int ubp_set_orig_insn(struct task_struct *tsk, struct ubp_bkpt *ubp,
+ bool check)
+{
+ if (check) {
+ ubp_opcode_t opcode;
+ int result = arch->read_opcode(tsk, ubp->vaddr, &opcode);
+ if (result)
+ return result;
+ if (opcode != arch->bkpt_insn)
+ return -EINVAL;
+ }
+ return ubp_write_opcode(tsk, ubp->vaddr, ubp->opcode);
+}
+
+/* Return 0 if vaddr is in an executable VM area, or -EINVAL otherwise. */
+static inline int ubp_check_vma(struct task_struct *tsk, unsigned long vaddr)
+{
+ struct vm_area_struct *vma;
+ struct mm_struct *mm;
+ int ret = -EINVAL;
+
+ mm = get_task_mm(tsk);
+ if (!mm)
+ return -EINVAL;
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, vaddr);
+ if (vma && vaddr >= vma->vm_start && (vma->vm_flags & VM_EXEC))
+ ret = 0;
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ return ret;
+}
+
+/**
+ * ubp_validate_insn_addr - Validate if the instruction is an
+ * executable vma.
+ * Returns 0 if the vaddr is a valid instruction address.
+ * @tsk: the probed task
+ * @vaddr: virtual address of the instruction to be verified.
+ *
+ * Possible errors:
+ * -%EINVAL: Instruction passed is not a valid instruction address.
+ */
+int ubp_validate_insn_addr(struct task_struct *tsk, unsigned long vaddr)
+{
+ int result;
+
+ result = ubp_check_vma(tsk, vaddr);
+ if (result != 0)
+ return result;
+ if (arch->validate_address)
+ result = arch->validate_address(tsk, vaddr);
+ return result;
+}
+
+static void ubp_bkpt_insertion_failed(struct task_struct *tsk,
+ struct ubp_bkpt *ubp, const char *why)
+{
+ printk(KERN_ERR "Can't place breakpoint at pid %d vaddr %#lx: %s\n",
+ tsk->pid, ubp->vaddr, why);
+}
+
+/**
+ * ubp_insert_bkpt - insert breakpoint
+ * Insert a breakpoint into the process that includes @tsk, at the
+ * virtual address @ubp->vaddr.
+ *
+ * @ubp->strategy affects how this breakpoint will be handled:
+ * %UBP_HNT_INLINE: Probed instruction will be single-stepped inline.
+ * %UBP_HNT_TSKINFO: As above.
+ * %UBP_HNT_PERMSL: An XOL instruction slot in the probed process's
+ * address space has been allocated to this probepoint, and will
+ * remain so allocated as long as it's needed. @ubp->xol_vaddr is
+ * its address. (This slot can be reallocated if
+ * @ubp_insert_bkpt() fails.) The client is NOT required to
+ * allocate an instruction slot before calling @ubp_insert_bkpt().
+ * @ubp_insert_bkpt() updates @ubp->strategy as needed:
+ * %UBP_HNT_INLINE: Architecture or client cannot do XOL for this
+ * probepoint.
+ * %UBP_HNT_TSKINFO: @ubp_task_arch_info will be used for this
+ * probepoint.
+ *
+ * All threads of the probed process must be stopped while
+ * @ubp_insert_bkpt() runs.
+ *
+ * Possible errors:
+ * -%ENOSYS: ubp not supported for this architecture
+ * -%EINVAL: unrecognized/invalid strategy flags
+ * -%EINVAL: invalid instruction address
+ * -%EEXIST: breakpoint instruction already exists at that address
+ * -%EPERM: cannot probe this instruction
+ * -%EFAULT: failed to insert breakpoint instruction
+ * [TBD: Validate xol_vaddr?]
+ */
+int ubp_insert_bkpt(struct task_struct *tsk, struct ubp_bkpt *ubp)
+{
+ int result, len;
+
+ BUG_ON(!tsk || !ubp);
+ if (!validate_strategy(ubp->strategy, UBP_HNT_MASK))
+ return -EINVAL;
+
+ result = ubp_validate_insn_addr(tsk, ubp->vaddr);
+ if (result != 0)
+ return result;
+
+ /*
+ * If ubp_read_vm() transfers fewer bytes than the maximum
+ * instruction size, assume that the probed instruction is smaller
+ * than the max and near the end of the last page of instructions.
+ * But there must be room at least for a breakpoint-size instruction.
+ */
+ len = ubp_read_vm(tsk, ubp->vaddr, ubp->insn, arch->max_insn_bytes);
+ if (len < ubp_opcode_sz) {
+ ubp_bkpt_insertion_failed(tsk, ubp,
+ "error reading original instruction");
+ return -EFAULT;
+ }
+ memcpy(&ubp->opcode, ubp->insn, ubp_opcode_sz);
+ if (arch->is_bkpt_insn(ubp)) {
+ ubp_bkpt_insertion_failed(tsk, ubp,
+ "bkpt already exists at that addr");
+ return -EEXIST;
+ }
+
+ result = arch->analyze_insn(tsk, ubp);
+ if (result < 0) {
+ ubp_bkpt_insertion_failed(tsk, ubp,
+ "instruction type cannot be probed");
+ return result;
+ }
+
+ result = arch->set_bkpt(tsk, ubp);
+ if (result < 0) {
+ ubp_bkpt_insertion_failed(tsk, ubp,
+ "failed to insert bkpt instruction");
+ return result;
+ }
+ return 0;
+}
+
+/**
+ * ubp_pre_sstep - prepare to single-step the probed instruction
+ * @tsk: the probed task
+ * @ubp: the probepoint information, as returned by @ubp_insert_bkpt().
+ * Unless the %UBP_HNT_INLINE flag is set in @ubp->strategy,
+ * @ubp->xol_vaddr must be the address of an XOL instruction slot
+ * that is allocated to this probepoint at least until after the
+ * completion of @ubp_post_sstep(), and populated with the contents
+ * of @ubp->insn. [Need to be more precise here to account for
+ * untimely exit or UBP_HNT_BOOSTED.]
+ * @tskinfo: points to a @ubp_task_arch_info object for @tsk, if
+ * the %UBP_HNT_TSKINFO flag is set in @ubp->strategy.
+ * @regs: reflects the saved user state of @tsk. @ubp_pre_sstep()
+ * adjusts this. In particular, the instruction pointer is set
+ * to the instruction to be single-stepped.
+ * Possible errors:
+ * -%EFAULT: Failed to read or write @tsk's address space as needed.
+ *
+ * The client must ensure that the contents of @ubp are not
+ * changed during the single-step operation -- i.e., between when
+ * @ubp_pre_sstep() is called and when @ubp_post_sstep() returns.
+ * Additionally, if single-stepping inline is used for this probepoint,
+ * the client must serialize the single-step operation (so multiple
+ * threads don't step on each other while the opcode replacement is
+ * taking place).
+ */
+int ubp_pre_sstep(struct task_struct *tsk, struct ubp_bkpt *ubp,
+ struct ubp_task_arch_info *tskinfo, struct pt_regs *regs)
+{
+ int result;
+
+ BUG_ON(!tsk || !ubp || !regs);
+ if (ubp_uses_xol(ubp->strategy)) {
+ BUG_ON(!ubp->xol_vaddr);
+ return arch->pre_xol(tsk, ubp, tskinfo, regs);
+ }
+
+ /*
+ * Single-step this instruction inline. Replace the breakpoint
+ * with the original opcode.
+ */
+ result = arch->set_orig_insn(tsk, ubp, false);
+ if (result == 0)
+ arch->set_ip(regs, ubp->vaddr);
+ return result;
+}
+
+/**
+ * ubp_post_sstep - prepare to resume execution after single-step
+ * @tsk: the probed task
+ * @ubp: the probepoint information, as with @ubp_pre_sstep()
+ * @tskinfo: the @ubp_task_arch_info object, if any, passed to
+ * @ubp_pre_sstep()
+ * @regs: reflects the saved state of @tsk after the single-step
+ * operation. @ubp_post_sstep() adjusts @tsk's state as needed,
+ * including pointing the instruction pointer at the instruction
+ * following the probed instruction.
+ * Possible errors:
+ * -%EFAULT: Failed to read or write @tsk's address space as needed.
+ */
+int ubp_post_sstep(struct task_struct *tsk, struct ubp_bkpt *ubp,
+ struct ubp_task_arch_info *tskinfo, struct pt_regs *regs)
+{
+ BUG_ON(!tsk || !ubp || !regs);
+ if (ubp_uses_xol(ubp->strategy))
+ return arch->post_xol(tsk, ubp, tskinfo, regs);
+
+ /*
+ * Single-stepped this instruction inline. Put the breakpoint
+ * instruction back.
+ */
+ return arch->set_bkpt(tsk, ubp);
+}
+
+/**
+ * ubp_cancel_xol - cancel XOL for this probepoint
+ * @tsk: a task in the probed process
+ * @ubp: the probepoint information
+ * Switch @ubp's single-stepping strategy from out-of-line to inline.
+ * If the client employs lazy XOL-slot allocation, it can call
+ * this function if it determines that it can't provide an XOL
+ * slot for @ubp. @ubp_cancel_xol() adjusts @ubp appropriately.
+ *
+ * @ubp_cancel_xol()'s behavior is undefined if @ubp_pre_sstep() has
+ * already been called for @ubp.
+ *
+ * Possible errors:
+ * Can't think of any yet.
+ */
+int ubp_cancel_xol(struct task_struct *tsk, struct ubp_bkpt *ubp)
+{
+ if (arch->cancel_xol)
+ arch->cancel_xol(tsk, ubp);
+ ubp->strategy |= UBP_HNT_INLINE;
+ return 0;
+}
+
+/**
+ * ubp_get_bkpt_addr - compute address of bkpt given post-bkpt regs
+ * @regs: Reflects the saved state of the task after it has hit a breakpoint
+ * instruction. Return the address of the breakpoint instruction.
+ */
+unsigned long ubp_get_bkpt_addr(struct pt_regs *regs)
+{
+ return instruction_pointer(regs) - arch->ip_advancement_by_bkpt_insn;
+}
+
+/**
+ * ubp_remove_bkpt - remove breakpoint
+ * For the process that includes @tsk, remove the breakpoint specified
+ * by @ubp, restoring the original opcode.
+ *
+ * Possible errors:
+ * -%EINVAL: @ubp->vaddr is not a valid instruction address.
+ * -%ENOENT: There is no breakpoint instruction at @ubp->vaddr.
+ * -%EFAULT: Failed to read/write @tsk's address space as needed.
+ */
+int ubp_remove_bkpt(struct task_struct *tsk, struct ubp_bkpt *ubp)
+{
+ if (ubp_validate_insn_addr(tsk, ubp->vaddr) != 0)
+ return -EINVAL;
+ return arch->set_orig_insn(tsk, ubp, true);
+}
+
+void ubp_set_ip(struct pt_regs *regs, unsigned long vaddr)
+{
+ arch->set_ip(regs, vaddr);
+}
+
+/* Default implementation of arch->is_bkpt_insn */
+static bool ubp_is_bkpt_insn(struct ubp_bkpt *ubp)
+{
+ return (ubp->opcode == arch->bkpt_insn);
+}
+
+/* Default implementation of arch->pre_xol */
+static int ubp_pre_xol(struct task_struct *tsk, struct ubp_bkpt *ubp,
+ struct ubp_task_arch_info *tskinfo, struct pt_regs *regs)
+{
+ arch->set_ip(regs, ubp->xol_vaddr);
+ return 0;
+}
+
+/* Validate arch-specific info during ubp initialization. */
+
+static int ubp_bad_arch_param(const char *param_name, int value)
+{
+ printk(KERN_ERR "ubp: bad value %d/%#x for parameter %s"
+ " in ubp_arch_info\n", value, value, param_name);
+ return -ENOSYS;
+}
+
+static int ubp_missing_arch_func(const char *func_name)
+{
+ printk(KERN_ERR "ubp: ubp_arch_info lacks required function: %s\n",
+ func_name);
+ return -ENOSYS;
+}
+
+static int __init init_ubp(void)
+{
+ int result = 0;
+
+ /* Accept any value of bkpt_insn. */
+ if (arch->max_insn_bytes < 1)
+ result = ubp_bad_arch_param("max_insn_bytes",
+ arch->max_insn_bytes);
+ if (arch->ip_advancement_by_bkpt_insn > arch->max_insn_bytes)
+ result = ubp_bad_arch_param("ip_advancement_by_bkpt_insn",
+ arch->ip_advancement_by_bkpt_insn);
+ /* Accept any value of strategies. */
+ if (!arch->set_ip)
+ result = ubp_missing_arch_func("set_ip");
+ /* Null validate_address() is OK. */
+ if (!arch->read_opcode)
+ arch->read_opcode = ubp_read_opcode;
+ if (!arch->set_bkpt)
+ arch->set_bkpt = ubp_set_bkpt;
+ if (!arch->set_orig_insn)
+ arch->set_orig_insn = ubp_set_orig_insn;
+ if (!arch->is_bkpt_insn)
+ arch->is_bkpt_insn = ubp_is_bkpt_insn;
+ if (!arch->analyze_insn)
+ result = ubp_missing_arch_func("analyze_insn");
+ if (!arch->pre_xol)
+ arch->pre_xol = ubp_pre_xol;
+ if (ubp_uses_xol(arch->strategies) && !arch->post_xol)
+ result = ubp_missing_arch_func("post_xol");
+ /* Null cancel_xol() is OK. */
+ return result;
+}
+
+module_init(init_ubp);

2010-01-11 12:25:55

by Srikar Dronamraju

[permalink] [raw]
Subject: [RFC] [PATCH 3/7] Execution out of line (XOL)

Execution out of line (XOL)

Slot allocation mechanism for Execution Out of Line strategy in User
space breakpointing Inftrastructure. (XOL)

This patch provides slot allocation mechanism for execution out of
line strategy for use with user space breakpoint infrastructure.
This patch requires utrace support in kernel.

This patch provides five functions xol_get_insn_slot(),
xol_free_insn_slot(), xol_put_area(), xol_get_area() and
xol_validate_vaddr().

Current slot allocation mechanism:
1. Allocate one dedicated slot per user breakpoint.
2. If the allocated vma is completely used, expand current vma.
3. If we cant expand the vma, allocate a new vma.

Signed-off-by: Jim Keniston <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/Kconfig | 4
include/linux/ubp_xol.h | 56 ++++
kernel/Makefile | 1
kernel/ubp_xol.c | 644 ++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 705 insertions(+)

Index: new_uprobes.git/arch/Kconfig
===================================================================
--- new_uprobes.git.orig/arch/Kconfig
+++ new_uprobes.git/arch/Kconfig
@@ -102,6 +102,10 @@ config USER_RETURN_NOTIFIER
config HAVE_UBP
def_bool n

+config UBP_XOL
+ def_bool y
+ depends on UBP && UTRACE
+
config HAVE_IOREMAP_PROT
bool

Index: new_uprobes.git/include/linux/ubp_xol.h
===================================================================
--- /dev/null
+++ new_uprobes.git/include/linux/ubp_xol.h
@@ -0,0 +1,56 @@
+#ifndef _LINUX_XOL_H
+#define _LINUX_XOL_H
+/*
+ * User-space BreakPoint support (ubp) -- Allocation of instruction
+ * slots for execution out of line (XOL)
+ * include/linux/ubp_xol.h
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2009
+ */
+
+
+#if defined(CONFIG_UBP_XOL)
+extern unsigned long xol_get_insn_slot(struct ubp_bkpt *ubp, void *xol_area);
+extern void xol_free_insn_slot(unsigned long, void *xol_area);
+extern int xol_validate_vaddr(struct pid *pid, unsigned long vaddr,
+ void *xol_area);
+extern void *xol_get_area(struct pid *pid);
+extern void xol_put_area(void *xol_area);
+#else /* CONFIG_UBP_XOL */
+static inline unsigned long xol_get_insn_slot(struct ubp_bkpt *ubp,
+ void *xol_area)
+{
+ return 0;
+}
+static inline void xol_free_insn_slot(unsigned long slot_addr, void *xol_area)
+{
+}
+static inline int xol_validate_vaddr(struct pid *pid, unsigned long vaddr,
+ void *xol_area)
+{
+ return -ENOSYS;
+}
+static inline void *xol_get_area(struct pid *pid)
+{
+ return NULL;
+}
+static inline void xol_put_area(void *xol_area)
+{
+}
+#endif /* CONFIG_UBP_XOL */
+
+#endif /* _LINUX_XOL_H */
Index: new_uprobes.git/kernel/Makefile
===================================================================
--- new_uprobes.git.orig/kernel/Makefile
+++ new_uprobes.git/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_PERF_EVENTS) += perf_event.
obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
obj-$(CONFIG_UBP) += ubp_core.o
+obj-$(CONFIG_UBP_XOL) += ubp_xol.o

ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
Index: new_uprobes.git/kernel/ubp_xol.c
===================================================================
--- /dev/null
+++ new_uprobes.git/kernel/ubp_xol.c
@@ -0,0 +1,644 @@
+/*
+ * User-space BreakPoint support (ubp) -- Allocation of instruction
+ * slots for execution out of line (XOL)
+ * kernel/ubp_xol.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2009
+ */
+
+/*
+ * Every probepoint gets its own slot. Once it's assigned a slot, it
+ * keeps that slot until the probepoint goes away. If we run out of
+ * slots in the XOL vma, we try to expand it by one page. If we can't
+ * expand it, we allocate an additional vma. Only the probed process
+ * itself can add or expand vmas.
+ */
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/kref.h>
+#include <linux/utrace.h>
+#include <linux/ubp.h>
+#include <linux/errno.h>
+#include <linux/mman.h>
+#include <linux/file.h>
+
+#define UINSNS_PER_PAGE (PAGE_SIZE/UBP_XOL_SLOT_BYTES)
+
+struct ubp_xol_vma {
+ struct list_head list;
+ unsigned long *bitmap; /* 0 = free slot */
+
+ /*
+ * We keep the vma's vm_start rather than a pointer to the vma
+ * itself. The probed process or a naughty kernel module could make
+ * the vma go away, and we must handle that reasonably gracefully.
+ */
+ unsigned long vaddr; /* Page(s) of instruction slots */
+ int npages;
+ int nslots;
+};
+
+struct ubp_xol_area {
+ struct list_head vmas;
+ struct mutex mutex; /* Serializes access to list of vmas */
+
+ /*
+ * We ref-count threads and clients. The xol_report_* callbacks
+ * are all about noticing when the last thread goes away.
+ */
+ struct kref kref;
+ struct ubp_xol_vma *last_vma;
+ pid_t tgid;
+ bool can_expand;
+};
+
+static const struct utrace_engine_ops xol_engine_ops;
+static void xol_free_area(struct kref *kref);
+
+/*
+ * xol_mutex allows creation of unique ubp_xol_area.
+ * Critical region for xol_mutex includes creation and initialization
+ * of ubp_xol_area and attaching an exclusive engine with
+ * xol_engine_ops for the thread whose pid is thread group id.
+ */
+static DEFINE_MUTEX(xol_mutex);
+
+/**
+ * xol_put_area - release a reference to ubp_xol_area.
+ * If this happens to be the last reference, free the ubp_xol_area.
+ * @xol_area: unique per process ubp_xol_area for this process.
+ */
+void xol_put_area(void *xol_area)
+{
+ struct ubp_xol_area *area = (struct ubp_xol_area *) xol_area;
+
+ if (unlikely(!area))
+ return;
+ kref_put(&area->kref, xol_free_area);
+}
+
+/*
+ * Need unique ubp_xol_area. This is achieved by using utrace engines.
+ * However code using utrace could be avoided if mm_struct /
+ * mm_context_t had a pointer to ubp_xol_area.
+ */
+
+/*
+ * xol_create_engine - add a thread to watch
+ * xol_create_engine can return these values:
+ * 0: successfully created an engine.
+ * -EEXIST: don't bother because an engine already exists for this
+ * thread.
+ * -ESRCH: Process or thread is exiting; don't need to create an
+ * engine.
+ * -ENOMEM: utrace can't allocate memory for the engine
+ *
+ * This function is called holding a reference to pid.
+ */
+static int xol_create_engine(struct pid *pid, struct ubp_xol_area *area)
+{
+ struct utrace_engine *engine;
+ int result;
+
+ engine = utrace_attach_pid(pid, UTRACE_ATTACH_CREATE |
+ UTRACE_ATTACH_EXCLUSIVE | UTRACE_ATTACH_MATCH_OPS,
+ &xol_engine_ops, area);
+ if (IS_ERR(engine)) {
+ put_pid(pid);
+ return PTR_ERR(engine);
+ }
+ result = utrace_set_events_pid(pid, engine,
+ UTRACE_EVENT(EXEC) | UTRACE_EVENT(CLONE) | UTRACE_EVENT(EXIT));
+ /*
+ * Since this is the first and only time we set events for this
+ * engine, there shouldn't be any callbacks in progress.
+ */
+ WARN_ON(result == -EINPROGRESS);
+ kref_get(&area->kref);
+ put_pid(pid);
+ utrace_engine_put(engine);
+ return 0;
+}
+
+/*
+ * If a thread clones while xol_get_area() is running, it's possible
+ * for xol_create_engine() to be called both from there and from
+ * here. No problem, since xol_create_engine() refuses to create (or
+ * ref-count) a second engine for the same task.
+ */
+static u32 xol_report_clone(u32 action,
+ struct utrace_engine *engine,
+ unsigned long clone_flags,
+ struct task_struct *child)
+{
+ if (clone_flags & CLONE_THREAD) {
+ struct pid *child_pid = get_pid(task_pid(child));
+
+ BUG_ON(!child_pid);
+ (void)xol_create_engine(child_pid,
+ (struct ubp_xol_area *) engine->data);
+ }
+ return UTRACE_RESUME;
+}
+
+/*
+ * When a multithreaded app execs, the exec-ing thread reports the
+ * exec, and the other threads report exit.
+ */
+static u32 xol_report_exec(u32 action,
+ struct utrace_engine *engine,
+ const struct linux_binfmt *fmt,
+ const struct linux_binprm *bprm,
+ struct pt_regs *regs)
+{
+ xol_put_area((struct ubp_xol_area *)engine->data);
+ return UTRACE_DETACH;
+}
+
+static u32 xol_report_exit(u32 action, struct utrace_engine *engine,
+ long orig_code, long *code)
+{
+ xol_put_area((struct ubp_xol_area *)engine->data);
+ return UTRACE_DETACH;
+}
+
+static const struct utrace_engine_ops xol_engine_ops = {
+ .report_exit = xol_report_exit,
+ .report_clone = xol_report_clone,
+ .report_exec = xol_report_exec
+};
+
+/*
+ * @start_pid is the pid for a thread in the traced process.
+ * Creating engines for a hugely multithreaded process can be
+ * time consuming. Hence engines for other threads are created
+ * outside the critical region.
+ */
+static void create_engine_sibling_threads(struct pid *start_pid,
+ struct ubp_xol_area *area)
+{
+ struct task_struct *t, *start;
+ struct utrace_engine *engine;
+ struct pid *pid = NULL;
+
+ rcu_read_lock();
+ start = pid_task(start_pid, PIDTYPE_PID);
+ t = start;
+ if (t) {
+ do {
+ if (t->exit_state) {
+ t = next_thread(t);
+ continue;
+ }
+
+ /*
+ * This doesn't sleep, does minimal error checking.
+ */
+ engine = utrace_attach_task(t,
+ UTRACE_ATTACH_MATCH_OPS,
+ &xol_engine_ops, NULL);
+ if (PTR_ERR(engine) == -ENOENT) {
+ pid = get_pid(task_pid(t));
+ (void)xol_create_engine(pid, area);
+ } else if (!IS_ERR(engine))
+ utrace_engine_put(engine);
+
+ t = next_thread(t);
+ } while (t != start);
+ }
+ rcu_read_unlock();
+}
+
+/**
+ * xol_get_area - Get a reference to process's ubp_xol_area.
+ * If an ubp_xol_area doesn't exist for @tg_leader's process, create
+ * one. In any case, increment its refcount and return a pointer
+ * to it.
+ * @tg_leader: pointer to struct pid of a thread whose tid is the
+ * thread group id
+ */
+void *xol_get_area(struct pid *tg_leader)
+{
+ struct ubp_xol_area *area = NULL;
+ struct utrace_engine *engine;
+ struct pid *pid;
+ int ret;
+
+ pid = get_pid(tg_leader);
+ mutex_lock(&xol_mutex);
+ engine = utrace_attach_pid(tg_leader, UTRACE_ATTACH_MATCH_OPS,
+ &xol_engine_ops, NULL);
+ if (!IS_ERR(engine)) {
+ area = engine->data;
+ utrace_engine_put(engine);
+ mutex_unlock(&xol_mutex);
+ goto found_area;
+ }
+
+ area = kzalloc(sizeof(*area), GFP_USER);
+ if (unlikely(!area)) {
+ mutex_unlock(&xol_mutex);
+ return NULL;
+ }
+ mutex_init(&area->mutex);
+ kref_init(&area->kref);
+ area->last_vma = NULL;
+ area->can_expand = true;
+ area->tgid = pid_task(tg_leader, PIDTYPE_PID)->tgid;
+ INIT_LIST_HEAD(&area->vmas);
+ ret = xol_create_engine(pid, area);
+ mutex_unlock(&xol_mutex);
+
+ if (ret != 0) {
+ kfree(area);
+ return NULL;
+ }
+ create_engine_sibling_threads(pid, area);
+
+found_area:
+ if (likely(area))
+ kref_get(&area->kref);
+ return (void *) area;
+}
+
+static void xol_free_area(struct kref *kref)
+{
+ struct ubp_xol_vma *usv, *tmp;
+ struct ubp_xol_area *area;
+
+ area = container_of(kref, struct ubp_xol_area, kref);
+ list_for_each_entry_safe(usv, tmp, &area->vmas, list) {
+ kfree(usv->bitmap);
+ kfree(usv);
+ }
+ kfree(area);
+}
+
+/*
+ * Allocate a bitmap for a new vma, or expand an existing bitmap.
+ * if old_bitmap is non-NULL, xol_realloc_bitmap() never returns
+ * old_bitmap.
+ */
+static unsigned long *xol_realloc_bitmap(unsigned long *old_bitmap,
+ int old_nslots, int new_nslots)
+{
+ unsigned long *new_bitmap;
+
+ BUG_ON(new_nslots < old_nslots);
+
+ new_bitmap = kzalloc(BITS_TO_LONGS(new_nslots) * sizeof(long),
+ GFP_USER);
+ if (!new_bitmap) {
+ printk(KERN_ERR "ubp_xol: cannot %sallocate bitmap for XOL "
+ "area for pid/tgid %d/%d\n", (old_bitmap ? "re" : ""),
+ current->pid, current->tgid);
+ return NULL;
+ }
+ if (old_bitmap)
+ memcpy(new_bitmap, old_bitmap,
+ BITS_TO_LONGS(old_nslots) * sizeof(long));
+ return new_bitmap;
+}
+
+static struct ubp_xol_vma *xol_alloc_vma(void)
+{
+ struct ubp_xol_vma *usv;
+
+ usv = kzalloc(sizeof(struct ubp_xol_vma), GFP_USER);
+ if (!usv) {
+ printk(KERN_ERR "ubp_xol: cannot allocate kmem for XOL vma"
+ " for pid/tgid %d/%d\n", current->pid, current->tgid);
+ return NULL;
+ }
+ usv->bitmap = xol_realloc_bitmap(NULL, 0, UINSNS_PER_PAGE);
+ if (!usv->bitmap) {
+ kfree(usv);
+ return NULL;
+ }
+ return usv;
+}
+
+static inline struct ubp_xol_vma *xol_add_vma(struct ubp_xol_area *area)
+{
+ struct vm_area_struct *vma;
+ struct ubp_xol_vma *usv;
+ struct mm_struct *mm;
+ struct file *file;
+ unsigned long addr;
+
+ mm = get_task_mm(current);
+ if (!mm)
+ return ERR_PTR(-ESRCH);
+
+ usv = xol_alloc_vma();
+ if (!usv) {
+ mmput(mm);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ down_write(&mm->mmap_sem);
+ /*
+ * Find the end of the top mapping and skip a page.
+ * If there is no space for PAGE_SIZE above
+ * that, mmap will ignore our address hint.
+ *
+ * We allocate a "fake" unlinked shmem file because
+ * anonymous memory might not be granted execute
+ * permission when the selinux security hooks have
+ * their way.
+ */
+ vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
+ addr = vma->vm_end + PAGE_SIZE;
+ file = shmem_file_setup("uprobes/ssol", PAGE_SIZE, VM_NORESERVE);
+ if (!file) {
+ printk(KERN_ERR "ubp_xol failed to setup shmem_file while "
+ "allocating vma for pid/tgid %d/%d for "
+ "single-stepping out of line.\n",
+ current->pid, current->tgid);
+ goto fail;
+ }
+ addr = do_mmap_pgoff(file, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
+ fput(file);
+
+ if (addr & ~PAGE_MASK) {
+ printk(KERN_ERR "ubp_xol failed to allocate a vma for pid/tgid"
+ " %d/%d for single-stepping out of line.\n",
+ current->pid, current->tgid);
+ goto fail;
+ }
+ vma = find_vma(mm, addr);
+ BUG_ON(!vma);
+
+ /* Don't expand vma on mremap(). */
+ vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
+ usv->vaddr = vma->vm_start;
+ up_write(&mm->mmap_sem);
+ mmput(mm);
+ usv->npages = 1;
+ usv->nslots = UINSNS_PER_PAGE;
+ INIT_LIST_HEAD(&usv->list);
+ list_add_tail(&usv->list, &area->vmas);
+ area->last_vma = usv;
+ return usv;
+
+fail:
+ up_write(&mm->mmap_sem);
+ mmput(mm);
+ kfree(usv->bitmap);
+ kfree(usv);
+ return ERR_PTR(-ENOMEM);
+}
+
+/* Runs with area->mutex locked */
+static long xol_expand_vma(struct ubp_xol_vma *usv)
+{
+ struct vm_area_struct *vma;
+ unsigned long *new_bitmap;
+ struct mm_struct *mm;
+ unsigned long new_length, result;
+ int new_nslots;
+
+ new_length = PAGE_SIZE * (usv->npages + 1);
+ new_nslots = (int) ((usv->npages + 1) * UINSNS_PER_PAGE);
+
+ /* xol_realloc_bitmap() never returns usv->bitmap. */
+ new_bitmap = xol_realloc_bitmap(usv->bitmap, usv->nslots, new_nslots);
+ if (!new_bitmap)
+ return -ENOMEM;
+
+ mm = get_task_mm(current);
+ if (!mm)
+ return -ESRCH;
+
+ down_write(&mm->mmap_sem);
+ vma = find_vma(mm, usv->vaddr);
+ if (!vma) {
+ printk(KERN_ERR "pid/tgid %d/%d: ubp XOL vma at %#lx"
+ " has disappeared!\n", current->pid, current->tgid,
+ usv->vaddr);
+ result = -ENOMEM;
+ goto fail;
+ }
+ if (vma_pages(vma) != usv->npages || vma->vm_start != usv->vaddr) {
+ printk(KERN_ERR "pid/tgid %d/%d: ubp XOL vma has been"
+ " altered: %#lx/%ld pages; should be %#lx/%d pages\n",
+ current->pid, current->tgid, vma->vm_start,
+ vma_pages(vma), usv->vaddr, usv->npages);
+ result = -ENOMEM;
+ goto fail;
+ }
+ vma->vm_flags &= ~VM_DONTEXPAND;
+ result = do_mremap(usv->vaddr, usv->npages*PAGE_SIZE, new_length, 0, 0);
+ vma->vm_flags |= VM_DONTEXPAND;
+ if (IS_ERR_VALUE(result)) {
+ printk(KERN_WARNING "ubp_xol failed to expand the vma "
+ "for pid/tgid %d/%d for single-stepping out of line.\n",
+ current->pid, current->tgid);
+ goto fail;
+ }
+ BUG_ON(result != usv->vaddr);
+ up_write(&mm->mmap_sem);
+
+ kfree(usv->bitmap);
+ usv->bitmap = new_bitmap;
+ usv->nslots = new_nslots;
+ usv->npages++;
+ return 0;
+
+fail:
+ up_write(&mm->mmap_sem);
+ mmput(mm);
+ kfree(new_bitmap);
+ return result;
+}
+
+/*
+ * Find a slot
+ * - searching in existing vmas for a free slot.
+ * - If no free slot in existing vmas, try expanding the last vma.
+ * - If unable to expand a vma, try adding a new vma.
+ *
+ * Runs with area->mutex locked.
+ */
+static unsigned long xol_take_insn_slot(struct ubp_xol_area *area)
+{
+ struct ubp_xol_vma *usv;
+ unsigned long slot_addr;
+ int slot_nr;
+
+ list_for_each_entry(usv, &area->vmas, list) {
+ slot_nr = find_first_zero_bit(usv->bitmap, usv->nslots);
+ if (slot_nr < usv->nslots) {
+ set_bit(slot_nr, usv->bitmap);
+ slot_addr = usv->vaddr +
+ (slot_nr * UBP_XOL_SLOT_BYTES);
+ return slot_addr;
+ }
+ }
+
+ /*
+ * All out of space. Need to allocate a new page.
+ * Only the probed process itself can add or expand vmas.
+ */
+ if (!area->can_expand || (area->tgid != current->tgid))
+ goto fail;
+
+ usv = area->last_vma;
+ if (usv) {
+ /* Expand vma, take first of newly added slots. */
+ slot_nr = usv->nslots;
+ if (xol_expand_vma(usv) != 0) {
+ printk(KERN_WARNING "Allocating additional vma.\n");
+ usv = NULL;
+ }
+ }
+ if (!usv) {
+ slot_nr = 0;
+ usv = xol_add_vma(area);
+ if (IS_ERR(usv))
+ goto cant_expand;
+ }
+
+ /* Take first slot of new page. */
+ set_bit(slot_nr, usv->bitmap);
+ slot_addr = usv->vaddr + (slot_nr * UBP_XOL_SLOT_BYTES);
+ return slot_addr;
+
+cant_expand:
+ area->can_expand = false;
+fail:
+ return 0;
+}
+
+/**
+ * xol_get_insn_slot - If ubp was not allocated a slot, then
+ * allocate a slot. If ubp_insert_bkpt is already called, (i.e
+ * ubp.vaddr != 0) then copy the instruction into the slot.
+ * Allocating a free slot could result in
+ * - using a free slot in the current vma or
+ * - expanding the last vma or
+ * - adding a new vma.
+ * Returns the allocated slot address or 0.
+ * @ubp: probepoint information
+ * @xol_area refers the unique per process ubp_xol_area for
+ * this process.
+ */
+unsigned long xol_get_insn_slot(struct ubp_bkpt *ubp, void *xol_area)
+{
+ struct ubp_xol_area *area = (struct ubp_xol_area *) xol_area;
+ int len;
+
+ if (unlikely(!area))
+ return 0;
+ mutex_lock(&area->mutex);
+ if (likely(!ubp->xol_vaddr)) {
+ ubp->xol_vaddr = xol_take_insn_slot(area);
+ /*
+ * Initialize the slot if ubp->vaddr points to valid
+ * instruction slot.
+ */
+ if (likely(ubp->xol_vaddr) && ubp->vaddr) {
+ len = access_process_vm(current, ubp->xol_vaddr,
+ ubp->insn, UBP_XOL_SLOT_BYTES, 1);
+ if (unlikely(len < UBP_XOL_SLOT_BYTES))
+ printk(KERN_ERR "Failed to copy instruction"
+ " at %#lx len = %d\n",
+ ubp->vaddr, len);
+ }
+ }
+ mutex_unlock(&area->mutex);
+ return ubp->xol_vaddr;
+}
+
+/**
+ * xol_free_insn_slot - If slot was earlier allocated by
+ * @xol_get_insn_slot(), make the slot available for
+ * subsequent requests.
+ * @slot_addr: slot address as returned by
+ * @xol_get_insn_area().
+ * @xol_area refers the unique per process ubp_xol_area for
+ * this process.
+ */
+void xol_free_insn_slot(unsigned long slot_addr, void *xol_area)
+{
+ struct ubp_xol_area *area = (struct ubp_xol_area *) xol_area;
+ struct ubp_xol_vma *usv;
+ int found = 0;
+
+ if (unlikely(!slot_addr || IS_ERR_VALUE(slot_addr)))
+ return;
+ if (unlikely(!area))
+ return;
+ mutex_lock(&area->mutex);
+ list_for_each_entry(usv, &area->vmas, list) {
+ unsigned long vma_end = usv->vaddr + usv->npages*PAGE_SIZE;
+ if (usv->vaddr <= slot_addr && slot_addr < vma_end) {
+ int slot_nr;
+ unsigned long offset = slot_addr - usv->vaddr;
+ BUG_ON(offset % UBP_XOL_SLOT_BYTES);
+ slot_nr = offset / UBP_XOL_SLOT_BYTES;
+ BUG_ON(slot_nr >= usv->nslots);
+ clear_bit(slot_nr, usv->bitmap);
+ found = 1;
+ }
+ }
+ mutex_unlock(&area->mutex);
+ if (!found)
+ printk(KERN_ERR "%s: no XOL vma for slot address %#lx\n",
+ __func__, slot_addr);
+}
+
+/**
+ * xol_validate_vaddr - Verify if the specified address is in an
+ * executable vma, but not in an XOL vma.
+ * - Return 0 if the specified virtual address is in an
+ * executable vma, but not in an XOL vma.
+ * - Return 1 if the specified virtual address is in an
+ * XOL vma.
+ * - Return -EINTR otherwise.(i.e non executable vma, or
+ * not a valid address
+ * @pid: the probed process
+ * @vaddr: virtual address of the instruction to be validated.
+ * @xol_area refers the unique per process ubp_xol_area for
+ * this process.
+ */
+int xol_validate_vaddr(struct pid *pid, unsigned long vaddr, void *xol_area)
+{
+ struct ubp_xol_area *area = (struct ubp_xol_area *) xol_area;
+ struct ubp_xol_vma *usv;
+ struct task_struct *tsk;
+ int result;
+
+ tsk = pid_task(pid, PIDTYPE_PID);
+ result = ubp_validate_insn_addr(tsk, vaddr);
+ if (result != 0)
+ return result;
+
+ if (unlikely(!area))
+ return 0;
+ mutex_lock(&area->mutex);
+ list_for_each_entry(usv, &area->vmas, list) {
+ unsigned long vma_end = usv->vaddr + usv->npages*PAGE_SIZE;
+ if (usv->vaddr <= vaddr && vaddr < vma_end) {
+ result = 1;
+ break;
+ }
+ }
+ mutex_unlock(&area->mutex);
+ return result;
+}

2010-01-11 12:26:08

by Srikar Dronamraju

[permalink] [raw]
Subject: [RFC] [PATCH 5/7] X86 Support for Uprobes

[PATCH] x86 support for Uprobes

Signed-off-by: Jim Keniston <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/uprobes.h | 27 +++++++++++++++++++++++++++
2 files changed, 28 insertions(+)

Index: new_uprobes.git/arch/x86/Kconfig
===================================================================
--- new_uprobes.git.orig/arch/x86/Kconfig
+++ new_uprobes.git/arch/x86/Kconfig
@@ -51,6 +51,7 @@ config X86
select HAVE_KERNEL_LZMA
select HAVE_HW_BREAKPOINT
select HAVE_UBP
+ select HAVE_UPROBES
select HAVE_ARCH_KMEMCHECK
select HAVE_USER_RETURN_NOTIFIER

Index: new_uprobes.git/arch/x86/include/asm/uprobes.h
===================================================================
--- /dev/null
+++ new_uprobes.git/arch/x86/include/asm/uprobes.h
@@ -0,0 +1,27 @@
+#ifndef _ASM_UPROBES_H
+#define _ASM_UPROBES_H
+/*
+ * Userspace Probes (UProbes)
+ * uprobes.h
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008, 2009
+ */
+#include <linux/signal.h>
+
+#define BREAKPOINT_SIGNAL SIGTRAP
+#define SSTEP_SIGNAL SIGTRAP
+#endif /* _ASM_UPROBES_H */

2010-01-11 12:26:22

by Srikar Dronamraju

[permalink] [raw]
Subject: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

This patch implements ftrace plugin for uprobes.

Description:
Ftrace plugin provides an interface to dump data at a given address, top of
the stack and function arguments when a user program calls a specific
function.

To dump the data at a given address issue
echo up <pid> <address to probe> D <data address> <size> >>/sys/kernel/tracing/uprobes_events

To dump the data from top of stack issue
echo up <pid> <address to probe> S <size> >>/sys/kernel/tracing/uprobes_events

To dump the function arguments issue
echo up <pid> <address to probe> A <num-args> >>/sys/kernel/tracing/uprobes_events

D => Dump the data at a given address.
S => Dump the data from top of stack.
A => Dump probed function arguments. Supported only for x86_64 arch.

For example:
Input:
$ cd /sys/kernel/debug/tracing/
$ echo "up 6424 0x4004d8 S 100" > uprobe_events
$ echo "up 6424 0x4004d8 D 0x7fff6bf587d0 35" >> uprobe_events
$ echo "up 6424 0x4004d8 A 5" >> uprobe_events
$
$ cat uprobe_events
up 6424 0x4004d8 S 100
up 6424 0x4004d8 D 7fff6bf587d0 35
up 6424 0x4004d8 A 5
$
$ echo 1 > tracing_on


Output:

$ cat trace
! tracer: nop
!
! TASK-PID CPU# TIMESTAMP FUNCTION
! | | | | |
<...>-6424 [004] 1156.853343: : 0x4004d8: S 0x7fff6bf587a8: 31 06 40 00 00 00 00 00 1.@.....
<...>-6424 [004] 1156.853348: : 0x4004d8: S 0x7fff6bf587b0: 00 00 00 00 00 00 00 00 ........
<...>-6424 [004] 1156.853350: : 0x4004d8: S 0x7fff6bf587b8: c0 bb c1 4a 3b 00 00 00 ...J;...
<...>-6424 [004] 1156.853352: : 0x4004d8: S 0x7fff6bf587c0: 50 06 40 00 c8 00 00 00 P.@.....
<...>-6424 [004] 1156.853353: : 0x4004d8: S 0x7fff6bf587c8: ed 00 00 ff 00 00 00 00 ........
<...>-6424 [004] 1156.853355: : 0x4004d8: S 0x7fff6bf587d0: 54 68 69 73 20 73 74 72 This str
<...>-6424 [004] 1156.853357: : 0x4004d8: S 0x7fff6bf587d8: 69 6e 67 20 69 73 20 6f ing is o
<...>-6424 [004] 1156.853359: : 0x4004d8: S 0x7fff6bf587e0: 6e 20 74 68 65 20 73 74 n the st
<...>-6424 [004] 1156.853361: : 0x4004d8: S 0x7fff6bf587e8: 61 63 6b 20 69 6e 20 6d ack in m
<...>-6424 [004] 1156.853363: : 0x4004d8: S 0x7fff6bf587f0: 61 69 6e 00 00 00 00 00 ain.....
<...>-6424 [004] 1156.853364: : 0x4004d8: S 0x7fff6bf587f8: 00 00 00 00 04 00 00 00 ........
<...>-6424 [004] 1156.853366: : 0x4004d8: S 0x7fff6bf58800: ff ff ff ff ff ff ff ff ........
<...>-6424 [004] 1156.853367: : 0x4004d8: S 0x7fff6bf58808: 00 00 00 00 ....
<...>-6424 [004] 1156.853388: : 0x4004d8: D 0x7fff6bf587d0: 54 68 69 73 20 73 74 72 This str
<...>-6424 [004] 1156.853389: : 0x4004d8: D 0x7fff6bf587d8: 69 6e 67 20 69 73 20 6f ing is o
<...>-6424 [004] 1156.853391: : 0x4004d8: D 0x7fff6bf587e0: 6e 20 74 68 65 20 73 74 n the st
<...>-6424 [004] 1156.853393: : 0x4004d8: D 0x7fff6bf587e8: 61 63 6b 20 69 6e 20 6d ack in m
<...>-6424 [004] 1156.853394: : 0x4004d8: D 0x7fff6bf587f0: 61 69 6e ain
<...>-6424 [004] 1156.853398: : 0x4004d8: A ARG 1: 0000000000000004
<...>-6424 [004] 1156.853399: : 0x4004d8: A ARG 2: 00000000000000c8
<...>-6424 [004] 1156.853400: : 0x4004d8: A ARG 3: 00000000ff0000ed
<...>-6424 [004] 1156.853401: : 0x4004d8: A ARG 4: ffffffffffffffff
<...>-6424 [004] 1156.853402: : 0x4004d8: A ARG 5: 0000000000000048

TODO:
- use ringbuffer
- Allow user to specify Nick Name for probe addresses.
- Dump arguments from floating point registers.
- Optimize code to use single probe instead of multiple probes for same probe
addresses.

--
Signed-off-by: Mahesh Salgaonkar <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
Documentation/trace/uprobes_trace.txt | 197 ++++++++++++
kernel/trace/Makefile | 1 +
kernel/trace/trace_uprobes.c | 537 +++++++++++++++++++++++++++++++++
3 files changed, 735 insertions(+), 0 deletions(-)

diff --git a/Documentation/trace/uprobes_trace.txt b/Documentation/trace/uprobes_trace.txt
new file mode 100644
index 0000000..3c4482b
--- /dev/null
+++ b/Documentation/trace/uprobes_trace.txt
@@ -0,0 +1,197 @@
+ Uprobes based Event Tracer
+ ==========================
+
+ Mahesh J Salgaonkar
+
+Overview
+--------
+This tracer, based on uprobes, enables a user to put a probe anywhere in the
+user process and dump values from user specified data address or from the top
+of the stack frame when the probe is hit.
+
+For 64-bit processes on x86_64, the tracer can also report function arguments
+when the probe is hit. Currently, this feature is not supported for 32-bit
+processes.
+
+To activate this tracer just set a probe via
+/sys/kernel/debug/tracing/uprobe_events and traced information can be seen via
+/sys/kernel/debug/tracing/trace.
+
+User can specify probes for multiple processes concurrently.
+
+Synopsis
+--------
+up <pid> <address-to-probe> <type> [<data-address>] {<size>|<numargs>}
+
+up : set a user probe
+<pid> : Process ID.
+<address-to-probe> : Instruction address to probe in user process.
+<type> : Type of data to dump.
+ D => Dump the data from specified data address
+ S => Dump the data from top of the stack
+ A => Dump the function arguments (x86_64 only).
+[<data-address>] : Data address. Applicable only for type 'D'
+<size> : Number of bytes of data to dump.
+<numargs> : Number of arguments to dump.
+
+To dump the data at a given address when probe is hit, run:
+echo up <pid> <address to probe> D <data address> <size> >>/sys/kernel/tracing/uprobes_events
+
+To dump the data from top of stack when probe is hit, run:
+echo up <pid> <address to probe> S <size> >>/sys/kernel/tracing/uprobes_events
+
+To extract the function arguments when probe is hit, run:
+echo up <pid> <address to probe> A <numargs> >>/sys/kernel/tracing/uprobes_events
+
+Usage Examples
+--------------
+Let us consider following sample C program:
+
+/* SAMPLE C PROGRAM */
+#include <stdio.h>
+#include <stdlib.h>
+
+char *global_str_p = "Global String pointer";
+char global_str[] = "Global String";
+
+int foo(int a, unsigned int b, unsigned long c, long d, char e)
+{
+ return 0;
+}
+
+int main()
+{
+ char str[] = "This string is on the stack in main";
+ int a = 4;
+ unsigned int b = 200;
+ unsigned long c = 0xff0000ed;
+ long d = -1;
+ char e = 'H';
+
+ while (getchar() != EOF)
+ foo(a, b,c,d,e );
+
+ return 0;
+}
+/* SAMPLE C PROGRAM */
+
+This example puts a probe at function foo() and dumps some data values, the
+top of the stack and all five arguments passed to function foo().
+
+The probe address for function foo can be acquired using the 'nm' utility on
+the executable file as below:
+
+ $ gcc sample.c -o sample
+ $ nm sample | grep foo
+ 0000000000400498 T foo
+
+We will also dump the data from the global variables 'global_str_p' and
+'global_str'. The DATA addresses for these variable can be acquired as below:
+
+ $ nm sample | grep global
+ 0000000000600960 D global_str
+ 0000000000600958 D global_str_p
+
+When setting the probe, you need to specify the process id of the user process
+to trace. The process id can be determined by using the 'ps' command.
+
+ $ ps -a | grep sample
+ 3906 pts/6 00:00:00 sample
+
+Now set a probe at function foo() as a new event that dumps 100 bytes from the
+stack as shown below:
+
+$ echo "up 3906 0x0000000000400498 S 100" > /sys/kernel/tracing/uprobes_events
+
+Set additional probes at function foo() to dump the data from the global
+variables as shown below:
+
+$ echo "up 3906 0x0000000000400498 D 0000000000600960 15" >> /sys/kernel/tracing/uprobes_events
+$ echo "up 3906 0x0000000000400498 D 0000000000600958 8" >> /sys/kernel/tracing/uprobes_events
+
+Set another probe at function foo() to dump all five arguments passed to
+function foo(). (This option is only valid for x86_64 architecture.)
+
+$ echo "up 3906 0x0000000000400498 A 5" >> /sys/kernel/tracing/uprobes_events
+
+To see all the current uprobe events:
+
+$ cat /sys/kernel/debug/tracing/uprobe_events
+up 3906 0x400498 S 100
+up 3906 0x400498 D 0x600960 15
+up 3906 0x400498 D 0x600958 8
+up 3906 0x400498 A 5
+
+When the function foo() gets called all the above probes will hit and you can
+see the traced information via /sys/kernel/debug/tracing/trace
+
+$ cat /sys/kernel/debug/tracing/trace
+# tracer: nop
+#
+# TASK-PID CPU# TIMESTAMP FUNCTION
+# | | | | |
+ <...>-3906 [001] 391.531431: : 0x400498: S 0x7fffd934eba8: 38 05 40 00 00 00 00 00 8.@.....
+ <...>-3906 [001] 391.531436: : 0x400498: S 0x7fffd934ebb0: 54 68 69 73 20 73 74 72 This str
+ <...>-3906 [001] 391.531438: : 0x400498: S 0x7fffd934ebb8: 69 6e 67 20 69 73 20 6f ing is o
+ <...>-3906 [001] 391.531439: : 0x400498: S 0x7fffd934ebc0: 6e 20 74 68 65 20 73 74 n the st
+ <...>-3906 [001] 391.531441: : 0x400498: S 0x7fffd934ebc8: 61 63 6b 20 69 6e 20 6d ack in m
+ <...>-3906 [001] 391.531443: : 0x400498: S 0x7fffd934ebd0: 61 69 6e 00 00 00 00 01 ain.....
+ <...>-3906 [001] 391.531445: : 0x400498: S 0x7fffd934ebd8: c0 bb c1 4a 3b 00 00 00 ...J;...
+ <...>-3906 [001] 391.531446: : 0x400498: S 0x7fffd934ebe0: 04 00 00 00 c8 00 00 00 ........
+ <...>-3906 [001] 391.531448: : 0x400498: S 0x7fffd934ebe8: ed 00 00 ff 00 00 00 00 ........
+ <...>-3906 [001] 391.531450: : 0x400498: S 0x7fffd934ebf0: ff ff ff ff ff ff ff ff ........
+ <...>-3906 [001] 391.531452: : 0x400498: S 0x7fffd934ebf8: 00 00 00 00 00 00 00 48 .......H
+ <...>-3906 [001] 391.531453: : 0x400498: S 0x7fffd934ec00: 00 00 00 00 00 00 00 00 ........
+ <...>-3906 [001] 391.531455: : 0x400498: S 0x7fffd934ec08: 74 d9 e1 4a t..J
+ <...>-3906 [001] 391.531489: : 0x400498: D 0x600960: 47 6c 6f 62 61 6c 20 53 Global S
+ <...>-3906 [001] 391.531491: : 0x400498: D 0x600968: 74 72 69 6e 67 00 00 tring..
+ <...>-3906 [001] 391.531500: : 0x400498: D 0x600958: 48 06 40 00 00 00 00 00 H.@.....
+ <...>-3906 [001] 391.531504: : 0x400498: A ARG 1: 0000000000000004
+ <...>-3906 [001] 391.531505: : 0x400498: A ARG 2: 00000000000000c8
+ <...>-3906 [001] 391.531505: : 0x400498: A ARG 3: 00000000ff0000ed
+ <...>-3906 [001] 391.531506: : 0x400498: A ARG 4: ffffffffffffffff
+ <...>-3906 [001] 391.531507: : 0x400498: A ARG 5: 0000000000000048
+
+Under the FUNCTION column, each line shows the probe address, type, data/stack
+address, and 8 bytes of data in hex followed by the ascii representation of the
+hex values. If the size specified is more that 8 bytes then multiple lines
+will be used to dump data values. In case of type A one argument is shown per
+line.
+
+The lines with type 'S' from tracer output display 100 bytes (8 bytes per
+line) from the top of the stack when the probed function foo() is hit. The lines
+with type 'A' dump all the five arguments passed to the function foo(). The
+first two lines with type 'D' dump 15 bytes of data from the global variable
+'global_str' at data address 0x600960. The 3rd line with type 'D' dumps 8 byte
+of data from the global string pointer variable 'global_str_p' at 0x600958.
+The output shows that it holds the address 0x0000000000400648. As per the
+sample program this should point to a const string of 21 characters. Let's
+dump the data values at this address.
+
+echo "up 3906 0x0000000000400498 D 0x0000000000400648 24" > /sys/kernel/tracing/uprobes_events
+
+Please note that we have not used '>>' operator here; as a result, all
+existing probes will be cleared before this new probe is set.
+
+Take look at the tracer output.
+
+$ cat /sys/kernel/debug/tracing/trace
+# tracer: nop
+#
+# TASK-PID CPU# TIMESTAMP FUNCTION
+# | | | | |
+ <...>-3906 [001] 442.537669: : 0x400498: D 0x400648: 47 6c 6f 62 61 6c 20 53 Global S
+ <...>-3906 [001] 442.537674: : 0x400498: D 0x400650: 74 72 69 6e 67 20 70 6f tring po
+ <...>-3906 [001] 442.537676: : 0x400498: D 0x400658: 69 6e 74 65 72 00 00 00 inter...
+
+
+To clear all the probe events, run:
+
+echo > /sys/kernel/tracing/uprobes_events
+
+TODO:
+- Allow user to attach a name to probe addresses for address translation.
+- Support reporting of arguments from 32-bit applications.
+- Dump arguments from floating point registers.
+- Optimize code to use single probe instead of multiple probes for same probe
+ addresses.
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 26f03ac..623541f 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -54,5 +54,6 @@ obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
obj-$(CONFIG_EVENT_PROFILE) += trace_event_profile.o
obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
obj-$(CONFIG_EVENT_TRACING) += power-traces.o
+obj-$(CONFIG_UPROBES) += trace_uprobes.o

libftrace-y := ftrace.o
diff --git a/kernel/trace/trace_uprobes.c b/kernel/trace/trace_uprobes.c
new file mode 100644
index 0000000..c6e3f90
--- /dev/null
+++ b/kernel/trace/trace_uprobes.c
@@ -0,0 +1,537 @@
+/*
+ * Ftrace plugin for Userspace Probes (UProbes)
+ * kernel/trace/trace_uprobes.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2009
+ */
+#include <linux/uaccess.h>
+#include <linux/debugfs.h>
+#include <linux/types.h>
+#include <linux/ctype.h>
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/regset.h>
+#include <linux/pid.h>
+#include <linux/uprobes.h>
+
+#include "trace.h"
+
+struct trace_uprobe {
+ struct list_head list;
+ struct uprobe usp;
+ unsigned long daddr;
+ size_t length;
+
+#ifdef __x86_64__
+#define TYPE_ARG 'A'
+#endif
+#define TYPE_DATA 'D'
+#define TYPE_STACK 'S'
+ char type;
+};
+
+static DEFINE_MUTEX(trace_uprobe_lock);
+static LIST_HEAD(tu_list);
+
+#define NUMVALUES 8 /* Number of data values to print per line*/
+
+/* NUMVALUES*2 for hex values + NUMVALUES for spaces + 1 */
+#define HEXBUFSIZE ((NUMVALUES * 2) + NUMVALUES + 1)
+
+#define CHARBUFSIZE NUMVALUES /* NUMVALUES characters */
+#define BUFSIZE (HEXBUFSIZE + CHARBUFSIZE)
+
+/*
+ * uprobe handler to dump data values and the top of the
+ * stack frame through tracer.
+ *
+ * The output is pushed to tracer in following format:
+ *
+ * <probe-address>: <type> <data/stack-address>: <data o/p>
+ *
+ * The <data o/p> is divided into two parts - the hex area and
+ * the char area. The hex area contains hex data values.
+ * The number of hex data values contained are controlled
+ * by NUMVALUES. The char area is the ascii representation
+ * of hex data values.
+ *
+ * |<---------- BUFSIZE + 1------------>|
+ *
+ * +-----------------+---------------+--+
+ * obuf | HEX Area | CHAR Area |\0|
+ * +-----------------+---------------+--+
+ * ^ ^ ^
+ * |<--HEXBUFSIZE -->|<-CHARBUFSIZE->|
+ *
+
+ *
+ * 0x400498: S 0x7fffd934eba8: c8 00 00 00 ed 00 00 ff ........
+ * 0x400498: S 0x7fffd934ebb0: 54 68 69 73 20 73 74 72 This str
+ */
+
+static void uprobe_handler(struct uprobe *u, struct pt_regs *regs)
+{
+ struct trace_uprobe *tu;
+ char *buf;
+ unsigned long ip = instruction_pointer(regs), daddr;
+ int len;
+ char obuf[BUFSIZE + 1];
+
+ tu = container_of(u, struct trace_uprobe, usp);
+ buf = kzalloc(tu->length + 1, GFP_KERNEL);
+ if (!buf)
+ return;
+
+ if (tu->type == TYPE_STACK) {
+ /* Get Stack Pointer. Dump stack memory */
+ daddr = (unsigned long)user_stack_pointer(regs);
+ } else
+ daddr = tu->daddr;
+
+ len = tu->length;
+ if (!copy_from_user(buf, (void *)daddr, tu->length)) {
+ int pos = 0;
+
+ for (pos = 0; pos < len; pos += NUMVALUES) {
+ char *hp = obuf; /* Hex area buf pointer */
+ char *cp = hp + HEXBUFSIZE; /* char area buf pointer */
+ int i = 0, last;
+
+ memset(obuf, ' ', BUFSIZE);
+ obuf[BUFSIZE] = '\0';
+
+ last = pos + (NUMVALUES - 1);
+ if (last >= len)
+ last = len - 1;
+
+ for (i = pos; i <= last; i++) {
+ sprintf(hp, "%02x", (unsigned char)buf[i]);
+
+ /*
+ * Character representation..
+ * ignore non-printable chars
+ */
+ if ((buf[i] >= ' ') && (buf[i] <= '~'))
+ *cp = buf[i];
+ else
+ *cp = '.';
+
+ hp += 2;
+ *hp++ = ' ';
+ cp++;
+ }
+
+ __trace_bprintk(ip, "0x%lx: %c 0x%lx: %s\n",
+ tu->usp.vaddr, tu->type,
+ (daddr + pos), obuf);
+ }
+ } else {
+ __trace_bprintk(ip, "0x%lx: %c 0x%lx: "
+ "Data capture failed. Invalid address\n",
+ tu->usp.vaddr, tu->type, daddr);
+ }
+ kfree(buf);
+}
+
+#ifdef __x86_64__
+
+/*
+ * uprobe handler to dump function arguments through tracer.
+ * Currently, supported for x86_64 architecture.
+ * Argument extraction as per x86_64 ABI (Application Binary
+ * Interface) document Version 0.99.
+ *
+ * The output is pushed to tracer in following format:
+ *
+ * <probe-address>: A ARG #: <value>
+ *
+ * e.g.
+ * 0x400498: A ARG 1: 0000000000000004
+ * 0x400498: A ARG 2: 00000000000000c8
+ */
+static void uprobe_handler_args(struct uprobe *u, struct pt_regs *regs)
+{
+ struct trace_uprobe *tu;
+ unsigned long ip = instruction_pointer(regs);
+ unsigned long args[6];
+ int i;
+
+ tu = container_of(u, struct trace_uprobe, usp);
+
+ /* Function arguments */
+ args[0] = regs->di;
+ args[1] = regs->si;
+ args[2] = regs->dx;
+ args[3] = regs->cx;
+ args[4] = regs->r8;
+ args[5] = regs->r9;
+
+ for (i = 0; i < tu->length; i++) {
+ __trace_bprintk(ip, "0x%lx: %c ARG %d: %016lx\n",
+ u->vaddr, tu->type, i + 1, args[i]);
+ }
+}
+#endif
+
+/*
+ * Updates the size/numargs of existing probe event if found.
+ */
+static struct trace_uprobe *update_trace_probe(pid_t pid,
+ unsigned long taddr, unsigned long daddr, size_t length,
+ char type)
+{
+ struct trace_uprobe *tu, *tmp;
+
+ mutex_lock(&trace_uprobe_lock);
+ list_for_each_entry_safe(tu, tmp, &tu_list, list) {
+ if ((tu->usp.pid == pid) && (tu->usp.vaddr == taddr)
+ && (tu->type == type) && (tu->daddr == daddr)) {
+ tu->length = length;
+ mutex_unlock(&trace_uprobe_lock);
+ return tu;
+ }
+ }
+ mutex_unlock(&trace_uprobe_lock);
+ return NULL;
+}
+
+/*
+ * Creates a new probe event entry and sets the user probe by calling
+ * register_uprobe()
+ */
+static int trace_register_uprobe(pid_t pid, unsigned long taddr,
+ unsigned long daddr, size_t length, char type)
+{
+ struct trace_uprobe *tu;
+ int ret = 0;
+
+ /* Check for duplication. If probe for same data address
+ * already exists then just update the length.
+ */
+ tu = update_trace_probe(pid, taddr, daddr, length, type);
+ if (tu)
+ return 0;
+
+ /* This is a new probe. */
+ tu = kzalloc(sizeof(struct trace_uprobe), GFP_KERNEL);
+ if (!tu)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&tu->list);
+ tu->length = length;
+ tu->daddr = daddr;
+ tu->type = type;
+ tu->usp.pid = pid;
+ tu->usp.vaddr = taddr;
+#ifdef __x86_64__
+ tu->usp.handler = (tu->type == TYPE_ARG) ?
+ uprobe_handler_args : uprobe_handler;
+#else
+ tu->usp.handler = uprobe_handler;
+#endif
+ ret = register_uprobe(&tu->usp);
+
+ if (ret) {
+ pr_err("register_uprobe(pid=%d vaddr=%lx) = ret(%d) failed\n",
+ pid, taddr, ret);
+ kfree(tu);
+ return ret;
+ }
+ mutex_lock(&trace_uprobe_lock);
+ list_add_tail(&tu->list, &tu_list);
+ mutex_unlock(&trace_uprobe_lock);
+ return 0;
+}
+
+static void uprobes_clear_all_events(void)
+{
+ struct trace_uprobe *tu, *tmp;
+
+ mutex_lock(&trace_uprobe_lock);
+ list_for_each_entry_safe(tu, tmp, &tu_list, list) {
+ unregister_uprobe(&tu->usp);
+ list_del(&tu->list);
+ kfree(tu);
+ }
+ mutex_unlock(&trace_uprobe_lock);
+}
+
+/* User probes listing interfaces */
+static void *uprobes_seq_start(struct seq_file *m, loff_t *pos)
+{
+ mutex_lock(&trace_uprobe_lock);
+ return seq_list_start(&tu_list, *pos);
+}
+
+static void *uprobes_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ return seq_list_next(v, &tu_list, pos);
+}
+
+static void uprobes_seq_stop(struct seq_file *m, void *v)
+{
+ mutex_unlock(&trace_uprobe_lock);
+}
+
+static int uprobes_seq_show(struct seq_file *m, void *v)
+{
+ struct trace_uprobe *tu = v;
+
+ if (tu == NULL)
+ return 0;
+
+ if (tu->type == TYPE_DATA)
+ seq_printf(m, "%-3s%d 0x%lx D 0x%lx %zu\n",
+ "up", tu->usp.pid, tu->usp.vaddr, tu->daddr, tu->length);
+ else
+ seq_printf(m, "%-3s%d 0x%lx %c %zu\n",
+ "up", tu->usp.pid, tu->usp.vaddr, tu->type, tu->length);
+
+ return 0;
+}
+
+static const struct seq_operations uprobes_seq_ops = {
+ .start = uprobes_seq_start,
+ .next = uprobes_seq_next,
+ .stop = uprobes_seq_stop,
+ .show = uprobes_seq_show
+};
+
+static int uprobe_events_open(struct inode *inode, struct file *file)
+{
+ if ((file->f_mode & FMODE_WRITE) &&
+ !(file->f_flags & O_APPEND))
+ uprobes_clear_all_events();
+
+ return seq_open(file, &uprobes_seq_ops);
+}
+
+#ifdef __x86_64__
+static int process_check_64bit(pid_t p)
+{
+ struct pid *pid = NULL;
+ struct task_struct *tsk;
+ int ret = -ESRCH;
+
+ rcu_read_lock();
+ if (current->nsproxy)
+ pid = find_vpid(p);
+
+ if (pid) {
+ tsk = pid_task(pid, PIDTYPE_PID);
+
+ if (tsk) {
+ if (test_tsk_thread_flag(tsk, TIF_IA32)) {
+ pr_err("Option to dump arguments is"
+ "not supported for 32bit process\n");
+ ret = -EPERM;
+ } else
+ ret = 0;
+ }
+ }
+ rcu_read_unlock();
+ return ret;
+}
+#endif
+
+/*
+ * Input syntax:
+ * up <pid> <address-to-probe> <type> [<data-address>] <size>
+ */
+
+static int enable_uprobe_trace(int argc, char **argv)
+{
+ unsigned long taddr, daddr = 0, tmpval;
+ size_t dsize;
+ pid_t pid;
+ int ret = -EINVAL;
+ char type;
+
+ if ((argc < 5) || (argc > 6))
+ return -EINVAL;
+
+ if (strcmp(argv[0], "up"))
+ return -EINVAL;
+
+ /* get the pid */
+ ret = strict_strtoul(argv[1], 10, &tmpval);
+ if (ret)
+ return ret;
+
+ pid = (pid_t) tmpval;
+
+ /* get the address to probe */
+ ret = strict_strtoul(argv[2], 16, &taddr);
+ if (ret)
+ return ret;
+
+ /* See if user asked for Stack or Data address. */
+ if ((strlen(argv[3]) != 1) || (!isalpha(*argv[3])))
+ return -EINVAL;
+
+ switch (*argv[3]) {
+#ifdef __x86_64__
+ /*
+ * dumping of arguments supported only for x86_64 arch
+ */
+ case 'A':
+ case 'a':
+ type = TYPE_ARG;
+ if (argc > 5)
+ return -EINVAL;
+ /* Option 'A' is not supported for 32 bit process. */
+ ret = process_check_64bit(pid);
+ if (ret)
+ return ret;
+
+ daddr = 0;
+ break;
+#endif
+ case 'D':
+ case 'd':
+ type = TYPE_DATA;
+ if (argc < 6)
+ return -EINVAL;
+ /* get the data address */
+ ret = strict_strtoul(argv[4], 16, &daddr);
+ if (ret)
+ return ret;
+ break;
+ case 'S':
+ case 's':
+ type = TYPE_STACK;
+ if (argc > 5)
+ return -EINVAL;
+ daddr = 0;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ /*
+ * In case of TYPE_DATA and TYPE_STACK: get the size of data to dump.
+ * In case of TYPE_ARG: this is the number of arguments to dump
+ */
+ ret = strict_strtoul(((type == TYPE_DATA) ?
+ argv[5] : argv[4]), 10, &tmpval);
+ if (ret)
+ return ret;
+
+ dsize = (size_t) tmpval;
+
+#ifdef __x86_64__
+ /* Only upto 6 args supported */
+ if ((type == TYPE_ARG) && (dsize > 6)) {
+ pr_err("Can not dump more than 6 arguments\n");
+ return -EINVAL;
+ }
+#endif
+
+ ret = trace_register_uprobe(pid, taddr, daddr, dsize, type);
+ return ret;
+}
+
+/*
+ * Process commands written to /sys/kernel/debug/tracing/uprobe_events.
+ * Supports multiple lines. It reads the entire ubuf into local buffer
+ * and then breaks the input into lines. Invokes enable_uprobe_trace()
+ * for each line after splitting them into args array.
+ */
+
+static ssize_t
+uprobe_events_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ char *kbuf, *start, *end = NULL, *tmp;
+ char **argv = NULL;
+ int argc = 0;
+ int ret = 0;
+ size_t done = 0;
+ size_t size;
+
+ if (!count)
+ return 0;
+
+ kbuf = kmalloc(count + 1, GFP_KERNEL);
+ if (!kbuf)
+ return -ENOMEM;
+
+ if (copy_from_user(kbuf, ubuf, count)) {
+ ret = -EFAULT;
+ goto err_out;
+ }
+
+ kbuf[count] = '\0';
+ for (start = kbuf; done < count; start = end + 1) {
+ end = strchr(start, '\n');
+ if (!end) {
+ pr_err("Line length is too long");
+ ret = -EINVAL;
+ goto err_out;
+ }
+ *end = '\0';
+ size = end - start + 1;
+ done += size;
+ /* Remove comments */
+ tmp = strchr(start, '#');
+ if (tmp)
+ *tmp = '\0';
+
+ argv = argv_split(GFP_KERNEL, start, &argc);
+ if (!argv) {
+ ret = -ENOMEM;
+ goto err_out;
+ }
+
+ if (argc)
+ ret = enable_uprobe_trace(argc, argv);
+
+ argv_free(argv);
+ if (ret < 0)
+ goto err_out;
+ }
+ ret = done;
+err_out:
+ kfree(kbuf);
+ return ret;
+}
+
+static const struct file_operations uprobes_events_ops = {
+ .open = uprobe_events_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+ .write = uprobe_events_write,
+};
+
+static __init int init_uprobe_trace(void)
+{
+ struct dentry *d_tracer;
+ struct dentry *entry;
+
+ d_tracer = tracing_init_dentry();
+
+ entry = debugfs_create_file("uprobe_events", 0644, d_tracer,
+ NULL, &uprobes_events_ops);
+
+ if (!entry)
+ pr_warning("Could not create debugfs 'uprobe_events' entry\n");
+
+ return 0;
+}
+fs_initcall(init_uprobe_trace);

2010-01-11 12:26:10

by Srikar Dronamraju

[permalink] [raw]
Subject: [RFC] [PATCH 4/7] Uprobes Implementation

Uprobes Implementation

Uprobes Infrastructure enables user to dynamically establish
probepoints in user applications and collect information by executing
a handler functions when the probepoints are hit.
Please refer Documentation/uprobes.txt for more details.

This patch provides the core implementation of uprobes.
This patch builds on utrace infrastructure.

You need to follow this up with the uprobes patch for your
architecture.

Signed-off-by: Jim Keniston <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/Kconfig | 12
include/linux/uprobes.h | 292 ++++++
kernel/Makefile | 1
kernel/uprobes_core.c | 2017 ++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 2322 insertions(+)

Index: new_uprobes.git/arch/Kconfig
===================================================================
--- new_uprobes.git.orig/arch/Kconfig
+++ new_uprobes.git/arch/Kconfig
@@ -66,6 +66,16 @@ config UBP
in user applications. This service is used by components
such as uprobes. If in doubt, say "N".

+config UPROBES
+ bool "User-space probes (EXPERIMENTAL)"
+ depends on UTRACE && MODULES && UBP
+ depends on HAVE_UPROBES
+ help
+ Uprobes enables kernel modules to establish probepoints
+ in user applications and execute handler functions when
+ the probepoints are hit. For more information, refer to
+ Documentation/uprobes.txt. If in doubt, say "N".
+
config HAVE_EFFICIENT_UNALIGNED_ACCESS
bool
help
@@ -115,6 +125,8 @@ config HAVE_KPROBES
config HAVE_KRETPROBES
bool

+config HAVE_UPROBES
+ def_bool n
#
# An arch should select this if it provides all these things:
#
Index: new_uprobes.git/include/linux/uprobes.h
===================================================================
--- /dev/null
+++ new_uprobes.git/include/linux/uprobes.h
@@ -0,0 +1,292 @@
+#ifndef _LINUX_UPROBES_H
+#define _LINUX_UPROBES_H
+/*
+ * Userspace Probes (UProbes)
+ * include/linux/uprobes.h
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006, 2009
+ */
+#include <linux/types.h>
+#include <linux/list.h>
+
+struct pt_regs;
+
+/* This is what the user supplies us. */
+struct uprobe {
+ /*
+ * The pid of the probed process. Currently, this can be the
+ * thread ID (task->pid) of any active thread in the process.
+ */
+ pid_t pid;
+
+ /* Location of the probepoint */
+ unsigned long vaddr;
+
+ /* Handler to run when the probepoint is hit */
+ void (*handler)(struct uprobe*, struct pt_regs*);
+
+ /*
+ * This function, if non-NULL, will be called upon completion of
+ * an ASYNCHRONOUS registration (i.e., one initiated by a uprobe
+ * handler). reg = 1 for register, 0 for unregister.
+ */
+ void (*registration_callback)(struct uprobe *u, int reg, int result);
+
+ /* Reserved for use by uprobes */
+ void *kdata;
+};
+
+#if defined(CONFIG_UPROBES)
+extern int register_uprobe(struct uprobe *u);
+extern void unregister_uprobe(struct uprobe *u);
+#else
+static inline int register_uprobe(struct uprobe *u)
+{
+ return -ENOSYS;
+}
+static inline void unregister_uprobe(struct uprobe *u)
+{
+}
+#endif /* CONFIG_UPROBES */
+
+#ifdef UPROBES_IMPLEMENTATION
+
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/wait.h>
+#include <asm/atomic.h>
+#include <linux/ubp.h>
+#include <linux/ubp_xol.h>
+#include <asm/uprobes.h>
+
+struct utrace_engine;
+struct task_struct;
+struct pid;
+
+enum uprobe_probept_state {
+ UPROBE_INSERTING, /* process quiescing prior to insertion */
+ UPROBE_BP_SET, /* breakpoint in place */
+ UPROBE_REMOVING, /* process quiescing prior to removal */
+ UPROBE_DISABLED /* removal completed */
+};
+
+enum uprobe_task_state {
+ UPTASK_QUIESCENT,
+ UPTASK_SLEEPING, /* See utask_fake_quiesce(). */
+ UPTASK_RUNNING,
+ UPTASK_BP_HIT,
+ UPTASK_SSTEP
+};
+
+enum uprobe_ssil_state {
+ SSIL_DISABLE,
+ SSIL_CLEAR,
+ SSIL_SET
+};
+
+#define UPROBE_HASH_BITS 5
+#define UPROBE_TABLE_SIZE (1 << UPROBE_HASH_BITS)
+
+/*
+ * uprobe_process -- not a user-visible struct.
+ * A uprobe_process represents a probed process. A process can have
+ * multiple probepoints (each represented by a uprobe_probept) and
+ * one or more threads (each represented by a uprobe_task).
+ */
+struct uprobe_process {
+ /*
+ * rwsem is write-locked for any change to the uprobe_process's
+ * graph (including uprobe_tasks, uprobe_probepts, and uprobe_kimgs) --
+ * e.g., due to probe [un]registration or special events like exit.
+ * It's read-locked during the whole time we process a probepoint hit.
+ */
+ struct rw_semaphore rwsem;
+
+ /* Table of uprobe_probepts registered for this process */
+ /* TODO: Switch to list_head[] per Ingo. */
+ struct hlist_head uprobe_table[UPROBE_TABLE_SIZE];
+
+ /* List of uprobe_probepts awaiting insertion or removal */
+ struct list_head pending_uprobes;
+
+ /* List of uprobe_tasks in this task group */
+ struct list_head thread_list;
+ int nthreads;
+ int n_quiescent_threads;
+
+ /* this goes on the uproc_table */
+ struct hlist_node hlist;
+
+ /*
+ * All threads (tasks) in a process share the same uprobe_process.
+ */
+ struct pid *tg_leader;
+ pid_t tgid;
+
+ /* Threads in UTASK_SLEEPING state wait here to be roused. */
+ wait_queue_head_t waitq;
+
+ /*
+ * We won't free the uprobe_process while...
+ * - any register/unregister operations on it are in progress; or
+ * - any uprobe_report_* callbacks are running; or
+ * - uprobe_table[] is not empty; or
+ * - any tasks are UTASK_SLEEPING in the waitq;
+ * refcount reflects this. We do NOT ref-count tasks (threads),
+ * since once the last thread has exited, the rest is academic.
+ */
+ atomic_t refcount;
+
+ /*
+ * finished = 1 means the process is execing or the last thread
+ * is exiting, and we're cleaning up the uproc. If the execed
+ * process is probed, a new uproc will be created.
+ */
+ bool finished;
+
+ /*
+ * 1 to single-step out of line; 0 for inline. This can drop to
+ * 0 if we can't set up the XOL area, but never goes from 0 to 1.
+ */
+ bool sstep_out_of_line;
+
+ /*
+ * Manages slots for instruction-copies to be single-stepped
+ * out of line.
+ */
+ void *xol_area;
+};
+
+/*
+ * uprobe_kimg -- not a user-visible struct.
+ * Holds implementation-only per-uprobe data.
+ * uprobe->kdata points to this.
+ */
+struct uprobe_kimg {
+ struct uprobe *uprobe;
+ struct uprobe_probept *ppt;
+
+ /*
+ * -EBUSY while we're waiting for all threads to quiesce so the
+ * associated breakpoint can be inserted or removed.
+ * 0 if the the insert/remove operation has succeeded, or -errno
+ * otherwise.
+ */
+ int status;
+
+ /* on ppt's list */
+ struct list_head list;
+};
+
+/*
+ * uprobe_probept -- not a user-visible struct.
+ * A probepoint, at which several uprobes can be registered.
+ * Guarded by uproc->rwsem.
+ */
+struct uprobe_probept {
+ /* breakpoint/XOL details */
+ struct ubp_bkpt ubp;
+
+ /* The uprobe_kimg(s) associated with this uprobe_probept */
+ struct list_head uprobe_list;
+
+ enum uprobe_probept_state state;
+
+ /* The parent uprobe_process */
+ struct uprobe_process *uproc;
+
+ /*
+ * ppt goes in the uprobe_process->uprobe_table when registered --
+ * even before the breakpoint has been inserted.
+ */
+ struct hlist_node ut_node;
+
+ /*
+ * ppt sits in the uprobe_process->pending_uprobes queue while
+ * awaiting insertion or removal of the breakpoint.
+ */
+ struct list_head pd_node;
+
+ /* [un]register_uprobe() waits 'til bkpt inserted/removed */
+ wait_queue_head_t waitq;
+
+ /*
+ * ssil_lock, ssilq and ssil_state are used to serialize
+ * single-stepping inline, so threads don't clobber each other
+ * swapping the breakpoint instruction in and out. This helps
+ * prevent crashing the probed app, but it does NOT prevent
+ * probe misses while the breakpoint is swapped out.
+ * ssilq - threads wait for their chance to single-step inline.
+ */
+ spinlock_t ssil_lock;
+ wait_queue_head_t ssilq;
+ enum uprobe_ssil_state ssil_state;
+};
+
+/*
+ * uprobe_utask -- not a user-visible struct.
+ * Corresponds to a thread in a probed process.
+ * Guarded by uproc->rwsem.
+ */
+struct uprobe_task {
+ /* Lives in the global utask_table */
+ struct hlist_node hlist;
+
+ /* Lives on the thread_list for the uprobe_process */
+ struct list_head list;
+
+ struct task_struct *tsk;
+ struct pid *pid;
+
+ /* The utrace engine for this task */
+ struct utrace_engine *engine;
+
+ /* Back pointer to the associated uprobe_process */
+ struct uprobe_process *uproc;
+
+ enum uprobe_task_state state;
+
+ /*
+ * quiescing = 1 means this task has been asked to quiesce.
+ * It may not be able to comply immediately if it's hit a bkpt.
+ */
+ bool quiescing;
+
+ /* Set before running handlers; cleared after single-stepping. */
+ struct uprobe_probept *active_probe;
+
+ /* Saved address of copied original instruction */
+ long singlestep_addr;
+
+ struct ubp_task_arch_info arch_info;
+
+ /*
+ * Unexpected error in probepoint handling has left task's
+ * text or stack corrupted. Kill task ASAP.
+ */
+ bool doomed;
+
+ /* [un]registrations initiated by handlers must be asynchronous. */
+ struct list_head deferred_registrations;
+
+ /* Delay handler-destined signals 'til after single-step done. */
+ struct list_head delayed_signals;
+};
+
+#endif /* UPROBES_IMPLEMENTATION */
+
+#endif /* _LINUX_UPROBES_H */
Index: new_uprobes.git/kernel/Makefile
===================================================================
--- new_uprobes.git.orig/kernel/Makefile
+++ new_uprobes.git/kernel/Makefile
@@ -104,6 +104,7 @@ obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_b
obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
obj-$(CONFIG_UBP) += ubp_core.o
obj-$(CONFIG_UBP_XOL) += ubp_xol.o
+obj-$(CONFIG_UPROBES) += uprobes_core.o

ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
Index: new_uprobes.git/kernel/uprobes_core.c
===================================================================
--- /dev/null
+++ new_uprobes.git/kernel/uprobes_core.c
@@ -0,0 +1,2017 @@
+/*
+ * Userspace Probes (UProbes)
+ * kernel/uprobes_core.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006, 2009
+ */
+#include <linux/types.h>
+#include <linux/hash.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/rcupdate.h>
+#include <linux/err.h>
+#include <linux/kref.h>
+#include <linux/utrace.h>
+#include <linux/regset.h>
+#define UPROBES_IMPLEMENTATION 1
+#include <linux/uprobes.h>
+#include <linux/tracehook.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/errno.h>
+#include <linux/mman.h>
+
+#define UPROBE_SET_FLAGS 1
+#define UPROBE_CLEAR_FLAGS 0
+
+#define MAX_XOL_SLOTS 1024
+
+static int utask_fake_quiesce(struct uprobe_task *utask);
+static int uprobe_post_ssout(struct uprobe_task *utask,
+ struct uprobe_probept *ppt, struct pt_regs *regs);
+
+typedef void (*uprobe_handler_t)(struct uprobe*, struct pt_regs*);
+
+/*
+ * Table of currently probed processes, hashed by task-group leader's
+ * struct pid.
+ */
+static struct hlist_head uproc_table[UPROBE_TABLE_SIZE];
+
+/* Protects uproc_table during uprobe (un)registration */
+static DEFINE_MUTEX(uproc_mutex);
+
+/* Table of uprobe_tasks, hashed by task_struct pointer. */
+static struct hlist_head utask_table[UPROBE_TABLE_SIZE];
+static DEFINE_SPINLOCK(utask_table_lock);
+
+/* p_uprobe_utrace_ops = &uprobe_utrace_ops. Fwd refs are a pain w/o this. */
+static const struct utrace_engine_ops *p_uprobe_utrace_ops;
+
+struct deferred_registration {
+ struct list_head list;
+ struct uprobe *uprobe;
+ int regflag; /* 0 - unregister, 1 - register */
+};
+
+/*
+ * Calling a signal handler cancels single-stepping, so uprobes delays
+ * calling the handler, as necessary, until after single-stepping is completed.
+ */
+struct delayed_signal {
+ struct list_head list;
+ siginfo_t info;
+};
+
+static u16 ubp_strategies;
+
+static struct uprobe_task *uprobe_find_utask(struct task_struct *tsk)
+{
+ struct hlist_head *head;
+ struct hlist_node *node;
+ struct uprobe_task *utask;
+ unsigned long flags;
+
+ head = &utask_table[hash_ptr(tsk, UPROBE_HASH_BITS)];
+ spin_lock_irqsave(&utask_table_lock, flags);
+ hlist_for_each_entry(utask, node, head, hlist) {
+ if (utask->tsk == tsk) {
+ spin_unlock_irqrestore(&utask_table_lock, flags);
+ return utask;
+ }
+ }
+ spin_unlock_irqrestore(&utask_table_lock, flags);
+ return NULL;
+}
+
+static void uprobe_hash_utask(struct uprobe_task *utask)
+{
+ struct hlist_head *head;
+ unsigned long flags;
+
+ INIT_HLIST_NODE(&utask->hlist);
+ head = &utask_table[hash_ptr(utask->tsk, UPROBE_HASH_BITS)];
+ spin_lock_irqsave(&utask_table_lock, flags);
+ hlist_add_head(&utask->hlist, head);
+ spin_unlock_irqrestore(&utask_table_lock, flags);
+}
+
+static void uprobe_unhash_utask(struct uprobe_task *utask)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&utask_table_lock, flags);
+ hlist_del(&utask->hlist);
+ spin_unlock_irqrestore(&utask_table_lock, flags);
+}
+
+static inline void uprobe_get_process(struct uprobe_process *uproc)
+{
+ atomic_inc(&uproc->refcount);
+}
+
+/*
+ * Decrement uproc's refcount in a situation where we "know" it can't
+ * reach zero. It's OK to call this with uproc locked. Compare with
+ * uprobe_put_process().
+ */
+static inline void uprobe_decref_process(struct uprobe_process *uproc)
+{
+ if (atomic_dec_and_test(&uproc->refcount))
+ BUG();
+}
+
+/*
+ * Runs with the uproc_mutex held. Returns with uproc ref-counted and
+ * write-locked.
+ *
+ * Around exec time, briefly, it's possible to have one (finished) uproc
+ * for the old image and one for the new image. We find the latter.
+ */
+static struct uprobe_process *uprobe_find_process(struct pid *tg_leader)
+{
+ struct uprobe_process *uproc;
+ struct hlist_head *head;
+ struct hlist_node *node;
+
+ head = &uproc_table[hash_ptr(tg_leader, UPROBE_HASH_BITS)];
+ hlist_for_each_entry(uproc, node, head, hlist) {
+ if (uproc->tg_leader == tg_leader && !uproc->finished) {
+ uprobe_get_process(uproc);
+ down_write(&uproc->rwsem);
+ return uproc;
+ }
+ }
+ return NULL;
+}
+
+/*
+ * In the given uproc's hash table of probepoints, find the one with the
+ * specified virtual address. Runs with uproc->rwsem locked.
+ */
+static struct uprobe_probept *uprobe_find_probept(struct uprobe_process *uproc,
+ unsigned long vaddr)
+{
+ struct uprobe_probept *ppt;
+ struct hlist_node *node;
+ struct hlist_head *head = &uproc->uprobe_table[hash_long(vaddr,
+ UPROBE_HASH_BITS)];
+
+ hlist_for_each_entry(ppt, node, head, ut_node) {
+ if (ppt->ubp.vaddr == vaddr && ppt->state != UPROBE_DISABLED)
+ return ppt;
+ }
+ return NULL;
+}
+
+/*
+ * Save a copy of the original instruction (so it can be single-stepped
+ * out of line), insert the breakpoint instruction, and awake
+ * register_uprobe().
+ */
+static void uprobe_insert_bkpt(struct uprobe_probept *ppt,
+ struct task_struct *tsk)
+{
+ struct uprobe_kimg *uk;
+ int result;
+
+ if (tsk)
+ result = ubp_insert_bkpt(tsk, &ppt->ubp);
+ else
+ /* No surviving tasks associated with ppt->uproc */
+ result = -ESRCH;
+ ppt->state = (result ? UPROBE_DISABLED : UPROBE_BP_SET);
+ list_for_each_entry(uk, &ppt->uprobe_list, list)
+ uk->status = result;
+ wake_up_all(&ppt->waitq);
+}
+
+/*
+ * Check if task has just stepped on a trap instruction at the
+ * indicated address. If it has indeed stepped on that address,
+ * then reset Instruction Pointer for the task.
+ *
+ * tsk should either be current thread or already quiesced thread.
+ */
+static inline void reset_thread_ip(struct task_struct *tsk,
+ struct pt_regs *regs, unsigned long addr)
+{
+ if ((ubp_get_bkpt_addr(regs) == addr) &&
+ !test_tsk_thread_flag(tsk, TIF_SINGLESTEP))
+ ubp_set_ip(regs, addr);
+}
+
+/*
+ * ppt's breakpoint has been removed. If any threads are in the middle of
+ * single-stepping at this probepoint, fix things up so they can proceed.
+ * If any threads have just hit breakpoint but are yet to start
+ * pre-processing, reset their instruction pointers.
+ *
+ * Runs with all of ppt->uproc's threads quiesced and ppt->uproc->rwsem
+ * write-locked
+ */
+static inline void adjust_trapped_thread_ip(struct uprobe_probept *ppt)
+{
+ struct uprobe_process *uproc = ppt->uproc;
+ struct uprobe_task *utask;
+ struct pt_regs *regs;
+
+ list_for_each_entry(utask, &uproc->thread_list, list) {
+ regs = task_pt_regs(utask->tsk);
+ if (utask->active_probe != ppt) {
+ reset_thread_ip(utask->tsk, regs, ppt->ubp.vaddr);
+ continue;
+ }
+
+ /*
+ * Current thread cannot have an active breakpoint
+ * and still request for a breakpoint removal. The
+ * above case is handled by utask_fake_quiesce().
+ */
+ BUG_ON(utask->tsk == current);
+
+#ifdef CONFIG_UBP_XOL
+ if (instruction_pointer(regs) == ppt->ubp.xol_vaddr)
+ /* adjust the ip to breakpoint addr. */
+ ubp_set_ip(regs, ppt->ubp.vaddr);
+ else
+ /* adjust the ip to next instruction. */
+ uprobe_post_ssout(utask, ppt, regs);
+#endif
+ }
+}
+
+static void uprobe_remove_bkpt(struct uprobe_probept *ppt,
+ struct task_struct *tsk)
+{
+ if (tsk) {
+ if (ubp_remove_bkpt(tsk, &ppt->ubp) != 0) {
+ printk(KERN_ERR
+ "Error removing uprobe at pid %d vaddr %#lx:"
+ " can't restore original instruction\n",
+ tsk->tgid, ppt->ubp.vaddr);
+ /*
+ * This shouldn't happen, since we were previously
+ * able to write the breakpoint at that address.
+ * There's not much we can do besides let the
+ * process die with a SIGTRAP the next time the
+ * breakpoint is hit.
+ */
+ }
+ adjust_trapped_thread_ip(ppt);
+ if (ppt->ubp.strategy & UBP_HNT_INLINE) {
+ unsigned long flags;
+ spin_lock_irqsave(&ppt->ssil_lock, flags);
+ ppt->ssil_state = SSIL_DISABLE;
+ wake_up_all(&ppt->ssilq);
+ spin_unlock_irqrestore(&ppt->ssil_lock, flags);
+ }
+ }
+ /* Wake up unregister_uprobe(). */
+ ppt->state = UPROBE_DISABLED;
+ wake_up_all(&ppt->waitq);
+}
+
+/*
+ * Runs with all of uproc's threads quiesced and uproc->rwsem write-locked.
+ * As specified, insert or remove the breakpoint instruction for each
+ * uprobe_probept on uproc's pending list.
+ * tsk = one of the tasks associated with uproc -- NULL if there are
+ * no surviving threads.
+ * It's OK for uproc->pending_uprobes to be empty here. It can happen
+ * if a register and an unregister are requested (by different probers)
+ * simultaneously for the same pid/vaddr.
+ */
+static void handle_pending_uprobes(struct uprobe_process *uproc,
+ struct task_struct *tsk)
+{
+ struct uprobe_probept *ppt, *tmp;
+
+ list_for_each_entry_safe(ppt, tmp, &uproc->pending_uprobes, pd_node) {
+ switch (ppt->state) {
+ case UPROBE_INSERTING:
+ uprobe_insert_bkpt(ppt, tsk);
+ break;
+ case UPROBE_REMOVING:
+ uprobe_remove_bkpt(ppt, tsk);
+ break;
+ default:
+ BUG();
+ }
+ list_del(&ppt->pd_node);
+ }
+}
+
+static void utask_adjust_flags(struct uprobe_task *utask, int set,
+ unsigned long flags)
+{
+ unsigned long newflags, oldflags;
+
+ oldflags = utask->engine->flags;
+ newflags = oldflags;
+ if (set)
+ newflags |= flags;
+ else
+ newflags &= ~flags;
+ /*
+ * utrace_barrier[_pid] is not appropriate here. If we're
+ * adjusting current, it's not needed. And if we're adjusting
+ * some other task, we're holding utask->uproc->rwsem, which
+ * could prevent that task from completing the callback we'd
+ * be waiting on.
+ */
+ if (newflags != oldflags) {
+ if (utrace_set_events_pid(utask->pid, utask->engine,
+ newflags) != 0)
+ /* We don't care. */
+ ;
+ }
+}
+
+static inline void clear_utrace_quiesce(struct uprobe_task *utask, bool resume)
+{
+ utask_adjust_flags(utask, UPROBE_CLEAR_FLAGS, UTRACE_EVENT(QUIESCE));
+ if (resume) {
+ if (utrace_control_pid(utask->pid, utask->engine,
+ UTRACE_RESUME) != 0)
+ /* We don't care. */
+ ;
+ }
+}
+
+/* Opposite of quiesce_all_threads(). Same locking applies. */
+static void rouse_all_threads(struct uprobe_process *uproc)
+{
+ struct uprobe_task *utask;
+
+ list_for_each_entry(utask, &uproc->thread_list, list) {
+ if (utask->quiescing) {
+ utask->quiescing = false;
+ if (utask->state == UPTASK_QUIESCENT) {
+ utask->state = UPTASK_RUNNING;
+ uproc->n_quiescent_threads--;
+ clear_utrace_quiesce(utask, true);
+ }
+ }
+ }
+ /* Wake any threads that decided to sleep rather than quiesce. */
+ wake_up_all(&uproc->waitq);
+}
+
+/*
+ * If all of uproc's surviving threads have quiesced, do the necessary
+ * breakpoint insertions or removals, un-quiesce everybody, and return 1.
+ * tsk is a surviving thread, or NULL if there is none. Runs with
+ * uproc->rwsem write-locked.
+ */
+static int check_uproc_quiesced(struct uprobe_process *uproc,
+ struct task_struct *tsk)
+{
+ if (uproc->n_quiescent_threads >= uproc->nthreads) {
+ handle_pending_uprobes(uproc, tsk);
+ rouse_all_threads(uproc);
+ return 1;
+ }
+ return 0;
+}
+
+/* Direct the indicated thread to quiesce. */
+static void uprobe_stop_thread(struct uprobe_task *utask)
+{
+ int result;
+
+ /*
+ * As with utask_adjust_flags, calling utrace_barrier_pid below
+ * could deadlock.
+ */
+ BUG_ON(utask->tsk == current);
+ result = utrace_control_pid(utask->pid, utask->engine, UTRACE_STOP);
+ if (result == 0) {
+ /* Already stopped. */
+ utask->state = UPTASK_QUIESCENT;
+ utask->uproc->n_quiescent_threads++;
+ } else if (result == -EINPROGRESS) {
+ if (utask->tsk->state & TASK_INTERRUPTIBLE) {
+ /*
+ * Task could be in interruptible wait for a long
+ * time -- e.g., if stopped for I/O. But we know
+ * it's not going to run user code before all
+ * threads quiesce, so pretend it's quiesced.
+ * This avoids terminating a system call via
+ * UTRACE_INTERRUPT.
+ */
+ utask->state = UPTASK_QUIESCENT;
+ utask->uproc->n_quiescent_threads++;
+ } else {
+ /*
+ * Task will eventually stop, but it may be a long time.
+ * Don't wait.
+ */
+ result = utrace_control_pid(utask->pid, utask->engine,
+ UTRACE_INTERRUPT);
+ if (result != 0)
+ /* We don't care. */
+ ;
+ }
+ }
+}
+
+/*
+ * Quiesce all threads in the specified process -- e.g., prior to
+ * breakpoint insertion. Runs with uproc->rwsem write-locked.
+ * Returns false if all threads have died.
+ */
+static bool quiesce_all_threads(struct uprobe_process *uproc,
+ struct uprobe_task **cur_utask_quiescing)
+{
+ struct uprobe_task *utask;
+ struct task_struct *survivor = NULL; /* any survivor */
+ bool survivors = false;
+
+ *cur_utask_quiescing = NULL;
+ list_for_each_entry(utask, &uproc->thread_list, list) {
+ if (!survivors) {
+ survivor = pid_task(utask->pid, PIDTYPE_PID);
+ if (survivor)
+ survivors = true;
+ }
+ if (!utask->quiescing) {
+ /*
+ * If utask is currently handling a probepoint, it'll
+ * check utask->quiescing and quiesce when it's done.
+ */
+ utask->quiescing = true;
+ if (utask->tsk == current)
+ *cur_utask_quiescing = utask;
+ else if (utask->state == UPTASK_RUNNING) {
+ utask_adjust_flags(utask, UPROBE_SET_FLAGS,
+ UTRACE_EVENT(QUIESCE));
+ uprobe_stop_thread(utask);
+ }
+ }
+ }
+ /*
+ * If all the (other) threads are already quiesced, it's up to the
+ * current thread to do the necessary work.
+ */
+ check_uproc_quiesced(uproc, survivor);
+ return survivors;
+}
+
+/* Called with utask->uproc write-locked. */
+static void uprobe_free_task(struct uprobe_task *utask, bool in_callback)
+{
+ struct deferred_registration *dr, *d;
+ struct delayed_signal *ds, *ds2;
+ int result;
+
+ if (utask->engine && (utask->tsk != current || !in_callback)) {
+ /*
+ * No other tasks in this process should be running
+ * uprobe_report_* callbacks. (If they are, utrace_barrier()
+ * here could deadlock.)
+ */
+ result = utrace_control_pid(utask->pid, utask->engine,
+ UTRACE_DETACH);
+ BUG_ON(result == -EINPROGRESS);
+ }
+ put_pid(utask->pid); /* null pid OK */
+
+ uprobe_unhash_utask(utask);
+ list_del(&utask->list);
+ list_for_each_entry_safe(dr, d, &utask->deferred_registrations, list) {
+ list_del(&dr->list);
+ kfree(dr);
+ }
+
+ list_for_each_entry_safe(ds, ds2, &utask->delayed_signals, list) {
+ list_del(&ds->list);
+ kfree(ds);
+ }
+
+ kfree(utask);
+}
+
+/*
+ * Dismantle uproc and all its remaining uprobe_tasks.
+ * in_callback = 1 if the caller is a uprobe_report_* callback who will
+ * handle the UTRACE_DETACH operation.
+ * Runs with uproc_mutex held; called with uproc->rwsem write-locked.
+ */
+static void uprobe_free_process(struct uprobe_process *uproc, int in_callback)
+{
+ struct uprobe_task *utask, *tmp;
+
+ if (!hlist_unhashed(&uproc->hlist))
+ hlist_del(&uproc->hlist);
+ list_for_each_entry_safe(utask, tmp, &uproc->thread_list, list)
+ uprobe_free_task(utask, in_callback);
+ put_pid(uproc->tg_leader);
+ if (uproc->xol_area)
+ xol_put_area(uproc->xol_area);
+ up_write(&uproc->rwsem); /* So kfree doesn't complain */
+ kfree(uproc);
+}
+
+/*
+ * Decrement uproc's ref count. If it's zero, free uproc and return
+ * 1. Else return 0. If uproc is locked, don't call this; use
+ * uprobe_decref_process().
+ */
+static int uprobe_put_process(struct uprobe_process *uproc, bool in_callback)
+{
+ int freed = 0;
+
+ if (atomic_dec_and_test(&uproc->refcount)) {
+ mutex_lock(&uproc_mutex);
+ down_write(&uproc->rwsem);
+ if (unlikely(atomic_read(&uproc->refcount) != 0)) {
+ /*
+ * The works because uproc_mutex is held any
+ * time the ref count can go from 0 to 1 -- e.g.,
+ * register_uprobe() sneaks in with a new probe.
+ */
+ up_write(&uproc->rwsem);
+ } else {
+ uprobe_free_process(uproc, in_callback);
+ freed = 1;
+ }
+ mutex_unlock(&uproc_mutex);
+ }
+ return freed;
+}
+
+static struct uprobe_kimg *uprobe_mk_kimg(struct uprobe *u)
+{
+ struct uprobe_kimg *uk = kzalloc(sizeof *uk,
+ GFP_USER);
+
+ if (unlikely(!uk))
+ return ERR_PTR(-ENOMEM);
+ u->kdata = uk;
+ uk->uprobe = u;
+ uk->ppt = NULL;
+ INIT_LIST_HEAD(&uk->list);
+ uk->status = -EBUSY;
+ return uk;
+}
+
+/*
+ * Allocate a uprobe_task object for p and add it to uproc's list.
+ * Called with p "got" and uproc->rwsem write-locked. Called in one of
+ * the following cases:
+ * - before setting the first uprobe in p's process
+ * - we're in uprobe_report_clone() and p is the newly added thread
+ * Returns:
+ * - pointer to new uprobe_task on success
+ * - NULL if t dies before we can utrace_attach it
+ * - negative errno otherwise
+ */
+static struct uprobe_task *uprobe_add_task(struct pid *p,
+ struct uprobe_process *uproc)
+{
+ struct uprobe_task *utask;
+ struct utrace_engine *engine;
+ struct task_struct *t = pid_task(p, PIDTYPE_PID);
+
+ if (!t)
+ return NULL;
+ utask = kzalloc(sizeof *utask, GFP_USER);
+ if (unlikely(utask == NULL))
+ return ERR_PTR(-ENOMEM);
+
+ utask->pid = p;
+ utask->tsk = t;
+ utask->state = UPTASK_RUNNING;
+ utask->quiescing = false;
+ utask->uproc = uproc;
+ utask->active_probe = NULL;
+ utask->doomed = false;
+ INIT_LIST_HEAD(&utask->deferred_registrations);
+ INIT_LIST_HEAD(&utask->delayed_signals);
+ INIT_LIST_HEAD(&utask->list);
+ list_add_tail(&utask->list, &uproc->thread_list);
+ uprobe_hash_utask(utask);
+
+ engine = utrace_attach_pid(p, UTRACE_ATTACH_CREATE,
+ p_uprobe_utrace_ops, utask);
+ if (IS_ERR(engine)) {
+ long err = PTR_ERR(engine);
+ printk("uprobes: utrace_attach_task failed, returned %ld\n",
+ err);
+ uprobe_free_task(utask, 0);
+ if (err == -ESRCH)
+ return NULL;
+ return ERR_PTR(err);
+ }
+ utask->engine = engine;
+ /*
+ * Always watch for traps, clones, execs and exits. Caller must
+ * set any other engine flags.
+ */
+ utask_adjust_flags(utask, UPROBE_SET_FLAGS,
+ UTRACE_EVENT(SIGNAL) | UTRACE_EVENT(SIGNAL_IGN) |
+ UTRACE_EVENT(SIGNAL_CORE) | UTRACE_EVENT(EXEC) |
+ UTRACE_EVENT(CLONE) | UTRACE_EVENT(EXIT));
+ /*
+ * Note that it's OK if t dies just after utrace_attach, because
+ * with the engine in place, the appropriate report_* callback
+ * should handle it after we release uproc->rwsem.
+ */
+ utrace_engine_put(engine);
+ return utask;
+}
+
+/*
+ * start_pid is the pid for a thread in the probed process. Find the
+ * next thread that doesn't have a corresponding uprobe_task yet. Return
+ * a ref-counted pid for that task, if any, else NULL.
+ */
+static struct pid *find_next_thread_to_add(struct uprobe_process *uproc,
+ struct pid *start_pid)
+{
+ struct task_struct *t, *start;
+ struct uprobe_task *utask;
+ struct pid *pid = NULL;
+
+ rcu_read_lock();
+ start = pid_task(start_pid, PIDTYPE_PID);
+ t = start;
+ if (t) {
+ do {
+ if (unlikely(t->flags & PF_EXITING))
+ goto dont_add;
+ list_for_each_entry(utask, &uproc->thread_list, list) {
+ if (utask->tsk == t)
+ /* Already added */
+ goto dont_add;
+ }
+ /* Found thread/task to add. */
+ pid = get_pid(task_pid(t));
+ break;
+dont_add:
+ t = next_thread(t);
+ } while (t != start);
+ }
+ rcu_read_unlock();
+ return pid;
+}
+
+/* Runs with uproc_mutex held; returns with uproc->rwsem write-locked. */
+static struct uprobe_process *uprobe_mk_process(struct pid *tg_leader)
+{
+ struct uprobe_process *uproc;
+ struct uprobe_task *utask;
+ struct pid *add_me;
+ int i;
+ long err;
+
+ uproc = kzalloc(sizeof *uproc, GFP_USER);
+ if (unlikely(uproc == NULL))
+ return ERR_PTR(-ENOMEM);
+
+ /* Initialize fields */
+ atomic_set(&uproc->refcount, 1);
+ init_rwsem(&uproc->rwsem);
+ down_write(&uproc->rwsem);
+ init_waitqueue_head(&uproc->waitq);
+ for (i = 0; i < UPROBE_TABLE_SIZE; i++)
+ INIT_HLIST_HEAD(&uproc->uprobe_table[i]);
+ INIT_LIST_HEAD(&uproc->pending_uprobes);
+ INIT_LIST_HEAD(&uproc->thread_list);
+ uproc->nthreads = 0;
+ uproc->n_quiescent_threads = 0;
+ INIT_HLIST_NODE(&uproc->hlist);
+ uproc->tg_leader = get_pid(tg_leader);
+ uproc->tgid = pid_task(tg_leader, PIDTYPE_PID)->tgid;
+ uproc->finished = false;
+
+#ifdef CONFIG_UBP_XOL
+ if (!(ubp_strategies & UBP_HNT_INLINE))
+ uproc->sstep_out_of_line = true;
+ else
+#endif
+ uproc->sstep_out_of_line = false;
+
+ /*
+ * Create and populate one utask per thread in this process. We
+ * can't call uprobe_add_task() while holding RCU lock, so we:
+ * 1. rcu_read_lock()
+ * 2. Find the next thread, add_me, in this process that's not
+ * already on uproc's thread_list.
+ * 3. rcu_read_unlock()
+ * 4. uprobe_add_task(add_me, uproc)
+ * Repeat 1-4 'til we have utasks for all threads.
+ */
+ add_me = tg_leader;
+ while ((add_me = find_next_thread_to_add(uproc, add_me)) != NULL) {
+ utask = uprobe_add_task(add_me, uproc);
+ if (IS_ERR(utask)) {
+ err = PTR_ERR(utask);
+ goto fail;
+ }
+ if (utask)
+ uproc->nthreads++;
+ }
+
+ if (uproc->nthreads == 0) {
+ /* All threads -- even p -- are dead. */
+ err = -ESRCH;
+ goto fail;
+ }
+ return uproc;
+
+fail:
+ uprobe_free_process(uproc, 0);
+ return ERR_PTR(err);
+}
+
+/*
+ * Creates a uprobe_probept and connects it to uk and uproc. Runs with
+ * uproc->rwsem write-locked.
+ */
+static struct uprobe_probept *uprobe_add_probept(struct uprobe_kimg *uk,
+ struct uprobe_process *uproc)
+{
+ struct uprobe_probept *ppt;
+
+ ppt = kzalloc(sizeof *ppt, GFP_USER);
+ if (unlikely(ppt == NULL))
+ return ERR_PTR(-ENOMEM);
+ init_waitqueue_head(&ppt->waitq);
+ init_waitqueue_head(&ppt->ssilq);
+ spin_lock_init(&ppt->ssil_lock);
+ ppt->ssil_state = SSIL_CLEAR;
+
+ /* Connect to uk. */
+ INIT_LIST_HEAD(&ppt->uprobe_list);
+ list_add_tail(&uk->list, &ppt->uprobe_list);
+ uk->ppt = ppt;
+ uk->status = -EBUSY;
+ ppt->ubp.vaddr = uk->uprobe->vaddr;
+ ppt->ubp.xol_vaddr = 0;
+
+ /* Connect to uproc. */
+ if (!uproc->sstep_out_of_line)
+ ppt->ubp.strategy = UBP_HNT_INLINE;
+ else
+ ppt->ubp.strategy = ubp_strategies;
+ ppt->state = UPROBE_INSERTING;
+ ppt->uproc = uproc;
+ INIT_LIST_HEAD(&ppt->pd_node);
+ list_add_tail(&ppt->pd_node, &uproc->pending_uprobes);
+ INIT_HLIST_NODE(&ppt->ut_node);
+ hlist_add_head(&ppt->ut_node,
+ &uproc->uprobe_table[hash_long(ppt->ubp.vaddr,
+ UPROBE_HASH_BITS)]);
+ uprobe_get_process(uproc);
+ return ppt;
+}
+
+/*
+ * Runs with ppt->uproc write-locked. Frees ppt and decrements the ref
+ * count on ppt->uproc (but ref count shouldn't hit 0).
+ */
+static void uprobe_free_probept(struct uprobe_probept *ppt)
+{
+ struct uprobe_process *uproc = ppt->uproc;
+
+ xol_free_insn_slot(ppt->ubp.xol_vaddr, uproc->xol_area);
+ hlist_del(&ppt->ut_node);
+ kfree(ppt);
+ uprobe_decref_process(uproc);
+}
+
+static void uprobe_free_kimg(struct uprobe_kimg *uk)
+{
+ uk->uprobe->kdata = NULL;
+ kfree(uk);
+}
+
+/*
+ * Runs with uprobe_process write-locked.
+ * Note that we never free uk->uprobe, because the user owns that.
+ */
+static void purge_uprobe(struct uprobe_kimg *uk)
+{
+ struct uprobe_probept *ppt = uk->ppt;
+
+ list_del(&uk->list);
+ uprobe_free_kimg(uk);
+ if (list_empty(&ppt->uprobe_list))
+ uprobe_free_probept(ppt);
+}
+
+/*
+ * Runs with utask->uproc locked.
+ * read lock if called from uprobe handler.
+ * else write lock.
+ * Returns -EINPROGRESS on success.
+ * Returns -EBUSY if a request for defer registration already exists.
+ * Returns 0 if we have deferred request for both register/unregister.
+ *
+ */
+static int defer_registration(struct uprobe *u, int regflag,
+ struct uprobe_task *utask)
+{
+ struct deferred_registration *dr, *d;
+
+ /* Check if we already have such a defer request */
+ list_for_each_entry_safe(dr, d, &utask->deferred_registrations, list) {
+ if (dr->uprobe == u) {
+ if (dr->regflag != regflag) {
+ /* same as successful register + unregister */
+ list_del(&dr->list);
+ kfree(dr);
+ return 0;
+ } else
+ /* we already have identical request */
+ return -EBUSY;
+ }
+ }
+
+ /* We have a new unique request */
+ dr = kmalloc(sizeof(struct deferred_registration), GFP_USER);
+ if (!dr)
+ return -ENOMEM;
+ dr->uprobe = u;
+ dr->regflag = regflag;
+ INIT_LIST_HEAD(&dr->list);
+ list_add_tail(&dr->list, &utask->deferred_registrations);
+ return -EINPROGRESS;
+}
+
+/*
+ * Given a numeric thread ID, return a ref-counted struct pid for the
+ * task-group-leader thread.
+ */
+static struct pid *uprobe_get_tg_leader(pid_t p)
+{
+ struct pid *pid = NULL;
+
+ rcu_read_lock();
+ if (current->nsproxy)
+ pid = find_vpid(p);
+ if (pid) {
+ struct task_struct *t = pid_task(pid, PIDTYPE_PID);
+ if (t)
+ pid = task_tgid(t);
+ else
+ pid = NULL;
+ }
+ rcu_read_unlock();
+ return get_pid(pid); /* null pid OK here */
+}
+
+/* See Documentation/uprobes.txt. */
+int register_uprobe(struct uprobe *u)
+{
+ struct uprobe_task *cur_utask, *cur_utask_quiescing = NULL;
+ struct uprobe_process *uproc;
+ struct uprobe_probept *ppt;
+ struct uprobe_kimg *uk;
+ struct pid *p;
+ int ret = 0, uproc_is_new = 0;
+ bool survivors;
+#ifndef CONFIG_UBP_XOL
+ struct task_struct *tsk;
+#endif
+
+ if (!u || !u->handler)
+ return -EINVAL;
+
+ p = uprobe_get_tg_leader(u->pid);
+ if (!p)
+ return -ESRCH;
+
+ cur_utask = uprobe_find_utask(current);
+ if (cur_utask && cur_utask->active_probe) {
+ /*
+ * Called from handler; cur_utask->uproc is read-locked.
+ * Do this registration later.
+ */
+ put_pid(p);
+ return defer_registration(u, 1, cur_utask);
+ }
+
+ /* Get the uprobe_process for this pid, or make a new one. */
+ mutex_lock(&uproc_mutex);
+ uproc = uprobe_find_process(p);
+
+ if (uproc) {
+ struct uprobe_task *utask;
+
+ mutex_unlock(&uproc_mutex);
+ list_for_each_entry(utask, &uproc->thread_list, list) {
+ if (!utask->active_probe)
+ continue;
+ /*
+ * utask is at a probepoint, but has dropped
+ * uproc->rwsem to single-step. If utask is
+ * stopped, then it's probably because some
+ * other engine has asserted UTRACE_STOP;
+ * that engine may not allow UTRACE_RESUME
+ * until register_uprobe() returns. But, for
+ * reasons we won't go into here, utask wants
+ * to finish with utask->active_probe before
+ * allowing handle_pending_uprobes() to run
+ * (via utask_fake_quiesce()). So we defer this
+ * registration operation; it will be run after
+ * utask->active_probe is taken care of.
+ */
+ BUG_ON(utask->state != UPTASK_SSTEP);
+ if (task_is_stopped_or_traced(utask->tsk)) {
+ ret = defer_registration(u, 1, utask);
+ goto fail_uproc;
+ }
+ }
+ } else {
+ uproc = uprobe_mk_process(p);
+ if (IS_ERR(uproc)) {
+ ret = (int) PTR_ERR(uproc);
+ mutex_unlock(&uproc_mutex);
+ goto fail_tsk;
+ }
+ /* Hold uproc_mutex until we've added uproc to uproc_table. */
+ uproc_is_new = 1;
+ }
+
+#ifdef CONFIG_UBP_XOL
+ ret = xol_validate_vaddr(p, u->vaddr, uproc->xol_area);
+#else
+ tsk = pid_task(p, PIDTYPE_PID);
+ ret = ubp_validate_insn_addr(tsk, u->vaddr);
+#endif
+ if (ret < 0)
+ goto fail_uproc;
+
+ if (u->kdata) {
+ /*
+ * Probe is already/still registered. This is the only
+ * place we return -EBUSY to the user.
+ */
+ ret = -EBUSY;
+ goto fail_uproc;
+ }
+
+ uk = uprobe_mk_kimg(u);
+ if (IS_ERR(uk)) {
+ ret = (int) PTR_ERR(uk);
+ goto fail_uproc;
+ }
+
+ /* See if we already have a probepoint at the vaddr. */
+ ppt = (uproc_is_new ? NULL : uprobe_find_probept(uproc, u->vaddr));
+ if (ppt) {
+ /* Breakpoint is already in place, or soon will be. */
+ uk->ppt = ppt;
+ list_add_tail(&uk->list, &ppt->uprobe_list);
+ switch (ppt->state) {
+ case UPROBE_INSERTING:
+ uk->status = -EBUSY; /* in progress */
+ if (uproc->tg_leader == task_tgid(current)) {
+ cur_utask_quiescing = cur_utask;
+ BUG_ON(!cur_utask_quiescing);
+ }
+ break;
+ case UPROBE_REMOVING:
+ /* Wait! Don't remove that bkpt after all! */
+ ppt->state = UPROBE_BP_SET;
+ /* Remove from pending list. */
+ list_del(&ppt->pd_node);
+ /* Wake unregister_uprobe(). */
+ wake_up_all(&ppt->waitq);
+ /*FALLTHROUGH*/
+ case UPROBE_BP_SET:
+ uk->status = 0;
+ break;
+ default:
+ BUG();
+ }
+ up_write(&uproc->rwsem);
+ put_pid(p);
+ if (uk->status == 0) {
+ uprobe_decref_process(uproc);
+ return 0;
+ }
+ goto await_bkpt_insertion;
+ } else {
+ ppt = uprobe_add_probept(uk, uproc);
+ if (IS_ERR(ppt)) {
+ ret = (int) PTR_ERR(ppt);
+ goto fail_uk;
+ }
+ }
+
+ if (uproc_is_new) {
+ hlist_add_head(&uproc->hlist,
+ &uproc_table[hash_ptr(uproc->tg_leader,
+ UPROBE_HASH_BITS)]);
+ mutex_unlock(&uproc_mutex);
+ }
+ put_pid(p);
+ survivors = quiesce_all_threads(uproc, &cur_utask_quiescing);
+
+ if (!survivors) {
+ purge_uprobe(uk);
+ up_write(&uproc->rwsem);
+ uprobe_put_process(uproc, false);
+ return -ESRCH;
+ }
+ up_write(&uproc->rwsem);
+
+await_bkpt_insertion:
+ if (cur_utask_quiescing)
+ /* Current task is probing its own process. */
+ (void) utask_fake_quiesce(cur_utask_quiescing);
+ else
+ wait_event(ppt->waitq, ppt->state != UPROBE_INSERTING);
+ ret = uk->status;
+ if (ret != 0) {
+ down_write(&uproc->rwsem);
+ purge_uprobe(uk);
+ up_write(&uproc->rwsem);
+ }
+ uprobe_put_process(uproc, false);
+ return ret;
+
+fail_uk:
+ uprobe_free_kimg(uk);
+
+fail_uproc:
+ if (uproc_is_new) {
+ uprobe_free_process(uproc, 0);
+ mutex_unlock(&uproc_mutex);
+ } else {
+ up_write(&uproc->rwsem);
+ uprobe_put_process(uproc, false);
+ }
+
+fail_tsk:
+ put_pid(p);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(register_uprobe);
+
+/* See Documentation/uprobes.txt. */
+void unregister_uprobe(struct uprobe *u)
+{
+ struct pid *p;
+ struct uprobe_process *uproc;
+ struct uprobe_kimg *uk;
+ struct uprobe_probept *ppt;
+ struct uprobe_task *cur_utask, *cur_utask_quiescing = NULL;
+ struct uprobe_task *utask;
+
+ if (!u)
+ return;
+ p = uprobe_get_tg_leader(u->pid);
+ if (!p)
+ return;
+
+ cur_utask = uprobe_find_utask(current);
+ if (cur_utask && cur_utask->active_probe) {
+ /* Called from handler; uproc is read-locked; do this later */
+ put_pid(p);
+ (void) defer_registration(u, 0, cur_utask);
+ return;
+ }
+
+ /*
+ * Lock uproc before walking the graph, in case the process we're
+ * probing is exiting.
+ */
+ mutex_lock(&uproc_mutex);
+ uproc = uprobe_find_process(p);
+ mutex_unlock(&uproc_mutex);
+ put_pid(p);
+ if (!uproc)
+ return;
+
+ list_for_each_entry(utask, &uproc->thread_list, list) {
+ if (!utask->active_probe)
+ continue;
+
+ /* See comment in register_uprobe(). */
+ BUG_ON(utask->state != UPTASK_SSTEP);
+ if (task_is_stopped_or_traced(utask->tsk)) {
+ (void) defer_registration(u, 0, utask);
+ goto done;
+ }
+ }
+ uk = (struct uprobe_kimg *)u->kdata;
+ if (!uk)
+ /*
+ * This probe was never successfully registered, or
+ * has already been unregistered.
+ */
+ goto done;
+ if (uk->status == -EBUSY)
+ /* Looks like register or unregister is already in progress. */
+ goto done;
+ ppt = uk->ppt;
+
+ list_del(&uk->list);
+ uprobe_free_kimg(uk);
+
+ if (!list_empty(&ppt->uprobe_list))
+ goto done;
+
+ /*
+ * The last uprobe at ppt's probepoint is being unregistered.
+ * Queue the breakpoint for removal.
+ */
+ ppt->state = UPROBE_REMOVING;
+ list_add_tail(&ppt->pd_node, &uproc->pending_uprobes);
+
+ (void) quiesce_all_threads(uproc, &cur_utask_quiescing);
+ up_write(&uproc->rwsem);
+ if (cur_utask_quiescing)
+ /* Current task is probing its own process. */
+ (void) utask_fake_quiesce(cur_utask_quiescing);
+ else
+ wait_event(ppt->waitq, ppt->state != UPROBE_REMOVING);
+
+ if (likely(ppt->state == UPROBE_DISABLED)) {
+ down_write(&uproc->rwsem);
+ uprobe_free_probept(ppt);
+ /* else somebody else's register_uprobe() resurrected ppt. */
+ up_write(&uproc->rwsem);
+ }
+ uprobe_put_process(uproc, false);
+ return;
+
+done:
+ up_write(&uproc->rwsem);
+ uprobe_put_process(uproc, false);
+}
+EXPORT_SYMBOL_GPL(unregister_uprobe);
+
+/* Find a surviving thread in uproc. Runs with uproc->rwsem locked. */
+static struct task_struct *find_surviving_thread(struct uprobe_process *uproc)
+{
+ struct uprobe_task *utask;
+
+ list_for_each_entry(utask, &uproc->thread_list, list) {
+ if (!(utask->tsk->flags & PF_EXITING))
+ return utask->tsk;
+ }
+ return NULL;
+}
+
+/*
+ * Run all the deferred_registrations previously queued by the current utask.
+ * Runs with no locks or mutexes held. The current utask's uprobe_process
+ * is ref-counted, so it won't disappear as the result of unregister_u*probe()
+ * called here.
+ */
+static void uprobe_run_def_regs(struct list_head *drlist)
+{
+ struct deferred_registration *dr, *d;
+
+ list_for_each_entry_safe(dr, d, drlist, list) {
+ int result = 0;
+ struct uprobe *u = dr->uprobe;
+
+ if (dr->regflag)
+ result = register_uprobe(u);
+ else
+ unregister_uprobe(u);
+ if (u && u->registration_callback)
+ u->registration_callback(u, dr->regflag, result);
+ list_del(&dr->list);
+ kfree(dr);
+ }
+}
+
+/*
+ * utrace engine report callbacks
+ */
+
+/*
+ * We've been asked to quiesce, but aren't in a position to do so.
+ * This could happen in either of the following cases:
+ *
+ * 1) Our own thread is doing a register or unregister operation --
+ * e.g., as called from a uprobe handler or a non-uprobes utrace
+ * callback. We can't wait_event() for ourselves in [un]register_uprobe().
+ *
+ * 2) We've been asked to quiesce, but we hit a probepoint first. Now
+ * we're in the report_signal callback, having handled the probepoint.
+ * We'd like to just turn on UTRACE_EVENT(QUIESCE) and coast into
+ * quiescence. Unfortunately, it's possible to hit a probepoint again
+ * before we quiesce. When processing the SIGTRAP, utrace would call
+ * uprobe_report_quiesce(), which must decline to take any action so
+ * as to avoid removing the uprobe just hit. As a result, we could
+ * keep hitting breakpoints and never quiescing.
+ *
+ * So here we do essentially what we'd prefer to do in uprobe_report_quiesce().
+ * If we're the last thread to quiesce, handle_pending_uprobes() and
+ * rouse_all_threads(). Otherwise, pretend we're quiescent and sleep until
+ * the last quiescent thread handles that stuff and then wakes us.
+ *
+ * Called and returns with no mutexes held. Returns 1 if we free utask->uproc,
+ * else 0.
+ */
+static int utask_fake_quiesce(struct uprobe_task *utask)
+{
+ struct uprobe_process *uproc = utask->uproc;
+ enum uprobe_task_state prev_state = utask->state;
+
+ down_write(&uproc->rwsem);
+
+ /* In case we're somehow set to quiesce for real... */
+ clear_utrace_quiesce(utask, false);
+
+ if (uproc->n_quiescent_threads == uproc->nthreads-1) {
+ /* We're the last thread to "quiesce." */
+ handle_pending_uprobes(uproc, utask->tsk);
+ rouse_all_threads(uproc);
+ up_write(&uproc->rwsem);
+ return 0;
+ } else {
+ utask->state = UPTASK_SLEEPING;
+ uproc->n_quiescent_threads++;
+ up_write(&uproc->rwsem);
+ /* We ref-count sleepers. */
+ uprobe_get_process(uproc);
+
+ wait_event(uproc->waitq, !utask->quiescing);
+
+ down_write(&uproc->rwsem);
+ utask->state = prev_state;
+ uproc->n_quiescent_threads--;
+ up_write(&uproc->rwsem);
+
+ /*
+ * If uproc's last uprobe has been unregistered, and
+ * unregister_uprobe() woke up before we did, it's up
+ * to us to free uproc.
+ */
+ return uprobe_put_process(uproc, false);
+ }
+}
+
+/* Prepare to single-step ppt's probed instruction inline. */
+static void uprobe_pre_ssin(struct uprobe_task *utask,
+ struct uprobe_probept *ppt, struct pt_regs *regs)
+{
+ unsigned long flags;
+
+ if (unlikely(ppt->ssil_state == SSIL_DISABLE)) {
+ reset_thread_ip(utask->tsk, regs, ppt->ubp.vaddr);
+ return;
+ }
+ spin_lock_irqsave(&ppt->ssil_lock, flags);
+ while (ppt->ssil_state == SSIL_SET) {
+ spin_unlock_irqrestore(&ppt->ssil_lock, flags);
+ up_read(&utask->uproc->rwsem);
+ wait_event(ppt->ssilq, ppt->ssil_state != SSIL_SET);
+ down_read(&utask->uproc->rwsem);
+ spin_lock_irqsave(&ppt->ssil_lock, flags);
+ }
+ if (unlikely(ppt->ssil_state == SSIL_DISABLE)) {
+ /*
+ * While waiting to single step inline, breakpoint has
+ * been removed. Thread continues as if nothing happened.
+ */
+ spin_unlock_irqrestore(&ppt->ssil_lock, flags);
+ reset_thread_ip(utask->tsk, regs, ppt->ubp.vaddr);
+ return;
+ }
+ ppt->ssil_state = SSIL_SET;
+ spin_unlock_irqrestore(&ppt->ssil_lock, flags);
+
+ if (unlikely(ubp_pre_sstep(utask->tsk, &ppt->ubp,
+ &utask->arch_info, regs) != 0)) {
+ printk(KERN_ERR "Failed to temporarily restore original "
+ "instruction for single-stepping: "
+ "pid/tgid=%d/%d, vaddr=%#lx\n",
+ utask->tsk->pid, utask->tsk->tgid, ppt->ubp.vaddr);
+ utask->doomed = true;
+ }
+}
+
+/* Prepare to continue execution after single-stepping inline. */
+static void uprobe_post_ssin(struct uprobe_task *utask,
+ struct uprobe_probept *ppt, struct pt_regs *regs)
+{
+ unsigned long flags;
+
+ if (unlikely(ubp_post_sstep(utask->tsk, &ppt->ubp,
+ &utask->arch_info, regs) != 0))
+ printk("Couldn't restore bp: pid/tgid=%d/%d, addr=%#lx\n",
+ utask->tsk->pid, utask->tsk->tgid, ppt->ubp.vaddr);
+ spin_lock_irqsave(&ppt->ssil_lock, flags);
+ if (likely(ppt->ssil_state == SSIL_SET)) {
+ ppt->ssil_state = SSIL_CLEAR;
+ wake_up(&ppt->ssilq);
+ }
+ spin_unlock_irqrestore(&ppt->ssil_lock, flags);
+}
+
+#ifdef CONFIG_UBP_XOL
+/*
+ * This architecture wants to do single-stepping out of line, but now we've
+ * discovered that it can't -- typically because we couldn't set up the XOL
+ * vma. Make all probepoints use inline single-stepping.
+ */
+static void uproc_cancel_xol(struct uprobe_process *uproc)
+{
+ down_write(&uproc->rwsem);
+ if (likely(uproc->sstep_out_of_line)) {
+ /* No other task beat us to it. */
+ int i;
+ struct uprobe_probept *ppt;
+ struct hlist_node *node;
+ struct hlist_head *head;
+ for (i = 0; i < UPROBE_TABLE_SIZE; i++) {
+ head = &uproc->uprobe_table[i];
+ hlist_for_each_entry(ppt, node, head, ut_node) {
+ if (!(ppt->ubp.strategy & UBP_HNT_INLINE))
+ ubp_cancel_xol(current, &ppt->ubp);
+ }
+ }
+ /* Do this last, so other tasks don't proceed too soon. */
+ uproc->sstep_out_of_line = false;
+ }
+ up_write(&uproc->rwsem);
+}
+
+/* Prepare to single-step ppt's probed instruction out of line. */
+static int uprobe_pre_ssout(struct uprobe_task *utask,
+ struct uprobe_probept *ppt, struct pt_regs *regs)
+{
+ if (!ppt->ubp.xol_vaddr)
+ ppt->ubp.xol_vaddr = xol_get_insn_slot(&ppt->ubp,
+ ppt->uproc->xol_area);
+ if (unlikely(!ppt->ubp.xol_vaddr)) {
+ ubp_cancel_xol(utask->tsk, &ppt->ubp);
+ return -1;
+ }
+ utask->singlestep_addr = ppt->ubp.xol_vaddr;
+ return ubp_pre_sstep(utask->tsk, &ppt->ubp, &utask->arch_info, regs);
+}
+
+/* Prepare to continue execution after single-stepping out of line. */
+static int uprobe_post_ssout(struct uprobe_task *utask,
+ struct uprobe_probept *ppt, struct pt_regs *regs)
+{
+ int ret;
+
+ ret = ubp_post_sstep(utask->tsk, &ppt->ubp, &utask->arch_info, regs);
+ return ret;
+}
+#endif
+
+/*
+ * If this thread is supposed to be quiescing, mark it quiescent; and
+ * if it was the last thread to quiesce, do the work we quiesced for.
+ * Runs with utask->uproc->rwsem write-locked. Returns true if we can
+ * let this thread resume.
+ */
+static bool utask_quiesce(struct uprobe_task *utask)
+{
+ if (utask->quiescing) {
+ if (utask->state != UPTASK_QUIESCENT) {
+ utask->state = UPTASK_QUIESCENT;
+ utask->uproc->n_quiescent_threads++;
+ }
+ return check_uproc_quiesced(utask->uproc, current);
+ } else {
+ clear_utrace_quiesce(utask, false);
+ return true;
+ }
+}
+
+/*
+ * Delay delivery of the indicated signal until after single-step.
+ * Otherwise single-stepping will be cancelled as part of calling
+ * the signal handler.
+ */
+static void uprobe_delay_signal(struct uprobe_task *utask, siginfo_t *info)
+{
+ struct delayed_signal *ds;
+
+ ds = kmalloc(sizeof(*ds), GFP_USER);
+ if (ds) {
+ ds->info = *info;
+ INIT_LIST_HEAD(&ds->list);
+ list_add_tail(&ds->list, &utask->delayed_signals);
+ }
+}
+
+static void uprobe_inject_delayed_signals(struct list_head *delayed_signals)
+{
+ struct delayed_signal *ds, *tmp;
+
+ list_for_each_entry_safe(ds, tmp, delayed_signals, list) {
+ send_sig_info(ds->info.si_signo, &ds->info, current);
+ list_del(&ds->list);
+ kfree(ds);
+ }
+}
+
+/*
+ * Verify from Instruction Pointer if singlestep has indeed occurred.
+ * If Singlestep has occurred, then do post singlestep fix-ups.
+ */
+static bool validate_and_post_sstep(struct uprobe_task *utask,
+ struct pt_regs *regs,
+ struct uprobe_probept *ppt)
+{
+ unsigned long vaddr = instruction_pointer(regs);
+
+ if (ppt->ubp.strategy & UBP_HNT_INLINE) {
+ /*
+ * If we have singlestepped, Instruction pointer cannot
+ * be same as virtual address of probepoint.
+ */
+ if (vaddr == ppt->ubp.vaddr)
+ return false;
+ uprobe_post_ssin(utask, ppt, regs);
+#ifdef CONFIG_UBP_XOL
+ } else {
+ /*
+ * If we have executed out of line, Instruction pointer
+ * cannot be same as virtual address of XOL slot.
+ */
+ if (vaddr == ppt->ubp.xol_vaddr)
+ return false;
+ uprobe_post_ssout(utask, ppt, regs);
+#endif
+ }
+ return true;
+}
+
+/*
+ * Helper routine for uprobe_report_signal().
+ * We get called here with:
+ * state = UPTASK_RUNNING => we are here due to a breakpoint hit
+ * - Read-lock the process
+ * - Figure out which probepoint, based on regs->IP
+ * - Set state = UPTASK_BP_HIT
+ * - Invoke handler for each uprobe at this probepoint
+ * - Reset regs->IP to beginning of the insn, if necessary
+ * - Start watching for quiesce events, in case another
+ * engine cancels our UTRACE_SINGLESTEP with a
+ * UTRACE_STOP.
+ * - Set singlestep in motion (UTRACE_SINGLESTEP),
+ * with state = UPTASK_SSTEP
+ * - Read-unlock the process
+ *
+ * state = UPTASK_SSTEP => here after single-stepping
+ * - Read-lock the process
+ * - Validate we are here per the state machine
+ * - Clean up after single-stepping
+ * - Set state = UPTASK_RUNNING
+ * - Read-unlock the process
+ * - If it's time to quiesce, take appropriate action.
+ * - If the handler(s) we ran called [un]register_uprobe(),
+ * complete those via uprobe_run_def_regs().
+ *
+ * state = ANY OTHER STATE
+ * - Not our signal, pass it on (UTRACE_RESUME)
+ */
+static u32 uprobe_handle_signal(u32 action,
+ struct uprobe_task *utask,
+ struct pt_regs *regs,
+ siginfo_t *info,
+ const struct k_sigaction *orig_ka)
+{
+ struct uprobe_probept *ppt;
+ struct uprobe_process *uproc;
+ struct uprobe_kimg *uk;
+ unsigned long probept;
+ enum utrace_resume_action resume_action;
+ enum utrace_signal_action signal_action = utrace_signal_action(action);
+
+ uproc = utask->uproc;
+
+ /*
+ * We may need to re-assert UTRACE_SINGLESTEP if this signal
+ * is not associated with the breakpoint.
+ */
+ if (utask->state == UPTASK_SSTEP)
+ resume_action = UTRACE_SINGLESTEP;
+ else
+ resume_action = UTRACE_RESUME;
+ /*
+ * This might be UTRACE_SIGNAL_REPORT request but some other
+ * engine's callback might have changed the signal action to
+ * something other than UTRACE_SIGNAL_REPORT. Use orig_ka to figure
+ * out such cases.
+ */
+ if (unlikely(signal_action == UTRACE_SIGNAL_REPORT) || !orig_ka) {
+ /* This thread was quiesced using UTRACE_INTERRUPT. */
+ bool done_quiescing;
+ if (utask->active_probe)
+ /*
+ * We'll fake quiescence after we're done
+ * processing the probepoint.
+ */
+ return UTRACE_SIGNAL_IGN | resume_action;
+
+ down_write(&uproc->rwsem);
+ done_quiescing = utask_quiesce(utask);
+ up_write(&uproc->rwsem);
+ if (done_quiescing)
+ resume_action = UTRACE_RESUME;
+ else
+ resume_action = UTRACE_STOP;
+ return UTRACE_SIGNAL_IGN | resume_action;
+ }
+
+ /*
+ * info will be null if we're called with action=UTRACE_SIGNAL_HANDLER,
+ * which means that single-stepping has been disabled so a signal
+ * handler can be called in the probed process. That should never
+ * happen because we intercept and delay handled signals (action =
+ * UTRACE_RESUME) until after we're done single-stepping.
+ */
+ BUG_ON(!info);
+ if (signal_action == UTRACE_SIGNAL_DELIVER && utask->active_probe &&
+ info->si_signo != SSTEP_SIGNAL) {
+ uprobe_delay_signal(utask, info);
+ return UTRACE_SIGNAL_IGN | UTRACE_SINGLESTEP;
+ }
+
+ if (info->si_signo != BREAKPOINT_SIGNAL &&
+ info->si_signo != SSTEP_SIGNAL)
+ goto no_interest;
+
+ switch (utask->state) {
+ case UPTASK_RUNNING:
+ if (info->si_signo != BREAKPOINT_SIGNAL)
+ goto no_interest;
+
+#ifdef CONFIG_UBP_XOL
+ /*
+ * Set up the XOL area if it's not already there. We
+ * do this here because we have to do it before
+ * handling the first probepoint hit, the probed
+ * process has to do it, and this may be the first
+ * time our probed process runs uprobes code. We need
+ * the XOL area for the uretprobe trampoline even if
+ * this architectures doesn't single-step out of line.
+ */
+ if (uproc->sstep_out_of_line && !uproc->xol_area) {
+ uproc->xol_area = xol_get_area(uproc->tg_leader);
+ if (unlikely(uproc->sstep_out_of_line) &&
+ unlikely(!uproc->xol_area))
+ uproc_cancel_xol(uproc);
+ }
+#endif
+
+ down_read(&uproc->rwsem);
+ /* Don't quiesce while running handlers. */
+ clear_utrace_quiesce(utask, false);
+ probept = ubp_get_bkpt_addr(regs);
+ ppt = uprobe_find_probept(uproc, probept);
+ if (!ppt) {
+ up_read(&uproc->rwsem);
+ goto no_interest;
+ }
+ utask->active_probe = ppt;
+ utask->state = UPTASK_BP_HIT;
+
+ if (likely(ppt->state == UPROBE_BP_SET)) {
+ list_for_each_entry(uk, &ppt->uprobe_list, list) {
+ struct uprobe *u = uk->uprobe;
+ if (u->handler)
+ u->handler(u, regs);
+ }
+ }
+
+#ifdef CONFIG_UBP_XOL
+ if ((ppt->ubp.strategy & UBP_HNT_INLINE) ||
+ uprobe_pre_ssout(utask, ppt, regs) != 0)
+#endif
+ uprobe_pre_ssin(utask, ppt, regs);
+ if (unlikely(utask->doomed)) {
+ utask->active_probe = NULL;
+ utask->state = UPTASK_RUNNING;
+ up_read(&uproc->rwsem);
+ goto no_interest;
+ }
+ utask->state = UPTASK_SSTEP;
+ /* In case another engine cancels our UTRACE_SINGLESTEP... */
+ utask_adjust_flags(utask, UPROBE_SET_FLAGS,
+ UTRACE_EVENT(QUIESCE));
+ /* Don't deliver this signal to the process. */
+ resume_action = UTRACE_SINGLESTEP;
+ signal_action = UTRACE_SIGNAL_IGN;
+
+ up_read(&uproc->rwsem);
+ break;
+
+ case UPTASK_SSTEP:
+ if (info->si_signo != SSTEP_SIGNAL)
+ goto no_interest;
+
+ down_read(&uproc->rwsem);
+ ppt = utask->active_probe;
+ BUG_ON(!ppt);
+
+ /*
+ * Havent singlestepped yet? then re-assert
+ * UTRACE_SINGLESTEP.
+ */
+ if (!validate_and_post_sstep(utask, regs, ppt)) {
+ up_read(&uproc->rwsem);
+ goto no_interest;
+ }
+
+ /* No further need to re-assert UTRACE_SINGLESTEP. */
+ clear_utrace_quiesce(utask, false);
+
+ utask->active_probe = NULL;
+ utask->state = UPTASK_RUNNING;
+ if (unlikely(utask->doomed)) {
+ up_read(&uproc->rwsem);
+ goto no_interest;
+ }
+
+ if (utask->quiescing) {
+ int uproc_freed;
+ up_read(&uproc->rwsem);
+ uproc_freed = utask_fake_quiesce(utask);
+ BUG_ON(uproc_freed);
+ } else
+ up_read(&uproc->rwsem);
+
+ /*
+ * We hold a ref count on uproc, so this should never
+ * make utask or uproc disappear.
+ */
+ uprobe_run_def_regs(&utask->deferred_registrations);
+
+ uprobe_inject_delayed_signals(&utask->delayed_signals);
+
+ resume_action = UTRACE_RESUME;
+ signal_action = UTRACE_SIGNAL_IGN;
+ break;
+ default:
+ goto no_interest;
+ }
+
+no_interest:
+ return signal_action | resume_action;
+}
+
+/*
+ * Signal callback:
+ */
+static u32 uprobe_report_signal(u32 action,
+ struct utrace_engine *engine,
+ struct pt_regs *regs,
+ siginfo_t *info,
+ const struct k_sigaction *orig_ka,
+ struct k_sigaction *return_ka)
+{
+ struct uprobe_task *utask;
+ struct uprobe_process *uproc;
+ bool doomed;
+ enum utrace_resume_action report_action;
+
+ utask = (struct uprobe_task *)rcu_dereference(engine->data);
+ BUG_ON(!utask);
+ uproc = utask->uproc;
+
+ /* Keep uproc intact until just before we return. */
+ uprobe_get_process(uproc);
+ report_action = uprobe_handle_signal(action, utask, regs, info,
+ orig_ka);
+ doomed = utask->doomed;
+
+ if (uprobe_put_process(uproc, true))
+ report_action = utrace_signal_action(report_action) |
+ UTRACE_DETACH;
+ if (doomed)
+ do_exit(SIGSEGV);
+ return report_action;
+}
+
+/*
+ * Quiesce callback: The associated process has one or more breakpoint
+ * insertions or removals pending. If we're the last thread in this
+ * process to quiesce, do the insertion(s) and/or removal(s).
+ */
+static u32 uprobe_report_quiesce(u32 action,
+ struct utrace_engine *engine,
+ unsigned long event)
+{
+ struct uprobe_task *utask;
+ struct uprobe_process *uproc;
+ bool done_quiescing = false;
+
+ utask = (struct uprobe_task *)rcu_dereference(engine->data);
+ BUG_ON(!utask);
+
+ if (utask->state == UPTASK_SSTEP)
+ /*
+ * We got a breakpoint trap and tried to single-step,
+ * but somebody else's report_signal callback overrode
+ * our UTRACE_SINGLESTEP with a UTRACE_STOP. Try again.
+ */
+ return UTRACE_SINGLESTEP;
+
+ BUG_ON(utask->active_probe);
+ uproc = utask->uproc;
+ down_write(&uproc->rwsem);
+ done_quiescing = utask_quiesce(utask);
+ up_write(&uproc->rwsem);
+ return done_quiescing ? UTRACE_RESUME : UTRACE_STOP;
+}
+
+/*
+ * uproc's process is exiting or exec-ing. Runs with uproc->rwsem
+ * write-locked. Caller must ref-count uproc before calling this
+ * function, to ensure that uproc doesn't get freed in the middle of
+ * this.
+ */
+static void uprobe_cleanup_process(struct uprobe_process *uproc)
+{
+ struct hlist_node *pnode1, *pnode2;
+ struct uprobe_kimg *uk, *unode;
+ struct uprobe_probept *ppt;
+ struct hlist_head *head;
+ int i;
+
+ uproc->finished = true;
+ for (i = 0; i < UPROBE_TABLE_SIZE; i++) {
+ head = &uproc->uprobe_table[i];
+ hlist_for_each_entry_safe(ppt, pnode1, pnode2, head, ut_node) {
+ if (ppt->state == UPROBE_INSERTING ||
+ ppt->state == UPROBE_REMOVING) {
+ /*
+ * This task is (exec/exit)ing with
+ * a [un]register_uprobe pending.
+ * [un]register_uprobe will free ppt.
+ */
+ ppt->state = UPROBE_DISABLED;
+ list_del(&ppt->pd_node);
+ list_for_each_entry_safe(uk, unode,
+ &ppt->uprobe_list, list)
+ uk->status = -ESRCH;
+ wake_up_all(&ppt->waitq);
+ } else if (ppt->state == UPROBE_BP_SET) {
+ list_for_each_entry_safe(uk, unode,
+ &ppt->uprobe_list, list) {
+ list_del(&uk->list);
+ uprobe_free_kimg(uk);
+ }
+ uprobe_free_probept(ppt);
+ /* else */
+ /*
+ * If ppt is UPROBE_DISABLED, assume that
+ * [un]register_uprobe() has been notified
+ * and will free it soon.
+ */
+ }
+ }
+ }
+}
+
+static u32 uprobe_exec_exit(struct utrace_engine *engine,
+ struct task_struct *tsk, int exit)
+{
+ struct uprobe_process *uproc;
+ struct uprobe_probept *ppt;
+ struct uprobe_task *utask;
+ bool utask_quiescing;
+
+ utask = (struct uprobe_task *)rcu_dereference(engine->data);
+ uproc = utask->uproc;
+ uprobe_get_process(uproc);
+
+ ppt = utask->active_probe;
+ if (ppt) {
+ printk(KERN_WARNING "Task handler called %s while at uprobe"
+ " probepoint: pid/tgid = %d/%d, probepoint"
+ " = %#lx\n", (exit ? "exit" : "exec"),
+ tsk->pid, tsk->tgid, ppt->ubp.vaddr);
+ /*
+ * Mutex cleanup depends on where do_execve()/do_exit() was
+ * called and on ubp strategy (XOL vs. SSIL).
+ */
+ if (ppt->ubp.strategy & UBP_HNT_INLINE) {
+ switch (utask->state) {
+ unsigned long flags;
+ case UPTASK_SSTEP:
+ spin_lock_irqsave(&ppt->ssil_lock, flags);
+ ppt->ssil_state = SSIL_CLEAR;
+ wake_up(&ppt->ssilq);
+ spin_unlock_irqrestore(&ppt->ssil_lock, flags);
+ break;
+ default:
+ break;
+ }
+ }
+ if (utask->state == UPTASK_BP_HIT) {
+ /* uprobe handler called do_exit()/do_execve(). */
+ up_read(&uproc->rwsem);
+ uprobe_decref_process(uproc);
+ }
+ }
+
+ down_write(&uproc->rwsem);
+ utask_quiescing = utask->quiescing;
+ uproc->nthreads--;
+ if (utrace_set_events_pid(utask->pid, engine, 0))
+ /* We don't care. */
+ ;
+ uprobe_free_task(utask, 1);
+ if (uproc->nthreads) {
+ /*
+ * In case other threads are waiting for us to quiesce...
+ */
+ if (utask_quiescing)
+ (void) check_uproc_quiesced(uproc,
+ find_surviving_thread(uproc));
+ } else
+ /*
+ * We were the last remaining thread - clean up the uprobe
+ * remnants a la unregister_uprobe(). We don't have to
+ * remove the breakpoints, though.
+ */
+ uprobe_cleanup_process(uproc);
+
+ up_write(&uproc->rwsem);
+ uprobe_put_process(uproc, true);
+ return UTRACE_DETACH;
+}
+
+/*
+ * Exit callback: The associated task/thread is exiting.
+ */
+static u32 uprobe_report_exit(u32 action,
+ struct utrace_engine *engine,
+ long orig_code, long *code)
+{
+ return uprobe_exec_exit(engine, current, 1);
+}
+/*
+ * Clone callback: The current task has spawned a thread/process.
+ * Utrace guarantees that parent and child pointers will be valid
+ * for the duration of this callback.
+ *
+ * NOTE: For now, we don't pass on uprobes from the parent to the
+ * child. We now do the necessary clearing of breakpoints in the
+ * child's address space.
+ *
+ * TODO:
+ * - Provide option for child to inherit uprobes.
+ */
+static u32 uprobe_report_clone(u32 action,
+ struct utrace_engine *engine,
+ unsigned long clone_flags,
+ struct task_struct *child)
+{
+ struct uprobe_process *uproc;
+ struct uprobe_task *ptask, *ctask;
+
+ ptask = (struct uprobe_task *)rcu_dereference(engine->data);
+ uproc = ptask->uproc;
+
+ /*
+ * Lock uproc so no new uprobes can be installed 'til all
+ * report_clone activities are completed.
+ */
+ mutex_lock(&uproc_mutex);
+ down_write(&uproc->rwsem);
+
+ if (clone_flags & CLONE_THREAD) {
+ /* New thread in the same process. */
+ ctask = uprobe_find_utask(child);
+ if (unlikely(ctask)) {
+ /*
+ * uprobe_mk_process() ran just as this clone
+ * happened, and has already accounted for the
+ * new child.
+ */
+ } else {
+ struct pid *child_pid = get_pid(task_pid(child));
+ BUG_ON(!child_pid);
+ ctask = uprobe_add_task(child_pid, uproc);
+ BUG_ON(!ctask);
+ if (IS_ERR(ctask))
+ goto done;
+ uproc->nthreads++;
+ /*
+ * FIXME: Handle the case where uproc is quiescing
+ * (assuming it's possible to clone while quiescing).
+ */
+ }
+ } else {
+ /*
+ * New process spawned by parent. Remove the probepoints
+ * in the child's text.
+ *
+ * Its not necessary to quiesce the child as we are assured
+ * by utrace that this callback happens *before* the child
+ * gets to run userspace.
+ *
+ * We also hold the uproc->rwsem for the parent - so no
+ * new uprobes will be registered 'til we return.
+ */
+ int i;
+ struct uprobe_probept *ppt;
+ struct hlist_node *node;
+ struct hlist_head *head;
+
+ for (i = 0; i < UPROBE_TABLE_SIZE; i++) {
+ head = &uproc->uprobe_table[i];
+ hlist_for_each_entry(ppt, node, head, ut_node) {
+ if (ubp_remove_bkpt(child, &ppt->ubp) != 0) {
+ /* Ratelimit this? */
+ printk(KERN_ERR "Pid %d forked %d;"
+ " failed to remove probepoint"
+ " at %#lx in child\n",
+ current->pid, child->pid,
+ ppt->ubp.vaddr);
+ }
+ }
+ }
+ }
+
+done:
+ up_write(&uproc->rwsem);
+ mutex_unlock(&uproc_mutex);
+ return UTRACE_RESUME;
+}
+
+/*
+ * Exec callback: The associated process called execve() or friends
+ *
+ * The new program is about to start running and so there is no
+ * possibility of a uprobe from the previous user address space
+ * to be hit.
+ *
+ * NOTE:
+ * Typically, this process would have passed through the clone
+ * callback, where the necessary action *should* have been
+ * taken. However, if we still end up at this callback:
+ * - We don't have to clear the uprobes - memory image
+ * will be overlaid.
+ * - We have to free up uprobe resources associated with
+ * this process.
+ */
+static u32 uprobe_report_exec(u32 action,
+ struct utrace_engine *engine,
+ const struct linux_binfmt *fmt,
+ const struct linux_binprm *bprm,
+ struct pt_regs *regs)
+{
+ return uprobe_exec_exit(engine, current, 0);
+}
+
+static const struct utrace_engine_ops uprobe_utrace_ops = {
+ .report_quiesce = uprobe_report_quiesce,
+ .report_signal = uprobe_report_signal,
+ .report_exit = uprobe_report_exit,
+ .report_clone = uprobe_report_clone,
+ .report_exec = uprobe_report_exec
+};
+
+static int __init init_uprobes(void)
+{
+ int ret, i;
+
+ ubp_strategies = UBP_HNT_TSKINFO;
+ ret = ubp_init(&ubp_strategies);
+ if (ret != 0) {
+ printk(KERN_ERR "Can't start uprobes: ubp_init() returned %d\n",
+ ret);
+ return ret;
+ }
+ for (i = 0; i < UPROBE_TABLE_SIZE; i++) {
+ INIT_HLIST_HEAD(&uproc_table[i]);
+ INIT_HLIST_HEAD(&utask_table[i]);
+ }
+
+ p_uprobe_utrace_ops = &uprobe_utrace_ops;
+ return 0;
+}
+
+static void __exit exit_uprobes(void)
+{
+}
+
+module_init(init_uprobes);
+module_exit(exit_uprobes);

2010-01-11 12:26:40

by Srikar Dronamraju

[permalink] [raw]
Subject: [RFC] [PATCH 6/7] Uprobes Documentation

Uprobes documentation

Signed-off-by: Jim Keniston <[email protected]>
---
Documentation/uprobes.txt | 460 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 460 insertions(+)

Index: new_uprobes.git/Documentation/uprobes.txt
===================================================================
--- /dev/null
+++ new_uprobes.git/Documentation/uprobes.txt
@@ -0,0 +1,460 @@
+Title : User-Space Probes (Uprobes)
+Author : Jim Keniston <[email protected]>
+
+CONTENTS
+
+1. Concepts: Uprobes
+2. Architectures Supported
+3. Configuring Uprobes
+4. API Reference
+5. Uprobes Features and Limitations
+6. Interoperation with Kprobes
+7. Interoperation with Utrace
+8. Probe Overhead
+9. TODO
+10. Uprobes Team
+11. Uprobes Example
+
+1. Concepts: Uprobes
+
+Uprobes enables you to dynamically break into any routine in a
+user application and collect debugging and performance information
+non-disruptively. You can trap at any code address, specifying a
+kernel handler routine to be invoked when the breakpoint is hit.
+
+A uprobe can be inserted on any instruction in the application's
+virtual address space. The registration function
+register_uprobe() specifies which process is to be probed, where
+the probe is to be inserted, and what handler is to be called when
+the probe is hit.
+
+Typically, Uprobes-based instrumentation is packaged as a kernel
+module. In the simplest case, the module's init function installs
+("registers") one or more probes, and the exit function unregisters
+them. However, probes can be registered or unregistered in response
+to other events as well. For example:
+- A probe handler itself can register and/or unregister probes.
+- You can establish Utrace callbacks to register and/or unregister
+probes when a particular process forks, clones a thread,
+execs, enters a system call, receives a signal, exits, etc.
+See the utrace documentation in Documentation/DocBook.
+
+1.1 How Does a Uprobe Work?
+
+When a uprobe is registered, Uprobes makes a copy of the probed
+instruction, stops the probed application, replaces the first byte(s)
+of the probed instruction with a breakpoint instruction (e.g., int3
+on i386 and x86_64), and allows the probed application to continue.
+(When inserting the breakpoint, Uprobes uses the same copy-on-write
+mechanism that ptrace uses, so that the breakpoint affects only that
+process, and not any other process running that program. This is
+true even if the probed instruction is in a shared library.)
+
+When a CPU hits the breakpoint instruction, a trap occurs, the CPU's
+user-mode registers are saved, and a SIGTRAP signal is generated.
+Uprobes intercepts the SIGTRAP and finds the associated uprobe.
+It then executes the handler associated with the uprobe, passing the
+handler the addresses of the uprobe struct and the saved registers.
+The handler may block, but keep in mind that the probed thread remains
+stopped while your handler runs.
+
+Next, Uprobes single-steps its copy of the probed instruction and
+resumes execution of the probed process at the instruction following
+the probepoint. (It would be simpler to single-step the actual
+instruction in place, but then Uprobes would have to temporarily
+remove the breakpoint instruction. This would create problems in a
+multithreaded application. For example, it would open a time window
+when another thread could sail right past the probepoint.)
+
+Instruction copies to be single-stepped are stored in a per-process
+"single-step out of line (XOL) area," which is a little VM area
+created by Uprobes in each probed process's address space.
+
+1.2 The Role of Utrace
+
+When a probe is registered on a previously unprobed process,
+Uprobes establishes a tracing "engine" with Utrace (see
+Documentation/utrace.txt) for each thread (task) in the process.
+Uprobes uses the Utrace "quiesce" mechanism to stop all the threads
+prior to insertion or removal of a breakpoint. Utrace also notifies
+Uprobes of breakpoint and single-step traps and of other interesting
+events in the lifetime of the probed process, such as fork, clone,
+exec, and exit.
+
+1.3 Multithreaded Applications
+
+Uprobes supports the probing of multithreaded applications. Uprobes
+imposes no limit on the number of threads in a probed application.
+All threads in a process use the same text pages, so every probe
+in a process affects all threads; of course, each thread hits the
+probepoint (and runs the handler) independently. Multiple threads
+may run the same handler simultaneously. If you want a particular
+thread or set of threads to run a particular handler, your handler
+should check current or current->pid to determine which thread has
+hit the probepoint.
+
+When a process clones a new thread, that thread automatically shares
+all current and future probes established for that process.
+
+Keep in mind that when you register or unregister a probe, the
+breakpoint is not inserted or removed until Utrace has stopped all
+threads in the process. The register/unregister function returns
+after the breakpoint has been inserted/removed (but see the next
+section).
+
+1.5 Registering Probes within Probe Handlers
+
+A uprobe handler can call [un]register_uprobe() functions.
+A handler can even unregister its own probe. However, when invoked
+from a handler, the actual [un]register operations do not take
+place immediately. Rather, they are queued up and executed after
+all handlers for that probepoint have been run. In the handler,
+the [un]register call returns -EINPROGRESS. If you set the
+registration_callback field in the uprobe object, that callback will
+be called when the [un]register operation completes.
+
+2. Architectures Supported
+
+This ubp-based version of Uprobes is implemented on the following
+architectures:
+
+- x86
+
+3. Configuring Uprobes
+
+When configuring the kernel using make menuconfig/xconfig/oldconfig,
+ensure that CONFIG_UPROBES is set to "y". Select "Infrastructure for
+tracing and debugging user processes" to enable Utrace. Under "General
+setup" select "User-space breakpoint assistance" then select
+"User-space probes".
+
+So that you can load and unload Uprobes-based instrumentation modules,
+make sure "Loadable module support" (CONFIG_MODULES) and "Module
+unloading" (CONFIG_MODULE_UNLOAD) are set to "y".
+
+4. API Reference
+
+The Uprobes API includes a "register" function and an "unregister"
+function for uprobes. Here are terse, mini-man-page specifications for
+these functions and the associated probe handlers that you'll write.
+See the latter half of this document for examples.
+
+4.1 register_uprobe
+
+#include <linux/uprobes.h>
+int register_uprobe(struct uprobe *u);
+
+Sets a breakpoint at virtual address u->vaddr in the process whose
+pid is u->pid. When the breakpoint is hit, Uprobes calls u->handler.
+
+register_uprobe() returns 0 on success, -EINPROGRESS if
+register_uprobe() was called from a uprobe handler (and therefore
+delayed), or a negative errno otherwise.
+
+Section 4.4, "User's Callback for Delayed Registrations",
+explains how to be notified upon completion of a delayed
+registration.
+
+User's handler (u->handler):
+#include <linux/uprobes.h>
+#include <linux/ptrace.h>
+void handler(struct uprobe *u, struct pt_regs *regs);
+
+Called with u pointing to the uprobe associated with the breakpoint,
+and regs pointing to the struct containing the registers saved when
+the breakpoint was hit.
+
+4.3 unregister_uprobe
+
+#include <linux/uprobes.h>
+void unregister_uprobe(struct uprobe *u);
+
+Removes the specified probe. The unregister function can be called
+at any time after the probe has been registered, and can be called
+from a uprobe handler.
+
+4.4 User's Callback for Delayed Registrations
+
+#include <linux/uprobes.h>
+void registration_callback(struct uprobe *u, int reg, int result);
+
+As previously mentioned, the functions described in Section 4 can
+be called from within a uprobe. When that happens, the
+[un]registration operation is delayed until all handlers
+associated with that handler's probepoint have been run. Upon
+completion of the [un]registration operation, Uprobes checks the
+registration_callback member of the associated uprobe:
+u->registration_callback for [un]register_uprobe. Uprobes calls
+that callback function, if any, passing it the following values:
+
+- u = the address of the uprobe object.
+
+- reg = 1 for register_uprobe() or 0 for unregister_uprobe()
+
+- result = the return value that register_uprobe() would have
+returned if this weren't a delayed operation. This is always 0
+for unregister_uprobe().
+
+NOTE: Uprobes calls the registration_callback ONLY in the case of a
+delayed [un]registration.
+
+5. Uprobes Features and Limitations
+
+The user is expected to assign values to the following members
+of struct uprobe: pid, vaddr, handler, and (as needed)
+registration_callback. Other members are reserved for Uprobes's use.
+Uprobes may produce unexpected results if you:
+- assign non-zero values to reserved members of struct uprobe;
+- change the contents of a uprobe object while it is registered; or
+- attempt to register a uprobe that is already registered.
+
+Uprobes allows any number of uprobes at a particular address. For
+a particular probepoint, handlers are run in the order in which
+they were registered.
+
+Any number of kernel modules may probe a particular process
+simultaneously, and a particular module may probe any number of
+processes simultaneously.
+
+Probes are shared by all threads in a process (including newly
+created threads).
+
+If a probed process exits or execs, Uprobes automatically
+unregisters all uprobes associated with that process. Subsequent
+attempts to unregister these probes will be treated as no-ops.
+
+On the other hand, if a probed memory area is removed from the
+process's virtual memory map (e.g., via dlclose(3) or munmap(2)),
+it's currently up to you to unregister the probes first.
+
+There is no way to specify that probes should be inherited across fork;
+Uprobes removes all probepoints in the newly created child process.
+See Section 7, "Interoperation with Utrace", for more information on
+this topic.
+
+On at least some architectures, Uprobes makes no attempt to verify
+that the probe address you specify actually marks the start of an
+instruction. If you get this wrong, chaos may ensue.
+
+To avoid interfering with interactive debuggers, Uprobes will refuse
+to insert a probepoint where a breakpoint instruction already exists,
+unless it was Uprobes that put it there. Some architectures may
+refuse to insert probes on other types of instructions.
+
+If you install a probe in an inline-able function, Uprobes makes
+no attempt to chase down all inline instances of the function and
+install probes there. gcc may inline a function without being asked,
+so keep this in mind if you're not seeing the probe hits you expect.
+
+A probe handler can modify the environment of the probed function
+-- e.g., by modifying data structures, or by modifying the
+contents of the pt_regs struct (which are restored to the registers
+upon return from the breakpoint). So Uprobes can be used, for example,
+to install a bug fix or to inject faults for testing. Uprobes, of
+course, has no way to distinguish the deliberately injected faults
+from the accidental ones. Don't drink and probe.
+
+When you register the first probe at probepoint or unregister the
+last probe probe at a probepoint, Uprobes asks Utrace to "quiesce"
+the probed process so that Uprobes can insert or remove the breakpoint
+instruction. If the process is not already stopped, Utrace stops it.
+If the process is entering an interruptible system call at that instant,
+this may cause the system call to finish early or fail with EINTR.
+
+When Uprobes establishes a probepoint on a previous unprobed page
+of text, Linux creates a new copy of the page via its copy-on-write
+mechanism. When probepoints are removed, Uprobes makes no attempt
+to consolidate identical copies of the same page. This could affect
+memory availability if you probe many, many pages in many, many
+long-running processes.
+
+6. Interoperation with Kprobes
+
+Uprobes is intended to interoperate usefully with Kprobes (see
+Documentation/kprobes.txt). For example, an instrumentation module
+can make calls to both the Kprobes API and the Uprobes API.
+
+A uprobe handler can register or unregister kprobes,
+jprobes, and kretprobes, as well as uprobes. On the
+other hand, a kprobe, jprobe, or kretprobe handler must not sleep, and
+therefore cannot register or unregister any of these types of probes.
+(Ideas for removing this restriction are welcome.)
+
+Note that the overhead of a uprobe hit is several times that of
+a k[ret]probe hit.
+
+7. Interoperation with Utrace
+
+As mentioned in Section 1.2, Uprobes is a client of Utrace. For each
+probed thread, Uprobes establishes a Utrace engine, and registers
+callbacks for the following types of events: clone/fork, exec, exit,
+and "core-dump" signals (which include breakpoint traps). Uprobes
+establishes this engine when the process is first probed, or when
+Uprobes is notified of the thread's creation, whichever comes first.
+
+An instrumentation module can use both the Utrace and Uprobes APIs (as
+well as Kprobes). When you do this, keep the following facts in mind:
+
+- For a particular event, Utrace callbacks are called in the order in
+which the engines are established. Utrace does not currently provide
+a mechanism for altering this order.
+
+- When Uprobes learns that a probed process has forked, it removes
+the breakpoints in the child process.
+
+- When Uprobes learns that a probed process has exec-ed or exited,
+it disposes of its data structures for that process (first allowing
+any outstanding [un]registration operations to terminate).
+
+- When a probed thread hits a breakpoint or completes single-stepping
+of a probed instruction, engines with the UTRACE_EVENT(SIGNAL_CORE)
+flag set are notified.
+
+If you want to establish probes in a newly forked child, you can use
+the following procedure:
+
+- Register a report_clone callback with Utrace. In this callback,
+the CLONE_THREAD flag distinguishes between the creation of a new
+thread vs. a new process.
+
+- In your report_clone callback, call utrace_attach_task() to attach to
+the child process, and call utrace_control(..., UTRACE_REPORT)
+The child process will quiesce at a point where it is ready to
+be probed.
+
+- In your report_quiesce callback, register the desired probes.
+(Note that you cannot use the same probe object for both parent
+and child. If you want to duplicate the probepoints, you must
+create a new set of uprobe objects.)
+
+8. Probe Overhead
+
+On a typical CPU in use in 2007, a uprobe hit takes about 3
+microseconds to process. Specifically, a benchmark that hits the
+same probepoint repeatedly, firing a simple handler each time, reports
+300,000 to 350,000 hits per second, depending on the architecture.
+
+Here are sample overhead figures (in usec) for x86 architecture.
+
+x86: Intel Pentium M, 1495 MHz, 2957.31 bogomips
+uprobe = 2.9 usec;
+
+9. TODO
+
+a. Support for other architectures.
+b. Support return probes.
+
+10. Uprobes Team
+
+The following people have made major contributions to Uprobes:
+Jim Keniston - [email protected]
+Srikar Dronamraju - [email protected]
+Ananth Mavinakayanahalli - [email protected]
+Prasanna Panchamukhi - [email protected]
+Dave Wilder - [email protected]
+
+11. Uprobes Example
+
+Here's a sample kernel module showing the use of Uprobes to count the
+number of times an instruction at a particular address is executed,
+and optionally (unless verbose=0) report each time it's executed.
+----- cut here -----
+/* uprobe_example.c */
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/uprobes.h>
+
+/*
+ * Usage: insmod uprobe_example.ko pid=<pid> vaddr=<address> [verbose=0]
+ * where <pid> identifies the probed process and <address> is the virtual
+ * address of the probed instruction.
+ */
+
+static int pid = 0;
+module_param(pid, int, 0);
+MODULE_PARM_DESC(pid, "pid");
+
+static int verbose = 1;
+module_param(verbose, int, 0);
+MODULE_PARM_DESC(verbose, "verbose");
+
+static long vaddr = 0;
+module_param(vaddr, long, 0);
+MODULE_PARM_DESC(vaddr, "vaddr");
+
+static int nhits;
+static struct uprobe usp;
+
+static void uprobe_handler(struct uprobe *u, struct pt_regs *regs)
+{
+ nhits++;
+ if (verbose)
+ printk(KERN_INFO "Hit #%d on probepoint at %#lx\n",
+ nhits, u->vaddr);
+}
+
+int __init init_module(void)
+{
+ int ret;
+ usp.pid = pid;
+ usp.vaddr = vaddr;
+ usp.handler = uprobe_handler;
+ printk(KERN_INFO "Registering uprobe on pid %d, vaddr %#lx\n",
+ usp.pid, usp.vaddr);
+ ret = register_uprobe(&usp);
+ if (ret != 0) {
+ printk(KERN_ERR "register_uprobe() failed, returned %d\n", ret);
+ return ret;
+ }
+ return 0;
+}
+
+void __exit cleanup_module(void)
+{
+ printk(KERN_INFO "Unregistering uprobe on pid %d, vaddr %#lx\n",
+ usp.pid, usp.vaddr);
+ printk(KERN_INFO "Probepoint was hit %d times\n", nhits);
+ unregister_uprobe(&usp);
+}
+MODULE_LICENSE("GPL");
+----- cut here -----
+
+You can build the kernel module, uprobe_example.ko, using the following
+Makefile:
+----- cut here -----
+obj-m := uprobe_example.o
+KDIR := /lib/modules/$(shell uname -r)/build
+PWD := $(shell pwd)
+default:
+ $(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules
+clean:
+ rm -f *.mod.c *.ko *.o .*.cmd
+ rm -rf .tmp_versions
+----- cut here -----
+
+For example, if you want to run myprog and monitor its calls to myfunc(),
+you can do the following:
+
+$ make // Build the uprobe_example module.
+...
+$ nm -p myprog | awk '$3=="myfunc"'
+080484a8 T myfunc
+$ ./myprog &
+$ ps
+ PID TTY TIME CMD
+ 4367 pts/3 00:00:00 bash
+ 8156 pts/3 00:00:00 myprog
+ 8157 pts/3 00:00:00 ps
+$ su -
+...
+# insmod uprobe_example.ko pid=8156 vaddr=0x080484a8
+
+In /var/log/messages and on the console, you will see a message of the
+form "kernel: Hit #1 on probepoint at 0x80484a8" each time myfunc()
+is called. To turn off probing, remove the module:
+
+# rmmod uprobe_example
+
+In /var/log/messages and on the console, you will see a message of the
+form "Probepoint was hit 5 times".

2010-01-11 14:36:24

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes

Srikar Dronamraju wrote:
> Hi,
>
> This patchset implements Uprobes which enables you to dynamically
> break into any routine in a user space application and collect
> information non-disruptively. Uprobes is based on utrace and uses
> x86 instruction decoder.
>
> When a uprobe is registered, Uprobes makes a copy of the probed
> instruction, stops the probed application, replaces the first
> byte(s) of the probed instruction with a breakpoint instruction and
> allows the probed application to continue. (Uprobes uses the same
> copy-on-write mechanism so that the breakpoint affects only that
> process.)
>
> When a CPU hits the breakpoint instruction, Uprobes intercepts the
> SIGTRAP and finds the associated uprobe. It then executes the
> associated handler. Uprobes single-steps its copy of the probed
> instruction and resumes execution of the probed process at the
> instruction following the probepoint. Instruction copies to be
> single-stepped are stored in a per-process "single-step out of line
> (XOL) area,"
>
> Uprobes can be used to take advantage of static markers available
> in user space applications.
>
> Advantages of uprobes over conventional debugging include:
> 1. Non-disruptive.
> 2. Uses Execution out of line(XOL),
> 3. Much better handling of multithreaded programs because of XOL.
> 4. No context switch between tracer, tracee.

Hi Srikar and Jim,

Great work! thanks for releasing it.

>
> Here is the list of TODO Items.
>
> - Provide a perf interface to uprobes.

I think we also need to integrate ftrace-kprobe/uprobe to support
dynamic trace event. it helps perf probe to support uprobe much
easier.

> - Return probes.

Hmm, I think we need some symbol information for supporting
return probes in user space. Could you tell me how to work it?
is that requires some user-space helper?

> - Support for Other Architectures.
> - Jump optimization.

I assume that you meant this is "uprobe-booster" to skip
just single stepping after probing, isn't it?


Thank you,
--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-11 23:00:07

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes


On Mon, 2010-01-11 at 09:35 -0500, Masami Hiramatsu wrote:
> Srikar Dronamraju wrote:
> > Hi,
> >
> > This patchset implements Uprobes which enables you to dynamically
> > break into any routine in a user space application and collect
> > information non-disruptively. Uprobes is based on utrace and uses
> > x86 instruction decoder.
...
>
> > - Return probes.
>
> Hmm, I think we need some symbol information for supporting
> return probes in user space. Could you tell me how to work it?
> is that requires some user-space helper?

Return probes are on the TODO list, but we actually already have a
pretty solid implementation of that. It's held out for now because
Srikar's patch set is already big, and we want get a review of the basic
ubp/xol/uprobes feature.

For the most part, we don't need special symbol information for return
probes. We just do as we did in kretprobes: hijack the return address
and replace it with the address of a trampoline. In user-space return
probes, the trampoline is one of the instruction slots in the XOL vma,
and contains a breakpoint to trap us into the kernel. (Of course, as in
kretprobes, we need to know the address of the function so we can hijack
the return address upon entry to the function.)

One place where symbol info would come in handy is when a function
returns in a weird way. We handle longjmps by noticing that the task's
stack is smaller than expected, and presumably missing stack frames that
were bypassed by the longjmp. But this heuristic gets dicey when you
consider that in a 32-bit x86 app, a struct-returning function pops not
only the return address upon return, but also the address of the
returned struct value. So it'd be nice to know if a function returns a
struct.

Does this answer your question, or did I miss something?

>
> > - Support for Other Architectures.
> > - Jump optimization.
>
> I assume that you meant this is "uprobe-booster" to skip
> just single stepping after probing, isn't it?

Yes, I think that's what Srikar meant: avoid single-stepping by adding a
jump instruction after the instruction-copy in the XOL slot -- as you
did in your kprobes-booster work. Your instruction-analysis work makes
this much more feasible.

>
>
> Thank you,

Jim Keniston

2010-01-12 02:02:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Mon, Jan 11, 2010 at 05:55:53PM +0530, Srikar Dronamraju wrote:
> Uprobes Implementation
>
> Uprobes Infrastructure enables user to dynamically establish
> probepoints in user applications and collect information by executing
> a handler functions when the probepoints are hit.
> Please refer Documentation/uprobes.txt for more details.
>
> This patch provides the core implementation of uprobes.
> This patch builds on utrace infrastructure.
>
> You need to follow this up with the uprobes patch for your
> architecture.

Good to see this!!! Several questions interspersed below.

Thanx, Paul

> Signed-off-by: Jim Keniston <[email protected]>
> Signed-off-by: Srikar Dronamraju <[email protected]>
> ---
> arch/Kconfig | 12
> include/linux/uprobes.h | 292 ++++++
> kernel/Makefile | 1
> kernel/uprobes_core.c | 2017 ++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 2322 insertions(+)
>
> Index: new_uprobes.git/arch/Kconfig
> ===================================================================
> --- new_uprobes.git.orig/arch/Kconfig
> +++ new_uprobes.git/arch/Kconfig
> @@ -66,6 +66,16 @@ config UBP
> in user applications. This service is used by components
> such as uprobes. If in doubt, say "N".
>
> +config UPROBES
> + bool "User-space probes (EXPERIMENTAL)"
> + depends on UTRACE && MODULES && UBP
> + depends on HAVE_UPROBES
> + help
> + Uprobes enables kernel modules to establish probepoints
> + in user applications and execute handler functions when
> + the probepoints are hit. For more information, refer to
> + Documentation/uprobes.txt. If in doubt, say "N".
> +
> config HAVE_EFFICIENT_UNALIGNED_ACCESS
> bool
> help
> @@ -115,6 +125,8 @@ config HAVE_KPROBES
> config HAVE_KRETPROBES
> bool
>
> +config HAVE_UPROBES
> + def_bool n
> #
> # An arch should select this if it provides all these things:
> #
> Index: new_uprobes.git/include/linux/uprobes.h
> ===================================================================
> --- /dev/null
> +++ new_uprobes.git/include/linux/uprobes.h
> @@ -0,0 +1,292 @@
> +#ifndef _LINUX_UPROBES_H
> +#define _LINUX_UPROBES_H
> +/*
> + * Userspace Probes (UProbes)
> + * include/linux/uprobes.h
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) IBM Corporation, 2006, 2009
> + */
> +#include <linux/types.h>
> +#include <linux/list.h>
> +
> +struct pt_regs;
> +
> +/* This is what the user supplies us. */
> +struct uprobe {
> + /*
> + * The pid of the probed process. Currently, this can be the
> + * thread ID (task->pid) of any active thread in the process.
> + */
> + pid_t pid;
> +
> + /* Location of the probepoint */
> + unsigned long vaddr;
> +
> + /* Handler to run when the probepoint is hit */
> + void (*handler)(struct uprobe*, struct pt_regs*);
> +
> + /*
> + * This function, if non-NULL, will be called upon completion of
> + * an ASYNCHRONOUS registration (i.e., one initiated by a uprobe
> + * handler). reg = 1 for register, 0 for unregister.
> + */
> + void (*registration_callback)(struct uprobe *u, int reg, int result);
> +
> + /* Reserved for use by uprobes */
> + void *kdata;
> +};
> +
> +#if defined(CONFIG_UPROBES)
> +extern int register_uprobe(struct uprobe *u);
> +extern void unregister_uprobe(struct uprobe *u);
> +#else
> +static inline int register_uprobe(struct uprobe *u)
> +{
> + return -ENOSYS;
> +}
> +static inline void unregister_uprobe(struct uprobe *u)
> +{
> +}
> +#endif /* CONFIG_UPROBES */
> +
> +#ifdef UPROBES_IMPLEMENTATION
> +
> +#include <linux/mutex.h>
> +#include <linux/rwsem.h>
> +#include <linux/wait.h>
> +#include <asm/atomic.h>
> +#include <linux/ubp.h>
> +#include <linux/ubp_xol.h>
> +#include <asm/uprobes.h>
> +
> +struct utrace_engine;
> +struct task_struct;
> +struct pid;
> +
> +enum uprobe_probept_state {
> + UPROBE_INSERTING, /* process quiescing prior to insertion */
> + UPROBE_BP_SET, /* breakpoint in place */
> + UPROBE_REMOVING, /* process quiescing prior to removal */
> + UPROBE_DISABLED /* removal completed */
> +};
> +
> +enum uprobe_task_state {
> + UPTASK_QUIESCENT,
> + UPTASK_SLEEPING, /* See utask_fake_quiesce(). */
> + UPTASK_RUNNING,
> + UPTASK_BP_HIT,
> + UPTASK_SSTEP
> +};
> +
> +enum uprobe_ssil_state {
> + SSIL_DISABLE,
> + SSIL_CLEAR,
> + SSIL_SET
> +};
> +
> +#define UPROBE_HASH_BITS 5
> +#define UPROBE_TABLE_SIZE (1 << UPROBE_HASH_BITS)
> +
> +/*
> + * uprobe_process -- not a user-visible struct.
> + * A uprobe_process represents a probed process. A process can have
> + * multiple probepoints (each represented by a uprobe_probept) and
> + * one or more threads (each represented by a uprobe_task).
> + */
> +struct uprobe_process {
> + /*
> + * rwsem is write-locked for any change to the uprobe_process's
> + * graph (including uprobe_tasks, uprobe_probepts, and uprobe_kimgs) --
> + * e.g., due to probe [un]registration or special events like exit.
> + * It's read-locked during the whole time we process a probepoint hit.
> + */
> + struct rw_semaphore rwsem;
> +
> + /* Table of uprobe_probepts registered for this process */
> + /* TODO: Switch to list_head[] per Ingo. */
> + struct hlist_head uprobe_table[UPROBE_TABLE_SIZE];
> +
> + /* List of uprobe_probepts awaiting insertion or removal */
> + struct list_head pending_uprobes;
> +
> + /* List of uprobe_tasks in this task group */
> + struct list_head thread_list;
> + int nthreads;
> + int n_quiescent_threads;
> +
> + /* this goes on the uproc_table */
> + struct hlist_node hlist;
> +
> + /*
> + * All threads (tasks) in a process share the same uprobe_process.
> + */
> + struct pid *tg_leader;
> + pid_t tgid;
> +
> + /* Threads in UTASK_SLEEPING state wait here to be roused. */
> + wait_queue_head_t waitq;
> +
> + /*
> + * We won't free the uprobe_process while...
> + * - any register/unregister operations on it are in progress; or
> + * - any uprobe_report_* callbacks are running; or
> + * - uprobe_table[] is not empty; or
> + * - any tasks are UTASK_SLEEPING in the waitq;
> + * refcount reflects this. We do NOT ref-count tasks (threads),
> + * since once the last thread has exited, the rest is academic.
> + */
> + atomic_t refcount;
> +
> + /*
> + * finished = 1 means the process is execing or the last thread
> + * is exiting, and we're cleaning up the uproc. If the execed
> + * process is probed, a new uproc will be created.
> + */
> + bool finished;
> +
> + /*
> + * 1 to single-step out of line; 0 for inline. This can drop to
> + * 0 if we can't set up the XOL area, but never goes from 0 to 1.
> + */
> + bool sstep_out_of_line;
> +
> + /*
> + * Manages slots for instruction-copies to be single-stepped
> + * out of line.
> + */
> + void *xol_area;
> +};
> +
> +/*
> + * uprobe_kimg -- not a user-visible struct.
> + * Holds implementation-only per-uprobe data.
> + * uprobe->kdata points to this.
> + */
> +struct uprobe_kimg {
> + struct uprobe *uprobe;
> + struct uprobe_probept *ppt;
> +
> + /*
> + * -EBUSY while we're waiting for all threads to quiesce so the
> + * associated breakpoint can be inserted or removed.
> + * 0 if the the insert/remove operation has succeeded, or -errno
> + * otherwise.
> + */
> + int status;
> +
> + /* on ppt's list */
> + struct list_head list;
> +};
> +
> +/*
> + * uprobe_probept -- not a user-visible struct.
> + * A probepoint, at which several uprobes can be registered.
> + * Guarded by uproc->rwsem.
> + */
> +struct uprobe_probept {
> + /* breakpoint/XOL details */
> + struct ubp_bkpt ubp;
> +
> + /* The uprobe_kimg(s) associated with this uprobe_probept */
> + struct list_head uprobe_list;
> +
> + enum uprobe_probept_state state;
> +
> + /* The parent uprobe_process */
> + struct uprobe_process *uproc;
> +
> + /*
> + * ppt goes in the uprobe_process->uprobe_table when registered --
> + * even before the breakpoint has been inserted.
> + */
> + struct hlist_node ut_node;
> +
> + /*
> + * ppt sits in the uprobe_process->pending_uprobes queue while
> + * awaiting insertion or removal of the breakpoint.
> + */
> + struct list_head pd_node;
> +
> + /* [un]register_uprobe() waits 'til bkpt inserted/removed */
> + wait_queue_head_t waitq;
> +
> + /*
> + * ssil_lock, ssilq and ssil_state are used to serialize
> + * single-stepping inline, so threads don't clobber each other
> + * swapping the breakpoint instruction in and out. This helps
> + * prevent crashing the probed app, but it does NOT prevent
> + * probe misses while the breakpoint is swapped out.
> + * ssilq - threads wait for their chance to single-step inline.
> + */
> + spinlock_t ssil_lock;
> + wait_queue_head_t ssilq;
> + enum uprobe_ssil_state ssil_state;
> +};
> +
> +/*
> + * uprobe_utask -- not a user-visible struct.
> + * Corresponds to a thread in a probed process.
> + * Guarded by uproc->rwsem.
> + */
> +struct uprobe_task {
> + /* Lives in the global utask_table */
> + struct hlist_node hlist;
> +
> + /* Lives on the thread_list for the uprobe_process */
> + struct list_head list;
> +
> + struct task_struct *tsk;
> + struct pid *pid;
> +
> + /* The utrace engine for this task */
> + struct utrace_engine *engine;
> +
> + /* Back pointer to the associated uprobe_process */
> + struct uprobe_process *uproc;
> +
> + enum uprobe_task_state state;
> +
> + /*
> + * quiescing = 1 means this task has been asked to quiesce.
> + * It may not be able to comply immediately if it's hit a bkpt.
> + */
> + bool quiescing;
> +
> + /* Set before running handlers; cleared after single-stepping. */
> + struct uprobe_probept *active_probe;
> +
> + /* Saved address of copied original instruction */
> + long singlestep_addr;
> +
> + struct ubp_task_arch_info arch_info;
> +
> + /*
> + * Unexpected error in probepoint handling has left task's
> + * text or stack corrupted. Kill task ASAP.
> + */
> + bool doomed;
> +
> + /* [un]registrations initiated by handlers must be asynchronous. */
> + struct list_head deferred_registrations;
> +
> + /* Delay handler-destined signals 'til after single-step done. */
> + struct list_head delayed_signals;
> +};
> +
> +#endif /* UPROBES_IMPLEMENTATION */
> +
> +#endif /* _LINUX_UPROBES_H */
> Index: new_uprobes.git/kernel/Makefile
> ===================================================================
> --- new_uprobes.git.orig/kernel/Makefile
> +++ new_uprobes.git/kernel/Makefile
> @@ -104,6 +104,7 @@ obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_b
> obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
> obj-$(CONFIG_UBP) += ubp_core.o
> obj-$(CONFIG_UBP_XOL) += ubp_xol.o
> +obj-$(CONFIG_UPROBES) += uprobes_core.o
>
> ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
> # According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
> Index: new_uprobes.git/kernel/uprobes_core.c
> ===================================================================
> --- /dev/null
> +++ new_uprobes.git/kernel/uprobes_core.c
> @@ -0,0 +1,2017 @@
> +/*
> + * Userspace Probes (UProbes)
> + * kernel/uprobes_core.c
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) IBM Corporation, 2006, 2009
> + */
> +#include <linux/types.h>
> +#include <linux/hash.h>
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/sched.h>
> +#include <linux/rcupdate.h>
> +#include <linux/err.h>
> +#include <linux/kref.h>
> +#include <linux/utrace.h>
> +#include <linux/regset.h>
> +#define UPROBES_IMPLEMENTATION 1
> +#include <linux/uprobes.h>
> +#include <linux/tracehook.h>
> +#include <linux/string.h>
> +#include <linux/uaccess.h>
> +#include <linux/errno.h>
> +#include <linux/mman.h>
> +
> +#define UPROBE_SET_FLAGS 1
> +#define UPROBE_CLEAR_FLAGS 0
> +
> +#define MAX_XOL_SLOTS 1024
> +
> +static int utask_fake_quiesce(struct uprobe_task *utask);
> +static int uprobe_post_ssout(struct uprobe_task *utask,
> + struct uprobe_probept *ppt, struct pt_regs *regs);
> +
> +typedef void (*uprobe_handler_t)(struct uprobe*, struct pt_regs*);
> +
> +/*
> + * Table of currently probed processes, hashed by task-group leader's
> + * struct pid.
> + */
> +static struct hlist_head uproc_table[UPROBE_TABLE_SIZE];
> +
> +/* Protects uproc_table during uprobe (un)registration */
> +static DEFINE_MUTEX(uproc_mutex);
> +
> +/* Table of uprobe_tasks, hashed by task_struct pointer. */
> +static struct hlist_head utask_table[UPROBE_TABLE_SIZE];
> +static DEFINE_SPINLOCK(utask_table_lock);
> +
> +/* p_uprobe_utrace_ops = &uprobe_utrace_ops. Fwd refs are a pain w/o this. */
> +static const struct utrace_engine_ops *p_uprobe_utrace_ops;
> +
> +struct deferred_registration {
> + struct list_head list;
> + struct uprobe *uprobe;
> + int regflag; /* 0 - unregister, 1 - register */
> +};
> +
> +/*
> + * Calling a signal handler cancels single-stepping, so uprobes delays
> + * calling the handler, as necessary, until after single-stepping is completed.
> + */
> +struct delayed_signal {
> + struct list_head list;
> + siginfo_t info;
> +};
> +
> +static u16 ubp_strategies;
> +
> +static struct uprobe_task *uprobe_find_utask(struct task_struct *tsk)
> +{
> + struct hlist_head *head;
> + struct hlist_node *node;
> + struct uprobe_task *utask;
> + unsigned long flags;
> +
> + head = &utask_table[hash_ptr(tsk, UPROBE_HASH_BITS)];
> + spin_lock_irqsave(&utask_table_lock, flags);
> + hlist_for_each_entry(utask, node, head, hlist) {
> + if (utask->tsk == tsk) {
> + spin_unlock_irqrestore(&utask_table_lock, flags);
> + return utask;
> + }
> + }
> + spin_unlock_irqrestore(&utask_table_lock, flags);
> + return NULL;
> +}
> +
> +static void uprobe_hash_utask(struct uprobe_task *utask)
> +{
> + struct hlist_head *head;
> + unsigned long flags;
> +
> + INIT_HLIST_NODE(&utask->hlist);
> + head = &utask_table[hash_ptr(utask->tsk, UPROBE_HASH_BITS)];
> + spin_lock_irqsave(&utask_table_lock, flags);
> + hlist_add_head(&utask->hlist, head);
> + spin_unlock_irqrestore(&utask_table_lock, flags);
> +}
> +
> +static void uprobe_unhash_utask(struct uprobe_task *utask)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&utask_table_lock, flags);
> + hlist_del(&utask->hlist);
> + spin_unlock_irqrestore(&utask_table_lock, flags);
> +}
> +
> +static inline void uprobe_get_process(struct uprobe_process *uproc)
> +{
> + atomic_inc(&uproc->refcount);
> +}
> +
> +/*
> + * Decrement uproc's refcount in a situation where we "know" it can't
> + * reach zero. It's OK to call this with uproc locked. Compare with
> + * uprobe_put_process().
> + */
> +static inline void uprobe_decref_process(struct uprobe_process *uproc)
> +{
> + if (atomic_dec_and_test(&uproc->refcount))
> + BUG();
> +}
> +
> +/*
> + * Runs with the uproc_mutex held. Returns with uproc ref-counted and
> + * write-locked.
> + *
> + * Around exec time, briefly, it's possible to have one (finished) uproc
> + * for the old image and one for the new image. We find the latter.
> + */
> +static struct uprobe_process *uprobe_find_process(struct pid *tg_leader)
> +{
> + struct uprobe_process *uproc;
> + struct hlist_head *head;
> + struct hlist_node *node;
> +
> + head = &uproc_table[hash_ptr(tg_leader, UPROBE_HASH_BITS)];
> + hlist_for_each_entry(uproc, node, head, hlist) {
> + if (uproc->tg_leader == tg_leader && !uproc->finished) {
> + uprobe_get_process(uproc);
> + down_write(&uproc->rwsem);
> + return uproc;
> + }
> + }
> + return NULL;
> +}
> +
> +/*
> + * In the given uproc's hash table of probepoints, find the one with the
> + * specified virtual address. Runs with uproc->rwsem locked.
> + */
> +static struct uprobe_probept *uprobe_find_probept(struct uprobe_process *uproc,
> + unsigned long vaddr)
> +{
> + struct uprobe_probept *ppt;
> + struct hlist_node *node;
> + struct hlist_head *head = &uproc->uprobe_table[hash_long(vaddr,
> + UPROBE_HASH_BITS)];
> +
> + hlist_for_each_entry(ppt, node, head, ut_node) {
> + if (ppt->ubp.vaddr == vaddr && ppt->state != UPROBE_DISABLED)
> + return ppt;
> + }
> + return NULL;
> +}
> +
> +/*
> + * Save a copy of the original instruction (so it can be single-stepped
> + * out of line), insert the breakpoint instruction, and awake
> + * register_uprobe().
> + */
> +static void uprobe_insert_bkpt(struct uprobe_probept *ppt,
> + struct task_struct *tsk)
> +{
> + struct uprobe_kimg *uk;
> + int result;
> +
> + if (tsk)
> + result = ubp_insert_bkpt(tsk, &ppt->ubp);
> + else
> + /* No surviving tasks associated with ppt->uproc */
> + result = -ESRCH;
> + ppt->state = (result ? UPROBE_DISABLED : UPROBE_BP_SET);
> + list_for_each_entry(uk, &ppt->uprobe_list, list)
> + uk->status = result;
> + wake_up_all(&ppt->waitq);
> +}
> +
> +/*
> + * Check if task has just stepped on a trap instruction at the
> + * indicated address. If it has indeed stepped on that address,
> + * then reset Instruction Pointer for the task.
> + *
> + * tsk should either be current thread or already quiesced thread.
> + */
> +static inline void reset_thread_ip(struct task_struct *tsk,
> + struct pt_regs *regs, unsigned long addr)
> +{
> + if ((ubp_get_bkpt_addr(regs) == addr) &&
> + !test_tsk_thread_flag(tsk, TIF_SINGLESTEP))
> + ubp_set_ip(regs, addr);
> +}
> +
> +/*
> + * ppt's breakpoint has been removed. If any threads are in the middle of
> + * single-stepping at this probepoint, fix things up so they can proceed.
> + * If any threads have just hit breakpoint but are yet to start
> + * pre-processing, reset their instruction pointers.
> + *
> + * Runs with all of ppt->uproc's threads quiesced and ppt->uproc->rwsem
> + * write-locked
> + */
> +static inline void adjust_trapped_thread_ip(struct uprobe_probept *ppt)
> +{
> + struct uprobe_process *uproc = ppt->uproc;
> + struct uprobe_task *utask;
> + struct pt_regs *regs;
> +
> + list_for_each_entry(utask, &uproc->thread_list, list) {
> + regs = task_pt_regs(utask->tsk);
> + if (utask->active_probe != ppt) {
> + reset_thread_ip(utask->tsk, regs, ppt->ubp.vaddr);
> + continue;
> + }
> +
> + /*
> + * Current thread cannot have an active breakpoint
> + * and still request for a breakpoint removal. The
> + * above case is handled by utask_fake_quiesce().
> + */
> + BUG_ON(utask->tsk == current);
> +
> +#ifdef CONFIG_UBP_XOL
> + if (instruction_pointer(regs) == ppt->ubp.xol_vaddr)
> + /* adjust the ip to breakpoint addr. */
> + ubp_set_ip(regs, ppt->ubp.vaddr);
> + else
> + /* adjust the ip to next instruction. */
> + uprobe_post_ssout(utask, ppt, regs);
> +#endif
> + }
> +}
> +
> +static void uprobe_remove_bkpt(struct uprobe_probept *ppt,
> + struct task_struct *tsk)
> +{
> + if (tsk) {
> + if (ubp_remove_bkpt(tsk, &ppt->ubp) != 0) {
> + printk(KERN_ERR
> + "Error removing uprobe at pid %d vaddr %#lx:"
> + " can't restore original instruction\n",
> + tsk->tgid, ppt->ubp.vaddr);
> + /*
> + * This shouldn't happen, since we were previously
> + * able to write the breakpoint at that address.
> + * There's not much we can do besides let the
> + * process die with a SIGTRAP the next time the
> + * breakpoint is hit.
> + */
> + }
> + adjust_trapped_thread_ip(ppt);
> + if (ppt->ubp.strategy & UBP_HNT_INLINE) {
> + unsigned long flags;
> + spin_lock_irqsave(&ppt->ssil_lock, flags);
> + ppt->ssil_state = SSIL_DISABLE;
> + wake_up_all(&ppt->ssilq);
> + spin_unlock_irqrestore(&ppt->ssil_lock, flags);
> + }
> + }
> + /* Wake up unregister_uprobe(). */
> + ppt->state = UPROBE_DISABLED;
> + wake_up_all(&ppt->waitq);
> +}
> +
> +/*
> + * Runs with all of uproc's threads quiesced and uproc->rwsem write-locked.
> + * As specified, insert or remove the breakpoint instruction for each
> + * uprobe_probept on uproc's pending list.
> + * tsk = one of the tasks associated with uproc -- NULL if there are
> + * no surviving threads.
> + * It's OK for uproc->pending_uprobes to be empty here. It can happen
> + * if a register and an unregister are requested (by different probers)
> + * simultaneously for the same pid/vaddr.
> + */
> +static void handle_pending_uprobes(struct uprobe_process *uproc,
> + struct task_struct *tsk)
> +{
> + struct uprobe_probept *ppt, *tmp;
> +
> + list_for_each_entry_safe(ppt, tmp, &uproc->pending_uprobes, pd_node) {
> + switch (ppt->state) {
> + case UPROBE_INSERTING:
> + uprobe_insert_bkpt(ppt, tsk);
> + break;
> + case UPROBE_REMOVING:
> + uprobe_remove_bkpt(ppt, tsk);
> + break;
> + default:
> + BUG();
> + }
> + list_del(&ppt->pd_node);
> + }
> +}
> +
> +static void utask_adjust_flags(struct uprobe_task *utask, int set,
> + unsigned long flags)
> +{
> + unsigned long newflags, oldflags;
> +
> + oldflags = utask->engine->flags;
> + newflags = oldflags;
> + if (set)
> + newflags |= flags;
> + else
> + newflags &= ~flags;
> + /*
> + * utrace_barrier[_pid] is not appropriate here. If we're
> + * adjusting current, it's not needed. And if we're adjusting
> + * some other task, we're holding utask->uproc->rwsem, which
> + * could prevent that task from completing the callback we'd
> + * be waiting on.
> + */
> + if (newflags != oldflags) {
> + if (utrace_set_events_pid(utask->pid, utask->engine,
> + newflags) != 0)
> + /* We don't care. */
> + ;
> + }
> +}
> +
> +static inline void clear_utrace_quiesce(struct uprobe_task *utask, bool resume)
> +{
> + utask_adjust_flags(utask, UPROBE_CLEAR_FLAGS, UTRACE_EVENT(QUIESCE));
> + if (resume) {
> + if (utrace_control_pid(utask->pid, utask->engine,
> + UTRACE_RESUME) != 0)
> + /* We don't care. */
> + ;
> + }
> +}
> +
> +/* Opposite of quiesce_all_threads(). Same locking applies. */
> +static void rouse_all_threads(struct uprobe_process *uproc)
> +{
> + struct uprobe_task *utask;
> +
> + list_for_each_entry(utask, &uproc->thread_list, list) {
> + if (utask->quiescing) {
> + utask->quiescing = false;
> + if (utask->state == UPTASK_QUIESCENT) {
> + utask->state = UPTASK_RUNNING;
> + uproc->n_quiescent_threads--;
> + clear_utrace_quiesce(utask, true);
> + }
> + }
> + }
> + /* Wake any threads that decided to sleep rather than quiesce. */
> + wake_up_all(&uproc->waitq);
> +}
> +
> +/*
> + * If all of uproc's surviving threads have quiesced, do the necessary
> + * breakpoint insertions or removals, un-quiesce everybody, and return 1.
> + * tsk is a surviving thread, or NULL if there is none. Runs with
> + * uproc->rwsem write-locked.
> + */
> +static int check_uproc_quiesced(struct uprobe_process *uproc,
> + struct task_struct *tsk)
> +{
> + if (uproc->n_quiescent_threads >= uproc->nthreads) {
> + handle_pending_uprobes(uproc, tsk);
> + rouse_all_threads(uproc);
> + return 1;
> + }
> + return 0;
> +}
> +
> +/* Direct the indicated thread to quiesce. */
> +static void uprobe_stop_thread(struct uprobe_task *utask)
> +{
> + int result;
> +
> + /*
> + * As with utask_adjust_flags, calling utrace_barrier_pid below
> + * could deadlock.
> + */
> + BUG_ON(utask->tsk == current);
> + result = utrace_control_pid(utask->pid, utask->engine, UTRACE_STOP);
> + if (result == 0) {
> + /* Already stopped. */
> + utask->state = UPTASK_QUIESCENT;
> + utask->uproc->n_quiescent_threads++;
> + } else if (result == -EINPROGRESS) {
> + if (utask->tsk->state & TASK_INTERRUPTIBLE) {
> + /*
> + * Task could be in interruptible wait for a long
> + * time -- e.g., if stopped for I/O. But we know
> + * it's not going to run user code before all
> + * threads quiesce, so pretend it's quiesced.
> + * This avoids terminating a system call via
> + * UTRACE_INTERRUPT.
> + */
> + utask->state = UPTASK_QUIESCENT;
> + utask->uproc->n_quiescent_threads++;
> + } else {
> + /*
> + * Task will eventually stop, but it may be a long time.
> + * Don't wait.
> + */
> + result = utrace_control_pid(utask->pid, utask->engine,
> + UTRACE_INTERRUPT);
> + if (result != 0)
> + /* We don't care. */
> + ;
> + }
> + }
> +}
> +
> +/*
> + * Quiesce all threads in the specified process -- e.g., prior to
> + * breakpoint insertion. Runs with uproc->rwsem write-locked.
> + * Returns false if all threads have died.
> + */
> +static bool quiesce_all_threads(struct uprobe_process *uproc,
> + struct uprobe_task **cur_utask_quiescing)
> +{
> + struct uprobe_task *utask;
> + struct task_struct *survivor = NULL; /* any survivor */
> + bool survivors = false;
> +
> + *cur_utask_quiescing = NULL;
> + list_for_each_entry(utask, &uproc->thread_list, list) {
> + if (!survivors) {
> + survivor = pid_task(utask->pid, PIDTYPE_PID);
> + if (survivor)
> + survivors = true;
> + }
> + if (!utask->quiescing) {
> + /*
> + * If utask is currently handling a probepoint, it'll
> + * check utask->quiescing and quiesce when it's done.
> + */
> + utask->quiescing = true;
> + if (utask->tsk == current)
> + *cur_utask_quiescing = utask;
> + else if (utask->state == UPTASK_RUNNING) {
> + utask_adjust_flags(utask, UPROBE_SET_FLAGS,
> + UTRACE_EVENT(QUIESCE));
> + uprobe_stop_thread(utask);
> + }
> + }
> + }
> + /*
> + * If all the (other) threads are already quiesced, it's up to the
> + * current thread to do the necessary work.
> + */
> + check_uproc_quiesced(uproc, survivor);
> + return survivors;
> +}
> +
> +/* Called with utask->uproc write-locked. */
> +static void uprobe_free_task(struct uprobe_task *utask, bool in_callback)
> +{
> + struct deferred_registration *dr, *d;
> + struct delayed_signal *ds, *ds2;
> + int result;
> +
> + if (utask->engine && (utask->tsk != current || !in_callback)) {
> + /*
> + * No other tasks in this process should be running
> + * uprobe_report_* callbacks. (If they are, utrace_barrier()
> + * here could deadlock.)
> + */
> + result = utrace_control_pid(utask->pid, utask->engine,
> + UTRACE_DETACH);
> + BUG_ON(result == -EINPROGRESS);
> + }
> + put_pid(utask->pid); /* null pid OK */
> +
> + uprobe_unhash_utask(utask);
> + list_del(&utask->list);
> + list_for_each_entry_safe(dr, d, &utask->deferred_registrations, list) {
> + list_del(&dr->list);
> + kfree(dr);
> + }
> +
> + list_for_each_entry_safe(ds, ds2, &utask->delayed_signals, list) {
> + list_del(&ds->list);
> + kfree(ds);
> + }
> +
> + kfree(utask);
> +}
> +
> +/*
> + * Dismantle uproc and all its remaining uprobe_tasks.
> + * in_callback = 1 if the caller is a uprobe_report_* callback who will
> + * handle the UTRACE_DETACH operation.
> + * Runs with uproc_mutex held; called with uproc->rwsem write-locked.
> + */
> +static void uprobe_free_process(struct uprobe_process *uproc, int in_callback)
> +{
> + struct uprobe_task *utask, *tmp;
> +
> + if (!hlist_unhashed(&uproc->hlist))
> + hlist_del(&uproc->hlist);
> + list_for_each_entry_safe(utask, tmp, &uproc->thread_list, list)
> + uprobe_free_task(utask, in_callback);
> + put_pid(uproc->tg_leader);
> + if (uproc->xol_area)
> + xol_put_area(uproc->xol_area);
> + up_write(&uproc->rwsem); /* So kfree doesn't complain */
> + kfree(uproc);
> +}
> +
> +/*
> + * Decrement uproc's ref count. If it's zero, free uproc and return
> + * 1. Else return 0. If uproc is locked, don't call this; use
> + * uprobe_decref_process().
> + */
> +static int uprobe_put_process(struct uprobe_process *uproc, bool in_callback)
> +{
> + int freed = 0;
> +
> + if (atomic_dec_and_test(&uproc->refcount)) {
> + mutex_lock(&uproc_mutex);
> + down_write(&uproc->rwsem);
> + if (unlikely(atomic_read(&uproc->refcount) != 0)) {
> + /*
> + * The works because uproc_mutex is held any
> + * time the ref count can go from 0 to 1 -- e.g.,
> + * register_uprobe() sneaks in with a new probe.
> + */
> + up_write(&uproc->rwsem);
> + } else {
> + uprobe_free_process(uproc, in_callback);
> + freed = 1;
> + }
> + mutex_unlock(&uproc_mutex);
> + }
> + return freed;
> +}
> +
> +static struct uprobe_kimg *uprobe_mk_kimg(struct uprobe *u)
> +{
> + struct uprobe_kimg *uk = kzalloc(sizeof *uk,
> + GFP_USER);
> +
> + if (unlikely(!uk))
> + return ERR_PTR(-ENOMEM);
> + u->kdata = uk;
> + uk->uprobe = u;
> + uk->ppt = NULL;
> + INIT_LIST_HEAD(&uk->list);
> + uk->status = -EBUSY;
> + return uk;
> +}
> +
> +/*
> + * Allocate a uprobe_task object for p and add it to uproc's list.
> + * Called with p "got" and uproc->rwsem write-locked. Called in one of
> + * the following cases:
> + * - before setting the first uprobe in p's process
> + * - we're in uprobe_report_clone() and p is the newly added thread
> + * Returns:
> + * - pointer to new uprobe_task on success
> + * - NULL if t dies before we can utrace_attach it
> + * - negative errno otherwise
> + */
> +static struct uprobe_task *uprobe_add_task(struct pid *p,
> + struct uprobe_process *uproc)
> +{
> + struct uprobe_task *utask;
> + struct utrace_engine *engine;
> + struct task_struct *t = pid_task(p, PIDTYPE_PID);

What keeps the task_struct referenced by "t" from disappearing at this
point?

> +
> + if (!t)
> + return NULL;
> + utask = kzalloc(sizeof *utask, GFP_USER);
> + if (unlikely(utask == NULL))
> + return ERR_PTR(-ENOMEM);
> +
> + utask->pid = p;
> + utask->tsk = t;
> + utask->state = UPTASK_RUNNING;
> + utask->quiescing = false;
> + utask->uproc = uproc;
> + utask->active_probe = NULL;
> + utask->doomed = false;
> + INIT_LIST_HEAD(&utask->deferred_registrations);
> + INIT_LIST_HEAD(&utask->delayed_signals);
> + INIT_LIST_HEAD(&utask->list);
> + list_add_tail(&utask->list, &uproc->thread_list);
> + uprobe_hash_utask(utask);
> +
> + engine = utrace_attach_pid(p, UTRACE_ATTACH_CREATE,
> + p_uprobe_utrace_ops, utask);
> + if (IS_ERR(engine)) {
> + long err = PTR_ERR(engine);
> + printk("uprobes: utrace_attach_task failed, returned %ld\n",
> + err);
> + uprobe_free_task(utask, 0);
> + if (err == -ESRCH)
> + return NULL;
> + return ERR_PTR(err);
> + }
> + utask->engine = engine;
> + /*
> + * Always watch for traps, clones, execs and exits. Caller must
> + * set any other engine flags.
> + */
> + utask_adjust_flags(utask, UPROBE_SET_FLAGS,
> + UTRACE_EVENT(SIGNAL) | UTRACE_EVENT(SIGNAL_IGN) |
> + UTRACE_EVENT(SIGNAL_CORE) | UTRACE_EVENT(EXEC) |
> + UTRACE_EVENT(CLONE) | UTRACE_EVENT(EXIT));
> + /*
> + * Note that it's OK if t dies just after utrace_attach, because
> + * with the engine in place, the appropriate report_* callback
> + * should handle it after we release uproc->rwsem.
> + */
> + utrace_engine_put(engine);
> + return utask;
> +}
> +
> +/*
> + * start_pid is the pid for a thread in the probed process. Find the
> + * next thread that doesn't have a corresponding uprobe_task yet. Return
> + * a ref-counted pid for that task, if any, else NULL.
> + */
> +static struct pid *find_next_thread_to_add(struct uprobe_process *uproc,
> + struct pid *start_pid)
> +{
> + struct task_struct *t, *start;
> + struct uprobe_task *utask;
> + struct pid *pid = NULL;
> +
> + rcu_read_lock();
> + start = pid_task(start_pid, PIDTYPE_PID);
> + t = start;
> + if (t) {
> + do {
> + if (unlikely(t->flags & PF_EXITING))
> + goto dont_add;
> + list_for_each_entry(utask, &uproc->thread_list, list) {

Doesn't this need to be list_for_each_entry_rcu()?

Or do you have ->thread_list protected elsewise?

> + if (utask->tsk == t)
> + /* Already added */
> + goto dont_add;
> + }
> + /* Found thread/task to add. */
> + pid = get_pid(task_pid(t));
> + break;
> +dont_add:
> + t = next_thread(t);
> + } while (t != start);
> + }
> + rcu_read_unlock();

Now that we are outside of rcu_read_lock()'s protection, the task
indicated by "pid" might disappear, and the value of "pid" might well
be reused. Is this really OK?

> + return pid;
> +}
> +
> +/* Runs with uproc_mutex held; returns with uproc->rwsem write-locked. */
> +static struct uprobe_process *uprobe_mk_process(struct pid *tg_leader)
> +{
> + struct uprobe_process *uproc;
> + struct uprobe_task *utask;
> + struct pid *add_me;
> + int i;
> + long err;
> +
> + uproc = kzalloc(sizeof *uproc, GFP_USER);
> + if (unlikely(uproc == NULL))
> + return ERR_PTR(-ENOMEM);
> +
> + /* Initialize fields */
> + atomic_set(&uproc->refcount, 1);
> + init_rwsem(&uproc->rwsem);
> + down_write(&uproc->rwsem);
> + init_waitqueue_head(&uproc->waitq);
> + for (i = 0; i < UPROBE_TABLE_SIZE; i++)
> + INIT_HLIST_HEAD(&uproc->uprobe_table[i]);
> + INIT_LIST_HEAD(&uproc->pending_uprobes);
> + INIT_LIST_HEAD(&uproc->thread_list);
> + uproc->nthreads = 0;
> + uproc->n_quiescent_threads = 0;
> + INIT_HLIST_NODE(&uproc->hlist);
> + uproc->tg_leader = get_pid(tg_leader);
> + uproc->tgid = pid_task(tg_leader, PIDTYPE_PID)->tgid;
> + uproc->finished = false;
> +
> +#ifdef CONFIG_UBP_XOL
> + if (!(ubp_strategies & UBP_HNT_INLINE))
> + uproc->sstep_out_of_line = true;
> + else
> +#endif
> + uproc->sstep_out_of_line = false;
> +
> + /*
> + * Create and populate one utask per thread in this process. We
> + * can't call uprobe_add_task() while holding RCU lock, so we:
> + * 1. rcu_read_lock()
> + * 2. Find the next thread, add_me, in this process that's not
> + * already on uproc's thread_list.
> + * 3. rcu_read_unlock()
> + * 4. uprobe_add_task(add_me, uproc)
> + * Repeat 1-4 'til we have utasks for all threads.
> + */
> + add_me = tg_leader;
> + while ((add_me = find_next_thread_to_add(uproc, add_me)) != NULL) {
> + utask = uprobe_add_task(add_me, uproc);
> + if (IS_ERR(utask)) {
> + err = PTR_ERR(utask);
> + goto fail;
> + }
> + if (utask)
> + uproc->nthreads++;
> + }
> +
> + if (uproc->nthreads == 0) {
> + /* All threads -- even p -- are dead. */
> + err = -ESRCH;
> + goto fail;
> + }
> + return uproc;
> +
> +fail:
> + uprobe_free_process(uproc, 0);
> + return ERR_PTR(err);
> +}
> +
> +/*
> + * Creates a uprobe_probept and connects it to uk and uproc. Runs with
> + * uproc->rwsem write-locked.
> + */
> +static struct uprobe_probept *uprobe_add_probept(struct uprobe_kimg *uk,
> + struct uprobe_process *uproc)
> +{
> + struct uprobe_probept *ppt;
> +
> + ppt = kzalloc(sizeof *ppt, GFP_USER);
> + if (unlikely(ppt == NULL))
> + return ERR_PTR(-ENOMEM);
> + init_waitqueue_head(&ppt->waitq);
> + init_waitqueue_head(&ppt->ssilq);
> + spin_lock_init(&ppt->ssil_lock);
> + ppt->ssil_state = SSIL_CLEAR;
> +
> + /* Connect to uk. */
> + INIT_LIST_HEAD(&ppt->uprobe_list);
> + list_add_tail(&uk->list, &ppt->uprobe_list);
> + uk->ppt = ppt;
> + uk->status = -EBUSY;
> + ppt->ubp.vaddr = uk->uprobe->vaddr;
> + ppt->ubp.xol_vaddr = 0;
> +
> + /* Connect to uproc. */
> + if (!uproc->sstep_out_of_line)
> + ppt->ubp.strategy = UBP_HNT_INLINE;
> + else
> + ppt->ubp.strategy = ubp_strategies;
> + ppt->state = UPROBE_INSERTING;
> + ppt->uproc = uproc;
> + INIT_LIST_HEAD(&ppt->pd_node);
> + list_add_tail(&ppt->pd_node, &uproc->pending_uprobes);
> + INIT_HLIST_NODE(&ppt->ut_node);
> + hlist_add_head(&ppt->ut_node,
> + &uproc->uprobe_table[hash_long(ppt->ubp.vaddr,
> + UPROBE_HASH_BITS)]);
> + uprobe_get_process(uproc);
> + return ppt;
> +}
> +
> +/*
> + * Runs with ppt->uproc write-locked. Frees ppt and decrements the ref
> + * count on ppt->uproc (but ref count shouldn't hit 0).
> + */
> +static void uprobe_free_probept(struct uprobe_probept *ppt)
> +{
> + struct uprobe_process *uproc = ppt->uproc;
> +
> + xol_free_insn_slot(ppt->ubp.xol_vaddr, uproc->xol_area);
> + hlist_del(&ppt->ut_node);
> + kfree(ppt);
> + uprobe_decref_process(uproc);
> +}
> +
> +static void uprobe_free_kimg(struct uprobe_kimg *uk)
> +{
> + uk->uprobe->kdata = NULL;
> + kfree(uk);
> +}
> +
> +/*
> + * Runs with uprobe_process write-locked.
> + * Note that we never free uk->uprobe, because the user owns that.
> + */
> +static void purge_uprobe(struct uprobe_kimg *uk)
> +{
> + struct uprobe_probept *ppt = uk->ppt;
> +
> + list_del(&uk->list);
> + uprobe_free_kimg(uk);
> + if (list_empty(&ppt->uprobe_list))
> + uprobe_free_probept(ppt);
> +}
> +
> +/*
> + * Runs with utask->uproc locked.
> + * read lock if called from uprobe handler.
> + * else write lock.
> + * Returns -EINPROGRESS on success.
> + * Returns -EBUSY if a request for defer registration already exists.
> + * Returns 0 if we have deferred request for both register/unregister.
> + *
> + */
> +static int defer_registration(struct uprobe *u, int regflag,
> + struct uprobe_task *utask)
> +{
> + struct deferred_registration *dr, *d;
> +
> + /* Check if we already have such a defer request */
> + list_for_each_entry_safe(dr, d, &utask->deferred_registrations, list) {
> + if (dr->uprobe == u) {
> + if (dr->regflag != regflag) {
> + /* same as successful register + unregister */
> + list_del(&dr->list);
> + kfree(dr);
> + return 0;
> + } else
> + /* we already have identical request */
> + return -EBUSY;
> + }
> + }
> +
> + /* We have a new unique request */
> + dr = kmalloc(sizeof(struct deferred_registration), GFP_USER);
> + if (!dr)
> + return -ENOMEM;
> + dr->uprobe = u;
> + dr->regflag = regflag;
> + INIT_LIST_HEAD(&dr->list);
> + list_add_tail(&dr->list, &utask->deferred_registrations);
> + return -EINPROGRESS;
> +}
> +
> +/*
> + * Given a numeric thread ID, return a ref-counted struct pid for the
> + * task-group-leader thread.
> + */
> +static struct pid *uprobe_get_tg_leader(pid_t p)
> +{
> + struct pid *pid = NULL;
> +
> + rcu_read_lock();
> + if (current->nsproxy)
> + pid = find_vpid(p);
> + if (pid) {
> + struct task_struct *t = pid_task(pid, PIDTYPE_PID);
> + if (t)
> + pid = task_tgid(t);
> + else
> + pid = NULL;
> + }
> + rcu_read_unlock();

What happens if the thread disappears at this point? We are outside of
rcu_read_lock() protection, so all the structures could potentially be
freed up by other CPUs, especially if this CPU takes an interrupt or is
preempted.

> + return get_pid(pid); /* null pid OK here */
> +}
> +
> +/* See Documentation/uprobes.txt. */
> +int register_uprobe(struct uprobe *u)
> +{
> + struct uprobe_task *cur_utask, *cur_utask_quiescing = NULL;
> + struct uprobe_process *uproc;
> + struct uprobe_probept *ppt;
> + struct uprobe_kimg *uk;
> + struct pid *p;
> + int ret = 0, uproc_is_new = 0;
> + bool survivors;
> +#ifndef CONFIG_UBP_XOL
> + struct task_struct *tsk;
> +#endif
> +
> + if (!u || !u->handler)
> + return -EINVAL;
> +
> + p = uprobe_get_tg_leader(u->pid);
> + if (!p)
> + return -ESRCH;
> +
> + cur_utask = uprobe_find_utask(current);
> + if (cur_utask && cur_utask->active_probe) {
> + /*
> + * Called from handler; cur_utask->uproc is read-locked.
> + * Do this registration later.
> + */
> + put_pid(p);
> + return defer_registration(u, 1, cur_utask);
> + }
> +
> + /* Get the uprobe_process for this pid, or make a new one. */
> + mutex_lock(&uproc_mutex);
> + uproc = uprobe_find_process(p);
> +
> + if (uproc) {
> + struct uprobe_task *utask;
> +
> + mutex_unlock(&uproc_mutex);
> + list_for_each_entry(utask, &uproc->thread_list, list) {
> + if (!utask->active_probe)
> + continue;
> + /*
> + * utask is at a probepoint, but has dropped
> + * uproc->rwsem to single-step. If utask is
> + * stopped, then it's probably because some
> + * other engine has asserted UTRACE_STOP;
> + * that engine may not allow UTRACE_RESUME
> + * until register_uprobe() returns. But, for
> + * reasons we won't go into here, utask wants
> + * to finish with utask->active_probe before
> + * allowing handle_pending_uprobes() to run
> + * (via utask_fake_quiesce()). So we defer this
> + * registration operation; it will be run after
> + * utask->active_probe is taken care of.
> + */
> + BUG_ON(utask->state != UPTASK_SSTEP);
> + if (task_is_stopped_or_traced(utask->tsk)) {
> + ret = defer_registration(u, 1, utask);
> + goto fail_uproc;
> + }
> + }
> + } else {
> + uproc = uprobe_mk_process(p);
> + if (IS_ERR(uproc)) {
> + ret = (int) PTR_ERR(uproc);
> + mutex_unlock(&uproc_mutex);
> + goto fail_tsk;
> + }
> + /* Hold uproc_mutex until we've added uproc to uproc_table. */
> + uproc_is_new = 1;
> + }
> +
> +#ifdef CONFIG_UBP_XOL
> + ret = xol_validate_vaddr(p, u->vaddr, uproc->xol_area);
> +#else
> + tsk = pid_task(p, PIDTYPE_PID);
> + ret = ubp_validate_insn_addr(tsk, u->vaddr);
> +#endif
> + if (ret < 0)
> + goto fail_uproc;
> +
> + if (u->kdata) {
> + /*
> + * Probe is already/still registered. This is the only
> + * place we return -EBUSY to the user.
> + */
> + ret = -EBUSY;
> + goto fail_uproc;
> + }
> +
> + uk = uprobe_mk_kimg(u);
> + if (IS_ERR(uk)) {
> + ret = (int) PTR_ERR(uk);
> + goto fail_uproc;
> + }
> +
> + /* See if we already have a probepoint at the vaddr. */
> + ppt = (uproc_is_new ? NULL : uprobe_find_probept(uproc, u->vaddr));
> + if (ppt) {
> + /* Breakpoint is already in place, or soon will be. */
> + uk->ppt = ppt;
> + list_add_tail(&uk->list, &ppt->uprobe_list);
> + switch (ppt->state) {
> + case UPROBE_INSERTING:
> + uk->status = -EBUSY; /* in progress */
> + if (uproc->tg_leader == task_tgid(current)) {
> + cur_utask_quiescing = cur_utask;
> + BUG_ON(!cur_utask_quiescing);
> + }
> + break;
> + case UPROBE_REMOVING:
> + /* Wait! Don't remove that bkpt after all! */
> + ppt->state = UPROBE_BP_SET;
> + /* Remove from pending list. */
> + list_del(&ppt->pd_node);
> + /* Wake unregister_uprobe(). */
> + wake_up_all(&ppt->waitq);
> + /*FALLTHROUGH*/
> + case UPROBE_BP_SET:
> + uk->status = 0;
> + break;
> + default:
> + BUG();
> + }
> + up_write(&uproc->rwsem);
> + put_pid(p);
> + if (uk->status == 0) {
> + uprobe_decref_process(uproc);
> + return 0;
> + }
> + goto await_bkpt_insertion;
> + } else {
> + ppt = uprobe_add_probept(uk, uproc);
> + if (IS_ERR(ppt)) {
> + ret = (int) PTR_ERR(ppt);
> + goto fail_uk;
> + }
> + }
> +
> + if (uproc_is_new) {
> + hlist_add_head(&uproc->hlist,
> + &uproc_table[hash_ptr(uproc->tg_leader,
> + UPROBE_HASH_BITS)]);
> + mutex_unlock(&uproc_mutex);
> + }
> + put_pid(p);
> + survivors = quiesce_all_threads(uproc, &cur_utask_quiescing);
> +
> + if (!survivors) {
> + purge_uprobe(uk);
> + up_write(&uproc->rwsem);
> + uprobe_put_process(uproc, false);
> + return -ESRCH;
> + }
> + up_write(&uproc->rwsem);
> +
> +await_bkpt_insertion:
> + if (cur_utask_quiescing)
> + /* Current task is probing its own process. */
> + (void) utask_fake_quiesce(cur_utask_quiescing);
> + else
> + wait_event(ppt->waitq, ppt->state != UPROBE_INSERTING);
> + ret = uk->status;
> + if (ret != 0) {
> + down_write(&uproc->rwsem);
> + purge_uprobe(uk);
> + up_write(&uproc->rwsem);
> + }
> + uprobe_put_process(uproc, false);
> + return ret;
> +
> +fail_uk:
> + uprobe_free_kimg(uk);
> +
> +fail_uproc:
> + if (uproc_is_new) {
> + uprobe_free_process(uproc, 0);
> + mutex_unlock(&uproc_mutex);
> + } else {
> + up_write(&uproc->rwsem);
> + uprobe_put_process(uproc, false);
> + }
> +
> +fail_tsk:
> + put_pid(p);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(register_uprobe);
> +
> +/* See Documentation/uprobes.txt. */
> +void unregister_uprobe(struct uprobe *u)
> +{
> + struct pid *p;
> + struct uprobe_process *uproc;
> + struct uprobe_kimg *uk;
> + struct uprobe_probept *ppt;
> + struct uprobe_task *cur_utask, *cur_utask_quiescing = NULL;
> + struct uprobe_task *utask;
> +
> + if (!u)
> + return;
> + p = uprobe_get_tg_leader(u->pid);
> + if (!p)
> + return;
> +
> + cur_utask = uprobe_find_utask(current);
> + if (cur_utask && cur_utask->active_probe) {
> + /* Called from handler; uproc is read-locked; do this later */
> + put_pid(p);
> + (void) defer_registration(u, 0, cur_utask);
> + return;
> + }
> +
> + /*
> + * Lock uproc before walking the graph, in case the process we're
> + * probing is exiting.
> + */
> + mutex_lock(&uproc_mutex);
> + uproc = uprobe_find_process(p);
> + mutex_unlock(&uproc_mutex);
> + put_pid(p);
> + if (!uproc)
> + return;
> +
> + list_for_each_entry(utask, &uproc->thread_list, list) {
> + if (!utask->active_probe)
> + continue;
> +
> + /* See comment in register_uprobe(). */
> + BUG_ON(utask->state != UPTASK_SSTEP);
> + if (task_is_stopped_or_traced(utask->tsk)) {
> + (void) defer_registration(u, 0, utask);
> + goto done;
> + }
> + }
> + uk = (struct uprobe_kimg *)u->kdata;
> + if (!uk)
> + /*
> + * This probe was never successfully registered, or
> + * has already been unregistered.
> + */
> + goto done;
> + if (uk->status == -EBUSY)
> + /* Looks like register or unregister is already in progress. */
> + goto done;
> + ppt = uk->ppt;
> +
> + list_del(&uk->list);
> + uprobe_free_kimg(uk);
> +
> + if (!list_empty(&ppt->uprobe_list))
> + goto done;
> +
> + /*
> + * The last uprobe at ppt's probepoint is being unregistered.
> + * Queue the breakpoint for removal.
> + */
> + ppt->state = UPROBE_REMOVING;
> + list_add_tail(&ppt->pd_node, &uproc->pending_uprobes);
> +
> + (void) quiesce_all_threads(uproc, &cur_utask_quiescing);
> + up_write(&uproc->rwsem);
> + if (cur_utask_quiescing)
> + /* Current task is probing its own process. */
> + (void) utask_fake_quiesce(cur_utask_quiescing);
> + else
> + wait_event(ppt->waitq, ppt->state != UPROBE_REMOVING);
> +
> + if (likely(ppt->state == UPROBE_DISABLED)) {
> + down_write(&uproc->rwsem);
> + uprobe_free_probept(ppt);
> + /* else somebody else's register_uprobe() resurrected ppt. */
> + up_write(&uproc->rwsem);
> + }
> + uprobe_put_process(uproc, false);
> + return;
> +
> +done:
> + up_write(&uproc->rwsem);
> + uprobe_put_process(uproc, false);
> +}
> +EXPORT_SYMBOL_GPL(unregister_uprobe);
> +
> +/* Find a surviving thread in uproc. Runs with uproc->rwsem locked. */
> +static struct task_struct *find_surviving_thread(struct uprobe_process *uproc)
> +{
> + struct uprobe_task *utask;
> +
> + list_for_each_entry(utask, &uproc->thread_list, list) {
> + if (!(utask->tsk->flags & PF_EXITING))
> + return utask->tsk;
> + }
> + return NULL;
> +}
> +
> +/*
> + * Run all the deferred_registrations previously queued by the current utask.
> + * Runs with no locks or mutexes held. The current utask's uprobe_process
> + * is ref-counted, so it won't disappear as the result of unregister_u*probe()
> + * called here.
> + */
> +static void uprobe_run_def_regs(struct list_head *drlist)
> +{
> + struct deferred_registration *dr, *d;
> +
> + list_for_each_entry_safe(dr, d, drlist, list) {
> + int result = 0;
> + struct uprobe *u = dr->uprobe;
> +
> + if (dr->regflag)
> + result = register_uprobe(u);
> + else
> + unregister_uprobe(u);
> + if (u && u->registration_callback)
> + u->registration_callback(u, dr->regflag, result);
> + list_del(&dr->list);
> + kfree(dr);
> + }
> +}
> +
> +/*
> + * utrace engine report callbacks
> + */
> +
> +/*
> + * We've been asked to quiesce, but aren't in a position to do so.
> + * This could happen in either of the following cases:
> + *
> + * 1) Our own thread is doing a register or unregister operation --
> + * e.g., as called from a uprobe handler or a non-uprobes utrace
> + * callback. We can't wait_event() for ourselves in [un]register_uprobe().
> + *
> + * 2) We've been asked to quiesce, but we hit a probepoint first. Now
> + * we're in the report_signal callback, having handled the probepoint.
> + * We'd like to just turn on UTRACE_EVENT(QUIESCE) and coast into
> + * quiescence. Unfortunately, it's possible to hit a probepoint again
> + * before we quiesce. When processing the SIGTRAP, utrace would call
> + * uprobe_report_quiesce(), which must decline to take any action so
> + * as to avoid removing the uprobe just hit. As a result, we could
> + * keep hitting breakpoints and never quiescing.
> + *
> + * So here we do essentially what we'd prefer to do in uprobe_report_quiesce().
> + * If we're the last thread to quiesce, handle_pending_uprobes() and
> + * rouse_all_threads(). Otherwise, pretend we're quiescent and sleep until
> + * the last quiescent thread handles that stuff and then wakes us.
> + *
> + * Called and returns with no mutexes held. Returns 1 if we free utask->uproc,
> + * else 0.
> + */
> +static int utask_fake_quiesce(struct uprobe_task *utask)
> +{
> + struct uprobe_process *uproc = utask->uproc;
> + enum uprobe_task_state prev_state = utask->state;
> +
> + down_write(&uproc->rwsem);
> +
> + /* In case we're somehow set to quiesce for real... */
> + clear_utrace_quiesce(utask, false);
> +
> + if (uproc->n_quiescent_threads == uproc->nthreads-1) {
> + /* We're the last thread to "quiesce." */
> + handle_pending_uprobes(uproc, utask->tsk);
> + rouse_all_threads(uproc);
> + up_write(&uproc->rwsem);
> + return 0;
> + } else {
> + utask->state = UPTASK_SLEEPING;
> + uproc->n_quiescent_threads++;
> + up_write(&uproc->rwsem);
> + /* We ref-count sleepers. */
> + uprobe_get_process(uproc);
> +
> + wait_event(uproc->waitq, !utask->quiescing);
> +
> + down_write(&uproc->rwsem);
> + utask->state = prev_state;
> + uproc->n_quiescent_threads--;
> + up_write(&uproc->rwsem);
> +
> + /*
> + * If uproc's last uprobe has been unregistered, and
> + * unregister_uprobe() woke up before we did, it's up
> + * to us to free uproc.
> + */
> + return uprobe_put_process(uproc, false);
> + }
> +}
> +
> +/* Prepare to single-step ppt's probed instruction inline. */
> +static void uprobe_pre_ssin(struct uprobe_task *utask,
> + struct uprobe_probept *ppt, struct pt_regs *regs)
> +{
> + unsigned long flags;
> +
> + if (unlikely(ppt->ssil_state == SSIL_DISABLE)) {
> + reset_thread_ip(utask->tsk, regs, ppt->ubp.vaddr);
> + return;
> + }
> + spin_lock_irqsave(&ppt->ssil_lock, flags);
> + while (ppt->ssil_state == SSIL_SET) {
> + spin_unlock_irqrestore(&ppt->ssil_lock, flags);
> + up_read(&utask->uproc->rwsem);
> + wait_event(ppt->ssilq, ppt->ssil_state != SSIL_SET);
> + down_read(&utask->uproc->rwsem);
> + spin_lock_irqsave(&ppt->ssil_lock, flags);
> + }
> + if (unlikely(ppt->ssil_state == SSIL_DISABLE)) {
> + /*
> + * While waiting to single step inline, breakpoint has
> + * been removed. Thread continues as if nothing happened.
> + */
> + spin_unlock_irqrestore(&ppt->ssil_lock, flags);
> + reset_thread_ip(utask->tsk, regs, ppt->ubp.vaddr);
> + return;
> + }
> + ppt->ssil_state = SSIL_SET;
> + spin_unlock_irqrestore(&ppt->ssil_lock, flags);
> +
> + if (unlikely(ubp_pre_sstep(utask->tsk, &ppt->ubp,
> + &utask->arch_info, regs) != 0)) {
> + printk(KERN_ERR "Failed to temporarily restore original "
> + "instruction for single-stepping: "
> + "pid/tgid=%d/%d, vaddr=%#lx\n",
> + utask->tsk->pid, utask->tsk->tgid, ppt->ubp.vaddr);
> + utask->doomed = true;
> + }
> +}
> +
> +/* Prepare to continue execution after single-stepping inline. */
> +static void uprobe_post_ssin(struct uprobe_task *utask,
> + struct uprobe_probept *ppt, struct pt_regs *regs)
> +{
> + unsigned long flags;
> +
> + if (unlikely(ubp_post_sstep(utask->tsk, &ppt->ubp,
> + &utask->arch_info, regs) != 0))
> + printk("Couldn't restore bp: pid/tgid=%d/%d, addr=%#lx\n",
> + utask->tsk->pid, utask->tsk->tgid, ppt->ubp.vaddr);
> + spin_lock_irqsave(&ppt->ssil_lock, flags);
> + if (likely(ppt->ssil_state == SSIL_SET)) {
> + ppt->ssil_state = SSIL_CLEAR;
> + wake_up(&ppt->ssilq);
> + }
> + spin_unlock_irqrestore(&ppt->ssil_lock, flags);
> +}
> +
> +#ifdef CONFIG_UBP_XOL
> +/*
> + * This architecture wants to do single-stepping out of line, but now we've
> + * discovered that it can't -- typically because we couldn't set up the XOL
> + * vma. Make all probepoints use inline single-stepping.
> + */
> +static void uproc_cancel_xol(struct uprobe_process *uproc)
> +{
> + down_write(&uproc->rwsem);
> + if (likely(uproc->sstep_out_of_line)) {
> + /* No other task beat us to it. */
> + int i;
> + struct uprobe_probept *ppt;
> + struct hlist_node *node;
> + struct hlist_head *head;
> + for (i = 0; i < UPROBE_TABLE_SIZE; i++) {
> + head = &uproc->uprobe_table[i];
> + hlist_for_each_entry(ppt, node, head, ut_node) {
> + if (!(ppt->ubp.strategy & UBP_HNT_INLINE))
> + ubp_cancel_xol(current, &ppt->ubp);
> + }
> + }
> + /* Do this last, so other tasks don't proceed too soon. */
> + uproc->sstep_out_of_line = false;
> + }
> + up_write(&uproc->rwsem);
> +}
> +
> +/* Prepare to single-step ppt's probed instruction out of line. */
> +static int uprobe_pre_ssout(struct uprobe_task *utask,
> + struct uprobe_probept *ppt, struct pt_regs *regs)
> +{
> + if (!ppt->ubp.xol_vaddr)
> + ppt->ubp.xol_vaddr = xol_get_insn_slot(&ppt->ubp,
> + ppt->uproc->xol_area);
> + if (unlikely(!ppt->ubp.xol_vaddr)) {
> + ubp_cancel_xol(utask->tsk, &ppt->ubp);
> + return -1;
> + }
> + utask->singlestep_addr = ppt->ubp.xol_vaddr;
> + return ubp_pre_sstep(utask->tsk, &ppt->ubp, &utask->arch_info, regs);
> +}
> +
> +/* Prepare to continue execution after single-stepping out of line. */
> +static int uprobe_post_ssout(struct uprobe_task *utask,
> + struct uprobe_probept *ppt, struct pt_regs *regs)
> +{
> + int ret;
> +
> + ret = ubp_post_sstep(utask->tsk, &ppt->ubp, &utask->arch_info, regs);
> + return ret;
> +}
> +#endif
> +
> +/*
> + * If this thread is supposed to be quiescing, mark it quiescent; and
> + * if it was the last thread to quiesce, do the work we quiesced for.
> + * Runs with utask->uproc->rwsem write-locked. Returns true if we can
> + * let this thread resume.
> + */
> +static bool utask_quiesce(struct uprobe_task *utask)
> +{
> + if (utask->quiescing) {
> + if (utask->state != UPTASK_QUIESCENT) {
> + utask->state = UPTASK_QUIESCENT;
> + utask->uproc->n_quiescent_threads++;
> + }
> + return check_uproc_quiesced(utask->uproc, current);
> + } else {
> + clear_utrace_quiesce(utask, false);
> + return true;
> + }
> +}
> +
> +/*
> + * Delay delivery of the indicated signal until after single-step.
> + * Otherwise single-stepping will be cancelled as part of calling
> + * the signal handler.
> + */
> +static void uprobe_delay_signal(struct uprobe_task *utask, siginfo_t *info)
> +{
> + struct delayed_signal *ds;
> +
> + ds = kmalloc(sizeof(*ds), GFP_USER);
> + if (ds) {
> + ds->info = *info;
> + INIT_LIST_HEAD(&ds->list);
> + list_add_tail(&ds->list, &utask->delayed_signals);
> + }
> +}
> +
> +static void uprobe_inject_delayed_signals(struct list_head *delayed_signals)
> +{
> + struct delayed_signal *ds, *tmp;
> +
> + list_for_each_entry_safe(ds, tmp, delayed_signals, list) {
> + send_sig_info(ds->info.si_signo, &ds->info, current);
> + list_del(&ds->list);
> + kfree(ds);
> + }
> +}
> +
> +/*
> + * Verify from Instruction Pointer if singlestep has indeed occurred.
> + * If Singlestep has occurred, then do post singlestep fix-ups.
> + */
> +static bool validate_and_post_sstep(struct uprobe_task *utask,
> + struct pt_regs *regs,
> + struct uprobe_probept *ppt)
> +{
> + unsigned long vaddr = instruction_pointer(regs);
> +
> + if (ppt->ubp.strategy & UBP_HNT_INLINE) {
> + /*
> + * If we have singlestepped, Instruction pointer cannot
> + * be same as virtual address of probepoint.
> + */
> + if (vaddr == ppt->ubp.vaddr)
> + return false;
> + uprobe_post_ssin(utask, ppt, regs);
> +#ifdef CONFIG_UBP_XOL
> + } else {
> + /*
> + * If we have executed out of line, Instruction pointer
> + * cannot be same as virtual address of XOL slot.
> + */
> + if (vaddr == ppt->ubp.xol_vaddr)
> + return false;
> + uprobe_post_ssout(utask, ppt, regs);
> +#endif
> + }
> + return true;
> +}
> +
> +/*
> + * Helper routine for uprobe_report_signal().
> + * We get called here with:
> + * state = UPTASK_RUNNING => we are here due to a breakpoint hit
> + * - Read-lock the process
> + * - Figure out which probepoint, based on regs->IP
> + * - Set state = UPTASK_BP_HIT
> + * - Invoke handler for each uprobe at this probepoint
> + * - Reset regs->IP to beginning of the insn, if necessary
> + * - Start watching for quiesce events, in case another
> + * engine cancels our UTRACE_SINGLESTEP with a
> + * UTRACE_STOP.
> + * - Set singlestep in motion (UTRACE_SINGLESTEP),
> + * with state = UPTASK_SSTEP
> + * - Read-unlock the process
> + *
> + * state = UPTASK_SSTEP => here after single-stepping
> + * - Read-lock the process
> + * - Validate we are here per the state machine
> + * - Clean up after single-stepping
> + * - Set state = UPTASK_RUNNING
> + * - Read-unlock the process
> + * - If it's time to quiesce, take appropriate action.
> + * - If the handler(s) we ran called [un]register_uprobe(),
> + * complete those via uprobe_run_def_regs().
> + *
> + * state = ANY OTHER STATE
> + * - Not our signal, pass it on (UTRACE_RESUME)
> + */
> +static u32 uprobe_handle_signal(u32 action,
> + struct uprobe_task *utask,
> + struct pt_regs *regs,
> + siginfo_t *info,
> + const struct k_sigaction *orig_ka)
> +{
> + struct uprobe_probept *ppt;
> + struct uprobe_process *uproc;
> + struct uprobe_kimg *uk;
> + unsigned long probept;
> + enum utrace_resume_action resume_action;
> + enum utrace_signal_action signal_action = utrace_signal_action(action);
> +
> + uproc = utask->uproc;
> +
> + /*
> + * We may need to re-assert UTRACE_SINGLESTEP if this signal
> + * is not associated with the breakpoint.
> + */
> + if (utask->state == UPTASK_SSTEP)
> + resume_action = UTRACE_SINGLESTEP;
> + else
> + resume_action = UTRACE_RESUME;
> + /*
> + * This might be UTRACE_SIGNAL_REPORT request but some other
> + * engine's callback might have changed the signal action to
> + * something other than UTRACE_SIGNAL_REPORT. Use orig_ka to figure
> + * out such cases.
> + */
> + if (unlikely(signal_action == UTRACE_SIGNAL_REPORT) || !orig_ka) {
> + /* This thread was quiesced using UTRACE_INTERRUPT. */
> + bool done_quiescing;
> + if (utask->active_probe)
> + /*
> + * We'll fake quiescence after we're done
> + * processing the probepoint.
> + */
> + return UTRACE_SIGNAL_IGN | resume_action;
> +
> + down_write(&uproc->rwsem);
> + done_quiescing = utask_quiesce(utask);
> + up_write(&uproc->rwsem);
> + if (done_quiescing)
> + resume_action = UTRACE_RESUME;
> + else
> + resume_action = UTRACE_STOP;
> + return UTRACE_SIGNAL_IGN | resume_action;
> + }
> +
> + /*
> + * info will be null if we're called with action=UTRACE_SIGNAL_HANDLER,
> + * which means that single-stepping has been disabled so a signal
> + * handler can be called in the probed process. That should never
> + * happen because we intercept and delay handled signals (action =
> + * UTRACE_RESUME) until after we're done single-stepping.
> + */
> + BUG_ON(!info);
> + if (signal_action == UTRACE_SIGNAL_DELIVER && utask->active_probe &&
> + info->si_signo != SSTEP_SIGNAL) {
> + uprobe_delay_signal(utask, info);
> + return UTRACE_SIGNAL_IGN | UTRACE_SINGLESTEP;
> + }
> +
> + if (info->si_signo != BREAKPOINT_SIGNAL &&
> + info->si_signo != SSTEP_SIGNAL)
> + goto no_interest;
> +
> + switch (utask->state) {
> + case UPTASK_RUNNING:
> + if (info->si_signo != BREAKPOINT_SIGNAL)
> + goto no_interest;
> +
> +#ifdef CONFIG_UBP_XOL
> + /*
> + * Set up the XOL area if it's not already there. We
> + * do this here because we have to do it before
> + * handling the first probepoint hit, the probed
> + * process has to do it, and this may be the first
> + * time our probed process runs uprobes code. We need
> + * the XOL area for the uretprobe trampoline even if
> + * this architectures doesn't single-step out of line.
> + */
> + if (uproc->sstep_out_of_line && !uproc->xol_area) {
> + uproc->xol_area = xol_get_area(uproc->tg_leader);
> + if (unlikely(uproc->sstep_out_of_line) &&
> + unlikely(!uproc->xol_area))
> + uproc_cancel_xol(uproc);
> + }
> +#endif
> +
> + down_read(&uproc->rwsem);
> + /* Don't quiesce while running handlers. */
> + clear_utrace_quiesce(utask, false);
> + probept = ubp_get_bkpt_addr(regs);
> + ppt = uprobe_find_probept(uproc, probept);
> + if (!ppt) {
> + up_read(&uproc->rwsem);
> + goto no_interest;
> + }
> + utask->active_probe = ppt;
> + utask->state = UPTASK_BP_HIT;
> +
> + if (likely(ppt->state == UPROBE_BP_SET)) {
> + list_for_each_entry(uk, &ppt->uprobe_list, list) {
> + struct uprobe *u = uk->uprobe;
> + if (u->handler)
> + u->handler(u, regs);
> + }
> + }
> +
> +#ifdef CONFIG_UBP_XOL
> + if ((ppt->ubp.strategy & UBP_HNT_INLINE) ||
> + uprobe_pre_ssout(utask, ppt, regs) != 0)
> +#endif
> + uprobe_pre_ssin(utask, ppt, regs);
> + if (unlikely(utask->doomed)) {
> + utask->active_probe = NULL;
> + utask->state = UPTASK_RUNNING;
> + up_read(&uproc->rwsem);
> + goto no_interest;
> + }
> + utask->state = UPTASK_SSTEP;
> + /* In case another engine cancels our UTRACE_SINGLESTEP... */
> + utask_adjust_flags(utask, UPROBE_SET_FLAGS,
> + UTRACE_EVENT(QUIESCE));
> + /* Don't deliver this signal to the process. */
> + resume_action = UTRACE_SINGLESTEP;
> + signal_action = UTRACE_SIGNAL_IGN;
> +
> + up_read(&uproc->rwsem);
> + break;
> +
> + case UPTASK_SSTEP:
> + if (info->si_signo != SSTEP_SIGNAL)
> + goto no_interest;
> +
> + down_read(&uproc->rwsem);
> + ppt = utask->active_probe;
> + BUG_ON(!ppt);
> +
> + /*
> + * Havent singlestepped yet? then re-assert
> + * UTRACE_SINGLESTEP.
> + */
> + if (!validate_and_post_sstep(utask, regs, ppt)) {
> + up_read(&uproc->rwsem);
> + goto no_interest;
> + }
> +
> + /* No further need to re-assert UTRACE_SINGLESTEP. */
> + clear_utrace_quiesce(utask, false);
> +
> + utask->active_probe = NULL;
> + utask->state = UPTASK_RUNNING;
> + if (unlikely(utask->doomed)) {
> + up_read(&uproc->rwsem);
> + goto no_interest;
> + }
> +
> + if (utask->quiescing) {
> + int uproc_freed;
> + up_read(&uproc->rwsem);
> + uproc_freed = utask_fake_quiesce(utask);
> + BUG_ON(uproc_freed);
> + } else
> + up_read(&uproc->rwsem);
> +
> + /*
> + * We hold a ref count on uproc, so this should never
> + * make utask or uproc disappear.
> + */
> + uprobe_run_def_regs(&utask->deferred_registrations);
> +
> + uprobe_inject_delayed_signals(&utask->delayed_signals);
> +
> + resume_action = UTRACE_RESUME;
> + signal_action = UTRACE_SIGNAL_IGN;
> + break;
> + default:
> + goto no_interest;
> + }
> +
> +no_interest:
> + return signal_action | resume_action;
> +}
> +
> +/*
> + * Signal callback:
> + */
> +static u32 uprobe_report_signal(u32 action,
> + struct utrace_engine *engine,
> + struct pt_regs *regs,
> + siginfo_t *info,
> + const struct k_sigaction *orig_ka,
> + struct k_sigaction *return_ka)
> +{
> + struct uprobe_task *utask;
> + struct uprobe_process *uproc;
> + bool doomed;
> + enum utrace_resume_action report_action;
> +
> + utask = (struct uprobe_task *)rcu_dereference(engine->data);

Are we really in an RCU read-side critical section here?

> + BUG_ON(!utask);
> + uproc = utask->uproc;
> +
> + /* Keep uproc intact until just before we return. */
> + uprobe_get_process(uproc);
> + report_action = uprobe_handle_signal(action, utask, regs, info,
> + orig_ka);
> + doomed = utask->doomed;
> +
> + if (uprobe_put_process(uproc, true))
> + report_action = utrace_signal_action(report_action) |
> + UTRACE_DETACH;
> + if (doomed)
> + do_exit(SIGSEGV);
> + return report_action;
> +}
> +
> +/*
> + * Quiesce callback: The associated process has one or more breakpoint
> + * insertions or removals pending. If we're the last thread in this
> + * process to quiesce, do the insertion(s) and/or removal(s).
> + */
> +static u32 uprobe_report_quiesce(u32 action,
> + struct utrace_engine *engine,
> + unsigned long event)
> +{
> + struct uprobe_task *utask;
> + struct uprobe_process *uproc;
> + bool done_quiescing = false;
> +
> + utask = (struct uprobe_task *)rcu_dereference(engine->data);

Are we really in an RCU read-side critical section here?

> + BUG_ON(!utask);
> +
> + if (utask->state == UPTASK_SSTEP)
> + /*
> + * We got a breakpoint trap and tried to single-step,
> + * but somebody else's report_signal callback overrode
> + * our UTRACE_SINGLESTEP with a UTRACE_STOP. Try again.
> + */
> + return UTRACE_SINGLESTEP;
> +
> + BUG_ON(utask->active_probe);
> + uproc = utask->uproc;
> + down_write(&uproc->rwsem);
> + done_quiescing = utask_quiesce(utask);
> + up_write(&uproc->rwsem);
> + return done_quiescing ? UTRACE_RESUME : UTRACE_STOP;
> +}
> +
> +/*
> + * uproc's process is exiting or exec-ing. Runs with uproc->rwsem
> + * write-locked. Caller must ref-count uproc before calling this
> + * function, to ensure that uproc doesn't get freed in the middle of
> + * this.
> + */
> +static void uprobe_cleanup_process(struct uprobe_process *uproc)
> +{
> + struct hlist_node *pnode1, *pnode2;
> + struct uprobe_kimg *uk, *unode;
> + struct uprobe_probept *ppt;
> + struct hlist_head *head;
> + int i;
> +
> + uproc->finished = true;
> + for (i = 0; i < UPROBE_TABLE_SIZE; i++) {
> + head = &uproc->uprobe_table[i];
> + hlist_for_each_entry_safe(ppt, pnode1, pnode2, head, ut_node) {
> + if (ppt->state == UPROBE_INSERTING ||
> + ppt->state == UPROBE_REMOVING) {
> + /*
> + * This task is (exec/exit)ing with
> + * a [un]register_uprobe pending.
> + * [un]register_uprobe will free ppt.
> + */
> + ppt->state = UPROBE_DISABLED;
> + list_del(&ppt->pd_node);
> + list_for_each_entry_safe(uk, unode,
> + &ppt->uprobe_list, list)
> + uk->status = -ESRCH;
> + wake_up_all(&ppt->waitq);
> + } else if (ppt->state == UPROBE_BP_SET) {
> + list_for_each_entry_safe(uk, unode,
> + &ppt->uprobe_list, list) {
> + list_del(&uk->list);
> + uprobe_free_kimg(uk);
> + }
> + uprobe_free_probept(ppt);
> + /* else */
> + /*
> + * If ppt is UPROBE_DISABLED, assume that
> + * [un]register_uprobe() has been notified
> + * and will free it soon.
> + */
> + }
> + }
> + }
> +}
> +
> +static u32 uprobe_exec_exit(struct utrace_engine *engine,
> + struct task_struct *tsk, int exit)
> +{
> + struct uprobe_process *uproc;
> + struct uprobe_probept *ppt;
> + struct uprobe_task *utask;
> + bool utask_quiescing;
> +
> + utask = (struct uprobe_task *)rcu_dereference(engine->data);

Are we really in an RCU read-side critical section here?

> + uproc = utask->uproc;
> + uprobe_get_process(uproc);
> +
> + ppt = utask->active_probe;
> + if (ppt) {
> + printk(KERN_WARNING "Task handler called %s while at uprobe"
> + " probepoint: pid/tgid = %d/%d, probepoint"
> + " = %#lx\n", (exit ? "exit" : "exec"),
> + tsk->pid, tsk->tgid, ppt->ubp.vaddr);
> + /*
> + * Mutex cleanup depends on where do_execve()/do_exit() was
> + * called and on ubp strategy (XOL vs. SSIL).
> + */
> + if (ppt->ubp.strategy & UBP_HNT_INLINE) {
> + switch (utask->state) {
> + unsigned long flags;
> + case UPTASK_SSTEP:
> + spin_lock_irqsave(&ppt->ssil_lock, flags);
> + ppt->ssil_state = SSIL_CLEAR;
> + wake_up(&ppt->ssilq);
> + spin_unlock_irqrestore(&ppt->ssil_lock, flags);
> + break;
> + default:
> + break;
> + }
> + }
> + if (utask->state == UPTASK_BP_HIT) {
> + /* uprobe handler called do_exit()/do_execve(). */
> + up_read(&uproc->rwsem);
> + uprobe_decref_process(uproc);
> + }
> + }
> +
> + down_write(&uproc->rwsem);
> + utask_quiescing = utask->quiescing;
> + uproc->nthreads--;
> + if (utrace_set_events_pid(utask->pid, engine, 0))
> + /* We don't care. */
> + ;
> + uprobe_free_task(utask, 1);
> + if (uproc->nthreads) {
> + /*
> + * In case other threads are waiting for us to quiesce...
> + */
> + if (utask_quiescing)
> + (void) check_uproc_quiesced(uproc,
> + find_surviving_thread(uproc));
> + } else
> + /*
> + * We were the last remaining thread - clean up the uprobe
> + * remnants a la unregister_uprobe(). We don't have to
> + * remove the breakpoints, though.
> + */
> + uprobe_cleanup_process(uproc);
> +
> + up_write(&uproc->rwsem);
> + uprobe_put_process(uproc, true);
> + return UTRACE_DETACH;
> +}
> +
> +/*
> + * Exit callback: The associated task/thread is exiting.
> + */
> +static u32 uprobe_report_exit(u32 action,
> + struct utrace_engine *engine,
> + long orig_code, long *code)
> +{
> + return uprobe_exec_exit(engine, current, 1);
> +}
> +/*
> + * Clone callback: The current task has spawned a thread/process.
> + * Utrace guarantees that parent and child pointers will be valid
> + * for the duration of this callback.
> + *
> + * NOTE: For now, we don't pass on uprobes from the parent to the
> + * child. We now do the necessary clearing of breakpoints in the
> + * child's address space.
> + *
> + * TODO:
> + * - Provide option for child to inherit uprobes.
> + */
> +static u32 uprobe_report_clone(u32 action,
> + struct utrace_engine *engine,
> + unsigned long clone_flags,
> + struct task_struct *child)
> +{
> + struct uprobe_process *uproc;
> + struct uprobe_task *ptask, *ctask;
> +
> + ptask = (struct uprobe_task *)rcu_dereference(engine->data);

Are we really in an RCU read-side critical section here?

> + uproc = ptask->uproc;
> +
> + /*
> + * Lock uproc so no new uprobes can be installed 'til all
> + * report_clone activities are completed.
> + */
> + mutex_lock(&uproc_mutex);
> + down_write(&uproc->rwsem);
> +
> + if (clone_flags & CLONE_THREAD) {
> + /* New thread in the same process. */
> + ctask = uprobe_find_utask(child);
> + if (unlikely(ctask)) {
> + /*
> + * uprobe_mk_process() ran just as this clone
> + * happened, and has already accounted for the
> + * new child.
> + */
> + } else {
> + struct pid *child_pid = get_pid(task_pid(child));
> + BUG_ON(!child_pid);
> + ctask = uprobe_add_task(child_pid, uproc);
> + BUG_ON(!ctask);
> + if (IS_ERR(ctask))
> + goto done;
> + uproc->nthreads++;
> + /*
> + * FIXME: Handle the case where uproc is quiescing
> + * (assuming it's possible to clone while quiescing).
> + */
> + }
> + } else {
> + /*
> + * New process spawned by parent. Remove the probepoints
> + * in the child's text.
> + *
> + * Its not necessary to quiesce the child as we are assured
> + * by utrace that this callback happens *before* the child
> + * gets to run userspace.
> + *
> + * We also hold the uproc->rwsem for the parent - so no
> + * new uprobes will be registered 'til we return.
> + */
> + int i;
> + struct uprobe_probept *ppt;
> + struct hlist_node *node;
> + struct hlist_head *head;
> +
> + for (i = 0; i < UPROBE_TABLE_SIZE; i++) {
> + head = &uproc->uprobe_table[i];
> + hlist_for_each_entry(ppt, node, head, ut_node) {
> + if (ubp_remove_bkpt(child, &ppt->ubp) != 0) {
> + /* Ratelimit this? */
> + printk(KERN_ERR "Pid %d forked %d;"
> + " failed to remove probepoint"
> + " at %#lx in child\n",
> + current->pid, child->pid,
> + ppt->ubp.vaddr);
> + }
> + }
> + }
> + }
> +
> +done:
> + up_write(&uproc->rwsem);
> + mutex_unlock(&uproc_mutex);
> + return UTRACE_RESUME;
> +}
> +
> +/*
> + * Exec callback: The associated process called execve() or friends
> + *
> + * The new program is about to start running and so there is no
> + * possibility of a uprobe from the previous user address space
> + * to be hit.
> + *
> + * NOTE:
> + * Typically, this process would have passed through the clone
> + * callback, where the necessary action *should* have been
> + * taken. However, if we still end up at this callback:
> + * - We don't have to clear the uprobes - memory image
> + * will be overlaid.
> + * - We have to free up uprobe resources associated with
> + * this process.
> + */
> +static u32 uprobe_report_exec(u32 action,
> + struct utrace_engine *engine,
> + const struct linux_binfmt *fmt,
> + const struct linux_binprm *bprm,
> + struct pt_regs *regs)
> +{
> + return uprobe_exec_exit(engine, current, 0);
> +}
> +
> +static const struct utrace_engine_ops uprobe_utrace_ops = {
> + .report_quiesce = uprobe_report_quiesce,
> + .report_signal = uprobe_report_signal,
> + .report_exit = uprobe_report_exit,
> + .report_clone = uprobe_report_clone,
> + .report_exec = uprobe_report_exec
> +};
> +
> +static int __init init_uprobes(void)
> +{
> + int ret, i;
> +
> + ubp_strategies = UBP_HNT_TSKINFO;
> + ret = ubp_init(&ubp_strategies);
> + if (ret != 0) {
> + printk(KERN_ERR "Can't start uprobes: ubp_init() returned %d\n",
> + ret);
> + return ret;
> + }
> + for (i = 0; i < UPROBE_TABLE_SIZE; i++) {
> + INIT_HLIST_HEAD(&uproc_table[i]);
> + INIT_HLIST_HEAD(&utask_table[i]);
> + }
> +
> + p_uprobe_utrace_ops = &uprobe_utrace_ops;
> + return 0;
> +}
> +
> +static void __exit exit_uprobes(void)
> +{
> +}
> +
> +module_init(init_uprobes);
> +module_exit(exit_uprobes);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2010-01-12 04:55:01

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

On Mon, Jan 11, 2010 at 05:56:08PM +0530, Srikar Dronamraju wrote:
> This patch implements ftrace plugin for uprobes.
>
> Description:
> Ftrace plugin provides an interface to dump data at a given address, top of
> the stack and function arguments when a user program calls a specific
> function.


So, as told before, ftrace plugins tend to be relegated to
obsolescence and I first suggested to plug this into kprobe
events so that we have a unified interface to control/create
u|k|kret probe events.

But after digging more into first appearances, uprobe creation
can follow the kprobes creation flow.

kprobe can be created whenever we want. This is about probing
kernel text and it is already there so that we can set the
probe, default deactivated, in advance.

This is much more tricky in the case of uprobes as I see two
ways to work with it:

- probing on an already running process
- probing on a process we are about to run

Now say we create to create a uprobe trace event for an already
running process. No problem in the workflow, we just need to
set the address and the pid. Fine.

Now what if I want to launch ls and want to profile a function
inside. What can I do with a trace event. I can't create the
probe event based on a pid as I don't know it in advance.
I could give it the ls cmdline and it manages to activate
on the next ls launched. This is racy as another ls can
be launched concurrently.

So I can only say there that an ftrace plugin or an ftrace trace
event would be only a half-useful interface to exploit utrace
possibilities because it only lets us trace already running
apps. Moreover I bet the most chosen workflow to profile/trace
uprobes is by launching an app and profile it from the beginning,
not by profiling an already running one, which makes an ftrace
interface even less than half useful there.

ftrace is cool to trace the kernel, but this kind of tricky
userspace tracing workflow is not adapted to it.

What do you think?

2010-01-12 05:08:59

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

On Tue, 2010-01-12 at 05:54 +0100, Frederic Weisbecker wrote:

> Now what if I want to launch ls and want to profile a function
> inside. What can I do with a trace event. I can't create the
> probe event based on a pid as I don't know it in advance.
> I could give it the ls cmdline and it manages to activate
> on the next ls launched. This is racy as another ls can
> be launched concurrently.

You make a wrapper script:

#!/bin/sh
<add probe to ls with pid> $$
exec $*

I do this all the time to limit the function tracer to a specific app.

#!/bin/sh
echo $$ > /debug/tracing/set_ftrace_pid
echo function > /debug/tracing/current_tracer
exec $*


The exec will cause the ls to have the pid of $$.

-- Steve



>
> So I can only say there that an ftrace plugin or an ftrace trace
> event would be only a half-useful interface to exploit utrace
> possibilities because it only lets us trace already running
> apps. Moreover I bet the most chosen workflow to profile/trace
> uprobes is by launching an app and profile it from the beginning,
> not by profiling an already running one, which makes an ftrace
> interface even less than half useful there.
>
> ftrace is cool to trace the kernel, but this kind of tricky
> userspace tracing workflow is not adapted to it.
>
> What do you think?
>

2010-01-12 05:36:07

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Mon, Jan 11, 2010 at 05:55:53PM +0530, Srikar Dronamraju wrote:
> +static const struct utrace_engine_ops uprobe_utrace_ops = {
> + .report_quiesce = uprobe_report_quiesce,
> + .report_signal = uprobe_report_signal,
> + .report_exit = uprobe_report_exit,
> + .report_clone = uprobe_report_clone,
> + .report_exec = uprobe_report_exec
> +};


So, as stated before, uprobe seems to handle too much standalone
policies such as freeing on exec, always inherit on clone and never
on fork. Such rules should be decided from uprobe clients not
from uprobe itself and that makes it not enough flexible to
be usable for now.

For example if we want it to be usable by perf, we have two ways:

- a trace event. Unfortunately, like I explained in a previous
mail, this doesn't seem to be a suitable interface for this
particular case.

- a performance monitoring unit, with the existing unified interface
struct pmu, usable by perf.


Typically, to use it with perf toward a pmu, perf tools need to
create a uprobe on perf process and activate its hook on the next exec.
Thereafter, it's up to perf to decide if we inherit through clone
and fork.

Here I fear utrace and perf are going to collide.

See how could be the final struct pmu (we need to extend it
to support utrace):

struct pmu {
enable() -> called we schedule in a context where we want
a uprobe to be active. Called very often
disable() -> the above opposite

/* Not yet existing callbacks */

hook_task() -> called when a process is created which
we want to activate our hook
would be typically called once on
exec if we have set enable_on_exec
and also on clone()/fork()
if we want to inherit.
}


The above hook_task (could be divided in more precise callback events
like hook_on_exec, hook_on_clone, etc...) would be needed by perf
to drive correctly utrace and this is going to collide with utrace
callbacks that notify execs and forks.

Probably utrace can be kept for all the utrace breakpoint signal
handling an so. But I guess the rest can be implemented on top
of a struct pmu and driven by perf like we did with hardware
breakpoints re-implementation.

Just an idea.

2010-01-12 05:44:29

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

On Tue, Jan 12, 2010 at 12:08:53AM -0500, Steven Rostedt wrote:
> On Tue, 2010-01-12 at 05:54 +0100, Frederic Weisbecker wrote:
>
> > Now what if I want to launch ls and want to profile a function
> > inside. What can I do with a trace event. I can't create the
> > probe event based on a pid as I don't know it in advance.
> > I could give it the ls cmdline and it manages to activate
> > on the next ls launched. This is racy as another ls can
> > be launched concurrently.
>
> You make a wrapper script:
>
> #!/bin/sh
> <add probe to ls with pid> $$
> exec $*
>
> I do this all the time to limit the function tracer to a specific app.
>
> #!/bin/sh
> echo $$ > /debug/tracing/set_ftrace_pid
> echo function > /debug/tracing/current_tracer
> exec $*
>
>
> The exec will cause the ls to have the pid of $$.


Sounds like a good idea. In this case we could indeed
think about a trace event.

It would typically have the benefit to have the same
interface than kprobes.

We can use it with perf, the only constraint is that we need
to launch the record right after creating the trace event.
Or we can pre-create them and set the pid of the target
later when we launch perf record.

And we need an enable on exec option in the probe definition.

That's a nice option.

Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Tue, Jan 12, 2010 at 06:36:00AM +0100, Frederic Weisbecker wrote:
> On Mon, Jan 11, 2010 at 05:55:53PM +0530, Srikar Dronamraju wrote:
> > +static const struct utrace_engine_ops uprobe_utrace_ops = {
> > + .report_quiesce = uprobe_report_quiesce,
> > + .report_signal = uprobe_report_signal,
> > + .report_exit = uprobe_report_exit,
> > + .report_clone = uprobe_report_clone,
> > + .report_exec = uprobe_report_exec
> > +};
>
>
> So, as stated before, uprobe seems to handle too much standalone
> policies such as freeing on exec, always inherit on clone and never
> on fork. Such rules should be decided from uprobe clients not
> from uprobe itself and that makes it not enough flexible to
> be usable for now.

The freeing on exec is only housekeeping of uprobe data structures. And
probepoints are inherited only on CLONE_THREAD and not otherwise, simply
since the existing probes can be hit in the new thread's context. Not
sure what other policy you are hinting at.

> For example if we want it to be usable by perf, we have two ways:
>
> - a trace event. Unfortunately, like I explained in a previous
> mail, this doesn't seem to be a suitable interface for this
> particular case.
>
> - a performance monitoring unit, with the existing unified interface
> struct pmu, usable by perf.
>
>
> Typically, to use it with perf toward a pmu, perf tools need to
> create a uprobe on perf process and activate its hook on the next exec.
> Thereafter, it's up to perf to decide if we inherit through clone
> and fork.

As mentioned above, the inheritance is only for threads. It should be
fairly easy to inherit probes on fork, and that can be made a perf based
policy decision.

> Here I fear utrace and perf are going to collide.

Utrace does not mandate any of the above concerns you've mentioned.
Utrace just provides callbacks at the said events and uprobes can be
tweaked to accommodate perf's requirements as possible, as feasible.

> See how could be the final struct pmu (we need to extend it
> to support utrace):
>
> struct pmu {
> enable() -> called we schedule in a context where we want
> a uprobe to be active. Called very often
> disable() -> the above opposite
>
> /* Not yet existing callbacks */
>
> hook_task() -> called when a process is created which
> we want to activate our hook
> would be typically called once on
> exec if we have set enable_on_exec
> and also on clone()/fork()
> if we want to inherit.
> }
>
>
> The above hook_task (could be divided in more precise callback events
> like hook_on_exec, hook_on_clone, etc...) would be needed by perf
> to drive correctly utrace and this is going to collide with utrace
> callbacks that notify execs and forks.
>
> Probably utrace can be kept for all the utrace breakpoint signal
> handling an so. But I guess the rest can be implemented on top
> of a struct pmu and driven by perf like we did with hardware
> breakpoints re-implementation.
>
> Just an idea.

Well, I wonder if perf can ride on utrace's callbacks for the
hook_task() for the clone/fork cases?

Ananth

2010-01-12 08:21:34

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

Hi Paul,


> > +
> > +/*
> > + * Allocate a uprobe_task object for p and add it to uproc's list.
> > + * Called with p "got" and uproc->rwsem write-locked. Called in one of
> > + * the following cases:
> > + * - before setting the first uprobe in p's process
> > + * - we're in uprobe_report_clone() and p is the newly added thread
> > + * Returns:
> > + * - pointer to new uprobe_task on success
> > + * - NULL if t dies before we can utrace_attach it
> > + * - negative errno otherwise
> > + */
> > +static struct uprobe_task *uprobe_add_task(struct pid *p,
> > + struct uprobe_process *uproc)
> > +{
> > + struct uprobe_task *utask;
> > + struct utrace_engine *engine;
> > + struct task_struct *t = pid_task(p, PIDTYPE_PID);
>
> What keeps the task_struct referenced by "t" from disappearing at this
> point?

We have a ref-counted pid which is used for creation of the utrace
engine. If the task_struct disappears then utrace would refuse to create
an engine and we wouldnt proceed further. We only use the task struct
and pid only when we have a successful utrace engine. Once utrace
engine is created,utrace guarantees us that the task will remain till
Uprobes is notified of the death/exit.

>
> > +
> > + if (!t)
> > + return NULL;
> > + utask = kzalloc(sizeof *utask, GFP_USER);
> > + if (unlikely(utask == NULL))
> > + return ERR_PTR(-ENOMEM);
> > +
> > + utask->pid = p;
> > + utask->tsk = t;
> > + utask->state = UPTASK_RUNNING;
> > + utask->quiescing = false;
> > + utask->uproc = uproc;
> > + utask->active_probe = NULL;
> > + utask->doomed = false;
> > + INIT_LIST_HEAD(&utask->deferred_registrations);
> > + INIT_LIST_HEAD(&utask->delayed_signals);
> > + INIT_LIST_HEAD(&utask->list);
> > + list_add_tail(&utask->list, &uproc->thread_list);
> > + uprobe_hash_utask(utask);
> > +
> > + engine = utrace_attach_pid(p, UTRACE_ATTACH_CREATE,
> > + p_uprobe_utrace_ops, utask);
> > + if (IS_ERR(engine)) {
> > + long err = PTR_ERR(engine);
> > + printk("uprobes: utrace_attach_task failed, returned %ld\n",
> > + err);
> > + uprobe_free_task(utask, 0);
> > + if (err == -ESRCH)
> > + return NULL;
> > + return ERR_PTR(err);
> > + }
> > + goto dont_add;
> > + list_for_each_entry(utask, &uproc->thread_list, list) {
>
> Doesn't this need to be list_for_each_entry_rcu()?
>
> Or do you have ->thread_list protected elsewise?

thread_list is protected by write lock for uproc->rwsem.

>
> > + if (utask->tsk == t)
> > + /* Already added */
> > + goto dont_add;
> > + }
> > + /* Found thread/task to add. */
> > + pid = get_pid(task_pid(t));
> > + break;
> > +dont_add:
> > + t = next_thread(t);
> > + } while (t != start);
> > + }
> > + rcu_read_unlock();
>
> Now that we are outside of rcu_read_lock()'s protection, the task
> indicated by "pid" might disappear, and the value of "pid" might well
> be reused. Is this really OK?

We have a ref-counted pid; so pid should ideally not disappear.
And as I said earlier, once utrace engine gets created, we are sure that
the task struct lies till the engine gets detached. If an engine is not
created, we dont use the task struct or the pid. We piggyback on the
guarantee that utrace provides.

>
> > + return pid;
> > +}
> > +
> > +/*
> > + * Given a numeric thread ID, return a ref-counted struct pid for the
> > + * task-group-leader thread.
> > + */
> > +static struct pid *uprobe_get_tg_leader(pid_t p)
> > +{
> > + struct pid *pid = NULL;
> > +
> > + rcu_read_lock();
> > + if (current->nsproxy)
> > + pid = find_vpid(p);
> > + if (pid) {
> > + struct task_struct *t = pid_task(pid, PIDTYPE_PID);
> > + if (t)
> > + pid = task_tgid(t);
> > + else
> > + pid = NULL;
> > + }
> > + rcu_read_unlock();
>
> What happens if the thread disappears at this point? We are outside of
> rcu_read_lock() protection, so all the structures could potentially be
> freed up by other CPUs, especially if this CPU takes an interrupt or is
> preempted.
>
> > + return get_pid(pid); /* null pid OK here */
> > +}


Same as above ;

> > +/*
> > + * Signal callback:
> > + */
> > +static u32 uprobe_report_signal(u32 action,
> > + struct utrace_engine *engine,
> > + struct pt_regs *regs,
> > + siginfo_t *info,
> > + const struct k_sigaction *orig_ka,
> > + struct k_sigaction *return_ka)
> > +{
> > + struct uprobe_task *utask;
> > + struct uprobe_process *uproc;
> > + bool doomed;
> > + enum utrace_resume_action report_action;
> > +
> > + utask = (struct uprobe_task *)rcu_dereference(engine->data);
>
> Are we really in an RCU read-side critical section here?

Yeah we dont need the rcu_deference here.

>
> > +static u32 uprobe_report_quiesce(u32 action,
> > + struct utrace_engine *engine,
> > + unsigned long event)
> > +{
> > + struct uprobe_task *utask;
> > + struct uprobe_process *uproc;
> > + bool done_quiescing = false;
> > +
> > + utask = (struct uprobe_task *)rcu_dereference(engine->data);
>
> Are we really in an RCU read-side critical section here?

Yeah we dont need the rcu_deference here also.

>
> > +
> > +static u32 uprobe_exec_exit(struct utrace_engine *engine,
> > + struct task_struct *tsk, int exit)
> > +{
> > + struct uprobe_process *uproc;
> > + struct uprobe_probept *ppt;
> > + struct uprobe_task *utask;
> > + bool utask_quiescing;
> > +
> > + utask = (struct uprobe_task *)rcu_dereference(engine->data);
>
> Are we really in an RCU read-side critical section here?

Yeah we dont need the rcu_deference here also.

>
> > + * - Provide option for child to inherit uprobes.
> > + */
> > +static u32 uprobe_report_clone(u32 action,
> > + struct utrace_engine *engine,
> > + unsigned long clone_flags,
> > + struct task_struct *child)
> > +{
> > + struct uprobe_process *uproc;
> > + struct uprobe_task *ptask, *ctask;
> > +
> > + ptask = (struct uprobe_task *)rcu_dereference(engine->data);
>
> Are we really in an RCU read-side critical section here?

Yeah we dont need the rcu_deference here also.

--
Thanks and Regards
Srikar

2010-01-12 08:54:21

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation


Hi Frederic,


>
>
> So, as stated before, uprobe seems to handle too much standalone
> policies such as freeing on exec, always inherit on clone and never
> on fork. Such rules should be decided from uprobe clients not
> from uprobe itself and that makes it not enough flexible to
> be usable for now.

Lets say we were tracing process A and had inserted few breakpoints.
If this process were to peform an exec, we would be loading a new
process image. The old breakpoints are actually detrimental now
and hence all the breakpoints that we had installed would
have to be removed anyway. And new breakpoints have to be installed at
different locations.

If this process were to fork then we would have to create all the
per process uprobes book-keeping including one page per process
instruction store. In most cases fork would be followed by exec.
which would mean we would have to trash the breakpoints that we
inherited.

Tracing a newly exec-ed process or a forked process is similar to
starting a new uprobes session.

Also uprobes would allow more than one kernel module/plugin to trace
the same process. i.e for the same process at the same breakpoint
one client may want a follow-on-fork, or follow-on-exec, the other one
may not be wanting it.

But I understand your requirements for tracing a session rather than
just a process. And thats where the utrace based task-finder or
something similar finds its application. So this layer(task-finder)
would be able to tell uprobes to start tracing an process based on
certain criteria.

Since uprobes uses breakpoint instruction, all threads of a process
which is being traced would take an exception when passing thro a
breakpoint. Hence we have to always inherit on clone. If a client wants
to trace only certain threads of a process, then he could filter them in
the uprobe trace handler.

I feel the current uprobes + task finder would be much more flexible.
perf could probably use this combination. Also this approach would
reduce un-necessary creation of uprobes book-keeping for process where
we may never place probes.

>
> For example if we want it to be usable by perf, we have two ways:
>
> - a trace event. Unfortunately, like I explained in a previous
> mail, this doesn't seem to be a suitable interface for this
> particular case.
>
> - a performance monitoring unit, with the existing unified interface
> struct pmu, usable by perf.
>
>
> Typically, to use it with perf toward a pmu, perf tools need to
> create a uprobe on perf process and activate its hook on the next exec.
> Thereafter, it's up to perf to decide if we inherit through clone
> and fork.
>
> Here I fear utrace and perf are going to collide.

I am not sure why utrace and perf would collide.
I think utrace is a layer below uprobes so perf could use utrace
directly(if it implements the task-finder logic) or use utrace thro
uprobes.

>
> See how could be the final struct pmu (we need to extend it
> to support utrace):
>
> struct pmu {
> enable() -> called we schedule in a context where we want
> a uprobe to be active. Called very often
> disable() -> the above opposite
>
> /* Not yet existing callbacks */
>
> hook_task() -> called when a process is created which
> we want to activate our hook
> would be typically called once on
> exec if we have set enable_on_exec
> and also on clone()/fork()
> if we want to inherit.
> }
>
>
> The above hook_task (could be divided in more precise callback events
> like hook_on_exec, hook_on_clone, etc...) would be needed by perf
> to drive correctly utrace and this is going to collide with utrace
> callbacks that notify execs and forks.

As pointed by Ananth, this hook on exec, hook on fork is exactly what
the taskfinder/perf would provide using the utrace api's.

If there is any reason why utrace and perf could collide then can you
please put in more details. In such a case, Roland and others may have
more ideas on how to work around these issues.

Please let me know your thoughts.


--
Thanks and Regards
Srikar

2010-01-12 18:54:43

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes


Frederic Weisbecker <[email protected]> writes:

> [...]
> This is much more tricky in the case of uprobes as I see two
> ways to work with it:
> - probing on an already running process
> - probing on a process we are about to run
> [...]

As you might expect, in systemtap we've had to figure out this area
some time ago. We use another utrace consumer called "task finder"
that registers interest in present / future processes, and gives us
kernel-space callbacks when these come and go. You could merge it or
something like it.

- FChE

2010-01-12 19:12:51

by Tim Bird

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

Steven Rostedt wrote:
> I do this all the time to limit the function tracer to a specific app.
>
> #!/bin/sh
> echo $$ > /debug/tracing/set_ftrace_pid
> echo function > /debug/tracing/current_tracer
> exec $*

This seems pretty handy. You should put this in scripts/tracing.
:-)

Do you have any other helper scripts you use commonly?
-- Tim

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Corporation of America
=============================

2010-01-12 22:01:11

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

Frank Ch. Eigler wrote:
>
> Frederic Weisbecker <[email protected]> writes:
>
>> [...]
>> This is much more tricky in the case of uprobes as I see two
>> ways to work with it:
>> - probing on an already running process
>> - probing on a process we are about to run
>> [...]
>
> As you might expect, in systemtap we've had to figure out this area
> some time ago. We use another utrace consumer called "task finder"
> that registers interest in present / future processes, and gives us
> kernel-space callbacks when these come and go. You could merge it or
> something like it.

So, could you tell us how the task-finder works and is implemented?
I think we'd better clarify what functions are required for uprobes
and pmu, and I think we may be able to re-implement improved pmu
on utrace.

Thank you,


--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-12 22:16:50

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

Hi -

> > As you might expect, in systemtap we've had to figure out this area
> > some time ago. We use another utrace consumer called "task finder" [...]
>
> So, could you tell us how the task-finder works and is implemented?

The code may be found at runtime/task_finder* in the systemtap sources.
There is a simple interest-registration structure/API that identifies
processes / shared libraries of interest, and a set of callbacks to be
invoked when said processes/shared libraries are mapped or unmapped.

It is implemented in terms of utrace callbacks for process/thread
lifetime monitoring, and utrace syscall callbacks for tracking shared
library segments being mapped and unmapped.

http://sourceware.org/git/?p=systemtap.git;a=tree;f=runtime


> I think we'd better clarify what functions are required for uprobes
> and pmu, and I think we may be able to re-implement improved pmu on
> utrace.

I don't see any collision between pmu / perf / utrace, so nothing is
really "required" for them or simple usage of uprobes. If you wish to
track dynamic process/shared-library lifetimes, then you need extra
code somewhere to respond to those changes. Layering this dynamic
capability seems like the natural way to go, and is easily done with
utrace and/or tracepoints.

- FChE

2010-01-12 22:32:40

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

Frank Ch. Eigler wrote:
> Hi -
>
>>> As you might expect, in systemtap we've had to figure out this area
>>> some time ago. We use another utrace consumer called "task finder" [...]
>>
>> So, could you tell us how the task-finder works and is implemented?
>
> The code may be found at runtime/task_finder* in the systemtap sources.
> There is a simple interest-registration structure/API that identifies
> processes / shared libraries of interest, and a set of callbacks to be
> invoked when said processes/shared libraries are mapped or unmapped.
>
> It is implemented in terms of utrace callbacks for process/thread
> lifetime monitoring, and utrace syscall callbacks for tracking shared
> library segments being mapped and unmapped.
>
> http://sourceware.org/git/?p=systemtap.git;a=tree;f=runtime

Nice! so we can set a probe by the relative address in the library
segments.

>> I think we'd better clarify what functions are required for uprobes
>> and pmu, and I think we may be able to re-implement improved pmu on
>> utrace.
>
> I don't see any collision between pmu / perf / utrace, so nothing is
> really "required" for them or simple usage of uprobes. If you wish to
> track dynamic process/shared-library lifetimes, then you need extra
> code somewhere to respond to those changes.

And that code we can find in runtime/task_finder*, right? :-)

> Layering this dynamic
> capability seems like the natural way to go, and is easily done with
> utrace and/or tracepoints.

Sure, and I think that will allow us to use uprobe events as
trace-events, because we can set probes before executing programs. :-)

Thank you!


--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-13 00:54:01

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Tue, 2010-01-12 at 13:44 +0530, Ananth N Mavinakayanahalli wrote:
> On Tue, Jan 12, 2010 at 06:36:00AM +0100, Frederic Weisbecker wrote:
...
> > So, as stated before, uprobe seems to handle too much standalone
> > policies such as freeing on exec, always inherit on clone and never
> > on fork. Such rules should be decided from uprobe clients not
> > from uprobe itself and that makes it not enough flexible to
> > be usable for now.
>
> The freeing on exec is only housekeeping of uprobe data structures. And
> probepoints are inherited only on CLONE_THREAD and not otherwise, simply
> since the existing probes can be hit in the new thread's context. Not
> sure what other policy you are hinting at.
>
...
> >
> >
> > Typically, to use it with perf toward a pmu, perf tools need to
> > create a uprobe on perf process and activate its hook on the next exec.
> > Thereafter, it's up to perf to decide if we inherit through clone
> > and fork.
>
> As mentioned above, the inheritance is only for threads. It should be
> fairly easy to inherit probes on fork, and that can be made a perf based
> policy decision.
>

One reason we don't currently support inheritance (or cloning) of
uprobes across fork is that a uprobe object is (a) per-process (and I
think we want to keep it that way); and (b) owned by the uprobes client.
That is, the client creates and populates that uprobe object, and passes
a pointer to it to both register_uprobe() and unregister_uprobe(). We
could clone this object on fork, but then how would the client refer to
the cloned uprobes in the new process -- e.g., to unregister them?

I guess each cloned uprobe could remember its "patriarch" uprobe -- its
ultimate ancestor, the one created by the client; and we could add an
"unregister_uprobe_clone" function that takes both the address of the
patriarch uprobe and the pid of the (clone) uprobe to be unregistered.

It has also been suggested that it might be more user-friendly to
let the client discard (or reuse) the uprobe object as soon as
register_uprobe() returns. register_uprobe() would presumably copy
everything it needs from the uprobe to the uprobe_kimg, and pass back
a handle (e.g., the address of the uprobe_kimg) that the client can
later pass to unregister_uprobe() -- or unregister_uprobe_clone(). (In
this case, only the uprobe_kimg would be cloned.) It might be good to
consider both these enhancement requests together.

Anyway, as previously described, the clone-on-fork feature can be (and
has been) implemented by a utrace-based task-finder that notices forks,
and creates and registers a whole new set of uprobes for the new
process.

Jim

2010-01-13 21:58:59

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

Steven Rostedt wrote:
> On Tue, 2010-01-12 at 05:54 +0100, Frederic Weisbecker wrote:
>
>> Now what if I want to launch ls and want to profile a function
>> inside. What can I do with a trace event. I can't create the
>> probe event based on a pid as I don't know it in advance.
>> I could give it the ls cmdline and it manages to activate
>> on the next ls launched. This is racy as another ls can
>> be launched concurrently.
>
> You make a wrapper script:
>
> #!/bin/sh
> <add probe to ls with pid> $$
> exec $*
>
> I do this all the time to limit the function tracer to a specific app.
>
> #!/bin/sh
> echo $$ > /debug/tracing/set_ftrace_pid
> echo function > /debug/tracing/current_tracer
> exec $*

I recommend you to add below line at the end of the script,
from my experience. :)

echo nop > /debug/tracing/current_tracer

Thank you,


--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-13 22:13:15

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

Masami Hiramatsu wrote:
> Steven Rostedt wrote:
>> On Tue, 2010-01-12 at 05:54 +0100, Frederic Weisbecker wrote:
>>
>>> Now what if I want to launch ls and want to profile a function
>>> inside. What can I do with a trace event. I can't create the
>>> probe event based on a pid as I don't know it in advance.
>>> I could give it the ls cmdline and it manages to activate
>>> on the next ls launched. This is racy as another ls can
>>> be launched concurrently.
>>
>> You make a wrapper script:
>>
>> #!/bin/sh
>> <add probe to ls with pid> $$
>> exec $*
>>
>> I do this all the time to limit the function tracer to a specific app.
>>
>> #!/bin/sh
>> echo $$ > /debug/tracing/set_ftrace_pid
>> echo function > /debug/tracing/current_tracer
>> exec $*
>
> I recommend you to add below line at the end of the script,
> from my experience. :)
>
> echo nop > /debug/tracing/current_tracer

Oops, my bad, it doesn't work after exec...
But, it is very important to disable function tracer after
tracing target process.

So, perhaps, below script may work.

#!/bin/sh
(echo $BASHPID > /debug/tracing/set_ftrace_pid
echo function > /debug/tracing/current_tracer
exec $*)
echo nop > /debug/tracing/current_tracer

Thanks,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-13 23:36:34

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

On Wed, 2010-01-13 at 17:12 -0500, Masami Hiramatsu wrote:
> Masami Hiramatsu wrote:
> > Steven Rostedt wrote:
> >> On Tue, 2010-01-12 at 05:54 +0100, Frederic Weisbecker wrote:
> >>
> >>> Now what if I want to launch ls and want to profile a function
> >>> inside. What can I do with a trace event. I can't create the
> >>> probe event based on a pid as I don't know it in advance.
> >>> I could give it the ls cmdline and it manages to activate
> >>> on the next ls launched. This is racy as another ls can
> >>> be launched concurrently.
> >>
> >> You make a wrapper script:
> >>
> >> #!/bin/sh
> >> <add probe to ls with pid> $$
> >> exec $*
> >>
> >> I do this all the time to limit the function tracer to a specific app.
> >>
> >> #!/bin/sh
> >> echo $$ > /debug/tracing/set_ftrace_pid
> >> echo function > /debug/tracing/current_tracer
> >> exec $*
> >
> > I recommend you to add below line at the end of the script,
> > from my experience. :)
> >
> > echo nop > /debug/tracing/current_tracer
>
> Oops, my bad, it doesn't work after exec...
> But, it is very important to disable function tracer after
> tracing target process.
>
> So, perhaps, below script may work.
>
> #!/bin/sh
> (echo $BASHPID > /debug/tracing/set_ftrace_pid
> echo function > /debug/tracing/current_tracer
> exec $*)
> echo nop > /debug/tracing/current_tracer

Unfortunately, that would lose the entire trace you just recorded.

So perhaps adding:

trace-cmd extract
echo nop > /debug/tracing/current_tracer

would work better. The extract feature of trace-cmd pulls the data from
the kernel buffer and saves it in a file format.

-- Steve

2010-01-14 11:08:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote:
> User Space Breakpoint Assistance Layer (UBP)
>
> User space breakpointing Infrastructure provides kernel subsystems
> with architecture independent interface to establish breakpoints in
> user applications. This patch provides core implementation of ubp and
> also wrappers for architecture dependent methods.

So if this is the basic infrastructure to set userspace breakpoints,
then why not call this uprobe?

> UBP currently supports both single stepping inline and execution out
> of line strategies. Two different probepoints in the same process can
> have two different strategies.

maybe explain wth these are?

2010-01-14 11:09:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/7] Execution out of line (XOL)

On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote:
> Execution out of line (XOL)
>
> Slot allocation mechanism for Execution Out of Line strategy in User
> space breakpointing Inftrastructure. (XOL)
>
> This patch provides slot allocation mechanism for execution out of
> line strategy for use with user space breakpoint infrastructure.
> This patch requires utrace support in kernel.
>
> This patch provides five functions xol_get_insn_slot(),
> xol_free_insn_slot(), xol_put_area(), xol_get_area() and
> xol_validate_vaddr().
>
> Current slot allocation mechanism:
> 1. Allocate one dedicated slot per user breakpoint.
> 2. If the allocated vma is completely used, expand current vma.
> 3. If we cant expand the vma, allocate a new vma.


Say what?

I see the text, but non of it makes any sense at all.

2010-01-14 11:10:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote:
>
> Uprobes Infrastructure enables user to dynamically establish
> probepoints in user applications and collect information by executing
> a handler functions when the probepoints are hit.
> Please refer Documentation/uprobes.txt for more details.
>
> This patch provides the core implementation of uprobes.
> This patch builds on utrace infrastructure.
>
> You need to follow this up with the uprobes patch for your
> architecture.

So all this is basically some glue between what you call ubp (the real
userspace breakpoint stuff) and utrace? Or does it do more?

2010-01-14 11:12:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Tue, 2010-01-12 at 13:44 +0530, Ananth N Mavinakayanahalli wrote:
>
> Well, I wonder if perf can ride on utrace's callbacks for the
> hook_task() for the clone/fork cases?

Well it could, but using all of utrace to simply get simple callbacks
from copy_process() is just daft so we're not going to do that.

2010-01-14 11:14:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 5/7] X86 Support for Uprobes

On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote:
> [PATCH] x86 support for Uprobes

So uhm,.. HAVE_UPROBE is basically HAVE_UBP?

> Signed-off-by: Jim Keniston <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/uprobes.h | 27 +++++++++++++++++++++++++++
> 2 files changed, 28 insertions(+)
>
> Index: new_uprobes.git/arch/x86/Kconfig
> ===================================================================
> --- new_uprobes.git.orig/arch/x86/Kconfig
> +++ new_uprobes.git/arch/x86/Kconfig
> @@ -51,6 +51,7 @@ config X86
> select HAVE_KERNEL_LZMA
> select HAVE_HW_BREAKPOINT
> select HAVE_UBP
> + select HAVE_UPROBES
> select HAVE_ARCH_KMEMCHECK
> select HAVE_USER_RETURN_NOTIFIER
>
> Index: new_uprobes.git/arch/x86/include/asm/uprobes.h
> ===================================================================
> --- /dev/null
> +++ new_uprobes.git/arch/x86/include/asm/uprobes.h
> @@ -0,0 +1,27 @@
> +#ifndef _ASM_UPROBES_H
> +#define _ASM_UPROBES_H
> +/*
> + * Userspace Probes (UProbes)
> + * uprobes.h
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) IBM Corporation, 2008, 2009
> + */
> +#include <linux/signal.h>
> +
> +#define BREAKPOINT_SIGNAL SIGTRAP
> +#define SSTEP_SIGNAL SIGTRAP
> +#endif /* _ASM_UPROBES_H */

2010-01-14 11:23:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

On Mon, 2010-01-11 at 17:56 +0530, Srikar Dronamraju wrote:
> This patch implements ftrace plugin for uprobes.

Right, like others have said, trace events is a much saner interface.

So the easiest way I can see that working is to register uprobes against
a file (not a pid). Then on creation it uses rmap to find all current
maps of that file and install the probe if there is a consumer for that
map.

Then for each new mmap() of that file, we also need to check if there's
a consumer ready and install the probe.

The existence of the uprobe trace-event would keep a ref on the
dentry/inode, ensuring it remains around or something.

Consumers could be some utrace thing (currently called uprobe -- which
is a misnomer imo), or perf, or ftrace like.


2010-01-14 11:30:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

On Thu, 2010-01-14 at 12:23 +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-11 at 17:56 +0530, Srikar Dronamraju wrote:
> > This patch implements ftrace plugin for uprobes.
>
> Right, like others have said, trace events is a much saner interface.
>
> So the easiest way I can see that working is to register uprobes against
> a file (not a pid).

Just to clarify, this means you can do things like:

p:uprobe_event dso:symbol[+offs]

Irrespective of whether there are any current user of that file.

2010-01-14 11:35:20

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

On Thu, Jan 14, 2010 at 12:23:11PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-11 at 17:56 +0530, Srikar Dronamraju wrote:
> > This patch implements ftrace plugin for uprobes.
>
> Right, like others have said, trace events is a much saner interface.
>
> So the easiest way I can see that working is to register uprobes against
> a file (not a pid). Then on creation it uses rmap to find all current
> maps of that file and install the probe if there is a consumer for that
> map.
>
> Then for each new mmap() of that file, we also need to check if there's
> a consumer ready and install the probe.



That looks racy.

Say you first create a probe on /bin/ls:

perf probe p addr_in_ls /bin/ls

then something else launches /bin/ls behind you, probe
is set on it

then you launch:

perf record -e "probe:...." /bin/ls

Then it goes recording the previous instance.

2010-01-14 11:43:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

On Thu, 2010-01-14 at 12:35 +0100, Frederic Weisbecker wrote:
> On Thu, Jan 14, 2010 at 12:23:11PM +0100, Peter Zijlstra wrote:
> > On Mon, 2010-01-11 at 17:56 +0530, Srikar Dronamraju wrote:
> > > This patch implements ftrace plugin for uprobes.
> >
> > Right, like others have said, trace events is a much saner interface.
> >
> > So the easiest way I can see that working is to register uprobes against
> > a file (not a pid). Then on creation it uses rmap to find all current
> > maps of that file and install the probe if there is a consumer for that
> > map.
> >
> > Then for each new mmap() of that file, we also need to check if there's
> > a consumer ready and install the probe.
>
>
>
> That looks racy.
>
> Say you first create a probe on /bin/ls:
>
> perf probe p addr_in_ls /bin/ls
>
> then something else launches /bin/ls behind you, probe
> is set on it
>
> then you launch:
>
> perf record -e "probe:...." /bin/ls
>
> Then it goes recording the previous instance.

Uhm, why? Only the perf /bin/ls instance has a consumer and will thus
have a probe installed.

(Or if you want to use ftrace you need to always have all instances
probed anyway)

2010-01-14 12:17:27

by Mark Wielaard

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

On Thu, 2010-01-14 at 12:29 +0100, Peter Zijlstra wrote:
> On Thu, 2010-01-14 at 12:23 +0100, Peter Zijlstra wrote:
> > On Mon, 2010-01-11 at 17:56 +0530, Srikar Dronamraju wrote:
> > > This patch implements ftrace plugin for uprobes.
> >
> > Right, like others have said, trace events is a much saner interface.
> >
> > So the easiest way I can see that working is to register uprobes against
> > a file (not a pid).
>
> Just to clarify, this means you can do things like:
>
> p:uprobe_event dso:symbol[+offs]
>
> Irrespective of whether there are any current user of that file.

Yes, that is a good idea, you can then also refine that with a filter on
a target pid. That is what systemtap also does, you define files
(whether they are executables or shared libraries, etc) plus
symbols/offsets/etc as targets and monitor when they get mapped in
(either system wide, per executable or pid based).

Cheers,

Mark

2010-01-14 12:20:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

On Thu, 2010-01-14 at 13:16 +0100, Mark Wielaard wrote:
> On Thu, 2010-01-14 at 12:29 +0100, Peter Zijlstra wrote:
> > On Thu, 2010-01-14 at 12:23 +0100, Peter Zijlstra wrote:
> > > On Mon, 2010-01-11 at 17:56 +0530, Srikar Dronamraju wrote:
> > > > This patch implements ftrace plugin for uprobes.
> > >
> > > Right, like others have said, trace events is a much saner interface.
> > >
> > > So the easiest way I can see that working is to register uprobes against
> > > a file (not a pid).
> >
> > Just to clarify, this means you can do things like:
> >
> > p:uprobe_event dso:symbol[+offs]
> >
> > Irrespective of whether there are any current user of that file.
>
> Yes, that is a good idea, you can then also refine that with a filter on
> a target pid. That is what systemtap also does, you define files
> (whether they are executables or shared libraries, etc) plus
> symbols/offsets/etc as targets and monitor when they get mapped in
> (either system wide, per executable or pid based).

Well, the pid part is more the concern of the consumer of the
trace-event. The event itself is task invariant and only cares about the
particular code in question getting executed.

2010-01-14 12:23:37

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

On Thu, Jan 14, 2010 at 12:43:01PM +0100, Peter Zijlstra wrote:
> On Thu, 2010-01-14 at 12:35 +0100, Frederic Weisbecker wrote:
> > On Thu, Jan 14, 2010 at 12:23:11PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2010-01-11 at 17:56 +0530, Srikar Dronamraju wrote:
> > > > This patch implements ftrace plugin for uprobes.
> > >
> > > Right, like others have said, trace events is a much saner interface.
> > >
> > > So the easiest way I can see that working is to register uprobes against
> > > a file (not a pid). Then on creation it uses rmap to find all current
> > > maps of that file and install the probe if there is a consumer for that
> > > map.
> > >
> > > Then for each new mmap() of that file, we also need to check if there's
> > > a consumer ready and install the probe.
> >
> >
> >
> > That looks racy.
> >
> > Say you first create a probe on /bin/ls:
> >
> > perf probe p addr_in_ls /bin/ls
> >
> > then something else launches /bin/ls behind you, probe
> > is set on it
> >
> > then you launch:
> >
> > perf record -e "probe:...." /bin/ls
> >
> > Then it goes recording the previous instance.
>
> Uhm, why? Only the perf /bin/ls instance has a consumer and will thus
> have a probe installed.
>
> (Or if you want to use ftrace you need to always have all instances
> probed anyway)


I see, so what you suggest is to have the probe set up
as generic first. Then the process that activates it
becomes a consumer, right?

Would work, yeah.

2010-01-14 12:29:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

On Thu, 2010-01-14 at 13:23 +0100, Frederic Weisbecker wrote:
>
> I see, so what you suggest is to have the probe set up
> as generic first. Then the process that activates it
> becomes a consumer, right?

Right, so either we have it always on, for things like ftrace,

in which case the creation traverses rmap and installs the probes
all existing mmap()s, and a mmap() hook will install it on all new
ones.

Or they're strictly consumer driver, like perf, in which case the act of
enabling the event will install the probe (if its not there yet).

2010-01-14 19:46:47

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)


On Thu, 2010-01-14 at 12:08 +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote:
> > User Space Breakpoint Assistance Layer (UBP)
> >
> > User space breakpointing Infrastructure provides kernel subsystems
> > with architecture independent interface to establish breakpoints in
> > user applications. This patch provides core implementation of ubp and
> > also wrappers for architecture dependent methods.
>
> So if this is the basic infrastructure to set userspace breakpoints,
> then why not call this uprobe?

Ubp is for setting and removing breakpoints, and for supporting the two
schemes (inline, out of line) for executing the probed instruction after
you hit the breakpoint.

Uprobes provides a higher-level API and deals with synchronization
issues, process-vs-thread issues, execution of the client's (potentially
buggy) probe handler, multiple probe clients, multiple probes at the
same location, thread- and process-lifetime events, etc.

>
> > UBP currently supports both single stepping inline and execution out
> > of line strategies. Two different probepoints in the same process can
> > have two different strategies.
>
> maybe explain wth these are?
>

Here's a partial explanation from patch #6,section 1.1:

+When a CPU hits the breakpoint instruction, a trap occurs, the CPU's
+user-mode registers are saved, and a SIGTRAP signal is generated.
+Uprobes intercepts the SIGTRAP and finds the associated uprobe.
+It then executes the handler associated with the uprobe, passing the
+handler the addresses of the uprobe struct and the saved registers.
+...
+
+Next, Uprobes single-steps its copy of the probed instruction and
+resumes execution of the probed process at the instruction following
+the probepoint. (It would be simpler to single-step the actual
+instruction in place, but then Uprobes would have to temporarily
+remove the breakpoint instruction. This would create problems in a
+multithreaded application. For example, it would open a time window
+when another thread could sail right past the probepoint.)
+
+Instruction copies to be single-stepped are stored in a per-process
+"single-step out of line (XOL) area," which is a little VM area
+created by Uprobes in each probed process's address space.

This (single-stepping out of line = SSOL) is essentially what kprobes
does on most architectures. XOL (execution out of line) is actually a
broader category that could include other schemes, discussed elsewhere.

Jim

2010-01-14 22:43:56

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/7] Execution out of line (XOL)


On Thu, 2010-01-14 at 12:08 +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote:
> > Execution out of line (XOL)
> >
> > Slot allocation mechanism for Execution Out of Line strategy in User
> > space breakpointing Inftrastructure. (XOL)
> >
> > This patch provides slot allocation mechanism for execution out of
> > line strategy for use with user space breakpoint infrastructure.
> > This patch requires utrace support in kernel.
> >
> > This patch provides five functions xol_get_insn_slot(),
> > xol_free_insn_slot(), xol_put_area(), xol_get_area() and
> > xol_validate_vaddr().
> >
> > Current slot allocation mechanism:
> > 1. Allocate one dedicated slot per user breakpoint.
> > 2. If the allocated vma is completely used, expand current vma.
> > 3. If we cant expand the vma, allocate a new vma.
>
>
> Say what?
>
> I see the text, but non of it makes any sense at all.
>

Yeah, there's not a lot of context there. I hope it will make more
sense if you read section 1.1 of Documentation/uprobes.txt (patch #6).
Or look at get_insn_slot() in kprobes, and understand that we're trying
to do something similar in uprobes, where the instruction copies have to
reside in the user address space of the probed process.

Jim

2010-01-14 22:50:18

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation


On Thu, 2010-01-14 at 12:09 +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote:
> >
> > Uprobes Infrastructure enables user to dynamically establish
> > probepoints in user applications and collect information by executing
> > a handler functions when the probepoints are hit.
> > Please refer Documentation/uprobes.txt for more details.
> >
> > This patch provides the core implementation of uprobes.
> > This patch builds on utrace infrastructure.
> >
> > You need to follow this up with the uprobes patch for your
> > architecture.
>
> So all this is basically some glue between what you call ubp (the real
> userspace breakpoint stuff) and utrace? Or does it do more?
>

My reply in
http://lkml.indiana.edu/hypermail/linux/kernel/1001.1/02483.html
addresses this.

Jim

2010-01-14 23:08:09

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 5/7] X86 Support for Uprobes

On Thu, 2010-01-14 at 12:13 +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote:
> > [PATCH] x86 support for Uprobes
>
> So uhm,.. HAVE_UPROBE is basically HAVE_UBP?

Certainly ALMOST all the architecture-specific stuff we've factored out
of the old uprobes resides in ubp. Because of how it exploits utrace,
uprobes also needs to know what signals to expect from breakpoint traps
and single-step traps. (E.g., the "breakpoint signal" in s390 is
SIGILL.)

I have no objection to moving BREAKPOINT_SIGNAL and SSTEP_SIGNAL
to .../asm/ubp.h, even though ubp doesn't actually use them. But until
we port ubp to other architectures, we won't know for sure whether we've
done a complete job of capturing all the arch-specific stuff there.

Jim

>
> > Signed-off-by: Jim Keniston <[email protected]>
> > ---
> > arch/x86/Kconfig | 1 +
> > arch/x86/include/asm/uprobes.h | 27 +++++++++++++++++++++++++++
> > 2 files changed, 28 insertions(+)
> >
> > Index: new_uprobes.git/arch/x86/Kconfig
> > ===================================================================
> > --- new_uprobes.git.orig/arch/x86/Kconfig
> > +++ new_uprobes.git/arch/x86/Kconfig
> > @@ -51,6 +51,7 @@ config X86
> > select HAVE_KERNEL_LZMA
> > select HAVE_HW_BREAKPOINT
> > select HAVE_UBP
> > + select HAVE_UPROBES
> > select HAVE_ARCH_KMEMCHECK
> > select HAVE_USER_RETURN_NOTIFIER
> >
> > Index: new_uprobes.git/arch/x86/include/asm/uprobes.h
> > ===================================================================
> > --- /dev/null
> > +++ new_uprobes.git/arch/x86/include/asm/uprobes.h
> > @@ -0,0 +1,27 @@
> > +#ifndef _ASM_UPROBES_H
> > +#define _ASM_UPROBES_H
> > +/*
> > + * Userspace Probes (UProbes)
> > + * uprobes.h
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> > + *
> > + * Copyright (C) IBM Corporation, 2008, 2009
> > + */
> > +#include <linux/signal.h>
> > +
> > +#define BREAKPOINT_SIGNAL SIGTRAP
> > +#define SSTEP_SIGNAL SIGTRAP
> > +#endif /* _ASM_UPROBES_H */
>
>

2010-01-16 23:48:41

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

Quoting Peter Zijlstra <[email protected]>:

> On Fri, 2010-01-15 at 16:58 -0800, Jim Keniston wrote:
>> But here are some things to keep in mind about the
>> various approaches:
>>
>> 1. Single-stepping inline is easiest: you need to know very little about
>> the instruction set you're probing. But it's inadequate for
>> multithreaded apps.
>> 2. Single-stepping out of line solves the multithreading issue (as do #3
>> and #4), but requires more knowledge of the instruction set. (In
>> particular, calls, jumps, and returns need special care; as do
>> rip-relative instructions in x86_64.) I count 9 architectures that
>> support kprobes. I think most of these do SSOL.
>> 3. "Boosted" probes (where an appended jump instruction removes the need
>> for the single-step trap on many instructions) require even more
>> knowledge of the instruction set, and like SSOL, require XOL slots.
>> Right now, as far as I know, x86 is the only architecture with boosted
>> kprobes.
>> 4. Emulation removes the need for the XOL area, but requires pretty much
>> total knowledge of the instruction set. It's also a performance win for
>> architectures that can't do #3. I see kvm implemented on 4
>> architectures (ia64, powerpc, s390, x86). Coincidentally, those are the
>> architectures to which uprobes (old uprobes, with ubp and xol bundled
>> in) has already been ported (though Intel hasn't been maintaining their
>> ia64 port).
>
> Right, so I was thinking a combination of 4 and execute from kernel
> space would be feasible. I would think most regular instructions are
> runnable from kernel space given that we provide the proper pt_regs
> environment.
>
> Although I just realize we need to fully emulate the address computation
> step for all memory writes, otherwise a wild userspace pointer might end
> up writing in your kernel image.

Correct.

>
> Also, don't we already need full knowledge of the instruction set in
> order to decode the instruction stream and find instruction boundaries.

Not really. For #3 (boosting), you need to know everything for #2,
plus be able to compute the length of each instruction -- which we can
now do for x86. To emulate an instruction (#4), you need to replicate
what it does, side-effects and all. The x86 instruction set seems to
be adding new floating-point instructions all the time, and I bet even
Masami doesn't know what they all do, but so far, they all seem to
adhere to the instruction-length rules encoded in Masami's instruction
decoder.

As you may have noted before, I think FP would be a special problem
for your approach. I'm not sure how folks would react to the idea of
executing FP instructions in kernel space. But emulating them is also
tough. There's an IEEE FP emulation package somewhere in one of the
Linux arch directories, but I'm not sure how precise it is, and
dropping even 1 bit of precision is unacceptable for many
applications, since such errors tend to grow in complex computations
employing many FP instructions.

Jim

2010-01-17 00:12:56

by Bryan Donlan

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Fri, Jan 15, 2010 at 7:58 PM, Jim Keniston <[email protected]> wrote:

> 4. Emulation removes the need for the XOL area, but requires pretty much
> total knowledge of the instruction set. ?It's also a performance win for
> architectures that can't do #3. ?I see kvm implemented on 4
> architectures (ia64, powerpc, s390, x86). ?Coincidentally, those are the
> architectures to which uprobes (old uprobes, with ubp and xol bundled
> in) has already been ported (though Intel hasn't been maintaining their
> ia64 port). ?So it sort of comes down to how objectionable the XOL vma
> (or page) really is.

On x86 at least, wouldn't one option to be to run the instruction to
be emulated in CPL ('ring') 2, from a XOL page above the user-kernel
split, not accessible to userspace at CPL 3? Linux hasn't
traditionally used anything other than CPL 0 and CPL 3 (plus CPL 1 on
Xen), but it would seem to avoid many of the problems here - it's
invisible to normal userspace code and so doesn't pollute userspace
memory maps with kernel-private stuff, but since it's running at a
higher CPL than the kernel, we can still protect kernel memory and
protect against privileged instructions.

2010-01-17 14:37:44

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/16/2010 02:58 AM, Jim Keniston wrote:
>
> I hear (er, read) you. Emulation may turn out to be the answer for some
> architectures. But here are some things to keep in mind about the
> various approaches:
>
> 1. Single-stepping inline is easiest: you need to know very little about
> the instruction set you're probing. But it's inadequate for
> multithreaded apps.
> 2. Single-stepping out of line solves the multithreading issue (as do #3
> and #4), but requires more knowledge of the instruction set. (In
> particular, calls, jumps, and returns need special care; as do
> rip-relative instructions in x86_64.) I count 9 architectures that
> support kprobes. I think most of these do SSOL.
> 3. "Boosted" probes (where an appended jump instruction removes the need
> for the single-step trap on many instructions) require even more
> knowledge of the instruction set, and like SSOL, require XOL slots.
> Right now, as far as I know, x86 is the only architecture with boosted
> kprobes.
> 4. Emulation removes the need for the XOL area, but requires pretty much
> total knowledge of the instruction set. It's also a performance win for
> architectures that can't do #3. I see kvm implemented on 4
> architectures (ia64, powerpc, s390, x86). Coincidentally, those are the
> architectures to which uprobes (old uprobes, with ubp and xol bundled
> in) has already been ported (though Intel hasn't been maintaining their
> ia64 port). So it sort of comes down to how objectionable the XOL vma
> (or page) really is.
>

The kvm emulator emulates only a subset of the x86 instruction set
(basically mmio instructions and commonly-used page-table manipulation
instructions, as well as some privileged instructions). It would take a
lot of work to expand it to be completely generic; and even then it will
fail if userspace uses an instruction set extension the kernel is not
aware of.

To me, boosted probes with a fallback to single-stepping seems to be the
better option by far.

--
error compiling committee.c: too many arguments to function

2010-01-17 14:40:46

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/15/2010 11:50 AM, Peter Zijlstra wrote:
> As previously stated, I think poking at a process's address space is an
> utter no-go.
>

Why not reserve an address space range for this, somewhere near the top
of memory? It doesn't have to be populated if it isn't used.

--
error compiling committee.c: too many arguments to function

2010-01-17 14:52:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Sun, 2010-01-17 at 16:39 +0200, Avi Kivity wrote:
> On 01/15/2010 11:50 AM, Peter Zijlstra wrote:
> > As previously stated, I think poking at a process's address space is an
> > utter no-go.
> >
>
> Why not reserve an address space range for this, somewhere near the top
> of memory? It doesn't have to be populated if it isn't used.

Because I think poking at a process's address space like that is gross.
Also, if its fixed size you're imposing artificial limits on the number
of possible probes.

2010-01-17 14:56:39

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/17/2010 04:52 PM, Peter Zijlstra wrote:
> On Sun, 2010-01-17 at 16:39 +0200, Avi Kivity wrote:
>
>> On 01/15/2010 11:50 AM, Peter Zijlstra wrote:
>>
>>> As previously stated, I think poking at a process's address space is an
>>> utter no-go.
>>>
>>>
>> Why not reserve an address space range for this, somewhere near the top
>> of memory? It doesn't have to be populated if it isn't used.
>>
> Because I think poking at a process's address space like that is gross.
>

If it's reserved, it's no longer the process' address space.

> Also, if its fixed size you're imposing artificial limits on the number
> of possible probes.
>

Obviously we'll need a limit, a uprobe will also take kernel memory, we
can't allow people to exhaust it.

--
error compiling committee.c: too many arguments to function

2010-01-17 14:59:55

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/17/2010 04:52 PM, Peter Zijlstra wrote:
> On Sun, 2010-01-17 at 16:39 +0200, Avi Kivity wrote:
>
>> On 01/15/2010 11:50 AM, Peter Zijlstra wrote:
>>
>>> As previously stated, I think poking at a process's address space is an
>>> utter no-go.
>>>
>>>
>> Why not reserve an address space range for this, somewhere near the top
>> of memory? It doesn't have to be populated if it isn't used.
>>
> Because I think poking at a process's address space like that is gross.
> Also, if its fixed size you're imposing artificial limits on the number
> of possible probes.
>

btw, an alternative is to require the caller to provide the address
space for this. If the caller is in another process, we need to allow
it to play with the target's address space (i.e. mmap_process()). I
don't think uprobes justifies this by itself, but mmap_process() can be
very useful for sandboxing with seccomp.

--
error compiling committee.c: too many arguments to function

2010-01-17 15:01:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Sun, 2010-01-17 at 16:56 +0200, Avi Kivity wrote:
> On 01/17/2010 04:52 PM, Peter Zijlstra wrote:

> > Also, if its fixed size you're imposing artificial limits on the number
> > of possible probes.
> >
>
> Obviously we'll need a limit, a uprobe will also take kernel memory, we
> can't allow people to exhaust it.

Only if its unprivilidged, kernel and root should be able to place as
many probes until the machine keels over.

2010-01-17 15:03:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Sun, 2010-01-17 at 16:59 +0200, Avi Kivity wrote:
> On 01/17/2010 04:52 PM, Peter Zijlstra wrote:
> > On Sun, 2010-01-17 at 16:39 +0200, Avi Kivity wrote:
> >
> >> On 01/15/2010 11:50 AM, Peter Zijlstra wrote:
> >>
> >>> As previously stated, I think poking at a process's address space is an
> >>> utter no-go.
> >>>
> >>>
> >> Why not reserve an address space range for this, somewhere near the top
> >> of memory? It doesn't have to be populated if it isn't used.
> >>
> > Because I think poking at a process's address space like that is gross.
> > Also, if its fixed size you're imposing artificial limits on the number
> > of possible probes.
> >
>
> btw, an alternative is to require the caller to provide the address
> space for this. If the caller is in another process, we need to allow
> it to play with the target's address space (i.e. mmap_process()). I
> don't think uprobes justifies this by itself, but mmap_process() can be
> very useful for sandboxing with seccomp.

mmap_process() sounds utterly gross, one process playing with another
process's address space.. yuck!

2010-01-17 19:34:29

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/17/2010 05:03 PM, Peter Zijlstra wrote:
>
>> btw, an alternative is to require the caller to provide the address
>> space for this. If the caller is in another process, we need to allow
>> it to play with the target's address space (i.e. mmap_process()). I
>> don't think uprobes justifies this by itself, but mmap_process() can be
>> very useful for sandboxing with seccomp.
>>
> mmap_process() sounds utterly gross, one process playing with another
> process's address space.. yuck!
>

This is debugging. We're playing with registers, we're playing with the
cpu, we're playing with memory contents. Why not the address space as well?

For seccomp, this really should be generalized. Run a system call on
behalf of another process, but don't let that process do anything to
affect it. I think Google is doing something clever with one thread in
seccomp mode and another unconstrained, but that's very hacky - you have
to stop the constrained thread so it can't interfere with the live one.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-01-18 07:23:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Sat, 2010-01-16 at 18:48 -0500, Jim Keniston wrote:

> As you may have noted before, I think FP would be a special problem
> for your approach. I'm not sure how folks would react to the idea of
> executing FP instructions in kernel space. But emulating them is also
> tough. There's an IEEE FP emulation package somewhere in one of the
> Linux arch directories, but I'm not sure how precise it is, and
> dropping even 1 bit of precision is unacceptable for many
> applications, since such errors tend to grow in complex computations
> employing many FP instructions.

Well, we have kernel space using FP/MMX/SSE like things, its not hard if
you really need it, but in this case I think its easier than normal,
because we'll just allow it to change the userspace state because that
is exactly what we want it to do.

2010-01-18 07:37:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Sat, 2010-01-16 at 19:12 -0500, Bryan Donlan wrote:
> On Fri, Jan 15, 2010 at 7:58 PM, Jim Keniston <[email protected]> wrote:
>
> > 4. Emulation removes the need for the XOL area, but requires pretty much
> > total knowledge of the instruction set. It's also a performance win for
> > architectures that can't do #3. I see kvm implemented on 4
> > architectures (ia64, powerpc, s390, x86). Coincidentally, those are the
> > architectures to which uprobes (old uprobes, with ubp and xol bundled
> > in) has already been ported (though Intel hasn't been maintaining their
> > ia64 port). So it sort of comes down to how objectionable the XOL vma
> > (or page) really is.
>
> On x86 at least, wouldn't one option to be to run the instruction to
> be emulated in CPL ('ring') 2, from a XOL page above the user-kernel
> split, not accessible to userspace at CPL 3? Linux hasn't
> traditionally used anything other than CPL 0 and CPL 3 (plus CPL 1 on
> Xen), but it would seem to avoid many of the problems here - it's
> invisible to normal userspace code and so doesn't pollute userspace
> memory maps with kernel-private stuff, but since it's running at a
> higher CPL than the kernel, we can still protect kernel memory and
> protect against privileged instructions.

Another option is to go play games with the RPL of the user data
segments when we load them. But yeah, something like this seems to
nicely deal with the protection issues.

2010-01-18 07:46:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Sun, 2010-01-17 at 21:33 +0200, Avi Kivity wrote:
> On 01/17/2010 05:03 PM, Peter Zijlstra wrote:
> >
> >> btw, an alternative is to require the caller to provide the address
> >> space for this. If the caller is in another process, we need to allow
> >> it to play with the target's address space (i.e. mmap_process()). I
> >> don't think uprobes justifies this by itself, but mmap_process() can be
> >> very useful for sandboxing with seccomp.
> >>
> > mmap_process() sounds utterly gross, one process playing with another
> > process's address space.. yuck!
> >
>
> This is debugging. We're playing with registers, we're playing with the
> cpu, we're playing with memory contents. Why not the address space as well?

Because you want thins go to be as transparent as possible in order to
avoid heisenbugs. Sure we cannot avoid everything, but we should avoid
everything we possibly can.

Also, aside of the VDSO, we simply do not force map things into address
spaces (and like said before, I think the VDSO stinks for doing that)
and I think we don't want to create (more) precedents in this case.


2010-01-18 11:02:17

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/18/2010 09:45 AM, Peter Zijlstra wrote:
>
>> This is debugging. We're playing with registers, we're playing with the
>> cpu, we're playing with memory contents. Why not the address space as well?
>>
> Because you want thins go to be as transparent as possible in order to
> avoid heisenbugs. Sure we cannot avoid everything, but we should avoid
> everything we possibly can.
>

If we reserve some address space, you don't add any heisenbugs (at
least, not any additional ones over emulation). Even if we don't,
address space layout randomization means we're not keeping the address
space layout constant between runs anyway.

> Also, aside of the VDSO, we simply do not force map things into address
> spaces (and like said before, I think the VDSO stinks for doing that)
> and I think we don't want to create (more) precedents in this case.
>

You've made it clear that you don't like it, but not why.

The kernel already manages the user's address space (except for
MAP_FIXED which is unreliable unless you've already reserved the address
space). I don't see why adding a vma for debugging is so horrible.

--
error compiling committee.c: too many arguments to function

2010-01-18 11:45:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, 2010-01-18 at 13:01 +0200, Avi Kivity wrote:
>
> You've made it clear that you don't like it, but not why.
>
> The kernel already manages the user's address space (except for
> MAP_FIXED which is unreliable unless you've already reserved the address
> space). I don't see why adding a vma for debugging is so horrible.

Well, the kernel only does what the user (and loader) tell it through
mmap(). Other than that we never (except this VDSO thing) inject vmas,
and I see no reason to start doing that now.

2010-01-18 11:46:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, 2010-01-18 at 13:01 +0200, Avi Kivity wrote:
> If we reserve some address space, you don't add any heisenbugs (at
> least, not any additional ones over emulation). Even if we don't,
> address space layout randomization means we're not keeping the address
> space layout constant between runs anyway.

Well, it still limits the number of probes to the reserved area. If you
want more you need to grow the area.. which then changes the state.

2010-01-18 12:01:48

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/18/2010 01:44 PM, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 13:01 +0200, Avi Kivity wrote:
>
>> You've made it clear that you don't like it, but not why.
>>
>> The kernel already manages the user's address space (except for
>> MAP_FIXED which is unreliable unless you've already reserved the address
>> space). I don't see why adding a vma for debugging is so horrible.
>>
> Well, the kernel only does what the user (and loader) tell it through
> mmap().

What I meant was that the kernel chooses the addresses (unless you go
the MAP_FIXED way). From the user's point of view, there is no change
in behaviour: the kernel picks an address. If the constraints have
changed (because we reserve a range), that doesn't affect the user.

> Other than that we never (except this VDSO thing) inject vmas,
> and I see no reason to start doing that now.
>

Maybe you place no value on uprobes. But people who debug userspace
likely will see a reason.

--
error compiling committee.c: too many arguments to function

2010-01-18 12:07:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, 2010-01-18 at 14:01 +0200, Avi Kivity wrote:
>
> Maybe you place no value on uprobes. But people who debug userspace
> likely will see a reason.

I do see value in uprobes, I just don't like it mucking about with the
address space. Nor does it appear required.

2010-01-18 12:10:27

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/18/2010 02:06 PM, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 14:01 +0200, Avi Kivity wrote:
>
>> Maybe you place no value on uprobes. But people who debug userspace
>> likely will see a reason.
>>
> I do see value in uprobes, I just don't like it mucking about with the
> address space. Nor does it appear required.
>

Well, the alternatives are very unappealing. Emulation and
single-stepping are going to be very slow compared to a couple of jumps.

--
error compiling committee.c: too many arguments to function

2010-01-18 12:13:29

by Pekka Enberg

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

Hi Avi,

On Mon, 2010-01-18 at 14:01 +0200, Avi Kivity wrote:
>>> Maybe you place no value on uprobes. ?But people who debug userspace
>>> likely will see a reason.

On 01/18/2010 02:06 PM, Peter Zijlstra wrote:
>> I do see value in uprobes, I just don't like it mucking about with the
>> address space. Nor does it appear required.

On Mon, Jan 18, 2010 at 2:09 PM, Avi Kivity <[email protected]> wrote:
> Well, the alternatives are very unappealing. ?Emulation and single-stepping
> are going to be very slow compared to a couple of jumps.

So how big chunks of the address space are we talking here for uprobes?

2010-01-18 12:14:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, 2010-01-18 at 14:09 +0200, Avi Kivity wrote:
> On 01/18/2010 02:06 PM, Peter Zijlstra wrote:
> > On Mon, 2010-01-18 at 14:01 +0200, Avi Kivity wrote:
> >
> >> Maybe you place no value on uprobes. But people who debug userspace
> >> likely will see a reason.
> >>
> > I do see value in uprobes, I just don't like it mucking about with the
> > address space. Nor does it appear required.
> >
>
> Well, the alternatives are very unappealing. Emulation and
> single-stepping are going to be very slow compared to a couple of jumps.

With CPL2 or RPL on user segments the protection issue seems to be
manageable for running the instructions from kernel space.

2010-01-18 12:17:54

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/18/2010 02:13 PM, Pekka Enberg wrote:
> So how big chunks of the address space are we talking here for uprobes?
>

That's for the authors to answer, but at a guess, 32 bytes per probe
(largest x86 instruction is 15 bytes), so 32 MB will give you a million
probes. That's a piece of cake for x86-64, probably harder to justify
for i386.

--
error compiling committee.c: too many arguments to function

2010-01-18 12:24:24

by Pekka Enberg

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/18/2010 02:13 PM, Pekka Enberg wrote:
>> So how big chunks of the address space are we talking here for uprobes?

On Mon, Jan 18, 2010 at 2:17 PM, Avi Kivity <[email protected]> wrote:
> That's for the authors to answer, but at a guess, 32 bytes per probe
> (largest x86 instruction is 15 bytes), so 32 MB will give you a million
> probes. ?That's a piece of cake for x86-64, probably harder to justify for
> i386.

Yup, it's 32-bit that I worry about.

2010-01-18 12:24:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, 2010-01-18 at 14:17 +0200, Avi Kivity wrote:
> On 01/18/2010 02:13 PM, Pekka Enberg wrote:
> > So how big chunks of the address space are we talking here for uprobes?
> >
>
> That's for the authors to answer, but at a guess, 32 bytes per probe
> (largest x86 instruction is 15 bytes), so 32 MB will give you a million
> probes. That's a piece of cake for x86-64, probably harder to justify
> for i386.

Yeah, I'm aware of people turning off address space randomization to
gain more virtual space on i386, I'm pretty sure those folks aren't
going to be happy if we shrink it.

Let alone them trying to probe their app.

2010-01-18 12:37:49

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/18/2010 02:14 PM, Peter Zijlstra wrote:
>
>> Well, the alternatives are very unappealing. Emulation and
>> single-stepping are going to be very slow compared to a couple of jumps.
>>
> With CPL2 or RPL on user segments the protection issue seems to be
> manageable for running the instructions from kernel space.
>

CPL2 gives unrestricted access to the kernel address space; and RPL does
not affect page level protection. Segment limits don't work on x86-64.
But perhaps I missed something - these things are tricky.

It should be possible to translate the instruction into an address space
check, followed by the action, but that's still slower due to privilege
level switches.

--
error compiling committee.c: too many arguments to function

2010-01-18 12:44:25

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

* Avi Kivity <[email protected]> [2010-01-18 14:17:10]:

> On 01/18/2010 02:13 PM, Pekka Enberg wrote:
> >So how big chunks of the address space are we talking here for uprobes?
>
> That's for the authors to answer, but at a guess, 32 bytes per probe
> (largest x86 instruction is 15 bytes), so 32 MB will give you a
> million probes. That's a piece of cake for x86-64, probably harder
> to justify for i386.


On x86, each probe takes 16 bytes.
In the current implementation of XOL, the first hit of a breakpoint,
requires us to allocate a page. If that page does get full with "active"
breakpoints, we expand / add a page. There is a bit map that keeps a
check to see if a previously used breakpoint is removed and hence that
slot can be reused. By active breakpoints, I refer to those that are
inserted, and has been trapped atleast once but not yet removed.

Jim did try a few other allocation techniques but those that involved
slot stealing did end up having locking. People who did look at that
code did advise us to reduce the locking and keep the allocation simple
(atleast for the first cut).

--
Thanks and Regards
Srikar

>
> --
> error compiling committee.c: too many arguments to function
>

2010-01-18 12:51:13

by Pekka Enberg

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, Jan 18, 2010 at 2:44 PM, Srikar Dronamraju
<[email protected]> wrote:
> * Avi Kivity <[email protected]> [2010-01-18 14:17:10]:
>
>> On 01/18/2010 02:13 PM, Pekka Enberg wrote:
>> >So how big chunks of the address space are we talking here for uprobes?
>>
>> That's for the authors to answer, but at a guess, 32 bytes per probe
>> (largest x86 instruction is 15 bytes), so 32 MB will give you a
>> million probes. ?That's a piece of cake for x86-64, probably harder
>> to justify for i386.
>
> On x86, each probe takes 16 bytes.

And how many probes do we expected to be live at the same time in
real-world scenarios? I guess Avi's "one million" is more than enough?

2010-01-18 12:54:18

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/18/2010 02:51 PM, Pekka Enberg wrote:
>
> And how many probes do we expected to be live at the same time in
> real-world scenarios? I guess Avi's "one million" is more than enough?
>

I don't think a user will ever come close to a million, but we can
expect some inflation from inlined functions (I don't know if uprobes
replicates such probes, but if it doesn't, it should).

--
error compiling committee.c: too many arguments to function

2010-01-18 12:57:59

by Pekka Enberg

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/18/2010 02:51 PM, Pekka Enberg wrote:
>> And how many probes do we expected to be live at the same time in
>> real-world scenarios? I guess Avi's "one million" is more than enough?

Avi Kivity kirjoitti:
> I don't think a user will ever come close to a million, but we can
> expect some inflation from inlined functions (I don't know if uprobes
> replicates such probes, but if it doesn't, it should).

Right. I guess we're looking at few megabytes of the address space for
normal scenarios which doesn't seem too excessive.

However, as Peter pointed out, the bigger problem is that now we're
opening the door for other features to steal chunks of the address
space. And I think it's a legitimate worry that it's going to cause
problems for 32-bit in the future.

I don't like the idea but if the performance benefits are real (are
they?), maybe it's a worthwhile trade-off. Dunno.

Pekka

2010-01-18 13:00:59

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes

On Thu, Jan 14, 2010 at 01:29:09PM +0100, Peter Zijlstra wrote:
> On Thu, 2010-01-14 at 13:23 +0100, Frederic Weisbecker wrote:
> >
> > I see, so what you suggest is to have the probe set up
> > as generic first. Then the process that activates it
> > becomes a consumer, right?
>
> Right, so either we have it always on, for things like ftrace,
>
> in which case the creation traverses rmap and installs the probes
> all existing mmap()s, and a mmap() hook will install it on all new
> ones.
>
> Or they're strictly consumer driver, like perf, in which case the act of
> enabling the event will install the probe (if its not there yet).
>


Looks like a good plan.

2010-01-18 13:05:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, 2010-01-18 at 14:53 +0200, Avi Kivity wrote:
> On 01/18/2010 02:51 PM, Pekka Enberg wrote:
> >
> > And how many probes do we expected to be live at the same time in
> > real-world scenarios? I guess Avi's "one million" is more than enough?
> >
>
> I don't think a user will ever come close to a million, but we can
> expect some inflation from inlined functions (I don't know if uprobes
> replicates such probes, but if it doesn't, it should).

That's up to the userspace creating the probes but yes, agreed.

2010-01-18 13:07:22

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/18/2010 02:57 PM, Pekka Enberg wrote:
> On 01/18/2010 02:51 PM, Pekka Enberg wrote:
>>> And how many probes do we expected to be live at the same time in
>>> real-world scenarios? I guess Avi's "one million" is more than enough?
>
> Avi Kivity kirjoitti:
>> I don't think a user will ever come close to a million, but we can
>> expect some inflation from inlined functions (I don't know if uprobes
>> replicates such probes, but if it doesn't, it should).
>
> Right. I guess we're looking at few megabytes of the address space for
> normal scenarios which doesn't seem too excessive.
>
> However, as Peter pointed out, the bigger problem is that now we're
> opening the door for other features to steal chunks of the address
> space. And I think it's a legitimate worry that it's going to cause
> problems for 32-bit in the future.
>
> I don't like the idea but if the performance benefits are real (are
> they?), maybe it's a worthwhile trade-off. Dunno.

If uprobes can trace to buffer memory in the process address space, I
think the win can be dramatic. Incidentally it will require injecting
even more vmas into a process.

Basically it means very low cost tracing, like the kernel tracers.

--
error compiling committee.c: too many arguments to function

2010-01-18 13:16:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, 2010-01-18 at 14:37 +0200, Avi Kivity wrote:
> On 01/18/2010 02:14 PM, Peter Zijlstra wrote:
> >
> >> Well, the alternatives are very unappealing. Emulation and
> >> single-stepping are going to be very slow compared to a couple of jumps.
> >>
> > With CPL2 or RPL on user segments the protection issue seems to be
> > manageable for running the instructions from kernel space.
> >
>
> CPL2 gives unrestricted access to the kernel address space; and RPL does
> not affect page level protection. Segment limits don't work on x86-64.
> But perhaps I missed something - these things are tricky.

So setting RPL to 3 on the user segments allows access to kernel pages
just fine? How useful.. :/

> It should be possible to translate the instruction into an address space
> check, followed by the action, but that's still slower due to privilege
> level switches.

Well, if you manage to do the address validation you don't need the priv
level switch anymore, right?

Are the ins encodings sane enough to recognize mem parameters without
needing to know the actual ins?

How about using a hw-breakpoint to close the gap for the inline single
step? You could even re-insert the int3 lazily when you need the
hw-breakpoint again. It would consume one hw-breakpoint register for
each task/cpu that has probes though..

2010-01-18 13:34:10

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/18/2010 03:15 PM, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 14:37 +0200, Avi Kivity wrote:
>
>> On 01/18/2010 02:14 PM, Peter Zijlstra wrote:
>>
>>>
>>>> Well, the alternatives are very unappealing. Emulation and
>>>> single-stepping are going to be very slow compared to a couple of jumps.
>>>>
>>>>
>>> With CPL2 or RPL on user segments the protection issue seems to be
>>> manageable for running the instructions from kernel space.
>>>
>>>
>> CPL2 gives unrestricted access to the kernel address space; and RPL does
>> not affect page level protection. Segment limits don't work on x86-64.
>> But perhaps I missed something - these things are tricky.
>>
> So setting RPL to 3 on the user segments allows access to kernel pages
> just fine? How useful.. :/
>

The further we stay away from segmentation, the better. Thankfully AMD
removed hardware task switching from x86-64 so we can't even think about
that.

>> It should be possible to translate the instruction into an address space
>> check, followed by the action, but that's still slower due to privilege
>> level switches.
>>
> Well, if you manage to do the address validation you don't need the priv
> level switch anymore, right?
>

Right.

> Are the ins encodings sane enough to recognize mem parameters without
> needing to know the actual ins?
>

No. You need to know whether the instruction accesses memory or not.

Look at the tables at the beginning of arch/x86/kvm/emulate.c. Opcodes
marked with ModRM, BitOp, MemAbs, String, Stack are all different styles
of memory instructions. You need to know the operand size for the edge
cases. And there are probably a few special cases in the code.

> How about using a hw-breakpoint to close the gap for the inline single
> step? You could even re-insert the int3 lazily when you need the
> hw-breakpoint again. It would consume one hw-breakpoint register for
> each task/cpu that has probes though..
>

If you have more than four threads, it breaks, no? And you need an IPI
each time you hit the breakpoint.

Ultimately I'd like to see the breakpoint avoided as well, use a jump to
the XOL area and trace in ~20 cycles instead of ~1000.

--
error compiling committee.c: too many arguments to function

2010-01-18 13:34:57

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, Jan 18, 2010 at 02:15:51PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 14:37 +0200, Avi Kivity wrote:
> > On 01/18/2010 02:14 PM, Peter Zijlstra wrote:
> > >
> > >> Well, the alternatives are very unappealing. Emulation and
> > >> single-stepping are going to be very slow compared to a couple of jumps.
> > >>
> > > With CPL2 or RPL on user segments the protection issue seems to be
> > > manageable for running the instructions from kernel space.
> > >
> >
> > CPL2 gives unrestricted access to the kernel address space; and RPL does
> > not affect page level protection. Segment limits don't work on x86-64.
> > But perhaps I missed something - these things are tricky.
>
> So setting RPL to 3 on the user segments allows access to kernel pages
> just fine? How useful.. :/
>
> > It should be possible to translate the instruction into an address space
> > check, followed by the action, but that's still slower due to privilege
> > level switches.
>
> Well, if you manage to do the address validation you don't need the priv
> level switch anymore, right?
>
> Are the ins encodings sane enough to recognize mem parameters without
> needing to know the actual ins?
>
> How about using a hw-breakpoint to close the gap for the inline single
> step? You could even re-insert the int3 lazily when you need the
> hw-breakpoint again. It would consume one hw-breakpoint register for
> each task/cpu that has probes though..
>

A very scarce resource that it is, well, sometimes all that we might have
is just one hw-breakpoint register (like older PPC64 with 1 IABR) in the
system. If one process/thread consumes it, then all other contenders (from
both kernel and user-space) are prevented from acquiring it.

Also to mention the existence of processors with no support for
instruction breakpoints.

Thanks,
K.Prasad

2010-01-18 13:35:14

by Mark Wielaard

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, 2010-01-18 at 14:53 +0200, Avi Kivity wrote:
> On 01/18/2010 02:51 PM, Pekka Enberg wrote:
> >
> > And how many probes do we expected to be live at the same time in
> > real-world scenarios? I guess Avi's "one million" is more than enough?
> >
> I don't think a user will ever come close to a million, but we can
> expect some inflation from inlined functions (I don't know if uprobes
> replicates such probes, but if it doesn't, it should).

SystemTap by default places probes on all instances of an inlined
function. It is still hard to get to a million probes though.
$ stap -v -l 'process("/usr/bin/emacs").function("*")'
[...]
Pass 2: analyzed script: 4359 probe(s)

You can try probing all statements (for every function, in every file,
on every line of source code), but even that only adds up to ten
thousands of probes:
$ stap -v -l 'process("/usr/bin/emacs").statement("*@*:*")'
[...]
Pass 2: analyzed script: 39603 probe(s)

So a million is pretty far out, even if you add larger programs and all
the shared libraries they are using.

As Srikar said the current allocation technique is the simplest you can
do, one xol slot for each uprobe. But there are other techniques that
you can use. Theoretically you only need a xol slot for each thread of a
process that simultaneously hits a uprobe instance. That requires a bit
more bookkeeping. The variant of uprobes that systemtap uses at the
moment does that. But the locking in that case is pretty tricky, so it
seemed easier to first get the code with the simplest xol allocation
technique upstream. But if you do that than you can use a very small xol
area to support millions of uprobes and only have to expand it when
there are hundreds of threads in a process all hitting the probes
simultaneously.

Cheers,

Mark

Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, Jan 18, 2010 at 02:13:25PM +0200, Pekka Enberg wrote:
> Hi Avi,
>
> On Mon, 2010-01-18 at 14:01 +0200, Avi Kivity wrote:
> >>> Maybe you place no value on uprobes. ?But people who debug userspace
> >>> likely will see a reason.
>
> On 01/18/2010 02:06 PM, Peter Zijlstra wrote:
> >> I do see value in uprobes, I just don't like it mucking about with the
> >> address space. Nor does it appear required.
>
> On Mon, Jan 18, 2010 at 2:09 PM, Avi Kivity <[email protected]> wrote:
> > Well, the alternatives are very unappealing. ?Emulation and single-stepping
> > are going to be very slow compared to a couple of jumps.
>
> So how big chunks of the address space are we talking here for uprobes?

As Srikar mentioned, the least we start with is 1 page. Though you can
have as many probes as you want, there are certain optimizations we can
do, depending on the most common usecases.

For eg., if you'd consider the start of a routine to be the most
commonly traced location, most routines in a binary would generally
start with the same instruction (say push %ebp), and we can refcount a
slot with that instruction to be used for all probes of the same
instruction.

Ananth

2010-01-18 15:59:13

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

Jim Keniston wrote:
> Not really. For #3 (boosting), you need to know everything for #2,
> plus be able to compute the length of each instruction -- which we can
> now do for x86. To emulate an instruction (#4), you need to replicate
> what it does, side-effects and all. The x86 instruction set seems to
> be adding new floating-point instructions all the time, and I bet even
> Masami doesn't know what they all do, but so far, they all seem to
> adhere to the instruction-length rules encoded in Masami's instruction
> decoder.

Actually, current x86 decoder doesn't support FP(x87) instructions.(even
it already supported AVX) But I think it's not so hard to add it.

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-18 16:53:08

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/18/2010 05:43 PM, Ananth N Mavinakayanahalli wrote:
>>
>>> Well, the alternatives are very unappealing. Emulation and single-stepping
>>> are going to be very slow compared to a couple of jumps.
>>>
>> So how big chunks of the address space are we talking here for uprobes?
>>
> As Srikar mentioned, the least we start with is 1 page. Though you can
> have as many probes as you want, there are certain optimizations we can
> do, depending on the most common usecases.
>
> For eg., if you'd consider the start of a routine to be the most
> commonly traced location, most routines in a binary would generally
> start with the same instruction (say push %ebp), and we can refcount a
> slot with that instruction to be used for all probes of the same
> instruction.
>

But then you can't follow the instruction with a jump back to the code...

--
error compiling committee.c: too many arguments to function

Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, Jan 18, 2010 at 06:52:32PM +0200, Avi Kivity wrote:
> On 01/18/2010 05:43 PM, Ananth N Mavinakayanahalli wrote:
>>>
>>>> Well, the alternatives are very unappealing. Emulation and single-stepping
>>>> are going to be very slow compared to a couple of jumps.
>>>>
>>> So how big chunks of the address space are we talking here for uprobes?
>>>
>> As Srikar mentioned, the least we start with is 1 page. Though you can
>> have as many probes as you want, there are certain optimizations we can
>> do, depending on the most common usecases.
>>
>> For eg., if you'd consider the start of a routine to be the most
>> commonly traced location, most routines in a binary would generally
>> start with the same instruction (say push %ebp), and we can refcount a
>> slot with that instruction to be used for all probes of the same
>> instruction.
>>
>
> But then you can't follow the instruction with a jump back to the code...

Right. This will work only for the non boosted case where single-stepping
is mandatory. I guess the tradeoff is vma space and speed.

Ananth

2010-01-18 19:21:41

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, 2010-01-18 at 10:58 -0500, Masami Hiramatsu wrote:
> Jim Keniston wrote:
> > Not really. For #3 (boosting), you need to know everything for #2,
> > plus be able to compute the length of each instruction -- which we can
> > now do for x86. To emulate an instruction (#4), you need to replicate
> > what it does, side-effects and all. The x86 instruction set seems to
> > be adding new floating-point instructions all the time, and I bet even
> > Masami doesn't know what they all do, but so far, they all seem to
> > adhere to the instruction-length rules encoded in Masami's instruction
> > decoder.
>
> Actually, current x86 decoder doesn't support FP(x87) instructions.(even
> it already supported AVX) But I think it's not so hard to add it.
>

At one point I verified that it worked for all the x87 instructions in
libm:
https://www.redhat.com/archives/utrace-devel/2009-March/msg00031.html
I'm pretty sure I tested mmx instructions as well. But I guess this was
before you rearranged the opcode tables.

Yeah, it wouldn't be hard to add back in, at least for purposes of
computing instruction lengths.

Jim

2010-01-18 19:50:06

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, 2010-01-18 at 14:34 +0100, Mark Wielaard wrote:
> On Mon, 2010-01-18 at 14:53 +0200, Avi Kivity wrote:
> > On 01/18/2010 02:51 PM, Pekka Enberg wrote:
> > >
> > > And how many probes do we expected to be live at the same time in
> > > real-world scenarios? I guess Avi's "one million" is more than enough?
> > >
> > I don't think a user will ever come close to a million, but we can
> > expect some inflation from inlined functions (I don't know if uprobes
> > replicates such probes, but if it doesn't, it should).
>
> SystemTap by default places probes on all instances of an inlined
> function. It is still hard to get to a million probes though.
> $ stap -v -l 'process("/usr/bin/emacs").function("*")'
> [...]
> Pass 2: analyzed script: 4359 probe(s)
>
> You can try probing all statements (for every function, in every file,
> on every line of source code), but even that only adds up to ten
> thousands of probes:
> $ stap -v -l 'process("/usr/bin/emacs").statement("*@*:*")'
> [...]
> Pass 2: analyzed script: 39603 probe(s)
>
> So a million is pretty far out, even if you add larger programs and all
> the shared libraries they are using.

Thanks, Mark. One correction, below.

>
> As Srikar said the current allocation technique is the simplest you can
> do, one xol slot for each uprobe. But there are other techniques that
> you can use. Theoretically you only need a xol slot for each thread of a
> process that simultaneously hits a uprobe instance. That requires a bit
> more bookkeeping. The variant of uprobes that systemtap uses at the
> moment does that.

Actually, it's per-probepoint, with a fixed number of slots. If the
probepoint you just hit doesn't have a slot, and none are free, you
steal a slot from another probepoint. Yeah, it's messy.

We considered allocating slots per-thread, hoping to make it basically
lockless, but that way there's more likely to be constant scribbling on
the XOL area, as a thread with n slots cycles through n+m probepoints.
And of course, it gets dicey as the process clones more threads.

I guess the point is, there are a lot of ways to allocate slots, and we
haven't found the perfect algorithm yet, even if you accept the
existence of (and need for) the XOL area. Keep the ideas coming.

> But the locking in that case is pretty tricky, so it
> seemed easier to first get the code with the simplest xol allocation
> technique upstream. But if you do that than you can use a very small xol
> area to support millions of uprobes and only have to expand it when
> there are hundreds of threads in a process all hitting the probes
> simultaneously.
>
> Cheers,
>
> Mark
>

Jim

2010-01-18 21:23:13

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

Jim Keniston wrote:
> On Mon, 2010-01-18 at 10:58 -0500, Masami Hiramatsu wrote:
>> Jim Keniston wrote:
>>> Not really. For #3 (boosting), you need to know everything for #2,
>>> plus be able to compute the length of each instruction -- which we can
>>> now do for x86. To emulate an instruction (#4), you need to replicate
>>> what it does, side-effects and all. The x86 instruction set seems to
>>> be adding new floating-point instructions all the time, and I bet even
>>> Masami doesn't know what they all do, but so far, they all seem to
>>> adhere to the instruction-length rules encoded in Masami's instruction
>>> decoder.
>>
>> Actually, current x86 decoder doesn't support FP(x87) instructions.(even
>> it already supported AVX) But I think it's not so hard to add it.
>>
>
> At one point I verified that it worked for all the x87 instructions in
> libm:
> https://www.redhat.com/archives/utrace-devel/2009-March/msg00031.html
> I'm pretty sure I tested mmx instructions as well. But I guess this was
> before you rearranged the opcode tables.
>
> Yeah, it wouldn't be hard to add back in, at least for purposes of
> computing instruction lengths.

objdump -d /lib/libm.so.6 | awk -f arch/x86/tools/distill.awk | ./test_get_len
Succeed: decoded and checked 37198 instructions

Hmm, yeah, that's already supported :-D.

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-18 22:16:15

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, 2010-01-18 at 14:57 +0200, Pekka Enberg wrote:
> On 01/18/2010 02:51 PM, Pekka Enberg wrote:
> >> And how many probes do we expected to be live at the same time in
> >> real-world scenarios? I guess Avi's "one million" is more than enough?
>
> Avi Kivity kirjoitti:
> > I don't think a user will ever come close to a million, but we can
> > expect some inflation from inlined functions (I don't know if uprobes
> > replicates such probes, but if it doesn't, it should).
>
> Right. I guess we're looking at few megabytes of the address space for
> normal scenarios which doesn't seem too excessive.
>
> However, as Peter pointed out, the bigger problem is that now we're
> opening the door for other features to steal chunks of the address
> space. And I think it's a legitimate worry that it's going to cause
> problems for 32-bit in the future.
>
> I don't like the idea but if the performance benefits are real (are
> they?),

Based on what seems to be the closest thing to an apples-to-apples
comparison -- counting the number of calls to a specified function --
uprobes is 6-7 times faster than the ptrace-based equivalent, ltrace -c.
And of course, uprobes provides much, much more flexibility, appears to
scale better, and works with multithreaded apps.

Likewise, FWIW, utrace is more than 10x faster than strace -c in
counting system calls.

> maybe it's a worthwhile trade-off. Dunno.
>
> Pekka

Jim

2010-01-19 08:08:20

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/19/2010 12:15 AM, Jim Keniston wrote:
>
>> I don't like the idea but if the performance benefits are real (are
>> they?),
>>
> Based on what seems to be the closest thing to an apples-to-apples
> comparison -- counting the number of calls to a specified function --
> uprobes is 6-7 times faster than the ptrace-based equivalent, ltrace -c.
> And of course, uprobes provides much, much more flexibility, appears to
> scale better, and works with multithreaded apps.
>
> Likewise, FWIW, utrace is more than 10x faster than strace -c in
> counting system calls.
>
>

This is still with a kernel entry, yes? Do you have plans for a variant
that's completely in userspace?

--
error compiling committee.c: too many arguments to function

2010-01-19 17:48:17

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)


On Tue, 2010-01-19 at 10:07 +0200, Avi Kivity wrote:
> On 01/19/2010 12:15 AM, Jim Keniston wrote:
> >
> >> I don't like the idea but if the performance benefits are real (are
> >> they?),
> >>
> > Based on what seems to be the closest thing to an apples-to-apples
> > comparison -- counting the number of calls to a specified function --
> > uprobes is 6-7 times faster than the ptrace-based equivalent, ltrace -c.
> > And of course, uprobes provides much, much more flexibility, appears to
> > scale better, and works with multithreaded apps.
> >
> > Likewise, FWIW, utrace is more than 10x faster than strace -c in
> > counting system calls.
> >
> >
>
> This is still with a kernel entry, yes?

Yes, this involves setting a breakpoint and trapping into the kernel
when it's hit. The 6-7x figure is with the current 2-trap approach
(breakpoint, single-step). Boosting could presumably make that more
like 12-14x.

> Do you have plans for a variant
> that's completely in userspace?

I don't know of any such plans, but I'd be interested to read more of
your thoughts here. As I understand it, you've suggested replacing the
probed instruction with a jump into an instrumentation vma (the XOL
area, or something similar). Masami has demonstrated -- through his
djprobes enhancement to kprobes -- that this can be done for many x86
instructions.

What does the code in the jumped-to vma do? Is the instrumentation code
that corresponds to the uprobe handlers encoded in an ad hoc .so?

BTW, when some people say "completely in userspace," they mean something
like ptrace, where the kernel is still heavily involved but the
instrumentation code runs in user space. The ubp layer is intended to
support that model as well. In our various implementations of the XOL
vma/address area, however, the XOL area is either created on exec or
created/expanded only by the probed process.

Jim

2010-01-19 18:06:20

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Tue, Jan 19, 2010 at 09:47:45AM -0800, Jim Keniston wrote:
> > Do you have plans for a variant
> > that's completely in userspace?
>
> I don't know of any such plans, but I'd be interested to read more of
> your thoughts here. As I understand it, you've suggested replacing the
> probed instruction with a jump into an instrumentation vma (the XOL
> area, or something similar). Masami has demonstrated -- through his
> djprobes enhancement to kprobes -- that this can be done for many x86
> instructions.
>
> What does the code in the jumped-to vma do? Is the instrumentation code
> that corresponds to the uprobe handlers encoded in an ad hoc .so?


Once the instrumentation is requested by a process that is not the
instrumented one, this looks impossible to set a uprobe without a
minimal voluntary collaboration from the instrumented process
(events sent through IPC or whatever). So that looks too limited,
this is not anymore a true dynamic uprobe.

2010-01-20 06:36:26

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

* Frederic Weisbecker <[email protected]> [2010-01-19 19:06:12]:

> On Tue, Jan 19, 2010 at 09:47:45AM -0800, Jim Keniston wrote:
> >
> > What does the code in the jumped-to vma do? Is the instrumentation code
> > that corresponds to the uprobe handlers encoded in an ad hoc .so?
>
>
> Once the instrumentation is requested by a process that is not the
> instrumented one, this looks impossible to set a uprobe without a
> minimal voluntary collaboration from the instrumented process
> (events sent through IPC or whatever). So that looks too limited,
> this is not anymore a true dynamic uprobe.

I dont see a case where the thread being debugged refuses to place a
probe unless the process is exiting. The traced process doesnt decide
if it wants to be probed or not. There could be a slight delay from the
time the tracer requested to the time the probe is placed. But this
delay in only affecting the tracer and the tracee. This is in contract
to say stop_machine where the threads of other applications are also
affected.

2010-01-20 09:44:08

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/19/2010 07:47 PM, Jim Keniston wrote:
>
>> This is still with a kernel entry, yes?
>>
> Yes, this involves setting a breakpoint and trapping into the kernel
> when it's hit. The 6-7x figure is with the current 2-trap approach
> (breakpoint, single-step). Boosting could presumably make that more
> like 12-14x.
>

A trap is IIRC ~1000 cycles, we can reduce this to ~50 (totally
negligible from the executed code's point of view).

>> Do you have plans for a variant
>> that's completely in userspace?
>>
> I don't know of any such plans, but I'd be interested to read more of
> your thoughts here. As I understand it, you've suggested replacing the
> probed instruction with a jump into an instrumentation vma (the XOL
> area, or something similar). Masami has demonstrated -- through his
> djprobes enhancement to kprobes -- that this can be done for many x86
> instructions.
>
> What does the code in the jumped-to vma do?

1. Write a trace entry into shared memory, trap into the kernel on overflow.
2. Trap if a condition is satisfied (fast watchpoint implementation).

> Is the instrumentation code
> that corresponds to the uprobe handlers encoded in an ad hoc .so?
>

Looks like a good idea, but it doesn't matter much to me.

--
error compiling committee.c: too many arguments to function

2010-01-20 09:58:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Wed, 2010-01-20 at 11:43 +0200, Avi Kivity wrote:
> 1. Write a trace entry into shared memory, trap into the kernel on overflow.
> 2. Trap if a condition is satisfied (fast watchpoint implementation).

So now you want to consume more of a process' address space to store
trace data as well? Not to mention that that process could wreck the
trace data rendering it utterly unreliable.

2010-01-20 10:45:47

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

> >
> >What does the code in the jumped-to vma do?
>
> 1. Write a trace entry into shared memory, trap into the kernel on overflow.
> 2. Trap if a condition is satisfied (fast watchpoint implementation).
>
> >Is the instrumentation code
> >that corresponds to the uprobe handlers encoded in an ad hoc .so?
>
> Looks like a good idea, but it doesn't matter much to me.
>

That looks to be a nice idea. We should certainly look into this
possibility. However can we look at this option probably a little later?

Our plan was to do one step at a time i.e have the basic uprobes in
first and target the booster (i.e jump to the next instruction without
the need for single-stepping next).

We could look at this option of using jump instead of int3 after we are
done with the booster. Hope that's okay.

--
Thanks and Regards
Srikar

2010-01-20 10:52:06

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Wed, Jan 20, 2010 at 12:06:20PM +0530, Srikar Dronamraju wrote:
> * Frederic Weisbecker <[email protected]> [2010-01-19 19:06:12]:
>
> > On Tue, Jan 19, 2010 at 09:47:45AM -0800, Jim Keniston wrote:
> > >
> > > What does the code in the jumped-to vma do? Is the instrumentation code
> > > that corresponds to the uprobe handlers encoded in an ad hoc .so?
> >
> >
> > Once the instrumentation is requested by a process that is not the
> > instrumented one, this looks impossible to set a uprobe without a
> > minimal voluntary collaboration from the instrumented process
> > (events sent through IPC or whatever). So that looks too limited,
> > this is not anymore a true dynamic uprobe.
>
> I dont see a case where the thread being debugged refuses to place a
> probe unless the process is exiting. The traced process doesnt decide
> if it wants to be probed or not. There could be a slight delay from the
> time the tracer requested to the time the probe is placed. But this
> delay in only affecting the tracer and the tracee. This is in contract
> to say stop_machine where the threads of other applications are also
> affected.


I did not think about a kind of trace point inserted in a shared memory.
I was just confused :)

2010-01-20 12:23:11

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/20/2010 11:57 AM, Peter Zijlstra wrote:
> On Wed, 2010-01-20 at 11:43 +0200, Avi Kivity wrote:
>
>> 1. Write a trace entry into shared memory, trap into the kernel on overflow.
>> 2. Trap if a condition is satisfied (fast watchpoint implementation).
>>
> So now you want to consume more of a process' address space to store
> trace data as well?

Yes. I know I'm bad.

> Not to mention that that process could wreck the
> trace data rendering it utterly unreliable.
>

It could, but it also might not. Are we going to deny high performance
tracing to users just because it doesn't work in all cases?

Note this applies to any kind of monitoring or debugging technology. A
process can be influenced by the debugger and render any debug info you
get out of it unreliable. One non-timing example is a process using a
checksum of its text as an input to some algorithm.

--
error compiling committee.c: too many arguments to function

2010-01-20 12:24:38

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/20/2010 12:45 PM, Srikar Dronamraju wrote:
>>> What does the code in the jumped-to vma do?
>>>
>> 1. Write a trace entry into shared memory, trap into the kernel on overflow.
>> 2. Trap if a condition is satisfied (fast watchpoint implementation).
>>
> That looks to be a nice idea. We should certainly look into this
> possibility. However can we look at this option probably a little later?
>
> Our plan was to do one step at a time i.e have the basic uprobes in
> first and target the booster (i.e jump to the next instruction without
> the need for single-stepping next).
>
> We could look at this option of using jump instead of int3 after we are
> done with the booster. Hope that's okay.
>

I'm all for incremental development and merging, as long as we keep the
interfaces flexible enough for the future.

--
error compiling committee.c: too many arguments to function

2010-01-20 15:58:10

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Mon, Jan 18, 2010 at 02:15:51PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 14:37 +0200, Avi Kivity wrote:
> > On 01/18/2010 02:14 PM, Peter Zijlstra wrote:
> > >
> > >> Well, the alternatives are very unappealing. Emulation and
> > >> single-stepping are going to be very slow compared to a couple of jumps.
> > >>
> > > With CPL2 or RPL on user segments the protection issue seems to be
> > > manageable for running the instructions from kernel space.
> > >
> >
> > CPL2 gives unrestricted access to the kernel address space; and RPL does
> > not affect page level protection. Segment limits don't work on x86-64.
> > But perhaps I missed something - these things are tricky.
>
> So setting RPL to 3 on the user segments allows access to kernel pages
> just fine? How useful.. :/
>
> > It should be possible to translate the instruction into an address space
> > check, followed by the action, but that's still slower due to privilege
> > level switches.
>
> Well, if you manage to do the address validation you don't need the priv
> level switch anymore, right?
>

It also starts becoming very x86-centric though, doesn't it? It might
kick other ports later.

What is there at the moment is storing the copied instructions in a VMA.
The most unpalatable part of that to me is that it's visible to
userspace, probably via /proc/ and I didn't check, but I hope an
munmap() from userspace cannot delete it.

What the VMA has going for it is that it *appears* to be easier to port to
other architectures than the alternatives, certainly easier to handle than
instruction emulation.

> Are the ins encodings sane enough to recognize mem parameters without
> needing to know the actual ins?
>
> How about using a hw-breakpoint to close the gap for the inline single
> step? You could even re-insert the int3 lazily when you need the
> hw-breakpoint again. It would consume one hw-breakpoint register for
> each task/cpu that has probes though..
>

This feels very racy. Along with that, making these sort of changes
was considered a risky venture on x86 and needed strong verification from
elsewhere (http://lkml.org/lkml/2010/1/12/300). There are probably similar
concerns on other architectures that would make a reliable port difficult.

Right now the approach is with VMAs. The alternatives are

1. reserved XOL page (similar disadvantages to the VMA)
2. emulated instructions
This is an emulation bug waiting to happen in my opinion and makes
porting uprobes a significantly more difficult undertaking than
either the XOL-VMA or XOL-page approach
3. XOL page in kernel space available at a different CPL
This assumes all target architectures have a usable privilege
ring which may be the case. However, I would guess that it
is going to perform worse than the current approach because
of the change in privilege level. No idea what the cost of
a privilege level change is, but I doubt it's free
4. Boosted probes (arch-specific, apparently only x86 does this for
kprobes)

As unpalatable as the VMA is, I am failing to see why it's not a
reasonable starting point with an understanding that 2 or 3 would be
implemented in the future after the other architecture ports are in
place and the reliability of the options as well as the performance can
be measured.

There would appear to be two classes of application that might suffer
from the VMA. The first which need absolutly every single ounce of address
space. The second which introspects itself via /proc/self/maps and makes
decisions based on that. The first is unfortunate but should be a limited
number of use cases. The second could be fudged by simply not exporting the
information via /proc.

I'm of the opinion it would be reasonable to let the VMA go ahead, look
at the ports for the other architectures and revisit options 2 and 3 above
to see if the VMA can really be removed with performance or reliability
penalty.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-01-20 18:31:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

Jim Keniston <[email protected]> writes:
>
> I don't know of any such plans, but I'd be interested to read more of
> your thoughts here. As I understand it, you've suggested replacing the
> probed instruction with a jump into an instrumentation vma (the XOL
> area, or something similar). Masami has demonstrated -- through his
> djprobes enhancement to kprobes -- that this can be done for many x86
> instructions.

The big problem when doing this in user space is that for 64bit
it has to be within 2GB of the probed code, otherwise you would
need to rewrite the instruction to not use any rip relative addressing,
which can be rather complicated (needs registers, but the instruction
might already use them, so you would need a register allocator/spilling etc.)

And that 2GB can be anywhere in the address space for shared
libraries, which might well be already used. A lot of programs
need large VM areas without holes.

Also I personally would be unconfortable to let the instruction
decoder be used by unpriviledged code. Who knows how
many buffer overflows it has?

In general the trend has been also to make traps faster in the CPU, make
sure you're not optimizing for some old CPU here.

-Andi
--
[email protected] -- Speaking for myself only.

2010-01-20 18:33:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

Peter Zijlstra <[email protected]> writes:
>
> With CPL2 or RPL on user segments the protection issue seems to be
> manageable for running the instructions from kernel space.

Nope -- it doesn't work on 64bit and even on 32bit can have large
costs on some CPUs.

Also designing 32bit only features in 2010 would seem rather ....
unfortunate.

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-20 19:32:14

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

Frederic Weisbecker wrote:
> On Tue, Jan 19, 2010 at 09:47:45AM -0800, Jim Keniston wrote:
>>> Do you have plans for a variant
>>> that's completely in userspace?
>>
>> I don't know of any such plans, but I'd be interested to read more of
>> your thoughts here. As I understand it, you've suggested replacing the
>> probed instruction with a jump into an instrumentation vma (the XOL
>> area, or something similar). Masami has demonstrated -- through his
>> djprobes enhancement to kprobes -- that this can be done for many x86
>> instructions.
>>
>> What does the code in the jumped-to vma do? Is the instrumentation code
>> that corresponds to the uprobe handlers encoded in an ad hoc .so?
>
>
> Once the instrumentation is requested by a process that is not the
> instrumented one, this looks impossible to set a uprobe without a
> minimal voluntary collaboration from the instrumented process
> (events sent through IPC or whatever). So that looks too limited,
> this is not anymore a true dynamic uprobe.

Agreed. Since uprobe's handler must be running in kernel,
we need to jump into kernel space anyway. "Booster" (just skips
a single-stepping(trap) exception) may be useful for
improving uprobe performance.

And also as Andi said, using jump instead of int3 in userspace
has 2GB address space limitation. It's not a problem for kernel
inside, but a big problem in userspace.

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-20 19:34:43

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)


On Wed, 2010-01-20 at 19:31 +0100, Andi Kleen wrote:
> Jim Keniston <[email protected]> writes:
> >
> > I don't know of any such plans, but I'd be interested to read more of
> > your thoughts here. As I understand it, you've suggested replacing the
> > probed instruction with a jump into an instrumentation vma (the XOL
> > area, or something similar). Masami has demonstrated -- through his
> > djprobes enhancement to kprobes -- that this can be done for many x86
> > instructions.
>
> The big problem when doing this in user space is that for 64bit
> it has to be within 2GB of the probed code, otherwise you would
> need to rewrite the instruction to not use any rip relative addressing,
> which can be rather complicated (needs registers, but the instruction
> might already use them, so you would need a register allocator/spilling etc.)

I'm probably telling you stuff you already know, but...

Re: jumps longer than 2GB: The following 14-byte sequence seems to work:
jmpq *(%rip)
.quad next_insn
where next_insn is the address of the instruction to which we want to
jump. We'd need this for boosting, anyway -- to jump from the XOL area
back to the probed instruction stream.

I think djprobes inserts a 5-byte jump at the probepoint; I don't know
whether a 14-byte jump would introduce new difficulties.

Re: rewriting instructions that use rip-relative addressing. We do that
now. See handle_riprel_insn() in patch #2. (As far as we can tell, it
works, but we'd appreciate your review of it.)

>
> And that 2GB can be anywhere in the address space for shared
> libraries, which might well be already used. A lot of programs
> need large VM areas without holes.
>
> Also I personally would be unconfortable to let the instruction
> decoder be used by unpriviledged code. Who knows how
> many buffer overflows it has?

The instruction decoder is used only during instruction analysis, while
registering the probe -- i.e., in kernel space.

>
> In general the trend has been also to make traps faster in the CPU, make
> sure you're not optimizing for some old CPU here.

I won't argue with that. What Avi seems to be proposing buys us a
speedup, but at the cost of increased complexity -- among other things,
splitting the instrumentation code between user space (in the "XOL" area
-- which would then be used for much more than XOL instruction slots)
and kernel space. The splitting would presumably be handled by
higher-level code -- SystemTap, perf, or whatever. It's a neat idea,
but it seems like a v2 kind of feature.

>
> -Andi

Jim

2010-01-20 19:58:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

> Re: rewriting instructions that use rip-relative addressing. We do that
> now. See handle_riprel_insn() in patch #2. (As far as we can tell, it
> works, but we'd appreciate your review of it.)

Yes, but how do you get within 2GB of it? Add lots of holes
in the address space?

> The instruction decoder is used only during instruction analysis, while
> registering the probe -- i.e., in kernel space.

Registering the user probe? That means if there's a buffer overflow
in there it would be exploitable.

> >
> > In general the trend has been also to make traps faster in the CPU, make
> > sure you're not optimizing for some old CPU here.
>
> I won't argue with that. What Avi seems to be proposing buys us a
> speedup, but at the cost of increased complexity -- among other things,
> splitting the instrumentation code between user space (in the "XOL" area
> -- which would then be used for much more than XOL instruction slots)

You can't have a single XOL area, at least not if you want to support
shared libraries on 64bit & rip relative.

> and kernel space. The splitting would presumably be handled by
> higher-level code -- SystemTap, perf, or whatever. It's a neat idea,
> but it seems like a v2 kind of feature.

I'm not sure it can even work, unless you severly limited the allowed
instructions.

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-20 20:29:08

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)


On Wed, 2010-01-20 at 20:58 +0100, Andi Kleen wrote:
> > Re: rewriting instructions that use rip-relative addressing. We do that
> > now. See handle_riprel_insn() in patch #2. (As far as we can tell, it
> > works, but we'd appreciate your review of it.)
>
> Yes, but how do you get within 2GB of it?

I'm not sure what you're asking.

To jump between the probed instruction stream and the XOL area, I've
proposed
jmpq *(%rip)
.quad next_insn
next_insn is a 64-bit address, which presumably allows you to jump to
anywhere in the address space.

To read/write the memory addressed by a rip-relative instruction, we
convert the rip-relative addressing to indirect addressing through a
64-bit scratch register (whose saved value we restore before returning
to the probed instruction stream).

> Add lots of holes
> in the address space?

No.

>
> > The instruction decoder is used only during instruction analysis, while
> > registering the probe -- i.e., in kernel space.
>
> Registering the user probe? That means if there's a buffer overflow
> in there it would be exploitable.

Certainly a poorly written probe handler would be a problem. Could you
explain further what you mean? Are you talking about a buffer overflow
in the probed program? in the probe handler? in uprobes?

>
> > >
> > > In general the trend has been also to make traps faster in the CPU, make
> > > sure you're not optimizing for some old CPU here.
> >
> > I won't argue with that. What Avi seems to be proposing buys us a
> > speedup, but at the cost of increased complexity -- among other things,
> > splitting the instrumentation code between user space (in the "XOL" area
> > -- which would then be used for much more than XOL instruction slots)
>
> You can't have a single XOL area, at least not if you want to support
> shared libraries on 64bit & rip relative.

I disagree. See above.

>
> > and kernel space. The splitting would presumably be handled by
> > higher-level code -- SystemTap, perf, or whatever. It's a neat idea,
> > but it seems like a v2 kind of feature.
>
> I'm not sure it can even work, unless you severly limited the allowed
> instructions.

I'm not sure it can work, either. But I still believe that we've
addressed the known issues wrt the big x86_64 address space.

>
> -Andi
>

Thanks.
Jim

2010-01-22 07:02:42

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes [ Summary of Comments and actions to be taken ]

Here is a summary of the Comments and actions that need to be taken for
the current uprobes patchset. Please let me know if I missed or
misunderstood any of your comments.

1. Uprobes depends on trap signal.
Uprobes depends on trap signal rather than hooking to the global
die notifier. It was suggested that we hook to the global die notifier.

In the next version of patches, Uprobes will use the global die
notifier and look at the per-task count of the probes in use to
see if it has to be consumed.

However this would reduce the ability of uprobe handlers to
sleep. Since we are dealing with userspace, sleeping in handlers
would have been a good feature. We are looking at ways to get
around this limitation.


2. XOL vma vs Emulation vs Single Stepping Inline vs using Protection
Rings.
XOL VMA is an additional process address vma. This is
opposition to add an additional vma without user actually
requesting for the same.

XOL vma and single stepping inline are the two architecture
independent implementations. While other implementations are
more architecture specific. Single stepping inline wouldnt go
well with multithreaded process.

Even though XOL vma has its own issues, we will go with it since
other implementations seem to have more complications.

we would look forward to implementing boosters later.
Later on, if we come across another techniques with lesser
side-effects than the XOL vma, we would switch to using them.


3. Current Uprobes looks at process life times and not vma lifetimes.
Also it needs threads to quiesce when inserting and removing
breakpoints.

Current uprobes was quiesing threads of a process before
insertion and deletion. This resulted in uprobes having to track
process lifetimes. An alternative method to track vma lifetimes
was suggested.

Next version would update the copy of the page and flip the
pagetables as suggested by Peter. Hence it would no more depend on
threads quiescing.

However this would need hooks in munmap/rmap so that uprobes can
remove breakpoints that are placed in that vma.

This would also mean removing the rcu_deference we were using.

4. Move the ftrace plugin to use trace events.

Since ftrace plugins are relegated to obsolescence, it was
suggested we use trace events which would have much wider scope.

Next version will use trace events.

5. rename UBP to user_bkpt

6. updating the authors for all files that are getting added.

I shall work towards v2 of uprobes and send across the patches at the
earliest.

--
Thanks and Regards
Srikar

Subject: Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes [ Summary of Comments and actions to be taken ]

On Fri, Jan 22, 2010 at 12:32:32PM +0530, Srikar Dronamraju wrote:
> Here is a summary of the Comments and actions that need to be taken for
> the current uprobes patchset. Please let me know if I missed or
> misunderstood any of your comments.
>
> 1. Uprobes depends on trap signal.
> Uprobes depends on trap signal rather than hooking to the global
> die notifier. It was suggested that we hook to the global die notifier.
>
> In the next version of patches, Uprobes will use the global die
> notifier and look at the per-task count of the probes in use to
> see if it has to be consumed.
>
> However this would reduce the ability of uprobe handlers to
> sleep. Since we are dealing with userspace, sleeping in handlers
> would have been a good feature. We are looking at ways to get
> around this limitation.

We could set a TIF_ flag in the notifier to indicate a breakpoint hit
and process it in task context before the task heads into userspace.

Ananth

2010-01-22 10:48:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes [ Summary of Comments and actions to be taken ]

On Fri, 2010-01-22 at 12:54 +0530, Ananth N Mavinakayanahalli wrote:
> On Fri, Jan 22, 2010 at 12:32:32PM +0530, Srikar Dronamraju wrote:
> > Here is a summary of the Comments and actions that need to be taken for
> > the current uprobes patchset. Please let me know if I missed or
> > misunderstood any of your comments.
> >
> > 1. Uprobes depends on trap signal.
> > Uprobes depends on trap signal rather than hooking to the global
> > die notifier. It was suggested that we hook to the global die notifier.
> >
> > In the next version of patches, Uprobes will use the global die
> > notifier and look at the per-task count of the probes in use to
> > see if it has to be consumed.
> >
> > However this would reduce the ability of uprobe handlers to
> > sleep. Since we are dealing with userspace, sleeping in handlers
> > would have been a good feature. We are looking at ways to get
> > around this limitation.
>
> We could set a TIF_ flag in the notifier to indicate a breakpoint hit
> and process it in task context before the task heads into userspace.

Make that optional, not everybody might want that. Either provide a
simple trampoline or use a flag to indicate the callback be called from
process context on registration.

2010-01-22 18:07:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes [ Summary of Comments and actions to be taken ]

On Fri, 2010-01-22 at 12:32 +0530, Srikar Dronamraju wrote:

> 2. XOL vma vs Emulation vs Single Stepping Inline vs using Protection
> Rings.
> XOL VMA is an additional process address vma. This is
> opposition to add an additional vma without user actually
> requesting for the same.
>
> XOL vma and single stepping inline are the two architecture
> independent implementations. While other implementations are
> more architecture specific. Single stepping inline wouldnt go
> well with multithreaded process.
>
> Even though XOL vma has its own issues, we will go with it since
> other implementations seem to have more complications.
>
> we would look forward to implementing boosters later.
> Later on, if we come across another techniques with lesser
> side-effects than the XOL vma, we would switch to using them.

How about modifying glibc to reserve like 64 bytes on the TLS structure
it has and storing the ins and possible boost jmp there? Since each
thread can only have a single trap at any one time that should be
enough.

2010-01-22 18:37:49

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes [ Summary of Comments and actions to be taken ]

Peter Zijlstra wrote:
> On Fri, 2010-01-22 at 12:32 +0530, Srikar Dronamraju wrote:
>
>> 2. XOL vma vs Emulation vs Single Stepping Inline vs using Protection
>> Rings.
>> XOL VMA is an additional process address vma. This is
>> opposition to add an additional vma without user actually
>> requesting for the same.
>>
>> XOL vma and single stepping inline are the two architecture
>> independent implementations. While other implementations are
>> more architecture specific. Single stepping inline wouldnt go
>> well with multithreaded process.
>>
>> Even though XOL vma has its own issues, we will go with it since
>> other implementations seem to have more complications.
>>
>> we would look forward to implementing boosters later.
>> Later on, if we come across another techniques with lesser
>> side-effects than the XOL vma, we would switch to using them.
>
> How about modifying glibc to reserve like 64 bytes on the TLS structure
> it has and storing the ins and possible boost jmp there? Since each
> thread can only have a single trap at any one time that should be
> enough.

Hmm, it is a good idea. Well, we'll have a copy of original insn
in kernel, but it could be simpler than managing XOL vma. :-)

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-22 23:55:25

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes [ Summary of Comments and actions to be taken ]


On Fri, 2010-01-22 at 19:06 +0100, Peter Zijlstra wrote:
> On Fri, 2010-01-22 at 12:32 +0530, Srikar Dronamraju wrote:
>
> > 2. XOL vma vs Emulation vs Single Stepping Inline vs using Protection
> > Rings.
> > XOL VMA is an additional process address vma. This is
> > opposition to add an additional vma without user actually
> > requesting for the same.
> >
> > XOL vma and single stepping inline are the two architecture
> > independent implementations. While other implementations are
> > more architecture specific. Single stepping inline wouldnt go
> > well with multithreaded process.
> >
> > Even though XOL vma has its own issues, we will go with it since
> > other implementations seem to have more complications.
> >
> > we would look forward to implementing boosters later.
> > Later on, if we come across another techniques with lesser
> > side-effects than the XOL vma, we would switch to using them.
>
> How about modifying glibc to reserve like 64 bytes on the TLS structure
> it has and storing the ins and possible boost jmp there? Since each
> thread can only have a single trap at any one time that should be
> enough.

We once implemented something similar, but using an area just beyond the
top of the stack instead of TLS. We figured it would never pass muster
because we have to temporarily map the page executable (and undo it
after the single-step), and this felt like a big security hole. I'd
think we'd have the same concern with TLS.

Jim

2010-01-24 23:41:28

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Sun 2010-01-17 16:01:46, Peter Zijlstra wrote:
> On Sun, 2010-01-17 at 16:56 +0200, Avi Kivity wrote:
> > On 01/17/2010 04:52 PM, Peter Zijlstra wrote:
>
> > > Also, if its fixed size you're imposing artificial limits on the number
> > > of possible probes.
> > >
> >
> > Obviously we'll need a limit, a uprobe will also take kernel memory, we
> > can't allow people to exhaust it.
>
> Only if its unprivilidged, kernel and root should be able to place as
> many probes until the machine keels over.

Well, it is address space that limits you in both cases...

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-01-27 06:54:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes [ Summary of Comments and actions to be taken ]

On Fri, 2010-01-22 at 12:54 +0530, Ananth N Mavinakayanahalli wrote:
> On Fri, Jan 22, 2010 at 12:32:32PM +0530, Srikar Dronamraju wrote:
> > Here is a summary of the Comments and actions that need to be taken for
> > the current uprobes patchset. Please let me know if I missed or
> > misunderstood any of your comments.
> >
> > 1. Uprobes depends on trap signal.
> > Uprobes depends on trap signal rather than hooking to the global
> > die notifier. It was suggested that we hook to the global die notifier.
> >
> > In the next version of patches, Uprobes will use the global die
> > notifier and look at the per-task count of the probes in use to
> > see if it has to be consumed.
> >
> > However this would reduce the ability of uprobe handlers to
> > sleep. Since we are dealing with userspace, sleeping in handlers
> > would have been a good feature. We are looking at ways to get
> > around this limitation.
>
> We could set a TIF_ flag in the notifier to indicate a breakpoint hit
> and process it in task context before the task heads into userspace.

OK, so we can go play stack games in the INT3 interrupt handler by
moving to a non IST stack when it comes from userspace, or move kprobes
over to INT1 or something.

2010-01-27 08:25:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)


* Avi Kivity <[email protected]> wrote:

> On 01/20/2010 11:57 AM, Peter Zijlstra wrote:
> >On Wed, 2010-01-20 at 11:43 +0200, Avi Kivity wrote:
> >> 1. Write a trace entry into shared memory, trap into the kernel on
> >> overflow.
> >> 2. Trap if a condition is satisfied (fast watchpoint implementation).
> >
> > So now you want to consume more of a process' address space to store trace
> > data as well?
>
> Yes. I know I'm bad.

No, you are just wrong.

> > Not to mention that that process could wreck the trace data rendering it
> > utterly unreliable.
>
> It could, but it also might not. Are we going to deny high performance
> tracing to users just because it doesn't work in all cases?

Tracing and monitoring is foremost about being able to trust the instrument,
then about performance and usability. That's one of the big things about
ftrace and perf.

By proposing 'user space tracing' you are missing two big aspects:

- That self-contained, kernel-driven tracing can be replicated in user-space.
It cannot. Sharing and global state is much harder to maintain reliably,
but the bigger problem is that user-space can stomp on its own tracing
state and can make it unreliable. Tracing is often used to figure out bugs,
and tracers will be trusted less if they can stomp on themselves.

- That somehow it's much faster and that this edge matters. It isnt and it
doesnt matter. The few places that need very very fast tracing wont use any
of these facilities - it will use something specialized.

So you are creating a solution for special cases that dont need it, and you
are also ignoring prime qualities of a good tracing framework.

Ingo

2010-01-27 08:24:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes [ Summary of Comments and actions to be taken ]

On Wed, 2010-01-27 at 07:53 +0100, Peter Zijlstra wrote:
> On Fri, 2010-01-22 at 12:54 +0530, Ananth N Mavinakayanahalli wrote:
> > On Fri, Jan 22, 2010 at 12:32:32PM +0530, Srikar Dronamraju wrote:
> > > Here is a summary of the Comments and actions that need to be taken for
> > > the current uprobes patchset. Please let me know if I missed or
> > > misunderstood any of your comments.
> > >
> > > 1. Uprobes depends on trap signal.
> > > Uprobes depends on trap signal rather than hooking to the global
> > > die notifier. It was suggested that we hook to the global die notifier.
> > >
> > > In the next version of patches, Uprobes will use the global die
> > > notifier and look at the per-task count of the probes in use to
> > > see if it has to be consumed.
> > >
> > > However this would reduce the ability of uprobe handlers to
> > > sleep. Since we are dealing with userspace, sleeping in handlers
> > > would have been a good feature. We are looking at ways to get
> > > around this limitation.
> >
> > We could set a TIF_ flag in the notifier to indicate a breakpoint hit
> > and process it in task context before the task heads into userspace.
>
> OK, so we can go play stack games in the INT3 interrupt handler by
> moving to a non IST stack when it comes from userspace, or move kprobes
> over to INT1 or something.

Right, it just got pointed out that INT1 doesn't have a single byte
encoding, only INT0 and INT3 :/

2010-01-27 08:36:43

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/27/2010 10:24 AM, Ingo Molnar wrote:
>
>
>>> Not to mention that that process could wreck the trace data rendering it
>>> utterly unreliable.
>>>
>> It could, but it also might not. Are we going to deny high performance
>> tracing to users just because it doesn't work in all cases?
>>
> Tracing and monitoring is foremost about being able to trust the instrument,
> then about performance and usability. That's one of the big things about
> ftrace and perf.
>
> By proposing 'user space tracing' you are missing two big aspects:
>
> - That self-contained, kernel-driven tracing can be replicated in user-space.
> It cannot. Sharing and global state is much harder to maintain reliably,
> but the bigger problem is that user-space can stomp on its own tracing
> state and can make it unreliable. Tracing is often used to figure out bugs,
> and tracers will be trusted less if they can stomp on themselves.
>
> - That somehow it's much faster and that this edge matters. It isnt and it
> doesnt matter. The few places that need very very fast tracing wont use any
> of these facilities - it will use something specialized.
>
> So you are creating a solution for special cases that dont need it, and you
> are also ignoring prime qualities of a good tracing framework.
>

I see it exactly the opposite. Only a very small minority of cases will
have such severe memory corruption that tracing will fall apart because
of random writes to memory; especially on 64-bit where the address space
is sparse. On the other hand, knowing that the cost is a few dozen
cycles rather than a thousand or so means that you can trace production
servers running full loads without worrying about whether tracing will
affect whatever it is you're trying to observe.

I'm not against slow reliable tracing, but we shouldn't ignore the need
for speed.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-01-27 09:08:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)


* Avi Kivity <[email protected]> wrote:

> On 01/27/2010 10:24 AM, Ingo Molnar wrote:
> >
> >
> >>>Not to mention that that process could wreck the trace data rendering it
> >>>utterly unreliable.
> >>It could, but it also might not. Are we going to deny high performance
> >>tracing to users just because it doesn't work in all cases?
> >Tracing and monitoring is foremost about being able to trust the instrument,
> >then about performance and usability. That's one of the big things about
> >ftrace and perf.
> >
> >By proposing 'user space tracing' you are missing two big aspects:
> >
> > - That self-contained, kernel-driven tracing can be replicated in user-space.
> > It cannot. Sharing and global state is much harder to maintain reliably,
> > but the bigger problem is that user-space can stomp on its own tracing
> > state and can make it unreliable. Tracing is often used to figure out bugs,
> > and tracers will be trusted less if they can stomp on themselves.
> >
> > - That somehow it's much faster and that this edge matters. It isnt and it
> > doesnt matter. The few places that need very very fast tracing wont use any
> > of these facilities - it will use something specialized.
> >
> >So you are creating a solution for special cases that dont need it, and you
> >are also ignoring prime qualities of a good tracing framework.
>
> I see it exactly the opposite. Only a very small minority of cases will
> have such severe memory corruption that tracing will fall apart because of
> random writes to memory; especially on 64-bit where the address space is
> sparse. On the other hand, knowing that the cost is a few dozen cycles
> rather than a thousand or so means that you can trace production servers
> running full loads without worrying about whether tracing will affect
> whatever it is you're trying to observe.
>
> I'm not against slow reliable tracing, but we shouldn't ignore the need for
> speed.

I havent seen a conscise summary of your points in this thread, so let me
summarize it as i've understood them (hopefully not putting words into your
mouth): AFAICS you are arguing for some crazy fragile architecture-specific
solution that traps INT3 into ring3 just to shave off a few cycles, and then
use user-space state to trace into.

If so then you ignore the obvious solution to _that_ problem: dont use INT3 at
all, but rebuild (or re-JIT) your program with explicit callbacks. It's _MUCH_
faster than _any_ breakpoint based solution - literally just the cost of a
function call (or not even that - i've written very fast inlined tracers -
they do rock when it comes to performance). Problem solved and none of the
INT3 details matters at all.

INT3 only matters to _transparent_ probing, and for that, the cost of INT3 is
almost _by definition_ less important than the fact that we can do transparent
tracing. If performance were the overriding issue they'd use dedicated
callbacks - and the INT3 technique wouldnt matter at all.

( Also, just like we were able to extend the kprobes code with more and more
optimizations, the same can be done with any user-space probing as well, to
make it faster. But at the core of it has to be a sane design that is
transparent and controlled by the kernel, so that it has the option to apply
more and more otimizations - yours isnt such and its limitations are
designed-in. Which is neither smart nor useful. )

Ingo

2010-01-27 09:27:59

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/27/2010 11:08 AM, Ingo Molnar wrote:
>
>> I see it exactly the opposite. Only a very small minority of cases will
>> have such severe memory corruption that tracing will fall apart because of
>> random writes to memory; especially on 64-bit where the address space is
>> sparse. On the other hand, knowing that the cost is a few dozen cycles
>> rather than a thousand or so means that you can trace production servers
>> running full loads without worrying about whether tracing will affect
>> whatever it is you're trying to observe.
>>
>> I'm not against slow reliable tracing, but we shouldn't ignore the need for
>> speed.
>>
> I havent seen a conscise summary of your points in this thread, so let me
> summarize it as i've understood them (hopefully not putting words into your
> mouth): AFAICS you are arguing for some crazy fragile architecture-specific
> solution that traps INT3 into ring3 just to shave off a few cycles, and then
> use user-space state to trace into.
>


That's a good summary, except for the words "crazy fragile", "trap INT3
into ring3" and "a few cycles".

Instead of using int 3, put a jump instruction in the program. This
shaves a lot more than a few cycles.

> If so then you ignore the obvious solution to _that_ problem: dont use INT3 at
> all, but rebuild (or re-JIT) your program with explicit callbacks. It's _MUCH_
> faster than _any_ breakpoint based solution - literally just the cost of a
> function call (or not even that - i've written very fast inlined tracers -
> they do rock when it comes to performance). Problem solved and none of the
> INT3 details matters at all.
>

However did I not think of that? Yes, and let's rip off kprobes tracing
from the kernel, we can always rebuild it.

Well, I'm observing an issue in a production system now. I may not want
to take it down, or if I take it down I may not be able to observe it
again as the problem takes a couple of days to show up, or I may not
have the full source, or it takes 10 minutes to build and so an
iterative edit/build/run cycle can stretch for hours.

Adding a vma to a running program is very unlikely to affect it. If the
program makes random accesses to memory, it will likely segfault very
quickly before we ever get to trace it.

> INT3 only matters to _transparent_ probing, and for that, the cost of INT3 is
> almost _by definition_ less important than the fact that we can do transparent
> tracing. If performance were the overriding issue they'd use dedicated
> callbacks - and the INT3 technique wouldnt matter at all.
>

INT3 isn't transparent. The only thing that comes close to full
transparency is hardware breakpoints. So we have a tradeoff between
transparency and speed, and except for the wierdest bugs, this level of
transparency won't be needed.

> ( Also, just like we were able to extend the kprobes code with more and more
> optimizations, the same can be done with any user-space probing as well, to
> make it faster. But at the core of it has to be a sane design that is
> transparent and controlled by the kernel, so that it has the option to apply
> more and more otimizations - yours isnt such and its limitations are
> designed-in.

No design is fully transparent, and I don't see why my design can't be
controlled by the kernel?

> Which is neither smart nor useful. )
>

This style of arguing is neither smart or useful as well.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-01-27 10:23:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)


* Avi Kivity <[email protected]> wrote:

> > If so then you ignore the obvious solution to _that_ problem: dont use
> > INT3 at all, but rebuild (or re-JIT) your program with explicit callbacks.
> > It's _MUCH_ faster than _any_ breakpoint based solution - literally just
> > the cost of a function call (or not even that - i've written very fast
> > inlined tracers - they do rock when it comes to performance). Problem
> > solved and none of the INT3 details matters at all.
>
> However did I not think of that? Yes, and let's rip off kprobes tracing
> from the kernel, we can always rebuild it.
>
> Well, I'm observing an issue in a production system now. I may not want to
> take it down, or if I take it down I may not be able to observe it again as
> the problem takes a couple of days to show up, or I may not have the full
> source, or it takes 10 minutes to build and so an iterative edit/build/run
> cycle can stretch for hours.

You have somewhat misconstrued my argument. What i said above is that _if_ you
need extreme levels of performance you always have the option to go even
faster via specialized tracing solutions. I did not promote it as a
replacement solution. Specialization obviously brings in a new set of
problems: infexibility and non-transparency, an example of what you gave
above.

Your proposed solution brings in precisely such kinds of issues, on a
different level, just to improve performance at the cost of transparency and
at the cost of features and robustness.

It's btw rather ironic as your arguments are somewhat similar to the Xen vs.
KVM argument just turned around: KVM started out slower by relying on hardware
implementation for virtualization while Xen relied on a clever but limiting
hack. With each CPU generation the hardware got faster, while the various
design limitations of Xen are hurting it and KVM is winning that race.

A (partially) similar situation exists here: INT3 into ring 0 and handling it
there in a protected environment might be more expensive, but _if_ it matters
to performance it sure could be made faster in hardware (and in fact it will
become faster with every new generation of hardware).

Both Peter and me are telling you that we are considering your solution too
specialized, at the cost of flexibility, features and robustness.

Thanks,

Ingo

2010-01-15 09:03:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote:
>
> +Instruction copies to be single-stepped are stored in a per-process
> +"single-step out of line (XOL) area," which is a little VM area
> +created by Uprobes in each probed process's address space.

I think tinkering with the probed process's address space is a no-no.
Have you ran this by the linux mm folks? I'd be inclined to NAK this
straight out.

2010-01-15 09:04:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote:
>
> discussed elsewhere.

Thanks for the pointer...

2010-01-15 09:08:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/7] Execution out of line (XOL)

On Thu, 2010-01-14 at 14:43 -0800, Jim Keniston wrote:
>
> Yeah, there's not a lot of context there. I hope it will make more
> sense if you read section 1.1 of Documentation/uprobes.txt (patch #6).
> Or look at get_insn_slot() in kprobes, and understand that we're trying
> to do something similar in uprobes, where the instruction copies have to
> reside in the user address space of the probed process.

That's not the point, changelogs shoulnd not be this cryptic. They
should be stand alone and descriptive of what, why and how.

If you can't be bothered writing such for something you want reviewed
for inclusion then I might not be bothered looking at them at all.

2010-01-15 09:10:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Thu, 2010-01-14 at 14:49 -0800, Jim Keniston wrote:
> On Thu, 2010-01-14 at 12:09 +0100, Peter Zijlstra wrote:
> > On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote:
> > >
> > > Uprobes Infrastructure enables user to dynamically establish
> > > probepoints in user applications and collect information by executing
> > > a handler functions when the probepoints are hit.
> > > Please refer Documentation/uprobes.txt for more details.
> > >
> > > This patch provides the core implementation of uprobes.
> > > This patch builds on utrace infrastructure.
> > >
> > > You need to follow this up with the uprobes patch for your
> > > architecture.
> >
> > So all this is basically some glue between what you call ubp (the real
> > userspace breakpoint stuff) and utrace? Or does it do more?
> >
>
> My reply in
> http://lkml.indiana.edu/hypermail/linux/kernel/1001.1/02483.html
> addresses this.

Right, so all that need be done is add the multiple probe stuff to UBP
and its a sane interface to use on its own, at which point I'd be
inclined to call that uprobes (UBP really is an crap name).

Then we can ditch the whole utrace muck as I see no reason to want to
use that, whereas the ubp (given a sane name) looks interesting.

2010-01-15 09:26:57

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

Peter Zijlstra <[email protected]> writes:

> [...]
> Right, so all that need be done is add the multiple probe stuff to UBP
> and its a sane interface to use on its own, at which point I'd be
> inclined to call that uprobes (UBP really is an crap name).

At one point ubp+uprobes were one piece. They were separated on the
suspicion that lkml would like them that way.

> Then we can ditch the whole utrace muck as I see no reason to want to
> use that, whereas the ubp (given a sane name) looks interesting.

Assuming you meant what you write, perhaps you misunderstand the
layering relationship of these pieces. utrace underlies uprobes and
other process manipulation functionality, present and future.

- FChE

2010-01-15 09:35:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, 2010-01-15 at 04:26 -0500, Frank Ch. Eigler wrote:
> Peter Zijlstra <[email protected]> writes:
>
> > [...]
> > Right, so all that need be done is add the multiple probe stuff to UBP
> > and its a sane interface to use on its own, at which point I'd be
> > inclined to call that uprobes (UBP really is an crap name).
>
> At one point ubp+uprobes were one piece. They were separated on the
> suspicion that lkml would like them that way.

Right, good thinking, that way we can use ubp without having to use
utrace ;-)

> > Then we can ditch the whole utrace muck as I see no reason to want to
> > use that, whereas the ubp (given a sane name) looks interesting.
>
> Assuming you meant what you write, perhaps you misunderstand the
> layering relationship of these pieces. utrace underlies uprobes and
> other process manipulation functionality, present and future.

Why, utrace doesn't at all look to bring a fundamental contribution to
all that. If there's a proper kernel interface to install probes on
userspace code (ubp seems to be mostly that) then I can use perf/ftrace
to do the rest of the state management, no need to use utrace there.

You can hardly force me to use utrace there, can you?

Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Fri, Jan 15, 2010 at 10:03:48AM +0100, Peter Zijlstra wrote:
> On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote:
> >
> > discussed elsewhere.
>
> Thanks for the pointer...

:-)

Peter,
I think Jim was referring to
http://sources.redhat.com/ml/systemtap/2007-q1/msg00571.html

Ananth

2010-01-15 09:50:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Fri, 2010-01-15 at 15:08 +0530, Ananth N Mavinakayanahalli wrote:
> On Fri, Jan 15, 2010 at 10:03:48AM +0100, Peter Zijlstra wrote:
> > On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote:
> > >
> > > discussed elsewhere.
> >
> > Thanks for the pointer...
>
> :-)
>
> Peter,
> I think Jim was referring to
> http://sources.redhat.com/ml/systemtap/2007-q1/msg00571.html

That's a 2007 email from some obscure list... that's hardly something
that can be referenced to without link.

As previously stated, I think poking at a process's address space is an
utter no-go.

Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Fri, Jan 15, 2010 at 10:50:14AM +0100, Peter Zijlstra wrote:
> On Fri, 2010-01-15 at 15:08 +0530, Ananth N Mavinakayanahalli wrote:
> > On Fri, Jan 15, 2010 at 10:03:48AM +0100, Peter Zijlstra wrote:
> > > On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote:
> > > >
> > > > discussed elsewhere.
> > >
> > > Thanks for the pointer...
> >
> > :-)
> >
> > Peter,
> > I think Jim was referring to
> > http://sources.redhat.com/ml/systemtap/2007-q1/msg00571.html
>
> That's a 2007 email from some obscure list... that's hardly something
> that can be referenced to without link.
>
> As previously stated, I think poking at a process's address space is an
> utter no-go.

In which case we'll need to find a different solution to it. The gdb
style of 'breakpoint hit' -> 'put original instruction back in place' ->
single-step -> 'put back the breakpoint' would be a big limiter,
especially for multithreaded cases.

The design here is to have a small vma sufficiently high enough in
memory a-la vDSO that most apps won't reach, though there is still no
ironclad guarantee.

Ideally, we will need to single-step on a copy of the instruction, in the
user address space of the traced process.

Ideas?

Ananth

2010-01-15 10:14:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Fri, 2010-01-15 at 15:40 +0530, Ananth N Mavinakayanahalli wrote:

> Ideas?

emulate the one instruction?

Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Fri, Jan 15, 2010 at 11:13:32AM +0100, Peter Zijlstra wrote:
> On Fri, 2010-01-15 at 15:40 +0530, Ananth N Mavinakayanahalli wrote:
>
> > Ideas?
>
> emulate the one instruction?

In kernel? Generically? Don't think its that easy for userspace --
you have the full gamut of instructions to emulate (fp, vector, etc);
further, the instruction could itself cause a page fault and the like.

2010-01-15 10:26:55

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

Hi Peter,

> > >
> >
> > My reply in
> > http://lkml.indiana.edu/hypermail/linux/kernel/1001.1/02483.html
> > addresses this.
>
> Right, so all that need be done is add the multiple probe stuff to UBP
> and its a sane interface to use on its own, at which point I'd be
> inclined to call that uprobes (UBP really is an crap name).

I am fine with renaming ubp to a suggested name. The reason for splitting
uprobes to two layers was to allow others (currently none) to reuse the
current ubp layer. It was felt that there could be multiple clients for
ubp who could co-exist.

However ubp alone is not enough to provide the userspace tracing.
Currently it wouldn't understand synchronization between different
threads of a process, process life time issues, context in which the
handler has to be run.

As pointed out by Jim earlier, we have segregrated that layer which
takes care of the above issues into the uprobes layer.

For example, while inserting a breakpoint, one of the threads of a
process could be running at the same place where we are trying to place
a breakpoint. Or there could be two threads that could be racing to
insert/delete a breakpoint. These synchronization issues are all handled
by the Uprobes layer.

Uprobes layer would need to be notified of process life-time events
like fork/clone/exec/exit.
It also needs to know
- when a breakpoint is hit
- stop and resume a thread.

Uprobes layer uses utrace to be notified of the process life time events
and the signal handling part.

--
Thanks and Regards
Srikar

>
> Then we can ditch the whole utrace muck as I see no reason to want to
> use that, whereas the ubp (given a sane name) looks interesting.
>


2010-01-15 10:33:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, 2010-01-15 at 15:56 +0530, Srikar Dronamraju wrote:
> Hi Peter,
>
> Or there could be two threads that could be racing to
> insert/delete a breakpoint. These synchronization issues are all handled
> by the Uprobes layer.

Shouldn't be hard to put that in the ubp layer, right?

> Uprobes layer would need to be notified of process life-time events
> like fork/clone/exec/exit.

No so much the process lifetimes as the vma life times are interesting,
placing a hook in the vm code to track that isn't too hard,

> It also needs to know
> - when a breakpoint is hit
> - stop and resume a thread.

A simple hook in the trap code is done quickly enough, and no reason to
stop the thread, its not going anywhere when it traps.

2010-01-15 10:57:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Fri, 2010-01-15 at 15:52 +0530, Ananth N Mavinakayanahalli wrote:
> On Fri, Jan 15, 2010 at 11:13:32AM +0100, Peter Zijlstra wrote:
> > On Fri, 2010-01-15 at 15:40 +0530, Ananth N Mavinakayanahalli wrote:
> >
> > > Ideas?
> >
> > emulate the one instruction?
>
> In kernel? Generically? Don't think its that easy for userspace --
> you have the full gamut of instructions to emulate (fp, vector, etc);
> further,

Can't you jit a piece of code that wraps the one instruction, save the
full cpu state, set the userspace segments, have it load pt_regs (except
for the IP) execute the one ins, save the results, restore the full
state?

Then replace pt_regs with the saved result and advance the stored IP by
the length of that one instruction and return to userspace?

All you need to take care of are the priv insns, but doesn't something
like kvm already have code to deal with that?

> the instruction could itself cause a page fault and the like.

Faults aren't a problem, we take faults from kernel space all the time.

2010-01-15 11:02:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Fri, 2010-01-15 at 11:56 +0100, Peter Zijlstra wrote:
> On Fri, 2010-01-15 at 15:52 +0530, Ananth N Mavinakayanahalli wrote:
> > On Fri, Jan 15, 2010 at 11:13:32AM +0100, Peter Zijlstra wrote:
> > > On Fri, 2010-01-15 at 15:40 +0530, Ananth N Mavinakayanahalli wrote:
> > >
> > > > Ideas?
> > >
> > > emulate the one instruction?
> >
> > In kernel? Generically? Don't think its that easy for userspace --
> > you have the full gamut of instructions to emulate (fp, vector, etc);
> > further,
>
> Can't you jit a piece of code that wraps the one instruction, save the
> full cpu state, set the userspace segments, have it load pt_regs (except
> for the IP) execute the one ins, save the results, restore the full
> state?

Hmm, normally the problem with FP/Vector state is that we don't
save/restore it going in/out the kernel, so kernel-space can't use it
because it would change the userspace state, but in this case we can
simply execute that one instruction and have it change user state,
because that's exactly what we want to do.

So we don't need to save restore the full cpu state around that JIT'ed
piece of code, but just the regular regs.

2010-01-15 11:05:59

by Maneesh Soni

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, Jan 15, 2010 at 11:33:27AM +0100, Peter Zijlstra wrote:
> On Fri, 2010-01-15 at 15:56 +0530, Srikar Dronamraju wrote:
> > Hi Peter,
> >
> > Or there could be two threads that could be racing to
> > insert/delete a breakpoint. These synchronization issues are all handled
> > by the Uprobes layer.
>
> Shouldn't be hard to put that in the ubp layer, right?
>
> > Uprobes layer would need to be notified of process life-time events
> > like fork/clone/exec/exit.
>
> No so much the process lifetimes as the vma life times are interesting,
> placing a hook in the vm code to track that isn't too hard,
>

I think similar hooks were given thumbs down in the previous incarnation
of uprobes (which was implemented without utrace).

http://lkml.indiana.edu/hypermail/linux/kernel/0603.2/1254.html

Thanks
Maneesh

--
Maneesh Soni
Linux Technology Center
IBM India Systems and Technology Lab,
Bangalore, India.

2010-01-15 11:12:36

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/7] Execution out of line (XOL)

* Peter Zijlstra <[email protected]> [2010-01-15 10:07:35]:

> On Thu, 2010-01-14 at 14:43 -0800, Jim Keniston wrote:
> >
> > Yeah, there's not a lot of context there. I hope it will make more
> > sense if you read section 1.1 of Documentation/uprobes.txt (patch #6).
> > Or look at get_insn_slot() in kprobes, and understand that we're trying
> > to do something similar in uprobes, where the instruction copies have to
> > reside in the user address space of the probed process.
>
> That's not the point, changelogs shoulnd not be this cryptic. They
> should be stand alone and descriptive of what, why and how.
>
> If you can't be bothered writing such for something you want reviewed
> for inclusion then I might not be bothered looking at them at all.
>

Okay shall add to the Changelog with the information providing the
context for this patch.

2010-01-15 11:13:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, 2010-01-15 at 16:35 +0530, Maneesh Soni wrote:
> On Fri, Jan 15, 2010 at 11:33:27AM +0100, Peter Zijlstra wrote:
> > On Fri, 2010-01-15 at 15:56 +0530, Srikar Dronamraju wrote:
> > > Hi Peter,
> > >
> > > Or there could be two threads that could be racing to
> > > insert/delete a breakpoint. These synchronization issues are all handled
> > > by the Uprobes layer.
> >
> > Shouldn't be hard to put that in the ubp layer, right?
> >
> > > Uprobes layer would need to be notified of process life-time events
> > > like fork/clone/exec/exit.
> >
> > No so much the process lifetimes as the vma life times are interesting,
> > placing a hook in the vm code to track that isn't too hard,
> >
>
> I think similar hooks were given thumbs down in the previous incarnation
> of uprobes (which was implemented without utrace).
>
> http://lkml.indiana.edu/hypermail/linux/kernel/0603.2/1254.html

I wasn't at all proposing to mess with a_ops, nor do you really need to,
I was more thinking of adding a callback like perf_event_mmap() and a
corresponding unmap(), that way you can track mapping life-times and
add/remove probes accordingly.

Adding the probe uses the fact that (most) executable mappings are
MAP_PRIVATE and CoWs a private copy of the page with the modified ins,
right?

What does it do for MAP_SHARED|MAP_EXECUTABLE sections -- simply fail to
add the probe?

2010-01-15 11:18:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, 2010-01-15 at 12:12 +0100, Peter Zijlstra wrote:
> On Fri, 2010-01-15 at 16:35 +0530, Maneesh Soni wrote:
> > On Fri, Jan 15, 2010 at 11:33:27AM +0100, Peter Zijlstra wrote:
> > > On Fri, 2010-01-15 at 15:56 +0530, Srikar Dronamraju wrote:
> > > > Hi Peter,
> > > >
> > > > Or there could be two threads that could be racing to
> > > > insert/delete a breakpoint. These synchronization issues are all handled
> > > > by the Uprobes layer.
> > >
> > > Shouldn't be hard to put that in the ubp layer, right?
> > >
> > > > Uprobes layer would need to be notified of process life-time events
> > > > like fork/clone/exec/exit.
> > >
> > > No so much the process lifetimes as the vma life times are interesting,
> > > placing a hook in the vm code to track that isn't too hard,
> > >
> >
> > I think similar hooks were given thumbs down in the previous incarnation
> > of uprobes (which was implemented without utrace).
> >
> > http://lkml.indiana.edu/hypermail/linux/kernel/0603.2/1254.html
>
> I wasn't at all proposing to mess with a_ops, nor do you really need to,
> I was more thinking of adding a callback like perf_event_mmap() and a
> corresponding unmap(), that way you can track mapping life-times and
> add/remove probes accordingly.
>
> Adding the probe uses the fact that (most) executable mappings are
> MAP_PRIVATE and CoWs a private copy of the page with the modified ins,
> right?

Does it clean up the CoW'ed page on removing the probe? Does that
account for userspace having made other changes in between installing
and removing the probe (for PROT_WRITE mappings obviously)?

2010-01-15 13:08:10

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

* Peter Zijlstra <[email protected]> [2010-01-15 11:33:27]:

>
> > Uprobes layer would need to be notified of process life-time events
> > like fork/clone/exec/exit.
>
> No so much the process lifetimes as the vma life times are interesting,
> placing a hook in the vm code to track that isn't too hard,
>
> > It also needs to know
> > - when a breakpoint is hit
> > - stop and resume a thread.
>
> A simple hook in the trap code is done quickly enough, and no reason to
> stop the thread, its not going anywhere when it traps.
>
>

Some of the threads could be executing in the vicinity of the
breakpoint when it is getting inserted or deleted. Wont we need to
stop/quiesce those threads?

2010-01-15 13:10:57

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

Hi -

> > > Then we can ditch the whole utrace muck as I see no reason to want to
> > > use that, whereas the ubp (given a sane name) looks interesting.
> >
> > Assuming you meant what you write, perhaps you misunderstand the
> > layering relationship of these pieces. utrace underlies uprobes and
> > other process manipulation functionality, present and future.
>
> Why, utrace doesn't at all look to bring a fundamental contribution to
> all that. If there's a proper kernel interface to install probes on
> userspace code (ubp seems to be mostly that) then I can use perf/ftrace
> to do the rest of the state management, no need to use utrace there.

> You can hardly force me to use utrace there, can you?

utrace is not a form of punishment inflicted upon the undeserving. It
is a service layer that uprobes et alii are built upon. You as a
potential uprobes client need not also talk directly to it, if you
wish to reimplement task-finder-like services some other way.

- FChE

2010-01-15 13:17:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, 2010-01-15 at 18:38 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <[email protected]> [2010-01-15 11:33:27]:
>
> >
> > > Uprobes layer would need to be notified of process life-time events
> > > like fork/clone/exec/exit.
> >
> > No so much the process lifetimes as the vma life times are interesting,
> > placing a hook in the vm code to track that isn't too hard,
> >
> > > It also needs to know
> > > - when a breakpoint is hit
> > > - stop and resume a thread.
> >
> > A simple hook in the trap code is done quickly enough, and no reason to
> > stop the thread, its not going anywhere when it traps.
> >
> >
>
> Some of the threads could be executing in the vicinity of the
> breakpoint when it is getting inserted or deleted. Wont we need to
> stop/quiesce those threads?

The easy answer it so use kstopmachine to patch the code, the slightly
more complex would be using something like:

http://lkml.org/lkml/2010/1/12/300




2010-01-15 13:25:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, 2010-01-15 at 08:10 -0500, Frank Ch. Eigler wrote:
> Hi -
>
> > > > Then we can ditch the whole utrace muck as I see no reason to want to
> > > > use that, whereas the ubp (given a sane name) looks interesting.
> > >
> > > Assuming you meant what you write, perhaps you misunderstand the
> > > layering relationship of these pieces. utrace underlies uprobes and
> > > other process manipulation functionality, present and future.
> >
> > Why, utrace doesn't at all look to bring a fundamental contribution to
> > all that. If there's a proper kernel interface to install probes on
> > userspace code (ubp seems to be mostly that) then I can use perf/ftrace
> > to do the rest of the state management, no need to use utrace there.
>
> > You can hardly force me to use utrace there, can you?
>
> utrace is not a form of punishment inflicted upon the undeserving. It
> is a service layer that uprobes et alii are built upon. You as a
> potential uprobes client need not also talk directly to it, if you
> wish to reimplement task-finder-like services some other way.

I said I wanted to, I think the whole task orientation of user-space
probing is wrong, its about text mappings.

But yes, I think that for most purposes utrace is a punishment, its way
too heavy, I mean, trap, generate a signal, catch the signal, that's
like an insane amount of code to jump through in order to get that trap.

Furthermore it requires stopping and resuming tasks and nonsense like
that, that's unwanted in many cases, just run stuff from the trap site
and you're done.

Yes you can do this with utrace, and I'm not going to stop people from
using utrace for his, I'm just saying I'm not going to.

2010-01-15 13:38:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, 2010-01-15 at 14:16 +0100, Peter Zijlstra wrote:
> On Fri, 2010-01-15 at 18:38 +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra <[email protected]> [2010-01-15 11:33:27]:
> >
> > >
> > > > Uprobes layer would need to be notified of process life-time events
> > > > like fork/clone/exec/exit.
> > >
> > > No so much the process lifetimes as the vma life times are interesting,
> > > placing a hook in the vm code to track that isn't too hard,
> > >
> > > > It also needs to know
> > > > - when a breakpoint is hit
> > > > - stop and resume a thread.
> > >
> > > A simple hook in the trap code is done quickly enough, and no reason to
> > > stop the thread, its not going anywhere when it traps.
> > >
> > >
> >
> > Some of the threads could be executing in the vicinity of the
> > breakpoint when it is getting inserted or deleted. Wont we need to
> > stop/quiesce those threads?
>
> The easy answer it so use kstopmachine to patch the code, the slightly
> more complex would be using something like:
>
> http://lkml.org/lkml/2010/1/12/300

Also, since its userspace, can't you simply play games with the
pagetables? CoW a new private copy of the page and flip the pagetables
around to the new one, then flush the pagetables on all relevant cpus
and bob's your uncle.

(You might have to play some games with making the page RO to trap
intermediate accesses, but that should work I think)

2010-01-15 13:38:44

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

Hi -

On Fri, Jan 15, 2010 at 02:25:30PM +0100, Peter Zijlstra wrote:
> [...]
> > utrace is not a form of punishment inflicted upon the undeserving. It
> > is a service layer that uprobes et alii are built upon. You as a
> > potential uprobes client need not also talk directly to it, if you
> > wish to reimplement task-finder-like services some other way.
>
> [...]
> But yes, I think that for most purposes utrace is a punishment, its way
> too heavy, I mean, trap, generate a signal, catch the signal, that's
> like an insane amount of code to jump through in order to get that trap.

At the bottom, there will be an int3 in the userspace text page.
There will be a trap taken, no matter what. Code must figure out
whether this trap came from an in-kernel client such as uprobes, or
whether it is to be passed through to a userspace debugger via ptrace
(or the gdbstub). This part is unavoidable if you wish to be
compatible.

I'm not sure, but it sounds like the part you're complaining about is
how utrace ultimately reports the trap to uprobes: i.e.,
utrace_get_signal()? Is that the "insane amount of code"?


> Furthermore it requires stopping and resuming tasks and nonsense like
> that, that's unwanted in many cases, just run stuff from the trap site
> and you're done.

I don't know what you mean exactly. A trap already stopped task.
utrace merely allows various clients to inspect/manipulate the state
of the task at that moment. It does not add any context switches or
spurious stop/resumue operations.


- FChE

2010-01-15 13:48:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, 2010-01-15 at 08:38 -0500, Frank Ch. Eigler wrote:
> Hi -
>
> On Fri, Jan 15, 2010 at 02:25:30PM +0100, Peter Zijlstra wrote:
> > [...]
> > > utrace is not a form of punishment inflicted upon the undeserving. It
> > > is a service layer that uprobes et alii are built upon. You as a
> > > potential uprobes client need not also talk directly to it, if you
> > > wish to reimplement task-finder-like services some other way.
> >
> > [...]
> > But yes, I think that for most purposes utrace is a punishment, its way
> > too heavy, I mean, trap, generate a signal, catch the signal, that's
> > like an insane amount of code to jump through in order to get that trap.
>
> At the bottom, there will be an int3 in the userspace text page.
> There will be a trap taken, no matter what. Code must figure out
> whether this trap came from an in-kernel client such as uprobes, or
> whether it is to be passed through to a userspace debugger via ptrace
> (or the gdbstub). This part is unavoidable if you wish to be
> compatible.

Sure, a lookup against existing probe sites on trap is unavoidable, if
you find a match, you call a probe specific handler and deal with it
there, if you don't you'll eventually generate a SIGTRAP and fall back
to userspace.

Thing is, utrace doesn't do that (nor should it), its something the
uprobe interface should implement just like kprobes does.

> I'm not sure, but it sounds like the part you're complaining about is
> how utrace ultimately reports the trap to uprobes: i.e.,
> utrace_get_signal()? Is that the "insane amount of code"?

Well when tracing/profiling every instruction is too much. Having to
needlessly raise a signal only to catch it again a short bit later
sounds like obvious waste to me.

> > Furthermore it requires stopping and resuming tasks and nonsense like
> > that, that's unwanted in many cases, just run stuff from the trap site
> > and you're done.
>
> I don't know what you mean exactly. A trap already stopped task.
> utrace merely allows various clients to inspect/manipulate the state
> of the task at that moment. It does not add any context switches or
> spurious stop/resumue operations.

Srikar seemed to suggest it needed stop/resume.

2010-01-15 14:01:04

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

Hi -

On Fri, Jan 15, 2010 at 02:47:56PM +0100, Peter Zijlstra wrote:
> [...]
> > I'm not sure, but it sounds like the part you're complaining about is
> > how utrace ultimately reports the trap to uprobes: i.e.,
> > utrace_get_signal()? Is that the "insane amount of code"?
>
> Well when tracing/profiling every instruction is too much. Having to
> needlessly raise a signal only to catch it again a short bit later
> sounds like obvious waste to me.

Well, I'm not in a position to argue line by line about the necessity
or the cost of utrace low level guts, but this may represent the most
practical engineering balance between functionality / modularity /
undesirably intrusive modifications. Perhaps there exists a tool with
which one can confirm your worry about excess cost of this particular
piece.


> > > Furthermore it requires stopping and resuming tasks and nonsense like
> > > that, that's unwanted in many cases, just run stuff from the trap site
> > > and you're done.
> >
> > I don't know what you mean exactly. A trap already stopped task.
> > utrace merely allows various clients to inspect/manipulate the state
> > of the task at that moment. It does not add any context switches or
> > spurious stop/resumue operations.
>
> Srikar seemed to suggest it needed stop/resume.

You may be confusing breakpoint insertion/removal operations versus
breakpoint hits.


- FChE

2010-01-15 14:06:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, 2010-01-15 at 09:00 -0500, Frank Ch. Eigler wrote:

> Well, I'm not in a position to argue line by line about the necessity
> or the cost of utrace low level guts, but this may represent the most
> practical engineering balance between functionality / modularity /
> undesirably intrusive modifications.

How intrusive and non-modular is installing a DIE_INT3 notifier?

2010-01-15 14:20:15

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

>
> > > Furthermore it requires stopping and resuming tasks and nonsense like
> > > that, that's unwanted in many cases, just run stuff from the trap site
> > > and you're done.
> >
> > I don't know what you mean exactly. A trap already stopped task.
> > utrace merely allows various clients to inspect/manipulate the state
> > of the task at that moment. It does not add any context switches or
> > spurious stop/resumue operations.
>
> Srikar seemed to suggest it needed stop/resume.
>

If process traps, We dont need to stop/resume other threads.
uprobes needs threads to quiesce when inserting/deleting the breakpoint.

2010-01-15 14:22:40

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

Hi -

> > Well, I'm not in a position to argue line by line about the necessity
> > or the cost of utrace low level guts, but this may represent the most
> > practical engineering balance between functionality / modularity /
> > undesirably intrusive modifications.
>
> How intrusive and non-modular is installing a DIE_INT3 notifier?

I'm not sure about all the reasons pro/con, but it looks like
installing such a systemwide hook would force every userspace
breakpoint or kprobe event machine wide to pass through the
hypothetical uprobes layer, whether or not applicable to a current
task.

- FChE

2010-01-15 14:25:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, 2010-01-15 at 19:50 +0530, Srikar Dronamraju wrote:

> > Srikar seemed to suggest it needed stop/resume.
> >
>
> If process traps, We dont need to stop/resume other threads.
> uprobes needs threads to quiesce when inserting/deleting the breakpoint.

Right, I don't think we need to at all. See the CoW thing from previous
emails.

2010-01-15 14:40:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, 2010-01-15 at 09:22 -0500, Frank Ch. Eigler wrote:
> Hi -
>
> > > Well, I'm not in a position to argue line by line about the necessity
> > > or the cost of utrace low level guts, but this may represent the most
> > > practical engineering balance between functionality / modularity /
> > > undesirably intrusive modifications.
> >
> > How intrusive and non-modular is installing a DIE_INT3 notifier?
>
> I'm not sure about all the reasons pro/con, but it looks like
> installing such a systemwide hook would force every userspace
> breakpoint or kprobe event machine wide to pass through the
> hypothetical uprobes layer, whether or not applicable to a current
> task.

Well, we'll have to pass through the global die notifier anyway, but a
quick per task filter sounds like a good idea, we can do that by keeping
a per-task count of the number of uprobes in use.

Then the uprobe code can avoid the lookup if there are no task users and
no global users.

The advantage of this construct is that is easily allows for global
users, whereas a utrace based one doesn't.


2010-01-15 20:19:12

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/7] Execution out of line (XOL)


On Fri, 2010-01-15 at 10:07 +0100, Peter Zijlstra wrote:
> On Thu, 2010-01-14 at 14:43 -0800, Jim Keniston wrote:
> >
> > Yeah, there's not a lot of context there. I hope it will make more
> > sense if you read section 1.1 of Documentation/uprobes.txt (patch #6).
> > Or look at get_insn_slot() in kprobes, and understand that we're trying
> > to do something similar in uprobes, where the instruction copies have to
> > reside in the user address space of the probed process.
>
> That's not the point, changelogs shoulnd not be this cryptic. They
> should be stand alone and descriptive of what, why and how.

Point taken.

>
> If you can't be bothered writing such for something you want reviewed
> for inclusion then I might not be bothered looking at them at all.
>

We appreciate your persistence wrt this patch set. :-}

Jim

2010-01-15 21:07:42

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)


On Fri, 2010-01-15 at 10:02 +0100, Peter Zijlstra wrote:
> On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote:
> >
> > +Instruction copies to be single-stepped are stored in a per-process
> > +"single-step out of line (XOL) area," which is a little VM area
> > +created by Uprobes in each probed process's address space.
>
> I think tinkering with the probed process's address space is a no-no.
> Have you ran this by the linux mm folks?

Sort of.

Back in 2007 (!), we were getting ready to post uprobes (which was then
essentially uprobes+xol+upb) to LKML, pondering XOL alternatives and
waiting for utrace to get pulled back into the -mm tree. (It turned out
to be a long wait.) I emailed Andrew Morton, inquiring about the
prospects for utrace and giving him a preview of utrace-based uprobes.
He expressed openness to the idea of allocating a piece of the user
address space for the XOL area, a la the vdso page.

With advice and review from Dave Hansen, we implemented an XOL page, set
up for every process (probed or not) along the same lines as the vdso
page.

About that time, Roland McGrath suggested using do_mmap_pgoff() to
create a separate vma on demand. This was the seed of the current
implementation. It had the advantages of being
architecture-independent, affecting only probed processes, and allowing
the allocation of more XOL slots. (Uprobes can make do with a fixed
number of XOL slots -- allowing one probepoint to steal another's slot
-- but it isn't pretty.)

As I recall, Dave preferred the other idea (1 XOL page for every
process, probed or not) -- mostly because he didn't like the idea of a
new vma popping into existence when the process gets probed -- but was
OK with us going ahead with Roland's idea.

(I'm not a VM guy; pardon any imprecision in my language.)

Jim

> I'd be inclined to NAK this
> straight out.
>


2010-01-15 21:19:41

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)


On Fri, 2010-01-15 at 10:50 +0100, Peter Zijlstra wrote:
> On Fri, 2010-01-15 at 15:08 +0530, Ananth N Mavinakayanahalli wrote:
> > On Fri, Jan 15, 2010 at 10:03:48AM +0100, Peter Zijlstra wrote:
> > > On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote:
> > > >
> > > > discussed elsewhere.
> > >
> > > Thanks for the pointer...
> >
> > :-)
> >
> > Peter,
> > I think Jim was referring to
> > http://sources.redhat.com/ml/systemtap/2007-q1/msg00571.html
>
> That's a 2007 email from some obscure list... that's hardly something
> that can be referenced to without link.

Actually, I was referring to this
http://lkml.indiana.edu/hypermail/linux/kernel/1001.1/01120.html
from earlier (Monday) in this same discussion. But I'll be sure to
include pointers in the future.

For more thoughts on this approach, see
http://sourceware.org/bugzilla/show_bug.cgi?id=5509
(And no, I don't expect you to have seen that before. :-)) Most of the
troublesome issues mentioned in that enhancement request have since been
resolved.

Jim

2010-01-15 21:50:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Fri, 2010-01-15 at 13:07 -0800, Jim Keniston wrote:
> On Fri, 2010-01-15 at 10:02 +0100, Peter Zijlstra wrote:
> > On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote:
> > >
> > > +Instruction copies to be single-stepped are stored in a per-process
> > > +"single-step out of line (XOL) area," which is a little VM area
> > > +created by Uprobes in each probed process's address space.
> >
> > I think tinkering with the probed process's address space is a no-no.
> > Have you ran this by the linux mm folks?
>
> Sort of.
>
> Back in 2007 (!), we were getting ready to post uprobes (which was then
> essentially uprobes+xol+upb) to LKML, pondering XOL alternatives and
> waiting for utrace to get pulled back into the -mm tree. (It turned out
> to be a long wait.) I emailed Andrew Morton, inquiring about the
> prospects for utrace and giving him a preview of utrace-based uprobes.
> He expressed openness to the idea of allocating a piece of the user
> address space for the XOL area, a la the vdso page.
>
> With advice and review from Dave Hansen, we implemented an XOL page, set
> up for every process (probed or not) along the same lines as the vdso
> page.
>
> About that time, Roland McGrath suggested using do_mmap_pgoff() to
> create a separate vma on demand. This was the seed of the current
> implementation. It had the advantages of being
> architecture-independent, affecting only probed processes, and allowing
> the allocation of more XOL slots. (Uprobes can make do with a fixed
> number of XOL slots -- allowing one probepoint to steal another's slot
> -- but it isn't pretty.)
>
> As I recall, Dave preferred the other idea (1 XOL page for every
> process, probed or not) -- mostly because he didn't like the idea of a
> new vma popping into existence when the process gets probed -- but was
> OK with us going ahead with Roland's idea.

Well, I think its all very gross, I would really like people to try and
'emulate' or plain execute those original instructions from kernel
space.

As to the privileged instructions, I think qemu/kvm like projects should
have pretty much all of that covered.

Nor do I think we need utrace at all to make user space probes useful.
Even stronger, I think the focus on utrace made you get some
fundamentals wrong. Its not mainly about task state, but like said, its
about text mappings, which is something utrace knows nothing about.

That is not to say you cannot build a useful interface from uprobes and
utrace, but its not at all required or natural.

2010-01-15 22:28:12

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation


On Fri, 2010-01-15 at 12:18 +0100, Peter Zijlstra wrote:
> On Fri, 2010-01-15 at 12:12 +0100, Peter Zijlstra wrote:
...
> >
> > Adding the probe uses the fact that (most) executable mappings are
> > MAP_PRIVATE and CoWs a private copy of the page with the modified ins,
> > right?

We've just used access_process_vm() to insert the breakpoint
instruction. (If there are situations where that's not appropriate,
please advise.)

>
> Does it clean up the CoW'ed page on removing the probe?

If I understand your question, the answer is no. We make no attempt to
reclaim COW'ed pages, even after all the probes have been removed. In
fact, once the first probe is hit and the XOL vma is created, the XOL
vma hangs around for the life of the process.

> Does that
> account for userspace having made other changes in between installing
> and removing the probe (for PROT_WRITE mappings obviously)?

We don't attempt the aforementioned cleanup, so I think the answer is
"N/A."

Jim

2010-01-15 23:11:44

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation


On Fri, 2010-01-15 at 19:50 +0530, Srikar Dronamraju wrote:
> >
> > > > Furthermore it requires stopping and resuming tasks and nonsense like
> > > > that, that's unwanted in many cases, just run stuff from the trap site
> > > > and you're done.
> > >
> > > I don't know what you mean exactly. A trap already stopped task.
> > > utrace merely allows various clients to inspect/manipulate the state
> > > of the task at that moment. It does not add any context switches or
> > > spurious stop/resumue operations.
> >
> > Srikar seemed to suggest it needed stop/resume.
> >
>
> If process traps, We dont need to stop/resume other threads.
> uprobes needs threads to quiesce when inserting/deleting the breakpoint.
>

Years ago, we had pre-utrace versions of uprobes where the uprobes
breakpoint-handler code was dispatched from the die_notifier, before the
int3 turned into a SIGTRAP. I believe that's what Peter is
recommending. On my old Pentium M...
- a pre-utrace uprobe hit cost about 1 usec;
- a utrace-based uprobe hit cost about 3 usec;
- and an unboosted kprobe hit cost 0.57 usec.

So yeah, learning about the int3 via utrace after the SIGTRAP gets
created adds some overhead to uprobes. But as previously discussed in
this thread -- e.g.,
http://lkml.indiana.edu/hypermail/linux/kernel/1001.1/02969.html --
there are ways to avoid the 2nd (single-step) trap, which should cut
overhead in half.

Jim

2010-01-15 23:44:55

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, 2010-01-15 at 12:12 +0100, Peter Zijlstra wrote:
...
>
> Adding the probe uses the fact that (most) executable mappings are
> MAP_PRIVATE and CoWs a private copy of the page with the modified ins,
> right?
>
> What does it do for MAP_SHARED|MAP_EXECUTABLE sections -- simply fail to
> add the probe?

If the vma containing the instruction to be probed has the VM_EXEC flag
set (and it's not in the XOL area) we go ahead and try to probe it. I'm
not familar with the implications of MAP_SHARED|MAP_EXECUTABLE -- how
you would get such a combo, or what access_process_vm() would do with
it.

Jim

2010-01-16 00:58:31

by Jim Keniston

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)


On Fri, 2010-01-15 at 22:49 +0100, Peter Zijlstra wrote:
> On Fri, 2010-01-15 at 13:07 -0800, Jim Keniston wrote:
> > On Fri, 2010-01-15 at 10:02 +0100, Peter Zijlstra wrote:
> > > On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote:
> > > >
> > > > +Instruction copies to be single-stepped are stored in a per-process
> > > > +"single-step out of line (XOL) area," which is a little VM area
> > > > +created by Uprobes in each probed process's address space.
> > >
> > > I think tinkering with the probed process's address space is a no-no.
> > > Have you ran this by the linux mm folks?
> >
> > Sort of.
> >
> > Back in 2007 (!), we were getting ready to post uprobes (which was then
> > essentially uprobes+xol+upb) to LKML, pondering XOL alternatives and
> > waiting for utrace to get pulled back into the -mm tree. (It turned out
> > to be a long wait.) I emailed Andrew Morton, inquiring about the
> > prospects for utrace and giving him a preview of utrace-based uprobes.
> > He expressed openness to the idea of allocating a piece of the user
> > address space for the XOL area, a la the vdso page.
> >
> > With advice and review from Dave Hansen, we implemented an XOL page, set
> > up for every process (probed or not) along the same lines as the vdso
> > page.
> >
> > About that time, Roland McGrath suggested using do_mmap_pgoff() to
> > create a separate vma on demand. This was the seed of the current
> > implementation. It had the advantages of being
> > architecture-independent, affecting only probed processes, and allowing
> > the allocation of more XOL slots. (Uprobes can make do with a fixed
> > number of XOL slots -- allowing one probepoint to steal another's slot
> > -- but it isn't pretty.)
> >
> > As I recall, Dave preferred the other idea (1 XOL page for every
> > process, probed or not) -- mostly because he didn't like the idea of a
> > new vma popping into existence when the process gets probed -- but was
> > OK with us going ahead with Roland's idea.
>
> Well, I think its all very gross, I would really like people to try and
> 'emulate' or plain execute those original instructions from kernel
> space.
>
> As to the privileged instructions, I think qemu/kvm like projects should
> have pretty much all of that covered.

I hear (er, read) you. Emulation may turn out to be the answer for some
architectures. But here are some things to keep in mind about the
various approaches:

1. Single-stepping inline is easiest: you need to know very little about
the instruction set you're probing. But it's inadequate for
multithreaded apps.
2. Single-stepping out of line solves the multithreading issue (as do #3
and #4), but requires more knowledge of the instruction set. (In
particular, calls, jumps, and returns need special care; as do
rip-relative instructions in x86_64.) I count 9 architectures that
support kprobes. I think most of these do SSOL.
3. "Boosted" probes (where an appended jump instruction removes the need
for the single-step trap on many instructions) require even more
knowledge of the instruction set, and like SSOL, require XOL slots.
Right now, as far as I know, x86 is the only architecture with boosted
kprobes.
4. Emulation removes the need for the XOL area, but requires pretty much
total knowledge of the instruction set. It's also a performance win for
architectures that can't do #3. I see kvm implemented on 4
architectures (ia64, powerpc, s390, x86). Coincidentally, those are the
architectures to which uprobes (old uprobes, with ubp and xol bundled
in) has already been ported (though Intel hasn't been maintaining their
ia64 port). So it sort of comes down to how objectionable the XOL vma
(or page) really is.

Regarding your suggestion about executing the probed instruction in the
kernel, how widely do you think that can be applied: which
architectures? how much of the instruction set?

>
> Nor do I think we need utrace at all to make user space probes useful.
> Even stronger, I think the focus on utrace made you get some
> fundamentals wrong. Its not mainly about task state, but like said, its
> about text mappings, which is something utrace knows nothing about.

I think that's a useful insight. As mentioned, long ago we offered up a
version of uprobes where probes were per-executable rather than
per-process. The feedback from LKML was, in no uncertain terms, that
they should be per-process, and use access_process_vm(). Of course --
as we then argued -- sometimes you want to probe a process from the very
start, so the SystemTap folks had to invent the task-finder to allow
that.

>
> That is not to say you cannot build a useful interface from uprobes and
> utrace, but its not at all required or natural.
>

Thanks again for your advice and ideas.

Jim

2010-01-16 10:04:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation

On Fri, 2010-01-15 at 15:44 -0800, Jim Keniston wrote:
> On Fri, 2010-01-15 at 12:12 +0100, Peter Zijlstra wrote:
> ....
> >
> > Adding the probe uses the fact that (most) executable mappings are
> > MAP_PRIVATE and CoWs a private copy of the page with the modified ins,
> > right?
> >
> > What does it do for MAP_SHARED|MAP_EXECUTABLE sections -- simply fail to
> > add the probe?
>
> If the vma containing the instruction to be probed has the VM_EXEC flag
> set (and it's not in the XOL area) we go ahead and try to probe it. I'm
> not familar with the implications of MAP_SHARED|MAP_EXECUTABLE -- how
> you would get such a combo, or what access_process_vm() would do with
> it.

I'm not sure how you'd get one, the user has to explicitly create one I
think, regular loaders don't create such things, but maybe JITs do.

The problem is that for MAP_SHARED you cannot CoW the page, you have to
modify the original page, which might get written back into a file if
its file based, not something you'd want to have happen I guess.

2010-01-16 10:33:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On Fri, 2010-01-15 at 16:58 -0800, Jim Keniston wrote:
> But here are some things to keep in mind about the
> various approaches:
>
> 1. Single-stepping inline is easiest: you need to know very little about
> the instruction set you're probing. But it's inadequate for
> multithreaded apps.
> 2. Single-stepping out of line solves the multithreading issue (as do #3
> and #4), but requires more knowledge of the instruction set. (In
> particular, calls, jumps, and returns need special care; as do
> rip-relative instructions in x86_64.) I count 9 architectures that
> support kprobes. I think most of these do SSOL.
> 3. "Boosted" probes (where an appended jump instruction removes the need
> for the single-step trap on many instructions) require even more
> knowledge of the instruction set, and like SSOL, require XOL slots.
> Right now, as far as I know, x86 is the only architecture with boosted
> kprobes.
> 4. Emulation removes the need for the XOL area, but requires pretty much
> total knowledge of the instruction set. It's also a performance win for
> architectures that can't do #3. I see kvm implemented on 4
> architectures (ia64, powerpc, s390, x86). Coincidentally, those are the
> architectures to which uprobes (old uprobes, with ubp and xol bundled
> in) has already been ported (though Intel hasn't been maintaining their
> ia64 port).

Right, so I was thinking a combination of 4 and execute from kernel
space would be feasible. I would think most regular instructions are
runnable from kernel space given that we provide the proper pt_regs
environment.

Although I just realize we need to fully emulate the address computation
step for all memory writes, otherwise a wild userspace pointer might end
up writing in your kernel image.

Also, don't we already need full knowledge of the instruction set in
order to decode the instruction stream and find instruction boundaries.

> So it sort of comes down to how objectionable the XOL vma (or page)
> really is.

Well, I really hate touching the address space, and the fact that it
permutates the probed application in very obvious ways.

FWIW, I think the VDSO is ugly too and would have objected to it were it
proposed now -- there's much better solutions for that
(/sys/lib/libkernel.so comes to mind).

> Regarding your suggestion about executing the probed instruction in the
> kernel, how widely do you think that can be applied: which
> architectures? how much of the instruction set?

I only know some of x86, I really couldn't tell for any other arch.

2010-01-16 15:51:11

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/7] Uprobes Implementation


Jim Keniston <[email protected]> writes:

> [...]
> Years ago, we had pre-utrace versions of uprobes where the uprobes
> breakpoint-handler code was dispatched from the die_notifier, before the
> int3 turned into a SIGTRAP. I believe that's what Peter is
> recommending. On my old Pentium M...
> - a pre-utrace uprobe hit cost about 1 usec;
> - a utrace-based uprobe hit cost about 3 usec;
> [...]
> So yeah, learning about the int3 via utrace after the SIGTRAP gets
> created adds some overhead to uprobes. [...]

Was this test comparing likewise fruit? For example, did it account
for factors where other processes were gdb-int3-instrumented or with
lots of kprobes active? Differently multithreaded? Demultiplexing
probes amongst multiple processes?

(It's counterintuitive that the utrace/kernel int3->sigtrap
dispatching code alone should cause thousands of extra instructions.)

- FChE

2010-02-07 13:49:13

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)

On 01/27/2010 12:23 PM, Ingo Molnar wrote:
> * Avi Kivity<[email protected]> wrote:
>

(back from vacation)

>>> If so then you ignore the obvious solution to _that_ problem: dont use
>>> INT3 at all, but rebuild (or re-JIT) your program with explicit callbacks.
>>> It's _MUCH_ faster than _any_ breakpoint based solution - literally just
>>> the cost of a function call (or not even that - i've written very fast
>>> inlined tracers - they do rock when it comes to performance). Problem
>>> solved and none of the INT3 details matters at all.
>>>
>> However did I not think of that? Yes, and let's rip off kprobes tracing
>> from the kernel, we can always rebuild it.
>>
>> Well, I'm observing an issue in a production system now. I may not want to
>> take it down, or if I take it down I may not be able to observe it again as
>> the problem takes a couple of days to show up, or I may not have the full
>> source, or it takes 10 minutes to build and so an iterative edit/build/run
>> cycle can stretch for hours.
>>
> You have somewhat misconstrued my argument. What i said above is that _if_ you
> need extreme levels of performance you always have the option to go even
> faster via specialized tracing solutions. I did not promote it as a
> replacement solution. Specialization obviously brings in a new set of
> problems: infexibility and non-transparency, an example of what you gave
> above.
>
> Your proposed solution brings in precisely such kinds of issues, on a
> different level, just to improve performance at the cost of transparency and
> at the cost of features and robustness.
>

We just disagree on the intrusiveness, then. IMO it will be a very rare
application that really suffers from a vma injection, since most apps
don't manage their vmas directly but leave it to the kernel and ld.so.

> It's btw rather ironic as your arguments are somewhat similar to the Xen vs.
> KVM argument just turned around: KVM started out slower by relying on hardware
> implementation for virtualization while Xen relied on a clever but limiting
> hack. With each CPU generation the hardware got faster, while the various
> design limitations of Xen are hurting it and KVM is winning that race.
>
> A (partially) similar situation exists here: INT3 into ring 0 and handling it
> there in a protected environment might be more expensive, but _if_ it matters
> to performance it sure could be made faster in hardware (and in fact it will
> become faster with every new generation of hardware).
>

Not at all. For kvm hardware eliminates exits completely where pv Xen
tries to reduce their cost, but an INT3 will be forever much more
expensive than a jump.

You are right however that we should favour hardware support where
available, and for high bandwidth tracing, it is available: branch trace
store. With that, it is easy to know how many times the processor
passed through some code point as well as to reconstruct the entire call
chain, basically what the function tracer does for the kernel.

Do we have facilities for exposing that to userspace? It can also be
very useful for the kernel.

It will still be slower if we only trace a few points, and it can't
trace register and memory values, but it's a good tool to have IMO.

> Both Peter and me are telling you that we are considering your solution too
> specialized, at the cost of flexibility, features and robustness.
>

We'll agree to disagree on that then.

--
error compiling committee.c: too many arguments to function