2011-04-01 14:42:18

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 0/26] 0: Uprobes patchset with perf probe support


This patchset implements Uprobes which enables you to dynamically break
into any routine in a user space application and collect information
non-disruptively.

This patchset resolves most of the comments on the previous
posting http://lkml.org/lkml/2011/3/14/171/. It also adds perf probe
support for user space tracing. This patchset applies on top of tip
commit acf359c95697.

Uprobes Patches
This patchset implements inode based uprobes which are specified as
<file>:<offset> where offset is the offset from start of the map.
The probehit overhead is around 3X times the overhead from pid based
patchset.

When a uprobe is registered, Uprobes makes a copy of the probed
instruction, replaces the first byte(s) of the probed instruction with a
breakpoint instruction. (Uprobes uses background page replacement
mechanism and ensures that the breakpoint affects only that process.)

When a CPU hits the breakpoint instruction, Uprobes gets notified of
trap and finds the associated uprobe. It then executes the associated
handler. Uprobes single-steps its copy of the probed instruction and
resumes execution of the probed process at the instruction following the
probepoint. Instruction copies to be single-stepped are stored in a
per-mm "execution out of line (XOL) area". Currently XOL area is
allocated as one page vma.

For previous postings: please refer: http://lkml.org/lkml/2010/12/16/65
http://lkml.org/lkml/2010/8/25/165 http://lkml.org/lkml/2010/7/27/121
http://lkml.org/lkml/2010/7/12/67 http://lkml.org/lkml/2010/7/8/239
http://lkml.org/lkml/2010/6/29/299 http://lkml.org/lkml/2010/6/14/41
http://lkml.org/lkml/2010/3/20/107 and http://lkml.org/lkml/2010/5/18/307

This patchset is a rework based on suggestions from discussions on lkml
in September, March and January 2010 (http://lkml.org/lkml/2010/1/11/92,
http://lkml.org/lkml/2010/1/27/19, http://lkml.org/lkml/2010/3/20/107
and http://lkml.org/lkml/2010/3/31/199 ). This implementation of uprobes
doesnt depend on utrace.

Advantages of uprobes over conventional debugging include:

1. Non-disruptive.
Unlike current ptrace based mechanisms, uprobes tracing wouldnt
involve signals, stopping threads and context switching between the
tracer and tracee.

2. Much better handling of multithreaded programs because of XOL.
Current ptrace based mechanisms use single stepping inline, i.e they
copy back the original instruction on hitting a breakpoint. In such
mechanisms tracers have to stop all the threads on a breakpoint hit or
tracers will not be able to handle all hits to the location of
interest. Uprobes uses execution out of line, where the instruction to
be traced is analysed at the time of breakpoint insertion and a copy
of instruction is stored at a different location. On breakpoint hit,
uprobes jumps to that copied location and singlesteps the same
instruction and does the necessary fixups post singlestepping.

3. Multiple tracers for an application.
Multiple uprobes based tracer could work in unison to trace an
application. There could one tracer that could be interested in
generic events for a particular set of process. While there could be
another tracer that is just interested in one specific event of a
particular process thats part of the previous set of process.

4. Corelating events from kernels and userspace.
Uprobes could be used with other tools like kprobes, tracepoints or as
part of higher level tools like perf to give a consolidated set of
events from kernel and userspace. In future we could look at a single
backtrace showing application, library and kernel calls.

Here is the list of TODO Items.

- Breakpoint handling should co-exist with singlestep/blockstep from
another tracer/debugger.
- Queue and dequeue signals delivered from the singlestep till
completion of postprocessing.
- Prefiltering (i.e filtering at the time of probe insertion)
- Return probes.
- Support for other architectures.
- Uprobes booster.
- replace macro W with bits in inat table.

To try please fetch using
git fetch \
git://git.kernel.org/pub/scm/linux/kernel/git/srikar/linux-uprobes.git \
tip_inode_uprobes_010411:tip_inode_uprobes

Please refer "[RFC] [PATCH 2.6.37-rc5-tip 20/20] 20: tracing: uprobes
trace_event infrastructure" on how to use uprobe_tracer.

Please do provide your valuable comments.

Thanks in advance.
Srikar

Srikar Dronamraju(26)
0: Uprobes patchset with perf probe support
1: mm: replace_page() loses static attribute
2: mm: Move replace_page() to mm/memory.c
3: X86 specific breakpoint definitions.
4: uprobes: Breakground page replacement.
5: uprobes: Adding and remove a uprobe in a rb tree.
6: Uprobes: register/unregister probes.
7: x86: analyze instruction and determine fixups.
8: uprobes: store/restore original instruction.
9: uprobes: mmap and fork hooks.
10: x86: architecture specific task information.
11: uprobes: task specific information.
12: uprobes: slot allocation for uprobes
13: uprobes: get the breakpoint address.
14: x86: x86 specific probe handling
15: uprobes: Handing int3 and singlestep exception.
16: x86: uprobes exception notifier for x86.
17: uprobes: register a notifier for uprobes.
18: uprobes: commonly used filters.
19: tracing: Extract out common code for kprobes/uprobes traceevents.
20: tracing: uprobes trace_event interface
21: Signed-off-by: Srikar Dronamraju <[email protected]>
22: perf: rename target_module to target
23: perf: show possible probes in a given executable file or library.
24: perf: perf interface for uprobes
25: perf: Documentation for perf uprobes
26: uprobes: filter chain


Documentation/trace/uprobetrace.txt | 94 ++
arch/Kconfig | 4 +
arch/x86/Kconfig | 3 +
arch/x86/include/asm/thread_info.h | 2 +
arch/x86/include/asm/uprobes.h | 55 ++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/signal.c | 14 +
arch/x86/kernel/uprobes.c | 614 +++++++++++++
include/linux/mm.h | 2 +
include/linux/mm_types.h | 9 +
include/linux/sched.h | 3 +
include/linux/uprobes.h | 197 ++++
kernel/Makefile | 1 +
kernel/fork.c | 10 +
kernel/trace/Kconfig | 20 +
kernel/trace/Makefile | 2 +
kernel/trace/trace.h | 5 +
kernel/trace/trace_kprobe.c | 861 +------------------
kernel/trace/trace_probe.c | 753 ++++++++++++++++
kernel/trace/trace_probe.h | 160 ++++
kernel/trace/trace_uprobe.c | 803 +++++++++++++++++
kernel/uprobes.c | 1474 +++++++++++++++++++++++++++++++
mm/ksm.c | 62 --
mm/memory.c | 62 ++
mm/mmap.c | 6 +
tools/perf/Documentation/perf-probe.txt | 21 +-
tools/perf/builtin-probe.c | 26 +-
tools/perf/util/probe-event.c | 431 ++++++++--
tools/perf/util/probe-event.h | 12 +-
tools/perf/util/symbol.c | 8 +
tools/perf/util/symbol.h | 1 +
31 files changed, 4714 insertions(+), 1002 deletions(-)


2011-04-01 14:42:25

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 1/26] 1: mm: replace_page() loses static attribute


User bkpt will use background page replacement approach to insert/delete
breakpoints. Background page replacement approach is based on
replace_page. Hence replace_page() loses its static attribute.
This is a precursor to moving replace_page() to mm/memory.c

Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Ananth N Mavinakayanahalli <[email protected]>
---
include/linux/mm.h | 2 ++
mm/ksm.c | 2 +-
2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7606d7d..089bda4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1008,6 +1008,8 @@ void account_page_writeback(struct page *page);
int set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
int clear_page_dirty_for_io(struct page *page);
+int replace_page(struct vm_area_struct *vma, struct page *page,
+ struct page *kpage, pte_t orig_pte);

/* Is the vma a continuation of the stack vma above it? */
static inline int vma_stack_continue(struct vm_area_struct *vma, unsigned long addr)
diff --git a/mm/ksm.c b/mm/ksm.c
index 1bbe785..f444158 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -760,7 +760,7 @@ out:
*
* Returns 0 on success, -EFAULT on failure.
*/
-static int replace_page(struct vm_area_struct *vma, struct page *page,
+int replace_page(struct vm_area_struct *vma, struct page *page,
struct page *kpage, pte_t orig_pte)
{
struct mm_struct *mm = vma->vm_mm;

2011-04-01 14:42:40

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 2/26] 2: mm: Move replace_page() to mm/memory.c


move replace_page() out of mm/ksm.c to mm/memory.c so that replace_page
can be used even when CONFIG_KSM is not defined.
replace_page() is used to implement background page replacement.

Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Ananth N Mavinakayanahalli <[email protected]>
---
mm/ksm.c | 62 -----------------------------------------------------------
mm/memory.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 62 insertions(+), 62 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index f444158..61df1db 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -751,68 +751,6 @@ out:
return err;
}

-/**
- * replace_page - replace page in vma by new ksm page
- * @vma: vma that holds the pte pointing to page
- * @page: the page we are replacing by kpage
- * @kpage: the ksm page we replace page by
- * @orig_pte: the original value of the pte
- *
- * Returns 0 on success, -EFAULT on failure.
- */
-int replace_page(struct vm_area_struct *vma, struct page *page,
- struct page *kpage, pte_t orig_pte)
-{
- struct mm_struct *mm = vma->vm_mm;
- pgd_t *pgd;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *ptep;
- spinlock_t *ptl;
- unsigned long addr;
- int err = -EFAULT;
-
- addr = page_address_in_vma(page, vma);
- if (addr == -EFAULT)
- goto out;
-
- pgd = pgd_offset(mm, addr);
- if (!pgd_present(*pgd))
- goto out;
-
- pud = pud_offset(pgd, addr);
- if (!pud_present(*pud))
- goto out;
-
- pmd = pmd_offset(pud, addr);
- BUG_ON(pmd_trans_huge(*pmd));
- if (!pmd_present(*pmd))
- goto out;
-
- ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
- if (!pte_same(*ptep, orig_pte)) {
- pte_unmap_unlock(ptep, ptl);
- goto out;
- }
-
- get_page(kpage);
- page_add_anon_rmap(kpage, vma, addr);
-
- flush_cache_page(vma, addr, pte_pfn(*ptep));
- ptep_clear_flush(vma, addr, ptep);
- set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
-
- page_remove_rmap(page);
- if (!page_mapped(page))
- try_to_free_swap(page);
- put_page(page);
-
- pte_unmap_unlock(ptep, ptl);
- err = 0;
-out:
- return err;
-}
-
static int page_trans_compound_anon_split(struct page *page)
{
int ret = 0;
diff --git a/mm/memory.c b/mm/memory.c
index 9da8cab..60b8494 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2733,6 +2733,68 @@ void unmap_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL(unmap_mapping_range);

+/**
+ * replace_page - replace page in vma by new ksm page
+ * @vma: vma that holds the pte pointing to page
+ * @page: the page we are replacing by kpage
+ * @kpage: the ksm page we replace page by
+ * @orig_pte: the original value of the pte
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ */
+int replace_page(struct vm_area_struct *vma, struct page *page,
+ struct page *kpage, pte_t orig_pte)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep;
+ spinlock_t *ptl;
+ unsigned long addr;
+ int err = -EFAULT;
+
+ addr = page_address_in_vma(page, vma);
+ if (addr == -EFAULT)
+ goto out;
+
+ pgd = pgd_offset(mm, addr);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pud = pud_offset(pgd, addr);
+ if (!pud_present(*pud))
+ goto out;
+
+ pmd = pmd_offset(pud, addr);
+ BUG_ON(pmd_trans_huge(*pmd));
+ if (!pmd_present(*pmd))
+ goto out;
+
+ ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!pte_same(*ptep, orig_pte)) {
+ pte_unmap_unlock(ptep, ptl);
+ goto out;
+ }
+
+ get_page(kpage);
+ page_add_anon_rmap(kpage, vma, addr);
+
+ flush_cache_page(vma, addr, pte_pfn(*ptep));
+ ptep_clear_flush(vma, addr, ptep);
+ set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+
+ page_remove_rmap(page);
+ if (!page_mapped(page))
+ try_to_free_swap(page);
+ put_page(page);
+
+ pte_unmap_unlock(ptep, ptl);
+ err = 0;
+out:
+ return err;
+}
+
int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
{
struct address_space *mapping = inode->i_mapping;

2011-04-01 14:42:51

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 3/26] 3: X86 specific breakpoint definitions.


Provides definitions for the breakpoint instruction and x86 specific
uprobe info structure.

Signed-off-by: Jim Keniston <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/Kconfig | 3 +++
arch/x86/include/asm/uprobes.h | 40 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/uprobes.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index da3b204..051fe74 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -238,6 +238,9 @@ config ARCH_CPU_PROBE_RELEASE
def_bool y
depends on HOTPLUG_CPU

+config ARCH_SUPPORTS_UPROBES
+ def_bool y
+
source "init/Kconfig"
source "kernel/Kconfig.freezer"

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
new file mode 100644
index 0000000..5026359
--- /dev/null
+++ b/arch/x86/include/asm/uprobes.h
@@ -0,0 +1,40 @@
+#ifndef _ASM_UPROBES_H
+#define _ASM_UPROBES_H
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2010
+ * Authors:
+ * Srikar Dronamraju
+ * Jim Keniston
+ */
+
+typedef u8 uprobe_opcode_t;
+#define MAX_UINSN_BYTES 16
+#define UPROBES_XOL_SLOT_BYTES 128 /* to keep it cache aligned */
+
+#define UPROBES_BKPT_INSN 0xcc
+#define UPROBES_BKPT_INSN_SIZE 1
+
+#ifdef CONFIG_X86_64
+struct uprobe_arch_info {
+ unsigned long rip_rela_target_address;
+};
+#else
+struct uprobe_arch_info {};
+#endif
+#endif /* _ASM_UPROBES_H */

2011-04-01 14:43:03

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 4/26] 4: uprobes: Breakground page replacement.


Provides Background page replacement using replace_page() routine.
Also provides routines to read an opcode from a given virtual address
and for verifying if a instruction is a breakpoint instruction.

Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Jim Keniston <[email protected]>
---
arch/Kconfig | 12 ++
include/linux/uprobes.h | 81 +++++++++++++++
kernel/Makefile | 1
kernel/uprobes.c | 254 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 348 insertions(+), 0 deletions(-)
create mode 100644 include/linux/uprobes.h
create mode 100644 kernel/uprobes.c

diff --git a/arch/Kconfig b/arch/Kconfig
index f78c2be..c681f16 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -61,6 +61,18 @@ config OPTPROBES
depends on KPROBES && HAVE_OPTPROBES
depends on !PREEMPT

+config UPROBES
+ bool "User-space probes (EXPERIMENTAL)"
+ depends on ARCH_SUPPORTS_UPROBES
+ depends on MMU
+ select MM_OWNER
+ help
+ Uprobes enables kernel subsystems to establish probepoints
+ in user applications and execute handler functions when
+ the probepoints are hit.
+
+ If in doubt, say "N".
+
config HAVE_EFFICIENT_UNALIGNED_ACCESS
bool
help
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
new file mode 100644
index 0000000..ae134a5
--- /dev/null
+++ b/include/linux/uprobes.h
@@ -0,0 +1,81 @@
+#ifndef _LINUX_UPROBES_H
+#define _LINUX_UPROBES_H
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2010
+ * Authors:
+ * Srikar Dronamraju
+ * Jim Keniston
+ */
+
+#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
+#include <asm/uprobes.h>
+#else
+/*
+ * ARCH_SUPPORTS_UPROBES is not defined.
+ */
+typedef u8 uprobe_opcode_t;
+#endif /* CONFIG_ARCH_SUPPORTS_UPROBES */
+
+/* Post-execution fixups. Some architectures may define others. */
+
+/* No fixup needed */
+#define UPROBES_FIX_NONE 0x0
+/* Adjust IP back to vicinity of actual insn */
+#define UPROBES_FIX_IP 0x1
+/* Adjust the return address of a call insn */
+#define UPROBES_FIX_CALL 0x2
+/* Might sleep while doing Fixup */
+#define UPROBES_FIX_SLEEPY 0x4
+
+#ifndef UPROBES_FIX_DEFAULT
+#define UPROBES_FIX_DEFAULT UPROBES_FIX_IP
+#endif
+
+/* Unexported functions & macros for use by arch-specific code */
+#define uprobe_opcode_sz (sizeof(uprobe_opcode_t))
+
+/*
+ * Most architectures can use the default versions of @read_opcode(),
+ * @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
+ *
+ * @set_ip:
+ * Set the instruction pointer in @regs to @vaddr.
+ * @analyze_insn:
+ * Analyze @user_bkpt->insn. Return 0 if @user_bkpt->insn is an
+ * instruction you can probe, or a negative errno (typically -%EPERM)
+ * otherwise. Determine what sort of XOL-related fixups @post_xol()
+ * (and possibly @pre_xol()) will need to do for this instruction, and
+ * annotate @user_bkpt accordingly. You may modify @user_bkpt->insn
+ * (e.g., the x86_64 port does this for rip-relative instructions).
+ * @pre_xol:
+ * Called just before executing the instruction associated
+ * with @user_bkpt out of line. @user_bkpt->xol_vaddr is the address
+ * in @tsk's virtual address space where @user_bkpt->insn has been
+ * copied. @pre_xol() should at least set the instruction pointer in
+ * @regs to @user_bkpt->xol_vaddr -- which is what the default,
+ * @pre_xol(), does.
+ * @post_xol:
+ * Called after executing the instruction associated with
+ * @user_bkpt out of line. @post_xol() should perform the fixups
+ * specified in @user_bkpt->fixups, which includes ensuring that the
+ * instruction pointer in @regs points at the next instruction in
+ * the probed instruction stream. @tskinfo is as for @pre_xol().
+ * You must provide this function.
+ */
+#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 85cbfb3..7c26588 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -108,6 +108,7 @@ obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
obj-$(CONFIG_PADATA) += padata.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
+obj-$(CONFIG_UPROBES) += uprobes.o

ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
new file mode 100644
index 0000000..9ef21a7
--- /dev/null
+++ b/kernel/uprobes.c
@@ -0,0 +1,254 @@
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2010
+ * Authors:
+ * Srikar Dronamraju
+ * Jim Keniston
+ */
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/uprobes.h>
+#include <linux/rmap.h> /* needed for anon_vma_prepare */
+
+struct uprobe {
+ u8 insn[MAX_UINSN_BYTES];
+ u16 fixups;
+};
+
+/*
+ * NOTE:
+ * Expect the breakpoint instruction to be the smallest size instruction for
+ * the architecture. If an arch has variable length instruction and the
+ * breakpoint instruction is not of the smallest length instruction
+ * supported by that architecture then we need to modify read_opcode /
+ * write_opcode accordingly. This would never be a problem for arch's that
+ * have fixed length instructions.
+ */
+
+/*
+ * write_opcode - write the opcode at a given virtual address.
+ * @tsk: the probed task.
+ * @uprobe: the breakpointing information.
+ * @vaddr: the virtual address to store the opcode.
+ * @opcode: opcode to be written at @vaddr.
+ *
+ * Called with tsk->mm->mmap_sem held (for read and with a reference to
+ * tsk->mm.
+ *
+ * For task @tsk, write the opcode at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
+ unsigned long vaddr, uprobe_opcode_t opcode)
+{
+ struct page *old_page, *new_page;
+ void *vaddr_old, *vaddr_new;
+ struct vm_area_struct *vma;
+ spinlock_t *ptl;
+ pte_t *orig_pte;
+ unsigned long addr;
+ int ret;
+
+ /* Read the page with vaddr into memory */
+ ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
+ if (ret <= 0)
+ return -EINVAL;
+ ret = -EINVAL;
+
+ /*
+ * We are interested in text pages only. Our pages of interest
+ * should be mapped for read and execute only. We desist from
+ * adding probes in write mapped pages since the breakpoints
+ * might end up in the file copy.
+ */
+ if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) !=
+ (VM_READ|VM_EXEC))
+ goto put_out;
+
+ /* Allocate a page */
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
+ if (!new_page) {
+ ret = -ENOMEM;
+ goto put_out;
+ }
+
+ /*
+ * lock page will serialize against do_wp_page()'s
+ * PageAnon() handling
+ */
+ lock_page(old_page);
+ /* copy the page now that we've got it stable */
+ vaddr_old = kmap_atomic(old_page, KM_USER0);
+ vaddr_new = kmap_atomic(new_page, KM_USER1);
+
+ memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
+ /* poke the new insn in, ASSUMES we don't cross page boundary */
+ addr = vaddr;
+ vaddr &= ~PAGE_MASK;
+ memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
+
+ kunmap_atomic(vaddr_new, KM_USER1);
+ kunmap_atomic(vaddr_old, KM_USER0);
+
+ orig_pte = page_check_address(old_page, tsk->mm, addr, &ptl, 0);
+ if (!orig_pte)
+ goto unlock_out;
+ pte_unmap_unlock(orig_pte, ptl);
+
+ lock_page(new_page);
+ ret = anon_vma_prepare(vma);
+ if (!ret)
+ ret = replace_page(vma, old_page, new_page, *orig_pte);
+
+ unlock_page(new_page);
+ if (ret != 0)
+ page_cache_release(new_page);
+unlock_out:
+ unlock_page(old_page);
+
+put_out:
+ put_page(old_page); /* we did a get_page in the beginning */
+ return ret;
+}
+
+/**
+ * read_opcode - read the opcode at a given virtual address.
+ * @tsk: the probed task.
+ * @vaddr: the virtual address to read the opcode.
+ * @opcode: location to store the read opcode.
+ *
+ * Called with tsk->mm->mmap_sem held (for read and with a reference to
+ * tsk->mm.
+ *
+ * For task @tsk, read the opcode at @vaddr and store it in @opcode.
+ * Return 0 (success) or a negative errno.
+ */
+int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
+ uprobe_opcode_t *opcode)
+{
+ struct vm_area_struct *vma;
+ struct page *page;
+ void *vaddr_new;
+ int ret;
+
+ ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
+ if (ret <= 0)
+ return -EFAULT;
+ ret = -EFAULT;
+
+ /*
+ * We are interested in text pages only. Our pages of interest
+ * should be mapped for read and execute only. We desist from
+ * adding probes in write mapped pages since the breakpoints
+ * might end up in the file copy.
+ */
+ if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) !=
+ (VM_READ|VM_EXEC))
+ goto put_out;
+
+ lock_page(page);
+ vaddr_new = kmap_atomic(page, KM_USER0);
+ vaddr &= ~PAGE_MASK;
+ memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz);
+ kunmap_atomic(vaddr_new, KM_USER0);
+ unlock_page(page);
+ ret = 0;
+
+put_out:
+ put_page(page); /* we did a get_page in the beginning */
+ return ret;
+}
+
+/**
+ * set_bkpt - store breakpoint at a given address.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ *
+ * For task @tsk, store the breakpoint instruction at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
+ unsigned long vaddr)
+{
+ return write_opcode(tsk, uprobe, vaddr, UPROBES_BKPT_INSN);
+}
+
+/**
+ * set_orig_insn - Restore the original instruction.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ * @verify: if true, verify existance of breakpoint instruction.
+ *
+ * For task @tsk, restore the original opcode (opcode) at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+int __weak set_orig_insn(struct task_struct *tsk, struct uprobe *uprobe,
+ unsigned long vaddr, bool verify)
+{
+ if (verify) {
+ uprobe_opcode_t opcode;
+ int result = read_opcode(tsk, vaddr, &opcode);
+ if (result)
+ return result;
+ if (opcode != UPROBES_BKPT_INSN)
+ return -EINVAL;
+ }
+ return write_opcode(tsk, uprobe, vaddr,
+ *(uprobe_opcode_t *) uprobe->insn);
+}
+
+static void print_insert_fail(struct task_struct *tsk,
+ unsigned long vaddr, const char *why)
+{
+ printk(KERN_ERR "Can't place breakpoint at pid %d vaddr %#lx: %s\n",
+ tsk->pid, vaddr, why);
+}
+
+/*
+ * uprobes_resume_can_sleep - Check if fixup might result in sleep.
+ * @uprobes: the probepoint information.
+ *
+ * Returns true if fixup might result in sleep.
+ */
+static bool uprobes_resume_can_sleep(struct uprobe *uprobe)
+{
+ return uprobe->fixups & UPROBES_FIX_SLEEPY;
+}
+
+/**
+ * is_bkpt_insn - check if instruction is breakpoint instruction.
+ * @insn: instruction to be checked.
+ * Default implementation of is_bkpt_insn
+ * Returns true if @insn is a breakpoint instruction.
+ */
+bool __weak is_bkpt_insn(u8 *insn)
+{
+ uprobe_opcode_t opcode;
+
+ memcpy(&opcode, insn, UPROBES_BKPT_INSN_SIZE);
+ return (opcode == UPROBES_BKPT_INSN);
+}

2011-04-01 14:43:15

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 5/26] 5: uprobes: Adding and remove a uprobe in a rb tree.


Provides interfaces to add and remove uprobes from the global rb tree.
Also provides definitions for uprobe_consumer, interfaces to add and
remove a consumer to a uprobe. There is a unique uprobe element in the
rbtree for each unique inode:offset pair.

Uprobe gets added to the global rb tree when the first consumer for that
uprobe gets registered. It gets removed from the tree only when all
registered consumers are unregistered.

Multiple consumers can share the same probe. Each consumer provides a
filter to limit the tasks on which the handler should run, a handler
that runs on probe hit and a value which helps filter callback to limit
the tasks on which the handler should run.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/uprobes.h | 12 ++
kernel/uprobes.c | 226 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 234 insertions(+), 4 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index ae134a5..bfe2e9e 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -23,6 +23,7 @@
* Jim Keniston
*/

+#include <linux/rbtree.h>
#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
#include <asm/uprobes.h>
#else
@@ -50,6 +51,17 @@ typedef u8 uprobe_opcode_t;
/* Unexported functions & macros for use by arch-specific code */
#define uprobe_opcode_sz (sizeof(uprobe_opcode_t))

+struct uprobe_consumer {
+ int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
+ /*
+ * filter is optional; If a filter exists, handler is run
+ * if and only if filter returns true.
+ */
+ bool (*filter)(struct uprobe_consumer *self, struct task_struct *task);
+
+ struct uprobe_consumer *next;
+};
+
/*
* Most architectures can use the default versions of @read_opcode(),
* @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 9ef21a7..f37418b 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -33,10 +33,28 @@
#include <linux/rmap.h> /* needed for anon_vma_prepare */

struct uprobe {
+ struct rb_node rb_node; /* node in the rb tree */
+ atomic_t ref; /* lifetime muck */
+ struct rw_semaphore consumer_rwsem;
+ struct uprobe_consumer *consumers;
+ struct inode *inode; /* we hold a ref */
+ loff_t offset;
u8 insn[MAX_UINSN_BYTES];
u16 fixups;
};

+static bool valid_vma(struct vm_area_struct *vma)
+{
+ if (!vma->vm_file)
+ return false;
+
+ if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
+ (VM_READ|VM_EXEC))
+ return true;
+
+ return false;
+}
+
/*
* NOTE:
* Expect the breakpoint instruction to be the smallest size instruction for
@@ -83,8 +101,7 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
* adding probes in write mapped pages since the breakpoints
* might end up in the file copy.
*/
- if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) !=
- (VM_READ|VM_EXEC))
+ if (!valid_vma(vma))
goto put_out;

/* Allocate a page */
@@ -164,8 +181,7 @@ int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
* adding probes in write mapped pages since the breakpoints
* might end up in the file copy.
*/
- if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) !=
- (VM_READ|VM_EXEC))
+ if (!valid_vma(vma))
goto put_out;

lock_page(page);
@@ -252,3 +268,205 @@ bool __weak is_bkpt_insn(u8 *insn)
memcpy(&opcode, insn, UPROBES_BKPT_INSN_SIZE);
return (opcode == UPROBES_BKPT_INSN);
}
+
+static struct rb_root uprobes_tree = RB_ROOT;
+static DEFINE_SPINLOCK(treelock);
+
+static int match_inode(struct uprobe *uprobe, struct inode *inode,
+ struct rb_node **p)
+{
+ struct rb_node *n = *p;
+
+ if (inode < uprobe->inode)
+ *p = n->rb_left;
+ else if (inode > uprobe->inode)
+ *p = n->rb_right;
+ else
+ return 1;
+ return 0;
+}
+
+static int match_offset(struct uprobe *uprobe, loff_t offset,
+ struct rb_node **p)
+{
+ struct rb_node *n = *p;
+
+ if (offset < uprobe->offset)
+ *p = n->rb_left;
+ else if (offset > uprobe->offset)
+ *p = n->rb_right;
+ else
+ return 1;
+ return 0;
+}
+
+
+/* Called with treelock held */
+static struct uprobe *__find_uprobe(struct inode * inode,
+ loff_t offset, struct rb_node **near_match)
+{
+ struct rb_node *n = uprobes_tree.rb_node;
+ struct uprobe *uprobe, *u = NULL;
+
+ while (n) {
+ uprobe = rb_entry(n, struct uprobe, rb_node);
+ if (match_inode(uprobe, inode, &n)) {
+ if (near_match)
+ *near_match = n;
+ if (match_offset(uprobe, offset, &n)) {
+ /* get access ref */
+ atomic_inc(&uprobe->ref);
+ u = uprobe;
+ break;
+ }
+ }
+ }
+ return u;
+}
+
+/*
+ * Find a uprobe corresponding to a given inode:offset
+ * Acquires treelock
+ */
+static struct uprobe *find_uprobe(struct inode * inode, loff_t offset)
+{
+ struct uprobe *uprobe;
+ unsigned long flags;
+
+ spin_lock_irqsave(&treelock, flags);
+ uprobe = __find_uprobe(inode, offset, NULL);
+ spin_unlock_irqrestore(&treelock, flags);
+ return uprobe;
+}
+
+/*
+ * Acquires treelock.
+ * Matching uprobe already exists in rbtree;
+ * increment (access refcount) and return the matching uprobe.
+ *
+ * No matching uprobe; insert the uprobe in rb_tree;
+ * get a double refcount (access + creation) and return NULL.
+ */
+static struct uprobe *insert_uprobe(struct uprobe *uprobe)
+{
+ struct rb_node **p = &uprobes_tree.rb_node;
+ struct rb_node *parent = NULL;
+ struct uprobe *u;
+ unsigned long flags;
+
+ spin_lock_irqsave(&treelock, flags);
+ while (*p) {
+ parent = *p;
+ u = rb_entry(parent, struct uprobe, rb_node);
+ if (u->inode > uprobe->inode)
+ p = &(*p)->rb_left;
+ else if (u->inode < uprobe->inode)
+ p = &(*p)->rb_right;
+ else {
+ if (u->offset > uprobe->offset)
+ p = &(*p)->rb_left;
+ else if (u->offset < uprobe->offset)
+ p = &(*p)->rb_right;
+ else {
+ /* get access ref */
+ atomic_inc(&u->ref);
+ goto unlock_return;
+ }
+ }
+ }
+ u = NULL;
+ rb_link_node(&uprobe->rb_node, parent, p);
+ rb_insert_color(&uprobe->rb_node, &uprobes_tree);
+ /* get access + drop ref */
+ atomic_set(&uprobe->ref, 2);
+
+unlock_return:
+ spin_unlock_irqrestore(&treelock, flags);
+ return u;
+}
+
+static void put_uprobe(struct uprobe *uprobe)
+{
+ if (atomic_dec_and_test(&uprobe->ref))
+ kfree(uprobe);
+}
+
+static struct uprobe *uprobes_add(struct inode *inode, loff_t offset)
+{
+ struct uprobe *uprobe, *cur_uprobe;
+
+ uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+ if (!uprobe)
+ return NULL;
+
+ __iget(inode);
+ uprobe->inode = inode;
+ uprobe->offset = offset;
+ init_rwsem(&uprobe->consumer_rwsem);
+
+ /* add to uprobes_tree, sorted on inode:offset */
+ cur_uprobe = insert_uprobe(uprobe);
+
+ /* a uprobe exists for this inode:offset combination*/
+ if (cur_uprobe) {
+ kfree(uprobe);
+ uprobe = cur_uprobe;
+ iput(inode);
+ }
+ return uprobe;
+}
+
+static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ struct uprobe_consumer *consumer;
+
+ down_read(&uprobe->consumer_rwsem);
+ consumer = uprobe->consumers;
+ while (consumer) {
+ if (!consumer->filter || consumer->filter(consumer, current))
+ consumer->handler(consumer, regs);
+
+ consumer = consumer->next;
+ }
+ up_read(&uprobe->consumer_rwsem);
+}
+
+static void add_consumer(struct uprobe *uprobe,
+ struct uprobe_consumer *consumer)
+{
+ down_write(&uprobe->consumer_rwsem);
+ consumer->next = uprobe->consumers;
+ uprobe->consumers = consumer;
+ up_write(&uprobe->consumer_rwsem);
+}
+
+/*
+ * For uprobe @uprobe, delete the consumer @consumer.
+ * Return true if the @consumer is deleted successfully
+ * or return false.
+ */
+static bool del_consumer(struct uprobe *uprobe,
+ struct uprobe_consumer *consumer)
+{
+ struct uprobe_consumer *con;
+ bool ret = false;
+
+ down_write(&uprobe->consumer_rwsem);
+ con = uprobe->consumers;
+ if (consumer == con) {
+ uprobe->consumers = con->next;
+ if (!con->next)
+ put_uprobe(uprobe); /* drop creation ref */
+ ret = true;
+ } else {
+ for (; con; con = con->next) {
+ if (con->next == consumer) {
+ con->next = consumer->next;
+ ret = true;
+ break;
+ }
+ }
+ }
+ up_write(&uprobe->consumer_rwsem);
+ return ret;
+}

2011-04-01 14:43:28

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 6/26] 6: Uprobes: register/unregister probes.


A probe is specified by a file:offset. While registering, a breakpoint
is inserted for the first consumer, On subsequent probes, the consumer
gets appended to the existing consumers. While unregistering a
breakpoint is removed if the consumer happens to be the last consumer.
All other unregisterations, the consumer is deleted from the list of
consumers.

Probe specifications are maintained in a rb tree. A probe specification
is converted into a uprobe before store in a rb tree. A uprobe can be
shared by many consumers.

Given a inode, we get a list of mm's that have mapped the inode.
However we want to limit the probes to certain processes/threads. The
filtering should be at thread level. To limit the probes to a certain
processes/threads, we would want to walk through the list of threads
whose mm member refer to a given mm.

Here are the options that I thought of:
1. Use mm->owner and walk thro the thread_group of mm->owner, siblings
of mm->owner, siblings of parent of mm->owner. This should be
good list to traverse. Not sure if this is an exhaustive
enough list that all tasks that have a mm set to this mm_struct are
walked through.

2. Install probes on all mm's that have mapped the probes and filter
only at probe hit time.

3. walk thro do_each_thread; while_each_thread; I think this will catch
all tasks that have a mm set to the given mm. However this might
be too heavy esp if mm corresponds to a library.

4. add a list_head element to the mm struct and update the list whenever
the task->mm thread gets updated. This could mean extending the current
mm->owner. However there is some maintainance overhead.

Currently we use the second approach, i.e probe all mm's that have mapped
the probes and filter only at probe hit.

Also would be interested to know if there are ways to call
replace_page without having to take mmap_sem.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/mm_types.h | 5 +
include/linux/uprobes.h | 32 +++++
kernel/uprobes.c | 280 ++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 306 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 02aa561..c691096 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -317,6 +317,11 @@ struct mm_struct {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
+#ifdef CONFIG_UPROBES
+ unsigned long uprobes_vaddr;
+ struct list_head uprobes_list; /* protected by uprobes_mutex */
+ atomic_t uprobes_count;
+#endif
};

/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index bfe2e9e..62036a0 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -31,6 +31,7 @@
* ARCH_SUPPORTS_UPROBES is not defined.
*/
typedef u8 uprobe_opcode_t;
+struct uprobe_arch_info {}; /* arch specific info*/
#endif /* CONFIG_ARCH_SUPPORTS_UPROBES */

/* Post-execution fixups. Some architectures may define others. */
@@ -62,6 +63,19 @@ struct uprobe_consumer {
struct uprobe_consumer *next;
};

+struct uprobe {
+ struct rb_node rb_node; /* node in the rb tree */
+ atomic_t ref;
+ struct rw_semaphore consumer_rwsem;
+ struct uprobe_arch_info arch_info; /* arch specific info if any */
+ struct uprobe_consumer *consumers;
+ struct inode *inode; /* Also hold a ref to inode */
+ loff_t offset;
+ u8 insn[MAX_UINSN_BYTES]; /* orig instruction */
+ u16 fixups;
+ int copy;
+};
+
/*
* Most architectures can use the default versions of @read_opcode(),
* @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
@@ -90,4 +104,22 @@ struct uprobe_consumer {
* the probed instruction stream. @tskinfo is as for @pre_xol().
* You must provide this function.
*/
+
+#ifdef CONFIG_UPROBES
+extern int register_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer);
+extern void unregister_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer);
+#else /* CONFIG_UPROBES is not defined */
+static inline int register_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer)
+{
+ return -ENOSYS;
+}
+static inline void unregister_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer)
+{
+}
+
+#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index f37418b..ff3f15e 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -32,17 +32,6 @@
#include <linux/uprobes.h>
#include <linux/rmap.h> /* needed for anon_vma_prepare */

-struct uprobe {
- struct rb_node rb_node; /* node in the rb tree */
- atomic_t ref; /* lifetime muck */
- struct rw_semaphore consumer_rwsem;
- struct uprobe_consumer *consumers;
- struct inode *inode; /* we hold a ref */
- loff_t offset;
- u8 insn[MAX_UINSN_BYTES];
- u16 fixups;
-};
-
static bool valid_vma(struct vm_area_struct *vma)
{
if (!vma->vm_file)
@@ -470,3 +459,272 @@ static bool del_consumer(struct uprobe *uprobe,
up_write(&uprobe->consumer_rwsem);
return ret;
}
+
+static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
+{
+ int ret = 0;
+
+ /*TODO: install breakpoint */
+ if (!ret)
+ atomic_inc(&mm->uprobes_count);
+ return ret;
+}
+
+static int remove_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
+{
+ int ret = 0;
+
+ /*TODO: remove breakpoint */
+ if (!ret)
+ atomic_dec(&mm->uprobes_count);
+
+ return ret;
+}
+
+static void delete_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
+{
+ down_read(&mm->mmap_sem);
+ remove_uprobe(mm, uprobe);
+ list_del(&mm->uprobes_list);
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+}
+
+/*
+ * There could be threads that have hit the breakpoint and are entering
+ * the notifier code and trying to acquire the treelock. The thread
+ * calling erase_uprobe() that is removing the uprobe from the rb_tree
+ * can race with these threads and might acquire the treelock compared
+ * to some of the breakpoint hit threads. In such a case, the breakpoint
+ * hit threads will not find the uprobe. Finding if a "trap" instruction
+ * was present at the interrupting address is racy. Hence provide some
+ * extra time (by way of synchronize_sched() for breakpoint hit threads
+ * to acquire the treelock before the uprobe is removed from the rbtree.
+ */
+static void erase_uprobe(struct uprobe *uprobe)
+{
+ unsigned long flags;
+
+ synchronize_sched();
+ spin_lock_irqsave(&treelock, flags);
+ rb_erase(&uprobe->rb_node, &uprobes_tree);
+ spin_unlock_irqrestore(&treelock, flags);
+ iput(uprobe->inode);
+}
+
+static DEFINE_MUTEX(uprobes_mutex);
+
+/*
+ * register_uprobe - register a probe
+ * @inode: the file in which the probe has to be placed.
+ * @offset: offset from the start of the file.
+ * @consumer: information on howto handle the probe..
+ *
+ * Apart from the access refcount, register_uprobe() takes a creation
+ * refcount (thro uprobes_add) if and only if this @uprobe is getting
+ * inserted into the rbtree (i.e first consumer for a @inode:@offset
+ * tuple). Creation refcount stops unregister_uprobe from freeing the
+ * @uprobe even before the register operation is complete. Creation
+ * refcount is released when the last @consumer for the @uprobe
+ * unregisters.
+ *
+ * Return errno if it cannot successully install probes
+ * else return 0 (success)
+ */
+int register_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer)
+{
+ struct prio_tree_iter iter;
+ struct list_head try_list, success_list;
+ struct address_space *mapping;
+ struct mm_struct *mm, *tmpmm;
+ struct vm_area_struct *vma;
+ struct uprobe *uprobe;
+ int ret = -1;
+
+ if (!inode || !consumer || consumer->next)
+ return -EINVAL;
+
+ uprobe = uprobes_add(inode, offset);
+ if (!uprobe)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&try_list);
+ INIT_LIST_HEAD(&success_list);
+ mapping = inode->i_mapping;
+
+ mutex_lock(&uprobes_mutex);
+ if (uprobe->consumers) {
+ ret = 0;
+ goto consumers_add;
+ }
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
+ loff_t vaddr;
+
+ if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
+ continue;
+
+ mm = vma->vm_mm;
+ if (!valid_vma(vma)) {
+ mmput(mm);
+ continue;
+ }
+
+ vaddr = vma->vm_start + offset;
+ vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+ if (vaddr > ULONG_MAX) {
+ /*
+ * We cannot have a virtual address that is
+ * greater than ULONG_MAX
+ */
+ mmput(mm);
+ continue;
+ }
+ mm->uprobes_vaddr = (unsigned long) vaddr;
+ list_add(&mm->uprobes_list, &try_list);
+ }
+ spin_unlock(&mapping->i_mmap_lock);
+
+ if (list_empty(&try_list)) {
+ ret = 0;
+ goto consumers_add;
+ }
+ list_for_each_entry_safe(mm, tmpmm, &try_list, uprobes_list) {
+ down_read(&mm->mmap_sem);
+ ret = install_uprobe(mm, uprobe);
+
+ if (ret && (ret != -ESRCH || ret != -EEXIST)) {
+ up_read(&mm->mmap_sem);
+ break;
+ }
+ if (!ret)
+ list_move(&mm->uprobes_list, &success_list);
+ else {
+ /*
+ * install_uprobe failed as there are no active
+ * threads for the mm; ignore the error.
+ */
+ list_del(&mm->uprobes_list);
+ mmput(mm);
+ }
+ up_read(&mm->mmap_sem);
+ }
+
+ if (list_empty(&try_list)) {
+ /*
+ * All install_uprobes were successful;
+ * cleanup successful entries.
+ */
+ ret = 0;
+ list_for_each_entry_safe(mm, tmpmm, &success_list,
+ uprobes_list) {
+ list_del(&mm->uprobes_list);
+ mmput(mm);
+ }
+ goto consumers_add;
+ }
+
+ /*
+ * Atleast one unsuccessful install_uprobe;
+ * remove successful probes and cleanup untried entries.
+ */
+ list_for_each_entry_safe(mm, tmpmm, &success_list, uprobes_list)
+ delete_uprobe(mm, uprobe);
+ list_for_each_entry_safe(mm, tmpmm, &try_list, uprobes_list) {
+ list_del(&mm->uprobes_list);
+ mmput(mm);
+ }
+ erase_uprobe(uprobe);
+ goto put_unlock;
+
+consumers_add:
+ add_consumer(uprobe, consumer);
+
+put_unlock:
+ mutex_unlock(&uprobes_mutex);
+ put_uprobe(uprobe); /* drop access ref */
+ return ret;
+}
+
+/*
+ * unregister_uprobe - unregister a already registered probe.
+ * @inode: the file in which the probe has to be removed.
+ * @offset: offset from the start of the file.
+ * @consumer: identify which probe if multiple probes are colocated.
+ */
+void unregister_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer)
+{
+ struct prio_tree_iter iter;
+ struct list_head tmp_list;
+ struct address_space *mapping;
+ struct mm_struct *mm, *tmpmm;
+ struct vm_area_struct *vma;
+ struct uprobe *uprobe;
+
+ if (!inode || !consumer)
+ return;
+
+ uprobe = find_uprobe(inode, offset);
+ if (!uprobe) {
+ printk(KERN_ERR "No uprobe found with inode:offset %p %lld\n",
+ inode, offset);
+ return;
+ }
+
+ if (!del_consumer(uprobe, consumer)) {
+ printk(KERN_ERR "No uprobe found with consumer %p\n",
+ consumer);
+ return;
+ }
+
+ INIT_LIST_HEAD(&tmp_list);
+
+ mapping = inode->i_mapping;
+
+ mutex_lock(&uprobes_mutex);
+ if (uprobe->consumers)
+ goto put_unlock;
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
+ if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
+ continue;
+
+ mm = vma->vm_mm;
+
+ if (!atomic_read(&mm->uprobes_count)) {
+ mmput(mm);
+ continue;
+ }
+
+ if (valid_vma(vma)) {
+ loff_t vaddr;
+
+ vaddr = vma->vm_start + offset;
+ vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+ if (vaddr > ULONG_MAX) {
+ /*
+ * We cannot have a virtual address that is
+ * greater than ULONG_MAX
+ */
+ mmput(mm);
+ continue;
+ }
+ mm->uprobes_vaddr = (unsigned long) vaddr;
+ list_add(&mm->uprobes_list, &tmp_list);
+ } else
+ mmput(mm);
+ }
+ spin_unlock(&mapping->i_mmap_lock);
+ list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list)
+ delete_uprobe(mm, uprobe);
+
+ erase_uprobe(uprobe);
+
+put_unlock:
+ mutex_unlock(&uprobes_mutex);
+ put_uprobe(uprobe); /* drop access ref */
+}

2011-04-01 14:43:34

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 7/26] 7: x86: analyze instruction and determine fixups.


The instruction analysis is based on x86 instruction decoder and
determines if an instruction can be probed and determines the necessary
fixups after singlestep. Instruction analysis is done at probe
insertion time so that we avoid having to repeat the same analysis every
time a probe is hit.

Signed-off-by: Jim Keniston <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/include/asm/uprobes.h | 2
arch/x86/kernel/Makefile | 1
arch/x86/kernel/uprobes.c | 414 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 417 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/kernel/uprobes.c

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 5026359..0063207 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -37,4 +37,6 @@ struct uprobe_arch_info {
#else
struct uprobe_arch_info {};
#endif
+struct uprobe;
+extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
#endif /* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 758b379..ef8ffff 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -111,6 +111,7 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o

obj-$(CONFIG_SWIOTLB) += pci-swiotlb.o
obj-$(CONFIG_OF) += devicetree.o
+obj-$(CONFIG_UPROBES) += uprobes.o

###
# 64 bit specific files
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
new file mode 100644
index 0000000..f81e940
--- /dev/null
+++ b/arch/x86/kernel/uprobes.c
@@ -0,0 +1,414 @@
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2010
+ * Authors:
+ * Srikar Dronamraju
+ * Jim Keniston
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/uprobes.h>
+
+#include <linux/kdebug.h>
+#include <asm/insn.h>
+
+#ifdef CONFIG_X86_32
+#define is_32bit_app(tsk) 1
+#else
+#define is_32bit_app(tsk) (test_tsk_thread_flag(tsk, TIF_IA32))
+#endif
+
+#define UPROBES_FIX_RIP_AX 0x8000
+#define UPROBES_FIX_RIP_CX 0x4000
+
+/* Adaptations for mhiramat x86 decoder v14. */
+#define OPCODE1(insn) ((insn)->opcode.bytes[0])
+#define OPCODE2(insn) ((insn)->opcode.bytes[1])
+#define OPCODE3(insn) ((insn)->opcode.bytes[2])
+#define MODRM_REG(insn) X86_MODRM_REG(insn->modrm.value)
+
+#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
+ (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) | \
+ (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) | \
+ (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) | \
+ (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf)) \
+ << (row % 32))
+
+
+static const u32 good_insns_64[256 / 32] = {
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+ /* ---------------------------------------------- */
+ W(0x00, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 00 */
+ W(0x10, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 10 */
+ W(0x20, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 20 */
+ W(0x30, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 30 */
+ W(0x40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 40 */
+ W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+ W(0x60, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+ W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+ W(0x80, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+ W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+ W(0xc0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+ W(0xd0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+ W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+ W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1) /* f0 */
+ /* ---------------------------------------------- */
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+};
+
+/* Good-instruction tables for 32-bit apps */
+
+static const u32 good_insns_32[256 / 32] = {
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+ /* ---------------------------------------------- */
+ W(0x00, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) | /* 00 */
+ W(0x10, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) , /* 10 */
+ W(0x20, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) | /* 20 */
+ W(0x30, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) , /* 30 */
+ W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+ W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+ W(0x60, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+ W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+ W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+ W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+ W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+ W(0xd0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+ W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+ W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1) /* f0 */
+ /* ---------------------------------------------- */
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+};
+
+/* Using this for both 64-bit and 32-bit apps */
+static const u32 good_2byte_insns[256 / 32] = {
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+ /* ---------------------------------------------- */
+ W(0x00, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1) | /* 00 */
+ W(0x10, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* 10 */
+ W(0x20, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1) | /* 20 */
+ W(0x30, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 30 */
+ W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+ W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+ W(0x60, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 60 */
+ W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1) , /* 70 */
+ W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+ W(0xa0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1) | /* a0 */
+ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+ W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* c0 */
+ W(0xd0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+ W(0xe0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* e0 */
+ W(0xf0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0) /* f0 */
+ /* ---------------------------------------------- */
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+};
+#undef W
+
+/*
+ * opcodes we'll probably never support:
+ * 6c-6d, e4-e5, ec-ed - in
+ * 6e-6f, e6-e7, ee-ef - out
+ * cc, cd - int3, int
+ * cf - iret
+ * d6 - illegal instruction
+ * f1 - int1/icebp
+ * f4 - hlt
+ * fa, fb - cli, sti
+ * 0f - lar, lsl, syscall, clts, sysret, sysenter, sysexit, invd, wbinvd, ud2
+ *
+ * invalid opcodes in 64-bit mode:
+ * 06, 0e, 16, 1e, 27, 2f, 37, 3f, 60-62, 82, c4-c5, d4-d5
+ *
+ * 63 - we support this opcode in x86_64 but not in i386.
+ *
+ * opcodes we may need to refine support for:
+ * 0f - 2-byte instructions: For many of these instructions, the validity
+ * depends on the prefix and/or the reg field. On such instructions, we
+ * just consider the opcode combination valid if it corresponds to any
+ * valid instruction.
+ * 8f - Group 1 - only reg = 0 is OK
+ * c6-c7 - Group 11 - only reg = 0 is OK
+ * d9-df - fpu insns with some illegal encodings
+ * f2, f3 - repnz, repz prefixes. These are also the first byte for
+ * certain floating-point instructions, such as addsd.
+ * fe - Group 4 - only reg = 0 or 1 is OK
+ * ff - Group 5 - only reg = 0-6 is OK
+ *
+ * others -- Do we need to support these?
+ * 0f - (floating-point?) prefetch instructions
+ * 07, 17, 1f - pop es, pop ss, pop ds
+ * 26, 2e, 36, 3e - es:, cs:, ss:, ds: segment prefixes --
+ * but 64 and 65 (fs: and gs:) seem to be used, so we support them
+ * 67 - addr16 prefix
+ * ce - into
+ * f0 - lock prefix
+ */
+
+/*
+ * TODO:
+ * - Where necessary, examine the modrm byte and allow only valid instructions
+ * in the different Groups and fpu instructions.
+ */
+
+static bool is_prefix_bad(struct insn *insn)
+{
+ int i;
+
+ for (i = 0; i < insn->prefixes.nbytes; i++) {
+ switch (insn->prefixes.bytes[i]) {
+ case 0x26: /*INAT_PFX_ES */
+ case 0x2E: /*INAT_PFX_CS */
+ case 0x36: /*INAT_PFX_DS */
+ case 0x3E: /*INAT_PFX_SS */
+ case 0xF0: /*INAT_PFX_LOCK */
+ return true;
+ }
+ }
+ return false;
+}
+
+static void report_bad_prefix(void)
+{
+ printk(KERN_ERR "uprobes does not currently support probing "
+ "instructions with any of the following prefixes: "
+ "cs:, ds:, es:, ss:, lock:\n");
+}
+
+static void report_bad_1byte_opcode(int mode, uprobe_opcode_t op)
+{
+ printk(KERN_ERR "In %d-bit apps, "
+ "uprobes does not currently support probing "
+ "instructions whose first byte is 0x%2.2x\n", mode, op);
+}
+
+static void report_bad_2byte_opcode(uprobe_opcode_t op)
+{
+ printk(KERN_ERR "uprobes does not currently support probing "
+ "instructions with the 2-byte opcode 0x0f 0x%2.2x\n", op);
+}
+
+static int validate_insn_32bits(struct uprobe *uprobe, struct insn *insn)
+{
+ insn_init(insn, uprobe->insn, false);
+
+ /* Skip good instruction prefixes; reject "bad" ones. */
+ insn_get_opcode(insn);
+ if (is_prefix_bad(insn)) {
+ report_bad_prefix();
+ return -ENOTSUPP;
+ }
+ if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_32))
+ return 0;
+ if (insn->opcode.nbytes == 2) {
+ if (test_bit(OPCODE2(insn),
+ (unsigned long *) good_2byte_insns))
+ return 0;
+ report_bad_2byte_opcode(OPCODE2(insn));
+ } else
+ report_bad_1byte_opcode(32, OPCODE1(insn));
+ return -ENOTSUPP;
+}
+
+static int validate_insn_64bits(struct uprobe *uprobe, struct insn *insn)
+{
+ insn_init(insn, uprobe->insn, true);
+
+ /* Skip good instruction prefixes; reject "bad" ones. */
+ insn_get_opcode(insn);
+ if (is_prefix_bad(insn)) {
+ report_bad_prefix();
+ return -ENOTSUPP;
+ }
+ if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_64))
+ return 0;
+ if (insn->opcode.nbytes == 2) {
+ if (test_bit(OPCODE2(insn),
+ (unsigned long *) good_2byte_insns))
+ return 0;
+ report_bad_2byte_opcode(OPCODE2(insn));
+ } else
+ report_bad_1byte_opcode(64, OPCODE1(insn));
+ return -ENOTSUPP;
+}
+
+/*
+ * Figure out which fixups post_xol() will need to perform, and annotate
+ * uprobe->fixups accordingly. To start with, uprobe->fixups is
+ * either zero or it reflects rip-related fixups.
+ */
+static void prepare_fixups(struct uprobe *uprobe, struct insn *insn)
+{
+ bool fix_ip = true, fix_call = false; /* defaults */
+ insn_get_opcode(insn); /* should be a nop */
+
+ switch (OPCODE1(insn)) {
+ case 0xc3: /* ret/lret */
+ case 0xcb:
+ case 0xc2:
+ case 0xca:
+ /* ip is correct */
+ fix_ip = false;
+ break;
+ case 0xe8: /* call relative - Fix return addr */
+ fix_call = true;
+ break;
+ case 0x9a: /* call absolute - Fix return addr, not ip */
+ fix_call = true;
+ fix_ip = false;
+ break;
+ case 0xff:
+ {
+ int reg;
+ insn_get_modrm(insn);
+ reg = MODRM_REG(insn);
+ if (reg == 2 || reg == 3) {
+ /* call or lcall, indirect */
+ /* Fix return addr; ip is correct. */
+ fix_call = true;
+ fix_ip = false;
+ } else if (reg == 4 || reg == 5) {
+ /* jmp or ljmp, indirect */
+ /* ip is correct. */
+ fix_ip = false;
+ }
+ break;
+ }
+ case 0xea: /* jmp absolute -- ip is correct */
+ fix_ip = false;
+ break;
+ default:
+ break;
+ }
+ if (fix_ip)
+ uprobe->fixups |= UPROBES_FIX_IP;
+ if (fix_call)
+ uprobe->fixups |=
+ (UPROBES_FIX_CALL | UPROBES_FIX_SLEEPY);
+}
+
+#ifdef CONFIG_X86_64
+/*
+ * If uprobe->insn doesn't use rip-relative addressing, return
+ * immediately. Otherwise, rewrite the instruction so that it accesses
+ * its memory operand indirectly through a scratch register. Set
+ * uprobe->fixups and uprobe->arch_info.rip_rela_target_address
+ * accordingly. (The contents of the scratch register will be saved
+ * before we single-step the modified instruction, and restored
+ * afterward.)
+ *
+ * We do this because a rip-relative instruction can access only a
+ * relatively small area (+/- 2 GB from the instruction), and the XOL
+ * area typically lies beyond that area. At least for instructions
+ * that store to memory, we can't execute the original instruction
+ * and "fix things up" later, because the misdirected store could be
+ * disastrous.
+ *
+ * Some useful facts about rip-relative instructions:
+ * - There's always a modrm byte.
+ * - There's never a SIB byte.
+ * - The displacement is always 4 bytes.
+ */
+static void handle_riprel_insn(struct uprobe *uprobe, struct insn *insn)
+{
+ u8 *cursor;
+ u8 reg;
+
+ uprobe->arch_info.rip_rela_target_address = 0x0;
+ if (!insn_rip_relative(insn))
+ return;
+
+ /*
+ * Point cursor at the modrm byte. The next 4 bytes are the
+ * displacement. Beyond the displacement, for some instructions,
+ * is the immediate operand.
+ */
+ cursor = uprobe->insn + insn->prefixes.nbytes
+ + insn->rex_prefix.nbytes + insn->opcode.nbytes;
+ insn_get_length(insn);
+
+ /*
+ * Convert from rip-relative addressing to indirect addressing
+ * via a scratch register. Change the r/m field from 0x5 (%rip)
+ * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
+ */
+ reg = MODRM_REG(insn);
+ if (reg == 0) {
+ /*
+ * The register operand (if any) is either the A register
+ * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
+ * REX prefix) %r8. In any case, we know the C register
+ * is NOT the register operand, so we use %rcx (register
+ * #1) for the scratch register.
+ */
+ uprobe->fixups = UPROBES_FIX_RIP_CX;
+ /* Change modrm from 00 000 101 to 00 000 001. */
+ *cursor = 0x1;
+ } else {
+ /* Use %rax (register #0) for the scratch register. */
+ uprobe->fixups = UPROBES_FIX_RIP_AX;
+ /* Change modrm from 00 xxx 101 to 00 xxx 000 */
+ *cursor = (reg << 3);
+ }
+
+ /* Target address = address of next instruction + (signed) offset */
+ uprobe->arch_info.rip_rela_target_address = (long) insn->length
+ + insn->displacement.value;
+ /* Displacement field is gone; slide immediate field (if any) over. */
+ if (insn->immediate.nbytes) {
+ cursor++;
+ memmove(cursor, cursor + insn->displacement.nbytes,
+ insn->immediate.nbytes);
+ }
+ return;
+}
+#else
+static void handle_riprel_insn(struct uprobe *uprobe, struct insn *insn)
+{
+ return;
+}
+#endif /* CONFIG_X86_64 */
+
+/**
+ * analyze_insn - instruction analysis including validity and fixups.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * Return 0 on success or a -ve number on error.
+ */
+int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe)
+{
+ int ret;
+ struct insn insn;
+
+ uprobe->fixups = 0;
+ if (is_32bit_app(tsk))
+ ret = validate_insn_32bits(uprobe, &insn);
+ else
+ ret = validate_insn_64bits(uprobe, &insn);
+ if (ret != 0)
+ return ret;
+ if (!is_32bit_app(tsk))
+ handle_riprel_insn(uprobe, &insn);
+ prepare_fixups(uprobe, &insn);
+ return 0;
+}

2011-04-01 14:43:45

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 8/26] 8: uprobes: store/restore original instruction.


On the first probe insertion, copy the original instruction and opcode.
If multiple vmas map the same text area corresponding to an inode, we
only need to copy the instruction just once.
The copied instruction is further copied to a designated slot on probe
hit. Its also used at the time of probe removal to restore the original
instruction.
opcode is used to analyze the instruction and determine the fixups.
Determining fixups at probe hit time would result in doing the same
operation on every probe hit. Hence Instruction analysis using the
opcode is done at probe insertion time.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/uprobes.c | 119 +++++++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 112 insertions(+), 7 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index ff3f15e..d3ae4cb 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -71,6 +71,7 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
unsigned long vaddr, uprobe_opcode_t opcode)
{
struct page *old_page, *new_page;
+ struct address_space *mapping;
void *vaddr_old, *vaddr_new;
struct vm_area_struct *vma;
spinlock_t *ptl;
@@ -93,6 +94,18 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
if (!valid_vma(vma))
goto put_out;

+ mapping = uprobe->inode->i_mapping;
+ if (mapping != vma->vm_file->f_mapping)
+ goto put_out;
+
+ addr = vma->vm_start + uprobe->offset;
+ addr -= vma->vm_pgoff << PAGE_SHIFT;
+ if (addr > ULONG_MAX)
+ goto put_out;
+
+ if (vaddr != (unsigned long) addr)
+ goto put_out;
+
/* Allocate a page */
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
if (!new_page) {
@@ -111,7 +124,6 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,

memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
/* poke the new insn in, ASSUMES we don't cross page boundary */
- addr = vaddr;
vaddr &= ~PAGE_MASK;
memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);

@@ -460,24 +472,117 @@ static bool del_consumer(struct uprobe *uprobe,
return ret;
}

+static int __copy_insn(struct address_space *mapping, char *insn,
+ unsigned long nbytes, unsigned long offset)
+{
+ struct page *page;
+ void *vaddr;
+ unsigned long off1;
+ loff_t idx;
+
+ idx = offset >> PAGE_CACHE_SHIFT;
+ off1 = offset &= ~PAGE_MASK;
+ page = grab_cache_page(mapping, (unsigned long)idx);
+ if (!page)
+ return -ENOMEM;
+
+ vaddr = kmap_atomic(page, KM_USER0);
+ memcpy(insn, vaddr + off1, nbytes);
+ kunmap_atomic(vaddr, KM_USER0);
+ unlock_page(page);
+ page_cache_release(page);
+ return 0;
+}
+
+static int copy_insn(struct uprobe *uprobe, unsigned long addr)
+{
+ struct address_space *mapping;
+ int bytes;
+ unsigned long nbytes;
+
+ addr &= ~PAGE_MASK;
+ nbytes = PAGE_SIZE - addr;
+ mapping = uprobe->inode->i_mapping;
+
+ /* Instruction at end of binary; copy only available bytes */
+ if (uprobe->offset + MAX_UINSN_BYTES > uprobe->inode->i_size)
+ bytes = uprobe->inode->i_size - uprobe->offset;
+ else
+ bytes = MAX_UINSN_BYTES;
+
+ /* Instruction at the page-boundary; copy bytes in second page */
+ if (nbytes < bytes) {
+ if (__copy_insn(mapping, uprobe->insn + nbytes,
+ bytes - nbytes, uprobe->offset + nbytes))
+ return -ENOMEM;
+ bytes = nbytes;
+ }
+ return __copy_insn(mapping, uprobe->insn, bytes, uprobe->offset);
+}
+
+static struct task_struct *uprobes_get_mm_owner(struct mm_struct *mm)
+{
+ struct task_struct *tsk;
+
+ rcu_read_lock();
+ tsk = rcu_dereference(mm->owner);
+ if (tsk)
+ get_task_struct(tsk);
+ rcu_read_unlock();
+ return tsk;
+}
+
static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
{
- int ret = 0;
+ struct task_struct *tsk = uprobes_get_mm_owner(mm);
+ int ret;

- /*TODO: install breakpoint */
- if (!ret)
+ if (!tsk) /* task is probably exiting; bail-out */
+ return -ESRCH;
+
+ if (!uprobe->copy) {
+ ret = copy_insn(uprobe, mm->uprobes_vaddr);
+ if (ret)
+ goto put_return;
+ if (is_bkpt_insn(uprobe->insn)) {
+ print_insert_fail(tsk, mm->uprobes_vaddr,
+ "breakpoint instruction already exists");
+ ret = -EEXIST;
+ goto put_return;
+ }
+ ret = analyze_insn(tsk, uprobe);
+ if (ret) {
+ print_insert_fail(tsk, mm->uprobes_vaddr,
+ "instruction type cannot be probed");
+ goto put_return;
+ }
+ uprobe->copy = 1;
+ }
+
+ ret = set_bkpt(tsk, uprobe, mm->uprobes_vaddr);
+ if (ret < 0)
+ print_insert_fail(tsk, mm->uprobes_vaddr,
+ "failed to insert bkpt instruction");
+ else
atomic_inc(&mm->uprobes_count);
+
+put_return:
+ put_task_struct(tsk);
return ret;
}

static int remove_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
{
- int ret = 0;
+ struct task_struct *tsk = uprobes_get_mm_owner(mm);
+ int ret;

- /*TODO: remove breakpoint */
+ if (!tsk) /* task is probably exiting; bail-out */
+ return -ESRCH;
+
+ ret = set_orig_insn(tsk, uprobe, mm->uprobes_vaddr, true);
if (!ret)
atomic_dec(&mm->uprobes_count);
-
+ put_task_struct(tsk);
return ret;
}

2011-04-01 14:44:00

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 9/26] 9: uprobes: mmap and fork hooks.


Provides hooks in mmap and fork.

On fork, after the new mm is created, we need to set the count of
uprobes. On mmap, check if the mmap region is an executable page and if
its a executable page, walk through the rbtree and insert actual
breakpoints for already registered probes corresponding to this inode.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/uprobes.h | 14 ++++-
kernel/fork.c | 2 +
kernel/uprobes.c | 144 +++++++++++++++++++++++++++++++++++++++++++----
mm/mmap.c | 6 ++
4 files changed, 154 insertions(+), 12 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 62036a0..27496c6 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -66,6 +66,7 @@ struct uprobe_consumer {
struct uprobe {
struct rb_node rb_node; /* node in the rb tree */
atomic_t ref;
+ struct list_head pending_list;
struct rw_semaphore consumer_rwsem;
struct uprobe_arch_info arch_info; /* arch specific info if any */
struct uprobe_consumer *consumers;
@@ -110,6 +111,10 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer);
extern void unregister_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer);
+
+struct vm_area_struct;
+extern int uprobe_mmap(struct vm_area_struct *vma);
+extern void uprobe_dup_mmap(struct mm_struct *old_mm, struct mm_struct *mm);
#else /* CONFIG_UPROBES is not defined */
static inline int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer)
@@ -120,6 +125,13 @@ static inline void unregister_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer)
{
}
-
+static inline void uprobe_dup_mmap(struct mm_struct *old_mm,
+ struct mm_struct *mm)
+{
+}
+static inline int uprobe_mmap(struct vm_area_struct *vma)
+{
+ return 0;
+}
#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index e7548de..2f1a16d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -68,6 +68,7 @@
#include <linux/user-return-notifier.h>
#include <linux/oom.h>
#include <linux/khugepaged.h>
+#include <linux/uprobes.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -425,6 +426,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
}
/* a new mm has just been created */
arch_dup_mmap(oldmm, mm);
+ uprobe_dup_mmap(oldmm, mm);
retval = 0;
out:
up_write(&mm->mmap_sem);
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index d3ae4cb..8cf38d6 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -404,6 +404,7 @@ static struct uprobe *uprobes_add(struct inode *inode, loff_t offset)
uprobe->inode = inode;
uprobe->offset = offset;
init_rwsem(&uprobe->consumer_rwsem);
+ INIT_LIST_HEAD(&uprobe->pending_list);

/* add to uprobes_tree, sorted on inode:offset */
cur_uprobe = insert_uprobe(uprobe);
@@ -472,17 +473,32 @@ static bool del_consumer(struct uprobe *uprobe,
return ret;
}

-static int __copy_insn(struct address_space *mapping, char *insn,
- unsigned long nbytes, unsigned long offset)
+static int __copy_insn(struct address_space *mapping,
+ struct vm_area_struct *vma, char *insn,
+ unsigned long nbytes, unsigned long offset)
{
struct page *page;
void *vaddr;
unsigned long off1;
- loff_t idx;
+ unsigned long idx;

- idx = offset >> PAGE_CACHE_SHIFT;
+ idx = (unsigned long) (offset >> PAGE_CACHE_SHIFT);
off1 = offset &= ~PAGE_MASK;
- page = grab_cache_page(mapping, (unsigned long)idx);
+ if (vma) {
+ /*
+ * We get here from uprobe_mmap() -- the case where we
+ * are trying to copy an instruction from a page that's
+ * not yet in page cache.
+ *
+ * Read page in before copy.
+ */
+ struct file *filp = vma->vm_file;
+
+ if (!filp)
+ return -EINVAL;
+ page_cache_sync_readahead(mapping, &filp->f_ra, filp, idx, 1);
+ }
+ page = grab_cache_page(mapping, idx);
if (!page)
return -ENOMEM;

@@ -494,7 +510,8 @@ static int __copy_insn(struct address_space *mapping, char *insn,
return 0;
}

-static int copy_insn(struct uprobe *uprobe, unsigned long addr)
+static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma,
+ unsigned long addr)
{
struct address_space *mapping;
int bytes;
@@ -512,12 +529,12 @@ static int copy_insn(struct uprobe *uprobe, unsigned long addr)

/* Instruction at the page-boundary; copy bytes in second page */
if (nbytes < bytes) {
- if (__copy_insn(mapping, uprobe->insn + nbytes,
+ if (__copy_insn(mapping, vma, uprobe->insn + nbytes,
bytes - nbytes, uprobe->offset + nbytes))
return -ENOMEM;
bytes = nbytes;
}
- return __copy_insn(mapping, uprobe->insn, bytes, uprobe->offset);
+ return __copy_insn(mapping, vma, uprobe->insn, bytes, uprobe->offset);
}

static struct task_struct *uprobes_get_mm_owner(struct mm_struct *mm)
@@ -532,7 +549,8 @@ static struct task_struct *uprobes_get_mm_owner(struct mm_struct *mm)
return tsk;
}

-static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
+static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe,
+ struct vm_area_struct *vma)
{
struct task_struct *tsk = uprobes_get_mm_owner(mm);
int ret;
@@ -541,7 +559,7 @@ static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
return -ESRCH;

if (!uprobe->copy) {
- ret = copy_insn(uprobe, mm->uprobes_vaddr);
+ ret = copy_insn(uprobe, vma, mm->uprobes_vaddr);
if (ret)
goto put_return;
if (is_bkpt_insn(uprobe->insn)) {
@@ -698,7 +716,7 @@ int register_uprobe(struct inode *inode, loff_t offset,
}
list_for_each_entry_safe(mm, tmpmm, &try_list, uprobes_list) {
down_read(&mm->mmap_sem);
- ret = install_uprobe(mm, uprobe);
+ ret = install_uprobe(mm, uprobe, NULL);

if (ret && (ret != -ESRCH || ret != -EEXIST)) {
up_read(&mm->mmap_sem);
@@ -833,3 +851,107 @@ put_unlock:
mutex_unlock(&uprobes_mutex);
put_uprobe(uprobe); /* drop access ref */
}
+
+static void add_to_temp_list(struct vm_area_struct *vma, struct inode *inode,
+ struct list_head *tmp_list)
+{
+ struct uprobe *uprobe;
+ struct rb_node *n;
+ unsigned long flags;
+
+ n = uprobes_tree.rb_node;
+ spin_lock_irqsave(&treelock, flags);
+ uprobe = __find_uprobe(inode, 0, &n);
+ for (; n; n = rb_next(n)) {
+ uprobe = rb_entry(n, struct uprobe, rb_node);
+ if (match_inode(uprobe, inode, &n)) {
+ list_add(&uprobe->pending_list, tmp_list);
+ continue;
+ }
+ break;
+ }
+ spin_unlock_irqrestore(&treelock, flags);
+}
+
+/*
+ * Called from dup_mmap.
+ * called with mm->mmap_sem and old_mm->mmap_sem acquired.
+ */
+void uprobe_dup_mmap(struct mm_struct *old_mm, struct mm_struct *mm)
+{
+ atomic_set(&old_mm->uprobes_count,
+ atomic_read(&mm->uprobes_count));
+}
+
+/*
+ * Called from mmap_region.
+ * called with mm->mmap_sem acquired.
+ *
+ * Return -ve no if we fail to insert probes and we cannot
+ * bail-out.
+ * Return 0 otherwise. i.e :
+ * - successful insertion of probes
+ * - no possible probes to be inserted.
+ * - insertion of probes failed but we can bail-out.
+ */
+int uprobe_mmap(struct vm_area_struct *vma)
+{
+ struct list_head tmp_list;
+ struct uprobe *uprobe, *u;
+ struct mm_struct *mm;
+ struct inode *inode;
+ unsigned long start;
+ unsigned long pgoff;
+ int ret = 0;
+
+ if (!valid_vma(vma))
+ return ret; /* Bail-out */
+
+ INIT_LIST_HEAD(&tmp_list);
+
+ mm = vma->vm_mm;
+ inode = vma->vm_file->f_mapping->host;
+ start = vma->vm_start;
+ pgoff = vma->vm_pgoff;
+ __iget(inode);
+
+ up_write(&mm->mmap_sem);
+ mutex_lock(&uprobes_mutex);
+ down_read(&mm->mmap_sem);
+
+ vma = find_vma(mm, start);
+ /* Not the same vma */
+ if (!vma || vma->vm_start != start ||
+ vma->vm_pgoff != pgoff || !valid_vma(vma) ||
+ inode->i_mapping != vma->vm_file->f_mapping)
+ goto mmap_out;
+
+ add_to_temp_list(vma, inode, &tmp_list);
+ list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
+ loff_t vaddr;
+
+ list_del(&uprobe->pending_list);
+ if (ret)
+ continue;
+
+ vaddr = vma->vm_start + uprobe->offset;
+ vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+ if (vaddr > ULONG_MAX)
+ /*
+ * We cannot have a virtual address that is
+ * greater than ULONG_MAX
+ */
+ continue;
+ mm->uprobes_vaddr = (unsigned long)vaddr;
+ ret = install_uprobe(mm, uprobe, vma);
+ if (ret && (ret == -ESRCH || ret == -EEXIST))
+ ret = 0;
+ }
+
+mmap_out:
+ mutex_unlock(&uprobes_mutex);
+ iput(inode);
+ up_read(&mm->mmap_sem);
+ down_write(&mm->mmap_sem);
+ return ret;
+}
diff --git a/mm/mmap.c b/mm/mmap.c
index 2ec8eb5..dcd0308 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -30,6 +30,7 @@
#include <linux/perf_event.h>
#include <linux/audit.h>
#include <linux/khugepaged.h>
+#include <linux/uprobes.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -1366,6 +1367,11 @@ out:
mm->locked_vm += (len >> PAGE_SHIFT);
} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
make_pages_present(addr, addr + len);
+
+ if (file && uprobe_mmap(vma))
+ /* matching probes but cannot insert */
+ goto unmap_and_free_vma;
+
return addr;

unmap_and_free_vma:

2011-04-01 14:44:16

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 10/26] 10: x86: architecture specific task information.


On X86_64, we need to support rip relative instructions.
Rip relative instructions are handled by saving the scratch register
on probe hit and then retrieving the previously saved scratch register
after single-step. This value stored at probe hit is specific to each
task. Hence this is implemented as part of uprobe_task_arch_info.

Since x86_32 has no support for rip relative instructions, we dont need to
bother for x86_32.

Signed-off-by: Jim Keniston <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/include/asm/uprobes.h | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 0063207..e38950f 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -34,8 +34,13 @@ typedef u8 uprobe_opcode_t;
struct uprobe_arch_info {
unsigned long rip_rela_target_address;
};
+
+struct uprobe_task_arch_info {
+ unsigned long saved_scratch_register;
+};
#else
struct uprobe_arch_info {};
+struct uprobe_task_arch_info {};
#endif
struct uprobe;
extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);

2011-04-01 14:44:29

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 11/26] 11: uprobes: task specific information.


Uprobes needs to maintain some task specific information include if a
task is currently uprobed, the currently handing uprobe, any arch
specific information (for example to handle rip relative instructions),
the per-task slot where the original instruction is copied to before
single-stepping.

Provides routines to create/manage and free the task specific
information.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/sched.h | 3 +++
include/linux/uprobes.h | 25 +++++++++++++++++++++++++
kernel/fork.c | 4 ++++
kernel/uprobes.c | 38 ++++++++++++++++++++++++++++++++++++++
4 files changed, 70 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 83bd2e2..d4dcd23 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1534,6 +1534,9 @@ struct task_struct {
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
#endif
+#ifdef CONFIG_UPROBES
+ struct uprobe_task *utask;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 27496c6..8da993c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -26,12 +26,14 @@
#include <linux/rbtree.h>
#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
#include <asm/uprobes.h>
+struct uprobe_task_arch_info; /* arch specific task info */
#else
/*
* ARCH_SUPPORTS_UPROBES is not defined.
*/
typedef u8 uprobe_opcode_t;
struct uprobe_arch_info {}; /* arch specific info*/
+struct uprobe_task_arch_info {}; /* arch specific task info */
#endif /* CONFIG_ARCH_SUPPORTS_UPROBES */

/* Post-execution fixups. Some architectures may define others. */
@@ -77,6 +79,27 @@ struct uprobe {
int copy;
};

+enum uprobe_task_state {
+ UTASK_RUNNING,
+ UTASK_BP_HIT,
+ UTASK_SSTEP
+};
+
+/*
+ * uprobe_utask -- not a user-visible struct.
+ * Corresponds to a thread in a probed process.
+ * Guarded by uproc->mutex.
+ */
+struct uprobe_task {
+ unsigned long xol_vaddr;
+ unsigned long vaddr;
+
+ enum uprobe_task_state state;
+ struct uprobe_task_arch_info tskinfo;
+
+ struct uprobe *active_uprobe;
+};
+
/*
* Most architectures can use the default versions of @read_opcode(),
* @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
@@ -111,6 +134,7 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer);
extern void unregister_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer);
+extern void uprobe_free_utask(struct task_struct *tsk);

struct vm_area_struct;
extern int uprobe_mmap(struct vm_area_struct *vma);
@@ -133,5 +157,6 @@ static inline int uprobe_mmap(struct vm_area_struct *vma)
{
return 0;
}
+static inline void uprobe_free_utask(struct task_struct *tsk) {}
#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 2f1a16d..e25c29e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -197,6 +197,7 @@ void __put_task_struct(struct task_struct *tsk)
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);

+ uprobe_free_utask(tsk);
if (!profile_handoff_task(tsk))
free_task(tsk);
}
@@ -1218,6 +1219,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
INIT_LIST_HEAD(&p->pi_state_list);
p->pi_state_cache = NULL;
#endif
+#ifdef CONFIG_UPROBES
+ p->utask = NULL;
+#endif
/*
* sigaltstack should be cleared when sharing the same VM
*/
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 8cf38d6..f9fb7c2 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -955,3 +955,41 @@ mmap_out:
down_write(&mm->mmap_sem);
return ret;
}
+
+/*
+ * Called with no locks held.
+ * Called in context of a exiting or a exec-ing thread.
+ */
+void uprobe_free_utask(struct task_struct *tsk)
+{
+ struct uprobe_task *utask = tsk->utask;
+
+ if (!utask)
+ return;
+
+ if (utask->active_uprobe)
+ put_uprobe(utask->active_uprobe);
+ kfree(utask);
+ tsk->utask = NULL;
+}
+
+/*
+ * Allocate a uprobe_task object for the task.
+ * Called when the thread hits a breakpoint for the first time.
+ *
+ * Returns:
+ * - pointer to new uprobe_task on success
+ * - negative errno otherwise
+ */
+static struct uprobe_task *add_utask(void)
+{
+ struct uprobe_task *utask;
+
+ utask = kzalloc(sizeof *utask, GFP_KERNEL);
+ if (unlikely(utask == NULL))
+ return ERR_PTR(-ENOMEM);
+
+ utask->active_uprobe = NULL;
+ current->utask = utask;
+ return utask;
+}

2011-04-01 14:44:41

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 12/26] 12: uprobes: slot allocation for uprobes


Every task is allocated a fixed slot. When a probe is hit, the original
instruction corresponding to the probe hit is copied to per-task fixed
slot. Currently we allocate one page of slots for each mm. Bitmaps are
used to know which slots are free. Each slot is made of 128 bytes so
that its cache aligned.

TODO: On massively threaded processes (or if a huge number of processes
share the same mm), there is a possiblilty of running out of slots.
One alternative could be to extend the slots as when slots are required.

Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Jim Keniston <[email protected]>
---
include/linux/mm_types.h | 4 +
include/linux/uprobes.h | 21 ++++
kernel/fork.c | 4 +
kernel/uprobes.c | 238 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 267 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c691096..ff4c72b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,9 @@
#include <linux/completion.h>
#include <linux/cpumask.h>
#include <linux/page-debug-flags.h>
+#ifdef CONFIG_UPROBES
+#include <linux/uprobes.h>
+#endif
#include <asm/page.h>
#include <asm/mmu.h>

@@ -321,6 +324,7 @@ struct mm_struct {
unsigned long uprobes_vaddr;
struct list_head uprobes_list; /* protected by uprobes_mutex */
atomic_t uprobes_count;
+ struct uprobes_xol_area *uprobes_xol_area;
#endif
};

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 8da993c..10647be 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -101,6 +101,25 @@ struct uprobe_task {
};

/*
+ * Every thread gets its own slot. Once it's assigned a slot, it
+ * keeps that slot until the thread exits. Only definite number
+ * of slots are allocated.
+ */
+
+struct uprobes_xol_area {
+ spinlock_t slot_lock; /* protects bitmap and slot (de)allocation*/
+ unsigned long *bitmap; /* 0 = free slot */
+ struct page *page;
+
+ /*
+ * We keep the vma's vm_start rather than a pointer to the vma
+ * itself. The probed process or a naughty kernel module could make
+ * the vma go away, and we must handle that reasonably gracefully.
+ */
+ unsigned long vaddr; /* Page(s) of instruction slots */
+};
+
+/*
* Most architectures can use the default versions of @read_opcode(),
* @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
*
@@ -139,6 +158,7 @@ extern void uprobe_free_utask(struct task_struct *tsk);
struct vm_area_struct;
extern int uprobe_mmap(struct vm_area_struct *vma);
extern void uprobe_dup_mmap(struct mm_struct *old_mm, struct mm_struct *mm);
+extern void uprobes_free_xol_area(struct mm_struct *mm);
#else /* CONFIG_UPROBES is not defined */
static inline int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer)
@@ -158,5 +178,6 @@ static inline int uprobe_mmap(struct vm_area_struct *vma)
return 0;
}
static inline void uprobe_free_utask(struct task_struct *tsk) {}
+static inline void uprobes_free_xol_area(struct mm_struct *mm) {}
#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index e25c29e..7131096 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -558,6 +558,7 @@ void mmput(struct mm_struct *mm)
might_sleep();

if (atomic_dec_and_test(&mm->mm_users)) {
+ uprobes_free_xol_area(mm);
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
@@ -690,6 +691,9 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
mm->pmd_huge_pte = NULL;
#endif
+#ifdef CONFIG_UPROBES
+ mm->uprobes_xol_area = NULL;
+#endif

if (!mm_init(mm, tsk))
goto fail_nomem;
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index f9fb7c2..7663d18 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -31,12 +31,28 @@
#include <linux/slab.h>
#include <linux/uprobes.h>
#include <linux/rmap.h> /* needed for anon_vma_prepare */
+#include <linux/mman.h> /* needed for PROT_EXEC, MAP_PRIVATE */
+#include <linux/file.h> /* needed for fput() */

+#define UINSNS_PER_PAGE (PAGE_SIZE/UPROBES_XOL_SLOT_BYTES)
+#define MAX_UPROBES_XOL_SLOTS UINSNS_PER_PAGE
+
+/*
+ * valid_vma: Verify if the specified vma is an executable vma,
+ * but not an XOL vma.
+ * - Return 1 if the specified virtual address is in an
+ * executable vma, but not in an XOL vma.
+ */
static bool valid_vma(struct vm_area_struct *vma)
{
+ struct uprobes_xol_area *area = vma->vm_mm->uprobes_xol_area;
+
if (!vma->vm_file)
return false;

+ if (area && (area->vaddr == vma->vm_start))
+ return false;
+
if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
(VM_READ|VM_EXEC))
return true;
@@ -956,6 +972,224 @@ mmap_out:
return ret;
}

+/* Slot allocation for XOL */
+
+static int xol_add_vma(struct uprobes_xol_area *area)
+{
+ struct vm_area_struct *vma;
+ struct mm_struct *mm;
+ struct file *file;
+ unsigned long addr;
+ int ret = -ENOMEM;
+
+ mm = get_task_mm(current);
+ if (!mm)
+ return -ESRCH;
+
+ down_write(&mm->mmap_sem);
+ if (mm->uprobes_xol_area) {
+ ret = -EALREADY;
+ goto fail;
+ }
+
+ /*
+ * Find the end of the top mapping and skip a page.
+ * If there is no space for PAGE_SIZE above
+ * that, mmap will ignore our address hint.
+ *
+ * We allocate a "fake" unlinked shmem file because
+ * anonymous memory might not be granted execute
+ * permission when the selinux security hooks have
+ * their way.
+ */
+ vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
+ addr = vma->vm_end + PAGE_SIZE;
+ file = shmem_file_setup("uprobes/xol", PAGE_SIZE, VM_NORESERVE);
+ if (!file) {
+ printk(KERN_ERR "uprobes_xol failed to setup shmem_file "
+ "while allocating vma for pid/tgid %d/%d for "
+ "single-stepping out of line.\n",
+ current->pid, current->tgid);
+ goto fail;
+ }
+ addr = do_mmap_pgoff(file, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
+ fput(file);
+
+ if (addr & ~PAGE_MASK) {
+ printk(KERN_ERR "uprobes_xol failed to allocate a vma for "
+ "pid/tgid %d/%d for single-stepping out of "
+ "line.\n", current->pid, current->tgid);
+ goto fail;
+ }
+ vma = find_vma(mm, addr);
+
+ /* Don't expand vma on mremap(). */
+ vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
+ area->vaddr = vma->vm_start;
+ if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
+ &vma) > 0)
+ ret = 0;
+
+fail:
+ up_write(&mm->mmap_sem);
+ mmput(mm);
+ return ret;
+}
+
+/*
+ * xol_alloc_area - Allocate process's uprobes_xol_area.
+ * This area will be used for storing instructions for execution out of
+ * line.
+ *
+ * Returns the allocated area or NULL.
+ */
+static struct uprobes_xol_area *xol_alloc_area(void)
+{
+ struct uprobes_xol_area *area = NULL;
+
+ area = kzalloc(sizeof(*area), GFP_KERNEL);
+ if (unlikely(!area))
+ return NULL;
+
+ area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
+ GFP_KERNEL);
+
+ if (!area->bitmap)
+ goto fail;
+
+ spin_lock_init(&area->slot_lock);
+ if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
+ task_lock(current);
+ if (!current->mm->uprobes_xol_area) {
+ current->mm->uprobes_xol_area = area;
+ task_unlock(current);
+ return area;
+ }
+ task_unlock(current);
+ }
+
+fail:
+ kfree(area->bitmap);
+ kfree(area);
+ return current->mm->uprobes_xol_area;
+}
+
+/*
+ * uprobes_free_xol_area - Free the area allocated for slots.
+ */
+void uprobes_free_xol_area(struct mm_struct *mm)
+{
+ struct uprobes_xol_area *area = mm->uprobes_xol_area;
+
+ if (!area)
+ return;
+
+ put_page(area->page);
+ kfree(area->bitmap);
+ kfree(area);
+}
+
+/*
+ * Find a slot
+ * - searching in existing vmas for a free slot.
+ * - If no free slot in existing vmas, return 0;
+ *
+ * Called when holding uprobes_xol_area->slot_lock
+ */
+static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
+{
+ unsigned long slot_addr;
+ int slot_nr;
+
+ slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
+ if (slot_nr < UINSNS_PER_PAGE) {
+ __set_bit(slot_nr, area->bitmap);
+ slot_addr = area->vaddr +
+ (slot_nr * UPROBES_XOL_SLOT_BYTES);
+ return slot_addr;
+ }
+
+ return 0;
+}
+
+/*
+ * xol_get_insn_slot - If was not allocated a slot, then
+ * allocate a slot.
+ * Returns the allocated slot address or 0.
+ */
+static unsigned long xol_get_insn_slot(struct uprobe *uprobe,
+ unsigned long slot_addr)
+{
+ struct uprobes_xol_area *area = current->mm->uprobes_xol_area;
+ unsigned long flags, xol_vaddr = current->utask->xol_vaddr;
+ void *vaddr;
+
+ if (!current->utask->xol_vaddr || !area) {
+ if (!area)
+ area = xol_alloc_area();
+
+ if (!area)
+ return 0;
+
+ spin_lock_irqsave(&area->slot_lock, flags);
+ xol_vaddr = xol_take_insn_slot(area);
+ spin_unlock_irqrestore(&area->slot_lock, flags);
+ current->utask->xol_vaddr = xol_vaddr;
+ }
+
+ /*
+ * Initialize the slot if xol_vaddr points to valid
+ * instruction slot.
+ */
+ if (unlikely(!xol_vaddr))
+ return 0;
+
+ current->utask->vaddr = slot_addr;
+ vaddr = kmap_atomic(area->page, KM_USER0);
+ xol_vaddr &= ~PAGE_MASK;
+ memcpy(vaddr + xol_vaddr, uprobe->insn, MAX_UINSN_BYTES);
+ kunmap_atomic(vaddr, KM_USER0);
+ return current->utask->xol_vaddr;
+}
+
+/*
+ * xol_free_insn_slot - If slot was earlier allocated by
+ * @xol_get_insn_slot(), make the slot available for
+ * subsequent requests.
+ */
+static void xol_free_insn_slot(struct task_struct *tsk, unsigned long slot_addr)
+{
+ struct uprobes_xol_area *area;
+ unsigned long vma_end;
+
+ if (!tsk->mm || !tsk->mm->uprobes_xol_area)
+ return;
+
+ area = tsk->mm->uprobes_xol_area;
+
+ if (unlikely(!slot_addr || IS_ERR_VALUE(slot_addr)))
+ return;
+
+ vma_end = area->vaddr + PAGE_SIZE;
+ if (area->vaddr <= slot_addr && slot_addr < vma_end) {
+ int slot_nr;
+ unsigned long offset = slot_addr - area->vaddr;
+ unsigned long flags;
+
+ BUG_ON(offset % UPROBES_XOL_SLOT_BYTES);
+
+ slot_nr = offset / UPROBES_XOL_SLOT_BYTES;
+ BUG_ON(slot_nr >= UINSNS_PER_PAGE);
+
+ spin_lock_irqsave(&area->slot_lock, flags);
+ __clear_bit(slot_nr, area->bitmap);
+ spin_unlock_irqrestore(&area->slot_lock, flags);
+ return;
+ }
+ printk(KERN_ERR "%s: no XOL vma for slot address %#lx\n",
+ __func__, slot_addr);
+}
+
/*
* Called with no locks held.
* Called in context of a exiting or a exec-ing thread.
@@ -963,14 +1197,18 @@ mmap_out:
void uprobe_free_utask(struct task_struct *tsk)
{
struct uprobe_task *utask = tsk->utask;
+ unsigned long xol_vaddr;

if (!utask)
return;

+ xol_vaddr = utask->xol_vaddr;
if (utask->active_uprobe)
put_uprobe(utask->active_uprobe);
+
kfree(utask);
tsk->utask = NULL;
+ xol_free_insn_slot(tsk, xol_vaddr);
}

/*

2011-04-01 14:44:51

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 13/26] 13: uprobes: get the breakpoint address.


On a breakpoint hit, return the address where the breakpoint was hit.

Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Jim Keniston <[email protected]>
---
include/linux/uprobes.h | 5 +++++
kernel/uprobes.c | 11 +++++++++++
2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 10647be..91b808a 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -157,6 +157,7 @@ extern void uprobe_free_utask(struct task_struct *tsk);

struct vm_area_struct;
extern int uprobe_mmap(struct vm_area_struct *vma);
+extern unsigned long uprobes_get_bkpt_addr(struct pt_regs *regs);
extern void uprobe_dup_mmap(struct mm_struct *old_mm, struct mm_struct *mm);
extern void uprobes_free_xol_area(struct mm_struct *mm);
#else /* CONFIG_UPROBES is not defined */
@@ -179,5 +180,9 @@ static inline int uprobe_mmap(struct vm_area_struct *vma)
}
static inline void uprobe_free_utask(struct task_struct *tsk) {}
static inline void uprobes_free_xol_area(struct mm_struct *mm) {}
+static inline unsigned long uprobes_get_bkpt_addr(struct pt_regs *regs)
+{
+ return 0;
+}
#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 7663d18..dcae6dd 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1190,6 +1190,17 @@ static void xol_free_insn_slot(struct task_struct *tsk, unsigned long slot_addr)
__func__, slot_addr);
}

+/**
+ * uprobes_get_bkpt_addr - compute address of bkpt given post-bkpt regs
+ * @regs: Reflects the saved state of the task after it has hit a breakpoint
+ * instruction.
+ * Return the address of the breakpoint instruction.
+ */
+unsigned long uprobes_get_bkpt_addr(struct pt_regs *regs)
+{
+ return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
+}
+
/*
* Called with no locks held.
* Called in context of a exiting or a exec-ing thread.

2011-04-01 14:45:00

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 14/26] 14: x86: x86 specific probe handling


Provides x86 specific implementations for setting the current
instruction pointer, pre single-step and post-singlestep handling,
enabling and disabling singlestep.

This patch also introduces TIF_UPROBE which is set by uprobes notifier
code. TIF_UPROBE indicates that there is pending work that needs to be
done at do_notify_resume time.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/include/asm/thread_info.h | 2
arch/x86/include/asm/uprobes.h | 5 +
arch/x86/kernel/uprobes.c | 171 ++++++++++++++++++++++++++++++++++++
3 files changed, 178 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 1f2e61e..2dc1921 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,6 +84,7 @@ struct thread_info {
#define TIF_SECCOMP 8 /* secure computing */
#define TIF_MCE_NOTIFY 10 /* notify userspace of an MCE */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
+#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* 32bit process */
#define TIF_FORK 18 /* ret_from_fork */
@@ -107,6 +108,7 @@ struct thread_info {
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
#define _TIF_MCE_NOTIFY (1 << TIF_MCE_NOTIFY)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
+#define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
#define _TIF_FORK (1 << TIF_FORK)
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index e38950f..0c9c8b6 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -44,4 +44,9 @@ struct uprobe_task_arch_info {};
#endif
struct uprobe;
extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
+extern void set_ip(struct pt_regs *regs, unsigned long vaddr);
+extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern void arch_uprobe_enable_sstep(struct pt_regs *regs);
+extern void arch_uprobe_disable_sstep(struct pt_regs *regs);
#endif /* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index f81e940..bf98841 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -25,6 +25,7 @@
#include <linux/sched.h>
#include <linux/ptrace.h>
#include <linux/uprobes.h>
+#include <linux/uaccess.h>

#include <linux/kdebug.h>
#include <asm/insn.h>
@@ -412,3 +413,173 @@ int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe)
prepare_fixups(uprobe, &insn);
return 0;
}
+
+/*
+ * @reg: reflects the saved state of the task
+ * @vaddr: the virtual address to jump to.
+ * Return 0 on success or a -ve number on error.
+ */
+void set_ip(struct pt_regs *regs, unsigned long vaddr)
+{
+ regs->ip = vaddr;
+}
+
+/*
+ * pre_xol - prepare to execute out of line.
+ * @uprobe: the probepoint information.
+ * @regs: reflects the saved user state of @tsk.
+ *
+ * If we're emulating a rip-relative instruction, save the contents
+ * of the scratch register and store the target address in that register.
+ *
+ * Returns true if @uprobe->opcode is @bkpt_insn.
+ */
+#ifdef CONFIG_X86_64
+int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ struct uprobe_task_arch_info *tskinfo = &current->utask->tskinfo;
+
+ regs->ip = current->utask->xol_vaddr;
+ if (uprobe->fixups & UPROBES_FIX_RIP_AX) {
+ tskinfo->saved_scratch_register = regs->ax;
+ regs->ax = current->utask->vaddr;
+ regs->ax += uprobe->arch_info.rip_rela_target_address;
+ } else if (uprobe->fixups & UPROBES_FIX_RIP_CX) {
+ tskinfo->saved_scratch_register = regs->cx;
+ regs->cx = current->utask->vaddr;
+ regs->cx += uprobe->arch_info.rip_rela_target_address;
+ }
+ return 0;
+}
+#else
+int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ regs->ip = current->utask->xol_vaddr;
+ return 0;
+}
+#endif
+
+/*
+ * Called by post_xol() to adjust the return address pushed by a call
+ * instruction executed out of line.
+ */
+static int adjust_ret_addr(unsigned long sp, long correction)
+{
+ int rasize, ncopied;
+ long ra = 0;
+
+ if (is_32bit_app(current))
+ rasize = 4;
+ else
+ rasize = 8;
+ ncopied = copy_from_user(&ra, (void __user *) sp, rasize);
+ if (unlikely(ncopied))
+ goto fail;
+ ra += correction;
+ ncopied = copy_to_user((void __user *) sp, &ra, rasize);
+ if (unlikely(ncopied))
+ goto fail;
+ return 0;
+
+fail:
+ printk(KERN_ERR
+ "uprobes: Failed to adjust return address after"
+ " single-stepping call instruction;"
+ " pid=%d, sp=%#lx\n", current->pid, sp);
+ return -EFAULT;
+}
+
+#ifdef CONFIG_X86_64
+static bool is_riprel_insn(struct uprobe *uprobe)
+{
+ return ((uprobe->fixups &
+ (UPROBES_FIX_RIP_AX | UPROBES_FIX_RIP_CX)) != 0);
+}
+
+static void handle_riprel_post_xol(struct uprobe *uprobe,
+ struct pt_regs *regs, long *correction)
+{
+ if (is_riprel_insn(uprobe)) {
+ struct uprobe_task_arch_info *tskinfo;
+ tskinfo = &current->utask->tskinfo;
+
+ if (uprobe->fixups & UPROBES_FIX_RIP_AX)
+ regs->ax = tskinfo->saved_scratch_register;
+ else
+ regs->cx = tskinfo->saved_scratch_register;
+ /*
+ * The original instruction includes a displacement, and so
+ * is 4 bytes longer than what we've just single-stepped.
+ * Fall through to handle stuff like "jmpq *...(%rip)" and
+ * "callq *...(%rip)".
+ */
+ *correction += 4;
+ }
+}
+#else
+static void handle_riprel_post_xol(struct uprobe *uprobe,
+ struct pt_regs *regs, long *correction)
+{
+}
+#endif
+
+/*
+ * Called after single-stepping. To avoid the SMP problems that can
+ * occur when we temporarily put back the original opcode to
+ * single-step, we single-stepped a copy of the instruction.
+ *
+ * This function prepares to resume execution after the single-step.
+ * We have to fix things up as follows:
+ *
+ * Typically, the new ip is relative to the copied instruction. We need
+ * to make it relative to the original instruction (FIX_IP). Exceptions
+ * are return instructions and absolute or indirect jump or call instructions.
+ *
+ * If the single-stepped instruction was a call, the return address that
+ * is atop the stack is the address following the copied instruction. We
+ * need to make it the address following the original instruction (FIX_CALL).
+ *
+ * If the original instruction was a rip-relative instruction such as
+ * "movl %edx,0xnnnn(%rip)", we have instead executed an equivalent
+ * instruction using a scratch register -- e.g., "movl %edx,(%rax)".
+ * We need to restore the contents of the scratch register and adjust
+ * the ip, keeping in mind that the instruction we executed is 4 bytes
+ * shorter than the original instruction (since we squeezed out the offset
+ * field). (FIX_RIP_AX or FIX_RIP_CX)
+ */
+int post_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ struct uprobe_task *utask = current->utask;
+ int result = 0;
+ long correction;
+
+ correction = (long)(utask->vaddr - utask->xol_vaddr);
+ handle_riprel_post_xol(uprobe, regs, &correction);
+ if (uprobe->fixups & UPROBES_FIX_IP)
+ regs->ip += correction;
+ if (uprobe->fixups & UPROBES_FIX_CALL)
+ result = adjust_ret_addr(regs->sp, correction);
+ return result;
+}
+
+void arch_uprobe_enable_sstep(struct pt_regs *regs)
+{
+ /*
+ * Enable single-stepping by
+ * - Set TF on stack
+ * - Set TIF_SINGLESTEP: Guarantees that TF is set when
+ * returning to user mode.
+ * - Indicate that TF is set by us.
+ */
+ regs->flags |= X86_EFLAGS_TF;
+ set_thread_flag(TIF_SINGLESTEP);
+ set_thread_flag(TIF_FORCED_TF);
+}
+
+void arch_uprobe_disable_sstep(struct pt_regs *regs)
+{
+ /* Disable single-stepping by clearing what we set */
+ clear_thread_flag(TIF_SINGLESTEP);
+ clear_thread_flag(TIF_FORCED_TF);
+ regs->flags &= ~X86_EFLAGS_TF;
+}

2011-04-01 14:45:20

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 15/26] 15: uprobes: Handing int3 and singlestep exception.


On int3, set the TIF_UPROBE flag and if a task specific info is
available, indicate the task state as breakpoint hit. Setting the
TIF_UPROBE flag results in uprobe_notify_resume being called.
uprobe_notify_resume walks thro the list of vmas and then matches the
inode and offset corresponding to the instruction pointer to enteries in
rbtree. Once a matcing uprobes is found, run the handlers for all the
consumers that have registered.

On singlestep exception, perform the necessary fixups and allow the
process to continue. The necessary fixups are determined at instruction
analysis time.

TODO: If there is no matching uprobe, signal a trap to the process.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/uprobes.h | 4 +
kernel/uprobes.c | 145 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 149 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 91b808a..26c4d78 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -160,6 +160,9 @@ extern int uprobe_mmap(struct vm_area_struct *vma);
extern unsigned long uprobes_get_bkpt_addr(struct pt_regs *regs);
extern void uprobe_dup_mmap(struct mm_struct *old_mm, struct mm_struct *mm);
extern void uprobes_free_xol_area(struct mm_struct *mm);
+extern int uprobe_post_notifier(struct pt_regs *regs);
+extern int uprobe_bkpt_notifier(struct pt_regs *regs);
+extern void uprobe_notify_resume(struct pt_regs *regs);
#else /* CONFIG_UPROBES is not defined */
static inline int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer)
@@ -180,6 +183,7 @@ static inline int uprobe_mmap(struct vm_area_struct *vma)
}
static inline void uprobe_free_utask(struct task_struct *tsk) {}
static inline void uprobes_free_xol_area(struct mm_struct *mm) {}
+static inline void uprobe_notify_resume(struct pt_regs *regs) {}
static inline unsigned long uprobes_get_bkpt_addr(struct pt_regs *regs)
{
return 0;
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index dcae6dd..e0ff6ba 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1242,3 +1242,148 @@ static struct uprobe_task *add_utask(void)
current->utask = utask;
return utask;
}
+
+/* Prepare to single-step probed instruction out of line. */
+static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
+ unsigned long vaddr)
+{
+ xol_get_insn_slot(uprobe, vaddr);
+ BUG_ON(!current->utask->xol_vaddr);
+ if (!pre_xol(uprobe, regs)) {
+ set_ip(regs, current->utask->xol_vaddr);
+ return 0;
+ }
+ return -EFAULT;
+}
+
+/*
+ * Verify from Instruction Pointer if singlestep has indeed occurred.
+ * If Singlestep has occurred, then do post singlestep fix-ups.
+ */
+static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ unsigned long vaddr = instruction_pointer(regs);
+
+ /*
+ * If we have executed out of line, Instruction pointer
+ * cannot be same as virtual address of XOL slot.
+ */
+ if (vaddr == current->utask->xol_vaddr)
+ return false;
+ post_xol(uprobe, regs);
+ return true;
+}
+
+/*
+ * uprobe_notify_resume gets called in task context just before returning
+ * to userspace.
+ *
+ * If its the first time the probepoint is hit, slot gets allocated here.
+ * If its the first time the thread hit a breakpoint, utask gets
+ * allocated here.
+ */
+void uprobe_notify_resume(struct pt_regs *regs)
+{
+ struct vm_area_struct *vma;
+ struct uprobe_task *utask;
+ struct mm_struct *mm;
+ struct uprobe *u = NULL;
+ unsigned long probept;
+
+ utask = current->utask;
+ mm = current->mm;
+ if (unlikely(!utask)) {
+ utask = add_utask();
+
+ /* Failed to allocate utask for the current task. */
+ BUG_ON(!utask);
+ utask->state = UTASK_BP_HIT;
+ }
+ if (utask->state == UTASK_BP_HIT) {
+ probept = uprobes_get_bkpt_addr(regs);
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (!valid_vma(vma))
+ continue;
+ if (probept < vma->vm_start || probept > vma->vm_end)
+ continue;
+ u = find_uprobe(vma->vm_file->f_mapping->host,
+ probept - vma->vm_start);
+ break;
+ }
+ up_read(&mm->mmap_sem);
+ /*TODO Return SIGTRAP signal */
+ if (!u) {
+ set_ip(regs, probept);
+ utask->state = UTASK_RUNNING;
+ return;
+ }
+ /* TODO Start queueing signals. */
+ utask->active_uprobe = u;
+ handler_chain(u, regs);
+ utask->state = UTASK_SSTEP;
+ if (!pre_ssout(u, regs, probept))
+ arch_uprobe_enable_sstep(regs);
+ } else if (utask->state == UTASK_SSTEP) {
+ u = utask->active_uprobe;
+ if (sstep_complete(u, regs)) {
+ put_uprobe(u);
+ utask->active_uprobe = NULL;
+ utask->state = UTASK_RUNNING;
+ /* TODO Stop queueing signals. */
+ arch_uprobe_disable_sstep(regs);
+ }
+ }
+}
+
+/*
+ * uprobe_bkpt_notifier gets called from interrupt context
+ * it gets a reference to the ppt and sets TIF_UPROBE flag,
+ */
+int uprobe_bkpt_notifier(struct pt_regs *regs)
+{
+ struct uprobe_task *utask;
+
+ if (!current->mm || !atomic_read(&current->mm->uprobes_count))
+ /* task is currently not uprobed */
+ return 0;
+
+ utask = current->utask;
+ if (utask)
+ utask->state = UTASK_BP_HIT;
+ set_thread_flag(TIF_UPROBE);
+ return 1;
+}
+
+/*
+ * uprobe_post_notifier gets called in interrupt context.
+ * It completes the single step operation.
+ */
+int uprobe_post_notifier(struct pt_regs *regs)
+{
+ struct uprobe *uprobe;
+ struct uprobe_task *utask;
+
+ if (!current->mm || !current->utask || !current->utask->active_uprobe)
+ /* task is currently not uprobed */
+ return 0;
+
+ utask = current->utask;
+ uprobe = utask->active_uprobe;
+ if (!uprobe)
+ return 0;
+
+ if (uprobes_resume_can_sleep(uprobe)) {
+ set_thread_flag(TIF_UPROBE);
+ return 1;
+ }
+ if (sstep_complete(uprobe, regs)) {
+ put_uprobe(uprobe);
+ utask->active_uprobe = NULL;
+ utask->state = UTASK_RUNNING;
+ /* TODO Stop queueing signals. */
+ arch_uprobe_disable_sstep(regs);
+ return 1;
+ }
+ return 0;
+}

2011-04-01 14:45:28

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 16/26] 16: x86: uprobes exception notifier for x86.


Provides a uprobes exception notifier for x86. This uprobes_exception
notifier gets called in interrupt context and routes int3 and singlestep
exception when a uprobed process encounters a INT3 or a singlestep exception.

Signed-off-by: Ananth N Mavinakayanahalli <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/include/asm/uprobes.h | 3 +++
arch/x86/kernel/signal.c | 14 ++++++++++++++
arch/x86/kernel/uprobes.c | 29 +++++++++++++++++++++++++++++
3 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 0c9c8b6..bb5c03c 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -22,6 +22,7 @@
* Srikar Dronamraju
* Jim Keniston
*/
+#include <linux/notifier.h>

typedef u8 uprobe_opcode_t;
#define MAX_UINSN_BYTES 16
@@ -49,4 +50,6 @@ extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
extern void arch_uprobe_enable_sstep(struct pt_regs *regs);
extern void arch_uprobe_disable_sstep(struct pt_regs *regs);
+extern int uprobes_exception_notify(struct notifier_block *self,
+ unsigned long val, void *data);
#endif /* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 4fd173c..0b06b94 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -20,6 +20,7 @@
#include <linux/personality.h>
#include <linux/uaccess.h>
#include <linux/user-return-notifier.h>
+#include <linux/uprobes.h>

#include <asm/processor.h>
#include <asm/ucontext.h>
@@ -848,6 +849,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
if (thread_info_flags & _TIF_SIGPENDING)
do_signal(regs);

+ if (thread_info_flags & _TIF_UPROBE) {
+ clear_thread_flag(TIF_UPROBE);
+#ifdef CONFIG_X86_32
+ /*
+ * On x86_32, do_notify_resume() gets called with
+ * interrupts disabled. Hence enable interrupts if they
+ * are still disabled.
+ */
+ local_irq_enable();
+#endif
+ uprobe_notify_resume(regs);
+ }
+
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index bf98841..e0ce207 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -583,3 +583,32 @@ void arch_uprobe_disable_sstep(struct pt_regs *regs)
clear_thread_flag(TIF_FORCED_TF);
regs->flags &= ~X86_EFLAGS_TF;
}
+
+/*
+ * Wrapper routine for handling exceptions.
+ */
+int uprobes_exception_notify(struct notifier_block *self,
+ unsigned long val, void *data)
+{
+ struct die_args *args = data;
+ struct pt_regs *regs = args->regs;
+ int ret = NOTIFY_DONE;
+
+ /* We are only interested in userspace traps */
+ if (regs && !user_mode_vm(regs))
+ return NOTIFY_DONE;
+
+ switch (val) {
+ case DIE_INT3:
+ /* Run your handler here */
+ if (uprobe_bkpt_notifier(regs))
+ ret = NOTIFY_STOP;
+ break;
+ case DIE_DEBUG:
+ if (uprobe_post_notifier(regs))
+ ret = NOTIFY_STOP;
+ default:
+ break;
+ }
+ return ret;
+}

2011-04-01 14:45:35

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 17/26] 17: uprobes: register a notifier for uprobes.


Uprobe needs to be intimated on int3 and singlestep exceptions.
Hence uprobes registers a die notifier so that its notified of the events.

Signed-off-by: Ananth N Mavinakayanahalli <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/uprobes.c | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index e0ff6ba..cdd52d0 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -33,6 +33,7 @@
#include <linux/rmap.h> /* needed for anon_vma_prepare */
#include <linux/mman.h> /* needed for PROT_EXEC, MAP_PRIVATE */
#include <linux/file.h> /* needed for fput() */
+#include <linux/kdebug.h> /* needed for notifier mechanism */

#define UINSNS_PER_PAGE (PAGE_SIZE/UPROBES_XOL_SLOT_BYTES)
#define MAX_UPROBES_XOL_SLOTS UINSNS_PER_PAGE
@@ -1387,3 +1388,20 @@ int uprobe_post_notifier(struct pt_regs *regs)
}
return 0;
}
+
+struct notifier_block uprobes_exception_nb = {
+ .notifier_call = uprobes_exception_notify,
+ .priority = 0x7ffffff0,
+};
+
+static int __init init_uprobes(void)
+{
+ return register_die_notifier(&uprobes_exception_nb);
+}
+
+static void __exit exit_uprobes(void)
+{
+}
+
+module_init(init_uprobes);
+module_exit(exit_uprobes);

2011-04-01 14:45:51

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 18/26] 18: uprobes: commonly used filters.


Provides most commonly used filters that most users of uprobes can
reuse. However this would be useful once we can dynamically associate a
filter with a uprobe-event tracer.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/uprobes.h | 5 +++++
kernel/uprobes.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 26c4d78..34b989f 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -65,6 +65,11 @@ struct uprobe_consumer {
struct uprobe_consumer *next;
};

+struct uprobe_simple_consumer {
+ struct uprobe_consumer consumer;
+ pid_t fvalue;
+};
+
struct uprobe {
struct rb_node rb_node; /* node in the rb tree */
atomic_t ref;
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index cdd52d0..c950f13 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1389,6 +1389,56 @@ int uprobe_post_notifier(struct pt_regs *regs)
return 0;
}

+bool uprobes_pid_filter(struct uprobe_consumer *self, struct task_struct *t)
+{
+ struct uprobe_simple_consumer *usc;
+
+ usc = container_of(self, struct uprobe_simple_consumer, consumer);
+ if (t->tgid == usc->fvalue)
+ return true;
+ return false;
+}
+
+bool uprobes_tid_filter(struct uprobe_consumer *self, struct task_struct *t)
+{
+ struct uprobe_simple_consumer *usc;
+
+ usc = container_of(self, struct uprobe_simple_consumer, consumer);
+ if (t->pid == usc->fvalue)
+ return true;
+ return false;
+}
+
+bool uprobes_ppid_filter(struct uprobe_consumer *self, struct task_struct *t)
+{
+ pid_t pid;
+ struct uprobe_simple_consumer *usc;
+
+ usc = container_of(self, struct uprobe_simple_consumer, consumer);
+ rcu_read_lock();
+ pid = task_tgid_vnr(t->real_parent);
+ rcu_read_unlock();
+
+ if (pid == usc->fvalue)
+ return true;
+ return false;
+}
+
+bool uprobes_sid_filter(struct uprobe_consumer *self, struct task_struct *t)
+{
+ pid_t pid;
+ struct uprobe_simple_consumer *usc;
+
+ usc = container_of(self, struct uprobe_simple_consumer, consumer);
+ rcu_read_lock();
+ pid = pid_vnr(task_session(t));
+ rcu_read_unlock();
+
+ if (pid == usc->fvalue)
+ return true;
+ return false;
+}
+
struct notifier_block uprobes_exception_nb = {
.notifier_call = uprobes_exception_notify,
.priority = 0x7ffffff0,

2011-04-01 14:46:07

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 19/26] 19: tracing: Extract out common code for kprobes/uprobes traceevents.


Move parts of trace_kprobe.c that can be shared with upcoming
trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
kernel/trace/trace_probe.c.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/trace/Kconfig | 4
kernel/trace/Makefile | 1
kernel/trace/trace_kprobe.c | 861 +------------------------------------------
kernel/trace/trace_probe.c | 747 +++++++++++++++++++++++++++++++++++++
kernel/trace/trace_probe.h | 158 ++++++++
5 files changed, 929 insertions(+), 842 deletions(-)
create mode 100644 kernel/trace/trace_probe.c
create mode 100644 kernel/trace/trace_probe.h

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61d7d59..9c62806 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -373,6 +373,7 @@ config KPROBE_EVENT
depends on HAVE_REGS_AND_STACK_ACCESS_API
bool "Enable kprobes-based dynamic events"
select TRACING
+ select PROBE_EVENTS
default y
help
This allows the user to add tracing events (similar to tracepoints)
@@ -385,6 +386,9 @@ config KPROBE_EVENT
This option is also required by perf-probe subcommand of perf tools.
If you want to use perf tools, this option is strongly recommended.

+config PROBE_EVENTS
+ def_bool n
+
config DYNAMIC_FTRACE
bool "enable/disable ftrace tracepoints dynamically"
depends on FUNCTION_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 761c510..692223a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -56,5 +56,6 @@ obj-$(CONFIG_TRACEPOINTS) += power-traces.o
ifeq ($(CONFIG_TRACING),y)
obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
endif
+obj-$(CONFIG_PROBE_EVENTS) +=trace_probe.o

libftrace-y := ftrace.o
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 8435b43..3f9189d 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -19,525 +19,15 @@

#include <linux/module.h>
#include <linux/uaccess.h>
-#include <linux/kprobes.h>
-#include <linux/seq_file.h>
-#include <linux/slab.h>
-#include <linux/smp.h>
-#include <linux/debugfs.h>
-#include <linux/types.h>
-#include <linux/string.h>
-#include <linux/ctype.h>
-#include <linux/ptrace.h>
-#include <linux/perf_event.h>
-#include <linux/stringify.h>
-#include <linux/limits.h>
-#include <asm/bitsperlong.h>
-
-#include "trace.h"
-#include "trace_output.h"
-
-#define MAX_TRACE_ARGS 128
-#define MAX_ARGSTR_LEN 63
-#define MAX_EVENT_NAME_LEN 64
-#define MAX_STRING_SIZE PATH_MAX
-#define KPROBE_EVENT_SYSTEM "kprobes"
-
-/* Reserved field names */
-#define FIELD_STRING_IP "__probe_ip"
-#define FIELD_STRING_RETIP "__probe_ret_ip"
-#define FIELD_STRING_FUNC "__probe_func"
-
-const char *reserved_field_names[] = {
- "common_type",
- "common_flags",
- "common_preempt_count",
- "common_pid",
- "common_tgid",
- "common_lock_depth",
- FIELD_STRING_IP,
- FIELD_STRING_RETIP,
- FIELD_STRING_FUNC,
-};
-
-/* Printing function type */
-typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
- void *);
-#define PRINT_TYPE_FUNC_NAME(type) print_type_##type
-#define PRINT_TYPE_FMT_NAME(type) print_type_format_##type
-
-/* Printing in basic type function template */
-#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast) \
-static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s, \
- const char *name, \
- void *data, void *ent)\
-{ \
- return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
-} \
-static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
-
-DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
-
-/* data_rloc: data relative location, compatible with u32 */
-#define make_data_rloc(len, roffs) \
- (((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
-#define get_rloc_len(dl) ((u32)(dl) >> 16)
-#define get_rloc_offs(dl) ((u32)(dl) & 0xffff)
-
-static inline void *get_rloc_data(u32 *dl)
-{
- return (u8 *)dl + get_rloc_offs(*dl);
-}
-
-/* For data_loc conversion */
-static inline void *get_loc_data(u32 *dl, void *ent)
-{
- return (u8 *)ent + get_rloc_offs(*dl);
-}
-
-/*
- * Convert data_rloc to data_loc:
- * data_rloc stores the offset from data_rloc itself, but data_loc
- * stores the offset from event entry.
- */
-#define convert_rloc_to_loc(dl, offs) ((u32)(dl) + (offs))
-
-/* For defining macros, define string/string_size types */
-typedef u32 string;
-typedef u32 string_size;
-
-/* Print type function for string type */
-static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
- const char *name,
- void *data, void *ent)
-{
- int len = *(u32 *)data >> 16;
-
- if (!len)
- return trace_seq_printf(s, " %s=(fault)", name);
- else
- return trace_seq_printf(s, " %s=\"%s\"", name,
- (const char *)get_loc_data(data, ent));
-}
-static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
-
-/* Data fetch function type */
-typedef void (*fetch_func_t)(struct pt_regs *, void *, void *);
-
-struct fetch_param {
- fetch_func_t fn;
- void *data;
-};
-
-static __kprobes void call_fetch(struct fetch_param *fprm,
- struct pt_regs *regs, void *dest)
-{
- return fprm->fn(regs, fprm->data, dest);
-}
-
-#define FETCH_FUNC_NAME(method, type) fetch_##method##_##type
-/*
- * Define macro for basic types - we don't need to define s* types, because
- * we have to care only about bitwidth at recording time.
- */
-#define DEFINE_BASIC_FETCH_FUNCS(method) \
-DEFINE_FETCH_##method(u8) \
-DEFINE_FETCH_##method(u16) \
-DEFINE_FETCH_##method(u32) \
-DEFINE_FETCH_##method(u64)
-
-#define CHECK_FETCH_FUNCS(method, fn) \
- (((FETCH_FUNC_NAME(method, u8) == fn) || \
- (FETCH_FUNC_NAME(method, u16) == fn) || \
- (FETCH_FUNC_NAME(method, u32) == fn) || \
- (FETCH_FUNC_NAME(method, u64) == fn) || \
- (FETCH_FUNC_NAME(method, string) == fn) || \
- (FETCH_FUNC_NAME(method, string_size) == fn)) \
- && (fn != NULL))
-
-/* Data fetch function templates */
-#define DEFINE_FETCH_reg(type) \
-static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs, \
- void *offset, void *dest) \
-{ \
- *(type *)dest = (type)regs_get_register(regs, \
- (unsigned int)((unsigned long)offset)); \
-}
-DEFINE_BASIC_FETCH_FUNCS(reg)
-/* No string on the register */
-#define fetch_reg_string NULL
-#define fetch_reg_string_size NULL
-
-#define DEFINE_FETCH_stack(type) \
-static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
- void *offset, void *dest) \
-{ \
- *(type *)dest = (type)regs_get_kernel_stack_nth(regs, \
- (unsigned int)((unsigned long)offset)); \
-}
-DEFINE_BASIC_FETCH_FUNCS(stack)
-/* No string on the stack entry */
-#define fetch_stack_string NULL
-#define fetch_stack_string_size NULL
-
-#define DEFINE_FETCH_retval(type) \
-static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
- void *dummy, void *dest) \
-{ \
- *(type *)dest = (type)regs_return_value(regs); \
-}
-DEFINE_BASIC_FETCH_FUNCS(retval)
-/* No string on the retval */
-#define fetch_retval_string NULL
-#define fetch_retval_string_size NULL
-
-#define DEFINE_FETCH_memory(type) \
-static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
- void *addr, void *dest) \
-{ \
- type retval; \
- if (probe_kernel_address(addr, retval)) \
- *(type *)dest = 0; \
- else \
- *(type *)dest = retval; \
-}
-DEFINE_BASIC_FETCH_FUNCS(memory)
-/*
- * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
- * length and relative data location.
- */
-static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
- void *addr, void *dest)
-{
- long ret;
- int maxlen = get_rloc_len(*(u32 *)dest);
- u8 *dst = get_rloc_data(dest);
- u8 *src = addr;
- mm_segment_t old_fs = get_fs();
- if (!maxlen)
- return;
- /*
- * Try to get string again, since the string can be changed while
- * probing.
- */
- set_fs(KERNEL_DS);
- pagefault_disable();
- do
- ret = __copy_from_user_inatomic(dst++, src++, 1);
- while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
- dst[-1] = '\0';
- pagefault_enable();
- set_fs(old_fs);
-
- if (ret < 0) { /* Failed to fetch string */
- ((u8 *)get_rloc_data(dest))[0] = '\0';
- *(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
- } else
- *(u32 *)dest = make_data_rloc(src - (u8 *)addr,
- get_rloc_offs(*(u32 *)dest));
-}
-/* Return the length of string -- including null terminal byte */
-static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
- void *addr, void *dest)
-{
- int ret, len = 0;
- u8 c;
- mm_segment_t old_fs = get_fs();
-
- set_fs(KERNEL_DS);
- pagefault_disable();
- do {
- ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
- len++;
- } while (c && ret == 0 && len < MAX_STRING_SIZE);
- pagefault_enable();
- set_fs(old_fs);
-
- if (ret < 0) /* Failed to check the length */
- *(u32 *)dest = 0;
- else
- *(u32 *)dest = len;
-}
-
-/* Memory fetching by symbol */
-struct symbol_cache {
- char *symbol;
- long offset;
- unsigned long addr;
-};
-
-static unsigned long update_symbol_cache(struct symbol_cache *sc)
-{
- sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
- if (sc->addr)
- sc->addr += sc->offset;
- return sc->addr;
-}
-
-static void free_symbol_cache(struct symbol_cache *sc)
-{
- kfree(sc->symbol);
- kfree(sc);
-}
-
-static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
-{
- struct symbol_cache *sc;
-
- if (!sym || strlen(sym) == 0)
- return NULL;
- sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
- if (!sc)
- return NULL;
-
- sc->symbol = kstrdup(sym, GFP_KERNEL);
- if (!sc->symbol) {
- kfree(sc);
- return NULL;
- }
- sc->offset = offset;

- update_symbol_cache(sc);
- return sc;
-}
-
-#define DEFINE_FETCH_symbol(type) \
-static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
- void *data, void *dest) \
-{ \
- struct symbol_cache *sc = data; \
- if (sc->addr) \
- fetch_memory_##type(regs, (void *)sc->addr, dest); \
- else \
- *(type *)dest = 0; \
-}
-DEFINE_BASIC_FETCH_FUNCS(symbol)
-DEFINE_FETCH_symbol(string)
-DEFINE_FETCH_symbol(string_size)
-
-/* Dereference memory access function */
-struct deref_fetch_param {
- struct fetch_param orig;
- long offset;
-};
-
-#define DEFINE_FETCH_deref(type) \
-static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
- void *data, void *dest) \
-{ \
- struct deref_fetch_param *dprm = data; \
- unsigned long addr; \
- call_fetch(&dprm->orig, regs, &addr); \
- if (addr) { \
- addr += dprm->offset; \
- fetch_memory_##type(regs, (void *)addr, dest); \
- } else \
- *(type *)dest = 0; \
-}
-DEFINE_BASIC_FETCH_FUNCS(deref)
-DEFINE_FETCH_deref(string)
-DEFINE_FETCH_deref(string_size)
-
-static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
-{
- if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
- free_deref_fetch_param(data->orig.data);
- else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
- free_symbol_cache(data->orig.data);
- kfree(data);
-}
-
-/* Bitfield fetch function */
-struct bitfield_fetch_param {
- struct fetch_param orig;
- unsigned char hi_shift;
- unsigned char low_shift;
-};
-
-#define DEFINE_FETCH_bitfield(type) \
-static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
- void *data, void *dest) \
-{ \
- struct bitfield_fetch_param *bprm = data; \
- type buf = 0; \
- call_fetch(&bprm->orig, regs, &buf); \
- if (buf) { \
- buf <<= bprm->hi_shift; \
- buf >>= bprm->low_shift; \
- } \
- *(type *)dest = buf; \
-}
-DEFINE_BASIC_FETCH_FUNCS(bitfield)
-#define fetch_bitfield_string NULL
-#define fetch_bitfield_string_size NULL
-
-static __kprobes void
-free_bitfield_fetch_param(struct bitfield_fetch_param *data)
-{
- /*
- * Don't check the bitfield itself, because this must be the
- * last fetch function.
- */
- if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
- free_deref_fetch_param(data->orig.data);
- else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
- free_symbol_cache(data->orig.data);
- kfree(data);
-}
-/* Default (unsigned long) fetch type */
-#define __DEFAULT_FETCH_TYPE(t) u##t
-#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
-#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
-#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
-
-/* Fetch types */
-enum {
- FETCH_MTD_reg = 0,
- FETCH_MTD_stack,
- FETCH_MTD_retval,
- FETCH_MTD_memory,
- FETCH_MTD_symbol,
- FETCH_MTD_deref,
- FETCH_MTD_bitfield,
- FETCH_MTD_END,
-};
-
-#define ASSIGN_FETCH_FUNC(method, type) \
- [FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
-
-#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype) \
- {.name = _name, \
- .size = _size, \
- .is_signed = sign, \
- .print = PRINT_TYPE_FUNC_NAME(ptype), \
- .fmt = PRINT_TYPE_FMT_NAME(ptype), \
- .fmttype = _fmttype, \
- .fetch = { \
-ASSIGN_FETCH_FUNC(reg, ftype), \
-ASSIGN_FETCH_FUNC(stack, ftype), \
-ASSIGN_FETCH_FUNC(retval, ftype), \
-ASSIGN_FETCH_FUNC(memory, ftype), \
-ASSIGN_FETCH_FUNC(symbol, ftype), \
-ASSIGN_FETCH_FUNC(deref, ftype), \
-ASSIGN_FETCH_FUNC(bitfield, ftype), \
- } \
- }
+#include "trace_probe.h"

-#define ASSIGN_FETCH_TYPE(ptype, ftype, sign) \
- __ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
-
-#define FETCH_TYPE_STRING 0
-#define FETCH_TYPE_STRSIZE 1
-
-/* Fetch type information table */
-static const struct fetch_type {
- const char *name; /* Name of type */
- size_t size; /* Byte size of type */
- int is_signed; /* Signed flag */
- print_type_func_t print; /* Print functions */
- const char *fmt; /* Fromat string */
- const char *fmttype; /* Name in format file */
- /* Fetch functions */
- fetch_func_t fetch[FETCH_MTD_END];
-} fetch_type_table[] = {
- /* Special types */
- [FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
- sizeof(u32), 1, "__data_loc char[]"),
- [FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
- string_size, sizeof(u32), 0, "u32"),
- /* Basic types */
- ASSIGN_FETCH_TYPE(u8, u8, 0),
- ASSIGN_FETCH_TYPE(u16, u16, 0),
- ASSIGN_FETCH_TYPE(u32, u32, 0),
- ASSIGN_FETCH_TYPE(u64, u64, 0),
- ASSIGN_FETCH_TYPE(s8, u8, 1),
- ASSIGN_FETCH_TYPE(s16, u16, 1),
- ASSIGN_FETCH_TYPE(s32, u32, 1),
- ASSIGN_FETCH_TYPE(s64, u64, 1),
-};
-
-static const struct fetch_type *find_fetch_type(const char *type)
-{
- int i;
-
- if (!type)
- type = DEFAULT_FETCH_TYPE_STR;
-
- /* Special case: bitfield */
- if (*type == 'b') {
- unsigned long bs;
- type = strchr(type, '/');
- if (!type)
- goto fail;
- type++;
- if (strict_strtoul(type, 0, &bs))
- goto fail;
- switch (bs) {
- case 8:
- return find_fetch_type("u8");
- case 16:
- return find_fetch_type("u16");
- case 32:
- return find_fetch_type("u32");
- case 64:
- return find_fetch_type("u64");
- default:
- goto fail;
- }
- }
-
- for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
- if (strcmp(type, fetch_type_table[i].name) == 0)
- return &fetch_type_table[i];
-fail:
- return NULL;
-}
-
-/* Special function : only accept unsigned long */
-static __kprobes void fetch_stack_address(struct pt_regs *regs,
- void *dummy, void *dest)
-{
- *(unsigned long *)dest = kernel_stack_pointer(regs);
-}
-
-static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
- fetch_func_t orig_fn)
-{
- int i;
-
- if (type != &fetch_type_table[FETCH_TYPE_STRING])
- return NULL; /* Only string type needs size function */
- for (i = 0; i < FETCH_MTD_END; i++)
- if (type->fetch[i] == orig_fn)
- return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
-
- WARN_ON(1); /* This should not happen */
- return NULL;
-}
+#define KPROBE_EVENT_SYSTEM "kprobes"

/**
* Kprobe event core functions
*/

-struct probe_arg {
- struct fetch_param fetch;
- struct fetch_param fetch_size;
- unsigned int offset; /* Offset from argument entry */
- const char *name; /* Name of this argument */
- const char *comm; /* Command of this argument */
- const struct fetch_type *type; /* Type of this argument */
-};
-
-/* Flags for trace_probe */
-#define TP_FLAG_TRACE 1
-#define TP_FLAG_PROFILE 2
-
struct trace_probe {
struct list_head list;
struct kretprobe rp; /* Use rp.kp for kprobe use */
@@ -576,18 +66,6 @@ static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs);
static int kretprobe_dispatcher(struct kretprobe_instance *ri,
struct pt_regs *regs);

-/* Check the name is good for event/group/fields */
-static int is_good_name(const char *name)
-{
- if (!isalpha(*name) && *name != '_')
- return 0;
- while (*++name != '\0') {
- if (!isalpha(*name) && !isdigit(*name) && *name != '_')
- return 0;
- }
- return 1;
-}
-
/*
* Allocate new trace_probe and initialize it (including kprobes).
*/
@@ -596,7 +74,7 @@ static struct trace_probe *alloc_trace_probe(const char *group,
void *addr,
const char *symbol,
unsigned long offs,
- int nargs, int is_return)
+ int nargs, bool is_return)
{
struct trace_probe *tp;
int ret = -ENOMEM;
@@ -647,24 +125,12 @@ error:
return ERR_PTR(ret);
}

-static void free_probe_arg(struct probe_arg *arg)
-{
- if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
- free_bitfield_fetch_param(arg->fetch.data);
- else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
- free_deref_fetch_param(arg->fetch.data);
- else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
- free_symbol_cache(arg->fetch.data);
- kfree(arg->name);
- kfree(arg->comm);
-}
-
static void free_trace_probe(struct trace_probe *tp)
{
int i;

for (i = 0; i < tp->nr_args; i++)
- free_probe_arg(&tp->args[i]);
+ traceprobe_free_probe_arg(&tp->args[i]);

kfree(tp->call.class->system);
kfree(tp->call.name);
@@ -738,227 +204,6 @@ end:
return ret;
}

-/* Split symbol and offset. */
-static int split_symbol_offset(char *symbol, unsigned long *offset)
-{
- char *tmp;
- int ret;
-
- if (!offset)
- return -EINVAL;
-
- tmp = strchr(symbol, '+');
- if (tmp) {
- /* skip sign because strict_strtol doesn't accept '+' */
- ret = strict_strtoul(tmp + 1, 0, offset);
- if (ret)
- return ret;
- *tmp = '\0';
- } else
- *offset = 0;
- return 0;
-}
-
-#define PARAM_MAX_ARGS 16
-#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
-
-static int parse_probe_vars(char *arg, const struct fetch_type *t,
- struct fetch_param *f, int is_return)
-{
- int ret = 0;
- unsigned long param;
-
- if (strcmp(arg, "retval") == 0) {
- if (is_return)
- f->fn = t->fetch[FETCH_MTD_retval];
- else
- ret = -EINVAL;
- } else if (strncmp(arg, "stack", 5) == 0) {
- if (arg[5] == '\0') {
- if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
- f->fn = fetch_stack_address;
- else
- ret = -EINVAL;
- } else if (isdigit(arg[5])) {
- ret = strict_strtoul(arg + 5, 10, &param);
- if (ret || param > PARAM_MAX_STACK)
- ret = -EINVAL;
- else {
- f->fn = t->fetch[FETCH_MTD_stack];
- f->data = (void *)param;
- }
- } else
- ret = -EINVAL;
- } else
- ret = -EINVAL;
- return ret;
-}
-
-/* Recursive argument parser */
-static int __parse_probe_arg(char *arg, const struct fetch_type *t,
- struct fetch_param *f, int is_return)
-{
- int ret = 0;
- unsigned long param;
- long offset;
- char *tmp;
-
- switch (arg[0]) {
- case '$':
- ret = parse_probe_vars(arg + 1, t, f, is_return);
- break;
- case '%': /* named register */
- ret = regs_query_register_offset(arg + 1);
- if (ret >= 0) {
- f->fn = t->fetch[FETCH_MTD_reg];
- f->data = (void *)(unsigned long)ret;
- ret = 0;
- }
- break;
- case '@': /* memory or symbol */
- if (isdigit(arg[1])) {
- ret = strict_strtoul(arg + 1, 0, &param);
- if (ret)
- break;
- f->fn = t->fetch[FETCH_MTD_memory];
- f->data = (void *)param;
- } else {
- ret = split_symbol_offset(arg + 1, &offset);
- if (ret)
- break;
- f->data = alloc_symbol_cache(arg + 1, offset);
- if (f->data)
- f->fn = t->fetch[FETCH_MTD_symbol];
- }
- break;
- case '+': /* deref memory */
- arg++; /* Skip '+', because strict_strtol() rejects it. */
- case '-':
- tmp = strchr(arg, '(');
- if (!tmp)
- break;
- *tmp = '\0';
- ret = strict_strtol(arg, 0, &offset);
- if (ret)
- break;
- arg = tmp + 1;
- tmp = strrchr(arg, ')');
- if (tmp) {
- struct deref_fetch_param *dprm;
- const struct fetch_type *t2 = find_fetch_type(NULL);
- *tmp = '\0';
- dprm = kzalloc(sizeof(struct deref_fetch_param),
- GFP_KERNEL);
- if (!dprm)
- return -ENOMEM;
- dprm->offset = offset;
- ret = __parse_probe_arg(arg, t2, &dprm->orig,
- is_return);
- if (ret)
- kfree(dprm);
- else {
- f->fn = t->fetch[FETCH_MTD_deref];
- f->data = (void *)dprm;
- }
- }
- break;
- }
- if (!ret && !f->fn) { /* Parsed, but do not find fetch method */
- pr_info("%s type has no corresponding fetch method.\n",
- t->name);
- ret = -EINVAL;
- }
- return ret;
-}
-
-#define BYTES_TO_BITS(nb) ((BITS_PER_LONG * (nb)) / sizeof(long))
-
-/* Bitfield type needs to be parsed into a fetch function */
-static int __parse_bitfield_probe_arg(const char *bf,
- const struct fetch_type *t,
- struct fetch_param *f)
-{
- struct bitfield_fetch_param *bprm;
- unsigned long bw, bo;
- char *tail;
-
- if (*bf != 'b')
- return 0;
-
- bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
- if (!bprm)
- return -ENOMEM;
- bprm->orig = *f;
- f->fn = t->fetch[FETCH_MTD_bitfield];
- f->data = (void *)bprm;
-
- bw = simple_strtoul(bf + 1, &tail, 0); /* Use simple one */
- if (bw == 0 || *tail != '@')
- return -EINVAL;
-
- bf = tail + 1;
- bo = simple_strtoul(bf, &tail, 0);
- if (tail == bf || *tail != '/')
- return -EINVAL;
-
- bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
- bprm->low_shift = bprm->hi_shift + bo;
- return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
-}
-
-/* String length checking wrapper */
-static int parse_probe_arg(char *arg, struct trace_probe *tp,
- struct probe_arg *parg, int is_return)
-{
- const char *t;
- int ret;
-
- if (strlen(arg) > MAX_ARGSTR_LEN) {
- pr_info("Argument is too long.: %s\n", arg);
- return -ENOSPC;
- }
- parg->comm = kstrdup(arg, GFP_KERNEL);
- if (!parg->comm) {
- pr_info("Failed to allocate memory for command '%s'.\n", arg);
- return -ENOMEM;
- }
- t = strchr(parg->comm, ':');
- if (t) {
- arg[t - parg->comm] = '\0';
- t++;
- }
- parg->type = find_fetch_type(t);
- if (!parg->type) {
- pr_info("Unsupported type: %s\n", t);
- return -EINVAL;
- }
- parg->offset = tp->size;
- tp->size += parg->type->size;
- ret = __parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
- if (ret >= 0 && t != NULL)
- ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
- if (ret >= 0) {
- parg->fetch_size.fn = get_fetch_size_function(parg->type,
- parg->fetch.fn);
- parg->fetch_size.data = parg->fetch.data;
- }
- return ret;
-}
-
-/* Return 1 if name is reserved or already used by another argument */
-static int conflict_field_name(const char *name,
- struct probe_arg *args, int narg)
-{
- int i;
- for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
- if (strcmp(reserved_field_names[i], name) == 0)
- return 1;
- for (i = 0; i < narg; i++)
- if (strcmp(args[i].name, name) == 0)
- return 1;
- return 0;
-}
-
static int create_trace_probe(int argc, char **argv)
{
/*
@@ -981,7 +226,7 @@ static int create_trace_probe(int argc, char **argv)
*/
struct trace_probe *tp;
int i, ret = 0;
- int is_return = 0, is_delete = 0;
+ bool is_return = false, is_delete = false;
char *symbol = NULL, *event = NULL, *group = NULL;
char *arg;
unsigned long offset = 0;
@@ -990,11 +235,11 @@ static int create_trace_probe(int argc, char **argv)

/* argc must be >= 1 */
if (argv[0][0] == 'p')
- is_return = 0;
+ is_return = false;
else if (argv[0][0] == 'r')
- is_return = 1;
+ is_return = true;
else if (argv[0][0] == '-')
- is_delete = 1;
+ is_delete = true;
else {
pr_info("Probe definition must be started with 'p', 'r' or"
" '-'.\n");
@@ -1058,7 +303,7 @@ static int create_trace_probe(int argc, char **argv)
/* a symbol specified */
symbol = argv[1];
/* TODO: support .init module functions */
- ret = split_symbol_offset(symbol, &offset);
+ ret = traceprobe_split_symbol_offset(symbol, &offset);
if (ret) {
pr_info("Failed to parse symbol.\n");
return ret;
@@ -1120,7 +365,8 @@ static int create_trace_probe(int argc, char **argv)
goto error;
}

- if (conflict_field_name(tp->args[i].name, tp->args, i)) {
+ if (traceprobe_conflict_field_name(tp->args[i].name,
+ tp->args, i)) {
pr_info("Argument[%d] name '%s' conflicts with "
"another field.\n", i, argv[i]);
ret = -EINVAL;
@@ -1128,7 +374,8 @@ static int create_trace_probe(int argc, char **argv)
}

/* Parse fetch argument */
- ret = parse_probe_arg(arg, tp, &tp->args[i], is_return);
+ ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+ is_return);
if (ret) {
pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
goto error;
@@ -1215,70 +462,11 @@ static int probes_open(struct inode *inode, struct file *file)
return seq_open(file, &probes_seq_op);
}

-static int command_trace_probe(const char *buf)
-{
- char **argv;
- int argc = 0, ret = 0;
-
- argv = argv_split(GFP_KERNEL, buf, &argc);
- if (!argv)
- return -ENOMEM;
-
- if (argc)
- ret = create_trace_probe(argc, argv);
-
- argv_free(argv);
- return ret;
-}
-
-#define WRITE_BUFSIZE 4096
-
static ssize_t probes_write(struct file *file, const char __user *buffer,
size_t count, loff_t *ppos)
{
- char *kbuf, *tmp;
- int ret;
- size_t done;
- size_t size;
-
- kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
- if (!kbuf)
- return -ENOMEM;
-
- ret = done = 0;
- while (done < count) {
- size = count - done;
- if (size >= WRITE_BUFSIZE)
- size = WRITE_BUFSIZE - 1;
- if (copy_from_user(kbuf, buffer + done, size)) {
- ret = -EFAULT;
- goto out;
- }
- kbuf[size] = '\0';
- tmp = strchr(kbuf, '\n');
- if (tmp) {
- *tmp = '\0';
- size = tmp - kbuf + 1;
- } else if (done + size < count) {
- pr_warning("Line length is too long: "
- "Should be less than %d.", WRITE_BUFSIZE);
- ret = -EINVAL;
- goto out;
- }
- done += size;
- /* Remove comments */
- tmp = strchr(kbuf, '#');
- if (tmp)
- *tmp = '\0';
-
- ret = command_trace_probe(kbuf);
- if (ret)
- goto out;
- }
- ret = done;
-out:
- kfree(kbuf);
- return ret;
+ return traceprobe_probes_write(file, buffer, count, ppos,
+ create_trace_probe);
}

static const struct file_operations kprobe_events_ops = {
@@ -1536,17 +724,6 @@ static void probe_event_disable(struct ftrace_event_call *call)
}
}

-#undef DEFINE_FIELD
-#define DEFINE_FIELD(type, item, name, is_signed) \
- do { \
- ret = trace_define_field(event_call, #type, name, \
- offsetof(typeof(field), item), \
- sizeof(field.item), is_signed, \
- FILTER_OTHER); \
- if (ret) \
- return ret; \
- } while (0)
-
static int kprobe_event_define_fields(struct ftrace_event_call *event_call)
{
int ret, i;
@@ -1887,7 +1064,7 @@ static __init int kprobe_trace_self_tests_init(void)

pr_info("Testing kprobe tracing: ");

- ret = command_trace_probe("p:testprobe kprobe_trace_selftest_target "
+ ret = traceprobe_command("p:testprobe kprobe_trace_selftest_target "
"$stack $stack0 +0($stack)");
if (WARN_ON_ONCE(ret)) {
pr_warning("error on probing function entry.\n");
@@ -1902,7 +1079,7 @@ static __init int kprobe_trace_self_tests_init(void)
probe_event_enable(&tp->call);
}

- ret = command_trace_probe("r:testprobe2 kprobe_trace_selftest_target "
+ ret = traceprobe_command("r:testprobe2 kprobe_trace_selftest_target "
"$retval");
if (WARN_ON_ONCE(ret)) {
pr_warning("error on probing function return.\n");
@@ -1922,13 +1099,13 @@ static __init int kprobe_trace_self_tests_init(void)

ret = target(1, 2, 3, 4, 5, 6);

- ret = command_trace_probe("-:testprobe");
+ ret = traceprobe_command_trace_probe("-:testprobe");
if (WARN_ON_ONCE(ret)) {
pr_warning("error on deleting a probe.\n");
warn++;
}

- ret = command_trace_probe("-:testprobe2");
+ ret = traceprobe_command_trace_probe("-:testprobe2");
if (WARN_ON_ONCE(ret)) {
pr_warning("error on deleting a probe.\n");
warn++;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
new file mode 100644
index 0000000..0fcdb16
--- /dev/null
+++ b/kernel/trace/trace_probe.c
@@ -0,0 +1,747 @@
+/*
+ * Common code for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author: Srikar Dronamraju
+ *
+ * Derived from kernel/trace/trace_kprobe.c written by
+ * Masami Hiramatsu <[email protected]>
+ */
+
+#include "trace_probe.h"
+
+const char *reserved_field_names[] = {
+ "common_type",
+ "common_flags",
+ "common_preempt_count",
+ "common_pid",
+ "common_tgid",
+ "common_lock_depth",
+ FIELD_STRING_IP,
+ FIELD_STRING_RETIP,
+ FIELD_STRING_FUNC,
+};
+
+/* Printing function type */
+#define PRINT_TYPE_FUNC_NAME(type) print_type_##type
+#define PRINT_TYPE_FMT_NAME(type) print_type_format_##type
+
+/* Printing in basic type function template */
+#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast) \
+static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s, \
+ const char *name, \
+ void *data, void *ent)\
+{ \
+ return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
+} \
+static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
+
+DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
+
+static inline void *get_rloc_data(u32 *dl)
+{
+ return (u8 *)dl + get_rloc_offs(*dl);
+}
+
+/* For data_loc conversion */
+static inline void *get_loc_data(u32 *dl, void *ent)
+{
+ return (u8 *)ent + get_rloc_offs(*dl);
+}
+
+/* For defining macros, define string/string_size types */
+typedef u32 string;
+typedef u32 string_size;
+
+/* Print type function for string type */
+static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
+ const char *name,
+ void *data, void *ent)
+{
+ int len = *(u32 *)data >> 16;
+
+ if (!len)
+ return trace_seq_printf(s, " %s=(fault)", name);
+ else
+ return trace_seq_printf(s, " %s=\"%s\"", name,
+ (const char *)get_loc_data(data, ent));
+}
+static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
+
+#define FETCH_FUNC_NAME(method, type) fetch_##method##_##type
+/*
+ * Define macro for basic types - we don't need to define s* types, because
+ * we have to care only about bitwidth at recording time.
+ */
+#define DEFINE_BASIC_FETCH_FUNCS(method) \
+DEFINE_FETCH_##method(u8) \
+DEFINE_FETCH_##method(u16) \
+DEFINE_FETCH_##method(u32) \
+DEFINE_FETCH_##method(u64)
+
+#define CHECK_FETCH_FUNCS(method, fn) \
+ (((FETCH_FUNC_NAME(method, u8) == fn) || \
+ (FETCH_FUNC_NAME(method, u16) == fn) || \
+ (FETCH_FUNC_NAME(method, u32) == fn) || \
+ (FETCH_FUNC_NAME(method, u64) == fn) || \
+ (FETCH_FUNC_NAME(method, string) == fn) || \
+ (FETCH_FUNC_NAME(method, string_size) == fn)) \
+ && (fn != NULL))
+
+/* Data fetch function templates */
+#define DEFINE_FETCH_reg(type) \
+static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs, \
+ void *offset, void *dest) \
+{ \
+ *(type *)dest = (type)regs_get_register(regs, \
+ (unsigned int)((unsigned long)offset)); \
+}
+DEFINE_BASIC_FETCH_FUNCS(reg)
+/* No string on the register */
+#define fetch_reg_string NULL
+#define fetch_reg_string_size NULL
+
+#define DEFINE_FETCH_stack(type) \
+static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
+ void *offset, void *dest) \
+{ \
+ *(type *)dest = (type)regs_get_kernel_stack_nth(regs, \
+ (unsigned int)((unsigned long)offset)); \
+}
+DEFINE_BASIC_FETCH_FUNCS(stack)
+/* No string on the stack entry */
+#define fetch_stack_string NULL
+#define fetch_stack_string_size NULL
+
+#define DEFINE_FETCH_retval(type) \
+static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
+ void *dummy, void *dest) \
+{ \
+ *(type *)dest = (type)regs_return_value(regs); \
+}
+DEFINE_BASIC_FETCH_FUNCS(retval)
+/* No string on the retval */
+#define fetch_retval_string NULL
+#define fetch_retval_string_size NULL
+
+#define DEFINE_FETCH_memory(type) \
+static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
+ void *addr, void *dest) \
+{ \
+ type retval; \
+ if (probe_kernel_address(addr, retval)) \
+ *(type *)dest = 0; \
+ else \
+ *(type *)dest = retval; \
+}
+DEFINE_BASIC_FETCH_FUNCS(memory)
+/*
+ * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
+ * length and relative data location.
+ */
+static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
+ void *addr, void *dest)
+{
+ long ret;
+ int maxlen = get_rloc_len(*(u32 *)dest);
+ u8 *dst = get_rloc_data(dest);
+ u8 *src = addr;
+ mm_segment_t old_fs = get_fs();
+ if (!maxlen)
+ return;
+ /*
+ * Try to get string again, since the string can be changed while
+ * probing.
+ */
+ set_fs(KERNEL_DS);
+ pagefault_disable();
+ do
+ ret = __copy_from_user_inatomic(dst++, src++, 1);
+ while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
+ dst[-1] = '\0';
+ pagefault_enable();
+ set_fs(old_fs);
+
+ if (ret < 0) { /* Failed to fetch string */
+ ((u8 *)get_rloc_data(dest))[0] = '\0';
+ *(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
+ } else
+ *(u32 *)dest = make_data_rloc(src - (u8 *)addr,
+ get_rloc_offs(*(u32 *)dest));
+}
+/* Return the length of string -- including null terminal byte */
+static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
+ void *addr, void *dest)
+{
+ int ret, len = 0;
+ u8 c;
+ mm_segment_t old_fs = get_fs();
+
+ set_fs(KERNEL_DS);
+ pagefault_disable();
+ do {
+ ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
+ len++;
+ } while (c && ret == 0 && len < MAX_STRING_SIZE);
+ pagefault_enable();
+ set_fs(old_fs);
+
+ if (ret < 0) /* Failed to check the length */
+ *(u32 *)dest = 0;
+ else
+ *(u32 *)dest = len;
+}
+
+/* Memory fetching by symbol */
+struct symbol_cache {
+ char *symbol;
+ long offset;
+ unsigned long addr;
+};
+
+static unsigned long update_symbol_cache(struct symbol_cache *sc)
+{
+ sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
+ if (sc->addr)
+ sc->addr += sc->offset;
+ return sc->addr;
+}
+
+static void free_symbol_cache(struct symbol_cache *sc)
+{
+ kfree(sc->symbol);
+ kfree(sc);
+}
+
+static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
+{
+ struct symbol_cache *sc;
+
+ if (!sym || strlen(sym) == 0)
+ return NULL;
+ sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
+ if (!sc)
+ return NULL;
+
+ sc->symbol = kstrdup(sym, GFP_KERNEL);
+ if (!sc->symbol) {
+ kfree(sc);
+ return NULL;
+ }
+ sc->offset = offset;
+
+ update_symbol_cache(sc);
+ return sc;
+}
+
+#define DEFINE_FETCH_symbol(type) \
+static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
+ void *data, void *dest) \
+{ \
+ struct symbol_cache *sc = data; \
+ if (sc->addr) \
+ fetch_memory_##type(regs, (void *)sc->addr, dest); \
+ else \
+ *(type *)dest = 0; \
+}
+DEFINE_BASIC_FETCH_FUNCS(symbol)
+DEFINE_FETCH_symbol(string)
+DEFINE_FETCH_symbol(string_size)
+
+/* Dereference memory access function */
+struct deref_fetch_param {
+ struct fetch_param orig;
+ long offset;
+};
+
+#define DEFINE_FETCH_deref(type) \
+static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
+ void *data, void *dest) \
+{ \
+ struct deref_fetch_param *dprm = data; \
+ unsigned long addr; \
+ call_fetch(&dprm->orig, regs, &addr); \
+ if (addr) { \
+ addr += dprm->offset; \
+ fetch_memory_##type(regs, (void *)addr, dest); \
+ } else \
+ *(type *)dest = 0; \
+}
+DEFINE_BASIC_FETCH_FUNCS(deref)
+DEFINE_FETCH_deref(string)
+DEFINE_FETCH_deref(string_size)
+
+static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
+{
+ if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+ free_deref_fetch_param(data->orig.data);
+ else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+ free_symbol_cache(data->orig.data);
+ kfree(data);
+}
+
+/* Bitfield fetch function */
+struct bitfield_fetch_param {
+ struct fetch_param orig;
+ unsigned char hi_shift;
+ unsigned char low_shift;
+};
+
+#define DEFINE_FETCH_bitfield(type) \
+static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
+ void *data, void *dest) \
+{ \
+ struct bitfield_fetch_param *bprm = data; \
+ type buf = 0; \
+ call_fetch(&bprm->orig, regs, &buf); \
+ if (buf) { \
+ buf <<= bprm->hi_shift; \
+ buf >>= bprm->low_shift; \
+ } \
+ *(type *)dest = buf; \
+}
+
+DEFINE_BASIC_FETCH_FUNCS(bitfield)
+#define fetch_bitfield_string NULL
+#define fetch_bitfield_string_size NULL
+
+static __kprobes void
+free_bitfield_fetch_param(struct bitfield_fetch_param *data)
+{
+ /*
+ * Don't check the bitfield itself, because this must be the
+ * last fetch function.
+ */
+ if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+ free_deref_fetch_param(data->orig.data);
+ else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+ free_symbol_cache(data->orig.data);
+ kfree(data);
+}
+
+/* Default (unsigned long) fetch type */
+#define __DEFAULT_FETCH_TYPE(t) u##t
+#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
+#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
+#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
+
+#define ASSIGN_FETCH_FUNC(method, type) \
+ [FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
+
+#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype) \
+ {.name = _name, \
+ .size = _size, \
+ .is_signed = sign, \
+ .print = PRINT_TYPE_FUNC_NAME(ptype), \
+ .fmt = PRINT_TYPE_FMT_NAME(ptype), \
+ .fmttype = _fmttype, \
+ .fetch = { \
+ASSIGN_FETCH_FUNC(reg, ftype), \
+ASSIGN_FETCH_FUNC(stack, ftype), \
+ASSIGN_FETCH_FUNC(retval, ftype), \
+ASSIGN_FETCH_FUNC(memory, ftype), \
+ASSIGN_FETCH_FUNC(symbol, ftype), \
+ASSIGN_FETCH_FUNC(deref, ftype), \
+ASSIGN_FETCH_FUNC(bitfield, ftype), \
+ } \
+ }
+
+#define ASSIGN_FETCH_TYPE(ptype, ftype, sign) \
+ __ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
+
+#define FETCH_TYPE_STRING 0
+#define FETCH_TYPE_STRSIZE 1
+
+/* Fetch type information table */
+static const struct fetch_type fetch_type_table[] = {
+ /* Special types */
+ [FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
+ sizeof(u32), 1, "__data_loc char[]"),
+ [FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
+ string_size, sizeof(u32), 0, "u32"),
+ /* Basic types */
+ ASSIGN_FETCH_TYPE(u8, u8, 0),
+ ASSIGN_FETCH_TYPE(u16, u16, 0),
+ ASSIGN_FETCH_TYPE(u32, u32, 0),
+ ASSIGN_FETCH_TYPE(u64, u64, 0),
+ ASSIGN_FETCH_TYPE(s8, u8, 1),
+ ASSIGN_FETCH_TYPE(s16, u16, 1),
+ ASSIGN_FETCH_TYPE(s32, u32, 1),
+ ASSIGN_FETCH_TYPE(s64, u64, 1),
+};
+
+static const struct fetch_type *find_fetch_type(const char *type)
+{
+ int i;
+
+ if (!type)
+ type = DEFAULT_FETCH_TYPE_STR;
+
+ /* Special case: bitfield */
+ if (*type == 'b') {
+ unsigned long bs;
+ type = strchr(type, '/');
+ if (!type)
+ goto fail;
+ type++;
+ if (strict_strtoul(type, 0, &bs))
+ goto fail;
+ switch (bs) {
+ case 8:
+ return find_fetch_type("u8");
+ case 16:
+ return find_fetch_type("u16");
+ case 32:
+ return find_fetch_type("u32");
+ case 64:
+ return find_fetch_type("u64");
+ default:
+ goto fail;
+ }
+ }
+
+ for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
+ if (strcmp(type, fetch_type_table[i].name) == 0)
+ return &fetch_type_table[i];
+fail:
+ return NULL;
+}
+
+/* Special function : only accept unsigned long */
+static __kprobes void fetch_stack_address(struct pt_regs *regs,
+ void *dummy, void *dest)
+{
+ *(unsigned long *)dest = kernel_stack_pointer(regs);
+}
+
+static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
+ fetch_func_t orig_fn)
+{
+ int i;
+
+ if (type != &fetch_type_table[FETCH_TYPE_STRING])
+ return NULL; /* Only string type needs size function */
+ for (i = 0; i < FETCH_MTD_END; i++)
+ if (type->fetch[i] == orig_fn)
+ return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
+
+ WARN_ON(1); /* This should not happen */
+ return NULL;
+}
+
+
+/* Split symbol and offset. */
+int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset)
+{
+ char *tmp;
+ int ret;
+
+ if (!offset)
+ return -EINVAL;
+
+ tmp = strchr(symbol, '+');
+ if (tmp) {
+ /* skip sign because strict_strtol doesn't accept '+' */
+ ret = strict_strtoul(tmp + 1, 0, offset);
+ if (ret)
+ return ret;
+ *tmp = '\0';
+ } else
+ *offset = 0;
+ return 0;
+}
+
+
+#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
+
+static int parse_probe_vars(char *arg, const struct fetch_type *t,
+ struct fetch_param *f, bool is_return)
+{
+ int ret = 0;
+ unsigned long param;
+
+ if (strcmp(arg, "retval") == 0) {
+ if (is_return)
+ f->fn = t->fetch[FETCH_MTD_retval];
+ else
+ ret = -EINVAL;
+ } else if (strncmp(arg, "stack", 5) == 0) {
+ if (arg[5] == '\0') {
+ if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
+ f->fn = fetch_stack_address;
+ else
+ ret = -EINVAL;
+ } else if (isdigit(arg[5])) {
+ ret = strict_strtoul(arg + 5, 10, &param);
+ if (ret || param > PARAM_MAX_STACK)
+ ret = -EINVAL;
+ else {
+ f->fn = t->fetch[FETCH_MTD_stack];
+ f->data = (void *)param;
+ }
+ } else
+ ret = -EINVAL;
+ } else
+ ret = -EINVAL;
+ return ret;
+}
+
+/* Recursive argument parser */
+static int parse_probe_arg(char *arg, const struct fetch_type *t,
+ struct fetch_param *f, bool is_return)
+{
+ int ret = 0;
+ unsigned long param;
+ long offset;
+ char *tmp;
+
+ switch (arg[0]) {
+ case '$':
+ ret = parse_probe_vars(arg + 1, t, f, is_return);
+ break;
+ case '%': /* named register */
+ ret = regs_query_register_offset(arg + 1);
+ if (ret >= 0) {
+ f->fn = t->fetch[FETCH_MTD_reg];
+ f->data = (void *)(unsigned long)ret;
+ ret = 0;
+ }
+ break;
+ case '@': /* memory or symbol */
+ if (isdigit(arg[1])) {
+ ret = strict_strtoul(arg + 1, 0, &param);
+ if (ret)
+ break;
+ f->fn = t->fetch[FETCH_MTD_memory];
+ f->data = (void *)param;
+ } else {
+ ret = traceprobe_split_symbol_offset(arg + 1, &offset);
+ if (ret)
+ break;
+ f->data = alloc_symbol_cache(arg + 1, offset);
+ if (f->data)
+ f->fn = t->fetch[FETCH_MTD_symbol];
+ }
+ break;
+ case '+': /* deref memory */
+ arg++; /* Skip '+', because strict_strtol() rejects it. */
+ case '-':
+ tmp = strchr(arg, '(');
+ if (!tmp)
+ break;
+ *tmp = '\0';
+ ret = strict_strtol(arg, 0, &offset);
+ if (ret)
+ break;
+ arg = tmp + 1;
+ tmp = strrchr(arg, ')');
+ if (tmp) {
+ struct deref_fetch_param *dprm;
+ const struct fetch_type *t2 = find_fetch_type(NULL);
+ *tmp = '\0';
+ dprm = kzalloc(sizeof(struct deref_fetch_param),
+ GFP_KERNEL);
+ if (!dprm)
+ return -ENOMEM;
+ dprm->offset = offset;
+ ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+ if (ret)
+ kfree(dprm);
+ else {
+ f->fn = t->fetch[FETCH_MTD_deref];
+ f->data = (void *)dprm;
+ }
+ }
+ break;
+ }
+ if (!ret && !f->fn) { /* Parsed, but do not find fetch method */
+ pr_info("%s type has no corresponding fetch method.\n",
+ t->name);
+ ret = -EINVAL;
+ }
+ return ret;
+}
+#define BYTES_TO_BITS(nb) ((BITS_PER_LONG * (nb)) / sizeof(long))
+
+/* Bitfield type needs to be parsed into a fetch function */
+static int __parse_bitfield_probe_arg(const char *bf,
+ const struct fetch_type *t,
+ struct fetch_param *f)
+{
+ struct bitfield_fetch_param *bprm;
+ unsigned long bw, bo;
+ char *tail;
+
+ if (*bf != 'b')
+ return 0;
+
+ bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
+ if (!bprm)
+ return -ENOMEM;
+ bprm->orig = *f;
+ f->fn = t->fetch[FETCH_MTD_bitfield];
+ f->data = (void *)bprm;
+
+ bw = simple_strtoul(bf + 1, &tail, 0); /* Use simple one */
+ if (bw == 0 || *tail != '@')
+ return -EINVAL;
+
+ bf = tail + 1;
+ bo = simple_strtoul(bf, &tail, 0);
+ if (tail == bf || *tail != '/')
+ return -EINVAL;
+
+ bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
+ bprm->low_shift = bprm->hi_shift + bo;
+ return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
+}
+
+/* String length checking wrapper */
+int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+ struct probe_arg *parg, bool is_return)
+{
+ const char *t;
+ int ret;
+
+ if (strlen(arg) > MAX_ARGSTR_LEN) {
+ pr_info("Argument is too long.: %s\n", arg);
+ return -ENOSPC;
+ }
+ parg->comm = kstrdup(arg, GFP_KERNEL);
+ if (!parg->comm) {
+ pr_info("Failed to allocate memory for command '%s'.\n", arg);
+ return -ENOMEM;
+ }
+ t = strchr(parg->comm, ':');
+ if (t) {
+ arg[t - parg->comm] = '\0';
+ t++;
+ }
+ parg->type = find_fetch_type(t);
+ if (!parg->type) {
+ pr_info("Unsupported type: %s\n", t);
+ return -EINVAL;
+ }
+ parg->offset = *size;
+ *size += parg->type->size;
+ ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+ if (ret >= 0 && t != NULL)
+ ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
+ if (ret >= 0) {
+ parg->fetch_size.fn = get_fetch_size_function(parg->type,
+ parg->fetch.fn);
+ parg->fetch_size.data = parg->fetch.data;
+ }
+ return ret;
+}
+
+/* Return 1 if name is reserved or already used by another argument */
+int traceprobe_conflict_field_name(const char *name,
+ struct probe_arg *args, int narg)
+{
+ int i;
+ for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
+ if (strcmp(reserved_field_names[i], name) == 0)
+ return 1;
+ for (i = 0; i < narg; i++)
+ if (strcmp(args[i].name, name) == 0)
+ return 1;
+ return 0;
+}
+
+void traceprobe_free_probe_arg(struct probe_arg *arg)
+{
+ if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
+ free_bitfield_fetch_param(arg->fetch.data);
+ else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
+ free_deref_fetch_param(arg->fetch.data);
+ else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
+ free_symbol_cache(arg->fetch.data);
+ kfree(arg->name);
+ kfree(arg->comm);
+}
+
+int traceprobe_command(const char *buf, int (*createfn)(int, char**))
+{
+ char **argv;
+ int argc = 0, ret = 0;
+
+ argv = argv_split(GFP_KERNEL, buf, &argc);
+ if (!argv)
+ return -ENOMEM;
+
+ if (argc)
+ ret = createfn(argc, argv);
+
+ argv_free(argv);
+ return ret;
+}
+
+#define WRITE_BUFSIZE 128
+
+ssize_t traceprobe_probes_write(struct file *file, const char __user *buffer,
+ size_t count, loff_t *ppos, int (*createfn)(int, char**))
+{
+ char *kbuf, *tmp;
+ int ret = 0;
+ size_t done = 0;
+ size_t size;
+
+ kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
+ if (!kbuf)
+ return -ENOMEM;
+
+ while (done < count) {
+ size = count - done;
+ if (size >= WRITE_BUFSIZE)
+ size = WRITE_BUFSIZE - 1;
+ if (copy_from_user(kbuf, buffer + done, size)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ kbuf[size] = '\0';
+ tmp = strchr(kbuf, '\n');
+ if (tmp) {
+ *tmp = '\0';
+ size = tmp - kbuf + 1;
+ } else if (done + size < count) {
+ pr_warning("Line length is too long: "
+ "Should be less than %d.", WRITE_BUFSIZE);
+ ret = -EINVAL;
+ goto out;
+ }
+ done += size;
+ /* Remove comments */
+ tmp = strchr(kbuf, '#');
+ if (tmp)
+ *tmp = '\0';
+
+ ret = traceprobe_command(kbuf, createfn);
+ if (ret)
+ goto out;
+ }
+ ret = done;
+out:
+ kfree(kbuf);
+ return ret;
+}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
new file mode 100644
index 0000000..487514b
--- /dev/null
+++ b/kernel/trace/trace_probe.h
@@ -0,0 +1,158 @@
+/*
+ * Common header file for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author: Srikar Dronamraju
+ *
+ * Derived from kernel/trace/trace_kprobe.c written by
+ * Masami Hiramatsu <[email protected]>
+ */
+
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/smp.h>
+#include <linux/debugfs.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/ptrace.h>
+#include <linux/perf_event.h>
+#include <linux/kprobes.h>
+#include <linux/stringify.h>
+#include <linux/limits.h>
+#include <linux/uaccess.h>
+#include <asm/bitsperlong.h>
+
+#include "trace.h"
+#include "trace_output.h"
+
+#define MAX_TRACE_ARGS 128
+#define MAX_ARGSTR_LEN 63
+#define MAX_EVENT_NAME_LEN 64
+#define MAX_STRING_SIZE PATH_MAX
+
+/* Reserved field names */
+#define FIELD_STRING_IP "__probe_ip"
+#define FIELD_STRING_RETIP "__probe_ret_ip"
+#define FIELD_STRING_FUNC "__probe_func"
+
+#undef DEFINE_FIELD
+#define DEFINE_FIELD(type, item, name, is_signed) \
+ do { \
+ ret = trace_define_field(event_call, #type, name, \
+ offsetof(typeof(field), item), \
+ sizeof(field.item), is_signed, \
+ FILTER_OTHER); \
+ if (ret) \
+ return ret; \
+ } while (0)
+
+
+/* Flags for trace_probe */
+#define TP_FLAG_TRACE 1
+#define TP_FLAG_PROFILE 2
+
+
+/* data_rloc: data relative location, compatible with u32 */
+#define make_data_rloc(len, roffs) \
+ (((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
+#define get_rloc_len(dl) ((u32)(dl) >> 16)
+#define get_rloc_offs(dl) ((u32)(dl) & 0xffff)
+
+/*
+ * Convert data_rloc to data_loc:
+ * data_rloc stores the offset from data_rloc itself, but data_loc
+ * stores the offset from event entry.
+ */
+#define convert_rloc_to_loc(dl, offs) ((u32)(dl) + (offs))
+
+/* Data fetch function type */
+typedef void (*fetch_func_t)(struct pt_regs *, void *, void *);
+/* Printing function type */
+typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
+ void *);
+
+/* Fetch types */
+enum {
+ FETCH_MTD_reg = 0,
+ FETCH_MTD_stack,
+ FETCH_MTD_retval,
+ FETCH_MTD_memory,
+ FETCH_MTD_symbol,
+ FETCH_MTD_deref,
+ FETCH_MTD_bitfield,
+ FETCH_MTD_END,
+};
+
+/* Fetch type information table */
+struct fetch_type {
+ const char *name; /* Name of type */
+ size_t size; /* Byte size of type */
+ int is_signed; /* Signed flag */
+ print_type_func_t print; /* Print functions */
+ const char *fmt; /* Fromat string */
+ const char *fmttype; /* Name in format file */
+ /* Fetch functions */
+ fetch_func_t fetch[FETCH_MTD_END];
+};
+
+struct fetch_param {
+ fetch_func_t fn;
+ void *data;
+};
+
+struct probe_arg {
+ struct fetch_param fetch;
+ struct fetch_param fetch_size;
+ unsigned int offset; /* Offset from argument entry */
+ const char *name; /* Name of this argument */
+ const char *comm; /* Command of this argument */
+ const struct fetch_type *type; /* Type of this argument */
+};
+
+static inline __kprobes void call_fetch(struct fetch_param *fprm,
+ struct pt_regs *regs, void *dest)
+{
+ return fprm->fn(regs, fprm->data, dest);
+}
+
+/* Check the name is good for event/group/fields */
+static int is_good_name(const char *name)
+{
+ if (!isalpha(*name) && *name != '_')
+ return 0;
+ while (*++name != '\0') {
+ if (!isalpha(*name) && !isdigit(*name) && *name != '_')
+ return 0;
+ }
+ return 1;
+}
+
+extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+ struct probe_arg *parg, bool is_return);
+
+extern int traceprobe_conflict_field_name(const char *name,
+ struct probe_arg *args, int narg);
+
+extern void traceprobe_free_probe_arg(struct probe_arg *arg);
+
+extern int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset);
+
+extern ssize_t traceprobe_probes_write(struct file *file,
+ const char __user *buffer, size_t count, loff_t *ppos,
+ int (*createfn)(int, char**));
+
+extern int traceprobe_command(const char *buf, int (*createfn)(int, char**));

2011-04-01 14:46:22

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 20/26] 20: tracing: uprobes trace_event interface


Implements trace_event support for uprobes. In its current form it can
be used to put probes at a specified offset in a file and dump the
required registers when the code flow reaches the probed address.

The following example shows how to dump the instruction pointer and %ax
a register at the probed text address. Here we are trying to probe
zfree in /bin/zsh

# cd /sys/kernel/debug/tracing/
# cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
# objdump -T /bin/zsh | grep -w zfree
0000000000446420 g DF .text 0000000000000012 Base zfree
# echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
# echo 1 > events/uprobes/enable
# sleep 20
# echo 0 > events/uprobes/enable
# cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79

TODO: Connect a filter to a consumer.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/Kconfig | 10 -
kernel/trace/Kconfig | 16 +
kernel/trace/Makefile | 1
kernel/trace/trace.h | 5
kernel/trace/trace_kprobe.c | 4
kernel/trace/trace_probe.c | 14 +
kernel/trace/trace_probe.h | 6
kernel/trace/trace_uprobe.c | 803 +++++++++++++++++++++++++++++++++++++++++++
8 files changed, 842 insertions(+), 17 deletions(-)
create mode 100644 kernel/trace/trace_uprobe.c

diff --git a/arch/Kconfig b/arch/Kconfig
index c681f16..c4e9663 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -62,16 +62,8 @@ config OPTPROBES
depends on !PREEMPT

config UPROBES
- bool "User-space probes (EXPERIMENTAL)"
- depends on ARCH_SUPPORTS_UPROBES
- depends on MMU
select MM_OWNER
- help
- Uprobes enables kernel subsystems to establish probepoints
- in user applications and execute handler functions when
- the probepoints are hit.
-
- If in doubt, say "N".
+ def_bool n

config HAVE_EFFICIENT_UNALIGNED_ACCESS
bool
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 9c62806..0b1b7d5 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -386,6 +386,22 @@ config KPROBE_EVENT
This option is also required by perf-probe subcommand of perf tools.
If you want to use perf tools, this option is strongly recommended.

+config UPROBE_EVENT
+ bool "Enable uprobes-based dynamic events"
+ depends on ARCH_SUPPORTS_UPROBES
+ depends on MMU
+ select UPROBES
+ select PROBE_EVENTS
+ select TRACING
+ default n
+ help
+ This allows the user to add tracing events on top of userspace dynamic
+ events (similar to tracepoints) on the fly via the traceevents interface.
+ Those events can be inserted wherever uprobes can probe, and record
+ various registers.
+ This option is required if you plan to use perf-probe subcommand of perf
+ tools on user space applications.
+
config PROBE_EVENTS
def_bool n

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 692223a..bb3d3ff 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -57,5 +57,6 @@ ifeq ($(CONFIG_TRACING),y)
obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
endif
obj-$(CONFIG_PROBE_EVENTS) +=trace_probe.o
+obj-$(CONFIG_UPROBE_EVENT) += trace_uprobe.o

libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 5e9dfc6..55249f8 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -97,6 +97,11 @@ struct kretprobe_trace_entry_head {
unsigned long ret_ip;
};

+struct uprobe_trace_entry_head {
+ struct trace_entry ent;
+ unsigned long ip;
+};
+
/*
* trace_flag_type is an enumeration that holds different
* states when a trace occurs. These are:
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 3f9189d..dec70fb 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -374,8 +374,8 @@ static int create_trace_probe(int argc, char **argv)
}

/* Parse fetch argument */
- ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
- is_return);
+ ret = traceprobe_parse_probe_arg(arg, &tp->size,
+ &tp->args[i], is_return, true);
if (ret) {
pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
goto error;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 0fcdb16..4499a2a 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -508,13 +508,17 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t,

/* Recursive argument parser */
static int parse_probe_arg(char *arg, const struct fetch_type *t,
- struct fetch_param *f, bool is_return)
+ struct fetch_param *f, bool is_return, bool is_kprobe)
{
int ret = 0;
unsigned long param;
long offset;
char *tmp;

+ /* Until uprobe_events supports only reg arguments */
+ if (!is_kprobe && arg[0] != '%')
+ return -EINVAL;
+
switch (arg[0]) {
case '$':
ret = parse_probe_vars(arg + 1, t, f, is_return);
@@ -564,7 +568,8 @@ static int parse_probe_arg(char *arg, const struct fetch_type *t,
if (!dprm)
return -ENOMEM;
dprm->offset = offset;
- ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+ ret = parse_probe_arg(arg, t2, &dprm->orig, is_return,
+ is_kprobe);
if (ret)
kfree(dprm);
else {
@@ -618,7 +623,7 @@ static int __parse_bitfield_probe_arg(const char *bf,

/* String length checking wrapper */
int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
- struct probe_arg *parg, bool is_return)
+ struct probe_arg *parg, bool is_return, bool is_kprobe)
{
const char *t;
int ret;
@@ -644,7 +649,8 @@ int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
}
parg->offset = *size;
*size += parg->type->size;
- ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+ ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return,
+ is_kprobe);
if (ret >= 0 && t != NULL)
ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
if (ret >= 0) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 487514b..f625d8d 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -48,6 +48,7 @@
#define FIELD_STRING_IP "__probe_ip"
#define FIELD_STRING_RETIP "__probe_ret_ip"
#define FIELD_STRING_FUNC "__probe_func"
+#define FIELD_STRING_PID "__probe_pid"

#undef DEFINE_FIELD
#define DEFINE_FIELD(type, item, name, is_signed) \
@@ -64,6 +65,7 @@
/* Flags for trace_probe */
#define TP_FLAG_TRACE 1
#define TP_FLAG_PROFILE 2
+#define TP_FLAG_UPROBE 4


/* data_rloc: data relative location, compatible with u32 */
@@ -130,7 +132,7 @@ static inline __kprobes void call_fetch(struct fetch_param *fprm,
}

/* Check the name is good for event/group/fields */
-static int is_good_name(const char *name)
+static inline int is_good_name(const char *name)
{
if (!isalpha(*name) && *name != '_')
return 0;
@@ -142,7 +144,7 @@ static int is_good_name(const char *name)
}

extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
- struct probe_arg *parg, bool is_return);
+ struct probe_arg *parg, bool is_return, bool is_kprobe);

extern int traceprobe_conflict_field_name(const char *name,
struct probe_arg *args, int narg);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
new file mode 100644
index 0000000..8395cfc
--- /dev/null
+++ b/kernel/trace/trace_uprobe.c
@@ -0,0 +1,803 @@
+/*
+ * uprobes-based tracing events
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author: Srikar Dronamraju
+ */
+
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/uprobes.h>
+#include <linux/namei.h>
+
+#include "trace_probe.h"
+
+#define UPROBE_EVENT_SYSTEM "uprobes"
+
+/**
+ * uprobe event core functions
+ */
+struct trace_uprobe;
+struct uprobe_trace_consumer {
+ struct uprobe_consumer cons;
+ struct trace_uprobe *tp;
+};
+
+struct trace_uprobe {
+ struct list_head list;
+ struct ftrace_event_class class;
+ struct ftrace_event_call call;
+ struct uprobe_trace_consumer *consumer;
+ struct inode *inode;
+ char *filename;
+ unsigned long offset;
+ unsigned long nhit;
+ unsigned int flags; /* For TP_FLAG_* */
+ ssize_t size; /* trace entry size */
+ unsigned int nr_args;
+ struct probe_arg args[];
+};
+
+#define SIZEOF_TRACE_UPROBE(n) \
+ (offsetof(struct trace_uprobe, args) + \
+ (sizeof(struct probe_arg) * (n)))
+
+static int register_uprobe_event(struct trace_uprobe *tp);
+static void unregister_uprobe_event(struct trace_uprobe *tp);
+
+static DEFINE_MUTEX(uprobe_lock);
+static LIST_HEAD(uprobe_list);
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs);
+
+/*
+ * Allocate new trace_uprobe and initialize it (including uprobes).
+ */
+static struct trace_uprobe *alloc_trace_uprobe(const char *group,
+ const char *event, int nargs)
+{
+ struct trace_uprobe *tp;
+
+ if (!event || !is_good_name(event)) {
+ printk(KERN_ERR "%s\n", event);
+ return ERR_PTR(-EINVAL);
+ }
+
+ if (!group || !is_good_name(group)) {
+ printk(KERN_ERR "%s\n", group);
+ return ERR_PTR(-EINVAL);
+ }
+
+ tp = kzalloc(SIZEOF_TRACE_UPROBE(nargs), GFP_KERNEL);
+ if (!tp)
+ return ERR_PTR(-ENOMEM);
+
+ tp->call.class = &tp->class;
+ tp->call.name = kstrdup(event, GFP_KERNEL);
+ if (!tp->call.name)
+ goto error;
+
+ tp->class.system = kstrdup(group, GFP_KERNEL);
+ if (!tp->class.system)
+ goto error;
+
+ INIT_LIST_HEAD(&tp->list);
+ return tp;
+error:
+ kfree(tp->call.name);
+ kfree(tp);
+ return ERR_PTR(-ENOMEM);
+}
+
+static void free_trace_uprobe(struct trace_uprobe *tp)
+{
+ int i;
+
+ for (i = 0; i < tp->nr_args; i++)
+ traceprobe_free_probe_arg(&tp->args[i]);
+
+ iput(tp->inode);
+ kfree(tp->call.class->system);
+ kfree(tp->call.name);
+ kfree(tp->filename);
+ kfree(tp);
+}
+
+static struct trace_uprobe *find_probe_event(const char *event,
+ const char *group)
+{
+ struct trace_uprobe *tp;
+
+ list_for_each_entry(tp, &uprobe_list, list)
+ if (strcmp(tp->call.name, event) == 0 &&
+ strcmp(tp->call.class->system, group) == 0)
+ return tp;
+ return NULL;
+}
+
+/* Unregister a trace_uprobe and probe_event: call with locking uprobe_lock */
+static void unregister_trace_uprobe(struct trace_uprobe *tp)
+{
+ list_del(&tp->list);
+ unregister_uprobe_event(tp);
+ free_trace_uprobe(tp);
+}
+
+/* Register a trace_uprobe and probe_event */
+static int register_trace_uprobe(struct trace_uprobe *tp)
+{
+ struct trace_uprobe *old_tp;
+ int ret;
+
+ mutex_lock(&uprobe_lock);
+
+ /* register as an event */
+ old_tp = find_probe_event(tp->call.name, tp->call.class->system);
+ if (old_tp)
+ /* delete old event */
+ unregister_trace_uprobe(old_tp);
+
+ ret = register_uprobe_event(tp);
+ if (ret) {
+ pr_warning("Failed to register probe event(%d)\n", ret);
+ goto end;
+ }
+
+ list_add_tail(&tp->list, &uprobe_list);
+end:
+ mutex_unlock(&uprobe_lock);
+ return ret;
+}
+
+static int create_trace_uprobe(int argc, char **argv)
+{
+ /*
+ * Argument syntax:
+ * - Add uprobe: p[:[GRP/]EVENT] VADDR@PID [%REG]
+ *
+ * - Remove uprobe: -:[GRP/]EVENT
+ */
+ struct path path;
+ struct inode *inode;
+ struct trace_uprobe *tp;
+ int i, ret = 0;
+ int is_delete = 0;
+ char *arg = NULL, *event = NULL, *group = NULL;
+ unsigned long offset;
+ char buf[MAX_EVENT_NAME_LEN];
+ char *filename;
+
+ /* argc must be >= 1 */
+ if (argv[0][0] == '-')
+ is_delete = 1;
+ else if (argv[0][0] != 'p') {
+ pr_info("Probe definition must be started with 'p', 'r' or"
+ " '-'.\n");
+ return -EINVAL;
+ }
+
+ if (argv[0][1] == ':') {
+ event = &argv[0][2];
+ if (strchr(event, '/')) {
+ group = event;
+ event = strchr(group, '/') + 1;
+ event[-1] = '\0';
+ if (strlen(group) == 0) {
+ pr_info("Group name is not specified\n");
+ return -EINVAL;
+ }
+ }
+ if (strlen(event) == 0) {
+ pr_info("Event name is not specified\n");
+ return -EINVAL;
+ }
+ }
+ if (!group)
+ group = UPROBE_EVENT_SYSTEM;
+
+ if (is_delete) {
+ if (!event) {
+ pr_info("Delete command needs an event name.\n");
+ return -EINVAL;
+ }
+ mutex_lock(&uprobe_lock);
+ tp = find_probe_event(event, group);
+ if (!tp) {
+ mutex_unlock(&uprobe_lock);
+ pr_info("Event %s/%s doesn't exist.\n", group, event);
+ return -ENOENT;
+ }
+ /* delete an event */
+ unregister_trace_uprobe(tp);
+ mutex_unlock(&uprobe_lock);
+ return 0;
+ }
+
+ if (argc < 2) {
+ pr_info("Probe point is not specified.\n");
+ return -EINVAL;
+ }
+ if (isdigit(argv[1][0])) {
+ pr_info("probe point must be have a filename.\n");
+ return -EINVAL;
+ }
+ arg = strchr(argv[1], ':');
+ if (!arg)
+ goto fail_address_parse;
+
+ *arg++ = '\0';
+ filename = argv[1];
+ ret = kern_path(filename, LOOKUP_FOLLOW, &path);
+ if (ret)
+ goto fail_address_parse;
+ inode = path.dentry->d_inode;
+ __iget(inode);
+
+ ret = strict_strtoul(arg, 0, &offset);
+ if (ret)
+ goto fail_address_parse;
+ argc -= 2; argv += 2;
+
+ /* setup a probe */
+ if (!event) {
+ char *tail = strrchr(filename, '/');
+
+ snprintf(buf, MAX_EVENT_NAME_LEN, "%c_%s_0x%lx", 'p',
+ (tail ? tail + 1 : filename), offset);
+ event = buf;
+ }
+ tp = alloc_trace_uprobe(group, event, argc);
+ if (IS_ERR(tp)) {
+ pr_info("Failed to allocate trace_uprobe.(%d)\n",
+ (int)PTR_ERR(tp));
+ iput(inode);
+ return PTR_ERR(tp);
+ }
+ tp->offset = offset;
+ tp->inode = inode;
+ tp->filename = kstrdup(filename, GFP_KERNEL);
+ if (!tp->filename) {
+ pr_info("Failed to allocate filename.\n");
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ /* parse arguments */
+ ret = 0;
+ for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) {
+ /* Increment count for freeing args in error case */
+ tp->nr_args++;
+
+ /* Parse argument name */
+ arg = strchr(argv[i], '=');
+ if (arg) {
+ *arg++ = '\0';
+ tp->args[i].name = kstrdup(argv[i], GFP_KERNEL);
+ } else {
+ arg = argv[i];
+ /* If argument name is omitted, set "argN" */
+ snprintf(buf, MAX_EVENT_NAME_LEN, "arg%d", i + 1);
+ tp->args[i].name = kstrdup(buf, GFP_KERNEL);
+ }
+
+ if (!tp->args[i].name) {
+ pr_info("Failed to allocate argument[%d] name.\n", i);
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ if (!is_good_name(tp->args[i].name)) {
+ pr_info("Invalid argument[%d] name: %s\n",
+ i, tp->args[i].name);
+ ret = -EINVAL;
+ goto error;
+ }
+
+ if (traceprobe_conflict_field_name(tp->args[i].name,
+ tp->args, i)) {
+ pr_info("Argument[%d] name '%s' conflicts with "
+ "another field.\n", i, argv[i]);
+ ret = -EINVAL;
+ goto error;
+ }
+
+ /* Parse fetch argument */
+ ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+ false, false);
+ if (ret) {
+ pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
+ goto error;
+ }
+ }
+
+ ret = register_trace_uprobe(tp);
+ if (ret)
+ goto error;
+ return 0;
+
+error:
+ free_trace_uprobe(tp);
+ return ret;
+
+fail_address_parse:
+ pr_info("Failed to parse address.\n");
+ return ret;
+}
+
+static void cleanup_all_probes(void)
+{
+ struct trace_uprobe *tp;
+
+ mutex_lock(&uprobe_lock);
+ while (!list_empty(&uprobe_list)) {
+ tp = list_entry(uprobe_list.next, struct trace_uprobe, list);
+ unregister_trace_uprobe(tp);
+ }
+ mutex_unlock(&uprobe_lock);
+}
+
+
+/* Probes listing interfaces */
+static void *probes_seq_start(struct seq_file *m, loff_t *pos)
+{
+ mutex_lock(&uprobe_lock);
+ return seq_list_start(&uprobe_list, *pos);
+}
+
+static void *probes_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ return seq_list_next(v, &uprobe_list, pos);
+}
+
+static void probes_seq_stop(struct seq_file *m, void *v)
+{
+ mutex_unlock(&uprobe_lock);
+}
+
+static int probes_seq_show(struct seq_file *m, void *v)
+{
+ struct trace_uprobe *tp = v;
+ int i;
+
+ seq_printf(m, "p:%s/%s", tp->call.class->system, tp->call.name);
+ seq_printf(m, " %s:0x%p", tp->filename, (void *)tp->offset);
+
+ for (i = 0; i < tp->nr_args; i++)
+ seq_printf(m, " %s=%s", tp->args[i].name, tp->args[i].comm);
+ seq_printf(m, "\n");
+ return 0;
+}
+
+static const struct seq_operations probes_seq_op = {
+ .start = probes_seq_start,
+ .next = probes_seq_next,
+ .stop = probes_seq_stop,
+ .show = probes_seq_show
+};
+
+static int probes_open(struct inode *inode, struct file *file)
+{
+ if ((file->f_mode & FMODE_WRITE) &&
+ (file->f_flags & O_TRUNC))
+ cleanup_all_probes();
+
+ return seq_open(file, &probes_seq_op);
+}
+
+static ssize_t probes_write(struct file *file, const char __user *buffer,
+ size_t count, loff_t *ppos)
+{
+ return traceprobe_probes_write(file, buffer, count, ppos,
+ create_trace_uprobe);
+}
+
+static const struct file_operations uprobe_events_ops = {
+ .owner = THIS_MODULE,
+ .open = probes_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+ .write = probes_write,
+};
+
+/* Probes profiling interfaces */
+static int probes_profile_seq_show(struct seq_file *m, void *v)
+{
+ struct trace_uprobe *tp = v;
+
+ seq_printf(m, " %s %-44s %15lu\n", tp->filename, tp->call.name,
+ tp->nhit);
+ return 0;
+}
+
+static const struct seq_operations profile_seq_op = {
+ .start = probes_seq_start,
+ .next = probes_seq_next,
+ .stop = probes_seq_stop,
+ .show = probes_profile_seq_show
+};
+
+static int profile_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &profile_seq_op);
+}
+
+static const struct file_operations uprobe_profile_ops = {
+ .owner = THIS_MODULE,
+ .open = profile_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+/* uprobe handler */
+static void uprobe_trace_func(struct trace_uprobe *tp, struct pt_regs *regs)
+{
+ struct uprobe_trace_entry_head *entry;
+ struct ring_buffer_event *event;
+ struct ring_buffer *buffer;
+ u8 *data;
+ int size, i, pc;
+ unsigned long irq_flags;
+ struct ftrace_event_call *call = &tp->call;
+
+ tp->nhit++;
+
+ local_save_flags(irq_flags);
+ pc = preempt_count();
+
+ size = sizeof(*entry) + tp->size;
+
+ event = trace_current_buffer_lock_reserve(&buffer, call->event.type,
+ size, irq_flags, pc);
+ if (!event)
+ return;
+
+ entry = ring_buffer_event_data(event);
+ entry->ip = uprobes_get_bkpt_addr(task_pt_regs(current));
+ data = (u8 *)&entry[1];
+ for (i = 0; i < tp->nr_args; i++)
+ call_fetch(&tp->args[i].fetch, regs,
+ data + tp->args[i].offset);
+
+ if (!filter_current_check_discard(buffer, call, entry, event))
+ trace_buffer_unlock_commit(buffer, event, irq_flags, pc);
+}
+
+/* Event entry printers */
+enum print_line_t
+print_uprobe_event(struct trace_iterator *iter, int flags,
+ struct trace_event *event)
+{
+ struct uprobe_trace_entry_head *field;
+ struct trace_seq *s = &iter->seq;
+ struct trace_uprobe *tp;
+ u8 *data;
+ int i;
+
+ field = (struct uprobe_trace_entry_head *)iter->ent;
+ tp = container_of(event, struct trace_uprobe, call.event);
+
+ if (!trace_seq_printf(s, "%s: (", tp->call.name))
+ goto partial;
+
+ if (!seq_print_ip_sym(s, field->ip, flags | TRACE_ITER_SYM_OFFSET))
+ goto partial;
+
+ if (!trace_seq_puts(s, ")"))
+ goto partial;
+
+ data = (u8 *)&field[1];
+ for (i = 0; i < tp->nr_args; i++)
+ if (!tp->args[i].type->print(s, tp->args[i].name,
+ data + tp->args[i].offset, field))
+ goto partial;
+
+ if (!trace_seq_puts(s, "\n"))
+ goto partial;
+
+ return TRACE_TYPE_HANDLED;
+partial:
+ return TRACE_TYPE_PARTIAL_LINE;
+}
+
+
+static int probe_event_enable(struct ftrace_event_call *call)
+{
+ struct uprobe_trace_consumer *utc;
+ struct trace_uprobe *tp = (struct trace_uprobe *)call->data;
+ int ret = 0;
+
+ if (!tp->inode || tp->consumer)
+ return -EINTR;
+
+ utc = kzalloc(sizeof(struct uprobe_trace_consumer), GFP_KERNEL);
+ if (!utc)
+ return -EINTR;
+
+ utc->cons.handler = uprobe_dispatcher;
+ utc->cons.filter = NULL;
+ ret = register_uprobe(tp->inode, tp->offset, &utc->cons);
+ if (ret) {
+ kfree(utc);
+ return ret;
+ }
+
+ tp->flags |= TP_FLAG_TRACE;
+ utc->tp = tp;
+ tp->consumer = utc;
+ return 0;
+}
+
+static void probe_event_disable(struct ftrace_event_call *call)
+{
+ struct trace_uprobe *tp = (struct trace_uprobe *)call->data;
+
+ if (!tp->inode || !tp->consumer)
+ return;
+
+ unregister_uprobe(tp->inode, tp->offset, &tp->consumer->cons);
+ tp->flags &= ~TP_FLAG_TRACE;
+ kfree(tp->consumer);
+ tp->consumer = NULL;
+}
+
+static int uprobe_event_define_fields(struct ftrace_event_call *event_call)
+{
+ int ret, i;
+ struct uprobe_trace_entry_head field;
+ struct trace_uprobe *tp = (struct trace_uprobe *)event_call->data;
+
+ DEFINE_FIELD(unsigned long, ip, FIELD_STRING_IP, 0);
+ /* Set argument names as fields */
+ for (i = 0; i < tp->nr_args; i++) {
+ ret = trace_define_field(event_call, tp->args[i].type->fmttype,
+ tp->args[i].name,
+ sizeof(field) + tp->args[i].offset,
+ tp->args[i].type->size,
+ tp->args[i].type->is_signed,
+ FILTER_OTHER);
+ if (ret)
+ return ret;
+ }
+ return 0;
+}
+
+static int __set_print_fmt(struct trace_uprobe *tp, char *buf, int len)
+{
+ int i;
+ int pos = 0;
+
+ const char *fmt, *arg;
+
+ fmt = "(%lx)";
+ arg = "REC->" FIELD_STRING_IP;
+
+ /* When len=0, we just calculate the needed length */
+#define LEN_OR_ZERO (len ? len - pos : 0)
+
+ pos += snprintf(buf + pos, LEN_OR_ZERO, "\"%s", fmt);
+
+ for (i = 0; i < tp->nr_args; i++) {
+ pos += snprintf(buf + pos, LEN_OR_ZERO, " %s=%s",
+ tp->args[i].name, tp->args[i].type->fmt);
+ }
+
+ pos += snprintf(buf + pos, LEN_OR_ZERO, "\", %s", arg);
+
+ for (i = 0; i < tp->nr_args; i++) {
+ pos += snprintf(buf + pos, LEN_OR_ZERO, ", REC->%s",
+ tp->args[i].name);
+ }
+
+#undef LEN_OR_ZERO
+
+ /* return the length of print_fmt */
+ return pos;
+}
+
+static int set_print_fmt(struct trace_uprobe *tp)
+{
+ int len;
+ char *print_fmt;
+
+ /* First: called with 0 length to calculate the needed length */
+ len = __set_print_fmt(tp, NULL, 0);
+ print_fmt = kmalloc(len + 1, GFP_KERNEL);
+ if (!print_fmt)
+ return -ENOMEM;
+
+ /* Second: actually write the @print_fmt */
+ __set_print_fmt(tp, print_fmt, len + 1);
+ tp->call.print_fmt = print_fmt;
+
+ return 0;
+}
+
+#ifdef CONFIG_PERF_EVENTS
+
+/* uprobe profile handler */
+static void uprobe_perf_func(struct trace_uprobe *tp,
+ struct pt_regs *regs)
+{
+ struct ftrace_event_call *call = &tp->call;
+ struct uprobe_trace_entry_head *entry;
+ struct hlist_head *head;
+ u8 *data;
+ int size, __size, i;
+ int rctx;
+
+ __size = sizeof(*entry) + tp->size;
+ size = ALIGN(__size + sizeof(u32), sizeof(u64));
+ size -= sizeof(u32);
+ if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE,
+ "profile buffer not large enough"))
+ return;
+
+ entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx);
+ if (!entry)
+ return;
+
+ entry->ip = uprobes_get_bkpt_addr(task_pt_regs(current));
+ data = (u8 *)&entry[1];
+ for (i = 0; i < tp->nr_args; i++)
+ call_fetch(&tp->args[i].fetch, regs,
+ data + tp->args[i].offset);
+
+ head = this_cpu_ptr(call->perf_events);
+ perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs, head);
+}
+
+static int probe_perf_enable(struct ftrace_event_call *call)
+{
+ struct uprobe_trace_consumer *utc;
+ struct trace_uprobe *tp = (struct trace_uprobe *)call->data;
+ int ret = 0;
+
+ if (!tp->inode || tp->consumer)
+ return -EINTR;
+
+ utc = kzalloc(sizeof(struct uprobe_trace_consumer), GFP_KERNEL);
+ if (!utc)
+ return -EINTR;
+
+ utc->cons.handler = uprobe_dispatcher;
+ utc->cons.filter = NULL;
+ ret = register_uprobe(tp->inode, tp->offset, &utc->cons);
+ if (ret) {
+ kfree(utc);
+ return ret;
+ }
+
+ tp->flags |= TP_FLAG_PROFILE;
+ tp->consumer = utc;
+ utc->tp = tp;
+ return 0;
+}
+
+static void probe_perf_disable(struct ftrace_event_call *call)
+{
+ struct trace_uprobe *tp = (struct trace_uprobe *)call->data;
+
+ if (!tp->inode || !tp->consumer)
+ return;
+
+ unregister_uprobe(tp->inode, tp->offset, &tp->consumer->cons);
+ tp->flags &= ~TP_FLAG_PROFILE;
+ kfree(tp->consumer);
+ tp->consumer = NULL;
+}
+#endif /* CONFIG_PERF_EVENTS */
+
+static
+int uprobe_register(struct ftrace_event_call *event, enum trace_reg type)
+{
+ switch (type) {
+ case TRACE_REG_REGISTER:
+ return probe_event_enable(event);
+ case TRACE_REG_UNREGISTER:
+ probe_event_disable(event);
+ return 0;
+
+#ifdef CONFIG_PERF_EVENTS
+ case TRACE_REG_PERF_REGISTER:
+ return probe_perf_enable(event);
+ case TRACE_REG_PERF_UNREGISTER:
+ probe_perf_disable(event);
+ return 0;
+#endif
+ }
+ return 0;
+}
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
+{
+ struct uprobe_trace_consumer *utc;
+ struct trace_uprobe *tp;
+
+ utc = container_of(con, struct uprobe_trace_consumer, cons);
+ tp = utc->tp;
+ if (!tp || tp->consumer != utc)
+ return 0;
+
+ if (tp->flags & TP_FLAG_TRACE)
+ uprobe_trace_func(tp, regs);
+#ifdef CONFIG_PERF_EVENTS
+ if (tp->flags & TP_FLAG_PROFILE)
+ uprobe_perf_func(tp, regs);
+#endif
+ return 0;
+}
+
+
+static struct trace_event_functions uprobe_funcs = {
+ .trace = print_uprobe_event
+};
+
+static int register_uprobe_event(struct trace_uprobe *tp)
+{
+ struct ftrace_event_call *call = &tp->call;
+ int ret;
+
+ /* Initialize ftrace_event_call */
+ INIT_LIST_HEAD(&call->class->fields);
+ call->event.funcs = &uprobe_funcs;
+ call->class->define_fields = uprobe_event_define_fields;
+ if (set_print_fmt(tp) < 0)
+ return -ENOMEM;
+ ret = register_ftrace_event(&call->event);
+ if (!ret) {
+ kfree(call->print_fmt);
+ return -ENODEV;
+ }
+ call->flags = 0;
+ call->class->reg = uprobe_register;
+ call->data = tp;
+ ret = trace_add_event_call(call);
+ if (ret) {
+ pr_info("Failed to register uprobe event: %s\n", call->name);
+ kfree(call->print_fmt);
+ unregister_ftrace_event(&call->event);
+ }
+ return ret;
+}
+
+static void unregister_uprobe_event(struct trace_uprobe *tp)
+{
+ /* tp->event is unregistered in trace_remove_event_call() */
+ trace_remove_event_call(&tp->call);
+ kfree(tp->call.print_fmt);
+ tp->call.print_fmt = NULL;
+}
+
+/* Make a trace interface for controling probe points */
+static __init int init_uprobe_trace(void)
+{
+ struct dentry *d_tracer;
+ struct dentry *entry;
+
+ d_tracer = tracing_init_dentry();
+ if (!d_tracer)
+ return 0;
+
+ entry = trace_create_file("uprobe_events", 0644, d_tracer,
+ NULL, &uprobe_events_ops);
+ /* Profile interface */
+ entry = trace_create_file("uprobe_profile", 0444, d_tracer,
+ NULL, &uprobe_profile_ops);
+ return 0;
+}
+fs_initcall(init_uprobe_trace);

2011-04-01 14:46:34

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 21/26] 21: Signed-off-by: Srikar Dronamraju <[email protected]>

---
Documentation/trace/uprobetrace.txt | 94 +++++++++++++++++++++++++++++++++++
1 files changed, 94 insertions(+), 0 deletions(-)
create mode 100644 Documentation/trace/uprobetrace.txt

diff --git a/Documentation/trace/uprobetrace.txt b/Documentation/trace/uprobetrace.txt
new file mode 100644
index 0000000..6c18ffe
--- /dev/null
+++ b/Documentation/trace/uprobetrace.txt
@@ -0,0 +1,94 @@
+ Uprobe-tracer: Uprobe-based Event Tracing
+ =========================================
+ Documentation is written by Srikar Dronamraju
+
+Overview
+--------
+These events are similar to kprobe based events.
+To enable this feature, build your kernel with CONFIG_UPROBE_EVENTS=y.
+
+Similar to the kprobe-event tracer, this doesn't need to be activated via
+current_tracer. Instead of that, add probe points via
+/sys/kernel/debug/tracing/uprobe_events, and enable it via
+/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled.
+
+
+Synopsis of uprobe_tracer
+-------------------------
+ p[:[GRP/]EVENT] PATH:SYMBOL[+offs] [FETCHARGS] : Set a probe
+
+ GRP : Group name. If omitted, use "uprobes" for it.
+ EVENT : Event name. If omitted, the event name is generated
+ based on SYMBOL+offs.
+ PATH : path to an executable or a library.
+ SYMBOL[+offs] : Symbol+offset where the probe is inserted.
+
+ FETCHARGS : Arguments. Each probe can have up to 128 args.
+ %REG : Fetch register REG
+
+Event Profiling
+---------------
+ You can check the total number of probe hits and probe miss-hits via
+/sys/kernel/debug/tracing/uprobe_profile.
+ The first column is event name, the second is the number of probe hits,
+the third is the number of probe miss-hits.
+
+Usage examples
+--------------
+To add a probe as a new event, write a new definition to uprobe_events
+as below.
+
+ echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
+
+ This sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash
+
+
+ echo > /sys/kernel/debug/tracing/uprobe_events
+
+ This clears all probe points.
+
+The following example shows how to dump the instruction pointer and %ax
+a register at the probed text address. Here we are trying to probe
+function zfree in /bin/zsh
+
+ # cd /sys/kernel/debug/tracing/
+ # cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
+ 00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
+ # objdump -T /bin/zsh | grep -w zfree
+ 0000000000446420 g DF .text 0000000000000012 Base zfree
+
+0x46420 is the offset of zfree in object /bin/zsh that is loaded at
+0x00400000. Hence the command to probe would be :
+
+ # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
+
+We can see the events that are registered by looking at the uprobe_events
+file.
+
+ # cat uprobe_events
+ p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
+
+Right after definition, each event is disabled by default. For tracing these
+events, you need to enable it by:
+
+ # echo 1 > events/uprobes/enable
+
+Lets disable the event after sleeping for some time.
+ # sleep 20
+ # echo 0 > events/uprobes/enable
+
+And you can see the traced information via /sys/kernel/debug/tracing/trace.
+
+ # cat trace
+ # tracer: nop
+ #
+ # TASK-PID CPU# TIMESTAMP FUNCTION
+ # | | | | |
+ zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+ zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+ zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+ zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+
+Each line shows us probes were triggered for a pid 24842 with ip being
+0x446421 and contents of ax register being 79.
+

2011-04-01 14:46:42

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 22/26] 22: perf: rename target_module to target


This is a precursor patch that modifies names that refer to kernel/module
to also refer to user space names.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
tools/perf/builtin-probe.c | 12 ++++++------
tools/perf/util/probe-event.c | 6 +++---
2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 2c0e64d..98db08f 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -61,7 +61,7 @@ static struct {
struct perf_probe_event events[MAX_PROBES];
struct strlist *dellist;
struct line_range line_range;
- const char *target_module;
+ const char *target;
int max_probe_points;
struct strfilter *filter;
} params;
@@ -241,7 +241,7 @@ static const struct option options[] = {
"file", "vmlinux pathname"),
OPT_STRING('s', "source", &symbol_conf.source_prefix,
"directory", "path to kernel source"),
- OPT_STRING('m', "module", &params.target_module,
+ OPT_STRING('m', "module", &params.target,
"modname", "target module name"),
#endif
OPT__DRY_RUN(&probe_event_dry_run),
@@ -327,7 +327,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
if (!params.filter)
params.filter = strfilter__new(DEFAULT_FUNC_FILTER,
NULL);
- ret = show_available_funcs(params.target_module,
+ ret = show_available_funcs(params.target,
params.filter);
strfilter__delete(params.filter);
if (ret < 0)
@@ -348,7 +348,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
usage_with_options(probe_usage, options);
}

- ret = show_line_range(&params.line_range, params.target_module);
+ ret = show_line_range(&params.line_range, params.target);
if (ret < 0)
pr_err(" Error: Failed to show lines. (%d)\n", ret);
return ret;
@@ -365,7 +365,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)

ret = show_available_vars(params.events, params.nevents,
params.max_probe_points,
- params.target_module,
+ params.target,
params.filter,
params.show_ext_vars);
strfilter__delete(params.filter);
@@ -387,7 +387,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
if (params.nevents) {
ret = add_perf_probe_events(params.events, params.nevents,
params.max_probe_points,
- params.target_module,
+ params.target,
params.force_add);
if (ret < 0) {
pr_err(" Error: Failed to add events. (%d)\n", ret);
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index 5ddee66..09c53c1 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -1979,7 +1979,7 @@ static int filter_available_functions(struct map *map __unused,
return 1;
}

-int show_available_funcs(const char *module, struct strfilter *_filter)
+int show_available_funcs(const char *elfobject, struct strfilter *_filter)
{
struct map *map;
int ret;
@@ -1990,9 +1990,9 @@ int show_available_funcs(const char *module, struct strfilter *_filter)
if (ret < 0)
return ret;

- map = kernel_get_module_map(module);
+ map = kernel_get_module_map(elfobject);
if (!map) {
- pr_err("Failed to find %s map.\n", (module) ? : "kernel");
+ pr_err("Failed to find %s map.\n", (elfobject) ? : "kernel");
return -EINVAL;
}
available_func_filter = _filter;

2011-04-01 14:46:52

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 23/26] 23: perf: show possible probes in a given executable file or library.


Enhances -F/--funcs option of "perf probe" to list possible probe points in
an executable file or library. A new option -e/--exe specifies the path of
the executable or library.


Show last 10 functions in /bin/zsh.

# perf probe -F -u -e /bin/zsh | tail
zstrtol
ztrcmp
ztrdup
ztrduppfx
ztrftime
ztrlen
ztrncpy
ztrsub
zwarn
zwarnnam

Show first 10 functions in /lib/libc.so.6

# perf probe -u -F -e /lib/libc.so.6 | head
_IO_adjust_column
_IO_adjust_wcolumn
_IO_default_doallocate
_IO_default_finish
_IO_default_pbackfail
_IO_default_uflow
_IO_default_xsgetn
_IO_default_xsputn
_IO_do_write@@GLIBC_2.2.5
_IO_doallocbuf

Signed-off-by: Srikar Dronamraju <[email protected]>
---
tools/perf/builtin-probe.c | 9 +++++--
tools/perf/util/probe-event.c | 56 +++++++++++++++++++++++++++++++----------
tools/perf/util/probe-event.h | 4 +--
tools/perf/util/symbol.c | 8 ++++++
tools/perf/util/symbol.h | 1 +
5 files changed, 61 insertions(+), 17 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 98db08f..6ceebea 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -57,6 +57,7 @@ static struct {
bool show_ext_vars;
bool show_funcs;
bool mod_events;
+ bool uprobes;
int nevents;
struct perf_probe_event events[MAX_PROBES];
struct strlist *dellist;
@@ -249,6 +250,10 @@ static const struct option options[] = {
"Set how many probe points can be found for a probe."),
OPT_BOOLEAN('F', "funcs", &params.show_funcs,
"Show potential probe-able functions."),
+ OPT_BOOLEAN('u', "uprobe", &params.uprobes,
+ "user space probe events"),
+ OPT_STRING('e', "exe", &params.target,
+ "executable", "userspace executable or library"),
OPT_CALLBACK('\0', "filter", NULL,
"[!]FILTER", "Set a filter (with --vars/funcs only)\n"
"\t\t\t(default: \"" DEFAULT_VAR_FILTER "\" for --vars,\n"
@@ -327,8 +332,8 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
if (!params.filter)
params.filter = strfilter__new(DEFAULT_FUNC_FILTER,
NULL);
- ret = show_available_funcs(params.target,
- params.filter);
+ ret = show_available_funcs(params.target, params.filter,
+ params.uprobes);
strfilter__delete(params.filter);
if (ret < 0)
pr_err(" Error: Failed to show functions."
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index 09c53c1..cf77feb 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -47,6 +47,7 @@
#include "trace-event.h" /* For __unused */
#include "probe-event.h"
#include "probe-finder.h"
+#include "session.h"

#define MAX_CMDLEN 256
#define MAX_PROBE_ARGS 128
@@ -1963,6 +1964,7 @@ int del_perf_probe_events(struct strlist *dellist)

return ret;
}
+
/* TODO: don't use a global variable for filter ... */
static struct strfilter *available_func_filter;

@@ -1979,30 +1981,58 @@ static int filter_available_functions(struct map *map __unused,
return 1;
}

-int show_available_funcs(const char *elfobject, struct strfilter *_filter)
+static int __show_available_funcs(struct map *map)
+{
+ if (map__load(map, filter_available_functions)) {
+ pr_err("Failed to load map.\n");
+ return -EINVAL;
+ }
+ if (!dso__sorted_by_name(map->dso, map->type))
+ dso__sort_by_name(map->dso, map->type);
+
+ dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
+ return 0;
+}
+
+static int available_kernel_funcs(const char *module)
{
struct map *map;
int ret;

- setup_pager();
-
ret = init_vmlinux();
if (ret < 0)
return ret;

- map = kernel_get_module_map(elfobject);
+ map = kernel_get_module_map(module);
if (!map) {
- pr_err("Failed to find %s map.\n", (elfobject) ? : "kernel");
+ pr_err("Failed to find %s map.\n", (module) ? : "kernel");
return -EINVAL;
}
+ return __show_available_funcs(map);
+}
+
+int show_available_funcs(const char *elfobject, struct strfilter *_filter,
+ bool user)
+{
+ struct map *map;
+ int ret;
+
+ setup_pager();
available_func_filter = _filter;
- if (map__load(map, filter_available_functions)) {
- pr_err("Failed to load map.\n");
- return -EINVAL;
- }
- if (!dso__sorted_by_name(map->dso, map->type))
- dso__sort_by_name(map->dso, map->type);

- dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
- return 0;
+ if (!user)
+ return available_kernel_funcs(elfobject);
+
+ symbol_conf.try_vmlinux_path = false;
+ symbol_conf.sort_by_name = true;
+ ret = symbol__init();
+ if (ret < 0) {
+ pr_err("Failed to init symbol map.\n");
+ return ret;
+ }
+ map = dso__new_map(elfobject);
+ ret = __show_available_funcs(map);
+ dso__delete(map->dso);
+ map__delete(map);
+ return ret;
}
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index 3434fc9..4c24a85 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -128,8 +128,8 @@ extern int show_line_range(struct line_range *lr, const char *module);
extern int show_available_vars(struct perf_probe_event *pevs, int npevs,
int max_probe_points, const char *module,
struct strfilter *filter, bool externs);
-extern int show_available_funcs(const char *module, struct strfilter *filter);
-
+extern int show_available_funcs(const char *module, struct strfilter *filter,
+ bool user);

/* Maximum index number of event-name postfix */
#define MAX_EVENT_INDEX 1024
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index f06c10f..eefeab4 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -2606,3 +2606,11 @@ int machine__load_vmlinux_path(struct machine *self, enum map_type type,

return ret;
}
+
+struct map *dso__new_map(const char *name)
+{
+ struct dso *dso = dso__new(name);
+ struct map *map = map__new2(0, dso, MAP__FUNCTION);
+
+ return map;
+}
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 713b0b4..3838909 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -211,6 +211,7 @@ char dso__symtab_origin(const struct dso *self);
void dso__set_long_name(struct dso *self, char *name);
void dso__set_build_id(struct dso *self, void *build_id);
void dso__read_running_kernel_build_id(struct dso *self, struct machine *machine);
+struct map *dso__new_map(const char *name);
struct symbol *dso__find_symbol(struct dso *self, enum map_type type, u64 addr);
struct symbol *dso__find_symbol_by_name(struct dso *self, enum map_type type,
const char *name);

2011-04-01 14:47:05

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 24/26] 24: perf: perf interface for uprobes


Enhances perf probe to user space executables and libraries.
Provides very basic support for uprobes.

[ Probing a function in the executable using function name ]
-------------------------------------------------------------
[root@localhost ~]# perf probe -u zfree@/bin/zsh
Add new event:
probe_zsh:zfree (on /bin/zsh:0x45400)

You can now use it on all perf tools, such as:

perf record -e probe_zsh:zfree -aR sleep 1

[root@localhost ~]# perf record -e probe_zsh:zfree -aR sleep 15
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.314 MB perf.data (~13715 samples) ]
[root@localhost ~]# perf report --stdio
# Events: 3K probe_zsh:zfree
#
# Overhead Command Shared Object Symbol
# ........ ....... ............. ......
#
100.00% zsh zsh [.] zfree


#
# (For a higher level overview, try: perf report --sort comm,dso)
#
[root@localhost ~]

[ Probing a library function using function name ]
--------------------------------------------------
[root@localhost]#
[root@localhost]# perf probe -u malloc@/lib64/libc.so.6
Add new event:
probe_libc:malloc (on /lib64/libc-2.5.so:0x74dc0)

You can now use it on all perf tools, such as:

perf record -e probe_libc:malloc -aR sleep 1

[root@localhost]#
[root@localhost]# perf probe --list
probe_libc:malloc (on /lib64/libc-2.5.so:0x0000000000074dc0)

Signed-off-by: Srikar Dronamraju <[email protected]>
---
tools/perf/builtin-probe.c | 9 +
tools/perf/util/probe-event.c | 387 +++++++++++++++++++++++++++++++++--------
tools/perf/util/probe-event.h | 8 +
3 files changed, 322 insertions(+), 82 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 6ceebea..ce9c1d8 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -79,6 +79,7 @@ static int parse_probe_event(const char *str)
return -1;
}

+ pev->uprobes = params.uprobes;
/* Parse a perf-probe command into event */
ret = parse_perf_probe_command(str, pev);
pr_debug("%d arguments\n", pev->nargs);
@@ -250,7 +251,7 @@ static const struct option options[] = {
"Set how many probe points can be found for a probe."),
OPT_BOOLEAN('F', "funcs", &params.show_funcs,
"Show potential probe-able functions."),
- OPT_BOOLEAN('u', "uprobe", &params.uprobes,
+ OPT_BOOLEAN('u', "uprobes", &params.uprobes,
"user space probe events"),
OPT_STRING('e', "exe", &params.target,
"executable", "userspace executable or library"),
@@ -309,6 +310,10 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
pr_err(" Error: Don't use --list with --funcs.\n");
usage_with_options(probe_usage, options);
}
+ if (params.uprobes) {
+ pr_warning(" Error: Don't use --list with --uprobes.\n");
+ usage_with_options(probe_usage, options);
+ }
ret = show_perf_probe_events();
if (ret < 0)
pr_err(" Error: Failed to show event list. (%d)\n",
@@ -342,7 +347,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
}

#ifdef DWARF_SUPPORT
- if (params.show_lines) {
+ if (params.show_lines && !params.uprobes) {
if (params.mod_events) {
pr_err(" Error: Don't use --line with"
" --add/--del.\n");
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index cf77feb..532c7d1 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -74,6 +74,7 @@ static int e_snprintf(char *str, size_t size, const char *format, ...)
}

static char *synthesize_perf_probe_point(struct perf_probe_point *pp);
+static int convert_name_to_addr(struct perf_probe_event *pev);
static struct machine machine;

/* Initialize symbol maps and path of vmlinux/modules */
@@ -170,6 +171,31 @@ const char *kernel_get_module_path(const char *module)
return (dso) ? dso->long_name : NULL;
}

+static int init_perf_uprobes(void)
+{
+ int ret = 0;
+
+ symbol_conf.try_vmlinux_path = false;
+ symbol_conf.sort_by_name = true;
+ ret = symbol__init();
+ if (ret < 0)
+ pr_debug("Failed to init symbol map.\n");
+
+ return ret;
+}
+
+static int convert_to_perf_probe_point(struct probe_trace_point *tp,
+ struct perf_probe_point *pp)
+{
+ pp->function = strdup(tp->symbol);
+ if (pp->function == NULL)
+ return -ENOMEM;
+ pp->offset = tp->offset;
+ pp->retprobe = tp->retprobe;
+
+ return 0;
+}
+
#ifdef DWARF_SUPPORT
static int open_vmlinux(const char *module)
{
@@ -223,6 +249,15 @@ static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
bool need_dwarf = perf_probe_event_need_dwarf(pev);
int fd, ntevs;

+ if (pev->uprobes) {
+ if (need_dwarf) {
+ pr_warning("Debuginfo-analysis is not yet supported"
+ " with -u/--uprobes option.\n");
+ return -ENOSYS;
+ }
+ return convert_name_to_addr(pev);
+ }
+
fd = open_vmlinux(module);
if (fd < 0) {
if (need_dwarf) {
@@ -541,13 +576,7 @@ static int kprobe_convert_to_perf_probe(struct probe_trace_point *tp,
pr_err("Failed to find symbol %s in kernel.\n", tp->symbol);
return -ENOENT;
}
- pp->function = strdup(tp->symbol);
- if (pp->function == NULL)
- return -ENOMEM;
- pp->offset = tp->offset;
- pp->retprobe = tp->retprobe;
-
- return 0;
+ return convert_to_perf_probe_point(tp, pp);
}

static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
@@ -558,6 +587,9 @@ static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
pr_warning("Debuginfo-analysis is not supported.\n");
return -ENOSYS;
}
+ if (pev->uprobes)
+ return convert_name_to_addr(pev);
+
return 0;
}

@@ -688,6 +720,7 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
struct perf_probe_point *pp = &pev->point;
char *ptr, *tmp;
char c, nc = 0;
+ bool found = false;
/*
* <Syntax>
* perf probe [EVENT=]SRC[:LN|;PTN]
@@ -726,8 +759,11 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
if (tmp == NULL)
return -ENOMEM;

- /* Check arg is function or file and copy it */
- if (strchr(tmp, '.')) /* File */
+ /*
+ * Check arg is function or file and copy it.
+ * If its uprobes then expect the function to be of type [email protected]
+ */
+ if (!pev->uprobes && strchr(tmp, '.')) /* File */
pp->file = tmp;
else /* Function */
pp->function = tmp;
@@ -765,6 +801,17 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
}
break;
case '@': /* File name */
+ if (pev->uprobes && !found) {
+ /* uprobes overloads @ operator */
+ tmp = zalloc(sizeof(char *) * MAX_PROBE_ARGS);
+ e_snprintf(tmp, MAX_PROBE_ARGS, "%s@%s",
+ pp->function, arg);
+ free(pp->function);
+ pp->function = tmp;
+ found = true;
+ break;
+ }
+
if (pp->file) {
semantic_error("SRC@SRC is not allowed.\n");
return -EINVAL;
@@ -822,6 +869,11 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
return -EINVAL;
}

+ if (pev->uprobes && !pp->function) {
+ semantic_error("No function specified for uprobes");
+ return -EINVAL;
+ }
+
if ((pp->offset || pp->line || pp->lazy_line) && pp->retprobe) {
semantic_error("Offset/Line/Lazy pattern can't be used with "
"return probe.\n");
@@ -831,6 +883,11 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
pr_debug("symbol:%s file:%s line:%d offset:%lu return:%d lazy:%s\n",
pp->function, pp->file, pp->line, pp->offset, pp->retprobe,
pp->lazy_line);
+
+ if (pev->uprobes && perf_probe_event_need_dwarf(pev)) {
+ semantic_error("no dwarf based probes for uprobes.");
+ return -EINVAL;
+ }
return 0;
}

@@ -982,7 +1039,8 @@ bool perf_probe_event_need_dwarf(struct perf_probe_event *pev)
{
int i;

- if (pev->point.file || pev->point.line || pev->point.lazy_line)
+ if ((pev->point.file && !pev->uprobes) || pev->point.line ||
+ pev->point.lazy_line)
return true;

for (i = 0; i < pev->nargs; i++)
@@ -1273,10 +1331,21 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev)
if (buf == NULL)
return NULL;

- len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s+%lu",
- tp->retprobe ? 'r' : 'p',
- tev->group, tev->event,
- tp->symbol, tp->offset);
+ if (tev->uprobes)
+ len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s",
+ tp->retprobe ? 'r' : 'p',
+ tev->group, tev->event, tp->symbol);
+ else if (tp->offset)
+ len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s+%lu",
+ tp->retprobe ? 'r' : 'p',
+ tev->group, tev->event,
+ tp->symbol, tp->offset);
+ else
+ len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s",
+ tp->retprobe ? 'r' : 'p',
+ tev->group, tev->event,
+ tp->symbol);
+
if (len <= 0)
goto error;

@@ -1295,7 +1364,7 @@ error:
}

static int convert_to_perf_probe_event(struct probe_trace_event *tev,
- struct perf_probe_event *pev)
+ struct perf_probe_event *pev, bool is_kprobe)
{
char buf[64] = "";
int i, ret;
@@ -1307,7 +1376,11 @@ static int convert_to_perf_probe_event(struct probe_trace_event *tev,
return -ENOMEM;

/* Convert trace_point to probe_point */
- ret = kprobe_convert_to_perf_probe(&tev->point, &pev->point);
+ if (is_kprobe)
+ ret = kprobe_convert_to_perf_probe(&tev->point, &pev->point);
+ else
+ ret = convert_to_perf_probe_point(&tev->point, &pev->point);
+
if (ret < 0)
return ret;

@@ -1401,7 +1474,7 @@ static void clear_probe_trace_event(struct probe_trace_event *tev)
memset(tev, 0, sizeof(*tev));
}

-static int open_kprobe_events(bool readwrite)
+static int open_probe_events(bool readwrite, bool is_kprobe)
{
char buf[PATH_MAX];
const char *__debugfs;
@@ -1412,8 +1485,13 @@ static int open_kprobe_events(bool readwrite)
pr_warning("Debugfs is not mounted.\n");
return -ENOENT;
}
+ if (is_kprobe)
+ ret = e_snprintf(buf, PATH_MAX, "%stracing/kprobe_events",
+ __debugfs);
+ else
+ ret = e_snprintf(buf, PATH_MAX, "%stracing/uprobe_events",
+ __debugfs);

- ret = e_snprintf(buf, PATH_MAX, "%stracing/kprobe_events", __debugfs);
if (ret >= 0) {
pr_debug("Opening %s write=%d\n", buf, readwrite);
if (readwrite && !probe_event_dry_run)
@@ -1424,16 +1502,29 @@ static int open_kprobe_events(bool readwrite)

if (ret < 0) {
if (errno == ENOENT)
- pr_warning("kprobe_events file does not exist - please"
- " rebuild kernel with CONFIG_KPROBE_EVENT.\n");
+ pr_warning("%s file does not exist - please"
+ " rebuild kernel with CONFIG_%s_EVENT.\n",
+ is_kprobe ? "kprobe_events" : "uprobe_events",
+ is_kprobe ? "KPROBE" : "UPROBE");
else
- pr_warning("Failed to open kprobe_events file: %s\n",
- strerror(errno));
+ pr_warning("Failed to open %s file: %s\n",
+ is_kprobe ? "kprobe_events" : "uprobe_events",
+ strerror(errno));
}
return ret;
}

-/* Get raw string list of current kprobe_events */
+static int open_kprobe_events(bool readwrite)
+{
+ return open_probe_events(readwrite, 1);
+}
+
+static int open_uprobe_events(bool readwrite)
+{
+ return open_probe_events(readwrite, 0);
+}
+
+/* Get raw string list of current kprobe_events or uprobe_events */
static struct strlist *get_probe_trace_command_rawlist(int fd)
{
int ret, idx;
@@ -1498,36 +1589,26 @@ static int show_perf_probe_event(struct perf_probe_event *pev)
return ret;
}

-/* List up current perf-probe events */
-int show_perf_probe_events(void)
+static int __show_perf_probe_events(int fd, bool is_kprobe)
{
- int fd, ret;
+ int ret = 0;
struct probe_trace_event tev;
struct perf_probe_event pev;
struct strlist *rawlist;
struct str_node *ent;

- setup_pager();
- ret = init_vmlinux();
- if (ret < 0)
- return ret;
-
memset(&tev, 0, sizeof(tev));
memset(&pev, 0, sizeof(pev));

- fd = open_kprobe_events(false);
- if (fd < 0)
- return fd;
-
rawlist = get_probe_trace_command_rawlist(fd);
- close(fd);
if (!rawlist)
return -ENOENT;

strlist__for_each(ent, rawlist) {
ret = parse_probe_trace_command(ent->s, &tev);
if (ret >= 0) {
- ret = convert_to_perf_probe_event(&tev, &pev);
+ ret = convert_to_perf_probe_event(&tev, &pev,
+ is_kprobe);
if (ret >= 0)
ret = show_perf_probe_event(&pev);
}
@@ -1537,6 +1618,31 @@ int show_perf_probe_events(void)
break;
}
strlist__delete(rawlist);
+ return ret;
+}
+
+/* List up current perf-probe events */
+int show_perf_probe_events(void)
+{
+ int fd, ret;
+
+ setup_pager();
+ fd = open_kprobe_events(false);
+ if (fd < 0)
+ return fd;
+
+ ret = init_vmlinux();
+ if (ret < 0)
+ return ret;
+
+ ret = __show_perf_probe_events(fd, true);
+ close(fd);
+
+ fd = open_uprobe_events(false);
+ if (fd >= 0) {
+ ret = __show_perf_probe_events(fd, false);
+ close(fd);
+ }

return ret;
}
@@ -1646,7 +1752,10 @@ static int __add_probe_trace_events(struct perf_probe_event *pev,
const char *event, *group;
struct strlist *namelist;

- fd = open_kprobe_events(true);
+ if (pev->uprobes)
+ fd = open_uprobe_events(true);
+ else
+ fd = open_kprobe_events(true);
if (fd < 0)
return fd;
/* Get current event names */
@@ -1660,18 +1769,19 @@ static int __add_probe_trace_events(struct perf_probe_event *pev,
printf("Add new event%s\n", (ntevs > 1) ? "s:" : ":");
for (i = 0; i < ntevs; i++) {
tev = &tevs[i];
- if (pev->event)
- event = pev->event;
- else
- if (pev->point.function)
- event = pev->point.function;
- else
- event = tev->point.symbol;
+
if (pev->group)
group = pev->group;
else
group = PERFPROBE_GROUP;

+ if (pev->event)
+ event = pev->event;
+ else if (pev->point.function)
+ event = pev->point.function;
+ else
+ event = tev->point.symbol;
+
/* Get an unused new event name */
ret = get_new_event_name(buf, 64, event,
namelist, allow_suffix);
@@ -1749,6 +1859,7 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
tev->point.offset = pev->point.offset;
tev->point.retprobe = pev->point.retprobe;
tev->nargs = pev->nargs;
+ tev->uprobes = pev->uprobes;
if (tev->nargs) {
tev->args = zalloc(sizeof(struct probe_trace_arg)
* tev->nargs);
@@ -1779,6 +1890,9 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
}
}

+ if (pev->uprobes)
+ return 1;
+
/* Currently just checking function name from symbol map */
sym = __find_kernel_function_by_name(tev->point.symbol, NULL);
if (!sym) {
@@ -1805,15 +1919,19 @@ struct __event_package {
int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
int max_tevs, const char *module, bool force_add)
{
- int i, j, ret;
+ int i, j, ret = 0;
struct __event_package *pkgs;

pkgs = zalloc(sizeof(struct __event_package) * npevs);
if (pkgs == NULL)
return -ENOMEM;

- /* Init vmlinux path */
- ret = init_vmlinux();
+ if (!pevs->uprobes)
+ /* Init vmlinux path */
+ ret = init_vmlinux();
+ else
+ ret = init_perf_uprobes();
+
if (ret < 0) {
free(pkgs);
return ret;
@@ -1883,23 +2001,15 @@ error:
return ret;
}

-static int del_trace_probe_event(int fd, const char *group,
- const char *event, struct strlist *namelist)
+static int del_trace_probe_event(int fd, const char *buf,
+ struct strlist *namelist)
{
- char buf[128];
struct str_node *ent, *n;
- int found = 0, ret = 0;
-
- ret = e_snprintf(buf, 128, "%s:%s", group, event);
- if (ret < 0) {
- pr_err("Failed to copy event.\n");
- return ret;
- }
+ int ret = -1;

if (strpbrk(buf, "*?")) { /* Glob-exp */
strlist__for_each_safe(ent, n, namelist)
if (strglobmatch(ent->s, buf)) {
- found++;
ret = __del_trace_probe_event(fd, ent);
if (ret < 0)
break;
@@ -1908,40 +2018,42 @@ static int del_trace_probe_event(int fd, const char *group,
} else {
ent = strlist__find(namelist, buf);
if (ent) {
- found++;
ret = __del_trace_probe_event(fd, ent);
if (ret >= 0)
strlist__remove(namelist, ent);
}
}
- if (found == 0 && ret >= 0)
- pr_info("Info: Event \"%s\" does not exist.\n", buf);
-
return ret;
}

int del_perf_probe_events(struct strlist *dellist)
{
- int fd, ret = 0;
+ int ret = -1, ufd = -1, kfd = -1;
+ char buf[128];
const char *group, *event;
char *p, *str;
struct str_node *ent;
- struct strlist *namelist;
+ struct strlist *namelist = NULL, *unamelist = NULL;

- fd = open_kprobe_events(true);
- if (fd < 0)
- return fd;

/* Get current event names */
- namelist = get_probe_trace_event_names(fd, true);
- if (namelist == NULL)
- return -EINVAL;
+ kfd = open_kprobe_events(true);
+ if (kfd < 0)
+ return kfd;
+ namelist = get_probe_trace_event_names(kfd, true);
+
+ ufd = open_uprobe_events(true);
+ if (ufd >= 0)
+ unamelist = get_probe_trace_event_names(ufd, true);
+
+ if (namelist == NULL && unamelist == NULL)
+ goto error;

strlist__for_each(ent, dellist) {
str = strdup(ent->s);
if (str == NULL) {
ret = -ENOMEM;
- break;
+ goto error;
}
pr_debug("Parsing: %s\n", str);
p = strchr(str, ':');
@@ -1953,15 +2065,37 @@ int del_perf_probe_events(struct strlist *dellist)
group = "*";
event = str;
}
+
+ ret = e_snprintf(buf, 128, "%s:%s", group, event);
+ if (ret < 0) {
+ pr_err("Failed to copy event.");
+ free(str);
+ goto error;
+ }
+
pr_debug("Group: %s, Event: %s\n", group, event);
- ret = del_trace_probe_event(fd, group, event, namelist);
+ if (namelist)
+ ret = del_trace_probe_event(kfd, buf, namelist);
+ if (unamelist && ret != 0)
+ ret = del_trace_probe_event(ufd, buf, unamelist);
+
free(str);
- if (ret < 0)
- break;
+ if (ret != 0)
+ pr_info("Info: Event \"%s\" does not exist.\n", buf);
}
- strlist__delete(namelist);
- close(fd);

+error:
+ if (kfd >= 0) {
+ if (namelist)
+ strlist__delete(namelist);
+ close(kfd);
+ }
+
+ if (ufd >= 0) {
+ if (unamelist)
+ strlist__delete(unamelist);
+ close(ufd);
+ }
return ret;
}

@@ -2036,3 +2170,102 @@ int show_available_funcs(const char *elfobject, struct strfilter *_filter,
map__delete(map);
return ret;
}
+
+#define DEFAULT_FUNC_FILTER "!_*"
+
+/*
+ * uprobe_events only accepts address:
+ * Convert function and any offset to address
+ */
+static int convert_name_to_addr(struct perf_probe_event *pev)
+{
+ struct perf_probe_point *pp = &pev->point;
+ struct symbol *sym;
+ struct map *map;
+ char *name = NULL, *tmpname = NULL, *function = NULL;
+ int ret = -EINVAL;
+ unsigned long long vaddr = 0;
+
+ /* check if user has specifed a virtual address
+ vaddr = strtoul(pp->function, NULL, 0); */
+ if (!pp->function)
+ goto out;
+
+ function = strdup(pp->function);
+ if (!function) {
+ pr_warning("Failed to allocate memory by strdup.\n");
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ tmpname = strchr(function, '@');
+ if (!tmpname) {
+ pr_warning("Cannot break %s into function and file\n",
+ function);
+ goto out;
+ }
+
+ *tmpname = '\0';
+ name = realpath(tmpname + 1, NULL);
+ if (!name) {
+ pr_warning("Cannot find realpath for %s.\n", tmpname + 1);
+ goto out;
+ }
+
+ map = dso__new_map(name);
+ if (!map) {
+ pr_warning("Cannot find appropriate DSO for %s.\n", name);
+ goto out;
+ }
+ available_func_filter = strfilter__new(DEFAULT_FUNC_FILTER, NULL);
+ if (map__load(map, filter_available_functions)) {
+ pr_err("Failed to load map.\n");
+ return -EINVAL;
+ }
+
+ sym = map__find_symbol_by_name(map, function, NULL);
+ if (!sym) {
+ pr_warning("Cannot find %s in DSO %s\n", function, name);
+ goto out;
+ }
+
+ if (map->start > sym->start)
+ vaddr = map->start;
+ vaddr += sym->start + pp->offset + map->pgoff;
+ pp->offset = 0;
+
+ if (!pev->event) {
+ pev->event = function;
+ function = NULL;
+ }
+ if (!pev->group) {
+ char *ptr1, *ptr2;
+
+ pev->group = zalloc(sizeof(char *) * 64);
+ ptr1 = strdup(basename(name));
+ if (ptr1) {
+ ptr2 = strpbrk(ptr1, "-._");
+ if (ptr2)
+ *ptr2 = '\0';
+ e_snprintf(pev->group, 64, "%s_%s", PERFPROBE_GROUP,
+ ptr1);
+ free(ptr1);
+ }
+ }
+ free(pp->function);
+ pp->function = zalloc(sizeof(char *) * MAX_PROBE_ARGS);
+ if (!pp->function) {
+ ret = -ENOMEM;
+ pr_warning("Failed to allocate memory by zalloc.\n");
+ goto out;
+ }
+ e_snprintf(pp->function, MAX_PROBE_ARGS, "%s:0x%llx", name, vaddr);
+ ret = 0;
+
+out:
+ if (function)
+ free(function);
+ if (name)
+ free(name);
+ return ret;
+}
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index 4c24a85..5199df4 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -7,7 +7,7 @@

extern bool probe_event_dry_run;

-/* kprobe-tracer tracing point */
+/* kprobe-tracer and uprobe-tracer tracing point */
struct probe_trace_point {
char *symbol; /* Base symbol */
unsigned long offset; /* Offset from symbol */
@@ -20,7 +20,7 @@ struct probe_trace_arg_ref {
long offset; /* Offset value */
};

-/* kprobe-tracer tracing argument */
+/* kprobe-tracer and uprobe-tracer tracing argument */
struct probe_trace_arg {
char *name; /* Argument name */
char *value; /* Base value */
@@ -28,12 +28,13 @@ struct probe_trace_arg {
struct probe_trace_arg_ref *ref; /* Referencing offset */
};

-/* kprobe-tracer tracing event (point + arg) */
+/* kprobe-tracer and uprobe-tracer tracing event (point + arg) */
struct probe_trace_event {
char *event; /* Event name */
char *group; /* Group name */
struct probe_trace_point point; /* Trace point */
int nargs; /* Number of args */
+ bool uprobes; /* uprobes only */
struct probe_trace_arg *args; /* Arguments */
};

@@ -69,6 +70,7 @@ struct perf_probe_event {
char *group; /* Group name */
struct perf_probe_point point; /* Probe point */
int nargs; /* Number of arguments */
+ bool uprobes;
struct perf_probe_arg *args; /* Arguments */
};

2011-04-01 14:47:13

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 25/26] 25: perf: Documentation for perf uprobes


Modify perf-probe.txt to include uprobe documentation

Signed-off-by: Suzuki Poulose <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
tools/perf/Documentation/perf-probe.txt | 21 ++++++++++++++++++++-
1 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/tools/perf/Documentation/perf-probe.txt b/tools/perf/Documentation/perf-probe.txt
index 02bafce..7a80a66 100644
--- a/tools/perf/Documentation/perf-probe.txt
+++ b/tools/perf/Documentation/perf-probe.txt
@@ -76,6 +76,8 @@ OPTIONS
-F::
--funcs::
Show available functions in given module or kernel.
+ With -e/--exe, can also list functions in a user space executable /
+ shared library.

--filter=FILTER::
(Only for --vars and --funcs) Set filter. FILTER is a combination of glob
@@ -96,12 +98,21 @@ OPTIONS
--max-probes::
Set the maximum number of probe points for an event. Default is 128.

+-u::
+--uprobes::
+ Specify a uprobe based probe point.
+
+-e::
+--exe=PATH::
+ Specify path to the executable or shared library file for user
+ space tracing. Used with --funcs option only.
+
PROBE SYNTAX
------------
Probe points are defined by following syntax.

1) Define event based on function name
- [EVENT=]FUNC[@SRC][:RLN|+OFFS|%return|;PTN] [ARG ...]
+ [EVENT=]FUNC[@EXE][@SRC][:RLN|+OFFS|%return|;PTN] [ARG ...]

2) Define event based on source file with line number
[EVENT=]SRC:ALN [ARG ...]
@@ -112,6 +123,7 @@ Probe points are defined by following syntax.

'EVENT' specifies the name of new event, if omitted, it will be set the name of the probed function. Currently, event group name is set as 'probe'.
'FUNC' specifies a probed function name, and it may have one of the following options; '+OFFS' is the offset from function entry address in bytes, ':RLN' is the relative-line number from function entry line, and '%return' means that it probes function return. And ';PTN' means lazy matching pattern (see LAZY MATCHING). Note that ';PTN' must be the end of the probe point definition. In addition, '@SRC' specifies a source file which has that function.
+'EXE' specifies the absolute or relative path of the user space executable or user space shared library.
It is also possible to specify a probe point by the source line number or lazy matching by using 'SRC:ALN' or 'SRC;PTN' syntax, where 'SRC' is the source file path, ':ALN' is the line number and ';PTN' is the lazy matching pattern.
'ARG' specifies the arguments of this probe point, (see PROBE ARGUMENT).

@@ -180,6 +192,13 @@ Delete all probes on schedule().

./perf probe --del='schedule*'

+Add probes at zfree() function on /bin/zsh
+
+ ./perf probe -u zfree@/bin/zsh
+
+Add probes at malloc() function on libc
+
+ ./perf probe -u malloc@/lib/libc.so.6

SEE ALSO
--------

2011-04-01 14:47:25

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v3 2.6.39-rc1-tip 26/26] 26: uprobes: filter chain


Loops through the filters callbacks of currently registered
consumers to see if any consumer is interested in tracing this task.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/uprobes.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index c950f13..62ccb56 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -450,6 +450,23 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
up_read(&uprobe->consumer_rwsem);
}

+static bool filter_chain(struct uprobe *uprobe, struct task_struct *t)
+{
+ struct uprobe_consumer *consumer;
+ bool ret = false;
+
+ down_read(&uprobe->consumer_rwsem);
+ for (consumer = uprobe->consumers; consumer;
+ consumer = consumer->next) {
+ if (!consumer->filter || consumer->filter(consumer, t)) {
+ ret = true;
+ break;
+ }
+ }
+ up_read(&uprobe->consumer_rwsem);
+ return ret;
+}
+
static void add_consumer(struct uprobe *uprobe,
struct uprobe_consumer *consumer)
{

2011-04-01 14:57:05

by Srikar Dronamraju

[permalink] [raw]
Subject: [RESEND] [PATCH v3 2.6.39-rc1-tip 21/26] 21: Uprobe tracer documentation Dronamraju <[email protected]>

[Resending the patch after correcting the Subject and with sign-off in
body.]

uprobetracer Documentation.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
Documentation/trace/uprobetrace.txt | 94 +++++++++++++++++++++++++++++++++++
1 files changed, 94 insertions(+), 0 deletions(-)
create mode 100644 Documentation/trace/uprobetrace.txt

diff --git a/Documentation/trace/uprobetrace.txt b/Documentation/trace/uprobetrace.txt
new file mode 100644
index 0000000..6c18ffe
--- /dev/null
+++ b/Documentation/trace/uprobetrace.txt
@@ -0,0 +1,94 @@
+ Uprobe-tracer: Uprobe-based Event Tracing
+ =========================================
+ Documentation is written by Srikar Dronamraju
+
+Overview
+--------
+These events are similar to kprobe based events.
+To enable this feature, build your kernel with CONFIG_UPROBE_EVENTS=y.
+
+Similar to the kprobe-event tracer, this doesn't need to be activated via
+current_tracer. Instead of that, add probe points via
+/sys/kernel/debug/tracing/uprobe_events, and enable it via
+/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled.
+
+
+Synopsis of uprobe_tracer
+-------------------------
+ p[:[GRP/]EVENT] PATH:SYMBOL[+offs] [FETCHARGS] : Set a probe
+
+ GRP : Group name. If omitted, use "uprobes" for it.
+ EVENT : Event name. If omitted, the event name is generated
+ based on SYMBOL+offs.
+ PATH : path to an executable or a library.
+ SYMBOL[+offs] : Symbol+offset where the probe is inserted.
+
+ FETCHARGS : Arguments. Each probe can have up to 128 args.
+ %REG : Fetch register REG
+
+Event Profiling
+---------------
+ You can check the total number of probe hits and probe miss-hits via
+/sys/kernel/debug/tracing/uprobe_profile.
+ The first column is event name, the second is the number of probe hits,
+the third is the number of probe miss-hits.
+
+Usage examples
+--------------
+To add a probe as a new event, write a new definition to uprobe_events
+as below.
+
+ echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
+
+ This sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash
+
+
+ echo > /sys/kernel/debug/tracing/uprobe_events
+
+ This clears all probe points.
+
+The following example shows how to dump the instruction pointer and %ax
+a register at the probed text address. Here we are trying to probe
+function zfree in /bin/zsh
+
+ # cd /sys/kernel/debug/tracing/
+ # cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
+ 00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
+ # objdump -T /bin/zsh | grep -w zfree
+ 0000000000446420 g DF .text 0000000000000012 Base zfree
+
+0x46420 is the offset of zfree in object /bin/zsh that is loaded at
+0x00400000. Hence the command to probe would be :
+
+ # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
+
+We can see the events that are registered by looking at the uprobe_events
+file.
+
+ # cat uprobe_events
+ p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
+
+Right after definition, each event is disabled by default. For tracing these
+events, you need to enable it by:
+
+ # echo 1 > events/uprobes/enable
+
+Lets disable the event after sleeping for some time.
+ # sleep 20
+ # echo 0 > events/uprobes/enable
+
+And you can see the traced information via /sys/kernel/debug/tracing/trace.
+
+ # cat trace
+ # tracer: nop
+ #
+ # TASK-PID CPU# TIMESTAMP FUNCTION
+ # | | | | |
+ zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+ zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+ zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+ zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+
+Each line shows us probes were triggered for a pid 24842 with ip being
+0x446421 and contents of ax register being 79.
+

2011-04-02 00:27:13

by Stephen Wilson

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 6/26] 6: Uprobes: register/unregister probes.

On Fri, Apr 01, 2011 at 08:03:38PM +0530, Srikar Dronamraju wrote:
> +int register_uprobe(struct inode *inode, loff_t offset,
> + struct uprobe_consumer *consumer)
> +{
> + struct prio_tree_iter iter;
> + struct list_head try_list, success_list;
> + struct address_space *mapping;
> + struct mm_struct *mm, *tmpmm;
> + struct vm_area_struct *vma;
> + struct uprobe *uprobe;
> + int ret = -1;
> +
> + if (!inode || !consumer || consumer->next)
> + return -EINVAL;
> +
> + uprobe = uprobes_add(inode, offset);
> + if (!uprobe)
> + return -ENOMEM;
> +
> + INIT_LIST_HEAD(&try_list);
> + INIT_LIST_HEAD(&success_list);
> + mapping = inode->i_mapping;
> +
> + mutex_lock(&uprobes_mutex);
> + if (uprobe->consumers) {
> + ret = 0;
> + goto consumers_add;
> + }
> +
> + spin_lock(&mapping->i_mmap_lock);
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
> + loff_t vaddr;
> +
> + if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
> + continue;
> +
> + mm = vma->vm_mm;
> + if (!valid_vma(vma)) {
> + mmput(mm);
> + continue;
> + }
> +
> + vaddr = vma->vm_start + offset;
> + vaddr -= vma->vm_pgoff << PAGE_SHIFT;

What happens here when someone passes an offset that is out of bounds
for the vma? Looks like we could oops when the kernel tries to set a
breakpoint. Perhaps check wrt ->vm_end?

--
steve

2011-04-02 01:03:50

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 6/26] 6: Uprobes: register/unregister probes.

> > +
> > + mm = vma->vm_mm;
> > + if (!valid_vma(vma)) {
> > + mmput(mm);
> > + continue;
> > + }
> > +
> > + vaddr = vma->vm_start + offset;
> > + vaddr -= vma->vm_pgoff << PAGE_SHIFT;
>
> What happens here when someone passes an offset that is out of bounds
> for the vma? Looks like we could oops when the kernel tries to set a
> breakpoint. Perhaps check wrt ->vm_end?
>

If the offset is wrong, install_uprobe will fail, since
grab_cache_page() should not be able to find that page for us.
And hence we return gracefully.

I will surely test this case and I am happy to add a check for
vm_end.

--
Thanks and Regards
Srikar

2011-04-02 01:29:52

by Stephen Wilson

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 6/26] 6: Uprobes: register/unregister probes.

On Sat, Apr 02, 2011 at 06:23:53AM +0530, Srikar Dronamraju wrote:
> > > +
> > > + mm = vma->vm_mm;
> > > + if (!valid_vma(vma)) {
> > > + mmput(mm);
> > > + continue;
> > > + }
> > > +
> > > + vaddr = vma->vm_start + offset;
> > > + vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> >
> > What happens here when someone passes an offset that is out of bounds
> > for the vma? Looks like we could oops when the kernel tries to set a
> > breakpoint. Perhaps check wrt ->vm_end?
> >
>
> If the offset is wrong, install_uprobe will fail, since
> grab_cache_page() should not be able to find that page for us.
> And hence we return gracefully.

Hummm. But grab_cache_page() just wraps find_or_create_page(), so I don't
think it will do what you want.


> I will surely test this case and I am happy to add a check for
> vm_end.

Thanks!


--
steve

Subject: Re: [PATCH v3 2.6.39-rc1-tip 22/26] 22: perf: rename target_module to target

(2011/04/01 23:36), Srikar Dronamraju wrote:
> This is a precursor patch that modifies names that refer to kernel/module
> to also refer to user space names.
>
> Signed-off-by: Srikar Dronamraju <[email protected]>
> ---
> tools/perf/builtin-probe.c | 12 ++++++------
> tools/perf/util/probe-event.c | 6 +++---
> 2 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
> index 2c0e64d..98db08f 100644
> --- a/tools/perf/builtin-probe.c
> +++ b/tools/perf/builtin-probe.c
> @@ -61,7 +61,7 @@ static struct {
> struct perf_probe_event events[MAX_PROBES];
> struct strlist *dellist;
> struct line_range line_range;
> - const char *target_module;
> + const char *target;
> int max_probe_points;
> struct strfilter *filter;
> } params;
> @@ -241,7 +241,7 @@ static const struct option options[] = {
> "file", "vmlinux pathname"),
> OPT_STRING('s', "source", &symbol_conf.source_prefix,
> "directory", "path to kernel source"),
> - OPT_STRING('m', "module", &params.target_module,
> + OPT_STRING('m', "module", &params.target,
> "modname", "target module name"),
> #endif
> OPT__DRY_RUN(&probe_event_dry_run),
> @@ -327,7 +327,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
> if (!params.filter)
> params.filter = strfilter__new(DEFAULT_FUNC_FILTER,
> NULL);
> - ret = show_available_funcs(params.target_module,
> + ret = show_available_funcs(params.target,
> params.filter);
> strfilter__delete(params.filter);
> if (ret < 0)
> @@ -348,7 +348,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
> usage_with_options(probe_usage, options);
> }
>
> - ret = show_line_range(&params.line_range, params.target_module);
> + ret = show_line_range(&params.line_range, params.target);
> if (ret < 0)
> pr_err(" Error: Failed to show lines. (%d)\n", ret);
> return ret;
> @@ -365,7 +365,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
>
> ret = show_available_vars(params.events, params.nevents,
> params.max_probe_points,
> - params.target_module,
> + params.target,
> params.filter,
> params.show_ext_vars);
> strfilter__delete(params.filter);
> @@ -387,7 +387,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
> if (params.nevents) {
> ret = add_perf_probe_events(params.events, params.nevents,
> params.max_probe_points,
> - params.target_module,
> + params.target,
> params.force_add);
> if (ret < 0) {
> pr_err(" Error: Failed to add events. (%d)\n", ret);
> diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
> index 5ddee66..09c53c1 100644
> --- a/tools/perf/util/probe-event.c
> +++ b/tools/perf/util/probe-event.c
> @@ -1979,7 +1979,7 @@ static int filter_available_functions(struct map *map __unused,
> return 1;
> }
>
> -int show_available_funcs(const char *module, struct strfilter *_filter)
> +int show_available_funcs(const char *elfobject, struct strfilter *_filter)
> {
> struct map *map;
> int ret;
> @@ -1990,9 +1990,9 @@ int show_available_funcs(const char *module, struct strfilter *_filter)
> if (ret < 0)
> return ret;
>
> - map = kernel_get_module_map(module);
> + map = kernel_get_module_map(elfobject);
> if (!map) {
> - pr_err("Failed to find %s map.\n", (module) ? : "kernel");
> + pr_err("Failed to find %s map.\n", (elfobject) ? : "kernel");

Hmm, these changes(module -> elfobject) are put back by the next patch.
Could you check your patch stack?

Thank you,

--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

Subject: Re: [PATCH v3 2.6.39-rc1-tip 23/26] 23: perf: show possible probes in a given executable file or library.

(2011/04/01 23:37), Srikar Dronamraju wrote:
> Enhances -F/--funcs option of "perf probe" to list possible probe points in
> an executable file or library. A new option -e/--exe specifies the path of
> the executable or library.

I think you'd better use -x for abbr. of --exe, since -e is used for --event
for other subcommands.

And also, it seems this kind of patch should be placed after perf-probe
uprobe support patch, because without uprobe support, user binary analysis
is meaningless. (In the result, this introduces -u/--uprobe option without
uprobe support)


> Show last 10 functions in /bin/zsh.
>
> # perf probe -F -u -e /bin/zsh | tail

I also can't understand why -u is required even if we have -x for user
binaries and -m for kernel modules.

Thanks,

> zstrtol
> ztrcmp
> ztrdup
> ztrduppfx
> ztrftime
> ztrlen
> ztrncpy
> ztrsub
> zwarn
> zwarnnam
>
> Show first 10 functions in /lib/libc.so.6
>
> # perf probe -u -F -e /lib/libc.so.6 | head
> _IO_adjust_column
> _IO_adjust_wcolumn
> _IO_default_doallocate
> _IO_default_finish
> _IO_default_pbackfail
> _IO_default_uflow
> _IO_default_xsgetn
> _IO_default_xsputn
> _IO_do_write@@GLIBC_2.2.5
> _IO_doallocbuf
>
> Signed-off-by: Srikar Dronamraju <[email protected]>
> ---
> tools/perf/builtin-probe.c | 9 +++++--
> tools/perf/util/probe-event.c | 56 +++++++++++++++++++++++++++++++----------
> tools/perf/util/probe-event.h | 4 +--
> tools/perf/util/symbol.c | 8 ++++++
> tools/perf/util/symbol.h | 1 +
> 5 files changed, 61 insertions(+), 17 deletions(-)
>
> diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
> index 98db08f..6ceebea 100644
> --- a/tools/perf/builtin-probe.c
> +++ b/tools/perf/builtin-probe.c
> @@ -57,6 +57,7 @@ static struct {
> bool show_ext_vars;
> bool show_funcs;
> bool mod_events;
> + bool uprobes;
> int nevents;
> struct perf_probe_event events[MAX_PROBES];
> struct strlist *dellist;
> @@ -249,6 +250,10 @@ static const struct option options[] = {
> "Set how many probe points can be found for a probe."),
> OPT_BOOLEAN('F', "funcs", &params.show_funcs,
> "Show potential probe-able functions."),
> + OPT_BOOLEAN('u', "uprobe", &params.uprobes,
> + "user space probe events"),
> + OPT_STRING('e', "exe", &params.target,
> + "executable", "userspace executable or library"),
> OPT_CALLBACK('\0', "filter", NULL,
> "[!]FILTER", "Set a filter (with --vars/funcs only)\n"
> "\t\t\t(default: \"" DEFAULT_VAR_FILTER "\" for --vars,\n"
> @@ -327,8 +332,8 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
> if (!params.filter)
> params.filter = strfilter__new(DEFAULT_FUNC_FILTER,
> NULL);
> - ret = show_available_funcs(params.target,
> - params.filter);
> + ret = show_available_funcs(params.target, params.filter,
> + params.uprobes);
> strfilter__delete(params.filter);
> if (ret < 0)
> pr_err(" Error: Failed to show functions."
> diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
> index 09c53c1..cf77feb 100644
> --- a/tools/perf/util/probe-event.c
> +++ b/tools/perf/util/probe-event.c
> @@ -47,6 +47,7 @@
> #include "trace-event.h" /* For __unused */
> #include "probe-event.h"
> #include "probe-finder.h"
> +#include "session.h"
>
> #define MAX_CMDLEN 256
> #define MAX_PROBE_ARGS 128
> @@ -1963,6 +1964,7 @@ int del_perf_probe_events(struct strlist *dellist)
>
> return ret;
> }
> +
> /* TODO: don't use a global variable for filter ... */
> static struct strfilter *available_func_filter;
>
> @@ -1979,30 +1981,58 @@ static int filter_available_functions(struct map *map __unused,
> return 1;
> }
>
> -int show_available_funcs(const char *elfobject, struct strfilter *_filter)
> +static int __show_available_funcs(struct map *map)
> +{
> + if (map__load(map, filter_available_functions)) {
> + pr_err("Failed to load map.\n");
> + return -EINVAL;
> + }
> + if (!dso__sorted_by_name(map->dso, map->type))
> + dso__sort_by_name(map->dso, map->type);
> +
> + dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
> + return 0;
> +}
> +
> +static int available_kernel_funcs(const char *module)
> {
> struct map *map;
> int ret;
>
> - setup_pager();
> -
> ret = init_vmlinux();
> if (ret < 0)
> return ret;
>
> - map = kernel_get_module_map(elfobject);
> + map = kernel_get_module_map(module);
> if (!map) {
> - pr_err("Failed to find %s map.\n", (elfobject) ? : "kernel");
> + pr_err("Failed to find %s map.\n", (module) ? : "kernel");
> return -EINVAL;
> }
> + return __show_available_funcs(map);
> +}
> +
> +int show_available_funcs(const char *elfobject, struct strfilter *_filter,
> + bool user)
> +{
> + struct map *map;
> + int ret;
> +
> + setup_pager();
> available_func_filter = _filter;
> - if (map__load(map, filter_available_functions)) {
> - pr_err("Failed to load map.\n");
> - return -EINVAL;
> - }
> - if (!dso__sorted_by_name(map->dso, map->type))
> - dso__sort_by_name(map->dso, map->type);
>
> - dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
> - return 0;
> + if (!user)
> + return available_kernel_funcs(elfobject);
> +
> + symbol_conf.try_vmlinux_path = false;
> + symbol_conf.sort_by_name = true;
> + ret = symbol__init();
> + if (ret < 0) {
> + pr_err("Failed to init symbol map.\n");
> + return ret;
> + }
> + map = dso__new_map(elfobject);
> + ret = __show_available_funcs(map);
> + dso__delete(map->dso);
> + map__delete(map);
> + return ret;
> }
> diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
> index 3434fc9..4c24a85 100644
> --- a/tools/perf/util/probe-event.h
> +++ b/tools/perf/util/probe-event.h
> @@ -128,8 +128,8 @@ extern int show_line_range(struct line_range *lr, const char *module);
> extern int show_available_vars(struct perf_probe_event *pevs, int npevs,
> int max_probe_points, const char *module,
> struct strfilter *filter, bool externs);
> -extern int show_available_funcs(const char *module, struct strfilter *filter);
> -
> +extern int show_available_funcs(const char *module, struct strfilter *filter,
> + bool user);
>
> /* Maximum index number of event-name postfix */
> #define MAX_EVENT_INDEX 1024
> diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
> index f06c10f..eefeab4 100644
> --- a/tools/perf/util/symbol.c
> +++ b/tools/perf/util/symbol.c
> @@ -2606,3 +2606,11 @@ int machine__load_vmlinux_path(struct machine *self, enum map_type type,
>
> return ret;
> }
> +
> +struct map *dso__new_map(const char *name)
> +{
> + struct dso *dso = dso__new(name);
> + struct map *map = map__new2(0, dso, MAP__FUNCTION);
> +
> + return map;
> +}
> diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
> index 713b0b4..3838909 100644
> --- a/tools/perf/util/symbol.h
> +++ b/tools/perf/util/symbol.h
> @@ -211,6 +211,7 @@ char dso__symtab_origin(const struct dso *self);
> void dso__set_long_name(struct dso *self, char *name);
> void dso__set_build_id(struct dso *self, void *build_id);
> void dso__read_running_kernel_build_id(struct dso *self, struct machine *machine);
> +struct map *dso__new_map(const char *name);
> struct symbol *dso__find_symbol(struct dso *self, enum map_type type, u64 addr);
> struct symbol *dso__find_symbol_by_name(struct dso *self, enum map_type type,
> const char *name);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

Subject: Re: [PATCH v3 2.6.39-rc1-tip 18/26] 18: uprobes: commonly used filters.

(2011/04/01 23:36), Srikar Dronamraju wrote:
> Provides most commonly used filters that most users of uprobes can
> reuse. However this would be useful once we can dynamically associate a
> filter with a uprobe-event tracer.

Hmm, would you mean that these filters are currently not used?
If so, it would be better to remove this from the series, and
send again with an actual user code.

Thank you,

>
> Signed-off-by: Srikar Dronamraju <[email protected]>
> ---
> include/linux/uprobes.h | 5 +++++
> kernel/uprobes.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 55 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 26c4d78..34b989f 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -65,6 +65,11 @@ struct uprobe_consumer {
> struct uprobe_consumer *next;
> };
>
> +struct uprobe_simple_consumer {
> + struct uprobe_consumer consumer;
> + pid_t fvalue;
> +};
> +
> struct uprobe {
> struct rb_node rb_node; /* node in the rb tree */
> atomic_t ref;
> diff --git a/kernel/uprobes.c b/kernel/uprobes.c
> index cdd52d0..c950f13 100644
> --- a/kernel/uprobes.c
> +++ b/kernel/uprobes.c
> @@ -1389,6 +1389,56 @@ int uprobe_post_notifier(struct pt_regs *regs)
> return 0;
> }
>
> +bool uprobes_pid_filter(struct uprobe_consumer *self, struct task_struct *t)
> +{
> + struct uprobe_simple_consumer *usc;
> +
> + usc = container_of(self, struct uprobe_simple_consumer, consumer);
> + if (t->tgid == usc->fvalue)
> + return true;
> + return false;
> +}
> +
> +bool uprobes_tid_filter(struct uprobe_consumer *self, struct task_struct *t)
> +{
> + struct uprobe_simple_consumer *usc;
> +
> + usc = container_of(self, struct uprobe_simple_consumer, consumer);
> + if (t->pid == usc->fvalue)
> + return true;
> + return false;
> +}
> +
> +bool uprobes_ppid_filter(struct uprobe_consumer *self, struct task_struct *t)
> +{
> + pid_t pid;
> + struct uprobe_simple_consumer *usc;
> +
> + usc = container_of(self, struct uprobe_simple_consumer, consumer);
> + rcu_read_lock();
> + pid = task_tgid_vnr(t->real_parent);
> + rcu_read_unlock();
> +
> + if (pid == usc->fvalue)
> + return true;
> + return false;
> +}
> +
> +bool uprobes_sid_filter(struct uprobe_consumer *self, struct task_struct *t)
> +{
> + pid_t pid;
> + struct uprobe_simple_consumer *usc;
> +
> + usc = container_of(self, struct uprobe_simple_consumer, consumer);
> + rcu_read_lock();
> + pid = pid_vnr(task_session(t));
> + rcu_read_unlock();
> +
> + if (pid == usc->fvalue)
> + return true;
> + return false;
> +}
> +
> struct notifier_block uprobes_exception_nb = {
> .notifier_call = uprobes_exception_notify,
> .priority = 0x7ffffff0,

--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

Subject: Re: [PATCH v3 2.6.39-rc1-tip 26/26] 26: uprobes: filter chain

(2011/04/01 23:37), Srikar Dronamraju wrote:
> Loops through the filters callbacks of currently registered
> consumers to see if any consumer is interested in tracing this task.
>
> Signed-off-by: Srikar Dronamraju <[email protected]>
> ---
> kernel/uprobes.c | 17 +++++++++++++++++
> 1 files changed, 17 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/uprobes.c b/kernel/uprobes.c
> index c950f13..62ccb56 100644
> --- a/kernel/uprobes.c
> +++ b/kernel/uprobes.c
> @@ -450,6 +450,23 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
> up_read(&uprobe->consumer_rwsem);
> }
>
> +static bool filter_chain(struct uprobe *uprobe, struct task_struct *t)
> +{
> + struct uprobe_consumer *consumer;
> + bool ret = false;
> +
> + down_read(&uprobe->consumer_rwsem);
> + for (consumer = uprobe->consumers; consumer;
> + consumer = consumer->next) {
> + if (!consumer->filter || consumer->filter(consumer, t)) {
> + ret = true;
> + break;
> + }
> + }
> + up_read(&uprobe->consumer_rwsem);
> + return ret;
> +}
> +

Where this function is called from ? This patch seems the last one of this series...

Thank you,

--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2011-04-06 22:42:07

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 26/26] 26: uprobes: filter chain

> > +static bool filter_chain(struct uprobe *uprobe, struct task_struct *t)
> > +{
> > + struct uprobe_consumer *consumer;
> > + bool ret = false;
> > +
> > + down_read(&uprobe->consumer_rwsem);
> > + for (consumer = uprobe->consumers; consumer;
> > + consumer = consumer->next) {
> > + if (!consumer->filter || consumer->filter(consumer, t)) {
> > + ret = true;
> > + break;
> > + }
> > + }
> > + up_read(&uprobe->consumer_rwsem);
> > + return ret;
> > +}
> > +
>
> Where this function is called from ? This patch seems the last one of this series...
>

Sorry for the delayed reply, I was travelling to LFCS.
Still I have to connect the filter from trace/perf probe.
Thats listed as todo and thats the next thing I am planning to work on.
Once we have the connect, this filter_chain and filters that we defined
will be used. Till then these two patches, One that defines filter_chain
and one that defines filters are useless.

--
Thanks and Regards
Srikar

2011-04-06 22:50:24

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 23/26] 23: perf: show possible probes in a given executable file or library.

* Masami Hiramatsu <[email protected]> [2011-04-04 19:15:11]:

> (2011/04/01 23:37), Srikar Dronamraju wrote:
> > Enhances -F/--funcs option of "perf probe" to list possible probe points in
> > an executable file or library. A new option -e/--exe specifies the path of
> > the executable or library.
>
> I think you'd better use -x for abbr. of --exe, since -e is used for --event
> for other subcommands.

Okay,

>
> And also, it seems this kind of patch should be placed after perf-probe
> uprobe support patch, because without uprobe support, user binary analysis
> is meaningless. (In the result, this introduces -u/--uprobe option without
> uprobe support)
>

Okay, I can do that, Should we do the listing before or after the uprobe
can place a breakpoint is arguable.

>
> > Show last 10 functions in /bin/zsh.
> >
> > # perf probe -F -u -e /bin/zsh | tail
>
> I also can't understand why -u is required even if we have -x for user
> binaries and -m for kernel modules.
>

yes, for the listing we can certainly do without -u option.

--
Thanks and Regards
Srikar

2011-04-06 23:46:36

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 22/26] 22: perf: rename target_module to target

> >
> > -int show_available_funcs(const char *module, struct strfilter *_filter)
> > +int show_available_funcs(const char *elfobject, struct strfilter *_filter)
> > {
> > struct map *map;
> > int ret;
> > @@ -1990,9 +1990,9 @@ int show_available_funcs(const char *module, struct strfilter *_filter)
> > if (ret < 0)
> > return ret;
> >
> > - map = kernel_get_module_map(module);
> > + map = kernel_get_module_map(elfobject);
> > if (!map) {
> > - pr_err("Failed to find %s map.\n", (module) ? : "kernel");
> > + pr_err("Failed to find %s map.\n", (elfobject) ? : "kernel");
>
> Hmm, these changes(module -> elfobject) are put back by the next patch.
> Could you check your patch stack?
>

In the next patch, we move "map =
kernel_get_module_map(module/elfobject)" to a new function
available_kernel_funcs(). For example: Even after the next patch,
show_available_funcs() still takes elfobject and not module. If you want
to avoid this, then we would have to either introduce the
available_kernel_funcs() in this patch Or we could merge this and the
next patch. Both those solutions dont look clean to me.

--
Thanks and Regards
Srikar

Subject: Re: [PATCH v3 2.6.39-rc1-tip 26/26] 26: uprobes: filter chain

(2011/04/07 7:41), Srikar Dronamraju wrote:
>>> +static bool filter_chain(struct uprobe *uprobe, struct task_struct *t)
>>> +{
>>> + struct uprobe_consumer *consumer;
>>> + bool ret = false;
>>> +
>>> + down_read(&uprobe->consumer_rwsem);
>>> + for (consumer = uprobe->consumers; consumer;
>>> + consumer = consumer->next) {
>>> + if (!consumer->filter || consumer->filter(consumer, t)) {
>>> + ret = true;
>>> + break;
>>> + }
>>> + }
>>> + up_read(&uprobe->consumer_rwsem);
>>> + return ret;
>>> +}
>>> +
>>
>> Where this function is called from ? This patch seems the last one of this series...
>>
>
> Sorry for the delayed reply, I was travelling to LFCS.
> Still I have to connect the filter from trace/perf probe.

I see, and I'd like to suggest you to separate that series
from this "uprobe" series. For upstream merge, indeed, we
need a consumer of the uprobe. However, it should be as simple
as possible, so that we can focus on reviewing uprobe itself.

> Thats listed as todo and thats the next thing I am planning to work on.

Interesting:) Could you tell us what the plan will introduce?
How will it be connected? how will we use it?

Thank you,

--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]

2011-04-18 12:21:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 4/26] 4: uprobes: Breakground page replacement.

On Fri, 2011-04-01 at 20:03 +0530, Srikar Dronamraju wrote:

> +static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
> + unsigned long vaddr, uprobe_opcode_t opcode)
> +{
> + struct page *old_page, *new_page;
> + void *vaddr_old, *vaddr_new;
> + struct vm_area_struct *vma;
> + spinlock_t *ptl;
> + pte_t *orig_pte;
> + unsigned long addr;
> + int ret;
> +
> + /* Read the page with vaddr into memory */
> + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
> + if (ret <= 0)
> + return -EINVAL;

Why not return the actual gup() error?

> + ret = -EINVAL;
> +
> + /*
> + * We are interested in text pages only. Our pages of interest
> + * should be mapped for read and execute only. We desist from
> + * adding probes in write mapped pages since the breakpoints
> + * might end up in the file copy.
> + */
> + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) !=
> + (VM_READ|VM_EXEC))
> + goto put_out;

Note how you return -EINVAL here when we're attempting to poke at the
wrong kind of mapping.

> + /* Allocate a page */
> + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
> + if (!new_page) {
> + ret = -ENOMEM;
> + goto put_out;
> + }
> +
> + /*
> + * lock page will serialize against do_wp_page()'s
> + * PageAnon() handling
> + */
> + lock_page(old_page);
> + /* copy the page now that we've got it stable */
> + vaddr_old = kmap_atomic(old_page, KM_USER0);
> + vaddr_new = kmap_atomic(new_page, KM_USER1);
> +
> + memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
> + /* poke the new insn in, ASSUMES we don't cross page boundary */

Why not test this assertion with a VM_BUG_ON() or something.

> + addr = vaddr;
> + vaddr &= ~PAGE_MASK;
> + memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
> +
> + kunmap_atomic(vaddr_new, KM_USER1);
> + kunmap_atomic(vaddr_old, KM_USER0);

The use of KM_foo is obsolete and un-needed.

> + orig_pte = page_check_address(old_page, tsk->mm, addr, &ptl, 0);
> + if (!orig_pte)
> + goto unlock_out;
> + pte_unmap_unlock(orig_pte, ptl);
> +
> + lock_page(new_page);
> + ret = anon_vma_prepare(vma);
> + if (!ret)
> + ret = replace_page(vma, old_page, new_page, *orig_pte);
> +
> + unlock_page(new_page);
> + if (ret != 0)
> + page_cache_release(new_page);
> +unlock_out:
> + unlock_page(old_page);
> +
> +put_out:
> + put_page(old_page); /* we did a get_page in the beginning */
> + return ret;
> +}
> +
> +/**
> + * read_opcode - read the opcode at a given virtual address.
> + * @tsk: the probed task.
> + * @vaddr: the virtual address to read the opcode.
> + * @opcode: location to store the read opcode.
> + *
> + * Called with tsk->mm->mmap_sem held (for read and with a reference to
> + * tsk->mm.
> + *
> + * For task @tsk, read the opcode at @vaddr and store it in @opcode.
> + * Return 0 (success) or a negative errno.
> + */
> +int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
> + uprobe_opcode_t *opcode)
> +{
> + struct vm_area_struct *vma;
> + struct page *page;
> + void *vaddr_new;
> + int ret;
> +
> + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
> + if (ret <= 0)
> + return -EFAULT;

Again, why not return the gup() error proper?

> + ret = -EFAULT;
> +
> + /*
> + * We are interested in text pages only. Our pages of interest
> + * should be mapped for read and execute only. We desist from
> + * adding probes in write mapped pages since the breakpoints
> + * might end up in the file copy.
> + */
> + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) !=
> + (VM_READ|VM_EXEC))
> + goto put_out;

But now you return -EFAULT if we peek at the wrong kind of mapping,
which is inconsistent with the -EINVAL of write_opcode().

> + lock_page(page);
> + vaddr_new = kmap_atomic(page, KM_USER0);
> + vaddr &= ~PAGE_MASK;
> + memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz);
> + kunmap_atomic(vaddr_new, KM_USER0);

Again, loose the KM_foo.

> + unlock_page(page);
> + ret = 0;
> +
> +put_out:
> + put_page(page); /* we did a get_page in the beginning */
> + return ret;
> +}

2011-04-18 12:21:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 6/26] 6: Uprobes: register/unregister probes.

On Fri, 2011-04-01 at 20:03 +0530, Srikar Dronamraju wrote:
> +static int remove_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
> +{
> + int ret = 0;
> +
> + /*TODO: remove breakpoint */
> + if (!ret)
> + atomic_dec(&mm->uprobes_count);
> +
> + return ret;
> +}

> +static void delete_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
> +{
> + down_read(&mm->mmap_sem);
> + remove_uprobe(mm, uprobe);
> + list_del(&mm->uprobes_list);
> + up_read(&mm->mmap_sem);
> + mmput(mm);
> +}

> +static void erase_uprobe(struct uprobe *uprobe)
> +{
> + unsigned long flags;
> +
> + synchronize_sched();
> + spin_lock_irqsave(&treelock, flags);
> + rb_erase(&uprobe->rb_node, &uprobes_tree);
> + spin_unlock_irqrestore(&treelock, flags);
> + iput(uprobe->inode);
> +}

remove,delete,erase.. head spins.. ;-)

2011-04-18 12:21:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 5/26] 5: uprobes: Adding and remove a uprobe in a rb tree.

On Fri, 2011-04-01 at 20:03 +0530, Srikar Dronamraju wrote:
> +static int match_inode(struct uprobe *uprobe, struct inode *inode,
> + struct rb_node **p)
> +{
> + struct rb_node *n = *p;
> +
> + if (inode < uprobe->inode)
> + *p = n->rb_left;
> + else if (inode > uprobe->inode)
> + *p = n->rb_right;
> + else
> + return 1;
> + return 0;
> +}
> +
> +static int match_offset(struct uprobe *uprobe, loff_t offset,
> + struct rb_node **p)
> +{
> + struct rb_node *n = *p;
> +
> + if (offset < uprobe->offset)
> + *p = n->rb_left;
> + else if (offset > uprobe->offset)
> + *p = n->rb_right;
> + else
> + return 1;
> + return 0;
> +}
>+
> +/* Called with treelock held */
> +static struct uprobe *__find_uprobe(struct inode * inode,
> + loff_t offset, struct rb_node **near_match)
> +{
> + struct rb_node *n = uprobes_tree.rb_node;
> + struct uprobe *uprobe, *u = NULL;
> +
> + while (n) {
> + uprobe = rb_entry(n, struct uprobe, rb_node);
> + if (match_inode(uprobe, inode, &n)) {
> + if (near_match)
> + *near_match = n;
> + if (match_offset(uprobe, offset, &n)) {
> + /* get access ref */
> + atomic_inc(&uprobe->ref);
> + u = uprobe;
> + break;
> + }
> + }
> + }
> + return u;
> +}

Here you break out the match functions for some reason.

> +/*
> + * Find a uprobe corresponding to a given inode:offset
> + * Acquires treelock
> + */
> +static struct uprobe *find_uprobe(struct inode * inode, loff_t offset)
> +{
> + struct uprobe *uprobe;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&treelock, flags);
> + uprobe = __find_uprobe(inode, offset, NULL);
> + spin_unlock_irqrestore(&treelock, flags);
> + return uprobe;
> +}
> +
> +/*
> + * Acquires treelock.
> + * Matching uprobe already exists in rbtree;
> + * increment (access refcount) and return the matching uprobe.
> + *
> + * No matching uprobe; insert the uprobe in rb_tree;
> + * get a double refcount (access + creation) and return NULL.
> + */
> +static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> +{
> + struct rb_node **p = &uprobes_tree.rb_node;
> + struct rb_node *parent = NULL;
> + struct uprobe *u;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&treelock, flags);
> + while (*p) {
> + parent = *p;
> + u = rb_entry(parent, struct uprobe, rb_node);
> + if (u->inode > uprobe->inode)
> + p = &(*p)->rb_left;
> + else if (u->inode < uprobe->inode)
> + p = &(*p)->rb_right;
> + else {
> + if (u->offset > uprobe->offset)
> + p = &(*p)->rb_left;
> + else if (u->offset < uprobe->offset)
> + p = &(*p)->rb_right;
> + else {
> + /* get access ref */
> + atomic_inc(&u->ref);
> + goto unlock_return;
> + }
> + }
> + }
> + u = NULL;
> + rb_link_node(&uprobe->rb_node, parent, p);
> + rb_insert_color(&uprobe->rb_node, &uprobes_tree);
> + /* get access + drop ref */
> + atomic_set(&uprobe->ref, 2);
> +
> +unlock_return:
> + spin_unlock_irqrestore(&treelock, flags);
> + return u;
> +}

And here you open-code the match functions..

Why not have something like:

static int match_probe(struct uprobe *l, struct uprobe *r)
{
if (l->inode < r->inode)
return -1;
else if (l->inode > r->inode)
return 1;
else {
if (l->offset < r->offset)
return -1;
else if (l->offset > r->offset)
return 1;
}

return 0;
}

And use that as:

static struct uprobe *
__find_uprobe(struct inode *inode, loff_t offset)
{
struct uprobe r = { .inode = inode, .offset = offset };
struct rb_node *n = uprobes_tree.rb_node;
struct uprobe *uprobe;
int match;

while (n) {
uprobe = rb_node(n, struct uprobe, rb_node);
match = match_probe(uprobe, &r);
if (!match) {
atomic_inc(&uprobe->ref);
return uprobe;
}
if (match < 0)
n = n->rb_left;
else
n = n->rb_right;
}

return NULL;
}

static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
{
struct rb_node **p = &uprobes_tree.rb_node;
struct rn_node *parent = NULL;
struct uprobe *u;
int match;

while (*p) {
parent = *p;
u = rb_entry(parent, struct uprobe, rb_node);
match = match_uprobe(u, uprobe);
if (!match) {
atomic_inc(&u->ref);
return u;
}
if (match < 0)
p = &parent->rb_left;
else
p = &parent->rb_right;
}

atomic_set(&uprobe->ref, 2);
rb_link_node(&uprobe->rb_node, parent, p);
rb_insert_color(&uprobe->rb_node, &uprobes_tree);
return uprobe;
}

Isn't that much nicer?

2011-04-18 16:13:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 8/26] 8: uprobes: store/restore original instruction.

On Fri, 2011-04-01 at 20:03 +0530, Srikar Dronamraju wrote:

> +static int __copy_insn(struct address_space *mapping, char *insn,
> + unsigned long nbytes, unsigned long offset)
> +{
> + struct page *page;
> + void *vaddr;
> + unsigned long off1;
> + loff_t idx;
> +
> + idx = offset >> PAGE_CACHE_SHIFT;
> + off1 = offset &= ~PAGE_MASK;
> + page = grab_cache_page(mapping, (unsigned long)idx);

What if the page wasn't present due to being swapped out?

> + if (!page)
> + return -ENOMEM;
> +
> + vaddr = kmap_atomic(page, KM_USER0);
> + memcpy(insn, vaddr + off1, nbytes);
> + kunmap_atomic(vaddr, KM_USER0);
> + unlock_page(page);
> + page_cache_release(page);
> + return 0;
> +}
> +
> +static int copy_insn(struct uprobe *uprobe, unsigned long addr)
> +{
> + struct address_space *mapping;
> + int bytes;
> + unsigned long nbytes;
> +
> + addr &= ~PAGE_MASK;
> + nbytes = PAGE_SIZE - addr;
> + mapping = uprobe->inode->i_mapping;
> +
> + /* Instruction at end of binary; copy only available bytes */
> + if (uprobe->offset + MAX_UINSN_BYTES > uprobe->inode->i_size)
> + bytes = uprobe->inode->i_size - uprobe->offset;
> + else
> + bytes = MAX_UINSN_BYTES;
> +
> + /* Instruction at the page-boundary; copy bytes in second page */
> + if (nbytes < bytes) {
> + if (__copy_insn(mapping, uprobe->insn + nbytes,
> + bytes - nbytes, uprobe->offset + nbytes))
> + return -ENOMEM;
> + bytes = nbytes;
> + }
> + return __copy_insn(mapping, uprobe->insn, bytes, uprobe->offset);
> +}

This all made me think why implement read_opcode() again.. I know its
all slightly different, but still.

> +static struct task_struct *uprobes_get_mm_owner(struct mm_struct *mm)
> +{
> + struct task_struct *tsk;
> +
> + rcu_read_lock();
> + tsk = rcu_dereference(mm->owner);
> + if (tsk)
> + get_task_struct(tsk);
> + rcu_read_unlock();
> + return tsk;
> +}

Naming is somewhat inconsistent, most of your functions have the _uprobe
postfix and now its a uprobes_ prefix all of a sudden.

> static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
> {
> - int ret = 0;
> + struct task_struct *tsk = uprobes_get_mm_owner(mm);
> + int ret;
>
> - /*TODO: install breakpoint */
> - if (!ret)
> + if (!tsk) /* task is probably exiting; bail-out */
> + return -ESRCH;
> +
> + if (!uprobe->copy) {
> + ret = copy_insn(uprobe, mm->uprobes_vaddr);
> + if (ret)
> + goto put_return;
> + if (is_bkpt_insn(uprobe->insn)) {
> + print_insert_fail(tsk, mm->uprobes_vaddr,
> + "breakpoint instruction already exists");
> + ret = -EEXIST;
> + goto put_return;
> + }
> + ret = analyze_insn(tsk, uprobe);
> + if (ret) {
> + print_insert_fail(tsk, mm->uprobes_vaddr,
> + "instruction type cannot be probed");
> + goto put_return;
> + }

If you want to expose this functionality to !root users printing stuff
to dmesg like that isn't a good idea.

> + uprobe->copy = 1;
> + }
> +
> + ret = set_bkpt(tsk, uprobe, mm->uprobes_vaddr);
> + if (ret < 0)
> + print_insert_fail(tsk, mm->uprobes_vaddr,
> + "failed to insert bkpt instruction");
> + else
> atomic_inc(&mm->uprobes_count);
> +
> +put_return:
> + put_task_struct(tsk);
> return ret;
> }

2011-04-18 16:22:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 9/26] 9: uprobes: mmap and fork hooks.

On Fri, 2011-04-01 at 20:04 +0530, Srikar Dronamraju wrote:
> + if (vma) {
> + /*
> + * We get here from uprobe_mmap() -- the case where we
> + * are trying to copy an instruction from a page that's
> + * not yet in page cache.
> + *
> + * Read page in before copy.
> + */
> + struct file *filp = vma->vm_file;
> +
> + if (!filp)
> + return -EINVAL;
> + page_cache_sync_readahead(mapping, &filp->f_ra, filp, idx, 1);
> + }
> + page = grab_cache_page(mapping, idx);

So I don't see why that isn't so for the normal install_uprobe() <-
register_uprobe() path.

2011-04-18 16:30:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 9/26] 9: uprobes: mmap and fork hooks.

On Fri, 2011-04-01 at 20:04 +0530, Srikar Dronamraju wrote:
> + if (vaddr > ULONG_MAX)
> + /*
> + * We cannot have a virtual address that is
> + * greater than ULONG_MAX
> + */
> + continue;

I'm having trouble with those checks.. while they're not wrong they're
not correct either. Mostly the top address space is where the kernel
lives and on 32-on-64 compat the boundary is much lower still. Ideally
it'd be TASK_SIZE, but that doesn't work since it assumes you're testing
for the current task.

2011-04-18 16:46:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 12/26] 12: uprobes: slot allocation for uprobes

On Fri, 2011-04-01 at 20:04 +0530, Srikar Dronamraju wrote:
> Every task is allocated a fixed slot. When a probe is hit, the original
> instruction corresponding to the probe hit is copied to per-task fixed
> slot. Currently we allocate one page of slots for each mm. Bitmaps are
> used to know which slots are free. Each slot is made of 128 bytes so
> that its cache aligned.
>
> TODO: On massively threaded processes (or if a huge number of processes
> share the same mm), there is a possiblilty of running out of slots.
> One alternative could be to extend the slots as when slots are required.

As long as you're single stepping things and not using boosted probes
you can fully serialize the slot usage. Claim a slot on trap and release
the slot on finish. Claiming can wait on a free slot since you already
have the whole SLEEPY thing.


> +static int xol_add_vma(struct uprobes_xol_area *area)
> +{
> + struct vm_area_struct *vma;
> + struct mm_struct *mm;
> + struct file *file;
> + unsigned long addr;
> + int ret = -ENOMEM;
> +
> + mm = get_task_mm(current);
> + if (!mm)
> + return -ESRCH;
> +
> + down_write(&mm->mmap_sem);
> + if (mm->uprobes_xol_area) {
> + ret = -EALREADY;
> + goto fail;
> + }
> +
> + /*
> + * Find the end of the top mapping and skip a page.
> + * If there is no space for PAGE_SIZE above
> + * that, mmap will ignore our address hint.
> + *
> + * We allocate a "fake" unlinked shmem file because
> + * anonymous memory might not be granted execute
> + * permission when the selinux security hooks have
> + * their way.
> + */

That just annoys me, so we're working around some stupid sekurity crap,
executable anonymous maps are perfectly fine, also what do JITs do?

> + vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
> + addr = vma->vm_end + PAGE_SIZE;
> + file = shmem_file_setup("uprobes/xol", PAGE_SIZE, VM_NORESERVE);
> + if (!file) {
> + printk(KERN_ERR "uprobes_xol failed to setup shmem_file "
> + "while allocating vma for pid/tgid %d/%d for "
> + "single-stepping out of line.\n",
> + current->pid, current->tgid);
> + goto fail;
> + }
> + addr = do_mmap_pgoff(file, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
> + fput(file);
> +
> + if (addr & ~PAGE_MASK) {
> + printk(KERN_ERR "uprobes_xol failed to allocate a vma for "
> + "pid/tgid %d/%d for single-stepping out of "
> + "line.\n", current->pid, current->tgid);
> + goto fail;
> + }
> + vma = find_vma(mm, addr);
> +
> + /* Don't expand vma on mremap(). */
> + vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
> + area->vaddr = vma->vm_start;
> + if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
> + &vma) > 0)
> + ret = 0;
> +
> +fail:
> + up_write(&mm->mmap_sem);
> + mmput(mm);
> + return ret;
> +}

2011-04-18 16:48:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 13/26] 13: uprobes: get the breakpoint address.

On Fri, 2011-04-01 at 20:05 +0530, Srikar Dronamraju wrote:
> +/**
> + * uprobes_get_bkpt_addr - compute address of bkpt given post-bkpt regs
> + * @regs: Reflects the saved state of the task after it has hit a breakpoint
> + * instruction.
> + * Return the address of the breakpoint instruction.
> + */
> +unsigned long uprobes_get_bkpt_addr(struct pt_regs *regs)
> +{
> + return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
> +}

This assumes the breakpoint instruction is trap like, not fault like, is
that true for all architectures?

2011-04-18 16:55:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 14/26] 14: x86: x86 specific probe handling

On Fri, 2011-04-01 at 20:05 +0530, Srikar Dronamraju wrote:
> +/*
> + * @reg: reflects the saved state of the task
> + * @vaddr: the virtual address to jump to.
> + * Return 0 on success or a -ve number on error.
> + */
> +void set_ip(struct pt_regs *regs, unsigned long vaddr)
> +{
> + regs->ip = vaddr;
> +}

Since we have the cross-architecture function:
instruction_pointer(struct pt_regs*) to read the thing, this ought to be
called set_instruction_pointer(struct pt_regs*, unsigned long) or
somesuch.

2011-04-18 16:58:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 14/26] 14: x86: x86 specific probe handling

On Fri, 2011-04-01 at 20:05 +0530, Srikar Dronamraju wrote:
> +void arch_uprobe_enable_sstep(struct pt_regs *regs)
> +{
> + /*
> + * Enable single-stepping by
> + * - Set TF on stack
> + * - Set TIF_SINGLESTEP: Guarantees that TF is set when
> + * returning to user mode.
> + * - Indicate that TF is set by us.
> + */
> + regs->flags |= X86_EFLAGS_TF;
> + set_thread_flag(TIF_SINGLESTEP);
> + set_thread_flag(TIF_FORCED_TF);
> +}
> +
> +void arch_uprobe_disable_sstep(struct pt_regs *regs)
> +{
> + /* Disable single-stepping by clearing what we set */
> + clear_thread_flag(TIF_SINGLESTEP);
> + clear_thread_flag(TIF_FORCED_TF);
> + regs->flags &= ~X86_EFLAGS_TF;
> +}

Don't you loose the single step flag if userspace was already
single-stepping when it hit your breakpoint? Also, you don't seem to
touch the blockstep settings.

2011-04-19 05:57:08

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 14/26] 14: x86: x86 specific probe handling

* Peter Zijlstra <[email protected]> [2011-04-18 18:55:00]:

> On Fri, 2011-04-01 at 20:05 +0530, Srikar Dronamraju wrote:
> > +/*
> > + * @reg: reflects the saved state of the task
> > + * @vaddr: the virtual address to jump to.
> > + * Return 0 on success or a -ve number on error.
> > + */
> > +void set_ip(struct pt_regs *regs, unsigned long vaddr)
> > +{
> > + regs->ip = vaddr;
> > +}
>
> Since we have the cross-architecture function:
> instruction_pointer(struct pt_regs*) to read the thing, this ought to be
> called set_instruction_pointer(struct pt_regs*, unsigned long) or
> somesuch.

Okay, will rename set_ip to set_instruction_pointer.

2011-04-19 06:40:41

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 12/26] 12: uprobes: slot allocation for uprobes

* Peter Zijlstra <[email protected]> [2011-04-18 18:46:11]:

> On Fri, 2011-04-01 at 20:04 +0530, Srikar Dronamraju wrote:
> > Every task is allocated a fixed slot. When a probe is hit, the original
> > instruction corresponding to the probe hit is copied to per-task fixed
> > slot. Currently we allocate one page of slots for each mm. Bitmaps are
> > used to know which slots are free. Each slot is made of 128 bytes so
> > that its cache aligned.
> >
> > TODO: On massively threaded processes (or if a huge number of processes
> > share the same mm), there is a possiblilty of running out of slots.
> > One alternative could be to extend the slots as when slots are required.
>
> As long as you're single stepping things and not using boosted probes
> you can fully serialize the slot usage. Claim a slot on trap and release
> the slot on finish. Claiming can wait on a free slot since you already
> have the whole SLEEPY thing.
>

Yes, thats certainly one approach but that approach makes every
breakpoint hit contend for spinlock. (Infact we will have to change it
to mutex lock (as you rightly pointed out) so that we allow threads to
wait when slots are not free). Assuming a 4K page, we would be taxing
applications that have less than 32 threads (which is probably the
default case). If we continue with the current approach, then we
could only add additional page(s) for apps which has more than 32
threads and only when more than 32 __live__ threads have actually hit a
breakpoint.

>
> > +static int xol_add_vma(struct uprobes_xol_area *area)
> > +{
> > + struct vm_area_struct *vma;
> > + struct mm_struct *mm;
> > + struct file *file;
> > + unsigned long addr;
> > + int ret = -ENOMEM;
> > +
> > + mm = get_task_mm(current);
> > + if (!mm)
> > + return -ESRCH;
> > +
> > + down_write(&mm->mmap_sem);
> > + if (mm->uprobes_xol_area) {
> > + ret = -EALREADY;
> > + goto fail;
> > + }
> > +
> > + /*
> > + * Find the end of the top mapping and skip a page.
> > + * If there is no space for PAGE_SIZE above
> > + * that, mmap will ignore our address hint.
> > + *
> > + * We allocate a "fake" unlinked shmem file because
> > + * anonymous memory might not be granted execute
> > + * permission when the selinux security hooks have
> > + * their way.
> > + */
>
> That just annoys me, so we're working around some stupid sekurity crap,
> executable anonymous maps are perfectly fine, also what do JITs do?

Yes, we are working around selinux security hooks, but do we have a
choice.

James can you please comment on this.

>
> > + vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
> > + addr = vma->vm_end + PAGE_SIZE;
> > + file = shmem_file_setup("uprobes/xol", PAGE_SIZE, VM_NORESERVE);
> > + if (!file) {
> > + printk(KERN_ERR "uprobes_xol failed to setup shmem_file "
> > + "while allocating vma for pid/tgid %d/%d for "
> > + "single-stepping out of line.\n",
> > + current->pid, current->tgid);
> > + goto fail;
> > + }
> > + addr = do_mmap_pgoff(file, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
> > + fput(file);
> > +

2011-04-19 06:58:53

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 9/26] 9: uprobes: mmap and fork hooks.

* Peter Zijlstra <[email protected]> [2011-04-18 18:29:23]:

> On Fri, 2011-04-01 at 20:04 +0530, Srikar Dronamraju wrote:
> > + if (vaddr > ULONG_MAX)
> > + /*
> > + * We cannot have a virtual address that is
> > + * greater than ULONG_MAX
> > + */
> > + continue;
>
> I'm having trouble with those checks.. while they're not wrong they're
> not correct either. Mostly the top address space is where the kernel
> lives and on 32-on-64 compat the boundary is much lower still. Ideally
> it'd be TASK_SIZE, but that doesn't work since it assumes you're testing
> for the current task.
>

Guess I can use TASK_SIZE_OF(tsk) instead of ULONG_MAX ?
I think TASK_SIZE_OF handles 32-on-64 correctly.

--
Thanks and Regards
Srikar

2011-04-19 09:03:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 12/26] 12: uprobes: slot allocation for uprobes

On Tue, 2011-04-19 at 11:56 +0530, Srikar Dronamraju wrote:
> > > + /*
> > > + * Find the end of the top mapping and skip a page.
> > > + * If there is no space for PAGE_SIZE above
> > > + * that, mmap will ignore our address hint.
> > > + *
> > > + * We allocate a "fake" unlinked shmem file because
> > > + * anonymous memory might not be granted execute
> > > + * permission when the selinux security hooks have
> > > + * their way.
> > > + */
> >
> > That just annoys me, so we're working around some stupid sekurity crap,
> > executable anonymous maps are perfectly fine, also what do JITs do?
>
> Yes, we are working around selinux security hooks, but do we have a
> choice.

Of course you have a choice, mark selinux broken and let them sort
it ;-)

Anyway, it looks like install_special_mapping() the thing I think you
ought to use (and I'm sure I said that before) also wobbles around
selinux by using security_file_mmap() even though its very clearly not a
file mmap (hint: vm_file == NULL).

2011-04-19 09:13:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 12/26] 12: uprobes: slot allocation for uprobes

(Dropped the systemtap list since its mis-behaving, please leave it out on future postings)

On Tue, 2011-04-19 at 11:56 +0530, Srikar Dronamraju wrote:
> > > TODO: On massively threaded processes (or if a huge number of processes
> > > share the same mm), there is a possiblilty of running out of slots.
> > > One alternative could be to extend the slots as when slots are required.
> >
> > As long as you're single stepping things and not using boosted probes
> > you can fully serialize the slot usage. Claim a slot on trap and release
> > the slot on finish. Claiming can wait on a free slot since you already
> > have the whole SLEEPY thing.
> >
>
> Yes, thats certainly one approach but that approach makes every
> breakpoint hit contend for spinlock. (Infact we will have to change it
> to mutex lock (as you rightly pointed out) so that we allow threads to
> wait when slots are not free). Assuming a 4K page, we would be taxing
> applications that have less than 32 threads (which is probably the
> default case). If we continue with the current approach, then we
> could only add additional page(s) for apps which has more than 32
> threads and only when more than 32 __live__ threads have actually hit a
> breakpoint.

That very much depends on what you do, some folks think its entirely
reasonable for processes to have thousands of threads. Now I completely
agree with you that that is not 'normal', but then I think using Java
isn't normal either ;-)

Anyway, avoiding that spinlock/mutex for each trap isn't hard, avoiding
a process wide cacheline bounce is slightly harder but still not
impossible.

With 32 slots in 4k you have 128 bytes to play with, all we need is a
single bit per slot to mark it being in-use. If a task remembers what
slot it used last and tries to claim that using an atomic test and set
for that bit it will, in the 'normal' case, never contend on a process
wide cacheline.

In case it does find the slot taken, it'll have to go the slow route and
scan for a free slot and possibly wait for one to become free.

2011-04-19 13:04:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 15/26] 15: uprobes: Handing int3 and singlestep exception.

On Fri, 2011-04-01 at 20:05 +0530, Srikar Dronamraju wrote:
> + if (unlikely(!utask)) {
> + utask = add_utask();
> +
> + /* Failed to allocate utask for the current task. */
> + BUG_ON(!utask);

That's not really nice is it ;-) means I can make the kernel go BUG by
simply applying memory pressure.

> + utask->state = UTASK_BP_HIT;
> + }

2011-04-19 13:12:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 15/26] 15: uprobes: Handing int3 and singlestep exception.

On Tue, 2011-04-19 at 15:03 +0200, Peter Zijlstra wrote:
> On Fri, 2011-04-01 at 20:05 +0530, Srikar Dronamraju wrote:
> > + if (unlikely(!utask)) {
> > + utask = add_utask();
> > +
> > + /* Failed to allocate utask for the current task. */
> > + BUG_ON(!utask);
>
> That's not really nice is it ;-) means I can make the kernel go BUG by
> simply applying memory pressure.

Agreed,

None of these patches should have a single BUG_ON(). They all must fail
nicely.

-- Steve

>
> > + utask->state = UTASK_BP_HIT;
> > + }

2011-04-19 13:29:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 7/26] 7: x86: analyze instruction and determine fixups.

On Fri, 2011-04-01 at 20:03 +0530, Srikar Dronamraju wrote:

> +
> +static void report_bad_prefix(void)
> +{
> + printk(KERN_ERR "uprobes does not currently support probing "
> + "instructions with any of the following prefixes: "
> + "cs:, ds:, es:, ss:, lock:\n");
> +}
> +
> +static void report_bad_1byte_opcode(int mode, uprobe_opcode_t op)
> +{
> + printk(KERN_ERR "In %d-bit apps, "
> + "uprobes does not currently support probing "
> + "instructions whose first byte is 0x%2.2x\n", mode, op);
> +}
> +
> +static void report_bad_2byte_opcode(uprobe_opcode_t op)
> +{
> + printk(KERN_ERR "uprobes does not currently support probing "
> + "instructions with the 2-byte opcode 0x0f 0x%2.2x\n", op);
> +}

Should these really be KERN_ERR, or is KERN_WARNING a better fit?

Also, can a non-privileged user cause these printks to spam the console
and cause a DoS to the system?

-- Steve

2011-04-19 13:40:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 15/26] 15: uprobes: Handing int3 and singlestep exception.

On Fri, 2011-04-01 at 20:05 +0530, Srikar Dronamraju wrote:
> + probept = uprobes_get_bkpt_addr(regs);
> + down_read(&mm->mmap_sem);
> + for (vma = mm->mmap; vma; vma = vma->vm_next) {
> + if (!valid_vma(vma))
> + continue;
> + if (probept < vma->vm_start || probept > vma->vm_end)
> + continue;
> + u = find_uprobe(vma->vm_file->f_mapping->host,
> + probept - vma->vm_start);
> + break;
> + }

Why the linear vma walk? Surely the find_vma() suffices since there can
only be one vma that matches a particular vaddr.

> + up_read(&mm->mmap_sem);

2011-04-19 13:54:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 17/26] 17: uprobes: register a notifier for uprobes.

On Fri, 2011-04-01 at 20:05 +0530, Srikar Dronamraju wrote:
> +struct notifier_block uprobes_exception_nb = {
> + .notifier_call = uprobes_exception_notify,
> + .priority = 0x7ffffff0,
> +};

What kind of magic number is that?

2011-04-19 13:58:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 18/26] 18: uprobes: commonly used filters.

On Fri, 2011-04-01 at 20:06 +0530, Srikar Dronamraju wrote:
> Provides most commonly used filters that most users of uprobes can
> reuse. However this would be useful once we can dynamically associate a
> filter with a uprobe-event tracer.
>
> Signed-off-by: Srikar Dronamraju <[email protected]>
> ---
> include/linux/uprobes.h | 5 +++++
> kernel/uprobes.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 55 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 26c4d78..34b989f 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -65,6 +65,11 @@ struct uprobe_consumer {
> struct uprobe_consumer *next;
> };
>
> +struct uprobe_simple_consumer {
> + struct uprobe_consumer consumer;
> + pid_t fvalue;
> +};
> +
> struct uprobe {
> struct rb_node rb_node; /* node in the rb tree */
> atomic_t ref;
> diff --git a/kernel/uprobes.c b/kernel/uprobes.c
> index cdd52d0..c950f13 100644
> --- a/kernel/uprobes.c
> +++ b/kernel/uprobes.c
> @@ -1389,6 +1389,56 @@ int uprobe_post_notifier(struct pt_regs *regs)
> return 0;
> }
>
> +bool uprobes_pid_filter(struct uprobe_consumer *self, struct task_struct *t)
> +{
> + struct uprobe_simple_consumer *usc;
> +
> + usc = container_of(self, struct uprobe_simple_consumer, consumer);
> + if (t->tgid == usc->fvalue)
> + return true;
> + return false;
> +}
> +
> +bool uprobes_tid_filter(struct uprobe_consumer *self, struct task_struct *t)
> +{
> + struct uprobe_simple_consumer *usc;
> +
> + usc = container_of(self, struct uprobe_simple_consumer, consumer);
> + if (t->pid == usc->fvalue)
> + return true;
> + return false;
> +}

Pretty much everything using t->pid/t->tgid is doing it wrong.

> +bool uprobes_ppid_filter(struct uprobe_consumer *self, struct task_struct *t)
> +{
> + pid_t pid;
> + struct uprobe_simple_consumer *usc;
> +
> + usc = container_of(self, struct uprobe_simple_consumer, consumer);
> + rcu_read_lock();
> + pid = task_tgid_vnr(t->real_parent);
> + rcu_read_unlock();
> +
> + if (pid == usc->fvalue)
> + return true;
> + return false;
> +}
> +
> +bool uprobes_sid_filter(struct uprobe_consumer *self, struct task_struct *t)
> +{
> + pid_t pid;
> + struct uprobe_simple_consumer *usc;
> +
> + usc = container_of(self, struct uprobe_simple_consumer, consumer);
> + rcu_read_lock();
> + pid = pid_vnr(task_session(t));
> + rcu_read_unlock();
> +
> + if (pid == usc->fvalue)
> + return true;
> + return false;
> +}

And there things go haywire too.

What you want is to save the pid-namespace of the task creating the
filter in your uprobe_simple_consumer and use that to obtain the task's
pid for matching with the provided number.


2011-04-20 13:40:59

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 12/26] 12: uprobes: slot allocation for uprobes

On Tue, Apr 19, 2011 at 2:26 AM, Srikar Dronamraju
<[email protected]> wrote:
> * Peter Zijlstra <[email protected]> [2011-04-18 18:46:11]:
>
>> On Fri, 2011-04-01 at 20:04 +0530, Srikar Dronamraju wrote:

>> > +static int xol_add_vma(struct uprobes_xol_area *area)
>> > +{
>> > + ? struct vm_area_struct *vma;
>> > + ? struct mm_struct *mm;
>> > + ? struct file *file;
>> > + ? unsigned long addr;
>> > + ? int ret = -ENOMEM;
>> > +
>> > + ? mm = get_task_mm(current);
>> > + ? if (!mm)
>> > + ? ? ? ? ? return -ESRCH;
>> > +
>> > + ? down_write(&mm->mmap_sem);
>> > + ? if (mm->uprobes_xol_area) {
>> > + ? ? ? ? ? ret = -EALREADY;
>> > + ? ? ? ? ? goto fail;
>> > + ? }
>> > +
>> > + ? /*
>> > + ? ?* Find the end of the top mapping and skip a page.
>> > + ? ?* If there is no space for PAGE_SIZE above
>> > + ? ?* that, mmap will ignore our address hint.
>> > + ? ?*
>> > + ? ?* We allocate a "fake" unlinked shmem file because
>> > + ? ?* anonymous memory might not be granted execute
>> > + ? ?* permission when the selinux security hooks have
>> > + ? ?* their way.
>> > + ? ?*/
>>
>> That just annoys me, so we're working around some stupid sekurity crap,
>> executable anonymous maps are perfectly fine, also what do JITs do?
>
> Yes, we are working around selinux security hooks, but do we have a
> choice.
>
> James can you please comment on this.

[added myself and stephen, the 2 SELinux maintainers]

This is just wrong. Anything to 'work around' SELinux in the kernel
is wrong. SELinux access decisions are determined by policy not by
dirty hacks in the code to subvert any kind of security claims that
policy might hope to enforce.

[side note, security_file_mmap() is the right call if there is a file
or not. It should just be called security_mmap() but the _file_ has
been around a long time and just never had a need to be changed]

Now how to fix the problems you were seeing. If you run a modern
system with a GUI I'm willing to bet the pop-up window told you
exactly how to fix your problem. If you are not on a GUI I accept
it's a more difficult as you most likely don't have the setroubleshoot
tools installed to help you out. I'm just guess what your problem
was, but I think you have two solutions either:

1) chcon -t unconfined_execmem_t /path/to/your/binary
2) setsebool -P allow_execmem 1

The first will cause the binary to execute in a domain with
permissions to execute anonymous memory, the second will allow all
unconfined domains to execute anonymous memory.

I believe there was a question about how JIT's work with SELinux
systems. They work mostly by method #1.

I did hear this question though: On a different but related note, how
is the use of uprobes controlled? Does it apply the same checking as
for ptrace?

Thanks guys! If you have SELinux or LSM problems in the future let me
know. It's likely the solution is easier than you imagine ;)

-Eric

2011-04-20 14:52:39

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 12/26] 12: uprobes: slot allocation for uprobes


eparis wrote:

> [...]
> Now how to fix the problems you were seeing. If you run a modern
> system with a GUI I'm willing to bet the pop-up window told you
> exactly how to fix your problem. [...]
>
> 1) chcon -t unconfined_execmem_t /path/to/your/binary
> 2) setsebool -P allow_execmem 1
> [...]
> I believe there was a question about how JIT's work with SELinux
> systems. They work mostly by method #1.

Actually, that's a solution to a different problem. Here, it's not
particular /path/to/your/binaries that want/need selinux provileges.
It's a kernel-driven debugging facility that needs it temporarily for
arbitrary processes.

It's not like JITs, with known binary names. It's not like GDB, which
simply overwrites existing instructions in the text segment. To make
uprobes work fast (single-step-out-of-line), one needs one or emore
temporary pages with unusual mapping permissions.

- FChE

2011-04-20 15:25:30

by Stephen Smalley

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 12/26] 12: uprobes: slot allocation for uprobes

On Wed, 2011-04-20 at 10:51 -0400, Frank Ch. Eigler wrote:
> eparis wrote:
>
> > [...]
> > Now how to fix the problems you were seeing. If you run a modern
> > system with a GUI I'm willing to bet the pop-up window told you
> > exactly how to fix your problem. [...]
> >
> > 1) chcon -t unconfined_execmem_t /path/to/your/binary
> > 2) setsebool -P allow_execmem 1
> > [...]
> > I believe there was a question about how JIT's work with SELinux
> > systems. They work mostly by method #1.
>
> Actually, that's a solution to a different problem. Here, it's not
> particular /path/to/your/binaries that want/need selinux provileges.
> It's a kernel-driven debugging facility that needs it temporarily for
> arbitrary processes.
>
> It's not like JITs, with known binary names. It's not like GDB, which
> simply overwrites existing instructions in the text segment. To make
> uprobes work fast (single-step-out-of-line), one needs one or emore
> temporary pages with unusual mapping permissions.

I would expect that (2) would solve it, but couldn't distinguish the
kernel-created mappings from userspace doing the same thing.
Alternatively, you could temporarily switch your credentials around the
mapping operation, e.g.:
old_cred = override_creds(&init_cred);
do_mmap_pgoff(...);
revert_creds(old_cred);

devtmpfs does something similar to avoid triggering permission checks on
userspace when it is internally creating and deleting nodes.

How is this ability to use this facility controlled?

--
Stephen Smalley
National Security Agency

2011-04-21 11:23:30

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 18/26] 18: uprobes: commonly used filters.

* Peter Zijlstra <[email protected]> [2011-04-19 15:57:57]:

> On Fri, 2011-04-01 at 20:06 +0530, Srikar Dronamraju wrote:
> > Provides most commonly used filters that most users of uprobes can
> > reuse. However this would be useful once we can dynamically associate a
> > filter with a uprobe-event tracer.
> >
> > Signed-off-by: Srikar Dronamraju <[email protected]>
> > ---
> > include/linux/uprobes.h | 5 +++++
> > kernel/uprobes.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 55 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index 26c4d78..34b989f 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -65,6 +65,11 @@ struct uprobe_consumer {
> > struct uprobe_consumer *next;
> > };
> >
> > +struct uprobe_simple_consumer {
> > + struct uprobe_consumer consumer;
> > + pid_t fvalue;
> > +};
> > +
> > struct uprobe {
> > struct rb_node rb_node; /* node in the rb tree */
> > atomic_t ref;
> > diff --git a/kernel/uprobes.c b/kernel/uprobes.c
> > index cdd52d0..c950f13 100644
> > --- a/kernel/uprobes.c
> > +++ b/kernel/uprobes.c
> > @@ -1389,6 +1389,56 @@ int uprobe_post_notifier(struct pt_regs *regs)
> > return 0;
> > }
> >
> > +bool uprobes_pid_filter(struct uprobe_consumer *self, struct task_struct *t)
> > +{
> > + struct uprobe_simple_consumer *usc;
> > +
> > + usc = container_of(self, struct uprobe_simple_consumer, consumer);
> > + if (t->tgid == usc->fvalue)
> > + return true;
> > + return false;
> > +}
> > +
> > +bool uprobes_tid_filter(struct uprobe_consumer *self, struct task_struct *t)
> > +{
> > + struct uprobe_simple_consumer *usc;
> > +
> > + usc = container_of(self, struct uprobe_simple_consumer, consumer);
> > + if (t->pid == usc->fvalue)
> > + return true;
> > + return false;
> > +}
>
> Pretty much everything using t->pid/t->tgid is doing it wrong.
>
> > +bool uprobes_ppid_filter(struct uprobe_consumer *self, struct task_struct *t)
> > +{
> > + pid_t pid;
> > + struct uprobe_simple_consumer *usc;
> > +
> > + usc = container_of(self, struct uprobe_simple_consumer, consumer);
> > + rcu_read_lock();
> > + pid = task_tgid_vnr(t->real_parent);
> > + rcu_read_unlock();
> > +
> > + if (pid == usc->fvalue)
> > + return true;
> > + return false;
> > +}
> > +
> > +bool uprobes_sid_filter(struct uprobe_consumer *self, struct task_struct *t)
> > +{
> > + pid_t pid;
> > + struct uprobe_simple_consumer *usc;
> > +
> > + usc = container_of(self, struct uprobe_simple_consumer, consumer);
> > + rcu_read_lock();
> > + pid = pid_vnr(task_session(t));
> > + rcu_read_unlock();
> > +
> > + if (pid == usc->fvalue)
> > + return true;
> > + return false;
> > +}
>
> And there things go haywire too.
>
> What you want is to save the pid-namespace of the task creating the
> filter in your uprobe_simple_consumer and use that to obtain the task's
> pid for matching with the provided number.
>

Okay, will do by adding the pid-namespace of the task creating the
filter in the uprobe_simple_consumer.


--
Thanks and Regards
Srikar

2011-04-21 11:34:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 18/26] 18: uprobes: commonly used filters.

On Thu, 2011-04-21 at 16:39 +0530, Srikar Dronamraju wrote:
> > What you want is to save the pid-namespace of the task creating the
> > filter in your uprobe_simple_consumer and use that to obtain the task's
> > pid for matching with the provided number.
> >
>
> Okay, will do by adding the pid-namespace of the task creating the
> filter in the uprobe_simple_consumer.

Maybe you could convert to the global pid namespace on construction and
always use that for comparison.

That would avoid the namespace muck on comparison..

2011-04-21 12:03:53

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 18/26] 18: uprobes: commonly used filters.

* Peter Zijlstra <[email protected]> [2011-04-21 13:37:15]:

> On Thu, 2011-04-21 at 16:39 +0530, Srikar Dronamraju wrote:
> > > What you want is to save the pid-namespace of the task creating the
> > > filter in your uprobe_simple_consumer and use that to obtain the task's
> > > pid for matching with the provided number.
> > >
> >
> > Okay, will do by adding the pid-namespace of the task creating the
> > filter in the uprobe_simple_consumer.
>
> Maybe you could convert to the global pid namespace on construction and
> always use that for comparison.
>
> That would avoid the namespace muck on comparison..
>

Yeah, this idea also seems feasible.

--
Thanks and Regards
Srikar

2011-04-21 14:25:47

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 12/26] 12: uprobes: slot allocation for uprobes

* Eric Paris <[email protected]> [2011-04-20 09:40:57]:

> On Tue, Apr 19, 2011 at 2:26 AM, Srikar Dronamraju
> <[email protected]> wrote:
> > * Peter Zijlstra <[email protected]> [2011-04-18 18:46:11]:
> >
> >> On Fri, 2011-04-01 at 20:04 +0530, Srikar Dronamraju wrote:
>
> >> > +static int xol_add_vma(struct uprobes_xol_area *area)
> >> > +{
> >> > + ? struct vm_area_struct *vma;
> >> > + ? struct mm_struct *mm;
> >> > + ? struct file *file;
> >> > + ? unsigned long addr;
> >> > + ? int ret = -ENOMEM;
> >> > +
> >> > + ? mm = get_task_mm(current);
> >> > + ? if (!mm)
> >> > + ? ? ? ? ? return -ESRCH;
> >> > +
> >> > + ? down_write(&mm->mmap_sem);
> >> > + ? if (mm->uprobes_xol_area) {
> >> > + ? ? ? ? ? ret = -EALREADY;
> >> > + ? ? ? ? ? goto fail;
> >> > + ? }
> >> > +
> >> > + ? /*
> >> > + ? ?* Find the end of the top mapping and skip a page.
> >> > + ? ?* If there is no space for PAGE_SIZE above
> >> > + ? ?* that, mmap will ignore our address hint.
> >> > + ? ?*
> >> > + ? ?* We allocate a "fake" unlinked shmem file because
> >> > + ? ?* anonymous memory might not be granted execute
> >> > + ? ?* permission when the selinux security hooks have
> >> > + ? ?* their way.
> >> > + ? ?*/
> >>
> >> That just annoys me, so we're working around some stupid sekurity crap,
> >> executable anonymous maps are perfectly fine, also what do JITs do?
> >
> > Yes, we are working around selinux security hooks, but do we have a
> > choice.
> >
> > James can you please comment on this.
>
> [added myself and stephen, the 2 SELinux maintainers]

Thanks for pitching in.

>
> This is just wrong. Anything to 'work around' SELinux in the kernel
> is wrong. SELinux access decisions are determined by policy not by
> dirty hacks in the code to subvert any kind of security claims that
> policy might hope to enforce.
>
> [side note, security_file_mmap() is the right call if there is a file
> or not. It should just be called security_mmap() but the _file_ has
> been around a long time and just never had a need to be changed]
>

Okay,

> Now how to fix the problems you were seeing. If you run a modern
> system with a GUI I'm willing to bet the pop-up window told you
> exactly how to fix your problem. If you are not on a GUI I accept
> it's a more difficult as you most likely don't have the setroubleshoot
> tools installed to help you out. I'm just guess what your problem
> was, but I think you have two solutions either:

I am not running GUI on my testbox and mostly disable selinux unless I
need to test if uprobes works on selinux environment.

>
> 1) chcon -t unconfined_execmem_t /path/to/your/binary
> 2) setsebool -P allow_execmem 1
>
> The first will cause the binary to execute in a domain with
> permissions to execute anonymous memory, the second will allow all
> unconfined domains to execute anonymous memory.

We arent restricted to a particular binary/binaries. We want an
infrastructure that can trace all user-space applications. So the
first option doesnt seem to help us.

If I understand the second option. we would want this command to be
run on any selinux enabled machines that wants uprobes to be working.

>
> I did hear this question though: On a different but related note, how
> is the use of uprobes controlled? Does it apply the same checking as
> for ptrace?
>

Uprobes is an infrastructure to trace user space applications.
Uprobes will use Singlestepping out of line approach compared to
ptrace's approach of re-inserting the original instruction on a
breakpoint hit.

Uprobes inserts a breakpoint and registers itself with the notifier
mechanism similar to kprobes. Once we hit the breakpoint, the
notifier callback will determine that the breakpoint was indeed
inserted by uprobes, run the handler and then tries to singlestep the
original instruction from a __different__ location. This approach
works much better for multithreaded applications and also reduces
context switches.

To achieve this for user space applications, we need to create that
__different__ location from where we can singlestep. Uprobes creates
this location by adding a new executable single page vma. This page
will have slots to which we copy the original instruction.
Once we singlestep the original instruction at a reserved slot, we do
the necessary fixups.

Initially we created the single page executable vma as an anonymous
vma.
However SElinux wasnt happy to see an executable anonymous VMA. Hence we
added shmem-file. The vma is semi-transparent to the user i.e,
in-kernel uprobes infrastructure will create this vma as and when
first breakpoint is hit for that process-group. This vma is cleaned up
at the process-group exit time.

Our idea is to export this to regular users through system-call so
that regular debuggers like gdb can benefit.

> Thanks guys! If you have SELinux or LSM problems in the future let me
> know. It's likely the solution is easier than you imagine ;)

Can I assume that you would be okay with Peter's suggestion of using
install_special_mapping() instead of we calling do_mmap_pgoff().

--
Thanks and Regards
Srikar

2011-04-21 14:46:18

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 12/26] 12: uprobes: slot allocation for uprobes

On Thu, 2011-04-21 at 19:41 +0530, Srikar Dronamraju wrote:
> * Eric Paris <[email protected]> [2011-04-20 09:40:57]:

> > Now how to fix the problems you were seeing. If you run a modern
> > system with a GUI I'm willing to bet the pop-up window told you
> > exactly how to fix your problem. If you are not on a GUI I accept
> > it's a more difficult as you most likely don't have the setroubleshoot
> > tools installed to help you out. I'm just guess what your problem
> > was, but I think you have two solutions either:
>
> I am not running GUI on my testbox and mostly disable selinux unless I
> need to test if uprobes works on selinux environment.
>
> >
> > 1) chcon -t unconfined_execmem_t /path/to/your/binary
> > 2) setsebool -P allow_execmem 1
> >
> > The first will cause the binary to execute in a domain with
> > permissions to execute anonymous memory, the second will allow all
> > unconfined domains to execute anonymous memory.
>
> We arent restricted to a particular binary/binaries. We want an
> infrastructure that can trace all user-space applications. So the
> first option doesnt seem to help us.
>
> If I understand the second option. we would want this command to be
> run on any selinux enabled machines that wants uprobes to be working.

The check which I'm guessing you ran into problems is done using the
permissions of the 'current' task. I don't know how your code works at
all, but I would have guessed that the task allocating and adding this
magic page was either perf or systemtap or something like that. Not the
'victim' task where the page would actually show up. Is that true? If
so, method1 (or 2) works. If not, and 'current' is actually the victim,
method2 would work some of the time, but not always.

If current can be the victim (or some completely random task) we can
probably switch the credentials of the current task as we do the
allocation to the initial creds (or something like that) so the
permission will be granted.

Unrelated note: I'd prefer to see that page be READ+EXEC only once it
has been mapped into the victim task. Obviously the portion of the code
that creates this page and sets up the instructions to run is going to
need write. Maybe this isn't feasible. Maybe this magic pages gets
written a lot even after it's been mapped in. But I'd rather, if
possible, know that my victim tasks didn't have a WRITE+EXEC page
available......

-Eric

2011-04-21 16:28:17

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 12/26] 12: uprobes: slot allocation for uprobes

> Unrelated note: I'd prefer to see that page be READ+EXEC only once it
> has been mapped into the victim task. Obviously the portion of the code
> that creates this page and sets up the instructions to run is going to
> need write. Maybe this isn't feasible. Maybe this magic pages gets
> written a lot even after it's been mapped in. But I'd rather, if
> possible, know that my victim tasks didn't have a WRITE+EXEC page
> available......

AIUI the page never really needs to be writable in the page tables. It's
never written from user mode. It's only written by kernel code, and that
can use a separate momentary kmap to do its writing.


Thanks,
Roland

2011-04-21 17:14:33

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 12/26] 12: uprobes: slot allocation for uprobes

* Eric Paris <[email protected]> [2011-04-21 10:45:33]:

> > > 1) chcon -t unconfined_execmem_t /path/to/your/binary
> > > 2) setsebool -P allow_execmem 1
> > >
> > > The first will cause the binary to execute in a domain with
> > > permissions to execute anonymous memory, the second will allow all
> > > unconfined domains to execute anonymous memory.
> >
> > We arent restricted to a particular binary/binaries. We want an
> > infrastructure that can trace all user-space applications. So the
> > first option doesnt seem to help us.
> >
> > If I understand the second option. we would want this command to be
> > run on any selinux enabled machines that wants uprobes to be working.
>
> The check which I'm guessing you ran into problems is done using the
> permissions of the 'current' task. I don't know how your code works at
> all, but I would have guessed that the task allocating and adding this
> magic page was either perf or systemtap or something like that. Not the
> 'victim' task where the page would actually show up. Is that true? If
> so, method1 (or 2) works. If not, and 'current' is actually the victim,
> method2 would work some of the time, but not always.


The vma addition is done in the 'victim' context and not in the
debugger context. Victim hits the breakpoint and requests a
xol_vma for itself.

>
> If current can be the victim (or some completely random task) we can
> probably switch the credentials of the current task as we do the
> allocation to the initial creds (or something like that) so the
> permission will be granted.

Okay, I will use the switching the credentials logic.

>
> Unrelated note: I'd prefer to see that page be READ+EXEC only once it
> has been mapped into the victim task. Obviously the portion of the code
> that creates this page and sets up the instructions to run is going to
> need write. Maybe this isn't feasible. Maybe this magic pages gets
> written a lot even after it's been mapped in. But I'd rather, if
> possible, know that my victim tasks didn't have a WRITE+EXEC page
> available......
>

We dont need write permissions on this vma, READ+EXEC is good enuf.
As Roland said, userspace never writes to this vma.

--
Thanks and Regards
Srikar

2011-04-21 17:19:11

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 15/26] 15: uprobes: Handing int3 and singlestep exception.

* Peter Zijlstra <[email protected]> [2011-04-19 15:39:19]:

> On Fri, 2011-04-01 at 20:05 +0530, Srikar Dronamraju wrote:
> > + probept = uprobes_get_bkpt_addr(regs);
> > + down_read(&mm->mmap_sem);
> > + for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > + if (!valid_vma(vma))
> > + continue;
> > + if (probept < vma->vm_start || probept > vma->vm_end)
> > + continue;
> > + u = find_uprobe(vma->vm_file->f_mapping->host,
> > + probept - vma->vm_start);
> > + break;
> > + }
>
> Why the linear vma walk? Surely the find_vma() suffices since there can
> only be one vma that matches a particular vaddr.


Agree, will incorporate.

--
Thanks and Regards
Srikar
>
> > + up_read(&mm->mmap_sem);

2011-04-21 17:25:12

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 15/26] 15: uprobes: Handing int3 and singlestep exception.

* Peter Zijlstra <[email protected]> [2011-04-19 15:03:05]:

> On Fri, 2011-04-01 at 20:05 +0530, Srikar Dronamraju wrote:
> > + if (unlikely(!utask)) {
> > + utask = add_utask();
> > +
> > + /* Failed to allocate utask for the current task. */
> > + BUG_ON(!utask);
>
> That's not really nice is it ;-) means I can make the kernel go BUG by
> simply applying memory pressure.
>

The other option would be remove the probe and set the ip to
the breakpoint address and restart the thread.


--
Thanks and Regards
Srikar

2011-04-21 17:39:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 15/26] 15: uprobes: Handing int3 and singlestep exception.

On Thu, 2011-04-21 at 22:40 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <[email protected]> [2011-04-19 15:03:05]:
>
> > On Fri, 2011-04-01 at 20:05 +0530, Srikar Dronamraju wrote:
> > > + if (unlikely(!utask)) {
> > > + utask = add_utask();
> > > +
> > > + /* Failed to allocate utask for the current task. */
> > > + BUG_ON(!utask);
> >
> > That's not really nice is it ;-) means I can make the kernel go BUG by
> > simply applying memory pressure.
> >
>
> The other option would be remove the probe and set the ip to
> the breakpoint address and restart the thread.

While its better than GFP_NOFAIL since its a return to userspace and
hence cannot be holding locks etc.. it's still not pretty. But heaps
better than simply bailing the kernel.


2011-04-21 17:45:49

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 7/26] 7: x86: analyze instruction and determine fixups.

* Steven Rostedt <[email protected]> [2011-04-19 09:29:11]:

> On Fri, 2011-04-01 at 20:03 +0530, Srikar Dronamraju wrote:
>
> > +
> > +static void report_bad_prefix(void)
> > +{
> > + printk(KERN_ERR "uprobes does not currently support probing "
> > + "instructions with any of the following prefixes: "
> > + "cs:, ds:, es:, ss:, lock:\n");
> > +}
> > +
> > +static void report_bad_1byte_opcode(int mode, uprobe_opcode_t op)
> > +{
> > + printk(KERN_ERR "In %d-bit apps, "
> > + "uprobes does not currently support probing "
> > + "instructions whose first byte is 0x%2.2x\n", mode, op);
> > +}
> > +
> > +static void report_bad_2byte_opcode(uprobe_opcode_t op)
> > +{
> > + printk(KERN_ERR "uprobes does not currently support probing "
> > + "instructions with the 2-byte opcode 0x0f 0x%2.2x\n", op);
> > +}
>
> Should these really be KERN_ERR, or is KERN_WARNING a better fit?
>
> Also, can a non-privileged user cause these printks to spam the console
> and cause a DoS to the system?
>

Sometimes, the user might try registering a probe at a valid file +
valid offset + valid consumer; but an instruction that we cant probe.
Then trying to figure why its failing would be very hard.

how about pr_warn_ratelimited()?

--
Thanks and Regards
Srikar

2011-04-21 17:49:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 2.6.39-rc1-tip 7/26] 7: x86: analyze instruction and determine fixups.

On Thu, 2011-04-21 at 23:01 +0530, Srikar Dronamraju wrote:
>
> Sometimes, the user might try registering a probe at a valid file +
> valid offset + valid consumer; but an instruction that we cant probe.
> Then trying to figure why its failing would be very hard.

Uhm, how about failing to create the probe to begin with?

You can even do that in userspace as you can inspect the DSO you're
going to probe (and pretty much have to, since you'll have to pass the
kernel a fd to hand it the proper inode).