2011-03-14 13:40:12

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes


This patchset implements Uprobes which enables you to dynamically break
into any routine in a user space application and collect information
non-disruptively.

This patchset resolves most of the comments from Peter on the previous
posting https://lkml.org/lkml/2010/12/16/65.

Uprobes Patches
This patchset implements inode based uprobes which are specified as
<file>:<offset> where offset is the offset from start of the map.
The probehit overhead is around 3X times the overhead from pid based
patchset.

When a uprobe is registered, Uprobes makes a copy of the probed
instruction, replaces the first byte(s) of the probed instruction with a
breakpoint instruction. (Uprobes uses background page replacement
mechanism and ensures that the breakpoint affects only that process.)

When a CPU hits the breakpoint instruction, Uprobes gets notified of
trap and finds the associated uprobe. It then executes the associated
handler. Uprobes single-steps its copy of the probed instruction and
resumes execution of the probed process at the instruction following the
probepoint. Instruction copies to be single-stepped are stored in a
per-mm "execution out of line (XOL) area". Currently XOL area is
allocated as one page vma.

For previous posting: please refer: http://lkml.org/lkml/2010/8/25/165
http://lkml.org/lkml/2010/7/27/121, http://lkml.org/lkml/2010/7/12/67,
http://lkml.org/lkml/2010/7/8/239, http://lkml.org/lkml/2010/6/29/299,
http://lkml.org/lkml/2010/6/14/41, http://lkml.org/lkml/2010/3/20/107
and http://lkml.org/lkml/2010/5/18/307

This patchset is a rework based on suggestions from discussions on lkml
in September, March and January 2010 (http://lkml.org/lkml/2010/1/11/92,
http://lkml.org/lkml/2010/1/27/19, http://lkml.org/lkml/2010/3/20/107
and http://lkml.org/lkml/2010/3/31/199 ). This implementation of uprobes
doesnt depend on utrace.

Advantages of uprobes over conventional debugging include:

1. Non-disruptive.
Unlike current ptrace based mechanisms, uprobes tracing wouldnt
involve signals, stopping threads and context switching between the
tracer and tracee.

2. Much better handling of multithreaded programs because of XOL.
Current ptrace based mechanisms use single stepping inline, i.e they
copy back the original instruction on hitting a breakpoint. In such
mechanisms tracers have to stop all the threads on a breakpoint hit or
tracers will not be able to handle all hits to the location of
interest. Uprobes uses execution out of line, where the instruction to
be traced is analysed at the time of breakpoint insertion and a copy
of instruction is stored at a different location. On breakpoint hit,
uprobes jumps to that copied location and singlesteps the same
instruction and does the necessary fixups post singlestepping.

3. Multiple tracers for an application.
Multiple uprobes based tracer could work in unison to trace an
application. There could one tracer that could be interested in
generic events for a particular set of process. While there could be
another tracer that is just interested in one specific event of a
particular process thats part of the previous set of process.

4. Corelating events from kernels and userspace.
Uprobes could be used with other tools like kprobes, tracepoints or as
part of higher level tools like perf to give a consolidated set of
events from kernel and userspace. In future we could look at a single
backtrace showing application, library and kernel calls.

Here is the list of TODO Items.

- Integrating perf probe with this patchset.
- Breakpoint handling should co-exist with singlestep/blockstep from
another tracer/debugger.
- Queue and dequeue signals delivered from the singlestep till
completion of postprocessing.
- Prefiltering (i.e filtering at the time of probe insertion)
- Return probes.
- Support for other architectures.
- Uprobes booster.
- replace macro W with bits in inat table.

To try please fetch using
git fetch \
git://git.kernel.org/pub/scm/linux/kernel/git/srikar/linux-uprobes.git \
tip_inode_uprobes_140311:tip_inode_uprobes

Please refer "[RFC] [PATCH 2.6.37-rc5-tip 20/20] 20: tracing: uprobes
trace_event infrastructure" on how to use uprobe_tracer.

Please do provide your valuable comments.

Thanks in advance.
Srikar

Srikar Dronamraju(20)
0: Inode based uprobes
1: mm: Move replace_page() to mm/memory.c
2: X86 specific breakpoint definitions.
3: uprobes: Breakground page replacement.
4: uprobes: Adding and remove a uprobe in a rb tree.
5: Uprobes: register/unregister probes.
6: x86: analyze instruction and determine fixups.
7: uprobes: store/restore original instruction.
8: uprobes: mmap and fork hooks.
9: x86: architecture specific task information.
10: uprobes: task specific information.
11: uprobes: slot allocation for uprobes
12: uprobes: get the breakpoint address.
13: x86: x86 specific probe handling
14: uprobes: Handing int3 and singlestep exception.
15: x86: uprobes exception notifier for x86.
16: uprobes: register a notifier for uprobes.
17: uprobes: filter chain
18: uprobes: commonly used filters.
19: tracing: Extract out common code for kprobes/uprobes traceevents.
20: tracing: uprobes trace_event interface

arch/Kconfig | 4 +
arch/x86/Kconfig | 3 +
arch/x86/include/asm/thread_info.h | 2 +
arch/x86/include/asm/uprobes.h | 58 ++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/signal.c | 14 +
arch/x86/kernel/uprobes.c | 600 ++++++++++++++++
include/linux/mm.h | 2 +
include/linux/mm_types.h | 9 +
include/linux/sched.h | 3 +
include/linux/uprobes.h | 183 +++++
kernel/Makefile | 1 +
kernel/fork.c | 10 +
kernel/trace/Kconfig | 20 +
kernel/trace/Makefile | 2 +
kernel/trace/trace.h | 5 +
kernel/trace/trace_kprobe.c | 861 +-----------------------
kernel/trace/trace_probe.c | 753 ++++++++++++++++++++
kernel/trace/trace_probe.h | 160 +++++
kernel/trace/trace_uprobe.c | 800 ++++++++++++++++++++++
kernel/uprobes.c | 1331 ++++++++++++++++++++++++++++++++++++
mm/ksm.c | 62 --
mm/memory.c | 62 ++
mm/mmap.c | 2 +
24 files changed, 4044 insertions(+), 904 deletions(-)


2011-03-14 13:39:58

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 1/20] 1: mm: Move replace_page() to mm/memory.c


User bkpt will use background page replacement approach to insert/delete
breakpoints. Background page replacement approach is based on
replace_page. Hence replace_page() loses its static attribute.

Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Ananth N Mavinakayanahalli <[email protected]>
---
include/linux/mm.h | 2 ++
mm/ksm.c | 62 ----------------------------------------------------
mm/memory.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 64 insertions(+), 62 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 679300c..01a0740 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -984,6 +984,8 @@ void account_page_writeback(struct page *page);
int set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
int clear_page_dirty_for_io(struct page *page);
+int replace_page(struct vm_area_struct *vma, struct page *page,
+ struct page *kpage, pte_t orig_pte);

/* Is the vma a continuation of the stack vma above it? */
static inline int vma_stack_continue(struct vm_area_struct *vma, unsigned long addr)
diff --git a/mm/ksm.c b/mm/ksm.c
index c2b2a94..f46e20d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -765,68 +765,6 @@ out:
return err;
}

-/**
- * replace_page - replace page in vma by new ksm page
- * @vma: vma that holds the pte pointing to page
- * @page: the page we are replacing by kpage
- * @kpage: the ksm page we replace page by
- * @orig_pte: the original value of the pte
- *
- * Returns 0 on success, -EFAULT on failure.
- */
-static int replace_page(struct vm_area_struct *vma, struct page *page,
- struct page *kpage, pte_t orig_pte)
-{
- struct mm_struct *mm = vma->vm_mm;
- pgd_t *pgd;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *ptep;
- spinlock_t *ptl;
- unsigned long addr;
- int err = -EFAULT;
-
- addr = page_address_in_vma(page, vma);
- if (addr == -EFAULT)
- goto out;
-
- pgd = pgd_offset(mm, addr);
- if (!pgd_present(*pgd))
- goto out;
-
- pud = pud_offset(pgd, addr);
- if (!pud_present(*pud))
- goto out;
-
- pmd = pmd_offset(pud, addr);
- BUG_ON(pmd_trans_huge(*pmd));
- if (!pmd_present(*pmd))
- goto out;
-
- ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
- if (!pte_same(*ptep, orig_pte)) {
- pte_unmap_unlock(ptep, ptl);
- goto out;
- }
-
- get_page(kpage);
- page_add_anon_rmap(kpage, vma, addr);
-
- flush_cache_page(vma, addr, pte_pfn(*ptep));
- ptep_clear_flush(vma, addr, ptep);
- set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
-
- page_remove_rmap(page);
- if (!page_mapped(page))
- try_to_free_swap(page);
- put_page(page);
-
- pte_unmap_unlock(ptep, ptl);
- err = 0;
-out:
- return err;
-}
-
static int page_trans_compound_anon_split(struct page *page)
{
int ret = 0;
diff --git a/mm/memory.c b/mm/memory.c
index 5823698..2a3021c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2669,6 +2669,68 @@ void unmap_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL(unmap_mapping_range);

+/**
+ * replace_page - replace page in vma by new ksm page
+ * @vma: vma that holds the pte pointing to page
+ * @page: the page we are replacing by kpage
+ * @kpage: the ksm page we replace page by
+ * @orig_pte: the original value of the pte
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ */
+int replace_page(struct vm_area_struct *vma, struct page *page,
+ struct page *kpage, pte_t orig_pte)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep;
+ spinlock_t *ptl;
+ unsigned long addr;
+ int err = -EFAULT;
+
+ addr = page_address_in_vma(page, vma);
+ if (addr == -EFAULT)
+ goto out;
+
+ pgd = pgd_offset(mm, addr);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pud = pud_offset(pgd, addr);
+ if (!pud_present(*pud))
+ goto out;
+
+ pmd = pmd_offset(pud, addr);
+ BUG_ON(pmd_trans_huge(*pmd));
+ if (!pmd_present(*pmd))
+ goto out;
+
+ ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!pte_same(*ptep, orig_pte)) {
+ pte_unmap_unlock(ptep, ptl);
+ goto out;
+ }
+
+ get_page(kpage);
+ page_add_anon_rmap(kpage, vma, addr);
+
+ flush_cache_page(vma, addr, pte_pfn(*ptep));
+ ptep_clear_flush(vma, addr, ptep);
+ set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+
+ page_remove_rmap(page);
+ if (!page_mapped(page))
+ try_to_free_swap(page);
+ put_page(page);
+
+ pte_unmap_unlock(ptep, ptl);
+ err = 0;
+out:
+ return err;
+}
+
int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
{
struct address_space *mapping = inode->i_mapping;

2011-03-14 13:40:20

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 3/20] 3: uprobes: Breakground page replacement.


Provides Background page replacement using replace_page() routine.
Also provides routines to read an opcode from a given virtual address
and for verifying if a instruction is a breakpoint instruction.

Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Jim Keniston <[email protected]>
---
arch/Kconfig | 12 ++
include/linux/uprobes.h | 70 ++++++++++++++
kernel/Makefile | 1
kernel/uprobes.c | 230 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 313 insertions(+), 0 deletions(-)
create mode 100644 include/linux/uprobes.h
create mode 100644 kernel/uprobes.c

diff --git a/arch/Kconfig b/arch/Kconfig
index f78c2be..c681f16 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -61,6 +61,18 @@ config OPTPROBES
depends on KPROBES && HAVE_OPTPROBES
depends on !PREEMPT

+config UPROBES
+ bool "User-space probes (EXPERIMENTAL)"
+ depends on ARCH_SUPPORTS_UPROBES
+ depends on MMU
+ select MM_OWNER
+ help
+ Uprobes enables kernel subsystems to establish probepoints
+ in user applications and execute handler functions when
+ the probepoints are hit.
+
+ If in doubt, say "N".
+
config HAVE_EFFICIENT_UNALIGNED_ACCESS
bool
help
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
new file mode 100644
index 0000000..350ccb0
--- /dev/null
+++ b/include/linux/uprobes.h
@@ -0,0 +1,70 @@
+#ifndef _LINUX_UPROBES_H
+#define _LINUX_UPROBES_H
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2010
+ * Authors:
+ * Srikar Dronamraju
+ * Jim Keniston
+ */
+
+#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
+#include <asm/uprobes.h>
+#else
+/*
+ * ARCH_SUPPORTS_UPROBES is not defined.
+ */
+typedef u8 uprobe_opcode_t;
+
+/* Post-execution fixups. Some architectures may define others. */
+#endif /* CONFIG_ARCH_SUPPORTS_UPROBES */
+
+/* No fixup needed */
+#define UPROBES_FIX_NONE 0x0
+/* Adjust IP back to vicinity of actual insn */
+#define UPROBES_FIX_IP 0x1
+/* Adjust the return address of a call insn */
+#define UPROBES_FIX_CALL 0x2
+/* Might sleep while doing Fixup */
+#define UPROBES_FIX_SLEEPY 0x4
+
+#ifndef UPROBES_FIX_DEFAULT
+#define UPROBES_FIX_DEFAULT UPROBES_FIX_IP
+#endif
+
+/* Unexported functions & macros for use by arch-specific code */
+#define uprobe_opcode_sz (sizeof(uprobe_opcode_t))
+
+/*
+ * Most architectures can use the default versions of @read_opcode(),
+ * @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
+ *
+ * @set_ip:
+ * Set the instruction pointer in @regs to @vaddr.
+ * @analyze_insn:
+ * Analyze @user_bkpt->insn. Return 0 if @user_bkpt->insn is an
+ * instruction you can probe, or a negative errno (typically -%EPERM)
+ * otherwise. Determine what sort of
+ * @pre_xol:
+ * @post_xol:
+ * XOL-related fixups @post_xol() (and possibly @pre_xol()) will need
+ * to do for this instruction, and annotate @user_bkpt accordingly.
+ * You may modify @user_bkpt->insn (e.g., the x86_64 port does this
+ * for rip-relative instructions).
+ */
+#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 353d3fe..d562285 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += perf_event.o
obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
obj-$(CONFIG_PADATA) += padata.o
+obj-$(CONFIG_UPROBES) += uprobes.o

ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
new file mode 100644
index 0000000..4f0f61b
--- /dev/null
+++ b/kernel/uprobes.c
@@ -0,0 +1,230 @@
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2010
+ * Authors:
+ * Srikar Dronamraju
+ * Jim Keniston
+ */
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/uprobes.h>
+#include <linux/rmap.h> /* needed for anon_vma_prepare */
+
+struct uprobe {
+ u8 insn[MAX_UINSN_BYTES];
+ u16 fixups;
+};
+
+/*
+ * Called with tsk->mm->mmap_sem held (either for read or write and
+ * with a reference to tsk->mm.
+ */
+static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
+ unsigned long vaddr, uprobe_opcode_t opcode)
+{
+ struct page *old_page, *new_page;
+ void *vaddr_old, *vaddr_new;
+ struct vm_area_struct *vma;
+ spinlock_t *ptl;
+ pte_t *orig_pte;
+ unsigned long addr;
+ int ret = -EINVAL;
+
+ /* Read the page with vaddr into memory */
+ ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
+ if (ret <= 0)
+ return -EINVAL;
+ ret = -EINVAL;
+
+ /*
+ * check if the page we are interested is read-only mapped
+ * Since we are interested in text pages, Our pages of interest
+ * should be mapped read-only.
+ */
+ if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
+ (VM_READ|VM_EXEC))
+ goto put_out;
+
+ /* Allocate a page */
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
+ if (!new_page) {
+ ret = -ENOMEM;
+ goto put_out;
+ }
+
+ /*
+ * lock page will serialize against do_wp_page()'s
+ * PageAnon() handling
+ */
+ lock_page(old_page);
+ /* copy the page now that we've got it stable */
+ vaddr_old = kmap_atomic(old_page, KM_USER0);
+ vaddr_new = kmap_atomic(new_page, KM_USER1);
+
+ memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
+ /* poke the new insn in, ASSUMES we don't cross page boundary */
+ addr = vaddr;
+ vaddr &= ~PAGE_MASK;
+ memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
+
+ kunmap_atomic(vaddr_new, KM_USER1);
+ kunmap_atomic(vaddr_old, KM_USER0);
+
+ orig_pte = page_check_address(old_page, tsk->mm, addr, &ptl, 0);
+ if (!orig_pte)
+ goto unlock_out;
+ pte_unmap_unlock(orig_pte, ptl);
+
+ lock_page(new_page);
+ if (!anon_vma_prepare(vma))
+ /* flip pages, do_wp_page() will fail pte_same() and bail */
+ ret = replace_page(vma, old_page, new_page, *orig_pte);
+
+ unlock_page(new_page);
+ if (ret != 0)
+ page_cache_release(new_page);
+unlock_out:
+ unlock_page(old_page);
+
+put_out:
+ put_page(old_page); /* we did a get_page in the beginning */
+ return ret;
+}
+
+/**
+ * read_opcode - read the opcode at a given virtual address.
+ * @tsk: the probed task.
+ * @vaddr: the virtual address to store the opcode.
+ * @opcode: location to store the read opcode.
+ *
+ * For task @tsk, read the opcode at @vaddr and store it in @opcode.
+ * Return 0 (success) or a negative errno.
+ */
+int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
+ uprobe_opcode_t *opcode)
+{
+ struct vm_area_struct *vma;
+ struct page *page;
+ void *vaddr_new;
+ int ret;
+
+ ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
+ if (ret <= 0)
+ return -EFAULT;
+ ret = -EFAULT;
+
+ /*
+ * check if the page we are interested is read-only mapped
+ * Since we are interested in text pages, Our pages of interest
+ * should be mapped read-only.
+ */
+ if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
+ (VM_READ|VM_EXEC))
+ goto put_out;
+
+ lock_page(page);
+ vaddr_new = kmap_atomic(page, KM_USER0);
+ vaddr &= ~PAGE_MASK;
+ memcpy(&opcode, vaddr_new + vaddr, uprobe_opcode_sz);
+ kunmap_atomic(vaddr_new, KM_USER0);
+ unlock_page(page);
+ ret = uprobe_opcode_sz;
+
+put_out:
+ put_page(page); /* we did a get_page in the beginning */
+ return ret;
+}
+
+/**
+ * set_bkpt - store breakpoint at a given address.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ *
+ * For task @tsk, store the breakpoint instruction at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
+ unsigned long vaddr)
+{
+ return write_opcode(tsk, uprobe, vaddr, UPROBES_BKPT_INSN);
+}
+
+/**
+ * set_orig_insn - Restore the original instruction.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ * @verify: if true, verify existance of breakpoint instruction.
+ *
+ * For task @tsk, restore the original opcode (opcode) at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+int __weak set_orig_insn(struct task_struct *tsk, struct uprobe *uprobe,
+ unsigned long vaddr, bool verify)
+{
+ if (verify) {
+ uprobe_opcode_t opcode;
+ int result = read_opcode(tsk, vaddr, &opcode);
+ if (result)
+ return result;
+ if (opcode != UPROBES_BKPT_INSN)
+ return -EINVAL;
+ }
+ return write_opcode(tsk, uprobe, vaddr,
+ *(uprobe_opcode_t *) uprobe->insn);
+}
+
+static void print_insert_fail(struct task_struct *tsk,
+ unsigned long vaddr, const char *why)
+{
+ printk(KERN_ERR "Can't place breakpoint at pid %d vaddr %#lx: %s\n",
+ tsk->pid, vaddr, why);
+}
+
+/*
+ * uprobes_resume_can_sleep - Check if fixup might result in sleep.
+ * @uprobes: the probepoint information.
+ *
+ * Returns true if fixup might result in sleep.
+ */
+static bool uprobes_resume_can_sleep(struct uprobe *uprobe)
+{
+ return uprobe->fixups & UPROBES_FIX_SLEEPY;
+}
+
+/**
+ * is_bkpt_insn - check if instruction is breakpoint instruction.
+ * @insn: instruction to be checked.
+ * Default implementation of is_bkpt_insn
+ * Returns true if @insn is a breakpoint instruction.
+ */
+bool __weak is_bkpt_insn(u8 *insn)
+{
+ uprobe_opcode_t opcode;
+
+ memcpy(&opcode, insn, UPROBES_BKPT_INSN_SIZE);
+ return (opcode == UPROBES_BKPT_INSN);
+}

2011-03-14 13:40:30

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 4/20] 4: uprobes: Adding and remove a uprobe in a rb tree.


Provides interfaces to add and remove uprobes from the global rb tree.
Also provides definitions for uprobe_consumer, interfaces to add and
remove a consumer to a uprobe. There is a unique uprobe element in the
rbtree for each unique inode:offset pair.

Uprobe gets added to the global rb tree when the first consumer for that
uprobe gets registered. It gets removed from the tree only when all
registered consumers are unregistered.

Multiple consumers can share the same probe. Each consumer provides a
filter to limit the tasks on which the handler should run, a handler
that runs on probe hit and a value which helps filter callback to limit
the tasks on which the handler should run.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/uprobes.h | 12 +++
kernel/uprobes.c | 225 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 233 insertions(+), 4 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 350ccb0..f422bc6 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -23,6 +23,7 @@
* Jim Keniston
*/

+#include <linux/rbtree.h>
#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
#include <asm/uprobes.h>
#else
@@ -50,6 +51,17 @@ typedef u8 uprobe_opcode_t;
/* Unexported functions & macros for use by arch-specific code */
#define uprobe_opcode_sz (sizeof(uprobe_opcode_t))

+struct uprobe_consumer {
+ int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
+ /*
+ * filter is optional; If a filter exists, handler is run
+ * if and only if filter returns true.
+ */
+ bool (*filter)(struct uprobe_consumer *self, struct task_struct *task);
+
+ struct uprobe_consumer *next;
+};
+
/*
* Most architectures can use the default versions of @read_opcode(),
* @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 4f0f61b..6e692a8 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -33,10 +33,28 @@
#include <linux/rmap.h> /* needed for anon_vma_prepare */

struct uprobe {
+ struct rb_node rb_node; /* node in the rb tree */
+ atomic_t ref; /* lifetime muck */
+ struct rw_semaphore consumer_rwsem;
+ struct uprobe_consumer *consumers;
+ struct inode *inode; /* we hold a ref */
+ loff_t offset;
u8 insn[MAX_UINSN_BYTES];
u16 fixups;
};

+static int valid_vma(struct vm_area_struct *vma)
+{
+ if (!vma->vm_file)
+ return 0;
+
+ if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
+ (VM_READ|VM_EXEC))
+ return 1;
+
+ return 0;
+}
+
/*
* Called with tsk->mm->mmap_sem held (either for read or write and
* with a reference to tsk->mm.
@@ -63,8 +81,7 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
* Since we are interested in text pages, Our pages of interest
* should be mapped read-only.
*/
- if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
- (VM_READ|VM_EXEC))
+ if (!valid_vma(vma))
goto put_out;

/* Allocate a page */
@@ -140,8 +157,7 @@ int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
* Since we are interested in text pages, Our pages of interest
* should be mapped read-only.
*/
- if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
- (VM_READ|VM_EXEC))
+ if (!valid_vma(vma))
goto put_out;

lock_page(page);
@@ -228,3 +244,204 @@ bool __weak is_bkpt_insn(u8 *insn)
memcpy(&opcode, insn, UPROBES_BKPT_INSN_SIZE);
return (opcode == UPROBES_BKPT_INSN);
}
+
+static struct rb_root uprobes_tree = RB_ROOT;
+static DEFINE_MUTEX(uprobes_mutex);
+static DEFINE_SPINLOCK(treelock);
+
+static int match_inode(struct uprobe *uprobe, struct inode *inode,
+ struct rb_node **p)
+{
+ struct rb_node *n = *p;
+
+ if (inode < uprobe->inode)
+ *p = n->rb_left;
+ else if (inode > uprobe->inode)
+ *p = n->rb_right;
+ else
+ return 1;
+ return 0;
+}
+
+static int match_offset(struct uprobe *uprobe, loff_t offset,
+ struct rb_node **p)
+{
+ struct rb_node *n = *p;
+
+ if (offset < uprobe->offset)
+ *p = n->rb_left;
+ else if (offset > uprobe->offset)
+ *p = n->rb_right;
+ else
+ return 1;
+ return 0;
+}
+
+
+/* Called with treelock held */
+static struct uprobe *__find_uprobe(struct inode * inode,
+ loff_t offset, struct rb_node **near_match)
+{
+ struct rb_node *n = uprobes_tree.rb_node;
+ struct uprobe *uprobe, *u = NULL;
+
+ while (n) {
+ uprobe = rb_entry(n, struct uprobe, rb_node);
+ if (match_inode(uprobe, inode, &n)) {
+ if (near_match)
+ *near_match = n;
+ if (match_offset(uprobe, offset, &n)) {
+ atomic_inc(&uprobe->ref);
+ u = uprobe;
+ break;
+ }
+ }
+ }
+ return u;
+}
+
+/*
+ * Find a uprobe corresponding to a given inode:offset
+ * Acquires treelock
+ */
+static struct uprobe *find_uprobe(struct inode * inode, loff_t offset)
+{
+ struct uprobe *uprobe;
+ unsigned long flags;
+
+ spin_lock_irqsave(&treelock, flags);
+ uprobe = __find_uprobe(inode, offset, NULL);
+ spin_unlock_irqrestore(&treelock, flags);
+ return uprobe;
+}
+
+/*
+ * Check if a uprobe is already inserted;
+ * If it does; return refcount incremented uprobe
+ * else add the current uprobe and return NULL
+ * Acquires treelock.
+ */
+static struct uprobe *insert_uprobe(struct uprobe *uprobe)
+{
+ struct rb_node **p = &uprobes_tree.rb_node;
+ struct rb_node *parent = NULL;
+ struct uprobe *u;
+ unsigned long flags;
+
+ spin_lock_irqsave(&treelock, flags);
+ while (*p) {
+ parent = *p;
+ u = rb_entry(parent, struct uprobe, rb_node);
+ if (u->inode > uprobe->inode)
+ p = &(*p)->rb_left;
+ else if (u->inode < uprobe->inode)
+ p = &(*p)->rb_right;
+ else {
+ if (u->offset > uprobe->offset)
+ p = &(*p)->rb_left;
+ else if (u->offset < uprobe->offset)
+ p = &(*p)->rb_right;
+ else {
+ atomic_inc(&u->ref);
+ goto unlock_return;
+ }
+ }
+ }
+ u = NULL;
+ rb_link_node(&uprobe->rb_node, parent, p);
+ rb_insert_color(&uprobe->rb_node, &uprobes_tree);
+ atomic_set(&uprobe->ref, 2);
+
+unlock_return:
+ spin_unlock_irqrestore(&treelock, flags);
+ return u;
+}
+
+/* Should be called lock-less */
+static void put_uprobe(struct uprobe *uprobe)
+{
+ if (atomic_dec_and_test(&uprobe->ref))
+ kfree(uprobe);
+}
+
+static struct uprobe *uprobes_add(struct inode *inode, loff_t offset)
+{
+ struct uprobe *uprobe, *cur_uprobe;
+
+ __iget(inode);
+ uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+
+ if (!uprobe) {
+ iput(inode);
+ return NULL;
+ }
+ uprobe->inode = inode;
+ uprobe->offset = offset;
+
+ /* add to uprobes_tree, sorted on inode:offset */
+ cur_uprobe = insert_uprobe(uprobe);
+
+ /* a uprobe exists for this inode:offset combination*/
+ if (cur_uprobe) {
+ kfree(uprobe);
+ uprobe = cur_uprobe;
+ iput(inode);
+ } else
+ init_rwsem(&uprobe->consumer_rwsem);
+
+ return uprobe;
+}
+
+/* Acquires uprobe->consumer_rwsem */
+static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ struct uprobe_consumer *consumer;
+
+ down_read(&uprobe->consumer_rwsem);
+ consumer = uprobe->consumers;
+ while (consumer) {
+ if (!consumer->filter || consumer->filter(consumer, current))
+ consumer->handler(consumer, regs);
+
+ consumer = consumer->next;
+ }
+ up_read(&uprobe->consumer_rwsem);
+}
+
+/* Acquires uprobe->consumer_rwsem */
+static void add_consumer(struct uprobe *uprobe,
+ struct uprobe_consumer *consumer)
+{
+ down_write(&uprobe->consumer_rwsem);
+ consumer->next = uprobe->consumers;
+ uprobe->consumers = consumer;
+ up_write(&uprobe->consumer_rwsem);
+ return;
+}
+
+/* Acquires uprobe->consumer_rwsem */
+static int del_consumer(struct uprobe *uprobe,
+ struct uprobe_consumer *consumer)
+{
+ struct uprobe_consumer *con;
+ int ret = 0;
+
+ down_write(&uprobe->consumer_rwsem);
+ con = uprobe->consumers;
+ if (consumer == con) {
+ uprobe->consumers = con->next;
+ if (!con->next)
+ put_uprobe(uprobe);
+ ret = 1;
+ } else {
+ for (; con; con = con->next) {
+ if (con->next == consumer) {
+ con->next = consumer->next;
+ ret = 1;
+ break;
+ }
+ }
+ }
+ up_write(&uprobe->consumer_rwsem);
+ return ret;
+}

2011-03-14 13:40:46

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 5/20] 5: Uprobes: register/unregister probes.


A probe is specified by a file:offset. While registering, a breakpoint
is inserted for the first consumer, On subsequent probes, the consumer
gets appended to the existing consumers. While unregistering a
breakpoint is removed if the consumer happens to be the last consumer.
All other unregisterations, the consumer is deleted from the list of
consumers.

Probe specifications are maintained in a rb tree. A probe specification
is converted into a uprobe before store in a rb tree. A uprobe can be
shared by many consumers.

Given a inode, we get a list of mm's that have mapped the inode.
However we want to limit the probes to certain processes/threads. The
filtering should be at thread level. To limit the probes to a certain
processes/threads, we would want to walk through the list of threads
whose mm member refer to a given mm.

Here are the options that I thought of:
1. Use mm->owner and walk thro the thread_group of mm->owner, siblings
of mm->owner, siblings of parent of mm->owner. This should be
good list to traverse. Not sure if this is an exhaustive
enough list that all tasks that have a mm set to this mm_struct are
walked through.

2. Install probes on all mm's that have mapped the probes and filter
only at probe hit time.

3. walk thro do_each_thread; while_each_thread; I think this will catch
all tasks that have a mm set to the given mm. However this might
be too heavy esp if mm corresponds to a library.

4. add a list_head element to the mm struct and update the list whenever
the task->mm thread gets updated. This could mean extending the current
mm->owner. However there is some maintainance overhead.

Currently we use the second approach, i.e probe all mm's that have mapped
the probes and filter only at probe hit.

Also would be interested to know if there are ways to call
replace_page without having to take mmap_sem.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/mm_types.h | 5 +
include/linux/uprobes.h | 32 ++++++++
kernel/uprobes.c | 195 +++++++++++++++++++++++++++++++++++++++++++---
3 files changed, 221 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 26bc4e2..96e4a77 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -315,6 +315,11 @@ struct mm_struct {
#endif
/* How many tasks sharing this mm are OOM_DISABLE */
atomic_t oom_disable_count;
+#ifdef CONFIG_UPROBES
+ unsigned long uprobes_vaddr;
+ struct list_head uprobes_list;
+ atomic_t uprobes_count;
+#endif
};

/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f422bc6..8654a06 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -31,6 +31,7 @@
* ARCH_SUPPORTS_UPROBES is not defined.
*/
typedef u8 uprobe_opcode_t;
+struct uprobe_arch_info {}; /* arch specific info*/

/* Post-execution fixups. Some architectures may define others. */
#endif /* CONFIG_ARCH_SUPPORTS_UPROBES */
@@ -62,6 +63,19 @@ struct uprobe_consumer {
struct uprobe_consumer *next;
};

+struct uprobe {
+ struct rb_node rb_node; /* node in the rb tree */
+ atomic_t ref;
+ struct rw_semaphore consumer_rwsem;
+ struct uprobe_arch_info arch_info; /* arch specific info if any */
+ struct uprobe_consumer *consumers;
+ struct inode *inode; /* Also hold a ref to inode */
+ loff_t offset;
+ u8 insn[MAX_UINSN_BYTES]; /* orig instruction */
+ u16 fixups;
+ int copy;
+};
+
/*
* Most architectures can use the default versions of @read_opcode(),
* @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
@@ -79,4 +93,22 @@ struct uprobe_consumer {
* You may modify @user_bkpt->insn (e.g., the x86_64 port does this
* for rip-relative instructions).
*/
+
+#ifdef CONFIG_UPROBES
+extern int register_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer);
+extern void unregister_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer);
+#else /* CONFIG_UPROBES is not defined */
+static inline int register_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer)
+{
+ return -ENOSYS;
+}
+static inline void unregister_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer)
+{
+}
+
+#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 6e692a8..4dbb90f 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -32,17 +32,6 @@
#include <linux/uprobes.h>
#include <linux/rmap.h> /* needed for anon_vma_prepare */

-struct uprobe {
- struct rb_node rb_node; /* node in the rb tree */
- atomic_t ref; /* lifetime muck */
- struct rw_semaphore consumer_rwsem;
- struct uprobe_consumer *consumers;
- struct inode *inode; /* we hold a ref */
- loff_t offset;
- u8 insn[MAX_UINSN_BYTES];
- u16 fixups;
-};
-
static int valid_vma(struct vm_area_struct *vma)
{
if (!vma->vm_file)
@@ -445,3 +434,187 @@ static int del_consumer(struct uprobe *uprobe,
up_write(&uprobe->consumer_rwsem);
return ret;
}
+
+static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
+{
+ int ret = 0;
+
+ /*TODO: install breakpoint */
+ if (!ret)
+ atomic_inc(&mm->uprobes_count);
+ return ret;
+}
+
+static int remove_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
+{
+ int ret = 0;
+
+ /*TODO: remove breakpoint */
+ if (!ret)
+ atomic_dec(&mm->uprobes_count);
+
+ return ret;
+}
+
+/* Returns 0 if it can install one probe */
+int register_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer)
+{
+ struct prio_tree_iter iter;
+ struct list_head tmp_list;
+ struct address_space *mapping;
+ struct mm_struct *mm, *tmpmm;
+ struct vm_area_struct *vma;
+ struct uprobe *uprobe;
+ int ret = -1;
+
+ if (!inode || !consumer || consumer->next)
+ return -EINVAL;
+ uprobe = uprobes_add(inode, offset);
+ INIT_LIST_HEAD(&tmp_list);
+
+ mapping = inode->i_mapping;
+
+ mutex_lock(&uprobes_mutex);
+ if (uprobe->consumers) {
+ ret = 0;
+ goto consumers_add;
+ }
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
+ loff_t vaddr;
+
+ if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
+ continue;
+
+ mm = vma->vm_mm;
+ if (!valid_vma(vma)) {
+ mmput(mm);
+ continue;
+ }
+
+ vaddr = vma->vm_start + offset;
+ vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+ if (vaddr > ULONG_MAX) {
+ /*
+ * We cannot have a virtual address that is
+ * greater than ULONG_MAX
+ */
+ mmput(mm);
+ continue;
+ }
+ mm->uprobes_vaddr = (unsigned long) vaddr;
+ list_add(&mm->uprobes_list, &tmp_list);
+ }
+ spin_unlock(&mapping->i_mmap_lock);
+
+ if (list_empty(&tmp_list)) {
+ ret = 0;
+ goto consumers_add;
+ }
+ list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list) {
+ down_read(&mm->mmap_sem);
+ if (!install_uprobe(mm, uprobe))
+ ret = 0;
+ list_del(&mm->uprobes_list);
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ }
+
+consumers_add:
+ add_consumer(uprobe, consumer);
+ mutex_unlock(&uprobes_mutex);
+ put_uprobe(uprobe);
+ return ret;
+}
+
+void unregister_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer)
+{
+ struct prio_tree_iter iter;
+ struct list_head tmp_list;
+ struct address_space *mapping;
+ struct mm_struct *mm, *tmpmm;
+ struct vm_area_struct *vma;
+ struct uprobe *uprobe;
+ unsigned long flags;
+
+ if (!inode || !consumer)
+ return;
+
+ uprobe = find_uprobe(inode, offset);
+ if (!uprobe) {
+ printk(KERN_ERR "No uprobe found with inode:offset %p %lld\n",
+ inode, offset);
+ return;
+ }
+
+ if (!del_consumer(uprobe, consumer)) {
+ printk(KERN_ERR "No uprobe found with consumer %p\n",
+ consumer);
+ return;
+ }
+
+ INIT_LIST_HEAD(&tmp_list);
+
+ mapping = inode->i_mapping;
+
+ mutex_lock(&uprobes_mutex);
+ if (uprobe->consumers)
+ goto put_unlock;
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
+ if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
+ continue;
+
+ mm = vma->vm_mm;
+
+ if (!atomic_read(&mm->uprobes_count)) {
+ mmput(mm);
+ continue;
+ }
+
+ if (valid_vma(vma)) {
+ loff_t vaddr;
+
+ vaddr = vma->vm_start + offset;
+ vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+ if (vaddr > ULONG_MAX) {
+ /*
+ * We cannot have a virtual address that is
+ * greater than ULONG_MAX
+ */
+ mmput(mm);
+ continue;
+ }
+ mm->uprobes_vaddr = (unsigned long) vaddr;
+ list_add(&mm->uprobes_list, &tmp_list);
+ } else
+ mmput(mm);
+ }
+ spin_unlock(&mapping->i_mmap_lock);
+ list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list) {
+ down_read(&mm->mmap_sem);
+ remove_uprobe(mm, uprobe);
+ list_del(&mm->uprobes_list);
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ }
+
+ /*
+ * There could be other threads that could be spinning on
+ * treelock; some of these threads could be interested in this
+ * uprobe. Give these threads a chance to run.
+ */
+ synchronize_sched();
+ spin_lock_irqsave(&treelock, flags);
+ rb_erase(&uprobe->rb_node, &uprobes_tree);
+ spin_unlock_irqrestore(&treelock, flags);
+ iput(uprobe->inode);
+
+put_unlock:
+ mutex_unlock(&uprobes_mutex);
+ put_uprobe(uprobe);
+}

2011-03-14 13:40:56

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 6/20] 6: x86: analyze instruction and determine fixups.


The instruction analysis is based on x86 instruction decoder and
determines if an instruction can be probed and determines the necessary
fixups after singlestep. Instruction analysis is done at probe
insertion time so that we avoid having to repeat the same analysis every
time a probe is hit.

Signed-off-by: Jim Keniston <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/include/asm/uprobes.h | 2
arch/x86/kernel/Makefile | 1
arch/x86/kernel/uprobes.c | 414 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 417 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/kernel/uprobes.c

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 5026359..0063207 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -37,4 +37,6 @@ struct uprobe_arch_info {
#else
struct uprobe_arch_info {};
#endif
+struct uprobe;
+extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
#endif /* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 743642f..32596aa 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -110,6 +110,7 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o

obj-$(CONFIG_SWIOTLB) += pci-swiotlb.o
obj-$(CONFIG_OF) += devicetree.o
+obj-$(CONFIG_UPROBES) += uprobes.o

###
# 64 bit specific files
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
new file mode 100644
index 0000000..cf223a4
--- /dev/null
+++ b/arch/x86/kernel/uprobes.c
@@ -0,0 +1,414 @@
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2010
+ * Authors:
+ * Srikar Dronamraju
+ * Jim Keniston
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/uprobes.h>
+
+#include <linux/kdebug.h>
+#include <asm/insn.h>
+
+#ifdef CONFIG_X86_32
+#define is_32bit_app(tsk) 1
+#else
+#define is_32bit_app(tsk) (test_tsk_thread_flag(tsk, TIF_IA32))
+#endif
+
+#define UPROBES_FIX_RIP_AX 0x8000
+#define UPROBES_FIX_RIP_CX 0x4000
+
+/* Adaptations for mhiramat x86 decoder v14. */
+#define OPCODE1(insn) ((insn)->opcode.bytes[0])
+#define OPCODE2(insn) ((insn)->opcode.bytes[1])
+#define OPCODE3(insn) ((insn)->opcode.bytes[2])
+#define MODRM_REG(insn) X86_MODRM_REG(insn->modrm.value)
+
+#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
+ (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) | \
+ (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) | \
+ (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) | \
+ (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf)) \
+ << (row % 32))
+
+
+static const u32 good_insns_64[256 / 32] = {
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+ /* ---------------------------------------------- */
+ W(0x00, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 00 */
+ W(0x10, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 10 */
+ W(0x20, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 20 */
+ W(0x30, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 30 */
+ W(0x40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 40 */
+ W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+ W(0x60, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+ W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+ W(0x80, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+ W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+ W(0xc0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+ W(0xd0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+ W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+ W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1) /* f0 */
+ /* ---------------------------------------------- */
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+};
+
+/* Good-instruction tables for 32-bit apps */
+
+static const u32 good_insns_32[256 / 32] = {
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+ /* ---------------------------------------------- */
+ W(0x00, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) | /* 00 */
+ W(0x10, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) , /* 10 */
+ W(0x20, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) | /* 20 */
+ W(0x30, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) , /* 30 */
+ W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+ W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+ W(0x60, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+ W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+ W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+ W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+ W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+ W(0xd0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+ W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+ W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1) /* f0 */
+ /* ---------------------------------------------- */
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+};
+
+/* Using this for both 64-bit and 32-bit apps */
+static const u32 good_2byte_insns[256 / 32] = {
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+ /* ---------------------------------------------- */
+ W(0x00, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1) | /* 00 */
+ W(0x10, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* 10 */
+ W(0x20, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1) | /* 20 */
+ W(0x30, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 30 */
+ W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+ W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+ W(0x60, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 60 */
+ W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1) , /* 70 */
+ W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+ W(0xa0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1) | /* a0 */
+ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+ W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* c0 */
+ W(0xd0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+ W(0xe0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* e0 */
+ W(0xf0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0) /* f0 */
+ /* ---------------------------------------------- */
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+};
+#undef W
+
+/*
+ * opcodes we'll probably never support:
+ * 6c-6d, e4-e5, ec-ed - in
+ * 6e-6f, e6-e7, ee-ef - out
+ * cc, cd - int3, int
+ * cf - iret
+ * d6 - illegal instruction
+ * f1 - int1/icebp
+ * f4 - hlt
+ * fa, fb - cli, sti
+ * 0f - lar, lsl, syscall, clts, sysret, sysenter, sysexit, invd, wbinvd, ud2
+ *
+ * invalid opcodes in 64-bit mode:
+ * 06, 0e, 16, 1e, 27, 2f, 37, 3f, 60-62, 82, c4-c5, d4-d5
+ *
+ * 63 - we support this opcode in x86_64 but not in i386.
+ *
+ * opcodes we may need to refine support for:
+ * 0f - 2-byte instructions: For many of these instructions, the validity
+ * depends on the prefix and/or the reg field. On such instructions, we
+ * just consider the opcode combination valid if it corresponds to any
+ * valid instruction.
+ * 8f - Group 1 - only reg = 0 is OK
+ * c6-c7 - Group 11 - only reg = 0 is OK
+ * d9-df - fpu insns with some illegal encodings
+ * f2, f3 - repnz, repz prefixes. These are also the first byte for
+ * certain floating-point instructions, such as addsd.
+ * fe - Group 4 - only reg = 0 or 1 is OK
+ * ff - Group 5 - only reg = 0-6 is OK
+ *
+ * others -- Do we need to support these?
+ * 0f - (floating-point?) prefetch instructions
+ * 07, 17, 1f - pop es, pop ss, pop ds
+ * 26, 2e, 36, 3e - es:, cs:, ss:, ds: segment prefixes --
+ * but 64 and 65 (fs: and gs:) seem to be used, so we support them
+ * 67 - addr16 prefix
+ * ce - into
+ * f0 - lock prefix
+ */
+
+/*
+ * TODO:
+ * - Where necessary, examine the modrm byte and allow only valid instructions
+ * in the different Groups and fpu instructions.
+ */
+
+static bool is_prefix_bad(struct insn *insn)
+{
+ int i;
+
+ for (i = 0; i < insn->prefixes.nbytes; i++) {
+ switch (insn->prefixes.bytes[i]) {
+ case 0x26: /*INAT_PFX_ES */
+ case 0x2E: /*INAT_PFX_CS */
+ case 0x36: /*INAT_PFX_DS */
+ case 0x3E: /*INAT_PFX_SS */
+ case 0xF0: /*INAT_PFX_LOCK */
+ return 1;
+ }
+ }
+ return 0;
+}
+
+static void report_bad_prefix(void)
+{
+ printk(KERN_ERR "uprobes does not currently support probing "
+ "instructions with any of the following prefixes: "
+ "cs:, ds:, es:, ss:, lock:\n");
+}
+
+static void report_bad_1byte_opcode(int mode, uprobe_opcode_t op)
+{
+ printk(KERN_ERR "In %d-bit apps, "
+ "uprobes does not currently support probing "
+ "instructions whose first byte is 0x%2.2x\n", mode, op);
+}
+
+static void report_bad_2byte_opcode(uprobe_opcode_t op)
+{
+ printk(KERN_ERR "uprobes does not currently support probing "
+ "instructions with the 2-byte opcode 0x0f 0x%2.2x\n", op);
+}
+
+static int validate_insn_32bits(struct uprobe *uprobe, struct insn *insn)
+{
+ insn_init(insn, uprobe->insn, false);
+
+ /* Skip good instruction prefixes; reject "bad" ones. */
+ insn_get_opcode(insn);
+ if (is_prefix_bad(insn)) {
+ report_bad_prefix();
+ return -EPERM;
+ }
+ if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_32))
+ return 0;
+ if (insn->opcode.nbytes == 2) {
+ if (test_bit(OPCODE2(insn),
+ (unsigned long *) good_2byte_insns))
+ return 0;
+ report_bad_2byte_opcode(OPCODE2(insn));
+ } else
+ report_bad_1byte_opcode(32, OPCODE1(insn));
+ return -EPERM;
+}
+
+static int validate_insn_64bits(struct uprobe *uprobe, struct insn *insn)
+{
+ insn_init(insn, uprobe->insn, true);
+
+ /* Skip good instruction prefixes; reject "bad" ones. */
+ insn_get_opcode(insn);
+ if (is_prefix_bad(insn)) {
+ report_bad_prefix();
+ return -EPERM;
+ }
+ if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_64))
+ return 0;
+ if (insn->opcode.nbytes == 2) {
+ if (test_bit(OPCODE2(insn),
+ (unsigned long *) good_2byte_insns))
+ return 0;
+ report_bad_2byte_opcode(OPCODE2(insn));
+ } else
+ report_bad_1byte_opcode(64, OPCODE1(insn));
+ return -EPERM;
+}
+
+/*
+ * Figure out which fixups post_xol() will need to perform, and annotate
+ * uprobe->fixups accordingly. To start with, uprobe->fixups is
+ * either zero or it reflects rip-related fixups.
+ */
+static void prepare_fixups(struct uprobe *uprobe, struct insn *insn)
+{
+ bool fix_ip = true, fix_call = false; /* defaults */
+ insn_get_opcode(insn); /* should be a nop */
+
+ switch (OPCODE1(insn)) {
+ case 0xc3: /* ret/lret */
+ case 0xcb:
+ case 0xc2:
+ case 0xca:
+ /* ip is correct */
+ fix_ip = false;
+ break;
+ case 0xe8: /* call relative - Fix return addr */
+ fix_call = true;
+ break;
+ case 0x9a: /* call absolute - Fix return addr, not ip */
+ fix_call = true;
+ fix_ip = false;
+ break;
+ case 0xff:
+ {
+ int reg;
+ insn_get_modrm(insn);
+ reg = MODRM_REG(insn);
+ if (reg == 2 || reg == 3) {
+ /* call or lcall, indirect */
+ /* Fix return addr; ip is correct. */
+ fix_call = true;
+ fix_ip = false;
+ } else if (reg == 4 || reg == 5) {
+ /* jmp or ljmp, indirect */
+ /* ip is correct. */
+ fix_ip = false;
+ }
+ break;
+ }
+ case 0xea: /* jmp absolute -- ip is correct */
+ fix_ip = false;
+ break;
+ default:
+ break;
+ }
+ if (fix_ip)
+ uprobe->fixups |= UPROBES_FIX_IP;
+ if (fix_call)
+ uprobe->fixups |=
+ (UPROBES_FIX_CALL | UPROBES_FIX_SLEEPY);
+}
+
+#ifdef CONFIG_X86_64
+/*
+ * If uprobe->insn doesn't use rip-relative addressing, return 0. Otherwise,
+ * rewrite the instruction so that it accesses its memory operand
+ * indirectly through a scratch register. Set uprobe->fixups and
+ * uprobe->arch_info.rip_rela_target_address accordingly. (The contents of the
+ * scratch register will be saved before we single-step the modified
+ * instruction, and restored afterward.) Return 1.
+ *
+ * We do this because a rip-relative instruction can access only a
+ * relatively small area (+/- 2 GB from the instruction), and the XOL
+ * area typically lies beyond that area. At least for instructions
+ * that store to memory, we can't execute the original instruction
+ * and "fix things up" later, because the misdirected store could be
+ * disastrous.
+ *
+ * Some useful facts about rip-relative instructions:
+ * - There's always a modrm byte.
+ * - There's never a SIB byte.
+ * - The displacement is always 4 bytes.
+ */
+static int handle_riprel_insn(struct uprobe *uprobe, struct insn *insn)
+{
+ u8 *cursor;
+ u8 reg;
+
+ if (!insn_rip_relative(insn))
+ return 0;
+ /*
+ * Point cursor at the modrm byte. The next 4 bytes are the
+ * displacement. Beyond the displacement, for some instructions,
+ * is the immediate operand.
+ */
+ cursor = uprobe->insn + insn->prefixes.nbytes
+ + insn->rex_prefix.nbytes + insn->opcode.nbytes;
+ insn_get_length(insn);
+
+ /*
+ * Convert from rip-relative addressing to indirect addressing
+ * via a scratch register. Change the r/m field from 0x5 (%rip)
+ * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
+ */
+ reg = MODRM_REG(insn);
+ if (reg == 0) {
+ /*
+ * The register operand (if any) is either the A register
+ * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
+ * REX prefix) %r8. In any case, we know the C register
+ * is NOT the register operand, so we use %rcx (register
+ * #1) for the scratch register.
+ */
+ uprobe->fixups = UPROBES_FIX_RIP_CX;
+ /* Change modrm from 00 000 101 to 00 000 001. */
+ *cursor = 0x1;
+ } else {
+ /* Use %rax (register #0) for the scratch register. */
+ uprobe->fixups = UPROBES_FIX_RIP_AX;
+ /* Change modrm from 00 xxx 101 to 00 xxx 000 */
+ *cursor = (reg << 3);
+ }
+
+ /* Target address = address of next instruction + (signed) offset */
+ uprobe->arch_info.rip_rela_target_address = (long) insn->length
+ + insn->displacement.value;
+ /* Displacement field is gone; slide immediate field (if any) over. */
+ if (insn->immediate.nbytes) {
+ cursor++;
+ memmove(cursor, cursor + insn->displacement.nbytes,
+ insn->immediate.nbytes);
+ }
+ return 1;
+}
+#endif /* CONFIG_X86_64 */
+
+/**
+ * analyze_insn - instruction analysis including validity and fixups.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * Return 0 on success or a -ve number on error.
+ */
+int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe)
+{
+ int ret;
+ struct insn insn;
+
+ uprobe->fixups = 0;
+#ifdef CONFIG_X86_64
+ uprobe->arch_info.rip_rela_target_address = 0x0;
+#endif
+
+ if (is_32bit_app(tsk))
+ ret = validate_insn_32bits(uprobe, &insn);
+ else
+ ret = validate_insn_64bits(uprobe, &insn);
+ if (ret != 0)
+ return ret;
+#ifdef CONFIG_X86_64
+ ret = handle_riprel_insn(uprobe, &insn);
+ if (ret == -1)
+ /* rip-relative; can't XOL */
+ return 0;
+#endif
+ prepare_fixups(uprobe, &insn);
+ return 0;
+}

2011-03-14 13:41:11

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.


On the first probe insertion, copy the original instruction and opcode.
If multiple vmas map the same text area corresponding to an inode, we
only need to copy the instruction just once.
The copied instruction is further copied to a designated slot on probe
hit. Its also used at the time of probe removal to restore the original
instruction.
opcode is used to analyze the instruction and determine the fixups.
Determining fixups at probe hit time would result in doing the same
operation on every probe hit. Hence Instruction analysis using the
opcode is done at probe insertion time.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/uprobes.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 105 insertions(+), 8 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 4dbb90f..8730633 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -52,11 +52,12 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
unsigned long vaddr, uprobe_opcode_t opcode)
{
struct page *old_page, *new_page;
+ struct address_space *mapping;
void *vaddr_old, *vaddr_new;
struct vm_area_struct *vma;
spinlock_t *ptl;
pte_t *orig_pte;
- unsigned long addr;
+ loff_t addr;
int ret = -EINVAL;

/* Read the page with vaddr into memory */
@@ -73,6 +74,18 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
if (!valid_vma(vma))
goto put_out;

+ mapping = uprobe->inode->i_mapping;
+ if (mapping != vma->vm_file->f_mapping)
+ goto put_out;
+
+ addr = vma->vm_start + uprobe->offset;
+ addr -= vma->vm_pgoff << PAGE_SHIFT;
+ if (addr > ULONG_MAX)
+ goto put_out;
+
+ if (vaddr != (unsigned long) addr)
+ goto put_out;
+
/* Allocate a page */
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
if (!new_page) {
@@ -91,7 +104,6 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,

memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
/* poke the new insn in, ASSUMES we don't cross page boundary */
- addr = vaddr;
vaddr &= ~PAGE_MASK;
memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);

@@ -435,24 +447,109 @@ static int del_consumer(struct uprobe *uprobe,
return ret;
}

+static int __copy_insn(struct address_space *mapping, char *insn,
+ unsigned long nbytes, unsigned long offset)
+{
+ struct page *page;
+ void *vaddr;
+ unsigned long off1;
+ loff_t idx;
+
+ idx = offset >> PAGE_CACHE_SHIFT;
+ off1 = offset &= ~PAGE_MASK;
+ page = grab_cache_page(mapping, (unsigned long)idx);
+ if (!page)
+ return -ENOMEM;
+
+ vaddr = kmap_atomic(page, KM_USER0);
+ memcpy(insn, vaddr + off1, nbytes);
+ kunmap_atomic(vaddr, KM_USER0);
+ unlock_page(page);
+ page_cache_release(page);
+ return 0;
+}
+
+static int copy_insn(struct uprobe *uprobe, unsigned long addr)
+{
+ struct address_space *mapping;
+ int bytes;
+ unsigned long nbytes;
+
+ addr &= ~PAGE_MASK;
+ nbytes = PAGE_SIZE - addr;
+ mapping = uprobe->inode->i_mapping;
+
+ /* Instruction at end of binary; copy only available bytes */
+ if (uprobe->offset + MAX_UINSN_BYTES > uprobe->inode->i_size)
+ bytes = uprobe->inode->i_size - uprobe->offset;
+ else
+ bytes = MAX_UINSN_BYTES;
+
+ /* Instruction at the page-boundary; copy bytes in second page */
+ if (nbytes < bytes) {
+ if (__copy_insn(mapping, uprobe->insn + nbytes,
+ bytes - nbytes, uprobe->offset + nbytes))
+ return -ENOMEM;
+ bytes = nbytes;
+ }
+ return __copy_insn(mapping, uprobe->insn, bytes, uprobe->offset);
+}
+
static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
{
- int ret = 0;
+ struct task_struct *tsk;
+ int ret = -EINVAL;

- /*TODO: install breakpoint */
- if (!ret)
+ get_task_struct(mm->owner);
+ tsk = mm->owner;
+ if (!tsk)
+ return ret;
+
+ if (!uprobe->copy) {
+ ret = copy_insn(uprobe, mm->uprobes_vaddr);
+ if (ret)
+ goto put_return;
+ if (is_bkpt_insn(uprobe->insn)) {
+ print_insert_fail(tsk, mm->uprobes_vaddr,
+ "breakpoint instruction already exists");
+ ret = -EEXIST;
+ goto put_return;
+ }
+ ret = analyze_insn(tsk, uprobe);
+ if (ret) {
+ print_insert_fail(tsk, mm->uprobes_vaddr,
+ "instruction type cannot be probed");
+ goto put_return;
+ }
+ uprobe->copy = 1;
+ }
+
+ ret = set_bkpt(tsk, uprobe, mm->uprobes_vaddr);
+ if (ret < 0)
+ print_insert_fail(tsk, mm->uprobes_vaddr,
+ "failed to insert bkpt instruction");
+ else
atomic_inc(&mm->uprobes_count);
+
+put_return:
+ put_task_struct(tsk);
return ret;
}

static int remove_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
{
- int ret = 0;
+ struct task_struct *tsk;
+ int ret;
+
+ get_task_struct(mm->owner);
+ tsk = mm->owner;
+ if (!tsk)
+ return -EINVAL;

- /*TODO: remove breakpoint */
+ ret = set_orig_insn(tsk, uprobe, mm->uprobes_vaddr, true);
if (!ret)
atomic_dec(&mm->uprobes_count);
-
+ put_task_struct(tsk);
return ret;
}

2011-03-14 13:41:25

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 8/20] 8: uprobes: mmap and fork hooks.


Provides hooks in mmap and fork.

On fork, after the new mm is created, we need to set the count of
uprobes. On mmap, check if the mmap region is an executable page and if
its a executable page, walk through the rbtree and insert actual
breakpoints for already registered probes corresponding to this inode.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/uprobes.h | 11 +++++-
kernel/fork.c | 2 +
kernel/uprobes.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++
mm/mmap.c | 2 +
4 files changed, 105 insertions(+), 1 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 8654a06..1f54aae 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -66,6 +66,7 @@ struct uprobe_consumer {
struct uprobe {
struct rb_node rb_node; /* node in the rb tree */
atomic_t ref;
+ struct list_head pending_list;
struct rw_semaphore consumer_rwsem;
struct uprobe_arch_info arch_info; /* arch specific info if any */
struct uprobe_consumer *consumers;
@@ -99,6 +100,10 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer);
extern void unregister_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer);
+
+struct vm_area_struct;
+extern void uprobe_mmap(struct vm_area_struct *vma);
+extern void uprobe_dup_mmap(struct mm_struct *old_mm, struct mm_struct *mm);
#else /* CONFIG_UPROBES is not defined */
static inline int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer)
@@ -109,6 +114,10 @@ static inline void unregister_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer)
{
}
-
+static inline void uprobe_dup_mmap(struct mm_struct *old_mm,
+ struct mm_struct *mm)
+{
+}
+static inline void uprobe_mmap(struct vm_area_struct *vma) { }
#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 25e4291..b6d6877 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -67,6 +67,7 @@
#include <linux/user-return-notifier.h>
#include <linux/oom.h>
#include <linux/khugepaged.h>
+#include <linux/uprobes.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -418,6 +419,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
}
/* a new mm has just been created */
arch_dup_mmap(oldmm, mm);
+ uprobe_dup_mmap(oldmm, mm);
retval = 0;
out:
up_write(&mm->mmap_sem);
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 8730633..8ed5b77 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -378,6 +378,7 @@ static struct uprobe *uprobes_add(struct inode *inode, loff_t offset)
}
uprobe->inode = inode;
uprobe->offset = offset;
+ INIT_LIST_HEAD(&uprobe->pending_list);

/* add to uprobes_tree, sorted on inode:offset */
cur_uprobe = insert_uprobe(uprobe);
@@ -715,3 +716,93 @@ put_unlock:
mutex_unlock(&uprobes_mutex);
put_uprobe(uprobe);
}
+
+static void add_to_temp_list(struct vm_area_struct *vma, struct inode *inode,
+ struct list_head *tmp_list)
+{
+ struct uprobe *uprobe;
+ struct rb_node *n;
+ unsigned long flags;
+
+ n = uprobes_tree.rb_node;
+ spin_lock_irqsave(&treelock, flags);
+ uprobe = __find_uprobe(inode, 0, &n);
+ for (; n; n = rb_next(n)) {
+ uprobe = rb_entry(n, struct uprobe, rb_node);
+ if (match_inode(uprobe, inode, &n)) {
+ list_add(&uprobe->pending_list, tmp_list);
+ continue;
+ }
+ break;
+ }
+ spin_unlock_irqrestore(&treelock, flags);
+}
+
+/*
+ * Called from dup_mmap.
+ * called with mm->mmap_sem and old_mm->mmap_sem acquired.
+ */
+void uprobe_dup_mmap(struct mm_struct *old_mm, struct mm_struct *mm)
+{
+ atomic_set(&old_mm->uprobes_count,
+ atomic_read(&mm->uprobes_count));
+}
+
+/*
+ * Called from mmap_region.
+ * called with mm->mmap_sem acquired.
+ */
+void uprobe_mmap(struct vm_area_struct *vma)
+{
+ struct list_head tmp_list;
+ struct uprobe *uprobe, *u;
+ struct mm_struct *mm;
+ struct inode *inode;
+ unsigned long start;
+ unsigned long pgoff;
+
+ if (!valid_vma(vma))
+ return;
+
+ INIT_LIST_HEAD(&tmp_list);
+
+ mm = vma->vm_mm;
+ inode = vma->vm_file->f_mapping->host;
+ start = vma->vm_start;
+ pgoff = vma->vm_pgoff;
+ __iget(inode);
+
+ up_write(&mm->mmap_sem);
+ mutex_lock(&uprobes_mutex);
+ down_read(&mm->mmap_sem);
+
+ vma = find_vma(mm, start);
+ /* Not the same vma */
+ if (!vma || vma->vm_start != start ||
+ vma->vm_pgoff != pgoff || !valid_vma(vma) ||
+ inode->i_mapping != vma->vm_file->f_mapping)
+ goto mmap_out;
+
+ add_to_temp_list(vma, inode, &tmp_list);
+ list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
+ loff_t vaddr;
+
+ list_del(&uprobe->pending_list);
+ vaddr = vma->vm_start + uprobe->offset;
+ vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+ if (vaddr > ULONG_MAX)
+ /*
+ * We cannot have a virtual address that is
+ * greater than ULONG_MAX
+ */
+ continue;
+ mm->uprobes_vaddr = (unsigned long)vaddr;
+ install_uprobe(mm, uprobe);
+ }
+
+mmap_out:
+ mutex_unlock(&uprobes_mutex);
+ iput(inode);
+ up_read(&mm->mmap_sem);
+ down_write(&mm->mmap_sem);
+}
diff --git a/mm/mmap.c b/mm/mmap.c
index 2ec8eb5..3836c08 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -30,6 +30,7 @@
#include <linux/perf_event.h>
#include <linux/audit.h>
#include <linux/khugepaged.h>
+#include <linux/uprobes.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -1366,6 +1367,7 @@ out:
mm->locked_vm += (len >> PAGE_SHIFT);
} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
make_pages_present(addr, addr + len);
+ uprobe_mmap(vma);
return addr;

unmap_and_free_vma:

2011-03-14 13:41:35

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 9/20] 9: x86: architecture specific task information.


On X86_64, we need to support rip relative instructions.
Rip relative instructions are handled by saving the scratch register
on probe hit and then retrieving the previously saved scratch register
after single-step. This value stored at probe hit is specific to each
task. Hence this is implemented as part of uprobe_task_arch_info.

Since x86_32 has no support for rip relative instructions, we dont need to
bother for x86_32.

Signed-off-by: Jim Keniston <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/include/asm/uprobes.h | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 0063207..e38950f 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -34,8 +34,13 @@ typedef u8 uprobe_opcode_t;
struct uprobe_arch_info {
unsigned long rip_rela_target_address;
};
+
+struct uprobe_task_arch_info {
+ unsigned long saved_scratch_register;
+};
#else
struct uprobe_arch_info {};
+struct uprobe_task_arch_info {};
#endif
struct uprobe;
extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);

2011-03-14 13:41:46

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 10/20] 10: uprobes: task specific information.


Uprobes needs to maintain some task specific information include if a
task is currently uprobed, the currently handing uprobe, any arch
specific information (for example to handle rip relative instructions),
the per-task slot where the original instruction is copied to before
single-stepping.

Provides routines to create/manage and free the task specific
information.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/sched.h | 3 +++
include/linux/uprobes.h | 25 +++++++++++++++++++++++++
kernel/fork.c | 4 ++++
kernel/uprobes.c | 38 ++++++++++++++++++++++++++++++++++++++
4 files changed, 70 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c15936f..d7c9cd0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1527,6 +1527,9 @@ struct task_struct {
unsigned long memsw_bytes; /* uncharged mem+swap usage */
} memcg_batch;
#endif
+#ifdef CONFIG_UPROBES
+ struct uprobe_task *utask;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 1f54aae..d5be840 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -26,12 +26,14 @@
#include <linux/rbtree.h>
#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
#include <asm/uprobes.h>
+struct uprobe_task_arch_info; /* arch specific task info */
#else
/*
* ARCH_SUPPORTS_UPROBES is not defined.
*/
typedef u8 uprobe_opcode_t;
struct uprobe_arch_info {}; /* arch specific info*/
+struct uprobe_task_arch_info {}; /* arch specific task info */

/* Post-execution fixups. Some architectures may define others. */
#endif /* CONFIG_ARCH_SUPPORTS_UPROBES */
@@ -77,6 +79,27 @@ struct uprobe {
int copy;
};

+enum uprobe_task_state {
+ UTASK_RUNNING,
+ UTASK_BP_HIT,
+ UTASK_SSTEP
+};
+
+/*
+ * uprobe_utask -- not a user-visible struct.
+ * Corresponds to a thread in a probed process.
+ * Guarded by uproc->mutex.
+ */
+struct uprobe_task {
+ unsigned long xol_vaddr;
+ unsigned long vaddr;
+
+ enum uprobe_task_state state;
+ struct uprobe_task_arch_info tskinfo;
+
+ struct uprobe *active_uprobe;
+};
+
/*
* Most architectures can use the default versions of @read_opcode(),
* @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
@@ -100,6 +123,7 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer);
extern void unregister_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer);
+extern void uprobe_free_utask(struct task_struct *tsk);

struct vm_area_struct;
extern void uprobe_mmap(struct vm_area_struct *vma);
@@ -118,6 +142,7 @@ static inline void uprobe_dup_mmap(struct mm_struct *old_mm,
struct mm_struct *mm)
{
}
+static inline void uprobe_free_utask(struct task_struct *tsk) {}
static inline void uprobe_mmap(struct vm_area_struct *vma) { }
#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index b6d6877..de3d10a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -191,6 +191,7 @@ void __put_task_struct(struct task_struct *tsk)
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);

+ uprobe_free_utask(tsk);
if (!profile_handoff_task(tsk))
free_task(tsk);
}
@@ -1214,6 +1215,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
INIT_LIST_HEAD(&p->pi_state_list);
p->pi_state_cache = NULL;
#endif
+#ifdef CONFIG_UPROBES
+ p->utask = NULL;
+#endif
/*
* sigaltstack should be cleared when sharing the same VM
*/
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 8ed5b77..f3540ff 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -806,3 +806,41 @@ mmap_out:
up_read(&mm->mmap_sem);
down_write(&mm->mmap_sem);
}
+
+/*
+ * Called with no locks held.
+ * Called in context of a exiting or a exec-ing thread.
+ */
+void uprobe_free_utask(struct task_struct *tsk)
+{
+ struct uprobe_task *utask = tsk->utask;
+
+ if (!utask)
+ return;
+
+ if (utask->active_uprobe)
+ put_uprobe(utask->active_uprobe);
+ kfree(utask);
+ tsk->utask = NULL;
+}
+
+/*
+ * Allocate a uprobe_task object for the task.
+ * Called when the thread hits a breakpoint for the first time.
+ *
+ * Returns:
+ * - pointer to new uprobe_task on success
+ * - negative errno otherwise
+ */
+static struct uprobe_task *add_utask(void)
+{
+ struct uprobe_task *utask;
+
+ utask = kzalloc(sizeof *utask, GFP_KERNEL);
+ if (unlikely(utask == NULL))
+ return ERR_PTR(-ENOMEM);
+
+ utask->active_uprobe = NULL;
+ current->utask = utask;
+ return utask;
+}

2011-03-14 13:41:55

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 11/20] 11: uprobes: slot allocation for uprobes


Every task is allocated a fixed slot. When a probe is hit, the original
instruction corresponding to the probe hit is copied to per-task fixed
slot. Currently we allocate one page of slots for each mm. Bitmaps are
used to know which slots are free. Each slot is made of 128 bytes so
that its cache aligned.

TODO: On massively threaded processes (or if a huge number of processes
share the same mm), there is a possiblilty of running out of slots.
One alternative could be to extend the slots as when slots are required.

Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Jim Keniston <[email protected]>
---
include/linux/mm_types.h | 4 +
include/linux/uprobes.h | 21 ++++
kernel/fork.c | 4 +
kernel/uprobes.c | 241 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 270 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 96e4a77..6dfa507 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,9 @@
#include <linux/completion.h>
#include <linux/cpumask.h>
#include <linux/page-debug-flags.h>
+#ifdef CONFIG_UPROBES
+#include <linux/uprobes.h>
+#endif
#include <asm/page.h>
#include <asm/mmu.h>

@@ -319,6 +322,7 @@ struct mm_struct {
unsigned long uprobes_vaddr;
struct list_head uprobes_list;
atomic_t uprobes_count;
+ struct uprobes_xol_area *uprobes_xol_area;
#endif
};

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index d5be840..c42975b 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -101,6 +101,25 @@ struct uprobe_task {
};

/*
+ * Every thread gets its own slot. Once it's assigned a slot, it
+ * keeps that slot until the thread exits. Only definite number
+ * of slots are allocated.
+ */
+
+struct uprobes_xol_area {
+ spinlock_t slot_lock; /* protects bitmap and slot (de)allocation*/
+ unsigned long *bitmap; /* 0 = free slot */
+ struct page *page;
+
+ /*
+ * We keep the vma's vm_start rather than a pointer to the vma
+ * itself. The probed process or a naughty kernel module could make
+ * the vma go away, and we must handle that reasonably gracefully.
+ */
+ unsigned long vaddr; /* Page(s) of instruction slots */
+};
+
+/*
* Most architectures can use the default versions of @read_opcode(),
* @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
*
@@ -128,6 +147,7 @@ extern void uprobe_free_utask(struct task_struct *tsk);
struct vm_area_struct;
extern void uprobe_mmap(struct vm_area_struct *vma);
extern void uprobe_dup_mmap(struct mm_struct *old_mm, struct mm_struct *mm);
+extern void uprobes_free_xol_area(struct mm_struct *mm);
#else /* CONFIG_UPROBES is not defined */
static inline int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer)
@@ -144,5 +164,6 @@ static inline void uprobe_dup_mmap(struct mm_struct *old_mm,
}
static inline void uprobe_free_utask(struct task_struct *tsk) {}
static inline void uprobe_mmap(struct vm_area_struct *vma) { }
+static inline void uprobes_free_xol_area(struct mm_struct *mm) {}
#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index de3d10a..0afa0cd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -551,6 +551,7 @@ void mmput(struct mm_struct *mm)
might_sleep();

if (atomic_dec_and_test(&mm->mm_users)) {
+ uprobes_free_xol_area(mm);
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
@@ -677,6 +678,9 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
memcpy(mm, oldmm, sizeof(*mm));

/* Initializing for Swap token stuff */
+#ifdef CONFIG_UPROBES
+ mm->uprobes_xol_area = NULL;
+#endif
mm->token_priority = 0;
mm->last_interval = 0;

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index f3540ff..9d3d402 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -31,12 +31,28 @@
#include <linux/slab.h>
#include <linux/uprobes.h>
#include <linux/rmap.h> /* needed for anon_vma_prepare */
+#include <linux/mman.h> /* needed for PROT_EXEC, MAP_PRIVATE */
+#include <linux/file.h> /* needed for fput() */

+#define UINSNS_PER_PAGE (PAGE_SIZE/UPROBES_XOL_SLOT_BYTES)
+#define MAX_UPROBES_XOL_SLOTS UINSNS_PER_PAGE
+
+/*
+ * valid_vma: Verify if the specified vma is an executable vma,
+ * but not an XOL vma.
+ * - Return 1 if the specified virtual address is in an
+ * executable vma, but not in an XOL vma.
+ */
static int valid_vma(struct vm_area_struct *vma)
{
+ struct uprobes_xol_area *area = current->mm->uprobes_xol_area;
+
if (!vma->vm_file)
return 0;

+ if (area && (area->vaddr == vma->vm_start))
+ return 0;
+
if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
(VM_READ|VM_EXEC))
return 1;
@@ -807,6 +823,227 @@ mmap_out:
down_write(&mm->mmap_sem);
}

+/* Slot allocation for XOL */
+
+static int xol_add_vma(struct uprobes_xol_area *area)
+{
+ struct vm_area_struct *vma;
+ struct mm_struct *mm;
+ struct file *file;
+ unsigned long addr;
+ int ret = -ENOMEM;
+
+ mm = get_task_mm(current);
+ if (!mm)
+ return -ESRCH;
+
+ down_write(&mm->mmap_sem);
+ if (mm->uprobes_xol_area) {
+ ret = -EALREADY;
+ goto fail;
+ }
+
+ /*
+ * Find the end of the top mapping and skip a page.
+ * If there is no space for PAGE_SIZE above
+ * that, mmap will ignore our address hint.
+ *
+ * We allocate a "fake" unlinked shmem file because
+ * anonymous memory might not be granted execute
+ * permission when the selinux security hooks have
+ * their way.
+ */
+ vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
+ addr = vma->vm_end + PAGE_SIZE;
+ file = shmem_file_setup("uprobes/xol", PAGE_SIZE, VM_NORESERVE);
+ if (!file) {
+ printk(KERN_ERR "uprobes_xol failed to setup shmem_file "
+ "while allocating vma for pid/tgid %d/%d for "
+ "single-stepping out of line.\n",
+ current->pid, current->tgid);
+ goto fail;
+ }
+ addr = do_mmap_pgoff(file, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
+ fput(file);
+
+ if (addr & ~PAGE_MASK) {
+ printk(KERN_ERR "uprobes_xol failed to allocate a vma for "
+ "pid/tgid %d/%d for single-stepping out of "
+ "line.\n", current->pid, current->tgid);
+ goto fail;
+ }
+ vma = find_vma(mm, addr);
+
+ /* Don't expand vma on mremap(). */
+ vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
+ area->vaddr = vma->vm_start;
+ if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
+ &vma) > 0)
+ ret = 0;
+
+fail:
+ up_write(&mm->mmap_sem);
+ mmput(mm);
+ return ret;
+}
+
+/*
+ * xol_alloc_area - Allocate process's uprobes_xol_area.
+ * This area will be used for storing instructions for execution out of
+ * line.
+ *
+ * Returns the allocated area or NULL.
+ */
+static struct uprobes_xol_area *xol_alloc_area(void)
+{
+ struct uprobes_xol_area *area = NULL;
+
+ area = kzalloc(sizeof(*area), GFP_USER);
+ if (unlikely(!area))
+ return NULL;
+
+ area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
+ GFP_USER);
+
+ if (!area->bitmap)
+ goto fail;
+
+ spin_lock_init(&area->slot_lock);
+ if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
+ task_lock(current);
+ if (!current->mm->uprobes_xol_area) {
+ current->mm->uprobes_xol_area = area;
+ task_unlock(current);
+ return area;
+ }
+ task_unlock(current);
+ }
+
+fail:
+ if (area) {
+ if (area->bitmap)
+ kfree(area->bitmap);
+ kfree(area);
+ }
+ return current->mm->uprobes_xol_area;
+}
+
+/*
+ * uprobes_free_xol_area - Free the area allocated for slots.
+ */
+void uprobes_free_xol_area(struct mm_struct *mm)
+{
+ struct uprobes_xol_area *area = mm->uprobes_xol_area;
+
+ if (!area)
+ return;
+
+ put_page(area->page);
+ kfree(area->bitmap);
+ kfree(area);
+}
+
+/*
+ * Find a slot
+ * - searching in existing vmas for a free slot.
+ * - If no free slot in existing vmas, return 0;
+ *
+ * Called when holding uprobes_xol_area->slot_lock
+ */
+static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
+{
+ unsigned long slot_addr;
+ int slot_nr;
+
+ slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
+ if (slot_nr < UINSNS_PER_PAGE) {
+ __set_bit(slot_nr, area->bitmap);
+ slot_addr = area->vaddr +
+ (slot_nr * UPROBES_XOL_SLOT_BYTES);
+ return slot_addr;
+ }
+
+ return 0;
+}
+
+/*
+ * xol_get_insn_slot - If was not allocated a slot, then
+ * allocate a slot.
+ * Returns the allocated slot address or 0.
+ */
+static unsigned long xol_get_insn_slot(struct uprobe *uprobe,
+ unsigned long slot_addr)
+{
+ struct uprobes_xol_area *area = current->mm->uprobes_xol_area;
+ unsigned long flags, xol_vaddr = current->utask->xol_vaddr;
+ void *vaddr;
+
+ if (!current->utask->xol_vaddr) {
+ if (!area)
+ area = xol_alloc_area();
+
+ if (!area)
+ return 0;
+
+ spin_lock_irqsave(&area->slot_lock, flags);
+ xol_vaddr = xol_take_insn_slot(area);
+ spin_unlock_irqrestore(&area->slot_lock, flags);
+ current->utask->xol_vaddr = xol_vaddr;
+ }
+
+ /*
+ * Initialize the slot if xol_vaddr points to valid
+ * instruction slot.
+ */
+ if (unlikely(!xol_vaddr))
+ return 0;
+
+ current->utask->vaddr = slot_addr;
+ vaddr = kmap_atomic(area->page, KM_USER0);
+ xol_vaddr &= ~PAGE_MASK;
+ memcpy(vaddr + xol_vaddr, uprobe->insn, MAX_UINSN_BYTES);
+ kunmap_atomic(vaddr, KM_USER0);
+ return current->utask->xol_vaddr;
+}
+
+/*
+ * xol_free_insn_slot - If slot was earlier allocated by
+ * @xol_get_insn_slot(), make the slot available for
+ * subsequent requests.
+ */
+static void xol_free_insn_slot(struct task_struct *tsk, unsigned long slot_addr)
+{
+ struct uprobes_xol_area *area;
+ unsigned long vma_end;
+
+ if (!tsk->mm || !tsk->mm->uprobes_xol_area)
+ return;
+
+ area = tsk->mm->uprobes_xol_area;
+
+ if (unlikely(!slot_addr || IS_ERR_VALUE(slot_addr)))
+ return;
+
+ vma_end = area->vaddr + PAGE_SIZE;
+ if (area->vaddr <= slot_addr && slot_addr < vma_end) {
+ int slot_nr;
+ unsigned long offset = slot_addr - area->vaddr;
+ unsigned long flags;
+
+ BUG_ON(offset % UPROBES_XOL_SLOT_BYTES);
+
+ slot_nr = offset / UPROBES_XOL_SLOT_BYTES;
+ BUG_ON(slot_nr >= UINSNS_PER_PAGE);
+
+ spin_lock_irqsave(&area->slot_lock, flags);
+ __clear_bit(slot_nr, area->bitmap);
+ spin_unlock_irqrestore(&area->slot_lock, flags);
+ return;
+ }
+ printk(KERN_ERR "%s: no XOL vma for slot address %#lx\n",
+ __func__, slot_addr);
+}
+
/*
* Called with no locks held.
* Called in context of a exiting or a exec-ing thread.
@@ -814,14 +1051,18 @@ mmap_out:
void uprobe_free_utask(struct task_struct *tsk)
{
struct uprobe_task *utask = tsk->utask;
+ unsigned long xol_vaddr;

if (!utask)
return;

+ xol_vaddr = utask->xol_vaddr;
if (utask->active_uprobe)
put_uprobe(utask->active_uprobe);
+
kfree(utask);
tsk->utask = NULL;
+ xol_free_insn_slot(tsk, xol_vaddr);
}

/*

2011-03-14 13:42:09

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 12/20] 12: uprobes: get the breakpoint address.


On a breakpoint hit, return the address where the breakpoint was hit.

Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Jim Keniston <[email protected]>
---
include/linux/uprobes.h | 5 +++++
kernel/uprobes.c | 11 +++++++++++
2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index c42975b..aef55de 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -146,6 +146,7 @@ extern void uprobe_free_utask(struct task_struct *tsk);

struct vm_area_struct;
extern void uprobe_mmap(struct vm_area_struct *vma);
+extern unsigned long uprobes_get_bkpt_addr(struct pt_regs *regs);
extern void uprobe_dup_mmap(struct mm_struct *old_mm, struct mm_struct *mm);
extern void uprobes_free_xol_area(struct mm_struct *mm);
#else /* CONFIG_UPROBES is not defined */
@@ -165,5 +166,9 @@ static inline void uprobe_dup_mmap(struct mm_struct *old_mm,
static inline void uprobe_free_utask(struct task_struct *tsk) {}
static inline void uprobe_mmap(struct vm_area_struct *vma) { }
static inline void uprobes_free_xol_area(struct mm_struct *mm) {}
+static inline unsigned long uprobes_get_bkpt_addr(struct pt_regs *regs)
+{
+ return 0;
+}
#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 9d3d402..307f0cd 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1044,6 +1044,17 @@ static void xol_free_insn_slot(struct task_struct *tsk, unsigned long slot_addr)
__func__, slot_addr);
}

+/**
+ * uprobes_get_bkpt_addr - compute address of bkpt given post-bkpt regs
+ * @regs: Reflects the saved state of the task after it has hit a breakpoint
+ * instruction.
+ * Return the address of the breakpoint instruction.
+ */
+unsigned long uprobes_get_bkpt_addr(struct pt_regs *regs)
+{
+ return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
+}
+
/*
* Called with no locks held.
* Called in context of a exiting or a exec-ing thread.

2011-03-14 13:42:18

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 13/20] 13: x86: x86 specific probe handling


Provides x86 specific implementations for setting the current
instruction pointer, pre single-step and post-singlestep handling,
enabling and disabling singlestep.

This patch also introduces TIF_UPROBE which is set by uprobes notifier
code. TIF_UPROBE indicates that there is pending work that needs to be
done at do_notify_resume time.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/include/asm/thread_info.h | 2
arch/x86/include/asm/uprobes.h | 10 ++
arch/x86/kernel/uprobes.c | 157 ++++++++++++++++++++++++++++++++++++
3 files changed, 168 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index f0b6e5d..5b9c9f0 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,6 +84,7 @@ struct thread_info {
#define TIF_SECCOMP 8 /* secure computing */
#define TIF_MCE_NOTIFY 10 /* notify userspace of an MCE */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
+#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* 32bit process */
#define TIF_FORK 18 /* ret_from_fork */
@@ -107,6 +108,7 @@ struct thread_info {
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
#define _TIF_MCE_NOTIFY (1 << TIF_MCE_NOTIFY)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
+#define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
#define _TIF_FORK (1 << TIF_FORK)
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index e38950f..0e1b23f 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -37,11 +37,19 @@ struct uprobe_arch_info {

struct uprobe_task_arch_info {
unsigned long saved_scratch_register;
+ int oflags;
};
#else
struct uprobe_arch_info {};
-struct uprobe_task_arch_info {};
+struct uprobe_task_arch_info {
+ int oflags;
+};
#endif
struct uprobe;
extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
+extern void set_ip(struct pt_regs *regs, unsigned long vaddr);
+extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern void arch_uprobe_enable_sstep(struct pt_regs *regs);
+extern void arch_uprobe_disable_sstep(struct pt_regs *regs);
#endif /* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index cf223a4..5667e90 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -25,6 +25,7 @@
#include <linux/sched.h>
#include <linux/ptrace.h>
#include <linux/uprobes.h>
+#include <linux/uaccess.h>

#include <linux/kdebug.h>
#include <asm/insn.h>
@@ -412,3 +413,159 @@ int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe)
prepare_fixups(uprobe, &insn);
return 0;
}
+
+/*
+ * @reg: reflects the saved state of the task
+ * @vaddr: the virtual address to jump to.
+ * Return 0 on success or a -ve number on error.
+ */
+void set_ip(struct pt_regs *regs, unsigned long vaddr)
+{
+ regs->ip = vaddr;
+}
+
+/*
+ * pre_xol - prepare to execute out of line.
+ * @uprobe: the probepoint information.
+ * @regs: reflects the saved user state of @tsk.
+ *
+ * If we're emulating a rip-relative instruction, save the contents
+ * of the scratch register and store the target address in that register.
+ *
+ * Returns true if @uprobe->opcode is @bkpt_insn.
+ */
+int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ struct uprobe_task_arch_info *tskinfo = &current->utask->tskinfo;
+
+ regs->ip = current->utask->xol_vaddr;
+#ifdef CONFIG_X86_64
+ if (uprobe->fixups & UPROBES_FIX_RIP_AX) {
+ tskinfo->saved_scratch_register = regs->ax;
+ regs->ax = current->utask->vaddr;
+ regs->ax += uprobe->arch_info.rip_rela_target_address;
+ } else if (uprobe->fixups & UPROBES_FIX_RIP_CX) {
+ tskinfo->saved_scratch_register = regs->cx;
+ regs->cx = current->utask->vaddr;
+ regs->cx += uprobe->arch_info.rip_rela_target_address;
+ }
+#endif
+ return 0;
+}
+
+/*
+ * Called by post_xol() to adjust the return address pushed by a call
+ * instruction executed out of line.
+ */
+static int adjust_ret_addr(unsigned long sp, long correction)
+{
+ int rasize, ncopied;
+ long ra = 0;
+
+ if (is_32bit_app(current))
+ rasize = 4;
+ else
+ rasize = 8;
+ ncopied = copy_from_user(&ra, (void __user *) sp, rasize);
+ if (unlikely(ncopied))
+ goto fail;
+ ra += correction;
+ ncopied = copy_to_user((void __user *) sp, &ra, rasize);
+ if (unlikely(ncopied))
+ goto fail;
+ return 0;
+
+fail:
+ printk(KERN_ERR
+ "uprobes: Failed to adjust return address after"
+ " single-stepping call instruction;"
+ " pid=%d, sp=%#lx\n", current->pid, sp);
+ return -EFAULT;
+}
+
+#ifdef CONFIG_X86_64
+static bool is_riprel_insn(struct uprobe *uprobe)
+{
+ return ((uprobe->fixups &
+ (UPROBES_FIX_RIP_AX | UPROBES_FIX_RIP_CX)) != 0);
+}
+
+#endif /* CONFIG_X86_64 */
+
+/*
+ * Called after single-stepping. To avoid the SMP problems that can
+ * occur when we temporarily put back the original opcode to
+ * single-step, we single-stepped a copy of the instruction.
+ *
+ * This function prepares to resume execution after the single-step.
+ * We have to fix things up as follows:
+ *
+ * Typically, the new ip is relative to the copied instruction. We need
+ * to make it relative to the original instruction (FIX_IP). Exceptions
+ * are return instructions and absolute or indirect jump or call instructions.
+ *
+ * If the single-stepped instruction was a call, the return address that
+ * is atop the stack is the address following the copied instruction. We
+ * need to make it the address following the original instruction (FIX_CALL).
+ *
+ * If the original instruction was a rip-relative instruction such as
+ * "movl %edx,0xnnnn(%rip)", we have instead executed an equivalent
+ * instruction using a scratch register -- e.g., "movl %edx,(%rax)".
+ * We need to restore the contents of the scratch register and adjust
+ * the ip, keeping in mind that the instruction we executed is 4 bytes
+ * shorter than the original instruction (since we squeezed out the offset
+ * field). (FIX_RIP_AX or FIX_RIP_CX)
+ */
+int post_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ struct uprobe_task *utask = current->utask;
+ int result = 0;
+ long correction;
+
+ correction = (long)(utask->vaddr - utask->xol_vaddr);
+#ifdef CONFIG_X86_64
+ if (is_riprel_insn(uprobe)) {
+ struct uprobe_task_arch_info *tskinfo;
+ tskinfo = &current->utask->tskinfo;
+
+ if (uprobe->fixups & UPROBES_FIX_RIP_AX)
+ regs->ax = tskinfo->saved_scratch_register;
+ else
+ regs->cx = tskinfo->saved_scratch_register;
+ /*
+ * The original instruction includes a displacement, and so
+ * is 4 bytes longer than what we've just single-stepped.
+ * Fall through to handle stuff like "jmpq *...(%rip)" and
+ * "callq *...(%rip)".
+ */
+ correction += 4;
+ }
+#endif
+ if (uprobe->fixups & UPROBES_FIX_IP)
+ regs->ip += correction;
+ if (uprobe->fixups & UPROBES_FIX_CALL)
+ result = adjust_ret_addr(regs->sp, correction);
+ return result;
+}
+
+void arch_uprobe_enable_sstep(struct pt_regs *regs)
+{
+ /*
+ * Enable single-stepping by
+ * - Set TF on stack
+ * - Set TIF_SINGLESTEP: Guarantees that TF is set when
+ * returning to user mode.
+ * - Indicate that TF is set by us.
+ */
+ regs->flags |= X86_EFLAGS_TF;
+ set_thread_flag(TIF_SINGLESTEP);
+ set_thread_flag(TIF_FORCED_TF);
+}
+
+void arch_uprobe_disable_sstep(struct pt_regs *regs)
+{
+ /* Disable single-stepping by clearing what we set */
+ clear_thread_flag(TIF_SINGLESTEP);
+ clear_thread_flag(TIF_FORCED_TF);
+ regs->flags &= ~X86_EFLAGS_TF;
+}

2011-03-14 13:42:31

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 14/20] 14: uprobes: Handing int3 and singlestep exception.


On int3, set the TIF_UPROBE flag and if a task specific info is
available, indicate the task state as breakpoint hit. Setting the
TIF_UPROBE flag results in uprobe_notify_resume being called.
uprobe_notify_resume walks thro the list of vmas and then matches the
inode and offset corresponding to the instruction pointer to enteries in
rbtree. Once a matcing uprobes is found, run the handlers for all the
consumers that have registered.

On singlestep exception, perform the necessary fixups and allow the
process to continue. The necessary fixups are determined at instruction
analysis time.

TODO: If there is no matching uprobe, signal a trap to the process.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/uprobes.h | 4 +
kernel/uprobes.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 150 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index aef55de..b7fd925 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -149,6 +149,9 @@ extern void uprobe_mmap(struct vm_area_struct *vma);
extern unsigned long uprobes_get_bkpt_addr(struct pt_regs *regs);
extern void uprobe_dup_mmap(struct mm_struct *old_mm, struct mm_struct *mm);
extern void uprobes_free_xol_area(struct mm_struct *mm);
+extern int uprobe_post_notifier(struct pt_regs *regs);
+extern int uprobe_bkpt_notifier(struct pt_regs *regs);
+extern void uprobe_notify_resume(struct pt_regs *regs);
#else /* CONFIG_UPROBES is not defined */
static inline int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer)
@@ -166,6 +169,7 @@ static inline void uprobe_dup_mmap(struct mm_struct *old_mm,
static inline void uprobe_free_utask(struct task_struct *tsk) {}
static inline void uprobe_mmap(struct vm_area_struct *vma) { }
static inline void uprobes_free_xol_area(struct mm_struct *mm) {}
+static inline void uprobe_notify_resume(struct pt_regs *regs) {}
static inline unsigned long uprobes_get_bkpt_addr(struct pt_regs *regs)
{
return 0;
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 307f0cd..d8d4574 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1096,3 +1096,149 @@ static struct uprobe_task *add_utask(void)
current->utask = utask;
return utask;
}
+
+/* Prepare to single-step probed instruction out of line. */
+static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
+ unsigned long vaddr)
+{
+ xol_get_insn_slot(uprobe, vaddr);
+ BUG_ON(!current->utask->xol_vaddr);
+ if (!pre_xol(uprobe, regs)) {
+ set_ip(regs, current->utask->xol_vaddr);
+ return 0;
+ }
+ return -EFAULT;
+}
+
+/*
+ * Verify from Instruction Pointer if singlestep has indeed occurred.
+ * If Singlestep has occurred, then do post singlestep fix-ups.
+ */
+static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ unsigned long vaddr = instruction_pointer(regs);
+
+ /*
+ * If we have executed out of line, Instruction pointer
+ * cannot be same as virtual address of XOL slot.
+ */
+ if (vaddr == current->utask->xol_vaddr)
+ return false;
+ post_xol(uprobe, regs);
+ return true;
+}
+
+/*
+ * uprobe_notify_resume gets called in task context just before returning
+ * to userspace.
+ *
+ * If its the first time the probepoint is hit, slot gets allocated here.
+ * If its the first time the thread hit a breakpoint, utask gets
+ * allocated here.
+ */
+void uprobe_notify_resume(struct pt_regs *regs)
+{
+ struct vm_area_struct *vma;
+ struct uprobe_task *utask;
+ struct mm_struct *mm;
+ struct uprobe *u = NULL;
+ unsigned long probept;
+
+ utask = current->utask;
+ mm = current->mm;
+ if (unlikely(!utask)) {
+ utask = add_utask();
+
+ /* Failed to allocate utask for the current task. */
+ BUG_ON(!utask);
+ utask->state = UTASK_BP_HIT;
+ }
+ if (utask->state == UTASK_BP_HIT) {
+ probept = uprobes_get_bkpt_addr(regs);
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (!valid_vma(vma))
+ continue;
+ if (probept < vma->vm_start || probept > vma->vm_end)
+ continue;
+ u = find_uprobe(vma->vm_file->f_mapping->host,
+ probept - vma->vm_start);
+ if (u)
+ break;
+ }
+ up_read(&mm->mmap_sem);
+ /*TODO Return SIGTRAP signal */
+ if (!u) {
+ set_ip(regs, probept);
+ utask->state = UTASK_RUNNING;
+ return;
+ }
+ /* TODO Start queueing signals. */
+ utask->active_uprobe = u;
+ handler_chain(u, regs);
+ utask->state = UTASK_SSTEP;
+ if (!pre_ssout(u, regs, probept))
+ arch_uprobe_enable_sstep(regs);
+ } else if (utask->state == UTASK_SSTEP) {
+ u = utask->active_uprobe;
+ if (sstep_complete(u, regs)) {
+ put_uprobe(u);
+ utask->active_uprobe = NULL;
+ utask->state = UTASK_RUNNING;
+ /* TODO Stop queueing signals. */
+ arch_uprobe_disable_sstep(regs);
+ }
+ }
+}
+
+/*
+ * uprobe_bkpt_notifier gets called from interrupt context
+ * it gets a reference to the ppt and sets TIF_UPROBE flag,
+ */
+int uprobe_bkpt_notifier(struct pt_regs *regs)
+{
+ struct uprobe_task *utask;
+
+ if (!current->mm || !atomic_read(&current->mm->uprobes_count))
+ /* task is currently not uprobed */
+ return 0;
+
+ utask = current->utask;
+ if (utask)
+ utask->state = UTASK_BP_HIT;
+ set_thread_flag(TIF_UPROBE);
+ return 1;
+}
+
+/*
+ * uprobe_post_notifier gets called in interrupt context.
+ * It completes the single step operation.
+ */
+int uprobe_post_notifier(struct pt_regs *regs)
+{
+ struct uprobe *uprobe;
+ struct uprobe_task *utask;
+
+ if (!current->mm || !current->utask || !current->utask->active_uprobe)
+ /* task is currently not uprobed */
+ return 0;
+
+ utask = current->utask;
+ uprobe = utask->active_uprobe;
+ if (!uprobe)
+ return 0;
+
+ if (uprobes_resume_can_sleep(uprobe)) {
+ set_thread_flag(TIF_UPROBE);
+ return 1;
+ }
+ if (sstep_complete(uprobe, regs)) {
+ put_uprobe(uprobe);
+ utask->active_uprobe = NULL;
+ utask->state = UTASK_RUNNING;
+ /* TODO Stop queueing signals. */
+ arch_uprobe_disable_sstep(regs);
+ return 1;
+ }
+ return 0;
+}

2011-03-14 13:42:41

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 15/20] 15: x86: uprobes exception notifier for x86.


Provides a uprobes exception notifier for x86. This uprobes_exception
notifier gets called in interrupt context and routes int3 and singlestep
exception when a uprobed process encounters a INT3 or a singlestep exception.

Signed-off-by: Ananth N Mavinakayanahalli <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/include/asm/uprobes.h | 3 +++
arch/x86/kernel/signal.c | 14 ++++++++++++++
arch/x86/kernel/uprobes.c | 29 +++++++++++++++++++++++++++++
3 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 0e1b23f..f49cf49 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -22,6 +22,7 @@
* Srikar Dronamraju
* Jim Keniston
*/
+#include <linux/notifier.h>

typedef u8 uprobe_opcode_t;
#define MAX_UINSN_BYTES 16
@@ -52,4 +53,6 @@ extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
extern void arch_uprobe_enable_sstep(struct pt_regs *regs);
extern void arch_uprobe_disable_sstep(struct pt_regs *regs);
+extern int uprobes_exception_notify(struct notifier_block *self,
+ unsigned long val, void *data);
#endif /* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 4fd173c..0b06b94 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -20,6 +20,7 @@
#include <linux/personality.h>
#include <linux/uaccess.h>
#include <linux/user-return-notifier.h>
+#include <linux/uprobes.h>

#include <asm/processor.h>
#include <asm/ucontext.h>
@@ -848,6 +849,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
if (thread_info_flags & _TIF_SIGPENDING)
do_signal(regs);

+ if (thread_info_flags & _TIF_UPROBE) {
+ clear_thread_flag(TIF_UPROBE);
+#ifdef CONFIG_X86_32
+ /*
+ * On x86_32, do_notify_resume() gets called with
+ * interrupts disabled. Hence enable interrupts if they
+ * are still disabled.
+ */
+ local_irq_enable();
+#endif
+ uprobe_notify_resume(regs);
+ }
+
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 5667e90..b55469c 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -569,3 +569,32 @@ void arch_uprobe_disable_sstep(struct pt_regs *regs)
clear_thread_flag(TIF_FORCED_TF);
regs->flags &= ~X86_EFLAGS_TF;
}
+
+/*
+ * Wrapper routine for handling exceptions.
+ */
+int uprobes_exception_notify(struct notifier_block *self,
+ unsigned long val, void *data)
+{
+ struct die_args *args = data;
+ struct pt_regs *regs = args->regs;
+ int ret = NOTIFY_DONE;
+
+ /* We are only interested in userspace traps */
+ if (regs && !user_mode_vm(regs))
+ return NOTIFY_DONE;
+
+ switch (val) {
+ case DIE_INT3:
+ /* Run your handler here */
+ if (uprobe_bkpt_notifier(regs))
+ ret = NOTIFY_STOP;
+ break;
+ case DIE_DEBUG:
+ if (uprobe_post_notifier(regs))
+ ret = NOTIFY_STOP;
+ default:
+ break;
+ }
+ return ret;
+}

2011-03-14 13:42:55

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 16/20] 16: uprobes: register a notifier for uprobes.


Uprobe needs to be intimated on int3 and singlestep exceptions.
Hence uprobes registers a die notifier so that its notified of the events.

Signed-off-by: Ananth N Mavinakayanahalli <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/uprobes.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index d8d4574..bbedcef 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -33,6 +33,7 @@
#include <linux/rmap.h> /* needed for anon_vma_prepare */
#include <linux/mman.h> /* needed for PROT_EXEC, MAP_PRIVATE */
#include <linux/file.h> /* needed for fput() */
+#include <linux/kdebug.h> /* needed for notifier mechanism */

#define UINSNS_PER_PAGE (PAGE_SIZE/UPROBES_XOL_SLOT_BYTES)
#define MAX_UPROBES_XOL_SLOTS UINSNS_PER_PAGE
@@ -1242,3 +1243,21 @@ int uprobe_post_notifier(struct pt_regs *regs)
}
return 0;
}
+
+struct notifier_block uprobes_exception_nb = {
+ .notifier_call = uprobes_exception_notify,
+ .priority = 0x7ffffff0,
+};
+
+static int __init init_uprobes(void)
+{
+ register_die_notifier(&uprobes_exception_nb);
+ return 0;
+}
+
+static void __exit exit_uprobes(void)
+{
+}
+
+module_init(init_uprobes);
+module_exit(exit_uprobes);

2011-03-14 13:43:11

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 17/20] 17: uprobes: filter chain


Loops through the filters callbacks of currently registered
consumers to see if any consumer is interested in tracing this task.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/uprobes.c | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index bbedcef..e3a3051 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -428,6 +428,24 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
}

/* Acquires uprobe->consumer_rwsem */
+static bool filter_chain(struct uprobe *uprobe, struct task_struct *t)
+{
+ struct uprobe_consumer *consumer;
+ bool ret = false;
+
+ down_read(&uprobe->consumer_rwsem);
+ for (consumer = uprobe->consumers; consumer;
+ consumer = consumer->next) {
+ if (!consumer->filter || consumer->filter(consumer, t)) {
+ ret = true;
+ break;
+ }
+ }
+ up_read(&uprobe->consumer_rwsem);
+ return ret;
+}
+
+/* Acquires uprobe->consumer_rwsem */
static void add_consumer(struct uprobe *uprobe,
struct uprobe_consumer *consumer)
{

2011-03-14 13:43:31

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 18/20] 18: uprobes: commonly used filters.


Provides most commonly used filters that most users of uprobes can
reuse. However this would be useful once we can dynamically associate a
filter with a uprobe-event tracer.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/uprobes.h | 5 +++++
kernel/uprobes.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b7fd925..a7a8d5a 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -65,6 +65,11 @@ struct uprobe_consumer {
struct uprobe_consumer *next;
};

+struct uprobe_simple_consumer {
+ struct uprobe_consumer consumer;
+ pid_t fvalue;
+};
+
struct uprobe {
struct rb_node rb_node; /* node in the rb tree */
atomic_t ref;
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index e3a3051..328053e 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1262,6 +1262,56 @@ int uprobe_post_notifier(struct pt_regs *regs)
return 0;
}

+bool uprobes_pid_filter(struct uprobe_consumer *self, struct task_struct *t)
+{
+ struct uprobe_simple_consumer *usc;
+
+ usc = container_of(self, struct uprobe_simple_consumer, consumer);
+ if (t->tgid == usc->fvalue)
+ return true;
+ return false;
+}
+
+bool uprobes_tid_filter(struct uprobe_consumer *self, struct task_struct *t)
+{
+ struct uprobe_simple_consumer *usc;
+
+ usc = container_of(self, struct uprobe_simple_consumer, consumer);
+ if (t->pid == usc->fvalue)
+ return true;
+ return false;
+}
+
+bool uprobes_ppid_filter(struct uprobe_consumer *self, struct task_struct *t)
+{
+ pid_t pid;
+ struct uprobe_simple_consumer *usc;
+
+ usc = container_of(self, struct uprobe_simple_consumer, consumer);
+ rcu_read_lock();
+ pid = task_tgid_vnr(t->real_parent);
+ rcu_read_unlock();
+
+ if (pid == usc->fvalue)
+ return true;
+ return false;
+}
+
+bool uprobes_sid_filter(struct uprobe_consumer *self, struct task_struct *t)
+{
+ pid_t pid;
+ struct uprobe_simple_consumer *usc;
+
+ usc = container_of(self, struct uprobe_simple_consumer, consumer);
+ rcu_read_lock();
+ pid = pid_vnr(task_session(t));
+ rcu_read_unlock();
+
+ if (pid == usc->fvalue)
+ return true;
+ return false;
+}
+
struct notifier_block uprobes_exception_nb = {
.notifier_call = uprobes_exception_notify,
.priority = 0x7ffffff0,

2011-03-14 13:43:41

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 20/20] 20: tracing: uprobes trace_event interface


Implements trace_event support for uprobes. In its current form it can
be used to put probes at a specified offset in a file and dump the
required registers when the code flow reaches the probed address.

The following example shows how to dump the instruction pointer and %ax
a register at the probed text address. Here we are trying to probe
zfree in /bin/zsh

# cd /sys/kernel/debug/tracing/
# cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
# objdump -T /bin/zsh | grep -w zfree
0000000000446420 g DF .text 0000000000000012 Base zfree
# echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
# echo 1 > events/uprobes/enable
# sleep 20
# echo 0 > events/uprobes/enable
# cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79

TODO: Documentation/trace/uprobetrace.txt
TODO: Connect a filter to a consumer.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/Kconfig | 10 -
kernel/trace/Kconfig | 16 +
kernel/trace/Makefile | 1
kernel/trace/trace.h | 5
kernel/trace/trace_kprobe.c | 4
kernel/trace/trace_probe.c | 14 +
kernel/trace/trace_probe.h | 6
kernel/trace/trace_uprobe.c | 800 +++++++++++++++++++++++++++++++++++++++++++
8 files changed, 839 insertions(+), 17 deletions(-)
create mode 100644 kernel/trace/trace_uprobe.c

diff --git a/arch/Kconfig b/arch/Kconfig
index c681f16..c4e9663 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -62,16 +62,8 @@ config OPTPROBES
depends on !PREEMPT

config UPROBES
- bool "User-space probes (EXPERIMENTAL)"
- depends on ARCH_SUPPORTS_UPROBES
- depends on MMU
select MM_OWNER
- help
- Uprobes enables kernel subsystems to establish probepoints
- in user applications and execute handler functions when
- the probepoints are hit.
-
- If in doubt, say "N".
+ def_bool n

config HAVE_EFFICIENT_UNALIGNED_ACCESS
bool
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 78b9a05..5a64d1a 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -386,6 +386,22 @@ config KPROBE_EVENT
This option is also required by perf-probe subcommand of perf tools.
If you want to use perf tools, this option is strongly recommended.

+config UPROBE_EVENT
+ bool "Enable uprobes-based dynamic events"
+ depends on ARCH_SUPPORTS_UPROBES
+ depends on MMU
+ select UPROBES
+ select PROBE_EVENTS
+ select TRACING
+ default n
+ help
+ This allows the user to add tracing events on top of userspace dynamic
+ events (similar to tracepoints) on the fly via the traceevents interface.
+ Those events can be inserted wherever uprobes can probe, and record
+ various registers.
+ This option is required if you plan to use perf-probe subcommand of perf
+ tools on user space applications.
+
config PROBE_EVENTS
def_bool n

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 692223a..bb3d3ff 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -57,5 +57,6 @@ ifeq ($(CONFIG_TRACING),y)
obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
endif
obj-$(CONFIG_PROBE_EVENTS) +=trace_probe.o
+obj-$(CONFIG_UPROBE_EVENT) += trace_uprobe.o

libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 5e9dfc6..55249f8 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -97,6 +97,11 @@ struct kretprobe_trace_entry_head {
unsigned long ret_ip;
};

+struct uprobe_trace_entry_head {
+ struct trace_entry ent;
+ unsigned long ip;
+};
+
/*
* trace_flag_type is an enumeration that holds different
* states when a trace occurs. These are:
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 3f9189d..dec70fb 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -374,8 +374,8 @@ static int create_trace_probe(int argc, char **argv)
}

/* Parse fetch argument */
- ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
- is_return);
+ ret = traceprobe_parse_probe_arg(arg, &tp->size,
+ &tp->args[i], is_return, true);
if (ret) {
pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
goto error;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 9c01aaf..fc8b110 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -508,13 +508,17 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t,

/* Recursive argument parser */
static int parse_probe_arg(char *arg, const struct fetch_type *t,
- struct fetch_param *f, bool is_return)
+ struct fetch_param *f, bool is_return, bool is_kprobe)
{
int ret = 0;
unsigned long param;
long offset;
char *tmp;

+ /* Until uprobe_events supports only reg arguments */
+ if (!is_kprobe && arg[0] != '%')
+ return -EINVAL;
+
switch (arg[0]) {
case '$':
ret = parse_probe_vars(arg + 1, t, f, is_return);
@@ -564,7 +568,8 @@ static int parse_probe_arg(char *arg, const struct fetch_type *t,
if (!dprm)
return -ENOMEM;
dprm->offset = offset;
- ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+ ret = parse_probe_arg(arg, t2, &dprm->orig, is_return,
+ is_kprobe);
if (ret)
kfree(dprm);
else {
@@ -618,7 +623,7 @@ static int __parse_bitfield_probe_arg(const char *bf,

/* String length checking wrapper */
int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
- struct probe_arg *parg, bool is_return)
+ struct probe_arg *parg, bool is_return, bool is_kprobe)
{
const char *t;
int ret;
@@ -644,7 +649,8 @@ int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
}
parg->offset = *size;
*size += parg->type->size;
- ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+ ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return,
+ is_kprobe);
if (ret >= 0 && t != NULL)
ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
if (ret >= 0) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 487514b..f625d8d 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -48,6 +48,7 @@
#define FIELD_STRING_IP "__probe_ip"
#define FIELD_STRING_RETIP "__probe_ret_ip"
#define FIELD_STRING_FUNC "__probe_func"
+#define FIELD_STRING_PID "__probe_pid"

#undef DEFINE_FIELD
#define DEFINE_FIELD(type, item, name, is_signed) \
@@ -64,6 +65,7 @@
/* Flags for trace_probe */
#define TP_FLAG_TRACE 1
#define TP_FLAG_PROFILE 2
+#define TP_FLAG_UPROBE 4


/* data_rloc: data relative location, compatible with u32 */
@@ -130,7 +132,7 @@ static inline __kprobes void call_fetch(struct fetch_param *fprm,
}

/* Check the name is good for event/group/fields */
-static int is_good_name(const char *name)
+static inline int is_good_name(const char *name)
{
if (!isalpha(*name) && *name != '_')
return 0;
@@ -142,7 +144,7 @@ static int is_good_name(const char *name)
}

extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
- struct probe_arg *parg, bool is_return);
+ struct probe_arg *parg, bool is_return, bool is_kprobe);

extern int traceprobe_conflict_field_name(const char *name,
struct probe_arg *args, int narg);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
new file mode 100644
index 0000000..17567f5
--- /dev/null
+++ b/kernel/trace/trace_uprobe.c
@@ -0,0 +1,800 @@
+/*
+ * uprobes-based tracing events
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author: Srikar Dronamraju
+ */
+
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/uprobes.h>
+#include <linux/namei.h>
+
+#include "trace_probe.h"
+
+#define UPROBE_EVENT_SYSTEM "uprobes"
+
+/**
+ * uprobe event core functions
+ */
+struct trace_uprobe;
+struct uprobe_trace_consumer {
+ struct uprobe_consumer cons;
+ struct trace_uprobe *tp;
+};
+
+struct trace_uprobe {
+ struct list_head list;
+ struct ftrace_event_class class;
+ struct ftrace_event_call call;
+ struct uprobe_trace_consumer *consumer;
+ struct inode *inode;
+ char *filename;
+ unsigned long offset;
+ unsigned long nhit;
+ unsigned int flags; /* For TP_FLAG_* */
+ ssize_t size; /* trace entry size */
+ unsigned int nr_args;
+ struct probe_arg args[];
+};
+
+#define SIZEOF_TRACE_UPROBE(n) \
+ (offsetof(struct trace_uprobe, args) + \
+ (sizeof(struct probe_arg) * (n)))
+
+static int register_uprobe_event(struct trace_uprobe *tp);
+static void unregister_uprobe_event(struct trace_uprobe *tp);
+
+static DEFINE_MUTEX(uprobe_lock);
+static LIST_HEAD(uprobe_list);
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs);
+
+/*
+ * Allocate new trace_uprobe and initialize it (including uprobes).
+ */
+static struct trace_uprobe *alloc_trace_uprobe(const char *group,
+ const char *event, int nargs)
+{
+ struct trace_uprobe *tp;
+
+ if (!event || !is_good_name(event)) {
+ printk(KERN_ERR "%s\n", event);
+ return ERR_PTR(-EINVAL);
+ }
+
+ if (!group || !is_good_name(group)) {
+ printk(KERN_ERR "%s\n", group);
+ return ERR_PTR(-EINVAL);
+ }
+
+ tp = kzalloc(SIZEOF_TRACE_UPROBE(nargs), GFP_KERNEL);
+ if (!tp)
+ return ERR_PTR(-ENOMEM);
+
+ tp->call.class = &tp->class;
+ tp->call.name = kstrdup(event, GFP_KERNEL);
+ if (!tp->call.name)
+ goto error;
+
+ tp->class.system = kstrdup(group, GFP_KERNEL);
+ if (!tp->class.system)
+ goto error;
+
+ INIT_LIST_HEAD(&tp->list);
+ return tp;
+error:
+ kfree(tp->call.name);
+ kfree(tp);
+ return ERR_PTR(-ENOMEM);
+}
+
+static void free_trace_uprobe(struct trace_uprobe *tp)
+{
+ int i;
+
+ for (i = 0; i < tp->nr_args; i++)
+ traceprobe_free_probe_arg(&tp->args[i]);
+
+ iput(tp->inode);
+ kfree(tp->call.class->system);
+ kfree(tp->call.name);
+ kfree(tp->filename);
+ kfree(tp);
+}
+
+static struct trace_uprobe *find_probe_event(const char *event,
+ const char *group)
+{
+ struct trace_uprobe *tp;
+
+ list_for_each_entry(tp, &uprobe_list, list)
+ if (strcmp(tp->call.name, event) == 0 &&
+ strcmp(tp->call.class->system, group) == 0)
+ return tp;
+ return NULL;
+}
+
+/* Unregister a trace_uprobe and probe_event: call with locking uprobe_lock */
+static void unregister_trace_uprobe(struct trace_uprobe *tp)
+{
+ list_del(&tp->list);
+ unregister_uprobe_event(tp);
+ free_trace_uprobe(tp);
+}
+
+/* Register a trace_uprobe and probe_event */
+static int register_trace_uprobe(struct trace_uprobe *tp)
+{
+ struct trace_uprobe *old_tp;
+ int ret;
+
+ mutex_lock(&uprobe_lock);
+
+ /* register as an event */
+ old_tp = find_probe_event(tp->call.name, tp->call.class->system);
+ if (old_tp)
+ /* delete old event */
+ unregister_trace_uprobe(old_tp);
+
+ ret = register_uprobe_event(tp);
+ if (ret) {
+ pr_warning("Failed to register probe event(%d)\n", ret);
+ goto end;
+ }
+
+ list_add_tail(&tp->list, &uprobe_list);
+end:
+ mutex_unlock(&uprobe_lock);
+ return ret;
+}
+
+static int create_trace_uprobe(int argc, char **argv)
+{
+ /*
+ * Argument syntax:
+ * - Add uprobe: p[:[GRP/]EVENT] VADDR@PID [%REG]
+ *
+ * - Remove uprobe: -:[GRP/]EVENT
+ */
+ struct path path;
+ struct inode *inode;
+ struct trace_uprobe *tp;
+ int i, ret = 0;
+ int is_delete = 0;
+ char *arg = NULL, *event = NULL, *group = NULL;
+ unsigned long offset;
+ char buf[MAX_EVENT_NAME_LEN];
+ char *filename;
+
+ /* argc must be >= 1 */
+ if (argv[0][0] == '-')
+ is_delete = 1;
+ else if (argv[0][0] != 'p') {
+ pr_info("Probe definition must be started with 'p', 'r' or"
+ " '-'.\n");
+ return -EINVAL;
+ }
+
+ if (argv[0][1] == ':') {
+ event = &argv[0][2];
+ if (strchr(event, '/')) {
+ group = event;
+ event = strchr(group, '/') + 1;
+ event[-1] = '\0';
+ if (strlen(group) == 0) {
+ pr_info("Group name is not specified\n");
+ return -EINVAL;
+ }
+ }
+ if (strlen(event) == 0) {
+ pr_info("Event name is not specified\n");
+ return -EINVAL;
+ }
+ }
+ if (!group)
+ group = UPROBE_EVENT_SYSTEM;
+
+ if (is_delete) {
+ if (!event) {
+ pr_info("Delete command needs an event name.\n");
+ return -EINVAL;
+ }
+ mutex_lock(&uprobe_lock);
+ tp = find_probe_event(event, group);
+ if (!tp) {
+ mutex_unlock(&uprobe_lock);
+ pr_info("Event %s/%s doesn't exist.\n", group, event);
+ return -ENOENT;
+ }
+ /* delete an event */
+ unregister_trace_uprobe(tp);
+ mutex_unlock(&uprobe_lock);
+ return 0;
+ }
+
+ if (argc < 2) {
+ pr_info("Probe point is not specified.\n");
+ return -EINVAL;
+ }
+ if (isdigit(argv[1][0])) {
+ pr_info("probe point must be have a filename.\n");
+ return -EINVAL;
+ }
+ arg = strchr(argv[1], ':');
+ if (!arg)
+ goto fail_address_parse;
+
+ *arg++ = '\0';
+ filename = argv[1];
+ ret = kern_path(filename, LOOKUP_FOLLOW, &path);
+ if (ret)
+ goto fail_address_parse;
+ inode = path.dentry->d_inode;
+ __iget(inode);
+
+ ret = strict_strtoul(arg, 0, &offset);
+ if (ret)
+ goto fail_address_parse;
+ argc -= 2; argv += 2;
+
+ /* setup a probe */
+ if (!event) {
+ char *tail = strrchr(filename, '/');
+
+ snprintf(buf, MAX_EVENT_NAME_LEN, "%c_%s_0x%lx", 'p',
+ (tail ? tail + 1 : filename), offset);
+ event = buf;
+ }
+ tp = alloc_trace_uprobe(group, event, argc);
+ if (IS_ERR(tp)) {
+ pr_info("Failed to allocate trace_uprobe.(%d)\n",
+ (int)PTR_ERR(tp));
+ iput(inode);
+ return PTR_ERR(tp);
+ }
+ tp->offset = offset;
+ tp->inode = inode;
+ tp->filename = kstrdup(filename, GFP_KERNEL);
+ if (!tp->filename) {
+ pr_info("Failed to allocate filename.\n");
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ /* parse arguments */
+ ret = 0;
+ for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) {
+ /* Increment count for freeing args in error case */
+ tp->nr_args++;
+
+ /* Parse argument name */
+ arg = strchr(argv[i], '=');
+ if (arg) {
+ *arg++ = '\0';
+ tp->args[i].name = kstrdup(argv[i], GFP_KERNEL);
+ } else {
+ arg = argv[i];
+ /* If argument name is omitted, set "argN" */
+ snprintf(buf, MAX_EVENT_NAME_LEN, "arg%d", i + 1);
+ tp->args[i].name = kstrdup(buf, GFP_KERNEL);
+ }
+
+ if (!tp->args[i].name) {
+ pr_info("Failed to allocate argument[%d] name.\n", i);
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ if (!is_good_name(tp->args[i].name)) {
+ pr_info("Invalid argument[%d] name: %s\n",
+ i, tp->args[i].name);
+ ret = -EINVAL;
+ goto error;
+ }
+
+ if (traceprobe_conflict_field_name(tp->args[i].name,
+ tp->args, i)) {
+ pr_info("Argument[%d] name '%s' conflicts with "
+ "another field.\n", i, argv[i]);
+ ret = -EINVAL;
+ goto error;
+ }
+
+ /* Parse fetch argument */
+ ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+ false, false);
+ if (ret) {
+ pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
+ goto error;
+ }
+ }
+
+ ret = register_trace_uprobe(tp);
+ if (ret)
+ goto error;
+ return 0;
+
+error:
+ free_trace_uprobe(tp);
+ return ret;
+
+fail_address_parse:
+ pr_info("Failed to parse address.\n");
+ return ret;
+}
+
+static void cleanup_all_probes(void)
+{
+ struct trace_uprobe *tp;
+
+ mutex_lock(&uprobe_lock);
+ while (!list_empty(&uprobe_list)) {
+ tp = list_entry(uprobe_list.next, struct trace_uprobe, list);
+ unregister_trace_uprobe(tp);
+ }
+ mutex_unlock(&uprobe_lock);
+}
+
+
+/* Probes listing interfaces */
+static void *probes_seq_start(struct seq_file *m, loff_t *pos)
+{
+ mutex_lock(&uprobe_lock);
+ return seq_list_start(&uprobe_list, *pos);
+}
+
+static void *probes_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ return seq_list_next(v, &uprobe_list, pos);
+}
+
+static void probes_seq_stop(struct seq_file *m, void *v)
+{
+ mutex_unlock(&uprobe_lock);
+}
+
+static int probes_seq_show(struct seq_file *m, void *v)
+{
+ struct trace_uprobe *tp = v;
+ int i;
+
+ seq_printf(m, "p:%s/%s", tp->call.class->system, tp->call.name);
+ seq_printf(m, " %s:0x%p", tp->filename, (void *)tp->offset);
+
+ for (i = 0; i < tp->nr_args; i++)
+ seq_printf(m, " %s=%s", tp->args[i].name, tp->args[i].comm);
+ seq_printf(m, "\n");
+ return 0;
+}
+
+static const struct seq_operations probes_seq_op = {
+ .start = probes_seq_start,
+ .next = probes_seq_next,
+ .stop = probes_seq_stop,
+ .show = probes_seq_show
+};
+
+static int probes_open(struct inode *inode, struct file *file)
+{
+ if ((file->f_mode & FMODE_WRITE) &&
+ (file->f_flags & O_TRUNC))
+ cleanup_all_probes();
+
+ return seq_open(file, &probes_seq_op);
+}
+
+static ssize_t probes_write(struct file *file, const char __user *buffer,
+ size_t count, loff_t *ppos)
+{
+ return traceprobe_probes_write(file, buffer, count, ppos,
+ create_trace_uprobe);
+}
+
+static const struct file_operations uprobe_events_ops = {
+ .owner = THIS_MODULE,
+ .open = probes_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+ .write = probes_write,
+};
+
+/* Probes profiling interfaces */
+static int probes_profile_seq_show(struct seq_file *m, void *v)
+{
+ struct trace_uprobe *tp = v;
+
+ seq_printf(m, " %s %-44s %15lu\n", tp->filename, tp->call.name,
+ tp->nhit);
+ return 0;
+}
+
+static const struct seq_operations profile_seq_op = {
+ .start = probes_seq_start,
+ .next = probes_seq_next,
+ .stop = probes_seq_stop,
+ .show = probes_profile_seq_show
+};
+
+static int profile_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &profile_seq_op);
+}
+
+static const struct file_operations uprobe_profile_ops = {
+ .owner = THIS_MODULE,
+ .open = profile_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+/* uprobe handler */
+static void uprobe_trace_func(struct trace_uprobe *tp, struct pt_regs *regs)
+{
+ struct uprobe_trace_entry_head *entry;
+ struct ring_buffer_event *event;
+ struct ring_buffer *buffer;
+ u8 *data;
+ int size, i, pc;
+ unsigned long irq_flags;
+ struct ftrace_event_call *call = &tp->call;
+
+ tp->nhit++;
+
+ local_save_flags(irq_flags);
+ pc = preempt_count();
+
+ size = sizeof(*entry) + tp->size;
+
+ event = trace_current_buffer_lock_reserve(&buffer, call->event.type,
+ size, irq_flags, pc);
+ if (!event)
+ return;
+
+ entry = ring_buffer_event_data(event);
+ entry->ip = uprobes_get_bkpt_addr(task_pt_regs(current));
+ data = (u8 *)&entry[1];
+ for (i = 0; i < tp->nr_args; i++)
+ call_fetch(&tp->args[i].fetch, regs,
+ data + tp->args[i].offset);
+
+ if (!filter_current_check_discard(buffer, call, entry, event))
+ trace_buffer_unlock_commit(buffer, event, irq_flags, pc);
+}
+
+/* Event entry printers */
+enum print_line_t
+print_uprobe_event(struct trace_iterator *iter, int flags,
+ struct trace_event *event)
+{
+ struct uprobe_trace_entry_head *field;
+ struct trace_seq *s = &iter->seq;
+ struct trace_uprobe *tp;
+ u8 *data;
+ int i;
+
+ field = (struct uprobe_trace_entry_head *)iter->ent;
+ tp = container_of(event, struct trace_uprobe, call.event);
+
+ if (!trace_seq_printf(s, "%s: (", tp->call.name))
+ goto partial;
+
+ if (!seq_print_ip_sym(s, field->ip, flags | TRACE_ITER_SYM_OFFSET))
+ goto partial;
+
+ if (!trace_seq_puts(s, ")"))
+ goto partial;
+
+ data = (u8 *)&field[1];
+ for (i = 0; i < tp->nr_args; i++)
+ if (!tp->args[i].type->print(s, tp->args[i].name,
+ data + tp->args[i].offset, field))
+ goto partial;
+
+ if (!trace_seq_puts(s, "\n"))
+ goto partial;
+
+ return TRACE_TYPE_HANDLED;
+partial:
+ return TRACE_TYPE_PARTIAL_LINE;
+}
+
+
+static int probe_event_enable(struct ftrace_event_call *call)
+{
+ struct uprobe_trace_consumer *utc;
+ struct trace_uprobe *tp = (struct trace_uprobe *)call->data;
+ int ret = 0;
+
+ if (!tp->inode || tp->consumer)
+ return -EINTR;
+
+ utc = kzalloc(sizeof(struct uprobe_trace_consumer), GFP_KERNEL);
+ if (!utc)
+ return -EINTR;
+
+ utc->cons.handler = uprobe_dispatcher;
+ utc->cons.filter = NULL;
+ ret = register_uprobe(tp->inode, tp->offset, &utc->cons);
+ if (ret) {
+ kfree(utc);
+ return ret;
+ }
+
+ tp->flags |= TP_FLAG_TRACE;
+ utc->tp = tp;
+ tp->consumer = utc;
+ return 0;
+}
+
+static void probe_event_disable(struct ftrace_event_call *call)
+{
+ struct trace_uprobe *tp = (struct trace_uprobe *)call->data;
+
+ if (!tp->inode || !tp->consumer)
+ return;
+
+ unregister_uprobe(tp->inode, tp->offset, &tp->consumer->cons);
+ tp->flags &= ~TP_FLAG_TRACE;
+ kfree(tp->consumer);
+}
+
+static int uprobe_event_define_fields(struct ftrace_event_call *event_call)
+{
+ int ret, i;
+ struct uprobe_trace_entry_head field;
+ struct trace_uprobe *tp = (struct trace_uprobe *)event_call->data;
+
+ DEFINE_FIELD(unsigned long, ip, FIELD_STRING_IP, 0);
+ /* Set argument names as fields */
+ for (i = 0; i < tp->nr_args; i++) {
+ ret = trace_define_field(event_call, tp->args[i].type->fmttype,
+ tp->args[i].name,
+ sizeof(field) + tp->args[i].offset,
+ tp->args[i].type->size,
+ tp->args[i].type->is_signed,
+ FILTER_OTHER);
+ if (ret)
+ return ret;
+ }
+ return 0;
+}
+
+static int __set_print_fmt(struct trace_uprobe *tp, char *buf, int len)
+{
+ int i;
+ int pos = 0;
+
+ const char *fmt, *arg;
+
+ fmt = "(%lx)";
+ arg = "REC->" FIELD_STRING_IP;
+
+ /* When len=0, we just calculate the needed length */
+#define LEN_OR_ZERO (len ? len - pos : 0)
+
+ pos += snprintf(buf + pos, LEN_OR_ZERO, "\"%s", fmt);
+
+ for (i = 0; i < tp->nr_args; i++) {
+ pos += snprintf(buf + pos, LEN_OR_ZERO, " %s=%s",
+ tp->args[i].name, tp->args[i].type->fmt);
+ }
+
+ pos += snprintf(buf + pos, LEN_OR_ZERO, "\", %s", arg);
+
+ for (i = 0; i < tp->nr_args; i++) {
+ pos += snprintf(buf + pos, LEN_OR_ZERO, ", REC->%s",
+ tp->args[i].name);
+ }
+
+#undef LEN_OR_ZERO
+
+ /* return the length of print_fmt */
+ return pos;
+}
+
+static int set_print_fmt(struct trace_uprobe *tp)
+{
+ int len;
+ char *print_fmt;
+
+ /* First: called with 0 length to calculate the needed length */
+ len = __set_print_fmt(tp, NULL, 0);
+ print_fmt = kmalloc(len + 1, GFP_KERNEL);
+ if (!print_fmt)
+ return -ENOMEM;
+
+ /* Second: actually write the @print_fmt */
+ __set_print_fmt(tp, print_fmt, len + 1);
+ tp->call.print_fmt = print_fmt;
+
+ return 0;
+}
+
+#ifdef CONFIG_PERF_EVENTS
+
+/* uprobe profile handler */
+static void uprobe_perf_func(struct trace_uprobe *tp,
+ struct pt_regs *regs)
+{
+ struct ftrace_event_call *call = &tp->call;
+ struct uprobe_trace_entry_head *entry;
+ struct hlist_head *head;
+ u8 *data;
+ int size, __size, i;
+ int rctx;
+
+ __size = sizeof(*entry) + tp->size;
+ size = ALIGN(__size + sizeof(u32), sizeof(u64));
+ size -= sizeof(u32);
+ if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE,
+ "profile buffer not large enough"))
+ return;
+
+ entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx);
+ if (!entry)
+ return;
+
+ entry->ip = uprobes_get_bkpt_addr(task_pt_regs(current));
+ data = (u8 *)&entry[1];
+ for (i = 0; i < tp->nr_args; i++)
+ call_fetch(&tp->args[i].fetch, regs,
+ data + tp->args[i].offset);
+
+ head = this_cpu_ptr(call->perf_events);
+ perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs, head);
+}
+
+static int probe_perf_enable(struct ftrace_event_call *call)
+{
+ struct uprobe_trace_consumer *utc;
+ struct trace_uprobe *tp = (struct trace_uprobe *)call->data;
+ int ret = 0;
+
+ if (!tp->inode || tp->consumer)
+ return -EINTR;
+
+ utc = kzalloc(sizeof(struct uprobe_trace_consumer), GFP_KERNEL);
+ if (!utc)
+ return -EINTR;
+
+ utc->cons.handler = uprobe_dispatcher;
+ utc->cons.filter = NULL;
+ ret = register_uprobe(tp->inode, tp->offset, &utc->cons);
+ if (ret) {
+ kfree(utc);
+ return ret;
+ }
+
+ tp->flags |= TP_FLAG_PROFILE;
+ tp->consumer = utc;
+ utc->tp = tp;
+ return 0;
+}
+
+static void probe_perf_disable(struct ftrace_event_call *call)
+{
+ struct trace_uprobe *tp = (struct trace_uprobe *)call->data;
+
+ if (!tp->inode || !tp->consumer)
+ return;
+
+ unregister_uprobe(tp->inode, tp->offset, &tp->consumer->cons);
+ tp->flags &= ~TP_FLAG_PROFILE;
+ kfree(tp->consumer);
+}
+#endif /* CONFIG_PERF_EVENTS */
+
+static
+int uprobe_register(struct ftrace_event_call *event, enum trace_reg type)
+{
+ switch (type) {
+ case TRACE_REG_REGISTER:
+ return probe_event_enable(event);
+ case TRACE_REG_UNREGISTER:
+ probe_event_disable(event);
+ return 0;
+
+#ifdef CONFIG_PERF_EVENTS
+ case TRACE_REG_PERF_REGISTER:
+ return probe_perf_enable(event);
+ case TRACE_REG_PERF_UNREGISTER:
+ probe_perf_disable(event);
+ return 0;
+#endif
+ }
+ return 0;
+}
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
+{
+ struct uprobe_trace_consumer *utc;
+ struct trace_uprobe *tp;
+
+ utc = container_of(con, struct uprobe_trace_consumer, cons);
+ tp = utc->tp;
+ if (!tp || tp->consumer != utc)
+ return 0;
+
+ if (tp->flags & TP_FLAG_TRACE)
+ uprobe_trace_func(tp, regs);
+#ifdef CONFIG_PERF_EVENTS
+ if (tp->flags & TP_FLAG_PROFILE)
+ uprobe_perf_func(tp, regs);
+#endif
+ return 0;
+}
+
+
+static struct trace_event_functions uprobe_funcs = {
+ .trace = print_uprobe_event
+};
+
+static int register_uprobe_event(struct trace_uprobe *tp)
+{
+ struct ftrace_event_call *call = &tp->call;
+ int ret;
+
+ /* Initialize ftrace_event_call */
+ INIT_LIST_HEAD(&call->class->fields);
+ call->event.funcs = &uprobe_funcs;
+ call->class->define_fields = uprobe_event_define_fields;
+ if (set_print_fmt(tp) < 0)
+ return -ENOMEM;
+ ret = register_ftrace_event(&call->event);
+ if (!ret) {
+ kfree(call->print_fmt);
+ return -ENODEV;
+ }
+ call->flags = 0;
+ call->class->reg = uprobe_register;
+ call->data = tp;
+ ret = trace_add_event_call(call);
+ if (ret) {
+ pr_info("Failed to register uprobe event: %s\n", call->name);
+ kfree(call->print_fmt);
+ unregister_ftrace_event(&call->event);
+ }
+ return ret;
+}
+
+static void unregister_uprobe_event(struct trace_uprobe *tp)
+{
+ /* tp->event is unregistered in trace_remove_event_call() */
+ trace_remove_event_call(&tp->call);
+ kfree(tp->call.print_fmt);
+}
+
+/* Make a trace interface for controling probe points */
+static __init int init_uprobe_trace(void)
+{
+ struct dentry *d_tracer;
+ struct dentry *entry;
+
+ d_tracer = tracing_init_dentry();
+ if (!d_tracer)
+ return 0;
+
+ entry = trace_create_file("uprobe_events", 0644, d_tracer,
+ NULL, &uprobe_events_ops);
+ /* Profile interface */
+ entry = trace_create_file("uprobe_profile", 0444, d_tracer,
+ NULL, &uprobe_profile_ops);
+ return 0;
+}
+fs_initcall(init_uprobe_trace);

2011-03-14 13:40:10

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 2/20] 2: X86 specific breakpoint definitions.


Provides definitions for the breakpoint instruction and x86 specific
uprobe info structure.

Signed-off-by: Jim Keniston <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/Kconfig | 3 +++
arch/x86/include/asm/uprobes.h | 40 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/uprobes.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0256aed..610c5e5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -237,6 +237,9 @@ config ARCH_CPU_PROBE_RELEASE
def_bool y
depends on HOTPLUG_CPU

+config ARCH_SUPPORTS_UPROBES
+ def_bool y
+
source "init/Kconfig"
source "kernel/Kconfig.freezer"

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
new file mode 100644
index 0000000..5026359
--- /dev/null
+++ b/arch/x86/include/asm/uprobes.h
@@ -0,0 +1,40 @@
+#ifndef _ASM_UPROBES_H
+#define _ASM_UPROBES_H
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2010
+ * Authors:
+ * Srikar Dronamraju
+ * Jim Keniston
+ */
+
+typedef u8 uprobe_opcode_t;
+#define MAX_UINSN_BYTES 16
+#define UPROBES_XOL_SLOT_BYTES 128 /* to keep it cache aligned */
+
+#define UPROBES_BKPT_INSN 0xcc
+#define UPROBES_BKPT_INSN_SIZE 1
+
+#ifdef CONFIG_X86_64
+struct uprobe_arch_info {
+ unsigned long rip_rela_target_address;
+};
+#else
+struct uprobe_arch_info {};
+#endif
+#endif /* _ASM_UPROBES_H */

2011-03-14 13:43:57

by Srikar Dronamraju

[permalink] [raw]
Subject: [PATCH v2 2.6.38-rc8-tip 19/20] 19: tracing: Extract out common code for kprobes/uprobes traceevents.


Move parts of trace_kprobe.c that can be shared with upcoming
trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
kernel/trace/trace_probe.c.

Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/trace/Kconfig | 4
kernel/trace/Makefile | 1
kernel/trace/trace_kprobe.c | 861 +------------------------------------------
kernel/trace/trace_probe.c | 747 +++++++++++++++++++++++++++++++++++++
kernel/trace/trace_probe.h | 158 ++++++++
5 files changed, 929 insertions(+), 842 deletions(-)
create mode 100644 kernel/trace/trace_probe.c
create mode 100644 kernel/trace/trace_probe.h

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 14674dc..78b9a05 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -373,6 +373,7 @@ config KPROBE_EVENT
depends on HAVE_REGS_AND_STACK_ACCESS_API
bool "Enable kprobes-based dynamic events"
select TRACING
+ select PROBE_EVENTS
default y
help
This allows the user to add tracing events (similar to tracepoints)
@@ -385,6 +386,9 @@ config KPROBE_EVENT
This option is also required by perf-probe subcommand of perf tools.
If you want to use perf tools, this option is strongly recommended.

+config PROBE_EVENTS
+ def_bool n
+
config DYNAMIC_FTRACE
bool "enable/disable ftrace tracepoints dynamically"
depends on FUNCTION_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 761c510..692223a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -56,5 +56,6 @@ obj-$(CONFIG_TRACEPOINTS) += power-traces.o
ifeq ($(CONFIG_TRACING),y)
obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
endif
+obj-$(CONFIG_PROBE_EVENTS) +=trace_probe.o

libftrace-y := ftrace.o
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 8435b43..3f9189d 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -19,525 +19,15 @@

#include <linux/module.h>
#include <linux/uaccess.h>
-#include <linux/kprobes.h>
-#include <linux/seq_file.h>
-#include <linux/slab.h>
-#include <linux/smp.h>
-#include <linux/debugfs.h>
-#include <linux/types.h>
-#include <linux/string.h>
-#include <linux/ctype.h>
-#include <linux/ptrace.h>
-#include <linux/perf_event.h>
-#include <linux/stringify.h>
-#include <linux/limits.h>
-#include <asm/bitsperlong.h>
-
-#include "trace.h"
-#include "trace_output.h"
-
-#define MAX_TRACE_ARGS 128
-#define MAX_ARGSTR_LEN 63
-#define MAX_EVENT_NAME_LEN 64
-#define MAX_STRING_SIZE PATH_MAX
-#define KPROBE_EVENT_SYSTEM "kprobes"
-
-/* Reserved field names */
-#define FIELD_STRING_IP "__probe_ip"
-#define FIELD_STRING_RETIP "__probe_ret_ip"
-#define FIELD_STRING_FUNC "__probe_func"
-
-const char *reserved_field_names[] = {
- "common_type",
- "common_flags",
- "common_preempt_count",
- "common_pid",
- "common_tgid",
- "common_lock_depth",
- FIELD_STRING_IP,
- FIELD_STRING_RETIP,
- FIELD_STRING_FUNC,
-};
-
-/* Printing function type */
-typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
- void *);
-#define PRINT_TYPE_FUNC_NAME(type) print_type_##type
-#define PRINT_TYPE_FMT_NAME(type) print_type_format_##type
-
-/* Printing in basic type function template */
-#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast) \
-static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s, \
- const char *name, \
- void *data, void *ent)\
-{ \
- return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
-} \
-static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
-
-DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
-
-/* data_rloc: data relative location, compatible with u32 */
-#define make_data_rloc(len, roffs) \
- (((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
-#define get_rloc_len(dl) ((u32)(dl) >> 16)
-#define get_rloc_offs(dl) ((u32)(dl) & 0xffff)
-
-static inline void *get_rloc_data(u32 *dl)
-{
- return (u8 *)dl + get_rloc_offs(*dl);
-}
-
-/* For data_loc conversion */
-static inline void *get_loc_data(u32 *dl, void *ent)
-{
- return (u8 *)ent + get_rloc_offs(*dl);
-}
-
-/*
- * Convert data_rloc to data_loc:
- * data_rloc stores the offset from data_rloc itself, but data_loc
- * stores the offset from event entry.
- */
-#define convert_rloc_to_loc(dl, offs) ((u32)(dl) + (offs))
-
-/* For defining macros, define string/string_size types */
-typedef u32 string;
-typedef u32 string_size;
-
-/* Print type function for string type */
-static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
- const char *name,
- void *data, void *ent)
-{
- int len = *(u32 *)data >> 16;
-
- if (!len)
- return trace_seq_printf(s, " %s=(fault)", name);
- else
- return trace_seq_printf(s, " %s=\"%s\"", name,
- (const char *)get_loc_data(data, ent));
-}
-static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
-
-/* Data fetch function type */
-typedef void (*fetch_func_t)(struct pt_regs *, void *, void *);
-
-struct fetch_param {
- fetch_func_t fn;
- void *data;
-};
-
-static __kprobes void call_fetch(struct fetch_param *fprm,
- struct pt_regs *regs, void *dest)
-{
- return fprm->fn(regs, fprm->data, dest);
-}
-
-#define FETCH_FUNC_NAME(method, type) fetch_##method##_##type
-/*
- * Define macro for basic types - we don't need to define s* types, because
- * we have to care only about bitwidth at recording time.
- */
-#define DEFINE_BASIC_FETCH_FUNCS(method) \
-DEFINE_FETCH_##method(u8) \
-DEFINE_FETCH_##method(u16) \
-DEFINE_FETCH_##method(u32) \
-DEFINE_FETCH_##method(u64)
-
-#define CHECK_FETCH_FUNCS(method, fn) \
- (((FETCH_FUNC_NAME(method, u8) == fn) || \
- (FETCH_FUNC_NAME(method, u16) == fn) || \
- (FETCH_FUNC_NAME(method, u32) == fn) || \
- (FETCH_FUNC_NAME(method, u64) == fn) || \
- (FETCH_FUNC_NAME(method, string) == fn) || \
- (FETCH_FUNC_NAME(method, string_size) == fn)) \
- && (fn != NULL))
-
-/* Data fetch function templates */
-#define DEFINE_FETCH_reg(type) \
-static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs, \
- void *offset, void *dest) \
-{ \
- *(type *)dest = (type)regs_get_register(regs, \
- (unsigned int)((unsigned long)offset)); \
-}
-DEFINE_BASIC_FETCH_FUNCS(reg)
-/* No string on the register */
-#define fetch_reg_string NULL
-#define fetch_reg_string_size NULL
-
-#define DEFINE_FETCH_stack(type) \
-static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
- void *offset, void *dest) \
-{ \
- *(type *)dest = (type)regs_get_kernel_stack_nth(regs, \
- (unsigned int)((unsigned long)offset)); \
-}
-DEFINE_BASIC_FETCH_FUNCS(stack)
-/* No string on the stack entry */
-#define fetch_stack_string NULL
-#define fetch_stack_string_size NULL
-
-#define DEFINE_FETCH_retval(type) \
-static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
- void *dummy, void *dest) \
-{ \
- *(type *)dest = (type)regs_return_value(regs); \
-}
-DEFINE_BASIC_FETCH_FUNCS(retval)
-/* No string on the retval */
-#define fetch_retval_string NULL
-#define fetch_retval_string_size NULL
-
-#define DEFINE_FETCH_memory(type) \
-static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
- void *addr, void *dest) \
-{ \
- type retval; \
- if (probe_kernel_address(addr, retval)) \
- *(type *)dest = 0; \
- else \
- *(type *)dest = retval; \
-}
-DEFINE_BASIC_FETCH_FUNCS(memory)
-/*
- * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
- * length and relative data location.
- */
-static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
- void *addr, void *dest)
-{
- long ret;
- int maxlen = get_rloc_len(*(u32 *)dest);
- u8 *dst = get_rloc_data(dest);
- u8 *src = addr;
- mm_segment_t old_fs = get_fs();
- if (!maxlen)
- return;
- /*
- * Try to get string again, since the string can be changed while
- * probing.
- */
- set_fs(KERNEL_DS);
- pagefault_disable();
- do
- ret = __copy_from_user_inatomic(dst++, src++, 1);
- while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
- dst[-1] = '\0';
- pagefault_enable();
- set_fs(old_fs);
-
- if (ret < 0) { /* Failed to fetch string */
- ((u8 *)get_rloc_data(dest))[0] = '\0';
- *(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
- } else
- *(u32 *)dest = make_data_rloc(src - (u8 *)addr,
- get_rloc_offs(*(u32 *)dest));
-}
-/* Return the length of string -- including null terminal byte */
-static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
- void *addr, void *dest)
-{
- int ret, len = 0;
- u8 c;
- mm_segment_t old_fs = get_fs();
-
- set_fs(KERNEL_DS);
- pagefault_disable();
- do {
- ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
- len++;
- } while (c && ret == 0 && len < MAX_STRING_SIZE);
- pagefault_enable();
- set_fs(old_fs);
-
- if (ret < 0) /* Failed to check the length */
- *(u32 *)dest = 0;
- else
- *(u32 *)dest = len;
-}
-
-/* Memory fetching by symbol */
-struct symbol_cache {
- char *symbol;
- long offset;
- unsigned long addr;
-};
-
-static unsigned long update_symbol_cache(struct symbol_cache *sc)
-{
- sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
- if (sc->addr)
- sc->addr += sc->offset;
- return sc->addr;
-}
-
-static void free_symbol_cache(struct symbol_cache *sc)
-{
- kfree(sc->symbol);
- kfree(sc);
-}
-
-static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
-{
- struct symbol_cache *sc;
-
- if (!sym || strlen(sym) == 0)
- return NULL;
- sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
- if (!sc)
- return NULL;
-
- sc->symbol = kstrdup(sym, GFP_KERNEL);
- if (!sc->symbol) {
- kfree(sc);
- return NULL;
- }
- sc->offset = offset;

- update_symbol_cache(sc);
- return sc;
-}
-
-#define DEFINE_FETCH_symbol(type) \
-static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
- void *data, void *dest) \
-{ \
- struct symbol_cache *sc = data; \
- if (sc->addr) \
- fetch_memory_##type(regs, (void *)sc->addr, dest); \
- else \
- *(type *)dest = 0; \
-}
-DEFINE_BASIC_FETCH_FUNCS(symbol)
-DEFINE_FETCH_symbol(string)
-DEFINE_FETCH_symbol(string_size)
-
-/* Dereference memory access function */
-struct deref_fetch_param {
- struct fetch_param orig;
- long offset;
-};
-
-#define DEFINE_FETCH_deref(type) \
-static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
- void *data, void *dest) \
-{ \
- struct deref_fetch_param *dprm = data; \
- unsigned long addr; \
- call_fetch(&dprm->orig, regs, &addr); \
- if (addr) { \
- addr += dprm->offset; \
- fetch_memory_##type(regs, (void *)addr, dest); \
- } else \
- *(type *)dest = 0; \
-}
-DEFINE_BASIC_FETCH_FUNCS(deref)
-DEFINE_FETCH_deref(string)
-DEFINE_FETCH_deref(string_size)
-
-static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
-{
- if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
- free_deref_fetch_param(data->orig.data);
- else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
- free_symbol_cache(data->orig.data);
- kfree(data);
-}
-
-/* Bitfield fetch function */
-struct bitfield_fetch_param {
- struct fetch_param orig;
- unsigned char hi_shift;
- unsigned char low_shift;
-};
-
-#define DEFINE_FETCH_bitfield(type) \
-static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
- void *data, void *dest) \
-{ \
- struct bitfield_fetch_param *bprm = data; \
- type buf = 0; \
- call_fetch(&bprm->orig, regs, &buf); \
- if (buf) { \
- buf <<= bprm->hi_shift; \
- buf >>= bprm->low_shift; \
- } \
- *(type *)dest = buf; \
-}
-DEFINE_BASIC_FETCH_FUNCS(bitfield)
-#define fetch_bitfield_string NULL
-#define fetch_bitfield_string_size NULL
-
-static __kprobes void
-free_bitfield_fetch_param(struct bitfield_fetch_param *data)
-{
- /*
- * Don't check the bitfield itself, because this must be the
- * last fetch function.
- */
- if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
- free_deref_fetch_param(data->orig.data);
- else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
- free_symbol_cache(data->orig.data);
- kfree(data);
-}
-/* Default (unsigned long) fetch type */
-#define __DEFAULT_FETCH_TYPE(t) u##t
-#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
-#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
-#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
-
-/* Fetch types */
-enum {
- FETCH_MTD_reg = 0,
- FETCH_MTD_stack,
- FETCH_MTD_retval,
- FETCH_MTD_memory,
- FETCH_MTD_symbol,
- FETCH_MTD_deref,
- FETCH_MTD_bitfield,
- FETCH_MTD_END,
-};
-
-#define ASSIGN_FETCH_FUNC(method, type) \
- [FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
-
-#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype) \
- {.name = _name, \
- .size = _size, \
- .is_signed = sign, \
- .print = PRINT_TYPE_FUNC_NAME(ptype), \
- .fmt = PRINT_TYPE_FMT_NAME(ptype), \
- .fmttype = _fmttype, \
- .fetch = { \
-ASSIGN_FETCH_FUNC(reg, ftype), \
-ASSIGN_FETCH_FUNC(stack, ftype), \
-ASSIGN_FETCH_FUNC(retval, ftype), \
-ASSIGN_FETCH_FUNC(memory, ftype), \
-ASSIGN_FETCH_FUNC(symbol, ftype), \
-ASSIGN_FETCH_FUNC(deref, ftype), \
-ASSIGN_FETCH_FUNC(bitfield, ftype), \
- } \
- }
+#include "trace_probe.h"

-#define ASSIGN_FETCH_TYPE(ptype, ftype, sign) \
- __ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
-
-#define FETCH_TYPE_STRING 0
-#define FETCH_TYPE_STRSIZE 1
-
-/* Fetch type information table */
-static const struct fetch_type {
- const char *name; /* Name of type */
- size_t size; /* Byte size of type */
- int is_signed; /* Signed flag */
- print_type_func_t print; /* Print functions */
- const char *fmt; /* Fromat string */
- const char *fmttype; /* Name in format file */
- /* Fetch functions */
- fetch_func_t fetch[FETCH_MTD_END];
-} fetch_type_table[] = {
- /* Special types */
- [FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
- sizeof(u32), 1, "__data_loc char[]"),
- [FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
- string_size, sizeof(u32), 0, "u32"),
- /* Basic types */
- ASSIGN_FETCH_TYPE(u8, u8, 0),
- ASSIGN_FETCH_TYPE(u16, u16, 0),
- ASSIGN_FETCH_TYPE(u32, u32, 0),
- ASSIGN_FETCH_TYPE(u64, u64, 0),
- ASSIGN_FETCH_TYPE(s8, u8, 1),
- ASSIGN_FETCH_TYPE(s16, u16, 1),
- ASSIGN_FETCH_TYPE(s32, u32, 1),
- ASSIGN_FETCH_TYPE(s64, u64, 1),
-};
-
-static const struct fetch_type *find_fetch_type(const char *type)
-{
- int i;
-
- if (!type)
- type = DEFAULT_FETCH_TYPE_STR;
-
- /* Special case: bitfield */
- if (*type == 'b') {
- unsigned long bs;
- type = strchr(type, '/');
- if (!type)
- goto fail;
- type++;
- if (strict_strtoul(type, 0, &bs))
- goto fail;
- switch (bs) {
- case 8:
- return find_fetch_type("u8");
- case 16:
- return find_fetch_type("u16");
- case 32:
- return find_fetch_type("u32");
- case 64:
- return find_fetch_type("u64");
- default:
- goto fail;
- }
- }
-
- for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
- if (strcmp(type, fetch_type_table[i].name) == 0)
- return &fetch_type_table[i];
-fail:
- return NULL;
-}
-
-/* Special function : only accept unsigned long */
-static __kprobes void fetch_stack_address(struct pt_regs *regs,
- void *dummy, void *dest)
-{
- *(unsigned long *)dest = kernel_stack_pointer(regs);
-}
-
-static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
- fetch_func_t orig_fn)
-{
- int i;
-
- if (type != &fetch_type_table[FETCH_TYPE_STRING])
- return NULL; /* Only string type needs size function */
- for (i = 0; i < FETCH_MTD_END; i++)
- if (type->fetch[i] == orig_fn)
- return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
-
- WARN_ON(1); /* This should not happen */
- return NULL;
-}
+#define KPROBE_EVENT_SYSTEM "kprobes"

/**
* Kprobe event core functions
*/

-struct probe_arg {
- struct fetch_param fetch;
- struct fetch_param fetch_size;
- unsigned int offset; /* Offset from argument entry */
- const char *name; /* Name of this argument */
- const char *comm; /* Command of this argument */
- const struct fetch_type *type; /* Type of this argument */
-};
-
-/* Flags for trace_probe */
-#define TP_FLAG_TRACE 1
-#define TP_FLAG_PROFILE 2
-
struct trace_probe {
struct list_head list;
struct kretprobe rp; /* Use rp.kp for kprobe use */
@@ -576,18 +66,6 @@ static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs);
static int kretprobe_dispatcher(struct kretprobe_instance *ri,
struct pt_regs *regs);

-/* Check the name is good for event/group/fields */
-static int is_good_name(const char *name)
-{
- if (!isalpha(*name) && *name != '_')
- return 0;
- while (*++name != '\0') {
- if (!isalpha(*name) && !isdigit(*name) && *name != '_')
- return 0;
- }
- return 1;
-}
-
/*
* Allocate new trace_probe and initialize it (including kprobes).
*/
@@ -596,7 +74,7 @@ static struct trace_probe *alloc_trace_probe(const char *group,
void *addr,
const char *symbol,
unsigned long offs,
- int nargs, int is_return)
+ int nargs, bool is_return)
{
struct trace_probe *tp;
int ret = -ENOMEM;
@@ -647,24 +125,12 @@ error:
return ERR_PTR(ret);
}

-static void free_probe_arg(struct probe_arg *arg)
-{
- if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
- free_bitfield_fetch_param(arg->fetch.data);
- else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
- free_deref_fetch_param(arg->fetch.data);
- else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
- free_symbol_cache(arg->fetch.data);
- kfree(arg->name);
- kfree(arg->comm);
-}
-
static void free_trace_probe(struct trace_probe *tp)
{
int i;

for (i = 0; i < tp->nr_args; i++)
- free_probe_arg(&tp->args[i]);
+ traceprobe_free_probe_arg(&tp->args[i]);

kfree(tp->call.class->system);
kfree(tp->call.name);
@@ -738,227 +204,6 @@ end:
return ret;
}

-/* Split symbol and offset. */
-static int split_symbol_offset(char *symbol, unsigned long *offset)
-{
- char *tmp;
- int ret;
-
- if (!offset)
- return -EINVAL;
-
- tmp = strchr(symbol, '+');
- if (tmp) {
- /* skip sign because strict_strtol doesn't accept '+' */
- ret = strict_strtoul(tmp + 1, 0, offset);
- if (ret)
- return ret;
- *tmp = '\0';
- } else
- *offset = 0;
- return 0;
-}
-
-#define PARAM_MAX_ARGS 16
-#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
-
-static int parse_probe_vars(char *arg, const struct fetch_type *t,
- struct fetch_param *f, int is_return)
-{
- int ret = 0;
- unsigned long param;
-
- if (strcmp(arg, "retval") == 0) {
- if (is_return)
- f->fn = t->fetch[FETCH_MTD_retval];
- else
- ret = -EINVAL;
- } else if (strncmp(arg, "stack", 5) == 0) {
- if (arg[5] == '\0') {
- if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
- f->fn = fetch_stack_address;
- else
- ret = -EINVAL;
- } else if (isdigit(arg[5])) {
- ret = strict_strtoul(arg + 5, 10, &param);
- if (ret || param > PARAM_MAX_STACK)
- ret = -EINVAL;
- else {
- f->fn = t->fetch[FETCH_MTD_stack];
- f->data = (void *)param;
- }
- } else
- ret = -EINVAL;
- } else
- ret = -EINVAL;
- return ret;
-}
-
-/* Recursive argument parser */
-static int __parse_probe_arg(char *arg, const struct fetch_type *t,
- struct fetch_param *f, int is_return)
-{
- int ret = 0;
- unsigned long param;
- long offset;
- char *tmp;
-
- switch (arg[0]) {
- case '$':
- ret = parse_probe_vars(arg + 1, t, f, is_return);
- break;
- case '%': /* named register */
- ret = regs_query_register_offset(arg + 1);
- if (ret >= 0) {
- f->fn = t->fetch[FETCH_MTD_reg];
- f->data = (void *)(unsigned long)ret;
- ret = 0;
- }
- break;
- case '@': /* memory or symbol */
- if (isdigit(arg[1])) {
- ret = strict_strtoul(arg + 1, 0, &param);
- if (ret)
- break;
- f->fn = t->fetch[FETCH_MTD_memory];
- f->data = (void *)param;
- } else {
- ret = split_symbol_offset(arg + 1, &offset);
- if (ret)
- break;
- f->data = alloc_symbol_cache(arg + 1, offset);
- if (f->data)
- f->fn = t->fetch[FETCH_MTD_symbol];
- }
- break;
- case '+': /* deref memory */
- arg++; /* Skip '+', because strict_strtol() rejects it. */
- case '-':
- tmp = strchr(arg, '(');
- if (!tmp)
- break;
- *tmp = '\0';
- ret = strict_strtol(arg, 0, &offset);
- if (ret)
- break;
- arg = tmp + 1;
- tmp = strrchr(arg, ')');
- if (tmp) {
- struct deref_fetch_param *dprm;
- const struct fetch_type *t2 = find_fetch_type(NULL);
- *tmp = '\0';
- dprm = kzalloc(sizeof(struct deref_fetch_param),
- GFP_KERNEL);
- if (!dprm)
- return -ENOMEM;
- dprm->offset = offset;
- ret = __parse_probe_arg(arg, t2, &dprm->orig,
- is_return);
- if (ret)
- kfree(dprm);
- else {
- f->fn = t->fetch[FETCH_MTD_deref];
- f->data = (void *)dprm;
- }
- }
- break;
- }
- if (!ret && !f->fn) { /* Parsed, but do not find fetch method */
- pr_info("%s type has no corresponding fetch method.\n",
- t->name);
- ret = -EINVAL;
- }
- return ret;
-}
-
-#define BYTES_TO_BITS(nb) ((BITS_PER_LONG * (nb)) / sizeof(long))
-
-/* Bitfield type needs to be parsed into a fetch function */
-static int __parse_bitfield_probe_arg(const char *bf,
- const struct fetch_type *t,
- struct fetch_param *f)
-{
- struct bitfield_fetch_param *bprm;
- unsigned long bw, bo;
- char *tail;
-
- if (*bf != 'b')
- return 0;
-
- bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
- if (!bprm)
- return -ENOMEM;
- bprm->orig = *f;
- f->fn = t->fetch[FETCH_MTD_bitfield];
- f->data = (void *)bprm;
-
- bw = simple_strtoul(bf + 1, &tail, 0); /* Use simple one */
- if (bw == 0 || *tail != '@')
- return -EINVAL;
-
- bf = tail + 1;
- bo = simple_strtoul(bf, &tail, 0);
- if (tail == bf || *tail != '/')
- return -EINVAL;
-
- bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
- bprm->low_shift = bprm->hi_shift + bo;
- return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
-}
-
-/* String length checking wrapper */
-static int parse_probe_arg(char *arg, struct trace_probe *tp,
- struct probe_arg *parg, int is_return)
-{
- const char *t;
- int ret;
-
- if (strlen(arg) > MAX_ARGSTR_LEN) {
- pr_info("Argument is too long.: %s\n", arg);
- return -ENOSPC;
- }
- parg->comm = kstrdup(arg, GFP_KERNEL);
- if (!parg->comm) {
- pr_info("Failed to allocate memory for command '%s'.\n", arg);
- return -ENOMEM;
- }
- t = strchr(parg->comm, ':');
- if (t) {
- arg[t - parg->comm] = '\0';
- t++;
- }
- parg->type = find_fetch_type(t);
- if (!parg->type) {
- pr_info("Unsupported type: %s\n", t);
- return -EINVAL;
- }
- parg->offset = tp->size;
- tp->size += parg->type->size;
- ret = __parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
- if (ret >= 0 && t != NULL)
- ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
- if (ret >= 0) {
- parg->fetch_size.fn = get_fetch_size_function(parg->type,
- parg->fetch.fn);
- parg->fetch_size.data = parg->fetch.data;
- }
- return ret;
-}
-
-/* Return 1 if name is reserved or already used by another argument */
-static int conflict_field_name(const char *name,
- struct probe_arg *args, int narg)
-{
- int i;
- for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
- if (strcmp(reserved_field_names[i], name) == 0)
- return 1;
- for (i = 0; i < narg; i++)
- if (strcmp(args[i].name, name) == 0)
- return 1;
- return 0;
-}
-
static int create_trace_probe(int argc, char **argv)
{
/*
@@ -981,7 +226,7 @@ static int create_trace_probe(int argc, char **argv)
*/
struct trace_probe *tp;
int i, ret = 0;
- int is_return = 0, is_delete = 0;
+ bool is_return = false, is_delete = false;
char *symbol = NULL, *event = NULL, *group = NULL;
char *arg;
unsigned long offset = 0;
@@ -990,11 +235,11 @@ static int create_trace_probe(int argc, char **argv)

/* argc must be >= 1 */
if (argv[0][0] == 'p')
- is_return = 0;
+ is_return = false;
else if (argv[0][0] == 'r')
- is_return = 1;
+ is_return = true;
else if (argv[0][0] == '-')
- is_delete = 1;
+ is_delete = true;
else {
pr_info("Probe definition must be started with 'p', 'r' or"
" '-'.\n");
@@ -1058,7 +303,7 @@ static int create_trace_probe(int argc, char **argv)
/* a symbol specified */
symbol = argv[1];
/* TODO: support .init module functions */
- ret = split_symbol_offset(symbol, &offset);
+ ret = traceprobe_split_symbol_offset(symbol, &offset);
if (ret) {
pr_info("Failed to parse symbol.\n");
return ret;
@@ -1120,7 +365,8 @@ static int create_trace_probe(int argc, char **argv)
goto error;
}

- if (conflict_field_name(tp->args[i].name, tp->args, i)) {
+ if (traceprobe_conflict_field_name(tp->args[i].name,
+ tp->args, i)) {
pr_info("Argument[%d] name '%s' conflicts with "
"another field.\n", i, argv[i]);
ret = -EINVAL;
@@ -1128,7 +374,8 @@ static int create_trace_probe(int argc, char **argv)
}

/* Parse fetch argument */
- ret = parse_probe_arg(arg, tp, &tp->args[i], is_return);
+ ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+ is_return);
if (ret) {
pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
goto error;
@@ -1215,70 +462,11 @@ static int probes_open(struct inode *inode, struct file *file)
return seq_open(file, &probes_seq_op);
}

-static int command_trace_probe(const char *buf)
-{
- char **argv;
- int argc = 0, ret = 0;
-
- argv = argv_split(GFP_KERNEL, buf, &argc);
- if (!argv)
- return -ENOMEM;
-
- if (argc)
- ret = create_trace_probe(argc, argv);
-
- argv_free(argv);
- return ret;
-}
-
-#define WRITE_BUFSIZE 4096
-
static ssize_t probes_write(struct file *file, const char __user *buffer,
size_t count, loff_t *ppos)
{
- char *kbuf, *tmp;
- int ret;
- size_t done;
- size_t size;
-
- kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
- if (!kbuf)
- return -ENOMEM;
-
- ret = done = 0;
- while (done < count) {
- size = count - done;
- if (size >= WRITE_BUFSIZE)
- size = WRITE_BUFSIZE - 1;
- if (copy_from_user(kbuf, buffer + done, size)) {
- ret = -EFAULT;
- goto out;
- }
- kbuf[size] = '\0';
- tmp = strchr(kbuf, '\n');
- if (tmp) {
- *tmp = '\0';
- size = tmp - kbuf + 1;
- } else if (done + size < count) {
- pr_warning("Line length is too long: "
- "Should be less than %d.", WRITE_BUFSIZE);
- ret = -EINVAL;
- goto out;
- }
- done += size;
- /* Remove comments */
- tmp = strchr(kbuf, '#');
- if (tmp)
- *tmp = '\0';
-
- ret = command_trace_probe(kbuf);
- if (ret)
- goto out;
- }
- ret = done;
-out:
- kfree(kbuf);
- return ret;
+ return traceprobe_probes_write(file, buffer, count, ppos,
+ create_trace_probe);
}

static const struct file_operations kprobe_events_ops = {
@@ -1536,17 +724,6 @@ static void probe_event_disable(struct ftrace_event_call *call)
}
}

-#undef DEFINE_FIELD
-#define DEFINE_FIELD(type, item, name, is_signed) \
- do { \
- ret = trace_define_field(event_call, #type, name, \
- offsetof(typeof(field), item), \
- sizeof(field.item), is_signed, \
- FILTER_OTHER); \
- if (ret) \
- return ret; \
- } while (0)
-
static int kprobe_event_define_fields(struct ftrace_event_call *event_call)
{
int ret, i;
@@ -1887,7 +1064,7 @@ static __init int kprobe_trace_self_tests_init(void)

pr_info("Testing kprobe tracing: ");

- ret = command_trace_probe("p:testprobe kprobe_trace_selftest_target "
+ ret = traceprobe_command("p:testprobe kprobe_trace_selftest_target "
"$stack $stack0 +0($stack)");
if (WARN_ON_ONCE(ret)) {
pr_warning("error on probing function entry.\n");
@@ -1902,7 +1079,7 @@ static __init int kprobe_trace_self_tests_init(void)
probe_event_enable(&tp->call);
}

- ret = command_trace_probe("r:testprobe2 kprobe_trace_selftest_target "
+ ret = traceprobe_command("r:testprobe2 kprobe_trace_selftest_target "
"$retval");
if (WARN_ON_ONCE(ret)) {
pr_warning("error on probing function return.\n");
@@ -1922,13 +1099,13 @@ static __init int kprobe_trace_self_tests_init(void)

ret = target(1, 2, 3, 4, 5, 6);

- ret = command_trace_probe("-:testprobe");
+ ret = traceprobe_command_trace_probe("-:testprobe");
if (WARN_ON_ONCE(ret)) {
pr_warning("error on deleting a probe.\n");
warn++;
}

- ret = command_trace_probe("-:testprobe2");
+ ret = traceprobe_command_trace_probe("-:testprobe2");
if (WARN_ON_ONCE(ret)) {
pr_warning("error on deleting a probe.\n");
warn++;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
new file mode 100644
index 0000000..9c01aaf
--- /dev/null
+++ b/kernel/trace/trace_probe.c
@@ -0,0 +1,747 @@
+/*
+ * Common code for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author: Srikar Dronamraju
+ *
+ * Derived from kernel/trace/trace_kprobe.c written by
+ * Masami Hiramatsu <[email protected]>
+ */
+
+#include "trace_probe.h"
+
+const char *reserved_field_names[] = {
+ "common_type",
+ "common_flags",
+ "common_preempt_count",
+ "common_pid",
+ "common_tgid",
+ "common_lock_depth",
+ FIELD_STRING_IP,
+ FIELD_STRING_RETIP,
+ FIELD_STRING_FUNC,
+};
+
+/* Printing function type */
+#define PRINT_TYPE_FUNC_NAME(type) print_type_##type
+#define PRINT_TYPE_FMT_NAME(type) print_type_format_##type
+
+/* Printing in basic type function template */
+#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast) \
+static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s, \
+ const char *name, \
+ void *data, void *ent)\
+{ \
+ return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
+} \
+static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
+
+DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
+
+static inline void *get_rloc_data(u32 *dl)
+{
+ return (u8 *)dl + get_rloc_offs(*dl);
+}
+
+/* For data_loc conversion */
+static inline void *get_loc_data(u32 *dl, void *ent)
+{
+ return (u8 *)ent + get_rloc_offs(*dl);
+}
+
+/* For defining macros, define string/string_size types */
+typedef u32 string;
+typedef u32 string_size;
+
+/* Print type function for string type */
+static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
+ const char *name,
+ void *data, void *ent)
+{
+ int len = *(u32 *)data >> 16;
+
+ if (!len)
+ return trace_seq_printf(s, " %s=(fault)", name);
+ else
+ return trace_seq_printf(s, " %s=\"%s\"", name,
+ (const char *)get_loc_data(data, ent));
+}
+static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
+
+#define FETCH_FUNC_NAME(method, type) fetch_##method##_##type
+/*
+ * Define macro for basic types - we don't need to define s* types, because
+ * we have to care only about bitwidth at recording time.
+ */
+#define DEFINE_BASIC_FETCH_FUNCS(method) \
+DEFINE_FETCH_##method(u8) \
+DEFINE_FETCH_##method(u16) \
+DEFINE_FETCH_##method(u32) \
+DEFINE_FETCH_##method(u64)
+
+#define CHECK_FETCH_FUNCS(method, fn) \
+ (((FETCH_FUNC_NAME(method, u8) == fn) || \
+ (FETCH_FUNC_NAME(method, u16) == fn) || \
+ (FETCH_FUNC_NAME(method, u32) == fn) || \
+ (FETCH_FUNC_NAME(method, u64) == fn) || \
+ (FETCH_FUNC_NAME(method, string) == fn) || \
+ (FETCH_FUNC_NAME(method, string_size) == fn)) \
+ && (fn != NULL))
+
+/* Data fetch function templates */
+#define DEFINE_FETCH_reg(type) \
+static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs, \
+ void *offset, void *dest) \
+{ \
+ *(type *)dest = (type)regs_get_register(regs, \
+ (unsigned int)((unsigned long)offset)); \
+}
+DEFINE_BASIC_FETCH_FUNCS(reg)
+/* No string on the register */
+#define fetch_reg_string NULL
+#define fetch_reg_string_size NULL
+
+#define DEFINE_FETCH_stack(type) \
+static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
+ void *offset, void *dest) \
+{ \
+ *(type *)dest = (type)regs_get_kernel_stack_nth(regs, \
+ (unsigned int)((unsigned long)offset)); \
+}
+DEFINE_BASIC_FETCH_FUNCS(stack)
+/* No string on the stack entry */
+#define fetch_stack_string NULL
+#define fetch_stack_string_size NULL
+
+#define DEFINE_FETCH_retval(type) \
+static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
+ void *dummy, void *dest) \
+{ \
+ *(type *)dest = (type)regs_return_value(regs); \
+}
+DEFINE_BASIC_FETCH_FUNCS(retval)
+/* No string on the retval */
+#define fetch_retval_string NULL
+#define fetch_retval_string_size NULL
+
+#define DEFINE_FETCH_memory(type) \
+static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
+ void *addr, void *dest) \
+{ \
+ type retval; \
+ if (probe_kernel_address(addr, retval)) \
+ *(type *)dest = 0; \
+ else \
+ *(type *)dest = retval; \
+}
+DEFINE_BASIC_FETCH_FUNCS(memory)
+/*
+ * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
+ * length and relative data location.
+ */
+static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
+ void *addr, void *dest)
+{
+ long ret;
+ int maxlen = get_rloc_len(*(u32 *)dest);
+ u8 *dst = get_rloc_data(dest);
+ u8 *src = addr;
+ mm_segment_t old_fs = get_fs();
+ if (!maxlen)
+ return;
+ /*
+ * Try to get string again, since the string can be changed while
+ * probing.
+ */
+ set_fs(KERNEL_DS);
+ pagefault_disable();
+ do
+ ret = __copy_from_user_inatomic(dst++, src++, 1);
+ while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
+ dst[-1] = '\0';
+ pagefault_enable();
+ set_fs(old_fs);
+
+ if (ret < 0) { /* Failed to fetch string */
+ ((u8 *)get_rloc_data(dest))[0] = '\0';
+ *(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
+ } else
+ *(u32 *)dest = make_data_rloc(src - (u8 *)addr,
+ get_rloc_offs(*(u32 *)dest));
+}
+/* Return the length of string -- including null terminal byte */
+static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
+ void *addr, void *dest)
+{
+ int ret, len = 0;
+ u8 c;
+ mm_segment_t old_fs = get_fs();
+
+ set_fs(KERNEL_DS);
+ pagefault_disable();
+ do {
+ ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
+ len++;
+ } while (c && ret == 0 && len < MAX_STRING_SIZE);
+ pagefault_enable();
+ set_fs(old_fs);
+
+ if (ret < 0) /* Failed to check the length */
+ *(u32 *)dest = 0;
+ else
+ *(u32 *)dest = len;
+}
+
+/* Memory fetching by symbol */
+struct symbol_cache {
+ char *symbol;
+ long offset;
+ unsigned long addr;
+};
+
+static unsigned long update_symbol_cache(struct symbol_cache *sc)
+{
+ sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
+ if (sc->addr)
+ sc->addr += sc->offset;
+ return sc->addr;
+}
+
+static void free_symbol_cache(struct symbol_cache *sc)
+{
+ kfree(sc->symbol);
+ kfree(sc);
+}
+
+static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
+{
+ struct symbol_cache *sc;
+
+ if (!sym || strlen(sym) == 0)
+ return NULL;
+ sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
+ if (!sc)
+ return NULL;
+
+ sc->symbol = kstrdup(sym, GFP_KERNEL);
+ if (!sc->symbol) {
+ kfree(sc);
+ return NULL;
+ }
+ sc->offset = offset;
+
+ update_symbol_cache(sc);
+ return sc;
+}
+
+#define DEFINE_FETCH_symbol(type) \
+static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
+ void *data, void *dest) \
+{ \
+ struct symbol_cache *sc = data; \
+ if (sc->addr) \
+ fetch_memory_##type(regs, (void *)sc->addr, dest); \
+ else \
+ *(type *)dest = 0; \
+}
+DEFINE_BASIC_FETCH_FUNCS(symbol)
+DEFINE_FETCH_symbol(string)
+DEFINE_FETCH_symbol(string_size)
+
+/* Dereference memory access function */
+struct deref_fetch_param {
+ struct fetch_param orig;
+ long offset;
+};
+
+#define DEFINE_FETCH_deref(type) \
+static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
+ void *data, void *dest) \
+{ \
+ struct deref_fetch_param *dprm = data; \
+ unsigned long addr; \
+ call_fetch(&dprm->orig, regs, &addr); \
+ if (addr) { \
+ addr += dprm->offset; \
+ fetch_memory_##type(regs, (void *)addr, dest); \
+ } else \
+ *(type *)dest = 0; \
+}
+DEFINE_BASIC_FETCH_FUNCS(deref)
+DEFINE_FETCH_deref(string)
+DEFINE_FETCH_deref(string_size)
+
+static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
+{
+ if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+ free_deref_fetch_param(data->orig.data);
+ else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+ free_symbol_cache(data->orig.data);
+ kfree(data);
+}
+
+/* Bitfield fetch function */
+struct bitfield_fetch_param {
+ struct fetch_param orig;
+ unsigned char hi_shift;
+ unsigned char low_shift;
+};
+
+#define DEFINE_FETCH_bitfield(type) \
+static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
+ void *data, void *dest) \
+{ \
+ struct bitfield_fetch_param *bprm = data; \
+ type buf = 0; \
+ call_fetch(&bprm->orig, regs, &buf); \
+ if (buf) { \
+ buf <<= bprm->hi_shift; \
+ buf >>= bprm->low_shift; \
+ } \
+ *(type *)dest = buf; \
+}
+
+DEFINE_BASIC_FETCH_FUNCS(bitfield)
+#define fetch_bitfield_string NULL
+#define fetch_bitfield_string_size NULL
+
+static __kprobes void
+free_bitfield_fetch_param(struct bitfield_fetch_param *data)
+{
+ /*
+ * Don't check the bitfield itself, because this must be the
+ * last fetch function.
+ */
+ if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+ free_deref_fetch_param(data->orig.data);
+ else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+ free_symbol_cache(data->orig.data);
+ kfree(data);
+}
+
+/* Default (unsigned long) fetch type */
+#define __DEFAULT_FETCH_TYPE(t) u##t
+#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
+#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
+#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
+
+#define ASSIGN_FETCH_FUNC(method, type) \
+ [FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
+
+#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype) \
+ {.name = _name, \
+ .size = _size, \
+ .is_signed = sign, \
+ .print = PRINT_TYPE_FUNC_NAME(ptype), \
+ .fmt = PRINT_TYPE_FMT_NAME(ptype), \
+ .fmttype = _fmttype, \
+ .fetch = { \
+ASSIGN_FETCH_FUNC(reg, ftype), \
+ASSIGN_FETCH_FUNC(stack, ftype), \
+ASSIGN_FETCH_FUNC(retval, ftype), \
+ASSIGN_FETCH_FUNC(memory, ftype), \
+ASSIGN_FETCH_FUNC(symbol, ftype), \
+ASSIGN_FETCH_FUNC(deref, ftype), \
+ASSIGN_FETCH_FUNC(bitfield, ftype), \
+ } \
+ }
+
+#define ASSIGN_FETCH_TYPE(ptype, ftype, sign) \
+ __ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
+
+#define FETCH_TYPE_STRING 0
+#define FETCH_TYPE_STRSIZE 1
+
+/* Fetch type information table */
+static const struct fetch_type fetch_type_table[] = {
+ /* Special types */
+ [FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
+ sizeof(u32), 1, "__data_loc char[]"),
+ [FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
+ string_size, sizeof(u32), 0, "u32"),
+ /* Basic types */
+ ASSIGN_FETCH_TYPE(u8, u8, 0),
+ ASSIGN_FETCH_TYPE(u16, u16, 0),
+ ASSIGN_FETCH_TYPE(u32, u32, 0),
+ ASSIGN_FETCH_TYPE(u64, u64, 0),
+ ASSIGN_FETCH_TYPE(s8, u8, 1),
+ ASSIGN_FETCH_TYPE(s16, u16, 1),
+ ASSIGN_FETCH_TYPE(s32, u32, 1),
+ ASSIGN_FETCH_TYPE(s64, u64, 1),
+};
+
+static const struct fetch_type *find_fetch_type(const char *type)
+{
+ int i;
+
+ if (!type)
+ type = DEFAULT_FETCH_TYPE_STR;
+
+ /* Special case: bitfield */
+ if (*type == 'b') {
+ unsigned long bs;
+ type = strchr(type, '/');
+ if (!type)
+ goto fail;
+ type++;
+ if (strict_strtoul(type, 0, &bs))
+ goto fail;
+ switch (bs) {
+ case 8:
+ return find_fetch_type("u8");
+ case 16:
+ return find_fetch_type("u16");
+ case 32:
+ return find_fetch_type("u32");
+ case 64:
+ return find_fetch_type("u64");
+ default:
+ goto fail;
+ }
+ }
+
+ for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
+ if (strcmp(type, fetch_type_table[i].name) == 0)
+ return &fetch_type_table[i];
+fail:
+ return NULL;
+}
+
+/* Special function : only accept unsigned long */
+static __kprobes void fetch_stack_address(struct pt_regs *regs,
+ void *dummy, void *dest)
+{
+ *(unsigned long *)dest = kernel_stack_pointer(regs);
+}
+
+static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
+ fetch_func_t orig_fn)
+{
+ int i;
+
+ if (type != &fetch_type_table[FETCH_TYPE_STRING])
+ return NULL; /* Only string type needs size function */
+ for (i = 0; i < FETCH_MTD_END; i++)
+ if (type->fetch[i] == orig_fn)
+ return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
+
+ WARN_ON(1); /* This should not happen */
+ return NULL;
+}
+
+
+/* Split symbol and offset. */
+int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset)
+{
+ char *tmp;
+ int ret;
+
+ if (!offset)
+ return -EINVAL;
+
+ tmp = strchr(symbol, '+');
+ if (tmp) {
+ /* skip sign because strict_strtol doesn't accept '+' */
+ ret = strict_strtoul(tmp + 1, 0, offset);
+ if (ret)
+ return ret;
+ *tmp = '\0';
+ } else
+ *offset = 0;
+ return 0;
+}
+
+
+#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
+
+static int parse_probe_vars(char *arg, const struct fetch_type *t,
+ struct fetch_param *f, bool is_return)
+{
+ int ret = 0;
+ unsigned long param;
+
+ if (strcmp(arg, "retval") == 0) {
+ if (is_return)
+ f->fn = t->fetch[FETCH_MTD_retval];
+ else
+ ret = -EINVAL;
+ } else if (strncmp(arg, "stack", 5) == 0) {
+ if (arg[5] == '\0') {
+ if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
+ f->fn = fetch_stack_address;
+ else
+ ret = -EINVAL;
+ } else if (isdigit(arg[5])) {
+ ret = strict_strtoul(arg + 5, 10, &param);
+ if (ret || param > PARAM_MAX_STACK)
+ ret = -EINVAL;
+ else {
+ f->fn = t->fetch[FETCH_MTD_stack];
+ f->data = (void *)param;
+ }
+ } else
+ ret = -EINVAL;
+ } else
+ ret = -EINVAL;
+ return ret;
+}
+
+/* Recursive argument parser */
+static int parse_probe_arg(char *arg, const struct fetch_type *t,
+ struct fetch_param *f, bool is_return)
+{
+ int ret = 0;
+ unsigned long param;
+ long offset;
+ char *tmp;
+
+ switch (arg[0]) {
+ case '$':
+ ret = parse_probe_vars(arg + 1, t, f, is_return);
+ break;
+ case '%': /* named register */
+ ret = regs_query_register_offset(arg + 1);
+ if (ret >= 0) {
+ f->fn = t->fetch[FETCH_MTD_reg];
+ f->data = (void *)(unsigned long)ret;
+ ret = 0;
+ }
+ break;
+ case '@': /* memory or symbol */
+ if (isdigit(arg[1])) {
+ ret = strict_strtoul(arg + 1, 0, &param);
+ if (ret)
+ break;
+ f->fn = t->fetch[FETCH_MTD_memory];
+ f->data = (void *)param;
+ } else {
+ ret = traceprobe_split_symbol_offset(arg + 1, &offset);
+ if (ret)
+ break;
+ f->data = alloc_symbol_cache(arg + 1, offset);
+ if (f->data)
+ f->fn = t->fetch[FETCH_MTD_symbol];
+ }
+ break;
+ case '+': /* deref memory */
+ arg++; /* Skip '+', because strict_strtol() rejects it. */
+ case '-':
+ tmp = strchr(arg, '(');
+ if (!tmp)
+ break;
+ *tmp = '\0';
+ ret = strict_strtol(arg, 0, &offset);
+ if (ret)
+ break;
+ arg = tmp + 1;
+ tmp = strrchr(arg, ')');
+ if (tmp) {
+ struct deref_fetch_param *dprm;
+ const struct fetch_type *t2 = find_fetch_type(NULL);
+ *tmp = '\0';
+ dprm = kzalloc(sizeof(struct deref_fetch_param),
+ GFP_KERNEL);
+ if (!dprm)
+ return -ENOMEM;
+ dprm->offset = offset;
+ ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+ if (ret)
+ kfree(dprm);
+ else {
+ f->fn = t->fetch[FETCH_MTD_deref];
+ f->data = (void *)dprm;
+ }
+ }
+ break;
+ }
+ if (!ret && !f->fn) { /* Parsed, but do not find fetch method */
+ pr_info("%s type has no corresponding fetch method.\n",
+ t->name);
+ ret = -EINVAL;
+ }
+ return ret;
+}
+#define BYTES_TO_BITS(nb) ((BITS_PER_LONG * (nb)) / sizeof(long))
+
+/* Bitfield type needs to be parsed into a fetch function */
+static int __parse_bitfield_probe_arg(const char *bf,
+ const struct fetch_type *t,
+ struct fetch_param *f)
+{
+ struct bitfield_fetch_param *bprm;
+ unsigned long bw, bo;
+ char *tail;
+
+ if (*bf != 'b')
+ return 0;
+
+ bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
+ if (!bprm)
+ return -ENOMEM;
+ bprm->orig = *f;
+ f->fn = t->fetch[FETCH_MTD_bitfield];
+ f->data = (void *)bprm;
+
+ bw = simple_strtoul(bf + 1, &tail, 0); /* Use simple one */
+ if (bw == 0 || *tail != '@')
+ return -EINVAL;
+
+ bf = tail + 1;
+ bo = simple_strtoul(bf, &tail, 0);
+ if (tail == bf || *tail != '/')
+ return -EINVAL;
+
+ bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
+ bprm->low_shift = bprm->hi_shift + bo;
+ return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
+}
+
+/* String length checking wrapper */
+int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+ struct probe_arg *parg, bool is_return)
+{
+ const char *t;
+ int ret;
+
+ if (strlen(arg) > MAX_ARGSTR_LEN) {
+ pr_info("Argument is too long.: %s\n", arg);
+ return -ENOSPC;
+ }
+ parg->comm = kstrdup(arg, GFP_KERNEL);
+ if (!parg->comm) {
+ pr_info("Failed to allocate memory for command '%s'.\n", arg);
+ return -ENOMEM;
+ }
+ t = strchr(parg->comm, ':');
+ if (t) {
+ arg[t - parg->comm] = '\0';
+ t++;
+ }
+ parg->type = find_fetch_type(t);
+ if (!parg->type) {
+ pr_info("Unsupported type: %s\n", t);
+ return -EINVAL;
+ }
+ parg->offset = *size;
+ *size += parg->type->size;
+ ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+ if (ret >= 0 && t != NULL)
+ ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
+ if (ret >= 0) {
+ parg->fetch_size.fn = get_fetch_size_function(parg->type,
+ parg->fetch.fn);
+ parg->fetch_size.data = parg->fetch.data;
+ }
+ return ret;
+}
+
+/* Return 1 if name is reserved or already used by another argument */
+int traceprobe_conflict_field_name(const char *name,
+ struct probe_arg *args, int narg)
+{
+ int i;
+ for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
+ if (strcmp(reserved_field_names[i], name) == 0)
+ return 1;
+ for (i = 0; i < narg; i++)
+ if (strcmp(args[i].name, name) == 0)
+ return 1;
+ return 0;
+}
+
+void traceprobe_free_probe_arg(struct probe_arg *arg)
+{
+ if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
+ free_bitfield_fetch_param(arg->fetch.data);
+ else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
+ free_deref_fetch_param(arg->fetch.data);
+ else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
+ free_symbol_cache(arg->fetch.data);
+ kfree(arg->name);
+ kfree(arg->comm);
+}
+
+int traceprobe_command(const char *buf, int (*createfn)(int, char**))
+{
+ char **argv;
+ int argc = 0, ret = 0;
+
+ argv = argv_split(GFP_KERNEL, buf, &argc);
+ if (!argv)
+ return -ENOMEM;
+
+ if (argc)
+ ret = createfn(argc, argv);
+
+ argv_free(argv);
+ return ret;
+}
+
+#define WRITE_BUFSIZE 128
+
+ssize_t traceprobe_probes_write(struct file *file, const char __user *buffer,
+ size_t count, loff_t *ppos, int (*createfn)(int, char**))
+{
+ char *kbuf, *tmp;
+ int ret = 0;
+ size_t done = 0;
+ size_t size;
+
+ kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
+ if (!kbuf)
+ return -ENOMEM;
+
+ while (done < count) {
+ size = count - done;
+ if (size >= WRITE_BUFSIZE)
+ size = WRITE_BUFSIZE - 1;
+ if (copy_from_user(kbuf, buffer + done, size)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ kbuf[size] = '\0';
+ tmp = strchr(kbuf, '\n');
+ if (tmp) {
+ *tmp = '\0';
+ size = tmp - kbuf + 1;
+ } else if (done + size < count) {
+ pr_warning("Line length is too long: "
+ "Should be less than %d.", WRITE_BUFSIZE);
+ ret = -EINVAL;
+ goto out;
+ }
+ done += size;
+ /* Remove comments */
+ tmp = strchr(kbuf, '#');
+ if (tmp)
+ *tmp = '\0';
+
+ ret = traceprobe_command(kbuf, createfn);
+ if (ret)
+ goto out;
+ }
+ ret = done;
+out:
+ kfree(kbuf);
+ return ret;
+}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
new file mode 100644
index 0000000..487514b
--- /dev/null
+++ b/kernel/trace/trace_probe.h
@@ -0,0 +1,158 @@
+/*
+ * Common header file for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author: Srikar Dronamraju
+ *
+ * Derived from kernel/trace/trace_kprobe.c written by
+ * Masami Hiramatsu <[email protected]>
+ */
+
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/smp.h>
+#include <linux/debugfs.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/ptrace.h>
+#include <linux/perf_event.h>
+#include <linux/kprobes.h>
+#include <linux/stringify.h>
+#include <linux/limits.h>
+#include <linux/uaccess.h>
+#include <asm/bitsperlong.h>
+
+#include "trace.h"
+#include "trace_output.h"
+
+#define MAX_TRACE_ARGS 128
+#define MAX_ARGSTR_LEN 63
+#define MAX_EVENT_NAME_LEN 64
+#define MAX_STRING_SIZE PATH_MAX
+
+/* Reserved field names */
+#define FIELD_STRING_IP "__probe_ip"
+#define FIELD_STRING_RETIP "__probe_ret_ip"
+#define FIELD_STRING_FUNC "__probe_func"
+
+#undef DEFINE_FIELD
+#define DEFINE_FIELD(type, item, name, is_signed) \
+ do { \
+ ret = trace_define_field(event_call, #type, name, \
+ offsetof(typeof(field), item), \
+ sizeof(field.item), is_signed, \
+ FILTER_OTHER); \
+ if (ret) \
+ return ret; \
+ } while (0)
+
+
+/* Flags for trace_probe */
+#define TP_FLAG_TRACE 1
+#define TP_FLAG_PROFILE 2
+
+
+/* data_rloc: data relative location, compatible with u32 */
+#define make_data_rloc(len, roffs) \
+ (((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
+#define get_rloc_len(dl) ((u32)(dl) >> 16)
+#define get_rloc_offs(dl) ((u32)(dl) & 0xffff)
+
+/*
+ * Convert data_rloc to data_loc:
+ * data_rloc stores the offset from data_rloc itself, but data_loc
+ * stores the offset from event entry.
+ */
+#define convert_rloc_to_loc(dl, offs) ((u32)(dl) + (offs))
+
+/* Data fetch function type */
+typedef void (*fetch_func_t)(struct pt_regs *, void *, void *);
+/* Printing function type */
+typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
+ void *);
+
+/* Fetch types */
+enum {
+ FETCH_MTD_reg = 0,
+ FETCH_MTD_stack,
+ FETCH_MTD_retval,
+ FETCH_MTD_memory,
+ FETCH_MTD_symbol,
+ FETCH_MTD_deref,
+ FETCH_MTD_bitfield,
+ FETCH_MTD_END,
+};
+
+/* Fetch type information table */
+struct fetch_type {
+ const char *name; /* Name of type */
+ size_t size; /* Byte size of type */
+ int is_signed; /* Signed flag */
+ print_type_func_t print; /* Print functions */
+ const char *fmt; /* Fromat string */
+ const char *fmttype; /* Name in format file */
+ /* Fetch functions */
+ fetch_func_t fetch[FETCH_MTD_END];
+};
+
+struct fetch_param {
+ fetch_func_t fn;
+ void *data;
+};
+
+struct probe_arg {
+ struct fetch_param fetch;
+ struct fetch_param fetch_size;
+ unsigned int offset; /* Offset from argument entry */
+ const char *name; /* Name of this argument */
+ const char *comm; /* Command of this argument */
+ const struct fetch_type *type; /* Type of this argument */
+};
+
+static inline __kprobes void call_fetch(struct fetch_param *fprm,
+ struct pt_regs *regs, void *dest)
+{
+ return fprm->fn(regs, fprm->data, dest);
+}
+
+/* Check the name is good for event/group/fields */
+static int is_good_name(const char *name)
+{
+ if (!isalpha(*name) && *name != '_')
+ return 0;
+ while (*++name != '\0') {
+ if (!isalpha(*name) && !isdigit(*name) && *name != '_')
+ return 0;
+ }
+ return 1;
+}
+
+extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+ struct probe_arg *parg, bool is_return);
+
+extern int traceprobe_conflict_field_name(const char *name,
+ struct probe_arg *args, int narg);
+
+extern void traceprobe_free_probe_arg(struct probe_arg *arg);
+
+extern int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset);
+
+extern ssize_t traceprobe_probes_write(struct file *file,
+ const char __user *buffer, size_t count, loff_t *ppos,
+ int (*createfn)(int, char**));
+
+extern int traceprobe_command(const char *buf, int (*createfn)(int, char**));

2011-03-14 14:16:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 1/20] 1: mm: Move replace_page() to mm/memory.c

On Mon, 2011-03-14 at 19:04 +0530, Srikar Dronamraju wrote:
> User bkpt will use background page replacement approach to insert/delete
> breakpoints. Background page replacement approach is based on
> replace_page. Hence replace_page() loses its static attribute.
>

Just a nitpick, but since replace_page() is being moved, could you
specify that in the change log. Something like:

"Hence, replace_page() is moved from ksm.c into memory.c and its static
attribute is removed."

I like to see in the change log "move x to y" when that is actually
done, because it is hard to see if anything actually changed when code
is moved. Ideally it is best to move code in one patch and make the
change in another. If you do cut another version of this patch set,
could you do that. This alone is not enough to require a new release.

Thanks,

-- Steve

> Signed-off-by: Srikar Dronamraju <[email protected]>
> Signed-off-by: Ananth N Mavinakayanahalli <[email protected]>
> ---
> include/linux/mm.h | 2 ++
> mm/ksm.c | 62 ----------------------------------------------------
> mm/memory.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 64 insertions(+), 62 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 679300c..01a0740 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -984,6 +984,8 @@ void account_page_writeback(struct page *page);
> int set_page_dirty(struct page *page);
> int set_page_dirty_lock(struct page *page);
> int clear_page_dirty_for_io(struct page *page);
> +int replace_page(struct vm_area_struct *vma, struct page *page,
> + struct page *kpage, pte_t orig_pte);
>
> /* Is the vma a continuation of the stack vma above it? */
> static inline int vma_stack_continue(struct vm_area_struct *vma, unsigned long addr)
> diff --git a/mm/ksm.c b/mm/ksm.c
> index c2b2a94..f46e20d 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -765,68 +765,6 @@ out:
> return err;
> }
>
> -/**
> - * replace_page - replace page in vma by new ksm page
> - * @vma: vma that holds the pte pointing to page
> - * @page: the page we are replacing by kpage
> - * @kpage: the ksm page we replace page by
> - * @orig_pte: the original value of the pte
> - *
> - * Returns 0 on success, -EFAULT on failure.
> - */
> -static int replace_page(struct vm_area_struct *vma, struct page *page,
> - struct page *kpage, pte_t orig_pte)
> -{
> - struct mm_struct *mm = vma->vm_mm;
> - pgd_t *pgd;
> - pud_t *pud;
> - pmd_t *pmd;
> - pte_t *ptep;
> - spinlock_t *ptl;
> - unsigned long addr;
> - int err = -EFAULT;
> -
> - addr = page_address_in_vma(page, vma);
> - if (addr == -EFAULT)
> - goto out;
> -
> - pgd = pgd_offset(mm, addr);
> - if (!pgd_present(*pgd))
> - goto out;
> -
> - pud = pud_offset(pgd, addr);
> - if (!pud_present(*pud))
> - goto out;
> -
> - pmd = pmd_offset(pud, addr);
> - BUG_ON(pmd_trans_huge(*pmd));
> - if (!pmd_present(*pmd))
> - goto out;
> -
> - ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> - if (!pte_same(*ptep, orig_pte)) {
> - pte_unmap_unlock(ptep, ptl);
> - goto out;
> - }
> -
> - get_page(kpage);
> - page_add_anon_rmap(kpage, vma, addr);
> -
> - flush_cache_page(vma, addr, pte_pfn(*ptep));
> - ptep_clear_flush(vma, addr, ptep);
> - set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> -
> - page_remove_rmap(page);
> - if (!page_mapped(page))
> - try_to_free_swap(page);
> - put_page(page);
> -
> - pte_unmap_unlock(ptep, ptl);
> - err = 0;
> -out:
> - return err;
> -}
> -
> static int page_trans_compound_anon_split(struct page *page)
> {
> int ret = 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index 5823698..2a3021c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2669,6 +2669,68 @@ void unmap_mapping_range(struct address_space *mapping,
> }
> EXPORT_SYMBOL(unmap_mapping_range);
>
> +/**
> + * replace_page - replace page in vma by new ksm page
> + * @vma: vma that holds the pte pointing to page
> + * @page: the page we are replacing by kpage
> + * @kpage: the ksm page we replace page by
> + * @orig_pte: the original value of the pte
> + *
> + * Returns 0 on success, -EFAULT on failure.
> + */
> +int replace_page(struct vm_area_struct *vma, struct page *page,
> + struct page *kpage, pte_t orig_pte)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *ptep;
> + spinlock_t *ptl;
> + unsigned long addr;
> + int err = -EFAULT;
> +
> + addr = page_address_in_vma(page, vma);
> + if (addr == -EFAULT)
> + goto out;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + goto out;
> +
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + goto out;
> +
> + pmd = pmd_offset(pud, addr);
> + BUG_ON(pmd_trans_huge(*pmd));
> + if (!pmd_present(*pmd))
> + goto out;
> +
> + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + if (!pte_same(*ptep, orig_pte)) {
> + pte_unmap_unlock(ptep, ptl);
> + goto out;
> + }
> +
> + get_page(kpage);
> + page_add_anon_rmap(kpage, vma, addr);
> +
> + flush_cache_page(vma, addr, pte_pfn(*ptep));
> + ptep_clear_flush(vma, addr, ptep);
> + set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> +
> + page_remove_rmap(page);
> + if (!page_mapped(page))
> + try_to_free_swap(page);
> + put_page(page);
> +
> + pte_unmap_unlock(ptep, ptl);
> + err = 0;
> +out:
> + return err;
> +}
> +
> int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
> {
> struct address_space *mapping = inode->i_mapping;

2011-03-14 15:29:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 3/20] 3: uprobes: Breakground page replacement.

On Mon, 2011-03-14 at 19:04 +0530, Srikar Dronamraju wrote:
> +/*
> + * Most architectures can use the default versions of @read_opcode(),
> + * @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
> + *
> + * @set_ip:
> + * Set the instruction pointer in @regs to @vaddr.
> + * @analyze_insn:
> + * Analyze @user_bkpt->insn. Return 0 if @user_bkpt->insn is an
> + * instruction you can probe, or a negative errno (typically -%EPERM)
> + * otherwise. Determine what sort of

sort of ... what?

-- Steve

> + * @pre_xol:
> + * @post_xol:
> + * XOL-related fixups @post_xol() (and possibly @pre_xol()) will need
> + * to do for this instruction, and annotate @user_bkpt accordingly.
> + * You may modify @user_bkpt->insn (e.g., the x86_64 port does this
> + * for rip-relative instructions).
> + */

2011-03-14 15:39:01

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 3/20] 3: uprobes: Breakground page replacement.

On Mon, 2011-03-14 at 19:04 +0530, Srikar Dronamraju wrote:
> +/*
> + * Called with tsk->mm->mmap_sem held (either for read or write and
> + * with a reference to tsk->mm.
> + */
> +static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
> + unsigned long vaddr, uprobe_opcode_t opcode)
> +{
> + struct page *old_page, *new_page;
> + void *vaddr_old, *vaddr_new;
> + struct vm_area_struct *vma;
> + spinlock_t *ptl;
> + pte_t *orig_pte;
> + unsigned long addr;
> + int ret = -EINVAL;
> +
> + /* Read the page with vaddr into memory */
> + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
> + if (ret <= 0)
> + return -EINVAL;
> + ret = -EINVAL;
> +
> + /*
> + * check if the page we are interested is read-only mapped
> + * Since we are interested in text pages, Our pages of interest
> + * should be mapped read-only.
> + */
> + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
> + (VM_READ|VM_EXEC))
> + goto put_out;
> +

I'm confused by the above comment and code. You state we are only
interested text pages mapped read-only, but then if the page is mapped
read/exec we exit out? It is fine if it is anything but READ/EXEC.

I'm also curious to why we can't modify text code that is also mapped as
read/write.

-- Steve

2011-03-14 16:00:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 5/20] 5: Uprobes: register/unregister probes.

On Mon, 2011-03-14 at 19:04 +0530, Srikar Dronamraju wrote:
> +/* Returns 0 if it can install one probe */
> +int register_uprobe(struct inode *inode, loff_t offset,
> + struct uprobe_consumer *consumer)
> +{
> + struct prio_tree_iter iter;
> + struct list_head tmp_list;
> + struct address_space *mapping;
> + struct mm_struct *mm, *tmpmm;
> + struct vm_area_struct *vma;
> + struct uprobe *uprobe;
> + int ret = -1;
> +
> + if (!inode || !consumer || consumer->next)
> + return -EINVAL;
> + uprobe = uprobes_add(inode, offset);

What happens if uprobes_add() returns NULL?

-- Steve

> + INIT_LIST_HEAD(&tmp_list);
> +
> + mapping = inode->i_mapping;
> +
> + mutex_lock(&uprobes_mutex);
> + if (uprobe->consumers) {
> + ret = 0;
> + goto consumers_add;
> + }

2011-03-14 16:58:53

by Stephen Wilson

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 3/20] 3: uprobes: Breakground page replacement.

On Mon, Mar 14, 2011 at 07:04:33PM +0530, Srikar Dronamraju wrote:
> +/**
> + * read_opcode - read the opcode at a given virtual address.
> + * @tsk: the probed task.
> + * @vaddr: the virtual address to store the opcode.
> + * @opcode: location to store the read opcode.
> + *
> + * For task @tsk, read the opcode at @vaddr and store it in @opcode.
> + * Return 0 (success) or a negative errno.
> + */
> +int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
> + uprobe_opcode_t *opcode)
> +{
> + struct vm_area_struct *vma;
> + struct page *page;
> + void *vaddr_new;
> + int ret;
> +
> + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
> + if (ret <= 0)
> + return -EFAULT;
> + ret = -EFAULT;
> +
> + /*
> + * check if the page we are interested is read-only mapped
> + * Since we are interested in text pages, Our pages of interest
> + * should be mapped read-only.
> + */
> + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
> + (VM_READ|VM_EXEC))
> + goto put_out;
> +
> + lock_page(page);
> + vaddr_new = kmap_atomic(page, KM_USER0);
> + vaddr &= ~PAGE_MASK;
> + memcpy(&opcode, vaddr_new + vaddr, uprobe_opcode_sz);
> + kunmap_atomic(vaddr_new, KM_USER0);
> + unlock_page(page);
> + ret = uprobe_opcode_sz;

This looks wrong. We should be setting ret = 0 on success here?

> +
> +put_out:
> + put_page(page); /* we did a get_page in the beginning */
> + return ret;
> +}

--
steve

2011-03-14 17:08:23

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 1/20] 1: mm: Move replace_page() to mm/memory.c

* Steven Rostedt <[email protected]> [2011-03-14 10:16:35]:

> On Mon, 2011-03-14 at 19:04 +0530, Srikar Dronamraju wrote:
> > User bkpt will use background page replacement approach to insert/delete
> > breakpoints. Background page replacement approach is based on
> > replace_page. Hence replace_page() loses its static attribute.
> >
>
> Just a nitpick, but since replace_page() is being moved, could you
> specify that in the change log. Something like:
>
> "Hence, replace_page() is moved from ksm.c into memory.c and its static
> attribute is removed."

Okay, Will take care to add the moved from ksm.c into memory.c in the
next version of patchset.

>
> I like to see in the change log "move x to y" when that is actually
> done, because it is hard to see if anything actually changed when code
> is moved. Ideally it is best to move code in one patch and make the

As discussed in IRC, moving and removing the static attribute had to
be one patch so that mm/ksm.c compiles correctly. The other option we
have is to remove the static attribute first and then moving the
function.

> change in another. If you do cut another version of this patch set,
> could you do that. This alone is not enough to require a new release.
>

--
Thanks and Regards
Srikar

2011-03-14 17:14:00

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 1/20] 1: mm: Move replace_page() to mm/memory.c

On Mon, 2011-03-14 at 22:32 +0530, Srikar Dronamraju wrote:
> * Steven Rostedt <[email protected]> [2011-03-14 10:16:35]:
>
> > On Mon, 2011-03-14 at 19:04 +0530, Srikar Dronamraju wrote:
> > > User bkpt will use background page replacement approach to insert/delete
> > > breakpoints. Background page replacement approach is based on
> > > replace_page. Hence replace_page() loses its static attribute.
> > >
> >
> > Just a nitpick, but since replace_page() is being moved, could you
> > specify that in the change log. Something like:
> >
> > "Hence, replace_page() is moved from ksm.c into memory.c and its static
> > attribute is removed."
>
> Okay, Will take care to add the moved from ksm.c into memory.c in the
> next version of patchset.


Thanks!

> > I like to see in the change log "move x to y" when that is actually
> > done, because it is hard to see if anything actually changed when code
> > is moved. Ideally it is best to move code in one patch and make the
>
> As discussed in IRC, moving and removing the static attribute had to
> be one patch so that mm/ksm.c compiles correctly. The other option we
> have is to remove the static attribute first and then moving the
> function.

Hmm, maybe that may be a good idea. Since it is really two changes. One
is to make it global for other usages. I'm not even sure why you moved
it. The change log for the move can explain that.

-- Steve

>
> > change in another. If you do cut another version of this patch set,
> > could you do that. This alone is not enough to require a new release.
> >
>

2011-03-14 17:30:38

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 3/20] 3: uprobes: Breakground page replacement.

* Steven Rostedt <[email protected]> [2011-03-14 11:38:57]:

> On Mon, 2011-03-14 at 19:04 +0530, Srikar Dronamraju wrote:
> > +/*
> > + * Called with tsk->mm->mmap_sem held (either for read or write and
> > + * with a reference to tsk->mm.
> > + */
> > +static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
> > + unsigned long vaddr, uprobe_opcode_t opcode)
> > +{
> > + struct page *old_page, *new_page;
> > + void *vaddr_old, *vaddr_new;
> > + struct vm_area_struct *vma;
> > + spinlock_t *ptl;
> > + pte_t *orig_pte;
> > + unsigned long addr;
> > + int ret = -EINVAL;
> > +
> > + /* Read the page with vaddr into memory */
> > + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
> > + if (ret <= 0)
> > + return -EINVAL;
> > + ret = -EINVAL;
> > +
> > + /*
> > + * check if the page we are interested is read-only mapped
> > + * Since we are interested in text pages, Our pages of interest
> > + * should be mapped read-only.
> > + */
> > + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
> > + (VM_READ|VM_EXEC))
> > + goto put_out;
> > +
>
> I'm confused by the above comment and code. You state we are only
> interested text pages mapped read-only, but then if the page is mapped
> read/exec we exit out? It is fine if it is anything but READ/EXEC.

You are right, it should have been
if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) !=
(VM_READ|VM_EXEC))
goto put_out;


Your comment applied for read_opcode function too.
Will correct in the next version of the patchset.

However in the next patch, where we replace the above with
valid_vma and that does the right thing.

>
> I'm also curious to why we can't modify text code that is also mapped as
> read/write.
>

If text code is mapped read/write then on memory pressure the page gets
written to the disk. Hence breakpoints inserted may end up being in the
disk copy modifying the actual copy.

--
Thanks and Regards
Srikar

2011-03-14 17:35:01

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 1/20] 1: mm: Move replace_page() to mm/memory.c

> >
> > As discussed in IRC, moving and removing the static attribute had to
> > be one patch so that mm/ksm.c compiles correctly. The other option we
> > have is to remove the static attribute first and then moving the
> > function.
>
> Hmm, maybe that may be a good idea. Since it is really two changes. One
> is to make it global for other usages. I'm not even sure why you moved
> it. The change log for the move can explain that.
>

unlike mm/memory.c; mm/ksm.c is compiled only when CONFIG_KSM is set.
So if we dont move it out of ksm.c; we will have to make uprobes
dependent on CONFIG_KSM. Since replace_page is the only function we are
interested in, its better to move it out of ksm.c, rather than
making uprobes dependent on CONFIG_KSM.


> -- Steve
>
> >
>

2011-03-14 17:35:57

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 3/20] 3: uprobes: Breakground page replacement.

* Stephen Wilson <[email protected]> [2011-03-14 12:58:18]:

> On Mon, Mar 14, 2011 at 07:04:33PM +0530, Srikar Dronamraju wrote:
> > +/**
> > + * read_opcode - read the opcode at a given virtual address.
> > + * @tsk: the probed task.
> > + * @vaddr: the virtual address to store the opcode.
> > + * @opcode: location to store the read opcode.
> > + *
> > + * For task @tsk, read the opcode at @vaddr and store it in @opcode.
> > + * Return 0 (success) or a negative errno.
> > + */
> > +int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
> > + uprobe_opcode_t *opcode)
> > +{
> > + struct vm_area_struct *vma;
> > + struct page *page;
> > + void *vaddr_new;
> > + int ret;
> > +
> > + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
> > + if (ret <= 0)
> > + return -EFAULT;
> > + ret = -EFAULT;
> > +
> > + /*
> > + * check if the page we are interested is read-only mapped
> > + * Since we are interested in text pages, Our pages of interest
> > + * should be mapped read-only.
> > + */
> > + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
> > + (VM_READ|VM_EXEC))
> > + goto put_out;
> > +
> > + lock_page(page);
> > + vaddr_new = kmap_atomic(page, KM_USER0);
> > + vaddr &= ~PAGE_MASK;
> > + memcpy(&opcode, vaddr_new + vaddr, uprobe_opcode_sz);
> > + kunmap_atomic(vaddr_new, KM_USER0);
> > + unlock_page(page);
> > + ret = uprobe_opcode_sz;
>
> This looks wrong. We should be setting ret = 0 on success here?

Right, I should have set ret = 0 here.

>
> > +
> > +put_out:
> > + put_page(page); /* we did a get_page in the beginning */
> > + return ret;
> > +}
>
> --
> steve
>

2011-03-14 17:38:31

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 5/20] 5: Uprobes: register/unregister probes.

* Steven Rostedt <[email protected]> [2011-03-14 12:00:33]:

> On Mon, 2011-03-14 at 19:04 +0530, Srikar Dronamraju wrote:
> > +/* Returns 0 if it can install one probe */
> > +int register_uprobe(struct inode *inode, loff_t offset,
> > + struct uprobe_consumer *consumer)
> > +{
> > + struct prio_tree_iter iter;
> > + struct list_head tmp_list;
> > + struct address_space *mapping;
> > + struct mm_struct *mm, *tmpmm;
> > + struct vm_area_struct *vma;
> > + struct uprobe *uprobe;
> > + int ret = -1;
> > +
> > + if (!inode || !consumer || consumer->next)
> > + return -EINVAL;
> > + uprobe = uprobes_add(inode, offset);
>
> What happens if uprobes_add() returns NULL?
>
Right again, I should have added a check to see if uprobes_add
hasnt returned NULL.

--
Thanks and Regards
Srikar

2011-03-14 18:09:58

by Stephen Wilson

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

On Mon, Mar 14, 2011 at 07:05:22PM +0530, Srikar Dronamraju wrote:
> static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
> {
> - int ret = 0;
> + struct task_struct *tsk;
> + int ret = -EINVAL;
>
> - /*TODO: install breakpoint */
> - if (!ret)
> + get_task_struct(mm->owner);
> + tsk = mm->owner;
> + if (!tsk)
> + return ret;

I think you need to check that tsk != NULL before calling
get_task_struct()...


> static int remove_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
> {
> - int ret = 0;
> + struct task_struct *tsk;
> + int ret;
> +
> + get_task_struct(mm->owner);
> + tsk = mm->owner;
> + if (!tsk)
> + return -EINVAL;

And here as well.

--
steve

2011-03-14 18:21:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 3/20] 3: uprobes: Breakground page replacement.

On Mon, 2011-03-14 at 22:54 +0530, Srikar Dronamraju wrote:
> >
> > I'm also curious to why we can't modify text code that is also mapped as
> > read/write.
> >
>
> If text code is mapped read/write then on memory pressure the page gets
> written to the disk. Hence breakpoints inserted may end up being in the
> disk copy modifying the actual copy.
>
>
OK, could you put a comment about that too.

Thanks,

-- Steve

2011-03-14 18:22:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 3/20] 3: uprobes: Breakground page replacement.

On Mon, 2011-03-14 at 22:54 +0530, Srikar Dronamraju wrote:
> >
> > I'm confused by the above comment and code. You state we are only
> > interested text pages mapped read-only, but then if the page is mapped
> > read/exec we exit out? It is fine if it is anything but READ/EXEC.
>
> You are right, it should have been
> if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) !=
> (VM_READ|VM_EXEC))
> goto put_out;
>
>
Golden rule #12: When the comments do not match the code, they probably
are both wrong ;)

-- Steve

2011-03-14 23:31:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

On Mon, 14 Mar 2011 19:04:03 +0530
Srikar Dronamraju <[email protected]> wrote:

> Advantages of uprobes over conventional debugging include:
>
> 1. Non-disruptive.
> Unlike current ptrace based mechanisms, uprobes tracing wouldnt
> involve signals, stopping threads and context switching between the
> tracer and tracee.
>
> 2. Much better handling of multithreaded programs because of XOL.
> Current ptrace based mechanisms use single stepping inline, i.e they
> copy back the original instruction on hitting a breakpoint. In such
> mechanisms tracers have to stop all the threads on a breakpoint hit or
> tracers will not be able to handle all hits to the location of
> interest. Uprobes uses execution out of line, where the instruction to
> be traced is analysed at the time of breakpoint insertion and a copy
> of instruction is stored at a different location. On breakpoint hit,
> uprobes jumps to that copied location and singlesteps the same
> instruction and does the necessary fixups post singlestepping.
>
> 3. Multiple tracers for an application.
> Multiple uprobes based tracer could work in unison to trace an
> application. There could one tracer that could be interested in
> generic events for a particular set of process. While there could be
> another tracer that is just interested in one specific event of a
> particular process thats part of the previous set of process.
>
> 4. Corelating events from kernels and userspace.
> Uprobes could be used with other tools like kprobes, tracepoints or as
> part of higher level tools like perf to give a consolidated set of
> events from kernel and userspace. In future we could look at a single
> backtrace showing application, library and kernel calls.

How do you envisage these features actually get used? For example,
will gdb be modified? Will other debuggers be modified or written?

IOW, I'm trying to get an understanding of how you expect this feature
will actually become useful to end users - the kernel patch is only
part of the story.

2011-03-14 23:47:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

> IOW, I'm trying to get an understanding of how you expect this feature
> will actually become useful to end users - the kernel patch is only
> part of the story.

One user would be systemtap for user tracing as I understand. Systemtap has a
userbase (at least I use it, although not for user tracing)

Right now lots of distros apply ugly patchkits to handle this instead,
which is not good (tm).

-Andi
--
[email protected] -- Speaking for myself only.

2011-03-15 00:23:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

On Tue, 15 Mar 2011, Andi Kleen wrote:

> > IOW, I'm trying to get an understanding of how you expect this feature
> > will actually become useful to end users - the kernel patch is only
> > part of the story.
>
> One user would be systemtap for user tracing as I understand. Systemtap has a
> userbase (at least I use it, although not for user tracing)

That's a brilliant reason to drop it right away. systemtap is the
least of our worries.

> Right now lots of distros apply ugly patchkits to handle this instead,
> which is not good (tm).

s/lots of distros/some enterprise distros where the management still
believes that this is a valuable feature add/

You deliberately skipped the first part of akpms question:

> How do you envisage these features actually get used? For example,
> will gdb be modified? Will other debuggers be modified or written?

How about answering this question first _BEFORE_ advertising
systemtap?

Thanks,

tglx

2011-03-15 01:13:57

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes


akpm wrote:

> [...] How do you envisage these features actually get used?

Patch #20/20 in the set includes an ftrace-flavoured debugfs frontend.
Previous versions of the patchset included perf front-ends too, which
are probably to be seen again.

> For example, will gdb be modified? Will other debuggers be modified
> or written? [...]

The code is not currently useful to gdb. Perhaps ptrace or an
improved userspace ABI can get access to it in the form of a
breakpoint-management interface, though this inode+offset
style of uprobe addressing would require adaptation to the
process-virtual-address style used by debugging APIs.

- FChE

2011-03-15 01:36:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

On Mon, 14 Mar 2011, Frank Ch. Eigler wrote:
> akpm wrote:
>
> > [...] How do you envisage these features actually get used?
>
> Patch #20/20 in the set includes an ftrace-flavoured debugfs frontend.

And you really think that:

# cd /sys/kernel/debug/tracing/

# cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh

# objdump -T /bin/zsh | grep -w zfree
0000000000446420 g DF .text 0000000000000012 Base zfree

# echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events

# cat uprobe_events
p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420

> TODO: Documentation/trace/uprobetrace.txt

without a reasonable documentation how to use that is a brilliant
argument?

> Previous versions of the patchset included perf front-ends too, which
> are probably to be seen again.

Ahh, probably. What does that mean?

And if that probably happens, what interface is that supposed to
use?

The above magic wrapped into perf ?

Or some sensible implementation ?

Thanks,

tglx

2011-03-15 02:03:35

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

> >
> > 4. Corelating events from kernels and userspace.
> > Uprobes could be used with other tools like kprobes, tracepoints or as
> > part of higher level tools like perf to give a consolidated set of
> > events from kernel and userspace. In future we could look at a single
> > backtrace showing application, library and kernel calls.
>
> How do you envisage these features actually get used? For example,
> will gdb be modified? Will other debuggers be modified or written?
>
> IOW, I'm trying to get an understanding of how you expect this feature
> will actually become useful to end users - the kernel patch is only
> part of the story.
>

Right, So I am looking at having perf probe for uprobes and also at
having a syscall so that perf probe and other ptrace users can use this
infrastructure. Ingo has already asked for a syscall for the same in
one of his replies to my previous patchset. From whatever
discussions I had with ptrace users, they have shown interest in
using this breakpoint infrastructure.

I am still not sure if this feature should be exposed thro a new
operation to ptrace syscall (something like SET_BP) or a completely new
syscall or both. I would be happy if people here could give inputs on
which route to go forward.

We had perf probe for uprobes implemented in few of the previous
patchset. When the underlying implementation changed from a
pid:vaddr to a file:offset, the way we invoke perf probe changed.

Previously we would do
"perf probe -p 3692 zfree@zsh"

Now we would be doing
"perf probe -u zfree@zsh"

The perf probe for uprobes is still WIP. Will post the patches when it
is in usable fashion. This should be pretty soon.

If you have suggestions on how this infrastructure could be used
above perf probe and syscall, then please do let me know.

--
Thanks and Regards
Srikar

2011-03-15 02:52:53

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

On Mon, 2011-03-14 at 16:30 -0700, Andrew Morton wrote:
>
> How do you envisage these features actually get used? For example,
> will gdb be modified? Will other debuggers be modified or written?
>
> IOW, I'm trying to get an understanding of how you expect this feature
> will actually become useful to end users - the kernel patch is only
> part of the story.

I'm hoping it solves this question:

https://lkml.org/lkml/2011/3/10/347

-- Steve

2011-03-15 05:27:33

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

Hi Thomas,

> >
> > > [...] How do you envisage these features actually get used?
> >
> > Patch #20/20 in the set includes an ftrace-flavoured debugfs frontend.
>
> And you really think that:
>
> # cd /sys/kernel/debug/tracing/
>
> # cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
> 00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
>
> # objdump -T /bin/zsh | grep -w zfree
> 0000000000446420 g DF .text 0000000000000012 Base zfree
>
> # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
>
> # cat uprobe_events
> p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
>
> > TODO: Documentation/trace/uprobetrace.txt
>
> without a reasonable documentation how to use that is a brilliant
> argument?

We had a fairly decent documentation for uprobes and
uprobetracer. But that had to be changed with the change in
underlying design of uprobes infrastructure. Since uprobetrace is one
the user interface, I plan to document it soon. However it would be
great if we had inputs on how we should be designing the syscall.

>
> > Previous versions of the patchset included perf front-ends too, which
> > are probably to be seen again.
>
> Ahh, probably. What does that mean?
>
> And if that probably happens, what interface is that supposed to
> use?
>
> The above magic wrapped into perf ?

For the initial perf probe thing, yes it would be a wrapper over
uprobe_tracer. That is because we can reuse most of the current perf
probe and we could prototype and show users what all can be
achieved. However we plan to make perf probe depend on the syscall when
the syscall takes shape.

>
> Or some sensible implementation ?

Would syscall based perf probe implementation count as a sensible
implementation? My current plan was to code up the perf probe for
uprobes and then draft a proposal for how the syscall should look.
There are still some areas on how we should be allowing the
filter, and what restrictions we should place on the syscall
defined handler. I would like to hear from you and others on your
ideas for the same. If you have ideas on doing it other than using a
syscall then please do let me know about the same.

I know that getting the user interface right is very important.
However I think it kind of depends on what the infrastructure can
provide too. So if we can decide on the kernel ABI and the
underlying design (i.e can we use replace_page() based background page
replacement, Are there issues with the Xol slot based mechanism that
we are using, etc), we can work towards providing a stable User ABI that
even normal users can use. For now I am concentrating on getting the
underlying infrastructure correct.


Please let me know your thoughts.

--
Thanks and Regards
Srikar

2011-03-15 09:29:10

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

* Stephen Wilson <[email protected]> [2011-03-14 14:09:14]:

> On Mon, Mar 14, 2011 at 07:05:22PM +0530, Srikar Dronamraju wrote:
> > static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
> > {
> > - int ret = 0;
> > + struct task_struct *tsk;
> > + int ret = -EINVAL;
> >
> > - /*TODO: install breakpoint */
> > - if (!ret)
> > + get_task_struct(mm->owner);
> > + tsk = mm->owner;
> > + if (!tsk)
> > + return ret;
>
> I think you need to check that tsk != NULL before calling
> get_task_struct()...
>

Guess checking for tsk != NULL would only help if and only if we are doing
within rcu. i.e we have to change to something like this

rcu_read_lock()
if (mm->owner) {
get_task_struct(mm->owner)
tsk = mm->owner;
}
rcu_read_unlock()
if (!tsk)
return ret;

Agree?

2011-03-15 11:04:14

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

On Tue, 15 Mar 2011, Srikar Dronamraju wrote:
> > > TODO: Documentation/trace/uprobetrace.txt
> >
> > without a reasonable documentation how to use that is a brilliant
> > argument?
>
> We had a fairly decent documentation for uprobes and
> uprobetracer. But that had to be changed with the change in
> underlying design of uprobes infrastructure. Since uprobetrace is one
> the user interface, I plan to document it soon. However it would be
> great if we had inputs on how we should be designing the syscall.

Ok.

> > Or some sensible implementation ?
>
> Would syscall based perf probe implementation count as a sensible
> implementation? My current plan was to code up the perf probe for

Yes.

> uprobes and then draft a proposal for how the syscall should look.
> There are still some areas on how we should be allowing the
> filter, and what restrictions we should place on the syscall
> defined handler. I would like to hear from you and others on your
> ideas for the same. If you have ideas on doing it other than using a
> syscall then please do let me know about the same.

I don't think that anything else than a proper syscall interface is
going to work out.

> I know that getting the user interface right is very important.
> However I think it kind of depends on what the infrastructure can
> provide too. So if we can decide on the kernel ABI and the
> underlying design (i.e can we use replace_page() based background page
> replacement, Are there issues with the Xol slot based mechanism that
> we are using, etc), we can work towards providing a stable User ABI that
> even normal users can use. For now I am concentrating on getting the
> underlying infrastructure correct.

Fair enough. I'll go through the existing patchset and comment there.

Thanks,

tglx

2011-03-15 11:43:57

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

>
> > uprobes and then draft a proposal for how the syscall should look.
> > There are still some areas on how we should be allowing the
> > filter, and what restrictions we should place on the syscall
> > defined handler. I would like to hear from you and others on your
> > ideas for the same. If you have ideas on doing it other than using a
> > syscall then please do let me know about the same.
>
> I don't think that anything else than a proper syscall interface is
> going to work out.

Okay,

>
> > I know that getting the user interface right is very important.
> > However I think it kind of depends on what the infrastructure can
> > provide too. So if we can decide on the kernel ABI and the
> > underlying design (i.e can we use replace_page() based background page
> > replacement, Are there issues with the Xol slot based mechanism that
> > we are using, etc), we can work towards providing a stable User ABI that
> > even normal users can use. For now I am concentrating on getting the
> > underlying infrastructure correct.
>
> Fair enough. I'll go through the existing patchset and comment there.
>

Thanks for taking a look at the code. Look forward for your
comments.

--
Thanks and Regards
Srikar

2011-03-15 13:23:22

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 3/20] 3: uprobes: Breakground page replacement.

On Mon, 14 Mar 2011, Srikar Dronamraju wrote:
> +/*
> + * Called with tsk->mm->mmap_sem held (either for read or write and
> + * with a reference to tsk->mm

Hmm, why is holding it for read sufficient?
.
> + */
> +static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
> + unsigned long vaddr, uprobe_opcode_t opcode)
> +{
> + struct page *old_page, *new_page;
> + void *vaddr_old, *vaddr_new;
> + struct vm_area_struct *vma;
> + spinlock_t *ptl;
> + pte_t *orig_pte;
> + unsigned long addr;
> + int ret = -EINVAL;

That initialization is pointless.

> + /* Read the page with vaddr into memory */
> + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
> + if (ret <= 0)
> + return -EINVAL;
> + ret = -EINVAL;
> +
> + /*
> + * check if the page we are interested is read-only mapped
> + * Since we are interested in text pages, Our pages of interest
> + * should be mapped read-only.
> + */
> + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
> + (VM_READ|VM_EXEC))
> + goto put_out;

IIRC then text pages are (VM_READ|VM_EXEC)

> + /* Allocate a page */
> + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
> + if (!new_page) {
> + ret = -ENOMEM;
> + goto put_out;
> + }
> +
> + /*
> + * lock page will serialize against do_wp_page()'s
> + * PageAnon() handling
> + */
> + lock_page(old_page);
> + /* copy the page now that we've got it stable */
> + vaddr_old = kmap_atomic(old_page, KM_USER0);
> + vaddr_new = kmap_atomic(new_page, KM_USER1);
> +
> + memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
> + /* poke the new insn in, ASSUMES we don't cross page boundary */

And what makes sure that we don't ?

> + addr = vaddr;
> + vaddr &= ~PAGE_MASK;
> + memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
> +
> + kunmap_atomic(vaddr_new, KM_USER1);
> + kunmap_atomic(vaddr_old, KM_USER0);
> +
> + orig_pte = page_check_address(old_page, tsk->mm, addr, &ptl, 0);
> + if (!orig_pte)
> + goto unlock_out;
> + pte_unmap_unlock(orig_pte, ptl);
> +
> + lock_page(new_page);
> + if (!anon_vma_prepare(vma))

Why don't you get the error code of anon_vma_prepare()?

> + /* flip pages, do_wp_page() will fail pte_same() and bail */

-ENOPARSE

> + ret = replace_page(vma, old_page, new_page, *orig_pte);
> +
> + unlock_page(new_page);
> + if (ret != 0)
> + page_cache_release(new_page);
> +unlock_out:
> + unlock_page(old_page);
> +
> +put_out:
> + put_page(old_page); /* we did a get_page in the beginning */
> + return ret;
> +}
> +
> +/**
> + * read_opcode - read the opcode at a given virtual address.
> + * @tsk: the probed task.
> + * @vaddr: the virtual address to store the opcode.
> + * @opcode: location to store the read opcode.
> + *
> + * For task @tsk, read the opcode at @vaddr and store it in @opcode.
> + * Return 0 (success) or a negative errno.

Wants to called with mmap_sem held as well, right ?

> + */
> +int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
> + uprobe_opcode_t *opcode)
> +{
> + struct vm_area_struct *vma;
> + struct page *page;
> + void *vaddr_new;
> + int ret;
> +
> + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
> + if (ret <= 0)
> + return -EFAULT;
> + ret = -EFAULT;
> +
> + /*
> + * check if the page we are interested is read-only mapped
> + * Since we are interested in text pages, Our pages of interest
> + * should be mapped read-only.
> + */
> + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
> + (VM_READ|VM_EXEC))
> + goto put_out;

Same as above

> + lock_page(page);
> + vaddr_new = kmap_atomic(page, KM_USER0);
> + vaddr &= ~PAGE_MASK;
> + memcpy(&opcode, vaddr_new + vaddr, uprobe_opcode_sz);
> + kunmap_atomic(vaddr_new, KM_USER0);
> + unlock_page(page);
> + ret = uprobe_opcode_sz;

ret = 0 ?? At least, that's what the comment above says.

> +
> +put_out:
> + put_page(page); /* we did a get_page in the beginning */
> + return ret;
> +}
> +

2011-03-15 13:40:16

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 4/20] 4: uprobes: Adding and remove a uprobe in a rb tree.

On Mon, 14 Mar 2011, Srikar Dronamraju wrote:
>
> +static int valid_vma(struct vm_area_struct *vma)

bool perpaps ?

> +{
> + if (!vma->vm_file)
> + return 0;
> +
> + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
> + (VM_READ|VM_EXEC))

Looks more correct than the code it replaces :)

> + return 1;
> +
> + return 0;
> +}
> +

> +static struct rb_root uprobes_tree = RB_ROOT;
> +static DEFINE_MUTEX(uprobes_mutex);
> +static DEFINE_SPINLOCK(treelock);

Why do you need a mutex and a spinlock ? Also the mutex is not
referenced.

> +static int match_inode(struct uprobe *uprobe, struct inode *inode,
> + struct rb_node **p)
> +{
> + struct rb_node *n = *p;
> +
> + if (inode < uprobe->inode)
> + *p = n->rb_left;
> + else if (inode > uprobe->inode)
> + *p = n->rb_right;
> + else
> + return 1;
> + return 0;
> +}
> +
> +static int match_offset(struct uprobe *uprobe, loff_t offset,
> + struct rb_node **p)
> +{
> + struct rb_node *n = *p;
> +
> + if (offset < uprobe->offset)
> + *p = n->rb_left;
> + else if (offset > uprobe->offset)
> + *p = n->rb_right;
> + else
> + return 1;
> + return 0;
> +}
> +
> +
> +/* Called with treelock held */
> +static struct uprobe *__find_uprobe(struct inode * inode,
> + loff_t offset, struct rb_node **near_match)
> +{
> + struct rb_node *n = uprobes_tree.rb_node;
> + struct uprobe *uprobe, *u = NULL;
> +
> + while (n) {
> + uprobe = rb_entry(n, struct uprobe, rb_node);
> + if (match_inode(uprobe, inode, &n)) {
> + if (near_match)
> + *near_match = n;
> + if (match_offset(uprobe, offset, &n)) {
> + atomic_inc(&uprobe->ref);
> + u = uprobe;
> + break;
> + }
> + }
> + }
> + return u;
> +}
> +
> +/*
> + * Find a uprobe corresponding to a given inode:offset
> + * Acquires treelock
> + */
> +static struct uprobe *find_uprobe(struct inode * inode, loff_t offset)
> +{
> + struct uprobe *uprobe;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&treelock, flags);
> + uprobe = __find_uprobe(inode, offset, NULL);
> + spin_unlock_irqrestore(&treelock, flags);

What's the calling context ? Do we really need a spinlock here for
walking the rb tree ?

> +
> +/* Should be called lock-less */

-ENOPARSE

> +static void put_uprobe(struct uprobe *uprobe)
> +{
> + if (atomic_dec_and_test(&uprobe->ref))
> + kfree(uprobe);
> +}
> +
> +static struct uprobe *uprobes_add(struct inode *inode, loff_t offset)
> +{
> + struct uprobe *uprobe, *cur_uprobe;
> +
> + __iget(inode);
> + uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
> +
> + if (!uprobe) {
> + iput(inode);
> + return NULL;
> + }

Please move the __iget() after the kzalloc()

> + uprobe->inode = inode;
> + uprobe->offset = offset;
> +
> + /* add to uprobes_tree, sorted on inode:offset */
> + cur_uprobe = insert_uprobe(uprobe);
> +
> + /* a uprobe exists for this inode:offset combination*/
> + if (cur_uprobe) {
> + kfree(uprobe);
> + uprobe = cur_uprobe;
> + iput(inode);
> + } else
> + init_rwsem(&uprobe->consumer_rwsem);

Please init stuff _before_ inserting not afterwards.

> +
> + return uprobe;
> +}
> +
> +/* Acquires uprobe->consumer_rwsem */
> +static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
> +{
> + struct uprobe_consumer *consumer;
> +
> + down_read(&uprobe->consumer_rwsem);
> + consumer = uprobe->consumers;
> + while (consumer) {
> + if (!consumer->filter || consumer->filter(consumer, current))
> + consumer->handler(consumer, regs);
> +
> + consumer = consumer->next;
> + }
> + up_read(&uprobe->consumer_rwsem);
> +}
> +
> +/* Acquires uprobe->consumer_rwsem */
> +static void add_consumer(struct uprobe *uprobe,
> + struct uprobe_consumer *consumer)
> +{
> + down_write(&uprobe->consumer_rwsem);
> + consumer->next = uprobe->consumers;
> + uprobe->consumers = consumer;
> + up_write(&uprobe->consumer_rwsem);
> + return;

pointless return

> +}
> +
> +/* Acquires uprobe->consumer_rwsem */

I'd prefer a comment about the return code over this redundant
information.

> +static int del_consumer(struct uprobe *uprobe,
> + struct uprobe_consumer *consumer)
> +{

2011-03-15 13:43:22

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 4/20] 4: uprobes: Adding and remove a uprobe in a rb tree.

On Tue, 2011-03-15 at 14:38 +0100, Thomas Gleixner wrote:
> > +{
> > + if (!vma->vm_file)
> > + return 0;
> > +
> > + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
> > + (VM_READ|VM_EXEC))
>
> Looks more correct than the code it replaces :)
>
> > + return 1;
> > +
> > + return 0;
> > +}
> > +
>
Yeah, we had this discussion already ;)

-- Steve

2011-03-15 13:48:04

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

On Tue, 2011-03-15 at 14:52 +0530, Srikar Dronamraju wrote:
> * Stephen Wilson <[email protected]> [2011-03-14 14:09:14]:
>

Nitpick:

> Guess checking for tsk != NULL would only help if and only if we are doing
> within rcu. i.e we have to change to something like this
>

tsk = NULL;

> rcu_read_lock()
> if (mm->owner) {
> get_task_struct(mm->owner)
> tsk = mm->owner;
> }
> rcu_read_unlock()
> if (!tsk)
> return ret;
>
> Agree?

Or:

rcu_read_lock();
tsk = mm->owner;
if (tsk)
get_task_struct(tsk);
rcu_read_unlock();
if (!tsk)
return ret;

Probably looks cleaner.

-- Steve

2011-03-15 14:30:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 5/20] 5: Uprobes: register/unregister probes.

On Mon, 14 Mar 2011, Srikar Dronamraju wrote:
> +/* Returns 0 if it can install one probe */
> +int register_uprobe(struct inode *inode, loff_t offset,
> + struct uprobe_consumer *consumer)
> +{
> + struct prio_tree_iter iter;
> + struct list_head tmp_list;
> + struct address_space *mapping;
> + struct mm_struct *mm, *tmpmm;
> + struct vm_area_struct *vma;
> + struct uprobe *uprobe;
> + int ret = -1;
> +
> + if (!inode || !consumer || consumer->next)
> + return -EINVAL;
> + uprobe = uprobes_add(inode, offset);

Does uprobes_add() always succeed ?

> + INIT_LIST_HEAD(&tmp_list);
> + mapping = inode->i_mapping;
> +
> + mutex_lock(&uprobes_mutex);
> + if (uprobe->consumers) {
> + ret = 0;
> + goto consumers_add;
> + }
> +
> + spin_lock(&mapping->i_mmap_lock);
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
> + loff_t vaddr;
> +
> + if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
> + continue;
> +
> + mm = vma->vm_mm;
> + if (!valid_vma(vma)) {
> + mmput(mm);
> + continue;
> + }
> +
> + vaddr = vma->vm_start + offset;
> + vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> + if (vaddr > ULONG_MAX) {
> + /*
> + * We cannot have a virtual address that is
> + * greater than ULONG_MAX
> + */
> + mmput(mm);
> + continue;
> + }
> + mm->uprobes_vaddr = (unsigned long) vaddr;
> + list_add(&mm->uprobes_list, &tmp_list);
> + }
> + spin_unlock(&mapping->i_mmap_lock);
> +
> + if (list_empty(&tmp_list)) {
> + ret = 0;
> + goto consumers_add;
> + }
> + list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list) {
> + down_read(&mm->mmap_sem);
> + if (!install_uprobe(mm, uprobe))
> + ret = 0;

Installing it once is success ?

> + list_del(&mm->uprobes_list);

Also the locking rules for mm->uprobes_list want to be
documented. They are completely non obvious.

> + up_read(&mm->mmap_sem);
> + mmput(mm);
> + }
> +
> +consumers_add:
> + add_consumer(uprobe, consumer);
> + mutex_unlock(&uprobes_mutex);
> + put_uprobe(uprobe);

Why do we drop the refcount here?

> + return ret;
> +}

> + /*
> + * There could be other threads that could be spinning on
> + * treelock; some of these threads could be interested in this
> + * uprobe. Give these threads a chance to run.
> + */
> + synchronize_sched();

This makes no sense at all. We are not holding treelock, we are about
to acquire it. Also what does it matter when they spin on treelock and
are interested in this uprobe. Either they find it before we remove it
or not. So why synchronize_sched()? I find the lifetime rules of
uprobe utterly confusing. Could you explain please ?

> + spin_lock_irqsave(&treelock, flags);
> + rb_erase(&uprobe->rb_node, &uprobes_tree);
> + spin_unlock_irqrestore(&treelock, flags);
> + iput(uprobe->inode);
> +
> +put_unlock:
> + mutex_unlock(&uprobes_mutex);
> + put_uprobe(uprobe);
> +}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2011-03-15 14:38:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 6/20] 6: x86: analyze instruction and determine fixups.

On Mon, 14 Mar 2011, Srikar Dronamraju wrote:
> +/*
> + * TODO:
> + * - Where necessary, examine the modrm byte and allow only valid instructions
> + * in the different Groups and fpu instructions.
> + */
> +
> +static bool is_prefix_bad(struct insn *insn)
> +{
> + int i;
> +
> + for (i = 0; i < insn->prefixes.nbytes; i++) {
> + switch (insn->prefixes.bytes[i]) {
> + case 0x26: /*INAT_PFX_ES */
> + case 0x2E: /*INAT_PFX_CS */
> + case 0x36: /*INAT_PFX_DS */
> + case 0x3E: /*INAT_PFX_SS */
> + case 0xF0: /*INAT_PFX_LOCK */
> + return 1;

true

> + }
> + }
> + return 0;

false

> +}

> +static int validate_insn_32bits(struct uprobe *uprobe, struct insn *insn)
> +{
> + insn_init(insn, uprobe->insn, false);
> +
> + /* Skip good instruction prefixes; reject "bad" ones. */
> + insn_get_opcode(insn);
> + if (is_prefix_bad(insn)) {
> + report_bad_prefix();
> + return -EPERM;

-ENOTSUPP perhaps. That's not a permission problem

> + }

> +/**
> + * analyze_insn - instruction analysis including validity and fixups.
> + * @tsk: the probed task.
> + * @uprobe: the probepoint information.
> + * Return 0 on success or a -ve number on error.
> + */
> +int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe)
> +{
> + int ret;
> + struct insn insn;
> +
> + uprobe->fixups = 0;
> +#ifdef CONFIG_X86_64
> + uprobe->arch_info.rip_rela_target_address = 0x0;
> +#endif

Please get rid of this #ifdef and use inlines (empty for 32bit)

> +
> + if (is_32bit_app(tsk))
> + ret = validate_insn_32bits(uprobe, &insn);
> + else
> + ret = validate_insn_64bits(uprobe, &insn);
> + if (ret != 0)
> + return ret;
> +#ifdef CONFIG_X86_64

Ditto

> + ret = handle_riprel_insn(uprobe, &insn);
> + if (ret == -1)
> + /* rip-relative; can't XOL */
> + return 0;

So we return -1 from handle_riprel_insn() and return success? Btw how
deals handle_riprel_insn() with 32bit user space ?

> +#endif
> + prepare_fixups(uprobe, &insn);
> + return 0;

Thanks,

tglx

2011-03-15 14:43:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

On Mon, 14 Mar 2011, Srikar Dronamraju wrote:
> static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
> {
> - int ret = 0;
> + struct task_struct *tsk;
> + int ret = -EINVAL;
>
> - /*TODO: install breakpoint */
> - if (!ret)
> + get_task_struct(mm->owner);

Increment task ref before checking for NULL ?

> + tsk = mm->owner;
> + if (!tsk)
> + return ret;
> +

Thanks,

tglx

2011-03-15 16:15:54

by Stephen Wilson

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

On Tue, Mar 15, 2011 at 09:47:59AM -0400, Steven Rostedt wrote:
> > rcu_read_lock()
> > if (mm->owner) {
> > get_task_struct(mm->owner)
> > tsk = mm->owner;
> > }
> > rcu_read_unlock()
> > if (!tsk)
> > return ret;
> >
> > Agree?
>
> Or:
>
> rcu_read_lock();
> tsk = mm->owner;
> if (tsk)
> get_task_struct(tsk);
> rcu_read_unlock();
> if (!tsk)
> return ret;
>
> Probably looks cleaner.

Yes, plus we should do "tsk = rcu_dereference(mm->owner);" and wrap the
whole thing in a static uprobes_get_mm_owner() or similar.


--
steve

2011-03-15 16:21:30

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

On Tue, 2011-03-15 at 12:15 -0400, Stephen Wilson wrote:

> Yes, plus we should do "tsk = rcu_dereference(mm->owner);" and wrap the
> whole thing in a static uprobes_get_mm_owner() or similar.

Ack.

-- Steve


2011-03-15 16:24:00

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

> >
> > rcu_read_lock();
> > tsk = mm->owner;
> > if (tsk)
> > get_task_struct(tsk);
> > rcu_read_unlock();
> > if (!tsk)
> > return ret;
> >
> > Probably looks cleaner.
>
> Yes, plus we should do "tsk = rcu_dereference(mm->owner);" and wrap the
> whole thing in a static uprobes_get_mm_owner() or similar.
>

Okay, will do that.

--
Thanks and Regards
Srikar

2011-03-15 16:31:10

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

* Thomas Gleixner <[email protected]> [2011-03-15 15:41:20]:

> On Mon, 14 Mar 2011, Srikar Dronamraju wrote:
> > static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
> > {
> > - int ret = 0;
> > + struct task_struct *tsk;
> > + int ret = -EINVAL;
> >
> > - /*TODO: install breakpoint */
> > - if (!ret)
> > + get_task_struct(mm->owner);
>
> Increment task ref before checking for NULL ?

In response to earlier comments/suggestions from Stephen Wilson, we
resolved to handle it this way


static uprobes_get_mm_owner() {
struct task_struct *tsk;

rcu_read_lock()
tsk = rcu_dereference(mm->owner);
if (tsk)
get_task_struct(tsk);
rcu_read_unlock();
return tsk;
}

Both install_uprobe and remove_uprobe will end up calling uprobes_get_mm_owner().


--
Thanks and Regards
Srikar

2011-03-15 17:21:49

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 5/20] 5: Uprobes: register/unregister probes.

* Thomas Gleixner <[email protected]> [2011-03-15 15:28:04]:

> On Mon, 14 Mar 2011, Srikar Dronamraju wrote:
> > +/* Returns 0 if it can install one probe */
> > +int register_uprobe(struct inode *inode, loff_t offset,
> > + struct uprobe_consumer *consumer)
> > +{
> > + struct prio_tree_iter iter;
> > + struct list_head tmp_list;
> > + struct address_space *mapping;
> > + struct mm_struct *mm, *tmpmm;
> > + struct vm_area_struct *vma;
> > + struct uprobe *uprobe;
> > + int ret = -1;
> > +
> > + if (!inode || !consumer || consumer->next)
> > + return -EINVAL;
> > + uprobe = uprobes_add(inode, offset);
>
> Does uprobes_add() always succeed ?
>

Steve already gave this comment. Adding a check to catch if
uprobe_add returns NULL and return immediately.

> > + INIT_LIST_HEAD(&tmp_list);
> > + mapping = inode->i_mapping;
> > +
> > + mutex_lock(&uprobes_mutex);
> > + if (uprobe->consumers) {
> > + ret = 0;
> > + goto consumers_add;
> > + }
> > +
> > + spin_lock(&mapping->i_mmap_lock);
> > + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
> > + loff_t vaddr;
> > +
> > + if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
> > + continue;
> > +
> > + mm = vma->vm_mm;
> > + if (!valid_vma(vma)) {
> > + mmput(mm);
> > + continue;
> > + }
> > +
> > + vaddr = vma->vm_start + offset;
> > + vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > + if (vaddr > ULONG_MAX) {
> > + /*
> > + * We cannot have a virtual address that is
> > + * greater than ULONG_MAX
> > + */
> > + mmput(mm);
> > + continue;
> > + }
> > + mm->uprobes_vaddr = (unsigned long) vaddr;
> > + list_add(&mm->uprobes_list, &tmp_list);
> > + }
> > + spin_unlock(&mapping->i_mmap_lock);
> > +
> > + if (list_empty(&tmp_list)) {
> > + ret = 0;
> > + goto consumers_add;
> > + }
> > + list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list) {
> > + down_read(&mm->mmap_sem);
> > + if (!install_uprobe(mm, uprobe))
> > + ret = 0;
>
> Installing it once is success ?

This is a little tricky. My intention was to return success even if one
install is successful. If we return error, then the caller can go
ahead and free the consumer. Since we return success if there are
currently no processes that have mapped this inode, I was tempted to
return success on atleast one successful install.

>
> > + list_del(&mm->uprobes_list);
>
> Also the locking rules for mm->uprobes_list want to be
> documented. They are completely non obvious.
>
> > + up_read(&mm->mmap_sem);
> > + mmput(mm);
> > + }
> > +
> > +consumers_add:
> > + add_consumer(uprobe, consumer);
> > + mutex_unlock(&uprobes_mutex);
> > + put_uprobe(uprobe);
>
> Why do we drop the refcount here?

The first time uprobe_add gets called for a unique inode:offset
pair, it sets the refcount to 2 (One for the uprobe creation and the
other for register activity). From next time onwards it
increments the refcount by (for register activity) 1.
The refcount dropped here corresponds to the register activity.

Similarly unregister takes a refcount thro find_uprobe and drops it thro
del_consumer(). However it drops the creation refcount if and if
there are no more consumers.

I thought of just taking the refcount just for the first register and
decrement for the last unregister. However register/unregister can race
with each other causing the refcount to be zero and free the uprobe
structure even though we were still registering the probe.

>
> > + return ret;
> > +}
>
> > + /*
> > + * There could be other threads that could be spinning on
> > + * treelock; some of these threads could be interested in this
> > + * uprobe. Give these threads a chance to run.
> > + */
> > + synchronize_sched();
>
> This makes no sense at all. We are not holding treelock, we are about
> to acquire it. Also what does it matter when they spin on treelock and
> are interested in this uprobe. Either they find it before we remove it
> or not. So why synchronize_sched()? I find the lifetime rules of
> uprobe utterly confusing. Could you explain please ?

There could be threads that have hit the breakpoint and are
entering the notifier code(interrupt context) and then
do_notify_resume(task context) and trying to acquire the treelock.
(treelock is held by the breakpoint hit threads in
uprobe_notify_resume which gets called in do_notify_resume()) The
current thread that is removing the uprobe from the rb_tree can race
with these threads and might acquire the treelock before some of the
breakpoint hit threads. If this happens the interrupted threads have
to re-read the opcode to see if the breakpoint location no more has the
breakpoint instruction and retry the instruction. However before it can
detect and retry, some other thread might insert a breakpoint at that
location. This can go in a loop.

The other option would be for the interrupted threads to turn that into
a signal. However I am not sure if this is a good option at all esp
since we already have the breakpoint removed.

To avoid this, I am planning to give some __extra__ time for the
breakpoint hit threads to compete and win the race for spinlock with
the unregistering thread.


--
Thanks and Regards
Srikar

>
> > + spin_lock_irqsave(&treelock, flags);
> > + rb_erase(&uprobe->rb_node, &uprobes_tree);
> > + spin_unlock_irqrestore(&treelock, flags);
> > + iput(uprobe->inode);
> > +

2011-03-15 17:36:53

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 4/20] 4: uprobes: Adding and remove a uprobe in a rb tree.

* Thomas Gleixner <[email protected]> [2011-03-15 14:38:33]:

> On Mon, 14 Mar 2011, Srikar Dronamraju wrote:
> >
> > +static int valid_vma(struct vm_area_struct *vma)
>
> bool perpaps ?

Okay,

>
> > +{
> > + if (!vma->vm_file)
> > + return 0;
> > +
> > + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
> > + (VM_READ|VM_EXEC))
>
> Looks more correct than the code it replaces :)

Yes, Steven Rostedt has already pointed this out.

>
> > + return 1;
> > +
> > + return 0;
> > +}
> > +
>
> > +static struct rb_root uprobes_tree = RB_ROOT;
> > +static DEFINE_MUTEX(uprobes_mutex);
> > +static DEFINE_SPINLOCK(treelock);
>
> Why do you need a mutex and a spinlock ? Also the mutex is not
> referenced.

We can move the mutex to the next patch, (i.e register/unregister
patch), where it gets used. mutex for now serializes
register_uprobe/unregister_uprobe/mmap_uprobe.
This mutex is the one that guards mm->uprobes_list.

>
> > +/*
> > + * Find a uprobe corresponding to a given inode:offset
> > + * Acquires treelock
> > + */
> > +static struct uprobe *find_uprobe(struct inode * inode, loff_t offset)
> > +{
> > + struct uprobe *uprobe;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&treelock, flags);
> > + uprobe = __find_uprobe(inode, offset, NULL);
> > + spin_unlock_irqrestore(&treelock, flags);
>
> What's the calling context ? Do we really need a spinlock here for
> walking the rb tree ?
>

find_uprobe() gets called from unregister_uprobe and on probe hit from
uprobe_notify_resume. I am not sure if its a good idea to walk the tree
as and when the tree is changing either because of a insertion or
deletion of a probe.

> > +
> > +/* Should be called lock-less */
>
> -ENOPARSE
>
> > +static void put_uprobe(struct uprobe *uprobe)
> > +{
> > + if (atomic_dec_and_test(&uprobe->ref))
> > + kfree(uprobe);
> > +}
> > +
> > +static struct uprobe *uprobes_add(struct inode *inode, loff_t offset)
> > +{
> > + struct uprobe *uprobe, *cur_uprobe;
> > +
> > + __iget(inode);
> > + uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
> > +
> > + if (!uprobe) {
> > + iput(inode);
> > + return NULL;
> > + }
>
> Please move the __iget() after the kzalloc()

Okay.

>
> > + uprobe->inode = inode;
> > + uprobe->offset = offset;
> > +
> > + /* add to uprobes_tree, sorted on inode:offset */
> > + cur_uprobe = insert_uprobe(uprobe);
> > +
> > + /* a uprobe exists for this inode:offset combination*/
> > + if (cur_uprobe) {
> > + kfree(uprobe);
> > + uprobe = cur_uprobe;
> > + iput(inode);
> > + } else
> > + init_rwsem(&uprobe->consumer_rwsem);
>
> Please init stuff _before_ inserting not afterwards.

Okay.

>
> > +
> > + return uprobe;
> > +}
> > +
> > +/* Acquires uprobe->consumer_rwsem */
> > +static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
> > +{
> > + struct uprobe_consumer *consumer;
> > +
> > + down_read(&uprobe->consumer_rwsem);
> > + consumer = uprobe->consumers;
> > + while (consumer) {
> > + if (!consumer->filter || consumer->filter(consumer, current))
> > + consumer->handler(consumer, regs);
> > +
> > + consumer = consumer->next;
> > + }
> > + up_read(&uprobe->consumer_rwsem);
> > +}
> > +
> > +/* Acquires uprobe->consumer_rwsem */
> > +static void add_consumer(struct uprobe *uprobe,
> > + struct uprobe_consumer *consumer)
> > +{
> > + down_write(&uprobe->consumer_rwsem);
> > + consumer->next = uprobe->consumers;
> > + uprobe->consumers = consumer;
> > + up_write(&uprobe->consumer_rwsem);
> > + return;
>
> pointless return

Okay,

>
> > +}
> > +
> > +/* Acquires uprobe->consumer_rwsem */
>
> I'd prefer a comment about the return code over this redundant
> information.
>

Okay,

> > +static int del_consumer(struct uprobe *uprobe,
> > + struct uprobe_consumer *consumer)
> > +{

2011-03-15 17:47:47

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 5/20] 5: Uprobes: register/unregister probes.

On Tue, 2011-03-15 at 22:45 +0530, Srikar Dronamraju wrote:
> > > + }
> > > + list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list) {
> > > + down_read(&mm->mmap_sem);
> > > + if (!install_uprobe(mm, uprobe))
> > > + ret = 0;
> >
> > Installing it once is success ?
>
> This is a little tricky. My intention was to return success even if one
> install is successful. If we return error, then the caller can go
> ahead and free the consumer. Since we return success if there are
> currently no processes that have mapped this inode, I was tempted to
> return success on atleast one successful install.

What about an all or nothing approach. If one fails, remove all that
were installed, and give up.

-- Steve

2011-03-15 17:51:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 5/20] 5: Uprobes: register/unregister probes.

On Tue, 2011-03-15 at 13:47 -0400, Steven Rostedt wrote:
> On Tue, 2011-03-15 at 22:45 +0530, Srikar Dronamraju wrote:
> > > > + }
> > > > + list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list) {
> > > > + down_read(&mm->mmap_sem);
> > > > + if (!install_uprobe(mm, uprobe))
> > > > + ret = 0;
> > >
> > > Installing it once is success ?
> >
> > This is a little tricky. My intention was to return success even if one
> > install is successful. If we return error, then the caller can go
> > ahead and free the consumer. Since we return success if there are
> > currently no processes that have mapped this inode, I was tempted to
> > return success on atleast one successful install.
>
> What about an all or nothing approach. If one fails, remove all that
> were installed, and give up.

That sounds like a much saner semantic and is what we generally do in
the kernel.

2011-03-15 17:57:14

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 3/20] 3: uprobes: Breakground page replacement.

* Thomas Gleixner <[email protected]> [2011-03-15 14:22:09]:

> On Mon, 14 Mar 2011, Srikar Dronamraju wrote:
> > +/*
> > + * Called with tsk->mm->mmap_sem held (either for read or write and
> > + * with a reference to tsk->mm
>
> Hmm, why is holding it for read sufficient?

We are not adding a new vma to the mm; but just replacing a page with
another after holding the locks for the pages. Existing routines
doing close to similar things like the
access_process_vm/get_user_pages seem to be taking the read_lock. Do
you see a resaon why readlock wouldnt suffice?

> .
> > + */
> > +static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
> > + unsigned long vaddr, uprobe_opcode_t opcode)
> > +{
> > + struct page *old_page, *new_page;
> > + void *vaddr_old, *vaddr_new;
> > + struct vm_area_struct *vma;
> > + spinlock_t *ptl;
> > + pte_t *orig_pte;
> > + unsigned long addr;
> > + int ret = -EINVAL;
>
> That initialization is pointless.
Okay,

>
> > + /* Read the page with vaddr into memory */
> > + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
> > + if (ret <= 0)
> > + return -EINVAL;
> > + ret = -EINVAL;
> > +
> > + /*
> > + * check if the page we are interested is read-only mapped
> > + * Since we are interested in text pages, Our pages of interest
> > + * should be mapped read-only.
> > + */
> > + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
> > + (VM_READ|VM_EXEC))
> > + goto put_out;
>
> IIRC then text pages are (VM_READ|VM_EXEC)

Steven Rostedt already pointed this out.


>
> > + /* Allocate a page */
> > + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
> > + if (!new_page) {
> > + ret = -ENOMEM;
> > + goto put_out;
> > + }
> > +
> > + /*
> > + * lock page will serialize against do_wp_page()'s
> > + * PageAnon() handling
> > + */
> > + lock_page(old_page);
> > + /* copy the page now that we've got it stable */
> > + vaddr_old = kmap_atomic(old_page, KM_USER0);
> > + vaddr_new = kmap_atomic(new_page, KM_USER1);
> > +
> > + memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
> > + /* poke the new insn in, ASSUMES we don't cross page boundary */
>
> And what makes sure that we don't ?

We are expecting the breakpoint instruction to be the minimum size
instruction for that architecture. This wouldnt be a problem for
architectures that have fixed length instructions.
For architectures which have variable size instructions, I am
hoping that the opcode size will be small enuf that it will always not
cross boundary. Something like 0xCC on x86. If and when we support
architectures that have variable length instructions whose
breakpoint instruction can span across page boundary, we have to add
more meat to take care of the case.

>
> > + addr = vaddr;
> > + vaddr &= ~PAGE_MASK;
> > + memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
> > +
> > + kunmap_atomic(vaddr_new, KM_USER1);
> > + kunmap_atomic(vaddr_old, KM_USER0);
> > +
> > + orig_pte = page_check_address(old_page, tsk->mm, addr, &ptl, 0);
> > + if (!orig_pte)
> > + goto unlock_out;
> > + pte_unmap_unlock(orig_pte, ptl);
> > +
> > + lock_page(new_page);
> > + if (!anon_vma_prepare(vma))
>
> Why don't you get the error code of anon_vma_prepare()?

Okay, will capture the error_code of anon_vma_prepare.

>
> > + /* flip pages, do_wp_page() will fail pte_same() and bail */
>
> -ENOPARSE
>
> > + ret = replace_page(vma, old_page, new_page, *orig_pte);
> > +
> > + unlock_page(new_page);
> > + if (ret != 0)
> > + page_cache_release(new_page);
> > +unlock_out:
> > + unlock_page(old_page);
> > +
> > +put_out:
> > + put_page(old_page); /* we did a get_page in the beginning */
> > + return ret;
> > +}
> > +
> > +/**
> > + * read_opcode - read the opcode at a given virtual address.
> > + * @tsk: the probed task.
> > + * @vaddr: the virtual address to store the opcode.
> > + * @opcode: location to store the read opcode.
> > + *
> > + * For task @tsk, read the opcode at @vaddr and store it in @opcode.
> > + * Return 0 (success) or a negative errno.
>
> Wants to called with mmap_sem held as well, right ?

Yes, will document.

>
> > + */
> > +int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
> > + uprobe_opcode_t *opcode)
> > +{
> > + struct vm_area_struct *vma;
> > + struct page *page;
> > + void *vaddr_new;
> > + int ret;
> > +
> > + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
> > + if (ret <= 0)
> > + return -EFAULT;
> > + ret = -EFAULT;
> > +
> > + /*
> > + * check if the page we are interested is read-only mapped
> > + * Since we are interested in text pages, Our pages of interest
> > + * should be mapped read-only.
> > + */
> > + if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
> > + (VM_READ|VM_EXEC))
> > + goto put_out;
>
> Same as above
>
> > + lock_page(page);
> > + vaddr_new = kmap_atomic(page, KM_USER0);
> > + vaddr &= ~PAGE_MASK;
> > + memcpy(&opcode, vaddr_new + vaddr, uprobe_opcode_sz);
> > + kunmap_atomic(vaddr_new, KM_USER0);
> > + unlock_page(page);
> > + ret = uprobe_opcode_sz;
>
> ret = 0 ?? At least, that's what the comment above says.

Has been already been pointed out by Stephen Wilson.
setting ret = 0 here.

>
> > +
> > +put_out:
> > + put_page(page); /* we did a get_page in the beginning */
> > + return ret;
> > +}
> > +

2011-03-15 17:58:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

On Tue, 2011-03-15 at 14:52 +0530, Srikar Dronamraju wrote:
> * Stephen Wilson <[email protected]> [2011-03-14 14:09:14]:
>
> > On Mon, Mar 14, 2011 at 07:05:22PM +0530, Srikar Dronamraju wrote:
> > > static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
> > > {
> > > - int ret = 0;
> > > + struct task_struct *tsk;
> > > + int ret = -EINVAL;
> > >
> > > - /*TODO: install breakpoint */
> > > - if (!ret)
> > > + get_task_struct(mm->owner);
> > > + tsk = mm->owner;
> > > + if (!tsk)
> > > + return ret;
> >
> > I think you need to check that tsk != NULL before calling
> > get_task_struct()...
> >
>
> Guess checking for tsk != NULL would only help if and only if we are doing
> within rcu. i.e we have to change to something like this
>
> rcu_read_lock()
> if (mm->owner) {
> get_task_struct(mm->owner)
> tsk = mm->owner;
> }
> rcu_read_unlock()
> if (!tsk)
> return ret;

so the whole mm->owner semantics seem vague, memcontrol.c doesn't seem
consistent in itself, one site uses rcu_dereference() the other site
doesn't.

Also, the assignments in kernel/fork.c and kernel/exit.c don't use
rcu_assign_pointer() and therefore lack the needed write barrier.

Git blames Balbir for this.

2011-03-15 18:04:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 3/20] 3: uprobes: Breakground page replacement.

On Tue, 15 Mar 2011, Srikar Dronamraju wrote:

> * Thomas Gleixner <[email protected]> [2011-03-15 14:22:09]:
>
> > On Mon, 14 Mar 2011, Srikar Dronamraju wrote:
> > > +/*
> > > + * Called with tsk->mm->mmap_sem held (either for read or write and
> > > + * with a reference to tsk->mm
> >
> > Hmm, why is holding it for read sufficient?
>
> We are not adding a new vma to the mm; but just replacing a page with
> another after holding the locks for the pages. Existing routines
> doing close to similar things like the
> access_process_vm/get_user_pages seem to be taking the read_lock. Do
> you see a resaon why readlock wouldnt suffice?

No, I just was confused by the comment. Probably should have asked why
you want to call it write locked.

Thanks,

tglx

2011-03-15 18:06:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

Hi Thomas,

> That's a brilliant reason to drop it right away. systemtap is the
> least of our worries.

Feeling defensive about something?

>
> > How do you envisage these features actually get used? For example,
> > will gdb be modified? Will other debuggers be modified or written?
>
> How about answering this question first _BEFORE_ advertising
> systemtap?

I thought this was obvious. systemtap is essentially a script driven
debugger.

-Andi

--
[email protected] -- Speaking for myself only.

2011-03-15 18:10:33

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 5/20] 5: Uprobes: register/unregister probes.

* Peter Zijlstra <[email protected]> [2011-03-15 18:50:11]:

> On Tue, 2011-03-15 at 13:47 -0400, Steven Rostedt wrote:
> > On Tue, 2011-03-15 at 22:45 +0530, Srikar Dronamraju wrote:
> > > > > + }
> > > > > + list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list) {
> > > > > + down_read(&mm->mmap_sem);
> > > > > + if (!install_uprobe(mm, uprobe))
> > > > > + ret = 0;
> > > >
> > > > Installing it once is success ?
> > >
> > > This is a little tricky. My intention was to return success even if one
> > > install is successful. If we return error, then the caller can go
> > > ahead and free the consumer. Since we return success if there are
> > > currently no processes that have mapped this inode, I was tempted to
> > > return success on atleast one successful install.
> >
> > What about an all or nothing approach. If one fails, remove all that
> > were installed, and give up.
>
> That sounds like a much saner semantic and is what we generally do in
> the kernel.

One of the install_uprobe could be failing because the process was
almost exiting, something like there was no mm->owner. Also lets
assume that the first few install_uprobes go thro and the last
install_uprobe fails. There could be breakpoint hits corresponding to
the already installed uprobes that get displayed. i.e all
breakpoint hits from the first install_uprobe to the time we detect a
failed a install_uprobe and revert all inserted breakpoints will be
shown as being captured.

Also register_uprobe will only install probes for processes that are
actively and have mapped the inode. However install_uprobe called from
uprobe_mmap which also registers the probe can fail. Now should we take
the same yardstick and remove all the probes corresponding to the
successful register_uprobe?

2011-03-15 18:13:50

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 3/20] 3: uprobes: Breakground page replacement.

* Thomas Gleixner <[email protected]> [2011-03-15 19:03:44]:

> On Tue, 15 Mar 2011, Srikar Dronamraju wrote:
>
> > * Thomas Gleixner <[email protected]> [2011-03-15 14:22:09]:
> >
> > > On Mon, 14 Mar 2011, Srikar Dronamraju wrote:
> > > > +/*
> > > > + * Called with tsk->mm->mmap_sem held (either for read or write and
> > > > + * with a reference to tsk->mm
> > >
> > > Hmm, why is holding it for read sufficient?
> >
> > We are not adding a new vma to the mm; but just replacing a page with
> > another after holding the locks for the pages. Existing routines
> > doing close to similar things like the
> > access_process_vm/get_user_pages seem to be taking the read_lock. Do
> > you see a resaon why readlock wouldnt suffice?
>
> No, I just was confused by the comment. Probably should have asked why
> you want to call it write locked.

We no more call it write locked. So we can drop the reference to
write lock.

2011-03-15 18:16:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 5/20] 5: Uprobes: register/unregister probes.

On Tue, 2011-03-15 at 23:34 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <[email protected]> [2011-03-15 18:50:11]:
>
> > On Tue, 2011-03-15 at 13:47 -0400, Steven Rostedt wrote:
> > > On Tue, 2011-03-15 at 22:45 +0530, Srikar Dronamraju wrote:
> > > > > > + }
> > > > > > + list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list) {
> > > > > > + down_read(&mm->mmap_sem);
> > > > > > + if (!install_uprobe(mm, uprobe))
> > > > > > + ret = 0;
> > > > >
> > > > > Installing it once is success ?
> > > >
> > > > This is a little tricky. My intention was to return success even if one
> > > > install is successful. If we return error, then the caller can go
> > > > ahead and free the consumer. Since we return success if there are
> > > > currently no processes that have mapped this inode, I was tempted to
> > > > return success on atleast one successful install.
> > >
> > > What about an all or nothing approach. If one fails, remove all that
> > > were installed, and give up.
> >
> > That sounds like a much saner semantic and is what we generally do in
> > the kernel.
>
> One of the install_uprobe could be failing because the process was
> almost exiting, something like there was no mm->owner. Also lets
> assume that the first few install_uprobes go thro and the last
> install_uprobe fails. There could be breakpoint hits corresponding to
> the already installed uprobes that get displayed. i.e all
> breakpoint hits from the first install_uprobe to the time we detect a
> failed a install_uprobe and revert all inserted breakpoints will be
> shown as being captured.

I think you can gracefully deal with the exit case and simply ignore
that one. But you cannot let arbitrary installs fail and still report
success, that gives very weak and nearly useless semantics.

> Also register_uprobe will only install probes for processes that are
> actively and have mapped the inode. However install_uprobe called from
> uprobe_mmap which also registers the probe can fail. Now should we take
> the same yardstick and remove all the probes corresponding to the
> successful register_uprobe?

No, if uprobe_mmap() fails we should fail the mmap(), that also
guarantees consistency.

2011-03-15 18:17:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 5/20] 5: Uprobes: register/unregister probes.

On Tue, 15 Mar 2011, Srikar Dronamraju wrote:
> * Thomas Gleixner <[email protected]> [2011-03-15 15:28:04]:
> > > + list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list) {
> > > + down_read(&mm->mmap_sem);
> > > + if (!install_uprobe(mm, uprobe))
> > > + ret = 0;
> >
> > Installing it once is success ?
>
> This is a little tricky. My intention was to return success even if one
> install is successful. If we return error, then the caller can go
> ahead and free the consumer. Since we return success if there are
> currently no processes that have mapped this inode, I was tempted to
> return success on atleast one successful install.

Ok. Wants to be documented in a comment.

> >
> > > + list_del(&mm->uprobes_list);
> >
> > Also the locking rules for mm->uprobes_list want to be
> > documented. They are completely non obvious.
> >
> > > + up_read(&mm->mmap_sem);
> > > + mmput(mm);
> > > + }
> > > +
> > > +consumers_add:
> > > + add_consumer(uprobe, consumer);
> > > + mutex_unlock(&uprobes_mutex);
> > > + put_uprobe(uprobe);
> >
> > Why do we drop the refcount here?
>
> The first time uprobe_add gets called for a unique inode:offset
> pair, it sets the refcount to 2 (One for the uprobe creation and the
> other for register activity). From next time onwards it
> increments the refcount by (for register activity) 1.
> The refcount dropped here corresponds to the register activity.
>
> Similarly unregister takes a refcount thro find_uprobe and drops it thro
> del_consumer(). However it drops the creation refcount if and if
> there are no more consumers.

Ok. That wants a few comments perhaps. It's not really obvious.

> I thought of just taking the refcount just for the first register and
> decrement for the last unregister. However register/unregister can race
> with each other causing the refcount to be zero and free the uprobe
> structure even though we were still registering the probe.

Right, that won't work.

> >
> > > + return ret;
> > > +}
> >
> > > + /*
> > > + * There could be other threads that could be spinning on
> > > + * treelock; some of these threads could be interested in this
> > > + * uprobe. Give these threads a chance to run.
> > > + */
> > > + synchronize_sched();
> >
> > This makes no sense at all. We are not holding treelock, we are about
> > to acquire it. Also what does it matter when they spin on treelock and
> > are interested in this uprobe. Either they find it before we remove it
> > or not. So why synchronize_sched()? I find the lifetime rules of
> > uprobe utterly confusing. Could you explain please ?
>
> There could be threads that have hit the breakpoint and are
> entering the notifier code(interrupt context) and then
> do_notify_resume(task context) and trying to acquire the treelock.
> (treelock is held by the breakpoint hit threads in
> uprobe_notify_resume which gets called in do_notify_resume()) The
> current thread that is removing the uprobe from the rb_tree can race
> with these threads and might acquire the treelock before some of the
> breakpoint hit threads. If this happens the interrupted threads have
> to re-read the opcode to see if the breakpoint location no more has the
> breakpoint instruction and retry the instruction. However before it can
> detect and retry, some other thread might insert a breakpoint at that
> location. This can go in a loop.

Ok, that makes sense, but you want to put a lenghty explanation into
the comment above the synchronize_sched() call.

Thanks,

tglx

2011-03-15 18:58:52

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

* Peter Zijlstra <[email protected]> [2011-03-15 18:57:42]:

> On Tue, 2011-03-15 at 14:52 +0530, Srikar Dronamraju wrote:
> > * Stephen Wilson <[email protected]> [2011-03-14 14:09:14]:
> >
> > > On Mon, Mar 14, 2011 at 07:05:22PM +0530, Srikar Dronamraju wrote:
> > > > static int install_uprobe(struct mm_struct *mm, struct uprobe *uprobe)
> > > > {
> > > > - int ret = 0;
> > > > + struct task_struct *tsk;
> > > > + int ret = -EINVAL;
> > > >
> > > > - /*TODO: install breakpoint */
> > > > - if (!ret)
> > > > + get_task_struct(mm->owner);
> > > > + tsk = mm->owner;
> > > > + if (!tsk)
> > > > + return ret;
> > >
> > > I think you need to check that tsk != NULL before calling
> > > get_task_struct()...
> > >
> >
> > Guess checking for tsk != NULL would only help if and only if we are doing
> > within rcu. i.e we have to change to something like this
> >
> > rcu_read_lock()
> > if (mm->owner) {
> > get_task_struct(mm->owner)
> > tsk = mm->owner;
> > }
> > rcu_read_unlock()
> > if (!tsk)
> > return ret;
>
> so the whole mm->owner semantics seem vague, memcontrol.c doesn't seem
> consistent in itself, one site uses rcu_dereference() the other site
> doesn't.
>

mm->owner should be under rcu_read_lock, unless the task is exiting
and mm_count is 1. mm->owner is updated under task_lock().

> Also, the assignments in kernel/fork.c and kernel/exit.c don't use
> rcu_assign_pointer() and therefore lack the needed write barrier.
>

Those are paths when the only context using the mm->owner is single

> Git blames Balbir for this.

I accept the blame and am willing to fix anything incorrect found in
the code.


--
Three Cheers,
Balbir

2011-03-15 19:10:24

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 11/20] 11: uprobes: slot allocation for uprobes

Just a couple of minor notes while I was looking at this code...

> +static struct uprobes_xol_area *xol_alloc_area(void)
> +{
> + struct uprobes_xol_area *area = NULL;
> +
> + area = kzalloc(sizeof(*area), GFP_USER);
> + if (unlikely(!area))
> + return NULL;
> +
> + area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
> + GFP_USER);

Why GFP_USER? That causes extra allocation limits to be enforced. Given
that in part 14 you have:

+/* Prepare to single-step probed instruction out of line. */
+static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
+ unsigned long vaddr)
+{
+ xol_get_insn_slot(uprobe, vaddr);
+ BUG_ON(!current->utask->xol_vaddr);

It seems to me that you really don't want those allocations to fail.

back to xol_alloc_area():

> + if (!area->bitmap)
> + goto fail;
> +
> + spin_lock_init(&area->slot_lock);
> + if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
> + task_lock(current);
> + if (!current->mm->uprobes_xol_area) {
> + current->mm->uprobes_xol_area = area;
> + task_unlock(current);
> + return area;
> + }
> + task_unlock(current);
> + }
> +
> +fail:
> + if (area) {
> + if (area->bitmap)
> + kfree(area->bitmap);
> + kfree(area);
> + }

You've already checked area against NULL, and kfree() can handle null
pointers, so both of those tests are unneeded.

> + return current->mm->uprobes_xol_area;
> +}

jon

2011-03-15 19:28:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

On Wed, 2011-03-16 at 00:28 +0530, Balbir Singh wrote:
>
> mm->owner should be under rcu_read_lock, unless the task is exiting
> and mm_count is 1. mm->owner is updated under task_lock().
>
> > Also, the assignments in kernel/fork.c and kernel/exit.c don't use
> > rcu_assign_pointer() and therefore lack the needed write barrier.
> >
>
> Those are paths when the only context using the mm->owner is single
>
> > Git blames Balbir for this.
>
> I accept the blame and am willing to fix anything incorrect found in
> the code.

:-), ok sounds right, just wasn't entirely obvious when having a quick
look.

2011-03-15 19:32:45

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

On Tue, 2011-03-15 at 20:30 +0100, Peter Zijlstra wrote:
> On Wed, 2011-03-16 at 00:28 +0530, Balbir Singh wrote:

> > I accept the blame and am willing to fix anything incorrect found in
> > the code.
>
> :-), ok sounds right, just wasn't entirely obvious when having a quick
> look.

Does that mean we should be adding a comment there?

-- Steve

2011-03-15 19:41:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 4/20] 4: uprobes: Adding and remove a uprobe in a rb tree.

On Tue, 15 Mar 2011, Srikar Dronamraju wrote:
> * Thomas Gleixner <[email protected]> [2011-03-15 14:38:33]:
> > > +/*
> > > + * Find a uprobe corresponding to a given inode:offset
> > > + * Acquires treelock
> > > + */
> > > +static struct uprobe *find_uprobe(struct inode * inode, loff_t offset)
> > > +{
> > > + struct uprobe *uprobe;
> > > + unsigned long flags;
> > > +
> > > + spin_lock_irqsave(&treelock, flags);
> > > + uprobe = __find_uprobe(inode, offset, NULL);
> > > + spin_unlock_irqrestore(&treelock, flags);
> >
> > What's the calling context ? Do we really need a spinlock here for
> > walking the rb tree ?
> >
>
> find_uprobe() gets called from unregister_uprobe and on probe hit from
> uprobe_notify_resume. I am not sure if its a good idea to walk the tree
> as and when the tree is changing either because of a insertion or
> deletion of a probe.

I know that you cannot walk the tree lockless except you would use
some rcu based container for your probes.

Though my question is more whether this needs to be a spinlock or if
that could be replaced by a mutex. At least there is no reason to
disable interrupts. You cannot trap into a probe from the thread in
which you are installing/removing it.

Thanks,

tglx

2011-03-15 19:46:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 4/20] 4: uprobes: Adding and remove a uprobe in a rb tree.

On Tue, 2011-03-15 at 20:22 +0100, Thomas Gleixner wrote:
> I am not sure if its a good idea to walk the tree
> > as and when the tree is changing either because of a insertion or
> > deletion of a probe.
>
> I know that you cannot walk the tree lockless except you would use
> some rcu based container for your probes.

You can in fact combine a seqlock, rb-trees and RCU to do lockless
walks.

https://lkml.org/lkml/2010/10/20/160

and

https://lkml.org/lkml/2010/10/20/437

But doing that would be an optimization best done once we get all this
working nicely.

2011-03-15 19:51:31

by Stephen Wilson

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 17/20] 17: uprobes: filter chain




On Mon, Mar 14, 2011 at 07:07:22PM +0530, Srikar Dronamraju wrote:
>
> Loops through the filters callbacks of currently registered
> consumers to see if any consumer is interested in tracing this task.

Should this be part of the series? It is not currently used.

> /* Acquires uprobe->consumer_rwsem */
> +static bool filter_chain(struct uprobe *uprobe, struct task_struct *t)
> +{
> + struct uprobe_consumer *consumer;
> + bool ret = false;
> +
> + down_read(&uprobe->consumer_rwsem);
> + for (consumer = uprobe->consumers; consumer;
> + consumer = consumer->next) {
> + if (!consumer->filter || consumer->filter(consumer, t)) {

The implementation does not seem to match the changelog description.
Should this not be:

if (consumer->filter && consumer->filter(consumer, t))

?

> + ret = true;
> + break;
> + }
> + }
> + up_read(&uprobe->consumer_rwsem);
> + return ret;
> +}
> +

--
steve

2011-03-15 19:57:03

by Stephen Wilson

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 16/20] 16: uprobes: register a notifier for uprobes.

On Mon, Mar 14, 2011 at 07:07:08PM +0530, Srikar Dronamraju wrote:
> +static int __init init_uprobes(void)
> +{
> + register_die_notifier(&uprobes_exception_nb);
> + return 0;
> +}
> +

Although not currently needed, perhaps it would be best to return the
result of register_die_notifier() ?

--
steve

2011-03-15 20:01:08

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

On Tue, 2011-03-15 at 20:43 +0100, Thomas Gleixner wrote:
> On Tue, 15 Mar 2011, Andi Kleen wrote:
> > > > How do you envisage these features actually get used? For example,
> > > > will gdb be modified? Will other debuggers be modified or written?
> > >
> > > How about answering this question first _BEFORE_ advertising
> > > systemtap?
> >
> > I thought this was obvious. systemtap is essentially a script driven
> > debugger.
>
> Oh thanks for the clarification. I always wondered why a computer
> would need a tap.
>
> And it does not matter at all whether systemtap can use this or
> not. If the main debuggers used like gdb are not going to use it then
> it's a complete waste. We don't need another debugging interface just
> for a single esoteric use case.

The question is, can we have a tracing interface? I don't care about a
debugging interface as PTRACE (although the ABI sucks) is fine for that.
But any type of live tracing it really sucks for.

Hopefully this will allow perf (and yes even LTTng and systemtap) to be
finally able to do seamless tracing between userspace and kernel space.
The only other thing we have now is PTRACE, and if you think that's
sufficient, then spend a day programming with it.


-- Steve

2011-03-15 20:02:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

On Tue, 15 Mar 2011, Andi Kleen wrote:
> > > How do you envisage these features actually get used? For example,
> > > will gdb be modified? Will other debuggers be modified or written?
> >
> > How about answering this question first _BEFORE_ advertising
> > systemtap?
>
> I thought this was obvious. systemtap is essentially a script driven
> debugger.

Oh thanks for the clarification. I always wondered why a computer
would need a tap.

And it does not matter at all whether systemtap can use this or
not. If the main debuggers used like gdb are not going to use it then
it's a complete waste. We don't need another debugging interface just
for a single esoteric use case.

Thanks,

tglx

2011-03-15 20:25:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

On Tue, 15 Mar 2011, Steven Rostedt wrote:

> On Tue, 2011-03-15 at 20:43 +0100, Thomas Gleixner wrote:
> > On Tue, 15 Mar 2011, Andi Kleen wrote:
> > > > > How do you envisage these features actually get used? For example,
> > > > > will gdb be modified? Will other debuggers be modified or written?
> > > >
> > > > How about answering this question first _BEFORE_ advertising
> > > > systemtap?
> > >
> > > I thought this was obvious. systemtap is essentially a script driven
> > > debugger.
> >
> > Oh thanks for the clarification. I always wondered why a computer
> > would need a tap.
> >
> > And it does not matter at all whether systemtap can use this or
> > not. If the main debuggers used like gdb are not going to use it then
> > it's a complete waste. We don't need another debugging interface just
> > for a single esoteric use case.
>
> The question is, can we have a tracing interface? I don't care about a
> debugging interface as PTRACE (although the ABI sucks) is fine for that.
> But any type of live tracing it really sucks for.
>
> Hopefully this will allow perf (and yes even LTTng and systemtap) to be
> finally able to do seamless tracing between userspace and kernel space.
> The only other thing we have now is PTRACE, and if you think that's
> sufficient, then spend a day programming with it.

I didn't say that ptrace rocks.

All I'm saying is that we want a better argument than a single user
which is - and yes i tried it more than once - assbackwards beyond all
imagination.

If gdb, perf, trace can and will make use of it then we have sensible
arguments enough to go there. If systemtap can use it as well then I
have no problem with that..

Thanks,

tglx

2011-03-15 20:32:33

by Stephen Wilson

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 11/20] 11: uprobes: slot allocation for uprobes


On Mon, Mar 14, 2011 at 07:06:10PM +0530, Srikar Dronamraju wrote:
> diff --git a/kernel/fork.c b/kernel/fork.c
> index de3d10a..0afa0cd 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -551,6 +551,7 @@ void mmput(struct mm_struct *mm)
> might_sleep();
>
> if (atomic_dec_and_test(&mm->mm_users)) {
> + uprobes_free_xol_area(mm);
> exit_aio(mm);
> ksm_exit(mm);
> khugepaged_exit(mm); /* must run before exit_mmap */
> @@ -677,6 +678,9 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
> memcpy(mm, oldmm, sizeof(*mm));
>
> /* Initializing for Swap token stuff */
> +#ifdef CONFIG_UPROBES
> + mm->uprobes_xol_area = NULL;
> +#endif
> mm->token_priority = 0;
> mm->last_interval = 0;

Perhaps move the uprobes_xol_area initialization away from that comment?
A few lines down beside the hugepage #ifdef would read a bit better.


--
steve

2011-03-15 20:44:23

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

On Tue, 2011-03-15 at 21:09 +0100, Thomas Gleixner wrote:
> On Tue, 15 Mar 2011, Steven Rostedt wrote:

> I didn't say that ptrace rocks.
>
> All I'm saying is that we want a better argument than a single user
> which is - and yes i tried it more than once - assbackwards beyond all
> imagination.
>
> If gdb, perf, trace can and will make use of it then we have sensible

I'm more interested in the perf/trace than gdb, as the way gdb is mostly
used (at least now) to debug problems in the code with a big hammer
(single step, look at registers/variables). That is, gdb is usually very
interactive and its best to "stop the code" from running to examine what
has happened. gdb is not something you will run on an application that
is being used by others.

With perf/trace things are different, as you want the task to be as
little affected by the tracer as it runs, perhaps even in a production
environment. This is a completely different paradigm.

If gdb uses it, great, but I don't think we should bend over backwards
to make it usable by gdb. Debugging and tracing are different, with
different requirements and needs.

> arguments enough to go there. If systemtap can use it as well then I
> have no problem with that..

Fair enough.

-- Steve

2011-03-15 22:42:44

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 4/20] 4: uprobes: Adding and remove a uprobe in a rb tree.

Le mardi 15 mars 2011 à 20:48 +0100, Peter Zijlstra a écrit :
> On Tue, 2011-03-15 at 20:22 +0100, Thomas Gleixner wrote:
> > I am not sure if its a good idea to walk the tree
> > > as and when the tree is changing either because of a insertion or
> > > deletion of a probe.
> >
> > I know that you cannot walk the tree lockless except you would use
> > some rcu based container for your probes.
>
> You can in fact combine a seqlock, rb-trees and RCU to do lockless
> walks.
>
> https://lkml.org/lkml/2010/10/20/160
>
> and
>
> https://lkml.org/lkml/2010/10/20/437
>
> But doing that would be an optimization best done once we get all this
> working nicely.
>

We have such schem in net/ipv4/inetpeer.c function inet_getpeer() (using
a seqlock on latest net-next-2.6 tree), but we added a counter to make
sure a reader could not enter an infinite loop while traversing tree
(AVL tree in inetpeer case).


2011-03-16 04:57:07

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 11/20] 11: uprobes: slot allocation for uprobes

* Stephen Wilson <[email protected]> [2011-03-15 16:31:17]:

>
> On Mon, Mar 14, 2011 at 07:06:10PM +0530, Srikar Dronamraju wrote:
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index de3d10a..0afa0cd 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -551,6 +551,7 @@ void mmput(struct mm_struct *mm)
> > might_sleep();
> >
> > if (atomic_dec_and_test(&mm->mm_users)) {
> > + uprobes_free_xol_area(mm);
> > exit_aio(mm);
> > ksm_exit(mm);
> > khugepaged_exit(mm); /* must run before exit_mmap */
> > @@ -677,6 +678,9 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
> > memcpy(mm, oldmm, sizeof(*mm));
> >
> > /* Initializing for Swap token stuff */
> > +#ifdef CONFIG_UPROBES
> > + mm->uprobes_xol_area = NULL;
> > +#endif
> > mm->token_priority = 0;
> > mm->last_interval = 0;
>
> Perhaps move the uprobes_xol_area initialization away from that comment?
> A few lines down beside the hugepage #ifdef would read a bit better.


Okay, Will do.


--
Thanks and Regards
Srikar

2011-03-16 05:04:36

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 11/20] 11: uprobes: slot allocation for uprobes

* Jonathan Corbet <[email protected]> [2011-03-15 13:10:20]:

> Just a couple of minor notes while I was looking at this code...
>
> > +static struct uprobes_xol_area *xol_alloc_area(void)
> > +{
> > + struct uprobes_xol_area *area = NULL;
> > +
> > + area = kzalloc(sizeof(*area), GFP_USER);
> > + if (unlikely(!area))
> > + return NULL;
> > +
> > + area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
> > + GFP_USER);
>
> Why GFP_USER? That causes extra allocation limits to be enforced. Given
> that in part 14 you have:

Okay, Will use GFP_KERNEL.
We used GFP_USER because we thought its going to represent part of
process address space;

>
> +/* Prepare to single-step probed instruction out of line. */
> +static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
> + unsigned long vaddr)
> +{
> + xol_get_insn_slot(uprobe, vaddr);
> + BUG_ON(!current->utask->xol_vaddr);
>
> It seems to me that you really don't want those allocations to fail.
>
> back to xol_alloc_area():
>
> > + if (!area->bitmap)
> > + goto fail;
> > +
> > + spin_lock_init(&area->slot_lock);
> > + if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
> > + task_lock(current);
> > + if (!current->mm->uprobes_xol_area) {
> > + current->mm->uprobes_xol_area = area;
> > + task_unlock(current);
> > + return area;
> > + }
> > + task_unlock(current);
> > + }
> > +
> > +fail:
> > + if (area) {
> > + if (area->bitmap)
> > + kfree(area->bitmap);
> > + kfree(area);
> > + }
>
> You've already checked area against NULL, and kfree() can handle null
> pointers, so both of those tests are unneeded.

Okay,

>
> > + return current->mm->uprobes_xol_area;
> > +}
>
> jon

--
Thanks and Regards
Srikar

2011-03-16 05:51:55

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

* Steven Rostedt <[email protected]> [2011-03-15 15:32:40]:

> On Tue, 2011-03-15 at 20:30 +0100, Peter Zijlstra wrote:
> > On Wed, 2011-03-16 at 00:28 +0530, Balbir Singh wrote:
>
> > > I accept the blame and am willing to fix anything incorrect found in
> > > the code.
> >
> > :-), ok sounds right, just wasn't entirely obvious when having a quick
> > look.
>
> Does that mean we should be adding a comment there?
>

This is what the current documentation looks like.

#ifdef CONFIG_MM_OWNER
/*
* "owner" points to a task that is regarded as the canonical
* user/owner of this mm. All of the following must be true in
* order for it to be changed:
*
* current == mm->owner
* current->mm != mm
* new_owner->mm == mm
* new_owner->alloc_lock is held
*/
struct task_struct __rcu *owner;
#endif

Do you want me to document the fork/exit case?


--
Three Cheers,
Balbir

2011-03-16 07:52:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 4/20] 4: uprobes: Adding and remove a uprobe in a rb tree.

On Tue, 2011-03-15 at 23:42 +0100, Eric Dumazet wrote:
> Le mardi 15 mars 2011 à 20:48 +0100, Peter Zijlstra a écrit :
> > On Tue, 2011-03-15 at 20:22 +0100, Thomas Gleixner wrote:
> > > I am not sure if its a good idea to walk the tree
> > > > as and when the tree is changing either because of a insertion or
> > > > deletion of a probe.
> > >
> > > I know that you cannot walk the tree lockless except you would use
> > > some rcu based container for your probes.
> >
> > You can in fact combine a seqlock, rb-trees and RCU to do lockless
> > walks.
> >
> > https://lkml.org/lkml/2010/10/20/160
> >
> > and
> >
> > https://lkml.org/lkml/2010/10/20/437
> >
> > But doing that would be an optimization best done once we get all this
> > working nicely.
> >
>
> We have such schem in net/ipv4/inetpeer.c function inet_getpeer() (using
> a seqlock on latest net-next-2.6 tree), but we added a counter to make
> sure a reader could not enter an infinite loop while traversing tree

Right, Linus suggested a single lockless iteration, but a limited count
works too.

> (AVL tree in inetpeer case).

Ooh, there's an AVL implementation in the kernel? I have to ask, why not
use the RB-tree? (I know AVL has a slightly stronger balancing condition
which reduces the max depth from 2*log(n) to 1+log(n)).

Also, if it does make sense to have both and AVL and RB implementation,
does it then also make sense to lift the AVL thing to generic code into
lib/ ?

2011-03-16 17:29:33

by Tom Tromey

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

>> If gdb, perf, trace can and will make use of it then we have sensible
>> arguments enough to go there. If systemtap can use it as well then I
>> have no problem with that..

Yes, gdb would be able to use it.
I don't know of anybody working on it at present.

Tom

2011-03-16 17:32:36

by Tom Tromey

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

Steve> I'm more interested in the perf/trace than gdb, as the way gdb is mostly
Steve> used (at least now) to debug problems in the code with a big hammer
Steve> (single step, look at registers/variables). That is, gdb is usually very
Steve> interactive and its best to "stop the code" from running to examine what
Steve> has happened. gdb is not something you will run on an application that
Steve> is being used by others.

It depends. People do in fact do this stuff. In recent years gdb got
its own implementation of "always inserted" breakpoints (basically the
same idea as uprobes) to support some trickier multi-thread debugging
scenarios.

Tom

2011-03-16 17:40:48

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 7/20] 7: uprobes: store/restore original instruction.

On Wed, 2011-03-16 at 11:21 +0530, Balbir Singh wrote:
> * Steven Rostedt <[email protected]> [2011-03-15 15:32:40]:
>
> > On Tue, 2011-03-15 at 20:30 +0100, Peter Zijlstra wrote:
> > > On Wed, 2011-03-16 at 00:28 +0530, Balbir Singh wrote:
> >
> > > > I accept the blame and am willing to fix anything incorrect found in
> > > > the code.
> > >
> > > :-), ok sounds right, just wasn't entirely obvious when having a quick
> > > look.
> >
> > Does that mean we should be adding a comment there?
> >
>
> This is what the current documentation looks like.
>
> #ifdef CONFIG_MM_OWNER
> /*
> * "owner" points to a task that is regarded as the canonical
> * user/owner of this mm. All of the following must be true in
> * order for it to be changed:
> *
> * current == mm->owner
> * current->mm != mm
> * new_owner->mm == mm
> * new_owner->alloc_lock is held
> */
> struct task_struct __rcu *owner;
> #endif
>
> Do you want me to document the fork/exit case?
>

Ah, looking at the code, I guess comments are not needed.

Thanks,

-- Steve

2011-03-16 17:44:15

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 0/20] 0: Inode based uprobes

On Wed, 2011-03-16 at 11:32 -0600, Tom Tromey wrote:
> Steve> I'm more interested in the perf/trace than gdb, as the way gdb is mostly
> Steve> used (at least now) to debug problems in the code with a big hammer
> Steve> (single step, look at registers/variables). That is, gdb is usually very
> Steve> interactive and its best to "stop the code" from running to examine what
> Steve> has happened. gdb is not something you will run on an application that
> Steve> is being used by others.
>
> It depends. People do in fact do this stuff. In recent years gdb got
> its own implementation of "always inserted" breakpoints (basically the
> same idea as uprobes) to support some trickier multi-thread debugging
> scenarios.

Like I said, if it helps out gdb then great! My concern is that we will
want it to replace ptrace, where it may not be designed to, then people
will start NAKing it.

I hope that gdb uses it, and we don't add another interface that nobody
uses. But I don't see that happening with uprobes.

-- Steve

2011-03-18 18:31:52

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 6/20] 6: x86: analyze instruction and determine fixups.

>
> > + ret = handle_riprel_insn(uprobe, &insn);
> > + if (ret == -1)
> > + /* rip-relative; can't XOL */
> > + return 0;
>
> So we return -1 from handle_riprel_insn() and return success?


handle_riprel_insn() returns 0 if the instruction is not rip-relative
returns 1 if its rip-relative but can use XOL slots.
returns -1 if its rip-relative but cannot use XOL.

We dont see any instructions that are rip-relative and cannot use XOL.
so the check and return are redundant and I will remove that in the next
patch.


Btw how
> deals handle_riprel_insn() with 32bit user space ?
>

handle_riprel_insn() calls insn_rip_relative() which fails if
instruction isnot rip-relative and handle_riprel_insn returns
immediately.

The rest of your suggestions for this patch are taken.

> > +#endif
> > + prepare_fixups(uprobe, &insn);
> > + return 0;
>

--
Thanks and Regards
Srikar

2011-03-18 18:47:00

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 6/20] 6: x86: analyze instruction and determine fixups.

> handle_riprel_insn() returns 0 if the instruction is not rip-relative
> returns 1 if its rip-relative but can use XOL slots.
> returns -1 if its rip-relative but cannot use XOL.
>
> We dont see any instructions that are rip-relative and cannot use XOL.
> so the check and return are redundant and I will remove that in the next
> patch.

How is that? You can only adjust a rip-relative instruction correctly if
the instruction copy is within 2GB of the original target address, which
cannot be presumed to always be the case in user address space layout
(unlike the kernel).

Thanks,
Roland

2011-03-18 18:56:19

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 6/20] 6: x86: analyze instruction and determine fixups.

* Roland McGrath <[email protected]> [2011-03-18 11:36:29]:

> > handle_riprel_insn() returns 0 if the instruction is not rip-relative
> > returns 1 if its rip-relative but can use XOL slots.
> > returns -1 if its rip-relative but cannot use XOL.
> >
> > We dont see any instructions that are rip-relative and cannot use XOL.
> > so the check and return are redundant and I will remove that in the next
> > patch.
>
> How is that? You can only adjust a rip-relative instruction correctly if
> the instruction copy is within 2GB of the original target address, which
> cannot be presumed to always be the case in user address space layout
> (unlike the kernel).
>

So we rewrite the copy of instruction (stored in XOL) such that it
accesses its memory operand indirectly thro a scratch register.
The contents of the scratch register are stored before singlestep and
restored later.

Can you please tell us if this doesnt work?

--
Thanks and Regards
Srikar

2011-03-18 19:00:08

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 5/20] 5: Uprobes: register/unregister probes.

> >
> > One of the install_uprobe could be failing because the process was
> > almost exiting, something like there was no mm->owner. Also lets
> > assume that the first few install_uprobes go thro and the last
> > install_uprobe fails. There could be breakpoint hits corresponding to
> > the already installed uprobes that get displayed. i.e all
> > breakpoint hits from the first install_uprobe to the time we detect a
> > failed a install_uprobe and revert all inserted breakpoints will be
> > shown as being captured.
>
> I think you can gracefully deal with the exit case and simply ignore
> that one. But you cannot let arbitrary installs fail and still report
> success, that gives very weak and nearly useless semantics.

If there are more than one instance of a process running and if one
instance of a process has a probe thro ptrace, install_uprobe would fail
for that process with -EEXIST since we dont want to probe locations that
have breakpoints already. Should we then act similar to the exit case,
do we also deal gracefully?

--
Thanks and Regards
Srikar

2011-03-18 19:10:59

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 6/20] 6: x86: analyze instruction and determine fixups.

> So we rewrite the copy of instruction (stored in XOL) such that it
> accesses its memory operand indirectly thro a scratch register.
> The contents of the scratch register are stored before singlestep and
> restored later.

I see. That should work fine in principle, assuming you use a register
that is not otherwise involved, of course. I hope you arrange to restore
the register if the copied instruction is never run because of a signal or
suchlike. In that case, it's important that the signal context get the
original register and original PC rather than the fiddled state for running
the copy. Likewise, if anyone is inspecting the registers right after the
step.


Thanks,
Roland

2011-03-18 19:13:57

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 6/20] 6: x86: analyze instruction and determine fixups.

* Srikar Dronamraju <[email protected]> [2011-03-19 00:19:22]:

> * Roland McGrath <[email protected]> [2011-03-18 11:36:29]:
>
> > > handle_riprel_insn() returns 0 if the instruction is not rip-relative
> > > returns 1 if its rip-relative but can use XOL slots.
> > > returns -1 if its rip-relative but cannot use XOL.
> > >
> > > We dont see any instructions that are rip-relative and cannot use XOL.
> > > so the check and return are redundant and I will remove that in the next
> > > patch.
> >
> > How is that? You can only adjust a rip-relative instruction correctly if
> > the instruction copy is within 2GB of the original target address, which
> > cannot be presumed to always be the case in user address space layout
> > (unlike the kernel).
> >
>
> So we rewrite the copy of instruction (stored in XOL) such that it
> accesses its memory operand indirectly thro a scratch register.
> The contents of the scratch register are stored before singlestep and
> restored later.
>
> Can you please tell us if this doesnt work?
>

Infact we have tested using rip-relative addresses and it has
worked very well. So we have verified that it does work. Can you
please tell us why you dont think this works?

--
Thanks and Regards
Srikar

2011-03-18 19:23:41

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 17/20] 17: uprobes: filter chain

* Stephen Wilson <[email protected]> [2011-03-15 15:49:14]:

>
>
>
> On Mon, Mar 14, 2011 at 07:07:22PM +0530, Srikar Dronamraju wrote:
> >
> > Loops through the filters callbacks of currently registered
> > consumers to see if any consumer is interested in tracing this task.
>
> Should this be part of the series? It is not currently used.
>
> > /* Acquires uprobe->consumer_rwsem */
> > +static bool filter_chain(struct uprobe *uprobe, struct task_struct *t)
> > +{
> > + struct uprobe_consumer *consumer;
> > + bool ret = false;
> > +
> > + down_read(&uprobe->consumer_rwsem);
> > + for (consumer = uprobe->consumers; consumer;
> > + consumer = consumer->next) {
> > + if (!consumer->filter || consumer->filter(consumer, t)) {
>
> The implementation does not seem to match the changelog description.
> Should this not be:
>
> if (consumer->filter && consumer->filter(consumer, t))
>
> ?

filter is optional; if filter is present, then it means that the
tracer is interested in a specific set of processes that maps this
inode. If there is no filter; it means that it is interested in all
processes that map this filter.

filter_chain() should return true if any consumer is interested in
tracing this task.
if there is a consumer who hasnt defined a filter then we dont need to loop thro remaining consumers.

Hence

if (!consumer->filter || consumer->filter(consumer, t)) {

seems better suited to me.

--
Thanks and Regards
Srikar

2011-03-18 19:35:08

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 16/20] 16: uprobes: register a notifier for uprobes.

* Stephen Wilson <[email protected]> [2011-03-15 15:56:36]:

> On Mon, Mar 14, 2011 at 07:07:08PM +0530, Srikar Dronamraju wrote:
> > +static int __init init_uprobes(void)
> > +{
> > + register_die_notifier(&uprobes_exception_nb);
> > + return 0;
> > +}
> > +
>
> Although not currently needed, perhaps it would be best to return the
> result of register_die_notifier() ?
>

Okay, I can do that but notifier_chain_register() that gets called from
register_die_notifier() always return 0.

--
Thanks and Regards
Srikar

2011-03-18 22:11:54

by Stephen Wilson

[permalink] [raw]
Subject: Re: [PATCH v2 2.6.38-rc8-tip 17/20] 17: uprobes: filter chain


On Sat, Mar 19, 2011 at 12:46:48AM +0530, Srikar Dronamraju wrote:
> > > + for (consumer = uprobe->consumers; consumer;
> > > + consumer = consumer->next) {
> > > + if (!consumer->filter || consumer->filter(consumer, t)) {
> >
> > The implementation does not seem to match the changelog description.
> > Should this not be:
> >
> > if (consumer->filter && consumer->filter(consumer, t))
> >
> > ?
>
> filter is optional; if filter is present, then it means that the
> tracer is interested in a specific set of processes that maps this
> inode. If there is no filter; it means that it is interested in all
> processes that map this filter.

Ah OK. That does make sense then. Thanks!


> filter_chain() should return true if any consumer is interested in
> tracing this task. if there is a consumer who hasnt defined a filter
> then we dont need to loop thro remaining consumers.
>
> Hence
>
> if (!consumer->filter || consumer->filter(consumer, t)) {
>
> seems better suited to me.

--
steve