This patchset implements Uprobes which enables you to dynamically break
into any routine in a user space application and collect information
non-disruptively.
This patchset resolves most of the comments on the previous posting
https://lkml.org/lkml/2011/4/1/176 and inputs I got at LFCS. This
patchset applies on top of tip commit 59c5f46fbe01
Uprobes Patches
This patchset implements inode based uprobes which are specified as
<file>:<offset> where offset is the offset from start of the map.
The probehit overhead is around 3X times the overhead from pid based
patchset.
When a uprobe is registered, Uprobes makes a copy of the probed
instruction, replaces the first byte(s) of the probed instruction with a
breakpoint instruction. (Uprobes uses background page replacement
mechanism and ensures that the breakpoint affects only that process.)
When a CPU hits the breakpoint instruction, Uprobes gets notified of
trap and finds the associated uprobe. It then executes the associated
handler. Uprobes single-steps its copy of the probed instruction and
resumes execution of the probed process at the instruction following the
probepoint. Instruction copies to be single-stepped are stored in a
per-mm "execution out of line (XOL) area". Currently XOL area is
allocated as one page vma.
For previous postings: please refer: http://lkml.org/lkml/2011/3/14/171/
http://lkml.org/lkml/2010/12/16/65 http://lkml.org/lkml/2010/8/25/165
http://lkml.org/lkml/2010/7/27/121 http://lkml.org/lkml/2010/7/12/67
http://lkml.org/lkml/2010/7/8/239 http://lkml.org/lkml/2010/6/29/299
http://lkml.org/lkml/2010/6/14/41 http://lkml.org/lkml/2010/3/20/107 and
http://lkml.org/lkml/2010/5/18/307
This patchset is a rework based on suggestions from discussions on lkml
in September, March and January 2010 (http://lkml.org/lkml/2010/1/11/92,
http://lkml.org/lkml/2010/1/27/19, http://lkml.org/lkml/2010/3/20/107
and http://lkml.org/lkml/2010/3/31/199 ). This implementation of uprobes
doesnt depend on utrace.
Advantages of uprobes over conventional debugging include:
1. Non-disruptive.
Unlike current ptrace based mechanisms, uprobes tracing wouldnt
involve signals, stopping threads and context switching between the
tracer and tracee.
2. Much better handling of multithreaded programs because of XOL.
Current ptrace based mechanisms use single stepping inline, i.e they
copy back the original instruction on hitting a breakpoint. In such
mechanisms tracers have to stop all the threads on a breakpoint hit or
tracers will not be able to handle all hits to the location of
interest. Uprobes uses execution out of line, where the instruction to
be traced is analysed at the time of breakpoint insertion and a copy
of instruction is stored at a different location. On breakpoint hit,
uprobes jumps to that copied location and singlesteps the same
instruction and does the necessary fixups post singlestepping.
3. Multiple tracers for an application.
Multiple uprobes based tracer could work in unison to trace an
application. There could one tracer that could be interested in
generic events for a particular set of process. While there could be
another tracer that is just interested in one specific event of a
particular process thats part of the previous set of process.
4. Corelating events from kernels and userspace.
Uprobes could be used with other tools like kprobes, tracepoints or as
part of higher level tools like perf to give a consolidated set of
events from kernel and userspace. In future we could look at a single
backtrace showing application, library and kernel calls.
Here is the list of TODO Items.
- Breakpoint handling should co-exist with singlestep/blockstep from
another tracer/debugger.
- Queue and dequeue signals delivered from the singlestep till
completion of postprocessing.
- Prefiltering (i.e filtering at the time of probe insertion)
- Return probes.
- Support for other architectures.
- Uprobes booster.
- replace macro W with bits in inat table.
To try please fetch using
git fetch \
git://git.kernel.org/pub/scm/linux/kernel/git/srikar/linux-uprobes.git \
tip_inode_uprobes_070611:tip_inode_uprobes
Please refer "[RFC] [PATCH 3.0-rc2-tip 18/22] tracing: tracing: Uprobe
tracer documentation" on how to use uprobe_tracer.
Please refer "[RFC] [PATCH 3.0-rc2-tip 22/22] perf: Documentation for perf
uprobes" on how to use uprobe_tracer.
Please do provide your valuable comments.
Thanks in advance.
Srikar
Srikar Dronamraju (22)
0: Uprobes patchset with perf probe support
1: X86 specific breakpoint definitions.
2: uprobes: Breakground page replacement.
3: uprobes: Adding and remove a uprobe in a rb tree.
4: Uprobes: register/unregister probes.
5: x86: analyze instruction and determine fixups.
6: uprobes: store/restore original instruction.
7: uprobes: mmap and fork hooks.
8: x86: architecture specific task information.
9: uprobes: task specific information.
10: uprobes: slot allocation for uprobes
11: uprobes: get the breakpoint address.
12: x86: x86 specific probe handling
13: uprobes: Handing int3 and singlestep exception.
14: x86: uprobes exception notifier for x86.
15: uprobes: register a notifier for uprobes.
16: tracing: Extract out common code for kprobes/uprobes traceevents.
17: tracing: uprobes trace_event interface
18: tracing: Uprobe tracer documentation
19: perf: rename target_module to target
20: perf: perf interface for uprobes
21: perf: show possible probes in a given executable file or library.
22: perf: Documentation for perf uprobes
Documentation/trace/uprobetrace.txt | 94 ++
arch/Kconfig | 4 +
arch/x86/Kconfig | 3 +
arch/x86/include/asm/thread_info.h | 2 +
arch/x86/include/asm/uprobes.h | 53 ++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/signal.c | 14 +
arch/x86/kernel/uprobes.c | 591 +++++++++++++
include/linux/mm_types.h | 9 +
include/linux/sched.h | 9 +-
include/linux/uprobes.h | 194 ++++
kernel/Makefile | 1 +
kernel/fork.c | 10 +
kernel/trace/Kconfig | 20 +
kernel/trace/Makefile | 2 +
kernel/trace/trace.h | 5 +
kernel/trace/trace_kprobe.c | 860 +------------------
kernel/trace/trace_probe.c | 752 ++++++++++++++++
kernel/trace/trace_probe.h | 160 ++++
kernel/trace/trace_uprobe.c | 812 +++++++++++++++++
kernel/uprobes.c | 1476 +++++++++++++++++++++++++++++++
mm/mmap.c | 6 +
tools/perf/Documentation/perf-probe.txt | 21 +-
tools/perf/builtin-probe.c | 77 ++-
tools/perf/util/probe-event.c | 431 ++++++++--
tools/perf/util/probe-event.h | 12 +-
tools/perf/util/symbol.c | 10 +-
tools/perf/util/symbol.h | 1 +
28 files changed, 4686 insertions(+), 944 deletions(-)
Provides definitions for the breakpoint instruction and x86 specific
uprobe info structure.
Signed-off-by: Jim Keniston <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/Kconfig | 3 +++
arch/x86/include/asm/uprobes.h | 40 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/uprobes.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 30041d8..7c843e2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -243,6 +243,9 @@ config ARCH_CPU_PROBE_RELEASE
def_bool y
depends on HOTPLUG_CPU
+config ARCH_SUPPORTS_UPROBES
+ def_bool y
+
source "init/Kconfig"
source "kernel/Kconfig.freezer"
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
new file mode 100644
index 0000000..192ba4a
--- /dev/null
+++ b/arch/x86/include/asm/uprobes.h
@@ -0,0 +1,40 @@
+#ifndef _ASM_UPROBES_H
+#define _ASM_UPROBES_H
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ * Srikar Dronamraju
+ * Jim Keniston
+ */
+
+typedef u8 uprobe_opcode_t;
+#define MAX_UINSN_BYTES 16
+#define UPROBES_XOL_SLOT_BYTES 128 /* to keep it cache aligned */
+
+#define UPROBES_BKPT_INSN 0xcc
+#define UPROBES_BKPT_INSN_SIZE 1
+
+#ifdef CONFIG_X86_64
+struct uprobe_arch_info {
+ unsigned long rip_rela_target_address;
+};
+#else
+struct uprobe_arch_info {};
+#endif
+#endif /* _ASM_UPROBES_H */
Provides Background page replacement by
- cow the page that needs replacement.
- modify a copy of the cowed page.
- replace the cow page with the modified page
- flush the page tables.
Also provides additional routines to read an opcode from a given virtual
address and for verifying if a instruction is a breakpoint instruction.
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Jim Keniston <[email protected]>
---
arch/Kconfig | 12 ++
include/linux/uprobes.h | 81 ++++++++++++
kernel/Makefile | 1
kernel/uprobes.c | 309 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 403 insertions(+), 0 deletions(-)
create mode 100644 include/linux/uprobes.h
create mode 100644 kernel/uprobes.c
diff --git a/arch/Kconfig b/arch/Kconfig
index 26b0e23..3d91687 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -61,6 +61,18 @@ config OPTPROBES
depends on KPROBES && HAVE_OPTPROBES
depends on !PREEMPT
+config UPROBES
+ bool "User-space probes (EXPERIMENTAL)"
+ depends on ARCH_SUPPORTS_UPROBES
+ depends on MMU
+ select MM_OWNER
+ help
+ Uprobes enables kernel subsystems to establish probepoints
+ in user applications and execute handler functions when
+ the probepoints are hit.
+
+ If in doubt, say "N".
+
config HAVE_EFFICIENT_UNALIGNED_ACCESS
bool
help
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
new file mode 100644
index 0000000..232ccea
--- /dev/null
+++ b/include/linux/uprobes.h
@@ -0,0 +1,81 @@
+#ifndef _LINUX_UPROBES_H
+#define _LINUX_UPROBES_H
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ * Srikar Dronamraju
+ * Jim Keniston
+ */
+
+#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
+#include <asm/uprobes.h>
+#else
+/*
+ * ARCH_SUPPORTS_UPROBES is not defined.
+ */
+typedef u8 uprobe_opcode_t;
+#endif /* CONFIG_ARCH_SUPPORTS_UPROBES */
+
+/* Post-execution fixups. Some architectures may define others. */
+
+/* No fixup needed */
+#define UPROBES_FIX_NONE 0x0
+/* Adjust IP back to vicinity of actual insn */
+#define UPROBES_FIX_IP 0x1
+/* Adjust the return address of a call insn */
+#define UPROBES_FIX_CALL 0x2
+/* Might sleep while doing Fixup */
+#define UPROBES_FIX_SLEEPY 0x4
+
+#ifndef UPROBES_FIX_DEFAULT
+#define UPROBES_FIX_DEFAULT UPROBES_FIX_IP
+#endif
+
+/* Unexported functions & macros for use by arch-specific code */
+#define uprobe_opcode_sz (sizeof(uprobe_opcode_t))
+
+/*
+ * Most architectures can use the default versions of @read_opcode(),
+ * @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
+ *
+ * @set_ip:
+ * Set the instruction pointer in @regs to @vaddr.
+ * @analyze_insn:
+ * Analyze @user_bkpt->insn. Return 0 if @user_bkpt->insn is an
+ * instruction you can probe, or a negative errno (typically -%EPERM)
+ * otherwise. Determine what sort of XOL-related fixups @post_xol()
+ * (and possibly @pre_xol()) will need to do for this instruction, and
+ * annotate @user_bkpt accordingly. You may modify @user_bkpt->insn
+ * (e.g., the x86_64 port does this for rip-relative instructions).
+ * @pre_xol:
+ * Called just before executing the instruction associated
+ * with @user_bkpt out of line. @user_bkpt->xol_vaddr is the address
+ * in @tsk's virtual address space where @user_bkpt->insn has been
+ * copied. @pre_xol() should at least set the instruction pointer in
+ * @regs to @user_bkpt->xol_vaddr -- which is what the default,
+ * @pre_xol(), does.
+ * @post_xol:
+ * Called after executing the instruction associated with
+ * @user_bkpt out of line. @post_xol() should perform the fixups
+ * specified in @user_bkpt->fixups, which includes ensuring that the
+ * instruction pointer in @regs points at the next instruction in
+ * the probed instruction stream. @tskinfo is as for @pre_xol().
+ * You must provide this function.
+ */
+#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 2d64cfc..2ca229a 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/
obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
obj-$(CONFIG_PADATA) += padata.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
+obj-$(CONFIG_UPROBES) += uprobes.o
ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
new file mode 100644
index 0000000..7ef916e
--- /dev/null
+++ b/kernel/uprobes.c
@@ -0,0 +1,309 @@
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ * Srikar Dronamraju
+ * Jim Keniston
+ */
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/pagemap.h>
+#include <linux/slab.h>
+#include <linux/uprobes.h>
+#include <linux/rmap.h> /* needed for anon_vma_prepare */
+#include <linux/mmu_notifier.h> /* needed for set_pte_at_notify */
+#include <linux/swap.h> /* needed for try_to_free_swap */
+
+struct uprobe {
+ u8 insn[MAX_UINSN_BYTES];
+ u16 fixups;
+};
+
+static bool valid_vma(struct vm_area_struct *vma)
+{
+ if (!vma->vm_file)
+ return false;
+
+ if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
+ (VM_READ|VM_EXEC))
+ return true;
+
+ return false;
+}
+
+/**
+ * __replace_page - replace page in vma by new page.
+ * based on replace_page in mm/ksm.c
+ *
+ * @vma: vma that holds the pte pointing to page
+ * @page: the cowed page we are replacing by kpage
+ * @kpage: the modified page we replace page by
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ */
+static int __replace_page(struct vm_area_struct *vma, struct page *page,
+ struct page *kpage)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep;
+ spinlock_t *ptl;
+ unsigned long addr;
+ int err = -EFAULT;
+
+ addr = page_address_in_vma(page, vma);
+ if (addr == -EFAULT)
+ goto out;
+
+ pgd = pgd_offset(mm, addr);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pud = pud_offset(pgd, addr);
+ if (!pud_present(*pud))
+ goto out;
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_trans_huge(*pmd) || (!pmd_present(*pmd)))
+ goto out;
+
+ ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!ptep)
+ goto out;
+
+ get_page(kpage);
+ page_add_new_anon_rmap(kpage, vma, addr);
+
+ flush_cache_page(vma, addr, pte_pfn(*ptep));
+ ptep_clear_flush(vma, addr, ptep);
+ set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+
+ page_remove_rmap(page);
+ if (!page_mapped(page))
+ try_to_free_swap(page);
+ put_page(page);
+ pte_unmap_unlock(ptep, ptl);
+ err = 0;
+
+out:
+ return err;
+}
+
+/*
+ * NOTE:
+ * Expect the breakpoint instruction to be the smallest size instruction for
+ * the architecture. If an arch has variable length instruction and the
+ * breakpoint instruction is not of the smallest length instruction
+ * supported by that architecture then we need to modify read_opcode /
+ * write_opcode accordingly. This would never be a problem for archs that
+ * have fixed length instructions.
+ */
+
+/*
+ * write_opcode - write the opcode at a given virtual address.
+ * @tsk: the probed task.
+ * @uprobe: the breakpointing information.
+ * @vaddr: the virtual address to store the opcode.
+ * @opcode: opcode to be written at @vaddr.
+ *
+ * Called with tsk->mm->mmap_sem held (for read and with a reference to
+ * tsk->mm).
+ *
+ * For task @tsk, write the opcode at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
+ unsigned long vaddr, uprobe_opcode_t opcode)
+{
+ struct page *old_page, *new_page;
+ void *vaddr_old, *vaddr_new;
+ struct vm_area_struct *vma;
+ unsigned long addr;
+ int ret;
+
+ /* Read the page with vaddr into memory */
+ ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
+ if (ret <= 0)
+ return ret;
+ ret = -EINVAL;
+
+ /*
+ * We are interested in text pages only. Our pages of interest
+ * should be mapped for read and execute only. We desist from
+ * adding probes in write mapped pages since the breakpoints
+ * might end up in the file copy.
+ */
+ if (!valid_vma(vma))
+ goto put_out;
+
+ /* Allocate a page */
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
+ if (!new_page) {
+ ret = -ENOMEM;
+ goto put_out;
+ }
+
+ /*
+ * lock page will serialize against do_wp_page()'s
+ * PageAnon() handling
+ */
+ lock_page(old_page);
+ /* copy the page now that we've got it stable */
+ vaddr_old = kmap_atomic(old_page, KM_USER0);
+ vaddr_new = kmap_atomic(new_page, KM_USER1);
+
+ memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
+ /* poke the new insn in, ASSUMES we don't cross page boundary */
+ addr = vaddr;
+ vaddr &= ~PAGE_MASK;
+ memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
+
+ kunmap_atomic(vaddr_new);
+ kunmap_atomic(vaddr_old);
+
+ ret = anon_vma_prepare(vma);
+ if (ret)
+ goto unlock_out;
+
+ lock_page(new_page);
+ ret = __replace_page(vma, old_page, new_page);
+ unlock_page(new_page);
+ if (ret != 0)
+ page_cache_release(new_page);
+unlock_out:
+ unlock_page(old_page);
+
+put_out:
+ put_page(old_page); /* we did a get_page in the beginning */
+ return ret;
+}
+
+/**
+ * read_opcode - read the opcode at a given virtual address.
+ * @tsk: the probed task.
+ * @vaddr: the virtual address to read the opcode.
+ * @opcode: location to store the read opcode.
+ *
+ * Called with tsk->mm->mmap_sem held (for read and with a reference to
+ * tsk->mm.
+ *
+ * For task @tsk, read the opcode at @vaddr and store it in @opcode.
+ * Return 0 (success) or a negative errno.
+ */
+int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
+ uprobe_opcode_t *opcode)
+{
+ struct vm_area_struct *vma;
+ struct page *page;
+ void *vaddr_new;
+ int ret;
+
+ ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
+ if (ret <= 0)
+ return ret;
+ ret = -EINVAL;
+
+ /*
+ * We are interested in text pages only. Our pages of interest
+ * should be mapped for read and execute only. We desist from
+ * adding probes in write mapped pages since the breakpoints
+ * might end up in the file copy.
+ */
+ if (!valid_vma(vma))
+ goto put_out;
+
+ lock_page(page);
+ vaddr_new = kmap_atomic(page, KM_USER0);
+ vaddr &= ~PAGE_MASK;
+ memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz);
+ kunmap_atomic(vaddr_new);
+ unlock_page(page);
+ ret = 0;
+
+put_out:
+ put_page(page); /* we did a get_page in the beginning */
+ return ret;
+}
+
+/**
+ * set_bkpt - store breakpoint at a given address.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ *
+ * For task @tsk, store the breakpoint instruction at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
+ unsigned long vaddr)
+{
+ return write_opcode(tsk, uprobe, vaddr, UPROBES_BKPT_INSN);
+}
+
+/**
+ * set_orig_insn - Restore the original instruction.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ * @verify: if true, verify existance of breakpoint instruction.
+ *
+ * For task @tsk, restore the original opcode (opcode) at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+int __weak set_orig_insn(struct task_struct *tsk, struct uprobe *uprobe,
+ unsigned long vaddr, bool verify)
+{
+ if (verify) {
+ uprobe_opcode_t opcode;
+ int result = read_opcode(tsk, vaddr, &opcode);
+ if (result)
+ return result;
+ if (opcode != UPROBES_BKPT_INSN)
+ return -EINVAL;
+ }
+ return write_opcode(tsk, uprobe, vaddr,
+ *(uprobe_opcode_t *) uprobe->insn);
+}
+
+static void print_insert_fail(struct task_struct *tsk,
+ unsigned long vaddr, const char *why)
+{
+ pr_warn_once("Can't place breakpoint at pid %d vaddr" " %#lx: %s\n",
+ tsk->pid, vaddr, why);
+}
+
+/**
+ * is_bkpt_insn - check if instruction is breakpoint instruction.
+ * @insn: instruction to be checked.
+ * Default implementation of is_bkpt_insn
+ * Returns true if @insn is a breakpoint instruction.
+ */
+bool __weak is_bkpt_insn(u8 *insn)
+{
+ uprobe_opcode_t opcode;
+
+ memcpy(&opcode, insn, UPROBES_BKPT_INSN_SIZE);
+ return (opcode == UPROBES_BKPT_INSN);
+}
Provides interfaces to add and remove uprobes from the global rb tree.
Also provides definitions for uprobe_consumer, interfaces to add and
remove a consumer to a uprobe. There is a unique uprobe element in the
rbtree for each unique inode:offset pair.
Uprobe gets added to the global rb tree when the first consumer for that
uprobe gets registered. It gets removed from the tree only when all
registered consumers are unregistered.
Multiple consumers can share the same probe. Each consumer provides a
filter to limit the tasks on which the handler should run, a handler
that runs on probe hit and a value which helps filter callback to limit
the tasks on which the handler should run.
Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/uprobes.h | 12 +++
kernel/uprobes.c | 210 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 222 insertions(+), 0 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 232ccea..9187df3 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -23,6 +23,7 @@
* Jim Keniston
*/
+#include <linux/rbtree.h>
#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
#include <asm/uprobes.h>
#else
@@ -50,6 +51,17 @@ typedef u8 uprobe_opcode_t;
/* Unexported functions & macros for use by arch-specific code */
#define uprobe_opcode_sz (sizeof(uprobe_opcode_t))
+struct uprobe_consumer {
+ int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
+ /*
+ * filter is optional; If a filter exists, handler is run
+ * if and only if filter returns true.
+ */
+ bool (*filter)(struct uprobe_consumer *self, struct task_struct *task);
+
+ struct uprobe_consumer *next;
+};
+
/*
* Most architectures can use the default versions of @read_opcode(),
* @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 7ef916e..aace4d9 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -35,6 +35,12 @@
#include <linux/swap.h> /* needed for try_to_free_swap */
struct uprobe {
+ struct rb_node rb_node; /* node in the rb tree */
+ atomic_t ref; /* lifetime muck */
+ struct rw_semaphore consumer_rwsem;
+ struct uprobe_consumer *consumers;
+ struct inode *inode; /* we hold a ref */
+ loff_t offset;
u8 insn[MAX_UINSN_BYTES];
u16 fixups;
};
@@ -307,3 +313,207 @@ bool __weak is_bkpt_insn(u8 *insn)
memcpy(&opcode, insn, UPROBES_BKPT_INSN_SIZE);
return (opcode == UPROBES_BKPT_INSN);
}
+
+static struct rb_root uprobes_tree = RB_ROOT;
+static DEFINE_SPINLOCK(uprobes_treelock); /* serialize (un)register */
+
+static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
+{
+ if (match_inode)
+ *match_inode = 0;
+
+ if (l->inode < r->inode)
+ return -1;
+ if (l->inode > r->inode)
+ return 1;
+ else {
+ if (match_inode)
+ *match_inode = 1;
+
+ if (l->offset < r->offset)
+ return -1;
+
+ if (l->offset > r->offset)
+ return 1;
+ }
+
+ return 0;
+}
+
+/* Called with uprobes_treelock held */
+static struct uprobe *__find_uprobe(struct inode * inode,
+ loff_t offset, struct rb_node **close_match)
+{
+ struct uprobe r = { .inode = inode, .offset = offset };
+ struct rb_node *n = uprobes_tree.rb_node;
+ struct uprobe *uprobe;
+ int match, match_inode;
+
+ while (n) {
+ uprobe = rb_entry(n, struct uprobe, rb_node);
+ match = match_uprobe(uprobe, &r, &match_inode);
+ if (close_match && match_inode)
+ *close_match = n;
+
+ if (!match) {
+ atomic_inc(&uprobe->ref);
+ return uprobe;
+ }
+ if (match < 0)
+ n = n->rb_left;
+ else
+ n = n->rb_right;
+
+ }
+ return NULL;
+}
+
+/*
+ * Find a uprobe corresponding to a given inode:offset
+ * Acquires uprobes_treelock
+ */
+static struct uprobe *find_uprobe(struct inode * inode, loff_t offset)
+{
+ struct uprobe *uprobe;
+ unsigned long flags;
+
+ spin_lock_irqsave(&uprobes_treelock, flags);
+ uprobe = __find_uprobe(inode, offset, NULL);
+ spin_unlock_irqrestore(&uprobes_treelock, flags);
+ return uprobe;
+}
+
+static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
+{
+ struct rb_node **p = &uprobes_tree.rb_node;
+ struct rb_node *parent = NULL;
+ struct uprobe *u;
+ int match;
+
+ while (*p) {
+ parent = *p;
+ u = rb_entry(parent, struct uprobe, rb_node);
+ match = match_uprobe(u, uprobe, NULL);
+ if (!match) {
+ atomic_inc(&u->ref);
+ return u;
+ }
+
+ if (match < 0)
+ p = &parent->rb_left;
+ else
+ p = &parent->rb_right;
+
+ }
+ u = NULL;
+ rb_link_node(&uprobe->rb_node, parent, p);
+ rb_insert_color(&uprobe->rb_node, &uprobes_tree);
+ /* get access + drop ref */
+ atomic_set(&uprobe->ref, 2);
+ return u;
+}
+
+/*
+ * Acquires uprobes_treelock.
+ * Matching uprobe already exists in rbtree;
+ * increment (access refcount) and return the matching uprobe.
+ *
+ * No matching uprobe; insert the uprobe in rb_tree;
+ * get a double refcount (access + creation) and return NULL.
+ */
+static struct uprobe *insert_uprobe(struct uprobe *uprobe)
+{
+ unsigned long flags;
+ struct uprobe *u;
+
+ spin_lock_irqsave(&uprobes_treelock, flags);
+ u = __insert_uprobe(uprobe);
+ spin_unlock_irqrestore(&uprobes_treelock, flags);
+ return u;
+}
+
+static void put_uprobe(struct uprobe *uprobe)
+{
+ if (atomic_dec_and_test(&uprobe->ref))
+ kfree(uprobe);
+}
+
+static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
+{
+ struct uprobe *uprobe, *cur_uprobe;
+
+ uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+ if (!uprobe)
+ return NULL;
+
+ __iget(inode);
+ uprobe->inode = inode;
+ uprobe->offset = offset;
+ init_rwsem(&uprobe->consumer_rwsem);
+
+ /* add to uprobes_tree, sorted on inode:offset */
+ cur_uprobe = insert_uprobe(uprobe);
+
+ /* a uprobe exists for this inode:offset combination*/
+ if (cur_uprobe) {
+ kfree(uprobe);
+ uprobe = cur_uprobe;
+ iput(inode);
+ }
+ return uprobe;
+}
+
+static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ struct uprobe_consumer *consumer;
+
+ down_read(&uprobe->consumer_rwsem);
+ consumer = uprobe->consumers;
+ while (consumer) {
+ if (!consumer->filter || consumer->filter(consumer, current))
+ consumer->handler(consumer, regs);
+
+ consumer = consumer->next;
+ }
+ up_read(&uprobe->consumer_rwsem);
+}
+
+static void add_consumer(struct uprobe *uprobe,
+ struct uprobe_consumer *consumer)
+{
+ down_write(&uprobe->consumer_rwsem);
+ consumer->next = uprobe->consumers;
+ uprobe->consumers = consumer;
+ up_write(&uprobe->consumer_rwsem);
+}
+
+/*
+ * For uprobe @uprobe, delete the consumer @consumer.
+ * Return true if the @consumer is deleted successfully
+ * or return false.
+ */
+static bool del_consumer(struct uprobe *uprobe,
+ struct uprobe_consumer *consumer)
+{
+ struct uprobe_consumer *con;
+ bool ret = false;
+
+ down_write(&uprobe->consumer_rwsem);
+ con = uprobe->consumers;
+ if (consumer == con) {
+ uprobe->consumers = con->next;
+ if (!con->next)
+ put_uprobe(uprobe); /* drop creation ref */
+ ret = true;
+ } else {
+ for (; con; con = con->next) {
+ if (con->next == consumer) {
+ con->next = consumer->next;
+ ret = true;
+ break;
+ }
+ }
+ }
+ up_write(&uprobe->consumer_rwsem);
+ return ret;
+}
A probe is specified by a file:offset. While registering, a breakpoint
is inserted for the first consumer, On subsequent probes, the consumer
gets appended to the existing consumers. While unregistering a
breakpoint is removed if the consumer happens to be the last consumer.
All other unregisterations, the consumer is deleted from the list of
consumers.
Probe specifications are maintained in a rb tree. A probe specification
is converted into a uprobe before store in a rb tree. A uprobe can be
shared by many consumers.
Given a inode, we get a list of mm's that have mapped the inode.
However we want to limit the probes to certain processes/threads. The
filtering should be at thread level. To limit the probes to a certain
processes/threads, we would want to walk through the list of threads
whose mm member refer to a given mm.
Here are the options that I thought of:
1. Use mm->owner and walk thro the thread_group of mm->owner, siblings
of mm->owner, siblings of parent of mm->owner. This should be
good list to traverse. Not sure if this is an exhaustive
enough list that all tasks that have a mm set to this mm_struct are
walked through.
2. Install probes on all mm's that have mapped the probes and filter
only at probe hit time.
3. walk thro do_each_thread; while_each_thread; I think this will catch
all tasks that have a mm set to the given mm. However this might
be too heavy esp if mm corresponds to a library.
4. add a list_head element to the mm struct and update the list whenever
the task->mm thread gets updated. This could mean extending the current
mm->owner. However there is some maintainance overhead.
Currently we use the second approach, i.e probe all mm's that have mapped
the probes and filter only at probe hit.
Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/mm_types.h | 5 +
include/linux/uprobes.h | 32 +++++
kernel/uprobes.c | 314 ++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 340 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 027935c..7bfef2e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -316,6 +316,11 @@ struct mm_struct {
#ifdef CONFIG_CPUMASK_OFFSTACK
struct cpumask cpumask_allocation;
#endif
+#ifdef CONFIG_UPROBES
+ unsigned long uprobes_vaddr;
+ struct list_head uprobes_list; /* protected by uprobes_mutex */
+ atomic_t uprobes_count;
+#endif
};
static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 9187df3..4087cc3 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -31,6 +31,7 @@
* ARCH_SUPPORTS_UPROBES is not defined.
*/
typedef u8 uprobe_opcode_t;
+struct uprobe_arch_info {}; /* arch specific info*/
#endif /* CONFIG_ARCH_SUPPORTS_UPROBES */
/* Post-execution fixups. Some architectures may define others. */
@@ -62,6 +63,19 @@ struct uprobe_consumer {
struct uprobe_consumer *next;
};
+struct uprobe {
+ struct rb_node rb_node; /* node in the rb tree */
+ atomic_t ref;
+ struct rw_semaphore consumer_rwsem;
+ struct uprobe_arch_info arch_info; /* arch specific info if any */
+ struct uprobe_consumer *consumers;
+ struct inode *inode; /* Also hold a ref to inode */
+ loff_t offset;
+ u8 insn[MAX_UINSN_BYTES]; /* orig instruction */
+ u16 fixups;
+ int copy;
+};
+
/*
* Most architectures can use the default versions of @read_opcode(),
* @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
@@ -90,4 +104,22 @@ struct uprobe_consumer {
* the probed instruction stream. @tskinfo is as for @pre_xol().
* You must provide this function.
*/
+
+#ifdef CONFIG_UPROBES
+extern int register_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer);
+extern void unregister_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer);
+#else /* CONFIG_UPROBES is not defined */
+static inline int register_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer)
+{
+ return -ENOSYS;
+}
+static inline void unregister_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer)
+{
+}
+
+#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index aace4d9..c6c2f5e 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -34,17 +34,6 @@
#include <linux/mmu_notifier.h> /* needed for set_pte_at_notify */
#include <linux/swap.h> /* needed for try_to_free_swap */
-struct uprobe {
- struct rb_node rb_node; /* node in the rb tree */
- atomic_t ref; /* lifetime muck */
- struct rw_semaphore consumer_rwsem;
- struct uprobe_consumer *consumers;
- struct inode *inode; /* we hold a ref */
- loff_t offset;
- u8 insn[MAX_UINSN_BYTES];
- u16 fixups;
-};
-
static bool valid_vma(struct vm_area_struct *vma)
{
if (!vma->vm_file)
@@ -517,3 +506,306 @@ static bool del_consumer(struct uprobe *uprobe,
up_write(&uprobe->consumer_rwsem);
return ret;
}
+
+static struct task_struct *get_mm_owner(struct mm_struct *mm)
+{
+ struct task_struct *tsk;
+
+ rcu_read_lock();
+ tsk = rcu_dereference(mm->owner);
+ if (tsk)
+ get_task_struct(tsk);
+ rcu_read_unlock();
+ return tsk;
+}
+
+static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
+{
+ int ret = 0;
+
+ /*TODO: install breakpoint */
+ if (!ret)
+ atomic_inc(&mm->uprobes_count);
+ return ret;
+}
+
+static int __remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
+{
+ int ret = 0;
+
+ /*TODO: remove breakpoint */
+ if (!ret)
+ atomic_dec(&mm->uprobes_count);
+
+ return ret;
+}
+
+static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
+{
+ down_read(&mm->mmap_sem);
+ __remove_breakpoint(mm, uprobe);
+ list_del(&mm->uprobes_list);
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+}
+
+/*
+ * There could be threads that have hit the breakpoint and are entering the
+ * notifier code and trying to acquire the uprobes_treelock. The thread
+ * calling delete_uprobe() that is removing the uprobe from the rb_tree can
+ * race with these threads and might acquire the uprobes_treelock compared
+ * to some of the breakpoint hit threads. In such a case, the breakpoint hit
+ * threads will not find the uprobe. Finding if a "trap" instruction was
+ * present at the interrupting address is racy. Hence provide some extra
+ * time (by way of synchronize_sched() for breakpoint hit threads to acquire
+ * the uprobes_treelock before the uprobe is removed from the rbtree.
+ */
+static void delete_uprobe(struct uprobe *uprobe)
+{
+ unsigned long flags;
+
+ synchronize_sched();
+ spin_lock_irqsave(&uprobes_treelock, flags);
+ rb_erase(&uprobe->rb_node, &uprobes_tree);
+ spin_unlock_irqrestore(&uprobes_treelock, flags);
+ iput(uprobe->inode);
+}
+
+static DEFINE_MUTEX(uprobes_mutex);
+
+/*
+ * register_uprobe - register a probe
+ * @inode: the file in which the probe has to be placed.
+ * @offset: offset from the start of the file.
+ * @consumer: information on howto handle the probe..
+ *
+ * Apart from the access refcount, register_uprobe() takes a creation
+ * refcount (thro alloc_uprobe) if and only if this @uprobe is getting
+ * inserted into the rbtree (i.e first consumer for a @inode:@offset
+ * tuple). Creation refcount stops unregister_uprobe from freeing the
+ * @uprobe even before the register operation is complete. Creation
+ * refcount is released when the last @consumer for the @uprobe
+ * unregisters.
+ *
+ * Return errno if it cannot successully install probes
+ * else return 0 (success)
+ */
+int register_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer)
+{
+ struct prio_tree_iter iter;
+ struct list_head try_list, success_list;
+ struct address_space *mapping;
+ struct mm_struct *mm, *tmpmm;
+ struct vm_area_struct *vma;
+ struct uprobe *uprobe;
+ int ret = -1;
+
+ if (!inode || !consumer || consumer->next)
+ return -EINVAL;
+
+ if (offset > inode->i_size)
+ return -EINVAL;
+
+ uprobe = alloc_uprobe(inode, offset);
+ if (!uprobe)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&try_list);
+ INIT_LIST_HEAD(&success_list);
+ mapping = inode->i_mapping;
+
+ mutex_lock(&uprobes_mutex);
+ if (uprobe->consumers) {
+ ret = 0;
+ goto consumers_add;
+ }
+
+ mutex_lock(&mapping->i_mmap_mutex);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
+ loff_t vaddr;
+ struct task_struct *tsk;
+
+ if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
+ continue;
+
+ mm = vma->vm_mm;
+ if (!valid_vma(vma)) {
+ mmput(mm);
+ continue;
+ }
+
+ vaddr = vma->vm_start + offset;
+ vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+ if (vaddr < vma->vm_start || vaddr > vma->vm_end) {
+ /* Not in this vma */
+ mmput(mm);
+ continue;
+ }
+ tsk = get_mm_owner(mm);
+ if (tsk && vaddr > TASK_SIZE_OF(tsk)) {
+ /*
+ * We cannot have a virtual address that is
+ * greater than TASK_SIZE_OF(tsk)
+ */
+ put_task_struct(tsk);
+ mmput(mm);
+ continue;
+ }
+ put_task_struct(tsk);
+ mm->uprobes_vaddr = (unsigned long) vaddr;
+ list_add(&mm->uprobes_list, &try_list);
+ }
+ mutex_unlock(&mapping->i_mmap_mutex);
+
+ if (list_empty(&try_list)) {
+ ret = 0;
+ goto consumers_add;
+ }
+ list_for_each_entry_safe(mm, tmpmm, &try_list, uprobes_list) {
+ down_read(&mm->mmap_sem);
+ ret = install_breakpoint(mm, uprobe);
+
+ if (ret && (ret != -ESRCH || ret != -EEXIST)) {
+ up_read(&mm->mmap_sem);
+ break;
+ }
+ if (!ret)
+ list_move(&mm->uprobes_list, &success_list);
+ else {
+ /*
+ * install_breakpoint failed as there are no active
+ * threads for the mm; ignore the error.
+ */
+ list_del(&mm->uprobes_list);
+ mmput(mm);
+ }
+ up_read(&mm->mmap_sem);
+ }
+
+ if (list_empty(&try_list)) {
+ /*
+ * All install_breakpoints were successful;
+ * cleanup successful entries.
+ */
+ ret = 0;
+ list_for_each_entry_safe(mm, tmpmm, &success_list,
+ uprobes_list) {
+ list_del(&mm->uprobes_list);
+ mmput(mm);
+ }
+ goto consumers_add;
+ }
+
+ /*
+ * Atleast one unsuccessful install_breakpoint;
+ * remove successful probes and cleanup untried entries.
+ */
+ list_for_each_entry_safe(mm, tmpmm, &success_list, uprobes_list)
+ remove_breakpoint(mm, uprobe);
+ list_for_each_entry_safe(mm, tmpmm, &try_list, uprobes_list) {
+ list_del(&mm->uprobes_list);
+ mmput(mm);
+ }
+ delete_uprobe(uprobe);
+ goto put_unlock;
+
+consumers_add:
+ add_consumer(uprobe, consumer);
+
+put_unlock:
+ mutex_unlock(&uprobes_mutex);
+ put_uprobe(uprobe); /* drop access ref */
+ return ret;
+}
+
+/*
+ * unregister_uprobe - unregister a already registered probe.
+ * @inode: the file in which the probe has to be removed.
+ * @offset: offset from the start of the file.
+ * @consumer: identify which probe if multiple probes are colocated.
+ */
+void unregister_uprobe(struct inode *inode, loff_t offset,
+ struct uprobe_consumer *consumer)
+{
+ struct prio_tree_iter iter;
+ struct list_head tmp_list;
+ struct address_space *mapping;
+ struct mm_struct *mm, *tmpmm;
+ struct vm_area_struct *vma;
+ struct uprobe *uprobe;
+
+ if (!inode || !consumer)
+ return;
+
+ uprobe = find_uprobe(inode, offset);
+ if (!uprobe) {
+ pr_debug("No uprobe found with inode:offset %p %lld\n",
+ inode, offset);
+ return;
+ }
+
+ if (!del_consumer(uprobe, consumer)) {
+ pr_debug("No uprobe found with consumer %p\n",
+ consumer);
+ return;
+ }
+
+ INIT_LIST_HEAD(&tmp_list);
+
+ mapping = inode->i_mapping;
+
+ mutex_lock(&uprobes_mutex);
+ if (uprobe->consumers)
+ goto put_unlock;
+
+ mutex_lock(&mapping->i_mmap_mutex);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
+ struct task_struct *tsk;
+
+ if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
+ continue;
+
+ mm = vma->vm_mm;
+
+ if (!atomic_read(&mm->uprobes_count)) {
+ mmput(mm);
+ continue;
+ }
+
+ if (valid_vma(vma)) {
+ loff_t vaddr;
+
+ vaddr = vma->vm_start + offset;
+ vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+ if (vaddr < vma->vm_start || vaddr > vma->vm_end) {
+ /* Not in this vma */
+ mmput(mm);
+ continue;
+ }
+ tsk = get_mm_owner(mm);
+ if (tsk && vaddr > TASK_SIZE_OF(tsk)) {
+ /*
+ * We cannot have a virtual address that is
+ * greater than TASK_SIZE_OF(tsk)
+ */
+ put_task_struct(tsk);
+ mmput(mm);
+ continue;
+ }
+ put_task_struct(tsk);
+ mm->uprobes_vaddr = (unsigned long) vaddr;
+ list_add(&mm->uprobes_list, &tmp_list);
+ } else
+ mmput(mm);
+ }
+ mutex_unlock(&mapping->i_mmap_mutex);
+ list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list)
+ remove_breakpoint(mm, uprobe);
+
+ delete_uprobe(uprobe);
+
+put_unlock:
+ mutex_unlock(&uprobes_mutex);
+ put_uprobe(uprobe); /* drop access ref */
+}
The instruction analysis is based on x86 instruction decoder and
determines if an instruction can be probed and determines the necessary
fixups after singlestep. Instruction analysis is done at probe
insertion time so that we avoid having to repeat the same analysis every
time a probe is hit.
Signed-off-by: Jim Keniston <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/include/asm/uprobes.h | 2
arch/x86/kernel/Makefile | 1
arch/x86/kernel/uprobes.c | 414 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 417 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/kernel/uprobes.c
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 192ba4a..4295ce0 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -37,4 +37,6 @@ struct uprobe_arch_info {
#else
struct uprobe_arch_info {};
#endif
+struct uprobe;
+extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
#endif /* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cc0469a..8be00b9 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -116,6 +116,7 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o
obj-$(CONFIG_SWIOTLB) += pci-swiotlb.o
obj-$(CONFIG_OF) += devicetree.o
+obj-$(CONFIG_UPROBES) += uprobes.o
###
# 64 bit specific files
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
new file mode 100644
index 0000000..79f74c5
--- /dev/null
+++ b/arch/x86/kernel/uprobes.c
@@ -0,0 +1,414 @@
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ * Srikar Dronamraju
+ * Jim Keniston
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/uprobes.h>
+
+#include <linux/kdebug.h>
+#include <asm/insn.h>
+
+#ifdef CONFIG_X86_32
+#define is_32bit_app(tsk) 1
+#else
+#define is_32bit_app(tsk) (test_tsk_thread_flag(tsk, TIF_IA32))
+#endif
+
+#define UPROBES_FIX_RIP_AX 0x8000
+#define UPROBES_FIX_RIP_CX 0x4000
+
+/* Adaptations for mhiramat x86 decoder v14. */
+#define OPCODE1(insn) ((insn)->opcode.bytes[0])
+#define OPCODE2(insn) ((insn)->opcode.bytes[1])
+#define OPCODE3(insn) ((insn)->opcode.bytes[2])
+#define MODRM_REG(insn) X86_MODRM_REG(insn->modrm.value)
+
+#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
+ (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) | \
+ (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) | \
+ (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) | \
+ (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf)) \
+ << (row % 32))
+
+
+static const u32 good_insns_64[256 / 32] = {
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+ /* ---------------------------------------------- */
+ W(0x00, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 00 */
+ W(0x10, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 10 */
+ W(0x20, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 20 */
+ W(0x30, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 30 */
+ W(0x40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 40 */
+ W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+ W(0x60, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+ W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+ W(0x80, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+ W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+ W(0xc0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+ W(0xd0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+ W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+ W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1) /* f0 */
+ /* ---------------------------------------------- */
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+};
+
+/* Good-instruction tables for 32-bit apps */
+
+static const u32 good_insns_32[256 / 32] = {
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+ /* ---------------------------------------------- */
+ W(0x00, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) | /* 00 */
+ W(0x10, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) , /* 10 */
+ W(0x20, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) | /* 20 */
+ W(0x30, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) , /* 30 */
+ W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+ W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+ W(0x60, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+ W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+ W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+ W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+ W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+ W(0xd0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+ W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+ W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1) /* f0 */
+ /* ---------------------------------------------- */
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+};
+
+/* Using this for both 64-bit and 32-bit apps */
+static const u32 good_2byte_insns[256 / 32] = {
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+ /* ---------------------------------------------- */
+ W(0x00, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1) | /* 00 */
+ W(0x10, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* 10 */
+ W(0x20, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1) | /* 20 */
+ W(0x30, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 30 */
+ W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+ W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+ W(0x60, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 60 */
+ W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1) , /* 70 */
+ W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+ W(0xa0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1) | /* a0 */
+ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+ W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* c0 */
+ W(0xd0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+ W(0xe0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* e0 */
+ W(0xf0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0) /* f0 */
+ /* ---------------------------------------------- */
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+};
+#undef W
+
+/*
+ * opcodes we'll probably never support:
+ * 6c-6d, e4-e5, ec-ed - in
+ * 6e-6f, e6-e7, ee-ef - out
+ * cc, cd - int3, int
+ * cf - iret
+ * d6 - illegal instruction
+ * f1 - int1/icebp
+ * f4 - hlt
+ * fa, fb - cli, sti
+ * 0f - lar, lsl, syscall, clts, sysret, sysenter, sysexit, invd, wbinvd, ud2
+ *
+ * invalid opcodes in 64-bit mode:
+ * 06, 0e, 16, 1e, 27, 2f, 37, 3f, 60-62, 82, c4-c5, d4-d5
+ *
+ * 63 - we support this opcode in x86_64 but not in i386.
+ *
+ * opcodes we may need to refine support for:
+ * 0f - 2-byte instructions: For many of these instructions, the validity
+ * depends on the prefix and/or the reg field. On such instructions, we
+ * just consider the opcode combination valid if it corresponds to any
+ * valid instruction.
+ * 8f - Group 1 - only reg = 0 is OK
+ * c6-c7 - Group 11 - only reg = 0 is OK
+ * d9-df - fpu insns with some illegal encodings
+ * f2, f3 - repnz, repz prefixes. These are also the first byte for
+ * certain floating-point instructions, such as addsd.
+ * fe - Group 4 - only reg = 0 or 1 is OK
+ * ff - Group 5 - only reg = 0-6 is OK
+ *
+ * others -- Do we need to support these?
+ * 0f - (floating-point?) prefetch instructions
+ * 07, 17, 1f - pop es, pop ss, pop ds
+ * 26, 2e, 36, 3e - es:, cs:, ss:, ds: segment prefixes --
+ * but 64 and 65 (fs: and gs:) seem to be used, so we support them
+ * 67 - addr16 prefix
+ * ce - into
+ * f0 - lock prefix
+ */
+
+/*
+ * TODO:
+ * - Where necessary, examine the modrm byte and allow only valid instructions
+ * in the different Groups and fpu instructions.
+ */
+
+static bool is_prefix_bad(struct insn *insn)
+{
+ int i;
+
+ for (i = 0; i < insn->prefixes.nbytes; i++) {
+ switch (insn->prefixes.bytes[i]) {
+ case 0x26: /*INAT_PFX_ES */
+ case 0x2E: /*INAT_PFX_CS */
+ case 0x36: /*INAT_PFX_DS */
+ case 0x3E: /*INAT_PFX_SS */
+ case 0xF0: /*INAT_PFX_LOCK */
+ return true;
+ }
+ }
+ return false;
+}
+
+static void report_bad_prefix(void)
+{
+ pr_warn_once("uprobes does not currently support probing "
+ "instructions with any of the following prefixes: "
+ "cs:, ds:, es:, ss:, lock:\n");
+}
+
+static void report_bad_1byte_opcode(int mode, uprobe_opcode_t op)
+{
+ pr_warn_once("In %d-bit apps, "
+ "uprobes does not currently support probing "
+ "instructions whose first byte is 0x%2.2x\n", mode, op);
+}
+
+static void report_bad_2byte_opcode(uprobe_opcode_t op)
+{
+ pr_warn_once("uprobes does not currently support probing "
+ "instructions with the 2-byte opcode 0x0f 0x%2.2x\n", op);
+}
+
+static int validate_insn_32bits(struct uprobe *uprobe, struct insn *insn)
+{
+ insn_init(insn, uprobe->insn, false);
+
+ /* Skip good instruction prefixes; reject "bad" ones. */
+ insn_get_opcode(insn);
+ if (is_prefix_bad(insn)) {
+ report_bad_prefix();
+ return -ENOTSUPP;
+ }
+ if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_32))
+ return 0;
+ if (insn->opcode.nbytes == 2) {
+ if (test_bit(OPCODE2(insn),
+ (unsigned long *) good_2byte_insns))
+ return 0;
+ report_bad_2byte_opcode(OPCODE2(insn));
+ } else
+ report_bad_1byte_opcode(32, OPCODE1(insn));
+ return -ENOTSUPP;
+}
+
+static int validate_insn_64bits(struct uprobe *uprobe, struct insn *insn)
+{
+ insn_init(insn, uprobe->insn, true);
+
+ /* Skip good instruction prefixes; reject "bad" ones. */
+ insn_get_opcode(insn);
+ if (is_prefix_bad(insn)) {
+ report_bad_prefix();
+ return -ENOTSUPP;
+ }
+ if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_64))
+ return 0;
+ if (insn->opcode.nbytes == 2) {
+ if (test_bit(OPCODE2(insn),
+ (unsigned long *) good_2byte_insns))
+ return 0;
+ report_bad_2byte_opcode(OPCODE2(insn));
+ } else
+ report_bad_1byte_opcode(64, OPCODE1(insn));
+ return -ENOTSUPP;
+}
+
+/*
+ * Figure out which fixups post_xol() will need to perform, and annotate
+ * uprobe->fixups accordingly. To start with, uprobe->fixups is
+ * either zero or it reflects rip-related fixups.
+ */
+static void prepare_fixups(struct uprobe *uprobe, struct insn *insn)
+{
+ bool fix_ip = true, fix_call = false; /* defaults */
+ insn_get_opcode(insn); /* should be a nop */
+
+ switch (OPCODE1(insn)) {
+ case 0xc3: /* ret/lret */
+ case 0xcb:
+ case 0xc2:
+ case 0xca:
+ /* ip is correct */
+ fix_ip = false;
+ break;
+ case 0xe8: /* call relative - Fix return addr */
+ fix_call = true;
+ break;
+ case 0x9a: /* call absolute - Fix return addr, not ip */
+ fix_call = true;
+ fix_ip = false;
+ break;
+ case 0xff:
+ {
+ int reg;
+ insn_get_modrm(insn);
+ reg = MODRM_REG(insn);
+ if (reg == 2 || reg == 3) {
+ /* call or lcall, indirect */
+ /* Fix return addr; ip is correct. */
+ fix_call = true;
+ fix_ip = false;
+ } else if (reg == 4 || reg == 5) {
+ /* jmp or ljmp, indirect */
+ /* ip is correct. */
+ fix_ip = false;
+ }
+ break;
+ }
+ case 0xea: /* jmp absolute -- ip is correct */
+ fix_ip = false;
+ break;
+ default:
+ break;
+ }
+ if (fix_ip)
+ uprobe->fixups |= UPROBES_FIX_IP;
+ if (fix_call)
+ uprobe->fixups |=
+ (UPROBES_FIX_CALL | UPROBES_FIX_SLEEPY);
+}
+
+#ifdef CONFIG_X86_64
+/*
+ * If uprobe->insn doesn't use rip-relative addressing, return
+ * immediately. Otherwise, rewrite the instruction so that it accesses
+ * its memory operand indirectly through a scratch register. Set
+ * uprobe->fixups and uprobe->arch_info.rip_rela_target_address
+ * accordingly. (The contents of the scratch register will be saved
+ * before we single-step the modified instruction, and restored
+ * afterward.)
+ *
+ * We do this because a rip-relative instruction can access only a
+ * relatively small area (+/- 2 GB from the instruction), and the XOL
+ * area typically lies beyond that area. At least for instructions
+ * that store to memory, we can't execute the original instruction
+ * and "fix things up" later, because the misdirected store could be
+ * disastrous.
+ *
+ * Some useful facts about rip-relative instructions:
+ * - There's always a modrm byte.
+ * - There's never a SIB byte.
+ * - The displacement is always 4 bytes.
+ */
+static void handle_riprel_insn(struct uprobe *uprobe, struct insn *insn)
+{
+ u8 *cursor;
+ u8 reg;
+
+ uprobe->arch_info.rip_rela_target_address = 0x0;
+ if (!insn_rip_relative(insn))
+ return;
+
+ /*
+ * Point cursor at the modrm byte. The next 4 bytes are the
+ * displacement. Beyond the displacement, for some instructions,
+ * is the immediate operand.
+ */
+ cursor = uprobe->insn + insn->prefixes.nbytes
+ + insn->rex_prefix.nbytes + insn->opcode.nbytes;
+ insn_get_length(insn);
+
+ /*
+ * Convert from rip-relative addressing to indirect addressing
+ * via a scratch register. Change the r/m field from 0x5 (%rip)
+ * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
+ */
+ reg = MODRM_REG(insn);
+ if (reg == 0) {
+ /*
+ * The register operand (if any) is either the A register
+ * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
+ * REX prefix) %r8. In any case, we know the C register
+ * is NOT the register operand, so we use %rcx (register
+ * #1) for the scratch register.
+ */
+ uprobe->fixups = UPROBES_FIX_RIP_CX;
+ /* Change modrm from 00 000 101 to 00 000 001. */
+ *cursor = 0x1;
+ } else {
+ /* Use %rax (register #0) for the scratch register. */
+ uprobe->fixups = UPROBES_FIX_RIP_AX;
+ /* Change modrm from 00 xxx 101 to 00 xxx 000 */
+ *cursor = (reg << 3);
+ }
+
+ /* Target address = address of next instruction + (signed) offset */
+ uprobe->arch_info.rip_rela_target_address = (long) insn->length
+ + insn->displacement.value;
+ /* Displacement field is gone; slide immediate field (if any) over. */
+ if (insn->immediate.nbytes) {
+ cursor++;
+ memmove(cursor, cursor + insn->displacement.nbytes,
+ insn->immediate.nbytes);
+ }
+ return;
+}
+#else
+static void handle_riprel_insn(struct uprobe *uprobe, struct insn *insn)
+{
+ return;
+}
+#endif /* CONFIG_X86_64 */
+
+/**
+ * analyze_insn - instruction analysis including validity and fixups.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * Return 0 on success or a -ve number on error.
+ */
+int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe)
+{
+ int ret;
+ struct insn insn;
+
+ uprobe->fixups = 0;
+ if (is_32bit_app(tsk))
+ ret = validate_insn_32bits(uprobe, &insn);
+ else
+ ret = validate_insn_64bits(uprobe, &insn);
+ if (ret != 0)
+ return ret;
+ if (!is_32bit_app(tsk))
+ handle_riprel_insn(uprobe, &insn);
+ prepare_fixups(uprobe, &insn);
+ return 0;
+}
On the first probe insertion, copy the original instruction and opcode.
If multiple vmas map the same text area corresponding to an inode, we
only need to copy the instruction just once.
The copied instruction is further copied to a designated slot on probe
hit. Its also used at the time of probe removal to restore the original
instruction.
opcode is used to analyze the instruction and determine the fixups.
Determining fixups at probe hit time would result in doing the same
operation on every probe hit. Hence Instruction analysis using the
opcode is done at probe insertion time.
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/uprobes.c | 121 +++++++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 114 insertions(+), 7 deletions(-)
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index c6c2f5e..9564a78 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -133,6 +133,7 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
unsigned long vaddr, uprobe_opcode_t opcode)
{
struct page *old_page, *new_page;
+ struct address_space *mapping;
void *vaddr_old, *vaddr_new;
struct vm_area_struct *vma;
unsigned long addr;
@@ -153,6 +154,18 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
if (!valid_vma(vma))
goto put_out;
+ mapping = uprobe->inode->i_mapping;
+ if (mapping != vma->vm_file->f_mapping)
+ goto put_out;
+
+ addr = vma->vm_start + uprobe->offset;
+ addr -= vma->vm_pgoff << PAGE_SHIFT;
+ if (addr > TASK_SIZE_OF(tsk))
+ goto put_out;
+
+ if (vaddr != (unsigned long) addr)
+ goto put_out;
+
/* Allocate a page */
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
if (!new_page) {
@@ -171,7 +184,6 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
/* poke the new insn in, ASSUMES we don't cross page boundary */
- addr = vaddr;
vaddr &= ~PAGE_MASK;
memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
@@ -507,6 +519,66 @@ static bool del_consumer(struct uprobe *uprobe,
return ret;
}
+static int __copy_insn(struct address_space *mapping,
+ struct vm_area_struct *vma, char *insn,
+ unsigned long nbytes, unsigned long offset)
+{
+ struct file *filp = vma->vm_file;
+ struct page *page;
+ void *vaddr;
+ unsigned long off1;
+ unsigned long idx;
+
+ if (!filp)
+ return -EINVAL;
+
+ idx = (unsigned long) (offset >> PAGE_CACHE_SHIFT);
+ off1 = offset &= ~PAGE_MASK;
+
+ /*
+ * Ensure that the page that has the original instruction is
+ * populated and in page-cache.
+ */
+ page_cache_sync_readahead(mapping, &filp->f_ra, filp, idx, 1);
+ page = grab_cache_page(mapping, idx);
+ if (!page)
+ return -ENOMEM;
+
+ vaddr = kmap_atomic(page, KM_USER0);
+ memcpy(insn, vaddr + off1, nbytes);
+ kunmap_atomic(vaddr);
+ unlock_page(page);
+ page_cache_release(page);
+ return 0;
+}
+
+static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct address_space *mapping;
+ int bytes;
+ unsigned long nbytes;
+
+ addr &= ~PAGE_MASK;
+ nbytes = PAGE_SIZE - addr;
+ mapping = uprobe->inode->i_mapping;
+
+ /* Instruction at end of binary; copy only available bytes */
+ if (uprobe->offset + MAX_UINSN_BYTES > uprobe->inode->i_size)
+ bytes = uprobe->inode->i_size - uprobe->offset;
+ else
+ bytes = MAX_UINSN_BYTES;
+
+ /* Instruction at the page-boundary; copy bytes in second page */
+ if (nbytes < bytes) {
+ if (__copy_insn(mapping, vma, uprobe->insn + nbytes,
+ bytes - nbytes, uprobe->offset + nbytes))
+ return -ENOMEM;
+ bytes = nbytes;
+ }
+ return __copy_insn(mapping, vma, uprobe->insn, bytes, uprobe->offset);
+}
+
static struct task_struct *get_mm_owner(struct mm_struct *mm)
{
struct task_struct *tsk;
@@ -521,22 +593,57 @@ static struct task_struct *get_mm_owner(struct mm_struct *mm)
static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
{
- int ret = 0;
+ struct task_struct *tsk = get_mm_owner(mm);
+ int ret;
- /*TODO: install breakpoint */
- if (!ret)
+ if (!tsk) /* task is probably exiting; bail-out */
+ return -ESRCH;
+
+ if (!uprobe->copy) {
+ struct vm_area_struct *vma = find_vma(mm, mm->uprobes_vaddr);
+
+ ret = copy_insn(uprobe, vma, mm->uprobes_vaddr);
+ if (ret)
+ goto put_return;
+ if (is_bkpt_insn(uprobe->insn)) {
+ print_insert_fail(tsk, mm->uprobes_vaddr,
+ "breakpoint instruction already exists");
+ ret = -EEXIST;
+ goto put_return;
+ }
+ ret = analyze_insn(tsk, uprobe);
+ if (ret) {
+ print_insert_fail(tsk, mm->uprobes_vaddr,
+ "instruction type cannot be probed");
+ goto put_return;
+ }
+ uprobe->copy = 1;
+ }
+
+ ret = set_bkpt(tsk, uprobe, mm->uprobes_vaddr);
+ if (ret < 0)
+ print_insert_fail(tsk, mm->uprobes_vaddr,
+ "failed to insert bkpt instruction");
+ else
atomic_inc(&mm->uprobes_count);
+
+put_return:
+ put_task_struct(tsk);
return ret;
}
static int __remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
{
- int ret = 0;
+ struct task_struct *tsk = get_mm_owner(mm);
+ int ret;
- /*TODO: remove breakpoint */
+ if (!tsk) /* task is probably exiting; bail-out */
+ return -ESRCH;
+
+ ret = set_orig_insn(tsk, uprobe, mm->uprobes_vaddr, true);
if (!ret)
atomic_dec(&mm->uprobes_count);
-
+ put_task_struct(tsk);
return ret;
}
Provides hooks in mmap and fork.
On fork, after the new mm is created, we need to set the count of
uprobes. On mmap, check if the mmap region is an executable page and if
its a executable page, walk through the rbtree and insert actual
breakpoints for already registered probes corresponding to this inode.
Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/uprobes.h | 14 ++++++
kernel/fork.c | 2 +
kernel/uprobes.c | 106 +++++++++++++++++++++++++++++++++++++++++++++++
mm/mmap.c | 6 +++
4 files changed, 127 insertions(+), 1 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 4087cc3..fc2f9d2 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -66,6 +66,7 @@ struct uprobe_consumer {
struct uprobe {
struct rb_node rb_node; /* node in the rb tree */
atomic_t ref;
+ struct list_head pending_list;
struct rw_semaphore consumer_rwsem;
struct uprobe_arch_info arch_info; /* arch specific info if any */
struct uprobe_consumer *consumers;
@@ -110,6 +111,10 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer);
extern void unregister_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer);
+
+struct vm_area_struct;
+extern int mmap_uprobe(struct vm_area_struct *vma);
+extern void dup_mmap_uprobe(struct mm_struct *old_mm, struct mm_struct *mm);
#else /* CONFIG_UPROBES is not defined */
static inline int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer)
@@ -120,6 +125,13 @@ static inline void unregister_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer)
{
}
-
+static inline void dup_mmap_uprobe(struct mm_struct *old_mm,
+ struct mm_struct *mm)
+{
+}
+static inline int mmap_uprobe(struct vm_area_struct *vma)
+{
+ return 0;
+}
#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 0276c30..f6c7cb1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -67,6 +67,7 @@
#include <linux/user-return-notifier.h>
#include <linux/oom.h>
#include <linux/khugepaged.h>
+#include <linux/uprobes.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -423,6 +424,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
}
/* a new mm has just been created */
arch_dup_mmap(oldmm, mm);
+ dup_mmap_uprobe(oldmm, mm);
retval = 0;
out:
up_write(&mm->mmap_sem);
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 9564a78..93a53c0 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -451,6 +451,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
uprobe->inode = inode;
uprobe->offset = offset;
init_rwsem(&uprobe->consumer_rwsem);
+ INIT_LIST_HEAD(&uprobe->pending_list);
/* add to uprobes_tree, sorted on inode:offset */
cur_uprobe = insert_uprobe(uprobe);
@@ -916,3 +917,108 @@ put_unlock:
mutex_unlock(&uprobes_mutex);
put_uprobe(uprobe); /* drop access ref */
}
+
+static void add_to_temp_list(struct vm_area_struct *vma, struct inode *inode,
+ struct list_head *tmp_list)
+{
+ struct uprobe *uprobe;
+ struct rb_node *n;
+ unsigned long flags;
+
+ n = uprobes_tree.rb_node;
+ spin_lock_irqsave(&uprobes_treelock, flags);
+ uprobe = __find_uprobe(inode, 0, &n);
+ for (; n; n = rb_next(n)) {
+ uprobe = rb_entry(n, struct uprobe, rb_node);
+ if (uprobe->inode != inode)
+ break;
+ list_add(&uprobe->pending_list, tmp_list);
+ continue;
+ }
+ spin_unlock_irqrestore(&uprobes_treelock, flags);
+}
+
+/*
+ * Called from dup_mmap.
+ * called with mm->mmap_sem and old_mm->mmap_sem acquired.
+ */
+void dup_mmap_uprobe(struct mm_struct *old_mm, struct mm_struct *mm)
+{
+ atomic_set(&old_mm->uprobes_count,
+ atomic_read(&mm->uprobes_count));
+}
+
+/*
+ * Called from mmap_region.
+ * called with mm->mmap_sem acquired.
+ *
+ * Return -ve no if we fail to insert probes and we cannot
+ * bail-out.
+ * Return 0 otherwise. i.e :
+ * - successful insertion of probes
+ * - no possible probes to be inserted.
+ * - insertion of probes failed but we can bail-out.
+ */
+int mmap_uprobe(struct vm_area_struct *vma)
+{
+ struct list_head tmp_list;
+ struct uprobe *uprobe, *u;
+ struct mm_struct *mm;
+ struct inode *inode;
+ unsigned long start, pgoff;
+ int ret = 0;
+
+ if (!valid_vma(vma))
+ return ret; /* Bail-out */
+
+ INIT_LIST_HEAD(&tmp_list);
+
+ mm = vma->vm_mm;
+ inode = vma->vm_file->f_mapping->host;
+ start = vma->vm_start;
+ pgoff = vma->vm_pgoff;
+ __iget(inode);
+
+ up_write(&mm->mmap_sem);
+ mutex_lock(&uprobes_mutex);
+ down_read(&mm->mmap_sem);
+
+ vma = find_vma(mm, start);
+ /* Not the same vma */
+ if (!vma || vma->vm_start != start ||
+ vma->vm_pgoff != pgoff || !valid_vma(vma) ||
+ inode->i_mapping != vma->vm_file->f_mapping)
+ goto mmap_out;
+
+ add_to_temp_list(vma, inode, &tmp_list);
+ list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
+ loff_t vaddr;
+
+ list_del(&uprobe->pending_list);
+ if (ret)
+ continue;
+
+ vaddr = vma->vm_start + uprobe->offset;
+ vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+ if (vaddr < vma->vm_start || vaddr > vma->vm_end)
+ /* Not in this vma */
+ continue;
+ if (vaddr > TASK_SIZE)
+ /*
+ * We cannot have a virtual address that is
+ * greater than TASK_SIZE
+ */
+ continue;
+ mm->uprobes_vaddr = (unsigned long)vaddr;
+ ret = install_breakpoint(mm, uprobe);
+ if (ret && (ret == -ESRCH || ret == -EEXIST))
+ ret = 0;
+ }
+
+mmap_out:
+ mutex_unlock(&uprobes_mutex);
+ iput(inode);
+ up_read(&mm->mmap_sem);
+ down_write(&mm->mmap_sem);
+ return ret;
+}
diff --git a/mm/mmap.c b/mm/mmap.c
index bbdc9af..3ff312f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -30,6 +30,7 @@
#include <linux/perf_event.h>
#include <linux/audit.h>
#include <linux/khugepaged.h>
+#include <linux/uprobes.h>
#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -1344,6 +1345,11 @@ out:
mm->locked_vm += (len >> PAGE_SHIFT);
} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
make_pages_present(addr, addr + len);
+
+ if (file && mmap_uprobe(vma))
+ /* matching probes but cannot insert */
+ goto unmap_and_free_vma;
+
return addr;
unmap_and_free_vma:
On X86_64, we need to support rip relative instructions.
Rip relative instructions are handled by saving the scratch register
on probe hit and then retrieving the previously saved scratch register
after single-step. This value stored at probe hit is specific to each
task. Hence this is implemented as part of uprobe_task_arch_info.
Since x86_32 has no support for rip relative instructions, we dont need to
bother for x86_32.
Signed-off-by: Jim Keniston <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/include/asm/uprobes.h | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 4295ce0..2f3c64d 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -34,8 +34,13 @@ typedef u8 uprobe_opcode_t;
struct uprobe_arch_info {
unsigned long rip_rela_target_address;
};
+
+struct uprobe_task_arch_info {
+ unsigned long saved_scratch_register;
+};
#else
struct uprobe_arch_info {};
+struct uprobe_task_arch_info {};
#endif
struct uprobe;
extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
Uprobes needs to maintain some task specific information include if a
task is currently uprobed, the currently handing uprobe, any arch
specific information (for example to handle rip relative instructions),
the per-task slot where the original instruction is copied to before
single-stepping.
Provides routines to create/manage and free the task specific
information.
Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/sched.h | 9 ++++++---
include/linux/uprobes.h | 25 +++++++++++++++++++++++++
kernel/fork.c | 4 ++++
kernel/uprobes.c | 38 ++++++++++++++++++++++++++++++++++++++
4 files changed, 73 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a837b20..9af9c99 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1304,9 +1304,9 @@ struct task_struct {
unsigned long stack_canary;
#endif
- /*
+ /*
* pointers to (original) parent process, youngest child, younger sibling,
- * older sibling, respectively. (p->father can be replaced with
+ * older sibling, respectively. (p->father can be replaced with
* p->real_parent->pid)
*/
struct task_struct *real_parent; /* real parent process */
@@ -1561,6 +1561,9 @@ struct task_struct {
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
#endif
+#ifdef CONFIG_UPROBES
+ struct uprobe_task *utask;
+#endif
};
/* Future-safe accessor for struct task_struct's cpus_allowed. */
@@ -2127,7 +2130,7 @@ static inline int dequeue_signal_lock(struct task_struct *tsk, sigset_t *mask, s
spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
return ret;
-}
+}
extern void block_all_signals(int (*notifier)(void *priv), void *priv,
sigset_t *mask);
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index fc2f9d2..821e000 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -26,12 +26,14 @@
#include <linux/rbtree.h>
#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
#include <asm/uprobes.h>
+struct uprobe_task_arch_info; /* arch specific task info */
#else
/*
* ARCH_SUPPORTS_UPROBES is not defined.
*/
typedef u8 uprobe_opcode_t;
struct uprobe_arch_info {}; /* arch specific info*/
+struct uprobe_task_arch_info {}; /* arch specific task info */
#endif /* CONFIG_ARCH_SUPPORTS_UPROBES */
/* Post-execution fixups. Some architectures may define others. */
@@ -77,6 +79,27 @@ struct uprobe {
int copy;
};
+enum uprobe_task_state {
+ UTASK_RUNNING,
+ UTASK_BP_HIT,
+ UTASK_SSTEP
+};
+
+/*
+ * uprobe_utask -- not a user-visible struct.
+ * Corresponds to a thread in a probed process.
+ * Guarded by uproc->mutex.
+ */
+struct uprobe_task {
+ unsigned long xol_vaddr;
+ unsigned long vaddr;
+
+ enum uprobe_task_state state;
+ struct uprobe_task_arch_info tskinfo;
+
+ struct uprobe *active_uprobe;
+};
+
/*
* Most architectures can use the default versions of @read_opcode(),
* @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
@@ -111,6 +134,7 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer);
extern void unregister_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer);
+extern void free_uprobe_utask(struct task_struct *tsk);
struct vm_area_struct;
extern int mmap_uprobe(struct vm_area_struct *vma);
@@ -133,5 +157,6 @@ static inline int mmap_uprobe(struct vm_area_struct *vma)
{
return 0;
}
+static inline void free_uprobe_utask(struct task_struct *tsk) {}
#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index f6c7cb1..bf5999b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -196,6 +196,7 @@ void __put_task_struct(struct task_struct *tsk)
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+ free_uprobe_utask(tsk);
if (!profile_handoff_task(tsk))
free_task(tsk);
}
@@ -1268,6 +1269,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
INIT_LIST_HEAD(&p->pi_state_list);
p->pi_state_cache = NULL;
#endif
+#ifdef CONFIG_UPROBES
+ p->utask = NULL;
+#endif
/*
* sigaltstack should be cleared when sharing the same VM
*/
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 93a53c0..2bb2bd7 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1022,3 +1022,41 @@ mmap_out:
down_write(&mm->mmap_sem);
return ret;
}
+
+/*
+ * Called with no locks held.
+ * Called in context of a exiting or a exec-ing thread.
+ */
+void free_uprobe_utask(struct task_struct *tsk)
+{
+ struct uprobe_task *utask = tsk->utask;
+
+ if (!utask)
+ return;
+
+ if (utask->active_uprobe)
+ put_uprobe(utask->active_uprobe);
+ kfree(utask);
+ tsk->utask = NULL;
+}
+
+/*
+ * Allocate a uprobe_task object for the task.
+ * Called when the thread hits a breakpoint for the first time.
+ *
+ * Returns:
+ * - pointer to new uprobe_task on success
+ * - negative errno otherwise
+ */
+static struct uprobe_task *add_utask(void)
+{
+ struct uprobe_task *utask;
+
+ utask = kzalloc(sizeof *utask, GFP_KERNEL);
+ if (unlikely(utask == NULL))
+ return ERR_PTR(-ENOMEM);
+
+ utask->active_uprobe = NULL;
+ current->utask = utask;
+ return utask;
+}
Slots are allocated at probe hit time and released after singlestep. When a
probe is hit, the original instruction corresponding to the probe hit is
copied to allocated slot. Currently we allocate one page of slots for
each mm. Bitmaps are used to know which slots are free. Each slot is made of
128 bytes so that its cache aligned.
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Jim Keniston <[email protected]>
---
include/linux/mm_types.h | 4 +
include/linux/uprobes.h | 23 ++++
kernel/fork.c | 4 +
kernel/uprobes.c | 242 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 273 insertions(+), 0 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7bfef2e..e016ac7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,9 @@
#include <linux/completion.h>
#include <linux/cpumask.h>
#include <linux/page-debug-flags.h>
+#ifdef CONFIG_UPROBES
+#include <linux/uprobes.h>
+#endif
#include <asm/page.h>
#include <asm/mmu.h>
@@ -320,6 +323,7 @@ struct mm_struct {
unsigned long uprobes_vaddr;
struct list_head uprobes_list; /* protected by uprobes_mutex */
atomic_t uprobes_count;
+ struct uprobes_xol_area *uprobes_xol_area;
#endif
};
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 821e000..4590e9a 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -101,6 +101,27 @@ struct uprobe_task {
};
/*
+ * On a breakpoint hit, thread contests for a slot. It free the
+ * slot after singlestep. Only definite number of slots are
+ * allocated.
+ */
+
+struct uprobes_xol_area {
+ spinlock_t slot_lock; /* protects bitmap and slot (de)allocation*/
+ wait_queue_head_t wq; /* if all slots are busy */
+ atomic_t slot_count; /* currently in use slots */
+ unsigned long *bitmap; /* 0 = free slot */
+ struct page *page;
+
+ /*
+ * We keep the vma's vm_start rather than a pointer to the vma
+ * itself. The probed process or a naughty kernel module could make
+ * the vma go away, and we must handle that reasonably gracefully.
+ */
+ unsigned long vaddr; /* Page(s) of instruction slots */
+};
+
+/*
* Most architectures can use the default versions of @read_opcode(),
* @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
*
@@ -139,6 +160,7 @@ extern void free_uprobe_utask(struct task_struct *tsk);
struct vm_area_struct;
extern int mmap_uprobe(struct vm_area_struct *vma);
extern void dup_mmap_uprobe(struct mm_struct *old_mm, struct mm_struct *mm);
+extern void free_uprobes_xol_area(struct mm_struct *mm);
#else /* CONFIG_UPROBES is not defined */
static inline int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer)
@@ -158,5 +180,6 @@ static inline int mmap_uprobe(struct vm_area_struct *vma)
return 0;
}
static inline void free_uprobe_utask(struct task_struct *tsk) {}
+static inline void free_uprobes_xol_area(struct mm_struct *mm) {}
#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index bf5999b..c2790b5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -557,6 +557,7 @@ void mmput(struct mm_struct *mm)
might_sleep();
if (atomic_dec_and_test(&mm->mm_users)) {
+ free_uprobes_xol_area(mm);
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
@@ -741,6 +742,9 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
mm->pmd_huge_pte = NULL;
#endif
+#ifdef CONFIG_UPROBES
+ mm->uprobes_xol_area = NULL;
+#endif
if (!mm_init(mm, tsk))
goto fail_nomem;
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 2bb2bd7..d19c3b0 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -33,12 +33,29 @@
#include <linux/rmap.h> /* needed for anon_vma_prepare */
#include <linux/mmu_notifier.h> /* needed for set_pte_at_notify */
#include <linux/swap.h> /* needed for try_to_free_swap */
+#include <linux/mman.h> /* needed for PROT_EXEC, MAP_PRIVATE */
+#include <linux/file.h> /* needed for fput() */
+#include <linux/init_task.h> /* init_cred */
+#define UINSNS_PER_PAGE (PAGE_SIZE/UPROBES_XOL_SLOT_BYTES)
+#define MAX_UPROBES_XOL_SLOTS UINSNS_PER_PAGE
+
+/*
+ * valid_vma: Verify if the specified vma is an executable vma,
+ * but not an XOL vma.
+ * - Return 1 if the specified virtual address is in an
+ * executable vma, but not in an XOL vma.
+ */
static bool valid_vma(struct vm_area_struct *vma)
{
+ struct uprobes_xol_area *area = vma->vm_mm->uprobes_xol_area;
+
if (!vma->vm_file)
return false;
+ if (area && (area->vaddr == vma->vm_start))
+ return false;
+
if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
(VM_READ|VM_EXEC))
return true;
@@ -1023,6 +1040,229 @@ mmap_out:
return ret;
}
+/* Slot allocation for XOL */
+
+static int xol_add_vma(struct uprobes_xol_area *area)
+{
+ const struct cred *curr_cred;
+ struct vm_area_struct *vma;
+ struct mm_struct *mm;
+ unsigned long addr;
+ int ret = -ENOMEM;
+
+ mm = get_task_mm(current);
+ if (!mm)
+ return -ESRCH;
+
+ down_write(&mm->mmap_sem);
+ if (mm->uprobes_xol_area) {
+ ret = -EALREADY;
+ goto fail;
+ }
+
+ /*
+ * Find the end of the top mapping and skip a page.
+ * If there is no space for PAGE_SIZE above
+ * that, mmap will ignore our address hint.
+ *
+ * override credentials otherwise anonymous memory might
+ * not be granted execute permission when the selinux
+ * security hooks have their way.
+ */
+ vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
+ addr = vma->vm_end + PAGE_SIZE;
+ curr_cred = override_creds(&init_cred);
+ addr = do_mmap_pgoff(NULL, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
+ revert_creds(curr_cred);
+
+ if (addr & ~PAGE_MASK) {
+ pr_debug("uprobes_xol failed to allocate a vma for pid/tgid"
+ "%d/%d for single-stepping out of line.\n",
+ current->pid, current->tgid);
+ goto fail;
+ }
+ vma = find_vma(mm, addr);
+
+ /* Don't expand vma on mremap(). */
+ vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
+ area->vaddr = vma->vm_start;
+ if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
+ &vma) > 0)
+ ret = 0;
+
+fail:
+ up_write(&mm->mmap_sem);
+ mmput(mm);
+ return ret;
+}
+
+/*
+ * xol_alloc_area - Allocate process's uprobes_xol_area.
+ * This area will be used for storing instructions for execution out of
+ * line.
+ *
+ * Returns the allocated area or NULL.
+ */
+static struct uprobes_xol_area *xol_alloc_area(void)
+{
+ struct uprobes_xol_area *area = NULL;
+
+ area = kzalloc(sizeof(*area), GFP_KERNEL);
+ if (unlikely(!area))
+ return NULL;
+
+ area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
+ GFP_KERNEL);
+
+ if (!area->bitmap)
+ goto fail;
+
+ init_waitqueue_head(&area->wq);
+ spin_lock_init(&area->slot_lock);
+ if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
+ task_lock(current);
+ if (!current->mm->uprobes_xol_area) {
+ current->mm->uprobes_xol_area = area;
+ task_unlock(current);
+ return area;
+ }
+ task_unlock(current);
+ }
+
+fail:
+ kfree(area->bitmap);
+ kfree(area);
+ return current->mm->uprobes_xol_area;
+}
+
+/*
+ * free_uprobes_xol_area - Free the area allocated for slots.
+ */
+void free_uprobes_xol_area(struct mm_struct *mm)
+{
+ struct uprobes_xol_area *area = mm->uprobes_xol_area;
+
+ if (!area)
+ return;
+
+ put_page(area->page);
+ kfree(area->bitmap);
+ kfree(area);
+}
+
+static void xol_wait_event(struct uprobes_xol_area *area)
+{
+ if (atomic_read(&area->slot_count) >= UINSNS_PER_PAGE)
+ wait_event(area->wq,
+ (atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
+}
+
+/*
+ * - search for a free slot.
+ */
+static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
+{
+ unsigned long slot_addr, flags;
+ int slot_nr;
+
+ do {
+ spin_lock_irqsave(&area->slot_lock, flags);
+ slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
+ if (slot_nr < UINSNS_PER_PAGE) {
+ __set_bit(slot_nr, area->bitmap);
+ slot_addr = area->vaddr +
+ (slot_nr * UPROBES_XOL_SLOT_BYTES);
+ atomic_inc(&area->slot_count);
+ }
+ spin_unlock_irqrestore(&area->slot_lock, flags);
+ if (slot_nr >= UINSNS_PER_PAGE)
+ xol_wait_event(area);
+
+ } while (slot_nr >= UINSNS_PER_PAGE);
+
+ return slot_addr;
+}
+
+/*
+ * xol_get_insn_slot - If was not allocated a slot, then
+ * allocate a slot.
+ * Returns the allocated slot address or 0.
+ */
+static unsigned long xol_get_insn_slot(struct uprobe *uprobe,
+ unsigned long slot_addr)
+{
+ struct uprobes_xol_area *area = current->mm->uprobes_xol_area;
+ unsigned long offset;
+ void *vaddr;
+
+ if (!area) {
+ area = xol_alloc_area();
+ if (!area)
+ return 0;
+ }
+ current->utask->xol_vaddr = xol_take_insn_slot(area);
+
+ /*
+ * Initialize the slot if xol_vaddr points to valid
+ * instruction slot.
+ */
+ if (unlikely(!current->utask->xol_vaddr))
+ return 0;
+
+ current->utask->vaddr = slot_addr;
+ offset = current->utask->xol_vaddr & ~PAGE_MASK;
+ vaddr = kmap_atomic(area->page, KM_USER0);
+ memcpy(vaddr + offset, uprobe->insn, MAX_UINSN_BYTES);
+ kunmap_atomic(vaddr);
+ return current->utask->xol_vaddr;
+}
+
+/*
+ * xol_free_insn_slot - If slot was earlier allocated by
+ * @xol_get_insn_slot(), make the slot available for
+ * subsequent requests.
+ */
+static void xol_free_insn_slot(struct task_struct *tsk)
+{
+ struct uprobes_xol_area *area;
+ unsigned long vma_end;
+ unsigned long slot_addr;
+
+ if (!tsk->mm || !tsk->mm->uprobes_xol_area || !tsk->utask)
+ return;
+
+ slot_addr = tsk->utask->xol_vaddr;
+
+ if (unlikely(!slot_addr || IS_ERR_VALUE(slot_addr)))
+ return;
+
+ area = tsk->mm->uprobes_xol_area;
+ vma_end = area->vaddr + PAGE_SIZE;
+ if (area->vaddr <= slot_addr && slot_addr < vma_end) {
+ int slot_nr;
+ unsigned long offset = slot_addr - area->vaddr;
+ unsigned long flags;
+
+ slot_nr = offset / UPROBES_XOL_SLOT_BYTES;
+ if (slot_nr >= UINSNS_PER_PAGE) {
+ pr_debug("%s: no XOL vma for slot address %#lx\n",
+ __func__, slot_addr);
+ return;
+ }
+
+ spin_lock_irqsave(&area->slot_lock, flags);
+ __clear_bit(slot_nr, area->bitmap);
+ spin_unlock_irqrestore(&area->slot_lock, flags);
+ atomic_dec(&area->slot_count);
+ if (waitqueue_active(&area->wq))
+ wake_up(&area->wq);
+ tsk->utask->xol_vaddr = 0;
+ return;
+ }
+ pr_debug("%s: no XOL vma for slot address %#lx\n",
+ __func__, slot_addr);
+}
+
/*
* Called with no locks held.
* Called in context of a exiting or a exec-ing thread.
@@ -1036,6 +1276,8 @@ void free_uprobe_utask(struct task_struct *tsk)
if (utask->active_uprobe)
put_uprobe(utask->active_uprobe);
+
+ xol_free_insn_slot(tsk);
kfree(utask);
tsk->utask = NULL;
}
On a breakpoint hit, return the address where the breakpoint was hit.
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Jim Keniston <[email protected]>
---
include/linux/uprobes.h | 5 +++++
kernel/uprobes.c | 11 +++++++++++
2 files changed, 16 insertions(+), 0 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 4590e9a..838fbaa 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -161,6 +161,7 @@ struct vm_area_struct;
extern int mmap_uprobe(struct vm_area_struct *vma);
extern void dup_mmap_uprobe(struct mm_struct *old_mm, struct mm_struct *mm);
extern void free_uprobes_xol_area(struct mm_struct *mm);
+extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
#else /* CONFIG_UPROBES is not defined */
static inline int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer)
@@ -181,5 +182,9 @@ static inline int mmap_uprobe(struct vm_area_struct *vma)
}
static inline void free_uprobe_utask(struct task_struct *tsk) {}
static inline void free_uprobes_xol_area(struct mm_struct *mm) {}
+static inline unsigned long get_uprobe_bkpt_addr(struct pt_regs *regs)
+{
+ return 0;
+}
#endif /* CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index d19c3b0..fa9e9ba 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1263,6 +1263,17 @@ static void xol_free_insn_slot(struct task_struct *tsk)
__func__, slot_addr);
}
+/**
+ * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
+ * @regs: Reflects the saved state of the task after it has hit a breakpoint
+ * instruction.
+ * Return the address of the breakpoint instruction.
+ */
+unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs)
+{
+ return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
+}
+
/*
* Called with no locks held.
* Called in context of a exiting or a exec-ing thread.
Provides x86 specific implementations for setting the current
instruction pointer, pre single-step and post-singlestep handling,
enabling and disabling singlestep.
This patch also introduces TIF_UPROBE which is set by uprobes notifier
code. TIF_UPROBE indicates that there is pending work that needs to be
done at do_notify_resume time.
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/include/asm/thread_info.h | 2
arch/x86/include/asm/uprobes.h | 3 +
arch/x86/kernel/uprobes.c | 148 ++++++++++++++++++++++++++++++++++++
3 files changed, 153 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 1f2e61e..2dc1921 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,6 +84,7 @@ struct thread_info {
#define TIF_SECCOMP 8 /* secure computing */
#define TIF_MCE_NOTIFY 10 /* notify userspace of an MCE */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
+#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* 32bit process */
#define TIF_FORK 18 /* ret_from_fork */
@@ -107,6 +108,7 @@ struct thread_info {
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
#define _TIF_MCE_NOTIFY (1 << TIF_MCE_NOTIFY)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
+#define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
#define _TIF_FORK (1 << TIF_FORK)
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 2f3c64d..3a7833c 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -44,4 +44,7 @@ struct uprobe_task_arch_info {};
#endif
struct uprobe;
extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
+extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
+extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
#endif /* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 79f74c5..8d90ff3 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -25,6 +25,7 @@
#include <linux/sched.h>
#include <linux/ptrace.h>
#include <linux/uprobes.h>
+#include <linux/uaccess.h>
#include <linux/kdebug.h>
#include <asm/insn.h>
@@ -412,3 +413,150 @@ int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe)
prepare_fixups(uprobe, &insn);
return 0;
}
+
+/*
+ * @reg: reflects the saved state of the task
+ * @vaddr: the virtual address to jump to.
+ * Return 0 on success or a -ve number on error.
+ */
+void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
+{
+ regs->ip = vaddr;
+}
+
+/*
+ * pre_xol - prepare to execute out of line.
+ * @uprobe: the probepoint information.
+ * @regs: reflects the saved user state of @tsk.
+ *
+ * If we're emulating a rip-relative instruction, save the contents
+ * of the scratch register and store the target address in that register.
+ *
+ * Returns true if @uprobe->opcode is @bkpt_insn.
+ */
+#ifdef CONFIG_X86_64
+int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ struct uprobe_task_arch_info *tskinfo = ¤t->utask->tskinfo;
+
+ regs->ip = current->utask->xol_vaddr;
+ if (uprobe->fixups & UPROBES_FIX_RIP_AX) {
+ tskinfo->saved_scratch_register = regs->ax;
+ regs->ax = current->utask->vaddr;
+ regs->ax += uprobe->arch_info.rip_rela_target_address;
+ } else if (uprobe->fixups & UPROBES_FIX_RIP_CX) {
+ tskinfo->saved_scratch_register = regs->cx;
+ regs->cx = current->utask->vaddr;
+ regs->cx += uprobe->arch_info.rip_rela_target_address;
+ }
+ return 0;
+}
+#else
+int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ regs->ip = current->utask->xol_vaddr;
+ return 0;
+}
+#endif
+
+/*
+ * Called by post_xol() to adjust the return address pushed by a call
+ * instruction executed out of line.
+ */
+static int adjust_ret_addr(unsigned long sp, long correction)
+{
+ int rasize, ncopied;
+ long ra = 0;
+
+ if (is_32bit_app(current))
+ rasize = 4;
+ else
+ rasize = 8;
+ ncopied = copy_from_user(&ra, (void __user *) sp, rasize);
+ if (unlikely(ncopied))
+ goto fail;
+ ra += correction;
+ ncopied = copy_to_user((void __user *) sp, &ra, rasize);
+ if (unlikely(ncopied))
+ goto fail;
+ return 0;
+
+fail:
+ pr_warn_once("uprobes: Failed to adjust return address after"
+ " single-stepping call instruction;"
+ " pid=%d, sp=%#lx\n", current->pid, sp);
+ return -EFAULT;
+}
+
+#ifdef CONFIG_X86_64
+static bool is_riprel_insn(struct uprobe *uprobe)
+{
+ return ((uprobe->fixups &
+ (UPROBES_FIX_RIP_AX | UPROBES_FIX_RIP_CX)) != 0);
+}
+
+static void handle_riprel_post_xol(struct uprobe *uprobe,
+ struct pt_regs *regs, long *correction)
+{
+ if (is_riprel_insn(uprobe)) {
+ struct uprobe_task_arch_info *tskinfo;
+ tskinfo = ¤t->utask->tskinfo;
+
+ if (uprobe->fixups & UPROBES_FIX_RIP_AX)
+ regs->ax = tskinfo->saved_scratch_register;
+ else
+ regs->cx = tskinfo->saved_scratch_register;
+ /*
+ * The original instruction includes a displacement, and so
+ * is 4 bytes longer than what we've just single-stepped.
+ * Fall through to handle stuff like "jmpq *...(%rip)" and
+ * "callq *...(%rip)".
+ */
+ *correction += 4;
+ }
+}
+#else
+static void handle_riprel_post_xol(struct uprobe *uprobe,
+ struct pt_regs *regs, long *correction)
+{
+}
+#endif
+
+/*
+ * Called after single-stepping. To avoid the SMP problems that can
+ * occur when we temporarily put back the original opcode to
+ * single-step, we single-stepped a copy of the instruction.
+ *
+ * This function prepares to resume execution after the single-step.
+ * We have to fix things up as follows:
+ *
+ * Typically, the new ip is relative to the copied instruction. We need
+ * to make it relative to the original instruction (FIX_IP). Exceptions
+ * are return instructions and absolute or indirect jump or call instructions.
+ *
+ * If the single-stepped instruction was a call, the return address that
+ * is atop the stack is the address following the copied instruction. We
+ * need to make it the address following the original instruction (FIX_CALL).
+ *
+ * If the original instruction was a rip-relative instruction such as
+ * "movl %edx,0xnnnn(%rip)", we have instead executed an equivalent
+ * instruction using a scratch register -- e.g., "movl %edx,(%rax)".
+ * We need to restore the contents of the scratch register and adjust
+ * the ip, keeping in mind that the instruction we executed is 4 bytes
+ * shorter than the original instruction (since we squeezed out the offset
+ * field). (FIX_RIP_AX or FIX_RIP_CX)
+ */
+int post_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ struct uprobe_task *utask = current->utask;
+ int result = 0;
+ long correction;
+
+ correction = (long)(utask->vaddr - utask->xol_vaddr);
+ handle_riprel_post_xol(uprobe, regs, &correction);
+ if (uprobe->fixups & UPROBES_FIX_IP)
+ regs->ip += correction;
+ if (uprobe->fixups & UPROBES_FIX_CALL)
+ result = adjust_ret_addr(regs->sp, correction);
+ return result;
+}
On int3, set the TIF_UPROBE flag and if a task specific info is
available, indicate the task state as breakpoint hit. Setting the
TIF_UPROBE flag results in uprobe_notify_resume being called.
uprobe_notify_resume walks thro the list of vmas and then matches the
inode and offset corresponding to the instruction pointer to enteries in
rbtree. Once a matcing uprobes is found, run the handlers for all the
consumers that have registered.
On singlestep exception, perform the necessary fixups and allow the
process to continue. The necessary fixups are determined at instruction
analysis time.
TODO: If there is no matching uprobe, signal a trap to the process.
Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/uprobes.h | 4 +
kernel/uprobes.c | 143 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 147 insertions(+), 0 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 838fbaa..8581723 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -162,6 +162,9 @@ extern int mmap_uprobe(struct vm_area_struct *vma);
extern void dup_mmap_uprobe(struct mm_struct *old_mm, struct mm_struct *mm);
extern void free_uprobes_xol_area(struct mm_struct *mm);
extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
+extern int uprobe_post_notifier(struct pt_regs *regs);
+extern int uprobe_bkpt_notifier(struct pt_regs *regs);
+extern void uprobe_notify_resume(struct pt_regs *regs);
#else /* CONFIG_UPROBES is not defined */
static inline int register_uprobe(struct inode *inode, loff_t offset,
struct uprobe_consumer *consumer)
@@ -182,6 +185,7 @@ static inline int mmap_uprobe(struct vm_area_struct *vma)
}
static inline void free_uprobe_utask(struct task_struct *tsk) {}
static inline void free_uprobes_xol_area(struct mm_struct *mm) {}
+static inline void uprobe_notify_resume(struct pt_regs *regs) {}
static inline unsigned long get_uprobe_bkpt_addr(struct pt_regs *regs)
{
return 0;
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index fa9e9ba..1e88d64 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1313,3 +1313,146 @@ static struct uprobe_task *add_utask(void)
current->utask = utask;
return utask;
}
+
+/* Prepare to single-step probed instruction out of line. */
+static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
+ unsigned long vaddr)
+{
+ if (xol_get_insn_slot(uprobe, vaddr) && !pre_xol(uprobe, regs)) {
+ set_instruction_pointer(regs, current->utask->xol_vaddr);
+ return 0;
+ }
+ return -EFAULT;
+}
+
+/*
+ * Verify from Instruction Pointer if singlestep has indeed occurred.
+ * If Singlestep has occurred, then do post singlestep fix-ups.
+ */
+static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ unsigned long vaddr = instruction_pointer(regs);
+
+ /*
+ * If we have executed out of line, Instruction pointer
+ * cannot be same as virtual address of XOL slot.
+ */
+ if (vaddr == current->utask->xol_vaddr)
+ return false;
+ post_xol(uprobe, regs);
+ return true;
+}
+
+/*
+ * uprobe_notify_resume gets called in task context just before returning
+ * to userspace.
+ *
+ * If its the first time the probepoint is hit, slot gets allocated here.
+ * If its the first time the thread hit a breakpoint, utask gets
+ * allocated here.
+ */
+void uprobe_notify_resume(struct pt_regs *regs)
+{
+ struct vm_area_struct *vma;
+ struct uprobe_task *utask;
+ struct mm_struct *mm;
+ struct uprobe *u = NULL;
+ unsigned long probept;
+
+ utask = current->utask;
+ mm = current->mm;
+ if (!utask || utask->state == UTASK_BP_HIT) {
+ probept = get_uprobe_bkpt_addr(regs);
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, probept);
+ if (vma && valid_vma(vma))
+ u = find_uprobe(vma->vm_file->f_mapping->host,
+ probept - vma->vm_start +
+ (vma->vm_pgoff << PAGE_SHIFT));
+ up_read(&mm->mmap_sem);
+ if (!u)
+ goto cleanup_ret;
+ if (!utask) {
+ utask = add_utask();
+ if (!utask)
+ goto cleanup_ret;
+ }
+ /* TODO Start queueing signals. */
+ utask->active_uprobe = u;
+ handler_chain(u, regs);
+ utask->state = UTASK_SSTEP;
+ if (!pre_ssout(u, regs, probept))
+ user_enable_single_step(current);
+ else
+ goto cleanup_ret;
+ } else if (utask->state == UTASK_SSTEP) {
+ u = utask->active_uprobe;
+ if (sstep_complete(u, regs)) {
+ put_uprobe(u);
+ utask->active_uprobe = NULL;
+ utask->state = UTASK_RUNNING;
+ user_disable_single_step(current);
+ xol_free_insn_slot(current);
+
+ /* TODO Stop queueing signals. */
+ }
+ }
+ return;
+
+cleanup_ret:
+ if (u) {
+ down_read(&mm->mmap_sem);
+ if (!set_orig_insn(current, u, probept, true))
+ atomic_dec(&mm->uprobes_count);
+ up_read(&mm->mmap_sem);
+ put_uprobe(u);
+ } else {
+ /*TODO Return SIGTRAP signal */
+ }
+ if (utask) {
+ utask->active_uprobe = NULL;
+ utask->state = UTASK_RUNNING;
+ }
+ set_instruction_pointer(regs, probept);
+}
+
+/*
+ * uprobe_bkpt_notifier gets called from interrupt context
+ * it gets a reference to the ppt and sets TIF_UPROBE flag,
+ */
+int uprobe_bkpt_notifier(struct pt_regs *regs)
+{
+ struct uprobe_task *utask;
+
+ if (!current->mm || !atomic_read(¤t->mm->uprobes_count))
+ /* task is currently not uprobed */
+ return 0;
+
+ utask = current->utask;
+ if (utask)
+ utask->state = UTASK_BP_HIT;
+ set_thread_flag(TIF_UPROBE);
+ return 1;
+}
+
+/*
+ * uprobe_post_notifier gets called in interrupt context.
+ * It completes the single step operation.
+ */
+int uprobe_post_notifier(struct pt_regs *regs)
+{
+ struct uprobe *uprobe;
+ struct uprobe_task *utask;
+
+ if (!current->mm || !current->utask || !current->utask->active_uprobe)
+ /* task is currently not uprobed */
+ return 0;
+
+ utask = current->utask;
+ uprobe = utask->active_uprobe;
+ if (!uprobe)
+ return 0;
+
+ set_thread_flag(TIF_UPROBE);
+ return 1;
+}
Provides a uprobes exception notifier for x86. This uprobe_exception
notifier gets called in interrupt context and routes int3 and singlestep
exception when a uprobed process encounters a INT3 or a singlestep exception.
Signed-off-by: Ananth N Mavinakayanahalli <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/x86/include/asm/uprobes.h | 3 +++
arch/x86/kernel/signal.c | 14 ++++++++++++++
arch/x86/kernel/uprobes.c | 29 +++++++++++++++++++++++++++++
3 files changed, 46 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 3a7833c..a5d9480 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -22,6 +22,7 @@
* Srikar Dronamraju
* Jim Keniston
*/
+#include <linux/notifier.h>
typedef u8 uprobe_opcode_t;
#define MAX_UINSN_BYTES 16
@@ -47,4 +48,6 @@ extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern int uprobe_exception_notify(struct notifier_block *self,
+ unsigned long val, void *data);
#endif /* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 40a2493..55db9f5 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -20,6 +20,7 @@
#include <linux/personality.h>
#include <linux/uaccess.h>
#include <linux/user-return-notifier.h>
+#include <linux/uprobes.h>
#include <asm/processor.h>
#include <asm/ucontext.h>
@@ -844,6 +845,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
if (thread_info_flags & _TIF_SIGPENDING)
do_signal(regs);
+ if (thread_info_flags & _TIF_UPROBE) {
+ clear_thread_flag(TIF_UPROBE);
+#ifdef CONFIG_X86_32
+ /*
+ * On x86_32, do_notify_resume() gets called with
+ * interrupts disabled. Hence enable interrupts if they
+ * are still disabled.
+ */
+ local_irq_enable();
+#endif
+ uprobe_notify_resume(regs);
+ }
+
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 8d90ff3..a323631 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -560,3 +560,32 @@ int post_xol(struct uprobe *uprobe, struct pt_regs *regs)
result = adjust_ret_addr(regs->sp, correction);
return result;
}
+
+/*
+ * Wrapper routine for handling exceptions.
+ */
+int uprobe_exception_notify(struct notifier_block *self,
+ unsigned long val, void *data)
+{
+ struct die_args *args = data;
+ struct pt_regs *regs = args->regs;
+ int ret = NOTIFY_DONE;
+
+ /* We are only interested in userspace traps */
+ if (regs && !user_mode_vm(regs))
+ return NOTIFY_DONE;
+
+ switch (val) {
+ case DIE_INT3:
+ /* Run your handler here */
+ if (uprobe_bkpt_notifier(regs))
+ ret = NOTIFY_STOP;
+ break;
+ case DIE_DEBUG:
+ if (uprobe_post_notifier(regs))
+ ret = NOTIFY_STOP;
+ default:
+ break;
+ }
+ return ret;
+}
Uprobe needs to be intimated on int3 and singlestep exceptions.
Hence uprobes registers a die notifier so that its notified of the events.
Signed-off-by: Ananth N Mavinakayanahalli <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/uprobes.c | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 1e88d64..95c16dd 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -36,6 +36,7 @@
#include <linux/mman.h> /* needed for PROT_EXEC, MAP_PRIVATE */
#include <linux/file.h> /* needed for fput() */
#include <linux/init_task.h> /* init_cred */
+#include <linux/kdebug.h> /* for notifier mechanism */
#define UINSNS_PER_PAGE (PAGE_SIZE/UPROBES_XOL_SLOT_BYTES)
#define MAX_UPROBES_XOL_SLOTS UINSNS_PER_PAGE
@@ -1456,3 +1457,20 @@ int uprobe_post_notifier(struct pt_regs *regs)
set_thread_flag(TIF_UPROBE);
return 1;
}
+
+struct notifier_block uprobe_exception_nb = {
+ .notifier_call = uprobe_exception_notify,
+ .priority = INT_MAX - 1, /* notified after kprobes, kgdb */
+};
+
+static int __init init_uprobes(void)
+{
+ return register_die_notifier(&uprobe_exception_nb);
+}
+
+static void __exit exit_uprobes(void)
+{
+}
+
+module_init(init_uprobes);
+module_exit(exit_uprobes);
Move parts of trace_kprobe.c that can be shared with upcoming
trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
kernel/trace/trace_probe.c.
Signed-off-by: Srikar Dronamraju <[email protected]>
---
kernel/trace/Kconfig | 4
kernel/trace/Makefile | 1
kernel/trace/trace_kprobe.c | 860 +------------------------------------------
kernel/trace/trace_probe.c | 746 +++++++++++++++++++++++++++++++++++++
kernel/trace/trace_probe.h | 158 ++++++++
5 files changed, 928 insertions(+), 841 deletions(-)
create mode 100644 kernel/trace/trace_probe.c
create mode 100644 kernel/trace/trace_probe.h
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 2ad39e5..f6bc439 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -373,6 +373,7 @@ config KPROBE_EVENT
depends on HAVE_REGS_AND_STACK_ACCESS_API
bool "Enable kprobes-based dynamic events"
select TRACING
+ select PROBE_EVENTS
default y
help
This allows the user to add tracing events (similar to tracepoints)
@@ -385,6 +386,9 @@ config KPROBE_EVENT
This option is also required by perf-probe subcommand of perf tools.
If you want to use perf tools, this option is strongly recommended.
+config PROBE_EVENTS
+ def_bool n
+
config DYNAMIC_FTRACE
bool "enable/disable ftrace tracepoints dynamically"
depends on FUNCTION_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 761c510..692223a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -56,5 +56,6 @@ obj-$(CONFIG_TRACEPOINTS) += power-traces.o
ifeq ($(CONFIG_TRACING),y)
obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
endif
+obj-$(CONFIG_PROBE_EVENTS) +=trace_probe.o
libftrace-y := ftrace.o
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index f925c45..704c04d 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -19,524 +19,15 @@
#include <linux/module.h>
#include <linux/uaccess.h>
-#include <linux/kprobes.h>
-#include <linux/seq_file.h>
-#include <linux/slab.h>
-#include <linux/smp.h>
-#include <linux/debugfs.h>
-#include <linux/types.h>
-#include <linux/string.h>
-#include <linux/ctype.h>
-#include <linux/ptrace.h>
-#include <linux/perf_event.h>
-#include <linux/stringify.h>
-#include <linux/limits.h>
-#include <asm/bitsperlong.h>
-
-#include "trace.h"
-#include "trace_output.h"
-
-#define MAX_TRACE_ARGS 128
-#define MAX_ARGSTR_LEN 63
-#define MAX_EVENT_NAME_LEN 64
-#define MAX_STRING_SIZE PATH_MAX
-#define KPROBE_EVENT_SYSTEM "kprobes"
-
-/* Reserved field names */
-#define FIELD_STRING_IP "__probe_ip"
-#define FIELD_STRING_RETIP "__probe_ret_ip"
-#define FIELD_STRING_FUNC "__probe_func"
-
-const char *reserved_field_names[] = {
- "common_type",
- "common_flags",
- "common_preempt_count",
- "common_pid",
- "common_tgid",
- FIELD_STRING_IP,
- FIELD_STRING_RETIP,
- FIELD_STRING_FUNC,
-};
-
-/* Printing function type */
-typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
- void *);
-#define PRINT_TYPE_FUNC_NAME(type) print_type_##type
-#define PRINT_TYPE_FMT_NAME(type) print_type_format_##type
-
-/* Printing in basic type function template */
-#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast) \
-static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s, \
- const char *name, \
- void *data, void *ent)\
-{ \
- return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
-} \
-static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
-
-DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
-
-/* data_rloc: data relative location, compatible with u32 */
-#define make_data_rloc(len, roffs) \
- (((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
-#define get_rloc_len(dl) ((u32)(dl) >> 16)
-#define get_rloc_offs(dl) ((u32)(dl) & 0xffff)
-
-static inline void *get_rloc_data(u32 *dl)
-{
- return (u8 *)dl + get_rloc_offs(*dl);
-}
-
-/* For data_loc conversion */
-static inline void *get_loc_data(u32 *dl, void *ent)
-{
- return (u8 *)ent + get_rloc_offs(*dl);
-}
-
-/*
- * Convert data_rloc to data_loc:
- * data_rloc stores the offset from data_rloc itself, but data_loc
- * stores the offset from event entry.
- */
-#define convert_rloc_to_loc(dl, offs) ((u32)(dl) + (offs))
-
-/* For defining macros, define string/string_size types */
-typedef u32 string;
-typedef u32 string_size;
-
-/* Print type function for string type */
-static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
- const char *name,
- void *data, void *ent)
-{
- int len = *(u32 *)data >> 16;
-
- if (!len)
- return trace_seq_printf(s, " %s=(fault)", name);
- else
- return trace_seq_printf(s, " %s=\"%s\"", name,
- (const char *)get_loc_data(data, ent));
-}
-static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
-
-/* Data fetch function type */
-typedef void (*fetch_func_t)(struct pt_regs *, void *, void *);
-
-struct fetch_param {
- fetch_func_t fn;
- void *data;
-};
-
-static __kprobes void call_fetch(struct fetch_param *fprm,
- struct pt_regs *regs, void *dest)
-{
- return fprm->fn(regs, fprm->data, dest);
-}
-
-#define FETCH_FUNC_NAME(method, type) fetch_##method##_##type
-/*
- * Define macro for basic types - we don't need to define s* types, because
- * we have to care only about bitwidth at recording time.
- */
-#define DEFINE_BASIC_FETCH_FUNCS(method) \
-DEFINE_FETCH_##method(u8) \
-DEFINE_FETCH_##method(u16) \
-DEFINE_FETCH_##method(u32) \
-DEFINE_FETCH_##method(u64)
-
-#define CHECK_FETCH_FUNCS(method, fn) \
- (((FETCH_FUNC_NAME(method, u8) == fn) || \
- (FETCH_FUNC_NAME(method, u16) == fn) || \
- (FETCH_FUNC_NAME(method, u32) == fn) || \
- (FETCH_FUNC_NAME(method, u64) == fn) || \
- (FETCH_FUNC_NAME(method, string) == fn) || \
- (FETCH_FUNC_NAME(method, string_size) == fn)) \
- && (fn != NULL))
-
-/* Data fetch function templates */
-#define DEFINE_FETCH_reg(type) \
-static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs, \
- void *offset, void *dest) \
-{ \
- *(type *)dest = (type)regs_get_register(regs, \
- (unsigned int)((unsigned long)offset)); \
-}
-DEFINE_BASIC_FETCH_FUNCS(reg)
-/* No string on the register */
-#define fetch_reg_string NULL
-#define fetch_reg_string_size NULL
-
-#define DEFINE_FETCH_stack(type) \
-static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
- void *offset, void *dest) \
-{ \
- *(type *)dest = (type)regs_get_kernel_stack_nth(regs, \
- (unsigned int)((unsigned long)offset)); \
-}
-DEFINE_BASIC_FETCH_FUNCS(stack)
-/* No string on the stack entry */
-#define fetch_stack_string NULL
-#define fetch_stack_string_size NULL
-
-#define DEFINE_FETCH_retval(type) \
-static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
- void *dummy, void *dest) \
-{ \
- *(type *)dest = (type)regs_return_value(regs); \
-}
-DEFINE_BASIC_FETCH_FUNCS(retval)
-/* No string on the retval */
-#define fetch_retval_string NULL
-#define fetch_retval_string_size NULL
-
-#define DEFINE_FETCH_memory(type) \
-static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
- void *addr, void *dest) \
-{ \
- type retval; \
- if (probe_kernel_address(addr, retval)) \
- *(type *)dest = 0; \
- else \
- *(type *)dest = retval; \
-}
-DEFINE_BASIC_FETCH_FUNCS(memory)
-/*
- * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
- * length and relative data location.
- */
-static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
- void *addr, void *dest)
-{
- long ret;
- int maxlen = get_rloc_len(*(u32 *)dest);
- u8 *dst = get_rloc_data(dest);
- u8 *src = addr;
- mm_segment_t old_fs = get_fs();
- if (!maxlen)
- return;
- /*
- * Try to get string again, since the string can be changed while
- * probing.
- */
- set_fs(KERNEL_DS);
- pagefault_disable();
- do
- ret = __copy_from_user_inatomic(dst++, src++, 1);
- while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
- dst[-1] = '\0';
- pagefault_enable();
- set_fs(old_fs);
-
- if (ret < 0) { /* Failed to fetch string */
- ((u8 *)get_rloc_data(dest))[0] = '\0';
- *(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
- } else
- *(u32 *)dest = make_data_rloc(src - (u8 *)addr,
- get_rloc_offs(*(u32 *)dest));
-}
-/* Return the length of string -- including null terminal byte */
-static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
- void *addr, void *dest)
-{
- int ret, len = 0;
- u8 c;
- mm_segment_t old_fs = get_fs();
-
- set_fs(KERNEL_DS);
- pagefault_disable();
- do {
- ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
- len++;
- } while (c && ret == 0 && len < MAX_STRING_SIZE);
- pagefault_enable();
- set_fs(old_fs);
-
- if (ret < 0) /* Failed to check the length */
- *(u32 *)dest = 0;
- else
- *(u32 *)dest = len;
-}
-
-/* Memory fetching by symbol */
-struct symbol_cache {
- char *symbol;
- long offset;
- unsigned long addr;
-};
-
-static unsigned long update_symbol_cache(struct symbol_cache *sc)
-{
- sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
- if (sc->addr)
- sc->addr += sc->offset;
- return sc->addr;
-}
-
-static void free_symbol_cache(struct symbol_cache *sc)
-{
- kfree(sc->symbol);
- kfree(sc);
-}
-
-static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
-{
- struct symbol_cache *sc;
-
- if (!sym || strlen(sym) == 0)
- return NULL;
- sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
- if (!sc)
- return NULL;
-
- sc->symbol = kstrdup(sym, GFP_KERNEL);
- if (!sc->symbol) {
- kfree(sc);
- return NULL;
- }
- sc->offset = offset;
- update_symbol_cache(sc);
- return sc;
-}
-
-#define DEFINE_FETCH_symbol(type) \
-static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
- void *data, void *dest) \
-{ \
- struct symbol_cache *sc = data; \
- if (sc->addr) \
- fetch_memory_##type(regs, (void *)sc->addr, dest); \
- else \
- *(type *)dest = 0; \
-}
-DEFINE_BASIC_FETCH_FUNCS(symbol)
-DEFINE_FETCH_symbol(string)
-DEFINE_FETCH_symbol(string_size)
-
-/* Dereference memory access function */
-struct deref_fetch_param {
- struct fetch_param orig;
- long offset;
-};
-
-#define DEFINE_FETCH_deref(type) \
-static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
- void *data, void *dest) \
-{ \
- struct deref_fetch_param *dprm = data; \
- unsigned long addr; \
- call_fetch(&dprm->orig, regs, &addr); \
- if (addr) { \
- addr += dprm->offset; \
- fetch_memory_##type(regs, (void *)addr, dest); \
- } else \
- *(type *)dest = 0; \
-}
-DEFINE_BASIC_FETCH_FUNCS(deref)
-DEFINE_FETCH_deref(string)
-DEFINE_FETCH_deref(string_size)
-
-static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
-{
- if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
- free_deref_fetch_param(data->orig.data);
- else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
- free_symbol_cache(data->orig.data);
- kfree(data);
-}
-
-/* Bitfield fetch function */
-struct bitfield_fetch_param {
- struct fetch_param orig;
- unsigned char hi_shift;
- unsigned char low_shift;
-};
-
-#define DEFINE_FETCH_bitfield(type) \
-static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
- void *data, void *dest) \
-{ \
- struct bitfield_fetch_param *bprm = data; \
- type buf = 0; \
- call_fetch(&bprm->orig, regs, &buf); \
- if (buf) { \
- buf <<= bprm->hi_shift; \
- buf >>= bprm->low_shift; \
- } \
- *(type *)dest = buf; \
-}
-DEFINE_BASIC_FETCH_FUNCS(bitfield)
-#define fetch_bitfield_string NULL
-#define fetch_bitfield_string_size NULL
-
-static __kprobes void
-free_bitfield_fetch_param(struct bitfield_fetch_param *data)
-{
- /*
- * Don't check the bitfield itself, because this must be the
- * last fetch function.
- */
- if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
- free_deref_fetch_param(data->orig.data);
- else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
- free_symbol_cache(data->orig.data);
- kfree(data);
-}
-/* Default (unsigned long) fetch type */
-#define __DEFAULT_FETCH_TYPE(t) u##t
-#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
-#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
-#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
-
-/* Fetch types */
-enum {
- FETCH_MTD_reg = 0,
- FETCH_MTD_stack,
- FETCH_MTD_retval,
- FETCH_MTD_memory,
- FETCH_MTD_symbol,
- FETCH_MTD_deref,
- FETCH_MTD_bitfield,
- FETCH_MTD_END,
-};
-
-#define ASSIGN_FETCH_FUNC(method, type) \
- [FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
-
-#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype) \
- {.name = _name, \
- .size = _size, \
- .is_signed = sign, \
- .print = PRINT_TYPE_FUNC_NAME(ptype), \
- .fmt = PRINT_TYPE_FMT_NAME(ptype), \
- .fmttype = _fmttype, \
- .fetch = { \
-ASSIGN_FETCH_FUNC(reg, ftype), \
-ASSIGN_FETCH_FUNC(stack, ftype), \
-ASSIGN_FETCH_FUNC(retval, ftype), \
-ASSIGN_FETCH_FUNC(memory, ftype), \
-ASSIGN_FETCH_FUNC(symbol, ftype), \
-ASSIGN_FETCH_FUNC(deref, ftype), \
-ASSIGN_FETCH_FUNC(bitfield, ftype), \
- } \
- }
+#include "trace_probe.h"
-#define ASSIGN_FETCH_TYPE(ptype, ftype, sign) \
- __ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
-
-#define FETCH_TYPE_STRING 0
-#define FETCH_TYPE_STRSIZE 1
-
-/* Fetch type information table */
-static const struct fetch_type {
- const char *name; /* Name of type */
- size_t size; /* Byte size of type */
- int is_signed; /* Signed flag */
- print_type_func_t print; /* Print functions */
- const char *fmt; /* Fromat string */
- const char *fmttype; /* Name in format file */
- /* Fetch functions */
- fetch_func_t fetch[FETCH_MTD_END];
-} fetch_type_table[] = {
- /* Special types */
- [FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
- sizeof(u32), 1, "__data_loc char[]"),
- [FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
- string_size, sizeof(u32), 0, "u32"),
- /* Basic types */
- ASSIGN_FETCH_TYPE(u8, u8, 0),
- ASSIGN_FETCH_TYPE(u16, u16, 0),
- ASSIGN_FETCH_TYPE(u32, u32, 0),
- ASSIGN_FETCH_TYPE(u64, u64, 0),
- ASSIGN_FETCH_TYPE(s8, u8, 1),
- ASSIGN_FETCH_TYPE(s16, u16, 1),
- ASSIGN_FETCH_TYPE(s32, u32, 1),
- ASSIGN_FETCH_TYPE(s64, u64, 1),
-};
-
-static const struct fetch_type *find_fetch_type(const char *type)
-{
- int i;
-
- if (!type)
- type = DEFAULT_FETCH_TYPE_STR;
-
- /* Special case: bitfield */
- if (*type == 'b') {
- unsigned long bs;
- type = strchr(type, '/');
- if (!type)
- goto fail;
- type++;
- if (strict_strtoul(type, 0, &bs))
- goto fail;
- switch (bs) {
- case 8:
- return find_fetch_type("u8");
- case 16:
- return find_fetch_type("u16");
- case 32:
- return find_fetch_type("u32");
- case 64:
- return find_fetch_type("u64");
- default:
- goto fail;
- }
- }
-
- for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
- if (strcmp(type, fetch_type_table[i].name) == 0)
- return &fetch_type_table[i];
-fail:
- return NULL;
-}
-
-/* Special function : only accept unsigned long */
-static __kprobes void fetch_stack_address(struct pt_regs *regs,
- void *dummy, void *dest)
-{
- *(unsigned long *)dest = kernel_stack_pointer(regs);
-}
-
-static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
- fetch_func_t orig_fn)
-{
- int i;
-
- if (type != &fetch_type_table[FETCH_TYPE_STRING])
- return NULL; /* Only string type needs size function */
- for (i = 0; i < FETCH_MTD_END; i++)
- if (type->fetch[i] == orig_fn)
- return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
-
- WARN_ON(1); /* This should not happen */
- return NULL;
-}
+#define KPROBE_EVENT_SYSTEM "kprobes"
/**
* Kprobe event core functions
*/
-struct probe_arg {
- struct fetch_param fetch;
- struct fetch_param fetch_size;
- unsigned int offset; /* Offset from argument entry */
- const char *name; /* Name of this argument */
- const char *comm; /* Command of this argument */
- const struct fetch_type *type; /* Type of this argument */
-};
-
-/* Flags for trace_probe */
-#define TP_FLAG_TRACE 1
-#define TP_FLAG_PROFILE 2
-
struct trace_probe {
struct list_head list;
struct kretprobe rp; /* Use rp.kp for kprobe use */
@@ -575,18 +66,6 @@ static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs);
static int kretprobe_dispatcher(struct kretprobe_instance *ri,
struct pt_regs *regs);
-/* Check the name is good for event/group/fields */
-static int is_good_name(const char *name)
-{
- if (!isalpha(*name) && *name != '_')
- return 0;
- while (*++name != '\0') {
- if (!isalpha(*name) && !isdigit(*name) && *name != '_')
- return 0;
- }
- return 1;
-}
-
/*
* Allocate new trace_probe and initialize it (including kprobes).
*/
@@ -595,7 +74,7 @@ static struct trace_probe *alloc_trace_probe(const char *group,
void *addr,
const char *symbol,
unsigned long offs,
- int nargs, int is_return)
+ int nargs, bool is_return)
{
struct trace_probe *tp;
int ret = -ENOMEM;
@@ -646,24 +125,12 @@ error:
return ERR_PTR(ret);
}
-static void free_probe_arg(struct probe_arg *arg)
-{
- if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
- free_bitfield_fetch_param(arg->fetch.data);
- else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
- free_deref_fetch_param(arg->fetch.data);
- else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
- free_symbol_cache(arg->fetch.data);
- kfree(arg->name);
- kfree(arg->comm);
-}
-
static void free_trace_probe(struct trace_probe *tp)
{
int i;
for (i = 0; i < tp->nr_args; i++)
- free_probe_arg(&tp->args[i]);
+ traceprobe_free_probe_arg(&tp->args[i]);
kfree(tp->call.class->system);
kfree(tp->call.name);
@@ -737,227 +204,6 @@ end:
return ret;
}
-/* Split symbol and offset. */
-static int split_symbol_offset(char *symbol, unsigned long *offset)
-{
- char *tmp;
- int ret;
-
- if (!offset)
- return -EINVAL;
-
- tmp = strchr(symbol, '+');
- if (tmp) {
- /* skip sign because strict_strtol doesn't accept '+' */
- ret = strict_strtoul(tmp + 1, 0, offset);
- if (ret)
- return ret;
- *tmp = '\0';
- } else
- *offset = 0;
- return 0;
-}
-
-#define PARAM_MAX_ARGS 16
-#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
-
-static int parse_probe_vars(char *arg, const struct fetch_type *t,
- struct fetch_param *f, int is_return)
-{
- int ret = 0;
- unsigned long param;
-
- if (strcmp(arg, "retval") == 0) {
- if (is_return)
- f->fn = t->fetch[FETCH_MTD_retval];
- else
- ret = -EINVAL;
- } else if (strncmp(arg, "stack", 5) == 0) {
- if (arg[5] == '\0') {
- if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
- f->fn = fetch_stack_address;
- else
- ret = -EINVAL;
- } else if (isdigit(arg[5])) {
- ret = strict_strtoul(arg + 5, 10, ¶m);
- if (ret || param > PARAM_MAX_STACK)
- ret = -EINVAL;
- else {
- f->fn = t->fetch[FETCH_MTD_stack];
- f->data = (void *)param;
- }
- } else
- ret = -EINVAL;
- } else
- ret = -EINVAL;
- return ret;
-}
-
-/* Recursive argument parser */
-static int __parse_probe_arg(char *arg, const struct fetch_type *t,
- struct fetch_param *f, int is_return)
-{
- int ret = 0;
- unsigned long param;
- long offset;
- char *tmp;
-
- switch (arg[0]) {
- case '$':
- ret = parse_probe_vars(arg + 1, t, f, is_return);
- break;
- case '%': /* named register */
- ret = regs_query_register_offset(arg + 1);
- if (ret >= 0) {
- f->fn = t->fetch[FETCH_MTD_reg];
- f->data = (void *)(unsigned long)ret;
- ret = 0;
- }
- break;
- case '@': /* memory or symbol */
- if (isdigit(arg[1])) {
- ret = strict_strtoul(arg + 1, 0, ¶m);
- if (ret)
- break;
- f->fn = t->fetch[FETCH_MTD_memory];
- f->data = (void *)param;
- } else {
- ret = split_symbol_offset(arg + 1, &offset);
- if (ret)
- break;
- f->data = alloc_symbol_cache(arg + 1, offset);
- if (f->data)
- f->fn = t->fetch[FETCH_MTD_symbol];
- }
- break;
- case '+': /* deref memory */
- arg++; /* Skip '+', because strict_strtol() rejects it. */
- case '-':
- tmp = strchr(arg, '(');
- if (!tmp)
- break;
- *tmp = '\0';
- ret = strict_strtol(arg, 0, &offset);
- if (ret)
- break;
- arg = tmp + 1;
- tmp = strrchr(arg, ')');
- if (tmp) {
- struct deref_fetch_param *dprm;
- const struct fetch_type *t2 = find_fetch_type(NULL);
- *tmp = '\0';
- dprm = kzalloc(sizeof(struct deref_fetch_param),
- GFP_KERNEL);
- if (!dprm)
- return -ENOMEM;
- dprm->offset = offset;
- ret = __parse_probe_arg(arg, t2, &dprm->orig,
- is_return);
- if (ret)
- kfree(dprm);
- else {
- f->fn = t->fetch[FETCH_MTD_deref];
- f->data = (void *)dprm;
- }
- }
- break;
- }
- if (!ret && !f->fn) { /* Parsed, but do not find fetch method */
- pr_info("%s type has no corresponding fetch method.\n",
- t->name);
- ret = -EINVAL;
- }
- return ret;
-}
-
-#define BYTES_TO_BITS(nb) ((BITS_PER_LONG * (nb)) / sizeof(long))
-
-/* Bitfield type needs to be parsed into a fetch function */
-static int __parse_bitfield_probe_arg(const char *bf,
- const struct fetch_type *t,
- struct fetch_param *f)
-{
- struct bitfield_fetch_param *bprm;
- unsigned long bw, bo;
- char *tail;
-
- if (*bf != 'b')
- return 0;
-
- bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
- if (!bprm)
- return -ENOMEM;
- bprm->orig = *f;
- f->fn = t->fetch[FETCH_MTD_bitfield];
- f->data = (void *)bprm;
-
- bw = simple_strtoul(bf + 1, &tail, 0); /* Use simple one */
- if (bw == 0 || *tail != '@')
- return -EINVAL;
-
- bf = tail + 1;
- bo = simple_strtoul(bf, &tail, 0);
- if (tail == bf || *tail != '/')
- return -EINVAL;
-
- bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
- bprm->low_shift = bprm->hi_shift + bo;
- return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
-}
-
-/* String length checking wrapper */
-static int parse_probe_arg(char *arg, struct trace_probe *tp,
- struct probe_arg *parg, int is_return)
-{
- const char *t;
- int ret;
-
- if (strlen(arg) > MAX_ARGSTR_LEN) {
- pr_info("Argument is too long.: %s\n", arg);
- return -ENOSPC;
- }
- parg->comm = kstrdup(arg, GFP_KERNEL);
- if (!parg->comm) {
- pr_info("Failed to allocate memory for command '%s'.\n", arg);
- return -ENOMEM;
- }
- t = strchr(parg->comm, ':');
- if (t) {
- arg[t - parg->comm] = '\0';
- t++;
- }
- parg->type = find_fetch_type(t);
- if (!parg->type) {
- pr_info("Unsupported type: %s\n", t);
- return -EINVAL;
- }
- parg->offset = tp->size;
- tp->size += parg->type->size;
- ret = __parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
- if (ret >= 0 && t != NULL)
- ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
- if (ret >= 0) {
- parg->fetch_size.fn = get_fetch_size_function(parg->type,
- parg->fetch.fn);
- parg->fetch_size.data = parg->fetch.data;
- }
- return ret;
-}
-
-/* Return 1 if name is reserved or already used by another argument */
-static int conflict_field_name(const char *name,
- struct probe_arg *args, int narg)
-{
- int i;
- for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
- if (strcmp(reserved_field_names[i], name) == 0)
- return 1;
- for (i = 0; i < narg; i++)
- if (strcmp(args[i].name, name) == 0)
- return 1;
- return 0;
-}
-
static int create_trace_probe(int argc, char **argv)
{
/*
@@ -980,7 +226,7 @@ static int create_trace_probe(int argc, char **argv)
*/
struct trace_probe *tp;
int i, ret = 0;
- int is_return = 0, is_delete = 0;
+ bool is_return = false, is_delete = false;
char *symbol = NULL, *event = NULL, *group = NULL;
char *arg;
unsigned long offset = 0;
@@ -989,11 +235,11 @@ static int create_trace_probe(int argc, char **argv)
/* argc must be >= 1 */
if (argv[0][0] == 'p')
- is_return = 0;
+ is_return = false;
else if (argv[0][0] == 'r')
- is_return = 1;
+ is_return = true;
else if (argv[0][0] == '-')
- is_delete = 1;
+ is_delete = true;
else {
pr_info("Probe definition must be started with 'p', 'r' or"
" '-'.\n");
@@ -1057,7 +303,7 @@ static int create_trace_probe(int argc, char **argv)
/* a symbol specified */
symbol = argv[1];
/* TODO: support .init module functions */
- ret = split_symbol_offset(symbol, &offset);
+ ret = traceprobe_split_symbol_offset(symbol, &offset);
if (ret) {
pr_info("Failed to parse symbol.\n");
return ret;
@@ -1119,7 +365,8 @@ static int create_trace_probe(int argc, char **argv)
goto error;
}
- if (conflict_field_name(tp->args[i].name, tp->args, i)) {
+ if (traceprobe_conflict_field_name(tp->args[i].name,
+ tp->args, i)) {
pr_info("Argument[%d] name '%s' conflicts with "
"another field.\n", i, argv[i]);
ret = -EINVAL;
@@ -1127,7 +374,8 @@ static int create_trace_probe(int argc, char **argv)
}
/* Parse fetch argument */
- ret = parse_probe_arg(arg, tp, &tp->args[i], is_return);
+ ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+ is_return);
if (ret) {
pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
goto error;
@@ -1214,70 +462,11 @@ static int probes_open(struct inode *inode, struct file *file)
return seq_open(file, &probes_seq_op);
}
-static int command_trace_probe(const char *buf)
-{
- char **argv;
- int argc = 0, ret = 0;
-
- argv = argv_split(GFP_KERNEL, buf, &argc);
- if (!argv)
- return -ENOMEM;
-
- if (argc)
- ret = create_trace_probe(argc, argv);
-
- argv_free(argv);
- return ret;
-}
-
-#define WRITE_BUFSIZE 4096
-
static ssize_t probes_write(struct file *file, const char __user *buffer,
size_t count, loff_t *ppos)
{
- char *kbuf, *tmp;
- int ret;
- size_t done;
- size_t size;
-
- kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
- if (!kbuf)
- return -ENOMEM;
-
- ret = done = 0;
- while (done < count) {
- size = count - done;
- if (size >= WRITE_BUFSIZE)
- size = WRITE_BUFSIZE - 1;
- if (copy_from_user(kbuf, buffer + done, size)) {
- ret = -EFAULT;
- goto out;
- }
- kbuf[size] = '\0';
- tmp = strchr(kbuf, '\n');
- if (tmp) {
- *tmp = '\0';
- size = tmp - kbuf + 1;
- } else if (done + size < count) {
- pr_warning("Line length is too long: "
- "Should be less than %d.", WRITE_BUFSIZE);
- ret = -EINVAL;
- goto out;
- }
- done += size;
- /* Remove comments */
- tmp = strchr(kbuf, '#');
- if (tmp)
- *tmp = '\0';
-
- ret = command_trace_probe(kbuf);
- if (ret)
- goto out;
- }
- ret = done;
-out:
- kfree(kbuf);
- return ret;
+ return traceprobe_probes_write(file, buffer, count, ppos,
+ create_trace_probe);
}
static const struct file_operations kprobe_events_ops = {
@@ -1535,17 +724,6 @@ static void probe_event_disable(struct ftrace_event_call *call)
}
}
-#undef DEFINE_FIELD
-#define DEFINE_FIELD(type, item, name, is_signed) \
- do { \
- ret = trace_define_field(event_call, #type, name, \
- offsetof(typeof(field), item), \
- sizeof(field.item), is_signed, \
- FILTER_OTHER); \
- if (ret) \
- return ret; \
- } while (0)
-
static int kprobe_event_define_fields(struct ftrace_event_call *event_call)
{
int ret, i;
@@ -1886,7 +1064,7 @@ static __init int kprobe_trace_self_tests_init(void)
pr_info("Testing kprobe tracing: ");
- ret = command_trace_probe("p:testprobe kprobe_trace_selftest_target "
+ ret = traceprobe_command("p:testprobe kprobe_trace_selftest_target "
"$stack $stack0 +0($stack)");
if (WARN_ON_ONCE(ret)) {
pr_warning("error on probing function entry.\n");
@@ -1901,7 +1079,7 @@ static __init int kprobe_trace_self_tests_init(void)
probe_event_enable(&tp->call);
}
- ret = command_trace_probe("r:testprobe2 kprobe_trace_selftest_target "
+ ret = traceprobe_command("r:testprobe2 kprobe_trace_selftest_target "
"$retval");
if (WARN_ON_ONCE(ret)) {
pr_warning("error on probing function return.\n");
@@ -1921,13 +1099,13 @@ static __init int kprobe_trace_self_tests_init(void)
ret = target(1, 2, 3, 4, 5, 6);
- ret = command_trace_probe("-:testprobe");
+ ret = traceprobe_command_trace_probe("-:testprobe");
if (WARN_ON_ONCE(ret)) {
pr_warning("error on deleting a probe.\n");
warn++;
}
- ret = command_trace_probe("-:testprobe2");
+ ret = traceprobe_command_trace_probe("-:testprobe2");
if (WARN_ON_ONCE(ret)) {
pr_warning("error on deleting a probe.\n");
warn++;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
new file mode 100644
index 0000000..d3828b2
--- /dev/null
+++ b/kernel/trace/trace_probe.c
@@ -0,0 +1,746 @@
+/*
+ * Common code for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author: Srikar Dronamraju
+ *
+ * Derived from kernel/trace/trace_kprobe.c written by
+ * Masami Hiramatsu <[email protected]>
+ */
+
+#include "trace_probe.h"
+
+const char *reserved_field_names[] = {
+ "common_type",
+ "common_flags",
+ "common_preempt_count",
+ "common_pid",
+ "common_tgid",
+ FIELD_STRING_IP,
+ FIELD_STRING_RETIP,
+ FIELD_STRING_FUNC,
+};
+
+/* Printing function type */
+#define PRINT_TYPE_FUNC_NAME(type) print_type_##type
+#define PRINT_TYPE_FMT_NAME(type) print_type_format_##type
+
+/* Printing in basic type function template */
+#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast) \
+static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s, \
+ const char *name, \
+ void *data, void *ent)\
+{ \
+ return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
+} \
+static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
+
+DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
+
+static inline void *get_rloc_data(u32 *dl)
+{
+ return (u8 *)dl + get_rloc_offs(*dl);
+}
+
+/* For data_loc conversion */
+static inline void *get_loc_data(u32 *dl, void *ent)
+{
+ return (u8 *)ent + get_rloc_offs(*dl);
+}
+
+/* For defining macros, define string/string_size types */
+typedef u32 string;
+typedef u32 string_size;
+
+/* Print type function for string type */
+static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
+ const char *name,
+ void *data, void *ent)
+{
+ int len = *(u32 *)data >> 16;
+
+ if (!len)
+ return trace_seq_printf(s, " %s=(fault)", name);
+ else
+ return trace_seq_printf(s, " %s=\"%s\"", name,
+ (const char *)get_loc_data(data, ent));
+}
+static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
+
+#define FETCH_FUNC_NAME(method, type) fetch_##method##_##type
+/*
+ * Define macro for basic types - we don't need to define s* types, because
+ * we have to care only about bitwidth at recording time.
+ */
+#define DEFINE_BASIC_FETCH_FUNCS(method) \
+DEFINE_FETCH_##method(u8) \
+DEFINE_FETCH_##method(u16) \
+DEFINE_FETCH_##method(u32) \
+DEFINE_FETCH_##method(u64)
+
+#define CHECK_FETCH_FUNCS(method, fn) \
+ (((FETCH_FUNC_NAME(method, u8) == fn) || \
+ (FETCH_FUNC_NAME(method, u16) == fn) || \
+ (FETCH_FUNC_NAME(method, u32) == fn) || \
+ (FETCH_FUNC_NAME(method, u64) == fn) || \
+ (FETCH_FUNC_NAME(method, string) == fn) || \
+ (FETCH_FUNC_NAME(method, string_size) == fn)) \
+ && (fn != NULL))
+
+/* Data fetch function templates */
+#define DEFINE_FETCH_reg(type) \
+static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs, \
+ void *offset, void *dest) \
+{ \
+ *(type *)dest = (type)regs_get_register(regs, \
+ (unsigned int)((unsigned long)offset)); \
+}
+DEFINE_BASIC_FETCH_FUNCS(reg)
+/* No string on the register */
+#define fetch_reg_string NULL
+#define fetch_reg_string_size NULL
+
+#define DEFINE_FETCH_stack(type) \
+static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
+ void *offset, void *dest) \
+{ \
+ *(type *)dest = (type)regs_get_kernel_stack_nth(regs, \
+ (unsigned int)((unsigned long)offset)); \
+}
+DEFINE_BASIC_FETCH_FUNCS(stack)
+/* No string on the stack entry */
+#define fetch_stack_string NULL
+#define fetch_stack_string_size NULL
+
+#define DEFINE_FETCH_retval(type) \
+static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
+ void *dummy, void *dest) \
+{ \
+ *(type *)dest = (type)regs_return_value(regs); \
+}
+DEFINE_BASIC_FETCH_FUNCS(retval)
+/* No string on the retval */
+#define fetch_retval_string NULL
+#define fetch_retval_string_size NULL
+
+#define DEFINE_FETCH_memory(type) \
+static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
+ void *addr, void *dest) \
+{ \
+ type retval; \
+ if (probe_kernel_address(addr, retval)) \
+ *(type *)dest = 0; \
+ else \
+ *(type *)dest = retval; \
+}
+DEFINE_BASIC_FETCH_FUNCS(memory)
+/*
+ * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
+ * length and relative data location.
+ */
+static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
+ void *addr, void *dest)
+{
+ long ret;
+ int maxlen = get_rloc_len(*(u32 *)dest);
+ u8 *dst = get_rloc_data(dest);
+ u8 *src = addr;
+ mm_segment_t old_fs = get_fs();
+ if (!maxlen)
+ return;
+ /*
+ * Try to get string again, since the string can be changed while
+ * probing.
+ */
+ set_fs(KERNEL_DS);
+ pagefault_disable();
+ do
+ ret = __copy_from_user_inatomic(dst++, src++, 1);
+ while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
+ dst[-1] = '\0';
+ pagefault_enable();
+ set_fs(old_fs);
+
+ if (ret < 0) { /* Failed to fetch string */
+ ((u8 *)get_rloc_data(dest))[0] = '\0';
+ *(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
+ } else
+ *(u32 *)dest = make_data_rloc(src - (u8 *)addr,
+ get_rloc_offs(*(u32 *)dest));
+}
+/* Return the length of string -- including null terminal byte */
+static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
+ void *addr, void *dest)
+{
+ int ret, len = 0;
+ u8 c;
+ mm_segment_t old_fs = get_fs();
+
+ set_fs(KERNEL_DS);
+ pagefault_disable();
+ do {
+ ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
+ len++;
+ } while (c && ret == 0 && len < MAX_STRING_SIZE);
+ pagefault_enable();
+ set_fs(old_fs);
+
+ if (ret < 0) /* Failed to check the length */
+ *(u32 *)dest = 0;
+ else
+ *(u32 *)dest = len;
+}
+
+/* Memory fetching by symbol */
+struct symbol_cache {
+ char *symbol;
+ long offset;
+ unsigned long addr;
+};
+
+static unsigned long update_symbol_cache(struct symbol_cache *sc)
+{
+ sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
+ if (sc->addr)
+ sc->addr += sc->offset;
+ return sc->addr;
+}
+
+static void free_symbol_cache(struct symbol_cache *sc)
+{
+ kfree(sc->symbol);
+ kfree(sc);
+}
+
+static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
+{
+ struct symbol_cache *sc;
+
+ if (!sym || strlen(sym) == 0)
+ return NULL;
+ sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
+ if (!sc)
+ return NULL;
+
+ sc->symbol = kstrdup(sym, GFP_KERNEL);
+ if (!sc->symbol) {
+ kfree(sc);
+ return NULL;
+ }
+ sc->offset = offset;
+
+ update_symbol_cache(sc);
+ return sc;
+}
+
+#define DEFINE_FETCH_symbol(type) \
+static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
+ void *data, void *dest) \
+{ \
+ struct symbol_cache *sc = data; \
+ if (sc->addr) \
+ fetch_memory_##type(regs, (void *)sc->addr, dest); \
+ else \
+ *(type *)dest = 0; \
+}
+DEFINE_BASIC_FETCH_FUNCS(symbol)
+DEFINE_FETCH_symbol(string)
+DEFINE_FETCH_symbol(string_size)
+
+/* Dereference memory access function */
+struct deref_fetch_param {
+ struct fetch_param orig;
+ long offset;
+};
+
+#define DEFINE_FETCH_deref(type) \
+static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
+ void *data, void *dest) \
+{ \
+ struct deref_fetch_param *dprm = data; \
+ unsigned long addr; \
+ call_fetch(&dprm->orig, regs, &addr); \
+ if (addr) { \
+ addr += dprm->offset; \
+ fetch_memory_##type(regs, (void *)addr, dest); \
+ } else \
+ *(type *)dest = 0; \
+}
+DEFINE_BASIC_FETCH_FUNCS(deref)
+DEFINE_FETCH_deref(string)
+DEFINE_FETCH_deref(string_size)
+
+static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
+{
+ if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+ free_deref_fetch_param(data->orig.data);
+ else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+ free_symbol_cache(data->orig.data);
+ kfree(data);
+}
+
+/* Bitfield fetch function */
+struct bitfield_fetch_param {
+ struct fetch_param orig;
+ unsigned char hi_shift;
+ unsigned char low_shift;
+};
+
+#define DEFINE_FETCH_bitfield(type) \
+static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
+ void *data, void *dest) \
+{ \
+ struct bitfield_fetch_param *bprm = data; \
+ type buf = 0; \
+ call_fetch(&bprm->orig, regs, &buf); \
+ if (buf) { \
+ buf <<= bprm->hi_shift; \
+ buf >>= bprm->low_shift; \
+ } \
+ *(type *)dest = buf; \
+}
+
+DEFINE_BASIC_FETCH_FUNCS(bitfield)
+#define fetch_bitfield_string NULL
+#define fetch_bitfield_string_size NULL
+
+static __kprobes void
+free_bitfield_fetch_param(struct bitfield_fetch_param *data)
+{
+ /*
+ * Don't check the bitfield itself, because this must be the
+ * last fetch function.
+ */
+ if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+ free_deref_fetch_param(data->orig.data);
+ else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+ free_symbol_cache(data->orig.data);
+ kfree(data);
+}
+
+/* Default (unsigned long) fetch type */
+#define __DEFAULT_FETCH_TYPE(t) u##t
+#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
+#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
+#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
+
+#define ASSIGN_FETCH_FUNC(method, type) \
+ [FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
+
+#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype) \
+ {.name = _name, \
+ .size = _size, \
+ .is_signed = sign, \
+ .print = PRINT_TYPE_FUNC_NAME(ptype), \
+ .fmt = PRINT_TYPE_FMT_NAME(ptype), \
+ .fmttype = _fmttype, \
+ .fetch = { \
+ASSIGN_FETCH_FUNC(reg, ftype), \
+ASSIGN_FETCH_FUNC(stack, ftype), \
+ASSIGN_FETCH_FUNC(retval, ftype), \
+ASSIGN_FETCH_FUNC(memory, ftype), \
+ASSIGN_FETCH_FUNC(symbol, ftype), \
+ASSIGN_FETCH_FUNC(deref, ftype), \
+ASSIGN_FETCH_FUNC(bitfield, ftype), \
+ } \
+ }
+
+#define ASSIGN_FETCH_TYPE(ptype, ftype, sign) \
+ __ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
+
+#define FETCH_TYPE_STRING 0
+#define FETCH_TYPE_STRSIZE 1
+
+/* Fetch type information table */
+static const struct fetch_type fetch_type_table[] = {
+ /* Special types */
+ [FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
+ sizeof(u32), 1, "__data_loc char[]"),
+ [FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
+ string_size, sizeof(u32), 0, "u32"),
+ /* Basic types */
+ ASSIGN_FETCH_TYPE(u8, u8, 0),
+ ASSIGN_FETCH_TYPE(u16, u16, 0),
+ ASSIGN_FETCH_TYPE(u32, u32, 0),
+ ASSIGN_FETCH_TYPE(u64, u64, 0),
+ ASSIGN_FETCH_TYPE(s8, u8, 1),
+ ASSIGN_FETCH_TYPE(s16, u16, 1),
+ ASSIGN_FETCH_TYPE(s32, u32, 1),
+ ASSIGN_FETCH_TYPE(s64, u64, 1),
+};
+
+static const struct fetch_type *find_fetch_type(const char *type)
+{
+ int i;
+
+ if (!type)
+ type = DEFAULT_FETCH_TYPE_STR;
+
+ /* Special case: bitfield */
+ if (*type == 'b') {
+ unsigned long bs;
+ type = strchr(type, '/');
+ if (!type)
+ goto fail;
+ type++;
+ if (strict_strtoul(type, 0, &bs))
+ goto fail;
+ switch (bs) {
+ case 8:
+ return find_fetch_type("u8");
+ case 16:
+ return find_fetch_type("u16");
+ case 32:
+ return find_fetch_type("u32");
+ case 64:
+ return find_fetch_type("u64");
+ default:
+ goto fail;
+ }
+ }
+
+ for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
+ if (strcmp(type, fetch_type_table[i].name) == 0)
+ return &fetch_type_table[i];
+fail:
+ return NULL;
+}
+
+/* Special function : only accept unsigned long */
+static __kprobes void fetch_stack_address(struct pt_regs *regs,
+ void *dummy, void *dest)
+{
+ *(unsigned long *)dest = kernel_stack_pointer(regs);
+}
+
+static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
+ fetch_func_t orig_fn)
+{
+ int i;
+
+ if (type != &fetch_type_table[FETCH_TYPE_STRING])
+ return NULL; /* Only string type needs size function */
+ for (i = 0; i < FETCH_MTD_END; i++)
+ if (type->fetch[i] == orig_fn)
+ return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
+
+ WARN_ON(1); /* This should not happen */
+ return NULL;
+}
+
+
+/* Split symbol and offset. */
+int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset)
+{
+ char *tmp;
+ int ret;
+
+ if (!offset)
+ return -EINVAL;
+
+ tmp = strchr(symbol, '+');
+ if (tmp) {
+ /* skip sign because strict_strtol doesn't accept '+' */
+ ret = strict_strtoul(tmp + 1, 0, offset);
+ if (ret)
+ return ret;
+ *tmp = '\0';
+ } else
+ *offset = 0;
+ return 0;
+}
+
+
+#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
+
+static int parse_probe_vars(char *arg, const struct fetch_type *t,
+ struct fetch_param *f, bool is_return)
+{
+ int ret = 0;
+ unsigned long param;
+
+ if (strcmp(arg, "retval") == 0) {
+ if (is_return)
+ f->fn = t->fetch[FETCH_MTD_retval];
+ else
+ ret = -EINVAL;
+ } else if (strncmp(arg, "stack", 5) == 0) {
+ if (arg[5] == '\0') {
+ if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
+ f->fn = fetch_stack_address;
+ else
+ ret = -EINVAL;
+ } else if (isdigit(arg[5])) {
+ ret = strict_strtoul(arg + 5, 10, ¶m);
+ if (ret || param > PARAM_MAX_STACK)
+ ret = -EINVAL;
+ else {
+ f->fn = t->fetch[FETCH_MTD_stack];
+ f->data = (void *)param;
+ }
+ } else
+ ret = -EINVAL;
+ } else
+ ret = -EINVAL;
+ return ret;
+}
+
+/* Recursive argument parser */
+static int parse_probe_arg(char *arg, const struct fetch_type *t,
+ struct fetch_param *f, bool is_return)
+{
+ int ret = 0;
+ unsigned long param;
+ long offset;
+ char *tmp;
+
+ switch (arg[0]) {
+ case '$':
+ ret = parse_probe_vars(arg + 1, t, f, is_return);
+ break;
+ case '%': /* named register */
+ ret = regs_query_register_offset(arg + 1);
+ if (ret >= 0) {
+ f->fn = t->fetch[FETCH_MTD_reg];
+ f->data = (void *)(unsigned long)ret;
+ ret = 0;
+ }
+ break;
+ case '@': /* memory or symbol */
+ if (isdigit(arg[1])) {
+ ret = strict_strtoul(arg + 1, 0, ¶m);
+ if (ret)
+ break;
+ f->fn = t->fetch[FETCH_MTD_memory];
+ f->data = (void *)param;
+ } else {
+ ret = traceprobe_split_symbol_offset(arg + 1, &offset);
+ if (ret)
+ break;
+ f->data = alloc_symbol_cache(arg + 1, offset);
+ if (f->data)
+ f->fn = t->fetch[FETCH_MTD_symbol];
+ }
+ break;
+ case '+': /* deref memory */
+ arg++; /* Skip '+', because strict_strtol() rejects it. */
+ case '-':
+ tmp = strchr(arg, '(');
+ if (!tmp)
+ break;
+ *tmp = '\0';
+ ret = strict_strtol(arg, 0, &offset);
+ if (ret)
+ break;
+ arg = tmp + 1;
+ tmp = strrchr(arg, ')');
+ if (tmp) {
+ struct deref_fetch_param *dprm;
+ const struct fetch_type *t2 = find_fetch_type(NULL);
+ *tmp = '\0';
+ dprm = kzalloc(sizeof(struct deref_fetch_param),
+ GFP_KERNEL);
+ if (!dprm)
+ return -ENOMEM;
+ dprm->offset = offset;
+ ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+ if (ret)
+ kfree(dprm);
+ else {
+ f->fn = t->fetch[FETCH_MTD_deref];
+ f->data = (void *)dprm;
+ }
+ }
+ break;
+ }
+ if (!ret && !f->fn) { /* Parsed, but do not find fetch method */
+ pr_info("%s type has no corresponding fetch method.\n",
+ t->name);
+ ret = -EINVAL;
+ }
+ return ret;
+}
+#define BYTES_TO_BITS(nb) ((BITS_PER_LONG * (nb)) / sizeof(long))
+
+/* Bitfield type needs to be parsed into a fetch function */
+static int __parse_bitfield_probe_arg(const char *bf,
+ const struct fetch_type *t,
+ struct fetch_param *f)
+{
+ struct bitfield_fetch_param *bprm;
+ unsigned long bw, bo;
+ char *tail;
+
+ if (*bf != 'b')
+ return 0;
+
+ bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
+ if (!bprm)
+ return -ENOMEM;
+ bprm->orig = *f;
+ f->fn = t->fetch[FETCH_MTD_bitfield];
+ f->data = (void *)bprm;
+
+ bw = simple_strtoul(bf + 1, &tail, 0); /* Use simple one */
+ if (bw == 0 || *tail != '@')
+ return -EINVAL;
+
+ bf = tail + 1;
+ bo = simple_strtoul(bf, &tail, 0);
+ if (tail == bf || *tail != '/')
+ return -EINVAL;
+
+ bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
+ bprm->low_shift = bprm->hi_shift + bo;
+ return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
+}
+
+/* String length checking wrapper */
+int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+ struct probe_arg *parg, bool is_return)
+{
+ const char *t;
+ int ret;
+
+ if (strlen(arg) > MAX_ARGSTR_LEN) {
+ pr_info("Argument is too long.: %s\n", arg);
+ return -ENOSPC;
+ }
+ parg->comm = kstrdup(arg, GFP_KERNEL);
+ if (!parg->comm) {
+ pr_info("Failed to allocate memory for command '%s'.\n", arg);
+ return -ENOMEM;
+ }
+ t = strchr(parg->comm, ':');
+ if (t) {
+ arg[t - parg->comm] = '\0';
+ t++;
+ }
+ parg->type = find_fetch_type(t);
+ if (!parg->type) {
+ pr_info("Unsupported type: %s\n", t);
+ return -EINVAL;
+ }
+ parg->offset = *size;
+ *size += parg->type->size;
+ ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+ if (ret >= 0 && t != NULL)
+ ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
+ if (ret >= 0) {
+ parg->fetch_size.fn = get_fetch_size_function(parg->type,
+ parg->fetch.fn);
+ parg->fetch_size.data = parg->fetch.data;
+ }
+ return ret;
+}
+
+/* Return 1 if name is reserved or already used by another argument */
+int traceprobe_conflict_field_name(const char *name,
+ struct probe_arg *args, int narg)
+{
+ int i;
+ for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
+ if (strcmp(reserved_field_names[i], name) == 0)
+ return 1;
+ for (i = 0; i < narg; i++)
+ if (strcmp(args[i].name, name) == 0)
+ return 1;
+ return 0;
+}
+
+void traceprobe_free_probe_arg(struct probe_arg *arg)
+{
+ if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
+ free_bitfield_fetch_param(arg->fetch.data);
+ else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
+ free_deref_fetch_param(arg->fetch.data);
+ else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
+ free_symbol_cache(arg->fetch.data);
+ kfree(arg->name);
+ kfree(arg->comm);
+}
+
+int traceprobe_command(const char *buf, int (*createfn)(int, char**))
+{
+ char **argv;
+ int argc = 0, ret = 0;
+
+ argv = argv_split(GFP_KERNEL, buf, &argc);
+ if (!argv)
+ return -ENOMEM;
+
+ if (argc)
+ ret = createfn(argc, argv);
+
+ argv_free(argv);
+ return ret;
+}
+
+#define WRITE_BUFSIZE 128
+
+ssize_t traceprobe_probes_write(struct file *file, const char __user *buffer,
+ size_t count, loff_t *ppos, int (*createfn)(int, char**))
+{
+ char *kbuf, *tmp;
+ int ret = 0;
+ size_t done = 0;
+ size_t size;
+
+ kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
+ if (!kbuf)
+ return -ENOMEM;
+
+ while (done < count) {
+ size = count - done;
+ if (size >= WRITE_BUFSIZE)
+ size = WRITE_BUFSIZE - 1;
+ if (copy_from_user(kbuf, buffer + done, size)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ kbuf[size] = '\0';
+ tmp = strchr(kbuf, '\n');
+ if (tmp) {
+ *tmp = '\0';
+ size = tmp - kbuf + 1;
+ } else if (done + size < count) {
+ pr_warning("Line length is too long: "
+ "Should be less than %d.", WRITE_BUFSIZE);
+ ret = -EINVAL;
+ goto out;
+ }
+ done += size;
+ /* Remove comments */
+ tmp = strchr(kbuf, '#');
+ if (tmp)
+ *tmp = '\0';
+
+ ret = traceprobe_command(kbuf, createfn);
+ if (ret)
+ goto out;
+ }
+ ret = done;
+out:
+ kfree(kbuf);
+ return ret;
+}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
new file mode 100644
index 0000000..487514b
--- /dev/null
+++ b/kernel/trace/trace_probe.h
@@ -0,0 +1,158 @@
+/*
+ * Common header file for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author: Srikar Dronamraju
+ *
+ * Derived from kernel/trace/trace_kprobe.c written by
+ * Masami Hiramatsu <[email protected]>
+ */
+
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/smp.h>
+#include <linux/debugfs.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/ptrace.h>
+#include <linux/perf_event.h>
+#include <linux/kprobes.h>
+#include <linux/stringify.h>
+#include <linux/limits.h>
+#include <linux/uaccess.h>
+#include <asm/bitsperlong.h>
+
+#include "trace.h"
+#include "trace_output.h"
+
+#define MAX_TRACE_ARGS 128
+#define MAX_ARGSTR_LEN 63
+#define MAX_EVENT_NAME_LEN 64
+#define MAX_STRING_SIZE PATH_MAX
+
+/* Reserved field names */
+#define FIELD_STRING_IP "__probe_ip"
+#define FIELD_STRING_RETIP "__probe_ret_ip"
+#define FIELD_STRING_FUNC "__probe_func"
+
+#undef DEFINE_FIELD
+#define DEFINE_FIELD(type, item, name, is_signed) \
+ do { \
+ ret = trace_define_field(event_call, #type, name, \
+ offsetof(typeof(field), item), \
+ sizeof(field.item), is_signed, \
+ FILTER_OTHER); \
+ if (ret) \
+ return ret; \
+ } while (0)
+
+
+/* Flags for trace_probe */
+#define TP_FLAG_TRACE 1
+#define TP_FLAG_PROFILE 2
+
+
+/* data_rloc: data relative location, compatible with u32 */
+#define make_data_rloc(len, roffs) \
+ (((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
+#define get_rloc_len(dl) ((u32)(dl) >> 16)
+#define get_rloc_offs(dl) ((u32)(dl) & 0xffff)
+
+/*
+ * Convert data_rloc to data_loc:
+ * data_rloc stores the offset from data_rloc itself, but data_loc
+ * stores the offset from event entry.
+ */
+#define convert_rloc_to_loc(dl, offs) ((u32)(dl) + (offs))
+
+/* Data fetch function type */
+typedef void (*fetch_func_t)(struct pt_regs *, void *, void *);
+/* Printing function type */
+typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
+ void *);
+
+/* Fetch types */
+enum {
+ FETCH_MTD_reg = 0,
+ FETCH_MTD_stack,
+ FETCH_MTD_retval,
+ FETCH_MTD_memory,
+ FETCH_MTD_symbol,
+ FETCH_MTD_deref,
+ FETCH_MTD_bitfield,
+ FETCH_MTD_END,
+};
+
+/* Fetch type information table */
+struct fetch_type {
+ const char *name; /* Name of type */
+ size_t size; /* Byte size of type */
+ int is_signed; /* Signed flag */
+ print_type_func_t print; /* Print functions */
+ const char *fmt; /* Fromat string */
+ const char *fmttype; /* Name in format file */
+ /* Fetch functions */
+ fetch_func_t fetch[FETCH_MTD_END];
+};
+
+struct fetch_param {
+ fetch_func_t fn;
+ void *data;
+};
+
+struct probe_arg {
+ struct fetch_param fetch;
+ struct fetch_param fetch_size;
+ unsigned int offset; /* Offset from argument entry */
+ const char *name; /* Name of this argument */
+ const char *comm; /* Command of this argument */
+ const struct fetch_type *type; /* Type of this argument */
+};
+
+static inline __kprobes void call_fetch(struct fetch_param *fprm,
+ struct pt_regs *regs, void *dest)
+{
+ return fprm->fn(regs, fprm->data, dest);
+}
+
+/* Check the name is good for event/group/fields */
+static int is_good_name(const char *name)
+{
+ if (!isalpha(*name) && *name != '_')
+ return 0;
+ while (*++name != '\0') {
+ if (!isalpha(*name) && !isdigit(*name) && *name != '_')
+ return 0;
+ }
+ return 1;
+}
+
+extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+ struct probe_arg *parg, bool is_return);
+
+extern int traceprobe_conflict_field_name(const char *name,
+ struct probe_arg *args, int narg);
+
+extern void traceprobe_free_probe_arg(struct probe_arg *arg);
+
+extern int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset);
+
+extern ssize_t traceprobe_probes_write(struct file *file,
+ const char __user *buffer, size_t count, loff_t *ppos,
+ int (*createfn)(int, char**));
+
+extern int traceprobe_command(const char *buf, int (*createfn)(int, char**));
Implements trace_event support for uprobes. In its current form it can
be used to put probes at a specified offset in a file and dump the
required registers when the code flow reaches the probed address.
The following example shows how to dump the instruction pointer and %ax
a register at the probed text address. Here we are trying to probe
zfree in /bin/zsh
# cd /sys/kernel/debug/tracing/
# cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
# objdump -T /bin/zsh | grep -w zfree
0000000000446420 g DF .text 0000000000000012 Base zfree
# echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
# echo 1 > events/uprobes/enable
# sleep 20
# echo 0 > events/uprobes/enable
# cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
TODO: Connect a filter to a consumer.
Signed-off-by: Srikar Dronamraju <[email protected]>
---
arch/Kconfig | 10 -
kernel/trace/Kconfig | 16 +
kernel/trace/Makefile | 1
kernel/trace/trace.h | 5
kernel/trace/trace_kprobe.c | 4
kernel/trace/trace_probe.c | 14 +
kernel/trace/trace_probe.h | 6
kernel/trace/trace_uprobe.c | 812 +++++++++++++++++++++++++++++++++++++++++++
8 files changed, 851 insertions(+), 17 deletions(-)
create mode 100644 kernel/trace/trace_uprobe.c
diff --git a/arch/Kconfig b/arch/Kconfig
index 3d91687..9ba05e1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -62,16 +62,8 @@ config OPTPROBES
depends on !PREEMPT
config UPROBES
- bool "User-space probes (EXPERIMENTAL)"
- depends on ARCH_SUPPORTS_UPROBES
- depends on MMU
select MM_OWNER
- help
- Uprobes enables kernel subsystems to establish probepoints
- in user applications and execute handler functions when
- the probepoints are hit.
-
- If in doubt, say "N".
+ def_bool n
config HAVE_EFFICIENT_UNALIGNED_ACCESS
bool
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index f6bc439..84783ea 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -386,6 +386,22 @@ config KPROBE_EVENT
This option is also required by perf-probe subcommand of perf tools.
If you want to use perf tools, this option is strongly recommended.
+config UPROBE_EVENT
+ bool "Enable uprobes-based dynamic events"
+ depends on ARCH_SUPPORTS_UPROBES
+ depends on MMU
+ select UPROBES
+ select PROBE_EVENTS
+ select TRACING
+ default n
+ help
+ This allows the user to add tracing events on top of userspace dynamic
+ events (similar to tracepoints) on the fly via the traceevents interface.
+ Those events can be inserted wherever uprobes can probe, and record
+ various registers.
+ This option is required if you plan to use perf-probe subcommand of perf
+ tools on user space applications.
+
config PROBE_EVENTS
def_bool n
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 692223a..bb3d3ff 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -57,5 +57,6 @@ ifeq ($(CONFIG_TRACING),y)
obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
endif
obj-$(CONFIG_PROBE_EVENTS) +=trace_probe.o
+obj-$(CONFIG_UPROBE_EVENT) += trace_uprobe.o
libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 229f859..19f4f23 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -97,6 +97,11 @@ struct kretprobe_trace_entry_head {
unsigned long ret_ip;
};
+struct uprobe_trace_entry_head {
+ struct trace_entry ent;
+ unsigned long ip;
+};
+
/*
* trace_flag_type is an enumeration that holds different
* states when a trace occurs. These are:
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 704c04d..55fa2a6 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -374,8 +374,8 @@ static int create_trace_probe(int argc, char **argv)
}
/* Parse fetch argument */
- ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
- is_return);
+ ret = traceprobe_parse_probe_arg(arg, &tp->size,
+ &tp->args[i], is_return, true);
if (ret) {
pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
goto error;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index d3828b2..673ca69 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -507,13 +507,17 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t,
/* Recursive argument parser */
static int parse_probe_arg(char *arg, const struct fetch_type *t,
- struct fetch_param *f, bool is_return)
+ struct fetch_param *f, bool is_return, bool is_kprobe)
{
int ret = 0;
unsigned long param;
long offset;
char *tmp;
+ /* Until uprobe_events supports only reg arguments */
+ if (!is_kprobe && arg[0] != '%')
+ return -EINVAL;
+
switch (arg[0]) {
case '$':
ret = parse_probe_vars(arg + 1, t, f, is_return);
@@ -563,7 +567,8 @@ static int parse_probe_arg(char *arg, const struct fetch_type *t,
if (!dprm)
return -ENOMEM;
dprm->offset = offset;
- ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+ ret = parse_probe_arg(arg, t2, &dprm->orig, is_return,
+ is_kprobe);
if (ret)
kfree(dprm);
else {
@@ -617,7 +622,7 @@ static int __parse_bitfield_probe_arg(const char *bf,
/* String length checking wrapper */
int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
- struct probe_arg *parg, bool is_return)
+ struct probe_arg *parg, bool is_return, bool is_kprobe)
{
const char *t;
int ret;
@@ -643,7 +648,8 @@ int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
}
parg->offset = *size;
*size += parg->type->size;
- ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+ ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return,
+ is_kprobe);
if (ret >= 0 && t != NULL)
ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
if (ret >= 0) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 487514b..f625d8d 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -48,6 +48,7 @@
#define FIELD_STRING_IP "__probe_ip"
#define FIELD_STRING_RETIP "__probe_ret_ip"
#define FIELD_STRING_FUNC "__probe_func"
+#define FIELD_STRING_PID "__probe_pid"
#undef DEFINE_FIELD
#define DEFINE_FIELD(type, item, name, is_signed) \
@@ -64,6 +65,7 @@
/* Flags for trace_probe */
#define TP_FLAG_TRACE 1
#define TP_FLAG_PROFILE 2
+#define TP_FLAG_UPROBE 4
/* data_rloc: data relative location, compatible with u32 */
@@ -130,7 +132,7 @@ static inline __kprobes void call_fetch(struct fetch_param *fprm,
}
/* Check the name is good for event/group/fields */
-static int is_good_name(const char *name)
+static inline int is_good_name(const char *name)
{
if (!isalpha(*name) && *name != '_')
return 0;
@@ -142,7 +144,7 @@ static int is_good_name(const char *name)
}
extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
- struct probe_arg *parg, bool is_return);
+ struct probe_arg *parg, bool is_return, bool is_kprobe);
extern int traceprobe_conflict_field_name(const char *name,
struct probe_arg *args, int narg);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
new file mode 100644
index 0000000..122f1b6
--- /dev/null
+++ b/kernel/trace/trace_uprobe.c
@@ -0,0 +1,812 @@
+/*
+ * uprobes-based tracing events
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author: Srikar Dronamraju
+ */
+
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/uprobes.h>
+#include <linux/namei.h>
+
+#include "trace_probe.h"
+
+#define UPROBE_EVENT_SYSTEM "uprobes"
+
+/**
+ * uprobe event core functions
+ */
+struct trace_uprobe;
+struct uprobe_trace_consumer {
+ struct uprobe_consumer cons;
+ struct trace_uprobe *tp;
+};
+
+struct trace_uprobe {
+ struct list_head list;
+ struct ftrace_event_class class;
+ struct ftrace_event_call call;
+ struct uprobe_trace_consumer *consumer;
+ struct inode *inode;
+ char *filename;
+ unsigned long offset;
+ unsigned long nhit;
+ unsigned int flags; /* For TP_FLAG_* */
+ ssize_t size; /* trace entry size */
+ unsigned int nr_args;
+ struct probe_arg args[];
+};
+
+#define SIZEOF_TRACE_UPROBE(n) \
+ (offsetof(struct trace_uprobe, args) + \
+ (sizeof(struct probe_arg) * (n)))
+
+static int register_uprobe_event(struct trace_uprobe *tp);
+static void unregister_uprobe_event(struct trace_uprobe *tp);
+
+static DEFINE_MUTEX(uprobe_lock);
+static LIST_HEAD(uprobe_list);
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs);
+
+/*
+ * Allocate new trace_uprobe and initialize it (including uprobes).
+ */
+static struct trace_uprobe *alloc_trace_uprobe(const char *group,
+ const char *event, int nargs)
+{
+ struct trace_uprobe *tp;
+
+ if (!event || !is_good_name(event))
+ return ERR_PTR(-EINVAL);
+
+ if (!group || !is_good_name(group))
+ return ERR_PTR(-EINVAL);
+
+ tp = kzalloc(SIZEOF_TRACE_UPROBE(nargs), GFP_KERNEL);
+ if (!tp)
+ return ERR_PTR(-ENOMEM);
+
+ tp->call.class = &tp->class;
+ tp->call.name = kstrdup(event, GFP_KERNEL);
+ if (!tp->call.name)
+ goto error;
+
+ tp->class.system = kstrdup(group, GFP_KERNEL);
+ if (!tp->class.system)
+ goto error;
+
+ INIT_LIST_HEAD(&tp->list);
+ return tp;
+error:
+ kfree(tp->call.name);
+ kfree(tp);
+ return ERR_PTR(-ENOMEM);
+}
+
+static void free_trace_uprobe(struct trace_uprobe *tp)
+{
+ int i;
+
+ for (i = 0; i < tp->nr_args; i++)
+ traceprobe_free_probe_arg(&tp->args[i]);
+
+ iput(tp->inode);
+ kfree(tp->call.class->system);
+ kfree(tp->call.name);
+ kfree(tp->filename);
+ kfree(tp);
+}
+
+static struct trace_uprobe *find_probe_event(const char *event,
+ const char *group)
+{
+ struct trace_uprobe *tp;
+
+ list_for_each_entry(tp, &uprobe_list, list)
+ if (strcmp(tp->call.name, event) == 0 &&
+ strcmp(tp->call.class->system, group) == 0)
+ return tp;
+ return NULL;
+}
+
+/* Unregister a trace_uprobe and probe_event: call with locking uprobe_lock */
+static void unregister_trace_uprobe(struct trace_uprobe *tp)
+{
+ list_del(&tp->list);
+ unregister_uprobe_event(tp);
+ free_trace_uprobe(tp);
+}
+
+/* Register a trace_uprobe and probe_event */
+static int register_trace_uprobe(struct trace_uprobe *tp)
+{
+ struct trace_uprobe *old_tp;
+ int ret;
+
+ mutex_lock(&uprobe_lock);
+
+ /* register as an event */
+ old_tp = find_probe_event(tp->call.name, tp->call.class->system);
+ if (old_tp)
+ /* delete old event */
+ unregister_trace_uprobe(old_tp);
+
+ ret = register_uprobe_event(tp);
+ if (ret) {
+ pr_warning("Failed to register probe event(%d)\n", ret);
+ goto end;
+ }
+
+ list_add_tail(&tp->list, &uprobe_list);
+end:
+ mutex_unlock(&uprobe_lock);
+ return ret;
+}
+
+static int create_trace_uprobe(int argc, char **argv)
+{
+ /*
+ * Argument syntax:
+ * - Add uprobe: p[:[GRP/]EVENT] VADDR@PID [%REG]
+ *
+ * - Remove uprobe: -:[GRP/]EVENT
+ */
+ struct path path;
+ struct inode *inode;
+ struct trace_uprobe *tp;
+ int i, ret = 0;
+ int is_delete = 0;
+ char *arg = NULL, *event = NULL, *group = NULL;
+ unsigned long offset;
+ char buf[MAX_EVENT_NAME_LEN];
+ char *filename;
+
+ /* argc must be >= 1 */
+ if (argv[0][0] == '-')
+ is_delete = 1;
+ else if (argv[0][0] != 'p') {
+ pr_info("Probe definition must be started with 'p', 'r' or"
+ " '-'.\n");
+ return -EINVAL;
+ }
+
+ if (argv[0][1] == ':') {
+ event = &argv[0][2];
+ if (strchr(event, '/')) {
+ group = event;
+ event = strchr(group, '/') + 1;
+ event[-1] = '\0';
+ if (strlen(group) == 0) {
+ pr_info("Group name is not specified\n");
+ return -EINVAL;
+ }
+ }
+ if (strlen(event) == 0) {
+ pr_info("Event name is not specified\n");
+ return -EINVAL;
+ }
+ }
+ if (!group)
+ group = UPROBE_EVENT_SYSTEM;
+
+ if (is_delete) {
+ if (!event) {
+ pr_info("Delete command needs an event name.\n");
+ return -EINVAL;
+ }
+ mutex_lock(&uprobe_lock);
+ tp = find_probe_event(event, group);
+ if (!tp) {
+ mutex_unlock(&uprobe_lock);
+ pr_info("Event %s/%s doesn't exist.\n", group, event);
+ return -ENOENT;
+ }
+ /* delete an event */
+ unregister_trace_uprobe(tp);
+ mutex_unlock(&uprobe_lock);
+ return 0;
+ }
+
+ if (argc < 2) {
+ pr_info("Probe point is not specified.\n");
+ return -EINVAL;
+ }
+ if (isdigit(argv[1][0])) {
+ pr_info("probe point must be have a filename.\n");
+ return -EINVAL;
+ }
+ arg = strchr(argv[1], ':');
+ if (!arg)
+ goto fail_address_parse;
+
+ *arg++ = '\0';
+ filename = argv[1];
+ ret = kern_path(filename, LOOKUP_FOLLOW, &path);
+ if (ret)
+ goto fail_address_parse;
+ inode = path.dentry->d_inode;
+ __iget(inode);
+
+ ret = strict_strtoul(arg, 0, &offset);
+ if (ret)
+ goto fail_address_parse;
+ argc -= 2; argv += 2;
+
+ /* setup a probe */
+ if (!event) {
+ char *tail = strrchr(filename, '/');
+ char *ptr;
+
+ ptr = kstrdup((tail ? tail + 1 : filename), GFP_KERNEL);
+ if (!ptr) {
+ ret = -ENOMEM;
+ goto fail_address_parse;
+ }
+
+ tail = ptr;
+ ptr = strpbrk(tail, ".-_");
+ if (ptr)
+ *ptr = '\0';
+
+ snprintf(buf, MAX_EVENT_NAME_LEN, "%c_%s_0x%lx", 'p', tail,
+ offset);
+ event = buf;
+ kfree(tail);
+ }
+ tp = alloc_trace_uprobe(group, event, argc);
+ if (IS_ERR(tp)) {
+ pr_info("Failed to allocate trace_uprobe.(%d)\n",
+ (int)PTR_ERR(tp));
+ iput(inode);
+ return PTR_ERR(tp);
+ }
+ tp->offset = offset;
+ tp->inode = inode;
+ tp->filename = kstrdup(filename, GFP_KERNEL);
+ if (!tp->filename) {
+ pr_info("Failed to allocate filename.\n");
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ /* parse arguments */
+ ret = 0;
+ for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) {
+ /* Increment count for freeing args in error case */
+ tp->nr_args++;
+
+ /* Parse argument name */
+ arg = strchr(argv[i], '=');
+ if (arg) {
+ *arg++ = '\0';
+ tp->args[i].name = kstrdup(argv[i], GFP_KERNEL);
+ } else {
+ arg = argv[i];
+ /* If argument name is omitted, set "argN" */
+ snprintf(buf, MAX_EVENT_NAME_LEN, "arg%d", i + 1);
+ tp->args[i].name = kstrdup(buf, GFP_KERNEL);
+ }
+
+ if (!tp->args[i].name) {
+ pr_info("Failed to allocate argument[%d] name.\n", i);
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ if (!is_good_name(tp->args[i].name)) {
+ pr_info("Invalid argument[%d] name: %s\n",
+ i, tp->args[i].name);
+ ret = -EINVAL;
+ goto error;
+ }
+
+ if (traceprobe_conflict_field_name(tp->args[i].name,
+ tp->args, i)) {
+ pr_info("Argument[%d] name '%s' conflicts with "
+ "another field.\n", i, argv[i]);
+ ret = -EINVAL;
+ goto error;
+ }
+
+ /* Parse fetch argument */
+ ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+ false, false);
+ if (ret) {
+ pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
+ goto error;
+ }
+ }
+
+ ret = register_trace_uprobe(tp);
+ if (ret)
+ goto error;
+ return 0;
+
+error:
+ free_trace_uprobe(tp);
+ return ret;
+
+fail_address_parse:
+ pr_info("Failed to parse address.\n");
+ return ret;
+}
+
+static void cleanup_all_probes(void)
+{
+ struct trace_uprobe *tp;
+
+ mutex_lock(&uprobe_lock);
+ while (!list_empty(&uprobe_list)) {
+ tp = list_entry(uprobe_list.next, struct trace_uprobe, list);
+ unregister_trace_uprobe(tp);
+ }
+ mutex_unlock(&uprobe_lock);
+}
+
+
+/* Probes listing interfaces */
+static void *probes_seq_start(struct seq_file *m, loff_t *pos)
+{
+ mutex_lock(&uprobe_lock);
+ return seq_list_start(&uprobe_list, *pos);
+}
+
+static void *probes_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ return seq_list_next(v, &uprobe_list, pos);
+}
+
+static void probes_seq_stop(struct seq_file *m, void *v)
+{
+ mutex_unlock(&uprobe_lock);
+}
+
+static int probes_seq_show(struct seq_file *m, void *v)
+{
+ struct trace_uprobe *tp = v;
+ int i;
+
+ seq_printf(m, "p:%s/%s", tp->call.class->system, tp->call.name);
+ seq_printf(m, " %s:0x%p", tp->filename, (void *)tp->offset);
+
+ for (i = 0; i < tp->nr_args; i++)
+ seq_printf(m, " %s=%s", tp->args[i].name, tp->args[i].comm);
+ seq_printf(m, "\n");
+ return 0;
+}
+
+static const struct seq_operations probes_seq_op = {
+ .start = probes_seq_start,
+ .next = probes_seq_next,
+ .stop = probes_seq_stop,
+ .show = probes_seq_show
+};
+
+static int probes_open(struct inode *inode, struct file *file)
+{
+ if ((file->f_mode & FMODE_WRITE) &&
+ (file->f_flags & O_TRUNC))
+ cleanup_all_probes();
+
+ return seq_open(file, &probes_seq_op);
+}
+
+static ssize_t probes_write(struct file *file, const char __user *buffer,
+ size_t count, loff_t *ppos)
+{
+ return traceprobe_probes_write(file, buffer, count, ppos,
+ create_trace_uprobe);
+}
+
+static const struct file_operations uprobe_events_ops = {
+ .owner = THIS_MODULE,
+ .open = probes_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+ .write = probes_write,
+};
+
+/* Probes profiling interfaces */
+static int probes_profile_seq_show(struct seq_file *m, void *v)
+{
+ struct trace_uprobe *tp = v;
+
+ seq_printf(m, " %s %-44s %15lu\n", tp->filename, tp->call.name,
+ tp->nhit);
+ return 0;
+}
+
+static const struct seq_operations profile_seq_op = {
+ .start = probes_seq_start,
+ .next = probes_seq_next,
+ .stop = probes_seq_stop,
+ .show = probes_profile_seq_show
+};
+
+static int profile_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &profile_seq_op);
+}
+
+static const struct file_operations uprobe_profile_ops = {
+ .owner = THIS_MODULE,
+ .open = profile_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+/* uprobe handler */
+static void uprobe_trace_func(struct trace_uprobe *tp, struct pt_regs *regs)
+{
+ struct uprobe_trace_entry_head *entry;
+ struct ring_buffer_event *event;
+ struct ring_buffer *buffer;
+ u8 *data;
+ int size, i, pc;
+ unsigned long irq_flags;
+ struct ftrace_event_call *call = &tp->call;
+
+ tp->nhit++;
+
+ local_save_flags(irq_flags);
+ pc = preempt_count();
+
+ size = sizeof(*entry) + tp->size;
+
+ event = trace_current_buffer_lock_reserve(&buffer, call->event.type,
+ size, irq_flags, pc);
+ if (!event)
+ return;
+
+ entry = ring_buffer_event_data(event);
+ entry->ip = get_uprobe_bkpt_addr(task_pt_regs(current));
+ data = (u8 *)&entry[1];
+ for (i = 0; i < tp->nr_args; i++)
+ call_fetch(&tp->args[i].fetch, regs,
+ data + tp->args[i].offset);
+
+ if (!filter_current_check_discard(buffer, call, entry, event))
+ trace_buffer_unlock_commit(buffer, event, irq_flags, pc);
+}
+
+/* Event entry printers */
+enum print_line_t
+print_uprobe_event(struct trace_iterator *iter, int flags,
+ struct trace_event *event)
+{
+ struct uprobe_trace_entry_head *field;
+ struct trace_seq *s = &iter->seq;
+ struct trace_uprobe *tp;
+ u8 *data;
+ int i;
+
+ field = (struct uprobe_trace_entry_head *)iter->ent;
+ tp = container_of(event, struct trace_uprobe, call.event);
+
+ if (!trace_seq_printf(s, "%s: (", tp->call.name))
+ goto partial;
+
+ if (!seq_print_ip_sym(s, field->ip, flags | TRACE_ITER_SYM_OFFSET))
+ goto partial;
+
+ if (!trace_seq_puts(s, ")"))
+ goto partial;
+
+ data = (u8 *)&field[1];
+ for (i = 0; i < tp->nr_args; i++)
+ if (!tp->args[i].type->print(s, tp->args[i].name,
+ data + tp->args[i].offset, field))
+ goto partial;
+
+ if (!trace_seq_puts(s, "\n"))
+ goto partial;
+
+ return TRACE_TYPE_HANDLED;
+partial:
+ return TRACE_TYPE_PARTIAL_LINE;
+}
+
+
+static int probe_event_enable(struct ftrace_event_call *call)
+{
+ struct uprobe_trace_consumer *utc;
+ struct trace_uprobe *tp = (struct trace_uprobe *)call->data;
+ int ret = 0;
+
+ if (!tp->inode || tp->consumer)
+ return -EINTR;
+
+ utc = kzalloc(sizeof(struct uprobe_trace_consumer), GFP_KERNEL);
+ if (!utc)
+ return -EINTR;
+
+ utc->cons.handler = uprobe_dispatcher;
+ utc->cons.filter = NULL;
+ ret = register_uprobe(tp->inode, tp->offset, &utc->cons);
+ if (ret) {
+ kfree(utc);
+ return ret;
+ }
+
+ tp->flags |= TP_FLAG_TRACE;
+ utc->tp = tp;
+ tp->consumer = utc;
+ return 0;
+}
+
+static void probe_event_disable(struct ftrace_event_call *call)
+{
+ struct trace_uprobe *tp = (struct trace_uprobe *)call->data;
+
+ if (!tp->inode || !tp->consumer)
+ return;
+
+ unregister_uprobe(tp->inode, tp->offset, &tp->consumer->cons);
+ tp->flags &= ~TP_FLAG_TRACE;
+ kfree(tp->consumer);
+ tp->consumer = NULL;
+}
+
+static int uprobe_event_define_fields(struct ftrace_event_call *event_call)
+{
+ int ret, i;
+ struct uprobe_trace_entry_head field;
+ struct trace_uprobe *tp = (struct trace_uprobe *)event_call->data;
+
+ DEFINE_FIELD(unsigned long, ip, FIELD_STRING_IP, 0);
+ /* Set argument names as fields */
+ for (i = 0; i < tp->nr_args; i++) {
+ ret = trace_define_field(event_call, tp->args[i].type->fmttype,
+ tp->args[i].name,
+ sizeof(field) + tp->args[i].offset,
+ tp->args[i].type->size,
+ tp->args[i].type->is_signed,
+ FILTER_OTHER);
+ if (ret)
+ return ret;
+ }
+ return 0;
+}
+
+static int __set_print_fmt(struct trace_uprobe *tp, char *buf, int len)
+{
+ int i;
+ int pos = 0;
+
+ const char *fmt, *arg;
+
+ fmt = "(%lx)";
+ arg = "REC->" FIELD_STRING_IP;
+
+ /* When len=0, we just calculate the needed length */
+#define LEN_OR_ZERO (len ? len - pos : 0)
+
+ pos += snprintf(buf + pos, LEN_OR_ZERO, "\"%s", fmt);
+
+ for (i = 0; i < tp->nr_args; i++) {
+ pos += snprintf(buf + pos, LEN_OR_ZERO, " %s=%s",
+ tp->args[i].name, tp->args[i].type->fmt);
+ }
+
+ pos += snprintf(buf + pos, LEN_OR_ZERO, "\", %s", arg);
+
+ for (i = 0; i < tp->nr_args; i++) {
+ pos += snprintf(buf + pos, LEN_OR_ZERO, ", REC->%s",
+ tp->args[i].name);
+ }
+
+#undef LEN_OR_ZERO
+
+ /* return the length of print_fmt */
+ return pos;
+}
+
+static int set_print_fmt(struct trace_uprobe *tp)
+{
+ int len;
+ char *print_fmt;
+
+ /* First: called with 0 length to calculate the needed length */
+ len = __set_print_fmt(tp, NULL, 0);
+ print_fmt = kmalloc(len + 1, GFP_KERNEL);
+ if (!print_fmt)
+ return -ENOMEM;
+
+ /* Second: actually write the @print_fmt */
+ __set_print_fmt(tp, print_fmt, len + 1);
+ tp->call.print_fmt = print_fmt;
+
+ return 0;
+}
+
+#ifdef CONFIG_PERF_EVENTS
+
+/* uprobe profile handler */
+static void uprobe_perf_func(struct trace_uprobe *tp,
+ struct pt_regs *regs)
+{
+ struct ftrace_event_call *call = &tp->call;
+ struct uprobe_trace_entry_head *entry;
+ struct hlist_head *head;
+ u8 *data;
+ int size, __size, i;
+ int rctx;
+
+ __size = sizeof(*entry) + tp->size;
+ size = ALIGN(__size + sizeof(u32), sizeof(u64));
+ size -= sizeof(u32);
+ if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE,
+ "profile buffer not large enough"))
+ return;
+
+ entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx);
+ if (!entry)
+ return;
+
+ entry->ip = get_uprobe_bkpt_addr(task_pt_regs(current));
+ data = (u8 *)&entry[1];
+ for (i = 0; i < tp->nr_args; i++)
+ call_fetch(&tp->args[i].fetch, regs,
+ data + tp->args[i].offset);
+
+ head = this_cpu_ptr(call->perf_events);
+ perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs, head);
+}
+
+static int probe_perf_enable(struct ftrace_event_call *call)
+{
+ struct uprobe_trace_consumer *utc;
+ struct trace_uprobe *tp = (struct trace_uprobe *)call->data;
+ int ret = 0;
+
+ if (!tp->inode || tp->consumer)
+ return -EINTR;
+
+ utc = kzalloc(sizeof(struct uprobe_trace_consumer), GFP_KERNEL);
+ if (!utc)
+ return -EINTR;
+
+ utc->cons.handler = uprobe_dispatcher;
+ utc->cons.filter = NULL;
+ ret = register_uprobe(tp->inode, tp->offset, &utc->cons);
+ if (ret) {
+ kfree(utc);
+ return ret;
+ }
+
+ tp->flags |= TP_FLAG_PROFILE;
+ tp->consumer = utc;
+ utc->tp = tp;
+ return 0;
+}
+
+static void probe_perf_disable(struct ftrace_event_call *call)
+{
+ struct trace_uprobe *tp = (struct trace_uprobe *)call->data;
+
+ if (!tp->inode || !tp->consumer)
+ return;
+
+ unregister_uprobe(tp->inode, tp->offset, &tp->consumer->cons);
+ tp->flags &= ~TP_FLAG_PROFILE;
+ kfree(tp->consumer);
+ tp->consumer = NULL;
+}
+#endif /* CONFIG_PERF_EVENTS */
+
+static
+int uprobe_register(struct ftrace_event_call *event, enum trace_reg type)
+{
+ switch (type) {
+ case TRACE_REG_REGISTER:
+ return probe_event_enable(event);
+ case TRACE_REG_UNREGISTER:
+ probe_event_disable(event);
+ return 0;
+
+#ifdef CONFIG_PERF_EVENTS
+ case TRACE_REG_PERF_REGISTER:
+ return probe_perf_enable(event);
+ case TRACE_REG_PERF_UNREGISTER:
+ probe_perf_disable(event);
+ return 0;
+#endif
+ }
+ return 0;
+}
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
+{
+ struct uprobe_trace_consumer *utc;
+ struct trace_uprobe *tp;
+
+ utc = container_of(con, struct uprobe_trace_consumer, cons);
+ tp = utc->tp;
+ if (!tp || tp->consumer != utc)
+ return 0;
+
+ if (tp->flags & TP_FLAG_TRACE)
+ uprobe_trace_func(tp, regs);
+#ifdef CONFIG_PERF_EVENTS
+ if (tp->flags & TP_FLAG_PROFILE)
+ uprobe_perf_func(tp, regs);
+#endif
+ return 0;
+}
+
+
+static struct trace_event_functions uprobe_funcs = {
+ .trace = print_uprobe_event
+};
+
+static int register_uprobe_event(struct trace_uprobe *tp)
+{
+ struct ftrace_event_call *call = &tp->call;
+ int ret;
+
+ /* Initialize ftrace_event_call */
+ INIT_LIST_HEAD(&call->class->fields);
+ call->event.funcs = &uprobe_funcs;
+ call->class->define_fields = uprobe_event_define_fields;
+ if (set_print_fmt(tp) < 0)
+ return -ENOMEM;
+ ret = register_ftrace_event(&call->event);
+ if (!ret) {
+ kfree(call->print_fmt);
+ return -ENODEV;
+ }
+ call->flags = 0;
+ call->class->reg = uprobe_register;
+ call->data = tp;
+ ret = trace_add_event_call(call);
+ if (ret) {
+ pr_info("Failed to register uprobe event: %s\n", call->name);
+ kfree(call->print_fmt);
+ unregister_ftrace_event(&call->event);
+ }
+ return ret;
+}
+
+static void unregister_uprobe_event(struct trace_uprobe *tp)
+{
+ /* tp->event is unregistered in trace_remove_event_call() */
+ trace_remove_event_call(&tp->call);
+ kfree(tp->call.print_fmt);
+ tp->call.print_fmt = NULL;
+}
+
+/* Make a trace interface for controling probe points */
+static __init int init_uprobe_trace(void)
+{
+ struct dentry *d_tracer;
+ struct dentry *entry;
+
+ d_tracer = tracing_init_dentry();
+ if (!d_tracer)
+ return 0;
+
+ entry = trace_create_file("uprobe_events", 0644, d_tracer,
+ NULL, &uprobe_events_ops);
+ /* Profile interface */
+ entry = trace_create_file("uprobe_profile", 0444, d_tracer,
+ NULL, &uprobe_profile_ops);
+ return 0;
+}
+fs_initcall(init_uprobe_trace);
Signed-off-by: Srikar Dronamraju <[email protected]>
---
Documentation/trace/uprobetrace.txt | 94 +++++++++++++++++++++++++++++++++++
1 files changed, 94 insertions(+), 0 deletions(-)
create mode 100644 Documentation/trace/uprobetrace.txt
diff --git a/Documentation/trace/uprobetrace.txt b/Documentation/trace/uprobetrace.txt
new file mode 100644
index 0000000..6c18ffe
--- /dev/null
+++ b/Documentation/trace/uprobetrace.txt
@@ -0,0 +1,94 @@
+ Uprobe-tracer: Uprobe-based Event Tracing
+ =========================================
+ Documentation is written by Srikar Dronamraju
+
+Overview
+--------
+These events are similar to kprobe based events.
+To enable this feature, build your kernel with CONFIG_UPROBE_EVENTS=y.
+
+Similar to the kprobe-event tracer, this doesn't need to be activated via
+current_tracer. Instead of that, add probe points via
+/sys/kernel/debug/tracing/uprobe_events, and enable it via
+/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled.
+
+
+Synopsis of uprobe_tracer
+-------------------------
+ p[:[GRP/]EVENT] PATH:SYMBOL[+offs] [FETCHARGS] : Set a probe
+
+ GRP : Group name. If omitted, use "uprobes" for it.
+ EVENT : Event name. If omitted, the event name is generated
+ based on SYMBOL+offs.
+ PATH : path to an executable or a library.
+ SYMBOL[+offs] : Symbol+offset where the probe is inserted.
+
+ FETCHARGS : Arguments. Each probe can have up to 128 args.
+ %REG : Fetch register REG
+
+Event Profiling
+---------------
+ You can check the total number of probe hits and probe miss-hits via
+/sys/kernel/debug/tracing/uprobe_profile.
+ The first column is event name, the second is the number of probe hits,
+the third is the number of probe miss-hits.
+
+Usage examples
+--------------
+To add a probe as a new event, write a new definition to uprobe_events
+as below.
+
+ echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
+
+ This sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash
+
+
+ echo > /sys/kernel/debug/tracing/uprobe_events
+
+ This clears all probe points.
+
+The following example shows how to dump the instruction pointer and %ax
+a register at the probed text address. Here we are trying to probe
+function zfree in /bin/zsh
+
+ # cd /sys/kernel/debug/tracing/
+ # cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
+ 00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
+ # objdump -T /bin/zsh | grep -w zfree
+ 0000000000446420 g DF .text 0000000000000012 Base zfree
+
+0x46420 is the offset of zfree in object /bin/zsh that is loaded at
+0x00400000. Hence the command to probe would be :
+
+ # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
+
+We can see the events that are registered by looking at the uprobe_events
+file.
+
+ # cat uprobe_events
+ p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
+
+Right after definition, each event is disabled by default. For tracing these
+events, you need to enable it by:
+
+ # echo 1 > events/uprobes/enable
+
+Lets disable the event after sleeping for some time.
+ # sleep 20
+ # echo 0 > events/uprobes/enable
+
+And you can see the traced information via /sys/kernel/debug/tracing/trace.
+
+ # cat trace
+ # tracer: nop
+ #
+ # TASK-PID CPU# TIMESTAMP FUNCTION
+ # | | | | |
+ zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+ zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+ zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+ zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+
+Each line shows us probes were triggered for a pid 24842 with ip being
+0x446421 and contents of ax register being 79.
+
This is a precursor patch that modifies names that refer to kernel/module
to also refer to user space names.
Signed-off-by: Srikar Dronamraju <[email protected]>
---
tools/perf/builtin-probe.c | 12 ++++++------
tools/perf/util/probe-event.c | 6 +++---
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 2c0e64d..98db08f 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -61,7 +61,7 @@ static struct {
struct perf_probe_event events[MAX_PROBES];
struct strlist *dellist;
struct line_range line_range;
- const char *target_module;
+ const char *target;
int max_probe_points;
struct strfilter *filter;
} params;
@@ -241,7 +241,7 @@ static const struct option options[] = {
"file", "vmlinux pathname"),
OPT_STRING('s', "source", &symbol_conf.source_prefix,
"directory", "path to kernel source"),
- OPT_STRING('m', "module", ¶ms.target_module,
+ OPT_STRING('m', "module", ¶ms.target,
"modname", "target module name"),
#endif
OPT__DRY_RUN(&probe_event_dry_run),
@@ -327,7 +327,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
if (!params.filter)
params.filter = strfilter__new(DEFAULT_FUNC_FILTER,
NULL);
- ret = show_available_funcs(params.target_module,
+ ret = show_available_funcs(params.target,
params.filter);
strfilter__delete(params.filter);
if (ret < 0)
@@ -348,7 +348,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
usage_with_options(probe_usage, options);
}
- ret = show_line_range(¶ms.line_range, params.target_module);
+ ret = show_line_range(¶ms.line_range, params.target);
if (ret < 0)
pr_err(" Error: Failed to show lines. (%d)\n", ret);
return ret;
@@ -365,7 +365,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
ret = show_available_vars(params.events, params.nevents,
params.max_probe_points,
- params.target_module,
+ params.target,
params.filter,
params.show_ext_vars);
strfilter__delete(params.filter);
@@ -387,7 +387,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
if (params.nevents) {
ret = add_perf_probe_events(params.events, params.nevents,
params.max_probe_points,
- params.target_module,
+ params.target,
params.force_add);
if (ret < 0) {
pr_err(" Error: Failed to add events. (%d)\n", ret);
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index f022316..153e860 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -1976,7 +1976,7 @@ static int filter_available_functions(struct map *map __unused,
return 1;
}
-int show_available_funcs(const char *module, struct strfilter *_filter)
+int show_available_funcs(const char *elfobject, struct strfilter *_filter)
{
struct map *map;
int ret;
@@ -1987,9 +1987,9 @@ int show_available_funcs(const char *module, struct strfilter *_filter)
if (ret < 0)
return ret;
- map = kernel_get_module_map(module);
+ map = kernel_get_module_map(elfobject);
if (!map) {
- pr_err("Failed to find %s map.\n", (module) ? : "kernel");
+ pr_err("Failed to find %s map.\n", (elfobject) ? : "kernel");
return -EINVAL;
}
available_func_filter = _filter;
Enhances perf probe to user space executables and libraries.
Provides very basic support for uprobes.
[ Probing a function in the executable using function name ]
-------------------------------------------------------------
[root@localhost ~]# perf probe -u zfree@/bin/zsh
Add new event:
probe_zsh:zfree (on /bin/zsh:0x45400)
You can now use it on all perf tools, such as:
perf record -e probe_zsh:zfree -aR sleep 1
[root@localhost ~]# perf record -e probe_zsh:zfree -aR sleep 15
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.314 MB perf.data (~13715 samples) ]
[root@localhost ~]# perf report --stdio
# Events: 3K probe_zsh:zfree
#
# Overhead Command Shared Object Symbol
# ........ ....... ............. ......
#
100.00% zsh zsh [.] zfree
#
# (For a higher level overview, try: perf report --sort comm,dso)
#
[root@localhost ~]
[ Probing a library function using function name ]
--------------------------------------------------
[root@localhost]#
[root@localhost]# perf probe -u malloc@/lib64/libc.so.6
Add new event:
probe_libc:malloc (on /lib64/libc-2.5.so:0x74dc0)
You can now use it on all perf tools, such as:
perf record -e probe_libc:malloc -aR sleep 1
[root@localhost]#
[root@localhost]# perf probe --list
probe_libc:malloc (on /lib64/libc-2.5.so:0x0000000000074dc0)
Signed-off-by: Srikar Dronamraju <[email protected]>
---
tools/perf/builtin-probe.c | 10 +
tools/perf/util/probe-event.c | 387 +++++++++++++++++++++++++++++++++--------
tools/perf/util/probe-event.h | 8 +
tools/perf/util/symbol.c | 10 +
tools/perf/util/symbol.h | 1
5 files changed, 334 insertions(+), 82 deletions(-)
diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 98db08f..a90ee01 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -57,6 +57,7 @@ static struct {
bool show_ext_vars;
bool show_funcs;
bool mod_events;
+ bool uprobes;
int nevents;
struct perf_probe_event events[MAX_PROBES];
struct strlist *dellist;
@@ -78,6 +79,7 @@ static int parse_probe_event(const char *str)
return -1;
}
+ pev->uprobes = params.uprobes;
/* Parse a perf-probe command into event */
ret = parse_perf_probe_command(str, pev);
pr_debug("%d arguments\n", pev->nargs);
@@ -249,6 +251,8 @@ static const struct option options[] = {
"Set how many probe points can be found for a probe."),
OPT_BOOLEAN('F', "funcs", ¶ms.show_funcs,
"Show potential probe-able functions."),
+ OPT_BOOLEAN('u', "uprobes", ¶ms.uprobes,
+ "user space probe events"),
OPT_CALLBACK('\0', "filter", NULL,
"[!]FILTER", "Set a filter (with --vars/funcs only)\n"
"\t\t\t(default: \"" DEFAULT_VAR_FILTER "\" for --vars,\n"
@@ -304,6 +308,10 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
pr_err(" Error: Don't use --list with --funcs.\n");
usage_with_options(probe_usage, options);
}
+ if (params.uprobes) {
+ pr_warning(" Error: Don't use --list with --uprobes.\n");
+ usage_with_options(probe_usage, options);
+ }
ret = show_perf_probe_events();
if (ret < 0)
pr_err(" Error: Failed to show event list. (%d)\n",
@@ -337,7 +345,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
}
#ifdef DWARF_SUPPORT
- if (params.show_lines) {
+ if (params.show_lines && !params.uprobes) {
if (params.mod_events) {
pr_err(" Error: Don't use --line with"
" --add/--del.\n");
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index 153e860..30f9e2f 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -73,6 +73,7 @@ static int e_snprintf(char *str, size_t size, const char *format, ...)
}
static char *synthesize_perf_probe_point(struct perf_probe_point *pp);
+static int convert_name_to_addr(struct perf_probe_event *pev);
static struct machine machine;
/* Initialize symbol maps and path of vmlinux/modules */
@@ -169,6 +170,31 @@ const char *kernel_get_module_path(const char *module)
return (dso) ? dso->long_name : NULL;
}
+static int init_perf_uprobes(void)
+{
+ int ret = 0;
+
+ symbol_conf.try_vmlinux_path = false;
+ symbol_conf.sort_by_name = true;
+ ret = symbol__init();
+ if (ret < 0)
+ pr_debug("Failed to init symbol map.\n");
+
+ return ret;
+}
+
+static int convert_to_perf_probe_point(struct probe_trace_point *tp,
+ struct perf_probe_point *pp)
+{
+ pp->function = strdup(tp->symbol);
+ if (pp->function == NULL)
+ return -ENOMEM;
+ pp->offset = tp->offset;
+ pp->retprobe = tp->retprobe;
+
+ return 0;
+}
+
#ifdef DWARF_SUPPORT
static int open_vmlinux(const char *module)
{
@@ -222,6 +248,15 @@ static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
bool need_dwarf = perf_probe_event_need_dwarf(pev);
int fd, ntevs;
+ if (pev->uprobes) {
+ if (need_dwarf) {
+ pr_warning("Debuginfo-analysis is not yet supported"
+ " with -u/--uprobes option.\n");
+ return -ENOSYS;
+ }
+ return convert_name_to_addr(pev);
+ }
+
fd = open_vmlinux(module);
if (fd < 0) {
if (need_dwarf) {
@@ -537,13 +572,7 @@ static int kprobe_convert_to_perf_probe(struct probe_trace_point *tp,
pr_err("Failed to find symbol %s in kernel.\n", tp->symbol);
return -ENOENT;
}
- pp->function = strdup(tp->symbol);
- if (pp->function == NULL)
- return -ENOMEM;
- pp->offset = tp->offset;
- pp->retprobe = tp->retprobe;
-
- return 0;
+ return convert_to_perf_probe_point(tp, pp);
}
static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
@@ -554,6 +583,9 @@ static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
pr_warning("Debuginfo-analysis is not supported.\n");
return -ENOSYS;
}
+ if (pev->uprobes)
+ return convert_name_to_addr(pev);
+
return 0;
}
@@ -684,6 +716,7 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
struct perf_probe_point *pp = &pev->point;
char *ptr, *tmp;
char c, nc = 0;
+ bool found = false;
/*
* <Syntax>
* perf probe [EVENT=]SRC[:LN|;PTN]
@@ -722,8 +755,11 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
if (tmp == NULL)
return -ENOMEM;
- /* Check arg is function or file and copy it */
- if (strchr(tmp, '.')) /* File */
+ /*
+ * Check arg is function or file and copy it.
+ * If its uprobes then expect the function to be of type [email protected]
+ */
+ if (!pev->uprobes && strchr(tmp, '.')) /* File */
pp->file = tmp;
else /* Function */
pp->function = tmp;
@@ -761,6 +797,17 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
}
break;
case '@': /* File name */
+ if (pev->uprobes && !found) {
+ /* uprobes overloads @ operator */
+ tmp = zalloc(sizeof(char *) * MAX_PROBE_ARGS);
+ e_snprintf(tmp, MAX_PROBE_ARGS, "%s@%s",
+ pp->function, arg);
+ free(pp->function);
+ pp->function = tmp;
+ found = true;
+ break;
+ }
+
if (pp->file) {
semantic_error("SRC@SRC is not allowed.\n");
return -EINVAL;
@@ -818,6 +865,11 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
return -EINVAL;
}
+ if (pev->uprobes && !pp->function) {
+ semantic_error("No function specified for uprobes");
+ return -EINVAL;
+ }
+
if ((pp->offset || pp->line || pp->lazy_line) && pp->retprobe) {
semantic_error("Offset/Line/Lazy pattern can't be used with "
"return probe.\n");
@@ -827,6 +879,11 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
pr_debug("symbol:%s file:%s line:%d offset:%lu return:%d lazy:%s\n",
pp->function, pp->file, pp->line, pp->offset, pp->retprobe,
pp->lazy_line);
+
+ if (pev->uprobes && perf_probe_event_need_dwarf(pev)) {
+ semantic_error("no dwarf based probes for uprobes.");
+ return -EINVAL;
+ }
return 0;
}
@@ -978,7 +1035,8 @@ bool perf_probe_event_need_dwarf(struct perf_probe_event *pev)
{
int i;
- if (pev->point.file || pev->point.line || pev->point.lazy_line)
+ if ((pev->point.file && !pev->uprobes) || pev->point.line ||
+ pev->point.lazy_line)
return true;
for (i = 0; i < pev->nargs; i++)
@@ -1269,10 +1327,21 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev)
if (buf == NULL)
return NULL;
- len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s+%lu",
- tp->retprobe ? 'r' : 'p',
- tev->group, tev->event,
- tp->symbol, tp->offset);
+ if (tev->uprobes)
+ len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s",
+ tp->retprobe ? 'r' : 'p',
+ tev->group, tev->event, tp->symbol);
+ else if (tp->offset)
+ len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s+%lu",
+ tp->retprobe ? 'r' : 'p',
+ tev->group, tev->event,
+ tp->symbol, tp->offset);
+ else
+ len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s",
+ tp->retprobe ? 'r' : 'p',
+ tev->group, tev->event,
+ tp->symbol);
+
if (len <= 0)
goto error;
@@ -1291,7 +1360,7 @@ error:
}
static int convert_to_perf_probe_event(struct probe_trace_event *tev,
- struct perf_probe_event *pev)
+ struct perf_probe_event *pev, bool is_kprobe)
{
char buf[64] = "";
int i, ret;
@@ -1303,7 +1372,11 @@ static int convert_to_perf_probe_event(struct probe_trace_event *tev,
return -ENOMEM;
/* Convert trace_point to probe_point */
- ret = kprobe_convert_to_perf_probe(&tev->point, &pev->point);
+ if (is_kprobe)
+ ret = kprobe_convert_to_perf_probe(&tev->point, &pev->point);
+ else
+ ret = convert_to_perf_probe_point(&tev->point, &pev->point);
+
if (ret < 0)
return ret;
@@ -1397,7 +1470,7 @@ static void clear_probe_trace_event(struct probe_trace_event *tev)
memset(tev, 0, sizeof(*tev));
}
-static int open_kprobe_events(bool readwrite)
+static int open_probe_events(bool readwrite, bool is_kprobe)
{
char buf[PATH_MAX];
const char *__debugfs;
@@ -1408,8 +1481,13 @@ static int open_kprobe_events(bool readwrite)
pr_warning("Debugfs is not mounted.\n");
return -ENOENT;
}
+ if (is_kprobe)
+ ret = e_snprintf(buf, PATH_MAX, "%stracing/kprobe_events",
+ __debugfs);
+ else
+ ret = e_snprintf(buf, PATH_MAX, "%stracing/uprobe_events",
+ __debugfs);
- ret = e_snprintf(buf, PATH_MAX, "%stracing/kprobe_events", __debugfs);
if (ret >= 0) {
pr_debug("Opening %s write=%d\n", buf, readwrite);
if (readwrite && !probe_event_dry_run)
@@ -1420,16 +1498,29 @@ static int open_kprobe_events(bool readwrite)
if (ret < 0) {
if (errno == ENOENT)
- pr_warning("kprobe_events file does not exist - please"
- " rebuild kernel with CONFIG_KPROBE_EVENT.\n");
+ pr_warning("%s file does not exist - please"
+ " rebuild kernel with CONFIG_%s_EVENT.\n",
+ is_kprobe ? "kprobe_events" : "uprobe_events",
+ is_kprobe ? "KPROBE" : "UPROBE");
else
- pr_warning("Failed to open kprobe_events file: %s\n",
- strerror(errno));
+ pr_warning("Failed to open %s file: %s\n",
+ is_kprobe ? "kprobe_events" : "uprobe_events",
+ strerror(errno));
}
return ret;
}
-/* Get raw string list of current kprobe_events */
+static int open_kprobe_events(bool readwrite)
+{
+ return open_probe_events(readwrite, 1);
+}
+
+static int open_uprobe_events(bool readwrite)
+{
+ return open_probe_events(readwrite, 0);
+}
+
+/* Get raw string list of current kprobe_events or uprobe_events */
static struct strlist *get_probe_trace_command_rawlist(int fd)
{
int ret, idx;
@@ -1494,36 +1585,26 @@ static int show_perf_probe_event(struct perf_probe_event *pev)
return ret;
}
-/* List up current perf-probe events */
-int show_perf_probe_events(void)
+static int __show_perf_probe_events(int fd, bool is_kprobe)
{
- int fd, ret;
+ int ret = 0;
struct probe_trace_event tev;
struct perf_probe_event pev;
struct strlist *rawlist;
struct str_node *ent;
- setup_pager();
- ret = init_vmlinux();
- if (ret < 0)
- return ret;
-
memset(&tev, 0, sizeof(tev));
memset(&pev, 0, sizeof(pev));
- fd = open_kprobe_events(false);
- if (fd < 0)
- return fd;
-
rawlist = get_probe_trace_command_rawlist(fd);
- close(fd);
if (!rawlist)
return -ENOENT;
strlist__for_each(ent, rawlist) {
ret = parse_probe_trace_command(ent->s, &tev);
if (ret >= 0) {
- ret = convert_to_perf_probe_event(&tev, &pev);
+ ret = convert_to_perf_probe_event(&tev, &pev,
+ is_kprobe);
if (ret >= 0)
ret = show_perf_probe_event(&pev);
}
@@ -1533,6 +1614,31 @@ int show_perf_probe_events(void)
break;
}
strlist__delete(rawlist);
+ return ret;
+}
+
+/* List up current perf-probe events */
+int show_perf_probe_events(void)
+{
+ int fd, ret;
+
+ setup_pager();
+ fd = open_kprobe_events(false);
+ if (fd < 0)
+ return fd;
+
+ ret = init_vmlinux();
+ if (ret < 0)
+ return ret;
+
+ ret = __show_perf_probe_events(fd, true);
+ close(fd);
+
+ fd = open_uprobe_events(false);
+ if (fd >= 0) {
+ ret = __show_perf_probe_events(fd, false);
+ close(fd);
+ }
return ret;
}
@@ -1642,7 +1748,10 @@ static int __add_probe_trace_events(struct perf_probe_event *pev,
const char *event, *group;
struct strlist *namelist;
- fd = open_kprobe_events(true);
+ if (pev->uprobes)
+ fd = open_uprobe_events(true);
+ else
+ fd = open_kprobe_events(true);
if (fd < 0)
return fd;
/* Get current event names */
@@ -1656,18 +1765,19 @@ static int __add_probe_trace_events(struct perf_probe_event *pev,
printf("Add new event%s\n", (ntevs > 1) ? "s:" : ":");
for (i = 0; i < ntevs; i++) {
tev = &tevs[i];
- if (pev->event)
- event = pev->event;
- else
- if (pev->point.function)
- event = pev->point.function;
- else
- event = tev->point.symbol;
+
if (pev->group)
group = pev->group;
else
group = PERFPROBE_GROUP;
+ if (pev->event)
+ event = pev->event;
+ else if (pev->point.function)
+ event = pev->point.function;
+ else
+ event = tev->point.symbol;
+
/* Get an unused new event name */
ret = get_new_event_name(buf, 64, event,
namelist, allow_suffix);
@@ -1745,6 +1855,7 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
tev->point.offset = pev->point.offset;
tev->point.retprobe = pev->point.retprobe;
tev->nargs = pev->nargs;
+ tev->uprobes = pev->uprobes;
if (tev->nargs) {
tev->args = zalloc(sizeof(struct probe_trace_arg)
* tev->nargs);
@@ -1775,6 +1886,9 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
}
}
+ if (pev->uprobes)
+ return 1;
+
/* Currently just checking function name from symbol map */
sym = __find_kernel_function_by_name(tev->point.symbol, NULL);
if (!sym) {
@@ -1801,15 +1915,19 @@ struct __event_package {
int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
int max_tevs, const char *module, bool force_add)
{
- int i, j, ret;
+ int i, j, ret = 0;
struct __event_package *pkgs;
pkgs = zalloc(sizeof(struct __event_package) * npevs);
if (pkgs == NULL)
return -ENOMEM;
- /* Init vmlinux path */
- ret = init_vmlinux();
+ if (!pevs->uprobes)
+ /* Init vmlinux path */
+ ret = init_vmlinux();
+ else
+ ret = init_perf_uprobes();
+
if (ret < 0) {
free(pkgs);
return ret;
@@ -1879,23 +1997,15 @@ error:
return ret;
}
-static int del_trace_probe_event(int fd, const char *group,
- const char *event, struct strlist *namelist)
+static int del_trace_probe_event(int fd, const char *buf,
+ struct strlist *namelist)
{
- char buf[128];
struct str_node *ent, *n;
- int found = 0, ret = 0;
-
- ret = e_snprintf(buf, 128, "%s:%s", group, event);
- if (ret < 0) {
- pr_err("Failed to copy event.\n");
- return ret;
- }
+ int ret = -1;
if (strpbrk(buf, "*?")) { /* Glob-exp */
strlist__for_each_safe(ent, n, namelist)
if (strglobmatch(ent->s, buf)) {
- found++;
ret = __del_trace_probe_event(fd, ent);
if (ret < 0)
break;
@@ -1904,40 +2014,42 @@ static int del_trace_probe_event(int fd, const char *group,
} else {
ent = strlist__find(namelist, buf);
if (ent) {
- found++;
ret = __del_trace_probe_event(fd, ent);
if (ret >= 0)
strlist__remove(namelist, ent);
}
}
- if (found == 0 && ret >= 0)
- pr_info("Info: Event \"%s\" does not exist.\n", buf);
-
return ret;
}
int del_perf_probe_events(struct strlist *dellist)
{
- int fd, ret = 0;
+ int ret = -1, ufd = -1, kfd = -1;
+ char buf[128];
const char *group, *event;
char *p, *str;
struct str_node *ent;
- struct strlist *namelist;
+ struct strlist *namelist = NULL, *unamelist = NULL;
- fd = open_kprobe_events(true);
- if (fd < 0)
- return fd;
/* Get current event names */
- namelist = get_probe_trace_event_names(fd, true);
- if (namelist == NULL)
- return -EINVAL;
+ kfd = open_kprobe_events(true);
+ if (kfd < 0)
+ return kfd;
+ namelist = get_probe_trace_event_names(kfd, true);
+
+ ufd = open_uprobe_events(true);
+ if (ufd >= 0)
+ unamelist = get_probe_trace_event_names(ufd, true);
+
+ if (namelist == NULL && unamelist == NULL)
+ goto error;
strlist__for_each(ent, dellist) {
str = strdup(ent->s);
if (str == NULL) {
ret = -ENOMEM;
- break;
+ goto error;
}
pr_debug("Parsing: %s\n", str);
p = strchr(str, ':');
@@ -1949,15 +2061,37 @@ int del_perf_probe_events(struct strlist *dellist)
group = "*";
event = str;
}
+
+ ret = e_snprintf(buf, 128, "%s:%s", group, event);
+ if (ret < 0) {
+ pr_err("Failed to copy event.");
+ free(str);
+ goto error;
+ }
+
pr_debug("Group: %s, Event: %s\n", group, event);
- ret = del_trace_probe_event(fd, group, event, namelist);
+ if (namelist)
+ ret = del_trace_probe_event(kfd, buf, namelist);
+ if (unamelist && ret != 0)
+ ret = del_trace_probe_event(ufd, buf, unamelist);
+
free(str);
- if (ret < 0)
- break;
+ if (ret != 0)
+ pr_info("Info: Event \"%s\" does not exist.\n", buf);
}
- strlist__delete(namelist);
- close(fd);
+error:
+ if (kfd >= 0) {
+ if (namelist)
+ strlist__delete(namelist);
+ close(kfd);
+ }
+
+ if (ufd >= 0) {
+ if (unamelist)
+ strlist__delete(unamelist);
+ close(ufd);
+ }
return ret;
}
/* TODO: don't use a global variable for filter ... */
@@ -2003,3 +2137,102 @@ int show_available_funcs(const char *elfobject, struct strfilter *_filter)
dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
return 0;
}
+
+#define DEFAULT_FUNC_FILTER "!_*"
+
+/*
+ * uprobe_events only accepts address:
+ * Convert function and any offset to address
+ */
+static int convert_name_to_addr(struct perf_probe_event *pev)
+{
+ struct perf_probe_point *pp = &pev->point;
+ struct symbol *sym;
+ struct map *map;
+ char *name = NULL, *tmpname = NULL, *function = NULL;
+ int ret = -EINVAL;
+ unsigned long long vaddr = 0;
+
+ /* check if user has specifed a virtual address
+ vaddr = strtoul(pp->function, NULL, 0); */
+ if (!pp->function)
+ goto out;
+
+ function = strdup(pp->function);
+ if (!function) {
+ pr_warning("Failed to allocate memory by strdup.\n");
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ tmpname = strchr(function, '@');
+ if (!tmpname) {
+ pr_warning("Cannot break %s into function and file\n",
+ function);
+ goto out;
+ }
+
+ *tmpname = '\0';
+ name = realpath(tmpname + 1, NULL);
+ if (!name) {
+ pr_warning("Cannot find realpath for %s.\n", tmpname + 1);
+ goto out;
+ }
+
+ map = dso__new_map(name);
+ if (!map) {
+ pr_warning("Cannot find appropriate DSO for %s.\n", name);
+ goto out;
+ }
+ available_func_filter = strfilter__new(DEFAULT_FUNC_FILTER, NULL);
+ if (map__load(map, filter_available_functions)) {
+ pr_err("Failed to load map.\n");
+ return -EINVAL;
+ }
+
+ sym = map__find_symbol_by_name(map, function, NULL);
+ if (!sym) {
+ pr_warning("Cannot find %s in DSO %s\n", function, name);
+ goto out;
+ }
+
+ if (map->start > sym->start)
+ vaddr = map->start;
+ vaddr += sym->start + pp->offset + map->pgoff;
+ pp->offset = 0;
+
+ if (!pev->event) {
+ pev->event = function;
+ function = NULL;
+ }
+ if (!pev->group) {
+ char *ptr1, *ptr2;
+
+ pev->group = zalloc(sizeof(char *) * 64);
+ ptr1 = strdup(basename(name));
+ if (ptr1) {
+ ptr2 = strpbrk(ptr1, "-._");
+ if (ptr2)
+ *ptr2 = '\0';
+ e_snprintf(pev->group, 64, "%s_%s", PERFPROBE_GROUP,
+ ptr1);
+ free(ptr1);
+ }
+ }
+ free(pp->function);
+ pp->function = zalloc(sizeof(char *) * MAX_PROBE_ARGS);
+ if (!pp->function) {
+ ret = -ENOMEM;
+ pr_warning("Failed to allocate memory by zalloc.\n");
+ goto out;
+ }
+ e_snprintf(pp->function, MAX_PROBE_ARGS, "%s:0x%llx", name, vaddr);
+ ret = 0;
+
+out:
+ if (function)
+ free(function);
+ if (name)
+ free(name);
+ return ret;
+}
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index 3434fc9..365e016 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -7,7 +7,7 @@
extern bool probe_event_dry_run;
-/* kprobe-tracer tracing point */
+/* kprobe-tracer and uprobe-tracer tracing point */
struct probe_trace_point {
char *symbol; /* Base symbol */
unsigned long offset; /* Offset from symbol */
@@ -20,7 +20,7 @@ struct probe_trace_arg_ref {
long offset; /* Offset value */
};
-/* kprobe-tracer tracing argument */
+/* kprobe-tracer and uprobe-tracer tracing argument */
struct probe_trace_arg {
char *name; /* Argument name */
char *value; /* Base value */
@@ -28,12 +28,13 @@ struct probe_trace_arg {
struct probe_trace_arg_ref *ref; /* Referencing offset */
};
-/* kprobe-tracer tracing event (point + arg) */
+/* kprobe-tracer and uprobe-tracer tracing event (point + arg) */
struct probe_trace_event {
char *event; /* Event name */
char *group; /* Group name */
struct probe_trace_point point; /* Trace point */
int nargs; /* Number of args */
+ bool uprobes; /* uprobes only */
struct probe_trace_arg *args; /* Arguments */
};
@@ -69,6 +70,7 @@ struct perf_probe_event {
char *group; /* Group name */
struct perf_probe_point point; /* Probe point */
int nargs; /* Number of arguments */
+ bool uprobes;
struct perf_probe_arg *args; /* Arguments */
};
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index eec1963..6a52022 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -565,7 +565,7 @@ static int dso__split_kallsyms(struct dso *dso, struct map *map,
struct machine *machine = kmaps->machine;
struct map *curr_map = map;
struct symbol *pos;
- int count = 0, moved = 0;
+ int count = 0, moved = 0;
struct rb_root *root = &dso->symbols[map->type];
struct rb_node *next = rb_first(root);
int kernel_range = 0;
@@ -2665,3 +2665,11 @@ int machine__load_vmlinux_path(struct machine *machine, enum map_type type,
return ret;
}
+
+struct map *dso__new_map(const char *name)
+{
+ struct dso *dso = dso__new(name);
+ struct map *map = map__new2(0, dso, MAP__FUNCTION);
+
+ return map;
+}
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 325ee36..8824d57 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -215,6 +215,7 @@ void dso__set_long_name(struct dso *dso, char *name);
void dso__set_build_id(struct dso *dso, void *build_id);
void dso__read_running_kernel_build_id(struct dso *dso,
struct machine *machine);
+struct map *dso__new_map(const char *name);
struct symbol *dso__find_symbol(struct dso *dso, enum map_type type,
u64 addr);
struct symbol *dso__find_symbol_by_name(struct dso *dso, enum map_type type,
Enhances -F/--funcs option of "perf probe" to list possible probe points in
an executable file or library. A new option -e/--exe specifies the path of
the executable or library.
Show last 10 functions in /bin/zsh.
# perf probe -F -u -e /bin/zsh | tail
zstrtol
ztrcmp
ztrdup
ztrduppfx
ztrftime
ztrlen
ztrncpy
ztrsub
zwarn
zwarnnam
Show first 10 functions in /lib/libc.so.6
# perf probe -u -F -e /lib/libc.so.6 | head
_IO_adjust_column
_IO_adjust_wcolumn
_IO_default_doallocate
_IO_default_finish
_IO_default_pbackfail
_IO_default_uflow
_IO_default_xsgetn
_IO_default_xsputn
_IO_do_write@@GLIBC_2.2.5
_IO_doallocbuf
Signed-off-by: Srikar Dronamraju <[email protected]>
---
tools/perf/builtin-probe.c | 63 +++++++++++++++++++++++++++++++++++++----
tools/perf/util/probe-event.c | 56 ++++++++++++++++++++++++++++--------
tools/perf/util/probe-event.h | 4 +--
3 files changed, 102 insertions(+), 21 deletions(-)
diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index a90ee01..6aff85c 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -185,6 +185,55 @@ static int opt_set_filter(const struct option *opt __used,
return 0;
}
+static int opt_set_executable(const struct option *opt __used,
+ const char *str, int unset __used)
+{
+ if (params.target || !str)
+ return -EINVAL;
+
+ if (params.uprobes) {
+ pr_err(" Error: Don't use -m with --uprobes.\n");
+ return -EINVAL;
+ }
+
+ if (str) {
+ params.target = str;
+ params.uprobes = true;
+ }
+ return 0;
+}
+
+#ifdef DWARF_SUPPORT
+static int opt_set_module(const struct option *opt __used,
+ const char *str, int unset __used)
+{
+ if (params.target || !str)
+ return -EINVAL;
+
+ if (params.uprobes) {
+ pr_err(" Error: Don't use -m with --uprobes.\n");
+ return -EINVAL;
+ }
+
+ if (str)
+ params.target = str;
+
+ return 0;
+}
+#endif
+
+static int opt_set_uprobes(const struct option *opt __used,
+ const char *str __used, int unset __used)
+{
+ if (params.target) {
+ pr_err(" Error: Don't use --uprobes with -x/-m.\n");
+ return -EINVAL;
+ }
+
+ params.uprobes = true;
+ return 0;
+}
+
static const char * const probe_usage[] = {
"perf probe [<options>] 'PROBEDEF' ['PROBEDEF' ...]",
"perf probe [<options>] --add 'PROBEDEF' [--add 'PROBEDEF' ...]",
@@ -243,16 +292,18 @@ static const struct option options[] = {
"file", "vmlinux pathname"),
OPT_STRING('s', "source", &symbol_conf.source_prefix,
"directory", "path to kernel source"),
- OPT_STRING('m', "module", ¶ms.target,
- "modname", "target module name"),
+ OPT_CALLBACK('m', "module", NULL, "modname", "target module name",
+ opt_set_module),
#endif
OPT__DRY_RUN(&probe_event_dry_run),
OPT_INTEGER('\0', "max-probes", ¶ms.max_probe_points,
"Set how many probe points can be found for a probe."),
OPT_BOOLEAN('F', "funcs", ¶ms.show_funcs,
"Show potential probe-able functions."),
- OPT_BOOLEAN('u', "uprobes", ¶ms.uprobes,
- "user space probe events"),
+ OPT_CALLBACK_NOOPT('u', "uprobes", NULL, NULL,
+ "user space probe events", opt_set_uprobes),
+ OPT_CALLBACK('x', "exe", NULL, "/path/to/absolute/relative/file",
+ "target executable name", opt_set_executable),
OPT_CALLBACK('\0', "filter", NULL,
"[!]FILTER", "Set a filter (with --vars/funcs only)\n"
"\t\t\t(default: \"" DEFAULT_VAR_FILTER "\" for --vars,\n"
@@ -335,8 +386,8 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
if (!params.filter)
params.filter = strfilter__new(DEFAULT_FUNC_FILTER,
NULL);
- ret = show_available_funcs(params.target,
- params.filter);
+ ret = show_available_funcs(params.target, params.filter,
+ params.uprobes);
strfilter__delete(params.filter);
if (ret < 0)
pr_err(" Error: Failed to show functions."
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index 30f9e2f..d45dfb1 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -47,6 +47,7 @@
#include "trace-event.h" /* For __unused */
#include "probe-event.h"
#include "probe-finder.h"
+#include "session.h"
#define MAX_CMDLEN 256
#define MAX_PROBE_ARGS 128
@@ -2094,6 +2095,7 @@ error:
}
return ret;
}
+
/* TODO: don't use a global variable for filter ... */
static struct strfilter *available_func_filter;
@@ -2110,32 +2112,60 @@ static int filter_available_functions(struct map *map __unused,
return 1;
}
-int show_available_funcs(const char *elfobject, struct strfilter *_filter)
+static int __show_available_funcs(struct map *map)
+{
+ if (map__load(map, filter_available_functions)) {
+ pr_err("Failed to load map.\n");
+ return -EINVAL;
+ }
+ if (!dso__sorted_by_name(map->dso, map->type))
+ dso__sort_by_name(map->dso, map->type);
+
+ dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
+ return 0;
+}
+
+static int available_kernel_funcs(const char *module)
{
struct map *map;
int ret;
- setup_pager();
-
ret = init_vmlinux();
if (ret < 0)
return ret;
- map = kernel_get_module_map(elfobject);
+ map = kernel_get_module_map(module);
if (!map) {
- pr_err("Failed to find %s map.\n", (elfobject) ? : "kernel");
+ pr_err("Failed to find %s map.\n", (module) ? : "kernel");
return -EINVAL;
}
+ return __show_available_funcs(map);
+}
+
+int show_available_funcs(const char *elfobject, struct strfilter *_filter,
+ bool user)
+{
+ struct map *map;
+ int ret;
+
+ setup_pager();
available_func_filter = _filter;
- if (map__load(map, filter_available_functions)) {
- pr_err("Failed to load map.\n");
- return -EINVAL;
- }
- if (!dso__sorted_by_name(map->dso, map->type))
- dso__sort_by_name(map->dso, map->type);
- dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
- return 0;
+ if (!user)
+ return available_kernel_funcs(elfobject);
+
+ symbol_conf.try_vmlinux_path = false;
+ symbol_conf.sort_by_name = true;
+ ret = symbol__init();
+ if (ret < 0) {
+ pr_err("Failed to init symbol map.\n");
+ return ret;
+ }
+ map = dso__new_map(elfobject);
+ ret = __show_available_funcs(map);
+ dso__delete(map->dso);
+ map__delete(map);
+ return ret;
}
#define DEFAULT_FUNC_FILTER "!_*"
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index 365e016..5199df4 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -130,8 +130,8 @@ extern int show_line_range(struct line_range *lr, const char *module);
extern int show_available_vars(struct perf_probe_event *pevs, int npevs,
int max_probe_points, const char *module,
struct strfilter *filter, bool externs);
-extern int show_available_funcs(const char *module, struct strfilter *filter);
-
+extern int show_available_funcs(const char *module, struct strfilter *filter,
+ bool user);
/* Maximum index number of event-name postfix */
#define MAX_EVENT_INDEX 1024
Modify perf-probe.txt to include uprobe documentation
Signed-off-by: Suzuki Poulose <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
---
tools/perf/Documentation/perf-probe.txt | 21 ++++++++++++++++++++-
1 files changed, 20 insertions(+), 1 deletions(-)
diff --git a/tools/perf/Documentation/perf-probe.txt b/tools/perf/Documentation/perf-probe.txt
index 02bafce..3db8d25 100644
--- a/tools/perf/Documentation/perf-probe.txt
+++ b/tools/perf/Documentation/perf-probe.txt
@@ -76,6 +76,8 @@ OPTIONS
-F::
--funcs::
Show available functions in given module or kernel.
+ With -x/--exe, can also list functions in a user space executable /
+ shared library.
--filter=FILTER::
(Only for --vars and --funcs) Set filter. FILTER is a combination of glob
@@ -96,12 +98,21 @@ OPTIONS
--max-probes::
Set the maximum number of probe points for an event. Default is 128.
+-u::
+--uprobes::
+ Specify a uprobe based probe point.
+
+-x::
+--exe=PATH::
+ Specify path to the executable or shared library file for user
+ space tracing. Used with --funcs option only.
+
PROBE SYNTAX
------------
Probe points are defined by following syntax.
1) Define event based on function name
- [EVENT=]FUNC[@SRC][:RLN|+OFFS|%return|;PTN] [ARG ...]
+ [EVENT=]FUNC[@EXE][@SRC][:RLN|+OFFS|%return|;PTN] [ARG ...]
2) Define event based on source file with line number
[EVENT=]SRC:ALN [ARG ...]
@@ -112,6 +123,7 @@ Probe points are defined by following syntax.
'EVENT' specifies the name of new event, if omitted, it will be set the name of the probed function. Currently, event group name is set as 'probe'.
'FUNC' specifies a probed function name, and it may have one of the following options; '+OFFS' is the offset from function entry address in bytes, ':RLN' is the relative-line number from function entry line, and '%return' means that it probes function return. And ';PTN' means lazy matching pattern (see LAZY MATCHING). Note that ';PTN' must be the end of the probe point definition. In addition, '@SRC' specifies a source file which has that function.
+'EXE' specifies the absolute or relative path of the user space executable or user space shared library.
It is also possible to specify a probe point by the source line number or lazy matching by using 'SRC:ALN' or 'SRC;PTN' syntax, where 'SRC' is the source file path, ':ALN' is the line number and ';PTN' is the lazy matching pattern.
'ARG' specifies the arguments of this probe point, (see PROBE ARGUMENT).
@@ -180,6 +192,13 @@ Delete all probes on schedule().
./perf probe --del='schedule*'
+Add probes at zfree() function on /bin/zsh
+
+ ./perf probe -u zfree@/bin/zsh
+
+Add probes at malloc() function on libc
+
+ ./perf probe -u malloc@/lib/libc.so.6
SEE ALSO
--------
On Tue, Jun 07, 2011 at 06:32:16PM +0530, Srikar Dronamraju wrote:
>
> Enhances perf probe to user space executables and libraries.
> Provides very basic support for uprobes.
Nice. Does this require full debug info for symbolic probes,
or can it also work with simple symbolc information?
On Tue, Jun 07, 2011 at 09:30:39AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 07, 2011 at 06:32:16PM +0530, Srikar Dronamraju wrote:
> >
> > Enhances perf probe to user space executables and libraries.
> > Provides very basic support for uprobes.
>
> Nice. Does this require full debug info for symbolic probes,
> or can it also work with simple symbolc information?
It works only with symbol information for now.
It doesn't (yet) know how to use debuginfo :-)
Ananth
Em Tue, Jun 07, 2011 at 07:08:53PM +0530, Ananth N Mavinakayanahalli escreveu:
> On Tue, Jun 07, 2011 at 09:30:39AM -0400, Christoph Hellwig wrote:
> > On Tue, Jun 07, 2011 at 06:32:16PM +0530, Srikar Dronamraju wrote:
> > > Enhances perf probe to user space executables and libraries.
> > > Provides very basic support for uprobes.
> > Nice. Does this require full debug info for symbolic probes,
> > or can it also work with simple symbolc information?
> It works only with symbol information for now.
> It doesn't (yet) know how to use debuginfo :-)
'perf probe' uses perf symbol library, so it really don't have to know
from where symbol resolution information is obtained, only if it needs
things that are _just_ on debuginfo, such as line information, etc.
But then that is also already supported in 'perf probe'.
Or is there something else in particular you're thinking?
- Arnaldo
* Arnaldo Carvalho de Melo <[email protected]> [2011-06-07 11:21:16]:
> Em Tue, Jun 07, 2011 at 07:08:53PM +0530, Ananth N Mavinakayanahalli escreveu:
> > On Tue, Jun 07, 2011 at 09:30:39AM -0400, Christoph Hellwig wrote:
> > > On Tue, Jun 07, 2011 at 06:32:16PM +0530, Srikar Dronamraju wrote:
> > > > Enhances perf probe to user space executables and libraries.
> > > > Provides very basic support for uprobes.
>
> > > Nice. Does this require full debug info for symbolic probes,
> > > or can it also work with simple symbolc information?
>
> > It works only with symbol information for now.
> > It doesn't (yet) know how to use debuginfo :-)
>
> 'perf probe' uses perf symbol library, so it really don't have to know
> from where symbol resolution information is obtained, only if it needs
> things that are _just_ on debuginfo, such as line information, etc.
>
> But then that is also already supported in 'perf probe'.
>
> Or is there something else in particular you're thinking?
>
What Ananth was saying was that perf probe for uprobes still has to take
advantage of tracing using line number; Also it is still restricted to
showing only register contents and we still have to add support for variables. I know that perf symbol library has support for that
just that we have to enable it for uprobes.
--
Thanks and Regards
Srikar
On 06/07/2011 06:02 AM, Srikar Dronamraju wrote:
> Enhances perf probe to user space executables and libraries.
> Provides very basic support for uprobes.
Hi Srikar,
This seems to have an issue with multiple active uprobes, whereas the v3
patchset handled this fine. I haven't tracked down the exact code
difference yet, but here's an example transcript of what I'm seeing:
# perf probe -l
probe_zsh:main (on /bin/zsh:0x000000000000e3f0)
probe_zsh:zalloc (on /bin/zsh:0x0000000000051120)
probe_zsh:zfree (on /bin/zsh:0x0000000000051c70)
# perf stat -e probe_zsh:main zsh -c true
Performance counter stats for 'zsh -c true':
1 probe_zsh:main
0.029387785 seconds time elapsed
# perf stat -e probe_zsh:zalloc zsh -c true
Performance counter stats for 'zsh -c true':
605 probe_zsh:zalloc
0.043836002 seconds time elapsed
# perf stat -e probe_zsh:zfree zsh -c true
Performance counter stats for 'zsh -c true':
36 probe_zsh:zfree
0.029445890 seconds time elapsed
# perf stat -e probe_zsh:* zsh -c true
Performance counter stats for 'zsh -c true':
0 probe_zsh:zalloc
1 probe_zsh:main
0 probe_zsh:zfree
0.030912587 seconds time elapsed
# perf stat -e probe_zsh:z* zsh -c true
Performance counter stats for 'zsh -c true':
605 probe_zsh:zalloc
0 probe_zsh:zfree
0.043774671 seconds time elapsed
It seems like among the selected probes, only one with the lowest offset
ever gets hit. Any ideas?
Thanks,
Josh
(2011/06/07 22:38), Ananth N Mavinakayanahalli wrote:
> On Tue, Jun 07, 2011 at 09:30:39AM -0400, Christoph Hellwig wrote:
>> On Tue, Jun 07, 2011 at 06:32:16PM +0530, Srikar Dronamraju wrote:
>>>
>>> Enhances perf probe to user space executables and libraries.
>>> Provides very basic support for uprobes.
>>
>> Nice. Does this require full debug info for symbolic probes,
>> or can it also work with simple symbolc information?
>
> It works only with symbol information for now.
> It doesn't (yet) know how to use debuginfo :-)
So we can enhance it to use debuginfo for obtaining
variables or line numbers from debuginfo. since
probe-finder already has the analysis part, I think
it's not hard to support userspace apps.
Thanks!
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
>
> Hi Srikar,
>
> This seems to have an issue with multiple active uprobes, whereas the v3
> patchset handled this fine. I haven't tracked down the exact code
> difference yet, but here's an example transcript of what I'm seeing:
Okay, Thanks for taking a look and trying it out. Since you have given a
way to reproduce the problem, I will try and post the fix.
--
Thanks and Regards
Srikar
Hi Srikar,
On Tue, Jun 07, 2011 at 06:28:50PM +0530, Srikar Dronamraju wrote:
> +/* Called with uprobes_treelock held */
> +static struct uprobe *__find_uprobe(struct inode * inode,
> + loff_t offset, struct rb_node **close_match)
> +{
> + struct uprobe r = { .inode = inode, .offset = offset };
> + struct rb_node *n = uprobes_tree.rb_node;
> + struct uprobe *uprobe;
> + int match, match_inode;
> +
> + while (n) {
> + uprobe = rb_entry(n, struct uprobe, rb_node);
> + match = match_uprobe(uprobe, &r, &match_inode);
> + if (close_match && match_inode)
> + *close_match = n;
> +
> + if (!match) {
> + atomic_inc(&uprobe->ref);
> + return uprobe;
> + }
> + if (match < 0)
> + n = n->rb_left;
> + else
> + n = n->rb_right;
> +
> + }
> + return NULL;
> +}
> +
I think there is a simple mistake in the search logic here. In particular, I
think the arguments to match_uprobe() should be swapped to give:
match = match_uprobe(&r, uprobe, NULL)
Otherwise, when we do not have an exact match, the next node to be considered
is the left child of 'uprobe' even though 'uprobe' is "smaller" than r (and
vice versa for the "larger" case).
> +static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
> +{
> + struct rb_node **p = &uprobes_tree.rb_node;
> + struct rb_node *parent = NULL;
> + struct uprobe *u;
> + int match;
> +
> + while (*p) {
> + parent = *p;
> + u = rb_entry(parent, struct uprobe, rb_node);
> + match = match_uprobe(u, uprobe, NULL);
> + if (!match) {
> + atomic_inc(&u->ref);
> + return u;
> + }
> +
> + if (match < 0)
> + p = &parent->rb_left;
> + else
> + p = &parent->rb_right;
> +
> + }
I think the match_uprobe() arguments should be swapped here as well for
similar reasons as above.
Also, changing the argument order seems to solve the issue reported by
Josh Stone where only the uprobe with the lowest address was responding
(thou I did not test with perf, just lightly with the trace_event
interface). In particular, iteration using rb_next() appears to work as
expected, thus allowing all breakpoints to be registered in
mmap_uprobe().
> + u = NULL;
> + rb_link_node(&uprobe->rb_node, parent, p);
> + rb_insert_color(&uprobe->rb_node, &uprobes_tree);
> + /* get access + drop ref */
> + atomic_set(&uprobe->ref, 2);
> + return u;
> +}
--
steve
On 06/07/2011 09:12 PM, Stephen Wilson wrote:
> Also, changing the argument order seems to solve the issue reported by
> Josh Stone where only the uprobe with the lowest address was responding
> (thou I did not test with perf, just lightly with the trace_event
> interface).
Makes sense, and indeed after swapping the arguments to both calls, the
perf test I gave now works as expected. Thanks!
Josh
* Josh Stone <[email protected]> [2011-06-08 00:04:39]:
> On 06/07/2011 09:12 PM, Stephen Wilson wrote:
> > Also, changing the argument order seems to solve the issue reported by
> > Josh Stone where only the uprobe with the lowest address was responding
> > (thou I did not test with perf, just lightly with the trace_event
> > interface).
>
> Makes sense, and indeed after swapping the arguments to both calls, the
> perf test I gave now works as expected. Thanks!
>
> Josh
Thanks Stephen for the fix and Josh for both reporting and confirming
that the fix works.
Stephen, Along with the parameter interchange, I also modified the
parameter name so that they dont confuse with the argument names in
match_uprobe. Otherwise 'r' in __find_uprobe would correspond to 'l' in
match_uprobe. The result is something like below.
I am resending the faulty patch with the fix and also checked in the fix
into my git tree.
--
Thanks and Regards
Srikar
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 95c16dd..72f21db 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -363,14 +363,14 @@ static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
static struct uprobe *__find_uprobe(struct inode * inode,
loff_t offset, struct rb_node **close_match)
{
- struct uprobe r = { .inode = inode, .offset = offset };
+ struct uprobe u = { .inode = inode, .offset = offset };
struct rb_node *n = uprobes_tree.rb_node;
struct uprobe *uprobe;
int match, match_inode;
while (n) {
uprobe = rb_entry(n, struct uprobe, rb_node);
- match = match_uprobe(uprobe, &r, &match_inode);
+ match = match_uprobe(&u, uprobe, &match_inode);
if (close_match && match_inode)
*close_match = n;
@@ -412,7 +412,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
while (*p) {
parent = *p;
u = rb_entry(parent, struct uprobe, rb_node);
- match = match_uprobe(u, uprobe, NULL);
+ match = match_uprobe(uprobe, u, NULL);
if (!match) {
atomic_inc(&u->ref);
return u;
Changelog: Including a fix suggested by Stephen Wilson, to fix a small
problem in match_uprobe, fix was to interchange the parameters
Problem was reported by Josh Stone. (Thanks Josh, Stephen)
Provides interfaces to add and remove uprobes from the global rb tree.
Also provides definitions for uprobe_consumer, interfaces to add and
remove a consumer to a uprobe. There is a unique uprobe element in the
rbtree for each unique inode:offset pair.
Uprobe gets added to the global rb tree when the first consumer for that
uprobe gets registered. It gets removed from the tree only when all
registered consumers are unregistered.
Multiple consumers can share the same probe. Each consumer provides a
filter to limit the tasks on which the handler should run, a handler
that runs on probe hit and a value which helps filter callback to limit
the tasks on which the handler should run.
Signed-off-by: Srikar Dronamraju <[email protected]>
---
include/linux/uprobes.h | 12 +++
kernel/uprobes.c | 210 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 222 insertions(+), 0 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 232ccea..9187df3 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -23,6 +23,7 @@
* Jim Keniston
*/
+#include <linux/rbtree.h>
#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
#include <asm/uprobes.h>
#else
@@ -50,6 +51,17 @@ typedef u8 uprobe_opcode_t;
/* Unexported functions & macros for use by arch-specific code */
#define uprobe_opcode_sz (sizeof(uprobe_opcode_t))
+struct uprobe_consumer {
+ int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
+ /*
+ * filter is optional; If a filter exists, handler is run
+ * if and only if filter returns true.
+ */
+ bool (*filter)(struct uprobe_consumer *self, struct task_struct *task);
+
+ struct uprobe_consumer *next;
+};
+
/*
* Most architectures can use the default versions of @read_opcode(),
* @set_bkpt(), @set_orig_insn(), and @is_bkpt_insn();
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 7ef916e..80ddaf3 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -35,6 +35,12 @@
#include <linux/swap.h> /* needed for try_to_free_swap */
struct uprobe {
+ struct rb_node rb_node; /* node in the rb tree */
+ atomic_t ref; /* lifetime muck */
+ struct rw_semaphore consumer_rwsem;
+ struct uprobe_consumer *consumers;
+ struct inode *inode; /* we hold a ref */
+ loff_t offset;
u8 insn[MAX_UINSN_BYTES];
u16 fixups;
};
@@ -307,3 +313,207 @@ bool __weak is_bkpt_insn(u8 *insn)
memcpy(&opcode, insn, UPROBES_BKPT_INSN_SIZE);
return (opcode == UPROBES_BKPT_INSN);
}
+
+static struct rb_root uprobes_tree = RB_ROOT;
+static DEFINE_SPINLOCK(uprobes_treelock); /* serialize (un)register */
+
+static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
+{
+ if (match_inode)
+ *match_inode = 0;
+
+ if (l->inode < r->inode)
+ return -1;
+ if (l->inode > r->inode)
+ return 1;
+ else {
+ if (match_inode)
+ *match_inode = 1;
+
+ if (l->offset < r->offset)
+ return -1;
+
+ if (l->offset > r->offset)
+ return 1;
+ }
+
+ return 0;
+}
+
+/* Called with uprobes_treelock held */
+static struct uprobe *__find_uprobe(struct inode * inode,
+ loff_t offset, struct rb_node **close_match)
+{
+ struct uprobe u = { .inode = inode, .offset = offset };
+ struct rb_node *n = uprobes_tree.rb_node;
+ struct uprobe *uprobe;
+ int match, match_inode;
+
+ while (n) {
+ uprobe = rb_entry(n, struct uprobe, rb_node);
+ match = match_uprobe(&u, uprobe, &match_inode);
+ if (close_match && match_inode)
+ *close_match = n;
+
+ if (!match) {
+ atomic_inc(&uprobe->ref);
+ return uprobe;
+ }
+ if (match < 0)
+ n = n->rb_left;
+ else
+ n = n->rb_right;
+
+ }
+ return NULL;
+}
+
+/*
+ * Find a uprobe corresponding to a given inode:offset
+ * Acquires uprobes_treelock
+ */
+static struct uprobe *find_uprobe(struct inode * inode, loff_t offset)
+{
+ struct uprobe *uprobe;
+ unsigned long flags;
+
+ spin_lock_irqsave(&uprobes_treelock, flags);
+ uprobe = __find_uprobe(inode, offset, NULL);
+ spin_unlock_irqrestore(&uprobes_treelock, flags);
+ return uprobe;
+}
+
+static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
+{
+ struct rb_node **p = &uprobes_tree.rb_node;
+ struct rb_node *parent = NULL;
+ struct uprobe *u;
+ int match;
+
+ while (*p) {
+ parent = *p;
+ u = rb_entry(parent, struct uprobe, rb_node);
+ match = match_uprobe(uprobe, u, NULL);
+ if (!match) {
+ atomic_inc(&u->ref);
+ return u;
+ }
+
+ if (match < 0)
+ p = &parent->rb_left;
+ else
+ p = &parent->rb_right;
+
+ }
+ u = NULL;
+ rb_link_node(&uprobe->rb_node, parent, p);
+ rb_insert_color(&uprobe->rb_node, &uprobes_tree);
+ /* get access + drop ref */
+ atomic_set(&uprobe->ref, 2);
+ return u;
+}
+
+/*
+ * Acquires uprobes_treelock.
+ * Matching uprobe already exists in rbtree;
+ * increment (access refcount) and return the matching uprobe.
+ *
+ * No matching uprobe; insert the uprobe in rb_tree;
+ * get a double refcount (access + creation) and return NULL.
+ */
+static struct uprobe *insert_uprobe(struct uprobe *uprobe)
+{
+ unsigned long flags;
+ struct uprobe *u;
+
+ spin_lock_irqsave(&uprobes_treelock, flags);
+ u = __insert_uprobe(uprobe);
+ spin_unlock_irqrestore(&uprobes_treelock, flags);
+ return u;
+}
+
+static void put_uprobe(struct uprobe *uprobe)
+{
+ if (atomic_dec_and_test(&uprobe->ref))
+ kfree(uprobe);
+}
+
+static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
+{
+ struct uprobe *uprobe, *cur_uprobe;
+
+ uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+ if (!uprobe)
+ return NULL;
+
+ __iget(inode);
+ uprobe->inode = inode;
+ uprobe->offset = offset;
+ init_rwsem(&uprobe->consumer_rwsem);
+
+ /* add to uprobes_tree, sorted on inode:offset */
+ cur_uprobe = insert_uprobe(uprobe);
+
+ /* a uprobe exists for this inode:offset combination*/
+ if (cur_uprobe) {
+ kfree(uprobe);
+ uprobe = cur_uprobe;
+ iput(inode);
+ }
+ return uprobe;
+}
+
+static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
+{
+ struct uprobe_consumer *consumer;
+
+ down_read(&uprobe->consumer_rwsem);
+ consumer = uprobe->consumers;
+ while (consumer) {
+ if (!consumer->filter || consumer->filter(consumer, current))
+ consumer->handler(consumer, regs);
+
+ consumer = consumer->next;
+ }
+ up_read(&uprobe->consumer_rwsem);
+}
+
+static void add_consumer(struct uprobe *uprobe,
+ struct uprobe_consumer *consumer)
+{
+ down_write(&uprobe->consumer_rwsem);
+ consumer->next = uprobe->consumers;
+ uprobe->consumers = consumer;
+ up_write(&uprobe->consumer_rwsem);
+}
+
+/*
+ * For uprobe @uprobe, delete the consumer @consumer.
+ * Return true if the @consumer is deleted successfully
+ * or return false.
+ */
+static bool del_consumer(struct uprobe *uprobe,
+ struct uprobe_consumer *consumer)
+{
+ struct uprobe_consumer *con;
+ bool ret = false;
+
+ down_write(&uprobe->consumer_rwsem);
+ con = uprobe->consumers;
+ if (consumer == con) {
+ uprobe->consumers = con->next;
+ if (!con->next)
+ put_uprobe(uprobe); /* drop creation ref */
+ ret = true;
+ } else {
+ for (; con; con = con->next) {
+ if (con->next == consumer) {
+ con->next = consumer->next;
+ ret = true;
+ break;
+ }
+ }
+ }
+ up_write(&uprobe->consumer_rwsem);
+ return ret;
+}
Hi Srikar,
Just a few questions/comments inline below.
On Tue, Jun 07, 2011 at 06:29:00PM +0530, Srikar Dronamraju wrote:
> +int register_uprobe(struct inode *inode, loff_t offset,
> + struct uprobe_consumer *consumer)
> +{
> + struct prio_tree_iter iter;
> + struct list_head try_list, success_list;
> + struct address_space *mapping;
> + struct mm_struct *mm, *tmpmm;
> + struct vm_area_struct *vma;
> + struct uprobe *uprobe;
> + int ret = -1;
> +
> + if (!inode || !consumer || consumer->next)
> + return -EINVAL;
> +
> + if (offset > inode->i_size)
> + return -EINVAL;
> +
> + uprobe = alloc_uprobe(inode, offset);
> + if (!uprobe)
> + return -ENOMEM;
> +
> + INIT_LIST_HEAD(&try_list);
> + INIT_LIST_HEAD(&success_list);
> + mapping = inode->i_mapping;
> +
> + mutex_lock(&uprobes_mutex);
> + if (uprobe->consumers) {
> + ret = 0;
> + goto consumers_add;
> + }
> +
> + mutex_lock(&mapping->i_mmap_mutex);
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
> + loff_t vaddr;
> + struct task_struct *tsk;
> +
> + if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
> + continue;
> +
> + mm = vma->vm_mm;
> + if (!valid_vma(vma)) {
> + mmput(mm);
> + continue;
> + }
> +
> + vaddr = vma->vm_start + offset;
> + vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> + if (vaddr < vma->vm_start || vaddr > vma->vm_end) {
This check looks like it is off by one? vma->vm_end is already one byte
past the last valid address in the vma, so we should compare using ">="
here I think.
> + /* Not in this vma */
> + mmput(mm);
> + continue;
> + }
> + tsk = get_mm_owner(mm);
> + if (tsk && vaddr > TASK_SIZE_OF(tsk)) {
> + /*
> + * We cannot have a virtual address that is
> + * greater than TASK_SIZE_OF(tsk)
> + */
> + put_task_struct(tsk);
> + mmput(mm);
> + continue;
> + }
> + put_task_struct(tsk);
> + mm->uprobes_vaddr = (unsigned long) vaddr;
> + list_add(&mm->uprobes_list, &try_list);
> + }
> + mutex_unlock(&mapping->i_mmap_mutex);
> +
> + if (list_empty(&try_list)) {
> + ret = 0;
> + goto consumers_add;
> + }
> + list_for_each_entry_safe(mm, tmpmm, &try_list, uprobes_list) {
> + down_read(&mm->mmap_sem);
> + ret = install_breakpoint(mm, uprobe);
> +
> + if (ret && (ret != -ESRCH || ret != -EEXIST)) {
> + up_read(&mm->mmap_sem);
> + break;
> + }
> + if (!ret)
> + list_move(&mm->uprobes_list, &success_list);
> + else {
> + /*
> + * install_breakpoint failed as there are no active
> + * threads for the mm; ignore the error.
> + */
> + list_del(&mm->uprobes_list);
> + mmput(mm);
> + }
> + up_read(&mm->mmap_sem);
> + }
> +
> + if (list_empty(&try_list)) {
> + /*
> + * All install_breakpoints were successful;
> + * cleanup successful entries.
> + */
> + ret = 0;
> + list_for_each_entry_safe(mm, tmpmm, &success_list,
> + uprobes_list) {
> + list_del(&mm->uprobes_list);
> + mmput(mm);
> + }
> + goto consumers_add;
> + }
> +
> + /*
> + * Atleast one unsuccessful install_breakpoint;
> + * remove successful probes and cleanup untried entries.
> + */
> + list_for_each_entry_safe(mm, tmpmm, &success_list, uprobes_list)
> + remove_breakpoint(mm, uprobe);
> + list_for_each_entry_safe(mm, tmpmm, &try_list, uprobes_list) {
> + list_del(&mm->uprobes_list);
> + mmput(mm);
> + }
> + delete_uprobe(uprobe);
> + goto put_unlock;
> +
> +consumers_add:
> + add_consumer(uprobe, consumer);
> +
> +put_unlock:
> + mutex_unlock(&uprobes_mutex);
> + put_uprobe(uprobe); /* drop access ref */
> + return ret;
> +}
> +
> +/*
> + * unregister_uprobe - unregister a already registered probe.
> + * @inode: the file in which the probe has to be removed.
> + * @offset: offset from the start of the file.
> + * @consumer: identify which probe if multiple probes are colocated.
> + */
> +void unregister_uprobe(struct inode *inode, loff_t offset,
> + struct uprobe_consumer *consumer)
> +{
> + struct prio_tree_iter iter;
> + struct list_head tmp_list;
> + struct address_space *mapping;
> + struct mm_struct *mm, *tmpmm;
> + struct vm_area_struct *vma;
> + struct uprobe *uprobe;
> +
> + if (!inode || !consumer)
> + return;
> +
> + uprobe = find_uprobe(inode, offset);
> + if (!uprobe) {
> + pr_debug("No uprobe found with inode:offset %p %lld\n",
> + inode, offset);
> + return;
> + }
> +
> + if (!del_consumer(uprobe, consumer)) {
> + pr_debug("No uprobe found with consumer %p\n",
> + consumer);
> + return;
> + }
When del_consumer() fails dont we still need to do a put_uprobe(uprobe)
to drop the extra access ref?
> +
> + INIT_LIST_HEAD(&tmp_list);
> +
> + mapping = inode->i_mapping;
> +
> + mutex_lock(&uprobes_mutex);
> + if (uprobe->consumers)
> + goto put_unlock;
> +
> + mutex_lock(&mapping->i_mmap_mutex);
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
> + struct task_struct *tsk;
> +
> + if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
> + continue;
> +
> + mm = vma->vm_mm;
> +
> + if (!atomic_read(&mm->uprobes_count)) {
> + mmput(mm);
> + continue;
> + }
> +
> + if (valid_vma(vma)) {
> + loff_t vaddr;
> +
> + vaddr = vma->vm_start + offset;
> + vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> + if (vaddr < vma->vm_start || vaddr > vma->vm_end) {
Same issue with the comparison against vma->vm_end here as well.
Thanks,
> + /* Not in this vma */
> + mmput(mm);
> + continue;
> + }
> + tsk = get_mm_owner(mm);
> + if (tsk && vaddr > TASK_SIZE_OF(tsk)) {
> + /*
> + * We cannot have a virtual address that is
> + * greater than TASK_SIZE_OF(tsk)
> + */
> + put_task_struct(tsk);
> + mmput(mm);
> + continue;
> + }
> + put_task_struct(tsk);
> + mm->uprobes_vaddr = (unsigned long) vaddr;
> + list_add(&mm->uprobes_list, &tmp_list);
> + } else
> + mmput(mm);
> + }
> + mutex_unlock(&mapping->i_mmap_mutex);
> + list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list)
> + remove_breakpoint(mm, uprobe);
> +
> + delete_uprobe(uprobe);
> +
> +put_unlock:
> + mutex_unlock(&uprobes_mutex);
> + put_uprobe(uprobe); /* drop access ref */
> +}
--
steve
On Tue, Jun 07, 2011 at 06:30:51PM +0530, Srikar Dronamraju wrote:
> +/*
> + * uprobe_post_notifier gets called in interrupt context.
> + * It completes the single step operation.
> + */
> +int uprobe_post_notifier(struct pt_regs *regs)
> +{
> + struct uprobe *uprobe;
> + struct uprobe_task *utask;
> +
> + if (!current->mm || !current->utask || !current->utask->active_uprobe)
> + /* task is currently not uprobed */
> + return 0;
> +
> + utask = current->utask;
> + uprobe = utask->active_uprobe;
> + if (!uprobe)
> + return 0;
> +
> + set_thread_flag(TIF_UPROBE);
> + return 1;
> +}
Looks like this can be simplified. If current->utask->active_uprobe is
non-null then surely the assignment to uprobe will be too?
--
steve
On Tue, Jun 07, 2011 at 06:29:31PM +0530, Srikar Dronamraju wrote:
> +static void add_to_temp_list(struct vm_area_struct *vma, struct inode *inode,
> + struct list_head *tmp_list)
> +{
> + struct uprobe *uprobe;
> + struct rb_node *n;
> + unsigned long flags;
> +
> + n = uprobes_tree.rb_node;
> + spin_lock_irqsave(&uprobes_treelock, flags);
> + uprobe = __find_uprobe(inode, 0, &n);
It is valid for a uprobe offset to be zero I guess, so perhaps we need
to do a put_uprobe() here when the result of __find_uprobe() is
non-null.
> + for (; n; n = rb_next(n)) {
> + uprobe = rb_entry(n, struct uprobe, rb_node);
> + if (uprobe->inode != inode)
> + break;
> + list_add(&uprobe->pending_list, tmp_list);
> + continue;
> + }
> + spin_unlock_irqrestore(&uprobes_treelock, flags);
> +}
> +
> +/*
> + * Called from dup_mmap.
> + * called with mm->mmap_sem and old_mm->mmap_sem acquired.
> + */
> +void dup_mmap_uprobe(struct mm_struct *old_mm, struct mm_struct *mm)
> +{
> + atomic_set(&old_mm->uprobes_count,
> + atomic_read(&mm->uprobes_count));
> +}
> +
> +/*
> + * Called from mmap_region.
> + * called with mm->mmap_sem acquired.
> + *
> + * Return -ve no if we fail to insert probes and we cannot
> + * bail-out.
> + * Return 0 otherwise. i.e :
> + * - successful insertion of probes
> + * - no possible probes to be inserted.
> + * - insertion of probes failed but we can bail-out.
> + */
> +int mmap_uprobe(struct vm_area_struct *vma)
> +{
> + struct list_head tmp_list;
> + struct uprobe *uprobe, *u;
> + struct mm_struct *mm;
> + struct inode *inode;
> + unsigned long start, pgoff;
> + int ret = 0;
> +
> + if (!valid_vma(vma))
> + return ret; /* Bail-out */
> +
> + INIT_LIST_HEAD(&tmp_list);
> +
> + mm = vma->vm_mm;
> + inode = vma->vm_file->f_mapping->host;
> + start = vma->vm_start;
> + pgoff = vma->vm_pgoff;
> + __iget(inode);
> +
> + up_write(&mm->mmap_sem);
> + mutex_lock(&uprobes_mutex);
> + down_read(&mm->mmap_sem);
> +
> + vma = find_vma(mm, start);
> + /* Not the same vma */
> + if (!vma || vma->vm_start != start ||
> + vma->vm_pgoff != pgoff || !valid_vma(vma) ||
> + inode->i_mapping != vma->vm_file->f_mapping)
> + goto mmap_out;
> +
> + add_to_temp_list(vma, inode, &tmp_list);
> + list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> + loff_t vaddr;
> +
> + list_del(&uprobe->pending_list);
> + if (ret)
> + continue;
> +
> + vaddr = vma->vm_start + uprobe->offset;
> + vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> + if (vaddr < vma->vm_start || vaddr > vma->vm_end)
Another place where the check should be "vaddr >= vma->vm_end" I think?
Thanks,
> + /* Not in this vma */
> + continue;
> + if (vaddr > TASK_SIZE)
> + /*
> + * We cannot have a virtual address that is
> + * greater than TASK_SIZE
> + */
> + continue;
> + mm->uprobes_vaddr = (unsigned long)vaddr;
> + ret = install_breakpoint(mm, uprobe);
> + if (ret && (ret == -ESRCH || ret == -EEXIST))
> + ret = 0;
> + }
> +
> +mmap_out:
> + mutex_unlock(&uprobes_mutex);
> + iput(inode);
> + up_read(&mm->mmap_sem);
> + down_write(&mm->mmap_sem);
> + return ret;
> +}
--
steve
> > +
> > + mm = vma->vm_mm;
> > + if (!valid_vma(vma)) {
> > + mmput(mm);
> > + continue;
> > + }
> > +
> > + vaddr = vma->vm_start + offset;
> > + vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > + if (vaddr < vma->vm_start || vaddr > vma->vm_end) {
>
> This check looks like it is off by one? vma->vm_end is already one byte
> past the last valid address in the vma, so we should compare using ">="
> here I think.
Right, we are off-by one.
Will correct in the next patchset.
Will also correct the other places where we check for vm_end.
> > +
> > + if (!del_consumer(uprobe, consumer)) {
> > + pr_debug("No uprobe found with consumer %p\n",
> > + consumer);
> > + return;
> > + }
>
> When del_consumer() fails dont we still need to do a put_uprobe(uprobe)
> to drop the extra access ref?
>
Yes, we need to check and drop the reference.
Will correct in the next patchset.
> > +
> > + INIT_LIST_HEAD(&tmp_list);
> > +
> > + mapping = inode->i_mapping;
--
Thanks and Regards
Srikar
> > + */
> > +int uprobe_post_notifier(struct pt_regs *regs)
> > +{
> > + struct uprobe *uprobe;
> > + struct uprobe_task *utask;
> > +
> > + if (!current->mm || !current->utask || !current->utask->active_uprobe)
> > + /* task is currently not uprobed */
> > + return 0;
> > +
> > + utask = current->utask;
> > + uprobe = utask->active_uprobe;
> > + if (!uprobe)
> > + return 0;
> > +
> > + set_thread_flag(TIF_UPROBE);
> > + return 1;
> > +}
>
> Looks like this can be simplified. If current->utask->active_uprobe is
> non-null then surely the assignment to uprobe will be too?
>
Yes, the two lines where we check for !uprobe and return are redundant
and can be removed. Will be corrected in the next patchset.
--
Thanks and Regards
Srikar
> > +static void add_to_temp_list(struct vm_area_struct *vma, struct inode *inode,
> > + struct list_head *tmp_list)
> > +{
> > + struct uprobe *uprobe;
> > + struct rb_node *n;
> > + unsigned long flags;
> > +
> > + n = uprobes_tree.rb_node;
> > + spin_lock_irqsave(&uprobes_treelock, flags);
> > + uprobe = __find_uprobe(inode, 0, &n);
>
> It is valid for a uprobe offset to be zero I guess, so perhaps we need
> to do a put_uprobe() here when the result of __find_uprobe() is
> non-null.
Right, will check for the result of __find_uprobe and do a put_uprobe().
Will correct this in the next patchset.
--
Thanks and Regards
Srikar
On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> - Breakpoint handling should co-exist with singlestep/blockstep from
> another tracer/debugger.
> - Queue and dequeue signals delivered from the singlestep till
> completion of postprocessing.
These two are important to sort before we can think of merging this
right?
On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> Please do provide your valuable comments.
Your patch split-up is complete crap. I'm about to simply fold all of
them just to be able to read anything.
The split-up appears to do its best to make it absolutely impossible to
get a sane overview of things, tons of things are out of order, either
it relies on future patches filling out things or modifies stuff in
previous patches.
Its a complete pain to read..
On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
> +static void report_bad_prefix(void)
> +{
> + pr_warn_once("uprobes does not currently support probing "
> + "instructions with any of the following prefixes: "
> + "cs:, ds:, es:, ss:, lock:\n");
> +}
> +
> +static void report_bad_1byte_opcode(int mode, uprobe_opcode_t op)
> +{
> + pr_warn_once("In %d-bit apps, "
> + "uprobes does not currently support probing "
> + "instructions whose first byte is 0x%2.2x\n", mode, op);
> +}
> +
> +static void report_bad_2byte_opcode(uprobe_opcode_t op)
> +{
> + pr_warn_once("uprobes does not currently support probing "
> + "instructions with the 2-byte opcode 0x0f 0x%2.2x\n", op);
> +}
I really don't like all that dmesg muck, why not simply fail the op?
This _once stuff is pretty useless too, once you've had them all
subsequent probe attempts will not say anything and leave you in the
dark anyway.
On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> + vaddr_old = kmap_atomic(old_page, KM_USER0);
> + vaddr_new = kmap_atomic(new_page, KM_USER1);
> + vaddr_new = kmap_atomic(page, KM_USER0);
again, drop the KM_foo bits, those are obsolete.
On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
> + loff_t vaddr;
> + struct task_struct *tsk;
> +
> + if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
> + continue;
> +
> + mm = vma->vm_mm;
> + if (!valid_vma(vma)) {
> + mmput(mm);
> + continue;
> + }
> +
> + vaddr = vma->vm_start + offset;
> + vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> + if (vaddr < vma->vm_start || vaddr > vma->vm_end) {
> + /* Not in this vma */
> + mmput(mm);
> + continue;
> + }
> + tsk = get_mm_owner(mm);
> + if (tsk && vaddr > TASK_SIZE_OF(tsk)) {
> + /*
> + * We cannot have a virtual address that is
> + * greater than TASK_SIZE_OF(tsk)
> + */
> + put_task_struct(tsk);
> + mmput(mm);
> + continue;
> + }
> + put_task_struct(tsk);
> + mm->uprobes_vaddr = (unsigned long) vaddr;
> + list_add(&mm->uprobes_list, &try_list);
> + }
This still falls flat on its face when there's multiple maps of the same
text in one mm.
On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
> +/*
> + * There could be threads that have hit the breakpoint and are entering the
> + * notifier code and trying to acquire the uprobes_treelock. The thread
> + * calling delete_uprobe() that is removing the uprobe from the rb_tree can
> + * race with these threads and might acquire the uprobes_treelock compared
> + * to some of the breakpoint hit threads. In such a case, the breakpoint hit
> + * threads will not find the uprobe. Finding if a "trap" instruction was
> + * present at the interrupting address is racy. Hence provide some extra
> + * time (by way of synchronize_sched() for breakpoint hit threads to acquire
> + * the uprobes_treelock before the uprobe is removed from the rbtree.
> + */
'some' extra time doesn't really sound convincing to me. Either it is
sufficient to avoid the race or it is not. It reads to me like: we add a
delay so that the race mostly doesn't occur. Not good ;-)
> +static void delete_uprobe(struct uprobe *uprobe)
> +{
> + unsigned long flags;
> +
> + synchronize_sched();
> + spin_lock_irqsave(&uprobes_treelock, flags);
> + rb_erase(&uprobe->rb_node, &uprobes_tree);
> + spin_unlock_irqrestore(&uprobes_treelock, flags);
> + iput(uprobe->inode);
> +}
Also what are the uprobe lifetime rules here? Does it still exist after
this returns?
The comment in del_consumer() that says: 'drop creation ref' worries me
and makes me thing that is the last reference around and the uprobe will
be freed right there, which clearly cannot happen since its not yet
removed from the RB-tree.
On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> + ret = anon_vma_prepare(vma);
> + if (ret)
> + goto unlock_out;
You just leaked new_page.
> +
> + lock_page(new_page);
> + ret = __replace_page(vma, old_page, new_page);
> + unlock_page(new_page);
> + if (ret != 0)
> + page_cache_release(new_page);
> +unlock_out:
> + unlock_page(old_page);
> +
> +put_out:
> + put_page(old_page); /* we did a get_page in the beginning */
> + return ret;
> +}
On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> +static void print_insert_fail(struct task_struct *tsk,
> + unsigned long vaddr, const char *why)
> +{
> + pr_warn_once("Can't place breakpoint at pid %d vaddr" " %#lx: %s\n",
> + tsk->pid, vaddr, why);
> +}
Why would we ever want to print this instead of simply failing the
operation and returning an error code to userspace?
On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> + vaddr_old = kmap_atomic(old_page, KM_USER0);
> + vaddr_new = kmap_atomic(new_page, KM_USER1);
> +
> + memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
> + /* poke the new insn in, ASSUMES we don't cross page boundary */
> + addr = vaddr;
> + vaddr &= ~PAGE_MASK;
> + memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
> +
> + kunmap_atomic(vaddr_new);
> + kunmap_atomic(vaddr_old);
> + vaddr_new = kmap_atomic(page, KM_USER0);
> + vaddr &= ~PAGE_MASK;
> + memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz);
> + kunmap_atomic(vaddr_new);
>
Both sequences in resp {write,read}_opcode() assume the opcode doesn't
cross page boundaries but don't in fact have any assertions validating
this assumption.
On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> +/**
> + * __replace_page - replace page in vma by new page.
> + * based on replace_page in mm/ksm.c
> + *
> + * @vma: vma that holds the pte pointing to page
> + * @page: the cowed page we are replacing by kpage
> + * @kpage: the modified page we replace page by
> + *
> + * Returns 0 on success, -EFAULT on failure.
> + */
> +static int __replace_page(struct vm_area_struct *vma, struct page *page,
> + struct page *kpage)
This is a verbatim copy of mm/ksm.c:replace_page(), I think I can
remember why you did this, but the changelog utterly fails to mention
why we need a second copy of this logic (or anything much at all).
On Thu, Jun 09, 2011 at 08:42:24PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> > - Breakpoint handling should co-exist with singlestep/blockstep from
> > another tracer/debugger.
> > - Queue and dequeue signals delivered from the singlestep till
> > completion of postprocessing.
>
> These two are important to sort before we can think of merging this
> right?
Yup.
Guess Srikar missed updating this part, but the first of the issues
(sstep/blockstep) is now fixed. Signal queueing is a work-in-progress.
Ananth
(2011/06/10 8:03), Peter Zijlstra wrote:
> On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
>> Please do provide your valuable comments.
>
> Your patch split-up is complete crap. I'm about to simply fold all of
> them just to be able to read anything.
>
> The split-up appears to do its best to make it absolutely impossible to
> get a sane overview of things, tons of things are out of order, either
> it relies on future patches filling out things or modifies stuff in
> previous patches.
>
> Its a complete pain to read..
Maybe for the part of uprobe itself, you are right.
But I hope tracing/perf parts are separated from that.
Thanks,
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
(2011/06/07 22:02), Srikar Dronamraju wrote:
> Enhances perf probe to user space executables and libraries.
> Provides very basic support for uprobes.
>
> [ Probing a function in the executable using function name ]
> -------------------------------------------------------------
> [root@localhost ~]# perf probe -u zfree@/bin/zsh
Hmm, here, I have concern about the interface inconsistency
of the probe point syntax.
Since perf probe already supports debuginfo analysis,
it accepts following syntax;
[EVENT=]FUNC[@SRC][:RLN|+OFFS|%return|;PTN] [ARG ...]
Thus, The "@" should take a source file path, not binary path.
I think -u option should have a path of the target binary, as below
# perf probe -u /bin/zsh -a zfree
This will allow perf-probe to support user-space debuginfo
analysis. With it, we can do as below;
# perf probe -u /bin/zsh -a zfree@foo/bar.c:10
Please try to update tools/perf/Documentation/perf-probe.txt too,
then you'll see how the new syntax is different from current one :)
Thanks,
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
(2011/06/07 22:02), Srikar Dronamraju wrote:
> Modify perf-probe.txt to include uprobe documentation
Sorry, Nak this. I hope to see better uniformed interface for
both of uprobe and kprobe, even after we support debuginfo for
uprobe.
I mean, if uprobe always requires a user-space binary, it should
be specified with -u option instead of --exe and @EXE.
I think this simplifies perf probe usage, it implies;
# perf probe [-k KERN] [...] // Control probes in kernel
# perf probe -m MOD [...] // Control probes in MOD module
# perf probe -u EXE [...] // Control probes on EXE file
Thank you,
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
Hey Masami,
[ I have restricted the number of people in CC to those people who are
closely following perf. Please feel free to add people].
Thanks for the review.
> > Enhances perf probe to user space executables and libraries.
> > Provides very basic support for uprobes.
> >
> > [ Probing a function in the executable using function name ]
> > -------------------------------------------------------------
> > [root@localhost ~]# perf probe -u zfree@/bin/zsh
>
> Hmm, here, I have concern about the interface inconsistency
> of the probe point syntax.
>
> Since perf probe already supports debuginfo analysis,
> it accepts following syntax;
>
> [EVENT=]FUNC[@SRC][:RLN|+OFFS|%return|;PTN] [ARG ...]
>
> Thus, The "@" should take a source file path, not binary path.
>
All the discussions in this forum happened with respect to
<function>@<executable> form (esp since we started with the inode based
uprobes approach). Thats why I chose the
perf probe -u func@execfile. This also gave us a flexibility to define
two probes in two different executables in the same session/command.
Something like
perf probe -u func1@exec1 func2@exec2.
However I am okay to change to your suggested syntax which is
"perf probe -u /bin/zsh -a zfree"
> I think -u option should have a path of the target binary, as below
>
> # perf probe -u /bin/zsh -a zfree
Will --uprobe work as the long name option for -u or do you suggest
something else?
>
> This will allow perf-probe to support user-space debuginfo
> analysis. With it, we can do as below;
>
While we discuss this I would also like to get your thoughts on couple
of issues.
1. Its okay to use a switch like "-u" that restricts the probe
definition session to just the user space tracing. i.e We wont be able
to define a probe in user space executable and also a kernel probe in
one single session.
2. Its okay to have just "target" and a flag to identify if the session
is a userspace probing session, instead of having separate fields for
userspace executable path and the module name.
i.e you are okay with patch "perf: rename target_module to target"
3. Currently perf_probe_point has one field "file" that was only used to
refer to source file. Uprobes overrides it to use it for executable file
too. One approach would be to add another field something like
"execfile" that we use to refer to executable files and use the current
field "file" for source files.
Also do you suggest that we rename the current
file field to "srcfile?
Based on the outcomes of this discussion, I will modify the code and
Documentation.
--
Thanks and Regards
Srikar
* Peter Zijlstra <[email protected]> [2011-06-10 01:03:21]:
> On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> > + vaddr_old = kmap_atomic(old_page, KM_USER0);
> > + vaddr_new = kmap_atomic(new_page, KM_USER1);
>
> > + vaddr_new = kmap_atomic(page, KM_USER0);
>
> again, drop the KM_foo bits, those are obsolete.
oh okay, I had only dropped the KM_foo from kunmap_atomic.
Will do the same for kmap_atomic.
--
Thanks and Regards
Srikar
* Peter Zijlstra <[email protected]> [2011-06-10 01:03:27]:
> On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> > + ret = anon_vma_prepare(vma);
> > + if (ret)
> > + goto unlock_out;
>
> You just leaked new_page.
Right, Will fix this.
--
Thanks and Regards
Srikar
* Peter Zijlstra <[email protected]> [2011-06-10 01:03:29]:
> On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> > + vaddr_old = kmap_atomic(old_page, KM_USER0);
> > + vaddr_new = kmap_atomic(new_page, KM_USER1);
> > +
> > + memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
> > + /* poke the new insn in, ASSUMES we don't cross page boundary */
> > + addr = vaddr;
> > + vaddr &= ~PAGE_MASK;
> > + memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
> > +
> > + kunmap_atomic(vaddr_new);
> > + kunmap_atomic(vaddr_old);
>
>
> > + vaddr_new = kmap_atomic(page, KM_USER0);
> > + vaddr &= ~PAGE_MASK;
> > + memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz);
> > + kunmap_atomic(vaddr_new);
> >
>
> Both sequences in resp {write,read}_opcode() assume the opcode doesn't
> cross page boundaries but don't in fact have any assertions validating
> this assumption.
>
read_opcode and write_opcode reads/writes just one breakpoint instruction
I had the below note just above the write_opcode definition.
/*
* NOTE:
* Expect the breakpoint instruction to be the smallest size instruction for
* the architecture. If an arch has variable length instruction and the
* breakpoint instruction is not of the smallest length instruction
* supported by that architecture then we need to modify read_opcode /
* write_opcode accordingly. This would never be a problem for archs that
* have fixed length instructions.
*/
Do we have archs which have a breakpoint instruction which isnt of the
smallest instruction size for that arch. If we do have can we change the
write_opcode/read_opcode while we support that architecture?
--
Thanks and Regards
Srikar
>
>
* Peter Zijlstra <[email protected]> [2011-06-10 01:03:32]:
> On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> > +/**
> > + * __replace_page - replace page in vma by new page.
> > + * based on replace_page in mm/ksm.c
> > + *
> > + * @vma: vma that holds the pte pointing to page
> > + * @page: the cowed page we are replacing by kpage
> > + * @kpage: the modified page we replace page by
> > + *
> > + * Returns 0 on success, -EFAULT on failure.
> > + */
> > +static int __replace_page(struct vm_area_struct *vma, struct page *page,
> > + struct page *kpage)
>
> This is a verbatim copy of mm/ksm.c:replace_page(), I think I can
> remember why you did this, but the changelog utterly fails to mention
> why we need a second copy of this logic (or anything much at all).
>
__replace_page is not exactly a copy of replace_page. Its a slightly
modified copy of replace_page. Here are the reasons for having this
modified copy instead of using the replace_page.
replace_page was written specifically for ksm purpose by Hugh Dickins.
Also Hugh said he doesnt like replace_page to be exposed for other uses.
He has plans for further modifying replace_page for ksm specific
purposes which might not be aligned to what we are using.
Further for replace_page, its good enuf to call page_add_anon_rmap()
However for uprobes needs we need to call page_add_new_anon_rmap().
page_add_new_anon_rmap() will add the page to the right lru list.
replace_page needs a reference to orig_pte while __replace_page doesnt
need.
I can add the same to Changelog but I am not sure it makes a good
reading. Hence I had skipped it.
--
Thanks and Regards
Srikar
* Peter Zijlstra <[email protected]> [2011-06-09 20:42:24]:
> On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> > - Breakpoint handling should co-exist with singlestep/blockstep from
> > another tracer/debugger.
We can remove this now.
Previous to this patchset the post notifier would run in interrupt
context hence we couldnt call user_disable_single_step
However from this patchset, (due to changes to do away with per task
slot), we run the post notifier in task context. Hence we can now call
user_enable_single_step/user_disable_single_step which does the right
thing.
Please correct me if I am missing.
> > - Queue and dequeue signals delivered from the singlestep till
> > completion of postprocessing.
>
I am working towards this.
> These two are important to sort before we can think of merging this
> right?
>
--
Thanks and Regards
Srikar
* Peter Zijlstra <[email protected]> [2011-06-10 01:03:16]:
> On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> > Please do provide your valuable comments.
>
> Your patch split-up is complete crap. I'm about to simply fold all of
> them just to be able to read anything.
>
> The split-up appears to do its best to make it absolutely impossible to
> get a sane overview of things, tons of things are out of order, either
> it relies on future patches filling out things or modifies stuff in
> previous patches.
>
> Its a complete pain to read..
>
We have two options,
Combine patches 2,3,4 and patches 6-7 and patches 9-10-11
or
Combine patches from 1 to 15 into one patch.
Please let me know your preference.
--
Thanks and Regards
Srikar
On 06/07, Srikar Dronamraju wrote:
>
> +static int __replace_page(struct vm_area_struct *vma, struct page *page,
> + struct page *kpage)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *ptep;
> + spinlock_t *ptl;
> + unsigned long addr;
> + int err = -EFAULT;
> +
> + addr = page_address_in_vma(page, vma);
> + if (addr == -EFAULT)
> + goto out;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + goto out;
> +
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + goto out;
> +
> + pmd = pmd_offset(pud, addr);
> + if (pmd_trans_huge(*pmd) || (!pmd_present(*pmd)))
> + goto out;
Hmm. So it doesn't work with transhuge pages? May be the caller should
use __gup(FOLL_SPLIT), otherwise set_bkpt/etc can fail "mysteriously", no?
OTOH, I don't really understand how pmd_trans_huge() is possible, valid_vma()
checks ->vm_file != NULL and I iiuc transparent hugepages can only work
with anonymous mappings. Confused...
But the real problem (afaics) is VM_HUGETLB mappings. I can't understand
how __replace_page() can work in this case. Probably valid_vma() should
fail if is_vm_hugetlb_page()?
Oleg.
On Fri, 2011-06-10 at 01:03 +0200, Peter Zijlstra wrote:
> The comment in del_consumer() that says: 'drop creation ref' worries me
> and makes me thing that is the last reference around and the uprobe will
> be freed right there, which clearly cannot happen since its not yet
> removed from the RB-tree.
I agree about that comment. It scared me too, not only about the RB
tree, but the uprobe is used later in that function to drop the
write_rwsem.
I think that comment needs to be changed to something like:
/* Have caller drop the creation ref */
-- Steve
On 06/07, Srikar Dronamraju wrote:
>
> +static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
> + unsigned long vaddr, uprobe_opcode_t opcode)
> +{
> + struct page *old_page, *new_page;
> + void *vaddr_old, *vaddr_new;
> + struct vm_area_struct *vma;
> + unsigned long addr;
> + int ret;
> +
> + /* Read the page with vaddr into memory */
> + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
Sorry if this was already discussed... But why we are using FOLL_WRITE here?
We are not going to write into this page, and this provokes the unnecessary
cow, no?
Also. This is called under down_read(mmap_sem), can't we race with
access_process_vm() modifying the same memory?
Oleg.
On 06/07, Srikar Dronamraju wrote:
>
> +int register_uprobe(struct inode *inode, loff_t offset,
> + struct uprobe_consumer *consumer)
> +{
> + struct prio_tree_iter iter;
> + struct list_head try_list, success_list;
> + struct address_space *mapping;
> + struct mm_struct *mm, *tmpmm;
> + struct vm_area_struct *vma;
> + struct uprobe *uprobe;
> + int ret = -1;
> +
> + if (!inode || !consumer || consumer->next)
> + return -EINVAL;
> +
> + if (offset > inode->i_size)
> + return -EINVAL;
> +
> + uprobe = alloc_uprobe(inode, offset);
> + if (!uprobe)
> + return -ENOMEM;
> +
> + INIT_LIST_HEAD(&try_list);
> + INIT_LIST_HEAD(&success_list);
> + mapping = inode->i_mapping;
> +
> + mutex_lock(&uprobes_mutex);
> + if (uprobe->consumers) {
> + ret = 0;
> + goto consumers_add;
> + }
> +
> + mutex_lock(&mapping->i_mmap_mutex);
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
I didn't actually read this patch yet, but this looks suspicious.
Why begin == end == 0? Doesn't this mean we are ignoring the mappings
with vm_pgoff != 0 ?
Perhaps this should be offset >> PAGE_SIZE?
Oleg.
(2011/06/13 17:41), Srikar Dronamraju wrote:
> Hey Masami,
> [ I have restricted the number of people in CC to those people who are
> closely following perf. Please feel free to add people].
>
> Thanks for the review.
>
>>> Enhances perf probe to user space executables and libraries.
>>> Provides very basic support for uprobes.
>>>
>>> [ Probing a function in the executable using function name ]
>>> -------------------------------------------------------------
>>> [root@localhost ~]# perf probe -u zfree@/bin/zsh
>>
>> Hmm, here, I have concern about the interface inconsistency
>> of the probe point syntax.
>>
>> Since perf probe already supports debuginfo analysis,
>> it accepts following syntax;
>>
>> [EVENT=]FUNC[@SRC][:RLN|+OFFS|%return|;PTN] [ARG ...]
>>
>> Thus, The "@" should take a source file path, not binary path.
>>
>
> All the discussions in this forum happened with respect to
> <function>@<executable> form (esp since we started with the inode based
> uprobes approach). Thats why I chose the
> perf probe -u func@execfile. This also gave us a flexibility to define
> two probes in two different executables in the same session/command.
> Something like
> perf probe -u func1@exec1 func2@exec2.
I see, and sorry for changing my mind, but I think '-u execname symbol'
style finally be good also for uprobe users when perf probe -u supports
debuginfo, since they don't need to learn a different syntax.
>
> However I am okay to change to your suggested syntax which is
> "perf probe -u /bin/zsh -a zfree"
Thanks!
>
>> I think -u option should have a path of the target binary, as below
>>
>> # perf probe -u /bin/zsh -a zfree
>
> Will --uprobe work as the long name option for -u or do you suggest
> something else?
Hmm, good question. Maybe we can use -x|--exec to define a uprobe event,
because there is no need to give an executable file for kprobes events.
# so that -x implies user space event on given execfile
>> This will allow perf-probe to support user-space debuginfo
>> analysis. With it, we can do as below;
>>
>
> While we discuss this I would also like to get your thoughts on couple
> of issues.
>
> 1. Its okay to use a switch like "-u" that restricts the probe
> definition session to just the user space tracing. i.e We wont be able
> to define a probe in user space executable and also a kernel probe in
> one single session.
I'm OK for that. The usage of perf probe is different from systemtap;
perf probe do one thing at a time, systemtap do everything at a time.
If someone want to define various probes in the kernel, they may have
to call perf probe several times. And I'm OK for that, because all probe
definition should be done before recording.
So with perf probe, tracing is done following below 3 steps.
1. Definition
2. Recording
3. Analysis
And each phase is done separately.
> 2. Its okay to have just "target" and a flag to identify if the session
> is a userspace probing session, instead of having separate fields for
> userspace executable path and the module name.
> i.e you are okay with patch "perf: rename target_module to target"
Ah, that's good to me. Actually, I'm now trying to expand "target_module"
to receive a path of offline module. It'll be better for that too.:-)
> 3. Currently perf_probe_point has one field "file" that was only used to
> refer to source file. Uprobes overrides it to use it for executable file
> too. One approach would be to add another field something like
> "execfile" that we use to refer to executable files and use the current
> field "file" for source files.
> Also do you suggest that we rename the current
> file field to "srcfile?
Ah, I see. From the viewpoint of implementation, we just introduce
a bool flag, which indicates user-probe, and a union, which has
a "char *module" and "char *exec". :)
However, Maybe we'd better look this more carefully. Here, we have
a problem with listing userspace probes (I mean how perf probe can
list up the probes which is on a user app)
Currently, it just ignores module name if a probe on a module.
probe:fuse_do_open (on fuse_do_open@ksrc/linux-2.6/fs/fuse/file.c with isdir)
One possible solution is to show the module name right before the
symbol as same as the kernel does.
probe:fuse_do_open (on fuse:fuse_do_open@ksrc/linux-2.6/fs/fuse/file.c with
isdir)
It seems that current your proposal doing same thing
probe_zsh:zfree (on /bin/zsh:0x45400)
Another way is to show it more verbosely, like below.
probe:fuse_do_open (at fuse_do_open@ksrc/linux-2.6/fs/fuse/file.c with isdir
on fuse)
probe_zsh:zfree (at 0x45400 on /bin/zsh)
Any thought?
>
> Based on the outcomes of this discussion, I will modify the code and
> Documentation.
>
Thank you!
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
> >
> > All the discussions in this forum happened with respect to
> > <function>@<executable> form (esp since we started with the inode based
> > uprobes approach). Thats why I chose the
> > perf probe -u func@execfile. This also gave us a flexibility to define
> > two probes in two different executables in the same session/command.
> > Something like
> > perf probe -u func1@exec1 func2@exec2.
>
> I see, and sorry for changing my mind, but I think '-u execname symbol'
> style finally be good also for uprobe users when perf probe -u supports
> debuginfo, since they don't need to learn a different syntax.
>
> >
> >> I think -u option should have a path of the target binary, as below
> >>
> >> # perf probe -u /bin/zsh -a zfree
> >
> > Will --uprobe work as the long name option for -u or do you suggest
> > something else?
>
> Hmm, good question. Maybe we can use -x|--exec to define a uprobe event,
> because there is no need to give an executable file for kprobes events.
> # so that -x implies user space event on given execfile
>
Okay, then lets stick with perf probe -x executable <function-name>
then.
> >
> > 1. Its okay to use a switch like "-u" that restricts the probe
> > definition session to just the user space tracing. i.e We wont be able
> > to define a probe in user space executable and also a kernel probe in
> > one single session.
>
> I'm OK for that. The usage of perf probe is different from systemtap;
> perf probe do one thing at a time, systemtap do everything at a time.
>
> If someone want to define various probes in the kernel, they may have
> to call perf probe several times. And I'm OK for that, because all probe
> definition should be done before recording.
Right.
>
> So with perf probe, tracing is done following below 3 steps.
> 1. Definition
> 2. Recording
> 3. Analysis
>
> And each phase is done separately.
>
> > 2. Its okay to have just "target" and a flag to identify if the session
> > is a userspace probing session, instead of having separate fields for
> > userspace executable path and the module name.
> > i.e you are okay with patch "perf: rename target_module to target"
>
> Ah, that's good to me. Actually, I'm now trying to expand "target_module"
> to receive a path of offline module. It'll be better for that too.:-)
Okay, works.
>
> > 3. Currently perf_probe_point has one field "file" that was only used to
> > refer to source file. Uprobes overrides it to use it for executable file
> > too. One approach would be to add another field something like
> > "execfile" that we use to refer to executable files and use the current
> > field "file" for source files.
> > Also do you suggest that we rename the current
> > file field to "srcfile?
>
> Ah, I see. From the viewpoint of implementation, we just introduce
> a bool flag, which indicates user-probe, and a union, which has
> a "char *module" and "char *exec". :)
Okay,
>
> However, Maybe we'd better look this more carefully. Here, we have
> a problem with listing userspace probes (I mean how perf probe can
> list up the probes which is on a user app)
>
> Currently, it just ignores module name if a probe on a module.
> probe:fuse_do_open (on fuse_do_open@ksrc/linux-2.6/fs/fuse/file.c with isdir)
>
> One possible solution is to show the module name right before the
> symbol as same as the kernel does.
>
> probe:fuse_do_open (on fuse:fuse_do_open@ksrc/linux-2.6/fs/fuse/file.c with
> isdir)
This looks better to me.
>
> It seems that current your proposal doing same thing
>
> probe_zsh:zfree (on /bin/zsh:0x45400)
>
> Another way is to show it more verbosely, like below.
>
> probe:fuse_do_open (at fuse_do_open@ksrc/linux-2.6/fs/fuse/file.c with isdir
> on fuse)
> probe_zsh:zfree (at 0x45400 on /bin/zsh)
>
But I am okay with changing to this format too.
--
Thanks and Regards
Srikar
* Oleg Nesterov <[email protected]> [2011-06-13 21:57:01]:
> On 06/07, Srikar Dronamraju wrote:
> >
> > +int register_uprobe(struct inode *inode, loff_t offset,
> > + struct uprobe_consumer *consumer)
> > +{
> > + struct prio_tree_iter iter;
> > + struct list_head try_list, success_list;
> > + struct address_space *mapping;
> > + struct mm_struct *mm, *tmpmm;
> > + struct vm_area_struct *vma;
> > + struct uprobe *uprobe;
> > + int ret = -1;
> > +
> > + if (!inode || !consumer || consumer->next)
> > + return -EINVAL;
> > +
> > + if (offset > inode->i_size)
> > + return -EINVAL;
> > +
> > + uprobe = alloc_uprobe(inode, offset);
> > + if (!uprobe)
> > + return -ENOMEM;
> > +
> > + INIT_LIST_HEAD(&try_list);
> > + INIT_LIST_HEAD(&success_list);
> > + mapping = inode->i_mapping;
> > +
> > + mutex_lock(&uprobes_mutex);
> > + if (uprobe->consumers) {
> > + ret = 0;
> > + goto consumers_add;
> > + }
> > +
> > + mutex_lock(&mapping->i_mmap_mutex);
> > + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
>
> I didn't actually read this patch yet, but this looks suspicious.
> Why begin == end == 0? Doesn't this mean we are ignoring the mappings
> with vm_pgoff != 0 ?
>
> Perhaps this should be offset >> PAGE_SIZE?
>
Okay,
I guess you meant something like this.
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
where pgoff == offset >> PAGE_SIZE
Right?
--
Thanks and Regards
Srikar
> > +static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
> > + unsigned long vaddr, uprobe_opcode_t opcode)
> > +{
> > + struct page *old_page, *new_page;
> > + void *vaddr_old, *vaddr_new;
> > + struct vm_area_struct *vma;
> > + unsigned long addr;
> > + int ret;
> > +
> > + /* Read the page with vaddr into memory */
> > + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
>
> Sorry if this was already discussed... But why we are using FOLL_WRITE here?
> We are not going to write into this page, and this provokes the unnecessary
> cow, no?
Yes, We are not going to write to the page returned by get_user_pages
but a copy of that page. The idea was if we cow the page then we dont
need to cow it at the replace_page time and since get_user_pages knows
the right way to cow the page, we dont have to write another routine to
cow the page.
I am still not clear on your concern.
Is it that we should delay cowing the page to the time we actually write
into the page?
or
Is it that we dont need to cow at all if we are replacing a file backed
page with anon page?
I think we have to cow the page either at page replacement time or at
the beginning. I had tried the option of not cowing the page and it
failed but I dont recollect why it failed but back then we used
write_protect_page and replace_page from ksm.c
>
> Also. This is called under down_read(mmap_sem), can't we race with
> access_process_vm() modifying the same memory?
Yes, we could be racing with access_process_vm on the same memory.
Do we have any other option other than making write_opcode/read_opcode
being called under down_write(mmap_sem)? I know that write_opcode worked
when we take down_write(mmap_sem). Just that
anon_vma_prepare() documents that it should be called under read lock
for mmap_sem.
Also Thomas had once asked why we were calling it under down_write.
May be race with access_process_vm is a good enough reason to call it
with down_write.
--
Thanks and Regards
Srikar
--
Thanks and Regards
Srikar
On Mon, 2011-06-13 at 14:29 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <[email protected]> [2011-06-10 01:03:29]:
>
> > On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> > > + vaddr_old = kmap_atomic(old_page, KM_USER0);
> > > + vaddr_new = kmap_atomic(new_page, KM_USER1);
> > > +
> > > + memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
> > > + /* poke the new insn in, ASSUMES we don't cross page boundary */
> > > + addr = vaddr;
> > > + vaddr &= ~PAGE_MASK;
> > > + memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
> > > +
> > > + kunmap_atomic(vaddr_new);
> > > + kunmap_atomic(vaddr_old);
> >
> >
> > > + vaddr_new = kmap_atomic(page, KM_USER0);
> > > + vaddr &= ~PAGE_MASK;
> > > + memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz);
> > > + kunmap_atomic(vaddr_new);
> > >
>
>
> >
> > Both sequences in resp {write,read}_opcode() assume the opcode doesn't
> > cross page boundaries but don't in fact have any assertions validating
> > this assumption.
> >
>
> read_opcode and write_opcode reads/writes just one breakpoint instruction
> I had the below note just above the write_opcode definition.
>
> /*
> * NOTE:
> * Expect the breakpoint instruction to be the smallest size instruction for
> * the architecture. If an arch has variable length instruction and the
> * breakpoint instruction is not of the smallest length instruction
> * supported by that architecture then we need to modify read_opcode /
> * write_opcode accordingly. This would never be a problem for archs that
> * have fixed length instructions.
> */
Whoever reads comments anyway? :-)
> Do we have archs which have a breakpoint instruction which isnt of the
> smallest instruction size for that arch. If we do have can we change the
> write_opcode/read_opcode while we support that architecture?
Why not put a simple WARN_ON_ONCE() in there that checks the assumption?
On Mon, 2011-06-13 at 19:00 +0200, Oleg Nesterov wrote:
>
> Also. This is called under down_read(mmap_sem), can't we race with
> access_process_vm() modifying the same memory?
Shouldn't matter COW and similar things are serialized using the pte
lock.
On 06/14, Srikar Dronamraju wrote:
>
> > > +static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
> > > + unsigned long vaddr, uprobe_opcode_t opcode)
> > > +{
> > > + struct page *old_page, *new_page;
> > > + void *vaddr_old, *vaddr_new;
> > > + struct vm_area_struct *vma;
> > > + unsigned long addr;
> > > + int ret;
> > > +
> > > + /* Read the page with vaddr into memory */
> > > + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
> >
> > Sorry if this was already discussed... But why we are using FOLL_WRITE here?
> > We are not going to write into this page, and this provokes the unnecessary
> > cow, no?
>
> Yes, We are not going to write to the page returned by get_user_pages
> but a copy of that page.
Yes I see. But the page returned by get_user_pages(write => 1) is already
a cow'ed copy (this mapping should be read-only).
> The idea was if we cow the page then we dont
> need to cow it at the replace_page time
Yes, replace_page() shouldn't cow.
> and since get_user_pages knows
> the right way to cow the page, we dont have to write another routine to
> cow the page.
Confused. write_opcode() allocs another page and does memcpy. This is
correct, but I don't understand the first cow.
> I am still not clear on your concern.
Probably I missed something... but could you please explain why we can't
- ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
+ ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &old_page, &vma);
?
> > Also. This is called under down_read(mmap_sem), can't we race with
> > access_process_vm() modifying the same memory?
>
> Yes, we could be racing with access_process_vm on the same memory.
>
> Do we have any other option other than making write_opcode/read_opcode
> being called under down_write(mmap_sem)?
I dunno. Probably we can simply ignore this issue, there are other ways
to modify this memory.
Oleg.
On 06/14, Peter Zijlstra wrote:
>
> On Mon, 2011-06-13 at 19:00 +0200, Oleg Nesterov wrote:
> >
> > Also. This is called under down_read(mmap_sem), can't we race with
> > access_process_vm() modifying the same memory?
>
> Shouldn't matter COW and similar things are serialized using the pte
> lock.
Yes, but afaics this doesn't matter. Suppose that write_opcode() is
called when access_process_vm() does copy_to_user_page(). We can cow
the page before memcpy() completes.
Oleg.
On 06/14, Srikar Dronamraju wrote:
>
> * Oleg Nesterov <[email protected]> [2011-06-13 21:57:01]:
>
> > > + mutex_lock(&mapping->i_mmap_mutex);
> > > + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
> >
> > I didn't actually read this patch yet, but this looks suspicious.
> > Why begin == end == 0? Doesn't this mean we are ignoring the mappings
> > with vm_pgoff != 0 ?
> >
> > Perhaps this should be offset >> PAGE_SIZE?
> >
>
> Okay,
> I guess you meant something like this.
>
> vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
>
> where pgoff == offset >> PAGE_SIZE
> Right?
Yes, modulo s/PAGE_SIZE/PAGE_SHIFT. But please double check, I can be
easily wrong ;)
Oleg.
> >
> > /*
> > * NOTE:
> > * Expect the breakpoint instruction to be the smallest size instruction for
> > * the architecture. If an arch has variable length instruction and the
> > * breakpoint instruction is not of the smallest length instruction
> > * supported by that architecture then we need to modify read_opcode /
> > * write_opcode accordingly. This would never be a problem for archs that
> > * have fixed length instructions.
> > */
>
> Whoever reads comments anyway? :-)
>
> > Do we have archs which have a breakpoint instruction which isnt of the
> > smallest instruction size for that arch. If we do have can we change the
> > write_opcode/read_opcode while we support that architecture?
>
> Why not put a simple WARN_ON_ONCE() in there that checks the assumption?
Okay, will do.
--
Thanks and Regards
Srikar
On Tue, 2011-06-14 at 16:27 +0200, Oleg Nesterov wrote:
> On 06/14, Peter Zijlstra wrote:
> >
> > On Mon, 2011-06-13 at 19:00 +0200, Oleg Nesterov wrote:
> > >
> > > Also. This is called under down_read(mmap_sem), can't we race with
> > > access_process_vm() modifying the same memory?
> >
> > Shouldn't matter COW and similar things are serialized using the pte
> > lock.
>
> Yes, but afaics this doesn't matter. Suppose that write_opcode() is
> called when access_process_vm() does copy_to_user_page(). We can cow
> the page before memcpy() completes.
access_process_vm() will end up doing a FOLL_WRITE itself when
copy_to_user_page() is called since write=1 in that case.
At that point we have a COW-race, someone wins, but the other will then
return the same page.
At this point further PTRACE pokes can indeed race with the memcpy in
write_opcode(). A possible fix would be to lock_page() around
copy_to_user_page() (its already done in set_page_dirty_lock(), so
pulling it out shouldn't matter much).
On 06/14, Peter Zijlstra wrote:
>
> On Tue, 2011-06-14 at 16:27 +0200, Oleg Nesterov wrote:
> > On 06/14, Peter Zijlstra wrote:
> > >
> > > On Mon, 2011-06-13 at 19:00 +0200, Oleg Nesterov wrote:
> > > >
> > > > Also. This is called under down_read(mmap_sem), can't we race with
> > > > access_process_vm() modifying the same memory?
> > >
> > > Shouldn't matter COW and similar things are serialized using the pte
> > > lock.
> >
> > Yes, but afaics this doesn't matter. Suppose that write_opcode() is
> > called when access_process_vm() does copy_to_user_page(). We can cow
> > the page before memcpy() completes.
>
> access_process_vm() will end up doing a FOLL_WRITE itself when
> copy_to_user_page() is called since write=1 in that case.
>
> At that point we have a COW-race, someone wins, but the other will then
> return the same page.
>
> At this point further PTRACE pokes can indeed race with the memcpy in
> write_opcode().
Currently it can't, write_opcode() does another cow. But that cow can,
and this is the same, yes.
> A possible fix would be to lock_page() around
> copy_to_user_page() (its already done in set_page_dirty_lock(), so
> pulling it out shouldn't matter much).
Yes, or write_opcode() could take mmap_sem for writing as Srikar suggests.
But do we really care? Whatever we do we can race with the other updates
to this memory. Say, someone can write to vma->vm_file.
Oleg.
On Tue, 2011-06-14 at 17:40 +0200, Oleg Nesterov wrote:
> But do we really care?
probably not :-)
> > > > +
> > > > + /* Read the page with vaddr into memory */
> > > > + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
> > >
> > > Sorry if this was already discussed... But why we are using FOLL_WRITE here?
> > > We are not going to write into this page, and this provokes the unnecessary
> > > cow, no?
> >
> > Yes, We are not going to write to the page returned by get_user_pages
> > but a copy of that page.
>
> Yes I see. But the page returned by get_user_pages(write => 1) is already
> a cow'ed copy (this mapping should be read-only).
>
> > The idea was if we cow the page then we dont
> > need to cow it at the replace_page time
>
> Yes, replace_page() shouldn't cow.
>
> > and since get_user_pages knows
> > the right way to cow the page, we dont have to write another routine to
> > cow the page.
>
> Confused. write_opcode() allocs another page and does memcpy. This is
> correct, but I don't understand the first cow.
>
we decided on get_user_pages(FOLL_WRITE|FOLL_FORCE) based on discussions
in these threads https://lkml.org/lkml/2010/4/23/327 and
https://lkml.org/lkml/2010/5/12/119
Summary of those two sub-threads as I understand was to have
get_user_pages do the "real" cow for us.
If I understand correctly, your concern is on the extra overhead added
by the get_user_pages. Other than that is there any side-effect of we
forcing the cow through get_user_pages.
> > I am still not clear on your concern.
>
> Probably I missed something... but could you please explain why we can't
>
> - ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
> + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &old_page, &vma);
>
> ?
I tried the code with this change and it works for regular cases.
I am not sure if it affects cases where programs do mprotect
So I am okay to not force cow through get_user_pages.
>
> > > Also. This is called under down_read(mmap_sem), can't we race with
> > > access_process_vm() modifying the same memory?
> >
> > Yes, we could be racing with access_process_vm on the same memory.
> >
> > Do we have any other option other than making write_opcode/read_opcode
> > being called under down_write(mmap_sem)?
>
> I dunno. Probably we can simply ignore this issue, there are other ways
> to modify this memory.
>
Okay.
--
Thanks and Regards
Srikar
I still didn't actually read this/next patches, but
On 06/07, Srikar Dronamraju wrote:
>
> +#ifdef CONFIG_UPROBES
> + unsigned long uprobes_vaddr;
Srikar, I know it is very easy to blame the patches ;) But why does this
patch add mm->uprobes_vaddr ? Look, it is write-only, register/unregister
do
mm->uprobes_vaddr = (unsigned long) vaddr;
and it is not used otherwise. It is not possible to understand its purpose
without reading the next patches. And the code above looks very strange,
the next vma can overwrite uprobes_vaddr.
If possible, please try to re-split this series. If uprobes_vaddr is used
in 6/22, then this patch should introduce this member. Note that this is
only one particular example, there are a lot more.
> +int register_uprobe(struct inode *inode, loff_t offset,
> + struct uprobe_consumer *consumer)
> +{
> ...
> + mutex_lock(&mapping->i_mmap_mutex);
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
> + loff_t vaddr;
> + struct task_struct *tsk;
> +
> + if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
> + continue;
> +
> + mm = vma->vm_mm;
> + if (!valid_vma(vma)) {
> + mmput(mm);
This looks deadlockable. If mmput()->atomic_dec_and_test() succeeds
unlink_file_vma() needs the same ->i_mmap_mutex, no?
I think you can simply remove mmput(). Why do you increment ->mm_users
in advance? I think you can do this right before list_add(), after all
valid_vma/etc checks.
> + vaddr = vma->vm_start + offset;
> + vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> + if (vaddr < vma->vm_start || vaddr > vma->vm_end) {
> + /* Not in this vma */
> + mmput(mm);
> + continue;
> + }
Not sure that "Not in this vma" is possible if we pass the correct pgoff
to vma_prio_tree_foreach()... but OK, I forgot everything I knew about
vma prio_tree.
So, we verified that vaddr is valid. Then,
> + tsk = get_mm_owner(mm);
> + if (tsk && vaddr > TASK_SIZE_OF(tsk)) {
how it it possible to map ->vm_file above TASK_SIZE ?
And why do you need get/put_task_struct? You could simply read
TASK_SIZE_OF(tsk) under rcu_read_lock.
> +void unregister_uprobe(struct inode *inode, loff_t offset,
> + struct uprobe_consumer *consumer)
> +{
> ...
> +
> + mutex_lock(&mapping->i_mmap_mutex);
> + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
> + struct task_struct *tsk;
> +
> + if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
> + continue;
> +
> + mm = vma->vm_mm;
> +
> + if (!atomic_read(&mm->uprobes_count)) {
> + mmput(mm);
Again, mmput() doesn't look safe.
> + list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list)
> + remove_breakpoint(mm, uprobe);
What if the application, say, unmaps the vma with bkpt before
unregister_uprobe() ? Or it can do mprotect(PROT_WRITE), then valid_vma()
fails. Probably this is fine, but mm->uprobes_count becomes wrong, no?
Oleg.
On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
> 1. Use mm->owner and walk thro the thread_group of mm->owner, siblings
> of mm->owner, siblings of parent of mm->owner. This should be
> good list to traverse. Not sure if this is an exhaustive
> enough list that all tasks that have a mm set to this mm_struct are
> walked through.
As per copy_process():
/*
* Thread groups must share signals as well, and detached threads
* can only be started up within the thread group.
*/
if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
return ERR_PTR(-EINVAL);
/*
* Shared signal handlers imply shared VM. By way of the above,
* thread groups also imply shared VM. Blocking this case allows
* for various simplifications in other code.
*/
if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
return ERR_PTR(-EINVAL);
CLONE_THREAD implies CLONE_VM, but not the other way around, we
therefore would be able to CLONE_VM and not be part of the primary
owner's thread group.
This is of course all terribly sad..
On 06/15, Srikar Dronamraju wrote:
>
> > > > > +
> > > > > + /* Read the page with vaddr into memory */
> > > > > + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
> > > >
> > > > Sorry if this was already discussed... But why we are using FOLL_WRITE here?
> > > > We are not going to write into this page, and this provokes the unnecessary
> > > > cow, no?
> > >
> > > Yes, We are not going to write to the page returned by get_user_pages
> > > but a copy of that page.
> >
> > Yes I see. But the page returned by get_user_pages(write => 1) is already
> > a cow'ed copy (this mapping should be read-only).
> >
> > > The idea was if we cow the page then we dont
> > > need to cow it at the replace_page time
> >
> > Yes, replace_page() shouldn't cow.
> >
> > > and since get_user_pages knows
> > > the right way to cow the page, we dont have to write another routine to
> > > cow the page.
> >
> > Confused. write_opcode() allocs another page and does memcpy. This is
> > correct, but I don't understand the first cow.
> >
>
> we decided on get_user_pages(FOLL_WRITE|FOLL_FORCE) based on discussions
> in these threads https://lkml.org/lkml/2010/4/23/327 and
> https://lkml.org/lkml/2010/5/12/119
Failed to Connect.
> Summary of those two sub-threads as I understand was to have
> get_user_pages do the "real" cow for us.
>
> If I understand correctly, your concern is on the extra overhead added
> by the get_user_pages.
No. My main concern is that I do not understand why do we need an extra cow.
This is fine, I am not vm expert. But I think it is not fine that you can't
explain why your code needs it ;)
What this 'get_user_pages do the "real" cow for us' actually means? It does
not do cow for us, __replace_page() does the actual/final cow. It re-installs
the modified copy of the page returned by get_user_pages() at the same pte.
> > Probably I missed something... but could you please explain why we can't
> >
> > - ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 1, 1, &old_page, &vma);
> > + ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &old_page, &vma);
> >
> > ?
>
> I tried the code with this change and it works for regular cases.
> I am not sure if it affects cases where programs do mprotect
Hmm... How can mprotect make a difference? This mapping should be read
only, and we are not going to do pte_mkwrite.
> So I am okay to not force cow through get_user_pages.
I am okay either way ;) But, imho, if we use FOLL_WRITE|FOLL_FORCE then
it would be nice to document why it this needed.
Oleg.
On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
>
> 2. Install probes on all mm's that have mapped the probes and filter
> only at probe hit time.
Right, seems sensible given the alternatives, we could always look at
optimizing this later.
On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
> + up_write(&mm->mmap_sem);
> + mutex_lock(&uprobes_mutex);
> + down_read(&mm->mmap_sem);
egads, and all that without a comment explaining why you think that is
even remotely sane.
I'm not at all convinced, it would expose the mmap() even though you
could still decide to tear it down if this function were to fail, I bet
there's some funnies there.
* Peter Zijlstra <[email protected]> [2011-06-15 20:11:26]:
> On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
> > + up_write(&mm->mmap_sem);
> > + mutex_lock(&uprobes_mutex);
> > + down_read(&mm->mmap_sem);
>
> egads, and all that without a comment explaining why you think that is
> even remotely sane.
>
> I'm not at all convinced, it would expose the mmap() even though you
> could still decide to tear it down if this function were to fail, I bet
> there's some funnies there.
The problem is with lock ordering. register/unregister operations
acquire uprobes_mutex (which serializes register unregister and the
mmap_hook) and then holds mmap_sem for read before they insert a
breakpoint.
But the mmap hook would be called with mmap_sem held for write. So
acquiring uprobes_mutex can result in deadlock. Hence we release the
mmap_sem, take the uprobes_mutex and then again hold the mmap_sem.
After we re-acquire the mmap_sem, we do check if the vma is valid.
Do we have better solutions?
--
Thanks and Regards
Srikar
* Peter Zijlstra <[email protected]> [2011-06-15 19:41:59]:
> On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
> > 1. Use mm->owner and walk thro the thread_group of mm->owner, siblings
> > of mm->owner, siblings of parent of mm->owner. This should be
> > good list to traverse. Not sure if this is an exhaustive
> > enough list that all tasks that have a mm set to this mm_struct are
> > walked through.
>
> As per copy_process():
>
> /*
> * Thread groups must share signals as well, and detached threads
> * can only be started up within the thread group.
> */
> if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
> return ERR_PTR(-EINVAL);
>
> /*
> * Shared signal handlers imply shared VM. By way of the above,
> * thread groups also imply shared VM. Blocking this case allows
> * for various simplifications in other code.
> */
> if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
> return ERR_PTR(-EINVAL);
>
> CLONE_THREAD implies CLONE_VM, but not the other way around, we
> therefore would be able to CLONE_VM and not be part of the primary
> owner's thread group.
>
> This is of course all terribly sad..
Agree,
If clone(CLONE_VM) were to be done by a thread_group leader, we can walk
thro the siblings of parent of mm->owner.
However if clone(CLONE_VM) were to be done by non thread_group_leader
thread, then we dont even seem to add it to the init_task. i.e I dont
think we can refer to such a thread even when we walk thro
do_each_thread(g,t) { .. } while_each_thread(g,t);
right?
--
Thanks and Regards
Srikar
> >
> > +#ifdef CONFIG_UPROBES
> > + unsigned long uprobes_vaddr;
>
> Srikar, I know it is very easy to blame the patches ;) But why does this
> patch add mm->uprobes_vaddr ? Look, it is write-only, register/unregister
> do
>
> mm->uprobes_vaddr = (unsigned long) vaddr;
>
> and it is not used otherwise. It is not possible to understand its purpose
mm->uprobes_vaddr is used in helper routines insert(remove)_breakpoint
routines which are just stubs here. mm->uprobes_vaddr caches the vaddr
for subsequent use in insert_breakpoint.
I could have moved the mm->uprobes_vaddr to the 6th patch that
implemented the insert_breakpoint routine. However at that time I felt
that people would comment back saying we do all the checks and get the
correct vaddr, but we dont cache it for subsequent use.
I will move adding the uprobes_vaddr initialization to the next patch.
Infact I might remove mm->uprobes_vaddr in the subsequent posting.
In one of the previous postings, I had the patches that used the helper
routines (like insert_breakpoint) first and then patches for wrapper
routines (like register/unregister) followed in the next patch. I was
told that it was tough to understand the context in which these helper
routines would be called. So I moved to having the wrapper routines with
stubs and implementing the stubs later.
> without reading the next patches. And the code above looks very strange,
> the next vma can overwrite uprobes_vaddr.
For this posting, handling two vmas for the same inode in the same mm
was a TODO. Since you and Peter have raised this I will handle it in the next posting. I will give a brief description of how I plan to implement this in my response to Peter's comments. Please do review and comment to it.
>
> If possible, please try to re-split this series. If uprobes_vaddr is used
> in 6/22, then this patch should introduce this member. Note that this is
> only one particular example, there are a lot more.
>
> > +int register_uprobe(struct inode *inode, loff_t offset,
> > + struct uprobe_consumer *consumer)
> > +{
> > ...
> > + mutex_lock(&mapping->i_mmap_mutex);
> > + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
> > + loff_t vaddr;
> > + struct task_struct *tsk;
> > +
> > + if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
> > + continue;
> > +
> > + mm = vma->vm_mm;
> > + if (!valid_vma(vma)) {
> > + mmput(mm);
>
> This looks deadlockable. If mmput()->atomic_dec_and_test() succeeds
> unlink_file_vma() needs the same ->i_mmap_mutex, no?
okay,
>
> I think you can simply remove mmput(). Why do you increment ->mm_users
> in advance? I think you can do this right before list_add(), after all
> valid_vma/etc checks.
Okay, will modify as suggested.
>
> > + vaddr = vma->vm_start + offset;
> > + vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > + if (vaddr < vma->vm_start || vaddr > vma->vm_end) {
> > + /* Not in this vma */
> > + mmput(mm);
> > + continue;
> > + }
>
> Not sure that "Not in this vma" is possible if we pass the correct pgoff
> to vma_prio_tree_foreach()... but OK, I forgot everything I knew about
> vma prio_tree.
>
I was asked what if the arithmetic to arrive at vaddr would end up not
being in the range.
> So, we verified that vaddr is valid. Then,
>
> > + tsk = get_mm_owner(mm);
> > + if (tsk && vaddr > TASK_SIZE_OF(tsk)) {
>
> how it it possible to map ->vm_file above TASK_SIZE ?
Same as above. I will do a rethink on both of these checks.
>
> And why do you need get/put_task_struct? You could simply read
> TASK_SIZE_OF(tsk) under rcu_read_lock.
Yes, for register/unregister case I could have just done the check under
rcu_read_lock instead of doing a get/put_task_struct. Since I needed
get_mm_owner() for insert/remove_breakpoint, I thought I will reuse it
here.
>
> > +void unregister_uprobe(struct inode *inode, loff_t offset,
> > + struct uprobe_consumer *consumer)
> > +{
> > ...
> > +
> > + mutex_lock(&mapping->i_mmap_mutex);
> > + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
> > + struct task_struct *tsk;
> > +
> > + if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
> > + continue;
> > +
> > + mm = vma->vm_mm;
> > +
> > + if (!atomic_read(&mm->uprobes_count)) {
> > + mmput(mm);
>
> Again, mmput() doesn't look safe.
Okay, I will increment the mm_users while adding to the list.
>
> > + list_for_each_entry_safe(mm, tmpmm, &tmp_list, uprobes_list)
> > + remove_breakpoint(mm, uprobe);
>
> What if the application, say, unmaps the vma with bkpt before
> unregister_uprobe() ? Or it can do mprotect(PROT_WRITE), then valid_vma()
> fails. Probably this is fine, but mm->uprobes_count becomes wrong, no?
Okay, will add a hook in unmap to keep the mm->uprobes_count sane.
>
> Oleg.
>
--
Thanks and Regards
Srikar
* Peter Zijlstra <[email protected]> [2011-06-10 01:03:26]:
> On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
> > +/*
> > + * There could be threads that have hit the breakpoint and are entering the
> > + * notifier code and trying to acquire the uprobes_treelock. The thread
> > + * calling delete_uprobe() that is removing the uprobe from the rb_tree can
> > + * race with these threads and might acquire the uprobes_treelock compared
> > + * to some of the breakpoint hit threads. In such a case, the breakpoint hit
> > + * threads will not find the uprobe. Finding if a "trap" instruction was
> > + * present at the interrupting address is racy. Hence provide some extra
> > + * time (by way of synchronize_sched() for breakpoint hit threads to acquire
> > + * the uprobes_treelock before the uprobe is removed from the rbtree.
> > + */
>
> 'some' extra time doesn't really sound convincing to me. Either it is
> sufficient to avoid the race or it is not. It reads to me like: we add a
> delay so that the race mostly doesn't occur. Not good ;-)
The extra time provided is sufficient to avoid the race. So will modify
it to mean "sufficient" instead of "some".
>
> > +static void delete_uprobe(struct uprobe *uprobe)
> > +{
> > + unsigned long flags;
> > +
> > + synchronize_sched();
> > + spin_lock_irqsave(&uprobes_treelock, flags);
> > + rb_erase(&uprobe->rb_node, &uprobes_tree);
> > + spin_unlock_irqrestore(&uprobes_treelock, flags);
> > + iput(uprobe->inode);
> > +}
>
> Also what are the uprobe lifetime rules here? Does it still exist after
> this returns?
>
> The comment in del_consumer() that says: 'drop creation ref' worries me
> and makes me thing that is the last reference around and the uprobe will
> be freed right there, which clearly cannot happen since its not yet
> removed from the RB-tree.
>
When del_consumer() is called in unregister_uprobe() it has atleast two
(or more if the uprobe is hit) references. One at the creation time and
the other thro find_uprobe() called in unregister_uprobe before
del_consumer. So the reference lost in del_consumer is never the last
reference. I added a commented this as creation reference so that the
find_uprobe and the put_uprobe() before return would match.
If the comment is confusing I can delete it or reword it as suggested by
Steven Rostedt which is /* Have caller drop the creation ref */
I would prefer to delete the comment.
--
Thanks and Regards
Srikar
* Peter Zijlstra <[email protected]> [2011-06-10 01:03:24]:
> On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
> > + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, 0) {
> > + loff_t vaddr;
> > + struct task_struct *tsk;
> > +
> > + if (!atomic_inc_not_zero(&vma->vm_mm->mm_users))
> > + continue;
> > +
> > + mm = vma->vm_mm;
> > + if (!valid_vma(vma)) {
> > + mmput(mm);
> > + continue;
> > + }
> > +
> > + vaddr = vma->vm_start + offset;
> > + vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > + if (vaddr < vma->vm_start || vaddr > vma->vm_end) {
> > + /* Not in this vma */
> > + mmput(mm);
> > + continue;
> > + }
> > + tsk = get_mm_owner(mm);
> > + if (tsk && vaddr > TASK_SIZE_OF(tsk)) {
> > + /*
> > + * We cannot have a virtual address that is
> > + * greater than TASK_SIZE_OF(tsk)
> > + */
> > + put_task_struct(tsk);
> > + mmput(mm);
> > + continue;
> > + }
> > + put_task_struct(tsk);
> > + mm->uprobes_vaddr = (unsigned long) vaddr;
> > + list_add(&mm->uprobes_list, &try_list);
> > + }
>
> This still falls flat on its face when there's multiple maps of the same
> text in one mm.
>
To address this we will use a uprobe_info structure.
struct uprobe_info {
unsigned long uprobes_vaddr;
struct mm_struct *mm;
struct list_head uprobes_list;
};
and remove the uprobes_list and uprobes_vaddr entries from mm structure.
the uprobe_info structures will be created in the vma_prio_tree_foreach
loop as and when required. Since we now have i_mmap_mutex, allocating this
uprobe_info structure as when required should be okay.
Agree?
--
Thanks and Regards
Srikar
(2011/06/14 20:56), Srikar Dronamraju wrote:
>>>> I think -u option should have a path of the target binary, as below
>>>>
>>>> # perf probe -u /bin/zsh -a zfree
>>>
>>> Will --uprobe work as the long name option for -u or do you suggest
>>> something else?
>>
>> Hmm, good question. Maybe we can use -x|--exec to define a uprobe event,
>> because there is no need to give an executable file for kprobes events.
>> # so that -x implies user space event on given execfile
>>
>
> Okay, then lets stick with perf probe -x executable <function-name>
> then.
Thanks:-)
>> However, Maybe we'd better look this more carefully. Here, we have
>> a problem with listing userspace probes (I mean how perf probe can
>> list up the probes which is on a user app)
>>
>> Currently, it just ignores module name if a probe on a module.
>> probe:fuse_do_open (on fuse_do_open@ksrc/linux-2.6/fs/fuse/file.c with isdir)
>>
>> One possible solution is to show the module name right before the
>> symbol as same as the kernel does.
>>
>> probe:fuse_do_open (on fuse:fuse_do_open@ksrc/linux-2.6/fs/fuse/file.c with
>> isdir)
>
> This looks better to me.
OK, this will be good for starting point. If someone complains,
we can switch to below format.
>> Another way is to show it more verbosely, like below.
>>
>> probe:fuse_do_open (at fuse_do_open@ksrc/linux-2.6/fs/fuse/file.c with isdir
>> on fuse)
>> probe_zsh:zfree (at 0x45400 on /bin/zsh)
>>
>
> But I am okay with changing to this format too.
>
Thanks!
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
On Thu, 2011-06-16 at 10:56 +0530, Srikar Dronamraju wrote:
> If the comment is confusing I can delete it or reword it as suggested by
> Steven Rostedt which is /* Have caller drop the creation ref */
>
> I would prefer to delete the comment.
Yeah, just drop the thing.
On Thu, 2011-06-16 at 09:41 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <[email protected]> [2011-06-15 19:41:59]:
>
> > On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
> > > 1. Use mm->owner and walk thro the thread_group of mm->owner, siblings
> > > of mm->owner, siblings of parent of mm->owner. This should be
> > > good list to traverse. Not sure if this is an exhaustive
> > > enough list that all tasks that have a mm set to this mm_struct are
> > > walked through.
> >
> > As per copy_process():
> >
> > /*
> > * Thread groups must share signals as well, and detached threads
> > * can only be started up within the thread group.
> > */
> > if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
> > return ERR_PTR(-EINVAL);
> >
> > /*
> > * Shared signal handlers imply shared VM. By way of the above,
> > * thread groups also imply shared VM. Blocking this case allows
> > * for various simplifications in other code.
> > */
> > if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
> > return ERR_PTR(-EINVAL);
> >
> > CLONE_THREAD implies CLONE_VM, but not the other way around, we
> > therefore would be able to CLONE_VM and not be part of the primary
> > owner's thread group.
> >
> > This is of course all terribly sad..
>
> Agree,
>
> If clone(CLONE_VM) were to be done by a thread_group leader, we can walk
> thro the siblings of parent of mm->owner.
>
> However if clone(CLONE_VM) were to be done by non thread_group_leader
> thread, then we dont even seem to add it to the init_task. i.e I dont
> think we can refer to such a thread even when we walk thro
> do_each_thread(g,t) { .. } while_each_thread(g,t);
>
> right?
No, we initialize p->group_leader = p; and only change that for
CLONE_THREAD, so a clone without CLONE_THREAD always results in a new
thread group leader, which are always added to the init_task list.
Or I'm now confused, which isn't at all impossible with that code ;-)
* Peter Zijlstra <[email protected]> [2011-06-16 11:46:22]:
> On Thu, 2011-06-16 at 09:41 +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra <[email protected]> [2011-06-15 19:41:59]:
> >
> > > On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
> > > > 1. Use mm->owner and walk thro the thread_group of mm->owner, siblings
> > > > of mm->owner, siblings of parent of mm->owner. This should be
> > > > good list to traverse. Not sure if this is an exhaustive
> > > > enough list that all tasks that have a mm set to this mm_struct are
> > > > walked through.
> > >
> > > As per copy_process():
> > >
> > > /*
> > > * Thread groups must share signals as well, and detached threads
> > > * can only be started up within the thread group.
> > > */
> > > if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
> > > return ERR_PTR(-EINVAL);
> > >
> > > /*
> > > * Shared signal handlers imply shared VM. By way of the above,
> > > * thread groups also imply shared VM. Blocking this case allows
> > > * for various simplifications in other code.
> > > */
> > > if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
> > > return ERR_PTR(-EINVAL);
> > >
> > > CLONE_THREAD implies CLONE_VM, but not the other way around, we
> > > therefore would be able to CLONE_VM and not be part of the primary
> > > owner's thread group.
> > >
> > > This is of course all terribly sad..
> >
> > Agree,
> >
> > If clone(CLONE_VM) were to be done by a thread_group leader, we can walk
> > thro the siblings of parent of mm->owner.
> >
> > However if clone(CLONE_VM) were to be done by non thread_group_leader
> > thread, then we dont even seem to add it to the init_task. i.e I dont
> > think we can refer to such a thread even when we walk thro
> > do_each_thread(g,t) { .. } while_each_thread(g,t);
> >
> > right?
>
> No, we initialize p->group_leader = p; and only change that for
> CLONE_THREAD, so a clone without CLONE_THREAD always results in a new
> thread group leader, which are always added to the init_task list.
>
Ahh .. I missed the p->group_leader = p thing.
In which case, shouldnt traversing all the tasks of all siblings of
parent of mm->owner should provide us all the the tasks that have linked
to mm. Right?
Agree that we can bother about this a little later.
--
Thanks and Regards
Srikar
On Thu, 2011-06-16 at 15:24 +0530, Srikar Dronamraju wrote:
>
> Ahh .. I missed the p->group_leader = p thing.
>
> In which case, shouldnt traversing all the tasks of all siblings of
> parent of mm->owner should provide us all the the tasks that have linked
> to mm. Right?
Yes, I think so, stopping the hierarchy walk when we find a
sibling/child with a different mm.
> Agree that we can bother about this a little later.
*nod*
On Tue, 2011-06-07 at 18:30 +0530, Srikar Dronamraju wrote:
> +void uprobe_notify_resume(struct pt_regs *regs)
> +{
> + struct vm_area_struct *vma;
> + struct uprobe_task *utask;
> + struct mm_struct *mm;
> + struct uprobe *u = NULL;
> + unsigned long probept;
> +
> + utask = current->utask;
> + mm = current->mm;
> + if (!utask || utask->state == UTASK_BP_HIT) {
> + probept = get_uprobe_bkpt_addr(regs);
> + down_read(&mm->mmap_sem);
> + vma = find_vma(mm, probept);
> + if (vma && valid_vma(vma))
> + u = find_uprobe(vma->vm_file->f_mapping->host,
> + probept - vma->vm_start +
> + (vma->vm_pgoff << PAGE_SHIFT));
> + up_read(&mm->mmap_sem);
> + if (!u)
> + goto cleanup_ret;
> + if (!utask) {
> + utask = add_utask();
> + if (!utask)
> + goto cleanup_ret;
So if we fail to allocate task state,..
> + }
> + /* TODO Start queueing signals. */
> + utask->active_uprobe = u;
> + handler_chain(u, regs);
> + utask->state = UTASK_SSTEP;
> + if (!pre_ssout(u, regs, probept))
> + user_enable_single_step(current);
> + else
> + goto cleanup_ret;
> + } else if (utask->state == UTASK_SSTEP) {
> + u = utask->active_uprobe;
> + if (sstep_complete(u, regs)) {
> + put_uprobe(u);
> + utask->active_uprobe = NULL;
> + utask->state = UTASK_RUNNING;
> + user_disable_single_step(current);
> + xol_free_insn_slot(current);
> +
> + /* TODO Stop queueing signals. */
> + }
> + }
> + return;
> +
> +cleanup_ret:
> + if (u) {
> + down_read(&mm->mmap_sem);
> + if (!set_orig_insn(current, u, probept, true))
we try to undo the probe? That doesn't make any sense. I thought you
meant to return to userspace, let it re-take the trap and try again
until you do manage to allocate the user resource.
This behaviour makes probes totally unreliable under memory pressure.
> + atomic_dec(&mm->uprobes_count);
> + up_read(&mm->mmap_sem);
> + put_uprobe(u);
> + } else {
> + /*TODO Return SIGTRAP signal */
> + }
> + if (utask) {
> + utask->active_uprobe = NULL;
> + utask->state = UTASK_RUNNING;
> + }
> + set_instruction_pointer(regs, probept);
> +}
Also, there's a scary amount of TODO in there...
On Thu, 2011-06-16 at 08:56 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <[email protected]> [2011-06-15 20:11:26]:
>
> > On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
> > > + up_write(&mm->mmap_sem);
> > > + mutex_lock(&uprobes_mutex);
> > > + down_read(&mm->mmap_sem);
> >
> > egads, and all that without a comment explaining why you think that is
> > even remotely sane.
> >
> > I'm not at all convinced, it would expose the mmap() even though you
> > could still decide to tear it down if this function were to fail, I bet
> > there's some funnies there.
>
> The problem is with lock ordering. register/unregister operations
> acquire uprobes_mutex (which serializes register unregister and the
> mmap_hook) and then holds mmap_sem for read before they insert a
> breakpoint.
>
> But the mmap hook would be called with mmap_sem held for write. So
> acquiring uprobes_mutex can result in deadlock. Hence we release the
> mmap_sem, take the uprobes_mutex and then again hold the mmap_sem.
Sure, I saw why you wanted to do it, I'm just not quite convinced its
safe to do and something like this definitely wants a comment explaining
why its safe to drop mmap_sem.
> After we re-acquire the mmap_sem, we do check if the vma is valid.
But you don't on the return path, and if !ret
mmap_region():unmap_and_free_vma will be touching vma again to remove
it.
> Do we have better solutions?
/me kicks the brain into gear and walks off to get a fresh cup of tea.
So the reason we take uprobes_mutex there is to avoid probes from going
away while you're installing them, right?
So we start by doing this add_to_temp_list() thing (horrid name), which
iterates the probes on this inode under uprobes_treelock and adds them
to a list.
Then we iterate the list, installing the probles.
How about we make the initial pass under uprobes_treelock take a
references on the probe, and then after install_breakpoint() succeeds we
again take uprobes_treelock and validate the uprobe still exists in the
tree and drop the extra reference, if not we simply remove the
breakpoint again and continue like it never existed.
That should avoid the need to take uprobes_mutex and not require
dropping mmap_sem, right?
> > +
> > +cleanup_ret:
> > + if (u) {
> > + down_read(&mm->mmap_sem);
> > + if (!set_orig_insn(current, u, probept, true))
>
> we try to undo the probe? That doesn't make any sense. I thought you
> meant to return to userspace, let it re-take the trap and try again
> until you do manage to allocate the user resource.
I meant removing the probe itself
https://lkml.org/lkml/2011/4/21/279
We could try reseting and retrying the trap. Just that we might end up
looping under memory pressure.
>
> This behaviour makes probes totally unreliable under memory pressure.
Under memory pressure we could be unreliable.
>
> > + atomic_dec(&mm->uprobes_count);
> > + up_read(&mm->mmap_sem);
> > + put_uprobe(u);
> > + } else {
> > + /*TODO Return SIGTRAP signal */
> > + }
> > + if (utask) {
> > + utask->active_uprobe = NULL;
> > + utask->state = UTASK_RUNNING;
> > + }
> > + set_instruction_pointer(regs, probept);
> > +}
>
> Also, there's a scary amount of TODO in there...
All of those deal with delaying the signals. I am working on it at this
moment.
--
Thanks and Regards
Srikar
On Thu, 2011-06-16 at 17:34 +0530, Srikar Dronamraju wrote:
> > > +
> > > +cleanup_ret:
> > > + if (u) {
> > > + down_read(&mm->mmap_sem);
> > > + if (!set_orig_insn(current, u, probept, true))
> >
> > we try to undo the probe? That doesn't make any sense. I thought you
> > meant to return to userspace, let it re-take the trap and try again
> > until you do manage to allocate the user resource.
>
> I meant removing the probe itself
> https://lkml.org/lkml/2011/4/21/279
>
> We could try reseting and retrying the trap. Just that we might end up
> looping under memory pressure.
>
> >
> > This behaviour makes probes totally unreliable under memory pressure.
>
> Under memory pressure we could be unreliable.
But that is total crap, there's nothing worse than unreliable debug
tools.
On Tue, 2011-06-07 at 18:28 +0530, Srikar Dronamraju wrote:
> +static int __replace_page(struct vm_area_struct *vma, struct page *page,
> + struct page *kpage)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *ptep;
> + spinlock_t *ptl;
> + unsigned long addr;
> + int err = -EFAULT;
> +
> + addr = page_address_in_vma(page, vma);
> + if (addr == -EFAULT)
> + goto out;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + goto out;
> +
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + goto out;
> +
> + pmd = pmd_offset(pud, addr);
> + if (pmd_trans_huge(*pmd) || (!pmd_present(*pmd)))
> + goto out;
> +
> + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + if (!ptep)
> + goto out;
Shouldn't we verify that the obtained pte does indeed refer to our @page
here?
> + get_page(kpage);
> + page_add_new_anon_rmap(kpage, vma, addr);
> +
> + flush_cache_page(vma, addr, pte_pfn(*ptep));
> + ptep_clear_flush(vma, addr, ptep);
> + set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
> +
> + page_remove_rmap(page);
> + if (!page_mapped(page))
> + try_to_free_swap(page);
> + put_page(page);
> + pte_unmap_unlock(ptep, ptl);
> + err = 0;
> +
> +out:
> + return err;
> +}
* Peter Zijlstra <[email protected]> [2011-06-16 14:00:26]:
> On Thu, 2011-06-16 at 08:56 +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra <[email protected]> [2011-06-15 20:11:26]:
> >
> > > On Tue, 2011-06-07 at 18:29 +0530, Srikar Dronamraju wrote:
> > > > + up_write(&mm->mmap_sem);
> > > > + mutex_lock(&uprobes_mutex);
> > > > + down_read(&mm->mmap_sem);
> > >
> > > egads, and all that without a comment explaining why you think that is
> > > even remotely sane.
> > >
> > > I'm not at all convinced, it would expose the mmap() even though you
> > > could still decide to tear it down if this function were to fail, I bet
> > > there's some funnies there.
> >
> > The problem is with lock ordering. register/unregister operations
> > acquire uprobes_mutex (which serializes register unregister and the
> > mmap_hook) and then holds mmap_sem for read before they insert a
> > breakpoint.
> >
> > But the mmap hook would be called with mmap_sem held for write. So
> > acquiring uprobes_mutex can result in deadlock. Hence we release the
> > mmap_sem, take the uprobes_mutex and then again hold the mmap_sem.
>
> Sure, I saw why you wanted to do it, I'm just not quite convinced its
> safe to do and something like this definitely wants a comment explaining
> why its safe to drop mmap_sem.
>
> > After we re-acquire the mmap_sem, we do check if the vma is valid.
>
> But you don't on the return path, and if !ret
> mmap_region():unmap_and_free_vma will be touching vma again to remove
> it.
>
Agree.
> > Do we have better solutions?
>
> /me kicks the brain into gear and walks off to get a fresh cup of tea.
>
> So the reason we take uprobes_mutex there is to avoid probes from going
> away while you're installing them, right?
It serializes register/unregister/mmap operations.
>
> So we start by doing this add_to_temp_list() thing (horrid name), which
> iterates the probes on this inode under uprobes_treelock and adds them
> to a list.
>
> Then we iterate the list, installing the probles.
>
> How about we make the initial pass under uprobes_treelock take a
> references on the probe, and then after install_breakpoint() succeeds we
> again take uprobes_treelock and validate the uprobe still exists in the
> tree and drop the extra reference, if not we simply remove the
> breakpoint again and continue like it never existed.
>
> That should avoid the need to take uprobes_mutex and not require
> dropping mmap_sem, right?
Now since a register and mmap operations can run in parallel, we could
have subtle race conditions like this:
1. register_uprobe inserts the uprobe in RB tree.
2. register_uprobe loops thro vmas and inserts breakpoints.
3. mmap is called for same inode, mmap_uprobe() takes reference;
4. mmap completes insertion and releases reference.
5. register uprobe tries to install breakpoint on one vma fails and not
due to -ESRCH or -EEXIST.
6. register_uprobe rolls back all install breakpoints except the one
inserted by mmap.
We end up with breakpoints that we have inserted by havent cleared.
Similarly unregister_uprobe might be looping to remove the breakpoints
when mmap comes in installs the breakpoint and returns.
unregister_uprobe might erase the uprobe from rbtree after mmap is done.
--
Thanks and Regards
Srikar
On 06/16, Srikar Dronamraju wrote:
>
> In which case, shouldnt traversing all the tasks of all siblings of
> parent of mm->owner should provide us all the the tasks that have linked
> to mm. Right?
I don't think so.
Even if the initial mm->ovner never exits (iow, mm->owner is never changed),
the "deep" CLONE_VM child can be reparented to init if its parent exits.
> Agree that we can bother about this a little later.
Agreed.
Oh. We should move ->mm from task_struct to signal_struct, but we need to
change the code like get_task_mm(). And then instead of mm->owner we can
have mm->processes list. Perhaps. This can be used by zap_threads() too.
Oleg.
On Thu, 2011-06-16 at 18:30 +0530, Srikar Dronamraju wrote:
> Now since a register and mmap operations can run in parallel, we could
> have subtle race conditions like this:
>
> 1. register_uprobe inserts the uprobe in RB tree.
> 2. register_uprobe loops thro vmas and inserts breakpoints.
>
> 3. mmap is called for same inode, mmap_uprobe() takes reference;
> 4. mmap completes insertion and releases reference.
>
> 5. register uprobe tries to install breakpoint on one vma fails and not
> due to -ESRCH or -EEXIST.
> 6. register_uprobe rolls back all install breakpoints except the one
> inserted by mmap.
>
> We end up with breakpoints that we have inserted by havent cleared.
>
> Similarly unregister_uprobe might be looping to remove the breakpoints
> when mmap comes in installs the breakpoint and returns.
> unregister_uprobe might erase the uprobe from rbtree after mmap is done.
Well yes, but that's mostly because of how you use those lists.
int __register_uprobe(...)
{
uprobe = alloc_uprobe(...); // find or insert in tree
vma_prio_tree_foreach(..) {
// get mm ref, add to list blah blah
}
list_for_each_entry_safe() {
// del from list etc..
down_read(mm->mmap_sem);
ret = install_breakpoint();
if (ret && (ret != -ESRCH || ret != -EEXIST)) {
up_read(..);
goto fail;
}
return 0;
fail:
list_for_each_entry_safe() {
// del from list, put mm
}
return ret;
}
void __unregister_uprobe(...)
{
uprobe = find_uprobe(); // ref++
if (delete_consumer(...)); // includes tree removal on last consumer
// implies we own the last ref
return; // consumers
vma_prio_tree_foreach() {
// create list
}
list_for_each_entry_safe() {
// remove from list
remove_breakpoint(); // unconditional, if it wasn't there
// its a nop anyway, can't get any new
// new probes on account of holding
// uprobes_mutex and mmap() doesn't see
// it due to tree removal.
}
}
int register_uprobe(...)
{
int ret;
mutex_lock(&uprobes_mutex);
ret = __register_uprobe(...);
if (!ret)
__unregister_uprobe(...);
mutex_unlock(&uprobes_mutex);
ret;
}
int mmap_uprobe(...)
{
spin_lock(&uprobes_treelock);
for_each_probe_in_inode() {
// create list;
}
spin_unlock(..);
list_for_each_entry_safe() {
// remove from list
ret = install_breakpoint();
if (ret)
goto fail;
if (!uprobe_still_there()) // takes treelock
remove_breakpoint();
}
return 0;
fail:
list_for_each_entry_safe() {
// destroy list
}
return ret;
}
Should work I think, no?
On Thu, 2011-06-16 at 20:23 +0200, Peter Zijlstra wrote:
> int __register_uprobe(...)
> {
> uprobe = alloc_uprobe(...); // find or insert in tree
>
> vma_prio_tree_foreach(..) {
> // get mm ref, add to list blah blah
> }
>
> list_for_each_entry_safe() {
> // del from list etc..
> down_read(mm->mmap_sem);
> ret = install_breakpoint();
> if (ret && (ret != -ESRCH || ret != -EEXIST)) {
> up_read(..);
> goto fail;
> }
>
> return 0;
>
> fail:
> list_for_each_entry_safe() {
> // del from list, put mm
> }
>
> return ret;
> }
>
> void __unregister_uprobe(...)
> {
> uprobe = find_uprobe(); // ref++
> if (delete_consumer(...)); // includes tree removal on last consumer
> // implies we own the last ref
> return; // consumers
>
> vma_prio_tree_foreach() {
> // create list
> }
>
> list_for_each_entry_safe() {
> // remove from list
> remove_breakpoint(); // unconditional, if it wasn't there
> // its a nop anyway, can't get any new
> // new probes on account of holding
> // uprobes_mutex and mmap() doesn't see
> // it due to tree removal.
> }
put_uprobe(); // last ref, *poof*
> }
>
> int register_uprobe(...)
> {
> int ret;
>
> mutex_lock(&uprobes_mutex);
> ret = __register_uprobe(...);
> if (!ret)
> __unregister_uprobe(...);
> mutex_unlock(&uprobes_mutex);
>
> ret;
> }
>
> int mmap_uprobe(...)
> {
> spin_lock(&uprobes_treelock);
> for_each_probe_in_inode() {
> // create list;
> }
> spin_unlock(..);
>
> list_for_each_entry_safe() {
> // remove from list
> ret = install_breakpoint();
> if (ret)
> goto fail;
> if (!uprobe_still_there()) // takes treelock
> remove_breakpoint();
> }
>
> return 0;
>
> fail:
> list_for_each_entry_safe() {
> // destroy list
> }
> return ret;
> }
>
> void __unregister_uprobe(...)
> {
> uprobe = find_uprobe(); // ref++
> if (delete_consumer(...)); // includes tree removal on last consumer
> // implies we own the last ref
> return; // consumers
>
> vma_prio_tree_foreach() {
> // create list
> }
>
> list_for_each_entry_safe() {
> // remove from list
> remove_breakpoint(); // unconditional, if it wasn't there
> // its a nop anyway, can't get any new
> // new probes on account of holding
> // uprobes_mutex and mmap() doesn't see
> // it due to tree removal.
> }
> }
>
This would have a bigger race.
A breakpoint might be hit by which time the node is removed and we
have no way to find out the uprobe. So we deliver an extra TRAP to the
app.
> int mmap_uprobe(...)
> {
> spin_lock(&uprobes_treelock);
> for_each_probe_in_inode() {
> // create list;
> }
> spin_unlock(..);
>
> list_for_each_entry_safe() {
> // remove from list
> ret = install_breakpoint();
> if (ret)
> goto fail;
> if (!uprobe_still_there()) // takes treelock
> remove_breakpoint();
> }
>
> return 0;
>
> fail:
> list_for_each_entry_safe() {
> // destroy list
> }
> return ret;
> }
>
register_uprobe will race with mmap_uprobe's first pass.
So we might end up with a vma that doesnot have a breakpoint inserted
but inserted in all other vma that map to the same inode.
--
Thanks and Regards
Srikar
On Fri, 2011-06-17 at 10:20 +0530, Srikar Dronamraju wrote:
> >
> > void __unregister_uprobe(...)
> > {
> > uprobe = find_uprobe(); // ref++
> > if (delete_consumer(...)); // includes tree removal on last consumer
> > // implies we own the last ref
> > return; // consumers
> >
> > vma_prio_tree_foreach() {
> > // create list
> > }
> >
> > list_for_each_entry_safe() {
> > // remove from list
> > remove_breakpoint(); // unconditional, if it wasn't there
> > // its a nop anyway, can't get any new
> > // new probes on account of holding
> > // uprobes_mutex and mmap() doesn't see
> > // it due to tree removal.
> > }
> > }
> >
>
> This would have a bigger race.
> A breakpoint might be hit by which time the node is removed and we
> have no way to find out the uprobe. So we deliver an extra TRAP to the
> app.
Gah indeed. Back to the drawing board for me.
> > int mmap_uprobe(...)
> > {
> > spin_lock(&uprobes_treelock);
> > for_each_probe_in_inode() {
> > // create list;
> > }
> > spin_unlock(..);
> >
> > list_for_each_entry_safe() {
> > // remove from list
> > ret = install_breakpoint();
> > if (ret)
> > goto fail;
> > if (!uprobe_still_there()) // takes treelock
> > remove_breakpoint();
> > }
> >
> > return 0;
> >
> > fail:
> > list_for_each_entry_safe() {
> > // destroy list
> > }
> > return ret;
> > }
> >
>
>
> register_uprobe will race with mmap_uprobe's first pass.
> So we might end up with a vma that doesnot have a breakpoint inserted
> but inserted in all other vma that map to the same inode.
I'm not seeing this though, if mmap_uprobe() is before register_uprobe()
inserts the probe in the tree, the vma is already in the rmap and
register_uprobe() will find it in its vma walk. If its after,
mmap_uprobe() will find it and install, if a concurrent
register_uprobe()'s vma walk also finds it, it will -EEXISTS and ignore
the error.
* Peter Zijlstra <[email protected]> [2011-06-17 10:03:56]:
> On Fri, 2011-06-17 at 10:20 +0530, Srikar Dronamraju wrote:
> > >
> > > void __unregister_uprobe(...)
> > > {
> > > uprobe = find_uprobe(); // ref++
> > > if (delete_consumer(...)); // includes tree removal on last consumer
> > > // implies we own the last ref
> > > return; // consumers
> > >
> > > vma_prio_tree_foreach() {
> > > // create list
> > > }
> > >
> > > list_for_each_entry_safe() {
> > > // remove from list
> > > remove_breakpoint(); // unconditional, if it wasn't there
> > > // its a nop anyway, can't get any new
> > > // new probes on account of holding
> > > // uprobes_mutex and mmap() doesn't see
> > > // it due to tree removal.
> > > }
> > > }
> > >
> >
> > This would have a bigger race.
> > A breakpoint might be hit by which time the node is removed and we
> > have no way to find out the uprobe. So we deliver an extra TRAP to the
> > app.
>
> Gah indeed. Back to the drawing board for me.
>
> > > int mmap_uprobe(...)
> > > {
> > > spin_lock(&uprobes_treelock);
> > > for_each_probe_in_inode() {
> > > // create list;
Here again if we have multiple mmaps for the same inode occuring on two
process contexts (I mean two different mm's), we have to manage how we
add the same uprobe to more than one list. Atleast my current
uprobe->pending_list wouldnt work.
> > > }
> > > spin_unlock(..);
> > >
> > > list_for_each_entry_safe() {
> > > // remove from list
> > > ret = install_breakpoint();
> > > if (ret)
> > > goto fail;
> > > if (!uprobe_still_there()) // takes treelock
> > > remove_breakpoint();
> > > }
> > >
> > > return 0;
> > >
> > > fail:
> > > list_for_each_entry_safe() {
> > > // destroy list
> > > }
> > > return ret;
> > > }
> > >
> >
> >
> > register_uprobe will race with mmap_uprobe's first pass.
> > So we might end up with a vma that doesnot have a breakpoint inserted
> > but inserted in all other vma that map to the same inode.
>
> I'm not seeing this though, if mmap_uprobe() is before register_uprobe()
> inserts the probe in the tree, the vma is already in the rmap and
> register_uprobe() will find it in its vma walk. If its after,
> mmap_uprobe() will find it and install, if a concurrent
> register_uprobe()'s vma walk also finds it, it will -EEXISTS and ignore
> the error.
>
You are right here.
What happens if the register_uprobe comes first and walks around the
vmas, Between mmap comes in does the insertion including the second pass
and returns. register_uprobe now finds that it cannot insert breakpoint
on one of the vmas and hence has to roll-back. The vma on which
mmap_uprobe inserted will not be in the list of vmas from which we try
to remove the breakpoint.
How about something like this:
/* Change from previous time:
* - add a atomic counter to inode (this is optional)
* - trylock first.
* - take down_write instead of down_read if we drop mmap_sem
* - no releasing mmap_sem second time since we take a down_write.
*/
int mmap_uprobe(struct vm_area_struct *vma)
{
struct list_head tmp_list;
struct uprobe *uprobe, *u;
struct mm_struct *mm;
struct inode *inode;
unsigned long start, pgoff;
int ret = 0;
if (!valid_vma(vma))
return ret; /* Bail-out */
inode = vma->vm_file->f_mapping->host;
if (!atomic_read(&inode->uprobes_count))
return ret;
INIT_LIST_HEAD(&tmp_list);
mm = vma->vm_mm;
start = vma->vm_start;
pgoff = vma->vm_pgoff;
__iget(inode);
if (!mutex_trylock(uprobes_mutex)) {
/*
* Unable to get uprobes_mutex; Probably contending with
* someother thread. Drop mmap_sem; acquire uprobes_mutex
* and mmap_sem and then verify vma.
*/
up_write(&mm->mmap_sem);
mutex_lock&(uprobes_mutex);
down_write(&mm->mmap_sem);
vma = find_vma(mm, start);
/* Not the same vma */
if (!vma || vma->vm_start != start ||
vma->vm_pgoff != pgoff || !valid_vma(vma) ||
inode->i_mapping != vma->vm_file->f_mapping)
goto mmap_out;
}
add_to_temp_list(vma, inode, &tmp_list);
list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
loff_t vaddr;
list_del(&uprobe->pending_list);
if (ret)
continue;
vaddr = vma->vm_start + uprobe->offset;
vaddr -= vma->vm_pgoff << PAGE_SHIFT;
if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
/* Not in this vma */
continue;
if (vaddr > TASK_SIZE)
/*
* We cannot have a virtual address that is
* greater than TASK_SIZE
*/
continue;
ret = install_breakpoint(mm, uprobe, vaddr);
if (ret && (ret == -ESRCH || ret == -EEXIST))
ret = 0;
}
mmap_out:
mutex_unlock(&uprobes_mutex);
iput(inode);
return ret;
}
--
Thanks and Regards
Srikar
* Oleg Nesterov <[email protected]> [2011-06-16 15:51:14]:
> On 06/16, Srikar Dronamraju wrote:
> >
> > In which case, shouldnt traversing all the tasks of all siblings of
> > parent of mm->owner should provide us all the the tasks that have linked
> > to mm. Right?
>
> I don't think so.
>
> Even if the initial mm->ovner never exits (iow, mm->owner is never changed),
> the "deep" CLONE_VM child can be reparented to init if its parent exits.
>
oh right.
> > Agree that we can bother about this a little later.
>
> Agreed.
>
>
> Oh. We should move ->mm from task_struct to signal_struct, but we need to
> change the code like get_task_mm(). And then instead of mm->owner we can
> have mm->processes list. Perhaps. This can be used by zap_threads() too.
>
Okay.. thats a nice idea.
--
Thanks and Regards
Srikar
On Fri, 2011-06-17 at 14:35 +0530, Srikar Dronamraju wrote:
> > > > int mmap_uprobe(...)
> > > > {
> > > > spin_lock(&uprobes_treelock);
> > > > for_each_probe_in_inode() {
> > > > // create list;
>
> Here again if we have multiple mmaps for the same inode occuring on two
> process contexts (I mean two different mm's), we have to manage how we
> add the same uprobe to more than one list. Atleast my current
> uprobe->pending_list wouldnt work.
Sure, wasn't concerned about that particular problem.
> > > > }
> > > > spin_unlock(..);
> > > >
> > > > list_for_each_entry_safe() {
> > > > // remove from list
> > > > ret = install_breakpoint();
> > > > if (ret)
> > > > goto fail;
> > > > if (!uprobe_still_there()) // takes treelock
> > > > remove_breakpoint();
> > > > }
> > > >
> > > > return 0;
> > > >
> > > > fail:
> > > > list_for_each_entry_safe() {
> > > > // destroy list
> > > > }
> > > > return ret;
> > > > }
> > > >
> > >
> > >
> > > register_uprobe will race with mmap_uprobe's first pass.
> > > So we might end up with a vma that doesnot have a breakpoint inserted
> > > but inserted in all other vma that map to the same inode.
> >
> > I'm not seeing this though, if mmap_uprobe() is before register_uprobe()
> > inserts the probe in the tree, the vma is already in the rmap and
> > register_uprobe() will find it in its vma walk. If its after,
> > mmap_uprobe() will find it and install, if a concurrent
> > register_uprobe()'s vma walk also finds it, it will -EEXISTS and ignore
> > the error.
> >
>
> You are right here.
>
> What happens if the register_uprobe comes first and walks around the
> vmas, Between mmap comes in does the insertion including the second pass
> and returns. register_uprobe now finds that it cannot insert breakpoint
> on one of the vmas and hence has to roll-back. The vma on which
> mmap_uprobe inserted will not be in the list of vmas from which we try
> to remove the breakpoint.
Yes it will, remember __register_uprobe() will call
__unregister_uprobe() on fail, which does a new vma-rmap walk which will
then see the newly added mmap.
> How about something like this:
> if (!mutex_trylock(uprobes_mutex)) {
>
> /*
> * Unable to get uprobes_mutex; Probably contending with
> * someother thread. Drop mmap_sem; acquire uprobes_mutex
> * and mmap_sem and then verify vma.
> */
>
> up_write(&mm->mmap_sem);
> mutex_lock&(uprobes_mutex);
> down_write(&mm->mmap_sem);
> vma = find_vma(mm, start);
> /* Not the same vma */
> if (!vma || vma->vm_start != start ||
> vma->vm_pgoff != pgoff || !valid_vma(vma) ||
> inode->i_mapping != vma->vm_file->f_mapping)
> goto mmap_out;
> }
Only if we have to, I really don't like dropping mmap_sem in the middle
of mmap. I'm fairly sure we can come up with some ordering scheme that
ought to make mmap_uprobe() work without the uprobes_mutex.
On thing I was thinking of to fix that initial problem of spurious traps
was to leave the uprobe in the tree but skip all probes without
consumers in mmap_uprobe().
On Fri, 2011-06-17 at 11:41 +0200, Peter Zijlstra wrote:
>
> On thing I was thinking of to fix that initial problem of spurious traps
> was to leave the uprobe in the tree but skip all probes without
> consumers in mmap_uprobe().
Can you find fault with using __unregister_uprobe() as a cleanup path
for __register_uprobe() so that we do a second vma-rmap walk, and
ignoring empty probes on uprobe_mmap()?
We won't get spurious traps because the empty (no consumers) uprobe is
still in the tree, we won't get any 'lost' probe insn because the
cleanup does a second vma-rmap walk which will include the new mmap().
And double probe insertion is harmless.
On Tue, 2011-06-07 at 18:31 +0530, Srikar Dronamraju wrote:
> @@ -844,6 +845,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
> if (thread_info_flags & _TIF_SIGPENDING)
> do_signal(regs);
>
> + if (thread_info_flags & _TIF_UPROBE) {
> + clear_thread_flag(TIF_UPROBE);
> +#ifdef CONFIG_X86_32
> + /*
> + * On x86_32, do_notify_resume() gets called with
> + * interrupts disabled. Hence enable interrupts if they
> + * are still disabled.
> + */
> + local_irq_enable();
> +#endif
> + uprobe_notify_resume(regs);
> + }
Would it make sense to do TIF_UPROBE before TIF_SIGPENDING? That way
when uprobe decides it ought to have send a signal we don't have to do
another loop through all this.
On Tue, 2011-06-21 at 15:31 +0200, Peter Zijlstra wrote:
> On Tue, 2011-06-07 at 18:31 +0530, Srikar Dronamraju wrote:
> > @@ -844,6 +845,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
> > if (thread_info_flags & _TIF_SIGPENDING)
> > do_signal(regs);
> >
> > + if (thread_info_flags & _TIF_UPROBE) {
> > + clear_thread_flag(TIF_UPROBE);
> > +#ifdef CONFIG_X86_32
> > + /*
> > + * On x86_32, do_notify_resume() gets called with
> > + * interrupts disabled. Hence enable interrupts if they
> > + * are still disabled.
> > + */
> > + local_irq_enable();
> > +#endif
> > + uprobe_notify_resume(regs);
> > + }
>
> Would it make sense to do TIF_UPROBE before TIF_SIGPENDING? That way
> when uprobe decides it ought to have send a signal we don't have to do
> another loop through all this.
Also, it might be good to unify x86_86 and i386 on the interrupt thing,
instead of propagating this difference (unless of course there's a good
reason they're different, but I really don't know this code well).
* Peter Zijlstra <[email protected]> [2011-06-21 15:17:23]:
> On Fri, 2011-06-17 at 11:41 +0200, Peter Zijlstra wrote:
> >
> > On thing I was thinking of to fix that initial problem of spurious traps
> > was to leave the uprobe in the tree but skip all probes without
> > consumers in mmap_uprobe().
>
> Can you find fault with using __unregister_uprobe() as a cleanup path
> for __register_uprobe() so that we do a second vma-rmap walk, and
> ignoring empty probes on uprobe_mmap()?
It gets a little complicated to handle simultaneous mmaps of the same
inode/file on different processes.
- Same uprobe cannot be in two different temporary lists at the same
time. So we have to serialize the mmap_uprobe hook.
- If we use auxillary structures that refers to uprobes as nodes of
tmplist, we dont know how many of them to preallocate. We cannot allocate
on demand since we traverse RB tree with uprobes_treelock.
>
> We won't get spurious traps because the empty (no consumers) uprobe is
> still in the tree, we won't get any 'lost' probe insn because the
> cleanup does a second vma-rmap walk which will include the new mmap().
> And double probe insertion is harmless.
>
so I am thinking of a solution that includes most of your ideas along
with using i_mmap_mutex in mmap_uprobe path.
/*
Changes:
1. Uses inode->i_mutex instead of uprobes_mutex. (This is optional).
2. Now along with vma rma walk, i_mmap_mutex is even held when we do deletion of uprobes into RB tree.
3. mmap_uprobe takes i_mmap_mutex.
4. inode->uprobes_count ( Again this is optional.)
Advantages:
1. No need to drop mmap_sem.
2. Now register/unregister can run in parallel. (iff we use i_mutex);
3. No need to take extra reference to uprobe in mmap_uprobe().
*/
void _unregister_uprobe(...)
{
if (!del_consumer(...)) { // includes tree removal on last consumer
return;
}
if (uprobe->consumers)
return;
mutex_lock(&inode->i_map_mutex); //sync with mmap.
vma_prio_tree_foreach() {
// create list
}
mutex_unlock(&inode->i_map_mutex);
list_for_each_entry_safe() {
// remove from list
down_read(&mm->mmap_sem);
remove_breakpoint(); // unconditional, if it wasn't there
up_read(&mm->mmap_sem);
}
mutex_lock(&inode->i_mmap_mutex);
delete_uprobe(uprobe);
mutex_unlock(&inode->i_mmap_mutex);
inode->uprobes_count --;
mutex_unlock(&inode->i_mutex);
}
int register_uprobe(...)
{
uprobe = alloc_uprobe(...); // find or insert in tree
mutex_lock(&inode->i_mutex); // sync with register/unregister
if (uprobe->consumers) {
add_consumer();
goto put_unlock;
}
add_consumer();
inode->uprobes_count ++;
mutex_lock(&inode->i_map_mutex); //sync with mmap.
vma_prio_tree_foreach(..) {
// get mm ref, add to list blah blah
}
mutex_unlock(&inode->i_map_mutex);
list_for_each_entry_safe() {
if (ret) {
// del from list etc..
//
continue;
}
down_read(mm->mmap_sem);
ret = install_breakpoint();
up_read(..);
// del from list etc..
//
if (ret && (ret == -ESRCH || ret == -EEXIST))
ret = 0;
}
if (ret) {
_unregister_uprobe();
put_unlock:
mutex_unlock(&inode->i_mutex);
put_uprobe(uprobe);
return ret;
}
void unregister_uprobe(...)
{
mutex_lock(&inode->i_mutex); // sync with register/unregister
uprobe = find_uprobe(); // ref++
_unregister_uprobe();
mutex_unlock(&inode->i_mutex);
put_uprobe(uprobe);
}
int mmap_uprobe(struct vm_area_struct *vma)
{
struct list_head tmp_list;
struct uprobe *uprobe, *u;
struct mm_struct *mm;
struct inode *inode;
int ret = 0;
if (!valid_vma(vma))
return ret; /* Bail-out */
mm = vma->vm_mm;
inode = vma->vm_file->f_mapping->host;
if (inode->uprobes_count)
return ret;
__iget(inode);
INIT_LIST_HEAD(&tmp_list);
mutex_lock(&inode->i_map_mutex);
add_to_temp_list(vma, inode, &tmp_list);
list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
loff_t vaddr;
list_del(&uprobe->pending_list);
if (ret)
continue;
vaddr = vma->vm_start + uprobe->offset;
vaddr -= vma->vm_pgoff << PAGE_SHIFT;
ret = install_breakpoint(mm, uprobe, vaddr);
if (ret && (ret == -ESRCH || ret == -EEXIST))
ret = 0;
}
mutex_unlock(&inode->i_map_mutex);
iput(inode);
return ret;
}
int munmap_uprobe(struct vm_area_struct *vma)
{
struct list_head tmp_list;
struct uprobe *uprobe, *u;
struct mm_struct *mm;
struct inode *inode;
int ret = 0;
if (!valid_vma(vma))
return ret; /* Bail-out */
mm = vma->vm_mm;
inode = vma->vm_file->f_mapping->host;
if (inode->uprobes_count)
return ret;
// walk thro RB tree and decrement mm->uprobes_count
walk_rbtree_and_dec_uprobes_count(); //hold treelock.
return ret;
}
--
Thanks and Regards
Srikar
* Peter Zijlstra <[email protected]> [2011-06-21 15:32:47]:
> On Tue, 2011-06-21 at 15:31 +0200, Peter Zijlstra wrote:
> > On Tue, 2011-06-07 at 18:31 +0530, Srikar Dronamraju wrote:
> > > @@ -844,6 +845,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
> > > if (thread_info_flags & _TIF_SIGPENDING)
> > > do_signal(regs);
> > >
> > > + if (thread_info_flags & _TIF_UPROBE) {
> > > + clear_thread_flag(TIF_UPROBE);
> > > +#ifdef CONFIG_X86_32
> > > + /*
> > > + * On x86_32, do_notify_resume() gets called with
> > > + * interrupts disabled. Hence enable interrupts if they
> > > + * are still disabled.
> > > + */
> > > + local_irq_enable();
> > > +#endif
> > > + uprobe_notify_resume(regs);
> > > + }
> >
> > Would it make sense to do TIF_UPROBE before TIF_SIGPENDING? That way
> > when uprobe decides it ought to have send a signal we don't have to do
> > another loop through all this.
>
Okay,
>
> Also, it might be good to unify x86_86 and i386 on the interrupt thing,
> instead of propagating this difference (unless of course there's a good
> reason they're different, but I really don't know this code well).
I am not sure if this has changed lately. So I will try removing the
local_irq_enable.
Oleg, Roland, do you know why do_notify_resume() gets called with
interrupts disabled on i386?
--
Thanks and Regards
Srikar
> Oleg, Roland, do you know why do_notify_resume() gets called with
> interrupts disabled on i386?
It was that way for a long time. My impression was that it was just not
bothering to reenable before do_signal->get_signal_to_deliver would shortly
disable (via spin_lock_irq) anyway. It's possible there was something more
to it, but I don't know of anything.
Thanks,
Roland
>
> so I am thinking of a solution that includes most of your ideas along
> with using i_mmap_mutex in mmap_uprobe path.
>
Addressing Peter's comments given on irc wrt i_mmap_mutex.
/*
Changes:
1. Uses inode->i_mutex instead of uprobes_mutex.
2. Now along with vma rma walk, i_mmap_mutex is even held when we do deletion of uprobes into RB tree.
3. mmap_uprobe takes i_mmap_mutex.
Advantages:
1. No need to drop mmap_sem.
2. Now register/unregister can run in parallel.
3.
*/
void _unregister_uprobe(...)
{
if (!del_consumer(...)) { // includes tree removal on last consumer
return;
}
if (uprobe->consumers)
return;
mutex_lock(&mapping->i_mmap_mutex); //sync with mmap.
vma_prio_tree_foreach() {
// create list
}
mutex_unlock(&mapping->i_mmap_mutex);
list_for_each_entry_safe() {
// remove from list
down_read(&mm->mmap_sem);
remove_breakpoint(); // unconditional, if it wasn't there
up_read(&mm->mmap_sem);
}
mutex_lock(&mapping->i_mmap_mutex);
delete_uprobe(uprobe);
mutex_unlock(&mapping->i_mmap_mutex);
inode->uprobes_count--;
mutex_unlock(&inode->i_mutex);
}
int register_uprobe(...)
{
uprobe = alloc_uprobe(...); // find or insert in tree
mutex_lock(&inode->i_mutex); // sync with register/unregister
if (uprobe->consumers) {
add_consumer();
goto put_unlock;
}
add_consumer();
inode->uprobes_count++;
mutex_lock(&mapping->i_mmap_mutex); //sync with mmap.
vma_prio_tree_foreach(..) {
// get mm ref, add to list blah blah
}
mutex_unlock(&mapping->i_mmap_mutex);
list_for_each_entry_safe() {
if (ret) {
// del from list etc..
//
continue;
}
down_read(mm->mmap_sem);
ret = install_breakpoint();
up_read(..);
// del from list etc..
//
if (ret && (ret == -ESRCH || ret == -EEXIST))
ret = 0;
}
if (ret)
_unregister_uprobe();
put_unlock:
mutex_unlock(&inode->i_mutex);
put_uprobe(uprobe);
return ret;
}
void unregister_uprobe(...)
{
mutex_lock(&inode->i_mutex); // sync with register/unregister
uprobe = find_uprobe(); // ref++
_unregister_uprobe();
mutex_unlock(&inode->i_mutex);
put_uprobe(uprobe);
}
int mmap_uprobe(struct vm_area_struct *vma)
{
struct list_head tmp_list;
struct uprobe *uprobe, *u;
struct mm_struct *mm;
struct inode *inode;
int ret = 0;
if (!valid_vma(vma))
return ret; /* Bail-out */
mm = vma->vm_mm;
inode = vma->vm_file->f_mapping->host;
if (inode->uprobes_count)
return ret;
__iget(inode);
INIT_LIST_HEAD(&tmp_list);
mutex_lock(&mapping->i_mmap_mutex);
add_to_temp_list(vma, inode, &tmp_list);
list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
loff_t vaddr;
list_del(&uprobe->pending_list);
if (ret)
continue;
vaddr = vma->vm_start + uprobe->offset;
vaddr -= vma->vm_pgoff << PAGE_SHIFT;
ret = install_breakpoint(mm, uprobe, vaddr);
if (ret && (ret == -ESRCH || ret == -EEXIST))
ret = 0;
}
mutex_unlock(&mapping->i_mmap_mutex);
iput(inode);
return ret;
}
int munmap_uprobe(struct vm_area_struct *vma)
{
struct list_head tmp_list;
struct uprobe *uprobe, *u;
struct mm_struct *mm;
struct inode *inode;
int ret = 0;
if (!valid_vma(vma))
return ret; /* Bail-out */
mm = vma->vm_mm;
inode = vma->vm_file->f_mapping->host;
if (inode->uprobes_count)
return ret;
// walk thro RB tree and decrement mm->uprobes_count
walk_rbtree_and_dec_uprobes_count(); //hold treelock.
return ret;
}
On Fri, 2011-06-24 at 07:36 +0530, Srikar Dronamraju wrote:
> >
> > so I am thinking of a solution that includes most of your ideas along
> > with using i_mmap_mutex in mmap_uprobe path.
> >
>
> Addressing Peter's comments given on irc wrt i_mmap_mutex.
>
> void _unregister_uprobe(...)
> {
> if (!del_consumer(...)) { // includes tree removal on last consumer
> return;
> }
> if (uprobe->consumers)
> return;
>
> mutex_lock(&mapping->i_mmap_mutex); //sync with mmap.
> vma_prio_tree_foreach() {
> // create list
> }
>
> mutex_unlock(&mapping->i_mmap_mutex);
>
> list_for_each_entry_safe() {
> // remove from list
> down_read(&mm->mmap_sem);
> remove_breakpoint(); // unconditional, if it wasn't there
> up_read(&mm->mmap_sem);
> }
>
> mutex_lock(&mapping->i_mmap_mutex);
> delete_uprobe(uprobe);
> mutex_unlock(&mapping->i_mmap_mutex);
>
> inode->uprobes_count--;
> mutex_unlock(&inode->i_mutex);
Right, so this lonesome unlock got me puzzled for a while, I always find
it best not to do asymmetric locking like this, keep the lock and unlock
in the same function.
> }
>
> int register_uprobe(...)
> {
> uprobe = alloc_uprobe(...); // find or insert in tree
>
> mutex_lock(&inode->i_mutex); // sync with register/unregister
> if (uprobe->consumers) {
> add_consumer();
> goto put_unlock;
> }
> add_consumer();
> inode->uprobes_count++;
> mutex_lock(&mapping->i_mmap_mutex); //sync with mmap.
> vma_prio_tree_foreach(..) {
> // get mm ref, add to list blah blah
> }
>
> mutex_unlock(&mapping->i_mmap_mutex);
> list_for_each_entry_safe() {
> if (ret) {
> // del from list etc..
> //
> continue;
> }
> down_read(mm->mmap_sem);
> ret = install_breakpoint();
> up_read(..);
> // del from list etc..
> //
> if (ret && (ret == -ESRCH || ret == -EEXIST))
> ret = 0;
> }
>
> if (ret)
> _unregister_uprobe();
>
> put_unlock:
> mutex_unlock(&inode->i_mutex);
You see, now this is a double unlock
> put_uprobe(uprobe);
> return ret;
> }
>
> void unregister_uprobe(...)
> {
> mutex_lock(&inode->i_mutex); // sync with register/unregister
> uprobe = find_uprobe(); // ref++
> _unregister_uprobe();
> mutex_unlock(&inode->i_mutex);
idem
> put_uprobe(uprobe);
> }
>
> int mmap_uprobe(struct vm_area_struct *vma)
> {
> struct list_head tmp_list;
> struct uprobe *uprobe, *u;
> struct mm_struct *mm;
> struct inode *inode;
> int ret = 0;
>
> if (!valid_vma(vma))
> return ret; /* Bail-out */
>
> mm = vma->vm_mm;
> inode = vma->vm_file->f_mapping->host;
> if (inode->uprobes_count)
> return ret;
> __iget(inode);
>
> INIT_LIST_HEAD(&tmp_list);
>
> mutex_lock(&mapping->i_mmap_mutex);
> add_to_temp_list(vma, inode, &tmp_list);
> list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> loff_t vaddr;
>
> list_del(&uprobe->pending_list);
> if (ret)
> continue;
>
> vaddr = vma->vm_start + uprobe->offset;
> vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> ret = install_breakpoint(mm, uprobe, vaddr);
Right, so this is the problem, you cannot do allocations under
i_mmap_mutex, however I think you can under i_mutex.
> if (ret && (ret == -ESRCH || ret == -EEXIST))
> ret = 0;
> }
>
> mutex_unlock(&mapping->i_mmap_mutex);
> iput(inode);
> return ret;
> }
>
> int munmap_uprobe(struct vm_area_struct *vma)
> {
> struct list_head tmp_list;
> struct uprobe *uprobe, *u;
> struct mm_struct *mm;
> struct inode *inode;
> int ret = 0;
>
> if (!valid_vma(vma))
> return ret; /* Bail-out */
>
> mm = vma->vm_mm;
> inode = vma->vm_file->f_mapping->host;
> if (inode->uprobes_count)
> return ret;
Should that be !->uprobes_count?
> // walk thro RB tree and decrement mm->uprobes_count
> walk_rbtree_and_dec_uprobes_count(); //hold treelock.
>
> return ret;
> }
> > mutex_lock(&mapping->i_mmap_mutex);
> > delete_uprobe(uprobe);
> > mutex_unlock(&mapping->i_mmap_mutex);
> >
> > inode->uprobes_count--;
> > mutex_unlock(&inode->i_mutex);
>
> Right, so this lonesome unlock got me puzzled for a while, I always find
> it best not to do asymmetric locking like this, keep the lock and unlock
> in the same function.
>
Okay, will do.
> > }
> >
> > int register_uprobe(...)
> > {
> > uprobe = alloc_uprobe(...); // find or insert in tree
> >
> > mutex_lock(&inode->i_mutex); // sync with register/unregister
> > if (uprobe->consumers) {
> > add_consumer();
> > goto put_unlock;
> > }
> > add_consumer();
> > inode->uprobes_count++;
> > mutex_lock(&mapping->i_mmap_mutex); //sync with mmap.
> > vma_prio_tree_foreach(..) {
> > // get mm ref, add to list blah blah
> > }
> >
> > mutex_unlock(&mapping->i_mmap_mutex);
> > list_for_each_entry_safe() {
> > if (ret) {
> > // del from list etc..
> > //
> > continue;
> > }
> > down_read(mm->mmap_sem);
> > ret = install_breakpoint();
> > up_read(..);
> > // del from list etc..
> > //
> > if (ret && (ret == -ESRCH || ret == -EEXIST))
> > ret = 0;
> > }
> >
> > if (ret)
> > _unregister_uprobe();
> >
> > put_unlock:
> > mutex_unlock(&inode->i_mutex);
>
> You see, now this is a double unlock
hmm . .will correct this.
>
> > put_uprobe(uprobe);
> > return ret;
> > }
> >
> > void unregister_uprobe(...)
> > {
> > mutex_lock(&inode->i_mutex); // sync with register/unregister
> > uprobe = find_uprobe(); // ref++
> > _unregister_uprobe();
> > mutex_unlock(&inode->i_mutex);
>
> idem
>
> > put_uprobe(uprobe);
> > }
> >
> > int mmap_uprobe(struct vm_area_struct *vma)
> > {
> > struct list_head tmp_list;
> > struct uprobe *uprobe, *u;
> > struct mm_struct *mm;
> > struct inode *inode;
> > int ret = 0;
> >
> > if (!valid_vma(vma))
> > return ret; /* Bail-out */
> >
> > mm = vma->vm_mm;
> > inode = vma->vm_file->f_mapping->host;
> > if (inode->uprobes_count)
> > return ret;
> > __iget(inode);
> >
> > INIT_LIST_HEAD(&tmp_list);
> >
> > mutex_lock(&mapping->i_mmap_mutex);
> > add_to_temp_list(vma, inode, &tmp_list);
> > list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> > loff_t vaddr;
> >
> > list_del(&uprobe->pending_list);
> > if (ret)
> > continue;
> >
> > vaddr = vma->vm_start + uprobe->offset;
> > vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > ret = install_breakpoint(mm, uprobe, vaddr);
>
> Right, so this is the problem, you cannot do allocations under
> i_mmap_mutex, however I think you can under i_mutex.
I didnt know that we cannot do allocations under i_mmap_mutex.
Why is this?
I cant take i_mutex, because we would have already held
down_write(mmap_sem) here.
>
> > if (ret && (ret == -ESRCH || ret == -EEXIST))
> > ret = 0;
> > }
> >
> > mutex_unlock(&mapping->i_mmap_mutex);
> > iput(inode);
> > return ret;
> > }
> >
> > int munmap_uprobe(struct vm_area_struct *vma)
> > {
> > struct list_head tmp_list;
> > struct uprobe *uprobe, *u;
> > struct mm_struct *mm;
> > struct inode *inode;
> > int ret = 0;
> >
> > if (!valid_vma(vma))
> > return ret; /* Bail-out */
> >
> > mm = vma->vm_mm;
> > inode = vma->vm_file->f_mapping->host;
> > if (inode->uprobes_count)
> > return ret;
>
> Should that be !->uprobes_count?
Yes it should be !inode->uprobes_count.
(both here and in mmap_uprobe)
>
>
--
Thanks and Regards
Srikar
On Mon, 2011-06-27 at 12:15 +0530, Srikar Dronamraju wrote:
> > > mutex_lock(&mapping->i_mmap_mutex);
> > > add_to_temp_list(vma, inode, &tmp_list);
> > > list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> > > loff_t vaddr;
> > >
> > > list_del(&uprobe->pending_list);
> > > if (ret)
> > > continue;
> > >
> > > vaddr = vma->vm_start + uprobe->offset;
> > > vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > > ret = install_breakpoint(mm, uprobe, vaddr);
> >
> > Right, so this is the problem, you cannot do allocations under
> > i_mmap_mutex, however I think you can under i_mutex.
>
> I didnt know that we cannot do allocations under i_mmap_mutex.
> Why is this?
Because we try to take i_mmap_mutex during reclaim, trying to unmap
pages. So suppose we do an allocation while holding i_mmap_mutex, find
there's no free memory, try and unmap a page in order to free it, and
we're stuck.
> I cant take i_mutex, because we would have already held
> down_write(mmap_sem) here.
Right. So can we add a lock in the uprobe? All we need is per uprobe
serialization, right?