This patch set provides the basic infrastructure for user-space probes
based on IBM's Dprobes. Similar to kprobes, user-space probes uses the
breakpoint mechanism to insert probes. User has to write a kernel module
to insert probes in the executable/library and specify the handlers in
the kernel module. Using this mechanism user can not only log user-space
data structure present in the memory when probe was hit, but also log
kernel data structures, stack traces, system registers etc.
User-space tools like systemtap can use this interface to probe applications
and libraries.
Design and development of this user-space probe mechanism has been
discussed in systemtap mailing lists.
a) Features supported:
1. Probes on application executable.
2. Probes on libraries.
3. Probes are visible across fork().
4. Probes can be inserted even on executables/libraries
that are not even started running(pages not present in memory).
5. Multiple handlers can be inserted/removed for each probe.
b) Interfaces:
1. register_uprobe(struct uprobe *uprobe) : accepts a pointer to uprobe.
User has to allocate the uprobes structure and initialize following
elements:
pathname - points to the application's pathname
offset - offset of the probe from the file beginning;
[It's still the case that the user has to specify the offset as well
as the address (see TODO list)]
In case of library calls, the offset is the
relative offset from the beginning of the of
the mapped library.
kp.addr - virtual address within the executable.
kp.pre_handler - handler to be executed when probe is fired.
kp.post_handler - handler to be executed after single stepping
the original instruction.
kp.fault_handler- handler to be executed if fault occurs while
executing the original instruction or the
handlers.
As with a kprobe, the user should not modify the uprobe while it is
registered. This routine returns zero on successful registeration.
2. unregister_uprobe(struct uprobe *uprobe) : accepts a pointer to uprobe.
c) Objects:
struct uprobe - Allocated per probe by the user.
struct uprobe_module - Allocated per application by the userspace probe
mechanism.
struct uprobe {
/* pointer to the pathname of the application */
char *pathname;
/* kprobe structure with user specified handlers */
struct kprobe kp;
/* hlist of all the userspace probes per application */
struct hlist_node ulist;
/* inode of the probed application */
struct inode *inode;
/* probe offset within the file */
unsigned long offset;
};
struct uprobe_module {
/* hlist head of all userspace probes per application */
struct hlist_head ulist_head;
/* list of all uprobe_module for probed application */
struct list_head mlist;
/* to hold path/dentry etc. */
struct nameidata nd;
/* original readpage operations */
struct address_space_operations *ori_a_ops;
/* readpage hooks added operations */
struct address_space_operations user_a_ops;
};
d) Usage:
/* Allocate a uprobe structure */
struct uprobe p;
/* Define pre handler */
int handler_pre(struct kprobe *p, struct pt_regs *regs)
{
.............collect useful data..............
return 0;
}
void handler_post(struct kprobe *p, struct pt_regs *regs,
unsigned long flags)
{
.............collect useful data..............
}
int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
{
........ release allocated resources & try to recover ....
return 0;
}
Before inserting the probe, specify the pathname of the application
on which the probe is to be inserted.
/*pointer to the pathname of the application */
p.pathname = "/home/prasanna/bin/myapp";
p.kp.pre_handler=handler_pre;
p.kp.post_handler=handler_post;
p.kp.fault_handler=handler_fault;
/* Specify the probe address */
/* $nm appln |grep func1 */
p.kp.addr = (kprobe_opcode_t *)0x080484d4;
/* Specify the offset within the application/executable*/
p.offset = (unsigned long)0x4d4;
/* Now register the userspace probe */
if (ret = register_uprobe(&p))
printk("register_uprobe: unsuccessful ret= %d\n", ret);
/* To unregister the registered probed, just call..*/
unregister_uprobe(&p);
e) TODO List:
1. Execution of probe handlers are serialized using a uprobe_lock,
need to make them scalable.
2. Provide jprobes type mechansim to allow the handlers to run in user
mode.
3. Insert probes on copy-on-write pages. Tracks all COW pages for the
page containing the specified probe point and inserts/removes all
the probe points for that page.
4. Optimize the insertion of probes through readpage hooks. Identify
all the probes to be inserted on the read page and insert them at
once.
5. A wrapper routine to calculate the offset from the probed file
beginning. In case of dynamic shared library, the offset is
calculated by subtracting the beginning address of the file mapped
from the address of the probe point.
6. Probes are visible if a copy of the probed executable is made,
remove probes from the new copied image.
7. Make user-space probes coexist with other debuggers like gdb etc.
8. Support probepoints within MAX_INSN_SIZE bytes of the end of a
vm area. (Right now we just copy MAX_INSN_SIZE bytes and assume we
won't read of the end of the vm area.) This would be useful for
return probes, where we'd like to put the trampoline between
mm->end_code and the end of that page, if possible. Really, all we
need is 1 byte.
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329
This patch provides the feature of inserting probes on pages that are
not present in the memory during registration.
To add readpage and readpages() hooks, two new elements are added to
the uprobe_module object:
struct address_space_operations *ori_a_ops;
struct address_space_operations user_a_ops;
User-space probes allows probes to be inserted even in pages that are
not present in the memory at the time of registration. This is done
by adding hooks to the readpage and readpages routines. During
registration, the address space operation object is modified by
substituting user-space probes's specific readpage and readpages
routines. When the pages are read into memory through the readpage and
readpages address space operations, any associated probes are
automatically inserted into those pages. These user-space probes
readpage and readpages routines internally call the original
readpage() and readpages() routines, and then check whether probes are
to be added to these pages, inserting probes as necessary. The
overhead of adding these hooks is limited to the application on which
the probes are inserted.
During unregistration, care should be taken to replace the readpage and
readpages hooks with the original routines if no probes remain on that
application.
Signed-of-by: Prasanna S Panchamukhi <[email protected]>
kernel/uprobes.c | 121 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 121 insertions(+)
diff -puN kernel/uprobes.c~kprobes_userspace_probes-hook-readpage kernel/uprobes.c
--- linux-2.6.16-rc6-mm2/kernel/uprobes.c~kprobes_userspace_probes-hook-readpage 2006-03-20 10:49:14.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/kernel/uprobes.c 2006-03-20 10:49:14.000000000 +0530
@@ -302,6 +302,115 @@ static struct uprobe_module __kprobes *g
return NULL;
}
+static inline void insert_readpage_uprobe(struct page *page,
+ struct address_space *mapping, struct uprobe *uprobe)
+{
+ unsigned long page_start = page->index << PAGE_CACHE_SHIFT;
+ unsigned long page_end = page_start + PAGE_SIZE;
+
+ if ((uprobe->offset >= page_start) && (uprobe->offset < page_end)) {
+ map_uprobe_page(page, uprobe, insert_kprobe_user);
+ flush_vma(mapping, page, uprobe);
+ }
+}
+
+/**
+ * This function hooks the readpages() of all modules that have active
+ * probes on them. The original readpages() is called for the given
+ * inode/address_space to actually read the pages into the memory.
+ * Then all probes that are specified on these pages are inserted.
+ */
+static int __kprobes uprobe_readpages(struct file *file,
+ struct address_space *mapping,
+ struct list_head *pages, unsigned nr_pages)
+{
+ int retval = 0;
+ struct page *page;
+ struct uprobe_module *umodule;
+ struct uprobe *uprobe = NULL;
+ struct hlist_node *node;
+
+ mutex_lock(&uprobe_mutex);
+
+ umodule = get_module_by_inode(file->f_dentry->d_inode);
+ if (!umodule) {
+ /*
+ * No module associated with this file, call the
+ * original readpages().
+ */
+ retval = mapping->a_ops->readpages(file, mapping,
+ pages, nr_pages);
+ goto out;
+ }
+
+ /* call original readpages() */
+ retval = umodule->ori_a_ops->readpages(file, mapping, pages, nr_pages);
+ if (retval < 0)
+ goto out;
+
+ /*
+ * TODO: Walk through readpages page list and get
+ * pages with probes instead of find_get_page().
+ */
+ hlist_for_each_entry(uprobe, node, &umodule->ulist_head, ulist) {
+ page = find_get_page(mapping,
+ uprobe->offset >> PAGE_CACHE_SHIFT);
+ if (!page)
+ continue;
+
+ if (!uprobe->kp.opcode)
+ insert_readpage_uprobe(page, mapping, uprobe);
+ page_cache_release(page);
+ }
+
+out:
+ mutex_unlock(&uprobe_mutex);
+
+ return retval;
+}
+
+/**
+ * This function hooks the readpage() of all modules that have active
+ * probes on them. The original readpage() is called for the given
+ * inode/address_space to actually read the pages into the memory.
+ * Then all probes that are specified on this page are inserted.
+ */
+int __kprobes uprobe_readpage(struct file *file, struct page *page)
+{
+ int retval = 0;
+ struct uprobe_module *umodule;
+ struct uprobe *uprobe = NULL;
+ struct hlist_node *node;
+ struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+
+ mutex_lock(&uprobe_mutex);
+
+ umodule = get_module_by_inode(file->f_dentry->d_inode);
+ if (!umodule) {
+ /*
+ * No module associated with this file, call the
+ * original readpage().
+ */
+ retval = mapping->a_ops->readpage(file, page);
+ goto out;
+ }
+
+ /* call original readpage() */
+ retval = umodule->ori_a_ops->readpage(file, page);
+ if (retval < 0)
+ goto out;
+
+ hlist_for_each_entry(uprobe, node, &umodule->ulist_head, ulist) {
+ if (!uprobe->kp.opcode)
+ insert_readpage_uprobe(page, mapping, uprobe);
+ }
+
+out:
+ mutex_unlock(&uprobe_mutex);
+
+ return retval;
+}
+
/**
* Gets exclusive write access to the given inode to ensure that the file
* on which probes are currently applied does not change. Use the function,
@@ -324,13 +433,23 @@ static inline int ex_write_unlock(struct
/**
* Add uprobe and uprobe_module to the appropriate hash list.
+ * Also swithces i_op to hooks into readpage and readpages().
*/
static void __kprobes get_inode_ops(struct uprobe *uprobe,
struct uprobe_module *umodule)
{
+ struct address_space *as;
+
INIT_HLIST_HEAD(&umodule->ulist_head);
hlist_add_head(&uprobe->ulist, &umodule->ulist_head);
list_add(&umodule->mlist, &uprobe_module_list);
+ as = umodule->nd.dentry->d_inode->i_mapping;
+ umodule->ori_a_ops = as->a_ops;
+ umodule->user_a_ops = *as->a_ops;
+ umodule->user_a_ops.readpage = uprobe_readpage;
+ umodule->user_a_ops.readpages = uprobe_readpages;
+ as->a_ops = &umodule->user_a_ops;
+
}
/*
@@ -459,6 +578,8 @@ void __kprobes unregister_uprobe(struct
hlist_del(&uprobe->ulist);
if (hlist_empty(&umodule->ulist_head)) {
list_del(&umodule->mlist);
+ umodule->nd.dentry->d_inode->i_mapping->a_ops =
+ umodule->ori_a_ops;
ex_write_unlock(uprobe->inode);
path_release(&umodule->nd);
kfree(umodule);
_
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329
This patch provides two interfaces to insert and remove
user space probes. Each probe is uniquely identified by
inode and offset within that executable/library file.
Insertion of a probe involves getting the code page for
a given offset, mapping it into the memory and then insert
the breakpoint at the given offset. Also the probe is added
to the uprobe_table hash list. A uprobe_module data strcture
is allocated for every probed application/library image on disk.
Removal of a probe involves getting the code page for a given
offset, mapping that page into the memory and then replacing
the breakpoint instruction with a the original opcode.
This patch also provides aggrigate probe handler feature,
where user can define multiple handlers per probe.
Signed-off-by : Prasanna S Panchamukhi <[email protected]>
arch/i386/kernel/uprobes.c | 71 +++++
fs/namei.c | 13
include/linux/kprobes.h | 51 +++
include/linux/namei.h | 1
kernel/Makefile | 2
kernel/kprobes.c | 3
kernel/uprobes.c | 589 +++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 725 insertions(+), 5 deletions(-)
diff -puN include/linux/kprobes.h~kprobes_userspace_probes-base-interface include/linux/kprobes.h
--- linux-2.6.16-rc6-mm2/include/linux/kprobes.h~kprobes_userspace_probes-base-interface 2006-03-20 10:34:58.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/include/linux/kprobes.h 2006-03-20 10:35:29.000000000 +0530
@@ -37,6 +37,10 @@
#include <linux/spinlock.h>
#include <linux/rcupdate.h>
#include <linux/mutex.h>
+#include <linux/mm.h>
+#include <linux/dcache.h>
+#include <linux/namei.h>
+#include <linux/pagemap.h>
#ifdef CONFIG_KPROBES
#include <asm/kprobes.h>
@@ -54,6 +58,7 @@ struct kprobe;
struct pt_regs;
struct kretprobe;
struct kretprobe_instance;
+extern struct uprobe *current_uprobe;
typedef int (*kprobe_pre_handler_t) (struct kprobe *, struct pt_regs *);
typedef int (*kprobe_break_handler_t) (struct kprobe *, struct pt_regs *);
typedef void (*kprobe_post_handler_t) (struct kprobe *, struct pt_regs *,
@@ -117,6 +122,32 @@ struct jprobe {
DECLARE_PER_CPU(struct kprobe *, current_kprobe);
DECLARE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
+struct uprobe {
+ /* pointer to the pathname of the application */
+ char *pathname;
+ /* kprobe structure with user specified handlers */
+ struct kprobe kp;
+ /* hlist of all the userspace probes per application */
+ struct hlist_node ulist;
+ /* inode of the probed application */
+ struct inode *inode;
+ /* probe offset within the file */
+ unsigned long offset;
+};
+
+struct uprobe_module {
+ /* hlist head of all userspace probes per application */
+ struct hlist_head ulist_head;
+ /* list of all uprobe_module for probed application */
+ struct list_head mlist;
+ /* to hold path/dentry etc. */
+ struct nameidata nd;
+ /* original readpage operations */
+ struct address_space_operations *ori_a_ops;
+ /* readpage hooks added operations */
+ struct address_space_operations user_a_ops;
+};
+
#ifdef ARCH_SUPPORTS_KRETPROBES
extern void arch_prepare_kretprobe(struct kretprobe *rp, struct pt_regs *regs);
#else /* ARCH_SUPPORTS_KRETPROBES */
@@ -162,9 +193,16 @@ extern void show_registers(struct pt_reg
extern kprobe_opcode_t *get_insn_slot(void);
extern void free_insn_slot(kprobe_opcode_t *slot);
extern void kprobes_inc_nmissed_count(struct kprobe *p);
+extern void copy_kprobe(struct kprobe *old_p, struct kprobe *p);
+extern int arch_copy_uprobe(struct kprobe *p, kprobe_opcode_t *address);
+extern void arch_arm_uprobe(kprobe_opcode_t *address);
+extern void arch_disarm_uprobe(struct kprobe *p, kprobe_opcode_t *address);
+extern void init_uprobes(void);
/* Get the kprobe at this addr (if any) - called with preemption disabled */
struct kprobe *get_kprobe(void *addr);
+struct kprobe *get_uprobe(void *addr);
+extern int arch_alloc_insn(struct kprobe *p);
struct hlist_head * kretprobe_inst_table_head(struct task_struct *tsk);
/* kprobe_running() will just return the current_kprobe on this CPU */
@@ -183,6 +221,16 @@ static inline struct kprobe_ctlblk *get_
return (&__get_cpu_var(kprobe_ctlblk));
}
+static inline void set_uprobe_instance(struct kprobe *p)
+{
+ current_uprobe = container_of(p, struct uprobe, kp);
+}
+
+static inline void reset_uprobe_instance(void)
+{
+ current_uprobe = NULL;
+}
+
int register_kprobe(struct kprobe *p);
void unregister_kprobe(struct kprobe *p);
int setjmp_pre_handler(struct kprobe *, struct pt_regs *);
@@ -194,6 +242,9 @@ void jprobe_return(void);
int register_kretprobe(struct kretprobe *rp);
void unregister_kretprobe(struct kretprobe *rp);
+int register_uprobe(struct uprobe *uprobe);
+void unregister_uprobe(struct uprobe *uprobe);
+
struct kretprobe_instance *get_free_rp_inst(struct kretprobe *rp);
void add_rp_inst(struct kretprobe_instance *ri);
void kprobe_flush_task(struct task_struct *tk);
diff -puN /dev/null kernel/uprobes.c
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/kernel/uprobes.c 2006-03-20 10:49:20.000000000 +0530
@@ -0,0 +1,589 @@
+/*
+ * User-space Probes (UProbes)
+ * kernel/uprobes.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006
+ *
+ * 2006-Mar Created by Prasanna S Panchamukhi <[email protected]>
+ * User-space probes initial implementation based on IBM's
+ * Dprobes.
+ */
+#include <linux/kprobes.h>
+#include <linux/hash.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/moduleloader.h>
+#include <asm-generic/sections.h>
+#include <asm/cacheflush.h>
+#include <asm/errno.h>
+#include <asm/kdebug.h>
+
+#define UPROBE_HASH_BITS 6
+#define UPROBE_TABLE_SIZE (1 << UPROBE_HASH_BITS)
+
+/* user space probes lists */
+static struct list_head uprobe_module_list;
+static struct hlist_head uprobe_table[UPROBE_TABLE_SIZE];
+DEFINE_SPINLOCK(uprobe_lock); /* Protects uprobe_table*/
+DEFINE_MUTEX(uprobe_mutex); /* Protects uprobe_module_table */
+
+/*
+ * Aggregate handlers for multiple uprobes support - these handlers
+ * take care of invoking the individual uprobe handlers on p->list
+ */
+static int __kprobes aggr_user_pre_handler(struct kprobe *p,
+ struct pt_regs *regs)
+{
+ struct kprobe *kp;
+
+ list_for_each_entry(kp, &p->list, list) {
+ if (kp->pre_handler) {
+ set_uprobe_instance(kp);
+ if (kp->pre_handler(kp, regs))
+ return 1;
+ }
+ }
+ return 0;
+}
+
+static void __kprobes aggr_user_post_handler(struct kprobe *p,
+ struct pt_regs *regs, unsigned long flags)
+{
+ struct kprobe *kp;
+
+ list_for_each_entry(kp, &p->list, list) {
+ if (kp->post_handler) {
+ set_uprobe_instance(kp);
+ kp->post_handler(kp, regs, flags);
+ }
+ }
+}
+
+static int __kprobes aggr_user_fault_handler(struct kprobe *p,
+ struct pt_regs *regs, int trapnr)
+{
+ struct kprobe *cur;
+
+ /*
+ * if we faulted "during" the execution of a user specified
+ * probe handler, invoke just that probe's fault handler
+ */
+ cur = ¤t_uprobe->kp;
+ if (cur && cur->fault_handler)
+ if (cur->fault_handler(cur, regs, trapnr))
+ return 1;
+ return 0;
+}
+
+/**
+ * This routine looks for an existing uprobe at the given offset and inode.
+ * If it's found, returns the corresponding kprobe pointer.
+ * This should be called with uprobe_lock held.
+ */
+static struct kprobe __kprobes *get_kprobe_user(struct inode *inode,
+ unsigned long offset)
+{
+ struct hlist_head *head;
+ struct hlist_node *node;
+ struct kprobe *p, *kpr;
+ struct uprobe *uprobe;
+
+ head = &uprobe_table[hash_ptr((kprobe_opcode_t *)
+ (((unsigned long)inode) * offset), UPROBE_HASH_BITS)];
+
+ hlist_for_each_entry(p, node, head, hlist) {
+ if (p->pre_handler == aggr_user_pre_handler) {
+ kpr = list_entry(p->list.next, typeof(*kpr), list);
+ uprobe = container_of(kpr, struct uprobe, kp);
+ } else
+ uprobe = container_of(p, struct uprobe, kp);
+
+ if ((uprobe->inode == inode) && (uprobe->offset == offset))
+ return p;
+ }
+
+ return NULL;
+}
+
+/**
+ * Finds a uprobe at the specified user-space address in the current task.
+ * Points current_uprobe at that uprobe and returns the corresponding kprobe.
+ */
+struct kprobe __kprobes *get_uprobe(void *addr)
+{
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+ struct inode *inode;
+ unsigned long offset;
+ struct kprobe *p, *kpr;
+ struct uprobe *uprobe;
+
+ vma = find_vma(mm, (unsigned long)addr);
+
+ BUG_ON(!vma); /* this should not happen, not in our memory map */
+
+ offset = (unsigned long)addr - (vma->vm_start +
+ (vma->vm_pgoff << PAGE_SHIFT));
+ if (!vma->vm_file)
+ return NULL;
+
+ inode = vma->vm_file->f_dentry->d_inode;
+
+ p = get_kprobe_user(inode, offset);
+ if (!p)
+ return NULL;
+
+ if (p->pre_handler == aggr_user_pre_handler) {
+ /*
+ * Walk the uprobe aggregate list and return firt
+ * element on aggregate list.
+ */
+ kpr = list_entry((p)->list.next, typeof(*kpr), list);
+ uprobe = container_of(kpr, struct uprobe, kp);
+ } else
+ uprobe = container_of(p, struct uprobe, kp);
+
+ if (uprobe)
+ current_uprobe = uprobe;
+
+ return p;
+}
+
+/*
+ * Fill in the required fields of the "manager uprobe". Replace the
+ * earlier kprobe in the hlist with the manager uprobe
+ */
+static inline void add_aggr_uprobe(struct kprobe *ap, struct kprobe *p)
+{
+ copy_kprobe(p, ap);
+ ap->addr = p->addr;
+ ap->pre_handler = aggr_user_pre_handler;
+ ap->post_handler = aggr_user_post_handler;
+ ap->fault_handler = aggr_user_fault_handler;
+
+ INIT_LIST_HEAD(&ap->list);
+ list_add(&p->list, &ap->list);
+
+ hlist_replace_rcu(&p->hlist, &ap->hlist);
+}
+
+/*
+ * This is the second or subsequent uprobe at the address - handle
+ * the intricacies
+ */
+static int __kprobes register_aggr_uprobe(struct kprobe *old_p,
+ struct kprobe *p)
+{
+ struct kprobe *ap;
+
+ if (old_p->pre_handler == aggr_user_pre_handler) {
+ copy_kprobe(old_p, p);
+ list_add(&p->list, &old_p->list);
+ } else {
+ ap = kzalloc(sizeof(struct kprobe), GFP_ATOMIC);
+ if (!ap)
+ return -ENOMEM;
+ add_aggr_uprobe(ap, old_p);
+ copy_kprobe(ap, p);
+ list_add(&p->list, &old_p->list);
+ }
+ return 0;
+}
+
+typedef int (*process_uprobe_func_t)(struct uprobe *uprobe,
+ kprobe_opcode_t *address);
+
+/**
+ * Saves the original instruction in the uprobe structure and
+ * inserts the breakpoint at the given address.
+ */
+int __kprobes insert_kprobe_user(struct uprobe *uprobe,
+ kprobe_opcode_t *address)
+{
+ int ret = 0;
+
+ ret = arch_copy_uprobe(&uprobe->kp, address);
+ if (ret) {
+ printk("Breakpoint already present\n");
+ return ret;
+ }
+ arch_arm_uprobe(address);
+
+ return 0;
+}
+
+/**
+ * Wait for the page to be unlocked if someone else had locked it,
+ * then map the page and insert or remove the breakpoint.
+ */
+static int __kprobes map_uprobe_page(struct page *page, struct uprobe *uprobe,
+ process_uprobe_func_t process_kprobe_user)
+{
+ int ret = 0;
+ kprobe_opcode_t *uprobe_address;
+
+ if (!page)
+ return -EINVAL; /* TODO: more suitable errno */
+
+ wait_on_page_locked(page);
+ /* could probably retry readpage here. */
+ if (!PageUptodate(page))
+ return -EINVAL; /* TODO: more suitable errno */
+
+ lock_page(page);
+
+ uprobe_address = (kprobe_opcode_t *)kmap(page);
+ uprobe_address = (kprobe_opcode_t *)((unsigned long)uprobe_address +
+ (uprobe->offset & ~PAGE_MASK));
+ ret = (*process_kprobe_user)(uprobe, uprobe_address);
+ kunmap(page);
+
+ unlock_page(page);
+
+ return ret;
+}
+
+/**
+ * flush_vma walks through the list of process private mappings,
+ * gets the vma containing the offset and flushes all the vmas
+ * containing the probed page.
+ */
+static void __kprobes flush_vma(struct address_space *mapping,
+ struct page *page, struct uprobe *uprobe)
+{
+ struct vm_area_struct *vma = NULL;
+ struct prio_tree_iter iter;
+ struct prio_tree_root *head = &mapping->i_mmap;
+ struct mm_struct *mm;
+ unsigned long start, end, offset = uprobe->offset;
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, head, offset, offset) {
+ mm = vma->vm_mm;
+ start = vma->vm_start - (vma->vm_pgoff << PAGE_SHIFT);
+ end = vma->vm_end - (vma->vm_pgoff << PAGE_SHIFT);
+
+ if ((start + offset) < end)
+ flush_icache_user_range(vma, page,
+ (unsigned long)uprobe->kp.addr,
+ sizeof(kprobe_opcode_t));
+ }
+ spin_unlock(&mapping->i_mmap_lock);
+}
+
+/**
+ * Walk the uprobe_module_list and return the uprobe module with matching
+ * inode.
+ */
+static struct uprobe_module __kprobes *get_module_by_inode(struct inode *inode)
+{
+ struct uprobe_module *umodule;
+
+ list_for_each_entry(umodule, &uprobe_module_list, mlist) {
+ if (umodule->nd.dentry->d_inode == inode)
+ return umodule;
+ }
+
+ return NULL;
+}
+
+/**
+ * Gets exclusive write access to the given inode to ensure that the file
+ * on which probes are currently applied does not change. Use the function,
+ * deny_write_access_to_inode() we added in fs/namei.c.
+ */
+static inline int ex_write_lock(struct inode *inode)
+{
+ return deny_write_access_to_inode(inode);
+}
+
+/**
+ * Called when removing user space probes to release the write lock on the
+ * inode.
+ */
+static inline int ex_write_unlock(struct inode *inode)
+{
+ atomic_inc(&inode->i_writecount);
+ return 0;
+}
+
+/**
+ * Add uprobe and uprobe_module to the appropriate hash list.
+ */
+static void __kprobes get_inode_ops(struct uprobe *uprobe,
+ struct uprobe_module *umodule)
+{
+ INIT_HLIST_HEAD(&umodule->ulist_head);
+ hlist_add_head(&uprobe->ulist, &umodule->ulist_head);
+ list_add(&umodule->mlist, &uprobe_module_list);
+}
+
+/*
+ * Removes the specified uprobe from either aggregate uprobe list
+ * or individual uprobe hash table.
+ */
+
+static int __kprobes remove_uprobe(struct uprobe *uprobe)
+{
+ struct kprobe *old_p, *list_p, *p;
+ int ret = 0;
+
+ p = &uprobe->kp;
+ old_p = get_kprobe_user(uprobe->inode, uprobe->offset);
+ if (unlikely(!old_p))
+ return 0;
+
+ if (p != old_p) {
+ list_for_each_entry(list_p, &old_p->list, list)
+ if (list_p == p)
+ /* kprobe p is a valid probe */
+ goto valid_p;
+ return 0;
+ }
+
+valid_p:
+ if ((old_p == p) ||
+ ((old_p->pre_handler == aggr_user_pre_handler) &&
+ (p->list.next == &old_p->list) &&
+ (p->list.prev == &old_p->list))) {
+ /*
+ * Only probe on the hash list, mark the corresponding
+ * instruction slot for freeing by return 1.
+ */
+ ret = 1;
+ hlist_del(&old_p->hlist);
+ if (p != old_p) {
+ list_del(&p->list);
+ kfree(old_p);
+ }
+ } else
+ list_del(&p->list);
+
+ return ret;
+}
+
+/*
+ * Disarms the probe and frees the corresponding instruction slot.
+ */
+static int __kprobes remove_kprobe_user(struct uprobe *uprobe,
+ kprobe_opcode_t *address)
+{
+ struct kprobe *p = &uprobe->kp;
+
+ arch_disarm_uprobe(p, address);
+ arch_remove_kprobe(p);
+
+ return 0;
+}
+
+/*
+ * Adds the given uprobe to the uprobe_hash table if it is
+ * the first probe to be inserted at the given address else
+ * adds to the aggregate uprobe's list.
+ */
+static int __kprobes insert_uprobe(struct uprobe *uprobe)
+{
+ struct kprobe *old_p;
+ int ret = 0;
+ unsigned long offset = uprobe->offset;
+ unsigned long inode = (unsigned long) uprobe->inode;
+ struct hlist_head *head;
+ unsigned long flags;
+
+ spin_lock_irqsave(&uprobe_lock, flags);
+ uprobe->kp.nmissed = 0;
+
+ old_p = get_kprobe_user(uprobe->inode, uprobe->offset);
+
+ if (old_p)
+ register_aggr_uprobe(old_p, &uprobe->kp);
+ else {
+ head = &uprobe_table[hash_ptr((kprobe_opcode_t *)
+ (offset * inode), UPROBE_HASH_BITS)];
+ INIT_HLIST_NODE(&uprobe->kp.hlist);
+ hlist_add_head(&uprobe->kp.hlist, head);
+ /*
+ * The original instruction must be copied into the instruction
+ * slot, hence return 1.
+ */
+ ret = 1;
+ }
+
+ spin_unlock_irqrestore(&uprobe_lock, flags);
+
+ return ret;
+}
+
+/**
+ * unregister_uprobe: Disarms the probe, removes the uprobe
+ * pointers from the hash list and unhooks readpage routines.
+ */
+void __kprobes unregister_uprobe(struct uprobe *uprobe)
+{
+ struct address_space *mapping;
+ struct uprobe_module *umodule;
+ struct page *page;
+ unsigned long flags;
+ int ret = 0;
+
+ if (!uprobe->inode)
+ return;
+
+ mapping = uprobe->inode->i_mapping;
+
+ page = find_get_page(mapping, uprobe->offset >> PAGE_CACHE_SHIFT);
+
+ spin_lock_irqsave(&uprobe_lock, flags);
+ ret = remove_uprobe(uprobe);
+ spin_unlock_irqrestore(&uprobe_lock, flags);
+
+ mutex_lock(&uprobe_mutex);
+ if (!(umodule = get_module_by_inode(uprobe->inode)))
+ goto out;
+
+ hlist_del(&uprobe->ulist);
+ if (hlist_empty(&umodule->ulist_head)) {
+ list_del(&umodule->mlist);
+ ex_write_unlock(uprobe->inode);
+ path_release(&umodule->nd);
+ kfree(umodule);
+ }
+
+out:
+ mutex_unlock(&uprobe_mutex);
+ if (ret)
+ ret = map_uprobe_page(page, uprobe, remove_kprobe_user);
+
+ if (ret == -EINVAL)
+ return;
+ /*
+ * TODO: unregister_uprobe should not fail, need to handle
+ * if it fails.
+ */
+ flush_vma(mapping, page, uprobe);
+
+ if (page)
+ page_cache_release(page);
+}
+
+/**
+ * register_uprobe(): combination of inode and offset is used to
+ * identify each probe uniquely. Each uprobe can be found from the
+ * uprobes_hash table by using inode and offset. register_uprobe(),
+ * inserts the breakpoint at the given address by locating and mapping
+ * the page. return 0 on success and error on failure.
+ */
+int __kprobes register_uprobe(struct uprobe *uprobe)
+{
+ struct address_space *mapping;
+ struct uprobe_module *umodule = NULL;
+ struct inode *inode;
+ struct nameidata nd;
+ struct page *page;
+ int error = 0;
+
+ INIT_HLIST_NODE(&uprobe->ulist);
+
+ /*
+ * TODO: Need to calculate the absolute file offset for dynamic
+ * shared libraries.
+ */
+ if ((error = path_lookup(uprobe->pathname, LOOKUP_FOLLOW, &nd)))
+ return error;
+
+ mutex_lock(&uprobe_mutex);
+
+ inode = nd.dentry->d_inode;
+ error = ex_write_lock(inode);
+ if (error)
+ goto out;
+
+ /*
+ * Check if there are probes already on this application and
+ * add the corresponding uprobe to per application probe's list.
+ */
+ umodule = get_module_by_inode(inode);
+ if (!umodule) {
+
+ error = arch_alloc_insn(&uprobe->kp);
+ if (error)
+ goto out;
+
+ /*
+ * Allocate a uprobe_module structure for this
+ * application if not allocated before.
+ */
+ umodule = kzalloc(sizeof(struct uprobe_module), GFP_KERNEL);
+ if (!umodule) {
+ error = -ENOMEM;
+ ex_write_unlock(inode);
+ arch_remove_kprobe(&uprobe->kp);
+ goto out;
+ }
+ memcpy(&umodule->nd, &nd, sizeof(struct nameidata));
+ get_inode_ops(uprobe, umodule);
+ } else {
+ path_release(&nd);
+ ex_write_unlock(inode);
+ hlist_add_head(&uprobe->ulist, &umodule->ulist_head);
+ }
+ mutex_unlock(&uprobe_mutex);
+
+ uprobe->inode = inode;
+ mapping = inode->i_mapping;
+ page = find_get_page(mapping, (uprobe->offset >> PAGE_CACHE_SHIFT));
+
+ if (insert_uprobe(uprobe))
+ error = map_uprobe_page(page, uprobe, insert_kprobe_user);
+
+ /*
+ * If error == -EINVAL, return success, probes will inserted by
+ * readpage hooks.
+ * TODO: Use a more suitable errno?
+ */
+ if (error == -EINVAL)
+ error = 0;
+ flush_vma(mapping, page, uprobe);
+
+ if (page)
+ page_cache_release(page);
+
+ return error;
+out:
+ path_release(&nd);
+ mutex_unlock(&uprobe_mutex);
+
+ return error;
+}
+
+void init_uprobes(void)
+{
+ int i;
+
+ /* FIXME allocate the probe table, currently defined statically */
+ /* initialize all list heads */
+ for (i = 0; i < UPROBE_TABLE_SIZE; i++)
+ INIT_HLIST_HEAD(&uprobe_table[i]);
+
+ INIT_LIST_HEAD(&uprobe_module_list);
+}
+
+EXPORT_SYMBOL_GPL(register_uprobe);
+EXPORT_SYMBOL_GPL(unregister_uprobe);
+
+
diff -puN /dev/null arch/i386/kernel/uprobes.c
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/arch/i386/kernel/uprobes.c 2006-03-20 10:35:28.000000000 +0530
@@ -0,0 +1,71 @@
+/*
+ * User-space Probes (UProbes)
+ * arch/i386/kernel/uprobes.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006.
+ *
+ * 2006-Mar Created by Prasanna S Panchamukhi <[email protected]>
+ * User-space probes initial implementation based on IBM's
+ * Dprobes.
+ */
+
+#include <linux/config.h>
+#include <linux/kprobes.h>
+#include <linux/ptrace.h>
+#include <linux/preempt.h>
+#include <asm/cacheflush.h>
+#include <asm/kdebug.h>
+#include <asm/desc.h>
+
+int __kprobes arch_alloc_insn(struct kprobe *p)
+{
+ mutex_lock(&kprobe_mutex);
+ p->ainsn.insn = get_insn_slot();
+ mutex_unlock(&kprobe_mutex);
+
+ if (!p->ainsn.insn)
+ return -ENOMEM;
+
+ return 0;
+}
+
+void __kprobes arch_disarm_uprobe(struct kprobe *p, kprobe_opcode_t *address)
+{
+ if (p->opcode != BREAKPOINT_INSTRUCTION)
+ *address = p->opcode;
+}
+
+void __kprobes arch_arm_uprobe(kprobe_opcode_t *address)
+{
+ *address = BREAKPOINT_INSTRUCTION;
+}
+
+int __kprobes arch_copy_uprobe(struct kprobe *p, kprobe_opcode_t *address)
+{
+ int ret = 1;
+
+ /*
+ * TODO: Check if the given address is a valid to access user memory.
+ */
+ if (*address != BREAKPOINT_INSTRUCTION) {
+ memcpy(p->ainsn.insn, address, MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
+ ret = 0;
+ }
+ p->opcode = *(kprobe_opcode_t *)address;
+
+ return ret;
+}
diff -puN kernel/kprobes.c~kprobes_userspace_probes-base-interface kernel/kprobes.c
--- linux-2.6.16-rc6-mm2/kernel/kprobes.c~kprobes_userspace_probes-base-interface 2006-03-20 10:34:58.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/kernel/kprobes.c 2006-03-20 10:34:58.000000000 +0530
@@ -356,7 +356,7 @@ static inline void free_rp_inst(struct k
/*
* Keep all fields in the kprobe consistent
*/
-static inline void copy_kprobe(struct kprobe *old_p, struct kprobe *p)
+void copy_kprobe(struct kprobe *old_p, struct kprobe *p)
{
memcpy(&p->opcode, &old_p->opcode, sizeof(kprobe_opcode_t));
memcpy(&p->ainsn, &old_p->ainsn, sizeof(struct arch_specific_insn));
@@ -650,6 +650,7 @@ static int __init init_kprobes(void)
INIT_HLIST_HEAD(&kretprobe_inst_table[i]);
}
+ init_uprobes();
err = arch_init_kprobes();
if (!err)
err = register_die_notifier(&kprobe_exceptions_nb);
diff -puN include/linux/namei.h~kprobes_userspace_probes-base-interface include/linux/namei.h
--- linux-2.6.16-rc6-mm2/include/linux/namei.h~kprobes_userspace_probes-base-interface 2006-03-20 10:34:58.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/include/linux/namei.h 2006-03-20 10:34:58.000000000 +0530
@@ -81,6 +81,7 @@ extern int follow_up(struct vfsmount **,
extern struct dentry *lock_rename(struct dentry *, struct dentry *);
extern void unlock_rename(struct dentry *, struct dentry *);
+extern int deny_write_access_to_inode(struct inode *inode);
static inline void nd_set_link(struct nameidata *nd, char *path)
{
diff -puN fs/namei.c~kprobes_userspace_probes-base-interface fs/namei.c
--- linux-2.6.16-rc6-mm2/fs/namei.c~kprobes_userspace_probes-base-interface 2006-03-20 10:34:58.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/fs/namei.c 2006-03-20 10:34:58.000000000 +0530
@@ -322,10 +322,9 @@ int get_write_access(struct inode * inod
return 0;
}
-int deny_write_access(struct file * file)
+/* This routine increments the writecount for a given inode. */
+int deny_write_access_to_inode(struct inode *inode)
{
- struct inode *inode = file->f_dentry->d_inode;
-
spin_lock(&inode->i_lock);
if (atomic_read(&inode->i_writecount) > 0) {
spin_unlock(&inode->i_lock);
@@ -337,6 +336,14 @@ int deny_write_access(struct file * file
return 0;
}
+/* Wrapper routine that increments the writecount for a given file pointer. */
+int deny_write_access(struct file * file)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+
+ return deny_write_access_to_inode(inode);
+}
+
void path_release(struct nameidata *nd)
{
dput(nd->dentry);
diff -puN kernel/Makefile~kprobes_userspace_probes-base-interface kernel/Makefile
--- linux-2.6.16-rc6-mm2/kernel/Makefile~kprobes_userspace_probes-base-interface 2006-03-20 10:34:58.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/kernel/Makefile 2006-03-20 10:34:58.000000000 +0530
@@ -31,7 +31,7 @@ obj-$(CONFIG_IKCONFIG) += configs.o
obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
obj-$(CONFIG_AUDITSYSCALL) += auditsc.o
-obj-$(CONFIG_KPROBES) += kprobes.o
+obj-$(CONFIG_KPROBES) += kprobes.o uprobes.o
obj-$(CONFIG_SYSFS) += ksysfs.o
obj-$(CONFIG_DETECT_SOFTLOCKUP) += softlockup.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
_
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329
This patch provides a mechanism for probe handling and
executing the user-specified handlers.
Each userspace probe is uniquely identified by the combination of
inode and offset, hence during registeration the inode and offset
combination is added to uprobes hash table. Initially when
breakpoint instruction is hit, the uprobes hash table is looked up
for matching inode and offset. The pre_handlers are called in
sequence if multiple probes are registered. Similar to kprobes,
uprobes also adopts to single step out-of-line, so that probe miss in
SMP environment can be avoided. But for userspace probes, instruction
copied into kernel address space cannot be single stepped, hence the
instruction must be copied to user address space. The solution is to
find free space in the current process address space and then copy the
original instruction and single step that instruction.
User processes use stack space to store local variables, agruments and
return values. Normally the stack space either below or above the
stack pointer indicates the free stack space.
The instruction to be single stepped can modify the stack space,
hence before using the free stack space, sufficient stack space should
be left. The instruction is copied to the bottom of the page and check
is made such that the copied instruction does not cross the page
boundry. The copied instruction is then single stepped. Several
architectures does not allow the instruction to be executed from the
stack location, since no-exec bit is set for the stack pages. In those
architectures, the page table entry corresponding to the stack page is
identified and the no-exec bit is unset making the instruction on that
stack page to be executed.
There are situations where even the free stack space is not enough for
the user instruction to be copied and single stepped. In such
situations, the virtual memory area(vma) can be expanded beyond the
current stack vma. This expaneded stack can be used to copy the
original instruction and single step out-of-line.
Even if the vma cannot be extended then the instruction much be
executed inline, by replacing the breakpoint instruction with original
instruction.
Signed-off-by: Prasanna S Panchamukhi <[email protected]>
arch/i386/kernel/Makefile | 2
arch/i386/kernel/kprobes.c | 4
arch/i386/kernel/uprobes.c | 537 +++++++++++++++++++++++++++++++++++++++++++++
arch/i386/mm/fault.c | 3
include/asm-i386/kprobes.h | 20 +
include/linux/kprobes.h | 8
6 files changed, 569 insertions(+), 5 deletions(-)
diff -puN include/asm-i386/kprobes.h~kprobes_userspace_probes-ss-out-of-line include/asm-i386/kprobes.h
--- linux-2.6.16-rc6-mm2/include/asm-i386/kprobes.h~kprobes_userspace_probes-ss-out-of-line 2006-03-20 10:49:29.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/include/asm-i386/kprobes.h 2006-03-20 10:49:29.000000000 +0530
@@ -26,6 +26,7 @@
*/
#include <linux/types.h>
#include <linux/ptrace.h>
+#include <asm/cacheflush.h>
#define __ARCH_WANT_KPROBES_INSN_SLOT
@@ -77,6 +78,19 @@ struct kprobe_ctlblk {
struct prev_kprobe prev_kprobe;
};
+/* per user probe control block */
+struct uprobe_ctlblk {
+ unsigned long uprobe_status;
+ unsigned long uprobe_saved_eflags;
+ unsigned long uprobe_old_eflags;
+ unsigned long singlestep_addr;
+ unsigned long flags;
+ struct kprobe *curr_p;
+ pte_t *upte;
+ struct page *upage;
+ struct task_struct *tsk;
+};
+
/* trap3/1 are intr gates for kprobes. So, restore the status of IF,
* if necessary, before executing the original int3/1 (trap) handler.
*/
@@ -88,4 +102,10 @@ static inline void restore_interrupts(st
extern int kprobe_exceptions_notify(struct notifier_block *self,
unsigned long val, void *data);
+extern int uprobe_exceptions_notify(struct notifier_block *self,
+ unsigned long val, void *data);
+extern unsigned long get_segment_eip(struct pt_regs *regs,
+ unsigned long *eip_limit);
+extern int is_IF_modifier(kprobe_opcode_t opcode);
+
#endif /* _ASM_KPROBES_H */
diff -puN arch/i386/kernel/uprobes.c~kprobes_userspace_probes-ss-out-of-line arch/i386/kernel/uprobes.c
--- linux-2.6.16-rc6-mm2/arch/i386/kernel/uprobes.c~kprobes_userspace_probes-ss-out-of-line 2006-03-20 10:49:29.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/arch/i386/kernel/uprobes.c 2006-03-20 10:56:23.000000000 +0530
@@ -30,6 +30,10 @@
#include <asm/cacheflush.h>
#include <asm/kdebug.h>
#include <asm/desc.h>
+#include <asm/uaccess.h>
+
+static struct uprobe_ctlblk uprobe_ctlblk;
+struct uprobe *current_uprobe;
int __kprobes arch_alloc_insn(struct kprobe *p)
{
@@ -69,3 +73,536 @@ int __kprobes arch_copy_uprobe(struct kp
return ret;
}
+
+/**
+ * This routines get the pte of the page containing the specified address.
+ */
+static pte_t __kprobes *get_uprobe_pte(unsigned long address)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte = NULL;
+
+ pgd = pgd_offset(current->mm, address);
+ if (!pgd)
+ goto out;
+
+ pud = pud_offset(pgd, address);
+ if (!pud)
+ goto out;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd)
+ goto out;
+
+ pte = pte_alloc_map(current->mm, pmd, address);
+
+out:
+ return pte;
+}
+
+/**
+ * This routine check for space in the current process's stack
+ * address space. If enough address space is found, copy the original
+ * instruction on that page for single stepping out-of-line.
+ */
+static int __kprobes copy_insn_on_new_page(struct uprobe *uprobe ,
+ struct pt_regs *regs, struct vm_area_struct *vma)
+{
+ unsigned long addr, stack_addr = regs->esp;
+ int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
+
+ if (vma->vm_flags & VM_GROWSDOWN) {
+ if (((stack_addr - sizeof(long long))) <
+ (vma->vm_start + size))
+ return -ENOMEM;
+ addr = vma->vm_start;
+ } else if (vma->vm_flags & VM_GROWSUP) {
+ if ((vma->vm_end - size) < (stack_addr + sizeof(long long)))
+ return -ENOMEM;
+ addr = vma->vm_end - size;
+ } else
+ return -EFAULT;
+
+ vma->vm_flags |= VM_LOCKED;
+
+ if (__copy_to_user_inatomic((unsigned long *)addr,
+ (unsigned long *)uprobe->kp.ainsn.insn, size))
+ return -EFAULT;
+
+ regs->eip = addr;
+
+ return 0;
+}
+
+/**
+ * This routine expands the stack beyond the present process address
+ * space and copies the instruction to that location, so that
+ * processor can single step out-of-line.
+ */
+static int __kprobes copy_insn_onexpstack(struct uprobe *uprobe,
+ struct pt_regs *regs, struct vm_area_struct *vma)
+{
+ unsigned long addr, vm_addr;
+ int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
+ struct vm_area_struct *new_vma;
+ struct mm_struct *mm = current->mm;
+
+
+ if (!down_read_trylock(¤t->mm->mmap_sem))
+ return -ENOMEM;
+
+ if (vma->vm_flags & VM_GROWSDOWN)
+ vm_addr = vma->vm_start - size;
+ else if (vma->vm_flags & VM_GROWSUP)
+ vm_addr = vma->vm_end + size;
+ else {
+ up_read(¤t->mm->mmap_sem);
+ return -EFAULT;
+ }
+
+ new_vma = find_extend_vma(mm, vm_addr);
+ if (!new_vma) {
+ up_read(¤t->mm->mmap_sem);
+ return -ENOMEM;
+ }
+
+ if (new_vma->vm_flags & VM_GROWSDOWN)
+ addr = new_vma->vm_start;
+ else
+ addr = new_vma->vm_end - size;
+
+ new_vma->vm_flags |= VM_LOCKED;
+ up_read(¤t->mm->mmap_sem);
+
+ if (__copy_to_user_inatomic((unsigned long *)addr,
+ (unsigned long *)uprobe->kp.ainsn.insn, size))
+ return -EFAULT;
+
+ regs->eip = addr;
+
+ return 0;
+}
+
+/**
+ * This routine checks for stack free space below the stack pointer
+ * and then copies the instructions at that location so that the
+ * processor can single step out-of-line. If there is not enough stack
+ * space or if copy_to_user fails or if the vma is invalid, it returns
+ * error.
+ */
+static int __kprobes copy_insn_onstack(struct uprobe *uprobe,
+ struct pt_regs *regs, unsigned long flags)
+{
+ unsigned long page_addr, stack_addr = regs->esp;
+ int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
+ unsigned long *source = (unsigned long *)uprobe->kp.ainsn.insn;
+
+ if (flags & VM_GROWSDOWN) {
+ page_addr = stack_addr & PAGE_MASK;
+
+ if (((stack_addr - sizeof(long long))) < (page_addr + size))
+ return -ENOMEM;
+
+ if (__copy_to_user_inatomic((unsigned long *)page_addr,
+ source, size))
+ return -EFAULT;
+
+ regs->eip = page_addr;
+ } else if (flags & VM_GROWSUP) {
+ page_addr = stack_addr & PAGE_MASK;
+
+ if (page_addr == stack_addr)
+ return -ENOMEM;
+ else
+ page_addr += PAGE_SIZE;
+
+ if ((page_addr - size) < (stack_addr + sizeof(long long)))
+ return -ENOMEM;
+
+ if (__copy_to_user_inatomic(
+ (unsigned long *)(page_addr - size), source, size))
+ return -EFAULT;
+
+ regs->eip = page_addr - size;
+ } else
+ return -EINVAL;
+
+ return 0;
+}
+
+/**
+ * This routines get the page containing the probe, maps it and
+ * replaced the instruction at the probed address with specified
+ * opcode.
+ */
+void __kprobes replace_original_insn(struct uprobe *uprobe,
+ struct pt_regs *regs, kprobe_opcode_t opcode)
+{
+ kprobe_opcode_t *addr;
+ struct page *page;
+
+ page = find_get_page(uprobe->inode->i_mapping,
+ uprobe->offset >> PAGE_CACHE_SHIFT);
+ BUG_ON(!page);
+
+ __lock_page(page);
+
+ addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER1);
+ addr = (kprobe_opcode_t *)((unsigned long)addr +
+ (unsigned long)(uprobe->offset & ~PAGE_MASK));
+ *addr = opcode;
+ /*TODO: flush vma ? */
+ kunmap_atomic(addr, KM_USER1);
+
+ unlock_page(page);
+
+ if (page)
+ page_cache_release(page);
+ regs->eip = (unsigned long)uprobe->kp.addr;
+}
+
+/**
+ * This routine provides the functionality of single stepping
+ * out-of-line. If single stepping out-of-line cannot be achieved,
+ * it replaces with the original instruction allowing it to single
+ * step inline.
+ */
+static inline int prepare_singlestep_uprobe(struct uprobe *uprobe,
+ struct uprobe_ctlblk *ucb, struct pt_regs *regs)
+{
+ unsigned long stack_addr = regs->esp, flags;
+ struct vm_area_struct *vma = NULL;
+ int err = 0;
+
+ vma = find_vma(current->mm, (stack_addr & PAGE_MASK));
+ if (!vma) {
+ /* TODO: Need better error reporting? */
+ goto no_vma;
+ }
+ flags = vma->vm_flags;
+
+ regs->eflags |= TF_MASK;
+ regs->eflags &= ~IF_MASK;
+
+ /*
+ * Copy_insn_on_stack tries to find some room for the instruction slot
+ * in the same page as the current esp.
+ */
+ err = copy_insn_onstack(uprobe, regs, flags);
+
+ /*
+ * If copy_insn_on_stack() fails, copy_insn_on_new_page() is called to
+ * try to find some room in the next pages below the current esp;
+ */
+ if (err)
+ err = copy_insn_on_new_page(uprobe, regs, vma);
+ /*
+ * If copy_insn_on_new_pagek() fails, copy_insn_on_expstack() is called to
+ * try to grow the stack's VM area by one page.
+ */
+ if (err)
+ err = copy_insn_onexpstack(uprobe, regs, vma);
+
+ ucb->uprobe_status = UPROBE_HIT_SS;
+
+ if (!err) {
+ ucb->upte = get_uprobe_pte(regs->eip);
+ if (!ucb->upte)
+ goto no_vma;
+ ucb->upage = pte_page(*ucb->upte);
+ __lock_page(ucb->upage);
+ }
+no_vma:
+ if (err) {
+ replace_original_insn(uprobe, regs, uprobe->kp.opcode);
+ ucb->uprobe_status = UPROBE_SS_INLINE;
+ }
+
+ ucb->singlestep_addr = regs->eip;
+
+ return 0;
+}
+
+/*
+ * uprobe_handler() executes the user specified handler and setup for
+ * single stepping the original instruction either out-of-line or inline.
+ */
+static int __kprobes uprobe_handler(struct pt_regs *regs)
+{
+ struct kprobe *p;
+ int ret = 0;
+ kprobe_opcode_t *addr = NULL;
+ struct uprobe_ctlblk *ucb = &uprobe_ctlblk;
+ unsigned long limit;
+
+ spin_lock_irqsave(&uprobe_lock, ucb->flags);
+ /* preemption is disabled, remains disabled
+ * until we single step on original instruction.
+ */
+ preempt_disable();
+
+ addr = (kprobe_opcode_t *)(get_segment_eip(regs, &limit) - 1);
+
+ p = get_uprobe(addr);
+ if (!p) {
+
+ if (*addr != BREAKPOINT_INSTRUCTION) {
+ /*
+ * The breakpoint instruction was removed right
+ * after we hit it. Another cpu has removed
+ * either a probepoint or a debugger breakpoint
+ * at this address. In either case, no further
+ * handling of this interrupt is appropriate.
+ * Back up over the (now missing) int3 and run
+ * the original instruction.
+ */
+ regs->eip -= sizeof(kprobe_opcode_t);
+ ret = 1;
+ }
+ /* Not one of ours: let kernel handle it */
+ goto no_uprobe;
+ }
+
+ if (p->opcode == BREAKPOINT_INSTRUCTION) {
+ /*
+ * Breakpoint was already present even before the probe
+ * was inserted, this might break some compatability with
+ * other debuggers like gdb etc. We dont handle such probes.
+ */
+ current_uprobe = NULL;
+ goto no_uprobe;
+ }
+
+ ucb->curr_p = p;
+ ucb->tsk = current;
+ ucb->uprobe_status = UPROBE_HIT_ACTIVE;
+ ucb->uprobe_saved_eflags = (regs->eflags & (TF_MASK | IF_MASK));
+ ucb->uprobe_old_eflags = (regs->eflags & (TF_MASK | IF_MASK));
+
+ if (p->pre_handler && p->pre_handler(p, regs))
+ /* handler has already set things up, so skip ss setup */
+ return 1;
+
+ prepare_singlestep_uprobe(current_uprobe, ucb, regs);
+ /*
+ * Avoid scheduling the current while returning from
+ * kernel to user mode.
+ */
+ clear_need_resched();
+ return 1;
+
+no_uprobe:
+ spin_unlock_irqrestore(&uprobe_lock, ucb->flags);
+ preempt_enable_no_resched();
+
+ return ret;
+}
+
+/*
+ * Called after single-stepping. p->addr is the address of the
+ * instruction whose first byte has been replaced by the "int 3"
+ * instruction. To avoid the SMP problems that can occur when we
+ * temporarily put back the original opcode to single-step, we
+ * single-stepped a copy of the instruction. The address of this
+ * copy is p->ainsn.insn.
+ *
+ * This function prepares to return from the post-single-step
+ * interrupt. We have to fix up the stack as follows:
+ *
+ * 0) Typically, the new eip is relative to the copied instruction. We
+ * need to make it relative to the original instruction. Exceptions are
+ * return instructions and absolute or indirect jump or call instructions.
+ *
+ * 1) If the single-stepped instruction was pushfl, then the TF and IF
+ * flags are set in the just-pushed eflags, and may need to be cleared.
+ *
+ * 2) If the single-stepped instruction was a call, the return address
+ * that is atop the stack is the address following the copied instruction.
+ * We need to make it the address following the original instruction.
+ */
+static void __kprobes resume_execution_user(struct kprobe *p,
+ struct pt_regs *regs, struct uprobe_ctlblk *ucb)
+{
+ unsigned long *tos = (unsigned long *)regs->esp;
+ unsigned long next_eip = 0;
+ unsigned long copy_eip = ucb->singlestep_addr;
+ unsigned long orig_eip = (unsigned long)p->addr;
+
+ switch (p->ainsn.insn[0]) {
+ case 0x9c: /* pushfl */
+ *tos &= ~(TF_MASK | IF_MASK);
+ *tos |= ucb->uprobe_old_eflags;
+ break;
+ case 0xc3: /* ret/lret */
+ case 0xcb:
+ case 0xc2:
+ case 0xca:
+ next_eip = regs->eip;
+ /* eip is already adjusted, no more changes required*/
+ return;
+ break;
+ case 0xe8: /* call relative - Fix return addr */
+ *tos = orig_eip + (*tos - copy_eip);
+ break;
+ case 0xff:
+ if ((p->ainsn.insn[1] & 0x30) == 0x10) {
+ /* call absolute, indirect */
+ /* Fix return addr; eip is correct. */
+ next_eip = regs->eip;
+ *tos = orig_eip + (*tos - copy_eip);
+ } else if (((p->ainsn.insn[1] & 0x31) == 0x20) ||
+ ((p->ainsn.insn[1] & 0x31) == 0x21)) {
+ /* jmp near or jmp far absolute indirect */
+ /* eip is correct. */
+ next_eip = regs->eip;
+ }
+ break;
+ case 0xea: /* jmp absolute -- eip is correct */
+ next_eip = regs->eip;
+ break;
+ default:
+ break;
+ }
+
+ regs->eflags &= ~TF_MASK;
+ if (next_eip)
+ regs->eip = next_eip;
+ else
+ regs->eip = orig_eip + (regs->eip - copy_eip);
+}
+
+/*
+ * post_uprobe_handler(), executes the user specified handlers and
+ * resumes with the normal execution.
+ */
+static inline int post_uprobe_handler(struct pt_regs *regs)
+{
+ struct kprobe *cur;
+ struct uprobe_ctlblk *ucb;
+
+ if (!current_uprobe)
+ return 0;
+
+ ucb = &uprobe_ctlblk;
+ cur = ucb->curr_p;
+
+ if (!cur || ucb->tsk != current)
+ return 0;
+
+ if (cur->post_handler) {
+ if (ucb->uprobe_status == UPROBE_SS_INLINE)
+ ucb->uprobe_status = UPROBE_SSDONE_INLINE;
+ else
+ ucb->uprobe_status = UPROBE_HIT_SSDONE;
+ cur->post_handler(cur, regs, 0);
+ }
+
+ resume_execution_user(cur, regs, ucb);
+ regs->eflags |= ucb->uprobe_saved_eflags;
+
+ if (ucb->uprobe_status == UPROBE_SSDONE_INLINE)
+ replace_original_insn(current_uprobe, regs,
+ BREAKPOINT_INSTRUCTION);
+ else {
+ unlock_page(ucb->upage);
+ pte_unmap(ucb->upte);
+ }
+ current_uprobe = NULL;
+ spin_unlock_irqrestore(&uprobe_lock, ucb->flags);
+ preempt_enable_no_resched();
+ /*
+ * if somebody else is singlestepping across a probe point, eflags
+ * will have TF set, in which case, continue the remaining processing
+ * of do_debug, as if this is not a probe hit.
+ */
+ if (regs->eflags & TF_MASK)
+ return 0;
+
+ return 1;
+}
+
+static inline int uprobe_fault_handler(struct pt_regs *regs, int trapnr)
+{
+ struct kprobe *cur;
+ struct uprobe_ctlblk *ucb;
+ int ret = 0;
+
+ ucb = &uprobe_ctlblk;
+ cur = ucb->curr_p;
+
+ if (ucb->tsk != current || !cur)
+ return 0;
+
+ switch(ucb->uprobe_status) {
+ case UPROBE_HIT_SS:
+ unlock_page(ucb->upage);
+ pte_unmap(ucb->upte);
+ /* TODO: All acceptable number of faults before disabling */
+ replace_original_insn(current_uprobe, regs, cur->opcode);
+ /* Fall through and reset the current probe */
+ case UPROBE_SS_INLINE:
+ regs->eip = (unsigned long)cur->addr;
+ regs->eflags |= ucb->uprobe_old_eflags;
+ regs->eflags &= ~TF_MASK;
+ current_uprobe = NULL;
+ ret = 1;
+ spin_unlock_irqrestore(&uprobe_lock, ucb->flags);
+ preempt_enable_no_resched();
+ break;
+ case UPROBE_HIT_ACTIVE:
+ case UPROBE_SSDONE_INLINE:
+ case UPROBE_HIT_SSDONE:
+ if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr))
+ return 1;
+
+ if (fixup_exception(regs))
+ return 1;
+ /*
+ * We must not allow the system page handler to continue while
+ * holding a lock, since page fault handler can sleep and
+ * reschedule it on different cpu. Hence return 1.
+ */
+ return 1;
+ break;
+ default:
+ break;
+ }
+ return ret;
+}
+
+/*
+ * Wrapper routine to for handling exceptions.
+ */
+int __kprobes uprobe_exceptions_notify(struct notifier_block *self,
+ unsigned long val, void *data)
+{
+ struct die_args *args = (struct die_args *)data;
+ int ret = NOTIFY_DONE;
+
+ if (args->regs->eflags & VM_MASK) {
+ /* We are in virtual-8086 mode. Return NOTIFY_DONE */
+ return ret;
+ }
+
+ switch (val) {
+ case DIE_INT3:
+ if (uprobe_handler(args->regs))
+ ret = NOTIFY_STOP;
+ break;
+ case DIE_DEBUG:
+ if (post_uprobe_handler(args->regs))
+ ret = NOTIFY_STOP;
+ break;
+ case DIE_GPF:
+ case DIE_PAGE_FAULT:
+ if (current_uprobe &&
+ uprobe_fault_handler(args->regs, args->trapnr))
+ ret = NOTIFY_STOP;
+ break;
+ default:
+ break;
+ }
+ return ret;
+}
diff -puN include/linux/kprobes.h~kprobes_userspace_probes-ss-out-of-line include/linux/kprobes.h
--- linux-2.6.16-rc6-mm2/include/linux/kprobes.h~kprobes_userspace_probes-ss-out-of-line 2006-03-20 10:49:29.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/include/linux/kprobes.h 2006-03-20 10:49:29.000000000 +0530
@@ -51,6 +51,13 @@
#define KPROBE_REENTER 0x00000004
#define KPROBE_HIT_SSDONE 0x00000008
+/* uprobe_status settings */
+#define UPROBE_HIT_ACTIVE 0x00000001
+#define UPROBE_HIT_SS 0x00000002
+#define UPROBE_HIT_SSDONE 0x00000004
+#define UPROBE_SS_INLINE 0x00000008
+#define UPROBE_SSDONE_INLINE 0x00000010
+
/* Attach to insert probes on any functions which should be ignored*/
#define __kprobes __attribute__((__section__(".kprobes.text")))
@@ -183,6 +190,7 @@ struct kretprobe_instance {
struct task_struct *task;
};
+extern spinlock_t uprobe_lock;
extern spinlock_t kretprobe_lock;
extern struct mutex kprobe_mutex;
extern int arch_prepare_kprobe(struct kprobe *p);
diff -puN arch/i386/kernel/kprobes.c~kprobes_userspace_probes-ss-out-of-line arch/i386/kernel/kprobes.c
--- linux-2.6.16-rc6-mm2/arch/i386/kernel/kprobes.c~kprobes_userspace_probes-ss-out-of-line 2006-03-20 10:49:29.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/arch/i386/kernel/kprobes.c 2006-03-20 10:49:29.000000000 +0530
@@ -88,7 +88,7 @@ static inline int can_boost(kprobe_opcod
/*
* returns non-zero if opcode modifies the interrupt flag.
*/
-static inline int is_IF_modifier(kprobe_opcode_t opcode)
+int is_IF_modifier(kprobe_opcode_t opcode)
{
switch (opcode) {
case 0xfa: /* cli */
@@ -610,7 +610,7 @@ int __kprobes kprobe_exceptions_notify(s
int ret = NOTIFY_DONE;
if (args->regs && user_mode(args->regs))
- return ret;
+ return uprobe_exceptions_notify(self, val, data);
switch (val) {
case DIE_INT3:
diff -puN arch/i386/mm/fault.c~kprobes_userspace_probes-ss-out-of-line arch/i386/mm/fault.c
--- linux-2.6.16-rc6-mm2/arch/i386/mm/fault.c~kprobes_userspace_probes-ss-out-of-line 2006-03-20 10:49:29.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/arch/i386/mm/fault.c 2006-03-20 10:49:29.000000000 +0530
@@ -71,8 +71,7 @@ void bust_spinlocks(int yes)
*
* This is slow, but is very rarely executed.
*/
-static inline unsigned long get_segment_eip(struct pt_regs *regs,
- unsigned long *eip_limit)
+unsigned long get_segment_eip(struct pt_regs *regs, unsigned long *eip_limit)
{
unsigned long eip = regs->eip;
unsigned seg = regs->xcs & 0xffff;
diff -puN arch/i386/kernel/Makefile~kprobes_userspace_probes-ss-out-of-line arch/i386/kernel/Makefile
--- linux-2.6.16-rc6-mm2/arch/i386/kernel/Makefile~kprobes_userspace_probes-ss-out-of-line 2006-03-20 10:49:29.000000000 +0530
+++ linux-2.6.16-rc6-mm2-prasanna/arch/i386/kernel/Makefile 2006-03-20 10:49:29.000000000 +0530
@@ -29,7 +29,7 @@ obj-$(CONFIG_KEXEC) += machine_kexec.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_X86_NUMAQ) += numaq.o
obj-$(CONFIG_X86_SUMMIT_NUMA) += summit.o
-obj-$(CONFIG_KPROBES) += kprobes.o
+obj-$(CONFIG_KPROBES) += kprobes.o uprobes.o
obj-$(CONFIG_MODULES) += module.o
obj-y += sysenter.o vsyscall.o
obj-$(CONFIG_ACPI_SRAT) += srat.o
_
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329
Prasanna S Panchamukhi <[email protected]> wrote:
>
> This patch provides two interfaces to insert and remove
> user space probes. Each probe is uniquely identified by
> inode and offset within that executable/library file.
> Insertion of a probe involves getting the code page for
> a given offset, mapping it into the memory and then insert
> the breakpoint at the given offset. Also the probe is added
> to the uprobe_table hash list. A uprobe_module data strcture
> is allocated for every probed application/library image on disk.
> Removal of a probe involves getting the code page for a given
> offset, mapping that page into the memory and then replacing
> the breakpoint instruction with a the original opcode.
> This patch also provides aggrigate probe handler feature,
> where user can define multiple handlers per probe.
>
> +/**
> + * Finds a uprobe at the specified user-space address in the current task.
> + * Points current_uprobe at that uprobe and returns the corresponding kprobe.
> + */
> +struct kprobe __kprobes *get_uprobe(void *addr)
> +{
> + struct mm_struct *mm = current->mm;
> + struct vm_area_struct *vma;
> + struct inode *inode;
> + unsigned long offset;
> + struct kprobe *p, *kpr;
> + struct uprobe *uprobe;
> +
> + vma = find_vma(mm, (unsigned long)addr);
> +
> + BUG_ON(!vma); /* this should not happen, not in our memory map */
> +
> + offset = (unsigned long)addr - (vma->vm_start +
> + (vma->vm_pgoff << PAGE_SHIFT));
> + if (!vma->vm_file)
> + return NULL;
All this appears to be happening without mmap_sem held?
> +/**
> + * Wait for the page to be unlocked if someone else had locked it,
> + * then map the page and insert or remove the breakpoint.
> + */
> +static int __kprobes map_uprobe_page(struct page *page, struct uprobe *uprobe,
> + process_uprobe_func_t process_kprobe_user)
> +{
> + int ret = 0;
> + kprobe_opcode_t *uprobe_address;
> +
> + if (!page)
> + return -EINVAL; /* TODO: more suitable errno */
> +
> + wait_on_page_locked(page);
> + /* could probably retry readpage here. */
> + if (!PageUptodate(page))
> + return -EINVAL; /* TODO: more suitable errno */
> +
> + lock_page(page);
That's a lot of fuss and might be racy with truncate.
Why not just run lock_page()?
> + uprobe_address = (kprobe_opcode_t *)kmap(page);
> + uprobe_address = (kprobe_opcode_t *)((unsigned long)uprobe_address +
> + (uprobe->offset & ~PAGE_MASK));
> + ret = (*process_kprobe_user)(uprobe, uprobe_address);
> + kunmap(page);
kmap_atomic() is preferred.
> +/**
> + * Gets exclusive write access to the given inode to ensure that the file
> + * on which probes are currently applied does not change. Use the function,
> + * deny_write_access_to_inode() we added in fs/namei.c.
> + */
> +static inline int ex_write_lock(struct inode *inode)
> +{
> + return deny_write_access_to_inode(inode);
> +}
hm, this code likes to poke around in VFS internals. It would be nice to
have an overall description of what it's trying to do, why and how.
> +/**
> + * unregister_uprobe: Disarms the probe, removes the uprobe
> + * pointers from the hash list and unhooks readpage routines.
> + */
> +void __kprobes unregister_uprobe(struct uprobe *uprobe)
> +{
> + struct address_space *mapping;
> + struct uprobe_module *umodule;
> + struct page *page;
> + unsigned long flags;
> + int ret = 0;
> +
> + if (!uprobe->inode)
> + return;
> +
> + mapping = uprobe->inode->i_mapping;
> +
> + page = find_get_page(mapping, uprobe->offset >> PAGE_CACHE_SHIFT);
> +
> + spin_lock_irqsave(&uprobe_lock, flags);
> + ret = remove_uprobe(uprobe);
> + spin_unlock_irqrestore(&uprobe_lock, flags);
> +
> + mutex_lock(&uprobe_mutex);
> + if (!(umodule = get_module_by_inode(uprobe->inode)))
> + goto out;
> +
> + hlist_del(&uprobe->ulist);
> + if (hlist_empty(&umodule->ulist_head)) {
> + list_del(&umodule->mlist);
> + ex_write_unlock(uprobe->inode);
> + path_release(&umodule->nd);
> + kfree(umodule);
> + }
> +
> +out:
> + mutex_unlock(&uprobe_mutex);
> + if (ret)
> + ret = map_uprobe_page(page, uprobe, remove_kprobe_user);
> +
> + if (ret == -EINVAL)
> + return;
> + /*
> + * TODO: unregister_uprobe should not fail, need to handle
> + * if it fails.
> + */
> + flush_vma(mapping, page, uprobe);
> +
> + if (page)
> + page_cache_release(page);
> +}
That's some pretty awkward coding. Buggy too. It doesn't drop the
refcount on the page if map_uprobe_page() returned -EINVAL because it's
assuming that EINVAL meant "there was no page". But there are multiple
reasons for map_uprobe_page() to return -EINVAL. If that page isn't
uptodate, we leak a ref.
This function should be doing the checking for a find_get_page() failure,
not map_uprobe_page().
Prasanna S Panchamukhi <[email protected]> wrote:
>
> This patch provides the feature of inserting probes on pages that are
> not present in the memory during registration.
>
> To add readpage and readpages() hooks, two new elements are added to
> the uprobe_module object:
> struct address_space_operations *ori_a_ops;
> struct address_space_operations user_a_ops;
>
> User-space probes allows probes to be inserted even in pages that are
> not present in the memory at the time of registration. This is done
> by adding hooks to the readpage and readpages routines. During
> registration, the address space operation object is modified by
> substituting user-space probes's specific readpage and readpages
> routines. When the pages are read into memory through the readpage and
> readpages address space operations, any associated probes are
> automatically inserted into those pages. These user-space probes
> readpage and readpages routines internally call the original
> readpage() and readpages() routines, and then check whether probes are
> to be added to these pages, inserting probes as necessary. The
> overhead of adding these hooks is limited to the application on which
> the probes are inserted.
>
> During unregistration, care should be taken to replace the readpage and
> readpages hooks with the original routines if no probes remain on that
> application.
>
So... The code's replacing the address_space's address_space_operations
with a copy which has .readpage() and .readpages() modified, because it
happens that filemap_nopage() uses those.
This is all rather hacky.
I think we need to step back and discuss what services this feature is
trying to provide, and how it is to provide them. Your covering
description didn't describe that - it dives straigt into details without
even describing what the patches are trying to achieve.
So. What are we trying to achieve here, and how are we trying to achieve
it? What problems were encountered in the development of the feature and
how were they solved? What alternative solutions were there?
I can mostly work all that out from background knowledge and looking at the
code, but I'd rather hear it from the designers please.
Prasanna S Panchamukhi <[email protected]> wrote:
>
> This patch provides a mechanism for probe handling and
> executing the user-specified handlers.
>
> Each userspace probe is uniquely identified by the combination of
> inode and offset, hence during registeration the inode and offset
> combination is added to uprobes hash table. Initially when
> breakpoint instruction is hit, the uprobes hash table is looked up
> for matching inode and offset. The pre_handlers are called in
> sequence if multiple probes are registered. Similar to kprobes,
> uprobes also adopts to single step out-of-line, so that probe miss in
> SMP environment can be avoided. But for userspace probes, instruction
> copied into kernel address space cannot be single stepped, hence the
> instruction must be copied to user address space. The solution is to
> find free space in the current process address space and then copy the
> original instruction and single step that instruction.
>
> User processes use stack space to store local variables, agruments and
> return values. Normally the stack space either below or above the
> stack pointer indicates the free stack space.
>
> The instruction to be single stepped can modify the stack space,
> hence before using the free stack space, sufficient stack space should
> be left. The instruction is copied to the bottom of the page and check
> is made such that the copied instruction does not cross the page
> boundry. The copied instruction is then single stepped. Several
> architectures does not allow the instruction to be executed from the
> stack location, since no-exec bit is set for the stack pages. In those
> architectures, the page table entry corresponding to the stack page is
> identified and the no-exec bit is unset making the instruction on that
> stack page to be executed.
>
> There are situations where even the free stack space is not enough for
> the user instruction to be copied and single stepped. In such
> situations, the virtual memory area(vma) can be expanded beyond the
> current stack vma. This expaneded stack can be used to copy the
> original instruction and single step out-of-line.
>
> Even if the vma cannot be extended then the instruction much be
> executed inline, by replacing the breakpoint instruction with original
> instruction.
>
> ...
>
> +
> +/**
> + * This routines get the pte of the page containing the specified address.
> + */
> +static pte_t __kprobes *get_uprobe_pte(unsigned long address)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte = NULL;
> +
> + pgd = pgd_offset(current->mm, address);
> + if (!pgd)
> + goto out;
> +
> + pud = pud_offset(pgd, address);
> + if (!pud)
> + goto out;
> +
> + pmd = pmd_offset(pud, address);
> + if (!pmd)
> + goto out;
> +
> + pte = pte_alloc_map(current->mm, pmd, address);
> +
> +out:
> + return pte;
> +}
That's familiar looking code..
I guess this should be given a more generic name then placed in
mm/memory.c, which is where we do pagetable walking.
> +/**
> + * This routine check for space in the current process's stack
> + * address space. If enough address space is found, copy the original
> + * instruction on that page for single stepping out-of-line.
> + */
> +static int __kprobes copy_insn_on_new_page(struct uprobe *uprobe ,
> + struct pt_regs *regs, struct vm_area_struct *vma)
> +{
> + unsigned long addr, stack_addr = regs->esp;
> + int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
> +
> + if (vma->vm_flags & VM_GROWSDOWN) {
> + if (((stack_addr - sizeof(long long))) <
> + (vma->vm_start + size))
> + return -ENOMEM;
> + addr = vma->vm_start;
> + } else if (vma->vm_flags & VM_GROWSUP) {
> + if ((vma->vm_end - size) < (stack_addr + sizeof(long long)))
> + return -ENOMEM;
> + addr = vma->vm_end - size;
> + } else
> + return -EFAULT;
> +
> + vma->vm_flags |= VM_LOCKED;
> +
> + if (__copy_to_user_inatomic((unsigned long *)addr,
> + (unsigned long *)uprobe->kp.ainsn.insn, size))
> + return -EFAULT;
> +
> + regs->eip = addr;
> +
> + return 0;
> +}
If we're going to use __copy_to_user_inatomic() then we'll need some nice
comments explaining why this is happening.
And we'll need to actually *be* in-atomic. That means we need an
open-coded inc_preempt_count() and dec_preempt_count() in there and I don't
see them.
> +/**
> + * This routine expands the stack beyond the present process address
> + * space and copies the instruction to that location, so that
> + * processor can single step out-of-line.
> + */
> +static int __kprobes copy_insn_onexpstack(struct uprobe *uprobe,
> + struct pt_regs *regs, struct vm_area_struct *vma)
> +{
> + unsigned long addr, vm_addr;
> + int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
> + struct vm_area_struct *new_vma;
> + struct mm_struct *mm = current->mm;
> +
> +
> + if (!down_read_trylock(¤t->mm->mmap_sem))
> + return -ENOMEM;
> +
> + if (vma->vm_flags & VM_GROWSDOWN)
> + vm_addr = vma->vm_start - size;
> + else if (vma->vm_flags & VM_GROWSUP)
> + vm_addr = vma->vm_end + size;
> + else {
> + up_read(¤t->mm->mmap_sem);
> + return -EFAULT;
> + }
> +
> + new_vma = find_extend_vma(mm, vm_addr);
> + if (!new_vma) {
> + up_read(¤t->mm->mmap_sem);
> + return -ENOMEM;
> + }
> +
> + if (new_vma->vm_flags & VM_GROWSDOWN)
> + addr = new_vma->vm_start;
> + else
> + addr = new_vma->vm_end - size;
> +
> + new_vma->vm_flags |= VM_LOCKED;
> + up_read(¤t->mm->mmap_sem);
> +
> + if (__copy_to_user_inatomic((unsigned long *)addr,
> + (unsigned long *)uprobe->kp.ainsn.insn, size))
> + return -EFAULT;
> +
> + regs->eip = addr;
> +
> + return 0;
> +}
Why is VM_LOCKED being set? (It needs a comment).
Where does it get unset?
> +
> + if (__copy_to_user_inatomic((unsigned long *)page_addr,
> + source, size))
> + if (__copy_to_user_inatomic(
> + (unsigned long *)(page_addr - size), source, size))
See above.
> +
> +/**
> + * This routines get the page containing the probe, maps it and
> + * replaced the instruction at the probed address with specified
> + * opcode.
> + */
> +void __kprobes replace_original_insn(struct uprobe *uprobe,
> + struct pt_regs *regs, kprobe_opcode_t opcode)
> +{
> + kprobe_opcode_t *addr;
> + struct page *page;
> +
> + page = find_get_page(uprobe->inode->i_mapping,
> + uprobe->offset >> PAGE_CACHE_SHIFT);
> + BUG_ON(!page);
> +
> + __lock_page(page);
Whoa. Why is __lock_page() being used here? It looks like a bug is being
covered up.
> + addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER1);
> + addr = (kprobe_opcode_t *)((unsigned long)addr +
> + (unsigned long)(uprobe->offset & ~PAGE_MASK));
> + *addr = opcode;
> + /*TODO: flush vma ? */
flush_dcache_page() would be needed.
But then, what happens if the page is shared by other processes? Do they
all start taking debug traps?
> + kunmap_atomic(addr, KM_USER1);
> +
> + unlock_page(page);
> +
> + if (page)
> + page_cache_release(page);
> + regs->eip = (unsigned long)uprobe->kp.addr;
> +}
> +
> +/**
> + * This routine provides the functionality of single stepping
> + * out-of-line. If single stepping out-of-line cannot be achieved,
> + * it replaces with the original instruction allowing it to single
> + * step inline.
> + */
> +static inline int prepare_singlestep_uprobe(struct uprobe *uprobe,
> + struct uprobe_ctlblk *ucb, struct pt_regs *regs)
> +{
> + unsigned long stack_addr = regs->esp, flags;
> + struct vm_area_struct *vma = NULL;
> + int err = 0;
> +
> + vma = find_vma(current->mm, (stack_addr & PAGE_MASK));
I don't think mmap_sem is held here?
> +static inline int uprobe_fault_handler(struct pt_regs *regs, int trapnr)
If, for some reason, the compiler decides to not inline this function then
you have a hunk of code which isn't marked __kprobes, but it should be.
I'd suggest that you remove all inlining from this code and add the
appropriate section markers.
Or I guess you could use __always_inline, but I'm not sure that it's really
worth the fuss and obscurity of doing that.
All kprobes-related code should be audited for this problem.
Prasanna S Panchamukhi <[email protected]> wrote:
>
> +/**
> + * This function hooks the readpages() of all modules that have active
> + * probes on them. The original readpages() is called for the given
> + * inode/address_space to actually read the pages into the memory.
> + * Then all probes that are specified on these pages are inserted.
> + */
The "/**" thing is designed to introduce a kerneldoc-style comment, but
these comments aren't using kerneldoc.
> And we'll need to actually *be* in-atomic. That means we need an
> open-coded inc_preempt_count() and dec_preempt_count() in there and I don't
> see them.
>
..
> Why is VM_LOCKED being set? (It needs a comment).
>
> Where does it get unset?
if this is an attempt to make the copy_in_atomic to be atomic, then it
is a bug; the user can unset this bit after all via mprotect, even from
another thread, and concurrently. UnGood(tm).
Andrew Morton wrote:
> Prasanna S Panchamukhi <[email protected]> wrote:
>>+
>>+/**
>>+ * This routines get the pte of the page containing the specified address.
>>+ */
>>+static pte_t __kprobes *get_uprobe_pte(unsigned long address)
>>+{
>>+ pgd_t *pgd;
>>+ pud_t *pud;
>>+ pmd_t *pmd;
>>+ pte_t *pte = NULL;
>>+
>>+ pgd = pgd_offset(current->mm, address);
>>+ if (!pgd)
>>+ goto out;
>>+
>>+ pud = pud_offset(pgd, address);
>>+ if (!pud)
>>+ goto out;
>>+
>>+ pmd = pmd_offset(pud, address);
>>+ if (!pmd)
>>+ goto out;
>>+
>>+ pte = pte_alloc_map(current->mm, pmd, address);
>>+
>>+out:
>>+ return pte;
>>+}
>
>
> That's familiar looking code..
>
> I guess this should be given a more generic name then placed in
> mm/memory.c, which is where we do pagetable walking.
>
Apart from this, there looks like quite a bit of other mm code
that has been crammed into everywhere but mm/ (yes this has
happened before, but it shouldn't be encouraged in new code).
For this specific example, I'm not sure that a function returning
a pointer to a pte is a good idea to be exporting. I'd like to see
some good reasons why things like get_user_pages, find_*_page, and
other standard APIs can't be used. Then you can list those reasons
in an individual patch to add your required API to mm/. This can
be more easily reviewed by people who aren't as good at wading
through code as Andrew.
Also, adding your own mm code outside core files really makes
things hard to maintain and audit when somebody would like to
change anything.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Mon, Mar 20, 2006 at 02:53:11AM -0800, Andrew Morton wrote:
> Prasanna S Panchamukhi <[email protected]> wrote:
> >
> > This patch provides the feature of inserting probes on pages that are
> > not present in the memory during registration.
> >
> > To add readpage and readpages() hooks, two new elements are added to
> > the uprobe_module object:
> > struct address_space_operations *ori_a_ops;
> > struct address_space_operations user_a_ops;
> >
> > User-space probes allows probes to be inserted even in pages that are
> > not present in the memory at the time of registration. This is done
> > by adding hooks to the readpage and readpages routines. During
> > registration, the address space operation object is modified by
> > substituting user-space probes's specific readpage and readpages
> > routines. When the pages are read into memory through the readpage and
> > readpages address space operations, any associated probes are
> > automatically inserted into those pages. These user-space probes
> > readpage and readpages routines internally call the original
> > readpage() and readpages() routines, and then check whether probes are
> > to be added to these pages, inserting probes as necessary. The
> > overhead of adding these hooks is limited to the application on which
> > the probes are inserted.
> >
> > During unregistration, care should be taken to replace the readpage and
> > readpages hooks with the original routines if no probes remain on that
> > application.
> >
>
> So... The code's replacing the address_space's address_space_operations
> with a copy which has .readpage() and .readpages() modified, because it
> happens that filemap_nopage() uses those.
>
> This is all rather hacky.
>
> I think we need to step back and discuss what services this feature is
> trying to provide, and how it is to provide them. Your covering
> description didn't describe that - it dives straigt into details without
> even describing what the patches are trying to achieve.
>
> So. What are we trying to achieve here, and how are we trying to achieve
> it? What problems were encountered in the development of the feature and
> how were they solved? What alternative solutions were there?
>
The basic idea is to insert probes on user applications which may or
may not be in memory, at the time of probe insertion.
The base interface patch (1/3) allows the user to insert the probes on the
text pages that are already present in the memory. This is done by mapping
the page via kmap() and then insert the breakpoint instruction at the given
page offset.
Then comes the issue of inserting a probe on pages not currently in
memory. This is useful if we'd want to probe an application that hasn't
yet started executing. In such situations, we still want to insert the
probe, but defer insertion of actual probe until the text page is read
into the memory.
Here are a few ways to accomplish this:
1. Add a notifier in the readpage(), readpages() routine, which will
notify when the text pages are read into the memory.
2. Reading the page from the executable into memory and then insert
probes on that text page. But there are situations where the system
can run into low memory problems and those text pages get discarded.
3. Change the readpage() and readpages() routines only for that application
where probes will be inserted. This is done by hooking the readpage()
and readpages() routines, which is limited to the probed application
and will not change anything related to other applications.
The current patchset uses approach 3.
I'd like to know if there are better/cleaner ways to accomplish this?
Thanks
Prasanna
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329
On Mon, Mar 20, 2006 at 02:42:48AM -0800, Andrew Morton wrote:
> Prasanna S Panchamukhi <[email protected]> wrote:
> >
> > This patch provides two interfaces to insert and remove
> > user space probes. Each probe is uniquely identified by
> > inode and offset within that executable/library file.
> > Insertion of a probe involves getting the code page for
> > a given offset, mapping it into the memory and then insert
> > the breakpoint at the given offset. Also the probe is added
> > to the uprobe_table hash list. A uprobe_module data strcture
> > is allocated for every probed application/library image on disk.
> > Removal of a probe involves getting the code page for a given
> > offset, mapping that page into the memory and then replacing
> > the breakpoint instruction with a the original opcode.
> > This patch also provides aggrigate probe handler feature,
> > where user can define multiple handlers per probe.
> >
> > +/**
> > + * Finds a uprobe at the specified user-space address in the current task.
> > + * Points current_uprobe at that uprobe and returns the corresponding kprobe.
> > + */
> > +struct kprobe __kprobes *get_uprobe(void *addr)
> > +{
> > + struct mm_struct *mm = current->mm;
> > + struct vm_area_struct *vma;
> > + struct inode *inode;
> > + unsigned long offset;
> > + struct kprobe *p, *kpr;
> > + struct uprobe *uprobe;
> > +
> > + vma = find_vma(mm, (unsigned long)addr);
> > +
> > + BUG_ON(!vma); /* this should not happen, not in our memory map */
> > +
> > + offset = (unsigned long)addr - (vma->vm_start +
> > + (vma->vm_pgoff << PAGE_SHIFT));
> > + if (!vma->vm_file)
> > + return NULL;
>
> All this appears to be happening without mmap_sem held?
Yes, I will make the changes to hold the mmap_sem.
>
> > +/**
> > + * Wait for the page to be unlocked if someone else had locked it,
> > + * then map the page and insert or remove the breakpoint.
> > + */
> > +static int __kprobes map_uprobe_page(struct page *page, struct uprobe *uprobe,
> > + process_uprobe_func_t process_kprobe_user)
> > +{
> > + int ret = 0;
> > + kprobe_opcode_t *uprobe_address;
> > +
> > + if (!page)
> > + return -EINVAL; /* TODO: more suitable errno */
> > +
> > + wait_on_page_locked(page);
> > + /* could probably retry readpage here. */
> > + if (!PageUptodate(page))
> > + return -EINVAL; /* TODO: more suitable errno */
> > +
> > + lock_page(page);
>
> That's a lot of fuss and might be racy with truncate.
>
> Why not just run lock_page()?
Yes, I will do that.
>
> > + uprobe_address = (kprobe_opcode_t *)kmap(page);
> > + uprobe_address = (kprobe_opcode_t *)((unsigned long)uprobe_address +
> > + (uprobe->offset & ~PAGE_MASK));
> > + ret = (*process_kprobe_user)(uprobe, uprobe_address);
> > + kunmap(page);
>
> kmap_atomic() is preferred.
>
Agreed.
> > +/**
> > + * Gets exclusive write access to the given inode to ensure that the file
> > + * on which probes are currently applied does not change. Use the function,
> > + * deny_write_access_to_inode() we added in fs/namei.c.
> > + */
> > +static inline int ex_write_lock(struct inode *inode)
> > +{
> > + return deny_write_access_to_inode(inode);
> > +}
>
> hm, this code likes to poke around in VFS internals. It would be nice to
> have an overall description of what it's trying to do, why and how.
ok, I should probably separate VFS changes in a different patch with
proper comments. However this is required to ensure that the
executable associated with this inode on which probes are inserted
does not change. deny_write_access_to_inode() just decrements the
inode usage count to -1.
>
> > +/**
> > + * unregister_uprobe: Disarms the probe, removes the uprobe
> > + * pointers from the hash list and unhooks readpage routines.
> > + */
> > +void __kprobes unregister_uprobe(struct uprobe *uprobe)
> > +{
> > + struct address_space *mapping;
> > + struct uprobe_module *umodule;
> > + struct page *page;
> > + unsigned long flags;
> > + int ret = 0;
> > +
> > + if (!uprobe->inode)
> > + return;
> > +
> > + mapping = uprobe->inode->i_mapping;
> > +
> > + page = find_get_page(mapping, uprobe->offset >> PAGE_CACHE_SHIFT);
> > +
> > + spin_lock_irqsave(&uprobe_lock, flags);
> > + ret = remove_uprobe(uprobe);
> > + spin_unlock_irqrestore(&uprobe_lock, flags);
> > +
> > + mutex_lock(&uprobe_mutex);
> > + if (!(umodule = get_module_by_inode(uprobe->inode)))
> > + goto out;
> > +
> > + hlist_del(&uprobe->ulist);
> > + if (hlist_empty(&umodule->ulist_head)) {
> > + list_del(&umodule->mlist);
> > + ex_write_unlock(uprobe->inode);
> > + path_release(&umodule->nd);
> > + kfree(umodule);
> > + }
> > +
> > +out:
> > + mutex_unlock(&uprobe_mutex);
> > + if (ret)
> > + ret = map_uprobe_page(page, uprobe, remove_kprobe_user);
> > +
> > + if (ret == -EINVAL)
> > + return;
> > + /*
> > + * TODO: unregister_uprobe should not fail, need to handle
> > + * if it fails.
> > + */
> > + flush_vma(mapping, page, uprobe);
> > +
> > + if (page)
> > + page_cache_release(page);
> > +}
>
> That's some pretty awkward coding. Buggy too. It doesn't drop the
> refcount on the page if map_uprobe_page() returned -EINVAL because it's
> assuming that EINVAL meant "there was no page". But there are multiple
> reasons for map_uprobe_page() to return -EINVAL. If that page isn't
> uptodate, we leak a ref.
>
> This function should be doing the checking for a find_get_page() failure,
> not map_uprobe_page().
Yes, that is buggy, will fix.
Thanks
Prasanna
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329
On Mon, Mar 20, 2006 at 03:09:22AM -0800, Andrew Morton wrote:
> Prasanna S Panchamukhi <[email protected]> wrote:
> >
> > This patch provides a mechanism for probe handling and
> > executing the user-specified handlers.
> >
> > Each userspace probe is uniquely identified by the combination of
> > inode and offset, hence during registeration the inode and offset
> > combination is added to uprobes hash table. Initially when
> > breakpoint instruction is hit, the uprobes hash table is looked up
> > for matching inode and offset. The pre_handlers are called in
> > sequence if multiple probes are registered. Similar to kprobes,
> > uprobes also adopts to single step out-of-line, so that probe miss in
> > SMP environment can be avoided. But for userspace probes, instruction
> > copied into kernel address space cannot be single stepped, hence the
> > instruction must be copied to user address space. The solution is to
> > find free space in the current process address space and then copy the
> > original instruction and single step that instruction.
> >
> > User processes use stack space to store local variables, agruments and
> > return values. Normally the stack space either below or above the
> > stack pointer indicates the free stack space.
> >
> > The instruction to be single stepped can modify the stack space,
> > hence before using the free stack space, sufficient stack space should
> > be left. The instruction is copied to the bottom of the page and check
> > is made such that the copied instruction does not cross the page
> > boundry. The copied instruction is then single stepped. Several
> > architectures does not allow the instruction to be executed from the
> > stack location, since no-exec bit is set for the stack pages. In those
> > architectures, the page table entry corresponding to the stack page is
> > identified and the no-exec bit is unset making the instruction on that
> > stack page to be executed.
> >
> > There are situations where even the free stack space is not enough for
> > the user instruction to be copied and single stepped. In such
> > situations, the virtual memory area(vma) can be expanded beyond the
> > current stack vma. This expaneded stack can be used to copy the
> > original instruction and single step out-of-line.
> >
> > Even if the vma cannot be extended then the instruction much be
> > executed inline, by replacing the breakpoint instruction with original
> > instruction.
> >
> > ...
> >
> > +
> > +/**
> > + * This routines get the pte of the page containing the specified address.
> > + */
> > +static pte_t __kprobes *get_uprobe_pte(unsigned long address)
> > +{
> > + pgd_t *pgd;
> > + pud_t *pud;
> > + pmd_t *pmd;
> > + pte_t *pte = NULL;
> > +
> > + pgd = pgd_offset(current->mm, address);
> > + if (!pgd)
> > + goto out;
> > +
> > + pud = pud_offset(pgd, address);
> > + if (!pud)
> > + goto out;
> > +
> > + pmd = pmd_offset(pud, address);
> > + if (!pmd)
> > + goto out;
> > +
> > + pte = pte_alloc_map(current->mm, pmd, address);
> > +
> > +out:
> > + return pte;
> > +}
>
> That's familiar looking code..
>
> I guess this should be given a more generic name then placed in
> mm/memory.c, which is where we do pagetable walking.
Agreed, I will send out a separate patch for the helpers.
>
> > +/**
> > + * This routine check for space in the current process's stack
> > + * address space. If enough address space is found, copy the original
> > + * instruction on that page for single stepping out-of-line.
> > + */
> > +static int __kprobes copy_insn_on_new_page(struct uprobe *uprobe ,
> > + struct pt_regs *regs, struct vm_area_struct *vma)
> > +{
> > + unsigned long addr, stack_addr = regs->esp;
> > + int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
> > +
> > + if (vma->vm_flags & VM_GROWSDOWN) {
> > + if (((stack_addr - sizeof(long long))) <
> > + (vma->vm_start + size))
> > + return -ENOMEM;
> > + addr = vma->vm_start;
> > + } else if (vma->vm_flags & VM_GROWSUP) {
> > + if ((vma->vm_end - size) < (stack_addr + sizeof(long long)))
> > + return -ENOMEM;
> > + addr = vma->vm_end - size;
> > + } else
> > + return -EFAULT;
> > +
> > + vma->vm_flags |= VM_LOCKED;
> > +
> > + if (__copy_to_user_inatomic((unsigned long *)addr,
> > + (unsigned long *)uprobe->kp.ainsn.insn, size))
> > + return -EFAULT;
> > +
> > + regs->eip = addr;
> > +
> > + return 0;
> > +}
>
> If we're going to use __copy_to_user_inatomic() then we'll need some nice
> comments explaining why this is happening.
>
> And we'll need to actually *be* in-atomic. That means we need an
> open-coded inc_preempt_count() and dec_preempt_count() in there and I don't
> see them.
>
We come here, after probe is hit, through uporbe_handler() with
interrupts disabled (since it is a interrupt gate). In uprobe_handler()
preemption is disabled and remains disabled until original instruction
is single stepped.
I will add proper comments in next iteration.
> > +/**
> > + * This routine expands the stack beyond the present process address
> > + * space and copies the instruction to that location, so that
> > + * processor can single step out-of-line.
> > + */
> > +static int __kprobes copy_insn_onexpstack(struct uprobe *uprobe,
> > + struct pt_regs *regs, struct vm_area_struct *vma)
> > +{
> > + unsigned long addr, vm_addr;
> > + int size = MAX_INSN_SIZE * sizeof(kprobe_opcode_t);
> > + struct vm_area_struct *new_vma;
> > + struct mm_struct *mm = current->mm;
> > +
> > +
> > + if (!down_read_trylock(¤t->mm->mmap_sem))
> > + return -ENOMEM;
> > +
> > + if (vma->vm_flags & VM_GROWSDOWN)
> > + vm_addr = vma->vm_start - size;
> > + else if (vma->vm_flags & VM_GROWSUP)
> > + vm_addr = vma->vm_end + size;
> > + else {
> > + up_read(¤t->mm->mmap_sem);
> > + return -EFAULT;
> > + }
> > +
> > + new_vma = find_extend_vma(mm, vm_addr);
> > + if (!new_vma) {
> > + up_read(¤t->mm->mmap_sem);
> > + return -ENOMEM;
> > + }
> > +
> > + if (new_vma->vm_flags & VM_GROWSDOWN)
> > + addr = new_vma->vm_start;
> > + else
> > + addr = new_vma->vm_end - size;
> > +
> > + new_vma->vm_flags |= VM_LOCKED;
> > + up_read(¤t->mm->mmap_sem);
> > +
> > + if (__copy_to_user_inatomic((unsigned long *)addr,
> > + (unsigned long *)uprobe->kp.ainsn.insn, size))
> > + return -EFAULT;
> > +
> > + regs->eip = addr;
> > +
> > + return 0;
> > +}
>
> Why is VM_LOCKED being set? (It needs a comment).
>
> Where does it get unset?
As Arjan says, idea was to make copy_to_user_inatomic() succeed but
there is some issue here, I have to look at it again.
> > +
> > + if (__copy_to_user_inatomic((unsigned long *)page_addr,
> > + source, size))
> > + if (__copy_to_user_inatomic(
> > + (unsigned long *)(page_addr - size), source, size))
>
> See above.
>
> > +
> > +/**
> > + * This routines get the page containing the probe, maps it and
> > + * replaced the instruction at the probed address with specified
> > + * opcode.
> > + */
> > +void __kprobes replace_original_insn(struct uprobe *uprobe,
> > + struct pt_regs *regs, kprobe_opcode_t opcode)
> > +{
> > + kprobe_opcode_t *addr;
> > + struct page *page;
> > +
> > + page = find_get_page(uprobe->inode->i_mapping,
> > + uprobe->offset >> PAGE_CACHE_SHIFT);
> > + BUG_ON(!page);
> > +
> > + __lock_page(page);
>
> Whoa. Why is __lock_page() being used here? It looks like a bug is being
> covered up.
>
we come here with a spinlock held. I will add the comment.
> > + addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER1);
> > + addr = (kprobe_opcode_t *)((unsigned long)addr +
> > + (unsigned long)(uprobe->offset & ~PAGE_MASK));
> > + *addr = opcode;
> > + /*TODO: flush vma ? */
>
> flush_dcache_page() would be needed.
>
> But then, what happens if the page is shared by other processes? Do they
> all start taking debug traps?
Yes, you are right. I think single stepping inline was a bad idea, disarming
the probe looks to be a better option
> > + kunmap_atomic(addr, KM_USER1);
> > +
> > + unlock_page(page);
> > +
> > + if (page)
> > + page_cache_release(page);
> > + regs->eip = (unsigned long)uprobe->kp.addr;
> > +}
> > +
> > +/**
> > + * This routine provides the functionality of single stepping
> > + * out-of-line. If single stepping out-of-line cannot be achieved,
> > + * it replaces with the original instruction allowing it to single
> > + * step inline.
> > + */
> > +static inline int prepare_singlestep_uprobe(struct uprobe *uprobe,
> > + struct uprobe_ctlblk *ucb, struct pt_regs *regs)
> > +{
> > + unsigned long stack_addr = regs->esp, flags;
> > + struct vm_area_struct *vma = NULL;
> > + int err = 0;
> > +
> > + vma = find_vma(current->mm, (stack_addr & PAGE_MASK));
>
> I don't think mmap_sem is held here?
Yes, this will be taken care.
>
> > +static inline int uprobe_fault_handler(struct pt_regs *regs, int trapnr)
>
> If, for some reason, the compiler decides to not inline this function then
> you have a hunk of code which isn't marked __kprobes, but it should be.
>
> I'd suggest that you remove all inlining from this code and add the
> appropriate section markers.
>
> Or I guess you could use __always_inline, but I'm not sure that it's really
> worth the fuss and obscurity of doing that.
>
> All kprobes-related code should be audited for this problem.
Yes, I will audit it and send out a patch if necessary.
Thanks
Prasanna
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329
On Mon, Mar 20, 2006 at 10:32:40PM +1100, Nick Piggin wrote:
> Andrew Morton wrote:
> >Prasanna S Panchamukhi <[email protected]> wrote:
>
> >>+
> >>+/**
> >>+ * This routines get the pte of the page containing the specified
> >>address.
> >>+ */
> >>+static pte_t __kprobes *get_uprobe_pte(unsigned long address)
> >>+{
> >>+ pgd_t *pgd;
> >>+ pud_t *pud;
> >>+ pmd_t *pmd;
> >>+ pte_t *pte = NULL;
> >>+
> >>+ pgd = pgd_offset(current->mm, address);
> >>+ if (!pgd)
> >>+ goto out;
> >>+
> >>+ pud = pud_offset(pgd, address);
> >>+ if (!pud)
> >>+ goto out;
> >>+
> >>+ pmd = pmd_offset(pud, address);
> >>+ if (!pmd)
> >>+ goto out;
> >>+
> >>+ pte = pte_alloc_map(current->mm, pmd, address);
> >>+
> >>+out:
> >>+ return pte;
> >>+}
> >
> >
> >That's familiar looking code..
> >
> >I guess this should be given a more generic name then placed in
> >mm/memory.c, which is where we do pagetable walking.
> >
>
> Apart from this, there looks like quite a bit of other mm code
> that has been crammed into everywhere but mm/ (yes this has
> happened before, but it shouldn't be encouraged in new code).
>
> For this specific example, I'm not sure that a function returning
> a pointer to a pte is a good idea to be exporting. I'd like to see
> some good reasons why things like get_user_pages, find_*_page, and
> other standard APIs can't be used. Then you can list those reasons
> in an individual patch to add your required API to mm/. This can
> be more easily reviewed by people who aren't as good at wading
> through code as Andrew.
>
> Also, adding your own mm code outside core files really makes
> things hard to maintain and audit when somebody would like to
> change anything.
>
Nick,
I will send out a separate patch to add this piece of code with proper
comments.
Thanks
Prasanna
> --
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329
On Mon, Mar 20, 2006 at 03:10:17AM -0800, Andrew Morton wrote:
> Prasanna S Panchamukhi <[email protected]> wrote:
> >
> > +/**
> > + * This function hooks the readpages() of all modules that have active
> > + * probes on them. The original readpages() is called for the given
> > + * inode/address_space to actually read the pages into the memory.
> > + * Then all probes that are specified on these pages are inserted.
> > + */
>
> The "/**" thing is designed to introduce a kerneldoc-style comment, but
> these comments aren't using kerneldoc.
Yes. I will take care of this issue.
Thanks
Prasanna
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329
On Mon, Mar 20, 2006 at 12:24:01PM +0100, Arjan van de Ven wrote:
>
> Lines: 16
>
> > And we'll need to actually *be* in-atomic. That means we need an
> > open-coded inc_preempt_count() and dec_preempt_count() in there and I don't
> > see them.
> >
>
> ..
>
> > Why is VM_LOCKED being set? (It needs a comment).
> >
> > Where does it get unset?
>
>
> if this is an attempt to make the copy_in_atomic to be atomic, then it
> is a bug; the user can unset this bit after all via mprotect, even from
> another thread, and concurrently. U
You are right, the purpose was to make copy_to_user_inatomic() to suceed. I
need to look at this issue again. Any suggestions?
Thanks
Prasanna
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329
On Mon, 2006-03-20 at 19:35 +0530, Prasanna S Panchamukhi wrote:
> On Mon, Mar 20, 2006 at 12:24:01PM +0100, Arjan van de Ven wrote:
> >
> > Lines: 16
> >
> > > And we'll need to actually *be* in-atomic. That means we need an
> > > open-coded inc_preempt_count() and dec_preempt_count() in there and I don't
> > > see them.
> > >
> >
> > ..
> >
> > > Why is VM_LOCKED being set? (It needs a comment).
> > >
> > > Where does it get unset?
> >
> >
> > if this is an attempt to make the copy_in_atomic to be atomic, then it
> > is a bug; the user can unset this bit after all via mprotect, even from
> > another thread, and concurrently. U
>
> You are right, the purpose was to make copy_to_user_inatomic() to suceed. I
> need to look at this issue again. Any suggestions?
get_user_pages seems to be the only viable API for this..
Prasanna S Panchamukhi <[email protected]> wrote:
>
> > > +
> > > + if (__copy_to_user_inatomic((unsigned long *)addr,
> > > + (unsigned long *)uprobe->kp.ainsn.insn, size))
> > > + return -EFAULT;
> > > +
> > > + regs->eip = addr;
> > > +
> > > + return 0;
> > > +}
> >
> > If we're going to use __copy_to_user_inatomic() then we'll need some nice
> > comments explaining why this is happening.
> >
> > And we'll need to actually *be* in-atomic. That means we need an
> > open-coded inc_preempt_count() and dec_preempt_count() in there and I don't
> > see them.
> >
>
> We come here, after probe is hit, through uporbe_handler() with
> interrupts disabled (since it is a interrupt gate). In uprobe_handler()
> preemption is disabled and remains disabled until original instruction
> is single stepped.
>
> I will add proper comments in next iteration.
preempt_disable() is insufficient - it is a no-op on !CONFIG_PREEMPT.
You _must_ run inc_preempt_count(). See how kmap_atomic() and
kunmap_atomic() work.
> > > + */
> > > +void __kprobes replace_original_insn(struct uprobe *uprobe,
> > > + struct pt_regs *regs, kprobe_opcode_t opcode)
> > > +{
> > > + kprobe_opcode_t *addr;
> > > + struct page *page;
> > > +
> > > + page = find_get_page(uprobe->inode->i_mapping,
> > > + uprobe->offset >> PAGE_CACHE_SHIFT);
> > > + BUG_ON(!page);
> > > +
> > > + __lock_page(page);
> >
> > Whoa. Why is __lock_page() being used here? It looks like a bug is being
> > covered up.
> >
>
> we come here with a spinlock held. I will add the comment.
Then the code is buggy. __lock_page() can schedule away, causing this CPU
to recur onto the same lock and deadlock.
> > > + addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER1);
> > > + addr = (kprobe_opcode_t *)((unsigned long)addr +
> > > + (unsigned long)(uprobe->offset & ~PAGE_MASK));
> > > + *addr = opcode;
> > > + /*TODO: flush vma ? */
> >
> > flush_dcache_page() would be needed.
> >
> > But then, what happens if the page is shared by other processes? Do they
> > all start taking debug traps?
>
> Yes, you are right. I think single stepping inline was a bad idea, disarming
> the probe looks to be a better option
>
You skipped my second question?
Prasanna S Panchamukhi <[email protected]> wrote:
>
> The basic idea is to insert probes on user applications which may or
> may not be in memory, at the time of probe insertion.
umm yes, but what for?
What does this entire feature *do*? Why does Linux need it?
OK, so it allows kernel modules to set breakpoints (via debug traps) into
user code. But why do we want to be able to do that? What are the
use-cases?
This may sound like boringly obvious stuff to you, but without a complete
problem statement from the designers, how are we to evaluate their proposed
solution?
Andrew, the need is probably better stated as one where system-wide probing
or tracing is possible. There are times where one needs a global view. Of
course one can use multiple tools to obtain such data e.g. probes in kernel
space strace in user space and so on. The advantages of supporting user
probes are as follows:
1) A single tool providing the data capture in a consistent manner eases
the problems of correlation of events across multiple tools (for kernel and
user space)
2) The dynamic aspect allows ad hoc probepoints to be inserted where no
existing instrumentation is provided (emergency debug scenario for
example).
3) The user probe distinguishes itself from all other externally managed
tracing mechanisms in that probepoints are globally applied - i.e. without
reference to PID. Compare this with ptrace breakpoints (hence strace and
gdb) where tracepoints and breakpoints are localized to a specified set of
processes. user-probes achieves this by design without the side effects
(privatization of pages) that ptrace has. Again this supports the global
view.
4) user-probes also supports the registering of the probepoints before an
the probed code is loaded. The clearly has advantages for catching
initialization problems.
A real life example of where this capability would have been very useful is
with a performance problem I am currently investigating. It involves a GPFS
+ SAMBA + TCPIP + RDAC and some user-space video serving application. We
are looking are where the latencies are accumulating in the system for the
specific user application. It's a very hard problem. Having multiple tools
serve up a partial view and having to coordinate these view from both data
analysis and data gathering perspectives is a real nightmare.
- -
Richard J Moore
IBM Advanced Linux Response Team - Linux Technology Centre
MOBEX: 264807; Mobile (+44) (0)7739-875237
Office: (+44) (0)1962-817072
Andrew Morton
<[email protected]>
To
21/03/2006 [email protected]
02:12 cc
[email protected], [email protected],
[email protected], Richard J
Moore/UK/IBM@IBMGB,
[email protected]
bcc
Subject
Re: [2/3 PATCH] Kprobes: User space
probes support- readpage hooks
Prasanna S Panchamukhi <[email protected]> wrote:
>
> The basic idea is to insert probes on user applications which may or
> may not be in memory, at the time of probe insertion.
umm yes, but what for?
What does this entire feature *do*? Why does Linux need it?
OK, so it allows kernel modules to set breakpoints (via debug traps) into
user code. But why do we want to be able to do that? What are the
use-cases?
This may sound like boringly obvious stuff to you, but without a complete
problem statement from the designers, how are we to evaluate their proposed
solution?
On Mon, Mar 20, 2006 at 12:40:49PM -0800, Andrew Morton wrote:
> Prasanna S Panchamukhi <[email protected]> wrote:
> >
> > > > +
> > > > + if (__copy_to_user_inatomic((unsigned long *)addr,
> > > > + (unsigned long *)uprobe->kp.ainsn.insn, size))
> > > > + return -EFAULT;
> > > > +
> > > > + regs->eip = addr;
> > > > +
> > > > + return 0;
> > > > +}
> > >
> > > If we're going to use __copy_to_user_inatomic() then we'll need some nice
> > > comments explaining why this is happening.
> > >
> > > And we'll need to actually *be* in-atomic. That means we need an
> > > open-coded inc_preempt_count() and dec_preempt_count() in there and I don't
> > > see them.
> > >
> >
> > We come here, after probe is hit, through uporbe_handler() with
> > interrupts disabled (since it is a interrupt gate). In uprobe_handler()
> > preemption is disabled and remains disabled until original instruction
> > is single stepped.
> >
> > I will add proper comments in next iteration.
>
> preempt_disable() is insufficient - it is a no-op on !CONFIG_PREEMPT.
>
> You _must_ run inc_preempt_count(). See how kmap_atomic() and
> kunmap_atomic() work.
>
Yes, I will use inc_preempt_count() instead of preempt_disable().
> > > > + */
> > > > +void __kprobes replace_original_insn(struct uprobe *uprobe,
> > > > + struct pt_regs *regs, kprobe_opcode_t opcode)
> > > > +{
> > > > + kprobe_opcode_t *addr;
> > > > + struct page *page;
> > > > +
> > > > + page = find_get_page(uprobe->inode->i_mapping,
> > > > + uprobe->offset >> PAGE_CACHE_SHIFT);
> > > > + BUG_ON(!page);
> > > > +
> > > > + __lock_page(page);
> > >
> > > Whoa. Why is __lock_page() being used here? It looks like a bug is being
> > > covered up.
> > >
> >
> > we come here with a spinlock held. I will add the comment.
>
> Then the code is buggy. __lock_page() can schedule away, causing this CPU
> to recur onto the same lock and deadlock.
Agreed. I will look at this issue and remove the __lock_page().
>
> > > > + addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER1);
> > > > + addr = (kprobe_opcode_t *)((unsigned long)addr +
> > > > + (unsigned long)(uprobe->offset & ~PAGE_MASK));
> > > > + *addr = opcode;
> > > > + /*TODO: flush vma ? */
> > >
> > > flush_dcache_page() would be needed.
> > >
> > > But then, what happens if the page is shared by other processes? Do they
> > > all start taking debug traps?
> >
> > Yes, you are right. I think single stepping inline was a bad idea, disarming
> > the probe looks to be a better option
> >
>
> You skipped my second question?
There is a small window in which other processor will not be able to see
the breakpoint if we are single step inline. But since single stepping inline
is a bad idea, we will disarm the probe forever (replace with original instrcution) if we cannot single step out-of-line.
Any suggestions?
Thanks
Prasanna
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329
Prasanna S Panchamukhi <[email protected]> wrote:
>
> >
> > > > > + addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER1);
> > > > > + addr = (kprobe_opcode_t *)((unsigned long)addr +
> > > > > + (unsigned long)(uprobe->offset & ~PAGE_MASK));
> > > > > + *addr = opcode;
> > > > > + /*TODO: flush vma ? */
> > > >
> > > > flush_dcache_page() would be needed.
> > > >
> > > > But then, what happens if the page is shared by other processes? Do they
> > > > all start taking debug traps?
> > >
> > > Yes, you are right. I think single stepping inline was a bad idea, disarming
> > > the probe looks to be a better option
> > >
> >
> > You skipped my second question?
>
> There is a small window in which other processor will not be able to see
> the breakpoint if we are single step inline. But since single stepping inline
> is a bad idea, we will disarm the probe forever (replace with original instrcution) if we cannot single step out-of-line.
> Any suggestions?
This doesn't appear to be working.
Let's go back in time and pretend that these patches were never written,
OK? We're standing around the water cooler saying "hey, wouldn't it be
cool if someone did X". You guys are way too far into this and you keep on
leaving everyone else behind. When I try to drag you up, you resist ;)
So let me rephrase, and thrash around in the dark a little more.
>From my reading of the code (and thus far that's my _only_ source of this
information) it appears that a design decision has been made (for reasons
which have yet to be disclosed) that the way to implement this (yet to be
described) requirement is to set user breakpoints upon particular
instructions within executables. System-wide, for all processes and
threads.
There are other things that could have been done. For example, you might
have chosen to set breakpoints upon a particular virtual address within a
heavyweight process. That's a process-oriented viewpoint, rather than a
file-oriented one, if you like.
This raises interesting questions, like
- How come that decision was made? Why _is_ this an executable-oriented
rather than process-oriented thing? Richard has covered that somewhat.
- What happens if the executable is writeably mapped?
- What happens if someone writes to the executable? (I think both
of these were disallowed in the implementation-which-is-not-to-be-named).
- What happens if different processes map the executable at different
addresses?
- Various other things which you've thought of and I haven't and which it
would be REALLY interesting to hear about.
But this is just one example. I don't think I'm being too picky here - my
reaction on seeing all this stuff was, basically, "wtf is all this code
for?". References to dprobes won't help, sorry - I've never looked at
dprobes and I don't know anyone apart from you guys who has.
What I'm asking you for is a description of what problem we're trying to
solve and how this code solves that problem. It is hard, it is inefficient
and, worse, it is error-prone for us to try to work all that out from a
particular implementation.
Andrew Morton <[email protected]> wrote on 21/03/2006 10:05:21:
> Prasanna S Panchamukhi <[email protected]> wrote:
> >
> > >
> > > > > > + addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER1);
> > > > > > + addr = (kprobe_opcode_t *)((unsigned long)addr +
> > > > > > + (unsigned long)(uprobe->offset & ~PAGE_MASK));
> > > > > > + *addr = opcode;
> > > > > > + /*TODO: flush vma ? */
> > > > >
> > > > > flush_dcache_page() would be needed.
> > > > >
> > > > > But then, what happens if the page is shared by other
> processes? Do they
> > > > > all start taking debug traps?
> > > >
> > > > Yes, you are right. I think single stepping inline was a bad
> idea, disarming
> > > > the probe looks to be a better option
> > > >
> > >
> > > You skipped my second question?
> >
> > There is a small window in which other processor will not be able to
see
> > the breakpoint if we are single step inline. But since single
> stepping inline
> > is a bad idea, we will disarm the probe forever (replace with
> original instrcution) if we cannot single step out-of-line.
> > Any suggestions?
>
> This doesn't appear to be working.
>
> Let's go back in time and pretend that these patches were never written,
> OK? We're standing around the water cooler saying "hey, wouldn't it be
> cool if someone did X". You guys are way too far into this and you keep
on
> leaving everyone else behind. When I try to drag you up, you resist ;)
>
> So let me rephrase, and thrash around in the dark a little more.
>
> From my reading of the code (and thus far that's my _only_ source of this
> information) it appears that a design decision has been made (for reasons
> which have yet to be disclosed) that the way to implement this (yet to be
> described) requirement is to set user breakpoints upon particular
> instructions within executables. System-wide, for all processes and
> threads.
>
> There are other things that could have been done. For example, you might
> have chosen to set breakpoints upon a particular virtual address within a
> heavyweight process. That's a process-oriented viewpoint, rather than a
> file-oriented one, if you like.
>
> This raises interesting questions, like
>
> - How come that decision was made? Why _is_ this an executable-oriented
> rather than process-oriented thing? Richard has covered that somewhat.
>
> - What happens if the executable is writeably mapped?
>
> - What happens if someone writes to the executable? (I think both
> of these were disallowed in the
implementation-which-is-not-to-be-named).
>
> - What happens if different processes map the executable at different
> addresses?
>
> - Various other things which you've thought of and I haven't and which it
> would be REALLY interesting to hear about.
>
> But this is just one example. I don't think I'm being too picky here -
my
> reaction on seeing all this stuff was, basically, "wtf is all this code
> for?". References to dprobes won't help, sorry - I've never looked at
> dprobes and I don't know anyone apart from you guys who has.
>
> What I'm asking you for is a description of what problem we're trying to
> solve and how this code solves that problem. It is hard, it is
inefficient
> and, worse, it is error-prone for us to try to work all that out from a
> particular implementation.
This is completely reasonable. And we don't need to refer to dprobes - that
was merely an evolutionary step.
I do have a notes that captured most the early design discussions and
decisions, which Prasanna and I should dive into to help answer some of the
points you raise.
Let me suggest, we chat among ourselves and pull together the relevant
details to answer each of your questions.
Richard
On Tue, Mar 21, 2006 at 02:05:21AM -0800, Andrew Morton wrote:
> Prasanna S Panchamukhi <[email protected]> wrote:
> >
> > >
> > > > > > + addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER1);
> > > > > > + addr = (kprobe_opcode_t *)((unsigned long)addr +
> > > > > > + (unsigned long)(uprobe->offset & ~PAGE_MASK));
> > > > > > + *addr = opcode;
> > > > > > + /*TODO: flush vma ? */
> > > > >
> > > > > flush_dcache_page() would be needed.
> > > > >
> > > > > But then, what happens if the page is shared by other processes? Do they
> > > > > all start taking debug traps?
> > > >
> > > > Yes, you are right. I think single stepping inline was a bad idea, disarming
> > > > the probe looks to be a better option
> > > >
> > >
> > > You skipped my second question?
> >
> > There is a small window in which other processor will not be able to see
> > the breakpoint if we are single step inline. But since single stepping inline
> > is a bad idea, we will disarm the probe forever (replace with original instrcution) if we cannot single step out-of-line.
> > Any suggestions?
>
> This doesn't appear to be working.
>
> Let's go back in time and pretend that these patches were never written,
> OK? We're standing around the water cooler saying "hey, wouldn't it be
> cool if someone did X". You guys are way too far into this and you keep on
> leaving everyone else behind. When I try to drag you up, you resist ;)
>
> So let me rephrase, and thrash around in the dark a little more.
>
> >From my reading of the code (and thus far that's my _only_ source of this
> information) it appears that a design decision has been made (for reasons
> which have yet to be disclosed) that the way to implement this (yet to be
> described) requirement is to set user breakpoints upon particular
> instructions within executables. System-wide, for all processes and
> threads.
>
> There are other things that could have been done. For example, you might
> have chosen to set breakpoints upon a particular virtual address within a
> heavyweight process. That's a process-oriented viewpoint, rather than a
> file-oriented one, if you like.
>
> This raises interesting questions, like
>
> - How come that decision was made? Why _is_ this an executable-oriented
> rather than process-oriented thing? Richard has covered that somewhat.
>
> - What happens if the executable is writeably mapped?
>
> - What happens if someone writes to the executable? (I think both
> of these were disallowed in the implementation-which-is-not-to-be-named).
>
> - What happens if different processes map the executable at different
> addresses?
>
> - Various other things which you've thought of and I haven't and which it
> would be REALLY interesting to hear about.
>
> But this is just one example. I don't think I'm being too picky here - my
> reaction on seeing all this stuff was, basically, "wtf is all this code
> for?". References to dprobes won't help, sorry - I've never looked at
> dprobes and I don't know anyone apart from you guys who has.
>
> What I'm asking you for is a description of what problem we're trying to
> solve and how this code solves that problem. It is hard, it is inefficient
> and, worse, it is error-prone for us to try to work all that out from a
> particular implementation.
Prasanna, I guess putting together a short writeup on the problem
description and the thinking behind the design decisions, known flaws etc,
in the form of a Documentation patch may help for starters. These questions
are likely to come up everytime anyone looks at the code.
The key thinking behind a lot of the design decisions was the
need for a very low overhead probe mechanism that would allow thousands of
active probes on the system and could detect any instance which hits the probe,
including probes on shared libraries which may be loaded by lots of
processes. Thus, (1) no forcing copy-on-write pages, (2) no forcing
in executable pages in memory just to place a probe on them (hence
zero overhead for probes which are very unlikely to be hit), (3) no
restrictions on evicting a page with a probe on it from memory (4) probes
being tracked by an (inode, offset) tuple rather than by virtual address
so that they can be shared across all processes mapping the executable/library
even at different virtual addresses, etc.
Regards
Suparna
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
> A real life example of where this capability would have been very useful is
> with a performance problem I am currently investigating. It involves a GPFS
> + SAMBA + TCPIP + RDAC
this pobablt tells more about the crappy code quality of your propritary
code than a real need for this. please argue without reference to huge
blobs of junk.
On Mon, Mar 20, 2006 at 06:12:55PM -0800, Andrew Morton wrote:
> What does this entire feature *do*? Why does Linux need it?
it's what dtrace does. thus the marketing departments at ibm and redhat
decided to copy the features 1:1 instead of thinking what problem they
want to solve.
On Mon, Mar 20, 2006 at 11:37:45AM +0530, Prasanna S Panchamukhi wrote:
> This patch set provides the basic infrastructure for user-space probes
> based on IBM's Dprobes. Similar to kprobes, user-space probes uses the
> breakpoint mechanism to insert probes. User has to write a kernel module
> to insert probes in the executable/library and specify the handlers in
> the kernel module.
Doing this from kernelspace is wrong. It should be done from userspace,
best using a systemcall. Of couse the handlers can't work as-is you'd
need to redo that to work similarlt to ptrace - which will hopefully
allow some code reuse aswell.
And please - independent of the best api in this case - please don't ever
submit large amounts of code again that don't have any in-tree user.
Christoph Hellwig <[email protected]> wrote on 21/03/2006 11:14:52:
> > A real life example of where this capability would have been very
useful is
> > with a performance problem I am currently investigating. It involves a
GPFS
> > + SAMBA + TCPIP + RDAC
>
> this pobablt tells more about the crappy code quality of your propritary
> code than a real need for this. please argue without reference to huge
> blobs of junk.
>
Damn! I knew there was an obvious answer. Thanks Christoph, I'll code a fix
over lunch. Your insight is as always most refreshing.
Richard
Christoph Hellwig <[email protected]> wrote on 21/03/2006 11:28:07:
> On Mon, Mar 20, 2006 at 06:12:55PM -0800, Andrew Morton wrote:
> > What does this entire feature *do*? Why does Linux need it?
>
> it's what dtrace does. thus the marketing departments at ibm and redhat
> decided to copy the features 1:1 instead of thinking what problem they
> want to solve.
>
I think you'll find it happened the other way round. Sun openly references
my white papers. They even stole the name of an ancestor to kprobes. But
who cares, it not relevant or particularly interesting whether the chicken
or the egg came first.
> I think you'll find it happened the other way round. Sun openly references
> my white papers. They even stole the name of an ancestor to kprobes. But
> who cares, it not relevant or particularly interesting whether the chicken
> or the egg came first.
I know your papers, too. In fact dprobes' RPN program downloads are a far
better design than systemtap's generation of kernel code. it's a pity that
you gave up on dprobes instead of applying the required work to it and
integrate it with other bits of a tracing framework.
On Tue, Mar 21, 2006 at 02:05:21AM -0800, Andrew Morton wrote:
> Prasanna S Panchamukhi <[email protected]> wrote:
> >
> > >
> > > > > > + addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER1);
> > > > > > + addr = (kprobe_opcode_t *)((unsigned long)addr +
> > > > > > + (unsigned long)(uprobe->offset & ~PAGE_MASK));
> > > > > > + *addr = opcode;
> > > > > > + /*TODO: flush vma ? */
> > > > >
> > > > > flush_dcache_page() would be needed.
> > > > >
> > > > > But then, what happens if the page is shared by other processes? Do they
> > > > > all start taking debug traps?
> > > >
> > > > Yes, you are right. I think single stepping inline was a bad idea, disarming
> > > > the probe looks to be a better option
> > > >
> > >
> > > You skipped my second question?
> >
> > There is a small window in which other processor will not be able to see
> > the breakpoint if we are single step inline. But since single stepping inline
> > is a bad idea, we will disarm the probe forever (replace with original instrcution) if we cannot single step out-of-line.
> > Any suggestions?
>
> This doesn't appear to be working.
>
> Let's go back in time and pretend that these patches were never written,
> OK? We're standing around the water cooler saying "hey, wouldn't it be
> cool if someone did X". You guys are way too far into this and you keep on
> leaving everyone else behind. When I try to drag you up, you resist ;)
>
> So let me rephrase, and thrash around in the dark a little more.
>
> >From my reading of the code (and thus far that's my _only_ source of this
> information) it appears that a design decision has been made (for reasons
> which have yet to be disclosed) that the way to implement this (yet to be
> described) requirement is to set user breakpoints upon particular
> instructions within executables. System-wide, for all processes and
> threads.
>
> There are other things that could have been done. For example, you might
> have chosen to set breakpoints upon a particular virtual address within a
> heavyweight process. That's a process-oriented viewpoint, rather than a
> file-oriented one, if you like.
>
> This raises interesting questions, like
>
> - How come that decision was made? Why _is_ this an executable-oriented
> rather than process-oriented thing? Richard has covered that somewhat.
>
> - What happens if the executable is writeably mapped?
>
> - What happens if someone writes to the executable? (I think both
> of these were disallowed in the implementation-which-is-not-to-be-named).
>
> - What happens if different processes map the executable at different
> addresses?
>
> - Various other things which you've thought of and I haven't and which it
> would be REALLY interesting to hear about.
>
> But this is just one example. I don't think I'm being too picky here - my
> reaction on seeing all this stuff was, basically, "wtf is all this code
> for?". References to dprobes won't help, sorry - I've never looked at
> dprobes and I don't know anyone apart from you guys who has.
>
> What I'm asking you for is a description of what problem we're trying to
> solve and how this code solves that problem. It is hard, it is inefficient
> and, worse, it is error-prone for us to try to work all that out from a
> particular implementation.
Andrew,
The basic need is to provide infrastructure for user-space dynamic
instrumentation. As with kprobes, there is no need to recompile or
restart applications for instrumentation, under a debugger for instance.
Possible approaches which were looked up
1. Attaching or loading the application into a tool.
In this method the user application must be loaded into a
tool or the tool is attached to already running application. Before
the user can instrument an application the user should decide what that
instrumentation will consist of. Dynaprof uses such a mechanism.
http://www.dyninst.org/tools.html
2. Using a "jump" instruction to a trampoline and trampoline executing
the instrumented code in user-space.
Eg: Paradyn tool. (http://www.paradyn.org/)
Issues with method 1 and 2 are:
* Induces Intel erratum E49 where the other processors see
stale data while one processor replaces the jump instruction.
* Instruction can only be replaced atomically if the size of
the jump instruction is greater than or equal to the original
instruction.
* Other processors need to be stopped if the "jump" instruction size
is less than the original instruction.
3. Using breakpoint instruction Using a breakpoint instruction and
executing the instrumentation code from within the breakpoint handler
in the interrupt context.
The advantages of the approach (3) taken, apart from what Suparna listed
earlier in this thread
- User is able to collect user-space data and kernel space data using the
same instrumenation code and getting a complete picture
- Probes are visible across fork() syscall.
Richard's mail earlier in this thread details the per process Vs per
executable file based probe.
This is just an RFC and we are looking for suggestions (like Christoph's).
Thanks
Prasanna
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-51776329
On Tue, Mar 21, 2006 at 11:14:52AM +0000, Christoph Hellwig wrote:
> > A real life example of where this capability would have been very useful is
> > with a performance problem I am currently investigating. It involves a GPFS
> > + SAMBA + TCPIP + RDAC
>
> this pobablt tells more about the crappy code quality of your propritary
> code than a real need for this. please argue without reference to huge
> blobs of junk.
In real life there are complicated stacks; sometimes they are open
source (for example, like JBoss or Tomcat), sometimes they are
propietary products, sometimes they are custom applications written by
the end-user. Sun has been making big hay about how with dtrace, you
can easily figure out what is going on. Systemtap is a tool that will
allow us to have have this kind of capability, and user space probes
is part of that project.
- Ted
72
Christoph Hellwig <[email protected]> wrote on 21/03/2006 12:15:50:
> > I think you'll find it happened the other way round. Sun openly
references
> > my white papers. They even stole the name of an ancestor to kprobes.
But
> > who cares, it not relevant or particularly interesting whether the
chicken
> > or the egg came first.
>
> I know your papers, too. In fact dprobes' RPN program downloads are a
far
> better design than systemtap's generation of kernel code. it's a pity
that
> you gave up on dprobes instead of applying the required work to it and
> integrate it with other bits of a tracing framework.
Fascinating, gave up on dprobes, not really! I thought the kernel community
felt it was the wrong implementation. We did remove all the RPN stuff to a
loadable kernel module and left behind a minimal API set - krpobes - which
comprised the kernel probing mechanism, user-space probes extensions and
watchpoint probes extension. The result was identical functionality to the
original dprobes but with a minimal patch to the mainline kernel. But in
addition it provided a very much more generalized interface that would
allow other utilities to exploit the kernel interface, which they have. In
this sense dprobes still exists and can be used on top of krpobes. What
would you recommend be retained from dprobes? And what further
modifications?