2006-05-09 06:55:00

by S. P. Prasanna

[permalink] [raw]
Subject: [RFC] [PATCH 0/6] Kprobes: User-space probes support for i386

Hi All,

As requisted earlier on the mailing list, below is the detailed
description of user-space probes followed by patches.
Please review and provide your comments.

http://lkml.org/lkml/2006/3/20/2 is the earlier posting.

ChangeLog:

- Separate patches to move generic code to mm and vfs subsystem.(Nick)
- Use get_user_pages() for __copy_to_user_inatomic to succeed.(Arjan)
- Use kmap_atomic() instead of kmap().(Andrew)
- Remove __lock_page() usage.(Andrew)
- Use mmap_sem before calling find_vma().(Andrew)
- Use flush_dcache_page() in replace_original_insn().(Andrew)
- Remove docbook style comments.(Andrew)
- Use inc/dec_preempt_count() in uprobe_handlers().(Andrew)

Thanks
Prasanna
--

A. What is the problem we are trying to solve?

The primary intent is to provide a system-wide tracing framework for Linux.
This framework can be used in conjunction with (as an extension of) kprobes
to gather information both from kernel and user-space, thus mitigating the
need to collect data separately and correlating them. It provides a
system-wide view of the problem at hand.

Some use-cases could be:
- One process depletes a system-wide resource (dcache, etc)
- One process owns resources exclusively, causing others to wait
- One process hogs the CPU or I/O bandwidth.

B. Why does Linux need this feature?

As Linux gets deployed in bigger and more complicated computing environments,
more issues relating to performance are surfacing. Tools that provide a
holistic view of the system can provide invaluable insights to the problem
at hand. Some debug scenarios require system-wide instrumentation so that
thousands of active probes with low-overhead can co-exist and all the instances
of probe hits on any binary can be detected. There are situations where the
existing tools like ptrace does not scale well.

For example:
- When working on networking-related performance problems, you need to
correlate instrumentation from multiple layers (the MAC layer
and IP stack in the kernel up to the application in question).
- Diagnosing problems with the X-Windows server, for example, might
require instrumenting all clients that connect to the server.
- When tackling issues relating to performance of distributed systems,
involving, say, a filesystem, samba, apache and the like, gathering
data independently and then correlating the same is going to be a
difficult task.


C. Design drivers

The primary drivers in arriving at this design were:
- Dynamic instrumentation that can be created, installed and removed as
needed without rebooting or restarting applications
- System-wide instrumentation with user having the freedom to retain or
discard data as desired as also the ability to gather both user and
kernel data with the same instrumentation code
- Not having to force COW on pages
- Not having to force pages into memory just to insert probes
- Not having to be concerned with evicting pages from memory under pressure
- Ability to probe shared libraries
- Ability to insert probes on applications that are yet to be started
- Probes are visible across fork() calls


D. Advantages of this approach

- No COW/privatization of pages or forcing of pages into memory just for
the sake of probe insertion
- No restriction on evicting pages with probes from memory
- Since probes are inserted based on the inode-offset tuple, all
instances of the program are instrumented -- user then has the
advantage of choosing what instances of the application he'd like
to instrument
- Probes can be inserted on applications residing on read-only mounts,
since the text pages are discarded post execution


E. The details

At the basic level, similar to kprobes, a breakpoint instruction or
watchpoint is inserted at the instrumentation location and handlers are
run when the breakpoint/watchpoint is hit.

In order to be able to insert probes on pages that aren't in memory
during registration, the readpage(s) hooks of struct
address_space_operations are modified for the inode in question so as to
be able to first insert the probes onto the page at the time it is read
into memory. This mechanism adds some overhead, but is restricted to the
probed binaries only.

The instrumented binary should not be allowed to change for the duration
of the instrumentation. This is achieved by decrementing the
inode->i_writecount of the instrumented binary, so we get exclusive
write access for the entire instrumentation duration.

When the breakpoint is hit, similar to a kprobe, its associated pre_handler
is invoked. The original instruction is then single-stepped out-of-line
so as to prevent any possible SMP misses. Single-stepping out-of-line
requires us to find an unused area in the process address space to which
we can copy the probed instruction.

- The application stack is checked to see if there is sufficient space
for the instruction copy. If so, the instruction is copied to the bottom
of the page. Some architectures have stack pages with no-exec set.
In such cases, the no-exec bit for the corresponding stack page is
temporarily unset.
- If there is insufficient space on stack, the vma is expanded beyond
the current stack's vma and that is used for the single-stepping.
- In cases where the vma can't be extended (the process has exhausted
all its virtual address space), we resort to single-stepping inline by
replacing the original instruction back at the probed location.



F. Known issues/flaws:

- Currently, applications that access the page-cache directly for I/O
will see the breakpoint instruction in text. Similar is the case of
text pages that are mmap'ed private.
- Arjan pointed out that tripwire-like tools can clearly detect the text
corruption.
- There is a way to fix these, albeit not too elegant.
- Modify the file_read_actor() to check if the read is
for a probed application and remove the breakpoints on
the copied image. This solution has been prototyped and
is known to work.
- There is a possibility that probes on an executable mmap'ed shared could
be written back to disk. The simplest solution is to disallow probes on
shared mmap objects.

- Instrumentation data that can be gathered is limited to pages resident
in memory when the probepoint is hit. A jprobe like approach can to
used so as to collect the data from pages that are not present in the
memory when the probepoint is hit.
- The instrumentation handler runs in kernel context. As Arjan pointed
out in one of the earlier discussion threads on this topic, running a
handler is user-space provides availability of better debug information.
- A jprobe like approach has been prototyped using a system-call
interface. This provides for executing the instrumentation
code in the process context in userspace. Clearly this has
significant overhead
- We have to take a debug trap to return back to the
"normal" process context from the instrumentation context.
- The instrumentation code must be made part of the
address spaces of the processes that map the same
binary.

- Probes on text that are mapped at different addresses by different
processes need special handling. This could be solved by tracking
vmas that map the same text pages and insert probes on them.
- Coexistence with debuggers is another issue. The simplest solution is
to fail registration of a breakpoint if one is already existing at the
location to be instrumented.
- Due to the system-wide approach to instrumentation, all processes
running the same executable end up having to pay the penalty of taking
the debug trap. Finer-grained controls can be provided to minimze
overhead by possibly filtering events based on pids of processes we
are interested in.

- Its been suggested that writing a kernel module to gather user-space
data isn't a great idea. However, with tools like systemtap, it is
possible for application programmers and system admins to just script
and gather data.


G. What alternative solutions were there?

As far as we know, there doesn't exist a system-wide, dynamic tracing
framework.

There are, of course, tools like ptrace(), that are suitable for per-process
instrumentation. But ptrace() has its own design/performance issues and
it's also well known that the ptrace approach won't scale well, especially
given the overhead of context switches and other issues with the current
implementation.

A short writeup on other approaches tried is available here:

http://marc.theaimsgroup.com/?l=linux-kernel&m=114344261621050&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=114294391910100&w=2

The belief is, on Linux there is space for both types of instrumentation
to coexist, and a need for both. Hence the proposal.


H. Open questions:

1. What if the text is writably mapped?
- Fail inserting probes on them.
2. What are the typical cases when an executable (library?) is mmap'ed
shared?

I. Usage:
/* Allocate a uprobe structure */
struct uprobe p;

/* Define pre handler */
int handler_pre(struct kprobe *p, struct pt_regs *regs)
{
.............collect useful data..............
return 0;
}

void handler_post(struct kprobe *p, struct pt_regs *regs,
unsigned long flags)
{
.............collect useful data..............
}

int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
{
........ release allocated resources & try to recover ....
return 0;
}

Before inserting the probe, specify the pathname of the application
on which the probe is to be inserted.

/*pointer to the pathname of the application */
p.pathname = "/home/prasanna/bin/myapp";
p.kp.pre_handler = handler_pre;
p.kp.post_handler = handler_post;
p.kp.fault_handler = handler_fault;

/* Specify the probe address */
/* $nm appln |grep func1 */
p.kp.addr = (kprobe_opcode_t *)0x080484d4;
/* Specify the offset within the application/executable*/
p.offset = (unsigned long)0x4d4;
/* Now register the userspace probe */
if (ret = register_uprobe(&p))
printk("register_uprobe: unsuccessful ret= %d\n", ret);

/* To unregister the registered probed, just call..*/
unregister_uprobe(&p);

--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329


2006-05-09 06:59:19

by S. P. Prasanna

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/6] Kprobes: Allow/deny exclusive write access to inodes

This patch adds two new wrapper routines to namei.c file
to decrement and increment the inode writecount. Other
routine deny_write_access() decrements the inode
writecount for a given file pointer. But there is no
wrapper routine that decrement's the inode writecount
for a given inode pointer. Also there is no routine that
increment's the inode writecount, if it less than zero.
Even the existing deny_write_access() is modified to use
the new wrapper routine. Kprobe's user-space probes uses
these wrapper routines to get and release exclusive
write access to the probed binary.

Signed-off-by: Prasanna S Panchamukhi <[email protected]>


fs/namei.c | 34 +++++++++++++++++++++++++++++++---
include/linux/namei.h | 2 ++
2 files changed, 33 insertions(+), 3 deletions(-)

diff -puN fs/namei.c~kprobes_userspace_probes-denywrite-to-inode fs/namei.c
--- linux-2.6.17-rc3-mm1/fs/namei.c~kprobes_userspace_probes-denywrite-to-inode 2006-05-09 10:08:38.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/fs/namei.c 2006-05-09 10:08:39.000000000 +0530
@@ -322,10 +322,13 @@ int get_write_access(struct inode * inod
return 0;
}

-int deny_write_access(struct file * file)
+/* This routine decrements the writecount for a given inode to
+ * get exclusive write access, so that the file on which probes
+ * are currently applied does not change. User-space probes
+ * uses this routine.
+ */
+int deny_write_access_to_inode(struct inode *inode)
{
- struct inode *inode = file->f_dentry->d_inode;
-
spin_lock(&inode->i_lock);
if (atomic_read(&inode->i_writecount) > 0) {
spin_unlock(&inode->i_lock);
@@ -337,6 +340,31 @@ int deny_write_access(struct file * file
return 0;
}

+/* This routine increments the writecount for a given inode.
+ * to release the write lock. User-space probes uses this
+ * routine.
+ */
+int write_access_to_inode(struct inode *inode)
+{
+ spin_lock(&inode->i_lock);
+ if (atomic_read(&inode->i_writecount) >= 0) {
+ spin_unlock(&inode->i_lock);
+ return -ETXTBSY;
+ }
+ atomic_inc(&inode->i_writecount);
+ spin_unlock(&inode->i_lock);
+
+ return 0;
+}
+
+/* Wrapper routine that decrements the writecount for a given file pointer. */
+int deny_write_access(struct file * file)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+
+ return deny_write_access_to_inode(inode);
+}
+
void path_release(struct nameidata *nd)
{
dput(nd->dentry);
diff -puN include/linux/namei.h~kprobes_userspace_probes-denywrite-to-inode include/linux/namei.h
--- linux-2.6.17-rc3-mm1/include/linux/namei.h~kprobes_userspace_probes-denywrite-to-inode 2006-05-09 10:08:38.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/include/linux/namei.h 2006-05-09 10:08:39.000000000 +0530
@@ -81,6 +81,8 @@ extern int follow_up(struct vfsmount **,

extern struct dentry *lock_rename(struct dentry *, struct dentry *);
extern void unlock_rename(struct dentry *, struct dentry *);
+extern int deny_write_access_to_inode(struct inode *inode);
+extern int write_access_to_inode(struct inode *inode);

static inline void nd_set_link(struct nameidata *nd, char *path)
{

_
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329

2006-05-09 07:01:33

by S. P. Prasanna

[permalink] [raw]
Subject: Re: [RFC] [PATCH 2/6] Kprobes: Get one pagetable entry

This patch provide a wrapper routine to allocate one page
table entry for a given virtual address address. Kprobe's
user-space probe mechanism uses this routine to get one
page table entry. As Nick Piggin suggested, this generic
routine can be used by routines like get_user_pages,
find_*_page, and other standard APIs.

Signed-off-by: Prasanna S Panchamukhi <[email protected]>


mm/memory.c | 29 +++++++++++++++++++++++++++++
1 files changed, 29 insertions(+)

diff -puN mm/memory.c~kprobes_userspace_probes-get-one-pagetable-entry mm/memory.c
--- linux-2.6.17-rc3-mm1/mm/memory.c~kprobes_userspace_probes-get-one-pagetable-entry 2006-05-09 10:08:44.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/mm/memory.c 2006-05-09 10:08:44.000000000 +0530
@@ -48,6 +48,7 @@
#include <linux/rmap.h>
#include <linux/module.h>
#include <linux/init.h>
+#include <linux/kprobes.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -417,6 +418,34 @@ struct page *vm_normal_page(struct vm_ar
}

/*
+ * This routines get the pte of the page containing the specified address.
+ */
+pte_t __kprobes *get_one_pte(unsigned long address)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte = NULL;
+
+ pgd = pgd_offset(current->mm, address);
+ if (!pgd)
+ goto out;
+
+ pud = pud_offset(pgd, address);
+ if (!pud)
+ goto out;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd)
+ goto out;
+
+ pte = pte_alloc_map(current->mm, pmd, address);
+
+out:
+ return pte;
+}
+
+/*
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
* covered by this vma.

_
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329

2006-05-09 07:05:18

by S. P. Prasanna

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/6] Kprobes: New interfaces for user-space probes

This patch provides two interfaces to insert and remove
user space probes. Each probe is uniquely identified by
inode and offset within that executable/library file.
Insertion of a probe involves getting the code page for
a given offset, mapping it into the memory and then inserting
the breakpoint at the given offset. Also the probe is added
to the uprobe_table hash list. A uprobe_module data structure
is allocated for every probed application/library image on disk.
Removal of a probe involves getting the code page for a given
offset, mapping that page into the memory and then replacing
the breakpoint instruction with a the original opcode.
This patch also provides aggregate probe handler feature,
where user can define multiple handlers per probe.

Signed-off-by : Prasanna S Panchamukhi <[email protected]>


arch/i386/kernel/uprobes.c | 71 +++++
include/linux/kprobes.h | 61 ++++
include/linux/list.h | 16 +
kernel/Makefile | 4
kernel/kprobes.c | 19 +
kernel/uprobes.c | 579 +++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 748 insertions(+), 2 deletions(-)

diff -puN include/linux/kprobes.h~kprobes_userspace_probes-base-interface include/linux/kprobes.h
--- linux-2.6.17-rc3-mm1/include/linux/kprobes.h~kprobes_userspace_probes-base-interface 2006-05-09 10:08:47.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/include/linux/kprobes.h 2006-05-09 12:32:50.000000000 +0530
@@ -37,6 +37,10 @@
#include <linux/spinlock.h>
#include <linux/rcupdate.h>
#include <linux/mutex.h>
+#include <linux/mm.h>
+#include <linux/dcache.h>
+#include <linux/namei.h>
+#include <linux/pagemap.h>

#ifdef CONFIG_KPROBES
#include <asm/kprobes.h>
@@ -47,6 +51,13 @@
#define KPROBE_REENTER 0x00000004
#define KPROBE_HIT_SSDONE 0x00000008

+/* uprobe_status settings */
+#define UPROBE_HIT_ACTIVE 0x00000001
+#define UPROBE_HIT_SS 0x00000002
+#define UPROBE_HIT_SSDONE 0x00000004
+#define UPROBE_SS_INLINE 0x00000008
+#define UPROBE_SSDONE_INLINE 0x00000010
+
/* Attach to insert probes on any functions which should be ignored*/
#define __kprobes __attribute__((__section__(".kprobes.text")))

@@ -54,6 +65,7 @@ struct kprobe;
struct pt_regs;
struct kretprobe;
struct kretprobe_instance;
+extern struct uprobe *current_uprobe;
typedef int (*kprobe_pre_handler_t) (struct kprobe *, struct pt_regs *);
typedef int (*kprobe_break_handler_t) (struct kprobe *, struct pt_regs *);
typedef void (*kprobe_post_handler_t) (struct kprobe *, struct pt_regs *,
@@ -117,6 +129,32 @@ struct jprobe {
DECLARE_PER_CPU(struct kprobe *, current_kprobe);
DECLARE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);

+struct uprobe {
+ /* pointer to the pathname of the application */
+ char *pathname;
+ /* kprobe structure with user specified handlers */
+ struct kprobe kp;
+ /* hlist of all the userspace probes per application */
+ struct hlist_node ulist;
+ /* inode of the probed application */
+ struct inode *inode;
+ /* probe offset within the file */
+ unsigned long offset;
+};
+
+struct uprobe_module {
+ /* hlist head of all userspace probes per application */
+ struct hlist_head ulist_head;
+ /* list of all uprobe_module for probed application */
+ struct list_head mlist;
+ /* to hold path/dentry etc. */
+ struct nameidata nd;
+ /* original readpage operations */
+ struct address_space_operations *ori_a_ops;
+ /* readpage hooks added operations */
+ struct address_space_operations user_a_ops;
+};
+
#ifdef ARCH_SUPPORTS_KRETPROBES
extern void arch_prepare_kretprobe(struct kretprobe *rp, struct pt_regs *regs);
#else /* ARCH_SUPPORTS_KRETPROBES */
@@ -153,6 +191,7 @@ struct kretprobe_instance {
};

extern spinlock_t kretprobe_lock;
+extern spinlock_t uprobe_lock;
extern struct mutex kprobe_mutex;
extern int arch_prepare_kprobe(struct kprobe *p);
extern void arch_arm_kprobe(struct kprobe *p);
@@ -162,9 +201,16 @@ extern void show_registers(struct pt_reg
extern kprobe_opcode_t *get_insn_slot(void);
extern void free_insn_slot(kprobe_opcode_t *slot);
extern void kprobes_inc_nmissed_count(struct kprobe *p);
+extern void copy_kprobe(struct kprobe *old_p, struct kprobe *p);
+extern int arch_copy_uprobe(struct kprobe *p, kprobe_opcode_t *address);
+extern void arch_arm_uprobe(kprobe_opcode_t *address);
+extern void arch_disarm_uprobe(struct kprobe *p, kprobe_opcode_t *address);
+extern void init_uprobes(void);

/* Get the kprobe at this addr (if any) - called with preemption disabled */
struct kprobe *get_kprobe(void *addr);
+struct kprobe *get_uprobe(void *addr);
+extern int arch_alloc_insn(struct kprobe *p);
struct hlist_head * kretprobe_inst_table_head(struct task_struct *tsk);

/* kprobe_running() will just return the current_kprobe on this CPU */
@@ -183,6 +229,16 @@ static inline struct kprobe_ctlblk *get_
return (&__get_cpu_var(kprobe_ctlblk));
}

+static inline void set_uprobe_instance(struct kprobe *p)
+{
+ current_uprobe = container_of(p, struct uprobe, kp);
+}
+
+static inline void reset_uprobe_instance(void)
+{
+ current_uprobe = NULL;
+}
+
int register_kprobe(struct kprobe *p);
void unregister_kprobe(struct kprobe *p);
int setjmp_pre_handler(struct kprobe *, struct pt_regs *);
@@ -194,10 +250,15 @@ void jprobe_return(void);
int register_kretprobe(struct kretprobe *rp);
void unregister_kretprobe(struct kretprobe *rp);

+int register_uprobe(struct uprobe *uprobe);
+void unregister_uprobe(struct uprobe *uprobe);
+
struct kretprobe_instance *get_free_rp_inst(struct kretprobe *rp);
void add_rp_inst(struct kretprobe_instance *ri);
void kprobe_flush_task(struct task_struct *tk);
void recycle_rp_inst(struct kretprobe_instance *ri);
+void kprobes_add_pagefault_notifier(void);
+void kprobes_remove_pagefault_notifier(void);
#else /* CONFIG_KPROBES */

#define __kprobes /**/
diff -puN /dev/null kernel/uprobes.c
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/kernel/uprobes.c 2006-05-09 12:32:57.000000000 +0530
@@ -0,0 +1,579 @@
+/*
+ * User-space Probes (UProbes)
+ * kernel/uprobes.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006
+ *
+ * 2006-Mar Created by Prasanna S Panchamukhi <[email protected]>
+ * User-space probes initial implementation based on IBM's
+ * Dprobes.
+ */
+#include <linux/kprobes.h>
+#include <linux/hash.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/moduleloader.h>
+#include <asm-generic/sections.h>
+#include <asm/cacheflush.h>
+#include <asm/errno.h>
+#include <asm/kdebug.h>
+
+#define UPROBE_HASH_BITS 6
+#define UPROBE_TABLE_SIZE (1 << UPROBE_HASH_BITS)
+
+/* user space probes lists */
+static struct list_head uprobe_module_list;
+static struct hlist_head uprobe_table[UPROBE_TABLE_SIZE];
+DEFINE_SPINLOCK(uprobe_lock); /* Protects uprobe_table*/
+DEFINE_MUTEX(uprobe_mutex); /* Protects uprobe_module_table */
+
+/*
+ * Aggregate handlers for multiple uprobes support - these handlers
+ * take care of invoking the individual uprobe handlers on p->list
+ */
+static int __kprobes aggr_user_pre_handler(struct kprobe *p,
+ struct pt_regs *regs)
+{
+ struct kprobe *kp;
+
+ list_for_each_entry(kp, &p->list, list) {
+ if (kp->pre_handler) {
+ set_uprobe_instance(kp);
+ if (kp->pre_handler(kp, regs))
+ return 1;
+ }
+ }
+ return 0;
+}
+
+static void __kprobes aggr_user_post_handler(struct kprobe *p,
+ struct pt_regs *regs, unsigned long flags)
+{
+ struct kprobe *kp;
+
+ list_for_each_entry(kp, &p->list, list) {
+ if (kp->post_handler) {
+ set_uprobe_instance(kp);
+ kp->post_handler(kp, regs, flags);
+ }
+ }
+}
+
+static int __kprobes aggr_user_fault_handler(struct kprobe *p,
+ struct pt_regs *regs, int trapnr)
+{
+ struct kprobe *cur;
+
+ /*
+ * if we faulted "during" the execution of a user specified
+ * probe handler, invoke just that probe's fault handler
+ */
+ cur = &current_uprobe->kp;
+ if (cur && cur->fault_handler)
+ if (cur->fault_handler(cur, regs, trapnr))
+ return 1;
+ return 0;
+}
+
+/*
+ * This routine looks for an existing uprobe at the given offset and inode.
+ * If it's found, returns the corresponding kprobe pointer.
+ * This should be called with uprobe_lock held.
+ */
+static struct kprobe __kprobes *get_kprobe_user(struct inode *inode,
+ unsigned long offset)
+{
+ struct hlist_head *head;
+ struct hlist_node *node;
+ struct kprobe *p, *kpr;
+ struct uprobe *uprobe;
+
+ head = &uprobe_table[hash_ptr((kprobe_opcode_t *)
+ (((unsigned long)inode) * offset), UPROBE_HASH_BITS)];
+
+ hlist_for_each_entry(p, node, head, hlist) {
+ if (p->pre_handler == aggr_user_pre_handler) {
+ kpr = list_entry(p->list.next, typeof(*kpr), list);
+ uprobe = container_of(kpr, struct uprobe, kp);
+ } else
+ uprobe = container_of(p, struct uprobe, kp);
+
+ if ((uprobe->inode == inode) && (uprobe->offset == offset))
+ return p;
+ }
+
+ return NULL;
+}
+
+/*
+ * Finds a uprobe at the specified user-space address in the current task.
+ * Points current_uprobe at that uprobe and returns the corresponding kprobe.
+ */
+struct kprobe __kprobes *get_uprobe(void *addr)
+{
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+ struct inode *inode;
+ unsigned long offset;
+ struct kprobe *p, *kpr;
+ struct uprobe *uprobe;
+
+ if (!down_read_trylock(&mm->mmap_sem))
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, (unsigned long)addr);
+
+ BUG_ON(!vma); /* this should not happen, not in our memory map */
+
+ offset = (unsigned long)addr - (vma->vm_start +
+ (vma->vm_pgoff << PAGE_SHIFT));
+ if (!vma->vm_file) {
+ up_read(&mm->mmap_sem);
+ return NULL;
+ }
+
+ inode = vma->vm_file->f_dentry->d_inode;
+ up_read(&mm->mmap_sem);
+
+ p = get_kprobe_user(inode, offset);
+ if (!p)
+ return NULL;
+
+ if (p->pre_handler == aggr_user_pre_handler) {
+ /*
+ * Walk the uprobe aggregate list and return firt
+ * element on aggregate list.
+ */
+ kpr = list_entry((p)->list.next, typeof(*kpr), list);
+ uprobe = container_of(kpr, struct uprobe, kp);
+ } else
+ uprobe = container_of(p, struct uprobe, kp);
+
+ if (uprobe)
+ current_uprobe = uprobe;
+
+ return p;
+}
+
+/*
+ * Fill in the required fields of the "manager uprobe". Replace the
+ * earlier kprobe in the hlist with the manager uprobe
+ */
+static inline void add_aggr_uprobe(struct kprobe *ap, struct kprobe *p)
+{
+ copy_kprobe(p, ap);
+ ap->addr = p->addr;
+ ap->pre_handler = aggr_user_pre_handler;
+ ap->post_handler = aggr_user_post_handler;
+ ap->fault_handler = aggr_user_fault_handler;
+
+ INIT_LIST_HEAD(&ap->list);
+ list_add(&p->list, &ap->list);
+
+ hlist_replace(&p->hlist, &ap->hlist);
+}
+
+/*
+ * This is the second or subsequent uprobe at the address - handle
+ * the intricacies
+ */
+static int __kprobes register_aggr_uprobe(struct kprobe *old_p,
+ struct kprobe *p)
+{
+ struct kprobe *ap;
+
+ if (old_p->pre_handler == aggr_user_pre_handler) {
+ copy_kprobe(old_p, p);
+ list_add(&p->list, &old_p->list);
+ } else {
+ ap = kzalloc(sizeof(struct kprobe), GFP_ATOMIC);
+ if (!ap)
+ return -ENOMEM;
+ add_aggr_uprobe(ap, old_p);
+ copy_kprobe(ap, p);
+ list_add(&p->list, &old_p->list);
+ }
+ return 0;
+}
+
+typedef int (*process_uprobe_func_t)(struct uprobe *uprobe,
+ kprobe_opcode_t *address);
+
+/*
+ * Saves the original instruction in the uprobe structure and
+ * inserts the breakpoint at the given address.
+ */
+int __kprobes insert_kprobe_user(struct uprobe *uprobe,
+ kprobe_opcode_t *address)
+{
+ int ret = 0;
+
+ ret = arch_copy_uprobe(&uprobe->kp, address);
+ if (ret) {
+ printk("Breakpoint already present\n");
+ return ret;
+ }
+ arch_arm_uprobe(address);
+
+ return 0;
+}
+
+/*
+ * Wait for the page to be unlocked if someone else had locked it,
+ * then map the page and insert or remove the breakpoint.
+ */
+static int __kprobes map_uprobe_page(struct page *page, struct uprobe *uprobe,
+ process_uprobe_func_t process_kprobe_user)
+{
+ int ret = 0;
+ kprobe_opcode_t *uprobe_address;
+
+ if (!page)
+ return -EINVAL; /* TODO: more suitable errno */
+
+ lock_page(page);
+
+ uprobe_address = (kprobe_opcode_t *)kmap_atomic(page, KM_USER0);
+ uprobe_address = (kprobe_opcode_t *)((unsigned long)uprobe_address +
+ (uprobe->offset & ~PAGE_MASK));
+ ret = (*process_kprobe_user)(uprobe, uprobe_address);
+ kunmap_atomic(uprobe_address, KM_USER0);
+
+ unlock_page(page);
+
+ return ret;
+}
+
+/*
+ * flush_vma walks through the list of process private mappings,
+ * gets the vma containing the offset and flushes all the vmas
+ * containing the probed page.
+ */
+static void __kprobes flush_vma(struct address_space *mapping,
+ struct page *page, struct uprobe *uprobe)
+{
+ struct vm_area_struct *vma = NULL;
+ struct prio_tree_iter iter;
+ struct prio_tree_root *head = &mapping->i_mmap;
+ unsigned long start, end, offset = uprobe->offset;
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, head, offset, offset) {
+ start = vma->vm_start - (vma->vm_pgoff << PAGE_SHIFT);
+ end = vma->vm_end - (vma->vm_pgoff << PAGE_SHIFT);
+
+ if ((start + offset) < end)
+ flush_icache_user_range(vma, page,
+ (unsigned long)uprobe->kp.addr,
+ sizeof(kprobe_opcode_t));
+ }
+ spin_unlock(&mapping->i_mmap_lock);
+}
+
+/*
+ * Walk the uprobe_module_list and return the uprobe module with matching
+ * inode.
+ */
+struct uprobe_module __kprobes *get_module_by_inode(struct inode *inode)
+{
+ struct uprobe_module *umodule;
+
+ list_for_each_entry(umodule, &uprobe_module_list, mlist) {
+ if (umodule->nd.dentry->d_inode == inode)
+ return umodule;
+ }
+
+ return NULL;
+}
+
+/*
+ * Add uprobe and uprobe_module to the appropriate hash list.
+ */
+static void __kprobes get_inode_ops(struct uprobe *uprobe,
+ struct uprobe_module *umodule)
+{
+ INIT_HLIST_HEAD(&umodule->ulist_head);
+ hlist_add_head(&uprobe->ulist, &umodule->ulist_head);
+ list_add(&umodule->mlist, &uprobe_module_list);
+}
+
+/*
+ * Removes the specified uprobe from either aggregate uprobe list
+ * or individual uprobe hash table.
+ */
+
+static int __kprobes remove_uprobe(struct uprobe *uprobe)
+{
+ struct kprobe *old_p, *list_p, *p;
+ int ret = 0;
+
+ p = &uprobe->kp;
+ old_p = get_kprobe_user(uprobe->inode, uprobe->offset);
+ if (unlikely(!old_p))
+ return 0;
+
+ if (p != old_p) {
+ list_for_each_entry(list_p, &old_p->list, list)
+ if (list_p == p)
+ /* kprobe p is a valid probe */
+ goto valid_p;
+ return 0;
+ }
+
+valid_p:
+ if ((old_p == p) ||
+ ((old_p->pre_handler == aggr_user_pre_handler) &&
+ (p->list.next == &old_p->list) &&
+ (p->list.prev == &old_p->list))) {
+ /*
+ * Only probe on the hash list, mark the corresponding
+ * instruction slot for freeing by return 1.
+ */
+ ret = 1;
+ hlist_del(&old_p->hlist);
+ if (p != old_p) {
+ list_del(&p->list);
+ kfree(old_p);
+ }
+ } else
+ list_del(&p->list);
+
+ return ret;
+}
+
+/*
+ * Disarms the probe and frees the corresponding instruction slot.
+ */
+static int __kprobes remove_kprobe_user(struct uprobe *uprobe,
+ kprobe_opcode_t *address)
+{
+ struct kprobe *p = &uprobe->kp;
+
+ arch_disarm_uprobe(p, address);
+ arch_remove_kprobe(p);
+
+ return 0;
+}
+
+/*
+ * Adds the given uprobe to the uprobe_hash table if it is
+ * the first probe to be inserted at the given address else
+ * adds to the aggregate uprobe's list.
+ */
+static int __kprobes insert_uprobe(struct uprobe *uprobe)
+{
+ struct kprobe *old_p;
+ int ret = 0;
+ unsigned long offset = uprobe->offset;
+ unsigned long inode = (unsigned long) uprobe->inode;
+ struct hlist_head *head;
+ unsigned long flags;
+
+ spin_lock_irqsave(&uprobe_lock, flags);
+ uprobe->kp.nmissed = 0;
+
+ old_p = get_kprobe_user(uprobe->inode, uprobe->offset);
+
+ if (old_p)
+ register_aggr_uprobe(old_p, &uprobe->kp);
+ else {
+ head = &uprobe_table[hash_ptr((kprobe_opcode_t *)
+ (offset * inode), UPROBE_HASH_BITS)];
+ INIT_HLIST_NODE(&uprobe->kp.hlist);
+ hlist_add_head(&uprobe->kp.hlist, head);
+ /*
+ * The original instruction must be copied into the instruction
+ * slot, hence return 1.
+ */
+ ret = 1;
+ }
+
+ spin_unlock_irqrestore(&uprobe_lock, flags);
+
+ return ret;
+}
+
+/*
+ * unregister_uprobe: Disarms the probe, removes the uprobe
+ * pointers from the hash list and unhooks readpage routines.
+ */
+void __kprobes unregister_uprobe(struct uprobe *uprobe)
+{
+ struct address_space *mapping;
+ struct uprobe_module *umodule;
+ struct page *page;
+ unsigned long flags;
+ int ret = 0;
+
+ if (!uprobe->inode)
+ return;
+
+ mapping = uprobe->inode->i_mapping;
+
+ page = find_get_page(mapping, uprobe->offset >> PAGE_CACHE_SHIFT);
+
+ spin_lock_irqsave(&uprobe_lock, flags);
+ ret = remove_uprobe(uprobe);
+ spin_unlock_irqrestore(&uprobe_lock, flags);
+
+ mutex_lock(&uprobe_mutex);
+ if (!(umodule = get_module_by_inode(uprobe->inode)))
+ goto out;
+
+ hlist_del(&uprobe->ulist);
+ if (hlist_empty(&umodule->ulist_head)) {
+ list_del(&umodule->mlist);
+ write_access_to_inode(uprobe->inode);
+ path_release(&umodule->nd);
+ kfree(umodule);
+ }
+ /* Unregister pagefault notifier, if no probes. */
+ mutex_lock(&kprobe_mutex);
+ kprobes_remove_pagefault_notifier();
+ mutex_unlock(&kprobe_mutex);
+out:
+ mutex_unlock(&uprobe_mutex);
+ if (ret)
+ ret = map_uprobe_page(page, uprobe, remove_kprobe_user);
+
+ if (ret == -EINVAL)
+ return;
+ /*
+ * TODO: unregister_uprobe should not fail, need to handle
+ * if it fails.
+ */
+ flush_vma(mapping, page, uprobe);
+
+ if (page)
+ page_cache_release(page);
+}
+
+/*
+ * register_uprobe(): combination of inode and offset is used to
+ * identify each probe uniquely. Each uprobe can be found from the
+ * uprobes_hash table by using inode and offset. register_uprobe(),
+ * inserts the breakpoint at the given address by locating and mapping
+ * the page. return 0 on success and error on failure.
+ */
+int __kprobes register_uprobe(struct uprobe *uprobe)
+{
+ struct address_space *mapping;
+ struct uprobe_module *umodule = NULL;
+ struct inode *inode;
+ struct nameidata nd;
+ struct page *page;
+ int error = 0;
+
+ INIT_HLIST_NODE(&uprobe->ulist);
+
+ /*
+ * TODO: Need to calculate the absolute file offset for dynamic
+ * shared libraries.
+ */
+ if ((error = path_lookup(uprobe->pathname, LOOKUP_FOLLOW, &nd)))
+ return error;
+
+ mutex_lock(&uprobe_mutex);
+
+ inode = nd.dentry->d_inode;
+ error = deny_write_access_to_inode(inode);
+ if (error)
+ goto out;
+
+ error = arch_alloc_insn(&uprobe->kp);
+ if (error) {
+ write_access_to_inode(inode);
+ goto out;
+ }
+
+ /*
+ * Check if there are probes already on this application and
+ * add the corresponding uprobe to per application probe's list.
+ */
+ umodule = get_module_by_inode(inode);
+ if (!umodule) {
+ /*
+ * Allocate a uprobe_module structure for this
+ * application if not allocated before.
+ */
+ umodule = kzalloc(sizeof(struct uprobe_module), GFP_KERNEL);
+ if (!umodule) {
+ error = -ENOMEM;
+ write_access_to_inode(inode);
+ arch_remove_kprobe(&uprobe->kp);
+ goto out;
+ }
+ memcpy(&umodule->nd, &nd, sizeof(struct nameidata));
+ get_inode_ops(uprobe, umodule);
+ } else {
+ path_release(&nd);
+ write_access_to_inode(inode);
+ hlist_add_head(&uprobe->ulist, &umodule->ulist_head);
+ }
+ mutex_unlock(&uprobe_mutex);
+ /*
+ * Register pagefault notifier, if not one registered.
+ */
+ mutex_lock(&kprobe_mutex);
+ kprobes_add_pagefault_notifier();
+ mutex_unlock(&kprobe_mutex);
+
+ uprobe->inode = inode;
+ mapping = inode->i_mapping;
+ page = find_get_page(mapping, (uprobe->offset >> PAGE_CACHE_SHIFT));
+
+ if (insert_uprobe(uprobe))
+ error = map_uprobe_page(page, uprobe, insert_kprobe_user);
+ else
+ arch_remove_kprobe(&uprobe->kp);
+
+ /*
+ * If error == -EINVAL, return success, probes will inserted by
+ * readpage hooks.
+ * TODO: Use a more suitable errno?
+ */
+ if (error == -EINVAL)
+ error = 0;
+ flush_vma(mapping, page, uprobe);
+
+ if (page)
+ page_cache_release(page);
+
+ return error;
+out:
+ path_release(&nd);
+ mutex_unlock(&uprobe_mutex);
+
+ return error;
+}
+
+void init_uprobes(void)
+{
+ int i;
+
+ /* FIXME allocate the probe table, currently defined statically */
+ /* initialize all list heads */
+ for (i = 0; i < UPROBE_TABLE_SIZE; i++)
+ INIT_HLIST_HEAD(&uprobe_table[i]);
+
+ INIT_LIST_HEAD(&uprobe_module_list);
+}
+
+EXPORT_SYMBOL_GPL(register_uprobe);
+EXPORT_SYMBOL_GPL(unregister_uprobe);
+
+
diff -puN /dev/null arch/i386/kernel/uprobes.c
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/kernel/uprobes.c 2006-05-09 12:32:52.000000000 +0530
@@ -0,0 +1,71 @@
+/*
+ * User-space Probes (UProbes)
+ * arch/i386/kernel/uprobes.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006.
+ *
+ * 2006-Mar Created by Prasanna S Panchamukhi <[email protected]>
+ * User-space probes initial implementation based on IBM's
+ * Dprobes.
+ */
+
+#include <linux/config.h>
+#include <linux/kprobes.h>
+#include <linux/ptrace.h>
+#include <linux/preempt.h>
+#include <asm/cacheflush.h>
+#include <asm/kdebug.h>
+#include <asm/desc.h>
+
+int __kprobes arch_alloc_insn(struct kprobe *p)
+{
+ mutex_lock(&kprobe_mutex);
+ p->ainsn.insn = get_insn_slot();
+ mutex_unlock(&kprobe_mutex);
+
+ if (!p->ainsn.insn)
+ return -ENOMEM;
+
+ return 0;
+}
+
+void __kprobes arch_disarm_uprobe(struct kprobe *p, kprobe_opcode_t *address)
+{
+ if (p->opcode != BREAKPOINT_INSTRUCTION)
+ *address = p->opcode;
+}
+
+void __kprobes arch_arm_uprobe(kprobe_opcode_t *address)
+{
+ *address = BREAKPOINT_INSTRUCTION;
+}
+
+int __kprobes arch_copy_uprobe(struct kprobe *p, kprobe_opcode_t *address)
+{
+ int ret = 1;
+
+ /*
+ * TODO: Check if the given address is a valid to access user memory.
+ */
+ if (*address != BREAKPOINT_INSTRUCTION) {
+ memcpy(p->ainsn.insn, address, MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
+ ret = 0;
+ }
+ p->opcode = *(kprobe_opcode_t *)address;
+
+ return ret;
+}
diff -puN kernel/kprobes.c~kprobes_userspace_probes-base-interface kernel/kprobes.c
--- linux-2.6.17-rc3-mm1/kernel/kprobes.c~kprobes_userspace_probes-base-interface 2006-05-09 10:08:47.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/kernel/kprobes.c 2006-05-09 10:08:47.000000000 +0530
@@ -47,7 +47,7 @@

static struct hlist_head kprobe_table[KPROBE_TABLE_SIZE];
static struct hlist_head kretprobe_inst_table[KPROBE_TABLE_SIZE];
-static atomic_t kprobe_count;
+atomic_t kprobe_count;

DEFINE_MUTEX(kprobe_mutex); /* Protects kprobe_table */
DEFINE_SPINLOCK(kretprobe_lock); /* Protects kretprobe_inst_table */
@@ -58,6 +58,20 @@ static struct notifier_block kprobe_page
.priority = 0x7fffffff /* we need to notified first */
};

+void kprobes_add_pagefault_notifier(void)
+{
+ if (atomic_add_return(1, &kprobe_count) == \
+ (ARCH_INACTIVE_KPROBE_COUNT + 1))
+ register_page_fault_notifier(&kprobe_page_fault_nb);
+}
+
+void kprobes_remove_pagefault_notifier(void)
+{
+ if (atomic_add_return(-1, &kprobe_count) == \
+ ARCH_INACTIVE_KPROBE_COUNT)
+ unregister_page_fault_notifier(&kprobe_page_fault_nb);
+}
+
#ifdef __ARCH_WANT_KPROBES_INSN_SLOT
/*
* kprobe->ainsn.insn points to the copy of the instruction to be
@@ -362,7 +376,7 @@ static inline void free_rp_inst(struct k
/*
* Keep all fields in the kprobe consistent
*/
-static inline void copy_kprobe(struct kprobe *old_p, struct kprobe *p)
+void copy_kprobe(struct kprobe *old_p, struct kprobe *p)
{
memcpy(&p->opcode, &old_p->opcode, sizeof(kprobe_opcode_t));
memcpy(&p->ainsn, &old_p->ainsn, sizeof(struct arch_specific_insn));
@@ -693,6 +707,7 @@ static int __init init_kprobes(void)
}
atomic_set(&kprobe_count, 0);

+ init_uprobes();
err = arch_init_kprobes();
if (!err)
err = register_die_notifier(&kprobe_exceptions_nb);
diff -puN include/linux/list.h~kprobes_userspace_probes-base-interface include/linux/list.h
--- linux-2.6.17-rc3-mm1/include/linux/list.h~kprobes_userspace_probes-base-interface 2006-05-09 10:08:47.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/include/linux/list.h 2006-05-09 10:08:47.000000000 +0530
@@ -655,6 +655,22 @@ static inline void hlist_replace_rcu(str
old->pprev = LIST_POISON2;
}

+/*
+ * The old entry will be replaced with the new entry atomically.
+ */
+static inline void hlist_replace(struct hlist_node *old,
+ struct hlist_node *new)
+{
+ struct hlist_node *next = old->next;
+
+ new->next = next;
+ new->pprev = old->pprev;
+ if (next)
+ new->next->pprev = &new->next;
+ *new->pprev = new;
+ old->pprev = LIST_POISON2;
+}
+
static inline void hlist_add_head(struct hlist_node *n, struct hlist_head *h)
{
struct hlist_node *first = h->first;
diff -puN kernel/Makefile~kprobes_userspace_probes-base-interface kernel/Makefile
--- linux-2.6.17-rc3-mm1/kernel/Makefile~kprobes_userspace_probes-base-interface 2006-05-09 10:08:47.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/kernel/Makefile 2006-05-09 10:08:47.000000000 +0530
@@ -35,7 +35,11 @@ obj-$(CONFIG_IKCONFIG) += configs.o
obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
obj-$(CONFIG_AUDITSYSCALL) += auditsc.o
+ifeq ($(CONFIG_X86_32),y)
+obj-$(CONFIG_KPROBES) += kprobes.o uprobes.o
+else
obj-$(CONFIG_KPROBES) += kprobes.o
+endif
obj-$(CONFIG_KGDB) += kgdb.o
obj-$(CONFIG_SYSFS) += ksysfs.o
obj-$(CONFIG_DETECT_SOFTLOCKUP) += softlockup.o

_
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329

2006-05-09 07:10:52

by S. P. Prasanna

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/6] Kprobes: Insert probes on non-memory resident pages

User-space probes also supports the registering of the probe points
before the probed code is loaded. This clearly has advantages for
catching initialization problems. This involves modifying the probed
applications address_space readpage() and readpages(). Overhead of
changing the address_space readpage/s() is limited to only the probed
application until all probes are removed from that application.

This patch provides the feature of inserting probes on pages that are
not present in the memory during registration.

To add readpage and readpages() hooks, two new elements are added to
the uprobe_module object:
struct address_space_operations *ori_a_ops;
struct address_space_operations user_a_ops;

When the pages are read into memory through the readpage and
readpages address space operations, any associated probes are
automatically inserted into those pages. These user-space probes
readpage and readpages routines internally call the original
readpage() and readpages() routines, and then check whether probes are
to be added to these pages, inserting probes as necessary.

During unregistration, care should be taken to replace the readpage
and readpages hooks with the original routines if no probes remain on
that application.

Signed-of-by: Prasanna S Panchamukhi <[email protected]>


kernel/uprobes.c | 121 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 121 insertions(+)

diff -puN kernel/uprobes.c~kprobes_userspace_probes-hook-readpage kernel/uprobes.c
--- linux-2.6.17-rc3-mm1/kernel/uprobes.c~kprobes_userspace_probes-hook-readpage 2006-05-09 10:08:49.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/kernel/uprobes.c 2006-05-09 10:08:49.000000000 +0530
@@ -300,15 +300,134 @@ struct uprobe_module __kprobes *get_modu
return NULL;
}

+static inline void insert_readpage_uprobe(struct page *page,
+ struct address_space *mapping, struct uprobe *uprobe)
+{
+ unsigned long page_start = page->index << PAGE_CACHE_SHIFT;
+ unsigned long page_end = page_start + PAGE_SIZE;
+
+ if ((uprobe->offset >= page_start) && (uprobe->offset < page_end)) {
+ map_uprobe_page(page, uprobe, insert_kprobe_user);
+ flush_vma(mapping, page, uprobe);
+ }
+}
+
+/*
+ * This function hooks the readpages() of all modules that have active
+ * probes on them. The original readpages() is called for the given
+ * inode/address_space to actually read the pages into the memory.
+ * Then all probes that are specified on these pages are inserted.
+ */
+static int __kprobes uprobe_readpages(struct file *file,
+ struct address_space *mapping,
+ struct list_head *pages, unsigned nr_pages)
+{
+ int retval = 0;
+ struct page *page;
+ struct uprobe_module *umodule;
+ struct uprobe *uprobe = NULL;
+ struct hlist_node *node;
+
+ mutex_lock(&uprobe_mutex);
+
+ umodule = get_module_by_inode(file->f_dentry->d_inode);
+ if (!umodule) {
+ /*
+ * No module associated with this file, call the
+ * original readpages().
+ */
+ retval = mapping->a_ops->readpages(file, mapping,
+ pages, nr_pages);
+ goto out;
+ }
+
+ /* call original readpages() */
+ retval = umodule->ori_a_ops->readpages(file, mapping, pages, nr_pages);
+ if (retval < 0)
+ goto out;
+
+ /*
+ * TODO: Walk through readpages page list and get
+ * pages with probes instead of find_get_page().
+ */
+ hlist_for_each_entry(uprobe, node, &umodule->ulist_head, ulist) {
+ page = find_get_page(mapping,
+ uprobe->offset >> PAGE_CACHE_SHIFT);
+ if (!page)
+ continue;
+
+ if (!uprobe->kp.opcode)
+ insert_readpage_uprobe(page, mapping, uprobe);
+ page_cache_release(page);
+ }
+
+out:
+ mutex_unlock(&uprobe_mutex);
+
+ return retval;
+}
+
+/*
+ * This function hooks the readpage() of all modules that have active
+ * probes on them. The original readpage() is called for the given
+ * inode/address_space to actually read the pages into the memory.
+ * Then all probes that are specified on this page are inserted.
+ */
+int __kprobes uprobe_readpage(struct file *file, struct page *page)
+{
+ int retval = 0;
+ struct uprobe_module *umodule;
+ struct uprobe *uprobe = NULL;
+ struct hlist_node *node;
+ struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+
+ mutex_lock(&uprobe_mutex);
+
+ umodule = get_module_by_inode(file->f_dentry->d_inode);
+ if (!umodule) {
+ /*
+ * No module associated with this file, call the
+ * original readpage().
+ */
+ retval = mapping->a_ops->readpage(file, page);
+ goto out;
+ }
+
+ /* call original readpage() */
+ retval = umodule->ori_a_ops->readpage(file, page);
+ if (retval < 0)
+ goto out;
+
+ hlist_for_each_entry(uprobe, node, &umodule->ulist_head, ulist) {
+ if (!uprobe->kp.opcode)
+ insert_readpage_uprobe(page, mapping, uprobe);
+ }
+
+out:
+ mutex_unlock(&uprobe_mutex);
+
+ return retval;
+}
+
/*
* Add uprobe and uprobe_module to the appropriate hash list.
+ * Also switches i_op to hooks into readpage and readpages().
*/
static void __kprobes get_inode_ops(struct uprobe *uprobe,
struct uprobe_module *umodule)
{
+ struct address_space *as;
+
INIT_HLIST_HEAD(&umodule->ulist_head);
hlist_add_head(&uprobe->ulist, &umodule->ulist_head);
list_add(&umodule->mlist, &uprobe_module_list);
+ as = umodule->nd.dentry->d_inode->i_mapping;
+ umodule->ori_a_ops = as->a_ops;
+ umodule->user_a_ops = *as->a_ops;
+ umodule->user_a_ops.readpage = uprobe_readpage;
+ umodule->user_a_ops.readpages = uprobe_readpages;
+ as->a_ops = &umodule->user_a_ops;
+
}

/*
@@ -437,6 +556,8 @@ void __kprobes unregister_uprobe(struct
hlist_del(&uprobe->ulist);
if (hlist_empty(&umodule->ulist_head)) {
list_del(&umodule->mlist);
+ umodule->nd.dentry->d_inode->i_mapping->a_ops =
+ umodule->ori_a_ops;
write_access_to_inode(uprobe->inode);
path_release(&umodule->nd);
kfree(umodule);

_
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329

2006-05-09 07:13:45

by S. P. Prasanna

[permalink] [raw]
Subject: Re: [RFC] [PATCH 5/6] Kprobes: Single step the original instruction out-of-line


This patch provides a mechanism for probe handling and
executing the user-specified handlers.

Each userspace probe is uniquely identified by the combination of
inode and offset, hence during registration the inode and offset
combination is added to uprobes hash table. Initially when
breakpoint instruction is hit, the uprobes hash table is looked up
for matching inode and offset. The pre_handlers are called in
sequence if multiple probes are registered. Similar to kprobes,
uprobes also adopts to single step out-of-line, so that probe miss in
SMP environment can be avoided. But for userspace probes, instruction
copied into kernel address space cannot be single stepped, hence the
instruction must be copied to user address space. The solution is to
find free space in the current process address space and then copy the
original instruction and single step that instruction.

User processes use stack space to store local variables, arguments and
return values. Normally the stack space either below or above the
stack pointer indicates the free stack space.

The instruction to be single stepped can modify the stack space,
hence before using the free stack space, sufficient stack space must
be left. The instruction is copied to the bottom of the page and check
is made such that the copied instruction does not cross the page
boundary. The copied instruction is then single stepped. Several
architectures does not allow the instruction to be executed from the
stack location, since no-exec bit is set for the stack pages. In those
architectures, the page table entry corresponding to the stack page is
identified and the no-exec bit is unset making the instruction on that
stack page to be executed.

There are situations where even the free stack space is not enough for
the user instruction to be copied and single stepped. In such
situations, the virtual memory area(vma) can be expanded beyond the
current stack vma. This expanded stack can be used to copy the
original instruction and single step out-of-line.

Even if the vma cannot be extended, then the instruction much be
executed inline, by replacing the breakpoint instruction with the
original instruction.

Signed-off-by: Prasanna S Panchamukhi <[email protected]>


arch/i386/kernel/Makefile | 2
arch/i386/kernel/kprobes.c | 4
arch/i386/kernel/uprobes.c | 472 +++++++++++++++++++++++++++++++++++++++++++++
arch/i386/mm/fault.c | 3
include/asm-i386/kprobes.h | 21 ++
5 files changed, 497 insertions(+), 5 deletions(-)

diff -puN include/asm-i386/kprobes.h~kprobes_userspace_probes-ss-out-of-line include/asm-i386/kprobes.h
--- linux-2.6.17-rc3-mm1/include/asm-i386/kprobes.h~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/include/asm-i386/kprobes.h 2006-05-09 12:40:48.000000000 +0530
@@ -26,6 +26,7 @@
*/
#include <linux/types.h>
#include <linux/ptrace.h>
+#include <asm/cacheflush.h>

#define __ARCH_WANT_KPROBES_INSN_SLOT

@@ -78,6 +79,19 @@ struct kprobe_ctlblk {
struct prev_kprobe prev_kprobe;
};

+/* per user probe control block */
+struct uprobe_ctlblk {
+ unsigned long uprobe_status;
+ unsigned long uprobe_saved_eflags;
+ unsigned long uprobe_old_eflags;
+ unsigned long singlestep_addr;
+ unsigned long flags;
+ struct kprobe *curr_p;
+ pte_t *upte;
+ struct page *upage;
+ struct task_struct *tsk;
+};
+
/* trap3/1 are intr gates for kprobes. So, restore the status of IF,
* if necessary, before executing the original int3/1 (trap) handler.
*/
@@ -89,4 +103,11 @@ static inline void restore_interrupts(st

extern int kprobe_exceptions_notify(struct notifier_block *self,
unsigned long val, void *data);
+extern int uprobe_exceptions_notify(struct notifier_block *self,
+ unsigned long val, void *data);
+extern unsigned long get_segment_eip(struct pt_regs *regs,
+ unsigned long *eip_limit);
+extern int is_IF_modifier(kprobe_opcode_t opcode);
+
+extern pte_t *get_one_pte(unsigned long address);
#endif /* _ASM_KPROBES_H */
diff -puN arch/i386/kernel/uprobes.c~kprobes_userspace_probes-ss-out-of-line arch/i386/kernel/uprobes.c
--- linux-2.6.17-rc3-mm1/arch/i386/kernel/uprobes.c~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/kernel/uprobes.c 2006-05-09 12:40:48.000000000 +0530
@@ -30,6 +30,10 @@
#include <asm/cacheflush.h>
#include <asm/kdebug.h>
#include <asm/desc.h>
+#include <asm/uaccess.h>
+
+static struct uprobe_ctlblk uprobe_ctlblk;
+struct uprobe *current_uprobe;

int __kprobes arch_alloc_insn(struct kprobe *p)
{
@@ -69,3 +73,471 @@ int __kprobes arch_copy_uprobe(struct kp

return ret;
}
+
+/*
+ * This routine check for space in the process's stack address space.
+ * If enough address space is found, returns the address of free stack
+ * space.
+ */
+unsigned long __kprobes *find_stack_space_on_next_page(unsigned long stack_addr,
+ int size, struct vm_area_struct *vma)
+{
+ unsigned long addr;
+ struct page *pg;
+ int retval = 0;
+
+ if (((stack_addr - sizeof(long long))) < (vma->vm_start + size))
+ return NULL;
+ addr = (stack_addr & PAGE_MASK) + PAGE_SIZE;
+
+ retval = get_user_pages(current, current->mm,
+ (unsigned long )addr, 1, 1, 0, &pg, NULL);
+ if (retval)
+ return NULL;
+
+ return (unsigned long *) addr;
+}
+
+/*
+ * This routine expands the stack beyond the present process address
+ * space and returns the address of free stack space. This routine
+ * must be called with mmap_sem held.
+ */
+unsigned long __kprobes *find_stack_space_in_expanded_vma(int size,
+ struct vm_area_struct *vma)
+{
+ unsigned long addr, vm_addr;
+ int retval = 0;
+ struct vm_area_struct *new_vma;
+ struct mm_struct *mm = current->mm;
+ struct page *pg;
+
+ vm_addr = vma->vm_start - size;
+ new_vma = find_extend_vma(mm, vm_addr);
+ if (!new_vma)
+ return NULL;
+
+ addr = new_vma->vm_start;
+ retval = get_user_pages(current, current->mm,
+ (unsigned long )addr, 1, 1, 0, &pg, NULL);
+ if (retval)
+ return NULL;
+
+ return (unsigned long *) addr;
+}
+
+/*
+ * This routine checks for stack free space below the stack pointer in the
+ * current stack page. If there is not enough stack space, it returns NULL.
+ */
+unsigned long __kprobes *find_stack_space_on_curr_page(unsigned long stack_addr,
+ int size)
+{
+ unsigned long page_addr;
+
+ page_addr = stack_addr & PAGE_MASK;
+
+ if (((stack_addr - sizeof(long long))) < (page_addr + size))
+ return NULL;
+
+ return (unsigned long *) page_addr;
+}
+
+/*
+ * This routines finds free stack space for a given size either on the
+ * current stack page, or on next stack page. If there is no free stack
+ * space is availible, then expands the stack and returns the address of
+ * free stack space.
+ */
+unsigned long __kprobes *find_stack_space(unsigned long stack_addr, int size)
+{
+ unsigned long *addr;
+ struct vm_area_struct *vma = NULL;
+
+ addr = find_stack_space_on_curr_page(stack_addr, size);
+ if (addr)
+ return addr;
+
+ if (!down_read_trylock(&current->mm->mmap_sem))
+ down_read(&current->mm->mmap_sem);
+
+ vma = find_vma(current->mm, (stack_addr & PAGE_MASK));
+ if (!vma) {
+ up_read(&current->mm->mmap_sem);
+ return NULL;
+ }
+
+ addr = find_stack_space_on_next_page(stack_addr, size, vma);
+ if (addr) {
+ up_read(&current->mm->mmap_sem);
+ return addr;
+ }
+
+ addr = find_stack_space_in_expanded_vma(size, vma);
+ up_read(&current->mm->mmap_sem);
+
+ if (!addr)
+ return NULL;
+
+ return addr;
+}
+
+/*
+ * This routines get the page containing the probe, maps it and
+ * replaced the instruction at the probed address with specified
+ * opcode.
+ */
+void __kprobes replace_original_insn(struct uprobe *uprobe,
+ struct pt_regs *regs, kprobe_opcode_t opcode)
+{
+ kprobe_opcode_t *addr;
+ struct page *page;
+
+ page = find_get_page(uprobe->inode->i_mapping,
+ uprobe->offset >> PAGE_CACHE_SHIFT);
+ BUG_ON(!page);
+
+ addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER1);
+ addr = (kprobe_opcode_t *)((unsigned long)addr +
+ (unsigned long)(uprobe->offset & ~PAGE_MASK));
+ *addr = opcode;
+ /*TODO: flush vma ? */
+ kunmap_atomic(addr, KM_USER1);
+
+ flush_dcache_page(page);
+
+ if (page)
+ page_cache_release(page);
+ regs->eip = (unsigned long)uprobe->kp.addr;
+}
+
+/*
+ * This routine provides the functionality of single stepping
+ * out-of-line. If single stepping out-of-line cannot be achieved,
+ * it replaces with the original instruction allowing it to single
+ * step inline.
+ */
+static int __kprobes prepare_singlestep_uprobe(struct uprobe *uprobe,
+ struct uprobe_ctlblk *ucb, struct pt_regs *regs)
+{
+ unsigned long *addr = NULL, stack_addr = regs->esp;
+ int size = sizeof(kprobe_opcode_t) * MAX_INSN_SIZE;
+ unsigned long *source = (unsigned long *)uprobe->kp.ainsn.insn;
+
+ /*
+ * Get free stack space to copy original instruction, so as to
+ * single step out-of-line.
+ */
+ addr = find_stack_space(stack_addr, size);
+ if (!addr)
+ goto no_stack_space;
+
+ /*
+ * We are in_atomic and preemption is disabled at this point of
+ * time. Copy original instruction on this per process stack
+ * page so as to single step out-of-line.
+ */
+ if (__copy_to_user_inatomic((unsigned long *)addr, source, size))
+ goto no_stack_space;
+
+ regs->eip = (unsigned long)addr;
+
+ regs->eflags |= TF_MASK;
+ regs->eflags &= ~IF_MASK;
+ ucb->uprobe_status = UPROBE_HIT_SS;
+
+ ucb->upte = get_one_pte(regs->eip);
+ if (!ucb->upte)
+ goto no_stack_space;
+ ucb->upage = pte_page(*ucb->upte);
+ set_pte(ucb->upte, pte_mkdirty(*ucb->upte));
+ ucb->singlestep_addr = regs->eip;
+
+ return 0;
+
+no_stack_space:
+ replace_original_insn(uprobe, regs, uprobe->kp.opcode);
+ ucb->uprobe_status = UPROBE_SS_INLINE;
+ ucb->singlestep_addr = regs->eip;
+
+ return 0;
+}
+
+/*
+ * uprobe_handler() executes the user specified handler and setup for
+ * single stepping the original instruction either out-of-line or inline.
+ */
+static int __kprobes uprobe_handler(struct pt_regs *regs)
+{
+ struct kprobe *p;
+ int ret = 0;
+ kprobe_opcode_t *addr = NULL;
+ struct uprobe_ctlblk *ucb = &uprobe_ctlblk;
+ unsigned long limit;
+
+ spin_lock_irqsave(&uprobe_lock, ucb->flags);
+ /* preemption is disabled, remains disabled
+ * until we single step on original instruction.
+ */
+ inc_preempt_count();
+
+ addr = (kprobe_opcode_t *)(get_segment_eip(regs, &limit) - 1);
+
+ p = get_uprobe(addr);
+ if (!p) {
+
+ if (*addr != BREAKPOINT_INSTRUCTION) {
+ /*
+ * The breakpoint instruction was removed right
+ * after we hit it. Another cpu has removed
+ * either a probe point or a debugger breakpoint
+ * at this address. In either case, no further
+ * handling of this interrupt is appropriate.
+ * Back up over the (now missing) int3 and run
+ * the original instruction.
+ */
+ regs->eip -= sizeof(kprobe_opcode_t);
+ ret = 1;
+ }
+ /* Not one of ours: let kernel handle it */
+ goto no_uprobe;
+ }
+
+ if (p->opcode == BREAKPOINT_INSTRUCTION) {
+ /*
+ * Breakpoint was already present even before the probe
+ * was inserted, this might break some compatibility with
+ * other debuggers like gdb etc. We dont handle such probes.
+ */
+ current_uprobe = NULL;
+ goto no_uprobe;
+ }
+
+ ucb->curr_p = p;
+ ucb->tsk = current;
+ ucb->uprobe_status = UPROBE_HIT_ACTIVE;
+ ucb->uprobe_saved_eflags = (regs->eflags & (TF_MASK | IF_MASK));
+ ucb->uprobe_old_eflags = (regs->eflags & (TF_MASK | IF_MASK));
+
+ if (p->pre_handler && p->pre_handler(p, regs))
+ /* handler has already set things up, so skip ss setup */
+ return 1;
+
+ prepare_singlestep_uprobe(current_uprobe, ucb, regs);
+ /*
+ * Avoid scheduling the current while returning from
+ * kernel to user mode.
+ */
+ clear_need_resched();
+ return 1;
+
+no_uprobe:
+ spin_unlock_irqrestore(&uprobe_lock, ucb->flags);
+ dec_preempt_count();
+
+ return ret;
+}
+
+/*
+ * Called after single-stepping. p->addr is the address of the
+ * instruction whose first byte has been replaced by the "int 3"
+ * instruction. To avoid the SMP problems that can occur when we
+ * temporarily put back the original opcode to single-step, we
+ * single-stepped a copy of the instruction. The address of this
+ * copy is p->ainsn.insn.
+ *
+ * This function prepares to return from the post-single-step
+ * interrupt. We have to fix up the stack as follows:
+ *
+ * 0) Typically, the new eip is relative to the copied instruction. We
+ * need to make it relative to the original instruction. Exceptions are
+ * return instructions and absolute or indirect jump or call instructions.
+ *
+ * 1) If the single-stepped instruction was pushfl, then the TF and IF
+ * flags are set in the just-pushed eflags, and may need to be cleared.
+ *
+ * 2) If the single-stepped instruction was a call, the return address
+ * that is atop the stack is the address following the copied instruction.
+ * We need to make it the address following the original instruction.
+ */
+static void __kprobes resume_execution_user(struct kprobe *p,
+ struct pt_regs *regs, struct uprobe_ctlblk *ucb)
+{
+ unsigned long *tos = (unsigned long *)regs->esp;
+ unsigned long next_eip = 0;
+ unsigned long copy_eip = ucb->singlestep_addr;
+ unsigned long orig_eip = (unsigned long)p->addr;
+
+ switch (p->ainsn.insn[0]) {
+ case 0x9c: /* pushfl */
+ *tos &= ~(TF_MASK | IF_MASK);
+ *tos |= ucb->uprobe_old_eflags;
+ break;
+ case 0xc3: /* ret/lret */
+ case 0xcb:
+ case 0xc2:
+ case 0xca:
+ next_eip = regs->eip;
+ /* eip is already adjusted, no more changes required*/
+ break;
+ case 0xe8: /* call relative - Fix return addr */
+ *tos = orig_eip + (*tos - copy_eip);
+ break;
+ case 0xff:
+ if ((p->ainsn.insn[1] & 0x30) == 0x10) {
+ /* call absolute, indirect */
+ /* Fix return addr; eip is correct. */
+ next_eip = regs->eip;
+ *tos = orig_eip + (*tos - copy_eip);
+ } else if (((p->ainsn.insn[1] & 0x31) == 0x20) ||
+ ((p->ainsn.insn[1] & 0x31) == 0x21)) {
+ /* jmp near or jmp far absolute indirect */
+ /* eip is correct. */
+ next_eip = regs->eip;
+ }
+ break;
+ case 0xea: /* jmp absolute -- eip is correct */
+ next_eip = regs->eip;
+ break;
+ default:
+ break;
+ }
+
+ regs->eflags &= ~TF_MASK;
+ if (next_eip)
+ regs->eip = next_eip;
+ else
+ regs->eip = orig_eip + (regs->eip - copy_eip);
+}
+
+/*
+ * post_uprobe_handler(), executes the user specified handlers and
+ * resumes with the normal execution.
+ */
+static int __kprobes post_uprobe_handler(struct pt_regs *regs)
+{
+ struct kprobe *cur;
+ struct uprobe_ctlblk *ucb;
+
+ if (!current_uprobe)
+ return 0;
+
+ ucb = &uprobe_ctlblk;
+ cur = ucb->curr_p;
+
+ if (!cur || ucb->tsk != current)
+ return 0;
+
+ if (cur->post_handler) {
+ if (ucb->uprobe_status == UPROBE_SS_INLINE)
+ ucb->uprobe_status = UPROBE_SSDONE_INLINE;
+ else
+ ucb->uprobe_status = UPROBE_HIT_SSDONE;
+ cur->post_handler(cur, regs, 0);
+ }
+
+ resume_execution_user(cur, regs, ucb);
+ regs->eflags |= ucb->uprobe_saved_eflags;
+
+ if (ucb->uprobe_status == UPROBE_SSDONE_INLINE)
+ replace_original_insn(current_uprobe, regs,
+ BREAKPOINT_INSTRUCTION);
+ else
+ pte_unmap(ucb->upte);
+
+ current_uprobe = NULL;
+ spin_unlock_irqrestore(&uprobe_lock, ucb->flags);
+ dec_preempt_count();
+ /*
+ * if somebody else is single stepping across a probe point, eflags
+ * will have TF set, in which case, continue the remaining processing
+ * of do_debug, as if this is not a probe hit.
+ */
+ if (regs->eflags & TF_MASK)
+ return 0;
+
+ return 1;
+}
+
+static int __kprobes uprobe_fault_handler(struct pt_regs *regs, int trapnr)
+{
+ struct kprobe *cur;
+ struct uprobe_ctlblk *ucb;
+ int ret = 0;
+
+ ucb = &uprobe_ctlblk;
+ cur = ucb->curr_p;
+
+ if (ucb->tsk != current || !cur)
+ return 0;
+
+ switch(ucb->uprobe_status) {
+ case UPROBE_HIT_SS:
+ pte_unmap(ucb->upte);
+ /* TODO: All acceptable number of faults before disabling */
+ replace_original_insn(current_uprobe, regs, cur->opcode);
+ /* Fall through and reset the current probe */
+ case UPROBE_SS_INLINE:
+ regs->eip = (unsigned long)cur->addr;
+ regs->eflags |= ucb->uprobe_old_eflags;
+ regs->eflags &= ~TF_MASK;
+ current_uprobe = NULL;
+ ret = 1;
+ spin_unlock_irqrestore(&uprobe_lock, ucb->flags);
+ preempt_enable_no_resched();
+ break;
+ case UPROBE_HIT_ACTIVE:
+ case UPROBE_SSDONE_INLINE:
+ case UPROBE_HIT_SSDONE:
+ if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr))
+ return 1;
+
+ if (fixup_exception(regs))
+ return 1;
+ /*
+ * We must not allow the system page handler to continue while
+ * holding a lock, since page fault handler can sleep and
+ * reschedule it on different cpu. Hence return 1.
+ */
+ return 1;
+ break;
+ default:
+ break;
+ }
+ return ret;
+}
+
+/*
+ * Wrapper routine to for handling exceptions.
+ */
+int __kprobes uprobe_exceptions_notify(struct notifier_block *self,
+ unsigned long val, void *data)
+{
+ struct die_args *args = (struct die_args *)data;
+ int ret = NOTIFY_DONE;
+
+ if (args->regs->eflags & VM_MASK) {
+ /* We are in virtual-8086 mode. Return NOTIFY_DONE */
+ return ret;
+ }
+
+ switch (val) {
+ case DIE_INT3:
+ if (uprobe_handler(args->regs))
+ ret = NOTIFY_STOP;
+ break;
+ case DIE_DEBUG:
+ if (post_uprobe_handler(args->regs))
+ ret = NOTIFY_STOP;
+ break;
+ case DIE_GPF:
+ case DIE_PAGE_FAULT:
+ if (current_uprobe &&
+ uprobe_fault_handler(args->regs, args->trapnr))
+ ret = NOTIFY_STOP;
+ break;
+ default:
+ break;
+ }
+ return ret;
+}
diff -puN arch/i386/kernel/kprobes.c~kprobes_userspace_probes-ss-out-of-line arch/i386/kernel/kprobes.c
--- linux-2.6.17-rc3-mm1/arch/i386/kernel/kprobes.c~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/kernel/kprobes.c 2006-05-09 12:40:48.000000000 +0530
@@ -139,7 +139,7 @@ retry:
/*
* returns non-zero if opcode modifies the interrupt flag.
*/
-static int __kprobes is_IF_modifier(kprobe_opcode_t opcode)
+int __kprobes is_IF_modifier(kprobe_opcode_t opcode)
{
switch (opcode) {
case 0xfa: /* cli */
@@ -649,7 +649,7 @@ int __kprobes kprobe_exceptions_notify(s
int ret = NOTIFY_DONE;

if (args->regs && user_mode(args->regs))
- return ret;
+ return uprobe_exceptions_notify(self, val, data);

switch (val) {
case DIE_INT3:
diff -puN arch/i386/mm/fault.c~kprobes_userspace_probes-ss-out-of-line arch/i386/mm/fault.c
--- linux-2.6.17-rc3-mm1/arch/i386/mm/fault.c~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/mm/fault.c 2006-05-09 12:40:48.000000000 +0530
@@ -104,8 +104,7 @@ void bust_spinlocks(int yes)
*
* This is slow, but is very rarely executed.
*/
-static inline unsigned long get_segment_eip(struct pt_regs *regs,
- unsigned long *eip_limit)
+unsigned long get_segment_eip(struct pt_regs *regs, unsigned long *eip_limit)
{
unsigned long eip = regs->eip;
unsigned seg = regs->xcs & 0xffff;
diff -puN arch/i386/kernel/Makefile~kprobes_userspace_probes-ss-out-of-line arch/i386/kernel/Makefile
--- linux-2.6.17-rc3-mm1/arch/i386/kernel/Makefile~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/kernel/Makefile 2006-05-09 12:40:48.000000000 +0530
@@ -27,7 +27,7 @@ obj-$(CONFIG_KEXEC) += machine_kexec.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_X86_NUMAQ) += numaq.o
obj-$(CONFIG_X86_SUMMIT_NUMA) += summit.o
-obj-$(CONFIG_KPROBES) += kprobes.o
+obj-$(CONFIG_KPROBES) += kprobes.o uprobes.o
obj-$(CONFIG_MODULES) += module.o
obj-y += sysenter.o vsyscall.o
obj-$(CONFIG_ACPI_SRAT) += srat.o

_
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329

2006-05-09 07:15:23

by S. P. Prasanna

[permalink] [raw]
Subject: Re: [RFC] [PATCH 6/6] Kprobes: Remove breakpoints from the copied pages

This patch removes the breakpoints if the pages read from the page
cache contains breakpoints. If the pages containing the breakpoints
is copied from the page cache, the copied image would also contain
breakpoints in them. This could be a major problem for tools like
tripwire etc and cause security concerns, hence must be prevented.
This patch hooks up the actor routine, checks if the executable was
a probed executable using the file inode and then replaces the
breakpoints with the original opcodes in the copied image.

Signed-off-by: Prasanna S Panchamukhi <[email protected]>


fs/nfsd/vfs.c | 4 +++-
include/asm-i386/kprobes.h | 1 +
include/linux/fs.h | 9 ++++++---
include/linux/kprobes.h | 17 ++++++++++++++++-
kernel/uprobes.c | 39 +++++++++++++++++++++++++++++++++++++++
mm/filemap.c | 17 ++++++++++++++---
mm/shmem.c | 2 +-
7 files changed, 80 insertions(+), 9 deletions(-)

diff -puN kernel/uprobes.c~kprobes_userspace_probes-remove-breakpoints-on-copy kernel/uprobes.c
--- linux-2.6.17-rc3-mm1/kernel/uprobes.c~kprobes_userspace_probes-remove-breakpoints-on-copy 2006-05-09 12:43:21.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/kernel/uprobes.c 2006-05-09 12:43:21.000000000 +0530
@@ -300,6 +300,45 @@ struct uprobe_module __kprobes *get_modu
return NULL;
}

+/*
+ * This routine checks if the given page contains breakpoints. It first
+ * checks if the file read is a probed executable and later checks
+ * if the page being read contains breakpoints. This routine is
+ * used by file_read_actor();
+ */
+void remove_uprobe_breakpoints(struct address_space *mapping,
+ struct page *page, unsigned long offset,
+ read_descriptor_t *desc, unsigned long size)
+{
+ struct inode *inode = mapping->host;
+ struct page *upage;
+ struct uprobe_module *umodule = NULL;
+ struct uprobe *uprobe = NULL;
+ struct hlist_node *node;
+ unsigned long page_off, ret;
+
+ mutex_lock(&uprobe_mutex);
+ umodule = get_module_by_inode(inode);
+ if (!umodule)
+ goto out;
+ hlist_for_each_entry(uprobe, node, &umodule->ulist_head, ulist) {
+ upage = find_get_page(mapping,
+ uprobe->offset >> PAGE_CACHE_SHIFT);
+ if (upage == page) {
+ page_off = uprobe->offset & ~PAGE_MASK;
+ if ((page_off >= offset) &&
+ (page_off < (offset + PAGE_SIZE)) &&
+ ((page_off - offset) <= size))
+ ret = __copy_to_user(desc->arg.buf +
+ (page_off - offset),
+ &(uprobe->kp.opcode),
+ sizeof(kprobe_opcode_t));
+ }
+ }
+out:
+ mutex_unlock(&uprobe_mutex);
+}
+
static inline void insert_readpage_uprobe(struct page *page,
struct address_space *mapping, struct uprobe *uprobe)
{
diff -puN mm/filemap.c~kprobes_userspace_probes-remove-breakpoints-on-copy mm/filemap.c
--- linux-2.6.17-rc3-mm1/mm/filemap.c~kprobes_userspace_probes-remove-breakpoints-on-copy 2006-05-09 12:43:21.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/mm/filemap.c 2006-05-09 12:43:21.000000000 +0530
@@ -31,6 +31,7 @@
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/cpuset.h>
+#include <linux/kprobes.h>
#include "filemap.h"
#include "internal.h"

@@ -884,7 +885,7 @@ page_ok:
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
- ret = actor(desc, page, offset, nr);
+ ret = actor(desc, page, offset, nr, mapping);
offset += ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
@@ -1012,7 +1013,8 @@ out:
EXPORT_SYMBOL(do_generic_mapping_read);

int file_read_actor(read_descriptor_t *desc, struct page *page,
- unsigned long offset, unsigned long size)
+ unsigned long offset, unsigned long size,
+ struct address_space *mapping)
{
char *kaddr;
unsigned long left, count = desc->count;
@@ -1043,6 +1045,13 @@ int file_read_actor(read_descriptor_t *d
desc->error = -EFAULT;
}
success:
+#ifdef CONFIG_KPROBES
+ /*
+ * Check if the data copied to the buffer contains breakpoint
+ * and overwrite the breakpoints with appropriate opcodes.
+ */
+ remove_uprobe_breakpoints(mapping, page, offset, desc, size);
+#endif
desc->count = count - size;
desc->written += size;
desc->arg.buf += size;
@@ -1159,7 +1168,9 @@ generic_file_read(struct file *filp, cha

EXPORT_SYMBOL(generic_file_read);

-int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size)
+int file_send_actor(read_descriptor_t * desc, struct page *page,
+ unsigned long offset, unsigned long size,
+ struct address_space *mapping)
{
ssize_t written;
unsigned long count = desc->count;
diff -puN mm/shmem.c~kprobes_userspace_probes-remove-breakpoints-on-copy mm/shmem.c
--- linux-2.6.17-rc3-mm1/mm/shmem.c~kprobes_userspace_probes-remove-breakpoints-on-copy 2006-05-09 12:43:21.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/mm/shmem.c 2006-05-09 12:43:21.000000000 +0530
@@ -1588,7 +1588,7 @@ static void do_shmem_file_read(struct fi
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
- ret = actor(desc, page, offset, nr);
+ ret = actor(desc, page, offset, nr, mapping);
offset += ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
diff -puN fs/nfsd/vfs.c~kprobes_userspace_probes-remove-breakpoints-on-copy fs/nfsd/vfs.c
--- linux-2.6.17-rc3-mm1/fs/nfsd/vfs.c~kprobes_userspace_probes-remove-breakpoints-on-copy 2006-05-09 12:43:21.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/fs/nfsd/vfs.c 2006-05-09 12:43:21.000000000 +0530
@@ -785,7 +785,9 @@ found:
* directrly. They will be released after the sending has completed.
*/
static int
-nfsd_read_actor(read_descriptor_t *desc, struct page *page, unsigned long offset , unsigned long size)
+nfsd_read_actor(read_descriptor_t *desc, struct page *page,
+ unsigned long offset, unsigned long size,
+ struct address_space *mapping)
{
unsigned long count = desc->count;
struct svc_rqst *rqstp = desc->arg.data;
diff -puN include/linux/kprobes.h~kprobes_userspace_probes-remove-breakpoints-on-copy include/linux/kprobes.h
--- linux-2.6.17-rc3-mm1/include/linux/kprobes.h~kprobes_userspace_probes-remove-breakpoints-on-copy 2006-05-09 12:43:21.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/include/linux/kprobes.h 2006-05-09 12:43:21.000000000 +0530
@@ -205,13 +205,13 @@ extern void copy_kprobe(struct kprobe *o
extern int arch_copy_uprobe(struct kprobe *p, kprobe_opcode_t *address);
extern void arch_arm_uprobe(kprobe_opcode_t *address);
extern void arch_disarm_uprobe(struct kprobe *p, kprobe_opcode_t *address);
-extern void init_uprobes(void);

/* Get the kprobe at this addr (if any) - called with preemption disabled */
struct kprobe *get_kprobe(void *addr);
struct kprobe *get_uprobe(void *addr);
extern int arch_alloc_insn(struct kprobe *p);
struct hlist_head * kretprobe_inst_table_head(struct task_struct *tsk);
+struct uprobe_module *get_module_by_inode(struct inode *inode);

/* kprobe_running() will just return the current_kprobe on this CPU */
static inline struct kprobe *kprobe_running(void)
@@ -239,6 +239,21 @@ static inline void reset_uprobe_instance
current_uprobe = NULL;
}

+#ifdef ARCH_SUPPORTS_UPROBES
+extern void init_uprobes(void);
+extern void remove_uprobe_breakpoints(struct address_space *mapping,
+ struct page *page, unsigned long offset,
+ read_descriptor_t *desc, unsigned long size);
+#else
+static inline void init_uprobes(void)
+{
+}
+static inline void remove_uprobe_breakpoints(struct address_space *mapping,
+ struct page *page, unsigned long offset,
+ read_descriptor_t *desc, unsigned long size)
+{
+}
+#endif
int register_kprobe(struct kprobe *p);
void unregister_kprobe(struct kprobe *p);
int setjmp_pre_handler(struct kprobe *, struct pt_regs *);
diff -puN include/asm-i386/kprobes.h~kprobes_userspace_probes-remove-breakpoints-on-copy include/asm-i386/kprobes.h
--- linux-2.6.17-rc3-mm1/include/asm-i386/kprobes.h~kprobes_userspace_probes-remove-breakpoints-on-copy 2006-05-09 12:43:21.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/include/asm-i386/kprobes.h 2006-05-09 12:43:21.000000000 +0530
@@ -46,6 +46,7 @@ typedef u8 kprobe_opcode_t;
#define JPROBE_ENTRY(pentry) (kprobe_opcode_t *)pentry
#define ARCH_SUPPORTS_KRETPROBES
#define ARCH_INACTIVE_KPROBE_COUNT 0
+#define ARCH_SUPPORTS_UPROBES

void arch_remove_kprobe(struct kprobe *p);
void kretprobe_trampoline(void);
diff -puN include/linux/fs.h~kprobes_userspace_probes-remove-breakpoints-on-copy include/linux/fs.h
--- linux-2.6.17-rc3-mm1/include/linux/fs.h~kprobes_userspace_probes-remove-breakpoints-on-copy 2006-05-09 12:43:21.000000000 +0530
+++ linux-2.6.17-rc3-mm1-prasanna/include/linux/fs.h 2006-05-09 12:43:21.000000000 +0530
@@ -998,7 +998,8 @@ typedef struct {
int error;
} read_descriptor_t;

-typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long);
+typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long,
+ unsigned long, struct address_space *);

/* These macros are for out of kernel modules to test that
* the kernel supports the unlocked_ioctl and compat_ioctl
@@ -1595,8 +1596,10 @@ extern int sb_min_blocksize(struct super

extern int generic_file_mmap(struct file *, struct vm_area_struct *);
extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
-extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
-extern int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
+extern int file_read_actor(read_descriptor_t * desc, struct page *page,
+ unsigned long offset, unsigned long size, struct address_space *mapping);
+extern int file_send_actor(read_descriptor_t * desc, struct page *page,
+ unsigned long offset, unsigned long size, struct address_space *mapping);
extern ssize_t generic_file_read(struct file *, char __user *, size_t, loff_t *);
int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
extern ssize_t generic_file_write(struct file *, const char __user *, size_t, loff_t *);

_
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329

2006-05-09 09:36:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/6] Kprobes: New interfaces for user-space probes

On Tue, May 09, 2006 at 12:35:08PM +0530, Prasanna S Panchamukhi wrote:
> This patch provides two interfaces to insert and remove
> user space probes. Each probe is uniquely identified by
> inode and offset within that executable/library file.
> Insertion of a probe involves getting the code page for
> a given offset, mapping it into the memory and then inserting
> the breakpoint at the given offset. Also the probe is added
> to the uprobe_table hash list. A uprobe_module data structure
> is allocated for every probed application/library image on disk.
> Removal of a probe involves getting the code page for a given
> offset, mapping that page into the memory and then replacing
> the breakpoint instruction with a the original opcode.
> This patch also provides aggregate probe handler feature,
> where user can define multiple handlers per probe.

This introduces interfaces that aren't used anywhere in the following
patches. That is completely not acceptable. Please provide a proper
userspace interface to this functionality, e.g. something based on the
RPN code from Richard's dprobes.

2006-05-09 09:34:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] [PATCH 2/6] Kprobes: Get one pagetable entry

On Tue, May 09, 2006 at 12:31:06PM +0530, Prasanna S Panchamukhi wrote:
> This patch provide a wrapper routine to allocate one page
> table entry for a given virtual address address. Kprobe's
> user-space probe mechanism uses this routine to get one
> page table entry. As Nick Piggin suggested, this generic
> routine can be used by routines like get_user_pages,
> find_*_page, and other standard APIs.

In find_*_page it defintily cannot be used because theses routines
are doing pagecache lookups and couldn't care less about users.

If you want to get this patch in convert the places currently opencoding
to it, otherwise it just adds more bloat.

2006-05-09 09:37:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] [PATCH 4/6] Kprobes: Insert probes on non-memory resident pages

On Tue, May 09, 2006 at 12:39:11PM +0530, Prasanna S Panchamukhi wrote:
> User-space probes also supports the registering of the probe points
> before the probed code is loaded. This clearly has advantages for
> catching initialization problems. This involves modifying the probed
> applications address_space readpage() and readpages().

Patching file-system provided method tables is completely non-acceptable.
Exactly to prevent idiots like you from doing this we've started to mark
them const.

2006-05-09 09:38:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] [PATCH 5/6] Kprobes: Single step the original instruction out-of-line

On Tue, May 09, 2006 at 12:42:04PM +0530, Prasanna S Panchamukhi wrote:
> Even if the vma cannot be extended, then the instruction much be
> executed inline, by replacing the breakpoint instruction with the
> original instruction.

Patching pagecache content is not acceptable. Just fail the probe
in this case.

2006-05-09 15:11:58

by Richard J Moore

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/6] Kprobes: New interfaces for user-space probes






Christoph Hellwig <[email protected]> wrote on 09/05/2006 10:36:14:

> On Tue, May 09, 2006 at 12:35:08PM +0530, Prasanna S Panchamukhi wrote:
> > This patch provides two interfaces to insert and remove
> > user space probes. Each probe is uniquely identified by
> > inode and offset within that executable/library file.
> > Insertion of a probe involves getting the code page for
> > a given offset, mapping it into the memory and then inserting
> > the breakpoint at the given offset. Also the probe is added
> > to the uprobe_table hash list. A uprobe_module data structure
> > is allocated for every probed application/library image on disk.
> > Removal of a probe involves getting the code page for a given
> > offset, mapping that page into the memory and then replacing
> > the breakpoint instruction with a the original opcode.
> > This patch also provides aggregate probe handler feature,
> > where user can define multiple handlers per probe.
>
> This introduces interfaces that aren't used anywhere in the following
> patches. That is completely not acceptable. Please provide a proper
> userspace interface to this functionality, e.g. something based on the
> RPN code from Richard's dprobes.
>

Christoph, what are you asking for here? Surely not the RPN interpreter. I
thought everyone agreed that that was massive bloatware and that a binary
interface viz kprobes was a much better implementation.

2006-05-09 15:19:08

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/6] Kprobes: New interfaces for user-space probes

On Tue, May 09, 2006 at 04:11:36PM +0100, Richard J Moore wrote:
> Christoph, what are you asking for here? Surely not the RPN interpreter. I
> thought everyone agreed that that was massive bloatware and that a binary
> interface viz kprobes was a much better implementation.

I don't know what interface would be best. I'm not pushing this big pile
of junk either. Unless you find a suitable interface that you include in
the patchkit we're not gonna add it, even after it's been rewritten to be
sane. So if you care to get this in find a suitable interface.

why the hell do you guys expect to get a huge piele of flaky code integrate
that slows down pagecaches and adds thousands of lines of undebuggable and
untestable code without submitting something that actually calls it.

I'd love to see the crack that's handed out at your group.

2006-05-09 17:04:52

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC] [PATCH 6/6] Kprobes: Remove breakpoints from the copied pages

On Tue, 9 May 2006, Prasanna S Panchamukhi wrote:
> This patch removes the breakpoints if the pages read from the page
> cache contains breakpoints. If the pages containing the breakpoints
> is copied from the page cache, the copied image would also contain
> breakpoints in them. This could be a major problem for tools like
> tripwire etc and cause security concerns, hence must be prevented.
> This patch hooks up the actor routine, checks if the executable was
> a probed executable using the file inode and then replaces the
> breakpoints with the original opcodes in the copied image.

You've done a nice job of making the code look like kernel code
throughout, it's a much tidier patchset than many.

With that said... it looks to me like one of the scariest and
most inappropriate sets I can remember. Getting the kernel to
connive in presenting an incoherent view of its pagecache:
I don't think we'd ever want that.

There's all kinds of things that could be said about the details
(your locking is often insufficient, for example); but there's a
lot going on, and it doesn't seem worth going through this line
by line, when the whole concept seems so unwelcome.

You've a big task to convince people that this is something the
Linux kernel will want: and perhaps you'll succeed - good luck.

But please approach what you're trying to do from userspace:
you can patch the binaries from there if you wish (but not on
my system, thanks). Or perhaps you can patch it all into the
kernel via kprobes itself, but I wouldn't recommend it.

Hugh

2006-05-09 17:41:25

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/6] Kprobes: New interfaces for user-space probes


hch wrote:

> [...] why the hell do you guys expect to get a huge piele of flaky
> code integrate that slows down pagecaches and adds thousands of
> lines of undebuggable and untestable code without submitting
> something that actually calls it. [...]

It is reasonable to want to see code that exercises this function.
Until systemtap does, hand-written examples can surely be provided.

- FChE

2006-05-09 20:40:51

by Adrian Bunk

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/6] Kprobes: New interfaces for user-space probes

On Tue, May 09, 2006 at 01:41:09PM -0400, Frank Ch. Eigler wrote:
>
> hch wrote:
>
> > [...] why the hell do you guys expect to get a huge piele of flaky
> > code integrate that slows down pagecaches and adds thousands of
> > lines of undebuggable and untestable code without submitting
> > something that actually calls it. [...]
>
> It is reasonable to want to see code that exercises this function.
> Until systemtap does, hand-written examples can surely be provided.

It's not about examples, it's about in-kernel users.

If the code using it is not yet ready for submission, there's no need to
add interfaces for it now.

> - FChE

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2006-05-09 21:00:33

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/6] Kprobes: New interfaces for user-space probes

Hi -

On Tue, May 09, 2006 at 10:40:52PM +0200, Adrian Bunk wrote:
> > It is reasonable to want to see code that exercises this function.
> > Until systemtap does, hand-written examples can surely be provided.
>
> It's not about examples, it's about in-kernel users.

Just as for kprobes, this facility is for dynamically generated
kernel-resident code. Just as for kprobes, there is nothing there to
submit to lkml to be "in-kernel" in that sense.

- FChE

2006-05-09 22:35:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/6] Kprobes: New interfaces for user-space probes

On Tue, May 09, 2006 at 04:58:35PM -0400, Frank Ch. Eigler wrote:
> Hi -
>
> On Tue, May 09, 2006 at 10:40:52PM +0200, Adrian Bunk wrote:
> > > It is reasonable to want to see code that exercises this function.
> > > Until systemtap does, hand-written examples can surely be provided.
> >
> > It's not about examples, it's about in-kernel users.
>
> Just as for kprobes, this facility is for dynamically generated
> kernel-resident code.

kprobes is not for "dynamically generated kernel-resident code". That's
just what you abuse it for.

2006-05-10 00:47:27

by bibo,mao

[permalink] [raw]
Subject: Re: [RFC] [PATCH 5/6] Kprobes: Single step the original instruction out-of-line

Previously when I tested uprobe patch in some specific IA32 machine with
CONFIG_X86_PAE option on, uprobe can not be activated unless kernel
option "noexec=off" is added before booting kernel.
Because without this option user stack is non-executable, copied trap
instruction is placed in process's stack space. Executing trap
instrunction in stack will cause page fault of non-execution priviledge.

Thanks
bibo,mao

Prasanna S Panchamukhi ??:
> This patch provides a mechanism for probe handling and
> executing the user-specified handlers.
>
> Each userspace probe is uniquely identified by the combination of
> inode and offset, hence during registration the inode and offset
> combination is added to uprobes hash table. Initially when
> breakpoint instruction is hit, the uprobes hash table is looked up
> for matching inode and offset. The pre_handlers are called in
> sequence if multiple probes are registered. Similar to kprobes,
> uprobes also adopts to single step out-of-line, so that probe miss in
> SMP environment can be avoided. But for userspace probes, instruction
> copied into kernel address space cannot be single stepped, hence the
> instruction must be copied to user address space. The solution is to
> find free space in the current process address space and then copy the
> original instruction and single step that instruction.
>
> User processes use stack space to store local variables, arguments and
> return values. Normally the stack space either below or above the
> stack pointer indicates the free stack space.
>
> The instruction to be single stepped can modify the stack space,
> hence before using the free stack space, sufficient stack space must
> be left. The instruction is copied to the bottom of the page and check
> is made such that the copied instruction does not cross the page
> boundary. The copied instruction is then single stepped. Several
> architectures does not allow the instruction to be executed from the
> stack location, since no-exec bit is set for the stack pages. In those
> architectures, the page table entry corresponding to the stack page is
> identified and the no-exec bit is unset making the instruction on that
> stack page to be executed.
>
> There are situations where even the free stack space is not enough for
> the user instruction to be copied and single stepped. In such
> situations, the virtual memory area(vma) can be expanded beyond the
> current stack vma. This expanded stack can be used to copy the
> original instruction and single step out-of-line.
>
> Even if the vma cannot be extended, then the instruction much be
> executed inline, by replacing the breakpoint instruction with the
> original instruction.
>
> Signed-off-by: Prasanna S Panchamukhi <[email protected]>
>
>
> arch/i386/kernel/Makefile | 2
> arch/i386/kernel/kprobes.c | 4
> arch/i386/kernel/uprobes.c | 472 +++++++++++++++++++++++++++++++++++++++++++++
> arch/i386/mm/fault.c | 3
> include/asm-i386/kprobes.h | 21 ++
> 5 files changed, 497 insertions(+), 5 deletions(-)
>
> diff -puN include/asm-i386/kprobes.h~kprobes_userspace_probes-ss-out-of-line include/asm-i386/kprobes.h
> --- linux-2.6.17-rc3-mm1/include/asm-i386/kprobes.h~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.000000000 +0530
> +++ linux-2.6.17-rc3-mm1-prasanna/include/asm-i386/kprobes.h 2006-05-09 12:40:48.000000000 +0530
> @@ -26,6 +26,7 @@
> */
> #include <linux/types.h>
> #include <linux/ptrace.h>
> +#include <asm/cacheflush.h>
>
> #define __ARCH_WANT_KPROBES_INSN_SLOT
>
> @@ -78,6 +79,19 @@ struct kprobe_ctlblk {
> struct prev_kprobe prev_kprobe;
> };
>
> +/* per user probe control block */
> +struct uprobe_ctlblk {
> + unsigned long uprobe_status;
> + unsigned long uprobe_saved_eflags;
> + unsigned long uprobe_old_eflags;
> + unsigned long singlestep_addr;
> + unsigned long flags;
> + struct kprobe *curr_p;
> + pte_t *upte;
> + struct page *upage;
> + struct task_struct *tsk;
> +};
> +
> /* trap3/1 are intr gates for kprobes. So, restore the status of IF,
> * if necessary, before executing the original int3/1 (trap) handler.
> */
> @@ -89,4 +103,11 @@ static inline void restore_interrupts(st
>
> extern int kprobe_exceptions_notify(struct notifier_block *self,
> unsigned long val, void *data);
> +extern int uprobe_exceptions_notify(struct notifier_block *self,
> + unsigned long val, void *data);
> +extern unsigned long get_segment_eip(struct pt_regs *regs,
> + unsigned long *eip_limit);
> +extern int is_IF_modifier(kprobe_opcode_t opcode);
> +
> +extern pte_t *get_one_pte(unsigned long address);
> #endif /* _ASM_KPROBES_H */
> diff -puN arch/i386/kernel/uprobes.c~kprobes_userspace_probes-ss-out-of-line arch/i386/kernel/uprobes.c
> --- linux-2.6.17-rc3-mm1/arch/i386/kernel/uprobes.c~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.000000000 +0530
> +++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/kernel/uprobes.c 2006-05-09 12:40:48.000000000 +0530
> @@ -30,6 +30,10 @@
> #include <asm/cacheflush.h>
> #include <asm/kdebug.h>
> #include <asm/desc.h>
> +#include <asm/uaccess.h>
> +
> +static struct uprobe_ctlblk uprobe_ctlblk;
> +struct uprobe *current_uprobe;
>
> int __kprobes arch_alloc_insn(struct kprobe *p)
> {
> @@ -69,3 +73,471 @@ int __kprobes arch_copy_uprobe(struct kp
>
> return ret;
> }
> +
> +/*
> + * This routine check for space in the process's stack address space.
> + * If enough address space is found, returns the address of free stack
> + * space.
> + */
> +unsigned long __kprobes *find_stack_space_on_next_page(unsigned long stack_addr,
> + int size, struct vm_area_struct *vma)
> +{
> + unsigned long addr;
> + struct page *pg;
> + int retval = 0;
> +
> + if (((stack_addr - sizeof(long long))) < (vma->vm_start + size))
> + return NULL;
> + addr = (stack_addr & PAGE_MASK) + PAGE_SIZE;
> +
> + retval = get_user_pages(current, current->mm,
> + (unsigned long )addr, 1, 1, 0, &pg, NULL);
> + if (retval)
> + return NULL;
> +
> + return (unsigned long *) addr;
> +}
> +
> +/*
> + * This routine expands the stack beyond the present process address
> + * space and returns the address of free stack space. This routine
> + * must be called with mmap_sem held.
> + */
> +unsigned long __kprobes *find_stack_space_in_expanded_vma(int size,
> + struct vm_area_struct *vma)
> +{
> + unsigned long addr, vm_addr;
> + int retval = 0;
> + struct vm_area_struct *new_vma;
> + struct mm_struct *mm = current->mm;
> + struct page *pg;
> +
> + vm_addr = vma->vm_start - size;
> + new_vma = find_extend_vma(mm, vm_addr);
> + if (!new_vma)
> + return NULL;
> +
> + addr = new_vma->vm_start;
> + retval = get_user_pages(current, current->mm,
> + (unsigned long )addr, 1, 1, 0, &pg, NULL);
> + if (retval)
> + return NULL;
> +
> + return (unsigned long *) addr;
> +}
> +
> +/*
> + * This routine checks for stack free space below the stack pointer in the
> + * current stack page. If there is not enough stack space, it returns NULL.
> + */
> +unsigned long __kprobes *find_stack_space_on_curr_page(unsigned long stack_addr,
> + int size)
> +{
> + unsigned long page_addr;
> +
> + page_addr = stack_addr & PAGE_MASK;
> +
> + if (((stack_addr - sizeof(long long))) < (page_addr + size))
> + return NULL;
> +
> + return (unsigned long *) page_addr;
> +}
> +
> +/*
> + * This routines finds free stack space for a given size either on the
> + * current stack page, or on next stack page. If there is no free stack
> + * space is availible, then expands the stack and returns the address of
> + * free stack space.
> + */
> +unsigned long __kprobes *find_stack_space(unsigned long stack_addr, int size)
> +{
> + unsigned long *addr;
> + struct vm_area_struct *vma = NULL;
> +
> + addr = find_stack_space_on_curr_page(stack_addr, size);
> + if (addr)
> + return addr;
> +
> + if (!down_read_trylock(&current->mm->mmap_sem))
> + down_read(&current->mm->mmap_sem);
> +
> + vma = find_vma(current->mm, (stack_addr & PAGE_MASK));
> + if (!vma) {
> + up_read(&current->mm->mmap_sem);
> + return NULL;
> + }
> +
> + addr = find_stack_space_on_next_page(stack_addr, size, vma);
> + if (addr) {
> + up_read(&current->mm->mmap_sem);
> + return addr;
> + }
> +
> + addr = find_stack_space_in_expanded_vma(size, vma);
> + up_read(&current->mm->mmap_sem);
> +
> + if (!addr)
> + return NULL;
> +
> + return addr;
> +}
> +
> +/*
> + * This routines get the page containing the probe, maps it and
> + * replaced the instruction at the probed address with specified
> + * opcode.
> + */
> +void __kprobes replace_original_insn(struct uprobe *uprobe,
> + struct pt_regs *regs, kprobe_opcode_t opcode)
> +{
> + kprobe_opcode_t *addr;
> + struct page *page;
> +
> + page = find_get_page(uprobe->inode->i_mapping,
> + uprobe->offset >> PAGE_CACHE_SHIFT);
> + BUG_ON(!page);
> +
> + addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER1);
> + addr = (kprobe_opcode_t *)((unsigned long)addr +
> + (unsigned long)(uprobe->offset & ~PAGE_MASK));
> + *addr = opcode;
> + /*TODO: flush vma ? */
> + kunmap_atomic(addr, KM_USER1);
> +
> + flush_dcache_page(page);
> +
> + if (page)
> + page_cache_release(page);
> + regs->eip = (unsigned long)uprobe->kp.addr;
> +}
> +
> +/*
> + * This routine provides the functionality of single stepping
> + * out-of-line. If single stepping out-of-line cannot be achieved,
> + * it replaces with the original instruction allowing it to single
> + * step inline.
> + */
> +static int __kprobes prepare_singlestep_uprobe(struct uprobe *uprobe,
> + struct uprobe_ctlblk *ucb, struct pt_regs *regs)
> +{
> + unsigned long *addr = NULL, stack_addr = regs->esp;
> + int size = sizeof(kprobe_opcode_t) * MAX_INSN_SIZE;
> + unsigned long *source = (unsigned long *)uprobe->kp.ainsn.insn;
> +
> + /*
> + * Get free stack space to copy original instruction, so as to
> + * single step out-of-line.
> + */
> + addr = find_stack_space(stack_addr, size);
> + if (!addr)
> + goto no_stack_space;
> +
> + /*
> + * We are in_atomic and preemption is disabled at this point of
> + * time. Copy original instruction on this per process stack
> + * page so as to single step out-of-line.
> + */
> + if (__copy_to_user_inatomic((unsigned long *)addr, source, size))
> + goto no_stack_space;
> +
> + regs->eip = (unsigned long)addr;
> +
> + regs->eflags |= TF_MASK;
> + regs->eflags &= ~IF_MASK;
> + ucb->uprobe_status = UPROBE_HIT_SS;
> +
> + ucb->upte = get_one_pte(regs->eip);
> + if (!ucb->upte)
> + goto no_stack_space;
> + ucb->upage = pte_page(*ucb->upte);
> + set_pte(ucb->upte, pte_mkdirty(*ucb->upte));
> + ucb->singlestep_addr = regs->eip;
> +
> + return 0;
> +
> +no_stack_space:
> + replace_original_insn(uprobe, regs, uprobe->kp.opcode);
> + ucb->uprobe_status = UPROBE_SS_INLINE;
> + ucb->singlestep_addr = regs->eip;
> +
> + return 0;
> +}
> +
> +/*
> + * uprobe_handler() executes the user specified handler and setup for
> + * single stepping the original instruction either out-of-line or inline.
> + */
> +static int __kprobes uprobe_handler(struct pt_regs *regs)
> +{
> + struct kprobe *p;
> + int ret = 0;
> + kprobe_opcode_t *addr = NULL;
> + struct uprobe_ctlblk *ucb = &uprobe_ctlblk;
> + unsigned long limit;
> +
> + spin_lock_irqsave(&uprobe_lock, ucb->flags);
> + /* preemption is disabled, remains disabled
> + * until we single step on original instruction.
> + */
> + inc_preempt_count();
> +
> + addr = (kprobe_opcode_t *)(get_segment_eip(regs, &limit) - 1);
> +
> + p = get_uprobe(addr);
> + if (!p) {
> +
> + if (*addr != BREAKPOINT_INSTRUCTION) {
> + /*
> + * The breakpoint instruction was removed right
> + * after we hit it. Another cpu has removed
> + * either a probe point or a debugger breakpoint
> + * at this address. In either case, no further
> + * handling of this interrupt is appropriate.
> + * Back up over the (now missing) int3 and run
> + * the original instruction.
> + */
> + regs->eip -= sizeof(kprobe_opcode_t);
> + ret = 1;
> + }
> + /* Not one of ours: let kernel handle it */
> + goto no_uprobe;
> + }
> +
> + if (p->opcode == BREAKPOINT_INSTRUCTION) {
> + /*
> + * Breakpoint was already present even before the probe
> + * was inserted, this might break some compatibility with
> + * other debuggers like gdb etc. We dont handle such probes.
> + */
> + current_uprobe = NULL;
> + goto no_uprobe;
> + }
> +
> + ucb->curr_p = p;
> + ucb->tsk = current;
> + ucb->uprobe_status = UPROBE_HIT_ACTIVE;
> + ucb->uprobe_saved_eflags = (regs->eflags & (TF_MASK | IF_MASK));
> + ucb->uprobe_old_eflags = (regs->eflags & (TF_MASK | IF_MASK));
> +
> + if (p->pre_handler && p->pre_handler(p, regs))
> + /* handler has already set things up, so skip ss setup */
> + return 1;
> +
> + prepare_singlestep_uprobe(current_uprobe, ucb, regs);
> + /*
> + * Avoid scheduling the current while returning from
> + * kernel to user mode.
> + */
> + clear_need_resched();
> + return 1;
> +
> +no_uprobe:
> + spin_unlock_irqrestore(&uprobe_lock, ucb->flags);
> + dec_preempt_count();
> +
> + return ret;
> +}
> +
> +/*
> + * Called after single-stepping. p->addr is the address of the
> + * instruction whose first byte has been replaced by the "int 3"
> + * instruction. To avoid the SMP problems that can occur when we
> + * temporarily put back the original opcode to single-step, we
> + * single-stepped a copy of the instruction. The address of this
> + * copy is p->ainsn.insn.
> + *
> + * This function prepares to return from the post-single-step
> + * interrupt. We have to fix up the stack as follows:
> + *
> + * 0) Typically, the new eip is relative to the copied instruction. We
> + * need to make it relative to the original instruction. Exceptions are
> + * return instructions and absolute or indirect jump or call instructions.
> + *
> + * 1) If the single-stepped instruction was pushfl, then the TF and IF
> + * flags are set in the just-pushed eflags, and may need to be cleared.
> + *
> + * 2) If the single-stepped instruction was a call, the return address
> + * that is atop the stack is the address following the copied instruction.
> + * We need to make it the address following the original instruction.
> + */
> +static void __kprobes resume_execution_user(struct kprobe *p,
> + struct pt_regs *regs, struct uprobe_ctlblk *ucb)
> +{
> + unsigned long *tos = (unsigned long *)regs->esp;
> + unsigned long next_eip = 0;
> + unsigned long copy_eip = ucb->singlestep_addr;
> + unsigned long orig_eip = (unsigned long)p->addr;
> +
> + switch (p->ainsn.insn[0]) {
> + case 0x9c: /* pushfl */
> + *tos &= ~(TF_MASK | IF_MASK);
> + *tos |= ucb->uprobe_old_eflags;
> + break;
> + case 0xc3: /* ret/lret */
> + case 0xcb:
> + case 0xc2:
> + case 0xca:
> + next_eip = regs->eip;
> + /* eip is already adjusted, no more changes required*/
> + break;
> + case 0xe8: /* call relative - Fix return addr */
> + *tos = orig_eip + (*tos - copy_eip);
> + break;
> + case 0xff:
> + if ((p->ainsn.insn[1] & 0x30) == 0x10) {
> + /* call absolute, indirect */
> + /* Fix return addr; eip is correct. */
> + next_eip = regs->eip;
> + *tos = orig_eip + (*tos - copy_eip);
> + } else if (((p->ainsn.insn[1] & 0x31) == 0x20) ||
> + ((p->ainsn.insn[1] & 0x31) == 0x21)) {
> + /* jmp near or jmp far absolute indirect */
> + /* eip is correct. */
> + next_eip = regs->eip;
> + }
> + break;
> + case 0xea: /* jmp absolute -- eip is correct */
> + next_eip = regs->eip;
> + break;
> + default:
> + break;
> + }
> +
> + regs->eflags &= ~TF_MASK;
> + if (next_eip)
> + regs->eip = next_eip;
> + else
> + regs->eip = orig_eip + (regs->eip - copy_eip);
> +}
> +
> +/*
> + * post_uprobe_handler(), executes the user specified handlers and
> + * resumes with the normal execution.
> + */
> +static int __kprobes post_uprobe_handler(struct pt_regs *regs)
> +{
> + struct kprobe *cur;
> + struct uprobe_ctlblk *ucb;
> +
> + if (!current_uprobe)
> + return 0;
> +
> + ucb = &uprobe_ctlblk;
> + cur = ucb->curr_p;
> +
> + if (!cur || ucb->tsk != current)
> + return 0;
> +
> + if (cur->post_handler) {
> + if (ucb->uprobe_status == UPROBE_SS_INLINE)
> + ucb->uprobe_status = UPROBE_SSDONE_INLINE;
> + else
> + ucb->uprobe_status = UPROBE_HIT_SSDONE;
> + cur->post_handler(cur, regs, 0);
> + }
> +
> + resume_execution_user(cur, regs, ucb);
> + regs->eflags |= ucb->uprobe_saved_eflags;
> +
> + if (ucb->uprobe_status == UPROBE_SSDONE_INLINE)
> + replace_original_insn(current_uprobe, regs,
> + BREAKPOINT_INSTRUCTION);
> + else
> + pte_unmap(ucb->upte);
> +
> + current_uprobe = NULL;
> + spin_unlock_irqrestore(&uprobe_lock, ucb->flags);
> + dec_preempt_count();
> + /*
> + * if somebody else is single stepping across a probe point, eflags
> + * will have TF set, in which case, continue the remaining processing
> + * of do_debug, as if this is not a probe hit.
> + */
> + if (regs->eflags & TF_MASK)
> + return 0;
> +
> + return 1;
> +}
> +
> +static int __kprobes uprobe_fault_handler(struct pt_regs *regs, int trapnr)
> +{
> + struct kprobe *cur;
> + struct uprobe_ctlblk *ucb;
> + int ret = 0;
> +
> + ucb = &uprobe_ctlblk;
> + cur = ucb->curr_p;
> +
> + if (ucb->tsk != current || !cur)
> + return 0;
> +
> + switch(ucb->uprobe_status) {
> + case UPROBE_HIT_SS:
> + pte_unmap(ucb->upte);
> + /* TODO: All acceptable number of faults before disabling */
> + replace_original_insn(current_uprobe, regs, cur->opcode);
> + /* Fall through and reset the current probe */
> + case UPROBE_SS_INLINE:
> + regs->eip = (unsigned long)cur->addr;
> + regs->eflags |= ucb->uprobe_old_eflags;
> + regs->eflags &= ~TF_MASK;
> + current_uprobe = NULL;
> + ret = 1;
> + spin_unlock_irqrestore(&uprobe_lock, ucb->flags);
> + preempt_enable_no_resched();
> + break;
> + case UPROBE_HIT_ACTIVE:
> + case UPROBE_SSDONE_INLINE:
> + case UPROBE_HIT_SSDONE:
> + if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr))
> + return 1;
> +
> + if (fixup_exception(regs))
> + return 1;
> + /*
> + * We must not allow the system page handler to continue while
> + * holding a lock, since page fault handler can sleep and
> + * reschedule it on different cpu. Hence return 1.
> + */
> + return 1;
> + break;
> + default:
> + break;
> + }
> + return ret;
> +}
> +
> +/*
> + * Wrapper routine to for handling exceptions.
> + */
> +int __kprobes uprobe_exceptions_notify(struct notifier_block *self,
> + unsigned long val, void *data)
> +{
> + struct die_args *args = (struct die_args *)data;
> + int ret = NOTIFY_DONE;
> +
> + if (args->regs->eflags & VM_MASK) {
> + /* We are in virtual-8086 mode. Return NOTIFY_DONE */
> + return ret;
> + }
> +
> + switch (val) {
> + case DIE_INT3:
> + if (uprobe_handler(args->regs))
> + ret = NOTIFY_STOP;
> + break;
> + case DIE_DEBUG:
> + if (post_uprobe_handler(args->regs))
> + ret = NOTIFY_STOP;
> + break;
> + case DIE_GPF:
> + case DIE_PAGE_FAULT:
> + if (current_uprobe &&
> + uprobe_fault_handler(args->regs, args->trapnr))
> + ret = NOTIFY_STOP;
> + break;
> + default:
> + break;
> + }
> + return ret;
> +}
> diff -puN arch/i386/kernel/kprobes.c~kprobes_userspace_probes-ss-out-of-line arch/i386/kernel/kprobes.c
> --- linux-2.6.17-rc3-mm1/arch/i386/kernel/kprobes.c~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.000000000 +0530
> +++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/kernel/kprobes.c 2006-05-09 12:40:48.000000000 +0530
> @@ -139,7 +139,7 @@ retry:
> /*
> * returns non-zero if opcode modifies the interrupt flag.
> */
> -static int __kprobes is_IF_modifier(kprobe_opcode_t opcode)
> +int __kprobes is_IF_modifier(kprobe_opcode_t opcode)
> {
> switch (opcode) {
> case 0xfa: /* cli */
> @@ -649,7 +649,7 @@ int __kprobes kprobe_exceptions_notify(s
> int ret = NOTIFY_DONE;
>
> if (args->regs && user_mode(args->regs))
> - return ret;
> + return uprobe_exceptions_notify(self, val, data);
>
> switch (val) {
> case DIE_INT3:
> diff -puN arch/i386/mm/fault.c~kprobes_userspace_probes-ss-out-of-line arch/i386/mm/fault.c
> --- linux-2.6.17-rc3-mm1/arch/i386/mm/fault.c~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.000000000 +0530
> +++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/mm/fault.c 2006-05-09 12:40:48.000000000 +0530
> @@ -104,8 +104,7 @@ void bust_spinlocks(int yes)
> *
> * This is slow, but is very rarely executed.
> */
> -static inline unsigned long get_segment_eip(struct pt_regs *regs,
> - unsigned long *eip_limit)
> +unsigned long get_segment_eip(struct pt_regs *regs, unsigned long *eip_limit)
> {
> unsigned long eip = regs->eip;
> unsigned seg = regs->xcs & 0xffff;
> diff -puN arch/i386/kernel/Makefile~kprobes_userspace_probes-ss-out-of-line arch/i386/kernel/Makefile
> --- linux-2.6.17-rc3-mm1/arch/i386/kernel/Makefile~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.000000000 +0530
> +++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/kernel/Makefile 2006-05-09 12:40:48.000000000 +0530
> @@ -27,7 +27,7 @@ obj-$(CONFIG_KEXEC) += machine_kexec.o
> obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> obj-$(CONFIG_X86_NUMAQ) += numaq.o
> obj-$(CONFIG_X86_SUMMIT_NUMA) += summit.o
> -obj-$(CONFIG_KPROBES) += kprobes.o
> +obj-$(CONFIG_KPROBES) += kprobes.o uprobes.o
> obj-$(CONFIG_MODULES) += module.o
> obj-y += sysenter.o vsyscall.o
> obj-$(CONFIG_ACPI_SRAT) += srat.o
>
> _

2006-05-10 12:17:56

by S. P. Prasanna

[permalink] [raw]
Subject: Re: [RFC] [PATCH 6/6] Kprobes: Remove breakpoints from the copied pages

On Tue, May 09, 2006 at 06:04:50PM +0100, Hugh Dickins wrote:
> On Tue, 9 May 2006, Prasanna S Panchamukhi wrote:
> > This patch removes the breakpoints if the pages read from the page
> > cache contains breakpoints. If the pages containing the breakpoints
> > is copied from the page cache, the copied image would also contain
> > breakpoints in them. This could be a major problem for tools like
> > tripwire etc and cause security concerns, hence must be prevented.
> > This patch hooks up the actor routine, checks if the executable was
> > a probed executable using the file inode and then replaces the
> > breakpoints with the original opcodes in the copied image.
>
> You've done a nice job of making the code look like kernel code
> throughout, it's a much tidier patchset than many.
>
> With that said... it looks to me like one of the scariest and
> most inappropriate sets I can remember. Getting the kernel to
> connive in presenting an incoherent view of its pagecache:
> I don't think we'd ever want that.
>

As Andi Kleen and Christoph suggested pagecache contention can be avoided
using the COW approach.

Advantages of COW:

1. No need to hookup file_read_actor() to remove the breakpoints if a
the probed page was read from pagecache.
2. No need to hookup readpage(s)() to insert probes when pages are
read into the memory.

Some thoughts about COW implications AFAIK

1. Need to hookup mmap() to make a per process copy.
2. Bring in the pages just to insert the probes.
3. All the text pages need to be in memory until process exits.
4. Free up the per process text pages by hooking exit() and exec().
5. Maskoff probes visible across fork(), by hooking fork().

Any implications ?


Thanks
Prasanna
--
Prasanna S Panchamukhi
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329

2006-05-10 14:19:49

by Richard J Moore

[permalink] [raw]
Subject: Re: [RFC] [PATCH 5/6] Kprobes: Single step the original instruction out-of-line






"bibo,mao" <[email protected]> wrote on 10/05/2006 01:47:14:

> Previously when I tested uprobe patch in some specific IA32 machine with
> CONFIG_X86_PAE option on, uprobe can not be activated unless kernel
> option "noexec=off" is added before booting kernel.
> Because without this option user stack is non-executable, copied trap
> instruction is placed in process's stack space. Executing trap
> instrunction in stack will cause page fault of non-execution priviledge.

You won't cause a pagefault. There only privilege checks from page tables
are read-only/read/write and user/system.

The executable attribute is, or was when I last read the processor
reference manual, only present in the descriptor table entry. Violation of
that would cause a general protection fault.

That being said, if we are going to execute code on a stack, we should set
up an alias descriptor for it. We might as well set up a global descriptor
that maps all of user space as executable.

Richard



>
> Thanks
> bibo,mao
>
> Prasanna S Panchamukhi ??:
> > This patch provides a mechanism for probe handling and
> > executing the user-specified handlers.
> >
> > Each userspace probe is uniquely identified by the combination of
> > inode and offset, hence during registration the inode and offset
> > combination is added to uprobes hash table. Initially when
> > breakpoint instruction is hit, the uprobes hash table is looked up
> > for matching inode and offset. The pre_handlers are called in
> > sequence if multiple probes are registered. Similar to kprobes,
> > uprobes also adopts to single step out-of-line, so that probe miss in
> > SMP environment can be avoided. But for userspace probes, instruction
> > copied into kernel address space cannot be single stepped, hence the
> > instruction must be copied to user address space. The solution is to
> > find free space in the current process address space and then copy the
> > original instruction and single step that instruction.
> >
> > User processes use stack space to store local variables, arguments and
> > return values. Normally the stack space either below or above the
> > stack pointer indicates the free stack space.
> >
> > The instruction to be single stepped can modify the stack space,
> > hence before using the free stack space, sufficient stack space must
> > be left. The instruction is copied to the bottom of the page and check
> > is made such that the copied instruction does not cross the page
> > boundary. The copied instruction is then single stepped. Several
> > architectures does not allow the instruction to be executed from the
> > stack location, since no-exec bit is set for the stack pages. In those
> > architectures, the page table entry corresponding to the stack page is
> > identified and the no-exec bit is unset making the instruction on that
> > stack page to be executed.
> >
> > There are situations where even the free stack space is not enough for
> > the user instruction to be copied and single stepped. In such
> > situations, the virtual memory area(vma) can be expanded beyond the
> > current stack vma. This expanded stack can be used to copy the
> > original instruction and single step out-of-line.
> >
> > Even if the vma cannot be extended, then the instruction much be
> > executed inline, by replacing the breakpoint instruction with the
> > original instruction.
> >
> > Signed-off-by: Prasanna S Panchamukhi <[email protected]>
> >
> >
> > arch/i386/kernel/Makefile | 2
> > arch/i386/kernel/kprobes.c | 4
> > arch/i386/kernel/uprobes.c | 472
> +++++++++++++++++++++++++++++++++++++++++++++
> > arch/i386/mm/fault.c | 3
> > include/asm-i386/kprobes.h | 21 ++
> > 5 files changed, 497 insertions(+), 5 deletions(-)
> >
> > diff -puN include/asm-i386/kprobes.h~kprobes_userspace_probes-ss-
> out-of-line include/asm-i386/kprobes.h
> > --- linux-2.6.17-rc3-mm1/include/asm-i386/kprobes.
> h~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.
> 000000000 +0530
> > +++ linux-2.6.17-rc3-mm1-prasanna/include/asm-i386/kprobes.h
> 2006-05-09 12:40:48.000000000 +0530
> > @@ -26,6 +26,7 @@
> > */
> > #include <linux/types.h>
> > #include <linux/ptrace.h>
> > +#include <asm/cacheflush.h>
> >
> > #define __ARCH_WANT_KPROBES_INSN_SLOT
> >
> > @@ -78,6 +79,19 @@ struct kprobe_ctlblk {
> > struct prev_kprobe prev_kprobe;
> > };
> >
> > +/* per user probe control block */
> > +struct uprobe_ctlblk {
> > + unsigned long uprobe_status;
> > + unsigned long uprobe_saved_eflags;
> > + unsigned long uprobe_old_eflags;
> > + unsigned long singlestep_addr;
> > + unsigned long flags;
> > + struct kprobe *curr_p;
> > + pte_t *upte;
> > + struct page *upage;
> > + struct task_struct *tsk;
> > +};
> > +
> > /* trap3/1 are intr gates for kprobes. So, restore the status of IF,
> > * if necessary, before executing the original int3/1 (trap) handler.
> > */
> > @@ -89,4 +103,11 @@ static inline void restore_interrupts(st
> >
> > extern int kprobe_exceptions_notify(struct notifier_block *self,
> > unsigned long val, void *data);
> > +extern int uprobe_exceptions_notify(struct notifier_block *self,
> > + unsigned long val, void *data);
> > +extern unsigned long get_segment_eip(struct pt_regs *regs,
> > + unsigned long *eip_limit);
> > +extern int is_IF_modifier(kprobe_opcode_t opcode);
> > +
> > +extern pte_t *get_one_pte(unsigned long address);
> > #endif /* _ASM_KPROBES_H */
> > diff -puN arch/i386/kernel/uprobes.c~kprobes_userspace_probes-ss-
> out-of-line arch/i386/kernel/uprobes.c
> > --- linux-2.6.17-rc3-mm1/arch/i386/kernel/uprobes.
> c~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.
> 000000000 +0530
> > +++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/kernel/uprobes.c
> 2006-05-09 12:40:48.000000000 +0530
> > @@ -30,6 +30,10 @@
> > #include <asm/cacheflush.h>
> > #include <asm/kdebug.h>
> > #include <asm/desc.h>
> > +#include <asm/uaccess.h>
> > +
> > +static struct uprobe_ctlblk uprobe_ctlblk;
> > +struct uprobe *current_uprobe;
> >
> > int __kprobes arch_alloc_insn(struct kprobe *p)
> > {
> > @@ -69,3 +73,471 @@ int __kprobes arch_copy_uprobe(struct kp
> >
> > return ret;
> > }
> > +
> > +/*
> > + * This routine check for space in the process's stack address space.
> > + * If enough address space is found, returns the address of free
stack
> > + * space.
> > + */
> > +unsigned long __kprobes *find_stack_space_on_next_page(unsigned
> long stack_addr,
> > + int size, struct vm_area_struct *vma)
> > +{
> > + unsigned long addr;
> > + struct page *pg;
> > + int retval = 0;
> > +
> > + if (((stack_addr - sizeof(long long))) < (vma->vm_start + size))
> > + return NULL;
> > + addr = (stack_addr & PAGE_MASK) + PAGE_SIZE;
> > +
> > + retval = get_user_pages(current, current->mm,
> > + (unsigned long )addr, 1, 1, 0, &pg, NULL);
> > + if (retval)
> > + return NULL;
> > +
> > + return (unsigned long *) addr;
> > +}
> > +
> > +/*
> > + * This routine expands the stack beyond the present process address
> > + * space and returns the address of free stack space. This routine
> > + * must be called with mmap_sem held.
> > + */
> > +unsigned long __kprobes *find_stack_space_in_expanded_vma(int size,
> > + struct vm_area_struct *vma)
> > +{
> > + unsigned long addr, vm_addr;
> > + int retval = 0;
> > + struct vm_area_struct *new_vma;
> > + struct mm_struct *mm = current->mm;
> > + struct page *pg;
> > +
> > + vm_addr = vma->vm_start - size;
> > + new_vma = find_extend_vma(mm, vm_addr);
> > + if (!new_vma)
> > + return NULL;
> > +
> > + addr = new_vma->vm_start;
> > + retval = get_user_pages(current, current->mm,
> > + (unsigned long )addr, 1, 1, 0, &pg, NULL);
> > + if (retval)
> > + return NULL;
> > +
> > + return (unsigned long *) addr;
> > +}
> > +
> > +/*
> > + * This routine checks for stack free space below the stack pointer in
the
> > + * current stack page. If there is not enough stack space, it returns
NULL.
> > + */
> > +unsigned long __kprobes *find_stack_space_on_curr_page(unsigned
> long stack_addr,
> > + int size)
> > +{
> > + unsigned long page_addr;
> > +
> > + page_addr = stack_addr & PAGE_MASK;
> > +
> > + if (((stack_addr - sizeof(long long))) < (page_addr + size))
> > + return NULL;
> > +
> > + return (unsigned long *) page_addr;
> > +}
> > +
> > +/*
> > + * This routines finds free stack space for a given size either on the
> > + * current stack page, or on next stack page. If there is no free
stack
> > + * space is availible, then expands the stack and returns the address
of
> > + * free stack space.
> > + */
> > +unsigned long __kprobes *find_stack_space(unsigned long
> stack_addr, int size)
> > +{
> > + unsigned long *addr;
> > + struct vm_area_struct *vma = NULL;
> > +
> > + addr = find_stack_space_on_curr_page(stack_addr, size);
> > + if (addr)
> > + return addr;
> > +
> > + if (!down_read_trylock(&current->mm->mmap_sem))
> > + down_read(&current->mm->mmap_sem);
> > +
> > + vma = find_vma(current->mm, (stack_addr & PAGE_MASK));
> > + if (!vma) {
> > + up_read(&current->mm->mmap_sem);
> > + return NULL;
> > + }
> > +
> > + addr = find_stack_space_on_next_page(stack_addr, size, vma);
> > + if (addr) {
> > + up_read(&current->mm->mmap_sem);
> > + return addr;
> > + }
> > +
> > + addr = find_stack_space_in_expanded_vma(size, vma);
> > + up_read(&current->mm->mmap_sem);
> > +
> > + if (!addr)
> > + return NULL;
> > +
> > + return addr;
> > +}
> > +
> > +/*
> > + * This routines get the page containing the probe, maps it and
> > + * replaced the instruction at the probed address with specified
> > + * opcode.
> > + */
> > +void __kprobes replace_original_insn(struct uprobe *uprobe,
> > + struct pt_regs *regs, kprobe_opcode_t opcode)
> > +{
> > + kprobe_opcode_t *addr;
> > + struct page *page;
> > +
> > + page = find_get_page(uprobe->inode->i_mapping,
> > + uprobe->offset >> PAGE_CACHE_SHIFT);
> > + BUG_ON(!page);
> > +
> > + addr = (kprobe_opcode_t *)kmap_atomic(page, KM_USER1);
> > + addr = (kprobe_opcode_t *)((unsigned long)addr +
> > + (unsigned long)(uprobe->offset & ~PAGE_MASK));
> > + *addr = opcode;
> > + /*TODO: flush vma ? */
> > + kunmap_atomic(addr, KM_USER1);
> > +
> > + flush_dcache_page(page);
> > +
> > + if (page)
> > + page_cache_release(page);
> > + regs->eip = (unsigned long)uprobe->kp.addr;
> > +}
> > +
> > +/*
> > + * This routine provides the functionality of single stepping
> > + * out-of-line. If single stepping out-of-line cannot be achieved,
> > + * it replaces with the original instruction allowing it to single
> > + * step inline.
> > + */
> > +static int __kprobes prepare_singlestep_uprobe(struct uprobe *uprobe,
> > + struct uprobe_ctlblk *ucb, struct pt_regs *regs)
> > +{
> > + unsigned long *addr = NULL, stack_addr = regs->esp;
> > + int size = sizeof(kprobe_opcode_t) * MAX_INSN_SIZE;
> > + unsigned long *source = (unsigned long *)uprobe->kp.ainsn.insn;
> > +
> > + /*
> > + * Get free stack space to copy original instruction, so as to
> > + * single step out-of-line.
> > + */
> > + addr = find_stack_space(stack_addr, size);
> > + if (!addr)
> > + goto no_stack_space;
> > +
> > + /*
> > + * We are in_atomic and preemption is disabled at this point of
> > + * time. Copy original instruction on this per process stack
> > + * page so as to single step out-of-line.
> > + */
> > + if (__copy_to_user_inatomic((unsigned long *)addr, source, size))
> > + goto no_stack_space;
> > +
> > + regs->eip = (unsigned long)addr;
> > +
> > + regs->eflags |= TF_MASK;
> > + regs->eflags &= ~IF_MASK;
> > + ucb->uprobe_status = UPROBE_HIT_SS;
> > +
> > + ucb->upte = get_one_pte(regs->eip);
> > + if (!ucb->upte)
> > + goto no_stack_space;
> > + ucb->upage = pte_page(*ucb->upte);
> > + set_pte(ucb->upte, pte_mkdirty(*ucb->upte));
> > + ucb->singlestep_addr = regs->eip;
> > +
> > + return 0;
> > +
> > +no_stack_space:
> > + replace_original_insn(uprobe, regs, uprobe->kp.opcode);
> > + ucb->uprobe_status = UPROBE_SS_INLINE;
> > + ucb->singlestep_addr = regs->eip;
> > +
> > + return 0;
> > +}
> > +
> > +/*
> > + * uprobe_handler() executes the user specified handler and setup for
> > + * single stepping the original instruction either out-of-line or
inline.
> > + */
> > +static int __kprobes uprobe_handler(struct pt_regs *regs)
> > +{
> > + struct kprobe *p;
> > + int ret = 0;
> > + kprobe_opcode_t *addr = NULL;
> > + struct uprobe_ctlblk *ucb = &uprobe_ctlblk;
> > + unsigned long limit;
> > +
> > + spin_lock_irqsave(&uprobe_lock, ucb->flags);
> > + /* preemption is disabled, remains disabled
> > + * until we single step on original instruction.
> > + */
> > + inc_preempt_count();
> > +
> > + addr = (kprobe_opcode_t *)(get_segment_eip(regs, &limit) - 1);
> > +
> > + p = get_uprobe(addr);
> > + if (!p) {
> > +
> > + if (*addr != BREAKPOINT_INSTRUCTION) {
> > + /*
> > + * The breakpoint instruction was removed right
> > + * after we hit it. Another cpu has removed
> > + * either a probe point or a debugger breakpoint
> > + * at this address. In either case, no further
> > + * handling of this interrupt is appropriate.
> > + * Back up over the (now missing) int3 and run
> > + * the original instruction.
> > + */
> > + regs->eip -= sizeof(kprobe_opcode_t);
> > + ret = 1;
> > + }
> > + /* Not one of ours: let kernel handle it */
> > + goto no_uprobe;
> > + }
> > +
> > + if (p->opcode == BREAKPOINT_INSTRUCTION) {
> > + /*
> > + * Breakpoint was already present even before the probe
> > + * was inserted, this might break some compatibility with
> > + * other debuggers like gdb etc. We dont handle such probes.
> > + */
> > + current_uprobe = NULL;
> > + goto no_uprobe;
> > + }
> > +
> > + ucb->curr_p = p;
> > + ucb->tsk = current;
> > + ucb->uprobe_status = UPROBE_HIT_ACTIVE;
> > + ucb->uprobe_saved_eflags = (regs->eflags & (TF_MASK | IF_MASK));
> > + ucb->uprobe_old_eflags = (regs->eflags & (TF_MASK | IF_MASK));
> > +
> > + if (p->pre_handler && p->pre_handler(p, regs))
> > + /* handler has already set things up, so skip ss setup */
> > + return 1;
> > +
> > + prepare_singlestep_uprobe(current_uprobe, ucb, regs);
> > + /*
> > + * Avoid scheduling the current while returning from
> > + * kernel to user mode.
> > + */
> > + clear_need_resched();
> > + return 1;
> > +
> > +no_uprobe:
> > + spin_unlock_irqrestore(&uprobe_lock, ucb->flags);
> > + dec_preempt_count();
> > +
> > + return ret;
> > +}
> > +
> > +/*
> > + * Called after single-stepping. p->addr is the address of the
> > + * instruction whose first byte has been replaced by the "int 3"
> > + * instruction. To avoid the SMP problems that can occur when we
> > + * temporarily put back the original opcode to single-step, we
> > + * single-stepped a copy of the instruction. The address of this
> > + * copy is p->ainsn.insn.
> > + *
> > + * This function prepares to return from the post-single-step
> > + * interrupt. We have to fix up the stack as follows:
> > + *
> > + * 0) Typically, the new eip is relative to the copied instruction.
We
> > + * need to make it relative to the original instruction. Exceptions
are
> > + * return instructions and absolute or indirect jump or call
instructions.
> > + *
> > + * 1) If the single-stepped instruction was pushfl, then the TF and IF
> > + * flags are set in the just-pushed eflags, and may need to be
cleared.
> > + *
> > + * 2) If the single-stepped instruction was a call, the return address
> > + * that is atop the stack is the address following the copied
instruction.
> > + * We need to make it the address following the original instruction.
> > + */
> > +static void __kprobes resume_execution_user(struct kprobe *p,
> > + struct pt_regs *regs, struct uprobe_ctlblk *ucb)
> > +{
> > + unsigned long *tos = (unsigned long *)regs->esp;
> > + unsigned long next_eip = 0;
> > + unsigned long copy_eip = ucb->singlestep_addr;
> > + unsigned long orig_eip = (unsigned long)p->addr;
> > +
> > + switch (p->ainsn.insn[0]) {
> > + case 0x9c: /* pushfl */
> > + *tos &= ~(TF_MASK | IF_MASK);
> > + *tos |= ucb->uprobe_old_eflags;
> > + break;
> > + case 0xc3: /* ret/lret */
> > + case 0xcb:
> > + case 0xc2:
> > + case 0xca:
> > + next_eip = regs->eip;
> > + /* eip is already adjusted, no more changes required*/
> > + break;
> > + case 0xe8: /* call relative - Fix return addr */
> > + *tos = orig_eip + (*tos - copy_eip);
> > + break;
> > + case 0xff:
> > + if ((p->ainsn.insn[1] & 0x30) == 0x10) {
> > + /* call absolute, indirect */
> > + /* Fix return addr; eip is correct. */
> > + next_eip = regs->eip;
> > + *tos = orig_eip + (*tos - copy_eip);
> > + } else if (((p->ainsn.insn[1] & 0x31) == 0x20) ||
> > + ((p->ainsn.insn[1] & 0x31) == 0x21)) {
> > + /* jmp near or jmp far absolute indirect */
> > + /* eip is correct. */
> > + next_eip = regs->eip;
> > + }
> > + break;
> > + case 0xea: /* jmp absolute -- eip is correct */
> > + next_eip = regs->eip;
> > + break;
> > + default:
> > + break;
> > + }
> > +
> > + regs->eflags &= ~TF_MASK;
> > + if (next_eip)
> > + regs->eip = next_eip;
> > + else
> > + regs->eip = orig_eip + (regs->eip - copy_eip);
> > +}
> > +
> > +/*
> > + * post_uprobe_handler(), executes the user specified handlers and
> > + * resumes with the normal execution.
> > + */
> > +static int __kprobes post_uprobe_handler(struct pt_regs *regs)
> > +{
> > + struct kprobe *cur;
> > + struct uprobe_ctlblk *ucb;
> > +
> > + if (!current_uprobe)
> > + return 0;
> > +
> > + ucb = &uprobe_ctlblk;
> > + cur = ucb->curr_p;
> > +
> > + if (!cur || ucb->tsk != current)
> > + return 0;
> > +
> > + if (cur->post_handler) {
> > + if (ucb->uprobe_status == UPROBE_SS_INLINE)
> > + ucb->uprobe_status = UPROBE_SSDONE_INLINE;
> > + else
> > + ucb->uprobe_status = UPROBE_HIT_SSDONE;
> > + cur->post_handler(cur, regs, 0);
> > + }
> > +
> > + resume_execution_user(cur, regs, ucb);
> > + regs->eflags |= ucb->uprobe_saved_eflags;
> > +
> > + if (ucb->uprobe_status == UPROBE_SSDONE_INLINE)
> > + replace_original_insn(current_uprobe, regs,
> > + BREAKPOINT_INSTRUCTION);
> > + else
> > + pte_unmap(ucb->upte);
> > +
> > + current_uprobe = NULL;
> > + spin_unlock_irqrestore(&uprobe_lock, ucb->flags);
> > + dec_preempt_count();
> > + /*
> > + * if somebody else is single stepping across a probe point, eflags
> > + * will have TF set, in which case, continue the remaining
processing
> > + * of do_debug, as if this is not a probe hit.
> > + */
> > + if (regs->eflags & TF_MASK)
> > + return 0;
> > +
> > + return 1;
> > +}
> > +
> > +static int __kprobes uprobe_fault_handler(struct pt_regs *regs, int
trapnr)
> > +{
> > + struct kprobe *cur;
> > + struct uprobe_ctlblk *ucb;
> > + int ret = 0;
> > +
> > + ucb = &uprobe_ctlblk;
> > + cur = ucb->curr_p;
> > +
> > + if (ucb->tsk != current || !cur)
> > + return 0;
> > +
> > + switch(ucb->uprobe_status) {
> > + case UPROBE_HIT_SS:
> > + pte_unmap(ucb->upte);
> > + /* TODO: All acceptable number of faults before disabling */
> > + replace_original_insn(current_uprobe, regs, cur->opcode);
> > + /* Fall through and reset the current probe */
> > + case UPROBE_SS_INLINE:
> > + regs->eip = (unsigned long)cur->addr;
> > + regs->eflags |= ucb->uprobe_old_eflags;
> > + regs->eflags &= ~TF_MASK;
> > + current_uprobe = NULL;
> > + ret = 1;
> > + spin_unlock_irqrestore(&uprobe_lock, ucb->flags);
> > + preempt_enable_no_resched();
> > + break;
> > + case UPROBE_HIT_ACTIVE:
> > + case UPROBE_SSDONE_INLINE:
> > + case UPROBE_HIT_SSDONE:
> > + if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr))
> > + return 1;
> > +
> > + if (fixup_exception(regs))
> > + return 1;
> > + /*
> > + * We must not allow the system page handler to continue while
> > + * holding a lock, since page fault handler can sleep and
> > + * reschedule it on different cpu. Hence return 1.
> > + */
> > + return 1;
> > + break;
> > + default:
> > + break;
> > + }
> > + return ret;
> > +}
> > +
> > +/*
> > + * Wrapper routine to for handling exceptions.
> > + */
> > +int __kprobes uprobe_exceptions_notify(struct notifier_block *self,
> > + unsigned long val, void *data)
> > +{
> > + struct die_args *args = (struct die_args *)data;
> > + int ret = NOTIFY_DONE;
> > +
> > + if (args->regs->eflags & VM_MASK) {
> > + /* We are in virtual-8086 mode. Return NOTIFY_DONE */
> > + return ret;
> > + }
> > +
> > + switch (val) {
> > + case DIE_INT3:
> > + if (uprobe_handler(args->regs))
> > + ret = NOTIFY_STOP;
> > + break;
> > + case DIE_DEBUG:
> > + if (post_uprobe_handler(args->regs))
> > + ret = NOTIFY_STOP;
> > + break;
> > + case DIE_GPF:
> > + case DIE_PAGE_FAULT:
> > + if (current_uprobe &&
> > + uprobe_fault_handler(args->regs, args->trapnr))
> > + ret = NOTIFY_STOP;
> > + break;
> > + default:
> > + break;
> > + }
> > + return ret;
> > +}
> > diff -puN arch/i386/kernel/kprobes.c~kprobes_userspace_probes-ss-
> out-of-line arch/i386/kernel/kprobes.c
> > --- linux-2.6.17-rc3-mm1/arch/i386/kernel/kprobes.
> c~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.
> 000000000 +0530
> > +++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/kernel/kprobes.c
> 2006-05-09 12:40:48.000000000 +0530
> > @@ -139,7 +139,7 @@ retry:
> > /*
> > * returns non-zero if opcode modifies the interrupt flag.
> > */
> > -static int __kprobes is_IF_modifier(kprobe_opcode_t opcode)
> > +int __kprobes is_IF_modifier(kprobe_opcode_t opcode)
> > {
> > switch (opcode) {
> > case 0xfa: /* cli */
> > @@ -649,7 +649,7 @@ int __kprobes kprobe_exceptions_notify(s
> > int ret = NOTIFY_DONE;
> >
> > if (args->regs && user_mode(args->regs))
> > - return ret;
> > + return uprobe_exceptions_notify(self, val, data);
> >
> > switch (val) {
> > case DIE_INT3:
> > diff -puN arch/i386/mm/fault.c~kprobes_userspace_probes-ss-out-of-
> line arch/i386/mm/fault.c
> > --- linux-2.6.17-rc3-mm1/arch/i386/mm/fault.
> c~kprobes_userspace_probes-ss-out-of-line 2006-05-09 12:40:48.
> 000000000 +0530
> > +++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/mm/fault.c
> 2006-05-09 12:40:48.000000000 +0530
> > @@ -104,8 +104,7 @@ void bust_spinlocks(int yes)
> > *
> > * This is slow, but is very rarely executed.
> > */
> > -static inline unsigned long get_segment_eip(struct pt_regs *regs,
> > - unsigned long *eip_limit)
> > +unsigned long get_segment_eip(struct pt_regs *regs, unsigned long
> *eip_limit)
> > {
> > unsigned long eip = regs->eip;
> > unsigned seg = regs->xcs & 0xffff;
> > diff -puN arch/i386/kernel/Makefile~kprobes_userspace_probes-ss-
> out-of-line arch/i386/kernel/Makefile
> > --- linux-2.6.17-rc3-
> mm1/arch/i386/kernel/Makefile~kprobes_userspace_probes-ss-out-of-
> line 2006-05-09 12:40:48.000000000 +0530
> > +++ linux-2.6.17-rc3-mm1-prasanna/arch/i386/kernel/Makefile
> 2006-05-09 12:40:48.000000000 +0530
> > @@ -27,7 +27,7 @@ obj-$(CONFIG_KEXEC) += machine_kexec.o
> > obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> > obj-$(CONFIG_X86_NUMAQ) += numaq.o
> > obj-$(CONFIG_X86_SUMMIT_NUMA) += summit.o
> > -obj-$(CONFIG_KPROBES) += kprobes.o
> > +obj-$(CONFIG_KPROBES) += kprobes.o uprobes.o
> > obj-$(CONFIG_MODULES) += module.o
> > obj-y += sysenter.o vsyscall.o
> > obj-$(CONFIG_ACPI_SRAT) += srat.o
> >
> > _

2006-05-10 19:16:39

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC] [PATCH 6/6] Kprobes: Remove breakpoints from the copied pages

On Wed, 10 May 2006, Prasanna S Panchamukhi wrote:
>
> As Andi Kleen and Christoph suggested pagecache contention can be avoided
> using the COW approach.

Yes, COWing is fine, I've no pagecache qualms if you go that way.

> Some thoughts about COW implications AFAIK
> 1. Need to hookup mmap() to make a per process copy.
I don't understand.
> 2. Bring in the pages just to insert the probes.
They'll be faulted in to insert probes, yes.
> 3. All the text pages need to be in memory until process exits.
No, COWed pages can go out to swap.
> 4. Free up the per process text pages by hooking exit() and exec().
exit_mmap frees process pages on exit and exec.
> 5. Maskoff probes visible across fork(), by hooking fork().
Depends on what semantics you want across fork.

Perhaps I don't understand, but you seem to be overcomplicating it,
seeing VM (mm) problems which would be automatically handled for you.

Wouldn't you just use ptrace(2) to insert and remove your uprobes?
See access_process_vm in kernel/ptrace.c for what that would do.
It goes on to get_user_pages to do the faulting and COWing.

Hugh