2018-02-28 07:54:17

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 0/4] trace_uprobe: Support SDT markers having semaphore

Userspace Statically Defined Tracepoints[1] are dtrace style markers
inside userspace applications. These markers are added by developer at
important places in the code. Each marker source expands to a single
nop instruction in the compiled code but there may be additional
overhead for computing the marker arguments which expands to couple of
instructions. If this computaion is quite more, execution of it can be
ommited by runtime if() condition when no one is tracing on the marker:

if (semaphore > 0) {
Execute marker instructions;
}

Default value of semaphore is 0. Tracer has to increment the semaphore
before recording on a marker and decrement it at the end of tracing.

Currently, perf tool has limited supports for SDT markers. I.e. it
can not trace markers surrounded by semaphore. Also, it's not easy
to add semaphore flip logic in userspace tool like perf, so basic
idea for this patchset is to add semaphore flip logic in the
trace_uprobe infrastructure. Ex,[2]

# cat tick.c
...
for (i = 0; i < 100; i++) {
DTRACE_PROBE1(tick, loop1, i);
if (TICK_LOOP2_ENABLED()) {
DTRACE_PROBE1(tick, loop2, i);
}
printf("hi: %d\n", i);
sleep(1);
}
...

Here tick:loop1 is marker without semaphore where as tick:loop2 is
surrounded by semaphore.


# perf buildid-cache --add /tmp/tick
# perf probe sdt_tick:loop1
# perf probe sdt_tick:loop2

# perf stat -e sdt_tick:loop1,sdt_tick:loop2 -- /tmp/tick
hi: 0
hi: 1
hi: 2
^C
Performance counter stats for '/tmp/tick':
3 sdt_tick:loop1
0 sdt_tick:loop2
2.747086086 seconds time elapsed


Perf failed to record data for tick:loop2. Same experiment with this
patch series:


# readelf -n ./tick
Provider: tick
Name: loop2
... Semaphore: 0x0000000010020036

# readelf -SW ./tick | grep probes
[25] .probes PROGBITS 0000000010020034 010034


Semaphore offset is 0x10036. I don't have educated 'perf probe'
about semaphore. So instead of using 'perf probe' command, I'm
manually adding entry in the <tracefs>/uprobe_events file.
Special char * denotes semaphore offset.


# echo "p:sdt_tick/loop2 /tmp/tick:0x6e4 *0x10036" > uprobe_events

# perf stat -e sdt_tick:loop2 -- /tmp/tick
hi: 0
hi: 1
hi: 2
hi: 3
^C
Performance counter stats for '/tmp/tick':
4 sdt_tick:loop2
3.359047827 seconds time elapsed


Feedback?

TODO:
- Educate perf tool about semaphore.
- perf_event_open() now suppoers {k,u}probe event creation[3]. If we
can supply semaphore offset in perf_event_attr, perf_event_open()
can be educated to probe SDT marker having semaphore. Though, both
config1 and config2 are already being used for uprobe and I don't
see any other attribute which I can use for semaphore offset. Can
we introduce one more config there? config3?

[1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation
[2] https://github.com/iovisor/bcc/issues/327#issuecomment-200576506
[3] https://lkml.org/lkml/2017/12/6/976


Ravi Bangoria (4):
Uprobe: Rename map_info to uprobe_map_info
Uprobe: Export few functions / data structures
trace_uprobe: Support SDT markers having semaphore
trace_uprobe: Fix multiple update of same semaphores

include/linux/uprobes.h | 25 +++++
kernel/events/uprobes.c | 43 ++++----
kernel/trace/trace_uprobe.c | 244 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 290 insertions(+), 22 deletions(-)

--
1.8.3.1



2018-02-28 07:53:04

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 3/4] trace_uprobe: Support SDT markers having semaphore

Userspace Statically Defined Tracepoints[1] are dtrace style markers
inside userspace applications. These markers are added by developer at
important places in the code. Each marker source expands to a single
nop instruction in the compiled code but there may be additional
overhead for computing the marker arguments which expands to couple of
instructions. If this computaion is quite more, execution of it can be
ommited by runtime if() condition when no one is tracing on the marker:

if (semaphore > 0) {
Execute marker instructions;
}

Default value of semaphore is 0. Tracer has to increment the semaphore
before recording on a marker and decrement it at the end of tracing.

Implement the semaphore flip logic in trace_uprobe, leaving core uprobe
infrastructure as is, except one new callback from uprobe_mmap() to
trace_uprobe.

There are two major scenarios while enabling the marker,
1. Trace already running process. In this case, find all suitable
processes and increment the semaphore value in them.
2. Trace is already enabled when target binary is executed. In this
case, all mmaps will get notified to trace_uprobe and trace_uprobe
will increment the semaphore if corresponding uprobe is enabled.

At the time of disabling probes, decrement semaphore in all existing
target processes.

Signed-off-by: Ravi Bangoria <[email protected]>
---
include/linux/uprobes.h | 2 +
kernel/events/uprobes.c | 5 ++
kernel/trace/trace_uprobe.c | 145 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 152 insertions(+)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 06c169e..04e9d57 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -121,6 +121,8 @@ struct uprobe_map_info {
unsigned long vaddr;
};

+extern void (*uprobe_mmap_callback)(struct vm_area_struct *vma);
+
extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern bool is_swbp_insn(uprobe_opcode_t *insn);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 56dd7af..81d8aaf 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1051,6 +1051,8 @@ static void build_probe_list(struct inode *inode,
spin_unlock(&uprobes_treelock);
}

+void (*uprobe_mmap_callback)(struct vm_area_struct *vma) = NULL;
+
/*
* Called from mmap_region/vma_adjust with mm->mmap_sem acquired.
*
@@ -1063,6 +1065,9 @@ int uprobe_mmap(struct vm_area_struct *vma)
struct uprobe *uprobe, *u;
struct inode *inode;

+ if (vma->vm_flags & VM_WRITE && uprobe_mmap_callback)
+ uprobe_mmap_callback(vma);
+
if (no_uprobe_events() || !valid_vma(vma, true))
return 0;

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 40592e7b..d14aafc 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -25,6 +25,7 @@
#include <linux/namei.h>
#include <linux/string.h>
#include <linux/rculist.h>
+#include <linux/sched/mm.h>

#include "trace_probe.h"

@@ -58,6 +59,7 @@ struct trace_uprobe {
struct inode *inode;
char *filename;
unsigned long offset;
+ unsigned long sdt_offset; /* sdt semaphore offset */
unsigned long nhit;
struct trace_probe tp;
};
@@ -502,6 +504,16 @@ static int create_trace_uprobe(int argc, char **argv)
for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) {
struct probe_arg *parg = &tu->tp.args[i];

+ /* This is not really an argument. */
+ if (argv[i][0] == '*') {
+ ret = kstrtoul(&(argv[i][1]), 0, &tu->sdt_offset);
+ if (ret) {
+ pr_info("Invalid semaphore address.\n");
+ goto error;
+ }
+ continue;
+ }
+
/* Increment count for freeing args in error case */
tu->tp.nr_args++;

@@ -894,6 +906,131 @@ static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
return trace_handle_return(s);
}

+static bool sdt_valid_vma(struct trace_uprobe *tu, struct vm_area_struct *vma)
+{
+ unsigned long vaddr = offset_to_vaddr(vma, tu->sdt_offset);
+
+ return tu->sdt_offset &&
+ vma->vm_file &&
+ file_inode(vma->vm_file) == tu->inode &&
+ vma->vm_flags & VM_WRITE &&
+ vma->vm_start <= vaddr &&
+ vma->vm_end > vaddr;
+}
+
+static struct vm_area_struct *
+sdt_find_vma(struct mm_struct *mm, struct trace_uprobe *tu)
+{
+ struct vm_area_struct *tmp;
+
+ for (tmp = mm->mmap; tmp != NULL; tmp = tmp->vm_next)
+ if (sdt_valid_vma(tu, tmp))
+ return tmp;
+
+ return NULL;
+}
+
+static int
+sdt_update_sem(struct mm_struct *mm, unsigned long vaddr, short val)
+{
+ struct page *page;
+ struct vm_area_struct *vma;
+ int ret = 0;
+ unsigned short orig = 0;
+
+ if (vaddr == 0)
+ return -EINVAL;
+
+ ret = get_user_pages_remote(NULL, mm, vaddr, 1,
+ FOLL_FORCE | FOLL_WRITE, &page, &vma, NULL);
+ if (ret <= 0)
+ return ret;
+
+ copy_from_page(page, vaddr, &orig, sizeof(orig));
+ orig += val;
+ copy_to_page(page, vaddr, &orig, sizeof(orig));
+ put_page(page);
+ return 0;
+}
+
+/*
+ * TODO: Adding this defination in include/linux/uprobes.h throws
+ * warnings about address_sapce. Adding it here for the time being.
+ */
+struct uprobe_map_info *build_uprobe_map_info(struct address_space *mapping, loff_t offset, bool is_register);
+
+static void sdt_increment_sem(struct trace_uprobe *tu)
+{
+ struct uprobe_map_info *info;
+ struct vm_area_struct *vma;
+ unsigned long vaddr;
+
+ uprobe_start_dup_mmap();
+ info = build_uprobe_map_info(tu->inode->i_mapping, tu->sdt_offset, false);
+ if (IS_ERR(info))
+ goto out;
+
+ while (info) {
+ down_write(&info->mm->mmap_sem);
+ vma = sdt_find_vma(info->mm, tu);
+ if (!vma)
+ goto cont;
+
+ vaddr = offset_to_vaddr(vma, tu->sdt_offset);
+ sdt_update_sem(info->mm, vaddr, 1);
+
+cont:
+ up_write(&info->mm->mmap_sem);
+ mmput(info->mm);
+ info = free_uprobe_map_info(info);
+ }
+
+out:
+ uprobe_end_dup_mmap();
+}
+
+/* Called with down_write(&vma->vm_mm->mmap_sem) */
+void trace_uprobe_mmap_callback(struct vm_area_struct *vma)
+{
+ struct trace_uprobe *tu;
+ unsigned long vaddr;
+
+ mutex_lock(&uprobe_lock);
+ list_for_each_entry(tu, &uprobe_list, list) {
+ if (!sdt_valid_vma(tu, vma) ||
+ !trace_probe_is_enabled(&tu->tp))
+ continue;
+
+ vaddr = offset_to_vaddr(vma, tu->sdt_offset);
+ sdt_update_sem(vma->vm_mm, vaddr, 1);
+ }
+ mutex_unlock(&uprobe_lock);
+}
+
+static void sdt_decrement_sem(struct trace_uprobe *tu)
+{
+ struct vm_area_struct *vma;
+ unsigned long vaddr;
+ struct uprobe_map_info *info;
+
+ info = build_uprobe_map_info(tu->inode->i_mapping, tu->sdt_offset, false);
+ if (IS_ERR(info))
+ return;
+
+ while (info) {
+ down_write(&info->mm->mmap_sem);
+ vma = sdt_find_vma(info->mm, tu);
+ if (vma) {
+ vaddr = offset_to_vaddr(vma, tu->sdt_offset);
+ sdt_update_sem(info->mm, vaddr, -1);
+ }
+ up_write(&info->mm->mmap_sem);
+
+ mmput(info->mm);
+ info = free_uprobe_map_info(info);
+ }
+}
+
typedef bool (*filter_func_t)(struct uprobe_consumer *self,
enum uprobe_filter_ctx ctx,
struct mm_struct *mm);
@@ -939,6 +1076,9 @@ typedef bool (*filter_func_t)(struct uprobe_consumer *self,
if (ret)
goto err_buffer;

+ if (tu->sdt_offset)
+ sdt_increment_sem(tu);
+
return 0;

err_buffer:
@@ -979,6 +1119,9 @@ typedef bool (*filter_func_t)(struct uprobe_consumer *self,

WARN_ON(!uprobe_filter_is_empty(&tu->filter));

+ if (tu->sdt_offset)
+ sdt_decrement_sem(tu);
+
uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
tu->tp.flags &= file ? ~TP_FLAG_TRACE : ~TP_FLAG_PROFILE;

@@ -1353,6 +1496,8 @@ static __init int init_uprobe_trace(void)
/* Profile interface */
trace_create_file("uprobe_profile", 0444, d_tracer,
NULL, &uprobe_profile_ops);
+
+ uprobe_mmap_callback = trace_uprobe_mmap_callback;
return 0;
}

--
1.8.3.1


2018-02-28 07:54:00

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 2/4] Uprobe: Export few functions / data structures

These functions and data structures will be used by other files
in later patches.

Signed-off-by: Ravi Bangoria <[email protected]>
---
include/linux/uprobes.h | 23 +++++++++++++++++++++++
kernel/events/uprobes.c | 20 ++++++--------------
2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 0a294e9..06c169e 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -115,6 +115,12 @@ struct uprobes_state {
struct xol_area *xol_area;
};

+struct uprobe_map_info {
+ struct uprobe_map_info *next;
+ struct mm_struct *mm;
+ unsigned long vaddr;
+};
+
extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern bool is_swbp_insn(uprobe_opcode_t *insn);
@@ -149,6 +155,11 @@ struct uprobes_state {
extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
void *src, unsigned long len);
+unsigned long offset_to_vaddr(struct vm_area_struct *vma, loff_t offset);
+void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
+void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len);
+struct uprobe_map_info *free_uprobe_map_info(struct uprobe_map_info *info);
+
#else /* !CONFIG_UPROBES */
struct uprobes_state {
};
@@ -203,5 +214,17 @@ static inline void uprobe_copy_process(struct task_struct *t, unsigned long flag
static inline void uprobe_clear_state(struct mm_struct *mm)
{
}
+unsigned long offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
+{
+}
+void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
+{
+}
+void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)
+{
+}
+struct uprobe_map_info *free_uprobe_map_info(struct uprobe_map_info *info)
+{
+}
#endif /* !CONFIG_UPROBES */
#endif /* _LINUX_UPROBES_H */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index fcce25dd..56dd7af 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -130,7 +130,7 @@ static bool valid_vma(struct vm_area_struct *vma, bool is_register)
return vma->vm_file && (vma->vm_flags & flags) == VM_MAYEXEC;
}

-static unsigned long offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
+unsigned long offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
{
return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
}
@@ -240,14 +240,14 @@ bool __weak is_trap_insn(uprobe_opcode_t *insn)
return is_swbp_insn(insn);
}

-static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
+void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
{
void *kaddr = kmap_atomic(page);
memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
kunmap_atomic(kaddr);
}

-static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)
+void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)
{
void *kaddr = kmap_atomic(page);
memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
@@ -705,23 +705,15 @@ static void delete_uprobe(struct uprobe *uprobe)
put_uprobe(uprobe);
}

-struct uprobe_map_info {
- struct uprobe_map_info *next;
- struct mm_struct *mm;
- unsigned long vaddr;
-};
-
-static inline struct uprobe_map_info *
-free_uprobe_map_info(struct uprobe_map_info *info)
+struct uprobe_map_info *free_uprobe_map_info(struct uprobe_map_info *info)
{
struct uprobe_map_info *next = info->next;
kfree(info);
return next;
}

-static struct uprobe_map_info *
-build_uprobe_map_info(struct address_space *mapping, loff_t offset,
- bool is_register)
+struct uprobe_map_info *build_uprobe_map_info(struct address_space *mapping,
+ loff_t offset, bool is_register)
{
unsigned long pgoff = offset >> PAGE_SHIFT;
struct vm_area_struct *vma;
--
1.8.3.1


2018-02-28 07:54:00

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 1/4] Uprobe: Rename map_info to uprobe_map_info

map_info is very generic name, rename it to uprobe_map_info.
Renaming will help to export this structure outside of the
file.

Signed-off-by: Ravi Bangoria <[email protected]>
---
kernel/events/uprobes.c | 32 +++++++++++++++++---------------
1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 267f6ef..fcce25dd 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -705,27 +705,29 @@ static void delete_uprobe(struct uprobe *uprobe)
put_uprobe(uprobe);
}

-struct map_info {
- struct map_info *next;
+struct uprobe_map_info {
+ struct uprobe_map_info *next;
struct mm_struct *mm;
unsigned long vaddr;
};

-static inline struct map_info *free_map_info(struct map_info *info)
+static inline struct uprobe_map_info *
+free_uprobe_map_info(struct uprobe_map_info *info)
{
- struct map_info *next = info->next;
+ struct uprobe_map_info *next = info->next;
kfree(info);
return next;
}

-static struct map_info *
-build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
+static struct uprobe_map_info *
+build_uprobe_map_info(struct address_space *mapping, loff_t offset,
+ bool is_register)
{
unsigned long pgoff = offset >> PAGE_SHIFT;
struct vm_area_struct *vma;
- struct map_info *curr = NULL;
- struct map_info *prev = NULL;
- struct map_info *info;
+ struct uprobe_map_info *curr = NULL;
+ struct uprobe_map_info *prev = NULL;
+ struct uprobe_map_info *info;
int more = 0;

again:
@@ -739,7 +741,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
* Needs GFP_NOWAIT to avoid i_mmap_rwsem recursion through
* reclaim. This is optimistic, no harm done if it fails.
*/
- prev = kmalloc(sizeof(struct map_info),
+ prev = kmalloc(sizeof(struct uprobe_map_info),
GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
if (prev)
prev->next = NULL;
@@ -772,7 +774,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
}

do {
- info = kmalloc(sizeof(struct map_info), GFP_KERNEL);
+ info = kmalloc(sizeof(struct uprobe_map_info), GFP_KERNEL);
if (!info) {
curr = ERR_PTR(-ENOMEM);
goto out;
@@ -784,7 +786,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
goto again;
out:
while (prev)
- prev = free_map_info(prev);
+ prev = free_uprobe_map_info(prev);
return curr;
}

@@ -792,11 +794,11 @@ static inline struct map_info *free_map_info(struct map_info *info)
register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
{
bool is_register = !!new;
- struct map_info *info;
+ struct uprobe_map_info *info;
int err = 0;

percpu_down_write(&dup_mmap_sem);
- info = build_map_info(uprobe->inode->i_mapping,
+ info = build_uprobe_map_info(uprobe->inode->i_mapping,
uprobe->offset, is_register);
if (IS_ERR(info)) {
err = PTR_ERR(info);
@@ -835,7 +837,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
up_write(&mm->mmap_sem);
free:
mmput(mm);
- info = free_map_info(info);
+ info = free_uprobe_map_info(info);
}
out:
percpu_up_write(&dup_mmap_sem);
--
1.8.3.1


2018-02-28 08:04:14

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 4/4] trace_uprobe: Fix multiple update of same semaphores

For tiny binaries/libraries, different mmap regions points to
the same file portion. In such cases, we may increment semaphore
multiple times. But while de-registration, semaphore will get
decremented only once, leaving semaphore > 0 even if no one is
tracing on that marker.

Ensure increment and decrement happens in sync by keeping list
of mms in trace_uprobe. Increment semaphore only if mm is not
present in the list and decrement only if mm is present in the
list.

Example

# echo "p:sdt_tick/loop2 /tmp/tick:0x6e4 *0x10036" > uprobe_events

Before patch:

# echo 1 > events/sdt_tick/loop2/enable
# ./Workspace/sdt_prog/tick &
# dd if=/proc//mem bs=1 count=1 skip=268566582 2>/dev/null | xxd
0000000: 02 .

# echo 0 > events/sdt_tick/loop2/enable
# dd if=/proc//mem bs=1 count=1 skip=268566582 2>/dev/null | xxd
0000000: 01 .

After patch:

# echo 1 > events/sdt_tick/loop2/enable
# ./Workspace/sdt_prog/tick &
# dd if=/proc//mem bs=1 count=1 skip=268566582 2>/dev/null | xxd
0000000: 01 .

# echo 0 > events/sdt_tick/loop2/enable
# dd if=/proc//mem bs=1 count=1 skip=268566582 2>/dev/null | xxd
0000000: 00 .

Signed-off-by: Ravi Bangoria <[email protected]>
---
kernel/trace/trace_uprobe.c | 105 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 102 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index d14aafc..3f1e8bd 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -49,6 +49,11 @@ struct trace_uprobe_filter {
struct list_head perf_events;
};

+struct sdt_mm_list {
+ struct mm_struct *mm;
+ struct sdt_mm_list *next;
+};
+
/*
* uprobe event core functions
*/
@@ -60,6 +65,8 @@ struct trace_uprobe {
char *filename;
unsigned long offset;
unsigned long sdt_offset; /* sdt semaphore offset */
+ struct sdt_mm_list *sml;
+ struct rw_semaphore sml_rw_sem;
unsigned long nhit;
struct trace_probe tp;
};
@@ -273,6 +280,7 @@ static inline bool is_ret_probe(struct trace_uprobe *tu)
if (is_ret)
tu->consumer.ret_handler = uretprobe_dispatcher;
init_trace_uprobe_filter(&tu->filter);
+ init_rwsem(&tu->sml_rw_sem);
return tu;

error:
@@ -953,6 +961,75 @@ static bool sdt_valid_vma(struct trace_uprobe *tu, struct vm_area_struct *vma)
return 0;
}

+static bool sdt_check_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
+{
+ struct sdt_mm_list *tmp = tu->sml;
+
+ if (!tu->sml || !mm)
+ return false;
+
+ while (tmp) {
+ if (tmp->mm == mm)
+ return true;
+ tmp = tmp->next;
+ }
+
+ return false;
+}
+
+static void sdt_add_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
+{
+ struct sdt_mm_list *tmp;
+
+ tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
+ if (!tmp) {
+ pr_info("sdt_add_mm_list failed.\n");
+ return;
+ }
+ tmp->mm = mm;
+ tmp->next = tu->sml;
+ tu->sml = tmp;
+}
+
+static void sdt_del_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
+{
+ struct sdt_mm_list *prev, *curr;
+
+ if (!tu->sml)
+ return;
+
+ if (tu->sml->mm == mm) {
+ curr = tu->sml;
+ tu->sml = tu->sml->next;
+ kfree(curr);
+ return;
+ }
+
+ prev = tu->sml;
+ curr = tu->sml->next;
+ while (curr) {
+ if (curr->mm == mm) {
+ prev->next = curr->next;
+ kfree(curr);
+ return;
+ }
+ prev = curr;
+ curr = curr->next;
+ }
+}
+
+static void sdt_flush_mm_list(struct trace_uprobe *tu)
+{
+ struct sdt_mm_list *next, *curr = tu->sml;
+
+ while (curr) {
+ next = curr->next;
+ kfree(curr);
+ curr = next;
+ }
+ tu->sml = NULL;
+}
+
/*
* TODO: Adding this defination in include/linux/uprobes.h throws
* warnings about address_sapce. Adding it here for the time being.
@@ -970,20 +1047,26 @@ static void sdt_increment_sem(struct trace_uprobe *tu)
if (IS_ERR(info))
goto out;

+ down_write(&tu->sml_rw_sem);
while (info) {
down_write(&info->mm->mmap_sem);
vma = sdt_find_vma(info->mm, tu);
if (!vma)
goto cont;

+ if (sdt_check_mm_list(tu, info->mm))
+ goto cont;
+
vaddr = offset_to_vaddr(vma, tu->sdt_offset);
- sdt_update_sem(info->mm, vaddr, 1);
+ if (!sdt_update_sem(info->mm, vaddr, 1))
+ sdt_add_mm_list(tu, info->mm);

cont:
up_write(&info->mm->mmap_sem);
mmput(info->mm);
info = free_uprobe_map_info(info);
}
+ up_write(&tu->sml_rw_sem);

out:
uprobe_end_dup_mmap();
@@ -1001,8 +1084,16 @@ void trace_uprobe_mmap_callback(struct vm_area_struct *vma)
!trace_probe_is_enabled(&tu->tp))
continue;

+ down_write(&tu->sml_rw_sem);
+ if (sdt_check_mm_list(tu, vma->vm_mm))
+ goto cont;
+
vaddr = offset_to_vaddr(vma, tu->sdt_offset);
- sdt_update_sem(vma->vm_mm, vaddr, 1);
+ if (!sdt_update_sem(vma->vm_mm, vaddr, 1))
+ sdt_add_mm_list(tu, vma->vm_mm);
+
+cont:
+ up_write(&tu->sml_rw_sem);
}
mutex_unlock(&uprobe_lock);
}
@@ -1017,7 +1108,11 @@ static void sdt_decrement_sem(struct trace_uprobe *tu)
if (IS_ERR(info))
return;

+ down_write(&tu->sml_rw_sem);
while (info) {
+ if (!sdt_check_mm_list(tu, info->mm))
+ goto cont;
+
down_write(&info->mm->mmap_sem);
vma = sdt_find_vma(info->mm, tu);
if (vma) {
@@ -1025,10 +1120,14 @@ static void sdt_decrement_sem(struct trace_uprobe *tu)
sdt_update_sem(info->mm, vaddr, -1);
}
up_write(&info->mm->mmap_sem);
-
+ sdt_del_mm_list(tu, info->mm);
+
+cont:
mmput(info->mm);
info = free_uprobe_map_info(info);
}
+ sdt_flush_mm_list(tu);
+ up_write(&tu->sml_rw_sem);
}

typedef bool (*filter_func_t)(struct uprobe_consumer *self,
--
1.8.3.1


2018-02-28 12:07:22

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC 0/4] trace_uprobe: Support SDT markers having semaphore

* Ravi Bangoria <[email protected]> [2018-02-28 13:23:41]:

> # readelf -n ./tick
> Provider: tick
> Name: loop2
> ... Semaphore: 0x0000000010020036
>
> # readelf -SW ./tick | grep probes
> [25] .probes PROGBITS 0000000010020034 010034
>
>
> Semaphore offset is 0x10036. I don't have educated 'perf probe'
> about semaphore. So instead of using 'perf probe' command, I'm
> manually adding entry in the <tracefs>/uprobe_events file.
> Special char * denotes semaphore offset.
>
>
> # echo "p:sdt_tick/loop2 /tmp/tick:0x6e4 *0x10036" > uprobe_events
>
> # perf stat -e sdt_tick:loop2 -- /tmp/tick
> hi: 0
> hi: 1
> hi: 2
> hi: 3
> ^C
> Performance counter stats for '/tmp/tick':
> 4 sdt_tick:loop2
> 3.359047827 seconds time elapsed
>
>
> Feedback?
>
> TODO:
> - Educate perf tool about semaphore.
>

Is it possible to extend perf buildcache with a new option to work with
semaphore?


2018-02-28 12:13:20

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC 1/4] Uprobe: Rename map_info to uprobe_map_info

>
> -static inline struct map_info *free_map_info(struct map_info *info)
> +static inline struct uprobe_map_info *
> +free_uprobe_map_info(struct uprobe_map_info *info)
> {
> - struct map_info *next = info->next;
> + struct uprobe_map_info *next = info->next;
> kfree(info);
> return next;
> }
>
> -static struct map_info *
> -build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
> +static struct uprobe_map_info *
> +build_uprobe_map_info(struct address_space *mapping, loff_t offset,
> + bool is_register)

Imho, uprobe_build_map_info may be better than build_uprobe_map_info,
similarly uprobe_free_map_info.


2018-02-28 12:26:35

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC 2/4] Uprobe: Export few functions / data structures

> @@ -149,6 +155,11 @@ struct uprobes_state {
> extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
> extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> void *src, unsigned long len);
> +unsigned long offset_to_vaddr(struct vm_area_struct *vma, loff_t offset);
> +void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
> +void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len);
> +struct uprobe_map_info *free_uprobe_map_info(struct uprobe_map_info *info);
> +
> #else /* !CONFIG_UPROBES */

If we have to export the above, we might have to work with mm maintainers and
see if we can move them there.

> -static inline struct uprobe_map_info *
> -free_uprobe_map_info(struct uprobe_map_info *info)
> +struct uprobe_map_info *free_uprobe_map_info(struct uprobe_map_info *info)
> {
> struct uprobe_map_info *next = info->next;
> kfree(info);
> return next;
> }
>
> -static struct uprobe_map_info *
> -build_uprobe_map_info(struct address_space *mapping, loff_t offset,
> - bool is_register)
> +struct uprobe_map_info *build_uprobe_map_info(struct address_space *mapping,
> + loff_t offset, bool is_register)
> {
> unsigned long pgoff = offset >> PAGE_SHIFT;
> struct vm_area_struct *vma;

Instead of exporting, did you look at extending the uprobe consumer with
ops. i.e if the consumer detects that a probe is a semaphore and exports
a set of callbacks which can them be called from uprobe
insertion/deletion time. With such a thing, incrementing/decrementing
the semaphore and the insertion/deletion of the breakpoint can be done
at one shot. No?


2018-02-28 14:28:39

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC 0/4] trace_uprobe: Support SDT markers having semaphore

Hi Ravi,

Thank you for your great work!

On Wed, 28 Feb 2018 13:23:41 +0530
Ravi Bangoria <[email protected]> wrote:

> Userspace Statically Defined Tracepoints[1] are dtrace style markers
> inside userspace applications. These markers are added by developer at
> important places in the code. Each marker source expands to a single
> nop instruction in the compiled code but there may be additional
> overhead for computing the marker arguments which expands to couple of
> instructions. If this computaion is quite more, execution of it can be
> ommited by runtime if() condition when no one is tracing on the marker:
>
> if (semaphore > 0) {
> Execute marker instructions;
> }
>
> Default value of semaphore is 0. Tracer has to increment the semaphore
> before recording on a marker and decrement it at the end of tracing.
>
> Currently, perf tool has limited supports for SDT markers. I.e. it
> can not trace markers surrounded by semaphore. Also, it's not easy
> to add semaphore flip logic in userspace tool like perf, so basic
> idea for this patchset is to add semaphore flip logic in the
> trace_uprobe infrastructure. Ex,[2]
>
> # cat tick.c
> ...
> for (i = 0; i < 100; i++) {
> DTRACE_PROBE1(tick, loop1, i);
> if (TICK_LOOP2_ENABLED()) {
> DTRACE_PROBE1(tick, loop2, i);
> }
> printf("hi: %d\n", i);
> sleep(1);
> }
> ...
>
> Here tick:loop1 is marker without semaphore where as tick:loop2 is
> surrounded by semaphore.
>
>
> # perf buildid-cache --add /tmp/tick
> # perf probe sdt_tick:loop1
> # perf probe sdt_tick:loop2
>
> # perf stat -e sdt_tick:loop1,sdt_tick:loop2 -- /tmp/tick
> hi: 0
> hi: 1
> hi: 2
> ^C
> Performance counter stats for '/tmp/tick':
> 3 sdt_tick:loop1
> 0 sdt_tick:loop2
> 2.747086086 seconds time elapsed
>
>
> Perf failed to record data for tick:loop2. Same experiment with this
> patch series:
>
>
> # readelf -n ./tick
> Provider: tick
> Name: loop2
> ... Semaphore: 0x0000000010020036
>
> # readelf -SW ./tick | grep probes
> [25] .probes PROGBITS 0000000010020034 010034
>
>
> Semaphore offset is 0x10036. I don't have educated 'perf probe'
> about semaphore. So instead of using 'perf probe' command, I'm
> manually adding entry in the <tracefs>/uprobe_events file.

Ok, it is easy to pass semaphore address via perf probe :)

> Special char * denotes semaphore offset.
>
>
> # echo "p:sdt_tick/loop2 /tmp/tick:0x6e4 *0x10036" > uprobe_events

IMHO, this syntax is no good, separate with space is only for arguments.
Since the semaphore is per-probe-point based, that should be specified with probe point.
(there are no 2 or more semaphores on 1 event, are there?)
So something like

# echo "p:sdt_tick/loop2 /tmp/tick:0x6e4(0x10036)" > uprobe_events

would be better to me.

Thank you,

>
> # perf stat -e sdt_tick:loop2 -- /tmp/tick
> hi: 0
> hi: 1
> hi: 2
> hi: 3
> ^C
> Performance counter stats for '/tmp/tick':
> 4 sdt_tick:loop2
> 3.359047827 seconds time elapsed
>
>
> Feedback?
>
> TODO:
> - Educate perf tool about semaphore.
> - perf_event_open() now suppoers {k,u}probe event creation[3]. If we
> can supply semaphore offset in perf_event_attr, perf_event_open()
> can be educated to probe SDT marker having semaphore. Though, both
> config1 and config2 are already being used for uprobe and I don't
> see any other attribute which I can use for semaphore offset. Can
> we introduce one more config there? config3?
>
> [1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation
> [2] https://github.com/iovisor/bcc/issues/327#issuecomment-200576506
> [3] https://lkml.org/lkml/2017/12/6/976
>
>
> Ravi Bangoria (4):
> Uprobe: Rename map_info to uprobe_map_info
> Uprobe: Export few functions / data structures
> trace_uprobe: Support SDT markers having semaphore
> trace_uprobe: Fix multiple update of same semaphores
>
> include/linux/uprobes.h | 25 +++++
> kernel/events/uprobes.c | 43 ++++----
> kernel/trace/trace_uprobe.c | 244 ++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 290 insertions(+), 22 deletions(-)
>
> --
> 1.8.3.1
>


--
Masami Hiramatsu <[email protected]>

2018-03-01 05:09:35

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 0/4] trace_uprobe: Support SDT markers having semaphore



On 02/28/2018 05:36 PM, Srikar Dronamraju wrote:
> * Ravi Bangoria <[email protected]> [2018-02-28 13:23:41]:
>
>> # readelf -n ./tick
>> Provider: tick
>> Name: loop2
>> ... Semaphore: 0x0000000010020036
>>
>> # readelf -SW ./tick | grep probes
>> [25] .probes PROGBITS 0000000010020034 010034
>>
>>
>> Semaphore offset is 0x10036. I don't have educated 'perf probe'
>> about semaphore. So instead of using 'perf probe' command, I'm
>> manually adding entry in the <tracefs>/uprobe_events file.
>> Special char * denotes semaphore offset.
>>
>>
>> # echo "p:sdt_tick/loop2 /tmp/tick:0x6e4 *0x10036" > uprobe_events
>>
>> # perf stat -e sdt_tick:loop2 -- /tmp/tick
>> hi: 0
>> hi: 1
>> hi: 2
>> hi: 3
>> ^C
>> Performance counter stats for '/tmp/tick':
>> 4 sdt_tick:loop2
>> 3.359047827 seconds time elapsed
>>
>>
>> Feedback?
>>
>> TODO:
>> - Educate perf tool about semaphore.
>>
> Is it possible to extend perf buildcache with a new option to work with
> semaphore?

Yes, that should be fairly easy to do. Will add a patch for that.

Thanks for the review,
Ravi


2018-03-01 05:10:44

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 1/4] Uprobe: Rename map_info to uprobe_map_info



On 02/28/2018 05:39 PM, Srikar Dronamraju wrote:
>> -static inline struct map_info *free_map_info(struct map_info *info)
>> +static inline struct uprobe_map_info *
>> +free_uprobe_map_info(struct uprobe_map_info *info)
>> {
>> - struct map_info *next = info->next;
>> + struct uprobe_map_info *next = info->next;
>> kfree(info);
>> return next;
>> }
>>
>> -static struct map_info *
>> -build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
>> +static struct uprobe_map_info *
>> +build_uprobe_map_info(struct address_space *mapping, loff_t offset,
>> + bool is_register)
> Imho, uprobe_build_map_info may be better than build_uprobe_map_info,
> similarly uprobe_free_map_info.

Sure, Will change it.

Thanks,
Ravi


2018-03-01 05:24:26

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 2/4] Uprobe: Export few functions / data structures



On 02/28/2018 05:54 PM, Srikar Dronamraju wrote:
>> @@ -149,6 +155,11 @@ struct uprobes_state {
>> extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
>> extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>> void *src, unsigned long len);
>> +unsigned long offset_to_vaddr(struct vm_area_struct *vma, loff_t offset);
>> +void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len);
>> +void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len);
>> +struct uprobe_map_info *free_uprobe_map_info(struct uprobe_map_info *info);
>> +
>> #else /* !CONFIG_UPROBES */
> If we have to export the above, we might have to work with mm maintainers and
> see if we can move them there.

Adding
    [email protected]
    Michal Hocko <[email protected]>
    Andrew Morton <[email protected]>
in the cc.

>> -static inline struct uprobe_map_info *
>> -free_uprobe_map_info(struct uprobe_map_info *info)
>> +struct uprobe_map_info *free_uprobe_map_info(struct uprobe_map_info *info)
>> {
>> struct uprobe_map_info *next = info->next;
>> kfree(info);
>> return next;
>> }
>>
>> -static struct uprobe_map_info *
>> -build_uprobe_map_info(struct address_space *mapping, loff_t offset,
>> - bool is_register)
>> +struct uprobe_map_info *build_uprobe_map_info(struct address_space *mapping,
>> + loff_t offset, bool is_register)
>> {
>> unsigned long pgoff = offset >> PAGE_SHIFT;
>> struct vm_area_struct *vma;
> Instead of exporting, did you look at extending the uprobe consumer with
> ops. i.e if the consumer detects that a probe is a semaphore and exports
> a set of callbacks which can them be called from uprobe
> insertion/deletion time. With such a thing, incrementing/decrementing
> the semaphore and the insertion/deletion of the breakpoint can be done
> at one shot. No?

Yes, we tried that approach as well. Basically, when install_breakpoint() get called,
notify consumer about that. We can either use consumer_filter function or add a
new callback into uprobe_consumer which will get called if install_breakpoint()
succeeds. something like:

     if (install_breakpoint()) {
         /* Notify consumers right after patching instruction. */
         consumer->post_prepare();
     }

There are different problem with that approach. install_breakpoint() gets called in
very early stage of binary loading and vma that holds the semaphore won't be
present in the mm yet. I also tried to solve this by creating a task_work in
consumer callback. task_work handler will get called when process virtual memory
map is fully prepared and we are going back to userspace. But it will make design
quite complicated. Also, there is no way to know if mm_struct we got in task_work
handler is _still_ valid.

With unregister also, we first remove the "caller" consumer and then re-patch
original instruction. i.e.

     __uprobe_unregister()
     {
         if (WARN_ON(!consumer_del(uprobe, uc)))
             return;
         err = register_for_each_vma(uprobe, NULL);

We don't callback "caller" consumer at unregistration.

Our idea is to make changes in core uprobe as less as possible. And IMHO,
exporting build_map_info() helps to simplifies the implementation.

Let me know if I'm missing something.

Thanks for the review,
Ravi


2018-03-01 05:31:01

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 0/4] trace_uprobe: Support SDT markers having semaphore



On 02/28/2018 07:55 PM, Masami Hiramatsu wrote:
> Hi Ravi,
>
> Thank you for your great work!

Thanks Masami.

> On Wed, 28 Feb 2018 13:23:41 +0530
> Ravi Bangoria <[email protected]> wrote:
>
>> Userspace Statically Defined Tracepoints[1] are dtrace style markers
>> inside userspace applications. These markers are added by developer at
>> important places in the code. Each marker source expands to a single
>> nop instruction in the compiled code but there may be additional
>> overhead for computing the marker arguments which expands to couple of
>> instructions. If this computaion is quite more, execution of it can be
>> ommited by runtime if() condition when no one is tracing on the marker:
...
>>
>> Semaphore offset is 0x10036. I don't have educated 'perf probe'
>> about semaphore. So instead of using 'perf probe' command, I'm
>> manually adding entry in the <tracefs>/uprobe_events file.
> Ok, it is easy to pass semaphore address via perf probe :)

Yes, it should be fairly easy to parse semaphore at buildid-cache time.
Will add a patch for that.

>
>> Special char * denotes semaphore offset.
>>
>>
>> # echo "p:sdt_tick/loop2 /tmp/tick:0x6e4 *0x10036" > uprobe_events
> IMHO, this syntax is no good, separate with space is only for arguments.
> Since the semaphore is per-probe-point based, that should be specified with probe point.
> (there are no 2 or more semaphores on 1 event, are there?)
> So something like
>
> # echo "p:sdt_tick/loop2 /tmp/tick:0x6e4(0x10036)" > uprobe_events

This is great suggestion. Will change it.

Please review patch 3 and 4 which contains actual implementation.

Thanks for the review,
Ravi


2018-03-01 14:09:09

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC 3/4] trace_uprobe: Support SDT markers having semaphore

On Wed, 28 Feb 2018 13:23:44 +0530
Ravi Bangoria <[email protected]> wrote:

> Userspace Statically Defined Tracepoints[1] are dtrace style markers
> inside userspace applications. These markers are added by developer at
> important places in the code. Each marker source expands to a single
> nop instruction in the compiled code but there may be additional
> overhead for computing the marker arguments which expands to couple of
> instructions. If this computaion is quite more, execution of it can be
> ommited by runtime if() condition when no one is tracing on the marker:
>
> if (semaphore > 0) {
> Execute marker instructions;
> }
>
> Default value of semaphore is 0. Tracer has to increment the semaphore
> before recording on a marker and decrement it at the end of tracing.
>
> Implement the semaphore flip logic in trace_uprobe, leaving core uprobe
> infrastructure as is, except one new callback from uprobe_mmap() to
> trace_uprobe.
>
> There are two major scenarios while enabling the marker,
> 1. Trace already running process. In this case, find all suitable
> processes and increment the semaphore value in them.
> 2. Trace is already enabled when target binary is executed. In this
> case, all mmaps will get notified to trace_uprobe and trace_uprobe
> will increment the semaphore if corresponding uprobe is enabled.
>
> At the time of disabling probes, decrement semaphore in all existing
> target processes.
>
> Signed-off-by: Ravi Bangoria <[email protected]>
> ---
> include/linux/uprobes.h | 2 +
> kernel/events/uprobes.c | 5 ++
> kernel/trace/trace_uprobe.c | 145 ++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 152 insertions(+)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 06c169e..04e9d57 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -121,6 +121,8 @@ struct uprobe_map_info {
> unsigned long vaddr;
> };
>
> +extern void (*uprobe_mmap_callback)(struct vm_area_struct *vma);
> +
> extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
> extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
> extern bool is_swbp_insn(uprobe_opcode_t *insn);
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 56dd7af..81d8aaf 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1051,6 +1051,8 @@ static void build_probe_list(struct inode *inode,
> spin_unlock(&uprobes_treelock);
> }
>
> +void (*uprobe_mmap_callback)(struct vm_area_struct *vma) = NULL;
> +
> /*
> * Called from mmap_region/vma_adjust with mm->mmap_sem acquired.
> *
> @@ -1063,6 +1065,9 @@ int uprobe_mmap(struct vm_area_struct *vma)
> struct uprobe *uprobe, *u;
> struct inode *inode;
>
> + if (vma->vm_flags & VM_WRITE && uprobe_mmap_callback)
> + uprobe_mmap_callback(vma);
> +
> if (no_uprobe_events() || !valid_vma(vma, true))
> return 0;
>
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index 40592e7b..d14aafc 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -25,6 +25,7 @@
> #include <linux/namei.h>
> #include <linux/string.h>
> #include <linux/rculist.h>
> +#include <linux/sched/mm.h>
>
> #include "trace_probe.h"
>
> @@ -58,6 +59,7 @@ struct trace_uprobe {
> struct inode *inode;
> char *filename;
> unsigned long offset;
> + unsigned long sdt_offset; /* sdt semaphore offset */
> unsigned long nhit;
> struct trace_probe tp;
> };
> @@ -502,6 +504,16 @@ static int create_trace_uprobe(int argc, char **argv)
> for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) {
> struct probe_arg *parg = &tu->tp.args[i];
>
> + /* This is not really an argument. */
> + if (argv[i][0] == '*') {
> + ret = kstrtoul(&(argv[i][1]), 0, &tu->sdt_offset);
> + if (ret) {
> + pr_info("Invalid semaphore address.\n");
> + goto error;
> + }
> + continue;
> + }

So, this part should be done with parsing probe-point as I pointed.

> +
> /* Increment count for freeing args in error case */
> tu->tp.nr_args++;
>
> @@ -894,6 +906,131 @@ static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
> return trace_handle_return(s);
> }
>
> +static bool sdt_valid_vma(struct trace_uprobe *tu, struct vm_area_struct *vma)
> +{
> + unsigned long vaddr = offset_to_vaddr(vma, tu->sdt_offset);
> +
> + return tu->sdt_offset &&
> + vma->vm_file &&
> + file_inode(vma->vm_file) == tu->inode &&
> + vma->vm_flags & VM_WRITE &&
> + vma->vm_start <= vaddr &&
> + vma->vm_end > vaddr;
> +}
> +
> +static struct vm_area_struct *
> +sdt_find_vma(struct mm_struct *mm, struct trace_uprobe *tu)
> +{
> + struct vm_area_struct *tmp;
> +
> + for (tmp = mm->mmap; tmp != NULL; tmp = tmp->vm_next)
> + if (sdt_valid_vma(tu, tmp))
> + return tmp;
> +
> + return NULL;
> +}
> +
> +static int
> +sdt_update_sem(struct mm_struct *mm, unsigned long vaddr, short val)
> +{
> + struct page *page;
> + struct vm_area_struct *vma;
> + int ret = 0;
> + unsigned short orig = 0;
> +
> + if (vaddr == 0)
> + return -EINVAL;
> +
> + ret = get_user_pages_remote(NULL, mm, vaddr, 1,
> + FOLL_FORCE | FOLL_WRITE, &page, &vma, NULL);
> + if (ret <= 0)
> + return ret;
> +
> + copy_from_page(page, vaddr, &orig, sizeof(orig));
> + orig += val;
> + copy_to_page(page, vaddr, &orig, sizeof(orig));
> + put_page(page);
> + return 0;
> +}
> +
> +/*
> + * TODO: Adding this defination in include/linux/uprobes.h throws
> + * warnings about address_sapce. Adding it here for the time being.
> + */
> +struct uprobe_map_info *build_uprobe_map_info(struct address_space *mapping, loff_t offset, bool is_register);
> +
> +static void sdt_increment_sem(struct trace_uprobe *tu)
> +{
> + struct uprobe_map_info *info;
> + struct vm_area_struct *vma;
> + unsigned long vaddr;
> +
> + uprobe_start_dup_mmap();
> + info = build_uprobe_map_info(tu->inode->i_mapping, tu->sdt_offset, false);
> + if (IS_ERR(info))
> + goto out;
> +
> + while (info) {
> + down_write(&info->mm->mmap_sem);
> + vma = sdt_find_vma(info->mm, tu);
> + if (!vma)
> + goto cont;
> +
> + vaddr = offset_to_vaddr(vma, tu->sdt_offset);
> + sdt_update_sem(info->mm, vaddr, 1);
> +
> +cont:

Why would you use goto here?

Thank you,

> + up_write(&info->mm->mmap_sem);
> + mmput(info->mm);
> + info = free_uprobe_map_info(info);
> + }
> +
> +out:
> + uprobe_end_dup_mmap();
> +}
> +
> +/* Called with down_write(&vma->vm_mm->mmap_sem) */
> +void trace_uprobe_mmap_callback(struct vm_area_struct *vma)
> +{
> + struct trace_uprobe *tu;
> + unsigned long vaddr;
> +
> + mutex_lock(&uprobe_lock);
> + list_for_each_entry(tu, &uprobe_list, list) {
> + if (!sdt_valid_vma(tu, vma) ||
> + !trace_probe_is_enabled(&tu->tp))
> + continue;
> +
> + vaddr = offset_to_vaddr(vma, tu->sdt_offset);
> + sdt_update_sem(vma->vm_mm, vaddr, 1);
> + }
> + mutex_unlock(&uprobe_lock);
> +}
> +
> +static void sdt_decrement_sem(struct trace_uprobe *tu)
> +{
> + struct vm_area_struct *vma;
> + unsigned long vaddr;
> + struct uprobe_map_info *info;
> +
> + info = build_uprobe_map_info(tu->inode->i_mapping, tu->sdt_offset, false);
> + if (IS_ERR(info))
> + return;
> +
> + while (info) {
> + down_write(&info->mm->mmap_sem);
> + vma = sdt_find_vma(info->mm, tu);
> + if (vma) {
> + vaddr = offset_to_vaddr(vma, tu->sdt_offset);
> + sdt_update_sem(info->mm, vaddr, -1);
> + }
> + up_write(&info->mm->mmap_sem);
> +
> + mmput(info->mm);
> + info = free_uprobe_map_info(info);
> + }
> +}
> +
> typedef bool (*filter_func_t)(struct uprobe_consumer *self,
> enum uprobe_filter_ctx ctx,
> struct mm_struct *mm);
> @@ -939,6 +1076,9 @@ typedef bool (*filter_func_t)(struct uprobe_consumer *self,
> if (ret)
> goto err_buffer;
>
> + if (tu->sdt_offset)
> + sdt_increment_sem(tu);
> +
> return 0;
>
> err_buffer:
> @@ -979,6 +1119,9 @@ typedef bool (*filter_func_t)(struct uprobe_consumer *self,
>
> WARN_ON(!uprobe_filter_is_empty(&tu->filter));
>
> + if (tu->sdt_offset)
> + sdt_decrement_sem(tu);
> +
> uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
> tu->tp.flags &= file ? ~TP_FLAG_TRACE : ~TP_FLAG_PROFILE;
>
> @@ -1353,6 +1496,8 @@ static __init int init_uprobe_trace(void)
> /* Profile interface */
> trace_create_file("uprobe_profile", 0444, d_tracer,
> NULL, &uprobe_profile_ops);
> +
> + uprobe_mmap_callback = trace_uprobe_mmap_callback;
> return 0;
> }
>
> --
> 1.8.3.1
>


--
Masami Hiramatsu <[email protected]>

2018-03-02 05:20:22

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 3/4] trace_uprobe: Support SDT markers having semaphore



On 03/01/2018 07:37 PM, Masami Hiramatsu wrote:
> On Wed, 28 Feb 2018 13:23:44 +0530
> Ravi Bangoria <[email protected]> wrote:
>
>> @@ -502,6 +504,16 @@ static int create_trace_uprobe(int argc, char **argv)
>> for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) {
>> struct probe_arg *parg = &tu->tp.args[i];
>>
>> + /* This is not really an argument. */
>> + if (argv[i][0] == '*') {
>> + ret = kstrtoul(&(argv[i][1]), 0, &tu->sdt_offset);
>> + if (ret) {
>> + pr_info("Invalid semaphore address.\n");
>> + goto error;
>> + }
>> + continue;
>> + }
> So, this part should be done with parsing probe-point as I pointed.

Yes, will change it.

>> +static void sdt_increment_sem(struct trace_uprobe *tu)
>> +{
>> + struct uprobe_map_info *info;
>> + struct vm_area_struct *vma;
>> + unsigned long vaddr;
>> +
>> + uprobe_start_dup_mmap();
>> + info = build_uprobe_map_info(tu->inode->i_mapping, tu->sdt_offset, false);
>> + if (IS_ERR(info))
>> + goto out;
>> +
>> + while (info) {
>> + down_write(&info->mm->mmap_sem);
>> + vma = sdt_find_vma(info->mm, tu);
>> + if (!vma)
>> + goto cont;
>> +
>> + vaddr = offset_to_vaddr(vma, tu->sdt_offset);
>> + sdt_update_sem(info->mm, vaddr, 1);
>> +
>> +cont:
> Why would you use goto here?

Hmm.. sdt_find_vma() must return vma. Sure, will remove the goto.

Should I add a WARN_ON(!vma) ?

Thanks for the review,
Ravi


2018-03-06 12:01:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 3/4] trace_uprobe: Support SDT markers having semaphore

On Wed, Feb 28, 2018 at 01:23:44PM +0530, Ravi Bangoria wrote:
> Userspace Statically Defined Tracepoints[1] are dtrace style markers
> inside userspace applications. These markers are added by developer at
> important places in the code. Each marker source expands to a single
> nop instruction in the compiled code but there may be additional
> overhead for computing the marker arguments which expands to couple of
> instructions. If this computaion is quite more, execution of it can be
> ommited by runtime if() condition when no one is tracing on the marker:
>
> if (semaphore > 0) {
> Execute marker instructions;
> }
>
> Default value of semaphore is 0. Tracer has to increment the semaphore
> before recording on a marker and decrement it at the end of tracing.
>
> Implement the semaphore flip logic in trace_uprobe, leaving core uprobe
> infrastructure as is, except one new callback from uprobe_mmap() to
> trace_uprobe.

W.T.H. is that called a semaphore? afaict its just a usage-counter.
There is no blocking, no releasing, nothing that would make it an actual
semaphore.

So please, remove all mention of semaphore from this code, because it,
most emphatically, is not one.

Also, would it not be much better to do userspace jump-labels for this?
That completely avoids the dynamic branch at the SDT site.

2018-03-07 08:45:27

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 3/4] trace_uprobe: Support SDT markers having semaphore



On 03/06/2018 05:29 PM, Peter Zijlstra wrote:
> On Wed, Feb 28, 2018 at 01:23:44PM +0530, Ravi Bangoria wrote:
>> Userspace Statically Defined Tracepoints[1] are dtrace style markers
>> inside userspace applications. These markers are added by developer at
>> important places in the code. Each marker source expands to a single
>> nop instruction in the compiled code but there may be additional
>> overhead for computing the marker arguments which expands to couple of
>> instructions. If this computaion is quite more, execution of it can be
>> ommited by runtime if() condition when no one is tracing on the marker:
>>
>> if (semaphore > 0) {
>> Execute marker instructions;
>> }
>>
>> Default value of semaphore is 0. Tracer has to increment the semaphore
>> before recording on a marker and decrement it at the end of tracing.
>>
>> Implement the semaphore flip logic in trace_uprobe, leaving core uprobe
>> infrastructure as is, except one new callback from uprobe_mmap() to
>> trace_uprobe.
> W.T.H. is that called a semaphore? afaict its just a usage-counter.

I totally agree with you. But it's not me who named it semaphore :)

Please refer to "Semaphore Handling" section at:
https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation

We can surly name it differently in the kernel code and document it
properly in the Documents/tracing/

> There is no blocking, no releasing, nothing that would make it an actual
> semaphore.
>
> So please, remove all mention of semaphore from this code, because it,
> most emphatically, is not one.
>
> Also, would it not be much better to do userspace jump-labels for this?
> That completely avoids the dynamic branch at the SDT site.
>

Userspace jump-label is a good idea but...

Semaphore logic has already became a kinda ABI now. Tools like bcc,
gdb, systemtap etc. flip the semaphore while probing the marker.

Thanks,
Ravi


2018-03-07 08:58:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 3/4] trace_uprobe: Support SDT markers having semaphore

On Wed, Mar 07, 2018 at 02:16:13PM +0530, Ravi Bangoria wrote:
> > W.T.H. is that called a semaphore? afaict its just a usage-counter.
>
> I totally agree with you. But it's not me who named it semaphore :)
>
> Please refer to "Semaphore Handling" section at:
> https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation
>
> We can surly name it differently in the kernel code and document it
> properly in the Documents/tracing/

Yes, just because the SDT people failed their CS101 class doesn't mean
we need to perpetuate that failure. Please name it sensibly in our code.

> > Also, would it not be much better to do userspace jump-labels for this?
> > That completely avoids the dynamic branch at the SDT site.
> >
>
> Userspace jump-label is a good idea but...
>
> Semaphore logic has already became a kinda ABI now. Tools like bcc,
> gdb, systemtap etc. flip the semaphore while probing the marker.

*groan*.. maybe suggest it for the next version; it appears we're
already at SDTv3, so surely there will be a v4 too.