Userspace Statically Defined Tracepoints[1] are dtrace style markers
inside userspace applications. These markers are added by developer at
important places in the code. Each marker source expands to a single
nop instruction in the compiled code but there may be additional
overhead for computing the marker arguments which expands to couple of
instructions. In case the overhead is more, execution of it can be
omitted by runtime if() condition when no one is tracing on the marker:
if (reference_counter > 0) {
Execute marker instructions;
}
Default value of reference counter is 0. Tracer has to increment the
reference counter before tracing on a marker and decrement it when
done with the tracing.
Currently, perf tool has limited supports for SDT markers. I.e. it
can not trace markers surrounded by reference counter. Also, it's
not easy to add reference counter logic in userspace tool like perf,
so basic idea for this patchset is to add reference counter logic in
the trace_uprobe infrastructure. Ex,[2]
# cat tick.c
...
for (i = 0; i < 100; i++) {
DTRACE_PROBE1(tick, loop1, i);
if (TICK_LOOP2_ENABLED()) {
DTRACE_PROBE1(tick, loop2, i);
}
printf("hi: %d\n", i);
sleep(1);
}
...
Here tick:loop1 is marker without reference counter where as tick:loop2
is surrounded by reference counter condition.
# perf buildid-cache --add /tmp/tick
# perf probe sdt_tick:loop1
# perf probe sdt_tick:loop2
# perf stat -e sdt_tick:loop1,sdt_tick:loop2 -- /tmp/tick
hi: 0
hi: 1
hi: 2
^C
Performance counter stats for '/tmp/tick':
3 sdt_tick:loop1
0 sdt_tick:loop2
2.747086086 seconds time elapsed
Perf failed to record data for tick:loop2. Same experiment with this
patch series:
# ./perf buildid-cache --add /tmp/tick
# ./perf probe sdt_tick:loop2
# ./perf stat -e sdt_tick:loop2 /tmp/tick
hi: 0
hi: 1
hi: 2
^C
Performance counter stats for '/tmp/tick':
3 sdt_tick:loop2
2.561851452 seconds time elapsed
[1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation
[2] https://github.com/iovisor/bcc/issues/327#issuecomment-200576506
[3] https://lkml.org/lkml/2017/12/6/976
Note: 'reference counter' is called as 'semaphore' in original Dtrace
(or Systemtap, bcc and even in ELF) documentation and code. But the
term 'semaphore' is misleading in this context. This is just a counter
used to hold number of tracers tracing on a marker. This is not really
used for any synchronization. So we are referring it as 'reference
counter' in kernel / perf code.
RFC series can be found at:
https://lkml.org/lkml/2018/2/28/76
Ravi Bangoria (8):
Uprobe: Export vaddr <-> offset conversion functions
mm: Prefix vma_ to vaddr_to_offset() and offset_to_vaddr()
Uprobe: Rename map_info to uprobe_map_info
Uprobe: Export uprobe_map_info along with
uprobe_{build/free}_map_info()
trace_uprobe: Support SDT markers having reference count (semaphore)
trace_uprobe/sdt: Fix multiple update of same reference counter
perf probe: Support SDT markers having reference counter (semaphore)
trace_uprobe/sdt: Document about reference counter
Documentation/trace/uprobetracer.txt | 16 +-
include/linux/mm.h | 12 ++
include/linux/uprobes.h | 11 ++
kernel/events/uprobes.c | 62 ++++----
kernel/trace/trace.c | 2 +-
kernel/trace/trace_uprobe.c | 273 ++++++++++++++++++++++++++++++++++-
tools/perf/util/probe-event.c | 21 ++-
tools/perf/util/probe-event.h | 1 +
tools/perf/util/probe-file.c | 22 ++-
tools/perf/util/symbol-elf.c | 10 ++
tools/perf/util/symbol.h | 1 +
11 files changed, 382 insertions(+), 49 deletions(-)
--
1.8.3.1
No functionality changes.
Signed-off-by: Ravi Bangoria <[email protected]>
---
include/linux/mm.h | 4 ++--
kernel/events/uprobes.c | 14 +++++++-------
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 95909f2..d7ee526 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2275,13 +2275,13 @@ struct vm_unmapped_area_info {
}
static inline unsigned long
-offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
+vma_offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
{
return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
}
static inline loff_t
-vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
+vma_vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
{
return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
}
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index bd6f230..535fd39 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -748,7 +748,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
curr = info;
info->mm = vma->vm_mm;
- info->vaddr = offset_to_vaddr(vma, offset);
+ info->vaddr = vma_offset_to_vaddr(vma, offset);
}
i_mmap_unlock_read(mapping);
@@ -807,7 +807,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
goto unlock;
if (vma->vm_start > info->vaddr ||
- vaddr_to_offset(vma, info->vaddr) != uprobe->offset)
+ vma_vaddr_to_offset(vma, info->vaddr) != uprobe->offset)
goto unlock;
if (is_register) {
@@ -977,7 +977,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
uprobe->offset >= offset + vma->vm_end - vma->vm_start)
continue;
- vaddr = offset_to_vaddr(vma, uprobe->offset);
+ vaddr = vma_offset_to_vaddr(vma, uprobe->offset);
err |= remove_breakpoint(uprobe, mm, vaddr);
}
up_read(&mm->mmap_sem);
@@ -1023,7 +1023,7 @@ static void build_probe_list(struct inode *inode,
struct uprobe *u;
INIT_LIST_HEAD(head);
- min = vaddr_to_offset(vma, start);
+ min = vma_vaddr_to_offset(vma, start);
max = min + (end - start) - 1;
spin_lock(&uprobes_treelock);
@@ -1076,7 +1076,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
if (!fatal_signal_pending(current) &&
filter_chain(uprobe, UPROBE_FILTER_MMAP, vma->vm_mm)) {
- unsigned long vaddr = offset_to_vaddr(vma, uprobe->offset);
+ unsigned long vaddr = vma_offset_to_vaddr(vma, uprobe->offset);
install_breakpoint(uprobe, vma->vm_mm, vma, vaddr);
}
put_uprobe(uprobe);
@@ -1095,7 +1095,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
inode = file_inode(vma->vm_file);
- min = vaddr_to_offset(vma, start);
+ min = vma_vaddr_to_offset(vma, start);
max = min + (end - start) - 1;
spin_lock(&uprobes_treelock);
@@ -1730,7 +1730,7 @@ static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
if (vma && vma->vm_start <= bp_vaddr) {
if (valid_vma(vma, false)) {
struct inode *inode = file_inode(vma->vm_file);
- loff_t offset = vaddr_to_offset(vma, bp_vaddr);
+ loff_t offset = vma_vaddr_to_offset(vma, bp_vaddr);
uprobe = find_uprobe(inode, offset);
}
--
1.8.3.1
map_info is very generic name, rename it to uprobe_map_info.
Renaming will help to export this structure outside of the
file.
Also rename free_map_info() to uprobe_free_map_info() and
build_map_info() to uprobe_build_map_info().
No functionality changes.
Signed-off-by: Ravi Bangoria <[email protected]>
---
kernel/events/uprobes.c | 32 +++++++++++++++++---------------
1 file changed, 17 insertions(+), 15 deletions(-)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 535fd39..081b88c1 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -695,27 +695,29 @@ static void delete_uprobe(struct uprobe *uprobe)
put_uprobe(uprobe);
}
-struct map_info {
- struct map_info *next;
+struct uprobe_map_info {
+ struct uprobe_map_info *next;
struct mm_struct *mm;
unsigned long vaddr;
};
-static inline struct map_info *free_map_info(struct map_info *info)
+static inline struct uprobe_map_info *
+uprobe_free_map_info(struct uprobe_map_info *info)
{
- struct map_info *next = info->next;
+ struct uprobe_map_info *next = info->next;
kfree(info);
return next;
}
-static struct map_info *
-build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
+static struct uprobe_map_info *
+uprobe_build_map_info(struct address_space *mapping, loff_t offset,
+ bool is_register)
{
unsigned long pgoff = offset >> PAGE_SHIFT;
struct vm_area_struct *vma;
- struct map_info *curr = NULL;
- struct map_info *prev = NULL;
- struct map_info *info;
+ struct uprobe_map_info *curr = NULL;
+ struct uprobe_map_info *prev = NULL;
+ struct uprobe_map_info *info;
int more = 0;
again:
@@ -729,7 +731,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
* Needs GFP_NOWAIT to avoid i_mmap_rwsem recursion through
* reclaim. This is optimistic, no harm done if it fails.
*/
- prev = kmalloc(sizeof(struct map_info),
+ prev = kmalloc(sizeof(struct uprobe_map_info),
GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
if (prev)
prev->next = NULL;
@@ -762,7 +764,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
}
do {
- info = kmalloc(sizeof(struct map_info), GFP_KERNEL);
+ info = kmalloc(sizeof(struct uprobe_map_info), GFP_KERNEL);
if (!info) {
curr = ERR_PTR(-ENOMEM);
goto out;
@@ -774,7 +776,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
goto again;
out:
while (prev)
- prev = free_map_info(prev);
+ prev = uprobe_free_map_info(prev);
return curr;
}
@@ -782,11 +784,11 @@ static inline struct map_info *free_map_info(struct map_info *info)
register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
{
bool is_register = !!new;
- struct map_info *info;
+ struct uprobe_map_info *info;
int err = 0;
percpu_down_write(&dup_mmap_sem);
- info = build_map_info(uprobe->inode->i_mapping,
+ info = uprobe_build_map_info(uprobe->inode->i_mapping,
uprobe->offset, is_register);
if (IS_ERR(info)) {
err = PTR_ERR(info);
@@ -825,7 +827,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
up_write(&mm->mmap_sem);
free:
mmput(mm);
- info = free_map_info(info);
+ info = uprobe_free_map_info(info);
}
out:
percpu_up_write(&dup_mmap_sem);
--
1.8.3.1
Userspace Statically Defined Tracepoints[1] are dtrace style markers
inside userspace applications. These markers are added by developer at
important places in the code. Each marker source expands to a single
nop instruction in the compiled code but there may be additional
overhead for computing the marker arguments which expands to couple of
instructions. In case the overhead is more, execution of it can be
ommited by runtime if() condition when no one is tracing on the marker:
if (reference_counter > 0) {
Execute marker instructions;
}
Default value of reference counter is 0. Tracer has to increment the
reference counter before tracing on a marker and decrement it when
done with the tracing.
Implement the reference counter logic in trace_uprobe, leaving core
uprobe infrastructure as is, except one new callback from uprobe_mmap()
to trace_uprobe.
trace_uprobe definition with reference counter will now be:
<path>:<offset>[(ref_ctr_offset)]
There are two different cases while enabling the marker,
1. Trace existing process. In this case, find all suitable processes
and increment the reference counter in them.
2. Enable trace before running target binary. In this case, all mmaps
will get notified to trace_uprobe and trace_uprobe will increment
the reference counter if corresponding uprobe is enabled.
At the time of disabling probes, decrement reference counter in all
existing target processes.
[1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation
Note: 'reference counter' is called as 'semaphore' in original Dtrace
(or Systemtap, bcc and even in ELF) documentation and code. But the
term 'semaphore' is misleading in this context. This is just a counter
used to hold number of tracers tracing on a marker. This is not really
used for any synchronization. So we are referring it as 'reference
counter' in kernel / perf code.
Signed-off-by: Ravi Bangoria <[email protected]>
Signed-off-by: Fengguang Wu <[email protected]>
[Fengguang reported/fixed build failure in RFC patch]
---
include/linux/uprobes.h | 2 +
kernel/events/uprobes.c | 6 ++
kernel/trace/trace_uprobe.c | 172 +++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 178 insertions(+), 2 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 7bd2760..2d4df65 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -122,6 +122,8 @@ struct uprobe_map_info {
unsigned long vaddr;
};
+extern void (*uprobe_mmap_callback)(struct vm_area_struct *vma);
+
extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern bool is_swbp_insn(uprobe_opcode_t *insn);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e7830b8..06821bb 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1041,6 +1041,9 @@ static void build_probe_list(struct inode *inode,
spin_unlock(&uprobes_treelock);
}
+/* Rightnow the only user of this is trace_uprobe. */
+void (*uprobe_mmap_callback)(struct vm_area_struct *vma);
+
/*
* Called from mmap_region/vma_adjust with mm->mmap_sem acquired.
*
@@ -1053,6 +1056,9 @@ int uprobe_mmap(struct vm_area_struct *vma)
struct uprobe *uprobe, *u;
struct inode *inode;
+ if (uprobe_mmap_callback)
+ uprobe_mmap_callback(vma);
+
if (no_uprobe_events() || !valid_vma(vma, true))
return 0;
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 2014f43..b6c9b48 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -25,6 +25,8 @@
#include <linux/namei.h>
#include <linux/string.h>
#include <linux/rculist.h>
+#include <linux/sched/mm.h>
+#include <linux/highmem.h>
#include "trace_probe.h"
@@ -58,6 +60,7 @@ struct trace_uprobe {
struct inode *inode;
char *filename;
unsigned long offset;
+ unsigned long ref_ctr_offset;
unsigned long nhit;
struct trace_probe tp;
};
@@ -362,10 +365,10 @@ static int create_trace_uprobe(int argc, char **argv)
{
struct trace_uprobe *tu;
struct inode *inode;
- char *arg, *event, *group, *filename;
+ char *arg, *event, *group, *filename, *rctr, *rctr_end;
char buf[MAX_EVENT_NAME_LEN];
struct path path;
- unsigned long offset;
+ unsigned long offset, ref_ctr_offset;
bool is_delete, is_return;
int i, ret;
@@ -375,6 +378,7 @@ static int create_trace_uprobe(int argc, char **argv)
is_return = false;
event = NULL;
group = NULL;
+ ref_ctr_offset = 0;
/* argc must be >= 1 */
if (argv[0][0] == '-')
@@ -454,6 +458,26 @@ static int create_trace_uprobe(int argc, char **argv)
goto fail_address_parse;
}
+ /* Parse reference counter offset if specified. */
+ rctr = strchr(arg, '(');
+ if (rctr) {
+ rctr_end = strchr(arg, ')');
+ if (rctr > rctr_end || *(rctr_end + 1) != 0) {
+ ret = -EINVAL;
+ pr_info("Invalid reference counter offset.\n");
+ goto fail_address_parse;
+ }
+
+ *rctr++ = 0;
+ *rctr_end = 0;
+ ret = kstrtoul(rctr, 0, &ref_ctr_offset);
+ if (ret) {
+ pr_info("Invalid reference counter offset.\n");
+ goto fail_address_parse;
+ }
+ }
+
+ /* Parse uprobe offset. */
ret = kstrtoul(arg, 0, &offset);
if (ret)
goto fail_address_parse;
@@ -488,6 +512,7 @@ static int create_trace_uprobe(int argc, char **argv)
goto fail_address_parse;
}
tu->offset = offset;
+ tu->ref_ctr_offset = ref_ctr_offset;
tu->inode = inode;
tu->filename = kstrdup(filename, GFP_KERNEL);
@@ -620,6 +645,8 @@ static int probes_seq_show(struct seq_file *m, void *v)
break;
}
}
+ if (tu->ref_ctr_offset)
+ seq_printf(m, "(0x%lx)", tu->ref_ctr_offset);
for (i = 0; i < tu->tp.nr_args; i++)
seq_printf(m, " %s=%s", tu->tp.args[i].name, tu->tp.args[i].comm);
@@ -894,6 +921,139 @@ static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
return trace_handle_return(s);
}
+static bool sdt_valid_vma(struct trace_uprobe *tu, struct vm_area_struct *vma)
+{
+ unsigned long vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
+
+ return tu->ref_ctr_offset &&
+ vma->vm_file &&
+ file_inode(vma->vm_file) == tu->inode &&
+ vma->vm_flags & VM_WRITE &&
+ vma->vm_start <= vaddr &&
+ vma->vm_end > vaddr;
+}
+
+static struct vm_area_struct *
+sdt_find_vma(struct mm_struct *mm, struct trace_uprobe *tu)
+{
+ struct vm_area_struct *tmp;
+
+ for (tmp = mm->mmap; tmp != NULL; tmp = tmp->vm_next)
+ if (sdt_valid_vma(tu, tmp))
+ return tmp;
+
+ return NULL;
+}
+
+/*
+ * Reference count gate the invocation of probe. If present,
+ * by default reference count is 0. One needs to increment
+ * it before tracing the probe and decrement it when done.
+ */
+static int
+sdt_update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
+{
+ void *kaddr;
+ struct page *page;
+ struct vm_area_struct *vma;
+ int ret = 0;
+ unsigned short orig = 0;
+
+ if (vaddr == 0)
+ return -EINVAL;
+
+ ret = get_user_pages_remote(NULL, mm, vaddr, 1,
+ FOLL_FORCE | FOLL_WRITE, &page, &vma, NULL);
+ if (ret <= 0)
+ return ret;
+
+ kaddr = kmap_atomic(page);
+ memcpy(&orig, kaddr + (vaddr & ~PAGE_MASK), sizeof(orig));
+ orig += d;
+ memcpy(kaddr + (vaddr & ~PAGE_MASK), &orig, sizeof(orig));
+ kunmap_atomic(kaddr);
+
+ put_page(page);
+ return 0;
+}
+
+static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
+{
+ struct uprobe_map_info *info;
+ struct vm_area_struct *vma;
+ unsigned long vaddr;
+
+ uprobe_start_dup_mmap();
+ info = uprobe_build_map_info(tu->inode->i_mapping,
+ tu->ref_ctr_offset, false);
+ if (IS_ERR(info))
+ goto out;
+
+ while (info) {
+ down_write(&info->mm->mmap_sem);
+
+ vma = sdt_find_vma(info->mm, tu);
+ vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
+ sdt_update_ref_ctr(info->mm, vaddr, 1);
+
+ up_write(&info->mm->mmap_sem);
+ mmput(info->mm);
+ info = uprobe_free_map_info(info);
+ }
+
+out:
+ uprobe_end_dup_mmap();
+}
+
+/* Called with down_write(&vma->vm_mm->mmap_sem) */
+void trace_uprobe_mmap_callback(struct vm_area_struct *vma)
+{
+ struct trace_uprobe *tu;
+ unsigned long vaddr;
+
+ if (!(vma->vm_flags & VM_WRITE))
+ return;
+
+ mutex_lock(&uprobe_lock);
+ list_for_each_entry(tu, &uprobe_list, list) {
+ if (!sdt_valid_vma(tu, vma) ||
+ !trace_probe_is_enabled(&tu->tp))
+ continue;
+
+ vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
+ sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
+ }
+ mutex_unlock(&uprobe_lock);
+}
+
+static void sdt_decrement_ref_ctr(struct trace_uprobe *tu)
+{
+ struct vm_area_struct *vma;
+ unsigned long vaddr;
+ struct uprobe_map_info *info;
+
+ uprobe_start_dup_mmap();
+ info = uprobe_build_map_info(tu->inode->i_mapping,
+ tu->ref_ctr_offset, false);
+ if (IS_ERR(info))
+ goto out;
+
+ while (info) {
+ down_write(&info->mm->mmap_sem);
+
+ vma = sdt_find_vma(info->mm, tu);
+ vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
+ sdt_update_ref_ctr(info->mm, vaddr, -1);
+
+ up_write(&info->mm->mmap_sem);
+ mmput(info->mm);
+ info = uprobe_free_map_info(info);
+ }
+
+out:
+ uprobe_end_dup_mmap();
+}
+
typedef bool (*filter_func_t)(struct uprobe_consumer *self,
enum uprobe_filter_ctx ctx,
struct mm_struct *mm);
@@ -939,6 +1099,9 @@ typedef bool (*filter_func_t)(struct uprobe_consumer *self,
if (ret)
goto err_buffer;
+ if (tu->ref_ctr_offset)
+ sdt_increment_ref_ctr(tu);
+
return 0;
err_buffer:
@@ -979,6 +1142,9 @@ typedef bool (*filter_func_t)(struct uprobe_consumer *self,
WARN_ON(!uprobe_filter_is_empty(&tu->filter));
+ if (tu->ref_ctr_offset)
+ sdt_decrement_ref_ctr(tu);
+
uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
tu->tp.flags &= file ? ~TP_FLAG_TRACE : ~TP_FLAG_PROFILE;
@@ -1423,6 +1589,8 @@ static __init int init_uprobe_trace(void)
/* Profile interface */
trace_create_file("uprobe_profile", 0444, d_tracer,
NULL, &uprobe_profile_ops);
+
+ uprobe_mmap_callback = trace_uprobe_mmap_callback;
return 0;
}
--
1.8.3.1
With this, perf buildid-cache will save SDT markers with reference
counter in probe cache. Perf probe will be able to probe markers
having reference counter. Ex,
# readelf -n /tmp/tick | grep -A1 loop2
Name: loop2
... Semaphore: 0x0000000010020036
# ./perf buildid-cache --add /tmp/tick
# ./perf probe sdt_tick:loop2
# ./perf stat -e sdt_tick:loop2 /tmp/tick
hi: 0
hi: 1
hi: 2
^C
Performance counter stats for '/tmp/tick':
3 sdt_tick:loop2
2.561851452 seconds time elapsed
Signed-off-by: Ravi Bangoria <[email protected]>
---
tools/perf/util/probe-event.c | 21 +++++++++++++++++----
tools/perf/util/probe-event.h | 1 +
tools/perf/util/probe-file.c | 22 +++++++++++++++++++---
tools/perf/util/symbol-elf.c | 10 ++++++++++
tools/perf/util/symbol.h | 1 +
5 files changed, 48 insertions(+), 7 deletions(-)
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index e1dbc98..2cbe68a 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -1832,6 +1832,12 @@ int parse_probe_trace_command(const char *cmd, struct probe_trace_event *tev)
tp->offset = strtoul(fmt2_str, NULL, 10);
}
+ if (tev->uprobes) {
+ fmt2_str = strchr(p, '(');
+ if (fmt2_str)
+ tp->ref_ctr_offset = strtoul(fmt2_str + 1, NULL, 0);
+ }
+
tev->nargs = argc - 2;
tev->args = zalloc(sizeof(struct probe_trace_arg) * tev->nargs);
if (tev->args == NULL) {
@@ -2054,15 +2060,22 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev)
}
/* Use the tp->address for uprobes */
- if (tev->uprobes)
- err = strbuf_addf(&buf, "%s:0x%lx", tp->module, tp->address);
- else if (!strncmp(tp->symbol, "0x", 2))
+ if (tev->uprobes) {
+ if (tp->ref_ctr_offset)
+ err = strbuf_addf(&buf, "%s:0x%lx(0x%lx)", tp->module,
+ tp->address, tp->ref_ctr_offset);
+ else
+ err = strbuf_addf(&buf, "%s:0x%lx", tp->module,
+ tp->address);
+ } else if (!strncmp(tp->symbol, "0x", 2)) {
/* Absolute address. See try_to_find_absolute_address() */
err = strbuf_addf(&buf, "%s%s0x%lx", tp->module ?: "",
tp->module ? ":" : "", tp->address);
- else
+ } else {
err = strbuf_addf(&buf, "%s%s%s+%lu", tp->module ?: "",
tp->module ? ":" : "", tp->symbol, tp->offset);
+ }
+
if (err)
goto error;
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index 45b14f0..15a98c3 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -27,6 +27,7 @@ struct probe_trace_point {
char *symbol; /* Base symbol */
char *module; /* Module name */
unsigned long offset; /* Offset from symbol */
+ unsigned long ref_ctr_offset; /* SDT reference counter offset */
unsigned long address; /* Actual address of the trace point */
bool retprobe; /* Return probe flag */
};
diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c
index 4ae1123..08ba3a6 100644
--- a/tools/perf/util/probe-file.c
+++ b/tools/perf/util/probe-file.c
@@ -701,6 +701,12 @@ static unsigned long long sdt_note__get_addr(struct sdt_note *note)
: (unsigned long long)note->addr.a64[0];
}
+static unsigned long long sdt_note__get_ref_ctr_offset(struct sdt_note *note)
+{
+ return note->bit32 ? (unsigned long long)note->addr.a32[2]
+ : (unsigned long long)note->addr.a64[2];
+}
+
static const char * const type_to_suffix[] = {
":s64", "", "", "", ":s32", "", ":s16", ":s8",
"", ":u8", ":u16", "", ":u32", "", "", "", ":u64"
@@ -776,14 +782,24 @@ static char *synthesize_sdt_probe_command(struct sdt_note *note,
{
struct strbuf buf;
char *ret = NULL, **args;
- int i, args_count;
+ int i, args_count, err;
+ unsigned long long ref_ctr_offset;
if (strbuf_init(&buf, 32) < 0)
return NULL;
- if (strbuf_addf(&buf, "p:%s/%s %s:0x%llx",
+ ref_ctr_offset = sdt_note__get_ref_ctr_offset(note);
+
+ if (ref_ctr_offset)
+ err = strbuf_addf(&buf, "p:%s/%s %s:0x%llx(0x%llx)",
sdtgrp, note->name, pathname,
- sdt_note__get_addr(note)) < 0)
+ sdt_note__get_addr(note), ref_ctr_offset);
+ else
+ err = strbuf_addf(&buf, "p:%s/%s %s:0x%llx",
+ sdtgrp, note->name, pathname,
+ sdt_note__get_addr(note));
+
+ if (err < 0)
goto error;
if (!note->args)
diff --git a/tools/perf/util/symbol-elf.c b/tools/perf/util/symbol-elf.c
index 2de7705..76c7b54 100644
--- a/tools/perf/util/symbol-elf.c
+++ b/tools/perf/util/symbol-elf.c
@@ -1928,6 +1928,16 @@ static int populate_sdt_note(Elf **elf, const char *data, size_t len,
}
}
+ /* Adjust reference counter offset */
+ if (elf_section_by_name(*elf, &ehdr, &shdr, SDT_PROBES_SCN, NULL)) {
+ if (shdr.sh_offset) {
+ if (tmp->bit32)
+ tmp->addr.a32[2] -= (shdr.sh_addr - shdr.sh_offset);
+ else
+ tmp->addr.a64[2] -= (shdr.sh_addr - shdr.sh_offset);
+ }
+ }
+
list_add_tail(&tmp->note_list, sdt_notes);
return 0;
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 70c16741..ad0c4f2 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -384,6 +384,7 @@ struct sdt_note {
int cleanup_sdt_note_list(struct list_head *sdt_notes);
int sdt_notes__get_count(struct list_head *start);
+#define SDT_PROBES_SCN ".probes"
#define SDT_BASE_SCN ".stapsdt.base"
#define SDT_NOTE_SCN ".note.stapsdt"
#define SDT_NOTE_TYPE 3
--
1.8.3.1
No functionality changes.
Signed-off-by: Ravi Bangoria <[email protected]>
---
include/linux/mm.h | 12 ++++++++++++
kernel/events/uprobes.c | 10 ----------
2 files changed, 12 insertions(+), 10 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42..95909f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2274,6 +2274,18 @@ struct vm_unmapped_area_info {
return unmapped_area(info);
}
+static inline unsigned long
+offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
+{
+ return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
+}
+
+static inline loff_t
+vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
+{
+ return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
+}
+
/* truncate.c */
extern void truncate_inode_pages(struct address_space *, loff_t);
extern void truncate_inode_pages_range(struct address_space *,
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ce6848e..bd6f230 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -130,16 +130,6 @@ static bool valid_vma(struct vm_area_struct *vma, bool is_register)
return vma->vm_file && (vma->vm_flags & flags) == VM_MAYEXEC;
}
-static unsigned long offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
-{
- return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
-}
-
-static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
-{
- return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
-}
-
/**
* __replace_page - replace page in vma by new page.
* based on replace_page in mm/ksm.c
--
1.8.3.1
No functionality changes.
Signed-off-by: Ravi Bangoria <[email protected]>
---
Documentation/trace/uprobetracer.txt | 16 +++++++++++++---
kernel/trace/trace.c | 2 +-
2 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt
index bf526a7c..8fb13b0 100644
--- a/Documentation/trace/uprobetracer.txt
+++ b/Documentation/trace/uprobetracer.txt
@@ -19,15 +19,25 @@ user to calculate the offset of the probepoint in the object.
Synopsis of uprobe_tracer
-------------------------
- p[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a uprobe
- r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe)
- -:[GRP/]EVENT : Clear uprobe or uretprobe event
+ p[:[GRP/]EVENT] PATH:OFFSET[(REF_CTR_OFFSET)] [FETCHARGS]
+ r[:[GRP/]EVENT] PATH:OFFSET[(REF_CTR_OFFSET)] [FETCHARGS]
+ -:[GRP/]EVENT
+
+ p : Set a uprobe
+ r : Set a return uprobe (uretprobe)
+ - : Clear uprobe or uretprobe event
GRP : Group name. If omitted, "uprobes" is the default value.
EVENT : Event name. If omitted, the event name is generated based
on PATH+OFFSET.
PATH : Path to an executable or a library.
OFFSET : Offset where the probe is inserted.
+ REF_CTR_OFFSET: Reference counter offset. Optional field. Reference count
+ gate the invocation of probe. If present, by default
+ reference count is 0. Kernel needs to increment it before
+ tracing the probe and decrement it when done. This is
+ identical to semaphore in Userspace Statically Defined
+ Tracepoints (USDT).
FETCHARGS : Arguments. Each probe can have up to 128 args.
%REG : Fetch register REG
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 20a2300..2104d03 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4604,7 +4604,7 @@ static int tracing_trace_options_open(struct inode *inode, struct file *file)
"place (kretprobe): [<module>:]<symbol>[+<offset>]|<memaddr>\n"
#endif
#ifdef CONFIG_UPROBE_EVENTS
- "\t place: <path>:<offset>\n"
+ " place (uprobe): <path>:<offset>[(ref_ctr_offset)]\n"
#endif
"\t args: <name>=fetcharg[:type]\n"
"\t fetcharg: %<register>, @<address>, @<symbol>[+|-<offset>],\n"
--
1.8.3.1
For tiny binaries/libraries, different mmap regions points to the
same file portion. In such cases, we may increment reference counter
multiple times. But while de-registration, reference counter will get
decremented only by once leaving reference counter > 0 even if no one
is tracing on that marker.
Ensure increment and decrement happens in sync by keeping list of
mms in trace_uprobe. Increment reference counter only if mm is not
present in the list and decrement only if mm is present in the list.
Example
# echo "p:sdt_tick/loop2 /tmp/tick:0x6e4(0x10036)" > uprobe_events
Before patch:
# perf stat -a -e sdt_tick:loop2
# /tmp/tick
# dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd
0000000: 02 .
# pkill perf
# dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd
0000000: 01 .
After patch:
# perf stat -a -e sdt_tick:loop2
# /tmp/tick
# dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd
0000000: 01 .
# pkill perf
# dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd
0000000: 00 .
Signed-off-by: Ravi Bangoria <[email protected]>
---
kernel/trace/trace_uprobe.c | 105 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 103 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index b6c9b48..9bf3f7a 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -50,6 +50,11 @@ struct trace_uprobe_filter {
struct list_head perf_events;
};
+struct sdt_mm_list {
+ struct mm_struct *mm;
+ struct sdt_mm_list *next;
+};
+
/*
* uprobe event core functions
*/
@@ -61,6 +66,8 @@ struct trace_uprobe {
char *filename;
unsigned long offset;
unsigned long ref_ctr_offset;
+ struct sdt_mm_list *sml;
+ struct rw_semaphore sml_rw_sem;
unsigned long nhit;
struct trace_probe tp;
};
@@ -274,6 +281,7 @@ static inline bool is_ret_probe(struct trace_uprobe *tu)
if (is_ret)
tu->consumer.ret_handler = uretprobe_dispatcher;
init_trace_uprobe_filter(&tu->filter);
+ init_rwsem(&tu->sml_rw_sem);
return tu;
error:
@@ -921,6 +929,74 @@ static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
return trace_handle_return(s);
}
+static bool sdt_check_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
+{
+ struct sdt_mm_list *tmp = tu->sml;
+
+ if (!tu->sml || !mm)
+ return false;
+
+ while (tmp) {
+ if (tmp->mm == mm)
+ return true;
+ tmp = tmp->next;
+ }
+
+ return false;
+}
+
+static void sdt_add_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
+{
+ struct sdt_mm_list *tmp;
+
+ tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
+ if (!tmp)
+ return;
+
+ tmp->mm = mm;
+ tmp->next = tu->sml;
+ tu->sml = tmp;
+}
+
+static void sdt_del_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
+{
+ struct sdt_mm_list *prev, *curr;
+
+ if (!tu->sml)
+ return;
+
+ if (tu->sml->mm == mm) {
+ curr = tu->sml;
+ tu->sml = tu->sml->next;
+ kfree(curr);
+ return;
+ }
+
+ prev = tu->sml;
+ curr = tu->sml->next;
+ while (curr) {
+ if (curr->mm == mm) {
+ prev->next = curr->next;
+ kfree(curr);
+ return;
+ }
+ prev = curr;
+ curr = curr->next;
+ }
+}
+
+static void sdt_flush_mm_list(struct trace_uprobe *tu)
+{
+ struct sdt_mm_list *next, *curr = tu->sml;
+
+ while (curr) {
+ next = curr->next;
+ kfree(curr);
+ curr = next;
+ }
+ tu->sml = NULL;
+}
+
static bool sdt_valid_vma(struct trace_uprobe *tu, struct vm_area_struct *vma)
{
unsigned long vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
@@ -989,17 +1065,25 @@ static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
if (IS_ERR(info))
goto out;
+ down_write(&tu->sml_rw_sem);
while (info) {
+ if (sdt_check_mm_list(tu, info->mm))
+ goto cont;
+
down_write(&info->mm->mmap_sem);
vma = sdt_find_vma(info->mm, tu);
vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
- sdt_update_ref_ctr(info->mm, vaddr, 1);
+ if (!sdt_update_ref_ctr(info->mm, vaddr, 1))
+ sdt_add_mm_list(tu, info->mm);
up_write(&info->mm->mmap_sem);
+
+cont:
mmput(info->mm);
info = uprobe_free_map_info(info);
}
+ up_write(&tu->sml_rw_sem);
out:
uprobe_end_dup_mmap();
@@ -1020,8 +1104,16 @@ void trace_uprobe_mmap_callback(struct vm_area_struct *vma)
!trace_probe_is_enabled(&tu->tp))
continue;
+ down_write(&tu->sml_rw_sem);
+ if (sdt_check_mm_list(tu, vma->vm_mm))
+ goto cont;
+
vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
- sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
+ if (!sdt_update_ref_ctr(vma->vm_mm, vaddr, 1))
+ sdt_add_mm_list(tu, vma->vm_mm);
+
+cont:
+ up_write(&tu->sml_rw_sem);
}
mutex_unlock(&uprobe_lock);
}
@@ -1038,7 +1130,11 @@ static void sdt_decrement_ref_ctr(struct trace_uprobe *tu)
if (IS_ERR(info))
goto out;
+ down_write(&tu->sml_rw_sem);
while (info) {
+ if (!sdt_check_mm_list(tu, info->mm))
+ goto cont;
+
down_write(&info->mm->mmap_sem);
vma = sdt_find_vma(info->mm, tu);
@@ -1046,9 +1142,14 @@ static void sdt_decrement_ref_ctr(struct trace_uprobe *tu)
sdt_update_ref_ctr(info->mm, vaddr, -1);
up_write(&info->mm->mmap_sem);
+ sdt_del_mm_list(tu, info->mm);
+
+cont:
mmput(info->mm);
info = uprobe_free_map_info(info);
}
+ sdt_flush_mm_list(tu);
+ up_write(&tu->sml_rw_sem);
out:
uprobe_end_dup_mmap();
--
1.8.3.1
These exported data structure and functions will be used by other
files in later patches.
No functionality changes.
Signed-off-by: Ravi Bangoria <[email protected]>
---
include/linux/uprobes.h | 9 +++++++++
kernel/events/uprobes.c | 14 +++-----------
2 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 0a294e9..7bd2760 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -109,12 +109,19 @@ enum rp_check {
RP_CHECK_RET,
};
+struct address_space;
struct xol_area;
struct uprobes_state {
struct xol_area *xol_area;
};
+struct uprobe_map_info {
+ struct uprobe_map_info *next;
+ struct mm_struct *mm;
+ unsigned long vaddr;
+};
+
extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
extern bool is_swbp_insn(uprobe_opcode_t *insn);
@@ -149,6 +156,8 @@ struct uprobes_state {
extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
void *src, unsigned long len);
+extern struct uprobe_map_info *uprobe_free_map_info(struct uprobe_map_info *info);
+extern struct uprobe_map_info *uprobe_build_map_info(struct address_space *mapping, loff_t offset, bool is_register);
#else /* !CONFIG_UPROBES */
struct uprobes_state {
};
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 081b88c1..e7830b8 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -695,23 +695,15 @@ static void delete_uprobe(struct uprobe *uprobe)
put_uprobe(uprobe);
}
-struct uprobe_map_info {
- struct uprobe_map_info *next;
- struct mm_struct *mm;
- unsigned long vaddr;
-};
-
-static inline struct uprobe_map_info *
-uprobe_free_map_info(struct uprobe_map_info *info)
+struct uprobe_map_info *uprobe_free_map_info(struct uprobe_map_info *info)
{
struct uprobe_map_info *next = info->next;
kfree(info);
return next;
}
-static struct uprobe_map_info *
-uprobe_build_map_info(struct address_space *mapping, loff_t offset,
- bool is_register)
+struct uprobe_map_info *uprobe_build_map_info(struct address_space *mapping,
+ loff_t offset, bool is_register)
{
unsigned long pgoff = offset >> PAGE_SHIFT;
struct vm_area_struct *vma;
--
1.8.3.1
On Tue, Mar 13, 2018 at 06:25:56PM +0530, Ravi Bangoria wrote:
> No functionality changes.
>
> Signed-off-by: Ravi Bangoria <[email protected]>
Reviewed-by: J?r?me Glisse <[email protected]>
> ---
> include/linux/mm.h | 12 ++++++++++++
> kernel/events/uprobes.c | 10 ----------
> 2 files changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ad06d42..95909f2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2274,6 +2274,18 @@ struct vm_unmapped_area_info {
> return unmapped_area(info);
> }
>
> +static inline unsigned long
> +offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
> +{
> + return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> +}
> +
> +static inline loff_t
> +vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
> +{
> + return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
> +}
> +
> /* truncate.c */
> extern void truncate_inode_pages(struct address_space *, loff_t);
> extern void truncate_inode_pages_range(struct address_space *,
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index ce6848e..bd6f230 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -130,16 +130,6 @@ static bool valid_vma(struct vm_area_struct *vma, bool is_register)
> return vma->vm_file && (vma->vm_flags & flags) == VM_MAYEXEC;
> }
>
> -static unsigned long offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
> -{
> - return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> -}
> -
> -static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
> -{
> - return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
> -}
> -
> /**
> * __replace_page - replace page in vma by new page.
> * based on replace_page in mm/ksm.c
> --
> 1.8.3.1
>
On Tue, Mar 13, 2018 at 06:25:57PM +0530, Ravi Bangoria wrote:
> No functionality changes.
>
> Signed-off-by: Ravi Bangoria <[email protected]>
Doing this with coccinelle would have been nicer but this is small
enough to review without too much fatigue :)
Reviewed-by: J?r?me Glisse <[email protected]>
> ---
> include/linux/mm.h | 4 ++--
> kernel/events/uprobes.c | 14 +++++++-------
> 2 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 95909f2..d7ee526 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2275,13 +2275,13 @@ struct vm_unmapped_area_info {
> }
>
> static inline unsigned long
> -offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
> +vma_offset_to_vaddr(struct vm_area_struct *vma, loff_t offset)
> {
> return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> }
>
> static inline loff_t
> -vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
> +vma_vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
> {
> return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start);
> }
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index bd6f230..535fd39 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -748,7 +748,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
> curr = info;
>
> info->mm = vma->vm_mm;
> - info->vaddr = offset_to_vaddr(vma, offset);
> + info->vaddr = vma_offset_to_vaddr(vma, offset);
> }
> i_mmap_unlock_read(mapping);
>
> @@ -807,7 +807,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
> goto unlock;
>
> if (vma->vm_start > info->vaddr ||
> - vaddr_to_offset(vma, info->vaddr) != uprobe->offset)
> + vma_vaddr_to_offset(vma, info->vaddr) != uprobe->offset)
> goto unlock;
>
> if (is_register) {
> @@ -977,7 +977,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
> uprobe->offset >= offset + vma->vm_end - vma->vm_start)
> continue;
>
> - vaddr = offset_to_vaddr(vma, uprobe->offset);
> + vaddr = vma_offset_to_vaddr(vma, uprobe->offset);
> err |= remove_breakpoint(uprobe, mm, vaddr);
> }
> up_read(&mm->mmap_sem);
> @@ -1023,7 +1023,7 @@ static void build_probe_list(struct inode *inode,
> struct uprobe *u;
>
> INIT_LIST_HEAD(head);
> - min = vaddr_to_offset(vma, start);
> + min = vma_vaddr_to_offset(vma, start);
> max = min + (end - start) - 1;
>
> spin_lock(&uprobes_treelock);
> @@ -1076,7 +1076,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
> list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> if (!fatal_signal_pending(current) &&
> filter_chain(uprobe, UPROBE_FILTER_MMAP, vma->vm_mm)) {
> - unsigned long vaddr = offset_to_vaddr(vma, uprobe->offset);
> + unsigned long vaddr = vma_offset_to_vaddr(vma, uprobe->offset);
> install_breakpoint(uprobe, vma->vm_mm, vma, vaddr);
> }
> put_uprobe(uprobe);
> @@ -1095,7 +1095,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
>
> inode = file_inode(vma->vm_file);
>
> - min = vaddr_to_offset(vma, start);
> + min = vma_vaddr_to_offset(vma, start);
> max = min + (end - start) - 1;
>
> spin_lock(&uprobes_treelock);
> @@ -1730,7 +1730,7 @@ static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
> if (vma && vma->vm_start <= bp_vaddr) {
> if (valid_vma(vma, false)) {
> struct inode *inode = file_inode(vma->vm_file);
> - loff_t offset = vaddr_to_offset(vma, bp_vaddr);
> + loff_t offset = vma_vaddr_to_offset(vma, bp_vaddr);
>
> uprobe = find_uprobe(inode, offset);
> }
> --
> 1.8.3.1
>
On Tue, Mar 13, 2018 at 06:25:58PM +0530, Ravi Bangoria wrote:
> map_info is very generic name, rename it to uprobe_map_info.
> Renaming will help to export this structure outside of the
> file.
>
> Also rename free_map_info() to uprobe_free_map_info() and
> build_map_info() to uprobe_build_map_info().
>
> No functionality changes.
>
> Signed-off-by: Ravi Bangoria <[email protected]>
Same coccinelle comments as previously.
Reviewed-by: J?r?me Glisse <[email protected]>
> ---
> kernel/events/uprobes.c | 32 +++++++++++++++++---------------
> 1 file changed, 17 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 535fd39..081b88c1 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -695,27 +695,29 @@ static void delete_uprobe(struct uprobe *uprobe)
> put_uprobe(uprobe);
> }
>
> -struct map_info {
> - struct map_info *next;
> +struct uprobe_map_info {
> + struct uprobe_map_info *next;
> struct mm_struct *mm;
> unsigned long vaddr;
> };
>
> -static inline struct map_info *free_map_info(struct map_info *info)
> +static inline struct uprobe_map_info *
> +uprobe_free_map_info(struct uprobe_map_info *info)
> {
> - struct map_info *next = info->next;
> + struct uprobe_map_info *next = info->next;
> kfree(info);
> return next;
> }
>
> -static struct map_info *
> -build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
> +static struct uprobe_map_info *
> +uprobe_build_map_info(struct address_space *mapping, loff_t offset,
> + bool is_register)
> {
> unsigned long pgoff = offset >> PAGE_SHIFT;
> struct vm_area_struct *vma;
> - struct map_info *curr = NULL;
> - struct map_info *prev = NULL;
> - struct map_info *info;
> + struct uprobe_map_info *curr = NULL;
> + struct uprobe_map_info *prev = NULL;
> + struct uprobe_map_info *info;
> int more = 0;
>
> again:
> @@ -729,7 +731,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
> * Needs GFP_NOWAIT to avoid i_mmap_rwsem recursion through
> * reclaim. This is optimistic, no harm done if it fails.
> */
> - prev = kmalloc(sizeof(struct map_info),
> + prev = kmalloc(sizeof(struct uprobe_map_info),
> GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> if (prev)
> prev->next = NULL;
> @@ -762,7 +764,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
> }
>
> do {
> - info = kmalloc(sizeof(struct map_info), GFP_KERNEL);
> + info = kmalloc(sizeof(struct uprobe_map_info), GFP_KERNEL);
> if (!info) {
> curr = ERR_PTR(-ENOMEM);
> goto out;
> @@ -774,7 +776,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
> goto again;
> out:
> while (prev)
> - prev = free_map_info(prev);
> + prev = uprobe_free_map_info(prev);
> return curr;
> }
>
> @@ -782,11 +784,11 @@ static inline struct map_info *free_map_info(struct map_info *info)
> register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
> {
> bool is_register = !!new;
> - struct map_info *info;
> + struct uprobe_map_info *info;
> int err = 0;
>
> percpu_down_write(&dup_mmap_sem);
> - info = build_map_info(uprobe->inode->i_mapping,
> + info = uprobe_build_map_info(uprobe->inode->i_mapping,
> uprobe->offset, is_register);
> if (IS_ERR(info)) {
> err = PTR_ERR(info);
> @@ -825,7 +827,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
> up_write(&mm->mmap_sem);
> free:
> mmput(mm);
> - info = free_map_info(info);
> + info = uprobe_free_map_info(info);
> }
> out:
> percpu_up_write(&dup_mmap_sem);
> --
> 1.8.3.1
>
On Tue, Mar 13, 2018 at 06:25:59PM +0530, Ravi Bangoria wrote:
> These exported data structure and functions will be used by other
> files in later patches.
>
> No functionality changes.
>
> Signed-off-by: Ravi Bangoria <[email protected]>
Reviewed-by: J?r?me Glisse <[email protected]>
> ---
> include/linux/uprobes.h | 9 +++++++++
> kernel/events/uprobes.c | 14 +++-----------
> 2 files changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 0a294e9..7bd2760 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -109,12 +109,19 @@ enum rp_check {
> RP_CHECK_RET,
> };
>
> +struct address_space;
> struct xol_area;
>
> struct uprobes_state {
> struct xol_area *xol_area;
> };
>
> +struct uprobe_map_info {
> + struct uprobe_map_info *next;
> + struct mm_struct *mm;
> + unsigned long vaddr;
> +};
> +
> extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
> extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
> extern bool is_swbp_insn(uprobe_opcode_t *insn);
> @@ -149,6 +156,8 @@ struct uprobes_state {
> extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
> extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> void *src, unsigned long len);
> +extern struct uprobe_map_info *uprobe_free_map_info(struct uprobe_map_info *info);
> +extern struct uprobe_map_info *uprobe_build_map_info(struct address_space *mapping, loff_t offset, bool is_register);
> #else /* !CONFIG_UPROBES */
> struct uprobes_state {
> };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 081b88c1..e7830b8 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -695,23 +695,15 @@ static void delete_uprobe(struct uprobe *uprobe)
> put_uprobe(uprobe);
> }
>
> -struct uprobe_map_info {
> - struct uprobe_map_info *next;
> - struct mm_struct *mm;
> - unsigned long vaddr;
> -};
> -
> -static inline struct uprobe_map_info *
> -uprobe_free_map_info(struct uprobe_map_info *info)
> +struct uprobe_map_info *uprobe_free_map_info(struct uprobe_map_info *info)
> {
> struct uprobe_map_info *next = info->next;
> kfree(info);
> return next;
> }
>
> -static struct uprobe_map_info *
> -uprobe_build_map_info(struct address_space *mapping, loff_t offset,
> - bool is_register)
> +struct uprobe_map_info *uprobe_build_map_info(struct address_space *mapping,
> + loff_t offset, bool is_register)
> {
> unsigned long pgoff = offset >> PAGE_SHIFT;
> struct vm_area_struct *vma;
> --
> 1.8.3.1
>
Hi Ravi,
On Tue, 13 Mar 2018 18:26:00 +0530
Ravi Bangoria <[email protected]> wrote:
> Userspace Statically Defined Tracepoints[1] are dtrace style markers
> inside userspace applications. These markers are added by developer at
> important places in the code. Each marker source expands to a single
> nop instruction in the compiled code but there may be additional
> overhead for computing the marker arguments which expands to couple of
> instructions. In case the overhead is more, execution of it can be
> ommited by runtime if() condition when no one is tracing on the marker:
>
> if (reference_counter > 0) {
> Execute marker instructions;
> }
>
> Default value of reference counter is 0. Tracer has to increment the
> reference counter before tracing on a marker and decrement it when
> done with the tracing.
>
> Implement the reference counter logic in trace_uprobe, leaving core
> uprobe infrastructure as is, except one new callback from uprobe_mmap()
> to trace_uprobe.
>
> trace_uprobe definition with reference counter will now be:
>
> <path>:<offset>[(ref_ctr_offset)]
Would you mean
<path>:<offset>(<ref_ctr_offset>)
?
or use "[]" for delimiter?
Since,
> @@ -454,6 +458,26 @@ static int create_trace_uprobe(int argc, char **argv)
> goto fail_address_parse;
> }
>
> + /* Parse reference counter offset if specified. */
> + rctr = strchr(arg, '(');
This seems you choose "()" for delimiter.
> + if (rctr) {
> + rctr_end = strchr(arg, ')');
rctr_end = strchr(rctr, ')');
? since we are sure rctr != NULL.
> + if (rctr > rctr_end || *(rctr_end + 1) != 0) {
> + ret = -EINVAL;
> + pr_info("Invalid reference counter offset.\n");
> + goto fail_address_parse;
> + }
Also
> +
> + *rctr++ = 0;
> + *rctr_end = 0;
Please consider to use '\0' for nul;
Thanks,
> + ret = kstrtoul(rctr, 0, &ref_ctr_offset);
> + if (ret) {
> + pr_info("Invalid reference counter offset.\n");
> + goto fail_address_parse;
> + }
> + }
> +
> + /* Parse uprobe offset. */
> ret = kstrtoul(arg, 0, &offset);
> if (ret)
> goto fail_address_parse;
> @@ -488,6 +512,7 @@ static int create_trace_uprobe(int argc, char **argv)
> goto fail_address_parse;
> }
> tu->offset = offset;
> + tu->ref_ctr_offset = ref_ctr_offset;
> tu->inode = inode;
> tu->filename = kstrdup(filename, GFP_KERNEL);
>
> @@ -620,6 +645,8 @@ static int probes_seq_show(struct seq_file *m, void *v)
> break;
> }
> }
> + if (tu->ref_ctr_offset)
> + seq_printf(m, "(0x%lx)", tu->ref_ctr_offset);
>
> for (i = 0; i < tu->tp.nr_args; i++)
> seq_printf(m, " %s=%s", tu->tp.args[i].name, tu->tp.args[i].comm);
> @@ -894,6 +921,139 @@ static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
> return trace_handle_return(s);
> }
>
> +static bool sdt_valid_vma(struct trace_uprobe *tu, struct vm_area_struct *vma)
> +{
> + unsigned long vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
> +
> + return tu->ref_ctr_offset &&
> + vma->vm_file &&
> + file_inode(vma->vm_file) == tu->inode &&
> + vma->vm_flags & VM_WRITE &&
> + vma->vm_start <= vaddr &&
> + vma->vm_end > vaddr;
> +}
> +
> +static struct vm_area_struct *
> +sdt_find_vma(struct mm_struct *mm, struct trace_uprobe *tu)
> +{
> + struct vm_area_struct *tmp;
> +
> + for (tmp = mm->mmap; tmp != NULL; tmp = tmp->vm_next)
> + if (sdt_valid_vma(tu, tmp))
> + return tmp;
> +
> + return NULL;
> +}
> +
> +/*
> + * Reference count gate the invocation of probe. If present,
> + * by default reference count is 0. One needs to increment
> + * it before tracing the probe and decrement it when done.
> + */
> +static int
> +sdt_update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
> +{
> + void *kaddr;
> + struct page *page;
> + struct vm_area_struct *vma;
> + int ret = 0;
> + unsigned short orig = 0;
> +
> + if (vaddr == 0)
> + return -EINVAL;
> +
> + ret = get_user_pages_remote(NULL, mm, vaddr, 1,
> + FOLL_FORCE | FOLL_WRITE, &page, &vma, NULL);
> + if (ret <= 0)
> + return ret;
> +
> + kaddr = kmap_atomic(page);
> + memcpy(&orig, kaddr + (vaddr & ~PAGE_MASK), sizeof(orig));
> + orig += d;
> + memcpy(kaddr + (vaddr & ~PAGE_MASK), &orig, sizeof(orig));
> + kunmap_atomic(kaddr);
> +
> + put_page(page);
> + return 0;
> +}
> +
> +static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
> +{
> + struct uprobe_map_info *info;
> + struct vm_area_struct *vma;
> + unsigned long vaddr;
> +
> + uprobe_start_dup_mmap();
> + info = uprobe_build_map_info(tu->inode->i_mapping,
> + tu->ref_ctr_offset, false);
> + if (IS_ERR(info))
> + goto out;
> +
> + while (info) {
> + down_write(&info->mm->mmap_sem);
> +
> + vma = sdt_find_vma(info->mm, tu);
> + vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
> + sdt_update_ref_ctr(info->mm, vaddr, 1);
> +
> + up_write(&info->mm->mmap_sem);
> + mmput(info->mm);
> + info = uprobe_free_map_info(info);
> + }
> +
> +out:
> + uprobe_end_dup_mmap();
> +}
> +
> +/* Called with down_write(&vma->vm_mm->mmap_sem) */
> +void trace_uprobe_mmap_callback(struct vm_area_struct *vma)
> +{
> + struct trace_uprobe *tu;
> + unsigned long vaddr;
> +
> + if (!(vma->vm_flags & VM_WRITE))
> + return;
> +
> + mutex_lock(&uprobe_lock);
> + list_for_each_entry(tu, &uprobe_list, list) {
> + if (!sdt_valid_vma(tu, vma) ||
> + !trace_probe_is_enabled(&tu->tp))
> + continue;
> +
> + vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
> + sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
> + }
> + mutex_unlock(&uprobe_lock);
> +}
> +
> +static void sdt_decrement_ref_ctr(struct trace_uprobe *tu)
> +{
> + struct vm_area_struct *vma;
> + unsigned long vaddr;
> + struct uprobe_map_info *info;
> +
> + uprobe_start_dup_mmap();
> + info = uprobe_build_map_info(tu->inode->i_mapping,
> + tu->ref_ctr_offset, false);
> + if (IS_ERR(info))
> + goto out;
> +
> + while (info) {
> + down_write(&info->mm->mmap_sem);
> +
> + vma = sdt_find_vma(info->mm, tu);
> + vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
> + sdt_update_ref_ctr(info->mm, vaddr, -1);
> +
> + up_write(&info->mm->mmap_sem);
> + mmput(info->mm);
> + info = uprobe_free_map_info(info);
> + }
> +
> +out:
> + uprobe_end_dup_mmap();
> +}
> +
> typedef bool (*filter_func_t)(struct uprobe_consumer *self,
> enum uprobe_filter_ctx ctx,
> struct mm_struct *mm);
> @@ -939,6 +1099,9 @@ typedef bool (*filter_func_t)(struct uprobe_consumer *self,
> if (ret)
> goto err_buffer;
>
> + if (tu->ref_ctr_offset)
> + sdt_increment_ref_ctr(tu);
> +
> return 0;
>
> err_buffer:
> @@ -979,6 +1142,9 @@ typedef bool (*filter_func_t)(struct uprobe_consumer *self,
>
> WARN_ON(!uprobe_filter_is_empty(&tu->filter));
>
> + if (tu->ref_ctr_offset)
> + sdt_decrement_ref_ctr(tu);
> +
> uprobe_unregister(tu->inode, tu->offset, &tu->consumer);
> tu->tp.flags &= file ? ~TP_FLAG_TRACE : ~TP_FLAG_PROFILE;
>
> @@ -1423,6 +1589,8 @@ static __init int init_uprobe_trace(void)
> /* Profile interface */
> trace_create_file("uprobe_profile", 0444, d_tracer,
> NULL, &uprobe_profile_ops);
> +
> + uprobe_mmap_callback = trace_uprobe_mmap_callback;
> return 0;
> }
>
> --
> 1.8.3.1
>
--
Masami Hiramatsu <[email protected]>
On Tue, 13 Mar 2018 18:26:03 +0530
Ravi Bangoria <[email protected]> wrote:
> No functionality changes.
Please consider to describe what is this change and why, here.
>
> Signed-off-by: Ravi Bangoria <[email protected]>
> ---
> Documentation/trace/uprobetracer.txt | 16 +++++++++++++---
> kernel/trace/trace.c | 2 +-
> 2 files changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt
> index bf526a7c..8fb13b0 100644
> --- a/Documentation/trace/uprobetracer.txt
> +++ b/Documentation/trace/uprobetracer.txt
> @@ -19,15 +19,25 @@ user to calculate the offset of the probepoint in the object.
>
> Synopsis of uprobe_tracer
> -------------------------
> - p[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a uprobe
> - r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe)
> - -:[GRP/]EVENT : Clear uprobe or uretprobe event
> + p[:[GRP/]EVENT] PATH:OFFSET[(REF_CTR_OFFSET)] [FETCHARGS]
> + r[:[GRP/]EVENT] PATH:OFFSET[(REF_CTR_OFFSET)] [FETCHARGS]
Ah, OK in this context, [] means optional syntax :)
> + -:[GRP/]EVENT
> +
> + p : Set a uprobe
> + r : Set a return uprobe (uretprobe)
> + - : Clear uprobe or uretprobe event
>
> GRP : Group name. If omitted, "uprobes" is the default value.
> EVENT : Event name. If omitted, the event name is generated based
> on PATH+OFFSET.
> PATH : Path to an executable or a library.
> OFFSET : Offset where the probe is inserted.
> + REF_CTR_OFFSET: Reference counter offset. Optional field. Reference count
> + gate the invocation of probe. If present, by default
> + reference count is 0. Kernel needs to increment it before
> + tracing the probe and decrement it when done. This is
> + identical to semaphore in Userspace Statically Defined
> + Tracepoints (USDT).
>
> FETCHARGS : Arguments. Each probe can have up to 128 args.
> %REG : Fetch register REG
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 20a2300..2104d03 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -4604,7 +4604,7 @@ static int tracing_trace_options_open(struct inode *inode, struct file *file)
> "place (kretprobe): [<module>:]<symbol>[+<offset>]|<memaddr>\n"
> #endif
> #ifdef CONFIG_UPROBE_EVENTS
> - "\t place: <path>:<offset>\n"
> + " place (uprobe): <path>:<offset>[(ref_ctr_offset)]\n"
> #endif
> "\t args: <name>=fetcharg[:type]\n"
> "\t fetcharg: %<register>, @<address>, @<symbol>[+|-<offset>],\n"
> --
> 1.8.3.1
>
--
Masami Hiramatsu <[email protected]>
Hi Ravi,
This code logic looks good. I just have several small comments for style.
On Tue, 13 Mar 2018 18:26:02 +0530
Ravi Bangoria <[email protected]> wrote:
> diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
> index e1dbc98..2cbe68a 100644
> --- a/tools/perf/util/probe-event.c
> +++ b/tools/perf/util/probe-event.c
> @@ -1832,6 +1832,12 @@ int parse_probe_trace_command(const char *cmd, struct probe_trace_event *tev)
> tp->offset = strtoul(fmt2_str, NULL, 10);
> }
>
> + if (tev->uprobes) {
> + fmt2_str = strchr(p, '(');
> + if (fmt2_str)
> + tp->ref_ctr_offset = strtoul(fmt2_str + 1, NULL, 0);
> + }
> +
> tev->nargs = argc - 2;
> tev->args = zalloc(sizeof(struct probe_trace_arg) * tev->nargs);
> if (tev->args == NULL) {
> @@ -2054,15 +2060,22 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev)
> }
>
> /* Use the tp->address for uprobes */
> - if (tev->uprobes)
> - err = strbuf_addf(&buf, "%s:0x%lx", tp->module, tp->address);
> - else if (!strncmp(tp->symbol, "0x", 2))
> + if (tev->uprobes) {
> + if (tp->ref_ctr_offset)
> + err = strbuf_addf(&buf, "%s:0x%lx(0x%lx)", tp->module,
> + tp->address, tp->ref_ctr_offset);
> + else
> + err = strbuf_addf(&buf, "%s:0x%lx", tp->module,
> + tp->address);
> + } else if (!strncmp(tp->symbol, "0x", 2)) {
> /* Absolute address. See try_to_find_absolute_address() */
> err = strbuf_addf(&buf, "%s%s0x%lx", tp->module ?: "",
> tp->module ? ":" : "", tp->address);
> - else
> + } else {
> err = strbuf_addf(&buf, "%s%s%s+%lu", tp->module ?: "",
> tp->module ? ":" : "", tp->symbol, tp->offset);
> + }
What the purpose of this {}?
> +
> if (err)
> goto error;
>
> diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
> index 45b14f0..15a98c3 100644
> --- a/tools/perf/util/probe-event.h
> +++ b/tools/perf/util/probe-event.h
> @@ -27,6 +27,7 @@ struct probe_trace_point {
> char *symbol; /* Base symbol */
> char *module; /* Module name */
> unsigned long offset; /* Offset from symbol */
> + unsigned long ref_ctr_offset; /* SDT reference counter offset */
> unsigned long address; /* Actual address of the trace point */
> bool retprobe; /* Return probe flag */
> };
> diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c
> index 4ae1123..08ba3a6 100644
> --- a/tools/perf/util/probe-file.c
> +++ b/tools/perf/util/probe-file.c
> @@ -701,6 +701,12 @@ static unsigned long long sdt_note__get_addr(struct sdt_note *note)
> : (unsigned long long)note->addr.a64[0];
> }
>
> +static unsigned long long sdt_note__get_ref_ctr_offset(struct sdt_note *note)
> +{
> + return note->bit32 ? (unsigned long long)note->addr.a32[2]
> + : (unsigned long long)note->addr.a64[2];
> +}
Could you please introduce an enum for specifying the index by name?
e.g.
enum {
SDT_NOTE_IDX_ADDR = 0,
SDT_NOTE_IDX_REFCTR = 2,
};
> +
> static const char * const type_to_suffix[] = {
> ":s64", "", "", "", ":s32", "", ":s16", ":s8",
> "", ":u8", ":u16", "", ":u32", "", "", "", ":u64"
> @@ -776,14 +782,24 @@ static char *synthesize_sdt_probe_command(struct sdt_note *note,
> {
> struct strbuf buf;
> char *ret = NULL, **args;
> - int i, args_count;
> + int i, args_count, err;
> + unsigned long long ref_ctr_offset;
>
> if (strbuf_init(&buf, 32) < 0)
> return NULL;
>
> - if (strbuf_addf(&buf, "p:%s/%s %s:0x%llx",
> + ref_ctr_offset = sdt_note__get_ref_ctr_offset(note);
> +
> + if (ref_ctr_offset)
> + err = strbuf_addf(&buf, "p:%s/%s %s:0x%llx(0x%llx)",
> sdtgrp, note->name, pathname,
> - sdt_note__get_addr(note)) < 0)
> + sdt_note__get_addr(note), ref_ctr_offset);
> + else
> + err = strbuf_addf(&buf, "p:%s/%s %s:0x%llx",
> + sdtgrp, note->name, pathname,
> + sdt_note__get_addr(note));
This can be minimized (and avoid repeating) by using 2 strbuf_addf()s, like
err = strbuf_addf(&buf, "p:%s/%s %s:0x%llx",
sdtgrp, note->name, pathname,
sdt_note__get_addr(note));
if (ref_ctr_offset && !err < 0)
err = strbuf_addf("(0x%llx)", ref_ctr_offset);
> +
> + if (err < 0)
> goto error;
>
> if (!note->args)
> diff --git a/tools/perf/util/symbol-elf.c b/tools/perf/util/symbol-elf.c
> index 2de7705..76c7b54 100644
> --- a/tools/perf/util/symbol-elf.c
> +++ b/tools/perf/util/symbol-elf.c
> @@ -1928,6 +1928,16 @@ static int populate_sdt_note(Elf **elf, const char *data, size_t len,
> }
> }
>
> + /* Adjust reference counter offset */
> + if (elf_section_by_name(*elf, &ehdr, &shdr, SDT_PROBES_SCN, NULL)) {
> + if (shdr.sh_offset) {
> + if (tmp->bit32)
> + tmp->addr.a32[2] -= (shdr.sh_addr - shdr.sh_offset);
> + else
> + tmp->addr.a64[2] -= (shdr.sh_addr - shdr.sh_offset);
Here we should use enum above too.
Thank you,
> + }
> + }
> +
> list_add_tail(&tmp->note_list, sdt_notes);
> return 0;
>
> diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
> index 70c16741..ad0c4f2 100644
> --- a/tools/perf/util/symbol.h
> +++ b/tools/perf/util/symbol.h
> @@ -384,6 +384,7 @@ struct sdt_note {
> int cleanup_sdt_note_list(struct list_head *sdt_notes);
> int sdt_notes__get_count(struct list_head *start);
>
> +#define SDT_PROBES_SCN ".probes"
> #define SDT_BASE_SCN ".stapsdt.base"
> #define SDT_NOTE_SCN ".note.stapsdt"
> #define SDT_NOTE_TYPE 3
> --
> 1.8.3.1
>
--
Masami Hiramatsu <[email protected]>
On Tue, 13 Mar 2018 18:26:01 +0530
Ravi Bangoria <[email protected]> wrote:
> For tiny binaries/libraries, different mmap regions points to the
> same file portion. In such cases, we may increment reference counter
> multiple times. But while de-registration, reference counter will get
> decremented only by once leaving reference counter > 0 even if no one
> is tracing on that marker.
>
> Ensure increment and decrement happens in sync by keeping list of
> mms in trace_uprobe. Increment reference counter only if mm is not
> present in the list and decrement only if mm is present in the list.
>
> Example
>
> # echo "p:sdt_tick/loop2 /tmp/tick:0x6e4(0x10036)" > uprobe_events
>
> Before patch:
>
> # perf stat -a -e sdt_tick:loop2
> # /tmp/tick
> # dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd
> 0000000: 02 .
>
> # pkill perf
> # dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd
> 0000000: 01 .
>
> After patch:
>
> # perf stat -a -e sdt_tick:loop2
> # /tmp/tick
> # dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd
> 0000000: 01 .
>
> # pkill perf
> # dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd
> 0000000: 00 .
>
> Signed-off-by: Ravi Bangoria <[email protected]>
> ---
> kernel/trace/trace_uprobe.c | 105 +++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 103 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index b6c9b48..9bf3f7a 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -50,6 +50,11 @@ struct trace_uprobe_filter {
> struct list_head perf_events;
> };
>
> +struct sdt_mm_list {
> + struct mm_struct *mm;
> + struct sdt_mm_list *next;
> +};
Oh, please use struct list_head instead of defining your own pointer-chain :(
> +
> /*
> * uprobe event core functions
> */
> @@ -61,6 +66,8 @@ struct trace_uprobe {
> char *filename;
> unsigned long offset;
> unsigned long ref_ctr_offset;
> + struct sdt_mm_list *sml;
> + struct rw_semaphore sml_rw_sem;
BTW, is there any reason to use rw_semaphore? (mutex doesn't fit?)
Thank you,
> unsigned long nhit;
> struct trace_probe tp;
> };
> @@ -274,6 +281,7 @@ static inline bool is_ret_probe(struct trace_uprobe *tu)
> if (is_ret)
> tu->consumer.ret_handler = uretprobe_dispatcher;
> init_trace_uprobe_filter(&tu->filter);
> + init_rwsem(&tu->sml_rw_sem);
> return tu;
>
> error:
> @@ -921,6 +929,74 @@ static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
> return trace_handle_return(s);
> }
>
> +static bool sdt_check_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
> +{
> + struct sdt_mm_list *tmp = tu->sml;
> +
> + if (!tu->sml || !mm)
> + return false;
> +
> + while (tmp) {
> + if (tmp->mm == mm)
> + return true;
> + tmp = tmp->next;
> + }
> +
> + return false;
> +}
> +
> +static void sdt_add_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
> +{
> + struct sdt_mm_list *tmp;
> +
> + tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
> + if (!tmp)
> + return;
> +
> + tmp->mm = mm;
> + tmp->next = tu->sml;
> + tu->sml = tmp;
> +}
> +
> +static void sdt_del_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
> +{
> + struct sdt_mm_list *prev, *curr;
> +
> + if (!tu->sml)
> + return;
> +
> + if (tu->sml->mm == mm) {
> + curr = tu->sml;
> + tu->sml = tu->sml->next;
> + kfree(curr);
> + return;
> + }
> +
> + prev = tu->sml;
> + curr = tu->sml->next;
> + while (curr) {
> + if (curr->mm == mm) {
> + prev->next = curr->next;
> + kfree(curr);
> + return;
> + }
> + prev = curr;
> + curr = curr->next;
> + }
> +}
> +
> +static void sdt_flush_mm_list(struct trace_uprobe *tu)
> +{
> + struct sdt_mm_list *next, *curr = tu->sml;
> +
> + while (curr) {
> + next = curr->next;
> + kfree(curr);
> + curr = next;
> + }
> + tu->sml = NULL;
> +}
> +
> static bool sdt_valid_vma(struct trace_uprobe *tu, struct vm_area_struct *vma)
> {
> unsigned long vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
> @@ -989,17 +1065,25 @@ static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
> if (IS_ERR(info))
> goto out;
>
> + down_write(&tu->sml_rw_sem);
> while (info) {
> + if (sdt_check_mm_list(tu, info->mm))
> + goto cont;
> +
> down_write(&info->mm->mmap_sem);
>
> vma = sdt_find_vma(info->mm, tu);
> vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
> - sdt_update_ref_ctr(info->mm, vaddr, 1);
> + if (!sdt_update_ref_ctr(info->mm, vaddr, 1))
> + sdt_add_mm_list(tu, info->mm);
>
> up_write(&info->mm->mmap_sem);
> +
> +cont:
> mmput(info->mm);
> info = uprobe_free_map_info(info);
> }
> + up_write(&tu->sml_rw_sem);
>
> out:
> uprobe_end_dup_mmap();
> @@ -1020,8 +1104,16 @@ void trace_uprobe_mmap_callback(struct vm_area_struct *vma)
> !trace_probe_is_enabled(&tu->tp))
> continue;
>
> + down_write(&tu->sml_rw_sem);
> + if (sdt_check_mm_list(tu, vma->vm_mm))
> + goto cont;
> +
> vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
> - sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
> + if (!sdt_update_ref_ctr(vma->vm_mm, vaddr, 1))
> + sdt_add_mm_list(tu, vma->vm_mm);
> +
> +cont:
> + up_write(&tu->sml_rw_sem);
> }
> mutex_unlock(&uprobe_lock);
> }
> @@ -1038,7 +1130,11 @@ static void sdt_decrement_ref_ctr(struct trace_uprobe *tu)
> if (IS_ERR(info))
> goto out;
>
> + down_write(&tu->sml_rw_sem);
> while (info) {
> + if (!sdt_check_mm_list(tu, info->mm))
> + goto cont;
> +
> down_write(&info->mm->mmap_sem);
>
> vma = sdt_find_vma(info->mm, tu);
> @@ -1046,9 +1142,14 @@ static void sdt_decrement_ref_ctr(struct trace_uprobe *tu)
> sdt_update_ref_ctr(info->mm, vaddr, -1);
>
> up_write(&info->mm->mmap_sem);
> + sdt_del_mm_list(tu, info->mm);
> +
> +cont:
> mmput(info->mm);
> info = uprobe_free_map_info(info);
> }
> + sdt_flush_mm_list(tu);
> + up_write(&tu->sml_rw_sem);
>
> out:
> uprobe_end_dup_mmap();
> --
> 1.8.3.1
>
--
Masami Hiramatsu <[email protected]>
Hi Masami,
On 03/14/2018 07:18 PM, Masami Hiramatsu wrote:
> Hi Ravi,
>
> On Tue, 13 Mar 2018 18:26:00 +0530
> Ravi Bangoria <[email protected]> wrote:
>
>> Userspace Statically Defined Tracepoints[1] are dtrace style markers
>> inside userspace applications. These markers are added by developer at
>> important places in the code. Each marker source expands to a single
>> nop instruction in the compiled code but there may be additional
>> overhead for computing the marker arguments which expands to couple of
>> instructions. In case the overhead is more, execution of it can be
>> ommited by runtime if() condition when no one is tracing on the marker:
>>
>> if (reference_counter > 0) {
>> Execute marker instructions;
>> }
>>
>> Default value of reference counter is 0. Tracer has to increment the
>> reference counter before tracing on a marker and decrement it when
>> done with the tracing.
>>
>> Implement the reference counter logic in trace_uprobe, leaving core
>> uprobe infrastructure as is, except one new callback from uprobe_mmap()
>> to trace_uprobe.
>>
>> trace_uprobe definition with reference counter will now be:
>>
>> <path>:<offset>[(ref_ctr_offset)]
> Would you mean
> <path>:<offset>(<ref_ctr_offset>)
> ?
>
> or use "[]" for delimiter?
[] indicates optional field.
> Since,
>
>> @@ -454,6 +458,26 @@ static int create_trace_uprobe(int argc, char **argv)
>> goto fail_address_parse;
>> }
>>
>> + /* Parse reference counter offset if specified. */
>> + rctr = strchr(arg, '(');
> This seems you choose "()" for delimiter.
Correct.
>> + if (rctr) {
>> + rctr_end = strchr(arg, ')');
> rctr_end = strchr(rctr, ')');
>
> ? since we are sure rctr != NULL.
Yes. we can use rctr instead of arg.
>> + if (rctr > rctr_end || *(rctr_end + 1) != 0) {
>> + ret = -EINVAL;
>> + pr_info("Invalid reference counter offset.\n");
>> + goto fail_address_parse;
>> + }
>
> Also
>
>> +
>> + *rctr++ = 0;
>> + *rctr_end = 0;
> Please consider to use '\0' for nul;
Sure. Will change it.
Thanks for the review :)
Ravi
On 03/14/2018 07:45 PM, Masami Hiramatsu wrote:
> On Tue, 13 Mar 2018 18:26:01 +0530
> Ravi Bangoria <[email protected]> wrote:
>
>> For tiny binaries/libraries, different mmap regions points to the
>> same file portion. In such cases, we may increment reference counter
>> multiple times. But while de-registration, reference counter will get
>> decremented only by once leaving reference counter > 0 even if no one
>> is tracing on that marker.
>>
>> Ensure increment and decrement happens in sync by keeping list of
>> mms in trace_uprobe. Increment reference counter only if mm is not
>> present in the list and decrement only if mm is present in the list.
>>
>> Example
>>
>> # echo "p:sdt_tick/loop2 /tmp/tick:0x6e4(0x10036)" > uprobe_events
>>
>> Before patch:
>>
>> # perf stat -a -e sdt_tick:loop2
>> # /tmp/tick
>> # dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd
>> 0000000: 02 .
>>
>> # pkill perf
>> # dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd
>> 0000000: 01 .
>>
>> After patch:
>>
>> # perf stat -a -e sdt_tick:loop2
>> # /tmp/tick
>> # dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd
>> 0000000: 01 .
>>
>> # pkill perf
>> # dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd
>> 0000000: 00 .
>>
>> Signed-off-by: Ravi Bangoria <[email protected]>
>> ---
>> kernel/trace/trace_uprobe.c | 105 +++++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 103 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
>> index b6c9b48..9bf3f7a 100644
>> --- a/kernel/trace/trace_uprobe.c
>> +++ b/kernel/trace/trace_uprobe.c
>> @@ -50,6 +50,11 @@ struct trace_uprobe_filter {
>> struct list_head perf_events;
>> };
>>
>> +struct sdt_mm_list {
>> + struct mm_struct *mm;
>> + struct sdt_mm_list *next;
>> +};
> Oh, please use struct list_head instead of defining your own pointer-chain :(
Sure, will change it.
>> +
>> /*
>> * uprobe event core functions
>> */
>> @@ -61,6 +66,8 @@ struct trace_uprobe {
>> char *filename;
>> unsigned long offset;
>> unsigned long ref_ctr_offset;
>> + struct sdt_mm_list *sml;
>> + struct rw_semaphore sml_rw_sem;
> BTW, is there any reason to use rw_semaphore? (mutex doesn't fit?)
Hmm.. No specific reason.. will use a mutex instead.
Thanks for the review :)
Ravi
On 03/14/2018 07:39 PM, Masami Hiramatsu wrote:
> Hi Ravi,
>
> This code logic looks good. I just have several small comments for style.
>
> On Tue, 13 Mar 2018 18:26:02 +0530
> Ravi Bangoria <[email protected]> wrote:
>
>> diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
>> index e1dbc98..2cbe68a 100644
>> --- a/tools/perf/util/probe-event.c
>> +++ b/tools/perf/util/probe-event.c
>> @@ -1832,6 +1832,12 @@ int parse_probe_trace_command(const char *cmd, struct probe_trace_event *tev)
>> tp->offset = strtoul(fmt2_str, NULL, 10);
>> }
>>
>> + if (tev->uprobes) {
>> + fmt2_str = strchr(p, '(');
>> + if (fmt2_str)
>> + tp->ref_ctr_offset = strtoul(fmt2_str + 1, NULL, 0);
>> + }
>> +
>> tev->nargs = argc - 2;
>> tev->args = zalloc(sizeof(struct probe_trace_arg) * tev->nargs);
>> if (tev->args == NULL) {
>> @@ -2054,15 +2060,22 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev)
>> }
>>
>> /* Use the tp->address for uprobes */
>> - if (tev->uprobes)
>> - err = strbuf_addf(&buf, "%s:0x%lx", tp->module, tp->address);
>> - else if (!strncmp(tp->symbol, "0x", 2))
>> + if (tev->uprobes) {
>> + if (tp->ref_ctr_offset)
>> + err = strbuf_addf(&buf, "%s:0x%lx(0x%lx)", tp->module,
>> + tp->address, tp->ref_ctr_offset);
>> + else
>> + err = strbuf_addf(&buf, "%s:0x%lx", tp->module,
>> + tp->address);
>> + } else if (!strncmp(tp->symbol, "0x", 2)) {
>> /* Absolute address. See try_to_find_absolute_address() */
>> err = strbuf_addf(&buf, "%s%s0x%lx", tp->module ?: "",
>> tp->module ? ":" : "", tp->address);
>> - else
>> + } else {
>> err = strbuf_addf(&buf, "%s%s%s+%lu", tp->module ?: "",
>> tp->module ? ":" : "", tp->symbol, tp->offset);
>> + }
> What the purpose of this {}?
The starting if has multiple statements and thus it needs braces. So I added
braces is all other conditions.
>> +
>> if (err)
>> goto error;
>>
>> diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
>> index 45b14f0..15a98c3 100644
>> --- a/tools/perf/util/probe-event.h
>> +++ b/tools/perf/util/probe-event.h
>> @@ -27,6 +27,7 @@ struct probe_trace_point {
>> char *symbol; /* Base symbol */
>> char *module; /* Module name */
>> unsigned long offset; /* Offset from symbol */
>> + unsigned long ref_ctr_offset; /* SDT reference counter offset */
>> unsigned long address; /* Actual address of the trace point */
>> bool retprobe; /* Return probe flag */
>> };
>> diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c
>> index 4ae1123..08ba3a6 100644
>> --- a/tools/perf/util/probe-file.c
>> +++ b/tools/perf/util/probe-file.c
>> @@ -701,6 +701,12 @@ static unsigned long long sdt_note__get_addr(struct sdt_note *note)
>> : (unsigned long long)note->addr.a64[0];
>> }
>>
>> +static unsigned long long sdt_note__get_ref_ctr_offset(struct sdt_note *note)
>> +{
>> + return note->bit32 ? (unsigned long long)note->addr.a32[2]
>> + : (unsigned long long)note->addr.a64[2];
>> +}
> Could you please introduce an enum for specifying the index by name?
>
> e.g.
> enum {
> SDT_NOTE_IDX_ADDR = 0,
> SDT_NOTE_IDX_REFCTR = 2,
> };
That will be good. Will change it.
>> +
>> static const char * const type_to_suffix[] = {
>> ":s64", "", "", "", ":s32", "", ":s16", ":s8",
>> "", ":u8", ":u16", "", ":u32", "", "", "", ":u64"
>> @@ -776,14 +782,24 @@ static char *synthesize_sdt_probe_command(struct sdt_note *note,
>> {
>> struct strbuf buf;
>> char *ret = NULL, **args;
>> - int i, args_count;
>> + int i, args_count, err;
>> + unsigned long long ref_ctr_offset;
>>
>> if (strbuf_init(&buf, 32) < 0)
>> return NULL;
>>
>> - if (strbuf_addf(&buf, "p:%s/%s %s:0x%llx",
>> + ref_ctr_offset = sdt_note__get_ref_ctr_offset(note);
>> +
>> + if (ref_ctr_offset)
>> + err = strbuf_addf(&buf, "p:%s/%s %s:0x%llx(0x%llx)",
>> sdtgrp, note->name, pathname,
>> - sdt_note__get_addr(note)) < 0)
>> + sdt_note__get_addr(note), ref_ctr_offset);
>> + else
>> + err = strbuf_addf(&buf, "p:%s/%s %s:0x%llx",
>> + sdtgrp, note->name, pathname,
>> + sdt_note__get_addr(note));
> This can be minimized (and avoid repeating) by using 2 strbuf_addf()s, like
>
> err = strbuf_addf(&buf, "p:%s/%s %s:0x%llx",
> sdtgrp, note->name, pathname,
> sdt_note__get_addr(note));
> if (ref_ctr_offset && !err < 0)
> err = strbuf_addf("(0x%llx)", ref_ctr_offset);
Sure. Will change it.
>
>> +
>> + if (err < 0)
>> goto error;
>>
>> if (!note->args)
>> diff --git a/tools/perf/util/symbol-elf.c b/tools/perf/util/symbol-elf.c
>> index 2de7705..76c7b54 100644
>> --- a/tools/perf/util/symbol-elf.c
>> +++ b/tools/perf/util/symbol-elf.c
>> @@ -1928,6 +1928,16 @@ static int populate_sdt_note(Elf **elf, const char *data, size_t len,
>> }
>> }
>>
>> + /* Adjust reference counter offset */
>> + if (elf_section_by_name(*elf, &ehdr, &shdr, SDT_PROBES_SCN, NULL)) {
>> + if (shdr.sh_offset) {
>> + if (tmp->bit32)
>> + tmp->addr.a32[2] -= (shdr.sh_addr - shdr.sh_offset);
>> + else
>> + tmp->addr.a64[2] -= (shdr.sh_addr - shdr.sh_offset);
> Here we should use enum above too.
Sure.
Thanks for the review :)
Ravi
On 03/14/2018 07:20 PM, Masami Hiramatsu wrote:
> On Tue, 13 Mar 2018 18:26:03 +0530
> Ravi Bangoria <[email protected]> wrote:
>
>> No functionality changes.
> Please consider to describe what is this change and why, here.
Will add in next version.
>> Signed-off-by: Ravi Bangoria <[email protected]>
>> ---
>> Documentation/trace/uprobetracer.txt | 16 +++++++++++++---
>> kernel/trace/trace.c | 2 +-
>> 2 files changed, 14 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt
>> index bf526a7c..8fb13b0 100644
>> --- a/Documentation/trace/uprobetracer.txt
>> +++ b/Documentation/trace/uprobetracer.txt
>> @@ -19,15 +19,25 @@ user to calculate the offset of the probepoint in the object.
>>
>> Synopsis of uprobe_tracer
>> -------------------------
>> - p[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a uprobe
>> - r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe)
>> - -:[GRP/]EVENT : Clear uprobe or uretprobe event
>> + p[:[GRP/]EVENT] PATH:OFFSET[(REF_CTR_OFFSET)] [FETCHARGS]
>> + r[:[GRP/]EVENT] PATH:OFFSET[(REF_CTR_OFFSET)] [FETCHARGS]
> Ah, OK in this context, [] means optional syntax :)
Correct.
Thanks,
Ravi
On 03/13, Ravi Bangoria wrote:
>
> +static bool sdt_valid_vma(struct trace_uprobe *tu, struct vm_area_struct *vma)
> +{
> + unsigned long vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
> +
> + return tu->ref_ctr_offset &&
> + vma->vm_file &&
> + file_inode(vma->vm_file) == tu->inode &&
> + vma->vm_flags & VM_WRITE &&
> + vma->vm_start <= vaddr &&
> + vma->vm_end > vaddr;
> +}
Perhaps in this case a simple
ref_ctr_offset < vma->vm_end - vma->vm_start
check without vma_offset_to_vaddr() makes more sense, but I won't insist.
> +static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
> +{
> + struct uprobe_map_info *info;
> + struct vm_area_struct *vma;
> + unsigned long vaddr;
> +
> + uprobe_start_dup_mmap();
> + info = uprobe_build_map_info(tu->inode->i_mapping,
> + tu->ref_ctr_offset, false);
Hmm. This doesn't look right.
If you need to find all mappings (and avoid the races with fork/dup_mmap) you
need to take this semaphore for writing, uprobe_start_dup_mmap() can't help.
Oleg.
On Tue, 13 Mar 2018 18:26:00 +0530
Ravi Bangoria <[email protected]> wrote:
> include/linux/uprobes.h | 2 +
> kernel/events/uprobes.c | 6 ++
> kernel/trace/trace_uprobe.c | 172 +++++++++++++++++++++++++++++++++++++++++++-
I'm currently traveling, but I'll try to look at it in a week or two.
-- Steve
> 3 files changed, 178 insertions(+), 2 deletions(-)
>
>
On 03/14/2018 10:29 PM, Oleg Nesterov wrote:
> On 03/13, Ravi Bangoria wrote:
>> +static bool sdt_valid_vma(struct trace_uprobe *tu, struct vm_area_struct *vma)
>> +{
>> + unsigned long vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
>> +
>> + return tu->ref_ctr_offset &&
>> + vma->vm_file &&
>> + file_inode(vma->vm_file) == tu->inode &&
>> + vma->vm_flags & VM_WRITE &&
>> + vma->vm_start <= vaddr &&
>> + vma->vm_end > vaddr;
>> +}
> Perhaps in this case a simple
>
> ref_ctr_offset < vma->vm_end - vma->vm_start
>
> check without vma_offset_to_vaddr() makes more sense, but I won't insist.
>
Hmm... I'm not quite sure. Will rethink and get back to you.
>
>> +static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
>> +{
>> + struct uprobe_map_info *info;
>> + struct vm_area_struct *vma;
>> + unsigned long vaddr;
>> +
>> + uprobe_start_dup_mmap();
>> + info = uprobe_build_map_info(tu->inode->i_mapping,
>> + tu->ref_ctr_offset, false);
> Hmm. This doesn't look right.
>
> If you need to find all mappings (and avoid the races with fork/dup_mmap) you
> need to take this semaphore for writing, uprobe_start_dup_mmap() can't help.
Oops. Yes. Will change it.
Thanks for the review :)
Ravi
Hi Ravi,
On Wed, 14 Mar 2018 20:52:59 +0530
Ravi Bangoria <[email protected]> wrote:
> On 03/14/2018 07:20 PM, Masami Hiramatsu wrote:
> > On Tue, 13 Mar 2018 18:26:03 +0530
> > Ravi Bangoria <[email protected]> wrote:
> >
> >> No functionality changes.
> > Please consider to describe what is this change and why, here.
>
> Will add in next version.
Thanks, and could you also move this before perf-probe patch?
Also Could you make perf-probe check the tracing/README whether
the kernel supports reference counter syntax or not?
perf-tool can be used on older (or stable) kernel.
Thank you,
>
> >> Signed-off-by: Ravi Bangoria <[email protected]>
> >> ---
> >> Documentation/trace/uprobetracer.txt | 16 +++++++++++++---
> >> kernel/trace/trace.c | 2 +-
> >> 2 files changed, 14 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt
> >> index bf526a7c..8fb13b0 100644
> >> --- a/Documentation/trace/uprobetracer.txt
> >> +++ b/Documentation/trace/uprobetracer.txt
> >> @@ -19,15 +19,25 @@ user to calculate the offset of the probepoint in the object.
> >>
> >> Synopsis of uprobe_tracer
> >> -------------------------
> >> - p[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a uprobe
> >> - r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe)
> >> - -:[GRP/]EVENT : Clear uprobe or uretprobe event
> >> + p[:[GRP/]EVENT] PATH:OFFSET[(REF_CTR_OFFSET)] [FETCHARGS]
> >> + r[:[GRP/]EVENT] PATH:OFFSET[(REF_CTR_OFFSET)] [FETCHARGS]
> > Ah, OK in this context, [] means optional syntax :)
>
> Correct.
>
> Thanks,
> Ravi
>
--
Masami Hiramatsu <[email protected]>
On 03/13, Ravi Bangoria wrote:
>
> @@ -1053,6 +1056,9 @@ int uprobe_mmap(struct vm_area_struct *vma)
> struct uprobe *uprobe, *u;
> struct inode *inode;
>
> + if (uprobe_mmap_callback)
> + uprobe_mmap_callback(vma);
> +
> if (no_uprobe_events() || !valid_vma(vma, true))
> return 0;
probe_event_enable() does
uprobe_register();
/* WINDOW */
sdt_increment_ref_ctr();
what if uprobe_mmap() is called in between? The counter(s) in this vma
will be incremented twice, no?
> +static struct vm_area_struct *
> +sdt_find_vma(struct mm_struct *mm, struct trace_uprobe *tu)
> +{
> + struct vm_area_struct *tmp;
> +
> + for (tmp = mm->mmap; tmp != NULL; tmp = tmp->vm_next)
> + if (sdt_valid_vma(tu, tmp))
> + return tmp;
> +
> + return NULL;
I can't understand the logic... Lets ignore sdt_valid_vma() for now.
The caller has uprobe_map_info, why it can't simply do
vma = find_vma(uprobe_map_info->vaddr)? and then check sdt_valid_vma().
Oleg.
On 03/15, Oleg Nesterov wrote:
>
> > +static struct vm_area_struct *
> > +sdt_find_vma(struct mm_struct *mm, struct trace_uprobe *tu)
> > +{
> > + struct vm_area_struct *tmp;
> > +
> > + for (tmp = mm->mmap; tmp != NULL; tmp = tmp->vm_next)
> > + if (sdt_valid_vma(tu, tmp))
> > + return tmp;
> > +
> > + return NULL;
>
> I can't understand the logic... Lets ignore sdt_valid_vma() for now.
> The caller has uprobe_map_info, why it can't simply do
> vma = find_vma(uprobe_map_info->vaddr)? and then check sdt_valid_vma().
Note to mention that sdt_find_vma() can return NULL but the callers do
vma_offset_to_vaddr(vma) without any check.
Oleg.
On 03/13, Ravi Bangoria wrote:
>
> For tiny binaries/libraries, different mmap regions points to the
> same file portion. In such cases, we may increment reference counter
> multiple times.
Yes,
> But while de-registration, reference counter will get
> decremented only by once
could you explain why this happens? sdt_increment_ref_ctr() and
sdt_decrement_ref_ctr() look symmetrical, _decrement_ should see
the same mappings?
Ether way, this patch doesn't look right at first glance... Just
for example,
> +static bool sdt_check_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
> +{
> + struct sdt_mm_list *tmp = tu->sml;
> +
> + if (!tu->sml || !mm)
> + return false;
> +
> + while (tmp) {
> + if (tmp->mm == mm)
> + return true;
> + tmp = tmp->next;
> + }
> +
> + return false;
...
> +}
> +
> +static void sdt_add_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
> +{
> + struct sdt_mm_list *tmp;
> +
> + tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
> + if (!tmp)
> + return;
> +
> + tmp->mm = mm;
> + tmp->next = tu->sml;
> + tu->sml = tmp;
> +}
> +
...
> @@ -1020,8 +1104,16 @@ void trace_uprobe_mmap_callback(struct vm_area_struct *vma)
> !trace_probe_is_enabled(&tu->tp))
> continue;
>
> + down_write(&tu->sml_rw_sem);
> + if (sdt_check_mm_list(tu, vma->vm_mm))
> + goto cont;
> +
> vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
> - sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
> + if (!sdt_update_ref_ctr(vma->vm_mm, vaddr, 1))
> + sdt_add_mm_list(tu, vma->vm_mm);
> +
> +cont:
> + up_write(&tu->sml_rw_sem);
To simplify, suppose that tu->sml is empty.
Some process calls this function, increments the counter and adds its ->mm into
the list.
Then it exits, ->mm is freed.
The next fork/exec allocates the same memory for the new ->mm, the new process
calls trace_uprobe_mmap_callback() and sdt_check_mm_list() returns T?
Oleg.
On 03/13, Ravi Bangoria wrote:
>
> +sdt_update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
> +{
> + void *kaddr;
> + struct page *page;
> + struct vm_area_struct *vma;
> + int ret = 0;
> + unsigned short orig = 0;
> +
> + if (vaddr == 0)
> + return -EINVAL;
> +
> + ret = get_user_pages_remote(NULL, mm, vaddr, 1,
> + FOLL_FORCE | FOLL_WRITE, &page, &vma, NULL);
> + if (ret <= 0)
> + return ret;
> +
> + kaddr = kmap_atomic(page);
> + memcpy(&orig, kaddr + (vaddr & ~PAGE_MASK), sizeof(orig));
> + orig += d;
> + memcpy(kaddr + (vaddr & ~PAGE_MASK), &orig, sizeof(orig));
> + kunmap_atomic(kaddr);
Hmm. Why memcpy? You could simply do
kaddr = kmap_atomic();
unsigned short *ptr = kaddr + (vaddr & ~PAGE_MASK);
*ptr += d;
kunmap_atomic();
Oleg.
On Tue, 13 Mar 2018 18:25:56 +0530
Ravi Bangoria <[email protected]> wrote:
> No functionality changes.
Please add a detailed explanation why this patch is needed. All commits
should be self sufficient and stand on their own. If I were to come up
to this patch via a git blame, I would be clueless to why it was done.
-- Steve
>
> Signed-off-by: Ravi Bangoria <[email protected]>
> ---
>
On Tue, 13 Mar 2018 18:25:57 +0530
Ravi Bangoria <[email protected]> wrote:
> No functionality changes.
Again, please add an explanation to why this patch is done.
-- Steve
>
> Signed-off-by: Ravi Bangoria <[email protected]>
On Tue, 13 Mar 2018 18:25:59 +0530
Ravi Bangoria <[email protected]> wrote:
> These exported data structure and functions will be used by other
> files in later patches.
I'm reluctantly OK with the above.
>
> No functionality changes.
Please remove this line. There are functionality changes. Turning a
static inline into an exported function *is* a functionality change.
-- Steve
>
> Signed-off-by: Ravi Bangoria <[email protected]>
> ---
> include/linux/uprobes.h | 9 +++++++++
> kernel/events/uprobes.c | 14 +++-----------
> 2 files changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index 0a294e9..7bd2760 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -109,12 +109,19 @@ enum rp_check {
> RP_CHECK_RET,
> };
>
> +struct address_space;
> struct xol_area;
>
> struct uprobes_state {
> struct xol_area *xol_area;
> };
>
> +struct uprobe_map_info {
> + struct uprobe_map_info *next;
> + struct mm_struct *mm;
> + unsigned long vaddr;
> +};
> +
> extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
> extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr);
> extern bool is_swbp_insn(uprobe_opcode_t *insn);
> @@ -149,6 +156,8 @@ struct uprobes_state {
> extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
> extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> void *src, unsigned long len);
> +extern struct uprobe_map_info *uprobe_free_map_info(struct uprobe_map_info *info);
> +extern struct uprobe_map_info *uprobe_build_map_info(struct address_space *mapping, loff_t offset, bool is_register);
> #else /* !CONFIG_UPROBES */
> struct uprobes_state {
> };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 081b88c1..e7830b8 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -695,23 +695,15 @@ static void delete_uprobe(struct uprobe *uprobe)
> put_uprobe(uprobe);
> }
>
> -struct uprobe_map_info {
> - struct uprobe_map_info *next;
> - struct mm_struct *mm;
> - unsigned long vaddr;
> -};
> -
> -static inline struct uprobe_map_info *
> -uprobe_free_map_info(struct uprobe_map_info *info)
> +struct uprobe_map_info *uprobe_free_map_info(struct uprobe_map_info *info)
> {
> struct uprobe_map_info *next = info->next;
> kfree(info);
> return next;
> }
>
> -static struct uprobe_map_info *
> -uprobe_build_map_info(struct address_space *mapping, loff_t offset,
> - bool is_register)
> +struct uprobe_map_info *uprobe_build_map_info(struct address_space *mapping,
> + loff_t offset, bool is_register)
> {
> unsigned long pgoff = offset >> PAGE_SHIFT;
> struct vm_area_struct *vma;
On Tue, 13 Mar 2018 18:25:58 +0530
Ravi Bangoria <[email protected]> wrote:
> -static inline struct map_info *free_map_info(struct map_info *info)
> +static inline struct uprobe_map_info *
> +uprobe_free_map_info(struct uprobe_map_info *info)
> {
> - struct map_info *next = info->next;
> + struct uprobe_map_info *next = info->next;
> kfree(info);
> return next;
> }
>
> -static struct map_info *
> -build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
> +static struct uprobe_map_info *
> +uprobe_build_map_info(struct address_space *mapping, loff_t offset,
Also, as these functions have side effects (like you need to perform a
mmput(info->mm), you need to add kerneldoc type comments to these
functions, explaining how to use them.
When you upgrade a function from static to use cases outside the file,
it requires documenting that function for future users.
-- Steve
> + bool is_register)
> {
> unsigned long pgoff = offset >> PAGE_SHIFT;
> struct vm_area_struct *vma;
> - struct map_info *curr = NULL;
> - struct map_info *prev = NULL;
> - struct map_info *info;
> + struct uprobe_map_info *curr = NULL;
> + struct uprobe_map_info *prev = NULL;
> + struct uprobe_map_info *info;
> int more = 0;
>
> again:
> @@ -729,7 +731,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
> * Needs GFP_NOWAIT to avoid i_mmap_rwsem recursion through
> * reclaim. This is optimistic, no harm done if it fails.
> */
> - prev = kmalloc(sizeof(struct map_info),
> + prev = kmalloc(sizeof(struct uprobe_map_info),
> GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> if (prev)
> prev->next = NULL;
> @@ -762,7 +764,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
> }
>
> do {
> - info = kmalloc(sizeof(struct map_info), GFP_KERNEL);
> + info = kmalloc(sizeof(struct uprobe_map_info), GFP_KERNEL);
> if (!info) {
> curr = ERR_PTR(-ENOMEM);
> goto out;
> @@ -774,7 +776,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
> goto again;
> out:
> while (prev)
> - prev = free_map_info(prev);
> + prev = uprobe_free_map_info(prev);
> return curr;
> }
>
> @@ -782,11 +784,11 @@ static inline struct map_info *free_map_info(struct map_info *info)
> register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
> {
> bool is_register = !!new;
> - struct map_info *info;
> + struct uprobe_map_info *info;
> int err = 0;
>
> percpu_down_write(&dup_mmap_sem);
> - info = build_map_info(uprobe->inode->i_mapping,
> + info = uprobe_build_map_info(uprobe->inode->i_mapping,
> uprobe->offset, is_register);
> if (IS_ERR(info)) {
> err = PTR_ERR(info);
> @@ -825,7 +827,7 @@ static inline struct map_info *free_map_info(struct map_info *info)
> up_write(&mm->mmap_sem);
> free:
> mmput(mm);
> - info = free_map_info(info);
> + info = uprobe_free_map_info(info);
> }
> out:
> percpu_up_write(&dup_mmap_sem);
On Tue, 13 Mar 2018 18:26:00 +0530
Ravi Bangoria <[email protected]> wrote:
> +static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
> +{
> + struct uprobe_map_info *info;
> + struct vm_area_struct *vma;
> + unsigned long vaddr;
> +
> + uprobe_start_dup_mmap();
Please add a comment here that this function ups the mm ref count for
each info returned. Otherwise it's hard to know what that mmput() below
matches.
-- Steve
> + info = uprobe_build_map_info(tu->inode->i_mapping,
> + tu->ref_ctr_offset, false);
> + if (IS_ERR(info))
> + goto out;
> +
> + while (info) {
> + down_write(&info->mm->mmap_sem);
> +
> + vma = sdt_find_vma(info->mm, tu);
> + vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
> + sdt_update_ref_ctr(info->mm, vaddr, 1);
> +
> + up_write(&info->mm->mmap_sem);
> + mmput(info->mm);
> + info = uprobe_free_map_info(info);
> + }
> +
> +out:
> + uprobe_end_dup_mmap();
> +}
> +
On 03/15/2018 09:57 PM, Steven Rostedt wrote:
> On Tue, 13 Mar 2018 18:25:56 +0530
> Ravi Bangoria <[email protected]> wrote:
>
>> No functionality changes.
> Please add a detailed explanation why this patch is needed. All commits
> should be self sufficient and stand on their own. If I were to come up
> to this patch via a git blame, I would be clueless to why it was done.
Sure Steve, Will add description it in next series.
Thanks for the review,
Ravi
On 03/15/2018 10:14 PM, Steven Rostedt wrote:
> On Tue, 13 Mar 2018 18:25:58 +0530
> Ravi Bangoria <[email protected]> wrote:
>> -static inline struct map_info *free_map_info(struct map_info *info)
>> +static inline struct uprobe_map_info *
>> +uprobe_free_map_info(struct uprobe_map_info *info)
>> {
>> - struct map_info *next = info->next;
>> + struct uprobe_map_info *next = info->next;
>> kfree(info);
>> return next;
>> }
>>
>> -static struct map_info *
>> -build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
>> +static struct uprobe_map_info *
>> +uprobe_build_map_info(struct address_space *mapping, loff_t offset,
> Also, as these functions have side effects (like you need to perform a
> mmput(info->mm), you need to add kerneldoc type comments to these
> functions, explaining how to use them.
>
> When you upgrade a function from static to use cases outside the file,
> it requires documenting that function for future users.
Sure, will add a comment here.
Thanks for the review,
Ravi
On 03/15/2018 09:58 PM, Steven Rostedt wrote:
> On Tue, 13 Mar 2018 18:25:57 +0530
> Ravi Bangoria <[email protected]> wrote:
>
>> No functionality changes.
> Again, please add an explanation to why this patch is done.
Sure. Will add.
Thanks for the review,
Ravi
> -- Steve
>
>> Signed-off-by: Ravi Bangoria <[email protected]>
On 03/15/2018 10:02 PM, Steven Rostedt wrote:
> On Tue, 13 Mar 2018 18:25:59 +0530
> Ravi Bangoria <[email protected]> wrote:
>
>> These exported data structure and functions will be used by other
>> files in later patches.
> I'm reluctantly OK with the above.
>
>> No functionality changes.
> Please remove this line. There are functionality changes. Turning a
> static inline into an exported function *is* a functionality change.
Sure. Will change it.
Thanks for the review,
Ravi
On 03/15/2018 10:18 PM, Steven Rostedt wrote:
> On Tue, 13 Mar 2018 18:26:00 +0530
> Ravi Bangoria <[email protected]> wrote:
>
>> +static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
>> +{
>> + struct uprobe_map_info *info;
>> + struct vm_area_struct *vma;
>> + unsigned long vaddr;
>> +
>> + uprobe_start_dup_mmap();
> Please add a comment here that this function ups the mm ref count for
> each info returned. Otherwise it's hard to know what that mmput() below
> matches.
Sure. Will add it.
Thanks for the review,
Ravi
On 03/15/2018 07:51 PM, Oleg Nesterov wrote:
> On 03/13, Ravi Bangoria wrote:
>> @@ -1053,6 +1056,9 @@ int uprobe_mmap(struct vm_area_struct *vma)
>> struct uprobe *uprobe, *u;
>> struct inode *inode;
>>
>> + if (uprobe_mmap_callback)
>> + uprobe_mmap_callback(vma);
>> +
>> if (no_uprobe_events() || !valid_vma(vma, true))
>> return 0;
> probe_event_enable() does
>
> uprobe_register();
> /* WINDOW */
> sdt_increment_ref_ctr();
>
> what if uprobe_mmap() is called in between? The counter(s) in this vma
> will be incremented twice, no?
I guess, it's a valid issue with PATCH 5 but should be taken care by PATCH 6.
>
>> +static struct vm_area_struct *
>> +sdt_find_vma(struct mm_struct *mm, struct trace_uprobe *tu)
>> +{
>> + struct vm_area_struct *tmp;
>> +
>> + for (tmp = mm->mmap; tmp != NULL; tmp = tmp->vm_next)
>> + if (sdt_valid_vma(tu, tmp))
>> + return tmp;
>> +
>> + return NULL;
> I can't understand the logic... Lets ignore sdt_valid_vma() for now.
> The caller has uprobe_map_info, why it can't simply do
> vma = find_vma(uprobe_map_info->vaddr)? and then check sdt_valid_vma().
Yes. that should work. Will change it.
Thanks for the review,
Ravi
On 03/15/2018 08:00 PM, Oleg Nesterov wrote:
> On 03/15, Oleg Nesterov wrote:
>>> +static struct vm_area_struct *
>>> +sdt_find_vma(struct mm_struct *mm, struct trace_uprobe *tu)
>>> +{
>>> + struct vm_area_struct *tmp;
>>> +
>>> + for (tmp = mm->mmap; tmp != NULL; tmp = tmp->vm_next)
>>> + if (sdt_valid_vma(tu, tmp))
>>> + return tmp;
>>> +
>>> + return NULL;
>> I can't understand the logic... Lets ignore sdt_valid_vma() for now.
>> The caller has uprobe_map_info, why it can't simply do
>> vma = find_vma(uprobe_map_info->vaddr)? and then check sdt_valid_vma().
> Note to mention that sdt_find_vma() can return NULL but the callers do
> vma_offset_to_vaddr(vma) without any check.
If the "mm" we are passing to sdt_find_vma() is returned by
uprobe_build_map_info(ref_ctr_offset), sdt_find_vma() must
_not_ return NULL.
Thanks for the review,
Ravi
On 03/15/2018 08:31 PM, Oleg Nesterov wrote:
> On 03/13, Ravi Bangoria wrote:
>> +sdt_update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
>> +{
>> + void *kaddr;
>> + struct page *page;
>> + struct vm_area_struct *vma;
>> + int ret = 0;
>> + unsigned short orig = 0;
>> +
>> + if (vaddr == 0)
>> + return -EINVAL;
>> +
>> + ret = get_user_pages_remote(NULL, mm, vaddr, 1,
>> + FOLL_FORCE | FOLL_WRITE, &page, &vma, NULL);
>> + if (ret <= 0)
>> + return ret;
>> +
>> + kaddr = kmap_atomic(page);
>> + memcpy(&orig, kaddr + (vaddr & ~PAGE_MASK), sizeof(orig));
>> + orig += d;
>> + memcpy(kaddr + (vaddr & ~PAGE_MASK), &orig, sizeof(orig));
>> + kunmap_atomic(kaddr);
> Hmm. Why memcpy? You could simply do
>
> kaddr = kmap_atomic();
> unsigned short *ptr = kaddr + (vaddr & ~PAGE_MASK);
> *ptr += d;
> kunmap_atomic();
Yes, that should work. Will change it.
Thanks for the review,
Ravi
On 03/15/2018 06:17 PM, Masami Hiramatsu wrote:
> Hi Ravi,
>
> On Wed, 14 Mar 2018 20:52:59 +0530
> Ravi Bangoria <[email protected]> wrote:
>
>> On 03/14/2018 07:20 PM, Masami Hiramatsu wrote:
>>> On Tue, 13 Mar 2018 18:26:03 +0530
>>> Ravi Bangoria <[email protected]> wrote:
>>>
>>>> No functionality changes.
>>> Please consider to describe what is this change and why, here.
>> Will add in next version.
> Thanks, and could you also move this before perf-probe patch?
> Also Could you make perf-probe check the tracing/README whether
> the kernel supports reference counter syntax or not?
>
> perf-tool can be used on older (or stable) kernel.
Sure, Will do that.
Thanks,
Ravi
On 03/16/2018 05:09 PM, Oleg Nesterov wrote:
> On 03/16, Ravi Bangoria wrote:
>> On 03/15/2018 08:00 PM, Oleg Nesterov wrote:
>>> Note to mention that sdt_find_vma() can return NULL but the callers do
>>> vma_offset_to_vaddr(vma) without any check.
>> If the "mm" we are passing to sdt_find_vma() is returned by
>> uprobe_build_map_info(ref_ctr_offset), sdt_find_vma() must
>> _not_ return NULL.
> Not at all.
>
> Once build_map_info() returns any mapping can go away. Otherwise, why do
> you think the caller has to take ->mmap_sem and use find_vma()? If you
> were right, build_map_info() could just return the list of vma's instead
> of list of mm's.
Oh.. okay.. I was under wrong impression then. Will add a check there.
Thanks for the review :)
Ravi
On 03/16, Ravi Bangoria wrote:
>
> On 03/15/2018 08:00 PM, Oleg Nesterov wrote:
> > Note to mention that sdt_find_vma() can return NULL but the callers do
> > vma_offset_to_vaddr(vma) without any check.
>
> If the "mm" we are passing to sdt_find_vma() is returned by
> uprobe_build_map_info(ref_ctr_offset), sdt_find_vma() must
> _not_ return NULL.
Not at all.
Once build_map_info() returns any mapping can go away. Otherwise, why do
you think the caller has to take ->mmap_sem and use find_vma()? If you
were right, build_map_info() could just return the list of vma's instead
of list of mm's.
Oleg.
On 03/15/2018 08:19 PM, Oleg Nesterov wrote:
> On 03/13, Ravi Bangoria wrote:
>> For tiny binaries/libraries, different mmap regions points to the
>> same file portion. In such cases, we may increment reference counter
>> multiple times.
> Yes,
>
>> But while de-registration, reference counter will get
>> decremented only by once
> could you explain why this happens? sdt_increment_ref_ctr() and
> sdt_decrement_ref_ctr() look symmetrical, _decrement_ should see
> the same mappings?
Sorry, I thought this happens only for tiny binaries. But that is not the case.
This happens for binary / library of any length.
Also, it's not a problem with sdt_increment_ref_ctr() / sdt_increment_ref_ctr().
The problem happens with trace_uprobe_mmap_callback().
To illustrate in detail, I'm adding a pr_info() in trace_uprobe_mmap_callback():
vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
+ pr_info("0x%lx-0x%lx : 0x%lx\n", vma->vm_start, vma->vm_end, vaddr);
sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
Ok now, libpython has SDT markers with reference counter:
# readelf -n /usr/lib64/libpython2.7.so.1.0 | grep -A2 Provider
Provider: python
Name: function__entry
... Semaphore: 0x00000000002899d8
Probing on that marker:
# cd /sys/kernel/debug/tracing/
# echo "p:sdt_python/function__entry /usr/lib64/libpython2.7.so.1.0:0x16a4d4(0x2799d8)" > uprobe_events
# echo 1 > events/sdt_python/function__entry/enable
When I run python:
# strace -o out python
mmap(NULL, 2738968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fff92460000
mmap(0x7fff926a0000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x230000) = 0x7fff926a0000
mprotect(0x7fff926a0000, 65536, PROT_READ) = 0
The first mmap() maps the whole library into one region. Second mmap()
and third mprotect() split out the whole region into smaller vmas and sets
appropriate protection flags.
Now, in this case, trace_uprobe_mmap_callback() updates reference counter
twice -- by second mmap() call and by third mprotect() call -- because both
regions contain reference counter offset. This I can verify in dmesg:
# dmesg | tail
trace_kprobe: 0x7fff926a0000-0x7fff926f0000 : 0x7fff926e99d8
trace_kprobe: 0x7fff926b0000-0x7fff926f0000 : 0x7fff926e99d8
Final vmas of libpython:
# cat /proc/`pgrep python`/maps | grep libpython
7fff92460000-7fff926a0000 r-xp 00000000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
7fff926a0000-7fff926b0000 r--p 00230000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
7fff926b0000-7fff926f0000 rw-p 00240000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
I see similar problem with normal binary as well. I'm using Brendan Gregg's
example[1]:
# readelf -n /tmp/tick | grep -A2 Provider
Provider: tick
Name: loop2
... Semaphore: 0x000000001005003c
Probing that marker:
# echo "p:sdt_tick/loop2 /tmp/tick:0x6e4(0x10036)" > uprobe_events
# echo 1 > events/sdt_tick/loop2/enable
Now when I run the binary
# /tmp/tick
load_elf_binary() internally calls mmap() and I see trace_uprobe_mmap_callback()
updating reference counter twice:
# dmesg | tail
trace_kprobe: 0x10010000-0x10030000 : 0x10020036
trace_kprobe: 0x10020000-0x10030000 : 0x10020036
proc/<pid>/maps of the tick:
# cat /proc/`pgrep tick`/maps
10000000-10010000 r-xp 00000000 08:05 1335712 /tmp/tick
10010000-10020000 r--p 00000000 08:05 1335712 /tmp/tick
10020000-10030000 rw-p 00010000 08:05 1335712 /tmp/tick
[1] https://github.com/iovisor/bcc/issues/327#issuecomment-200576506
> Ether way, this patch doesn't look right at first glance... Just
> for example,
>
>> +static bool sdt_check_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
>> +{
>> + struct sdt_mm_list *tmp = tu->sml;
>> +
>> + if (!tu->sml || !mm)
>> + return false;
>> +
>> + while (tmp) {
>> + if (tmp->mm == mm)
>> + return true;
>> + tmp = tmp->next;
>> + }
>> +
>> + return false;
> ...
>
>> +}
>> +
>> +static void sdt_add_mm_list(struct trace_uprobe *tu, struct mm_struct *mm)
>> +{
>> + struct sdt_mm_list *tmp;
>> +
>> + tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
>> + if (!tmp)
>> + return;
>> +
>> + tmp->mm = mm;
>> + tmp->next = tu->sml;
>> + tu->sml = tmp;
>> +}
>> +
> ...
>
>> @@ -1020,8 +1104,16 @@ void trace_uprobe_mmap_callback(struct vm_area_struct *vma)
>> !trace_probe_is_enabled(&tu->tp))
>> continue;
>>
>> + down_write(&tu->sml_rw_sem);
>> + if (sdt_check_mm_list(tu, vma->vm_mm))
>> + goto cont;
>> +
>> vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
>> - sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
>> + if (!sdt_update_ref_ctr(vma->vm_mm, vaddr, 1))
>> + sdt_add_mm_list(tu, vma->vm_mm);
>> +
>> +cont:
>> + up_write(&tu->sml_rw_sem);
> To simplify, suppose that tu->sml is empty.
>
> Some process calls this function, increments the counter and adds its ->mm into
> the list.
>
> Then it exits, ->mm is freed.
>
> The next fork/exec allocates the same memory for the new ->mm, the new process
> calls trace_uprobe_mmap_callback() and sdt_check_mm_list() returns T?
Yes. This can happen. May be we can use mmu_notifier for this?
We register a release() callback from trace_uprobe while adding mm
in tu->sml. When mm gets freed, trace_uprobe will get notified.
Though, I don't know much about mmu_notifier. I need to think on this.
Let me know if you have better ideas.
Thanks for the review :)
Ravi
On 03/16/2018 05:42 PM, Ravi Bangoria wrote:
>
> On 03/15/2018 08:19 PM, Oleg Nesterov wrote:
>> On 03/13, Ravi Bangoria wrote:
>>> For tiny binaries/libraries, different mmap regions points to the
>>> same file portion. In such cases, we may increment reference counter
>>> multiple times.
>> Yes,
>>
>>> But while de-registration, reference counter will get
>>> decremented only by once
>> could you explain why this happens? sdt_increment_ref_ctr() and
>> sdt_decrement_ref_ctr() look symmetrical, _decrement_ should see
>> the same mappings?
> Sorry, I thought this happens only for tiny binaries. But that is not the case.
> This happens for binary / library of any length.
>
> Also, it's not a problem with sdt_increment_ref_ctr() / sdt_increment_ref_ctr().
> The problem happens with trace_uprobe_mmap_callback().
>
> To illustrate in detail, I'm adding a pr_info() in trace_uprobe_mmap_callback():
>
> vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
> + pr_info("0x%lx-0x%lx : 0x%lx\n", vma->vm_start, vma->vm_end, vaddr);
> sdt_update_ref_ctr(vma->vm_mm, vaddr, 1);
>
>
> Ok now, libpython has SDT markers with reference counter:
>
> # readelf -n /usr/lib64/libpython2.7.so.1.0 | grep -A2 Provider
> Provider: python
> Name: function__entry
> ... Semaphore: 0x00000000002899d8
>
> Probing on that marker:
>
> # cd /sys/kernel/debug/tracing/
> # echo "p:sdt_python/function__entry /usr/lib64/libpython2.7.so.1.0:0x16a4d4(0x2799d8)" > uprobe_events
> # echo 1 > events/sdt_python/function__entry/enable
>
> When I run python:
>
> # strace -o out python
> mmap(NULL, 2738968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fff92460000
> mmap(0x7fff926a0000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x230000) = 0x7fff926a0000
> mprotect(0x7fff926a0000, 65536, PROT_READ) = 0
>
> The first mmap() maps the whole library into one region. Second mmap()
> and third mprotect() split out the whole region into smaller vmas and sets
> appropriate protection flags.
>
> Now, in this case, trace_uprobe_mmap_callback() updates reference counter
> twice -- by second mmap() call and by third mprotect() call -- because both
> regions contain reference counter offset. This I can verify in dmesg:
>
> # dmesg | tail
> trace_kprobe: 0x7fff926a0000-0x7fff926f0000 : 0x7fff926e99d8
> trace_kprobe: 0x7fff926b0000-0x7fff926f0000 : 0x7fff926e99d8
>
> Final vmas of libpython:
>
> # cat /proc/`pgrep python`/maps | grep libpython
> 7fff92460000-7fff926a0000 r-xp 00000000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
> 7fff926a0000-7fff926b0000 r--p 00230000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
> 7fff926b0000-7fff926f0000 rw-p 00240000 08:05 403934 /usr/lib64/libpython2.7.so.1.0
>
>
> I see similar problem with normal binary as well. I'm using Brendan Gregg's
> example[1]:
>
> # readelf -n /tmp/tick | grep -A2 Provider
> Provider: tick
> Name: loop2
> ... Semaphore: 0x000000001005003c
>
> Probing that marker:
>
> # echo "p:sdt_tick/loop2 /tmp/tick:0x6e4(0x10036)" > uprobe_events
> # echo 1 > events/sdt_tick/loop2/enable
>
> Now when I run the binary
>
> # /tmp/tick
>
> load_elf_binary() internally calls mmap() and I see trace_uprobe_mmap_callback()
> updating reference counter twice:
>
> # dmesg | tail
> trace_kprobe: 0x10010000-0x10030000 : 0x10020036
> trace_kprobe: 0x10020000-0x10030000 : 0x10020036
>
> proc/<pid>/maps of the tick:
>
> # cat /proc/`pgrep tick`/maps
> 10000000-10010000 r-xp 00000000 08:05 1335712 /tmp/tick
> 10010000-10020000 r--p 00000000 08:05 1335712 /tmp/tick
> 10020000-10030000 rw-p 00010000 08:05 1335712 /tmp/tick
>
> [1] https://github.com/iovisor/bcc/issues/327#issuecomment-200576506
Also, while de-registration, we look for all existing mms using
uprobe_build_mmap_info() and decrement the counter in each
of the mm. i.e. we decrement the counter only once.
-Ravi
On Fri, 16 Mar 2018 15:12:38 +0530
Ravi Bangoria <[email protected]> wrote:
> On 03/15/2018 06:17 PM, Masami Hiramatsu wrote:
> > Hi Ravi,
> >
> > On Wed, 14 Mar 2018 20:52:59 +0530
> > Ravi Bangoria <[email protected]> wrote:
> >
> >> On 03/14/2018 07:20 PM, Masami Hiramatsu wrote:
> >>> On Tue, 13 Mar 2018 18:26:03 +0530
> >>> Ravi Bangoria <[email protected]> wrote:
> >>>
> >>>> No functionality changes.
> >>> Please consider to describe what is this change and why, here.
> >> Will add in next version.
> > Thanks, and could you also move this before perf-probe patch?
> > Also Could you make perf-probe check the tracing/README whether
> > the kernel supports reference counter syntax or not?
> >
> > perf-tool can be used on older (or stable) kernel.
>
> Sure, Will do that.
Please see scan_ftrace_readme@util/probe-file.c :)
It is easy to expand the pattern table.
Thank you,
--
Masami Hiramatsu <[email protected]>
On 03/15, Steven Rostedt wrote:
>
> On Tue, 13 Mar 2018 18:26:00 +0530
> Ravi Bangoria <[email protected]> wrote:
>
> > +static void sdt_increment_ref_ctr(struct trace_uprobe *tu)
> > +{
> > + struct uprobe_map_info *info;
> > + struct vm_area_struct *vma;
> > + unsigned long vaddr;
> > +
> > + uprobe_start_dup_mmap();
>
> Please add a comment here that this function ups the mm ref count for
> each info returned. Otherwise it's hard to know what that mmput() below
> matches.
You meant uprobe_build_map_info(), not uprobe_start_dup_mmap().
Yes, and if it gets more callers perhaps we should move this mmput() into
uprobe_free_map_info()...
Oleg.
--- x/kernel/events/uprobes.c
+++ x/kernel/events/uprobes.c
@@ -714,6 +714,7 @@ struct map_info {
static inline struct map_info *free_map_info(struct map_info *info)
{
struct map_info *next = info->next;
+ mmput(info->mm);
kfree(info);
return next;
}
@@ -783,8 +784,11 @@ build_map_info(struct address_space *map
goto again;
out:
- while (prev)
- prev = free_map_info(prev);
+ while (prev) {
+ info = prev;
+ prev = prev->next;
+ kfree(info);
+ }
return curr;
}
@@ -834,7 +838,6 @@ register_for_each_vma(struct uprobe *upr
unlock:
up_write(&mm->mmap_sem);
free:
- mmput(mm);
info = free_map_info(info);
}
out:
On 03/16, Ravi Bangoria wrote:
>
> On 03/15/2018 08:19 PM, Oleg Nesterov wrote:
> > On 03/13, Ravi Bangoria wrote:
> >> For tiny binaries/libraries, different mmap regions points to the
> >> same file portion. In such cases, we may increment reference counter
> >> multiple times.
> > Yes,
> >
> >> But while de-registration, reference counter will get
> >> decremented only by once
> > could you explain why this happens? sdt_increment_ref_ctr() and
> > sdt_decrement_ref_ctr() look symmetrical, _decrement_ should see
> > the same mappings?
...
> ??? # strace -o out python
> ?? ?? mmap(NULL, 2738968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fff92460000
> ????? mmap(0x7fff926a0000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x230000) = 0x7fff926a0000
> ????? mprotect(0x7fff926a0000, 65536, PROT_READ) = 0
Ah, in this case everything is clear, thanks.
I was confused by the changelog, I misinterpreted it as if inc/dec are not
balanced in case of multiple mappings even if the application doesn't play
with mmap/mprotect/etc.
And it seems that you are trying to confuse yourself, not only me ;) Just
suppose that an application does mmap+munmap in a loop and the mapped region
contains uprobe but not the counter.
And this all makes me think that we should do something else. Ideally,
install_breakpoint() and remove_breakpoint() should inc/dec the counter
if they do not fail...
Btw, why do we need a counter, not a boolean? Who else can modify it?
Or different uprobes can share the same counter?
Oleg.
Hi Oleg,
On 03/14/2018 10:29 PM, Oleg Nesterov wrote:
> On 03/13, Ravi Bangoria wrote:
>> +static bool sdt_valid_vma(struct trace_uprobe *tu, struct vm_area_struct *vma)
>> +{
>> + unsigned long vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
>> +
>> + return tu->ref_ctr_offset &&
>> + vma->vm_file &&
>> + file_inode(vma->vm_file) == tu->inode &&
>> + vma->vm_flags & VM_WRITE &&
>> + vma->vm_start <= vaddr &&
>> + vma->vm_end > vaddr;
>> +}
> Perhaps in this case a simple
>
> ref_ctr_offset < vma->vm_end - vma->vm_start
>
> check without vma_offset_to_vaddr() makes more sense, but I won't insist.
>
I still don't get this. This seems a comparison between file offset and size
of the vma. Shouldn't we need to consider pg_off here?
Thanks,
Ravi
Hi Oleg,
On 03/16/2018 11:20 PM, Oleg Nesterov wrote:
> On 03/16, Ravi Bangoria wrote:
>> On 03/15/2018 08:19 PM, Oleg Nesterov wrote:
>>> On 03/13, Ravi Bangoria wrote:
>>>> For tiny binaries/libraries, different mmap regions points to the
>>>> same file portion. In such cases, we may increment reference counter
>>>> multiple times.
>>> Yes,
>>>
>>>> But while de-registration, reference counter will get
>>>> decremented only by once
>>> could you explain why this happens? sdt_increment_ref_ctr() and
>>> sdt_decrement_ref_ctr() look symmetrical, _decrement_ should see
>>> the same mappings?
> ...
>
>> # strace -o out python
>> mmap(NULL, 2738968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fff92460000
>> mmap(0x7fff926a0000, 327680, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x230000) = 0x7fff926a0000
>> mprotect(0x7fff926a0000, 65536, PROT_READ) = 0
> Ah, in this case everything is clear, thanks.
>
> I was confused by the changelog, I misinterpreted it as if inc/dec are not
> balanced in case of multiple mappings even if the application doesn't play
> with mmap/mprotect/etc.
>
> And it seems that you are trying to confuse yourself, not only me ;) Just
> suppose that an application does mmap+munmap in a loop and the mapped region
> contains uprobe but not the counter.
this is fine because ...
>
> And this all makes me think that we should do something else. Ideally,
> install_breakpoint() and remove_breakpoint() should inc/dec the counter
> if they do not fail...
The whole point of adding this logic in trace_uprobe is we wanted to
decouple the counter inc/dec logic from uprobe patching. If user is just
doing mmap+munmap region in a loop which contains uprobe, the
instruction will be patched by the core uprobe infrastructure. Whenever
application mmap the region that holds to counter, it will be incremented.
Our initial design was to increment counter in install_breakpoint() but
uprobed instruction gets patched in a very early stage of binary loading
and vma that holds the counter may not be mapped yet.
>
> Btw, why do we need a counter, not a boolean? Who else can modify it?
> Or different uprobes can share the same counter?
Yes, multiple SDT markers can share the counter. Ex, there can be multiple
implementation of same function and thus each individual implementation
may contain marker which share the same counter. From mysql,
# readelf -n /usr/lib64/mysql/libmysqlclient.so.18.0.0 | grep -A2 Provider
Provider: mysql
Name: net__write__start
Location: 0x000000000003caa0, ..., Semaphore: 0x0000000000333532
--
Provider: mysql
Name: net__write__start
Location: 0x000000000003cd5c, ..., Semaphore: 0x0000000000333532
Here, both the markers has same name, but different location. Also they
share the counter (semaphore).
Apart from that, counter allows multiple tracers to trace on a single marker,
which is difficult with boolean flag.
Thanks,
Ravi
Hi Ravi,
On 03/19, Ravi Bangoria wrote:
>
> On 03/16/2018 11:20 PM, Oleg Nesterov wrote:
> >
> > And it seems that you are trying to confuse yourself, not only me ;) Just
> > suppose that an application does mmap+munmap in a loop and the mapped region
> > contains uprobe but not the counter.
>
> this is fine because ...
Yes, I guess I tried to say "counter but not uprobe" but possibly I was actually
confused.
> Our initial design was to increment counter in install_breakpoint() but
> uprobed instruction gets patched in a very early stage of binary loading
> and vma that holds the counter may not be mapped yet.
Yes, yes, I understand this is not that simple...
> > Btw, why do we need a counter, not a boolean? Who else can modify it?
> > Or different uprobes can share the same counter?
>
> Yes, multiple SDT markers can share the counter.
OK, thanks.
Oleg.
On 03/19, Ravi Bangoria wrote:
>
> Hi Oleg,
>
> On 03/14/2018 10:29 PM, Oleg Nesterov wrote:
> > On 03/13, Ravi Bangoria wrote:
> >> +static bool sdt_valid_vma(struct trace_uprobe *tu, struct vm_area_struct *vma)
> >> +{
> >> + unsigned long vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset);
> >> +
> >> + return tu->ref_ctr_offset &&
> >> + vma->vm_file &&
> >> + file_inode(vma->vm_file) == tu->inode &&
> >> + vma->vm_flags & VM_WRITE &&
> >> + vma->vm_start <= vaddr &&
> >> + vma->vm_end > vaddr;
> >> +}
> > Perhaps in this case a simple
> >
> > ref_ctr_offset < vma->vm_end - vma->vm_start
> >
> > check without vma_offset_to_vaddr() makes more sense, but I won't insist.
> >
>
> I still don't get this. This seems a comparison between file offset and size
> of the vma. Shouldn't we need to consider pg_off here?
Indeed, I am stupid ;)
Oleg.