2021-11-29 20:16:33

by Jeff Xie

[permalink] [raw]
Subject: [RFC][PATCH v6 0/5] trace: Introduce objtrace trigger to trace the kernel object

Introduce a method based on function tracer to trace any object and get
the value of the object dynamically. the object can be obtained from the
dynamic event (kprobe_event/uprobe_event) or the static event(tracepoint).

Usage:
When using the kprobe event, only need to set the objtrace(a new trigger),
we can get the value of the object. The object is from the setting of the
kprobe event.

For example:
For the function bio_add_page():

int bio_add_page(struct bio *bio, struct page *page,
unsigned int len, unsigned int offset)

Firstly, we can set the base of the object, thus the first string "arg1"
stands for the value of the first parameter of this function bio_add_gage(),

# echo 'p bio_add_page arg1=$arg1' >> ./kprobe_events

Secondly, we can get the value dynamically base the above object.

find the offset of the bi_size in struct bio:
$ gdb vmlinux
(gdb) p &(((struct bio *)0)->bi_iter.bi_size)
$1 = (unsigned int *) 0x28

# echo 'objtrace:add:arg1,0x28:u32:1 if comm == "cat"' > ./events/kprobes/ \
p_bio_add_page_0/trigger

The best way to use this is that we can set the entrance event and exit
event, for example, the following example is to set the read_papes as
the entrance event, and set the __blk_account_io_start as the exit event.

# cd /sys/kernel/debug/tracing/
# echo 0 > ./tracing_on
# echo 'p read_pages' >> ./kprobe_events
# echo 'p __blk_account_io_start' >> ./kprobe_events
# echo 'traceon if comm == "cat"' > ./events/kprobes/p_read_pages_0/trigger
# echo 'traceoff if comm == "cat"' > ./events/kprobes/p___blk_account_io_start_0/trigger
# echo 'p bio_add_page arg1=$arg1' >> ./kprobe_events
# echo 'objtrace:add:arg1,0x28:u32:1 if comm == "cat"' > ./events/kprobes/p_bio_add_page_0/trigger

# du -sh /test.txt
12.0K /test.txt

# cat /test.txt > /dev/null
# cat ./trace

# tracer: nop
#
# entries-in-buffer/entries-written: 50/50 #P:1
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
cat-95 [000] ..... 1.412065: _raw_spin_unlock_irqrestore <-event_triggers_call object:0xffff888108af6328 value:0x0
cat-95 [000] ..... 1.412066: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x0
cat-95 [000] ..... 1.412066: __bio_add_page <-bio_add_page object:0xffff888108af6328 value:0x0
cat-95 [000] ..... 1.412066: rcu_read_unlock_strict <-xa_load object:0xffff888108af6328 value:0x1000
cat-95 [000] ..... 1.412066: bio_add_page <-ext4_mpage_readpages object:0xffff888108af6328 value:0x1000
cat-95 [000] ..... 1.412066: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x1000
cat-95 [000] ..... 1.412067: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x1000
cat-95 [000] ..... 1.412067: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x1000
cat-95 [000] ..... 1.412067: __bio_add_page <-bio_add_page object:0xffff888108af6328 value:0x1000
cat-95 [000] ..... 1.412067: rcu_read_unlock_strict <-xa_load object:0xffff888108af6328 value:0x2000
cat-95 [000] ..... 1.412067: bio_add_page <-ext4_mpage_readpages object:0xffff888108af6328 value:0x2000
cat-95 [000] ..... 1.412067: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x2000
cat-95 [000] ..... 1.412067: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x2000
cat-95 [000] ..... 1.412067: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x2000
cat-95 [000] ..... 1.412068: submit_bio <-ext4_mpage_readpages object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412068: submit_bio_noacct <-ext4_mpage_readpages object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412068: __submit_bio <-submit_bio_noacct object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412068: blk_try_enter_queue <-__submit_bio object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412068: rcu_read_unlock_strict <-blk_try_enter_queue object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412068: rcu_read_unlock_strict <-blk_try_enter_queue object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412068: submit_bio_checks <-__submit_bio object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412068: __cond_resched <-submit_bio_checks object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412068: rcu_all_qs <-__cond_resched object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412068: should_fail_bio <-submit_bio_checks object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412069: create_task_io_context <-submit_bio_checks object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412069: kmem_cache_alloc_node <-create_task_io_context object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412069: should_failslab <-kmem_cache_alloc_node object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412069: _raw_spin_lock <-create_task_io_context object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412069: blk_mq_submit_bio <-__submit_bio object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412069: __blk_queue_split <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412069: bvec_split_segs <-__blk_queue_split object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412069: blk_attempt_plug_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412070: __blk_mq_sched_bio_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412070: dd_bio_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412070: _raw_spin_lock <-dd_bio_merge object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412070: blk_mq_sched_try_merge <-dd_bio_merge object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412070: elv_merge <-blk_mq_sched_try_merge object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412070: elv_rqhash_find <-elv_merge object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412070: dd_request_merge <-blk_mq_sched_try_merge object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412070: elv_rb_find <-dd_request_merge object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412070: __blk_mq_alloc_request <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412071: dd_limit_depth <-__blk_mq_alloc_request object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412071: blk_mq_get_tag <-__blk_mq_alloc_request object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412071: __blk_mq_get_tag <-blk_mq_get_tag object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412071: blk_mq_rq_ctx_init.isra.0 <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412071: ktime_get <-blk_mq_rq_ctx_init.isra.0 object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412071: dd_prepare_request <-blk_mq_rq_ctx_init.isra.0 object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412071: __blk_account_io_start <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412071: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x3000
cat-95 [000] ..... 1.412071: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x3000

Jeff Xie (5):
trace: Add trace any kernel object
trace/objtrace: get the value of the object
trace/README: Document objtrace trigger syntax
trace/objtrace: Add testcases for objtrace
trace/objtrace: Add documentation for objtrace

Documentation/trace/events.rst | 114 +++
include/linux/trace_events.h | 1 +
kernel/trace/Kconfig | 10 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 3 +
kernel/trace/trace.h | 8 +
kernel/trace/trace_entries.h | 18 +
kernel/trace/trace_events_trigger.c | 1 +
kernel/trace/trace_object.c | 660 ++++++++++++++++++
kernel/trace/trace_output.c | 40 ++
.../ftrace/test.d/trigger/trigger-objtrace.tc | 39 ++
11 files changed, 895 insertions(+)
create mode 100644 kernel/trace/trace_object.c
create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-objtrace.tc

--
2.25.1



2021-11-29 20:17:09

by Jeff Xie

[permalink] [raw]
Subject: [RFC][PATCH v6 5/5] trace/objtrace: Add documentation for objtrace

Added documentation explaining how to use objtrace trigger to get the value
of the object.

Signed-off-by: Jeff Xie <[email protected]>
---
Documentation/trace/events.rst | 114 +++++++++++++++++++++++++++++++++
1 file changed, 114 insertions(+)

diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index 8ddb9b09451c..a6d8313ba24e 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -527,6 +527,120 @@ The following commands are supported:

See Documentation/trace/histogram.rst for details and examples.

+- objtrace
+
+ This command provides a way to get the value of any object, The object
+ can be obtained from the dynamic event(kprobe_event/uprobe_event) or the
+ static event(tracepoint).
+
+ Usage:
+ When using the kprobe event, only need to set the objtrace(a new trigger),
+ we can get the value of the object. The object is from the setting of the
+ kprobe event.
+
+ For example:
+ For the function bio_add_page():
+
+ int bio_add_page(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int offset)
+
+ Firstly, we can set the base of the object, thus the first string "arg1"
+ stands for the value of the first parameter of this function bio_add_gage(),
+
+ # echo 'p bio_add_page arg1=$arg1' >> ./kprobe_events
+
+ Secondly, we can get the value dynamically base the above object.
+
+ find the offset of the bi_size in struct bio:
+ $ gdb vmlinux
+ (gdb) p &(((struct bio *)0)->bi_iter.bi_size)
+ $1 = (unsigned int *) 0x28
+
+ # echo 'objtrace:add:arg1,0x28:u32:1 if comm == "cat"' > ./events/kprobes/ \
+ p_bio_add_page_0/trigger
+
+ The best way to use this is that we can set the entrance event and exit
+ event, for example, the following example is to set the read_papes as
+ the entrance event, and set the __blk_account_io_start as the exit event.
+
+ # cd /sys/kernel/debug/tracing/
+ # echo 0 > ./tracing_on
+ # echo 'p read_pages' >> ./kprobe_events
+ # echo 'p __blk_account_io_start' >> ./kprobe_events
+ # echo 'traceon if comm == "cat"' > ./events/kprobes/p_read_pages_0/trigger
+ # echo 'traceoff if comm == "cat"' > ./events/kprobes/p___blk_account_io_start_0/trigger
+ # echo 'p bio_add_page arg1=$arg1' >> ./kprobe_events
+ # echo 'objtrace:add:arg1,0x28:u32:1 if comm == "cat"' > ./events/kprobes/p_bio_add_page_0/trigger
+
+ # du -sh /test.txt
+ 12.0K /test.txt
+
+ # cat /test.txt > /dev/null
+ # cat ./trace
+
+ # tracer: nop
+ #
+ # entries-in-buffer/entries-written: 50/50 #P:1
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / _-=> migrate-disable
+ # |||| / delay
+ # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
+ # | | | ||||| | |
+ cat-95 [000] ..... 1.412065: _raw_spin_unlock_irqrestore <-event_triggers_call object:0xffff888108af6328 value:0x0
+ cat-95 [000] ..... 1.412066: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x0
+ cat-95 [000] ..... 1.412066: __bio_add_page <-bio_add_page object:0xffff888108af6328 value:0x0
+ cat-95 [000] ..... 1.412066: rcu_read_unlock_strict <-xa_load object:0xffff888108af6328 value:0x1000
+ cat-95 [000] ..... 1.412066: bio_add_page <-ext4_mpage_readpages object:0xffff888108af6328 value:0x1000
+ cat-95 [000] ..... 1.412066: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x1000
+ cat-95 [000] ..... 1.412067: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x1000
+ cat-95 [000] ..... 1.412067: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x1000
+ cat-95 [000] ..... 1.412067: __bio_add_page <-bio_add_page object:0xffff888108af6328 value:0x1000
+ cat-95 [000] ..... 1.412067: rcu_read_unlock_strict <-xa_load object:0xffff888108af6328 value:0x2000
+ cat-95 [000] ..... 1.412067: bio_add_page <-ext4_mpage_readpages object:0xffff888108af6328 value:0x2000
+ cat-95 [000] ..... 1.412067: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x2000
+ cat-95 [000] ..... 1.412067: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x2000
+ cat-95 [000] ..... 1.412067: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x2000
+ cat-95 [000] ..... 1.412068: submit_bio <-ext4_mpage_readpages object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412068: submit_bio_noacct <-ext4_mpage_readpages object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412068: __submit_bio <-submit_bio_noacct object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412068: blk_try_enter_queue <-__submit_bio object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412068: rcu_read_unlock_strict <-blk_try_enter_queue object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412068: rcu_read_unlock_strict <-blk_try_enter_queue object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412068: submit_bio_checks <-__submit_bio object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412068: __cond_resched <-submit_bio_checks object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412068: rcu_all_qs <-__cond_resched object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412068: should_fail_bio <-submit_bio_checks object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412069: create_task_io_context <-submit_bio_checks object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412069: kmem_cache_alloc_node <-create_task_io_context object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412069: should_failslab <-kmem_cache_alloc_node object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412069: _raw_spin_lock <-create_task_io_context object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412069: blk_mq_submit_bio <-__submit_bio object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412069: __blk_queue_split <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412069: bvec_split_segs <-__blk_queue_split object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412069: blk_attempt_plug_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412070: __blk_mq_sched_bio_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412070: dd_bio_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412070: _raw_spin_lock <-dd_bio_merge object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412070: blk_mq_sched_try_merge <-dd_bio_merge object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412070: elv_merge <-blk_mq_sched_try_merge object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412070: elv_rqhash_find <-elv_merge object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412070: dd_request_merge <-blk_mq_sched_try_merge object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412070: elv_rb_find <-dd_request_merge object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412070: __blk_mq_alloc_request <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412071: dd_limit_depth <-__blk_mq_alloc_request object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412071: blk_mq_get_tag <-__blk_mq_alloc_request object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412071: __blk_mq_get_tag <-blk_mq_get_tag object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412071: blk_mq_rq_ctx_init.isra.0 <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412071: ktime_get <-blk_mq_rq_ctx_init.isra.0 object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412071: dd_prepare_request <-blk_mq_rq_ctx_init.isra.0 object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412071: __blk_account_io_start <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412071: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x3000
+ cat-95 [000] ..... 1.412071: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x3000
+
7. In-kernel trace event API
============================

--
2.25.1


2021-11-29 20:18:08

by Jeff Xie

[permalink] [raw]
Subject: [RFC][PATCH v6 2/5] trace/objtrace: get the value of the object

Using objtrace trigger to get the value of the object which from the kernel
function parameter.

Signed-off-by: Jeff Xie <[email protected]>
---
kernel/trace/trace_entries.h | 5 +-
kernel/trace/trace_object.c | 184 +++++++++++++++++++++++++++++------
kernel/trace/trace_output.c | 6 +-
3 files changed, 161 insertions(+), 34 deletions(-)

diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index bb120d9498a9..2407c45a568c 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -413,8 +413,9 @@ FTRACE_ENTRY(object, trace_object_entry,
__field( unsigned long, ip )
__field( unsigned long, parent_ip )
__field( unsigned long, object )
+ __field( unsigned long, value )
),

- F_printk(" %ps <-- %ps object:%lx\n",
- (void *)__entry->ip, (void *)__entry->parent_ip, __entry->object)
+ F_printk(" %ps <-- %ps object:%lx value:%lx\n", (void *)__entry->ip,
+ (void *)__entry->parent_ip, __entry->object, __entry->value)
);
diff --git a/kernel/trace/trace_object.c b/kernel/trace/trace_object.c
index 83fccd587b63..c66e53da6aff 100644
--- a/kernel/trace/trace_object.c
+++ b/kernel/trace/trace_object.c
@@ -12,14 +12,25 @@
static DEFINE_PER_CPU(atomic_t, trace_object_event_disable);
static DEFINE_RAW_SPINLOCK(object_spin_lock);
static struct trace_event_file event_trace_file;
-static const int max_args_num = 6;
static const int max_obj_pool = 10;
static atomic_t trace_object_ref;
static int exit_trace_object(void);
static int init_trace_object(void);

+struct objtrace_trigger_data {
+ struct ftrace_event_field *field;
+ long offset;
+ int type_size;
+};
+
+struct objtrace_fetch_type {
+ char *name;
+ int type_size;
+};
+
struct object_instance {
void *object;
+ int obj_type_size;
struct freelist_node free_list;
struct list_head active_list;
};
@@ -50,7 +61,7 @@ static bool object_empty(void)
return list_empty(&obj_pool->active_list);
}

-static void set_trace_object(void *obj)
+static void set_trace_object(void *obj, int type_size)
{
struct freelist_node *fn;
struct object_instance *ins;
@@ -73,6 +84,7 @@ static void set_trace_object(void *obj)

ins = container_of(fn, struct object_instance, free_list);
ins->object = obj;
+ ins->obj_type_size = type_size;

raw_spin_lock_irqsave(&object_spin_lock, flags);
list_add_rcu(&ins->active_list, &obj_pool->active_list);
@@ -135,7 +147,7 @@ static int init_object_pool(void)
}

static void submit_trace_object(unsigned long ip, unsigned long parent_ip,
- unsigned long object)
+ unsigned long object, unsigned long value)
{

struct trace_buffer *buffer;
@@ -152,19 +164,68 @@ static void submit_trace_object(unsigned long ip, unsigned long parent_ip,
entry->ip = ip;
entry->parent_ip = parent_ip;
entry->object = object;
+ entry->value = value;

event_trigger_unlock_commit(&event_trace_file, buffer, event,
entry, pc);
}

+static inline long get_object_value(unsigned long *val, void *obj, int type_size)
+{
+ long ret = 0;
+
+ switch (type_size) {
+ case 1: {
+ u8 tmp;
+
+ ret = copy_from_kernel_nofault(&tmp, obj, sizeof(tmp));
+ if (ret)
+ goto out;
+ *val = tmp;
+ break;
+ }
+ case 2: {
+ u16 tmp;
+
+ ret = copy_from_kernel_nofault(&tmp, obj, sizeof(tmp));
+ if (ret)
+ goto out;
+ *val = tmp;
+ break;
+ }
+ case 4: {
+ u32 tmp;
+
+ ret = copy_from_kernel_nofault(&tmp, obj, sizeof(tmp));
+ if (ret)
+ goto out;
+ *val = tmp;
+ break;
+ }
+ case 8: {
+ u64 tmp;
+
+ ret = copy_from_kernel_nofault(&tmp, obj, sizeof(tmp));
+ if (ret)
+ goto out;
+ *val = tmp;
+ break;
+ }
+ default:
+ return -EINVAL;
+ }
+out:
+ return ret;
+}
+
static void
trace_object_events_call(unsigned long ip, unsigned long parent_ip,
struct ftrace_ops *op, struct ftrace_regs *fregs)
{
- struct pt_regs *pt_regs = ftrace_get_regs(fregs);
- unsigned long obj;
+ struct object_instance *inst;
+ unsigned long val = 0;
long disabled;
- int cpu, n;
+ int cpu;

preempt_disable_notrace();

@@ -177,12 +238,11 @@ trace_object_events_call(unsigned long ip, unsigned long parent_ip,
if (object_empty())
goto out;

- for (n = 0; n < max_args_num; n++) {
- obj = regs_get_kernel_argument(pt_regs, n);
- if (object_exist((void *)obj))
- submit_trace_object(ip, parent_ip, obj);
+ list_for_each_entry_rcu(inst, &obj_pool->active_list, active_list) {
+ if (get_object_value(&val, inst->object, inst->obj_type_size))
+ goto out;
+ submit_trace_object(ip, parent_ip, (unsigned long)inst->object, val);
}
-
out:
atomic_dec(&per_cpu(trace_object_event_disable, cpu));
preempt_enable_notrace();
@@ -198,12 +258,14 @@ trace_object_trigger(struct event_trigger_data *data,
struct trace_buffer *buffer, void *rec,
struct ring_buffer_event *event)
{
+ struct objtrace_trigger_data *obj_data = data->private_data;
+ struct ftrace_event_field *field;
+ void *obj, *val = NULL;

- struct ftrace_event_field *field = data->private_data;
- void *obj = NULL;
-
- memcpy(&obj, rec + field->offset, sizeof(obj));
- set_trace_object(obj);
+ field = obj_data->field;
+ memcpy(&val, rec + field->offset, sizeof(val));
+ obj = val + obj_data->offset;
+ set_trace_object(obj, obj_data->type_size);
}

static void
@@ -349,6 +411,22 @@ static void unregister_object_trigger(char *glob, struct event_trigger_ops *ops,
}
}

+static const struct objtrace_fetch_type objtrace_fetch_types[] = {
+ {"u8", 1},
+ {"s8", 1},
+ {"x8", 1},
+ {"u16", 2},
+ {"s16", 2},
+ {"x16", 2},
+ {"u32", 4},
+ {"s32", 4},
+ {"x32", 4},
+ {"u64", 8},
+ {"s64", 8},
+ {"x64", 8},
+ {}
+};
+
static int
event_object_trigger_callback(struct event_command *cmd_ops,
struct trace_event_file *file,
@@ -356,19 +434,21 @@ event_object_trigger_callback(struct event_command *cmd_ops,
{
struct event_trigger_data *trigger_data;
struct event_trigger_ops *trigger_ops;
+ struct objtrace_trigger_data *obj_data;
struct trace_event_call *call;
struct ftrace_event_field *field;
- char *objtrace_cmd;
- char *trigger = NULL;
- char *arg;
- char *number;
- int ret;
+ char *type, *tr, *obj, *tmp, *trigger = NULL;
+ char *number, *objtrace_cmd;
+ int ret, i, def_type_size, type_size = 0;
+ long offset = 0;

ret = -EINVAL;
if (!param)
goto out;
-
- /* separate the trigger from the filter (c:a:n [if filter]) */
+ /*
+ * separate the trigger from the filter:
+ * objtrace:add:OBJ[,OFFS][:TYPE][:COUNT] [if filter]
+ */
trigger = strsep(&param, " \t");
if (!trigger)
goto out;
@@ -382,33 +462,78 @@ event_object_trigger_callback(struct event_command *cmd_ops,
if (!objtrace_cmd || strcmp(objtrace_cmd, "add"))
goto out;

- arg = strsep(&trigger, ":");
- if (!arg)
+ obj = strsep(&trigger, ":");
+ if (!obj)
goto out;
+
+ tr = strchr(obj, ',');
+ if (!tr)
+ offset = 0;
+ else {
+ *tr++ = '\0';
+ ret = kstrtol(tr, 0, &offset);
+ if (ret)
+ goto out;
+ }
+
+ ret = -EINVAL;
call = file->event_call;
- field = trace_find_event_field(call, arg);
+ field = trace_find_event_field(call, obj);
if (!field)
goto out;
-
if (field->size != sizeof(void *))
goto out;
+ def_type_size = sizeof(void *);
+ if (!trigger) {
+ type_size = def_type_size;
+ goto skip_get_type;
+ }

+ tmp = trigger;
+ type = strsep(&trigger, ":");
+ if (!type)
+ type_size = def_type_size;
+ else if (isdigit(type[0])) {
+ type_size = def_type_size;
+ trigger = tmp;
+ } else {
+ for (i = 0; objtrace_fetch_types[i].name; i++) {
+ if (strcmp(objtrace_fetch_types[i].name, type) == 0) {
+ type_size = objtrace_fetch_types[i].type_size;
+ break;
+ }
+ }
+ }
+ if (!type_size)
+ goto out;
+skip_get_type:
trigger_ops = cmd_ops->get_trigger_ops(cmd, trigger);

ret = -ENOMEM;
+ obj_data = kzalloc(sizeof(*obj_data), GFP_KERNEL);
+ if (!obj_data)
+ goto out;
+
+ obj_data->field = field;
+ obj_data->offset = offset;
+ obj_data->type_size = type_size;
+
trigger_data = kzalloc(sizeof(*trigger_data), GFP_KERNEL);
- if (!trigger_data)
+ if (!trigger_data) {
+ kfree(obj_data);
goto out;
+ }

trigger_data->count = -1;
trigger_data->ops = trigger_ops;
trigger_data->cmd_ops = cmd_ops;
- trigger_data->private_data = field;
+ trigger_data->private_data = obj_data;
INIT_LIST_HEAD(&trigger_data->list);
INIT_LIST_HEAD(&trigger_data->named_list);

if (glob[0] == '!') {
cmd_ops->unreg(glob+1, trigger_ops, trigger_data, file);
+ kfree(obj_data);
kfree(trigger_data);
ret = 0;
goto out;
@@ -463,6 +588,7 @@ event_object_trigger_callback(struct event_command *cmd_ops,
out_free:
if (cmd_ops->set_filter)
cmd_ops->set_filter(NULL, trigger_data, NULL);
+ kfree(obj_data);
kfree(trigger_data);
goto out;
}
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d747aed27104..12a971927d8c 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1553,6 +1553,7 @@ static enum print_line_t trace_object_print(struct trace_iterator *iter, int fla
trace_assign_type(field, iter->ent);
print_fn_trace(s, field->ip, field->parent_ip, flags);
trace_seq_printf(s, " object:0x%lx", field->object);
+ trace_seq_printf(s, " value:0x%lx", field->value);
trace_seq_putc(s, '\n');

return trace_handle_return(s);
@@ -1565,9 +1566,8 @@ static enum print_line_t trace_object_raw(struct trace_iterator *iter, int flags

trace_assign_type(field, iter->ent);

- trace_seq_printf(&iter->seq, "%lx %lx\n",
- field->ip,
- field->parent_ip);
+ trace_seq_printf(&iter->seq, "%lx %lx %lx %lx\n", field->ip,
+ field->parent_ip, field->object, field->value);

return trace_handle_return(&iter->seq);
}
--
2.25.1


2021-11-29 20:18:19

by Jeff Xie

[permalink] [raw]
Subject: [RFC][PATCH v6 3/5] trace/README: Document objtrace trigger syntax

Add a short description of the use of objtrace trigger.

Signed-off-by: Jeff Xie <[email protected]>
---
kernel/trace/trace.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 88de94da596b..a84b6813769e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5590,6 +5590,9 @@ static const char readme_msg[] =
"\t enable_hist:<system>:<event>\n"
"\t disable_hist:<system>:<event>\n"
#endif
+#ifdef CONFIG_TRACE_OBJECT
+ "\t objtrace:add:obj[,offset][:type][:count][if <filter>]\n"
+#endif
#ifdef CONFIG_STACKTRACE
"\t\t stacktrace\n"
#endif
--
2.25.1


2021-11-29 20:18:19

by Jeff Xie

[permalink] [raw]
Subject: [RFC][PATCH v6 1/5] trace: Add trace any kernel object

Introduce objtrace trigger to get the execution flow by tracing
any kernel object in the kernel parameter.

Signed-off-by: Jeff Xie <[email protected]>
---
include/linux/trace_events.h | 1 +
kernel/trace/Kconfig | 10 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.h | 8 +
kernel/trace/trace_entries.h | 17 +
kernel/trace/trace_events_trigger.c | 1 +
kernel/trace/trace_object.c | 534 ++++++++++++++++++++++++++++
kernel/trace/trace_output.c | 40 +++
8 files changed, 612 insertions(+)
create mode 100644 kernel/trace/trace_object.c

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 2d167ac3452c..54baec78ce62 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -684,6 +684,7 @@ enum event_trigger_type {
ETT_EVENT_HIST = (1 << 4),
ETT_HIST_ENABLE = (1 << 5),
ETT_EVENT_EPROBE = (1 << 6),
+ ETT_TRACE_OBJECT = (1 << 7),
};

extern int filter_match_preds(struct event_filter *filter, void *rec);
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 420ff4bc67fd..f69761cbfaaa 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -237,6 +237,16 @@ config FUNCTION_PROFILER

If in doubt, say N.

+config TRACE_OBJECT
+ bool "Trace kernel object in function parameter"
+ depends on FUNCTION_TRACER
+ depends on HAVE_FUNCTION_ARG_ACCESS_API
+ select TRACING
+ default y
+ help
+ You can trace the kernel object in the kernel function parameter.
+ The kernel object is dynamically specified via event trigger.
+
config STACK_TRACER
bool "Trace max stack"
depends on HAVE_FUNCTION_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index bedc5caceec7..b924b8e55922 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -67,6 +67,7 @@ obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += fgraph.o
+obj-$(CONFIG_TRACE_OBJECT) += trace_object.o
ifeq ($(CONFIG_BLOCK),y)
obj-$(CONFIG_EVENT_TRACING) += blktrace.o
endif
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 38715aa6cfdf..46a1301ac47a 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -54,6 +54,7 @@ enum trace_type {
TRACE_TIMERLAT,
TRACE_RAW_DATA,
TRACE_FUNC_REPEATS,
+ TRACE_OBJECT,

__TRACE_LAST_TYPE,
};
@@ -469,6 +470,7 @@ extern void __ftrace_bad_type(void);
TRACE_GRAPH_RET); \
IF_ASSIGN(var, ent, struct func_repeats_entry, \
TRACE_FUNC_REPEATS); \
+ IF_ASSIGN(var, ent, struct trace_object_entry, TRACE_OBJECT);\
__ftrace_bad_type(); \
} while (0)

@@ -1531,6 +1533,12 @@ static inline int register_trigger_hist_cmd(void) { return 0; }
static inline int register_trigger_hist_enable_disable_cmds(void) { return 0; }
#endif

+#ifdef CONFIG_TRACE_OBJECT
+extern int register_trigger_object_cmd(void);
+#else
+static inline int register_trigger_object_cmd(void) { return 0; }
+#endif
+
extern int register_trigger_cmds(void);
extern void clear_event_triggers(struct trace_array *tr);

diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index cd41e863b51c..bb120d9498a9 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -401,3 +401,20 @@ FTRACE_ENTRY(timerlat, timerlat_entry,
__entry->context,
__entry->timer_latency)
);
+
+/*
+ * trace object entry:
+ */
+FTRACE_ENTRY(object, trace_object_entry,
+
+ TRACE_OBJECT,
+
+ F_STRUCT(
+ __field( unsigned long, ip )
+ __field( unsigned long, parent_ip )
+ __field( unsigned long, object )
+ ),
+
+ F_printk(" %ps <-- %ps object:%lx\n",
+ (void *)__entry->ip, (void *)__entry->parent_ip, __entry->object)
+);
diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index 3d5c07239a2a..da2d871fa809 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -1687,6 +1687,7 @@ __init int register_trigger_cmds(void)
register_trigger_enable_disable_cmds();
register_trigger_hist_enable_disable_cmds();
register_trigger_hist_cmd();
+ register_trigger_object_cmd();

return 0;
}
diff --git a/kernel/trace/trace_object.c b/kernel/trace/trace_object.c
new file mode 100644
index 000000000000..83fccd587b63
--- /dev/null
+++ b/kernel/trace/trace_object.c
@@ -0,0 +1,534 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * trace the kernel object in the kernel function parameter
+ * Copyright (C) 2021 Jeff Xie <[email protected]>
+ */
+
+#define pr_fmt(fmt) "trace_object: " fmt
+
+#include "trace_output.h"
+#include <linux/freelist.h>
+
+static DEFINE_PER_CPU(atomic_t, trace_object_event_disable);
+static DEFINE_RAW_SPINLOCK(object_spin_lock);
+static struct trace_event_file event_trace_file;
+static const int max_args_num = 6;
+static const int max_obj_pool = 10;
+static atomic_t trace_object_ref;
+static int exit_trace_object(void);
+static int init_trace_object(void);
+
+struct object_instance {
+ void *object;
+ struct freelist_node free_list;
+ struct list_head active_list;
+};
+
+struct obj_pool {
+ struct freelist_head free_list;
+ struct list_head active_list;
+};
+static struct obj_pool *obj_pool;
+
+static bool object_exist(void *obj)
+{
+ struct object_instance *inst;
+ bool ret = false;
+
+ list_for_each_entry_rcu(inst, &obj_pool->active_list, active_list) {
+ if (inst->object == obj) {
+ ret = true;
+ goto out;
+ }
+ }
+out:
+ return ret;
+}
+
+static bool object_empty(void)
+{
+ return list_empty(&obj_pool->active_list);
+}
+
+static void set_trace_object(void *obj)
+{
+ struct freelist_node *fn;
+ struct object_instance *ins;
+ unsigned long flags;
+
+ if (in_nmi())
+ return;
+
+ if (!obj)
+ return;
+
+ if (object_exist(obj))
+ return;
+
+ fn = freelist_try_get(&obj_pool->free_list);
+ if (!fn) {
+ trace_printk("object_pool is full, can't trace object:0x%px\n", obj);
+ return;
+ }
+
+ ins = container_of(fn, struct object_instance, free_list);
+ ins->object = obj;
+
+ raw_spin_lock_irqsave(&object_spin_lock, flags);
+ list_add_rcu(&ins->active_list, &obj_pool->active_list);
+ raw_spin_unlock_irqrestore(&object_spin_lock, flags);
+}
+
+static inline void free_free_list_objects(struct freelist_head *head)
+{
+
+ struct object_instance *inst;
+ struct freelist_node *node = head->head;
+
+ while (node) {
+ inst = container_of(node, struct object_instance, free_list);
+ node = node->next;
+ kfree(inst);
+ }
+}
+
+static inline void free_active_list_objects(struct list_head *head)
+{
+ struct object_instance *inst;
+
+ list_for_each_entry_rcu(inst, head, active_list)
+ kfree(inst);
+}
+
+static inline void free_object_pool(void)
+{
+ free_free_list_objects(&obj_pool->free_list);
+ free_active_list_objects(&obj_pool->active_list);
+ kfree(obj_pool);
+}
+
+
+static int init_object_pool(void)
+{
+ struct object_instance *inst;
+ int i, ret = 0;
+
+ obj_pool = kzalloc(sizeof(*obj_pool), GFP_KERNEL);
+ if (!obj_pool) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ INIT_LIST_HEAD(&obj_pool->active_list);
+
+ for (i = 0; i < max_obj_pool; i++) {
+ inst = kzalloc(sizeof(*inst), GFP_KERNEL);
+ if (!inst) {
+ free_object_pool();
+ ret = -ENOMEM;
+ goto out;
+ }
+ freelist_add(&inst->free_list, &obj_pool->free_list);
+ }
+out:
+ return ret;
+}
+
+static void submit_trace_object(unsigned long ip, unsigned long parent_ip,
+ unsigned long object)
+{
+
+ struct trace_buffer *buffer;
+ struct ring_buffer_event *event;
+ struct trace_object_entry *entry;
+ int pc;
+
+ pc = preempt_count();
+ event = trace_event_buffer_lock_reserve(&buffer, &event_trace_file,
+ TRACE_OBJECT, sizeof(*entry), pc);
+ if (!event)
+ return;
+ entry = ring_buffer_event_data(event);
+ entry->ip = ip;
+ entry->parent_ip = parent_ip;
+ entry->object = object;
+
+ event_trigger_unlock_commit(&event_trace_file, buffer, event,
+ entry, pc);
+}
+
+static void
+trace_object_events_call(unsigned long ip, unsigned long parent_ip,
+ struct ftrace_ops *op, struct ftrace_regs *fregs)
+{
+ struct pt_regs *pt_regs = ftrace_get_regs(fregs);
+ unsigned long obj;
+ long disabled;
+ int cpu, n;
+
+ preempt_disable_notrace();
+
+ cpu = raw_smp_processor_id();
+ disabled = atomic_inc_return(&per_cpu(trace_object_event_disable, cpu));
+
+ if (disabled != 1)
+ goto out;
+
+ if (object_empty())
+ goto out;
+
+ for (n = 0; n < max_args_num; n++) {
+ obj = regs_get_kernel_argument(pt_regs, n);
+ if (object_exist((void *)obj))
+ submit_trace_object(ip, parent_ip, obj);
+ }
+
+out:
+ atomic_dec(&per_cpu(trace_object_event_disable, cpu));
+ preempt_enable_notrace();
+}
+
+static struct ftrace_ops trace_ops = {
+ .func = trace_object_events_call,
+ .flags = FTRACE_OPS_FL_SAVE_REGS,
+};
+
+static void
+trace_object_trigger(struct event_trigger_data *data,
+ struct trace_buffer *buffer, void *rec,
+ struct ring_buffer_event *event)
+{
+
+ struct ftrace_event_field *field = data->private_data;
+ void *obj = NULL;
+
+ memcpy(&obj, rec + field->offset, sizeof(obj));
+ set_trace_object(obj);
+}
+
+static void
+trace_object_trigger_free(struct event_trigger_ops *ops,
+ struct event_trigger_data *data)
+{
+ if (WARN_ON_ONCE(data->ref <= 0))
+ return;
+
+ data->ref--;
+ if (!data->ref)
+ trigger_data_free(data);
+}
+
+static void
+trace_object_count_trigger(struct event_trigger_data *data,
+ struct trace_buffer *buffer, void *rec,
+ struct ring_buffer_event *event)
+{
+ if (!data->count)
+ return;
+
+ if (data->count != -1)
+ (data->count)--;
+
+ trace_object_trigger(data, buffer, rec, event);
+}
+
+static int event_object_trigger_init(struct event_trigger_ops *ops,
+ struct event_trigger_data *data)
+{
+ data->ref++;
+ return 0;
+}
+
+static int
+event_trigger_print(const char *name, struct seq_file *m,
+ void *data, char *filter_str)
+{
+ long count = (long)data;
+
+ seq_puts(m, name);
+
+ if (count == -1)
+ seq_puts(m, ":unlimited");
+ else
+ seq_printf(m, ":count=%ld", count);
+
+ if (filter_str)
+ seq_printf(m, " if %s\n", filter_str);
+ else
+ seq_putc(m, '\n');
+
+ return 0;
+}
+
+static int
+trace_object_trigger_print(struct seq_file *m, struct event_trigger_ops *ops,
+ struct event_trigger_data *data)
+{
+ return event_trigger_print("objtrace", m, (void *)data->count,
+ data->filter_str);
+}
+
+static struct event_trigger_ops objecttrace_trigger_ops = {
+ .func = trace_object_trigger,
+ .print = trace_object_trigger_print,
+ .init = event_object_trigger_init,
+ .free = trace_object_trigger_free,
+};
+
+static struct event_trigger_ops objecttrace_count_trigger_ops = {
+ .func = trace_object_count_trigger,
+ .print = trace_object_trigger_print,
+ .init = event_object_trigger_init,
+ .free = trace_object_trigger_free,
+};
+
+static struct event_trigger_ops *
+objecttrace_get_trigger_ops(char *cmd, char *param)
+{
+ return param ? &objecttrace_count_trigger_ops : &objecttrace_trigger_ops;
+}
+
+static int register_object_trigger(char *glob, struct event_trigger_ops *ops,
+ struct event_trigger_data *data,
+ struct trace_event_file *file)
+{
+ struct event_trigger_data *test;
+ int ret = 0;
+
+ lockdep_assert_held(&event_mutex);
+
+ list_for_each_entry(test, &file->triggers, list) {
+ if (test->cmd_ops->trigger_type == data->cmd_ops->trigger_type) {
+ ret = -EEXIST;
+ goto out;
+ }
+ }
+
+ if (data->ops->init) {
+ ret = data->ops->init(data->ops, data);
+ if (ret < 0)
+ goto out;
+ }
+
+ list_add_rcu(&data->list, &file->triggers);
+ ret++;
+
+ update_cond_flag(file);
+ if (trace_event_trigger_enable_disable(file, 1) < 0) {
+ list_del_rcu(&data->list);
+ update_cond_flag(file);
+ ret--;
+ }
+ init_trace_object();
+out:
+ return ret;
+}
+
+static void unregister_object_trigger(char *glob, struct event_trigger_ops *ops,
+ struct event_trigger_data *test,
+ struct trace_event_file *file)
+{
+ struct event_trigger_data *data;
+ bool unregistered = false;
+
+ lockdep_assert_held(&event_mutex);
+
+ list_for_each_entry(data, &file->triggers, list) {
+ if (data->cmd_ops->trigger_type == test->cmd_ops->trigger_type) {
+ unregistered = true;
+ list_del_rcu(&data->list);
+ trace_event_trigger_enable_disable(file, 0);
+ update_cond_flag(file);
+ break;
+ }
+ }
+
+ if (unregistered && data->ops->free) {
+ data->ops->free(data->ops, data);
+ exit_trace_object();
+ }
+}
+
+static int
+event_object_trigger_callback(struct event_command *cmd_ops,
+ struct trace_event_file *file,
+ char *glob, char *cmd, char *param)
+{
+ struct event_trigger_data *trigger_data;
+ struct event_trigger_ops *trigger_ops;
+ struct trace_event_call *call;
+ struct ftrace_event_field *field;
+ char *objtrace_cmd;
+ char *trigger = NULL;
+ char *arg;
+ char *number;
+ int ret;
+
+ ret = -EINVAL;
+ if (!param)
+ goto out;
+
+ /* separate the trigger from the filter (c:a:n [if filter]) */
+ trigger = strsep(&param, " \t");
+ if (!trigger)
+ goto out;
+ if (param) {
+ param = skip_spaces(param);
+ if (!*param)
+ param = NULL;
+ }
+
+ objtrace_cmd = strsep(&trigger, ":");
+ if (!objtrace_cmd || strcmp(objtrace_cmd, "add"))
+ goto out;
+
+ arg = strsep(&trigger, ":");
+ if (!arg)
+ goto out;
+ call = file->event_call;
+ field = trace_find_event_field(call, arg);
+ if (!field)
+ goto out;
+
+ if (field->size != sizeof(void *))
+ goto out;
+
+ trigger_ops = cmd_ops->get_trigger_ops(cmd, trigger);
+
+ ret = -ENOMEM;
+ trigger_data = kzalloc(sizeof(*trigger_data), GFP_KERNEL);
+ if (!trigger_data)
+ goto out;
+
+ trigger_data->count = -1;
+ trigger_data->ops = trigger_ops;
+ trigger_data->cmd_ops = cmd_ops;
+ trigger_data->private_data = field;
+ INIT_LIST_HEAD(&trigger_data->list);
+ INIT_LIST_HEAD(&trigger_data->named_list);
+
+ if (glob[0] == '!') {
+ cmd_ops->unreg(glob+1, trigger_ops, trigger_data, file);
+ kfree(trigger_data);
+ ret = 0;
+ goto out;
+ }
+
+ if (trigger) {
+ number = strsep(&trigger, ":");
+
+ ret = -EINVAL;
+ if (!strlen(number))
+ goto out_free;
+
+ /*
+ * We use the callback data field (which is a pointer)
+ * as our counter.
+ */
+ ret = kstrtoul(number, 0, &trigger_data->count);
+ if (ret)
+ goto out_free;
+ }
+
+ if (!param) /* if param is non-empty, it's supposed to be a filter */
+ goto out_reg;
+
+ if (!cmd_ops->set_filter)
+ goto out_reg;
+
+ ret = cmd_ops->set_filter(param, trigger_data, file);
+ if (ret < 0)
+ goto out_free;
+
+ out_reg:
+ /* Up the trigger_data count to make sure reg doesn't free it on failure */
+ event_object_trigger_init(trigger_ops, trigger_data);
+ ret = cmd_ops->reg(glob, trigger_ops, trigger_data, file);
+ /*
+ * The above returns on success the # of functions enabled,
+ * but if it didn't find any functions it returns zero.
+ * Consider no functions a failure too.
+ */
+ if (!ret) {
+ cmd_ops->unreg(glob, trigger_ops, trigger_data, file);
+ ret = -ENOENT;
+ } else if (ret > 0)
+ ret = 0;
+
+ /* Down the counter of trigger_data or free it if not used anymore */
+ trace_object_trigger_free(trigger_ops, trigger_data);
+ out:
+ return ret;
+
+ out_free:
+ if (cmd_ops->set_filter)
+ cmd_ops->set_filter(NULL, trigger_data, NULL);
+ kfree(trigger_data);
+ goto out;
+}
+
+static struct event_command trigger_object_cmd = {
+ .name = "objtrace",
+ .trigger_type = ETT_TRACE_OBJECT,
+ .flags = EVENT_CMD_FL_NEEDS_REC,
+ .func = event_object_trigger_callback,
+ .reg = register_object_trigger,
+ .unreg = unregister_object_trigger,
+ .get_trigger_ops = objecttrace_get_trigger_ops,
+ .set_filter = set_trigger_filter,
+};
+
+__init int register_trigger_object_cmd(void)
+{
+ int ret;
+
+ ret = register_event_command(&trigger_object_cmd);
+ WARN_ON(ret < 0);
+
+ return ret;
+}
+
+static int init_trace_object(void)
+{
+ int ret;
+
+ if (atomic_inc_return(&trace_object_ref) != 1) {
+ ret = 0;
+ goto out;
+ }
+
+ ret = init_object_pool();
+ if (ret)
+ goto out;
+
+ event_trace_file.tr = top_trace_array();
+ if (WARN_ON(!event_trace_file.tr)) {
+ ret = -1;
+ goto out;
+ }
+ ret = register_ftrace_function(&trace_ops);
+out:
+ return ret;
+}
+
+static int exit_trace_object(void)
+{
+ int ret;
+
+ if (WARN_ON_ONCE(atomic_read(&trace_object_ref) <= 0))
+ goto out;
+
+ if (atomic_dec_return(&trace_object_ref) != 0) {
+ ret = 0;
+ goto out;
+ }
+
+ ret = unregister_ftrace_function(&trace_ops);
+ if (ret) {
+ pr_err("can't unregister ftrace for trace object\n");
+ goto out;
+ }
+ free_object_pool();
+out:
+ return ret;
+}
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 3547e7176ff7..d747aed27104 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1543,6 +1543,45 @@ static struct trace_event trace_func_repeats_event = {
.funcs = &trace_func_repeats_funcs,
};

+/* TRACE_OBJECT */
+static enum print_line_t trace_object_print(struct trace_iterator *iter, int flags,
+ struct trace_event *event)
+{
+ struct trace_object_entry *field;
+ struct trace_seq *s = &iter->seq;
+
+ trace_assign_type(field, iter->ent);
+ print_fn_trace(s, field->ip, field->parent_ip, flags);
+ trace_seq_printf(s, " object:0x%lx", field->object);
+ trace_seq_putc(s, '\n');
+
+ return trace_handle_return(s);
+}
+
+static enum print_line_t trace_object_raw(struct trace_iterator *iter, int flags,
+ struct trace_event *event)
+{
+ struct trace_object_entry *field;
+
+ trace_assign_type(field, iter->ent);
+
+ trace_seq_printf(&iter->seq, "%lx %lx\n",
+ field->ip,
+ field->parent_ip);
+
+ return trace_handle_return(&iter->seq);
+}
+
+static struct trace_event_functions trace_object_funcs = {
+ .trace = trace_object_print,
+ .raw = trace_object_raw,
+};
+
+static struct trace_event trace_object_event = {
+ .type = TRACE_OBJECT,
+ .funcs = &trace_object_funcs,
+};
+
static struct trace_event *events[] __initdata = {
&trace_fn_event,
&trace_ctx_event,
@@ -1557,6 +1596,7 @@ static struct trace_event *events[] __initdata = {
&trace_timerlat_event,
&trace_raw_data_event,
&trace_func_repeats_event,
+ &trace_object_event,
NULL
};

--
2.25.1


2021-11-29 20:18:19

by Jeff Xie

[permalink] [raw]
Subject: [RFC][PATCH v6 4/5] trace/objtrace: Add testcases for objtrace

Add a series of testcases to illustrate correct and incorrect usage of
objtrace trigger.

Signed-off-by: Jeff Xie <[email protected]>
---
.../ftrace/test.d/trigger/trigger-objtrace.tc | 39 +++++++++++++++++++
1 file changed, 39 insertions(+)
create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-objtrace.tc

diff --git a/tools/testing/selftests/ftrace/test.d/trigger/trigger-objtrace.tc b/tools/testing/selftests/ftrace/test.d/trigger/trigger-objtrace.tc
new file mode 100644
index 000000000000..d71eb157c51f
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/trigger/trigger-objtrace.tc
@@ -0,0 +1,39 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: event trigger - test objtrace-trigger
+# requires: kprobe_events "objtrace":README
+
+fail() { #msg
+ echo $1
+ exit_fail
+}
+
+echo 'p bio_add_page arg1=$arg1' > kprobe_events
+
+FEATURE=`grep objtrace events/kprobes/p_bio_add_page_0/trigger`
+if [ -z "$FEATURE" ]; then
+ echo "objtrace trigger is not supported"
+ exit_unsupported
+fi
+
+echo "Test objtrace trigger"
+echo 'objtrace:add:arg1,0x28:u32:1 if comm == "cat"' > \
+ events/kprobes/p_bio_add_page_0/trigger
+if [ -z $? ]; then
+ fail "objtrace trigger syntax error"
+fi
+
+echo "Test objtrace semantic errors"
+
+# Being lack of objtrace command
+! echo 'objtrace:arg1,0x28:u32:1' > events/kprobes/p_bio_add_page_0/trigger
+# Bad parameter name
+! echo 'objtrace:add:argx:u32:1' > events/kprobes/p_bio_add_page_0/trigger
+
+echo "reset objtrace trigger"
+
+echo '!objtrace:add:arg1,0x28:u32' > \
+ events/kprobes/p_bio_add_page_0/trigger
+echo '-:p_bio_add_page_0' >> ./kprobe_events
+
+exit 0
--
2.25.1


2021-12-10 16:55:47

by Jeff Xie

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 0/5] trace: Introduce objtrace trigger to trace the kernel object

hi Steve and Masami,

I really don't know what else I can do to enhance this patchset, if
this idea is really valuable and
can play its role, it may need the help of you or other developers ;-)

On Tue, Nov 30, 2021 at 12:50 AM Jeff Xie <[email protected]> wrote:
>
> Introduce a method based on function tracer to trace any object and get
> the value of the object dynamically. the object can be obtained from the
> dynamic event (kprobe_event/uprobe_event) or the static event(tracepoint).
>
> Usage:
> When using the kprobe event, only need to set the objtrace(a new trigger),
> we can get the value of the object. The object is from the setting of the
> kprobe event.
>
> For example:
> For the function bio_add_page():
>
> int bio_add_page(struct bio *bio, struct page *page,
> unsigned int len, unsigned int offset)
>
> Firstly, we can set the base of the object, thus the first string "arg1"
> stands for the value of the first parameter of this function bio_add_gage(),
>
> # echo 'p bio_add_page arg1=$arg1' >> ./kprobe_events
>
> Secondly, we can get the value dynamically base the above object.
>
> find the offset of the bi_size in struct bio:
> $ gdb vmlinux
> (gdb) p &(((struct bio *)0)->bi_iter.bi_size)
> $1 = (unsigned int *) 0x28
>
> # echo 'objtrace:add:arg1,0x28:u32:1 if comm == "cat"' > ./events/kprobes/ \
> p_bio_add_page_0/trigger
>
> The best way to use this is that we can set the entrance event and exit
> event, for example, the following example is to set the read_papes as
> the entrance event, and set the __blk_account_io_start as the exit event.
>
> # cd /sys/kernel/debug/tracing/
> # echo 0 > ./tracing_on
> # echo 'p read_pages' >> ./kprobe_events
> # echo 'p __blk_account_io_start' >> ./kprobe_events
> # echo 'traceon if comm == "cat"' > ./events/kprobes/p_read_pages_0/trigger
> # echo 'traceoff if comm == "cat"' > ./events/kprobes/p___blk_account_io_start_0/trigger
> # echo 'p bio_add_page arg1=$arg1' >> ./kprobe_events
> # echo 'objtrace:add:arg1,0x28:u32:1 if comm == "cat"' > ./events/kprobes/p_bio_add_page_0/trigger
>
> # du -sh /test.txt
> 12.0K /test.txt
>
> # cat /test.txt > /dev/null
> # cat ./trace
>
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 50/50 #P:1
> #
> # _-----=> irqs-off
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # ||| / _-=> migrate-disable
> # |||| / delay
> # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
> # | | | ||||| | |
> cat-95 [000] ..... 1.412065: _raw_spin_unlock_irqrestore <-event_triggers_call object:0xffff888108af6328 value:0x0
> cat-95 [000] ..... 1.412066: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x0
> cat-95 [000] ..... 1.412066: __bio_add_page <-bio_add_page object:0xffff888108af6328 value:0x0
> cat-95 [000] ..... 1.412066: rcu_read_unlock_strict <-xa_load object:0xffff888108af6328 value:0x1000
> cat-95 [000] ..... 1.412066: bio_add_page <-ext4_mpage_readpages object:0xffff888108af6328 value:0x1000
> cat-95 [000] ..... 1.412066: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x1000
> cat-95 [000] ..... 1.412067: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x1000
> cat-95 [000] ..... 1.412067: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x1000
> cat-95 [000] ..... 1.412067: __bio_add_page <-bio_add_page object:0xffff888108af6328 value:0x1000
> cat-95 [000] ..... 1.412067: rcu_read_unlock_strict <-xa_load object:0xffff888108af6328 value:0x2000
> cat-95 [000] ..... 1.412067: bio_add_page <-ext4_mpage_readpages object:0xffff888108af6328 value:0x2000
> cat-95 [000] ..... 1.412067: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x2000
> cat-95 [000] ..... 1.412067: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x2000
> cat-95 [000] ..... 1.412067: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x2000
> cat-95 [000] ..... 1.412068: submit_bio <-ext4_mpage_readpages object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: submit_bio_noacct <-ext4_mpage_readpages object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: __submit_bio <-submit_bio_noacct object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: blk_try_enter_queue <-__submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: rcu_read_unlock_strict <-blk_try_enter_queue object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: rcu_read_unlock_strict <-blk_try_enter_queue object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: submit_bio_checks <-__submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: __cond_resched <-submit_bio_checks object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: rcu_all_qs <-__cond_resched object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: should_fail_bio <-submit_bio_checks object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: create_task_io_context <-submit_bio_checks object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: kmem_cache_alloc_node <-create_task_io_context object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: should_failslab <-kmem_cache_alloc_node object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: _raw_spin_lock <-create_task_io_context object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: blk_mq_submit_bio <-__submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: __blk_queue_split <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: bvec_split_segs <-__blk_queue_split object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: blk_attempt_plug_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: __blk_mq_sched_bio_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: dd_bio_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: _raw_spin_lock <-dd_bio_merge object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: blk_mq_sched_try_merge <-dd_bio_merge object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: elv_merge <-blk_mq_sched_try_merge object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: elv_rqhash_find <-elv_merge object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: dd_request_merge <-blk_mq_sched_try_merge object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: elv_rb_find <-dd_request_merge object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: __blk_mq_alloc_request <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: dd_limit_depth <-__blk_mq_alloc_request object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: blk_mq_get_tag <-__blk_mq_alloc_request object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: __blk_mq_get_tag <-blk_mq_get_tag object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: blk_mq_rq_ctx_init.isra.0 <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: ktime_get <-blk_mq_rq_ctx_init.isra.0 object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: dd_prepare_request <-blk_mq_rq_ctx_init.isra.0 object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: __blk_account_io_start <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x3000
>
> Jeff Xie (5):
> trace: Add trace any kernel object
> trace/objtrace: get the value of the object
> trace/README: Document objtrace trigger syntax
> trace/objtrace: Add testcases for objtrace
> trace/objtrace: Add documentation for objtrace
>
> Documentation/trace/events.rst | 114 +++
> include/linux/trace_events.h | 1 +
> kernel/trace/Kconfig | 10 +
> kernel/trace/Makefile | 1 +
> kernel/trace/trace.c | 3 +
> kernel/trace/trace.h | 8 +
> kernel/trace/trace_entries.h | 18 +
> kernel/trace/trace_events_trigger.c | 1 +
> kernel/trace/trace_object.c | 660 ++++++++++++++++++
> kernel/trace/trace_output.c | 40 ++
> .../ftrace/test.d/trigger/trigger-objtrace.tc | 39 ++
> 11 files changed, 895 insertions(+)
> create mode 100644 kernel/trace/trace_object.c
> create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-objtrace.tc
>
> --
> 2.25.1
>

Thanks,
---
JeffXie

2021-12-10 17:31:45

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 0/5] trace: Introduce objtrace trigger to trace the kernel object

On Sat, 11 Dec 2021 00:55:31 +0800
Jeff Xie <[email protected]> wrote:

> hi Steve and Masami,
>
> I really don't know what else I can do to enhance this patchset, if
> this idea is really valuable and
> can play its role, it may need the help of you or other developers ;-)

Hi Jeff,

I've been very busy with other things, but I may have time soon to look at
your patch set. After I get a chance to play with it, I'm certain I will be
coming back with some more comments ;-)

Sorry about the delay, but my day job got in the way :-p

-- Steve

2021-12-13 10:39:04

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 0/5] trace: Introduce objtrace trigger to trace the kernel object

Hi Jeff,

On Sat, 11 Dec 2021 00:55:31 +0800
Jeff Xie <[email protected]> wrote:

> hi Steve and Masami,
>
> I really don't know what else I can do to enhance this patchset, if
> this idea is really valuable and
> can play its role, it may need the help of you or other developers ;-)

Thanks for ping me. Let me review it this week.

Thank you!

>
> On Tue, Nov 30, 2021 at 12:50 AM Jeff Xie <[email protected]> wrote:
> >
> > Introduce a method based on function tracer to trace any object and get
> > the value of the object dynamically. the object can be obtained from the
> > dynamic event (kprobe_event/uprobe_event) or the static event(tracepoint).
> >
> > Usage:
> > When using the kprobe event, only need to set the objtrace(a new trigger),
> > we can get the value of the object. The object is from the setting of the
> > kprobe event.
> >
> > For example:
> > For the function bio_add_page():
> >
> > int bio_add_page(struct bio *bio, struct page *page,
> > unsigned int len, unsigned int offset)
> >
> > Firstly, we can set the base of the object, thus the first string "arg1"
> > stands for the value of the first parameter of this function bio_add_gage(),
> >
> > # echo 'p bio_add_page arg1=$arg1' >> ./kprobe_events
> >
> > Secondly, we can get the value dynamically base the above object.
> >
> > find the offset of the bi_size in struct bio:
> > $ gdb vmlinux
> > (gdb) p &(((struct bio *)0)->bi_iter.bi_size)
> > $1 = (unsigned int *) 0x28
> >
> > # echo 'objtrace:add:arg1,0x28:u32:1 if comm == "cat"' > ./events/kprobes/ \
> > p_bio_add_page_0/trigger
> >
> > The best way to use this is that we can set the entrance event and exit
> > event, for example, the following example is to set the read_papes as
> > the entrance event, and set the __blk_account_io_start as the exit event.
> >
> > # cd /sys/kernel/debug/tracing/
> > # echo 0 > ./tracing_on
> > # echo 'p read_pages' >> ./kprobe_events
> > # echo 'p __blk_account_io_start' >> ./kprobe_events
> > # echo 'traceon if comm == "cat"' > ./events/kprobes/p_read_pages_0/trigger
> > # echo 'traceoff if comm == "cat"' > ./events/kprobes/p___blk_account_io_start_0/trigger
> > # echo 'p bio_add_page arg1=$arg1' >> ./kprobe_events
> > # echo 'objtrace:add:arg1,0x28:u32:1 if comm == "cat"' > ./events/kprobes/p_bio_add_page_0/trigger
> >
> > # du -sh /test.txt
> > 12.0K /test.txt
> >
> > # cat /test.txt > /dev/null
> > # cat ./trace
> >
> > # tracer: nop
> > #
> > # entries-in-buffer/entries-written: 50/50 #P:1
> > #
> > # _-----=> irqs-off
> > # / _----=> need-resched
> > # | / _---=> hardirq/softirq
> > # || / _--=> preempt-depth
> > # ||| / _-=> migrate-disable
> > # |||| / delay
> > # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
> > # | | | ||||| | |
> > cat-95 [000] ..... 1.412065: _raw_spin_unlock_irqrestore <-event_triggers_call object:0xffff888108af6328 value:0x0
> > cat-95 [000] ..... 1.412066: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x0
> > cat-95 [000] ..... 1.412066: __bio_add_page <-bio_add_page object:0xffff888108af6328 value:0x0
> > cat-95 [000] ..... 1.412066: rcu_read_unlock_strict <-xa_load object:0xffff888108af6328 value:0x1000
> > cat-95 [000] ..... 1.412066: bio_add_page <-ext4_mpage_readpages object:0xffff888108af6328 value:0x1000
> > cat-95 [000] ..... 1.412066: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x1000
> > cat-95 [000] ..... 1.412067: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x1000
> > cat-95 [000] ..... 1.412067: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x1000
> > cat-95 [000] ..... 1.412067: __bio_add_page <-bio_add_page object:0xffff888108af6328 value:0x1000
> > cat-95 [000] ..... 1.412067: rcu_read_unlock_strict <-xa_load object:0xffff888108af6328 value:0x2000
> > cat-95 [000] ..... 1.412067: bio_add_page <-ext4_mpage_readpages object:0xffff888108af6328 value:0x2000
> > cat-95 [000] ..... 1.412067: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x2000
> > cat-95 [000] ..... 1.412067: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x2000
> > cat-95 [000] ..... 1.412067: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x2000
> > cat-95 [000] ..... 1.412068: submit_bio <-ext4_mpage_readpages object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: submit_bio_noacct <-ext4_mpage_readpages object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: __submit_bio <-submit_bio_noacct object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: blk_try_enter_queue <-__submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: rcu_read_unlock_strict <-blk_try_enter_queue object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: rcu_read_unlock_strict <-blk_try_enter_queue object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: submit_bio_checks <-__submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: __cond_resched <-submit_bio_checks object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: rcu_all_qs <-__cond_resched object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: should_fail_bio <-submit_bio_checks object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: create_task_io_context <-submit_bio_checks object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: kmem_cache_alloc_node <-create_task_io_context object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: should_failslab <-kmem_cache_alloc_node object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: _raw_spin_lock <-create_task_io_context object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: blk_mq_submit_bio <-__submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: __blk_queue_split <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: bvec_split_segs <-__blk_queue_split object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: blk_attempt_plug_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: __blk_mq_sched_bio_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: dd_bio_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: _raw_spin_lock <-dd_bio_merge object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: blk_mq_sched_try_merge <-dd_bio_merge object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: elv_merge <-blk_mq_sched_try_merge object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: elv_rqhash_find <-elv_merge object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: dd_request_merge <-blk_mq_sched_try_merge object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: elv_rb_find <-dd_request_merge object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: __blk_mq_alloc_request <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: dd_limit_depth <-__blk_mq_alloc_request object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: blk_mq_get_tag <-__blk_mq_alloc_request object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: __blk_mq_get_tag <-blk_mq_get_tag object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: blk_mq_rq_ctx_init.isra.0 <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: ktime_get <-blk_mq_rq_ctx_init.isra.0 object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: dd_prepare_request <-blk_mq_rq_ctx_init.isra.0 object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: __blk_account_io_start <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x3000
> >
> > Jeff Xie (5):
> > trace: Add trace any kernel object
> > trace/objtrace: get the value of the object
> > trace/README: Document objtrace trigger syntax
> > trace/objtrace: Add testcases for objtrace
> > trace/objtrace: Add documentation for objtrace
> >
> > Documentation/trace/events.rst | 114 +++
> > include/linux/trace_events.h | 1 +
> > kernel/trace/Kconfig | 10 +
> > kernel/trace/Makefile | 1 +
> > kernel/trace/trace.c | 3 +
> > kernel/trace/trace.h | 8 +
> > kernel/trace/trace_entries.h | 18 +
> > kernel/trace/trace_events_trigger.c | 1 +
> > kernel/trace/trace_object.c | 660 ++++++++++++++++++
> > kernel/trace/trace_output.c | 40 ++
> > .../ftrace/test.d/trigger/trigger-objtrace.tc | 39 ++
> > 11 files changed, 895 insertions(+)
> > create mode 100644 kernel/trace/trace_object.c
> > create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-objtrace.tc
> >
> > --
> > 2.25.1
> >
>
> Thanks,
> ---
> JeffXie


--
Masami Hiramatsu <[email protected]>

2021-12-16 14:50:25

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 0/5] trace: Introduce objtrace trigger to trace the kernel object

Hi Jeff,

On Tue, 30 Nov 2021 00:49:46 +0800
Jeff Xie <[email protected]> wrote:

> Introduce a method based on function tracer to trace any object and get
> the value of the object dynamically. the object can be obtained from the
> dynamic event (kprobe_event/uprobe_event) or the static event(tracepoint).

Now you can remove [RFC] tag from this series, I think.
And can you also add a changelog (short description of the changes) from
the previous version, so that reviewers can find the difference
point easily? Unless that, I need to make a diff from your previous version. :(

Thank you,

>
> Usage:
> When using the kprobe event, only need to set the objtrace(a new trigger),
> we can get the value of the object. The object is from the setting of the
> kprobe event.
>
> For example:
> For the function bio_add_page():
>
> int bio_add_page(struct bio *bio, struct page *page,
> unsigned int len, unsigned int offset)
>
> Firstly, we can set the base of the object, thus the first string "arg1"
> stands for the value of the first parameter of this function bio_add_gage(),
>
> # echo 'p bio_add_page arg1=$arg1' >> ./kprobe_events
>
> Secondly, we can get the value dynamically base the above object.
>
> find the offset of the bi_size in struct bio:
> $ gdb vmlinux
> (gdb) p &(((struct bio *)0)->bi_iter.bi_size)
> $1 = (unsigned int *) 0x28
>
> # echo 'objtrace:add:arg1,0x28:u32:1 if comm == "cat"' > ./events/kprobes/ \
> p_bio_add_page_0/trigger
>
> The best way to use this is that we can set the entrance event and exit
> event, for example, the following example is to set the read_papes as
> the entrance event, and set the __blk_account_io_start as the exit event.
>
> # cd /sys/kernel/debug/tracing/
> # echo 0 > ./tracing_on
> # echo 'p read_pages' >> ./kprobe_events
> # echo 'p __blk_account_io_start' >> ./kprobe_events
> # echo 'traceon if comm == "cat"' > ./events/kprobes/p_read_pages_0/trigger
> # echo 'traceoff if comm == "cat"' > ./events/kprobes/p___blk_account_io_start_0/trigger
> # echo 'p bio_add_page arg1=$arg1' >> ./kprobe_events
> # echo 'objtrace:add:arg1,0x28:u32:1 if comm == "cat"' > ./events/kprobes/p_bio_add_page_0/trigger
>
> # du -sh /test.txt
> 12.0K /test.txt
>
> # cat /test.txt > /dev/null
> # cat ./trace
>
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 50/50 #P:1
> #
> # _-----=> irqs-off
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # ||| / _-=> migrate-disable
> # |||| / delay
> # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
> # | | | ||||| | |
> cat-95 [000] ..... 1.412065: _raw_spin_unlock_irqrestore <-event_triggers_call object:0xffff888108af6328 value:0x0
> cat-95 [000] ..... 1.412066: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x0
> cat-95 [000] ..... 1.412066: __bio_add_page <-bio_add_page object:0xffff888108af6328 value:0x0
> cat-95 [000] ..... 1.412066: rcu_read_unlock_strict <-xa_load object:0xffff888108af6328 value:0x1000
> cat-95 [000] ..... 1.412066: bio_add_page <-ext4_mpage_readpages object:0xffff888108af6328 value:0x1000
> cat-95 [000] ..... 1.412066: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x1000
> cat-95 [000] ..... 1.412067: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x1000
> cat-95 [000] ..... 1.412067: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x1000
> cat-95 [000] ..... 1.412067: __bio_add_page <-bio_add_page object:0xffff888108af6328 value:0x1000
> cat-95 [000] ..... 1.412067: rcu_read_unlock_strict <-xa_load object:0xffff888108af6328 value:0x2000
> cat-95 [000] ..... 1.412067: bio_add_page <-ext4_mpage_readpages object:0xffff888108af6328 value:0x2000
> cat-95 [000] ..... 1.412067: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x2000
> cat-95 [000] ..... 1.412067: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x2000
> cat-95 [000] ..... 1.412067: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x2000
> cat-95 [000] ..... 1.412068: submit_bio <-ext4_mpage_readpages object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: submit_bio_noacct <-ext4_mpage_readpages object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: __submit_bio <-submit_bio_noacct object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: blk_try_enter_queue <-__submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: rcu_read_unlock_strict <-blk_try_enter_queue object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: rcu_read_unlock_strict <-blk_try_enter_queue object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: submit_bio_checks <-__submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: __cond_resched <-submit_bio_checks object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: rcu_all_qs <-__cond_resched object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412068: should_fail_bio <-submit_bio_checks object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: create_task_io_context <-submit_bio_checks object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: kmem_cache_alloc_node <-create_task_io_context object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: should_failslab <-kmem_cache_alloc_node object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: _raw_spin_lock <-create_task_io_context object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: blk_mq_submit_bio <-__submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: __blk_queue_split <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: bvec_split_segs <-__blk_queue_split object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412069: blk_attempt_plug_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: __blk_mq_sched_bio_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: dd_bio_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: _raw_spin_lock <-dd_bio_merge object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: blk_mq_sched_try_merge <-dd_bio_merge object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: elv_merge <-blk_mq_sched_try_merge object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: elv_rqhash_find <-elv_merge object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: dd_request_merge <-blk_mq_sched_try_merge object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: elv_rb_find <-dd_request_merge object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412070: __blk_mq_alloc_request <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: dd_limit_depth <-__blk_mq_alloc_request object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: blk_mq_get_tag <-__blk_mq_alloc_request object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: __blk_mq_get_tag <-blk_mq_get_tag object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: blk_mq_rq_ctx_init.isra.0 <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: ktime_get <-blk_mq_rq_ctx_init.isra.0 object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: dd_prepare_request <-blk_mq_rq_ctx_init.isra.0 object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: __blk_account_io_start <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x3000
> cat-95 [000] ..... 1.412071: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x3000
>
> Jeff Xie (5):
> trace: Add trace any kernel object
> trace/objtrace: get the value of the object
> trace/README: Document objtrace trigger syntax
> trace/objtrace: Add testcases for objtrace
> trace/objtrace: Add documentation for objtrace
>
> Documentation/trace/events.rst | 114 +++
> include/linux/trace_events.h | 1 +
> kernel/trace/Kconfig | 10 +
> kernel/trace/Makefile | 1 +
> kernel/trace/trace.c | 3 +
> kernel/trace/trace.h | 8 +
> kernel/trace/trace_entries.h | 18 +
> kernel/trace/trace_events_trigger.c | 1 +
> kernel/trace/trace_object.c | 660 ++++++++++++++++++
> kernel/trace/trace_output.c | 40 ++
> .../ftrace/test.d/trigger/trigger-objtrace.tc | 39 ++
> 11 files changed, 895 insertions(+)
> create mode 100644 kernel/trace/trace_object.c
> create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-objtrace.tc
>
> --
> 2.25.1
>


--
Masami Hiramatsu <[email protected]>

2021-12-16 15:02:41

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 3/5] trace/README: Document objtrace trigger syntax

Hi Jeff,

On Tue, 30 Nov 2021 00:49:49 +0800
Jeff Xie <[email protected]> wrote:

> Add a short description of the use of objtrace trigger.

Please merge this into the [1/5] and [2/5], because the changes of
"a new feature" and "the existance of the feature which
kernel indicates" must be atomic.

My recommendation is

- add following line in the first patch

"\t objtrace:add:obj[:count][if <filter>]\n"

- and update it as following in the next patch according to the changes

"\t objtrace:add:obj[,offset][:type][:count][if <filter>]\n"


Thank you!


>
> Signed-off-by: Jeff Xie <[email protected]>
> ---
> kernel/trace/trace.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 88de94da596b..a84b6813769e 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -5590,6 +5590,9 @@ static const char readme_msg[] =
> "\t enable_hist:<system>:<event>\n"
> "\t disable_hist:<system>:<event>\n"
> #endif
> +#ifdef CONFIG_TRACE_OBJECT
> + "\t objtrace:add:obj[,offset][:type][:count][if <filter>]\n"
> +#endif
> #ifdef CONFIG_STACKTRACE
> "\t\t stacktrace\n"
> #endif
> --
> 2.25.1
>


--
Masami Hiramatsu <[email protected]>

2021-12-17 01:43:29

by Jeff Xie

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 0/5] trace: Introduce objtrace trigger to trace the kernel object

Hi Masami,

Thanks for your review.

On Thu, Dec 16, 2021 at 10:50 PM Masami Hiramatsu <[email protected]> wrote:
>
> Hi Jeff,
>
> On Tue, 30 Nov 2021 00:49:46 +0800
> Jeff Xie <[email protected]> wrote:
>
> > Introduce a method based on function tracer to trace any object and get
> > the value of the object dynamically. the object can be obtained from the
> > dynamic event (kprobe_event/uprobe_event) or the static event(tracepoint).
>
> Now you can remove [RFC] tag from this series, I think.
> And can you also add a changelog (short description of the changes) from
> the previous version, so that reviewers can find the difference
> point easily? Unless that, I need to make a diff from your previous version. :(

Ok, I will remove the [RFC] and add changelog.

> Thank you,
>
> >
> > Usage:
> > When using the kprobe event, only need to set the objtrace(a new trigger),
> > we can get the value of the object. The object is from the setting of the
> > kprobe event.
> >
> > For example:
> > For the function bio_add_page():
> >
> > int bio_add_page(struct bio *bio, struct page *page,
> > unsigned int len, unsigned int offset)
> >
> > Firstly, we can set the base of the object, thus the first string "arg1"
> > stands for the value of the first parameter of this function bio_add_gage(),
> >
> > # echo 'p bio_add_page arg1=$arg1' >> ./kprobe_events
> >
> > Secondly, we can get the value dynamically base the above object.
> >
> > find the offset of the bi_size in struct bio:
> > $ gdb vmlinux
> > (gdb) p &(((struct bio *)0)->bi_iter.bi_size)
> > $1 = (unsigned int *) 0x28
> >
> > # echo 'objtrace:add:arg1,0x28:u32:1 if comm == "cat"' > ./events/kprobes/ \
> > p_bio_add_page_0/trigger
> >
> > The best way to use this is that we can set the entrance event and exit
> > event, for example, the following example is to set the read_papes as
> > the entrance event, and set the __blk_account_io_start as the exit event.
> >
> > # cd /sys/kernel/debug/tracing/
> > # echo 0 > ./tracing_on
> > # echo 'p read_pages' >> ./kprobe_events
> > # echo 'p __blk_account_io_start' >> ./kprobe_events
> > # echo 'traceon if comm == "cat"' > ./events/kprobes/p_read_pages_0/trigger
> > # echo 'traceoff if comm == "cat"' > ./events/kprobes/p___blk_account_io_start_0/trigger
> > # echo 'p bio_add_page arg1=$arg1' >> ./kprobe_events
> > # echo 'objtrace:add:arg1,0x28:u32:1 if comm == "cat"' > ./events/kprobes/p_bio_add_page_0/trigger
> >
> > # du -sh /test.txt
> > 12.0K /test.txt
> >
> > # cat /test.txt > /dev/null
> > # cat ./trace
> >
> > # tracer: nop
> > #
> > # entries-in-buffer/entries-written: 50/50 #P:1
> > #
> > # _-----=> irqs-off
> > # / _----=> need-resched
> > # | / _---=> hardirq/softirq
> > # || / _--=> preempt-depth
> > # ||| / _-=> migrate-disable
> > # |||| / delay
> > # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
> > # | | | ||||| | |
> > cat-95 [000] ..... 1.412065: _raw_spin_unlock_irqrestore <-event_triggers_call object:0xffff888108af6328 value:0x0
> > cat-95 [000] ..... 1.412066: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x0
> > cat-95 [000] ..... 1.412066: __bio_add_page <-bio_add_page object:0xffff888108af6328 value:0x0
> > cat-95 [000] ..... 1.412066: rcu_read_unlock_strict <-xa_load object:0xffff888108af6328 value:0x1000
> > cat-95 [000] ..... 1.412066: bio_add_page <-ext4_mpage_readpages object:0xffff888108af6328 value:0x1000
> > cat-95 [000] ..... 1.412066: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x1000
> > cat-95 [000] ..... 1.412067: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x1000
> > cat-95 [000] ..... 1.412067: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x1000
> > cat-95 [000] ..... 1.412067: __bio_add_page <-bio_add_page object:0xffff888108af6328 value:0x1000
> > cat-95 [000] ..... 1.412067: rcu_read_unlock_strict <-xa_load object:0xffff888108af6328 value:0x2000
> > cat-95 [000] ..... 1.412067: bio_add_page <-ext4_mpage_readpages object:0xffff888108af6328 value:0x2000
> > cat-95 [000] ..... 1.412067: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x2000
> > cat-95 [000] ..... 1.412067: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x2000
> > cat-95 [000] ..... 1.412067: __bio_try_merge_page <-bio_add_page object:0xffff888108af6328 value:0x2000
> > cat-95 [000] ..... 1.412068: submit_bio <-ext4_mpage_readpages object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: submit_bio_noacct <-ext4_mpage_readpages object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: __submit_bio <-submit_bio_noacct object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: blk_try_enter_queue <-__submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: rcu_read_unlock_strict <-blk_try_enter_queue object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: rcu_read_unlock_strict <-blk_try_enter_queue object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: submit_bio_checks <-__submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: __cond_resched <-submit_bio_checks object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: rcu_all_qs <-__cond_resched object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412068: should_fail_bio <-submit_bio_checks object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: create_task_io_context <-submit_bio_checks object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: kmem_cache_alloc_node <-create_task_io_context object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: should_failslab <-kmem_cache_alloc_node object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: _raw_spin_lock <-create_task_io_context object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: blk_mq_submit_bio <-__submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: __blk_queue_split <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: bvec_split_segs <-__blk_queue_split object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412069: blk_attempt_plug_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: __blk_mq_sched_bio_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: dd_bio_merge <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: _raw_spin_lock <-dd_bio_merge object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: blk_mq_sched_try_merge <-dd_bio_merge object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: elv_merge <-blk_mq_sched_try_merge object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: elv_rqhash_find <-elv_merge object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: dd_request_merge <-blk_mq_sched_try_merge object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: elv_rb_find <-dd_request_merge object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412070: __blk_mq_alloc_request <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: dd_limit_depth <-__blk_mq_alloc_request object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: blk_mq_get_tag <-__blk_mq_alloc_request object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: __blk_mq_get_tag <-blk_mq_get_tag object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: blk_mq_rq_ctx_init.isra.0 <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: ktime_get <-blk_mq_rq_ctx_init.isra.0 object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: dd_prepare_request <-blk_mq_rq_ctx_init.isra.0 object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: __blk_account_io_start <-blk_mq_submit_bio object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: kprobe_ftrace_handler <-ftrace_ops_list_func object:0xffff888108af6328 value:0x3000
> > cat-95 [000] ..... 1.412071: get_kprobe <-kprobe_ftrace_handler object:0xffff888108af6328 value:0x3000
> >
> > Jeff Xie (5):
> > trace: Add trace any kernel object
> > trace/objtrace: get the value of the object
> > trace/README: Document objtrace trigger syntax
> > trace/objtrace: Add testcases for objtrace
> > trace/objtrace: Add documentation for objtrace
> >
> > Documentation/trace/events.rst | 114 +++
> > include/linux/trace_events.h | 1 +
> > kernel/trace/Kconfig | 10 +
> > kernel/trace/Makefile | 1 +
> > kernel/trace/trace.c | 3 +
> > kernel/trace/trace.h | 8 +
> > kernel/trace/trace_entries.h | 18 +
> > kernel/trace/trace_events_trigger.c | 1 +
> > kernel/trace/trace_object.c | 660 ++++++++++++++++++
> > kernel/trace/trace_output.c | 40 ++
> > .../ftrace/test.d/trigger/trigger-objtrace.tc | 39 ++
> > 11 files changed, 895 insertions(+)
> > create mode 100644 kernel/trace/trace_object.c
> > create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-objtrace.tc
> >
> > --
> > 2.25.1
> >
>
>
> --
> Masami Hiramatsu <[email protected]>

---
JeffXie

2021-12-17 01:46:06

by Jeff Xie

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 3/5] trace/README: Document objtrace trigger syntax

Hi Masami,

On Thu, Dec 16, 2021 at 11:02 PM Masami Hiramatsu <[email protected]> wrote:
>
> Hi Jeff,
>
> On Tue, 30 Nov 2021 00:49:49 +0800
> Jeff Xie <[email protected]> wrote:
>
> > Add a short description of the use of objtrace trigger.
>
> Please merge this into the [1/5] and [2/5], because the changes of
> "a new feature" and "the existance of the feature which
> kernel indicates" must be atomic.
>
> My recommendation is
>
> - add following line in the first patch
>
> "\t objtrace:add:obj[:count][if <filter>]\n"
>
> - and update it as following in the next patch according to the changes
>
> "\t objtrace:add:obj[,offset][:type][:count][if <filter>]\n"

I will merge them, Thanks.

>
> Thank you!
>
>
> >
> > Signed-off-by: Jeff Xie <[email protected]>
> > ---
> > kernel/trace/trace.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index 88de94da596b..a84b6813769e 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -5590,6 +5590,9 @@ static const char readme_msg[] =
> > "\t enable_hist:<system>:<event>\n"
> > "\t disable_hist:<system>:<event>\n"
> > #endif
> > +#ifdef CONFIG_TRACE_OBJECT
> > + "\t objtrace:add:obj[,offset][:type][:count][if <filter>]\n"
> > +#endif
> > #ifdef CONFIG_STACKTRACE
> > "\t\t stacktrace\n"
> > #endif
> > --
> > 2.25.1
> >
>
>
> --
> Masami Hiramatsu <[email protected]>
---
JeffXie

2021-12-17 04:51:57

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 1/5] trace: Add trace any kernel object

On Tue, 30 Nov 2021 00:49:47 +0800
Jeff Xie <[email protected]> wrote:

> Introduce objtrace trigger to get the execution flow by tracing
> any kernel object in the kernel parameter.

Here, please add simple usage (no need to add output) since we can
see only this comment after merging this on git log.
If you run `git log -- kernel/trace/` and search "tracing: Add ",
then you can see more examples to add a new feature. ;)

What would you think this?
-----
Introduce objtrace trigger to get the call flow by tracing
any kernel object in the function parameter.

The objtrace trigger makes a list of the target object address from
the given event parameter, and records all kernel function calls
which has the object address in its parameter.

User can identify the kernel call flow related to the object easily.

Syntax:
...

Usage:
...
-----

>
> Signed-off-by: Jeff Xie <[email protected]>
> ---
> include/linux/trace_events.h | 1 +
> kernel/trace/Kconfig | 10 +
> kernel/trace/Makefile | 1 +
> kernel/trace/trace.h | 8 +
> kernel/trace/trace_entries.h | 17 +
> kernel/trace/trace_events_trigger.c | 1 +
> kernel/trace/trace_object.c | 534 ++++++++++++++++++++++++++++
> kernel/trace/trace_output.c | 40 +++
> 8 files changed, 612 insertions(+)
> create mode 100644 kernel/trace/trace_object.c
>
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 2d167ac3452c..54baec78ce62 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -684,6 +684,7 @@ enum event_trigger_type {
> ETT_EVENT_HIST = (1 << 4),
> ETT_HIST_ENABLE = (1 << 5),
> ETT_EVENT_EPROBE = (1 << 6),
> + ETT_TRACE_OBJECT = (1 << 7),
> };
>
> extern int filter_match_preds(struct event_filter *filter, void *rec);
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 420ff4bc67fd..f69761cbfaaa 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -237,6 +237,16 @@ config FUNCTION_PROFILER
>
> If in doubt, say N.
>
> +config TRACE_OBJECT
> + bool "Trace kernel object in function parameter"
> + depends on FUNCTION_TRACER
> + depends on HAVE_FUNCTION_ARG_ACCESS_API
> + select TRACING
> + default y
> + help
> + You can trace the kernel object in the kernel function parameter.
> + The kernel object is dynamically specified via event trigger.
> +
> config STACK_TRACER
> bool "Trace max stack"
> depends on HAVE_FUNCTION_TRACER
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index bedc5caceec7..b924b8e55922 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -67,6 +67,7 @@ obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
> obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
> obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
> obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += fgraph.o
> +obj-$(CONFIG_TRACE_OBJECT) += trace_object.o
> ifeq ($(CONFIG_BLOCK),y)
> obj-$(CONFIG_EVENT_TRACING) += blktrace.o
> endif
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 38715aa6cfdf..46a1301ac47a 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -54,6 +54,7 @@ enum trace_type {
> TRACE_TIMERLAT,
> TRACE_RAW_DATA,
> TRACE_FUNC_REPEATS,
> + TRACE_OBJECT,
>
> __TRACE_LAST_TYPE,
> };
> @@ -469,6 +470,7 @@ extern void __ftrace_bad_type(void);
> TRACE_GRAPH_RET); \
> IF_ASSIGN(var, ent, struct func_repeats_entry, \
> TRACE_FUNC_REPEATS); \
> + IF_ASSIGN(var, ent, struct trace_object_entry, TRACE_OBJECT);\
> __ftrace_bad_type(); \
> } while (0)
>
> @@ -1531,6 +1533,12 @@ static inline int register_trigger_hist_cmd(void) { return 0; }
> static inline int register_trigger_hist_enable_disable_cmds(void) { return 0; }
> #endif
>
> +#ifdef CONFIG_TRACE_OBJECT
> +extern int register_trigger_object_cmd(void);
> +#else
> +static inline int register_trigger_object_cmd(void) { return 0; }
> +#endif
> +
> extern int register_trigger_cmds(void);
> extern void clear_event_triggers(struct trace_array *tr);
>
> diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
> index cd41e863b51c..bb120d9498a9 100644
> --- a/kernel/trace/trace_entries.h
> +++ b/kernel/trace/trace_entries.h
> @@ -401,3 +401,20 @@ FTRACE_ENTRY(timerlat, timerlat_entry,
> __entry->context,
> __entry->timer_latency)
> );
> +
> +/*
> + * trace object entry:
> + */
> +FTRACE_ENTRY(object, trace_object_entry,
> +
> + TRACE_OBJECT,
> +
> + F_STRUCT(
> + __field( unsigned long, ip )
> + __field( unsigned long, parent_ip )
> + __field( unsigned long, object )
> + ),
> +
> + F_printk(" %ps <-- %ps object:%lx\n",
> + (void *)__entry->ip, (void *)__entry->parent_ip, __entry->object)
> +);
> diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
> index 3d5c07239a2a..da2d871fa809 100644
> --- a/kernel/trace/trace_events_trigger.c
> +++ b/kernel/trace/trace_events_trigger.c
> @@ -1687,6 +1687,7 @@ __init int register_trigger_cmds(void)
> register_trigger_enable_disable_cmds();
> register_trigger_hist_enable_disable_cmds();
> register_trigger_hist_cmd();
> + register_trigger_object_cmd();
>
> return 0;
> }
> diff --git a/kernel/trace/trace_object.c b/kernel/trace/trace_object.c
> new file mode 100644
> index 000000000000..83fccd587b63
> --- /dev/null
> +++ b/kernel/trace/trace_object.c
> @@ -0,0 +1,534 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * trace the kernel object in the kernel function parameter
> + * Copyright (C) 2021 Jeff Xie <[email protected]>
> + */
> +
> +#define pr_fmt(fmt) "trace_object: " fmt
> +
> +#include "trace_output.h"
> +#include <linux/freelist.h>
> +
> +static DEFINE_PER_CPU(atomic_t, trace_object_event_disable);

atomic_t is for atomic operation which must be shared among cpus. On the
other hand, per-cpu variable is used for the core-local storage or flags,
other cpus never touch it. Thus the per-cpu atomic_t is very strange.

> +static DEFINE_RAW_SPINLOCK(object_spin_lock);
> +static struct trace_event_file event_trace_file;
> +static const int max_args_num = 6;
> +static const int max_obj_pool = 10;
> +static atomic_t trace_object_ref;
> +static int exit_trace_object(void);
> +static int init_trace_object(void);
> +

Please add more comments to the code itself. Explain why this is needed
and how it works for which case. That will lead deeper understanding.

> +struct object_instance {
> + void *object;
> + struct freelist_node free_list;
> + struct list_head active_list;
> +};
> +
> +struct obj_pool {
> + struct freelist_head free_list;
> + struct list_head active_list;
> +};
> +static struct obj_pool *obj_pool;
> +
> +static bool object_exist(void *obj)
> +{
> + struct object_instance *inst;
> + bool ret = false;
> +
> + list_for_each_entry_rcu(inst, &obj_pool->active_list, active_list) {
> + if (inst->object == obj) {
> + ret = true;
> + goto out;
> + }
> + }
> +out:
> + return ret;
> +}
> +
> +static bool object_empty(void)
> +{
> + return list_empty(&obj_pool->active_list);
> +}
> +
> +static void set_trace_object(void *obj)
> +{
> + struct freelist_node *fn;
> + struct object_instance *ins;
> + unsigned long flags;
> +
> + if (in_nmi())
> + return;
> +
> + if (!obj)
> + return;
> +
> + if (object_exist(obj))
> + return;
> +
> + fn = freelist_try_get(&obj_pool->free_list);
> + if (!fn) {
> + trace_printk("object_pool is full, can't trace object:0x%px\n", obj);
> + return;
> + }
> +
> + ins = container_of(fn, struct object_instance, free_list);
> + ins->object = obj;
> +
> + raw_spin_lock_irqsave(&object_spin_lock, flags);
> + list_add_rcu(&ins->active_list, &obj_pool->active_list);
> + raw_spin_unlock_irqrestore(&object_spin_lock, flags);

Please add a comment that why this spinlock is needed here and why
other operation doesn't need it.

> +}
> +
> +static inline void free_free_list_objects(struct freelist_head *head)
> +{
> +
> + struct object_instance *inst;
> + struct freelist_node *node = head->head;
> +
> + while (node) {
> + inst = container_of(node, struct object_instance, free_list);
> + node = node->next;
> + kfree(inst);

Note that this is safe only if this free_list is no more accessed.

> + }
> +}
> +
> +static inline void free_active_list_objects(struct list_head *head)
> +{
> + struct object_instance *inst;
> +
> + list_for_each_entry_rcu(inst, head, active_list)
> + kfree(inst);

No, you can not do this, since the 'inst' is still on the active_list.
Before freeing the 'inst', you need to do list_del_rcu().
Moreover, the 'inst' might be still accessed from the callbacks,
you can not directly free it.

RCU ensures that looping on a list can be done safely before and after
removing/adding a node, but not ensure it is freeable soon, because
someone still looking the node.

I guess you may expect that all callbacks are stopped after calling

unregister_ftrace_function(&trace_ops);

but that is wrong. That don't synchronize the callbacks (= wait
for finishing all running callbacks).
So, you should use synchronize_rcu() after that.

> +}
> +
> +static inline void free_object_pool(void)
> +{
> + free_free_list_objects(&obj_pool->free_list);
> + free_active_list_objects(&obj_pool->active_list);

BTW, did you also ensure all the trigger callback has been done?
set_trace_object() has a small time window when the object_instance
is not linked to both of the free_list and the active_list.
Thus you have to make sure that the function is never called while
this function. (How to do it? please write the comment about it)

> + kfree(obj_pool);
> +}
> +
> +
> +static int init_object_pool(void)
> +{
> + struct object_instance *inst;
> + int i, ret = 0;
> +
> + obj_pool = kzalloc(sizeof(*obj_pool), GFP_KERNEL);
> + if (!obj_pool) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + INIT_LIST_HEAD(&obj_pool->active_list);
> +
> + for (i = 0; i < max_obj_pool; i++) {
> + inst = kzalloc(sizeof(*inst), GFP_KERNEL);
> + if (!inst) {
> + free_object_pool();
> + ret = -ENOMEM;
> + goto out;
> + }
> + freelist_add(&inst->free_list, &obj_pool->free_list);
> + }
> +out:
> + return ret;
> +}
> +
> +static void submit_trace_object(unsigned long ip, unsigned long parent_ip,
> + unsigned long object)
> +{
> +
> + struct trace_buffer *buffer;
> + struct ring_buffer_event *event;
> + struct trace_object_entry *entry;
> + int pc;
> +
> + pc = preempt_count();
> + event = trace_event_buffer_lock_reserve(&buffer, &event_trace_file,
> + TRACE_OBJECT, sizeof(*entry), pc);
> + if (!event)
> + return;
> + entry = ring_buffer_event_data(event);
> + entry->ip = ip;
> + entry->parent_ip = parent_ip;
> + entry->object = object;
> +
> + event_trigger_unlock_commit(&event_trace_file, buffer, event,
> + entry, pc);
> +}
> +
> +static void
> +trace_object_events_call(unsigned long ip, unsigned long parent_ip,
> + struct ftrace_ops *op, struct ftrace_regs *fregs)
> +{
> + struct pt_regs *pt_regs = ftrace_get_regs(fregs);
> + unsigned long obj;
> + long disabled;
> + int cpu, n;
> +
> + preempt_disable_notrace();
> +
> + cpu = raw_smp_processor_id();
> + disabled = atomic_inc_return(&per_cpu(trace_object_event_disable, cpu));

So what is the purpose of this check? (are there any issue if the same
cpu reenter here?)

> +
> + if (disabled != 1)
> + goto out;
> +
> + if (object_empty())
> + goto out;
> +
> + for (n = 0; n < max_args_num; n++) {
> + obj = regs_get_kernel_argument(pt_regs, n);
> + if (object_exist((void *)obj))
> + submit_trace_object(ip, parent_ip, obj);

Please add a comment why don't you break the loop after finding
any trace object here.

> + }
> +
> +out:
> + atomic_dec(&per_cpu(trace_object_event_disable, cpu));
> + preempt_enable_notrace();
> +}
> +
> +static struct ftrace_ops trace_ops = {
> + .func = trace_object_events_call,
> + .flags = FTRACE_OPS_FL_SAVE_REGS,
> +};
> +
> +static void
> +trace_object_trigger(struct event_trigger_data *data,
> + struct trace_buffer *buffer, void *rec,
> + struct ring_buffer_event *event)
> +{
> +
> + struct ftrace_event_field *field = data->private_data;
> + void *obj = NULL;
> +
> + memcpy(&obj, rec + field->offset, sizeof(obj));
> + set_trace_object(obj);
> +}
> +

/* This is clone of event_trigger_free(). */

> +static void
> +trace_object_trigger_free(struct event_trigger_ops *ops,
> + struct event_trigger_data *data)
> +{
> + if (WARN_ON_ONCE(data->ref <= 0))
> + return;
> +
> + data->ref--;
> + if (!data->ref)
> + trigger_data_free(data);
> +}
> +
> +static void
> +trace_object_count_trigger(struct event_trigger_data *data,
> + struct trace_buffer *buffer, void *rec,
> + struct ring_buffer_event *event)
> +{
> + if (!data->count)
> + return;
> +
> + if (data->count != -1)
> + (data->count)--;
> +
> + trace_object_trigger(data, buffer, rec, event);
> +}
> +

/* This is the clone of event_trigger_init() */

> +static int event_object_trigger_init(struct event_trigger_ops *ops,
> + struct event_trigger_data *data)
> +{
> + data->ref++;
> + return 0;
> +}


At this moment, I think cloning those functions are OK. But after merged
and enough tested, trigger should expose these functions and this
should use it.


BTW, it seems you are sharing the trace object active_list (obj_pool::active_list)
among all object trace triggers. This means if user sets one objtrace trigger
on 2 events and remove from one event, the active_list still have the object
which sets by the event we removed. If it is correct, I think this should be
noted in the document.
Something like;

Note that currently this is not designed to trace objects of different types
at the same time. You must set an objtrace trigger to only one event, or
you must carefully choose the same type object when you set the objtrace on
differnt events. Also, the traced object list is not cleared unless all
triggers are removed.


> +
> +static int
> +event_trigger_print(const char *name, struct seq_file *m,
> + void *data, char *filter_str)
> +{
> + long count = (long)data;
> +
> + seq_puts(m, name);
> +
> + if (count == -1)
> + seq_puts(m, ":unlimited");
> + else
> + seq_printf(m, ":count=%ld", count);
> +
> + if (filter_str)
> + seq_printf(m, " if %s\n", filter_str);
> + else
> + seq_putc(m, '\n');
> +
> + return 0;
> +}
> +
> +static int
> +trace_object_trigger_print(struct seq_file *m, struct event_trigger_ops *ops,
> + struct event_trigger_data *data)
> +{
> + return event_trigger_print("objtrace", m, (void *)data->count,
> + data->filter_str);
> +}
> +
> +static struct event_trigger_ops objecttrace_trigger_ops = {
> + .func = trace_object_trigger,
> + .print = trace_object_trigger_print,
> + .init = event_object_trigger_init,
> + .free = trace_object_trigger_free,
> +};
> +
> +static struct event_trigger_ops objecttrace_count_trigger_ops = {
> + .func = trace_object_count_trigger,
> + .print = trace_object_trigger_print,
> + .init = event_object_trigger_init,
> + .free = trace_object_trigger_free,
> +};
> +
> +static struct event_trigger_ops *
> +objecttrace_get_trigger_ops(char *cmd, char *param)
> +{
> + return param ? &objecttrace_count_trigger_ops : &objecttrace_trigger_ops;
> +}
> +
> +static int register_object_trigger(char *glob, struct event_trigger_ops *ops,
> + struct event_trigger_data *data,
> + struct trace_event_file *file)
> +{
> + struct event_trigger_data *test;
> + int ret = 0;
> +
> + lockdep_assert_held(&event_mutex);
> +
> + list_for_each_entry(test, &file->triggers, list) {
> + if (test->cmd_ops->trigger_type == data->cmd_ops->trigger_type) {
> + ret = -EEXIST;
> + goto out;
> + }
> + }
> +
> + if (data->ops->init) {
> + ret = data->ops->init(data->ops, data);
> + if (ret < 0)
> + goto out;
> + }
> +
> + list_add_rcu(&data->list, &file->triggers);
> + ret++;
> +
> + update_cond_flag(file);
> + if (trace_event_trigger_enable_disable(file, 1) < 0) {
> + list_del_rcu(&data->list);
> + update_cond_flag(file);
> + ret--;
> + }
> + init_trace_object();
> +out:
> + return ret;
> +}
> +
> +static void unregister_object_trigger(char *glob, struct event_trigger_ops *ops,
> + struct event_trigger_data *test,
> + struct trace_event_file *file)
> +{
> + struct event_trigger_data *data;
> + bool unregistered = false;
> +
> + lockdep_assert_held(&event_mutex);
> +
> + list_for_each_entry(data, &file->triggers, list) {
> + if (data->cmd_ops->trigger_type == test->cmd_ops->trigger_type) {
> + unregistered = true;
> + list_del_rcu(&data->list);
> + trace_event_trigger_enable_disable(file, 0);
> + update_cond_flag(file);
> + break;
> + }
> + }
> +
> + if (unregistered && data->ops->free) {
> + data->ops->free(data->ops, data);
> + exit_trace_object();
> + }

isn't this the following? Can you skip exit_trace_object()
if data->ops->free == NULL?

if (unregistered) {
if (data->ops->free)
data->ops->free(data->ops, data);
exit_trace_object();
}

Thank you,


> +}
> +
> +static int
> +event_object_trigger_callback(struct event_command *cmd_ops,
> + struct trace_event_file *file,
> + char *glob, char *cmd, char *param)
> +{
> + struct event_trigger_data *trigger_data;
> + struct event_trigger_ops *trigger_ops;
> + struct trace_event_call *call;
> + struct ftrace_event_field *field;
> + char *objtrace_cmd;
> + char *trigger = NULL;
> + char *arg;
> + char *number;
> + int ret;
> +
> + ret = -EINVAL;
> + if (!param)
> + goto out;
> +
> + /* separate the trigger from the filter (c:a:n [if filter]) */
> + trigger = strsep(&param, " \t");
> + if (!trigger)
> + goto out;
> + if (param) {
> + param = skip_spaces(param);
> + if (!*param)
> + param = NULL;
> + }
> +
> + objtrace_cmd = strsep(&trigger, ":");
> + if (!objtrace_cmd || strcmp(objtrace_cmd, "add"))
> + goto out;
> +
> + arg = strsep(&trigger, ":");
> + if (!arg)
> + goto out;
> + call = file->event_call;
> + field = trace_find_event_field(call, arg);
> + if (!field)
> + goto out;
> +
> + if (field->size != sizeof(void *))
> + goto out;
> +
> + trigger_ops = cmd_ops->get_trigger_ops(cmd, trigger);
> +
> + ret = -ENOMEM;
> + trigger_data = kzalloc(sizeof(*trigger_data), GFP_KERNEL);
> + if (!trigger_data)
> + goto out;
> +
> + trigger_data->count = -1;
> + trigger_data->ops = trigger_ops;
> + trigger_data->cmd_ops = cmd_ops;
> + trigger_data->private_data = field;
> + INIT_LIST_HEAD(&trigger_data->list);
> + INIT_LIST_HEAD(&trigger_data->named_list);
> +
> + if (glob[0] == '!') {
> + cmd_ops->unreg(glob+1, trigger_ops, trigger_data, file);
> + kfree(trigger_data);
> + ret = 0;
> + goto out;
> + }
> +
> + if (trigger) {
> + number = strsep(&trigger, ":");
> +
> + ret = -EINVAL;
> + if (!strlen(number))
> + goto out_free;
> +
> + /*
> + * We use the callback data field (which is a pointer)
> + * as our counter.
> + */
> + ret = kstrtoul(number, 0, &trigger_data->count);
> + if (ret)
> + goto out_free;
> + }
> +
> + if (!param) /* if param is non-empty, it's supposed to be a filter */
> + goto out_reg;
> +
> + if (!cmd_ops->set_filter)
> + goto out_reg;
> +
> + ret = cmd_ops->set_filter(param, trigger_data, file);
> + if (ret < 0)
> + goto out_free;
> +
> + out_reg:
> + /* Up the trigger_data count to make sure reg doesn't free it on failure */
> + event_object_trigger_init(trigger_ops, trigger_data);
> + ret = cmd_ops->reg(glob, trigger_ops, trigger_data, file);
> + /*
> + * The above returns on success the # of functions enabled,
> + * but if it didn't find any functions it returns zero.
> + * Consider no functions a failure too.
> + */
> + if (!ret) {
> + cmd_ops->unreg(glob, trigger_ops, trigger_data, file);
> + ret = -ENOENT;
> + } else if (ret > 0)
> + ret = 0;
> +
> + /* Down the counter of trigger_data or free it if not used anymore */
> + trace_object_trigger_free(trigger_ops, trigger_data);
> + out:
> + return ret;
> +
> + out_free:
> + if (cmd_ops->set_filter)
> + cmd_ops->set_filter(NULL, trigger_data, NULL);
> + kfree(trigger_data);
> + goto out;
> +}
> +
> +static struct event_command trigger_object_cmd = {
> + .name = "objtrace",
> + .trigger_type = ETT_TRACE_OBJECT,
> + .flags = EVENT_CMD_FL_NEEDS_REC,
> + .func = event_object_trigger_callback,
> + .reg = register_object_trigger,
> + .unreg = unregister_object_trigger,
> + .get_trigger_ops = objecttrace_get_trigger_ops,
> + .set_filter = set_trigger_filter,
> +};
> +
> +__init int register_trigger_object_cmd(void)
> +{
> + int ret;
> +
> + ret = register_event_command(&trigger_object_cmd);
> + WARN_ON(ret < 0);
> +
> + return ret;
> +}
> +
> +static int init_trace_object(void)
> +{
> + int ret;
> +
> + if (atomic_inc_return(&trace_object_ref) != 1) {
> + ret = 0;
> + goto out;
> + }
> +
> + ret = init_object_pool();
> + if (ret)
> + goto out;
> +
> + event_trace_file.tr = top_trace_array();
> + if (WARN_ON(!event_trace_file.tr)) {
> + ret = -1;
> + goto out;
> + }
> + ret = register_ftrace_function(&trace_ops);
> +out:
> + return ret;
> +}
> +
> +static int exit_trace_object(void)
> +{
> + int ret;
> +
> + if (WARN_ON_ONCE(atomic_read(&trace_object_ref) <= 0))
> + goto out;
> +
> + if (atomic_dec_return(&trace_object_ref) != 0) {
> + ret = 0;
> + goto out;
> + }
> +
> + ret = unregister_ftrace_function(&trace_ops);
> + if (ret) {
> + pr_err("can't unregister ftrace for trace object\n");
> + goto out;
> + }
> + free_object_pool();
> +out:
> + return ret;
> +}
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index 3547e7176ff7..d747aed27104 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -1543,6 +1543,45 @@ static struct trace_event trace_func_repeats_event = {
> .funcs = &trace_func_repeats_funcs,
> };
>
> +/* TRACE_OBJECT */
> +static enum print_line_t trace_object_print(struct trace_iterator *iter, int flags,
> + struct trace_event *event)
> +{
> + struct trace_object_entry *field;
> + struct trace_seq *s = &iter->seq;
> +
> + trace_assign_type(field, iter->ent);
> + print_fn_trace(s, field->ip, field->parent_ip, flags);
> + trace_seq_printf(s, " object:0x%lx", field->object);
> + trace_seq_putc(s, '\n');
> +
> + return trace_handle_return(s);
> +}
> +
> +static enum print_line_t trace_object_raw(struct trace_iterator *iter, int flags,
> + struct trace_event *event)
> +{
> + struct trace_object_entry *field;
> +
> + trace_assign_type(field, iter->ent);
> +
> + trace_seq_printf(&iter->seq, "%lx %lx\n",
> + field->ip,
> + field->parent_ip);
> +
> + return trace_handle_return(&iter->seq);
> +}
> +
> +static struct trace_event_functions trace_object_funcs = {
> + .trace = trace_object_print,
> + .raw = trace_object_raw,
> +};
> +
> +static struct trace_event trace_object_event = {
> + .type = TRACE_OBJECT,
> + .funcs = &trace_object_funcs,
> +};
> +
> static struct trace_event *events[] __initdata = {
> &trace_fn_event,
> &trace_ctx_event,
> @@ -1557,6 +1596,7 @@ static struct trace_event *events[] __initdata = {
> &trace_timerlat_event,
> &trace_raw_data_event,
> &trace_func_repeats_event,
> + &trace_object_event,
> NULL
> };
>
> --
> 2.25.1
>


--
Masami Hiramatsu <[email protected]>

2021-12-17 16:33:14

by Jeff Xie

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 1/5] trace: Add trace any kernel object

Hi Masami,

Thanks for your review.

On Fri, Dec 17, 2021 at 12:51 PM Masami Hiramatsu <[email protected]> wrote:
>
> On Tue, 30 Nov 2021 00:49:47 +0800
> Jeff Xie <[email protected]> wrote:
>
> > Introduce objtrace trigger to get the execution flow by tracing
> > any kernel object in the kernel parameter.
>
> Here, please add simple usage (no need to add output) since we can
> see only this comment after merging this on git log.
> If you run `git log -- kernel/trace/` and search "tracing: Add ",
> then you can see more examples to add a new feature. ;)
>
> What would you think this?
> -----
> Introduce objtrace trigger to get the call flow by tracing
> any kernel object in the function parameter.
>
> The objtrace trigger makes a list of the target object address from
> the given event parameter, and records all kernel function calls
> which has the object address in its parameter.
>
> User can identify the kernel call flow related to the object easily.
>
> Syntax:
> ...
>
> Usage:
> ...

I agree, I will add it.

> -----
>
> >
> > Signed-off-by: Jeff Xie <[email protected]>
> > ---
> > include/linux/trace_events.h | 1 +
> > kernel/trace/Kconfig | 10 +
> > kernel/trace/Makefile | 1 +
> > kernel/trace/trace.h | 8 +
> > kernel/trace/trace_entries.h | 17 +
> > kernel/trace/trace_events_trigger.c | 1 +
> > kernel/trace/trace_object.c | 534 ++++++++++++++++++++++++++++
> > kernel/trace/trace_output.c | 40 +++
> > 8 files changed, 612 insertions(+)
> > create mode 100644 kernel/trace/trace_object.c
> >
> > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> > index 2d167ac3452c..54baec78ce62 100644
> > --- a/include/linux/trace_events.h
> > +++ b/include/linux/trace_events.h
> > @@ -684,6 +684,7 @@ enum event_trigger_type {
> > ETT_EVENT_HIST = (1 << 4),
> > ETT_HIST_ENABLE = (1 << 5),
> > ETT_EVENT_EPROBE = (1 << 6),
> > + ETT_TRACE_OBJECT = (1 << 7),
> > };
> >
> > extern int filter_match_preds(struct event_filter *filter, void *rec);
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index 420ff4bc67fd..f69761cbfaaa 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -237,6 +237,16 @@ config FUNCTION_PROFILER
> >
> > If in doubt, say N.
> >
> > +config TRACE_OBJECT
> > + bool "Trace kernel object in function parameter"
> > + depends on FUNCTION_TRACER
> > + depends on HAVE_FUNCTION_ARG_ACCESS_API
> > + select TRACING
> > + default y
> > + help
> > + You can trace the kernel object in the kernel function parameter.
> > + The kernel object is dynamically specified via event trigger.
> > +
> > config STACK_TRACER
> > bool "Trace max stack"
> > depends on HAVE_FUNCTION_TRACER
> > diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> > index bedc5caceec7..b924b8e55922 100644
> > --- a/kernel/trace/Makefile
> > +++ b/kernel/trace/Makefile
> > @@ -67,6 +67,7 @@ obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
> > obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
> > obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
> > obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += fgraph.o
> > +obj-$(CONFIG_TRACE_OBJECT) += trace_object.o
> > ifeq ($(CONFIG_BLOCK),y)
> > obj-$(CONFIG_EVENT_TRACING) += blktrace.o
> > endif
> > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> > index 38715aa6cfdf..46a1301ac47a 100644
> > --- a/kernel/trace/trace.h
> > +++ b/kernel/trace/trace.h
> > @@ -54,6 +54,7 @@ enum trace_type {
> > TRACE_TIMERLAT,
> > TRACE_RAW_DATA,
> > TRACE_FUNC_REPEATS,
> > + TRACE_OBJECT,
> >
> > __TRACE_LAST_TYPE,
> > };
> > @@ -469,6 +470,7 @@ extern void __ftrace_bad_type(void);
> > TRACE_GRAPH_RET); \
> > IF_ASSIGN(var, ent, struct func_repeats_entry, \
> > TRACE_FUNC_REPEATS); \
> > + IF_ASSIGN(var, ent, struct trace_object_entry, TRACE_OBJECT);\
> > __ftrace_bad_type(); \
> > } while (0)
> >
> > @@ -1531,6 +1533,12 @@ static inline int register_trigger_hist_cmd(void) { return 0; }
> > static inline int register_trigger_hist_enable_disable_cmds(void) { return 0; }
> > #endif
> >
> > +#ifdef CONFIG_TRACE_OBJECT
> > +extern int register_trigger_object_cmd(void);
> > +#else
> > +static inline int register_trigger_object_cmd(void) { return 0; }
> > +#endif
> > +
> > extern int register_trigger_cmds(void);
> > extern void clear_event_triggers(struct trace_array *tr);
> >
> > diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
> > index cd41e863b51c..bb120d9498a9 100644
> > --- a/kernel/trace/trace_entries.h
> > +++ b/kernel/trace/trace_entries.h
> > @@ -401,3 +401,20 @@ FTRACE_ENTRY(timerlat, timerlat_entry,
> > __entry->context,
> > __entry->timer_latency)
> > );
> > +
> > +/*
> > + * trace object entry:
> > + */
> > +FTRACE_ENTRY(object, trace_object_entry,
> > +
> > + TRACE_OBJECT,
> > +
> > + F_STRUCT(
> > + __field( unsigned long, ip )
> > + __field( unsigned long, parent_ip )
> > + __field( unsigned long, object )
> > + ),
> > +
> > + F_printk(" %ps <-- %ps object:%lx\n",
> > + (void *)__entry->ip, (void *)__entry->parent_ip, __entry->object)
> > +);
> > diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
> > index 3d5c07239a2a..da2d871fa809 100644
> > --- a/kernel/trace/trace_events_trigger.c
> > +++ b/kernel/trace/trace_events_trigger.c
> > @@ -1687,6 +1687,7 @@ __init int register_trigger_cmds(void)
> > register_trigger_enable_disable_cmds();
> > register_trigger_hist_enable_disable_cmds();
> > register_trigger_hist_cmd();
> > + register_trigger_object_cmd();
> >
> > return 0;
> > }
> > diff --git a/kernel/trace/trace_object.c b/kernel/trace/trace_object.c
> > new file mode 100644
> > index 000000000000..83fccd587b63
> > --- /dev/null
> > +++ b/kernel/trace/trace_object.c
> > @@ -0,0 +1,534 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * trace the kernel object in the kernel function parameter
> > + * Copyright (C) 2021 Jeff Xie <[email protected]>
> > + */
> > +
> > +#define pr_fmt(fmt) "trace_object: " fmt
> > +
> > +#include "trace_output.h"
> > +#include <linux/freelist.h>
> > +
> > +static DEFINE_PER_CPU(atomic_t, trace_object_event_disable);
>
> atomic_t is for atomic operation which must be shared among cpus. On the
> other hand, per-cpu variable is used for the core-local storage or flags,
> other cpus never touch it. Thus the per-cpu atomic_t is very strange.
>

From the patch V1, I cloned it from the function_test_events_call()
in kernel/trace/trace_events.c

commit: 9ea21c1ecdb35ecdcac5fd9d95f62a1f6a7ffec0
tracing/events: perform function tracing in event selftests
Author: Steven Rostedt <[email protected]>

It should be to prevent being preempted by the interrupt context in
the process of adding one

> > +static DEFINE_RAW_SPINLOCK(object_spin_lock);
> > +static struct trace_event_file event_trace_file;
> > +static const int max_args_num = 6;
> > +static const int max_obj_pool = 10;
> > +static atomic_t trace_object_ref;
> > +static int exit_trace_object(void);
> > +static int init_trace_object(void);
> > +
>
> Please add more comments to the code itself. Explain why this is needed
> and how it works for which case. That will lead deeper understanding.
>

I agree, I will add more comments in the next version.

> > +struct object_instance {
> > + void *object;
> > + struct freelist_node free_list;
> > + struct list_head active_list;
> > +};
> > +
> > +struct obj_pool {
> > + struct freelist_head free_list;
> > + struct list_head active_list;
> > +};
> > +static struct obj_pool *obj_pool;
> > +
> > +static bool object_exist(void *obj)
> > +{
> > + struct object_instance *inst;
> > + bool ret = false;
> > +
> > + list_for_each_entry_rcu(inst, &obj_pool->active_list, active_list) {
> > + if (inst->object == obj) {
> > + ret = true;
> > + goto out;
> > + }
> > + }
> > +out:
> > + return ret;
> > +}
> > +
> > +static bool object_empty(void)
> > +{
> > + return list_empty(&obj_pool->active_list);
> > +}
> > +
> > +static void set_trace_object(void *obj)
> > +{
> > + struct freelist_node *fn;
> > + struct object_instance *ins;
> > + unsigned long flags;
> > +
> > + if (in_nmi())
> > + return;
> > +
> > + if (!obj)
> > + return;
> > +
> > + if (object_exist(obj))
> > + return;
> > +
> > + fn = freelist_try_get(&obj_pool->free_list);
> > + if (!fn) {
> > + trace_printk("object_pool is full, can't trace object:0x%px\n", obj);
> > + return;
> > + }
> > +
> > + ins = container_of(fn, struct object_instance, free_list);
> > + ins->object = obj;
> > +
> > + raw_spin_lock_irqsave(&object_spin_lock, flags);
> > + list_add_rcu(&ins->active_list, &obj_pool->active_list);
> > + raw_spin_unlock_irqrestore(&object_spin_lock, flags);
>
> Please add a comment that why this spinlock is needed here and why
> other operation doesn't need it.

(Only this place has write operations, and it cannot be concurrent)
I agree, I will add it.

> > +}
> > +
> > +static inline void free_free_list_objects(struct freelist_head *head)
> > +{
> > +
> > + struct object_instance *inst;
> > + struct freelist_node *node = head->head;
> > +
> > + while (node) {
> > + inst = container_of(node, struct object_instance, free_list);
> > + node = node->next;
> > + kfree(inst);
>
> Note that this is safe only if this free_list is no more accessed.
>

Maybe it need to use rcu to handle it , I will think about how to fix
it in the next version.

> > + }
> > +}
> > +
> > +static inline void free_active_list_objects(struct list_head *head)
> > +{
> > + struct object_instance *inst;
> > +
> > + list_for_each_entry_rcu(inst, head, active_list)
> > + kfree(inst);
>
> No, you can not do this, since the 'inst' is still on the active_list.
> Before freeing the 'inst', you need to do list_del_rcu().
> Moreover, the 'inst' might be still accessed from the callbacks,
> you can not directly free it.
>
> RCU ensures that looping on a list can be done safely before and after
> removing/adding a node, but not ensure it is freeable soon, because
> someone still looking the node.
>
> I guess you may expect that all callbacks are stopped after calling

Yes you guessed it right ;-)

> unregister_ftrace_function(&trace_ops);
>
> but that is wrong. That don't synchronize the callbacks (= wait
> for finishing all running callbacks).
> So, you should use synchronize_rcu() after that.

Thanks, I will fix it.

> > +}
> > +
> > +static inline void free_object_pool(void)
> > +{
> > + free_free_list_objects(&obj_pool->free_list);
> > + free_active_list_objects(&obj_pool->active_list);
>
> BTW, did you also ensure all the trigger callback has been done?
> set_trace_object() has a small time window when the object_instance
> is not linked to both of the free_list and the active_list.
> Thus you have to make sure that the function is never called while
> this function. (How to do it? please write the comment about it)

I ignored this small time window, maybe use the rcu can handle it, I
will think about how to fix it.

> > + kfree(obj_pool);
> > +}
> > +
> > +
> > +static int init_object_pool(void)
> > +{
> > + struct object_instance *inst;
> > + int i, ret = 0;
> > +
> > + obj_pool = kzalloc(sizeof(*obj_pool), GFP_KERNEL);
> > + if (!obj_pool) {
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > +
> > + INIT_LIST_HEAD(&obj_pool->active_list);
> > +
> > + for (i = 0; i < max_obj_pool; i++) {
> > + inst = kzalloc(sizeof(*inst), GFP_KERNEL);
> > + if (!inst) {
> > + free_object_pool();
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > + freelist_add(&inst->free_list, &obj_pool->free_list);
> > + }
> > +out:
> > + return ret;
> > +}
> > +
> > +static void submit_trace_object(unsigned long ip, unsigned long parent_ip,
> > + unsigned long object)
> > +{
> > +
> > + struct trace_buffer *buffer;
> > + struct ring_buffer_event *event;
> > + struct trace_object_entry *entry;
> > + int pc;
> > +
> > + pc = preempt_count();
> > + event = trace_event_buffer_lock_reserve(&buffer, &event_trace_file,
> > + TRACE_OBJECT, sizeof(*entry), pc);
> > + if (!event)
> > + return;
> > + entry = ring_buffer_event_data(event);
> > + entry->ip = ip;
> > + entry->parent_ip = parent_ip;
> > + entry->object = object;
> > +
> > + event_trigger_unlock_commit(&event_trace_file, buffer, event,
> > + entry, pc);
> > +}
> > +
> > +static void
> > +trace_object_events_call(unsigned long ip, unsigned long parent_ip,
> > + struct ftrace_ops *op, struct ftrace_regs *fregs)
> > +{
> > + struct pt_regs *pt_regs = ftrace_get_regs(fregs);
> > + unsigned long obj;
> > + long disabled;
> > + int cpu, n;
> > +
> > + preempt_disable_notrace();
> > +
> > + cpu = raw_smp_processor_id();
> > + disabled = atomic_inc_return(&per_cpu(trace_object_event_disable, cpu));
>
> So what is the purpose of this check? (are there any issue if the same
> cpu reenter here?)
>

There may be an interrupt context that can preempt it. I am not yet
sure whether the cpu reenter will affect it.
I will debug and test it. (Referred from function_test_events_call())

> > +
> > + if (disabled != 1)
> > + goto out;
> > +
> > + if (object_empty())
> > + goto out;
> > +
> > + for (n = 0; n < max_args_num; n++) {
> > + obj = regs_get_kernel_argument(pt_regs, n);
> > + if (object_exist((void *)obj))
> > + submit_trace_object(ip, parent_ip, obj);
>
> Please add a comment why don't you break the loop after finding
> any trace object here.
>
(because the parameters of a function may match multiple objects)
I will add the comment in the next version.

> > + }
> > +
> > +out:
> > + atomic_dec(&per_cpu(trace_object_event_disable, cpu));
> > + preempt_enable_notrace();
> > +}
> > +
> > +static struct ftrace_ops trace_ops = {
> > + .func = trace_object_events_call,
> > + .flags = FTRACE_OPS_FL_SAVE_REGS,
> > +};
> > +
> > +static void
> > +trace_object_trigger(struct event_trigger_data *data,
> > + struct trace_buffer *buffer, void *rec,
> > + struct ring_buffer_event *event)
> > +{
> > +
> > + struct ftrace_event_field *field = data->private_data;
> > + void *obj = NULL;
> > +
> > + memcpy(&obj, rec + field->offset, sizeof(obj));
> > + set_trace_object(obj);
> > +}
> > +
>
> /* This is clone of event_trigger_free(). */
>
> > +static void
> > +trace_object_trigger_free(struct event_trigger_ops *ops,
> > + struct event_trigger_data *data)
> > +{
> > + if (WARN_ON_ONCE(data->ref <= 0))
> > + return;
> > +
> > + data->ref--;
> > + if (!data->ref)
> > + trigger_data_free(data);
> > +}
> > +
> > +static void
> > +trace_object_count_trigger(struct event_trigger_data *data,
> > + struct trace_buffer *buffer, void *rec,
> > + struct ring_buffer_event *event)
> > +{
> > + if (!data->count)
> > + return;
> > +
> > + if (data->count != -1)
> > + (data->count)--;
> > +
> > + trace_object_trigger(data, buffer, rec, event);
> > +}
> > +
>
> /* This is the clone of event_trigger_init() */
>
> > +static int event_object_trigger_init(struct event_trigger_ops *ops,
> > + struct event_trigger_data *data)
> > +{
> > + data->ref++;
> > + return 0;
> > +}
>
>
> At this moment, I think cloning those functions are OK. But after merged
> and enough tested, trigger should expose these functions and this
> should use it.

I agree.

>
> BTW, it seems you are sharing the trace object active_list (obj_pool::active_list)
> among all object trace triggers. This means if user sets one objtrace trigger
> on 2 events and remove from one event, the active_list still have the object
> which sets by the event we removed. If it is correct, I think this should be
> noted in the document.
> Something like;
>
> Note that currently this is not designed to trace objects of different types
> at the same time. You must set an objtrace trigger to only one event, or
> you must carefully choose the same type object when you set the objtrace on
> differnt events. Also, the traced object list is not cleared unless all
> triggers are removed.
>
Thanks, I agree, I will add it.
>
> > +
> > +static int
> > +event_trigger_print(const char *name, struct seq_file *m,
> > + void *data, char *filter_str)
> > +{
> > + long count = (long)data;
> > +
> > + seq_puts(m, name);
> > +
> > + if (count == -1)
> > + seq_puts(m, ":unlimited");
> > + else
> > + seq_printf(m, ":count=%ld", count);
> > +
> > + if (filter_str)
> > + seq_printf(m, " if %s\n", filter_str);
> > + else
> > + seq_putc(m, '\n');
> > +
> > + return 0;
> > +}
> > +
> > +static int
> > +trace_object_trigger_print(struct seq_file *m, struct event_trigger_ops *ops,
> > + struct event_trigger_data *data)
> > +{
> > + return event_trigger_print("objtrace", m, (void *)data->count,
> > + data->filter_str);
> > +}
> > +
> > +static struct event_trigger_ops objecttrace_trigger_ops = {
> > + .func = trace_object_trigger,
> > + .print = trace_object_trigger_print,
> > + .init = event_object_trigger_init,
> > + .free = trace_object_trigger_free,
> > +};
> > +
> > +static struct event_trigger_ops objecttrace_count_trigger_ops = {
> > + .func = trace_object_count_trigger,
> > + .print = trace_object_trigger_print,
> > + .init = event_object_trigger_init,
> > + .free = trace_object_trigger_free,
> > +};
> > +
> > +static struct event_trigger_ops *
> > +objecttrace_get_trigger_ops(char *cmd, char *param)
> > +{
> > + return param ? &objecttrace_count_trigger_ops : &objecttrace_trigger_ops;
> > +}
> > +
> > +static int register_object_trigger(char *glob, struct event_trigger_ops *ops,
> > + struct event_trigger_data *data,
> > + struct trace_event_file *file)
> > +{
> > + struct event_trigger_data *test;
> > + int ret = 0;
> > +
> > + lockdep_assert_held(&event_mutex);
> > +
> > + list_for_each_entry(test, &file->triggers, list) {
> > + if (test->cmd_ops->trigger_type == data->cmd_ops->trigger_type) {
> > + ret = -EEXIST;
> > + goto out;
> > + }
> > + }
> > +
> > + if (data->ops->init) {
> > + ret = data->ops->init(data->ops, data);
> > + if (ret < 0)
> > + goto out;
> > + }
> > +
> > + list_add_rcu(&data->list, &file->triggers);
> > + ret++;
> > +
> > + update_cond_flag(file);
> > + if (trace_event_trigger_enable_disable(file, 1) < 0) {
> > + list_del_rcu(&data->list);
> > + update_cond_flag(file);
> > + ret--;
> > + }
> > + init_trace_object();
> > +out:
> > + return ret;
> > +}
> > +
> > +static void unregister_object_trigger(char *glob, struct event_trigger_ops *ops,
> > + struct event_trigger_data *test,
> > + struct trace_event_file *file)
> > +{
> > + struct event_trigger_data *data;
> > + bool unregistered = false;
> > +
> > + lockdep_assert_held(&event_mutex);
> > +
> > + list_for_each_entry(data, &file->triggers, list) {
> > + if (data->cmd_ops->trigger_type == test->cmd_ops->trigger_type) {
> > + unregistered = true;
> > + list_del_rcu(&data->list);
> > + trace_event_trigger_enable_disable(file, 0);
> > + update_cond_flag(file);
> > + break;
> > + }
> > + }
> > +
> > + if (unregistered && data->ops->free) {
> > + data->ops->free(data->ops, data);
> > + exit_trace_object();
> > + }
>
> isn't this the following? Can you skip exit_trace_object()
> if data->ops->free == NULL?
>
> if (unregistered) {
> if (data->ops->free)
> data->ops->free(data->ops, data);
> exit_trace_object();
> }

Thanks, this one would be better, although data->ops->free is not
empty at the moment.

>
> Thank you,
>
>
> > +}
> > +
> > +static int
> > +event_object_trigger_callback(struct event_command *cmd_ops,
> > + struct trace_event_file *file,
> > + char *glob, char *cmd, char *param)
> > +{
> > + struct event_trigger_data *trigger_data;
> > + struct event_trigger_ops *trigger_ops;
> > + struct trace_event_call *call;
> > + struct ftrace_event_field *field;
> > + char *objtrace_cmd;
> > + char *trigger = NULL;
> > + char *arg;
> > + char *number;
> > + int ret;
> > +
> > + ret = -EINVAL;
> > + if (!param)
> > + goto out;
> > +
> > + /* separate the trigger from the filter (c:a:n [if filter]) */
> > + trigger = strsep(&param, " \t");
> > + if (!trigger)
> > + goto out;
> > + if (param) {
> > + param = skip_spaces(param);
> > + if (!*param)
> > + param = NULL;
> > + }
> > +
> > + objtrace_cmd = strsep(&trigger, ":");
> > + if (!objtrace_cmd || strcmp(objtrace_cmd, "add"))
> > + goto out;
> > +
> > + arg = strsep(&trigger, ":");
> > + if (!arg)
> > + goto out;
> > + call = file->event_call;
> > + field = trace_find_event_field(call, arg);
> > + if (!field)
> > + goto out;
> > +
> > + if (field->size != sizeof(void *))
> > + goto out;
> > +
> > + trigger_ops = cmd_ops->get_trigger_ops(cmd, trigger);
> > +
> > + ret = -ENOMEM;
> > + trigger_data = kzalloc(sizeof(*trigger_data), GFP_KERNEL);
> > + if (!trigger_data)
> > + goto out;
> > +
> > + trigger_data->count = -1;
> > + trigger_data->ops = trigger_ops;
> > + trigger_data->cmd_ops = cmd_ops;
> > + trigger_data->private_data = field;
> > + INIT_LIST_HEAD(&trigger_data->list);
> > + INIT_LIST_HEAD(&trigger_data->named_list);
> > +
> > + if (glob[0] == '!') {
> > + cmd_ops->unreg(glob+1, trigger_ops, trigger_data, file);
> > + kfree(trigger_data);
> > + ret = 0;
> > + goto out;
> > + }
> > +
> > + if (trigger) {
> > + number = strsep(&trigger, ":");
> > +
> > + ret = -EINVAL;
> > + if (!strlen(number))
> > + goto out_free;
> > +
> > + /*
> > + * We use the callback data field (which is a pointer)
> > + * as our counter.
> > + */
> > + ret = kstrtoul(number, 0, &trigger_data->count);
> > + if (ret)
> > + goto out_free;
> > + }
> > +
> > + if (!param) /* if param is non-empty, it's supposed to be a filter */
> > + goto out_reg;
> > +
> > + if (!cmd_ops->set_filter)
> > + goto out_reg;
> > +
> > + ret = cmd_ops->set_filter(param, trigger_data, file);
> > + if (ret < 0)
> > + goto out_free;
> > +
> > + out_reg:
> > + /* Up the trigger_data count to make sure reg doesn't free it on failure */
> > + event_object_trigger_init(trigger_ops, trigger_data);
> > + ret = cmd_ops->reg(glob, trigger_ops, trigger_data, file);
> > + /*
> > + * The above returns on success the # of functions enabled,
> > + * but if it didn't find any functions it returns zero.
> > + * Consider no functions a failure too.
> > + */
> > + if (!ret) {
> > + cmd_ops->unreg(glob, trigger_ops, trigger_data, file);
> > + ret = -ENOENT;
> > + } else if (ret > 0)
> > + ret = 0;
> > +
> > + /* Down the counter of trigger_data or free it if not used anymore */
> > + trace_object_trigger_free(trigger_ops, trigger_data);
> > + out:
> > + return ret;
> > +
> > + out_free:
> > + if (cmd_ops->set_filter)
> > + cmd_ops->set_filter(NULL, trigger_data, NULL);
> > + kfree(trigger_data);
> > + goto out;
> > +}
> > +
> > +static struct event_command trigger_object_cmd = {
> > + .name = "objtrace",
> > + .trigger_type = ETT_TRACE_OBJECT,
> > + .flags = EVENT_CMD_FL_NEEDS_REC,
> > + .func = event_object_trigger_callback,
> > + .reg = register_object_trigger,
> > + .unreg = unregister_object_trigger,
> > + .get_trigger_ops = objecttrace_get_trigger_ops,
> > + .set_filter = set_trigger_filter,
> > +};
> > +
> > +__init int register_trigger_object_cmd(void)
> > +{
> > + int ret;
> > +
> > + ret = register_event_command(&trigger_object_cmd);
> > + WARN_ON(ret < 0);
> > +
> > + return ret;
> > +}
> > +
> > +static int init_trace_object(void)
> > +{
> > + int ret;
> > +
> > + if (atomic_inc_return(&trace_object_ref) != 1) {
> > + ret = 0;
> > + goto out;
> > + }
> > +
> > + ret = init_object_pool();
> > + if (ret)
> > + goto out;
> > +
> > + event_trace_file.tr = top_trace_array();
> > + if (WARN_ON(!event_trace_file.tr)) {
> > + ret = -1;
> > + goto out;
> > + }
> > + ret = register_ftrace_function(&trace_ops);
> > +out:
> > + return ret;
> > +}
> > +
> > +static int exit_trace_object(void)
> > +{
> > + int ret;
> > +
> > + if (WARN_ON_ONCE(atomic_read(&trace_object_ref) <= 0))
> > + goto out;
> > +
> > + if (atomic_dec_return(&trace_object_ref) != 0) {
> > + ret = 0;
> > + goto out;
> > + }
> > +
> > + ret = unregister_ftrace_function(&trace_ops);
> > + if (ret) {
> > + pr_err("can't unregister ftrace for trace object\n");
> > + goto out;
> > + }
> > + free_object_pool();
> > +out:
> > + return ret;
> > +}
> > diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> > index 3547e7176ff7..d747aed27104 100644
> > --- a/kernel/trace/trace_output.c
> > +++ b/kernel/trace/trace_output.c
> > @@ -1543,6 +1543,45 @@ static struct trace_event trace_func_repeats_event = {
> > .funcs = &trace_func_repeats_funcs,
> > };
> >
> > +/* TRACE_OBJECT */
> > +static enum print_line_t trace_object_print(struct trace_iterator *iter, int flags,
> > + struct trace_event *event)
> > +{
> > + struct trace_object_entry *field;
> > + struct trace_seq *s = &iter->seq;
> > +
> > + trace_assign_type(field, iter->ent);
> > + print_fn_trace(s, field->ip, field->parent_ip, flags);
> > + trace_seq_printf(s, " object:0x%lx", field->object);
> > + trace_seq_putc(s, '\n');
> > +
> > + return trace_handle_return(s);
> > +}
> > +
> > +static enum print_line_t trace_object_raw(struct trace_iterator *iter, int flags,
> > + struct trace_event *event)
> > +{
> > + struct trace_object_entry *field;
> > +
> > + trace_assign_type(field, iter->ent);
> > +
> > + trace_seq_printf(&iter->seq, "%lx %lx\n",
> > + field->ip,
> > + field->parent_ip);
> > +
> > + return trace_handle_return(&iter->seq);
> > +}
> > +
> > +static struct trace_event_functions trace_object_funcs = {
> > + .trace = trace_object_print,
> > + .raw = trace_object_raw,
> > +};
> > +
> > +static struct trace_event trace_object_event = {
> > + .type = TRACE_OBJECT,
> > + .funcs = &trace_object_funcs,
> > +};
> > +
> > static struct trace_event *events[] __initdata = {
> > &trace_fn_event,
> > &trace_ctx_event,
> > @@ -1557,6 +1596,7 @@ static struct trace_event *events[] __initdata = {
> > &trace_timerlat_event,
> > &trace_raw_data_event,
> > &trace_func_repeats_event,
> > + &trace_object_event,
> > NULL
> > };
> >
> > --
> > 2.25.1
> >
>
>
> --
> Masami Hiramatsu <[email protected]>

---
JeffXie

2021-12-19 03:07:32

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 1/5] trace: Add trace any kernel object

On Sat, 18 Dec 2021 00:32:57 +0800
Jeff Xie <[email protected]> wrote:

> Hi Masami,
>
> Thanks for your review.
>
> On Fri, Dec 17, 2021 at 12:51 PM Masami Hiramatsu <[email protected]> wrote:
> >
> > On Tue, 30 Nov 2021 00:49:47 +0800
> > Jeff Xie <[email protected]> wrote:
> >
> > > Introduce objtrace trigger to get the execution flow by tracing
> > > any kernel object in the kernel parameter.
> >
> > Here, please add simple usage (no need to add output) since we can
> > see only this comment after merging this on git log.
> > If you run `git log -- kernel/trace/` and search "tracing: Add ",
> > then you can see more examples to add a new feature. ;)
> >
> > What would you think this?
> > -----
> > Introduce objtrace trigger to get the call flow by tracing
> > any kernel object in the function parameter.
> >
> > The objtrace trigger makes a list of the target object address from
> > the given event parameter, and records all kernel function calls
> > which has the object address in its parameter.
> >
> > User can identify the kernel call flow related to the object easily.
> >
> > Syntax:
> > ...
> >
> > Usage:
> > ...
>
> I agree, I will add it.
>
> > -----
> >
> > >
> > > Signed-off-by: Jeff Xie <[email protected]>
> > > ---
> > > include/linux/trace_events.h | 1 +
> > > kernel/trace/Kconfig | 10 +
> > > kernel/trace/Makefile | 1 +
> > > kernel/trace/trace.h | 8 +
> > > kernel/trace/trace_entries.h | 17 +
> > > kernel/trace/trace_events_trigger.c | 1 +
> > > kernel/trace/trace_object.c | 534 ++++++++++++++++++++++++++++
> > > kernel/trace/trace_output.c | 40 +++
> > > 8 files changed, 612 insertions(+)
> > > create mode 100644 kernel/trace/trace_object.c
> > >
> > > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> > > index 2d167ac3452c..54baec78ce62 100644
> > > --- a/include/linux/trace_events.h
> > > +++ b/include/linux/trace_events.h
> > > @@ -684,6 +684,7 @@ enum event_trigger_type {
> > > ETT_EVENT_HIST = (1 << 4),
> > > ETT_HIST_ENABLE = (1 << 5),
> > > ETT_EVENT_EPROBE = (1 << 6),
> > > + ETT_TRACE_OBJECT = (1 << 7),
> > > };
> > >
> > > extern int filter_match_preds(struct event_filter *filter, void *rec);
> > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > > index 420ff4bc67fd..f69761cbfaaa 100644
> > > --- a/kernel/trace/Kconfig
> > > +++ b/kernel/trace/Kconfig
> > > @@ -237,6 +237,16 @@ config FUNCTION_PROFILER
> > >
> > > If in doubt, say N.
> > >
> > > +config TRACE_OBJECT
> > > + bool "Trace kernel object in function parameter"
> > > + depends on FUNCTION_TRACER
> > > + depends on HAVE_FUNCTION_ARG_ACCESS_API
> > > + select TRACING
> > > + default y
> > > + help
> > > + You can trace the kernel object in the kernel function parameter.
> > > + The kernel object is dynamically specified via event trigger.
> > > +
> > > config STACK_TRACER
> > > bool "Trace max stack"
> > > depends on HAVE_FUNCTION_TRACER
> > > diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> > > index bedc5caceec7..b924b8e55922 100644
> > > --- a/kernel/trace/Makefile
> > > +++ b/kernel/trace/Makefile
> > > @@ -67,6 +67,7 @@ obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
> > > obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
> > > obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
> > > obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += fgraph.o
> > > +obj-$(CONFIG_TRACE_OBJECT) += trace_object.o
> > > ifeq ($(CONFIG_BLOCK),y)
> > > obj-$(CONFIG_EVENT_TRACING) += blktrace.o
> > > endif
> > > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> > > index 38715aa6cfdf..46a1301ac47a 100644
> > > --- a/kernel/trace/trace.h
> > > +++ b/kernel/trace/trace.h
> > > @@ -54,6 +54,7 @@ enum trace_type {
> > > TRACE_TIMERLAT,
> > > TRACE_RAW_DATA,
> > > TRACE_FUNC_REPEATS,
> > > + TRACE_OBJECT,
> > >
> > > __TRACE_LAST_TYPE,
> > > };
> > > @@ -469,6 +470,7 @@ extern void __ftrace_bad_type(void);
> > > TRACE_GRAPH_RET); \
> > > IF_ASSIGN(var, ent, struct func_repeats_entry, \
> > > TRACE_FUNC_REPEATS); \
> > > + IF_ASSIGN(var, ent, struct trace_object_entry, TRACE_OBJECT);\
> > > __ftrace_bad_type(); \
> > > } while (0)
> > >
> > > @@ -1531,6 +1533,12 @@ static inline int register_trigger_hist_cmd(void) { return 0; }
> > > static inline int register_trigger_hist_enable_disable_cmds(void) { return 0; }
> > > #endif
> > >
> > > +#ifdef CONFIG_TRACE_OBJECT
> > > +extern int register_trigger_object_cmd(void);
> > > +#else
> > > +static inline int register_trigger_object_cmd(void) { return 0; }
> > > +#endif
> > > +
> > > extern int register_trigger_cmds(void);
> > > extern void clear_event_triggers(struct trace_array *tr);
> > >
> > > diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
> > > index cd41e863b51c..bb120d9498a9 100644
> > > --- a/kernel/trace/trace_entries.h
> > > +++ b/kernel/trace/trace_entries.h
> > > @@ -401,3 +401,20 @@ FTRACE_ENTRY(timerlat, timerlat_entry,
> > > __entry->context,
> > > __entry->timer_latency)
> > > );
> > > +
> > > +/*
> > > + * trace object entry:
> > > + */
> > > +FTRACE_ENTRY(object, trace_object_entry,
> > > +
> > > + TRACE_OBJECT,
> > > +
> > > + F_STRUCT(
> > > + __field( unsigned long, ip )
> > > + __field( unsigned long, parent_ip )
> > > + __field( unsigned long, object )
> > > + ),
> > > +
> > > + F_printk(" %ps <-- %ps object:%lx\n",
> > > + (void *)__entry->ip, (void *)__entry->parent_ip, __entry->object)
> > > +);
> > > diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
> > > index 3d5c07239a2a..da2d871fa809 100644
> > > --- a/kernel/trace/trace_events_trigger.c
> > > +++ b/kernel/trace/trace_events_trigger.c
> > > @@ -1687,6 +1687,7 @@ __init int register_trigger_cmds(void)
> > > register_trigger_enable_disable_cmds();
> > > register_trigger_hist_enable_disable_cmds();
> > > register_trigger_hist_cmd();
> > > + register_trigger_object_cmd();
> > >
> > > return 0;
> > > }
> > > diff --git a/kernel/trace/trace_object.c b/kernel/trace/trace_object.c
> > > new file mode 100644
> > > index 000000000000..83fccd587b63
> > > --- /dev/null
> > > +++ b/kernel/trace/trace_object.c
> > > @@ -0,0 +1,534 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * trace the kernel object in the kernel function parameter
> > > + * Copyright (C) 2021 Jeff Xie <[email protected]>
> > > + */
> > > +
> > > +#define pr_fmt(fmt) "trace_object: " fmt
> > > +
> > > +#include "trace_output.h"
> > > +#include <linux/freelist.h>
> > > +
> > > +static DEFINE_PER_CPU(atomic_t, trace_object_event_disable);
> >
> > atomic_t is for atomic operation which must be shared among cpus. On the
> > other hand, per-cpu variable is used for the core-local storage or flags,
> > other cpus never touch it. Thus the per-cpu atomic_t is very strange.
> >
>
> From the patch V1, I cloned it from the function_test_events_call()
> in kernel/trace/trace_events.c
>
> commit: 9ea21c1ecdb35ecdcac5fd9d95f62a1f6a7ffec0
> tracing/events: perform function tracing in event selftests
> Author: Steven Rostedt <[email protected]>

Hmm, OK.

>
> It should be to prevent being preempted by the interrupt context in
> the process of adding one

Yeah, I think so.

The commit says "some bugs" but it is not sure what actually needs to be
cared.

tracing/events: perform function tracing in event selftests

We can find some bugs in the trace events if we stress the writes as well.
The function tracer is a good way to stress the events.

Steve, can you tell me what was the problem?

I think we don't need per-cpu atomic_t because the counter is increment
and decrement. Thus when quiting the nested ftrace handler on the same CPU,
the counter comes back to the same value. We don't need to care about
atomic increment.

I mean, if we use the normal per-cpu "unsigned int" as a counter, the
operation of "counter++" becomes;

load counter to reg1
[1]
reg1 = reg1 + 1
store reg1 to counter

And if an interrupt occurs at [1], the following happens.

load counter to reg1 # counter = 0

(interrupt - save reg1)
load counter to reg1 # counter = 0
reg1 = reg1 + 1
store reg1 to counter # counter = 1
...
load counter to reg1 # counter = 1
reg1 = reg1 - 1
store reg1 to counter # counter = 0
(iret - restore reg1)

reg1 = reg1 + 1
store reg1 to counter

So, even if the operation is not atomic, there seems no problem.
What else scenario we have to worry?

(BTW, what is the ftrace_test_recursion_trylock()? Is that also
for detecting nesting case??)

> > > +static DEFINE_RAW_SPINLOCK(object_spin_lock);
> > > +static struct trace_event_file event_trace_file;
> > > +static const int max_args_num = 6;
> > > +static const int max_obj_pool = 10;
> > > +static atomic_t trace_object_ref;
> > > +static int exit_trace_object(void);
> > > +static int init_trace_object(void);
> > > +
> >
> > Please add more comments to the code itself. Explain why this is needed
> > and how it works for which case. That will lead deeper understanding.
> >
>
> I agree, I will add more comments in the next version.
>
> > > +struct object_instance {
> > > + void *object;
> > > + struct freelist_node free_list;
> > > + struct list_head active_list;
> > > +};
> > > +
> > > +struct obj_pool {
> > > + struct freelist_head free_list;
> > > + struct list_head active_list;
> > > +};
> > > +static struct obj_pool *obj_pool;
> > > +
> > > +static bool object_exist(void *obj)
> > > +{
> > > + struct object_instance *inst;
> > > + bool ret = false;
> > > +
> > > + list_for_each_entry_rcu(inst, &obj_pool->active_list, active_list) {
> > > + if (inst->object == obj) {
> > > + ret = true;
> > > + goto out;
> > > + }
> > > + }
> > > +out:
> > > + return ret;
> > > +}
> > > +
> > > +static bool object_empty(void)
> > > +{
> > > + return list_empty(&obj_pool->active_list);
> > > +}
> > > +
> > > +static void set_trace_object(void *obj)
> > > +{
> > > + struct freelist_node *fn;
> > > + struct object_instance *ins;
> > > + unsigned long flags;
> > > +
> > > + if (in_nmi())
> > > + return;
> > > +
> > > + if (!obj)
> > > + return;
> > > +
> > > + if (object_exist(obj))
> > > + return;
> > > +
> > > + fn = freelist_try_get(&obj_pool->free_list);
> > > + if (!fn) {
> > > + trace_printk("object_pool is full, can't trace object:0x%px\n", obj);
> > > + return;
> > > + }
> > > +
> > > + ins = container_of(fn, struct object_instance, free_list);
> > > + ins->object = obj;
> > > +
> > > + raw_spin_lock_irqsave(&object_spin_lock, flags);
> > > + list_add_rcu(&ins->active_list, &obj_pool->active_list);
> > > + raw_spin_unlock_irqrestore(&object_spin_lock, flags);
> >
> > Please add a comment that why this spinlock is needed here and why
> > other operation doesn't need it.
>
> (Only this place has write operations, and it cannot be concurrent)

Yes and such comment helps us when objtrace supports "del" command too. :)

> I agree, I will add it.
>
> > > +}
> > > +
> > > +static inline void free_free_list_objects(struct freelist_head *head)
> > > +{
> > > +
> > > + struct object_instance *inst;
> > > + struct freelist_node *node = head->head;
> > > +
> > > + while (node) {
> > > + inst = container_of(node, struct object_instance, free_list);
> > > + node = node->next;
> > > + kfree(inst);
> >
> > Note that this is safe only if this free_list is no more accessed.
> >
>
> Maybe it need to use rcu to handle it , I will think about how to fix
> it in the next version.

Yes, what you need is a synchronize_rcu().

>
> > > + }
> > > +}
> > > +
> > > +static inline void free_active_list_objects(struct list_head *head)
> > > +{
> > > + struct object_instance *inst;
> > > +
> > > + list_for_each_entry_rcu(inst, head, active_list)
> > > + kfree(inst);
> >
> > No, you can not do this, since the 'inst' is still on the active_list.
> > Before freeing the 'inst', you need to do list_del_rcu().
> > Moreover, the 'inst' might be still accessed from the callbacks,
> > you can not directly free it.
> >
> > RCU ensures that looping on a list can be done safely before and after
> > removing/adding a node, but not ensure it is freeable soon, because
> > someone still looking the node.
> >
> > I guess you may expect that all callbacks are stopped after calling
>
> Yes you guessed it right ;-)
>
> > unregister_ftrace_function(&trace_ops);
> >
> > but that is wrong. That don't synchronize the callbacks (= wait
> > for finishing all running callbacks).
> > So, you should use synchronize_rcu() after that.
>
> Thanks, I will fix it.

Thanks, it maybe hard to notice. Since the synchronize_rcu() is a time
taking operation, trace_event subsystem carefully chooses the calling
point. (e.g. where all user specified operation was done)

> > > +}
> > > +
> > > +static inline void free_object_pool(void)
> > > +{
> > > + free_free_list_objects(&obj_pool->free_list);
> > > + free_active_list_objects(&obj_pool->active_list);
> >
> > BTW, did you also ensure all the trigger callback has been done?
> > set_trace_object() has a small time window when the object_instance
> > is not linked to both of the free_list and the active_list.
> > Thus you have to make sure that the function is never called while
> > this function. (How to do it? please write the comment about it)
>
> I ignored this small time window, maybe use the rcu can handle it, I
> will think about how to fix it.
>
> > > + kfree(obj_pool);
> > > +}
> > > +
> > > +
> > > +static int init_object_pool(void)
> > > +{
> > > + struct object_instance *inst;
> > > + int i, ret = 0;
> > > +
> > > + obj_pool = kzalloc(sizeof(*obj_pool), GFP_KERNEL);
> > > + if (!obj_pool) {
> > > + ret = -ENOMEM;
> > > + goto out;
> > > + }
> > > +
> > > + INIT_LIST_HEAD(&obj_pool->active_list);
> > > +
> > > + for (i = 0; i < max_obj_pool; i++) {
> > > + inst = kzalloc(sizeof(*inst), GFP_KERNEL);
> > > + if (!inst) {
> > > + free_object_pool();
> > > + ret = -ENOMEM;
> > > + goto out;
> > > + }
> > > + freelist_add(&inst->free_list, &obj_pool->free_list);
> > > + }
> > > +out:
> > > + return ret;
> > > +}
> > > +
> > > +static void submit_trace_object(unsigned long ip, unsigned long parent_ip,
> > > + unsigned long object)
> > > +{
> > > +
> > > + struct trace_buffer *buffer;
> > > + struct ring_buffer_event *event;
> > > + struct trace_object_entry *entry;
> > > + int pc;
> > > +
> > > + pc = preempt_count();
> > > + event = trace_event_buffer_lock_reserve(&buffer, &event_trace_file,
> > > + TRACE_OBJECT, sizeof(*entry), pc);
> > > + if (!event)
> > > + return;
> > > + entry = ring_buffer_event_data(event);
> > > + entry->ip = ip;
> > > + entry->parent_ip = parent_ip;
> > > + entry->object = object;
> > > +
> > > + event_trigger_unlock_commit(&event_trace_file, buffer, event,
> > > + entry, pc);
> > > +}
> > > +
> > > +static void
> > > +trace_object_events_call(unsigned long ip, unsigned long parent_ip,
> > > + struct ftrace_ops *op, struct ftrace_regs *fregs)
> > > +{
> > > + struct pt_regs *pt_regs = ftrace_get_regs(fregs);
> > > + unsigned long obj;
> > > + long disabled;
> > > + int cpu, n;
> > > +
> > > + preempt_disable_notrace();
> > > +
> > > + cpu = raw_smp_processor_id();
> > > + disabled = atomic_inc_return(&per_cpu(trace_object_event_disable, cpu));
> >
> > So what is the purpose of this check? (are there any issue if the same
> > cpu reenter here?)
> >
>
> There may be an interrupt context that can preempt it. I am not yet
> sure whether the cpu reenter will affect it.
> I will debug and test it. (Referred from function_test_events_call())

I think what you need is ftrace_test_recursion_trylock(). Please
check the "Protect your callback" section in Documentation/trace/ftrace-uses.rst.

>
> > > +
> > > + if (disabled != 1)
> > > + goto out;
> > > +
> > > + if (object_empty())
> > > + goto out;
> > > +
> > > + for (n = 0; n < max_args_num; n++) {
> > > + obj = regs_get_kernel_argument(pt_regs, n);
> > > + if (object_exist((void *)obj))
> > > + submit_trace_object(ip, parent_ip, obj);
> >
> > Please add a comment why don't you break the loop after finding
> > any trace object here.
> >
> (because the parameters of a function may match multiple objects)
> I will add the comment in the next version.

good :)

>
> > > + }
> > > +
> > > +out:
> > > + atomic_dec(&per_cpu(trace_object_event_disable, cpu));
> > > + preempt_enable_notrace();
> > > +}
> > > +
> > > +static struct ftrace_ops trace_ops = {
> > > + .func = trace_object_events_call,
> > > + .flags = FTRACE_OPS_FL_SAVE_REGS,
> > > +};
> > > +
> > > +static void
> > > +trace_object_trigger(struct event_trigger_data *data,
> > > + struct trace_buffer *buffer, void *rec,
> > > + struct ring_buffer_event *event)
> > > +{
> > > +
> > > + struct ftrace_event_field *field = data->private_data;
> > > + void *obj = NULL;
> > > +
> > > + memcpy(&obj, rec + field->offset, sizeof(obj));
> > > + set_trace_object(obj);
> > > +}
> > > +
> >
> > /* This is clone of event_trigger_free(). */
> >
> > > +static void
> > > +trace_object_trigger_free(struct event_trigger_ops *ops,
> > > + struct event_trigger_data *data)
> > > +{
> > > + if (WARN_ON_ONCE(data->ref <= 0))
> > > + return;
> > > +
> > > + data->ref--;
> > > + if (!data->ref)
> > > + trigger_data_free(data);
> > > +}
> > > +
> > > +static void
> > > +trace_object_count_trigger(struct event_trigger_data *data,
> > > + struct trace_buffer *buffer, void *rec,
> > > + struct ring_buffer_event *event)
> > > +{
> > > + if (!data->count)
> > > + return;
> > > +
> > > + if (data->count != -1)
> > > + (data->count)--;
> > > +
> > > + trace_object_trigger(data, buffer, rec, event);
> > > +}
> > > +
> >
> > /* This is the clone of event_trigger_init() */
> >
> > > +static int event_object_trigger_init(struct event_trigger_ops *ops,
> > > + struct event_trigger_data *data)
> > > +{
> > > + data->ref++;
> > > + return 0;
> > > +}
> >
> >
> > At this moment, I think cloning those functions are OK. But after merged
> > and enough tested, trigger should expose these functions and this
> > should use it.
>
> I agree.
>
> >
> > BTW, it seems you are sharing the trace object active_list (obj_pool::active_list)
> > among all object trace triggers. This means if user sets one objtrace trigger
> > on 2 events and remove from one event, the active_list still have the object
> > which sets by the event we removed. If it is correct, I think this should be
> > noted in the document.
> > Something like;
> >
> > Note that currently this is not designed to trace objects of different types
> > at the same time. You must set an objtrace trigger to only one event, or
> > you must carefully choose the same type object when you set the objtrace on
> > differnt events. Also, the traced object list is not cleared unless all
> > triggers are removed.
> >
> Thanks, I agree, I will add it.
> >
> > > +
> > > +static int
> > > +event_trigger_print(const char *name, struct seq_file *m,
> > > + void *data, char *filter_str)
> > > +{
> > > + long count = (long)data;
> > > +
> > > + seq_puts(m, name);
> > > +
> > > + if (count == -1)
> > > + seq_puts(m, ":unlimited");
> > > + else
> > > + seq_printf(m, ":count=%ld", count);
> > > +
> > > + if (filter_str)
> > > + seq_printf(m, " if %s\n", filter_str);
> > > + else
> > > + seq_putc(m, '\n');
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static int
> > > +trace_object_trigger_print(struct seq_file *m, struct event_trigger_ops *ops,
> > > + struct event_trigger_data *data)
> > > +{
> > > + return event_trigger_print("objtrace", m, (void *)data->count,
> > > + data->filter_str);
> > > +}
> > > +
> > > +static struct event_trigger_ops objecttrace_trigger_ops = {
> > > + .func = trace_object_trigger,
> > > + .print = trace_object_trigger_print,
> > > + .init = event_object_trigger_init,
> > > + .free = trace_object_trigger_free,
> > > +};
> > > +
> > > +static struct event_trigger_ops objecttrace_count_trigger_ops = {
> > > + .func = trace_object_count_trigger,
> > > + .print = trace_object_trigger_print,
> > > + .init = event_object_trigger_init,
> > > + .free = trace_object_trigger_free,
> > > +};
> > > +
> > > +static struct event_trigger_ops *
> > > +objecttrace_get_trigger_ops(char *cmd, char *param)
> > > +{
> > > + return param ? &objecttrace_count_trigger_ops : &objecttrace_trigger_ops;
> > > +}
> > > +
> > > +static int register_object_trigger(char *glob, struct event_trigger_ops *ops,
> > > + struct event_trigger_data *data,
> > > + struct trace_event_file *file)
> > > +{
> > > + struct event_trigger_data *test;
> > > + int ret = 0;
> > > +
> > > + lockdep_assert_held(&event_mutex);
> > > +
> > > + list_for_each_entry(test, &file->triggers, list) {
> > > + if (test->cmd_ops->trigger_type == data->cmd_ops->trigger_type) {
> > > + ret = -EEXIST;
> > > + goto out;
> > > + }
> > > + }
> > > +
> > > + if (data->ops->init) {
> > > + ret = data->ops->init(data->ops, data);
> > > + if (ret < 0)
> > > + goto out;
> > > + }
> > > +
> > > + list_add_rcu(&data->list, &file->triggers);
> > > + ret++;
> > > +
> > > + update_cond_flag(file);
> > > + if (trace_event_trigger_enable_disable(file, 1) < 0) {
> > > + list_del_rcu(&data->list);
> > > + update_cond_flag(file);
> > > + ret--;
> > > + }
> > > + init_trace_object();
> > > +out:
> > > + return ret;
> > > +}
> > > +
> > > +static void unregister_object_trigger(char *glob, struct event_trigger_ops *ops,
> > > + struct event_trigger_data *test,
> > > + struct trace_event_file *file)
> > > +{
> > > + struct event_trigger_data *data;
> > > + bool unregistered = false;
> > > +
> > > + lockdep_assert_held(&event_mutex);
> > > +
> > > + list_for_each_entry(data, &file->triggers, list) {
> > > + if (data->cmd_ops->trigger_type == test->cmd_ops->trigger_type) {
> > > + unregistered = true;
> > > + list_del_rcu(&data->list);
> > > + trace_event_trigger_enable_disable(file, 0);
> > > + update_cond_flag(file);
> > > + break;
> > > + }
> > > + }
> > > +
> > > + if (unregistered && data->ops->free) {
> > > + data->ops->free(data->ops, data);
> > > + exit_trace_object();
> > > + }
> >
> > isn't this the following? Can you skip exit_trace_object()
> > if data->ops->free == NULL?
> >
> > if (unregistered) {
> > if (data->ops->free)
> > data->ops->free(data->ops, data);
> > exit_trace_object();
> > }
>
> Thanks, this one would be better, although data->ops->free is not
> empty at the moment.
>
>

Thank you,


--
Masami Hiramatsu <[email protected]>

2021-12-21 07:36:12

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 1/5] trace: Add trace any kernel object

Hi Jeff,

On Sat, 18 Dec 2021 00:32:57 +0800
Jeff Xie <[email protected]> wrote:

> > > +struct object_instance {
> > > + void *object;
> > > + struct freelist_node free_list;
> > > + struct list_head active_list;
> > > +};
> > > +
> > > +struct obj_pool {
> > > + struct freelist_head free_list;
> > > + struct list_head active_list;
> > > +};
> > > +static struct obj_pool *obj_pool;
> > > +
> > > +static bool object_exist(void *obj)
> > > +{
> > > + struct object_instance *inst;
> > > + bool ret = false;
> > > +
> > > + list_for_each_entry_rcu(inst, &obj_pool->active_list, active_list) {
> > > + if (inst->object == obj) {
> > > + ret = true;
> > > + goto out;
> > > + }
> > > + }
> > > +out:
> > > + return ret;
> > > +}
> > > +
> > > +static bool object_empty(void)
> > > +{
> > > + return list_empty(&obj_pool->active_list);
> > > +}
> > > +
> > > +static void set_trace_object(void *obj)
> > > +{
> > > + struct freelist_node *fn;
> > > + struct object_instance *ins;
> > > + unsigned long flags;
> > > +
> > > + if (in_nmi())
> > > + return;
> > > +
> > > + if (!obj)
> > > + return;
> > > +
> > > + if (object_exist(obj))
> > > + return;
> > > +
> > > + fn = freelist_try_get(&obj_pool->free_list);
> > > + if (!fn) {
> > > + trace_printk("object_pool is full, can't trace object:0x%px\n", obj);
> > > + return;
> > > + }
> > > +
> > > + ins = container_of(fn, struct object_instance, free_list);
> > > + ins->object = obj;
> > > +
> > > + raw_spin_lock_irqsave(&object_spin_lock, flags);
> > > + list_add_rcu(&ins->active_list, &obj_pool->active_list);
> > > + raw_spin_unlock_irqrestore(&object_spin_lock, flags);
> >
> > Please add a comment that why this spinlock is needed here and why
> > other operation doesn't need it.
>
> (Only this place has write operations, and it cannot be concurrent)
> I agree, I will add it.

BTW, I have another better solution for object pool. If the
object pool size is fixed (of course to avoid performance overhead,
it should be small enough) and it can not avoid using spinlock,
it is better to use fixed-size array. That makes the implementation
much simpler.

static struct object_instance {
void *obj; /* trace object */
// add offset in the next patch
} traced_obj[MAX_TRACE_OBJECT];

static atomic_t num_traced_obj;
static DEFINE_RAW_SPINLOCK(trace_obj_lock);

static void set_trace_object(void *obj)
{
...

raw_spin_lock_irqsave(&trace_obj_lock, flags);
if (num_traced_obj == MAX_TRACED_OBJECT)
goto out;

traced_obj[num_traced_obj].obj = obj;
smp_wmb(); // make sure the num_traced_obj update always appears after trace_obj update.
num_traced_obj++;
out:
raw_spin_unlock_irqrestore(&trace_obj_lock, flags);
}

static bool object_exist(void *obj)
{
...
max = num_traced_obj;
smp_rmb(); // then the num_traced_obj will cap the max.
for (i = 0; i < max; i++) {
if (traced_obj[i].obj == obj)
return true;
}
return false;
}

static inline void free_object_pool(void)
{
num_traced_obj = 0;
memset(traced_obj, 0, sizeof(traced_obj));
}

Sorry if I confuse you but I think you shouldn't take a time on those unneeded
complexity. :)


> > > +static void submit_trace_object(unsigned long ip, unsigned long parent_ip,
> > > + unsigned long object)
> > > +{
> > > +
> > > + struct trace_buffer *buffer;
> > > + struct ring_buffer_event *event;
> > > + struct trace_object_entry *entry;
> > > + int pc;
> > > +
> > > + pc = preempt_count();
> > > + event = trace_event_buffer_lock_reserve(&buffer, &event_trace_file,
> > > + TRACE_OBJECT, sizeof(*entry), pc);
> > > + if (!event)
> > > + return;
> > > + entry = ring_buffer_event_data(event);
> > > + entry->ip = ip;
> > > + entry->parent_ip = parent_ip;
> > > + entry->object = object;
> > > +
> > > + event_trigger_unlock_commit(&event_trace_file, buffer, event,
> > > + entry, pc);
> > > +}
> > > +
> > > +static void
> > > +trace_object_events_call(unsigned long ip, unsigned long parent_ip,
> > > + struct ftrace_ops *op, struct ftrace_regs *fregs)
> > > +{
> > > + struct pt_regs *pt_regs = ftrace_get_regs(fregs);
> > > + unsigned long obj;
> > > + long disabled;
> > > + int cpu, n;
> > > +
> > > + preempt_disable_notrace();
> > > +
> > > + cpu = raw_smp_processor_id();
> > > + disabled = atomic_inc_return(&per_cpu(trace_object_event_disable, cpu));
> >
> > So what is the purpose of this check? (are there any issue if the same
> > cpu reenter here?)
> >
>
> There may be an interrupt context that can preempt it. I am not yet
> sure whether the cpu reenter will affect it.
> I will debug and test it. (Referred from function_test_events_call())

Maybe you can use ftrace_test_recursion_trylock(), as kprobe_ftrace_handler()
does.

Thank you,

--
Masami Hiramatsu <[email protected]>

2021-12-21 10:29:43

by Jeff Xie

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 1/5] trace: Add trace any kernel object

Hi Masami,

On Tue, Dec 21, 2021 at 3:36 PM Masami Hiramatsu <[email protected]> wrote:
>
> Hi Jeff,
>
> On Sat, 18 Dec 2021 00:32:57 +0800
> Jeff Xie <[email protected]> wrote:
>
> > > > +struct object_instance {
> > > > + void *object;
> > > > + struct freelist_node free_list;
> > > > + struct list_head active_list;
> > > > +};
> > > > +
> > > > +struct obj_pool {
> > > > + struct freelist_head free_list;
> > > > + struct list_head active_list;
> > > > +};
> > > > +static struct obj_pool *obj_pool;
> > > > +
> > > > +static bool object_exist(void *obj)
> > > > +{
> > > > + struct object_instance *inst;
> > > > + bool ret = false;
> > > > +
> > > > + list_for_each_entry_rcu(inst, &obj_pool->active_list, active_list) {
> > > > + if (inst->object == obj) {
> > > > + ret = true;
> > > > + goto out;
> > > > + }
> > > > + }
> > > > +out:
> > > > + return ret;
> > > > +}
> > > > +
> > > > +static bool object_empty(void)
> > > > +{
> > > > + return list_empty(&obj_pool->active_list);
> > > > +}
> > > > +
> > > > +static void set_trace_object(void *obj)
> > > > +{
> > > > + struct freelist_node *fn;
> > > > + struct object_instance *ins;
> > > > + unsigned long flags;
> > > > +
> > > > + if (in_nmi())
> > > > + return;
> > > > +
> > > > + if (!obj)
> > > > + return;
> > > > +
> > > > + if (object_exist(obj))
> > > > + return;
> > > > +
> > > > + fn = freelist_try_get(&obj_pool->free_list);
> > > > + if (!fn) {
> > > > + trace_printk("object_pool is full, can't trace object:0x%px\n", obj);
> > > > + return;
> > > > + }
> > > > +
> > > > + ins = container_of(fn, struct object_instance, free_list);
> > > > + ins->object = obj;
> > > > +
> > > > + raw_spin_lock_irqsave(&object_spin_lock, flags);
> > > > + list_add_rcu(&ins->active_list, &obj_pool->active_list);
> > > > + raw_spin_unlock_irqrestore(&object_spin_lock, flags);
> > >
> > > Please add a comment that why this spinlock is needed here and why
> > > other operation doesn't need it.
> >
> > (Only this place has write operations, and it cannot be concurrent)
> > I agree, I will add it.
>
> BTW, I have another better solution for object pool. If the
> object pool size is fixed (of course to avoid performance overhead,
> it should be small enough) and it can not avoid using spinlock,
> it is better to use fixed-size array. That makes the implementation
> much simpler.

This looks really simpler, I will add it in the next version ;-)

> static struct object_instance {
> void *obj; /* trace object */
> // add offset in the next patch
> } traced_obj[MAX_TRACE_OBJECT];
>
> static atomic_t num_traced_obj;
> static DEFINE_RAW_SPINLOCK(trace_obj_lock);
>
> static void set_trace_object(void *obj)
> {
> ...
>
> raw_spin_lock_irqsave(&trace_obj_lock, flags);
> if (num_traced_obj == MAX_TRACED_OBJECT)
> goto out;
>
> traced_obj[num_traced_obj].obj = obj;
> smp_wmb(); // make sure the num_traced_obj update always appears after trace_obj update.
> num_traced_obj++;

I would like to ask whether this place need to add another smp_wmb()
to match the smp_rmb() in the object_exist().
I learned that rcu has a publish and subscribe mechanism ;-)

> out:
> raw_spin_unlock_irqrestore(&trace_obj_lock, flags);
> }
>
> static bool object_exist(void *obj)
> {
> ...
> max = num_traced_obj;
> smp_rmb(); // then the num_traced_obj will cap the max.
> for (i = 0; i < max; i++) {
> if (traced_obj[i].obj == obj)
> return true;
> }
> return false;
> }
>
> static inline void free_object_pool(void)
> {
> num_traced_obj = 0;
> memset(traced_obj, 0, sizeof(traced_obj));
> }
>
> Sorry if I confuse you but I think you shouldn't take a time on those unneeded
> complexity. :)

Thanks, at least I learned a different way to implement it ;-)

>
>
> > > > +static void submit_trace_object(unsigned long ip, unsigned long parent_ip,
> > > > + unsigned long object)
> > > > +{
> > > > +
> > > > + struct trace_buffer *buffer;
> > > > + struct ring_buffer_event *event;
> > > > + struct trace_object_entry *entry;
> > > > + int pc;
> > > > +
> > > > + pc = preempt_count();
> > > > + event = trace_event_buffer_lock_reserve(&buffer, &event_trace_file,
> > > > + TRACE_OBJECT, sizeof(*entry), pc);
> > > > + if (!event)
> > > > + return;
> > > > + entry = ring_buffer_event_data(event);
> > > > + entry->ip = ip;
> > > > + entry->parent_ip = parent_ip;
> > > > + entry->object = object;
> > > > +
> > > > + event_trigger_unlock_commit(&event_trace_file, buffer, event,
> > > > + entry, pc);
> > > > +}
> > > > +
> > > > +static void
> > > > +trace_object_events_call(unsigned long ip, unsigned long parent_ip,
> > > > + struct ftrace_ops *op, struct ftrace_regs *fregs)
> > > > +{
> > > > + struct pt_regs *pt_regs = ftrace_get_regs(fregs);
> > > > + unsigned long obj;
> > > > + long disabled;
> > > > + int cpu, n;
> > > > +
> > > > + preempt_disable_notrace();
> > > > +
> > > > + cpu = raw_smp_processor_id();
> > > > + disabled = atomic_inc_return(&per_cpu(trace_object_event_disable, cpu));
> > >
> > > So what is the purpose of this check? (are there any issue if the same
> > > cpu reenter here?)
> > >
> >
> > There may be an interrupt context that can preempt it. I am not yet
> > sure whether the cpu reenter will affect it.
> > I will debug and test it. (Referred from function_test_events_call())
>
> Maybe you can use ftrace_test_recursion_trylock(), as kprobe_ftrace_handler()
> does.

I will use it ,thanks.

>
> Thank you,
>
> --
> Masami Hiramatsu <[email protected]>
---
JeffXie

2021-12-23 14:12:37

by Jeff Xie

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 1/5] trace: Add trace any kernel object

Hi Masami,

On Tue, Dec 21, 2021 at 6:29 PM Jeff Xie <[email protected]> wrote:
>
> Hi Masami,
>
> On Tue, Dec 21, 2021 at 3:36 PM Masami Hiramatsu <[email protected]> wrote:
> >
> > Hi Jeff,
> >
> > On Sat, 18 Dec 2021 00:32:57 +0800
> > Jeff Xie <[email protected]> wrote:
> >
> > > > > +struct object_instance {
> > > > > + void *object;
> > > > > + struct freelist_node free_list;
> > > > > + struct list_head active_list;
> > > > > +};
> > > > > +
> > > > > +struct obj_pool {
> > > > > + struct freelist_head free_list;
> > > > > + struct list_head active_list;
> > > > > +};
> > > > > +static struct obj_pool *obj_pool;
> > > > > +
> > > > > +static bool object_exist(void *obj)
> > > > > +{
> > > > > + struct object_instance *inst;
> > > > > + bool ret = false;
> > > > > +
> > > > > + list_for_each_entry_rcu(inst, &obj_pool->active_list, active_list) {
> > > > > + if (inst->object == obj) {
> > > > > + ret = true;
> > > > > + goto out;
> > > > > + }
> > > > > + }
> > > > > +out:
> > > > > + return ret;
> > > > > +}
> > > > > +
> > > > > +static bool object_empty(void)
> > > > > +{
> > > > > + return list_empty(&obj_pool->active_list);
> > > > > +}
> > > > > +
> > > > > +static void set_trace_object(void *obj)
> > > > > +{
> > > > > + struct freelist_node *fn;
> > > > > + struct object_instance *ins;
> > > > > + unsigned long flags;
> > > > > +
> > > > > + if (in_nmi())
> > > > > + return;
> > > > > +
> > > > > + if (!obj)
> > > > > + return;
> > > > > +
> > > > > + if (object_exist(obj))
> > > > > + return;
> > > > > +
> > > > > + fn = freelist_try_get(&obj_pool->free_list);
> > > > > + if (!fn) {
> > > > > + trace_printk("object_pool is full, can't trace object:0x%px\n", obj);
> > > > > + return;
> > > > > + }
> > > > > +
> > > > > + ins = container_of(fn, struct object_instance, free_list);
> > > > > + ins->object = obj;
> > > > > +
> > > > > + raw_spin_lock_irqsave(&object_spin_lock, flags);
> > > > > + list_add_rcu(&ins->active_list, &obj_pool->active_list);
> > > > > + raw_spin_unlock_irqrestore(&object_spin_lock, flags);
> > > >
> > > > Please add a comment that why this spinlock is needed here and why
> > > > other operation doesn't need it.
> > >
> > > (Only this place has write operations, and it cannot be concurrent)
> > > I agree, I will add it.
> >
> > BTW, I have another better solution for object pool. If the
> > object pool size is fixed (of course to avoid performance overhead,
> > it should be small enough) and it can not avoid using spinlock,
> > it is better to use fixed-size array. That makes the implementation
> > much simpler.
>
> This looks really simpler, I will add it in the next version ;-)
>
> > static struct object_instance {
> > void *obj; /* trace object */
> > // add offset in the next patch
> > } traced_obj[MAX_TRACE_OBJECT];
> >
> > static atomic_t num_traced_obj;
> > static DEFINE_RAW_SPINLOCK(trace_obj_lock);
> >
> > static void set_trace_object(void *obj)
> > {
> > ...
> >
> > raw_spin_lock_irqsave(&trace_obj_lock, flags);
> > if (num_traced_obj == MAX_TRACED_OBJECT)
> > goto out;
> >
> > traced_obj[num_traced_obj].obj = obj;
> > smp_wmb(); // make sure the num_traced_obj update always appears after trace_obj update.
> > num_traced_obj++;
>
> I would like to ask whether this place need to add another smp_wmb()
> to match the smp_rmb() in the object_exist().
> I learned that rcu has a publish and subscribe mechanism ;-)

There is no need to add another smp_wmb(). I understood.

Have a nice holiday;-) Thanks.

>
> > out:
> > raw_spin_unlock_irqrestore(&trace_obj_lock, flags);
> > }
> >
> > static bool object_exist(void *obj)
> > {
> > ...
> > max = num_traced_obj;
> > smp_rmb(); // then the num_traced_obj will cap the max.
> > for (i = 0; i < max; i++) {
> > if (traced_obj[i].obj == obj)
> > return true;
> > }
> > return false;
> > }
> >
> > static inline void free_object_pool(void)
> > {
> > num_traced_obj = 0;
> > memset(traced_obj, 0, sizeof(traced_obj));
> > }
> >
> > Sorry if I confuse you but I think you shouldn't take a time on those unneeded
> > complexity. :)
>
> Thanks, at least I learned a different way to implement it ;-)
>
> >
> >
> > > > > +static void submit_trace_object(unsigned long ip, unsigned long parent_ip,
> > > > > + unsigned long object)
> > > > > +{
> > > > > +
> > > > > + struct trace_buffer *buffer;
> > > > > + struct ring_buffer_event *event;
> > > > > + struct trace_object_entry *entry;
> > > > > + int pc;
> > > > > +
> > > > > + pc = preempt_count();
> > > > > + event = trace_event_buffer_lock_reserve(&buffer, &event_trace_file,
> > > > > + TRACE_OBJECT, sizeof(*entry), pc);
> > > > > + if (!event)
> > > > > + return;
> > > > > + entry = ring_buffer_event_data(event);
> > > > > + entry->ip = ip;
> > > > > + entry->parent_ip = parent_ip;
> > > > > + entry->object = object;
> > > > > +
> > > > > + event_trigger_unlock_commit(&event_trace_file, buffer, event,
> > > > > + entry, pc);
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +trace_object_events_call(unsigned long ip, unsigned long parent_ip,
> > > > > + struct ftrace_ops *op, struct ftrace_regs *fregs)
> > > > > +{
> > > > > + struct pt_regs *pt_regs = ftrace_get_regs(fregs);
> > > > > + unsigned long obj;
> > > > > + long disabled;
> > > > > + int cpu, n;
> > > > > +
> > > > > + preempt_disable_notrace();
> > > > > +
> > > > > + cpu = raw_smp_processor_id();
> > > > > + disabled = atomic_inc_return(&per_cpu(trace_object_event_disable, cpu));
> > > >
> > > > So what is the purpose of this check? (are there any issue if the same
> > > > cpu reenter here?)
> > > >
> > >
> > > There may be an interrupt context that can preempt it. I am not yet
> > > sure whether the cpu reenter will affect it.
> > > I will debug and test it. (Referred from function_test_events_call())
> >
> > Maybe you can use ftrace_test_recursion_trylock(), as kprobe_ftrace_handler()
> > does.
>
> I will use it ,thanks.
>
> >
> > Thank you,
> >
> > --
> > Masami Hiramatsu <[email protected]>
> ---
> JeffXie
---
JeffXie

2022-01-08 00:21:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 1/5] trace: Add trace any kernel object

Sorry for the late reply, I'm currently unemployed (for another week) and
was spending all my time renovating my office. I finished my office and I'm
now trying to get back up to speed.

On Sun, 19 Dec 2021 12:07:23 +0900
Masami Hiramatsu <[email protected]> wrote:


> > > > +#include "trace_output.h"
> > > > +#include <linux/freelist.h>
> > > > +
> > > > +static DEFINE_PER_CPU(atomic_t, trace_object_event_disable);
> > >
> > > atomic_t is for atomic operation which must be shared among cpus. On the
> > > other hand, per-cpu variable is used for the core-local storage or flags,
> > > other cpus never touch it. Thus the per-cpu atomic_t is very strange.
> > >
> >
> > From the patch V1, I cloned it from the function_test_events_call()
> > in kernel/trace/trace_events.c
> >
> > commit: 9ea21c1ecdb35ecdcac5fd9d95f62a1f6a7ffec0
> > tracing/events: perform function tracing in event selftests
> > Author: Steven Rostedt <[email protected]>
>
> Hmm, OK.

Ug, showing me my skeletons in my closet! That commit is from 2009, where I
didn't know any better ;-)

>
> >
> > It should be to prevent being preempted by the interrupt context in
> > the process of adding one
>
> Yeah, I think so.
>
> The commit says "some bugs" but it is not sure what actually needs to be
> cared.
>
> tracing/events: perform function tracing in event selftests
>
> We can find some bugs in the trace events if we stress the writes as well.
> The function tracer is a good way to stress the events.
>
> Steve, can you tell me what was the problem?
>
> I think we don't need per-cpu atomic_t because the counter is increment
> and decrement. Thus when quiting the nested ftrace handler on the same CPU,
> the counter comes back to the same value. We don't need to care about
> atomic increment.
>
> I mean, if we use the normal per-cpu "unsigned int" as a counter, the
> operation of "counter++" becomes;

Yes, that was from the days of being extra paranoid. A simple counter would
work, with a barrier() in place such that gcc doesn't cause any issues.

I may have to go back and revisit all that code and clean it up a bit.

>
> load counter to reg1
> [1]
> reg1 = reg1 + 1
> store reg1 to counter
>
> And if an interrupt occurs at [1], the following happens.
>
> load counter to reg1 # counter = 0
>
> (interrupt - save reg1)
> load counter to reg1 # counter = 0
> reg1 = reg1 + 1
> store reg1 to counter # counter = 1
> ...
> load counter to reg1 # counter = 1
> reg1 = reg1 - 1
> store reg1 to counter # counter = 0
> (iret - restore reg1)
>
> reg1 = reg1 + 1
> store reg1 to counter
>
> So, even if the operation is not atomic, there seems no problem.
> What else scenario we have to worry?
>
> (BTW, what is the ftrace_test_recursion_trylock()? Is that also
> for detecting nesting case??)

Yes, the ftrace_test_recursion_trylock() is for finding recursions.

The above code is from the early days of ftrace, and was only used in
testing at boot up. It's not something to copy from ;-)

>
> > > > +static DEFINE_RAW_SPINLOCK(object_spin_lock);
> > > > +static struct trace_event_file event_trace_file;
> > > > +static const int max_args_num = 6;
> > > > +static const int max_obj_pool = 10;
> > > > +static atomic_t trace_object_ref;
> > > > +static int exit_trace_object(void);
> > > > +static int init_trace_object(void);
> > > > +
> > >
> > > Please add more comments to the code itself. Explain why this is needed
> > > and how it works for which case. That will lead deeper understanding.
> > >
> >
> > I agree, I will add more comments in the next version.
> >
> > > > +struct object_instance {
> > > > + void *object;
> > > > + struct freelist_node free_list;
> > > > + struct list_head active_list;
> > > > +};
> > > > +
> > > > +struct obj_pool {
> > > > + struct freelist_head free_list;
> > > > + struct list_head active_list;
> > > > +};
> > > > +static struct obj_pool *obj_pool;
> > > > +
> > > > +static bool object_exist(void *obj)
> > > > +{
> > > > + struct object_instance *inst;
> > > > + bool ret = false;
> > > > +
> > > > + list_for_each_entry_rcu(inst, &obj_pool->active_list, active_list) {
> > > > + if (inst->object == obj) {
> > > > + ret = true;
> > > > + goto out;
> > > > + }
> > > > + }
> > > > +out:
> > > > + return ret;

BTW, the above really should be:

static bool object_exist(void *obj)
{
struct object_instance *inst;

list_for_each_entry_rcu(inst, &obj_pool->active_list, active_list) {
if (inst->object == obj)
return true;
}
return false;
}

-- Steve


2022-01-10 02:00:55

by Jeff Xie

[permalink] [raw]
Subject: Re: [RFC][PATCH v6 1/5] trace: Add trace any kernel object

Hi Steven,

Welcome back and look forward to playing this patch set with you again
in 2022 ;-)

Thank you and Masima for your guidance on this patchset in 2021. I
learned a lot.


On Sat, Jan 8, 2022 at 8:21 AM Steven Rostedt <[email protected]> wrote:
>
> Sorry for the late reply, I'm currently unemployed (for another week) and
> was spending all my time renovating my office. I finished my office and I'm
> now trying to get back up to speed.
>
> On Sun, 19 Dec 2021 12:07:23 +0900
> Masami Hiramatsu <[email protected]> wrote:
>
>
> > > > > +#include "trace_output.h"
> > > > > +#include <linux/freelist.h>
> > > > > +
> > > > > +static DEFINE_PER_CPU(atomic_t, trace_object_event_disable);
> > > >
> > > > atomic_t is for atomic operation which must be shared among cpus. On the
> > > > other hand, per-cpu variable is used for the core-local storage or flags,
> > > > other cpus never touch it. Thus the per-cpu atomic_t is very strange.
> > > >
> > >
> > > From the patch V1, I cloned it from the function_test_events_call()
> > > in kernel/trace/trace_events.c
> > >
> > > commit: 9ea21c1ecdb35ecdcac5fd9d95f62a1f6a7ffec0
> > > tracing/events: perform function tracing in event selftests
> > > Author: Steven Rostedt <[email protected]>
> >
> > Hmm, OK.
>
> Ug, showing me my skeletons in my closet! That commit is from 2009, where I
> didn't know any better ;-)
>
> >
> > >
> > > It should be to prevent being preempted by the interrupt context in
> > > the process of adding one
> >
> > Yeah, I think so.
> >
> > The commit says "some bugs" but it is not sure what actually needs to be
> > cared.
> >
> > tracing/events: perform function tracing in event selftests
> >
> > We can find some bugs in the trace events if we stress the writes as well.
> > The function tracer is a good way to stress the events.
> >
> > Steve, can you tell me what was the problem?
> >
> > I think we don't need per-cpu atomic_t because the counter is increment
> > and decrement. Thus when quiting the nested ftrace handler on the same CPU,
> > the counter comes back to the same value. We don't need to care about
> > atomic increment.
> >
> > I mean, if we use the normal per-cpu "unsigned int" as a counter, the
> > operation of "counter++" becomes;
>
> Yes, that was from the days of being extra paranoid. A simple counter would
> work, with a barrier() in place such that gcc doesn't cause any issues.
>
> I may have to go back and revisit all that code and clean it up a bit.
>
> >
> > load counter to reg1
> > [1]
> > reg1 = reg1 + 1
> > store reg1 to counter
> >
> > And if an interrupt occurs at [1], the following happens.
> >
> > load counter to reg1 # counter = 0
> >
> > (interrupt - save reg1)
> > load counter to reg1 # counter = 0
> > reg1 = reg1 + 1
> > store reg1 to counter # counter = 1
> > ...
> > load counter to reg1 # counter = 1
> > reg1 = reg1 - 1
> > store reg1 to counter # counter = 0
> > (iret - restore reg1)
> >
> > reg1 = reg1 + 1
> > store reg1 to counter
> >
> > So, even if the operation is not atomic, there seems no problem.
> > What else scenario we have to worry?
> >
> > (BTW, what is the ftrace_test_recursion_trylock()? Is that also
> > for detecting nesting case??)
>
> Yes, the ftrace_test_recursion_trylock() is for finding recursions.
>
> The above code is from the early days of ftrace, and was only used in
> testing at boot up. It's not something to copy from ;-)
>
> >
> > > > > +static DEFINE_RAW_SPINLOCK(object_spin_lock);
> > > > > +static struct trace_event_file event_trace_file;
> > > > > +static const int max_args_num = 6;
> > > > > +static const int max_obj_pool = 10;
> > > > > +static atomic_t trace_object_ref;
> > > > > +static int exit_trace_object(void);
> > > > > +static int init_trace_object(void);
> > > > > +
> > > >
> > > > Please add more comments to the code itself. Explain why this is needed
> > > > and how it works for which case. That will lead deeper understanding.
> > > >
> > >
> > > I agree, I will add more comments in the next version.
> > >
> > > > > +struct object_instance {
> > > > > + void *object;
> > > > > + struct freelist_node free_list;
> > > > > + struct list_head active_list;
> > > > > +};
> > > > > +
> > > > > +struct obj_pool {
> > > > > + struct freelist_head free_list;
> > > > > + struct list_head active_list;
> > > > > +};
> > > > > +static struct obj_pool *obj_pool;
> > > > > +
> > > > > +static bool object_exist(void *obj)
> > > > > +{
> > > > > + struct object_instance *inst;
> > > > > + bool ret = false;
> > > > > +
> > > > > + list_for_each_entry_rcu(inst, &obj_pool->active_list, active_list) {
> > > > > + if (inst->object == obj) {
> > > > > + ret = true;
> > > > > + goto out;
> > > > > + }
> > > > > + }
> > > > > +out:
> > > > > + return ret;
>
> BTW, the above really should be:
>
> static bool object_exist(void *obj)
> {
> struct object_instance *inst;
>
> list_for_each_entry_rcu(inst, &obj_pool->active_list, active_list) {
> if (inst->object == obj)
> return true;
> }
> return false;
> }

Thanks. Masami suggested that it is better to use fixed-size array,
I will be ready to send the next version.

> -- Steve
>
---
JeffXie