Hi Jeff,
On Fri, 5 Nov 2021 17:11:27 +0800
Jeff Xie <[email protected]> wrote:
> Hi Masami,
>
> Thank you very much for your patient guidance and help, It helped me a lot.
>
> On Thu, Nov 4, 2021 at 11:47 PM Masami Hiramatsu <[email protected]> wrote:
> >
> > Hi Jeff,
> >
> > Thank you for updating!
> >
> > On Thu, 28 Oct 2021 17:27:02 +0800
> > Jeff Xie <[email protected]> wrote:
> >
> > > Introduce a method based on function tracer and kprobe/uprobe/trace_event/
> > > to trace any object in the linux kernel.
> > >
> > > Usage:
> > > When using kprobe/uprobe/trace_event/ we can use a new trigger(objtrace)
> > > to set object and trigger object trace.
> > >
> > > For example:
> > >
> > > For the function bio_add_page, we can trace the first argument:
> > >
> > > int bio_add_page(struct bio *bio, struct page *page,
> > > unsigned int len, unsigned int offset)
> > >
> > > [root@JeffXie ]# cd /sys/kernel/debug/tracing/
> > > [root@JeffXie tracing]# echo 'p bio_add_page arg1=$arg1' > kprobe_events
> > > [root@JeffXie tracing]# cd events/kprobes/p_bio_add_page_0/
> > > [root@JeffXie p_bio_add_page_0]# echo 'objtrace:arg1:1 if comm == "cat"' > ./trigger
> > > [root@JeffXie p_bio_add_page_0]# cat /test.txt
> > > [root@JeffXie p_bio_add_page_0]# cd /sys/kernel/debug/tracing/
> > > [root@JeffXie tracing]# cat ./trace
> > > # tracer: nop
> > > #
> > > # entries-in-buffer/entries-written: 81/81 #P:4
> > > #
> > > # _-----=> irqs-off
> > > # / _----=> need-resched
> > > # | / _---=> hardirq/softirq
> > > # || / _--=> preempt-depth
> > > # ||| / _-=> migrate-disable
> > > # |||| / delay
> > > # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
> > > # | | | ||||| | |
> > > cat-122 [001] ..... 111.193997: bio_add_page <-ext4_mpage_readpages object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.193998: __bio_try_merge_page <-bio_add_page object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.193998: __bio_add_page <-bio_add_page object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.193998: submit_bio <-ext4_mpage_readpages object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.193998: submit_bio_noacct <-ext4_mpage_readpages object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.193999: __submit_bio <-submit_bio_noacct object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.193999: submit_bio_checks <-__submit_bio object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.193999: __cond_resched <-submit_bio_checks object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.193999: rcu_all_qs <-__cond_resched object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.194000: should_fail_bio <-submit_bio_checks object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.194001: blk_mq_submit_bio <-__submit_bio object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.194001: blk_attempt_plug_merge <-blk_mq_submit_bio object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.194001: __blk_mq_sched_bio_merge <-blk_mq_submit_bio object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.194002: __blk_mq_alloc_request <-blk_mq_submit_bio object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.194002: blk_mq_get_tag <-__blk_mq_alloc_request object:0xffff888102a4b900
> > > cat-122 [001] ..... 111.194003: blk_account_io_start <-blk_mq_submit_bio object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194212: bio_advance <-blk_update_request object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194213: bio_endio <-blk_update_request object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194213: mpage_end_io <-blk_update_request object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194213: __read_end_io <-blk_update_request object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194218: bio_put <-blk_update_request object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194218: bio_free <-blk_update_request object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194218: bio_free <-blk_update_request object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194218: bvec_free <-bio_free object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194219: mempool_free <-blk_update_request object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194219: mempool_free <-blk_update_request object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194219: mempool_free_slab <-blk_update_request object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194219: mempool_free_slab <-blk_update_request object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194219: kmem_cache_free <-blk_update_request object:0xffff888102a4b900
> > > <idle>-0 [001] d.... 111.194219: kmem_cache_free <-blk_update_request object:0xffff888102a4b900
> > >
> > > Signed-off-by: Jeff Xie <[email protected]>
> > > ---
> > > Currently only showing all the code I have, But I think that there are main three tasks
> > > to be completed next:
> > >
> > > 1. Masami suggested:
> > >
> > > <snip>
> > > objfilter:add:arg1
> > >
> > > and the exit event
> > >
> > > objfilter:del:arg1
> > >
> > > then user will only see the trace between the entrance and exit events.
> > > </snip>
> > >
> > > Actually, I don’t have a good idea to implement it now ;-)
> > > Looking for related existing code logic.
> >
> > Even though, please support 'add' command at least (no problem even if
> > there is only 'add' supported) because it defines its syntax.
> > It seems a bit redundant now, but if there is 'add', we can extend it to
> > support 'del', 'clear' or 'update' etc.
> >
>
> Thanks, I will add the "add" command.
Thanks!
>
> >
> > >
> > > 2. Steven suggested:
> > > <snip>
> > > For example:
> > >
> > > obj = arg1 + 0x64
> > > if (copy_from_kernel_nofault(&val, arg1 + 0x64, sizeof(val)))
> > > // faulted
> > > return;
> > >
> > > Now val has the content of __bi_cnt and we can print that!
> > > </snip>
> > >
> > > We need to implement a more complex parsing and also record the size of a arg
> > > or other related members.it may change the way I record the object.
> >
> > That can be done in additional patch in a series. I recommend you to send a
> > series of patches, which simply add a feature (or extend existing feature),
> > step by step.
>
> This is a good suggestion, although I have never submitted a patchset.
> I will try it. ;-)
>
> > >
> > >
> > > 3. Update trigger documentation
> > >
> > > https://lore.kernel.org/lkml/[email protected]/T/#t
> > >
> > > I'm very happy to work with you all, and got a lot of knowledge from your help.
> > >
> > > At present, I am the only person writing code using the free time, I usually need to do
> > > other work about linux and take care of a three-month-old boy. unlike a group, the progress
> > > of submitting patches may be slower.
> >
> > That is OK. You can start sowing the seeds, then we can grow it. :-)
>
> I never thought that my idea might become a relatively large project.
> With my current level of kernel programming, I might not be able to
> complete it alone.
>
> > And also, please add a (set of) testcase for this feature under
> > tools/testing/selftests/ftrace/test.d/, so that the future changes will not
> > cause any regression. You don't have to check the trace result (if you can,
> > it is recommended) but at least the usage (e.g. accepting correct syntax,
> > and rejecting wrong syntax) test case is helpful.
>
> ok, I will add a set of testcase.
Good :)
[...]
> > > +static void obj_refill_fn(struct work_struct *refill_work)
> > > +{
> > > + void **new_obj_element;
> > > + unsigned long flags;
> > > + int new_min_nr, new_curr_nr;
> > > + int used_nr;
> > > +
> > > + new_min_nr = obj_pool.min_nr * 2;
> >
> > This can easily cause OOM. I think we need a limit to cap.
> > Anyway, at the first step, you can drop this "refill" routine and
> > simply use fixed size array (or hash table?).
> > Anyway if the array becomes huge, your kernel may slow down.
>
> If we limit the max size, should be able to solve the OOM.
> The logic of this refill should look normal, But in order to simplify
> the implementation of the first step,
> I can remove the refill routine.
I would like to ask you to focus on the performance side, since
the search (object_exist) will be run on every function call.
You can make it a lockless, kretprobe_instance pool (in kernel/kprobes.c)
will be a good example for you.
[...]
> > > +
> > > +static void
> > > +trace_object_trigger(struct event_trigger_data *data,
> > > + struct trace_buffer *buffer, void *rec,
> > > + struct ring_buffer_event *event)
> > > +{
> > > +
> > > + struct ftrace_event_field *field = data->private_data;
> > > + void *obj = NULL;
> > > +
> > > + memcpy(&obj, (char *)rec + field->offset, field->size);
> >
> > Please use 'sizeof(obj)' instead of 'field->size' because 'field->size'
> > can be longer than that.
> >
> This is only one I can't understand.
>
> The field->size should be (1(u8/s8), 2(u16/s16), 4(u32/s32),
> 8(u64/s64)) , the sizeof(obj) is 8
> and it should never be longer than sizeof(obj).
You may forget 32bit CPUs..., and you don't need smaller values too,
since it will easily cause false positive. So tracing an object
address is good.
Thank you!
--
Masami Hiramatsu <[email protected]>
Hi Masami,
On Fri, Nov 5, 2021 at 9:50 PM Masami Hiramatsu <[email protected]> wrote:
>
> Hi Jeff,
>
> On Fri, 5 Nov 2021 17:11:27 +0800
> Jeff Xie <[email protected]> wrote:
>
> > Hi Masami,
> >
> > Thank you very much for your patient guidance and help, It helped me a lot.
> >
> > On Thu, Nov 4, 2021 at 11:47 PM Masami Hiramatsu <[email protected]> wrote:
> > >
> > > Hi Jeff,
> > >
> > > Thank you for updating!
> > >
> > > On Thu, 28 Oct 2021 17:27:02 +0800
> > > Jeff Xie <[email protected]> wrote:
> > >
> > > > Introduce a method based on function tracer and kprobe/uprobe/trace_event/
> > > > to trace any object in the linux kernel.
> > > >
> > > > Usage:
> > > > When using kprobe/uprobe/trace_event/ we can use a new trigger(objtrace)
> > > > to set object and trigger object trace.
> > > >
> > > > For example:
> > > >
> > > > For the function bio_add_page, we can trace the first argument:
> > > >
> > > > int bio_add_page(struct bio *bio, struct page *page,
> > > > unsigned int len, unsigned int offset)
> > > >
> > > > [root@JeffXie ]# cd /sys/kernel/debug/tracing/
> > > > [root@JeffXie tracing]# echo 'p bio_add_page arg1=$arg1' > kprobe_events
> > > > [root@JeffXie tracing]# cd events/kprobes/p_bio_add_page_0/
> > > > [root@JeffXie p_bio_add_page_0]# echo 'objtrace:arg1:1 if comm == "cat"' > ./trigger
> > > > [root@JeffXie p_bio_add_page_0]# cat /test.txt
> > > > [root@JeffXie p_bio_add_page_0]# cd /sys/kernel/debug/tracing/
> > > > [root@JeffXie tracing]# cat ./trace
> > > > # tracer: nop
> > > > #
> > > > # entries-in-buffer/entries-written: 81/81 #P:4
> > > > #
> > > > # _-----=> irqs-off
> > > > # / _----=> need-resched
> > > > # | / _---=> hardirq/softirq
> > > > # || / _--=> preempt-depth
> > > > # ||| / _-=> migrate-disable
> > > > # |||| / delay
> > > > # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
> > > > # | | | ||||| | |
> > > > cat-122 [001] ..... 111.193997: bio_add_page <-ext4_mpage_readpages object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.193998: __bio_try_merge_page <-bio_add_page object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.193998: __bio_add_page <-bio_add_page object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.193998: submit_bio <-ext4_mpage_readpages object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.193998: submit_bio_noacct <-ext4_mpage_readpages object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.193999: __submit_bio <-submit_bio_noacct object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.193999: submit_bio_checks <-__submit_bio object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.193999: __cond_resched <-submit_bio_checks object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.193999: rcu_all_qs <-__cond_resched object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.194000: should_fail_bio <-submit_bio_checks object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.194001: blk_mq_submit_bio <-__submit_bio object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.194001: blk_attempt_plug_merge <-blk_mq_submit_bio object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.194001: __blk_mq_sched_bio_merge <-blk_mq_submit_bio object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.194002: __blk_mq_alloc_request <-blk_mq_submit_bio object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.194002: blk_mq_get_tag <-__blk_mq_alloc_request object:0xffff888102a4b900
> > > > cat-122 [001] ..... 111.194003: blk_account_io_start <-blk_mq_submit_bio object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194212: bio_advance <-blk_update_request object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194213: bio_endio <-blk_update_request object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194213: mpage_end_io <-blk_update_request object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194213: __read_end_io <-blk_update_request object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194218: bio_put <-blk_update_request object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194218: bio_free <-blk_update_request object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194218: bio_free <-blk_update_request object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194218: bvec_free <-bio_free object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194219: mempool_free <-blk_update_request object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194219: mempool_free <-blk_update_request object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194219: mempool_free_slab <-blk_update_request object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194219: mempool_free_slab <-blk_update_request object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194219: kmem_cache_free <-blk_update_request object:0xffff888102a4b900
> > > > <idle>-0 [001] d.... 111.194219: kmem_cache_free <-blk_update_request object:0xffff888102a4b900
> > > >
> > > > Signed-off-by: Jeff Xie <[email protected]>
> > > > ---
> > > > Currently only showing all the code I have, But I think that there are main three tasks
> > > > to be completed next:
> > > >
> > > > 1. Masami suggested:
> > > >
> > > > <snip>
> > > > objfilter:add:arg1
> > > >
> > > > and the exit event
> > > >
> > > > objfilter:del:arg1
> > > >
> > > > then user will only see the trace between the entrance and exit events.
> > > > </snip>
> > > >
> > > > Actually, I don’t have a good idea to implement it now ;-)
> > > > Looking for related existing code logic.
> > >
> > > Even though, please support 'add' command at least (no problem even if
> > > there is only 'add' supported) because it defines its syntax.
> > > It seems a bit redundant now, but if there is 'add', we can extend it to
> > > support 'del', 'clear' or 'update' etc.
> > >
> >
> > Thanks, I will add the "add" command.
>
> Thanks!
>
> >
> > >
> > > >
> > > > 2. Steven suggested:
> > > > <snip>
> > > > For example:
> > > >
> > > > obj = arg1 + 0x64
> > > > if (copy_from_kernel_nofault(&val, arg1 + 0x64, sizeof(val)))
> > > > // faulted
> > > > return;
> > > >
> > > > Now val has the content of __bi_cnt and we can print that!
> > > > </snip>
> > > >
> > > > We need to implement a more complex parsing and also record the size of a arg
> > > > or other related members.it may change the way I record the object.
> > >
> > > That can be done in additional patch in a series. I recommend you to send a
> > > series of patches, which simply add a feature (or extend existing feature),
> > > step by step.
> >
> > This is a good suggestion, although I have never submitted a patchset.
> > I will try it. ;-)
> >
> > > >
> > > >
> > > > 3. Update trigger documentation
> > > >
> > > > https://lore.kernel.org/lkml/[email protected]/T/#t
> > > >
> > > > I'm very happy to work with you all, and got a lot of knowledge from your help.
> > > >
> > > > At present, I am the only person writing code using the free time, I usually need to do
> > > > other work about linux and take care of a three-month-old boy. unlike a group, the progress
> > > > of submitting patches may be slower.
> > >
> > > That is OK. You can start sowing the seeds, then we can grow it. :-)
> >
> > I never thought that my idea might become a relatively large project.
> > With my current level of kernel programming, I might not be able to
> > complete it alone.
> >
> > > And also, please add a (set of) testcase for this feature under
> > > tools/testing/selftests/ftrace/test.d/, so that the future changes will not
> > > cause any regression. You don't have to check the trace result (if you can,
> > > it is recommended) but at least the usage (e.g. accepting correct syntax,
> > > and rejecting wrong syntax) test case is helpful.
> >
> > ok, I will add a set of testcase.
>
> Good :)
>
>
> [...]
> > > > +static void obj_refill_fn(struct work_struct *refill_work)
> > > > +{
> > > > + void **new_obj_element;
> > > > + unsigned long flags;
> > > > + int new_min_nr, new_curr_nr;
> > > > + int used_nr;
> > > > +
> > > > + new_min_nr = obj_pool.min_nr * 2;
> > >
> > > This can easily cause OOM. I think we need a limit to cap.
> > > Anyway, at the first step, you can drop this "refill" routine and
> > > simply use fixed size array (or hash table?).
> > > Anyway if the array becomes huge, your kernel may slow down.
> >
> > If we limit the max size, should be able to solve the OOM.
> > The logic of this refill should look normal, But in order to simplify
> > the implementation of the first step,
> > I can remove the refill routine.
>
> I would like to ask you to focus on the performance side, since
> the search (object_exist) will be run on every function call.
> You can make it a lockless, kretprobe_instance pool (in kernel/kprobes.c)
> will be a good example for you.
Thanks, I will learn from this example.
>
> [...]
> > > > +
> > > > +static void
> > > > +trace_object_trigger(struct event_trigger_data *data,
> > > > + struct trace_buffer *buffer, void *rec,
> > > > + struct ring_buffer_event *event)
> > > > +{
> > > > +
> > > > + struct ftrace_event_field *field = data->private_data;
> > > > + void *obj = NULL;
> > > > +
> > > > + memcpy(&obj, (char *)rec + field->offset, field->size);
> > >
> > > Please use 'sizeof(obj)' instead of 'field->size' because 'field->size'
> > > can be longer than that.
> > >
> > This is only one I can't understand.
> >
> > The field->size should be (1(u8/s8), 2(u16/s16), 4(u32/s32),
> > 8(u64/s64)) , the sizeof(obj) is 8
> > and it should never be longer than sizeof(obj).
>
> You may forget 32bit CPUs..., and you don't need smaller values too,
> since it will easily cause false positive. So tracing an object
> address is good.
In this case, it is indeed better to use sizeof(obj) ;-)
Actually I have submitted the [RFC][PATCH V4], I will also modify it.
> Thank you!
>
>
> --
> Masami Hiramatsu <[email protected]>
Thanks,
---
JeffXie