LinuxLists.cc - krobe: __blkdev

2020-06-17 10:33:03

Subject: krobe: __blkdev_put probe is missed

Hello Guys,

I found probe on __blkdev_put is missed, which can be observed
via bcc/perf reliably:

1) start trace
- perf probe __blkdev_put
- perf trace -a -e probe:__blkdev_put

or

/usr/share/bcc/tools/stackcount __blkdev_put

2) run the following command:
blockdev --getbsz /dev/sda1

3) 'perf trace' or stackcount just dumps one trace event, and it
should have been two
__blkdev_put() traces, since one __blkdev_put() is called for
partition(/dev/sda1),
and another is for disk(/dev/sda). If trace_printk() is added in __blkdev_put(),
two events will be captured from ftrace.

The only special thing about __blkdev_put() is that the function will call into
itself. However, no such issue on __blkdev_get() which calls itself too.

Thanks,
Ming Lei

2020-06-18 12:57:45

by Ming Lei

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

On Wed, Jun 17, 2020 at 06:30:39PM +0800, Ming Lei wrote:
> Hello Guys,
>
> I found probe on __blkdev_put is missed, which can be observed
> via bcc/perf reliably:
>
> 1) start trace
> - perf probe __blkdev_put
> - perf trace -a -e probe:__blkdev_put
>
> or
>
> /usr/share/bcc/tools/stackcount __blkdev_put
>
> 2) run the following command:
> blockdev --getbsz /dev/sda1
>
> 3) 'perf trace' or stackcount just dumps one trace event, and it
> should have been two
> __blkdev_put() traces, since one __blkdev_put() is called for
> partition(/dev/sda1),
> and another is for disk(/dev/sda). If trace_printk() is added in __blkdev_put(),
> two events will be captured from ftrace.
>

The issue can be shown by loading a kprobe module which registers on
__blkdev_put(), just by replacing _do_fork with __blkdev_put on
samples/kprobes/kprobe_example.c.

So the issue is really in kprobe code.

Thanks,
Ming

2020-06-18 13:58:22

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

Hi Ming,

On Thu, 18 Jun 2020 20:54:38 +0800
Ming Lei <[email protected]> wrote:

> On Wed, Jun 17, 2020 at 06:30:39PM +0800, Ming Lei wrote:
> > Hello Guys,
> >
> > I found probe on __blkdev_put is missed, which can be observed
> > via bcc/perf reliably:
> >
> > 1) start trace
> > - perf probe __blkdev_put
> > - perf trace -a -e probe:__blkdev_put

Could you dump the kprobe_event as below?

# cat /sys/kernel/tracing/kprobe_events

> >
> > or
> >
> > /usr/share/bcc/tools/stackcount __blkdev_put
> >
> > 2) run the following command:
> > blockdev --getbsz /dev/sda1

And dump the kprobe profile?

# cat /sys/kernel/tracing/kprobe_profile

> >
> > 3) 'perf trace' or stackcount just dumps one trace event, and it
> > should have been two
> > __blkdev_put() traces, since one __blkdev_put() is called for
> > partition(/dev/sda1),
> > and another is for disk(/dev/sda). If trace_printk() is added in __blkdev_put(),
> > two events will be captured from ftrace.
> >
>
> The issue can be shown by loading a kprobe module which registers on
> __blkdev_put(), just by replacing _do_fork with __blkdev_put on
> samples/kprobes/kprobe_example.c.

Could you tell me what kernel are you using?

I'm using 5.4 on ubuntu and can not reproduce it with kprobe_event.

root@devnote2:/sys/kernel/tracing# uname -a
Linux devnote2 5.4.0-37-generic #41-Ubuntu SMP Wed Jun 3 18:57:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
root@devnote2:/sys/kernel/tracing# echo p __blkdev_put > kprobe_events
root@devnote2:/sys/kernel/tracing# echo 1 > events/kprobes/p___blkdev_put_0/enable
root@devnote2:/sys/kernel/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 0/0 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
root@devnote2:/sys/kernel/tracing# blockdev --getbsz /dev/nvme0n1
4096
root@devnote2:/sys/kernel/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<...>-111740 [002] .... 301734.476991: p___blkdev_put_0: (__blkdev_put+0x0/0x1e0)

Hmm, maybe some issue in the latest kernel...?

Thank you,

--
Masami Hiramatsu <[email protected]>

2020-06-18 23:22:45

by Ming Lei

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

On Thu, Jun 18, 2020 at 10:56:02PM +0900, Masami Hiramatsu wrote:
> Hi Ming,
>
> On Thu, 18 Jun 2020 20:54:38 +0800
> Ming Lei <[email protected]> wrote:
>
> > On Wed, Jun 17, 2020 at 06:30:39PM +0800, Ming Lei wrote:
> > > Hello Guys,
> > >
> > > I found probe on __blkdev_put is missed, which can be observed
> > > via bcc/perf reliably:
> > >
> > > 1) start trace
> > > - perf probe __blkdev_put
> > > - perf trace -a -e probe:__blkdev_put
>
> Could you dump the kprobe_event as below?
>
> # cat /sys/kernel/tracing/kprobe_events
>
>
> > >
> > > or
> > >
> > > /usr/share/bcc/tools/stackcount __blkdev_put
> > >
> > > 2) run the following command:
> > > blockdev --getbsz /dev/sda1
>
> And dump the kprobe profile?
>
> # cat /sys/kernel/tracing/kprobe_profile
>
> > >
> > > 3) 'perf trace' or stackcount just dumps one trace event, and it
> > > should have been two
> > > __blkdev_put() traces, since one __blkdev_put() is called for
> > > partition(/dev/sda1),
> > > and another is for disk(/dev/sda). If trace_printk() is added in __blkdev_put(),
> > > two events will be captured from ftrace.
> > >
> >
> > The issue can be shown by loading a kprobe module which registers on
> > __blkdev_put(), just by replacing _do_fork with __blkdev_put on
> > samples/kprobes/kprobe_example.c.
>
> Could you tell me what kernel are you using?
>
> I'm using 5.4 on ubuntu and can not reproduce it with kprobe_event.
>
> root@devnote2:/sys/kernel/tracing# uname -a
> Linux devnote2 5.4.0-37-generic #41-Ubuntu SMP Wed Jun 3 18:57:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
> root@devnote2:/sys/kernel/tracing# echo p __blkdev_put > kprobe_events
> root@devnote2:/sys/kernel/tracing# echo 1 > events/kprobes/p___blkdev_put_0/enable
> root@devnote2:/sys/kernel/tracing# cat trace
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 0/0 #P:8
> #
> # _-----=> irqs-off
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # ||| / delay
> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> # | | | |||| | |
> root@devnote2:/sys/kernel/tracing# blockdev --getbsz /dev/nvme0n1
> 4096
> root@devnote2:/sys/kernel/tracing# cat trace
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 1/1 #P:8
> #
> # _-----=> irqs-off
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # ||| / delay
> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> # | | | |||| | |
> <...>-111740 [002] .... 301734.476991: p___blkdev_put_0: (__blkdev_put+0x0/0x1e0)
>
> Hmm, maybe some issue in the latest kernel...?

Hello Masami,

I am testing the latest upstream kernel, your trace actually reproduces
this issue.

After 'blockdev --getbsz /dev/nvme0n1' returns, __blkdev_put() should
have been called two times(one for partition, and the other for disk),
however kprobe trace just shows one time of calling this function.

If trace_printk() is added at the entry of __blkdev_put() manually,
you will see that __blkdev_put() is called two times in 'blockdev
--getbsz /dev/nvme0n1'.

Thanks,
Ming

2020-06-19 05:17:26

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

Hi Ming,

On Fri, 19 Jun 2020 07:19:01 +0800
Ming Lei <[email protected]> wrote:

> > I'm using 5.4 on ubuntu and can not reproduce it with kprobe_event.
> >
> > root@devnote2:/sys/kernel/tracing# uname -a
> > Linux devnote2 5.4.0-37-generic #41-Ubuntu SMP Wed Jun 3 18:57:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
> > root@devnote2:/sys/kernel/tracing# echo p __blkdev_put > kprobe_events
> > root@devnote2:/sys/kernel/tracing# echo 1 > events/kprobes/p___blkdev_put_0/enable
> > root@devnote2:/sys/kernel/tracing# cat trace
> > # tracer: nop
> > #
> > # entries-in-buffer/entries-written: 0/0 #P:8
> > #
> > # _-----=> irqs-off
> > # / _----=> need-resched
> > # | / _---=> hardirq/softirq
> > # || / _--=> preempt-depth
> > # ||| / delay
> > # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> > # | | | |||| | |
> > root@devnote2:/sys/kernel/tracing# blockdev --getbsz /dev/nvme0n1
> > 4096
> > root@devnote2:/sys/kernel/tracing# cat trace
> > # tracer: nop
> > #
> > # entries-in-buffer/entries-written: 1/1 #P:8
> > #
> > # _-----=> irqs-off
> > # / _----=> need-resched
> > # | / _---=> hardirq/softirq
> > # || / _--=> preempt-depth
> > # ||| / delay
> > # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> > # | | | |||| | |
> > <...>-111740 [002] .... 301734.476991: p___blkdev_put_0: (__blkdev_put+0x0/0x1e0)
> >
> > Hmm, maybe some issue in the latest kernel...?
>
> Hello Masami,
>
> I am testing the latest upstream kernel, your trace actually reproduces
> this issue.

OK.

>
> After 'blockdev --getbsz /dev/nvme0n1' returns, __blkdev_put() should
> have been called two times(one for partition, and the other for disk),
> however kprobe trace just shows one time of calling this function.
>
> If trace_printk() is added at the entry of __blkdev_put() manually,
> you will see that __blkdev_put() is called two times in 'blockdev
> --getbsz /dev/nvme0n1'.

OK, let me check again on the latest kernel.
Here I tested with qemu.

root@devnote2:/sys/kernel/debug/tracing# uname -a
Linux devnote2 5.8.0-rc1+ #26 SMP PREEMPT Fri Jun 19 12:12:53 JST 2020 x86_64 x86_64 x86_64 GNU/Linux

And we have a (virtual) sda with 1 partition.

root@devnote2:/sys/kernel/debug/tracing# cat /proc/partitions
major minor #blocks name

8 0 10240 sda
8 1 9216 sda1

OK, then let's make events (for sure)

root@devnote2:/sys/kernel/debug/tracing# echo p __blkdev_put >> kprobe_events
root@devnote2:/sys/kernel/debug/tracing# echo r __blkdev_put >> kprobe_events
root@devnote2:/sys/kernel/debug/tracing# echo p blkdev_put >> kprobe_events

There are 3 events in the kernel, blkdev_put() and __blkdev_put() and
the return of __blkdev_put().
Then enable it and access to */dev/sda* (a disk)

root@devnote2:/sys/kernel/debug/tracing# echo 1 > events/kprobes/enable
root@devnote2:/sys/kernel/debug/tracing# blockdev --getbsz /dev/sda
4096
root@devnote2:/sys/kernel/debug/tracing# echo 0 > events/kprobes/enable
root@devnote2:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 3/3 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
blockdev-185 [002] ...1 72.604266: p_blkdev_put_0: (blkdev_put+0x0/0x130)
blockdev-185 [002] ...1 72.604276: p___blkdev_put_0: (__blkdev_put+0x0/0x220)
blockdev-185 [002] d..2 72.604288: r___blkdev_put_0: (blkdev_put+0x50/0x130 <- __blkdev_put)

So the __blkdev_put() is called once from blkdev_put().
Next, we do same trace with accessing */dev/sda1* (a partition).

root@devnote2:/sys/kernel/debug/tracing# echo > trace
root@devnote2:/sys/kernel/debug/tracing# echo 1 > events/kprobes/enable
root@devnote2:/sys/kernel/debug/tracing# blockdev --getbsz /dev/sda1
4096
root@devnote2:/sys/kernel/debug/tracing# echo 0 > events/kprobes/enable
root@devnote2:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 5/5 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
blockdev-191 [005] ...1 194.401517: p_blkdev_put_0: (blkdev_put+0x0/0x130)
blockdev-191 [005] ...1 194.401527: p___blkdev_put_0: (__blkdev_put+0x0/0x220)
blockdev-191 [005] ...1 194.401529: p___blkdev_put_0: (__blkdev_put+0x0/0x220)
blockdev-191 [005] d..2 194.401535: r___blkdev_put_0: (__blkdev_put+0x1ea/0x220 <- __blkdev_put)
blockdev-191 [005] d..2 194.401536: r___blkdev_put_0: (blkdev_put+0x50/0x130 <- __blkdev_put)

In this case, we can see the __blkdev_put() is called twice, from
blkdev_put() and __blkdev_put() itself (nested call).

So, it seems that the kprobe event correctly works.

Could you do same thing on your environment, and share the result?

Thank you,

--
Masami Hiramatsu <[email protected]>

2020-06-19 08:53:42

by Ming Lei

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

Hi Masami,

On Fri, Jun 19, 2020 at 02:12:39PM +0900, Masami Hiramatsu wrote:
> Hi Ming,
>
> On Fri, 19 Jun 2020 07:19:01 +0800
> Ming Lei <[email protected]> wrote:
>
> > > I'm using 5.4 on ubuntu and can not reproduce it with kprobe_event.
> > >
> > > root@devnote2:/sys/kernel/tracing# uname -a
> > > Linux devnote2 5.4.0-37-generic #41-Ubuntu SMP Wed Jun 3 18:57:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
> > > root@devnote2:/sys/kernel/tracing# echo p __blkdev_put > kprobe_events
> > > root@devnote2:/sys/kernel/tracing# echo 1 > events/kprobes/p___blkdev_put_0/enable
> > > root@devnote2:/sys/kernel/tracing# cat trace
> > > # tracer: nop
> > > #
> > > # entries-in-buffer/entries-written: 0/0 #P:8
> > > #
> > > # _-----=> irqs-off
> > > # / _----=> need-resched
> > > # | / _---=> hardirq/softirq
> > > # || / _--=> preempt-depth
> > > # ||| / delay
> > > # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> > > # | | | |||| | |
> > > root@devnote2:/sys/kernel/tracing# blockdev --getbsz /dev/nvme0n1
> > > 4096
> > > root@devnote2:/sys/kernel/tracing# cat trace
> > > # tracer: nop
> > > #
> > > # entries-in-buffer/entries-written: 1/1 #P:8
> > > #
> > > # _-----=> irqs-off
> > > # / _----=> need-resched
> > > # | / _---=> hardirq/softirq
> > > # || / _--=> preempt-depth
> > > # ||| / delay
> > > # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> > > # | | | |||| | |
> > > <...>-111740 [002] .... 301734.476991: p___blkdev_put_0: (__blkdev_put+0x0/0x1e0)
> > >
> > > Hmm, maybe some issue in the latest kernel...?
> >
> > Hello Masami,
> >
> > I am testing the latest upstream kernel, your trace actually reproduces
> > this issue.
>
> OK.
>
> >
> > After 'blockdev --getbsz /dev/nvme0n1' returns, __blkdev_put() should
> > have been called two times(one for partition, and the other for disk),
> > however kprobe trace just shows one time of calling this function.
> >
> > If trace_printk() is added at the entry of __blkdev_put() manually,
> > you will see that __blkdev_put() is called two times in 'blockdev
> > --getbsz /dev/nvme0n1'.
>
> OK, let me check again on the latest kernel.
> Here I tested with qemu.
>
> root@devnote2:/sys/kernel/debug/tracing# uname -a
> Linux devnote2 5.8.0-rc1+ #26 SMP PREEMPT Fri Jun 19 12:12:53 JST 2020 x86_64 x86_64 x86_64 GNU/Linux
>
> And we have a (virtual) sda with 1 partition.
>
> root@devnote2:/sys/kernel/debug/tracing# cat /proc/partitions
> major minor #blocks name
>
> 8 0 10240 sda
> 8 1 9216 sda1
>
> OK, then let's make events (for sure)
>
> root@devnote2:/sys/kernel/debug/tracing# echo p __blkdev_put >> kprobe_events
> root@devnote2:/sys/kernel/debug/tracing# echo r __blkdev_put >> kprobe_events
> root@devnote2:/sys/kernel/debug/tracing# echo p blkdev_put >> kprobe_events
>
> There are 3 events in the kernel, blkdev_put() and __blkdev_put() and
> the return of __blkdev_put().
> Then enable it and access to */dev/sda* (a disk)
>
> root@devnote2:/sys/kernel/debug/tracing# echo 1 > events/kprobes/enable
> root@devnote2:/sys/kernel/debug/tracing# blockdev --getbsz /dev/sda
> 4096
> root@devnote2:/sys/kernel/debug/tracing# echo 0 > events/kprobes/enable
> root@devnote2:/sys/kernel/debug/tracing# cat trace
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 3/3 #P:8
> #
> # _-----=> irqs-off
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # ||| / delay
> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> # | | | |||| | |
> blockdev-185 [002] ...1 72.604266: p_blkdev_put_0: (blkdev_put+0x0/0x130)
> blockdev-185 [002] ...1 72.604276: p___blkdev_put_0: (__blkdev_put+0x0/0x220)
> blockdev-185 [002] d..2 72.604288: r___blkdev_put_0: (blkdev_put+0x50/0x130 <- __blkdev_put)
>
> So the __blkdev_put() is called once from blkdev_put().
> Next, we do same trace with accessing */dev/sda1* (a partition).
>
> root@devnote2:/sys/kernel/debug/tracing# echo > trace
> root@devnote2:/sys/kernel/debug/tracing# echo 1 > events/kprobes/enable

I can't find 'events/kprobes' in my VM with upstream kernel, also not found
the dir under fedora31(5.5.15-200) & rhel8(v4.18 based).

Could you share me how to enable the kprobes event? I guess some kernel
config options are required.

> root@devnote2:/sys/kernel/debug/tracing# blockdev --getbsz /dev/sda1
> 4096
> root@devnote2:/sys/kernel/debug/tracing# echo 0 > events/kprobes/enable

I used samples/kprobes/kprobe_example.c by replacing __do_fork with
__blkdev_put for confirming this issue, and only one __blkdev_put trace is
observed in dmesg log when running 'blockdev --getbsz /dev/sda1'.

Thanks,
Ming

2020-06-19 13:03:53

by Steven Rostedt

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

On Fri, 19 Jun 2020 15:28:59 +0800
Ming Lei <[email protected]> wrote:

> >
> > OK, then let's make events (for sure)
> >
> > root@devnote2:/sys/kernel/debug/tracing# echo p __blkdev_put >> kprobe_events
> > root@devnote2:/sys/kernel/debug/tracing# echo r __blkdev_put >> kprobe_events
> > root@devnote2:/sys/kernel/debug/tracing# echo p blkdev_put >> kprobe_events

Hi Ming,

Do you have the kprobe_events file?

> > root@devnote2:/sys/kernel/debug/tracing# echo 1 > events/kprobes/enable
>
> I can't find 'events/kprobes' in my VM with upstream kernel, also not found
> the dir under fedora31(5.5.15-200) & rhel8(v4.18 based).

The events/kprobes directly will be created when you create a
kprobe_event. It wont exist until then.

-- Steve

2020-06-19 13:35:34

by Ming Lei

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

On Fri, Jun 19, 2020 at 08:19:54AM -0400, Steven Rostedt wrote:
> On Fri, 19 Jun 2020 15:28:59 +0800
> Ming Lei <[email protected]> wrote:
>
> > >
> > > OK, then let's make events (for sure)
> > >
> > > root@devnote2:/sys/kernel/debug/tracing# echo p __blkdev_put >> kprobe_events
> > > root@devnote2:/sys/kernel/debug/tracing# echo r __blkdev_put >> kprobe_events
> > > root@devnote2:/sys/kernel/debug/tracing# echo p blkdev_put >> kprobe_events
>
> Hi Ming,
>
> Do you have the kprobe_events file?
>
> > > root@devnote2:/sys/kernel/debug/tracing# echo 1 > events/kprobes/enable
> >
> > I can't find 'events/kprobes' in my VM with upstream kernel, also not found
> > the dir under fedora31(5.5.15-200) & rhel8(v4.18 based).
>
> The events/kprobes directly will be created when you create a
> kprobe_event. It wont exist until then.

Hi Steven and Masami,

Got it, thanks for your help, now I can run the test, follows the steps
and results, and there is still one __blkdev_put probed.

And it is observed in my VM reliably with 5.7+ or Fedora kernel reliably,
kernel config is attached.

[root@ktest-01 tracing]# uname -a
Linux ktest-01 5.7.0+ #1900 SMP Fri Jun 19 16:26:47 CST 2020 x86_64 x86_64 x86_64 GNU/Linux
[root@ktest-01 tracing]#
[root@ktest-01 tracing]# cat kprobe_events
[root@ktest-01 tracing]#
[root@ktest-01 tracing]# echo p blkdev_put >> kprobe_events
[root@ktest-01 tracing]# echo p __blkdev_put >> kprobe_events
[root@ktest-01 tracing]# echo r __blkdev_put >> kprobe_events
[root@ktest-01 tracing]#
[root@ktest-01 tracing]# echo 1 > events/kprobes/enable
[root@ktest-01 tracing]# blockdev --getbsz /dev/sda1
4096
[root@ktest-01 tracing]# echo 0 > events/kprobes/enable
[root@ktest-01 tracing]# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 3/3 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
blockdev-970 [005] .... 17603.447236: p_blkdev_put_0: (blkdev_put+0x0/0xb4)
blockdev-970 [005] .... 17603.447244: p___blkdev_put_0: (__blkdev_put+0x0/0x19d)
blockdev-970 [005] d... 17603.447251: r___blkdev_put_0: (blkdev_close+0x22/0x25 <- __blkdev_put)

Thanks,
Ming

Attachments:

(No filename) (2.47 kB)
config.tar.gz (25.37 kB)
Download all attachments

2020-06-19 15:38:19

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

Hi Ming,

On Fri, 19 Jun 2020 21:32:40 +0800
Ming Lei <[email protected]> wrote:

> On Fri, Jun 19, 2020 at 08:19:54AM -0400, Steven Rostedt wrote:
> > On Fri, 19 Jun 2020 15:28:59 +0800
> > Ming Lei <[email protected]> wrote:
> >
> > > >
> > > > OK, then let's make events (for sure)
> > > >
> > > > root@devnote2:/sys/kernel/debug/tracing# echo p __blkdev_put >> kprobe_events
> > > > root@devnote2:/sys/kernel/debug/tracing# echo r __blkdev_put >> kprobe_events
> > > > root@devnote2:/sys/kernel/debug/tracing# echo p blkdev_put >> kprobe_events
> >
> > Hi Ming,
> >
> > Do you have the kprobe_events file?
> >
> > > > root@devnote2:/sys/kernel/debug/tracing# echo 1 > events/kprobes/enable
> > >
> > > I can't find 'events/kprobes' in my VM with upstream kernel, also not found
> > > the dir under fedora31(5.5.15-200) & rhel8(v4.18 based).
> >
> > The events/kprobes directly will be created when you create a
> > kprobe_event. It wont exist until then.
>
> Hi Steven and Masami,
>
> Got it, thanks for your help, now I can run the test, follows the steps
> and results, and there is still one __blkdev_put probed.

Hmm, strange...

> And it is observed in my VM reliably with 5.7+ or Fedora kernel reliably,
> kernel config is attached.

Thanks for sharing it.

>
> [root@ktest-01 tracing]# uname -a
> Linux ktest-01 5.7.0+ #1900 SMP Fri Jun 19 16:26:47 CST 2020 x86_64 x86_64 x86_64 GNU/Linux
> [root@ktest-01 tracing]#
> [root@ktest-01 tracing]# cat kprobe_events
> [root@ktest-01 tracing]#
> [root@ktest-01 tracing]# echo p blkdev_put >> kprobe_events
> [root@ktest-01 tracing]# echo p __blkdev_put >> kprobe_events
> [root@ktest-01 tracing]# echo r __blkdev_put >> kprobe_events
> [root@ktest-01 tracing]#
> [root@ktest-01 tracing]# echo 1 > events/kprobes/enable
> [root@ktest-01 tracing]# blockdev --getbsz /dev/sda1
> 4096
> [root@ktest-01 tracing]# echo 0 > events/kprobes/enable
> [root@ktest-01 tracing]# cat trace
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 3/3 #P:8
> #
> # _-----=> irqs-off
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # ||| / delay
> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> # | | | |||| | |
> blockdev-970 [005] .... 17603.447236: p_blkdev_put_0: (blkdev_put+0x0/0xb4)
> blockdev-970 [005] .... 17603.447244: p___blkdev_put_0: (__blkdev_put+0x0/0x19d)
> blockdev-970 [005] d... 17603.447251: r___blkdev_put_0: (blkdev_close+0x22/0x25 <- __blkdev_put)

This shows __blkdev_put() is a tail-call. It is possible that the
internal (nested) __blkdev_put() call becomes a goto inside the
function by the gcc optimization.

Ah, after all it is as expected. With your kconfig, the kernel is
very agressively optimized.

$ objdump -dS vmlinux | less
...
ffffffff81256dc3 <__blkdev_put>:
{
ffffffff81256dc3: e8 98 85 df ff callq ffffffff8104f360 <__fentry__>
ffffffff81256dc8: 41 57 push %r15
ffffffff81256dca: 41 56 push %r14
ffffffff81256dcc: 41 55 push %r13
...
ffffffff81256f05: 75 02 jne ffffffff81256f09 <__blkdev_put+0x146>
struct block_device *victim = NULL;
ffffffff81256f07: 31 db xor %ebx,%ebx
bdev->bd_contains = NULL;
ffffffff81256f09: 48 c7 45 60 00 00 00 movq $0x0,0x60(%rbp)
ffffffff81256f10: 00
put_disk_and_module(disk);
ffffffff81256f11: 4c 89 f7 mov %r14,%rdi
ffffffff81256f14: e8 c6 3d 11 00 callq ffffffff8136acdf <put_disk_and_module>
mutex_unlock(&bdev->bd_mutex);
ffffffff81256f19: 4c 89 ff mov %r15,%rdi
__blkdev_put(victim, mode, 1);
ffffffff81256f1c: 41 bc 01 00 00 00 mov $0x1,%r12d
mutex_unlock(&bdev->bd_mutex);
ffffffff81256f22: e8 8d d7 48 00 callq ffffffff816e46b4 <mutex_unlock>
bdput(bdev);
ffffffff81256f27: 48 89 ef mov %rbp,%rdi
ffffffff81256f2a: e8 f0 e9 ff ff callq ffffffff8125591f <bdput>
if (victim)
ffffffff81256f2f: 48 85 db test %rbx,%rbx
ffffffff81256f32: 74 08 je ffffffff81256f3c <__blkdev_put+0x179>
ffffffff81256f34: 48 89 dd mov %rbx,%rbp
ffffffff81256f37: e9 b4 fe ff ff jmpq ffffffff81256df0 <__blkdev_put+0x2d> <<-----THIS!!
}
ffffffff81256f3c: 48 8b 44 24 28 mov 0x28(%rsp),%rax
ffffffff81256f41: 65 48 33 04 25 28 00 xor %gs:0x28,%rax
ffffffff81256f48: 00 00
ffffffff81256f4a: 74 05 je ffffffff81256f51 <__blkdev_put+0x18e>
ffffffff81256f4c: e8 5a 4e 48 00 callq ffffffff816dbdab <__stack_chk_fail>
ffffffff81256f51: 48 83 c4 30 add $0x30,%rsp
ffffffff81256f55: 5b pop %rbx
ffffffff81256f56: 5d pop %rbp
ffffffff81256f57: 41 5c pop %r12
ffffffff81256f59: 41 5d pop %r13
ffffffff81256f5b: 41 5e pop %r14
ffffffff81256f5d: 41 5f pop %r15
ffffffff81256f5f: c3 retq

As you can see, the nested __blkdev_put() is coverted to a loop.
If you put kprobe on __blkdev_put+0x2d, you'll see the event twice.

Thank you,

--
Masami Hiramatsu <[email protected]>

2020-06-20 04:58:50

by Ming Lei

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

Hi Masami,

On Sat, Jun 20, 2020 at 12:35:09AM +0900, Masami Hiramatsu wrote:
> Hi Ming,
>
> On Fri, 19 Jun 2020 21:32:40 +0800
> Ming Lei <[email protected]> wrote:
>
> > On Fri, Jun 19, 2020 at 08:19:54AM -0400, Steven Rostedt wrote:
> > > On Fri, 19 Jun 2020 15:28:59 +0800
> > > Ming Lei <[email protected]> wrote:
> > >
> > > > >
> > > > > OK, then let's make events (for sure)
> > > > >
> > > > > root@devnote2:/sys/kernel/debug/tracing# echo p __blkdev_put >> kprobe_events
> > > > > root@devnote2:/sys/kernel/debug/tracing# echo r __blkdev_put >> kprobe_events
> > > > > root@devnote2:/sys/kernel/debug/tracing# echo p blkdev_put >> kprobe_events
> > >
> > > Hi Ming,
> > >
> > > Do you have the kprobe_events file?
> > >
> > > > > root@devnote2:/sys/kernel/debug/tracing# echo 1 > events/kprobes/enable
> > > >
> > > > I can't find 'events/kprobes' in my VM with upstream kernel, also not found
> > > > the dir under fedora31(5.5.15-200) & rhel8(v4.18 based).
> > >
> > > The events/kprobes directly will be created when you create a
> > > kprobe_event. It wont exist until then.
> >
> > Hi Steven and Masami,
> >
> > Got it, thanks for your help, now I can run the test, follows the steps
> > and results, and there is still one __blkdev_put probed.
>
> Hmm, strange...
>
> > And it is observed in my VM reliably with 5.7+ or Fedora kernel reliably,
> > kernel config is attached.
>
> Thanks for sharing it.
>
> >
> > [root@ktest-01 tracing]# uname -a
> > Linux ktest-01 5.7.0+ #1900 SMP Fri Jun 19 16:26:47 CST 2020 x86_64 x86_64 x86_64 GNU/Linux
> > [root@ktest-01 tracing]#
> > [root@ktest-01 tracing]# cat kprobe_events
> > [root@ktest-01 tracing]#
> > [root@ktest-01 tracing]# echo p blkdev_put >> kprobe_events
> > [root@ktest-01 tracing]# echo p __blkdev_put >> kprobe_events
> > [root@ktest-01 tracing]# echo r __blkdev_put >> kprobe_events
> > [root@ktest-01 tracing]#
> > [root@ktest-01 tracing]# echo 1 > events/kprobes/enable
> > [root@ktest-01 tracing]# blockdev --getbsz /dev/sda1
> > 4096
> > [root@ktest-01 tracing]# echo 0 > events/kprobes/enable
> > [root@ktest-01 tracing]# cat trace
> > # tracer: nop
> > #
> > # entries-in-buffer/entries-written: 3/3 #P:8
> > #
> > # _-----=> irqs-off
> > # / _----=> need-resched
> > # | / _---=> hardirq/softirq
> > # || / _--=> preempt-depth
> > # ||| / delay
> > # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> > # | | | |||| | |
> > blockdev-970 [005] .... 17603.447236: p_blkdev_put_0: (blkdev_put+0x0/0xb4)
> > blockdev-970 [005] .... 17603.447244: p___blkdev_put_0: (__blkdev_put+0x0/0x19d)
> > blockdev-970 [005] d... 17603.447251: r___blkdev_put_0: (blkdev_close+0x22/0x25 <- __blkdev_put)
>
> This shows __blkdev_put() is a tail-call. It is possible that the
> internal (nested) __blkdev_put() call becomes a goto inside the
> function by the gcc optimization.
>
> Ah, after all it is as expected. With your kconfig, the kernel is
> very agressively optimized.
>
> $ objdump -dS vmlinux | less
> ...
> ffffffff81256dc3 <__blkdev_put>:
> {
> ffffffff81256dc3: e8 98 85 df ff callq ffffffff8104f360 <__fentry__>
> ffffffff81256dc8: 41 57 push %r15
> ffffffff81256dca: 41 56 push %r14
> ffffffff81256dcc: 41 55 push %r13
> ...
> ffffffff81256f05: 75 02 jne ffffffff81256f09 <__blkdev_put+0x146>
> struct block_device *victim = NULL;
> ffffffff81256f07: 31 db xor %ebx,%ebx
> bdev->bd_contains = NULL;
> ffffffff81256f09: 48 c7 45 60 00 00 00 movq $0x0,0x60(%rbp)
> ffffffff81256f10: 00
> put_disk_and_module(disk);
> ffffffff81256f11: 4c 89 f7 mov %r14,%rdi
> ffffffff81256f14: e8 c6 3d 11 00 callq ffffffff8136acdf <put_disk_and_module>
> mutex_unlock(&bdev->bd_mutex);
> ffffffff81256f19: 4c 89 ff mov %r15,%rdi
> __blkdev_put(victim, mode, 1);
> ffffffff81256f1c: 41 bc 01 00 00 00 mov $0x1,%r12d
> mutex_unlock(&bdev->bd_mutex);
> ffffffff81256f22: e8 8d d7 48 00 callq ffffffff816e46b4 <mutex_unlock>
> bdput(bdev);
> ffffffff81256f27: 48 89 ef mov %rbp,%rdi
> ffffffff81256f2a: e8 f0 e9 ff ff callq ffffffff8125591f <bdput>
> if (victim)
> ffffffff81256f2f: 48 85 db test %rbx,%rbx
> ffffffff81256f32: 74 08 je ffffffff81256f3c <__blkdev_put+0x179>
> ffffffff81256f34: 48 89 dd mov %rbx,%rbp
> ffffffff81256f37: e9 b4 fe ff ff jmpq ffffffff81256df0 <__blkdev_put+0x2d> <<-----THIS!!
> }
> ffffffff81256f3c: 48 8b 44 24 28 mov 0x28(%rsp),%rax
> ffffffff81256f41: 65 48 33 04 25 28 00 xor %gs:0x28,%rax
> ffffffff81256f48: 00 00
> ffffffff81256f4a: 74 05 je ffffffff81256f51 <__blkdev_put+0x18e>
> ffffffff81256f4c: e8 5a 4e 48 00 callq ffffffff816dbdab <__stack_chk_fail>
> ffffffff81256f51: 48 83 c4 30 add $0x30,%rsp
> ffffffff81256f55: 5b pop %rbx
> ffffffff81256f56: 5d pop %rbp
> ffffffff81256f57: 41 5c pop %r12
> ffffffff81256f59: 41 5d pop %r13
> ffffffff81256f5b: 41 5e pop %r14
> ffffffff81256f5d: 41 5f pop %r15
> ffffffff81256f5f: c3 retq
>
>
> As you can see, the nested __blkdev_put() is coverted to a loop.
> If you put kprobe on __blkdev_put+0x2d, you'll see the event twice.

Thanks for your investigation.

Some trace tools can just trace on function entry, such as bcc, and some
user script always trace on function entry.

I guess the issue should belong to kprobe implementation:

1) __blkdev_put() is capable of being kprobed, so from user view, the
probe on entry of __blkdev_put() should be triggered

2) from implementation view, I understand exception should be trapped
on the entry of __blkdev_put(), looks it isn't done.

Correct me if the above is wrong, and is it possible to fix it in kprobe?

Thanks,
Ming

2020-06-20 05:03:43

by Steven Rostedt

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

On Sat, 20 Jun 2020 07:28:20 +0800
Ming Lei <[email protected]> wrote:

> Thanks for your investigation.
>
> Some trace tools can just trace on function entry, such as bcc, and some
> user script always trace on function entry.
>
> I guess the issue should belong to kprobe implementation:

The issue is that kprobes is to dynamically add trace events into an
already compiled kernel. I'm guessing you would get the same result
from ftrace's function tracer. That's because the compiler has no idea
about what or where you intend to add these dynamically created trace
events (that's basically the point of kprobes). But if you add static
events (and this includes trace_printk()), the compiler knows about
them, because they exist at compile time, and will make sure they
execute the number of times the code shows it. But if you don't have
these events at compile time, the compiler is free to modify the code
in such a way it doesn't make sense if you add a probe at run time.

gdb suffers the same problem on user space debugging if you let the
compiler aggressively optimize the code. If you step through highly
optimized user space code with gdb, the position jumps all over the
place. This is basically the same effect.

>
> 1) __blkdev_put() is capable of being kprobed, so from user view, the
> probe on entry of __blkdev_put() should be triggered
>
> 2) from implementation view, I understand exception should be trapped
> on the entry of __blkdev_put(), looks it isn't done.

But it is done! But the compiler removed the second call and basically
just inlined it. So that entry no longer exists. When you added a
"trace_printk()" in there, the compiler is not allowed to skip that
trace call. But because there's nothing in here to tell the compiler
that it can't just remove the second call (which it did, and the kprobe
is just showing you what the compiler did!) then there's nothing
kprobes can do about it.

>
> Correct me if the above is wrong, and is it possible to fix it in kprobe?

No.

-- Steve

2020-06-20 05:04:57

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

Hi Ming,

On Sat, 20 Jun 2020 07:28:20 +0800
Ming Lei <[email protected]> wrote:

> >
> > Ah, after all it is as expected. With your kconfig, the kernel is
> > very agressively optimized.
> >
> > $ objdump -dS vmlinux | less
> > ...
> > ffffffff81256dc3 <__blkdev_put>:
> > {
> > ffffffff81256dc3: e8 98 85 df ff callq ffffffff8104f360 <__fentry__>
> > ffffffff81256dc8: 41 57 push %r15
> > ffffffff81256dca: 41 56 push %r14
> > ffffffff81256dcc: 41 55 push %r13
> > ...
> > ffffffff81256f05: 75 02 jne ffffffff81256f09 <__blkdev_put+0x146>
> > struct block_device *victim = NULL;
> > ffffffff81256f07: 31 db xor %ebx,%ebx
> > bdev->bd_contains = NULL;
> > ffffffff81256f09: 48 c7 45 60 00 00 00 movq $0x0,0x60(%rbp)
> > ffffffff81256f10: 00
> > put_disk_and_module(disk);
> > ffffffff81256f11: 4c 89 f7 mov %r14,%rdi
> > ffffffff81256f14: e8 c6 3d 11 00 callq ffffffff8136acdf <put_disk_and_module>
> > mutex_unlock(&bdev->bd_mutex);
> > ffffffff81256f19: 4c 89 ff mov %r15,%rdi
> > __blkdev_put(victim, mode, 1);
> > ffffffff81256f1c: 41 bc 01 00 00 00 mov $0x1,%r12d
> > mutex_unlock(&bdev->bd_mutex);
> > ffffffff81256f22: e8 8d d7 48 00 callq ffffffff816e46b4 <mutex_unlock>
> > bdput(bdev);
> > ffffffff81256f27: 48 89 ef mov %rbp,%rdi
> > ffffffff81256f2a: e8 f0 e9 ff ff callq ffffffff8125591f <bdput>
> > if (victim)
> > ffffffff81256f2f: 48 85 db test %rbx,%rbx
> > ffffffff81256f32: 74 08 je ffffffff81256f3c <__blkdev_put+0x179>
> > ffffffff81256f34: 48 89 dd mov %rbx,%rbp
> > ffffffff81256f37: e9 b4 fe ff ff jmpq ffffffff81256df0 <__blkdev_put+0x2d> <<-----THIS!!
> > }
> > ffffffff81256f3c: 48 8b 44 24 28 mov 0x28(%rsp),%rax
> > ffffffff81256f41: 65 48 33 04 25 28 00 xor %gs:0x28,%rax
> > ffffffff81256f48: 00 00
> > ffffffff81256f4a: 74 05 je ffffffff81256f51 <__blkdev_put+0x18e>
> > ffffffff81256f4c: e8 5a 4e 48 00 callq ffffffff816dbdab <__stack_chk_fail>
> > ffffffff81256f51: 48 83 c4 30 add $0x30,%rsp
> > ffffffff81256f55: 5b pop %rbx
> > ffffffff81256f56: 5d pop %rbp
> > ffffffff81256f57: 41 5c pop %r12
> > ffffffff81256f59: 41 5d pop %r13
> > ffffffff81256f5b: 41 5e pop %r14
> > ffffffff81256f5d: 41 5f pop %r15
> > ffffffff81256f5f: c3 retq
> >
> >
> > As you can see, the nested __blkdev_put() is coverted to a loop.
> > If you put kprobe on __blkdev_put+0x2d, you'll see the event twice.
>
> Thanks for your investigation.
>
> Some trace tools can just trace on function entry, such as bcc, and some
> user script always trace on function entry.
>
> I guess the issue should belong to kprobe implementation:
>
> 1) __blkdev_put() is capable of being kprobed, so from user view, the
> probe on entry of __blkdev_put() should be triggered

Yes, it is correctly triggered.

>
> 2) from implementation view, I understand exception should be trapped
> on the entry of __blkdev_put(), looks it isn't done.

No, it is correctly trapped the function entry address. The problem is
that the gcc optimized the nested function call into jump to the
beginning of function body (skip prologue).

Usually, a function is compiled as below

func() (1) the entry address (func:)
{ (2) the function prologue (setup stackframe)
int a (3) the beginning of function body
...
func() (4) the nested function call

And in this case, the gcc optimized (4) into jump to (3) instead of
actual function call instruction. Thus, for the nested case (1) and
(2) are skipped.
IOW, the code flow becomes
(1)->(2)->(3)->(4)->(3)
instead of
(1)->(2)->(3)->(4)->(1)->(2)->(3)

In this case, if we put a probe on (1) or (2), those are disappeared
in the nested call. Thus if you put a probe on (3) ('perf probe __blkdev_put:2')
you'll see the event twice.

Thank you,

--
Masami Hiramatsu <[email protected]>

2020-06-22 00:32:38

by Ming Lei

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

Hi Masami,

On Sat, Jun 20, 2020 at 10:37:47AM +0900, Masami Hiramatsu wrote:
> Hi Ming,
>
> On Sat, 20 Jun 2020 07:28:20 +0800
> Ming Lei <[email protected]> wrote:
>
> > >
> > > Ah, after all it is as expected. With your kconfig, the kernel is
> > > very agressively optimized.
> > >
> > > $ objdump -dS vmlinux | less
> > > ...
> > > ffffffff81256dc3 <__blkdev_put>:
> > > {
> > > ffffffff81256dc3: e8 98 85 df ff callq ffffffff8104f360 <__fentry__>
> > > ffffffff81256dc8: 41 57 push %r15
> > > ffffffff81256dca: 41 56 push %r14
> > > ffffffff81256dcc: 41 55 push %r13
> > > ...
> > > ffffffff81256f05: 75 02 jne ffffffff81256f09 <__blkdev_put+0x146>
> > > struct block_device *victim = NULL;
> > > ffffffff81256f07: 31 db xor %ebx,%ebx
> > > bdev->bd_contains = NULL;
> > > ffffffff81256f09: 48 c7 45 60 00 00 00 movq $0x0,0x60(%rbp)
> > > ffffffff81256f10: 00
> > > put_disk_and_module(disk);
> > > ffffffff81256f11: 4c 89 f7 mov %r14,%rdi
> > > ffffffff81256f14: e8 c6 3d 11 00 callq ffffffff8136acdf <put_disk_and_module>
> > > mutex_unlock(&bdev->bd_mutex);
> > > ffffffff81256f19: 4c 89 ff mov %r15,%rdi
> > > __blkdev_put(victim, mode, 1);
> > > ffffffff81256f1c: 41 bc 01 00 00 00 mov $0x1,%r12d
> > > mutex_unlock(&bdev->bd_mutex);
> > > ffffffff81256f22: e8 8d d7 48 00 callq ffffffff816e46b4 <mutex_unlock>
> > > bdput(bdev);
> > > ffffffff81256f27: 48 89 ef mov %rbp,%rdi
> > > ffffffff81256f2a: e8 f0 e9 ff ff callq ffffffff8125591f <bdput>
> > > if (victim)
> > > ffffffff81256f2f: 48 85 db test %rbx,%rbx
> > > ffffffff81256f32: 74 08 je ffffffff81256f3c <__blkdev_put+0x179>
> > > ffffffff81256f34: 48 89 dd mov %rbx,%rbp
> > > ffffffff81256f37: e9 b4 fe ff ff jmpq ffffffff81256df0 <__blkdev_put+0x2d> <<-----THIS!!
> > > }
> > > ffffffff81256f3c: 48 8b 44 24 28 mov 0x28(%rsp),%rax
> > > ffffffff81256f41: 65 48 33 04 25 28 00 xor %gs:0x28,%rax
> > > ffffffff81256f48: 00 00
> > > ffffffff81256f4a: 74 05 je ffffffff81256f51 <__blkdev_put+0x18e>
> > > ffffffff81256f4c: e8 5a 4e 48 00 callq ffffffff816dbdab <__stack_chk_fail>
> > > ffffffff81256f51: 48 83 c4 30 add $0x30,%rsp
> > > ffffffff81256f55: 5b pop %rbx
> > > ffffffff81256f56: 5d pop %rbp
> > > ffffffff81256f57: 41 5c pop %r12
> > > ffffffff81256f59: 41 5d pop %r13
> > > ffffffff81256f5b: 41 5e pop %r14
> > > ffffffff81256f5d: 41 5f pop %r15
> > > ffffffff81256f5f: c3 retq
> > >
> > >
> > > As you can see, the nested __blkdev_put() is coverted to a loop.
> > > If you put kprobe on __blkdev_put+0x2d, you'll see the event twice.
> >
> > Thanks for your investigation.
> >
> > Some trace tools can just trace on function entry, such as bcc, and some
> > user script always trace on function entry.
> >
> > I guess the issue should belong to kprobe implementation:
> >
> > 1) __blkdev_put() is capable of being kprobed, so from user view, the
> > probe on entry of __blkdev_put() should be triggered
>
> Yes, it is correctly triggered.

I mean it isn't from user's viewpoint, and the binary code is usually a
black box for final kprobe user.

IMO, all your and Steven's input are just from kprobe/trace developer's viewpoint.
Can you think about the issue from kprobe real/final user?

Trace is very useful tools to observe system internal, and people often
relies on trace to understand system. However, missed probe often causes
trouble for us to understand the system correctly.

>
> >
> > 2) from implementation view, I understand exception should be trapped
> > on the entry of __blkdev_put(), looks it isn't done.
>
> No, it is correctly trapped the function entry address. The problem is
> that the gcc optimized the nested function call into jump to the
> beginning of function body (skip prologue).
>
> Usually, a function is compiled as below
>
> func() (1) the entry address (func:)
> { (2) the function prologue (setup stackframe)
> int a (3) the beginning of function body
> ...
> func() (4) the nested function call
>
> And in this case, the gcc optimized (4) into jump to (3) instead of
> actual function call instruction. Thus, for the nested case (1) and
> (2) are skipped.
> IOW, the code flow becomes
> (1)->(2)->(3)->(4)->(3)
> instead of
> (1)->(2)->(3)->(4)->(1)->(2)->(3)
>
> In this case, if we put a probe on (1) or (2), those are disappeared
> in the nested call. Thus if you put a probe on (3) ('perf probe __blkdev_put:2')
> you'll see the event twice.

Thanks for your explanation.

Can you kprobe guys improve the implementation for covering this case?
For example, put probe on 3) in case the above situation is recognized.

Thanks,
Ming

2020-06-22 01:36:23

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

On Mon, 22 Jun 2020 08:27:53 +0800
Ming Lei <[email protected]> wrote:

> I mean it isn't from user's viewpoint, and the binary code is usually a
> black box for final kprobe user.
>
> IMO, all your and Steven's input are just from kprobe/trace developer's viewpoint.
> Can you think about the issue from kprobe real/final user?
>
> Trace is very useful tools to observe system internal, and people often
> relies on trace to understand system. However, missed probe often causes
> trouble for us to understand the system correctly.

Agreed. However, since kprobes related tracing tools are layered
to provide different features (e.g. kprobes abstructs sw breakpoint,
ftrace kprobe-events provides a minimum CUI, and perf-probe provides
binary analysis, etc.), this issue should be solved by user-level
binary analysis layer. (it is not good idea to analyze the optimized
code in kernel)

> > > 2) from implementation view, I understand exception should be trapped
> > > on the entry of __blkdev_put(), looks it isn't done.
> >
> > No, it is correctly trapped the function entry address. The problem is
> > that the gcc optimized the nested function call into jump to the
> > beginning of function body (skip prologue).
> >
> > Usually, a function is compiled as below
> >
> > func() (1) the entry address (func:)
> > { (2) the function prologue (setup stackframe)
> > int a (3) the beginning of function body
> > ...
> > func() (4) the nested function call
> >
> > And in this case, the gcc optimized (4) into jump to (3) instead of
> > actual function call instruction. Thus, for the nested case (1) and
> > (2) are skipped.
> > IOW, the code flow becomes
> > (1)->(2)->(3)->(4)->(3)
> > instead of
> > (1)->(2)->(3)->(4)->(1)->(2)->(3)
> >
> > In this case, if we put a probe on (1) or (2), those are disappeared
> > in the nested call. Thus if you put a probe on (3) ('perf probe __blkdev_put:2')
> > you'll see the event twice.
>
> Thanks for your explanation.
>
> Can you kprobe guys improve the implementation for covering this case?
> For example, put probe on 3) in case the above situation is recognized.

OK, let me try to fix this in perf-probe since that is the simplest
binary analysis part in user-space.

Thank you,

--
Masami Hiramatsu <[email protected]>

2020-06-22 13:04:15

by Steven Rostedt

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

On Mon, 22 Jun 2020 08:27:53 +0800
Ming Lei <[email protected]> wrote:

> Can you kprobe guys improve the implementation for covering this case?
> For example, put probe on 3) in case the above situation is recognized.

To do so would require solving the halting problem.

https://en.wikipedia.org/wiki/Halting_problem

Or perhaps reading the DWARF output of the compiler to determine if it
optimized the location you are looking for.

The first case is impossible to solve, the second would take a lot of
work, (are you going to fund it?)

Your comment about tracing internals is valid, but if you can't
understand the optimization of the compiler on the kernel, I suggest
you stick with the static trace events. kprobes can be added virtual
anywhere in the kernel. It's very function requires a kprobe *user* to
understand the internals of the kernel as well as its executable binary
code, and not expect the kprobe to figure it out for you.

We are all for adding infrastructure to make kprobes easier. But
figuring out that the kernel optimized a function call so that we can
add some wrapper to *simulate* the optimized out function call is
something I believe is out of scope for a kprobe. In fact, I would call
that a feature! I would like to know that a function was optimized out.
When I add a kprobe, I'm more interested in what the compiler actually
did to the kernel than what the source code shows us. That is very
useful information.

-- Steve

2020-06-23 02:02:46

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

On Mon, 22 Jun 2020 09:01:48 -0400
Steven Rostedt <[email protected]> wrote:

> On Mon, 22 Jun 2020 08:27:53 +0800
> Ming Lei <[email protected]> wrote:
>
> > Can you kprobe guys improve the implementation for covering this case?
> > For example, put probe on 3) in case the above situation is recognized.
>
> To do so would require solving the halting problem.
>
> https://en.wikipedia.org/wiki/Halting_problem
>
> Or perhaps reading the DWARF output of the compiler to determine if it
> optimized the location you are looking for.

As far as I can see, gcc-9.3 doesn't generate this information :(
Maybe the optimizer forgot to push the tail-call callsite information
to dwarf generator when making a recursive tail-call to a loop.

> The first case is impossible to solve, the second would take a lot of
> work, (are you going to fund it?)

What I can provide is "--skip-prologue" option for the perf-probe
which will be similar to the "-P" option. If the compiler correctly
generates the information, we can enable it automatically. But
as far as I can see, it doesn't.

[OT] DWARF has its option(and GNU extension) but it seems not correctly
implemented yet.

http://www.dwarfstd.org/ShowIssue.php?issue=100909.2

What I found on __blkdev_put and the callers, the "tail-call to other
function" were recorded as a call-site DIE with DW_AT_tail_call. But
if the "tail-call to itself (recursive call)" was expanded as a loop,
it was just disappeared. No call-site information, nor the DW_AT_tail_call.

Thank you,

--
Masami Hiramatsu <[email protected]>

2020-06-23 03:24:50

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

On Tue, 23 Jun 2020 08:47:06 +0900
Masami Hiramatsu <[email protected]> wrote:

> On Mon, 22 Jun 2020 09:01:48 -0400
> Steven Rostedt <[email protected]> wrote:
>
> > On Mon, 22 Jun 2020 08:27:53 +0800
> > Ming Lei <[email protected]> wrote:
> >
> > > Can you kprobe guys improve the implementation for covering this case?
> > > For example, put probe on 3) in case the above situation is recognized.
> >
> > To do so would require solving the halting problem.
> >
> > https://en.wikipedia.org/wiki/Halting_problem
> >
> > Or perhaps reading the DWARF output of the compiler to determine if it
> > optimized the location you are looking for.
>
> As far as I can see, gcc-9.3 doesn't generate this information :(
> Maybe the optimizer forgot to push the tail-call callsite information
> to dwarf generator when making a recursive tail-call to a loop.
>
> > The first case is impossible to solve, the second would take a lot of
> > work, (are you going to fund it?)
>
> What I can provide is "--skip-prologue" option for the perf-probe
> which will be similar to the "-P" option. If the compiler correctly
> generates the information, we can enable it automatically. But
> as far as I can see, it doesn't.
>
> [OT] DWARF has its option(and GNU extension) but it seems not correctly
> implemented yet.
>
> http://www.dwarfstd.org/ShowIssue.php?issue=100909.2

Oops, sorry, I missed the following sentences.

"Tail calls are jump-like instructions which transfer control to the start
of some subprogram, but the call site location address isn't visible in the
unwind information."

"Tail recursion is a call to the current function which is compiled as a
loop into the middle of the current function."

"The DW_TAG_call_site entries describe normal and tail calls."

This means, the gcc is correctly implemented and this __blkdev_put() case
is NOT covered by DT_TAG_call_site.
So we can not detect it from the debuginfo.

Thank you,

--
Masami Hiramatsu <[email protected]>

2020-06-23 05:30:40

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: kprobe: __blkdev_put probe is missed

On Tue, 23 Jun 2020 09:38:01 +0900
Masami Hiramatsu <[email protected]> wrote:

> On Tue, 23 Jun 2020 08:47:06 +0900
> Masami Hiramatsu <[email protected]> wrote:
>
> > On Mon, 22 Jun 2020 09:01:48 -0400
> > Steven Rostedt <[email protected]> wrote:
> >
> > > On Mon, 22 Jun 2020 08:27:53 +0800
> > > Ming Lei <[email protected]> wrote:
> > >
> > > > Can you kprobe guys improve the implementation for covering this case?
> > > > For example, put probe on 3) in case the above situation is recognized.
> > >
> > > To do so would require solving the halting problem.
> > >
> > > https://en.wikipedia.org/wiki/Halting_problem
> > >
> > > Or perhaps reading the DWARF output of the compiler to determine if it
> > > optimized the location you are looking for.
> >
> > As far as I can see, gcc-9.3 doesn't generate this information :(
> > Maybe the optimizer forgot to push the tail-call callsite information
> > to dwarf generator when making a recursive tail-call to a loop.
> >
> > > The first case is impossible to solve, the second would take a lot of
> > > work, (are you going to fund it?)
> >
> > What I can provide is "--skip-prologue" option for the perf-probe
> > which will be similar to the "-P" option. If the compiler correctly
> > generates the information, we can enable it automatically. But
> > as far as I can see, it doesn't.
> >
> > [OT] DWARF has its option(and GNU extension) but it seems not correctly
> > implemented yet.
> >
> > http://www.dwarfstd.org/ShowIssue.php?issue=100909.2
>
> Oops, sorry, I missed the following sentences.
>
> "Tail calls are jump-like instructions which transfer control to the start
> of some subprogram, but the call site location address isn't visible in the
> unwind information."
>
> "Tail recursion is a call to the current function which is compiled as a
> loop into the middle of the current function."
>
> "The DW_TAG_call_site entries describe normal and tail calls."
>
> This means, the gcc is correctly implemented and this __blkdev_put() case
> is NOT covered by DT_TAG_call_site.
> So we can not detect it from the debuginfo.

Hmm, BTW, if optimization is further advanced, it is possible that
the loop start position is not always at the beginning of the function.
It is easy to provide --skip-prologue to perf probe but it doesn't
ensure that works always as you expected.

For example,

func()
{
1:
{ /* block which doesn't executed in tail-recursion call */
...
}
2:
{ /* block which always executed in tail-recursion call */
...
}
func()
}

In this case, it is natural that the optimizer put a jump to 2 instead
of 1. Moreover, if the number of recursion is fixed, the optimizer
can unroll the loop. In that case there are no jumps.

So, as Steve pointed, strictly speaking, the developer needs to understand
what the source code was compiled into, before tracing/debuging it.

For the perf-probe case, I'm now thinking it is better user to
choose the line in the function explicitly. I wish I had another flag
that there was a tail-recursion, then I can warn users...

Thank you,

--
Masami Hiramatsu <[email protected]>