LinuxLists.cc - [PATCH 0/2] trace/kprobe: Two fixes for kretprobes

2021-06-14 18:06:01

Subject: [PATCH 0/2] trace/kprobe: Two fixes for kretprobes

The first patch fixes accounting of missed kretprobes in kprobe_profile.
The second patch removes limit on the maximum active kretprobe
instances, when registering a kretprobe through tracefs.

- Naveen

Naveen N. Rao (2):
trace/kprobe: Fix count of missed kretprobes in kprobe_profile
trace/kprobe: Remove limit on kretprobe maxactive

kernel/trace/trace_kprobe.c | 11 ++---------
kernel/trace/trace_probe.h | 1 -
.../ftrace/test.d/kprobe/kprobe_syntax_errors.tc | 1 -
.../ftrace/test.d/kprobe/kretprobe_maxactive.tc | 3 ---
4 files changed, 2 insertions(+), 14 deletions(-)

base-commit: 0b42677e2e5d87c730ddc41544b289b88596738c
--
2.31.1

2021-06-14 18:08:31

by Naveen N. Rao

[permalink] [raw]

Subject: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

We currently limit maxactive for a kretprobe to 4096 when registering
the same through tracefs. The comment indicates that this is done so as
to keep list traversal reasonable. However, we don't ever iterate over
all kretprobe_instance structures. The core kprobes infrastructure also
imposes no such limitation.

Remove the limit from the tracefs interface. This limit is easy to hit
on large cpu machines when tracing functions that can sleep.

Reported-by: Anton Blanchard <[email protected]>
Signed-off-by: Naveen N. Rao <[email protected]>
---
kernel/trace/trace_kprobe.c | 8 --------
kernel/trace/trace_probe.h | 1 -
.../ftrace/test.d/kprobe/kprobe_syntax_errors.tc | 1 -
.../selftests/ftrace/test.d/kprobe/kretprobe_maxactive.tc | 3 ---
4 files changed, 13 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 0475e2a6d0825e..b3e214980eed3d 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -21,7 +21,6 @@
#include "trace_probe_tmpl.h"

#define KPROBE_EVENT_SYSTEM "kprobes"
-#define KRETPROBE_MAXACTIVE_MAX 4096

/* Kprobe early definition from command line */
static char kprobe_boot_events_buf[COMMAND_LINE_SIZE] __initdata;
@@ -786,13 +785,6 @@ static int __trace_kprobe_create(int argc, const char *argv[])
trace_probe_log_err(1, BAD_MAXACT);
goto parse_error;
}
- /* kretprobes instances are iterated over via a list. The
- * maximum should stay reasonable.
- */
- if (maxactive > KRETPROBE_MAXACTIVE_MAX) {
- trace_probe_log_err(1, MAXACT_TOO_BIG);
- goto parse_error;
- }
}

/* try to parse an address. if that fails, try to read the
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 227d518e5ba521..e331017dc086ed 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -389,7 +389,6 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(BAD_UPROBE_OFFS, "Invalid uprobe offset"), \
C(MAXACT_NO_KPROBE, "Maxactive is not for kprobe"), \
C(BAD_MAXACT, "Invalid maxactive number"), \
- C(MAXACT_TOO_BIG, "Maxactive is too big"), \
C(BAD_PROBE_ADDR, "Invalid probed address or symbol"), \
C(BAD_RETPROBE, "Retprobe address must be an function entry"), \
C(BAD_ADDR_SUFFIX, "Invalid probed address suffix"), \
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
index fa928b431555ca..be3360a258bae8 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
@@ -10,7 +10,6 @@ check_error() { # command-with-error-pos-by-^
if grep -q 'r\[maxactive\]' README; then
check_error 'p^100 vfs_read' # MAXACT_NO_KPROBE
check_error 'r^1a111 vfs_read' # BAD_MAXACT
-check_error 'r^100000 vfs_read' # MAXACT_TOO_BIG
fi

check_error 'p ^non_exist_func' # BAD_PROBE_ADDR (enoent)
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_maxactive.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_maxactive.tc
index 4f0b268c12332a..f57c95bfc5ed5a 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_maxactive.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_maxactive.tc
@@ -6,9 +6,6 @@
# Test if we successfully reject unknown messages
if echo 'a:myprobeaccept inet_csk_accept' > kprobe_events; then false; else true; fi

-# Test if we successfully reject too big maxactive
-if echo 'r1000000:myprobeaccept inet_csk_accept' > kprobe_events; then false; else true; fi
-
# Test if we successfully reject unparsable numbers for maxactive
if echo 'r10fuzz:myprobeaccept inet_csk_accept' > kprobe_events; then false; else true; fi

--
2.31.1

2021-06-14 18:08:42

by Naveen N. Rao

[permalink] [raw]

Subject: [PATCH 1/2] trace/kprobe: Fix count of missed kretprobes in kprobe_profile

For a kretprobe, the miss count includes the number of times the probe
on function entry was missed, as well as the number of times we ran out
of kretprobe_instance structures due to maxactive being too low.

Fixes: cd7e7bd5e44718 ("tracing: Add kprobes event profiling interface")
Signed-off-by: Naveen N. Rao <[email protected]>
---
kernel/trace/trace_kprobe.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index ea6178cb5e334d..0475e2a6d0825e 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1192,7 +1192,8 @@ static int probes_profile_seq_show(struct seq_file *m, void *v)
seq_printf(m, " %-44s %15lu %15lu\n",
trace_probe_name(&tk->tp),
trace_kprobe_nhit(tk),
- tk->rp.kp.nmissed);
+ trace_kprobe_is_return(tk) ? tk->rp.kp.nmissed + tk->rp.nmissed
+ : tk->rp.kp.nmissed);

return 0;
}
--
2.31.1

2021-06-15 05:48:20

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH 1/2] trace/kprobe: Fix count of missed kretprobes in kprobe_profile

On Mon, 14 Jun 2021 23:33:28 +0530
"Naveen N. Rao" <[email protected]> wrote:

> For a kretprobe, the miss count includes the number of times the probe
> on function entry was missed, as well as the number of times we ran out
> of kretprobe_instance structures due to maxactive being too low.
>
> Fixes: cd7e7bd5e44718 ("tracing: Add kprobes event profiling interface")
> Signed-off-by: Naveen N. Rao <[email protected]>

Good catch!

> ---
> kernel/trace/trace_kprobe.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index ea6178cb5e334d..0475e2a6d0825e 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -1192,7 +1192,8 @@ static int probes_profile_seq_show(struct seq_file *m, void *v)
> seq_printf(m, " %-44s %15lu %15lu\n",
> trace_probe_name(&tk->tp),
> trace_kprobe_nhit(tk),
> - tk->rp.kp.nmissed);
> + trace_kprobe_is_return(tk) ? tk->rp.kp.nmissed + tk->rp.nmissed
> + : tk->rp.kp.nmissed);

Can you add a static trace_kprobe_nmissed(tk) for wrapping this ?

Thank you,

>
> return 0;
> }
> --
> 2.31.1
>

--
Masami Hiramatsu <[email protected]>

2021-06-15 09:38:04

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

On Mon, 14 Jun 2021 23:33:29 +0530
"Naveen N. Rao" <[email protected]> wrote:

> We currently limit maxactive for a kretprobe to 4096 when registering
> the same through tracefs. The comment indicates that this is done so as
> to keep list traversal reasonable. However, we don't ever iterate over
> all kretprobe_instance structures. The core kprobes infrastructure also
> imposes no such limitation.
>
> Remove the limit from the tracefs interface. This limit is easy to hit
> on large cpu machines when tracing functions that can sleep.
>
> Reported-by: Anton Blanchard <[email protected]>
> Signed-off-by: Naveen N. Rao <[email protected]>

OK, but I don't like to just remove the limit (since it can cause
memory shortage easily.)
Can't we make it configurable? I don't mean Kconfig, but
tracefs/options/kretprobe_maxactive, or kprobes's debugfs knob.

Hmm, maybe debugfs/kprobes/kretprobe_maxactive will be better since
it can limit both trace_kprobe and kprobes itself.

Let me fix that.

Thank you,

> ---
> kernel/trace/trace_kprobe.c | 8 --------
> kernel/trace/trace_probe.h | 1 -
> .../ftrace/test.d/kprobe/kprobe_syntax_errors.tc | 1 -
> .../selftests/ftrace/test.d/kprobe/kretprobe_maxactive.tc | 3 ---
> 4 files changed, 13 deletions(-)
>
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 0475e2a6d0825e..b3e214980eed3d 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -21,7 +21,6 @@
> #include "trace_probe_tmpl.h"
>
> #define KPROBE_EVENT_SYSTEM "kprobes"
> -#define KRETPROBE_MAXACTIVE_MAX 4096
>
> /* Kprobe early definition from command line */
> static char kprobe_boot_events_buf[COMMAND_LINE_SIZE] __initdata;
> @@ -786,13 +785,6 @@ static int __trace_kprobe_create(int argc, const char *argv[])
> trace_probe_log_err(1, BAD_MAXACT);
> goto parse_error;
> }
> - /* kretprobes instances are iterated over via a list. The
> - * maximum should stay reasonable.
> - */
> - if (maxactive > KRETPROBE_MAXACTIVE_MAX) {
> - trace_probe_log_err(1, MAXACT_TOO_BIG);
> - goto parse_error;
> - }
> }
>
> /* try to parse an address. if that fails, try to read the
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 227d518e5ba521..e331017dc086ed 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -389,7 +389,6 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
> C(BAD_UPROBE_OFFS, "Invalid uprobe offset"), \
> C(MAXACT_NO_KPROBE, "Maxactive is not for kprobe"), \
> C(BAD_MAXACT, "Invalid maxactive number"), \
> - C(MAXACT_TOO_BIG, "Maxactive is too big"), \
> C(BAD_PROBE_ADDR, "Invalid probed address or symbol"), \
> C(BAD_RETPROBE, "Retprobe address must be an function entry"), \
> C(BAD_ADDR_SUFFIX, "Invalid probed address suffix"), \
> diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
> index fa928b431555ca..be3360a258bae8 100644
> --- a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
> +++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
> @@ -10,7 +10,6 @@ check_error() { # command-with-error-pos-by-^
> if grep -q 'r\[maxactive\]' README; then
> check_error 'p^100 vfs_read' # MAXACT_NO_KPROBE
> check_error 'r^1a111 vfs_read' # BAD_MAXACT
> -check_error 'r^100000 vfs_read' # MAXACT_TOO_BIG
> fi
>
> check_error 'p ^non_exist_func' # BAD_PROBE_ADDR (enoent)
> diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_maxactive.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_maxactive.tc
> index 4f0b268c12332a..f57c95bfc5ed5a 100644
> --- a/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_maxactive.tc
> +++ b/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_maxactive.tc
> @@ -6,9 +6,6 @@
> # Test if we successfully reject unknown messages
> if echo 'a:myprobeaccept inet_csk_accept' > kprobe_events; then false; else true; fi
>
> -# Test if we successfully reject too big maxactive
> -if echo 'r1000000:myprobeaccept inet_csk_accept' > kprobe_events; then false; else true; fi
> -
> # Test if we successfully reject unparsable numbers for maxactive
> if echo 'r10fuzz:myprobeaccept inet_csk_accept' > kprobe_events; then false; else true; fi
>
> --
> 2.31.1
>

--
Masami Hiramatsu <[email protected]>

2021-06-15 17:44:01

by Naveen N. Rao

[permalink] [raw]

Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

Masami Hiramatsu wrote:
> On Mon, 14 Jun 2021 23:33:29 +0530
> "Naveen N. Rao" <[email protected]> wrote:
>
>> We currently limit maxactive for a kretprobe to 4096 when registering
>> the same through tracefs. The comment indicates that this is done so as
>> to keep list traversal reasonable. However, we don't ever iterate over
>> all kretprobe_instance structures. The core kprobes infrastructure also
>> imposes no such limitation.
>>
>> Remove the limit from the tracefs interface. This limit is easy to hit
>> on large cpu machines when tracing functions that can sleep.
>>
>> Reported-by: Anton Blanchard <[email protected]>
>> Signed-off-by: Naveen N. Rao <[email protected]>
>
> OK, but I don't like to just remove the limit (since it can cause
> memory shortage easily.)
> Can't we make it configurable? I don't mean Kconfig, but
> tracefs/options/kretprobe_maxactive, or kprobes's debugfs knob.
>
> Hmm, maybe debugfs/kprobes/kretprobe_maxactive will be better since
> it can limit both trace_kprobe and kprobes itself.

I don't think it is good to put a new tunable in debugfs -- we don't
have any kprobes tunable there, so this adds a dependency on debugfs
which shouldn't be necessary.

/proc/sys/debug/ may be a better fit since we have the
kprobes-optimization flag to disable optprobes there, though I'm not
sure if a new sysfs file is agreeable.

But, I'm not too sure this really is a problem. Maxactive is a user
_opt-in_ feature which needs to be explicitly added to an event
definition. In that sense, isn't this already a tunable?

- Naveen

2021-06-16 00:47:21

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

On Tue, 15 Jun 2021 23:11:27 +0530
"Naveen N. Rao" <[email protected]> wrote:

> Masami Hiramatsu wrote:
> > On Mon, 14 Jun 2021 23:33:29 +0530
> > "Naveen N. Rao" <[email protected]> wrote:
> >
> >> We currently limit maxactive for a kretprobe to 4096 when registering
> >> the same through tracefs. The comment indicates that this is done so as
> >> to keep list traversal reasonable. However, we don't ever iterate over
> >> all kretprobe_instance structures. The core kprobes infrastructure also
> >> imposes no such limitation.
> >>
> >> Remove the limit from the tracefs interface. This limit is easy to hit
> >> on large cpu machines when tracing functions that can sleep.
> >>
> >> Reported-by: Anton Blanchard <[email protected]>
> >> Signed-off-by: Naveen N. Rao <[email protected]>
> >
> > OK, but I don't like to just remove the limit (since it can cause
> > memory shortage easily.)
> > Can't we make it configurable? I don't mean Kconfig, but
> > tracefs/options/kretprobe_maxactive, or kprobes's debugfs knob.
> >
> > Hmm, maybe debugfs/kprobes/kretprobe_maxactive will be better since
> > it can limit both trace_kprobe and kprobes itself.
>
> I don't think it is good to put a new tunable in debugfs -- we don't
> have any kprobes tunable there, so this adds a dependency on debugfs
> which shouldn't be necessary.
>
> /proc/sys/debug/ may be a better fit since we have the
> kprobes-optimization flag to disable optprobes there, though I'm not
> sure if a new sysfs file is agreeable.

Indeed.

> But, I'm not too sure this really is a problem. Maxactive is a user
> _opt-in_ feature which needs to be explicitly added to an event
> definition. In that sense, isn't this already a tunable?

Let me explain the background of the limiation.

Maxactive is currently no limit for the kprobe kernel module API,
because the kernel module developer must take care of the max memory
usage (and they can).

But the tracefs user may NOT have enough information about what
happens if they pass something like 10M for maxactive (it will consume
around 500MB kernel memory for one kretprobe).

To avoid such trouble, I had set the 4096 limitation for the maxactive
parameter. Of course 4096 may not enough for some use-cases. I'm welcome
to expand it (e.g. 32k, isn't it enough?), but removing the limitation
may cause OOM trouble easily.

Thank you,

>
>
> - Naveen
>

--
Masami Hiramatsu <[email protected]>

2021-06-16 01:05:22

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

On Wed, 16 Jun 2021 09:46:22 +0900
Masami Hiramatsu <[email protected]> wrote:

> To avoid such trouble, I had set the 4096 limitation for the maxactive
> parameter. Of course 4096 may not enough for some use-cases. I'm welcome
> to expand it (e.g. 32k, isn't it enough?), but removing the limitation
> may cause OOM trouble easily.

What if you just made the max as 10 * number of possible cpus, or 4096,
which ever is greater? Why would a user need more?

I'd still like to get a wrapper around function graph tracing so that
kretprobes could use it. I think that would get rid of the requirement
of maxactive, because isn't that just used to have a way to know the
original return value?

-- Steve

2021-06-16 02:28:23

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

On Tue, 15 Jun 2021 21:03:51 -0400
Steven Rostedt <[email protected]> wrote:

> On Wed, 16 Jun 2021 09:46:22 +0900
> Masami Hiramatsu <[email protected]> wrote:
>
> > To avoid such trouble, I had set the 4096 limitation for the maxactive
> > parameter. Of course 4096 may not enough for some use-cases. I'm welcome
> > to expand it (e.g. 32k, isn't it enough?), but removing the limitation
> > may cause OOM trouble easily.
>
> What if you just made the max as 10 * number of possible cpus, or 4096,
> which ever is greater? Why would a user need more?

It could be. But actually, that is not correct number because the
number of instances depends on the number of processes and the possiblity
of recursive. Thus the huge system which runs more than 64k processes,
may need more than that.

> I'd still like to get a wrapper around function graph tracing so that
> kretprobes could use it. I think that would get rid of the requirement
> of maxactive, because isn't that just used to have a way to know the
> original return value?

Hmm, yes, on some arch, it can be done. But on other arch we still need
current implementation for generic solution.
What I need is not fully wrapped by the function graph, but just share
the per-task (software) shadow stack.

Thank you,

--
Masami Hiramatsu <[email protected]>

2021-06-16 20:29:17

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

On Wed, 16 Jun 2021 11:27:11 +0900
Masami Hiramatsu <[email protected]> wrote:

> On Tue, 15 Jun 2021 21:03:51 -0400
> Steven Rostedt <[email protected]> wrote:
>
> > On Wed, 16 Jun 2021 09:46:22 +0900
> > Masami Hiramatsu <[email protected]> wrote:
> >
> > > To avoid such trouble, I had set the 4096 limitation for the maxactive
> > > parameter. Of course 4096 may not enough for some use-cases. I'm welcome
> > > to expand it (e.g. 32k, isn't it enough?), but removing the limitation
> > > may cause OOM trouble easily.
> >
> > What if you just made the max as 10 * number of possible cpus, or 4096,
> > which ever is greater? Why would a user need more?
>
> It could be. But actually, that is not correct number because the
> number of instances depends on the number of processes and the possiblity
> of recursive. Thus the huge system which runs more than 64k processes,
> may need more than that.
>
> > I'd still like to get a wrapper around function graph tracing so that
> > kretprobes could use it. I think that would get rid of the requirement
> > of maxactive, because isn't that just used to have a way to know the
> > original return value?
>
> Hmm, yes, on some arch, it can be done. But on other arch we still need
> current implementation for generic solution.
> What I need is not fully wrapped by the function graph, but just share
> the per-task (software) shadow stack.

BTW, I have 2 ideas to fix this except for wrapper.

1. Use func-graph tracer API directly from dynamic event instead of
kretprobes. This will be enabled only if the arch supports fgraph
tracer and enable it. maxactive will be ignored if this is enabled,
and tracefs user may not need except for the return value
(BTW, is that possible to access the stack? In some case, return
value can be passed via stack)

2. Move the kretprobe instance pool from kretprobe to struct task.
This pool will allocates one page per task, and shared among all
kretprobes. This pool will be allocated when the 1st kretprobe
is registered. maxactive will be kept for someone who wants to
use per-instance data. But since dynamic event doesn't use it,
it will be removed from tracefs and perf.

Thank you,

--
Masami Hiramatsu <[email protected]>

2021-06-17 16:35:32

by Naveen N. Rao

[permalink] [raw]

Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

Masami Hiramatsu wrote:
> On Tue, 15 Jun 2021 23:11:27 +0530
> "Naveen N. Rao" <[email protected]> wrote:
>
>> Masami Hiramatsu wrote:
>> > On Mon, 14 Jun 2021 23:33:29 +0530
>> > "Naveen N. Rao" <[email protected]> wrote:
>> >
>> >> We currently limit maxactive for a kretprobe to 4096 when registering
>> >> the same through tracefs. The comment indicates that this is done so as
>> >> to keep list traversal reasonable. However, we don't ever iterate over
>> >> all kretprobe_instance structures. The core kprobes infrastructure also
>> >> imposes no such limitation.
>> >>
>> >> Remove the limit from the tracefs interface. This limit is easy to hit
>> >> on large cpu machines when tracing functions that can sleep.
>> >>
>> >> Reported-by: Anton Blanchard <[email protected]>
>> >> Signed-off-by: Naveen N. Rao <[email protected]>
>> >
>> > OK, but I don't like to just remove the limit (since it can cause
>> > memory shortage easily.)
>> > Can't we make it configurable? I don't mean Kconfig, but
>> > tracefs/options/kretprobe_maxactive, or kprobes's debugfs knob.
>> >
>> > Hmm, maybe debugfs/kprobes/kretprobe_maxactive will be better since
>> > it can limit both trace_kprobe and kprobes itself.
>>
>> I don't think it is good to put a new tunable in debugfs -- we don't
>> have any kprobes tunable there, so this adds a dependency on debugfs
>> which shouldn't be necessary.
>>
>> /proc/sys/debug/ may be a better fit since we have the
>> kprobes-optimization flag to disable optprobes there, though I'm not
>> sure if a new sysfs file is agreeable.
>
> Indeed.
>
>> But, I'm not too sure this really is a problem. Maxactive is a user
>> _opt-in_ feature which needs to be explicitly added to an event
>> definition. In that sense, isn't this already a tunable?
>
> Let me explain the background of the limiation.

Thanks for the background on this.

>
> Maxactive is currently no limit for the kprobe kernel module API,
> because the kernel module developer must take care of the max memory
> usage (and they can).
>
> But the tracefs user may NOT have enough information about what
> happens if they pass something like 10M for maxactive (it will consume
> around 500MB kernel memory for one kretprobe).

Ok, thinking more about this...

Right now, the only way for a user to notice that kretprobe maxactive is
an issue is by looking at kprobe_profile. This is not even possible if
using a bcc tool, which uses perf_event_open(). It took the reporting
team some effort to even identify that the reason why they were getting
weird results when tracing was due to the default value used for
kretprobe maxactive; and then that 4096 was the hard limit through
tracefs.

So, IMO, anyone using any existing bcc tool, or a pre-canned perf script
will not even be able to identify this as a problem to begin with... at
least, not without some effort.

To address this, as a first step, we should probably consider parsing
kprobe_profile and printing a warning with 'perf' if we detect a
non-zero miss count for a probe -- both a regular probe, as well as a
retprobe.

If we do this, the nice thing with kprobe_profile is that the probe miss
count is available, and can serve as a good way to decide what a more
reasonable maxactive value should be. This should help prevent users
from trying with arbitrary maxactive values.

For perf_event_open(), perhaps we can introduce an ioctl to query the
probe miss count.

>
> To avoid such trouble, I had set the 4096 limitation for the maxactive
> parameter. Of course 4096 may not enough for some use-cases. I'm welcome
> to expand it (e.g. 32k, isn't it enough?), but removing the limitation
> may cause OOM trouble easily.

Do you have suggestions for how we can determine a better limit? As you
point out in the other email, there could very well be 64k or more
processes on a large machine. Since the primary concern is memory usage,
we probably need to decide this based on total memory. But, memory usage
will vary depending on system load...

Perhaps we can start by making maxactive limit be a tunable with a
default value of 4096, with the understanding that users will be careful
when bumping up this value. Hopefully, scripts won't simply start
writing into this file ;)

If we can feed back the probe miss count, tools should be able to guide
users on what would be a reasonable maxactive value to use.

Thanks,
Naveen

2021-06-17 16:36:24

by Naveen N. Rao

[permalink] [raw]

Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

Masami Hiramatsu wrote:
> On Wed, 16 Jun 2021 11:27:11 +0900
> Masami Hiramatsu <[email protected]> wrote:
>
>> On Tue, 15 Jun 2021 21:03:51 -0400
>> Steven Rostedt <[email protected]> wrote:
>>
>> > On Wed, 16 Jun 2021 09:46:22 +0900
>> > Masami Hiramatsu <[email protected]> wrote:
>> >
>> > > To avoid such trouble, I had set the 4096 limitation for the maxactive
>> > > parameter. Of course 4096 may not enough for some use-cases. I'm welcome
>> > > to expand it (e.g. 32k, isn't it enough?), but removing the limitation
>> > > may cause OOM trouble easily.
>> >
>> > What if you just made the max as 10 * number of possible cpus, or 4096,
>> > which ever is greater? Why would a user need more?
>>
>> It could be. But actually, that is not correct number because the
>> number of instances depends on the number of processes and the possiblity
>> of recursive. Thus the huge system which runs more than 64k processes,
>> may need more than that.
>>
>> > I'd still like to get a wrapper around function graph tracing so that
>> > kretprobes could use it. I think that would get rid of the requirement
>> > of maxactive, because isn't that just used to have a way to know the
>> > original return value?
>>
>> Hmm, yes, on some arch, it can be done. But on other arch we still need
>> current implementation for generic solution.
>> What I need is not fully wrapped by the function graph, but just share
>> the per-task (software) shadow stack.
>
> BTW, I have 2 ideas to fix this except for wrapper.
>
> 1. Use func-graph tracer API directly from dynamic event instead of
> kretprobes. This will be enabled only if the arch supports fgraph
> tracer and enable it. maxactive will be ignored if this is enabled,
> and tracefs user may not need except for the return value
> (BTW, is that possible to access the stack? In some case, return
> value can be passed via stack)
>
> 2. Move the kretprobe instance pool from kretprobe to struct task.
> This pool will allocates one page per task, and shared among all
> kretprobes. This pool will be allocated when the 1st kretprobe
> is registered. maxactive will be kept for someone who wants to
> use per-instance data. But since dynamic event doesn't use it,
> it will be removed from tracefs and perf.

Won't this result in _more_ memory usage compared to what we have now?

Thanks,
Naveen

2021-06-17 17:09:45

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

On Thu, 17 Jun 2021 22:04:34 +0530
"Naveen N. Rao" <[email protected]> wrote:

> > 2. Move the kretprobe instance pool from kretprobe to struct task.
> > This pool will allocates one page per task, and shared among all
> > kretprobes. This pool will be allocated when the 1st kretprobe
> > is registered. maxactive will be kept for someone who wants to
> > use per-instance data. But since dynamic event doesn't use it,
> > it will be removed from tracefs and perf.
>
> Won't this result in _more_ memory usage compared to what we have now?

Maybe or maybe not. At least with this approach (or the function graph
one), you will allocate enough for the environment involved. If there's
thousands of tasks, then yes, it will allocate more memory. But if you are
running thousands of tasks, you should have a lot of memory in the machine.

If you are only running a few tasks, it will be less than the current
approach.

-- Steve

2021-06-18 06:20:19

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

On Thu, 17 Jun 2021 13:07:13 -0400
Steven Rostedt <[email protected]> wrote:

> On Thu, 17 Jun 2021 22:04:34 +0530
> "Naveen N. Rao" <[email protected]> wrote:
>
> > > 2. Move the kretprobe instance pool from kretprobe to struct task.
> > > This pool will allocates one page per task, and shared among all
> > > kretprobes. This pool will be allocated when the 1st kretprobe
> > > is registered. maxactive will be kept for someone who wants to
> > > use per-instance data. But since dynamic event doesn't use it,
> > > it will be removed from tracefs and perf.
> >
> > Won't this result in _more_ memory usage compared to what we have now?
>
> Maybe or maybe not. At least with this approach (or the function graph
> one), you will allocate enough for the environment involved. If there's
> thousands of tasks, then yes, it will allocate more memory. But if you are
> running thousands of tasks, you should have a lot of memory in the machine.
>
> If you are only running a few tasks, it will be less than the current
> approach.

Right, this depends on how many tasks you are running on your machine.
Anyway, since you may not sure how much maxactive is enough, you will
set maxactive high, then it can consume more than that. Of course you
can optimize by trial and error. But that does not guarantee all cases,
because the number of tasks can be increased while tracing. You might
need to re-configure it by checking the nmissed count again.

Thank you,

--
Masami Hiramatsu <[email protected]>

2021-06-18 06:44:03

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

On Thu, 17 Jun 2021 21:49:36 +0530
"Naveen N. Rao" <[email protected]> wrote:

> Masami Hiramatsu wrote:
> > On Tue, 15 Jun 2021 23:11:27 +0530
> > "Naveen N. Rao" <[email protected]> wrote:
> >
> >> Masami Hiramatsu wrote:
> >> > On Mon, 14 Jun 2021 23:33:29 +0530
> >> > "Naveen N. Rao" <[email protected]> wrote:
> >> >
> >> >> We currently limit maxactive for a kretprobe to 4096 when registering
> >> >> the same through tracefs. The comment indicates that this is done so as
> >> >> to keep list traversal reasonable. However, we don't ever iterate over
> >> >> all kretprobe_instance structures. The core kprobes infrastructure also
> >> >> imposes no such limitation.
> >> >>
> >> >> Remove the limit from the tracefs interface. This limit is easy to hit
> >> >> on large cpu machines when tracing functions that can sleep.
> >> >>
> >> >> Reported-by: Anton Blanchard <[email protected]>
> >> >> Signed-off-by: Naveen N. Rao <[email protected]>
> >> >
> >> > OK, but I don't like to just remove the limit (since it can cause
> >> > memory shortage easily.)
> >> > Can't we make it configurable? I don't mean Kconfig, but
> >> > tracefs/options/kretprobe_maxactive, or kprobes's debugfs knob.
> >> >
> >> > Hmm, maybe debugfs/kprobes/kretprobe_maxactive will be better since
> >> > it can limit both trace_kprobe and kprobes itself.
> >>
> >> I don't think it is good to put a new tunable in debugfs -- we don't
> >> have any kprobes tunable there, so this adds a dependency on debugfs
> >> which shouldn't be necessary.
> >>
> >> /proc/sys/debug/ may be a better fit since we have the
> >> kprobes-optimization flag to disable optprobes there, though I'm not
> >> sure if a new sysfs file is agreeable.
> >
> > Indeed.
> >
> >> But, I'm not too sure this really is a problem. Maxactive is a user
> >> _opt-in_ feature which needs to be explicitly added to an event
> >> definition. In that sense, isn't this already a tunable?
> >
> > Let me explain the background of the limiation.
>
> Thanks for the background on this.
>
> >
> > Maxactive is currently no limit for the kprobe kernel module API,
> > because the kernel module developer must take care of the max memory
> > usage (and they can).
> >
> > But the tracefs user may NOT have enough information about what
> > happens if they pass something like 10M for maxactive (it will consume
> > around 500MB kernel memory for one kretprobe).
>
> Ok, thinking more about this...
>
> Right now, the only way for a user to notice that kretprobe maxactive is
> an issue is by looking at kprobe_profile. This is not even possible if
> using a bcc tool, which uses perf_event_open(). It took the reporting
> team some effort to even identify that the reason why they were getting
> weird results when tracing was due to the default value used for
> kretprobe maxactive; and then that 4096 was the hard limit through
> tracefs.
>
> So, IMO, anyone using any existing bcc tool, or a pre-canned perf script
> will not even be able to identify this as a problem to begin with... at
> least, not without some effort.

Yeah, the nmissed counter must be exposed in that case via tracefs or
debugfs. Maybe ebpf can also warn it (by checking nmissed count).

> To address this, as a first step, we should probably consider parsing
> kprobe_profile and printing a warning with 'perf' if we detect a
> non-zero miss count for a probe -- both a regular probe, as well as a
> retprobe.

Yeah, it is doable. Note that perf-probe only set up the event and
perf-trace or other commands will use it.

> If we do this, the nice thing with kprobe_profile is that the probe miss
> count is available, and can serve as a good way to decide what a more
> reasonable maxactive value should be. This should help prevent users
> from trying with arbitrary maxactive values.

Such feedback loop is an interesting idea.
Note that nmissed count is an accumulate value, not the max number of
the instance which will be needed.

> For perf_event_open(), perhaps we can introduce an ioctl to query the
> probe miss count.

Or, maybe we can expand the maxactive in runtime. e.g. add a shortage
counter on the kretprobe, and run a monitor kernel thread (or kworker).
If the shortage counter is incremented, the monitor allocates instances
(2x counter) and give it to the kretprobe. And it resets the shortage
counter. This adaptive maxactive may cause mis-hit in the beginning,
but finally find the optimal maxactive value automatically.

> > To avoid such trouble, I had set the 4096 limitation for the maxactive
> > parameter. Of course 4096 may not enough for some use-cases. I'm welcome
> > to expand it (e.g. 32k, isn't it enough?), but removing the limitation
> > may cause OOM trouble easily.
>
> Do you have suggestions for how we can determine a better limit? As you
> point out in the other email, there could very well be 64k or more
> processes on a large machine. Since the primary concern is memory usage,
> we probably need to decide this based on total memory. But, memory usage
> will vary depending on system load...

This is very good question. IMHO, it might better to calculate the total
maxactive from the system memory size. For example, 1% of system memory
can be used for the kretprobes, 16GB system will allow using 160MB for
kretprobes, which means about "30M" is the max number of maxactive, or
multiple kretprobes can share it. Doesn't it sound enough? Of course
this will need to show the current usage of the kretprobe instance objects
via tracefs or debugfs. But this total cap seems reasonable for me to
avoid OOM trouble.

> Perhaps we can start by making maxactive limit be a tunable with a
> default value of 4096, with the understanding that users will be careful
> when bumping up this value. Hopefully, scripts won't simply start
> writing into this file ;)

Yeah, that's what I suggested at first, because the best maxactive will
depend on the max number of the *processes* and the probed function.

If the probed function will NOT be preempted or slept, maxactive will be
the number of *processor cores*. Or, if it can be preempted or slept, it
will be the max number of *processes*. If the probed function can
recursively called (Note: this is rare case), the maxactive has to
be multiplied.

It is hard to estimate the max number of processes, since it depends
on the system. Small embedded systems don't run thousands of processes,
but big servers will run more than ten thousands of processes.
Thus make it tunable will be a good idea.

Thank you,

>
> If we can feed back the probe miss count, tools should be able to guide
> users on what would be a reasonable maxactive value to use.
>
>
> Thanks,
> Naveen
>

--
Masami Hiramatsu <[email protected]>

2021-06-18 10:56:06

by Naveen N. Rao

[permalink] [raw]

Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

Masami Hiramatsu wrote:
> On Thu, 17 Jun 2021 13:07:13 -0400
> Steven Rostedt <[email protected]> wrote:
>
>> On Thu, 17 Jun 2021 22:04:34 +0530
>> "Naveen N. Rao" <[email protected]> wrote:
>>
>> > > 2. Move the kretprobe instance pool from kretprobe to struct task.
>> > > This pool will allocates one page per task, and shared among all
>> > > kretprobes. This pool will be allocated when the 1st kretprobe
>> > > is registered. maxactive will be kept for someone who wants to
>> > > use per-instance data. But since dynamic event doesn't use it,
>> > > it will be removed from tracefs and perf.
>> >
>> > Won't this result in _more_ memory usage compared to what we have now?
>>
>> Maybe or maybe not. At least with this approach (or the function graph
>> one), you will allocate enough for the environment involved. If there's
>> thousands of tasks, then yes, it will allocate more memory. But if you are
>> running thousands of tasks, you should have a lot of memory in the machine.
>>
>> If you are only running a few tasks, it will be less than the current
>> approach.
>
> Right, this depends on how many tasks you are running on your machine.
> Anyway, since you may not sure how much maxactive is enough, you will
> set maxactive high, then it can consume more than that. Of course you
> can optimize by trial and error. But that does not guarantee all cases,
> because the number of tasks can be increased while tracing. You might
> need to re-configure it by checking the nmissed count again.

Yes. If we go down this route, we should limit the per-task allocation
to a more reasonable 4k -- powerpc uses 64k pages.

Thanks,
Naveen

2021-06-18 13:21:47

by Naveen N. Rao

[permalink] [raw]

Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

Masami Hiramatsu wrote:
>
>> To address this, as a first step, we should probably consider parsing
>> kprobe_profile and printing a warning with 'perf' if we detect a
>> non-zero miss count for a probe -- both a regular probe, as well as a
>> retprobe.
>
> Yeah, it is doable. Note that perf-probe only set up the event and
> perf-trace or other commands will use it.
>
>
>> If we do this, the nice thing with kprobe_profile is that the probe miss
>> count is available, and can serve as a good way to decide what a more
>> reasonable maxactive value should be. This should help prevent users
>> from trying with arbitrary maxactive values.
>
> Such feedback loop is an interesting idea.
> Note that nmissed count is an accumulate value, not the max number of
> the instance which will be needed.

Yes, we will have to factor-in the duration during which the event was
active. This will still be an approximation, but serves as a good
starting point. It may need a few tries to get this right, but more
importantly, the user knows instantly that there are missed probes.

>
>> For perf_event_open(), perhaps we can introduce an ioctl to query the
>> probe miss count.
>
> Or, maybe we can expand the maxactive in runtime. e.g. add a shortage
> counter on the kretprobe, and run a monitor kernel thread (or kworker).
> If the shortage counter is incremented, the monitor allocates instances
> (2x counter) and give it to the kretprobe. And it resets the shortage
> counter. This adaptive maxactive may cause mis-hit in the beginning,
> but finally find the optimal maxactive value automatically.

I like this idea and I have been thinking along these lines too. If we
start with a better default (rather than just num_possible_cpus() used
today), I suspect we may be able to get this to work well enough to not
have to miss any probes. Specifying 'maxactive' can still serve as a
workaround to allocate a larger initial set of kretprobe_instances in
case this doesn't work.

>
>
>> > To avoid such trouble, I had set the 4096 limitation for the maxactive
>> > parameter. Of course 4096 may not enough for some use-cases. I'm
>> > welcome
>> > to expand it (e.g. 32k, isn't it enough?), but removing the limitation
>> > may cause OOM trouble easily.
>>
>> Do you have suggestions for how we can determine a better limit? As you
>> point out in the other email, there could very well be 64k or more
>> processes on a large machine. Since the primary concern is memory usage,
>> we probably need to decide this based on total memory. But, memory usage
>> will vary depending on system load...
>
> This is very good question. IMHO, it might better to calculate the total
> maxactive from the system memory size. For example, 1% of system memory
> can be used for the kretprobes, 16GB system will allow using 160MB for
> kretprobes, which means about "30M" is the max number of maxactive, or
> multiple kretprobes can share it. Doesn't it sound enough? Of course
> this will need to show the current usage of the kretprobe instance objects
> via tracefs or debugfs. But this total cap seems reasonable for me to
> avoid OOM trouble.
>
>> Perhaps we can start by making maxactive limit be a tunable with a
>> default value of 4096, with the understanding that users will be careful
>> when bumping up this value. Hopefully, scripts won't simply start
>> writing into this file ;)
>
> Yeah, that's what I suggested at first, because the best maxactive will
> depend on the max number of the *processes* and the probed function.
>
> If the probed function will NOT be preempted or slept, maxactive will be
> the number of *processor cores*. Or, if it can be preempted or slept, it
> will be the max number of *processes*. If the probed function can
> recursively called (Note: this is rare case), the maxactive has to
> be multiplied.
>
> It is hard to estimate the max number of processes, since it depends
> on the system. Small embedded systems don't run thousands of processes,
> but big servers will run more than ten thousands of processes.
> Thus make it tunable will be a good idea.

Agree.

Thanks,
Naveen